/ R

Acquiring and visualizing Berlin city air quality data

In this post I will go over a couple of methods and techniques of fetching and basic visualizations of data. For this purspose, I am going to be using specific tools and specific data sets.

  1. Overview of the data
  2. Fetching data
  3. Visualizing data
    3.1. Heat maps

    3.2. Time series
  4. Spatial visualization

Please be aware that I am only scratching the surface with this blog post. Maybe I will put this into a series in the near future.

Overview of the data

There are lots of interesting data sets on the web worth exploring. I will take a look at data which is, in my eyes, kind of underappreciated. That is air quality data. The city of Berlin operates an air quality measuring network and provides the data through a web portal. In total there are 16 stations measuring different pollutants.

Station at Hardenbergplatz in Berlin Charlottenburg

Station at Hardenbergplatz in Berlin Charlottenburg - Source

The measuring network is in existence since 1975, but online data only became available in 2006. Since then, monthly and annual data collections in the form of PDF documents are published.

Fetching data

Unfortunately, PDF documents are kind of pain for data mining. Fortunately though, the measured values over the last week are available as a nice HTML table, which would be pretty simple for processing.

Table data
Measured pollutants of August 17, 2016 as a nice table - Source

What's also pretty convenient, is the fact that each table can be accessed through a simple URL directing to a specific date. http://www.stadtentwicklung.berlin.de/umwelt/luftqualitaet/de/messnetz/tageswerte/download/20160817.html i.e. would access the table for August 17, 2016. Just change the last part of the URL to the desired date and you get the values.

R is an (the) ideal tool for analyzing structured data. There are also a couple of packages for scraping web pages and transforming their content into data frames.

I wrote a function in R that does exactly that - scrape the tables and put them into data frames. All you have to do is provide a starting and ending date. You can find it on Github.

For example, if you want to get the data of the last month, run the following command in R:

blume <- getBlume(start = "2016-07-01", end = "2016-07-31")

Besides the pollutants, the function also adds measured values for temperature, wind velocity, wind direction and relative humidity. It also adds geographic coordinates for each of the 16 stations.

Visualizing data

What you could do now is go ahead and use your preferred code snippets of ggplot2 and create some basic line charts or bar plots. But for producing interactive, web based visualizations, there is a nice package called rCharts created by Ramnath Vaidyanathan, or short ramnathv.

rCharts offers a couple of visualizations libraries, like

  • NVD3, which is based on D3.js
  • Morris, which is based on Raphael.js
  • and Polychart, which "takes many ideas from the Grammar of Graphics and the R library ggplot2"

A very nice overview and tutorial on the capabilities of rCharts can be fount here.

Heat maps

What about we would like to find out, at which station high values of pollutants were found. We could do that, with a heat map, which plots each station against each measured pollutant. To make them comparable, we rescale each value on a scale of 0 to 1. Let's compare new year's eve 2015 with January 1, 2016.

Pollutants on New Year's Eve 2015
Pollutants on New Year's Day 2016
# Load the required packages
require(rCharts)
require(plyr)
require(knitr)
require(reshape2)
require(scales)

# plot
blume.newYearsEve <- blume[31825:31840,]
blume.transform <- ddply(melt(blume.newYear[,-c(9,10,11,12,13,14,15)]), .(variable), transform, rescale = rescale(value))
p1 <- rPlot(Station ~ variable, color = 'rescale', data = blume.transform, type = 'tile', height = 600)
p1$guides("{color: {scale: {type: gradient, lower: white, upper: steelblue}}}")
p1

Time series

Let's take a look at time series. We'll use the temperature value, because it's kind of interesting to see it go up and down over the months and years.

Time series temperature

blume.oneStation <- subset(blume[c("Station","Date","Temperature")], Station == "Buch")
blume.transform <- reshape2::melt(blume.oneStation[,c('Date', 'Temperature')], id = 'Date')
p2 <- nPlot(value ~ Date, group = 'variable', data = blume.transform, type = 'lineWithFocusChart')
p2$xAxis( tickFormat="#!function(d) {return d3.time.format('%d-%m-%Y')(new Date( d * 86400000 ));}!#" )
p2

We can also compare two pollutants over the course of one year.

Time series nitrogen dioxide and particles PM10
blume.oneYear <- subset(blume[c("Station","Date","Nitrogen dioxide", "Particles PM10")], Station == "Mitte")
blume.oneYear <- blume.oneYear[(5*365):(6*365),-1]
blume.transform <- reshape2::melt(blume.oneYear, id = 'Date')
p7 <- nPlot(value ~ Date, group = 'variable', data = blume.transform, type = 'lineWithFocusChart')
p7$xAxis( tickFormat="#!function(d) {return d3.time.format('%d-%m-%Y')(new Date( d * 86400000 ));}!#" )
p7

Spatial visualization

Now because the function getBlume() adds geographic coordinates for each station, we can also do a little spatial analysis. The package rMaps, also created by Ramnath Vaidyanathan, provides us with functions for easy creation of Leaflet based web maps.

How about we create a heat map based on the values of nitrogen dioxide for New Year's Day.

Leaflet based heat map of values of nitrogen dioxide
require(rMaps)
require(rjson)
L2 <- Leaflet$new()
L2$setView(c(52.51361, 13.41883), 11)
L2$tileLayer(provider = "Stamen.Toner")
blume.json <- toJSONArray2(blume.oneDay[,c(14,15,4)], json = FALSE, names = FALSE)
cat(rjson::toJSON(blume.json))
# Add leaflet-heat plugin
L2$addAssets(jshead = c(
  "https://leaflet.github.io/Leaflet.heat/dist/leaflet-heat.js"
))
# Add javascript to modify underlying chart
L2$setTemplate(afterScript = sprintf("
  &lt;script&gt;
   var blumePoints = %s
   var heat = L.heatLayer(blumePoints).addTo(map)           
  &lt;/script&gt;
", rjson::toJSON(blume.json)
))
L2

Sadly, this provides no indication on the actual values that were measured.

Outlook

I hope I could provide some insights on basic data fetching visualization tasks. On a follow up post, I'll be looking into spatial interpolation processes, and how to create a short animation over time. Keep an eye out for that.

Thanks for reading.

Spatial interpolation animation