Here is a little data visualization I made and the story behind it.
Last Friday I, along with over 20 other data mungers, attended the Bristol Hackday held in the new part of the Colston Hall. This event was part of the International opendataday.. A good mix pf people, including Stephen Hilton and Sarah Billings from Connecting Bristol (@connectbristol) with Rob Scott and Councillor Mark Wright. Reportage includes the twitter Hash tag #bristolhackday, http://www.delib.co.uk/dblog/, http://www.delib.co.uk/dblog/bristol-open-data-hack-day-kicks-off/ and Dan Dixon'.
Whilst some explored the Your Freedom data, others worked on data released by the Bristol City Council on data.gov.uk http://data.gov.uk/search/apachesolr_search/bristol. One group worked on data on lost shopping trolleys and had a nice demo working on the day.
Water Quality data
I had taken a look at a dataset on River Water Quality the previous night and on the day teamed withMariateresa Bucciante from the Environment Agency as domain expert/client, Leroy Kirby @LeroyKirby on front-end and me on back-end. I had already converted the CSV data to XML using a utility XQuery script and loaded it into an eXist database on a UWE server [ later moved to an EC2 instance]. The data comprises a large number of samples taken at various locations around Bristol recording the values of about 12 different measures of water quality. I also extracting the location references, and their location names, together with a script to generate kml, giving us a map of the Google Geocoded locations.
Leroys first thought was to allow comparisons of several sites by graphing a chosen quality over time. The user can select the sites from a map or from a selection list. JQuery and jqPlot will be used to generate the graphs. The backend provides a basic API to deliver the locations, the properties and the values of a property at a site as JSON.
To make sense of the data, we needed to understand what the different observational data values meant, both technically (what is it and how is it measured) and environmentally (what values are acceptable according to which standards and how do these separate measures relate to the Environmental Agencies classification of water quality? Maria put together a Google Spreadsheet with rows for each column with links to definitional data and this was augmented with other columns to provide guidance to programs processing the data.
The water quality data exhibits minor problems with data formating and data quality. These are pretty common to live data sets and we noted some of these as we went:
[later I discovered this additional data set which defines the River Sites and provides National Grid References e.g ST553763 for most of the sites - however some sites with a low number of values ( Site 2a Boiling Wells has only 2) are missing - I really should have found this since it was in the list of related sites]
However we could not get the two parts to work together, despite mucking round with serialization types(text) and mime-types (it should be application/json but all sorts of values seem to be in use). With help from another hacker, Richard Burrell, we finally realized that we had forgotten about the same-domain problem - our attempts to solve it with JSON-P did not quite work within the time of the workshop . Pity we got stuck on this because it was only a problem because we were developing on two hosts - in practice the front-end and backend would have been hosted on the same site and the problem would not have arisen.
It would have been helpful if I'd found the page on the data.gov.uk site on JSON-P but one of the benefits of working on problems in a hackaday is the access to knowledgable hackers.
Look before you leap
Having complained about the need for meta data, after the event I discovered a metadata link on the Water Quality page. This data is available in folds on the page, and it is useful though a bit limited. I also discovered another dataset http://data.gov.uk/dataset/bristol-river-site-national-grid-references containing UK National Grid References for the monitoring points accurate to 100m.
I also failed to see the link to the wiki page for this data set where we can comment on the data - so much of the feedback mechanism is already in place. Actually data.gov.uk has improved greatly and I was operating on memories of encounters with the site a year ago. Now I've created an account and can make a contribution to the wiki page.
Current project state
I hated to leave a good project un-completed, so I took a bit of time this week to put together a prototype based on the ideas we had discussed in the workshop. This basic site allows the user to graph any of the measurements at any of the sites, and to show the site location on a map.
In addition to the base data on data.gov.uk, it uses an additional Google Spreadsheet which defines the columns of the data and starts to supplement the column names with additional information.
Conversion to XML was done using XQuery and all data stored in eXist an open source XML database. The UK National Grid Regerences are converted to easting/northing and hence lat/long using XQuery library functions. Graphing uses the Google Visualization API . The application is running on a free EC2 micro-instance.
Many additional visualisations could be provided to allow multiple sites to be graphed together or to map the readings. However I'm interested in ways in which the user community can interact with the data and comment and augment it. The data.gov.uk wiki entry for this dataset provides one avenue, but it would be good if a user could annotate date items and for these annotations to appear on the chart. I have in mind two use cases: One is to mark clearly erroneous data, such as high temperature readings which otherwise throw out the graph scale. The other is to provide an explanation, perhaps the occurance of a sewage spill, or exceptional rainfalll, of anomolies in the data. That's for another day.