Search

Bristol Open Data Hack Day - Bristol Water Quality data processed with XQuery.

Here is a little data visualization I made and the story behind it.

Background

Last Friday I, along with over 20 other data mungers, attended the Bristol Hackday held in the new part of the Colston Hall. This event was part of the International opendataday.. A good mix pf people, including Stephen Hilton and Sarah Billings from Connecting Bristol  (@connectbristol)  with Rob Scott and Councillor Mark Wright. Reportage includes the twitter Hash tag  #bristolhackday, http://www.delib.co.uk/dblog/, http://www.delib.co.uk/dblog/bristol-open-data-hack-day-kicks-off/ and Dan Dixon'.

Whilst some explored the Your Freedom data, others worked on data released by the Bristol City Council on data.gov.uk http://data.gov.uk/search/apachesolr_search/bristol. One group worked on data on lost shopping trolleys and had a nice demo working on the day.

Water Quality data

I had taken a look at a dataset on River Water Quality the previous night and on the day teamed withMariateresa Bucciante from the Environment Agency as domain expert/client,  Leroy Kirby @LeroyKirby on front-end and me on back-end. I had already converted the CSV data  to XML using a utility XQuery script and loaded it into an eXist database on a UWE server [ later moved to an EC2 instance]. The data comprises a large number of samples taken at various locations around Bristol recording the values of about 12 different measures of water quality. I also extracting the location references, and their location names, together with a script to generate kml, giving us a map of the Google Geocoded locations.

Front-end

Leroys first thought was to allow comparisons of several sites by graphing a chosen quality over time. The user can select the sites from a map or from a selection list.  JQuery and jqPlot will be used to generate the graphs. The backend provides a basic API to deliver the locations, the properties and the values of a property at a site as JSON.

Semantics

To make sense of the data, we needed to understand what the different observational data values meant, both technically (what is it and how is it measured) and environmentally (what values are acceptable according to which standards and how do these separate measures relate to the Environmental Agencies classification of water quality?  Maria put together a Google Spreadsheet with rows for each column with links to definitional data and this was augmented with other columns to provide guidance to programs processing the data. 

Problems

Data quality

The water quality data exhibits minor problems with data formating and data quality. These are pretty common to live data sets and we noted some of these as we went:

  • Date format - the format of dd-3-letter-month-yy is of little use for processing and had to be converted to the XML date format yyyy-mm-dd
  • Erroneous values -   Temperature radings of  50, 110, 136
  • Column naming: Turburdity instead of Turbidity
  • Lack of temporal precision:  sample date but not time
  • Lack of geographic Precision: a location name - like Sea Mills can be geocoded to get an approximate position.

[later I discovered this additional data set which defines the River Sites  and provides National Grid References  e.g ST553763 for most of the sites - however some sites with a low number of values ( Site 2a Boiling Wells has only 2) are missing - I really should have found this since it was in the list of related sites]

  • Coding of missing values: mainly mising values are blank but other wording is sometimes used :"no data", "no access", "NR"
  • Coding of imprecise values - for example, Nitrite is often recorded as ">1"
  • Multiple names for the same thing: Cotswold Road, Cotswold Rd
    •  [actually the Location is mainly(but not quite) redundant with the associated Site reference data]

Cross-platform development

The backend data in XML was being hosted on an UWE server. The front-end was being developed on a lap-top in Javascript and JQuery. I wrote a simple API delivering JSON which was validated by a http://jsonformatter.curiousconcept.com/ .

However we could not get the two parts to work together, despite mucking round with serialization types(text) and mime-types (it should be application/json but all sorts of values seem to be in use).  With help from another hacker, Richard Burrell, we finally realized that we had forgotten about the same-domain problem - our attempts to solve it with JSON-P did not quite work within the time of the workshop . Pity we got stuck on this because it was only a problem because we were developing on two hosts - in practice the front-end and backend would have been hosted on the same site and the problem would not have arisen.

It would have been helpful if I'd found the page on the data.gov.uk site on JSON-P but one of the benefits of working on problems in a hackaday is the access to knowledgable hackers.

Look before you leap

Having complained about the need for meta data, after the event I discovered a metadata link on the Water Quality page. This data is available in folds on the page, and it is useful though a bit limited.  I also discovered another dataset http://data.gov.uk/dataset/bristol-river-site-national-grid-references   containing UK National Grid References for the monitoring points accurate to 100m.

I also failed to see the link to the wiki page for this data set where we can comment on the data - so much of the feedback mechanism is already in place. Actually data.gov.uk has improved greatly and I was operating on memories of encounters with the site a year ago.  Now I've created an account and can make a contribution to the wiki page.

Current project state

I hated to leave a good project un-completed, so I took a bit of time this week to put together a prototype based on the ideas we had discussed in the workshop. This basic site allows the user to graph any of the measurements at any of the sites, and to show the site location on a map. 

Coliforms at Eastville Park Lake

In addition to the base data on data.gov.uk, it uses an additional Google Spreadsheet which defines the columns of the data and starts to supplement the column names with additional information.

Conversion to XML was done using XQuery and all data stored in eXist an open source XML database. The UK National Grid Regerences are converted to easting/northing and hence lat/long using XQuery library functions. Graphing uses the Google Visualization API . The application is running on a free EC2 micro-instance. 

Improvements

Many additional visualisations could be provided to allow multiple sites to be graphed together or to map the readings.  However I'm interested in ways in which the user community can interact with the data and comment and augment it.  The data.gov.uk wiki entry for this dataset provides one avenue, but it would be good if  a user could annotate date items and for these annotations to appear on the chart.  I have in mind two use cases: One is to mark clearly erroneous data, such as high temperature readings which otherwise throw out the graph scale.  The other is to provide an explanation, perhaps the occurance of a sewage spill, or exceptional rainfalll,  of anomolies in the data.  That's for another day.

Thanks for the great writeup, Chris, and for exploring the challenges you faced on the way to arriving at such a useful real-world project!
hi kit. we're working on some open data standards for the air quality data too, to make it easier to hack. hope to have some updates in the new year. http://data.gov.uk/dataset/bristol-air-quality
Thanks for the info - I would like to base this application on linked data but first I need to find a suitable vocabulary for the water quality measures - also need URIs for the sites, which should include the separate sites in the Floating Harbor as well.
We are developing a new database for this data at the moment which will iron out the problems you identified. We will be developing some automated outputs as well.