Government data conversion, ASHE tables and the small print

If I'm feeling generous, I could put the trouble the Conservative Party got into the other day over sexed-up crime figures down to careless data transformation.  Access to government data is now  easier thanks to the portal. Much of the data is in the form of Excel spreadsheets and we are all busy scraping and converting that data.  I've been using various approaches recently: exporting as xslx and open office xml formats or csv and processing in XQuery, now assisted by the neat service .  Good fun.  But there be dragons too. 

Take the Annual Survey of Hours and Earnings (ASHE) (Currently these seem not to be included in the data although they are a rich data source)

As an aside, navigating is oddly difficult:

First an HTML page by year; then for a year an HTML page of data classifications  e.g. 2009; then a PDF document containing links to Excel spreadsheets e.g. Analysis by Occupation  next an Excel workbook e.g.Table 2.1a Weekly pay:Gross  (the URI is buried in the PDF)  which contains 8 worksheets showing the further breakdown by gender and employment status.

We can convert any single spreadsheet to XML using the service - this Converts the first table. It's XML but there is lots more work to get to a usable data-set and even more to recover the n-dimensional data set from which one presumes all this is generated.  Perhaps someone is working away right now converting it to RDF SCOVO.

However in such simple conversions important information can be lost. In the ASHE spreadsheets, the confidence levels are shown by cell styling.  If you export to CSV, Excel warns you that formatting might be lost, but here the formatting carries semantic information about the reliability of the data.  It is probably recoverable from the xslx format and  the Open Office XML format but only with a lot of work. 

Another problem arises with footnotes - whilst the linking symbol and the footnote itself is retained in the export, its hard work to put the links back. Footnotes crucially indicate changes in the boundaries of classifications and time intervals and problems of data quality.

Data transformations which drop these awkward bits will inevitably lead to misinterpretations. So we really do want to hammer on as TBL does for access to raw data with all its wrinkles.