Couldn't help myself, had to have a go at Brian Kelly's Linked Data problem to find the UK town with the highest proportion of students.
1 Get the HESA data - convert the latest overall figures (in Excel) with elev.at to generic XML, then a custom XSLT to domain XML. HESA provides several totals and in the analysis I used the overall total which includes FE students - depends what you mean by 'student'.
2 Get Population data - fortuitously the guardian just published a GoogleSpreadsheet - Just a matter of picking the most appropriate OS Administrative area to match the data in edubase.
My RDF browser helped to formulate the query and explore the data , for example by providing the list of all Higher Education Institutions
4 Since there is no institution identifer in the HESA data, write a custom XQuery script to merge by institution name. Doing a simple sort/merge, on name we get only 70 matches. Using a computed key (the name is space-normalised, lower-cased, punctuation removed, split into words, noise words ignored, the words sorted and re-joined) brings this up to 123, with 13 unmatched from the edubase data, 43 from HESA. Here are the merge results. Exact matches are green, fuzzy matches black.The prevalance of minor differences in article and prepositions is quite high, considering these are formal institutional names.
Analysing the failures, we see that that nearly all the unmatched HESA institutions are in Scotland, Wales and Northern Ireland. The generated GoogleMap shows the coverage of eduBase. Although Welsh schools are in edubase,Welsh universities are not. I could not find any documentation on the coverage of the education dataset to confirm this.
The matching failures from instutions in edubase have several causes. Some are caused by duplicate names - there are two "Conservatoire for Dance and Drama" (In Camden and Leeds), two "University of the Arts,London" (in Westminster and Wimbledon) .
Others are due to names which are more than noisily different:"Trinity and All Saints College" and "Leeds Trinity and All Saints";"The Arts University College At Bournemouth"and " The Arts Institute at Bournemouth"; "Norwich University College of the Arts" and "Norwich School of Art & Design"; "Bath Spa University College" and "Bath Spa University"; "College of St Mark & St John" and "University College Plymouth St Mark and St John"; "Goldsmiths College, University of London" and "Goldsmiths College". A second pass using edit distance would probably resolve these mismatches.
Another cause is mergers : edubase has two Manchester Universities, one of which is UMIST. Some minor colleges are included in edubase but not in HESA.
5. To link to the Guardian data, I need to decide on the most appropriate ONS area. local-authority district looks about right but I need the appropriate ONS code. In the edubase dataset, only the full URI is present e.g. http://statistics.data.gov.uk/id/local-authority-district/47UE so the code has to be extracted from the URI with a string operation. Pulling URIs apart like this is deprecated linked data practice however.
The resultant XML so far is here:
7 Extract the ons code and population (in thousands) from the Guardian data: (I fudged the extraction to csv for the moment since its getting late, starting with a partial csv file which was converted to XML. [Flow now from the source Datastore spreadsheet]
8 Merge the education data and the population data, again using a sort/merge in XQuery and transforming to an HTML table. Table is ordered by the computed percentage of students per head of population:
Tada! It's Milton Keyes, of course, but remember the mis-matches, the missing Scottish, Welsh and Northern Irish Universities, the inference of the student location as their University location and the choice of ONS local authority districts as "town". The last choice is dubious since, for example, it puts my university, UWE Bristol, in South Gloucestershire rather than Bristol,
Inventory of custom scripts : one hosted service (elev.at) four generic XQuery utilities, five XSLT for transforming and grouping and one SPARQL query. Intermediate caching of temporary XML documents. It would be nice to put into a single pipeline - a good use case for me to learn XProc I think.
Linked data perhaps, but not quite as we want it, yet. I've used ad hoc XML schemas as the data format for integration here. Next experiment is to convert the HESA and Guardian data to triples, put them in Talis's Data Incubator and integrate the data with SPARQL. However common identifiers and scope between HESA and Edubase remains the missing magic.