A data mashup to explore Brian Kelly's Linked Data challenge

Couldn't help myself, had to have a go at Brian Kelly's Linked Data problem to find the UK town with the highest proportion of students.

Here is my result  and its  data flow

In detail:

1 Get the HESA data - convert the latest overall figures (in Excel) with to generic XML, then a custom XSLT to domain XML.  HESA provides several totals and in the analysis I used the overall total which includes FE students - depends what you mean by 'student'.

2 Get Population data - fortuitously the guardian just published a GoogleSpreadsheet  - Just a matter of picking the most appropriate OS Administrative area to match the data in edubase.

3 Write a SPARQL Query to get Higher Education Institutions and their areas from the edubase RDF Extract

My RDF browser helped to formulate the query and explore the data , for example by providing the list of all Higher Education Institutions

4 Since there is no institution identifer in the HESA data, write a custom XQuery script to merge by institution name.  Doing a simple sort/merge, on name we get only 70 matches.  Using a computed key (the name is space-normalised, lower-cased, punctuation removed, split into words, noise words ignored, the words sorted and re-joined) brings this up to 123, with 13 unmatched from the edubase data, 43 from HESA.  Here are the merge results. Exact matches are green, fuzzy matches black.The prevalance of minor differences in article and prepositions is quite high, considering these are formal institutional names.

Analysing the failures, we see that that nearly all the unmatched HESA institutions are in Scotland, Wales and Northern Ireland.   The generated GoogleMap shows the coverage of eduBase. Although Welsh schools are in edubase,Welsh universities are not. I could not find any  documentation on the coverage of the education dataset to confirm this.

The matching failures from instutions in edubase have several causes. Some are caused by duplicate names - there are two "Conservatoire for Dance and Drama"  (In Camden and Leeds), two "University of the Arts,London" (in Westminster and Wimbledon) .

Others are due to names which are more than noisily different:"Trinity and All Saints College" and "Leeds Trinity and All Saints";"The Arts University College At Bournemouth"and " The Arts Institute at Bournemouth"; "Norwich University College of the Arts" and "Norwich School of Art & Design"; "Bath Spa University College" and "Bath Spa University"; "College of St Mark & St John" and "University College Plymouth St Mark and St John"; "Goldsmiths College, University of London" and "Goldsmiths College". A second pass using edit distance would probably resolve these mismatches.

Another cause is mergers : edubase has two Manchester Universities, one of which is UMIST.  Some minor colleges are included in edubase but not in HESA.

5. To link to the Guardian data, I need to decide on the most appropriate ONS area.  local-authority district looks about right but I need the appropriate ONS code. In the edubase dataset, only the full URI is present e.g. so the code has to be extracted from the URI with a string operation. Pulling URIs apart like this is deprecated linked data practice however.

The resultant XML so far is here:

6. This data has to be grouped and summed by area code - XSLT can do this yielding this XML summary

7 Extract the ons code and population (in thousands)  from the Guardian data: (I fudged the extraction to csv for the moment since its getting late, starting with a partial csv file  which was converted to XML. [Flow now from the source Datastore spreadsheet]

8 Merge the education data and the population data, again using a sort/merge in XQuery and transforming to an HTML table.  Table is ordered by the computed percentage of students per head of population:

Tada! It's Milton Keyes, of course, but remember the mis-matches, the missing Scottish, Welsh and Northern Irish Universities, the inference of  the student location as their University location and the choice of ONS local authority districts as "town".  The last choice is dubious since, for example, it puts my university, UWE Bristol, in South Gloucestershire rather than Bristol,

Inventory of custom scripts : one hosted service (  four generic XQuery utilities, five XSLT for transforming and grouping and one SPARQL query. Intermediate caching of temporary XML documents. It would be nice to put into a single pipeline - a good use case for me to learn XProc I think.

Linked data perhaps, but not quite as we want it, yet.  I've used ad hoc XML schemas as the data format for integration here. Next experiment is to convert the HESA and Guardian data to triples, put them in Talis's Data Incubator and integrate the data with SPARQL. However common identifiers and scope between HESA and Edubase remains the missing magic.

Excellent example Chris. But doesn't the MK answer make you feel it's wrong? Exclude distance students?
If only! HESA data doesn't break-down students by mode of study (that I could find), and really distance students should count in their town of domicile but that data's not there either. Birkbeck is skewing the Camden figure too I think. I briefly looked for data on distance learning students by University (just proportion would do) which you would think would be useful data but didnt find anything.

The lack of universities anywhere but in England thanks to the edubase scope is another big problem.

I do think these questions about aggregates are a much tougher challenge for Linked data than some of the typical instance examples cited. If you are looking to buy a green car in Queensland, you are happy to filter false positives, will never know about false negatives and don't care as long as you get some true positives, which are probably redundantly present anyway. That doesn't work with aggregate questions - fail to account for distance learning students, or perhaps Scottish Universities and your answer is 100% false positive.