On the LinkedIn Linked Data group, Brian Kelly at UKOLN asked the question Which town or city in the UK has the largest proportion of students?
He has now summarised the responses on his blog.
Very interesting question, fraught with definitional problems and an interesting exploration of the data quality of dbPedia. However, interesting though dbPedia is as social knowledge database queryable with SPARQL, I don't see that it is really an exploration of Linked Data. I would expect the solution to be based on linking data from disparate, more definitive sources closer to the point of data collection. So student numbers from HESA (even if available only as .xls at present), linking institution names to places using the RDFed Edubased data, and then Census data on populations.
Student Numbers
The HESA site has a page on online statistics which leads to a list of products, of which the first, Students and Qualifiers Data Tables goes to a set of xls and csv tables showing analyses by different factors: subject of study; disability; ethnicity; institution level; qualifications obtained year by year from 1995/6 to 2007/8 (the latest collated) For example the the institutional level data for 2007/8 as an xls. We can use elev.at to do the conversion to XML. Then a stylesheet to generate semantic XML. (and hence RDF?). OK, straightforward.
University locations
Discovering where Universities are is trickier. Universities are included in the EduBase2 data and hence in the schools RDF in data.gov.uk - this is the data on Aston in my prototype RDF browser and in EduBase2.
The schools dataset provides a range of administrative areas - Parliamentary consituency, OSN Census areas, as well as OS easting/northings and latitude and longitude. It is perhaps not so clear which of these geographic regions are most useful to answer the question, but this too looks doable.
Town Populations
The census provides population data and is readily avaliable online via the ONS The latest data is 2001 but counts are available by areas given inthe EduBase2 data so this looks possible too, within the limits ofthe data.
Student Surveys
Although not directly relevant to the question posed, there are additional data sources providing quality measures and student satisfcation.
The Times Good University Guide is one, with individual pages for each University (Aston)
The Guardian University guide (available as a Google Spreadsheet)
Other data sources
For completeness, UCAS should be included as the central clearing house for University applications. Its site provides more detailed data on each university and its courses.
Of course there are entries in Wikipedia e.g Aston University and hence on dbpedia : Aston University
Linking University data
Linking these various sets of University data is however not straightforward.
The HESA tables contain only the institution names but these are not stable (Aston University used to be called the The University of Aston in Birmingham).
EduBase2 gives the institution name and two codes: UKPRN (UK Provider Reference Number issued by UK Register of Learning Providers ) of 10007759 as well as its own code 133787.
The university is identifable with the EduBase2 URI:
http://www.edubase.gov.uk/establishment/summary.xhtml?urn=133787
but UKRLP does not display RESTful URIs thanks to the rewriting of URLs like:
http://www.ukrlp.co.uk/ukrlp/ukrlp_provider.page_pls_provDetails?x=&pn_p_...
In the RDF, new URIs have been minted based on the EduBase2 code e.g.
http://education.data.gov.uk/doc/school/133787
(as an aside, the creators of the latest data.gov.uk RDF dataset on BIS research funding also have entries for Universities such as this page (from the Stromness browser) and have chosen to mint new URIs for universities
http://education.data.gov.uk/id/institution/H-0108
which at present are not resolvable)
UCAS also has a code for Universities (A80 for Aston)
http://www.ucas.ac.uk/students/choosingcourses/choosinguni/instguide/a/a80
with a lot of data (but nothing program-friendly) .
The Times Guide has no code and changes the name between pages - "Aston" in the guide , "Aston University Birmingham" on the individual page. The Guardian guide uses "Aston"
In the absence of a common identifier, linking can only be based on fuzzy matching of instututional names.
What institutions are included?
To count students we need to identify which institutions to include. However sources differ in which institutions they include. Just looking at the numbers without matching the sets we see a wide range of implict definitions:
Linked Data project
Integrating these disparate datasets represents an interesting challenge in Linked Data. It is tempting to start to scrape and integrate as a private project, just for the challenge, but such an approach would yield not Linked Data but another data base requiring re-scrapping and recoding as sources changed structure. If Linked Data means anything, it means linking disparate datasets published as close to the source as possible - by UCAS, by HESA, by the Guardian, by EduBase etc. Central to such integration is agreement on a common identifier, say, the UKPRN code, perhaps also expressed as part of a URI based on some agreed internet domain to construct a URI.
The reason I suggested use of DBpedia to answer my query was that I expected a quick win, but to find some flaws in the answers. I expected this to lead on to a discussion as to why a better approach would be to use authoritative sources. And I was unaware of the location of such services and the format of their data the challenge I posed allowed developers with no knowledge of Linked Data sources in the UK public sector to carry out their SPARQL development.
From the exercise I have learnt about the limitations of DBpedia which I was previously unaware of - so that has been useful, especially as DBpedia does seem to be promoted as central to Linked Data space.
I'm still unaware of what a SPARQL query would look like which would query official data sources, I guess we're saying that such a query is not yet solvable without significant effort in data manipulation and cleansing?
It also struck me that HESA is only the aggregator of multiple returns from individual institutions, quite probably with their own interpretation of what counts as a student, and when that counting is done in the academic year. At least the statisticians at HESA are aware of such problems are charged with the curation of the data set, but rperhaps this data should be published as linked data by each institution!