On the LinkedIn Linked Data group, Brian Kelly at UKOLN asked the question Which town or city in the UK has the largest proportion of students?
He has now summarised the responses on his blog.
Very interesting question, fraught with definitional problems and an interesting exploration of the data quality of dbPedia. However, interesting though dbPedia is as social knowledge database queryable with SPARQL, I don't see that it is really an exploration of Linked Data. I would expect the solution to be based on linking data from disparate, more definitive sources closer to the point of data collection. So student numbers from HESA (even if available only as .xls at present), linking institution names to places using the RDFed Edubased data, and then Census data on populations.
The HESA site has a page on online statistics which leads to a list of products, of which the first, Students and Qualifiers Data Tables goes to a set of xls and csv tables showing analyses by different factors: subject of study; disability; ethnicity; institution level; qualifications obtained year by year from 1995/6 to 2007/8 (the latest collated) For example the the institutional level data for 2007/8 as an xls. We can use elev.at to do the conversion to XML. Then a stylesheet to generate semantic XML. (and hence RDF?). OK, straightforward.
Discovering where Universities are is trickier. Universities are included in the EduBase2 data and hence in the schools RDF in data.gov.uk - this is the data on Aston in my prototype RDF browser and in EduBase2.
The schools dataset provides a range of administrative areas - Parliamentary consituency, OSN Census areas, as well as OS easting/northings and latitude and longitude. It is perhaps not so clear which of these geographic regions are most useful to answer the question, but this too looks doable.
The census provides population data and is readily avaliable online via the ONS The latest data is 2001 but counts are available by areas given inthe EduBase2 data so this looks possible too, within the limits ofthe data.
Although not directly relevant to the question posed, there are additional data sources providing quality measures and student satisfcation.
Other data sources
For completeness, UCAS should be included as the central clearing house for University applications. Its site provides more detailed data on each university and its courses.
Linking University data
Linking these various sets of University data is however not straightforward.
The HESA tables contain only the institution names but these are not stable (Aston University used to be called the The University of Aston in Birmingham).
EduBase2 gives the institution name and two codes: UKPRN (UK Provider Reference Number issued by UK Register of Learning Providers ) of 10007759 as well as its own code 133787.
The university is identifable with the EduBase2 URI:
but UKRLP does not display RESTful URIs thanks to the rewriting of URLs like:
In the RDF, new URIs have been minted based on the EduBase2 code e.g.
(as an aside, the creators of the latest data.gov.uk RDF dataset on BIS research funding also have entries for Universities such as this page (from the Stromness browser) and have chosen to mint new URIs for universities
which at present are not resolvable)
UCAS also has a code for Universities (A80 for Aston)
with a lot of data (but nothing program-friendly) .
The Times Guide has no code and changes the name between pages - "Aston" in the guide , "Aston University Birmingham" on the individual page. The Guardian guide uses "Aston"
In the absence of a common identifier, linking can only be based on fuzzy matching of instututional names.
What institutions are included?
To count students we need to identify which institutions to include. However sources differ in which institutions they include. Just looking at the numbers without matching the sets we see a wide range of implict definitions:
Linked Data project
Integrating these disparate datasets represents an interesting challenge in Linked Data. It is tempting to start to scrape and integrate as a private project, just for the challenge, but such an approach would yield not Linked Data but another data base requiring re-scrapping and recoding as sources changed structure. If Linked Data means anything, it means linking disparate datasets published as close to the source as possible - by UCAS, by HESA, by the Guardian, by EduBase etc. Central to such integration is agreement on a common identifier, say, the UKPRN code, perhaps also expressed as part of a URI based on some agreed internet domain to construct a URI.