Search

University data as #linkeddata

On the LinkedIn Linked Data group, Brian Kelly at UKOLN asked the question Which town or city in the UK has the largest proportion of students?

He has now summarised the responses on his blog

Very interesting question, fraught with definitional problems and an interesting exploration of the data quality of dbPedia. However, interesting though dbPedia is as social knowledge database queryable with SPARQL, I don't see that it is really an exploration of Linked Data.  I would expect the solution to be based on linking data from disparate, more definitive sources closer to the point of data collection. So student numbers from HESA (even if available only as .xls at present), linking institution names to places using the RDFed Edubased data, and then Census data on populations.

Student Numbers

The HESA site has a page on online statistics which leads to a list of products, of which the first, Students and Qualifiers Data Tables goes to a set of xls and csv tables showing analyses by different factors: subject of study; disability; ethnicity; institution level; qualifications obtained year by year from 1995/6 to 2007/8 (the latest collated)  For example the the institutional level data for 2007/8 as an xls.  We can use elev.at to do the conversion to XML.  Then a stylesheet to generate semantic XML. (and hence RDF?). OK, straightforward.

University locations


Discovering where Universities are is trickier.  Universities are included in the EduBase2 data and hence in the schools RDF in data.gov.uk -  this is the data on Aston in my prototype RDF browser and in EduBase2.

The schools dataset provides a range of administrative areas - Parliamentary consituency, OSN Census areas, as well as OS easting/northings and latitude and longitude. It is perhaps not so clear which of these geographic regions are most useful to answer the question, but this too looks doable.

Town Populations

The census provides population data and is readily avaliable online via the ONS  The latest data is 2001 but counts are available by areas given inthe EduBase2 data so this looks possible too, within the limits ofthe data.

Student Surveys

Although not directly relevant to the question posed, there are additional data sources providing quality measures and student satisfcation.

The Times Good University Guide is one, with individual pages for each University (Aston)

The Complete University Guide

The Guardian University guide (available as  a Google Spreadsheet)

Other data sources

For completeness, UCAS should be included as the central clearing house for University applications.  Its site provides more detailed data on each university and its courses.

Of course there are entries in Wikipedia e.g Aston University and hence on  dbpedia : Aston University

Linking University data

Linking these various sets of University data is however not straightforward. 

The HESA tables contain only the institution names but these are not stable (Aston University used to be called the The University of Aston in Birmingham). 

EduBase2 gives the institution name and two codes: UKPRN (UK Provider Reference Number issued by   UK Register of Learning Providers ) of 10007759 as well as its own code 133787. 

The university is identifable with the EduBase2 URI:

         http://www.edubase.gov.uk/establishment/summary.xhtml?urn=133787

but UKRLP does not display RESTful URIs thanks to the rewriting of URLs like:

           http://www.ukrlp.co.uk/ukrlp/ukrlp_provider.page_pls_provDetails?x=&pn_p_...

In the RDF, new URIs have been minted  based on the EduBase2 code e.g.

            http://education.data.gov.uk/doc/school/133787

(as an aside, the creators of the latest data.gov.uk RDF dataset on BIS research funding also have entries for Universities such as this page (from the Stromness browser) and have chosen to mint new URIs for universities

           http://education.data.gov.uk/id/institution/H-0108

which at present are not resolvable)

UCAS also has a code for Universities (A80 for Aston) 

          http://www.ucas.ac.uk/students/choosingcourses/choosinguni/instguide/a/a80

with a lot of data (but nothing program-friendly) . 

The Times Guide  has no code and changes the name between pages - "Aston" in the guide , "Aston University Birmingham" on the individual page. The Guardian guide uses "Aston"

In the absence of  a common identifier,  linking can only be  based on fuzzy matching of instututional  names. 

What institutions are included?

To count students we need to identify which institutions to include. However sources differ in which institutions they include. Just looking at the numbers without matching the sets we see a wide range of implict definitions:

  • HESA  - 166
  • Edubase2
  •   Higher Education 139
  •   Further Education 482
  • UCAS  304
  • Times Good University Guide 114
  • Complete University Guide  113
  • Guardian University Guide 117 + 32 minor

Linked Data project

Integrating these disparate datasets represents an interesting challenge in Linked Data.  It is tempting to start to scrape and integrate as a private project, just for the challenge, but such an approach would yield not Linked Data but another data base requiring re-scrapping and recoding as sources changed structure. If Linked Data means anything, it means linking disparate datasets published as close to the source as possible - by UCAS, by HESA, by the Guardian, by EduBase etc.  Central to such integration is agreement on a common identifier, say, the UKPRN code, perhaps also expressed as part of a URI based on some agreed internet domain to construct a URI.


 

Many thanks for the comment on my blog and your more in-depth post.

The reason I suggested use of DBpedia to answer my query was that I expected a quick win, but to find some flaws in the answers. I expected this to lead on to a discussion as to why a better approach would be to use authoritative sources. And I was unaware of the location of such services and the format of their data the challenge I posed allowed developers with no knowledge of Linked Data sources in the UK public sector to carry out their SPARQL development.

From the exercise I have learnt about the limitations of DBpedia which I was previously unaware of - so that has been useful, especially as DBpedia does seem to be promoted as central to Linked Data space.

I'm still unaware of what a SPARQL query would look like which would query official data sources, I guess we're saying that such a query is not yet solvable without significant effort in data manipulation and cleansing?

Interesting & thanks. Your post makes clear what should be obvious: making a query work is also a data analysis problem, and potentially quite different. I'm sure you thought of it, but you didn't mention the multi-campus issue that Brian mentioned. One would have to go back to the asker to sort some of those issues out!
Chris, thanks I did miss that part of, the problem, although we have it here at UWE. The main campus is in South Gloucestershire but the Bower Ashton campus is in Bristol.

It also struck me that HESA is only the aggregator of multiple returns from individual institutions, quite probably with their own interpretation of what counts as a student, and when that counting is done in the academic year. At least the statisticians at HESA are aware of such problems are charged with the curation of the data set, but rperhaps this data should be published as linked data by each institution!