An exercise in RDF - 2005 Election data

@psychemedia recently wrote in his blog about visualizing the 2005 election results based on CSV data on the Election Commission site. 

To continue my exercises in RDF scraping, I converted these to XML (constituencies, results and distribution) and then to RDF (constituencies, results  and distribution).

This process used three custom XQuery scripts to do the CSV to XML conversion and the recursive XML to RDF function in the last post to make the RDF. The minor problems encounted included:

  • Partly de-normalized data - constituency name on the first candidate row in each constituency in the results only needed the following-sibling approach to group the candidates
  • Naming Constituencies - I replaced all & with "and" - However there are differences with other sources of constituency names - for example TheyWorkForYou  has Torridge & West Devon whereas this data has West Devon and Torridge - seems that a handcoded name map for exceptions is needed.
  • Naming Candidates - I concatenate name and initials, with a ", "
  • Id's - I replaced spaces with _ but there are a few non-ascii characters in the CSV file - theres always a Lembit Opik - work needed on character encoding

XQuery script

An XQuery script provides a basic, standardized extract from the XML datasets: eg. Bath

RDF modeling

The difference between entity/attribute/relationship based models (including XML) and RDF really struck me with this exercise.

Entities dont need to be created. Resources like the election, constituencies and candidate are not themselves ever created. All we do is add triples which make statements about their URIs. Of course it's nice if the URIs are dereferencable but the RDF is usable without this.

The model is composable. The results RDF/XML document adds consitutuency properties like the name and candidate properties like the name, party and number of votes, but does not say who was elected.  The constituency RDF/XML document provides properties about voting numbers in the constituency, for example the turnout and for the winning candidate, the  boolean elected property.  When uploaded to the same dataset these separate triples are pooled and SPARQL queries can return aggregated data based on common types, properties and Resource URIs.

Triples are unique. There are duplicate triples generated in these three files because of the process of conversion from the separate XML files to RDF but these will be ignored when loaded into the same datastore, or should be  ignored by the query engine if in different graphs. RDF is kind-of auto-normalized.

Literal / Resource dilemna

The age-old Attribute/Entity dilemna appears in RDF as the Literal/ Resource dilemna. When modeling the candidates, I modeled the party as a literal. I later added the distribution of seats, whereupon I was forced to  model a Party as a Resource with properties like label and number of seats. This meant reworking the results XML.  Perhaps I should have anticipated this, but the nature of RDF means that the scope is not closed as it is in localised modeling, so these challenges will inevitably happen. However a generic process for schema evolution is possible in RDF, partly because RDF datasets are typically highly redundant.  It thus seemed OK to add links to party resources alongside the inital party-name literals. 

Now to get a new Talis store to put this stuff into ...