dbpedia and football #linkeddata

There was a little flurry of tweets about #linkeddata and football triggered by a query from @degsy .

I remembered that some of my early experiments with dbPedia had involved English Soccer clubs and indeed a couple of my applications are featured on the dbPedia use case page. The first maps the birthplaces of players in English Football clubs. XQuery provides the glue, generating SPARQL queries and formating the results.

SPARQL queries

The first SPARQL query collects the teams in the category English Premier League Football clubs to generate an index page:

PREFIX skos: <>
PREFIX rdfs: <>
SELECT ?club ?clubName  WHERE { 
     ?club skos:subject <>.
     ?club rdfs:label ?clubName.  FILTER (lang(?clubName) = 'en' ).

The index links to a second script to generate the player map, passing the club URI e.g.

The second SPARQL query finds the current players in a team, their date of birth, their birth place and the latitude and longitude of that place and a thumbnail image to generate a KML file which is displayed on Google Map - e,g, Arsenal

PREFIX geo: <>
PREFIX p: <>     
PREFIX rdfs: <>
PREFIX o: <>
      ?player p:currentclub <%club%>.
      <%club%> rdfs:label ?clubName. FILTER ( lang(?clubName) = 'en').
      ?player p:playername ?playerName.
      OPTIONAL { ?player p:cityofbirth ?city . ?city rdfs:label ?cityName.  FILTER (lang(?cityName) = 'en') }.
      OPTIONAL { ?player o:thumbnail ?thumbnail } 
      OPTIONAL { ?player o:birthDate ?dob}.
      OPTIONAL { ?city geo:long ?long. }               
      OPTIONAL { ?city geo:lat ?lat.}                

The placeholder %club% in the query template is replaced by the URI of the club before the query is submitted. Dates need conversion from xs:dates to a more readable format, and the positions randomly dithered so players born in the same town are not superimposed.


Until I fixed the scripts tonight, the application was broken and flagged for removal.  The breakage was caused by two changes: changes in dbpedia vocabularies and the mapping from Wikipedia to those vocabs, and changes to Wikipedia content. Since there is no mechanism for propagating changes, a developer using dbpedia can do nothing but check that the application is still working occasionally, perhaps with an automatic monitor, and rewrite the queries to match any changes.

Data Quality

I compared the list of Arsenal players with a more authorative up-to-date source, SoccerBase (accessed 24/02/2010).

  • SoccerBase and dbPedia  - 28
  • SoccerBase only - 8  - 4 are due to update delays between Wikipedia and dbpedia, 3 players have no Wikipedia enntry ,1 has a mistyped club (fixed in Wikipedia)
  • dbPedia only - 7 - 6 are players on loan to other clubs, the changes not synced with Wikipedia, 1 was the manager

Not all players retrieved were able to be mapped.  Of the 28 true positives, 22 could be mapped,1 had no birthcity and the rest were from towns in France which all surprisingly lacked geo-coding.

Happily it appears that more frequent update, or better still live updates (in the pipeline) of dbPedia would improve data quality significantly. But wouldn't it be great to persuade the owners of the Racing Post to allow their data to be converted to RDF. To my mind, Linked Data needs data as close to the source as possible, rather than data which has been re-typed into Wikipedia.