Scraping with XQuery and making some RDF (inspired by ScraperWiki)

Via Twitter today  I learnt about ScraperWiki.  This looks ready good and is a much-needed service. All power to it.

I've been using XQuery for this task for a while but never developed a collaborative platform.  Nonetheless an XQuery-based platform would be a useful addition to the set of tools. It would integrate with the pipeline nicely. It would also be nice to get the data as RDF.

As an example of the XQuery approach, I looked at the example of scraping the Premier League table from the BBC site.  Here it is on ScraperWiki with its Python code and cached data.

The XQuery equivalent (with a few added meta data attributes) to return XML can be run as

and comprises the following XQuery code:

import module namespace convert = "" at "../lib/convert.xqm";

let $uri := ""
let $html := convert:get-html($uri)
let $table := $html//table[@class="fulltable"]
let $date := string ($html//div[@class="fulltableHeader"]/text()) 
  let $xml :=
  element football-league {
     attribute id {"Barclays_Premier_League"},
     attribute label  {"Barclays Premier League"},
     attribute valid-date { $date },
     attribute acquired { current-dateTime()},
     attribute source {$uri},
     for $row in $table/tr[@class=("r1","r2")]
       element  team {
           attribute id {replace ($row/td[2]," ","_")},
           attribute label {string ($row/td[2])},
           attribute position {string ($row/td[1])},
           attribute games-played {string ($row/td[3])},
           attribute goal-difference {string ($row/td[14])},
           attribute points {string ($row/td[15])}

The source file does not define a namespace so none is needed here. I often use a wildcard namespace (e.g.  *:table) just in case.

The convert:get-html function is a wrapper round an eXist httpclient function:

declare function convert:get-html($uri as xs:string) {
   let $headers := element headers { element header {attribute name {"Pragma" }, attribute value {"no-cache"}}}
   let $response := httpclient:get(xs:anyURI($uri), false(), $headers)    
       if  ($response/@statusCode eq "200")
      else ()

This XML is not Excel-friendly because I want to make a simple transformation to RDF, in which attributes become properties and elements become resources:

The transformation is accomplished by an XQuery function (revised)

declare function convert:element-to-rdf ($element,$path,$prefix,$base) {
let $epath:= concat($path,"/",local-name($element),"/",$element/@id)
   element rdf:Description {
        attribute rdf:about  {concat($base,$epath)},
        for $at in $element/(@* except @id)
           element {concat($prefix,local-name($at))} {
              if (starts-with ($at,"http://"))
              then attribute rdf:resource {string ($at)}
              else string ($at)
          for $child in $element/* 
              element {concat($prefix,local-name($child))}
                {attribute rdf:resource {concat($base,$epath,"/",local-name($child),"/",$child/@id)} 
      for $child in $element/* 

The function descends through the XML structure, creating an rdf:Description for each element, properties for each attribute and linking properties to child elements, and then the child elements are transformed into rdf:Descriptions. Element names are assumed to be unique within the base and ids unique within an element name, attributes are single valued so its a bit limited but helpful for simple data.

RDF this might be (well it validates)  but it's not Linked Data because:

  • this uses only local resource and vocab URIs
  • resource URIs cant be dereferenced
  • there is no schema for the vocab
  • literals are not typed

Most can be fixed with a bit more time, but the biggest challenge is resource matching, for example matching "uwe:Man_Utd" with  Perhaps common aliases could be added to wikipedia, and hence to dbpedia, but much better of course if these were in the BBC page.

Another challenge is the lack of a vocab for football games.  @degsy tweeted about the need for this - I wonder if he got anywhere.