The Wallace Line.

Blog

Index

Scraping with XQuery and making some RDF (inspired by ScraperWiki)

Via Twitter today I learnt about ScraperWiki. This looks ready good and is a much-needed service. All power to it.

I've been using XQuery for this task for a while but never developed a collaborative platform. Nonetheless an XQuery-based platform would be a useful addition to the set of tools. It would integrate with the pipeline nicely. It would also be nice to get the data as RDF.

As an example of the XQuery approach, I looked at the example of scraping the Premier League table from the BBC site. Here it is on ScraperWiki with its Python code and cached data.

The XQuery equivalent (with a few added meta data attributes) to return XML can be run as

http://www.cems.uwe.ac.uk/xmlwiki/Scrape/premierLeague-xml.xq

and comprises the following XQuery code:



import module namespace convert = "http://www.cems.uwe.ac.uk/xmlwiki/convert" at "../lib/convert.xqm";

let $uri := "http://news.bbc.co.uk/sport1/hi/football/eng_prem/table/default.stm"
let $html := convert:get-html($uri)
let $table := $html//table[@class="fulltable"]
let $date := string ($html//div[@class="fulltableHeader"]/text()) 
return
  let $xml :=
  element football-league {
     attribute id {"Barclays_Premier_League"},
     attribute label  {"Barclays Premier League"},
     attribute valid-date { $date },
     attribute acquired { current-dateTime()},
     attribute source {$uri},
     for $row in $table/tr[@class=("r1","r2")]
     return 
       element  team {
           attribute id {replace ($row/td[2]," ","_")},
           attribute label {string ($row/td[2])},
           attribute position {string ($row/td[1])},
           attribute games-played {string ($row/td[3])},
           attribute goal-difference {string ($row/td[14])},
           attribute points {string ($row/td[15])}
       }
    }

The source file does not define a namespace so none is needed here. I often use a wildcard namespace (e.g. *:table) just in case.

The convert:get-html function is a wrapper round an eXist httpclient function:



declare function convert:get-html($uri as xs:string) {
   let $headers := element headers { element header {attribute name {"Pragma" }, attribute value {"no-cache"}}}
   let $response := httpclient:get(xs:anyURI($uri), false(), $headers)    
   return  
       if  ($response/@statusCode eq "200")
       then 
           $response/httpclient:body 
      else ()
};

This XML is not Excel-friendly because I want to make a simple transformation to RDF, in which attributes become properties and elements become resources:

http://www.cems.uwe.ac.uk/xmlwiki/Scrape/premierLeague-rdf.xq

The transformation is accomplished by an XQuery function (revised)



declare function convert:element-to-rdf ($element,$path,$prefix,$base) {
let $epath:= concat($path,"/",local-name($element),"/",$element/@id)
return
(
   element rdf:Description {
        attribute rdf:about  {concat($base,$epath)},
        for $at in $element/(@* except @id)
         return
           element {concat($prefix,local-name($at))} {
              if (starts-with ($at,"http://"))
              then attribute rdf:resource {string ($at)}
              else string ($at)
            },   
          for $child in $element/* 
           return 
              element {concat($prefix,local-name($child))}
                {attribute rdf:resource {concat($base,$epath,"/",local-name($child),"/",$child/@id)} 
                }
          },
      for $child in $element/* 
      return 
             convert:element-to-rdf($child,$epath,$prefix,$base)
    )
};

The function descends through the XML structure, creating an rdf:Description for each element, properties for each attribute and linking properties to child elements, and then the child elements are transformed into rdf:Descriptions. Element names are assumed to be unique within the base and ids unique within an element name, attributes are single valued so its a bit limited but helpful for simple data.

RDF this might be (well it validates) but it's not Linked Data because:

this uses only local resource and vocab URIs
resource URIs cant be dereferenced
there is no schema for the vocab
literals are not typed

Most can be fixed with a bit more time, but the biggest challenge is resource matching, for example matching "uwe:Man_Utd" with http://dbpedia.org/page/Manchester_United_F.C. Perhaps common aliases could be added to wikipedia, and hence to dbpedia, but much better of course if these were in the BBC page.

Another challenge is the lack of a vocab for football games. @degsy tweeted about the need for this - I wonder if he got anywhere.