Search

Scaping, Election nominations and RDF

A more challenging document to scape is the list of nominations for the forthcoming election.  This is one of the datasets on ScraperWiki but this dataset does not fit easily into a simple table. The source is the Press Association.

The XQuery version is more complex than the Python code but it does more, including decoding party abbreviations and some of the other data.

Here is the converted XML: http://www.cems.uwe.ac.uk/xmlwiki/Scrape/data/election-nominations.xml and the code

Scraping is complicated by the lack of structure in the page. Each constituency is just a sequence of elements under the main div

<p class="constituency-heading">21. ASHTON UNDER LYNE</p><ul class="nominations">       <li class="odd">Seema Kennedy (C)</li>
<li class="even">+David Heyes (Lab)</li>        <li class="odd">Paul Larkin (LD)</li>        <li class="even">David Lomas (BNP)</li>         <li class="odd">Angela McManus (UKIP)</li> </ul> <p class="boundary-change">10.25% boundary change</p> <p class="previous-result-summary">2005 notional: Lab maj 13,199 (38.33%) &ndash; Turnout 34,432 (51.53%)</p>
<p class="previous-result-detail">Lab 20,136 (58.48%); C 6,937 (20.15%); LD 4,017 (11.67%); Others 2,621 (7.61%); UKIP 721 (2.09%)</p> <p class="constituency-heading">22. AYLESBURY</p>

Finding the heading is OK :

    for $constituencey in $html//*:p[@class="constituency-heading"]

as is finding the end of elements for this constituency , taking care with the last one:

    let $end := $constituency/following-sibling::*:p[@class=("constituency-heading","abbreviations-intro")][1]

but I scratched my head about how to select the nodes inbetween.  I ended up with a function:



declare function local:nodes-before($s,$end) {
      if (empty($s) or $s[1] is $end)
      then ()
      else ($s[1],local:nodes-before(subsequence($s,2),$end))
};


let $cdata := <data>{local:nodes-before($constituency/following-sibling::*,$end)}</data>


but this seems slow and I'm sure there is a better way.

In converting to RDF, I had to improve the original function to generate a proper type/instance/type/instance..  resource path and output rdfs:label and rdf:type properties. The resultant RDF is http://www.cems.uwe.ac.uk/xmlwiki/Scrape/data/election-nominations.rdf

The conversion function now looks a bit more complicated:



declare function convert:element-to-rdf ($element,$path,$prefix,$base) {
let $epath:= concat($path,"/",local-name($element),"/",$element/@id)
return
(
   element rdf:Description {
        attribute rdf:about  {concat($base,"/resource",$epath)},
        element rdfs:label {string($element/@label)},
        element rdf:type {
             attribute rdf:resource {concat($base,"/vocab/type/",local-name($element))}
        },
        for $at in $element/(@* except (@label,@id))
         return
           element {concat($prefix,local-name($at))} {
              if (starts-with ($at,"http://"))
              then attribute rdf:resource {string ($at)}
              else string ($at)
            },   
          for $child in $element/* 
           return 
              element {concat($prefix,local-name($child))}
                {attribute rdf:resource {concat($base,"/resource",$epath,"/",local-name($child),"/",$child/@id)} 
                }
          },
      for $child in $element/* 
      return 
             convert:element-to-rdf($child,$epath,$prefix,$base)
    )
};


with a script to do the conversion and if required, store it:



declare namespace rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#";
import module namespace convert = "http://www.cems.uwe.ac.uk/xmlwiki/convert" at "../lib/convert.xqm";

let $uri  :=request:get-parameter("uri",())
let $filename := request:get-parameter("filename",())
let $xml := doc($uri)/*
let $rdf := 
  <rdf:RDF
       xmlns:rdf= "http://www.w3.org/1999/02/22-rdf-syntax-ns#"
       xmlns:rdfs = "http://www.w3.org/2000/01/rdf-schema#"
       xmlns:uwe="http://www.cems.uwe.ac.uk/xmlwiki/"> 
        {convert:element-to-rdf($xml,(),"uwe:", "http://www.cems.uwe.ac.uk/xmlwiki")}  
  </rdf:RDF>
return 
   if ($filename) 
   then  
      let $login := xmldb:login("db/Wiki/Scrape","user","password")
      return 
            xmldb:store("/db/Wiki/Scrape/data", $filename, $rdf )
    else 
       $rdf