Search

Reminder to self: avoid attributes for base data - reprise on my RDF scaping #XQuery

Banging my head on the table:  elements not attributes for data!  elements not attributes for data!

Every time I cut corners and use attributes for base data, I come to regret it.  I usually dont, but my attempts at page scaping to XML and the RDF did take this approach and it was silly.  Nowhere to put attributes of the data, like datatype and restricted to one/zero multiplicity. Duh!

Here's what I should do.

First convert the page to a formalized  XML with nested elements. Resources have an id attribute which is valid in a URI, Properties have attributes to define datatype and language. I need a permissive schema definition to check this structure - must learn Schematron.

Here is the revised code for the premier league:



import module namespace convert = "http://www.cems.uwe.ac.uk/xmlwiki/convert" at "../lib/convert.xqm";

let $uri := "http://news.bbc.co.uk/sport1/hi/football/eng_prem/table/default.stm"
let $html := convert:get-html($uri)
let $table := $html//table[@class="fulltable"]
let $date := $html//div[@class="fulltableHeader"]/text()
let $xml :=
  element football-league {
     attribute id {"Barclays_Premier_League"},
     element label  {"Barclays Premier League"},
     element valid-date { $date },
     element acquired { attribute datatype {"xs:dateTime"}, current-dateTime()},
     element source {attribute datatype {"uri"} , $uri},
     for $row in $table/tr[@class=("r1","r2")]
     return 
       element  team {
           attribute id {replace ($row/td[2]," ","_")},
           element label {string ($row/td[2])},
           element position {attribute datatype {"xs:integer"}, string ($row/td[1])},
           element games-played {attribute datatype {"xs:integer"}, string ($row/td[3])},
           element goal-difference {attribute datatype {"xs:integer"}, string ($row/td[14])},
           element points {attribute datatype {"xs:integer"}, string ($row/td[15])}
       }
    }
return 
  $xml




Here is Premier League data as XML  Live or Cached  The date needs reformatting to an ISO date.

This formalized XML can then be converted to RDF with a revised XQuery function:



(:~
  : convert formalized XML to RDF
  :  elements which become resources have an id attribute, properties may have a datatype attribute, which is uri if a URI
  :@param element  the XML element to be converted to RDF
  :@param base  base for resource URIs
  :@param path  hierarchical path to element resource - initially ()
  :@param prefix  default prefix for local property names
  :@param map XML document used to map local names to external vocab names
:)
declare function convert:element-to-rdf-v2 ($element,$base,$path,$prefix,$map) {
let $epath:= concat($path,"/",local-name($element),"/",$element/@id)
return
(
   element rdf:Description {
        attribute rdf:about  {concat($base,"/resource",$epath)},     
        element rdf:type {
             attribute rdf:resource {concat($base,"/vocab/type/",local-name($element))}
        },
        for $property in $element/*[empty(@id)]
        let $localname := local-name($property)
        let $localname := 
             if ($map/property[@local=$localname])
             then string($map/property[@local=$localname]/@external)
             else concat($prefix,$localname)
        return
           element {$localname} {
              if ($property/@datatype = "uri")
              then attribute rdf:resource {string ($property)}
              else 
               (
                 if ($property/@datatype) then attribute rdf:datatype {$property/@datatype} else (),
                 string ($property)
                ) 
            },   
          for $child in $element/*[@id]
          return 
              element {concat($prefix,local-name($child))}
                {attribute rdf:resource {concat($base,"/resource",$epath,"/",local-name($child),"/",$child/@id)} 
                }
          },
       for $child in $element/*[@id] 
       return 
             convert:element-to-rdf-v2($child,$base, $epath,$prefix,$map)
 )
};


The $map provides a map between local names and external names.  If no entry found, the local name is prefixed by $prefix. For example



<map>
  <property local="label" external="rdfs:label"/>
  <property local="latitude" external="geo:lat"/>
  <property local="longitude" external="geo:long"/>
</map>


RDF output - cached  or Live   (ish)

Still not Linked data of course, for all the reasons mentioned in previous posts, but it's a cleaner approach to data/XML/RDF conversion. Now to tackle the election data again.

 

 

Yes, I totally agree. Use XML elements not attributes for data!