Search

XQuery for scraping - long live XML

Tony Hirst at the OU and Rufus Pollack at CKAN set up a new Q&A forum for questions on open datasets which has captured my interest recently -http://getthedata.org/ . It's been helpful to me to get ideas about the availability of weather and enviromental data. I hope it prospers.

One of the questions asked was about tools or services to use for scraping. Naturally I posted an XQuery answer and was rather shocked when the only vote it got was down. How could this be?  Am I alone in thinking that XQuery, especially coupled with eXist, is an ideal platform for this work - perhaps so.  But I came across a couple of examples of scraping this week which tempted me to try to promote this toolset. Don Quixote comes to mind but what the heck.

One was a blog by John Goodwin on using python and an RDF library to create a kml-based map of programmes, using the RDF now available from the BBC site.  Another was an example of using ScraperWiki with PHP to get a map of available garages in Oxford by Tim Davies.

I also realised that I need a better way of simply documenting small projects like these in a way which provides simple documentation and a worksheet from which I can run conversion tasks and tests. It's a bit crude at presnt but I find it helpful. I have very simple way of protecting the visitor from launching time-consuming scraping tasks but it's only to stop accidents.

The two projects are the BBC programmes and Oxford Garages.  Both have the same structure.The core is a script to do the scraping. Other scripts handle the scrape-and-cache operation and scheduling a job to repeat the scrape. Other project-specific scripts handle transformations of the cached data.  The BBC example creates a KML file directly, Garages creates an intermediate XML file from which the KML is generated on demand. 

The code for the scapers is viewable in each project. I think (but I would ) that it is pretty minimal and clean.  Some support modules are required to handle common task like string parsing, geocoding and (not in use here ) CSV decoding, data and coordinate transformations.  An added benefit of eXist is that the whole application can be done in the one language, including storage of cached files, indexing stored files, searching, selection and job scheduling.

It is interesting to see John's use of the RDF BBC files.  It seems to me that the pure XML approach is cleaner and RDF hasn't bought any gains here. If only other sites generated such comprehensive XML.  Please, please may it continue and spread. JSON and RDF is useful to some clients but please keep the XML coming.

 

 

Kit, just to let you know the links through to the BBC programmes and Oxford garages projects appear to be broken.

FWIW, I took that question to be asking for things that "ordinary people" might use without too much set-up cost. Using eXist might be part of the answer but it isn't one that people could pick up and use immediately. Plus XQuery is *almost* as obscure as SPARQL as a language for getting stuff done ;) (It wasn't me who voted your answer down, by the way! ;)

Jeni, many thanks , Oh the perils of small hours posting - links now fixed.

I take our point about the user community, but ScaperWiki is the top answer and you still have to do the coding in Python on PHP even if ScraperWiki provides an nice community environment for the work. I guess the point is that it allows user to use a variety of popular languages rather than one unpopular one.

And I'm sure you would not down-vote XQuery - actually I know who did because the karma history says so :-)

Hey Kit,

This is great stuff. (As per tweet) if you wanted to write a step-by-step recipe of the XQuery approach that would be fantastic.

At the moment I've not really got the 'submit a recipe' bit working - so feel free to blog a step-by-step (with screenshots if possible), or jot one in an e-mail and I'll get it included.

Tim

Hi Kit
Thanks for the getTheData mention and the contributions you've made to it:-)
The voting on the scraper page may seem a little erratic, but I seem to remember (maybe incorrectly) that the question does request a hosted solution?

As far as documentation goes, I really like the view over the two projects you link to. One of the things I chatted to Rufus about was whether we should have a answer pages that could include runnable code snippets (cf. http://semanticreports.com/reports/ ), or live data previews (first 10 rows, or 20 random rows etc, a bit like Scraperwiki offers, and something I think is lacking on CKAN too?)

Have you also seen the http://blog.dexy.it/ , a "living/live" automated documentation project? Feels to me like it might be complementary to what you're trying to achieve, and it also feels to me that it suits the XProc pipeline model you're working with?

Thanks Tony I'll check that out. I've been messing with the ideas of 'literate programming' for while. The duplication of effort between my own scripts for running code, entries in the XQuery Book, blog Items and code repositories is really getting to me. This project file is simple to author in XML and transform but it needs some work.