Tony Hirst at the OU and Rufus Pollack at CKAN set up a new Q&A forum for questions on open datasets which has captured my interest recently -http://getthedata.org/ . It's been helpful to me to get ideas about the availability of weather and enviromental data. I hope it prospers.
One of the questions asked was about tools or services to use for scraping. Naturally I posted an XQuery answer and was rather shocked when the only vote it got was down. How could this be? Am I alone in thinking that XQuery, especially coupled with eXist, is an ideal platform for this work - perhaps so. But I came across a couple of examples of scraping this week which tempted me to try to promote this toolset. Don Quixote comes to mind but what the heck.
One was a blog by John Goodwin on using python and an RDF library to create a kml-based map of programmes, using the RDF now available from the BBC site. Another was an example of using ScraperWiki with PHP to get a map of available garages in Oxford by Tim Davies.
I also realised that I need a better way of simply documenting small projects like these in a way which provides simple documentation and a worksheet from which I can run conversion tasks and tests. It's a bit crude at presnt but I find it helpful. I have very simple way of protecting the visitor from launching time-consuming scraping tasks but it's only to stop accidents.
The two projects are the BBC programmes and Oxford Garages. Both have the same structure.The core is a script to do the scraping. Other scripts handle the scrape-and-cache operation and scheduling a job to repeat the scrape. Other project-specific scripts handle transformations of the cached data. The BBC example creates a KML file directly, Garages creates an intermediate XML file from which the KML is generated on demand.
The code for the scapers is viewable in each project. I think (but I would ) that it is pretty minimal and clean. Some support modules are required to handle common task like string parsing, geocoding and (not in use here ) CSV decoding, data and coordinate transformations. An added benefit of eXist is that the whole application can be done in the one language, including storage of cached files, indexing stored files, searching, selection and job scheduling.
It is interesting to see John's use of the RDF BBC files. It seems to me that the pure XML approach is cleaner and RDF hasn't bought any gains here. If only other sites generated such comprehensive XML. Please, please may it continue and spread. JSON and RDF is useful to some clients but please keep the XML coming.