More fiddling with my little pipeline framework this morning, this time to add a simple path operator. The initial application was to pull a table from a wikipedia page:
WWII Casualties extracted
For later processing, this HTML table needs to be cleaned up to plain XML with element names taken from the heading. I habitually write this kind of generic transformation in XQuery, but that doesn't play so well with the pipeline architecture. XSLT would fit better but I wasn't quite sure how to write this generalised transformation. My plea for help via Twitter elicted an early morning masterclass in XSLT from @JeniT and @AlainCouthures (thanks again folks, I'm really touched). Here is Jeni's work.
In the end I used a simpler transformation with numbered column headings. Adding that step I can now get the XML - its messy because theWikipedia tables have complex cells so the XSLT needs to be specialised for scrapping:
Now to write the next transform to a visualization. Another XSLT subsets the table to get just the British Imperial Forces casualties and create the graph XML, and finally the transformation to a pie chart using Jan's XSLT.
The pipeline can be run from the location line for testing and then embedded in a generated page:
Next job is to create a pipeline editor which will make it easier to view the stylesheets and the intermediate documents.