Following Tony Hirst's blog , I see he's into American Politics now with his Yahoo Pipes.
1. Register for http://developer.nytimes.com/docs/congress_api
2.explore the well-documented interface: (cached extracts)
2.1 all committees
2.2 individual committees eg. Committee on Energy and Natural Resources
Structure very similar to the Councils, so minors edits needed to those scripts:
3. Two XSLT in a pipeline
3.1 step 1 create the view, integrating multiple committees; this is parameterised for the congress and the chamber
4. random 403s - rate limit of 2 per sec on this data set generates random failures - how to trap and retry a doc () failure ?
5 add choke - do a number of doc() fetches but not random function to cache bust so not very useful - how else to choke in XSLT?
6 Finally get a clean run : report cached
7 NYT tells be I'm a transgressor - I ask for an increased rate - nothing yet, more runs now fail
8 I'm now banned from that API - two strikes and you're out - how long for?
9 [later] my request for increased rate was answered quickly and helpfully. However per/second rate limits make this work hard because the traffic is very bursty. Even at the increased rate of 10, I still had to throttle the script. Thanks to Dawn at NYT for the help.
If this had been written in XQuery, I could have added a transparent caching interface, saving each document as it is fetched and retrieving the cached version if available. Cache flushing could be manual. The pipeline could do this for the intermediate documents but not for the individual committee source documents. An argument for fetching the site into a local version using XQuery and presentation transformations with XSLT. Back to XQuery and the benefits of running on an XML database.
I note that the last committee, on Global Warming, has no members. That's not what its web site says and I cant check the data now, but I see a gap in the botton right corner of Tony's treemap so I guess the problem is with the data.