Hierachical structures are excellent for organising complex information, but a specific hierarchy is a design choice. Often there are multiple ways in which the same information can be structured. A typical example is encountered in document structures. One might represent the content as a hierarchy of chapters, sections ,sub-sections and paragraphs. One might also want to represent the physical structure into pages and lines. The two structures conflict since a line may be split over page boundaries. In Jackson Structured Programming, a program design method developed by Michael Jackson which influenced me greatly in the '70s, this is called a structure clash.
The problem of how to represent both views of the content in a single XML document is a well-known problem. Jeni Tennison discusses the alternatives in her excellent blog post.
In TEI (Text Encoding Initiative), the practice is to represent the chapter/section structure as the main XML structure and use milestones to mark the page boundaries using page-breaks (pb elements). In addition to providing the page number, these elements also provide a place to link to facsimile images. The convention is for page breaks to refer to the following page, but sometimes its not clear, for example in the Punch corpus converted from Guttenberg Project text to TEI by the Oxford University Computing Services,
I first encountered this problem last year when working on the Virginia Secession project with Chris Kemp at the University of Richmond in Virginia. I searched for prior work on the problem, but not knowing the right terminology, missed the algorithm written by David Sewell which was used for the Java function util:get-fragment-between () in eXist.
So I set about writing my own function. In addition to extracting a subtree of the full document, I needed to expand a couple of domain-specific attributes. I also thought it would be necessary to mark nodes which had been truncated so continuation marks could be inserted into the rendered page.
When I later discovered the eXist function, although it was faster, it did not allow the customisation necessary in this application. David Sewell's algorithm seemed somewhat slower and I wasn't clear how to mark truncated nodes.
I've finally got round to documenting my algorithm. To document the algorithms and compare timings, I've added this problem to a comparison application I'm playing with.
Timings need to be treated with caution but they seem to indicate performance ratios between the Java function, my XQuery algorithm and David Sewell's algorithm of 1: 4: 12. I need a full set of tests for these algorithms to ensure that my algorithm is sound. It is used in my version of the Punch example ,with the zip of the full application.