XProc Step by Step: Check text content after structural transformations
David Maus, 25th April 2019 08:52
When working with TEI documents stemming from a Word-to-XML-conversion I usually perform some structural
modifications like joining tei:hi elements with identical rend attribute if they are immediate siblings. Although I consider my XSLT good enough and
use XSpec to test the merging algorithm, I
like to verify that no text content was lost for good measure.
One tried and true method of doing this is dumping the text nodes of the document before and after the
transformation, and then compare both. Because I only want to know if the text is the same, comparing a SHA1
checksum is sufficient.
Yesterday I finally sat down and wrote an XProc 1.0 step that does exactly this.
Step 1: Calculate the checksum
The first step dmaus:content-checksum reads a document and adds the attribute dmaus:checksum containing the SHA1 checksum of the document's text content to the
outermost element.
Step 2: Compare checksums
The second step dmaus:check-text-content-match has two input ports
source and other. It calculates the checksum for the documents appearing on
either port and compares the two. If the checksums differ, the step raises an error. For convenient use in a
pipeline the document from source is passed through to the primary output port.
Step 3: Use the step
Both steps are defined in a library I can import in my pipeline. In this
contrived example I connect the step dmaus:check-text-content-match to the result of a
p:xslt that implements the structural modification and the original document appearing
on the pipeline's source port.