XProc Step by Step: Check text content after structural transformations
David Maus, 25. Apr 2019
When working with TEI documents stemming from a Word-to-XML-conversion I usually perform some structural
modifications like joining
tei:hi elements with identical
rend attribute if they are immediate siblings. Although I consider my XSLT good enough and
XSpec to test the merging algorithm, I
like to verify that no text content was lost for good measure.
One tried and true method of doing this is dumping the text nodes of the document before and after the transformation, and then compare both. Because I only want to know if the text is the same, comparing a SHA1 checksum is sufficient.
Yesterday I finally sat down and wrote an XProc 1.0 step that does exactly this.
Step 1: Calculate the checksum
The first step
dmaus:content-checksum reads a document and adds the attribute
dmaus:checksum containing the SHA1 checksum of the document's text content to the
Step 2: Compare checksums
The second step
dmaus:check-text-content-match has two input ports
other. It calculates the checksum for the documents appearing on
either port and compares the two. If the checksums differ, the step raises an error. For convenient use in a
pipeline the document from
source is passed through to the primary output port.
Step 3: Use the step
Both steps are defined in a
library I can import in my pipeline. In this
contrived example I connect the step
dmaus:check-text-content-match to the result of a
p:xslt that implements the structural modification and the original document appearing
on the pipeline's