XProc Step by Step: Check text content after structural transformations
David Maus, 25. Apr 2019
When working with TEI documents stemming from a Word-to-XML-conversion I usually perform some structural
modifications like joining
tei:hi
elements with identical
rend
attribute if they are immediate siblings. Although I consider my XSLT good enough and
use
XSpec to test the merging algorithm, I
like to verify that no text content was lost for good measure.
One tried and true method of doing this is dumping the text nodes of the document before and after the transformation, and then compare both. Because I only want to know if the text is the same, comparing a SHA1 checksum is sufficient.
Yesterday I finally sat down and wrote an XProc 1.0 step that does exactly this.
Step 1: Calculate the checksum
The first step
dmaus:content-checksum
reads a document and adds the attribute
dmaus:checksum
containing the SHA1 checksum of the document's text content to the
outermost element.
<p:declare-step type="dmaus:content-checksum" name="content-checksum"> <p:documentation> Add @dmaus:checksum attribute containing the SHA1 checksum of the document content to the outermost element. </p:documentation> <p:input port="source"/> <p:output port="result"/> <p:add-attribute attribute-name="dmaus:checksum" attribute-value="" match="/*"/> <p:hash algorithm="sha" version="1" match="@dmaus:checksum"> <p:with-option name="value" select="/"/> <p:input port="parameters"> <p:empty/> </p:input> </p:hash> </p:declare-step>
Step 2: Compare checksums
The second step
dmaus:check-text-content-match
has two input ports
source
and
other
. It calculates the checksum for the documents appearing on
either port and compares the two. If the checksums differ, the step raises an error. For convenient use in a
pipeline the document from
source
is passed through to the primary output port.
<p:declare-step type="dmaus:check-text-content-match" name="check-text-content-match"> <p:documentation> Signal an error if the content of the document appearing on the 'source' port differs from the content of the document appearing on the 'other' port. </p:documentation> <p:input port="source"/> <p:input port="other"/> <p:output port="result"/> <dmaus:content-checksum name="checksum-source"> <p:input port="source"> <p:pipe step="check-text-content-match" port="source"/> </p:input> </dmaus:content-checksum> <dmaus:content-checksum name="checksum-other"> <p:input port="source"> <p:pipe step="check-text-content-match" port="other"/> </p:input> </dmaus:content-checksum> <p:group> <p:variable name="source" select="/*/@dmaus:checksum"> <p:pipe step="checksum-source" port="result"/> </p:variable> <p:variable name="other" select="/*/@dmaus:checksum"> <p:pipe step="checksum-other" port="result"/> </p:variable> <p:choose> <p:when test="$other ne $source"> <p:error code="dmaus:content-mismatch"> <p:input port="source"> <p:inline> <message>The content of the two documents does not match</message> </p:inline> </p:input> </p:error> </p:when> <p:otherwise> <p:identity> <p:input port="source"> <p:pipe step="check-text-content-match" port="source"/> </p:input> </p:identity> </p:otherwise> </p:choose> </p:group> </p:declare-step>
Step 3: Use the step
Both steps are defined in a
library I can import in my pipeline. In this
contrived example I connect the step
dmaus:check-text-content-match
to the result of a
p:xslt
that implements the structural modification and the original document appearing
on the pipeline's
source
port.
<p:declare-step version="1.0" name="main" xmlns:dmaus="tag:dmaus@dmaus.name,2019:XProc" xmlns:p="http://www.w3.org/ns/xproc"> <p:input port="source"/> <p:output port="result"/> <p:import href="library.xpl"/> <p:xslt name="structural-modification"> <p:input port="stylesheet"> <p:document href="…"/> </p:input> </p:xslt> <dmaus:check-text-content-match> <p:input port="source"> <p:pipe step="structural-modification" port="result"/> </p:input> <p:input port="other"> <p:pipe step="main" port="source"/> </p:input> </dmaus:check-text-content-match> </p:declare-step>