David Maus

XProc Step by Step: Check text content after structural transformations

When working with TEI documents stemming from a Word-to-XML-conversion I usually perform some structural modifications like joining tei:hi elements with identical rend attribute if they are immediate siblings. Although I consider my XSLT good enough and use XSpec to test the merging algorithm, I like to verify that no text content was lost for good measure.

One tried and true method of doing this is dumping the text nodes of the document before and after the transformation, and then compare both. Because I only want to know if the text is the same, comparing a SHA1 checksum is sufficient.

Yesterday I finally sat down and wrote an XProc 1.0 step that does exactly this.

Step 1: Calculate the checksum

The first step dmaus:content-checksum reads a document and adds the attribute dmaus:checksum containing the SHA1 checksum of the document's text content to the outermost element.

dmaus:content-checksum
<p:declare-step type="dmaus:content-checksum" name="content-checksum">  <p:documentation>    Add @dmaus:checksum attribute containing the SHA1 checksum of the document content to the outermost element.  </p:documentation>  <p:input  port="source"/>  <p:output port="result"/>  <p:add-attribute attribute-name="dmaus:checksum" attribute-value="" match="/*"/>  <p:hash algorithm="sha" version="1" match="@dmaus:checksum">    <p:with-option name="value" select="/"/>    <p:input port="parameters">      <p:empty/>    </p:input>  </p:hash></p:declare-step>

Step 2: Compare checksums

The second step dmaus:check-text-content-match has two input ports source and other. It calculates the checksum for the documents appearing on either port and compares the two. If the checksums differ, the step raises an error. For convenient use in a pipeline the document from source is passed through to the primary output port.

dmaus:check-text-content-match
<p:declare-step type="dmaus:check-text-content-match" name="check-text-content-match">  <p:documentation>    Signal an error if the content of the document appearing on the 'source' port differs from the content of the    document appearing on the 'other' port.  </p:documentation>  <p:input  port="source"/>  <p:input  port="other"/>  <p:output port="result"/>  <dmaus:content-checksum name="checksum-source">    <p:input port="source">      <p:pipe step="check-text-content-match" port="source"/>    </p:input>  </dmaus:content-checksum>  <dmaus:content-checksum name="checksum-other">    <p:input port="source">      <p:pipe step="check-text-content-match" port="other"/>    </p:input>  </dmaus:content-checksum>  <p:group>    <p:variable name="source" select="/*/@dmaus:checksum">      <p:pipe step="checksum-source" port="result"/>    </p:variable>    <p:variable name="other" select="/*/@dmaus:checksum">      <p:pipe step="checksum-other" port="result"/>    </p:variable>    <p:choose>      <p:when test="$other ne $source">        <p:error code="dmaus:content-mismatch">          <p:input port="source">            <p:inline>              <message>The content of the two documents does not match</message>            </p:inline>          </p:input>        </p:error>      </p:when>      <p:otherwise>        <p:identity>          <p:input port="source">            <p:pipe step="check-text-content-match" port="source"/>          </p:input>        </p:identity>      </p:otherwise>    </p:choose>  </p:group></p:declare-step>

Step 3: Use the step

Both steps are defined in a library I can import in my pipeline. In this contrived example I connect the step dmaus:check-text-content-match to the result of a p:xslt that implements the structural modification and the original document appearing on the pipeline's source port.

Example pipeline
<p:declare-step version="1.0" name="main"                xmlns:dmaus="tag:dmaus@dmaus.name,2019:XProc"                xmlns:p="http://www.w3.org/ns/xproc">  <p:input  port="source"/>  <p:output port="result"/>  <p:import href="library.xpl"/>  <p:xslt name="structural-modification">    <p:input port="stylesheet">      <p:document href="…"/>    </p:input>  </p:xslt>  <dmaus:check-text-content-match>    <p:input port="source">      <p:pipe step="structural-modification" port="result"/>    </p:input>    <p:input port="other">      <p:pipe step="main" port="source"/>    </p:input>  </dmaus:check-text-content-match></p:declare-step>