XProc Step by Step: Implementing a DOCX to TEI step

Some of our projects use an XML first workflow where the principal researchers transcribe and annotate historical documents in Microsoft Word documents. While this solution is far from perfect, it allows humanists to capture valuable content with a tool they are comformtable with.

Such is the case in The Nuns' Network. The project aims at a digital edition of letters written by the Benedictine nuns of the Lüne monestery between 1460 and 1555. Being comprised of a couple of hundred letters, automating repetitive tasks is key. The project recently entered a stage were the publication workflow is stable enough to be eased and/or partially replaced with automated processes.

The first step of the current workflow is an initial transformation of a Word in a TEI document. The project uses the official transformation stylesheet provided by the Text Encoding Initiative and included in the <oXygen/> XML editor. It is executed manually: <oXygen/> provides an Ant-based scenario that unzips the Word document and applies the respective transformation.

In order to integrate this first step into an XProc pipeline I implemented this initial transformation as an XProc step. The step utilizes the XProc extension steps pxp:unzip and pxf:delete, and Calabash's cx:depend-on attribute. It unzips all XML documents to a temporary folder, runs the TEI transformation, and deletes the temporary folder.

The final TEI document is sent to the result port of the step.

docx2tei.xpl

<p:declare-step version="1.0" type="ndn:docx2tei"
                xmlns:c="http://www.w3.org/ns/xproc-step"
                xmlns:cx="http://xmlcalabash.com/ns/extensions"
                xmlns:ndn="http://diglib.hab.de/edoc/ed000248/ns"
                xmlns:pxp="http://exproc.org/proposed/steps"
                xmlns:pxf="http://exproc.org/proposed/steps/file"
                xmlns:p="http://www.w3.org/ns/xproc">

  <p:output port="result" primary="true">
    <p:pipe step="apply-transform" port="result"/>
  </p:output>

  <p:option name="inputFile" required="true"/>

  <p:import href="http://xmlcalabash.com/extension/steps/library-1.0.xpl"/>

  <p:variable name="outputTempDir" select="concat($inputFile, '.tmp')"/>
  <p:variable name="outputTempDirUri" select="resolve-uri(encode-for-uri($outputTempDir))"/>

  <pxp:unzip name="list-zip-directory">
    <p:with-option name="href" select="$inputFile"/>
  </pxp:unzip>

  <p:for-each name="iterate-zip-directory">
    <p:iteration-source select="/c:zipfile/c:file[ends-with(@name, '.xml') or ends-with(@name, '.rels')]">
      <p:pipe step="list-zip-directory" port="result"/>
    </p:iteration-source>
    <p:variable name="outputTempName" select="/c:file/@name"/>
    <pxp:unzip>
      <p:with-option name="href" select="$inputFile"/>
      <p:with-option name="file" select="$outputTempName"/>
    </pxp:unzip>
    <p:store method="xml">
      <p:with-option name="href" select="concat($outputTempDirUri, '/', encode-for-uri($outputTempName))"/>
    </p:store>
  </p:for-each>

  <p:load name="load-stylesheet">
    <p:with-option name="href" select="resolve-uri('stylesheet/profiles/default/docx/from.xsl')"/>
  </p:load>

  <p:load name="load-source" cx:depends-on="iterate-zip-directory">
    <p:with-option name="href" select="concat($outputTempDirUri, '/word/document.xml')"/>
  </p:load>

  <p:xslt name="apply-transform">
    <p:with-param name="word-directory" select="$outputTempDirUri"/>
    <p:input port="stylesheet">
      <p:pipe step="load-stylesheet" port="result"/>
    </p:input>
    <p:input port="source">
      <p:pipe step="load-source" port="result"/>
    </p:input>
  </p:xslt>

  <pxf:delete recursive="true" fail-on-error="false" cx:depends-on="apply-transform">
    <p:with-option name="href" select="$outputTempDir"/>
  </pxf:delete>

</p:declare-step>