XProc Step by Step: Implementing a DOCX to TEI step
David Maus, 20th August 2018 01:35
Some of our projects use an XML first workflow where the principal researchers transcribe and annotate
historical documents in Microsoft Word documents. While this solution is far from perfect, it allows humanists
to capture valuable content with a tool they are comformtable with.
Such is the case in The Nuns' Network. The project aims at a
digital edition of letters written by the Benedictine nuns of the Lüne monestery between 1460 and 1555. Being
comprised of a couple of hundred letters, automating repetitive tasks is key. The project recently entered a stage
were the publication workflow is stable enough to be eased and/or partially replaced with automated processes.
The first step of the current workflow is an initial transformation of a Word in a TEI document. The project uses
the official transformation stylesheet provided by the Text Encoding Initiative and included in the <oXygen/>
XML editor. It is executed manually: <oXygen/> provides an Ant-based scenario that unzips the Word document
and applies the respective transformation.
In order to integrate this first step into an XProc pipeline I implemented this initial transformation as an XProc
step. The step utilizes the XProc extension stepspxp:unzip and pxf:delete, and Calabash's
cx:depend-on attribute. It unzips all XML documents to a temporary folder, runs the TEI
transformation, and deletes the temporary folder.
The final TEI document is sent to the result port of the step.