Advanced grouping with xsl:for-each-group

Abstract

The following post explains the use of xsl:for-each-group with a custom grouping key function to join sequences of foreign text. It assumes some familarity with user defined functions in XSLT and requires a thorough understanding of tree navigation with XPath.

The challenge

I recently spoke with Wolfgang about the challenges of The Nuns' Network, a project that aims at a digital edition of letters written by the Benedictine nuns of the Lüne monestery between 1460 and 1555. The project uses an XML workflow that starts with Word documents that are processed to the final TEI markup. This first step of this workflow creates a raw TEI document which is further processed.

Figure 1: Current structure

One challenge the project faces is dealing with spans of text written in a letter's secondary language. These spans are marked up as TEI foreign elements in the raw TEI document that are eventually interrupted by line and/or page breaks. Figure 1 illustrates the structure. Ideally all thoses spans should merge to one foreign with the page and line breaks inside, as shown in figure 2.

Figure 2: Desired structure

Currently Wolfgang applies a regular expression to the TEI document. While this works and we also found a way to integrate this step in an XProc publishing pipeline it certainly lacks elegance. Manipulating the XML's textual (lexical) representation in order to achive what essentially is a tree transformation feels very clumsy.

The grouping function

It is specified as follows:

[Definition: The xsl:for-each-group instruction allocates the items in an input sequence into groups of items (that is, it establishes a collection of sequences) based either on common values of a grouping key, or on a pattern that the initial or final node in a group must match.] The sequence constructor that forms the content of the xsl:for-each-group instruction is evaluated once for each of these groups.

...

If the group-by attribute is present, the items in the population are examined, in population order. For each item J, the expression in the group-by attribute is evaluated to produce a sequence of zero or more grouping key values. For each one of these grouping keys, if there is already a group created to hold items having that grouping key value, J is added to that group; otherwise a new group is created for items with that grouping key value, and J becomes its first member.

XSL Transformations (XSLT) Version 2.0, Section 14.3

To solve our grouping problem we thus need a function that computes a grouping key such that all nodes in a sequence of foreign, lb, and pb have the same, and all other nodes a different key.

If we would just have cover the nodes that are not part of a sequence, we could write the function simply as:

<xsl:function name="fn:grouping-key" as="xs:string">
  <xsl:param name="node" as="node()"/>
  <xsl:value-of select="generate-id($node)"/>
</xsl:function>
      

The function generate-id() is guaranteed to return a unique value for each node during a transformation. If the node is part of a sequence we can return the unique value for the node that starts the sequence. This fullfills the requirement that all nodes of a sequence share the key. To implement this we need to check if the node in question is part of a sequence and, if so, the first node of this sequence.

If we think in the lines of XPath a node is part of the sequence if both of these conditions hold true:

  1. It has a foreign as preceding sibling and between this element and the node are only lb, pb, or foreign elements.
  2. It has a foreign as following sibling and between this element and the node are only lb, pb, or foreign elements, or it is itself a foreign element.

Another way of looking at these conditions is as functions that find the first and the last foreign of a sequence. Thus we can say:

<xsl:variable name="seq-start" select="$node/following-sibling::foreign[fn:valid-seq(fn:siblings-between(., $node))][last()]"/>
<xsl:variable name="seq-end">
  <xsl:choose>
    <xsl:when test="$node/preceding-sibling::foreign[fn:valid-seq(fn:siblings-between($node, .))]">
      <xsl:sequence select="$node/preceding-sibling::foreign[fn:valid-seq(fn:siblings-between($node, .))][last()]"/>
    </xsl:when>
    <xsl:when test="$node/self::foreign">
      <xsl:sequence select="$node"/>
    </xsl:when>
  </xsl:choose>
</xsl:variable>
      

Where the function fn:siblings-between returns all siblings between two nodes and fn:valid-seq returns true if a sequence of nodes contains only lb, pb, and foreign elements.

They are defined as follows:

<xsl:function name="fun:valid-seq" as="xs:boolean">
  <xsl:param name="nodes" as="node()*"/>
  <xsl:value-of select="empty($nodes[not(self::lb or self::pb or self::foreign or (self::text() and normalize-space() eq ''))])"/>
</xsl:function>

<xsl:function name="fun:siblings-between" as="node()*">
  <xsl:param name="start" as="node()"/>
  <xsl:param name="end" as="node()"/>
  <xsl:sequence select="$start/following-sibling::node() intersect $end/preceding-sibling::node()"/>
</xsl:function>
      

With these two variables in place we can improve our fn:grouping-key function.

<xsl:function name="fn:grouping-key" as="xs:string">
  <xsl:param name="node" as="node()"/>
  <xsl:variable name="seq-start" select="$node/following-sibling::foreign[fn:valid-seq(fn:siblings-between(., $node))][last()]"/>
  <xsl:variable name="seq-end">
    <xsl:choose>
      <xsl:when test="$node/preceding-sibling::foreign[fn:valid-seq(fn:siblings-between($node, .))]">
        <xsl:sequence select="$node/preceding-sibling::foreign[fn:valid-seq(fn:siblings-between($node, .))][last()]"/>
      </xsl:when>
      <xsl:when test="$node/self::foreign">
        <xsl:sequence select="$node"/>
      </xsl:when>
    </xsl:choose>
  </xsl:variable>
  <xsl:value-of select="if ($seq-start and $seq-end) then generate-id($seq-start) else generate-id($node)"/>
</xsl:function>
      

...

Now we can apply this grouping function to our challenge. We match all elements containing at least one foreign element, create a shallow copy, and iterate its child nodes with xsl:for-each-group and our grouping function.

<xsl:template match="*[foreign]">
  <xsl:copy>
    <xsl:sequence select="@*"/>
    <xsl:for-each-group select="node()" group-by="fn:grouping-key(.)">
      ...
    </xsl:for-each-group>
  </xsl:copy>
</xsl:template>
      

Inside the loop we handle two cases. If the current group contains exactly one node, then we simply copy it. Otherwise we create a foreign element by copying the group's first element and iterating the group members: lb and pb we copy, foreign we unwrap.

And that's about it.

Summary