David Maus

A quick one: Use Metafacture to create a list of records with cataloging errors

Some records in our library catalog contain an error that causes trouble when we distribute catalog data to national aggregators like the Zentrales Verzeichnis Digitalisierter Drucke (ZVDD), the central access point to printed works from the 15th century up to today, digitized in Germany. The catalogers made a typo and didn't separate the name of the publisher and the place of publication.

Metafacture is a really helpful suite of tools when working with a comparatively large set of records in a somewhat unwieldy format. The following Metamorph transformation runs over a dump of our catalog and outputs a list that contains the record's primary key (Pica production number, PPN), an indicator of the resource type, and the erroneous publication statement.

morph.xml
<metamorph xmlns="http://www.culturegraph.org/metamorph" version="1" entityMarker="$">  <rules>    <!-- Titel, bei denen Verlag und Ort nicht getrennt sind            -->    <!-- Bsp. http://uri.hab.de/instance/proxy/opac-de-23/391719076.xml -->    <!-- hier 033A $p[Strassburg: PrĂ¼ss]                                -->    <combine name="" value="${ppn} - ${mat} - ${fehler-verlag}">      <data source="003@$0" name="ppn"/>      <data source="002@$0" name="mat">        <substring start="0" end="1"/>      </data>      <data source="033A$p" name="fehler-verlag">        <regexp match=".+: .+" format="${0}"/>        <normalize-utf8/>      </data>    </combine>  </rules></metamorph>

If the subfield p of the field 033A matches the specified regular expression, and both the subfield 0 of field 003@ and subfield 0 of field 002@ are present, combine these three fields to an unnamed output entity. Because 002@ and 003@ are always present, the combine acts as a filter and generates an output entity only if the erroneous 033A is detected.

I run this morph with a simple flux pipeline.

Flux pipeline
default src = "s:\\edoc\\ed000242\\2017\\20170429-title.pp";default dst = "stdout";src|open-file|as-records|decode-pica|morph(FLUX_DIR + morph)|encode-literals|write(dst);

Turns out that only 678 of appr. 1.2 million records or 0.06% are affected. This speaks volumes for the dedication of our staff and makes the problem managable.