A quick one: Use Metafacture to create a list of records with cataloging errors
David Maus, 24. Apr 2018
Some records in our library catalog contain an error that causes trouble when we distribute catalog data to national aggregators like the Zentrales Verzeichnis Digitalisierter Drucke (ZVDD), the central access point to printed works from the 15th century up to today, digitized in Germany. The catalogers made a typo and didn't separate the name of the publisher and the place of publication.
Metafacture is a really helpful suite of tools when working with a comparatively large set of records in a somewhat unwieldy format. The following Metamorph transformation runs over a dump of our catalog and outputs a list that contains the record's primary key (Pica production number, PPN), an indicator of the resource type, and the erroneous publication statement.
<metamorph xmlns="http://www.culturegraph.org/metamorph" version="1" entityMarker="$"> <rules> <!-- Titel, bei denen Verlag und Ort nicht getrennt sind --> <!-- Bsp. http://uri.hab.de/instance/proxy/opac-de-23/391719076.xml --> <!-- hier 033A $p[Strassburg: Prüss] --> <combine name="" value="${ppn} - ${mat} - ${fehler-verlag}"> <data source="003@$0" name="ppn"/> <data source="002@$0" name="mat"> <substring start="0" end="1"/> </data> <data source="033A$p" name="fehler-verlag"> <regexp match=".+: .+" format="${0}"/> <normalize-utf8/> </data> </combine> </rules> </metamorph>
If the subfield
p
of the field
033A
matches the specified regular expression, and both the
subfield
0
of field
003@
and subfield
0
of field
002@
are
present, combine these three fields to an unnamed output entity. Because
002@
and
003@
are
always present, the
combine
acts as a filter and generates an output entity only if the
erroneous
033A
is detected.
I run this morph with a simple flux pipeline.
default src = "s:\\edoc\\ed000242\\2017\\20170429-title.pp"; default dst = "stdout"; src| open-file| as-records| decode-pica| morph(FLUX_DIR + morph)| encode-literals| write(dst);
Turns out that only 678 of appr. 1.2 million records or 0.06% are affected. This speaks volumes for the dedication of our staff and makes the problem managable.