Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

This project provides a toolkit for efficiently processing bibliographic records encoded in MARC 21, which is a popular file format used to exchange bibliographic data between libraries. In particular, the command line tool marc21 allows efficient filtering of records and extraction of data into a rectangular schema. Since the extracted data is in tabular form, it can be processed with popular frameworks such as Polars or Tidyverse.

marc21-rs is developed by the Metadata Department of the German National Library (DNB). It is used for data analysis and for automating metadata workflows (data engineering) as part of automatic content indexing.

The source code is licensed under the European Union Public License 1.2.

Getting Started

Installation

Binaries

In order to install the marc21 binary, archives with a precompiled binary are available for Windows, macOS and Linux.

Two variants are available for Linux: a dynamically linked version and a fully statically linked version (MUSL). In most cases, the statically linked version should be preferred, as it is independent of the glibc version of the host system. The following commands install the binary into the /usr/local/bin directory:

$ tar xfz marc21-0.2.0-x86_64-unknown-linux-musl.tar.gz
$ sudo install -Dm755 marc21-0.2.0-x86_64-unknown-linux-musl/marc21 \
   /usr/local/bin/marc21

From Source

If a Rust toolchain is available, marc21 can also be installed using the Rust package manager cargo. The project requires a Rust compiler with a minimum version of 1.93. Use the following command to install the program with the default features:

$ cargo install marc21-cli  

The binary can be built with the following features as needed:

build
Commands and functions that are only needed during the build process or packaging are activated with the build feature. This includes, for example, the commands for generating man pages (build-man) and shell completions (build-completion).
performant
This feature activates optimizations aimed at improving performance. This includes, for example, the activation of SIMD or a more aggressive inline strategy. Since the main goal of the project is high performance, the feature is enabled by default.
unstable
New features that are still in the testing phase can be activated using the unstable feature. Keep in mind that these functions may change at any time.

First Steps

This section provides an overview of working with the marc21 command line tool. It demonstrates important commands using simple use cases. An in-depth explanation of the concepts, in particular the structure of filter expressions, has been omitted for brevity.

The marc21 tool provides various commands for processing MARC 21 records (see marc21 --help for a complete list of available commands).

Concatenate Multiple Files

The concat command can be used to combine multiple files into a single output. In the following example, the authority data files from the Integrated Authority Files (GND) are concatenated into the single file GND.mrc.gz.

$ marc21 concat -ps authorities-gnd-*.mrc.gz -o GND.mrc.gz
10,122,437 records, 0 invalid | 49,035 records/s, elapsed: 00:03:19

The --skip-invalid (-s) option is used to skip invalid records that could not be decoded. If the option is not specified, processing will abort at the first invalid record. In addition, the processing progress can be displayed with the --progress (-p) option.

Filtering Records

The filter command extracts those records that fulfill a specified condition. For example, all records with status z and at least one field 100 with indicators 1 and # (space) can be filtered as follows:

$ marc21 filter -s 'ldr.status == "z" && 100/1#?' DUMP.mrc.gz -o out.mrc

Operators

The comparison operators ==, !=, >=, >, <=, and < can be used for values in selected leader fields, values in control fields, and values in subfields. Here are a few examples

$ marc21 filter -s '100/1#.a == "Lovelace, Ada"' DUMP.mrc.gz -o out.mrc
$ marc21 filter -s '100/*.a != "Curie, Marie"' DUMP.mrc.gz -o out.mrc
$ marc21 filter -s '001 == "119232022"' DUMP.mrc.gz -o out.mrc
$ marc21 filter -s 'ldr.length > 3000' DUMP.mrc.gz -o out.mrc
$ marc21 filter -s 'ldr.status == "z"' DUMP.mrc.gz -o out.mrc

To check whether a value (control field or data field) comes from a specified list, the in operator is used. In contrast, the not in operator checks whether a value is not contained in the list. The following example tests whether a field 100 exists that has a subfield a with the value “Curie, Marie” or “Lovelace, Ada”:

$ marc21 filter -s '100/*.a in ["Lovelace, Ada", "Curie, Marie"]' \
    DUMP.mrc.gz -o out.mrc

The =? operator and, in negated form, !? perform a substring search on subfield values. These operators allow simultaneous searching for multiple patterns by using the []-notation:

$ marc21 filter -s '100/*.a =? ["Hate", "Love"]' DUMP.mrc.gz -o out.mrc.gz
$ marc21 filter -s '100/1#.a =? "Love"' DUMP.mrc.gz -o out.mrc.gz

Subfield values can be checked against one or a set of regular expressions. The filter expression uses the =~ operator or the !~ operator in negated form. The underlying regex engine does not support all regex features; please refer to the specification to learn more about the syntax and possible limitations. The following example searches for all records with a field 533 that contains a subfield n whose value matches the regular expression for an ISBN.

$ marc21 filter -s \
    '533.n =~ "(?i)ISBN(?:-1[03])?(?::?\\s*)?\\s(?:97[89][-\ ]?)?\\d{1,5}[-\\ ]?(?:\\d+[-\\ ]?){2}(?:\\d|X)"' \
    DUMP.mrc.gz -o out.mrc.gz

To test whether a subfield value begins with a prefix or not, the =^ operator or, in its negated form, the !^ operator is used:

$ marc21 filter -s '400/1#.a =^ "Love"' DUMP.mrc.gz -o out.mrc.gz
$ marc21 filter -s '400/1#.a =^ ["Hate", "Love"]' DUMP.mrc.gz -o out.mrc.gz
$ marc21 filter -s '400/1#{ [ac] =^ "Count" }' DUMP.mrc.gz -o out.mrc.gz

In contrast, the =$ operator can be used to check whether a subfield value ends with a specific suffix. Keep in mind that the $ character often has a special meaning on the command line and may need to be escaped.

$ marc21 filter -s '548.4 =$ "/gnd#dateOfBirthAndDeath"' DUMP.mrc.gz -o out.mrc.gz
$ marc21 filter -s '401/1#.a !$ "Ada"' DUMP.mrc.gz -o out.mrc.gz

Similarity comparisons between character strings are performed using the =* operator (in negated form !*). The normalized Levenshtein distance is calculated between the subfield value and the comparison value. If this is greater than the specified threshold value, the comparison is considered a match. The default threshold value is 0.8 and can be changed using the command line option --strsim-threshold:

$ marc21 filter -s --strsim-threshold 0.9 '100/1#.a =* "Lovelace, Bda"' \
    DUMP.mrc.gz -o out.mrc.gz

Counting Records

The number of records contained in the input can be determined using the count command:

$ marc21 count GND.mrc.gz
10329438

The --where option can be used to count only those records that match a certain criterion:

$ marc21 count GND.mrc.gz --where 'ldr.type == "z" && 075{ b == "gik" && 2 == "gndspec" }'
179672

The print command output records in a human-readable format. The leader, control and data fields are written on a separate line. Consecutive records are divided by a blank line. The output of the command can be used in combination with standard UNIX tools such as grep, cut or sed. In the following example, a single data record is printed on the console:

$ marc21 print tests/data/ada.mrc --where '100/*.a =? "Love"'
LDR 03612nz  a2200589nc 4500
001 119232022
003 DE-101
005 20250720173911.0
008 950316n||azznnaabn           | aaa    |c
024/7# $a 119232022 $0 http://d-nb.info/gnd/119232022 $2 gnd
035 $a (DE-101)119232022
035 $a (DE-588)119232022
035 $z (DE-588)172642531
035 $z (DE-588a)172642531 $9 v:zg
035 $z (DE-588a)119232022 $9 v:zg
035 $z (DE-588c)4370325-2 $9 v:zg
040 $a DE-386 $c DE-386 $9 r:DE-576 $b ger $d 1841
042 $a gnd1
043 $c XA-GB
065 $a 28p $2 sswd
065 $a 9.5p $2 sswd
075 $b p $2 gndgen
075 $b piz $2 gndspec
079 $a g $q f $q s $q z $u w $u k $u v
100/1# $a Lovelace, Ada $d 1815-1852
375 $a 2 $2 iso5218
400/1# $a Lovelace, Augusta Ada ˜ofœ $d 1815-1852
400/1# $a Lovelace, Ada Augusta ˜ofœ $d 1815-1852
400/1# $a Byron, Ada $d 1815-1852
400/1# $a Byron King, Augusta Ada $d 1815-1852
400/1# $a King, Augusta Ada $d 1815-1852
400/1# $a King, Ada $d 1815-1852
400/1# $a King-Noel, Augusta Ada $c Countess of Lovelace $d 1815-1852
...

Reference

Commands

The marc21 tool provides the following commands:

  • concat — Concatenate records from multiple inputs (alias cat)
  • count — Print the number of records in the input data (alias cnt)
  • filter — Filter records that fulfill a specified condition
  • hash — Compute SHA-256 checksum of records
  • invalid — Output invalid records that cannot be decoded
  • print — Print records in human readable format
  • sample — Select a random permutation of records
  • split — Split the input into chunks of a given size

marc21-concat(1)

NAME

marc21-concat — Concatenate records from multiple inputs

SYNOPSIS

marc21 concat [options] [path]…
marc21 cat [options] [path]…

DESCRIPTION

The concat command is used to combine records from multiple files into a single file or output (stdout).

OPTIONS

FILTER OPTIONS

-s, --skip-invalid
Skip invalid records that can’t be decoded
--strsim-threshold
The minimum score for string similarity comparisons (0 <= score <= 100)
--where
An expression for filtering records

COMMON OPTIONS

-p, --progress
If set, show a progress bar
-c, --compression
Specify compression level (0..=9)

EXIT STATUS

  • 0 — Command succeeded.
  • 1 — Command failed.

EXAMPLES

In the following example, the five files dnb_all_dnbmarc.1.mrc.gz to dnb_all_dnbmarc.5.mrc.gz are concatenated into a single file DNB.mrc.gz. Invalid data records are skipped (option -s):

$ marc21 concat -s dnb_all_dnbmarc.*.mrc.gz -o DNB.mrc.gz

marc21-count(1)

NAME

marc21-count — Print the number of records in the indput data.

SYNOPSIS

marc21 count [options] [path]…
marc21 cnt [options] [path]…

DESCRIPTION

The count is used to determine the number of records contained in the input.

OPTIONS

FILTER OPTIONS

-s, --skip-invalid
Skip invalid records that can’t be decoded
--strsim-threshold
The minimum score for string similarity comparisons (0 <= score <= 100)
--where
An expression for filtering records

COMMON OPTIONS

-p, --progress
If set, show a progress bar
-c, --compression
Specify compression level (0..=9)

EXIT STATUS

  • 0 — Command succeeded.
  • 1 — Command failed.

EXAMPLES

tba

marc21-filter(1)

NAME

marc21-filter — Filter records that fulfill a specified condition

SYNOPSIS

marc21 filter [options] [path]…

DESCRIPTION

tba

OPTIONS

COMMON OPTIONS

-p, --progress
If set, show a progress bar
-c, --compression
Specify compression level (0..=9)

EXIT STATUS

  • 0 — Command succeeded.
  • 1 — Command failed.

EXAMPLES

tba

marc21-hash(1)

NAME

marc21-hash — Compute SHA-256 checksum of records

SYNOPSIS

marc21 hash [options] [path]…

DESCRIPTION

tba

OPTIONS

FILTER OPTIONS

-s, --skip-invalid
Skip invalid records that can’t be decoded
--strsim-threshold
The minimum score for string similarity comparisons (0 <= score <= 100)
--where
An expression for filtering records

COMMON OPTIONS

-p, --progress
If set, show a progress bar
-c, --compression
Specify compression level (0..=9)

EXIT STATUS

  • 0 — Command succeeded.
  • 1 — Command failed.

EXAMPLES

tba

marc21-invalid(1)

NAME

marc21-invalid — Output invalid records that cannot be decoded

SYNOPSIS

marc21 invalid [options] [path]…

DESCRIPTION

tba

OPTIONS

COMMON OPTIONS

-p, --progress
If set, show a progress bar
-c, --compression
Specify compression level (0..=9)

EXIT STATUS

  • 0 — Command succeeded.
  • 1 — Command failed.

EXAMPLES

tba

marc21-print(1)

NAME

marc21-print — Print records in human readable format

SYNOPSIS

marc21 print [options] [path]…

DESCRIPTION

tba

OPTIONS

FILTER OPTIONS

-s, --skip-invalid
Skip invalid records that can’t be decoded
--strsim-threshold
The minimum score for string similarity comparisons (0 <= score <= 100)
--where
An expression for filtering records

COMMON OPTIONS

-p, --progress
If set, show a progress bar
-c, --compression
Specify compression level (0..=9)

EXIT STATUS

  • 0 — Command succeeded.
  • 1 — Command failed.

EXAMPLES

tba

marc21-sample(1)

NAME

marc21-sample — Select a random permutation of records

SYNOPSIS

marc21 sample [options] [path]…

DESCRIPTION

tba

OPTIONS

FILTER OPTIONS

-s, --skip-invalid
Skip invalid records that can’t be decoded
--strsim-threshold
The minimum score for string similarity comparisons (0 <= score <= 100)
--where
An expression for filtering records

COMMON OPTIONS

-p, --progress
If set, show a progress bar
-c, --compression
Specify compression level (0..=9)

EXIT STATUS

  • 0 — Command succeeded.
  • 1 — Command failed.

EXAMPLES

tba

marc21-split(1)

NAME

marc21-split — Split the input into chunks of a given size

SYNOPSIS

marc21 split [options] [path]…

DESCRIPTION

tba

OPTIONS

FILTER OPTIONS

-s, --skip-invalid
Skip invalid records that can’t be decoded
--strsim-threshold
The minimum score for string similarity comparisons (0 <= score <= 100)
--where
An expression for filtering records

COMMON OPTIONS

-p, --progress
If set, show a progress bar
-c, --compression
Specify compression level (0..=9)

EXIT STATUS

  • 0 — Command succeeded.
  • 1 — Command failed.

EXAMPLES

tba