Tokenizer for German

I have implemented a tokenizer for German in Perl, which can be used by anybody who is interested. It optionally provides a rather detailed analysis of the tokens (and whitespace) in the input text. Please read the license terms before you download the software. By downloading the software you agree to the terms stated there.

Any feedback is heartily welcome.

Download

Usage

$ perl tokenize.perl [OPTIONS] <fileIn.text> <fileOut.tok>

If you call the script without any argument, you will get an overview of all OPTIONS (also documented below).

Options and Features

The tokenizer reads in plain text (and optionally a list of abbreviations) and produces a tokenized version. Click the titles to expand the following sections for more details:

  • Output formats

    • Text format:
      • Tokens are separated by spaces
      • 1 sentence per line
      • Multiple empty lines in the input file are interpreted as paragraph boundaries and recorded by one empty line in the output file
    • XML format:
      • Uses XML tags <tok>, <sent_bound/>, <newline/> (and, optionally, <space>)
  • Options

    1. Optional use of a list of abbreviations: [-a|-abbrev <abbrev>]
      • <abbrev> specifies a file with a list of abbreviations
      • List format: one abbreviation per line (like “etc.”)
      • Abbreviations can consist of regular expressions: “/regex/” (e.g. “/str./”)
    2. Optional XML output: [-x|-xml]
    3. The XML output optionally records all white space: [-s|-space]
      • Simple linebreaks are ignored
      • Multiple empty lines are squeezed
      • Leading and trailing empty lines are deleted
    4. The output optionally records the “types” of words (and spaces, with XML output): [-t|-type]
      • for words:
        1. unmarked default: [a-zA-Z]+
        2. “alphanum”: if word contais digits (among other characters)
        3. “mixed”: if word contains characters like brackets, quotes, …
        4. “allCap”, if word consists of capitalized characters only
      • for numbers:
        1. “card”: cardinals
        2. “ord”: ordinals
        3. “year” (see below)
      • for abbreviations: “abbrev”, with subtypes “sources”:
        1. “listed”, i.e. full abbreviation is listed in file <abbrev>
        2. “regEx”, i.e. matching regex is listed in file <abbrev>
        3. “nextWordLC”, i.e. next word is lower case
      • for special charcters:
        1. “specialChar_lead”: special chars preceding a word, like “(“
        2. “specialChar_trail”: special chars following a word, like “)”
        3. “punc”: punctuation marks
      • for whitespace:
        1. unmarked default: single space
        2. “tab”: tabulator
        3. “carrRet”: carriage return
        4. “unknown”: anything else

      NOTE: multiple types are possible (e.g. type=”space,tab”)

    5. Variants of “year recognizers”: [-y|-yearRobust]. This triggers a simplified version of date tagging:
      • Year expression candidates: four-digit numbers of the form: (1|2)[0-9][0-9][0-9], i.e. covering the years between 1000–2999
      • The default recognizer carefully checks the preceding context of the number (for expressions like ‘Januar’ or ‘Winter’ or ‘Jahr’) and will therefore miss year expressions as in “1999 regnete es oft.”
      • The “robust” recognizer ignores the context, i.e. any four-digit ordinal starting with 1 or 2 will be interpreted as a year expression. It therefore incorrectly analyses the cardinal in “Es gibt 1999 Optionen.” as a year expression.
      • NOTE: the default year recognizer does not work if option -s is chosen!

    NOTE: The script contains a hard-wired list of German date expressions (if you want to change them, you will have to edit the value of the variable “$yearRegex” in the Perl script).

Example

  • Sample input text:

    Das hier ist tatsächlich ein Mini-Testtext. Er testet u.a. Abkürzungen wie “Hauptstr. 3” und mit dem Ausdruck 4.9.2008 (oder auch 4. 9. 2008) einen Datumsausdruck. In den Jahren 1999 und 2000 hat es 1999 Liter geregnet.

  • Text output without any options (misses the abbreviations “u.a.” and “Haupstr.”):

    Das hier ist tatsächlich ein Mini-Testtext .
    Er testet u.a .
    Abkürzungen wie “ Hauptstr .
    3 “ und mit dem Ausdruck 4.9.2008 ( oder auch 4. 9. 2008 ) einen Datumsausdruck .
    In den Jahren 1999 und 2000 hat es 1999 Liter geregnet .

  • Text output with option -abbrev abbrev.lex:

    Das hier ist tatsächlich ein Mini-Testtext .
    Er testet u.a. Abkürzungen wie “ Hauptstr. 3 “ und mit dem Ausdruck 4.9.2008 ( oder auch 4. 9. 2008 ) einen Datumsausdruck .
    In den Jahren 1999 und 2000 hat es 1999 Liter geregnet .

  • XML output with options -xml -type -abbrev abbrev.lex:

    View XML output…

    <?xml version="1.0" encoding="utf-8"?>
    <text>
    <tok>Das</tok>
    <tok>hier</tok>
    <tok>ist</tok>
    <tok>tatsächlich</tok>
    <tok>ein</tok>
    <tok>Mini-Testtext</tok>
    <tok type='punc'>.</tok>
    <sent_bound/>
    <tok>Er</tok>
    <tok>testet</tok>
    <tok type='abbrev' source='listed'>u.a.</tok>
    <tok>Abkürzungen</tok>
    <tok>wie</tok>
    <tok type='specialChar_lead'>"</tok>
    <tok type='abbrev' source='regEx'>Hauptstr.</tok>
    <tok type='card'>3</tok>
    <tok type='specialChar_trail'>"</tok>
    <tok>und</tok>
    <tok>mit</tok>
    <tok>dem</tok>
    <tok>Ausdruck</tok>
    <tok type='alphanum,mixed'>4.9.2008</tok>
    <tok type='specialChar_lead'>(</tok>
    <tok>oder</tok>
    <tok>auch</tok>
    <tok type='ord'>4.</tok>
    <tok type='ord'>9.</tok>
    <tok type='year'>2008</tok>
    <tok type='specialChar_trail'>)</tok>
    <tok>einen</tok>
    <tok>Datumsausdruck</tok>
    <tok type='punc'>.</tok>
    <sent_bound/>
    <tok>In</tok>
    <tok>den</tok>
    <tok>Jahren</tok>
    <tok type='year'>1999</tok>
    <tok>und</tok>
    <tok type='card'>2000</tok>
    <tok>hat</tok>
    <tok>es</tok>
    <tok type='card'>1999</tok>
    <tok>Liter</tok>
    <tok>geregnet</tok>
    <tok type='punc'>.</tok>
    <sent_bound/>
    </text>
    

    Note that the year analyzer only recognizes the first of the year expressions in “In den Jahren 1999 und 2000” because it checks the preceding context for selected keywords such as “Jahr”. With the option -yearSimple, both would be marked as year expressions.