Overview
In the context of both the REA and REM projects, a part-of-speech tagset for the annotation of historical German was developed. The tagset is guided by the Stuttgart-Tübingen-Tagset (STTS) for Modern Standard German in that it adopts a number of tags and the hierachical tag design. In addition to it, it contains several new POS tags as well as more fine-grained distinctions for several major word classes.
Main Characteristics
-
Main POS tags
HiTS distinguishes between the following main POS tags which themselves are divided into several sub-tags:- ADJ – adjectives
- AP – adpositions
- AV – adverbs
- CARD – cardinal numbers
- D – determiners
- P – pronouns
- PAV – pronominal adverbs
- PTK – particels
- V – verbs
- ITJ, FM – miscellaneous
- $ – punctuation marks
-
Token-specific vs. lemma-specific annotation
A special feature of HiTS is the distinction between a token-specific annotation and a lemma-specific annotation. In the former, a token is tagged according to its actual use while in the latter the corresponding lemma of that token is annotated, i.e. all tokens are annotated with two tags. As a result, it is e.g. possible to distinguish between an adjective and its actual adverbial usage (as in ex. (a)) or between a verb and its actual use as a noun (as in ex. (b)). This double annotation allows for monitoring language change due to a change in a words category.(a) lebet rehto/ADJ > AVD ‘live fair’
(b) mit suften/VVINF > NA und mit weinen/VVINF > NA ‘with sighing and crying’
-
Determiners and Pronouns
Contrary to STTS, HiTS distinguishes between determiners (D) and pronouns (P) on the basic word class level. Furthermore, new tags such as for postposed pronouns (ex. (c)) were introduced. The definite and indefinite article - in STTS originally one independent tag (ART) – are now part of the bigger determiner class. In sum, there are four different tags depending on how the article is used (see ex. (c) and (d)): definite or indefinit and article-like (MHG corpus) or attributive-like (OHG corpus).(c) under disen chunigen allen/DI > DIN ‘under those kings’ (D = determiner, I = indefinit, N = postposed)
(d) der/DD > DDA, DDART liehte tac ‘the bright day’ (D = determiner, D = demonstrative, A = attributive/ ART = article-like)
(e) ein/DI > DIA, DIART tîer ‘an animal’ (D = determiner, I = indefinite, A = attributive/ ART = article-like)
Publications
- Stefanie Dipper, Karin Donhauser, Thomas Klein, Sonja Linde, Stefan Müller, Klaus-Peter Wegera (2013). HiTS: ein Tagset für historische Sprachstufen des Deutschen. In: Journal for Language Technology and Computational Linguistics, Special Issue, 28(1), pp. 85–137. [PDF]