Dies ist das Newsblog des Sprachwissenschaftlichen Instituts an der Ruhr-Universität Bochum.





Ruhr-Universität Bochum
Sprachwissenschaftliches Institut



Powered by PivotX - 2.3.11 
XML-Feed (RSS 1.0) 
XML: Atom Feed 

« Vortrag von Claudia M… | home | Vortrag von Cornelius… »

Vortrag von Bryan Jurish am Dienstag, 28.05.2013, 16:00 Uhr

Samstag, 18. Mai 2013. Aus der Kategorie 'Vortragsreihe'. Das Sprachwissenschaftliche Institut lädt ein zu dem Vortrag von Bryan Jurish (BBAW): Canonicalizing the Deutsches Textarchiv --
Conventional natural-language processing techniques cannot adequately account for historical input text due to conventional tools' reliance on a fixed application-specific lexicon keyed by contemporary orthographic surface form on the one hand, and the lack of consistent orthographic conventions in historical input text on the other.

Spelling variation can be treated as an error-correction problem or "canonicalization" task: an attempt to automatically assign each (historical) input word a unique extant canonical cognate, thus allowing direct application-specific processing (tagging, parsing, etc.) of the returned canonical forms without need for additional application-specific modifications. This talk provides an overview of the canonicalization techniques currently employed by the Deutsches Textarchiv (www.deutschestextarchiv.de) to prepare a corpus of historical German text for part-of-speech tagging, lemmatization, and
robust information retrieval.

Der Vortrag findet in Raum 3/159 statt.