Current Interests

I am generally interested in theoretical linguistics and computational linguistics and in the more practical issue of how one can use methods from computational linguistics to help theoretical linguists, linguistic fieldworkers, and also speakers of endangered languages. I also try to do some work to help preserve or at least document endangered languages. For that reason, I have become a member of the Foundation for Endangered Languages (Gesellschaft für bedrohte Sprachen e.V.) and I am currently taking care of the foundation's website.

Dissertation Project

My dissertation project is supervised by Prof. Dr. Tibor Kiss at the University of Bochum. In my dissertation, I will take a look at relative clause extraposition in German from several angles, theoretical as well as computational. Based on a large corpus of naturalistic data, I want to build a (statistical) model that is able to predict with high accuracy whether a specific relative clause will be extraposed or not. Another objective of my project is to discover factors that help to disambiguate the attachment of (extraposed) relative clauses. Based on these empirical studies, I plan to implement an efficient parsing system that is able to predict the correct antecedent of extraposed relative clauses with high accuracy.

Here is a word cloud that visualizes the topic of my thesis quite well.

Dissertation word cloud

Preposition-Noun Constructions (Project by Tibor Kiss)

I am currently working as researcher in a project on so-called preposition noun constructions (PNCs) led by Tibor Kiss. In this project, we investigate the properties of these constructions which consist of a preposition plus a nominal complement headed hy a count noun in the singular that is not accompanied by a determiner, which is normally obligatory for singular count nouns. We are building a very large corpus that is annotated partly by hand and partly automatically in order to compare PNCs to ordinary PPs and to illuminate the question why and under what conditions PNCs are possible. My role in this project consists mostly in computational linguistic assistance to building, maintaining, and evaluating the corpus.

Other Interests in Theoretical Linguistics

I am interested in nominal coordination and its interaction with attributive and possessive constructions in several Germanic languages and in Kurdish. What is most interesting about these constructions is the frequent occurrence of bracketing paradoxes and mismatches between morphology, syntax, and semantics. Some constructions that I have done some research on are so-called linker constructions (e.g. the ezafe construction in Kurdish) which involve a linking element occurring between a modified noun and its modifier.

Because of my background in Germanic Linguistics and Scandinavian Studies, I have an interest in the comparative syntax and morphology of the Germanic languages. Two Germanic languages that I am especially interested in are Low Saxon (aka. Low German, Plattdeutsch, or Nedersachsisch) and the vernacular of the Ruhrgebiet which is a German regional dialect on Low Saxon substrate. For this reason, I also contribute to the Bochum research workshop on Ruhr German.

Another topic that interests me very much is the borderline between syntax and morphology, especially clitics and so-called weak pronominals.

I have always believed that linguists should use data from large corpora rather than just their own grammaticality judgements or those of one or two informants to underpin their theories. Moreover, I think that actual grammars in the minds of human speakers and hearers are non-categorical to some extent and use stochastic learning in some form or other even after the main phase of language acquisition during childhood. I am thus not adverse to more quantitative approaches to grammatical theory. Recently, I have become interested in syntactic alternations and what factors condition them. For my MA thesis, which I wrote at Stanford University under the supervision of Joan Bresnan, I have studied the choice between different possessive constructions in Modern Low Saxon.

Last but not least, following a field linguistics course at the University of Bochum taught by Nikolaus Himmelmann, I have developed an interest in the Kurdish language and its various dialects.

Other Interests in Computational Linguistics

At the University of Bochum, I have done extensive research on the problems of tokenization, automatic abbreviation detection, and sentence boundary disambiguation using statistical algorithms. Together with Tibor Kiss, I have published an article about these issues in Computational Linguistics. Our approach has been integrated into the tokenization module of the Natural Language Processing Toolkit.

Another topic that I have done a project on during my stay at Stanford University and on which I plan to do some more work in the future is information retrieval for non-standardized languages, i.e. languages that lack a fixed orthography. One of the main problems is to find accurate and fast algorithms for fuzzy matching of query terms to index terms. This involves developing linguistically plausible string-similarity measures and graphemic parsers to split words into graphemes.

I am also interested in automatic grammar induction and statistical learning algorithms in general, although I haven't really done much work in these areas myself.

During my first quarter at Stanford, I have also worked a little bit on tools to obtain linguistic data from the internet. I small toolkit which uses Perl and the Google API can be downloaded from here.

Last but not least, I am interested in finding ways to integrate computational linguistic methods and tools into theoretical linguistic research and efforts to preserve endangered languages.




Jan Strunk's Homepage