Universal Dependency Parsing

Natural language processing is ubiquitous in our society today. Every time we use a search engine, a translation services, or a voice command system, we rely on the capacity of computers to interpret natural language. However, today’s technology works well only for a restricted set of languages, in particular English, and usually does not generalize well to languages with different structural characteristics. Thus, while the most accurate systems today can analyze well over 90% of the grammatical structures occurring in English text, the error rate for other languages can be as high as 35–40%, which means that over one third of the information encoded in the text can be lost. Since grammatical analysis, or parsing, is only the first step in end-to-end applications like information extraction or question answering, errors introduced at this stage will lead to even higher error rates at later stages, making natural language processing practically unusable for many languages. The lack of sufficiently accurate natural language processing affects a wide range of languages, including some of the most widely spoken languages in the world such as Chinese, Arabic, Hindi and Russian. Addressing this problem is therefore not only of great scientific interest but will also be of great societal and democratic value.

The purpose of this project is to study parsing models for typologically diverse languages in order to find out what techniques work well across languages and what aspects require language-specific adaptation. The central hypothesis is that parsing models need a better abstraction over concrete realization patterns, such as morphological inflection, function words and word order, in a way that is informed by linguistic typology. To test this hypothesis, we will extend existing dependency-based parsing models to better cope with typological diversity and adapt them to the representations of Universal Dependencies (UD), a system for cross-linguistically consistent grammatical analysis so far applied to over 40 languages. The more specific aims of the project can be stated as follows:

The project is funded by the Swedish Research Council (Vetenskapsrådet 2016-01817).