Uppsala University * Dept. of Linguistics and Philology * Computational Linguistics

Methods and Tools for Automatic Grammar Extraction

Financed by the Swedish Research Council

  På svenska



Project description

Almost all applications in language technology need a grammar in some form, but there is a lack of computational grammars for the many different application areas, not least for Swedish. The main goal of the project is to develop tools, which on the basis of a partial grammar and a given corpus automatically create a grammar for the text type that the corpus represents by using machine learning.

Training data for the machine learning algorithms is created by linguistic annotation of a part of a large Swedish corpus. The annotation is performed incrementally and mainly automatically. After the corpus annotation, we use a combination of deductive and inductive learning to introduce specific grammars.

Traditionally, syntactic analysis has been realized by either grammar-based or data-driven methods. During the last years, efforts have been put on the combination of these methods due to the possibility of higher quality by synergy. The project will contribute to this development.

In order to evaluate the method, the induced grammars are tested in machine translation, where the result is compared to a hand-written grammar. Another result is a large Swedish corpus, annotated with linguistic information; A treebank which can serve as training data in other applications, and as a language resource to be used in language technology and linguistic research.

Internal pages

Participants

Anna Sågvall Hein
Joakim Nivre
Beáta Megyesi
Bengt Dahlqvist
Mats Dahllöf
Eva Forsbom
Sofia Gustafson-Capková
Marco Kuhlmann
Mattias Nilsson
Eva Pettersson
Markus Saers
Filip Salomonsson
Per Starbäck

Data

The corpora we are working with are listed below.

Publications/Presentations


Dahlqvist, B. and Megyesi, B. 2007. Changing the tokenization in Talbanken to SUC2.0. Department of Linguistics and Philology, Uppsala University.

Megyesi, B. B. and Dahlqvist, B. 2006. Corpus format, (in Swedish) December 6, 2006. Uppsala University.

Megyesi, B. B. 2006. Corpus collection, (in Swedish), November 15, 2006. Uppsala University.

Nivre, J. 2006. Project presentation, (in English), September 26, 2006. Uppsala University.

Events

CoNLL Shared Task 2007, Chair: Joakim Nivre. Shared task on dependency parsing at the Conference on Computational Natural Language Learning 2007


Institutionen för lingvistik och filologi, Uppsala universitet, Sverige
Besöksadress: Engelska parken, Humanistiskt centrum, Thunbergsvägen 3
Postadress: Institutionen för lingvistik och filologi, Box 635, SE-751 26 Uppsala, Sverige.
E-post: info@lingfil.uu.se
Tel: +46 (0)18 471 22 52
Fax: +46 (0)18 471 10 94