Methods and Tools for Automatic Grammar Extraction
Financed by the Swedish Research Council
Project description
Almost all applications in language technology need a grammar in some
form, but there is a lack of computational grammars for the many
different application areas, not least for Swedish. The main goal of
the project is to develop tools, which on the basis of a partial
grammar and a given corpus automatically create a grammar for the text
type that the corpus represents by using machine learning.
Training data for the machine learning algorithms is created by linguistic annotation of a part of a large Swedish corpus. The annotation is performed incrementally and mainly automatically. After the corpus annotation, we use a combination of deductive and inductive learning to introduce specific grammars.
Traditionally, syntactic analysis has been realized by either grammar-based or data-driven methods. During the last years, efforts have been put on the combination of these methods due to the possibility of higher quality by synergy. The project will contribute to this development.
In order to evaluate the method, the induced grammars are tested in machine translation, where the result is compared to a hand-written grammar. Another result is a large Swedish corpus, annotated with linguistic information; A treebank which can serve as training data in other applications, and as a language resource to be used in language technology and linguistic research.
Participants
Anna Sågvall HeinJoakim Nivre
Beáta Megyesi
Bengt Dahlqvist
Mats Dahllöf
Eva Forsbom
Sofia Gustafson-Capková
Marco Kuhlmann
Mattias Nilsson
Eva Pettersson
Markus Saers
Filip Salomonsson
Per Starbäck
Data
The corpora we are working with are
listed below.
Publications/Presentations
Dahlqvist, B. and Megyesi, B. 2007.
Changing the tokenization in Talbanken to SUC2.0. Department of
Linguistics and Philology, Uppsala University.
Megyesi, B. B. and Dahlqvist, B. 2006. Corpus format, (in Swedish)
December 6, 2006. Uppsala University.
Megyesi, B. B. 2006. Corpus
collection, (in Swedish), November 15, 2006. Uppsala University.
Nivre, J. 2006. Project presentation, (in
English), September 26, 2006. Uppsala University.
Events
CoNLL
Shared Task 2007, Chair: Joakim Nivre. Shared task on
dependency parsing at the Conference on Computational Natural Language
Learning 2007
Institutionen
för lingvistik och filologi, Uppsala
universitet, Sverige
Besöksadress: Engelska parken, Humanistiskt
centrum, Thunbergsvägen 3
Postadress: Institutionen för lingvistik och filologi, Box 635,
SE-751 26 Uppsala, Sverige.
E-post: info@lingfil.uu.se
Tel: +46 (0)18 471 22 52
Fax: +46 (0)18 471 10 94