Methods and Tools for Automatic Grammar Extraction
Financed by the Swedish Research Council
Almost all applications in language technology need a grammar in some
form, but there is a lack of computational grammars for the many
different application areas, not least for Swedish. The main goal of
the project is to develop tools, which on the basis of a partial
grammar and a given corpus automatically create a grammar for the text
type that the corpus represents by using machine learning.
Training data for the machine learning algorithms is created by linguistic annotation of a part of a large Swedish corpus. The annotation is performed incrementally and mainly automatically. After the corpus annotation, we use a combination of deductive and inductive learning to introduce specific grammars.
Traditionally, syntactic analysis has been realized by either grammar-based or data-driven methods. During the last years, efforts have been put on the combination of these methods due to the possibility of higher quality by synergy. The project will contribute to this development.
In order to evaluate the method, the induced grammars are tested in machine translation, where the result is compared to a hand-written grammar. Another result is a large Swedish corpus, annotated with linguistic information; A treebank which can serve as training data in other applications, and as a language resource to be used in language technology and linguistic research.
ParticipantsAnna Sågvall Hein
The corpora we are working with are
Dahlqvist, B. and Megyesi, B. 2007. Changing the tokenization in Talbanken to SUC2.0. Department of Linguistics and Philology, Uppsala University.
Megyesi, B. B. and Dahlqvist, B. 2006. Corpus format, (in Swedish) December 6, 2006. Uppsala University.
Megyesi, B. B. 2006. Corpus collection, (in Swedish), November 15, 2006. Uppsala University.
Nivre, J. 2006. Project presentation, (in English), September 26, 2006. Uppsala University.
Shared Task 2007, Chair: Joakim Nivre. Shared task on
dependency parsing at the Conference on Computational Natural Language
för lingvistik och filologi, Uppsala
Besöksadress: Engelska parken, Humanistiskt centrum, Thunbergsvägen 3
Postadress: Institutionen för lingvistik och filologi, Box 635, SE-751 26 Uppsala, Sverige.
Tel: +46 (0)18 471 22 52
Fax: +46 (0)18 471 10 94