Uppsala University * Dept. of Linguistics and Philology * Computational Linguistics * Beata Megyesi

Parallel Corpora for Less Explored Languages

Financed by
The Swedish Research Council and the Faculty of Languages at Uppsala University

På svenska

Project description

The main goal of the project is to promote research and teaching in less explored languages by building language resources for language pairs that are dissimilar in language structure. We build parallel corpora, consisting of original texts and their translations, with contrastive studies in focus. The corpora are built semi-automatically by using a common module for formating and markup together with basic language resource kit (BLARK) for the involved languages.

BLARKs often iclude carefully compiled corpora of collected texts and a set of tools for the automatic analysis of the languages, such as sentence splitter, tokenizer, part-of-speech tagger, chunker, shallow parser, etc. These tools are used in the automatic alignment phase to improve alignment accuracy.

The parallel corpus is intended to be used in research, teaching and applications such as machine translation.

Language Pairs

English-Hindi-Swedish Parallel Corpus

English-Swedish-Turkish Parallel Corpus


Éva Á. Csató Johanson
Bengt Dahlqvist
Beáta B. Megyesi
Joakim Nivre
Eva Pettersson
Anju Saxena
Anna Sågvall Hein