Parallel Corpora for Less Explored Languages
Financed by
The Swedish Research Council and the
Faculty of Languages at Uppsala University
Project description
The main goal of the project is to promote research and teaching in less explored languages by building language resources for language pairs that are dissimilar in language structure. We build parallel corpora, consisting of original texts and their translations, with contrastive studies in focus. The corpora are built semi-automatically by using a common module for formating and markup together with basic language resource kit (BLARK) for the involved languages.
BLARKs often iclude carefully compiled corpora of collected texts and a set of tools for the automatic analysis of the languages, such as sentence splitter, tokenizer, part-of-speech tagger, chunker, shallow parser, etc. These tools are used in the automatic alignment phase to improve alignment accuracy.
The parallel corpus is intended to be used in research, teaching and applications such as machine translation.
Language Pairs
English-Hindi-Swedish Parallel CorpusEnglish-Swedish-Turkish Parallel Corpus
Participants
Éva Á. Csató JohansonBengt Dahlqvist
Beáta B. Megyesi
Joakim Nivre
Eva Pettersson
Anju Saxena
Anna Sågvall Hein