Swedish Language Models for HunPoS

Beata Megyesi

Description

HunPoS (Halacsy et al, 2007) is an open source reimplementation of the statistical part-of-speech tagger Trigrams'n Tags, also called TnT (Brants, 2000)
To build a Swedish version of the tagger, HunPoS was trained on the Stockholm Umeå Corpus, also called SUC (Ejerhed, et al.). It is a balanced corpus consisting of over one million words. The text were written in the 1990s and are balanced following the principles of the Brown and LOB corpora. The tokens in SUC are lemmatized (their uninflected form is marked), and tagged with their syntactically correct part-of-speech and morphological features.
According to a tagger evaluation and comparison test, HunPoS achieved an accuracy of 97% when trained on one million tokens taken from SUC, the same performance as TnT applied to the same data set for Swedish. The language models are available trained on the original tagset of SUC as well as the Parole annotation scheme for Swedish.

License

SUC is subjects to license agreements. The corpus is publicly available and free for research purposes. For more information, see SUC's homepage.

Download

The language model used for tagging of Swedish texts with HunPoS can be downloaded below.

Swedish HunPoS model with SUC-tags

Swedish HunPoS model with Parole-tags

References

When using the language models with the tagger, please refer to the following papers:

Halacsy, P., Kornai, A., and Oravecz, Cs. 2007. Hunpos - an open source trigram tagger. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, volume Companion Volume, Proceedings of the Demo and Poster Sessions, pages 209-212, Prague, Czech Republic, 2007. Association for Computational Linguistics.

Megyesi, B. 2008. The Open Source Tagger HunPoS for Swedish. Report, September 2008. Dept. of Linguistics and Philology, Uppsala University [.pdf]

SUC. Department of Linguistics Umeå University, and Department of Linguistics, Stockholm University. 1997. SUC 1.0 Stockholm Umeå Corpus, Version 1.0. ISBN: 91-7191-348-3.