Persian Language Model for HunPoS: TagPer
HunPoS (Halacsy et al, 2007) is an open source reimplementation of the statistical part-of-speech tagger Trigrams'n Tags, also called TnT (Brants, 2000) allowing the user to tune the tagger by using different feature settings. TagPer (Seraji, 2015, Chapter 4, pp. 91-96) was developed by training HunPos on Uppsala Persian Corpus (UPC), which is a large, freely available Persian corpus. The corpus is a modified version of the Bijankhan corpus with additional sentence segmentation and consistent tokenization consisting of over 2,7 million words and is annotated with morpho-syntactic and partly semantic features.
The tool is developed by Mojgan Seraji ( firstname.lastname@example.org ) and licensed under GNU General Public License . It is used for part-of-speech tagging of Persian texts and can be downloaded below:
- Persian Part of Speech Tagger
Start using TagPer
Before you start using the language model, you will first need to download HunPoS. Then you can take the model and tag your text using the following command line:
prompt> hunpos-tag model_TagPer < input_file.txt > output_file.txt
1. Halácsy P., Kornai A., and Oravecz Cs. 2007. Hunpos - an open source trigram tagger. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, volume Companion Volume, Proceedings of the Demo and Poster Sessions, pages 209-212, Prague, Czech Republic, 2007. Association for Computational Linguistics.
2. Seraji Mojgan. 2011. A Statistical Part-of-Speech Tagger for Persian. In Proceedings of the 18th Nordic Conference of Computational Linguistics NODALIDA. Riga, Latvia. [pdf]
3. Seraji, Mojgan. 2015. Morphosyntactic Corpora and Tools for Persian. Doctoral dissertation, Uppsala University. Studia Linguistica Upsaliensia 16. [pdf]