English logga in


Lingvistiska program

(Denna sida finns bara på engelska.)

Only a small subset of installed software is mentioned here.

Granska Tagger, an efficient Hidden Markov Model part-of-speech tagger for Swedish has its own page.

Others: [Brill] [HTK] [NLTK] [SRILM] [svannotate] [TnT] [UUParser]


Eric Brill has written a tagger that doesn’t seem to have a more specific name than Rule Based Tagger. See what it says about copyright in /local/ling/brill/RBT/COPYRIGHT. It assumes that you have a special directory as current when you use it, so you can do like this:

The parameters are LEXICON YOUR-CORPUS BIGRAMS LEXICALRULEFILE CONTEXTUALRULEFILE. So it’s convenient to give parameters that are files in that directory, for example LEXICON.BROWN as lexicon, but for other files, like your own, for example the corpus, you then need to give a full path.

There is more information in the folder /local/ling/brill/RBT/Docs/.


HTK (Hidden Markov Model Toolkit) is used primarily for speech recognition. There is documentation locally in the folder /local/share/doc/htk/.

The programs that are part of this package are LSubset, LMerge, LNewMap, LNorm, LPlex, LGList, LGPrep, LLink, LBuild, LFoF, LGCopy, HLMCopy, LAdapt, Cluster, HSmooth, HVite, HResults, HSGen, HParse, HQuant, HRest, HLRescore, HLStats, HMMIRest, HInit, HLEd, HList, HDMan, HERest, HHEd, HBuild, HCompV, HCopy, HSLab.


NLTK (Natural Language Toolkit) is a series of modules and corpora for research and education in NLP in Python. See more on the Python page.


The SRI Language Modeling Toolkit has several programs. They are in /local/ling/srilm/bin and are documented with man pages. There are also some special man pages that are for more than one program, see for example man training-scripts.


Tokenizes, segments, tags and parses text files in Swedish.

Run svannotate --help to get some help. Read /local/ling/svannotate/README for more. In the same folder are suc.hun and talbanken-default+splitmorph2.mco that the program uses unless you tells it something else.

Trigrams’n’Tags (TnT)

TnT is presented as a very efficient statistical part-of-speech tagger that is trainable on different languages and virtually any tagset. We have a license (for the whole department) for non-commercial use. There are four programs tnt, tnt-diff, tnt-para and tnt-wc. See the folder /local/ling/tnt with subfolders for documentation and license.


UUParser: A transition-based dependency parser for Universal Dependencies is made at the department. An example:

uuparser --out /tmp/out --datadir /corpora/ud/ud-treebanks-v2.5/ --include "mr_ufal"