On this page, we gather a wide range of historical corpora and other useful resources and tools for researchers working with historical text.
In the table below, you may download historical corpora for fourteen different languages. For more information about these language-specific corpora, and for download, click on the name of the language of interest to you. All resources hereunder are provided on a "AS-IS", “WHEREIS,” and “WITH ALL FAULTS” basis, without warranty of any kind, expressed or implied.
Language models derived from these corpora may be downloaded from the Language Models section of this page. There you may also create your own language models, by uploading files of your choice.
Furthermore, some useful tools for processing historical text are found in the Tools section of this page.
Latest News: (all updates are listed in the archive)
The following corpora are currently available for historical Czech:
Name | Time Period | Genre(s) | Download | Source | Licence | Info | ||||
Text | Token | Norm | Anno | All | ||||||
charters | 1310–1346 | charters | [txt] | [tok] | — | [xml] | [all] | www | www | [readme] |
DIAKORP | 1350–1939 | mixed | [txt] | [tok] | — | — | [all] | www | www | [readme] |
Gutenberg | 1890–1897 | fiction | [txt] | [tok] | — | — | [all] | www | www | [readme] |
You may also download all Czech corpora files (including readme files) here: all-czech-corpora.zip
The following corpora are currently available for historical Dutch:
Name | Time Period | Genre(s) | Download | Source | Licence | Info | ||||
Text | Token | Norm | Anno | All | ||||||
BaB | 1661–1783 | letters | — | — | — | — | — | www | register | [readme] |
EDGeS | 1360–1939 | bible | [txt] | [tok] | — | [txt] | [all] | www | www | [readme] |
Gutenberg | 14nn–1875 | fiction | [txt] | [tok] | — | — | [all] | www | www | [readme] |
Compilation | 1236–1938 | chancellery, narrative | — | — | — | — | — | www | Free for research | [readme] |
You may also download all Dutch corpora files (including readme files) here: all-dutch-corpora.zip
The following corpora are currently available for historical English:
Name | Time Period | Genre(s) | Download | Source | Licence | Info | ||||
Text | Token | Norm | Anno | All | ||||||
CLMET | 1710–1920 | mixed | [txt] | [tok] | — | [txt] | [all] | www | www | [readme] |
EDGeS | 1395–1890 | bible | [txt] | [tok] | — | [txt] | [all] | www | www | [readme] |
lampeter | 1640–1740 | tracts | [txt] | [tok] | — | — | [all] | www | www | [readme] |
You may also download all English corpora files (including readme files) here: all-english-corpora.zip
The following corpora are currently available for historical French:
Name | Time Period | Genre(s) | Download | Source | Licence | Info | ||||
Text | Token | Norm | Anno | All | ||||||
Paris | 1296–1790 | vernacular speech, tax-rolls | [txt] | [tok] | — | — | [all] | www | www | [readme] |
SRCMF | 842–1325 | — | [txt] | [tok] | — | [conll] | [all] | www | www | [readme] |
You may also download all French corpora files (including readme files) here: all-french-corpora.zip
The following corpora are currently available for historical German:
Name | Time Period | Genre(s) | Download | Source | Licence | Info | ||||
Text | Token | Norm | Anno | All | ||||||
DTA | 1600–1899 | mixed | [txt] | [tok] | — | — | [all] | www | www | [readme] |
EDGeS | 1460–1871 | bible | [txt] | [tok] | — | [txt] | [all] | www | www | [readme] |
GeMi | 1500–1690 | medicine | [txt] | [tok] | — | — | [all] | www | www | [readme] |
GerManC | 1654–1799 | mixed | [txt] | [tok] | — | [conll] | [all] | www | www | [readme] |
LitHist | 1790–1829 | literature | [txt] | [tok] | — | [conll] | [all] | www | www | [readme] |
ReM | 1050–1350 | mixed | [txt] | [tok] | — | [xml] | [all] | www | www | [readme] |
ReN | 1200–1650 | mixed | [txt] | [tok] | — | [xml] | [all] | www | www | [readme] |
Ridges | 1482–1914 | science | [txt] | [tok] | [txt] | [conll] | [all] | www | www | [readme] |
You may also download all German corpora files (including readme files) here: all-german-corpora.zip
Name | Time Period | Genre(s) | Download | Source | Licence | Info | ||||
Text | Token | Norm | Anno | All | ||||||
AGLDT | — | mixed | [txt] | [tok] | — | [xml] | [all] | www | www | [readme] |
Perseus | — | mixed | [txt] | [tok] | — | — | [all] | www | www | [readme] |
Proiel | — | mixed | [txt] | [tok] | — | [conll] | [all] | www | www | [readme] |
You may also download all Greek corpora files (including readme files) here: all-greek-corpora.zip
The following corpora are currently available for historical Hungarian:
Name | Time Period | Genre(s) | Download | Source | Licence | Info | ||||
Text | Token | Norm | Anno | All | ||||||
HGDS | 1440–1539 | codices | [txt] | [tok] | [txt] | [conll] | [all] | www | free | [readme] |
The following corpora are currently available for historical Icelandic:
Name | Time Period | Genre(s) | Download | Source | Licence | Info | ||||
Text | Token | Norm | Anno | All | ||||||
IcePaHC | 1150–2008 | mixed | [txt] | [tok] | [txt] | [txt] | [all] | www | www | [readme] |
The following corpora are currently available for historical Italian:
Name | Time Period | Genre(s) | Download | Source | Licence | Info | ||||
Text | Token | Norm | Anno | All | ||||||
Gutenberg | 1300–1897 | books | [txt] | [tok] | — | — | [all] | www | www | [readme] |
Name | Time Period | Genre(s) | Download | Source | Licence | Info | ||||
Text | Token | Norm | Anno | All | ||||||
AGLDT | — | mixed | [txt] | [tok] | — | [xml] | [all] | www | www | [readme] |
charters | 1310–1346 | charters | [txt] | [tok] | — | [txt] | [all] | www | www | [readme] |
Corpus Corporum | 100–1200 | mixed | — | — | — | — | — | www | www | [readme] |
LLCT1 | – | charters | [txt] | [tok] | — | — | [all] | www | www | [readme] |
LLCT2 | – | charters | [txt] | [tok] | — | [conll] | [all] | www | www | [readme] |
Perseus | – | mixed | [txt] | [tok] | — | — | [all] | www | www | [readme] | Proiel | – | mixed | [txt] | [tok] | — | [conll] | [all] | www | www | [readme] |
Thomisticus | 1225–1274 | mixed | [txt] | [tok] | — | [conll] | [all] | www | www | [readme] |
You may also download all Latin corpora files (including readme files) here: all-latin-corpora.zip
The following corpora are currently available for historical Polish:
Name | Time Period | Genre(s) | Download | Source | Licence | Info | ||||
Text | Token | Norm | Anno | All | ||||||
PolDiLemma | 1567–1800 | mixed | [txt] | [tok] | — | [txt] | [all] | www | www | [readme] |
The following corpora are currently available for historical Portuguese:
Name | Time Period | Genre(s) | Download | Source | Licence | Info | ||||
Text | Token | Norm | Anno | All | ||||||
Tycho | 1380–1881 | mixed | — | — | — | — | — | www | www | [readme] |
The following corpora are currently available for historical Russian:
Name | Time Period | Genre(s) | Download | Source | Licence | Info | ||||
Text | Token | Norm | Anno | All | ||||||
RNC | — | — | [txt] | [tok] | — | [conll] | [all] | www | www | [readme] |
The following corpora are currently available for historical Slovene:
Name | Time Period | Genre(s) | Download | Source | Licence | Info | ||||
Text | Token | Norm | Anno | All | ||||||
besedje | 1550–1603 | lexicon | [txt] | — | — | [xml] | [all] | www | www | [readme] |
DigLib | 1584–1918 | mixed | [txt] | [tok] | — | [txt] | [all] | www | www | [readme] |
RefCorpus | 1584–1899 | mixed | [txt] | [tok] | [txt] | [txt] | [all] | www | www | [readme] |
lex | 1584–1918 | lexicon | [txt] | — | — | — | — | www | www | [readme] |
You may also download all Slovene corpora files (including readme files) here: all-slovene-corpora.zip
The following corpora are currently available for historical Spanish:
Name | Time Period | Genre(s) | Download | Source | Licence | Info | ||||
Text | Token | Norm | Anno | All | ||||||
IMPACT BVC | 1481–1962 | mixed | [txt] | [tok] | [txt] | — | [all] | www | www | [readme] |
IMPACT GT | 1543–1748 | mixed | [txt] | [tok] | — | — | [all] | www | www | [readme] |
You may also download all Spanish corpora files (including readme files) here: all-spanish-corpora.zip
The following corpora are currently available for historical Swedish:
Name | Time Period | Genre(s) | Download | Source | Licence | Info | ||||
Text | Token | Norm | Anno | All | ||||||
Dalin | 1850–1853 | lexicon | [txt] | — | — | [txt] | [all] | www | www | [readme] |
EDGeS | 1703–1917 | bible | [txt] | [tok] | — | [txt] | [all] | www | www | [readme] |
Fornsvenska | 1350–1758 | mixed | [txt] | [tok] | — | — | [all] | www | www | [readme] |
GaW | 1527–1812 | court, church | [txt] | [tok] | [txt] | — | [all] | www | Free for research | [readme] |
Gutenberg | 1789–1902 | books | [txt] | [tok] | — | — | [all] | www | www | [readme] |
Konsistoriet | 1624–1699 | protocols | [txt] | [tok] | — | — | [all] | www | Open Access | [readme] |
Schlyter | 500–1500 | lexicon | [txt] | — | — | [txt] | [all] | www | www | [readme] |
Swedberg | 1700–1735 | lexicon | [txt] | — | — | [txt] | [all] | www | www | [readme] |
You may also download all Swedish corpora files (including readme files) here: all-swedish-corpora.zip
Eva Pettersson, | Department of Linguistics and Philology, Uppsala University, eva.pettersson@lingfil.uu.se |
Beáta Megyesi, | Department of Linguistics and Philology, Uppsala University, beata.megyesi@lingfil.uu.se |