HistCORP


Historical Corpora | Language Models | Tools

Historical Corpora

On this page, we gather a wide range of historical corpora and other useful resources and tools for researchers working with historical text.

In the table below, you may download historical corpora for fourteen different languages. For more information about these language-specific corpora, and for download, click on the name of the language of interest to you. All resources hereunder are provided on a "AS-IS", “WHEREIS,” and “WITH ALL FAULTS” basis, without warranty of any kind, expressed or implied.

Language models derived from these corpora may be downloaded from the Language Models section of this page. There you may also create your own language models, by uploading files of your choice.

Furthermore, some useful tools for processing historical text are found in the Tools section of this page.

Latest News: (all updates are listed in the archive)

Download Historical Corpora

Czech

The following corpora are currently available for historical Czech:

  • Medieval Charter Sections Corpus (charters)
  • The diachronic section of the Czech National Corpus (DIAKORP)
  • Selection of books from the Gutenberg Project (Gutenberg)

Name Time Period Genre(s) Download Source Licence Info
TextTokenNormAnnoAll
charters1310–1346charters[txt][tok][xml][all] wwwwww[readme]
DIAKORP1350–1939mixed[txt][tok][all] wwwwww[readme]
Gutenberg1890–1897fiction[txt][tok][all] www www[readme]

You may also download all Czech corpora files (including readme files) here: all-czech-corpora.zip

Dutch

The following corpora are currently available for historical Dutch:

  • Brieven als Buit, BaB (not available for download, but included in the language models)
  • The EDGeS Diachronic Bible Corpus (EDGeS)
  • Selection of books from the Gutenberg Project (Gutenberg)
  • The Compilation Corpus Historical Dutch (Compilation)

NameTime PeriodGenre(s)DownloadSourceLicenceInfo
TextTokenNormAnnoAll
BaB1661–1783letters wwwregister[readme]
EDGeS1360–1939bible[txt][tok][txt][all] wwwwww[readme]
Gutenberg14nn–1875fiction[txt][tok][all] www www[readme]
Compilation1236–1938chancellery, narrative www Free for research[readme]

You may also download all Dutch corpora files (including readme files) here: all-dutch-corpora.zip

English

The following corpora are currently available for historical English:

  • The Corpus of Late Modern English Texts, version 3.1 (CLMET)
  • The EDGeS Diachronic Bible Corpus (EDGeS)
  • The Lampeter Corpus of Early Modern English Tracts (lampeter)

NameTime PeriodGenre(s)DownloadSourceLicenceInfo
TextTokenNormAnnoAll
CLMET1710–1920mixed[txt][tok][txt][all] wwwwww[readme]
EDGeS1395–1890bible[txt][tok][txt][all] wwwwww[readme]
lampeter1640–1740tracts[txt][tok][all] wwwwww[readme]

You may also download all English corpora files (including readme files) here: all-english-corpora.zip

French

The following corpora are currently available for historical French:

  • Paris speech in the past (Paris)
  • Syntactic Reference Corpus of Medieval French (SRCMF)

NameTime PeriodGenre(s)DownloadSourceLicenceInfo
TextTokenNormAnnoAll
Paris1296–1790vernacular speech, tax-rolls[txt][tok][all] wwwwww[readme]
SRCMF842–1325[txt][tok][conll][all] wwwwww[readme]

You may also download all French corpora files (including readme files) here: all-french-corpora.zip

German

The following corpora are currently available for historical German:

  • Deutsches TextArchiv (DTA)
  • The EDGeS Diachronic Bible Corpus (EDGeS)
  • The Nottingham Corpus of Early Modern German Midwifery and Women's Medicine (GeMi)
  • GerManC
  • German Literary History (LitHist)
  • Reference Corpus of Middle High German (ReM)
  • Reference Corpus of Middle Low German/Low Rhenish (ReN)
  • Register in Diachronic German Science (Ridges)

NameTime PeriodGenre(s)DownloadSourceLicenceInfo
TextTokenNormAnnoAll
DTA1600–1899mixed[txt][tok][all] wwwwww[readme]
EDGeS1460–1871bible[txt][tok][txt][all] wwwwww[readme]
GeMi1500–1690medicine[txt][tok][all] wwwwww[readme]
GerManC1654–1799mixed[txt][tok] [conll][all] wwwwww[readme]
LitHist 1790–1829 literature [txt] [tok] [conll] [all] www www [readme]
ReM1050–1350mixed[txt][tok][xml][all] wwwwww[readme]
ReN1200–1650mixed[txt][tok][xml][all] wwwwww[readme]
Ridges1482–1914science[txt][tok][txt][conll][all] wwwwww[readme]

You may also download all German corpora files (including readme files) here: all-german-corpora.zip

Greek

The following corpora are currently available for Greek:
  • Ancient Greek and Latin Dependency Treebank (AGLDT)
  • Perseus Digital Library (Perseus)
  • Proiel Treebank (Proiel)

NameTime PeriodGenre(s)DownloadSourceLicenceInfo
TextTokenNormAnnoAll
AGLDTmixed[txt][tok][xml][all] wwwwww[readme]
Perseusmixed[txt][tok][all] wwwwww[readme]
Proielmixed[txt][tok][conll][all] wwwwww[readme]

You may also download all Greek corpora files (including readme files) here: all-greek-corpora.zip

Hungarian

The following corpora are currently available for historical Hungarian:

  • Hungarian Generative Diachronic Syntax (HGDS)

NameTime PeriodGenre(s)DownloadSourceLicenceInfo
TextTokenNormAnnoAll
HGDS1440–1539codices[txt][tok][txt][conll][all] wwwfree[readme]

Icelandic

The following corpora are currently available for historical Icelandic:

  • Icelandic Parsed Historical Corpus (IcePaHC)

NameTime PeriodGenre(s)DownloadSourceLicenceInfo
TextTokenNormAnnoAll
IcePaHC1150–2008mixed [txt] [tok] [txt] [txt] [all] www www [readme]

Italian

The following corpora are currently available for historical Italian:

  • Selection of books from the Gutenberg Project

NameTime PeriodGenre(s)DownloadSourceLicenceInfo
TextTokenNormAnnoAll
Gutenberg1300–1897books[txt][tok][all] www www[readme]

Latin

The following corpora are currently available for Latin:
  • Ancient Greek and Latin Dependency Treebank (AGLDT)
  • Medieval Charter Sections Corpus (charters)
  • Corpus Corporum (not available for download, but included in the language models)
  • Late Latin Charter Treebank 1 (LLCT1)
  • Late Latin Charter Treebank 2 (LLCT2)
  • Perseus Digital Library (Perseus)
  • Proiel Treebank (Proiel)
  • Index Thomisticus Treebank (Thomisticus)

NameTime PeriodGenre(s)DownloadSourceLicenceInfo
TextTokenNormAnnoAll
AGLDTmixed[txt][tok][xml][all] wwwwww[readme]
charters1310–1346charters[txt][tok][txt][all] wwwwww[readme]
Corpus Corporum100–1200mixed wwwwww[readme]
LLCT1charters[txt][tok][all] wwwwww[readme]
LLCT2charters[txt][tok][conll][all] wwwwww[readme]
Perseusmixed[txt][tok][all] wwwwww[readme]
Proielmixed[txt][tok][conll][all] wwwwww[readme]
Thomisticus1225–1274mixed[txt][tok][conll][all] wwwwww[readme]

You may also download all Latin corpora files (including readme files) here: all-latin-corpora.zip

Polish

The following corpora are currently available for historical Polish:

  • Middle Polish Diachrone Lemmatised Corpus (PolDiLemma)

NameTime PeriodGenre(s)DownloadSourceLicenceInfo
TextTokenNormAnnoAll
PolDiLemma1567–1800mixed[txt][tok][txt][all] wwwwww[readme]

Portuguese

The following corpora are currently available for historical Portuguese:

  • Tycho Brahe Parsed Corpus of Historical Portuguese (Tycho)

NameTime PeriodGenre(s)DownloadSourceLicenceInfo
TextTokenNormAnnoAll
Tycho1380–1881mixed wwwwww[readme]

Russian

The following corpora are currently available for historical Russian:

  • Middle Russian Corpus (RNC)

NameTime PeriodGenre(s)DownloadSourceLicenceInfo
TextTokenNormAnnoAll
RNC[txt][tok][conll][all] wwwwww[readme]

Slovene

The following corpora are currently available for historical Slovene:

  • Words of the 16th-Century Slovenian Literary Language (besedje)
  • Digital Library (DigLib)
  • Reference corpus of historical Slovene (RefCorpus)
  • Lexicon of historical Slovene (lex)

NameTime PeriodGenre(s)DownloadSourceLicenceInfo
TextTokenNormAnnoAll
besedje1550–1603 lexicon [txt] [xml] [all] www www [readme]
DigLib1584–1918 mixed[txt] [tok] [txt] [all] www www [readme]
RefCorpus1584–1899 mixed[txt] [tok] [txt] [txt] [all] www www [readme]
lex1584–1918 lexicon[txt] www www [readme]

You may also download all Slovene corpora files (including readme files) here: all-slovene-corpora.zip

Spanish

The following corpora are currently available for historical Spanish:

  • IMPACT-es diachronic corpus, BVC-section (IMPACT BVC)
  • IMPACT-es diachronic corpus, GT-section (IMPACT GT)

NameTime PeriodGenre(s)DownloadSourceLicenceInfo
TextTokenNormAnnoAll
IMPACT BVC 1481–1962mixed [txt] [tok] [txt] [all] www www [readme]
IMPACT GT 1543–1748 mixed [txt] [tok] [all] www www [readme]

You may also download all Spanish corpora files (including readme files) here: all-spanish-corpora.zip

Swedish

The following corpora are currently available for historical Swedish:

  • Dalin's 19th Century Swedish Dictionary (Dalin)
  • The EDGeS Diachronic Bible Corpus (EDGeS)
  • Fornsvenska Textbanken (Fornsvenska)
  • Selection of books from the Gutenberg Project (Gutenberg)
  • Texts from the Gender and Work project (GaW)
  • Protocols from the Academic Consistory of Uppsala University (Konsistoriet)
  • Schlyter's Medieval Swedish Dictionary (Schlyter)
  • Swensk Ordabok by Jesper Swedberg (Swedberg)

NameTime PeriodGenre(s)DownloadSourceLicenceInfo
TextTokenNormAnnoAll
Dalin 1850–1853 lexicon [txt] [txt] [all] www www [readme]
EDGeS 1703–1917 bible [txt] [tok] [txt] [all] www www [readme]
Fornsvenska 1350–1758 mixed [txt] [tok] [all] www www [readme]
GaW 1527–1812 court, church[txt] [tok] [txt] [all] wwwFree for research [readme]
Gutenberg 1789–1902 books [txt] [tok] [all] www www [readme]
Konsistoriet 1624–1699 protocols [txt] [tok] [all] www Open Access [readme]
Schlyter 500–1500 lexicon [txt] [txt] [all] www www [readme]
Swedberg 1700–1735 lexicon [txt] [txt] [all] www www [readme]

You may also download all Swedish corpora files (including readme files) here: all-swedish-corpora.zip




For questions or comments, or if there are corpora that you would like to add to this page, don't hesitate to contact us:

Eva Pettersson, Department of Linguistics and Philology, Uppsala University, eva.pettersson@lingfil.uu.se
Beáta Megyesi, Department of Linguistics and Philology, Uppsala University, beata.megyesi@lingfil.uu.se