HistCORP


Historical Corpora | Language Models | Tools

Historical Corpora

On this page, we gather a wide range of historical corpora and other useful resources and tools for researchers working with historical text.

In the table below, you may download historical corpora for fourteen different languages. For more information about these language-specific corpora, and for download, click on the name of the language of interest to you. All resources hereunder are provided on a "AS-IS", “WHEREIS,” and “WITH ALL FAULTS” basis, without warranty of any kind, expressed or implied.

Language models derived from these corpora may be downloaded from the Language Models section of this page. There you may also create your own language models, by uploading files of your choice.

Furthermore, some useful tools for processing historical text are found in the Tools section of this page.

News:

Download Historical Corpora

Czech

The following corpora are currently available for historical Czech:

  • Medieval Charter Sections Corpus (charters)
  • The diachronic section of the Czech National Corpus (DIAKORP)
  • Selection of books from the Gutenberg Project (Gutenberg)

Name Time Period Genre(s) Download Source Licence
TextTokenNormMorphSyntaxInfoAll
charters1310–1346charters[txt][tok][xml][readme][all] wwwwww
DIAKORP1350–1939mixed[txt][tok][readme][all] wwwwww
Gutenberg1890–1897fiction[txt][tok][readme][all] www www

You may also download all Czech corpora files (including readme files) here: all-czech-corpora.zip

Dutch

The following corpora are currently available for historical Dutch:

  • The EDGeS Diachronic Bible Corpus (EDGeS)
  • Selection of books from the Gutenberg Project (Gutenberg)
  • The Compilation Corpus Historical Dutch (Compilation)

NameTime PeriodGenre(s)DownloadSourceLicence
TextTokenNormAnnoInfoAll
EDGeS1360–1939bible[txt][tok][txt][readme][all] wwwwww
Gutenberg14nn–1875fiction[txt][tok][readme][all] www www
Compilation1236–1938chancellery, narrative[readme] www Free for research

You may also download all Dutch corpora files (including readme files) here: all-dutch-corpora.zip

English

The following corpora are currently available for historical English:

  • The Corpus of Late Modern English Texts, version 3.1 (CLMET)
  • The EDGeS Diachronic Bible Corpus (EDGeS)
  • The Lampeter Corpus of Early Modern English Tracts (lampeter)

NameTime PeriodGenre(s)DownloadSourceLicence
TextTokenNormAnnoInfoAll
CLMET1710–1920mixed[txt][tok][txt][readme][all] wwwwww
EDGeS1395–1890bible[txt][tok][txt][readme][all] wwwwww
lampeter1640–1740tracts[txt][tok][readme][all] wwwwww

You may also download all English corpora files (including readme files) here: all-english-corpora.zip

French

The following corpora are currently available for historical French:

  • Paris speech in the past (Paris)
  • Syntactic Reference Corpus of Medieval French (SRCMF)

NameTime PeriodGenre(s)DownloadSourceLicence
TextTokenNormLingInfoAll
Paris1296–1790vernacular speech, tax-rolls[txt][tok][readme][all] wwwwww
SRCMF842–1325[txt][tok][conll][readme][all] wwwwww

You may also download all French corpora files (including readme files) here: all-french-corpora.zip

German

The following corpora are currently available for historical German:

  • Deutsches TextArchiv (DTA)
  • The EDGeS Diachronic Bible Corpus (EDGeS)
  • The Nottingham Corpus of Early Modern German Midwifery and Women's Medicine (GeMi)
  • GerManC
  • German Literary History (LitHist)
  • Reference Corpus of Middle High German (ReM)
  • Reference Corpus of Middle Low German/Low Rhenish (ReN)
  • Register in Diachronic German Science (Ridges)

NameTime PeriodGenre(s)DownloadSourceLicence
TextTokenNormAnnoInfoAll
DTA1600–1899mixed[txt][tok][readme][all] wwwwww
EDGeS1460–1871bible[txt][tok][txt][readme][all] wwwwww
GeMi1500–1690medicine[txt][tok][readme][all] wwwwww
GerManC1654–1799mixed[txt][tok] [conll][readme][all] wwwwww
LitHist 1790–1829 literature [txt] [tok] [conll] [readme] [all] www www
ReM1050–1350mixed[txt][tok][xml][readme][all] wwwwww
ReN1200–1650mixed[txt][tok][xml][readme][all] wwwwww
Ridges1482–1914science[txt][tok][txt][conll][readme][all] wwwwww

You may also download all German corpora files (including readme files) here: all-german-corpora.zip

Greek

The following corpora are currently available for Greek:
  • Ancient Greek and Latin Dependency Treebank (AGLDT)
  • Perseus Digital Library (Perseus)
  • Proiel Treebank (Proiel)

NameTime PeriodGenre(s)DownloadSourceLicence
TextTokenNormMorphSyntaxInfoAll
AGLDTmixed[txt][tok][xml][xml][readme][all] wwwwww
Perseusmixed[txt][tok][readme][all] wwwwww
Proielmixed[txt][tok][conll][conll][readme][all] wwwwww

You may also download all Greek corpora files (including readme files) here: all-greek-corpora.zip

Hungarian

The following corpora are currently available for historical Hungarian:

  • Hungarian Generative Diachronic Syntax (HGDS)

NameTime PeriodGenre(s)DownloadSourceLicence
TextTokenNormMorphSyntaxInfoAll
HGDS1440–1539codices[txt][tok][txt][conll][conll][readme][all] wwwfree

Icelandic

The following corpora are currently available for historical Icelandic:

  • Icelandic Parsed Historical Corpus (IcePaHC)

NameTime PeriodGenre(s)DownloadSourceLicence
TextTokenNormMorphSyntaxInfoAll
IcePaHC1150–2008mixed[txt][tok][txt][txt][txt][readme][all] wwwwww

Italian

The following corpora are currently available for historical Italian:

  • Selection of books from the Gutenberg Project

NameTime PeriodGenre(s)DownloadSourceLicence
TextTokenNormMorphSyntaxInfoAll
Gutenberg1300–1897books[txt][tok][readme][all] www www

Latin

The following corpora are currently available for Latin:
  • Ancient Greek and Latin Dependency Treebank (AGLDT)
  • Medieval Charter Sections Corpus (charters)
  • Corpus Corporum (not available for download, but included in the language models)
  • Late Latin Charter Treebank 1 (LLCT1)
  • Late Latin Charter Treebank 2 (LLCT2)
  • Perseus Digital Library (Perseus)
  • Proiel Treebank (Proiel)
  • Index Thomisticus Treebank (Thomisticus)

NameTime PeriodGenre(s)DownloadSourceLicence
TextTokenNormMorphSyntaxInfoAll
AGLDTmixed[txt][tok][xml][xml][readme][all] wwwwww
charters1310–1346charters[txt][tok][xml][readme][all] wwwwww
Corpus Corporum100–1200mixed[readme] wwwwww
LLCT1charters[txt][tok][xml][xml][readme][all] wwwwww
LLCT2charters[txt][tok][conll][conll][readme][all] wwwwww
Perseusmixed[txt][tok][readme][all] wwwwww
Proielmixed[txt][tok][conll][conll][readme][all] wwwwww
Thomisticus1225–1274mixed[txt][tok][conll][conll][readme][all] wwwwww

You may also download all Latin corpora files (including readme files) here: all-latin-corpora.zip

Polish

The following corpora are currently available for historical Polish:

  • Middle Polish Diachrone Lemmatised Corpus (PolDiLemma)

NameTime PeriodGenre(s)DownloadSourceLicence
TextTokenNormMorphSyntaxInfoAll
PolDiLemma1567–1800mixed[tok][txt][readme][all] wwwwww

You may also download all Polish corpora files (including readme files) here: all-polish-corpora.zip

Portuguese

The following corpora are currently available for historical Portuguese:

  • Tycho Brahe Parsed Corpus of Historical Portuguese (Tycho)

NameTime PeriodGenre(s)DownloadSourceLicence
TextTokenNormMorphSyntaxInfoAll
Tycho1380–1881mixed[readme] wwwwww

Russian

The following corpora are currently available for historical Russian:

  • Middle Russian Corpus (RNC)

NameTime PeriodGenre(s)DownloadSourceLicence
TextTokenNormAnnoInfoAll
RNC[txt][tok][conll][readme][all] wwwwww

Slovene

The following corpora are currently available for historical Slovene:

  • Digital Library (DigLib)
  • Reference corpus of historical Slovene (RefCorpus)
  • Lexicon of historical Slovene (lex)

NameTime PeriodGenre(s)DownloadSourceLicence
TextTokenNormMorphSyntaxInfoAll
DigLib1584–1918mixed[txt][tok][txt][readme][all] wwwwww
RefCorpus1584–1899mixed[txt][tok][txt][txt][readme][all] wwwwww
lex1584–1918lexicon[txt][readme] wwwwww

You may also download all Slovene corpora files (including readme files) here: all-slovene-corpora.zip

Spanish

The following corpora are currently available for historical Spanish:

  • IMPACT-es diachronic corpus, BVC-section (IMPACT BVC)
  • IMPACT-es diachronic corpus, GT-section (IMPACT GT)

NameTime PeriodGenre(s)DownloadSourceLicence
TextTokenNormMorphSyntaxInfoAll
IMPACT BVC1481–1962mixed[tok][txt][readme][all] wwwwww
IMPACT GT1543–1748mixed[txt][tok][readme][all] wwwwww

You may also download all Spanish corpora files (including readme files) here: all-spanish-corpora.zip

Swedish

The following corpora are currently available for historical Swedish:

  • The EDGeS Diachronic Bible Corpus (EDGeS)
  • Fornsvenska Textbanken (Fornsvenska)
  • Selection of books from the Gutenberg Project (Gutenberg)
  • Texts from the Gender and Work project (GaW)
  • Protocols from the Academic Consistory of Uppsala University (Konsistoriet)

NameTime PeriodGenre(s)DownloadSourceLicence
TextTokenNormAnnoInfoAll
EDGeS1703–1917bible[txt][tok][txt][readme][all] wwwwww
Fornsvenska1350–1758mixed[txt][tok][readme][all]wwwwww
GaW1527–1812court, church[txt][tok][txt][readme][all]wwwFree for research
Gutenberg1789–1902books[txt][tok][readme][all]www www
Konsistoriet1624–1699protocols[txt][tok][readme][all]wwwOpen Access

You may also download all Swedish corpora files (including readme files) here: all-swedish-corpora.zip




For questions or comments, or if there are corpora that you would like to add to this page, don't hesitate to contact us:

Eva Pettersson, Department of Linguistics and Philology, Uppsala University, eva.pettersson@lingfil.uu.se
Beáta Megyesi, Department of Linguistics and Philology, Uppsala University, beata.megyesi@lingfil.uu.se