DReaM: The Dictionary/Grammar Reading Machine: Computational Tools for Accessing the World's Linguistic Heritage 2018-2020

The DReaM Project is a JPICH Digital Heritage-funded project.

Project Members

Associate Partners APs

Work Packages

Work PackageDescriptionResponsibility
WP1.1 document scanning Harald Hammarström and Søren Wichmann
WP1.2 OCR and OCR postcorrection Søren Wichmann and Shafqat Virk
WP1.3 importing data to corpus infrastructures Systems Developer (N. N.)
WP2.1 digitization of dictionaries Guillaume Segerer and PhD Student (N. N.)
WP2.2 web interface for digital dictionaries Guillaume Segerer and PhD Student (N. N.)
WP2.3 dictionary App development Guillaume Segerer and Rémy Bonnet
WP2.4 surveys and evaluation PhD Student (N.N.)
WP3.1 linguistic Information Extraction Søren Wichmann, Shafqat Virk and Harald Hammarström
WP3.2 language Factoid Database Søren Wichmann, Shafqat Virk and Harald Hammarström
WP3.3 presentation of results Harald Hammarström, Marian Klamer and Stéphane Robert

Project Publications (so far, mid-2019)

Aslam, Muhammad Irfan. (2019) Semantic frame based automatic extraction of typological information from descriptive grammars. University of Skövde MA thesis.
Foster, Daniel. (2019) Automatic Frame-Semantic Parsing for Linguistic Descriptions: Extracting typological linguistic information from unstructured text. University of Gothenburg MA thesis.
Hammarström, Harald, Shafqat Virk & Markus Forsberg. (2019) Text Mining on Grammatical Descriptions of the Languages of the World. Presentation at the Infrastructural Tensions workshop, Uppsala, 29-30 Aug 2019.
Hammarström, Harald. (2019) ¿Cuál es la gramática mas extensa? Ideas computacionales para medir la cantidad de una descripción gramatical de una lengua. Presentation at the Pontificia Universidad Católica de Perú, 9 May 2019, Lima.
Virk, Shafqat Mumtaz, Azam Sheikh Muhammad, Lars Borin, Muhammad Irfan Aslam, Saania Iqbal & Nazia Khurram. (2019) Exploiting Frame-Semantics and Frame-Semantic Parsing for Automatic Extraction of Typological Information from Descriptive Grammars of Natural Languages. In Proceedings of RANLP 2019. [No publisher stated].
Virk, Shafqat, Lars Borin, Per Malm, Anju Saxena, Markus Forsberg, Harald Hammarström, M. Azam & M. Irfan. (2019) LingFN: a FrameNet for the Linguistics Domain. Presentation at the CLT Retreat, 8 May 2019.
Virk, Shafqat, Per Malm, Lars Borin & Anju Saxena. (2019) LingFN: A FrameNet for the Linguistics Domain. In Proceedings of CICLing 2019. [No publisher stated].
Wichmann, Søren, Harald Hammarström & Shafqat Virk. (2019) Information extraction of linguistic typological information from grammatical descriptions. Presentation at the Universidade Federal de Minas Gerais, 4 Nov 2019.
Wichmann, Søren & Taraka Rama. (2019) Towards unsupervised extraction of linguistic typological features from language descriptions. First Workshop on Typology for Polyglot NLP, Florence, Aug. 1, 2019 (Co-located with ACL, July 28-Aug. 2, 2019).
Wichmann, Søren. (2017) The DReaM Project: A dictionary/grammar reading machine. Presentation at the Kazan University philological faculty.


Bibliography of relevant publications (from 2017)

A .bib file of this is here.

Bender, Emily M., Joshua Crowgey, Michael Wayne Goodman & Fei Xia. (2014) Learning Grammar Specifications from IGT: A Case Study of Chintang In Proceedings of the 2014 Workshop on the Use of Computational Methods in the Study of Endangered Languages, 43--53. Baltimore, Maryland, USA: Association for Computational Linguistics.
Bickel, Balthasar. (2015) Distributional typology: statistical inquiries into the dynamics of linguistic diversity. In Bernd Heine & Heiko Narrog (eds.), The Oxford Handbook of Linguistic Analysis, 901-923. 2nd edn. Oxford: Oxford University Press.
Borin, Lars, Shafqat Virk & Anju Saxena. (2016) Towards a Big Data View on South Asian Linguistic Diversity In WILDRE-3 - 3rd Workshop on Indian Language Data: Resources and Evaluation, 87-92. ELRA.
Cooper, Doug. (2014) Logistics of the Asia-Pacific Linguistic Data Warehouse. Paper presented at the Language Comparison with Linguistic Databases: RefLex and Typological Databases, 7-8 Oct 2014.
Cysouw, Michael. (2011) Typology without Types: Quantitatively inducing a Numeral Typology. Poster presented at the 9th biannual meeting of the Association for Linguistic Typology, ALT9, Hong Kong, China.
Dryer, Matthew S. (2006) Descriptive theories, explanatory theories, and basic linguistic theory. In Felix Ameka, Alan Dench & Nicholas Evans (eds.), Catching Language: Issues in Grammar Writing, 207-234. Berlin: Mouton de Gruyter.
Dryer, Matthew. (forthcoming) World Atlas of Word Order in Language. Oxford: Oxford University Press.
Evans, Nicholas & Stephen Levinson. (2009) The Myth of Language Universals: Language diversity and its importance for cognitive science. Behavioral and Brain Sciences 32(5). 429-492.
Güldemann, Tom. (2010) "Sprachraum" and geography: Linguistic macro-areas in Africa. In Alfred Lameli, Roland Kehrein & Stefan Rabanus (eds.), Language and Space: An International Handbook of Linguistic Variation Volume 2: Language Mapping (Handbooks of Linguistics and Communication Science 30/2), 561-585. Berlin: Mouton de Gruyter. [guldemann_sprachraum2010.pdf (1.29 MB) guldemann_sprachraum-africa2010.zip (5.64 MB) ]
Hammarström, Harald, Shafqat Mumtaz Virk & Markus Forsberg. (2017) Poor Man's OCR Post-Correction: Unsupervised Recognition of Variant Spelling Applied to a Multilingual Document Collection. In Proceedings of the Digital Access to Textual Cultural Heritage (DATeCH) conference, 71-75. Göttingen: ACM.
Hammarström, Harald, Shafqat Virk & Markus Forsberg. (2017) Extracting Grammar from Grammars: From Raw-Text Descriptions to Grammatical Characteristics of the Languages of the World. Presentation at the Computational Linguistics Seminar, Uppsala.
Hammarström, Harald, Shafqat Virk & Markus Forsberg. (2017) Automatically Filling in Grambank. Presentation at the Glottobank meeting, Waiheke.
Hammarström, Harald. (2013) Three Approaches to Prefix and Suffix Statistics in the Languages of the World. Paper presented at the Workshop on Corpus-based Quantitative Typology (CoQuaT 2013).
Harris, Zellig S. (1951) Methods in structural linguistics. Chicago: University of Chicago Press.
Himmelmann, Nikolaus. (2014) Asymmetries in the prosodic phrasing of function words: Another look at the suffixing preference. Language 90(4). 927-960.
Kamholz, David, Jonathan Pool & Susan Colowick. (2014) PanLex: Building a Resource for Panlingual Lexical Translation In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk & Stelios Piperidis (eds.), Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14). Reykjavik, Iceland: European Language Resources Association (ELRA).
Littell, Patrick, Aidan Pine & Henry Davis. (2017) Waldayu and Waldayu Mobile: Modern digital dictionary interfaces for endangered languages Association for Computational Linguistics.
Macklin-Cordes, Jayden L., Nathaniel L. Blackbourne, Thomas J. Bott, Jacqueline Cook, T. Mark Ellison, Jordan Hollis, Edith E. Kirlew, Genevieve C. Richards, Sanle Zhao & Erich R. Round. (2017) Robots who read grammars. Poster presented at CoEDL Fest 2017, Alexandra Park Conference Centre, Alexandra Headlands, QLD.
Manning, Christopher D., Prabhakar Raghavan & Hinrich Schütze. (2008) Introduction to Information Retrieval. Cambridge: Cambridge University Press.
Mikolov, Tomas, Ilya Sutskever, Kai Chen, Gregory S. Corrado & Jeffrey Dean. (2013) Distributed Representations of Words and Phrases and their Compositionality. In Christopher J. C. Burges, Léon Bottou, Zoubin Ghahramani & Kilian Q. Weinberger (eds.), Advances in Neural Information Processing Systems 26 (NIPS 2013), 3111-3119. Lake Tahoe, Nevada: Neural Information Processing Systems.
Nivre, Joakim, Željko Agić, Lars Ahrenberg & Maria Jesus Aranzabe. (2017) Universal Dependencies 2.0. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University in Prague.
Plank, Frank. (2009) WALS values evaluated. Linguistic Typology 13(1). 41-75.
Polyakov, Vladimir N., Valery D. Solovyev, Søren Wichmann & Oleg Belyaev. (2009) Using WALS and Jazyki Mira. Linguistic Typology 13. 137-167.
Saussure, Ferdinand de. (1916) Cours de linguistique générale. Paris: Payot.
Segerer, Guillaume. (2016) RefLex: la reconstruction sans peine. Faits de Langues 47. 201-214. [segerer_reflex2016.pdf (1.06 MB) ]
Tsunoda, Tasaku. (2005) Language Endangerment and Language Revitalization (Trends in Linguistics: Studies and Monographs 148). Berlin: Mouton de Gruyter.
Virk, Shafqat Mumtaz, Lars Borin, Anju Saxena & Harald Hammarström. (2017) Automatic Extraction of Typological Linguistic Features from Descriptive Grammars. In Kamil Ekštein & Václav Matoušek (eds.), Text, Speech, and Dialogue: 20th International Conference, TSD 2017, Prague, Czech Republic, August 27-31, 2017, Proceedings (Lecture Notes in Computer Science 10415), 111-119. Berlin: Springer.
Virk, Shafqat, Markus Forsberg & Harald Hammarström. (2017) TextCat for Language Profiling. Submitted.
Xia, Fei, William D. Lewis, Michael Wayne Goodman, Glenn Slayden, Ryan Georgi, Joshua Crowgey & Emily M. Bender. (2016) Enriching a massively multilingual database of interlinear glossed text. Language Resources and Evaluation 50(2). 1-29.