DReaM: The Dictionary/Grammar Reading Machine: Computational Tools for Accessing the World's Linguistic Heritage 2018-2020

The DReaM Project is a JPICH Digital Heritage-funded project.

Project Members

Associate Partners APs

Work Packages

Work PackageDescriptionResponsibility
WP1.1 document scanning Harald Hammarström and Søren Wichmann
WP1.2 OCR and OCR postcorrection Søren Wichmann and Shafqat Virk
WP1.3 importing data to corpus infrastructures Systems Developer (N. N.)
WP2.1 digitization of dictionaries Guillaume Segerer and PhD Student (N. N.)
WP2.2 web interface for digital dictionaries Guillaume Segerer and PhD Student (N. N.)
WP2.3 dictionary App development Guillaume Segerer and Rémy Bonnet
WP2.4 surveys and evaluation PhD Student (N.N.)
WP3.1 linguistic Information Extraction Søren Wichmann, Shafqat Virk and Harald Hammarström
WP3.2 language Factoid Database Søren Wichmann, Shafqat Virk and Harald Hammarström
WP3.3 presentation of results Harald Hammarström, Marian Klamer and Stéphane Robert

Project Publications

Allassonnière-Tang, Marc, Olof Lundgren, Maja Robbers, Sandra Cronhamn, Filip Larsson, One-Soon Her, Harald Hammarström & Gerd Carling. (2021) Expansion by migration and diffusion by contact is a source to the global diversity of linguistic nominal categorization systems. Nature: Humanities and Social Sciences Communications 8(331). 1-6, 1-50.
Bonnet, Rémy and. (2020) Générateur de dictionnaires au format Android pour les langues peu dotées (Dictionary App Generator for Less Resourced Languages) In Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 4 : Démonstrations et résumés d'articles internationaux, 6--9. Nancy, France: ATALA et AFCP.
Dannélls, Dana & Shafqat Virk. (2021) A Supervised Machine Learning Approach for Post-OCR Error Detection for Historical Text. In Simon Dobnik, Richard Johansson & Peter Ljunglöf (eds.), Selected contributions from the Eighth Swedish Language Technology Conference (SLTC-2020), 25-27 November 2020, 13-20. Linköping: Linköping Electronic Press.
Foster, Daniel. (2019) Automatic Frame-Semantic Parsing for Linguistic Descriptions: Extracting typological linguistic information from unstructured text. University of Gothenburg MA thesis.
Hammarström, Harald, One-Soon Her & Marc Tang. (2021) Term-Spotting: A quick-and-dirty method for extracting typological features of language from grammatical descriptions. In Simon Dobnik, Richard Johansson & Peter Ljunglöf (eds.), Selected contributions from the Eighth Swedish Language Technology Conference (SLTC-2020), 25-27 November 2020, 27-34. Linköping: Linköping Electronic Press.
Hammarström, Harald, Shafqat Virk & Markus Forsberg. (2017) Extracting Grammar from Grammars: From Raw-Text Descriptions to Grammatical Characteristics of the Languages of the World. Presentation at the Computational Linguistics Seminar, Uppsala.
Hammarström, Harald, Shafqat Virk & Markus Forsberg. (2017) Automatically Filling in Grambank. Presentation at the Glottobank meeting, Waiheke.
Hammarström, Harald, Shafqat Virk & Markus Forsberg. (2019) Text Mining on Grammatical Descriptions of the Languages of the World. Presentation at the Infrastructural Tensions workshop, Uppsala, 29-30 Aug 2019.
Hammarström, Harald. (2019) ¿Cuál es la gramática mas extensa? Ideas computacionales para medir la cantidad de una descripción gramatical de una lengua. Presentation at the Pontificia Universidad Católica de Perú, 9 May 2019, Lima.
Hammarström, Harald. (2021) Inventory and Content Separation in Grammatical Descriptions of Languages of the World. In Gerd Berget, Mark Michael Hall, Daniel Brenn & Sanna Kumpulainen (eds.), Linking Theory and Practice of Digital Libraries: 25th International Conference on Theory and Practice of Digital Libraries, TPDL 2021, Virtual Event, September 13-17, 2021, Proceedings, 129-140. Berlin: Springer.
Hammarström, Harald. (2021) Gramfinder: Human and Machine Reading of Grammatical Descriptions of the Languages of the World. In DaSH '21: Workshop on Data Science with Human-in-the-loop (DaSH), 1-6. DBLP.
Hammarström, Harald. (2021) Measuring Prefixation and Suffixation in the Languages of the World. In Proceedings of The 3rd Workshop on Research in Computational Typology and Multilingual NLP, 81-89. Stroudsburg, PA: Association for Computational Linguistics (ACL).
Hammarström, Harald, Shafqat Mumtaz Virk & Markus Forsberg. (2017) Poor Man's OCR Post-Correction: Unsupervised Recognition of Variant Spelling Applied to a Multilingual Document Collection. In Proceedings of the Digital Access to Textual Cultural Heritage (DATeCH) conference, 71-75. Göttingen: ACM.
Hammarström, Harald. (2020) The State of Description of the World's Languages. Presentation at the University of Helsinki, 15 Jan 2020.
Malm, Per, Shafqat Mumtaz Virk, Lars Borin & Anju Saxena. (2018) LingFN: Towards a Domain-specific Linguistic FrameNet. In Tiago Timponi Torrent, Lars Borin & Collin F. Baker (eds.), Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) Workshop 5 --- International FrameNet Workshop 2018: Multilingual FrameNets and Constructicons, 37-43. Paris, France: European Language Resources Association (ELRA).
Segerer, Guillaume. (2016) RefLex: la reconstruction sans peine. Faits de Langues 47. 201-214. [segerer_reflex2016.pdf (1.06 MB) ]
Virk, Shafqat Mumtaz and. (2021) A Data-Driven Semi-Automatic Framenet Development Methodology In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), 1471--1479. Held Online: INCOMA Ltd.
Virk, Shafqat Mumtaz and. (2021) A Deep Learning System for Automatic Extraction of Typological Linguistic Information from Descriptive Grammars In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), 1480--1489. Held Online: INCOMA Ltd.
Virk, Shafqat Mumtaz, Lars Borin, Anju Saxena & Harald Hammarström. (2017) Automatic Extraction of Typological Linguistic Features from Descriptive Grammars. In Kamil Ekštein & Václav Matoušek (eds.), Text, Speech, and Dialogue: 20th International Conference, TSD 2017, Prague, Czech Republic, August 27-31, 2017, Proceedings (Lecture Notes in Computer Science 10415), 111-119. Berlin: Springer.
Virk, Shafqat Mumtaz, Harald Hammarström, Lars Borin, Markus Forsberg & Søren Wichmann. (2020) From Linguistic Descriptions to Language Profiles. In Proceedings of the 7th Workshop on Linked Data in Linguistics (LDL-2020), 23-27. Marseille: European Language Resources Association (ELRA).
Virk, Shafqat Mumtaz, Harald Hammarström, Markus Forsberg & Søren Wichmann. (2020) The DReaM Corpus: A Multilingual Annotated Corpus of Grammars for the World’s Languages. In Proceedings of The 12th Language Resources and Evaluation Conference, 871--877. Marseille, France: European Language Resources Association.
Virk, Shafqat Mumtaz, Azam Sheikh Muhammad, Lars Borin, Muhammad Irfan Aslam, Saania Iqbal & Nazia Khurram. (2019) Exploiting Frame-Semantics and Frame-Semantic Parsing for Automatic Extraction of Typological Information from Descriptive Grammars of Natural Languages. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), 1247–1256. Varna, Bulgaria: NCOMA Ltd.
Virk, Shafqat, Lars Borin, Per Malm, Anju Saxena, Markus Forsberg, Harald Hammarström, M. Azam & M. Irfan. (2019) LingFN: a FrameNet for the Linguistics Domain. Presentation at the CLT Retreat, 8 May 2019.
Virk, Shafqat, Per Malm, Lars Borin & Anju Saxena. (2019) LingFN: A FrameNet for the Linguistics Domain In CICLing 2019: Short Oral Presentations and Poster Session, 1-12.
Wichmann, Søren, Harald Hammarström & Shafqat Virk. (2019) Information extraction of linguistic typological information from grammatical descriptions. Presentation at the Universidade Federal de Minas Gerais, 4 Nov 2019.
Wichmann, Søren & Taraka Rama. (2019) Towards unsupervised extraction of linguistic typological features from language descriptions. First Workshop on Typology for Polyglot NLP, Florence, Aug. 1, 2019 (Co-located with ACL, July 28-Aug. 2, 2019).
Wichmann, Søren. (2017) The DReaM Project: A dictionary/grammar reading machine. Presentation at the Kazan University philological faculty.
Wichmann, Søren. (2021) Pipeline for a Data-driven Network of Linguistic Terms. In Simon Dobnik, Richard Johansson & Peter Ljunglöf (eds.), Selected contributions from the Eighth Swedish Language Technology Conference (SLTC-2020), 25-27 November 2020, 66-71. Linköping: Linköping Electronic Press.
Zariquiey, Roberto, Mónica Arakaki, Javier Vera, Guido Torres, Claret Cuba, Carlos Barrientos, Aracelli García, Adriano Ingunza & Harald Hammarström. (in press) Linking endangerment databases and descriptive linguistics: an assessment of the use of terms relating to language endangerment in grammars. Language Documentation & Conservation . 27pp.

Links

Links

Bibliography of relevant publications (from 2017)

A .bib file of this is here.

Bender, Emily M., Joshua Crowgey, Michael Wayne Goodman & Fei Xia. (2014) Learning Grammar Specifications from IGT: A Case Study of Chintang In Proceedings of the 2014 Workshop on the Use of Computational Methods in the Study of Endangered Languages, 43--53. Baltimore, Maryland, USA: Association for Computational Linguistics.
Bickel, Balthasar. (2015) Distributional typology: statistical inquiries into the dynamics of linguistic diversity. In Bernd Heine & Heiko Narrog (eds.), The Oxford Handbook of Linguistic Analysis, 901-923. 2nd edn. Oxford: Oxford University Press.
Borin, Lars, Shafqat Virk & Anju Saxena. (2016) Towards a Big Data View on South Asian Linguistic Diversity In WILDRE-3 - 3rd Workshop on Indian Language Data: Resources and Evaluation, 87-92. ELRA.
Cooper, Doug. (2014) Logistics of the Asia-Pacific Linguistic Data Warehouse. Paper presented at the Language Comparison with Linguistic Databases: RefLex and Typological Databases, 7-8 Oct 2014.
Cysouw, Michael. (2011) Typology without Types: Quantitatively inducing a Numeral Typology. Poster presented at the 9th biannual meeting of the Association for Linguistic Typology, ALT9, Hong Kong, China.
Dryer, Matthew S. (2006) Descriptive theories, explanatory theories, and basic linguistic theory. In Felix Ameka, Alan Dench & Nicholas Evans (eds.), Catching Language: Issues in Grammar Writing, 207-234. Berlin: Mouton de Gruyter.
Dryer, Matthew. (forthcoming) World Atlas of Word Order in Language. Oxford: Oxford University Press.
Evans, Nicholas & Stephen Levinson. (2009) The Myth of Language Universals: Language diversity and its importance for cognitive science. Behavioral and Brain Sciences 32(5). 429-492.
Güldemann, Tom. (2010) "Sprachraum" and geography: Linguistic macro-areas in Africa. In Alfred Lameli, Roland Kehrein & Stefan Rabanus (eds.), Language and Space: An International Handbook of Linguistic Variation Volume 2: Language Mapping (Handbooks of Linguistics and Communication Science 30/2), 561-585. Berlin: Mouton de Gruyter. [guldemann_sprachraum2010.pdf (1.29 MB) guldemann_sprachraum-africa2010.zip (5.64 MB) ]
Hammarström, Harald, Shafqat Mumtaz Virk & Markus Forsberg. (2017) Poor Man's OCR Post-Correction: Unsupervised Recognition of Variant Spelling Applied to a Multilingual Document Collection. In Proceedings of the Digital Access to Textual Cultural Heritage (DATeCH) conference, 71-75. Göttingen: ACM.
Hammarström, Harald, Shafqat Virk & Markus Forsberg. (2017) Extracting Grammar from Grammars: From Raw-Text Descriptions to Grammatical Characteristics of the Languages of the World. Presentation at the Computational Linguistics Seminar, Uppsala.
Hammarström, Harald, Shafqat Virk & Markus Forsberg. (2017) Automatically Filling in Grambank. Presentation at the Glottobank meeting, Waiheke.
Hammarström, Harald. (2013) Three Approaches to Prefix and Suffix Statistics in the Languages of the World. Paper presented at the Workshop on Corpus-based Quantitative Typology (CoQuaT 2013).
Harris, Zellig S. (1951) Methods in structural linguistics. Chicago: University of Chicago Press.
Himmelmann, Nikolaus. (2014) Asymmetries in the prosodic phrasing of function words: Another look at the suffixing preference. Language 90(4). 927-960.
Kamholz, David, Jonathan Pool & Susan Colowick. (2014) PanLex: Building a Resource for Panlingual Lexical Translation In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk & Stelios Piperidis (eds.), Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14). Reykjavik, Iceland: European Language Resources Association (ELRA).
Littell, Patrick, Aidan Pine & Henry Davis. (2017) Waldayu and Waldayu Mobile: Modern digital dictionary interfaces for endangered languages Association for Computational Linguistics.
Macklin-Cordes, Jayden L., Nathaniel L. Blackbourne, Thomas J. Bott, Jacqueline Cook, T. Mark Ellison, Jordan Hollis, Edith E. Kirlew, Genevieve C. Richards, Sanle Zhao & Erich R. Round. (2017) Robots who read grammars. Poster presented at CoEDL Fest 2017, Alexandra Park Conference Centre, Alexandra Headlands, QLD.
Manning, Christopher D., Prabhakar Raghavan & Hinrich Schütze. (2008) Introduction to Information Retrieval. Cambridge: Cambridge University Press.
Mikolov, Tomas, Ilya Sutskever, Kai Chen, Gregory S. Corrado & Jeffrey Dean. (2013) Distributed Representations of Words and Phrases and their Compositionality. In Christopher J. C. Burges, Léon Bottou, Zoubin Ghahramani & Kilian Q. Weinberger (eds.), Advances in Neural Information Processing Systems 26 (NIPS 2013), 3111-3119. Lake Tahoe, Nevada: Neural Information Processing Systems.
Nivre, Joakim, Željko Agić, Lars Ahrenberg & Maria Jesus Aranzabe. (2017) Universal Dependencies 2.0. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University in Prague.
Plank, Frank. (2009) WALS values evaluated. Linguistic Typology 13(1). 41-75.
Polyakov, Vladimir N., Valery D. Solovyev, Søren Wichmann & Oleg Belyaev. (2009) Using WALS and Jazyki Mira. Linguistic Typology 13. 137-167.
Saussure, Ferdinand de. (1916) Cours de linguistique générale. Paris: Payot.
Segerer, Guillaume. (2016) RefLex: la reconstruction sans peine. Faits de Langues 47. 201-214. [segerer_reflex2016.pdf (1.06 MB) ]
Tsunoda, Tasaku. (2005) Language Endangerment and Language Revitalization (Trends in Linguistics: Studies and Monographs 148). Berlin: Mouton de Gruyter.
Virk, Shafqat Mumtaz, Lars Borin, Anju Saxena & Harald Hammarström. (2017) Automatic Extraction of Typological Linguistic Features from Descriptive Grammars. In Kamil Ekštein & Václav Matoušek (eds.), Text, Speech, and Dialogue: 20th International Conference, TSD 2017, Prague, Czech Republic, August 27-31, 2017, Proceedings (Lecture Notes in Computer Science 10415), 111-119. Berlin: Springer.
Virk, Shafqat, Markus Forsberg & Harald Hammarström. (2017) TextCat for Language Profiling. Submitted.
Xia, Fei, William D. Lewis, Michael Wayne Goodman, Glenn Slayden, Ryan Georgi, Joshua Crowgey & Emily M. Bender. (2016) Enriching a massively multilingual database of interlinear glossed text. Language Resources and Evaluation 50(2). 1-29.