Ali Basirat

Master Thesis Topics

A list of topics suggested by Ali Basirat for a master thesis in language technology.

  1. Word Embeddings

    1. Jimmy Callin (2017), Word representations and machine learning models for implicit sense classification in shallow discourse parsing, Masther thesis, Uppsala University
    2. Guanchen Song (2019), Multilingual word embeddings based on cross-lingual annotations, Master thesis, 2019, Uppsala University
    3. Andrew Dyer (2019), Low supervision, low corpus size, low similarity! Challenges in cross-lingual alignment of word embeddings, Master thesis, 2019, Uppsala University
    4. Adam Moss (2020), Detecting Lexical Semantic Change Using Probabilistic Gaussian Word Embeddings, Master thesis, 2020, Uppsala University
    5. Shifei Chen, Ali Basirat (2020), Cross-lingual Word Embeddings beyond Zero-shot Machine Translation, In the 8th Swedish Language Technology Conference (SLTC), Sweden
    6. Shifei Chen (2020), Cross-lingual Word Embeddings beyond Zero-shot Machine Translation, Master thesis, 2020, Uppsala University

  2. Joint POS Tagging and Dependency Parsing
    A dependency parser searches for syntactic relations between words of a sentence. Among the different features used by a dependency parser, part-of-speech (POS) tags of words are highly informative to find the correct dependency relations. However, such a positive contribution depends on the accuracy of the POS tagging. An incorrect POS tag can lead to wrong decisions by the parser and error propagation.
    POS tagging is not yet a solved problem, especially when it comes to the universal pos (UPOS) tags. The macro-average of the state-of-the-art UPOS tagging is 93.4. One way to mitigate the negative effect of the erroneous POS tags in dependency parsing is to train the tagger and the parser jointly. Bohnet and Nivre (2012) show that the joint learning of POS tagging and dependency parsing can improve the accuracy of both tasks.
    In this research, you are expected to explore the effect of joint POS tagging and dependency parsing on the accuracy of a neural dependency parser. In the desired case, the research is expected to cover both transition-based and graph-based parsing. It is worth noting that joint tagging-parsing is computationally much harder in the graph-based case (and even intractable in the non-projective case). The project is considered as a master thesis to be supervised by Joakim Nivre and Ali Basirat.

    Bernd Bohnet, and Joakim Nivre (2012), A Transition-Based System for Joint Part-of-Speech Tagging and Labeled Non-Projective Dependency Parsing, Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

  3. A Philosophical Study of Religious Concepts through Cross-lingual Word Embeddings
    Translating a religious or philosophical message is a difficult task, especially when the traditions shaping the vocabulary of the target language are very different from those of the source language. It is reasonable to think that such translation would alter the message in some way, but also that it has the potential to change the target language as such by introducing new words and/or new meanings for old words. In the Early Modern period, beginning with the Portuguese colonization of Goa, Christian missionaries from Europe created a considerable amount of Christian literature in Marathi and Konkani. In the case of Goan Konkani, conversations of a broad section of the population led to the emergence of different codes of Konkani, “Christian Konkani” written in Latin script and “Hindu Konkani” written in Devanagari and Kannada script. Although analysis of the historical material is perhaps too difficult at this stage, as there is no digitized corpus, interesting studies can be done on modern corpora.
    This project aims at studying the effectiveness of computational linguistic methods to link Christian theological key concepts from Portuguese (or possibly Latin) to Konkani and/or Marathi. The problem can be formulated in different ways, depending on the availability of certain types of data. One way is to formulate it as a bilingual lexicon induction task, which relies only on monolingual raw data and relatively small dictionaries between the languages. In this formulation, we do not incorporate the explicit human annotations of parallel corpora, but we look for interlingual similarities between the two languages.
    The minimum requirements are to perform the two following analyses: 1) a comparative analysis focusing on the terms used to express Christian theological key concepts from Portuguese (or possibly Latin) to Konkani and/or Marathi; 2) a similar comparative analysis of corpora of Latin script Konkani and Devanagari script Konkani respectively. Both studies could give valuable insights into semantic and other linguistic changes resulting from the translation of religiophilosophical messages (in this case, Christian). Further explorations should be discussed later.
    The project will be supervised by Ali Basirat and Pär Eliasson.

  4. Document Classification

    1. Muhammad Ammar Shadiq Ad Darman (2018), Automatic bug report assignment using multilevel recurrent neural networks, Master thesis, Uppsala University
    2. Lukas Borggren (2021), Automatic Categorization of News Articles With Contextualized Language Models, Master Tthesis, Linköping University

  5. Information Extraction

    1. Yaxi Zhang (2019), Named-entity recognition on social media data, Master thesis, Uppsala University
    2. Agaton Sjöberg (2021), Extracting Transaction Information from Financial Press Releases, Master thesis, Linköping University

  6. Word Embedding through Kernel PCA
    Kernel PCA adds some degree of nonlinearity to PCA. This project aimed at using different types of kernels to generate word embeddings. Word embeddings are evaluated in standard NLP tasks such as part-of-speech tagging, named-entity recognition, syntactic parsing, or other tasks.

  7. To Enrich Principal Word Vectors with Character-level Information
    This project aims at extending the principal word embedding method to encode character-level information. The character-level information can be encoded as n-grams of character sequences forming words. The statistics of character n-grams are to be concatenated to contextual word vectors from which principal word embeddings/vectors are extracted.

  8. To Evaluate Principal Word Vectors for Document Representation
    The spectral word embedding methods are closely related to latent semantic analysis, a standard framework for document representation. The principal word embedding (PWE) method is a spectral method that provides features for document representation. This project aims at evaluating the principal document vectors, trained by PWE, and comparing them with other standard document representations.

  9. Linguistic Information in Word Embeddings
    This project will be in line with the papers listed here. The project aims at investigate what type of linguistic information is captured by word embeddings. By linguistic information, we mean the linguistically motivated word categories such as grammatical gender, count/mass, plural/singular, etc. Some example studies have been done by previous students:

    1. Hartger Veeman, Marc Allassonniere-Tang, Aleksandrs Berdicevskis, Ali Basirat, Cross-lingual Embeddings Reveal Universal and Lineage-Specific Patterns in Grammatical Gender Assignment, In Proceedings of the SIGNLL Conference on Computational Natural Language Learning (CoNLL-2020)
    2. Hartger Veeman, and Ali Basirat, (2020), An exploration of the encoding of grammatical gender in word embeddings, The 8th Swedish Language Technology Conference (SLTC-2020)
    3. Hartger Veeman (2020), A comparative study of the grammatical gender systems of languages by means of analysing word embeddings, Master thesis, Uppsala University
  10. Bridge the Gap Between Lexicalized Grammars
    Lexicalized grammars such as CCG, HPSG, and LTAG associate the lexical items of a language with some elementary structures. This project aims at linking between different lexicalized grammars of a language. This project will be connected to the papers listed here.