Uppsala universitet  

Projects in Machine Translation
Master Programme, 5 credit students (5LN718)
Bachelor students (5LN426)

Organisation
Project Work
Individual Reflection Report
Seminar Presentation
Possible Topics
Resources

Aim

The goal of the master projects is
  1. to study background literature
  2. to carry out a practical assignment related to the topic selected
  3. prepare a final report and seminar describing the results

Deadlines

  • September 24: Hand in your topic preferences
  • November 5 and 8: Seminar presentations (detailed schedule TBA)
  • November 9: Hand in final group report
  • November 9: Hand in individual reflection reports
  • (December 7: backup deadline for reports)

Organisation

You will work in groups of 3-4 students. We will put together the groups, based on your wishes for which topics you prefer to work on. The list of
topic suggestions can be found at the bottom of this page. You can hand in your preferences by email to Sara, by September 24 at the latest. Give a ranked list of at least three different topics you could consider to work on, and rank them We will try our best to accommodate everyone's wishes, but we cannot guarantee that you will get your preferred topics. If you fail to hand in a wish by September 24, you will be assigned arbitrarily to a topic.

Project work

For each topic you should perform a practical project, where you apply some of the concepts related to your projects practically. This includes setting up and running MT systems, normally with Moses, and evaluate and compare systems that differ in some aspect related to your topic. Each project will be assigned a supervisor, with whom you should discuss how to set up your project. The project work should be presented in a written group report, written in English.

It is possible to divide the work in the group so that different persons perform different experiments. It is important that each person in the group should set up and run at least one MT system (the different MT systems could differ with respect to some aspect related to your topic, language pair, translation direction, et.c.). You are jointly responsible for writing the report, however, and each person should understand and be familiar with all parts of the report. It is obligatory to have at least one meeting with your supervisor where you discuss how you divide the work, and show that everyone can set up and run MT systems.

Also included in your project is to learn more about the theoretical background related to your topic, where you should describe the basics of your topic, and refer to literature on the topic.

In the final report, you are expected to

  • describe the background in terms of the concepts, approaches and techniques within your selected topic, including references to several journal and/or conference articles
  • describe your project and motivate your experimental setup
  • summarize, evaluate and analyze your results
  • describe possible shortcomings and ideas for improvement

As long as your experiments are well-motivated, you may just experiment with running the given tools and investigating some aspects of translation, relevant to your topic. This can be done by varying parameters to the different tools, the domian of different data sets, et.c. Evaluation should minimally be performed using the Bleu metric (unless otherwise specified in your topic description). Be sure to carefully analyse and discuss your results. Depending on the project topic, the number of members in the group, and your interests you may also need/want to write some code of your own, install and use some new software, or doing more involved evaluation; however, this is not in general a requirement for passing the course.

Individual reflection report

In addition to the group report, each student should also hand in a short individual reflection report. The report should be 1-2 A4 pages (not more), and should be written in English. The report should consist of two parts:

  • A description of your role in the project group and what you personally did in the project, including which MT systems you trained.
  • A reflection related to the literature, where you should pick a recent conference article related to your topic, summarize it, and briefly discuss how your project work relates to that work. Do not to pick the same article as any other group member.

Seminar Presentations

You will have one seminar presentations during the last week of the course. The goal of the presentation is to share with the class the outcome of your experiments, including the design, the different experiments you ran, the results, and an analysis of them.

It is up to the students in each group to decide how to organize the presentation. Each student should present some part of the work. All students in the group should know the contents of the whole presentation and be prepared to answer questions. It is compulsory for all students to attend all seminars. Please inform your teacher beforehand if you cannot participate. We will then find an alternative solution. The presentation should be given in English.

The schedule for presentation, and the length of it will be announced in good time.

The seminars will be held on November 5 and 8.

Project Topics

Here is a list of project ideas and a short description of their main goals. Note that the exact descriptions are suggestions that should be discussed with your supervisor. Note that there might be fewer groups than there are topic suggestions.

Each project consists of the following parts:

  • building a baseline system (to have something to compare with)
  • building new system(s) (at least as many as the number of project members, but also depending on the topic)
  • building a system consists of training, tuning and testing it, respectively.

List of Topics

  • Parameter Tuning

  • Factored SMT models
    • Project: train and compare various factored SMT models
      - include factors such as POS tags, lemmas, syntactic function
      - compare various combinations of translation and generation steps
      - tools: Moses, tagger and lemmatizer (hunpos, TreeTagger) and parsers with existing models
      - data recommendation: translated movie subtitles

  • Language Modeling and Domains
    • Project: Explore language models and their parameters and the impact of domains
      - investigate the effect of data size on translation quality
      - compare the use of in-domain versus out-of-domain data (perplexity and translation quality)
      - combinations of in-domain and out-of-domain LM's
      - tools: KenLM, SRILM, Moses
      - data recommendation: translated movie subtitles and data from other domains

  • Word alignment and Phrase-Based SMT
    • Project: Explore the impact of word alignment on SMT quality
      - different settings for GIZA++
      - different symmetrization heuristics
      - difference between alignment of wordforms, lemmas, (POS tags?)
      - other alignment tools: anymalign, (Berkeley aligner?)
      - tools: GIZA++, Moses, anymalign, TreeTagger, ...
      - data recommendation: translated movie subtitles

  • Re-ordering and SMT
    • Project: Apply and compare different re-ordering approaches
      - lexicalized re-ordering models
      - re-ordering constraints (see Moses: hybrid translation)
      - possibly pre-ordering (before training/decoding)
      - tools: Moses, external or own tools
      - data recommendation: translated movie subtitles

  • Domains and evaluation
    • Project: Explore the influences of different domains on training and test data, and evaluate through several different methods
      - Vary the domain in training, dev and test data
      - Train on mixed data or data from a single domain
      - Possibly: explore methods for domain adaption
      - Evaluate using different automatic metrics, possibly applyinh significance testing
      - Evaluate using some manual or semi-automatic method
      - tools: Moses, evaluation metrics
      - data recommendation: translated movie subtitles and data from other domains

  • Negation Translation
    • Project: Explore the translations of the negation phenomenon
      -train SMT (an possibly NMT) models and analysis the translation of negation
      -propose and preferably implement some possible methods to improve the performance
      -tools: Moses (if NMT: Sockeye, Marian, OpenNMT, NeuralMonkey)
      -data recommendation: movie subtitles, Europarl and/or news commentary
      -data/test suites: you might want to create a specific test set containing isntances of negation phenomena

Resources

General Resources for the projects will be listed and linked here.

Data

It is up to each project group to decide on which language pair(s) and corpus (corpora) to work. You will then download the data needed for your project from OPUS: http://opus.nlpl.eu/ . Choose the language pair you are interested in, and download the data in Moses format, which means that it has been sentence aligned. Once you have downloaded a corpus, you need to split it into test, development, training and mono-training parts. These sets should be disjoint, i.e. not contain the same sentences (unless the corpus you use happen to have a lot of repetition), except for the training and mono-training data that may overlap. Recommended sizes for the majority of projects are shown below. The reason for limiting the size is to keep the model sizes and run times reasonable.

Data Used for Recommended size
Test Testing the system 2000-5000 sentences
Development Tuning the system 2000-5000 sentences
Training Training the TM 200,000 sentences
Mono-training Training the LM 200,000 -- 1,000,000 sentences

You may choose any corpora, but we recommend using either News Commentary or OpenSubtitles for most language pairs. If you use OpenSubtitles, the older version named OpenSubtitles is normally big enough, and there is no need to download the larger versions from 2016 or 2018. OpenSubtitles typically have short sentences, so if you use that, 5000 sentences is good for test and development, otherwise, 2000 sentences is normally enough, if your corpus has sentences of a reasonable length. If you choose some other corpus, try to download one of a reasonable size, in order to avoid having to download too much data.

Before you train your system you need to prepare your data by tokenizing, true casing (or lower casing) and filtering out long sentences. This process is described in the Moses tutorial (http://www.statmt.org/moses/?n=Moses.Baseline), see also below.

When you evaluate your systems, it is important that you use the same casing and truecasing for both the system output and the reference translation. It is good practice to use de-tokenized and recased data for this, and use an evaluation script that performs these tasks. But in order to simplify your evaluation, it is perfectly acceptable that you instead tokenize and true case (or lowercase) the reference data in the same way as you did for your training data, and use the multi-bleu script used in the assignments.

All choices and preparations of data should be clearly specified in your report, including which corpus you used, which language pair(s), and the exact size of your data sets.

Tools

Most projects will be based on SMT. Then you will mainly work with the
Moses SMT toolkit, parts of which you are already familiar with from the lab sessions. For all projects, you will first have to build a baseline system following the instructions given in this tutorial. This is required in order to be able to compare the results of the systems you will build during the project. Remember that BLEU scores are relative, we are always interested in their improvement over a baseline system. An absolute BLEU score is not informative.

IMPORTANT! You do not have to install Moses on the university computers. At least one version of it is already installed. For Moses, and data preparation, use the paths provided in assignment 4. For language modeling you can use SRILM, paths are provided in assignment 2b.

Some projects might be based on NMT. Currently, there are many tools developed for NMT, based on different frameworks. You can choose any one you like for your project. The most popular NMT tools are Sockeye (based on MXNET), Marian (based on C++), OpenNMT (based on PyTorch), Nematus (based on Theano/TensorFlow), Tensor2Tensor (based on TensorFlow), Neural Monkey (based on TensorFlow). Please refer to the detailed documents and examples on their websites. I would like to recommend Sockeye, OpenNMT, and Marian. If you choose Marian, please try to install it on your own computer. The computers in the lab do not suit Marian.

Training (big) NMT models usually needs GPUs (Graphics processing cards) to accelerate the computation. Unfortunately, our lab computers do not support GPU versions. Please remember to install the CPU versions. If your own computer has a good NVIDA GPU (ONLY NVIDA, AMD GPUs are not friendly for machine learning now), you can try to install a GPU version instead of a CPU version.

Other tools and resources that you might need in some projects:

  • hunpos - POS tagger; pre-trained POS tagging models
  • TreeTagger - another POS tagger with many pre-trained models; includes also lemmatization!
  • MaltParser - a data-driven dependency parser generator. pre-trained POS tagging models for various languages (NOTE - you will need version 1.4.1 for using those models!)
  • Simple tools to convert POS-tagged data to CoNLL format (for parsing) and MaltParser output to XML trees (for tree-based SMT training) are available at /local/kurs/mt/projects/tools/:
    • tagged2conll.pl - convert TAB separated POS-tagging output to CoNLL format for parsing
    • malt2tree.pl - convert TAB separated parser output (CoNLL format) to XML format to be used with Moses syntax-based models. NOTE - This only works for projective trees!
  • efmeral An alternative word aligner
  • The Berkeley Word Aligner
  • More links to tools