Uppsala universitet  

Projects in Machine Translation
Master Programme, 7.5 credit students (5LN711)

Organisation
Project Work
Individual Reflection Report
Seminar Presentation
Possible Topics
Resources

Aim

The goal of the master projects is
  1. to study background literature and prepare a presentation for the final seminars in the MT course;
  2. to carry out a practical assignment related to the topic selected
  3. prepare a final report and seminar describing the results

Deadlines

  • September 24: Hand in your topic preferences
  • November 5 and 8: Seminar presentations (detailed schedule TBA)
  • November 9: Hand in final group report
  • November 9: Hand in individual reflection reports
  • (December 7: backup deadline for reports)

Organisation

You will work in groups of 3-4 students. We will put together the groups, based on your wishes for which topics you prefer to work on. The list of
topic suggestions can be found at the bottom of this page. You can hand in your preferences by email to Sara, by September 24 at the latest. Give a ranked list of at least three different topics you could consider to work on, and rank them We will try our best to accommodate everyone's wishes, but we cannot guarantee that you will get your preferred topics. If you fail to hand in a wish by September 24, you will be assigned arbitrarily to a topic.

Project work

For each topic you should perform a practical project, where you apply some of the concepts related to your projects practically. This includes setting up and running MT systems, normally with Moses, and evaluate and compare systems that differ in some aspect related to your topic. Each project will be assigned a supervisor, with whom you should discuss how to set up your project. The project work should be presented in a written group report, written in English.

It is possible to divide the work in the group so that different persons perform different experiments. It is important that each person in the group should set up and run at least one MT system (the different MT systems could differ with respect to some aspect related to your topic, language pair, translation direction, et.c.). You are jointly responsible for writing the report, however, and each person should understand and be familiar with all parts of the report. It is obligatory to have at least one meeting with your supervisor where you discuss how you divide the work, and show that everyone can set up and run MT systems.

Also included in your project is to learn more about the theoretical background related to your topic, where you should describe the basics of your topic, and refer to literature on the topic.

You are also supposed to have a reasonable level of technical difficult in your project. Only running a given software, like Moses, with some different parameters and evaluating it using an automatic metric is not enough. Each project member should be involved in at least one of the following tasks (or propose an equivalent task to your supervisor):

  • Writing your own source code (can be for pre-processing, re-implementing a known algorithm, for targeted evaluation, etc)
  • Installing and learning to use some software that were not used during the lab and assignments
  • Doing targeted evaluation and/or error analysis, beyond just using automatic metrics
  • Performing and interpreting significance testing of the metric scores

In the final report, you are expected to

  • give an overview of the topic, and work related to it, including references to several journal and/or conference articles
  • describe your project and motivate your experimental setup
  • summarize, evaluate and analyze your results
  • describe possible shortcomings and ideas for improvement

Individual reflection report

In addition to the group report, each student should also hand in a short individual reflection report. The report should be 1-2 A4 pages (not more), and should be written in English. The report should consist of two parts:

  • A description of your role in the project group and what you personally did in the project, including which MT systems you trained.
  • A reflection related to the literature, where you should pick a recent conference article related to your topic, summarize it, and briefly discuss how your project work relates to that work. Do not to pick the same article as any other group member.

Seminar Presentations

You will have two seminar presentations during the last week of the course. In the first presentation you will present theory related to your topic, and in the second presentation you will present the setup and results of your practical experiments.

It is up to the students in each group to decide how to organize the presentation. Each student should present some part of the work. All students in the group should know the contents of the whole presentation and be prepared to answer questions. It is compulsory for all students to attend all seminars. Please inform your teacher beforehand if you cannot participate. We will then find an alternative solution. The presentation should be given in English.

The goal of the first theoretical presentations is to give all students an overview of the topics selected by the master students for their projects. Please, try to give a comprehensible introduction to the topic you have selected. Motivate the ideas and concepts and try to be as pedagogical as possible. Allow discussions and questions.

The goal of the second presentation is to share with the class the outcome of your experiments, including the design, the different experiments you ran, the results, and an analysis of them.

The schedule for presentation, and the length of each type of presentations will be announced in good time.

The schedule for presentation, and the length of it will be announced in good time.

The seminars will be held on November 5 and 8.

Project Topics

Here is a list of project ideas and a short description of their main goals. Note that the exact descriptions are suggestions that should be discussed with your supervisor. Note that there might be fewer groups than there are topic suggestions.

Each project consists of the following parts:

  • building a baseline system (to have something to compare with)
  • building new system(s) (at least as many as the number of project members, but also depending on the topic)
  • building a system consists of training, tuning and testing it, respectively.

List of Topics

  • Transliteration

  • Historical Spelling Normalization
    • Seminar: NMT models for historical spelling normalization
    • Project: Apply NMT models to the historical spelling normalization task (beyond assignment 5)
      - train SMT models for historical spelling normalization (possibly exploring different options) - train NMT model with token pairs or segment (several consecutive tokens) pairs
      - use filtered training data (historical and modern spellings are not identical)
      - how much training data is neede for NMT/SMT?
      - tools: Sockeye, Marian, OpenNMT, NeuralMonkey, Moses
      - data recommendation: parallel historical-modern spellings

  • Factored SMT models
    • Seminar: Explain the basic concepts of factored SMT
    • Project: train and compare various factored SMT models
      - include factors such as POS tags, lemmas, syntactic function
      - compare various combinations of translation and generation steps
      - tools: Moses, tagger and lemmatizer (hunpos, TreeTagger) and parsers with existing models
      - data recommendation: translated movie subtitles

  • Re-ordering and SMT
    • Seminar: Explain different re-ordering strategies
    • Project: Apply and compare different re-ordering approaches
      - lexicalized re-ordering models
      - re-ordering constraints (see Moses: hybrid translation)
      - pre-ordering (before training/decoding)
      - tools: Moses, external or own tools
      - data recommendation: translated movie subtitles

  • Domains and evaluation
    • Seminar: Explain the impact on domain on MT and domain adaption strategies
    • Project: Explore the influences of different domains on training and test data, and evaluate through several different methods
      - Vary the domain in training, dev and test data
      - Train on mixed data or data from a single domain
      - Possibly: explore methods for domain adaption
      - Evaluate using different automatic metrics (possibly including significance testing)
      - Evaluate using some manual or semi-automatic method
      - tools: Moses, evaluation metrics
      - data recommendation: translated movie subtitles and data from other domains

Resources

General Resources for the projects will be listed and linked here.

Data

It is up to each project group to decide on which language pair(s) and corpus (corpora) to work. You will then download the data needed for your project from OPUS: http://opus.nlpl.eu/ . Choose the language pair you are interested in, and download the data in Moses format, which means that it has been sentence aligned. Once you have downloaded a corpus, you need to split it into test, development, training and mono-training parts. These sets should be disjoint, i.e. not contain the same sentences (unless the corpus you use happen to have a lot of repetition), except for the training and mono-training data that may overlap. Recommended sizes for the majority of projects are shown below. The reason for limiting the size is to keep the model sizes and run times reasonable.

Data Used for Recommended size
Test Testing the system 2000-5000 sentences
Development Tuning the system 2000-5000 sentences
Training Training the TM 200,000 sentences
Mono-training Training the LM 200,000 -- 1,000,000 sentences

You may choose any corpora, but we recommend using either News Commentary or OpenSubtitles for most language pairs. If you use OpenSubtitles, the older version named OpenSubtitles is normally big enough, and there is no need to download the larger versions from 2016 or 2018. OpenSubtitles typically have short sentences, so if you use that, 5000 sentences is good for test and development, otherwise, 2000 sentences is normally enough, if your corpus has sentences of a reasonable length. If you choose some other corpus, try to download one of a reasonable size, in order to avoid having to download too much data. (If you do end up downloading large amounts of data, please remove the extra data once you have extracted suitable data sets for your projects.)

Before you train your system you need to prepare your data by tokenizing, true casing (or lower casing) and filtering out long sentences. This process is described in the Moses tutorial (http://www.statmt.org/moses/?n=Moses.Baseline), see also below.

When you evaluate your systems, it is important that you use the same casing and truecasing for both the system output and the reference translation. It is good practice to use de-tokenized and recased data for this, and use an evaluation script that performs these tasks. But in order to simplify your evaluation, it is perfectly acceptable that you instead tokenize and true case (or lowercase) the reference data in the same way as you did for your training data, and use the multi-bleu script used in the assignments.

All choices and preparations of data should be clearly specified in your report, including which corpus you used, which language pair(s), and the exact size of your data sets.

Tools

Most projects will be based on SMT. Then you will mainly work with the
Moses SMT toolkit, parts of which you are already familiar with from the lab sessions. For all projects, you will first have to build a baseline system following the instructions given in this tutorial. This is required in order to be able to compare the results of the systems you will build during the project. Remember that BLEU scores are relative, we are always interested in their improvement over a baseline system. An absolute BLEU score is not informative.

IMPORTANT! You do not have to install Moses on the university computers. At least one version of it is already installed. For Moses, and data preparation, use the paths provided in assignment 4. For language modelling you can use SRILM, paths are provided in assignment 2b.

Some projects might be based on NMT. Currently, there are many tools developed for NMT, based on different frameworks. You can choose any one you like for your project. The most popular NMT tools are Sockeye (based on MXNET), Marian (based on C++), OpenNMT (based on PyTorch), Nematus (based on Theano/TensorFlow), Tensor2Tensor (based on TensorFlow), Neural Monkey (based on TensorFlow). Please refer to the detailed documents and examples on their websites. I would like to recommend Sockeye, OpenNMT, and Marian. If you choose Marian, please try to install it on your own computer. The computers in the lab do not suit Marian.

Training (big) NMT models usually needs GPUs (Graphics processing cards) to accelerate the computation. Unfortunately, our lab computers do not support GPU versions. Please remember to install the CPU versions. If your own computer has a good NVIDA GPU (ONLY NVIDA, AMD GPUs are not friendly for machine learning now), you can try to install a GPU version instead of a CPU version.

Other tools and resources that you might need in some projects:

  • hunpos - POS tagger; pre-trained POS tagging models
  • TreeTagger - another POS tagger with many pre-trained models; includes also lemmatization!
  • MaltParser - a data-driven dependency parser generator. pre-trained POS tagging models for various languages (NOTE - you will need version 1.4.1 for using those models!)
  • Simple tools to convert POS-tagged data to CoNLL format (for parsing) and MaltParser output to XML trees (for tree-based SMT training) are available at /local/kurs/mt/projects/tools/:
    • tagged2conll.pl - convert TAB separated POS-tagging output to CoNLL format for parsing
    • malt2tree.pl - convert TAB separated parser output (CoNLL format) to XML format to be used with Moses syntax-based models. NOTE - This only works for projective trees!
  • efmeral An alternative word aligner
  • The Berkeley Word Aligner
  • More links to tools