Uppsala universitet  

Projects in Machine Translation
5LN711 and 5LN718

Groups
Organization
Project Work
Seminar Presentation
Resources

Aim

In the project you will work in groups focused around one language pair invloving a low-resource language and English. You should come up with a focus for your study, and train several machine translation systems for your language pairs. In the project you will:
  1. study background literature
  2. describe your language and come up with a focus for your experiments
  3. propose an experimental setup for your language pair and focus
  4. carry out machine translation experiments for a given language pair
  5. prepare a final report (around 4 to 8 pages) and a seminar presentation describing your work

Note that you will have very limited data, so your resulting translation will be quite poor. This is prefectly fine. The main objective of the project is to learn the methodology of how to train MT systems, and how to perform meaningful expriments. The purpose is not to produce great translations.

Deadlines

  • October 27: Seminar presentations
  • October 30: Hand in final group report
  • (November 27: backup deadline for the report)

Groups

You will work in groups of 3-4 students. We will present a number of low-resource languages you can choose from, and you will then be able to sign up for a group working on a specific language in Studentportalen. You may not pick a langauge that you know, or where you know a (closely) related language.

List of languages:

The signup in Studentportalen will open on Tuesday Septmebr 15, 13.00, and close on Friday Septemebr 18. In case someone does not signup they will be assigned to a group by the teachers. In case there are only two persons in one of the groups, we will also need to reassign someone to balance the groups.

Organization

In the beginning of the project you are expected to find and study literature on different aspects: about your particular low-resource language in order to learn about its structure, on translation for your specific language pair (if available) and literature on low-resource MT in general. Once you have learnt about your language and decided on a specific aspect to focus on, you are expected to study literature also related to that aspect. The focus for your group can be some specific linguistic challenge for your particular language pair (like word order, word segmentation, unknown words, agreement, ...), or some issue or method related to low-resource MT in general (like network engineering geared at small data set, cross-lingual learning, ...). This decision can be based on your literature study and/or on your interests. You may also run your baseline system(s) first, and do a small evaluation/error analysis in order to find out what is problematic in the MT output. You are then expected to study literature related to your focus, in order to come up with a plan for your experiments.

Before you start with the main work, you should talk to a supervisor to see if your focus and your plan for your experiments are reasonable. This can be done either during supervision hours, or by contacting a supervisor by email.

Project work

Each group should perform a practical project, where you investigate machine translation for your language and focus practically. This includes setting up and running MT systems, evaluating the systems, and compare systems that differ in some aspect related to your focus. You may choose to use SMT, NMT or both types of systems. We recommend using the Moses and OpenNMT toolkits that you should be familiar with from the assignments.

At this point you also need to collect and prepare the data you need for your experiments. See details below.

You can choose to work either on one or both translation directions. But in most cases it is probably easiest to focus on translation into English, a language which you understand. You should have at least one baseline, which could either be an SMT or an NMT system (or both types). In addition you should train at least two variants of an MT system that somehow addresses issues related to your focus (the exact number of systems can be discussed with your supervisors, depending on how complex your changes to the baseline are!). The systems can differ for instance in the settings used to train the systems, in the system type (NMT or SMT), in different pre-processing, or in diferent use of data. The change should be motivated by previous literature, an analysis of the baseline output and/or your knowledge about the languages involved. Evaluation should minimally be performed using Bleu. You may also use other methods, like using other metrics that target specific issues, or do a small error analysis (even though this then has to be done based on the English reference translations, since you do not know the language).

Note that in order to pass the course you are not required to actually improve the translations, neither overall or for your focus. You are required to put forward some hypothesis of why you thought your idea may work. But in case it does not work, it is enough to evaluate and analyze the results as they are and try to explain the reasons why it might not have worked. Even if you do get the expected results, you are of course expected to analyse your findings.

Gongbo and Sara will supervise the projects. You should discuss your plan for the project with one of us, once you have decided on what to focus on. The project work should be presented in a written group report, written in English, and be presented during one of the seminars.

It is possible to divide the work in the group so that different persons perform different experiments. It is important that each person in the group should be involved in setting up and running at least one MT system. You are jointly responsible for writing the report, however, and each person should understand and be familiar with all parts of the report. It is obligatory to have at least one meeting with Sara where you discuss how you divide the work, and discuss how each person was involved in setting up and running MT systems. This can be done either during supervision hours, or by appointment.

Note that the training times for machine translation systems can be long, especially for NMT. It is thus important that you start training your systems early so that they have time to finish running. You then need time both for evaluation, analysis, and preparing your report and presentation.

In the final report, you are expected to

  • give a breif overview of the structure and specifics of your low-resource language
  • give an overview of low-resource MT, focusing on your language pair (if applicable) and focus, including references to several journal and/or conference articles
  • describe your project and motivate your experimental setup
  • summarize, evaluate and analyze your results
  • describe possible shortcomings and ideas for improvement

Seminar Presentations

You will have one seminar presentation on October 27. The goal of the presentation is to share with the class the outcome of your experiments, including the design, the different experiments you ran, the results, and an analysis of them.

It is up to the students in each group to decide how to organize the presentation. Each student should present some part of the work. All students in the group should know the contents of the whole presentation and be prepared to answer questions. It is compulsory for all students to attend all seminars. Please inform your teacher beforehand if you cannot participate. We will then find an alternative solution. The presentation should be given in English.

The schedule for presentations will be announced. Each group has 10 minutes for presentation, followed by five minutes for questions.

Resources

General Resources for the projects will be listed and linked here.

Data

It is up to each project group to decide on which corpus (corpora) to use. For several languages there are suggestions of where to get data. We also recommend using OPUS, from which you can download the data in Moses format, which means that it has been sentence aligned. But you are free to use data from any source you can find! Once you have downloaded a corpus, you need to split it into test, development, and training parts. These sets should be disjoint, i.e. not contain the same sentences (unless the corpus you use happen to have a lot of repetition). Recommended maximum sizes for the majority of projects are shown below. The reason for limiting the size is to keep the model sizes and run times reasonable. Note that since the focus this year is on low-resource languages, the available data may well be smaller than the maximum suggestions, which is then perfectly fine.

Data Used for Recommended size
Test Testing the system 2000-5000 sentences
Development Tuning the system 2000-5000 sentences
Training SMT Training the TM (if you compare SMT and NMT, use the same data for both conditions, though) 200,000 sentences
Mono-training SMT Training the LM 200,000 -- 1,000,000 sentences
Training NMT Training NMT models ~ 1,000,000 sentences

Before you train your system you need to prepare your data by tokenizing, true casing (or lower casing) and filtering out long sentences. This process is described in the Moses tutorial (http://www.statmt.org/moses/?n=Moses.Baseline), see also below.

When you evaluate your systems, it is important that you use the same casing and tokenization for both the system output and the reference translation. It is good practice to use de-tokenized and recased data for this, and use an evaluation script that performs these tasks. But in order to simplify your evaluation for this specific project, it is acceptable that you instead tokenize and true case (or lowercase) the reference data in the same way as you did for your training data, and use the multi-bleu script used in the assignments.

All choices and preparations of data should be clearly specified in your report, including which corpus you used, which language pair(s), and the exact size of your data sets, as well as your preprocessing choices, both for training and for evaluation.

Tools

If you choose to use SMT in your project, you will mainly work with the Moses SMT toolkit, parts of which you are already familiar with from the lab sessions. For all projects, you will first have to build a baseline system following the instructions given in this tutorial. This is required in order to be able to compare the results of the systems you will build during the project. Remember that BLEU scores are relative, we are always interested in their improvement over a baseline system. An absolute BLEU score is not informative.

IMPORTANT! You do not have to install Moses on our Linux computers. At least one version of it is already installed. For Moses, and data preparation, use the paths provided in assignment 2. (You also need to train a language model. You may use the SRILM toolkit that is installed on our computer system. A brief instruction is included in assignment 2b from 2018. Use the options "-kndiscount" and "-interpolate" instead of "-wbdiscount", though, since you are working on reasonably large corpora).

You may also use NMT. Currently, there are many tools developed for NMT, based on different frameworks. You can choose any one you like for your project. The most popular NMT tools are OpenNMT (based on PyTorch), Sockeye (based on MXNET), Marian (based on C++), Nematus (based on Theano/TensorFlow), Tensor2Tensor (based on TensorFlow), Neural Monkey (based on TensorFlow). Please refer to the detailed documents and examples on their websites. I would like to recommend OpenNMT and Sockeye. Training (big) NMT models usually needs GPUs (Graphics processing cards) to accelerate the computation.

We can use the GPU resources on UPPMAX. Here are some instructions to the Snowy GPU Cluster on UPPMAX.
Applying accounts: Please request the membership in our course project: g2020017 and group project UPPMAX 2020/2-2 (The group project gives you priority). Please apply your account soon, it may take several days to get approved.
More information about Snowy: User Guide, Slurm_jobs. We have installed the OpenNMT toolkit to train our NMT models. Here is the example script if you are only in the g2020017 project and the the example script if you are also in the UPPMAX 2020/2-2 project file to train NMT models on Snowy. More detailed information please refer to the OpenNMT documents .

Other tools and resources that might be useful for some projects: