Uppsala universitet  

Projects in Machine Translation
5LN711 and 5LN718

Groups
Organization
Project Work
Seminar Presentation
Resources

Aim

In the project you will work in groups focused around one language pair. You should come up with a focus for your study, and train several machine translation systems for your language pairs. In the project you will:
  1. study background literature
  2. decide on a linguistic focus
  3. propose an experimental setup for your language pair and focus
  4. carry out machine translation experiments for a given language pair
  5. prepare a final report (~ 4 to 8 pages) and seminar presentation describing the results

Deadlines

  • October 30: Seminar presentations
  • November 1: Hand in final group report
  • (November 29: backup deadline for the report)

Groups

You will work in groups of 3-4 students. We will put together the groups, based on your language skills. Each group will work around one (or possibly two) language pairs. It is up to the students in the group to decide which direction of translation to focus on, or both.

Below are the groups and the language pair(s) that you will focus on. Note that the direction of translation is up to each group. For groups with an additional language listed in parenthesis, it is up to the group to decide if you want to focus on one or two language pairs.

  • English-Chinese: Renfei, Xuemei, Xiajing
  • English Chinese (Mandarin): Jun, Tiantian, Shifei, Adam
  • English-Russian: Natalia, Ivan, Drew
  • English-French: Marsida, Fleur, Ugo
  • English-German: Fabian, Hartger, Evangelia
  • English-Spanish (+Italian?): Giuseppe, Camilla, Rizal
  • English-Swedish (+Bengali?): Luise, Elsa, Rex, Marufa

Organization

In the beginning of the project you are expected to find and study literature first on translation for your specific language pair. Then you will decide on a focus for your group, which will typically be on some linguistic challenge for your particular language pair (see some suggestions below). This decision can be made both based on previous work, on your knowledge of the languages you are working with, and on your interests. You are then also expected to study literature related to your focus, and come up with a plan for what to do during your project. You may also run your baseline system(s) first, in order to find out what is problematic in its output.

Some suggestion for what to focus on (you are more than welcome to propose other options!):

  • Word order
  • Word segmentation
  • Compounding
  • Multi-word expressions
  • Verb tense
  • Agreement
  • Pronoun translation
  • Negation
  • Unknown words
  • ...

Before you start with the main work, you should talk to a supervisor to see if your focus and your work plan is reasonable. This can be done either during supervision hours, or by contacting a supervisor by email.

Project work

Each group should perform a practical project, where you investigate machine translation for your language and focus practically. This includes setting up and running MT systems, evaluating the systems, and compare systems that differ in some aspect related to your focus. You may choose to use SMT, NMT or both types of systems. We recommend using the Moses and OpenNMT toolkits that you should be familiar with from the course assignments.

You can choose to work either on one or both translation directions. If your group have two language pairs you should choose just one translation direction per language pair. You should have at least one baseline, which could either be an SMT and NMT system. In addition you should train at least two variants of an MT system that somehow addresses issues related to your focus (the exact number of systems can be discussed with your supervisors, depending on how complex your changes to the baseline is!). The systems can differ for instance in the settings used to train the systems or in different pre-processing. The change should be motivated by previous literature, an analysis of the baseline output and/or your knowledge of the two languages involved. Evaluation should be performed by using Bleu, as well as in some more detailed way suiting your focus, which can be a small human evaluation, using some metric(s) that targets your focus, or coming up with an automated way to target your focus area.

Note that in order to pass the course you are not require to actually improve the translations, neither overall or for your focus. You are acquired to put forward some hypothesis of why you thought your idea would work. But in case it does not work, it is enough to evaluate and analyze the results as they stand and try to explain the reasons why it did not work.

Gongbo and Sara will supervise the projects. You should discuss your plan for the project with one of us, once you have decided on what to focus on. The project work should be presented in a written group report, written in English, and be presented during one of the seminars.

It is possible to divide the work in the group so that different persons perform different experiments. It is important that each person in the group should be involved in setting up and running at least one MT system. You are jointly responsible for writing the report, however, and each person should understand and be familiar with all parts of the report. It is obligatory to have at least one meeting with Sara where you discuss how you divide the work, and show that everyone can set up and run MT systems. This can be done either during supervision hours, or by appointment.

Note that the training times for machine translation systems are slow, especially fro NMT. It is thus important that you start training your systems early so that they have time to finish running. You then need time both for evaluation, analysis and preparing your report and presentation.

In the final report, you are expected to

  • give an overview of MT for your language pair and focus, and work related to it, including references to several journal and/or conference articles
  • describe your project and motivate your experimental setup
  • summarize, evaluate and analyze your results
  • describe possible shortcomings and ideas for improvement

Seminar Presentations

You will have one seminar presentation during the last week of the course. The goal of the presentation is to share with the class the outcome of your experiments, including the design, the different experiments you ran, the results, and an analysis of them.

It is up to the students in each group to decide how to organize the presentation. Each student should present some part of the work. All students in the group should know the contents of the whole presentation and be prepared to answer questions. It is compulsory for all students to attend all seminars. Please inform your teacher beforehand if you cannot participate. We will then find an alternative solution. The presentation should be given in English.

Here you can find the presentation schedule. Each group should prepare a presentation with slides for 10 minutes, followed by five minutes for questions.

Resources

General Resources for the projects will be listed and linked here.

Data

It is up to each project group to decide on which language pair(s) and corpus (corpora) to work. You will then download the data needed for your project from OPUS. Choose the language pair you are interested in, and download the data in Moses format, which means that it has been sentence aligned. Once you have downloaded a corpus, you need to split it into test, development, training and mono-training parts. These sets should be disjoint, i.e. not contain the same sentences (unless the corpus you use happen to have a lot of repetition), except for the training and mono-training data that may overlap. Recommended sizes for the majority of projects are shown below. The reason for limiting the size is to keep the model sizes and run times reasonable.

Data Used for Recommended size
Test Testing the system 2000-5000 sentences
Development Tuning the system 2000-5000 sentences
Training Training the TM 200,000 sentences
Mono-training Training the LM 200,000 -- 1,000,000 sentences
NMT-training Training NMT models ~ 1,000,000 sentences

You may choose any corpora, but we recommend using either News Commentary or OpenSubtitles for most language pairs. If you use OpenSubtitles, the older version named OpenSubtitles is normally big enough, and there is no need to download the larger versions from 2016 or 2018. OpenSubtitles typically have short sentences, so if you use that, 5000 sentences is good for test and development, otherwise, 2000 sentences is normally enough, if your corpus has sentences of a reasonable length. If you choose some other corpus, like the WMT shared task data, try to download one of a reasonable size, in order to avoid having to download too much data. (If you do end up downloading large amounts of data, please remove the extra data once you have extracted suitable data sets for your projects.)

Before you train your system you need to prepare your data by tokenizing, true casing (or lower casing) and filtering out long sentences. This process is described in the Moses tutorial (http://www.statmt.org/moses/?n=Moses.Baseline), see also below.

When you evaluate your systems, it is important that you use the same casing and truecasing for both the system output and the reference translation. It is good practice to use de-tokenized and recased data for this, and use an evaluation script that performs these tasks. But in order to simplify your evaluation, it is perfectly acceptable that you instead tokenize and true case (or lowercase) the reference data in the same way as you did for your training data, and use the multi-bleu script used in the assignments.

All choices and preparations of data should be clearly specified in your report, including which corpus you used, which language pair(s), and the exact size of your data sets.

Tools

If you choose to use SMT in your project, you will mainly work with the Moses SMT toolkit, parts of which you are already familiar with from the lab sessions. For all projects, you will first have to build a baseline system following the instructions given in this tutorial. This is required in order to be able to compare the results of the systems you will build during the project. Remember that BLEU scores are relative, we are always interested in their improvement over a baseline system. An absolute BLEU score is not informative.

IMPORTANT! You do not have to install Moses on the university computers. At least one version of it is already installed. For Moses, and data preparation, use the paths provided in assignment 2. (You also need to train a language model. You may use the SRILM toolkit that's installed on our computer system. A brief instruction is included in assignment 2b from last year. Use the options "-kndiscount" and "-interpolate" instead of "-wbdiscount", though, since you are working on reasonably large corpora).

Some projects might be based on NMT. Currently, there are many tools developed for NMT, based on different frameworks. You can choose any one you like for your project. The most popular NMT tools are OpenNMT (based on PyTorch), Sockeye (based on MXNET), Marian (based on C++), Nematus (based on Theano/TensorFlow), Tensor2Tensor (based on TensorFlow), Neural Monkey (based on TensorFlow). Please refer to the detailed documents and examples on their websites. I would like to recommend OpenNMT and Sockeye. Training (big) NMT models usually needs GPUs (Graphics processing cards) to accelerate the computation. We can use the GPU resources on Abel.

Here are some instructions to the Abel GPU Cluster. Applying accounts: Please choose the organization: Uppsala universitet (ZIP code: 75126), project: NN9447K: Nordic Language Processing Laboratory (project manager: Stephan Oepen). The resource you should choose is Abel. Please apply your account soon, it may take several days to get approved.

More information about Abel: Guide, Slurm_jobs, Queue systems. We use the OpenNMT toolkit to train our NMT models. Here is an example script file to train NMT models on Abel. More detailed information please refer to the Slurm_jobs and the OpenNMT parameter descriptions.

Other tools and resources that you might need in some projects: