Assignment1 - Machine translation evaluation


In this assignment session you will gain hands-on experience of using real machine translation services. You will compare the quality of different systems using both manual and automatic evaluation methods. Last, you will assess the pros and cons of machine translated texts.

This assignment is examined in class, on September 6, 9-12. Note that you have to attend the full session and be active in order to pass the assignment. If you miss this session, you will instead have to write a report, see the bottom of this page.


This assignment is intended to be performed in pairs. Team up with another student in order to solve it. If possible, team up with a student with the same native language as yourself, or with someone for which both of you have a good command of some language beside English. It is not necessary to work in the same pairs during all assignments.

Take notes during the assignment. During the last hour of the session, groups will get a chance to talk about their experiments, what they did, what surprised them, what they learned, et.c. In addition, the teachers will talk to students during the session, to discuss their findings.

1 - First impressions of MT quality / error analysis

In this sub tasks you should use the state-of-the-art translation service, Google Translate to translate a text, which you will then evaluate. You can freely choose which languages to work with (given that Google translate handles these languages). As the target language, choose a language that you both know well (native if possible). As the source language, if possible, choose any language that both of you know (English in most cases unless English is your target language), or if not possible, choose a language that one of you know well.

When you have decided on your language pair, you should find an article written in your source language. A good choice is a Wikipedia article, or an article from some news service in that language (e.g. BBC for English). Copy the main text from the article, and paste it into Google translate to translate it. Do NOT spend a long time on choosing the article.

Look into the translated text, and see first what your first expression of the quality is, then try to analyze the quality and errors in some more detail, identifying what types of errors occur, and which types of errors are the most frequent. Try to think about the following questions (and feel free to make additional observations!):

You should spend approximately 45 minutes on this task.

2 - Automatic vs manual evaluation

In this task we will carry out some experiments with text from two very different domains: course plans (from Luleå University) and movie subtitles. You will try methods for automatic and manual evaluation.


Create a new directory for this assignment and copy the following six files into this new work directory:

mkdir assignment1/
cd assignment1/
cp /local/kurs/mt/assignment1/data/* .

Source files:

Reference files:

Translated files:
The following files have been translated using the rule-based Convertus engine (which will be described during the guest lecture on October 15). The Convertus system specializes in course syllabi.

Translate the sentences in the Swedish source files into English twice, using Google Translate and Bing Translator. Save the translation results in separate text files in your work directory.

Automatic evaluation

Tokenize the reference texts and the translated texts (Convertus, Google, and Bing), using the provided script.

perl /local/kurs/mt/mosesdecoder/scripts/tokenizer/tokenizer.perl -l LANG < raw-translation.txt > tokenized-translation.txt

Where LANG is the language in the file ('sv' for Swedish and 'en' for English), raw-translation.txt is the translation from each translation service or the reference translation, and tokenized-translation.txt is the resulting translation with tokenization.

Use the provided multi-bleu.perl script to compute BLEU scores for the translated texts, using the reference translations provided. Record the scores you obtain for each system.

perl /local/kurs/mt/mosesdecoder/scripts/generic/multi-bleu.perl tokenized-reference.txt < tokenized-translation.txt

Manual evaluation

Look at the translation into English and evaluate around 20 text lines (or how many you have time to do) of either the translations for movie subtitles or syllabi for the Convertus, Google and Bing translations, using the following subjective assessment scale:

  1. Correct translation: 3 points
  2. Includes the main contents; however, there are grammatical problems that do not effect the understanding of the main message: 2 points
  3. Parts of the original message are lost but the main content is still understandable in its context: 1 point
  4. Unacceptable translation: 0 points

Compute the average score of your manual evaluations for each translation file. Record the scores obtained.


Compare the results of the manual and the automatic evaluations and think about the following questions:
  1. Do the manual and the automatic evaluations give high/low results to the same systems? Why?
  2. Can you see problems with the reference translations that may negatively influence the automatic evaluation? Do you observe acceptable translations that are not well matched with the reference and, therefore, are penalized by the automatic metrics without objective reason? Discuss possible solutions to the problems you discover.
  3. Did the domain of the texts affect the quality of the translations? Were there differences between the three translation engines? Did they make different kinds of mistakes? Why do you think this is so?

Wrapping up

Towards the end of the assignment session, everyone will be expected to share their findings and discuss the assignment, in the full class. You should all be prepared to report your main findings, and discuss the questions asked, and any other interesting issues that came up during the assignment.


The assignment is supposed to be examined in class, on September 6, 9-12. You need to be present and active during the whole session.

If you failed to attend the oral session, you instead have to write an assignment report, which can be done individually, or for a pair of students. Spend around 45 minutes on task 1, and do task 2. Then write a report where you discuss your findings, including the scores for the different evaluations in task 2. Your report should be 1-2 A4-pages. The report should be handed in via the student portal as a pdf. The deadline is October 26, 2018.

Background information

Chapter 8 in the course textbook
Original publication: Kishore Papineni, Salim Roukos, Todd Ward, Wei-jing Zhu: BLEU: a Method for Automatic Evaluation of Machine Translation, (2002)