Assignment1 - Machine translation evaluation


In this assignment session you will gain hands-on experience of using real machine translation services. You will compare the quality of different systems using both manual and automatic evaluation methods. Last, you will assess the pros and cons of machine translated texts.

This assignment is examined in class, on September 2, 11-12. Before the examination you will need to spend around two hours on doing the work described in this document. If you miss the examination session or are inactive, you will have to compensate for this, see the bottom of this page.


The assignment work can be performed in pairs or individually. If you want to work in a pair, preferably team up with a student with the same native language as yourself, or with someone for which both of you have a good command of some language beside English. It is not necessary to work in the same pairs during all assignments. Note that both students in a pair should be able to independently discuss your findings!

Take notes when performing the assignment work. During the 1-hour examination session, you will get a chance to talk about your experiments, what you did, what surprised you, what you learned, et.c. In addition, the teachers will talk to students during the session, to discuss their findings.

1 - First impressions of MT quality / error analysis

In this sub tasks you should use the state-of-the-art translation service, Google Translate to translate a text, which you will then evaluate. You can freely choose which languages to work with (given that Google translate handles these languages). As the target language, choose a language that you (both) know well (native if possible). As the source language, if possible, choose any language that you (both) know (English in most cases unless English is your target language), or if not possible, choose a language that one of you know well.

When you have decided on your language pair, you should find an article written in your source language. A good choice is a Wikipedia article, or an article from some news service in that language. Copy the main text from the article, and paste it into Google translate to translate it. Do NOT spend a long time on choosing the article.

Look into the translated text, and see first what your first impression of the quality is, then try to analyze the quality and errors in some more detail, identifying what types of errors occur, and which types of errors are the most frequent. Try to think about the following questions (and feel free to make any additional observations!):

You should spend approximately 45 minutes on this task.

2 - Automatic vs manual evaluation

In this task we will carry out some experiments with text from two very different domains: course plans (from Luleå University) and movie subtitles. You will try methods for automatic and manual evaluation.


If you need help to login to our computer system from your laptop, look here.

Create a new directory for this assignment and copy the following six files into this new work directory:

mkdir assignment1/
cd assignment1/
cp /local/kurs/mt/assignment1/data/* .

Source files:

Reference files:

Translated files:
The following files have been translated using the rule-based Convertus engine. The Convertus system specializes in course syllabi.

Translate the sentences in the Swedish source files into English, using Google Translate or Bing Translator. Save the translation results in separate text files in your work directory.

Automatic evaluation

Tokenize the reference texts and the translated texts (Convertus and Google/Bing), using the provided script.

perl /local/kurs/mt/mosesdecoder/scripts/tokenizer/tokenizer.perl -l LANG < raw-translation.txt > tokenized-translation.txt

Where LANG is the language in the file ('sv' for Swedish and 'en' for English), raw-translation.txt is the translation from each translation service or the reference translation, and tokenized-translation.txt is the resulting translation with tokenization.

Use the provided multi-bleu.perl script to compute BLEU scores for the translated texts, using the reference translations provided. Record the scores you obtain for each system.

perl /local/kurs/mt/mosesdecoder/scripts/generic/multi-bleu.perl tokenized-reference.txt < tokenized-translation.txt

The multi-bleu output will look like this:
BLEU = 31.38, 69.1/47.7/29.9/19.9 (BP=0.838, ratio=0.850, hyp_len=488, ref_len=574)

The first score is the BLEU score that we are mainly interrested in. The following four scores are the 1-gram to 4-gram precision. The scores in paranthesis gives the brevity penalty and info about the length of the hypothesis and reference.

Manual evaluation

Look at the translation into English and evaluate at least 20 text lines (or some more if you have time) of either the translations for movie subtitles or syllabi for the Convertus and Google/Bing translations, using the following subjective assessment scale:

  1. Correct translation: 3 points
  2. Includes the main contents; however, there are grammatical problems that do not effect the understanding of the main message: 2 points
  3. Parts of the original message are lost but the main content is still understandable in its context: 1 point
  4. Unacceptable translation: 0 points

Compute the average score of your manual evaluations for each translation file. Record the scores obtained.


Compare the results of the manual and the automatic evaluations and think about the following questions (and other questions you may have):
  1. Do the manual and the automatic evaluations give high/low results to the same systems? Why?
  2. Can you see problems with the reference translations that may negatively influence the automatic evaluation? Do you observe acceptable translations that are not well matched with the reference and, therefore, are penalized by the automatic metrics without objective reason? Discuss possible solutions to the problems you discover.
  3. Did the domain of the texts affect the quality of the translations? Were there differences between the three translation engines? Did they make different kinds of mistakes? Why do you think this is so?

Wrapping up

During the examination session, everyone will be expected to share their findings and discuss the assignment, in smaller groups and in the full class. You should all be prepared to report your main findings, and discuss the questions asked, and any other interesting issues that came up during the assignment.


The assignment is supposed to be examined in the lab rooms (Chomsky and Turing), on September 2, 11-12. You need to be present and active during the whole session.

If you failed to attend the oral session, you instead have to report your assignment results later. You can then either dicuss the assignment during one of the project supervision sessions (given that the teacher has time), or write a report where you discuss your findings, including the scores for the different evaluations in task 2 (around 1-2 A4-pages). The deadline for the compensation is October 22.

Background information

Chapter 8 in the course textbook
Original publication: Kishore Papineni, Salim Roukos, Todd Ward, Wei-jing Zhu: BLEU: a Method for Automatic Evaluation of Machine Translation, (2002)