Assignment1 - Machine translation evaluation

Aim

In this assignment session you will gain hands-on experience of using real machine translation services. You will compare the quality of different systems using both manual and automatic evaluation methods. Last, you will assess the pros and cons of machine translated texts.

This assignment is examined in class, on September 6, 9-12. Note that you have to attend the full session and be active in order to pass the assignment. If you miss this session or are inactive, you will have to compensate for this, see the bottom of this page.

Practicalities

This assignment is intended to be performed in pairs. Team up with another student in order to solve it. If possible, team up with a student with the same native language as yourself, or with someone for which both of you have a good command of some language beside English. It is not necessary to work in the same pairs during all assignments.

Take notes during the assignment. During the last hour of the session, groups will get a chance to talk about their experiments, what they did, what surprised them, what they learned, et.c. In addition, the teachers will talk to students during the session, to discuss their findings.

1 - First impressions of MT quality / error analysis

In this sub tasks you should use the state-of-the-art translation service, Google Translate to translate a text, which you will then evaluate. You can freely choose which languages to work with (given that Google translate handles these languages). As the target language, choose a language that you both know well (native if possible). As the source language, if possible, choose any language that both of you know (English in most cases unless English is your target language), or if not possible, choose a language that one of you know well.

When you have decided on your language pair, you should find an article written in your source language. A good choice is a Wikipedia article, or an article from some news service in that language (e.g. BBC for English). Copy the main text from the article, and paste it into Google translate to translate it. Do NOT spend a long time on choosing the article.

Look into the translated text, and see first what your first expression of the quality is, then try to analyze the quality and errors in some more detail, identifying what types of errors occur, and which types of errors are the most frequent. Try to think about the following questions (and feel free to make additional observations!):

You should spend approximately 45 minutes on this task.

2 - Automatic vs manual evaluation

In this task we will carry out some experiments with text from two very different domains: course plans (from Luleå University) and movie subtitles. You will try methods for automatic and manual evaluation.

Data

Create a new directory for this assignment and copy the following six files into this new work directory:

mkdir assignment1/
cd assignment1/
cp /local/kurs/mt/assignment1/data/* .

Source files:
pirates50.sv.txt
kursplan50.sv.txt

Reference files:
pirates50.en.txt
kursplan50.en.txt

Translated files:
The following files have been translated using the rule-based Convertus engine (which will be described during the guest lecture on October 15). The Convertus system specializes in course syllabi.
pirates50.BTS.en.txt
kursplan50.BTS.en.txt

Translate the sentences in the Swedish source files into English, using Google Translate or Bing Translator. Save the translation results in separate text files in your work directory.

Automatic evaluation

Tokenize the reference texts and the translated texts (Convertus and Google/Bing), using the provided script.

perl /local/kurs/mt/mosesdecoder/scripts/tokenizer/tokenizer.perl -l LANG < raw-translation.txt > tokenized-translation.txt

Where LANG is the language in the file ('sv' for Swedish and 'en' for English), raw-translation.txt is the translation from each translation service or the reference translation, and tokenized-translation.txt is the resulting translation with tokenization.

Use the provided multi-bleu.perl script to compute BLEU scores for the translated texts, using the reference translations provided. Record the scores you obtain for each system.

perl /local/kurs/mt/mosesdecoder/scripts/generic/multi-bleu.perl tokenized-reference.txt < tokenized-translation.txt

The multi-bleu output will look like this:
BLEU = 31.38, 69.1/47.7/29.9/19.9 (BP=0.838, ratio=0.850, hyp_len=488, ref_len=574)

The first score is the BLEU score that we are mainly interrested in. The following four scores are the 1-gram to 4-gram precision. The scores in paranthesis gives the brevity penalty and some info about the length of the hypothesis and reference.

Manual evaluation

Look at the translation into English and evaluate around 20 text lines (or how many you have time to do) of either the translations for movie subtitles or syllabi for the Convertus and Google/Bing translations, using the following subjective assessment scale:

  1. Correct translation: 3 points
  2. Includes the main contents; however, there are grammatical problems that do not effect the understanding of the main message: 2 points
  3. Parts of the original message are lost but the main content is still understandable in its context: 1 point
  4. Unacceptable translation: 0 points

Compute the average score of your manual evaluations for each translation file. Record the scores obtained.

Comparison

Compare the results of the manual and the automatic evaluations and think about the following questions:
  1. Do the manual and the automatic evaluations give high/low results to the same systems? Why?
  2. Can you see problems with the reference translations that may negatively influence the automatic evaluation? Do you observe acceptable translations that are not well matched with the reference and, therefore, are penalized by the automatic metrics without objective reason? Discuss possible solutions to the problems you discover.
  3. Did the domain of the texts affect the quality of the translations? Were there differences between the three translation engines? Did they make different kinds of mistakes? Why do you think this is so?

Wrapping up

Towards the end of the assignment session, everyone will be expected to share their findings and discuss the assignment, in the full class. You should all be prepared to report your main findings, and discuss the questions asked, and any other interesting issues that came up during the assignment.

Reporting

The assignment is supposed to be examined in class, on September 6, 9-12. You need to be present and active during the whole session.

If you failed to attend the oral session, you instead have to do the assignment on your own, and report it afterwards. Spend around 45 minutes on task 1, and do task 2. You can then either dicuss the assignment during one of the project supervision sessions (given that the teacher has time), or write a report where you discuss your findings, including the scores for the different evaluations in task 2 (around 1-2 A4-pages). The deadline for the compensation is October 25.

Background information

Chapter 8 in the course textbook
Original publication: Kishore Papineni, Salim Roukos, Todd Ward, Wei-jing Zhu: BLEU: a Method for Automatic Evaluation of Machine Translation, (2002)