Word-based SMT -- Part 1

This document describes the tasks you need to perform for part 1 of assignment 2. This assignment is examined in class, on September 13, 9-12. Note that you have to attend the full session and be active in order to pass the assignment. If you miss this session, you will instead have to write a report, see the bottom of this page.

The purpose of this assignment is to gain insight in how a word-based SMT system works, and on the importance of the language model (LM) and translation model (TM). You will change the probabilities in these models by hand, to explore what happens. Normally you would not do this, but train the probabilities on a corpus. This is thus a bit artificial, but is intended to give you a better idea of how these models work.

The assignment is setup to do translation both from English to Swedish and from Swedish to English. Try out both translation directions in the beginning! Later in the assignment you may choose to focus on one translation direction. Focus on translation into a language you speak well, i.e. if you are Swedish or speak Swedish well, focus on translation into Swedish, otherwise, focus on translation into English. If you do not know Swedish, there is a grammar sketch of Swedish, which outlines the main issues encountered in this assignment.

Take notes during the assignment. During the last hour of the session, groups will get a chance to talk about their experiments, what they did, what surprised them, what they learned, et.c. We will also ask each group to report their final score(s) with their modified models. In addition, the teachers will talk to students during the session, to discuss their findings.


This assignment is intended to be performed in pairs. Team up with another student in order to solve it. It is not necessary to work in the same pairs during all assignments.

1 - Familiarize yourself with the system and run with uniform probabilities

The translation system is described here. Listen to the brief description of it by your teacher and/or read the description of it. You will probably have to go back to the description during the assignment as well!

Copy all the files needed for the assignment:

mkdir assignment2  
cd assignment2
cp /local/kurs/mt/assignment2/data/* .

The model files are given twice, in order for you to keep a copy of the original files when you start modifying the files yourself.

In the given model files all probabilities are equal. This likely gives bad translations. Run the sample sentences through the translation system, and study the translation suggestions and also use the automatic evaluation to explore the overall results, and find out the average rank. Feel free to add some more sentences if you want to explore something you find interesting. The commands for running the decoder are:

# for translation from Swedish to English

# show the translation results:
/local/kurs/mt/assignment2/simple_decoder/translate -lm lm.eng -tmw tmw.sweeng -tmf tmf.swe -o 2  -in test_meningar.swe 
# show the ranking of results:
/local/kurs/mt/assignment2/simple_decoder/translate -lm lm.eng -tmw tmw.sweeng -tmf tmf.swe -o 2  -in test_meningar.swe -eval test_meningar.eng 

# for translation from English to Swedish

# show the translation results:
/local/kurs/mt/assignment2/simple_decoder/translate -lm lm.swe -tmw tmw.engswe -tmf tmf.eng -o 2  -in test_meningar.eng 
# show the ranking of results:
/local/kurs/mt/assignment2/simple_decoder/translate -lm lm.swe -tmw tmw.engswe -tmf tmf.eng -o 2  -in test_meningar.eng -eval test_meningar.swe
The following questions are some things worth thinking about, and discussing with your assignment partner:

2 - Manipulate the translation models

In this task you should adjust the probabilities for fertilities and word translations, to achieve better translations. You should not set any probabilities to 0, and they should all be real probabilities, i.e. 0 < p ≤ 1. In some cases you could improve some translations at the cost of others. There are problems you cannot solve by only manipulating the translation models. In the end, try to choose changes that makes linguistic sense!

For the TMs to be proper probability models, they should contain proper probability distributions, i.e. the probabilities should sum to 1 for each word, that is, the fertilities for each word should sum to one, and the word probabilities p(s|t) should sum to 1 for each t. The given models are correct in this respect, except for rounding (which is OK), i.e. 0.33*3=0.99, and not 1. Try to respect this, but we will not be strict about this issue in the current assignment, though.

Start by changing one or a few things, and investigate what effect that has on the results. Try modifying both the word translation model (tmw) and fertility model (tmf). You may compare changes that seems reasonable given your linguistic intuition with seemingly "stupid" changes. An example of a linguistically motivated change is to have a higher probability for the translation of "en" and "ett" to "a" than "an", with the motivation that "a" is much more common in English than "an".

Here are some questions worth thinking about:

3 - Manipulate the language model

Go on to manipulate the language model as well, in order to try to further improve the translations. In the given files there are 1-grams and 2-grams with equal probabilities. Try to just adjust these first, and see if you can solve some problems. Again, you should not set any probabilities to 0, and they should all be real probabilities, i.e. 0 < p ≤ 1. You may also want to add n-grams that are missing, or remove ungrammatical n-grams. You might also want to add some 3-grams. If you do, remember to change the decoder flag for order to "-o 3" when you add 3-grams.

An easy way to improve the translations is to add a lot of 3-grams from the test sets. This is not all that meaningful, and would just result in over-fitting. Instead, try to make some principled changes based both on the test sentences, and on your knowledge of English/Swedish. Try to think about what makes the uniform model bad.

Here are some issues worth thinking about.

In addition to changing the probabilities for n-grams in the file, you may also change the backoff weight for unknown n-grams. A very simple backoff strategy is used in the decoder. If, for example a 3-gram is missing, it backs off to the 2-gram, but with a penalty that can be set on the command line. This penalty is simply multiplied by the 2-gram probability. If the 2-gram is missing too, it backs off to the 1-gram, and multiplies it by the penalty yet another time. The backoff penalty is set on the command line with the flag "-b WEIGHT", and the default values 0.01.

You may also modify the LM and TM in parallel, since changes made in one of them will affect what would be good changes in the other model.

4 - Wrapping up

At the end of the session you will get a new command, that will evaluate your final system on a new set of secret sentences from the same domain. If your changes are very specific to the known test set, this evaluation might be bad, whereas it should hopefully be good if your changes are general!

Commands for running the blind evalaution are found below! Make sure you point towards your own TM and LM files, that you set the correct n-gram order (normally 2 or 3) and that you set the backoff penalty if you changed that! If you only worked actively on one translation direction, only run this in that direction.

# for translation from Swedish to English
/local/kurs/mt/assignment2/simple_decoder/translate -lm lm.eng -tmw tmw.sweeng -tmf tmf.swe -o N  -evalBlind sweeng

# for translation from English to Swedish
/local/kurs/mt/assignment2/simple_decoder/translate -lm lm.swe -tmw tmw.engswe -tmf tmf.eng -o N  -evalBlind engswe

Then there will be a joint session where everyone shares their findings and discusses the issues of the assignment. For this it is good to keep in mind that the main purpose of this assignment is to learn more about how word based SMT models work, and the role of the LM and TM!

Each pair should:

Some general questions to think about:


The assignment is supposed to be examined in class, on September 13, 9-12. You need to be present and active during the whole session.

If you failed to attend the oral session, you instead have to write an assignment report, which can be done individually, or for a pair of students. Spend around 2 hours on experimenting with the weights in the TM and LM (starting with the Tm before doing the LM), before writing your report. In this report, give the score(s) for your final system(s) and compare it to the uniform system and discuss what you did, and what conclusions you can draw from your work. Also discuss at least a subset of the questions asked in the assignment text. Your report should be 1-2 A4-pages. The report should be handed in via the student portal as a pdf. The deadline is October 26, 2018.