In this assignment, you will use the Moses statistical machine translation system to train a phrase-based SMT system. You will tune the system with Mert, and compare performance on new sentences before and after tuning. You will also explore the dynamic programming beam search algorithm by varying some of the key parameters to the decoder. Most of you will use Moses in your projects later on, and this assignment should give you the experience that is needed for that.
This assignment is examined in class, on September 25, 9-12. Note that you have to attend the full session and be active in order to pass the assignment. If you miss this session, you will instead have to write a report, see the bottom of this page.
Take notes during the assignment. During the last hour of the session, groups will get a chance to talk about their experiments, what they did, what surprised them, what they learned, et.c. In addition, the teachers will talk to students during the session, to discuss their findings.
In this assignment, everyone is expected to perform at least sub tasks 1 and 2, and do some task from sub task 3. You might not be able to finish the full sub task 3, though.
First, you will train a complete phrase-based SMT system to familiarise yourself with the Moses training pipeline.
Copy all the files from
/local/kurs/mt/assignment3/data to your home space.
This data is a small subset of the Swedish-English section of the Europarl corpus.
Before training a phrase-based SMT system, we often need to perform tokenization, casing normalization and corpus cleaning to obtain optimal results.
Moses provides various tools for these operations.
Have to look in
/local/kurs/mt/mosesdecoder/scripts/, particularly the
training folders to get a feel for some of the available scripts.
Tokenize the training data as follows:
/local/kurs/mt/mosesdecoder/scripts/tokenizer/tokenizer.perl -l lang_id < europarl.train.lang_id > europarl.train.tk.lang_id
In the above command, you should replace
lang_id with the language codes sv and en for Swedish and English respectively.
Note that you do not have to follow the file-naming conventions described here, but you must give your output file a different name from the input file.
You should then lowercase the tokenized files as follows:
/local/kurs/mt/mosesdecoder/scripts/tokenizer/lowercase.perl < europarl.train.tk.lang_id > europarl.train.tk.lc.lang_id
Make sure that your input file here is the output from the tokenizer. Alternatives to lowercasing such as truecasing are also available in Moses; we will stick with lowercasing here for simplicity.
Finally, we will clean the corpus by removing sentences containing over 40 words. We have to be careful here: as our data is parallel, we must make sure to remove both the Swedish and English sentence, even if only one of them is too long. Luckily, the following command does this for us:
/local/kurs/mt/mosesdecoder/scripts/training/clean-corpus-n.perl europarl.train.tk.lc sv en europarl.train.tk.lc.fl 1 40
One of the key components of a phrase-based SMT system is a language model.
You may remember that we trained a language model with SRILM in part 2 of lab 2.
For this lab, a 5-gram English language model trained on Europarl data is available for you to use at
You do not have to copy this model to your home space in order to use it.
The Moses training pipeline consists of nine steps, all of which can be executed using the
train-model.perl script, which you will find at
You can read more about the training pipeline at http://www.statmt.org/moses/ (look at the 'Training' sub-menu on the left-hand side of the page).
The full command to train your model should look like this:
/local/kurs/mt/mosesdecoder/scripts/training/train-model.perl --corpus corpus --f sv --e en --root-dir outdir --lm 0:5:lm-file --external-bin-dir /local/kurs/mt/bin64 >logfile 2>&1
You should replace the placeholders
lm-file, with the name of the corpus (e.g. europarl.train.tk.lc.fl if you followed the suggested naming conventions), an output directory which will be created to store all the output files, and the path to the language model file.
If everything goes well the training should take around one minute on our Linux system. If you experience any unexpected error messages, make sure that your output directory is empty before training, and give full paths to all files and directories.
(If you're wondering what the
2>&1 part of the command does, it redirects standard error messages to standard output.
Standard output and standard error are then both written to the same file by the
Note that the order here is important.
This trick is not specific to Moses and can be used with any command line applications.)
Tip: If you want to see standard output/error in the terminal as they are produced, but also have them saved to file, use the
In the above example you would replace '
>logfile 2>&1' with '
2>&1 | tee logfile'.
This can be particularly useful when executing complex commands that take more than a few seconds to run, as is the case here.
Examine the files generated by the training process. Try to figure out what information they contain by looking at them. You may consult the Moses webpage to read about the training process. The training log may help you understand what goes on during training. Try to relate the training log output to the training pipeline on lecture slides.
The Moses configuration file (
moses.ini) contains different sections.
The [feature] section contains pointers to the phrase table file and language model file.
The configuration file also contains some feature weights.
Note that the phrase table has 4 weights, one for each feature contained in the phrase table.
Take a good look at this file and make yourself familiar with the main parameters defining a phrase-based SMT model.
Try translating some test data (which has been pre-tokenized and lowercased for you) with the Moses model you just trained:
/local/kurs/mt/mosesdecoder/bin/moses -f config-file < europarl.test.tk.lc.sv > out.enReplace
config-file with the path to
Have a look at the translations in
Remember that the model was trained on a small amount of data (~2000 sentences), so you will likely see many untranslated words in the output.
Use the BLEU metric to assess the quality of this translation:
/local/kurs/mt/mosesdecoder/scripts/generic/multi-bleu.perl europarl.test.tk.lc.en < out.enWhat score do you get?
The translations in the previous section were obtained by running Moses with default weights for each feature of the linear model (you can see these in
To obtain better performance, we can tune these weights to maximise translation performance (measured with the BLEU score) on a separate development data set.
Here we will use minimum error-rate training (MERT) for this task.
You can run MERT as follows:
/local/kurs/mt/mosesdecoder/scripts/training/mert-moses.pl europarl.dev.tk.lc.sv europarl.dev.tk.lc.en /local/kurs/mt/mosesdecoder/bin/moses config-file --working-dir outdir --mertdir /local/kurs/mt/mosesdecoder/bin >logfile.mert 2>&1You should replace
config-file with the path to
outdir with a new directory where you want Mert to store its output files.
Make sure once again to use full paths to all these files and directories.
As tuning requires normally Moses to translate the whole dev set (100 sentences) multiple times, this process make take a few minutes to run.
If you want, you can set the
--maximum-iterations option to 5 to cap the number of iterations.
Look at the output files produced by Mert.
You will notice that it produces a new configuration file for each iteration, and records the BLEU score on the development set at the top of each of these files.
The configuration file from the final iteration is copied to
We have now seen (hopefully) that running Mert improves performance on the development data. But what we are really hoping is that it also increases the BLEU score on the test data. Re-run Moses with the test data using the new configuration file produced by Mert, and calculate BLEU score on the output. How does this compare to the BLEU score Moses achieved earlier before tuning?
For the rest of the assignments, we're going to use a larger pre-trained Swedish-English model (also trained on Europarl data). You can find the model in /local/kurs/mt/lab-moses/europarl.sv-en. Copy the ready-made moses.ini into your directory. Note that this configuration file was made with an earlier version of Moses so probably looks a bit different to the one you created in the previous section.
Note that you will need to tokenize and lowercase any Swedish text before translating with this model. You can use the script preproc.sh in the model directory to preprocess your test sentences.
Start the decoder with this model by entering:
/local/kurs/mt/mosesdecoder/bin/moses -f moses.iniTry entering a few sentences and look at the translations you get. You can make up your own sentences or copy some sentences from a newspaper website such as DN or Svenska Dagbladet. You may also use sentences from the Europarl test set used above, or from the data sets in assignment 1. You can quit the decoder by pressing Control-D. Look at the BEST TRANSLATION line to see the scores. The decoder outputs the total score as well as the vector of the individual core feature scores.
You can increase the decoder's verbosity level to see what it does. If you run the decoder with the -v 2 option, it will tell you how many hypotheses were expanded, recombined, etc. With the -v 3 option, the decoder will dump information about all the hypotheses it expands to standard error. Make sure that you use short input sentences with -v 3! It is also a good idea to redirect the output to a file, as in the training commands above. The -report-segmentation option will show you how the input sentence was segmented into phrases.
Another way to gather information about how decoder parameters affect the output is by looking at n-best-lists containing the n best hypotheses in the part of the search space explored by the decoder. To generate n-best output, start the decoder with the -n-best-list file size option. This will output n-best-lists of the given size to the file you specify. Use an n-best size of around 100 to obtain a fair impression of the best output hypotheses.
Here are some options you can use to influence the search:
|sets the stack size S for histogram pruning (default: 100)
|sets the beam threshold eta for threshold pruning (default: 0.00001, which effectively disables threshold pruning in most cases!)
|segments the input into phrases of length at most p (default: 10, which is more than the maximum phrase length in our phrase table!)
|sets the distortion limit (maximum jump) to d (default: 6; 0 means no reordering permitted, -1 means unlimited)
ttable-limit directly in
moses.ini - this affects how many translation options are loaded for each span.
You can now experiment with different settings and options, for example starting with any of the below suggestions:
Try experimenting with these options, with input sentences of varying length, and find out how they interact with the number of hypotheses expanded by the decoder and with your subjective perception of translation quality.
Take a single input sentence and consider two different decoder configurations that produce different output. Your configurations should differ substantially in several parameters.
Compare the model scores of the two translations of your test sentence. Since the models are the same, the one with the lower score is the manifestation of a search error. Let's call the system with the lower score the target system and the system with the higher score the reference system.
Use the -v 3 output of the target system to find out where the search error occurs. Then try to adjust the target system's search parameters in such a way that the better solution output by the reference system is found, while expanding as few additional hypotheses as possible. How many hypotheses does your adjusted target system expand? How many did the reference system expand?
As for finding the search error, what you should do is find out what the best solution in your reference system looks like (segmentation, phrase translations, ordering). Then look at the search log (the -v 3 output) of the target system, starting with the empty hypothesis at the beginning (number 0) and try to follow the search path that would generate the same solution. I suggest you load the search log into a text editor, so you can use the search function to search for hypothesis numbers to see how they are expanded. The search error occurs at the point where the last hypothesis that is a prefix of the correct solution stops being expanded, because it's pruned or removed from the stack in another way. Depending on how exactly you set up your target system, it may also fail to generate the best solution in the first place, e.g., because the ttable-limit or the distortion limit prevents it. Then that would be the source of the search error.
Finally, try your systems other sentences, and see how the model scores varies. Is it as you would expect?
Towards the end of the assignment session, everyone will be expected to share their findings and discuss issues of the assignment, in the full class. You should all be prepared to report your main findings, and discuss interesting issues that came up during the assignment.
You should also have had some impressions of phrase-based SMT, and be prepared to discuss the following questions:
If you failed to attend the oral session, you instead have to write an assignment report, which can be done individually, or for a pair of students. You should do sub task 1 and 2, and at leats part of sub task 3. You should spend around 2.5 hours of work on the assignment. Then write a report about your experiences with Moses, including discussing the questions outlined in the assignment. Your report should be 1-2 A4-pages. The report should be handed in via the student portal as a pdf. The deadline is October 26, 2018.