This document describes the tasks you need to perform for part 2 of assignment 2. The task here is to train a language model on data and explore some aspects of the word-based SMT decoder. This assignment is examined in class, on September 17, 13-16. Note that you have to attend the full session and be active in order to pass the assignment. If you miss this session, you will instead have to write a report, see the bottom of this page.
You should work on both translation directions, but you may choose to focus parts of your analysis on one language directions, see specific instructions below. If you do not know Swedish, there is a grammar sketch of Swedish. For the TM you can either use a TM with uniform probabilities, as in the files given, or you can use your modified TMs from part 1.
For training LMs, we will use the SRILM toolkit. You can run it with the following command:
where CORPUS is your training corpus, LM is the resulting language model that you will use as input to the decoder, and ORDER is the maximum LM order you want. -wbdiscount means that the program runs Witten-Bell smoothing. It is suitable for small corpora, but you do not have to think about smoothing in this assignment. (Note that the -wbdiscount is not a suitable smoothing method for your course projects, where you will use larger training data.)
/local/kurs/mt/srilm/bin/i686-m64/ngram-count -wbdiscount -text CORPUS_FILE -lm LM_FILE -order ORDER
In the resulting LM files the probabilities are given as logprobs. Our decoder uses standard probabilities, but can convert from logprobs if the flag -lp is used. It is thus important that you add this flag in this assignment. The command you will run in this assignment, will thus need to use the following flags:
If you fail to use this flag your results will be strange, and higher LM orders will most likely make the results worse. If that happens, add this flag!
/local/kurs/mt/assignment2/simple_decoder/translate -lm LM_FILE -tmw WORD-translation-model -tmf FERTILITY-model -o MAX-NGRAM-ORDER -lp
There are two sets of corpora for training a language model: a parallel corpora, called corpus.parallel.* and a monolingual corpus called corpus.mono.*, which is larger. Both corpora contains the same type of block world sentences, but in the corpora labeled parallel, the English and Swedish sentences corresponds to each other, line by line (which would have been necessary if we would have trained a translation model, but is not necessary for language model training). Have a brief look at the corpora files to familiarize yourself with them. For training your language model, concatenate the two corpora into one large corpus for each language.
Take notes during the assignment. During the last hour of the session, groups will get a chance to talk about their experiments, what they did, what surprised them, what they learned, et.c. In addition, the teachers will talk to students during the session, to discuss their findings.
In this assignment, everyone will start by performing sub task 1. Each pair can then choose to focus on different sub tasks depending on their interest. The sub tasks will take a different amount of time, so depending on which you choose, expect to do one or more of them, besides sub task 1.
Run the system with different order n-grams, starting with 1, and increasing the order until you think that you can get no further improvements. Run your experiments both for the known test set and the blind test set. Think about what is a good compromise for translation quality, and how well the decoder will be able to generalize on new data. When you are done with this part, you should decide on a reasonable n-gram order to use for both translation directions! For all remaining experiments: use the n-gram order you have chosen for each language pair.
wc to count the number of sentences, and
tail to pick a certain number of sentences for the smaller files)
In this sub task, you should investigate this issue by training LMs of a fixed size, but with different samples. Choose a training data size (e.g. 40% of the size of the full data, or choose several sizes, if you wish). Then repeatedly pick a different training data sample containing your specified number of sentences, train an LM, and run the decoder, i order to investigate if and by how much the translation results vary. Look both at the ranks, and on the actual translations for at least some sentences! (Tip: use the Unix commands
wc to count the number of sentences. The command
sort -R shuffles the sentences in a file.
tail can be used for picking a certain number of sentences. Use unix pipes (|) to put commands together when needed!)
|Kim ställer ett brandgult block på det gröna fältet
Kim puts an orange block on the green field
|hon ställer 2 blåa block på ett fält
she puts 2 blue blocks on a field
The unknown words are transferred as is by the decoder. This is a standard procedure in SMT systems. Thus, focus on the translations chosen for the words surrounding the unknown words, rather than on the unknown words themselves. Run the decoder on these two sentences, or some other sentences, and study the results. For this task it is not so meaningful to look at the average rank, but you should focus on the actual translations, especially the highest ranking option. Analyze and discuss the translations chosen for the words surrounding the unknown words. Focus particularly on what happens in the language model for these words.
Towards the end of the assignment session, everyone will be expected to share their findings and discuss issues of the assignment, in the full class. You should all be prepared to report your main findings, and discuss interesting issues that came up during the assignment.
You should also have had some impressions of word-based SMT. Keep in mind that the system you used was limited in that it did not allow reordering, and you did not train the TM. Also, the sentence types in the block world are really simple. We will thus discuss some general issues relating to word-based translation, during the final joint session.
If you failed to attend the oral session, you instead have to write an assignment report, which can be done individually, or for a pair of students. Spend around 2 hours on experimenting with LMs, choosing the subtasks you find most interesting, before writing your report. In this report, give the scores for the systems you experimented with, and discuss your findings and the final questions in the "wrapping up" section above. Your report should be 1-2 A4-pages. The report should be handed in via the student portal as a pdf. The deadline is October 26, 2018.