Word-based SMT -- part 2

This document describes the tasks you need to perform for part 2 of assignment 2. The task here is to train a language model on data and explore some aspects of the word-based SMT decoder. This assignment is examined in class, on September 17, 13-16. Note that you have to attend the full session and be active in order to pass the assignment. If you miss this session, you will instead have to write a report, see the bottom of this page.

You should work on both translation directions, but you may choose to focus parts of your analysis on one language directions, see specific instructions below. If you do not know Swedish, there is a grammar sketch of Swedish. For the TM you can either use a TM with uniform probabilities, as in the files given, or you can use your modified TMs from part 1.

For training LMs, we will use the SRILM toolkit. You can run it with the following command:


/local/kurs/mt/srilm/bin/i686-m64/ngram-count -wbdiscount -text CORPUS_FILE -lm LM_FILE  -order ORDER
where CORPUS is your training corpus, LM is the resulting language model that you will use as input to the decoder, and ORDER is the maximum LM order you want. -wbdiscount means that the program runs Witten-Bell smoothing. It is suitable for small corpora, but you do not have to think about smoothing in this assignment. (Note that the -wbdiscount is not a suitable smoothing method for your course projects, where you will use larger training data.)

In the resulting LM files the probabilities are given as logprobs. Our decoder uses standard probabilities, but can convert from logprobs if the flag -lp is used. It is thus important that you add this flag in this assignment. The command you will run in this assignment, will thus need to use the following flags:


/local/kurs/mt/assignment2/simple_decoder/translate -lm LM_FILE -tmw WORD-translation-model -tmf FERTILITY-model -o MAX-NGRAM-ORDER -lp 
If you fail to use this flag your results will be strange, and higher LM orders will most likely make the results worse. If that happens, add this flag!

There are two sets of corpora for training a language model: a parallel corpora, called corpus.parallel.* and a monolingual corpus called corpus.mono.*, which is larger. Both corpora contains the same type of block world sentences, but in the corpora labeled parallel, the English and Swedish sentences corresponds to each other, line by line (which would have been necessary if we would have trained a translation model, but is not necessary for language model training). Have a brief look at the corpora files to familiarize yourself with them. For training your language model, concatenate the two corpora into one large corpus for each language.

Take notes during the assignment. During the last hour of the session, groups will get a chance to talk about their experiments, what they did, what surprised them, what they learned, et.c. In addition, the teachers will talk to students during the session, to discuss their findings.

Practicalities

This assignment is intended to be performed in pairs. Team up with another student in order to solve it. It is not necessary to work in the same pairs during all assignments.

In this assignment, everyone will start by performing sub task 1. Each pair can then choose to focus on different sub tasks depending on their interest. The sub tasks will take a different amount of time, so depending on which you choose, expect to do one or more of them, besides sub task 1.

1 - Train LMs

In the first part you will train LMs with SRILM and run the translation system with different order LMs. Train your LMs on the full concatenated corpora. You can train the LM with 8-grams, and run the translation system with the switch "-o max-ngram-order", to only use n-grams up to that order. For instance, you can run the decoder with only 3-grams if you use "-o 3", even if you trained an 8-gram model, and so on. In this task you should preform translation in both directions.

Run the system with different order n-grams, starting with 1, and increasing the order until you think that you can get no further improvements. Run your experiments both for the known test set and the blind test set. Think about what is a good compromise for translation quality, and how well the decoder will be able to generalize on new data. When you are done with this part, you should decide on a reasonable n-gram order to use for both translation directions! For all remaining experiments: use the n-gram order you have chosen for each language pair.

Choose sub tasks

When you are done with task 1, you can choose freely from the remaining sub tasks depending on your interest. In most cases, you will be expected to perform two or more additional sub tasks, but the tasks do take different time, so it dpeends on your choices. For all sub tasks, you can choose to focus on one translation direction, or compare results on both translation directions. If you choose one direction, it is recommended that you focus on the target language for which you have the best language skills.

2 - Linguistic investigation of n-gram order

Repeat the experiments of subtask 1, where you run the decoder on the full data, with different n-gram orders. Focus on the known test set. Look in detail at the translations, (not only on ranks, as before) and see which issues are solved for each n-gram order. Which issues are still problematic, even with high orders? Are there cases where some issue is solved only for some sentences? Think about why!

3 - Explore the influence of training data size

In general, more training data tends to give better results for SMT. Investigate this by using only part of the training data for training the LM. For instance, you can create three new smaller files that contains 50%, 25%, and 12.5% of the number of sentences in the full training data, respectively. Other sizes are also possible! Train LMs on these smaller files, and see the effect on the translation results. Look both at the ranks, and on the actual translations for at least some sentences! (Tip: use the Unix commands wc to count the number of sentences, and head and/or tail to pick a certain number of sentences for the smaller files)

4 - Explore the influence of training data sampling

Not all sentences are equally useful for the language model. It is important that the data you use are representative for the language (and domain) of interest. As an example, in Swedish, the uter gender is more common than the neuter gender, but if your training data happen to contain more sentences with nouns in the neuter than uter gender, this will be mis-represented.

In this sub task, you should investigate this issue by training LMs of a fixed size, but with different samples. Choose a training data size (e.g. 40% of the size of the full data, or choose several sizes, if you wish). Then repeatedly pick a different training data sample containing your specified number of sentences, train an LM, and run the decoder, i order to investigate if and by how much the translation results vary. Look both at the ranks, and on the actual translations for at least some sentences! (Tip: use the Unix commands wc to count the number of sentences. The command shuf or sort -R shuffles the sentences in a file. head and/or tail can be used for picking a certain number of sentences. Use unix pipes (|) to put commands together when needed!)

5 - Sentence boundaries

In the decoder there is a switch for using markers at the start and end of sentences. These markers can help in language modeling by giving information about which words tend to occur at the start and end of sentences. When the switch "-s" is used with the decoder, the sentence boundary markers are activated, and without it they are deactivated. Sentence markers are added automatically when you create a language model with SRILM. You can see them as "<s>" and "</s>" in the trained LM file. Compare translation with and without sentence boundaries activated, and look at the effects both on ranks, and on actual translations (focusing on the beginning and end of sentences!).

6 - Unknown words

For this task, you will investigate what happens with words that are unknown, that is, that are not in the TM. You can focus on the following two sentence pairs, or come up with your own examples:
Kim ställer ett brandgult block på det gröna fältet
Kim puts an orange block on the green field
hon ställer 2 blåa block på ett fält
she puts 2 blue blocks on a field

The unknown words are transferred as is by the decoder. This is a standard procedure in SMT systems. Thus, focus on the translations chosen for the words surrounding the unknown words, rather than on the unknown words themselves. Run the decoder on these two sentences, or some other sentences, and study the results. For this task it is not so meaningful to look at the average rank, but you should focus on the actual translations, especially the highest ranking option. Analyze and discuss the translations chosen for the words surrounding the unknown words. Focus particularly on what happens in the language model for these words.

Wrapping up

Towards the end of the assignment session, everyone will be expected to share their findings and discuss issues of the assignment, in the full class. You should all be prepared to report your main findings, and discuss interesting issues that came up during the assignment.

You should also have had some impressions of word-based SMT. Keep in mind that the system you used was limited in that it did not allow reordering, and you did not train the TM. Also, the sentence types in the block world are really simple. We will thus discuss some general issues relating to word-based translation, during the final joint session.

Reporting

The assignment is supposed to be examined in class, on September 17, 13-16. You need to be present and active during the whole session.

If you failed to attend the oral session, you instead have to write an assignment report, which can be done individually, or for a pair of students. Spend around 2 hours on experimenting with LMs, choosing the subtasks you find most interesting, before writing your report. In this report, give the scores for the systems you experimented with, and discuss your findings and the final questions in the "wrapping up" section above. Your report should be 1-2 A4-pages. The report should be handed in via the student portal as a pdf. The deadline is October 26, 2018.