Assignment5 - Sockeye and NMT

Aim

In this assignment, you will use the neural machine translation toolkit Sockeye to train an attentional encoder-decoder model to normalize historical Swedish spellings. Even though it is not a traditional translation task, the basic idea is the same as NMT. Moreover, the normalization model is much smaller than conventional NMT models which takes you less time. You will train the end-to-end neural models. You will use the trained NMT-based model to normalize historical spellings and evaluate the normalization. You will learn the general settings of training an NMT model and have a better understanding of NMT. NMT model is the mainstream in machine translation. This assignment will give you the experience of training NMT models.

This assignment is examined in class, we have two 2-hour slots for this assignment, on October 11&15, 13-15. Note that you have to attend the full session and be active in order to pass the assignment. If you miss this session, you will instead have to write a report, see the bottom of this page.

Take notes during the assignment. During the last hour of the second lab seesion, groups will get a chance to talk about their experiments, what they did, what surprised them, what they learned, et.c. In addition, the teachers will talk to students during the session, to discuss their findings.

For the first lab session, I suggest you finish the experiment parts of Part 0, 1, and 2. You should answer at least half of the listed questions in these parts. But you can report your findings and answers whenever you are prepared.

For the second lab session, you can focus on Part 3 and answer all the questions in this assignment.

Practicalities

This assignment is intended to be performed in pairs. Team up with another student in order to solve it. It is not necessary to work in the same pairs during all assignments.

0 - Install NMT tools (Sockeye, based on MXNet)

The NMT tools are not installed on our lab machines. You need to install them by yourself.

NMT tools make training NMT models much easier. You don't need to implement a system from scratch. In general, NMT tools are developed on the top of deep learning frameworks, and provide interfaces of different programming languages. OpenNMT, Sockeye, Marian, Nematus, and Tensor2tensor are popular NMT tools. Tensorflow, PyTorch, Dynet, MXNet, and Chainer are popular deep learning frameworks. Our lab machines have no advanced GPU. We will install the CPU version toolkits.

The reasons why we choose Sockeye as the NMT tool are 1) sockeye has many model implementations, 2) it is the easiest to install Sockeye on our lab machines.

Install MXNet

pip3 install mxnet --user

--user is always required when you install sth with a non-root account. More details about MXNet. You may need to upgrade your pip. Use pip3 install --upgrade pip --user for upgrade.

Install Sockeye

pip3 install sockeye --user

It will help you to install/upgrade dependencies as well. More details about Sockeye

1 - Training the model

Copy all the files from /local/kurs/mt/assignment5 to your home space. This data is a set of Swedish historical-modern spellings from Gender and Work corpus.

We train character-level NMT models for normalization. The training data consists of token pairs (historical and modern spelling pairs) rather than sentence pairs.

Data preparation

In NMT, we also need to perform tokenization, casing normalization and corpus cleaning to obtain optimal results. Moses provides various tools for these operations. In this assignment, the data has been segmented into character sequences. You can use the data directly.

Train the model

Look at the train.sh, change the paths, look at the parameter settings. Then, run the following command to start training:

nohup ./train.sh >log.train 2>&1 & 

The training will take about 35 minutes. Let's run this command first, and check the settings later. Even though we train models using CPUs, we still could speed up training if the machine has multiple threads. For sockeye, we could add the following setting to set the threads for training:

export OMP_NUM_THREADS=7

On our lab machines, we can use 7 of 8 threads, leave one or two threads if you work for other tasks. The solution is from here.

Parameter settings and training logs

When the model is under training, let us have a look at the parameter settings. Use command python3 -m sockeye.train --help >help.train to output the help information into the file help.train. Read the help file carefully. It will help you review the NMT lecture notes. Here are some questions you need to think:

The training log will be printed in the log file (log.train). Read the log carefully. It will tell you the overall process of training. Please think the following questions when you are looking at the log file:

In addition to the log, the toolkit also produces many other files during training. Look at the model directory, and figure out what these files are used for?

2 - Testing the model

Use the trained model to normalize the test set, and evaluate the normalizations.

Decoding

Just run the shell script decode.sh: ./decode.sh. It takes about 1 minute to generate predictions (spelling normalizations). The decoding log is in the file log.decode.

Similarly, you could get the help file of decoding by using python3 -m sockeye.translate --help >help.decode.

Evaluation

In machine translation, we usually use BLEU as the main evaluation metric. The historical spelling normalization is a different task. We could use accuracy and character error rate (CER) as the evaluation metrics. Accuracy is a token-level metric, and CER is a character-level evaluation metric. Here is a reference for CER. For accuracy, we run this command:

python3 eval_acc.py you_model_directory/pred/test.pred test/target.seq test/source.seq error.acc

test.pred is the path of generated normalizations. target.seq is the path of references. source.seq is the path of source historical spellings. error.acc is the output of errors. You need to change the paths of these parameters. For CER, you may need to install python-Levenshtein first (use command: pip3 install python-Levenshtein --user), then run this command:

python3 eval_cer.py you_model_directory/pred/test.pred test/target.seq test/source.seq error.cer

You can find the errors made by this NMT-based model from error.cer. There are five columns in error.cer, representing spelling ID, historical spellings, model predictions, reference spellings, and the edit distance between predictions and references. What is the overall edit distance? What does this mean?

3 - Exploring the parameters

Try to change the parameter settings (such as embedding size, hidden numbers, encoder/decoder networks, initial learning rate, etc. ) and retrain one or two new models. Reminder: You need to specify a new directory to store your new model.

Tips: larger embedding size or hidden numbers may give you a better result, it also takes you more time. :(

You can also try different parameter settings during decoding, such as beam size.

Compare the new model with the first model, and see how does the new parameter settings affect the model training, the normalization result, etc.

Wrapping up

Towards the end of the assignment session, everyone will be expected to share their findings and discuss issues of the assignment, in the full class. You should all be prepared to report your main findings, and discuss interesting issues that came up during the assignment.

You should also have had some impressions of NMT, and be prepared to discuss the questions listed in this assignment.

Reporting

The assignment is supposed to be examined in class, on October 11&15, 13-15. You need to be present and active during the whole session in order to pass.

If you failed to attend the oral session, you instead have to write an assignment report, which can be done individually, or for a pair of students. You should spend around 3.5 hours of work on the assignment. Then write a report about your experiences with Sockeye, including discussing the questions outlined in the assignment. Your report should be 1-2 A4-pages. The report should be handed in via the student portal as a pdf. The deadline is October 26, 2018.