Assignment 5 - NMT

Aim

In this assignment you will use the openNMT toolkit to train NMT models, at different granularities, with different model architectures. You are expected to be familiar to NMT settings and using this toolkit. You should learn to use BPE toolkit to segment text. You will learn to submit jobs on a GPU cluster.

This assignment is examined in class, on October 1, 9-12. Note that you have to attend the full session and be active in order to pass the assignment. If you miss this session or are inactive, you will have to compensate for this, see the bottom of this page.

Practicalities

This assignment is intended to be performed in pairs. Team up with another student in order to solve it. It is not necessary to work in the same pair during all assignments.

Take notes during the assignment. During the last half hour of the session, groups will get a chance to talk about their experiments, what they did, what surprised them, what they learned, etc. In addition, the teachers will talk to students during the session, to discuss their findings.

Preliminary

In this assignment, we will use GPUs on Abel cluster to speed up the training. You need to have an account first. You can log in Abel via ssh: ssh yourUserName@abel.uio.no

Different from running a job on a local machine, we submit jobs to the cluster's job queue which is managed by Slurm. This means that our jobs might have to wait before running, because of resource, priority, job limit, etc.
Each job should be submitted only once for each group, the other student do not need to resubmit the same job once again! There is a queueing system, you will wait longer for your jobs if you submit these replicate jobs.

Here are some basic commands of the Slurm system.

Data

When you have logged in Abel, copy all the given files: cp -rf /usit/abel/u1/gtang/assignment5 ./
Then cd assignment5

Files description:

Note: You need to process data in the directory data and submit jobs in the directory assignment5.

Training NMT models

We use the OpenNMT toolkit to train NMT models.

Pre-process Data

The first step is to pre-process Data/features files and build vocabulary. The processed files are more suitable for training (faster). You can just run the script preprocess_word.sh to do the pre-processing for word-level models. It will generate precessed data using sven.word as name prefix (sven.word.train.1.pt, sven.word.valid.1.pt, sven.word.vocab.pt). You need to preprocess data for character-level and subword-level models as well before training.

Training Models

Now we feed the pre-processed data into OpenNMT and start training. The example training scipt is given. You need to revise the model name, parameter settings, etc. in the script file.

You can submit the jobs to train word-level and character-level models:
sbatch sven.word.sh #word-level
sbatch sven.char.sh #character-level

We use a very basic method to train our character-level models. We split words into characters and feed into the model. Each character is viewed as a word and we replace the white spaces with a special token "^".

Evaluation

We can use the trained models to translate the test set and evaluate the model automatically using BLEU score. (The commands are included in the script file. ) Note that we need to merge the characters into words before the evaluation. We only use BLEU score as the evaluation metric, this is only for reference. You can also evaluate the results manually.

When you have submitted the jobs. Please

Subword NMT models

In this section, we will explore subwor-level models.

Subword Segmentation

We use the BPE toolkit (subword-nmt) to segment tokenized data into subwords. The example script (process_bpe.sh) is also provided. You need to change the directory name or the subword operation number (subword vocabulary size). Note that we already have the tokenized data here. We usuallly need to tokenize the text bofore applying BPE segmentation .

Training and Evaluation

Training subword-level models is the same as training models at word/character level. We first pre-process the segmented subword-level data with preprocess_bpe.sh. Then we submit the subword-level job (sbatch sven.bpe.sh) . However, we need to post-process the "@@" tokens that are generated during pre-processing before we conduct the evaluation.

Please answer these questions when you have finished the preceding parts:

Wrapping up

Towards the end of the assignment session, everyone will be expected to familiar to training NMT models and using the OpenNMT toolkit, share their findings and discuss the assignment, in the full class. You should all be prepared to report your main findings, and discuss the questions asked, and any other interesting issues that came up during the assignment.

Reporting

The assignment is supposed to be examined in class, on October 1, 9-12. You need to be present and active during the whole session.

If you failed to attend the oral session, you instead have to do the assignment on your own, and report it afterwards. You can then either dicuss the assignment during one of the project supervision sessions (given that the teacher has time), or write a report where you discuss your findings (around 1-2 A4-pages). The deadline for the compensation is October 25.