Assignment 5 - NMT
In this assignment you will use the openNMT toolkit to train NMT models,
at different granularities, with different model architectures.
You are expected to be familiar to NMT settings and using this toolkit.
You should learn to use BPE toolkit to segment text.
You will learn to submit jobs on a GPU cluster.
This assignment is examined in class, on October 1, 9-12. Note that you have to attend the full session and be active in order to pass the assignment. If you miss this session or are inactive, you will have to compensate for this, see the bottom of this page.
This assignment is intended to be performed in pairs. Team up with another student in order to solve it. It is not necessary to work in the same pair during all assignments.
Take notes during the assignment. During the last half hour of the session, groups will get a chance to talk about their experiments, what they did, what surprised them, what they learned, etc. In addition, the teachers will talk to students during the session, to discuss their findings.
In this assignment, we will use GPUs on Abel cluster to speed up the training.
You need to have an account first. You can log in Abel via ssh:
Different from running a job on a local machine, we submit jobs to the cluster's job queue which is managed by Slurm.
This means that our jobs might have to wait before running, because of resource, priority, job limit, etc.
Each job should be submitted only once for each group, the other student do not need to resubmit the same job once again! There is a queueing system, you will wait longer for your jobs if you submit these replicate jobs.
Here are some basic commands of the Slurm system.
- Job scripts are submitted with the sbatch command:
sbatch YourJobScript . The sbatch command returns a jobid, an id number that identifies the submitted job. The job will be waiting in the job queue until there are free compute resources it can use.
A job in that state is said to be pending (PD). When it has started, it is called running (R).
Any output (stdout or stderr) of the job script will be written to a file called slurm-jobid.out in the directory where you ran sbatch, unless otherwise specified.
- You can cancel running or pending (waiting) jobs with scancel:
- You can inspect the job queue via squeue.
squeue -l -u YourUserName will list all your submitted jobs.
When you have logged in Abel, copy all the given files:
cp -rf /usit/abel/u1/gtang/assignment5 ./
Note: You need to process data in the directory data and submit jobs in the directory assignment5.
- data/sven.word.*: tokenized data
- data/sven.char.*: character-level data
- data/process_bpe.sh: segment words to subwords
- data/preprocess_word.sh: preprocess data for word level models
- data/preprocess_bpe.sh: preprocess data for subword level models
- data/preprocess_char.sh: preprocess data for character level models
- help.preprocess: OpenNMT help file for preprocessing data
- help.train: OpenNMT help file for training
- help.translate: OpenNMT help file for translating
- sven.word.sh: job script to train word-level NMT models
- sven.char.sh: job script to train character-level NMT models
- sven.bpe.sh: job script to train subword-level NMT models
- multi-bleu.perl: used for computing BLEU scores
Training NMT models
We use the OpenNMT toolkit to train NMT models.
The first step is to pre-process Data/features files and build vocabulary. The processed files are more suitable for training (faster).
You can just run the script preprocess_word.sh to do the pre-processing for word-level models.
It will generate precessed data using sven.word as name prefix (sven.word.train.1.pt, sven.word.valid.1.pt, sven.word.vocab.pt).
You need to preprocess data for character-level and subword-level models as well before training.
Now we feed the pre-processed data into OpenNMT and start training. The example training scipt is given. You need to revise the model name, parameter settings, etc. in the script file.
You can submit the jobs to train word-level and character-level models:
sbatch sven.word.sh #word-level
sbatch sven.char.sh #character-level
We use a very basic method to train our character-level models. We split words into characters and feed into the model. Each character is viewed as a word and we replace the white spaces with a special token "^".
We can use the trained models to translate the test set and evaluate the model automatically using BLEU score. (The commands are included in the script file. ) Note that we need to merge the characters into words before the evaluation.
We only use BLEU score as the evaluation metric, this is only for reference. You can also evaluate the results manually.
When you have submitted the jobs. Please
- read the help.train file and train another NMT model with different settings and explain why you make the changes.
- read the help.translate file and try to generate attention distributions (weights) using word-level models and discuss the findings.
Subword NMT models
In this section, we will explore subwor-level models.
We use the BPE toolkit (subword-nmt) to segment tokenized data into subwords. The example script (process_bpe.sh) is also provided. You need to change the directory name or the subword operation number (subword vocabulary size). Note that we already have the tokenized data here. We usuallly need to tokenize the text bofore applying BPE segmentation .
Training and Evaluation
Training subword-level models is the same as training models at word/character level. We first pre-process the segmented subword-level data with preprocess_bpe.sh. Then we submit the subword-level job (
sbatch sven.bpe.sh) . However, we need to post-process the "@@" tokens that are generated during pre-processing before we conduct the evaluation.
- Please have a look at the subword vocabulary and discuss it with your classmates.
- Please revise the subword vocabulary size (bpe_operations) and see how does the result change.
Please answer these questions when you have finished the preceding parts:
- What is the general expression for models at different granularities?
- Which level model is the largestest and smallest, and why?
- What is your most interesting finding?
Towards the end of the assignment session, everyone will be expected to familiar to training NMT models and using the OpenNMT toolkit, share their findings and discuss the assignment, in the full class. You should all be prepared to report your main findings, and discuss the questions asked, and any other interesting issues that came up during the assignment.
The assignment is supposed to be examined in class, on October 1, 9-12. You need to be present and active during the whole session.
If you failed to attend the oral session, you instead have to do the assignment on your own, and report it afterwards. You can then either dicuss the assignment during one of the project supervision sessions (given that the teacher has time), or write a report where you discuss your findings (around 1-2 A4-pages). The deadline for the compensation is October 25.