Assignment 3 - Language models

Aim

In this assignment session you will train both N-gram and Neural language models (LMs). You will compare the quality of different LMs and explore the differences. Last, you will assess the pros and cons of different LMs.

This assignment is examined in class, on September 18, 13-16. Note that you have to attend the full session and be active in order to pass the assignment. If you miss this session or are inactive, you will have to compensate for this, see the bottom of this page.

Practicalities

This assignment is intended to be performed in pairs. Team up with another student in order to solve it. It is not necessary to work in the same pair during all assignments.

Take notes during the assignment. During the last half hour of the session, groups will get a chance to talk about their experiments, what they did, what surprised them, what they learned, et.c. In addition, the teachers will talk to students during the session, to discuss their findings.

The N-gram part is based on the Lab 4 Language modeling in NLP course. The Neural part is based on the offical examples from PyTorch.

Data

Create a new directory for this assignment and copy all the files into this new work directory:

mkdir assignment3/
cd assignment3/
cp /local/kurs/mt/assignment3/* .

Files description:


1 - N-gram LMs

Training

In this sub tasks you should use the toollkit "SRILM" to train LMs on the holmes.txt. We need to run a command like the following:

ngram-count -order 2 -text holmes.txt -addsmooth 0.01 -lm bigram-add

This commands tells SRILM to count bigrams (-order 2) in the training file (-file holmes.txt ) and to create a language model and save it in a file called bigram-add (-lm bigram-add ) with additive smoothing with (-addsmooth 0.01). By varying parameters such as n-gram order and smoothing method, we can create different language models. Note that you can use ngram-count -help to get the help of using different parameters.

Evaluating

To evaluate a language model created as above on his-last-bow.txt, we run:

ngram -order 2 -lm bigram-add -ppl his-last-bow.txt

This command tells SRILM to use a bigram model (-order 2) based on the previously stored model ( -lm bigram-add) to estimate the perplexity of the test file (-ppl his-last-bow.txt ). The output generated by this command should be something like:

file his-last-bow.txt: 5080 sentences, 81393 words, 2729 OOVs
0 zeroprobs, logprob= -177261 ppl= 130.829 ppl1= 179.224

This tells us that the test file contained 5080 sentences and 81393 words, of which 2729 were out-of-vocabulary (OOV), meaning that they did not occur in the training set. The model did not find any zero probabilities (thanks to the smoothing), the log probability was 130.829. Remember that perplexity, as well as the related entropy measure, tells us how surprised or confused the model is when seeing the test data, so lower perplexity is better. The second perplexity measure (ppl1) is computed without end-of-sentence tokens and can be ignored for now.

You should spend approximately 40 minutes on this task.

2 - Neural LMs

In this task we will train some neural LMs, using PyTorch. It will take much longer time to train a neural LM without a GPU. Please do not try to train a very big model (bigger embedding size, more hidden units, etc.) during the class which takes too much time.

Installing PyTorch:

python3 -m pip install torch==1.2.0+cpu torchvision==0.4.0+cpu -f https://download.pytorch.org/whl/torch_stable.html --user

Note: adding "--user" could allow you to install your own applications.

Verification of PyTorch: run import torch; x = torch.rand(5, 3); print(x) in python3. If you get a tensor in the output, it means that you have installed PyTorch successfully!

Trianing and Evaluation

Use the following command to train a neural LM.

nohup python3 nlm.py --epochs 5 --tied --model LSTM --data ./ --save model_epo5.pt --nhid 200 --emsize 200 >log.nlm.epo5 2>&1 &

nohup + command + ">output.file 2>&1 &" enables the command to run in the background and exports all the output to the output file. (You can use top -c -u your_user_name to check the job) This command will let you train a neural LM using LSTM RNNs (-model LSTM), both the number of hidden units and the size of word embeddings are 200 (--nhid 200 --emsize 200), the word embedding and softmax weights are tied toghether(or shared, --tied). The model will be trained for 5 epochs (--epochs 5). Please use python3 nlm.py --help to get more detailed descriptions of the parameters. You could try to change the following parameters: learning rate, epochs, embedding size, number of hidden units, RNN units (LSTM-->GRU), without "--tied".

When the training is finished, you will also get the evaluation on the other three sets, his-last-bow.txt, lost-word.txt, other-authors.txt.

Generation

We can use python3 generate.py --checkpoint model_epo5.pt --data ./ --outf generated.txt to generate some new sentences sampled from the trained neural LMs and export to generated.txt (--outf generated.txt). Sentences are separaed by "". Please check the file and discuss the quality of these generated sentences.
please use different outfile names when you train different models use kill -s 9 PID to kill a running job. You can find the PID using the "top -u your_user_name -c " command.

Online Neural LMs

Here we experience the LMs' ability to generate texts. Write with transformer provides some state-of-the-art LMs to generate sentences. (Transformer is a more advanced model than RNNs in NLP, we will learn it later. ) Please discuss your findings with your classmates.

Wrapping up

Towards the end of the assignment session, everyone will be expected to share their findings and discuss the assignment, in the full class. You should all be prepared to report your main findings, and discuss the questions asked, and any other interesting issues that came up during the assignment.

Reporting

The assignment is supposed to be examined in class, on September 18, 13-16. You need to be present and active during the whole session.

If you failed to attend the oral session, you instead have to do the assignment on your own, and report it afterwards. Spend around 45 minutes on task 1, and do task 2. You can then either dicuss the assignment during one of the project supervision sessions (given that the teacher has time), or write a report where you discuss your findings (around 1-2 A4-pages). The deadline for the compensation is October 25.