In this assignment, you will have the opportunity to get familiar with Docent, a document-level decoder for phrase-based SMT developed by the computational linguistics group at Uppsala University. You will learn how to run decoder and explore its most important parameters. You will also (hopefully) get some insights into the influence of different models during the SMT decoding process.
This assignment is examined in class, on October 1, 9-12. Note that you have to attend the full session and be active in order to pass the assignment. If you miss this session, you will instead have to write a report, see the bottom of this page.
Take notes during the assignment. During the last hour of the session, groups will get a chance to talk about their experiments, what they did, what surprised them, what they learned, et.c. In addition, the teachers will talk to students during the session, to discuss their findings.
In this assignment, everyone will start by performing sub task 1. Each pair can then choose to focus on different sub tasks depending on their interest. The sub tasks will take a different amount of time, so depending on which you choose, expect to do one or more of them, besides sub task 1.
Docent is a decoder for phrase-based SMT that translates complete documents and makes it possible to create feature models having access to the entire document context, including its translation proposed by the MT system. It is based on local search with hill climbing instead of the dynamic programming algorithm called stack decoding that is used in most other SMT decoders.
Docent is open source software and is released on Github. There's also some documentation on the Github site: https://github.com/chardmeier/docent/wiki.
The search algorithm implemented in Docent is described in the following publication:
Hardmeier, C., Nivre, J. and Tiedemann, J. Document-Wide Decoding for Phrase-Based Statistical Machine Translation. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1179-1190, Association for Computational Linguistics, 2012.
The software itself is described in a system demonstration paper:
Hardmeier, C., Stymne, S., Tiedemann, J. and Nivre, J. Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 193-198, Association for Computational Linguistics, 2013.
In this assignment, we're going to work with two document-level features, TTR (type-token ratio) and OVIX, designed to do joint text simplification and translation. You can find details about them in our Nodalida paper:
Stymne, S., Tiedemann, J., Hardmeier, C. and Nivre, J. Statistical Machine Translation with Readability Constraints. In: Proceedings of Nodalida 2013, pages 375-386, NEALT, 2013.
The Docent decoder potentially creates many output files, so we recommend that you start by creating a working directory and make sure that the current path of your shell is set to that working directory whenever you run Docent. You can delete the files in this directory after completing the assignment, and also during the assignment if you find that you create a high number of files.
Start by familiarizing yourself with the file formats used by Docent. The setup of the decoder, including the description of the feature models, their weights, the search parameters etc. are specified in an XML configuration file, whose format is described on the Wiki page.
You will find a working configuration file for a Swedish-English system in /local/kurs/mt/lab-docent/config.xml. Copy this file to your newly created work directory. The configuration file refers to a number of other files such as the phrase table and language model. You needn't copy those, it's enough to have your own copy of the configuration to work on. Take a good look at the configuration file and compare its contents with the description on the Wiki page. Note that some models and weights have been commented out by adding <!-- and --> markers.
The input text for a document-level decoder also requires a special encoding, because information about document boundaries must be retained. The standard input format used by Docent is an XML-based format called NIST-XML that is commonly used by MT competitions. We've provided an input file for you to work with. You can find it in /local/kurs/mt/lab-docent/testset.xml. The test set contains two small excerpts from Europarl and a newspaper article from Dagens Nyheter (3 June 2013). Take a look at the file. Note that the text in the file is tokenised.
Here's how to invoke the decoder:
/local/kurs/mt/bin64/detailed-docent -b burnin -i sampleInterval -x maxSteps \ -n /local/kurs/mt/lab-docent/testset.xml config.xml outstem
Here, config.xml is your copy of the configuration file, and all files created will have names starting with outstem. The other three parameters, burnin, sampleInterval and maxSteps are iteration counts. The decoder will first run for burnin iteration without creating any output. Then it will dump its current state to a file and continue running, creating a new file every sampleInterval iterations. After maxSteps iterations, it will stop.
To begin with, try running the decoder for a small number of iterations and see what happens. For this assignment, we recommend that you generally set the burnin period to 0. You could start by running the decoder for 10000 iterations and sample every 1000 iterations or so. 10000 iterations will not be enough to create good translations, but it will give you an impression of how the decoder works and how long decoding takes. Then you could gradually raise the maxSteps parameter, and remember to adjust the sampleInterval so the total number of dumps produced remains reasonable. Try to find out for how many iterations you need to run Docent before additional running time no longer gives you a noticeable improvement.
Take a look at the modifications made by the decoder between the various sampling points. In /local/kurs/mt/lab-docent, you will find a script called compare.sh that helps you compare two output files by showing just the output lines that are different.
Once you are done with sub task 1, you can choose freely between sub task 2-4. You may have time to do one or more of them.
The configuration file contains disabled entries for two simplification models, based on measures sometimes used for readability estimations: TTR and OVIX. TTR (type token ratio) is the ratio of the number of types, i.e. unique words, and tokens, the total number of words, in a text. A text with a low type/token ratio has less lexical variation than a text with high type token ratio. OVIX is a reformulation of type token ratio that is less sensitive to document length, which can be a problem with type token ratio in some contexts. For the formula of OVIX, see the Nodalida paper linked above. Note that type token ratio only affects one aspect related to readability. There are many other aspects that are not treated by these models.
a) Try enabling the simplification features, individually or in combination, and run the decoder to see what happens. Remember that you also have to enable the corresponding entry in the weights section whenever you enable a model. Try varying the feature weights for the simplification features. High weights for the simplification models may have the effect of producing excessively long translations. If you encounter this problem, try increasing the word penalty weight. Otherwise, you should leave the weights of the baseline features constant.
b) Compare the output produced by the translation system. What happens if you run the decoder with a very high weight for the TTR and/or the OVIX model? What if you use a similar weight as for the other models? Can you find a weight setting that achieves simplification without messing up the translations? Provide example translations.
Now let's take a closer look at what the decoder actually does. Remember that the local search process applies modifications to document states by running certain state operations. The operations available to the decoder are listed in the <state-generator> section of the configuration file. Each operation has a weight that specifies how often it will be attempted relative to the other operations. Some operations also have other parameters. The various decay parameters control draws from a geometric distribution. They should be between 0 and 1, and the higher the decay parameter, the more likely will high numbers be drawn (i.e., longer chunks be moved, or chunks be moved across longer distances). The decoder tells you after regular intervals how many operations of each type were attempted and how many of them were accepted because they improved the scores.
Try experimenting with different sets of operations and different parameters. How is the translation quality affected?
If you want more information about what the operations are doing, you can pass the debug option -d component to the decoder. Possible values of component include ChangePhraseTranslationOperation, SwapPhrasesOperation, ResegmentOperation or SimulatedAnnealing. The former three output information about the operations proposed, and the latter will tell you whether an operation was accepted or rejected. Note that this will write potentially huge amounts of information to stderr, so it's best to redirect the output to a file, and you may find the format a bit difficult to understand.
Don't hesitate to ask for help if you feel lost.
Try experimenting with different model weights. How is the translation quality affected?
Towards the end of the assignment session, everyone will be expected to share their findings and discuss the assignment, in the full class. You should all be prepared to report your main findings, and any other interesting issues that came up during the assignment.
If you failed to attend the oral session, you instead have to write an assignment report, which can be done individually, or for a pair of students. Start by doing task 1. Then you should spend around 1.5 hours on one or more of tasks 2-4. Then write a report where you discuss your findings. Your report should be 1-2 A4-pages. The report should be handed in via the student portal as a pdf. The deadline is October 26, 2018.