Assignment 4: Cross-Lingual Dependency Parsing
In this assignment you will get to try out cross-lingual parsing with a state-of-the-art neural parser. The parser you will use is uuparser, which is developed by the Uppsala parsing team, with Miryam de Lhoneux as the main developer. uuparser is similar to the Kiperwasser and Goldberg parser you read about in literature seminar 2. The variant you will use if the transition-based parser.
The goal of this assignment is to see how parsing for a low-resource language (which you may simulate) can be aided by using data from another language as well. To start with, pick three languages and treebanks:
- A target language treebank (TGT)
This is the language that you are attempting to parse. It is meant to be treated as a low resource language, but it is OK to use a high resource language and simulate such a scenario by limiting the training data. It is helpful, but not required, to choose a target language that you know.
- A support language treebank that you believe will be good (GSUP)
This should be a language with more resources than your target language, which you think might help in parsing the target language, for instance since it is (closely) related, shares some important lingusitic features, is a contact language, or some other reason.
- A support language treebank that you do not believe will be good (or at least not as good as GSUP) (BSUP)
This should be a language with more resources than your target language, which you think might not help in parsing the target language, for instance since it is not related, have different important lingusitic features, maybe has a different script, or some other reason(s).
For cases where a language have more than one treebank, you can pick any that fulfils the criteria. If possible you can try to match genres, but that is not required, so focus more on language choice than treebank choice.
Make sure that your target language has at least 200 sentences in total in its training and development data, and that your support languages have at least 600 sentences training data each.
You should then run the following five experiments. Detailed instructions on commands and how to handle data is given below.
- Train a monolingual parsing model on 100 sentences from TGT and record the scores for the TGT development set recorded during training.
- 0-shot transfer
- Train a monolingual model on 500 sentences from GSUP and evaluate it on the TGT development set, using the model from the best iteration.
- Train a monolingual model on 500 sentences from BSUP and evaluate it on the TGT development set, using the model from the best iteration.
- few-shot transfer
- Train a multilingual model on 100 sentences from TGT and 500 sentences from GSUP and record the scores for the TGT development set recorded during training.
- Train a multilingual model on 100 sentences from TGT and 500 sentences from BSUP and record the scores for the TGT development set recorded during training.
Note that in all cases you are using a limited amount of data, in order to keep the run time of your experiments reasonable. In a real setting it is quite likely that you would use more data for the support languages. Also note that in this assignment, your two support languages do not actually need to have more data than the target language, but this can be simulated by not using all available data.
Motivate your choice of languages briefly in the report.
EvaluationYou should evaluate your results in three parts.
- Present the UAS and LAS scores for the TGT development set for the best iteration for each of your five systems. For 3a and 3b, give both the scores for the epoch with the best average development score of TGT and SUP, and the score for the epoch with the best TGT score.
- Draw learning curves of how the scores for the TGT development set, develop for each epoch for systems 1, 3a, and 3b. Preferably, draw all three curves in the same plot.
- Do a small qualitative evaluation where you compare the errors for a few sentences that have different parses across (a subset of) systems. (Depending on how many errors and differences between systems there are, how long your sentences are, how many of your five systems you choose to focus on, and how detailed your discussion is, somewhere around 1-5 sentences is suitable.)
For the qualitiative evaluation you may use MaltEval, which you have already used in the NLP course. Note, however, that you will need to convert to Conll-X format for it to work, which can be done with this script:
In addition, also report the UAS and LAS scores for your two support languages, when trained alone, experiments 2a and 2b, and when trained together with your target language, experiments 3a and 3b. Is there a difference for the good and bad fit support language?
DataUse the Universal Dependencies data (UD) version 2.5 in this assignment. The data is availble on the Linux system at: /corpora/ud/ud-treebanks-v2.5 You need to prepare your own directory for the three languages that you are interested in, and copy the relevant parts of the data there. Note that you should keep the naming convention, i.e. the folders for each language should have the same name as in the original structure (e.g. UD_Swedish-LinES), and the training and development files should also have the same names (e.g. sv_lines-ud-train.conllu and sv_lines-ud-dev.conllu), but their sizes should be modified. You do not need to copy the test files, or any additional files, since you will not use them.
For the TGT language, you should have a training set of 100 sentences and a development set of at least 100 sentences. If your language have both these sets, copy the first 100 sentences of the train set and keep the full development set. If your language does not have a development set, copy 100 sentences from the train set to your train set, and another 100 sentences to your development set.
For the two support languages, follow the same procedure as for the TGT language, but the train set should have 500 sentences.
To select the first N sentences from a Conllu-file, you can use the following script:
/local/kurs/parsing/assign4/select-n-conllu-sentences.perl N input-file output-file
Where input-file is the original Conllu-file you are reading from, output-file is the file you write to, and N is the number of sentences you want to copy.
The parser, uuparser, is available on the Linux computer system. You can run the parser using the command
uuparser. Treebanks should be given using the ISO id, i.e. the short name for each treebank, which is used in the names of the files with data (for instance sv_lines or en_ewt).
The current version of one of a specification files used by the parser is UD 2.2. If the parser cannot find your treebank, that is because you use treebanks that are newer than 2.2. You then need to add the following flag to all commands:
To train uuparser for a single language, use the command:
uuparser --outdir [results directory] --datadir [your directory of UD files with the structure UD\_\*\*/iso\_id-ud-train/dev.conllu] --include [treebank to train on denoted by its ISO id] --disable-rlmost
To test uuparser for another language than the language a model is trained on (for experiments 2a and 2b) :
uuparser --predict --outdir [results directory] --modeldir [model directory in the form model_dir/SUP-iso-id] --datadir [directory of UD files with the structure UD\_\*\*/iso\_id-ud-train/dev.conllu] --multiling --include [TGT-iso-id:SUP-iso-id]
The parser automatically chooses the model with the best development score. The model directory should be the one specified when training the parser, followed by the treebank name. Check that this directory contains the model "barchybrid.model". Note that the include flag needs to specify the ISO of the treebank you want to test on and the ISO for the treebank that you trained the model on, with a colon in between.
To run uuparser for multiple languages, use the command:
uuparser --outdir [results directory] --datadir [your directory of UD files with the structure UD\_\*\*/iso\_id-ud-train/dev.conllu] --disable-rlmost --include ["treebanks to train on denoted by their ISO id"] --multiling
Note that you need to have quotes around the treebanks when you have more than one treebank!
Note, use different output directories for the different experiments, in order not to risk that the parser over-writes previous output which you may need. Also note that it takes some time to run these experiments. Each experiment should run for the default 30 epochs. Each epoch probably takes less than a minute for 100 sentences, and somewhere around 3-6 minutes for 500-600 sentences. Thus, take into account when planning your time that you will have to wait some time to get your results!
In all experiments, use the default settings for uuparser, except for the flag
--disable-rlmost which disables what is called the extended feature set, i.e. information about children of items in the stack.
For Distinction (VG)
For distinction, you have to design and motivate a set of expriments where you investigate one of the following issues:
- Try out other support languages than the two for the basic assignment. Make your choice in some principled way. Discuss the effects of this choice.
- Vary the size of the training data for the target and/or support language, and discuss the effects.
- Try some variants of the size of the different components of the neural network, or vary other parameters. Look at the code to see what you can vary. Discuss the results and motivate your experiments.
- If any of your languages have more than one treebank, explore the effect of using data from different treebanks, and possibly also of using different parts of the treebanks.
- Experiment with using more than one support language at the time, by training models with three or more languages in them (multi-source models). Think carefully about the size of the data here.
It is also possible to earn a VG by exploring some other issue. If you want to do so, it is obligatory to contact Sara beforehand, in order to get your specific extension approved. If you pursue an individual idea without approval, no VG grade will be granted.
Note, that due to the time it takes to run experiments, you are not required to run many more experiments. Up to five experiments beyond the G task should normally be enough. Also, it is not required to use much more data than in the proposed experiments, for the same reason. Carefully design the experiments you run, though, in order to run experiments that makes sense, and that you can analyse in a useful way! In all cases you should discuss and motivate your experimental design.
Note that doing a VG task is no guarantee for earning a VG. The full assignment must also be performed with an overall high quality.
Report by uploading a pdf report describing your experiments and discussing your findings.
Please do not hesitate to post questions in the discussion forum or contact Sara via email in case you encounter any problems.