Syntactic Analysis (5LN455): Assignment 4
In this assignment, you will learn how to use a state-of-the-art system for dependency parsing (MaltParser) and evaluate its performance.
The assignment is structured into smaller tasks; detailed instructions on each of these tasks are given below. These instructions also specify how to report your work on that task in the lab report.
To get started, go to the MaltParser website and download the latest release of the parser, either in tar.gz or in zip format. Then, follow the installation instructions. Once you have tested your installation and know that everything works, read the Start Using MaltParser section of the User Guide. In that section you will learn how to train a parsing model on a data set (the training data), and use that model to parse unseen data (the testing data).
Please note that the testing data serves two purposes at the same time: It contains the sentences (tagged with part-of-speech information) that the parser should assign dependency analyses to, and it also contains gold-standard analyses that you can use to evaluate the performance of the parser. These gold-standard analyses are not visible to the parser during parsing. (If they were, the parser could just assign the gold-standard analysis to the sentence and would receive perfect score.)
Task 1: Train a Baseline Model
Your first task is to train a useful parsing model on realistic data. The workflow for this is exactly the same as the one that you used in the setup phase. The only things that change are
Note that training a model with this data will take quite a bit longer than training the dummy model from the setup phase.
Reporting: Report the time it took to train and parse with the model, as well as the hardware configuration of the computer that you used for this experiment (processor type, amount of memory).
Task 2: Evaluate the Baseline Model
Now that you have trained a parsing model and used it to parse the testing data, your next task is to evaluate the performance of your system. For this you will use two measures: labelled attachment score (LAS) and labelled exact match (LEM). In both cases you compare the parser’s output to the testing data.
You can read more about LAS and LEM in section 6.1 of the KMN book. You should implement the word-based version of LAS.
While there are several tools available for computing LAS and LEM, you are asked to implement your own evaluator. You can use any programming language you want; the only requirement is that the evaluator should be callable from the command line. It should accept exactly two arguments: the file with the gold-standard data, and the file with the system output. For example, you should be able to do something like this:
% python depEval.py swedish_dep_dev.conll out.conll
Total number of edges: 9339
Number of correct edges: 6678
Total number of sentences: 494
Number of correct sentences: 86
In order to write the evaluator, you need to know about the format of the data files. Please see this page for detailed information on this.
Reporting: Include the code for your evaluator in the lab report (or send it by email in case the code is more than a page). Also give the LAS and LEM score for your baseline system from task 1.
Task 3: Selecting a Good Parsing Algorithm
MaltParser supports several parsing algorithms; this is described in the Parsing Algorithm section of the User Guide. Your next task is to select the best algorithm for the data at hand, where the ‘best’ algorithm is the one that gives the highest score on LAS or LEM. The two metrics might give slightly different results, which means there are different ways of deciding which system is 'best'. To make this choice, for each algorithm you need to train a separate parsing model, use it to parse the testing data, and evaluate the performance of the parser as in Task 2. You can restrict your search to the algorithms described in the Nivre and Stack families. For the Nivre family you should try at least some combinations of the additional arguments that can be used. For the projective algorithms (in both families) try at least one type of pseudo-projective parsing (use the argument -pp). There should thus be at least 10 different combinations of algorithms and arguments. You may also try other options if you wish.
Reporting: Report the LAS and LEM scores for all algorithms and combinations of arguments that you tried, and write down which algorithm you picked in the end.
Task 4: Feature Engineering
MaltParser processes a sentence from left to right, and at each point takes one of a small set of possible transitions. (Details will be presented in the lectures.) In order to predict the next action, MaltParser uses feature models. Up to now, you have been using the baseline feature model for whatever algorithm you were experimenting with. However, one can often do much better than that.
Your next task is to improve the feature model by exploiting the fact that the training and testing data contain morphological features such as case, tense, and definitiveness. These are specified in the FEATS column (column 6) of the CoNLL format. Here is an example:
7 hemmet _ NOUN NN NEU|SIN|DEF|NOM 6 PA _ _
This line specifies that the word hemmet has the grammatical gender neuter (NEU), the grammatical number singular (SIN), is marked as definite (DEF), and has the grammatical case nominative (NOM). This is useful information during parsing.
Read the Feature Model section of the user guide to find out how to extract the value of the FEATS column and split it into a set of atomic features using the delimiter | (pipe). Then, create a copy of the file that holds the feature model used by the algorithm that you selected in Task 3 and make the necessary modifications. Finally, train a new parsing model with the extended feature model, use it to parse the testing data, and evaluate its performance.
Note: If you are using an algorithm from the Nivre family, then you should extract the features for
Stack. If you are using an algorithm from the Stack family, then you should use
Reporting: Write down the lines that you added to the baseline feature model, and how this affected the LAS and LEM of the parser relative to the score that you got for Task 3. (Note that you need to retrain the parser with the new feature model in order to see changes.)
Task 5: Error analysis
In addition to standard evaluation, using metrics like LAS, it can be useful to do some more detailed error analysis of the system output. In this task you should explore two ways of doing error analysis: automatically and manually. You will use these methods on the output from your chosen system from task 3, and your system from task 4.
Your first subtask is to write a program that performs a simple form of error analysis, to find out which categories are most often confused. For each erroneous link, your program should store the predicited label, and the gold standard label, for instance: "SS-SP". You only have to care about the label, not about the predicted head. Your program should store the count for each type of confusion, and output the top fifteen confusions, and how many times they occur. Again you may use any programming language, and your program should be callable from the command line, taking the files with gold standard and the system output as arguments.
Run your program for your selected system from task 3, and your system from task 4, and compare the results. Briefly discuss the differences or lack of differences. To aid your discussion, you can use the list of labels.
For your second subtask, find a sentence that have a different tree when using your selected system from task 3 and your system from task 4. Write down the line number where the sentence you pick start. Draw the trees for the sentence, as parsed by the two systems (if the sentence is very long, you can choose only to draw a relevant part of the tree). Discuss the difference(s) and why you think they happen.
Reporting: Hand in your program and include the result of the automatic analysis, the trees and a discussion of the questions in your report. You may draw the trees by hand, in which case you can leave the trees in Sara's postbox (number 108).
Task 6: Gold-Standard Tags Versus Predicted Tags
You only need to work on this task in case you want to get the grade Pass With Distinction (VG).
In the training and testing data that you have been using up to now, the part-of-speech (POS) tags are gold-standard tags, in the sense that they were assigned manually. In this task you will be exploring what happens in the more realistic scenario where the tags are assigned automatically.
Your specific task is to produce alternative versions of the training and testing data where the gold-standard POS tags have been replaced with automatically-assigned tags. To obtain these, you can use Hunpos, a state-of-the-art part-of-speech tagger. Proceed as follows:
- Download and install Hunpos on your computer.
- Read the User Manual to learn how to use Hunpos.
- Use the training data for the parser to produce training data for the tagger.
- Train the tagger on this training data.
- Use the trained tagger to tag the sentences in the parser data (training and testing).
- Produce new parser data by replacing the gold standard tags in the original data with the automatic tags.
- Re-train the parser using the new data.
- Re-evaluate the parser using the new data. Try two scenarios, only using automatic tags in both training and evaluation, and a mixed scenario where you use a model trained on gold tags and then test on automatic tags.
- Consider the results and think about possible reasons for them.
In order to succeed with these tasks, you may be tempted to write some code that can modify CoNLL files. However, all manipulations can also be done using standard Unix commands such as
Reporting: Report the LAS of the parser trained and tested on data with automatically-assigned tags. Describe the conclusions that you draw from your results. You can write your report in Swedish or English.
Submission and Grading
Submit your lab report, and the code for your scripts by email to Sara. Make sure that you have your name in the report and code files.
The assignment will be graded mainly based on your written lab report. Note that it is not enough to do the VG task to get a VG, the full report must also be of high standard.
The submission deadline for this assignment is January 17, 2016.
Please do not hesitate to contact me as soon as possible either personally or via email in case you encounter any problems.