Assignment: Dependency Parsing
Note: master students only!
In this assignment, you are going to implement some key components of a transition-based
dependency parser and analyze its behavior. Starter code can be found in
A transition system for dependency parsing consists of a set of configurations (or parser states) and a set of transitions (or parse actions) for moving between configurations. You are going to implement the arc-eager transition system (see section 3.4.1 in Kübler et al.), where configurations consist of a stack, a buffer and an arc set, all represented as lists. Tokens are represented by integers corresponding to sentence positions (with 0 for the special root node), and dependency arcs are represented as triples (h, d, l), where h is the head token, d the dependent token, and l the dependency label.
Given the sentence "the cat sits on the mat today", the following is a possible configuration:
stack = [0, 2] buffer = [3, 4, 5, 6, 7] arcs = [(2, 1, "det")]This configuration has the root and "cat" on the stack, "sits", "on", "the", "mat" and "today" in the buffer, The arc set contains a single dependency arc connecting the head word "cat" to the dependent word "the" with the dependency relation "det". Make sure you understand how configurations are represented before you move on.
Let us call the first word in the stack top and the first word in the buffer next. The transitions of the arc-eager system can be described as follows:
SH = move next from the buffer to the stack RE = remove top from the stack (RA, label) = add (top, next, label) to the arc set; move next from the buffer to the stack (LA, label) = add (next, top, label) to the arc set; remove top from the stack
Starter code: The file
transition.py contains a dummy implementation
of a transition-based parser that parses the sentence "the cat sits on the mat today" by initializing
the parser configuration to:
stack =  buffer = [1, 2, 3, 4, 5, 6, 7] arcs = 
performing the following sequence of transitions:
SH (LA, "det") SH (LA, "nsubj") SH SH SH (LA, "det") (LA, "case") (RA, "nmod") RE (RA, "nmod")
and printing the resulting dependency tree. You run the parser as follows:
python transition.pyThe method
transition(trans, stack, buffer, arcs)is only a stub, which only implements the
SHtransition, and your task is to implement the correct behavior for all four transitions. You can test your implementation by verifying that the parser produces the correct dependency tree for the sentence.
Training a transition-based dependency parser means training a model for predicting the next transition given the current configuration. To do this, we first need to construct an oracle, a function that correctly predicts the next transition when given access to the correct dependency tree. You can find a definition of a static oracle for the arc-eager system in Goldberg and Nivre, CoLING 2012, Algorithm 1.Starter code: The file
oracle.pyimplements an oracle parser that reads a set of syntactically annotated sentences from a file and tries to derive the correct trees by following the predictions of an
oracle()method. This method is a stub that always predicts
SH, and your task is to implement a better oracle. In addition, you have to replace the
transition()method by your own implementation from above. You can test your implementation on the sentence "she gave the cat some food" by running:
python3 oracle.py < example.tabor
python3 oracle.py tab < example.tabif you want to see the result in the same tab-based format as the input. The result in the tab-ased format should be the same as in the input file
Testing on a real treebank
When you are satisfied that your oracle works as it should, it is time to test it on some real treebank data. For this purpose, we will use the development set of the English UD treebank in a simplified format. The first sentence looks like this:
From ADP 3 case the DET 3 det AP PROPN 4 nmod comes VERB 0 root this DET 6 det story NOUN 4 nsubj : PUNCT 4 punct
The four columns contain tokens, part-of-speech tags, syntactic heads and dependency labels, respectively. To run your oracle parser on the whole development set and make it print the derived trees in the same format, you can run:
python3 oracle.py tab < en-ud-dev.tab > en-ud-dev.out
Note the extra argument
tab to change the output format from dependency tuples to the column-based format.
Ideally, the trees output by your oracle parser should be identical to the original trees from the treebank, because we want to use the oracle to guide a machine learning algorithm. However, if you compare the two sets of trees, you are likely to find some discrepancies. Your final task is to figure out what causes these discrepancies and discuss how this problem could be handled. No implementation is needed for this part.
For Distinction (VG)
For distinction, you have to do one of the following:
- Implement a different transition system, for example, arc-standard.
- Implement a projectivization algorithm for dependency trees.
- Implement a dynamic oracle for the arc-eager transition system.
Report by uploading a zip or gz document to studentportalen containing the code file
oracle.py and a report discussing the issues found when applying your code on a real treebank, and any additional files that might be needed to report on a VG task.
Please do not hesitate to contact Sara as soon as possible either personally or via email in case you encounter any problems.
HistoryThis assignment was originally developed Joakim Nivre in 2016
Modified by Sara Stymne, 2017