In this assignment, you are going to implement a PCFG parser using the CKY
algorithm and evaluate it using treebank data. Starter code can be
- Training data
train.dat- original trees
- Development data, used to evaluate your parser
dev.dat- original trees
dev.raw- only the sentences, used as input to the parser
The data can be found in
The CKY algorithm only accepts grammars in Chomsky Normal Form (CNF). Before we start extracting grammars from treebanks, we therefore need methods for converting trees into CNF. (Normally, one would also want to convert parser output in CNF back to the original tree format for evaluation purposes, but we will omit this step here and evaluate on the CNF trees instead.)
The trees in the treebank have been preprocessed to simplify the annotation (removing empty nodes and function tags) and each tree is represented as a list:
[ROOT, Child1, ..., Childn]where
ROOTis the root symbol and each
Childiis itself a tree represented as a list. Here is a simple example, with indentation to show the tree structure:
["S", ["NP", ["NNP", "Pierre"], ["NNP", "Vinken"]], ["VP", ["MD", "will"], ["ADVP", ["RB", "soon"]] ["VP", ["VB", "join"], ["NP", ["DT", "the"], ["NN", "board"]]]] [".", "."]]
Conversion: To convert a tree to CNF, you must (a) eliminate n-ary branching subtrees (for n > 2) by inserting additional nodes, and (b) eliminate unary branching by merging nodes. For our example tree, this gives the following transform, where new nodes are marked in red:
["S", ["NP", ["NNP", "Pierre"], ["NNP", "Vinken"]], ["S|NP", ["VP", ["MD", "will"], ["VP|MD", ["ADVP+RB", "soon"]] ["VP", ["VB", "join"], ["NP", ["DT", "the"], ["NN", "board"]]]]], [".", "."]]]
A|B is used for a new node in a subtree
A and left sibling
B. (How much
of the previously generated subtrees is encoded in the new label is
often referred to as the markovization order. I suggest you keep it
simple and only record the immediately preceding sibling, which
corresponds to a first-order markovization, but you are free to
experiment with this if you wish.)
A+B is used for a node that results from
A node and a
B node in a unary
chain. (Although it is possible to apply a markovization scheme here
as well, I recommend you to encode the complete path.)
Starter code: The file
cnf.py contains a stub
for the method
to_cnf() together with additional code
that can be used to read and write trees and check that the output is
indeed in CNF. Your task is to complete the implementation of
Evaluation: To test your implementation, run the main program
cnf.py as follows:
python3 cnf.py < infile > outfileThe program checks that
to_cnf()converts all trees in
infileto CNF and preserves all the words of the sentence. Converted trees are printed to
You run the program
pcfg.py as follows:
python3 pcfg.py treebankfile grammarfile
The program extracts a PCFG from the trees in
(which must be in CNF) and stores the grammar in
in the following format:
["Q1", sym, word, p] = unary rule sym --> word with probability p
["Q2", sym, y1, y2, p] = binary rule sym --> y1 y2 with probability p
["WORDS", wordlist] = list of all "well known words"
The treebank file should contain one tree per line, as in
Now that you have a PCFG in CNF, the next step is to implement a parser that can use the grammar to analyze new sentences using the CKY algorithm.
Starter code: The file
parser.py implements a parser that can
load a PCFG in the format created above and parse a set of raw (but tokenized) sentences, one per line, into CNF trees. Your task is to implement the method
CKY for finding the most probable parse and the method
backtrace for extracting the corresponding tree using backpointers. Here are a couple of things to note:
- In this implementation of CKY, the sentence has to be represented as a list of pairs
(norm, word), where
wordis the word occurring in the input sentence and
normis either the same word, if it is a known word according to the grammar, or the string
normshould be used for grammar lookup but
wordshould be used in the output tree.
- Binary rules that expand a nonterminal
symare stored in
- Rule probabilities are stored in
q1[sym, word](for unary rules) and
q2[sym, y1, y2](for binary rules).
- All non-terminal symbols are stored in
Evaluation: To test your implementation, run the main program of
parser.py as follows:
python3 parser.py grammarfile < infile > outfile
The grammar file should be the grammar extracted from your converted training data in previous assignments. The infile should be the raw sentences in the dev set.
The program parses the sentences in
infile and prints the parser trees
outfile. The infile should contain one sentence per line with space-separated tokens, as in the file
dev.raw. Also make sure that the file
tokenizer.py is available in the same directory.
The parser is likely to be very slow, so do not be surprised if it takes half an hour to parse the dev set. It is advisable to use only a few short sentences during development, and possibly a smaller grammar, to speed up testing. In the end you should parse the full dev set, though.
After parsing the sentences in the dev set, the final step is to evaluate the accuracy of the parser. No implementation is needed for this. Just run the evaluator as follows:
python3 eval.py goldfile outfile
goldfile should contain the dev sentences with binarized trees (produced in part 1) and the outfile is the outfile from your parser. The parser is not likely to be very accurate, so do not be surprised if
your results are quite far from the state of the art. You should expect the F1-score of the parser to be somewhere around .65-.75. If it is much lower there is most probably an error somewhere.
For distinction, you have to improve the implementation in one of the following ways and write a report about your additional modifications and results:
- Investigate how vertical and horizontal markovization affects parsing and discuss the implications of your results.
- Improve parsing accuracy through a better treatment of rare words. Discuss and motivate your chosen treatment.
- The parser is very slow. Investigate ways to improve its speed. Motivate and evaluate your different ideas for speed improvements. To pass this assignment you should either improve the speed substantially and have reasonable explanations of why, or attempt to improve the speed without succeeding, but provide an insightful discussion about the reasons for this and for why you tried the methods you did.
- Extend the parser to also handle unit productions (unary grammar rules). This is not part of CNF, but this functionality is often added to CKY. In order to demonstrate that it works, you also need to create a (very small) pcfg treebank that uses unit productions. You should also discuss the pros and cons of allowing unit productions in CKY as opposed to converting them to CNF. If your implementation cannot handle chains of unit productions, discuss how this can be added to it.
It is also possible to earn a VG by exploring some other issue. If you want to do so, it is obligatory to contact Sara beforehand, in order to get your specific extension approved. If you pursue an individual idea without approval, no VG grade will be granted.
Note that doing a VG task is no guarantee for earning a VG. The full assignment must also be performed with an overall high quality.
Report by uploading a zip or gz document to studentportalen containing a file with your name, the f1-score and time, the code (
cnf.py), and possibly any files needed for reporting the VG-assignment if you choose to do one.
Please do not hesitate to contact Ali as soon as possible either personally or via email in case you encounter any problems.
Modified by Sara Stymne, 2017