Syntactic Analysis (5LN455): Assignment 2
Note: bachelor students only!
In this assignment, you will build a PCFG-parser based on the CKY algorithm (J&M Section 13.4.1, Section 14.2). This corresponds to Exercise 14.1 in J&M.
Code and Data
The code can be found in
/local/kurs/parsing/pcfg_starter_code.tar.gz. The code consists of 4 python files. You only have to modify the file
parser.py. There are also 5 data and grammar files that can be found in
The CKY algorithm only accepts grammars in Chomsky Normal Form (CNF). Before we start extracting grammars from treebanks, we therefore need methods for converting trees into CNF. (Normally, one would also want to convert parser output in CNF back to the original tree format for evaluation purposes, but we will omit this step here and evaluate on the CNF trees instead.) In the bachelor assignment the conversion and grammar extraction is already performed for you.
Treebank data: The data to be used for training and evaluation comes from the Penn Treebank and consists of a training set of about 40,000 sentences(section 02-21) and a development set of 100 sentences (from section 00). The trees have been preprocessed to simplify the annotation (removing empty nodes and function tags) and each tree is represented as a list:
[ROOT, Child1, ..., Childn]where
ROOTis the root symbol and each
Childiis itself a tree represented as a list. The trees have already been converted to CNF for you. Here is an example with indentation to show the tree structure::
["S", ["NP", ["NNP", "Pierre"], ["NNP", "Vinken"]], ["S|NP", ["VP", ["MD", "will"], ["VP|MD", ["ADVP+RB", "soon"]] ["VP", ["VB", "join"], ["NP", ["DT", "the"], ["NN", "board"]]]]], [".", "."]]]
A|B is used for a new node in a subtree
A and left sibling
A+B is used for a node that results from
A node and a
B node in a unary
Based on the training data in CNF, a PCFG has been extracted for you. This was done by reading the treebank one tree at a time, and counting the frequency of nonterminals, words, unary and binary rules in that tree.
The format of the grammar is:
["Q1", sym, word, p] = unary rule sym --> word with probability p
["Q2", sym, y1, y2, p] = binary rule sym --> y1 y2 with probability p
["WORDS", wordlist] = list of all "well known words"
Note that words that occur 5 times or more are stored as "well known words" in
the grammar. Words with lower frequency are mapped to the symbol
_RARE_ as a simple form of smoothing.
- Training data (not needed to solve the assignment)
train.dat- original trees
train.dat.cnf- trees converted to CNF
- Development data, used to evaluate your parser
dev.dat- original trees
dev.dat.cnf- trees converted to CNF, used for evaluation
dev.raw- only the sentences, used as input to the parser
grammar.pcfg- PCFG trained on the training data in CNF
CKY Parsing Assignment
Your assignment is to implement a parser that can use the grammar to analyze new sentences using the CKY algorithm.
Starter code: The file
parser.py implements a parser that can
load a PCFG in the format described above and parse a set of raw (but tokenized) sentences, one per line, into CNF trees. Your task is to implement the method
CKY for finding the most probable parse and the method
backtrace for extracting the corresponding tree using backpointers. Make sure that the trees that you create have the same format as in the provided data files. Here are a couple of things to note:
- In this implementation of CKY, the sentence is represented as a list of pairs
(norm, word), where
wordis the word occurring in the input sentence and
normis either the same word, if it is a known word according to the grammar, or the string
normshould be used for grammar lookup but
wordshould be used in the output tree.
- You will need to use the following variables from the
- Binary rules that expand a nonterminal
Xare stored in
- Rule probabilities are stored in
q1[sym, word](for unary rules) and
q2[sym, y1, y2](for binary rules).
- All non-terminal symbols are stored in
- Binary rules that expand a nonterminal
Evaluation: To test your implementation, run the main program of
parser.py as follows:
python3 parser.py grammar.pcfg < INFILE > YOUR_OUTFILE
The program parses the sentences in
INFILE and prints the parse trees
YOUR_OUTFILE. The infile should contain one sentence per line with space-separated tokens, as in the file
dev.raw. Also make sure that the file
tokenizer.py is available in the same directory.
The parser is likely to be very slow, so do not be surprised if it takes half an hour to parse the dev set. In order to test your parser in shorter time it is highly recommended that you create a small test file with a few simple sentences, and possibly also a small pcfg file. In order to do this, look at the format of the existing files.
After parsing the sentences in the dev set, the final step is to evaluate the accuracy of the parser. Just run the evaluator as follows:
python3 eval.py dev.dat.cnf YOUR_OUTFILE
You should expect the F1-score of the parser to be somewhere around .65-.75. If it is much lower there is most probably an error somewhere.
Submission and Grading
In order to pass this assignment (Godkänt), you will need to submit an implementation of a working parser, and report its performance in terms of its overall F1-Score and runtime in seconds, as printed by the software provided.
In order to pass this assignment with distinction (Väl godkänt), I expect you to show ability to analyse your work, synthesize new ideas, and/or critically assess your implementation. Below are some suggestions for extensions to the assignment that will let you demonstrate these abilities:
Extend the parser to also handle unit productions (unary grammar rules). This is not part of CNF, but this functionality is often added to CKY. In order to demonstrate that it works, you also need to create a (very small) pcfg treebank that uses unit productions. In the pcfg
Q3can be used for unary grammar rules, e.g.
["Q3", "NP", "PRON", 0.0082]You should also discuss the pros and cons of allowing unit productions in CKY as opposed to converting them to CNF. Discuss your implementation and results, and if your implementation cannot handle chains of unit productions, discuss how this can be added to it.
Implement the conversion to CNF. Note that you also have to extract a new grammar on your converted treebank. Compare a few (two is enough) different schemes for markovization. See the master assignment for information on markovization and how to integrate this with the existing code and extract a new grammar. Discuss your findings in a short report.
Suggest a way in which your working parser can be improved with respect to accuracy. Implement this suggestion, and discuss the result, or come up with a concrete plan for doing so and write a report describing and discussing your proposal.
The parser is very slow. Investigate ways to improve its speed. Motivate and evaluate your different ideas for speed improvements. To pass this assignment you should either improve the speed substantially and have reasonable explanations of why, or attempt to improve the speed without succeeding, but provide an insightful discussion about the reasons for this.
Compare your parser with an existing system from the parsing literature in the form of a one-page essay.
It is also possible to earn a VG by exploring some other issue. If you want to do so, it is obligatory to contact Sara beforehand, in order to get the extension approved. If you pursue an individual idea without approval, no VG grade will be granted.
Also note that doing a VG task is no guarantee for earning a VG. The full assignment must also be performed with an overall high quality.
Report by uploading a zip or gz document to studentportalen containing a file with your name, the f1-score and time, the code (
parser.py), and possibly any files needed for reporting the VG-assignment if you choose to do one.
Please do not hesitate to contact me as soon as possible either personally or via email in case you encounter any problems.
HistoryThis assignment was originally developed Joakim Nivre in 2016
Modified by Sara Stymne, 2016/2017