<sentence id="2" user="malt" date=""> <word id="1" form="Genom" postag="pp" head="3" deprel="ADV"/> <word id="2" form="skattereformen" postag="nn.utr.sin.def.nom" head="1" deprel="PR"/> <word id="3" form="införs" postag="vb.prs.sfo" head="0" deprel="ROOT"/> <word id="4" form="individuell" postag="jj.pos.utr.sin.ind.nom" head="5" deprel="ATT"/> <word id="5" form="beskattning" postag="nn.utr.sin.ind.nom" head="3" deprel="SUB"/> <word id="6" form="(" postag="pad" head="5" deprel="IP"/> <word id="7" form="särbeskattning" postag="nn.utr.sin.ind.nom" head="5" deprel="APP"/> <word id="8" form=")" postag="pad" head="5" deprel="IP"/> <word id="9" form="av" postag="pp" head="5" deprel="ATT"/> <word id="10" form="arbetsinkomster" postag="nn.utr.plu.ind.nom" head="9" deprel="PR"/> <word id="11" form="." postag="mad" head="3" deprel="IP"/> </sentence>The tagsets used for parts-of-speech and dependency relations must be specified in the header of the XML document. An example document can be found here. An XML schema for Malt-XML treebanks can be found here.
Malt-TAB is a text-based representation, which is mainly used by MaltParser. Malt-TAB contains a subset of the features in Malt-XML, and attributes are implicitly defined by their position. Each word is represented on one line, with attribute values being separated by tabs. The required order of attributes is as follows:
form (required) < postag (required) < head (optional) < deprel (optional)
Although head and deprel are optional, they must either both be included or both be omitted. (Normally, all four columns are present in the input when training the parser and in the output when parsing, while only form and postag are present in the input when parsing.) Please note also that the id attribute is not represented explicitly at all. Words in a sentence are separated by one newline; sentences are separated by one additional newline. A dependency tree for the Swedish sentence "Genom skattereformen införs individuell beskattning (särbeskattning) av arbetsinkomster." can be represented as follows:Genom pp 3 ADV skattereformen nn.utr.sin.def.nom 1 PR införs vb.prs.sfo 0 ROOT individuell jj.pos.utr.sin.ind.nom 5 ATT beskattning nn.utr.sin.ind.nom 3 SUB ( pad 5 IP särbeskattning nn.utr.sin.ind.nom 5 APP ) pad 5 IP av pp 5 ATT arbetsinkomster nn.utr.plu.ind.nom 9 PR . mad 3 IP
An example document can be found here.
For interchange purposes we have defined a conversion from Malt-XML to Nordic Treebank Network TIGER-XML. The above sentence will get the following representation in NTN TIGER-XML:
<s id="s2"> <graph root="p2_3"> <terminals> <t id="w2_1" form="Genom" postag="pp"/> <t id="w2_2" form="skattereformen" postag="nn.utr.sin.def.nom"/> <t id="w2_3" form="införs" postag="vb.prs.sfo"/> <t id="w2_4" form="individuell" postag="jj.pos.utr.sin.ind.nom"/> <t id="w2_5" form="beskattning" postag="nn.utr.sin.ind.nom"/> <t id="w2_6" form="(" postag="pad"/> <t id="w2_7" form="särbeskattning" postag="nn.utr.sin.ind.nom"/> <t id="w2_8" form=")" postag="pad"/> <t id="w2_9" form="av" postag="pp"/> <t id="w2_10" form="arbetsinkomster" postag="nn.utr.plu.ind.nom"/> <t id="w2_11" form="." postag="mad"/> </terminals> <nonterminals> <nt id="p2_1" form="Genom" postag="pp" > <edge idref="w2_1" label="--"/> <edge idref="p2_2" label="PR"/> </nt> <nt id="p2_2" form="skattereformen" postag="nn.utr.sin.def.nom" > <edge idref="w2_2" label="--"/> </nt> <nt id="p2_3" form="införs" postag="vb.prs.sfo" > <edge idref="w2_3" label="--"/> <edge idref="p2_1" label="ADV"/> <edge idref="p2_5" label="SUB"/> <edge idref="p2_11" label="IP"/> </nt> <nt id="p2_4" form="individuell" postag="jj.pos.utr.sin.ind.nom" > <edge idref="w2_4" label="--"/> </nt> <nt id="p2_5" form="beskattning" postag="nn.utr.sin.ind.nom" > <edge idref="w2_5" label="--"/> <edge idref="p2_4" label="ATT"/> <edge idref="p2_6" label="IP"/> <edge idref="p2_7" label="APP"/> <edge idref="p2_8" label="IP"/> <edge idref="p2_9" label="ATT"/> </nt> <nt id="p2_6" form="(" postag="pad" > <edge idref="w2_6" label="--"/> </nt> <nt id="p2_7" form="särbeskattning" postag="nn.utr.sin.ind.nom" > <edge idref="w2_7" label="--"/> </nt> <nt id="p2_8" form=")" postag="pad" > <edge idref="w2_8" label="--"/> </nt> <nt id="p2_9" form="av" postag="pp" > <edge idref="w2_9" label="--"/> <edge idref="p2_10" label="PR"/> </nt> <nt id="p2_10" form="arbetsinkomster" postag="nn.utr.plu.ind.nom" > <edge idref="w2_10" label="--"/> </nt> <nt id="p2_11" form="." postag="mad" > <edge idref="w2_11" label="--"/> </nt> </nonterminals> </graph> </s>
An example document can be found here.
MaltConverter is a terminal-based program for conversion between the representation format for dependency treebanks Malt-XML, Malt-TAB and TIGER-XML (NTN). It is also possible to map attribute names and tagsets.
To run MaltEval you need the Java VM (tested for JRE 1.4.1).
Usage: java -jar MaltConverter.jar <conversion> <mapfile> <infile> <outfile>
Parameter | Description |
---|---|
conversion | Specifies the conversion (e.g. malt2tiger). See table below for available conversions. |
mapfile | Path to the XML document which describes the mapping of attribute names and tagsets. See example below. |
infile | The path to the source file which will be converted. |
outfile | The path to the destination file where the output will be saved. |
The table below lists the available conversions. In the table, malt stands for Malt-XML and tab for Malt-TAB. Note that it is possible to convert from Malt-XML to Malt-XML with malt2malt, which allows mapping of tagsets and removal of attributes. When converting to Malt-TAB, tagset files will be created which can be used by MaltParser.
Parameter | From | To |
---|---|---|
tiger2malt | TIGER-XML | Malt-XML |
tiger2tab | TIGER-XML | Malt-TAB |
malt2tiger | Malt-XML | TIGER-XML |
malt2tab | Malt-XML | Malt-TAB |
malt2malt | Malt-XML | Malt-XML |
tab2malt | Malt-TAB | Malt-XML |
tab2tiger | Malt-TAB | TIGER-XML |
tab2tab | Malt-TAB | Malt-TAB |
A mapping file must be specified. A mapping file can be represented as follows:
<?xml version="1.0" encoding="ISO-8859-1"?> <mapping id="Talbanken"> <annotation> <feature from="form" to="form"/> <feature from="postag" to="postag"> <value from="ab"/> <value from="ab.kom"/> <value from="ab.pos"/> ... <value from="vb.sup.akt.mod"/> <value from="vb.sup.sfo"/> </feature> <feature from="head" to=""/> <feature from="deprel" to="edgelabel"> <value from="ROOT" to="--"/> <value from="ADV"/> ... <value from="XX"/> </feature> </annotation> </mapping>
The mapping file consists of a sequence of features (or attributes) with a mapping from an attribute name to an attribute name. If the attribute names are the same, the to value should be identical to the from value. If the to value is the empty string, the attribute will be suppressed in the output.
For tagset values, the identity map can be achieved by simply excluding the to attribute (but the empty string is not allowed as a value of the to attribute).