SPARK: A Shallow Parser for Swedish
SPARK is a shallow parser developed for the robust analysis of the
constituent structure of Swedish. The parser is based on a context-free
grammar for Swedish and Earley's algorithm implemented in python by
John Aycock (SPARK-0.6.1).
The infile has to be tokenized and each sentence has to begin in a new
line. The words have to be annotated with the Swedish version of
PAROLE tags. (There is an already trained tagger for this purpose,
the Trigrams'n Tags tagger developed by T. Brants (2000), trained on
Swedish. For more information, look at here.) The
word and the tag has to be separated by '/', as shown in the following
Smygrustning/NCUSN@IS av/SPS raketvapen/NCNPN@IS
The output of the parser can be represented as tagged annotation, or as
with category names as labels, and with morphologically analysed words,
as indented structures, or a tree representation (one sentence at the
time as a postscript file.)
The analytic scheme is based on phrase structure framework. Eight types
phrases are used. Some of categories correspond to chunks, e.g. AP,
ADVP, and verb clusters. Other categories are designed to be able to
handle arguments on the right hand side of the phrasal head, or to
handle co-ordinated phrases, such as maximal adjective phrases. The
phrase categories are listed below, each followed by a brief
explanation and an example.
Adverb Phrase (AdvP) consists of adverbs that can modify
expressions or verbs. e.g. very
Minimal Adjective Phrase (APMIN) constitutes the adjectival head
possible modifiers, e.g. AdvP and/or prepositional phrase. e.g. very
Maximal Adjective Phrase (APMAX) includes more than one AP with a
or a conjunction in between.e.g. very interesting and nice
Noun Phrase (NP) may include the head noun and its modifiers to the
e.g. determiners, nouns in genitive, possessive pronouns, numerical
expressions, APMIN, APMAX, and/or compound nouns. Thus, possessive
expressions do not split an NP into two noun phrases like in the
CoNLL-2000 shared task on chunking. e.g. Pilger's very interesting and
Prepositional Phrase (PP) consists of one or several prepositions
delimited by a conjunction and one or several NPs, or in
elliptical expressions an AP only. e.g. about politics
Verb Cluster (VC) consists of a continuous verb group belonging to
the same verb phrase without any intervening constituents like NP or
AdvP. e.g. would have been
Infinitive Phrase (InfP) includes an infinite verb together with
particle and may contain AdvP and/or verbal particles. e.g. to go out
Numeral Expression (NumP) consists of numerals with their possible
modifiers, for example AP or AdvP. e.g. several thousands
Please note, that the grammatical categories do not represent neither
clauses, such as relative clauses, nor sentences. Also, there are other
versions of the parser which include detection of maximal projections
of noun phrases including relative clauses. If you are intrested,
please, contact me.
SPARKchunk is publicly available and it is free to use, copy, modify,
merge, publish, distribute, and/or sublicense copies of the Software.
The software is provided "AS IS". Please give appropriate references,
Download a faster Java version implemented by Martin Johansson and Pontus Lindström.
Aycock, J. 1998. Compiling Little Languages in Python. In Proceedings
of the 7th International Python Conference.
Megyesi, B. 2002. Data-Driven Syntactic Analysis -
Methods and Applications for Swedish. Ph.D.Thesis. Department of
Speech, Music and Hearing, KTH, Stockholm, Sweden.