SPARK: A Shallow Parser for Swedish

Beata Megyesi

Description

SPARK is a shallow parser developed for the robust analysis of the constituent structure of Swedish. The parser is based on a context-free grammar for Swedish and Earley's algorithm implemented in python by John Aycock (SPARK-0.6.1).

Format

The infile has to be tokenized and each sentence has to begin in a new line. The words have to be annotated with the Swedish version of PAROLE tags. (There is an already trained tagger for this purpose, the Trigrams'n Tags tagger developed by T. Brants (2000), trained on Swedish. For more information, look at here.) The word and the tag has to be separated by '/', as shown in the following sentence: The output of the parser can be represented as tagged annotation, or as labelled bracketings with category names as labels, and with morphologically analysed words, as indented structures, or a tree representation (one sentence at the time as a postscript file.)
The analytic scheme is based on phrase structure framework. Eight types of phrases are used. Some of categories correspond to chunks, e.g. AP, ADVP, and verb clusters. Other categories are designed to be able to handle arguments on the right hand side of the phrasal head, or to handle co-ordinated phrases, such as maximal adjective phrases. The phrase categories are listed below, each followed by a brief explanation and an example.

Categories

  • Adverb Phrase (AdvP) consists of adverbs that can modify adjectives, numerical expressions or verbs. e.g. very
  • Minimal Adjective Phrase (APMIN) constitutes the adjectival head and its possible modifiers, e.g. AdvP and/or prepositional phrase. e.g. very interesting
  • Maximal Adjective Phrase (APMAX) includes more than one AP with a delimiter or a conjunction in between.e.g. very interesting and nice
  • Noun Phrase (NP) may include the head noun and its modifiers to the left, e.g. determiners, nouns in genitive, possessive pronouns, numerical expressions, APMIN, APMAX, and/or compound nouns. Thus, possessive expressions do not split an NP into two noun phrases like in the CoNLL-2000 shared task on chunking. e.g. Pilger's very interesting and nice book
  • Prepositional Phrase (PP) consists of one or several prepositions delimited by a conjunction and one or several NPs, or in elliptical expressions an AP only. e.g. about politics
  • Verb Cluster (VC) consists of a continuous verb group belonging to the same verb phrase without any intervening constituents like NP or AdvP. e.g. would have been
  • Infinitive Phrase (InfP) includes an infinite verb together with the infinite particle and may contain AdvP and/or verbal particles. e.g. to go out
  • Numeral Expression (NumP) consists of numerals with their possible modifiers, for example AP or AdvP. e.g. several thousands
  • Please note, that the grammatical categories do not represent neither clauses, such as relative clauses, nor sentences. Also, there are other versions of the parser which include detection of maximal projections of noun phrases including relative clauses. If you are intrested, please, contact me.

    License

    SPARKchunk is publicly available and it is free to use, copy, modify, merge, publish, distribute, and/or sublicense copies of the Software. The software is provided "AS IS". Please give appropriate references, see below.

    Download

    Download SPARKchunk.
    Download a faster Java version implemented by Martin Johansson and Pontus Lindström.

    References

    Aycock, J. 1998. Compiling Little Languages in Python. In Proceedings of the 7th International Python Conference.

    Megyesi, B. 2002. Data-Driven Syntactic Analysis - Methods and Applications for Swedish. Ph.D.Thesis. Department of Speech, Music and Hearing, KTH, Stockholm, Sweden.