Nordic Treebank Network
The goal of this network, funded by the Nordic Language Technology Program, is to
promote research related to treebanks in the Nordic countries. Besides the network
website, our main channel of communication is the Nordic treebank mailing list
and archives
(nordic-treebank@iln.uio.no).
Participants
- Växjö University, School of Mathematics
and Systems Engineering. Contact: Joakim Nivre
(network coordinator).
- Copenhagen Business School,
Department of Computational Linguistics. Contact: Matthias Trautner Kromann.
- CSC
Scientific Computing Ltd,
The Language Bank of Finland. Contact: Manne
Miettinen
- Göteborg University, Department
of Linguistics
Contact: Jens Allwood
- Göteborg University,
Department of Swedish, Natural Language Processing Section Contact:
Lars Borin.
- KTH, Department of Speech, Music and
Hearing.
Contact: Beata Megyesi.
- NTNU, Department of Linguistics.
Contact: Torbjørn Nordgård.
- Stockholm University, Department of
Linguistics.
Contact: Martin Volk.
- University of Bergen,
Department of Linguistics. Contact:
Koenraad de Smedt.
- University of Helsinki, Department of
General Linguistics.
Contact: Kimmo
Koskenniemi.
- University of Iceland. Contact: Eirikur Rögnvaldsson.
- University of Oslo, The
Text Laboratory.
Contact: Janne Bondi
Johannessen.
- University of Southern Denmark,
Institute of Language and Communication, VISL project.
Contact: Eckhard
Bick.
- University of Tartu, Institute of Computer
Science.
Contact: Heli Uibo.
- Uppsala University, Department of Linguistics.
Contact: Eva Ejerhed.
Network Activities
- Network meetings:
- First general meeting: 17-18 September 2003, Fefor, Norway:
Minutes.
- Internal workshop: 5-7 March 2004, Stockholm, Sweden: Minutes.
- Second general meeting: 8-10 September 2004, Tartu, Estonia: Minutes.
- International workshops:
- The Second Workshop on Treebanks
and Linguistic Theories, 14-15 November 2003, Växjö.
- The Third Workshop on Treebanks and Linguistic Theories, 10-11 December 2004, Tübingen.
- PhD Courses and summer schools:
- Treebanks: Formats, Tools and Usage, 1-5 March 2004, Stockholm University.
- Empirical Methods in Natural Language Processing,
9-14 August 2004, Tartu, Estonia.
- Nordic parallel treebank (password protected for copyright reasons).
- Documentation of projects, tools and resources:
- Treebank representation formats
- Treebank tools (creation, usage, evaluation)
- Projects and resources
at network sites
Copenhagen Business School
- Danish Dependency
Treebank (DDT). The DDT currently contains 100.000 words of
syntactically annotated text from the Danish PAROLE corpus. The
underlying linguistic principles are documented in the
tagging manual
for DDT.
- DTAG dependency
annotation tool. DTAG is a linguistic command language
interpreter for editing, searching, comparing and displaying
dependency-annotated texts within a treebank.
- Discontinuous
Grammar. The purpose of this project is to create a linguistic
theory for discontinuous (non-projective) dependency grammars with
a cost component, and to design heuristic algorithms for (1)
serial parsing with repair, and (2) learning a massively ambiguous
dependency lexicon from a treebank.
CSC Scientific Computing
- Finnish
text collection. The Language Bank of Finland provides access to
roughly 180 million
words of Finnish text, mainly newspaper articles. All the text are
encoded in
XML. 60 % of the texts are automatically morphosyntactically analyzed
with
Kielikone's Textmorfo analyzer.
- Finland-Swedish text collection. The Language Bank of Finland provides
access to roughly 32
million words of Finland-Swedish (Swedish text produced by Swedish
speaking Finns) text, mainly newspaper articles. All the text are
encoded in XML. All the texts are automatically morphosyntactically
analyzed
with Lingsoft's SWECG analyzer.
- Lemmie is
a corpus query tool with a web user interface. It is based on a object
oriented Perl API and allows the user to query texts in the Finnish and
Finland-Swedish text collections.
Göteborg University - Linguistics
- Spoken Language and
Semantics
This research group has collected several corpora, mainly of spoken
language. The main corpus is called GSLC (Göteborg Spoken Language
Corpus), and it is an incrementally growing corpus of spoken language
from different social activities.
- NordTalk
The NorFA network on corpus based research on spoken language:
Cooperation between general linguistics, computational linguistics,
language technology and speech technology in the Nordic countries
Göteborg University - Swedish
- Språkbanken -- the Bank of
Swedish at Göteborg University is a constantly growing and evolving
national repository of annotated corpora and lexical resources for
Swedish, including some syntactically annotated material and also
parallel corpora. Many of Språkbanken's resources are searchable with a
web interface.
KTH
- Data-driven
Morpho-Syntactic and Shallow Syntactic Analysis of Swedish The purpose of
this project is to automatically model the morpho-syntactic and constituent
structure of Swedish by using various data-driven learning algorithms and
combine these by ensemble techniques.
- Swedish
Treebank The goal of this project is to build a large-scale treebank for
Swedish, containing both spoken and written language data, and annotated with
both constituent structure and dependency structures. The project is a
cooperation between six Swedish universities.
- The GROG project: Boundaries
and
Groupings - The Structuring of Speech in Different Communicative
Situations
The purpose of this project is to model the structuring of speech in terms of
prosodic boundaries and groupings in various communicative situations. The
modeling aims at a structured and optimized description of the relations
between
the acoustic and linguistic structure on the one hand, and the
prosodic/perceptual annotations on the other.
- Learning
Lexical Semantic Relations for Swedish The aim of this project is to
automatically learn information about lexical semantic relations for
Swedish data. The aim is also to introduce and evaluate the use of this
lexical knowledge in communicative systems.
- Open
source softwares Open source tools developed at CTT, TMH, KTH.
- Speecon: Speech-driven Interfaces for
Consumer Devices SPEECON is a project focusing on collecting linguistic
data
for speech recogniser training. SPEECON is funded as a shared-cost project
under
Human Language Technologies (HLT), which is part of the European Commission's
Information Societies Technologies (IST) Programme. The partners will collect
speech data for 18 languages or dialectal zones, including most of the
languages
spoken in the EU. TMH, KTH is a subcontractor and prepared the recordings of
the
Swedish data. See also the Speechdat project which consists of a series of
speech data collection projects funded by the European Union.
http://www.speechdat.org/
NTNU
Stockholm University
- Converting SynTag to XML for TIGER-Search SynTag is a Swedish treebank dating back to 1986. Its annotation comprises part-of-speech tags and predicate-argument structures. In this project we convert the predicate-argument structure into a tree structure that follows the TIGER-XML coding scheme. This will allow the SynTag treebank to be loaded into TIGER-Search, a powerful treebank search tool.
- Adapting the TIGER annotation guidelines to Swedish This projects aims at investigating to what degree the annotation format and guidelines developed in the TIGER (and NEGRA) project for the German treebank will be applicable to Swedish. The project will result in 100 consistently annotated Swedish sentences in the spirit of TIGER/NEGRA and a detailed description of the advantages and disadvantages of using the TIGER/NEGRA guidelines for Swedish (with suggestions for alternative annotations for specific cases).
University of Bergen - Section for linguistic studies and Language
technology group at AKSIS
- TREPIL
investigates methods for building a Norwegian treebank
semi-automatically. The project focuses on rich syntactic and semantic
annotation. The goal is not to build a full scale treebank, but to
acquire the necessary know-how, including linguistic principles,
database design, resources, tools, evaluation, etc. for building a
treebank with multiple structures for each sentence. The approach is
tested on a small scale treebank. The project is coordinated by Prof.
Koenraad de Smedt and is administered by AKSIS. Financed by the
Norwegian Research Council under the KUNSTI program, the project runs
from April 2004 until March2007 and cooperates actively with other
research groups, in particular the language technology group at PARC
(Palo Alto), the TIGER-project (Stuttgart), and the Nordic treebank network.
University of Helsinki
- The University of Helsinki
Language Corpus Server (UHLCS) Computer corpora of more than 50
languages, including samples of minority languages and extensive corpora
representing different text types. In 2000, the corpora of the Uralic,
Turkic, Tungusic, Mongolic, Palaeo-Siberian, Iranian and Caucasian
languages.
University of Oslo
- PaNoLa -
Parsing Nordic Languages The project, coordinated by the
University of Southern Denmark, Odense, aims at enhancing the Nordic
element within the VISL system. Interactive syntax learning and grammar
computer games will be available for Norwegian, Swedish and Danish.
C
- Nomen
Nescio-prosjektet - en automatisk navnegjenkjenner for norsk, svensk og
dansk The project, coordinated in Oslo, aims at developing an
automatic named entity recogniser for Norwegian, Swedish and Danish.
- Oslo-Bergen-taggeren
The Oslo-Bergen tagger is based on the Constraint Grammar formalism, and
is used to tag the Oslo Corpus of Tagged, Norwegian Texts.
- Big
Brother-korpuset This corpus is under development at the
University of Oslo. It is based on the first Norwegian series of the
reality show Big Brother, and will be a spoken language corpus with
transcribed text, sound and video.
- GREI
The aim of the GREI project has been to extend the number of Norwegian
sentences on the VISL site in University of Southern Denmark, and
improve the user giudelines for Norwegian pupils, everything done in
cooperation with Norwegian teachers.
- LOGON The LOGON is a
large project, coordinated by the University of Oslo, aiming at
automatic translation from Norwegian to English. Part of the project
will be a parallel tourist corpus, of which a prototype can be viewed here.
- LDB corpus
The LDB corpus is coordinated by the Department of Lexicography at
the University of Oslo, as a resource for lexicographic research.
Click for a search example.
- Usenet-korpus for
norsk
This 140 million word corpus of Norwegian, developed at
the University of Oslo, is from the newslist Usenet domain.
- Opus -
Open Source Parallel Corpus The OPUS contains translated texts from
the web, converted and aligned, and tagged, at the moment: 30,000,000
words in 60 languages. Developed at the Universities of Uppsala and
Oslo.
- The
Oslo Corpus of Tagged Norwegian Texts . Its user interface is
web-based. A wide variety of search options are available for Bokmål and
Nynorsk in this 20 mill. word corpus.
- A statistical tagger for
Norwegian This tagger was developed by the University of Oslo
for Inxight, Grenoble, but since then, a number of other types of
statistical taggers have also been tested out.
- A corpus
of spoken Norwegian language This corpus is only at its
planning stages at this moment, its realisation depends on funding.
Planned jointly by the Universities of Oslo and Bergen.
University of Southern Denmark
- Arboretum
(Danish treebank) The Arboretum project, launched in 2002, aims at
building a large, multi-format Danish treebank from automatically parsed
and manually revised running text data (10 million words). Analyses are
based on a Constraint Grammar parser (DanGram) with various lexical
resources, and further processed with PSG rules and dependency
transformers. About 100.000 words have been manually revised in both
formats.
- Korpus 90/2000 Search interface
for the treebank's corpus data base, in both text and CG-annotated
format. Korpus90/2000 was compiled by DSL and
annotated by Eckhard Bick (VISL) in 2002.
- Arboretum search
interface for both text, form/function and structural searches. With
a different interface, the same data can be accessed as part of VISL's
internet-based
grammar
teaching initiative.
- DanGram
parser On-line access to a live running text CG tagger and syntactic
tree parser, with graphical tree visualisation. The info-section
explains the tags and categories used in the Arboretum tree bank and
other VISL corpus projects for Danish.
University of Tartu
- Text corpora
(written Estonian)
1) A corpus for different periods (from 1890s to 1990s) marked up to the
sentence level, total 3 million words
2) Morphologically and syntactically annotated and disambiguated corpora
(300 000 and 100 000 words, respectively)
3) Estonian news agencies current text corpus (about 23 million words).
- Corpus of
Spoken Estonian The corpus is transcribed by the transcription of
conversational analysis (CA). The goal is to collect various types of oral
speech, the usage of both everyday and institutional conversation,
spontaneous and planned speech, monologues and dialogues, face-to-face
interaction and media texts. 659 texts (403,000 text units).
- The database
of fixed expressions Currently contains 20,794 records, the majority of
expressions (17,493) are phrasal verbs.
- Morphological analyzer A
morphological analyser and generator for practical applications (hyphenator,
spelling checker) have been developed.It is based on the Concise
Morphological Dictionary of Estonian (46 000 lexical units).
Recall 99%. Morphological tools are usable over WWW.
- Constraint Grammar shallow
syntactic parser of Estonian The constraint grammar of Estonian includes
1240 constraints for morphological disambiguation, 180 syntactic mapping
rules and 1118 syntactic constraints. Recall 83-90 %, less than 3,5 %
errors. Several experiments for practical applications have been caried out
on the basis of the parser: noun phrase extractor, automatic summarizer,
grammar checker.
- Annotation tool A user interface for
syntactic hand-annotation and disambiguation. The goal of the tool is to
avoid typos usually made by hand-annotators. The user can select the correct
tag by a mouse-click.
- The instruction for
syntactic annotationThe instruction includes the description of
syntactic tags used by CG syntactic parser of Estonian, with examples.
- Syntactically analyzed
and disambiguated text corpus The corpus is intended as a test corpus
for the Constraint Grammar syntactic parser of Estonian. The goal is to
correctly annotate 200 000 words of written texts (mostly fiction + some
newspapers).
Uppsala University
- Projects
- KOMA: Corpus-based machine translation
- MATS: Methodology and application of a translation system
- Scaniakorpusprojektet: Multilingual document management
- ETAP: Organisation and usage of parallel corpora
- FASTY: Faster typing for disabled persons
- Scarrie: Computer support for Scandinavian proof-reading
- MIA: Multra at work (machine translation)
- PLUG: Parallel corpus project at LiU, UU and GU
- Corpora
- Tools
Växjö University
- Swedish Treebank
The goal of this project is to build a large-scale treebank for Swedish,
containing both spoken and written language data, and annotated with both
constituent structure and dependency structures. The project is a cooperation
between six Swedish universities.
- Stochastic Dependency
Grammars for Natural Language Parsing The purpose of this project is to study stochastic
dependency grammars from three different perspectives: the formal perspective, the machine
learning perspective, and the language technology perspective. Treebanks play a central
role in the latter two areas.