English-Swedish-Turkish Parallel Corpus
The Swedish Research Council and the Faculty of Languages at Uppsala University
We present the English-Swedish-Turkish parallel corpus and the automatic
procedure with tools that we have been using in order to build the
corpus efficiently. The method presented below can be transfered to
build other parallel corpora.
We build the corpus automatically by using a basic language resource kit (BLARK) for Swedish and Turkish, and tools for the automatic alignment and correction of data. We choose tools that are user friendly, understandable and easy to learn by people with less computer skills, thereby allowing researchers and students to use, align and correct the corpus data by themselves.
Corpus Annotation ProcedureThe text is processed automatically by using tools making the annotation, alignment and manual correction easy and straightforward. The following steps give an overview of the annotation procedure and the involved tools.
1. Preprocessing for cleaning up the original files partly manually
For example, rtf., doc and pdf documents are converted to plain text files. In the case of original pdf-file, we scan and proof-read the material, and correct it where necessary.
The texts are encoded by using UTF-8 (Unicode). The plain text files are then processed by using various tools presented below.
2. Corpus Markup, linguistic analysis, and alignment
We use a graphical interface UplugConnector based on the Uplug Toolkit (Tiedemann, 2003) which is a collection of tools for processing corpus data. Uplug can be used for sentence splitting, tokenization, tagging and parsing by using external taggers and parsers, and paragraph, sentence and word alignment.
The clean plain text files are processed to markup the data, to annotate it with morpho-syntactic features, and to align the text on the paragraph, sentence and word level.
The figure below illustrates the UplugConnector Interface.
The user can optionally give the location of the source and target files, decide where the output should be saved, and specify the encoding for the input and output files. For the markup, basic structural markup, sentence segmentation, and tokenization are available. In the toolkit, the user can also call for sentence and word aligners, and their visualization tools.
Further, the Uplug Connector GUI has been constructed to give the possibility to include calls to new scripts outside Uplug for complementary analysis, when such needs arise. The user can easily access to another resource if the available ones do not fit his/her needs, for example an external tokenizer, sentence splitter, tagger, or parser.
Each part of the corpus is clearly marked and annotated. We use the XML Corpus Encoding Standard (XCES) for the annotation format. The plain text files are processed by various tools. The sentence splitter is used to break the text into sentences, and the texts are tokenized.
Once the sentences and tokens are identified and structurally marked, they are linguistically analyzed. First, we annotate the words morphologically. For Swedish, we use the Trigram'n Tags tagger (Brants, 2000) trained on Swedish (Megyesi, 2002). The Turkish material is morphologically analysed and disambiguated using a Turkish analyzer developed by Kemal Oflazer (1994) and a disambiguator developed by Deniz Yüret and Ferhan Türe (2006). Then, the part-of-speech tagged texts are sent to the MALT dependency parser (Nivre, 2005) trained for Swedish and Turkish.
Swedish structural markup, tagged and parsed in xml format
Turkish structural markup, tagged and parsed in xml format
3. Sentence alignment, its visualization and correction with ISA
We use standard techniques for the establishment of links between source and target language segments. Paragraphs and sentences are aligned by using the length-based approach developed by Gale and Church (1993). The aligned sentences are stored in XML format. As the XML representation of the linking is not user friendly, and the automatic alignment has to be corrected manually, we use the graphical interface ISA developed by Jörg Tiedemann (2003), as illustrated below.
Swedish Turkish sentence aligned text in xml and html
4. Visualization of the sentence pairs with morphological analysis
For displaying the corrected sentence output from ISA after manual correction of the alignment together with the morphological analysis, a script utilizing the structural XML parser Hpricot (2006) was developed, called LinkViz. It takes as input the tagged XML-files together with the sentence alignment results produced by ISA and generates an HTML-file which is displaying the sentences ailgned together with the linguistic information for each word shown in pop-up windows.
The visualization tool makes it easier for students and researchers to study the morphological annotation for the words and chosen structures for translation than the structurally marked up version of the corpus.
Swedish Turkish sentence aligned tagged and parsed text in xml
Swedish Turkish sentence aligned tagged and parsed in html
5. Word alignment and its visualization
As the next step, words and phrases are aligned using the clue alignment approach (Tiedemann, 2003), and the toolbox for statistical machine translation GIZA++ (Och and Ney, 2003), also implemented in Uplug.
To visualize the word alignment in a simple way, a new script for HTML-visualization of the word alignment results was included in the UplugConnector. This takes as input the text file with word link information produced by Uplug and shows the word-pair frequencies. This visualization serves as a bilingual lexicon created from the source and target language data.
Swedish Turkish word aligned text in html
6. Search tool for word forms
A subset of the corpus bitexts are stored in a database format together with their sentence linkage information. This facilitates online lookup for word forms yielding sentence pair matches. A search can be specified for either language and for a full or partial word match.
The search facility is useful for the students when exploring spcific translations and word variant occurrences.
The illustrating texts above come from Orhan Pamuk's book "The White Castle" and Jostein Gaardner's novel "Sofie's world".
ReferencesThorsten Brants. 2000. TnT − A Statistical Part-of-Speech Tagger. In Proceedings of the 6th Applied Natural Language Processing Conference. Seattle, USA.
Gülsen Eryigit, Joakim Nivre, and Kemal Oflazer. 2006. The incremental use of morphological information and lexicalization in data-driven dependency parsing. In Proceedings of the 21st International Conference on the Computer Processing of Oriental Languages, 498-507.
William A. Gale, and Kenneth W. Church. 1993. A Program for Aligning Sentences in Bilingual Corpora. Computational Linguistics, 19(1), 75-102.
Hpricot. A Fast, Enjoyable HTML and XML Parser for Ruby http://code.whytheluckystiff.net/hpricot/ 2006.
Beata Megyesi. 2002. Data-Driven Syntactic Analysis − Methods and Applications for Swedish. PhD Thesis. Kungliga Tekniska Högskolan. Sweden.
Joakim Nivre and Johan Hall. 2005. MaltParser: A Language-Independent System for Data-Driven Dependency Parsing. In Proceedings of the Fourth Workshop on Treebanks and Linguistic Theories, Barcelona, 9-10 December 2005, 137-148.
Joakim Nivre, Johan Hall, Jens Nilsson, Gülsen Eryigit, and Svetoslav Marinov. 2006. Labeled Pseudo-Projective Dependency Parsing with Support Vector Machines. In Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL), 221-225.
Kemal Oflazer. 1994. Two-level Description of Turkish Morphology, Literary and Linguistic Computing, Vol. 9, No:2.
Franz Josef Och and Hermann Ney. 2003. A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics, volume 29:1, pp. 19-51, March 2003.
Jörg Tiedemann. 2003. Recycling Translations − Extraction of Lexical Data from Parallel Corpora and their Applications in Natural Language Processing. PhD Thesis. Uppsala University.
Jörg Tiedemann. 2004. Word to word alignment strategies. In Proceedings of the 20th International Conference on Computational Linguistics (COLING
2004). Geneva, Switzerland, August 23-27.
Jörg Tiedemann. 2005. Optimisation of Word Alignment Clues. In Journal of Natural Language Engineering, Special Issue on Parallel Texts, Rada Mihalcea and Michel Simard, Cambridge University Press.
Jörg Tiedemann. 2006. ISA & ICA – Two Web Interfaces for Interactive Alignment of Bitext. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC'06). Genoa, Italy.
Deniz Yüret and Ferhan Türe. 2006. Learning morphological disambiguation rules for Turkish. In Proceedings of HLT NAACL'06, pages 328-334, New York, NY.
PublicationsMegyesi, B. and Dahlqvist, B. 2007. The Swedish-Turkish Parallel Corpus and Tools for its Creation. To appear in Proceedings of NoDaLida 2007. May 24-26 2007, Tartu, Estonia. [.pdf]
Megyesi, B. B., Sågvall Hein, A., and Csató Johansson, É. 2006. Building a Swedish-Turkish Parallel Corpus. In Proceedings of Language Resources
and Evaluation Conference. May 22-28, 2006, Genoa, Italy. [.pdf]