PLUG - Corpus: A bilingually aligned text corpus


In workpackage 1 a bilingually sentence aligned corpus has to be compiled. The corpus shall include the following language pairs:

The corpus shall contain texts from 3 different genres: Corpus data will be taken from different sources which are available at the 3 departments that take part in the PLUG project. An XML data definition was chosen to encode the corpus. Follow this link to inspect the XML data definition file.
Click here for searching in the corpus (restricted access). An unrestricted demo version can be found here.

The tables below describe some characteristics of the current version of the PLUG corpus.


PLUG corpus files
-----------------



all files are encoded in XML using the plugXML.dtd

complete corpus:	2188079 words
sv<->en:		1169165 words
sv<->de:		 525278 words
sv<->it:		 493636 words


filename		origin
------------------------------------------------------------------
ensvtacc.xml		manual texts for MS Access
ensvtxl.xml		manual texts for MS Excel
sventscan.xml		collection of truck maintainance manuals from Scania
ensvtscan.xml		collection of truck maintainance manuals from Scania
svdetscan.xml		collection of truck maintainance manuals from Scania
svittscan.xml		collection of truck maintainance manuals from Scania

ensvfbell.xml		'Viking P. for the Jewish Publication Society
			of America' by Saul Bellow
ensvfgord.xml		'A Guest of Honour' by Nadine Gordimer
svitfbio.xml		'En biodlares död' by Lars Gustafsson
svitfkak.xml		'En kakelsättares eftermiddag' by Lars Gustafsson

svenpeu.xml		collection of EU texts (taken from the PEDANT corpus)
svdepeu.xml		collection of EU texts (taken from the PEDANT corpus)
svitpfut.xml		'Future noise policy - European Commission Green Paper'
				EU text (taken from the PEDANT corpus)
svenprf.xml		declarations from the Swedish government
svdeprf.xml		declarations from the Swedish government



political texts:	410408 words

		size	in words	in bytes	languages
------------------------------------------------------------------
svenpeu.xml		186111		1584273		sv->en
svdepeu.xml		180312		1657251		sv->de
svenprf.xml		  8011		  86180		sv->en
svdeprf.xml		  7778		  90486		sv->de
svitpfut.xml		 28196		 255892		sv->it


technical texts:	1353740 words

		size	in words	in bytes	languages
------------------------------------------------------------------
ensvtacc.xml		163173		1606582		en->sv
ensvtxl.xml		124961		1330559		en->sv
sventscan.xml		187830		3370698		sv->en
ensvtscan.xml		197459		3508625		en->sv
svdetscan.xml		337188		6052332		sv->de
svittscan.xml		343129		7139947		sv->it


fiction:		423931 words

		size	in words	in bytes	languages
------------------------------------------------------------------
ensvfbell.xml		132066		1169471		en->sv
ensvgord.xml		169554		1423741		en->sv
svitfbio.xml		 55882		 501451		sv->it
svitfkak.xml		 66429		 629647		sv->it



(all word counts were computed using the following command:
cat {files} |sed 's/<[^>]*>//g' |wc -w  )



last update: 06/18/1998
comments to joerg AT stp.ling.u.se