In workpackage 1 a bilingually sentence aligned corpus has to be compiled. The corpus shall include the following language pairs:
The tables below describe some characteristics of the current version of the PLUG corpus.
PLUG corpus files ----------------- all files are encoded in XML using the plugXML.dtd complete corpus: 2188079 words sv<->en: 1169165 words sv<->de: 525278 words sv<->it: 493636 words filename origin ------------------------------------------------------------------ ensvtacc.xml manual texts for MS Access ensvtxl.xml manual texts for MS Excel sventscan.xml collection of truck maintainance manuals from Scania ensvtscan.xml collection of truck maintainance manuals from Scania svdetscan.xml collection of truck maintainance manuals from Scania svittscan.xml collection of truck maintainance manuals from Scania ensvfbell.xml 'Viking P. for the Jewish Publication Society of America' by Saul Bellow ensvfgord.xml 'A Guest of Honour' by Nadine Gordimer svitfbio.xml 'En biodlares död' by Lars Gustafsson svitfkak.xml 'En kakelsättares eftermiddag' by Lars Gustafsson svenpeu.xml collection of EU texts (taken from the PEDANT corpus) svdepeu.xml collection of EU texts (taken from the PEDANT corpus) svitpfut.xml 'Future noise policy - European Commission Green Paper' EU text (taken from the PEDANT corpus) svenprf.xml declarations from the Swedish government svdeprf.xml declarations from the Swedish government political texts: 410408 words size in words in bytes languages ------------------------------------------------------------------ svenpeu.xml 186111 1584273 sv->en svdepeu.xml 180312 1657251 sv->de svenprf.xml 8011 86180 sv->en svdeprf.xml 7778 90486 sv->de svitpfut.xml 28196 255892 sv->it technical texts: 1353740 words size in words in bytes languages ------------------------------------------------------------------ ensvtacc.xml 163173 1606582 en->sv ensvtxl.xml 124961 1330559 en->sv sventscan.xml 187830 3370698 sv->en ensvtscan.xml 197459 3508625 en->sv svdetscan.xml 337188 6052332 sv->de svittscan.xml 343129 7139947 sv->it fiction: 423931 words size in words in bytes languages ------------------------------------------------------------------ ensvfbell.xml 132066 1169471 en->sv ensvgord.xml 169554 1423741 en->sv svitfbio.xml 55882 501451 sv->it svitfkak.xml 66429 629647 sv->it (all word counts were computed using the following command: cat {files} |sed 's/<[^>]*>//g' |wc -w )