Base vocabulary pool
Description
A base vocabulary pool is a ranked list of the most generally used lemmas, their wordforms, adjusted frequency, and contribution, according to a lemmatised and categorised corpus. From this pool, base vocabularies for various application needs can be extracted.
More details on the base vocabulary pool can be found here:
- Eva Forsbom. 2006. A Swedish Base Vocabulary Pool. Presentation at the Swedish Language Technology Conference. Göteborg, October 27-28. (Extended abstract pdf.)
- Forsbom, Eva. 2006. Deriving a base vocabulary pool from the Stockholm-Umeå Corpus. Term paper for NGSLT course Soft Computing, more detailed version of above (pdf).
The BaseVocabulary package includes a Swedish base vocabulary, based on the Stockholm-Umeå Corpus, and an English base vocabulary, based on the Susanne corpus, and scripts for creating the base vocabularies and computing various frequency and dispersion measures. The scripts were written solely for the purpose of the paper, and have been tested only for Linux (2.4.22, Mandrake 9.2 and 2.6.14, Fedora Core 4). The base vocabularies, however, are raw text files, and can be viewed in any editor.
The base vocabulary pool has been used by me in the following projects:
- Feature extraction for genre classification
- Feature combination for genre classification
- Inducing baseform models from a Swedish vocabulary pool
- Morphological Classification of Swedish Words using Memory-Based Learning
And elsewhere:
- Elena Volodina. 2008. FROM CORPUS TO LANGUAGE CLASSROOM: reusing Stockholm Umeå Corpus in a vocabulary exercise generator SCORVEX. Master's thesis, Language Technology Programme, University of Gothenburg, May. (pdf)
License
The package is licensed under the GNU General Public License.
Download
Download the BaseVocabulary package
(a gzipped tar archive). Unpack it with tar -xzf
BaseVocabulary.tgz
. Follow instructions in the
BaseVocabulary/README file.
Files
- README - information, installation instructions, example runs, etc.
- bin/ - scripts
- adjusted_frequency.pl - Perl script for computing Adjusted Frequency
- basevoc_suc.sh - Shell script for deriving base vocabulary from SUC
- basevoc_susanne.sh - Shell script for deriving base vocabulary from Susanne
- conflate_suc.map - Conflation map for SUC (used by normalise_and_count.pl)
- conflate_susanne.map - Conflation map for Susanne (used by normalise_and_count.pl)
- contribution.pl - Perl script for computing Contribution
- correct_suc.pl - Perl script for correcting "anomalies" in SUC annotation
- correct_susanne.pl - Perl script for correcting "anomalies" in Susanne annotation
- dispersion_and_fmod.pl - Perl script for computing Dispersion and Modified Frequency
- merge_wordforms.pl - Perl script for merging wordform info into lemma base vocabulary
- normalise_and_count.pl - Perl script for normalising, disambiguating, and counting
- susanne2data.pl - Perl script for extracting token info from Susanne file
- xces2r.xsl - XSLT template for extracting token info from XCES file (SUC)
- data/ - base vocabulary pools
- SUC_basevoc - A Swedish base vocabulary based on Stockholm-Umeå Corpus 2
- Susanne_basevoc - An English base vocabulary based on Susanne (R5) corpus
- doc/ - documentation for scripts
- gpl.txt - GNU General Public License
Requirements
The base vocabularies are raw text files, and can be viewed by any editor.
The scripts, in Perl (5.005) and XSLT, were all developed for a Linux environment using standard modules, but they are probably portable to other environments (sorry, I have no way of testing), except for the shell scripts basevoc_suc.sh and basevoc_susanne.sh, which are used as glueing batch scripts for the other scripts. (Use them as examples rather than as turnkey scripts.)
The SUC corpus* can be obtained, subject to a license, from http://www.ling.su.se/dali/suc/suc2.0_info.html. The original corpus files can be converted from SGML format to valid XML format with parole2xml.pl.
* Stockholm-Umeå Corpus, version 2, 2002, Stockholm University, Department of Linguistics and Umeå University, Department of Linguistics.
The Susanne (R5) corpus can be downloaded from http://www.grsampson.net/RSue.html. Its annotation scheme and corpus compilation (excerpts from the Brown corpus) are described in the following book: Geoffrey Sampson. 1995. English for the Computer: The SUSANNE Corpus and analytic scheme. Clarendon Press, Oxford. ISBN 0-19-824023-6.
Version history
- Refactored xces2r.xsl, updated URL to SUC 2009-05-10
- Package created 2006-08-04
- (First script created 2005-01-10)