Corpus Lecicon Reducer


Purpose

Reduces the lexicon of a corpus. All words within the corpus that are (and/or) (not) within a given list are removed and replaced by the '^'-sign, thereby reducing the lexicon of the corpus to (everyting except the words within) the wordlist and the '^'-sign. Reducing the lexicon of a corpus to a given list ofwords is used for using the corpus within the context of WordNet if the position-information of wordnet-only words needs to be kept. The removal of stopwords (forbidden words) is good for some IR-tasks. Note that the forbidden list overrules the allowed-list. Comparisons are done in lowercase

The configfile is configcorpuslexiconreducer.pl

Synopsis

./corpuslexiconreducer.pl
[-c <corpus>] [-f <fromsubdir>] | [-fd <fromdir>] [-tc <tocorpus>] [-t <tosubdir>] | [-td <targetdir>] [-o <stopwordsfile>] | [-od <stopwordsdirfile>] [-a<allowedwordsfile>] | [-ad <allowedwordsdirfile>] [-ps <partsubdir>] [-p <part>] | [-pd <partdir>] [-pf <divprtfile> [...]] [-dr] [-? = -h = -help = --help] [-v [<verboselvl>]]

-c
the corpus that needs to be reduced
-f
subdir below corpus where the compoundified/wordrow version of the corpus can be found (defaults to compoundified if none is specified and if it's notchanged in the configfile)
-fd
full path to the wordrow-formatted (compoundified) corpus
-tc
target-corpus. If none is set the source corpus is used
-t
the subdir relative to the corpusdir in which the compoundified corpus should be stored. Note that this option is only possible if the corpus is givenwith the -c or -tc option (not the full path with the -td option)
-td
full path to the place where the reduced corpus should be stored
-o
the stopwordsfile (forbiddenwordsfile) (within taskData/lists/ normally). A default might be set!
-od
full path to the forbiddenwordsfile including the filename
-a
the allowedwordsfile (within taskData/lists/ normally). A default might be set!
-ad
full path to the allowedwordsfile including the filename
-ps
subdir in which the divisions are
-p
the name of a division to use as a part. The file-names in all the divprt -files in the division-dir are added and used as the list of files to use unless you specify one or more explicitly using the -pf option
-pd
full path to the division-dir to use as a part. This option causes the -p option to be ignored
-pf
one or more divprt files to use as the part instead of all files in the division
-dr
dry-run. Nothing is written or deleted, only reading and reporting is done
-v
the level of verbosity, default verboselevel = 2, available levels: 0,1,2,3
-?
(and equivalents) prints help: the purpose and the synopsys

NOTE:
If no from and to-dirs are given the defaults in the config file are used

WARNING:
A default allowedwordsfile & forbiddenwordsfile might be set in the config-file. This will result in reducing that might be unwanted...

You can download (or look at the sources of) CorpusLexiconReducer [here]. To run it you willalso need [the config file] and the [fiauimenre library]. You can also get the entire tool-package (containing the newest version of all fiauimenre tools and the library) [in one download]
Part of the LogiLogi Network: The LogiLogi Foundation - LogiLogi.org - OgOg.org
This is an old version for archival purposes, see www.LogiLogi.org for the current version.
< Edit this document | View history | Printer friendly (inc. links) >
Visited 909 times
Document last modified Fri, 29 Jul 2005 04:11:19
All content is available under the GNU Free Documentation License. The LogiLogi-system is under the GPL
SourceForge.net Logo Zylon Internet Services-Groningen Logo
Visitor statistics