Corpus Tagset Reducer
- Purpose
Reduces the tagset of a corpus. All tags within the tagrow file of a corpus that (are) (and/or) (not) within a given list are removed, optionally together with the accompanying word in the wordrow file. Reducing the tagset of a corpus to a given list of tags was used by us for removing pauses from our corpus for the fiauimenre research.
The configfile is configcorpustagsetreducer.pl
- Synopsis
./corpustagsetreducer.pl
- [-c <corpus>] [-f <fromsubdir>] | [-fd <fromdir>] [-w <wordrowsubdir>] | [-wd <wordrowdir>] | [-nw] [-tc <tocorpus>] [-t <tosubdir>] | [-td <targetdir>] [-o <stopwordsfile>] | [-od <stopwordsdirfile>] [-a <allowedtagsfile>] | [-ad <allowedtagsdirfile>] [-ps <partsubdir>] [-p <part>] | [-pd <partdir>] [-pf <divprtfile> [...]] [-dr] [-? = -h = -help = --help] [-v [<verboselvl>]]
- -c
- the corpus whose tagset needs to be reduced
- -f
- subdir below corpus where the tagrow version of the corpus can be found (defaults to tagrow if none is specified and if it's not changed in the configfile)
- -fd
- full path to the tagrow-formatted corpus
- -w
- subdir below corpus where the wordrow version of the corpus can be found (defaults to wordrow if none is specified and if it's not changed in the configfile)
- -wd
- full path to the wordrow-formatted corpus
- -nw
- don't remove the accompanying words for tags from the wordrow files
- -tc
- target-corpus. If none is set the source corpus is used
- -t
- the subdir relative to the corpusdir in which the reduced corpus tagrow files should be stored. Note that this option is only possible if the corpus is given with the -c or -tc option (not the full path with the -td option)
- -td
- full path to the place where the reduced tagrow files should be stored
- -r
- the subdir relative to the corpusdir in which the reduced wordrow files should be stored. Note that this option is only possible if the corpus is given with the -c or -tc option (not the full path with the -td option)
- -rd
- full path to the place where the reduced wordrow files should be stored
- -o
- the forbidden tags file (within taskData/lists/ normally). Note that a default might be set!
- -od
- full path to the forbiddentagsfile including the filename
- -a
- the allowedtagsfile (within taskData/lists/ normally). A default might be set!
- -ad
- full path to the allowedtagsfile including the filename
- -ps
- subdir in which the divisions are
- -p
- the name of a division to use as a part. The file-names in all the divprt -files in the division-dir are added and used as the list of files to use unless you specify one or more explicitly using the -pf option
- -pd
- full path to the division-dir to use as a part. This option causes the -p option to be ignored
- -pf
- one or more divprt files to use as the part instead of all files in the division
- -dr
- dry-run. Nothing is written or deleted, only reading and reporting is done
- -v
- the level of verbosity, default verboselevel = 2, available levels: 0,1,2,3
- -?
- (and equivalents) prints help: the purpose and the synopsys
NOTE:
- If no from and to-dirs are given the defaults in the config file are used
WARNING:
- A default allowedtagsfile & forbiddentagsfile might be set in the config-file. This will result in reducing that might be unwanted...
You can download (or look at the sources of) CorpusTagsetReducer [here]. To run it you will also need [the config file] and the [fiauimenre library]. You can also get the entire tool-package (containing the newest version of all fiauimenre tools and the library) [in one download]
|