Corpus Divider
- Purpose
Takes as it's argument one or more dimensions (and possibly ranges) and creates lists with file-names for each category within those dimensions.
CorpusDivider is used for supplying the groups (lists of files) between which the permutations are calculated later on.
The configfile is configcorpusdivider.pl
- Synopsis
./corpusdivider.pl
- [-c <corpus>] [-f <fromsubdir>] | [-fd <fromdir>] [-d <nameofdivtocreate>] [-tc <tocorpus>] [-t <tosubdir>] | [-td <targetdir>] [-l <libshortname>] | [-lf<libfullname>] [-ll] [-la <extralibargs>] [-i <dimensionname> [ '[' ] [ [<|>=]<census>[@cat1] ] [<anothercensus> [...]] [ ']' ][@cat2]] [-i <anotherdimnm> [<census>]]...] [-s [<yesdivs>]] [-ps <partsubdir>] [-p <part>] | [-pd <partdir>] [-pf <divprtfile> [...]] [-dr] [-? = -h = -help = --help] [-v [<verboselvl>]]
- -c
- the corpus to work on
- -f
- the subdir in which the metadata can be found
- -fd
- full path to the metadata to work on (conflicts with -c and -f)
- -d
- the name of the division that should be created (required, even if -td is set)
- -tc
- target-corpus where the divprt file should be created. If none is set the source corpus is used
- -t
- the subdir in which (within a subdir with the name of the division-dir) the divprt files should be stored
- -td
- full path to the place where the division should be stored (conflicts with -t)
- -l
- the short name of the meta-reader-library to use (conflicts with -lf)
- -lf
- the meta-reader-library to use (conflicts with -l)
- -ll
- list the installed meta-reader-libraries (exits the program immidiately)
- -la
- extra args to hand over to the meta-reader-lib
- -i
- a dimension to divide the corpus along. Can be specified multiple times. Also a census (or multiple) can be specified, along which to categorize the values along the dimension. If the values within the dimension "mark" are 1..10, and if it's the only dimension the corpus is separated along then the corpuswill be separated in ten categories (if each mark is present in the meta-data). If the censi 0 and 5.5 are specified then the corpus will be divided in twocategories (0 < 5.5 and >= 5.5). It is also possible to cut off a part of the corpus using '<' and '>='. Categories can also be made by grouping censi using brackets ([census census]) or ''s (census@cat1 or [census census]@cat2). If one of the censi in a dimension is not numerical (containing non-digits) not given values on that dimension are ignored
- -s
- shows the number of files in each of the dimensions & censi specified with -i. Shows all dimensions and the number of files in them if no dimensions are specified. Shows the number of files in each combination if -s 1 is given. Note that nothing is saved if this option is called
- -ps
- subdir in which the divisions are
- -p
- the name of a previous division to use as a part. The file-names in all the divprt-files in the division-dir are added and used as the list of files to use unless you specify one or more explicitly using the -pf option
- -pd
- full path to the division-dir to use as a part. This option causes the -p option to be ignored
- -pf
- one or more divprt files to use as the part instead of all files in the division
- -dr
- dry-run. Nothing is written or deleted, only reading and reporting is done
- -v
- the level of verbosity, default verboselevel = 1, available levels: 0,1,2,3
- -?
- (and equivalents) prints help: the purpose and the synopsys
NOTE:
- If no meta-dir and to-file is given the default(s) in the config file are/is used
You can download (or look at the sources of) CorpusDivider [here]. To run it you will also need [the config file] and the [fiauimenre library]. You can also get the entire tool-package (containing the newest version of all fiauimenre tools and the library) [in one download]
|