Corpus Splitter
- Purpose
Splits a corpus into smaller bits, into separate files. Wheiter a corpus can be split using this tool depends on the availabillity of a splitter-lib, a library doing the splitting. Of course not all corpora can or need to be split, and for using just a part of a corpus that can already be expressed in terms offiles that are already separated, the CorpusDivider should be used instead of this tool
Use -ls to list the splitter-libs
The configfile is configcorpussplitter.pl
The following splitter-libs are available:
- Synopsis
./corpussplitter.pl
- [-c <corpus>] [-f <fromsubdir>] | [-fd <fromdir>] [-tc <tocorpus>] [-t <tosubdir>] | [-td <targetdir>] [-l <striplibshortname>] | [-lf <striplibfullname>][-ll] [-la <extrastriplibargs>] [-ps <partsubdir>] [-p <part>] | [-pd <partdir>] [-pf <divprtfile> [...]] [-dr] [-? = -h = -help = --help] [-v [<verboselvl>]]
- -c
- the corpus to work on
- -f
- the subdir below the corpus-dir in which the corpus that is to be splitted can be found (defaults to rawunsplit if none is specified, the default way of using this tool is renaming the raw dir to rawunsplit and then sending the split corpus to the new raw dir, and using it as the new raw corpus)
- -fd
- the full path to the stuff to split (conflicts with the -c and -f options)
- -tc
- target-corpus. If none is set the source corpus is used
- -t
- the subdir in which the split corpus should be stored (defaults to raw if none is specified)
- -td
- full path to the dir in which the split corpus should be stored
- -l
- the short name of the split-library to use (conflicts with -lf)
- -lf
- the split-library to use (conflicts with -l)
- -ll
- list the installed split-libraries (exits the program immidiately)
- -la
- extra args to hand over to the split-lib
- -ps
- subdir in which the divisions are
- -p
- the name of a division to use as a part. The file-names in all the divprt -files in the division-dir are added and used as the list of files to use unless you specify one or more explicitly using the -pf option
- -pd
- full path to the division-dir to use as a part. This option causes the -p option to be ignored
- -pf
- one or more divprt files to use as the part instead of all files in the division
- -dr
- dry-run. Nothing is written or deleted, only reading and reporting is done
- -v
- the level of verbosity, default verboselevel = 2, available levels: 0,1,2,3
- -?
- (and equivalents) prints help: the purpose and the synopsys
NOTE:
- If no from and to-dirs are given the defaults in the config file are used
You can download (or look at the sources of) CorpusSplitter [here]. To run it you will also need [the config file] and the [fiauimenrelibrary]. You can also get the entire tool-package (containing the newest version of all fiauimenre tools and the library) [in one download]
|