Corpus Sampler
- Purpose
CorpusSampler is used to take a random sample from a corpus. The requested numer of words and/or lines can be specified. If the sample is returned also a .meta file is supplied in which for each line the file and the line from which it was originally taken is stored (usefull for checking purposes).
If a sample with the requested number of lines and words is not possible CorpusSampler will exit with an error after maxnrofruns.
The configfile is configcorpussampler.pl
- Synopsis
./corpussampler.pl
- [-c <corpus>] [-f <fromsubdir>] | [-fd <fromdir>] [-tc <tocorpus>] [-t <tosubdir>] | [-td <targetdir>] [-l <nroflines>>] [-w <nrofwords> [-u]] [-x <maxnrofruns>] [-ps <partsubdir>] [-p <part>] | [-pd <partdir>] [-pf <divprtfile> [...]] [-dr] [-? = -h = -help = --help] [-v [<verboselvl>]]
- -c
- the corpus to work on
- -f
- the subdir in which the files to sample can be found (defaults to refined)
- -fd
- full path to the the files to sample from
- -tc
- the name the sample-corpus should have. If none is set the source corpus is used, so be alert to supply a suitable tosubdir then...
- -t
- the subdir in which the sample should be stored (defaults to refined)
- -td
- the the full path in which the sample should be stored (conflicts with -tc, and -t)
- -m
- the subdir in which the metafile should be stored (defaults to samplemeta)
- -md
- full path to the place where the metafile should be stored
- -l
- the number of lines you want your random sample to have
- -w
- the number of words you want in your random sample (if you ask both for a certain number of lines and for a certain number of words, it will randomizeuntill it has found a random set containing the requested number of lines and the requested number of words). It reports an error if the request cannot be fulfilled before max runs
- -u
- count punctuation as words
- -x
- the maximum number of runs before giving up
- -ps
- subdir in which the divisions are
- -p
- the name of a division to use as a part. The file-names in all the divprt -files in the division-dir are added and used as the list of files to use unless you specify one or more explicitly using the -pf option
- -pd
- full path to the division-dir to use as a part. This option causes the -p option to be ignored
- -pf
- one or more divprt files to use as the part instead of all files in the division
- -dr
- dry-run. Nothing is written or deleted, only reading and reporting is done
- -v
- the level of verbosity, default verboselevel = 2, available levels: 0,1,2,3
- -?
- (and equivalents) prints help: the purpose and the synopsys
NOTE:
- If no dirs or levels are given the defaults in the config file are used
You can download (or look at the sources of) CorpusSampler [here]. To run it you will also need [the config file] and the [fiauimenre library]. You can also get the entire tool-package (containing the newest version of all fiauimenre tools and the library) [in one download]
|