Tag Sample Finder
- Purpose
Finds samples of a given ngram. In the normal use-case the n-gram is an n-gram of !POS-tags that is to be found within the corpus, and provided with the words and/or with a left- or right-side context. This to get a better picture of the n-gram in its context (note that results are taken from the entire corpus, not from the subcorpus for which it is typical)
WARNING: It is required that the wordrow and the tagrow are correctly aligned, so no words without tags or tags without words...
NOTE: A samplesetname must be set if no manual ngram is provided
The configfile is configtagsamplefinder.pl
- Synopsis
./tagsamplefinder.pl
- [-c <corpus>] [-f <fromsubdir>] | [-fd <fromdir>] [-tc <tocorpus>] [-t <tosubdir>] | [-td <targetdir>] [-h <tohumtargetdir>] | [-hd <humtargetdir>] [-a <tasksubdir>] | [-ad <taskdir>] [-s <samplesetname>] [-af <taskubdirfile>] | [-n <manualngram>] [-w <wordrowsubdir>] | [-wd <wordrowdir>] | [-nw] [-l <left>] | [-r <right>] [-m <maxnrofsamples> [-ra]] [-i <ignoretagspattern>] [-ps <partsubdir>] [-p <part>] | [-pd <partdir>] [-pf <divprtfile> [...]] [-dr] [-? = -h = -help = --help] [-v [<verboselvl>]]
- -c
- the corpus to work on
- -f
- the name of the subdir under the corpus in which the tagrow files can be found
- -fd
- full path to the tagrow-set to work on (conflicts with -f)
- -tc
- target-corpus. If none is set the source corpus is used
- -t
- the name of the subdir in which the found samples are to be stored in a computer-interpretable format
- -td
- full path to the dir in which the found samples will be stored. This option causes the -t option to be ignored
- -h
- the name of the subdir in which the found samples are to be stored in a human-interpretable format
- -hd
- full path to the dir in which the human-readable version of the found samples will be stored. This option causes the -h option to be ignored
- -a
- the subdir in which a list of n-grams to find samples for, can be found
- -ad
- full path to the dir where the n-grams are to be found (invalidates -a)
- -s
- the name of the sample-set to use. The result-files will be found and stored inside a sub-dir below the from(sub)dir and the (to)(hum)- targetdir that gets this name. Needs to be set (this setting corresponds to the -pn setting in for example permstat-resultselector)
- -af
- the name of the file in which the n-grams are to be found. If no filename given all files in the ngram-dir are used as ngram-files (and as sample- names)
- -n
- a single n-gram, specified on the commandline, to look for (conflicts with -a and -ad, and with -t, -td, -h, -hd) output given on-screen.
- -w
- the subdir of the corpusData dir in which the wordrow (used for printing the words of the tagged sentence) data can be found. Note that word- printing only works if the wordrow- and the tagrow-files are aligned correctly
- -wd
- full path to the dir where the wordrow files can be found
- -nw
- don't use wordrows, no words... Invalidates the -w and -wd settings
- -l
- the left-side context that is wanted
- -r
- the right-side context that is wanted
- -m
- set a maximum to the number of samples that you want to find
- -ra
- make sure the samples chosen are randomly selected
- -i
- a reg-exp pattern that matches the tags that should be ignored in looking for samples. Usefull for rearching for the reduced tagset
- -ps
- subdir in which the divisions are
- -p
- the name of a division to use as a part. The file-names in all the divprt -files in the division-dir are added and used as the list of files to use unless you specify one or more explicitly using the -pf option
- -pd
- full path to the division-dir to use as a part. This option causes the -p option to be ignored
- -pf
- one or more divprt files to use as the part instead of all files in the division
- -dr
- dry-run. Nothing is written or deleted, only reading and reporting is done
- -v
- the level of verbosity, default verboselevel = 2, available levels: 0,1,2,3
- -?
- (and equivalents) prints help: the purpose and the synopsys
NOTE:
- If no from-dir, division-dir and/or to-file is given the default(s) in the config file are/is used
You can download (or look at the sources of) TagSampleFinder [here]. To run it you will also need [the config file] and the [fiauimenre library]. You can also get the entire tool-package (containing the newest version of all fiauimenre tools and the library) [in one download]
|