Tag Sample Finder


Purpose

Finds samples of a given ngram. In the normal use-case the n-gram is an n-gram of !POS-tags that is to be found within the corpus, and provided with the words and/or with a left- or right-side context. This to get a better picture of the n-gram in its context (note that results are taken from the entire corpus, not from the subcorpus for which it is typical)

WARNING: It is required that the wordrow and the tagrow are correctly aligned, so no words without tags or tags without words...

NOTE: A samplesetname must be set if no manual ngram is provided

The configfile is configtagsamplefinder.pl

Synopsis

./tagsamplefinder.pl
[-c <corpus>] [-f <fromsubdir>] | [-fd <fromdir>] [-tc <tocorpus>] [-t <tosubdir>] | [-td <targetdir>] [-h <tohumtargetdir>] | [-hd <humtargetdir>] [-a <tasksubdir>] | [-ad <taskdir>] [-s <samplesetname>] [-af <taskubdirfile>] | [-n <manualngram>] [-w <wordrowsubdir>] | [-wd <wordrowdir>] | [-nw] [-l <left>] | [-r <right>] [-m <maxnrofsamples> [-ra]] [-i <ignoretagspattern>] [-ps <partsubdir>] [-p <part>] | [-pd <partdir>] [-pf <divprtfile> [...]] [-dr] [-? = -h = -help = --help] [-v [<verboselvl>]]

-c
the corpus to work on
-f
the name of the subdir under the corpus in which the tagrow files can be found
-fd
full path to the tagrow-set to work on (conflicts with -f)
-tc
target-corpus. If none is set the source corpus is used
-t
the name of the subdir in which the found samples are to be stored in a computer-interpretable format
-td
full path to the dir in which the found samples will be stored. This option causes the -t option to be ignored
-h
the name of the subdir in which the found samples are to be stored in a human-interpretable format
-hd
full path to the dir in which the human-readable version of the found samples will be stored. This option causes the -h option to be ignored
-a
the subdir in which a list of n-grams to find samples for, can be found
-ad
full path to the dir where the n-grams are to be found (invalidates -a)
-s
the name of the sample-set to use. The result-files will be found and stored inside a sub-dir below the from(sub)dir and the (to)(hum)- targetdir that gets this name. Needs to be set (this setting corresponds to the -pn setting in for example permstat-resultselector)
-af
the name of the file in which the n-grams are to be found. If no filename given all files in the ngram-dir are used as ngram-files (and as sample- names)
-n
a single n-gram, specified on the commandline, to look for (conflicts with -a and -ad, and with -t, -td, -h, -hd) output given on-screen.
-w
the subdir of the corpusData dir in which the wordrow (used for printing the words of the tagged sentence) data can be found. Note that word- printing only works if the wordrow- and the tagrow-files are aligned correctly
-wd
full path to the dir where the wordrow files can be found
-nw
don't use wordrows, no words... Invalidates the -w and -wd settings
-l
the left-side context that is wanted
-r
the right-side context that is wanted
-m
set a maximum to the number of samples that you want to find
-ra
make sure the samples chosen are randomly selected
-i
a reg-exp pattern that matches the tags that should be ignored in looking for samples. Usefull for rearching for the reduced tagset
-ps
subdir in which the divisions are
-p
the name of a division to use as a part. The file-names in all the divprt -files in the division-dir are added and used as the list of files to use unless you specify one or more explicitly using the -pf option
-pd
full path to the division-dir to use as a part. This option causes the -p option to be ignored
-pf
one or more divprt files to use as the part instead of all files in the division
-dr
dry-run. Nothing is written or deleted, only reading and reporting is done
-v
the level of verbosity, default verboselevel = 2, available levels: 0,1,2,3
-?
(and equivalents) prints help: the purpose and the synopsys

NOTE:
If no from-dir, division-dir and/or to-file is given the default(s) in the config file are/is used

You can download (or look at the sources of) TagSampleFinder [here]. To run it you will also need [the config file] and the [fiauimenre library]. You can also get the entire tool-package (containing the newest version of all fiauimenre tools and the library) [in one download]
Part of the LogiLogi Network: The LogiLogi Foundation - LogiLogi.org - OgOg.org
This is an old version for archival purposes, see www.LogiLogi.org for the current version.
< Edit this document | View history | Printer friendly (inc. links) >
Visited 1431 times
Document last modified Tue, 10 Oct 2006 18:16:57
All content is available under the GNU Free Documentation License. The LogiLogi-system is under the GPL
SourceForge.net Logo Zylon Internet Services-Groningen Logo
Visitor statistics