Tag Same Statter


Purpose

Takes two sets of tagrow files (files containing tags seperated by spaces in sentences seperated by newlines). In the normal use-case one is tagged manually and the other is not. The two sets are aligned at the intra-line level (between lines the data should be aligned already!) and then the percentage of tagsthat don't match (extra or missing or wrong) are reported. With -n the width of the n-grams to check can be given, then the check is done at the n-gram-level.

The configfile is configtagsamestatter.pl

Synopsis

./tagsamestatter.pl
[-c <corpus>] [-f1 <fromsubdir1>] [-f2 <fromsubdir2>] | [-fd1 <fromdir1>] [-fd2 <fromdir2>] [-tc <tocorpus>] [-t <tosubdir>] | [-td <targetdir>] [-s <statresults>] [-w <wordrowsubdir>] | [-wd <wordrowdir>] | [-nw] [-h <humtargetsubdir>] | [-hd <humtargetdir>] [-n <width> [<width> [...]]] | [-n <width>..<tillwidth>] [-i <ignoretagspattern>] [-ps <partsubdir>] [-p <part>] | [-pd <partdir>] [-pf <divprtfile> [...]] [-dr] [-? = -h = -help = --help] [-v [<verboselvl>]]

-c
the corpus to work on
-f1
the name of the subdir under the corpus in which the tagrow files of set 1 can be found
-f2
same as -f1 but for set 2
-fd1
full path to the set1-corpus to work on (conflicts with -f1 and -f2)
-fd2
same as -c1 but for set 2
-tc
target-corpus. If none is set the source corpus is used
-t
the name of the subdir in which the results are to be stored
-td
full path to the dir in which the statresults will be stored. This option causes the -t option to be ignored
-s
the name of the statresults. The result-files will be stored inside a sub-dir that gets this name (<corpus> / <subdir> / <statresults>)
-h
the subdir in which humtarget-specific things like the wrongs-file (a list of differences between the sets), or the ignored & prettyprint files shouldbe stored
-hd
full path to the dir where the humtarget-files should be stored
-w
the subdir of the corpusData dir in which the wordrow (used for printing the middle line, normally the words of the tagged sentence) data can be found. Note that middle-printing only works if the middle line contains the same number of elements as the shortest of the two other lines
-wd
full path to the dir where the wordrow files can be found
-nw
don't print the middle line of words. Invalidates the -w and -wd settings
-n
the width('s) of the ngrams to use for the samestatting. A range can be specified like 1..4. Also the wished for n-grams can be specified separately
-i
a reg-exp pattern that matches the tags that should be ignored in the comparison
-ps
subdir in which the divisions are
-p
the name of a division to use as a part. The file-names in all the divprt -files in the division-dir are added and used as the list of files to use unless you specify one or more explicitly using the -pf option
-pd
full path to the division-dir to use as a part. This option causes the -p option to be ignored
-pf
one or more divprt files to use as the part instead of all files in the division
-dr
dry-run. Nothing is written or deleted, only reading and reporting is done
-v
the level of verbosity, default verboselevel = 2, available levels: 0,1,2,3
-?
(and equivalents) prints help: the purpose and the synopsys

NOTE:
If no from-dir, division-dir and/or to-file is given the default(s) in the config file are/is used

You can download (or look at the sources of) TagSameStatter [here]. To run it you will also need [the config file] and the [fiauimenrelibrary]. You can also get the entire tool-package (containing the newest version of all fiauimenre tools and the library) [in one download]
Part of the LogiLogi Network: The LogiLogi Foundation - LogiLogi.org - OgOg.org
This is an old version for archival purposes, see www.LogiLogi.org for the current version.
< Edit this document | View history | Printer friendly (inc. links) >
Visited 1097 times
Document last modified Fri, 29 Jul 2005 07:48:29
All content is available under the GNU Free Documentation License. The LogiLogi-system is under the GPL
SourceForge.net Logo Zylon Internet Services-Groningen Logo
Visitor statistics