Row Statter
- Purpose
Analyses a row-set, reporting the number of tags/words and the frequency of tags/words found in the set, and the average sentence/row-length plus the average row-length before and after each word/tag-type. The frequency and sentence/row-length calculations are made both per file and in total for different n-gram widths (the row-length before and after specific tags are not done per file (not very usefull and would cost a lot of memory)).
Tags or words can be ignored with -i if wished so.
The configfile is configrowstatter.pl
- Synopsis
./rowstatter.pl
- [-c <corpus>] [-f <fromsubdir>] | [-fd <fromdir>] [-tc <tocorpus>] [-t <tosubdir>] | [-td <resultsdir>] [-s <rowstatresults>] [-h <humtargetsubdir>] | [-hd <humtargetdir>] [-n <width> [<width> [...]]] | [-n <width>..<tillwidth>] [-wo] [-ta] [-np] [-nl] [-nc] [-nd] [-i <ignorepattern>] [-ps <partsubdir>] [-p <part>] | [-pd <partdir>] [-pf <divprtfile> [...]] [-dr] [-? = -h = -help = --help] [-v [verboselvl]]
- -c
- the corpus to work on
- -f
- the name of the subdir under the corpus in which the row files can be found (NOTE! Defaults to wordrow within the corpusData dir if none is set)
- -cd
- full path to the corpus on which should be worked
- -t
- the name of the subdir in which the rowstatresults are to be stored
- -s
- the name of the rowstatresults. The results-files will be stored inside a sub-dir that gets this name (<corpus> / <subdir> / <rowstatresults>)
- -td
- full path to the dir in which the rowstatresults-files will be stored.
- -h
- the subdir in which humtarget-specific !SPSS-ready data-files should be stored
- -hd
- full path to the dir where the humtarget-files should be stored. This option causes the -h option to be ignored
- -n
- the width('s) of the ngrams to calculate rowstatresults for. A range can be specified like 1..4. Also the wished for n-grams to be used can be specified separately
- -wo
- words, the datadir will default to corpusData, the subdir will default to wordrow
- -ta
- tags, the datadir will default to taggingData, the subdir will default to tagrow
- -np
- no count per file. Normally counts are also done and stored on a per file basis so it is possible to tell how frequent each item is in each file and how long rows, and rows after/before each tag/word are on average. Storing this stuff takes quite some diskspace if your corpus is large and if it has many small files, so if you don't need the info, and if you do need the diskspace, then disable it with this setting
- -nl
- No lengths, calculate and store no average sentence/lengths, or lengths of the row before or after a certain tag/word
- -nc
- No counts, no counts should be kept of words/tags. Note that specifying both -nl and -nc at the same thing effectively causes the program to do nothing, except making a few dirs
- -nd
- Don't delete previous results
- -i
- a reg-exp pattern that matches the words or tags that should be ignored in the results
- -ps
- subdir in which the divisions are
- -p
- the name of a division to use as a part. The file-names in all the divprt -files in the division-dir are added and used as the list of files to use unless you specify one or more explicitly using the -pf option
- -pd
- full path to the division-dir to use as a part. This option causes the -p option to be ignored
- -pf
- one or more divprt files to use as the part instead of all files in the division
- -v
- the level of verbosity, default verboselevel = 2, available levels: 0,1,2,3
- -?
- (and equivalents) prints help: the purpose and the synopsys
NOTE:
- If no from-dir, division-dir and/or to-file is given the default(s) in the config file are/is used
You can download (or look at the sources of) RowStatter [here]. To run it you will also need [the config file] and the [fiauimenre library]. You can also get the entire tool-package (containing the newest version of all fiauimenre tools and the library) [in one download]
|