Ngram Permutator


Purpose

Takes a set of groups of tagrow files (files containing tags seperated by spaces in sentences seperated by newlines, the groups are provided by CorpusDivider) and permutates the lines between the groups and takes ngrams for each of them (within a given width-range). This data is written out to a set of permutation-files within n-gram subdirs within the permutationdir. The files written out are tables containing the found n-grams separated by tabs, with below themthe counts, each permutation on its own line; a permutation-file for each group. This data can be read using the PermutationDataReader modules.

The configfile is configngrampermutator.pl

Synopsis

./ngrampermutator.pl
[-c <corpus>] [-f <fromsubdir>] | [-fd <fromdir>] [-tc <tocorpus>] [-t <tosubdir>] | [-td <targetdir>] [-pn <permutation-name>] [-n <width> [<width> [...]]] | [-n <width>..<tillwidth>] [-np <nrofpermutations>] [-s] [-m <minimumnumberofoccurences>] [-ds <divsubdir>] [-d <divison>] | [-dd <divisondir>] [-df <divprtfile> [...]] [-dr] [-? = -h = -help = --help] [-v [<verboselvl>]]

-c
the corpus to work on
-f
the name of the subdir under the corpus in which the tagrow files can be found
-fd
full path to the corpus to work on
-tc
the target-corpus. If none is set the (source) corpus is used
-t
the name of the subdir in which the permutations are to be stored
-td
full path to the dir in which the permutation-files will be stored. This option causes the -t and -p options to be ignored
-pn
the name of the permutation. The permutation-files will be stored inside a sub-dir that gets this name (<corpus> / <subdir> / <permutation-name>)
-n
the width('s) of the ngrams to use for the permutation. A range can be specified like 1..4. Also the wished for n-grams can be specified separately. It is important to know that the same permutation is used for all the different n-gram lengths, so it really does one permutation-test per run. NOTE: This option does not accept ranges if called with the -s option
-np
the number of permutations to make (> 5000 is advised)
-s
permutate the single n-grams, not the sentences
-m
the minimum number of times an ngram needs to be present in the entire corpus to be taken into consideration
-ds
subdir in which the divisions are
-d
the name of a division to use. The file-names in all the divprt-files in the division-dir are used as the groups unless you specify one or more explicitly using the -df option
-dd
full path to the division-dir. This option causes the -d option to be ignored
-df
one or more divprt files to use instead of all files in the division
-dr
dry-run. Nothing is written or deleted, only reading and reporting is done
-v
the level of verbosity, default verboselevel = 2, available levels: 0,1,2,3
-?
(and equivalents) prints help: the purpose and the synopsys

NOTE:
If no from-dir, division-dir and/or to-file is given the default(s) in the config file are/is used

You can download (or look at the sources of) NGramPermutator [here]. To run it you will also need [the config file] and the [fiauimenre library]. You can also get the entire tool-package (containing the newest version of all fiauimenre tools and the library) [in one download]
Part of the LogiLogi Network: The LogiLogi Foundation - LogiLogi.org - OgOg.org
This is an old version for archival purposes, see www.LogiLogi.org for the current version.
< Edit this document | View history | Printer friendly (inc. links) >
Visited 1331 times
Document last modified Fri, 10 Nov 2006 14:40:44
All content is available under the GNU Free Documentation License. The LogiLogi-system is under the GPL
SourceForge.net Logo Zylon Internet Services-Groningen Logo
Visitor statistics