Corpus Refiner


Purpose

Refines a corpus. The type of refinement that it does depends on the refine- lib that is specified on the commandline. Libs can be specified with their full names, as well as with their short names (only if they are installed properly). The CorpusRefiner should only be run after the stripping of unwanted elements from the corpus has been done. What the CorpusRefiner actually does depends on the refine-lib, but usually it is sentence splitting (all sentences on their own line), including whitespace around punctuation, and stripping whitespace. The changed files are saved into a different dir (default can be set in the configfile)

Use -ll to list the refiner-libs

The configfile is configcorpusrefiner.pl

The following refiner-libs are available:

Synopsis

./corpusrefiner.pl
[-c <corpus>] [-f <fromsubdir>] | [-fd <fromdir>] [-tc <tocorpus>] [-t <tosubdir>] | [-td <targetdir>] [-l <refinerlibshortname>] | [-lf <refinerlib>] [-ll] [-la <extrarefinerlibargs>] [-a <actionlist>] [-ps <partsubdir>] [-p <part>] | [-pd <partdir>] [-pf <divprtfile> [...]] [-dr] [-? = -h = -help = --help] [-v [<verboselvl>]]

-c
the corpus to work on
-f
the subdir below the corpus-dir in which the stripped corpus can be found (defaults to stripped if none is specified)
-fd
the full path to the stuff to work on (conflicts with -c and -f)
-tc
target-corpus. If none is set the source corpus is used
-t
the subdir in which the refined corpus should be stored (defaults to refined if none is specified)
-td
full path to the dir in which the refined corpus should be stored
-l
the short name of the refiner-library to use (conflicts with -rl)
-lf
the refiner-library to use (conflicts with -rs)
-ll
list the installed refiner-libraries (exits the program immidiately)
-la
extra args to hand over to the refiner-lib
-a
a detailed specification of the actions to perform, in this case what to change during the refinement-stage and what not in the form of a string in which each token stands for a thing to change/refine). Have a look at the description of the refine-lib you use for info on what actionlist-settings it supports
-ps
subdir in which the divisions are
-p
the name of a division to use as a part. The file-names in all the divprt -files in the division-dir are added and used as the list of files to use unless you specify one or more explicitly using the -pf option
-pd
full path to the division-dir to use as a part. This option causes the -p option to be ignored
-pf
one or more divprt files to use as the part instead of all files in the division
-dr
dry-run. Nothing is written or deleted, only reading and reporting is done
-v
the level of verbosity, default verboselevel = 2, available levels: 0,1,2,3
-?
(and equivalents) prints help: the purpose and the synopsys

NOTE:
If no from and to-dirs are given the defaults in the config file are used

You can download (or look at the sources of) CorpusRefiner [here]. To run it you will also need [the config file] and the [fiauimenre library]. You can also get the entire tool-package (containing the newest version of all fiauimenre tools and the library) [in one download]
Part of the LogiLogi Network: The LogiLogi Foundation - LogiLogi.org - OgOg.org
This is an old version for archival purposes, see www.LogiLogi.org for the current version.
< Edit this document | View history | Printer friendly (inc. links) >
Visited 1242 times
Document last modified Fri, 29 Jul 2005 04:02:53
All content is available under the GNU Free Documentation License. The LogiLogi-system is under the GPL
SourceForge.net Logo Zylon Internet Services-Groningen Logo
Visitor statistics