Corpus Compoundify


Purpose

Compoundifies a corpus. This means that all words connected by _'s within the compounds-file are also going to be connected this way within the output of this program. Compoundifying is neccesary for using the corpus within the context of databases that contain them, like WordNet. Comparisons are done in lowercase

WARNING:
Compoundifying a set of wordrow-files from a tagged corpus results in them becoming un-aligned if the tool that uses them together does not (temporarily) join the compounds again or an equivalent thing...

The configfile is configcorpuscompoundify.pl

Synopsis

./corpuscompoundify.pl
[-c <corpus>] [-f <fromsubdir>] | [-fd <fromdir>] [-tc <tocorpus>] [-t <tosubdir>] | [-td <targetdir>] [-o <compoundsfile>] | [-od <compoundsdirfile>] [-ps <partsubdir>] [-p <part>] | [-pd <partdir>] [-pf <divprtfile> [...]] [-dr] [-? = -h = -help = --help] [-v [<verboselvl>]]

-c
the corpus that needs to be compoundified
-f
subdir below corpus where the wordrow version of the corpus can be found (defaults to wordrow if none is specified and if it's not changed in the configfile)
-fd
full path to the wordrow-formatted corpus
-tc
target-corpus. If none is set the source corpus is used
-t
the subdir relative to the corpusdir in which the compoundified corpus should be stored. Note that this option is only possible if the corpus is givenwith the -c or -tc option (not the full path with the -td option)
-td
full path to the place where the compoundified corpus should be stored
-o
the compoundsfile (within taskData/lists/ normally)
-od
full path to the compoundsfile including the filename
-ps
subdir in which the divisions are
-p
the name of a division to use as a part. The file-names in all the divprt -files in the division-dir are added and used as the list of files to use unless you specify one or more explicitly using the -pf option
-pd
full path to the division-dir to use as a part. This option causes the -p option to be ignored
-pf
one or more divprt files to use as the part instead of all files in the division
-dr
dry-run. Nothing is written or deleted, only reading and reporting is done
-v
the level of verbosity, default verboselevel = 2, available levels: 0,1,2,3
-?
(and equivalents) prints help: the purpose and the synopsys

NOTE:
If no from and to-dirs are given the defaults in the config file are used

You can download (or look at the sources of) CorpusCompoundify [here]. To run it you will also need [the config file] and the [fiauimenre library]. You can also get the entire tool-package (containing the newest version of all fiauimenre tools and the library) [in one download]
Part of the LogiLogi Network: The LogiLogi Foundation - LogiLogi.org - OgOg.org
This is an old version for archival purposes, see www.LogiLogi.org for the current version.
< Edit this document | View history | Printer friendly (inc. links) >
Visited 878 times
Document last modified Fri, 29 Jul 2005 04:10:51
All content is available under the GNU Free Documentation License. The LogiLogi-system is under the GPL
SourceForge.net Logo Zylon Internet Services-Groningen Logo
Visitor statistics