Purpose

Makes the corpus ready for the TnT-tagger. Should be run only on a corpus that's in the refined format. It can be a corpus with tags (such as is the output of the CorpusRefiner: tags and words separated by '__'s). What it actually does is putting all words and punctuation in the first column and all tags in the second column (unless ran with -nt, then tags are removed). The changed files are saved into a different dir (default can be set in the configfile)

The configfile is configcorpus2tnt.pl

Synopsis

./corpus2tnt.pl
[-c <corpus>] [-f <fromsubdir>] | [-fd <fromdir>] [-tc <tocorpus>] [-t <tosubdir>] | [-td <targetdir>] [-nt] [-l [<linkdir>]] [-ps <partsubdir>] [-p <part>] | [-pd <partdir>] [-pf <divprtfile> [...]] [-dr] [-? = -h = -help = --help] [-v [<verboselvl>]]

-c
the refined (& tagged) corpus
-f
subdir below corpus where the refined (& tagged) corpus can be found (defaults to refined if none is set & it's not changed in the config file)
-fd
full path to the data to work on (conflicts with -c and -f)
-tc
target-corpus where the tnt-ready stuff should be stored. If none is set the source corpus is used
-t
the subdir relative to the corpusdir in which the TnT-ready corpus should be stored. Note that this option is only possible if the corpus is given with the -c option (not the full path with the -cd option) (defaults to) origtagged if none is specified and if the corpus is tagged, while -nt is not specified, it defaults to untagged otherwise)
-td
full path to the place where the TnT-ready corpus should be stored
-nt
the corpus is untagged or it is tagged but the existing tags must be removed (makes the subdir default to untagged)
-l
the 'tagged' dir that should be added as a symlink pointing to the dirtree (or nothing for the value in the config file (normally 'tagged'))
-ps
subdir in which the divisions are
-p
the name of a previous division to use as a part. The file-names in all the divprt-files in the division-dir are added and used as the list of files to use unless you specify one or more explicitly using the -pf option
-pd
full path to the division-dir to use as a part. This option causes the -p option to be ignored
-pf
one or more divprt files to use as the part instead of all files in the division
-dr
dry-run. Nothing is written or deleted, only reading and reporting is done
-v
the level of verbosity, default verboselevel = 1, available levels: 0,1,2,3
-?
(and equivalents) prints help: the purpose and the synopsys

NOTE:
If no from and to-dirs are given the defaults in the config file are used

You can download (or look at the sources of) Corpus2TnT [here]. To run it you will also need [the config file] and the [fiauimenre library]. You can also get the entire tool-package (containing the newest version of all fiauimenre tools and the library) [in one download]
Part of the LogiLogi Network: The LogiLogi Foundation - LogiLogi.org - OgOg.org
This is an old version for archival purposes, see www.LogiLogi.org for the current version.
< Edit this document | View history | Printer friendly (inc. links) >
Visited 1159 times
Document last modified Fri, 29 Jul 2005 06:53:46
All content is available under the GNU Free Documentation License. The LogiLogi-system is under the GPL
SourceForge.net Logo Zylon Internet Services-Groningen Logo
Visitor statistics