Corpus ST penn Stripper


Short name: penn

The CorpusSTpennStripper library strips the penn (penn- treebank) corpus. The available actionlist options are 'mtrdpn', d and p are not implemented yet

All actionlist characters stand for a yes/no action, 'm' stands for strip mess ('==='s, everything after@'s and '[]'s), 't' stands for strip tags (everything directly after a '/'), 'r' stands for replacing the '/'s that separate tags and words with '__'s, 'd' stands for remove dots that are not separate entities (like in Acronyms, or words like Corp.), 'p' stands for replace % with percent, 35 with number etc., 'n' stands for remove anything that is not a normal word, a tag, or interpunctuation

If no actionlist is given it will do all actions by default
Part of the LogiLogi Network: The LogiLogi Foundation - LogiLogi.org - OgOg.org
This is an old version for archival purposes, see www.LogiLogi.org for the current version.
< Edit this document | View history | Printer friendly (inc. links) >
Visited 626 times
Document last modified Fri, 29 Jul 2005 06:58:42
All content is available under the GNU Free Documentation License. The LogiLogi-system is under the GPL
SourceForge.net Logo Zylon Internet Services-Groningen Logo
Visitor statistics