Corpus ST penn StripperShort name: penn The CorpusSTpennStripper library strips the penn (penn- treebank) corpus. The available actionlist options are 'mtrdpn', d and p are not implemented yet All actionlist characters stand for a yes/no action, 'm' stands for strip mess ('==='s, everything after@'s and '[]'s), 't' stands for strip tags (everything directly after a '/'), 'r' stands for replacing the '/'s that separate tags and words with '__'s, 'd' stands for remove dots that are not separate entities (like in Acronyms, or words like Corp.), 'p' stands for replace % with percent, 35 with number etc., 'n' stands for remove anything that is not a normal word, a tag, or interpunctuation If no actionlist is given it will do all actions by default |
MenuList
