Spam Research


On this page you can find the papers written and the materials used in my SPAM researches.

Especially the tools (which are GPLed, and written in Perl) can be interesting for anyone planning on a research on anything that is textual and that can be devided in small parts related to some time series...

Tools for fully automated trend- and style- analysis are among them (their output is ready to read into statistical software such as SPSS). Besides that there are some advanced word-counting tools as well as some tools for creating simple lexicon files. Each tool has a small task to ensure some level of flexibility. The tools for each research task are joined together by a Perl run-script called sgoal.pl.

The tools can be [downloaded here] (34MB)

The SPAM researches done by me so far are:

The SPAM Trend Research:
Its about the trends in SPAM from 1997 till 2003 (on a corpus 70.000 mails). A comparison was made for the two semantic clusters Sex and Money. For this research I developed a formula to do query expansion for trend analysis. The formula is called Sformula.

[Download paper] - its in Dutch and its a draft...
[Download package] (17MB) - contains various result-files and the paper.

The SPAM Style Research:
Different aspects of style are counted in SPAM from 1997 till the 4th month of 2004. These results were interpreted using Principle Components Analysis and plain old line graphs. Again the two semantic clusters of Sex and Money were used, but this time I investigated their styles.

Download paper [pdf] [doc] [sxw] - English
[Download package] (1,9MB) - Contains the paper and various result-files, many of them are SPSS tables and graphs.

The Tweak-Test:
This one is more about Sformula and WordNet than about SPAM but its applied to the SpamCorpus. Sformula doesn't score bad. Two small mistakes are discovered in this paper in Sformula though.

Download paper [pdf] [doc] [sxw] - Dutch
[Download package] (3,2MB) - Contains the paper and various result-files, many of them are SPSS tables and graphs.


Other interesting downloads
A lexicon containing lists of adjectives, adverbs, verbs, nouns, auxiliaries and proper names. It's ok, but its not perfect.

[Download lexicon] (1,3MB)

Statistics about the word-counts and about the number of mails in the corpus.

[Download stats]

Note that I can't offer the Spamcorpus as a download, because it would be of much use for spammers to anti-tune against spamfilters.

I got the spams from [Paul Wouters] who told me not to give it to any one else...
Part of the LogiLogi Network: The LogiLogi Foundation - LogiLogi.org - OgOg.org
This is an old version for archival purposes, see www.LogiLogi.org for the current version.
< Edit this document | View history | Printer friendly (inc. links) >
Visited 2685 times
Document last modified Mon, 04 Oct 2004 13:18:54
All content is available under the GNU Free Documentation License. The LogiLogi-system is under the GPL
SourceForge.net Logo Zylon Internet Services-Groningen Logo
Visitor statistics