Anti-Spam Methods – State-of-the-Art - Technology for E

Transcription

Anti-Spam Methods – State-of-the-Art - Technology for E
Anti-Spam Methods – State-of-the-Art
W. Gansterer, M. Ilger, P. Lechner, R. Neumayer, J. Strauß
Institute of Distributed and Multimedia Systems
Faculty of Computer Science
University of Vienna, Austria
March 2005
This report summarizes the results of Phase 1 of the project FA 384018 “Spamabwehr”
of the Institute of Distributed and Multimedia Systems at the University of Vienna,
funded by Mobilkom Austria, UPC Telekabel and Internet Service Providers Austria
(ISPA).
We would like to thank Mobilkom Austria, UPC Telekabel and Internet Service
Providers Austria (ISPA) for their support which made this research project possible.
We also would like to express our gratitude to all those commercial vendors of antispam
tools who provided us with their products for experimental investigations as well as to
the volunteers who provided us private e-mail messages for testing purposes.
Copyright:
© 2005 by University of Vienna. All rights reserved. No part of this publication may be reproduced or distributed in
any form or by any means without the prior permission of the authors. The Institute of Distributed and Multimedia
Systems at the University of Vienna does not guarantee the accuracy, adequacy or completeness of any information
and is not responsible for any errors or omissions or the result obtained from the use of such information.
Note: Experimental data not to be used for ranking purposes
Since the objective of this report was the analysis of existing methodology and not a comprehensive and detailed
evaluation or comparison of available anti-spam products/tools, the results of our experiments must not be interpreted
as a “ranking”. In order to produce a sound basis for a rigorous “ranking” of various anti-spam products/tools more
effort has to be spent on defining comparable parameter settings and on fine tuning.
2
About the Authors
Project “Spamabwehr” was launched in summer 2004 at the Department of Computer
Science (Distributed Systems group) which, due to internal restructuring at the
University of Vienna, became the new Institute of Distributed and Multimedia Systems
at the Faculty of Computer Science.
The team:
Dr. Wilfried Gansterer (project leader),
Michael Ilger, Peter Lechner, Robert Neumayer and Jürgen Strauß.
From left to right: J. Strauß, M. Ilger, P. Lechner, W. Gansterer, R. Neumayer
Contact for Project “Spamabwehr”:
phone: +43-1-4277-39650
e-mail: Each team member can be contacted at [email protected]
The institution:
The Faculty of Computer Science (Fakultät für Informatik) is currently lead by Dean
Prof. Dr. Günter Haring. The Institute of Distributed and Multimedia Systems, headed
by Prof. DDr. Gerald Quirchmayr, is one of the institutes within this faculty.
Institute of Distributed and Multimedia Systems
University of Vienna
Lenaugasse 2/8, A-1080 Vienna (Austria)
3
Table of Content
EXECUTIVE SUMMARY ..................................................................................................................6
1.
INTRODUCTION........................................................................................................................9
1.1.
WHAT IS “SPAM”? ...............................................................................................................10
1.2.
STATISTICAL DATA .............................................................................................................11
1.2.1.
Total Amount of Spam...............................................................................................11
1.2.2.
Sources of Spam ........................................................................................................13
1.2.3.
Content of Spam ........................................................................................................14
1.3.
THE ECONOMIC BACKGROUND ...........................................................................................15
1.3.1.
Why Spam?................................................................................................................15
1.3.2.
Damage Caused by Spam..........................................................................................18
1.3.3.
Conclusion ................................................................................................................18
1.4.
THE TECHNICAL BACKGROUND ..........................................................................................19
1.4.1.
Simple Mail Transfer Protocol..................................................................................19
1.4.2.
Internet Message Format ..........................................................................................20
1.4.3.
Spammers’ Techniques..............................................................................................21
2.
ANTI-SPAM METHODS..........................................................................................................23
2.1.
QUALITY CRITERIA FOR ANTI-SPAM METHODS ..................................................................23
2.2.
SENDER SIDE (=PRE-SEND) METHODS ................................................................................25
2.2.1.
Increasing Sender Costs............................................................................................25
2.2.2.
Increasing Spammers’ Risk.......................................................................................29
2.3.
RECEIVER SIDE (=POST-SEND) METHODS ...........................................................................30
2.3.1.
Approaches Based on Source of Mail .......................................................................30
2.3.2.
Approaches Based on Content ..................................................................................38
2.3.3.
Using Source and Content ........................................................................................41
2.4.
SENDER AND RECEIVER SIDE ..............................................................................................47
2.4.1.
IM 2000.....................................................................................................................48
2.4.2.
AMTP ........................................................................................................................48
3.
PRODUCTS AND TOOLS .......................................................................................................49
3.1.
OVERVIEW ..........................................................................................................................49
3.1.1.
Quality Criteria.........................................................................................................49
3.1.2.
Comparisons of Anti-Spam Software ........................................................................49
3.2.
COMMERCIAL PRODUCTS.....................................................................................................50
3.2.1.
Symantec Brightmail Anti-Spam ...............................................................................50
3.2.2.
Kaspersky Anti-Spam ................................................................................................52
3.2.3.
SurfControl E-Mail Filter for SMTP.........................................................................53
3.2.4.
Symantec Mail Security for SMTP ............................................................................55
3.2.5.
Borderware MXtreme Mail Firewall ........................................................................57
3.2.6.
Ikarus mySpamWall ..................................................................................................57
3.2.7.
Spamkiss....................................................................................................................58
3.3.
OPEN SOURCE .....................................................................................................................60
3.3.1.
SpamAssassin............................................................................................................60
3.3.2.
CRM 114 ...................................................................................................................61
3.3.3.
Bogofilter ..................................................................................................................62
4.
PERFORMANCE EVALUATION ..........................................................................................63
4.1.
SOURCES FOR OUR OWN SAMPLES ......................................................................................63
4.1.1.
University of Vienna..................................................................................................63
4.1.2.
Mobilkom Austria......................................................................................................63
4.1.3.
UPC Telekabel ..........................................................................................................64
4.2.
TEST SAMPLE DESCRIPTION .................................................................................................64
4
4.2.1.
Our Test Sample........................................................................................................64
4.2.2.
SpamAssassin Test Sample........................................................................................65
4.3.
EXPERIMENTAL SETUP ........................................................................................................65
4.3.1.
Windows Test Process...............................................................................................66
4.3.2.
Linux Test Process ....................................................................................................66
5.
EXPERIMENTAL RESULTS ..................................................................................................67
5.1.
OUR TEST SAMPLE ..............................................................................................................67
5.1.1.
Commercial Products ...............................................................................................67
5.1.2.
Open Source Tools ....................................................................................................71
5.1.3.
Conclusion ................................................................................................................75
5.2.
SPAMASSASSIN TEST SAMPLE .............................................................................................76
5.2.1.
Commercial Products ...............................................................................................76
5.2.2.
Open Source Tools ....................................................................................................81
5.2.3.
Conclusion ................................................................................................................85
6.
CONCLUSION...........................................................................................................................87
6.1.
6.2.
METHODS ............................................................................................................................87
EXPERIMENTS......................................................................................................................88
7.
LIST OF FIGURES ...................................................................................................................89
8.
LIST OF TABLES .....................................................................................................................90
9.
INDEX.........................................................................................................................................91
10. BIBLIOGRAPHY ......................................................................................................................93
5
Executive Summary
This report summarizes the findings and results of the first phase of the project “FA
384108 Spam-Abwehr” (“Spam-Defense”) which was launched in July 2004 at the
Department of Computer Science and Business Informatics at the University of Vienna
and is supported by Mobilkom Austria, UPC Telekabel and Internet Service Providers
Austria (ISPA).
This document is structured as follows.
(1)
Section 1 provides an introduction into the topic by discussing definitions,
summarizing recent spam statistics, and reviewing the relevant economic and
technical background.
(2)
Section 2 summarizes the “state-of-the-art” of methods for detecting and
avoiding spam. Although this is an extremely difficult task in general – not
only due to the nature of the problem and due to the enormous dynamics of
current research and development activities, but also due to restricted access
to information about proprietary methods - we were able to cover the most
important approaches at a methodological, scientifically oriented level. We
tried to be as comprehensive as possible within the given space and time
limitations, and most of the important methods are covered briefly (without
going into details).
Our survey of methods is based on a categorization of anti-spam methods
(see Figure 5), which we developed and which provides an abstract
framework for all existing anti-spam approaches.
(3)
Section 3 contains an overview and a short description of a few anti-spam
products and tools which we had access to. This comprises commercial
products (Symantec-Brightmail, Kaspersky, SurfControl, Borderware, Ikarus)
as well as important open source tools (SpamAssassin, CRM 114,
Bogofilter).
(4)
Section 4 summarizes the setup we used for experimenting with the products
and tools mentioned above. In particular, it describes the two test sets
containing spam and ham messages (one of them we collected ourselves from
various sources, the other one is publicly available) and the hardware we
used.
(5)
Section 5 summarizes our experimental results in detail. Detection rates and
false positive rates are given for each of the products and tools used.
Since the goal of this report was an analysis of existing methodology, and not
a comprehensive and detailed evaluation or comparison of anti-spam
products/tools available, the results of our experimental evaluation must not
be interpreted as a “ranking”. In order to produce a rigorous ranking, we
would have to use a wider variety of test sets, we would have to spend much
6
more effort on tuning the products and tools (which can be an enormously
time consuming task), we would have to monitor their performance over a
longer period of time, and we also would have to take into account other
properties beyond detection rates (such as user-friendliness, administrative
overhead, etc.). Thus, the results quoted should be considered approximations
of the performance achievable with the respective tools.
In many cases, our results are reasonably good indications of the performance
to be expected – experience shows that even with higher efforts for tuning the
detection rates usually cannot be expected to increase a lot.
(6)
Finally, Section 6 summarizes our findings and conclusions.
C
C
C
C
Service
O
O
O
Operating System
W
L
W
W
Prop
Appliance
L
L
L
Prop Methods
*
*
*
Whitelist
*
*
*
*
*
*
*
*
Blacklist
*
*
*
*
*
*
*
*
Ikarus
mySpamWall
*
*
SPF
Challenge/Response
Token based
Challenge/Response
Greylist
Fingerprint
*
Prop
Bayes
Neural Networks
7
Bogofilter
C
CRM 114
MXtreme
Firewall
Commercial/
Open Source
SpamAssassin
Symantec
Brightmail
AntiSpam
Kaspersky
AntiSpam
SurfControl E-Mail
Filter
Symantec
Mail
Security for SMTP
Mail
Table 1 provides a compact overview of the products and tools we experimented with
and of some of the most common anti-spam methods, indicating which product/tool
uses which method.
*
*
DCC
Prop
DCC
Pyzor
Razor
*
*
*
*
*
*
URL Whitelist
URL Blacklist
*
Static
techniques
*
(Keywords, ..)
*
*
*
*
*
*
*
*
*
Digital Signature
Hashcash
SVM
Table 1: Products/tools considered, methods used by these, further remarks see page 2
Legend:
8
C…Commercial,
O…Open
Source,
W…Windows,
L…Linux,
Prop…Proprietary
1. Introduction
Among the strengths of electronic communications media such as electronic mail (email) are the relatively low transmission costs, high reliability and generally fast
delivery. Electronic messaging is not only cheap and fast; it is also easy to automate.
These properties make it obviously also very attractive for commercial advertising
purposes, and in recent years we have experienced a development where electronic
messaging is abused by flooding users’ mailboxes with unsolicited messages.
Spamming is the act of sending unsolicited (commercial) electronic messages in
bulk, and the word spam has become the synonym for such messages. This word is
originally derived from spiced ham (luncheon meat), which is a registered trademark of
Hormel Foods Corporation [1]. Monty Python’s Flying Circus used the term spam in
the so called “spam sketch” as a synonym for frequent occurrence – and someone
adopted this for unsolicited mass mail. Based on the origin of the word spam, all other
(desired) e-mail is called ham. Official terminology for spam, unsolicited bulk mail
(UBE) or unsolicited commercial e-mail (UCE), is introduced in Section 1.1.
The most common purpose for spamming is advertising. Offered goods range from
pornography, computer software and medical products to credit card accounts,
investments and university diplomas. Many of these products have an ill-reputed or
questionable legal nature. The main motivation for spamming is commercial profit. As
we mentioned above, the costs for sending millions of spam mail messages are very
low. In order to make good profit, it suffices if only a very small fraction (0.1% or even
less) of the sent out spam e-mail are replied to and lead to business transactions.
Spam has severe negative effects on e-mail users. Obviously, it consumes computer,
storage and network resources as well as human time and attention to dismiss unwanted
messages. Moreover, it has various indirect effects which are very difficult to account
for – the spectrum reaches from measurable costs like spam filter software and
administration to not measurable costs like a lost e-mail (expensive for business, not
that expensive for a private person).
We can distinguish five different types of spam: Beyond e-mail spam, there is
messaging spam (often called spim – spam using instant messaging), newsgroup spam
(excessive multiple postings in newsgroups), mobile phone spam (text messages), and
Internet telephony spam (via voice over IP).
Perspective Taken. In this report, our focus is on summarizing the state-of-the-art in
methods and techniques for detecting and avoiding e-mail spam, because it is the most
common kind of spamming on the Internet. In the future, other types of spam, in
particular to all kinds of mobile devices (cellular phones, PDAs, etc.) will become more
of a problem. Although for this context some technical aspects will require further
investigation, most of the methods discussed in this report will also be of relevance
there.
Moreover, we emphasize approaches suitable at the server side (ISP) as opposed to
methods designed for the client side (individual user), since an important goal is to find
9
centralized solutions suitable for ISPs. In such a context, user feedback – if feasible –
can be one way to control and improve quality, but it should not be an integral part of
anti-spam methods.
We have also experimented with a few implementations of common available
methods in order to illustrate their performance in practice. Our goal was not to produce
a complete survey, evaluation, comparison or even “ranking” of existing anti-spam
products. The tools considered in the evaluation were picked based on availability and
on the methods they implement – we cannot and do not intend to claim any form of
completeness in terms of anti-spam products or tools considered in this report.
Synopsis. In the remainder of this section, we provide a more detailed overview of
background information related to the spam topic: official terminology and definitions
of “spam”, statistics about spam, the technical background relevant to the phenomenon
“spam”, and finally the economic background that explains the motivation of
spammers.
In Chapter 2, we outline the most important anti-spam methods based on a
classification we have developed. In Chapter 3, we survey a few existing commercial
and non-commercial tools. In Chapter 4 we discuss how the performance of anti-spam
methods and software can be evaluated, and we describe the setup of our experimental
evaluation. In Chapter 5 we summarize and interpret the results of our tests with two
different samples (our own test set and a publicly available test set) and in Chapter 6,
we summarize our conclusions.
1.1. What is “Spam”?
First, we need to carefully define some central terminology. As mentioned before, the
commonly used word “spam” was derived from a completely different context. In this
section, we summarize official and/or more technical terminology used for spam, such
as unsolicited commercial e-mail (UCE) or unsolicited bulk e-mail (UBE).
Unsolicited Commercial E-Mail (UCE): “E-Mail containing commercial
information that has been sent to a recipient who did not ask to receive it” [2], or:
“Unsolicited e-mail is advertising material sent by e-mail without the recipient either
requesting such information or otherwise explicitly expressing an interest in the
material advertised.” [3]
Unsolicited Bulk E-mail (UBE): “E-Mail with substantially identical content sent
to many recipients who did not ask to receive it. Almost all UBE is also UCE.” [2], or:
“Unsolicited Bulk E-Mail, or UBE, is Internet mail (‘e-mail’) that is sent to a group of
recipients who have not requested it. A mail recipient may have at one time asked a
sender for bulk e-mail, but then later asked that sender not to send any more e-mail or
otherwise not have indicated a desire for such additional mail; hence any bulk e-mail
sent after that request was received is also UBE.” [4]
In our opinion, no e-mail that is solicited can be considered spam. However, there
may be spam which is not sent out in bulk or which does not involve (direct)
commercial interest. Ultimately, classification of an e-mail message as spam often
10
becomes a highly subjective decision and it is very difficult – if not impossible – to
establish common criteria covering a wide range of affected user. Nevertheless, based
on the statements mentioned above, we identify three central features, which we
consider defining properties of spam (not always all three of them have to apply):
1. It is unsolicited, that is, mail the receiver did not request.
2. It is sent out in bulk, that is, to many recipients.
3. Usually, there is commercial interest involved, for example, interest in
advertising (and selling) some product.
Two more relevant technical terms have been established for the special context of
newsgroup postings:
Excessive Multi-Posting (EMP): “Excessive Multi-Posting (EMP) refers to sending
the same (or nearly the same) message, one by one, to multiple newsgroups. Multiple
posting is almost never recommended because (a) multiple messages are better sent by
cross-posting and (b) follow-ups will be posted in different newsgroups.” [3]
Excessive Cross-Posting (ECP): “Excessive Cross-Posting (ECP) refers to sending
a message to many newsgroups all at once. Sometimes, if a message could belong in
more than one group, it can be useful to cross-post. However, in this case, an
appropriate ‘Follow up-To:’ header can ensure that the discussion continues only in
one designated newsgroup. Cross-posting to too many groups is considered spam,
especially if no ‘Follow up-To:’ header is included.” [3]
1.2. Statistical Data
Only ten years ago, the spam problem did not exist. In the last one or two years it has
become a major concern not only for private users, but also for businesses due to the
potential economic and commercial damage. In this section, we try to give a short
overview over current spam statistics to illustrate the development and the dynamics of
the issue. Unfortunately, all available statistics have one major disadvantage – they are
outdated quickly.
There are many sources of information about statistics on the development of spam,
such as [5] or [6].
1.2.1. Total Amount of Spam
The amount of spam sent over the Internet has been rising dramatically in recent years
and no decline is to be expected in the near future. This is clearly illustrated by data
about the share of spam in the total number of e-mail messages sent. In 2000, only 7%
of the messages sent worldwide were spam [7], whereas in 2002 already 40% were
spam [8]. The current percentage is estimated to be around 65% worldwide with much
higher estimates for some regions (for example, up to 90% for the USA). According to
11
that trend, some pessimists even announce the end of the mail infrastructure for 2007
[9].
Until July 2004, the anti-spam software developer Brightmail published monthly
statistics about spam, as shown in Figure 1.
Figure
1:
Percentage
(no newer data available)
of
e-mail
identified
as
spam,
June
2004
[10]
Messagelabs [11] published Figure 2, which reflects the interaction between
legislative measures / law enforcement and the percentage of spam in e-mail scanned.
Figure 2: Interaction of legislative measures, law enforcement and percentage of spam [11]
12
It is clearly visible that the percentage of spam in all e-mail messages sent still has
an increasing trend, but it also tends to react significantly to the introduction of new
legislative measures and to legal actions taken against spammers. This interpretation
also has to be seen in the light of the conjecture that 80% of the spam sent worldwide
comes from very few (roughly 200) distinct spammers [12].
1.2.2. Sources of Spam
Having illustrated that the percentage of spam in e-mail messages sent worldwide
increases rapidly, we also tried to collect some statistics about where the spam comes
from, both in terms of geographic regions as well as in terms of Internet domains. A
ranking of the most spam producing countries (as of 24th August 2004) according to
the anti-virus and anti-spam vendor Sophos [13] is shown in Table 2:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
United States
South Korea
China (& Hong Kong)
Brazil
Canada
Japan
Germany
France
Spain
United Kingdom
Mexico
Taiwan
Others
42.53%
15.42%
11.62%
6.17%
2.91%
2.87%
1.28%
1.24%
1.16%
1.15%
0.98%
0.91%
11.76%
Table 2: The top twelve sources of spam, geographically [13]
For collecting the data summarized in this table, researchers used honey pots1, to
collect spam. It is interesting to note that, compared to the data from February 2004,
Canada reduced its rate from 6.8 % to 2.9%, whereas South Korea tripled its rate. In
general, about 40 % of the world’s spam is sent out from “zombie computers”2 [8].
Figure 3 provided by the anti-spam vendor Commtouch [14] illustrates which
domains spammers prefer to send out spam.
1
The term “honeypot” (spam trap) refers to an e-mail address never published to humans. Any e-mail
sent to such addresses has to be spam.
2
The term “zombie computer” refers to a computer infected with viruses of all kinds and misused for
sending out spam.
13
Figure 3: The top ten sources of spam (domains) [14]
Postini [15] provides some interesting statistics investigating the sources of spam
and of directory harvest attacks3. On their graphic illustration, Austria seems to be
among the hotspots of spammers’ activity. Upon closer examination, it turns out that
the visual impression is due to the three following entries [15]:
48.22
48.22
48.22
16.37
16.37
16.37
AT VIENNA WIEN (state) RIPE
AT VIENNA WIEN (state) RIPE
AT VIENNA WIEN (state) RIPE
ev_dictatk
ev_spamatk
ev_dictatk
20881
26
211
Whereas the first three entries specify the location (latitude, longitude, additional
location information), the last two entries describe the event type and the intensity of
the attack. That means there were 26 spam attacks (=ev_spamatk) and more than 20,000
directory harvest attacks (=ev_dictatk) during the last six month having their origin
from Austria.
1.2.3. Content of Spam
As we will illustrate in detail in Section 1.3, the central motivation for sending out spam
is to make money and profit. Spammers use the direct marketing approach to make their
cash mostly by marketing, only rarely also by direct selling. Figure 4 shows a
categorization of spam in terms of its content based on data from Brightmail [10].
3
A directory harvest attack is the theft of confidential e-mail directory information, for example of
lists of e-mail addresses of all employees of an organization.
14
Spam categorized in terms of content (Data July 2004)
Internet
5%
Leisure
4%
Other
6%
Political Spiritual
2%
3%
Products
25%
Fraud
5%
Health
8%
Adult
15%
Scams
9%
Financial
18%
Figure 4: Spam categorized in terms of content (data from [10])
1.3. The Economic Background
In this section, we give a short overview over the economic business model of
spammers. There is a simple reason for the recently dramatic rise in the amount of spam
sent – it seems to be a relatively easy way to make money. We will outline why
spamming can be so profitable. Later, in Section 2.2, some approaches trying to reduce
this motivation and thus fighting the spam problem “at its source” will be discussed.
1.3.1. Why Spam?
There are many sources of information dealing with the reason, why spam is around, for
example [16].
Spam is sent out by companies and by individuals, but primarily for a single reason –
to make profit using a new form of direct marketing. Classical direct marketing, using
methods such as brochures, TV and radio spots, telephone calls, doorstep sales, etc. has
been used for a long time. For these marketing methods, the costs associated with every
step in this process are significant. More importantly, those costs for direct marketing
increase proportionally with the number of potential customers reached and revenue is
only created by selling real products or services. In this classical approach, frauds are
almost excluded because initial investment is necessary for advertising in order to make
money down the line.
With the availability of e-mail communication, new direct marketers were able to
reduce the costs for direct marketing to a negligible amount in proportion to the number
of potential customers reached. This increases the margin of profit considerably. In the
following, we describe this business model (spammers’ costs and revenues) in more
detail.
15
1.3.1.1. Cost Factors
The following list summarizes a few of the cost factors characteristic for spamming
businesses.
Product: Most of the spammers do not sell anything to the recipients of spam – they
are just acting as marketers (thus, spammers do not have any investments for actually
purchasing products).
Marketing Material: The creation of an e-mail does not need any highly specialized
software or knowledge. Thus, producing the marketing material is very cheap, and –
one of the most important differences to classical marketing – the costs for sending out
marketing material do not increase proportionally with the number of potential
customers reached.
Spam Tools: Tools for generating and sending out millions of personalized e-mail
are available, very inexpensive (often even for free) and easy to use.
Address Harvesting Tools: Tools for collecting addresses automatically on the
Internet can be downloaded [17][18] and addresses can be bought or rent [19].
Although the price for e-mail addresses depends on their “quality”, they are generally
quite inexpensive.
The Spam Campaign: Set up an Internet connection (for example, a free trial
account), send out millions of messages from this account in a short period, and move
to the next ISP for getting a new (free) account.
Other Costs: These include hardware and maintenance costs, but may also include
costs for responding to interested buyers (automated, in order to avoid personal
interaction, for example, via a Web interface).
1.3.1.2. Profit Factors
In the following, we list some of the most important sources of income and profit for
spammers.
Direct Income: The most common form of income for spammers is that they act as
marketing companies and are paid for marketing campaigns.
Web Banner Revenues: In many cases, spammers get revenue for every visit on a
Web site, which is advertised, in a spam e-mail.
Validation of Contact Information: Another source of income for spammers is to
validate e-mail addresses (for example, responses to “unsubscribe here” invitations in
spam messages) and to sell this information to other spammers or direct marketing
companies.
Sell Spam Business Models: The above is a special case of a more general concept
where spammers sell the information collected from responses to spam messages to
others.
16
Scams: In many cases, spam messages are hidden attempts to find out personal or
access information (“phishing”), such as credit card information, bank account
information, etc., which can then be used for criminal activities (theft, illegal
investment, etc.). Other kinds of scam could be: dubious job offers, ponzi schemes4,
Internet gambling, auctions, sexual offers and pre-paid purchase orders with no supply
of the ordered goods [20].
Product Selling: Only a minority of companies who send out spam are also selling
the advertised products themselves.
1.3.1.3. An Example
The following example gives an impression how spammers’ businesses operate. The
description is based on the interview [21] with an anonymous spammer who runs a
rather small-scale operation.
The spammer used an account at Send-safe [22], which allowed him to send out
400,000 e-mail messages via open proxies for US$ 50. On average, he sent out
approximately 61,000 e-mail messages per day. The recipients were taken from a CD
containing 4,000,000 e-mail addresses, which he bought for 300 Euros from [23]. It
turned out that only 56% of the addresses on the CD were syntactically correct, and
25% of these bounced due to full or out of use mailboxes. As Web site referred to in his
spam, he used bulletproof hosting5 in China via Worldsoftwarehouse [24], which
charges Euro 125 per month. He used a link-counter [25] to get an idea how many
persons view his e-mail (by counting how often it is opened in an e-mail client). On
average, per day about 30 persons ordered 2.5 units of the product offered.
Table 3 summarizes this operation for a typical month (one month = 30 days, prices
are given in Euro6).
Quantity of E-Mail:
Mail sent
61,000*30
1,830,000 (100%)
Mail viewed by user
19,136 *30
574,080 (31.37%)
Visitors on spamvertized site
359*30
10,770 (0.59%)
People ordering
22*30
660 (0.036)%
4
An investment swindle in which some early investors are paid off with money put up by later ones
in order to encourage more and bigger risks.
5
The term bulletproof is used to indicate that nothing can shut down the hosting service. Such
services can enable the sending of spam without the threat of Web site account cancellation.
6
For this data, the conversion rate 1 US$=0.75913 Euro was used.
17
Fixed costs:
E-Mail addresses
300
-300.--
Hosting cost
125
-125.--
Link counter
19.90*0.75913
-15.11
Open proxy cost
50/400,000*1,830,000*0.75913
-173.65
Purchase of goods
660*2.5*2.95
-4,867.50
Wrapping fee
660*2.5*0.50
-825.--
Variable costs:
-6,306.26
Total cost
Revenue:
Sales revenue
660*2.5*8
Result (monthly profit before taxes):
13,200.-6,893.74
Table 3: Cost-profit equation of a spammer (simplified, monthly basis)
Although some costs like computer hard- and software, Internet access costs and
taxes are not included here, this simple example shows that spamming is highly
profitable.
To get an impression of really large operations where millions of spam messages are
sent per day, see [26].
1.3.2. Damage Caused by Spam
This huge amount of spam not only occupies enormous resources, it obviously also
causes a dramatic loss of productivity. Whereas individuals only lose bandwidth and
time (which is hard to quantify), companies potentially lose much more. Ferris
Research [27] estimated productivity losses for US corporations with 10 billion dollar
for 2003 [28]. The European Union estimated a loss of productivity of 2.5 billion Euros
for 2003 [7]. Various tools to gauge losses for the own company are available online
(see, for example [29] or the simpler version [30]).
1.3.3. Conclusion
As illustrated before, e-mail communication currently provides an excellent means for
spammers to make high profits from sending out spam and related activities. All other
Internet users, including individual users, businesses and ISPs, suffer from all the
damaging side effects of spamming activities, reaching from mailboxes filled with junk
mail, which threatens the usefulness of e-mail as a means of communication, to all sorts
18
of other costs in terms of bandwidth usage, storage requirements, and last, but not least,
manpower required to fight spam.
It seems obvious that advanced approaches for fighting the spam problem must
include strategies to make spamming less attractive – not only by increasing the risks
for spammers through stricter legal regulations, but also by harming spammers’
business model, that is, by decreasing the potential margins of profit. In Section 2.2 we
will discuss some approaches in this direction in detail. For best results, the
infrastructure of the current Internet should be changed partly – and a lot of further
work is to be done – see also [31].
1.4. The Technical Background
The rules for transmitting an e-mail message and the composition of the message itself
are defined in several protocol standards. In this section, we discuss the two basic
underlying protocols, the Simple Mail Transfer Protocol (SMTP, RFC2821 [32]) and
the Internet Message Format (RFC2822 [33]).
The main objective of SMTP is to support reliable and efficient mail transfer. The
Internet Message Format defines the structure of e-mail messages.
1.4.1. Simple Mail Transfer Protocol
SMTP was originally developed in 1982 (RFC821 [34]), and has been consolidated and
updated in 2001 [32]. SMTP is independent of the particular transmission subsystem
and only requires a reliable ordered data stream channel. In the Internet Protocol Stack
[35] it is located at the application layer (layer 4) and uses TCP for data transmission.
An e-mail message usually consists of three different parts – the SMTP envelope, the
header and the body. SMTP specifies a set of commands to transmit an e-mail message
between an SMTP client and an SMTP server. The exchange of these commands
between the client and the server forms the SMTP envelope and is known as the socalled SMTP dialogue. A minimum SMTP implementation consists of nine commands.
There is also a service extension model that permits the client and server to agree to
utilize shared functionality beyond the original SMTP requirements. Table 4 shows a
typical communication scenario.
#
1
2
3
4
5
6
7
8
9
19
Station
Server:
Client:
Server:
Client:
Server:
Client:
Server:
Client:
Server:
Command and meaning
<wait for connection on TCP port 25>
<open connection to server>
220 receiving.server.com ESMTP server ready.
helo sending.server.com
250 Hello, sending.server.com
mail from: [email protected]
250 Sender ok, send RCPTs
rcpt to: [email protected]
250 [email protected]
10
11
12
13
Client:
Server:
Client:
Client:
data
354 Start mail input; end with <CRLF>.<CRLF>
mail text…
.
14
Server:
15
16
Client:
Server:
250 2.6.0 [email protected]
Queued mail for delivery
quit
221 receiving.server.com Service closing transmission channel
Table 4: Typical SMTP dialogue
The first step in the SMTP dialogue (lines 1-3) establishes the connection initiated
by the client. The standard SMTP port is 25. After connection establishment, the server
replies with code 220 (service ready, line 3). Relevant for the communication is just the
reply code, the text behind it can vary. Now the client sends a “helo” command (line 4).
In line 5, the server replies with code 250 (requested mail action okay) finishing the
SMTP handshake. After specifying sender and recipient (lines 6-9), the client uses the
“data” command (line 10) to tell the server that now the message itself will be
transferred. The server acknowledges (line 11) whereupon the client specifies the
content of the message (line 12). The end of the message is indicated by a “.” (line 13).
After receipt, an acknowledgement is sent to the client including an internal message
number assigned by the server (line 14). At the end of the communication, the client
sends a “quit” command (line 15) to close the transmission channel. The server
confirms with code 221 (Service closing transmission channel, line 16). With the
exception of the IP address of the client, any information provided by the client within
the SMTP dialogue can be forged (cf. [36]).
1.4.2. Internet Message Format
So far, we have discussed the communication scenario between the client and the server
and not the message itself. The Internet Message Format [33] specifies how an e-mail
message is composed. Generally, an e-mail consists of a header and a body.
Message Header. All header fields have the same general syntactic structure: A field
name, followed by a colon, followed by the field body. The header fields can be
grouped into “originator fields”, “destination address fields”, “identification fields”,
“information fields”, “resent fields”, “trace fields” and “optional fields”. The “trace
fields” are also discussed in [32].
Table 5 summarizes the most important header fields in the Internet Message
Format.
Originator fields:
Specifies the author of the message
From:
Sender of the message
Sender:
Reply address
Reply-To:
Destination address fields
Primary recipient(s)
To:
20
Other recipients
Blind carbon copy (addresses are not submitted to
the other recipients)
Identification fields:
Message-ID:
unique message identifier
Information fields:
subject of the message
Subject:
Trace fields:
Return-path:
The address to which messages indicating nondelivery or other mail system failures are sent.
When a SMTP server accepts a message, either for
Received:
relaying or for delivery, it inserts a trace record
including the sending and the receiving host and
arrival date and time of the message.
CC:
BCC:
Table 5: The most important header fields in the Internet Message Format [33]
Message Body. The second part of a message, called message body, contains the
information itself and, if structured, is defined according to the MIME-Protocol
(Multipurpose Internet Mail Extension).
The message transfer between the original sender and the final recipient can occur in
a single connection or in a series of hops through intermediary systems. The relaying of
messages from unknown sources to unknown destinations causes one of the biggest
problems of today’s mail traffic because spammers often use open relays7 for
transmitting their mail.
In Section 2.3.1 we will discuss existing methods for identifying spam based on
header information in detail. Additional detailed information is also given in the
diploma thesis [36].
1.4.3. Spammers’ Techniques
Without being able to claim completeness, we briefly mention a few common
techniques used by spammers to hide their tracks. Most of them do not send out e-mail
messages through their ISP, but instead they try to connect to the destination mail
server directly or to use open relays. This and the possibility to forge almost every
information in the header as discussed above makes it nearly impossible to trace a
spammer’s real address or even his identity.
There are many other techniques spammers use to mislead or bypass filters. Again,
we only give a very brief survey here; many of those techniques will be mentioned
again in the context of the respective anti-spam method in Chapter 2. One important
technique, which is very processing power consuming, though, is the receiver
personalization of every message (no BCC, every receiver gets his “own” e-mail) in
order to obscure bulk mailing. Less processor consuming is the randomization of the
7
SMTP or ESMTP server that provides everyone unrestricted relaying services
21
subject field and the “From:” address line. Other techniques commonly used are forging
the Message-ID, omitting the “To:” header, or adding random words and strings to a
message in order to mislead Bayes filters.
Table 6 shows some of the main techniques used by spammers and how their
approach has shifted in the last two years in response to the development of anti-spam
methods.
Table 6: Adaptation of spammers’ techniques to development of filtering techniques [37]
22
2. Anti-Spam Methods
In recent years, a vast number of methods and techniques for coping with the spam
problem have been proposed and developed, ranging from legal countermeasures to
very technical approaches. This is also reflected in a large amount of publications on
that topic. In order to bring some structure into this enormous amount of information
we are introducing a categorization of anti-spam methods, shown in Figure 5.
Our basic distinction is between methods “acting” before an e-mail is sent out (“presend”), methods “acting” after the message has been sent out (“post-send”), and new
regulations “acting” during the transfer of an e-mail (new protocols for mail transfer).
This comprises virtually all existing approaches, ranging from attempts to decrease the
amount of spam sent out to approaches based on text analysis and classification
methods applied to a received e-mail.
Figure 5: Categorization of anti-spam methods
In this chapter, we will discuss all these methods in detail. We will also point out
relations between relevant techniques, and evaluate them from the perspective taken in
this study.
2.1. Quality Criteria for Anti-Spam Methods
Before we can discuss evaluations of anti-spam methods, we need to carefully define
the quality criteria available for such an evaluation. All anti-spam methods, which
involve the process of deciding whether an (incoming or outgoing!) e-mail message is
spam or not can be viewed as binary classifiers. This includes almost all available
methods (in Figure 5, most “pre-send” methods, except for the methods contained in
“increase sender risks”, and all “post-send” methods, but not the approaches contained
in “new protocols”). For these categories of anti-spam methods, central concepts and
important terms can be “borrowed” from the areas (text) classification and data mining,
as illustrated in the following. Thus, for these types of methods there is a welldeveloped and clearly defined framework for evaluating their performance, as
summarized in the following.
In order to evaluate the quality of an anti-spam method (which can be viewed as a
binary classifier), we let it classify the members of a given set of e-mail messages (test
set) into two groups, in our case spam and ham, more generally, positives and
negatives.8 There is no single concept like “overall correctness” to measure the
performance of a binary classifier (for two classes). Assuming that the correct
classification of the test set is known, we can count how many of the messages in the
test set have been classified correctly (true positives and true negatives) and how many
of these messages have been misclassified by the anti-spam method under
consideration. This gives us a first set of (absolute) quality criteria: true/false positives
and true/false negatives.
In the context of anti-spam methods (and in all upcoming parts of this reports) we
follow the widespread conventions to use the term “positives” for denoting spam
messages, and the term “negatives” for denoting ham messages. Consequently, any
message will be classified as “positive” (spam) or “negative” (ham) by the anti-spam
method. If this message actually is spam, but it was (wrongly) classified as negative, it
is called a “false negative”. If it actually is ham, but it was (wrongly) classified as
positive, it is called a “false positive”.
Table 7 summarizes this concept. Each row corresponds to the known type of a
message, and each column denotes the class assigned by a binary classifier. According
to the table, a positive can be either a true positive, a spam message classified as spam,
or a false positive, a ham message classified as spam. On the other hand, a ham
message assigned to the ham group is a true negative, whereas a spam message that is
classified as ham is a false negative.
Message / classified as
Spam (positives)
Ham (negatives)
Spam
True positive
False negative
Ham
False positive
True negative
Table 7: Quality metrics of binary classifiers for the spam problem
8
Which of the two given classes is denoted as positives or negatives is up to the beholder.
24
Based on these quantities, relative quality criteria of a binary classifier can be
defined:
sensitivity =
true positives
(true positives + false negatives)
and
specificity =
true negatives
.
(true negatives + false positives)
Both of these quality metrics are between zero and one (often quoted as a
percentage), and each of them measures the correctness per class. The sensitivity of a
spam classifier is the proportion of messages classified as spam of all spam messages.
The closer to one the value of the sensitivity, the more spam is classified correctly.
Specificity denotes the correctness for the negatives or ham, respectively.
Theoretically, it is possible to achieve 100 per cent sensitivity and specificity
(imagine a human classifying blue and red objects). However, in the practice of antispam methods, there is often a tradeoff between achieving low false positive and high
true positive rates, that is, the goal not to classify any legitimate mail as spam (because
they might otherwise never reach their destination) can often only be achieved at the
cost of an increased rate of false negatives.
2.2. Sender Side (=Pre-Send) Methods
In this section, we discuss methods “acting” before an e-mail is sent out (“pre-send”
methods), that is, the leftmost part in the categorization introduced in Figure 5. The
underlying idea of pre-send methods is to discourage sending of spam, in some sense to
fight the problem before it occurs. We can distinguish two basic approaches in this
category: strategies to increase the costs for sending e-mail, which harm spammers’
business model as discussed in Section 1.3, and strategies to increase the risks for
sending UBE/UCE in the form of stricter legal regulations and stricter enforcement of
these regulations.
2.2.1. Increasing Sender Costs
Many interesting solutions for stopping spam e-mail are economically motivated. The
main idea is to make the spammers’ business model unprofitable. Two concrete
strategies have been developed to achieve this – delaying the sending of each e-mail
(technical solution) or introducing monetary fees for each e-mail (money-based
solution). However, conceptually, the approach of increasing sender costs comprises
not only monetary fees and costs in terms of time delays, but also other abstract costs.
25
2.2.1.1. Technical Solutions
Most technical solutions are based on CPU time: The sender of an e-mail is required to
compute a moderately expensive function – a so-called pricing function – before the email is actually sent. Since in general e-mail is not expected to be a medium for real
time communication, such a moderate delay for each e-mail is expected not to have any
significance for the average regular e-mail user, who may in most cases not send much
more than 20-50 e-mail messages a day, but it is very disturbing for a spammer,
because it reduces the number of potential customers reached per unit of time (for
details, see [31].
Since there is no need to change the SMTP protocol, it is easy to install such a
system with a pricing function. There is a major drawback, though, of this approach –
lack of fairness of most pricing functions found so far. Ideally, a pricing function
system should be “fair” in the sense that the delay it causes is independent of the
hardware of the computer system. Many solutions have been proposed, for example,
CPU-bound functions, memory-bound functions, or Turing-tests [38]. Especially
different ways of using memory-bound functions currently receive a lot of attention
[39][40]. An example for a Turing-type test based on human interaction is mentioned in
Section 2.3.1 (SFM). However, so far it remains an open question to find a pricing
function, which leads to at least comparable delays on old, slow computers and on the
latest hardware.
In the following, we take a closer look at a relatively wide spread and well-known
representative of technical solutions to increasing sender costs – Hashcash, which is
based on a CPU-bound function.
Hashcash [41] is a software plug-in for mail clients which add Hashcash stamps to
sent e-mail. Adding a Hashcash stamp means inserting a line starting with “XHashcash:” into the header of a message as shown in Table 8:
FROM:
someone <[email protected]>
TO:
max mustermann <[email protected]>
Subject:
test Hashcash
Date:
xx.xx.200x
X-Hashcash:
0:030626:[email protected]:6470e06d773e05a8
Table 8: A typical X-Hashcash header
In order to create a Hashcash stamp, the resource CPU time needs to be “spent” (on
an average desktop computer, a few seconds). One stamp is required for each individual
recipient (even if it is sent as BCC) and it indicates the degree of difficulty of a task
performed in order to “spend” CPU time. It is expected that the more difficult this task
is (and thus, the more CPU time is spent) for an e-mail the less likely, this e-mail is
26
spam. Thus, Hashcash stamps can be used as (part of) a criterion whether to accept an
e-mail message or not.
Technically, the tasks used by Hashcash are based on hash functions, more
specifically on so-called partial hash-collisions. A hash function H is a cryptographic
function for which it is supposedly hard to find two inputs that produce the same
output. A collision occurs, if two inputs do produce the same output: H(x) == H(y)
although x != y.
Common hash functions, like for example MD5 or SHA1, are designed to be
collision resistant (it is very hard to find SHA1(x) == SHA1(y) where x != y). For
common hash functions, computing a full collision is almost impossible, but partial
collisions can be found more easily. In contrast to a full collision, where all bits of, for
example, SHA1(x) must match SHA1(y), for a k-bit partial collision only the k most
significant bits of SHA1(x) and SHA1(y) have to match. On a 400 MHz PII, a 16-bit
partial collision for SHA1 can be computed in about one third of a second, whereas
computing a 32-bit collision would last seven hours.
Hashcash uses the recipient’s mail address and the current date as inputs for the
hash-collision.
2.2.1.2. Money Based Solutions
The basic idea behind money-based solutions is to “pay” some amount of (possibly
symbolic) currency (micro payment) for each e-mail to be sent. The idea is that an email is more likely to be ham the higher the amount paid for its delivery. In the
following, we describe a concrete proposal for implementing this idea.
The Lightweight Currency Protocol (LCP) [42] is a relatively simple mechanism
which allows organizations to issue their own generic currency. The issuer (for
example, an ISP) and the currency holder (the user) both generate a public/private key
pair. LCP is a request/response protocol where the issuer of the currency is the server
and the holder of the currency is the client.
Based on this protocol, Turner et al. [42] also propose a payment mechanism where
servers require payment for accepting incoming messages. The mail transfer agent is
responsible for organizing the payment, so the client is not involved. Currently, delivery
costs per e-mail message are estimated to be about 0.01 US cent (which corresponds to
US$ 100.- for 1,000,000 e-mail messages). Even if the price was raised to 1 US cent per
e-mail (which corresponds to US$ 10,000 for 1,000,000 messages), sending e-mail
would still be very cheap compared to sending snail mail (which costs more than 20 US
cents per letter).
The implementation of this payment mechanism based on LCP proceeds as follows:
A user spends a particular amount of his currency by sending a transfer-funds message
to the issuer, who in turn identifies the recipient’s public key, the amount and a
transaction-id. If the sender has a sufficient balance of funds, the issuer will debit the
sender’s account by the amount requested and credit the account of the recipient. The
27
recipient verifies the payment and the issuer responds with an account activity
statement.
Each user can earn some quota of a currency, and spammers would be forced to
make investments to purchase credits using real-world money, which narrows their
margin of profit. Since the costs for sending e-mail messages increase linearly with the
number of messages sent in this model, it is expected that spammers are forced to
increase their rate of return and thus they need to focus their efforts on recipients where
they have a high probability of revenue (which contradicts the current business model
of spammers illustrated in Section 1.3).
2.2.1.3. Strengths and Weaknesses
Generally speaking, methods for increasing sender costs and thus harming the business
model of spammers are a very interesting and promising approach to address the spam
problem. In contrast to many other approaches which tend to focus on the “symptoms”
only, they try to fight the problem at its “root” and consequently avoid the demand of
resources common to all approaches acting later in the spamming process. Moreover,
they are not user-specific, technically very accessible to ISPs and e-mail providers and
thus they fit very naturally into concepts suitable from the perspective of an ISP.
However, there are still a few important shortcomings, which lead us to believe that
those methods alone will not suffice, but rather will have to be integrated and combined
with other approaches in “multibarrier concepts”.
In the area of technical solutions, one of the main open questions is how to adapt
pricing functions to different hardware. Whereas CPU-bound pricing functions (such as
Hashcash) suffer from a possible unfairness due to differences in processing speeds
among different types of computer systems, some experts expect memory-bound
pricing functions [39] to be less sensitive to this problem.
For an evaluation of technical solutions for increasing sender costs, it is
indispensable to carefully analyze the economic basis of a spammer’s business. If a
reduction from 10 million e-mail messages sent per day to (for example) 100,000 e-mail
messages destroys the business model of a spammer, then methods based on increasing
sender costs could be the “perfect” anti-spam method.
A simple example illustrates the promise of this approach as well as its potential
shortcomings: if due to a pricing function it takes ten seconds to process an outgoing email, one computer can send at most 8,640 e-mail messages in 24 hours. Without a
pricing function, an estimated 2-3 billion spam messages can be sent per day – in order
to achieve this output with the pricing function of this example, the spammer would
need 250,000 to 375,000 computers. However, if pricing functions are needed such that
even an average user can send out only two e-mail messages per hour and if users with
old hardware experience even bigger delays, then this approach render itself useless
because it limits the effectiveness of e-mail as a communication medium in general.
Careful discussion of central questions (optimum type of pricing function, optimum
delay, etc.) at a scientific level is beyond the scope of this report, but is included in a
diploma thesis currently under preparation [31].
28
The main problems of money-based solutions are the relatively high administration
overhead and the fact that the very popular free e-mail accounts do not fit into this
strategy.
The last point leads to a potential general weakness of all current approaches to
increasing sender costs – their success would require some degree of coordination
among providers of e-mail services and the commitment of at least a significant part of
those providers worldwide. If only a minority of e-mail services worldwide adopts
policies to increasing sender costs spammers will simply elude the obstacle and pick
providers who do not implement such policies.
Similar to the careful analysis of the pricing functions still required for technical
solutions to increasing sender costs, the optimal fee structure in money-based solutions
still needs to be investigated carefully. The concept will be considered for practical
application only if it can be shown how to set it up in practice such that the amount of
spam is reduced significantly without burdening regular e-mail users too much.
2.2.2. Increasing Spammers’ Risk
A more focused approach than increasing the costs for everybody sending e-mail is to
increase the risk spammers are facing due to their activity. This requires the
introduction of new legal provisions for UBE/UCE and the development of jurisdiction
and of an infrastructure to enforce these provisions.
2.2.2.1. Legal Provisions
The European Union and the United States of America have both decided to enact a
legal basis for criminal prosecution of senders of UBE. Detailed information on the
different legal systems of the United States, the European Union and other countries is
available on [43], and in the summary given by Sabadello [44]. Since jurisdiction is not
our area of expertise, we will only give a very short overview in this section.
Opt-In vs. Opt-Out. Generally speaking, one can distinguish between an opt-in and an
opt-out system for anti-spam regulations. Opt-in means that nobody is allowed to send
UBE unless the receiver has explicitly agreed to receive such messages. In an opt-out
system, anybody is allowed to send UBE to anybody else as long as the receiver has the
possibility to opt out at any time he wants, that is, to declare that he does not want to
receive such messages any more.
USA. The United States of America implemented the CAN-SPAM act [45] on January
1, 2004. This is an opt-out system. It is very contended, because (like any opt-out
system) the act of opting out gives a spammer the possibility to verify that a mail
address is valid. Consequently, if the receiver tries to opt out via an automated
mechanism offered in the message, he may receive even more spam afterwards because
his e-mail address could be “verified”.
Despite this potential weakness, the CAN-SPAM Act also provides a basis to deal
with some major problems of unsolicited bulk e-mail: It is the basis for criminal
prosecution for header-forging, relaying commercial mail through open proxies or
29
through other infrastructure that is used for concealing the identity and it also prohibits
address harvesting and dictionary attacks are forbidden.
European Union. The European Union decided to implement an opt-in system. In June
2000, the European Parliament passed the directive on electronic commerce [46] and in
June 2002 the directive on privacy and electronic communication [47], which form the
basis for legal action. Consequences for sending UBE are not covered in these
directives, but it is incumbent upon the individual members of the European Union to
do so.
Austria. The current legal situation in Austria distinguishes between private individuals
and companies. According to § 107 of the bill on Telecommunication [48], sending
UBE to private individuals is not allowed and requires the previous agreement of the
individual (opt-in). This also covers commercial e-mail as well as other e-mail with
more than 50 recipients. The situation for companies is completely different. In general,
it is allowed to send UBE to companies as long as the recipient has a possibility to opt
out (similar to the CAN-SPAM act). Moreover, the bill on Electronic Commerce [49]
introduced the maintenance of a so-called “Robinson List” containing all individuals
and companies that in no case want to receive UBE. This list has to be taken into
consideration even when the delivery of mail would be allowed by the bill on
Telecommunication.
2.2.2.2. Strengths and Weaknesses
Practical experience shows that, although there is some deterrent effect (cf. Figure 2),
the legal framework of neither the United States nor the one of the European Union will
be able to completely solve the spam problem. Spammers do not care too much about
any legal consequences because they can easily hide their identity or even move their
operations to other countries where no legal basis for prosecution exists. Similar to
approaches for increasing sender costs, legal actions against spammers requires a much
higher degree of coordination among countries worldwide than this is the case
currently.
2.3. Receiver Side (=Post-Send) Methods
In this category, we group all approaches, which are “acting” at the receiver side after
an e-mail has been sent. In contrast to the pre-send methods summarized in Section 2.2,
which are proactive, the post-send methods tend to be reactive. They can be divided
into three groups – methods based on the source of the e-mail, methods based on the
content of the e-mail, and methods using both source and content.
2.3.1. Approaches Based on Source of Mail
In this section, we will illustrate the most important methods focusing on the source of
an e-mail. This information is in most cases the client’s-IP-address, the sender’s domain
or the full mailbox. It is usually extracted from the SMTP Dialogue or the message
itself. On this basis, it is possible to classify the source in three different ways. The first
way is to decide if the sender is a good or a bad one. Another possibility is to verify if a
30
source is legitimated to use a claimed identity. The third method is to verify that the
sender is willing to invest some additional effort to contact the receiver. In detail we
discuss:
• Blacklists, whitelists (good/bad sender)
• Sender Policy Framework, Caller ID, Sender ID, Domain Keys (legitimate/non
legitimate sender)
• Greylists, ChoiceMail and SFM (challenge-response systems)
2.3.1.1. Is the Sender Good/Bad?
A blacklist is a database containing IP-Addresses or domain names that are suspected
to send spam. Any message coming from a domain or IP-Address appearing on a
blacklist will be blocked. There are two main types of blacklists – real time blacklists
(RBL) with a centralized or distributed database and blacklists self-maintained by
administrators (domain-level blacklist). The most effective way is using RBLs
maintained by third-party organizations (for example [50]) which are usually DNS
(Domain Name System) based. Figure 6 illustrates the basic principle of blacklists.
1. Connecting
SMTP Client
(IP: 101.105.32.23)
SMTP Server
(mysmtpserver.com)
4. Disconnect
3. Answer:
127.0.0.2
2. DNS-Lookup:
23.32.105.101.rbl.org
rbl.org
Figure 6: Typical scenario for a blacklist
The SMTP client with the IP-Address 101.105.32.23 connects to mysmtpserver.com.
Before the SMTP Dialogue starts, the SMTP server records the IP-Address from the
TCP-Connection and performs a DNS-Lookup at rbl.org. Rbl.org returns the reply code
“127.0.0.2” which indicates that the SMT client is a source for UBE whereon the SMTP
server aborts the connection to the client.
A whitelist is a database containing IP-Addresses or Domains that for sure do not
send spam. Any message coming form a source appearing on a whitelist will bypass
filtering. A whitelist is usually proprietary but there are also global whitelists containing
organizations that signed up not to send spam. These lists are usually controlled through
a Third-Party organization (for example Habeas [51], Bonded Sender Program [52],
Brightmail Safe List).
31
2.3.1.2. Is the Sender Legitimate?
Important efforts have also focused on developing methods and techniques for
determining whether the sender of an e-mail can be authenticated or whether he is
legitimate. This includes various kinds of policy frameworks [53] or digital signatures
[54].
The underlying idea is on the one hand that spammers do not want to be
authenticated in order to avoid criminal prosecution and on the other hand that – for the
same reason – spammers tend to fake header information in their e-mail (cf. Section
1.4) which may lead to inconsistent information (for example, the pretended sender is
not legitimated for the pretended sending mail server).
In the following, we will briefly summarize the most important techniques in this
area. They were originally submitted as proposals to the Internet Engineering Task
Force (IETF, www.ietf.org).
Proposals for Anti-Spam Standards for coping with the spam problem, submitted to
the IETF in 2004.
•
Caller ID: proposed by Microsoft, Sendmail, Amazon.com and Brightmail
•
DomainKeys: proposed by Yahoo
•
Sender Policy Framework (SPF): supported by AOL, GMX
SPF, Sender-ID and DomainKeys are concepts to eliminate the possibility of domain
spoofing. The protocols try to leave the common transmission process unaffected and
interoperate with SMTP to support the distribution and acceptance of the protocol. All
described protocols have in common that they are using DNS for verifying e-mail. So,
more network traffic is used, because every received e-mail must be checked at the
domain specified at the e-mail address.
The Sender Policy Framework (SPF) [55], developed by Meng Wong and Mark
Lentczner, uses the “MAIL FROM:” identity of the SMTP dialogue to verify the
senders’ domain. This allows rejecting mail already within the SMTP dialogue. The
protocol is a hybrid of the Designated Mailers Protocol [56] and the Reverse MX
Protocol [57]. An SPF-Record designates the outbound SMTP Servers of the senders’
domain. When an SMTP Client connects to a mail exchanger, the server looks for an
SPF-Record in the DNS-tree of the claimed sender domain. If the result received from
the DNS-Query contains the IP-Address of the client, the sender is authorized to use the
domain in the “MAIL FROM:” argument. If not the domain was spoofed.
The Caller-Id [58] concept, developed by Microsoft, realizes the same concept but
uses the so-called “purported responsible address” for verification. The purported
responsible address refers to the mailbox that has directly initiated the transmission
process. It is determined by inspecting the header of the message. For example if the
header contains a “From:” field and a “Sender:” field, the PRA is extracted from the
“Sender:” field [59]. Both Caller-Id and SPF suffer from the fact that in the case of
involved mail forwarding systems and mailing lists during the transmission process the
32
IP-Address of the client often cannot be mapped to the domain of the sender. Therefore,
additional concepts like SRS [60] (Sender Rewriting Scheme) must be implemented.
The Sender-Id [61] framework is the result of a merger between Caller-Id and SPF.
DomainKeys [62] also use DNS but the verification process works via digital
signature instead of IP-Addresses. The sending side of this variant consists of two steps.
•
Set up: In this first step, the domain owner generates a public/private key
pair. This key pair is used for signing all outgoing mail. The DNS holds the
public key, and the private key is located at the outbound mail server.
•
Signing: When an end-user is sending an e-mail, than a digital signature with
the private key will be generated by the DomainKey enabled mail system.
For verifying an e-mail on the receiver side, three steps are necessary:
•
Preparing: The DomainKey enabled system on the receiver side extracts the
signature and the claimed “From:” domain from the e-mail headers and
fetches the public key from the DNS for the claimed “From:” domain.
•
Verifying: With the public key getting from the DNS the receiving e-mail
system verifies if the signature was generated by the matching private key.
•
Delivering: If the e-mail was successfully verified, then the message is
delivered into the receiver’s inbox.
2.3.1.3. Challenge-Response
Challenge-response systems initially block or hold e-mail from unknown senders. The
senders are notified of the blocking, then required to prove they are human by taking a
“quasi-Turing test”. If they pass, the e-mail is delivered [63].
Challenge-response (or reverse-whitelists or permission based filters) systems
maintain a list of permitted senders. E-Mail from a new sender is temporarily held
without delivery and the sender gets an e-mail with a challenge back. This challenge
can be clicking on an URL or to reply to this e-mail. If spammers use fake sender email addresses, they will never receive the challenge and if they use real e-mail
addresses, they will never be able to reply all challenges in a certain amount of time.
Unfortunately there are some limitations to this approach. If both communication
partners use challenge-response systems, then they will not be able to communicate
with each other. Another shortfall is that automated systems or mailing lists can not
respond to a challenge (if a friend is sending you an interesting newsletter). The third
problem is character recognition or pattern matching – these security challenge features
are easy to bypass. And finally a spammer may forge the e-mail address of a legitimate
user.
33
There are many implementations of challenge-response systems – we take a closer
look at three different types – greylists, a human interaction system called ChoiceMail
and a subscription mail server called SFM (Spam Free Mail).
Greylisting [64] is an aggressive method for blocking spam. It uses the fact that
sending spam is not failure tolerant. Because spammers often do not know if their
recipient addresses actually exist, they do not try to resend messages if an error occurs
during the transmission process.
When a client connects to a SMTP server using a Greylist, the server records the
following information:
1. The IP address of the host attempting the delivery
2. The envelope sender address
3. The envelope recipient address
The server then compares this triplet to a local database. If no record matches, the
message will be refused with a “temporary failure” response and the triplet is stored.
Usually RFC compliant MTA`s try to resend this message within a certain period of
time. When the message is received a second time within a specified time slot
(normally after a timestamp for blocking and before the expiration date of the triplet)
the message will be delivered.
ChoiceMail [65] can be run in different modes. Free for home use, Server edition
and Enterprise edition, and uses a challenge-response system.
If ChoiceMail cannot identify a message after checking it against your whitelist,
blacklist and any rules you are using, it sends a “registration request” to the sender (see
Figure 7).
Figure 7: Sender registration ChoiceMail
This short e-mail directs the sender to a Web page where he will be asked for his or
her name, e-mail address and reason for contacting you. The sender also will be asked
to fill in a code that appears on the screen as a graphic, something a person can do
easily but a computer cannot do at all.
34
This simple process eliminates almost all junk e-mail for two reasons. First,
spammers usually use invalid reply addresses and therefore never receive the
registration request. Second, spammers depend on automation, and the registration
response cannot be automated. The registration feature can be turned off.
SFM [66] is a subscription e-mail server whose service can be viewed as an
extension of your traditional e-mail service. It allows the creation (mostly
automatically) of multiple addresses/aliases of yourself restricted to a narrow
population of legitimate senders.
The principles of operation are very easy. There are two types of dynamic addresses:
publishable (=master) and personal (aliases). An alias is intentionally restricted to a
single contact or a group. If someone is trying to contact you for the first time, he sends
an e-mail to your master address. This message never reaches its destination; the sender
instead gets a challenge like this.
Dear Human Being:
To reach the recipient, please use this address:
For more information, and also if you cannot see the image
that has arrived with this message, please follow THIS
LINK.
The purpose of this procedure is to eliminate Email abuse
Figure 8: Challenge of SFM
A new alias remains open for a predetermined amount of time and during this time
anyone can use it to send you messages. Whenever this happens, the sender’s address is
added to the alias personalization. After it becomes closed, it will only accept e-mail
from senders on the personalization list.
When a message is sent through the server, it locates the proper alias personalized to
the recipient, or if no such alias is available, generates a new alias personalized to the
recipient on the fly and forwards the message to the recipient substituting the alias for
your sender address.
35
2.3.1.4. Strengths and Weaknesses
Low resource requirements and its ease of maintenance are the two main benefits of
blacklists. Any spam message can be rejected, before it is downloaded. Another big
advantage is that some spammers remove e-mail addresses automatically, if an e-mail is
rejected. There are only a few configuration changes necessary inside the server
software.
A big disadvantage is the lack of granularity – either all of the e-mail from a given
host is accepted, are all of it is rejected. Some spammers try to hide behind big ISP’s
and use Hotmail or AOL accounts for spamming (see Figure 3). One big problem of
blacklists is the possible refusal of legitimate mail because blacklists are often poorly
maintained and not up-to-date.
There are similar limitations with whitelists as with using blacklists. If a spammer
spoofs an address, he will get through a whitelist. They must be updated regularly and
this needs some time, but black- and whitelists typically stop around 10% of spam. [67]
Missing authentication of e-mail senders is one of the biggest weaknesses of the
current mail infrastructure but sender authentication alone will not solve the spam
problem. Sending unsolicited bulk e-mail would still be possible. The establishment of
a central authentication authority would be required for a secure environment but this
does not seem to be realistic.
One big disadvantage of challenge-response systems is that some MTAs do not
redeliver messages. Therefore, it is recommended to maintain a whitelist containing
these servers. Another general problem of challenge-response systems is the increased
mail traffic.
General Limitations of Header Based Approaches
The header of an e-mail includes various information of the sender and the mail
infrastructure involved during the transmission process. Generally, any information
given in the SMTP dialog and the header can be forged because there are no integrity
checks and authentication mechanisms defined in the standard SMTP. The only reliable
information is the IP-Address of the client. Spammers often forge header entries of an
e-mail. They try to inhibit the backtracking of the messages to keep their identity secret.
There is no other reason to forge a header entry but to conceal one’s identity. The
following analysis shall give an overview of what can be forged and focuses just on
entries that give information of the sender or the mail infrastructure involved.
“Return-Path:”:
The Return Path is just a record of the argument specified in the “MAIL FROM:”
command during the SMTP dialog. If it is forged, the Return-Path is also not trustable.
“Received:” Lines:
“Receive-id:” Lines are the most important header entries for backtracking messages
and for fixing bugs in a mail environment. RFC compliant MTAs must prepend a
36
“Received:” Line for messages that are not routed in a private area. As any other
information it can be forged easily in a way that it is not possible to distinguish whether
it is manipulated or unaltered. If a spammer uses an open proxy, there is no reason to
forge any “Received:” Line because the IP-Address of the spammer does not appear in
the message. For the receiver’s purposes you can only trust the lines that are processed
by your own MTAs.
“Message-Id:”: If the Message-Id is structured as recommended in RFC2821 it
contains the sender’s domain. Spammers can use any domain to conceal their identity.
Due to various mail-routing scenarios and the possibility of a forgery, it is not possible
to put semantics to the domain part of the Message-Id.
“Date:” Inconsistent “Date:” fields can be ascribed to various scenarios. They can
lead back to a forgery as well as to different time zones of sender and receiver. In
addition, a bad configuration of the processing MTAs is possible so you cannot classify
a message as spam when an inconsistency is detected.
“From:”, “Sender:”, Reply-to:”, “To:”: These fields can be forged easily. Just a few
syntactical checks can be performed verifying that the entry in this field may represent a
valid mailbox.
The following example shall illustrate that it is impossible to identify a forged header
in most cases.
Return-Path: <[email protected]>
Received: from mx6.univie.ac.at (mx6.univie.ac.at [131.130.1.49])
by atat.at (8.12.10/8.12.10) with SMTP id i7I5LSaT007420
for <[email protected]>; Wed, 18 Aug 2004 06:21:29 GMT
Received: from pakistan.com (pakistan.com [222.65.113.88])
by mx6.univie.ac.at (8.12.10/8.12.10) with SMTP id i7I5D126028130
for <[email protected]>; Wed, 18 Aug 2004 08:13:28 +0200
Message-ID: <[email protected]>
Date: Wed, 18 Aug 2004 15:14:18 +0900
From: "jamison tevlin" <[email protected]>
To: "Thornaper Maraja" <[email protected]>
Figure 9: Example of a forged header
Figure 9 shows the header of an unsolicited bulk e-mail, which originated at
“pakistan.com” and was sent to “xyz.at”. The source of the message was obviously the
IP-Address 222.65.113.88 reported by the mail relay “mx6.univie.ac.at”. The Reverse
DNS Lookup complies with the client identification, which is also available in the
Message-Id, the Return-Path and the “From:” address. An analysis of the timestamps
also gives a proper picture in respect to the different time zones of sender and receiver.
The header seems to contain consistent information. The fact is that the mail client at
“xyz.at” can only trust the recorded IP-Address 131.130.1.49. Any other information
given in the header can be forged. For example, it is possible that “mx6.univie.ac.at” is
an open relay. In this case, the IP-Address 222.65.113.88 would most likely be the
original source of the message. Whether the sender’s domain is indeed “pakistan.com”
can not be estimated. On the other hand it is also possible that all given information is
correct.
37
A consistent header is not a sign for legitimate mail. Because of the various
scenarios appearing within the mail distribution process the same is the case for an
inconsistent header. Plausibility checks can only be applied to very simple forgeries but
can not be used for efficient spam detection.
2.3.2. Approaches Based on Content
This section covers techniques used to analyze an e-mail message (or more general: a
text document) according to its content. The task is not to fully understand a text's
meaning but rather to find significant features, such as word frequencies, etc. Simple
approaches like keyword matching are introduced as well as more elaborate approaches
that combine simple methods commonly used in the area of text information retrieval.
2.3.2.1. Static Techniques
Keyword based approaches involve simple searches of the body and/or the subject line
of a message for specific keywords and phrases like “Viagra”, “Cialis” or “get this for
free”. If these words or phrases appear, this fact is used as an indicator for spam. The
three main types of keyword based matching are described below.
Keyword Based: Search for words or phrases that match exactly. For example,
“Viagra” only matches “Viagra”.
Pattern Matching: Covers simple variations by mixing constant text and flexible
components like wildcards, case (in)sensitiveness, number of occurrences. This kind of
pattern matching is based on regular expressions [68]. For example, “V*i*a*g*r*a”
matches “Viagra”, “V.i.a.g.r.a”, “Vviiaaggrraa”, ...
Rule Based: Rules are more complex constructs a message can be checked against.
For instance, the rule “Mentions Generic Viagra” detects if generic Viagra is a main
topic in a given message (via several regular expressions). It is a common practice to
assign a certain value to each rule and to sum up those values to compute an overall
spam rating (see Section 3.3.1)
2.3.2.2. URL Analysis
URL analysis in its simplest form means white- or blacklisting of URLs (compare
Section 2.3.1). However, approaches that are more sophisticated have been developed,
as the one explained in this section that combines several techniques.
Filtering Spam Using Search Engines [69]. An approach for filtering spam using
search engines like Google and Yahoo has been developed at the Georgia Institute of
Technology. The key idea is to filer spam according to the URLs (and their content)
that occurs in an e-mail message (for example, whether they link to Web sites a user
might be interested in or not). This is done by categorizing URLs via search engines as
well as using Bayesian classifiers on Web site content to define a user’s interest (in
terms of keywords resulting from the Bayesian analysis). The approach distinguishes
categorized URLs, which have already been indexed by a search engine, and
uncategorized URLs, which are not listed in any Web directory.
38
Such a system has to be trained. The first training step is to make a list of acceptable
categories (to define the categories a user is interested in). For this purpose, URLs are
extracted from legitimate mail messages in the user’s mailbox, which are then classified
through search engines. The content of the Web sites is also retrieved from the search
engines' caches and used to train a Bayesian classifier. Legitimate URLs, that is, URLs
that occur in the user’s message but cannot be found in a Web directory are whitelisted,
that is, a regular expression is created for each URL, resulting in a set of regular
expressions Aregex that represent legitimate URLs. At the end of the training process the
user is able to edit and verify the training results. After the training phase there should
be a set of legitimate categories, called Acategories and a set of regular expressions, called
Aregex that map the users preferences (or a list of URLs to be accepted).
After training the system is ready to classify mail. Figure 10 depicts the
classification process in detail. A message that does not contain any URLs is classified
as ham. If a message contains URLs, every URL is processed. If an URL matches a
regular expression in Aregex, the URL’s category is in the Acategories set or the URL was
previously classified as legitimate it is not considered any more. The remaining set of
URLs called Ur includes only categorized URLs with categories not in Acategories or
uncategorized URLs never seen before in legitimate messages that do not match any of
the regular expressions in Aregex. If an URL has a category not in Acategories the message
is classified as spam. For each uncategorized URL remaining in Ur the content referred
to is evaluated through the output of the Bayesian classifier.
Classify message as ham
Scan incoming
message for URLs
No URLs found
Build Ur
URLs found
True
For each Ui Є Ur Æ
check if Ui has a
category not in Acategories
Classify message as spam
For all uncategorized Ui Є Ur
run a simple Bayesian classifier
on the content behind Ui
False
Output is below
spam threshold
Classify message as ham
Check the output
Output exceeds
spam threshold
Classify message as spam
Figure 10: URL analysis based on [69]
2.3.2.3. Authentication
The problem with spam messages is that it is hard to tell if a message is spam or not.
The obvious answer to this problem is that there has to be a way to recognize non-spam
messages. Whitelists have been a method of choice for a long time, but they cannot
solve one important problem: The e-mail protocol currently used does not provide any
39
security features (cf. Section 1.4). This means that anybody may use more or less
anything as a sender’s address.
Digital Signature, Encryption of E-Mail
A working public key infrastructure would solve this problem. If every message was
signed with a private key, there would be no problem to authenticate all senders.
Unfortunately, there is one rather big problem with this solution. Currently only a very
small number of e-mail users have a valid certificate. Creating an infrastructure, which
allows every e-mail user to use digital signatures, would be a big challenge.
Generally there are two different options for a public key infrastructure: Either there
is one single root certification authority, which means that there is a heavy burden on
this central authority, or there are many different certification authorities, which means
that a lot of trust is required for each and everyone. A crafty spammer might start his
own certification authority and therefore be able to sign all his messages and therefore
get by this security measure with relative ease.
In many cases the sender needs to fetch a public key for each recipient, would be
necessary to integrate all common encryption systems (PGP, X509, …) into all mail
clients (which, for example, is currently not accepted by Microsoft for its Internet
Explorer).
Even if this sounds rather ineffective it still adds a certain amount of work to the
spammer’s plan of sending large amounts of e-mail messages. Whether this adds
enough work to harm spammers’ business model is a question that is currently not
answered yet.
2.3.2.4. Strengths and Weaknesses
Static techniques are useful to some extent at the individual or even corporate level.
However, the word “Viagra” may be of interest to a physician or pharmacist, thus
keyword based filtering cannot be used as a general solution. Performance may be the
main advantage of those primitive approaches, but another drawback is the need to
update the keywords.
At first glance, URL analysis seems to be promising. Taking a closer look reveals a
couple of drawbacks, though. Doing multiple queries in a search engine, or even
running a Bayes classifier may require a lot of time. This can lead to a point where
denial-of-service attacks based on messages containing vast amounts of URLs paralyze
a complete e-mail service.
Missing authentication is one of the biggest weaknesses of the current mail
infrastructure but authentication alone will not solve the spam problem. Sending
unsolicited bulk e-mail would still be possible. The establishment of a central
authentication authority would be required for a secure environment but this does not
seem to be realistic.
40
2.3.3. Using Source and Content
In many cases information from the body or the header of an e-mail alone is not enough
for a classification. Especially mass mailer detection needs as much information as
possible to be able to compare messages as thoroughly as possible. In this chapter we
take a look at different technologies using this approach.
2.3.3.1. Fingerprints, Signatures, Checksums
Digital fingerprint: a value calculated from the content of other data that changes, if the
data upon which it is based changes.
Checksum: A checksum is a value computed by adding together all the numbers in
the input data. It is the simplest form of a digital fingerprint – problem reordering the
numbers in the document, does not change the checksum value.
Cyclic Redundancy Checks are more reliable than checksums, they normally reflect
even minor changes to the input data, but it is relatively easy to generate a completely
different file that produces the same CRC value.
Hash algorithms and message digest: “one way hash algorithms” produce a “hash”
value, that means it is easy to compute b from a, but it is very difficult (or impossible)
to compute a if you only have b (compare Section 2.2.1). Two well-known hash
algorithms are MD5 and SHA.
DCC, Pyzor, Vipul’s razor
Checksum based spam filtering is a method for detecting spam by simply auditing how
often a received message had been sent (to other users). It is a client-server architecture
where the client calculates a checksum of an incoming message and sends it to a server
which looks for exact matches in its database and returns an indicator (for example the
number of times that the message had already been reported). Up to a user defined
policy the client then decides whether the message is spam or ham.
The most popular implementations of this concept are the Distributed Checksum
Clearinghouse (DCC [70]) from Rhyolite Software and Vipul’s Razor [71]. DCC and
Vipul’s Razor differ in the way messages are reported to the server.
A DCC client reports the checksums of any incoming message to the DCC server,
DCC basically enables mass mailer detection. It does not decide whether a message is
spam or not. DCC just reports how many copies of a message have already been
received. For this reason, clients have to maintain a whitelist including senders of
solicited bulk mail.
41
Table 9 shows the parts of a message DCC computes checksums for.
Checksum
Description
IP
Address of SMTP client
Env_From
SMTP envelope value
From
SMTP header line
Message-ID
SMTP header line
Received
Last Received: header line in the SMTP message
Substitute
SMTP header line chosen by the DCC client, prefixed
with the name of the header
Body
SMTP body ignoring white-space
Fuz1
Filtered or "fuzzy" body checksum
Fuz2
Another filtered or "fuzzy" body checksum
Table 9: DCC checksums
The most remarkable types are the fuzzy values that prevent spammers from
including random characters in their spam messages to avoid registration by DCC.
Besides, it is not entirely clear how the fuzzy checksums are computed (namely to keep
it secret from spammers). A typical response from a DCC server for a given (spam)
message is described by Table 10.
Checksum
Computed Value
Number of occurrences
From
d281eaa1 6bc43403 88d3a2cc dd3580ab
Message-ID
57aef887 c9d8748b 2b887907 5c43f751
Received
39c14d6c 7d1ca91b 2fe6f855 b921493d
Body
8e96458f 0843008a b6324d60 1cac9dc6
10
Fuz1
6b995215 1bde79e1 de9dc27f c6eadf5b
10
Fuz2
00000000 00000000 00000000 00000000
Table 10: Example of a DCC record
The response lists the computed checksums for each part of the message (Fuzzy 2
has no value here because this checksum requires a certain message length which was
not given in this example) and the number of registered occurrences (if any). The client
42
(for instance a spam filter using DCC like SpamAssassin can then handle the message
according to the registration count)
Vipul’s Razor follows a different approach. Within this system the user himself
reports a message to the server so the database of the server should only contain
checksums of approved spam. Therefore, in contrast to DCC, Razor is a tool for spam
detection. One problem appearing here is the trustworthiness of the report’s sender.
With the new version Razor 2.0 this problem has been eliminated, because every user
needs a generated key for signing. The server looks in the database for reports agreeing
with the voting received. The higher the agreement with other reports the higher is the
reliability of the sender [70].
The above mentioned Pyzor is an implementation of Razor, but is written in a
different programming language (Python). Technically it is nearly the same as Razor,
but it is open source software [72].
2.3.3.2. Classification Methods
The main idea behind their usage in spam filtering is to find a suitable (computerreadable) representation for mail messages and to classify them as spam or ham. This
representation is compared to training data and assigned to a class based on various
techniques that are briefly described in the following. Some of the relevant technologies
originate from the areas of text information retrieval and text analysis.
Representation of Texts
Text analysis is mainly based on words or tokens that occur in the documents of the
used text collection. The task is not to fully understand a text's meaning but rather to
extract relevant tokens. Tokens can be entire words, phrases, or n-grams (overlapping
tokens consisting of n characters). Although this approach might miss some of the
information content, it has clear performance advantages and is independent of the
text's language. Several models exist for text representation based on those
tokens/terms, the most common ones are listed below.
Term frequency: A message is represented by the number of occurrences of terms
(the more often a word occurs in a message the higher the value for this word).
TF IDF (Term Frequency Inverse Document Frequency) Representation: All
occurrences of each term in a document collection are registered. Further, the number
of documents each term occurs in is computed. Then each document is represented by
the terms included in this document (term frequency) weighted by the number of
occurrences in all documents (inverse document frequency) [73].
Training Models
Training can denote a simple storing of examples or involve more sophisticated and
time consuming methods, particularly important when token frequencies shall be held
up-to-date. According to [74], there are three major training methods TEFT, TOE,
TuM.
43
TEFT (Training on Everything): every message is used to update the database.
TOE (Train-on Error): only messages that were incorrectly classified are used for
training (usually after a corpus train). An advantage is the dynamic handling of errors;
the downside is the amount of human interaction needed (to find false classifications).
TuM (Train until Mature) [75]: Provides a hybrid between TEFT and TOE. TuM
will train the individual tokens in a message only up until a point where they have
reached maturity (for instance 25 hits per token). New types of training data are still
trained as well as immature tokens. TuM trains all tokens whenever an error is being
retrained. Therefore, it has both advantages – a balance between volatility and static
data, and the ability to adapt to new types of e-mail.
Distance Measures
The similarity between a query message and the messages in the training sets is
measured via distance functions. The query vector (consisting of term frequencies) is
compared to the examples in a training set so that one or more most similar vectors can
be found (for example, the similarity between an incoming mail message and a ham or
spam training set). Examples for distance measures are:
Euclidean Distance:
d ( x, y ) = ∑ |x − y|
Mahalanobis Distance:
d ( x, y ) =
Cosine:
d ( x, y ) = cos( x, y )
2
(x − y ) C (x − y )
t
−1
Classification Decision
After the computation of distance measures the query vector has to be assigned to a
class according to the training set, that is, it tags a given message as either ham or spam
according to a spam and a ham training corpus. It is not always the best choice to base
the decision whether a query vector belongs to one class or not on the one most similar
vector in the training set only.
Many different algorithms and models for classification tasks have been developed,
most of them following the procedure just presented methods, slightly differing in one
or the other detail. The methods presented in the following give an overall idea of
existing technologies, but the list does not claim completeness.
2.3.3.3. Bayes Filter
Although the application of Bayesian analysis to spam is rather new, the Bayesian logic
was actually first published by the Royal Society in 1763 and is based on Thomas
Bayes (born 1702 in London).
44
In basic terms, Bayes’ Formula allows us to determine the probability of an event
occurring based on the probabilities of two or more independent events. The general
formula is written as:
P ( Ai | B) =
P ( B | Ai ) P( Ai )
k
∑ P( B | A ) P( A )
j
j
j =1
In a Bayesian filter scenario, text is represented by significantly positive or negative
words (tokens), that is, typical spam or ham words. At first, lists of “good” and “bad”
words are computed from a training set of positive and negative examples. The output
is two lists containing spam and ham probabilities for all tokens (complete words in a
classic Bayesian filter). Spam probabilities for tokens are calculated using:
- the frequency of the token in the spam database
- the frequency of the token in the non-spam database
- the number of spam messages stored (in each database)
Any incoming e-mail is now represented by the most important tokens from these
lists, either “most positive” or “most negative”. The overall spam probability is defined
as the joint probability of independent events (the tokens).
Assuming that the variables a, b and c represent spam probabilities for three different
tokens, the total spam probability of a message is equal to:
abc
abc + (1 − a )(1 − b)(1 − c)
The decision whether a message is treated as spam or ham is based on this overall
spam probability (via a simple threshold function).
Bayesian filters use a variety of different tokens, a few are listed below [74]:
Standard Bayes: Each word is a token – this method is used in most spam filter
programs. Text is not preprocessed at all (everything is used as token, including header
info, java script, etc.).
Token Grab Bag: A sliding window of five words is moved across the input text. All
combinations of those five words are taken in an order-sensitive way – every
combination is a feature.
Token Sequence Sensitive: A sliding window of five words is moved across the input
text. All combinations of word deletions are applied (except that the first word in the
45
window is never deleted) and the resulting sequence-sensitive set of words is used as a
feature.
Sparse Binary Polynomial Hashing with Bayesian Chain Rule (SBPH/BCR): A
sliding window of five words is moved across the input. A feature is the sequence-andspacing-sensitive set of all possible combinations of those five words, except that the
first word in the window is always included.
Peaking Sparse Binary Polynomial Hashing: this is similar to SBPH/BCR, except
that for each window position, only the feature with the highest or lowest probability
(furthest from 0.5) is used. The other features generated at that window position are
disregarded. This is an attempt to 'decouple” the sequences of words in a text and make
the Bayesian chain rule more appropriate.
Markovian matching: This is similar to Sparse Binary Polynomial Hashing, but the
individual features are given variable weights. The increase of the weights is quadratic
increasing with the length of the token, so that a feature that contains more words than
any of its sub features can outweigh all of its sub-features combined.
2.3.3.4. Support Vector Machines (SVM)
The Support Vector Machines model, introduced by Vapnik [76][77], has proven to be
a powerful classification algorithm and is used in many categorization tasks including
text categorization. The main idea is to map the input data into a high dimensional
feature space and separate this data by the hyper plane that has provides the highest
margin between the two classes. If classes are not linearly separable, SVMs make use of
so called kernels (convolution functions) to transform the initial feature space to another
where a separating hyper plane exists.
2.3.3.5. K-Nearest Neighbor (K-NN)
A query is compared to all samples in the training set (according to a distance function,
Euclidean distance is very common). The query is assigned to the class the most of the
K-Nearest neighbors belong to (the k most similar vectors). For instance, if a message's
five nearest neighbors consist of two spam messages and three hams, the message is
classified as ham. K-Nearest Neighbor is an example for a decision part of a
classification system.
2.3.3.6. Neural Networks
Another technique widely used for classification and pattern recognition tasks are feedforward neural networks. Neural networks differ from other approaches because of their
extensive training phase and their heuristic way of initialization. Far more resources are
needed for training than for actual classification, in contrast to the K-NN algorithm
where training only means storing vectors and classification includes the costly
comparison to all training examples [78].
46
2.3.3.7. Strength and Weaknesses
Bayesian filters offer a good method to detect spam messages. They represent a
content based solution that is easy to implement. Disadvantages are the need for
permanent filter training, limited applicability for ISP’s, potential counterattacks from
spammers (insertion of random words – see [79]) and possible performance problems.
One big advantage of checksum based systems is the low rate of false positives.
False positives can only occur when a message has been sent many times or the
checksums of different messages are accidentally the same (a very small chance).
Approaches that are based on fuzzy checksums can be used to detect messages that
contain random words (often used by spammers to bypass keyword based filtering).
Many articles discuss the advantages and disadvantages of different classification
methods, distance functions and methods for text representations. Many of those
approaches are best used within a specific domain or problem. K-Nearest Neighbor is a
rather simple approach, but it has performance advantages at the time of decision
making. There is a vast body of literature on comparing different kinds of Bayes filters
(see, for example, [80]) to other types of filters currently available.
The neural network based approaches and support vector machines need an
extensive training phase and do not allow to draw conclusions because of their heuristic
initialization (they are a black box, the user does not know why a specific message is
classified as ham or spam). On the other hand, classification itself is faster which may
be an important advantage, although the main performance problem of spam detection
is message analyzing itself.
One of the first papers about the use of SVMs to classify spam messages was
published in 1999 [81]. Another paper [82], proposing a similar approach was
published in 2001. So far, the application of SVM as well as K-NN and its variations to
the spam problem has been discussed numerous times [83]. However, due to its quite
recent development and a rather complex implementation, SVMs are rarely used in
commercial anti-spam systems at the moment. It is very important to take into account
the data pre-processing and training phase, as they are a crucial part of the classification
process. All methods discussed here tend to obtain good results only if they are trained
on a regular basis. It is essential for keeping the performance of a content filter at a
satisfactory level and for avoiding a significant performance decrease over time.
A different solution, offered by some anti-spam software vendors, is to send out
training updates in regular intervals, taking away the burden of maintaining databases
and/or training sets from the end user. Although this avoids the individual effort and the
bias caused by misclassification, it tends to decrease classification performance because
it cannot account for individual users’ preferences.
2.4. Sender and Receiver Side
The above section described countermeasures, which become effective when the e-mail
is already sent. In this section we summarize methods that affect both sender and
47
receiver. This comprises suggestions for new e-mail transfer protocols, two of which
are mentioned here. It cannot be expected that they will be implemented and used in the
near future.
2.4.1. IM 2000
IM 2000 has been developed by D.J. Bernstein, the creator of qmail. Today’s Internet
mail infrastructure implements a push-system. The senders cost of sending a message to
thousands of recipients is nearly zero. IM 2000 purposes a pull mechanism where the
messages are stored at the sender’s side. This concept has some ramifications to a new
infrastructure [84][85]:
•
•
•
•
•
•
Each message is stored under the sender's disk quota at the sender's ISP. ISPs
accept messages only from authorized local users.
The sender's ISP, rather than the receiver's ISP, is the always-online post office
from which the receiver picks up the message.
The message is not copied to a separate outgoing mail queue. The sender's
archive is the outgoing mail queue.
The message is not copied to the receiver's ISP. All the receiver needs is a brief
notification that a message is available.
After downloading a message from the sender's ISP, the receiver can efficiently
confirm success. The sender's ISP can periodically retransmit notifications until
it receives a confirmation. The sender can check for confirmation. There is no
need for bounces.
Recipients can check on occasion for new messages in archives that interest
them. There is no need for mailing-list subscriptions.
The deployment of IM 2000 would require a global adoption of the mail
infrastructure. The proposed solutions provide quite complicated mechanisms to
admittedly difficult problems, given the requirements. A global deployment of an
implementation is unlikely anytime in the near future [86].
2.4.2. AMTP
The Authenticated Mail Transfer Protocol (AMTP [87]) is currently specified in an
Internet-Draft. The last version was submitted on April 26, 2004 to the Internet
Engineering Task Force. AMTP enables trusted relationship between entities operating
Mail Transfer Agents. This works over TLS like SSL for Web Servers. Both client and
server must present valid X.509 certificates, each signed by a trusted Certificate
Authority (CA), in order to begin a transaction. AMTP also provides a mechanism to
publish concisely-defined policies. This allows the parties in the trusted relationship to
hold each other responsible for operating their servers within the constraints of agreedupon rules. AMTP inherits the specification of SMTP and builds upon it. By operating
on a different TCP port AMTP can run in parallel with SMTP. It is hoped that this
supports an easy and smooth adoption [87].
48
3. Products and Tools
This chapter describes the products that were used in our experiments. In the first
section we give some general information about anti-spam software. Afterwards we
discuss in detail the characteristics of the commercial and open source spam filters
tested.
3.1. Overview
The following section gives an abstract about anti-spam software. At first some quality
criteria for product reviews are mentioned. Then we suggest online resources that shall
help finding the right choice within the wide variety of available solutions.
3.1.1. Quality Criteria
To defeat spam many approaches have been developed and where realized by the antispam industry. The market provides free applications for home users as well as
applications for an enterprise wide anti-spam policy. The systems’ qualities differ in
many ways and it is sometimes not obvious to determine which product should be
deployed. Some important criteria that should be considered are:
Usability: Deploying anti-spam software can be a very cost-intensive task. Besides
acquisition costs and license fees that have to be paid, staff for maintaining the
application is needed. The effort for configuring and training a system is one of the
most important criteria. It is preferable if less user interaction is needed.
Ease of Integration: The integration of a spam-filter in the existing IT-infrastructure
is another main point. Questions like additional hardware-cost or interoperability with
existing mail servers and operating systems have to be taken into consideration.
Processing Speed: The processing speed needed depends on the mail volume
received. If the number of messages received exceeds the capabilities of the spam filter
mail could be lost due to congestion. The processing speed depends mainly on the
methods used for analyzing the incoming messages.
Detection Rate: The detection rate is definitely the most important criterion. It must
be pointed out that this refers to the detection rates of spam as well as ham. A good
spam filter must have a very low rate of false positives and on the other hand detect as
many spam messages as possible.
3.1.2. Comparisons of Anti-Spam Software
For our report we tried to give a short summary of the methods and tools available for
spam detection. In the following section we want to describe two online tools, which
give a very good and regularly updated overview over anti-spam tools.
49
One is the so called Compare-O-Matic, available at NetworkWorldFusion [88]. First
there is a registration required (no fees), then you can search through the anti-spam
buyer’s guide [88]. You can choose between server based and client based products and
anti-spam services. A very good feature is the so called Compare-o-matic, where two or
more products can be compared (different features for comparison can be chosen).
The second tool can be found at Spamotomy [89]. You can choose between any kind
of solutions, desktop software, server software, hosted services and disposable
addresses, for every tool there is a short summary and a short description of the
methods used.
3.2. Commercial products
The main topic of this section is the description of the functionality of various
commercial anti-spam products – most of them have been used in our experiments (see
Chapter 5 for the results). Further information about the products can be found at the
vendors’ homepages. Due to their commercial status, it is clear that the functionality of
some methods is not fully publicly available.
In particular, we describe the following commercial products here (the notation in
brackets is the one used in Chapter 5): SurfControl E-Mail Filter for SMTP ( = Product
1), Symantec Brightmail Anti-Spam ( = Product 2), Symantec Mail Security for SMTP
( = Product 3), Kaspersky Anti-Spam ( = Product 4), Borderware MXtreme Mail
Firewall ( = Product 5) and Ikarus mySpamWall ( = Product 6).
3.2.1. Symantec Brightmail Anti-Spam
Symantec Brightmail Anti-Spam (Vers. 6.0) [90] offers a complete, server side antispam and anti-virus protection. It can be run on Windows 2000 Server (SP2) or higher,
Linux (Red Hat ES/AS 3.0) and Solaris (8 or 9) – we only tested the Windows version.
The product is updated online to ensure that the latest virus and spam patterns are
installed.
Brightmail Anti-Spam software filters e-mail in four basic ways:
• Filter and classify e-mail.
• Clean viruses from e-mail.
• Content Filters can be tailored specifically to the needs of an organization.
• The Allowed Senders List and the Blocked Senders List filter messages
based on available sender information. Own lists or third-party lists can be
used.
50
Figure 11 shows the typical processing path of Symantec Brightmail Anti-Spam.
Figure 11: Typical processing path of Symantec Brightmail Anti-Spam
Available Methods
Methods based on source of e-mail: Brightmail Anti-Spam supports whitelists and
blacklists. The filter treats mail coming from an address or connection in the whitelist
as legitimate mail. Policies can be set up to configure a variety of actions, performed on
incoming e-mail, including deletion, forwarding, and subject line modification.
Brightmail Anti-Spam provides also three preconfigured lists to handle e-mail
messages. These lists are:
•
Open Proxy List:
•
Safe List:
•
Suspect List:
list of IP addresses that are open proxies (often used by
spammers).
IP addresses from which virtually no outgoing e-mail is
spam.
IP addresses from which virtually all of the outgoing email is spam.
Methods based on fingerprints: Brightmail Anti-Spam uses its own Checksum
Database, the so called BrightSig technology. It is the cornerstone of Symantec’s
signature technology. The technology characterizes spam attacks using proprietary
fuzzy algorithms. The resulting signatures are added to a database of known spam.
Other content based methods: When evaluating whether messages are spam or not,
Brightmail Anti-Spam calculates a spam score from 1 to 100 for each message, based
on techniques such as pattern matching and heuristic analysis. If an e-mail score is in
the range from 90 to 100 it is considered spam, if the score is below 25 it is considered
ham. E-Mail with scores between 25 and 90 are suspected to be spam. For a more
aggressive filtering, thresholds can be varied and it is possible to specify different
actions for messages identified as suspected spam or spam based on different filtering
51
policies. Brightmail Anti-Spam allows creating custom filters based on keywords and
phrases found in specific areas of a message.
Other Features: The language a message is written in can be determined. By default,
Symantec Brightmail Anti-Spam treats all languages equally, but it is possible to
classify messages according to their language. It is also possible to filter out oversized
messages or messages with specific attachments. When configured for anti-virus
filtering, Brightmail Scanners detect viruses from e-mail as it enters the mail system.
When one or more viruses are detected, the anti-virus policies get effective.
User’s View: Symantec Brightmail Anti-Spam provides a set of basic features that
cannot be disabled to ensure protection against spam. These features are the Spam
Scoring and the Suspect List within the Reputation Service. The only way to take the
whole product offline is to disable the services in the Microsoft Services Console.
The product is easy to handle and provides a comfortable Web interface for
administration and Spam Scoring configuration. The processing speed is at a very high
level. Anti-spam and anti-virus definitions are updated regularly so there is no effort for
maintaining the product. It offers many good features and is a good combination of
anti-spam and anti-virus protection but there are disadvantages too. The information
provided in the logging section (statistics about status information and classification
results) is only updated every hour. A real time monitoring of the processing status is
therefore impossible.
3.2.2. Kaspersky Anti-Spam
Both Kaspersky Anti-Spam 2.0 Enterprise Edition and ISP Edition [91] must be run
under Linux or FreeBSD 4.x and plugged into an existing mail server (the most
common Unix mail servers like postfix, qmail, etc. are supported).
Available Methods
Methods based on source of e-mail: Kaspersky Anti-Spam 2.0 supports filtering mail
by officially blacklisted addresses as well as using local black- and whitelists created by
administrators. Furthermore a heuristic analysis checks some of the formal attributes of
an e-mail, such as sender’s address, recipient’s address, sender's IP address, size of
message, and format of message.
Classification methods: A lexicographical comparison searches for words and
phrases typically used by spammers. Additionally methods like hidden text or special
HTML tags, which are often used by spammers, are taken into account.
Methods based on fingerprints: The so called “SpamTest” technology compares
incoming mail against sample spam signatures (comparison of their lexical content and
detection of regular expressions). Those signatures are updated automatically on a
regular basis.
Other content based methods: The content of each message can be categorized by
Kaspersky Anti-Spam. In this context, “content” refers to the body of an e-mail,
52
excluding subject and header. Moreover, conditions can refer to non-formal attributes
of a message, the results of the content filtering. Therefore, the classical rule based
approach is combined with content analysis. Kaspersky uses two basic methods to
detect messages with "suspicious" content:
•
checking against sample messages (by comparison of their lexical content)
•
detection of regular expressions – words and word combinations
A message can be assigned to several content categories according to the results of
content analysis (obscene, formal, probable spam, making money fast, ...). Those
assignments can also be used in the filtering-rules.
Additionally, every message is processed via filtering rules. Every rule includes one
or more conditions that involve an analysis of the message – only if all the conditions of
a rule are met, the action of that rule will be applied. Such conditions include tests for:
sender’s IP-address, sender’s e-mail address and message size.
User’s View: Kaspersky Anti-Spam adds a header to each processed e-mail,
containing information about the overall score (spam, probable-spam, not detected).
The configuration can be determined via the Web interface. The tested configuration
uses the standard common profile (without RBL and DNS checks).
Administrators can switch between several standard rule sets (called common
profiles, valid for all users) or create new ones. In addition certain rules can be added on
a per user basis. The rule based approach is quite powerful, that is, it allows many
settings (the standard profile includes 34 rules – most of them handle header
substitution and modification).
The installation of Kaspersky Anti-Spam worked flawlessly, configuration via the
Web configurator is rather easy when switching between standard profiles (but gets
very complex when creating a new profile consisting of new rules). Furthermore, new
sample messages can be added to the predefined categories to improve the classification
performance.
A comprehensive evaluation of Kaspersky Anti-Spam is not easy because of the vast
number of configuration possibilities. Processing speed seems to be good. It is not
completely clear how important the content analysis is for classification.
3.2.3. SurfControl E-Mail Filter for SMTP
SurfControl E-Mail Filter for SMTP, Version 4.7 [92] is essentially a Simple Mail
Transfer Protocol host, which works with all SMTP mail severs, including Microsoft
Exchange, Lotus Notes Domino and Novell GroupWise. Like Symantec Mail Security
for SMTP it is a gateway and therefore the processing path is the same (see Figure 12)
and it filters both outgoing and incoming e-mail. The SurfControl E-Mail Filter is a
commercial product and runs on a Windows 2000 Server (Service Pack 4) or higher.
53
The core SurfControl E-Mail Filter solution consists of the following software
components:
• Message Administrator: Allows the user to review and act on delayed and
isolated messages, together with querying the various system logs.
• E-Mail Filter Administrator: Enables to control the E-Mail Filter remotely
via a Web browser.
• E-Mail Monitor: Provides a window onto the progress of individual messages
through the E-Mail Filter.
• Rules Administrator: enables to set up rules to monitor and/or block
messages.
• Scheduler: This is the interface to automate repetitive tasks, such as receiving
updates from SurfControl's anti-spam database.
Available Methods
Methods based on source of e-mail: The SurfControl E-Mail filter allows creating a
whitelist database by entering information of known individuals. Like the other
products, the Surf Control E-Mail filter supports custom and real time blacklists.
Classification Methods: The Virtual Learning Agent (VLA) is a content
development tool that can be trained to understand and recognize specific proprietary
content. The VLA uses neural network technology with trained strings allowing userdefined content to be learned.
Other content based methods: SurfControl uses its Anti-Spam Agent that
automatically detects and deals with common non-business or high-risk e-mail, such as
humorous graphics, chain letters, hoaxes and jokes. It is continuously updated by
SurfControl to maintain accuracy and quality. The filter also enables Boolean searches
to check for words, combinations of words or pairs of words within a message. There is
also a library of dictionaries to detect e-mail content that an organization may want to
avoid. These dictionaries contain words associated with different aspects of unwanted
content, for example adult material, hate speech and gambling.
Other Features: It is possible to remove active HTML content from the body of email messages. Active content is code that automatically installs and runs on your
computer, such as scripts or ActiveX Controls [93]. SurfControl E-Mail Filter can
detect various routing relay techniques and deny e-mail that have been forwarded or
routed. File Attachments and messages can be blocked if they do not comply with the
MIME standard or exceed a specific size. Looping messages between two or more email servers and messages that exceed a specific number of recipients can be detected
and removed. SurfControl E-Mail Filter has also an image recognition tool that scans
graphics files for explicit adult content.
For anti-virus protection an agent is available that helps to protect the system by
deleting viruses and cleaning infected files when they occur. It uses the McAfee
Olympus Anti-Virus engine to detect files that could damage a system.
User’s View: The entire configuration of the product is up to the user. There are no
preconfigured settings available. It is possible to turn all features off so that the
54
SurfControl E-Mail filter just acts as a simple SMTP-gateway. The settings that can be
taken by the user are very extensive and some time is needed to overlook the
functionality of the product.
The product provides a good user interface to handle the functionality and the
components are clearly arranged. The SurfControl E-Mail Monitor allows a real time
supervision of the processing state. It is possible to integrate external code by using the
External Program Plug-In.
The processing speed of the SurfControl E-Mail filter is far behind the other tested
products. The Virtual Image Agent seems not to be working at a sufficient level,
because it blocks harmless pictures even at the lowest sensitivity level. The Virtual
Learning Agent just supports pure text files so training the agent is very timeconsuming. The Loop Detection also seems not to be working correctly because it also
blocks messages that are already tagged with an “X-Spam”-Flag which is not a
significant sign for a looping message. Some of the wordbooks provided do not seem to
be useful because they include words appearing in nearly every message (for example
“html”).
3.2.4. Symantec Mail Security for SMTP
Symantec Mail Security for SMTP, Version 4.0.0.59 [94] is a Simple Mail Transfer
Protocol server that processes incoming or outgoing e-mail before sending them to a
local or remote mail server. The software is a commercial product and requires
Windows Server 2000 with Service Pack 4 or higher as operating system. There is also
a release for Solaris 8 or 9 that was not tested. The Symantec Mail Security for SMTP
is updated online to ensure that the latest virus and spam patterns are installed.
It can be configured to protect a network in the following ways:
• Virus protection
• Block spam
• Prevent the relaying of spam for another host
Figure 12: Typical processing path of Symantec Mail Security for SMTP
Available Methods
Methods based on source of e-mail: To limit potential spam, Symantec Mail Security
can support up to three real time blacklists. There is also the ability to block e-mail by a
custom blacklist (which contains the sender’s address or domain). Domains and e-mail
addresses that shall bypass the heuristic and blacklist detection can be added to a
55
whitelist. There is also an auto-generating whitelist feature that, if enabled, adds all
domains of outgoing messages that are not in the local routing list.
Classification Methods: The heuristic anti-spam engine is based on neural networks
and performs an analysis on the entire incoming mail message, looking for key
characteristics of spam. It weighs its findings against key characteristics of legitimate email and assigns a spam score (1-100) showing the spam probability – a high score
means a high spam probability. This score, in conjunction with the engine sensitivity
level (1=low, 5=high), determines whether a message is considered spam. Details of
this method were not made accessible to us.
Other content based methods: The Symantec Mail Security for SMTP allows
defining spam rules to be used for processing the message body. Each rule consists of
one ore more terms that can be combined using AND, OR, and NOT operators. For
example, the rule "top secret" OR "confidential" triggers if one of these terms appears
in the message body.
Other Features: Symantec Mail Security for SMTP allows blocking messages by
message size, by subject line or by file name. Also dropping messages that exceed
various container limits, like the file size, the cumulative size or the number of nested
containers is possible. Functionality for handling encrypted container files is included.
Relay restrictions can be configured within Symantec Mail Security for SMTP so that it
refuses to deliver e-mail that has a source outside of the organization.
The anti-virus scanning feature tries to detect virus infected e-mail. New or unknown
viruses can be detected through a heuristic method. The sensitivity of this feature is
variable. Another component of the anti-virus policy is the Mass Mailer Cleanup that
deletes mass mail or worm infected messages. The services search for a match between
virus name patterns and the signature returned by the anti-virus scan. If a match is
detected, then the message is dropped.
User’s View: The complete configuration of the product is up to the user. There are
no preconfigured settings available. It is possible to turn off all features so that
Symantec Mail Security for SMTP just acts as a simple SMTP-gateway. The delivery
of messages can be fully stopped and all messages can be rejected to set the Symantec
Mail Security for SMTP offline.
The product is easy to handle and provides a comfortable Web interface for
administration. The auto-generated whitelist is a useful feature saving time for editing
the list. The reporting function allows a good supervision of the processing state.
Reports are always up-to-date and include most of the relevant information. The
processing speed of the Symantec Mail Security for SMTP is at a very high level.
The most important spam detecting feature, the heuristic spam detection, does not
provide a sufficient detection rate. The effort for creating spam and content rules is too
high in relation to the expected increase in the spam detection rate. There is no ability to
manage probabilities for words or combinations of words appearing in an e-mail
message. The latest online update of the spam patterns file dates back to 2004-04-18,
since then the Live Update functionality seems to have had no impacts at all.
56
3.2.5. Borderware MXtreme Mail Firewall
The tested Version MX-400 combines three functionalities – MTA, e-mail gateway and
firewall. It has its own operating system – S-Core OS which is a Unix system based on
FreeBSD. In opposite to the other solutions, MXtreme is hardware based.
Available Methods
Methods based on source of e-mail: The MXtreme Mail Firewall supports blacklists
(custom as well as real time) and whitelists. These lists can be specified on the user or
system level.
Methods based on fingerprints: The MXtreme Mail Firewall uses DCC for spam
detection.
Classification Methods: The MXtreme Mail Firewall uses a so called Statistical
Token Analyzer (STA) to identify spam based on statistical analysis of mail content.
STA is based on Bayesian filtering, it is not publicly available in how far it is different
from a classical Bayes filter (a detailed description of Bayesian filtering can be found at
Section 2.3.3).
STA uses three sources of data to build its database: the initial tables supplied by
BorderWare based on analysis of known spam, tables derived from an analysis of local
legitimate mail (“local learning” or “training” and mail identified as “bulk” by DCC is
also analyzed to provide an example of local spam.
Other content based methods: The MXtreme Mail Firewall supports pattern based
filtering. Filters can be specified using simple English terms such as “contains” and
“matches” or using regular expressions. These filters are processed in the order of their
priority.
User’s View: The configuration is merely up to the administrator. After activation of
the anti-spam feature every single method can be activated and modified separately. It
is possible to filter mail using Brightmail. If so, RBL, DCC and STA are disabled by
default.
The product is easy to handle and provides a comfortable Web interface for
administration. Its main advantages are that it combines firewall and anti-spam
functionality and that it is rather easy to maintain. DCC and Statistical Token Analysis
performed quite well, although the classification performance may differ in production
use due to the chosen training policy.
3.2.6. Ikarus mySpamWall
Unlike the other products, the Ikarus mySpamWall [95] is a service running in a service
center, and cannot be installed on a computer locally. Ikarus Software calls this a
“managed security service”. As it is a service, no information about the operating
system is available. Maintenance is reduced to a minimum.
57
Available methods
Methods based on source of e-mail: Ikarus mySpamWall provides a full set of
blacklists, whitelists and a greylist. Especially the greylist is a very important feature, as
it does not only perform checks on the sender’s IP address and hostname (to reduce
dangers originating from private broadband accounts), but also performs several checks
on header information and the message body before finally accepting a message.
Classification methods: Classification is based on two main technologies. On the one
hand keywords are checked, and on the other hand a Bayesian filter is used. The user
can influence neither of these checks directly.
Methods based on fingerprints: No global fingerprint services like DCC are used in
this product. All incoming messages are classified and the ones classified as spam are
used to create tokens for future message classification.
Other features: Ikarus mySpamWall supports technologies like HashCash and
incorporates some additional features that are available through the tight integration of
the e-mail gateway and the spam filter engine. These features include an Anti Spoofing
Engine which checks if a sender’s address really exists (at least if it is a local address)
or an automatic whitelisting for frequent senders. Furthermore, IP addresses are banned
automatically for a certain time if they cause a certain amount of errors. Usually this
product is sold in a package with an anti-virus engine to provide full protection.
User’s View: Generally speaking Ikarus mySpamWall is very easy to use. It has a
Web interface that allows a simple setup of threshold values for possible spam and
spam. Optionally, an advanced interface allows the addition of simple rules to blacklist
or whitelist certain senders, receivers, subject lines or content using simple regular
expressions.
From our point of view it seems generally a good idea to offer spam protection for
companies as a service. This is a good solution for rather small companies that cannot
afford an IT department to deal with the spam problem.
Especially the fact that this product offers a complete integration of a mail transfer
agent and an anti-spam solution seems to be a big advantage compared to most of the
others. This advantage is used in a very potent greylist that is able to detect many spam
messages without the risk of creating “real” false positives, as messages can always be
sent again and then be delivered.
3.2.7. Spamkiss
Spamkiss [96] as well as some other projects aims at a goal completely different from
that of regular spam filters. While regular spam filters accept the fact that spam exists
and try to eliminate it after it is sent, these approaches try to make sending out spam
mail as expensive as possible (compare Section 2.2.1). These expenses should sooner or
later make spamming less attractive commercially.
58
The technology involved in this approach consists of a combination of two different
methods: A whitelist and a challenge-response protocol.
The challenge-response protocol is used for the initial contact. The user’s mail
address is modified with a random token, which is only valid for a certain period. This
modified e-mail address has to be used for the initial contact only. After that, the
sender’s address is added to the user’s whitelist. At this point e-mail communication
continues just as it does now. Both partners can communicate as they want, without
adding any new random tokens to mail addresses.
The way in which the tokens are distributed is the major difference between systems
using this approach and is similar to key distribution in a public key infrastructure. As
soon as the key is exchanged (or in this case the token), communication is no problem
at all. Spamkiss offers a simple exchange by either personally talking to each other (or
telling your partner how to modify your address for the first contact), or by getting the
currently used token from a server. A system very similar to Spamkiss is also used at
the Computing Science Department at the University of Alberta, CA. It is called SFM
[66] (Spam Free E-Mail service see Chapter 2.3.1).
To make sure that the current token cannot be harvested by a computer, a distorted
image is used. This technology is often also used to access free services like stock
quotes, which should not be machine-readable. The general idea is rather simple:
distorted images cannot be used by a computer (at the moment), as text recognition
software cannot read it. The human brain on the other hand is easily able to recognize
the letters, even though they are distorted and have a fancy background pattern.
Available Methods
Methods based on source of e-mail: Spamkiss’ main feature is a whitelist that
temporarily blocks all messages by senders who are not yet included.
Other features: As all users have to be on a whitelist a special dialog is available
which allows users to add themselves to the whitelist, including human interaction.
This kind of technology promises that no more legitimate messages will be lost, as
all messages generated by a human being (who reads replies to his address) are
delivered, or bounced with a request to send it again to a different, modified address.
This modified address simply consists of the original address and the current token.
The approach of stopping spam messages this early seems like a good idea at the
first look. Messages are not delivered at the first attempt. Later on, messages that are
legitimate are delivered, and those that are not are not accepted by the mail server,
telling the sender to fetch the current token first. If a token is compromised, which
means that the combination of token and e-mail address ends up on a spammer’s list, it
simply gets changed, without affecting communication with those already added to the
whitelist.
Several major questions seem to be unanswered so far. On the one hand it might still
pose a problem to handle automatically generated messages, and on the other hand
bouncing back a lot of messages may increase network traffic considerably.
59
Taking a closer look reveals a lot of additional work. Users should know their
tokens, so they can give them to future communication partners. After this initial
contact the sender’s address gets stored in the whitelist. This solution may work in
many cases, but there are many situations, which may cause problems in this kind of
environment. Many users do have several e-mail addresses, which means that a user
who uses different sender mail addresses has to go through the initial process several
times. In addition to that it is often hard to tell who will be the sender of messages that
are delivered by a mailing list, or the exact mailing address of automatically generated
messages by an online store. Checks are basically performed on the sender’s address
only. This means that someone who knows the addresses of your trusted partners may
easily send you any kind of message.
3.3. Open Source
This section describes the characteristics of the open source filters tested. Because of its
open source character both documentation and code quality differ. On the other hand
almost all products offer a lot of information on their Web site.
3.3.1. SpamAssassin
SpamAssassin [97] is written in Perl and part of the Apache Software Foundation. The
primary target platforms are Unix operating systems. Some Windows products
available use SpamAssassin, though they are not open source. SpamAssassin extracts
different features from incoming e-mail messages. This analysis is done through so
called tests, which capture the header, body or full text of an e-mail (a full listing can be
found at [98]). SpamAssassin can be configured to include RBL checks (see Section
2.3.1) and a Bayesian classifier. The overall rating of a message is computed from the
values for the different results of the text analysis plus the results of the Bayesian
filtering plus the results from the distributed hash databases. Hence SpamAssassin
never relies on one single technique. After the testing a header containing the overall
score and a spam mark is added to all processed e-mail messages. The messages can be
classified according to this mark or score (probably-spam if the mark is present,
certainly-spam above a certain threshold).
Available Methods
Methods based on source of e-mail: SpamAssassin analyses the message header
using regular expressions to identify forgeries and manipulations. SpamAssassin
permits defining proprietary whitelists as well as consulting public ones like Habeas
[51] and Bonded Sender Program [52]. In addition, many useful existing blacklists,
such as mail-abuse.org, ordb.org, SURBL, and others are supported. The latest release
of SpamAssassin (3.0) supports the Sender Policy Framework for verifying the senders’
domain.
Methods based on fingerprints: Vipul's Razor, Pyzor and DCC are supported to
block spam and bulk mail.
60
Classification Methods: SpamAssassin uses a Bayesian-like kind of probabilityanalysis classification, so that a user can train it to recognize mail messages similar to a
training set [99]. Many of SpamAssassin’s tests aim at static patterns in the text of mail
messages.
User’s View: The user can specify which rules to use and whether Bayesian
methods should be used or not. Furthermore, one can specify which features should be
computed and which online resources should be consulted (DCC, Razor, Pyzor). The
user can also specify which RBL should be used (if any), therefore SpamAssassin is
very adjustable to the user's needs. SpamAssassin also offers an auto-learn function,
which uses all messages below and above certain, scores (mail that is very clearly
classified as spam or ham) as learning input for the Bayesian classifier. The auto-learn
function is not considered in our tests, all training is done prior to testing.
We experienced no problem during configuration. SpamAssassin is a very
comprehensive tool, because it combines more than 600 different tests. SpamAssassin
seems to be particularly designed to hold false positives low and gives a good idea
about what features of an e-mail message can be computed.
3.3.2. CRM 114
CRM114 [100] (Controllable Regex Mutilator, concept # 114) is written in C and
available as open source. Several Unix operating systems are supported.
Available Methods
Methods based on source of e-mail: CRM 114 supports personal white- and
blacklists.
Classification Methods: CRM 114 uses a Marcovian Discriminator to differentiate
between ham and spam messages. Incoming text is matched against the Hidden Markov
Models [101] of the training corpi (this is another probabilistic model like Bayesian
filtering). CRM 114 uses phrases containing more words and assigns higher weights to
longer phrases, whereas Bayesian filtering usually uses phrases of length one (one
word).
Incoming mail is piped to CRM114 by the local MDA and adds a new header
containing the spam score to each message.
User’s View: The most important and resource intensive aspect of CRM114 is
training. The configuration is not very user friendly and gives kind of a “not yet
finished” impression. Bulk training takes a very long time even for a small amount of
messages. CRM114 is best used on a per user basis to personalize the training sets (like
other statistical approaches).
Installation and training are a bit tricky but the results are rather good. The
recommended training method is to only use false classifications as training input,
whereas we used bulk training (train on errors is much easier on the individual level).
61
3.3.3. Bogofilter
Bogofilter [102] is a "Paul Graham based" Bayesian spam filter. The application is
written in C and available as open source (several Unix operating systems are
supported).
Available Methods
Statistical Methods: Bogofilter classifies incoming mail as ham/spam based on a
statistical analysis of the message's header and content (body). Each token (word) in an
incoming mail is checked against a good- and a bad wordlist (typical for spam/ham).
This wordlists are generated by training the filter with both a ham and a spam corpus in
order to find typical ham and spam words. Values for spam-probability of individual
tokens are combined using the inverse chi-square function (statistical testing).
Bogofilter is used by the MDA (mail delivery agent) and computes tokens and
finally spam or ham probabilities for every incoming message. It adds an X-Bogosity
header to all e-mail containing the ham- and spam scores of this message. The messages
can be moved to their designated folders according to this header.
Users have to train Bogofilter with both a ham and a spam corpus. After that initial
training Bogofilter is ready to classify mail. It is recommended to retrain on a regular
basis to adjust to changes in mail messages.
User’s View: The user can choose between two (spam, ham) or three classes (spam,
probably spam, ham) and is responsible for the training. Moreover the threshold values
can be specified.
The results of Bogofilter are of particular interest because it is the only Bayes-only
application included in our test series. In our experience Bogofilter works without
problems (installation and usage).
62
4. Performance Evaluation
The previous chapters described available tools and the most relevant methods used for
spam detection. To experiment with the mentioned tools it is very important to have a
comparable test set. Existing evaluations that can be found in scientific literature use
both publicly available spam and ham samples (like the Ling spam corpus and PU1,
both proposed in [103], or the SpamAssassin sample [104]) or self collected samples,
which are usually more up-to-date, but in some cases not publicly available. We
decided to test the tools with our own collected sample, as it is more up-to-date, and
also with the SpamAssassin sample because of its public availability. In this chapter we
want to describe the source and the composition of our own test sample and the hardand software configuration used for the tests.
4.1. Sources for Our Own Samples
Collecting spam messages seems quite simple – a brief look in everyone’s inbox is
enough – but is a collection of a few inboxes enough for a representative sample? We
decided to ask our partners to collect spam to be as representative as possible.
The collection of ham messages is much harder due to legal preconditions. It is not
allowed to use ham messages without consent of both the receiver and the sender, so we
depended on volunteers and on our own e-mail.
The following chapter describes the situation of our partners Mobilkom Austria,
UPC Telekabel and the University of Vienna.
4.1.1. University of Vienna
The ZID that is responsible for e-mail transactions and all other IT services for the
University of Vienna was our first contact point. Getting spam messages was easy,
because ZID runs its own spam filter and therefore collects spam via spam traps.
Therefore, we got our own spam e-mail account that is filled by those honeypots with
approximately 1,500 spam messages coming in every day.
Ham collection was difficult, but some volunteers made their inboxes available.
Therefore we took our own messages and those of the volunteers – special thanks to
Mr. Hatz, Ms. Marosi, Ms. Khan, Ms. Thanheiser and Mr. Bobrowski.
4.1.2. Mobilkom Austria
Mobilkom Austria provides a message store containing spam messages as well as ham
messages which were falsely classified (“false positives”) by the mail filter currently
used.
The repository currently holds about 100 false positives and roughly 3,600 spam
messages. Several employees had the possibility to move messages to the respective
folders, but unfortunately only very few of them did so. The largest part of the
messages was forwarded to these folders. This means that their original headers and
envelopes were lost. Moreover, most messages were forwarded as inline messages. This
implies that their message body was changed too.
Some of the messages were forwarded as attachments leaving all the important
information (header, body) unchanged. However, as all messages provided by
Mobilkom Austria are in a folder located behind a corporate firewall, accessing them is
only possible through a Web based service. This makes it quite difficult to retrieve
messages.
All messages provided by Mobilkom Austria have to be checked manually, and
every single message, which can be used for the tests (which is still unchanged) has to
be retrieved from the Microsoft Outlook compatible store and moved to an IMAP
folder. This causes a significant overhead and creates a complicated environment so
that we decided not to use this data in our early experiments, which are based on
relatively large amounts of data.
4.1.3. UPC Telekabel
UPC Telekabel opened 10 test accounts for our project. Five were used as spam traps
and the other five were used to subscribe legal newsletters. Newsletters are very
important for a representative sample, because they have similar features to spam
messages.
4.2. Test sample description
In this chapter we characterize the two test samples which we used in our experiments.
4.2.1. Our Test Sample
Our sample was created on 25th of August 2004. It consists of 1,500 spam messages and
1,500 non spam messages, both collected from different sources.
Spam sources:
• 1,382 (ZID, collected via spam traps, 18.08.2004)
• 22 (Spanish messages from W. Strauss, 16.07.2004 – 19.07.2004)
• 44 (Department, Thanheiser, Khan, Hatz, 18.08.2004)
• 52 (W. Gansterer, 09.03.2004 – 05.08.2004)
Ham sources:
• 593 (Hatz, 01.01.2004 – 19.08.2004)
• 245 (Strauss, 23.09.2002 – 15.08.2004)
• 181 (Department, 20.06.2004 – 20.08.2004)
• 12 (Thanheiser, 04.08.2004 – 20.08.2004)
• 44 (Khan, 06.08.2004 – 20.08.2004)
• 36 (Marosi, 04.08.2004 – 13.08.2004)
• 244 (Ilger, 01.07.2004 – 22.08.2004)
• 145 (Newsletter account Chello, 16.08.2004 – 20.08.2004)
64
Sample size for 1,500 messages: ham 57.745 MB, spam 6.441 MB
Sample size for 1,000 messages: ham 33.845MB, spam 4.450 MB
4.2.2. SpamAssassin Test Sample
The original SpamAssassin sample consists of different parts [104]. We took Spam2
(20030228_spam_2.tar.bz2) with 1,397 mail messages and Easy Ham 2
(20030228_easy_ham_2.tar.bz2) with 1,400 messages, which roughly equals our own
test sample in size.
Sample size for 1,000 messages: ham 4.17 MB, spam 6.26 MB
4.3. Experimental Setup
Our test infrastructure consists of several software and hardware components, primarily
a message store containing the spam and ham messages and two server machines for
testing the Windows and Linux products. Table 11 gives a short overview of the Hardand Software we use.
Message
Configuration
Store
Windows Configuration
Linux Configuration
Hardware Platform
AMD Athlon64, 3000+,
512 MB RAM
Pentium 4, 2.8 GHZ,
1024 MB RAM
AMD
Athlon64,
3000+, 1024 MB RAM
Operating System
Windows 2000 Server,
Service Pack 4
Windows Server 2003,
Standard Edition
SuSE Linux 9.1
SMTP-Server
Mercury Mail
Version 4.01a
Microsoft SMTP-Service
Postfix
SMTP-Server,
Version 2.0.19
Self
written
Java
Application, Microsoft
POP3 Service
Fetchmail, Version 6.2.5,
Procmail, Version 3.22
Other
Distribution
Software
Mail
__
Server,
Table 11: Hardware and software configuration
Each test set is stored in its own IMAP folder and can be accessed remotely. Due to
the different architectural characteristics of the Linux and Windows spam filters we
decided to implement different ways for testing them9:
9
The configuration of the MXtreme Mail Firewall is not included because it is a hardware filter and
runs its own operating system. The test process is similar to the Windows test process. We also use our
Java application to deliver the messages to the MXtreme.
65
4.3.1. Windows Test Process
We use a small Java application for fetching the messages out of the IMAP store and
send them via SMTP directly to the spam filters. The spam filters work on the standard
SMTP-Port (port 25). For each test-run only one filter is active. All tested products are
configured as gateways10 and forward the messages to the Microsoft SMTP-Service.
The Microsoft SMTP-Service finally delivers the e-mail to the according mailboxes.
4.3.2. Linux Test Process
The process for testing the filters running under Linux is slightly different. The
messages are fetched through Fetchmail, a remote mail forwarding utility, which
delivers them to a Postfix mail server. Between the SMTP server and the spam filter we
plugged in Procmail that allows us to have multiple filters active during a test-run.
Procmail pipes the messages based on their recipient address to the different spam
filters. Every filter is assigned to a designated mailbox. After analysis the messages are
sent back to Procmail and finally delivered to the destination mailboxes and made
accessible via IMAP for evaluation (see Figure 13).
Figure 13: Windows test process vs. Linux test process
10
Symantec Brightmail Anti-Spam works directly in conjunction with the Microsoft SMTP Service
and is not a separate gateway.
66
5. Experimental Results
The following chapter summarizes our experiments with various anti-spam tools. This
includes open source tools as well as a small selection of commercial products.
At this point we need to emphasize again that the goal of our experiments was not to
thoroughly evaluate or compare various commercial products. Instead, we focused on
analyzing individual methods. As a consequence, our experimental results cannot not be
used as the basis of an evaluation or comparison (“ranking”) of commercial products.
Chapter 5.1 outlines the results achieved with our own test sample, Chapter 5.2
describes the results for the SpamAssassin sample. In many cases, there is a vast
number of configuration options. With our limited resources it was not possible to
determine the optimal configuration in terms of performance for each tool.
Consequently, we normally used the standard (default) setup. If simple choices had to
be made, we tried to minimize the number of false positives while keeping the spam
detection rate as high as possible.
5.1. Our Test Sample
Both samples – ham and spam – consist of 1,500 messages, for further details see
Chapter 4. The tools are grouped according to their availability status – commercial or
open source. For products without a training ability, we took all 1,500 messages for
testing. For all tested software with a training ability, we divided the test samples into
three randomly chosen parts of equal size 500. To get results that are less dependent on
a particular training set we took one part for training and the two other parts for testing.
We repeated this test three times with alternating training sets and then divided the
results by three (so data in the tables is rounded). This process is called cross validation
(in our case three fold cross validation), see [78] for details.
5.1.1. Commercial Products
The following chapter describes the results for the tested commercial products in detail.
An overview is given in Figure 14.
67
Figure 14: Results for tested commercial products – our test sample.
The experiments summarized in this figure must not be used as the basis for a ranking of different (commercial or non
commercial) antispam tools, because they were not designed for this purpose. Although the test data used is identical, due to
limited resources no intense efforts could be made to fine-tune any of the tools (cf. page 2).
5.1.1.1.Product 1
Tested Version:
Version 4.7
Product 1
ham sample
spam sample
total mail sent
1,500
1,500
total mail received
1,500
1,500
classified as ham
1,499 (99.93%)
200 (13.33%)
classified as suspected spam
not available
not available
classified as spam
1 (0.07%)
1,300 (86.67%)
date
25.08.2004
25.08.2004
Table 12: Product 1 – results our test sample
Product
1
configuration)
total mail sent
68
(alternative
ham sample
spam sample
1,500
1,500
total mail received
1,500
1,500
classified as ham
1,416 (94.4%)
180 (12%)
classified as suspected spam
not available
not available
classified as spam
84 (5.6%)
1,320 (88%)
date
25.08.2004
25.08.2004
Table 13: Product 1 (alternative configuration) – results our test sample
5.1.1.2.Product 2
Tested Version:
Version 6.0
Product 2
ham sample
spam sample
total mail sent
1,500
1,500
total mail received
1,500
1,500
classified as ham
1,498 (99.87%)
123 (8.2%)
classified as suspected spam
2 (0.13%)
2 (0.13%)
classified as spam
0 (0%)
1,375 (91.67%)
date
25.08.2004
25.08.2004
Table 14: Product 2 – results our test sample
5.1.1.3. Product 3
Tested Version:
Version 4.0.0.59
Product 3
ham sample
spam sample
total mail sent
1,500
1,500
total mail received
1,500
1,500
classified as ham
1,349 (89.93%)
651 (43.4%)
classified as suspected spam
124 (8.27%)
271 (18.07%)
classified as spam
27 (1.8%)
578 (38.53%)
date
25.08.2004
25.08.2004
Table 15: Product 3 – results our test sample
69
5.1.1.4. Product 4
Tested Version:
Version 2.0
Product 4
ham sample
spam sample
total mail sent
1,000
1,000
total mail received
1,000
1,000
classified as ham
951 (95.1%)
570 (57%)
classified as suspected spam
47 (4.7%)
77 (7.7%)
classified as spam
2 (0.2%)
353 (35.3%)
Date
04.10.2004
04.10.2004
Table 16: Product 4 – results our test sample
5.1.1.5. Product 5
Tested Version:
Version 4.0
Product 5
ham sample
spam sample
total mail sent
1,000
1,000
total mail received
1,000
1,000
classified as ham
997 (99.7%)
49 (4.9%)
classified as suspected spam
not used
not used
classified as spam
3 (0.3%)
951 (95.1%)
Date
19.10.2004
19.10.2004
Table 17: Product 5 – results our test sample
ham sample
spam sample
total mail sent
1,000
1,000
total mail received
1,000
1,000
classified as ham
1,000 (100%)
85 (8.5%)
classified as suspected spam
not used
not used
Product
5
configuration)
70
(alternative
classified as spam
0 (0%)
915 (91.5%)
Date
19.10.2004
19.10.2004
Table 18: Product 5 (alternative configuration) – results our test sample
5.1.1.6. Product 6
Tested Version:
version of November 30, 2004
Product 6
ham sample
spam sample
total mail sent
1,500
1,500
total mail received
1,490
1,465
10 greylisted
8 greylisted
27 blocked11
classified as ham
1,472 (98.8%)
95 (6.49%)
classified as suspected spam
18 (1.2%)
24 (1.64%)
classified as spam
0 (0%)
1,346 (91.88%)
Date
30.11.2004
30.11.2004
Table 19: Product 6 – results our test sample
5.1.2. Open Source Tools
The following chapter describes the results for the tested open source tools in detail. An
overview is given in Figure 15.
11
Error-Code: 533 Malformed Sender Address
71
Figure 15: Results for tested open source tools – our test sample
The experiments summarized in this figure must not be used as the basis for a ranking of different (commercial or non
commercial) antispam tools, because they were not designed for this purpose. Although the test data used is identical, due to
limited resources no intense efforts could be made to fine-tune any of the tools (cf. page 2).
5.1.2.1. SpamAssassin
Tested Version:
SpamAssassin, Version 2.64 and Version 3.0
Operating System:
SuSE Linux 9.1
Category:
Open source
Training Necessity: Yes
Parameter Settings 1: SpamAssassin standard (2.64),
Spam threshold = 4,
Bayes disabled,
Network tests enabled
SpamAssassin
ham sample
spam sample
total mail sent
1,000
1,000
total mail received
1,000
1,000
classified as ham
986 (98.6%)
300 (30.0%)
classified as suspected spam
14 (1.4%)
279 (27.9%)
classified as spam
0 (0.0%)
421 (42.1%)
Date
04.10.2004
04.10.2004
Table 20: SpamAssassin standard (2.64) – results our test sample
72
Parameter Settings 2: SpamAssassin low (2.64),
Spam threshold = 4,
Bayes disabled,
Network tests disabled
SpamAssassin
ham sample
spam sample
total mail sent
1,000
1,000
total mail received
1,000
1,000
classified as ham
987 (98.7%)
375 (37.5%)
classified as suspected spam
13 (1.3%)
314 (31.4%)
classified as spam
0 (0.0%)
311 (31.1%)
Date
04.10.2004
04.10.2004
Table 21: SpamAssassin low (2.64) – results our test sample
Parameter Settings 3: SpamAssassin Bayes (2.64),
Spam threshold = 4,
Bayes enabled,
Network tests enabled
SpamAssassin
ham sample
spam sample
total mail sent
1,000
1,000
total mail received
1,000
1,000
classified as ham
1,000 (100.0%)
19 (1.9%)
classified as suspected spam
0 (0.0%)
303 (30.3%)
classified as spam
0 (0.0%)
678 (67.8%)
Date
04.10.2004
04.10.2004
Table 22: SpamAssassin Bayes (2.64) – results our test sample
Parameter Settings 4: SpamAssassin (3.0),
Spam threshold = 4,
Bayes enabled,
Network tests enabled
73
SpamAssassin
ham sample
spam sample
total mail sent
1,000
1,000
total mail received
1,000
1,000
classified as ham
1,000 (100.0%)
32 (3.2%)
classified as suspected spam
0 (0.0%)
80 (8%)
classified as spam
0 (0.0%)
888 (88.8%)
Date
04.10.2004
04.10.2004
Table 23: SpamAssassin (3.0) – results our test sample
5.1.2.2. Bogofilter
Tested Version:
Operating System:
Category:
Training Necessity:
Parameter Settings:
Bogofilter, Version 0.92.2
SuSE Linux 9.1
Open source
Yes
Suspected spam threshold = 0.45,
Spam threshold = 0.99
Bogofilter
ham sample
spam sample
total mail sent
1,000
1,000
total mail received
1,000
1,000
classified as ham
976 (97.6%)
3 (0.3%)
classified as suspected spam
24 (2.4%)
103 (10.3%)
classified as spam
0 (0.0%)
894 (89.4%)
Date
04.10.2004
04.10.2004
Table 24: Bogofilter – results our test sample
5.1.2.3. CRM 114
Tested Version:
Operating System:
74
CRM 114, Version 20040327-BlameStPatrik [tre-0.6.6]
SuSE Linux 9.1
Category:
Training Necessity:
Open source
Yes
Parameter Settings 1: self-trained – trained with parts of our own e-mail messages
CRM 114
ham sample
spam sample
total mail sent
1,000
1,000
total mail received
1,000
1,000
classified as ham
995 (99.5%)
21 (2.1%)
classified as suspected spam
not available
not available
classified as spam
5 (0.5%)
979 (97.9%)
Date
04.10.2004
04.10.2004
Table 25: CRM 114 self-trained – results our test sample
Parameter Settings 2: pre-trained – pre-trained configuration files were used
CRM 114
ham sample
spam sample
total mail sent
1,000
1,000
total mail received
1,000
1,000
classified as ham
591 (59.1%)
131 (13.1%)
classified as suspected spam
not available
not available
classified as spam
409 (40.9%)
869 (86.9%)
Date
04.10.2004
04.10.2004
Table 26: CRM 114 pre-trained – results our test sample
5.1.3. Conclusion
A look at the results of our experiments with our own test sample (Figure 16) shows
that most products have quite comparable detection rates. Especially the false positive
rate can be made low with most products while spam recognition rate is usually around
90 percent. It is remarkable that there is no big difference between the detection rates of
the open source tools and the commercial products.
Furthermore, the products which were tested in multiple configurations, show that
the activation of additional features or using a training feature can have a significant
75
influence on the performance. The best example for this is SpamAssassin, which was
tested in four different versions with an increasing number of features activated.
Figure 16: Results for all tested products (our test sample)
The experiments summarized in this figure must not be used as the basis for a ranking of different (commercial or non
commercial) antispam tools, because they were not designed for this purpose. Although the test data used is identical, due to
limited resources no intense efforts could be made to fine-tune any of the tools (cf. page 2).
5.2. SpamAssassin Test Sample
The spam sample was divided into two parts (397 mail messages for training and 1,000
e-mail messages for testing). The ham sample was also divided into two parts (400 email messages for training and 1,000 e-mail messages for testing). For the training set,
we always took the first 397 (or 400) e-mail messages according to their timestamps.
5.2.1. Commercial Products
The following chapter describes the results for the tested commercial products in detail.
An overview is given in Figure 17.
76
Figure 17: Results for tested commercial products – SpamAssassin sample
The experiments summarized in this figure must not be used as the basis for a ranking of different (commercial or non
commercial) antispam tools, because they were not designed for this purpose. Although the test data used is identical, due to
limited resources no intense efforts could be made to fine-tune any of the tools (cf. page 2).
5.2.1.1.Product 1
Tested Version:
Version 4.7
ham sample
spam sample
total mail sent
1,000
1,000
total mail received
1,000
990
reason for missing e-mail
messages
–
java exception12
classified as ham
999 (99.9%)
220 (22.22%)
classified as suspected spam
0 (0.0%)
0 (0.00%)
classified as spam
1 (0.1%)
770 (77.78%)
date
05.09.2004
05.09.2004
Product 1
Table 27: Product 1 – results SpamAssassin test sample
12
Due to invalid message format
77
ham sample
spam sample
total mail sent
1,000
1,000
total mail received
1,000
990
reason for missing e-mail
messages
–
java exceptions12
classified as ham
965 (96.5%)
175 (17.68%)
classified as suspected spam
0 (0.0%)
0 (0.0%)
classified as spam
35 (3.5%)
815 (82.32%)
date
05.09.2004
05.09.2004
Product
1
configuration)
(alternative
Table 28: Product 1 (alternative configuration) – results SpamAssassin test sample
5.2.1.2. Product 2
Tested Version:
Version 6.0
Product 2
ham sample
spam sample
total mail sent
1,000
1,000
total mail received
1,000
985
reason for missing e-mail
messages
–
java exception13
classified as ham
999 (99.9%)
416 (42.23%)
classified as suspected spam
0 (0.0%)
40 (4.06%)
classified as spam
1 (0.1%)
529 (53.71%)
date
05.09.2004
05.09.2004
Table 29: Product 2 – results SpamAssassin test sample
13
Due to invalid message format
78
5.2.1.3.Product 3
Tested Version:
Version 4.0.0.59
Product 3
ham sample
spam sample
total mail sent
1,000
1,000
total mail received
1,000
989
reason for missing e-mail
messages
–
java exception14
classified as ham
998 (99.8%)
393 (39.74%)
classified as suspected spam
2 (0.2%)
118 (11.93%)
classified as spam
0 (0.0%)
478 (48.33%)
date
05.09.2004
05.09.2004
Table 30: Product 3 – results SpamAssassin test sample
5.2.1.4. Product 4
Tested Version:
Version 2.0
Product 4
ham sample
spam sample
total mail sent
1,000
1,000
total mail received
1,000
1,000
classified as ham
985 (98.5%)
166 (16.6%)
classified as suspected spam
15 (1.5%)
62 (6.2%)
classified as spam
0 (0.0%)
772 (77.2%)
date
28.09.2004
28.09.2004
Table 31: Product 4 – results SpamAssassin test sample
14
Due to invalid message format
79
5.2.1.5. Product 5
Tested Version:
Version 4.0
Product 5
ham sample
spam sample
total mail sent
1,000
1,000
total mail received
1,000
1,000
classified as ham
987 (98.7%)
140 (14.0%)
classified as suspected spam
not used
not used
classified as spam
13 (1.3%)
860 (86.0%)
date
25.10.2004
25.10.2004
Table 32: Product 5 – results SpamAssassin test sample
ham sample
spam sample
total mail sent
1,000
1,000
total mail received
1,000
1,000
classified as ham
984 (98.4%)
30 (3.0%)
classified as suspected spam
not used
not used
classified as spam
16 (1.6%)
970 (97.0%)
date
25.10.2004
25.10.2004
Product
5
configuration)
(alternative
Table 33: Product 5 (alternative configuration) – results SpamAssassin test sample
5.2.1.6. Product 6
Tested Version:
version of November 30, 2004
Product 6
80
ham sample
spam sample
total mail sent
1,000
991
java exception15
total mail received
1,000
989
2 greylisted
classified as ham
968 (96.8%)
31 (3.14%)
classified as suspected spam
30 (3.0%)
111 (11.22%)
classified as spam
2 (0.2%)
847 (85.64%)
Date
15.12.2004
15.12.2004
Table 34: Product 6 – results SpamAssassin test sample
5.2.2. Open Source Tools
The following chapter describes the results for the tested open source tools in detail. An
overview is given in Figure 18.
Figure 18: Results for tested open source tools – SpamAssassin sample
The experiments summarized in this figure must not be used as the basis for a ranking of different (commercial or non
commercial) antispam tools, because they were not designed for this purpose. Although the test data used is identical, due to
limited resources no intense efforts could be made to fine-tune any of the tools (cf. page 2).
15
Due to invalid message format
81
5.2.2.1. SpamAssassin
Tested Version:
Operating System:
Category:
Training Necessity:
SpamAssassin, Version 2.64 and Version 3.0
SuSE Linux 9.1
Open source
Yes
Parameter Settings 1: SpamAssassin standard (2.64),
Spam threshold = 4,
Bayes disabled,
Network tests enabled
SpamAssassin
ham sample
spam sample
total mail sent
1,000
1,000
total mail received
1,000
1,000
classified as ham
995 (99.5%)
113 (11.3%)
classified as suspected spam
5 (0.5%)
217 (21.7%)
classified as spam
0 (0.0%)
670 (67.0%)
date
28.09.2004
28.09.2004
Table 35: SpamAssassin standard – results SpamAssassin test sample
Parameter Settings 2: SpamAssassin low (2.64),
Spam threshold = 4,
Bayes disabled,
Network tests disabled
SpamAssassin
ham sample
spam sample
total mail sent
1,000
1,000
total mail received
1,000
1,000
classified as ham
996 (99.6%)
134 (13.4%)
classified as suspected spam
4 (0.4%)
235 (23.5%)
classified as spam
0 (0.0%)
631 (63.1%)
date
28.09.2004
28.09.2004
Table 36: SpamAssassin low – results SpamAssassin test sample
Parameter Settings 3: SpamAssassin Bayes (2.64),
82
Spam threshold = 4,
Bayes enabled,
Network tests enabled
SpamAssassin
ham sample
spam sample
total mail sent
1,000
1,000
total mail received
1,000
1,000
classified as ham
997 (99.7%)
54 (5.4%)
classified as suspected spam
3 (0.3%)
148 (14.8%)
classified as spam
0 (0.0%)
798 (79.8%)
date
28.09.2004
28.09.2004
Table 37: SpamAssassin Bayes – results SpamAssassin test sample
Parameter Settings 4: SpamAssassin (3.0),
Spam threshold = 4,
Bayes enabled,
Network tests enabled
SpamAssassin
ham sample
spam sample
total mail sent
1,000
1,000
total mail received
1,000
1,000
classified as ham
1,000 (100.0%)
109 (10.9%)
classified as suspected spam
0 (0.0%)
317 (31.7%)
classified as spam
0 (0.0%)
574 (57.4%)
date
28.09.2004
28.09.2004
Table 38: SpamAssassin 3.0 – results SpamAssassin test sample
5.2.2.2. Bogofilter
Tested Version:
Operating System:
Category:
83
Bogofilter, Version 0.92.2
SuSE Linux 9.1
Open source
Training Necessity:
Yes
Parameter Settings:
Suspected spam thresholds = 0.45,
Spam thresholds = 0.99
Bogofilter
ham sample
spam sample
total mail sent
1,000
1,000
total mail received
1,000
1,000
classified as ham
964 (96.4%)
26 (2.6%)
classified as suspected spam
36 (3.6%)
464 (46.4%)
classified as spam
0 (0.0%)
510 (51.0%)
date
28.09.2004
28.09.2004
Table 39: Bogofilter – results SpamAssassin test sample
5.2.2.3. CRM 114
Tested Version:
Operating System:
Category:
Training Necessity:
CRM 114, Version 20040327-BlameStPatrik [tre-0.6.6]
SuSE Linux 9.1
Open source
Yes
Parameter Settings 1: self-trained – trained with parts of our own e-mail messages
CRM 114
ham sample
spam sample
total mail sent
1,000
1,000
total mail received
1,000
1,000
classified as ham
995 (99.5%)
288 (28.8%)
classified as suspected spam
0 (0.0%)
0 (0.0%)
classified as spam
5 (0.5%)
712 (71.2%)
date
28.09.2004
28.09.2004
Table 40: CRM 114 self-trained – results SpamAssassin test sample
84
Parameter Settings 2: pre-trained – pre-trained configuration files were used
CRM 114
ham sample
spam sample
total mail sent
1,000
1,000
total mail received
1,000
1,000
classified as ham
973 (97.3%)
208 (20.8%)
classified as suspected spam
0 (0.0%)
0 (0.0%)
classified as spam
27 (2.7%)
792 (79.2%)
date
28.09.2004
28.09.2004
Table 41: CRM 114 pre-trained – results SpamAssassin test sample
5.2.3. Conclusion
A comparison of the results with the SpamAssassin test sample (Figure 19) shows a
bigger difference between the products than with our own test sample. The reason for
this might be that the messages included in this sample are rather old and therefore may
not be included any more in modern signature databases. Another explanation could be
that the general properties of spam messages changed and modern filters therefore
cannot recognize old spam. We see that the best implementations can achieve a spam
recognition rate of 80 percent or more with virtually no false positives.
85
Figure 19: Results for all tested products – SpamAssassin sample
The experiments summarized in this figure must not be used as the basis for a ranking of different (commercial or non
commercial) antispam tools, because they were not designed for this purpose. Although the test data used is identical, due to
limited resources no intense efforts could be made to fine-tune any of the tools (cf. page 2).
86
6. Conclusion
Summarizing, we can state the following observations with respect to anti-spam
methods and available tools, respectively.
6.1. Methods
Most tools and products currently available focus on what we call “post-send” methods
in our categorization of methods against spam e-mail presented in this report (Figure 5).
Their focus is on detecting spam or filtering spam out. Although they are able to
achieve acceptable detection and false positive rates, as our experiments show, many of
those methods (especially classical filters) have some serious drawbacks: They often
require a lot of effort for reacting to changes in the types of spam sent (new rules,
training, etc.), and their performance tends to decrease relatively fast if they are not
“maintained” well; they are often most useful on an individualized, personal basis,
which is an undesirable feature from the point of view of an ISP; they are usually
unable to reduce the waste of resources caused by spam e-mail (network bandwidth,
storage capacity, etc.); and they are often “one step behind” spammers’ tricks.
Nevertheless, there are some interesting newer approaches in this category that we
included under “classification methods”. They are motivated by more general
techniques from the areas of text classification or data mining, sometimes
algorithmically quite evolved. In our opinion they have the potential to overcome some
of these drawbacks. However, so far they are mostly discussed at an academic level and
are often not mature enough for practical use. It will be a major focus in the next project
phase to investigate approaches of this type in greater detail.
The situation is quite different with “pre-send” methods. Theoretically, they seem to
have big potential for big progress in the spam problem, mostly because they target the
source of the problem (commercial motivation) rather than (only) fighting the
symptoms – here, the idea is to prevent spam rather than to detect and filter it out.
Unfortunately, also in this case there are some important disadvantages: These
approaches tend to require a big administration overhead and, more importantly, their
success strongly depends on a worldwide agreement to deploy them – this holds for
proposals to increase the costs for sending e-mail as well as for legal regulations for
prohibiting sending out UBE and UCE. It is of course unrealistic that e-mail providers
worldwide will commit to common policies in the next years. Since national or regional
boundaries do not exist on the internet, we conclude that the pre-send approaches in the
current situation will not “solve” the problem, either.
Beyond those two big categories of methods there are also some more “radical”
approaches, such as new protocols for e-mail transfer (instead of SMTP), or the opinion
that we need to shift to a paradigm where we filter ham in instead of filtering spam out.
Although each of these ideas certainly has some merit, their widespread applicability in
practice is certainly not expected in the near future, definitely not for an ISP or in any
commercial context.
87
Our careful analysis of the situation leads us to the conclusion that there is some
potential for significant improvements of existing methods. Moreover, in order to
achieve best results, a multi-layered approach with several “defense lines” seems to be
required. Details will be investigated in the next phase of our project.
6.2. Experiments
As indicated in the beginning, our goal was to evaluate anti-spam methods and not to
compare products or tools. It would have been beyond the scope and resources of this
project to tune the tools we experimented with in order achieve the best possible
performance for each of them. In most cases we used more or less a standard
configuration, and if some simple choices were to be made, we tried to maximize the
rate of true positives for the lowest possible rate of false positives. Although the
experimental performance achieved has to be interpreted as an approximation for this
reason, no major improvements or new insights are to be expected from tuning
parameter settings.
In summary, the experiments show three results:
• For almost zero false positives, some of the tools detect significantly more than
90% of the spam messages.
• This rate of true positives varies quite strongly over the various tools (in some
cases, it goes down to only 30%).
• In general, there was no significant difference between the detection rates of
commercial products and the performance of open source tools. However, this
does not take into account other important features, such as user friendliness,
support, etc.
These experimental results again indicate that there is substantial room for
improvement, which we will investigate actively in the next phase of this project.
88
7. List of Figures
Figure 1: Percentage of e-mail identified as spam, June 2004 [ (no newer data available).................... 12
Figure 2: Interaction of legislative measures, law enforcement and percentage of spam [11.................. 12
Figure 3: The top ten sources of spam (domains) [14] ............................................................................. 14
Figure 4: Spam categorized in terms of content (data from [10) .............................................................. 15
Figure 5: Categorization of anti-spam methods........................................................................................ 23
Figure 6: Typical scenario for a blacklist ................................................................................................. 31
Figure 7: Sender registration ChoiceMail ................................................................................................ 34
Figure 8: Challenge of SFM...................................................................................................................... 35
Figure 9: Example of a forged header ...................................................................................................... 37
Figure 10: URL analysis based on [69] .................................................................................................... 39
Figure 11: Typical processing path of Symantec Brightmail Anti-Spam .................................................. 51
Figure 12: Typical processing path of Symantec Mail Security for SMTP ............................................... 55
Figure 13: Windows test process vs. Linux test process............................................................................ 66
Figure 14: Results for tested commercial products – our test sample....................................................... 68
Figure 15: Results for tested open source tools – our test sample ............................................................ 72
Figure 16: Results for all tested products (our test sample) ..................................................................... 76
Figure 17: Results for tested commercial products – SpamAssassin sample ............................................ 77
Figure 18: Results for tested open source tools – SpamAssassin sample.................................................. 81
Figure 19: Results for all tested products – SpamAssassin sample........................................................... 86
89
8. List of Tables
Table 1: Products/tools considered, methods used by these, further remarks see page 2
Table 2: The top twelve sources of spam, geographically [13]
Table 3: Cost-profit equation of a spammer (simplified, monthly basis)
Table 4: Typical SMTP dialogue
Table 5: The most important header fields in the Internet Message Format [33
Table 6: Adaptation of spammers’ techniques to development of filtering techniques [
Table 7: Quality metrics of binary classifiers for the spam problem
Table 8: A typical X-Hashcash header
Table 9: DCC checksums
Table 10: Example of a DCC record
Table 11: Hardware and software configuration
Table 12: Product 1 – results our test sample
Table 13: Product 1 (alternative configuration) – results our test sample
Table 14: Product 2 – results our test sample
Table 15: Product 3 – results our test sample
Table 16: Product 4 – results our test sample
Table 17: Product 5 – results our test sample
Table 18: Product 5 (alternative configuration) – results our test sample
Table 19: Product 6 – results our test sample
Table 20: SpamAssassin standard (2.64) – results our test sample
Table 21: SpamAssassin low (2.64) – results our test sample
Table 22: SpamAssassin Bayes (2.64) – results our test sample
Table 23: SpamAssassin (3.0) – results our test sample
Table 24: Bogofilter – results our test sample
Table 25: CRM 114 self-trained – results our test sample
Table 26: CRM 114 pre-trained – results our test sample
Table 27: Product 1 – results SpamAssassin test sample
Table 28: Product 1 (alternative configuration) – results SpamAssassin test sample
Table 29: Product 2 – results SpamAssassin test sample
Table 30: Product 3 – results SpamAssassin test sample
Table 31: Product 4 – results SpamAssassin test sample
Table 32: Product 5 – results SpamAssassin test sample
Table 33: Product 5 (alternative configuration) – results SpamAssassin test sample
Table 34: Product 6 – results SpamAssassin test sample
Table 35: SpamAssassin standard – results SpamAssassin test sample
Table 36: SpamAssassin low – results SpamAssassin test sample
Table 37: SpamAssassin Bayes – results SpamAssassin test sample
Table 38: SpamAssassin 3.0 – results SpamAssassin test sample
Table 39: Bogofilter – results SpamAssassin test sample
Table 40: CRM 114 self-trained – results SpamAssassin test sample
Table 41: CRM 114 pre-trained – results SpamAssassin test sample
90
8
13
18
20
21
22
24
26
42
42
65
68
69
69
69
70
70
71
71
72
73
73
74
74
75
75
77
78
78
79
79
80
80
81
82
82
83
83
84
84
85
9. Index
A
Address Harvesting Tools............................................................................................................. 16
AMTP ........................................................................................................................................... 48
B
Bayes Filter................................................................................................................................... 44
blacklist......................................................................................................................................... 31
Bogofilter...................................................................................................................................... 62
Borderware MXtreme Mail Firewall ............................................................................................ 57
C
Caller-Id ....................................................................................................................................... 32
CAN-SPAM ................................................................................................................................. 16
Challenge-Response ..................................................................................................................... 33
ChoiceMail ................................................................................................................................... 34
CRM 114 ...................................................................................................................................... 61
D
DCC .............................................................................................................................................. 41
Digital Signature........................................................................................................................... 40
DomainKeys ................................................................................................................................. 33
E
Excessive Cross-Posting ............................................................................................................... 11
Excessive Multi-Posting ............................................................................................................... 11
G
Greylist ......................................................................................................................................... 34
H
Hashcash....................................................................................................................................... 26
I
Ikarus mySpamWall
...................................................................................................................................................... 57
IM 2000 ........................................................................................................................................ 48
Internet Message Format
...................................................................................................................................................... 20
K
Kaspersky Anti-Spam
...................................................................................................................................................... 52
Keyword Based............................................................................................................................. 38
K-Nearest ..................................................................................................................................... 46
L
Lightweight Currency Protocol..................................................................................................... 27
N
Neural Networks........................................................................................................................... 46
91
P
Pattern Matching........................................................................................................................... 38
Pyzor............................................................................................................................................. 41
R
Rule Based.................................................................................................................................... 38
S
Sender-Id ..................................................................................................................................... 33
SFM .............................................................................................................................................. 35
Simple Mail Transfer Protocol...................................................................................................... 16
SpamAssassin ............................................................................................................................... 60
Spamkiss....................................................................................................................................... 58
Spam Tools ................................................................................................................................... 16
Sender Policy Framework ............................................................................................................ 32
SurfControl E-Mail Filter for SMTP ............................................................................................ 53
Support Vector Machines ............................................................................................................. 46
Symantec Mail Security for SMTP............................................................................................... 55
Symantec Brightmail Anti-Spam.................................................................................................. 50
U
Unsolicited Bulk E-mail ............................................................................................................... 10
Unsolicited Commercial E-Mail ................................................................................................... 10
URL Analysis ............................................................................................................................... 38
V
Vipul’s razor ................................................................................................................................. 41
W
whitelist ........................................................................................................................................ 31
92
10. Bibliography
[1]
Hormel Food Corporation.
http://www.hormel.com
[2]
A. Amor, J. Martin: “Civic Networking: The Next Generation”, 1998.
http://www.more.net
[3]
REDNET, Networking and Internet, British ISP since 1992.
http://www.red.net/support/resourcecentre/mail/email-aup.php
[4]
P. Hofmann: “Unsolicited Bulk E-mail: Definitions and Problems“, October 5, 1997.
http://www.imc.org/ube-def.html
[5]
David Madigan: “Statistics and the War on Spam (A Guide to the Unknown)”, 2004.
http://www.stat.rutgers.edu/~madigan/PAPERS/sagtu.pdf
[6]
Spamlinks, A lot of useful and up-to-date anti-spam links.
http://spamlinks.net/stats.htm
[7]
Mitteilung der Kommission an das europäische Parlament über unerbetene Werbenachrichten,
22.01.2004.
http://europa.eu.int/information_society/topics/ecomm/doc/useful_information/library/communi
c_reports/spam/spam_com_2004_28_de.pdf
[8]
News on ORF, August 24, 2004.
http://futurezone.orf.at/futurezone.orf?read=detail&id=245906
[9]
Standard online Report, December 30, 2004:
http://derstandard.at/?url=/?id=1857456
[10]
Anti-Spam Software Vendor Brightmail, Spam percentage June 2004.
http://www.brightmail.com
[11]
Anti-Spam Software Vendor MessageLabs, Spam Statistic November 2004.
http://www.messagelabs.com/emailthreats/default.asp#
[12]
The Spamhaus Project, List of the 200 biggest spammers called ROKSO.
http://www.spamhaus.org/rokso/
[13]
Anti-Spam and Virus Software Vendor Sophos, Dirty Dozen, the 12 most spamming countries.
http://www.sophos.com
[14]
Anti-Spam Software Vendor Commtouch.
http://www.commtouch.com
[15]
Anti-Spam and Virus Software Vendor Postini.
http://www.postini.com
[16]
Center for Democracy and Technology: “Why am I getting all this spam? Unsolicited
commercial e-mail six month report”, March 2003.
http://www.cdt.org/speech/spam/030319spamreport.shtml
93
[17]
Mailutilities, Advanced E-Mail Extractor.
http://www.mailutilities.com/aee/
[18]
MTI Software, Atomic Harvester.
http://www.desktopserver.com/atomic.htm
[19]
E-Mail Marketing Software, Mail utilities for Internet business and e-commerce.
http://www.massmailsoftware.com/extractweb/purchase-email-addresses.htm
[20]
Arbeiterkammer Österreich: “Die 8 häufigsten Arten von Internetbetrug“, 2004.
http://www.arbeiterkammer.at/www-192-IP-2839-AD-2799.html
[21]
Rejo Zenger: “Confession for two: a spammer spills it all”.
http://rejo.zenger.nl/abuse/1085493870.php
[22]
Send-safe real anonymous mailer.
http://www.send-safe.com/index.php
[23]
Ronald van der Wal.
http://www.spamvrij.nl/lijsten/bedrijf.php?idbedrijf=466
[24]
Worldsoftwarehouse, Bullet proof hosting service.
www.worldsoftwarehouse.com
[25]
Professional link counter tool, November 15, 2004.
http://www.linkcounter.be
[26]
Urteil gegen US Spammer Jeremy Jaynes, November 15, 2004.
http://www.silicon.de/cpo/news-antivirus/detail.php?nr=17568
[27]
Ferris Research.
http://www.ferris.com/
[28]
Heise News: “Spam belastet Europas Unternehmen”, January 4, 2003.
http://www.heise.de/newsticker/meldung/33417
[29]
M. Gibbs: “Spam Cost Model”, NetworkWorldFusion.
http://www.gibbs.com/msg/
[30]
Ikarus Software (Spam Wall and Anti-Virus).
http://www.mymailwall.at/spamcal.html
[31]
Jürgen Strauss: “Analyse betriebswirtschaftlich orientierter Lösungsansätze für die
Spamproblematik“, Diplomarbeit am Institut für Verteilte und Multimediale Systeme, Fakultät
für Informatik, Universität Wien, 2005 (in preparation).
[32]
J. Klensing: “RFC2821: Simple Mail Transfer Protocol”, April 2001.
ftp://ftp.rfc-editor.org/in-notes/rfc2821.txt
[33]
P. Resnik: “RFC2822: Internet Message Format”, April 2001.
ftp://ftp.rfc-editor.org/in-notes/rfc2822.txt
[34]
J. Postel: “RFC821: Simple Mail Transfer Protocol”, August 1982.
ftp://ftp.rfc-editor.org/in-notes/rfc821.txt
94
[35]
J. Postel: “Internet Protocol”, September 1981.
ftp://ftp.rfc-editor.org/in-notes/rfc791.txt
[36]
Peter Lechner: “Das Simple Mail Transfer Protokoll und die Spamproblematik“, Diplomarbeit
am Institut für Verteilte und Multimediale Systeme, Fakultät für Informatik, Universität Wien,
2005 (in preparation).
[37]
G. Hulten et al.: “Trends in Spam Products and Methods”, Microsoft Research, 2004.
www.ceas.cc/papers-2004/165.pdf
[38]
A. Birrel, et al.: “The Penny Black Project”.
http://research.microsoft.com/research/sv/PennyBlack/
[39]
C. Dwork, A. Goldberg, and M. Naor: "On Memory-Bound Functions for Fighting Spam",
Proceedings of the 23rd Annual International Cryptology Conference (CRYPTO 2003), August
2003.
[40]
M. Abadi, M. Burrows, M. Manasse, and T. Wobber: "Moderately Hard, Memory-bound
Functions", Proceedings of the 10th Annual Network and Distributed System Security
Symposium, February 2003.
[41]
Hashcash, A denial-of-service counter measure tool.
http://www.hashcash.org/
[42]
D. Turner, D. Havey: “Controlling spam through Lightweight Currency”, November 4, 2003.
http://ftp.csci.csusb.edu/turner/papers/turner_spam.pdf
[43]
D. Sorkin: “Overview over the most important anti-spam laws”, December 2003.
www.spamlaws.com
[44]
Andreas Sabadello: “Schutz vor unerwünschten E-Mails”, Diplomandenseminararbeit, Seminar
aus Kriminologie, Universität Wien, Sommersemester 2004.
[45]
Federal Law in USA, CAN-SPAM Act of 2003, January 1, 2004.
http://www.spamlaws.com/federal/108s877.html
[46]
DIRECTIVE 2000/31/EC OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of
8 June 2000 on certain legal aspects of information society services, in particular electronic
commerce, in the Internal Market (Directive on electronic commerce).
http://www.spamlaws.com/docs/2000-31-ec.pdf
[47]
DIRECTIVE 2002/58/EC OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of
12 July 2002 concerning the processing of personal data and the protection of privacy in the
electronic communications sector (Directive on privacy and electronic communications).
http://www.spamlaws.com/docs/2002-58-ec.pdf
[48]
Bill on Telecommunication, TKG 2003.
http://www.rtr.at/web.nsf/deutsch/Telekommunikation_Telekommunikationsrecht_TKG+2003
[49]
Bill on E-Commerce (ECG), BGBl I Nr. 152/2001.
http://www.Internet4jurists.at/gesetze/bg_e-commerce01.htm
[50]
Spamhaus, The Spamhaus Project.
http://www.spamhaus.org
95
[51]
Habeas, Sender Warranted E-Mail, 2004.
http://www.habeas.com
[52]
Bonded Sender Program, IronPort.
http://www.bondedsender.com
[53]
John Ioannidis: “Fighting Spam by Encapsulating Policy in E-Mail Addresses”, Proceedings of
Network and Distributed Systems Security Conference (NDSS), 2003.
[54]
T. Tompkins, D. Handley: "Giving e-mail back to the users: Using digital signatures to solve the
spam problem”, First Monday, 8(9), September 2003.
http://firstmonday.org/issues/issue8_9/tompkins/index.html
[55]
Sender Policy Framework.
www.spf.pobox.com
[56]
G. Fecyk: “Designated Mailers Protocol”, December 2003.
http://www.pan-am.ca/dmp/draft-fecyk-dmp-01.txt
[57]
Hadmut Danisch: ”Reverse MX“.
http://www.danisch.de/work/security/antispam.html
[58]
Microsoft: “Caller ID for E-Mail Technical Specification: The Next Step to Deterring Spam”,
Februar 12, 2004.
http://www.microsoft.com/downloads/details.aspx?FamilyID=9a9e8a28-3e85-4d07-9d0f6daeabd3b71b&displaylang=en
[59]
J. Lyon: “Purported Responsible Address in E-Mail Messages Specification”, October 2004.
http://www.microsoft.com/downloads/details.aspx?familyid=f8e9cb40-cc7c-46d6-8cd13a86a46546d5&displaylang=en
[60]
IC Groug Inc.: “Sender Rewriter Scheme”.
http://spf.pobox.com/srs.html
[61]
Microsoft Corporation: “Sender ID”.
http://www.microsoft.com/senderid
[62]
Yahoo! Inc.: “DomainKeys”.
http://antispam.yahoo.com/domainkeys
[63]
T. Loder, M.V. Alstyne, R. Wash: “An economic answer to unsolicited communication”, ACM
2004.
[64]
E. Harris: “The Next Step in the Spam Control War: Greylisting”, August 28, 2003.
http://projects.puremagic.com/greylisting/whitepaper.html
[65]
DigiPortal Software Inc.: “Choice Mail, A Spam Blocker – Not just a spam filter”.
http://www.digiportal.com
[66]
P. Gburzynski: “Spam-Free E-Mail Service”.
http://sfm.cs.ualberta.ca/
[67]
MXLogic – Spam Classification Techniques.
http://www.mxlogic.com
96
[68]
Jeffrey E.F. Friedl: “Mastering Regular Expressions, Powerful Techniques for Perl and Other
Tools”, ISBN: 1-56592-257-3, O'Reilly, January, 1997.
[69]
Oleg, Kolesnikov, Wenke Lee, Richard Lipton: “Filtering Spam Using Search Engines”, 2003.
http://www.cc.gatech.edu/~ok/
[70]
Karl A. Krueger: “The Spam Battle 2002: A Tactical Update”, SANS GSEC Practical, v1.4,
September 2002.
http://www.sans.org/rr/whitepapers/email/589.php and http://www.rhyolite.com/anti-spam/dcc/
[71]
Vipul’s Razor: “A distributed, collaborative, spam detection and filtering network”, December 3,
2004.
http://razor.sourceforge.net/
72]
Pyzor.
http://pyzor.sourceforge.net/
[73]
G. Salton, C. Buckley: “Term Weighting Approaches in Automatic Text Retrieval”, Information
Processing and Management, 24:513 523, 1988.
[74]
W. Yerazunis: “The Spam Filtering Accuracy Plateau at 99.9% Accuracy and How to Get Past
It”, MIT Spam Conference 2004.
[75]
J. Zdziarski: “Controlling Filter Complexity: Statistical-Algorithmic Hybrid Classification”.
[76]
Corinna Cortes, Vladimir Vapnik: “Support-vector networks”, Machine Learning, 20(3):273297, November 1995.
[77]
Vladimir Vapnik: “The Nature of Statistical Learning Theory”, Springer-Verlag, Heidelberg,
Germany, 1995.
[78]
Christopher M. Bishop: “Neural Networks for Pattern Recognition”, Oxford University Press,
1995.
[79]
G. Wittel, S. Wu, U. Davis: “On Attacking Statistical Spam Filters”, CEAS 2004.
http://www.ceas.cc/papers-2004/slides/170.pdf
[80]
K. Eide: “Winning the war on spam: Comparison of bayesian spam filters”, August 2003.
http://home.dataparty.no/kristian/reviews/bayesian/
[81]
A. Kolcz, J. Alspector: “SVM based filtering of e-mail spam with content-specific
misclassification costs”, In Proceedings of the TextDM'01 Workshop on Text Mining – held at
the 2001 IEEE International Conference on Data Mining, 2001.
[82]
H. Drucker, Donghui Wu, and V.N. Vapnik: “Support vector machine for spam categorization”
IEEE Transactions on Neural Networks, 10(5):1048–1054, September 1999.
[83]
Ana Cardoso-Cachopo, L. Oliveira: “An Empirical Comparison of Text Categorization
Methods” SPIRE 2003, LNCS 2857, pp. 183-196, 2003.
[84]
D.J. Bernstein: “Internet Mail 2000“.
http://cr.yp.to/im2000.html
97
[85]
Jonathan de Boyne Pollard: “Fleshing out IM2000”.
http://homepages.tesco.net./%7EJ.deBoynePollard/Proposals/IM2000/
[86]
Shane Hird: “Technical Solutions for Controlling Spam”.
http://security.dstc.edu.au/papers/technical_spam.pdf
[87]
W. Weinman: “Authenticated Mail Transfer Protocol”.
http://amtp.bw.org/docs/draft-weinman-amtp-03.txt and http://amtp.bw.org/
[88]
NetworkWorldFusion, Buyers guide.
http://www.nwfusion.com/bg/2003/spam/index.jsp
[89]
Spamotomy – a comparison tool.
http://www.spamotomy.com
[90]
Installation Guide and Administration Guide for Brightmail Anti-Spam, Version 6.0 (Document
Version 1.0).
[91]
Kaspersky Anti Spam 2.0 Enterprise Edition Manual.
http://www.kaspersky.com/de/downloads?chapter=146440562&downlink=149404921
[92]
SurfControl E-Mail Filter for SMTP: Administrator’s Guide (Version 4.7 created September
2003).
[93]
Definitions for active content found in GOOGLE.
http://www.google.at/search=define:active+content
[94]
Symantec Mail Security for SMTP: Administration Guide (Documentation Version 4.0).
[95]
Ikarus Software, Managed Security Services, My Mail Wall.
http://www.mymailwall.at/
[96]
Spamkiss, Anti-Spam Software.
http://www.spamkiss.com/
[97]
Homepage SpamAssassin.
http://spamassassin.apache.org, http://wiki.apache.org/spamassassin/
[98]
Tests performed by SpamAssassin.
http://spamassassin.apache.org/tests_3_0_x.html
[99]
SpamAssassin Bayes Frequently Asked Questions.
http://wiki.apache.org/spamassassin/BayesFaq
[100]
CRM 114, Bayesian Classifier.
http://crm114.sourceforge.net/
[101]
Paolo Frasconi and Giovanni Soda and Alessandro Vullo, “Hidden Markov Models for Text
Categorization in Multi-Page Documents”, J. Intell. Inf. Syst. 18 2-3, 195--217, 2002.
[102]
Bogofilter, Baysian Classifier.
http://www.bogofilter.org
98
[103]
Ling spam and PU1 spam corpus.
http://www.iit.demokritos.gr/skel/i-config/downloads/
[104]
SpamAssassin Testsamples.
http://www.spamassassin.org/publiccorpus
99