Anti-Spam Methods – State-of-the-Art - Technology for E
Transcription
Anti-Spam Methods – State-of-the-Art - Technology for E
Anti-Spam Methods – State-of-the-Art W. Gansterer, M. Ilger, P. Lechner, R. Neumayer, J. Strauß Institute of Distributed and Multimedia Systems Faculty of Computer Science University of Vienna, Austria March 2005 This report summarizes the results of Phase 1 of the project FA 384018 “Spamabwehr” of the Institute of Distributed and Multimedia Systems at the University of Vienna, funded by Mobilkom Austria, UPC Telekabel and Internet Service Providers Austria (ISPA). We would like to thank Mobilkom Austria, UPC Telekabel and Internet Service Providers Austria (ISPA) for their support which made this research project possible. We also would like to express our gratitude to all those commercial vendors of antispam tools who provided us with their products for experimental investigations as well as to the volunteers who provided us private e-mail messages for testing purposes. Copyright: © 2005 by University of Vienna. All rights reserved. No part of this publication may be reproduced or distributed in any form or by any means without the prior permission of the authors. The Institute of Distributed and Multimedia Systems at the University of Vienna does not guarantee the accuracy, adequacy or completeness of any information and is not responsible for any errors or omissions or the result obtained from the use of such information. Note: Experimental data not to be used for ranking purposes Since the objective of this report was the analysis of existing methodology and not a comprehensive and detailed evaluation or comparison of available anti-spam products/tools, the results of our experiments must not be interpreted as a “ranking”. In order to produce a sound basis for a rigorous “ranking” of various anti-spam products/tools more effort has to be spent on defining comparable parameter settings and on fine tuning. 2 About the Authors Project “Spamabwehr” was launched in summer 2004 at the Department of Computer Science (Distributed Systems group) which, due to internal restructuring at the University of Vienna, became the new Institute of Distributed and Multimedia Systems at the Faculty of Computer Science. The team: Dr. Wilfried Gansterer (project leader), Michael Ilger, Peter Lechner, Robert Neumayer and Jürgen Strauß. From left to right: J. Strauß, M. Ilger, P. Lechner, W. Gansterer, R. Neumayer Contact for Project “Spamabwehr”: phone: +43-1-4277-39650 e-mail: Each team member can be contacted at [email protected] The institution: The Faculty of Computer Science (Fakultät für Informatik) is currently lead by Dean Prof. Dr. Günter Haring. The Institute of Distributed and Multimedia Systems, headed by Prof. DDr. Gerald Quirchmayr, is one of the institutes within this faculty. Institute of Distributed and Multimedia Systems University of Vienna Lenaugasse 2/8, A-1080 Vienna (Austria) 3 Table of Content EXECUTIVE SUMMARY ..................................................................................................................6 1. INTRODUCTION........................................................................................................................9 1.1. WHAT IS “SPAM”? ...............................................................................................................10 1.2. STATISTICAL DATA .............................................................................................................11 1.2.1. Total Amount of Spam...............................................................................................11 1.2.2. Sources of Spam ........................................................................................................13 1.2.3. Content of Spam ........................................................................................................14 1.3. THE ECONOMIC BACKGROUND ...........................................................................................15 1.3.1. Why Spam?................................................................................................................15 1.3.2. Damage Caused by Spam..........................................................................................18 1.3.3. Conclusion ................................................................................................................18 1.4. THE TECHNICAL BACKGROUND ..........................................................................................19 1.4.1. Simple Mail Transfer Protocol..................................................................................19 1.4.2. Internet Message Format ..........................................................................................20 1.4.3. Spammers’ Techniques..............................................................................................21 2. ANTI-SPAM METHODS..........................................................................................................23 2.1. QUALITY CRITERIA FOR ANTI-SPAM METHODS ..................................................................23 2.2. SENDER SIDE (=PRE-SEND) METHODS ................................................................................25 2.2.1. Increasing Sender Costs............................................................................................25 2.2.2. Increasing Spammers’ Risk.......................................................................................29 2.3. RECEIVER SIDE (=POST-SEND) METHODS ...........................................................................30 2.3.1. Approaches Based on Source of Mail .......................................................................30 2.3.2. Approaches Based on Content ..................................................................................38 2.3.3. Using Source and Content ........................................................................................41 2.4. SENDER AND RECEIVER SIDE ..............................................................................................47 2.4.1. IM 2000.....................................................................................................................48 2.4.2. AMTP ........................................................................................................................48 3. PRODUCTS AND TOOLS .......................................................................................................49 3.1. OVERVIEW ..........................................................................................................................49 3.1.1. Quality Criteria.........................................................................................................49 3.1.2. Comparisons of Anti-Spam Software ........................................................................49 3.2. COMMERCIAL PRODUCTS.....................................................................................................50 3.2.1. Symantec Brightmail Anti-Spam ...............................................................................50 3.2.2. Kaspersky Anti-Spam ................................................................................................52 3.2.3. SurfControl E-Mail Filter for SMTP.........................................................................53 3.2.4. Symantec Mail Security for SMTP ............................................................................55 3.2.5. Borderware MXtreme Mail Firewall ........................................................................57 3.2.6. Ikarus mySpamWall ..................................................................................................57 3.2.7. Spamkiss....................................................................................................................58 3.3. OPEN SOURCE .....................................................................................................................60 3.3.1. SpamAssassin............................................................................................................60 3.3.2. CRM 114 ...................................................................................................................61 3.3.3. Bogofilter ..................................................................................................................62 4. PERFORMANCE EVALUATION ..........................................................................................63 4.1. SOURCES FOR OUR OWN SAMPLES ......................................................................................63 4.1.1. University of Vienna..................................................................................................63 4.1.2. Mobilkom Austria......................................................................................................63 4.1.3. UPC Telekabel ..........................................................................................................64 4.2. TEST SAMPLE DESCRIPTION .................................................................................................64 4 4.2.1. Our Test Sample........................................................................................................64 4.2.2. SpamAssassin Test Sample........................................................................................65 4.3. EXPERIMENTAL SETUP ........................................................................................................65 4.3.1. Windows Test Process...............................................................................................66 4.3.2. Linux Test Process ....................................................................................................66 5. EXPERIMENTAL RESULTS ..................................................................................................67 5.1. OUR TEST SAMPLE ..............................................................................................................67 5.1.1. Commercial Products ...............................................................................................67 5.1.2. Open Source Tools ....................................................................................................71 5.1.3. Conclusion ................................................................................................................75 5.2. SPAMASSASSIN TEST SAMPLE .............................................................................................76 5.2.1. Commercial Products ...............................................................................................76 5.2.2. Open Source Tools ....................................................................................................81 5.2.3. Conclusion ................................................................................................................85 6. CONCLUSION...........................................................................................................................87 6.1. 6.2. METHODS ............................................................................................................................87 EXPERIMENTS......................................................................................................................88 7. LIST OF FIGURES ...................................................................................................................89 8. LIST OF TABLES .....................................................................................................................90 9. INDEX.........................................................................................................................................91 10. BIBLIOGRAPHY ......................................................................................................................93 5 Executive Summary This report summarizes the findings and results of the first phase of the project “FA 384108 Spam-Abwehr” (“Spam-Defense”) which was launched in July 2004 at the Department of Computer Science and Business Informatics at the University of Vienna and is supported by Mobilkom Austria, UPC Telekabel and Internet Service Providers Austria (ISPA). This document is structured as follows. (1) Section 1 provides an introduction into the topic by discussing definitions, summarizing recent spam statistics, and reviewing the relevant economic and technical background. (2) Section 2 summarizes the “state-of-the-art” of methods for detecting and avoiding spam. Although this is an extremely difficult task in general – not only due to the nature of the problem and due to the enormous dynamics of current research and development activities, but also due to restricted access to information about proprietary methods - we were able to cover the most important approaches at a methodological, scientifically oriented level. We tried to be as comprehensive as possible within the given space and time limitations, and most of the important methods are covered briefly (without going into details). Our survey of methods is based on a categorization of anti-spam methods (see Figure 5), which we developed and which provides an abstract framework for all existing anti-spam approaches. (3) Section 3 contains an overview and a short description of a few anti-spam products and tools which we had access to. This comprises commercial products (Symantec-Brightmail, Kaspersky, SurfControl, Borderware, Ikarus) as well as important open source tools (SpamAssassin, CRM 114, Bogofilter). (4) Section 4 summarizes the setup we used for experimenting with the products and tools mentioned above. In particular, it describes the two test sets containing spam and ham messages (one of them we collected ourselves from various sources, the other one is publicly available) and the hardware we used. (5) Section 5 summarizes our experimental results in detail. Detection rates and false positive rates are given for each of the products and tools used. Since the goal of this report was an analysis of existing methodology, and not a comprehensive and detailed evaluation or comparison of anti-spam products/tools available, the results of our experimental evaluation must not be interpreted as a “ranking”. In order to produce a rigorous ranking, we would have to use a wider variety of test sets, we would have to spend much 6 more effort on tuning the products and tools (which can be an enormously time consuming task), we would have to monitor their performance over a longer period of time, and we also would have to take into account other properties beyond detection rates (such as user-friendliness, administrative overhead, etc.). Thus, the results quoted should be considered approximations of the performance achievable with the respective tools. In many cases, our results are reasonably good indications of the performance to be expected – experience shows that even with higher efforts for tuning the detection rates usually cannot be expected to increase a lot. (6) Finally, Section 6 summarizes our findings and conclusions. C C C C Service O O O Operating System W L W W Prop Appliance L L L Prop Methods * * * Whitelist * * * * * * * * Blacklist * * * * * * * * Ikarus mySpamWall * * SPF Challenge/Response Token based Challenge/Response Greylist Fingerprint * Prop Bayes Neural Networks 7 Bogofilter C CRM 114 MXtreme Firewall Commercial/ Open Source SpamAssassin Symantec Brightmail AntiSpam Kaspersky AntiSpam SurfControl E-Mail Filter Symantec Mail Security for SMTP Mail Table 1 provides a compact overview of the products and tools we experimented with and of some of the most common anti-spam methods, indicating which product/tool uses which method. * * DCC Prop DCC Pyzor Razor * * * * * * URL Whitelist URL Blacklist * Static techniques * (Keywords, ..) * * * * * * * * * Digital Signature Hashcash SVM Table 1: Products/tools considered, methods used by these, further remarks see page 2 Legend: 8 C…Commercial, O…Open Source, W…Windows, L…Linux, Prop…Proprietary 1. Introduction Among the strengths of electronic communications media such as electronic mail (email) are the relatively low transmission costs, high reliability and generally fast delivery. Electronic messaging is not only cheap and fast; it is also easy to automate. These properties make it obviously also very attractive for commercial advertising purposes, and in recent years we have experienced a development where electronic messaging is abused by flooding users’ mailboxes with unsolicited messages. Spamming is the act of sending unsolicited (commercial) electronic messages in bulk, and the word spam has become the synonym for such messages. This word is originally derived from spiced ham (luncheon meat), which is a registered trademark of Hormel Foods Corporation [1]. Monty Python’s Flying Circus used the term spam in the so called “spam sketch” as a synonym for frequent occurrence – and someone adopted this for unsolicited mass mail. Based on the origin of the word spam, all other (desired) e-mail is called ham. Official terminology for spam, unsolicited bulk mail (UBE) or unsolicited commercial e-mail (UCE), is introduced in Section 1.1. The most common purpose for spamming is advertising. Offered goods range from pornography, computer software and medical products to credit card accounts, investments and university diplomas. Many of these products have an ill-reputed or questionable legal nature. The main motivation for spamming is commercial profit. As we mentioned above, the costs for sending millions of spam mail messages are very low. In order to make good profit, it suffices if only a very small fraction (0.1% or even less) of the sent out spam e-mail are replied to and lead to business transactions. Spam has severe negative effects on e-mail users. Obviously, it consumes computer, storage and network resources as well as human time and attention to dismiss unwanted messages. Moreover, it has various indirect effects which are very difficult to account for – the spectrum reaches from measurable costs like spam filter software and administration to not measurable costs like a lost e-mail (expensive for business, not that expensive for a private person). We can distinguish five different types of spam: Beyond e-mail spam, there is messaging spam (often called spim – spam using instant messaging), newsgroup spam (excessive multiple postings in newsgroups), mobile phone spam (text messages), and Internet telephony spam (via voice over IP). Perspective Taken. In this report, our focus is on summarizing the state-of-the-art in methods and techniques for detecting and avoiding e-mail spam, because it is the most common kind of spamming on the Internet. In the future, other types of spam, in particular to all kinds of mobile devices (cellular phones, PDAs, etc.) will become more of a problem. Although for this context some technical aspects will require further investigation, most of the methods discussed in this report will also be of relevance there. Moreover, we emphasize approaches suitable at the server side (ISP) as opposed to methods designed for the client side (individual user), since an important goal is to find 9 centralized solutions suitable for ISPs. In such a context, user feedback – if feasible – can be one way to control and improve quality, but it should not be an integral part of anti-spam methods. We have also experimented with a few implementations of common available methods in order to illustrate their performance in practice. Our goal was not to produce a complete survey, evaluation, comparison or even “ranking” of existing anti-spam products. The tools considered in the evaluation were picked based on availability and on the methods they implement – we cannot and do not intend to claim any form of completeness in terms of anti-spam products or tools considered in this report. Synopsis. In the remainder of this section, we provide a more detailed overview of background information related to the spam topic: official terminology and definitions of “spam”, statistics about spam, the technical background relevant to the phenomenon “spam”, and finally the economic background that explains the motivation of spammers. In Chapter 2, we outline the most important anti-spam methods based on a classification we have developed. In Chapter 3, we survey a few existing commercial and non-commercial tools. In Chapter 4 we discuss how the performance of anti-spam methods and software can be evaluated, and we describe the setup of our experimental evaluation. In Chapter 5 we summarize and interpret the results of our tests with two different samples (our own test set and a publicly available test set) and in Chapter 6, we summarize our conclusions. 1.1. What is “Spam”? First, we need to carefully define some central terminology. As mentioned before, the commonly used word “spam” was derived from a completely different context. In this section, we summarize official and/or more technical terminology used for spam, such as unsolicited commercial e-mail (UCE) or unsolicited bulk e-mail (UBE). Unsolicited Commercial E-Mail (UCE): “E-Mail containing commercial information that has been sent to a recipient who did not ask to receive it” [2], or: “Unsolicited e-mail is advertising material sent by e-mail without the recipient either requesting such information or otherwise explicitly expressing an interest in the material advertised.” [3] Unsolicited Bulk E-mail (UBE): “E-Mail with substantially identical content sent to many recipients who did not ask to receive it. Almost all UBE is also UCE.” [2], or: “Unsolicited Bulk E-Mail, or UBE, is Internet mail (‘e-mail’) that is sent to a group of recipients who have not requested it. A mail recipient may have at one time asked a sender for bulk e-mail, but then later asked that sender not to send any more e-mail or otherwise not have indicated a desire for such additional mail; hence any bulk e-mail sent after that request was received is also UBE.” [4] In our opinion, no e-mail that is solicited can be considered spam. However, there may be spam which is not sent out in bulk or which does not involve (direct) commercial interest. Ultimately, classification of an e-mail message as spam often 10 becomes a highly subjective decision and it is very difficult – if not impossible – to establish common criteria covering a wide range of affected user. Nevertheless, based on the statements mentioned above, we identify three central features, which we consider defining properties of spam (not always all three of them have to apply): 1. It is unsolicited, that is, mail the receiver did not request. 2. It is sent out in bulk, that is, to many recipients. 3. Usually, there is commercial interest involved, for example, interest in advertising (and selling) some product. Two more relevant technical terms have been established for the special context of newsgroup postings: Excessive Multi-Posting (EMP): “Excessive Multi-Posting (EMP) refers to sending the same (or nearly the same) message, one by one, to multiple newsgroups. Multiple posting is almost never recommended because (a) multiple messages are better sent by cross-posting and (b) follow-ups will be posted in different newsgroups.” [3] Excessive Cross-Posting (ECP): “Excessive Cross-Posting (ECP) refers to sending a message to many newsgroups all at once. Sometimes, if a message could belong in more than one group, it can be useful to cross-post. However, in this case, an appropriate ‘Follow up-To:’ header can ensure that the discussion continues only in one designated newsgroup. Cross-posting to too many groups is considered spam, especially if no ‘Follow up-To:’ header is included.” [3] 1.2. Statistical Data Only ten years ago, the spam problem did not exist. In the last one or two years it has become a major concern not only for private users, but also for businesses due to the potential economic and commercial damage. In this section, we try to give a short overview over current spam statistics to illustrate the development and the dynamics of the issue. Unfortunately, all available statistics have one major disadvantage – they are outdated quickly. There are many sources of information about statistics on the development of spam, such as [5] or [6]. 1.2.1. Total Amount of Spam The amount of spam sent over the Internet has been rising dramatically in recent years and no decline is to be expected in the near future. This is clearly illustrated by data about the share of spam in the total number of e-mail messages sent. In 2000, only 7% of the messages sent worldwide were spam [7], whereas in 2002 already 40% were spam [8]. The current percentage is estimated to be around 65% worldwide with much higher estimates for some regions (for example, up to 90% for the USA). According to 11 that trend, some pessimists even announce the end of the mail infrastructure for 2007 [9]. Until July 2004, the anti-spam software developer Brightmail published monthly statistics about spam, as shown in Figure 1. Figure 1: Percentage (no newer data available) of e-mail identified as spam, June 2004 [10] Messagelabs [11] published Figure 2, which reflects the interaction between legislative measures / law enforcement and the percentage of spam in e-mail scanned. Figure 2: Interaction of legislative measures, law enforcement and percentage of spam [11] 12 It is clearly visible that the percentage of spam in all e-mail messages sent still has an increasing trend, but it also tends to react significantly to the introduction of new legislative measures and to legal actions taken against spammers. This interpretation also has to be seen in the light of the conjecture that 80% of the spam sent worldwide comes from very few (roughly 200) distinct spammers [12]. 1.2.2. Sources of Spam Having illustrated that the percentage of spam in e-mail messages sent worldwide increases rapidly, we also tried to collect some statistics about where the spam comes from, both in terms of geographic regions as well as in terms of Internet domains. A ranking of the most spam producing countries (as of 24th August 2004) according to the anti-virus and anti-spam vendor Sophos [13] is shown in Table 2: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. United States South Korea China (& Hong Kong) Brazil Canada Japan Germany France Spain United Kingdom Mexico Taiwan Others 42.53% 15.42% 11.62% 6.17% 2.91% 2.87% 1.28% 1.24% 1.16% 1.15% 0.98% 0.91% 11.76% Table 2: The top twelve sources of spam, geographically [13] For collecting the data summarized in this table, researchers used honey pots1, to collect spam. It is interesting to note that, compared to the data from February 2004, Canada reduced its rate from 6.8 % to 2.9%, whereas South Korea tripled its rate. In general, about 40 % of the world’s spam is sent out from “zombie computers”2 [8]. Figure 3 provided by the anti-spam vendor Commtouch [14] illustrates which domains spammers prefer to send out spam. 1 The term “honeypot” (spam trap) refers to an e-mail address never published to humans. Any e-mail sent to such addresses has to be spam. 2 The term “zombie computer” refers to a computer infected with viruses of all kinds and misused for sending out spam. 13 Figure 3: The top ten sources of spam (domains) [14] Postini [15] provides some interesting statistics investigating the sources of spam and of directory harvest attacks3. On their graphic illustration, Austria seems to be among the hotspots of spammers’ activity. Upon closer examination, it turns out that the visual impression is due to the three following entries [15]: 48.22 48.22 48.22 16.37 16.37 16.37 AT VIENNA WIEN (state) RIPE AT VIENNA WIEN (state) RIPE AT VIENNA WIEN (state) RIPE ev_dictatk ev_spamatk ev_dictatk 20881 26 211 Whereas the first three entries specify the location (latitude, longitude, additional location information), the last two entries describe the event type and the intensity of the attack. That means there were 26 spam attacks (=ev_spamatk) and more than 20,000 directory harvest attacks (=ev_dictatk) during the last six month having their origin from Austria. 1.2.3. Content of Spam As we will illustrate in detail in Section 1.3, the central motivation for sending out spam is to make money and profit. Spammers use the direct marketing approach to make their cash mostly by marketing, only rarely also by direct selling. Figure 4 shows a categorization of spam in terms of its content based on data from Brightmail [10]. 3 A directory harvest attack is the theft of confidential e-mail directory information, for example of lists of e-mail addresses of all employees of an organization. 14 Spam categorized in terms of content (Data July 2004) Internet 5% Leisure 4% Other 6% Political Spiritual 2% 3% Products 25% Fraud 5% Health 8% Adult 15% Scams 9% Financial 18% Figure 4: Spam categorized in terms of content (data from [10]) 1.3. The Economic Background In this section, we give a short overview over the economic business model of spammers. There is a simple reason for the recently dramatic rise in the amount of spam sent – it seems to be a relatively easy way to make money. We will outline why spamming can be so profitable. Later, in Section 2.2, some approaches trying to reduce this motivation and thus fighting the spam problem “at its source” will be discussed. 1.3.1. Why Spam? There are many sources of information dealing with the reason, why spam is around, for example [16]. Spam is sent out by companies and by individuals, but primarily for a single reason – to make profit using a new form of direct marketing. Classical direct marketing, using methods such as brochures, TV and radio spots, telephone calls, doorstep sales, etc. has been used for a long time. For these marketing methods, the costs associated with every step in this process are significant. More importantly, those costs for direct marketing increase proportionally with the number of potential customers reached and revenue is only created by selling real products or services. In this classical approach, frauds are almost excluded because initial investment is necessary for advertising in order to make money down the line. With the availability of e-mail communication, new direct marketers were able to reduce the costs for direct marketing to a negligible amount in proportion to the number of potential customers reached. This increases the margin of profit considerably. In the following, we describe this business model (spammers’ costs and revenues) in more detail. 15 1.3.1.1. Cost Factors The following list summarizes a few of the cost factors characteristic for spamming businesses. Product: Most of the spammers do not sell anything to the recipients of spam – they are just acting as marketers (thus, spammers do not have any investments for actually purchasing products). Marketing Material: The creation of an e-mail does not need any highly specialized software or knowledge. Thus, producing the marketing material is very cheap, and – one of the most important differences to classical marketing – the costs for sending out marketing material do not increase proportionally with the number of potential customers reached. Spam Tools: Tools for generating and sending out millions of personalized e-mail are available, very inexpensive (often even for free) and easy to use. Address Harvesting Tools: Tools for collecting addresses automatically on the Internet can be downloaded [17][18] and addresses can be bought or rent [19]. Although the price for e-mail addresses depends on their “quality”, they are generally quite inexpensive. The Spam Campaign: Set up an Internet connection (for example, a free trial account), send out millions of messages from this account in a short period, and move to the next ISP for getting a new (free) account. Other Costs: These include hardware and maintenance costs, but may also include costs for responding to interested buyers (automated, in order to avoid personal interaction, for example, via a Web interface). 1.3.1.2. Profit Factors In the following, we list some of the most important sources of income and profit for spammers. Direct Income: The most common form of income for spammers is that they act as marketing companies and are paid for marketing campaigns. Web Banner Revenues: In many cases, spammers get revenue for every visit on a Web site, which is advertised, in a spam e-mail. Validation of Contact Information: Another source of income for spammers is to validate e-mail addresses (for example, responses to “unsubscribe here” invitations in spam messages) and to sell this information to other spammers or direct marketing companies. Sell Spam Business Models: The above is a special case of a more general concept where spammers sell the information collected from responses to spam messages to others. 16 Scams: In many cases, spam messages are hidden attempts to find out personal or access information (“phishing”), such as credit card information, bank account information, etc., which can then be used for criminal activities (theft, illegal investment, etc.). Other kinds of scam could be: dubious job offers, ponzi schemes4, Internet gambling, auctions, sexual offers and pre-paid purchase orders with no supply of the ordered goods [20]. Product Selling: Only a minority of companies who send out spam are also selling the advertised products themselves. 1.3.1.3. An Example The following example gives an impression how spammers’ businesses operate. The description is based on the interview [21] with an anonymous spammer who runs a rather small-scale operation. The spammer used an account at Send-safe [22], which allowed him to send out 400,000 e-mail messages via open proxies for US$ 50. On average, he sent out approximately 61,000 e-mail messages per day. The recipients were taken from a CD containing 4,000,000 e-mail addresses, which he bought for 300 Euros from [23]. It turned out that only 56% of the addresses on the CD were syntactically correct, and 25% of these bounced due to full or out of use mailboxes. As Web site referred to in his spam, he used bulletproof hosting5 in China via Worldsoftwarehouse [24], which charges Euro 125 per month. He used a link-counter [25] to get an idea how many persons view his e-mail (by counting how often it is opened in an e-mail client). On average, per day about 30 persons ordered 2.5 units of the product offered. Table 3 summarizes this operation for a typical month (one month = 30 days, prices are given in Euro6). Quantity of E-Mail: Mail sent 61,000*30 1,830,000 (100%) Mail viewed by user 19,136 *30 574,080 (31.37%) Visitors on spamvertized site 359*30 10,770 (0.59%) People ordering 22*30 660 (0.036)% 4 An investment swindle in which some early investors are paid off with money put up by later ones in order to encourage more and bigger risks. 5 The term bulletproof is used to indicate that nothing can shut down the hosting service. Such services can enable the sending of spam without the threat of Web site account cancellation. 6 For this data, the conversion rate 1 US$=0.75913 Euro was used. 17 Fixed costs: E-Mail addresses 300 -300.-- Hosting cost 125 -125.-- Link counter 19.90*0.75913 -15.11 Open proxy cost 50/400,000*1,830,000*0.75913 -173.65 Purchase of goods 660*2.5*2.95 -4,867.50 Wrapping fee 660*2.5*0.50 -825.-- Variable costs: -6,306.26 Total cost Revenue: Sales revenue 660*2.5*8 Result (monthly profit before taxes): 13,200.-6,893.74 Table 3: Cost-profit equation of a spammer (simplified, monthly basis) Although some costs like computer hard- and software, Internet access costs and taxes are not included here, this simple example shows that spamming is highly profitable. To get an impression of really large operations where millions of spam messages are sent per day, see [26]. 1.3.2. Damage Caused by Spam This huge amount of spam not only occupies enormous resources, it obviously also causes a dramatic loss of productivity. Whereas individuals only lose bandwidth and time (which is hard to quantify), companies potentially lose much more. Ferris Research [27] estimated productivity losses for US corporations with 10 billion dollar for 2003 [28]. The European Union estimated a loss of productivity of 2.5 billion Euros for 2003 [7]. Various tools to gauge losses for the own company are available online (see, for example [29] or the simpler version [30]). 1.3.3. Conclusion As illustrated before, e-mail communication currently provides an excellent means for spammers to make high profits from sending out spam and related activities. All other Internet users, including individual users, businesses and ISPs, suffer from all the damaging side effects of spamming activities, reaching from mailboxes filled with junk mail, which threatens the usefulness of e-mail as a means of communication, to all sorts 18 of other costs in terms of bandwidth usage, storage requirements, and last, but not least, manpower required to fight spam. It seems obvious that advanced approaches for fighting the spam problem must include strategies to make spamming less attractive – not only by increasing the risks for spammers through stricter legal regulations, but also by harming spammers’ business model, that is, by decreasing the potential margins of profit. In Section 2.2 we will discuss some approaches in this direction in detail. For best results, the infrastructure of the current Internet should be changed partly – and a lot of further work is to be done – see also [31]. 1.4. The Technical Background The rules for transmitting an e-mail message and the composition of the message itself are defined in several protocol standards. In this section, we discuss the two basic underlying protocols, the Simple Mail Transfer Protocol (SMTP, RFC2821 [32]) and the Internet Message Format (RFC2822 [33]). The main objective of SMTP is to support reliable and efficient mail transfer. The Internet Message Format defines the structure of e-mail messages. 1.4.1. Simple Mail Transfer Protocol SMTP was originally developed in 1982 (RFC821 [34]), and has been consolidated and updated in 2001 [32]. SMTP is independent of the particular transmission subsystem and only requires a reliable ordered data stream channel. In the Internet Protocol Stack [35] it is located at the application layer (layer 4) and uses TCP for data transmission. An e-mail message usually consists of three different parts – the SMTP envelope, the header and the body. SMTP specifies a set of commands to transmit an e-mail message between an SMTP client and an SMTP server. The exchange of these commands between the client and the server forms the SMTP envelope and is known as the socalled SMTP dialogue. A minimum SMTP implementation consists of nine commands. There is also a service extension model that permits the client and server to agree to utilize shared functionality beyond the original SMTP requirements. Table 4 shows a typical communication scenario. # 1 2 3 4 5 6 7 8 9 19 Station Server: Client: Server: Client: Server: Client: Server: Client: Server: Command and meaning <wait for connection on TCP port 25> <open connection to server> 220 receiving.server.com ESMTP server ready. helo sending.server.com 250 Hello, sending.server.com mail from: [email protected] 250 Sender ok, send RCPTs rcpt to: [email protected] 250 [email protected] 10 11 12 13 Client: Server: Client: Client: data 354 Start mail input; end with <CRLF>.<CRLF> mail text… . 14 Server: 15 16 Client: Server: 250 2.6.0 [email protected] Queued mail for delivery quit 221 receiving.server.com Service closing transmission channel Table 4: Typical SMTP dialogue The first step in the SMTP dialogue (lines 1-3) establishes the connection initiated by the client. The standard SMTP port is 25. After connection establishment, the server replies with code 220 (service ready, line 3). Relevant for the communication is just the reply code, the text behind it can vary. Now the client sends a “helo” command (line 4). In line 5, the server replies with code 250 (requested mail action okay) finishing the SMTP handshake. After specifying sender and recipient (lines 6-9), the client uses the “data” command (line 10) to tell the server that now the message itself will be transferred. The server acknowledges (line 11) whereupon the client specifies the content of the message (line 12). The end of the message is indicated by a “.” (line 13). After receipt, an acknowledgement is sent to the client including an internal message number assigned by the server (line 14). At the end of the communication, the client sends a “quit” command (line 15) to close the transmission channel. The server confirms with code 221 (Service closing transmission channel, line 16). With the exception of the IP address of the client, any information provided by the client within the SMTP dialogue can be forged (cf. [36]). 1.4.2. Internet Message Format So far, we have discussed the communication scenario between the client and the server and not the message itself. The Internet Message Format [33] specifies how an e-mail message is composed. Generally, an e-mail consists of a header and a body. Message Header. All header fields have the same general syntactic structure: A field name, followed by a colon, followed by the field body. The header fields can be grouped into “originator fields”, “destination address fields”, “identification fields”, “information fields”, “resent fields”, “trace fields” and “optional fields”. The “trace fields” are also discussed in [32]. Table 5 summarizes the most important header fields in the Internet Message Format. Originator fields: Specifies the author of the message From: Sender of the message Sender: Reply address Reply-To: Destination address fields Primary recipient(s) To: 20 Other recipients Blind carbon copy (addresses are not submitted to the other recipients) Identification fields: Message-ID: unique message identifier Information fields: subject of the message Subject: Trace fields: Return-path: The address to which messages indicating nondelivery or other mail system failures are sent. When a SMTP server accepts a message, either for Received: relaying or for delivery, it inserts a trace record including the sending and the receiving host and arrival date and time of the message. CC: BCC: Table 5: The most important header fields in the Internet Message Format [33] Message Body. The second part of a message, called message body, contains the information itself and, if structured, is defined according to the MIME-Protocol (Multipurpose Internet Mail Extension). The message transfer between the original sender and the final recipient can occur in a single connection or in a series of hops through intermediary systems. The relaying of messages from unknown sources to unknown destinations causes one of the biggest problems of today’s mail traffic because spammers often use open relays7 for transmitting their mail. In Section 2.3.1 we will discuss existing methods for identifying spam based on header information in detail. Additional detailed information is also given in the diploma thesis [36]. 1.4.3. Spammers’ Techniques Without being able to claim completeness, we briefly mention a few common techniques used by spammers to hide their tracks. Most of them do not send out e-mail messages through their ISP, but instead they try to connect to the destination mail server directly or to use open relays. This and the possibility to forge almost every information in the header as discussed above makes it nearly impossible to trace a spammer’s real address or even his identity. There are many other techniques spammers use to mislead or bypass filters. Again, we only give a very brief survey here; many of those techniques will be mentioned again in the context of the respective anti-spam method in Chapter 2. One important technique, which is very processing power consuming, though, is the receiver personalization of every message (no BCC, every receiver gets his “own” e-mail) in order to obscure bulk mailing. Less processor consuming is the randomization of the 7 SMTP or ESMTP server that provides everyone unrestricted relaying services 21 subject field and the “From:” address line. Other techniques commonly used are forging the Message-ID, omitting the “To:” header, or adding random words and strings to a message in order to mislead Bayes filters. Table 6 shows some of the main techniques used by spammers and how their approach has shifted in the last two years in response to the development of anti-spam methods. Table 6: Adaptation of spammers’ techniques to development of filtering techniques [37] 22 2. Anti-Spam Methods In recent years, a vast number of methods and techniques for coping with the spam problem have been proposed and developed, ranging from legal countermeasures to very technical approaches. This is also reflected in a large amount of publications on that topic. In order to bring some structure into this enormous amount of information we are introducing a categorization of anti-spam methods, shown in Figure 5. Our basic distinction is between methods “acting” before an e-mail is sent out (“presend”), methods “acting” after the message has been sent out (“post-send”), and new regulations “acting” during the transfer of an e-mail (new protocols for mail transfer). This comprises virtually all existing approaches, ranging from attempts to decrease the amount of spam sent out to approaches based on text analysis and classification methods applied to a received e-mail. Figure 5: Categorization of anti-spam methods In this chapter, we will discuss all these methods in detail. We will also point out relations between relevant techniques, and evaluate them from the perspective taken in this study. 2.1. Quality Criteria for Anti-Spam Methods Before we can discuss evaluations of anti-spam methods, we need to carefully define the quality criteria available for such an evaluation. All anti-spam methods, which involve the process of deciding whether an (incoming or outgoing!) e-mail message is spam or not can be viewed as binary classifiers. This includes almost all available methods (in Figure 5, most “pre-send” methods, except for the methods contained in “increase sender risks”, and all “post-send” methods, but not the approaches contained in “new protocols”). For these categories of anti-spam methods, central concepts and important terms can be “borrowed” from the areas (text) classification and data mining, as illustrated in the following. Thus, for these types of methods there is a welldeveloped and clearly defined framework for evaluating their performance, as summarized in the following. In order to evaluate the quality of an anti-spam method (which can be viewed as a binary classifier), we let it classify the members of a given set of e-mail messages (test set) into two groups, in our case spam and ham, more generally, positives and negatives.8 There is no single concept like “overall correctness” to measure the performance of a binary classifier (for two classes). Assuming that the correct classification of the test set is known, we can count how many of the messages in the test set have been classified correctly (true positives and true negatives) and how many of these messages have been misclassified by the anti-spam method under consideration. This gives us a first set of (absolute) quality criteria: true/false positives and true/false negatives. In the context of anti-spam methods (and in all upcoming parts of this reports) we follow the widespread conventions to use the term “positives” for denoting spam messages, and the term “negatives” for denoting ham messages. Consequently, any message will be classified as “positive” (spam) or “negative” (ham) by the anti-spam method. If this message actually is spam, but it was (wrongly) classified as negative, it is called a “false negative”. If it actually is ham, but it was (wrongly) classified as positive, it is called a “false positive”. Table 7 summarizes this concept. Each row corresponds to the known type of a message, and each column denotes the class assigned by a binary classifier. According to the table, a positive can be either a true positive, a spam message classified as spam, or a false positive, a ham message classified as spam. On the other hand, a ham message assigned to the ham group is a true negative, whereas a spam message that is classified as ham is a false negative. Message / classified as Spam (positives) Ham (negatives) Spam True positive False negative Ham False positive True negative Table 7: Quality metrics of binary classifiers for the spam problem 8 Which of the two given classes is denoted as positives or negatives is up to the beholder. 24 Based on these quantities, relative quality criteria of a binary classifier can be defined: sensitivity = true positives (true positives + false negatives) and specificity = true negatives . (true negatives + false positives) Both of these quality metrics are between zero and one (often quoted as a percentage), and each of them measures the correctness per class. The sensitivity of a spam classifier is the proportion of messages classified as spam of all spam messages. The closer to one the value of the sensitivity, the more spam is classified correctly. Specificity denotes the correctness for the negatives or ham, respectively. Theoretically, it is possible to achieve 100 per cent sensitivity and specificity (imagine a human classifying blue and red objects). However, in the practice of antispam methods, there is often a tradeoff between achieving low false positive and high true positive rates, that is, the goal not to classify any legitimate mail as spam (because they might otherwise never reach their destination) can often only be achieved at the cost of an increased rate of false negatives. 2.2. Sender Side (=Pre-Send) Methods In this section, we discuss methods “acting” before an e-mail is sent out (“pre-send” methods), that is, the leftmost part in the categorization introduced in Figure 5. The underlying idea of pre-send methods is to discourage sending of spam, in some sense to fight the problem before it occurs. We can distinguish two basic approaches in this category: strategies to increase the costs for sending e-mail, which harm spammers’ business model as discussed in Section 1.3, and strategies to increase the risks for sending UBE/UCE in the form of stricter legal regulations and stricter enforcement of these regulations. 2.2.1. Increasing Sender Costs Many interesting solutions for stopping spam e-mail are economically motivated. The main idea is to make the spammers’ business model unprofitable. Two concrete strategies have been developed to achieve this – delaying the sending of each e-mail (technical solution) or introducing monetary fees for each e-mail (money-based solution). However, conceptually, the approach of increasing sender costs comprises not only monetary fees and costs in terms of time delays, but also other abstract costs. 25 2.2.1.1. Technical Solutions Most technical solutions are based on CPU time: The sender of an e-mail is required to compute a moderately expensive function – a so-called pricing function – before the email is actually sent. Since in general e-mail is not expected to be a medium for real time communication, such a moderate delay for each e-mail is expected not to have any significance for the average regular e-mail user, who may in most cases not send much more than 20-50 e-mail messages a day, but it is very disturbing for a spammer, because it reduces the number of potential customers reached per unit of time (for details, see [31]. Since there is no need to change the SMTP protocol, it is easy to install such a system with a pricing function. There is a major drawback, though, of this approach – lack of fairness of most pricing functions found so far. Ideally, a pricing function system should be “fair” in the sense that the delay it causes is independent of the hardware of the computer system. Many solutions have been proposed, for example, CPU-bound functions, memory-bound functions, or Turing-tests [38]. Especially different ways of using memory-bound functions currently receive a lot of attention [39][40]. An example for a Turing-type test based on human interaction is mentioned in Section 2.3.1 (SFM). However, so far it remains an open question to find a pricing function, which leads to at least comparable delays on old, slow computers and on the latest hardware. In the following, we take a closer look at a relatively wide spread and well-known representative of technical solutions to increasing sender costs – Hashcash, which is based on a CPU-bound function. Hashcash [41] is a software plug-in for mail clients which add Hashcash stamps to sent e-mail. Adding a Hashcash stamp means inserting a line starting with “XHashcash:” into the header of a message as shown in Table 8: FROM: someone <[email protected]> TO: max mustermann <[email protected]> Subject: test Hashcash Date: xx.xx.200x X-Hashcash: 0:030626:[email protected]:6470e06d773e05a8 Table 8: A typical X-Hashcash header In order to create a Hashcash stamp, the resource CPU time needs to be “spent” (on an average desktop computer, a few seconds). One stamp is required for each individual recipient (even if it is sent as BCC) and it indicates the degree of difficulty of a task performed in order to “spend” CPU time. It is expected that the more difficult this task is (and thus, the more CPU time is spent) for an e-mail the less likely, this e-mail is 26 spam. Thus, Hashcash stamps can be used as (part of) a criterion whether to accept an e-mail message or not. Technically, the tasks used by Hashcash are based on hash functions, more specifically on so-called partial hash-collisions. A hash function H is a cryptographic function for which it is supposedly hard to find two inputs that produce the same output. A collision occurs, if two inputs do produce the same output: H(x) == H(y) although x != y. Common hash functions, like for example MD5 or SHA1, are designed to be collision resistant (it is very hard to find SHA1(x) == SHA1(y) where x != y). For common hash functions, computing a full collision is almost impossible, but partial collisions can be found more easily. In contrast to a full collision, where all bits of, for example, SHA1(x) must match SHA1(y), for a k-bit partial collision only the k most significant bits of SHA1(x) and SHA1(y) have to match. On a 400 MHz PII, a 16-bit partial collision for SHA1 can be computed in about one third of a second, whereas computing a 32-bit collision would last seven hours. Hashcash uses the recipient’s mail address and the current date as inputs for the hash-collision. 2.2.1.2. Money Based Solutions The basic idea behind money-based solutions is to “pay” some amount of (possibly symbolic) currency (micro payment) for each e-mail to be sent. The idea is that an email is more likely to be ham the higher the amount paid for its delivery. In the following, we describe a concrete proposal for implementing this idea. The Lightweight Currency Protocol (LCP) [42] is a relatively simple mechanism which allows organizations to issue their own generic currency. The issuer (for example, an ISP) and the currency holder (the user) both generate a public/private key pair. LCP is a request/response protocol where the issuer of the currency is the server and the holder of the currency is the client. Based on this protocol, Turner et al. [42] also propose a payment mechanism where servers require payment for accepting incoming messages. The mail transfer agent is responsible for organizing the payment, so the client is not involved. Currently, delivery costs per e-mail message are estimated to be about 0.01 US cent (which corresponds to US$ 100.- for 1,000,000 e-mail messages). Even if the price was raised to 1 US cent per e-mail (which corresponds to US$ 10,000 for 1,000,000 messages), sending e-mail would still be very cheap compared to sending snail mail (which costs more than 20 US cents per letter). The implementation of this payment mechanism based on LCP proceeds as follows: A user spends a particular amount of his currency by sending a transfer-funds message to the issuer, who in turn identifies the recipient’s public key, the amount and a transaction-id. If the sender has a sufficient balance of funds, the issuer will debit the sender’s account by the amount requested and credit the account of the recipient. The 27 recipient verifies the payment and the issuer responds with an account activity statement. Each user can earn some quota of a currency, and spammers would be forced to make investments to purchase credits using real-world money, which narrows their margin of profit. Since the costs for sending e-mail messages increase linearly with the number of messages sent in this model, it is expected that spammers are forced to increase their rate of return and thus they need to focus their efforts on recipients where they have a high probability of revenue (which contradicts the current business model of spammers illustrated in Section 1.3). 2.2.1.3. Strengths and Weaknesses Generally speaking, methods for increasing sender costs and thus harming the business model of spammers are a very interesting and promising approach to address the spam problem. In contrast to many other approaches which tend to focus on the “symptoms” only, they try to fight the problem at its “root” and consequently avoid the demand of resources common to all approaches acting later in the spamming process. Moreover, they are not user-specific, technically very accessible to ISPs and e-mail providers and thus they fit very naturally into concepts suitable from the perspective of an ISP. However, there are still a few important shortcomings, which lead us to believe that those methods alone will not suffice, but rather will have to be integrated and combined with other approaches in “multibarrier concepts”. In the area of technical solutions, one of the main open questions is how to adapt pricing functions to different hardware. Whereas CPU-bound pricing functions (such as Hashcash) suffer from a possible unfairness due to differences in processing speeds among different types of computer systems, some experts expect memory-bound pricing functions [39] to be less sensitive to this problem. For an evaluation of technical solutions for increasing sender costs, it is indispensable to carefully analyze the economic basis of a spammer’s business. If a reduction from 10 million e-mail messages sent per day to (for example) 100,000 e-mail messages destroys the business model of a spammer, then methods based on increasing sender costs could be the “perfect” anti-spam method. A simple example illustrates the promise of this approach as well as its potential shortcomings: if due to a pricing function it takes ten seconds to process an outgoing email, one computer can send at most 8,640 e-mail messages in 24 hours. Without a pricing function, an estimated 2-3 billion spam messages can be sent per day – in order to achieve this output with the pricing function of this example, the spammer would need 250,000 to 375,000 computers. However, if pricing functions are needed such that even an average user can send out only two e-mail messages per hour and if users with old hardware experience even bigger delays, then this approach render itself useless because it limits the effectiveness of e-mail as a communication medium in general. Careful discussion of central questions (optimum type of pricing function, optimum delay, etc.) at a scientific level is beyond the scope of this report, but is included in a diploma thesis currently under preparation [31]. 28 The main problems of money-based solutions are the relatively high administration overhead and the fact that the very popular free e-mail accounts do not fit into this strategy. The last point leads to a potential general weakness of all current approaches to increasing sender costs – their success would require some degree of coordination among providers of e-mail services and the commitment of at least a significant part of those providers worldwide. If only a minority of e-mail services worldwide adopts policies to increasing sender costs spammers will simply elude the obstacle and pick providers who do not implement such policies. Similar to the careful analysis of the pricing functions still required for technical solutions to increasing sender costs, the optimal fee structure in money-based solutions still needs to be investigated carefully. The concept will be considered for practical application only if it can be shown how to set it up in practice such that the amount of spam is reduced significantly without burdening regular e-mail users too much. 2.2.2. Increasing Spammers’ Risk A more focused approach than increasing the costs for everybody sending e-mail is to increase the risk spammers are facing due to their activity. This requires the introduction of new legal provisions for UBE/UCE and the development of jurisdiction and of an infrastructure to enforce these provisions. 2.2.2.1. Legal Provisions The European Union and the United States of America have both decided to enact a legal basis for criminal prosecution of senders of UBE. Detailed information on the different legal systems of the United States, the European Union and other countries is available on [43], and in the summary given by Sabadello [44]. Since jurisdiction is not our area of expertise, we will only give a very short overview in this section. Opt-In vs. Opt-Out. Generally speaking, one can distinguish between an opt-in and an opt-out system for anti-spam regulations. Opt-in means that nobody is allowed to send UBE unless the receiver has explicitly agreed to receive such messages. In an opt-out system, anybody is allowed to send UBE to anybody else as long as the receiver has the possibility to opt out at any time he wants, that is, to declare that he does not want to receive such messages any more. USA. The United States of America implemented the CAN-SPAM act [45] on January 1, 2004. This is an opt-out system. It is very contended, because (like any opt-out system) the act of opting out gives a spammer the possibility to verify that a mail address is valid. Consequently, if the receiver tries to opt out via an automated mechanism offered in the message, he may receive even more spam afterwards because his e-mail address could be “verified”. Despite this potential weakness, the CAN-SPAM Act also provides a basis to deal with some major problems of unsolicited bulk e-mail: It is the basis for criminal prosecution for header-forging, relaying commercial mail through open proxies or 29 through other infrastructure that is used for concealing the identity and it also prohibits address harvesting and dictionary attacks are forbidden. European Union. The European Union decided to implement an opt-in system. In June 2000, the European Parliament passed the directive on electronic commerce [46] and in June 2002 the directive on privacy and electronic communication [47], which form the basis for legal action. Consequences for sending UBE are not covered in these directives, but it is incumbent upon the individual members of the European Union to do so. Austria. The current legal situation in Austria distinguishes between private individuals and companies. According to § 107 of the bill on Telecommunication [48], sending UBE to private individuals is not allowed and requires the previous agreement of the individual (opt-in). This also covers commercial e-mail as well as other e-mail with more than 50 recipients. The situation for companies is completely different. In general, it is allowed to send UBE to companies as long as the recipient has a possibility to opt out (similar to the CAN-SPAM act). Moreover, the bill on Electronic Commerce [49] introduced the maintenance of a so-called “Robinson List” containing all individuals and companies that in no case want to receive UBE. This list has to be taken into consideration even when the delivery of mail would be allowed by the bill on Telecommunication. 2.2.2.2. Strengths and Weaknesses Practical experience shows that, although there is some deterrent effect (cf. Figure 2), the legal framework of neither the United States nor the one of the European Union will be able to completely solve the spam problem. Spammers do not care too much about any legal consequences because they can easily hide their identity or even move their operations to other countries where no legal basis for prosecution exists. Similar to approaches for increasing sender costs, legal actions against spammers requires a much higher degree of coordination among countries worldwide than this is the case currently. 2.3. Receiver Side (=Post-Send) Methods In this category, we group all approaches, which are “acting” at the receiver side after an e-mail has been sent. In contrast to the pre-send methods summarized in Section 2.2, which are proactive, the post-send methods tend to be reactive. They can be divided into three groups – methods based on the source of the e-mail, methods based on the content of the e-mail, and methods using both source and content. 2.3.1. Approaches Based on Source of Mail In this section, we will illustrate the most important methods focusing on the source of an e-mail. This information is in most cases the client’s-IP-address, the sender’s domain or the full mailbox. It is usually extracted from the SMTP Dialogue or the message itself. On this basis, it is possible to classify the source in three different ways. The first way is to decide if the sender is a good or a bad one. Another possibility is to verify if a 30 source is legitimated to use a claimed identity. The third method is to verify that the sender is willing to invest some additional effort to contact the receiver. In detail we discuss: • Blacklists, whitelists (good/bad sender) • Sender Policy Framework, Caller ID, Sender ID, Domain Keys (legitimate/non legitimate sender) • Greylists, ChoiceMail and SFM (challenge-response systems) 2.3.1.1. Is the Sender Good/Bad? A blacklist is a database containing IP-Addresses or domain names that are suspected to send spam. Any message coming from a domain or IP-Address appearing on a blacklist will be blocked. There are two main types of blacklists – real time blacklists (RBL) with a centralized or distributed database and blacklists self-maintained by administrators (domain-level blacklist). The most effective way is using RBLs maintained by third-party organizations (for example [50]) which are usually DNS (Domain Name System) based. Figure 6 illustrates the basic principle of blacklists. 1. Connecting SMTP Client (IP: 101.105.32.23) SMTP Server (mysmtpserver.com) 4. Disconnect 3. Answer: 127.0.0.2 2. DNS-Lookup: 23.32.105.101.rbl.org rbl.org Figure 6: Typical scenario for a blacklist The SMTP client with the IP-Address 101.105.32.23 connects to mysmtpserver.com. Before the SMTP Dialogue starts, the SMTP server records the IP-Address from the TCP-Connection and performs a DNS-Lookup at rbl.org. Rbl.org returns the reply code “127.0.0.2” which indicates that the SMT client is a source for UBE whereon the SMTP server aborts the connection to the client. A whitelist is a database containing IP-Addresses or Domains that for sure do not send spam. Any message coming form a source appearing on a whitelist will bypass filtering. A whitelist is usually proprietary but there are also global whitelists containing organizations that signed up not to send spam. These lists are usually controlled through a Third-Party organization (for example Habeas [51], Bonded Sender Program [52], Brightmail Safe List). 31 2.3.1.2. Is the Sender Legitimate? Important efforts have also focused on developing methods and techniques for determining whether the sender of an e-mail can be authenticated or whether he is legitimate. This includes various kinds of policy frameworks [53] or digital signatures [54]. The underlying idea is on the one hand that spammers do not want to be authenticated in order to avoid criminal prosecution and on the other hand that – for the same reason – spammers tend to fake header information in their e-mail (cf. Section 1.4) which may lead to inconsistent information (for example, the pretended sender is not legitimated for the pretended sending mail server). In the following, we will briefly summarize the most important techniques in this area. They were originally submitted as proposals to the Internet Engineering Task Force (IETF, www.ietf.org). Proposals for Anti-Spam Standards for coping with the spam problem, submitted to the IETF in 2004. • Caller ID: proposed by Microsoft, Sendmail, Amazon.com and Brightmail • DomainKeys: proposed by Yahoo • Sender Policy Framework (SPF): supported by AOL, GMX SPF, Sender-ID and DomainKeys are concepts to eliminate the possibility of domain spoofing. The protocols try to leave the common transmission process unaffected and interoperate with SMTP to support the distribution and acceptance of the protocol. All described protocols have in common that they are using DNS for verifying e-mail. So, more network traffic is used, because every received e-mail must be checked at the domain specified at the e-mail address. The Sender Policy Framework (SPF) [55], developed by Meng Wong and Mark Lentczner, uses the “MAIL FROM:” identity of the SMTP dialogue to verify the senders’ domain. This allows rejecting mail already within the SMTP dialogue. The protocol is a hybrid of the Designated Mailers Protocol [56] and the Reverse MX Protocol [57]. An SPF-Record designates the outbound SMTP Servers of the senders’ domain. When an SMTP Client connects to a mail exchanger, the server looks for an SPF-Record in the DNS-tree of the claimed sender domain. If the result received from the DNS-Query contains the IP-Address of the client, the sender is authorized to use the domain in the “MAIL FROM:” argument. If not the domain was spoofed. The Caller-Id [58] concept, developed by Microsoft, realizes the same concept but uses the so-called “purported responsible address” for verification. The purported responsible address refers to the mailbox that has directly initiated the transmission process. It is determined by inspecting the header of the message. For example if the header contains a “From:” field and a “Sender:” field, the PRA is extracted from the “Sender:” field [59]. Both Caller-Id and SPF suffer from the fact that in the case of involved mail forwarding systems and mailing lists during the transmission process the 32 IP-Address of the client often cannot be mapped to the domain of the sender. Therefore, additional concepts like SRS [60] (Sender Rewriting Scheme) must be implemented. The Sender-Id [61] framework is the result of a merger between Caller-Id and SPF. DomainKeys [62] also use DNS but the verification process works via digital signature instead of IP-Addresses. The sending side of this variant consists of two steps. • Set up: In this first step, the domain owner generates a public/private key pair. This key pair is used for signing all outgoing mail. The DNS holds the public key, and the private key is located at the outbound mail server. • Signing: When an end-user is sending an e-mail, than a digital signature with the private key will be generated by the DomainKey enabled mail system. For verifying an e-mail on the receiver side, three steps are necessary: • Preparing: The DomainKey enabled system on the receiver side extracts the signature and the claimed “From:” domain from the e-mail headers and fetches the public key from the DNS for the claimed “From:” domain. • Verifying: With the public key getting from the DNS the receiving e-mail system verifies if the signature was generated by the matching private key. • Delivering: If the e-mail was successfully verified, then the message is delivered into the receiver’s inbox. 2.3.1.3. Challenge-Response Challenge-response systems initially block or hold e-mail from unknown senders. The senders are notified of the blocking, then required to prove they are human by taking a “quasi-Turing test”. If they pass, the e-mail is delivered [63]. Challenge-response (or reverse-whitelists or permission based filters) systems maintain a list of permitted senders. E-Mail from a new sender is temporarily held without delivery and the sender gets an e-mail with a challenge back. This challenge can be clicking on an URL or to reply to this e-mail. If spammers use fake sender email addresses, they will never receive the challenge and if they use real e-mail addresses, they will never be able to reply all challenges in a certain amount of time. Unfortunately there are some limitations to this approach. If both communication partners use challenge-response systems, then they will not be able to communicate with each other. Another shortfall is that automated systems or mailing lists can not respond to a challenge (if a friend is sending you an interesting newsletter). The third problem is character recognition or pattern matching – these security challenge features are easy to bypass. And finally a spammer may forge the e-mail address of a legitimate user. 33 There are many implementations of challenge-response systems – we take a closer look at three different types – greylists, a human interaction system called ChoiceMail and a subscription mail server called SFM (Spam Free Mail). Greylisting [64] is an aggressive method for blocking spam. It uses the fact that sending spam is not failure tolerant. Because spammers often do not know if their recipient addresses actually exist, they do not try to resend messages if an error occurs during the transmission process. When a client connects to a SMTP server using a Greylist, the server records the following information: 1. The IP address of the host attempting the delivery 2. The envelope sender address 3. The envelope recipient address The server then compares this triplet to a local database. If no record matches, the message will be refused with a “temporary failure” response and the triplet is stored. Usually RFC compliant MTA`s try to resend this message within a certain period of time. When the message is received a second time within a specified time slot (normally after a timestamp for blocking and before the expiration date of the triplet) the message will be delivered. ChoiceMail [65] can be run in different modes. Free for home use, Server edition and Enterprise edition, and uses a challenge-response system. If ChoiceMail cannot identify a message after checking it against your whitelist, blacklist and any rules you are using, it sends a “registration request” to the sender (see Figure 7). Figure 7: Sender registration ChoiceMail This short e-mail directs the sender to a Web page where he will be asked for his or her name, e-mail address and reason for contacting you. The sender also will be asked to fill in a code that appears on the screen as a graphic, something a person can do easily but a computer cannot do at all. 34 This simple process eliminates almost all junk e-mail for two reasons. First, spammers usually use invalid reply addresses and therefore never receive the registration request. Second, spammers depend on automation, and the registration response cannot be automated. The registration feature can be turned off. SFM [66] is a subscription e-mail server whose service can be viewed as an extension of your traditional e-mail service. It allows the creation (mostly automatically) of multiple addresses/aliases of yourself restricted to a narrow population of legitimate senders. The principles of operation are very easy. There are two types of dynamic addresses: publishable (=master) and personal (aliases). An alias is intentionally restricted to a single contact or a group. If someone is trying to contact you for the first time, he sends an e-mail to your master address. This message never reaches its destination; the sender instead gets a challenge like this. Dear Human Being: To reach the recipient, please use this address: For more information, and also if you cannot see the image that has arrived with this message, please follow THIS LINK. The purpose of this procedure is to eliminate Email abuse Figure 8: Challenge of SFM A new alias remains open for a predetermined amount of time and during this time anyone can use it to send you messages. Whenever this happens, the sender’s address is added to the alias personalization. After it becomes closed, it will only accept e-mail from senders on the personalization list. When a message is sent through the server, it locates the proper alias personalized to the recipient, or if no such alias is available, generates a new alias personalized to the recipient on the fly and forwards the message to the recipient substituting the alias for your sender address. 35 2.3.1.4. Strengths and Weaknesses Low resource requirements and its ease of maintenance are the two main benefits of blacklists. Any spam message can be rejected, before it is downloaded. Another big advantage is that some spammers remove e-mail addresses automatically, if an e-mail is rejected. There are only a few configuration changes necessary inside the server software. A big disadvantage is the lack of granularity – either all of the e-mail from a given host is accepted, are all of it is rejected. Some spammers try to hide behind big ISP’s and use Hotmail or AOL accounts for spamming (see Figure 3). One big problem of blacklists is the possible refusal of legitimate mail because blacklists are often poorly maintained and not up-to-date. There are similar limitations with whitelists as with using blacklists. If a spammer spoofs an address, he will get through a whitelist. They must be updated regularly and this needs some time, but black- and whitelists typically stop around 10% of spam. [67] Missing authentication of e-mail senders is one of the biggest weaknesses of the current mail infrastructure but sender authentication alone will not solve the spam problem. Sending unsolicited bulk e-mail would still be possible. The establishment of a central authentication authority would be required for a secure environment but this does not seem to be realistic. One big disadvantage of challenge-response systems is that some MTAs do not redeliver messages. Therefore, it is recommended to maintain a whitelist containing these servers. Another general problem of challenge-response systems is the increased mail traffic. General Limitations of Header Based Approaches The header of an e-mail includes various information of the sender and the mail infrastructure involved during the transmission process. Generally, any information given in the SMTP dialog and the header can be forged because there are no integrity checks and authentication mechanisms defined in the standard SMTP. The only reliable information is the IP-Address of the client. Spammers often forge header entries of an e-mail. They try to inhibit the backtracking of the messages to keep their identity secret. There is no other reason to forge a header entry but to conceal one’s identity. The following analysis shall give an overview of what can be forged and focuses just on entries that give information of the sender or the mail infrastructure involved. “Return-Path:”: The Return Path is just a record of the argument specified in the “MAIL FROM:” command during the SMTP dialog. If it is forged, the Return-Path is also not trustable. “Received:” Lines: “Receive-id:” Lines are the most important header entries for backtracking messages and for fixing bugs in a mail environment. RFC compliant MTAs must prepend a 36 “Received:” Line for messages that are not routed in a private area. As any other information it can be forged easily in a way that it is not possible to distinguish whether it is manipulated or unaltered. If a spammer uses an open proxy, there is no reason to forge any “Received:” Line because the IP-Address of the spammer does not appear in the message. For the receiver’s purposes you can only trust the lines that are processed by your own MTAs. “Message-Id:”: If the Message-Id is structured as recommended in RFC2821 it contains the sender’s domain. Spammers can use any domain to conceal their identity. Due to various mail-routing scenarios and the possibility of a forgery, it is not possible to put semantics to the domain part of the Message-Id. “Date:” Inconsistent “Date:” fields can be ascribed to various scenarios. They can lead back to a forgery as well as to different time zones of sender and receiver. In addition, a bad configuration of the processing MTAs is possible so you cannot classify a message as spam when an inconsistency is detected. “From:”, “Sender:”, Reply-to:”, “To:”: These fields can be forged easily. Just a few syntactical checks can be performed verifying that the entry in this field may represent a valid mailbox. The following example shall illustrate that it is impossible to identify a forged header in most cases. Return-Path: <[email protected]> Received: from mx6.univie.ac.at (mx6.univie.ac.at [131.130.1.49]) by atat.at (8.12.10/8.12.10) with SMTP id i7I5LSaT007420 for <[email protected]>; Wed, 18 Aug 2004 06:21:29 GMT Received: from pakistan.com (pakistan.com [222.65.113.88]) by mx6.univie.ac.at (8.12.10/8.12.10) with SMTP id i7I5D126028130 for <[email protected]>; Wed, 18 Aug 2004 08:13:28 +0200 Message-ID: <[email protected]> Date: Wed, 18 Aug 2004 15:14:18 +0900 From: "jamison tevlin" <[email protected]> To: "Thornaper Maraja" <[email protected]> Figure 9: Example of a forged header Figure 9 shows the header of an unsolicited bulk e-mail, which originated at “pakistan.com” and was sent to “xyz.at”. The source of the message was obviously the IP-Address 222.65.113.88 reported by the mail relay “mx6.univie.ac.at”. The Reverse DNS Lookup complies with the client identification, which is also available in the Message-Id, the Return-Path and the “From:” address. An analysis of the timestamps also gives a proper picture in respect to the different time zones of sender and receiver. The header seems to contain consistent information. The fact is that the mail client at “xyz.at” can only trust the recorded IP-Address 131.130.1.49. Any other information given in the header can be forged. For example, it is possible that “mx6.univie.ac.at” is an open relay. In this case, the IP-Address 222.65.113.88 would most likely be the original source of the message. Whether the sender’s domain is indeed “pakistan.com” can not be estimated. On the other hand it is also possible that all given information is correct. 37 A consistent header is not a sign for legitimate mail. Because of the various scenarios appearing within the mail distribution process the same is the case for an inconsistent header. Plausibility checks can only be applied to very simple forgeries but can not be used for efficient spam detection. 2.3.2. Approaches Based on Content This section covers techniques used to analyze an e-mail message (or more general: a text document) according to its content. The task is not to fully understand a text's meaning but rather to find significant features, such as word frequencies, etc. Simple approaches like keyword matching are introduced as well as more elaborate approaches that combine simple methods commonly used in the area of text information retrieval. 2.3.2.1. Static Techniques Keyword based approaches involve simple searches of the body and/or the subject line of a message for specific keywords and phrases like “Viagra”, “Cialis” or “get this for free”. If these words or phrases appear, this fact is used as an indicator for spam. The three main types of keyword based matching are described below. Keyword Based: Search for words or phrases that match exactly. For example, “Viagra” only matches “Viagra”. Pattern Matching: Covers simple variations by mixing constant text and flexible components like wildcards, case (in)sensitiveness, number of occurrences. This kind of pattern matching is based on regular expressions [68]. For example, “V*i*a*g*r*a” matches “Viagra”, “V.i.a.g.r.a”, “Vviiaaggrraa”, ... Rule Based: Rules are more complex constructs a message can be checked against. For instance, the rule “Mentions Generic Viagra” detects if generic Viagra is a main topic in a given message (via several regular expressions). It is a common practice to assign a certain value to each rule and to sum up those values to compute an overall spam rating (see Section 3.3.1) 2.3.2.2. URL Analysis URL analysis in its simplest form means white- or blacklisting of URLs (compare Section 2.3.1). However, approaches that are more sophisticated have been developed, as the one explained in this section that combines several techniques. Filtering Spam Using Search Engines [69]. An approach for filtering spam using search engines like Google and Yahoo has been developed at the Georgia Institute of Technology. The key idea is to filer spam according to the URLs (and their content) that occurs in an e-mail message (for example, whether they link to Web sites a user might be interested in or not). This is done by categorizing URLs via search engines as well as using Bayesian classifiers on Web site content to define a user’s interest (in terms of keywords resulting from the Bayesian analysis). The approach distinguishes categorized URLs, which have already been indexed by a search engine, and uncategorized URLs, which are not listed in any Web directory. 38 Such a system has to be trained. The first training step is to make a list of acceptable categories (to define the categories a user is interested in). For this purpose, URLs are extracted from legitimate mail messages in the user’s mailbox, which are then classified through search engines. The content of the Web sites is also retrieved from the search engines' caches and used to train a Bayesian classifier. Legitimate URLs, that is, URLs that occur in the user’s message but cannot be found in a Web directory are whitelisted, that is, a regular expression is created for each URL, resulting in a set of regular expressions Aregex that represent legitimate URLs. At the end of the training process the user is able to edit and verify the training results. After the training phase there should be a set of legitimate categories, called Acategories and a set of regular expressions, called Aregex that map the users preferences (or a list of URLs to be accepted). After training the system is ready to classify mail. Figure 10 depicts the classification process in detail. A message that does not contain any URLs is classified as ham. If a message contains URLs, every URL is processed. If an URL matches a regular expression in Aregex, the URL’s category is in the Acategories set or the URL was previously classified as legitimate it is not considered any more. The remaining set of URLs called Ur includes only categorized URLs with categories not in Acategories or uncategorized URLs never seen before in legitimate messages that do not match any of the regular expressions in Aregex. If an URL has a category not in Acategories the message is classified as spam. For each uncategorized URL remaining in Ur the content referred to is evaluated through the output of the Bayesian classifier. Classify message as ham Scan incoming message for URLs No URLs found Build Ur URLs found True For each Ui Є Ur Æ check if Ui has a category not in Acategories Classify message as spam For all uncategorized Ui Є Ur run a simple Bayesian classifier on the content behind Ui False Output is below spam threshold Classify message as ham Check the output Output exceeds spam threshold Classify message as spam Figure 10: URL analysis based on [69] 2.3.2.3. Authentication The problem with spam messages is that it is hard to tell if a message is spam or not. The obvious answer to this problem is that there has to be a way to recognize non-spam messages. Whitelists have been a method of choice for a long time, but they cannot solve one important problem: The e-mail protocol currently used does not provide any 39 security features (cf. Section 1.4). This means that anybody may use more or less anything as a sender’s address. Digital Signature, Encryption of E-Mail A working public key infrastructure would solve this problem. If every message was signed with a private key, there would be no problem to authenticate all senders. Unfortunately, there is one rather big problem with this solution. Currently only a very small number of e-mail users have a valid certificate. Creating an infrastructure, which allows every e-mail user to use digital signatures, would be a big challenge. Generally there are two different options for a public key infrastructure: Either there is one single root certification authority, which means that there is a heavy burden on this central authority, or there are many different certification authorities, which means that a lot of trust is required for each and everyone. A crafty spammer might start his own certification authority and therefore be able to sign all his messages and therefore get by this security measure with relative ease. In many cases the sender needs to fetch a public key for each recipient, would be necessary to integrate all common encryption systems (PGP, X509, …) into all mail clients (which, for example, is currently not accepted by Microsoft for its Internet Explorer). Even if this sounds rather ineffective it still adds a certain amount of work to the spammer’s plan of sending large amounts of e-mail messages. Whether this adds enough work to harm spammers’ business model is a question that is currently not answered yet. 2.3.2.4. Strengths and Weaknesses Static techniques are useful to some extent at the individual or even corporate level. However, the word “Viagra” may be of interest to a physician or pharmacist, thus keyword based filtering cannot be used as a general solution. Performance may be the main advantage of those primitive approaches, but another drawback is the need to update the keywords. At first glance, URL analysis seems to be promising. Taking a closer look reveals a couple of drawbacks, though. Doing multiple queries in a search engine, or even running a Bayes classifier may require a lot of time. This can lead to a point where denial-of-service attacks based on messages containing vast amounts of URLs paralyze a complete e-mail service. Missing authentication is one of the biggest weaknesses of the current mail infrastructure but authentication alone will not solve the spam problem. Sending unsolicited bulk e-mail would still be possible. The establishment of a central authentication authority would be required for a secure environment but this does not seem to be realistic. 40 2.3.3. Using Source and Content In many cases information from the body or the header of an e-mail alone is not enough for a classification. Especially mass mailer detection needs as much information as possible to be able to compare messages as thoroughly as possible. In this chapter we take a look at different technologies using this approach. 2.3.3.1. Fingerprints, Signatures, Checksums Digital fingerprint: a value calculated from the content of other data that changes, if the data upon which it is based changes. Checksum: A checksum is a value computed by adding together all the numbers in the input data. It is the simplest form of a digital fingerprint – problem reordering the numbers in the document, does not change the checksum value. Cyclic Redundancy Checks are more reliable than checksums, they normally reflect even minor changes to the input data, but it is relatively easy to generate a completely different file that produces the same CRC value. Hash algorithms and message digest: “one way hash algorithms” produce a “hash” value, that means it is easy to compute b from a, but it is very difficult (or impossible) to compute a if you only have b (compare Section 2.2.1). Two well-known hash algorithms are MD5 and SHA. DCC, Pyzor, Vipul’s razor Checksum based spam filtering is a method for detecting spam by simply auditing how often a received message had been sent (to other users). It is a client-server architecture where the client calculates a checksum of an incoming message and sends it to a server which looks for exact matches in its database and returns an indicator (for example the number of times that the message had already been reported). Up to a user defined policy the client then decides whether the message is spam or ham. The most popular implementations of this concept are the Distributed Checksum Clearinghouse (DCC [70]) from Rhyolite Software and Vipul’s Razor [71]. DCC and Vipul’s Razor differ in the way messages are reported to the server. A DCC client reports the checksums of any incoming message to the DCC server, DCC basically enables mass mailer detection. It does not decide whether a message is spam or not. DCC just reports how many copies of a message have already been received. For this reason, clients have to maintain a whitelist including senders of solicited bulk mail. 41 Table 9 shows the parts of a message DCC computes checksums for. Checksum Description IP Address of SMTP client Env_From SMTP envelope value From SMTP header line Message-ID SMTP header line Received Last Received: header line in the SMTP message Substitute SMTP header line chosen by the DCC client, prefixed with the name of the header Body SMTP body ignoring white-space Fuz1 Filtered or "fuzzy" body checksum Fuz2 Another filtered or "fuzzy" body checksum Table 9: DCC checksums The most remarkable types are the fuzzy values that prevent spammers from including random characters in their spam messages to avoid registration by DCC. Besides, it is not entirely clear how the fuzzy checksums are computed (namely to keep it secret from spammers). A typical response from a DCC server for a given (spam) message is described by Table 10. Checksum Computed Value Number of occurrences From d281eaa1 6bc43403 88d3a2cc dd3580ab Message-ID 57aef887 c9d8748b 2b887907 5c43f751 Received 39c14d6c 7d1ca91b 2fe6f855 b921493d Body 8e96458f 0843008a b6324d60 1cac9dc6 10 Fuz1 6b995215 1bde79e1 de9dc27f c6eadf5b 10 Fuz2 00000000 00000000 00000000 00000000 Table 10: Example of a DCC record The response lists the computed checksums for each part of the message (Fuzzy 2 has no value here because this checksum requires a certain message length which was not given in this example) and the number of registered occurrences (if any). The client 42 (for instance a spam filter using DCC like SpamAssassin can then handle the message according to the registration count) Vipul’s Razor follows a different approach. Within this system the user himself reports a message to the server so the database of the server should only contain checksums of approved spam. Therefore, in contrast to DCC, Razor is a tool for spam detection. One problem appearing here is the trustworthiness of the report’s sender. With the new version Razor 2.0 this problem has been eliminated, because every user needs a generated key for signing. The server looks in the database for reports agreeing with the voting received. The higher the agreement with other reports the higher is the reliability of the sender [70]. The above mentioned Pyzor is an implementation of Razor, but is written in a different programming language (Python). Technically it is nearly the same as Razor, but it is open source software [72]. 2.3.3.2. Classification Methods The main idea behind their usage in spam filtering is to find a suitable (computerreadable) representation for mail messages and to classify them as spam or ham. This representation is compared to training data and assigned to a class based on various techniques that are briefly described in the following. Some of the relevant technologies originate from the areas of text information retrieval and text analysis. Representation of Texts Text analysis is mainly based on words or tokens that occur in the documents of the used text collection. The task is not to fully understand a text's meaning but rather to extract relevant tokens. Tokens can be entire words, phrases, or n-grams (overlapping tokens consisting of n characters). Although this approach might miss some of the information content, it has clear performance advantages and is independent of the text's language. Several models exist for text representation based on those tokens/terms, the most common ones are listed below. Term frequency: A message is represented by the number of occurrences of terms (the more often a word occurs in a message the higher the value for this word). TF IDF (Term Frequency Inverse Document Frequency) Representation: All occurrences of each term in a document collection are registered. Further, the number of documents each term occurs in is computed. Then each document is represented by the terms included in this document (term frequency) weighted by the number of occurrences in all documents (inverse document frequency) [73]. Training Models Training can denote a simple storing of examples or involve more sophisticated and time consuming methods, particularly important when token frequencies shall be held up-to-date. According to [74], there are three major training methods TEFT, TOE, TuM. 43 TEFT (Training on Everything): every message is used to update the database. TOE (Train-on Error): only messages that were incorrectly classified are used for training (usually after a corpus train). An advantage is the dynamic handling of errors; the downside is the amount of human interaction needed (to find false classifications). TuM (Train until Mature) [75]: Provides a hybrid between TEFT and TOE. TuM will train the individual tokens in a message only up until a point where they have reached maturity (for instance 25 hits per token). New types of training data are still trained as well as immature tokens. TuM trains all tokens whenever an error is being retrained. Therefore, it has both advantages – a balance between volatility and static data, and the ability to adapt to new types of e-mail. Distance Measures The similarity between a query message and the messages in the training sets is measured via distance functions. The query vector (consisting of term frequencies) is compared to the examples in a training set so that one or more most similar vectors can be found (for example, the similarity between an incoming mail message and a ham or spam training set). Examples for distance measures are: Euclidean Distance: d ( x, y ) = ∑ |x − y| Mahalanobis Distance: d ( x, y ) = Cosine: d ( x, y ) = cos( x, y ) 2 (x − y ) C (x − y ) t −1 Classification Decision After the computation of distance measures the query vector has to be assigned to a class according to the training set, that is, it tags a given message as either ham or spam according to a spam and a ham training corpus. It is not always the best choice to base the decision whether a query vector belongs to one class or not on the one most similar vector in the training set only. Many different algorithms and models for classification tasks have been developed, most of them following the procedure just presented methods, slightly differing in one or the other detail. The methods presented in the following give an overall idea of existing technologies, but the list does not claim completeness. 2.3.3.3. Bayes Filter Although the application of Bayesian analysis to spam is rather new, the Bayesian logic was actually first published by the Royal Society in 1763 and is based on Thomas Bayes (born 1702 in London). 44 In basic terms, Bayes’ Formula allows us to determine the probability of an event occurring based on the probabilities of two or more independent events. The general formula is written as: P ( Ai | B) = P ( B | Ai ) P( Ai ) k ∑ P( B | A ) P( A ) j j j =1 In a Bayesian filter scenario, text is represented by significantly positive or negative words (tokens), that is, typical spam or ham words. At first, lists of “good” and “bad” words are computed from a training set of positive and negative examples. The output is two lists containing spam and ham probabilities for all tokens (complete words in a classic Bayesian filter). Spam probabilities for tokens are calculated using: - the frequency of the token in the spam database - the frequency of the token in the non-spam database - the number of spam messages stored (in each database) Any incoming e-mail is now represented by the most important tokens from these lists, either “most positive” or “most negative”. The overall spam probability is defined as the joint probability of independent events (the tokens). Assuming that the variables a, b and c represent spam probabilities for three different tokens, the total spam probability of a message is equal to: abc abc + (1 − a )(1 − b)(1 − c) The decision whether a message is treated as spam or ham is based on this overall spam probability (via a simple threshold function). Bayesian filters use a variety of different tokens, a few are listed below [74]: Standard Bayes: Each word is a token – this method is used in most spam filter programs. Text is not preprocessed at all (everything is used as token, including header info, java script, etc.). Token Grab Bag: A sliding window of five words is moved across the input text. All combinations of those five words are taken in an order-sensitive way – every combination is a feature. Token Sequence Sensitive: A sliding window of five words is moved across the input text. All combinations of word deletions are applied (except that the first word in the 45 window is never deleted) and the resulting sequence-sensitive set of words is used as a feature. Sparse Binary Polynomial Hashing with Bayesian Chain Rule (SBPH/BCR): A sliding window of five words is moved across the input. A feature is the sequence-andspacing-sensitive set of all possible combinations of those five words, except that the first word in the window is always included. Peaking Sparse Binary Polynomial Hashing: this is similar to SBPH/BCR, except that for each window position, only the feature with the highest or lowest probability (furthest from 0.5) is used. The other features generated at that window position are disregarded. This is an attempt to 'decouple” the sequences of words in a text and make the Bayesian chain rule more appropriate. Markovian matching: This is similar to Sparse Binary Polynomial Hashing, but the individual features are given variable weights. The increase of the weights is quadratic increasing with the length of the token, so that a feature that contains more words than any of its sub features can outweigh all of its sub-features combined. 2.3.3.4. Support Vector Machines (SVM) The Support Vector Machines model, introduced by Vapnik [76][77], has proven to be a powerful classification algorithm and is used in many categorization tasks including text categorization. The main idea is to map the input data into a high dimensional feature space and separate this data by the hyper plane that has provides the highest margin between the two classes. If classes are not linearly separable, SVMs make use of so called kernels (convolution functions) to transform the initial feature space to another where a separating hyper plane exists. 2.3.3.5. K-Nearest Neighbor (K-NN) A query is compared to all samples in the training set (according to a distance function, Euclidean distance is very common). The query is assigned to the class the most of the K-Nearest neighbors belong to (the k most similar vectors). For instance, if a message's five nearest neighbors consist of two spam messages and three hams, the message is classified as ham. K-Nearest Neighbor is an example for a decision part of a classification system. 2.3.3.6. Neural Networks Another technique widely used for classification and pattern recognition tasks are feedforward neural networks. Neural networks differ from other approaches because of their extensive training phase and their heuristic way of initialization. Far more resources are needed for training than for actual classification, in contrast to the K-NN algorithm where training only means storing vectors and classification includes the costly comparison to all training examples [78]. 46 2.3.3.7. Strength and Weaknesses Bayesian filters offer a good method to detect spam messages. They represent a content based solution that is easy to implement. Disadvantages are the need for permanent filter training, limited applicability for ISP’s, potential counterattacks from spammers (insertion of random words – see [79]) and possible performance problems. One big advantage of checksum based systems is the low rate of false positives. False positives can only occur when a message has been sent many times or the checksums of different messages are accidentally the same (a very small chance). Approaches that are based on fuzzy checksums can be used to detect messages that contain random words (often used by spammers to bypass keyword based filtering). Many articles discuss the advantages and disadvantages of different classification methods, distance functions and methods for text representations. Many of those approaches are best used within a specific domain or problem. K-Nearest Neighbor is a rather simple approach, but it has performance advantages at the time of decision making. There is a vast body of literature on comparing different kinds of Bayes filters (see, for example, [80]) to other types of filters currently available. The neural network based approaches and support vector machines need an extensive training phase and do not allow to draw conclusions because of their heuristic initialization (they are a black box, the user does not know why a specific message is classified as ham or spam). On the other hand, classification itself is faster which may be an important advantage, although the main performance problem of spam detection is message analyzing itself. One of the first papers about the use of SVMs to classify spam messages was published in 1999 [81]. Another paper [82], proposing a similar approach was published in 2001. So far, the application of SVM as well as K-NN and its variations to the spam problem has been discussed numerous times [83]. However, due to its quite recent development and a rather complex implementation, SVMs are rarely used in commercial anti-spam systems at the moment. It is very important to take into account the data pre-processing and training phase, as they are a crucial part of the classification process. All methods discussed here tend to obtain good results only if they are trained on a regular basis. It is essential for keeping the performance of a content filter at a satisfactory level and for avoiding a significant performance decrease over time. A different solution, offered by some anti-spam software vendors, is to send out training updates in regular intervals, taking away the burden of maintaining databases and/or training sets from the end user. Although this avoids the individual effort and the bias caused by misclassification, it tends to decrease classification performance because it cannot account for individual users’ preferences. 2.4. Sender and Receiver Side The above section described countermeasures, which become effective when the e-mail is already sent. In this section we summarize methods that affect both sender and 47 receiver. This comprises suggestions for new e-mail transfer protocols, two of which are mentioned here. It cannot be expected that they will be implemented and used in the near future. 2.4.1. IM 2000 IM 2000 has been developed by D.J. Bernstein, the creator of qmail. Today’s Internet mail infrastructure implements a push-system. The senders cost of sending a message to thousands of recipients is nearly zero. IM 2000 purposes a pull mechanism where the messages are stored at the sender’s side. This concept has some ramifications to a new infrastructure [84][85]: • • • • • • Each message is stored under the sender's disk quota at the sender's ISP. ISPs accept messages only from authorized local users. The sender's ISP, rather than the receiver's ISP, is the always-online post office from which the receiver picks up the message. The message is not copied to a separate outgoing mail queue. The sender's archive is the outgoing mail queue. The message is not copied to the receiver's ISP. All the receiver needs is a brief notification that a message is available. After downloading a message from the sender's ISP, the receiver can efficiently confirm success. The sender's ISP can periodically retransmit notifications until it receives a confirmation. The sender can check for confirmation. There is no need for bounces. Recipients can check on occasion for new messages in archives that interest them. There is no need for mailing-list subscriptions. The deployment of IM 2000 would require a global adoption of the mail infrastructure. The proposed solutions provide quite complicated mechanisms to admittedly difficult problems, given the requirements. A global deployment of an implementation is unlikely anytime in the near future [86]. 2.4.2. AMTP The Authenticated Mail Transfer Protocol (AMTP [87]) is currently specified in an Internet-Draft. The last version was submitted on April 26, 2004 to the Internet Engineering Task Force. AMTP enables trusted relationship between entities operating Mail Transfer Agents. This works over TLS like SSL for Web Servers. Both client and server must present valid X.509 certificates, each signed by a trusted Certificate Authority (CA), in order to begin a transaction. AMTP also provides a mechanism to publish concisely-defined policies. This allows the parties in the trusted relationship to hold each other responsible for operating their servers within the constraints of agreedupon rules. AMTP inherits the specification of SMTP and builds upon it. By operating on a different TCP port AMTP can run in parallel with SMTP. It is hoped that this supports an easy and smooth adoption [87]. 48 3. Products and Tools This chapter describes the products that were used in our experiments. In the first section we give some general information about anti-spam software. Afterwards we discuss in detail the characteristics of the commercial and open source spam filters tested. 3.1. Overview The following section gives an abstract about anti-spam software. At first some quality criteria for product reviews are mentioned. Then we suggest online resources that shall help finding the right choice within the wide variety of available solutions. 3.1.1. Quality Criteria To defeat spam many approaches have been developed and where realized by the antispam industry. The market provides free applications for home users as well as applications for an enterprise wide anti-spam policy. The systems’ qualities differ in many ways and it is sometimes not obvious to determine which product should be deployed. Some important criteria that should be considered are: Usability: Deploying anti-spam software can be a very cost-intensive task. Besides acquisition costs and license fees that have to be paid, staff for maintaining the application is needed. The effort for configuring and training a system is one of the most important criteria. It is preferable if less user interaction is needed. Ease of Integration: The integration of a spam-filter in the existing IT-infrastructure is another main point. Questions like additional hardware-cost or interoperability with existing mail servers and operating systems have to be taken into consideration. Processing Speed: The processing speed needed depends on the mail volume received. If the number of messages received exceeds the capabilities of the spam filter mail could be lost due to congestion. The processing speed depends mainly on the methods used for analyzing the incoming messages. Detection Rate: The detection rate is definitely the most important criterion. It must be pointed out that this refers to the detection rates of spam as well as ham. A good spam filter must have a very low rate of false positives and on the other hand detect as many spam messages as possible. 3.1.2. Comparisons of Anti-Spam Software For our report we tried to give a short summary of the methods and tools available for spam detection. In the following section we want to describe two online tools, which give a very good and regularly updated overview over anti-spam tools. 49 One is the so called Compare-O-Matic, available at NetworkWorldFusion [88]. First there is a registration required (no fees), then you can search through the anti-spam buyer’s guide [88]. You can choose between server based and client based products and anti-spam services. A very good feature is the so called Compare-o-matic, where two or more products can be compared (different features for comparison can be chosen). The second tool can be found at Spamotomy [89]. You can choose between any kind of solutions, desktop software, server software, hosted services and disposable addresses, for every tool there is a short summary and a short description of the methods used. 3.2. Commercial products The main topic of this section is the description of the functionality of various commercial anti-spam products – most of them have been used in our experiments (see Chapter 5 for the results). Further information about the products can be found at the vendors’ homepages. Due to their commercial status, it is clear that the functionality of some methods is not fully publicly available. In particular, we describe the following commercial products here (the notation in brackets is the one used in Chapter 5): SurfControl E-Mail Filter for SMTP ( = Product 1), Symantec Brightmail Anti-Spam ( = Product 2), Symantec Mail Security for SMTP ( = Product 3), Kaspersky Anti-Spam ( = Product 4), Borderware MXtreme Mail Firewall ( = Product 5) and Ikarus mySpamWall ( = Product 6). 3.2.1. Symantec Brightmail Anti-Spam Symantec Brightmail Anti-Spam (Vers. 6.0) [90] offers a complete, server side antispam and anti-virus protection. It can be run on Windows 2000 Server (SP2) or higher, Linux (Red Hat ES/AS 3.0) and Solaris (8 or 9) – we only tested the Windows version. The product is updated online to ensure that the latest virus and spam patterns are installed. Brightmail Anti-Spam software filters e-mail in four basic ways: • Filter and classify e-mail. • Clean viruses from e-mail. • Content Filters can be tailored specifically to the needs of an organization. • The Allowed Senders List and the Blocked Senders List filter messages based on available sender information. Own lists or third-party lists can be used. 50 Figure 11 shows the typical processing path of Symantec Brightmail Anti-Spam. Figure 11: Typical processing path of Symantec Brightmail Anti-Spam Available Methods Methods based on source of e-mail: Brightmail Anti-Spam supports whitelists and blacklists. The filter treats mail coming from an address or connection in the whitelist as legitimate mail. Policies can be set up to configure a variety of actions, performed on incoming e-mail, including deletion, forwarding, and subject line modification. Brightmail Anti-Spam provides also three preconfigured lists to handle e-mail messages. These lists are: • Open Proxy List: • Safe List: • Suspect List: list of IP addresses that are open proxies (often used by spammers). IP addresses from which virtually no outgoing e-mail is spam. IP addresses from which virtually all of the outgoing email is spam. Methods based on fingerprints: Brightmail Anti-Spam uses its own Checksum Database, the so called BrightSig technology. It is the cornerstone of Symantec’s signature technology. The technology characterizes spam attacks using proprietary fuzzy algorithms. The resulting signatures are added to a database of known spam. Other content based methods: When evaluating whether messages are spam or not, Brightmail Anti-Spam calculates a spam score from 1 to 100 for each message, based on techniques such as pattern matching and heuristic analysis. If an e-mail score is in the range from 90 to 100 it is considered spam, if the score is below 25 it is considered ham. E-Mail with scores between 25 and 90 are suspected to be spam. For a more aggressive filtering, thresholds can be varied and it is possible to specify different actions for messages identified as suspected spam or spam based on different filtering 51 policies. Brightmail Anti-Spam allows creating custom filters based on keywords and phrases found in specific areas of a message. Other Features: The language a message is written in can be determined. By default, Symantec Brightmail Anti-Spam treats all languages equally, but it is possible to classify messages according to their language. It is also possible to filter out oversized messages or messages with specific attachments. When configured for anti-virus filtering, Brightmail Scanners detect viruses from e-mail as it enters the mail system. When one or more viruses are detected, the anti-virus policies get effective. User’s View: Symantec Brightmail Anti-Spam provides a set of basic features that cannot be disabled to ensure protection against spam. These features are the Spam Scoring and the Suspect List within the Reputation Service. The only way to take the whole product offline is to disable the services in the Microsoft Services Console. The product is easy to handle and provides a comfortable Web interface for administration and Spam Scoring configuration. The processing speed is at a very high level. Anti-spam and anti-virus definitions are updated regularly so there is no effort for maintaining the product. It offers many good features and is a good combination of anti-spam and anti-virus protection but there are disadvantages too. The information provided in the logging section (statistics about status information and classification results) is only updated every hour. A real time monitoring of the processing status is therefore impossible. 3.2.2. Kaspersky Anti-Spam Both Kaspersky Anti-Spam 2.0 Enterprise Edition and ISP Edition [91] must be run under Linux or FreeBSD 4.x and plugged into an existing mail server (the most common Unix mail servers like postfix, qmail, etc. are supported). Available Methods Methods based on source of e-mail: Kaspersky Anti-Spam 2.0 supports filtering mail by officially blacklisted addresses as well as using local black- and whitelists created by administrators. Furthermore a heuristic analysis checks some of the formal attributes of an e-mail, such as sender’s address, recipient’s address, sender's IP address, size of message, and format of message. Classification methods: A lexicographical comparison searches for words and phrases typically used by spammers. Additionally methods like hidden text or special HTML tags, which are often used by spammers, are taken into account. Methods based on fingerprints: The so called “SpamTest” technology compares incoming mail against sample spam signatures (comparison of their lexical content and detection of regular expressions). Those signatures are updated automatically on a regular basis. Other content based methods: The content of each message can be categorized by Kaspersky Anti-Spam. In this context, “content” refers to the body of an e-mail, 52 excluding subject and header. Moreover, conditions can refer to non-formal attributes of a message, the results of the content filtering. Therefore, the classical rule based approach is combined with content analysis. Kaspersky uses two basic methods to detect messages with "suspicious" content: • checking against sample messages (by comparison of their lexical content) • detection of regular expressions – words and word combinations A message can be assigned to several content categories according to the results of content analysis (obscene, formal, probable spam, making money fast, ...). Those assignments can also be used in the filtering-rules. Additionally, every message is processed via filtering rules. Every rule includes one or more conditions that involve an analysis of the message – only if all the conditions of a rule are met, the action of that rule will be applied. Such conditions include tests for: sender’s IP-address, sender’s e-mail address and message size. User’s View: Kaspersky Anti-Spam adds a header to each processed e-mail, containing information about the overall score (spam, probable-spam, not detected). The configuration can be determined via the Web interface. The tested configuration uses the standard common profile (without RBL and DNS checks). Administrators can switch between several standard rule sets (called common profiles, valid for all users) or create new ones. In addition certain rules can be added on a per user basis. The rule based approach is quite powerful, that is, it allows many settings (the standard profile includes 34 rules – most of them handle header substitution and modification). The installation of Kaspersky Anti-Spam worked flawlessly, configuration via the Web configurator is rather easy when switching between standard profiles (but gets very complex when creating a new profile consisting of new rules). Furthermore, new sample messages can be added to the predefined categories to improve the classification performance. A comprehensive evaluation of Kaspersky Anti-Spam is not easy because of the vast number of configuration possibilities. Processing speed seems to be good. It is not completely clear how important the content analysis is for classification. 3.2.3. SurfControl E-Mail Filter for SMTP SurfControl E-Mail Filter for SMTP, Version 4.7 [92] is essentially a Simple Mail Transfer Protocol host, which works with all SMTP mail severs, including Microsoft Exchange, Lotus Notes Domino and Novell GroupWise. Like Symantec Mail Security for SMTP it is a gateway and therefore the processing path is the same (see Figure 12) and it filters both outgoing and incoming e-mail. The SurfControl E-Mail Filter is a commercial product and runs on a Windows 2000 Server (Service Pack 4) or higher. 53 The core SurfControl E-Mail Filter solution consists of the following software components: • Message Administrator: Allows the user to review and act on delayed and isolated messages, together with querying the various system logs. • E-Mail Filter Administrator: Enables to control the E-Mail Filter remotely via a Web browser. • E-Mail Monitor: Provides a window onto the progress of individual messages through the E-Mail Filter. • Rules Administrator: enables to set up rules to monitor and/or block messages. • Scheduler: This is the interface to automate repetitive tasks, such as receiving updates from SurfControl's anti-spam database. Available Methods Methods based on source of e-mail: The SurfControl E-Mail filter allows creating a whitelist database by entering information of known individuals. Like the other products, the Surf Control E-Mail filter supports custom and real time blacklists. Classification Methods: The Virtual Learning Agent (VLA) is a content development tool that can be trained to understand and recognize specific proprietary content. The VLA uses neural network technology with trained strings allowing userdefined content to be learned. Other content based methods: SurfControl uses its Anti-Spam Agent that automatically detects and deals with common non-business or high-risk e-mail, such as humorous graphics, chain letters, hoaxes and jokes. It is continuously updated by SurfControl to maintain accuracy and quality. The filter also enables Boolean searches to check for words, combinations of words or pairs of words within a message. There is also a library of dictionaries to detect e-mail content that an organization may want to avoid. These dictionaries contain words associated with different aspects of unwanted content, for example adult material, hate speech and gambling. Other Features: It is possible to remove active HTML content from the body of email messages. Active content is code that automatically installs and runs on your computer, such as scripts or ActiveX Controls [93]. SurfControl E-Mail Filter can detect various routing relay techniques and deny e-mail that have been forwarded or routed. File Attachments and messages can be blocked if they do not comply with the MIME standard or exceed a specific size. Looping messages between two or more email servers and messages that exceed a specific number of recipients can be detected and removed. SurfControl E-Mail Filter has also an image recognition tool that scans graphics files for explicit adult content. For anti-virus protection an agent is available that helps to protect the system by deleting viruses and cleaning infected files when they occur. It uses the McAfee Olympus Anti-Virus engine to detect files that could damage a system. User’s View: The entire configuration of the product is up to the user. There are no preconfigured settings available. It is possible to turn all features off so that the 54 SurfControl E-Mail filter just acts as a simple SMTP-gateway. The settings that can be taken by the user are very extensive and some time is needed to overlook the functionality of the product. The product provides a good user interface to handle the functionality and the components are clearly arranged. The SurfControl E-Mail Monitor allows a real time supervision of the processing state. It is possible to integrate external code by using the External Program Plug-In. The processing speed of the SurfControl E-Mail filter is far behind the other tested products. The Virtual Image Agent seems not to be working at a sufficient level, because it blocks harmless pictures even at the lowest sensitivity level. The Virtual Learning Agent just supports pure text files so training the agent is very timeconsuming. The Loop Detection also seems not to be working correctly because it also blocks messages that are already tagged with an “X-Spam”-Flag which is not a significant sign for a looping message. Some of the wordbooks provided do not seem to be useful because they include words appearing in nearly every message (for example “html”). 3.2.4. Symantec Mail Security for SMTP Symantec Mail Security for SMTP, Version 4.0.0.59 [94] is a Simple Mail Transfer Protocol server that processes incoming or outgoing e-mail before sending them to a local or remote mail server. The software is a commercial product and requires Windows Server 2000 with Service Pack 4 or higher as operating system. There is also a release for Solaris 8 or 9 that was not tested. The Symantec Mail Security for SMTP is updated online to ensure that the latest virus and spam patterns are installed. It can be configured to protect a network in the following ways: • Virus protection • Block spam • Prevent the relaying of spam for another host Figure 12: Typical processing path of Symantec Mail Security for SMTP Available Methods Methods based on source of e-mail: To limit potential spam, Symantec Mail Security can support up to three real time blacklists. There is also the ability to block e-mail by a custom blacklist (which contains the sender’s address or domain). Domains and e-mail addresses that shall bypass the heuristic and blacklist detection can be added to a 55 whitelist. There is also an auto-generating whitelist feature that, if enabled, adds all domains of outgoing messages that are not in the local routing list. Classification Methods: The heuristic anti-spam engine is based on neural networks and performs an analysis on the entire incoming mail message, looking for key characteristics of spam. It weighs its findings against key characteristics of legitimate email and assigns a spam score (1-100) showing the spam probability – a high score means a high spam probability. This score, in conjunction with the engine sensitivity level (1=low, 5=high), determines whether a message is considered spam. Details of this method were not made accessible to us. Other content based methods: The Symantec Mail Security for SMTP allows defining spam rules to be used for processing the message body. Each rule consists of one ore more terms that can be combined using AND, OR, and NOT operators. For example, the rule "top secret" OR "confidential" triggers if one of these terms appears in the message body. Other Features: Symantec Mail Security for SMTP allows blocking messages by message size, by subject line or by file name. Also dropping messages that exceed various container limits, like the file size, the cumulative size or the number of nested containers is possible. Functionality for handling encrypted container files is included. Relay restrictions can be configured within Symantec Mail Security for SMTP so that it refuses to deliver e-mail that has a source outside of the organization. The anti-virus scanning feature tries to detect virus infected e-mail. New or unknown viruses can be detected through a heuristic method. The sensitivity of this feature is variable. Another component of the anti-virus policy is the Mass Mailer Cleanup that deletes mass mail or worm infected messages. The services search for a match between virus name patterns and the signature returned by the anti-virus scan. If a match is detected, then the message is dropped. User’s View: The complete configuration of the product is up to the user. There are no preconfigured settings available. It is possible to turn off all features so that Symantec Mail Security for SMTP just acts as a simple SMTP-gateway. The delivery of messages can be fully stopped and all messages can be rejected to set the Symantec Mail Security for SMTP offline. The product is easy to handle and provides a comfortable Web interface for administration. The auto-generated whitelist is a useful feature saving time for editing the list. The reporting function allows a good supervision of the processing state. Reports are always up-to-date and include most of the relevant information. The processing speed of the Symantec Mail Security for SMTP is at a very high level. The most important spam detecting feature, the heuristic spam detection, does not provide a sufficient detection rate. The effort for creating spam and content rules is too high in relation to the expected increase in the spam detection rate. There is no ability to manage probabilities for words or combinations of words appearing in an e-mail message. The latest online update of the spam patterns file dates back to 2004-04-18, since then the Live Update functionality seems to have had no impacts at all. 56 3.2.5. Borderware MXtreme Mail Firewall The tested Version MX-400 combines three functionalities – MTA, e-mail gateway and firewall. It has its own operating system – S-Core OS which is a Unix system based on FreeBSD. In opposite to the other solutions, MXtreme is hardware based. Available Methods Methods based on source of e-mail: The MXtreme Mail Firewall supports blacklists (custom as well as real time) and whitelists. These lists can be specified on the user or system level. Methods based on fingerprints: The MXtreme Mail Firewall uses DCC for spam detection. Classification Methods: The MXtreme Mail Firewall uses a so called Statistical Token Analyzer (STA) to identify spam based on statistical analysis of mail content. STA is based on Bayesian filtering, it is not publicly available in how far it is different from a classical Bayes filter (a detailed description of Bayesian filtering can be found at Section 2.3.3). STA uses three sources of data to build its database: the initial tables supplied by BorderWare based on analysis of known spam, tables derived from an analysis of local legitimate mail (“local learning” or “training” and mail identified as “bulk” by DCC is also analyzed to provide an example of local spam. Other content based methods: The MXtreme Mail Firewall supports pattern based filtering. Filters can be specified using simple English terms such as “contains” and “matches” or using regular expressions. These filters are processed in the order of their priority. User’s View: The configuration is merely up to the administrator. After activation of the anti-spam feature every single method can be activated and modified separately. It is possible to filter mail using Brightmail. If so, RBL, DCC and STA are disabled by default. The product is easy to handle and provides a comfortable Web interface for administration. Its main advantages are that it combines firewall and anti-spam functionality and that it is rather easy to maintain. DCC and Statistical Token Analysis performed quite well, although the classification performance may differ in production use due to the chosen training policy. 3.2.6. Ikarus mySpamWall Unlike the other products, the Ikarus mySpamWall [95] is a service running in a service center, and cannot be installed on a computer locally. Ikarus Software calls this a “managed security service”. As it is a service, no information about the operating system is available. Maintenance is reduced to a minimum. 57 Available methods Methods based on source of e-mail: Ikarus mySpamWall provides a full set of blacklists, whitelists and a greylist. Especially the greylist is a very important feature, as it does not only perform checks on the sender’s IP address and hostname (to reduce dangers originating from private broadband accounts), but also performs several checks on header information and the message body before finally accepting a message. Classification methods: Classification is based on two main technologies. On the one hand keywords are checked, and on the other hand a Bayesian filter is used. The user can influence neither of these checks directly. Methods based on fingerprints: No global fingerprint services like DCC are used in this product. All incoming messages are classified and the ones classified as spam are used to create tokens for future message classification. Other features: Ikarus mySpamWall supports technologies like HashCash and incorporates some additional features that are available through the tight integration of the e-mail gateway and the spam filter engine. These features include an Anti Spoofing Engine which checks if a sender’s address really exists (at least if it is a local address) or an automatic whitelisting for frequent senders. Furthermore, IP addresses are banned automatically for a certain time if they cause a certain amount of errors. Usually this product is sold in a package with an anti-virus engine to provide full protection. User’s View: Generally speaking Ikarus mySpamWall is very easy to use. It has a Web interface that allows a simple setup of threshold values for possible spam and spam. Optionally, an advanced interface allows the addition of simple rules to blacklist or whitelist certain senders, receivers, subject lines or content using simple regular expressions. From our point of view it seems generally a good idea to offer spam protection for companies as a service. This is a good solution for rather small companies that cannot afford an IT department to deal with the spam problem. Especially the fact that this product offers a complete integration of a mail transfer agent and an anti-spam solution seems to be a big advantage compared to most of the others. This advantage is used in a very potent greylist that is able to detect many spam messages without the risk of creating “real” false positives, as messages can always be sent again and then be delivered. 3.2.7. Spamkiss Spamkiss [96] as well as some other projects aims at a goal completely different from that of regular spam filters. While regular spam filters accept the fact that spam exists and try to eliminate it after it is sent, these approaches try to make sending out spam mail as expensive as possible (compare Section 2.2.1). These expenses should sooner or later make spamming less attractive commercially. 58 The technology involved in this approach consists of a combination of two different methods: A whitelist and a challenge-response protocol. The challenge-response protocol is used for the initial contact. The user’s mail address is modified with a random token, which is only valid for a certain period. This modified e-mail address has to be used for the initial contact only. After that, the sender’s address is added to the user’s whitelist. At this point e-mail communication continues just as it does now. Both partners can communicate as they want, without adding any new random tokens to mail addresses. The way in which the tokens are distributed is the major difference between systems using this approach and is similar to key distribution in a public key infrastructure. As soon as the key is exchanged (or in this case the token), communication is no problem at all. Spamkiss offers a simple exchange by either personally talking to each other (or telling your partner how to modify your address for the first contact), or by getting the currently used token from a server. A system very similar to Spamkiss is also used at the Computing Science Department at the University of Alberta, CA. It is called SFM [66] (Spam Free E-Mail service see Chapter 2.3.1). To make sure that the current token cannot be harvested by a computer, a distorted image is used. This technology is often also used to access free services like stock quotes, which should not be machine-readable. The general idea is rather simple: distorted images cannot be used by a computer (at the moment), as text recognition software cannot read it. The human brain on the other hand is easily able to recognize the letters, even though they are distorted and have a fancy background pattern. Available Methods Methods based on source of e-mail: Spamkiss’ main feature is a whitelist that temporarily blocks all messages by senders who are not yet included. Other features: As all users have to be on a whitelist a special dialog is available which allows users to add themselves to the whitelist, including human interaction. This kind of technology promises that no more legitimate messages will be lost, as all messages generated by a human being (who reads replies to his address) are delivered, or bounced with a request to send it again to a different, modified address. This modified address simply consists of the original address and the current token. The approach of stopping spam messages this early seems like a good idea at the first look. Messages are not delivered at the first attempt. Later on, messages that are legitimate are delivered, and those that are not are not accepted by the mail server, telling the sender to fetch the current token first. If a token is compromised, which means that the combination of token and e-mail address ends up on a spammer’s list, it simply gets changed, without affecting communication with those already added to the whitelist. Several major questions seem to be unanswered so far. On the one hand it might still pose a problem to handle automatically generated messages, and on the other hand bouncing back a lot of messages may increase network traffic considerably. 59 Taking a closer look reveals a lot of additional work. Users should know their tokens, so they can give them to future communication partners. After this initial contact the sender’s address gets stored in the whitelist. This solution may work in many cases, but there are many situations, which may cause problems in this kind of environment. Many users do have several e-mail addresses, which means that a user who uses different sender mail addresses has to go through the initial process several times. In addition to that it is often hard to tell who will be the sender of messages that are delivered by a mailing list, or the exact mailing address of automatically generated messages by an online store. Checks are basically performed on the sender’s address only. This means that someone who knows the addresses of your trusted partners may easily send you any kind of message. 3.3. Open Source This section describes the characteristics of the open source filters tested. Because of its open source character both documentation and code quality differ. On the other hand almost all products offer a lot of information on their Web site. 3.3.1. SpamAssassin SpamAssassin [97] is written in Perl and part of the Apache Software Foundation. The primary target platforms are Unix operating systems. Some Windows products available use SpamAssassin, though they are not open source. SpamAssassin extracts different features from incoming e-mail messages. This analysis is done through so called tests, which capture the header, body or full text of an e-mail (a full listing can be found at [98]). SpamAssassin can be configured to include RBL checks (see Section 2.3.1) and a Bayesian classifier. The overall rating of a message is computed from the values for the different results of the text analysis plus the results of the Bayesian filtering plus the results from the distributed hash databases. Hence SpamAssassin never relies on one single technique. After the testing a header containing the overall score and a spam mark is added to all processed e-mail messages. The messages can be classified according to this mark or score (probably-spam if the mark is present, certainly-spam above a certain threshold). Available Methods Methods based on source of e-mail: SpamAssassin analyses the message header using regular expressions to identify forgeries and manipulations. SpamAssassin permits defining proprietary whitelists as well as consulting public ones like Habeas [51] and Bonded Sender Program [52]. In addition, many useful existing blacklists, such as mail-abuse.org, ordb.org, SURBL, and others are supported. The latest release of SpamAssassin (3.0) supports the Sender Policy Framework for verifying the senders’ domain. Methods based on fingerprints: Vipul's Razor, Pyzor and DCC are supported to block spam and bulk mail. 60 Classification Methods: SpamAssassin uses a Bayesian-like kind of probabilityanalysis classification, so that a user can train it to recognize mail messages similar to a training set [99]. Many of SpamAssassin’s tests aim at static patterns in the text of mail messages. User’s View: The user can specify which rules to use and whether Bayesian methods should be used or not. Furthermore, one can specify which features should be computed and which online resources should be consulted (DCC, Razor, Pyzor). The user can also specify which RBL should be used (if any), therefore SpamAssassin is very adjustable to the user's needs. SpamAssassin also offers an auto-learn function, which uses all messages below and above certain, scores (mail that is very clearly classified as spam or ham) as learning input for the Bayesian classifier. The auto-learn function is not considered in our tests, all training is done prior to testing. We experienced no problem during configuration. SpamAssassin is a very comprehensive tool, because it combines more than 600 different tests. SpamAssassin seems to be particularly designed to hold false positives low and gives a good idea about what features of an e-mail message can be computed. 3.3.2. CRM 114 CRM114 [100] (Controllable Regex Mutilator, concept # 114) is written in C and available as open source. Several Unix operating systems are supported. Available Methods Methods based on source of e-mail: CRM 114 supports personal white- and blacklists. Classification Methods: CRM 114 uses a Marcovian Discriminator to differentiate between ham and spam messages. Incoming text is matched against the Hidden Markov Models [101] of the training corpi (this is another probabilistic model like Bayesian filtering). CRM 114 uses phrases containing more words and assigns higher weights to longer phrases, whereas Bayesian filtering usually uses phrases of length one (one word). Incoming mail is piped to CRM114 by the local MDA and adds a new header containing the spam score to each message. User’s View: The most important and resource intensive aspect of CRM114 is training. The configuration is not very user friendly and gives kind of a “not yet finished” impression. Bulk training takes a very long time even for a small amount of messages. CRM114 is best used on a per user basis to personalize the training sets (like other statistical approaches). Installation and training are a bit tricky but the results are rather good. The recommended training method is to only use false classifications as training input, whereas we used bulk training (train on errors is much easier on the individual level). 61 3.3.3. Bogofilter Bogofilter [102] is a "Paul Graham based" Bayesian spam filter. The application is written in C and available as open source (several Unix operating systems are supported). Available Methods Statistical Methods: Bogofilter classifies incoming mail as ham/spam based on a statistical analysis of the message's header and content (body). Each token (word) in an incoming mail is checked against a good- and a bad wordlist (typical for spam/ham). This wordlists are generated by training the filter with both a ham and a spam corpus in order to find typical ham and spam words. Values for spam-probability of individual tokens are combined using the inverse chi-square function (statistical testing). Bogofilter is used by the MDA (mail delivery agent) and computes tokens and finally spam or ham probabilities for every incoming message. It adds an X-Bogosity header to all e-mail containing the ham- and spam scores of this message. The messages can be moved to their designated folders according to this header. Users have to train Bogofilter with both a ham and a spam corpus. After that initial training Bogofilter is ready to classify mail. It is recommended to retrain on a regular basis to adjust to changes in mail messages. User’s View: The user can choose between two (spam, ham) or three classes (spam, probably spam, ham) and is responsible for the training. Moreover the threshold values can be specified. The results of Bogofilter are of particular interest because it is the only Bayes-only application included in our test series. In our experience Bogofilter works without problems (installation and usage). 62 4. Performance Evaluation The previous chapters described available tools and the most relevant methods used for spam detection. To experiment with the mentioned tools it is very important to have a comparable test set. Existing evaluations that can be found in scientific literature use both publicly available spam and ham samples (like the Ling spam corpus and PU1, both proposed in [103], or the SpamAssassin sample [104]) or self collected samples, which are usually more up-to-date, but in some cases not publicly available. We decided to test the tools with our own collected sample, as it is more up-to-date, and also with the SpamAssassin sample because of its public availability. In this chapter we want to describe the source and the composition of our own test sample and the hardand software configuration used for the tests. 4.1. Sources for Our Own Samples Collecting spam messages seems quite simple – a brief look in everyone’s inbox is enough – but is a collection of a few inboxes enough for a representative sample? We decided to ask our partners to collect spam to be as representative as possible. The collection of ham messages is much harder due to legal preconditions. It is not allowed to use ham messages without consent of both the receiver and the sender, so we depended on volunteers and on our own e-mail. The following chapter describes the situation of our partners Mobilkom Austria, UPC Telekabel and the University of Vienna. 4.1.1. University of Vienna The ZID that is responsible for e-mail transactions and all other IT services for the University of Vienna was our first contact point. Getting spam messages was easy, because ZID runs its own spam filter and therefore collects spam via spam traps. Therefore, we got our own spam e-mail account that is filled by those honeypots with approximately 1,500 spam messages coming in every day. Ham collection was difficult, but some volunteers made their inboxes available. Therefore we took our own messages and those of the volunteers – special thanks to Mr. Hatz, Ms. Marosi, Ms. Khan, Ms. Thanheiser and Mr. Bobrowski. 4.1.2. Mobilkom Austria Mobilkom Austria provides a message store containing spam messages as well as ham messages which were falsely classified (“false positives”) by the mail filter currently used. The repository currently holds about 100 false positives and roughly 3,600 spam messages. Several employees had the possibility to move messages to the respective folders, but unfortunately only very few of them did so. The largest part of the messages was forwarded to these folders. This means that their original headers and envelopes were lost. Moreover, most messages were forwarded as inline messages. This implies that their message body was changed too. Some of the messages were forwarded as attachments leaving all the important information (header, body) unchanged. However, as all messages provided by Mobilkom Austria are in a folder located behind a corporate firewall, accessing them is only possible through a Web based service. This makes it quite difficult to retrieve messages. All messages provided by Mobilkom Austria have to be checked manually, and every single message, which can be used for the tests (which is still unchanged) has to be retrieved from the Microsoft Outlook compatible store and moved to an IMAP folder. This causes a significant overhead and creates a complicated environment so that we decided not to use this data in our early experiments, which are based on relatively large amounts of data. 4.1.3. UPC Telekabel UPC Telekabel opened 10 test accounts for our project. Five were used as spam traps and the other five were used to subscribe legal newsletters. Newsletters are very important for a representative sample, because they have similar features to spam messages. 4.2. Test sample description In this chapter we characterize the two test samples which we used in our experiments. 4.2.1. Our Test Sample Our sample was created on 25th of August 2004. It consists of 1,500 spam messages and 1,500 non spam messages, both collected from different sources. Spam sources: • 1,382 (ZID, collected via spam traps, 18.08.2004) • 22 (Spanish messages from W. Strauss, 16.07.2004 – 19.07.2004) • 44 (Department, Thanheiser, Khan, Hatz, 18.08.2004) • 52 (W. Gansterer, 09.03.2004 – 05.08.2004) Ham sources: • 593 (Hatz, 01.01.2004 – 19.08.2004) • 245 (Strauss, 23.09.2002 – 15.08.2004) • 181 (Department, 20.06.2004 – 20.08.2004) • 12 (Thanheiser, 04.08.2004 – 20.08.2004) • 44 (Khan, 06.08.2004 – 20.08.2004) • 36 (Marosi, 04.08.2004 – 13.08.2004) • 244 (Ilger, 01.07.2004 – 22.08.2004) • 145 (Newsletter account Chello, 16.08.2004 – 20.08.2004) 64 Sample size for 1,500 messages: ham 57.745 MB, spam 6.441 MB Sample size for 1,000 messages: ham 33.845MB, spam 4.450 MB 4.2.2. SpamAssassin Test Sample The original SpamAssassin sample consists of different parts [104]. We took Spam2 (20030228_spam_2.tar.bz2) with 1,397 mail messages and Easy Ham 2 (20030228_easy_ham_2.tar.bz2) with 1,400 messages, which roughly equals our own test sample in size. Sample size for 1,000 messages: ham 4.17 MB, spam 6.26 MB 4.3. Experimental Setup Our test infrastructure consists of several software and hardware components, primarily a message store containing the spam and ham messages and two server machines for testing the Windows and Linux products. Table 11 gives a short overview of the Hardand Software we use. Message Configuration Store Windows Configuration Linux Configuration Hardware Platform AMD Athlon64, 3000+, 512 MB RAM Pentium 4, 2.8 GHZ, 1024 MB RAM AMD Athlon64, 3000+, 1024 MB RAM Operating System Windows 2000 Server, Service Pack 4 Windows Server 2003, Standard Edition SuSE Linux 9.1 SMTP-Server Mercury Mail Version 4.01a Microsoft SMTP-Service Postfix SMTP-Server, Version 2.0.19 Self written Java Application, Microsoft POP3 Service Fetchmail, Version 6.2.5, Procmail, Version 3.22 Other Distribution Software Mail __ Server, Table 11: Hardware and software configuration Each test set is stored in its own IMAP folder and can be accessed remotely. Due to the different architectural characteristics of the Linux and Windows spam filters we decided to implement different ways for testing them9: 9 The configuration of the MXtreme Mail Firewall is not included because it is a hardware filter and runs its own operating system. The test process is similar to the Windows test process. We also use our Java application to deliver the messages to the MXtreme. 65 4.3.1. Windows Test Process We use a small Java application for fetching the messages out of the IMAP store and send them via SMTP directly to the spam filters. The spam filters work on the standard SMTP-Port (port 25). For each test-run only one filter is active. All tested products are configured as gateways10 and forward the messages to the Microsoft SMTP-Service. The Microsoft SMTP-Service finally delivers the e-mail to the according mailboxes. 4.3.2. Linux Test Process The process for testing the filters running under Linux is slightly different. The messages are fetched through Fetchmail, a remote mail forwarding utility, which delivers them to a Postfix mail server. Between the SMTP server and the spam filter we plugged in Procmail that allows us to have multiple filters active during a test-run. Procmail pipes the messages based on their recipient address to the different spam filters. Every filter is assigned to a designated mailbox. After analysis the messages are sent back to Procmail and finally delivered to the destination mailboxes and made accessible via IMAP for evaluation (see Figure 13). Figure 13: Windows test process vs. Linux test process 10 Symantec Brightmail Anti-Spam works directly in conjunction with the Microsoft SMTP Service and is not a separate gateway. 66 5. Experimental Results The following chapter summarizes our experiments with various anti-spam tools. This includes open source tools as well as a small selection of commercial products. At this point we need to emphasize again that the goal of our experiments was not to thoroughly evaluate or compare various commercial products. Instead, we focused on analyzing individual methods. As a consequence, our experimental results cannot not be used as the basis of an evaluation or comparison (“ranking”) of commercial products. Chapter 5.1 outlines the results achieved with our own test sample, Chapter 5.2 describes the results for the SpamAssassin sample. In many cases, there is a vast number of configuration options. With our limited resources it was not possible to determine the optimal configuration in terms of performance for each tool. Consequently, we normally used the standard (default) setup. If simple choices had to be made, we tried to minimize the number of false positives while keeping the spam detection rate as high as possible. 5.1. Our Test Sample Both samples – ham and spam – consist of 1,500 messages, for further details see Chapter 4. The tools are grouped according to their availability status – commercial or open source. For products without a training ability, we took all 1,500 messages for testing. For all tested software with a training ability, we divided the test samples into three randomly chosen parts of equal size 500. To get results that are less dependent on a particular training set we took one part for training and the two other parts for testing. We repeated this test three times with alternating training sets and then divided the results by three (so data in the tables is rounded). This process is called cross validation (in our case three fold cross validation), see [78] for details. 5.1.1. Commercial Products The following chapter describes the results for the tested commercial products in detail. An overview is given in Figure 14. 67 Figure 14: Results for tested commercial products – our test sample. The experiments summarized in this figure must not be used as the basis for a ranking of different (commercial or non commercial) antispam tools, because they were not designed for this purpose. Although the test data used is identical, due to limited resources no intense efforts could be made to fine-tune any of the tools (cf. page 2). 5.1.1.1.Product 1 Tested Version: Version 4.7 Product 1 ham sample spam sample total mail sent 1,500 1,500 total mail received 1,500 1,500 classified as ham 1,499 (99.93%) 200 (13.33%) classified as suspected spam not available not available classified as spam 1 (0.07%) 1,300 (86.67%) date 25.08.2004 25.08.2004 Table 12: Product 1 – results our test sample Product 1 configuration) total mail sent 68 (alternative ham sample spam sample 1,500 1,500 total mail received 1,500 1,500 classified as ham 1,416 (94.4%) 180 (12%) classified as suspected spam not available not available classified as spam 84 (5.6%) 1,320 (88%) date 25.08.2004 25.08.2004 Table 13: Product 1 (alternative configuration) – results our test sample 5.1.1.2.Product 2 Tested Version: Version 6.0 Product 2 ham sample spam sample total mail sent 1,500 1,500 total mail received 1,500 1,500 classified as ham 1,498 (99.87%) 123 (8.2%) classified as suspected spam 2 (0.13%) 2 (0.13%) classified as spam 0 (0%) 1,375 (91.67%) date 25.08.2004 25.08.2004 Table 14: Product 2 – results our test sample 5.1.1.3. Product 3 Tested Version: Version 4.0.0.59 Product 3 ham sample spam sample total mail sent 1,500 1,500 total mail received 1,500 1,500 classified as ham 1,349 (89.93%) 651 (43.4%) classified as suspected spam 124 (8.27%) 271 (18.07%) classified as spam 27 (1.8%) 578 (38.53%) date 25.08.2004 25.08.2004 Table 15: Product 3 – results our test sample 69 5.1.1.4. Product 4 Tested Version: Version 2.0 Product 4 ham sample spam sample total mail sent 1,000 1,000 total mail received 1,000 1,000 classified as ham 951 (95.1%) 570 (57%) classified as suspected spam 47 (4.7%) 77 (7.7%) classified as spam 2 (0.2%) 353 (35.3%) Date 04.10.2004 04.10.2004 Table 16: Product 4 – results our test sample 5.1.1.5. Product 5 Tested Version: Version 4.0 Product 5 ham sample spam sample total mail sent 1,000 1,000 total mail received 1,000 1,000 classified as ham 997 (99.7%) 49 (4.9%) classified as suspected spam not used not used classified as spam 3 (0.3%) 951 (95.1%) Date 19.10.2004 19.10.2004 Table 17: Product 5 – results our test sample ham sample spam sample total mail sent 1,000 1,000 total mail received 1,000 1,000 classified as ham 1,000 (100%) 85 (8.5%) classified as suspected spam not used not used Product 5 configuration) 70 (alternative classified as spam 0 (0%) 915 (91.5%) Date 19.10.2004 19.10.2004 Table 18: Product 5 (alternative configuration) – results our test sample 5.1.1.6. Product 6 Tested Version: version of November 30, 2004 Product 6 ham sample spam sample total mail sent 1,500 1,500 total mail received 1,490 1,465 10 greylisted 8 greylisted 27 blocked11 classified as ham 1,472 (98.8%) 95 (6.49%) classified as suspected spam 18 (1.2%) 24 (1.64%) classified as spam 0 (0%) 1,346 (91.88%) Date 30.11.2004 30.11.2004 Table 19: Product 6 – results our test sample 5.1.2. Open Source Tools The following chapter describes the results for the tested open source tools in detail. An overview is given in Figure 15. 11 Error-Code: 533 Malformed Sender Address 71 Figure 15: Results for tested open source tools – our test sample The experiments summarized in this figure must not be used as the basis for a ranking of different (commercial or non commercial) antispam tools, because they were not designed for this purpose. Although the test data used is identical, due to limited resources no intense efforts could be made to fine-tune any of the tools (cf. page 2). 5.1.2.1. SpamAssassin Tested Version: SpamAssassin, Version 2.64 and Version 3.0 Operating System: SuSE Linux 9.1 Category: Open source Training Necessity: Yes Parameter Settings 1: SpamAssassin standard (2.64), Spam threshold = 4, Bayes disabled, Network tests enabled SpamAssassin ham sample spam sample total mail sent 1,000 1,000 total mail received 1,000 1,000 classified as ham 986 (98.6%) 300 (30.0%) classified as suspected spam 14 (1.4%) 279 (27.9%) classified as spam 0 (0.0%) 421 (42.1%) Date 04.10.2004 04.10.2004 Table 20: SpamAssassin standard (2.64) – results our test sample 72 Parameter Settings 2: SpamAssassin low (2.64), Spam threshold = 4, Bayes disabled, Network tests disabled SpamAssassin ham sample spam sample total mail sent 1,000 1,000 total mail received 1,000 1,000 classified as ham 987 (98.7%) 375 (37.5%) classified as suspected spam 13 (1.3%) 314 (31.4%) classified as spam 0 (0.0%) 311 (31.1%) Date 04.10.2004 04.10.2004 Table 21: SpamAssassin low (2.64) – results our test sample Parameter Settings 3: SpamAssassin Bayes (2.64), Spam threshold = 4, Bayes enabled, Network tests enabled SpamAssassin ham sample spam sample total mail sent 1,000 1,000 total mail received 1,000 1,000 classified as ham 1,000 (100.0%) 19 (1.9%) classified as suspected spam 0 (0.0%) 303 (30.3%) classified as spam 0 (0.0%) 678 (67.8%) Date 04.10.2004 04.10.2004 Table 22: SpamAssassin Bayes (2.64) – results our test sample Parameter Settings 4: SpamAssassin (3.0), Spam threshold = 4, Bayes enabled, Network tests enabled 73 SpamAssassin ham sample spam sample total mail sent 1,000 1,000 total mail received 1,000 1,000 classified as ham 1,000 (100.0%) 32 (3.2%) classified as suspected spam 0 (0.0%) 80 (8%) classified as spam 0 (0.0%) 888 (88.8%) Date 04.10.2004 04.10.2004 Table 23: SpamAssassin (3.0) – results our test sample 5.1.2.2. Bogofilter Tested Version: Operating System: Category: Training Necessity: Parameter Settings: Bogofilter, Version 0.92.2 SuSE Linux 9.1 Open source Yes Suspected spam threshold = 0.45, Spam threshold = 0.99 Bogofilter ham sample spam sample total mail sent 1,000 1,000 total mail received 1,000 1,000 classified as ham 976 (97.6%) 3 (0.3%) classified as suspected spam 24 (2.4%) 103 (10.3%) classified as spam 0 (0.0%) 894 (89.4%) Date 04.10.2004 04.10.2004 Table 24: Bogofilter – results our test sample 5.1.2.3. CRM 114 Tested Version: Operating System: 74 CRM 114, Version 20040327-BlameStPatrik [tre-0.6.6] SuSE Linux 9.1 Category: Training Necessity: Open source Yes Parameter Settings 1: self-trained – trained with parts of our own e-mail messages CRM 114 ham sample spam sample total mail sent 1,000 1,000 total mail received 1,000 1,000 classified as ham 995 (99.5%) 21 (2.1%) classified as suspected spam not available not available classified as spam 5 (0.5%) 979 (97.9%) Date 04.10.2004 04.10.2004 Table 25: CRM 114 self-trained – results our test sample Parameter Settings 2: pre-trained – pre-trained configuration files were used CRM 114 ham sample spam sample total mail sent 1,000 1,000 total mail received 1,000 1,000 classified as ham 591 (59.1%) 131 (13.1%) classified as suspected spam not available not available classified as spam 409 (40.9%) 869 (86.9%) Date 04.10.2004 04.10.2004 Table 26: CRM 114 pre-trained – results our test sample 5.1.3. Conclusion A look at the results of our experiments with our own test sample (Figure 16) shows that most products have quite comparable detection rates. Especially the false positive rate can be made low with most products while spam recognition rate is usually around 90 percent. It is remarkable that there is no big difference between the detection rates of the open source tools and the commercial products. Furthermore, the products which were tested in multiple configurations, show that the activation of additional features or using a training feature can have a significant 75 influence on the performance. The best example for this is SpamAssassin, which was tested in four different versions with an increasing number of features activated. Figure 16: Results for all tested products (our test sample) The experiments summarized in this figure must not be used as the basis for a ranking of different (commercial or non commercial) antispam tools, because they were not designed for this purpose. Although the test data used is identical, due to limited resources no intense efforts could be made to fine-tune any of the tools (cf. page 2). 5.2. SpamAssassin Test Sample The spam sample was divided into two parts (397 mail messages for training and 1,000 e-mail messages for testing). The ham sample was also divided into two parts (400 email messages for training and 1,000 e-mail messages for testing). For the training set, we always took the first 397 (or 400) e-mail messages according to their timestamps. 5.2.1. Commercial Products The following chapter describes the results for the tested commercial products in detail. An overview is given in Figure 17. 76 Figure 17: Results for tested commercial products – SpamAssassin sample The experiments summarized in this figure must not be used as the basis for a ranking of different (commercial or non commercial) antispam tools, because they were not designed for this purpose. Although the test data used is identical, due to limited resources no intense efforts could be made to fine-tune any of the tools (cf. page 2). 5.2.1.1.Product 1 Tested Version: Version 4.7 ham sample spam sample total mail sent 1,000 1,000 total mail received 1,000 990 reason for missing e-mail messages – java exception12 classified as ham 999 (99.9%) 220 (22.22%) classified as suspected spam 0 (0.0%) 0 (0.00%) classified as spam 1 (0.1%) 770 (77.78%) date 05.09.2004 05.09.2004 Product 1 Table 27: Product 1 – results SpamAssassin test sample 12 Due to invalid message format 77 ham sample spam sample total mail sent 1,000 1,000 total mail received 1,000 990 reason for missing e-mail messages – java exceptions12 classified as ham 965 (96.5%) 175 (17.68%) classified as suspected spam 0 (0.0%) 0 (0.0%) classified as spam 35 (3.5%) 815 (82.32%) date 05.09.2004 05.09.2004 Product 1 configuration) (alternative Table 28: Product 1 (alternative configuration) – results SpamAssassin test sample 5.2.1.2. Product 2 Tested Version: Version 6.0 Product 2 ham sample spam sample total mail sent 1,000 1,000 total mail received 1,000 985 reason for missing e-mail messages – java exception13 classified as ham 999 (99.9%) 416 (42.23%) classified as suspected spam 0 (0.0%) 40 (4.06%) classified as spam 1 (0.1%) 529 (53.71%) date 05.09.2004 05.09.2004 Table 29: Product 2 – results SpamAssassin test sample 13 Due to invalid message format 78 5.2.1.3.Product 3 Tested Version: Version 4.0.0.59 Product 3 ham sample spam sample total mail sent 1,000 1,000 total mail received 1,000 989 reason for missing e-mail messages – java exception14 classified as ham 998 (99.8%) 393 (39.74%) classified as suspected spam 2 (0.2%) 118 (11.93%) classified as spam 0 (0.0%) 478 (48.33%) date 05.09.2004 05.09.2004 Table 30: Product 3 – results SpamAssassin test sample 5.2.1.4. Product 4 Tested Version: Version 2.0 Product 4 ham sample spam sample total mail sent 1,000 1,000 total mail received 1,000 1,000 classified as ham 985 (98.5%) 166 (16.6%) classified as suspected spam 15 (1.5%) 62 (6.2%) classified as spam 0 (0.0%) 772 (77.2%) date 28.09.2004 28.09.2004 Table 31: Product 4 – results SpamAssassin test sample 14 Due to invalid message format 79 5.2.1.5. Product 5 Tested Version: Version 4.0 Product 5 ham sample spam sample total mail sent 1,000 1,000 total mail received 1,000 1,000 classified as ham 987 (98.7%) 140 (14.0%) classified as suspected spam not used not used classified as spam 13 (1.3%) 860 (86.0%) date 25.10.2004 25.10.2004 Table 32: Product 5 – results SpamAssassin test sample ham sample spam sample total mail sent 1,000 1,000 total mail received 1,000 1,000 classified as ham 984 (98.4%) 30 (3.0%) classified as suspected spam not used not used classified as spam 16 (1.6%) 970 (97.0%) date 25.10.2004 25.10.2004 Product 5 configuration) (alternative Table 33: Product 5 (alternative configuration) – results SpamAssassin test sample 5.2.1.6. Product 6 Tested Version: version of November 30, 2004 Product 6 80 ham sample spam sample total mail sent 1,000 991 java exception15 total mail received 1,000 989 2 greylisted classified as ham 968 (96.8%) 31 (3.14%) classified as suspected spam 30 (3.0%) 111 (11.22%) classified as spam 2 (0.2%) 847 (85.64%) Date 15.12.2004 15.12.2004 Table 34: Product 6 – results SpamAssassin test sample 5.2.2. Open Source Tools The following chapter describes the results for the tested open source tools in detail. An overview is given in Figure 18. Figure 18: Results for tested open source tools – SpamAssassin sample The experiments summarized in this figure must not be used as the basis for a ranking of different (commercial or non commercial) antispam tools, because they were not designed for this purpose. Although the test data used is identical, due to limited resources no intense efforts could be made to fine-tune any of the tools (cf. page 2). 15 Due to invalid message format 81 5.2.2.1. SpamAssassin Tested Version: Operating System: Category: Training Necessity: SpamAssassin, Version 2.64 and Version 3.0 SuSE Linux 9.1 Open source Yes Parameter Settings 1: SpamAssassin standard (2.64), Spam threshold = 4, Bayes disabled, Network tests enabled SpamAssassin ham sample spam sample total mail sent 1,000 1,000 total mail received 1,000 1,000 classified as ham 995 (99.5%) 113 (11.3%) classified as suspected spam 5 (0.5%) 217 (21.7%) classified as spam 0 (0.0%) 670 (67.0%) date 28.09.2004 28.09.2004 Table 35: SpamAssassin standard – results SpamAssassin test sample Parameter Settings 2: SpamAssassin low (2.64), Spam threshold = 4, Bayes disabled, Network tests disabled SpamAssassin ham sample spam sample total mail sent 1,000 1,000 total mail received 1,000 1,000 classified as ham 996 (99.6%) 134 (13.4%) classified as suspected spam 4 (0.4%) 235 (23.5%) classified as spam 0 (0.0%) 631 (63.1%) date 28.09.2004 28.09.2004 Table 36: SpamAssassin low – results SpamAssassin test sample Parameter Settings 3: SpamAssassin Bayes (2.64), 82 Spam threshold = 4, Bayes enabled, Network tests enabled SpamAssassin ham sample spam sample total mail sent 1,000 1,000 total mail received 1,000 1,000 classified as ham 997 (99.7%) 54 (5.4%) classified as suspected spam 3 (0.3%) 148 (14.8%) classified as spam 0 (0.0%) 798 (79.8%) date 28.09.2004 28.09.2004 Table 37: SpamAssassin Bayes – results SpamAssassin test sample Parameter Settings 4: SpamAssassin (3.0), Spam threshold = 4, Bayes enabled, Network tests enabled SpamAssassin ham sample spam sample total mail sent 1,000 1,000 total mail received 1,000 1,000 classified as ham 1,000 (100.0%) 109 (10.9%) classified as suspected spam 0 (0.0%) 317 (31.7%) classified as spam 0 (0.0%) 574 (57.4%) date 28.09.2004 28.09.2004 Table 38: SpamAssassin 3.0 – results SpamAssassin test sample 5.2.2.2. Bogofilter Tested Version: Operating System: Category: 83 Bogofilter, Version 0.92.2 SuSE Linux 9.1 Open source Training Necessity: Yes Parameter Settings: Suspected spam thresholds = 0.45, Spam thresholds = 0.99 Bogofilter ham sample spam sample total mail sent 1,000 1,000 total mail received 1,000 1,000 classified as ham 964 (96.4%) 26 (2.6%) classified as suspected spam 36 (3.6%) 464 (46.4%) classified as spam 0 (0.0%) 510 (51.0%) date 28.09.2004 28.09.2004 Table 39: Bogofilter – results SpamAssassin test sample 5.2.2.3. CRM 114 Tested Version: Operating System: Category: Training Necessity: CRM 114, Version 20040327-BlameStPatrik [tre-0.6.6] SuSE Linux 9.1 Open source Yes Parameter Settings 1: self-trained – trained with parts of our own e-mail messages CRM 114 ham sample spam sample total mail sent 1,000 1,000 total mail received 1,000 1,000 classified as ham 995 (99.5%) 288 (28.8%) classified as suspected spam 0 (0.0%) 0 (0.0%) classified as spam 5 (0.5%) 712 (71.2%) date 28.09.2004 28.09.2004 Table 40: CRM 114 self-trained – results SpamAssassin test sample 84 Parameter Settings 2: pre-trained – pre-trained configuration files were used CRM 114 ham sample spam sample total mail sent 1,000 1,000 total mail received 1,000 1,000 classified as ham 973 (97.3%) 208 (20.8%) classified as suspected spam 0 (0.0%) 0 (0.0%) classified as spam 27 (2.7%) 792 (79.2%) date 28.09.2004 28.09.2004 Table 41: CRM 114 pre-trained – results SpamAssassin test sample 5.2.3. Conclusion A comparison of the results with the SpamAssassin test sample (Figure 19) shows a bigger difference between the products than with our own test sample. The reason for this might be that the messages included in this sample are rather old and therefore may not be included any more in modern signature databases. Another explanation could be that the general properties of spam messages changed and modern filters therefore cannot recognize old spam. We see that the best implementations can achieve a spam recognition rate of 80 percent or more with virtually no false positives. 85 Figure 19: Results for all tested products – SpamAssassin sample The experiments summarized in this figure must not be used as the basis for a ranking of different (commercial or non commercial) antispam tools, because they were not designed for this purpose. Although the test data used is identical, due to limited resources no intense efforts could be made to fine-tune any of the tools (cf. page 2). 86 6. Conclusion Summarizing, we can state the following observations with respect to anti-spam methods and available tools, respectively. 6.1. Methods Most tools and products currently available focus on what we call “post-send” methods in our categorization of methods against spam e-mail presented in this report (Figure 5). Their focus is on detecting spam or filtering spam out. Although they are able to achieve acceptable detection and false positive rates, as our experiments show, many of those methods (especially classical filters) have some serious drawbacks: They often require a lot of effort for reacting to changes in the types of spam sent (new rules, training, etc.), and their performance tends to decrease relatively fast if they are not “maintained” well; they are often most useful on an individualized, personal basis, which is an undesirable feature from the point of view of an ISP; they are usually unable to reduce the waste of resources caused by spam e-mail (network bandwidth, storage capacity, etc.); and they are often “one step behind” spammers’ tricks. Nevertheless, there are some interesting newer approaches in this category that we included under “classification methods”. They are motivated by more general techniques from the areas of text classification or data mining, sometimes algorithmically quite evolved. In our opinion they have the potential to overcome some of these drawbacks. However, so far they are mostly discussed at an academic level and are often not mature enough for practical use. It will be a major focus in the next project phase to investigate approaches of this type in greater detail. The situation is quite different with “pre-send” methods. Theoretically, they seem to have big potential for big progress in the spam problem, mostly because they target the source of the problem (commercial motivation) rather than (only) fighting the symptoms – here, the idea is to prevent spam rather than to detect and filter it out. Unfortunately, also in this case there are some important disadvantages: These approaches tend to require a big administration overhead and, more importantly, their success strongly depends on a worldwide agreement to deploy them – this holds for proposals to increase the costs for sending e-mail as well as for legal regulations for prohibiting sending out UBE and UCE. It is of course unrealistic that e-mail providers worldwide will commit to common policies in the next years. Since national or regional boundaries do not exist on the internet, we conclude that the pre-send approaches in the current situation will not “solve” the problem, either. Beyond those two big categories of methods there are also some more “radical” approaches, such as new protocols for e-mail transfer (instead of SMTP), or the opinion that we need to shift to a paradigm where we filter ham in instead of filtering spam out. Although each of these ideas certainly has some merit, their widespread applicability in practice is certainly not expected in the near future, definitely not for an ISP or in any commercial context. 87 Our careful analysis of the situation leads us to the conclusion that there is some potential for significant improvements of existing methods. Moreover, in order to achieve best results, a multi-layered approach with several “defense lines” seems to be required. Details will be investigated in the next phase of our project. 6.2. Experiments As indicated in the beginning, our goal was to evaluate anti-spam methods and not to compare products or tools. It would have been beyond the scope and resources of this project to tune the tools we experimented with in order achieve the best possible performance for each of them. In most cases we used more or less a standard configuration, and if some simple choices were to be made, we tried to maximize the rate of true positives for the lowest possible rate of false positives. Although the experimental performance achieved has to be interpreted as an approximation for this reason, no major improvements or new insights are to be expected from tuning parameter settings. In summary, the experiments show three results: • For almost zero false positives, some of the tools detect significantly more than 90% of the spam messages. • This rate of true positives varies quite strongly over the various tools (in some cases, it goes down to only 30%). • In general, there was no significant difference between the detection rates of commercial products and the performance of open source tools. However, this does not take into account other important features, such as user friendliness, support, etc. These experimental results again indicate that there is substantial room for improvement, which we will investigate actively in the next phase of this project. 88 7. List of Figures Figure 1: Percentage of e-mail identified as spam, June 2004 [ (no newer data available).................... 12 Figure 2: Interaction of legislative measures, law enforcement and percentage of spam [11.................. 12 Figure 3: The top ten sources of spam (domains) [14] ............................................................................. 14 Figure 4: Spam categorized in terms of content (data from [10) .............................................................. 15 Figure 5: Categorization of anti-spam methods........................................................................................ 23 Figure 6: Typical scenario for a blacklist ................................................................................................. 31 Figure 7: Sender registration ChoiceMail ................................................................................................ 34 Figure 8: Challenge of SFM...................................................................................................................... 35 Figure 9: Example of a forged header ...................................................................................................... 37 Figure 10: URL analysis based on [69] .................................................................................................... 39 Figure 11: Typical processing path of Symantec Brightmail Anti-Spam .................................................. 51 Figure 12: Typical processing path of Symantec Mail Security for SMTP ............................................... 55 Figure 13: Windows test process vs. Linux test process............................................................................ 66 Figure 14: Results for tested commercial products – our test sample....................................................... 68 Figure 15: Results for tested open source tools – our test sample ............................................................ 72 Figure 16: Results for all tested products (our test sample) ..................................................................... 76 Figure 17: Results for tested commercial products – SpamAssassin sample ............................................ 77 Figure 18: Results for tested open source tools – SpamAssassin sample.................................................. 81 Figure 19: Results for all tested products – SpamAssassin sample........................................................... 86 89 8. List of Tables Table 1: Products/tools considered, methods used by these, further remarks see page 2 Table 2: The top twelve sources of spam, geographically [13] Table 3: Cost-profit equation of a spammer (simplified, monthly basis) Table 4: Typical SMTP dialogue Table 5: The most important header fields in the Internet Message Format [33 Table 6: Adaptation of spammers’ techniques to development of filtering techniques [ Table 7: Quality metrics of binary classifiers for the spam problem Table 8: A typical X-Hashcash header Table 9: DCC checksums Table 10: Example of a DCC record Table 11: Hardware and software configuration Table 12: Product 1 – results our test sample Table 13: Product 1 (alternative configuration) – results our test sample Table 14: Product 2 – results our test sample Table 15: Product 3 – results our test sample Table 16: Product 4 – results our test sample Table 17: Product 5 – results our test sample Table 18: Product 5 (alternative configuration) – results our test sample Table 19: Product 6 – results our test sample Table 20: SpamAssassin standard (2.64) – results our test sample Table 21: SpamAssassin low (2.64) – results our test sample Table 22: SpamAssassin Bayes (2.64) – results our test sample Table 23: SpamAssassin (3.0) – results our test sample Table 24: Bogofilter – results our test sample Table 25: CRM 114 self-trained – results our test sample Table 26: CRM 114 pre-trained – results our test sample Table 27: Product 1 – results SpamAssassin test sample Table 28: Product 1 (alternative configuration) – results SpamAssassin test sample Table 29: Product 2 – results SpamAssassin test sample Table 30: Product 3 – results SpamAssassin test sample Table 31: Product 4 – results SpamAssassin test sample Table 32: Product 5 – results SpamAssassin test sample Table 33: Product 5 (alternative configuration) – results SpamAssassin test sample Table 34: Product 6 – results SpamAssassin test sample Table 35: SpamAssassin standard – results SpamAssassin test sample Table 36: SpamAssassin low – results SpamAssassin test sample Table 37: SpamAssassin Bayes – results SpamAssassin test sample Table 38: SpamAssassin 3.0 – results SpamAssassin test sample Table 39: Bogofilter – results SpamAssassin test sample Table 40: CRM 114 self-trained – results SpamAssassin test sample Table 41: CRM 114 pre-trained – results SpamAssassin test sample 90 8 13 18 20 21 22 24 26 42 42 65 68 69 69 69 70 70 71 71 72 73 73 74 74 75 75 77 78 78 79 79 80 80 81 82 82 83 83 84 84 85 9. Index A Address Harvesting Tools............................................................................................................. 16 AMTP ........................................................................................................................................... 48 B Bayes Filter................................................................................................................................... 44 blacklist......................................................................................................................................... 31 Bogofilter...................................................................................................................................... 62 Borderware MXtreme Mail Firewall ............................................................................................ 57 C Caller-Id ....................................................................................................................................... 32 CAN-SPAM ................................................................................................................................. 16 Challenge-Response ..................................................................................................................... 33 ChoiceMail ................................................................................................................................... 34 CRM 114 ...................................................................................................................................... 61 D DCC .............................................................................................................................................. 41 Digital Signature........................................................................................................................... 40 DomainKeys ................................................................................................................................. 33 E Excessive Cross-Posting ............................................................................................................... 11 Excessive Multi-Posting ............................................................................................................... 11 G Greylist ......................................................................................................................................... 34 H Hashcash....................................................................................................................................... 26 I Ikarus mySpamWall ...................................................................................................................................................... 57 IM 2000 ........................................................................................................................................ 48 Internet Message Format ...................................................................................................................................................... 20 K Kaspersky Anti-Spam ...................................................................................................................................................... 52 Keyword Based............................................................................................................................. 38 K-Nearest ..................................................................................................................................... 46 L Lightweight Currency Protocol..................................................................................................... 27 N Neural Networks........................................................................................................................... 46 91 P Pattern Matching........................................................................................................................... 38 Pyzor............................................................................................................................................. 41 R Rule Based.................................................................................................................................... 38 S Sender-Id ..................................................................................................................................... 33 SFM .............................................................................................................................................. 35 Simple Mail Transfer Protocol...................................................................................................... 16 SpamAssassin ............................................................................................................................... 60 Spamkiss....................................................................................................................................... 58 Spam Tools ................................................................................................................................... 16 Sender Policy Framework ............................................................................................................ 32 SurfControl E-Mail Filter for SMTP ............................................................................................ 53 Support Vector Machines ............................................................................................................. 46 Symantec Mail Security for SMTP............................................................................................... 55 Symantec Brightmail Anti-Spam.................................................................................................. 50 U Unsolicited Bulk E-mail ............................................................................................................... 10 Unsolicited Commercial E-Mail ................................................................................................... 10 URL Analysis ............................................................................................................................... 38 V Vipul’s razor ................................................................................................................................. 41 W whitelist ........................................................................................................................................ 31 92 10. Bibliography [1] Hormel Food Corporation. http://www.hormel.com [2] A. Amor, J. Martin: “Civic Networking: The Next Generation”, 1998. http://www.more.net [3] REDNET, Networking and Internet, British ISP since 1992. http://www.red.net/support/resourcecentre/mail/email-aup.php [4] P. Hofmann: “Unsolicited Bulk E-mail: Definitions and Problems“, October 5, 1997. http://www.imc.org/ube-def.html [5] David Madigan: “Statistics and the War on Spam (A Guide to the Unknown)”, 2004. http://www.stat.rutgers.edu/~madigan/PAPERS/sagtu.pdf [6] Spamlinks, A lot of useful and up-to-date anti-spam links. http://spamlinks.net/stats.htm [7] Mitteilung der Kommission an das europäische Parlament über unerbetene Werbenachrichten, 22.01.2004. http://europa.eu.int/information_society/topics/ecomm/doc/useful_information/library/communi c_reports/spam/spam_com_2004_28_de.pdf [8] News on ORF, August 24, 2004. http://futurezone.orf.at/futurezone.orf?read=detail&id=245906 [9] Standard online Report, December 30, 2004: http://derstandard.at/?url=/?id=1857456 [10] Anti-Spam Software Vendor Brightmail, Spam percentage June 2004. http://www.brightmail.com [11] Anti-Spam Software Vendor MessageLabs, Spam Statistic November 2004. http://www.messagelabs.com/emailthreats/default.asp# [12] The Spamhaus Project, List of the 200 biggest spammers called ROKSO. http://www.spamhaus.org/rokso/ [13] Anti-Spam and Virus Software Vendor Sophos, Dirty Dozen, the 12 most spamming countries. http://www.sophos.com [14] Anti-Spam Software Vendor Commtouch. http://www.commtouch.com [15] Anti-Spam and Virus Software Vendor Postini. http://www.postini.com [16] Center for Democracy and Technology: “Why am I getting all this spam? Unsolicited commercial e-mail six month report”, March 2003. http://www.cdt.org/speech/spam/030319spamreport.shtml 93 [17] Mailutilities, Advanced E-Mail Extractor. http://www.mailutilities.com/aee/ [18] MTI Software, Atomic Harvester. http://www.desktopserver.com/atomic.htm [19] E-Mail Marketing Software, Mail utilities for Internet business and e-commerce. http://www.massmailsoftware.com/extractweb/purchase-email-addresses.htm [20] Arbeiterkammer Österreich: “Die 8 häufigsten Arten von Internetbetrug“, 2004. http://www.arbeiterkammer.at/www-192-IP-2839-AD-2799.html [21] Rejo Zenger: “Confession for two: a spammer spills it all”. http://rejo.zenger.nl/abuse/1085493870.php [22] Send-safe real anonymous mailer. http://www.send-safe.com/index.php [23] Ronald van der Wal. http://www.spamvrij.nl/lijsten/bedrijf.php?idbedrijf=466 [24] Worldsoftwarehouse, Bullet proof hosting service. www.worldsoftwarehouse.com [25] Professional link counter tool, November 15, 2004. http://www.linkcounter.be [26] Urteil gegen US Spammer Jeremy Jaynes, November 15, 2004. http://www.silicon.de/cpo/news-antivirus/detail.php?nr=17568 [27] Ferris Research. http://www.ferris.com/ [28] Heise News: “Spam belastet Europas Unternehmen”, January 4, 2003. http://www.heise.de/newsticker/meldung/33417 [29] M. Gibbs: “Spam Cost Model”, NetworkWorldFusion. http://www.gibbs.com/msg/ [30] Ikarus Software (Spam Wall and Anti-Virus). http://www.mymailwall.at/spamcal.html [31] Jürgen Strauss: “Analyse betriebswirtschaftlich orientierter Lösungsansätze für die Spamproblematik“, Diplomarbeit am Institut für Verteilte und Multimediale Systeme, Fakultät für Informatik, Universität Wien, 2005 (in preparation). [32] J. Klensing: “RFC2821: Simple Mail Transfer Protocol”, April 2001. ftp://ftp.rfc-editor.org/in-notes/rfc2821.txt [33] P. Resnik: “RFC2822: Internet Message Format”, April 2001. ftp://ftp.rfc-editor.org/in-notes/rfc2822.txt [34] J. Postel: “RFC821: Simple Mail Transfer Protocol”, August 1982. ftp://ftp.rfc-editor.org/in-notes/rfc821.txt 94 [35] J. Postel: “Internet Protocol”, September 1981. ftp://ftp.rfc-editor.org/in-notes/rfc791.txt [36] Peter Lechner: “Das Simple Mail Transfer Protokoll und die Spamproblematik“, Diplomarbeit am Institut für Verteilte und Multimediale Systeme, Fakultät für Informatik, Universität Wien, 2005 (in preparation). [37] G. Hulten et al.: “Trends in Spam Products and Methods”, Microsoft Research, 2004. www.ceas.cc/papers-2004/165.pdf [38] A. Birrel, et al.: “The Penny Black Project”. http://research.microsoft.com/research/sv/PennyBlack/ [39] C. Dwork, A. Goldberg, and M. Naor: "On Memory-Bound Functions for Fighting Spam", Proceedings of the 23rd Annual International Cryptology Conference (CRYPTO 2003), August 2003. [40] M. Abadi, M. Burrows, M. Manasse, and T. Wobber: "Moderately Hard, Memory-bound Functions", Proceedings of the 10th Annual Network and Distributed System Security Symposium, February 2003. [41] Hashcash, A denial-of-service counter measure tool. http://www.hashcash.org/ [42] D. Turner, D. Havey: “Controlling spam through Lightweight Currency”, November 4, 2003. http://ftp.csci.csusb.edu/turner/papers/turner_spam.pdf [43] D. Sorkin: “Overview over the most important anti-spam laws”, December 2003. www.spamlaws.com [44] Andreas Sabadello: “Schutz vor unerwünschten E-Mails”, Diplomandenseminararbeit, Seminar aus Kriminologie, Universität Wien, Sommersemester 2004. [45] Federal Law in USA, CAN-SPAM Act of 2003, January 1, 2004. http://www.spamlaws.com/federal/108s877.html [46] DIRECTIVE 2000/31/EC OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of 8 June 2000 on certain legal aspects of information society services, in particular electronic commerce, in the Internal Market (Directive on electronic commerce). http://www.spamlaws.com/docs/2000-31-ec.pdf [47] DIRECTIVE 2002/58/EC OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of 12 July 2002 concerning the processing of personal data and the protection of privacy in the electronic communications sector (Directive on privacy and electronic communications). http://www.spamlaws.com/docs/2002-58-ec.pdf [48] Bill on Telecommunication, TKG 2003. http://www.rtr.at/web.nsf/deutsch/Telekommunikation_Telekommunikationsrecht_TKG+2003 [49] Bill on E-Commerce (ECG), BGBl I Nr. 152/2001. http://www.Internet4jurists.at/gesetze/bg_e-commerce01.htm [50] Spamhaus, The Spamhaus Project. http://www.spamhaus.org 95 [51] Habeas, Sender Warranted E-Mail, 2004. http://www.habeas.com [52] Bonded Sender Program, IronPort. http://www.bondedsender.com [53] John Ioannidis: “Fighting Spam by Encapsulating Policy in E-Mail Addresses”, Proceedings of Network and Distributed Systems Security Conference (NDSS), 2003. [54] T. Tompkins, D. Handley: "Giving e-mail back to the users: Using digital signatures to solve the spam problem”, First Monday, 8(9), September 2003. http://firstmonday.org/issues/issue8_9/tompkins/index.html [55] Sender Policy Framework. www.spf.pobox.com [56] G. Fecyk: “Designated Mailers Protocol”, December 2003. http://www.pan-am.ca/dmp/draft-fecyk-dmp-01.txt [57] Hadmut Danisch: ”Reverse MX“. http://www.danisch.de/work/security/antispam.html [58] Microsoft: “Caller ID for E-Mail Technical Specification: The Next Step to Deterring Spam”, Februar 12, 2004. http://www.microsoft.com/downloads/details.aspx?FamilyID=9a9e8a28-3e85-4d07-9d0f6daeabd3b71b&displaylang=en [59] J. Lyon: “Purported Responsible Address in E-Mail Messages Specification”, October 2004. http://www.microsoft.com/downloads/details.aspx?familyid=f8e9cb40-cc7c-46d6-8cd13a86a46546d5&displaylang=en [60] IC Groug Inc.: “Sender Rewriter Scheme”. http://spf.pobox.com/srs.html [61] Microsoft Corporation: “Sender ID”. http://www.microsoft.com/senderid [62] Yahoo! Inc.: “DomainKeys”. http://antispam.yahoo.com/domainkeys [63] T. Loder, M.V. Alstyne, R. Wash: “An economic answer to unsolicited communication”, ACM 2004. [64] E. Harris: “The Next Step in the Spam Control War: Greylisting”, August 28, 2003. http://projects.puremagic.com/greylisting/whitepaper.html [65] DigiPortal Software Inc.: “Choice Mail, A Spam Blocker – Not just a spam filter”. http://www.digiportal.com [66] P. Gburzynski: “Spam-Free E-Mail Service”. http://sfm.cs.ualberta.ca/ [67] MXLogic – Spam Classification Techniques. http://www.mxlogic.com 96 [68] Jeffrey E.F. Friedl: “Mastering Regular Expressions, Powerful Techniques for Perl and Other Tools”, ISBN: 1-56592-257-3, O'Reilly, January, 1997. [69] Oleg, Kolesnikov, Wenke Lee, Richard Lipton: “Filtering Spam Using Search Engines”, 2003. http://www.cc.gatech.edu/~ok/ [70] Karl A. Krueger: “The Spam Battle 2002: A Tactical Update”, SANS GSEC Practical, v1.4, September 2002. http://www.sans.org/rr/whitepapers/email/589.php and http://www.rhyolite.com/anti-spam/dcc/ [71] Vipul’s Razor: “A distributed, collaborative, spam detection and filtering network”, December 3, 2004. http://razor.sourceforge.net/ 72] Pyzor. http://pyzor.sourceforge.net/ [73] G. Salton, C. Buckley: “Term Weighting Approaches in Automatic Text Retrieval”, Information Processing and Management, 24:513 523, 1988. [74] W. Yerazunis: “The Spam Filtering Accuracy Plateau at 99.9% Accuracy and How to Get Past It”, MIT Spam Conference 2004. [75] J. Zdziarski: “Controlling Filter Complexity: Statistical-Algorithmic Hybrid Classification”. [76] Corinna Cortes, Vladimir Vapnik: “Support-vector networks”, Machine Learning, 20(3):273297, November 1995. [77] Vladimir Vapnik: “The Nature of Statistical Learning Theory”, Springer-Verlag, Heidelberg, Germany, 1995. [78] Christopher M. Bishop: “Neural Networks for Pattern Recognition”, Oxford University Press, 1995. [79] G. Wittel, S. Wu, U. Davis: “On Attacking Statistical Spam Filters”, CEAS 2004. http://www.ceas.cc/papers-2004/slides/170.pdf [80] K. Eide: “Winning the war on spam: Comparison of bayesian spam filters”, August 2003. http://home.dataparty.no/kristian/reviews/bayesian/ [81] A. Kolcz, J. Alspector: “SVM based filtering of e-mail spam with content-specific misclassification costs”, In Proceedings of the TextDM'01 Workshop on Text Mining – held at the 2001 IEEE International Conference on Data Mining, 2001. [82] H. Drucker, Donghui Wu, and V.N. Vapnik: “Support vector machine for spam categorization” IEEE Transactions on Neural Networks, 10(5):1048–1054, September 1999. [83] Ana Cardoso-Cachopo, L. Oliveira: “An Empirical Comparison of Text Categorization Methods” SPIRE 2003, LNCS 2857, pp. 183-196, 2003. [84] D.J. Bernstein: “Internet Mail 2000“. http://cr.yp.to/im2000.html 97 [85] Jonathan de Boyne Pollard: “Fleshing out IM2000”. http://homepages.tesco.net./%7EJ.deBoynePollard/Proposals/IM2000/ [86] Shane Hird: “Technical Solutions for Controlling Spam”. http://security.dstc.edu.au/papers/technical_spam.pdf [87] W. Weinman: “Authenticated Mail Transfer Protocol”. http://amtp.bw.org/docs/draft-weinman-amtp-03.txt and http://amtp.bw.org/ [88] NetworkWorldFusion, Buyers guide. http://www.nwfusion.com/bg/2003/spam/index.jsp [89] Spamotomy – a comparison tool. http://www.spamotomy.com [90] Installation Guide and Administration Guide for Brightmail Anti-Spam, Version 6.0 (Document Version 1.0). [91] Kaspersky Anti Spam 2.0 Enterprise Edition Manual. http://www.kaspersky.com/de/downloads?chapter=146440562&downlink=149404921 [92] SurfControl E-Mail Filter for SMTP: Administrator’s Guide (Version 4.7 created September 2003). [93] Definitions for active content found in GOOGLE. http://www.google.at/search=define:active+content [94] Symantec Mail Security for SMTP: Administration Guide (Documentation Version 4.0). [95] Ikarus Software, Managed Security Services, My Mail Wall. http://www.mymailwall.at/ [96] Spamkiss, Anti-Spam Software. http://www.spamkiss.com/ [97] Homepage SpamAssassin. http://spamassassin.apache.org, http://wiki.apache.org/spamassassin/ [98] Tests performed by SpamAssassin. http://spamassassin.apache.org/tests_3_0_x.html [99] SpamAssassin Bayes Frequently Asked Questions. http://wiki.apache.org/spamassassin/BayesFaq [100] CRM 114, Bayesian Classifier. http://crm114.sourceforge.net/ [101] Paolo Frasconi and Giovanni Soda and Alessandro Vullo, “Hidden Markov Models for Text Categorization in Multi-Page Documents”, J. Intell. Inf. Syst. 18 2-3, 195--217, 2002. [102] Bogofilter, Baysian Classifier. http://www.bogofilter.org 98 [103] Ling spam and PU1 spam corpus. http://www.iit.demokritos.gr/skel/i-config/downloads/ [104] SpamAssassin Testsamples. http://www.spamassassin.org/publiccorpus 99