Proofpoint MLX: Machine Learning to Fight Spam

Transcription

Proofpoint MLX: Machine Learning to Fight Spam
Proofpoint MLX Whitepaper
Machine learning to
beat spam today...
and tomorrow
Proofpoint, Inc.
892 Ross Drive
Sunnyvale, CA 94089
P 408 517 4710
F 408 517 4711
[email protected]
www.proofpoint.com
Mounting an effective defense against spam
requires detection techniques that evolve as
quickly as the attacks themselves. Without the
ability to adapt automatically to new types of
threats, anti-spam defenses will always remain a
step behind the spammers.
Proofpoint MLX™ technology uses advanced
machine learning techniques to provide
comprehensive spam detection that guards
against the spam threats of today, as well
as tomorrow. Proofpoint MLX continuously
analyzes millions of messages and automatically
adjusts its detection algorithms to identify even
the newest, most cunning types of attacks.
Proofpoint MLX provides accurate, adaptive,
and continuous protection against spam without
requiring manual tuning or administrator
intervention.
Contents
Executive Summary
1
Why Does MLX Matter?
1
The Need for Machine Learning
2
Using Machine Learning to Beat Spam
3
Machine Learning in Action: Proofpoint MLX
5
Proofpoint MLX Spam Detection Process
8
Recent Spam Trends and Emerging Threats
10
Understanding the “Perception Crisis” in Spam Effectiveness
13
Winning the Battle Against Image- and Attachment-based Spam
16
Proofpoint Attack Response Center
23
Conclusion
24
Additional Resources
25
About Proofpoint, Inc.
25
Contents
Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow
Executive Summary
As spammers employ increasingly sophisticated techniques to avoid detection, simplistic anti-spam solutions are leaving enterprises vulnerable to lost productivity, lost communications, malware attacks, data
theft, and financial loss. Clearly, a new approach is needed to defend corporate messaging infrastructures
and to reclaim email’s value as a business communications medium.
Mounting an effective defense against spam requires detection techniques that can evolve as quickly as
the attacks themselves. Without the ability to automatically adapt to detect new types of threats, an antispam solution will always be a step behind the spammers.
Proofpoint MLX™ technology leverages patent-pending machine learning techniques to provide a revolutionary spam detection system. The Proofpoint solution employs a full range of classification methods,
from legacy approaches such as heuristics and Bayesian analysis to state-of-the-art machine learning
algorithms (such as those used in genomic sequence analysis) and proprietary image analysis methods.
Analyzing millions of messages each day, Proofpoint MLX automatically adjust its detection algorithms to
identify even the newest spam attacks without manual tuning or administrator intervention. As a result,
Proofpoint MLX is able to provide continuous spam detection and content filtering with a very high degree
of accuracy–typically on the order of 99.8% or higher.
Unlike other anti-spam solutions, the Proofpoint platform delivers anti-spam defenses that don’t degrade
over time. By adapting continuously to the changing nature of spam attacks, Proofpoint MLX ensures that
enterprises always benefit the latest anti-spam defenses–even if those defenses are only a few hours old.
Proofpoint MLX technology protects corporate infrastructure against the spam threats of today, as well
as tomorrow.
Proofpoint MLX technology is available in Proofpoint’s SaaS email security solutions including Proofpoint
ENTERPRISE™ and Proofpoint SHIELD™ as well as in Proofpoint’s on-premises solutions including the
Proofpoint Messaging Security Gateway™ email security appliance.
Why Does MLX Matter?
Proofpoint’s MLX-based solutions provide the most effective spam detection available today:
o Accurate: Proofpoint’s machine learning technology, based on techniques such as logistic regression, provides the foundation for a powerful, adaptive anti-spam solution capable of analyzing all
types of message features, examining more than one million different attributes accurately differentiate between spam and valid messages.
o Decisive: Traditional anti-spam solutions evaluate a limited number of attributes and are unable
to decisively classify spam, which leads to a low rate of effectiveness and a high rate of false positives.
MLX ensures that Proofpoint’s solutions will remain effective against the tactics spammers try to employ
tomorrow:
o Predictive: Continuously-evolving spamming techniques can only be countered by a predictive solution capable of learning and self-adjusting. Traditional reactive approaches just can’t keep pace.
New in This Version:
Updated Content
This updated version of Proofpoint
MLX: Machine Learning to Beat
Spam Today and Tomorrow, published March 2010, describes how
MLX technology is being used to
fight the latest forms of spam, including blended threats, phishing
attacks, and spam related to social
media sites.
o Adaptive: Proofpoint’s MLX-based solutions automatically adapt to counter new threats. As more
data from both valid email and spam is added to the machine learning model, the system identifies and weights relevant attributes to automatically tune the classification process. The result is a
system that is just as effective at identifying tomorrow’s spam as it is at identifying spam today.
Proofpoint is the only vendor that has successfully combined machine learning techniques with traditional
approaches to achieve near-perfect spam detection. Ongoing efforts by Proofpoint’s Attack Response
Center scientists secure Proofpoint’s position as a technology pioneer and industry leader in the fight
against spam.
This whitepaper explains the key concepts, technologies and benefits associated with Proofpoint MLX
technology.
Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow
Page 1
Evolution of Techniques for Fighting Spam
Summary
Spamming techniques and antispam techniques have evolved on
parallel paths. First generation solutions were effective against “static”
spam attacks, but spammers quickly
developed randomization, obfuscation and new delivery strategies to
bypass basic spam filters. Second
generation solutions incorporated
simple heuristics in rsponse, but
these systems typically require a large
amount of administration to stay effective. Proofpoint’s third-generation
solution applies sophisticated machine learning techniques to deliver
high accuracy without the administrative overhead and technology
weaknesses of older techniques.
Traditional anti-spam solutions are
reactive—they compare new messages to known spam, simply looking for words, phrases and other
attributes previously encountered
in spam, and flag messages from
“known” spammers. These technologies cannot adapt quickly enough
to detect new threats and are losing
ground against increasingly sophisticated spam attacks.
Proofpoint MLX machine learning
technology provides the next generation in spam detection today—a
highly effective, intelligent solution
that can adapt to detect new types
of spam with minimal intervention.
Increasing
Sophistication
3rd Generation
MLX
• Logistic regression
• Support vector
machines
• Integrated reputation
2nd Generation
1st Generation
• Signature-based
• Challenge/response
• Text pattern matching
• RBLs, Collaborative
Basic Filtering
Time
Results
• Low false-positives
• Low effectiveness
• Easily fooled by
evolving techniques
• Linear models
• Simple word match
• Heuristic rules
Heuristics/Bayesian
Results
Machine Learning
Results
• Immune to evolving
attacks
• High effectiveness
without decay
• Low false-positives
• Low administration
• High false-positives
• High administration
• Effectiveness decays
over time
Figure 1: Evolution of spam detection.
The Need for Machine Learning
Defending messaging systems against spammers requires an intelligent system that can adapt automatically as the attackers’ techniques evolve. Unlike yesterday’s anti-spam technologies, Proofpoint’s MLX
technology counters new spam techniques as they emerge, defending messaging systems against the
threats of today as well as tomorrow.
A Brief History of Anti-spam Technologies: First Generation Solutions
In the early days of the spam epidemic—before the introduction of enterprise anti-spam solutions—spammers used simple, straightforward techniques to deliver spam. Spam messages were typically simple text
or HTML messages that were mass mailed over sustained periods of time. Given the “static” nature of
this spam, first-generation technologies such as signatures and RBLs (Real-time Block Lists) were able to
detect and stop attacks on a reactive basis. Companies like Symantec and others originally used signaturebased techniques, very similar to the way anti-virus products work. But spammers quickly developed techniques for randomizing multiple parts of their messages—maintaining the core message, while changing its
signature—to thwart detection by signature-based systems.
Similarly, RBL techniques rely on understanding the quality and volume of messages associated with a
given sender’s IP address by gathering information over substantial periods of time. Again, in the early days
of sustained spam campaigns, RBLs were reasonably effective after the initial attack was recognized. The
problem today is that spammers rotate IP addresses frequently and often use hijacked machines (so-called
“zombie” or “botnet” machines) to send small bursts of spam from an ever-changing array of locations.
Overall, first-generation approaches have a low rate of effectiveness against spam because they are easily
defeated by randomization and obfuscation strategies. On the positive side, first-generation solutions did
not introduce very many false positives (valid messages incorrectly marked as spam).
Page 2
Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow
Second Generation Anti-spam Solutions: Heuristic and Bayesian Approaches
To address the increasing frequency and sophistication of spam attacks, second-generation anti-spam
vendors (such as CipherTrust, Sophos and Postini) employed heuristics and Bayesian techniques—often in
combination with certain first generation technologies—in an attempt to create systems that deliver more
proactive, resilient defenses against spam. Heuristics are “rules of thumb” that attempt to make a judgment on whether an email is spam or not, based on a small number of “spammy” attributes. The problem
is that rules of thumb are not always accurate and can be easily fooled by spammers—especially since
most products taking this approach were based on open-source technologies that are readily available to
spammers.
The introduction of Bayesian techniques began the trend toward using more sophisticated analytic techniques rather than general rules of thumb to identify spam. Bayesian solutions use statistical analysis to
look for individual attributes that might indicate whether an email is valid or spam. But this relatively basic
statistical approach falls short due to its inability to understand the relationship between attributes. Bayesian systems can often be fooled simply by adding unrelated valid-looking text to a spam message.
Overall, second-generation systems based on heuristics or Bayesian weighting were more effective at
catching spam than earlier solutions, but this higher effectiveness came at a cost. Second generation systems suffer from a high rate of false positives and require a substantial amount of ongoing administration
to stay effective.
Third Generation Anti-spam Solutions: Proofpoint MLX Machine Learning
Cognizant of the evolution of both spam and anti-spam techniques, Proofpoint developed a solution that
uses exponentially more advanced machine learning techniques. This advanced, statistical approach is more
predictive and resilient than previous generation solutions. Proofpoint MLX offers both high effectiveness
and low occurrence of false positives while requiring very little ongoing administration to stay effective.
Yesterday’s Anti-spam Technologies
Technique
Description
Limitations
Spam Signature Detection
Compare messages to
known spam
o Minor modifications thwart detection–and
spammers know this
o Cannot detect new threats, always a step behind
spammers
o Poor effectiveness against image-based spam
Challenge-Response
Require sender to
respond
o Challenge is offensive in business context
o Misclassifies valid, automatically-generated email
Text-pattern Matching
Search for spam keywords such as Viagra or
enlargement
o Difficult to manage large keyword lists
o Simplicity leads to high false positive rates
o Effectiveness plummets as new types of spam
emerge
Heuristics or
Naive Bayesian
Apply rules of thumb to
assign a spam score
o Based on a small number of independentlyassessed attributes
o High administration overhead for manually-tuned
systems
o Naive Bayes models ignore attribute dependencies
and systematically under- or over-estimate spam
probabilities
o Easily defeated by today’s text- and image-based
spam obfuscation techniques
Community Resources
Check messages against
RBLs and other public
anti-spam resources
o Spammers use this information to thwart detection
o Network queries are time intensive, reducing performance in enterprise-wide usage
Using Machine Learning to Beat Spam
Proofpoint MLX technology leverages advanced machine learning techniques to automate the generation
of large-scale statistical models for spam and content filtering. Employing a full range of machine learnProofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow
Page 3
About Logistic Regression
Proofpoint’s statistical models combine attributes and weights to generate an estimate of the probability
that a particular message is spam.
Logistic regression is one technique
used to build these classification
models. Logistic regression provides
a way to predict a discrete outcome
such as group membership from a
set of variables that can be continuous, discrete, dichotomous or a mix.
Logistic regression is a Bayesian
technique—the most likely model is
inferred from a combination of observed attributes and previous data.
Instead of making the Naive-Bayes
assumption that each attribute is
conditionally independent, logistic
regression provides a mechanism for
taking interdependencies into account.
ing techniques enables Proofpoint to analyze many millions of messages per day and distill them down to
more than 1 million different attributes that reflect the underlying characteristics of spam. The resulting
statistical model provides formulas for combining message attributes and weights to estimate the probability that a particular message is spam. With this model in place, the Proofpoint platform can classify
messages with a high degree of confidence to maintain high effectiveness rates and a very low occurrence
of false positives.
The probability that a message is spam is estimated by applying statistical techniques such as logistic regression. To minimize classification errors, Proofpoint employs proprietary techniques to train the system
to accurately determine whether or not new cases should be classified as spam. To avoid “overfitting” the
model to the training data and maximize its accuracy for new data, Proofpoint performs large-scale, crossvalidation testing using data from a wide variety of sources.
The Importance of Attribute Dependencies – What Naive-Bayes Models Ignore
Because machine learning models take into account the incremental impacts of different spam attributes
and the dependencies between attributes, the system can very accurately classify messages.
For example, if the phrases “Want to stop Snoring?” and “Get a good night’s sleep!!!” appear in an email,
the marginal spam effect of the second phrase is lessened so that the likelihood that a message is spam is
not overestimated. Proofpoint’s proprietary spam analysis techniques use “supervised learning techniques”
to understand the subtle differences between valid messages and spam. As a result, Proofpoint MLX technology can accurately classify valid messages that might otherwise be confused with spam.
Suppose a user receives an email from a colleague that says, “Bob, did you see the spam message about
getting a good night’s sleep?”, the system will recognize that it’s valid because other attributes are more
important than the fact that the message contains the common spam phrase “good night’s sleep”.
Standard anti-spam systems are not able to detect these subtleties. For example, suppose a user receives
the following email from his doctor:
Dear Bob,
Hope you did get a good night’s sleep after your treatment. Did you sleep well and did you stop
snoring? It may take a few days for the medicine to kick in. Let me know if you have any questions. – Dr. Smith
A Naive-Bayes classifier trained on the following spam will incorrectly classify the doctor’s message as
spam:
Did you sleep well last night?? Get a good night’s sleep! Stop Snoring Today! Click here to learn
more!
Because the phrases “Did you sleep well” and “Get a good night’s sleep” have appeared in spam before,
and the Naive-Bayes classifier scores all attributes independently, each attribute gets weighted twice. As
a result, it overestimates the probability that the message is spam and mistakenly classifies the doctor’s
legitimate email as spam.
In contrast, Proofpoint MLX classifiers recognize that attributes sometimes appear together and the system takes these dependent attributes into account, resulting in a more accurate assessment of the “spam
content” of each message. As a result, Proofpoint MLX is able to correctly conclude that the doctor’s message is valid.
Attribute Dependencies
Because attributes associated with spam often have complex relationships and dependencies, taking those
dependencies into account is critical for accurate spam detection.
Heuristics and Naive-Bayes classifiers evaluate each spam attribute independently—they cannot take into
account dependencies between attributes. Because these systems assume that all attributes are conditionally independent, the benefits of considering a larger number of attributes are overwhelmed by the
proportional increases in the missing dependencies. This severely limits the number of attributes that these
systems can evaluate. They reach a point where adding attributes actually degrades their ability to make
accurate classifications.
In contrast, Proofpoint’s MLX classifiers accurately model attribute dependencies, enabling the system to
analyze more than one million high-quality attributes selected from a pool of many millions. As new attributes are identified, Proofpoint scientists use the latest machine learning techniques, such as information
Page 4
Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow
gain analysis, to ensure that only the most useful attributes are processed by the MLX engine, ensuring the
highest levels of performance and accuracy at all times.
The ability of Proofpoint MLX to analyze many times the number of attributes considered by traditional
systems results in a highly-effective solution that accurately detects spam while maintaining a low incidence of false positives.
Estimating the Probability that a Message is Spam
To estimate the probability that a message is spam, Proofpoint uses logistic regression to define a statistical model that represents the complicated dependencies observed among spam attributes. Unlike NaiveBayes classifiers, which evaluate each attribute independently, logistic regression enables smart scoring
that leverages the knowledge that certain spam attributes commonly appear together. This not only increases the classifier’s effectiveness at identifying spam, it enables it to differentiate between spam and
valid messages much more accurately.
Logistic regression calculates the incremental impact each attribute has on a message’s spam score. A
weight is assigned to each attribute to represent its net effect after the effects of other attributes are taken
into account. Sets of attributes that are known to be dependent on one another are weighted accordingly
and redundant attributes receive less weight, ensuring that the probability that a message is spam is not
over- or underestimated.
Because each attribute’s effect is modeled in relation to other attributes, gaps in the model can be filled
by intersecting existing attributes to create new ones. In systems that evaluate attributes independently,
continuing to add attributes can actually cause a degradation in the classifier’s accuracy and effectiveness.
However, adding helper attributes to a logistic regression classifier produces a better model with more
predictive power.
Minimizing Classification Errors
An anti-spam solution must be able to accurately classify messages—it must effectively block spam while
avoiding false positives. To achieve this, Proofpoint employs statistical techniques and a set of training
examples to determine whether or not new cases should be classified as spam.
Machine Learning in Action: Proofpoint MLX
Through its pioneering research into the application of tried-and-true statistical techniques to the problem of spam, and its continued focus on the security and messaging needs of large enterprises, Proofpoint
has developed a highly configurable message-processing platform that provides a comprehensive defense
against spam, viruses, and other messaging threats.
Proofpoint’s advanced machine learning classifiers and enterprise-strength platform enable Proofpoint
solutions such as Proofpoint ENTERPRISE Protection to synthesize large amounts of data, to analyze millions of message characteristics, and to classify messages with a very high degree of confidence, resulting
in a high rate of effectiveness and a very low rate of false positives.
Powered by Proofpoint MLX machine learning technology, the Proofpoint platform provides the most effective anti-spam solution available. By leveraging the best of first and second generation spam-detection
techniques, applying state-of-the-art MLX classifiers and adapting to enterprise-specific message characteristics and policies, Proofpoint solutions keep pace with emerging message threats and changing corporate needs.
The Proofpoint system is an enterprise-grade platform designed from the ground up to ensure high availability and performance, minimize management overhead and integrate seamlessly with existing enterprise
management tools.
Maximum Protection Today: High Confidence
The large number of attributes that the Proofpoint platform is able to analyze ensures that messages can
be classified with a high degree of confidence.
Proofpoint’s advanced classifiers enable the system to classify messages decisively—most messages score
very high or very low, with only 1.5% falling between 20 and 80 on a scale of 0-100. Competitors’ products
often unsure how to classify messages—upwards of 40% of messages typically receive scores between 20
and 80.
Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow
Page 5
The image in Figure 2, below, is an actual screenshot from a Proofpoint customer deployment. This report
shows how Proofpoint MLX scored inbound messages over a 24-hour period. Notice that 99% of the
emails are confidently scored either very high (indicating that they are spam) or very low (indicating that
they are valid email).
As a result, this customer has found that they can discard messages with a score of 80 or greater (the red
region in Figure 2), automatically eliminating 98% of their spam, with zero false positives. These messages,
which represent an amazing 80% of the company’s total inbound email volume, are rejected and discarded
right at the enterprise gateway, substantially reducing the burden on downstream mail servers, storage
systems and network bandwidth.
A very small number of messages (about 1% of total email traffic indicated by the yellow region in Figure
2) score between 45 and 80. These “probable spam” messages are held in a quarantine and added to email
digests that are sent to end users on a periodic basis. This policy blocks the last 2% of spam without the
risk of losing any legitimate email messages.
Lastly, the remaining 19% of this company’s original email stream gets confidently delivered as valid messages (the green region in Figure 2).
So in this case, Proofpoint correctly and confidently identifies more than 80% of the company’s inbound
email volume as spam without the need for administrators to constantly manage the solution—Proofpoint
MLX does the work.
Confident Scoring Enables Decisive Action Against Spam
Summary
Actual 24-hour spam score distribution from one of Proofpoint’s customer sites.
The confident scoring and high accuracy provided by Proofpoint MLX
has allowed this customer to adopt
aggressive policies against spam.
The green, yellow and red highlights
graphically illustrate this customer’s
actual spam policies.
Most notably, a full 80% of the company’s inbound email is discarded as
spam—with zero false positives—
right at the email gateway.
Deliver
Valid Email
19% of Total Mail Volume
Quarantine
Suspect Email
1% of Total Mail Volume
2% of Spam Volume
Discard
Spam
80% of Total Mail Volume
98% of Spam Volume
Confident Scoring
Enables decisive
action against spam
Figure 2: Was it really spam? Proofpoint MLX’s decisive classification eliminates the uncertainty.
Page 6
Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow
Typical Spam Score Distributions for Competing Products
Is this Spam...
or Valid Mail?
100
?
90
% of Messages
80
70
60
50
40
30
% Messages
with Spam Score
Summary
Less sophisticated systems score
spam with far less certainty than
Proofpoint MLX. In this case, a large
percentage of messages have scores
in the middle range.
Such systems are faced with a quandary: Should messages that fall in
this middle range be delivered to
end users (reducing the system’s
effectiveness)? Or should this large
volume of messages be quarantined
(generating a large number of false
positives)?
20
10
0
0
10
20
30
40
50
60
70
80
90
100
Spam Scores
Figure 3: Competing solutions are often unsure whether messages are spam or not.
Systems that are unable to decisively classify spam (Figure 3, above) are left with a difficult dilemma—
should the messages that fall in the in the middle of the scoring range be sent to the user as valid email,
or blocked as spam? Sending the messages to the user will lower the overall spam detection rate, greatly
reducing the solution’s effectiveness. On the other hand, blocking the messages will cause a spike in false
positives, which can be very detrimental to users.
Clearly, when messages cannot be classified decisively, there are no good options. The ability of Proofpoint
MLX to classify messages with a high degree of confidence eliminates this dilemma and greatly improves
the system’s overall effectiveness while maintaining a low rate of false positives.
Maximum Protection Tomorrow: The Learning Cycle
Unlike traditional anti-spam tools whose effectiveness quickly degrades as spammers change their tactics
to thwart detection, Proofpoint MLX is capable of learning and automatically adjusting to detect new
threats. As more data from both valid email and spam is added to the statistical model, the system identifies and weights relevant attributes to tune the classification process. The result is a system that is just as
effective at identifying tomorrow’s spam as it is at identifying today’s.
Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow
Page 7
Proofpoint MLX Spam Detection at a Glance
Customer Site
Spam Detection Module
Multilingual
Commercial Spam
Multilingual
Pornographic Spam
MLX
Spam Engine
Multilingual
Valid Email
Version Independent
MLX Updates
Automated Machine Learning
200,000+ Attribute Identification
Model Creation
Spam Scoring
Feedback
Valid Email
Likely Spam
Spam/Phish
Adult Spam
Apply Policies
MLX
Spam Engine
Defeating Spammers’ Obfuscation and Randomization Tactics
Obfuscation—such
as
when
spammers use variant spellings
of “Viagra” or camouflage HTML
text—is a very popular strategy that
spammers use to deceive spam filters.
Proofpoint researchers have developed new machine learning techniques that allow Proofpoint MLX
to rapidly and accurately identify
obfuscated text and differentiate
intentional obfuscations from legitimate spelling errors.
Proofpoint’s backend message processing systems also use machine
learning techniques to automatically
detect and “learn” new obfuscations of “spammy” words that are
observed. For example, if the system
observes a large number of variant
spellings within a short time, those
obfuscations are automatically added to the next Proofpoint MLX engine update.
These techniques have a significant
positive effect on anti-spam effectiveness with zero impact on false
positive performance.
The predictive nature of Proofpoint
MLX is also highly resistant to randomization and “hash busting” techniques commonly used by spammers
to bypass signature-based spam
filters.
Page 8
Deliver
Quarantine & Add to Digest
Delete
Figure 4: Proofpoint MLX spam engine creation and the spam detection process at the customer
site.
Proofpoint MLX Spam Detection Process
The MLX detection process (Figure 4, above) begins at the Proofpoint Attack Response Center, where
scientists and engineers build and refine mathematical models that represent Internet spam. These models
are constantly updated and delivered to customers to ensure their messaging infrastructures stay ahead of
the latest spam attacks.
Proofpoint examines every aspect of incoming messages, from the sender’s IP address, to the message
envelope, headers, and structure, and finally the content and formatting of the message’s attachments and
the message itself. At any given time, more than one million possible attributes—representing both content
and structural components—may be taken into consideration. A typical message may trigger more than
300 MLX attributes.
Every email message can be broken down into three main components:
o The Message Envelope: The envelope contains information used by Mail Transfer Agents to route
the message. Spammers can change the message envelope by capturing open relays or by planting
zombies at unsuspecting computers and using them to send spam with a “valid” email address.
Proofpoint MLX catches envelope-based spammer tricks.
o The Message Headers: Headers are key value pairs that provide source and routing information for
the message, along with other meta information such as the message sender, subject, and recipients.
Headers are often spoofed by spammers. Proofpoint MLX catches header-based spammer tricks.
o The Message Body and Attachments: The actual content of the message. Spammers often obfuscate the text of the message body using HTML and other encoding tricks, in an attempt to
exploit first and second generation spam filters. MLX catches message-body-based spammer tricks.
It also examines the content of attachments looking for similar characteristics.
The first level of screening examines the network stream to identify the source of each incoming email.
The system then performs an in-depth contextual analysis of the email, from distilling the email’s linguistic
structure to normalizing permutations of words (permutations are a common exploit whereby spammers
may replace ‘Viagra’ with a term like ‘v1a<b>g</b>gra’). Once the contextual analysis is complete, the
system evaluates the message according to the preferences set by both the end user and administrator.
The results of this in-depth analysis are fed into Proofpoint MLX’s advanced classifiers to determine the
appropriate disposition for the message. On its own, no single test classifies a message as spam. But by
taking all attributes of a message into account, Proofpoint’s advanced classifiers categorize each message
with a high-degree of certainty to accurately identify spam and minimize false positives.
Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow
Phishing Attacks and Pornographic Spam
Because of the mathematical foundation of MLX, its models can be easily adapted to subcategories of
spam. Two common email attacks are phishing or scam attacks and pornographic spam.
A phishing attack is a type of fraud. Phishing email looks like a legitimate message from a business familiar
to the recipient—typically a bank or a well-known online brand such as Amazon or PayPal—but is actually
a fraudulent attempt to extract personal identity or financial information. Thinking the message is valid,
the recipient posts personal account information. This information is then collected by the sender and
used illegally. Proofpoint has applied its machine learning algorithms to detecting phishing attacks, thereby
ensuring they are blocked from end users’ inboxes. To aid in combatting phishing, Proofpoint accurately
identifies spam in email attachments including, but not limited to PDF, ZIP, XLS, DOC and RTF file types.
(Phishing is discussed in more detail later in this paper.)
In addition to phishing attacks, organizations are being bombarded with pornographic spam, exposing
them to risk, liability and embarrassment. Administrators need tools to enforce a policy of zero-tolerance
for pornographic spam. Proofpoint not only detects pornographic spam with a high degree of accuracy
using MLX technology, but also allows administrators to define separate, more aggressive policies for pornographic spam. Each message analyzed by Proofpoint MLX is assigned a general spam score as well as an
adult spam score, enabling each type of message to be handled differently. For example, an organization
might configure its policies to delete all pornographic spam, while quarantining non-pornographic spam.
Reputation Analysis for IP Addresses and URLs
New types of attributes are continually being assessed and added to Proofpoint MLX. For example,
Proofpoint MLX spam engine updates include a dynamic set of attributes that represent reputation scores
associated with the IP addresses of various senders.
The Proofpoint Attack Response Center continually examines large volumes of Internet mail, external spam
block lists and data from Proofpoint partners and customers to identify IP addresses that are commonly
used to send spam. This ever-changing list of spam servers, suspected spam domains, botnet and “zombie”
machines is constantly updated and automatically incorporated into the MLX spam engine updates that
are delivered to Proofpoint customers on a regular basis. MLX performs this same reputation analysis on
the URLs included in messages, ensuring that dangerous addresses are caught, even if the message is sent
from a trusted domain.
Fighting Image-based Spam
In yet another attempt to bypass
less sophisticated spam filters, an
increasing amount of spam is now
being sent with the spam “payload”
contained in an attached image,
sometimes accompanied by randomized text.
The Proofpoint MLX spam engine
includes algorithms to detect and
block image-based spam and image-based obfuscation techniques,
which competing solutions cannot
accurately catch.
See “Recent Spam Trends and
Emerging Threats,” below for more
information on this problematic new
form of spam.
MLX Speaks Your Language
As the volume of non-English language spam increases, Proofpoint’s
machine learning engine is continually being trained to identify spam in
a wide variety of European and Asian
languages.
Combined with the real-time, local reputation data generated by the MLX Dynamic Reputation™ features
of each Proofpoint server and other message attributes, Proofpoint MLX can make intelligent decisions
about which messages and connections to block or throttle without the negative performance impact of
constant network blocklist (DNSBL) lookups. Proofpoint MLX also uses similar techniques to identify and
block malicious URLs contained in spam and phishing messages.
Bounce Management
Proofpoint’s anti-spam solutions also support Bounce Address Tag Validation (BATV, a draft specification
submitted to the IETF) in order to combat backscatter, which occurs when a spammer spoofs a legitimate
address resulting in a barrage of non-delivery reports (NDRs) directed at the legitimate user. Customers
who secure their outbound email through Proofpoint solutions can take advantage of BATV tagging of
outbound messages to detect and block backscatter from forged addresses. In addition, all Proofpoint customers can take advantage of MLX to differentiate between valid and backscatter NDRs and to configure
the agressiveness with which MLX rejects backscatter messages.
Obfuscation Detection
Proofpoint has developed a proprietary machine learning model to identify obfuscated words (which can
be a strong indicator of spam) and to differentiate intentional obfuscations from unintentional obfuscations (such as spelling errors).
Natural Language Tests
MLX performs context-aware analysis to derive meaning from text and to identify signature-busting language often used in spam email.
Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow
Page 9
Using Machine Learning for
Connection Management
Proofpoint’s connection management component—Proofpoint Dynamic Reputation™—also uses machine learning techniques. Powered
by a combination of local, dynamic
reputation analysis and global/network reputation data, Proofpoint’s
“netMLX” service uses machine
learning technologies to assess the
reputation of IP addresses, in order to block or throttle incoming
email connections from malicious
or spammy IP addresses. While the
analysis and computation is done
“in the cloud,” netMLX can be queried by any Proofpoint deployment
(whether deployed as SaaS or onpremises).
netMLX analyzes thousands of traffic properties, domain characteristics,
and other features. To identify the
most statistically relevant features
to use in classification, Proofpoint
uses a proprietary method known
as discretization, which ensures that
Proofpoint Dynamic Reputation operates at maximum efficiency from
a computation and accuracy standpoint.
Data is sourced from Proofpoint’s
worldwide collection of honeypots
and our customer base. The properties (or “features”) analyzed are fed
into a multi-stage classifier, meaning
that different properties are analyzed
using different machine learning algorithms and then sent to a final
classifier to compute a “reputation”
score for each incoming IP address.
Based on this score, customers can
enforce various connection management policies. For example, connections from malicious IPs can be rejected or throttled.
Note that these IP reputation scores
are distinct from the spam score that
customers also have control over at
the policy level.
The whole process occurs in realtime;
the time it takes for a new IP address to be identified, analyzed and
for data on that IP to be available to
Proofpoint customers is on the order
of one minute.
Page 10
International Language Analysis
MLX has the ability to parse both single- and double-byte text with appropriate machine learning models
for different languages (e.g., English, Japanese, Chinese, etc.). Ongoing training in these languages ensures
high-effectiveness against non-English language spam.
Recent Spam Trends and Emerging Threats
In spite of improved defenses and several high-profile criminal prosecutions, spam continues to plague
organizations of all sizes. Closely monitoring spam attacks against its own customers, Proofpoint found
that, for most enterprises, spam volumes rose between 150% and 400% last year.
And enterprises are feeling the impact: in 2009, spam is estimated to have cost U.S. organizations $42 billion. Worldwide, the cost was $130 billion. Lost productivity accounts for roughly 85% of these costs, with
the remainder primarily covering IT support costs, such as help desk salaries.1 But spam jeopardizes much
more than employee productivity and IT budgets: it’s increasingly implicated in security attacks, which can
lead to data breaches, regulatory fines, lost business, and other damages.
Clearly, spam remains a serious threat to the enterprise. And it’s complicating IT’s adoption of other new
technologies, such as social media platforms for cross-departmental collaboration.
Currently, six trends are making spam an especially difficult and urgent problem for enterprises. These six
trends are:
o The use of increasingly sophisticated botnets as the primary means of delivering spam
o The ongoing use of image-based and other forms of attachment-based spam to evade
detection
o The growing sophistication of phishing attacks
o The increase in blended threats, which combine email with other technologies, such as Web
sites or multimedia files
o The use of social media sites and tools to deliver spam and to steal data
o The rise of cybercriminal syndicates intent on stealing funds and data; hackers are no longer
satisfied with causing network outages or data loss; now they’re after confidential data they
can resell or exploit to steal funds or to blackmail enterprises
To assess the technical approach taken by any anti-spam technology, it’s important to understand the
nature and impact of these trends.
The Rise of Botnets
Robot networks, or botnets (also called zombie networks), consist of network-connected machines that
have been compromised by spyware or malware. These compromised machines are used by malware writers to send spam (or viruses) on their behalf and to launch other types of network attacks. Rather than
sending spam directly from a server to a set of organizations, spammers use botnets to send spam indirectly. Each node in the botnet is responsible for sending a fraction of the spam campaign. Proofpoint
estimates that more than 75% of all spam attacks are now sent using botnets. The impact of using botnets
as a spamming tool is two-fold:
o Quick and intense attacks: The spammer has precise control over the launch and duration of the
attack. By sending commands over an IRC (Internet Relay Chat or similar communication) channel,
the spammer can turn an attack on or off. Attacks now last under an hour and deliver huge volumes
of spam during that period.
o Multiple IP addresses: Instead of having just one or two sending IP addresses for a spam campaign, a botnet allows a spammer to send spam from hundreds—or even thousands—of IP addresses. The use of these techniques have made centralized reputation services (which block spam
based on the sender’s IP address) much less effective as a spam-fighting tool.
The rapid proliferation of botnets has made it possible for spammers to send an ever-increasing volume of
spam. Botnets make sending large quantities of spam “cheap,” because spammers are able to tap into large
pools of computing and network resources virtually free (or for a very low price, as when botnet controllers
rent out their botnets to other spammers).
This same “economy of scale” has also made it possible for spammers to send more resource-intensive
types of spam, such as image-based, highly personalized or highly randomized spam. For example, the
spam “payload” may be delivered as an attached image (or other document type), sometimes accompanied
Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow
by large amounts of text. Both images and text are typically randomized or obfuscated in an attempt to
defeat both signature- and heuristics-based spam filtering techniques, as described later in this paper.
Image-based and Attachment-based Spam
Image-based spam has “graduated” from an annoyance to being one of the core tricks used by spammers.
In fact, image-based spam represents a whole collection of new techniques (some of which are described in
more detail below) that leverage message attachments to deliver spam messages. From month-to-month
the volume of image-based spam (or spam that uses other attachments such as Word or PDF files) varies
widely, but spam campaigns that use these techniques continue to be a common problem.
Spammers have always used images in their spam. Spammers used to insert images referenced on remote
servers through URLs. But given the lower cost of bandwidth and availability of the massive processing
power of botnets, spammers are now attaching images and other types of files directly to their messages.
This has the advantage of tricking many spam filters. Proofpoint has invested heavily in developing and
delivering technologies that block these difficult-to-detect messages.
The rise of image-based spam was enabled primarily by the proliferation of botnets. Today, botnets send
large amounts of image-based spam where each individual message is randomized and obfuscated using
a variety of different techniques.
Why Traditional Detection Technologies Fail to Detect Image-based Spam
Most significantly, image-spam is very hard to detect using traditional first- and second-generation technologies such as signature techniques and Bayesian filtering. Furthermore, since much of the image-based
spam comes from botnets, it is difficult to detect using traditional reputation services, as the source IP addresses continually shift.
Spam technologies that worked well in the past saw their effectiveness decline when these spam techniques came into widespread use. The image-based spam techniques described later in this paper serve
to reduce the effectiveness of filters, and thus increase the amount of spam that gets through to user’s
inboxes.
Image-based spam techniques wreak havok against many of the most common anti-spam technologies:
o Signature-based detection: Because the images used in each individual spam message are obfuscated, each message is unique. That is, each spam email contains a “new” image that doesn’t match
any known signature. Therefore, signature-based engines are fooled by the spam and it is delivered
to end users.
o Reputation-based detection: The use of botnets allows image-based spam to be sent from an
ever-changing or “rotating” set of IP addresses. Most of the nodes in a botnet have no reputation
rating at all (either positive or negative) and are used for sending messages in such a way that they
avoid detection by reputation systems. Relying on the reputation of a sending IP address to catch
image-based spam does not work effectively as an enterprise scale solution.
o Bayesian-based detection: Many image-based spam messages are also accompanied by text
(which may be visible or invisible) that looks “legitimate” to many types of spam filters. Simple
Bayesian filters are unable to “see” and analyze the image. Instead, they rely on the text, which appears legitimate. Due to a limitation in the way Bayesian systems perform their computation, the
final spam score for this analysis results in the email being misclassified as legitimate mail. These
techniques are known as Bayesian-busting.
The Growing Sophistication of Phishing Attacks
Phishing attacks are email attacks that impersonate an email from a trusted site, such as a bank, brokerage, or social media site, in order to lure the recipient into clicking on a link or giving away confidential
information. Phishing attacks can be used to deliver ads to users, to steal users’ login credentials and other
information, or to infect the recipient’s computer system with malware. Many attacks lull users into clicking to learn more about topical events, such as elections, major sporting events, natural disasters, and news
about celebrities. A typical phishing attack would consist of a spam message, purportedly from a major
bank, telling the recipient that his or her account requires immediate action, and that he or she should click
on the link and login, thereby allowing hackers to harvest the recipient’s login credentials.
Using Machine Learning for
Connection Management
(continued)
Proofpoint uses connection level
sender reputation to significantly
reduce CPU load by limiting the
amount of email content that each
system must analyzed.
While Proofpoint Dynamic Reputation has a positive impact on antispam performance, enabling this
feature is not required to achieve the
99.8% anti-spam effectiveness typically delivered by Proofpoint MLX.
Rather, the goal of Proofpoint Dynamic Reputation is to mitigate the
impact of sudden bursts of email
traffic, denial-of-service attacks
(both direct and indirect) and the
vast numbers of email connections
caused by spam campaigns.
In typical production enterprise deployments, Proofpoint Dynamic
Reputation identifies 70%-80% of
inbound connections as malicious,
based on global netMLX reputation
scoring. Local reputation analysis will
often identify an additional 10% of
connections as malicious.
Note that these statistics are for
connections, not messages. One cannot know how many email messages
were in a given connection if it is not
accepted, but on average there are
multiple messages per spammy or
malicious connection. The net result
is that, for every connection blocked,
multiple messages and they payloads
they carry are blocked.
Since Proofpoint does not rely exclusively (or even heavily) on reputation for anti-spam effectiveness,
Proofpoint Dynamic Reputation’s
aggressiveness is tuned to ensure
sufficiently high block rates while
eliminating false positives. Reputation scores are maintained on individual IP addresses in order to avoid
false positives caused by blocking
overly broad IP ranges.
Phishing attacks constitute a small fraction of spam attacks, but it’s an expensive fraction. Phishing attacks
are estimated to have cost the U.S. economy more than $8.4 billion in 2009. Of that money, about $5.8
Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow
Page 11
Using Machine Learning for Connection Management
(continued)
The automated machine learning
processes at the heart of Proofpoint’s
connection management system are
designed to avoid false positives
by including significant amounts of
valid email connections in the training data to ensure that the engine is
recognizes both the characteristics
of valid mail as well as the characteristics of malicious connections.
Because this training process is performed continually over constantlychanging global dataset, the system
is “self-annealing.” A major differentiator of Proofpoint Dynamic Reputation is that IP scoring is highly dynamic and automated, whereas other
reputation systems require senders
to call or write to the vendor if they
have been incorrectly added to their
system’s “blocklist”.
While Proofpoint also provides online feedback mechanisms in the
event that senders need to request
removal, IP addresses that are identified as malicious and then return
to normal behavior quickly return
to “good” reputation scores and
are automatically “unblocked” by
Proofpoint Dynamic Reputation.
billion was in direct monetary losses–in other words, phishers stole nearly $6 billion from bank accounts
and other reserves, using login credentials and other information obtained fraudulently through phishing.
In addition, phishing attacks have tarnished more than 2,000 brands.2
Building on their successes, phishers are refining their attacks and focusing on high net-worth individuals
and institutions. So called spear phishing is a highly targeted form of phishing in which the email message
appears to be more targeted and personal. For example, instead of sending 10,000 users a spam message
that pretends to be a security notification from a major bank, a spear-fishing attack might send only 50
messages to select individuals whose data, assets, or login credentials are especially valuable. Because the
spam email contains personal information that a mass emailing would not, the recipient is more likely to
trust it, click on a dangerous link, and fall prey to the attack.
Phishing attacks are an especially pernicious and increasingly common form of spam. Instead of delivering bothersome ads or malware that may shut down a computer, they can lead to identify theft or even a
transfer of funds with serious, long-term repercussions.
Blended Threats
A blended threat is a security attack that achieves its ends by combining two or more attack vectors, such
as email and Web sites. Blended threats often use spam to initiate contact with a victim. For example, a
blended threat attack might consist of an email message that includes a link to a tampered Web site; when
recipients click on a link in the message, their browsers navigate to a site infected with malware, which the
browsers then automatically loads, thereby infecting the recipients’ computers. Because the spam message itself doesn’t contain malware, it stands a good chance of getting past most types of email and virus
filters.
Comparing the link to a known list of bad sites might not work either. The infected Web site might be
new and hence not yet included on any malware blacklists. Or it may be a legitimate site, such as a popular
news site or shopping site, which hackers have managed to infect with malware using an attack technique
such as SQL injection.
Blended threats can be quite elaborate, involving a series of seemingly unrelated leaps between the initial
contact with the victim (usually in the form of a spam message) and the ultimate theft of data or injection of malware. For example, an attack might begin with a spoofed email message that appears to be an
automatically generated message from Facebook–perhaps from the security team at a Facebook or from a
friend of the recipient. The recipient trusts Facebook and clicks on the link, which appears to lead to content posted on Google Reader, another Internet brand the user is likely to trust. The Google Reader content,
in turn, might link to a video posted seemingly posted on YouTube, but in fact posted on a malware site
cleverly designed to resemble YouTube. Playing the video unleashes a malware attack that takes advantage
of security vulnerabilities in some older versions of Flash.
Attacks such as these rely on the email recipient’s trust in personal friends and in the brands of the biggest
Internet properties. Like the simpler, two-step attack described above, it begins with an email message that
itself doesn’t include any malware. But within a matter of seconds,—the length of time that a trusting user
might spend clicking through the various links supposedly hosted on these well-known sites—the recipient’s computer is infected with malware.
In addition to infecting systems with malware, blended threats are being used by hackers to harvest login
credentials and confidential financial data such as Social Security numbers and credit card information.
Threats from Social Media
The social-media spam floodgates have opened: in 2009, 50% of enterprises reported receiving spam from
a social media networks. More than a third of enterprises say they have received malware attacks from
social media networks.3 A recent survey by Proofpoint found that more than a quarter of US workers say
that social-media network notifications account for about 5% (1 in 20) of the emails they receive at work.
More than 10% of US workers say that such notifications account for about 10% (1 in 10) of the emails
they receive at work.
The growth of spam that appears to be from social media networks isn’t too surprising given the surging
popularity of these networks and related Web 2.0 communications platforms, such as Twitter. The largest
social media community, Facebook, has grown to 350 million users.4 Users are spending more time on social media–82% more time year over year by December 2009.5 Social media networks now account for 11%
of all U.S. Internet traffic.6 In December 2009, Facebook became the top referrer site to news portals such
as Yahoo and MSN.7 (Such referrals make social media sites ideal environments for phishing attacks.)
Page 12
Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow
A couple of related trends are amplifying the vulnerability of social media users to spam and to phishing.
First, social media marketing is on the upswing. Seventy percent of marketers plan to increase their socialmedia marketing budgets in 2010.8 This re-allocation of marketing funds will lead to a sizeable increase
in email solicitations from well-known brands and social media sites. This increase will make it harder
for users to distinguish legitimate social-media marketing campaigns from spam campaigns that link to
malware or phishing sites.
Second, enterprises are adopting social media platforms such as SharePoint and Jive to improve communication and collaboration internally. This adoption makes employees more comfortable posting social
media content and following links, and more trusting of social media communications (including email
notifications) in general.
Spammers and cybercriminals are taking advantage of the popularity of social media networks to create
new types of spam, phishing attacks, and malware traps. Many social networks such as LinkedIn send invitations by email. Other such as Facebook send email notifications when users post comments. Spammers
and phishers are spoofing these messages to trick users into viewing ads, revealing login credentials, or
downloading malware. Phishers have even begun harvesting users’ contact lists on networks such as Facebook to make spam messages appear more credible. Many users will automatically trust a message
that includes names and photographs of friends. They assume that only the network administrators could
compile this information.
Social media spam is particularly well suited to blended threats. Because social media sites are filled with
links to videos, photos, and audio clips, many users don’t think twice about clicking on a link that supposedly leads to trustworthy content. Users don’t realize that videos can contain malware, or that even
trustworthy sites can be injected with malware.9
Social media networks like Facebook that support custom applications pose additional threats. Few networks screen these applications for security vulnerabilities or malware infections. Many users blindly trust
these applications, naively assuming that they’re harmless fun. Unfortunately, some of these applications
include Trojans or keyloggers. Some post bogus messages, soliciting the user’s friend to click on links and
further propagate the attack.
Social applications create dynamic opportunities for distributing malware or harvesting confidential data.
Spam messages promoting these applications are an invitation to trouble.
Spam and Crime
Ten years ago, most malware authors had relatively simple goals: they wanted to demonstrate technical
prowess, and they wanted to rebel against authority. To achieve these goals, hackers programmed increasingly sophisticated worms and viruses that slipped through the defenses of major corporations and caused
millions of dollars of damage worldwide. But the malevolence stopped there. Early hackers didn’t try to
extort money from their victims. They didn’t use Trojans or rootkits to steal login credentials for financial
gain. Their motivation seemed to be egotism, not greed.
Over the past decade, the nature of malware attacks and other spam-related crimes has changed dramatically. Cybercrime has become big money. Criminal syndicates around the globe are involved in creating and
renting botnets, and breaking into financial institutions and siphoning funds. Hackers are no longer content simply congesting networks; now they want financial gain from their nefarious labors.
An example of this change is the ongoing development of ransomware, a type of malware that first appeared around 2006. When a ransomware Trojan springs into action, it encrypts files on the user’s local
hard drive and demands that the user wire funds a foreign account in order to have the files restored.10 A
recent variant shuts down the infected computer’s lnternet access, then orders the victim to send a text
message, which turns out to be exorbitantly expensive, to a special number to receive the decryption key.11
Other new spam payloads may modify the DNS settings on a victim’s computer and direct users to fraudulent dating or gambling sites.
Outbound Spam Detection
Increasingly, organizations of all
types are concerned about preventing their networks from contributing to the global spam problem and
want to ensure that no machines on
their network are sending spam or
other forms of malicious email. Even
a single botnet-infected machine on
an organization’s internal network
can generate massive amounts of
spam, quickly causing their pool of
IP addresses to be blacklisted.
But most spam filtering solutions
rely on a combination of reputation
scores and content filtering to identify and stop spam and rely heavily
on the reputation scoring component to ensure accuracy.
While this approach may work satisfactorily for the inbound email stream,
where the reputation of sending IPs
can be easily monitored and tracked,
this approach is ineffective in addressing outbound spam, where one
must examine the content and be
able to make an accurate determination of “spammyness” based solely
on non-reputational factors.
In addition, some anti-spam solutions do not even support the ability
to scan the outbound email stream
for spam content.
In contrast, Proofpoint’s anti-spam
and anti-virus technology can be applied to both inbound and outbound
email streams.
And because Proofpoint MLX technology detects spam with extremely
high accuracy without relying on IP
reputation information, it is also
highly effective at outbound spam
detection, in contrast to solutions
that depend heavily on reputation
scoring.
The growth of cybercrime and increasingly malicious payloads makes spam a security issue too urgent to
ignore. Spam jeopardizes not only employee productivity, IT asset availability, and business continuity; it
also poses a direct threat to the financial wealth of enterprises and their employees.
Understanding the “Perception Crisis” in
Spam Effectiveness
New spamming techniques and ongoing increases in the sheer volume of spam being sent, combined with
the overall increase in legitimate email messages, have had a predictable but unwelcome result. Many email
Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow
Page 13
users who had become accustomed to spam-free inboxes find themselves now receiving a noticeable and
often annoying amount of spam.
End users perceive that effectiveness has declined and this often leads to an increase in end-user complaints and helpdesk calls related to spam. In addition, email administrators may find themselves spending
more and more time trying to stem the flow of spam.
In some cases, anti-spam solutions are unable to keep up with spammer sophistication, but in others, the
spam filter effectiveness has remained constant—or even improved—but the higher overall volume of email
results in an increase in the absolute number of spam messages making it through the filter.
Let’s take a closer look at the relationship between rising inbound email volumes, spam filter effectiveness
and end-user perceptions about the quality of your existing spam filtering solution. The following scenarios
will help explain why spam has once again become a critical enterprise IT issue and why organizations now
require anti-spam solutions with extremely high accuracy.
Scenario 1: The “Good Old Days”
Think “way back” to the time before botnets. You’ve recently replaced your first-generation anti-spam
solution with a new solution that provides what seems like an incredible effectiveness of 94%, with a
negligible number of false positives (i.e., legitimate messages inadvertently marked as spam). Your spam
volume is 500,000 messages per day. Your 5,000 end users aren’t complaining, as they very rarely receive
any spam at all (i.e., false negatives are also negligible). This is a huge improvement over your previous antispam solution, which had a 90% effectiveness rate. You are feeling good about the vendor you chose, and
you focus your attention on other projects.
A quick calculation of your new spam solution looks like this:
Variable
Metric
Value
Formula
A
Daily spam volume
500,000
B
New vendor’s effectiveness
94%
C
Number of end users
5,000
D
Spam being blocked at gateway
470,000
=A*B
E
Spam not blocked, that gets through the
gateway and hits your mail servers
30,000
=A-D
F
Average number of spam messages/day
that get to an end user’s inbox
6
=E/C
(with negligible FPs)
A quick calculation of the situation before you purchased the new solution looks like this. We assume that
the volume and number of employees stayed the same:
Variable
Metric
Value
Formula
G
Old vendor’s effectiveness
90%
(with some FPs)
H
Spam being blocked at gateway
450,000
=A*G
I
Spam not blocked, that gets through the
gateway and hits your mail servers
50,000
=A-H
J
Average number of spam messages/day
that get to an end user’s inbox
10
=I/C
So the results are as follows:
False Negatives in End User’s Inbox
Load on Exchange Servers
Old solution
10
50,000
New solution
6
30,000
Your new vendor shows quite an improvement over the old vendor. The research that went into your spam
solution has paid off immediately. Not only are your end users happier, but the entire mail team is as well,
as you have “gained” capacity on your Exchange servers.
Page 14
Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow
Scenario 2: Today’s Challenges
Fast forward to today. Things are different. The rise of botnets has resulted in a higher volume of spam hitting your gateway. You run some reports and see that inbound email volumes have increased from 500,000
spam messages per day to 2 millions spam messages per day. Though it sounds extreme, this type of increase is pretty typical, as discussed previously.
Furthermore, because spammers are now sending more sophisticated, attachment-based spam campaigns,
your gateway spam solution’s effectiveness might have declined a bit. But let’s be conservative and give
your “new” vendor the benefit of the doubt—we’ll assume that effectiveness has actually improved by 1%
to 95% average effectiveness.
Even so, your 5,000 end users are now complaining, because it looks to them like anti-spam effectiveness
has declined. This can’t be, you tell your team. We just spent time and money evaluating and deploying a
solution a year ago!
Lets look at the numbers again to better understand what’s going on:
Variable
Metric
Value
Formula
A
Daily spam volume
2,000,000
B
New vendor’s effectiveness
95%
C
Number of end users
5,000
D
Spam being blocked at gateway
1,900,000
=A*B
E
Spam not blocked, that gets through the
gateway and hits your mail servers
100,000
=A-D
F
Average number of spam messages/day
that get to an end user’s inbox
20
=E/C
So the results are as follows:
False Negatives in End User’s Inbox
Load on Exchange Servers
Old solution
10
50,000
New solution
(when first deployed)
6
30,000
New solution
(one year later)
20
100,000
No wonder the helpdesk phone is ringing off the hook. There are several observations to make here:
o
The typical end user perceives a more than threefold increase in the amount of spam in their
inbox.
o
Furthermore, they complain that they are getting more spam than they did with the old antispam solution. Why did you make the wrong selection, they ask you.
o
Your Exchange mail servers are also straining under increased load, and you are spending valuable resources trying to keep them performing well.
o
All of this occurs even though your anti-spam effectiveness at the gateway increased from 94%
to 95%. You are actually blocking three times as much spam as you did before!
You decide to solve this situation by educating your users through newsletters, seminars, etc. This certainly
has a positive effect, and is in fact a best practice to follow. Your end users empathize with your situation.
They tell you they understand what is going on, but that 20 spam messages per day is a bit too much for
them to handle. Can’t you do anything about it?
Scenario 3: The Need for Extreme Anti-spam Effectiveness
In the final scenario, let’s look at what it would take to solve this problem with technology. Your end users
are complaining due to a perceived decrease in effectiveness and the rising number of spam emails in their
inboxes. You can’t control the spam volumes hitting your organization. But you can deploy a system with
increased effectiveness. But “good” just won’t cut it. An extraordinarily high effectiveness in the range of
98% to 99% is required.
Let’s look again at the numbers:
Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow
Page 15
Variable
Metric
Value
A
Daily spam volume
2,000,000
B1
Minimum new effectiveness
98%
B2
Extreme effectiveness
99%
C
Number of end users
5,000
Formula
D1 to D2
Spam being blocked at gateway
1.96M to 1.98M
=A*B1 to =A*B2
E1 to E2
Spam not blocked, that gets through the
gateway and hits your mail servers
20,000 to 40,000
=A-D2 to =A-D1
F1 to F2
Average number of spam messages/day
that get to end user’s inbox
4 to 8
=E2/C to E1/C
So the results are as follows:
False Negatives in End User’s Inbox
Load on Exchange Servers
Old solution
10
50,000
New solution
(when first deployed)
6
30,000
New solution
(one year later)
20
100,000
Solution with 98%
Effectiveness
8
40,000
Solution with 99%
Effectiveness
4
20,000
As you can see, it is possible to get the situation under control and back to spam levels comparable to the
“good old days,” but it requires a solution with extremely high effectiveness. Anti-spam effectiveness must
be, at a minimum, 98%... and 99% or better effectiveness is clearly preferable.
This level of anti-spam effectiveness is possible, but only by using the most advanced technologies such as
Proofpoint MLX, which is continually being enhanced to incorporate innovative new anti-spam techniques
and intelligently responds to evolving spam conditions.
Winning the Battle Against Image- and
Attachment-based Spam
Because so much of today’s spam volume increase is due to image- and attachment-based spam being
sent by botnets, the best performing anti-spam solutions are those that correctly identify and block such
messages. In today’s environment, effective protection against attachment-based spam is a fundamental
requirement for successful anti-spam solutions.
The use of attachments and images in spam is not new. Spammers switched from using referenced URLs
for images to embedding images for two reasons. First, mail clients have become smarter, refusing to
download images unless they are explicitly told to do so. For example, Microsoft Outlook and Google’s
Gmail do not display images that are referenced through URLs, unless the user has clicked on a button or
configured a setting, thereby ordering the client to download the image. Second, the cost of computing
resources available to spammers has declined, directly through Moore’s Law and indirectly through the use
of botnets.
Let’s take a closer look at some of the techniques used in image-based spam to better understand the sort
of randomization, “personalization” and obfuscation techniques used in nearly all spam campaigns today.
Image-Based Spam Obfuscation and Evasion Methods
Today’s approach of embedding images has proven to be successful, especially against Bayesian-based and
signature-based products. Many spam filters are not able to correctly classify these messages.
The illustration below shows five images were extracted from five different spam messages. They are clearly
all from the same spam campaign and, at first glance, the images appear identical.
Page 16
Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow
Figure 5: To the human eye, these spam images are identical... But hidden differences confuse spam
filters.
However, if we take a digital “signature” of each image—using the same sort of technique that a simple
spam filter would use to compare the images—we find that each of them has a different electronic signature, as follows:
Image
Image Signature
1
122413e0682085f68c2b947a53af02cc
2
28de627c92a20b1043deebfa5f7715f8
3
6280188bd69ab41fd9764df2a10978f5
4
6e8a670f65570b1daf52dd3ae10c3a4c
5
e3bdd4b0073a502544df4f07647764db
To an unsophisticated, signature-based spam filter, these images are completely different!
That’s no comfort to your end users, however. One of them might receive this “same” spam five times over
the course of five days and call your help desk asking, “This message is obviously spam, and I keep getting
it! Why isn’t the filter catching it?”
A signature-based anti-spam filter might identify the first image as spammy based on a submission from
an end user, but as it continues to look for that exact image, it will never again see an exact match. Each image-based spam is subtly different. By randomizing or obfuscating the image used in each individual spam
email, the spammers are able to successfully bypass simple filtering methods. It’s impossible for a filter to
predict all obfuscations of an image using signature-based approaches or simple Bayesian technology.
But back to those images: How is it that each of them is unique? Let’s look at just a few of the techniques
spammers are using to confuse spam filters using images.
Randomized Image Borders
A thin border of different colors and pixel width is automatically placed (e.g., auto-generated by spamming software) around the image. The Proofpoint Attack Response Center has seen this technique used
frequently with “pump and dump” spam stock pitches. For example, the border that we’ve zoomed in on
in Figure 6 is unique for each individual spam message.
Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow
Page 17
Figure 6: A stock pitch image-based spam with randomized border. Each individual spam message
sent as part of this campaign includes an image with a unique, pixelated border, making the images
resistant to detection by simple signatures.
Figure 7: More image-based stock pitch spam. This image shows multiple obfuscation techniques
being used in the same spam message. The spam’s “payload” is the information contained in the image
at the top of the message. Taking a closer look at this image, we see that it has a very subtle background pattern of randomized lines. The small circle shows a zoomed in area where we’ve used image
Page 18
Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow
enhancement to reveal the random pattern of lines. The larger circle shows another area of the spam
where we’ve zoomed in on some randomized pixels, showing another technique the spammer has used
to make image de-obfuscation more difficult. Finally, note that the spammer has combined image- and
text-based obfuscation techniques by including randomized text to bypass text signatures and less
sophisticated Bayesian filters.
Randomized Pixels Between Paragraphs
In this technique, randomized pixels or other types of high-frequency noise are inserted in between paragraphs (see large circular highlight in Figure 7, above). This is an attempt by spammers to make image
de-obfuscation more difficult.
There are many variants on this technique whereby visible, apparently random, distortions are added to
individual spam images to make them unique so they can slip past signature-based filters.
Randomized Colortable Entries to Obfuscate the Image
Since most image-based spam messages don’t need many colors, unused color map entries are another
place where spammers can insert invisible obfuscations. For example, the spammer enters random values
into all of the unused color map entries for each individual spam message. This maintains the visual integrity of the image, but changes its invisible structure and makes the actual content of each image unique.
In the figures above, fewer than 10 colors are actually used in the visible image. If this is a 256 color GIF
or JPEG image, the spammer then has 246 bytes (i.e., 256 - 10) that can be safely randomized while still
ensuring that the image “looks” identical.
As shown in Figure 8, even something as simple as the GIF format offers plenty of opportunities for such
mischief. For example, randomized data can also be inserted into the GIF terminator part of the file. Alternatively, randomized borders, lines, pixels or other graphical features can be inserted into the visible portion
of the image.
GIF Signature
Screen Descriptor
Global Color Map
Randomizations can be placed in unused
color map entries (typically invisible)
Image Descriptor
Local Color Map
Randomizations can be placed in unused
color map entries (typically invisible)
Raster (Image) Data
Many types of visible randomizations and
obfuscations can be inserted here
GIF Terminator
Randomizations can also be inserted here
(invisible)
Figure 8: This is a schematic of the GIF file format. Note that, even in this relatively simple graphical
file format, there are many opportunities for inserting randomizations and obfuscations.
Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow
Page 19
Animated GIF with Embedded Spam Image
Another interesting technique that has emerged recently is the inclusion of a spam payload as a single
frame in an animated GIF. The other frames serve as “red herrings” to confuse anti-spam software. When
the email recipient views the image, the frames with small time values quickly cycle through, and the user
is presented with the image with the longest time sequence (which is, of course, the “spammy” image).
See Figure 9, below.
A Spammy Animated GIF
Frame 1
Displayed for 0.1 Sec
Frame 2
Spam Payload
Displayed for 250 Sec
Frame n...
Displayed for 0.1 Sec
Figure 9: Animated GIFs can be used to “hide” a spam payload from most spam filters. Decoy images display for an almost imperceptible amount of time while the payload image is displayed for long
periods. Advanced forms of analysis are required to properly identify such files as spam.
Image Segmentation
This image-based spam technique obfuscates an image by breaking up the base image into a random
assortment of smaller images. Each spam campaign will use the same image, but the sub-images are different. Automated software disassembles the image into its random parts and composes the HTML code
that holds them together. Of course, to most anti-spam filters, each individual image looks entirely different from the base image, and from every other sub-images found in individual emails that are part of the
spam campaign.
Base Image
Image subdivided into 7 sub-images
Image subdivided into 12 sub-images
Figure 10: Image spam payloads are sometimes broken up into randomly sized sub images that display properly when presented in a mail client, thanks to clever HTML coding.
Page 20
Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow
OCR-resistant Images
This technique uses animated GIFs to break up an image into different overlapping frames of “broken”
text. All frames, except for the last, quickly cycle through. Each ensuing frame after the first adds pixels to
complete the words in the image. The ensuing frames also contain transparent colors to ensure that the
frames underneath are visible. Only the parts required to complete the image are transparent. When all
frames have cycled through, they appear stacked on top of each other, and thus the image becomes legible,
as shown in Figure 11, next page. This technique is designed to make the text in the image resistant to spam
filters that use OCR (optical character recognition).
Optical character recognition is still an extremely compute intensive technique and Proofpoint’s research
in this area has revealed that OCR itself is only minimally effective in the fight against image-based spam.
Current versions of Proofpoint MLX do not use literal OCR techniques. However, Proofpoint has developed
proprietary, OCR-like techniques for analyzing image color spaces and other image attributes that help
MLX differentiate between valid and spammy images in a highly efficient (i.e., minimal CPU overhead)
manner.
Combining Image-based and Text-based Techniques
Spammers typically execute their attacks using a combination of techniques in the same campaign. With
the flexibility available to them with image-based spam, they are able to evade many first and second
generation filters.
Refer back to Figure 7 and you’ll see that this spam combines at least three different obfuscation techniques. The top part of the spam is an embedded image. The image uses two of the obfuscation techniques
discussed earlier. First, the image has random pixels between paragraphs. Second it incorporates randomlyspaced lines in the image background that serve to further obfuscate it from the base image. The third
technique used is a text technique. Underneath the image is a large amount of “Bayesian busting” text that
is likely to be interpreted as legitimate content.
This approach of mixing legitimate sounding language with an image-based spam payload is a sophisticated, but very common way to bypass Bayesian- and signature-based filters in one fell swoop. Not only is
each image different, but most Bayesian filters will overcompensate in their scoring by treating the legitimate sounding language as a “clue” that this email is legitimate.
Winning the Battle Against Image-based Spam
The most effective techniques against image-based spam are those that are machine learning based.
Proofpoint continues to be at the forefront in the battle against image-based spam—from both primary
research and practical development perspectives. Proofpoint’s MLX machine learning technology applies
artificial intelligence and advanced image analysis methods to the problem of correctly identifying imagebased spam.
Proofpoint’s MLX-based image spam detection technologies protect against all the techniques mentioned
above as well as many other classes of images. Just a few of the patent-pending image-based analysis
techniques used in Proofpoint MLX include:
o Fuzzy matching for obfuscated images: Proofpoint MLX detects obfuscated spam images by
using techniques that mimic the way human beings perceive spam. Proofpoint has developed a
variety of highly effective, but minimally compute intensive techniques for stripping out randomized
borders, ignoring high-frequency randomized noise and analyzing image colormap entries to “see
through” obfuscation tricks used by today’s image spammers.
o Dynamic spam image detection: Proofpoint software and appliances work locally to analyze incoming messages that contain images and track the number of similar (or similarly obfuscated) images. If the volume of these images exceeds a certain threshold, Proofpoint MLX classifies the image
as “spammy,” similar to the way that Proofpoint MLX Dynamic Reputation monitors for malicious
IP-level connections.
o Animated GIF spam detection: Proofpoint MLX analyzes the structural and temporal attributes
of animated images to help identify those with spam characteristics.
o Dynamic botnet protection: Proofpoint MLX Dynamic Reputation continually profiles IP-level
connections and source IP addresses, monitoring for activity characteristic of botnets. When botnet
IPs are detected, Proofpoint MLX automatically rejects image-based and other types of spam from
those sources.
Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow
Page 21
An OCR-resistant Animated GIF Spam
Frame 1: Contains “broken” text
with transparent background
Frame 2: Contains “broken”
text that displays through the
transparent regions of Frame 1
Visible Result: Animated frames combine visually to reveal the spam “payload”
Figure 11: An animated GIF image-based spam that uses an advanced technique to evade filters that
use optical character recognition (OCR), which attempt to extract the text payload from an image. In
this example, the spam payload is broken up into separate frames of an animated GIF. By themselves,
each frame consists of only fractional “broken” text which can’t be read by OCR. As the animation
plays, the frames combine to form the human-readable spam payload (yet another pump and dump
stock pitch).
Of course, these image-specific techniques work hand-in-hand with the hundreds of thousands of other
message attributes analyzed by Proofpoint MLX. As Proofpoint’s automated machine learning systems
and Proofpoint Attack Response Center staff identify new image-based spamming techniques and other
threats, MLX engine updates are automatically delivered to customers’ local Proofpoint servers. These
updates are automatically and immediately available—without requiring any administrator intervention,
manual updates or system upgrades—ensuring that your organization is always protected against the latest threats.
Ongoing MLX Research and Development for Attachment-based Spam
In order to maintain its edge on spam detection, Proofpoint continues to invest heavily in research and
development.
Proofpoint’s spam research arm, the Proofpoint Attack Response Center, has developed sophisticated machine learning based “agents” that can reliably identify new classes of image-, attachment- and text-based
spam. These predictive systems are also capable of automatically responding to new threats as they appear,
Page 22
Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow
automatically creating new versions of the Proofpoint MLX engine, testing new algorithms for effectiveness and delivering those enhancements to customer sites—without the need to human intervention.
This technology, which is unique to Proofpoint, enables extremely rapid response to new forms of spam
and optimizes the number of updates delivered to Proofpoint SaaS services and local Proofpoint servers in
such a way that they have zero negative impact on performance.
Proofpoint’s backend systems employ a wide variety of machine learning techniques, including methods
that are especially helpful in the identification of attachment-based spam:
o Automated Image Extraction Threshold Analysis: A technique that automatically detects images being used in new spam campaigns. It works by looking at high frequency variations of an
image. If a large number of images with subtle differences are detected, that image is added as a
potential “spammy” image and included in the MLX engine. This detection and automatic retraining
of the MLX engine is performed in real time.
o Predominant Correlation: Information gain is a technique used to identify the very best attributes (or clues) to use in detecting spam versus valid mail. From the millions of available attributes,
information gain selects those that are most valuable. Proofpoint has taken this technique a step
further with the introduction of predominant correlation-based attribute selection. This technique
allows Proofpoint MLX to identify attributes that are redundant and automatically remove them,
ensuring that only the most effective indicators of spam are considered. This intelligent approach
to attribute analysis maximizes effectiveness (the system’s ability to accurately detect spam) and
performance (the system’s ability to rapidly process messages) at the same time.
o URL Analysis Techniques: Proofpoint’s backend systems perform statistical analyses of URLs
from Proofpoint honeypots and customer sites, coupled with correlative analysis of URLs and the
IP addresses hosting them. By using advanced network analysis techniques, Proofpoint MLX can
determine if a sending IP address is associated with a known malicious URL or suspicious ISP and
use these associations as a strong indicator of spam.
o Clustering/Automation: Proofpoint MLX uses advanced automation that clusters messages in
large data sets and extracts common elements, speeding up the process of identifying new spam attributes. This results in higher effectiveness and faster responses to new spam attacks of all types.
o Hadoop MapReduce Processing of Very Large Data Sets: Proofpoint is using cutting-edge
technologies such as Apache Hadoop MapReduce to process extremely large amounts of data using
distributed computing resources. These distributed computing capabilities enable Proofpoint to analyze statistics and trends more quickly and comprehensively, which in turn strengthens Proofpoint
MLX’s ability to respond quickly to new spam campaigns .
Proofpoint Attack Response Center
Talking about machine learning is relatively easy. Developing an enterprise-class anti-spam solution that
effectively leverages machine learning techniques and the best traditional techniques requires a major R&D
investment and a world-class team.
The Proofpoint Attack Response Center is a collection of dedicated professionals and automated systems
that monitor the Internet for new spam attacks, virus outbreaks and other anomalous activities. Data from
the Proofpoint Attack Response Center is used to refine Proofpoint MLX models, develop new machine
learning-based security technologies, power new services such as Proofpoint Zero-Hour Anti-Virus and
address emerging threats. The updates created by Proofpoint researchers and their automated systems are
automatically delivered to Proofpoint customer sites via the Proofpoint Dynamic Update service, ensuring
that the most accurate statistical models and machine learning classifiers are always used.
A world-class team of scientists has been assembled for the Proofpoint Attack Response Center, unparalleled in its cross-disciplinary depth and breadth. The team consists of researchers and engineers with deep
roots across several relevant disciplines including machine learning, statistics, natural language processing,
information classification, messaging and security.
The Proofpoint Attack Response Center brings together the expertise and resources necessary to ensure
that Proofpoint solutions continue to set the standards that the rest of the industry strives to match.
Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow
Page 23
Conclusion
The Proofpoint MLX system is continually training to detect the latest forms of spam. Information is
fed back into the system to enable it to automatically tune its spam attributes, statistical processes and
classifications. Rather than relying on any one technology, the MLX engine dynamically chooses the most
effective set of attributes and models to process each message.
Proofpoint MLX technology:
o Continuously adapts: to detect new types of spam without manual intervention—the system’s
ability to identify spam does not degrade as spammers change their tactics.
o Employs next generation machine leaning techniques: including logistic regression and information gain techniques to build large-scale statistical models that accurately represent dependencies among spam attributes and delineate the boundary between spam and valid messages.
o Includes image- and attachment-specific machine learning techniques: to accurately identify even the most sophisticated spam messages. Proofpoint continues to identify the latest attachment-based spamming techniques and has built technology to handle these threats proactively and
predictably. As new techniques emerge, Proofpoint delivers the latest spam detection technologies
to customers automatically.
o Analyzes more than 1,000,000 spam attributes: including message envelope and header characteristics as well as the actual message and attachment content to accurately classify messages
and ensure a low rate of false positives.
o Ensures the maximum protection today and improves in performance: even as spam evolves.
Notes
1. Ferris Research: http://www.ferris.com/?p=322011.
2. “The True Corporate and Consumer Cost of Phishing,” http://blog.epostmarks.com/team-blog/2009/4/4/the-truecorporate-and-consumer-cost-of-phishing.html
3. http://news.cnet.com/8301-1009_3-10445723-83.html
4. Facebook’s announcement that they have 350 million members: http://blog.facebook.com/blog.
php?post=190423927130
5. http://blog.nielsen.com/nielsenwire/global/led-by-facebook-twitter-global-time-spent-on-social-media-sites-up-82year-over-year/
6. http://socialmediaatwork.com/2010/02/11/social-networks-now-account-for-11-of-us-traffic/
7. http://socialmediaatwork.com/2010/02/16/facebook-now-drives-more-traffic-to-web-sites-than-google/
8. http://www.mediapost.com/publications/?fa=Articles.showArticle&art_aid=121930
9. For an example of a multimedia attack distributed through a social media network, see http://www.mashable.
com/2009/10/01/new-facebook-attack/
10. https://www.networkworld.com/news/2009/121509-10-predictions-for-2010-kaminsky.html?page=2
11. http://blogs.zdnet.com/security/?p=4996&tag=content;col1
.
Page 24
Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow
Additional Resources
To learn more about Proofpoint MLX technology and Proofpoint’s email security solutions, please consult
the following online resources.
Webinar Replay: Defend Against Blended Threats
Blended Web and email threats are becoming increasingly complex and represent a huge potential risk
to your organization, your customers and your employees. Watch this webinar replay featuring Web and
email security experts from Blue Coat and Proofpoint and learn how you can:
o
Defend your organization from email and Web threats, minimizing threats.
o
Protect confidential corporate data and personal information.
o
Use real-time intelligence and reputation data to safeguard your organization from malicious
content
Register to view this web seminar replay by visiting:
http://www.proofpoint.com/id/blended-threats-webinar-0309/index.php
Learn More About SaaS Email Security and Email Archiving
Learn more about how Proofpoint’s Software-as-a-Service email security and email archiving solutions
deliver maximum security at the lowest total cost of ownership. Download two Osterman Research whitepapers, Using SaaS to Reduce the Costs of Email Security and Email Archiving: Realizing the Cost Savings and
Other Benefits from SaaS, by visiting:
http://www.proofpoint.com/tco
Research Paper on Spam Obfuscation Tactics
Proofpoint’s research activities are ongoing and Proofpoint researchers regularly publish papers on their
research. For example, the original research behind some of the obfuscation detection techniques described
in this whitepaper are described in the following paper that Proofpoint scientists presented at the 2006
Virus Bulletin conference. Download a copy of this paper by visiting:
http://www.proofpoint.com/id/vb2006/index.php
Proofpoint Online Resource Center
Find the latest Proofpoint datasheets, whitepapers, webinars and more in the Proofpoint Resource Center.
Please visit:
http://www.proofpoint.com/resources
Proofpoint Email Security Blog
Stay up to date on the latest trends in email security by subscribing to our email security blog or by following us on Twitter:
http://blog.proofpoint.com
http://www.twitter.com/Proofpoint_Inc
About Proofpoint, Inc.
Proofpoint provides unified email security, data loss prevention and email archiving solutions that help
enterprises, universities, government organizations and ISPs defend against spam and viruses, prevent leaks
of confidential and private information, encrypt sensitive emails and comply with regulations that affect
email use. Proofpoint’s products are controlled by a single management and policy console and are powered by Proofpoint MLX™ technology, an advanced machine learning system developed by Proofpoint
scientists and engineers. Proofpoint solutions can be deployed in hosted service, hardware appliance, virtual appliance, software and hybrid models, for maximum flexibility and scalability. For more information,
please visit http://www.proofpoint.com.
Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow
Page 25
US Worldwide
Headquarters
Proofpoint, Inc.
892 Ross Drive
Sunnyvale, CA 94089
United States
Tel +1 408 517 4710
US Utah Satellite
Office
Proofpoint, Inc.
13997 South Minuteman
Drive, Suite 320
Draper, UT 84020
United States
Tel +1 801 748 4610
Asia Pacific
Proofpoint APAC
5th Floor, Q.House
Convent Bldg.
38 Convent Road,
Silom, Bangrak
Bangkok 10500,
Thailand
Tel +66 2 632 2997
EMEA
Proofpoint, Ltd.
The Oxford Science Park
Magdalen Centre
Robert Robinson Avenue
Oxford, UK
OX4 4GA
Tel +44 (0) 870 803 0704
Japan
Proofpoint Japan K.K.
BUREX Kojimachi
Kojimachi 3-5-2,
Chiyoda-ku
Tokyo, 102-0083
Japan
Tel +81 3 5210 3611
Canada
Proofpoint Canada
210 King Street East,
Suite 300
Toronto, Ontario,
M5A 1J7
Canada
Tel +1 647 436 1036
www.proofpoint.com
©2010 Proofpoint, Inc. All rights reserved. 03/10 Rev A
Mexico
Proofpoint Mexico
Salaverry 1199
Col. Zacatenco
CP 07360
México D.F.
Tel: +52 55 5905 5306