Automating Malware Detection by Inferring Intent

Transcription

Automating Malware Detection by Inferring Intent
Automating Malware Detection by Inferring Intent
Weidong Cui
Electrical Engineering and Computer Sciences
University of California at Berkeley
Technical Report No. UCB/EECS-2006-115
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-115.html
September 14, 2006
Copyright © 2006, by the author(s).
All rights reserved.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission.
Automating Malware Detection by Inferring Intent
by
Weidong Cui
B.E. (Tsinghua University) 1998
M.E. (Tsinghua University) 2000
M.S. (University of California, Berkeley) 2003
A dissertation submitted in partial satisfaction
of the requirements for the degree of
Doctor of Philosophy
in
Engineering-Electrical Engineering and Computer Sciences
in the
GRADUATE DIVISION
of the
UNIVERSITY OF CALIFORNIA, BERKELEY
Committee in charge:
Professor Randy H. Katz, Chair
Dr. Vern Paxson
Professor David A. Wagner
Professor John Chuang
Fall 2006
Automating Malware Detection by Inferring Intent
c 2006
Copyright by
Weidong Cui
Abstract
Automating Malware Detection by Inferring Intent
by
Weidong Cui
Doctor of Philosophy in Engineering-Electrical Engineering and Computer Sciences
University of California, Berkeley
Professor Randy H. Katz, Chair
An increasing variety of malware like worms, spyware and adware threatens both personal and
business computing. Modern malware has two features: (1) malware evolves rapidly; (2) selfpropagating malware can spread very fast. These features lead to a strong need for automatic actions
against new unknown malware. In this thesis, we aim to develop new techniques and systems to
automate the detection of new unknown malware because detection is the first step for any reaction.
Since there is no single panacea that could be used to detect all malware in every environment,
we focus on one important environment, personal computers, and one important type of malware,
computer worms.
To tackle the problem of automatic malware detection, we face two fundamental challenges:
false alarms and scalability. We take two new approaches to solve these challenges. To minimize
false alarms, our approach is to infer the intent of user or adversary (the malware author) because
most benign software running on personal computers is user driven, and authors behind different
kinds of malware have distinct intent. To achieve early detection of fast spreading Internet worms,
we must monitor the Internet from a large number of vantage points, which leads to the scalability
problem—how to filter repeated probes. Our approach is to leverage protocol-independent replay
of application dialog, a new technology which, given examples of an application session, can mimic
both the initiator and responder sides of the session for a wide variety of application protocols
1
without requiring any specifics about the particular application it mimics. We use replay to filter
frequent multi-stage attacks by replaying the server side responses.
To evaluate the effectiveness of our new approaches, we develop the following systems:
(1) BINDER, a host-based detection system that can detect a wide class of malware on personal computers by identifying extrusions, malicious outbound network requests which the
user did not intend; (2) GQ, a large-scale, high-fidelity honeyfarm system that can capture Internet worms by analyzing in real-time the scanning probes seen on a quarter million Internet addresses, with emphases on isolation, stringent control, and wide coverage.
Professor Randy H. Katz
Dissertation Committee Chair
2
To my parents and Li.
i
Contents
Contents
ii
List of Figures
v
List of Tables
vii
Acknowledgements
viii
1
2
Introduction
1
1.1
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Problem Definition and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.3
Contributions and Structure of This Thesis . . . . . . . . . . . . . . . . . . . . . .
5
Related Work
7
2.1
Intrusion Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.1.1
Host-based Intrusion Detection . . . . . . . . . . . . . . . . . . . . . . . .
8
2.1.2
Network-based Intrusion Detection . . . . . . . . . . . . . . . . . . . . .
9
Internet Epidemiology and Defenses . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.2.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.2.2
Network Telescopes . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.2.3
Honeypots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
2.2.4
Defenses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
2.2
2.3
3
Extrusion-based Malware Detection for Personal Computers
17
3.1
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
3.2
Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
ii
3.2.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
3.2.2
Inferring User Intent . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
3.2.3
Extrusion Detection Algorithm . . . . . . . . . . . . . . . . . . . . . . . .
23
3.2.4
Detecting Break-Ins . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.3.1
BINDER Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
3.3.2
Windows Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.4.1
Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.4.2
Real World Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.4.3
Controlled Testbed Experiments . . . . . . . . . . . . . . . . . . . . . . .
35
3.5
Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
3.6
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
3.3
3.4
4
Protocol Independent Replay of Application Dialog
42
4.1
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
4.2
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
4.2.1
Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
4.2.2
Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
4.2.3
Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
4.2.4
Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
4.3.1
Preparation Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
4.3.2
Replay Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
4.3.3
Design Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
4.4.1
Test Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
4.4.2
Simple Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
4.4.3
Blaster (TFTP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
4.4.4
FTP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
4.4.5
NFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
4.4.6
Randex (CIFS/SMB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
4.4.7
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
4.3
4.4
4.5
iii
5
GQ: Realizing a Large-Scale Honeyfarm System
73
5.1
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
5.2
Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
5.2.1
Overview and Architecture . . . . . . . . . . . . . . . . . . . . . . . . . .
75
5.2.2
Network Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
5.2.3
Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
5.2.4
Replay Proxy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
5.2.5
Containment and Redirection
. . . . . . . . . . . . . . . . . . . . . . . .
81
5.2.6
Honeypot Management . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
5.3.1
Replay Proxy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
5.3.2
Isolation
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
88
5.3.3
Honeypot Management . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
90
5.4.1
Background Radiation and Scalability Analysis . . . . . . . . . . . . . . .
90
5.4.2
Evaluating the Replay Proxy . . . . . . . . . . . . . . . . . . . . . . . . .
95
5.4.3
Capturing Worms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
5.3
5.4
5.5
6
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Conclusions and Future Work
101
6.1
Thesis Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.2
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Bibliography
106
iv
List of Figures
3.1
An example of events generated by browsing a news web site. . . . . . . . . . . .
22
3.2
A time diagram of events in Figure 3.1. . . . . . . . . . . . . . . . . . . . . . . .
23
3.3
BINDER’s architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
3.4
CDF of the DNS lookup time on an experimental computer. . . . . . . . . . . . . .
27
3.5
A stripped list of events logged during the break-in of adware Spydeleter. . . . . .
34
3.6
The controlled testbed for evaluating BINDER with real world malware. . . . . . .
36
4.1
The infection process of the Blaster worm. . . . . . . . . . . . . . . . . . . . . . .
47
4.2
Steps in the preparation stage. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
4.3
A sample dialog for PASV FTP. . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
4.4
The message format for the toy Service Port Discovery protocol. . . . . . . . . . .
53
4.5
The primary and secondary dialogs for requests (left) and responses (right) using
the toy SPD protocol. RolePlayer first discovers endpoint-address (bold italic) and
argument (italic) fields, then breaks the ADUs into segments (indicated by gaps),
and discovers length (gray background) and possible cookie (bold) fields. . . . . .
53
4.6
Initiator-based and responder-based SPD replay dialogs. . . . . . . . . . . . . . .
56
4.7
Steps in the replay stage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
4.8
The application-level conversation of W32.Randex.D. . . . . . . . . . . . . . . . .
66
4.9
A portion of the application-level conversation of W32.Randex.D. Endpoint-address
fields are in bold-italic, cookie fields are in bold, and length fields are in gray background. In the initiator-based and responder-based replay dialogs, updated fields
are in black background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
GQ’s general architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
5.1
v
5.2
The replay proxy’s operation. It begins by matching the attacker’s messages against
a script. When the attacker deviates from the script, the proxy turns around and uses
the client side of the dialog so far to bring a honeypot server up to speed, before
proxying (and possibly modifying) the communication between the honeypot and
attacker. The shaded rectangles show in abstract terms portions of messages that the
proxy identifies as requiring either echoing in subsequent traffic (e.g., a transaction
identifier) or per-connection customization (e.g., an embedded IP address). . . . .
81
Message formats for the communication between the GQ controller and the honeypot manager. “ESX Server Info” is a string with the ESX server hostname, the ESX
server control port number, the user name, the password, and the VM config file
name, separated by tab keys. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
5.4
TCP scans before scan filtering. . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
5.5
TCP scans after scan filtering. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
5.6
Effect of scan filter cutoff on proportion of probes that pass the prefilter. . . . . . .
93
5.7
Effect of the number of honeypots on proportion of prefiltered probes that can engage a honeypot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93
5.3
vi
List of Tables
3.1
Summary of collected traces in real world experiments. . . . . . . . . . . . . . . .
31
3.2
Parameter selection for Dold , Dnew and Dprev . . . . . . . . . . . . . . . . . . . .
32
3.3
Break-down of false alarms according to causes. . . . . . . . . . . . . . . . . . . .
33
3.4
Email worms tested in the controlled testbed. . . . . . . . . . . . . . . . . . . . .
37
3.5
upper
upper
The impact of Dold
/Dnew
on BINDER’s performance of false negatives. . . . .
38
4.1
Definitions of different types of dynamic fields. . . . . . . . . . . . . . . . . . . .
49
4.2
Summary of evaluated applications and the number of dynamic fields in data received or sent by RolePlayer during either initiator or responder replay. . . . . . .
62
5.1
Virtual machine image types installed in GQ. . . . . . . . . . . . . . . . . . . . .
89
5.2
Summary of protocol trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
96
5.3
Part 1 of summary of captured worms (worm names are reported by Symantec Antivirus). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
5.4
Part 2 of summary of captured worms (worm names are reported by Symantec Antivirus). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
vii
Acknowledgements
When I am about to wrap up my work and life in Berkeley, I look back at these years and realize
that I was really fortunate to receive help and support from many people. Without them, I wouldn’t
be able to finish my Ph.D. dissertation.
First and foremost, I would like to thank my advisor, Professor Randy Katz, for giving me a
chance to work with him and guiding me through these years. Randy’s advice and support was
critical for my Ph.D. study. His technical advice teaches me how to conduct cutting-edge research;
his patience and trust gives me confidence to overcome the hurdles in my research; his unrestricted
support leads me to my choice on network security research; his care and responsibility to students
tells me what’s beyond being a researcher.
I would like to thank my de facto co-advisor, Dr. Vern Paxson, for taking me on to work with
him in the CCIED project. I have learned tremendously from him: his high standard for research,
exceptional diligence, efficient and effective communication, curiosity for new problems, and understanding and consideration for students. My work experience with Randy and Vern would always
inspire me to do my best in my career.
I would also like to thank Professors David Wagner and John Chuang for serving on my qualifying exam and dissertation committee. Their early feedback to my thesis proposal was very helpful
for me to formulate my research plan. I am also very grateful to them for reading and commenting
on my thesis.
I was also very fortunate to work with Professor Ion Stoica, Dr. Wai-tian (Dan) Tan, and
Dr. Nicholas Weaver. Ion helped me start my first research project at Berkeley and has always
been available to offer his advice on both my technical and career questions. I did my industrial
internship under the mentoring of Dan at HP Labs, who led me to enjoy the fun of system hacking
and has been a great mentor and friend. Nick has generously helped me with my quals and research
at ICSI. I want to thank all of them for their advice and help. I would also like to thank Professor
Anthony Joseph for his technical and career advice.
I would like to thank my colleagues in CCIED for their insightful comments and intriguing
discussion: Professors Stefan Savage, Geoffrey Voelker, and Alex Snoeren, Dr. Mark Allman,
viii
Dr. Robin Sommer, Martin Casado, Jay Chen, Christian Kreibich, Justin Ma, Michael Vrable, and
Vinod Yegneswaran.
My colleagues in Berkeley have always been there to give me their help. Helen Wang generously shared with me her Ph.D. research experience in my first year. Adam Costello’s feedback
intrigued me to rethink my backup-path allocation algorithm in my first networking research paper.
Sridhar Machiraju and Matthew Caesar were fantastic partners in our collaboration. I want to thank
all of them for their help. I would also like to thank others who have served to comment on my
ideas and review my paper drafts: Sharad Agarwal, Mike Chen, Yan Chen, Chen-Nee Chuah, Yanlei Diao, Yitao Duan, Rodrigo Fonseca, Ling Huang, Jayanthkumar Kannan, Chris Karlof, Karthik
Lakshminarayanan, Yaping Li, Morley Mao, Ana Sanz Merino, David Oppenheimer, George Porter,
Bhaskaran Raman, Ananth Rao, Mukund Seshadri, Jimmy Shih, Lakshminarayanan Subramanian,
Mel Tsai, Kai Wei, Alec Woo, Fang Yu, and Shelley Zhuang.
My life in Berkeley would have been much less fun without my Chinese friends. Zhanfeng Jia
has given me tremendous help since my first day in Berkeley. Fei Lin gave me her generous help
in my first year. Wei Liu shared with me his love for sports and beer through his years in Berkeley.
Jinwen Xiao has always been available to offer her advice and help whenever I asked. I want to
thank all of them for their help. I am also very grateful to Minghua Chen, Yanmei Li, Jinhui Pan,
Na Pi, Xiaofeng Ren, Chen Shen, Xiaotian Sun, Wei Wei, Rui Xu, Guang Yang, Jing Yang, Lei
Yuan, Min Yue, Jianhui Zhang, Haibo Zeng, Wei Zheng, and Xu Zou, for all the joy they have
brought to me in these years.
Finally, I am extremely grateful to my parents and wife. Without their selfless love and support,
I wouldn’t be able to reach this point in my life. My father and mother taught me the importance
of being honest, optimistic, confident, and hard-working. They not only gave me a life but also led
me to the path to explore and enjoy all the adventures in life. My wife, Li Yin, is my best colleague
and friend. Her love, care, support, and encouragement have been and will be the source of my
strength.
ix
x
Chapter 1
Introduction
1.1 Motivation
An increasing variety of malware like worms, spyware and adware threatens both personal and
business computing [32, 86]. Since 2001, large-scale Internet worm outbreaks have not only compromised hundreds of thousands of computer systems [56, 54, 97] but also slowed down many parts
of the Internet [54]. The Code Red worm [56] exploited a single Microsoft IIS server vulnerability,
took 14 hours to infect the 360,000 vulnerable hosts, and launched a flooding attack against the
White House web server. Using only a single UDP packet for infection, Slammer [54] took about
10 minutes to infect its vulnerable population and caused severe congestion in both local and global
networks. Spyware [73] jeopardizes computer users by disclosing their personal information to third
parties. Remotely controlled “bot” (short for robot, an automated software program that can execute
certain commands when it receives a specific input) networks of compromised systems are growing
quickly [13]. The bots have been used for spam forwarding, distributed denial of service attacks,
and spyware. In short, modern threats are causing severe damage to today’s computer world.
Before we describe the problem this thesis tackles, we first look at the features of modern
malware.
• Malware evolves rapidly. In a recent study [86], Symantec reported that it detected more
than 21,000 new Win32 malware variants in 2005, which is one order of magnitude larger
1
than the number of variants seen in 2003. In addition to the volume increase, we also see
more attacks against personal computers that exploit not only buffer overflow vulnerabilities
but also user misbehavior in using web browser, instant messaging, and peer-to-peer software.
• Self-propagating malware can spread very rapidly. While it took 14 hours for the Code
Red worm to infect its vulnerable hosts, it took only 10 minutes for the Slammer worm to do
so. Staniford et al. presented several techniques to accelerate a worm’s spread in [83, 82].
Their simulations showed that flash worms, which propagate over a pre-constructed tree for
efficient spreading, can complete their propagation in less than a second for single packet
UDP worms and only a few seconds for small TCP worms. In another study [57], Moore et
al. demonstrated that effective worm containment requires a reaction time under 60 seconds.
These features lead to a strong need for fast, automatic action against new unknown malware.
In the next section (Section 1.2), we describe the research problems we attempt to tackle in this
thesis and discuss the research challenges we face. In Section 1.3, we present our contributions and
the structure of this thesis.
1.2 Problem Definition and Challenges
To fight against new unknown malware, there are two potential approaches: proactive and reactive. Proactive approaches focus on eliminating vulnerabilities to prevent any malware. Reactive
approaches concern about actions after malware is unleashed. We believe that there is a long way to
go before we could achieve zero attacks. Thus we follow reactive approaches in this thesis. To react
to a new malicious attack, the first step is to detect the new malware. Since there are more and more
malware variants and self-propagating malware can spread very rapidly, we need fast, automatic
detection. In this thesis, we aim to develop new techniques and systems to automate the detection
of new unknown malware.
To tackle this problem, we face two fundamental challenges.
• False Alarms: To detect new unknown malware means we cannot use signatures. Instead,
we must detect it by recognizing certain patterns possessed solely by some malware. In this
2
thesis, we focus on patterns one can infer by monitoring its anomalous behavior. Anomalybased intrusion detection techniques have been studied for many years [4, 18], but they are
not widely deployed in practice due to their high false alarms. Axelsson [5] used the baserate fallacy phenomenon to show that, when intrusions are rare, the false alarm rate (i.e., the
probability that a benign event causes an alarm) has to be very low to achieve an acceptable
Bayesian detection rate (i.e., the probability that an alarm implies an intrusion). In automatic
detection and reaction, we tolerate fewer false alarms because actions on false alarms would
interrupt normal communication and computation.
To minimize false alarms, the approach in this thesis is to infer the intent of user or adversary
(the malware author).
– User Intent: It is a unique characteristic on personal computers. Most benign software
running on personal computers shares a common feature, that is, its activity is user
driven. Thus, if we can infer user intent, we can detect break-ins of new unknown
malware by identifying activity that the user did not intend. In other words, instead
of attempting to model anomalous behavior, we model the normal behavior of benign
software: their activity follows user intent. Since users of personal computers usually
use input devices such as keyboards and mice to control their computers, we assume
that user intent can be inferred from user-driven activity such as key strokes and mouse
clicks. Software activity may include network requests, file access, or system calls. We
focus on network activity in this thesis. The idea is that an outbound network connection
that is not requested by a user or cannot be traced back to some user request is treated
as likely to be malicious. A large class of malware makes such malicious outbound
network connections either for self-propagation (worms) or to disclose user information
(spyware/adware). We refer to these malicious, user-unintended, outbound network
connections as extrusions. Therefore, we can detect new unknown malware on personal
computers by identifying extrusions. In Chapter 3, we present the details of the extrusion
detection algorithm and a prototype system we built.
– Adversary Intent: Authors of malware have unique intent. For example, worm authors
intend to spread their malicious code to vulnerable systems via self-propagation; spy3
ware authors intend to collect private information on victim machines; bot authors intend to control their bots and execute various malicious tasks on them. In our work, we
focus on computer worms. Given the intent of worm authors, we detect new worms by
identifying self-propagation. Our basic approach is to first let a machine become infected, and then let the first machine infect another one, and so on. By doing so, we can
observe a chain of self-propagation if the first machine is compromised by a computer
worm. We have never seen a false alarm by using this approach (see Section 5.4.3).
• Scalability: We must achieve early detection of fast spreading Internet worms for any effective defense. To do so, we must monitor the Internet from a large number of vantage points.
This leads to the scalability problem—how to analyze the large amount of data efficiently in
real-time.
A central tool that has emerged recently for detecting Internet worm outbreaks is the
honeyfarm [80], a large collection of honeypots fed Internet traffic by a network telescope [6, 34, 58, 96]. In our work, we use high-fidelity honeypots to analyze in real-time
the thousands of scanning probes per minute seen on the large address space of our network
telescope. To do so by using a small number of honeypots, we leverage extensive filtering and
engage honeypots dynamically. In extensive filtering, we filter scanning probes in multiple
stages: (1) the scan filter limits the total number of distinct telescope addresses that engage a
remote source; (2) the first-packet filter drops a connection if the first data packet unambiguously identifies the attack as a known threat; (3) the replay proxy filters frequent multi-stage
attacks by replaying the server side responses until the point where it can identify the attack.
The replay proxy leverages a new technology—protocol-independent replay of application
dialog—we develop in this thesis. In replay, we use a script automatically derived from one or
two examples of a previous application session as the basis for mimicking either the client or
the server side of the session in a subsequent instance of the same transaction. The idea is that
we can represent each previously seen attack using a script that describes the network-level
dialog (both client and server messages) corresponding to the attack. A key property of such
scripts is that we can automatically extract them from samples of two dialogs corresponding
to a given attack, without any knowledge of the semantics of the application protocol used in
4
the attack. In Chapter 4, we present the details of protocol-independent replay. In Chapter 5,
we describe the design and implementation of our honeyfarm system.
In next section, we highlight our contributions and describe the structure of this thesis.
1.3 Contributions and Structure of This Thesis
In this thesis, we tackle the problem of automating detection of new unknown malware. Because
there are many different kinds of malware and they infect computer systems in many different
environments, there is no single panacea that can detect all malware in every environment. We
recognize this and focus on one important environment (personal computers) and one important
type of malware (computer worms). We develop the following algorithms and systems:
• Extrusion Detection: Our algorithm is based on the idea of inferring user intent. It detects
new unknown malware on personal computers by identifying extrusions, malicious outbound
network requests which the user did not intend. We implement BINDER, a host-based detection system for Windows that realizes this algorithm and can detect a wide class of malware
on personal computers, including worms, spyware, and adware, with few false alarms.
• Protocol-Independent Replay: We develop RolePlayer, a system which, given examples
of an application session, can mimic both the initiator and responder sides of the session
for a wide variety of application protocols. A key property of RolePlayer is that it operates
in an application-independent fashion: the system does not require any specifics about the
particular application it mimics. We can potentially use such replay for recognizing malware
variants, determining the range of system versions vulnerable to a given attack, testing defense
mechanisms, and filtering multi-stage attacks.
• GQ Honeyfarm System: We develop GQ, a large-scale honeyfarm system that ensures highfidelity honeypot operation, efficiently discards the incessant Internet “background radiation”
that has only nuisance value when looking for new forms of activity, and devises and enforces
an effective “containment” policy to ensure that the detected malware does not inflict external
5
damage or skew internal analyses. GQ leverages aggressive filtering, including a technique
based on protocol-independent replay.
The rest of this thesis is organized as follows. In Chapter 2, we discuss the related work in
the areas of intrusion detection and Internet epidemiology and defense. It provides the background
from which this thesis is developed.
In Chapter 3, we describe the design of the extrusion detection algorithm and the implementation and evaluation of the BINDER prototype system. We present the results of both real-world and
control-testbed experiments to demonstrate how effectively the system detects a wide class of malware and controls false alarms. Our limited user study indicates that BINDER controls the number
of false alarms to at most five over four weeks on each computer and the false positive rate is less
than 0.03%. Our control-testbed experiments show that BINDER can successfully detect the Blaster
worm and 22 different email worms (collected on a departmental email server over one week).
In Chapter 4, we describe the design and implementation of the RolePlayer system that can
mimic both the client and server sides of an application session for a wide variety of application
protocols without knowing any specifics of the application it mimics. We present the evaluation of
RolePlayer with a variety of network applications as well as the multi-stage infection processes of
some Windows worms. Our evaluations show that RolePlayer successfully replays the client and
server sides for NFS, FTP, and CIFS/SMB file transfers as well as the infection processes of the
Blaster and W32.Randex.D worms.
In Chapter 5, we describe the design and implementation of the GQ honeyfarm system which
we built to detect new worm outbreaks by analyzing in real-time the scanning probes seen on a
quarter million Internet addresses. We report on preliminary experiences with operating the system
at scale. We monitor more than a quarter million Internet addresses and have detected 66 distinct
worms in four months of operations, which far exceeds that previously achieved for a high-fidelity
honeyfarm.
Finally, in Chapter 6, we summarize our work and contributions, and discuss directions for
future work.
6
Chapter 2
Related Work
In this chapter, we discuss the related work in the areas of intrusion detection and Internet
epidemiology and defense. Instead of covering every piece of work in these areas, we focus on
the most relevant work and aim to provide the background from which we developed this thesis.
We start with intrusion detection (see Section 2.1), describing techniques proposed in the areas
of host-based intrusion detection (see Section 2.1.1) and network-based intrusion detection (see
Section 2.1.2) Then we describe the research efforts made in the area of Internet epidemiology
and defense (see Section 2.2), which motivate our work on developing a large-scale, high-fidelity
honeyfarm system.
2.1 Intrusion Detection
Research on intrusion detection has a long history since Anderson’s [4] and Denning’s [18]
seminal work. Prior work on intrusion detection can be roughly classified along two dimensions:
detection method (misuse-based vs. anomaly-based) and audit data source (host-based vs. networkbased). Misuse-based (also known as signature-based or rule-based) intrusion detection systems use
a set of rules or signatures to check if intrusion happens, while anomaly-based intrusion detection
systems attempt to detect anomalous behavior by observing a deviation from the previously learned
normal behavior of the system or the users. Host-based intrusion detection systems leverage audit
7
data collected from end hosts, while network-based intrusion detection systems monitor and analyze
network traffic.
2.1.1 Host-based Intrusion Detection
Early work on anomaly-based host intrusion detection was focused on analyzing system calls.
In [23], Forrest et al. modeled processes as normal patterns of short sequences of system calls and
used the percentage of mismatches to detect anomalies. Their work was based on two assumptions:
(1) the sequence of system calls executed by a program is locally consistent during normal operation, and (2) some unusual short sequence of system calls will be executed when a security hole in a
program is exploited. Ko et al. [38] proposed to use a specification language to define the intended
behavior of every program. The adoption of their scheme is limited because of the low degree of
automation. To detect intrusions, Wagner and Dean [100] used static analysis to learn “program
intent” and then compared the runtime behavior of a program with its intent. They proposed three
models to describe program intent: call graph, stack and digraph. The advantages of their techniques are: a high degree of automation; protection against a large class of attacks; the elimination
of false alarms. The limitation is that program source code is required. Sekar et al. [75] proposed
to train a finite-state automaton using a trace of system calls, program counter and stack information and monitor system calls to check if they follow the automaton in real-time. Compared with
Wagner and Dean’s work, this method does not require program source code but may have false
alarms due to the incompleteness of system call traces. Liao and Vemuri [48] applied text categorization techniques to intrusion detection by making an analogy between processes and documents
and between system calls and words. They used a k-Nearest Neighbor classification algorithm to
classify processes based on system calls. In [101], Wagner and Soto showed that it is possible for
attackers to evade host-based anomaly intrusion detection systems that use system calls. Besides
system calls, the behavior of storage [65] and file [113] systems has also been used for intrusion
detection. Pennington et al. [65] proposed to monitor storage activity to detect intrusions using a set
of predetermined rules. Their approach has limited ability to detect new attacks due to the usage of
rules. In [113], Xie et al. proposed to detect intrusions by identifying “simultaneous” creations of
8
new files on multiple machines in a homogeneous cluster. They correlated the file system activity
on a coarse granularity of one day, which slows the detection.
Anomaly-based host intrusion detection techniques are not widely deployed in practice. A key
obstacle is their high false alarm rate. Axelsson [5] used the base-rate fallacy phenomenon to show
that, when intrusions are rare, the false alarm rate has to be very low to achieve an acceptable
Bayesian detection rate, i.e., the probability that alarms imply intrusions. In contrast, signaturebased host intrusion detection systems are well adopted in practice. Commercial products like
Symantec [87] and ZoneAlarm [120] protect hosts against intrusions by filtering known attacks, but
they cannot detect new ones.
Recently, dynamic data flow analysis [62, 15] has been used to detect attacks that overwrite a
buffer with data received from untrusted sources (e.g., buffer overflow attacks). These approaches
have very good detection accuracy, but high performance overhead limits their deployment.
In summary, previous host-based intrusion detection systems have the following limitations.
Signature-based solutions have low false alarm rates but do not perform well on detecting new
attacks. Anomaly-based solutions can detect new attacks but have high false alarm rates. Moreover,
anomaly-based solutions that use system calls are likely to fail to detect new kinds of attacks like
spyware and email viruses because these attacks usually run as a new program and its program
behavior in terms of system calls may be similar to other benign programs.
2.1.2 Network-based Intrusion Detection
The first network intrusion detection system (NIDS) was developed by Snapp et al. in [78]. Lee
and Stolfo [44, 45] proposed a framework for constructing features and models for intrusion detection. Their key idea is to use data mining techniques to compute frequent patterns, select features,
and then apply classification algorithms to learn detection models. To detect network intrusions,
Sekar et al. [76] proposed a specification language to combine traffic statistics with state-machine
specifications of network protocols. In [92], Thottan and Ji used a statistical signal processing technique based on abrupt change detection to detect network anomalies. Snort [71] is a signature-based
lightweight network intrusion detection system. Paxson [64] developed Bro, a system for detecting
9
network intrusions in real-time. Bro applies the design philosophy of separating mechanisms from
policies by separating the event engine and the policy script interpreter. The former is for processing
packets at the network/application level, while the latter is for checking events according to policies and generating real-time alarms. Ptacek and Newsham [68] presented three classes of attacks
against network intrusion detection systems: insertion (attackers send packets that are only seen by
the NIDS but not the end system), evasion (attackers send packets that create ambiguities for the
NIDS), and denial of service attacks. Handley et al. [26] studied the evasion problem and introduced a traffic normalizer that patches up the packet stream to eliminate potential ambiguities. To
detect new unknown attacks exploiting known vulnerabilities, Wang et al. [102] developed Shield,
which prevents vulnerability-specific, exploit-generic attacks by using network filters built based on
known vulnerabilities.
In summary, anomaly-based network intrusion detection suffers from false alarms while
signature-based solutions do not reliably detect new unknown attacks.
2.2 Internet Epidemiology and Defenses
2.2.1 Overview
Since 2001, Internet worm outbreaks have caused severe damage that affected tens of millions
of individuals and hundreds of thousands of organizations. The Code Red worm of 2001 [56]
was the first outbreak in today’s global Internet after the Morris worm [20, 79] in 1988. It not
only compromised 360,000 hosts with a Web server vulnerability but also attempted to launch a
distributed denial of service attack against a government web server from those hosts. The Blaster
worm [97] exploited a vulnerability of a service running on millions of personal computers. The
Slammer worm [54] used only a single UDP packet for infection and took just 10 minutes to infect
its vulnerable population. The excessive scan traffic generated by Slammer slowed down many
parts of the Internet. Also using a single UDP packet for infection, the Witty worm [55] exploited
a vulnerability, which was publicized only one day before, in a commercial intrusion detection
software. It was the first to carry a destructive payload that overwrote random disk blocks. It
10
also started in an organized manner with an order of magnitude more ground-zero hosts than any
previous worm, and apparently was targeted at military bases. Weaver et al. provided a taxonomy
of computer worms in [104].
Staniford et al. presented several techniques to accelerate a worm’s spread in [83, 82]. They
described the design of flash worms, in which the worm author collects a list of vulnerable hosts,
constructs an efficient spread tree and encode it in the worm before propagating the worm. Their
simulations showed that flash worms can complete their spread in less than a second for single
packet UDP worms and only a few seconds for small TCP worms. In another study [57], Moore et
al. demonstrated that effective worm containment requires a reaction time under 60 seconds. These
results suggest that automated early detection of new worm outbreaks is essential for any effective
defense.
2.2.2 Network Telescopes
Understanding the behavior of Internet-scale worms requires broad visibility into their workings. Network telescopes, which work by monitoring traffic sent to unallocated portions of the IP
address space, have emerged as a powerful tool for this purpose [58], enabling analysis of remote
denial-of-service flooding attacks [59], botnet probing [116], and worm outbreaks [42, 54, 55, 56].
However, the passive nature of network telescopes limits the richness of analysis we can perform
with them; often, all we can tell of a source is that it is probing for a particular type of server, but
not what it will do if it finds such a server. We can gain much richer information by interacting
with the sources. One line of research in this regard has been the use of lightweight mechanisms
that establish connections with remote sources but process the data received from them syntactically
rather than with full semantics. For example, iSink [117] uses a stateless active responder to generate
response packets to incoming traffic, enabling it to discriminate between different types of attacks
by checking the response payloads. The lightweight responder of the Internet Motion Sensor [6]
bases its analysis on payload signatures. The responder acknowledges TCP SYN packets to elicit
the first data segment the source will send, using a cache of packet payload checksums to detect
matches to previously seen activity. This matching, however, lacks sufficient power to identify
11
new probing that deviates from previously seen activity only in subsequent data packets, nor can it
detect activity that is semantically equivalent to previous activity but differs in checksum due to the
presence of message fields that do not affect the semantics of the probe (e.g., IP addresses embedded
in data). Pang et al. used a detailed application-level responder to study the characteristics of
Internet “background radiation” [63]. The work shows that a large proportion of such probing
cannot in fact be distinguished using lightweight first-packet techniques such as those in [6, 117]
due to the prevalence of protocols (such as Windows NetBIOS and CIFS) that engage in extensive
setup exchanges.
In summary, network telescopes provide early and broad visibility into a worm’s spread, but
have no or limited capability of actively responding to probes. This makes them prone to false
alarms when they are attempted for fast detection of novel worms.
2.2.3 Honeypots
A honeypot is a vulnerable network decoy whose value lies in being probed, attacked, or compromised for the purposes of detecting the presence, techniques, and motivations of an attacker [81].
Honeypots can run directly on a host machine (“bare metal”) or within virtual machine environments
that provide isolation and control over the execution of processes within the honeypot. Honeypots can also be classified as low-interaction, medium-interaction or high-interaction (alternatively,
low-fidelity or high-fidelity). Low-interaction honeypots can emulate a variety of services that the
attackers can interact with. Medium-interaction honeypots provide more functionalities than lowinteraction honeypots but still do not have a full, productive operating system. One example is the
use of chroot in a UNIX environment. High-interaction honeypots give the attackers access to a
real operating system with few restrictions.
In [66], Provos proposed Honeyd, a scalable virtual honeypot framework. Honeyd can personalize TCP response to deceive Nmap [51] (a free open source utility for network exploration or
security auditing, which can be used to determine what operating systems and OS versions a host
is running) and create arbitrary virtual routing topologies. Honeyd can be used for detecting worms
and distracting adversaries. However, low-interaction honeypots’ ability to detect novel worms is
12
limited to their emulation capability in terms of the number of services they can emulate and their
incapability of emulating unknown vulnerabilities. High-interaction honeypots behave exactly as
a real host and therefore allow accurate, in situ examinations of a worm’s behavior. Moreover,
they can be used to detect new attacks against even unknown vulnerabilities in real operating systems [29]. However, they also require more protection in containment and isolation.
Advances in Virtual Machine Monitors such as VMware [95], Xen [7], and User-Mode Linux
(UML) [19] have made it tractable to deploy and manage high-interaction virtual honeypots. The
Honeynet project [29] proposed an architecture for high-interaction honeypots to contain and analyze attackers in the wild, focusing on two main components: data control and data capture. The
data control component prevents attackers from using the Honeynet to attack or harm other systems.
It limits outbound connections based on configurable policies and uses Snort Inline [53] to block
known attacks. The data capture component (also known as Sebek [30]) logs all of the attacker’s
activity and transmits data to a safe machine without the attacker knowing it. In [17], Dagon and
colleagues presented HoneyStat, a system that uses high-interaction honeypots running on VMware
GSX servers to monitor memory, disk and network activity on the guest OS. HoneyStat uses logit
regression [106] for causality analysis to detect a worm when discovering that a single network
event causes all subsequent memory or disk events.
By themselves, honeypots serve as poor “early warning” detection of new worm outbreaks because of the low probability that any particular honeypot will be infected early in a worm’s growth.
Conceptually, one might scatter a large number of well-monitored honeypots across the Internet
to address this limitation, but this raises major issues of management, cost, and liability. Spitzner
presented the idea of “Honeypot Farms” to address these issues [80]. These work by deploying a
collection of honeypots in a single, centralized location, where they receive suspicious traffic redirected from distributed network locations. Jiang and Xu presented a VM-based honeyfarm system,
Collapsar, which uses UML to capture and forward packets via Generic Routing Encapsulation
(GRE) tunnels and Snort-Inline to contain outbound traffic. However, since in Collapsar each honeypot has a single IP address, the probability of early detection is still low due to the limited “cross
section” presented by the honeypots.
In summary, we still lack solutions that can automatically detect and analyze Internet worm
13
outbreaks in seconds. In this thesis, we develop a large-scale high-fidelity honeyfarm system to
tackle this challenge.
2.2.4 Defenses
There have been many research efforts on detecting scanning worms. When a host is compromised by a scanning worm, it will try to spread by scanning other hosts while a large fraction of
them may not exist. Williamson [107] presented a virus throttle, a rate-limiting solution that throttles the rate of new connections made by a compromised host. The virus throttle can detect and
contain fast scanning worms. Zou et al. [121] proposed a new architecture for detecting scanning
worms. Their basic idea is to detect the trend, not the rate of monitored illegitimate scan traffic
because at the beginning of its spreading, a scanning worm propagates almost exponentially with
a constant, positive rate [122]. They used a Kalman filter to compute the parameter modeling the
victim increasing rate. If that parameter stabilizes and oscillates slightly around a positive constant
value, it means a new worm outbreak. Based on the theory of sequential hypothesis testing, Jung et
al. [35] proposed an on-line port scan detection algorithm called Threshold Random Walk (TRW).
In practice, TRW requires only four or five connection attempts to detect a malicious remote hosts.
In [74], Schechter et al. proposed a hybrid approach that integrates reverse sequential hypothesis
testing and credit-based connection rate limiting. Weaver et al. [105] proposed a solution based
on TRW to detect and contain scanning worms in an enterprise environment at very high speed by
leveraging hardware acceleration. They showed that their techniques can stop a scanning host after
fewer than ten scans with a very low false positive rate.
Email worms like SoBig and MyDoom have caused severe damage [112]. Several solutions
have been proposed to defend against email worms. Gupta and Sekar [25] proposed an anomalybased approach that detects increases in traffic volume over what was observed during the training
period. Their assumption was that an email virus attack would overwhelm mail servers and clients
with a large volume of email traffic. In [31], Hu and Mok proposed a framework based on behavior skewing and cordoning. Particularly, they set up bait email addresses (behavior skewing),
intercept SMTP messages to them (behavior monitoring), and forward suspicious emails to their
own SMTP server (cordoning). Based on conventional epidemiology, Xiong [114] proposed an at14
tachment chain tracing scheme that detects email worm propagation by identifying the existence of
transmission chains in the network.
It is important to generate attack signatures of previously unknown worms so that signaturebased intrusion detection systems can use the signatures to protect against these worms. Several
automated solutions have been proposed in the past. Kreibich and Crowcroft [41] proposed Honeycomb, which uses Honeyd as the frontend. After merging connections, Honeycomb compares them
horizontally (the nth incoming messages of all connections) and vertically (all incoming messages
of two connections) and uses a Longest Common Substring algorithm to find attack signatures.
In [77], Singh et al. developed Earlybird, a worm fingerprinting system that uses content sifting
to generate worm signatures. Content sifting is based on two anomalies of worm traffic: (1) some
content in a worm’s exploit is invariant; (2) the invariant content appear in flows from many source
to many destinations. By using multi-resolution bitmaps and multistage filters, Earlybird can maintain the necessary data structures at very high speed and with low memory overhead. Kim and
Karp [36] proposed Autograph, which automatically generates worm signatures by analyzing the
prevalence of portions of flow payloads. Autograph uses content-based payload partitioning to divide payloads of suspicious flows from scanning sources into variable-length content blocks, then
applies a greedy algorithm to find prevalent content blocks as signatures. All these solutions do
not work with polymorphic worms whose content may be encoded into successive, different byte
strings. In [61], Newsome et al. proposed Polygraph, which automatically generates signatures
for polymorphic worms. Polygraph leverages a key assumption that multiple invariant substrings
must often be present in all variants of a worm payload for the worm to function properly. These
substrings typically correspond to protocol framing, return addresses, and poorly obfuscated code.
Some worm behavior signatures [21] can be used to detect unknown worms. Network traffic
patterns of worm behaviors include (1) sending similar data from one machine to the next, (2) treelike propagation and reconnaissance, and (3) changing a server into a client. However, it is not clear
how the behavioral approach can be deployed in practice because it is difficult to collect necessary
information.
15
2.3 Summary
In this chapter, we reviewed the related work in the areas of intrusion detection and Internet
epidemiology and defense. We observe that anomaly detection techniques are not widely deployed
in practice due to their high false alarms. We also see that, in the war against Internet worms, it
is critical to achieve fast automatic detection of new outbreaks. The previous work described in
this chapter sets the stage for our work. In the next chapter, we present our work on detecting
new unknown malware on personal computers by capturing user-unintended malicious outbound
network connections.
16
Chapter 3
Extrusion-based Malware Detection for
Personal Computers
The main goal of our work is to develop new techniques and system architectures to automate
the process of malware detection. In this chapter, we describe how we can infer user intent to
automatically detect a wide class of new unknown malware on personal computers without any
prior knowledge. In Chapter 4 and Chapter 5, we will describe how we can infer adversary intent to
detect fast spreading Internet worms automatically and quickly.
3.1 Motivation
Many research efforts [64, 71, 102] and commercial products [87, 120] have focused on preventing break-ins by filtering either known malware or unknown malware exploiting known vulnerabilities. To protect computer systems from rapidly evolving malware, these solutions have two
requirements. First, a central entity must rapidly generate signatures of new malware after it is
detected. Second, distributed computer systems must download and apply these signatures to their
local databases before they are attacked. However, these can leave computer systems temporarily
unprotected from newly emerging malware. [86] estimated that 49 days elapsed on average between
the publication of a vulnerability and the release of an associated patch in 2005. On the other hand,
17
the average time for exploit development was less than 7 days in 2005. In particular, worms can
propagate much more rapidly than humans can respond in terms of generation and distribution of
signatures [83]. Therefore, we need schemes to detect new unknown malware when the signatures
are not available.
Anomaly-based intrusion detection have been studied for detecting new unknown malware. [28]
used short sequences of system calls executed by running processes to detect anomalies caused by
intrusions. [45] proposed a data mining framework for constructing features and training models
of network traffic for intrusion detection. [119] proposed a scheme based on the distinctive timing
characteristics of interactive traffic for detecting compromised stepping stones. Recent work of [47]
correlated simultaneous anomalous behaviors from different perspectives to improve the accuracy
of intrusion detection. The performance of anomaly-based approaches is limited in practice due to
their high false positive rate [5].
In short, it is highly desirable to develop new anomaly-based malware detection techniques that
can achieve low false positive rate to be useful in practice. We tackle this problem on a specific but
important platform—personal computers (i.e., a computer that is used locally by a single user at any
time).
A unique characteristic of personal computers is user intent. Most benign software running on
personal computers shares a common feature, that is, its activity is user driven. Thus, if we can
infer user intent, we can detect break-ins of unknown malware by identifying activity which the
user did not intend. Such activity can be network requests, file access, or system calls. In this thesis,
we focus on network activity because a large class of malware makes malicious outbound network
connections either for self-propagation (worms) or to disclose user information (spyware/adware).
Such outbound connections are suspicious when they are uncorrelated with user activity. We refer
to these malicious user unintended outbound connections as extrusions 1 . Thus we can detect new
unknown malware on personal computers by identifying extrusions.
Prior to our work, user intent has been used to set the access control policy (also known as
“designating authority”) in a GUI (Graphic User Interface) system [85, 115, 84]. Based on the
Principle of Least Authority (or Privilege), the basic idea of these research efforts is to designate
1
Extrusion is also defined as unauthorized transfer of digital assets in some other context [9].
18
the necessary authority to a running program based on user actions it receives directly or indirectly.
Unlike the prior work, our goal is to detect malware on personal computers that run a GUI, and
our basic idea is that a running program that receives user input directly or indirectly may generate
outbound network requests.
In the remainder of this chapter, we first describe the extrusion detection algorithm (see Section 3.2). Then, we present the architecture and implementation of BINDER (Break-IN DEtectoR),
a host-based malware detection system that realizes the extrusion detection algorithm (see Section 3.3). Next, we demonstrate our evaluation methodology and show that BINDER can detect
a wide class of malware with few false alarms (see Section 3.4). In the end, we discuss the limitations caused by potential attack techniques (see Section 3.5) and summarize this chapter (see
Section 3.6).
3.2 Design
In this section, we first present the design goals and rationale for our approach. Then, we
use an example to demonstrate how to infer user intent. Next, we describe the extrusion detection
algorithm in detail. We conclude this section by discussing why outbound connections of a wide
class of malware can be detected as extrusions.
3.2.1 Overview
Our goal is to automatically detect break-ins of new unknown malware on personal computers.
Our design should achieve:
• Minimal False Alarms: This is critical for any automatic intrusion detection system.
• Generality: It should work for a large class of malware without the need for signatures, and
regardless of how the malware infects the system.
• Small Overhead: It must not use intrusive probing or adversely affect the performance of the
host computer.
19
We achieve these design goals by leveraging a unique characteristic on personal computers, that
is, user intent. Our key observation is that outbound network connections from a compromised personal computer can be classified into three categories: user-intended, user-unintended benign, and
user-unintended malicious (referred to as extrusions). Since many kinds of malware such as worms,
spyware and adware make malicious outbound network connections either for self-propagation or
to disclose user information, we can detect break-ins of these kinds of malware by identifying their
extrusions. In the rest of this chapter, we use connections and outbound connections interchangeably unless otherwise specified. We do not consider inbound connections (initiated by a remote
computer) because firewalls are designed to filter such traffic. In Chapter 5, we will describe how
we use honeypots to inspect incoming malicious traffic to detect worms.
Since users of personal computers usually use input devices such as keyboards and mice to
control their computers, we assume that user intent can be inferred from user-driven activities.
We also assume malware runs as standalone processes. For those that hide in other processes by
exploiting techniques proposed in [70], our current design is unable to detect them. We look at this
as one of the limitations of today’s operating systems. We will discuss in detail attack techniques
and potential solutions in Section 3.5.
To detect extrusions, we need to determine if an outbound connection is user-intended. We
infer user intent by correlating outbound network connections with user-driven input at the process
level. Our key assumption is that outbound network connections made by a process that receives
user input a short time ago is user-intended. We also treat repeated connections (from the same
process to the same destination host or IP address within a given time window; see Section 3.2.3
for more details) as user-intended as long as the first one is user-intended. By doing this, we can
handle the case of automatically refreshing web pages and polling emails. Among user-unintended
outbound connections, we use a small whitelist to differentiate benign traffic from malicious traffic.
The whitelist covers three kinds of programs: system daemons, applications automatically checking
updates, and network applications automatically started by the operating system. Actual rules are
specific to each operating system and may become user specific.
20
In the rest of this section, we first use an example to demonstrate how user intended connections
may be initiated. Next, we describe the extrusion detection algorithm. Finally, we discuss how
malware can be detected by this algorithm.
3.2.2 Inferring User Intent
The following example demonstrates how user-intended connections are initiated. Let us assume that a user opens an Internet Explorer (IE) window, goes to a news web site, then leaves the
window idle to answer a phone call. We consider three kinds of events: user events (user input), process events (process start and process finish), and network events (connection request, data arrival
and domain name lookup). A list of events generated by these user actions is shown in Figure 3.1.
In this example, new connections are generated in the following four cases.
• Case I: When the user opens IE by double-clicking its icon on My Desktop in Windows, the
shell process explorer.exe (PID=1664) of Windows receives the user input (Event 1-2), and
then starts the IE process (Event 3). After the domain name (www.cs.berkeley.edu) of the
default homepage is resolved (Event 4), the IE process makes a connection to it to download
the homepage (Event 5). This connection of IE is triggered by the user input of its parent
process of explorer.exe.
• Case II: After the user clicks a bookmark of news.yahoo.com in the IE window (Event 8),
the domain name is resolved as 66.218.75.230 (Event 9). Then the IE process makes a
connection to it to download the HTML file (Event 10). This connection is triggered by the
user input of the same process.
• Case III: After receiving the HTML file in 4 packets (Event 11-14), IE goes to retrieve two
image files from us.ard.yahoo.com and us.ent4.yimg.com. IE makes connections to them
(Event 16,18) after the domain names are resolved (Event 15,17). These two connections are
triggered by the HTML data arrivals of the same process.
• Case IV: According to a setting in the web page, IE starts refreshing the web page for updated
21
1 09:32:33, PID=1664 (user input)
2 09:32:33, PID=1664 (user input)
3 09:32:35, PID=2573, PPID=1664, NAME="C:\...\iexplore.exe" (process start)
4 09:32:39, HOST=www.cs.berkeley.edu, IP=169.229.60.105 (DNS lookup)
5 09:32:39, PID=2573, LPORT=5354, RIP=169.229.60.105, RPORT=80 (connection request)
6 09:32:40, PID=2573, LPORT=5354, RIP=169.229.60.105, RPORT=80 (data arrival)
7 09:32:40, PID=2573, LPORT=5354, RIP=169.229.60.105, RPORT=80 (data arrival)
8 09:32:43, PID=2573 (user input)
9 09:32:45, HOST=news.yahoo.com, IP=66.218.75.230 (DNS lookup)
10 09:32:45, PID=2573, LPORT=5357, RIP=66.218.75.230, RPORT=80 (connection request)
11 09:32:47, PID=2573, LPORT=5357, RIP=66.218.75.230, RPORT=80 (data arrival)
12 09:32:47, PID=2573, LPORT=5357, RIP=66.218.75.230, RPORT=80 (data arrival)
13 09:32:48, PID=2573, LPORT=5357, RIP=66.218.75.230, RPORT=80 (data arrival)
14 09:32:48, PID=2573, LPORT=5357, RIP=66.218.75.230, RPORT=80 (data arrival)
15 09:32:50, HOST=us.ard.yahoo.com, IP=216.136.232.142,216.136.232.143 (DNS lookup)
16 09:32:50, PID=2573, LPORT=5359, RIP=216.136.232.142, RPORT=80 (connection request)
17 09:32:51, HOST=us.ent4.yimg.com, IP=192.35.210.205,192.35.210.199 (DNS lookup)
18 09:32:51, PID=2573, LPORT=5360, RIP=192.35.210.205, RPORT=80 (connection request)
19 09:32:52, PID=2573, LPORT=5359, RIP=216.136.232.142, RPORT=80 (data arrival)
20 09:32:52, PID=2573, LPORT=5360, RIP=192.35.210.205, RPORT=80 (data arrival)
21 09:32:53, PID=2573, LPORT=5359, RIP=216.136.232.142, RPORT=80 (data arrival)
22 09:32:53, PID=2573, LPORT=5360, RIP=192.35.210.205, RPORT=80 (data arrival)
23 09:43:01, HOST=news.yahoo.com, IP=66.218.75.230 (DNS lookup)
24 09:43:01, PID=2573, LPORT=5357, RIP=66.218.75.230, RPORT=80 (connection request)
25 09:43:02, PID=2573, LPORT=5357, RIP=66.218.75.230, RPORT=80 (data arrival)
26 09:43:02, PID=2573, LPORT=5357, RIP=66.218.75.230, RPORT=80 (data arrival)
......
Figure 3.1. An example of events generated by browsing a news web site.
news 10 minutes later. This connection (Event 24) repeats the previous connection (Event 10)
by connecting to the same host or Internet address.
It is natural for a user input or data arrival event to trigger a new connection in the same process
(Case II and III). It is also normal to repeat a recent connection in the same process (Case IV). Note
that connections of email clients repeatedly pulling emails falls into this case. However, Case I implies that it is necessary to correlate events among processes. In general, a user-intended connection
must be triggered by one of the rules below:
• Intra-process rule: A connection of a process may be triggered by a user input, data arrival or
connection request event of the same process.
22
• Inter-process rule: A connection of a process may be triggered by a user input or data arrival
event of another process.
To verify if a connection is triggered by the intra-process rule, we just need to monitor all user
and network activities of each single process. However, we need to monitor all possible communications among processes to verify if a connection is triggered by the inter-process rule. In our
current design, we only consider communications from a parent process to its direct child process
and use the following parent-process rule to approximate the inter-process rule. In other words, we
do not allow cascading in applying the parent-process rule.
• Parent-process rule: A connection of a process may be triggered by a user input or data arrival
event received by its direct parent process before it is created.
3.2.3 Extrusion Detection Algorithm
1,2
3
4,5 6,7
Dnew
8
9,10 11,12 13,14 15,16 17,18 19,20 21,22
Dold
Dold
23,24 25,26
T
Dprev
Figure 3.2. A time diagram of events in Figure 3.1.
The extrusion detection algorithm must decide if a connection is user-intended and if it is in the
whitelist. Its main idea is to limit the delay from a triggering event to a connection request event.
Note that, for data arrival events, we only consider those of user-intended connections. There are
three possible delays for a connection request. In Figure 3.2, we show them in a time diagram of
events in Figure 3.1. For a connection request made by process P , we define the three delays as
follows:
• Dnew : The delay since the last user input or data arrival event received by the parent process
of P before P is created. It is the delay from Event 1 to 5 in Figure 3.2.
23
• Dold : The delay since the last user input or data arrival event received by P . It is the delay
from Event 8 to 10 and from Event 14 to 16 in Figure 3.2.
• Dprev : The delay since the last connection request to the same host or IP address made by P .
It is the delay from Event 10 to 24 in Figure 3.2.
As Dold is the reaction time of a process, Dnew includes the loading time of a process as well.
For user-intended connections, Dold and Dnew are on the order of seconds while Dprev is on the
order of minutes. Depending on how a user-intended connection is initiated, it must have at least
upper
upper
upper
, and Dprev
). Note
one of the three delays less than its upper bound (defined as D new
, Dold
that these delays are dependent on the hardware, operating system, and user behavior. Given a
host computer system, we must train the BINDER system to find the correct upper bounds. In
Section 3.4.2 we will discuss how to choose these upper bounds.
In the design of the extrusion detection algorithm, we assume that we can learn rules from
previous false alarms. Each rule includes an application name (the image file name of a process
with the complete path) and a remote host (host name and/or IP address). The existence of a rule
means that any connection to that host made by a process of that application is not considered to be
an extrusion.
Given a connection request, the detection algorithm works as follows:
• If it is in the rule set of previous false alarms, then return;
• If it is in the whitelist, then return;
upper
• If Dprev exists and is less than Dprev
, then return;
upper
• If Dnew exists and is less than Dnew
, then return;
upper
• If Dold exists and is less than Dold
, then return;
• Otherwise, it is an extrusion.
After detecting an extrusion, we can either drop the connection or raise an alarm with related
information such as the process ID, the image file name, and the connection information. Studying
the tradeoff between different reactions is beyond the scope of this chapter.
24
3.2.4 Detecting Break-Ins
We just described the extrusion detection algorithm. Next, we discuss why outbound connections of worms, spyware and adware can be detected as extrusions. Unlike worms, spyware
and adware cannot propagate themselves and thus require user input to infect a computer system.
Worms can be classified as self-activated like Blaster [97] or user-activated like email worms. The
latter also requires user input to infect a personal computer.
When the malware receives user input for its break-in, its connections shortly after the breakin may be masked by user activity. Thus we may not be able to detect these initial extrusions.
However, we can detect a break-in as long as we can capture the first non-user triggered connection.
In Section 3.4, we will show that we successfully detected the adware Spydeleter [3] and 22 email
worms shortly after their break-ins.
When the malware runs without user input, we can easily detect its first outbound connection
as an extrusion. This is because the malware runs as a background process and does not receive any
user input. So Dold , Dnew and Dprev of the first connection do not exist. Additionally, we clear the
user input history after a personal computer is restarted. So even for the malware that received user
input for its break-in, it is guaranteed to detect its first connection as an extrusion after the victim
system is restarted.
In summary, BINDER can detect a large class of malware such as worms, spyware and adware that (1) run as background processes, (2) do not receive any user-driven input, and (3) make
outbound network connections.
3.3 Implementation
In the previous section, we described the extrusion detection algorithm. In this section, we describe the architecture and implementation of BINDER (Break-IN DEtectoR), a host-based detection system which, realizing the extrusion detection algorithm, can detect break-ins of new unknown
malware on personal computers.
25
3.3.1 BINDER Architecture
Extrusion Detector
User
Monitor
Process
Monitor
Network
Monitor
Operating System Kernel
Figure 3.3. BINDER’s architecture.
As a host-based system, BINDER must have small overhead so that it does not affect the performance of the host computer. Therefore, BINDER takes advantage of passive monitoring. To detect
extrusions, BINDER correlates information across three sources: user-driven input, processes, and
network traffic. Therefore, there are four components in a BINDER system: User Monitor, Process Monitor, Network Monitor, and Extrusion Detector. The architecture of BINDER is shown in
Figure 3.3. The first three components independently collect information from the operating system
(OS) passively in real time and report user, process, and network events to the Extrusion Detector. APIs for real-time monitoring are specific to each operating system. Since BINDER is built
to demonstrate the feasibility and effectiveness of the extrusion detection technique, we implement
BINDER in the application domain rather than in the kernel for ease of implementation and practical
limitation (i.e., no access to Windows source code). We describe the implementation on Windows
operating system in the second part of this section. In the following, we explain the functionality
and interface of these components that are general to all operating systems.
The User Monitor is responsible for monitoring user input and reporting user events to the
Extrusion Detector. It reports a user input event when observing a user clicks the mouse or hits a
key. A user input event has two components: the time when it happens and the ID of the process
that receives this user input. This mapping between a user input and a process is provided by the
26
operating system. So the User Monitor does not rely on the Process Monitor for such information.
Since a user input event has only the time information and the Extrusion Detector only stores the
last user input event, BINDER avoids storing or leaking privacy-sensitive information.
When a process is created or stopped, the Process Monitor correspondingly reports to the Extrusion Detector two types of process events: process start and process finish. A process start event
includes the time, the ID of the process itself, its image file name, and the ID of the parent process.
A process finish event has only the time and the process ID.
1
0.8
CDF
0.6
0.4
0.2
0
0
1
2
3
4
5
6
7
DNS Lookup Time (s)
8
9
10
Figure 3.4. CDF of the DNS lookup time on an experimental computer.
The Network Monitor audits network traffic and reports network events. For the sake of detecting extrusions, it reports three types of network events: connection request, data arrival and domain
name lookup. For connection request events, the Network Monitor checks TCP SYN packets and
UDP packets. A data arrival event is reported when an inbound TCP or UDP packet with non-empty
payload is received from a normal outbound connections. Note that the direction of a connection is
determined by the direction of the first TCP SYN or UDP packet of this connection. The Network
Monitor also parses DNS lookup packets. It associates a successful DNS lookup with a following
connection request to the same remote IP address as returned in the lookup. This is important because DNS lookup may take significant time between a user input and the corresponding connection
27
request. The Cumulative Distribution Function (CDF) of 2,644 DNS lookup times on one of the experimental computers (see Section 3.4.2 for details of the experiments) is shown in Figure 3.4. We
can see that about 8% DNS lookups take more than 2 seconds. A connection request event has five
components: the time, the process ID, the local transport port number, the remote IP address and
the remote transport port number. Note that the time is the starting time of its DNS lookup if it has
any or the connection itself. The mapping between network traffic and processes is provided by
the operating system. A data arrival event has the same components as a connection request event
except that its time is the time when the data packet is received. A domain name lookup event has
the time, the domain name for lookup, and a list of IP addresses mapping to it.
Except for domain name lookup results that are shared among all processes, the Extrusion
Detector organizes events based on processes and maintains a data record for each process. A
process data record has the following members: the process ID, the image file name, the parent
process ID, the time of the last user input event, the time of the last data arrival event, and all the
previous normal connections. When a process start event is received, a process data record is created
with the process ID, the image file name and the parent process ID. The time of the last user input
event is updated when a user input event of the process is reported. Similarly, the time of the last
data arrival is updated when a data arrival event is received. A process data record is closed when
its corresponding process finish event is received. All process records are cleared when the system
is shutdown. The size of the event database is small because the number of simultaneous processes
on a personal computer is usually less than 100. Based on all the information of user, process and
network events, the Extrusion Detector implements the extrusion detection algorithm.
3.3.2 Windows Prototype
BINDER’s general design (see Figure 3.3) can be implemented in all modern operating systems. Since computers running Windows operating systems are the largest community targeted by
malware [86], we implemented a prototype of BINDER for Windows 2000/XP. Given current Windows systems’ limitations [70], our Windows prototype does not provide a bulletproof solution for
break-in detection although it does demonstrate effectiveness of the extrusion detection technique
on detecting a large class of existing malware. Though this prototype is implemented in the applica28
tion space, a BINDER system can run in the kernel space if it is adopted in practice. A widespread
availability of Trusted Computing-related Virtual Machine-based protection [33] or similar isolation
techniques are necessary to turn BINDER into robust production.
The User Monitor is implemented with the Windows Hooks API [108]. It uses three hook procedures, KeyboardProc, MouseProc and CBTProc. KeyboardProc is used to monitor keyboard
events while MouseProc is used to monitor mouse events. MouseProc can provide the information of which window will receive a mouse event. Since KeyboardProc cannot provide the same
information for a keyboard event, we use CBTProc to monitor events when a window is about to receive the keyboard focus. After determining which window will receive a user input event, the User
Monitor uses the procedure GetWindowThreadProcessId to get the process ID of the window.
The Process Monitor is implemented based on the built-in Security Auditing on Windows
2000/XP [109]. By turning on the local security policy of auditing process tracking (Computer Configuration/Windows Settings/Security Settings/Local Policies/Audit Policy/Audit
process tracking), the Windows operating system can audit detailed tracking information for process start and finish events. The Process Monitor uses psloglist [67] to parse the security event log
and generates process start and process finish events.
The Network Monitor is implemented based on TDIMon [91] and WinDump [110] which
requires WinPcap [111]. TDIMon monitors activity at the Transport Driver Interface (TDI) level
of networking operations in the operating system kernel. It can provide all network events associated
with the process information. Since TDIMon does not save complete DNS packets, the Network
Monitor uses WinDump for this purpose. Based on the information collected by TDIMon and
DNS packets stored by WinDump, the Network Monitor reports network events to the Extrusion
Detector.
It is straightforward to implement the extrusion detection algorithm based on the information
stored in the process data record in the Extrusion Detector. Here we focus on the whitelisting mechanism in our Windows implementation. The whitelist in our current implementation has 15 rules.
These cover three kinds of programs: system daemons, software updates and network applications
automatically started by Windows. A rule for system daemons has only a program name (with a
29
full path). Processes of the program are allowed to make connections at any time. In our current
implementation, we have five system daemons including System, spoolsv.exe, svchost.exe, services.exe and lsass.exe. Note that the traffic of network protocols such as DHCP, ARP, and NTP
is initiated by the system daemons. A rule for software updates has both a program name and an
update web site. Processes of the program are allowed to connect to the update web site at any
time. In this category, we now have six rules that cover Symantec, Sygate, ZoneAlarm, Real Player,
Microsoft Office, and Mozilla. For network applications automatically started by Windows when it
starts, we currently have four rules for messenger programs of MSN, Yahoo!, AOL, and ICQ. These
programs are allowed to make connections at any time. In the future, we need to include a special
rule regarding wireless network status change. For example, an email client on a laptop computer
may start sending pre-written emails right after the laptop is connected to the wireless network in a
hot spot. In addition, we need to study the behavior of peer-to-peer software applications to decide
if it is necessary to treat them as system daemons to avoid false positives.
Managing the whitelist for an average user is very important. Rules for system daemons usually
do not change until the operating systems are upgraded. Since the number of software applications
that require regular updates is small and do not change very often, the rules for software updates can
be updated by some central entity that adopts BINDER. Though rules in the last category have to be
configured individually for each system, we believe some central entity can provide help by maintaining a list of applications that fall into this category. A mechanism similar to PeerPressure [103]
may be used to help an average user configure her own whitelist.
3.4 Evaluation
We evaluated BINDER on false positives and false negatives in two environments. First, we
installed it on six Windows computers used by different volunteers for their daily work, and collected traces over five weeks since September 7th, 2004. Second, in a controlled testbed based on
the Click modular router [39] and VMware Workstation [95], we tested BINDER with the Blaster
worm and 22 different email worms collected on a departmental email server over one week since
October 7th, 2004.
30
3.4.1 Methodology
The most important design objective of BINDER is to minimize false alarms while maximizing
detected extrusions. In our experiments, we use the number of false alarms rather than the false positive rate to evaluate BINDER. This is because users who respond to alarms are more sensitive to the
absolute number than a relative rate. When BINDER detects extrusions, it is based on connections.
However, when we count the number of false alarms, we do not use the number of misclassified
normal connections directly. This is because a false alarm covers a series of consecutive connection
requests (e.g., after the first misclassified normal connection is corrected, consecutive connection
requests that repeat it by connecting to the same host or IP address will be treated as normal).
Therefore, for misclassified normal connections, we merge them into groups, in which the first false
alarm covers the rest, and count each group as one false alarm. When we evaluate BINDER on
false negatives, we check if and how fast it can detect a break-in. The real world experiments are
used to evaluate BINDER for both false positives and false negatives, while the experiments in the
controlled testbed are only for false negatives.
upper
upper
upper
To evaluate BINDER with different values for the three parameters D old
, Dnew
and Dprev
,
we use offline, trace-based analysis in all experiments.
3.4.2 Real World Experiments
Table 3.1. Summary of collected traces in real world experiments.
User
A
B
C
D
E
F
Machine
Desktop
Desktop
Desktop
Laptop
Laptop
Laptop
OS
WinXP
WinXP
WinXP
Win2K
WinXP
WinXP
Days
27
26
23
23
13
12
User Events
35,270
80,497
24,781
99,928
8,630
20,490
Process Events
5,048
12,502
7,487
8,345
2,448
5,402
Network Apps
33
35
55
28
21
20
TCP Conns
33,480
15,450
36,077
9,784
10,210
7,592
To evaluate the performance of BINDER in a real world environment, we installed it on six
Windows computers used by different volunteers for their daily work, and collected traces over five
weeks. We collected traces of user input, process information, and network traffic from the six
31
computers. A summary of the collected traces is shown in Table 3.1. On one hand, these computers
were used for daily work, so the traces are realistic. On the other hand, our experimental population
is small because it is difficult to convince users to submit their working environment to experimental
software. However, from the summary of the collected traces in Table 3.1, we see that they have
several distinct combinations of hardware, operating system, and user behavior. For real world
experiments, we discuss parameter selection and then analyze the performance of our approach on
false positives and false negatives. Due to the limitation of data collection, we train BINDER and
evaluate it on the same data set.
Parameter Selection
Table 3.2. Parameter selection for D old , Dnew and Dprev .
User
A
B
C
D
E
F
Dold
18
15
14
16
19
14
90% (sec)
Dnew
Dprev
11
142
12
64
14
28
81
213
12
539
8
80
# of FAs
15
23
20
3
5
10
Dold
33
28
25
33
32
27
95% (sec)
Dnew
Dprev
15
752
21
260
15
134
81
715
14
541
13
265
# of FAs
5
7
5
3
4
5
Dold
79
79
74
85
93
79
99% (sec)
Dnew
Dprev
21
4973
22
3329
33
2272
81
4611
90
4216
31
3633
# of FAs
3
5
1
2
3
2
upper
upper
In this section, we discuss how to choose values for the three parameters D old
, Dnew
and
upper
Dprev
. The goal of parameter selection is to make the tightest possible upper bounds under the
condition that the number of false alarms is acceptable. We assume the rules of whitelisting described in Section 3.3.2 are fixed. The performance metric is the number of false alarms. Based
on the real-world traces, we calculate D old , Dnew and Dprev for all connection request events for
every user. Then we take the 90th, 95th and 99th percentile for all three parameters and calculate
the number of false alarms for each percentile. The results are shown in Table 3.2.
upper
upper
upper
From Table 3.2 we can see that Dold
, Dnew
and Dprev
must be different for different users
because they are dependent on computer speed and user patterns. Thus, they should be selected
on a per-user basis. We can also see that the performance of taking the 90th percentile is not
acceptable and the improvement from taking the 95th percentile to the 99th percentile is small.
Therefore, the parameters can be selected by choosing some value in the 95th percentile or according
to user’s preference. For conservative users, we should choose smaller values. The percentiles can
32
be obtained by training BINDER over a period of virus-free time. Without training, these parameters
upper
upper
can also be chosen (in some range) based on user’s preference. D old
and Dnew
can take values
upper
between 10 and 60 seconds, while Dprev
can take values between 600 and 3600 seconds. Note that
upper
upper
Dold
can be greater than Dnew
because the reaction time from a user input or data arrival event
to a connection request event is dependent on the instantaneous state of a computer. It remains a
challenge for future research to investigate effective solutions for parameter selection.
False Positives
Table 3.3. Break-down of false alarms according to causes.
User
A
B
C
D
E
F
Inter-Process
2
4
1
0
1
0
Whitelist
1
1
0
1
1
1
Collection
0
0
0
1
1
1
Total
3
5
1
2
3
2
By choosing parameters correctly, we expect to achieve few false alarms. From Table 3.2 we
can see that there were at most five false alarms (over 26 days) for each computer by choosing the
99th percentile. The false positive rate (i.e., the probability that a benign event causes an alarm) is
0.03%. We manually checked these remaining false alarms and found that they were caused by one
of three reasons:
• Incomplete information about inter-process event sharing. For example, four of the five false
alarms of User B were caused by this. We observe that PowerPoint followed IE to connect
to the same IP address while the parent process of the PowerPoint process was the Windows
shell. We hypothesize this is due to the usage of Windows API ShellExecute (which was
called by IE when the user clicked on a link of a PowerPoint file).
• Incomplete whitelisting. For example, connections made by Windows Application Layer
Gateway Service were (incorrectly) treated as extrusions.
• Incomplete trace collection. BINDER was accidentally turned off by a user in the middle of
33
trace collection. We gave users the control of turning on/off BINDER in case they thought
BINDER was responsible for bad performance, etc.
A break-down of the false alarms is shown in Table 3.3. We can see that a better-engineered
BINDER can eliminate false alarms in the Whitelist and Collection columns in Table 3.3, resulting
in much lower false positives. By extending BINDER to consider more inter-process communications than those between parent-child processes, we can decrease the false alarms in the InterProcess column.
False Negatives
1 14:40:10, PID=2368, PPID=240, NAME="C:\...\iexplore.exe" (process start)
2 14:40:15, PID=2368 (user input)
3 14:40:24, PID=2368 (user input)
4 14:40:24, PID=2368, LPORT=1054, RIP=12.34.56.78, RPORT=80 (connection request)
5 14:40:24, PID=2368, LPORT=1054, RIP=12.34.56.78, RPORT=80 (data arrival)
......
6 14:40:28, PID=2552, PPID=960, NAME="C:\...\mshta.exe" (process start)
7 14:40:29, PID=2552, LPORT=1066, RIP=87.65.43.21, RPORT=80 (connection request)
7 14:40:29, PID=2552, LPORT=1066, RIP=87.65.43.21, RPORT=80 (data arrival)
......
8 14:40:34, PID=2896, PPID=2552, NAME="C:\...\ntvdm.exe" (process start)
9 14:40:35, PID=2988, PPID=2896, NAME="C:\...\ ftp.exe" (process start)
10 14:40:35, PID=2988, LPORT=1068, RIP=44.33.22.11, RPORT=21 (connection request)
......
Figure 3.5. A stripped list of events logged during the break-in of adware Spydeleter.
In the real world experiments, among the six computers, one was infected by the adware
Gator [2] and CNSMIN [1] and another one was infected by the adware Gator and Spydeleter [3].
In particular, the second computer was compromised by Spydeleter when BINDER was running.
BINDER can successfully detect the adware Gator and CNSMIN because they do not have any
user input in history. In the following, we demonstrate how BINDER can detect the break-in of
Spydeleter right after it compromised the victim computer.
In Figure 3.5, we show a stripped list of events logged during the break-in of the adware Spydeleter. Note that all IP addresses are anonymized. Two related processes not shown in the list are
34
a process of explorer.exe with PID 240 and a process of svchost.exe with PID 960. After IE is
opened, a user connects to a site with IP 12.34.56.78. The web page has code to exploit a vulnerability in mshta.exe that processes .HTA files. After mshta.exe is infected by the malicious .HTA
file that is downloaded from 87.65.43.21, it starts a series of processes of ntvdm.exe that provides
an environment for a 16-bit process to execute on a 32-bit platform. Then, a process of ntvdm.exe
starts a process of ftp.exe that makes a connection request to 44.33.22.11. Since the prototype of
BINDER does not have complete information for verifying if a connection is triggered according
to the inter-process rule (see Section 3.2), the connection made by mshta.exe is detected as an extrusion. This is because its parent process is svchost.exe rather than iexplore.exe, though it is the
latter process that triggers its creation. If BINDER had complete information for inter-process event
sharing, it would detect the connection request made by ftp.exe as an extrusion. This is because
we do not allow cascading in applying the parent-process rule (this may cause false positives, but
we leave it for future research to study this tradeoff) and both the process of ftp.exe and its parent
process of ntvdm.exe does not have any user input or data arrival event in its history. So D old ,
Dnew and Dprev for the connection request made by ftp.exe do not exist. This connection is used
to download malicious code. Therefore, BINDER’s detection with some appropriate actions could
upper
have stopped the adware from infecting the computer. Note that the three parameters of D old
,
upper
upper
Dnew
and Dprev
do not affect BINDER’s detection of Spydeleter here.
3.4.3 Controlled Testbed Experiments
Our real world experiments show that BINDER has a small false positive rate. Since the number
of break-ins in our real world experiments is very limited, to evaluate BINDER’s performance on
false negatives, we need to test BINDER with more real world malware. To do so, we face a
challenge of repeating break-ins of real world malware without unwanted damage. We tackle this
problem by building a controlled testbed. Next, we describe our controlled testbed and present
experimental results on 22 email worms and the Blaster worm.
Controlled Testbed
We developed a controlled testbed using the Click modular router [39] and VMware Worksta-
35
tion [95]. We chose Click for its powerful modules [40] for packet filtering and Network Address
and Port Translation (NAPT). In addition, we implemented a containment module in Click that can
pass, redirect or drop outbound connections according to predefined policies. The advantage of
using VMware is that we can discard an infected system and get a new one just by copying a few
files. VMware also provides virtual private networks on the host system. The testbed is shown in
Figure 3.6.
Internet
SMTP
External
External
Click
Click
NAPT1
NAPT2
Containment
Internal
Internal
Linux VM
Windows VM
Linux Host1
Linux Host2
Figure 3.6. The controlled testbed for evaluating BINDER with real world malware.
In the testbed, we have two Linux hosts running VMware Workstation. On the first host, we
have a Windows virtual machine (VM) in a host-only private network. This VM is used for executing malicious code attached in email worms. The Click router on this host includes a containment
module and a NAPT module. The containment policy on this router allows DNS traffic pass through
and redirects all SMTP traffic (to port 25) to another Linux host. The second Linux host has a Linux
VM running the eXtremail server [22] in a host-only private network. The email server is configured
to accept all relay requests. The Click router on this host also has a NAPT module that guarantees
the email server can only receive inbound SMTP connections. Thus, all malicious emails are con-
36
tained in the email server. This controlled testbed enables us to repeat the whole break-in and
propagation process of email worms.
Experiments with Email Worms
Table 3.4. Email worms tested in the controlled testbed.
W32.Beagle.AB@mm
W32.Beagle.AG@mm
W32.Beagle.M@mm
W32.Erkez.B@mm
W32.Mydoom.M@mm
W32.Netsky.AD@mm
W32.Netsky.C@mm
W32.Netsky.F@mm
W32.Netsky.P@mm
W32.Netsky.S@mm
W32.Netsky.Z@mm
W32.Beagle.AC@mm
W32.Beagle.AR@mm
W32.Bugbear.B@mm
W32.Mydoom.L@mm
W32.Netsky.AB@mm
W32.Netsky.B@mm
W32.Netsky.D@mm
W32.Netsky.K@mm
W32.Netsky.Q@mm
W32.Netsky.W@mm
W32.Swen.A@mm
We obtained email worms from two sources. First, we set up our own email server and published an email address to USENET groups. This resulted in the email worm W32.Swen.A@mm
(we use Symantec’s naming rule [88]) being sent to us. Second, we were fortunate to convince the
system administrators of our department email server to give us 1,843 filtered virus email attachments that were collected over the week starting on October 7th, 2004. We used Symantec Norton
Antivirus [87] to scan these attachments and recognized 27 unique email worms. Among them, we
used 21 email worms because the rest of them were encrypted with a password. Thus, we tested 22
different real world email worms shown in Table 3.4.
For each email worm, we manually start it on a Windows virtual machine that has a file of
10 real email addresses, let it run for 10 minutes (with user input in history), and then restart the
virtual machine to run another 10 minutes (without user input in history). We analyze BINDER’s
performance using the traces collected during the two 10 minute periods. We choose 10 minutes
because they are long enough for email worms to scan hard disk, find email addresses, and send
malicious emails to them.
Our results show that BINDER successfully detects break-ins of all 22 email worms in the
37
second 10 minute period by identifying the very first malicious outbound connection. In the rest of
this section, we focus on BINDER’s performance in the first 10 minute period.
upper
upper
Table 3.5. The impact of Dold
/Dnew
on BINDER’s performance of false negatives.
upper
upper
Dold
= Dnew
(sec)
Num of email worms detected
10
22
20
21
30
21
40
19
50
17
60
15
upper
upper
According to our discussion on parameter selection, D old
and Dnew
usually take values
upper
between 10 and 60 seconds, while Dprev
usually takes values between 600 and 3600 seconds.
upper
Since our traces are 10 minute long, the parameter D prev
does not affect BINDER’s performance
upper
upper
on false negatives. So we study the impact of D old
and Dnew
on BINDER’s performance of false
upper
negatives. In Table 3.5, we show the number of email worms are detected by BINDER when D old
upper
and Dnew
take a same given value between 10 and 60 seconds. We have only one email worm
upper
upper
(W32.Swen.A@mm) missed when we take 30 seconds for D old
and Dnew
. This is because
the first connection is detected as user-intended due to the user input and all following connections
repeat the first one by connecting to the same Internet address.
Experiments with Blaster
We test BINDER with the Blaster worm [97]. In this experiment, we run two Windows XP VMs
A and B in a private network. We run BINDER on VM B and run msblast.exe on VM A. Blaster
on VM A scans the network, finds VM B and infects it. By analyzing the infection trace collected
by BINDER, we see that BINDER detects the first outbound connection made by a process tftp.exe
as an extrusion. This is because the process itself and its parent process of cmd.exe does not receive
any user input. Thus we can successfully detect Blaster in this case even before the worm itself is
transferred over by TFTP.
3.5 Limitations
Our limited user study shows that BINDER controls the number of false alarms to at most
five over four weeks on each computer. We also show that BINDER successfully detects break-ins
38
of the adware Gator, CNMIN, and Spydeleter, the Blaster worm, and 22 email worms. However,
BINDER is far from a complete system. It is built to verify that user intent can be a simple and
effective detector of a large class of malware with a very low false positive rate. We devote this
section to discussions of potential attacks against BINDER if its scheme is known to adversaries.
Though we try to investigate all possible attacks against BINDER, we cannot argue that we have
considered all of its possible vulnerabilities. Potential attacks include:
• Direct attack: subvert BINDER on the compromised system;
• Hiding inside other processes: inject malicious code into other processes;
• Faking user input: use APIs provided by the operating system to generate synthesized actions.
• Tricking the user to input: trick users to click on pop-up windows or a transparent overlay
window that intercepts all user input.
• Exploiting the whitelist: replace the executables of programs in the whitelist with a tweaked
one;
• Exploiting user input in history: when a malicious process is allowed to make one outbound
connection due to user input (e.g., a user opens a malicious email attachment), it can evade
BINDER’s detection by making that connection to a collusive remote site to keep receiving data. This would make BINDER think any new connections made by this process are
triggered by those data arrivals.
• Covert channels: a very tricky attack is to have a legitimate process make connections and
use them as a covert channel to leak information. For example, spyware can have an existing
IE process download a web page of a tweaked hyperlink by using some API provided by the
Windows shell right after a user clicks on the IE window of the same process. A collusive
remote server can get private information from the tweaked hyperlink.
Direct attack is a general limitation of all end-host software (e.g., antivirus, personal firewalls [120], and virus throttles [107] that attempt to limit outgoing attacks). A widespread availability of Trusted Computing-related Virtual Machine-based protection [33] or similar isolation
techniques are necessary to turn BINDER or any of these other systems into robust production.
39
The attacks of hiding inside other processes, faking user input, tricking users to input, and
exploiting whitelisting are inherent to the limitations of today’s operating systems. The effectiveness
of BINDER on malware detection implies pressing requirements for isolation among processes and
trustworthy user input in next-generation operating systems. Possible incomplete solutions for these
attacks might be to monitor corresponding system APIs or to verify the integrity of programs listed
in the whitelist. Even without a bulletproof solution for today’s operating system, we believe a
deployed BINDER system can raise the bar for adversaries significantly.
For the attacks of exploiting user input in history, a possible solution is to add more constraints
on how a user intended connection may be triggered. This requires more research work in the future.
For the attack of covert channels, possible solutions are discussed in [10].
3.6 Summary
In this chapter, we presented the design of a novel extrusion detection algorithm and the implementation of a host-based detection system which, realizing the extrusion detection algorithm,
can detect break-ins of a wide class of malware, including worms, spyware and adware, on personal
computers by identifying their extrusions. Our main contributions are:
• We take advantage of a unique characteristic of personal computers—user intent. Our evaluations suggest that user intent is a simple and effective detector for a large class of malware
with few false alarms.
• The controlled testbed based on the Click modular router and VMware Workstation enables
us to repeat the whole break-in and propagation process of email worms without worrying
about unwanted damage.
BINDER is very effective at detecting break-ins of malware on personal computers, but it also
has limitations:
• Leveraging user intent, it only works for personal computers.
40
• Running solely on a single host, it is poor at providing “early warning” of new worm outbreaks.
• It needs more extensive evaluation for actual deployment.
To address these limitations, in the next two chapters we describe how we can infer adversary
intent on a large-scale honeypot system to detect fast spreading Internet worms automatically and
quickly.
41
Chapter 4
Protocol Independent Replay of
Application Dialog
In the previous chapter, we described a novel extrusion detection algorithm and the BINDER
system for detecting break-ins of new unknown malware on personal computers. BINDER holds
promise to be effective on detecting a wide class of malware with few false alarms, but it would be
difficult to scale to provide “early warning” of new Internet worm outbreaks. For early automatic
detection of new worm outbreaks, we built a large-scale, high-fidelity “honeyfarm” system that analyzes in real-time the scanning probes seen on a quarter million Internet addresses. To achieve such
a scale, the honeyfarm system leverages a novel technique of protocol-independent replay of application dialog to winnow down the hundreds of probes per second seen by the honeyfarm to a level
that it can process. Before describing the honeyfarm system in detail in Chapter 5, we present in this
chapter the design of this novel technique and an evaluation over a variety of real-world network applications. This chapter is organized as follows. After motivating our work (see Section 4.1), we first
present an overview of design goals, challenges, terminology, and assumptions (see Section 4.2).
Then, we describe our replay system in detail and discuss related design issues (see Section 4.3).
Next, we describe our implementation, evaluate our system with real-world network applications,
and discuss the limitations (see Section 4.4). We demonstrate that our system successfully replays
the client and server sides for real-world network applications such as NFS, FTP, and CIFS/SMB
42
file transfers as well as the multi-stage infection processes of real-world worms such as Blaster and
W32.Randex.D. Finally, we summarize in Section 4.5.
4.1 Motivation
In a number of different situations it would be highly useful if we could cheaply “replay” one
side of an application session in a slightly different context. For example, consider the problem of
receiving probes from a remote host and attempting to determine whether the probes reflect a new
type of malware or an already known attack. Different attacks that exploit the same vulnerability
often conduct the same application dialog prior to finally exposing their unique malicious intent.
Thus, to coax from these attackers their final intent requires engaging them in an initial dialog,
either by implementing protocol-specific application-level responders [63, 66] or by deploying highinteraction honeypot systems [29, 89] that run the actual vulnerable services. Both approaches are
expensive in terms of development or management overhead.
On the other hand, much of the dialog required to engage with the remote source follows the
same “script” as seen in the past, with only very minor variants (different hostnames, IP addresses,
port numbers, command strings, or session cookies). If we could cheaply reproduce one side of the
dialog, we could directly tease out the remote source’s intent by efficiently driving it through the
routine part of the dialog until it reaches the point where, if it is indeed something new, it will reveal
its distinguishing nature by parting from the script.
A powerful example concerns the use of replay to construct proxies in our honeyfarm system.
Suppose in the first example above that not only do we wish to determine whether an incoming
probe reflects a new type of attack or a known type, but we also want to filter the attack if it is a
known type but allow it through if it is a new type. We could use replay to efficiently do so in two
steps (see Figure 5.2 in Section 5.2.4 for an illustration). First, we would replay the targeted server’s
behavior in response to the probe’s initial activity (e.g., setting up a Windows SMB RPC call) until
we reach a point where a new attack variant will manifest itself (e.g., by making a new type of SMB
call or by sending over a previously unseen payload for an existing type of call). If the remote host
at this point proves to lack novelty (we see the same final step in its activity as we have seen before),
43
then we drop the connection. However, if it reveals a novel next step, then at this point we would
like it to engage a high-interaction honeypot server so we can examine the new attack. To do so,
though, we must bring the server “up to speed” with respect to the application dialog in which the
remote host is already engaged. We can do so by using replay again, this time replaying the remote
host’s previous actions to the new server so that the two systems become synchronized, after which
we can allow the remote host to proceed with its probing. If we perform such proxying correctly, the
remote host will never know that it was switched from one type of responder (our initial replayer)
to another (the high-interaction honeypot).
Another example comes from trying to determine the equivalent for malware of a “toxicology
spread” for a biological pathogen. That is, given only an observed instance of a successful attack
against a given type of server (i.e., particular OS version and server patch level), how can we determine what other server/OS versions are also susceptible? If we have instances of other possible
versions available, then we could test their vulnerability by replaying the original attack against
them, providing the replay again takes into account the natural session variants such as differing
hostnames, IP addresses, and session cookies. Similarly, we could use replay to feed possible attacks into more sophisticated analysis engines (e.g., [62] and [15]).
We can also use lightweight replay to facilitate testing of network systems. For example, when
developing or configuring a new security mechanism it can be very helpful if we can easily repeat
attacks against the system to evaluate its response. Historically, this may require a complex testbed
to repeatedly run a piece of malware in a safe, restricted fashion. Armed with a replay system,
however, we could capture a single instance of an attack and then replay it against refinements of
the security mechanism without having to reestablish an environment for the malware to execute
within.
We can generalize such repeated replay to tasks of stress-testing, evaluating servers, or conducting large-scale measurements. A replay system could work as a low-cost client by replaying
application dialogs with altered parameters in part of the dialog. For example, by dynamically replacing the receiver’s email address in a SMTP dialog, we could use replay to create numerous
SMTP clients that send the same email to different addresses. We could use such a capability for
44
both debugging and performance testing, without needing to either create specialized clients or
invoke repeated instances of a computationally expensive piece of client software.
The programming and operating systems communities have studied the notion of replaying program execution for a number of years. Debugging was their targeted application for replay. LeBlanc
and Mellor-Crummey [43] studied how to leverage replay to debug parallel programs. Russinovich
and Cogswell [72] presented the repeatable scheduling algorithm for recording and replaying the
execution of shared-memory applications. King et al. [37] described a time-traveling virtual machine for debugging operating systems. Time travel allows a debugger to go back and forth arbitrarily through the execution history and to replay arbitrary segments of the past execution. Recently,
Geels et al. [24] developed a replay debugging tool for distributed C/C++ applications. However,
we know of little literature discussing general replay of network activity at the application level. The
existing work has instead focused on incorporating application-specific semantics. Honeyd [66], a
virtual honeypot framework, supports emulating arbitrary services that run as external applications
and receive/send data from/to stdin/stdout.
Libes’ expect tool includes the same notion as our work of automatically following a “script”
of expected interactive dialogs [49]. A significant difference of our work, however, is that we focus
on generating the script automatically from previous communications.
A number of commercial security products address replaying network traffic [16, 52]. These
products however appear limited to replay at the network or transport layer, similar to the Monkey
and Tcpreplay tools [12, 90]. The Flowreplay tool uses application-specific plug-ins to support
application-level replay [93].
To achieve protocol-independent replay, we develop a new system named RolePlayer. Our
work leverages the Needleman-Wunsch algorithm [60], widely used in bioinformatics research, to
locate fields that have changed between one example of an application session and another. In this
regard, our approach is similar to the recent Protocol Informatics project [8], which attempts to
identify protocol fields in unknown or poorly documented network protocols by comparing a series
of samples using sequence alignment algorithms.
Concurrent with our RolePlayer work, Leita et al proposed ScriptGen [46], a system for auto-
45
matically generating honeyd [66] scripts without requiring prior knowledge of application protocols.
They presented a novel region analysis algorithm to identify and merge fields in an application protocol. While these results hold promise for a number of applications, it remains a challenge to use
the approach for completing interactions with attacks using complex protocols such as SMB.
Before describing the design of RolePlayer, we first present an overview of design goals, challenges, terminology, and assumptions in the next section.
4.2 Overview
In this section, we first describe our design goals. Then, we discuss the challenges in implementing protocol-independent replay. Finally, we present the terminology and design assumptions.
4.2.1 Goals
In developing our replay system, RolePlayer, we aim for the system to achieve several important
goals:
• Protocol independence. The system should not need any application-specific customization,
so that it works transparently for a large class of applications, including both as a client and
as a server.
• Minimal training. The system should be able to mimic a previously seen type of application
dialog given only a small number of examples.
• Automation. Given such examples, the system should operate correctly without requiring
any manual intervention.
4.2.2 Challenges
In some cases, replaying is trivial to implement. For example, each attack by the Code Red
worm sends exactly the same byte stream over the network to a target. However, successfully replaying an application dialog can be much more complicated than simply parroting the stream.
46
compromise and open a shell
tftp −i xx.xx.xx.xx GET msblast.exe
Attacker
download msblast.exe via tftp
Victim
start msblast.exe
Figure 4.1. The infection process of the Blaster worm.
Consider for example the Blaster worm of August, 2003, which exploited a DCOM RPC vulnerability in Windows by attacking port 135/tcp. For Blaster, if a compromised host (A) finds a new
vulnerable host (V ) via its random scanning, then the following infection process occurs (see Figure 4.1).
1. A opens a connection to 135/tcp on V and sends three packets with payload sizes of 72, 1460,
and 244 bytes. These packets compromise V and open a shell listening on 4444/tcp.
2. A opens a connection to 4444/tcp and issues “tftp -i xx.xx.xx.xx GET msblast.exe” where
“xx.xx.xx.xx” is A’s IP address.
3. V sends a request back to 69/udp on A to download msblast.exe via TFTP.
4. A issues commands via 4444/tcp to start msblast.exe on host V .
This example illustrates a number of challenges for implementing replay.
1. An application session can involve multiple connections, with the initiator and responder
switching roles as both client and server, which means RolePlayer must be able to run as
client and server simultaneously.
2. While replaying an application dialog we sometimes cannot coalesce data. For example, if
the replayer attempts to send the first Blaster data unit as a single packet, we observe two
packets with sizes of 1460 and 316 bytes due to Ethernet framing. For reasons we have been
47
unable to determine, these packets do not compromise vulnerable hosts. Thus, RolePlayer
must consider both application data units and network framing. (It appears that there is a
race condition in Blaster’s exploitation process—A connects to 4444/tcp before the port is
opened on V . If we do not consider the network framing, it increases the likelihood of the
race condition. Accommodating such timing issues in replay remains for future work.)
3. Endpoint addresses such as IP addresses, port numbers and hostnames may appear in application data. For example, the IP address of A appears in the TFTP download command. This
requires RolePlayer to find and update endpoint addresses dynamically.
4. Endpoint addresses (especially names but also IP addresses, depending on formatting) can
have variable lengths, and thus the data size of a packet or application data unit can differ
between dialogs. This creates two requirements for RolePlayer: deciding if a received application data unit is complete, particularly when its size is smaller than expected; and changing
the value of such length fields when replaying them with different endpoint addresses.
5. Some applications use “cookie” fields to record session state. For example, the process ID
of the client program is a cookie field for the Windows CIFS protocol, while the file handle
is one in the NFS protocol. Therefore, RolePlayer must locate and update these fields during
replay.
6. We observe that non-zero padding up to 3 bytes may be inserted into an application data unit,
which we must accommodate without confusion during replay.
4.2.3 Terminology
We will use the term application session to mean a fixed series of interactions between two
hosts that accomplishes a specific task (e.g., uploading a particular file into a particular location).
The term application dialog refers to a recorded instance of a particular application session. The
host that starts a session is the initiator. The initiator contacts the responder. (We avoid “client”
and “server” here because for some applications the two endpoints assume both roles at different
times.) We want RolePlayer to be able to correctly mimic both initiators and responders. In doing
48
so, it acts as the replayer, using previous dialog(s) as a guide in communicating with the remote live
peer that runs the actual application program.
An application session consists of a set of TCP and/or UDP connections, where a UDP “connection” is a pair of unidirectional UDP flows that are matched on source/destination IP address/port
number. In a connection, an application data unit (ADU) is a consecutive chunk of applicationlevel data sent in one direction, which spans one or more packets. RolePlayer only cares about
application-level data, ignoring both network and transport-layer headers when replaying a session.
Table 4.1. Definitions of different types of dynamic fields.
Type
endpoint-address
length
cookie
argument
don’t-care
Definition
hostnames, IP addresses, transport port numbers
1 or 2 bytes reflecting the length of either the ADU or a subsequent dynamic field
session-specific opaque data that appears in ADUs from both sides of the dialog
fields that customize the meaning of a session
opaque fields that appear in only one side of the dialog
Within an ADU, a field is a byte sequence with semantic meaning. A dynamic field is a field that
potentially changes between different dialogs. We classify dynamic fields into five types: endpointaddress, length, cookie, argument, and don’t-care (see Table 4.1 for their definitions). Argument
fields (e.g., the destination directory for a file transfer or the domain name in a DNS lookup) are
only relevant for replay if we want to specifically alter them; for don’t-care fields, we ignore the
difference if the value for them we receive from a live peer differs from the original, and we send
them verbatim if communicating them to a live peer.
To replay a session, we need at least one sample dialog—the primary application dialog—to use
as a reference. We may also need an additional secondary dialog for discovering dynamic fields,
particularly length fields. Finally, we refer to the dialog generated during replay as the replay dialog.
4.2.4 Assumptions
Given the terminology, we can frame our design assumptions as follows.
1. We have available primary and (when needed) secondary dialogs that differ enough to disclose
the dynamic fields, but are otherwise the same in terms of the application semantics.
49
2. We assume that the live peer is configured suitably similar to its counterpart in the dialogs.
3. We assume the knowledge of some standard network protocol representations, such as embedded IP addresses being represented either as a four-byte integer or in dotted or commaseparated format and embedded transport port numbers as two-byte integers. (We currently
consider network byte order only, but it is an easy extension to accommodate either endianness.) Note that we do not assume the use of a single format, but instead encode into
RolePlayer each of the possible formats.
4. We assume the domain names of the participating hosts in sample dialogs can be provided
by the user if required for successful replay (see Section 4.4.7 for more discussion on this
assumption).
5. We assume that the application session does not include time-related behavior (e.g., wait 30
seconds before sending the next message) and does not require encryption or cryptographic
authentication.
We will show that these assumptions do not impede RolePlayer from replaying a wide class of
application protocols (see Section 4.4).
4.3 Design
The basic idea of RolePlayer is straightforward: given an example or two of an application
session, locate the dynamic fields in the ADUs and adjust them as necessary before sending the
ADUs out from the replayer. Since some dynamic fields, such as length fields, can only be found
by comparing two existing sample application dialogs, we split the work of RolePlayer into two
stages: preparation and replay. During preparation, RolePlayer first searches for endpoint-address
and argument fields in each sample dialog, then searches for length fields and possible cookie fields
by comparing the primary and secondary dialogs. During replay, it first searches for new values of
dynamic fields by comparing received ADUs with the corresponding ones in the primary dialog,
then updates them with the new values. In this section, we describe both stages in detail.
50
Before proceeding, we note that a particularly important issue concerns dynamic ports. RolePlayer needs to determine when to initiate or accept new connections, and in particular must recognize additional ports that an application protocol dynamically specifies. To do so, RolePlayer
detects stand-alone ports and IP address/port pairs during its search process, and matches these with
subsequent connection requests. This enables the system to accommodate features such as FTP’s
use of dynamic ports for data transfers, and portmapping as used by SunRPC.
4.3.1 Preparation Stage
Start Preparation
Parse Application Dialogs
Read Given Hostnames
and Arguments
Search for Endpoint−Address
and Arguments Fields
in the Primary Dialog
No
Secondary Dialog Exists?
Yes
Search for Endpoint−Address
and Arguments Fields
in the Secondary Dialog
Split ADU into Segments
based on Found Fields
Search for Length, Cookie
and Don’t−Care Fields
Finish Preparation
Figure 4.2. Steps in the preparation stage.
In the preparation stage, RolePlayer needs to parse network traces of the application dialogs and
51
search for the dynamic fields within. For its processing we may also need to inform RolePlayer of
the hostnames of both sides of the dialog, and any application arguments of interest, as these cannot
be inferred from the dialog. These steps of the preparation stage are shown in Figure 4.2.
Parsing Application Dialogs
FTP Client
Open FTP Data Connection
FTP Server
Request: RETR setup.exe
Response: 150 Opening BINARY
mode data connection for setup.exe
FTP Data
Response: 226 File send ok
Figure 4.3. A sample dialog for PASV FTP.
RolePlayer organizes dialogs in terms of both ADUs and data packets. It uses data packets
as the unit for sending and receiving data and ADUs as the unit for manipulating dynamic fields.
Note that data packets may interleave with ADUs. For example, FTP sessions use two concurrent
connections, one for data transfer and one for control (see Figure 4.3). RolePlayer needs to honor
the ordering between these as revealed by the packet sequencing on the wire, such as ensuring that
a file transfer on the data channel completes before sending the “226 Transfer complete” on the
control channel. Accordingly, we use the SYN, FIN and RST bits in the TCP header to delimit the
beginning and end of each connection, so we know the correct temporal ordering of the connections
within a session.
Searching for Dynamic Fields
RolePlayer next searches for dynamic fields in the primary dialog, and also by comparing it
with the secondary dialog (if available). The searching process contains a number of subtle steps
and considerations (we discuss some design issues in detail in Section 4.3.3). We illustrate these
using replay of a fictitious toy protocol, Service Port Discovery (SPD), as a running example.
52
Request
LEN−0
TYPE
SID
LEN−1
HOSTNAME
Response
LEN−0
TYPE
SID
LEN−1
IP−PORT
LEN−2
SERVICE
Figure 4.4. The message format for the toy Service Port Discovery protocol.
In SPD, a client sends a request message, carrying the client’s hostname and a service name, to
a server to ask for the port number of that service. The server replies with an IP address and port
number expressed in the comma-separated syntax used by FTP.
These two messages have the formats shown in Figure 4.4. Requests have 7 fields: LEN0 (1 byte) holds the length of the message. TYPE (1 byte) indicates the message type, with a
value of 1 indicating a request and 2 a response message. SID (2 bytes) is session/transaction
identifier, which the server must echo in its response. LEN-1 (1 byte) stores the length of the client
hostname, HOSTNAME, which the server logs. LEN-2 (1 byte) stores the length of the service
name, SERVICE.
Responses have 5 fields. LEN-0, TYPE and SID have the same meanings as in requests. LEN-1
(1 byte) stores the length of the IP-PORT field, which holds the IP address and port number of the
requested service in a comma-separated format. For example, 1.2.3.4:567 is expressed as “0x31
0x2c 0x32 0x2c 0x33 0x2c 0x34 0x2c 0x32 0x2c 0x35 0x35”.
19 1 1314 6
h o s t 0 1
7
p r i n t e r
18 2 1314 13
16 1 0300 7
h o s t t w o
3
m a p
19 2 0300 14
1 0 , 2 , 3 , 4 , 2 , 5 5
1 0 , 4 , 5 , 6 , 2 , 1 4 2
Figure 4.5. The primary and secondary dialogs for requests (left) and responses (right) using the toy
SPD protocol. RolePlayer first discovers endpoint-address (bold italic) and argument (italic) fields,
then breaks the ADUs into segments (indicated by gaps), and discovers length (gray background)
and possible cookie (bold) fields.
We note that this protocol is sufficiently simple that some of the operations we illustrate appear
trivial. However, the key point is that exactly the same operations also work for much more complex
protocols (such as Windows SMB/CIFS). Figure 4.5 shows two SPD dialogs, each consisting of
a request (left) and response (right). For RolePlayer to process these, we must inform it of the
53
embedded hostnames (“host01”, “hosttwo”) and arguments (“printer”, “map”), though we do
not need to specify their locations in the protocol. If we did not specify the arguments, they would
instead be identified and treated as don’t-care fields. We do not need to inform RolePlayer of the
hostname of a live peer because RolePlayer can automatically find it as the byte subsequence aligned
with the hostname in the primary dialog if it appears in an ADU from the live peer. In addition, we
do not inform RolePlayer of the embedded transaction identifiers (1314, 0300), length fields, IP
addresses (10.2.3.4, 10.4.5.6), or port numbers (567 = 2 · 256 + 55, 654 = 2 · 256 + 142).
The naive way to search for dynamic fields would be to align the byte sequences of the corresponding ADUs and look for subsequences that differ. However, we need to treat endpoint-address
and argument fields as a whole; for example, we do not want to decide that one difference between
the primary and secondary dialogs is changing “01” in the first to “two” in the second (the tail end
of the hostnames). Similarly, we want to detect that 13 and 14 in the replies are length fields and
not elements of the embedded IP addresses that changed. To do so, we proceed as follows:
1. Search for endpoint-address and argument fields in both dialogs by finding matches of presentations of their known values. For example, we will find “host01” as an endpoint-address
field and “printer” as an argument field in the primary’s request.
We consider seven possible presentations and their Unicode [94] equivalents for endpoint
addresses, and one presentation and its Unicode equivalent for arguments. For example, for
the primary’s reply we know from the packet headers in the primary trace that the server’s
IP address is 10.2.3.4, in which case we search for: the binary equivalent (0x0A020304);
ASCII dotted-quad notation (“10.2.3.4”); and comma-separated octets (“10,2,3,4”). The latter
locates the occurrence of the address in the reply. (If the server’s address was something
different, then we would not replace “10,2,3,4” in the replayed dialog.)
2. If we have a secondary dialog, then RolePlayer splits each ADU into segments based on the
endpoint-address and argument fields found in the previous step.
3. Finally, RolePlayer searches for length, cookie, and don’t-care fields by aligning and comparing each pair of data segments. By “alignment” here we mean application of the NeedlemanWunsch algorithm [60], which efficiently finds the minimal set of difference between two
54
byte sequences subject to constraints; see Section 4.3.3 below for discussion. An important
point is that at this stage we do not distinguish between cookie fields and don’t-care fields.
Only during the actual subsequent live session will we see whether these fields are used in
a manner consistent with cookies (which need to be altered during replay) or don’t-care’s
(which shouldn’t). See Section 4.3.2 for the process by which we make this decision.
In the SPD example (Figure 4.5), aligning and comparing the five-byte initial segments in the
primary and secondary requests results in the discovery of two pairs of length fields (19 vs.
16, and 6 vs. 7) and one cookie field (1314 vs. 0300). To find these, RolePlayer first checks
if a pair of differing bytes (or differing pairs of bytes) are consistent with being length fields,
i.e., their numeric values differ by one of: (1) the length difference of the whole ADU, (2) the
length difference of an endpoint-address or argument field that comes right after the length
fields, or (3) the double-byte length difference of these, if the subsequent field is in Unicode
format. For example, RolePlayer finds 19 and 16 as length fields because their difference
matches the length difference of the request messages, while the difference between 6 and 7
matches the length difference of the client hostnames. (Note that, to accommodate Unicode,
we merge any two consecutive differing byte sequences if there is only one single zero byte
between them.)
4.3.2 Replay Stage
With the preparation complete, RolePlayer can communicate with a live peer (the remote application program that interacts with RolePlayer). To do so, it uses the primary application dialog
plus the discovered dynamic fields as its “script,” allowing it to replay either the initiator or the
responder. Figures 4.6(a) and 4.6(b) give examples of creating an initial request and responding to
a request from a live peer in SPD, respectively. To construct these requires several steps, as shown
in Figure 4.7.
Deciding Packet Direction
We read the next data packet from the script to see whether we now expect to receive a packet
from the live peer or send one out.
55
19 1 1314 6
18 1 1314 4
h o s t 0 1
n e w h
7
p r i n t e r
8
s c h e d u l e
(a) The scripted (primary) dialog (top) and RolePlayer’s generated dialog (bottom) for an SPD request for which we
have instructed it to use a different hostname and service. The fields in black background reflect those updated to
account for the modified request and the automatically updated length fields.
19 1 1314 6
h o s t 0 1
7
p r i n t e r
18 2 1314 13
1 0 , 2 , 3 , 4 , 2 , 5 5
18 1 1616 5
h o s t 4
7
p r i n t e r
19 2 1616 14
1 0 , 2 1 , 8 , 5
, 2 , 5 5
(b) The same for constructing an SPD reply to a live peer that sends a different transaction ID in their request, and
for which the replayer is running on a different IP address. Note that the port in the reply stays constant.
Figure 4.6. Initiator-based and responder-based SPD replay dialogs.
Start Replay
Next Packet?
No
Finish Replay
Yes
Send
No
Send Packet
Send or Recv?
Recv
Recv Packet
First Packet?
Yes
Update Dynamic
Fields in ADU
Last Packet?
Yes
Find Dynamic
Fields in ADU
Sending Data
Receiving Data
Figure 4.7. Steps in the replay stage.
56
No
Receiving Data
If we expect to receive a packet, we read data from the specific connection. Doing so requires
deciding whether the received data is complete (i.e., equivalent to the corresponding data in the
script). To do so, we use the alignment algorithm (Section 4.3.3) with the constraint that the match
should be weighted to begin at the same point (i.e., at the beginning of the received data). If it yields
a match with no trailing “gap” (i.e., it did not need to pad the received data to construct a good
match), then we consider that we have received the expected data. Otherwise, we wait for more data
to arrive.
After receiving a complete ADU, we compare it with the corresponding one in the script to locate dynamic fields. This additional search is necessary for two reasons. First, RolePlayer may need
to find new values of endpoint addresses or arguments. Second, cookie fields found in the replay
stage may differ from those found in the preparation stage due to accidental agreement between the
primary and secondary dialogs.
We can apply the techniques used in the preparation stage to find dynamic fields, but with the
major additional challenge that now for the live dialog (the ongoing dialog with the live peer) we
do not have access to the endpoint addresses and application arguments. While the script provides
us with guidance as to the existence of these fields, they are often not in easily located positions
but instead surrounded by don’t-care fields. The difficult task is to pinpoint the fields that need to
change in the midst of those that do not matter. If by mistake this process overlaps an endpointaddress or argument field with a don’t-care field, then this will likely substitute incorrect text for
the replay. However, we can overcome these difficulties by applying the alignment algorithm with
pairwise constraints (Section 4.3.3). Doing so, we find that RolePlayer correctly identifies the fields
with high reliability.
After finding the endpoint-address and argument fields, we then use the same approach as for
the preparation stage to find length, cookie and don’t-care fields. We save the data corresponding
to newly found endpoint-address, argument, and cookie fields for future use in updating dynamic
fields.
Sending Data
57
When the script calls for us to send a data packet, we check whether it is the first one in an
ADU. If so, we update any dynamic fields in it and packetize it. The updated values come from
three sources: (1) analysis of previous ADUs; (2) the IP address and transport port numbers on
the replayer; (3) user-specified (for argument fields). After updating all other fields, we then adjust
length fields.
This still leaves the cookie fields for updating. RolePlayer only updates cookie fields reactively,
i.e., to reflect changes first introduced by the live peer. So, for example, in Figure 4.6(a) we do not
change the transaction ID when sending out the request, but we do change it in the reply shown in
Figure 4.6(b).
To update cookie values altered by the live peer, we search the ADU we are currently constructing for matches to cookie fields we previously found by comparing received ADUs with the script.
However, some of these identified cookie fields may in fact not be true cookies (and thus should
not be reflected in our new ADU), for three reasons: some Windows applications use non-zero-byte
padding; sometimes a single, long field becomes split into multiple, short cookie fields due to partial matches within it; and messages such as FTP server greetings can contain inconsistent (varying)
data (e.g., the FTP software name and version).
Thus, another major challenge is to determine robustly which cookie matches we should actually update. We studied several popular protocols and found that cookie fields usually appear in
the same context (i.e., with the same fields preceding and trailing them). Also, the probability of a
false match to an N -byte cookie field is very small when N is large (e.g., when N ≥ 4). Hence, to
determine if we should update a cookie match, we check four conditions, requiring at least two to
hold.
1. Does the byte sequence have at least four bytes? This condition reflects the fact that padding
fields are usually less than four bytes in length because they are used to align an ADU to a
32-bit boundary.
2. Does the byte sequence overlap a potential cookie field found in the preparation stage? (If
there is no secondary dialog, this condition will always be false, because we only find cookie
fields during preparation by comparing the primary and secondary dialog.) The intuition here
58
is that the matched byte sequence is more likely to be a correct one since it is part of a potential
cookie field.
3. Is the prefix of the byte sequence consistent with that of the matched cookie field? The prefix
is the byte sequence between the matched cookie field and the preceding non-cookie dynamic
field (or the beginning of the ADU). For prefixes exceeding 4 bytes, we consider only the last
four bytes (next to the targeted byte sequence). For empty prefixes, if the non-cookie dynamic
fields are the same type (or it is the beginning of the ADU) then the condition holds. This
condition matches cookie fields with the same leading context.
4. Is the suffix consistent? The same as the prefix condition but for the trailing context.
In the SPD example (Figure 4.5), byte sequences of 1314 in the request and response messages
in the primary dialog are cookie fields because they meet the second and fourth condition above.
By the second condition, they both overlap 0300 in the secondary dialog. By the fourth condition,
they are both trailed by a length field. On the other hand, the first and third conditions are not true
because they have only two bytes and their prefix fields are 1 and 2, respectively.
4.3.3 Design Issues
Sequence Alignment
The cornerstone of our approach is to compare two byte streams (either a primary dialog and a
secondary dialog, or a script and the ADUs we receive from a live peer) to find the best description
of their differences. The whole trick here is what constitutes “best”. Because we strive for an
application-independent approach, we cannot use the semantics of the underlying protocol to guide
the matching process. Instead we turn to generic algorithms that compare two byte streams using
customizable, byte-level weightings for determining the significance of differences between the two.
The term used for the application of these algorithms is “alignment”, since the goal is to find
which subsequences within two byte streams should be considered as counterparts. The NeedlemanWunsch algorithm [60] we use is parameterized in terms of weights reflecting the value associated
with identical characters, differing characters, and missing characters (“gaps”). The algorithm then
59
uses dynamic programming to find an alignment between two byte streams (i.e., where to introduce,
remove, or transform characters) with maximal weight.
We use two different forms of sequence alignment, global and semi-global. Global refers to
matching two byte streams against one another in their entirety, and is done using the standard
Needleman-Wunsch algorithm. Semi-global reflects matching one byte stream as a prefix or suffix
of the other, for which we use a modified Needleman-Wunsch algorithm (see below).
When considering possible alignments, the algorithm assigns different weightings for each
aligned pair of characters depending on whether the characters agree, disagree, or one is a gap.
Let the weight be m if they agree, n if they disagree, and g if one is a gap. The total score for a possible alignment is then the sum of the corresponding weights. For example, given abcdf and acef,
with weights m = 2, n = −1, g = −2, the optimal alignment (which the algorithm is guaranteed
to find) is abcdf with a-cef, with score m + g + m + n + m = 3, where “-” indicates a gap.
To compute semi-global alignments of matching one byte stream as prefix of the other, we
modify the algorithm to ignore trailing gap penalties. For example, given two strings ab and abcb,
we obtain the same global alignment score of 2m + 2g for the alignments ab-- with abcb, versus
a--b with abcb. But for semi-global alignment, the similarity score is 2m for the first and 2m + 2g
for the second, so we prefer the first since g takes a negative value.
The quality of sequence alignment depends critically on the particular parameters (weightings)
we use. For our use, the major concern is deciding how many gaps to allow in order to gain a match.
For example, when globally aligning ab with bc, two possible alignment results are ab with bc
(score n + n) or ab- with -bc (score g + m + g). The three parameters will have a combined linear
relationship (since we add combinations of them linearly to obtain a total score), so we proceed by
fixing n and g (to 0 and -1, respectively), and adjusting m for different situations.
For global alignment—which we use to align two sequences before comparing them and locating length, cookie, and don’t-care fields—we set m = 1 to avoid alignments like ab- with -bc.
For semi-global alignment—used during replay to decide whether received data is complete—we
set m to the length difference of the two sequences. The notion here is that if the last character
of the received data matches the last one in the ADU from the script, then m is large enough to
60
offset the gap penalty caused by aligning the characters together. However, if the received data is
indeed incomplete, its better match to only the first part of the ADU will still win out. Using the
semi-global alignment example above, we will set m = 2 (due to the length of ab vs. abcb), and
hence still find the best alignment with abcb to be ab-- rather than a--b.
Finally, we make a powerful refinement to the Needleman-Wunsch algorithm for pinpointing
endpoint-address and/or argument fields in a received ADU: we modify the algorithm to work with
a pairwise constraint matrix. The matrix specifies whether the ith element of the first sequence
can or cannot be aligned with the jth element of the second sequence. We dynamically generate
this matrix based on the structure of the data from the primary dialog. For example, if the data
includes an endpoint-address field represented as a dotted-quad IP address, then we add entries in
the matrix prohibiting those digits from being matched with non-digits in the second data stream,
and prohibiting the embedded “.”s from being matched to anything other than “.”s. This modification
significantly improves the efficacy of the alignment algorithm in the presence of don’t-care fields.
Removing Overlap
When we don’t have a secondary dialog, and search for matches of cookie fields in an ADU
in the primary dialog (so that we can update them with new values), sometimes we find fields that
overlap with one another. We remove these using a greedy algorithm. For each overlapping dynamic
field, we set the penalty for removing it as the number of bytes we would lose from the union set of
all dynamic fields. (So, for example, a dynamic field that is fully embedded within another dynamic
field has a penalty of 0.) We then select the overlapping field with the least penalty, remove it, and
repeat the process until there is no overlap.
Handling Large ADUs
ADUs can be very large, such as when transferring a large data item using a single Windows
CIFS, NFS or FTP message. RolePlayer cannot ignore these ADUs when searching for dynamic
fields because there may exist dynamic fields embedded within them—generally at the beginning.
For example, an NFS “READ Reply” response comes with both the read status and the corresponding file data, and includes a cookie field containing a transaction identifier. However, the complexity
of sequence alignment is O(M N ) for sequences of lengths M and N , making its application in-
61
tractable for large sequences. RolePlayer avoids this problem by considering only fixed-size byte
sequences at the beginning of large ADUs (we take the first 1024 bytes in our current implementation).
4.4 Evaluation
Protocol
Initiator Program
Responder Program
SMTP
DNS
HTTP
TFTP
FTP
NFS
CIFS
manual
nslookup
wget
W32.Blaster
wget
mount
W32.Randex.D
Sendmail
BIND
Apache
Windows XP
ProFTPD
Linux Fedora NFS
Windows XP
# Connections
# ADUs
1
1
1
3
2
9
6
13
2
2
34
18
36
86
# Initiator Fields
received
sent
22
3
8
0
10
0
5
1
12
0
34
12
101
65
# Responder Fields
received
sent
3
3
0
1
0
0
1
1
0
2
46
23
80
63
Table 4.2. Summary of evaluated applications and the number of dynamic fields in data received or
sent by RolePlayer during either initiator or responder replay.
In the previous section, we described the design of RolePlayer. We implemented RolePlayer
in 7,700 lines of C code under Linux, using libpcap [50] to store traffic and the standard socket
API to generate traffic. Thus, RolePlayer only needs root access if it needs to send traffic from a
privileged port. RolePlayer tries to packetize data of TCP connections in the same fashion as seen in
the primary/secondary dialogs. However, the socket APIs do not support such functionality. In our
implementation, RolePlayer calls send and sendto to dispatch data segments at the same boundary
as seen in those dialogs, and usually achieves expected packetization. We tested the system on a variety of protocols widely used in malicious attacks and network applications. Table 4.2 summarizes
our test suite. RolePlayer can successfully replay both the initiator and responder sides of all of
these dialogs. In the remainder of this section, we first describe our test environment, then present
the evaluation results, and discuss the limitations of RolePlayer in the end.
4.4.1 Test Environment
We conducted our evaluation in an isolated testbed consisting of a set of nodes running on
VMware Workstation [95] interconnected using software based on Click [39]. We used VMware
Workstation’s support for multiple guest operating systems and private networks between VM in62
stances to construct different, contained test configurations, with the Click system redirecting malware scans to our chosen target systems. We gave each running VM instance a distinct IP address
and hostname, and used non-persistent virtual disks to allow recovery from infection. For the tests
we used both Windows XP Professional and Fedora Core 3 images, to verify that replay works for
both Windows and Linux services. RolePlayer itself ran on the Linux host system rather than within
a virtual machine, enabling it to communicate with any virtual machine on the system.
4.4.2 Simple Protocols
Our simplest tests were for SMTP, DNS, and HTTP replay. For testing SMTP, we replayed an
email dialog with RolePlayer changing the recipient’s email address (an “argument” dynamic field).
In one instance, RolePlayer itself made this change as the session initiator; in the other, it played
the role of the responder (SMTP server).
For DNS, RolePlayer correctly located the transaction ID embedded within requests, updating it
when replaying the responder (DNS server). The HTTP dialog was similarly simple, being limited
to a single request and response. Since the request did not contain a cookie field, replaying trivially
consisted of purely resending the same data as seen originally (though RolePlayer found some
don’t-care fields in the HTTP header of the response message).
4.4.3 Blaster (TFTP)
As discussed in Section 4.2, Blaster [97] (see Figure 4.1) attacks its victim using a DCOM RPC
vulnerability. After attacking, it causes the victim to initiate a TFTP transfer back to the attacker to
download an executable, which it then instructs the victim to run. A Blaster attack session does not
contain any length fields or hostnames, so we can replay it without needing a secondary application
dialog or hostname information.
RolePlayer can replay both sides of the dialog. As an initiator, we successfully infected a remote
host with Blaster. As a fake victim, we tricked a live copy of Blaster into going through the full set
of infection steps when it probed a new system.
When replaying the initiator, RolePlayer found five dynamic fields in received ADUs. Of these,
63
it correctly deemed four as don’t-care fields. These were each part of a confirmation message,
specifying information such as data transfer size, time, and speed. The single dynamic field found
and updated was the IP address of the initiator, necessary for correct operation of the injected TFTP
command.
When replaying the responder, RolePlayer found only a single dynamic field, the worm’s IP
address, again necessary to correctly create the TFTP channel.
4.4.4 FTP
To test FTP replay (see Figure 4.3 for a sample dialog of PASV FTP), we used the
wget utility as the client program to connect to two live FTP servers, fedora.bu.edu and mirrors.xmission.com.
We collected two sample application dialogs using the command wget
ftp://ftp-server-name/fedora/core/3/i386/os/Fedora/RPMS/crontabs-1.10-7.noarch.rpm. In
both cases, wget used passive FTP, which uses dynamically created ports on the server side.
When acting as the initiator, we replayed the fedora.bu.edu dialog over a live session to mirrors.xmission.com, and vice versa. There were no length fields, so we did not need a secondary
dialog (nor hostnames). In both cases, RolePlayer successfully downloaded the file.
The system found twelve dynamic fields in the ADUs it received. Among them, the only two
meaningful ones were endpoint-address fields: the server’s IP address and the port of the FTP datatransfer channel. The rest arose due to differences in the greeting messages and authentication
responses. RolePlayer recognized these as don’t-care’s and did not update them.
When replaying the responder, wget successfully downloaded the file from a fake RolePlayer
server pretending to be either fedora.bu.edu or mirrors.xmission.com, with the same two endpointaddress fields updated in ADUs sent by the replayer.
We tested support for argument fields by specifying the filename crontabs-1.10-7.noarch.rpm
as an argument.
When replaying the initiator, we replaced this with pyorbit-devel-2.0.1-
1.i386.rpm, another file in the same directory. RolePlayer successfully downloaded the new file
from fedora.bu.edu using the script from mirrors.xmission.com. Since the two files are completely
different, the system found many don’t-care fields. None of these affected the replay because they
64
did not meet the conditions for updating cookie fields. We also confirmed that RolePlayer can replay
non-passive FTP dialogs successfully.
4.4.5 NFS
We tested the NFS protocol running over SunRPC using two different NFS servers. We used
the series of commands mount, ls, cp, umount to mount an NFS directory, copy a file from it to
a local directory, and unmount it. We used one NFS server for collecting the primary application
dialog and a second as the target for replaying the initiator. The NFS session consisted of nine TCP
connections, including interactions with three daemons running on the NSF server: portmap, nfs,
and mountd. The first two ran on 111/tcp and 2049/tcp, while mountd used a dynamic service port.
As is usually the case with NFS access, in the session both of the latter two ports were found via
RPCs to portmap.
When replaying the initiator, RolePlayer found 34 dynamic fields in received ADUs, and
changed 12 fields in the ADUs it sent. When replaying the responder, RolePlayer found 46 dynamic
fields in received ADUs, and changed 23 fields in the ADUs it sent. The cookie fields concerned
RPC call IDs.
RolePlayer successfully replayed both the initiator side (receiving the directory listing and then
the requested file) and the responder side (sending these to a live client, which correctly displayed
the listing and copied the file).
4.4.6 Randex (CIFS/SMB)
To test RolePlayer’s ability to handle a complex protocol while interacting with live malware,
we used the W32.Randex.D worm [98]. This worm scans the network for SMB/CIFS shares with
weak administrator passwords. To do so, it makes numerous SMB RPC calls (see Figure 4.8, reproduced with permission from [63]). When it finds an open share, it uploads a malicious executable
msmgri32.exe. In our experiments, we configured the targeted Windows VM to accept blank
passwords, and turned off its firewall so it would accept traffic on ports 135/tcp, 139/tcp, 445/tcp,
137/udp, and 138/udp.
65
->
<->
<->
<->
<->
<->
<->
<->
<->
<->
SMB
SMB
SMB
SMB
SMB
Negotiate Protocol Request
Negotiate Protocol Response
Session Setup AndX Request
Session Setup AndX Response
Tree Connect AndX Request,
Path: \\XX.128.18.16\IPC$
SMB Tree Connect AndX Response
SMB NT Create AndX Request, Path: \samr
SMB NT Create AndX Response
DCERPC Bind: call_id: 1 UUID: SAMR
DCERPC Bind_ack:
SAMR Connect4 request
SAMR Connect4 reply
SAMR EnumDomains request
SAMR EnumDomains reply
SAMR LookupDomain request
SAMR LookupDomain reply
SAMR OpenDomain request
SAMR OpenDomain reply
SAMR EnumDomainUsers request
Now start another session, connect to the
SRVSVC pipe and issue NetRemoteTOD
(get remote Time of Day) request
->
<->
<->
<->
<->
<->
<->
<-
SMB
SMB
SMB
SMB
SMB
Negotiate Protocol Request
Negotiate Protocol Response
Session Setup AndX Request
Session Setup AndX Response
Tree Connect AndX Request,
Path: \ \XX.128.18.16\IPC$
SMB Tree Connect AndX Response
SMB NT Create AndX Request, Path: \srvsvc
SMB NT Create AndX Response
DCERPC Bind: call_id: 1 UUID: SRVSVC
DCERPC Bind_ack: call_id: 1
SRVSVC NetrRemoteTOD request
SRVSVC NetrRemoteTOD reply
SMB Close request
SMB Close Response
Now connect to the ADMIN share and write the file
-> SMB Tree Connect AndX Request, Path: \\XX.128.18.16\ADMIN$
<- SMB Tree Connect AndX Response
-> SMB NT Create AndX Request,
Path:\system32\msmsgri32.exe <<<===
<- SMB
-> SMB
<- SMB
-> SMB
<- SMB
-> SMB
....
NT Create AndX Response, FID: 0x74ca
Transaction2 Request SET_FILE_INFORMATION
Transaction2 Response SET_FILE_INFORMATION
Transaction2 Request QUERY_FS_INFORMATION
Transaction2 Response QUERY_FS_INFORMATION
Write Request
Figure 4.8. The application-level conversation of W32.Randex.D.
66
To collect sample application dialogs, we manually launched a malware executable from another Windows VM, redirecting its scans to the targeted Windows VM, recording the traffic using
tcpdump. We stored two attacks because replaying CIFS requires a secondary application dialog
to locate the numerous length fields.
There are six connections in W32.Randex.D’s attack, all started by the initiator. Of these, two
are connections to 139/tcp, which are reset by the initiator immediately after it receives a SYN-ACK
from the responder. One connects to 80/tcp (HTTP), reset by the responder because the victim did
not run an HTTP server. The remaining three connections are all to 445/tcp. The worm uses the first
of these to detect a possible victim; it does not transmit any application data on this connection. The
worm uses the second to enumerate the user account list on the responder via the SAMR named pipe.
The final connection uploads the malicious executable to \Admin$\system32\msmsgri32.exe
via the Admin named pipe.
When replaying the initiator, RolePlayer found 101 dynamic fields in received ADUs, and
changed 65 fields in the ADUs it sent. When replaying the responder, it found 80 dynamic fields
in received ADUs, and changed 63 fields in the ADUs it sent. (The difference in the number of
fields is because some dynamic fields remain the same when they come from the replayer rather
than the worm. For example, the responder chooses the context handle of the SAMR named pipe;
when replaying the responder, the replayer just uses the same context handle as in the primary application dialog.) Considering ADUs in both directions, there were 21 endpoint-address fields, 76
length fields, and 32 cookie fields. The cookie fields reflect such information as the context handles
in SAMR named pipes and the client process IDs.
As with our Blaster experiment, RolePlayer successfully infected a live Windows system with
W32.Randex.D when replaying the initiator side, and, when replaying the responder, successfully
drove a live, attacking instance of the worm through its full set of infection steps.
To demonstrate the function of RolePlayer, we select six consecutive messages from the
conversation of W32.Randex.D (shown in bold-italic in Figure 4.8), consisting of SAMR Connect4 request/response, SAMR EnumDomains request/response, and SAMR LookupDomain request/response. We show the content of these messages from the primary, secondary, initiator-based
67
A->V
A<-V
A->V
A<-V
A->V
A<-V
N1|180|S26|0388|S7|96|S20|96|S8|113|S16|R8|96|R6|72|R4|0014CCB8|18|M4|18|M4
|144.165.114.119|M20
N4|S26|0388|S28|R24|M16|0000000092F3E82470FDD91195F8000C295763F7|M4
N4|S26|0388|S56|R24|0000000092F3E82470FDD91195F8000C295763F7|M8
N1|180|S26|0388|S7|124|S8|124|S6|125|S1|R8|124|DECRPC-6|100|R4|M4|000B0BB0
|M4|000B27A0|M8|8|10|000B8D18|M8|000BC610|5|M4|4|hone|M36
N1|156|S26|0388|S7|72|S20|72|S8|89|S16|R8|72|R6|48|R4|M4
|0000000092F3E82470FDD91195F8000C295763F7|8|10|001503F8|5|M4|4|hone
N4|S26|0388|S28|R24|000B27A0|M32
(a) Primary Dialog
A->V
A<-V
A->V
A<-V
A->V
A<-V
N1|172|S26|0474|S7|88|S20|88|S8|105|S16|R8|88|R6|64|R4|0014CCB8|14|M4|14|M4
|48.196.8.48|M20
N4|S26|0474|S28|R24|M16|000000006093917586FDD91195F8000C294A478F|M4
N4|S26|0474|S56|R24|000000006093917586FDD91195F8000C294A478F|M8
N1|184|S26|0474|S7|128|S8|128|S6|129|S1|R8|128|DECRPC-6|104|R4|M4|000B0BB0
|M4|000B6380|M8|12|14|000B76C0|M8|000C9FA8|7|M4|6|host02|M36
N1|160|S26|0474|S7|76|S20|76|S8|89|S16|R8|76|R6|52|R4|M4
|000000006093917586FDD91195F8000C294A478F|12|14|001503F8|7|M4|6|host02
N4|S26|0474|S28|R24|000B27A0|M32
(b) Secondary Dialog
A->V
A<-V
A->V
A<-V
A->V
A<-V
N1|176|S26|0388|S7|92|S20|92|S8|109|S16|R8|92|R6|68|R4|0014CCB8|16|M4|16|M4
|192.168.170.3|M20
N4|S26|0388|S28|R24|M16|0000000018B30AD10BFDD91195F8000C293573E4|M4
N4|S26|0388|S56|R24|0000000018B30AD10BFDD91195F8000C293573E4|M8
N1|188|S26|0388|S7|132|S8|132|S6|133|S1|R8|132|DECRPC-6|108|R4|M4|000B0BB0
|M4|000B9358|M8|16|18|000BEF40|M8|000B6BA0|9|M4|8|hostpeer|M36
N1|164|S26|0388|S7|80|S20|80|S8|89|S16|R8|80|R6|56|R4|M4
|0000000018B30AD10BFDD91195F8000C293573E4|16|18|001503F8|9|M4|8|hostpeer
N4|S26|0388|S28|R24|000BEF40|M32
(c) Initiator-based Replay Dialog
A->V
A<-V
A->V
A<-V
A->V
A<-V
N1|176|S26|0608|S7|92|S20|92|S8|109|S16|R8|92|R6|68|R4|0014CCB8|16|M4|16|M4
|169.91.250.93|M20
N4|S26|0608|S28|R24|M16|0000000092F3E82470FDD91195F8000C295763F7|M4
N4|S26|0608|S56|R24|0000000092F3E82470FDD91195F8000C295763F7|M8
N1|180|S26|0608|S7|124|S8|124|S6|125|S1|R8|124|DECRPC-6|100|R4|M4|000B0BB0
|M4|000B27A0|M8|8|10|000B8D18|M8|000BC610|5|M4|4|hone|M36
N1|156|S26|0608|S7|72|S20|72|S8|89|S16|R8|72|R6|48|R4|M4
|0000000092F3E82470FDD91195F8000C295763F7|8|10|00150A88|5|M4|4|hone
N4|S26|0608|S28|R24|000B27A0|M32
(d) Responder-based Replay Dialog
Figure 4.9. A portion of the application-level conversation of W32.Randex.D. Endpoint-address
fields are in bold-italic, cookie fields are in bold, and length fields are in gray background. In the
initiator-based and responder-based replay dialogs, updated fields are in black background.
68
replay, and responder-based replay dialogs in Figure 4.9. To fit each message more compactly, we
present them in the following format:
1. We split each message based on dynamic fields.
2. XY means we skipped Y bytes from protocol X (N for NetBIOS, S for SMB, R for DCERPC, M for Security Account Manager), since they do not change between different dialogs.
These represent the fixed fields as part of the dialog.
3. We present endpoint-address fields such as IP addresses and hostnames in ASCII format in
bold-italic. Three hostnames appear in the messages: “hone”, “host02”, and “hostpeer”.
4. We show length fields in decimal, with a gray background. For example, “180” and “96” in
the first message of the primary dialog are length fields.
5. We show cookie fields and don’t-care fields, including the client process IDs and the SAMR
context handles, in octets. For example, “0388” in the first message of the primary dialog
is a client process ID. For convenience, we highlight in bold the cookies in the primary and
secondary dialog which require dynamic updates during the replay process.
Note that RolePlayer located two cookie fields from the SAMR context handles because part of
the handles didn’t change between dialogs (e.g., the middle portion was constant in all the dialogs).
4.4.7 Discussion
From the experiments we can see that it is necessary to locate and update all dynamic fields—
endpoint-address, cookie, and length fields—for replaying protocols successfully, while argument
fields are important for extending RolePlayer’s functionality. TFTP, FTP, and NFS require correct
manipulation of endpoint-address fields. DNS, NFS, and CIFS also rely on correct identification
of cookie files. CIFS has numerous length fields within a single application dialog. Leveraging
argument fields, RolePlayer can work as a low-cost client for SMTP, FTP, or DNS.
While RolePlayer can replay a wide class of application protocols, its coverage is not universal.
In particular, it cannot accommodate protocols with time-dependent state, nor those that use cryp69
tographic authentication or encrypted traffic, although we can envision dealing with the latter by
introducing application-specific extensions to provide RolePlayer with access to a session’s cleartext dialog. Another restriction is that the live peer with which RolePlayer engages must behave
in a fashion consistent with the “script” used to configure RolePlayer. This requirement is more
restrictive than simply that the live peer follows the application protocol: it must also follow the
particular path present in the script.
Since RolePlayer keeps some dynamic fields unchanged as in the primary dialog, it is possible
for an adversary to detect the existence of a running RolePlayer by checking if certain dynamic fields
are changed among different sessions. For example, RolePlayer will always open the same port for
the data channel when replaying the responder of an FTP dialog, and it will use the same context
handles in SAMR named pipes when replaying the responder of a CIFS dialog. Another possible
way to detect RolePlayer is to discover inconsistencies between the operating system the application
should be running on versus the operating system RolePlayer is running on, by fingerprinting packet
headers [118]. In the future, we plan to address these problems by randomizing certain dynamic
fields and by manipulating packet headers to match the expected operating system.
Many applications of replay require automatic replay without human intervention. However,
this causes a problem for our assumption that the domain names of the participating hosts in the example dialogs can be provided by the user. In fact, this assumption may be eliminated by clustering
consecutive printable ASCII bytes in a message to find domain names. Note that we don’t need to
know the semantic meaning of the printable ASCII strings because in RolePlayer we use domain
names synthetically for splitting a message into segments and identifying length fields.
To locate length and (sometimes) cookie fields, RolePlayer needs a secondary dialog. In some
cases, it is not trivial to automatically identify two example dialogs that realize the same application
session or malicious attack. In addition, RolePlayer as presented in this chapter can only handle one
script at a time. In the next chapter, we will present the replay proxy, which extends the functionality
of RolePlayer to process multiple scripts concurrently. We will also describe how we find two
example dialogs for the same malicious attack.
70
4.5 Summary
We have presented RolePlayer, a system that, given examples of an application session, can
mimic both the initiator and responder sides of the session for a wide variety of application protocols. We can potentially use such replay for recognizing malware variants, determining the range of
system versions vulnerable to a given attack, testing defense mechanisms, and filtering multi-step
attacks. However, while for some application protocols replay can be essentially trivial—just resend
the same bytes as recorded for previously seen examples of the session—for other protocols replay
can require correctly altering numerous fields embedded within the examples, such as IP addresses,
hostnames, port numbers, transaction identifiers and other opaque cookies, as well as length fields
that change as these values change.
We might therefore conclude that replay inevitably requires building into the replayer specific
knowledge of the applications it can mimic. However, one of the key properties of RolePlayer is that
it operates in an application-independent fashion: the system does not require any specifics about
the particular application it mimics. It instead uses extensions of byte-stream alignment algorithms
from bioinformatics to compare different instances of a session to determine which fields it must
change to successfully replay one side of the session. To do so, it needs only two examples of the
particular session; in some cases, a single example suffices.
RolePlayer’s understanding of network protocols is very limited—just knowledge of a few lowlevel syntactic conventions, such as common representations of IP addresses and the use of length
fields to specify the size of subsequent variable-length fields. (The only other information RolePlayer requires—depending on the application protocol—is context for the example sessions, such
as the domain names of the participating hosts and specific arguments in requests or responses if
we wish to change these when replaying.) This information suffices for RolePlayer to heuristically
detect and adjust network addresses, ports, cookies, and length fields embedded within the session,
including for sessions that span multiple, concurrent connections.
We have successfully used RolePlayer to replay both the initiator and responder sides for a
variety of network applications, including: SMTP, DNS, HTTP; NFS, FTP and CIFS/SMB file
transfers; and the multi-stage infection processes of the Blaster and W32.Randex.D worms. The
71
latter require correctly engaging in connections that, within a single session, encompass multiple
protocols and both client and server roles.
Based on RolePlayer, we implemented a replay proxy for our honeyfarm system. This proxy
can filter known multi-step attacks by replaying the server-side responses and allow new attacks
through without dropping the ongoing ones by replaying the incomplete dialog to a high-fidelity
honeypot. In the next chapter, we describe our honeyfarm system with the replay proxy in detail.
72
Chapter 5
GQ: Realizing a Large-Scale Honeyfarm
System
In the previous chapter, we described RolePlayer, a system which, given one or two examples
of an application session, can mimic both the client side and the server side of the session for a wide
variety of application protocols. Based on RolePlayer, we develop a replay proxy to filter previously
seen attacks in our honeyfarm system. In this chapter, we describe the design, implementation, and
evaluation of our honeyfarm system.
5.1 Motivation
In Section 2.2, we reviewed the related work in the area of Internet epidemiology and defenses
in detail. In this section, we briefly recall our discussion there to motivate our work.
Since 2001, Internet worm outbreaks have caused severe damage that affected tens of millions
of individuals and hundreds of thousands of organizations. The Code Red worm [56], the first
outbreak in today’s global Internet after the Morris worm [20, 79] in 1988, compromised 360,000
vulnerable hosts with the Web server vulnerability and launched a distributed denial of service attack
against a government web server from those hosts. The Blaster worm [97] was the first outbreak
that exploited a vulnerability of a service running on millions of personal computers. The Slammer
73
worm [54] used only a single UDP packet for infection and took only 10 minutes to infect its
vulnerable population. Also using a single UDP packet for infection, the Witty worm [55] exploited
a vulnerability, which was publicized only one day before, in a commercial intrusion detection
software, and was the first to carry a destructive payload of overwriting random disk blocks. Prior
study [57, 83, 82] suggests that automated early detection of new worm outbreaks is essential for
any effective defense.
Honeypots, as vulnerable network decoys, can be used to detect and analyze the presence, techniques, and motivations of an attacker [81]. By themselves, however, honeypots serve as poor “early
warning” detection of new worm outbreaks because of the low probability that any particular honeypot will be infected early in a worm’s growth. On the other hand, network telescopes, which work
by monitoring traffic sent to unallocated portions of the IP address space, have been a powerful tool
for understanding the behavior of Internet worm outbreaks at scale [58]. Combining them, a central
tool that has emerged recently for detecting Internet worm outbreaks is the honeyfarm [80], a large
collection of honeypots fed Internet traffic by a network telescope [6, 34, 58, 96]. The honeypots
interact with remote sources whose probes strike the telescope, using either “low fidelity” mimicry
to characterize the intent of the probe activity [6, 63, 66, 117], or with “high fidelity” full execution
of incoming service requests in order to purposefully become infected as a means of assessing the
malicious activity in detail [17, 29, 34, 96].
Considerable strides have been made in deploying low-fidelity honeyfarms at large scale [6,
117], but high-fidelity systems face extensive challenges, due to the much greater processing required by the honeypot systems, and the need to safely contain hostile, captured contagion while
still allowing it enough freedom of execution to fully reveal its operation. Recent work has seen
significant advances in some facets of building large-scale, high-interaction honeyfarms, such as
making much more efficient use of Virtual Machine (VM) technology [96], but these systems have
yet to see live operation at scale.
To achieve automated early detection of Internet worm outbreaks, we develop a large-scale,
high-fidelity honeyfarm system named GQ. Our work combines clusters of high-fidelity honeypots
with the large address spaces managed by network telescopes. By leveraging extensive filtering
and engaging honeypots dynamically, we can use a small number of honeypots to inspect untrusted
74
traffic sent to a large number of IP addresses, with goals of automatically detecting Internet worm
outbreaks within seconds, and facilitating a wide variety of analyses on detected worms, such as
extracting (vulnerability, attack, and behavioral) signatures, and testing whether these signatures
can correctly detect repeated attacks by deploying them within the controlled environment of the
honeyfarm. Architecturally, this is similar in spirit to our UCSD colleagues’ work on the Potemkin
system, though that effort particularly emphasizes developing powerful VM technology for running
large numbers of servers optimized for worm detection [96].
In the remainder of this chapter, we first describe GQ’s design (see Section 5.2) and implementation (see Section 5.3). Then we report on our preliminary experiences with operating the system
to monitor more than a quarter million addresses, during which we captured 66 distinct worms over
the course of four months (see Section 5.4). Finally, we finish with a summary of our work (see
Section 5.5).
5.2 Design
In this section we present GQ’s design, beginning with an overview of the design goals and
system’s architecture. We then discuss management of the incoming traffic stream: obtaining input
from the network telescope, prefiltering this stream with lightweight mechanisms, and further filtering it using a replay proxy. We follow this with how we address the thorny problem of containment,
detailing the containment and redirection policy we have implemented that endeavors to strike a
tenable balance between allowing contagion executing within the honeyfarm to exhibit its complete
infection (and possibly self-propagation) process, while preventing it from damaging external hosts.
We finish with a discussion of issues regarding the management of the honeypots.
5.2.1 Overview and Architecture
In developing our honeyfarm system, we aim for the system to achieve several important goals:
• High Fidelity: The system should run fully functional servers to detect new attacks against
even unknown vulnerabilities.
75
DHCP/DNS
Honeypot
Virtual Switch
Honeypot VLANs
VLAN Manager
Map and NAT
Replay Proxy
First Packet Filter
VM Fabric (ESX)
Scan Filter
Network
Telescope
and/or
GRE
Tunnel
Safety Filter
GQ Controller
Honeypot
Honeypot
Honeypots
Containment
and Redirection
Honeypot
Manager
Control VLAN
Filter Bypass
ESX Console
To Logging System
Figure 5.1. GQ’s general architecture.
• Scalability: The system should be able to analyze in real-time the scanning probes seen on a
large number of Internet addresses (e.g., a quarter million addresses).
• Isolation: The system should isolate the network communications of every single honeypot.
• Stringent Control: The system should prevent honeypots from attacking external hosts.
• Wide Coverage: The system should run honeypots with a wide range of configurations with
respect to operating systems and services.
As Figure 5.1 portrays, GQ’s design emphasizes high-level modularity, consisting of a front-end
controller and back-end honeypots, mediated by a honeypot manager. This structure allows the bulk
of our system to remain honeypot-independent: although we currently use VMware ESX server
for our honeypot substrate, we intend to extend the system to include both Xen-based honeypots
using the technology developed for Potemkin [96] and a set of “bare metal” honeypots that execute
directly on hardware rather than in a VM environment, so we can detect malware that suppresses its
behavior when operating inside a VM.
The front-end controller supports data collection from multiple sources, including both direct
network connections and GRE tunnels. The latter allow us to topologically separate the honeyfarm
from the network telescope, and potentially to scatter numerous “wormholes” around the Internet
to obtain diverse sensor deployment. These features are desirable both for masking the honeyfarm’s
scope from attacker discovery and to ensure visibility into a broad spectrum of the Internet address
76
space, as previous work has shown large variations in the probing seen at different locations in the
network [14]. The controller also performs extensive filtering, using both simple mechanisms and
the more sophisticated replay proxy.
The front-end also performs Network Address Translation (NAT). Rather than require that each
honeypot be dynamically reconfigured to have an internal address equivalent to the external telescope address for which it is responding, we use NAT to perform this rewriting. However, if the
honeypot itself supports dynamic address reconfiguration (e.g., [96]), the front end simply skips the
NAT.
The final function of the front-end is containment and redirection. When a honeypot makes
an outgoing request, GQ must decide whether to allow it out to the public Internet. Based on
the policy in place, the front-end either allows the connection to proceed, redirects it to another
honeypot, or blocks it. Redirection to another honeypot is the means by which we can directly detect
a worm’s self-propagating behavior (namely if, once contacted, the new target honeypot itself then
initiates network activity). However, many types of malware require multi-stage infection processes
for which the initial infection will fail to fully establish (and thus fail to exhibit subsequent selfpropagation) if we do not allow some preliminary connections out to the Internet. Finding the right
balance between these two takes careful consideration, as we discuss below.
5.2.2 Network Input
The initial input to GQ is a stream of network connection attempts. GQ can process two different
forms of input: a directly routed address range, and GRE-encapsulated tunnels. A direct-routed
range simply is a local Ethernet or other network device connected to one or more subnets, while a
GRE tunnel requires a remote peer.
Supporting direct connections gives a simple and efficient deployment option. Supporting GRE
gives the opportunity to decouple the honeyfarm system from the telescope’s address space, and can
ultimately serve as a bridge between multiple honeyfarms.
77
5.2.3 Filtering
We term the initial stage of responding as prefiltering, which aims to use a few simple rules
to eliminate probe sources likely of low interest. Prefiltering has several components. First, GQ
limits the total number of distinct telescope addresses that engage a given remote source to a small
number N . The assumption here is that a modest N suffices to identify a source engaged in a new or
otherwise interesting form of activity. This filter can greatly reduce the volume of traffic processed
by the honeyfarm because of the very common phenomenon of a single remote source engaging in
heavy scanning of the telescope’s address space.
We note that there is no optimal value for N . Too low, and we will miss processing sources
that engage in diverse scanning patterns that send different types of probes to different subsequent
addresses. Too high, and we burden the honeypots with a great deal of redundant traffic.
In principle, knowledge of N also allows an adversary to design their probes to escape detection
by the honeyfarm. However, to do so the adversary also needs to know the address space over which
we apply the threshold of N , and if they know that, they can more directly escape detection.
As a way to offset both these considerations, GQ’s architecture supports sampling of the probe
stream: taking a small, randomly selected subset of the stream and accepting it for processing
without applying any of our filtering criteria (see “Filter Bypass” in Figure 5.1). We also use this
mechanism for sources deemed “interesting” during the filtering stages, such as those that deviate
from the replay proxy’s scripts (see Section 5.2.4).
Our current implementation of this prefilter maintains state for each IP address that contacts
our address ranges. Since we require 8 ∗ N bytes for each IP address, we need only 400M bytes
of memory for 10 million hosts when N is taken to be five. To date, we have not seen significant
memory issues with keeping complete state.
GQ uses a second prefilter, too, taken from the work of Bailey et al. [6]. After acknowledging
an initial SYN packet, we examine the first data packet sent by the putative attacking host. If this
packet unambiguously identifies the attack as a known and classified threat, the filter drops the
connection. While configuration of these filters requires manual assessment (since many attacks
78
cannot be unambiguously identified by their first data packet), we can use this mechanism to weed
out background radiation from endemic sources such as Code Red and Slammer.
Note that our use of TCP header rewriting when instantiating our honeypots greatly simplifies
use of this second prefilter. Without it, we would have to allocate a honeypot upon the initial SYN
(if it passes the scan-N prefilter) so it could complete the SYN handshake, agree upon sequence
numbers, and elicit the first data packet, just in case that packet proved of interest and we wished
to continue the connection. We could not proxy for the beginning of the connection because any
honeypot we might wind up instantiating later to handle the remainder of the connection would not
have knowledge of the correct sequence numbers.
Additionally, we include a final “safety” filter, used only for filtering outbound traffic. Its role
is to limit external traffic from the honeyfarm if any other component fails. This is a more relaxed
but simpler policy than we implement in the rest of the honeyfarm, as the point is to assure—via
independent operation—that any failure of our more complicated containment mechanism will not
have a deleterious external effect.
5.2.4 Replay Proxy
Something quite important to realize (which we quantify in Section 5.4) is that the simple
prefilters discussed in the previous section lack the power to identify many instances of known
attacks. As an extreme example, the W32.Femot worm goes through 90 pairs of message exchanges
before we can distinguish one variant from another, because it is only at that point that the attack
reveals the precise code it injects into the victim. Even for well-known worms such as Blaster, the
first-packet filter will fail to recognize new and potentially interesting variants, because these will
not manifest until later. Yet creating a custom filter for each significant piece of known worms is
quite impractical due to their prevalence.
To filter these frequent multi-stage attacks, we employ technology developed for RolePlayer,
our application-independent replay system discussed in Chapter 4. RolePlayer uses a script automatically derived from one or two examples of a previous application session as the basis for
mimicking either the client or the server side of the session in a subsequent instance of the same
79
transaction. The idea is that we can represent each previously seen attack using a script that describes the network-level dialog (both client and server messages) corresponding to the attack. A
key property of such scripts is that RolePlayer can automatically extract them from samples of two
dialogs corresponding to a given attack, without having to code into RolePlayer any knowledge of
the semantics of the application protocol used in the attack. We leverage this feature of RolePlayer
to build up a set of scripts with which we configure our replay proxy.
The replay proxy extends RolePlayer’s functionality in three substantial ways:
• Processing multiple scripts concurrently;
• Recognizing when a source deviates from this set of scripts;
• Serving as an intermediary between a source and a high-fidelity honeypot once the script no
longer applies.
The proxy first plays the role of the server side of the dialog, replying to incoming traffic from a
remote source. By matching the source’s messages against multiple scripts, the proxy can perform
a fine-grained classification of known attacks without relying on application-specific responders. If
the source never deviates from the set of scripts, then we have previously seen its activity, and we
can drop the connection.
If the dialog deviates from the scripts, however, the proxy has found something that is in some
fashion new. At this point it needs to facilitate cutting the remote source over from the faux dialog
with which it has engaged so far to a continuation of that dialog with a high-fidelity honeypot.
Doing so requires bringing the honeypot “up to speed.” That is, as far as the source is concerned,
it is already deeply engaged with a server. We need to synch the server up to this same point in the
dialog so that the rest of the interaction can continue with full fidelity.
To do so, the proxy replays the dialog as it has received it so far back to the honeypot server;
this time, the proxy plays the role of the client rather than the server’s role. Once it has brought
the honeypot up to speed, the source can then continue interacting with the honeypot. However, the
replay proxy needs to remain engaged in the dialog as an intermediary, translating the IP addresses,
80
Normal Script
Replay Proxy
Communication
{
?
{
...
...
{
Honeypot
...
Replay Against
Honeypot
Replay Proxy
...
Deviation Detected
{
Attacker
Figure 5.2. The replay proxy’s operation. It begins by matching the attacker’s messages against a
script. When the attacker deviates from the script, the proxy turns around and uses the client side of
the dialog so far to bring a honeypot server up to speed, before proxying (and possibly modifying)
the communication between the honeypot and attacker. The shaded rectangles show in abstract
terms portions of messages that the proxy identifies as requiring either echoing in subsequent traffic
(e.g., a transaction identifier) or per-connection customization (e.g., an embedded IP address).
sequence numbers, and, most importantly, dynamic application fields (e.g., IP addresses embedded
in application messages, or opaque “cookie” fields). Figure 5.2 shows the replay proxy’s operation.
For new attacks, there exists a risk that the replay proxy might interfere with their execution
because its knowledge is limited to scripts generated from previous attacks. Since the proxy has
no semantic understanding of the communication relayed between the attacker and the honeypot,
it cannot recognize if such interference occurs. Thus, the proxy marks the source of each attack
it attempts to pass along to a backend honeypot as “interesting”, such that any further connections
from that source will be routed directly to a honeypot server, bypassing the replay proxy.
The replay proxy is one of GQ’s most significant components. We will discuss some implementation considerations for it in Section 5.3.1.
5.2.5 Containment and Redirection
Another important design element concerns monitoring and enforcing the containment policy:
i.e., as compromised honeypots initiate outbound network connections, do we allow these to traverse
into the public Internet, redirect them to a fake respondent running inside the honeyfarm, or block
81
them entirely? As noted above, the fundamental tension here is the need to balance between allowing contagion to exhibit its complete infection process (necessary for identifying self-propagation),
but preventing the contagion from damaging external hosts.
To do so, we work through the following policy, in the given order:
1. If an outbound request would exceed the globally enforced, per-instantiated-honeypot rate
limit, drop it. This acts as a general safety feature for containment, complementary to, but
independent of, the “safety filter” previously discussed.
2. If an outbound DNS request originates from the DHCP/DNS-server honeypot, allow it. We
configure GQ’s honeypots with the address of a DHCP/DNS server that itself runs on a GQ
honeypot. We allow that one node to make outbound DNS requests.
3. If an outbound DNS request originates from a regular honeypot, redirect it to the DHCP/DNS
honeypot. Additionally, we consider such requests as suspicious, since the honeypot should
have originally sent them to the DHCP/DNS honeypot according to the configuration.
4. If an outbound request originates from a honeypot that has not been probed, drop it. If a
honeypot creates spontaneous activity (such as automatic update requests or similar tasks),
we simply block this activity. This serves both as a safety net and to ensure we know the
exact behavior of the honeypot.
5. If an outbound request targets the address of the source that caused GQ to instantiate the
honeypot, allow it. We allow the honeypot to always communicate with the source of the
initial attack, to enable it to complete some forms of multi-stage infections and to participate
in simple “phone home” confirmations.
6. If an outbound request is the first outbound connection to a host other than the infecting
source, allow it. Many worms use a two-stage infection process, where an initial infection
then fetches the bulk of the contagion from a second remote site, or “phones home” to register
the successful infection.1
For example, we received notification that three of our telescope addresses appeared in a list produced upon the arrest
of the suspected operators of the toxbot botnet [99]. Unlike many erroneous notifications we receive due to our use of the
address space, this one was correct: our honeypots had indeed become infected by toxbot, and our containment policy
correctly allowed them to contact the botnet’s control infrastructure.
1
82
7. If the destination port is 21/tcp, 80/tcp, 443/tcp, or 69/udp AND the port was not the one
accessed when the original remote source probed the honeypot AND the outbound request
is below a configurable threshold of how many such extra-port requests to allow, allow it.
Again, we do this to enable multi-stage infections and their control traffic, while attempting
to limit the probability of an escaped infection.
8. If the outbound request is a DHCP address-request broadcast, redirect it to the DHCP/DNS
honeypot.
9. If a clean honeypot is available, redirect the outbound request to it. Note that this step is
critical for detecting self-propagating behavior, and highlights the importance of not running
the honeyfarm at full capacity.
10. Otherwise, drop the outbound request.
In addition, when this process yields a decision to redirect, GQ consults a configurable redirection limit (computed on a per-instantiated-honeypot basis). If this connection would cause the
honeypot to exceed the limit, we drop the connection instead, to prevent runaway honeypots from
consuming excessive honeyfarm resources.
It is while assessing containment/redirection that GQ also makes a determination, from a network perspective, that we have spotted a new worm. We consider a system as infected with a worm
if, when redirected to another honeypot, it causes that honeypot to begin initiating connections.
Thus, from a network perspective, a worm is a program which, when running on one honeypot,
can cause other honeypots to change state sufficiently that they also begin generating connection
requests. This definition gives us a behavioral way to separate an infection that causes non-selfpropagating network activity from one that self-propagates at least one additional step (thus distinguishing worms from botnet “autorooters” that compromise a machine and perhaps download
configurable code to it, but do not cause it to automatically continue the process).
Redirection within the honeyfarm also allows us to perform malware classification. As the
worm continues to generate requests, we redirect these to honeypots with differing configurations,
including patch levels and OS variants. Doing so automatically creates a system vulnerability profile, determining which systems and configurations the worm can exploit.
83
The policy engine also manages the mapping of external (telescope) addresses to honeypots.
Currently, when doing so it allocates different honeypots to distinct attackers, even if those attackers
happen to target the same telescope address. Doing so maintains GQ’s analysis of exactly what
remote traffic infected a given honeypot, but also makes the honeyfarm vulnerable to fingerprinting
by attackers.
A final facet of containment concerns internal containment. We need to ensure that infected
honeypots cannot directly contact any other honeypot in the honeyfarm without our control system
mediating the interaction. To do so, we require that all honeypot-to-honeypot communication pass
through the control system, which also allows us to maintain each honeypot behind a NAT device.
Note that the load it adds to the control system is trivial compared with the scan probes from the
network telescopes.
We use VLANs as the primary technology for internal containment. This imposes a requirement that the back-end system running the honeypot VM can terminate a VLAN. We assign each
honeypot to its own VLAN, with the GQ controller on a VLAN trunk. The controller can then
rewrite packets on a per-VLAN as well as per-IP-address basis, enabling fine-grained isolation and
control of all honeypot traffic, even broadcast and related traffic.
5.2.6 Honeypot Management
To maintain modularity, the GQ honeypot manager (see Figure 5.1) isolates management of
honeypot configurations. The controller keeps an inventory and state of the different available honeypots, but defers direct operation of the VM honeypots to the honeypot manager, which is responsible for starting and reseting each honeypot. Only the honeypot manager needs to understand the
VM or bare-metal technology used to implement a given honeypot. By doing this, we not only
avoid the risk of inconsistency between the controller and the (stateless) honeypot manager, but also
make the controller independent of the back-end implementation.
In the large, the GQ controller simply monitors the honeypots and responds to network requests.
When the manager restarts a honeypot, the honeypot exists in a newly born state until requesting
its IP address through DHCP. After the honeypot received an address assignment, the controller
84
waits until it sees the system generate some network traffic (which it inevitably does as part of its
startup). At this point the controller waits ten seconds (to enable remaining services to start up and
to avoid startup transients) before considering the honeypot available. Once available, the controller
can now allocate the honeypot to serve incoming traffic. When later the controller decides that the
instantiated honeypot no longer has value, it instructs the manager to destroy the instance and restart
the honeypot in a reinitialized state.
5.3 Implementation
The GQ system is implemented in C (the replay proxy) and C++ (the Click modules) for about
14,000 lines in total. In this section we describe the most significant issues that arose when realizing
the GQ system as an implementation of the architecture presented in the previous section. These
issues concerned the replay proxy system, the allocation and isolation mechanisms, and the VMware
virtual machine management, which we detail in this section.
5.3.1 Replay Proxy
The replay proxy plays a central role in our architecture. Our current system takes a series of
scripts for different attacks, with two sample dialogs (a “primary” dialog and a “secondary” dialog,
in the terminology of Section 4.2.3, necessary for locating some types of dynamic fields) for each
script. The main challenges for implementing the replay proxy are generating scripts for newly seen
attacks and determining when a host has deviated from a script.
To generate a script for a newly seen worm attack, we take two different approaches depending
on whether the attack succeeded or failed. For successful attacks, we can use the instances of
infections redirected inside the honeyfarm for the primary and secondary dialogs. In this case, we
also have control over replay-sensitive fields such as host names and IP addresses, which RolePlayer
needs to know about to construct an accurate script.
For unsuccessful attacks, however, we have only one example of the application dialog corresponding to the attack. In this case, we face the difficult challenge of trying—without knowledge
85
of the application’s semantics—to find another instance of the exact same attack (as opposed to
merely a variant or an attack that has some elements in common). We leverage a heuristic to tackle
this problem: we assume that two very similar attacks from the same attacking host are in fact very
likely the same attack, and we can therefore use the pair as the primary and secondary dialogs required by RolePlayer. We test whether two candidate attacks from the same host are indeed “very
similar” by checking the number of Application Data Units (ADUs) in the two dialogs, their sizes,
and the dynamic fields we locate within them. We require that the number of ADUs must be the
same; the difference in ADU size quite small; and dynamic fields consistent across the two dialogs.
We use this technique to construct a corpus of known types of activity automatically. After a
single infection by a worm, we want to inform automatically the replay proxy of the worm’s attack
to allow it to filter out subsequent probes from the same worm (which could become huge very
rapidly, if the worm spreads quickly), to avoid tying up the honeyfarm with uninteresting additional
worm traffic. We also benefit significantly from having scripts to filter out uninteresting instances
of “background radiation”.
To efficiently filter out known activity, rather than comparing an instance with all of the scripts
at each step, we merge the scripts into a tree structure. For N scripts, doing so reduces the computational complexity from O(N ) to O(log(N )). Constructing the tree requires comparing ADUs
from different scripts. We decide that two ADUs from different scripts are the same if the match of
dynamic fields between them is at least as good as that between the primary and secondary dialogs
used to locate the fields in each script in the first place.
In online filtering, we use the same approach to check if a received ADU matches the ADU of
a node in the tree. If not, it represents a deviation from our collection of scripts.
When we compare an ADU with a set of sibling nodes, it is possible that more than one of the
nodes matches. When this happens, we choose the node that minimizes the number of don’t-care
fields (opaque data segments that appear in only one side of the attacks; see Chapter 4 for more
details) identified in the compared node.
To avoid comparing a received ADU with every node in a large set of sibling nodes (which
can arise for some common attack vectors like CIFS/SMB exploits), we use a Most Recently Used
86
(MRU) technique to control the number of nodes compared. The intuition behind this technique is
to leverage the locality of incoming probes. Given a set of sibling nodes, we associate with each
node a time reflecting when it last matched a a received ADU. We compare newly received ADUs
with those nodes having the most recent times first, to see if we can quickly find a close match.
The replay proxy also supports detection of attack variants for attacks that involve multiple
connections. For example, if a new Blaster worm uses the same buffer overflow code as the old
one, but changes the contents of the executable file then transferred over TFTP, the replay proxy
can correctly follow the dialog up through the TFTP connection, retrieve the executable file, notice
it differs from the script at that point, and proceed to replay this full set of activity to a honeypot
server in order to bring it up to speed to become infected by the attack variant.
A final detail is that when receiving data the proxy must determine when it has received a
complete ADU. Since it may interact with unknown attacks, it cannot always directly tell when this
occurs, so we use a timeout (currently set to five seconds) to ensure it reaches a decision. If the
source has not sent any further packets when the timeout expires, we assume the data received until
that point comprises an ADU. Since the timeout adds extra latency for detecting a new worm, we
plan to improve the design of the replay proxy to eliminate it by comparing incomplete ADUs with
the scripts.
Our replay proxy implementation comprises 8,600 lines of C code. It currently uses the socket
APIs to communicate with the attackers and the honeypots. This limits its performance and makes
it unwieldy to implement the address and sequence number translation it needs to perform. We
plan to replace the socket APIs with a user-level transport layer stack (using raw sockets) and port
the replay proxy to Click. Doing so will let us closely integrate the replay proxy with the other
components of the honeyfarm controller.
Currently, a single-threaded instance of the replay proxy can process about 100 concurrent
probes per second. To use the proxy operationally, we must improve its performance so that it can
process significantly more concurrent connections. In our near-term future work, we aim to solve
this problem from three perspectives: (1) improve the replay algorithm to reduce the computational
cost; (2) leverage multiple threads/processes/machines; (3) convert its network input/output to use
87
raw sockets. Our general goal is for the overhead of the replay proxy processing for an incoming
probe to be much less than that for launching a virtual machine honeypot.
5.3.2 Isolation
Another critical component for our system implementation is maintaining rigorous isolation of
all honeypots. The honeypots need to contact other honeyfarm systems—in particular, DHCP and
DNS servers—but we need absolute control over these and any other interactions. Additionally, we
need to keep our control traffic isolated from honeypot traffic.
To do so, we use VLANs and a VLAN-aware HP ProCurve 2800 series managed Ethernet
switch [27]. We use a “base” VLAN with no encapsulation for a port on the controller and a port
on each VMware ESX server. This LAN carries the management traffic for all the ESX servers,
including access to our file server. By using an entirely separate (virtual) network with a switch
capable of enforcing the separation, we can prevent even misbehaving honeypots from saturating
our control plane.2
Equally important is maintaining isolation among honeypot traffic. Each individual honeypot
should only communicate with authorized systems, which we accomplish using VLAN tagging and
the Virtual Switch Tagging (VST) mode provided by the VMware ESX server. In this mode, each
virtual switch in the ESX server will tag (untag) all outbound (inbound) Ethernet frames, leaving
VLAN tagging transparent to the honeypots. This is important because we want to repeatedly
use a single virtual machine image file for multiple running instances, and thus we want to avoid
embedding any VLAN-related state in the image file. With this approach, the only untagged link is
between the honeypot and the virtual switch; we achieve isolation for this segment by dedicating a
port group to each honeypot (the port group number is the same as the VLAN identifier assigned to
the honeypot). Thus, the GQ controller’s interface to the honeypot network receives VLAN-tagged
packets, with each honeypot on a different VLAN.
We also need to consider VLAN tagging/untagging for the controller. We implemented a VLAN
manager in the controller, which maintains the mapping between IP addresses and VLAN IDs, and
We are susceptible to bugs in the switch, however, if the VLAN encapsulation fails. We could instead use two
separate switches if this proves problematic.
2
88
adds/removes VLAN tags before the Ethernet frames leave/enter the GQ controller. This approach
allows us to completely abstract the isolation mechanism from the rest of the system.
Additionally, the VLAN manager has a “default route” for forwarding DHCP packets to the
dedicated DHCP/DNS VM system. Rather than running these servers on the controller, where they
might be susceptible to attack by a worm, we use a separate Linux honeypot to provide DHCP and
DNS service for the honeypots.
5.3.3 Honeypot Management
Table 5.1. Virtual machine image types installed in GQ.
Windows 2000 Professional with No Patches
Windows 2000 Professional with All Patches
Windows 2000 Server with Service Pack 4
Windows XP Professional with No Patches
Windows XP Professional with All Patches
Windows 2000 Professional with Service Pack 4
Windows 2000 Server with No Patches
Windows 2000 Server with All Patches
Windows XP Professional with Service Pack 2
Fedora Core 4 with All Patches
We have manually installed ten different virtual machine images for both Windows and Linux
(see Table 5.1). For ease of initial deployment, we operate only a subset of these which together
cover a wide range of vulnerabilities in Windows systems.
0
4
MESSAGE SIZE
8
VM Action
12
VM IP Address
ESX Server Info
Request from GQ Controller to Honeypot Manager
0
4
MESSAGE SIZE
8
VM Action
12
VM IP Address
16
VM State
Response from Honeypot Manager to GQ Controller
Figure 5.3. Message formats for the communication between the GQ controller and the honeypot
manager. “ESX Server Info” is a string with the ESX server hostname, the ESX server control port
number, the user name, the password, and the VM config file name, separated by tab keys.
89
We implemented the honeypot manager in C using the (undocumented other than by header
files) vmcontrol interface provided by VMware for remotely controlling virtual machines running
on VMware ESX servers. The communication between the honeyfarm controller and the honeypot manager follows a simple application-level protocol (see Figure 5.3) by which the controller
instructs the manager what actions to take on what virtual machines, and the manager informs the
controller of the results of those actions.
5.4 Evaluation
We began running GQ operationally in late 2005. The initial deployment we operate consists
of four VMware ESX servers running 24 honeypots in three different virtual machine configurations, including unpatched Windows XP Professional, unpatched Windows 2000 Server, and a
fully patched Windows XP Professional with an insecure configuration and weak passwords, since
these three configurations have a good coverage of vulnerabilities (vulnerabilities in unpatched Windows 2000 Professional are a subset of those in unpatched Windows 2000 Server). Thus, GQ
can currently capture worms that exploit Windows vulnerabilities on 80/tcp, 135/tcp, 139/tcp, and
445/tcp, but not other types of worms.
For our evaluation, we first examine the level of background radiation our network telescope
captures and analyze the effectiveness of scan filtering for reducing this volume. We then evaluate
the correctness and performance of the replay proxy. Finally, we present the worms GQ captured
over the course of four months of operation.
5.4.1 Background Radiation and Scalability Analysis
Our network telescope currently captures the equivalent of a /14 network block (262,144 addresses), plus an additional “hot” /23 network (512 addresses). This latter block lies just above one
of the private-address allocations. Any malware that performs sequential scanning using its local
address as a starting point (e.g., Blaster [97]) and runs on a host assigned an address within the
private-address block will very rapidly probe the adjacent public-address block, sending traffic to
90
Number of Different TCP Scans per Minute
6000
5000
4000
3000
2000
1000
0
0
2
4
6
8
10
12
14
12
14
Day
Number of Different TCP Scans per Minute
(a) TCP scans to the /14 network without any filtering.
20000
15000
10000
5000
0
0
2
4
6
8
10
Day
(b) TCP scans to the /23 network without any filtering.
Figure 5.4. TCP scans before scan filtering.
91
Number of Different TCP Scans per Minute
1000
800
600
400
200
0
0
2
4
6
8
10
12
14
12
14
Day
Number of Different TCP Scans per Minute
(a) TCP scans to the /14 network after scan filteringi.
1000
800
600
400
200
0
0
2
4
6
8
10
Day
(b) TCP scans to the /23 network after scan filtering.
Figure 5.5. TCP scans after scan filtering.
92
/14-10m
/14-1h
/14-4h
/14-8h
/23-10m
/23-1h
/23-4h
/23-8h
14
Passing Prefilter (%)
12
10
8
6
4
2
0
0
1
2
3
4
5
6 7 8
Filter Size
9
10
Figure 5.6. Effect of scan filter cutoff on proportion of probes that pass the prefilter.
10s
1m
2m
5m
Served by Honeypot (%)
100
80
60
40
20
0
16
32
64
128 256 512
Honeypots
1024 2048 4096
Figure 5.7. Effect of the number of honeypots on proportion of prefiltered probes that can engage
a honeypot.
93
our telescope. Thus, this particular telescope feed provides magnified visibility into the activity of
such malware.
We analyzed two weeks of network traffic recorded during honeyfarm operation. We find a huge
bias in background radiation. Figures 5.4(a) and 5.4(b) show the number of TCP scans per minute
seen by each telescope feed. Here, a “scan” means a TCP SYN from a distinct source-destination
pair (any duplicate TCP SYNs that arrive within 60 seconds are considered as one). We see that
despite being 1/512th the size, the /23 network sees several times more background radiation than
does the /14 network, dramatically illustrating the opportunity it provides to capture probes from
sequentially scanning worms early during their propagation.
Figures 5.5(a) and 5.5(b) show the reduction in the raw probing achieved by prefiltering the
traffic to remove scans. Our prefilter’s current configuration allows each source to scan at most
N = 4 different telescope addresses within an hour. We picked a cutoff of 4 because we currently
run three different honeypot configurations, and the GQ controller allocates honeypots with different
configurations to the same source when it probes additional telescope addresses; we then chose
N = 4 rather than N = 3 to add some redundancy.
This filtering has the greatest effect on traffic from the /23 network, which by its nature sees
numerous intensive scans, for which the prefilter discards all but the first few probes. As a result,
after prefiltering the honeyfarm as a whole receives over a dozen attack probes/sec.
We used trace-based analysis to study the impact of different parameters for the scan filtering.
Figure 5.6 shows for the /14 and /23 network the proportion of the incoming traffic that passes the
prefiltering stage for a given setting of N , the filter cutoff. The cutoff specifies number of different
telescope addresses allowed for a remote source to probe for a given period of time (as shown by
the different curves). For the /23 network, the pass-through proportion remains around 5%. For the
/14 network, it is considerably higher, but prefiltering still reaps significant benefits in reducing the
initial load.
However, even with aggressive prefiltering, more than eight attack probes pass the filtering
every second. This level rapidly consumes the available honeypots. In our current operation, we
use 20 of the 24 honeypots to engage remote attackers, and the other four for internal redirection.
94
Since engaging an attacker takes a considerable amount of time (including restarting the honeypot
afresh after finishing), the honeyfarm is always saturated. To investigate the number of honeypots
required in the absence of further filtering, we conducted a trace-based analysis using different
mean honeypot “life cycles”, where the life cycle consists of the attack engagement time plus the
restart time. We varied the life cycle from ten seconds up to five minutes (note that Xen-like virtual
machines may take only seconds [96] to restart, while a full restart for a VMware virtual machine
may take a full minute).
Figure 5.7 shows that once the life cycle exceeds one minute, we would need over 1,000 honeypots to engage 90% of the attacks that pass the prefiltering stage. Thus we need some mechanism to
filter down the attacks. To do so, we introduce the second stage of filtering provided by our replay
proxy, which we assess next.
5.4.2 Evaluating the Replay Proxy
In this section we assess the correctness and efficacy of the replay proxy as a means for further
reducing the volume of probes that the honeypots must process.
To assess correctness, we extracted 66 distinct worm attacks that successfully infected the honeypots and exhibited self-propagation inside the honeyfarm (detailed in Section 5.4.3). For each
distinct attack, we validated the proxy’s proper operation as follows:
1. We verified that the proxy indeed successfully replays the attack to compromise another vulnerable honeypot.
2. We ran the proxy with an incomplete script for the attack that only specifies part of the attack
activity. We then launched the attack against the honeyfarm to confirm that the proxy would
correctly follow the script through its end, and upon reaching the end would then successfully
bring a vulnerable honeypot “up to speed” and mediate the continuation of the attack so that
infection occurred.
3. We ran the proxy with a complete script of the attack to confirm that it successfully filtered
further instances of the attack.
95
Of these, the first two are particularly critical, because incorrect operation can lead to either
underestimating which other system configurations a given worm can infect (since we use replay to
generate the test instances for constructing these vulnerability profiles), or failing to detect a novel
attack (because the proxy does not manage to successfully transfer the new activity to a honeypot).
Incorrect operation in the third case results in greater honeyfarm load, but does not undermine
correctness.
Port
80/tcp
135/tcp
139/tcp
445/tcp
Nodes
28
548
312
3,977
Max Depth
5
176
55
104
Max Breadth
7
13
13
42
Pairs of Examples
39
914
63
1,551
Table 5.2. Summary of protocol trees.
Because we have not yet integrated the replay proxy into the operational honeyfarm, to assess
its filtering efficacy we constructed a set of scripts for it to follow and then ran those against attacks
drawn from traces of GQ’s operation. To do so, we first extracted 42,004 attack events for which GQ
allocated a honeypot. (Note that each attack might involve multiple connections.) We then followed
the approach described in Section 5.3.1 to automatically find 2,572 pairs of same attacks. Using
these pairs as the primary and secondary dialogs, we generated scripts for each and constructed four
protocol trees for filtering, one for each of the four services accessed in the attack traces: 80/tcp,
135/tcp, 139/tcp, and 445/tcp. Table 5.2 shows the properties of the resulting trees.
Using this set of scripts to process the remaining 36,860 attack events, the proxy filters out 76%
of them, indicating that operating it would offload the honeypots by a factor of four, a major gain.
Finally, it is difficult to directly assess the false positive rate (attacks that should not have been
filtered out but were), because doing so entails determining the equivalence of the actual operational
semantics of different sets of server requests. However, we argue that the rate should be low because
we take a conservative approach by requiring that filtered activity must match all dynamic fields in
a script, including large ADUs that correspond to buffer overruns or bulk data transfers.
96
Executable Name
a####.exe
a####.exe
a####.exe
cpufanctrl.exe
chkdisk32.exe
dllhost.exe
enbiei.exe
msblast.exe
lsd
NeroFil.EXE
sysmsn.exe
MsUpdaters.exe
ReaIPlayer.exe
WinTemp.exe
wins.exe
msnet.exe
MSGUPDATES.EXE
ntsf.exe
scardsvr32.exe
scardsvr32.exe
scardsvr32.exe
scardsvr32.exe
scardsvr32.exe
scardsvr32.exe
scardsvr32.exe
scardsvr32.exe
scardsvr32.exe
scardsvr32.exe
Size (B)
10366
10878
25726
191150
73728
10240
11808
6176
18432
78480
93184
107008
120320
209920
214528
238592
241152
211968
33169
34304
34816
35328
36864
39689
40504
43008
66374
205562
MD5Sum
7a67f7c8...
bf47cfe2...
62697686...
1737ec9a...
27764a5d...
53bfe15e...
d1ee9d2e...
5ae700c1...
17028f1e...
5ca9a953...
5f6c8c40...
aa0ee4b0...
4995eb34...
9e74a7b4...
7a9aee7b...
6355d4d5...
65b401eb...
5ac5998e...
1a570b48...
b10069a8...
ba599948...
617b4056...
0372809c...
470de280...
23055595...
ff20f56b...
f7a00ef5...
87f9e3d9...
Worm Name
W32.Zotob.E
W32.Zotob.H
Quarantined but no name
Backdoor.Sdbot
Quarantined but no name
W32.Welchia.Worm
W32.Blaster.F.Worm
W32.Balster.Worm
W32.Poxdar
W32.Spybot.Worm
W32.Spybot.Worm
W32.Spybot.Worm
W32.Spybot.Worm
W32.Spybot.Worm
W32.Spybot.Worm
W32.Spybot.Worm
W32.Spybot.Worm
Quarantined but no name
W32.Femot.Worm
W32.Femot.Worm
W32.Femot.Worm
W32.Femot.Worm
W32.Femot.Worm
W32.Femot.Worm
W32.Femot.Worm
W32.Valla.2048
Quarantined but no name
W32.Pinfi
# Events
4
9
1
1
1
297
1
1
11
1
3
1
2
1
1
1
2
1
4
1
55
2
1
4
1
1
1
1
# Conns
3
3
3
4
4
4 or 6
3
3
8
5
3
5
5
5
5
7
5
5
3
3
3
3
5
3
3
5
7
3
Time (s)
29.0
25.2
223.2
111.2
134.7
24.5
28.9
43.8
32.4
237.5
79.6
57.0
95.4
178.4
118.2
189.4
125.3
459.4
46.2
66.5
96.6
179.6
49.3
41.4
41.1
32.2
54.8
180.8
Table 5.3. Part 1 of summary of captured worms (worm names are reported by Symantec Antivirus).
5.4.3 Capturing Worms
In four months of operation, GQ inspected about 260,000 attacks, among which 2,792 attacks
were worms and 32,884 attacks were not worms but successfully compromised honeypots (we observed suspicious outbound network requests from the honeypots). Overall, GQ automatically captured 66 distinct worms belonging to 14 different families. Table 5.3 and 5.4 (we split the table into
two parts due to the large table size) summarize the different types, where a worm type corresponds
to a unique MD5 checksum for the worm’s executable code (note that all the worms we captured
involved multiple connections and had buffer overflow or password guessing separated from their
executables). Most of these executables have a name directly associated with them because they are
uploaded as files by the worm’s initial exploit (see below). The table also gives the size in bytes of
the executable and how many times GQ captured an instance of the worm.
To identify the worms, we uploaded the malicious executables to a Windows virtual machine
97
Executable Name
x.exe
x.exe
x.exe
x.exe
x.exe
x.exe
x.exe
x.exe
x.exe
x.exe
x.exe
x.exe
x.exe
x.exe
x.exe
x.exe
x.exe
x.exe
x.exe
x.exe
x.exe
xxxx...x
xxxx...x
xxxx...x
xxxx...x
xxxx...x
n/a
n/a
n/a
n/a
n/a
n/a
n/a
n/a
n/a
22 distinct files
3 distinct files
26/27/28 distinct files
Size (B)
9343
9344
9353
9359
9728
11391
11776
13825
20992
23040
187348
187350
187352
187354
187356
187358
187360
187392
189400
189402
189406
46592
56832
57856
224218
224220
10240
10752
10752
10752
10752
10752
10752
10879
11264
n/a
n/a
n/a
MD5Sum
986b5970...
d6df3972...
7d99b0e9...
a0139d7a...
c05385e6...
7f60162c...
c0610a0d...
0b80b637...
31385818...
e0989c83...
384c6289...
a4410431...
b3673398...
c132582a...
d586e6c2...
2430c64c...
eb1d07c1...
2d9951ca...
7d195c0a...
c03b5262...
4957f2e3...
a12cab51...
b783511e...
ab5e47bf...
d009d6e5...
af79e0c6...
7623c942...
1b90cc9f...
32a0d7d0...
ab7ecc7a...
d175bad0...
d85bf0c5...
b1e7d9ba...
042774a2...
a36ba4a2...
n/a
n/a
n/a
Worm Name
W32.Korgo.Q
W32.Korgo.T
W32.Korgo.V
W32.Korgo.W
W32.Korgo.Z
W32.Korgo.S
W32.Korgo.S
W32.Korgo.V
W32.Licum
W32.Korgo.S
W32.Pinfi
W32.Korgo.V
W32.Pinfi
W32.Pinfi
W32.Pinfi
W32.Korgo.V
W32.Pinfi
W32.Korgo.W
W32.Korgo.S
W32.Pinfi
W32.Korgo.S
Backdoor.Berbew.N
W32.Info.A
Trojan.Dropper
W32.Pinfi
W32.Pinfi
W32.Korgo.C
W32.Korgo.L
W32.Korgo.G
W32.Korgo.N
W32.Korgo.G
W32.Korgo.E
W32.Korgo.gen
W32.Korgo.I
W32.Korgo.I
W32.Muma.A
W32.Muma.B
BAT.Boohoo.Worm
# Events
17
7
102
31
20
169
15
2
2
3
1
6
5
5
2
1
1
1
1
1
1
844
34
685
1
3
3
1
8
2
3
1
1
15
1
2
2
4
# Conns
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
3
3
2
2
2
2
2
2
2
2
2
2
7
7
72
Time (s)
6.6
9.5
6.0
5.9
6.6
6.6
8.6
24.4
7.9
10.4
329.7
11.3
20.1
24.9
27.5
27.5
63.1
76.1
18.0
58.2
210.9
9.4
7.2
10.0
32.5
34.2
4.8
7.0
4.1
5.3
5.4
5.6
5.0
4.3
5.4
186.7
208.9
384.9
Table 5.4. Part 2 of summary of captured worms (worm names are reported by Symantec Antivirus).
on which we installed Symantec Antivirus. While its identification is incomplete (and some of the
names appear less convincing), since we lack access to a large worm corpus we include the names
for completeness.
We captured not only buffer-overflow worms but also those that exploit weak passwords.
All of those worm attacks required multiple connections to complete (as many as 72 for
BAT.Boohoo.Worm), as listed in the penultimate column, and involved a data transfer channel separate from the exploit connection to upload the malicious executable to the victim. This nature
highlights the weakness of using “first payload” techniques for filtering out known attacks from
98
background radiation: a large number of attacks only distinguish themselves as novel or known
after first engaging in significant initial activity.
The table also shows the (minimum) detection time for each worm. We measure detection
time as the interval between when the first scan packet arrived at the honeyfarm, to when a second
honeypot (infected by redirection from the first honeypot that served the request itself) attempts to
make an outbound connection attempt, which provides proof that we have captured code that selfpropagates. Detection time depends on many factors: end host response delay, network latency,
execution time of the malware within the honeypot, and redirection honeypot availability. We inspect network traces by comparing the times when a packet is read by the GQ controller and when
it is forwarded to a virtual machine by it. We find that the delay due to processing by the GQ controller is essentially instantaneous, i.e., the clock (which has a granularity of 10 −6 second) did not
advance during processing by the controller.
While the times given in Table 5.3 and 5.4 seem high, often the culprit is a slow first stage during
which the remote source initially infects a honeypot. When we tested GQ’s detection process by
releasing Code Red and Blaster within it, the detection time was around one second.
However, this detection time measures only the latency between a probe of a worm is admitted
by the honeyfarm and the honeyfarm detects it as a worm. What is more critical for the possibility
of automatic response and containment is the overall detection time of a new worm outbreak that
measures the latency between the first spread attempt of the worm outbreak and the detection by
the honeyfarm. Though we could apply the same techniques used in [69] to estimate the latency
between the first spread attempt of a worm outbreak and its first probe hits our network telescope of
a quarter million addresses, we still cannot estimate the overall detection time GQ may experience
because only a small part of the probes that hit our network telescope are inspected by GQ in the
current operation due to the facts that the replay proxy is not integrated and the number of virtual
machine honeypots is limited. The biggest challenge in the future work is to solve these problems
so that all the probes sent to the quarter million addresses can be inspected in real time. After it, we
can study if the overall detection time experienced by GQ is low enough for automatic response and
containment.
99
5.5 Summary
Recently, great interest has arisen in the construction of honeyfarms: large pools of honeypots
that interact with probes received over the Internet to automatically determine the nature of the
probing activity, especially whether it signals the onset of a global worm outbreak.
Building an effective honeyfarm raises many challenging technical issues. These include ensuring high-fidelity honeypot operation; efficiently discarding the incessant Internet “background
radiation” that has only nuisance value when looking for new forms of activity; and devising and
policing an effective “containment” policy to ensure that captured malware does not inflict external
damage or skew internal analyses.
In this chapter we presented GQ, a high-fidelity honeyfarm system designed to meet these challenges. GQ runs fully functional servers across a range of operating systems known to be prime
targets for worms (especially various flavors of Microsoft Windows), confining each server to a virtual machine to maintain full control over its operation once compromised. To cope with load, GQ
leverages aggressive filtering, including a technique based on application-independent “replay” that
holds promise to quadruple GQ’s effective capacity.
Among honeyfarm efforts, our experiences with GQ are singular in that the scale at which
we have operated it—monitoring more than a quarter million Internet addresses, and capturing
66 distinct worms in four months of operation—far exceeds that previously achieved for a highfidelity honeyfarm. While much remains for pushing the system further in terms of capacity and
automated, efficient operation, its future as a first-rate tool for analyzing Internet-scale epidemics
appears definite.
100
Chapter 6
Conclusions and Future Work
We start this chapter with a summary of our contributions in Section 6.1 and finish it and this
thesis with a discussion of directions for future work in Section 6.2.
6.1 Thesis Summary
In this thesis, we tackle the problem of automating detection of new unknown malware with
our focus on an important environment, personal computers, and an important kind of malware,
computer worms. We face two fundamental challenges: false alarms and scalability. To minimize
false alarms, our approach is to infer the intent of user or adversary (the malware author). We infer
user intent on personal computers by monitoring user-driven activity such as key strokes and mouse
clicks. To infer the intent of worm authors, we leverage honeypots to detect self-propagation by first
letting a honeypot become infected, and then letting the first honeypot infect another one, and so
on. For the scalability problem—how to handle a huge number of repeated probes—our approach
is to leverage a new technology, protocol-independent replay of application dialog, to filter frequent
multi-stage attacks by replaying the server-side responses. Our main contributions are concluded as
follows.
1. BINDER: an Extrusion-based Break-In Detector for Personal Computers.
We designed a novel extrusion detection algorithm to detect break-ins of new unknown mal101
ware on personal computers. To detect extrusions, we first assume that user intent is implied
by user-driven input such as key strokes and mouse clicks. We infer user intent by correlating
outbound network connections with user-driven input at the process level, and use whitelisting to detect user-unintended but benign connections generated by system daemons. By doing
so, we can detect a large class of malware such as worms, spyware and adware that (1) run as
background processes, (2) do not receive any user-driven input, (3) and make outbound network connections. We implemented BINDER, a host-based detection system for Windows, to
realize this algorithm. We evaluated BINDER on six computers used by different volunteers
for their daily work over five weeks. Our limited user study indicates that BINDER controls
the number of false alarms to at most five over four weeks on each computer and the false
positive rate is less than 0.03%. To evaluate BINDER’s capability of detecting break-ins, we
built a controlled testbed using the Click modular router and VMware Workstation. We tested
BINDER with the Blaster worm and 22 different email worms collected on a departmental
email server over one week and showed that BINDER successfully detect the break-ins caused
by all these worms. From this work, we demonstrated that user intent, a unique characteristic
of personal computers, is a simple and effective detector for a large class of malware with few
false alarms.
2. RolePlayer: Protocol-Independent Replay of Application Dialog.
We designed and implemented RolePlayer, a system which, given examples of an application
session, can mimic both the client side and the server side of the session for a wide variety
of application protocols. A key property of RolePlayer is that it operates in an applicationindependent fashion: the system does not require any specifics about the particular application
it mimics. It instead uses byte-stream alignment algorithms from bioinformatics to compare
different instances of a session to determine which fields it must change to successfully replay
one side of the session. Drawing only on knowledge of a few low-level syntactic conventions
such as representing IP addresses using “dotted quads”, and contextual information such as
the domain names of the participating hosts, RolePlayer can heuristically detect and adjust
network addresses, ports, cookies, and length fields embedded within the session, including
sessions that span multiple, concurrent connections on dynamically assigned ports. We suc102
cessfully used RolePlayer to replay both the client and server sides for a variety of network
applications, including NFS, FTP, and CIFS/SMB file transfers, as well as the multi-stage
infection processes of the Blaster and W32.Randex.D worms. We can potentially use such
replay for recognizing malware variants, determining the range of system versions vulnerable to a given attack, testing defense mechanisms, and filtering multi-step attacks. From this
work, we demonstrated that it is possible to achieve automatic network protocol analysis for
certain problems.
3. GQ: a Large-Scale, High-Fidelity Honeyfarm System.
We designed, implemented, and deployed GQ, a large-scale, high-fidelity honeyfarm system
that can detect new worm outbreaks by analyzing in real-time the scanning probes seen in a
quarter million Internet addresses. In GQ, we overcame many technical challenges, including
ensuring high-fidelity honeypot operation; efficiently discarding the incessant Internet background radiation that has only nuisance value when looking for new forms of activity; and
devising and policing an effective containment policy to ensure that captured malware does
not inflict external damage or skew internal analyses. GQ leverages aggressive filtering, including a technique based on application-independent replay that holds promise to quadruple
GQ’s effective capacity. Among honeyfarm efforts, our experiences with GQ are singular in
that the scale at which we have operated it—monitoring more than a quarter million Internet
addresses, and capturing 66 distinct worms in four months of operation—far exceeds that
previously achieved for a high-fidelity honeyfarm.
6.2 Future Work
To win the war against malicious attackers, we must keep creating new defense mechanisms in
the future. The work in this thesis sets the stage for follow-on work in a number of areas.
1. Improving the performance of the replay technology. From our discussion in Chapter 4 and 5 we can see that the replay technology has many potential applications. However, the replay proxy hasn’t been integrated into the GQ honeyfarm operationally due to
103
its performance limitation (a single-threaded instance of the replay proxy can process about
100 concurrent probes per second). We plan to study this problem from three perspectives:
(1) improving the replay algorithm to reduce the computational cost; (2) leveraging multiple
threads/processes/machines; (3) converting its network input/output to use raw sockets. To
improve the performance of the replay technology fundamentally, we need to have a better
understanding of network application protocols (so that we can compare an ADU with replay
“scripts” more efficiently). We need to develop new technologies for automatic inference of
network application protocols. RolePlayer compares two examples of an application session
to infer dynamic fields. We need to investigate what level of understanding we can achieve
by analyzing hundreds of thousands of examples of a network application all together.
2. Enriching the functionality of the honeyfarm technology. In this thesis, we have studied
detecting scanning worms (which probe a set of addresses to identify vulnerable hosts) by
inspecting traffic from network telescopes on honeypots. We need to extend the design and
operation of the GQ honeyfarm in two directions. First, once a worm is captured, we can perform a wide variety of analyses on it using the well-engineered, fully-controlled honeyfarm.
For example, we can leverage existing technologies on vulnerability analysis and signature
generation [62, 15, 11] to generate attack and vulnerability signatures. Moreover, we can verify if the analyses produce correct signatures by deploying them in the honeyfarm to check if
repeated attacks can be blocked. How we can integrate these technologies in the honeyfarm
efficiently remains to be a challenge. Second, we can use a honeyfarm system to detect and
analyze other malware than scanning worms because the operation of the honeypot cluster is
independent from the source of traffic. For example, we could feed the honeyfarm with spam
emails sent to nonexistent recipients on well-known email servers such as those owned by
universities; we could also subscribe honeyfarm nodes to application-layer networks such as
instant messaging and peer-to-peer networks. We need to investigate new approaches to feed
such data into the honeyfarm effectively.
In the long run, the following directions are interesting and important.
104
1. High-speed inline network intrusion detection with application-level replay. The replay
proxy presented in Chapter 5 is promising on filtering previously seen multi-stage malicious
attacks from a network telescope. Currently the inline network intrusion detection systems
are limited to signature inspection. If we could apply the replay proxy in network intrusion
detection, we could use it to reply to a suspicious request and replay the finished conversion
against the original target host so that the conversation can be continued after the request is
recognized as a normal one. It remains a challenge to achieve this with few false alarms at
high speed.
2. Emulating user (mis)behavior on honeypots. Our work in this thesis shows that inferring
intent is effective for malware detection and honeypots are a powerful tool for observing
malware’s behavior. However, existing work in honeypot research doesn’t have users in the
loop. This impedes honeypots from detect social-engineering attacks that exploit user’s misbehavior. On the other hand, attackers can detect the existence of honeypots by noticing zero
user activity. So it is desirable to emulate user (mis)behavior on honeypots, but it remains a
challenge due to the daunting complexity of user behavior.
3. Virtual machines for better security. In the GQ honeyfarm, virtual machine technology
enables us to deploy and manage high-fidelity virtual honeypots. In addition to efficiency and
isolation, an interesting property of virtual machines is disposability. If we could run multiple virtual machines instead of a single persistent operating system, we could run untrusted
jobs on a separate virtual machine from those that we use for other activities. These virtual
machines must be light weight with respect to resource consumption and boot-up speed. It
also remains a challenge to design a usage model for simultaneous virtual machines such that
it is easy for average users to migrate from today’s single persistent operating system model.
105
Bibliography
[1] Adware.CNSMIN. http://www.trendmicro.com/vinfo/virusencyclo/default5.asp?VName=
ADW CNSMIN.A.
[2] Adware.Gator. http://securityresponse.symantec.com/avcenter/venc/data/adware.gator.html.
[3] Adware.Spydeleter.
http://netrn.net/spywareblog/archives/2004/03/12/how-to-get-rid-of-
spy-deleter/.
[4] James P. Anderson. Computer security threat monitoring and surveilance. Technical report,
James P. Anderson Company, Fort Washington, PA, April 1980.
[5] Stefan Axelsson. The base-rate fallacy and its implications for the difficulty of intrusion
detection. In Proceedings of the 6th ACM Conference on Computer and Communication
Security, 1999.
[6] Michael Bailey, Evan Cooke, Farnam Jahanian, Jose Nazario, and David Watson. The Internet Motion Sensor - a distributed blackhole monitoring system. In Proceedings of the
12th Annual Network and Distributed System Security Symposium, San Diego, CA, February
2005.
[7] Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, and Andrew Warfield. Xen and the art of virtualization. In Proceedings of
the 19th ACM Symposium on Operating Systems Principles, October 2003.
[8] Marshall Beddoe. The protocol informatics project. http://www.baselineresearch.net/PI/.
106
[9] Richard Bejtlich. Extrusion Detection: Security Monitoring for Internal Intrusions. AddisonWesley Professional, 2005.
[10] Kevin Borders and Atul Prakash. Web tap: Detecting covert web traffic. In Proceedings of
the 11th ACM Conference on Computer and Communication Security, October 2004.
[11] David Brumley, James Newsome, Dawn Song, Hao Wang, and Somesh Jha. Towards automatic generation of vulnerability-based signatures. In Proceedings of the 2006 IEEE Symposium on Security and Privacy, May 2006.
[12] Yu-Chung Cheng, Urs Holzle, Neal Cardwell, Stefan Savage, and Geoffrey Voelker. Monkey
see, monkey do: a tool for TCP tracing and replaying. In Proceedings of the 2004 USENIX
Annual Technical Conference, June 2004.
[13] CipherTrust. Zombie statistics. http://www.ciphertrust.com/resources/statistics/zombie.php,
June 2006.
[14] Evan Cooke, Michael Bailey, Z. Morley Mao, David Watson, Farnam Jahanian, and Danny
McPherson. Toward understanding distributed blackhole placement. In Proceedings of the
Second ACM Workshop on Rapid Malcode (WORM), October 2004.
[15] Manuel Costa, Jon Crowcroft, Miguel Castro, Antony Rowstron, Lidong Zhou, Lintao
Zhang, and Paul Barham. Vigilante: End-to-end containment of Internet worms. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP), Brighton,
UK, October 2005.
[16] Cybertrace. http://www.cybertrace.com/ctids.html.
[17] David Dagon, Xinzhou Qin, Guofei Gu, Wenke Lee, Julian Grizzard, John Levine, and Henry
Owen. HoneyStat: Local worm detection using honeypots. In Proceedings of the Seventh
International Symposium on Recent Advances in Intrusion Detection, September 2004.
[18] Dorothy E. Denning. An intrusion-detection model. IEEE Transactions on Software Engineering, 13(2), February 1987.
[19] Jeff Dike. The user-mode linux. http://user-mode-linux.sourceforge.net/.
107
[20] Mark Eichin and Jon A. Rochlis. With microscope and tweezers: An analysis of the Internet
virus of November 1988. In Proceedings of the 1989 IEEE Symposium on Security and
Privacy, 1989.
[21] Daniel R. Ellis, John G. Aiken, Kira S. Attwood, and Scott D. Tenaglia. A behavioral approach to worm detection. In Proceedings of the Second ACM Workshop on Rapid Malcode
(WORM), October 2004.
[22] eXtremail. http://www.extremail.com/.
[23] Stephanie Forrest, Steven A. Hofmeyr, Anil Somayaji, and Thomas A. Longstaff. A sense of
self for Unix processes. In Proceedings of IEEE Symposium on Security and Privacy, 1996.
[24] Dennis Geels, Gautam Altekar, Scott Shenker, and Ion Stoica. Replay debugging for distributed applications. In Proceedings of the 2006 USENIX Annual Technical Conference,
Boston, MA, June 2006.
[25] Ajay Gupta and R. Sekar. An approach for detecting self-propagating email using anomaly
detection. In Proceedings of the Sixth International Symposium on Recent Advances in Intrusion Detection, September 2003.
[26] Mark Handley, Vern Paxson, and Christian Kreibich. Network intrusion detection: Evasion,
traffic normalization, and end-to-end protocol semantics. In Proceedings of the 10th USENIX
Security Symposium, 2001.
[27] Hewlett-Packard. http://www.hp.com/rnd/products/switches/2800 series/overview.htm.
[28] Steven A. Hofmeyr, Stephanie Forrest, and Anil Somayaji. Intrusion detection using sequences of system calls. Journal of Computer Security, 6(3):151–180, 1998.
[29] Honeynet. The honeynet project. http://www.honeynet.org/.
[30] Honeynet. Sebek. http://www.honeynet.org/tools/sebek/.
[31] Ruiqi Hu and Aloysius K. Mok. Detecting unknown massive mailing viruses using proactive
methods. In Proceedings of the Seventh International Symposium on Recent Advances in
Intrusion Detection, September 2004.
108
[32] Computer Security Institute. 2005 CSI/FBI computer crime and security survey, 2005.
[33] Intel. Intel virtualization technology. http://www.intel.com/technology/computing/vptech/,
2005.
[34] Xuxian Jiang and Dongyan Xu. Collapsar: A VM-based architecture for network attack
detention center. In Proceedings of the 13th USENIX Security Symposium, San Diego, CA,
August 2004.
[35] Jaeyeon Jung, Vern Paxson, Arthur W. Berger, and Hari Balakrishnan. Fast portscan detection
using sequential hypothesis testing. In Proceedings of the 2004 IEEE Symposium on Security
and Privacy, May 2004.
[36] Hyang-Ah Kim and Brad Karp. Autograph: Toward automated, distributed worm signature
detection. In Proceedings of the 13th USENIX Security Symposium, San Diego, CA, August
2004.
[37] Samuel T. King, George W. Dunlap, and Peter M. Chen. Debugging operating systems
with time-traveling virtual machines. In Proceedings of the 2005 USENIX Annual Technical
Conference, April 2005.
[38] Calvin Ko, Manfred Ruschitzka, and Karl Levitt. Execution monitoring of security-critical
programs in distributed systems: A specification-based approach. In Proceedings of the 1997
IEEE Symposium on Security and Privacy, May 1997.
[39] Eddie Kohler, Robert Morris, Benjie Chen, John Jannotti, and M. Frans Kaashoek. The Click
modular router. ACM Transactions on Computer Systems, 18(3):263–297, August 2000.
[40] Eddie Kohler, Robert Morris, and Massimiliano Poletto. Modular components for network
address translation. In Proceedings of the Third IEEE Conference on Open Architectures and
Network Programming (OPENARCH), June 2002.
[41] Christian Kreibich and Jon Crowcroft. Honeycomb - creating intrusion detection signatures
using honeypots. In Proceedings of ACM SIGCOMM HotNets-II Workshop, November 2003.
109
[42] Abhishek Kumar, Vern Paxson, and Nicholas Weaver. Exploiting underlying structure for
detailed reconstruction of an Internet-scale event. In Proceedings of the 2005 ACM Internet
Measurement Conference, October 2005.
[43] T. J. LeBlanc and J. M. Mellor-Crummey. Debugging parallel programs with instant replay.
IEEE Transactions on Computers, pages 471–482, April 1987.
[44] Wenke Lee and Salvatore J. Stolfo. Data mining approaches for intrusion detection. In
Proceedings of the Seventh USENIX Security Symposium, January 1998.
[45] Wenke Lee and Salvatore J. Stolfo. A framework for constructing features and models for
intrusion detection systems. ACM Transactions on Information and System Security, 3(4),
November 2000.
[46] Corrado Leita, Ken Mermoud, and Marc Dacier. Scriptgen: an automated script generation
tool for honeyd. In Proceedings of the 21st Annual Computer Security Applications Conference, December 2005.
[47] Zhenmin Li, Jed Taylor, Elizabeth Partridge, Yuanyuan Zhou, William Yurcik, Cristina Abad,
James J. Barlow, and Jeff Rosendale. UCLog: A unified, correlated logging architecture for
intrusion detection. In the 12th International Conference on Telecommunication Systems Modeling and Analysis (ICTSM), 2004.
[48] Yihua Liao and V. Rao Vemuri. Using text categorization techniques for intrusion detection.
In Proceedings of the 11th USENIX Security Symposium, August 2002.
[49] Don Libes. expect: Curing those uncontrollable fits of interaction. In Proceedings of the
Summer 1990 USENIX Conference, pages 183–192, June 1990.
[50] libpcap. http://www.tcpdump.org/.
[51] Insecure.Com LLC. Nmap security scanner. http://www.insecure.org/nmap.
[52] McAfee Inc. McAfee Security Forensics. http://networkassociates.com/us/products/mcafee/
forensics/security forensics.htm.
[53] William Metcalf. Snort inline. http://snort-inline.sourceforge.net/.
110
[54] David Moore, Vern Paxson, Stefan Savage, Colleen Shannon, Stuart Staniford, and Nicholas
Weaver. Inside the Slammer worm. IEEE Magazine of Security and Privacy, August 2003.
[55] David Moore and Colleen Shannon. The spread of the Witty worm. IEEE Security and
Privacy, 2(4), July/August 2004.
[56] David Moore, Colleen Shannon, and Jeffery Brown. Code-Red: a case study on the spread
and victims of an Internet worm. In Internet Measurement Workshop, November 2002.
[57] David Moore, Colleen Shannon, Geoffrey Voelker, and Stefan Savage. Internet quarantine:
Requirements for containing self-propagating code. In Proceedings of the 2003 IEEE Infocom Conference, San Francisco, CA, April 2003.
[58] David Moore, Colleen Shannon, Geoffrey Voelker, and Stefan Savage. Network telescopes.
Technical report, CAIDA, April 2004.
[59] David Moore, Geoffrey Voelker, and Stefan Savage. Inferring Internet denial-of-service activity. In Proceedings of the 10th USENIX Security Symposium, 2001.
[60] Saul B. Needleman and Christian D. Wunsch. A general method applicable to the search
for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology,
48:443–453, 1970.
[61] James Newsome, Brad Karp, and Dawn Song. Polygraph: Automatically generating signatures for polymorphic worms. In Proceedings of the 2005 IEEE Symposium on Security and
Privacy, May 2005.
[62] James Newsome and Dawn Song. Dynamic taint analysis for automatic detection, analysis,
and signature generation of exploits on commodity software. In Proceedings of the 12th
Annual Network and Distributed System Security Symposium, February 2005.
[63] Ruoming Pang, Vinod Yegneswaran, Paul Barford, Vern Paxson, and Larry Peterson. Characteristics of Internet background radiation. In Proceedings of the 2004 ACM Internet Measurement Conference, October 2004.
111
[64] Vern Paxson. Bro: a system for detecting network intruders in real-time. Computer Networks,
31(23-24):2435–2463, 1999.
[65] Adam G. Pennington, John D. Strunk, John Linwood Griffin, Craig A.N. Soules, Garth R.
Goodson, and Gregory R. Ganger. Storage-based intrusion detection: Watching storage activity for suspicious behavior. In Proceedings of the 12th USENIX Security Symposium, August
2003.
[66] Niels Provos. A virtual honeypot framework. In Proceedings of the 13th USENIX Security
Symposium, San Diego, CA, August 2004.
[67] PsTools. http://www.sysinternals.com/ntw2k/freeware/pstools.shtml.
[68] Thomas H. Ptacek and Timothy N. Newsham. Insertion, evasion, denial of service: Eluding
network intrusion detection. Technical report, Secure Networks, Inc., January 1998.
[69] Moheeb Abu Rajab, Fabian Monrose, and Andreas Terzis. On the effectiveness of distributed
worm monitoring. In Proceedings of the 14th USENIX Secuirty Symposium, Baltimore, MD,
August 2005.
[70] rattle.
Using
process
infection
to
bypass
windows
software
firewalls.
http://www.phrack.org/show.php?p=62&a=13, 2004.
[71] Martin Roesch. Snort: Lightweight intrusion detection for networks. In Proceedings of the
13th USENIX Systems Administration Conference, 1999.
[72] M. Russinovich and B. Cogswell. Replay for concurrent non-deterministic shared-memory
applications. In Proceedings of the 1996 Conference on Programming Language Design and
Implementation, pages 258–266, May 1996.
[73] Stefan Saroiu, Steven D. Gribble, and Henry M. Levy. Measurement and analysis of spyware
in a university environment. In Proceedings of the First Symposium on Networked Systems
Design and Implementation, March 2004.
[74] Stuart E. Schechter, Jaeyeon Jung, and Arthur W. Berger. Fast detection of scanning worm
112
infections. In Proceedings of the Seventh International Symposium on Recent Advances in
Intrusion Detection, September 2004.
[75] R. Sekar, M. Bendre, D. Dhurjati, and P. Bollineni. A fast automaton-based method for
detecting anomalous program behavior. In Proceedings of the 2001 IEEE Symposium on
Security and Privacy, May 2001.
[76] R. Sekar, A. Gupta, J. Frullo, T. Shanbhag, A. Tiwari, H. Yang, and S. Zhou. Specificationbased anomaly detection: a new approach for detecting network intrusions. In Proceedings
of the 2002 ACM Conference on Computer and Communications Security, November 2002.
[77] Sumeet Singh, Cristian Estan, George Varghese, and Stefan Savage. Automated worm fingerprinting. In Proceedings of the Sixth Symposium on Operating Systems Design and Implementation (OSDI), December 2004.
[78] Steven R. Snapp, James Brentano, Gihan V. Dias, Terrance L. Goan, L. Todd Heberlein, Che
Lin Ho, Karl N. Levitt, Biswanath Mukherjee, Stephen E. Smaha, Tim Grance, Daniel M.
Teal, and Doug Mansur. DIDS (Distributed Intrusion Detection System)—motivation, architecture, and an early prototype. In Proceedings of the 14th National Computer Security
Conference, Washington, DC, 1991.
[79] Eugene Spafford. An analysis of the Internet worm. In Proceedings of European Software
Engineering Conference, September 1989.
[80] Lance Spitzner. Honeypot farms. http://www.securityfocus.com/infocus/1720, August 2003.
[81] Lance Spitzner. Honeypots: Tracking Hackers. Addison-Wesley, 2003.
[82] Stuart Staniford, David Moore, Vern Paxson, and Nicholas Weaver. The top speed of flash
worms. In Proceedings of the Second ACM Workshop on Rapid Malcode (WORM), October
2004.
[83] Stuart Staniford, Vern Paxson, and Nicholas Weaver. How to 0wn the Internet in your spare
time. In Proceedings of the 11th USENIX Security Symposium, August 2002.
113
[84] Marc Stiegler, Alan H. Karp, Ka-Ping Yee, and Mark Miller. Polaris: Virus safe computing
for windows xp. Technical Report HPL-2004-221, HP Labs, December 2004.
[85] Marc Stiegler and Mark Miller. E and CapDesk. http://www.combex.com/tech/edesk.html.
[86] Symantec. Symantec Internet Security Threat Report, March 2006.
[87] Symantec Norton Antivirus. http://www.symantec.com/.
[88] Symantec
Security
Response
-
Alphabetical
Threat
Index.
http://securityresponse.symantec.com/avcenter/venc/auto/index/indexA.html.
[89] Synmantec. Decoy server product sheet. http://www.symantec.com/.
[90] Tcpreplay: Pcap editing and replay tools for *NIX. http://tcpreplay.sourceforge.net.
[91] TDIMon. http://www.sysinternals.com/ntw2k/freeware/tdimon.shtml.
[92] Marina Thottan and Chuanyi Ji. Anomaly detection in IP networks. IEEE Transactions on
Signal Processing, 51(8), August 2003.
[93] Aaron Turner. Flowreplay design notes. http://www.synfin.net/papers/flowreplay.pdf.
[94] Unicode. http://www.unicode.org/.
[95] VMware Inc. http://www.vmware.com/.
[96] Michael Vrable, Justin Ma, Jay Chen, David Moore, Erik Vandekieft, Alex C. Snoeren, Geoffrey M. Voelker, and Stefan Savage. Scalability, fidelity, and containment in the Potemkin
virtual honeyfarm. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP), October 2005.
[97] W32.Blaster.Worm. http://securityresponse.symantec.com/avcenter/venc/data/w32.blaster.
worm.html.
[98] W32.Randex.D. http://securityresponse.symantec.com/avcenter/venc/data/w32.randex.d.html.
[99] W32.Toxbot. http://securityresponse.symantec.com/avcenter/venc/data/w32.toxbot.html.
114
[100] David Wagner and Drew Dean. Intrusion detection via static analysis. In Proceedings of the
2001 IEEE Symposium on Security and Privacy, May 2001.
[101] David Wagner and Paolo Soto. Mimicry attacks on host-based intrusion detection systems.
In Proceedings of the Ninth ACM Conference on Computer and Communications Security,
November 2002.
[102] Helen J. Wang, Chuanxiong Guo, Daniel R. Simon, and Alf Zugenmaier.
Shield:
Vulnerability-driven network filters for preventing known vulnerability exploits. In Proceedings of ACM SIGCOMM, August 2004.
[103] Helen J. Wang, John C. Platt, Yu Chen, Ruyun Zhang, and Yi-Min Wang. Automatic misconfiguration troubleshooting with PeerPressure. In Proceedings of the 2004 Symposium on
Operating Systems Design and Implementation (OSDI), San Francisco, CA, December 2004.
[104] Nicholas Weaver, Vern Paxson, Stuart Staniford, and Robert Cunningham. A taxonomy of
computer worms. In Proceedings of the First ACM Workshop on Rapid Malcode (WORM),
October 2003.
[105] Nicholas Weaver, Stuart Staniford, and Vern Paxson. Very fast containment of scanning
worms. In Proceedings of the 13th USENIX Security Symposium, San Diego, CA, August
2004.
[106] Wikipedia. Logit analysis. http://en.wikipedia.org/wiki/Logit.
[107] Matthew M. Williamson. Throttling viruses: Restricting propagation to defeat malicious
mobile code. Technical Report HPL-2002-172, HP Labs Bristol, 2002.
[108] Windows Hooks API.
http://msdn.microsoft.com/library/default.asp?url=/library/en-
us/winui/winui/windowsuserinterface/windowing/hooks.asp.
[109] Windows Security Auditing. http://www.microsoft.com/technet/security/prodtech/win2000/
secwin2k/09detect.mspx.
[110] WinDump. http://windump.polito.it/.
[111] WinPcap. http://winpcap.polito.it/.
115
[112] Cynthia Wong, Stan Bielski, Jonathan M. McCune, and Chenxi Wang. A study of massmailing worms. In Proceedings of the Second ACM Workshop on Rapid Malcode (WORM),
October 2004.
[113] Yinglian Xie, Hyang-Ah Kim, David R. O’Hallaron, Michael K. Reiter, and Hui Zhang. Seurat: A pointillist approach to anomaly detection. In Proceedings of the Seventh International
Symposium on Recent Advances in Intrusion Detection, September 2004.
[114] Jintao Xiong. ACT: Attachment chain tracing scheme for email virus detection and control.
In Proceedings of the Second ACM Workshop on Rapid Malcode (WORM), October 2004.
[115] Ka-Ping Yee. User interaction design for secure systems. In Proceedings of the Fourth
International Conference on Information and Communications Security, Singapore, 2002.
[116] Vinod Yegneswaran, Paul Barford, and Vern Paxson. Using honeynets for Internet situational
awareness. In Proceedings of ACM SIGCOMM HotNets-IV Workshop, November 2005.
[117] Vinod Yegneswaran, Paul Barford, and Dave Plonka. On the design and use of Internet sinks
for network abuse monitoring. In Proceedings of the Seventh International Symposium on
Recent Advances in Intrusion Detection, September 2004.
[118] Michal
Zalewski.
P0f:
A
passive
OS
fingerprinting
tool.
http://lcamtuf.coredump.cx/p0f.shtml.
[119] Yin Zhang and Vern Paxson. Detecting stepping stones. In Proceedings of the 9th USENIX
Security Symposium, August 2000.
[120] ZoneAlarm. http://www.zonelabs.com/.
[121] Cliff Changchun Zou, Lixin Gao, Weibo Gong, and Don Towsley. Monitoring and early
warning for Internet worms. In Proceedings of the 10th ACM Conference on Computer and
Communications Security, October 2003.
[122] Cliff Changchun Zou, Weibo Gong, and Don Towsley. Worm propagation modeling and
analysis under dynamic quarantine defense. In Proceedings of the First ACM Workshop on
Rapid Malcode (WORM), October 2003.
116