EXFILD: A TOOL FOR THE DETECTION OF DATA EXFILTRATION

Transcription

EXFILD: A TOOL FOR THE DETECTION OF DATA EXFILTRATION
EXFILD: A TOOL FOR THE DETECTION OF DATA
EXFILTRATION USING ENTROPY AND ENCRYPTION
CHARACTERISTICS OF NETWORK TRAFFIC
by
Tyrell William Fawcett
A thesis submitted to the Faculty of the University of Delaware in partial
fulfillment of the requirements for the degree of Masters of Science in Electrical and
Computer Engineering
Spring 2010
c 2010 Tyrell William Fawcett
All Rights Reserved
EXFILD: A TOOL FOR THE DETECTION OF DATA
EXFILTRATION USING ENTROPY AND ENCRYPTION
CHARACTERISTICS OF NETWORK TRAFFIC
by
Tyrell William Fawcett
Approved:
W. David Sincoskie, Ph.D.
Professor in charge of thesis on behalf of the Advisory Committee
Approved:
Kenneth E. Barner, Ph.D.
Chairman of the Department of Electrical and Computer Engineering
Approved:
Michael J. Chajes, Ph.D
Dean of the College of Engineering
Approved:
Debra Hess Norris, M.S.
Vice Provost for Graduate and Professional Education
ACKNOWLEDGMENTS
To my advisor, Dave Sincoskie: Dr. Sincoskie has given me the opportunity
to further my education at the University of Delaware. He has provided me with the
guidance needed to perform my research at a level higher than I believed I could.
He has taught me not to blindly believe things without proving them myself. His
vast experience in research and ability to see the importance in this area of research
is what lead to my thesis.
To Chase Cotton: Dr. Cotton is always around and willing to brainstorm
and work through technical problems with anyone. His practical experience in the
field and willingness to perform experiments all hours of the day has proven to be
invaluable my research.
To Fouad Kiamilev: Dr. K has been gracious enough to provide space in
his lab and access to all of his equipment to me during my Graduate career. He
provides an enthusiastic and upbeat lab environment for CVORG. His eagerness to
learn and encouragement to do what you love is what keeps CVORG such a great
place to work.
To Charles Boncelet: Dr. Boncelet provided great advice and guidance relating to encryption and entropy calculations in my work. I’m thankful for him
sharing his experience and saving me time and effort going down unnecessary paths
in unfamiliar territory.
To all the members of CVORG: CVORG is definitely a very eclectic group
of people piled into labs together. The members are always there to help someone
in need. The brainstorming sessions in this lab leads to a lot of great ideas and
iii
research. A lot of aspects of both my thesis work and other projects have resulted
from these brainstorming sessions and has provided me with some paths forward in
future work. I am proud to be a part of CVORG.
To Larry and Karen Steenhoek: The Steenhoeks have provided a lot of support over the last semester to my wife and I. It has made this transitional period
and specifically the writing process a much easier time and for that I thank them.
To my parent, Jim and Deby Fawcett, and my sister, Sammy Fawcett: I
always know that no matter what happens I always have their support. I attribute
my successes up to now to the work ethic and morals they instilled in me. I love
them for everything they have done for me.
To my wife, Valerie Fawcett: She has been my best friend for as long as most
people remember. She has kept my life interesting with her goofy antics, which I’m
sure will only get more entertaining in DC. She has made sacrifices allowing me to
quench my thirst for knowledge and for that I can’t thank her enough. She will be
as happy as I am to see this thesis completed, because she calls it the day she gets
her husband back! I couldn’t ask for a more supportive wife to spend the rest of my
life with.
iv
TABLE OF CONTENTS
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Chapter
1 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1
2.1.2
2.1.3
2.1.4
2.1.5
2.1.6
2.1.7
2.1.8
2.1.9
2.1.10
2.1.11
2.2
2.3
Scenario of Security Against Infiltration
System Administrators . . . . . . . . . .
Encryption . . . . . . . . . . . . . . . .
Out of the Box Firewall . . . . . . . . .
Login Credentials . . . . . . . . . . . . .
Antivirus Software / Malware Detection
Properly Configured Firewall . . . . . .
Intrusion Detection System . . . . . . .
Intrusion Prevention System . . . . . . .
Outgoing Traffic . . . . . . . . . . . . .
Consequences of Encryption . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
8
8
9
10
11
11
12
13
13
14
15
15
16
Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3 NETWORK TOOLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1
3.2
3.3
Network Sniffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Corporate Watcher . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Network Top . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
v
3.4
3.5
3.6
DNS Extractor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Session Extractor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
IP Helper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4 EXFILD DESIGN AND IMPLEMENTATION . . . . . . . . . . . . 33
4.1
Packet and Session Processing . . . . . . . . . . . . . . . . . . . . . . 33
4.1.1
4.1.2
4.1.3
Packet Decoding . . . . . . . . . . . . . . . . . . . . . . . . . 34
Extract Sessions . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Entropy Calculation . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.3.1
4.1.3.2
4.1.4
4.1.5
4.2
Scaling by Initial Values . . . . . . . . . . . . . . . . 41
Scaling by Size . . . . . . . . . . . . . . . . . . . . . 42
Checking if Encryption is Expected . . . . . . . . . . . . . . . 44
Checking if Encryption is Present . . . . . . . . . . . . . . . . 45
Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.1
4.2.2
4.2.3
4.2.4
Expected
Expected
Expected
Expected
Unencrypted and Received Unencrypted
Encrypted and Received Encrypted . .
Encrypted and Received Unencrypted .
Unencrypted and Received Encrypted .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
49
51
52
54
5 EXPERIMENTS, RESULTS, AND ANALYSIS . . . . . . . . . . . 57
5.1
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.1.1
5.1.2
5.1.3
5.1.4
Control Data Set .
Data Set #1 . . . .
Data Set #2 . . . .
Malware Data Sets
5.1.4.1
5.1.4.2
5.1.4.3
5.2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
57
60
61
62
Kraken Botnet . . . . . . . . . . . . . . . . . . . . . 63
Zeus Botnet . . . . . . . . . . . . . . . . . . . . . . . 63
Black Worm . . . . . . . . . . . . . . . . . . . . . . . 64
Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2.1
Packet Versus Session Alerts . . . . . . . . . . . . . . . . . . . 65
vi
5.2.2
5.2.3
5.2.4
5.2.5
Control Data Set .
Data Set #1 . . . .
Data Set #2 . . . .
Malware Data Sets
5.2.5.1
5.2.5.2
5.2.5.3
5.2.6
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
66
70
71
72
Kraken . . . . . . . . . . . . . . . . . . . . . . . . . 72
Zeus . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Blackworm . . . . . . . . . . . . . . . . . . . . . . . 73
Data Exfiltration Detection Performance . . . . . . . . . . . . 74
6 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7 FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.1
7.2
7.3
7.4
7.5
7.6
7.7
Performance . . . . . . . . . . . . . . . . .
Handle A Network . . . . . . . . . . . . .
Application Layer Decoding of Packets . .
Comparison to packet and session entropy
Compressed File Analysis . . . . . . . . .
Behavioral Analysis . . . . . . . . . . . . .
Additional Tools . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
80
81
82
82
83
83
84
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Appendix
A
B
C
D
EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . . . .
ENTROPY PLOTS FOR DATA SETS . . . . . . . . . . . .
ALERTS FOR DATA SETS . . . . . . . . . . . . . . . . . . .
VERIFICATION OF MALWARE PACKET CAPTURES
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 89
. 99
. 108
. 112
D.1 Kraken Packet Capture . . . . . . . . . . . . . . . . . . . . . . . . . . 112
D.2 Zeus Packet Captures . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
D.2.1 Zeus #1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
D.2.2 Zeus #2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
D.2.3 Zeus #3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
D.3 Blackworm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
vii
LIST OF FIGURES
3.1
Example Output from the Network Sniffer . . . . . . . . . . . . . . 21
3.2
Example Output from the Corporate Watcher Program (Simple) . . 23
3.3
Example Output from the Corporate Watcher Program (Complex)
3.4
Example Output from the Network Top Program . . . . . . . . . . 27
3.5
Example Output from the DNS Extractor Program . . . . . . . . . 29
3.6
Example Output from the Session Extractor Program . . . . . . . . 30
3.7
GUI for the IP Helper Program . . . . . . . . . . . . . . . . . . . . 32
4.1
The Flow of the Processing of Packets . . . . . . . . . . . . . . . . 34
4.2
The Flow of the Processing of Sessions . . . . . . . . . . . . . . . . 34
4.3
Maximum and Minimum Entropy vs. Size of a Packet’s Payload . . 38
4.4
Maximum Entropy vs. Size of a Packet’s Payload . . . . . . . . . . 39
4.5
Maximum and Minimum Entropy Values with Different Initial
Values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.6
Scaled Maximum and Minimum Entropy Values . . . . . . . . . . . 44
4.7
HTTP and HTTPS Traffic . . . . . . . . . . . . . . . . . . . . . . . 47
4.8
The Four Branches of the Tree . . . . . . . . . . . . . . . . . . . . 48
4.9
Flow for Expected Unencrypted and Received Unencrypted Branch
viii
24
49
4.10
Flow for the Expected Encrypted and Received Encrypted Branch . 51
4.11
Flow for Expected Encrypted and Received Unencrypted Branch . 53
4.12
Flow for Expected Unencrypted and Received Unencrypted Branch
5.1
Network Diagram for the Control Data Set . . . . . . . . . . . . . . 58
5.2
Network Diagram for Data Set #1 . . . . . . . . . . . . . . . . . . 61
5.3
Network Diagram for Data Set #2 . . . . . . . . . . . . . . . . . . 62
B.1
Packet Entropies for the Control Data Set . . . . . . . . . . . . . . 99
B.2
Session Entropies for the Control Data Set . . . . . . . . . . . . . . 100
B.3
Packet Entropies for Data Set #1 (First Plot) . . . . . . . . . . . . 100
B.4
Packet Entropies for Data Set #1 (Second Plot) . . . . . . . . . . . 101
B.5
Session Entropies for Data Set #1
B.6
Packet Entropies for Data Set #2 . . . . . . . . . . . . . . . . . . . 102
B.7
Session Entropies for Data Set #2
B.8
Packet Entropies for the Kraken Data Set . . . . . . . . . . . . . . 103
B.9
Session Entropies for the Kraken Data Set . . . . . . . . . . . . . . 103
55
. . . . . . . . . . . . . . . . . . 101
. . . . . . . . . . . . . . . . . . 102
B.10 Packet Entropies for Zeus Data Set #1 . . . . . . . . . . . . . . . . 104
B.11 Session Entropies for Zeus Data Set #1 . . . . . . . . . . . . . . . . 104
B.12 Packet Entropies for Zeus Data Set #2 . . . . . . . . . . . . . . . . 105
B.13 Session Entropies for Zeus Data Set #2 . . . . . . . . . . . . . . . . 105
B.14 Packet Entropies for Zeus Data Set #3 . . . . . . . . . . . . . . . . 106
B.15 Session Entropies for Zeus Data Set #3 . . . . . . . . . . . . . . . . 106
ix
B.16 Packet Entropies for the Blackworm Data Set . . . . . . . . . . . . 107
B.17 Session Entropies for the Blackworm Data Set . . . . . . . . . . . . 107
C.1
Session Alerts for the Control Data Set . . . . . . . . . . . . . . . . 108
C.2
Session Alerts for Data Set #2 . . . . . . . . . . . . . . . . . . . . 109
C.3
Session Alerts for the Kraken Data Set . . . . . . . . . . . . . . . . 109
C.4
Session Alerts for Zeus Data Set #1 . . . . . . . . . . . . . . . . . 110
C.5
Session Alerts for Zeus Data Set #2 . . . . . . . . . . . . . . . . . 110
C.6
Session Alerts for Zeus Data Set #3 . . . . . . . . . . . . . . . . . 111
C.7
Session Alerts for the Blackworm Data Set . . . . . . . . . . . . . . 111
D.1
Sessions from the Kraken Data Set to Servers on Port 447 . . . . . 113
D.2
DNS Names Resolved from the Kraken Data Set . . . . . . . . . . . 113
D.3
HTTP GET Command Downloading the Configuration File . . . . 115
D.4
HTTP POST Commands and HTTP OK Responses . . . . . . . . 115
D.5
Zeus HTTP Communications . . . . . . . . . . . . . . . . . . . . . 116
D.6
HTTP GET Command Requesting the Configuration File . . . . . 117
D.7
HTTP OK Response to the Request for the Configuration File . . . 117
D.8
HTTP POST Commands and HTTP OK Responses . . . . . . . . 118
D.9
Zeus HTTP Communications . . . . . . . . . . . . . . . . . . . . . 118
D.10 HTTP GET Command Requesting the Configuration File . . . . . 119
D.11 HTTP OK Response to the Request for the Configuration File . . . 119
D.12 HTTP POST Commands and HTTP OK Responses . . . . . . . . 120
x
D.13 Zeus HTTP Communications . . . . . . . . . . . . . . . . . . . . . 121
D.14 Connection to Victim and Enumerating Its Network Shares
. . . . 122
D.15 Searching for Security Vendors’ Program Folders . . . . . . . . . . . 122
D.16 Copying Blackworm Executable to Victim . . . . . . . . . . . . . . 123
D.17 Copying Blackworm Executable to Victim . . . . . . . . . . . . . . 123
D.18 Deleting Startup Link from Victim . . . . . . . . . . . . . . . . . . 123
D.19 Creating a Job to Run the Blackworm Executable . . . . . . . . . . 124
xi
LIST OF TABLES
4.1
Scalars for Max Entropy of Payloads by Size . . . . . . . . . . . . . 43
4.2
Magic Numbers and HTTP Strings for Compressed Files . . . . . . 52
5.1
Alert Statistics for Packets and Sessions of Each Data Set . . . . . 65
5.2
Packet Alerts from the Control Data Set . . . . . . . . . . . . . . . 67
5.3
Session Alerts from the Control Data Set . . . . . . . . . . . . . . . 67
5.4
Packet Alerts from Data Set #1 . . . . . . . . . . . . . . . . . . . . 70
5.5
Packet Alerts from Data Set #2 . . . . . . . . . . . . . . . . . . . . 71
5.6
Zeus Packet and Session Alerts #2 . . . . . . . . . . . . . . . . . . 72
5.7
HTTP POST Commands in Zeus Packet Captures #2 . . . . . . . 73
5.8
Packet Alert Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.9
Session Alert Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.10
True and False Positive Rates for Each Data Set . . . . . . . . . . . 76
A.1
Annotation of the Control Data Set . . . . . . . . . . . . . . . . . . 89
xii
ABSTRACT
The twin goals of easy communication and privacy protection have always
been in conflict. Everyone can agree that important information such as social security numbers, credit card numbers, proprietary information, and classified government information should not be shared with untrusted and unknown entities. The
Internet makes it rather simple for an attacker to steal this information from even
security conscious users without the victims ever discovering the theft. All it takes
is one lapse in judgment and an attacker can have access to sensitive information.
Currently the computer and network security industry places its focus on
tools and techniques that are concerned with what is entering a system and not what
is exiting a system. The industry has no reason to not inspect the outgoing traffic.
Many attacks’ success and effectiveness rely heavily on traffic exiting the computer
system. Outgoing traffic is just as, if not more important to inspect as incoming
traffic to detect attacks involving theft of confidential information or interaction
between the attacker and victim’s computer systems. Frequently recurring data
breaches reinforce the necessity of tools and techniques capable of alerting the users
when data is being exfiltrated from their computer systems.
This thesis explores the use of entropy characteristics of network traffic to
ascertain whether egress traffic from computer systems is encrypted. The inspection
of network traffic at the session level instead of the packet is proposed to improve
the accuracy of the entropy values. It establishes that entropy can indeed be used
as an accurate metric of the traffic’s actual state of encryption.
xiii
The major contribution of this thesis to the field of computer and network
security is presenting a detection scheme capable of distinguishing data exfiltration
from benign traffic. The detection scheme is based on the results of the entropy
calculations and the observation of the traffic’s state of encryption. The expected
state of encryption for the session’s data link, network, transport, or application
layer protocol are found by decoding the protocol stack data fields. This expectation
is compared to the actual traffic’s observed state of encryption as measured by its
entropy. It is demonstrated that using this comparison as the base of the detection
scheme provides accurate detection of inappropriate data being exfiltrated from a
system.
The thesis produces multiple network tools that can be used for inspection of
traffic. The initial work develops tools that are capable of equipping system administrators with a better understanding of the network traffic leaving their systems.
The most important tool resulting from this research is ExFILD, which implements
the proposed detection scheme. The tool accurately alerts when it detects traffic
with suspicious payloads exiting a system.
xiv
Chapter 1
BACKGROUND
The field of computer and network security has increased in importance in
recent years. Historically (and in some cases in currently) computer security has
been neglected for reasons including cost, disbelief a system would be attacked,
and the inconvenience of managing the security systems. People involved in early
development of computers and the ARPANET (what is known as the Internet today)
wanted to share information with all of their colleagues to facilitate the development
of technology. They were not worried about people seeing their information and they
trusted their colleagues not to alter their data. Decades ago the implementation
of security was worried about the amount of time used on a computer and the
financial aspects, not an attack attempting to steal or gain control of a computer
system. Today, it is hard to go a week without seeing a report of a data breach
at a government facility, financial corporation, or an educational institute. The
President of the United States of America has recently issued a Cyberspace Policy
Review [1] and started pushing for more attention to be placed on protecting the
country from domestic and foreign threats. Society is beginning to take computer
security seriously and doing more to protect computers and data from threats.
Computer and network attacks come in many different forms. They range
from botnets, groups of compromised computer systems controlled by a common
attacker used for spamming, such as the Srizbi Botnet sending billions of spam
messages per day [2], to nation states attacking other nation states such as the
1
Russian DDOSing of Estonia’s computers in 2007 [3]. The attacker can use some
automated scripts to do most of the mundane tasks and make the decisions of how to
proceed after each step. Attacks can also be almost exclusively automated as most
botnets are. Attacks can be well known and run by script kiddies (inexperienced
attackers that only run tools created by others) or they can be sophisticated zero-day
attacks (an attack that has not been publicly released) written by an experienced
attacker. They can be very focused and target a single system or they can target a
large number of systems in the case of a worm spreading from peer to peer.
A main concern today is data confidentiality. Attacks can be divided into two
classes: those that steal confidential data and those that do not. One example of
an attack that does not need data to be extracted from the target is a DOS (Denial
of Service) attack. A DOS attack saturates the computer’s resources leaving it in a
non-functional state. Usually a DOS attack refers to saturating a network link, but
it can also refer to the saturation of the CPU cycles.
The confidential data exfiltrated from systems depends on the attacker’s
goals. In the case of an attack targeting individuals, it could be personal information such as social security numbers, credit card information, passwords, etc. For
attacks targeting nation states, extraction of confidential information on military
projects or classified information may be the objective. Attacks against financial
institutions seek money - either directly by manipulating financial transactions or
indirectly by extracting massive amounts of customer information.
A bot herder (the attacker controlling the compromised computers of a botnet) may also be interested in data in addition to the victim’s personal information.
They may also need information about the host system to be able to control the
bot and continue expanding the botnet. An example of information that could be
passed from the host to the herder would be the results from a network scan that
revealed possible new targets. Information about the host’s intranet could also be
2
used to decide what kind of information to extract, i.e. the host appears to be a
government contracted engineering firm, so an attacker should look for design plans.
Data can be exfiltrated from a system in many ways. Some of the more
common ways of exfiltrating data are through botnets, viruses, worms, rootkits,
FTP, close access means such as flash drives or CDs, email, and even phishing
attacks. Many techniques for exfiltrating data are discussed in [4]. The two general
methods of exfiltration are through physical means or over a network connection.
For this thesis the working definition of physical exfiltration will be any method
where a user will have to be in the proximity of the system containing the data.
Network exfiltration will be any method where the location of the attacker does not
matter for the attack to be successful.
The physical methods are easier and are often overlooked, but require physical
access to the computer. These attacks are often insider attacks, because it can
be hard to get physical access to these systems if the attacker is not trusted by
the owner or does not work for the organization where the computer is physically
located. Alternatively, physical attack is often employed by officials executing legal
search warrants, or covertly by individuals seeking illegal access to systems, the
so-called “black bag job.”
In the case of an insider attack, an employee can plug a flash drive into
a computer and copy the files they want on to it and then take it home. The
employee has now transferred the data offsite and can do anything he would like
with it. Physically exfiltrating data is even easier when an employee is issued a
laptop for work. The data on the laptop leaves the site everyday and also gives
the employee access to the network from home, usually through a VPN connection,
allowing the employee to copy it off of the system in the safety of their home. Some
other physical methods of stealing data are burning it to a CD/DVD, printing the
information, leaking it through lights and diodes as demonstrated in [5], cameras
3
recording monitors and keyboards, and transferring it over a wireless channel.
The network methods require some more knowledge, but the attacker can
be anywhere in the world. The possibilities for exfiltrating data over a network
are endless. Only an attackers creativity and the desire for stealthiness limit the
methods that can be devised to get data out of a computer. Exfiltrating data can
be as easy as emailing the information out or simply opening a connection to a
computer owned by an attacker. Covert channel data exfiltration methods such as
flipping bits in the TCP header or by controlling the timing of packets can be more
complicated [4]. The type of the method used to get the data out of the system
depends on the system and its configuration settings. Email is an easy way to get
data out of a system without causing suspicion, because a user sends multiple emails
a day, many with attachments.
A lot of information exfiltrated is the result of botnets, worms, and viruses.
When a computer system is infected, backdoors, rootkits, and hidden FTP servers
are installed, as well as impromptu servers set up by programs such as Netcat [6].
These installed binaries make it possible for the attacker to exfiltrate information
from a computer system. Botnets such as Torpig and Conficker have received an
increased amount of media attention. Torpig was estimated to have infected a little
more than 182,000 hosts [7] and Conficker was estimated to have infected almost 9
million hosts [8].
Many people become worried that a virus, worm, or botnet has infected their
computer. It is not the fact that a computer system is infected that should worry
users, but more of what tasks the attacker performs with the infected computer.
For example, if an attacker uses one hundred compromised hosts to launch a DDoS
(Distributed Denial of Service) attack against another host, it would have a rather
small impact on each of the compromised hosts. It would only result in a loss in
bandwidth for a relatively short time for the compromised hosts. It will however
4
have a substantial impact on the target computer of the DDoS. On the other hand
if the attack is similar to that of Torpig’s, which is used to steal information ranging
from passwords to credit card numbers [7], then the victims should be worried. Loss
of bandwidth is insignificant compared to the impact of identity theft. Users should
be concerned about being infected and take steps to clean the system, but what
they are infected with and its function should be of great concern.
The infinite ways to extract data is what makes the problem of detecting it a
difficult problem. In general, defense is hard and offense is easy, because an attacker
only needs one hole for an attack to work, while the defender needs to protect
against an infinite number of attacks. One mistake or oversight by a defender could
open a door for the attacker. Many systems are protected with software to regulate
incoming data such as firewalls, anti-virus software, intrusion detection systems,
and intrusion detection systems, but there is not as much emphasis on controlling
outbound traffic.
Exfiltration tools are typically very specific to what they want to detect. The
only exception to this may be some tools that use algorithms using self-learning or
artificial intelligence techniques; however, these tools may also be very focused.
DNStTrap [9] is a tool developed by Jhind using artificial intelligence to detect exfiltration through DNS tunneling. DNS tunnels are created by passing information
within the DNS names of hosts. DNStTrap extracts traits from the leftmost subdomain from the domain name, i.e. cics is the leftmost subdomain of cics.udel.edu.
One of the more important traits is the similarity of the leftmost subdomains to
the other leftmost subdomains. The idea is that if a user is browsing the Internet, the user may visit cics.udel.edu 5 times a day; however, if a user is exfiltrating
data it would look more like somethinginteresting.udel.edu, exfiltrateddata.udel.edu,
ssn.udel.edu, creditcardnumber.udel.edu, and password.udel.edu. DNStTrap will
then use the similarity of the domains along with other traits to decide if it is a
5
DNS tunnel.
With a lot of tools with a very narrow focus and not much with a broad
overview for detecting exfiltration, it appears the next logical step would be to
develop a tool or framework that looks at a broad space with multiple detection
modules that can be expanded. An example of this would be the framework used
by the DoD (Department of Defense) named Interrogator [10]. Interrogator is implemented in a way that allows for addition and removal of tools as desired. It is
able to glue together many different tools to achieve an overall picture of a network’s
status. CloudAV [11] is another implementation of a framework bringing together
tools. It combines multiple antivirus engines into to a system that resides in a cloud
to improve virus detection. Interrogator was designed for intrusion detection and
CloudAV is targeting detection of malware. A tool that uses the same principles as
the Interrogator and CloudAV that detects data being exfiltrated would be valuable.
Antivirus products and IDS (intrusion detection systems) use signatures of
malware for detection. Signature-based tools are very effective against malware
that has been seen before, but they are not effective against zero-day threats. The
industry has known that signature-based schemes are not a foolproof solution for
protecting a computer from viruses; however, they are a good and easy way to protect against known threats. Executables, payloads, or exploits can be run through
a program that will return a signature and then distribute the signature in an update to the database. The process of making signatures and distributing them is
simple and works well. However, attackers can get around signature detection by
altering sections of the sample just enough to create a different signature or by using
polymorphic code. The future of detection is in anomaly and behavioral techniques.
These schemes would be customized based on the different traits of the host or the
network. These schemes would be self-learning schemes that would detect unusual
events for a system or network and flag them to be decided if the event is normal
6
or suspicious. These systems learn the normal and suspicious traffic from training
data given to the system. Detection tools that are signature-based will continue to
be used as a baseline, but behavioral schemes will need to be incorporated also to
detect the more complex attacks.
7
Chapter 2
INTRODUCTION
2.1
Motivation
Many security tools and strategies emphasize detecting and preventing at-
tackers from breaking into a system. It is very intuitive that the focus is on intrusion.
Because no intrusion detection system is perfect, a user cannot be content with just
preventing an attacker from breaking into a system. A user has to worry about
an attacker preventing communication with the system or intercepting information
leaving the system and using the information maliciously. If these attacks are ignored, a user could theoretically be secure by preventing an attacker from entering
the system. The problem with the argument is that there are many vulnerabilities
or holes to get into a system. A user must defend against them all to stay secure,
while an attacker only needs one to get into the system. The vulnerability the
attacker uses can be a zero-day vulnerability, which would leave a user with little
chance of protecting against the first time it is used. It is also conceivable that
a vulnerability is pre-installed or designed in, with an attacker simply sending an
otherwise innocuous trigger to activate the exploit.
Many security plans do not include a strategy or the tools to mitigate the
possible risks that appear after an intrusion. It is bad enough that the security plans
do not have techniques to prevent risks such as stealing sensitive information, but
not having a way to detect that the sensitive information is leaving the system is even
worse. An attacker could potentially steal information from a system indefinitely
with no inspection of the outgoing traffic.
8
The problem is present all throughout society, including organizations as sophisticated as large enterprises and nation states. The data breach of the Joint Strike
Fighter project is an excellent example of an enormous amount of data leaving a network before being noticed. According to [12], attackers were able to exfiltrate several
terabytes of data about the Pentagon’s Joint Strike Fighter project. Heartland is a
company that provides services for credit cards. It was reported in early 2009 by
[13] that attackers may have obtained access to more than 100 million credit cards,
and it was later reported by [14] that it was more than 130 million credit cards. It
is easy to see that the organizations that are entrusted with sensitive information
do not have tools and techniques to detect data leaving their networks in a timely
manner.
2.1.1
Scenario of Security Against Infiltration
Let’s look at the general strategy of the security of computer systems and
networks while relating it to security used on a battlefield scenario. Once a group of
soldiers make the decision of where to construct base camp, the first thing they will
do is set up a perimeter. To begin with it may be a couple of soldiers patrolling the
perimeter while the remaining soldiers start the other duties, i.e. set up a command
post, create a strategy to bring in more equipment and troops, and layout the
blueprint of the camp. During the planning phases and from this point forward the
base will need secure lines of communication within the base and more importantly
with friendly troops outside of the base. The secure communications will allow the
base to communicate with the confidence that the messages it is passing will not be
deciphered by the enemy’s intelligence organization and used against the base.
After the initial blueprint has been completed implementation of the plan will
begin. The soldiers patrolling the perimeter may be given some aid in controlling
their boundaries with a primitive barbed wire fence. The fence serves the purpose of
slowing down ground troops, but not stopping long range attacks or heavy machinery
9
such as tanks from driving right over it. Space will be left in the barbed wire and
trenches to allow for troops to enter and exit the base. A guard shack will be added
at these entrance points. The guards will not let people enter without the proper
credentials, i.e. a military issued identification card and a record of a soldier’s
orders to enter or exit the base. The military police will be responsible for checking
everything entering the base for suspicious and banned objects. They will also be
responsible for keeping order within the base and preventing any action that could
threaten the base from the inside.
The next step would be to add a large reinforced wall for better protection
from ground troops, stopping soldiers from easily entering on foot, and being a
much larger deterrent to heavy machinery. The base now has a good foundation of
protection against ground troops, but can still be caught off guard when a battle
ensues. The base will build watchtowers along the walls to see in every direction.
They will then setup radar systems to facilitate in the monitoring of the area around
them. Finally they will construct more advanced defenses. These defenses will
include turrets and stationary artillery along with antiaircraft guns to protect the
base with higher-powered weapons. The battlefield scenario can be directly related
to the security of a computer system.
2.1.2
System Administrators
Starting from the beginning of the scenario, the first aspect to discuss is the
patrolling troops. The patrolling troops will continually move from place to place
looking for anything suspicious. Their function is very much like that of a system
administrator. The early phase in the construction can be related to turning on a
system for the first time and beginning to configure it. The system administrator is
the main line of defense at this point. Now as the base slowly builds up to its final
design, the patrolling troops will not be as heavily relied upon for the base’s safety.
They will still be needed, but will have much more support from the watchtowers,
10
radars, and the equipment providing heavier firepower. The system administrator
receives its support from the antivirus software, IDS, IPS, and other tools. They
still have to respond the alerts and logs from these systems.
2.1.3
Encryption
Secure lines of communication at the base and secure communications in
a network can be handled in the same manner and with the same mechanisms.
The idea is that it does not matter if communications are intercepted as long as
an attacker cannot decipher them. A main difference is that the base might care
more about people being able to intercept the message and figure out details such
as the sender, the recipient, when was it sent, and much more. This is because
this information might divulge strategic information, i.e. if one country suddenly
increases communications with another, it may mean they have formed an alliance.
Computer systems however send login credentials or banking information over the
line encrypted while the headers are in the clear. If users were worried about people
knowing with whom they are talking, they would use tools such as Tor [15] to
anonymize themselves. According to [16], there are hundreds of thousands of Tor
users each day. Hundreds of thousands of users would appear to be a lot, but when
compared to the total number of Internet users (1,802,330,457 users [17]) it is at
most only 0.0554 of a percent.
2.1.4
Out of the Box Firewall
The barbed wire fence is the next line of defense initiated at the base. The
fence can be correlated to a default firewall being initially installed. The firewall can
provide a strong enough defense to slow down attackers, but can also be subverted.
For instance if the firewall lets all DNS traffic into the system, an attacker can send
malformed DNS packets that causes unwanted actions on the system, assuming there
is some vulnerability the attacker can exploit.
11
The barbed wire fence can also prevent or slow down desired activities. Consider the entrances and exits to the base to be the ports that are open on the firewall.
An example of this is a large base that only has an opening at the north end of the
base, but a critical mission that is time sensitive needs to head south immediately.
The mission would have to head north first to exit the base and then head south
wasting time.
The same scenario can be related to the firewall’s actions. A user wants to
login to a bank’s website to check a bank account. The user will need to send traffic
on port 443 (HTTPS) to be able to securely access the bank account. If the firewall
only allows port 80 (HTTP) to enter and exit the network, the user would be forced
to use HTTP instead of HTTPS, the secure version of HTTP. The user would only
be able to do this if the bank allowed them to communicate insecurely. Preferably
the bank would not allow insecure communications with the user to prevent the user
from passing sensitive information in cleartext over the Internet.
2.1.5
Login Credentials
The guard shack can be related to the oldest and most commonly used form
of security in computer systems. The process of logging in to a system is easily
related to the guard shack. At the guard shack, soldiers would have to show a
military issued identification to verify their identity. The guards would then check
against some orders to verify that these soldiers have the authority to enter or leave
the base. On a computer system a user would first have to login into the system
using their user name and their login credentials such as a password or biometric
measurement. After the computer has verified the user’s identity, they now need to
decide whether or not the user has the proper permissions to access this computer.
Enterprise systems have this decision quite often. They have thousands of users
whose credentials can be used on several multipurpose computers, but not all users
will have access to the system with the database of salary information.
12
2.1.6
Antivirus Software / Malware Detection
Now that there are guarded entrances controlling everyone getting in, there
need to be people responsible for keeping order within the base. The military police
would pick up this responsibility. They would investigate any suspicious activity and
mitigate any threats from inside the base. Antivirus software has responsibilities
that line up nicely with the military police’s responsibilities. The antivirus software
is responsible for scanning the system for any items that could be detrimental to the
system. The antivirus software uses signatures of malicious software to determine
if a computer system is infected by malware. An actively scanning antivirus system
will set off an alarm when a virus is downloaded. Depending on the user defined
options, the antivirus software will either delete the virus or quarantine it. This
could be represented as the military police catching an attacker entering the base
with a concealed bomb and intent to detonate it within the base. The military police
could detain the person or open fire on the attacker if the situation deteriorates. The
antivirus software will also watch the system memory and instruction calls for any
unusual behavior, just as the military police patrol the base for disorderly conduct
amongst the soldiers.
2.1.7
Properly Configured Firewall
The base would eventually build a permanent large reinforced wall to pro-
tect the base. The wall would be well planned to include entrances and exits of
strategic value. The design should be one that provides the base with the greatest
protection possible, but does not hinder the base’s ability to successfully operate.
The reinforced wall is analogous to a firewall that is given the attention by a system
administrator to properly configure it using the characteristics of the network.
A properly configured firewall will take into account all of the services that
are authorized to run on the network and communicate with systems outside of the
network. The configured firewall will be as specific as only allowing inbound traffic
13
on port 80 with a known web server as its destination The more thought put into
configuring the firewall, the better it is able to protect the system from unknown
threats. If the firewall is configured correctly, it will be similar to whitelisting.
It will not be listing all of the unwanted traffic and blocking it, it will be blocking
everything and only allowing whitelisted traffic through it. The whitelisting strategy
is more effective than blacklisting, because the system administrator should know
all of the normal traffic and may not know all of the malicious traffic.
2.1.8
Intrusion Detection System
An IDS is a system that looks at the network traffic (usually focused on
incoming traffic) and attempts to detect attacks being made over the network.
Snort [18] is a very common open-source implementation of an IDS. An IDS is
similar to an antivirus application in respect to inspecting the system or network
for anything malicious. The system administrator should customize the IDS rules
to the system or network that it is protecting.
A network intrusion detection system (NIDS) only inspects the network traffic and extracts characteristics that will be used to decide whether the traffic is
malicious. A host-based intrusion detection system (HIDS) inspects all of the actions on a system to determine if malicious actions are being performed. An IDS
can implement signature-based detection or anomaly-based detection. Anomalybased detection inspects the network traffic and compares it to the behavior that is
considered normal for that system or network.
The radar and watchtowers have the same responsibilities as an IDS, although
radars have a much broader scope. Radars watch activities happening outside of
the base also. Most single hosts or small networks do not run intrusion detection
systems due to the time and resources that must be committed. An out of the
box implementation of an IDS will catch a lot of malicious events, but just like a
firewall it will be much more effective if time is spent on configuring it properly for
14
the network. An IDS can use a lot of system resources if there is a lot of traffic
to be processed and if there are a lot of checks being performed on the traffic. It
may leave the system at a state where it is not practical to be used for day-to-day
activities.
2.1.9
Intrusion Prevention System
An intrusion prevention system, or IPS, monitors traffic for intrusions and
will attempt to mitigate any intrusions automatically. It is basically a reactive IDS.
The advantage to an IPS over an IDS is that it can reject any traffic involved with an
attack at the time of detection, just like the turrets or anti-aircraft guns would do to
attackers who were breaching the base’s perimeter. An IPS will focus on incoming
traffic and its contents. In some cases it will look at outgoing traffic for a signature
match. A downside to an IPS system is that it can treat legitimate traffic as if it
is malicious and drop it. An example of this would be a signature-based detection
that is too generic that matches legitimate traffic and malicious traffic. The IPS
will then react to normal traffic with undesired actions.
2.1.10
Outgoing Traffic
Looking over the battlefield scenario and relating it to the security of a com-
puter or network illustrates that the layers of each are akin to the other. The
interesting thing to note about both examples is that there are not many precautions being taken to detect or prevent data from being exfiltrated from the base or
the computer system. The firewall will help prevent outgoing traffic that has been
denied in the rules; however, an attacker can use traffic allowed by the firewall to
easily transport data out of the computer. The IDS and IPS can look at outgoing
traffic, but by definition they are more sophisticated in detecting intrusions than
they are in detecting exfiltrations. The system administrators would have to spend
15
extra time configuring the IDS and IPS to handle the outgoing traffic. The same
situation is prevalent in the battlefield scenario.
2.1.11
Consequences of Encryption
Encryption is the standard for sending sensitive data between systems through
unknown and untrusted systems. It is a great defense against messages being intercepted in between those systems; however, it can easily be used against a system.
The point in encryption is to be able to obfuscate the data for storage or transport
and then be able to return the encrypted data to its original cleartext form. The
attacker can use the trait of obfuscation against a user and encrypt all of the data
it sends out of the system and thus leaving it hard to detect. The outgoing traffic
can be monitored leaving the network, but the user does not know what is in the
payload. The user will be able to observe where the packet claims to be going, how
it is leaving, the amount of data leaving, the timing information, and other miscellaneous header information. The users can use this information to separate out the
traffic they feel comfortable allowing to exit the system and the traffic that looks
suspicious and they would not want leaving their systems. A system administrator
can learn a lot about traffic by knowing if the packets are encrypted in addition
to all of the header information. Analyzing traffic without knowing the data or
payload of the packets can be counterintuitive for system administrators, but it will
be shown in this thesis that it is very effective.
2.2
Goals
Too many small businesses, large businesses, government organizations, and
average computer users in society are losing sensitive information each day. Computer and network security needs more work in the area of detecting attackers exfiltrating data from victims’ computers. Richard Bejtlich described a good first step
on his blog called TaoSecurity. He is the Director of Incident Response at General
16
Electric. “[D]evelop tools and techniques to describe what is happening on the network. ... Without understanding what is happening, we can’t decide if the activity
is normal, suspicious, or malicious.” [19] Creating tools that help a user or system
administrator understand the network’s ground truth would be invaluable. System
administrators can look at IDS and firewall logs all day, but if they do not understand what their network should be doing then they can misdiagnose the meaning
and importance of each message. The first goal of this thesis is to create tools to
help users and system administrators to better understand their network. The tools
developed will be discussed in Chapter 3.
System administrators and users who are comfortable with looking through
logs and reacting to them appropriately can still use a lot of resources looking
through the logs. The time can be wasted on many simple decisions that build
off of each other. The simple decisions are usually cut and dry and should be
automated. The next goal of this thesis is to accumulate all of these simple decisions
and automate it for the administrators. The results should be in a format that
will allow the administrator to quickly inspect and observe if there is traffic that
needs some attention. It will show the results in formatted text to be parsed and
graphically for a fast inspection. The tool will only handle outgoing traffic. It will
detect exfiltration, but will not attempt to prevent it in any matter.
ExFILD is the tool that is developed in this thesis. The development of
ExFILD will show that entropy calculations of a packet or session’s payload can
be used to characterize the payload’s observed state of encryption. It will focus on
decisions that will spawn off of the determination of whether traffic is encrypted and
if the traffic was expected to be encrypted. The design of ExFILD will be described
in Chapter 4 and its results in Chapter 5.
17
2.3
Related Work
Research has been performed in the area of detecting of data exfiltration.
Researchers from the University of California at Berkley have created a system
named Glavlit [20] to prevent data from being exfiltrated. Their system uses a
whitelisting approach. It has two main parts to the system that are referred to
as a guard system and a warden system. The guard will not let any traffic out of
the network without being approved by the warden. The warden is in charge of
approving all of the traffic. The approval process can be automated with content
matching or system administrators can manually approve each piece of data by
hand.
Researchers from University of California at Davis and Sandia National Laboratories developed a framework to detect data exfiltration called SIDD, Sensitive
Information Dissemination Detection [21]. SIDD has three major components to
it: application identification, content detection, and covert channel detection. The
components are inline and each component can cause an object to exit the chain
and assign some action to be taken. The application identification component tries
to determine the application of the traffic and then it uses a policy to determine if it
should be allowed. The content detection component checks traffic for data that has
been labeled as sensitive. The searching for the sensitive data is signature-based.
The last component handles covert channel detection. The covert channel detector
focuses on digital audio channels. Steganalysis described in [21] is used to generate
characteristics and decide whether there is a covert channel. The work present in
this paper will be complimentary to the efforts described in this section.
18
Chapter 3
NETWORK TOOLS
System administrators use a surprisingly large number of tools in a day to
monitor and diagnose the computer systems and networks under their responsibility.
The amount of tools can be contributed to the number of different tasks they perform
and the different types of system they maintain. A system administrator could
be in charge setting up a printer server, debugging a DNS server, or diagnosing
if a recent failure of a web server was due to a hardware failure or a successful
attack. They could also be dealing with an Apple computer used by marketing, a
Windows computer used by an engineer, and a UNIX system used as an IDS. System
administrators will use tools that they feel comfortable with, but they can run into
the problem of not having exactly what they need. They will be forced to find or
develop another tool. Their toolbox will continually grow as they progress through
their career. Understanding and maintaining a network would be an unmanageable
task without the information obtained from these tools.
Chapter 3 will discuss tools that have been developed during the data exfiltration research for this thesis. The tools are not necessarily novel or unique, but
their functions are crucial to understanding and diagnosing a network, specifically
the outgoing traffic and watching for data being exfiltrated. The tools have all been
developed in Perl. The choice was made to develop them in Perl to make them easily
customized for special circumstances. As mentioned earlier, system administrators
may use a tool for everything except one instance where a critical feature is not
19
supported. Instead of switching to another tool to obtain a feature, the Perl script
can be altered to add the feature and the administrator only needs to keep one tool
instead of multiple. Many of the fundamental ideas behind these tools are combined
to be used as the underlying framework for ExFILD. There are additional tools that
are not discussed in Chapter 3, but will be discussed in detail in Chapter 4.
3.1
Network Sniffer
The most important tool for a network analyst is a network sniffer. Tcpdump
[22] and Wireshark [23] are two of the most common network sniffers. Tcpdump is a
command line tool, while Wireshark is a GUI (graphical user interface) based tool.
A network sniffer taps into a network interface and extracts all of the packets. The
packet information will be outputted to the standard output, a GUI, or a file to be
inspected later. The sniffer will decode the different fields of a packet such as source
and destination IP addresses and ports, packet length, data, and the rest of the
header information from each of the layers. Network sniffers can be very basic and
only decode through the transport layer protocols (TCP and UDP), while others
such as Wireshark have modules that are capable of decoding the application layer
protocols. Sniffers can also perform tasks on packet captures that have been saved
to a file. The files are in the pcap format and allow a user to look at the traffic
offline.
The sniffer developed during this research is a simple one. The sniffer can
capture traffic live from a network device or read the traffic from a pcap file. The
header information will be printed out for each layer. Packets are handled at the
data link, network, and transport layers. TCP, UDP, ICMP, and IGMP are the
transport layer protocols that are supported by the sniffer. The information will be
printed out for each layer. It will print out the information as far up the layers as
it can decode and then it will store the rest of the data in a variable. The variable
containing the data and the options fields is not printed out by default due to the
20
fields having non-printable ASCII values. The program will print out these fields if
the –nonPrintableAscii flag is set on the command line. Each layer’s information is
printed out on a new line. Each packet is separated from the other packets by a line
of ∼’s. The network sniffer will output the results of the program to the standard
output unless the –silent flag is given. If the –silent flag is passed to the program,
the results will be outputted to a file. For live captures the file will be located
at Logs/traffic PID TIME.log, where PID and TIME are the program id and the
current time in seconds since the epoch respectively. The file will be located at
Logs/PCAP traffic.log when a pcap file is inputted, where PCAP is the name of the
pcap file. Live captures will also dump the packets into a pcap file. The file will be
located at Logs/traffic PID TIME.pcap.
Figure 3.1: Example Output from the Network Sniffer
A sample packet is shown in Figure 3.1. In the Network Sniffer’s output, the
tabbed lines in Figure 3.1 are part of the previous lines, but have been separated
strictly for display purposes. The first line represents the data link layer, the second
line represents the network layer, and the third line represents the transport layer.
The protocol being used for each layer is stated by the first word on each line. In
Figure 3.1, the data link layer is using Ethernet, the network layer is using IP, and
the transport layer is using TCP. The source and destination fields of the different
layers are represented by source -> destination. Depending on the layer, the source
21
and destination fields will be of different types. The data link layer will have the
type of MAC address, the network layer will have the type of IP address, and the
transport layer will have the type of port number. Each of the header fields following
the source and destination fields are displayed with its name and the value it holds.
Not all of the header fields will display a value. For example, the options field of an
IP header is empty when no options are set.
3.2
Corporate Watcher
Corporations and government organizations have to take data exfiltration
very seriously. The corporations need to prevent proprietary information from falling
into the hands of a competitor or an attacker looking to exploit the personal information of their employees or customers. A corporation that is concerned with this
should check outgoing communication for undesired traffic. Undesired traffic could
be defined in many ways depending on the type of goods and services provided by a
corporation. For example, a company that creates classified technical documents for
a government agency would want to know all of the documents with file extensions
of .doc or .pdf leaving the network. A financial company would want to check for
emails containing their clients’ credit card numbers, social security numbers, and
other personal information.
A big threat for corporations is their own employees, also known as an insider
threat. A lot of responsibility is put onto employees to remain loyal to a company and
to not betray the company’s trust. Corporations have to be aware of the employees
with malicious intent and the ones that are careless. An employee with malicious
intent could be a disgruntled employee or an employee who was recently laid off or
fired and wants to get revenge on the company. It could also be an employee being
paid to retrieve information from the company for an outside party. The careless
employee could accidentally send an email with important proprietary information
22
Figure 3.2: Example Output from the Corporate Watcher Program (Simple)
to a person outside of the company or even worse a mailing list with multiple people
outside of the company.
Entities need a simple detection module to know when specified strings, such
as English words or file extensions, are leaving the network. A tool was developed
to search the outgoing traffic to see if certain words or phrases are contained in the
payload. The tool will take a file containing the strings to match against the traffic.
Any matches will be flagged for later inspection. The file will be created by the
system administrator to represent the information that should not be leaving the
network. The matching mechanism used is regular expressions.
A sample output is shown in Figure 3.2. The sample is a result of a system
administrator finding a PDF file being served by an html server. The administrator
would flag PDF documents by adding “.pdf” to the file of sensitive strings and
“html” to see the files being served by a web server. Every packet with .pdf or
html in its payload will be flagged. The administrator can then search the results
of the scan for the desired set of flagged words. Each line represents a packet that
has flagged words in the payload of the data. The censored words that have been
matched in the payload are listed at the beginning of the line. Commas separate
multiple matches in the results. The middle of the line displays the source and
destination IP addresses of the flagged packet and their respective ports. The words
that caused the packet to be flagged are listed at the end of the line, with any
multiples separated with a comma. The first line shows a packet that contains the
flagged strings of .pdf and html, which can represent a packet from a web server
23
Figure 3.3: Example Output from the Corporate Watcher Program (Complex)
serving a file of an extension .pdf. The second and third lines represent packets
that could be originating from a web server. The fourth line represents a packet
that contains the string .doc, which could represent a packet involved with serving
a word document.
The regular expressions used for matching can be more complex than the previous example matching against just text. The sample output displayed in Figure 3.3
is the result of matching against a more complex regular expression. The regular
expression used is \[1-3][0-9][0-9] Evans Hall\ and can be seen in the Matched field.
The regular expression is used to look for any packets that mention a room in Evans
Hall, and this case any Evans Hall room numbers between 100 and 399. The packets
shown are flagged because they contain the phrase “140 Evans Hall” in the payload.
The regular expression can be as complex as necessary. The regular expressions
could be for file extensions, the magic number of files, social security numbers, etc.
The administrators just need to add the part of the regular expressions between the
\’s to the file of censored expressions.
3.3
Network Top
A system administrator interested in outgoing traffic would want to know
where the traffic is destined and how much data is being sent. A tool that is similar
to the UNIX top command would be useful resource for an administrator. The
program would show the outgoing traffic stats in real-time. The development of the
network top program was performed in two steps and then combined.
Searching through pcap files while responding to an incident can be tedious
for system administrators. The logs will be full of IP addresses, which will have little
24
meaning to the administrator. Administrators will recognize a few IP addresses that
belong to their network or to well known organizations such as Google or Microsoft.
The first step an administrator will perform while investigating suspicious traffic is to
determine the destination of the traffic. The administrator could then run the whois
command against the IP address to gather some information on the organization
that administers the IP address. The network top program uses the DNS name to
attempt to gather this information. The network top program attempts to resolve
the DNS name of the IP address. If the DNS name can not be resolved, it will
be displayed as “N/A.” Investigating the IP address 128.175.13.63 shows a good
example of the problem being solved by the network top program. An administrator
may not recognize the IP address, but they will recognize www.udel.edu, the DNS
name that is resolved from 128.175.13.63. Another example is 64.12.26.59, which
resolves to bos-m043a-sdr4.blue.aol.com. The administrator may not know the exact
function of 64.12.26.59, but knowing AOL maintains it will bring some meaning to
the IP address. If the administrator trusts traffic going to AOL, any warnings related
to 64.12.26.59 could be ignored.
The amount of outgoing data is another useful piece of information to system
administrators. An administrator can look at the IP addresses that have the most
data being sent to it and work down the list. The administrator will ignore any of
the IP addresses that are justified in receiving that amount of data. For example a
administrator with many Gmail users would be comfortable with a lot of outgoing
data destined to IP addresses administered by Google. The administrator may not
make it to the bottom of the list every time the logs are checked, but the risk of
losing data is less when the amount of data destined to an IP address is small.
Administrators may set a threshold of data leaving in a given amount of time that
will be tolerated. The threshold would have to be the result of a risk analysis
performed using the company’s policies.
25
The administrators should whitelist the known IP addresses, so they can
quickly skip to an unknown IP address. The program has a built-in feature to facilitate the process of skipping known IP addresses. The program’s -a command line
option will take a tab-delimited file containing IP addresses and corresponding labels. The labels can be anything as long as it has no white space. The administrator
can use the labels to characterize the IP addresses as either allowed or suspicious.
The labels can also be used to describe the function of the IP address, such as University Of Delaware Webserver. The matching of the IP address against those in
the input file is performed using regular expressions. Using regular expressions as
the matching mechanism allows for partial matching. An administrator may want
to label the entire range of 64.233.160.0 - 64.233.160.254 as Google. The administrator would add an entry with the IP address as “64.233.160.” and the description
of Google in the file to label the entire range of 64.233.0.0/24.
The -a option can be very convenient for a system administrator; however,
it will lower the program’s performance. An administrator can run the program
during a meeting or at the end of the day to be inspected later if the increased
run time becomes a nuisance. Combining multiple IP addresses into a range and
labeling them together can also improve performance. It will reduce the number of
checks performed on each IP address against those in the file of labels.
The network top program can be given a command line option for a network
device or a pcap file. Giving the network device as an option will display the
outgoing traffic to the console in real-time until the program is terminated. It will
also store the results to a file named “outbound ips PID TIME.log”. It will sort the
list by the destination IP addresses with the most outgoing data at the top. The
network top program can be resource intensive just like the UNIX top program, so
an administrator may not want to run it constantly. The option of giving the pcap
file will allow an administrator to inspect previous network captures that have set
26
Figure 3.4: Example Output from the Network Top Program
off flags for the large amounts of outgoing traffic. The results will be outputted to
a file with the name of the given pcap file concatenated with “ outbound ips.log.”
Sample output can be seen in Figure 3.4. The results are tab-delimited with
each unique destination IP address on a separate line. The first entry of the line
is the IP address in which the outgoing traffic is destined. The second entry is
the number of bytes that have been sent in the IP packets’ payloads. A design
choice was made to not decode the packet past IP to the transport layer to help
with performance. If the administrator is looking at the results in relation to other
entries, this choice should not affect the process of searching for suspicious outgoing
traffic. The third entry will have the successfully resolved DNS name or a “N/A”
designation. The last entry will be blank or will have the description from the file
given with the -a option if there is a match. The last column will not be present if
the -a option is not given.
The sample output in Figure 3.4 displays the 10 destination IP addresses
receiving the most data from the data set explained in Section 5.1.1. Inspecting
these samples, an administrator can quickly see that the user is banking with ING,
using Live Mesh to back up some files, using a service from Google, watching a
video on YouTube, communicating with a host from the University of Delaware,
and sending data to 10.0.0.54. Assuming that the administrator is comfortable
27
with the system sending data to the labeled IP addresses, it leaves one suspicious
IP address to investigate further. The administrator could quickly ignore the IP
address if the 10.0.0.0/8 subnet is implemented within the internal network, but
it will require attention if it is not an internal IP address. Note the DNS name
for 64.249.81.83 is lga15s01-in-f83.1e100.net. The DNS name could concern the
administrator, because it does not give any clues to what organization administers
that IP address. However, a whois command will quickly inform the administrator
that Google maintains the IP address. The administrator can add a label to the IP
address so it can be quickly recognized next time it appears.
3.4
DNS Extractor
Another way for an administrator to find where the data is going is to inspect
the DNS queries and answers. It is rare that users will remember IP addresses to
systems on the Internet. If the IP address has a DNS name, the user will use the DNS
name instead of attempting to remember the IP address. The DNS name will need
to be resolved by the user’s computer. A packet capture of the computer’s network
traffic will contain all of the DNS queries and DNS answers for the computer. An
administrator can use these queries and answers to get a summary of what users are
doing over the network. The DNS Extractor will take a pcap file as an input. All of
the DNS answers will be extracted from the pcap. The information in the answer
will then be printed out to a file.
Sample output from the DNS extractor program is shown in Figure 3.5. The
output is displayed in a fashion very similar to a sentence written in English, so it
is very intuitive. Each line of the output represents an entry from a DNS answer.
The first IP address is the system that sent the DNS query, while the second is
the DNS server that answered the query. The IP address is being requested for the
first DNS name in the line. The last part of the line can be a DNS name or an IP
address. The reason for receiving another DNS name is that there is a CNAME for
28
Figure 3.5: Example Output from the DNS Extractor Program
that particular DNS name. It will then try to resolve the CNAME to an IP address.
The process will continue until the IP address is already known (currently stored in
the cache) or the name has been resolved.
If there is an organization wide security policy of not using social networks
on the computer systems, an administrator can run the tool and grep the results
for phrases such as “facebook” or “twitter” to find the users that are breaking the
policy and putting the organization at risk. Many botnets use obscure DNS names
for their command and control servers. The DNS Extractor can be used to search
for these obscure host names, and if they are found there is a good chance that the
host is infected.
3.5
Session Extractor
A session can be described as a persistent connection between two hosts.
Sessions are defined by the two host’s addresses, whether they are MAC addresses,
IP addresses, or IP addresses with the respective ports depends on the type of traffic.
They can be thought of as a conversation between two people, where each word is
a packet and the complete conversation is a session. Two hosts can have multiple
sessions between them at any given time. Multiple conversations, such as speaking,
emailing, and writing notes to each other would be the same as two hosts having
multiple sessions with each other. A person speaking to two people at once is the
same as a host having a session with two different hosts.
29
Figure 3.6: Example Output from the Session Extractor Program
The combination of the source and destination addresses can give enough
information to decode the application of the session, and it can facilitate the system
administrator in debugging the outgoing network traffic. A common example of
determining the application of a session can be found in a session between an end
user browsing the Internet and a web server. The session would look similar to
123.45.67.89:12345 -> 98.76.54.32:80. The system administrator can look at the
session information to see a destination of port 80 and deduce that it is web traffic.
Further investigation into the IP address of the host with port 80 would reveal the
web site serving the traffic and what organization is administering it.
The Session Extractor takes a pcap file for its input. The host’s IP address
will be taken as a command line argument. If a host is not passed as an option the
program will attempt to find the appropriate host. It does this by searching for the
most common host in the pcap file, and assumes the administrator is interested in
this host. The sessions with outgoing traffic from the host will be extracted from
the pcap file. The results will be outputted to a file with the same file name as
the given pcap file with .log concatenated to the end. Only unique sessions will be
outputted to the results file. A sample of the output from the Session Extractor is
shown in Figure 3.6. The left side of the -> is the given host and its port, while
the right side is the destination host and the port it is utilizing for the session. An
administrator can now parse the files for the destination IP address to be able to
inspect the sessions and decode its function.
30
3.6
IP Helper
The IP Helper program is a simple GUI program that helps with the process
of creating whitelists and blacklists for a network. A screenshot of the GUI is
shown in Figure 3.7. It was created as a helper program for ExFILD. ExFILD uses
whitelists and blacklists in the decision-making processes. ExFILD will also output
the IP addresses it processes to a file to be used by the IP Helper program to create
the lists. The IP Helper program will open a file located at ../Config/ipList. It
will sort and remove the duplicate IP addresses at the beginning of the program.
It will then iterate through the IP addresses displaying the resolved DNS name (if
a DNS name could successfully be resolved) and the location of the IP address.
The location of the IP address is found using the Geo:IP [24] Perl module and the
GeoLite City [25] and GeoLite County [26] databases. The databases contain the
information that allows IP addresses to be mapped to a geographical region.
The administrator will use the information to make the decision of how to
handle the IP address. The program has 6 buttons in total. Two of them are for
starting and exiting the program. The other four are the choices of how the IP
address can be handled. They are the whitelist, blacklist, neither, and ignore for
now buttons. The whitelist and blacklist buttons will add the IP address to the
respective files located at ../Config/whiteList and ../Config/blackList. The neither
button will add the IP address to a file located at ../Config/ignoreList.
The ignoreList file is only used for the IP Helper program. The IP addresses
are read from the ignore list and excluded from the list of IP addresses shown to
the administrator during the process. This functionality was added for when an
administrator does not want to add the IP address to the whitelist or blacklist,
but it continues to be displayed every time the IP Helper program is run. The
administrator would use this button to command the program to ignore this IP
address in the future. The ignore for now button is for the instances when the
31
Figure 3.7: GUI for the IP Helper Program
administrator is not sure what should be done with an IP address and wants it to
be shown next time the program is run. When exiting the program, the IP addresses
that have yet to be addressed by the administrator will be outputted back into the
ipList file. The nice thing about doing this is that all of the duplicates were removed
at the beginning of the program, thus an administrator can run the program and
then exit to shrink the ipList file when it starts getting too large.
32
Chapter 4
EXFILD DESIGN AND IMPLEMENTATION
ExFILD was designed to facilitate the detection of data being exfiltrated.
It is designed to only inspect the outgoing traffic of a host. It will automate the
decisions usually made by a system administrator while searching for suspicious
traffic leaving a computer system. The foundation of ExFILD is based on the
encryption characteristics of the packets and sessions. Features will be extracted
from the network traffic during processing to be used as the input of the decisionmaking tree, which will be referred to as the tree in the rest of this thesis. The
processing of the network traffic will be discussed in Section 4.1 and the tree will be
discussed in Section 4.2.
4.1
Packet and Session Processing
The network traffic is processed two different times, once at the packet level
and another time at the session level. The flow for the processing of the packets
is shown in Figure 4.1 and the flow for the processing of the sessions is shown in
Figure 4.2. The processing of the packets will occur first, and the processing of the
sessions will occur after all of the packets have been processed. The two processes
are almost identical. The main differences are the inputs taken and the first two
stages in the processing of the packets. The session processing does not require
the first stage, because the processing of the packet handles the decoding and then
extracts the sessions. The individual steps of the processes will be explained in the
33
following sub-sections, except for the tree. The tree will be explained by its own in
Section 4.2.
Figure 4.1: The Flow of the Processing of Packets
Figure 4.2: The Flow of the Processing of Sessions
4.1.1
Packet Decoding
The decoding process is responsible for extracting key features from the pack-
ets. It supports the data link, network, and transport layer protocols. It supports
Ethernet in the data link layer, IP in the network layer, and TCP, UDP, ICMP,
and IGMP in the transport layers. The decoder does not handle the application
layer; however, it will attempt to decode the application protocol using the destination port and even the source port if needed. Many application layer protocols
are assigned to specific ports by IANA (Internet Assigned Numbers Authority) [27],
which makes it possible to match the two ports to a standard protocol. For the ports
that are not assigned to a specific application layer protocol, an administrator will
have to investigate the traffic to determine if it is a constant port for an application
layer protocol or just a randomly negotiated port. The administrator can then add
functionality for any new application layer protocols that are not yet supported.
34
All of the stages rely on the decoder to provide features that are extracted
from the packets and sessions. When the packet has been decoded as far up the layers
as possible, it will extract the information from the current layer. The information
extracted will be the protocol itself, the session identifier (explained in detail in
Section 4.1.2), the packet count of the session in which it belongs, the data, and the
destination IP address. The different stages of the process will use these features
to perform their functions. They may need more features such as the data length,
but the other features can be derived as needed from the features that were passed
to them. The decoder will also extract some other features for strictly statistical
purposes that are outputted to a file with the same name as the pcap file and .stats
concatenated to the end as the file extension.
4.1.2
Extract Sessions
The extracting sessions stage combines all of the packets in the same session
into one piece of data. The packets are grouped together using the session identifier
provided by the packet decoder. The session identifier consists of a string containing
the source and destination IP addresses and their respective ports. The format of the
session identifier is SRC IP:SRC PORT->DEST IP:DEST PORT, where SRC IP
and DEST IP are the source and destination IP addresses and SRC PORT and
DEST PORT are the source and destination ports. Not all session identifiers use
this format, but it is the most common session identifier due to most of the packets
in a network being TCP or UDP packets. IP packets that are not TCP or UDP have
a session identifier with a format of SRC IP->DEST IP. Ethernet packets will have
a session identifier with the format SRC MAC->DEST MAC, where SRC MAC is
the source MAC address and DEST MAC is the destination MAC address. ARP
packets will have a session identifier with the format of SHA:SPA->THA:TPA,
where SHA is the source hardware address, SPA is the source protocol address,
THA is the target hardware address, and TPA is the target protocol address. The
35
session extractor will attach the protocol, the accumulation of the data in all of the
packets from the session, and the number of packets in the sessions to the session
identifier to form a tuple to be processed. These sessions will be processed after the
processing of the packets is complete.
Session identifiers are not far from being unique, but there is a flaw to using
these session identifiers. Using session identifiers of this form without any discrimination in the time domain allows for the possibility for two separate sessions being
grouped together. The probability of two sessions being grouped together at the
transport layer protocols is rare due to the client ports being incremented or randomized with every session. The session identifiers for IP packets with no transport
layer information are more likely to characterize multiple sessions between two hosts
as one session. Ethernet packets will also have the same problem as the IP packets,
as the same session will characterize any Ethernet packet between the same two
hosts.
4.1.3
Entropy Calculation
Data encrypted through good encryption algorithms should look random
when observed at the bit level. Theoretically the data should be completely random
so that a person should not be able to find any patterns that disseminate information
about the data that is encrypted. Entropy is used to characterize the randomness
of data. A string of all the same characters will give an entropy value of 0. The
entropy will be higher as the string becomes more random. Entropy is used to detect
whether a packet is encrypted from the randomness measurement. Good encryption
will produce very random sets of data, which means that its entropy will be higher.
A good demonstration of this is obtained by looking at the payload of a HTTPS
packet and comparing it to the payload of a HTTP packet, which is discussed more
in Section 4.1.5 and is plotted in Figure 4.7.
36
−
N
X
p (x) log2 p (x)
(4.1)
i=1
Equation 4.1 is used to calculate the entropy of data (in our case a packet’s
payload), which was derived by C.E. Shannon in [28]. It is the sum over all possible
combinations of the probability of each time a combination occurs multiplied by the
log2 of the probability of the combination occurring. Conceptually, it will produce
a histogram of the possible bit combinations. If the distribution of the possible
combinations is even across all of the combinations, then it will have a high entropy
value. Conversely, an uneven distribution of the possible combinations will result in
a lower entropy value.
N is the number of possible combinations. N is 256 for this thesis due to the
code inspecting 8 bits at a time meaning there are 28 or 256 possible combinations.
The calculation is done 8 bits, or one byte, at a time, because most computers
communicate in multiples of bytes. Also, human readable text is common in computer communications, each character being represented by and 8-bit code known
as ASCII. The entropy calculation can range from 0 to 8. Let’s look at the example
packet with “AAAA” as its payload. The entropy calculation for the bit combination that represents ‘A’ will have the probability of 1 and log2 of the probability
will be 0. This will give an entropy value of 0 for that combination. The rest of the
combinations will have a probability of 0 meaning their respective entropy calculations will be 0. Summing the results of these entropy calculations will be 0. Now
let’s look at the opposite side of the spectrum. Assume that the packet payload
contains only one of each of the 256 ASCII characters. Every possible combination
will have an equal probability of occurring with a probability of 1/256. The result
of the log calculation will be -8. Summing all of the combinations’ calculations will
result in 8.
Figure 4.3 shows the maximum entropy versus the size of a packet’s payload.
37
It is observed that the maximum entropy value is not 8 between the sizes of 0 and 256
bytes. The maximum entropy value grows as the number of possible combinations
increases forming a steep slope until it reaches the size of 256 bytes. The slope is
an artifact of the equation for Shannon’s entropy. Since the algorithm looks at 8
bits at a time, which makes 256 possible combinations, each of those combinations
must each occur at least once and the same number of times as all of the other
combinations for the entropy to be 8. If a payload has a size of 2 bytes, two of the
combinations will have a probability of 1/2, but the other 254 combinations will
have a probability of 0, which will result in an entropy result of 1 not 8. As the
size of the packet grows from 0 to 256, the entropy values approach 8, which is the
asymptote or the maximum possible entropy for this calculation.
Figure 4.3: Maximum and Minimum Entropy vs. Size of a Packet’s Payload
38
Figure 4.4: Maximum Entropy vs. Size of a Packet’s Payload
Another interesting artifact of Equation 4.1 is displayed in Figure 4.4. Intuition would lead a person to believe that payloads of size greater than 256 bytes
would have a maximum possible entropy value of 8. The dips in the calculated entropy disprove the assumption made from one’s intuition. The dips can be explained
succinctly with an example. Consider the scenario of a 257-byte payload, where each
possible combination occurs once, except the bit combination of “00000000” occurs
twice. The bit combination of “00000000” will have a probability of 2/257, while
all of the other combinations will have a probability of 1/257. For the entropy calculation to equate the maximum of 8, all of the combinations must have the same
probability of occurring as each other. The payload size of 257 is not evenly divisible
by 256, so it is not possible to reach the entropy of 8 with a payload size of 257.
39
The first dip between the sizes of 256 bytes and 512 bytes is more pronounced
than the dip between the sizes between 512 bytes and 768 bytes. The dips will
become less pronounced as the sizes of the payloads grow larger. If you look at
the transitions from 256 bytes to 257 bytes and 512 bytes to 513 bytes, the 257th
byte and the 513th bytes will decrease the maximum entropy by different amounts.
The 257th byte will leave one probability at 2/257 and the rest with a probability
of 1/257 and the 513th byte will leave one combination at 3/513 and the rest with
a probability of 2/513. In the case of the payload with the size of 257 bytes, the
difference between the probabilities of the combination from the 257th byte and
the rest of the combinations will be 2/257 - 1/257 = 1/257. In the case of the
payload with the size of 513 bytes, the difference between the probabilities of the
combination from the 513th byte and the rest of the combinations will be 3/513
- 2/513 = 1/513. The difference of 1/257 is greater than 1/513, which causes it
to have a greater impact, thus pulling the maximum possible entropy farther down
from 8.
The entropy calculations will be used to distinguish from random data from
non-random data. In the work of this thesis, random data is considered encrypted
unless ExFILD finds it to be compressed. It is important to note that this determination is an estimation of encryption, and is not an exact science. The determination
will be made using a static threshold set by an administrator. Graphically, this
means a straight line will be drawn across a graph and packets or sessions above
it will be encrypted and any below it will be non-encrypted. The slope between
the payloads of size 0 and 256 is problematic for this determination. If a threshold
is set to 7, no payloads with a size less than 128 could be considered encrypted.
The solution we adopt is to scale the entropies in a way that keeps the integrity
of the relationships between payloads’ entropy values while flattening out the possible maximums. Two different solutions will be described in Section 4.1.3.1 and
40
Section 4.1.3.2. The dips are not a concern for the research in this thesis, because
their magnitude is small enough that it will not affect many entropy values enough
to change them their observed state of encryption from encrypted to unencrypted.
4.1.3.1
Scaling by Initial Values
The first proposed solution to the problem was to initialize the number of
each combination to a value other than 0. Initializing the number of each of the
possible combinations to 1 would be equivalent to a 256-byte packet that has a
completely random payload. The 256-byte payload is where the maximum possible
entropies begin to flatten out. The idea would then be that the line of maximum
possible entropies is scaled up to 8, thus removing the steep slope. The solution was
tested using 10 different initial values starting at 0.1 and incrementing by 0.1 up to
1. Figure 4.5 shows the plots for the initial values of 0.1, 0.3, 0.7, and 1. Each plot
has two lines with the top line being the maximum possible entropy values and the
bottom one being the minimum possible entropy values.
The steep slope of maximum possible entropies has been removed from the
line. It has been normalized into a dip down from the line with the entropy value of
8. The larger the initial value is the smaller the dip becomes and thus the quicker the
maximum entropies converge on the value of 8. The solution performs the desired
normalization on the maximum entropies. A problem arises with the minimum
entropies line when using this solution. The minimum possible entropies for the
values that are not scaled are all 0 as shown in Figure 4.3. The minimum values
will start at a value of 8 and slope downward when the initial value is greater than
0. The minimum entropy values converge to 0 at a slower rate as the initial values
grow larger. Packets with small payloads will have a larger chance of being labeled
as encrypted even if they are unencrypted. The solution alleviates the problems
with the maximum values, but creates a problem with the minimum values.
41
(a) Initial value set to 0.1.
(b) Initial value set to 0.3.
(c) Initial value set to 0.7.
(d) Initial value set to 1.
Figure 4.5: Maximum and Minimum Entropy Values with Different Initial Values.
4.1.3.2
Scaling by Size
The solution that is implemented in ExFILD is a scaling technique based
on the size of the payload. Since the range between 0 bytes and 256 bytes is the
problematic area, only the payloads that are less than 256 bytes will be scaled using
this method. The scaling is as simple as multiplying the entropies by a scalar. The
scalar will be based on the relationship between the maximum possible entropy for
a size and 8, which is the maximum possible entropy for this entropy calculation.
The scalar is calculated using Equation 4.2.
scalar =
8
log2 (payload size)
42
(4.2)
Equation 4.2 shows that the scalar is dependent only on the size of the payload.
Using this scalar will allow for the possible entropies of payload less than 256 bytes
to range from 0 to 8. The scalar can range from 8 down to 1. Table 4.1 shows
example scalars for milestone entropies. This scalar could be used to correct dips
for payloads larger than 256 bytes, but they are not used due to the dips not having
a significant impact and to prevent unnecessary computations.
Table 4.1: Scalars for Max Entropy of Payloads by Size
Packet Size Max Entropy Scalar
0
0
0
1
0
0
2
1
8
4
2
4
8
3
2.667
16
4
2
32
5
1.6
64
6
1.333
128
7
1.143
256
8
1
Figure 4.6 displays the lines of maximum and minimum entropy values that
have been scaled based on the payload size. The line of maximum entropy values is
almost identical to a line with the entropy value equal to 8. It will not always have
a value of 8 due only to the multiplication of decimals and rounding. The line of
minimum entropy values is identical to a line with an entropy value of 0. The scaling
has proven to work well on the line of maximum entropy values without disturbing
the line of minimum entropy values.
43
Figure 4.6: Scaled Maximum and Minimum Entropy Values
4.1.4
Checking if Encryption is Expected
Encryption is the most important characteristic of network traffic for this
thesis. Knowing whether network traffic is encrypted is not enough to make a determination of whether the outgoing traffic is benign or undesired. An administrator
needs to know whether the outgoing traffic is supposed to be encrypted or not. An
example would be an administrator observes traffic between two hosts that appears
to be encrypted. The administrator is nervous about the encrypted traffic and immediately suspects that sensitive data is being exfiltrated from the network. The
administrator will start investing resources into determining the application of the
outgoing traffic. The investment of resources may be worthwhile if it is HTTP traffic that is not expected to be encrypted, but it would be less likely to be useful if
44
the traffic was HTTPS where it is supposed to be encrypted. The administrator
would have more justification to investigate why encrypted data is leaving in HTTP
packets than encrypted traffic leaving in HTTPS packets, since it is abnormal. This
justification is based on whether or not the traffic is encrypted and if it is expected
to be encrypted, not on other characteristics such as the IP address being in the
blacklist. Taking the other characteristics into account may justify another action
being taken.
The stage checking if encryption is expected is just as important as the stage
checking if the data is encrypted. This stage uses the most specific protocol linked
with the data to determine if encryption is expected. The application layer protocol
that is determined by the ports of the traffic is used a majority of the time. Common protocols are already supported meaning that it has already been determined
whether the protocols’ data should or should not be encrypted with standard use.
Administrators can add protocols seen on the network as needed. They will need to
determine whether encryption is expected. The protocol of the packet or session will
be checked against the supported protocols. There is a possibility that the protocol
is yet to be handled by this stage, and in this case encryption is not expected by
default. Since unsupported protocols do not expect encryption by default, protocols
do not need to be added unless encryption is expected, but adding the protocols
that do not expect encryption can be used to prevent other administrators from
having to check if it should be encrypted at a later time.
4.1.5
Checking if Encryption is Present
Encryption is a standard for people who want to protect their data from
others. When encryption is implemented correctly it is a very effective method of
protecting data. Although encryption is very useful it can also be used against
people, for example while attackers are exfiltrating their data as discussed in Section 2.1.11. An administrator will know if encryption is expected from the previous
45
stage, and now needs to determine whether or not it is actually encrypted. The
state of encryption for a packet or session is observed using the scaled entropy from
Section 4.1.3.2. A comparison is performed between the entropies and a threshold
set by an administrator.
The comparison in this stage is trivial; however, setting the threshold is not.
As with all thresholds, an administrator does not want to set it too high or too
low. A threshold that is too low will cause false positives, and one that is too
high will cause false negatives. Assuming an administrator is more concerned with
encrypted traffic than unencrypted leaving the network, false positives will waste
the administrator’s resources investigating unencrypted traffic, while false negatives
will let encrypted traffic leave the network unobserved by the administrator. For
this thesis false positives and negatives will cause the traffic to be mishandled, so it
is important to set a threshold that produces a minimal amount of either.
The initial threshold for the program was set using the data set described in
Section 5.1.1. HTTP and HTTPS were used as the basis for setting the threshold.
The reason these two protocols were chosen is that they perform the same function,
but HTTP is not encrypted while HTTPS is encrypted. An important observation
to take away from Figure 4.7 is the separation between most of the HTTP and
HTTPS. The separation between encrypted and unencrypted traffic is the attribute
of entropy that allows for the determination of whether a payload is encrypted.
The result of averaging the entropies for all of the HTTP and HTTPS packets was
used to set the threshold. The entropies have enough separation that the average
of the packets will be a good value to delineate between the two. The threshold
was calculated to be 6.51237, which was rounded down to 6.5 for this thesis. The
threshold is represented by the straight line in Figure 4.7.
There are outliers that would be considered false positives and negatives, but
some of them will be handled in checks within the tree, explained in Section 4.2. The
46
Figure 4.7: HTTP and HTTPS Traffic
outliers were the motivation to input sessions through the tree along with packets
due to the belief that the small size of the payloads do not allow for an accurate
measurement of entropy. By grouping all of the packets in a session together, the
entropy for the entire session will be calculated and will normalize any outliers in
the session. In the case where the observed state of encryption does not match the
expected state, the session will be flagged and need to be investigated. ExFILD will
output the packet and session stats, so the administrators may use which of the two
they prefer.
47
Figure 4.8: The Four Branches of the Tree
4.2
Tree
The decision-making tree (which will be known as the tree throughout this
thesis) is responsible for deciding whether a packet or session might contain exfiltrated data and should be logged. All of the packets and sessions will be sent
through the tree except for traffic from the data link layer. Data link layer packets
will only travel on the local network, which for this thesis is assumed to be trusted,
so they are omitted from the tree. Figure 4.8 shows the division of the four types
of traffic being inputted into the tree. The tree branches out into four branches:
expected and received unencrypted traffic, expected unencrypted but received encrypted traffic, expected encrypted but received unencrypted traffic, and expected
and received encrypted traffic.
Each of the four branches will make a different set of decisions to decide
whether or not to flag the packet or session. The two branches where the expected
state of encryption matches the received are the less interesting cases in this thesis.
The other two branches are more interesting, because they are showing behavior
that is not expected under the standard use of a protocol. The branch where the
48
payload is expected to be unencrypted but is observed to be encrypted tends to be
the most intriguing branch. The four branches will be discussed in more detail in
Sections 4.2.1 - 4.2.4.
4.2.1
Expected Unencrypted and Received Unencrypted
The case where the traffic is expected to be in the clear and the traffic that
is received is not encrypted is very common because of protocols such as HTTP.
Sensitive information should not be sent over unencrypted protocols, but if it is
an administrator will be able to inspect the payloads of the packets for anything
suspicious. Figure 4.9 displays the flow for handling the traffic under this characterization. As a convention for this thesis, any decision will result in two branches
with the top being for a result of true and the bottom for a result of false.
Figure 4.9: Flow for Expected Unencrypted and Received Unencrypted Branch
The first stage, which checks the payloads for censored strings, is derived
from the Corporate Watcher tool described in Section 3.2. It will take regular
expressions from a file located at “../Config/censoredList”, and it will log an alert
for any matches to the expression. This stage is very effective for administrators with
well-defined strings that should not be leaving the network. The censoredList file
initially contained regular expressions for social security numbers and credit cards.
49
If credit cards are not a concern for a particular network, the regular expressions for
them should be removed. These regular expressions caused a lot of false positives
due to a lot of payloads containing large strings of numbers. The censoredList is
now empty by default. An administrator will need to add the regular expressions
as needed and tweak them to avoid false positives. False positives may not be
avoidable, so an administrator would need to make the choice of whether or not to
include them.
The destination IP addresses will then be checked against the blacklisted IP
addresses listed in the file “../Config/blackList”, and an alert will be logged for each
match. The blacklist is empty by default and during the running of ExFILD in Chapter 5. The destination addresses will then be checked against the whitelisted IP addresses in the file located at “../Config/whiteList”. If there is a match, no alert will
be logged and the tree will process the next packet or session. The default whitelist
includes some IP addresses that belong to Skype, Google, multicasting, and broadcasting. The Google and Skype IP addresses are included because their services are
run on the host that created the Control Data set. The multicast (224.0.0.251 for
mDNS and 224.0.0.252 for LLMNR) and broadcast (255.255.255.255) addresses are
included, because they are used on the local network and considered trusted.
The checks will end if there is no match at this point, but additional checks
can be added at this point for packets and sessions that are neither in the whitelist
or blacklist. The choice was to place the blacklisting stage in front of the whitelisting
stage in case an IP address is in both lists. This way the IP address will be handled
correctly as being a blacklisted IP address or it will create a false positive in the logs
when it is actually supposed to be a whitelisted IP address. Switching the order of
these two stages would lead to a false negative that will not be shown in any logs
and an administrator may not catch the undesired traffic leaving the network.
50
Figure 4.10: Flow for the Expected Encrypted and Received Encrypted Branch
4.2.2
Expected Encrypted and Received Encrypted
The case where encrypted traffic was both expected and received is processed
in a very similar manner as described in Section 4.2.1. The difference is in the
first stage of the processing of the packets and sessions. The corporate watcher
program will not be effective against encrypted data, because the data has been
obfuscated. The only way an administrator would be able to use regular expressions
is if effort was put forth to decrypt the data and at that point the administrator
would most likely want to look at it by hand. The flow for processing traffic of this
characterization is shown in Figure 4.10.
The first stage will check if the data is actually a compressed file being served
through html and not encrypted data. Packet payloads with compressed files will
also be random, thus having high entropy. The compressed files are found by their
magic numbers. A magic number is a set of bytes that can be found in every file of
a certain type and used to identify that file type. Table 4.2 has a list of the magic
numbers for the file types checked for in this stage.
It also contains the strings used to match it to a HTTP transfer. The HTTP
strings are used in the matching, because without some other reference to the file
there will be a lot of false positives. Magic file numbers are short in length, which
51
Table 4.2: Magic Numbers and HTTP Strings for Compressed Files
File Type
Extension
Bzip
Gzip
ZIP
Tar
.bz
.gz
.zip
.tar
Magic Number
42 5A
1F 8B
50 4B 03 04
75 73 74 61 72
HTTP Strings
Content-Type: application/zip
Content-Encoding: gzip
Content-Type: application/octet-stream
Content-Type: application/x-tar
leads to a lot of matches in payloads. In order to use only the magic numbers for
matching, all of the application layers protocols would need to be decoded to remove
any header information and find the beginning of the file. The Unix file program
is able to use only the magic numbers, because it is looking at the file and no data
from the headers of the application layer protocols.
The stage checking for compression will be more effective when inspecting
the sessions rather than the packets, because not every packet will have the magic
numbers or the HTTP strings to match against. Even though a packet does not
have the magic file number or HTTP string, it could still be a packet containing
parts of a compressed file. All of the packets in the session will be treated as one,
so it will only need to match one packet containing this information to be able
to exclude the rest of the packets in the session. The file types in Table 4.2 are
supported by default, but more can be added as required. Future work is planned
to decompress the files and running it through the Corporate Watcher program or
some other checks. This work is discussed in Section 7.5.
4.2.3
Expected Encrypted and Received Unencrypted
The case where the traffic is expected to be encrypted but unencrypted traffic
is received is more concerning than the cases where the received traffic is what is
expected. The traffic is automatically assumed to be suspicious once the received and
52
expected states of encryption do not match. It is not common for protocols to send
encrypted messages for part of the time and then switch to unencrypted or vice versa.
The processing for the case of expected encrypted but received unencrypted traffic
is similar to that of unencrypted traffic is both expected and received described
in Section 4.2.1. The flow for the processing of this case is shown in Figure 4.11.
The differences are the addition of the first stage, the logged alerts have different
messages that correspond to this stage, and the alert added after the whitelist check.
Figure 4.11: Flow for Expected Encrypted and Received Unencrypted Branch
The first stage is added to filter out packets that are unencrypted, but are
used for the management of the encrypted traffic. Transport Layer Security [29] is
a protocol that is used to communicate using an encrypted channel. The channel
cannot be established without unencrypted packets to initialize the stream and
negotiate the characteristics of the stream. The initialization is called a handshake.
There are two parts of this handshake that will cause an unnecessary alert. Those
alerts are the client hello message and the change cipher spec message. These are
not the only messages in the handshake that are not encrypted, but the structure of
the other messages and the values of their fields make them appear to be encrypted.
The TLS client hello message contains a list of cipher suites that the client
is capable of using for the communication. Each cipher suite is represented by two
53
constant byte values. The bytes are assigned in order meaning that the first cipher
is represented by the hex value 0x00 and 0x01, the second is represented by the hex
values 0x00 and 0x02, and the rest are numbered incrementally. This means that
the first byte will be repeated multiple times within the packet, which will lower the
entropy of the packet. The cipher suites that are assigned have two possible values
for the first byte, which are the hex values of 0x00 and 0xc0. The change cipher spec
is a small message that is only 6 bytes and it is a constant for TLS version 1. Three
of the bytes of this message are represented by the hex value of 0x01. With half
of the bytes the same value, the entropy will not be high enough to be considered
random.
The alerts for this case will be different from those in the case of expected
and received unencrypted traffic. The alerts in this case will be altered to include a
message stating that the expected and received characteristics of the traffic were not
same. It is important for the administrator to be notified of this discrepancy. The
other checks may be able to provide additional information explaining the cause of
the discrepancy. The traffic received is unencrypted, which allows for the corporate
watcher to be used to search for censored strings. The blacklist and whitelist checks
will be performed as described earlier in this chapter. If the destination IP address
is not in the whitelist then an alert will be logged for the discrepancy in the expected
and received encryption characteristics.
4.2.4
Expected Unencrypted and Received Encrypted
The case that causes the most suspicion is where the traffic is expected to
be unencrypted, but the received traffic is encrypted. The discrepancy between the
traffic’s expected and received characteristics is definitely disconcerting, and the fact
that the data is encrypted and can’t be inspected is even more worrisome. Setting
up a server that initiates an encrypted channel requires additional work compared
to one that serves data over an unencrypted channel. A server that implements TLS
54
or SSL will require the creation of a certificate. A lazy attacker will not put forth
the extra effort to implement a server with encrypted channels. The attacker would
be more inclined to encrypt the data and then send it over an unencrypted channel.
That is why detecting the discrepancy between the entropy characteristics of the
expected and received traffic is important, otherwise there would be no indication
for this unwanted traffic leaving the network.
Figure 4.12: Flow for Expected Unencrypted and Received Unencrypted Branch
Figure 4.12 displays the processing flow for the case where the outgoing network traffic is expected to be unencrypted, but the received traffic is encrypted.
The stages for this case have all been discussed in the other cases except for the
stage checking for ICMP requests and replies. The stages checking for compressed
files being served by a web server and checking if the destination IP address is in
the blacklist or whitelist have been discussed in the previous cases. The other stage
of the processing is a check to remove packets that appear to be encrypted, but are
not encrypted and are not used to maliciously exfiltrate data. The check is looking
for ICMP ping requests and reply messages. The list below contains the payloads of
these ICMP requests and replies. The x’s in the third item represent non-printable
ASCII characters.
1. abcdefghijklmnopqrstuvwabcdefghi
55
2. ABCDEFGHIJKLMNOPQRSTUVWABCDEFGHI
3. xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx !”#$%&’()*+,-./01234567
If ICMP ping requests and replies are found, the alerts will be suppressed. The
alerts for all of the stages will contain both the warnings for the results of that
stage’s check and a message stating that the traffic received was encrypted when it
was expected to be unencrypted.
56
Chapter 5
EXPERIMENTS, RESULTS, AND ANALYSIS
5.1
Experiments
ExFILD was run against multiple data sets to test how effective it is at
detecting data exfiltration. It was run against eight different data sets that were
produced on different machines and environments. One of the data sets was created
with specific tasks to contain a wide variety of traffic, including intentional exfiltration of data to be used as a control. Recording the normal use of two different users
on different networks created another two of the data sets. One of these data sets
was created at the University of Delaware using a university owned computer during
business hours, and the other one was created at another person’s residence on their
personal computer outside of business hours. The last data sets are packet captures
from pieces of malware that exfiltrate data from a network. The data sets from
malware are used to show the effectiveness of the program against actual threats.
5.1.1
Control Data Set
The Control Data Set was created to get as many types of traffic as possible
and was used to develop ExFILD. The diversity made a good baseline to test ExFILD. Audio and video streams of the user were recorded during the creation of the
data set. A written log was also created to document every step of the creation of
the data set. The written log can be found in Appendix A. There is an advantage
to using a data set created by hand instead of setting up a computer system and letting it monitor the user’s network traffic during normal use. The advantage is that
57
the creator of the data set knows exactly what actions were performed to cause the
network traffic. Using data sets that were recorded without someone documenting
the actions will make it difficult to know if the program is working correctly. The
created data set will act as the control experiment for this thesis.
The Control Data Set was created by capturing the traffic entering and exiting an Apple MacBook running Mac OS X 10.6. The MacBook was using the
Ethernet port to communicate over the network. The MacBook’s network configuration during the creation of the Control Data Set is shown in Figure 5.1. It was
connected to a Netgear switch (model GS724T) with four computers running Linux.
The switch was behind a D-Link router (model DIR-655), which had a public IP
address, meaning it can be accessed from other computers on the Internet, not only
the hosts on the local network. Knowing the network configuration will help to
understand what causes some of the alerts produced by ExFILD.
Figure 5.1: Network Diagram for the Control Data Set
The plan for the data set is to capture a large variety of different services
58
for a baseline to develop ExFILD. The packet capture covers a time span of 66
minutes. It includes activities that are commonly performed by user on a daily
basis. Email was checked using the web interface as well as using a mail client
that implements the IMAP protocol. Email attachments were downloaded using
the web client. Many web pages were loaded including CNN.com for the daily
news and SI.com to read the sports news for the day. Multiple RSS feeds were
accessed using Google Reader. Bank statements were checked for multiple bank
accounts. Twitter accounts were checked using both the web and desktop clients.
Videos were viewed from YouTube.com. Some other services that were run are
less common to the general public, but common for people in fields that relate to
computers and networks. SSH was used to log into other computer systems and
perform some mundane tasks. VNC was used for remote desktop administration
of another computer system on the network. A complete account of the activities
performed during the creation of this data set are included in Appendix A.
ExFILD is based on knowing the encryption characteristics of packets and
network sessions. The data set needs to have traffic that is both encrypted and
unencrypted in order to test the program adequately. The applications run during
this data set were a mix of protocols that utilize encrypted and unencrypted communications. The activities that will produce encrypted data are accessing bank
web sites through HTTPS, logging into computers using SSH, instant messaging
with colleagues, and other applications used for secure communications. Many of
the application protocols used over the Internet communicate using unencrypted
channels. The data set will include unencrypted traffic from browsing web sites
using HTTP, transferring files using FTP, resolving DNS names, and many other
application protocols.
The data set needs to have an example of data being exfiltrated within it.
Simple means of exfiltration were added to the Control Data Set. The first means of
59
exfiltration was transferring files between two computers using FTP. The files will be
sent between the two in an unencrypted channel. The second means of exfiltration
will also be transferring files between two hosts using FTP; however, the second
means will be using an FTP server over a channel encrypted using SSL.
Two Perl scripts were written to transfer files with one of the scripts for the
unencrypted transfer and the other for the encrypted transfer. The scripts are run to
transfer small text-only files throughout the data set. The sizes of the different files
are 100, 1,000, 4,000, and 8,041 bytes of text from the Declaration of Independence.
The reason for the files being so small is that large files will be much easier to detect
due to the increase in network traffic from the larger files. All of the files were
transferred using both scripts at a time when no other network traffic was present
to set a baseline. The files were also transferred over the encrypted channel while
web sites are loaded and while videos from YouTube are being buffered. These file
transfers will be used to show the effectiveness of ExFILD to detect exfiltration
interweaved with other traffic.
5.1.2
Data Set #1
The first uncontrolled data set was created by a user over a three-day period.
This data set will be referred to as Data Set #1. The communications recorded are
from an Apple iMac running Mac OS X 10.5. The computer is on a university’s
network and has a public IP address, so the network is a simple one shown in Figure 5.2. The activities performed during this data set are simple and common to the
daily activities of many Internet users. Most of the traffic results from browsing web
pages on the Internet using HTTP, HTTPS, and PROXY (used to access university
sites). There are many other protocols that are observed in this data set, but they
are underlying protocols that are not directly initiated by the user. Even though
the user may not have implicitly requested for these protocols to run, these services
are fundamental for other tasks the user wants to perform. Some of these protocols
60
include NTP for synchronizing the time, DNS to resolve names to IP addresses, and
ARP for routing packets on the local network. Data set #1 gives a good overview
of basic user activities.
Figure 5.2: Network Diagram for Data Set #1
5.1.3
Data Set #2
The second uncontrolled data set was created by a user other than the Control
Data Set and Data Set #1. It was created over a period of six and a half hours. The
communications were recorded on a Dell laptop running Windows 7. It was using its
wireless card (802.11g) to access the Internet. The network is shown in Figure 5.3.
The laptop is behind a router with two other computers running Windows XP. The
activities directly initiated by this user are very similar to the activities performed
by the user in Data Set #1. The other activities differ due to Windows and Apple
operating systems having different services that run by default. Data Set #2 has
the NetBIOS suite used by Windows for networking. Another Windows service seen
is Teredo, which is Microsoft’s solution for using IPv6 over IPv4. It also has other
unique protocols not discussed in the previous data sets, such as SMB for sharing
files over the network and DROPBOX, which is used to backup files online. The
combination of Data Set #1 and Data Set #2 gives a good overview of outgoing
traffic for basic users.
61
Figure 5.3: Network Diagram for Data Set #2
5.1.4
Malware Data Sets
Packet captures of real world examples of data exfiltration are not readily
available to the public. There are a couple of explanations for the lack of public
packet captures. One explanation is that not all companies will disclose information
to the public regarding data being exfiltrated from their networks, thus no information including packet captures will be found. Another explanation is that the
data being exfiltrated will most likely be sensitive information. A company would
not want to release a packet capture containing the data it was trying to secure.
A company could sanitize the packet captures of any sensitive data, but there is
always a chance that pieces of data were overlooked. Companies are not willing to
take a risk of putting the packet captures in the public domain with a chance that
the capture was not fully sanitized. The company needs to be concerned about divulging sensitive information about their employees in the packet capture. A packet
capture could also contain financial information, personal emails, or other personal
information that is not related to the sanitized data. Sanitizing a packet capture
for personal information will result in a large amount of things to find and remove.
62
The administrator will also have to decide what the employees do and do not want
other people to know about. The task will become overbearing very quickly. For
these reasons using packet captures with real world examples of large data breaches
is not a viable option; however, packet captures from malware could be useful for
this thesis. The packet captures from the malware are not going to be large samples,
but they will contain data being exfiltrated by an attacker. Most of the data will be
for control and not sensitive data, but the malware would use the same mechanisms
of exfiltration for sensitive data as it does for control data.
5.1.4.1
Kraken Botnet
A packet capture of the Kraken botnet was obtained from the OpenPacket.org
repository [30]. The packet capture contains command and control traffic between an
infected host and a command and control server. Kraken was reverse engineered and
a description of its activities were discussed in [31]. Infected hosts that are part of the
Kraken botnet will attempt to communicate with command and control servers in a
large list of DNS names generated by Kraken. The domain names of the command
and control servers are listed in [32]. It will continue to look for servers on port 447
until it finds a match. The communications between the bot and the command and
control servers are encrypted using a custom encryption protocol. The data being
exfiltrated in this data set is the encrypted communications with the server. The
command and control server is using the data received from the infected computer
to direct that host’s actions. The encrypted outgoing messages will provide a good
test for ExFILD. The verification that the packet capture contains network traffic
from the Kraken botnet is performed in Appendix D.1
5.1.4.2
Zeus Botnet
Three packet captures of the Zeus botnet were obtained from the OpenPacket.org repository ([33] [34] [35]). The packet captures contain three separate
63
communications between infected hosts and command and control servers. The
communications of the Zeus botnet are explained in [36]. It is explained that the
communications between the infected hosts and the command and controlled servers
are encrypted using RC4, a stream cipher. At first an infected host will download
a configuration file from a server. The infected host will then use HTTP POST
commands to send information to a server defined in the configuration file. The exfiltrated data includes information about the infected host and sensitive data stolen
from the host retrieved by web injection or from protected storage. The verification
that all three of the packet captures contain network traffic from Zeus is performed
in Appendix D.2
5.1.4.3
Black Worm
A packet capture from Blackworm or Nyxem.E was obtained from the repository at pcapr.net [37]. The packet capture contains the communication from Blackworm during the search and infection of a victim. Blackworm spreads using email
attachments and network shares. A detailed description of it is given by [38]. Blackworm is using network shares as its propagation method in this packet capture. It
involves walking through all of the known network shares of the infected host. The
host will try to copy itself into one of three file locations listed [38]. Before the victim is infected, the worm will delete any known folders relating to security vendors
from the victim machine. If the worm successfully infects a victim, the victim will
delete registry keys for known security vendors to attempt to stop the detection of
the infection. The victim will then search for other computers to infect. Blackworm
will delete many different types of files on the third of every month. The data exfiltrated from the infected host is the Blackworm executable itself. The verification
that the packet capture contains network traffic from Blackworm is performed in
Appendix D.3
64
5.2
5.2.1
Results and Analysis
Packet Versus Session Alerts
The choice was made to focus on the session alerts rather than the packet
alerts. This choice was made to reduce the impact of small packets and packets
determined to have a different encryption status than the rest of the packets in a
session. The payload of a small packet can be deceiving on whether it is random
or not, because there is not enough information to confidently determine the randomness of the payload. The problem with looking at single packets is that all of
the packets in a session could be determined to be unencrypted except for one. The
single encrypted packet could just be a coincidence, but it would cause an undesired alert or a false positive. However, looking at just the session alerts and not the
packet alerts could allow for legitimate alerts to be neglected. Future work described
in 7.4 is planned to investigate this problem. The choice also has the advantage of
combining alerts from multiple packets into one if they are from the same session.
Administrators would like this fact, because they will not be overwhelmed with
repetitive alerts.
Table 5.1: Alert Statistics for Packets and Sessions of Each Data Set
Data
Set
Control
One
Two
Kraken
Zeus 1
Zeus 2
Zeus 3
Blackworm
Packets
Unique Less Than
Alerts Sessions 256 Bytes
1,011
62
881
225
218
225
178
135
174
18
16
10
3
3
0
3
3
0
4
3
0
147
1
0
Sessions
Alerts
43
0
12
16
3
3
3
1
Table 5.1 displays stats about the packet and session alerts. It is shown that
65
for the Control Data Set, Data Set #1, Data Set #2, and the Blackworm Data Set,
which have a large number of packet alerts, the number of session alerts is decreased
dramatically. It can also be seen that the number of small packets for the Control
Data Set, Data Set #1, and Data Set #2 closely correlates to the difference between
the packet and session alerts of each respectively. The data in Table 5.1 provides
evidence that justifies the use of session alerts over packet alerts, thus the results
will be discussed with a focus on the session level rather than the packet level. The
Control Data Set will go into more detail at the packet level, since its activities
were generated for development and well known. All of the results for the data sets’
entropy values at the packet and session level are included in Appendix B. All of
the plots for the alerts will be included in Appendix C.
5.2.2
Control Data Set
The Control Data Set produced some results that were expected and some
that were unexpected. The unexpected results were due to errors in logic when
creating the data set. The controlled exfiltration methods were implemented in the
wrong direction, meaning that the files were being transported into the MacBook
instead of out of it. To mitigate this oversight the data set was run through ExFILD
with a host of 10.0.0.101 (the IP address for the MacBook), but with a filter that
removed any traffic to 10.0.0.54 and added any traffic from 10.0.0.54. The filter used
to do this is equivalent to “(src host 10.0.0.101 && !dst host 10.0.0.54) || src host
10.0.0.54.” The filter essentially returns any traffic leaving 10.0.0.101 and inverts the
traffic that includes 10.0.0.54, meaning incoming traffic from 10.0.0.54 to 10.0.0.101
will now be treated as outgoing traffic from 10.0.0.101 to 10.0.0.54. The traffic from
10.0.0.54 will include the traffic from sessions for a FTP server, a SSH server, and
a VNC server. This corrected data set will be used in the results section.
The Control Data Set created 1,011 packet alerts and 43 session alerts as
stated in Table 5.1, and more specific stats are shown in Table 5.2 and Table 5.3.
66
Table 5.2: Packet Alerts from the Control Data Set
Packet Type
AIM
Controlled FTP Exfiltration
HTTP
IRC
Skype
SSH
VNC
Alerts Small Packets Unique Sessions
2
2
2
503
417
45
7
7
7
1
1
1
17
14
2
8
4
4
473
436
1
Table 5.3: Session Alerts from the Control Data Set
Packet Type
Controlled FTP Exfiltration
Skype
Alerts
41
2
The first thing to note is that there are 7 HTTP, 2 AIM, and 1 IRC packet in the
set that cause alerts. All of these packets are small (less than 256 bytes), and their
sessions do not cause an alert. These packets can be the result of a payload being
too large to fit in a single packet, so the overflow of the payload will be sent in a
subsequent packet. The packets could also be small packets used to initialize and
close sessions.
SSH traffic causes 8 packet alerts. Four of these packets are smaller than
256 bytes and the other 4 packets are larger than 256 bytes. These packets are not
carrying any user data. They are used to administer the SSH session, such as the
server and client key exchanges. There are 473 packet alerts resulting from VNC
traffic. A majority of these packets, 436 to be specific, are smaller than 256 bytes.
All of the alerts caused by the SSH and VNC packets are not concerning, because
their respective sessions do not cause any alerts. The choice to only look at the
session level alerts is reaffirmed by the large number of packet alerts from AIM,
67
HTTP, IRC, SSH, and VNC that do not cause any session alerts.
Skype also causes alerts, which is not surprising. Skype communications
are transferred over encrypted channels to many different hosts over non-standard
ports. The 17 packet alerts and 2 session alerts occur even after attempting to
whitelist it. Skype was whitelisted because it is used on the MacBook regularly, but
it should not be whitelisted if it is not used regularly on the host.1 IP addresses
maintained by Skype were added to the whitelist to remove the alerts resulting
from those communications. Although it was mentioned that Skype does not use
a standard port, it does have a configurable port on a host-by-host basis, which
will remain static unless changed by the user. That port was also added in to
ExFILD to whitelist Skype’s communications. If Skype was not whitelisted, there
would have been a total of 1,805 packet alerts and 84 session alerts resulting from
its communications. If Skype was not trusted on a network, these alerts would be
legitimate concerns, but since they are trusted in this case, the 17 packet alerts
and 2 session alerts still remaining are considered false positives. A better way
to filter out traffic from Skype would need to be developed. On the other hand,
ExFILD resulted in less than one percent (0.94%) of the Skype packets and a little
more than two percent (2.34%) of the Skype sessions causing false positives. These
percentages of false positives are not bad considering only simple whitelisting was
used to identify Skype traffic.
The remaining alerts were the result of the controlled data exfiltration. The
controlled exfiltration was performed using FTP and FTPS (FTP over SSL), which
resulted in 49 sessions. There were 4 unencrypted file transfers that resulted in a
total of 8 sessions. Each transfer had a control channel and a data channel. There
1
The fact that Skype source code has not been released, combined with its use
of encrypted sessions connected to many random hosts, a consequence of its
peer-to-peer model, make it suspect to security conscious administrators, since
it is indeed exfiltrating encrypted data to random destinations.
68
were 14 encrypted file transfers that resulted in 41 sessions. Each encrypted file
transfer had 2 control channels and a data channel (except for a failed transfer
resulted in only 2 control channels and no data channel). As seen in Table 5.2, the
packet alerts accounted for 45 of the FTP sessions.
The 4 sessions not causing any packet alerts were the sessions responsible
for the transfer of the unencrypted data. This is expected behavior due to FTP
traffic being expected to be unencrypted which the files are, meaning they are not
suspicious. To catch these sessions an administrator would need to be inspecting
the payloads or using the censored word list. If the data set is run through ExFILD
again with a word from the Declaration of Independence in the censored word list,
these sessions will cause 4 alerts. This is assuming the word was within the first 100
characters of the Declaration of Independence. If the censored word is between the
100th and 1,000th character position, it will only cause three alerts and so on. The
data set was run through with the word “political” in the censored word list and
four additional alerts were caused by the censored word.
The four sessions that are represented in the packet alerts and not in the
session alerts are also from the unencrypted file transfers. They are the sessions
responsible for FTP control channels. The reason that the similar sessions for the
encrypted file transfers still cause alerts is that they include additional setup for the
encryption, including transferring the certificate.
All of the sessions involved with the encrypted file transfers caused alerts.
Only the 13 sessions used as the data channels should cause alerts. The 28 false
positives caused by the remaining sessions are due to the lack of application layer
decoding. The sessions will be treated as if they are normal FTP transfers not FTPS
transfers. There is no intelligence yet built into ExFILD to distinguish between the
traffic of the two protocols. The addition of this intelligence should mitigate the
problem by only inspecting the data channel and not the control sessions, whose
69
payloads are only used to control the channels.
The packet alerts from the Control Data Set caused many alerts that were
false positives. All of the AIM, HTTP, IRC, Skype, SSH, and VNC packets were
false positives. The controlled exfiltration using FTP and FTPS connections also
caused 464 false positive alerts at the packet level. Some of these packets were the
small FTP packets that are used for the opening and closing of the FTP connections,
while the other packets were from FTPS control channels. All of these packets are
grouped with their respective sessions and produce the correct results except for the
previously mentioned Skype and FTPS control sessions. As expected, the Control
Data Set produced better results at the session level than it did at the packet level.
5.2.3
Data Set #1
Data Set #1 resulted in 225 packet alerts and 0 session alerts as shown
in Table 5.1. The activities performed during the creation of this data set were
basic and it is not surprising that it generated no session alerts. Table 5.4 displays
the breakdown of the packet alerts for this data set. A majority of the packet
alerts resulted from traffic that is expected to be encrypted, but is observed to be
unencrypted. All of the alerts from this data set are false positives and can be
attributed to the payloads being small. An interesting aspect to note about the
HTTPS packets is that they are all Client Hello messages for SSL. A check could
be added to remove these alerts. Adding additional checks for any of these alerts
would not make a difference, since their respective sessions do not cause alerts.
Table 5.4: Packet Alerts from Data Set #1
Packet Type Alerts Small Packets Unique Sessions
HTTP
12
12
7
HTTPS
155
155
154
PROXY
58
58
57
70
5.2.4
Data Set #2
Data Set #2 created almost as many alerts as Data Set #1 with 178 packet
alerts and 12 session alerts as shown in Table 5.1. Table 5.5 shows the breakdown
of the packet alerts. The packet alerts are comprised mostly of small HTTPS and
HTTP packets with the largest payload being only 208 bytes. Once again a majority
of the HTTPS packets are Client Hello messages for SSL. The SMB packets are large,
which would be concerning if not for the fact that no session alerts result from them.
The NETBIOS-SSN caused 12 session alerts with small error packets. The fact that
the size of each packet is only 5 bytes would lead to the assumption that the packets
should not cause session alerts. The other packets in the session should average
out the entropies, but the other packets in the session do not have payloads. These
packets are just SYN-ACKs and ACKs. The session alerts that result from these
packets are false positives and need to be mitigated in some manner.
Table 5.5: Packet Alerts from Data Set #2
Packet Type
Alerts Small Packets Unique Sessions
HTTPS
100
100
93
HTTP
62
62
29
NETBIOS-SSN
12
12
12
SMB
4
0
1
The mitigation could be to add checks for these sets of error packets; however,
a better method would be to develop an algorithm to handle small sessions in a
better way. Handling traffic as sessions instead of packets solved this problem for
the packet level where it was more common, but doing this moved the problem to
the session level. The activities performed in this data set are similar to that of
Data Set #1 and there is no glaring reason to cause session alerts, which explains
why there are none other than the false positives.
71
5.2.5
Malware Data Sets
5.2.5.1
Kraken
The Kraken Data Set caused 18 packet alerts and 16 session alerts as shown
in Table 5.1. The packet alerts are from 16 unique sessions. Inspecting the packet
capture revealed that there are 16 sessions that communicate to a server on port 447,
meaning that there are 16 sessions that are responsible for exfiltrating data. The
data being exfiltrated is encrypted and uses a port that is not defined in ExFILD.
ExFILD alerted on all of the sessions responsible for data exfiltration and had no
false positives, which means it performed well against this data set.
5.2.5.2
Zeus
The three Zeus Data Sets created a total of 10 packet alerts and 9 session
alerts. Table 5.6 shows the breakdown of alerts for each data set and Table 5.7 shows
the breakdown for the unique sessions containing the exfiltrated data. Zeus Data
Sets #2 and #3 resulted in session alerts for all of the unique sessions containing the
HTTP POST commands. These two sets had a 100% detection rate with no false
positives. Zeus Data Set #1 did not produce the results expected. Three session
alerts were produced by this set, when 13 alerts should have been produced.
Table 5.6: Zeus Packet and Session Alerts #2
Data Set
Zeus 1
Zeus 2
Zeus 3
Packet Alerts Session Alerts
3
3
3
3
4
3
The 10 sessions containing data exfiltration were examined closer to find the
cause of the false negatives. The 10 sessions are all single packets with the same TCP
payloads. The sessions have entropies values of 6.4551 bits/byte. An observation
to note is that the threshold to determine encryption is set at 6.5 for this thesis.
72
Table 5.7: HTTP POST Commands in Zeus Packet Captures #2
Data Set
Zeus 1
Zeus 2
Zeus 3
HTTP POST Commands
13
3
4
Unique Sessions
13
3
3
The sessions’ entropy values are pretty close to the threshold, which leads to the
question of whether there is a better way to set the threshold. One aspect that
could affect the threshold is that it was set using the traffic from the Control Data
Set, and did not take into account traffic from any of the other sets. Setting the
threshold specifically for each host could have beneficial results, but it may also
have the opposite affect. There is not enough traffic in the malware packet captures
to be able to set a threshold with confidence. The threshold was not changed for
integrity of these experiments.
An experiment was run on the payloads of the sessions causing the false
negatives. The session payload was stripped of the HTTP headers, which is the same
as decoding the application layer protocol. The entropy was then calculated solely
on the data of the sessions, which returned a value of 7.5639 bits/byte. This entropy
value would have caused alerts to result from the 10 false negatives. The program
currently counts these sessions as false negatives, but they will be true positives
after the application layer protocols are decoded in the future work mentioned in
Section 7.3.
5.2.5.3
Blackworm
The Blackworm Data Set caused 147 packet alerts and 1 session alert as
shown in Table 5.1. The 147 packet alerts are all from the same session. During the
verification of the data set, it was found that all of the traffic from Blackworm was
contained in a single session. ExFILD was able to detect the session responsible for
73
exfiltrating data, which in this case was the Blackworm executable used during the
infection of the victim. It also produced no false positives. There were 51 packets
that contained data from the Blackworm executable that did not cause alerts. These
packets are false negatives, but they will be considered true positives with the rest
of the packets at the session level.
5.2.6
Data Exfiltration Detection Performance
The metrics for each of the data sets were accumulated and are displayed
in Table 5.8 for the packet level and Table 5.9 for the session level. The metrics
were based on whether the packet or session contained exfiltrated data. A positive
means the packet or session did in fact contain inappropriate data leaving the host,
while a negative meant it only contained data that was allowed to leave the host.
Inappropriate data is defined for this thesis as any information stored on or can
be extracted from a host that a user does not want to leave the host. It does not
include egress traffic that is exclusively used to establish or control the connections
responsible for transferring data. It is important to note that some packets may be
part of a session exfiltrating data, but do not contain exfiltrated data. A session is
considered to be a positive if any of the packets within it contain exfiltrated data,
so a negative session would contain no packets containing exfiltrated data.
74
Table 5.8: Packet Alert Metrics
Data
Set
Control
One
Two
Kraken
Zeus 1
Zeus 2
Zeus 3
Blackworm
Outgoing
True
False
Packets Positives Positives
47,032
39
9722
89,150
0
2253
80,807
0
1784
205
18
0
2,628
3
0
48
3
0
435
4
0
308
147
0
True
False
Negatives Negatives
46,021
0
88,925
0
80,629
0
187
0
2615
105
45
0
431
0
110
516
Table 5.9: Session Alert Metrics
Data
Set
Control
One
Two
Kraken
Zeus 1
Zeus 2
Zeus 3
Blackworm
Outgoing
True
False
Sessions Positives Positives
2,779
13
307
9,789
0
0
3,348
0
128
53
16
0
72
3
0
5
3
0
5
3
0
10
1
0
True
False
Negatives Negatives
2,736
0
9,789
0
3,336
0
37
0
59
109
2
0
2
0
9
0
2
2 small AIM, 464 FTP setup, 7 small HTTP, 1 IRC, 17 Skype, 8 SSH, and 473
VNC packets.
3
12 small HTTP, 155 small HTTPS, and 58 small PROXY packets.
4
62 small HTTP, 100 small HTTPS, 12 small NETBIOS, and 4 SAMBA packets.
5
10 HTTP POST packets explained in Section 5.2.5.2.
6
51 NETBIOS packets transferring the worm executable.
7
28 FTP Setup and 2 SKYPE sessions.
8
12 small NETBIOS sessions.
9
10 HTTP sessions containing the HTTP POST packets explained in Section 5.2.5.2.
75
Table 5.10: True and False Positive Rates for Each Data Set
Data
Set
Control
One
Two
Kraken
Zeus 1
Zeus 2
Zeus 3
Blackworm
Packets
Sessions
True
False
True
False
Positive Positive Positive Positive
Rate
Rate
Rate
Rate
1
0.0207
1
0.0108
N/A
0.0025
N/A
0
N/A
0.0022
N/A
0.0036
1
0
1
0
0.2308
0
0.2308
0
1
0
1
0
1
0
1
0
0.7424
0
1
0
Table 5.10 contains the calculated true positive rates and false positive rates
for each of the data sets. They were calculated for packets and sessions using
Equation 5.1 and Equation 5.2, which will allow for a comparison between their
performances.
T rue P ositve Rate =
T rue P ositives
T rue P ositives + F alse N egatives
(5.1)
F alse P ositve Rate =
F alseP ositives
F alse P ositives + T rue N egatives
(5.2)
The false positive rates for all of the data sets at both the packet and session
levels are very low. The Control Data Set is the only data set that has a false
positive rate above one percent. The increased false positive rate is caused by the
large number of small VNC packets and the FTP and FTP packets containing header
information or within the control channels. Zeus Data Set #1 has a low true positive
rate, which will be improved greatly with the addition of decoding application layer
protocols. ExFILD also did not perform well against the Blackworm Data Set at
the packet level. The low true positive rate was due to the 51 packets containing
the Blackworm executable that did not cause an alert. Data Sets #1 and #2 did
76
not have any true positives or false negatives, so the true positive rate could not be
calculated.
The main aspect of these two tables to note is that ExFILD at the session
level performs as good or better in every data set as it does at packet level except
for Data Set #2. A better solution to handle very small sessions would improve the
false positive rate at the session level. The biggest increases in performance between
the packet and session level come from the Control and Blackworm Data Sets. The
increase in the true positive rate of the Blackworm Data Set is the result of the
false negatives at the packet level being grouped with their respective sessions as
true positives. The decrease in the false positive rate of the Control Data Set is the
result of the aggregation of the false positives into their respective sessions. The
increase in the performance justifies the choice to focus on alerts at the session level
rather than the packet level.
77
Chapter 6
CONCLUSIONS
The lack of viable options for detecting theft of information has lead to an incredible amount of confidential data being extracted by attackers. The field of computer and network security has a need for tools and techniques capable of detecting
unwanted extraction of information from computer systems. Intrusion detection is
a mature field with thorough research performed on the techniques used, but the
research and development of techniques to detect exfiltration is still rudimentary.
The research for this thesis was started with the single goal of developing a program
capable of detecting data exfiltration. It evolved into three separate goals listed
below.
1. Create tools that will allow for a better understanding of a host’s network
traffic, specifically with relation to data exfiltration.
2. Determine if encrypted network traffic can be discerned from unencrypted
data using the entropy characteristics.
3. Develop a program that is capable of effectively detecting data exfiltration
using the encryption characteristics of network traffic.
Many tools were created to study a network and were discussed in Chapter 3.
The tools that were developed included the functionality to monitor network traffic,
report censored words, display the amount of traffic destined to each IP address, and
extract session and DNS information. The Network Top program provides a very
78
intuitive way to view the amount outgoing traffic and its destination. The Session
Extractor and DNS Extractor proved to be useful in the verification of the malware
samples and determining with whom a host is communicating.
It was proven within this thesis that entropy can be used to effectively discern encrypted traffic from unencrypted traffic. The entropy calculations result in
significant separation between the encrypted and unencrypted traffic, which allows
for the determination of whether data is encrypted or unencrypted. A method of
normalizing small packets to allow for this separation was described. It was shown
that small packets and sessions can cause misleading entropy values, and further
research should be performed to produce more useful results from them. It was
also described how to separate compressed data from encrypted data. Encrypted
traffic can be identified with confidence by a simple entropy calculation as long as
its payload is at least 256 bytes.
The accomplishment of the last goal is the largest contribution from this
thesis to the field of computer and network security. It was verified in this thesis
that exfiltrated data can indeed be detected based on the expected and actual state
of encryption for a packet or session’s payload. Chapter 5 provided results that
justify this claim. Research needs to be continued in this research area to improve
the accuracy of the technique. Simply implementing the decoding of the application
layer could greatly improve the performance of some of the data sets used in this
thesis.
Hopefully the research performed in this thesis and other similar work in this
research area stimulate research on data leaving a computer system as well as the
traffic entering the system. The field of computer and network security would greatly
benefit from continued research in techniques that process the outgoing traffic.
79
Chapter 7
FUTURE WORK
7.1
Performance
Before other features are added to the program, some attention should be
given to improving the performance of it. The performance of the program does
not make it feasible to run on a live capture, and can take a considerable amount of
time for large capture files. The decision to add features will need to consider the
performance hit that will be taken. If the new feature increases the effectiveness
and does not significantly affect the performance, it will be a good choice to add
the feature. If the new feature only slightly increases the effectiveness and decreases
the performance, it might not be a good choice to add it. For these reasons the
performance needs to be improved.
There are many solutions to increase performance. The first solution would
be to look through the code and optimize it by hand. The functions were created
with an emphasis on functionality and will have un-optimized segments of code.
There are a lot of comparisons being done on each packet and session. The order of
these comparisons may be able to be reordered to improve the performance without changing the results. For example in Figure 4.12, assume that there are more
whitelisted packets than there are ICMP requests. In this case it would make sense
to switch the order of the two checks. Switching the order of the two checks will
eliminate unnecessary checks, thus improving the performance. Improving performance will be the first objective of future work, because it will enable the other
future goals to be achieved while maintaining the programs usability.
80
Another possible solution to improve performance is to port the program
into the Snort IDS. Snort is widely used in the security industry and has a lot
of developer and community support. For it to be useful it needs to be able to
perform its tasks at a high rate. The community of developers has already put
a lot of effort toward improving the performance of Snort, which could be used
to the advantage of this research. Individual parts of the entire program could
be ported to Snort. Snort could also be used to perform the decoding explained
in Section 4.1.1. Rules can be written to perform the checks of the packets and
sessions. A more interesting possibility is writing a dynamic module to perform
some of the calculations in ExFILD, specifically the entropy calculations. Porting
the algorithms in ExFILD from Perl into Snort could allow for better performance
and a more stable system.
7.2
Handle A Network
The program currently only inspects the outgoing traffic of a single host at
a time. Under normal circumstances administrators would be more interested in all
of the traffic leaving a network, not just a single host. They would only check a
single host when its activities have been identified as suspicious. If an administrator
wants to look at more than one host’s outgoing traffic, ExFILD will have to be
run multiple times with different IP addresses as the inputs. It would be useful
to add the capability of looking at an entire network’s outgoing traffic. This could
be accomplished by allowing multiple IP addresses or the network’s subnet to be
inputted to the program. ExFILD would then ignore intranet traffic between all of
the local network’s hosts and inspect all of the outgoing traffic remaining. It can
also be accomplished by altering ExFILD to be able to run on the single gateway
leaving the network or running multiple instances of ExFILD on all of the gateways
between the Internet and the intranet.
81
7.3
Application Layer Decoding of Packets
One of the improvements that can increase the accuracy of the entropy cal-
culations is to decode the packets to the application layer. For example, ExFILD
will extract the payload of a HTTP packet with the headers, since HTTP protocol
is not decoded. The HTTP headers are large enough to influence the entropy of the
payload to look unencrypted, when it is actually encrypted. Decoding the application layer protocols will remove the headers from the data and perform the entropy
calculation only on the payload. Integrating ExFILD into Snort could provide some
of the decoding functionality. In [39] it is discussed that Snort has some basic decoding capabilities for application protocols such as HTTP, FTP, DNS, and SSL/TLS.
For performance reasons, it would be wise to implement only the important or most
frequent application protocols. There may be some protocols where the headers do
not affect the payloads’ entropy values enough to use resources to decode it.
7.4
Comparison to packet and session entropy
As described in Section 4.1, all of the packets and sessions were processed and
run through ExFILD. The entropy values were then calculated for both the packets
and sessions. Having both values provides the opportunity to create another feature
to be used within the tree. Comparing the entropy of a session to the entropies
of each packet contained within it could lead to interesting results. It is possible
that a single unencrypted packet is being hidden an encrypted session or vice versa.
The session entropy value may match the expected range, but individual packets
within it do not. It may be troublesome performing the comparison this way. It is
also counterproductive, because sessions are looked at to remove alerts from small
packets. It may be more useful to look for outliers in each session and perform some
analysis and checks on that packet. Performing experiments to see if there is any
relationship between the packets and the session could prove to be worthwhile and
could provide more topics of research.
82
7.5
Compressed File Analysis
ExFILD currently inspects traffic with highly random payloads for com-
pressed files. If a compressed file is found using the magic file numbers and a
HTTP string, it is assumed that the file is a properly compressed file. More steps
can be taken to ensure this is a compressed file and that its contents are allowed to
leave the network. The first step would be to extract the entire file from the network
traffic, which will require the application layer protocol to be completely decoded.
The next step would be to verify that the file structure matches the specification for
the file type. The most straightforward way to do this is by decompressing the file.
A file that decompresses successfully will have the correct file structure, while a file
that does not decompress successfully has an incorrect file structure. If it has an
incorrect file structure, it should automatically be marked as suspicious. If it can
be decompressed, checks should be run against its contents.
Compressing files and sending them out of the network is a common way to
exfiltrate data, because it’s easy and the contents are rarely inspected. The checks of
the contents should include ones similar to those in the corporate watcher program
that search for censored words and files. The checks performed on the packets and
sessions could also be applied to the contents of the compressed file. The main idea
is to make sure that an attacker is not just compressing data and exfiltrating it. The
packets and sessions containing the compressed file will be marked as not suspicious
if it passes all of these checks.
7.6
Behavioral Analysis
Another interesting piece of research to pursue is taking into account the
host’s normal behavior. Each host has a normal set of applications run on them,
which will create outgoing traffic with certain sets of characteristics. The characteristics can be used to create a profile or signature for the host’s typical outgoing
traffic. A profile may be more appropriate, since a signature is usually very specific
83
and a profile tends to be broader. A single host’s profile can be used to compare
against outgoing traffic and find some characteristics that stand out from the normal use. For example consider a host that only checks email and browses the web
in normal use, and one day it begins transferring a large amount of data out of the
network. When comparing that host’s profile to the outgoing traffic it will flag the
traffic. However, comparing a FTP server’s profile to this traffic may not show any
suspicious characteristics. Integrating the host’s normal use into ExFILD could lead
to more accurate results when looking for data being exfiltrated.
7.7
Additional Tools
All of the other sections for future work deal with improving the detection
of data being exfiltrated from a network. There needs to be work performed in
the area of analyzing the results of the alerts and determining what caused the
alert. The tools from Chapter 3 that were used to develop ExFILD can be used
to analyze the cause of the alerts. However, during this research it was realized
that more tools would be useful. The verification of the packet captures containing
communications for Zeus revealed a need for another useful tool. A tool that is
able to extract all of the HTTP GET and POST commands would make it easier
to identify communication channels over HTTP. In the case of Zeus, this tool could
be used to identify the name of the configuration file downloaded and the hosts who
have used HTTP POST commands to exfiltrate data. It could also show any files
that were downloaded by the attacker using HTTP GET commands, which would
help to explain what activities the attacker is performing.
84
BIBLIOGRAPHY
[1] White House. White House (2009). Cyberspace Policy Review: Assuring a Trusted and Resilient Information and Communications Infrastructure,
May 2009. http://www.whitehouse.gov/assets/documents/Cyberspace_
Policy_Review_final.pdf.
[2] Joe Stewart. Top Botnets Exposed. April 2008. http://www.secureworks.
com/research/threats/topbotnets/.
[3] Mark Landler and John Markoff. Digital Fears Emerge After Data Siege in
Estonia. The New York Times, May 2007. http://www.nytimes.com/2007/
05/29/technology/29estonia.html.
[4] Annarita Giani, Vincent H. Berk, and George V. Cybenko. Data exfiltration
and covert channels. In Sensors, and Command, Control, Communications, and
Intelligence (C3I) Technologies for Homeland Security and Homeland Defense
V, 2006.
[5] Fouad Kiamilev, Ryan Hoover, Ray Delvecchio, Nicholas Waite, Stephen Janansky, Rodney McGee, Corey Lange, and Michael Stamat. Demonstration of
Hardware Trojans. In DEFCON conference, August 20008.
[6] Netcat. http://netcat.sourceforge.net/.
[7] B. Stone-Gross, M. Cova, L. Cavallaro, B. Gilbert, M. Szydlowski, R. Kemmerer, C. Kruegel, and G. Vigna. Your Botnet is My Botnet: Analysis of a
Botnet Takeover. In Proceedings of the 16th ACM conference on Computer and
communications security, pages 635–647, Chicago, IL, November 2009. ACM.
[8] Toni Koivunen. Calculating the Size of the Downadup Outbreak. F-Secure Weblog: News from the Lab, January 2009. http://www.f-secure.com/weblog/
archives/00001584.html.
[9] Jhind. Catching DNS Tunnels with A.I. In DEFCON conference, July 2009.
[10] Kerry S. Long. Catching the Cyber Spy: ARL’s Interrogator. December 2004.
85
[11] Jon Oberheide, Evan Cooke, and Farnam Jahanian. CloudAV: N-Version Antivirus in the Network Cloud. In Proceedings of the 17th USENIX Security
Symposium, San Jose, CA, July 2008. Usenix Security Symposium.
[12] August Cole, Yochi Dreazen, and Siobahn Gorman. Computer Spies Breach
Fighter-Jet Project. The Wall Street Journal, April 2009. http://online.
wsj.com/article/SB124027491029837401.html.
[13] Jaikumar Vijayan. Heartland data breach could be bigger than TJX’s: This
recent incident suggests cybercrooks have shifted to targeting payment processors. Computer World, January 2009. http://www.computerworld.com/s/
article/9126379/Heartland_data_breach_could_be_bigger_than_TJX_s.
[14] Office of Public Affairs. Alleged International Hacker Indicted for Massive
Attacks on U.S. Retail and Banking Networks: Data Related to More Than
130 Million Credit and Debits Cards Allegedly Stolen. United States Department of Justice, August 2009. http://www.justice.gov/opa/pr/2009/
August/09-crm-810.html.
[15] Tor. http://www.torproject.org/.
[16] Karsten Loesing, Steven J. Murdoch, and Roger Dingledine. A Case Study
on Measuring Statistical Data in Tor Anonymity Network. In Accepted for
publication at Workshop on Ethics in Computer Security Research (WECSR
2010), Tenerife, Spain, January 2010.
[17] Miniwatts Marketing Group. World Internet Usage And Population Statistics.
Internet World Stats, 2010. http://www.internetworldstats.com/stats.
htm.
[18] Snort. http://www.snort.org/.
[19] Richard Bejtlich.
Advice for Academic Researchers.
TaoSecurity, February 2010.
http://taosecurity.blogspot.com/2010/02/
advice-for-academic-researchers.html.
[20] Nabil Schear, Carmelo Kintana, Qing Zhang, and Amin Vahdat. Glavlit: Preventing Exfiltration at Wire Speed. In Proceedings of the 5th ACM Workshop
on Hot Topics in Networks (HotNets-V), Irvine, CA, November 2006.
[21] Yali Liu, Cherita Corbett, Rennie Archibald, Biswanath Mukherjee, and Dipak Ghosal. SIDD: A Framework for Detecting Sensitive Data Exfiltration by
an Insider Attack. In Proceedings of the 42nd Annual Hawaii International
Conference on System Science, pages 1–10, January 2009.
86
[22] Tcpdump. http://www.tcpdump.org/.
[23] Wireshark. http://www.wireshark.org/.
[24] Boris Zentner. Geo::IP.
Geo-IP/lib/Geo/IP.pm.
[25] GeoLite City.
geolitecity.
CPAN, 2009.
http://search.cpan.org/dist/
Maxmind, March 2010.
http://www.maxmind.com/app/
[26] GeoLite Country. Maxmind, March 2010. http://www.maxmind.com/app/
geolitecountry.
[27] Port Numbers. IANA, March 2010. http://www.iana.org/assignments/
port-numbers.
[28] C. E. Shannon. Prediction and Entropy of Printed English. The Bell System
Technical Journal, 30:50–64, 1951.
[29] T. Dierks and C. Allen. The TLS Protocol Version 1.0 (RFC 2246). IETF,
January 1999. http://www.ietf.org/rfc/rfc2246.txt.pdf.
[30] Michael Ligh. 12b0c78f05f33fe25e08addc60bd9b7c.pcap. OpenPacket.org, May
2008. https://www.openpacket.org/capture/grab/33.
[31] Cody Pierce. Owning Kraken Zombies, a Detailed Dissection. DVLabs,
April 2008.
http://dvlabs.tippingpoint.com/blog/2008/04/28/
owning-kraken-zombies.
[32] Paul Royal. On the Kraken and Bobak Botnets. Damballa, April 2008. http:
//www.damballa.com/downloads/press/Kraken_Response.pdf.
[33] JJ Cummings. zeus-sample-1.pcap. OpenPacket.org, March 2010. https:
//www.openpacket.org/capture/grab/67.
[34] JJ Cummings. zeus-sample-2.pcap. OpenPacket.org, March 2010. https:
//www.openpacket.org/capture/grab/68.
[35] JJ Cummings. zeus-sample-3.pcap. OpenPacket.org, March 2010. https:
//www.openpacket.org/capture/grab/69.
[36] Doug Macdonald. Zeus: God of DIY Botnets. FortiGuard, October 2009.
http://www.fortiguard.com/analysis/zeusanalysis.html.
[37] kowsik.
blackworm-clean-install.pcap.
pcapr.net, May 2009.
http:
//www.pcapr.net/view/kowsik/2009/4/5/11/blackworm-clean-install.
pcap.html.
87
[38] F-Secure Virus Information Pages: Nyxem.E. F-Secure, January 2006. http:
//www.f-secure.com/v-descs/nyxem_e.shtml.
R Users Manual 2.8.5. October 2009. http:
[39] The Snort Project. SNORT
//www.snort.org/assets/125/snort_manual-2_8_5_1.pdf.
88
Appendix A
EXPERIMENTS
Table A.1: Annotation of the Control Data Set
Wireshark
Activity
Time
-108
Start Movie
0
Start Wireshark
12
Open Safari
22
Open CNN.com
36
Click a link for SI.com
52
Go back to CNN.com
61
Click CNN link to “Terror task force arrests 2 in NY”
88
Check Gmail (Webmail)
100
Clicked on new email
111
Downloaded PDF attachment
118
Downloaded PDF attachment
129
Clicked back to inbox
132
Sign out of Gmail
139
Quit Safari
144
Open Mail.app
187
Send email
Continued on Next Page. . .
89
Table A.1 – Continued
Wireshark
Activity
Time
225
Open Safari
232
Open Twitter.com
245
Signing into Twitter
259
Clicked the more button on Twitter
276
Clicked the more button on Twitter again
295
Opened new tab with user1’s Twitter profile
299
Opened new tab with woot.com
305
Opened new tab with link for patrick.net
312
Clicked on Twitter DM
321
Clicked on user2’s Twitter profile
325
Clicked back (to Twitter DM)
328
Deleted DM from user2
335
Signed out of Twitter
336
Closed Twitter Tab
341
Closed tab with user1’s profile
346
Closed tab with woot.com
357
Quit Safari
366
Opened Twitteriffic
370
Loaded my mentions in Twitteriffic
372
Loaded the rest of the tweets
403
Refreshed Twitteriffic
411
Closed Twitteriffic
419
Opened Live mesh
Continued on Next Page. . .
90
Table A.1 – Continued
Wireshark
Activity
Time
432
Live Mesh connected
440
Live mesh begins uploading
447
Live mesh finished uploading
565
Live Mesh updated 19 files
585
Uploaded TEST FOR DATASET LIVE MESH.rtf to Live Mesh
using desktop client
602
Quit Live Mesh
624
Dropbox Opened
628
Dropbox finished updating
659
Uploaded TEST FOR DATASET DROP BOX.rtf to Dropbox using desktop client
704
Quit Dropbox
718
Opened iTunes
748
Refreshed iTunes Podcasts
756
Refreshed iTunes Podcasts
781
Closed iTunes
937
Ran unencrypted FTP 100 bytes
947
Ran unencrypted FTP 1000 bytes
960
Ran unencrypted FTP 4000 bytes
974
Ran unencrypted FTP 8041 bytes
1056
Ran encrypted FTP 100 bytes
1066
Ran encrypted FTP 1000 bytes
1074
Ran encrypted FTP 4000 bytes
Continued on Next Page. . .
91
Table A.1 – Continued
Wireshark
Activity
Time
1085
Ran encrypted FTP 8041 bytes
1145
Start Skype
1180
Login into Skype
1211
Skype test call
1220
Start recording playback message
1234
Start Skype playback
1244
End Skype playback
1263
End Skype call
1284
Quit Skype
1327
ssh to 10.0.0.54
1332
Entered password for 10.0.0.54
1340
ls 10.0.0.54
1370
sudo apt-get update (10.0.0.54)
1377
sudo apt-get upgrade (10.0.0.54)
1462
screen -r (10.0.0.54)
1489
exit (10.0.0.54)
1491
ssh to 10.0.0.53
1494
Entered password for 10.0.0.53
1497
w
1509
sudo apt-get update (10.0.0.53)
1516
sudo apt-get upgrade (10.0.0.53)
1581
screen -r (10.0.0.53)
1617
ssh back to host (10.0.0.53)
Continued on Next Page. . .
92
Table A.1 – Continued
Wireshark
Activity
Time
1624
send password to host (10.0.0.53)
1628
ls (10.0.0.53)
1645
exit host (10.0.0.53)
1650
exit 10.0.0.53
1659
netstat -an
1702
Started iChat
1760
Messaged user3
1772
Messaged user3
1783
user3 messaged me
1788
Messaged user3
1797
user3 messaged me
1798
Messaged user3
1818
user3 messaged me
1824
Messaged user3
1913
Messaged user3
1934
Started safari
1965
Browsing through bank site
1993
ping Google
1996
Stop ping
2009
Open cnn.com
2024
Open bank1 site
2039
Put in login for bank1
2052
Logged into bank account
Continued on Next Page. . .
93
Table A.1 – Continued
Wireshark
Activity
Time
2076
Load more entries
2090
Load transfer history
2095
Load account overview
2112
Close bank1
2125
Open bank2
2131
Reload with flash
2152
Put in user name
2168
Enter security questions
2197
Enter pin
2199
Browsing bank2
2246
Updating preferences in bank2
2335
Updated pin, rejected pin for being too short
2359
Submitted new password
2366
Logged out of bank2
2371
Quit Safari
2393
Opened chicken of the vnc
2407
Allowed connection to cotvnc
2445
VNC to 10.0.0.54
2460
Ran update manager
2499
ssh to blah
2506
ls
2529
exit ssh
2538
Close vnc connection
Continued on Next Page. . .
94
Table A.1 – Continued
Wireshark
Activity
Time
2542
Exit cotvnc
2554
Open Colloquy
2596
Join lug irc
2629
Message winter
2654
PM user4 user5
2692
PM user4 user5 smiley
2701
Closed pm
2706
Get info user5
2716
Get info user4
2722
Get info user6
2735
Close Winter
2740
Close irc
2755
Open safari
2766
Open Google Reader
2774
Login into Google
2784
Load UDel news
2792
Mark all read
2799
Load MAC folder
2822
Open CES power tablet in a new tab
2850
Open Deal Brothers in a new tab
2862
Open CES Dell in a new tab
2892
Close CES power tablet
2908
Close Deal Brothers
Continued on Next Page. . .
95
Table A.1 – Continued
Wireshark
Activity
Time
2923
Close CES Dell
2927
Open security
2953
Open 768 bit RSA in a new tab
2980
Open industry group plans cyber in a new tab
3024
Open Liquidmatrix in Google Reader
3054
Open CES security cam in a new tab
3057
Mark all read
3093
Open tech folder
3239
Open BSOD in a new tab
3298
Open Netflix slashdot.com in a new tab
3304
Ftp encrypted 100 bytes to 10.0.0.54
3336
Open a hack-a-day in new tab
3339
Ftp encrypted 1000 bytes to 10.0.0.54
3359
Close Hack-A-Day article
3354
Close slashdot.com article
3373
Close BSOD article
3386
Close security cam article
3400
Close cyber plan article
3426
Open RSA paper site
3428
Ftp encrypted 4000 bytes to 10.0.0.54
3444
Ftp encrypted 8041 bytes to 10.0.0.54
3448
Ftp encrypted 8041 bytes to 10.0.0.54
3452
Open RSA paper
Continued on Next Page. . .
96
Table A.1 – Continued
Wireshark
Activity
Time
3457
Download RSA paper
3477
Close RSA paper
3478
Close 768 bit
3484
Open tech
3490
Mark all read
3492
Open Yahoo Sports
3496
Mark all read
3498
Open Vikings
3501
Mark all read
3503
Open Vikings
3520
Mark all read
3531
Open YouTube
3549
Load Josh Wilson video
3639
Clicked more information on side bar
3714
Ftp encrypted 100 bytes to 10.0.0.54
3734
Clicked back on YouTube
3741
Clicked Music
3771
Clicked Indie Music
3788
Clicked Bon Iver - Flume video
3799
Ftp encrypted 100 bytes to 10.0.0.54
3827
Ftp encrypted 1000 bytes to 10.0.0.54
3859
Ftp encrypted 4000 bytes to 10.0.0.54
3884
Ftp encrypted 8041 bytes to 10.0.0.54
Continued on Next Page. . .
97
Table A.1 – Continued
Wireshark
Activity
Time
3899
Quit safari
3906
Open iTunes
3924
Refresh podcasts
3949
Refresh iTunes U
3962
Close iTunes
3989
Stop Wireshark
98
Appendix B
ENTROPY PLOTS FOR DATA SETS
Note: All packets or sessions with a payload of size 0 have been omitted from
the plots in this appendix.
Figure B.1: Packet Entropies for the Control Data Set
99
Figure B.2: Session Entropies for the Control Data Set
Figure B.3: Packet Entropies for Data Set #1 (First Plot)
100
Figure B.4: Packet Entropies for Data Set #1 (Second Plot)
Figure B.5: Session Entropies for Data Set #1
101
Figure B.6: Packet Entropies for Data Set #2
Figure B.7: Session Entropies for Data Set #2
102
Figure B.8: Packet Entropies for the Kraken Data Set
Figure B.9: Session Entropies for the Kraken Data Set
103
Figure B.10: Packet Entropies for Zeus Data Set #1
Figure B.11: Session Entropies for Zeus Data Set #1
104
Figure B.12: Packet Entropies for Zeus Data Set #2
Figure B.13: Session Entropies for Zeus Data Set #2
105
Figure B.14: Packet Entropies for Zeus Data Set #3
Figure B.15: Session Entropies for Zeus Data Set #3
106
Figure B.16: Packet Entropies for the Blackworm Data Set
Figure B.17: Session Entropies for the Blackworm Data Set
107
Appendix C
ALERTS FOR DATA SETS
Figure C.1: Session Alerts for the Control Data Set
108
Figure C.2: Session Alerts for Data Set #2
Figure C.3: Session Alerts for the Kraken Data Set
109
Figure C.4: Session Alerts for Zeus Data Set #1
Figure C.5: Session Alerts for Zeus Data Set #2
110
Figure C.6: Session Alerts for Zeus Data Set #3
Figure C.7: Session Alerts for the Blackworm Data Set
111
Appendix D
VERIFICATION OF MALWARE PACKET CAPTURES
D.1
Kraken Packet Capture
The packet capture came from an open repository on the Internet, which
leads to the question of whether the source is credible. It needs to be verified that
the packet capture is actually traffic from the Kraken botnet. The first step was to
verify that there were conversations between the infected host and a server hosting
from port 447. The packet capture was run against the Session Extractor program
from Section 3.5 to find 16 sessions that match the criteria. The matching sessions
are shown in Figure D.1.
The next step was to verify that the infected host is communicating to the
hosts listed in [32]. The DNS Extractor from Section 3.4 was run against the data set
and the results are shown in Figure D.2. The domain names hmhxnupkc.mooo.com
and bdubefoeug.yi.org are found in the list of command and control servers in [32].
Upon further inspection of the packet capture, there is another domain name that
was not resolved at the time. The host name was rffcteo.dyndns.org and it is also
listed in [32] as a host name for the Kraken botnet. Entropy calculations were
performed on all the packets contained in sessions from Figure D.1. All of the
packets appear to be encrypted with entropies greater than 7.45 bits/byte. With
this information it is verified with confidence that this packet capture contains traffic
from the Kraken botnet.
112
Figure D.1: Sessions from the Kraken Data Set to Servers on Port 447
Figure D.2: DNS Names Resolved from the Kraken Data Set
113
D.2
Zeus Packet Captures
The three packet captures for Zeus were obtained from the same repository
as the Kraken packet capture, meaning the integrity of these packet captures should
also be checked. The verification of these packet captures will be done by hand,
since no tools have been written at this time to extract the HTTP GET and POST
commands from a packet capture. The first thing to note is that all packet captures
were submitted by the same author, we assume that all of the files are either from
Zeus or none of the files are from Zeus. The verification process will be based
on the “BOTNET COMMUNICATIONS” section in [36]. The basic steps will
be to search for the HTTP GET command downloading the configuration file, the
HTTP OK response displaying the downloading of the configuration files, the HTTP
POST commands responsible for exfiltrating the data, and the HTTP OK responses
containing information or commands for the bot.
D.2.1
Zeus #1
Figure D.3 shows the packets that are responsible for the infected host down-
loading the configuration file from the command and control server. Packet number
4 is the HTTP GET command requesting the configuration file with the name
of cfg3.bin. It is important to note that the creator of the Zeus executable can
customize the name of the configuration file. Packet numbers 5 through 8 are responsible for downloading the configuration file. Packet number 9 is the command
and control server affirming the download of the configuration file.
Figure D.4 shows the HTTP POST commands and the server’s HTTP OK
responses. Packet numbers 17 and 18 are the HTTP POST commands that are used
to exfiltrate data to the server. The file being sent has the name of stat1.php, which
can be customized by the creator of the Zeus executable. Packet numbers 21 and
25 are the HTTP OK responses from the server. Packet 21 contains a larger than
normal payload meaning it is most likely a command being sent from the server.
114
Figure D.3: HTTP GET Command Downloading the Configuration File
Figure D.4: HTTP POST Commands and HTTP OK Responses
115
Figure D.5 shows all of the HTTP communications from Zeus in this packet
capture, except for the packets carrying the files to and from the bot and the server.
It includes the HTTP GET and POST commands as well as the HTTP OK responses. Note that there are 13 HTTP POST commands exfiltrating data in this
packet capture. The packet capture shows all of the activities described in [36],
meaning that it contains traffic from Zeus.
Figure D.5: Zeus HTTP Communications
D.2.2
Zeus #2
Figure D.6 shows the HTTP GET packet that requests the configuration file
with a name of ribbon.tar. Figure D.7 displays the command and control server’s
116
HTTP OK response to the request to download the configuration file.
Figure D.6: HTTP GET Command Requesting the Configuration File
Figure D.7: HTTP OK Response to the Request for the Configuration File
Figure D.8 shows the HTTP POST commands and the server’s HTTP OK
responses. Packet numbers 54 and 57 are the HTTP POST commands that are used
to exfiltrate data to the server. The file being sent has the name of index1.php.
Packet numbers 60 and 64 are the HTTP OK responses from the server. Packet 60
contains a smaller payload than seen in Zeus Data Set #1, meaning it most likely
does not have a command from the server in the message.
Figure D.9 shows all of the HTTP communications from Zeus in this packet
capture, except for the packets carrying the files to and from the bot and the server.
It includes the HTTP GET and POST commands as well as the HTTP OK responses. Note that there are 3 HTTP POST commands exfiltrating data in this
packet capture. The packet capture shows all of the activities described in [36],
meaning that it contains traffic from Zeus.
117
Figure D.8: HTTP POST Commands and HTTP OK Responses
Figure D.9: Zeus HTTP Communications
118
D.2.3
Zeus #3
Figure D.10 shows the HTTP GET packet that requests the configuration
file with a name of kartos.bin.
Figure D.10: HTTP GET Command Requesting the Configuration File
Figure D.11 displays the command and control server’s HTTP OK response
to the request to download the configuration file.
Figure D.11: HTTP OK Response to the Request for the Configuration File
Figure D.12 shows the HTTP POST commands and the server’s HTTP OK
responses. Packet numbers 239 and 240 are the HTTP POST commands that are
used to exfiltrate data to the server. The file being sent has the name of youyou.php.
Packet numbers 244 and 246 are the HTTP OK responses from the server. Packet
119
246 contains a larger payload meaning it is most likely a command being sent from
the server.
Figure D.12: HTTP POST Commands and HTTP OK Responses
Figure D.13 shows all of the HTTP communications from Zeus in this packet
capture, except for the packets carrying the files to and from the bot and the server.
It includes the HTTP GET and POST commands as well as the HTTP OK responses. Note that there are 4 HTTP POST commands exfiltrating data in this
packet capture, but packet numbers 244 and 389 are from the same session. The
packet capture shows all of the activities described in [36], meaning that it contains
traffic from Zeus.
120
Figure D.13: Zeus HTTP Communications
D.3
Blackworm
As with the other malware packet captures, this packet capture is from a
repository that anyone can upload captures. The capture needs to be verified that
it matches the description of Blackworm. The verification is performed by hand
by inspecting the packet capture. Figure D.14 shows the infected host connecting
to the victim host and enumerating all of the network shares. Packet number 23
shows the infected host’s name is BLACKWORM and the victim host’s name is
BLACKWORM-VICTI. Packet numbers 39 and 40 are responsible for the enumeration of the network shares.
Figure D.15 displays the packets that are querying for the security vendors’
program folders. The folders that are being requested match those listed in [38]. In
the case of this packet capture, none of the folders were found on the victim host.
121
Figure D.14: Connection to Victim and Enumerating Its Network Shares
Figure D.15: Searching for Security Vendors’ Program Folders
122
Figure D.16 and Figure D.17 show the copying of the Blackworm executable
to different file locations. They represent the spreading of the worm onto the victim
host. The packet from Figure D.17 is disguising itself as a file that may normally
be there, and Figure D.18 shows the packet that removes the file that the worm is
imitating.
Figure D.16: Copying Blackworm Executable to Victim
Figure D.17: Copying Blackworm Executable to Victim
Figure D.18: Deleting Startup Link from Victim
Figure D.19 displays the infected host creating a job that will run the executable. Once this job has been made the executable will be run at a later time and
the infection has been successful. The packet capture shows the activities described
in [38], meaning that it contains traffic from Blackworm.
123
Figure D.19: Creating a Job to Run the Blackworm Executable
124