Thesis

Transcription

Thesis
University of Mannheim
Laboratory for Dependable Distributed Systems
Diploma Thesis
Monkey-Spider:
Detecting Malicious Web Sites
Ali Ikinci
May 23, 2007
First Examiner:
Second Examiner:
Advisor:
Prof. Dr.-Ing. Felix C. Freiling
Prof. Dr.-Ing. Wolfgang Eelsberg
Dipl.-Inform. Thorsten Holz
Hiermit versichere ich, dass ich die Arbeit selbständig verfasst und keine anderen als
die angegebenen Quellen und Hilfsmittel benutzt sowie Zitate kenntlich gemacht habe.
_________________________
Ali Ikinci
Mannheim, den 23. Mai 2007
Abstract
Client side attacks are on the rise. Malicious Web sites are posing a serious
threat to client security.
There is neither a freely available comprehensive
database of threats on the Web nor sucient freely available tools to build
such. This work will introduce the Monkey-Spider Internet analysis framework. Utilizing it as a client honeypot we portray the challenge in such an
approach and evaluate our system as a high-speed Internet scale analysis tool
to build a database of threats found in the wild. Furthermore, the evaluation
of this system in the analysis of dierent crawls over two month is presented
along with the lessons learned.
Additionally, alternative use cases for our
framework are presented.
Zusammenfassung
Angrie seitens des Clients sind im Anstieg. Bösartige Webseiten stellen
eine ernsthafte Bedrohung für die Clientsicherheit dar. Es gibt weder eine frei
verfügbare Datenbank über die Gefahren aus dem Web, noch ausreichende,
frei verfügbare Werkzeuge zum Aufbau einer solchen.
Diese Arbeit führt
das Monkey-Spider System zur Analyse des Internets ein. Dafür setzten wir
es als ein Client Honeypot ein.
bestehen sind, auf.
Wir zeigen die Herausforderungen, die zu
Dabei werten wir unser System als ein Werkzeug für
eine performante und netzweite Analyse, zur Erstellung einer Datenbank von
Gefahren in der Praxis, aus. Desweiteren präsentieren wir unsere Auswertung
aus über zwei Monaten und gehen auf die Erkenntnisse ein, die wir in dieser
Zeit gewonnen haben.
Zusätzlich dazu präsentieren wir weitere mögliche
Anwendungsgebiete für unser System.
Contents
Contents
1. Introduction
1
1.1.
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.2.
Requirements Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.3.
Simplied Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.4.
Acknowledgements
4
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2. Background
2.1.
2.2.
2.3.
2.4.
2.5.
5
Denition of Malware
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
Types of Malware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.2.1.
Computer Virus
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.2.2.
Computer Worm
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.2.3.
Trojan Horse
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.2.4.
Spyware and Adware . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.2.5.
Exploit
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.2.6.
Backdoor
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.2.7.
Botnet and Bots
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
Detection of Malware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1.
Static Analysis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.2.
Behavioral Analysis
. . . . . . . . . . . . . . . . . . . . . . . . . .
Honeypots vs. Honeyclients
9
9
10
11
. . . . . . . . . . . . . . . . . . . . . . . . . .
12
2.4.1.
Honeypots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
2.4.2.
Honeyclients
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
2.5.1.
HoneyMonkey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
2.5.2.
Spycrawler
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.5.3.
SiteAdvisor
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
2.5.4.
HoneyC
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2.5.5.
HoneyClient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
2.5.6.
Shelia
20
2.5.7.
Capture - HPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
2.5.8.
Pezzonavante
22
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.
Spam and Malware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
2.7.
Phishing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.8.
Crawling the Web
23
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3. Monkey-Spider
27
3.1.
System Architecture
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.2.
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
Seed Generation
29
3.3.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1.
Web Search Seeding
. . . . . . . . . . . . . . . . . . . . . . . . . .
29
3.3.2.
Spamtrap Seeding
3.3.3.
. . . . . . . . . . . . . . . . . . . . . . . . . . .
33
Blacklist Seeding . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
i
Contents
3.4.
3.5.
Web Crawling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
3.4.1.
Heritrix - a Web Crawler
35
3.4.2.
Conguring Crawl Jobs
. . . . . . . . . . . . . . . . . . . . . . . .
39
3.4.3.
Heritrix Architecture . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.4.4.
The ARC File-Format
. . . . . . . . . . . . . . . . . . . . . . . . .
41
3.4.5.
The CDX File-Format . . . . . . . . . . . . . . . . . . . . . . . . .
43
. . . . . . . . . . . . . . . . . . . . . . .
Malware Detection
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
3.5.1.
ScanFolder
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
3.5.2.
Extractor
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
3.5.3.
Malware Scanner . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
3.5.4.
ClamAV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
3.5.5.
CWSandbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
3.6.
Web Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
3.7.
Malware Database
48
3.8.
Limitations
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
3.8.1.
Avoid Detection and Blacklisting
. . . . . . . . . . . . . . . . . . .
50
3.8.2.
Hidden Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
3.8.3.
Dynamically Created Content . . . . . . . . . . . . . . . . . . . . .
51
3.8.4.
Independent Crawls
51
3.8.5.
HTTP Headers, Cookies and other Client Information
. . . . . . .
52
3.8.6.
Peer-to-Peer Networks . . . . . . . . . . . . . . . . . . . . . . . . .
52
. . . . . . . . . . . . . . . . . . . . . . . . . .
4. Evaluation
53
4.1.
Test Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
4.2.
Crawls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
4.3.
Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
4.4.
Phishing Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
4.5.
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
5. Future Work
61
5.1.
Queueing Additional Content
5.2.
Crawler Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
62
5.3.
Malware Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
5.4.
General Purpose
65
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
6. Summary and Conclusions
69
A. Database Scheme
71
B. JavaScript Code Obfuscation Sample
73
References
75
ii
List of Figures
List of Figures
1.
Simplied architecture of the Monkey-Spider system
. . . . . . . . . . . .
4
2.
Sample IDA Pro session [21] . . . . . . . . . . . . . . . . . . . . . . . . . .
11
3.
Honeypots vs. honeyclients
14
4.
The general honeyclient process chain
5.
Sample SiteAdvisor site report
6.
SiteAdvisors rated Google search results
7.
Example for an eBay phishing site [100] trying to steel passwords for iden. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
8.
Monkey-Spider Architecture . . . . . . . . . . . . . . . . . . . . . . . . . .
27
9.
Google suggestions for wallpaper
. . . . . . . . . . . . . . . . . . . . . . .
31
10.
Comparison of search terms with Google Trends . . . . . . . . . . . . . . .
31
11.
The Heritrix Web interface console running a job
40
12.
Simple
tity theft
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
16
. . . . . . . . . . . . . . . . . . . . . . . .
19
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
20
15.
CrawlContoller scheme [19] . . . .
Operation of the ScanFolder program . . .
Operation of the ExtractArcfile program
Operation of the ClamAV program . . . . . .
16.
The Monkey-Spider status-page . . . . . . . . . . . . . . . . . . . . . . . .
48
17.
The Monkey-Spider seeding page
. . . . . . . . . . . . . . . . . . . . . . .
49
18.
The Monkey-Spider crawling page . . . . . . . . . . . . . . . . . . . . . . .
50
19.
Filetype distribution
20.
Popular Google typos
21.
Goole.com, a typosquatter site
. . . . . . . . . . . . . . . . . . . . . . . .
63
22.
Scheme of a hash-function . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
13.
14.
. . . . . . . . . . . . . . . . .
41
. . . . . . . . . . . . . . . . .
44
. . . . . . . . . . . . . . . . .
45
. . . . . . . . . . . . . . . . .
46
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
iii
List of Tables
List of Tables
1.
Honeypot comparison
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.
Comparison of dierent honeyclients
3.
Options of the
4.
Extractor modules contained in Heritrix
. . . . . . . . . . . . . . . . . . .
39
5.
Contents of the CDX le-format
. . . . . . . . . . . . . . . . . . . . . . .
43
6.
Extractor modules extraction eciency . . . . . . . . . . . . . . . . . . . .
55
7.
Top 20 malware types of our test crawl . . . . . . . . . . . . . . . . . . . .
55
8.
Seed count per topic
56
9.
Malware analysis performance per crawl
10.
Topic based phishing site distribution . . . . . . . . . . . . . . . . . . . . .
57
11.
Uniqe phishing site types
58
12.
Topic specic binary le maliciousness probabilitiy
. . . . . . . . . . . . .
58
13.
Top 20 malware types
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
14.
Top 20 malicious domain distribution . . . . . . . . . . . . . . . . . . . . .
59
doGoogleSearch()
. . . . . . . . . . . . . . . . . . . . .
function
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
v
14
15
32
56
List of Listings
List of Listings
strings
1.
Application of the
command . . . . . . . . . . . . . . . . . . . . .
2.
Example for a SOAP request
. . . . . . . . . . . . . . . . . . . . . . . . .
30
3.
Example of a REST application . . . . . . . . . . . . . . . . . . . . . . . .
30
4.
Our query to Google . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.
Sample
. . . . . . . . . . . . . . . . . . . . .
33
6.
The
. . . . . . . . . . . . . . . . . . . . .
34
7.
Example of an ARC-le record, rst part holds the information, second
hosts le . . . . . . . . . . .
hostsFileSeed.sh seeder script
10
32
part the content. Each ARC-le contains usually hundreds or thousands
of records depending on aggregated les size . . . . . . . . . . . . . . . . .
42
8.
Example contents of a CDX-le with one contained le (trimmed) . . . . .
43
9.
A sample of extracted les . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
10.
Example for a ClamAV report le
. . . . . . . . . . . . . . . . . . . . . .
46
11.
Demonstration of the
function . . . . . . . . . . . . . . . . . .
68
12.
createvirdb.sql
re.split()
script to generate the Monkey-Spider database scheme .
71
13.
JavaScript code obfuscation . . . . . . . . . . . . . . . . . . . . . . . . . .
73
vii
1. Introduction
1. Introduction
Between 2006 and 2010, the information added annually to the digital universe will increase more than six fold from 161 exabytes to 988 exabytes.
[88]
The Internet is growing and evolving every day. More and more people are becoming
part of the so-called Internet community. With this growth also the amount of threats
for these people is increasing. Criminals who want to destroy, cheat, con, and steal are
evolving rapidly [102]. There is no comprehensive and free database to study malicious
Web sites found on the Internet.
HoneyMonkey [104] and SiteAdvisor [34] are proprietary
systems with an approach to build such research databases. Nevertheless they are not
freely available for further research and only limited access to their results is publicly
provided.
In this work, we introduce the Monkey-Spider Internet analysis framework,
which is part of my work on this topic. The task which we want to address with this work
is to build an easy-to-use infrastructure to detect and monitor malicious Web sites and
to evaluate it on popular parts of the Internet. Malicious Web sites are Web sites which
have any kind of content which could be a threat for the security of the clients which do
request these sites. The resulting system will further be referenced as the
There is an ongoing battle on the Internet between
whitehats
blackhats
Monkey-Spider.
the bad guys and
the good ones [94]. The term blackhat is used for someone who compromises
the security of a computer system without having the permission of the owner and to
gain control over the system connected to a network.
Blackhats write malicious code
and content and attack computer systems for criminal intentions. The term whitehat is
used contrarily to blackhat and depicts a person who is ethically opposed to the abuse
of computer systems. Whitehats try to trace back blackhats and unveil their activities
to ght their threats and secure the Internet. The strategies to unveil and trace criminal
activities on the Internet has resemblance with criminal investigation in the real world.
Used techniques are fast analysis methods, deception, inltration, tracing, replaying
criminal activities, as well as waiting and observing.
One of the best tools for real world analysis of blackhat activities and techniques are
honeypot
based technologies.
A honeypot is an information system resource whose value lies in unauthorized or illicit use of that resource. (Lance Spitzner) [94]
Honeypots bait blackhats to malicious actions and then examine their behaviour and
capture their tools. Honeypots can be distinguished as
eypots.
server honeypots
and
client hon-
Server honeypots are computer systems serving dierent kinds of services, like
FTP, HTTP etc. to clients. They wait for clients to attack them. Client honeypots are
clients requesting contents from other servers and thus initiate malicious servers to attack
them. Starting from this idea, the Monkey-Spider system will be used in our work as a
client honeypot (or
honeyclient ) to Web servers, to nd, observe, and capture malicious
activities on the publicly accessable Internet.
1
1. Introduction
1.1. Motivation
"Speed is the essence of war. Take advantage of the enemy's unpreparedness;
travel by unexpected routes and strike him where he has taken no precautions." (Sun Tzu) [101]
Manual analyses for individually detected malware sites have appeared in the last years
[66, 67]. These analyses are expensive, time-consuming, and far to few. Our aim is the
automatical large-scale-analysis for millions of Web sites. To build an infrastructure for
studying actual threats on the Internet the resulting system should be able to or help to
answer the following questions. With our evaluation in Chapter 4 we try to nd answers
to some of them:
•
How much malware is on the Web?
•
Where is that malware located?
•
How are malicious Web sites related with each other?
•
What fraction of binaries on the Internet are infected with malware?
•
How are the threats changing over time?
•
What is the topic specic probability for an infected site?
•
Which publicly unknown malware exists on the Internet?
We choose the name Monkey-Spider for our system. It stands for
analysis of Web sites, and
spider
monkey, the automated
which is used synonymously for Web crawlers, resulting
in an automated analyzing crawler.
1.2. Requirements Analysis
Diving into this topic we have found it to be very interesting and extremely extensible
to research. The ultimate idea would be to have a copy of the Internet Archive [23] and
to research on its huge collected content. Recognizing our available funds, the funds of
a diploma thesis, we realize that we have to gure out the requirements appropriate for
the scope of this thesis.
These requirements are the following:
•
The most important requirement is the performance of the whole system to achieve
the analysis of millions of website in a few days with few computers.
•
Fast and easy implementation to reach the allowed time in the scope of a diploma
thesis and to have enough time for an adequate evaluation.
•
Free tool availability and usability.
•
The resulting system should be easily extensible for other researchers.
2
1. Introduction
•
Generate representative results, thus analyze great portions of the Web.
•
Detect
•
Generate signatures and detailed information about the examined objects and anal-
zero-day
exploits and publicly unknown malware.
ysis.
•
Queue any linked content to achieve a full coverage.
Our solution proposal:
•
Don't reinvent the wheel, therefore make excessive use of available free open source
software [87] to minimize the implementation eort and facilitate extensibility of
the used foreign parts.
•
Keep the infrastructure free [82] so that others can build on it and do further
research.
•
Seperate crawling and scanning to achieve high performance and scalability.
•
Use a high performance multithreaded archival quality Web crawler.
•
Avoid crawling overhead through advanced crawling techniques.
•
Use extended link extraction mechanisms.
•
Use fast and easy malware detection methods to focus more on detection than the
fully analysis of threats. Therefore use a low-interaction approach.
•
Focus on modularity to allow additions and further improvements.
•
Use extensive prototyping, the process of quickly putting together a working model
(a prototype), in order to test various aspects of a design, illustrate ideas or features
and gather early user feedback and results.
•
Collect enough data to generate representative time-based and/or content-based
statistics.
1.3. Simplied Architecture
"What is essential in war is victory, not prolonged operations." (Sun Tzu)
The main idea used for our system is at rst to crawl the contents of a Web site and
then to analyze the crawled content for maliciousness.
as a crawler based low-interaction honeyclient.
Our system can be classied
We didn't want to analyze Web sites
while surng them with real Web browsers. The reason for not doing so is to have both
tasks split up and optimized, namely browsing or crawling a Web site and analyzing or
scanning the contents.
3
1. Introduction
One of the main advantages of this partinioning is the ability to split up the operational
parts. Hence the scanning and the crawling can be done on dierent machines and thus
combined more performant. The scanning usually does take signicantly more time and
resources than the crawling. The crawling process can take as much processing time and
memory as the Internet connection can bear and the tuning of the crawler allows.
Scanner
Crawler
Internet
Database
Figure 1: Simplied architecture of the Monkey-Spider system
The Monkey-Spider is made up of two main tasks the crawling and the scanning. At
rst a Web crawler collects resources from dierent parts of the Internet. After that the
collected content is checked with dierent malware scanners and analysis tools. Malicious
content found during investigation is stored.
The collected information is stored in a
database.
1.4. Acknowledgements
Initially, I would like to thank Prof. Dr. Freiling for giving me the opportunity to work
on such an interesting topic even though I have not studied in his research area but in
computer networks. He has been very insightfully for my problems and supportive in a
human fashion.
I would also like to thank Prof.
Dr.
Eelsberg for being such a kind person.
He
accepted to be my second examiner and allowed me to work on my thesis at the chair
for distributed dependable systems.
I further would like to express my thanks to Dipl.-Inform. Thorsten Holz for his great
support of mine. He has given me good advices and motivated me. Since he was often
busy in being productive, he was always there for a conversation either personally or per
email.
I would also like to thank my dear friends Nuh Duran and Arno Knöbl, who have
helped me to amend my numerous mistakes and gave me precious advices.
Finally, I would like to thank my familiy for the patience and motivation they have
had for me during my studies and ups and downs.
Eventually, I would like to thank all the people working at our oor. I had a great
time working here and am probably going to miss this friendly atmosphere.
4
2. Background
2. Background
Prior to trying to detect and analyze malware, we have to give proper answers to the
following questions. What are malicious Web sites and what are we looking for? What is
malware? Which kind of threats are there on the Internet? How is malware distributed
over the Internet? How can malware be detected? The following Sections try to answer
these questions and introduce the reader to the matter.
In Section 2.1, we dene what malicious Web sites and malware are.
Section 2.2 common types and functionalities of malware are explained.
After that in
Hereafter a
description of detection and analysis methods available regarding malware is shown in
Section 2.3. Section 2.4 points out some important concepts for our work namely honeypots, honeyclients and their interaction levels. Then we present related work that is
done on this and similar topics to give an overview of yet archived tools and results by
others in Section 2.5 and classify our work in the context of others. Afterwards we are
explaining how spam is aected with malicious code in Section 2.6. Finally, we introduce
some Web crawling concepts which are important to understand our system explained in
Section 2.8.
2.1. Denition of Malware
Malicious Web sites have any kind of content which could be a threat for the security of
the clients requesting these sites. Malware is any chunk of software that serves malicious
purposes. This software is built with malice aforethought to damage, cheat, con, steal or
similar intentions. Malicious Web sites can either host such malware or exploit a client
to force its infection with malware.
The latter Web sites are called
sites [93].
Malware is either programmed, generated with so-called
drive-by-download
malware toolkits
or generated
in runtime to obfuscate itself from detection.
For the infection of computer systems malware exploits either known or unknown
vulnerabilities of the victim systems or is consciously or unconciously executed by the
user or computer itself. Malware often spreads itself on the infected computer system
and over computer networks to other computer systems.
Many types of malware use
dierent advanced techniques to hide their existence and obfucate their actions. Thus
they try to avoid their detection to achieve infection and functionality as long as possible.
The aims of general malicious software can be summed up as:
•
Avoid detection:
Detected malware can easily be recognized also by others and
actions can be taken against them
•
Ensure functionality: Every software has only a meaning as long as it functions
properly and has a value for the attacker
•
Ensure (wide) spreading: Avoiding spread locking is important for the success of
the malware
5
2. Background
2.2. Types of Malware
The more you read and learn, the less your adversary will know. (Sun Tzu)
Dierent types of malware make use of dierent operation techniques and therefore require various detection and analysis strategies. It is a Sisyphean challenge to discover
new malware and to invent new technologies against malware writers because the malware researcher has always the assignment to rediscover a threat which an attacker has
discovered and implemented before him. As long as there are people with the knowledge
and the intention to abuse the researcher keeps searching for threats.
The classication of malware is done out of specic schemes like:
•
Infection method: Remote exploitation or local execution and exploitation
•
Spreading: Self replication and infection of other binarys and systems
•
User interaction degree: Socially force the user to download and install a malware
or hiding from the user
•
Abuse purpose: key logging, identity theft, scam, fraud etc.
Many modern malicious codes or content use well-known techniques of other sorts of
malicious code to achieve synergy eects for better infection, spreading, and hiding.
Having a copy of the malicious code or an infected sample of a publicly known malware
it is easy to unveil its existence through malware specic signatures. This scheme also
works well on spam mails or phishing sites which are not executable malicious code but
malicious from a social viewpoint for the user.
As an example the
HTML.MediaTickets.A
exploitation Web site signature contains
just the string executing an exploitation script. The following line has been extracted
from the Web site
http://desktopwallpaperfree.com
for over 400 times like shown in
Table 14.
<script language="JavaScript" src="http://www.mt-download.com/mtrslib2.js"></script>
2.2.1. Computer Virus
A program that can 'infect' other programs by modifying them to include
a possibly evolved copy of itself.
With the infection property, a virus can
spread throughout a computer system or network using the authorizations of
every user using it to infect their programs. Every program that gets infected
may also act as a virus and thus the infection grows. (Fred Cohen) [75]
A
computer virus is a program that can infect a computer system without the knowledge
of the user. Viruses are self-replicating and can infect other programs which than act
as a host for the virus similar to biological viruses. Infection of other systems can only
occur if an infected le is transferred to another computer system through a datatransfer
6
2. Background
over a computer network or storage media. Computer viruses are not able to infect other
computer systems over a computer network on their own.
They can have dierent unwanted and dangerous eects on a computer like data and
even system destruction, faulty system behaviour, and at least squandering of computing
resources in memory and storage space.
Computer viruses are usually detected through virus signatures. Anti-virus companies
and activists generate signatures for software samples which are suspicious to be malicious. These signatures are bitstreams and can be considered as ngerprints, which are
typical to one specic computer virus.
To unveil an infected host the le system and
memory of the computer system has to be scanned for theses signatures.
2.2.2. Computer Worm
Computer worms
are often confused with computer viruses but are a dierent kind of
malware. Worms are computer programs with the intended ability to exploit one or more
vulnerabilities of a specic system implementation.
In contrast to computer viruses
wait passively to be executed and to infect les on dierent systems, computer
worms attack other systems to actively infect them. Computer worms don't infect other
who
programs with their code but copy themselves on other exploited computer systems.
Computer worms often have a secondary intention other than their own spreading.
This can be either to send spam through infected hosts or to download other malicious
code to permanently take control over a computer utilizing a
backdoor
or a
Trojan Horse.
Since computer worms are actively spreading and infecting systems over the network,
new threats often generate a heavy load on parts of the Internet and can cause great
nancial damage especially for small networks.
networks utilizing emails [81],
peer-to-peer
Worms can spread through computer
networks [108], or even mobile phones [95].
Like computer viruses, computer worms are also detected with signatures. In addition
to this a worm attack over a network can also be detected with an intrusion detection
system signature.
With this approach an infection can be avoided with an intrusion
detection system prior to reaching the victim machine.
2.2.3. Trojan Horse
The term
Trojan Horse
is part of the myth of the Trojan War. It stands for an attack
device which veils its true bad intentions to its victim as harmless and useful. In the scope
of computation security a Trojan Horse is a malicious program which comes as a useful
program or is attached to a useful program to the victim and infects the victim unsighted.
Trojan Horses don't replicate themselves or spread over the Internet in an automatical
manner and thus are not categorized as a virus or a worm but often use similar techniques
to hide from the user and to hide its actions as a wanted or unproblematic system
operation. They can harm the client in any way like the other malware can.
7
2. Background
2.2.4. Spyware and Adware
There is no precise denition of
spying aspect.
spyware
but every denition attempt focuses on the
The Anti-Spyware Coalition [53] has formed a denition which comes
close to subject:
Technologies deployed without appropriate user consent and/or implemented in ways that impair user control over: Material changes that aect
their user experience, privacy, or system security; Use of their system resources, including what programs are installed on their computers; and/or
Collection, use, and distribution of their personal or other sensitive information. (The Anti-Spyware Coalition)
Spyware generally has criminal or nancial purposes.
The criminal purpose contains
among others identity and/or credit card information theft. The compromised systems
often have a keylogger installed which logs and scans the hard disks and Web forms to
inspect for credit card or other criminally valuable information.
The nancial purpose is mostly advertising and stealing credentials. The compromised
system's Web search or browser history are observed to gain information about interests
and custom advertisements are submitted.
In contrast to viruses or worms, spyware does not either self-replicate to spread itself or
transmit the infection to other systems. Spyware is installed either with social engineering
and deception techniques, through other malware or by exploiting system vulnerabilities
with prepared exploiting Web sites.
Adware
is software with advertising functions, either bundled with another sofware or
a part of a software. It is often considered disturbing but its main feature is the consent
of the user to agree to advertisements by installing the software. Some sort of adware
can send specic information about the client to optimize the advertising functionality.
Such consented
spying
functions lead sometimes to an erroneously naming of adware as
spyware but as long as adware stays legal in its actions it can't be categorized as spyware.
2.2.5. Exploit
Exploits are either malicious software or malicious content which can be used to exploit a
vulnerable system to gain control over a computer system or to allow a privilege escalation
or a denial of service attack. So-called
drive-by-downloads are Web sites which can exploit
Web clients who just visit such a Web site. Once exploited an attacker can do anything
on a victim system like installing additional malicious software like backdoors or adware.
Exploits are often classied as
remote
or
local
exploits which shows the applicability
of an exploit from a remote machine over a computer network or from the machine itself.
Exploits loose their applicability as soon as the according vulnerabilities are patched.
Exploits which are published before a vendor patch had existed are called
zero-day
ex-
ploits according to the count of days between the publishing of a vendor patch and the
occurence of the exploit. Such types of exploits are rare and interesting to both blackhats and whitehats. Since unpatchable vulnerabilities can pose a possible threat to all
8
2. Background
available systems on the Internet, system vendors are trying to be the rst to know about
a vulnerability and are sometimes even prepared to pay a bounty for their detection and
transfer [61].
2.2.6. Backdoor
A
backdoor
or a
rootkit
is a malicious code which enables an attacker to modify a victim
computer system in a way that the attacker has the ability to remote access the system
without the knowlede of the victim. Such code can either modify a legitimate program on
the victim system or install itself to enable remote access. A successful infected system
can than be used for any other purpose and stay undetected for a very long time because
if the backdoor is not remotely accessed it is likely that it will not be detected.
2.2.7. Botnet and Bots
The term
bot
[86] is an abbreviation of robot and stands for an automated malicious
program with some similarities to a computer worm. It has the ability to automatically
connect to a point in the Internet where his originator can gain control over such. Bots are
infested computer systems which autmatically connect to such a command and control
server and wait for commands, thus forming a kind of distributed remotely controllable
network of bots, a so-called
botnet.
Botnets often utilize Internet Relay Chat (IRC) [24] for their communication. Infested
bots connect to a password protected and encrypted specially prepared IRC server and
become a part of the botnet. After a botnet is up and running such a botnet can hold
thousands of bots ready for abuse. Botnets are often used for distributed denial of service
(DDOS) attacks and spamming. They have three major attributes:
•
A remote control facilitiy enabling the control of a bot by the attacker.
•
The implementation of several commands, which designates the abuse capabilities
of a bot.
•
The spreading mechanism which is important for the further propagation of the
bot.
2.3. Detection of Malware
Malware researchers are retrieving malware samples and samples of suspicious code often
through the following devices:
•
Mail gateways provide acess to many spam mails with attached malware or links
to malicious sites.
•
Honeypots are victims of automatic malware attacks and therefore are often used to
collect such binaries in an automatical manner. Furthermore, there are specialized
honeypots only for the automatic collection of malware samples like Nepenthes [39].
9
2. Background
•
Intrusion detection systems can detect an attack and dump the communication to
gather binarys.
•
Infected systems host infected binarys.
Malware can either be
publicly discovered
or
publicly undiscovered
malware.
Publicly
discovered malware can exploit computer systems through publicly known vulnerabilities.
Publicly undiscovered malware can exploit either known or unknown vulnerabilities of
computer systems. Malware doesn't have to exploit a system at all, it just has to have
malicious intentions and to be executed on the victim computer system.
Depending on this fact, the detection mechanisms dier in the following two ways.
Known malware including non-binary threats can easily be detected with the classic
signature database method in which for every known malware a signature is generated.
These signatures are characteristic byte sequences to identify malicious code.
For unknown malware with no existing signature, analysis tools have to be used to
unveil the maliciousness of an observed binary. There are two kinds of analysis techniques
for malware. The major trade-o of both systems is the relation between the fast analysis
and the detail level of the information about a sample.
2.3.1. Static Analysis
Static malware analysis
is the most expensive way and high skills are needed for the
analysis. But this approach can deliver complete information about a binary. There are
three ways to perform the static analysis of malware.
At rst the binary can be examined without ever being executed. Extracted strings
can deliver relevant information about the contents. The standard Unix tool
strings
is
able to print the strings of printable characters in les. Listing 1 shows a simple example
in which strings are extracted out of a le containing two strings.
Listing 1 Application of the strings command
$> cat hello.txt
hello world
$> strings hello.txt
hello world
In fact some malware was named after uniqe strings found in malware samples. Such
strings can show contained text, for example a message or a disclaimer of a sofware
library.
Hash functions (MD5 [58] / SHA-1 [63]) can deliver some information about
the content enabling us to compare the le with others. Additional information can be
obtained if contained resources can be extracted. The Resource Hacker [46] is an example
for an extraction tool able to obtain contained icons, bitmaps etc. from Windows binarys.
Since being easy to apply, this approach is likely to deliver only slight information.
The second way is to disassemble the binary and examine the machine code. This is the
most dicult and long lasting way to analyze, but can lead to a fully understanding of the
10
2. Background
binary. The main problem of this approach is the need of a very deep knowledge of the
examined system and the machine language. One of the best tools for disassembling code
for many architectures is the commercial IDA Pro disassembler [21]. We have observed
that it is widely used by malware researchers. Figure 2 shows a sample debugging session.
Figure 2: Sample IDA Pro session [21]
The last option is the decompilation of the binary to human readable program code.
This option needs fewer system knowledge than the disassembling but in contrast to
disassembling, decompilation is not always applicable or useful. If the original program
code was compiled using the code optimizing options of a compiler and/or without debugging symbols and/or with a special disassembler obfuscator the resulting code can be
either very dicult to analyze or even impossible to decompile. Even if a code is decompiled successfully the resulting code can be dicult to analyze because the source code
commentaries and even variable names are not recovered which usually deliver crucial
information about a sourcecode. Nevertheless it is far more easy to analyze such code
than to step to a huge machine code program. The Boomerang [4] project is an example
for a popular multi-platform decompiler.
2.3.2. Behavioral Analysis
Behavioral malware analysis
is faster and needs fewer skills than static analysis. Behav-
ioral analysis often gives fewer and incomplete information about the examined code.
Nevertheless for the most purposes, the incomplete information may be enough.
Be-
havioral analysis doesn't analyze the software itself but the behavior of it in a prepared
11
2. Background
environment using a black box approach.
Therefore a system is examined before the
execution and after the execution of a suspicious binary. All important information regarding the system security are logged. System calls can be hooked to trace all actions.
A system that implements modern behavioral analysis methods is the CWSandbox [8]
which is described in Section 3.5.5.
Behavioral analysis is likely to miss some features of the software sample, like hidden
features, password protection, software with unfullled dependencies like additional software libraryies or system abilities. Nevertheless, behavioral analysis is the best way to
autmatically analyze many software samples at a time and to identify their maliciousness.
2.4. Honeypots vs. Honeyclients
All warfare is based on deception. (Sun Tzu)
Deception is a technique used to trap criminals not only in criminal investigation but
also in computer security.
Like a thief in the real world an attacker wanting to break into a system wouldn't want
to be observed or traced and will ee as fast as she can, the very moment she realizes
the trap. An attacker who is fully decepted that the system she prepares to attack is
not observed and safe for her to break in will probably do this. Like in the real world, a
thief is a thief if she is caught stealing, an attacker can only be criminally chased if she
had attacked someones computer system or network illegally.
Another aspect of deception is the way the attacker will act on a compromised system.
The actions will dier regarding how much the attacker is decepted by a compromised
system. The more she is decepted and trusts in her prociency, the more she will unveil
of her ability, techniques and aims [73].
2.4.1. Honeypots
Honeypots
[99] are dedicated deception devices. The value of which lies in being probed,
attacked and compromised by manual or automated attacks to gather information on
the attacker without his knowledge. Being dedicated only for attack analysis and not for
production use every action on a honeypot is suspicious and illegal.
Server honeypots, who are often just called honeypots, wait for the adversary to attack
and thus have to be a desireable target for an attacker to fulll its task. Like in the real
world an open door may tempt a saint such a desireable system should provide publicly
known system vulnerabilities and weak security proles like weak or no passwords, old
and unpatched system software etc. to succeed.
Honeypots are usually classied as
low-interaction or high-interaction honeypots.
High-
interaction honeypots are real systems providing real application for the hacker to interact
with. They are used to gather profound information about threats and attacks. Lowinteraction honeypots simulate real systems and services. An attacked system has only
restricted interaction capabilities and thus lacks some of its deception values.
There are some aspects which are special to honeypots in general. Honeypots are likely
to be controlled by attackers who will try to attack other computer systems. This may be
12
2. Background
a legal issue in many countries needing attention. To restrict the ability to attack other
systems, honeypots have a containment strategy.
This strategy can restrict outbound
connections and thus minimize the possible damage. The containment strategy has to
be carefully adjusted to keep the deception up, so that the attacker doesn't realize that
she is being observed.
2.4.2. Honeyclients
Honeyclients, which are also sometimes denominated as client honeypots, are the opposite
to server honeypots. Honeyclients actively crawl or access the Web to search for servers
that exploit the client and thus to gather information of how attackers exploit clients.
Web browsers are not the only kind of client software that can be attacked but the current
focus in this work and in other honeyclients is mostly based on the analysis of Web client
exploitation.
One major culprit of server honeypots is their waiting for an attack. An attack appears
only by chance, thus it is possible that a ready to attack honeypot won't be attacked
for months or it is quite possible that it is attacked occasionally by many attackers
at the same time.
It is not easily predictable how frequently attacks will occur on a
honeypot and thus the analyses get more complicated. In comparison with this behavior
honeyclients initialize every analysis and thus control the maximum number of possible
attacks. The analysis is important even if no attack occurs because such Web sites and
servers can then be classied as likely not to be malicious and thus a
safe
Web could be
mapped.
Honeyclients can also be classied as low-interaction or high-interaction honeyclients.
The research done on this eld is very new.
High-interaction honeyclients are usually
real automated Web browsers on real operating systems which interact with Web sites
like real humans would do.
They log as much data as possible during the attack and
allow a xed time period for an attack. Since providing detailed information about the
attack high-interaction honeyclients are very slow and not able to scan broad parts of
the Web.
Low-interaction honeyclients are often emulated Web browsers, usually Web crawlers,
which do have no or only limited abilities for attackers to interact with. Low-interaction
honeyclients often make use of static signature or heuristics based malware and attack
detection and thus lack the detection of zero-day exploits and unimplemented attack
types.
These honeyclients are not suited for an in-depth investigation of an attackers
actions after a succesful compromise because the system is only simulated and any other
action than the initial exploitation is likely to be missed too. In spite of these drawbacks
are low-interaction honeyclients often easy to deploy and operate and very performant.
They can be used for automatical malware collection and to take a sounding on a portion
of the Web.
Furthermore the containment of attacks is not a big deal because the
compromised system is not real and thus unusable for the attacker which additionally
simplies the deployment of a low-interaction honeyclient. Figure 3 shows the operation
of honeypots and honeyclients.
13
2. Background
Figure 3: Honeypots vs. honeyclients
low-interaction
high-interaction
honeypots
simulated systems and services
real systems and services
honeyclients
simulated Web clients and systems
real Web clients and systems
very fast; low system load;
zero-day exploit detection;
advantages
multi threaded; multi system;
detailed analysis;
scalable; easy containment;
real world systems;
misses complex client exploits;
slow; complex resetup;
improper malware analysis;
heavy system load;
no zero-day detection;
single architecture;
disadvantages
Table 1: Honeypot comparison
Table 1 compares all aspects of honeypot based devices.
2.5. Related Work
Because our architecture is very similar to a honeyclient, most research done on this
eld can be referenced as honeyclients. Similar types of research regarding the malicious
contents of the Internet have been done with the following sometimes immature and
evolving research projects as shown in Table 2. Throughout all available honeyclients,
a common general architecture or process chain, which consists of three serial steps
depending on each other, can be observed.
analyze.
At rst a queue is lled with objects to
Then a client requests the queued objects.
Eventually the gathered objects
are analyzed regarding their maliciousness. The queue can also be extended during the
request and the analysis steps. This simple process can be depicted like in Figure 4.
14
SiteAdvisor
15
2005
Microsoft
Research
starting
year
developer/s
development production/
status
stable
drive-bydownloads,
analyzed
zero-day
threats
exploits,
malware
analysis
Windows
spectrum
os
Lavasoft
AdAware
Strider
Tools FDR,
GB, GK
malware
analysis
Windows
os
2006
Snort
Web client
(crawler)
Free
Software
lowinteraction
HoneyC
2006
self
email
n/a
highinteraction
Sheila
2006
state based
Web client
(various)
Free
Software
highinteraction
HPC
Capture -
Pezzo-
multi os
malware,
spam,
phishing
University
of
Mannheim
development/
beta
2006
ClamAV
many free
and
proprietary
tools
2005
Web client
(crawler),
email
Free
Software
lowinteraction
Spider
Monkey-
Web client
(IE/Mozilla)
highinteraction
proprietary/
not
available
navante
University
Vrije
University
Danford/
of
Universiteit
of
Still Secure
Wellington Amsterdam Wellington
development/ development/ development/ development/
n/a
beta
beta
beta
beta
malware,
malware,
drive-bymalware,
zero-day
zero-day
downloads,
exploits,
exploits,
malware
zero-day
zero-day
drive-bydrive-byexploits
exploits
downloads
downloads
Windows
Windows
multi os
multi os
multi os
os
os
Kathy
Wang
2004
self
Web client
(IE)
Free
Software
highinteraction
Honeyclient
Table 2: Comparison of dierent honeyclients
Windows
os
spyware
production/
stable
McAfee
2005
n/a
Web client
(any)
drive-bydownloads,
malware,
spam
University
of Washington
production/
stable
2005
Web client
(IE/Mozilla)
Web client
(IE)
highhighhighinteraction
interaction
interaction
proprietary/ proprietary/ proprietary/
not
not
not
available
available
available
infection
vector
toolkit
availability
classication
UW
Spycrawler
Honey-
Monkey
2. Background
2. Background
Figure 4: The general honeyclient process chain
2.5.1. HoneyMonkey
The Strider HoneyMonkey Exploit Detection system is a network of monkey programs running on virtual machines with dierent patch levels and
constantly patrolling the Web to hunt for Web sites that exploit browser
vulnerabilities. (The Strider HoneyMonkey Project) [104]
The HoneyMonkey is a Web browser (Internet Explorer) based high-interaction honeyclient developed at Microsoft Research in 2005. For their analysis approach they introduced the term Automated Web Patrol. The architecture consists of a chain of virtual
machines with dierent avors of the Windows Operating system in variuos ascending
patch levels.
Starting with the fully unpatched system, the Monkey Controller initiates a so called
monkey program which browses previously scheduled Web sites. The monkey waits for
a given time. After waiting the virtual machine is checked for possible intrusions. If an
attack is detected, the Web site is revisited with the next machine in the pipeline. This
reanalysis lasts as long as the machines are keeping successfully exploited. If nally the
last fully patched system in the chain is also exploited, then this Web site is picked out
to have a zero-day exploit.
The HoneyMonkey system uses a black-box approach for their analysis. Since a monkey
is instructed to launch a browser instance and wait a few minutes but not set up to click
on anything automatically, any executable les created out of the browsers temporary
folder signalizes an exploit.
The monkeys are started with the Strider Flight Data Recorder [72] which records every
single le and registry read/write. Each monkey also runs with the Strider Gatekeeper
[49] to detect any hooking of Auto-Start Extensibility Points (ASEPs) that may not create
executables, and with the Strider GhostBuster [50] which can detect stealth malware that
16
2. Background
hides processes and ASEP hooks.
For the ease of use every monkey runs in a virtual
machine queued, started, and reset by a Monkey Controller. The HoneyMonkey is also
able to create a URL-level topological graph to unveil bad neighborhoods and major
players in the malware community.
During their research in May/June 2005, they have found that totally unpachted
WinXP SP1 systems could be exploited by 752 URLs and a fully patched WinXP SP2
had no exploits. They also claimed to have detected a zero-day exploit in July 2005 [69],
but it is known that SEC Consult had found and announced that vulnerability already
in mid June 2005 [79].
The main advantage of the HoneyMonkey system is that it is able to detect zeroday exploits against fully patched Microsoft systems and can classify which exploits are
applicable up to which patch version. The performance of HoneyMonkey in identication
of exploits was about two minutes per URL.
2.5.2. Spycrawler
The University of Washington Spycrawler [93] in Seattle/USA is a proprietary crawler
based high-interaction honeyclient with an event-based detection mechanism developed
in 2005.
The focus of their research is on spyware and thus they don't follow other types of
malware. They utilize three steps in their architecture to detect and analyze Spyware
and use Google search results as their starting points:
1. Determination of whether a Web object contains executable software:
a) Based on the Content-type HTTP header (e.g. application/octet-stream)
b) Based on the URL containing known executable extensions (e.g. exe, cab, or
msi)
2. Downloading, installing, and executing that software within a virtual machine:
a) Download URL with Heritrix [19]
b) Install and execute the binary with an automated monkey program
3. Analyzing whether step two caused a spyware infection: Analyze the virtual machine with the AdAware [1] Spyware scanner to detect infections.
One of the main disadvantages of this approach is that they are possibly and probably
missing some executables because step one is not sucient to detect all of them. Since
with this approach the real number of spyware on the Web can be underestimated it is not
going to be overestimated. By comparison our system analyzes all content. This leads to
analyzing more content than needed but excludes missing of executable code. Another
main disadvantage is that their analysis is based on only one anti-spyware solution and
only on known spyware types.
17
2. Background
Despite of this weaknesess they generate interesting results. Their implementation with
a cluster of ten nodes achieves an average resetup period of 92 seconds per executable,
to resetup the VM, download the binary, install and execute it and to analyze it with
AdAware.
Their results gives an insight of the topic specic distribution of Spayware
on the Web and of the probality for an infected site in general.
They have crawled
about 40 million URL in May/October 2005 and found executable les in about 19%
of the crawled content.
executables.
Approximately 4% of the crawled sites had spyware infected
Additionally, they found some major spyware spreading sites linked on
other sites and some major spyware types contained on most of the sites.
2.5.3. SiteAdvisor
The SiteAdvisor [34] is a propritary high-interaction honeyclient from McAfee.
The
developement started in April 2005 from a group of MIT engineers. The rst results came
about December 2005. In April 2006 McAfee bought SiteAdvisor and now continues their
work.
SiteAdvisor is a propritary honeyclient system. There is few information and no public
documentation about the architecture of their system.
Regarding their Web blog it
seems that they are using a crawler based high-interaction honeyclient.
In December
2005 they had 80 analyzing nodes in their honeypot cluster for the analysis.
They
have recently published some research results but it is not possible to directly access
their result database. They have created a system which analyzes malicious content on
Web sites, drive-by-downloads, email trac after automatic registration on their sites,
and a very active community of Internet users who personally rate and comment on
the maliciousness of a Web site.
Other site specic information like online aliates,
popularity, and country are provided. All this information is provided as a specic site
report, which has to be queried manually for each Web site. Figure 5 shows a sample site
report for a malicious Web site. They have a simple safety rating system regarding the
overall safety of a site based on the three colors of trac lights. Green is for harmless
Web sites, yellow shows a suspicious appearance, and red means an assured or rated
maliciousness of a Web site.
The SiteAdvisor system also provides a browser plugin for the automatical analysis of
visited and searched content.
Search results for popular Web searches are highlighted
regarding their safety rating.
The SiteAdvisor's safety ratings appear next to search
results a small button on your browser toolbar changes color based on SiteAdvisor's
safety results.
Menu options on SiteAdvisor's toolbar let you customize SiteAdvisor
or see a site's detailed test results.
Figure 6 shows a sample search with an enabled
SiteAdvisor browser plugin.
Every now and then they also provide some malware related statistics. In their malware report from March 2007 they found out that 4.1% of all sites tested by SiteAdvisor
are rated red or yellow. They point out the dierent maliciousness levels of Top Level
Domains.
They show a range of red and yellow rated pages from 0.1% of nnish (.)
domains to 10.1% of the islands of Tokelau (.tk) domains. They also observe the maliciousness of general TLDs and identiy .info domains as the riskiest TLD with 7.5% of
18
2. Background
Figure 5: Sample SiteAdvisor site report
maliciousness and .com with 5.5%.
2.5.4. HoneyC
HoneyC [97] is a component based low-interaction honeyclient developed in 2006 at the
Victoria University of Wellington/New Zealand. It is the rst cross platform open source
framework written in Ruby. HoneyC emulates only the essential features of the target
clients and applies signature matching for a fast static analysis of server responses. It
is based on three components. The components communicate over a normal unix pipe
with each other.
$> queuer | visitor | analysisEngine
Components can be multithreaded and are modular. The initial version of the HoneyC
includes the following implementations for the three components:
•
Queuer, querys objects to analyze: Yahoo Search Web Services [65] queuer
•
Visitor is a client: a self written Web client requesting content
•
Analysis engine analyzes the response:
engine
19
a Snort [48] signature database analysis
2. Background
Figure 6: SiteAdvisors rated Google search results
Preliminary results [97] show a very good performance of 3 seconds for the analysis of
one Web site. They have estimated how long it would last to analyze the whole publicly
indexable Web with HoneyC. Calculations show that for approx 52 billion Web sites with
945 petabytes it would take one week utilizing a bandwith of about 13,75 Gbit and more
than 257000 instances of HoneyC to scan the whole Web.
2.5.5. HoneyClient
The
HoneyClient
[103] is a Web browser based high-interaction honeyclient.
It is the
rst open source client honeypot developed in 2004 by Kathy Wang. HoneyClient runs
on a normal Windows host. It takes snapshots of the Registry and the lesystem before
any action is taken. Right after a Web site is automatically visited, a long system scan
detects lesystem and Registry changes. Being the rst freely available client honeypot
it was not built to analyze millions of Web sites or to be very performant but to be a
proof of concept. HoneyClient has a worse performance of several minutes per URL.
2.5.6. Shelia
Shelia [96] is a high-interaction honeyclient in development at the Vrije Universiteit
Amsterdam/Netherlands with the primary focus on spam mails.
They are building a
system for checking automatically all links and attachments received by email. For each
Web site link or le received a behavioral analysis is made to detect the maliciousness of
a resource.
Therefore the queuer component is expanded to additionally queue malicious content.
The striking speciality of the Shelia system is a so called client emulator which is re-
20
2. Background
sponsible to invoke a proper application to deal with each kind of les identied by the
queuer component. It is special because this system analyzes not only vulnerabilities of
one client software but also vulnerabilities of any software associated with specic les.
It should be able to detect macro viruses and third party software vulnerabilities, for
example like a buer overow vulnerability in Adobe Acrobat Reader [68]. The analysis
consists of three parts which are signicant after the execution of a le or the invokation
of a le with its application:
•
The process monitor engine monitors the actions executed by the clients and applys
a containment strategy to prevent successful attacks from further spreading o the
honeyclient
•
An attack detection engine determines if an action executed by the client is considered illegal with a DLL Injection and an API Hooking technique
•
Attack log engine logs as much information as possible from the attack
2.5.7. Capture - HPC
Capture-HPC
[5] is a high-interaction honeyclient developed at the Victoria University
of Wellington/New Zealand.
Capture has two functional areas in its design namely a
Capture client and a Capture server. The clients are hosting the actual real honeypot on
a virtual machine. The server coordinates and controls the clients. Capture concentrates
on three aspects of high-interaction honeyclients:
•
Capture is designed to be fast. State changes on the clients are triggering malicious
actions in real time to the server.
•
Capture is designed to be scalable. The central Capture server can control numerous clients across a network
•
Capture supports dierent clients.
The current version supports Firefox, Opera
and Internet Explorer.
The scriptable server takes each URL it receives and distributes them to all clients in a
round robin fashion while controlling the clients in means of starting and stopping them.
The clients connect to the server on port 7070 and report back any state changes.
Therefore each client monitors its own state while browsing a Web site for changes on
the le system, registry, and processes.
An exclusion list for known normal system
changes are used to identiy a normal state, any other operation triggers a malicious
classication of the Web server and sends this information to the Capture server. Since
the state of the client has been changed, the client resets its state to a clean state and
retrieves new instructions from the server. If no state change has happened the client
requests new instructions from the server and continues its browsing without resetting.
The client consists of two components, a set of kernel drivers and a user space process.
The kernel drivers have event-based detection mechanisms for monitoring the systems
21
2. Background
state changes.
The user space process communicates back the system changes to the
server after capturing them from the kernel drivers and ltering them from the exclusion
list.
An interesting feature in development is to support non-browser clients, such as multimedia players and Microsoft Oce applications.
2.5.8. Pezzonavante
Pezzonavante
[77] is a high-interaction honeyclient developed by Robert Danford at
StillSecure started in August 2005. Since it is a propritary system and is mainly used internally to test the security products of StillSecure, there is few information and statistics
about the system architecture and results. Nevertheless Pezzonavante found malicious
Web sites and code is shared with the computer security community.
Pezzonavante utilizes a farm of honeyclients with dierent patch levels of operating
systems and partly the security software from StillSecure with its highest security policy.
A central URL server coordinates every client in the farm to visit the same sites. Like
with HoneyMonkey this approach is used to expose clients with dierent security settings
to the same threats. To avoid overhead and Web server downtimes Pezzonavante uses
squid caching.
In contrast to other systems Pezzonavante doesn't do integrity checks after each visit
but in a monthly fashion.
It is likely to miss some malicious sites due to the lack of
detection but this is overruled by a high speed detection of the most malicious Web sites.
Pezzonavante uses a hybrid, asynchronus approach to attack detection utilizing amongst
other the following techniques:
•
Osiris integrity checking
•
Security tool scans (various anti-virus and anti-spyware scanners)
•
Snort network IDS alerts
•
Trac analysis
•
Snapshot comparisons
Pezzonavante's main attainment is the possible high speed detection and analysis due to
monthly intrusion checking.
According to Grimes [83], the Pezzonavante system surfed the Web in December 2005
with four Windows XP machines with a rate of 150 to 300 URLs per hour, resulting in
about 24 to 12 seconds per website.
2.6. Spam and Malware
The term
spam
is commonly used for Unsolicited Bulk Email (UBE) or Unsolicited
Commercial Email (UCE). This type of email messages is often used for advertising
and phishing, but this is not the only malicious issue about spam. Recent spam waves
22
2. Background
have shown dierent attempts to use spam as an infection vector for malware [78, 74].
These spam mails have either attached malicious code and trick the user to execute the
attachments or guide users to Web sites hosting malicious contents.
Spam mails are day-to-day threats and appearances and so are the sites they link to.
To capture these threats our system should also be able to analyze malicious content
linked in spam mails. The found malware is likely to use the latest available techniques
and could also include zero-day threats. A day-by-day observation of incoming spam mail
is important, any later observation would miss the linked malicious content. According to
the phishing attack trend report for the February 2007 [84] of the Anti-Phishing Working
Group, the average time online for phishing sites has been four days.
They have also
collected a total of 289 uniqe malware types on 3121 hosts only in February.
2.7. Phishing
Phishing is not considered as malware but as malicious for the surfer from a social
viewpoint. Phishing is scam or criminal activity using social engineering techniques. It
is an attempt to fraudulently acquire sensitive information like user names, passwords,
online-banking transaction numbers and credit card data.
Every phishing attempt is
done in two steps. First a phishing mail is sent to anybody with the same methods like
spam mails are sent.
This mail then uses social engineering tricks to guide the victim to a previously prepared phishing Web site claiming to be a serious entity like the ocial homepage of the
victims bank, the PayPal Web site, or the eBay Web site.
A normal Internet user is
often not able to realize whether an email message is really from the claimed sender or
not. Figure 7 shows such a phishing Web site. Once the information is gained, phishers
either do sell this information on specialized black markets [107] or use them to scam.
2.8. Crawling the Web
For a broad and high bandwidth analysis of the entire Web or parts of it, it is obvious
that performance and scalability are the main concerns for nding appropriate solutions.
It is possible to scan the Web while surng through the contents, but this would lead
to a poor performance because the malware scanner would have to analyze only a small
portion of data per the Web site and thus generate a process per analyzed bunch of
content. Since every process generation creates a huge overhead while loading additional
library, forking new processes, allocating memory etc. we try to avoid this as much as
possible with batch processing of the crawled content at the end of the crawls.
Queueing
To crawl the Web one must have one or more starting points. There is no possibility to
mirror the Web directly because all parts are dynamically added and removed. The only
way of getting a good opinion about it is to crawl representative portions for a convincing
representation. In order to achieve this we have to nd some starting points and explore
further linked pages. This bunch of starting URLs are called
23
seeds throughout the rest of
2. Background
Figure 7: Example for an eBay phishing site [100] trying to steel passwords for identity
theft
this paper. Based on the intention of the research the popularity of the found material, it
is important to create a good hit rate and to gain representative results for our evaluation.
There are several methods of getting these seeds.
To surf the Web people have two options from where to start with. They either start
their surng from a previously known link or search for new URLs on popular Web
search engines like
google.com.
Once they have started, they can use the links on a Web
site to go further. Trying to understand the distribution of malware on the Internet we
choose popular topics and generate a seed out of Web searches based on these topics.
Additinally we generate seeds based on hosts-les which are collected and blacklisted
malicious as well as otherwise dangerous sites. Eventually we use spam mails as seeds
which often lead to phishing and malicious Web sites.
Scope
The crawl scope denotes how far we want to go in discovered URLs. Two parameters are
determining the crawl scope. The rst one is the maximum link hops to include, which
means URLs more than this number of links from a seed will not be ruled in-scope.
The second one is the maximum transitive hops to include, meaning URLs reached by
more than this number of transitive hops will not be ruled in-scope, even if otherwise
on an in-focus site. These two parameters are decisive settings for the hit count and the
relevance. The hit count speciys how many malicious sites are found.
24
2. Background
URL Normalization
The time when Web sites were static, handwritten HTML sites with uniqe human generated URLs are history. Modern Web sites even personal unprofessional sites are machinegenerated sites which use the URL as a transport layer for information. Database queries
over the URL or simple data transport over the URL mechanisms like Representational
State Transfer [80] are getting more and more popular. As a result URLs pointing to the
same physical resource on the Web are overloaded, while downloading the same content
multiple times we use URL normalization [92] techinques to avoid unneccarily overhead.
Link Extraction
To reach as many available contents as possible not only URLs contained in a HTML
le are to be analyzed but any other contents too.
Link extraction is the process of
extracting URLs out of any document like HTML, JavaScript, PDF etc. Link extraction
is essential for the scope of our evaluation because often related links are embedded in
other les than the HTML site and are reachable for human surfers, too.
File Storage
It is the Archive's experience that it is dicult to manage hundreds of millions of small les in most existing le systems. (The Internet Archive) [54]
The rst idea for storing the crawled contents is just to mirror them and to use their
URL as a folder index. This approach may be useful for standard archiving of a dozen
Web sites, but unusable for broad scope crawls of millions of URLs with obfuscated URLs
with sometimes over length. Most le systems will get problems managing so much les.
Therefore an optimized storage strategy has to be considered regarding:
•
Minimize le creation and management. Every le object creation, process creation
and termination takes notable time and resources.
•
Avoid lename conicts. Request headers in URLs as le names and overlong URLs
could cause problems with some lesystems due to restrictions of the lesystem
itself.
•
Utilize data compression. Many les are likely to be accessed only once or twice
so compression is not a big performance issue.
Many plaintext le formats and
especially HTML can be compressed very eciently.
Smaller les improve the
overall lesystem performance due to smaller write and read operations.
25
3. Monkey-Spider
3. Monkey-Spider
In this Chapter, we introduce our Internet analysis framework: the Monkey-Spider. We
have used it in our research as a low-interaction honeyclient but it is able to do other
kind of research regarding the Internet as well. Some reasonable use cases and extensions
to be worked on in the future are shown in Chapter 5.
3.1. System Architecture
The Monkey-Spider system utilizes many existing freely available software systems. Figure 8 shows the whole system. The architecture is divided in dierent functional blocks,
Figure 8: Monkey-Spider Architecture
marked as dotted or drawn through rounded rectangles.
These blocks can either be
on dierent computer systems for scalability and performance reasons or used on one
computer system for the ease of use. The hatched modules are still in development.
Queue generation
Every run begins with the Seeder block which generates starting
URLs for the crawler. One can either generate seeds with the Web search seeders and/or
27
3. Monkey-Spider
generate seeds out of spam mails with the mail seeder and/or use the MonitorDB seeder
to requeue previously detected malicious sites.
The Web search seeders use the corresponding search engines namely Yahoo, Google
and MSN Search for searching and URL extraction.
The mail seeder extracts URLs out of collected spam mails from a spamtrap.
The monitoring seeder is used to constantly reseed previously found malicious content
over time from our malware database.
Web Crawling Heritrix
[19] - the Web crawler - crawles contents optionally through
an interconnected Web proxy. The Web proxy can be used to increase performance and
avoid duplicate crawling. Heritrix queues the generated URLs from the seeder and stores
the crawled contents on the le server while generating detailed log les.
Malware Analysis
The scanner extracts and analyzes the ARC-les on the le system
with dierent anti-virus solutions and malware analysis tools.
Identied malware and
malicious Web sites are stored in the malware attic directory. Information regarding the
malicious content is stored in the database.
Furthermore, we copy every found binary and JavaScript le to an additional archive
directory for additional research purposes.
3.2. Implementation
One of our design principles is not to reinvent the wheel, therefore we use several Free
Software [82] packages which are doing well on one particular subject and then glue
them together with the Python programming language [59]. As a result we only have to
implement some minor parts as prototypes very fast, so that we can concentrate on the
evaluation of the system and on dependable results. We use Linux as the implementation
platform because Heritrix is also only able to work with Linux. We rst evaluated Java
[25] as the programming language for our framework, because it is also used by Heritrix.
We would thus be able to write our own modules and extend the functionality of Heritrix
for our purpose.
After some weeks of testing and evaluating, it became obvious that
it is not possible to produce fast prototypes for our system and it would be very timeconsuming to implement all things in Java.
We than started to examine Python for the programming.
We realized that this
language is often used as a prototyping language and it was less complex to implement
many standard operations than in Java because Python has a very huge and useful library
[90]. The language comes with a large standard library that covers areas such as string
processing (regular expressions, Unicode, calculating dierences between les), Internet
protocols (HTTP, FTP, SMTP, XML-RPC, POP, IMAP, CGI programming), software
engineering (unit testing, logging, proling, parsing Python code), and operating system
interfaces (system calls, le systems, TCP/IP sockets). Two additional advantages left
no doubt in using Python. The one is a nearly perfect documentation of all aspects of
the language and good tutorials in the standard system documentation and the other
are many available interfaces to the many common tasks on the Internet, like pYsearch
28
3. Monkey-Spider
[45]. With this approach it took only a few weeks to implement the major parts of the
system.
We make use of several Free software packages for our system namely:
•
The Python programming language [59], a dynamic object-oriented programming
language.
•
Heritrix the Internet Archive's open-source, extensible, Web-scale, archival-quality
Web crawler project.
•
Karrigell [30] a exible Python Web framework.
•
Clam anti-virus [6] a free anti-virus toolkit for Linux.
•
Htmldata [20] a manipulation library for HTML/XHTML/CSS documents with
Python.
•
PyGoogle [43] a Python Interface to the Google SOAP Search API.
•
pYsearch [45] a Python Interface for the Yahoo Search Web services API.
•
SOAPpy [60] a simple to use SOAP library for Python.
•
PostgreSQL [42] a powerful, open source relational database system.
•
PyGreSQL [44] a PostgreSQL module for Python.
3.3. Seed Generation
In the step of seed generation we try to nd some starting points for our crawl.
The
seeds are stored in a plain text le named seeds.txt residing in each proles directory of
the Heritrix package. They contain one URL per line and nothing else.
3.3.1. Web Search Seeding
We have used the Web Services APIs of the three popular search engines, Google, Yahoo
and MSN Search.
With this interfaces we retrieve URLs of the corresponding Web
searches. We choose some topics like celebrity or games as start points and search for
typical terms which we use for this topics. For example to populate search results for
pirate web sites we use crackz, serialz, warez etc.
After that we collect all found
URLs of one topic and use this list as seeds for our crawl.
With this method popular pages on the searched content can be found because people
either use a search engine as a starting point for surng the Web or start from a previously
known URL. They use mainly the rst results of found sites. We have a guarantee that
the found URLs are likely to have the right content for the topic we are searching for.
Another advantage of web search based seeding is that the results are always up to
date for changes over time. For example, if a page has been down for some time it won't
be found in the rst results any more, and if a new page has appeared recently with the
29
3. Monkey-Spider
right content it will be displayed in the rst part of the search results. This gives us the
safety to ignore special hosts for our initial searches and to concentrate on the search
topics to get representative results in our evaluation.
To access the databases of the search engines we make use of
(RPC) [85].
RPC is a technology that allows a program to execute a method on another
computer system in a shared computer network.
The search engines provide access
Simple Object Access Protocol (SOAP )
Representational State Transfer (REST) [80].
to their RPC calls either with
utilizing
Remote Procedure Calls
[47] requests or
SOAP is a protocol for exchanging XML-based [10] messages over computer networks.
Listing 2 shows an example for a SOAP request in which the client needs to know the
product details for the ID 827635. The answer is also delivered in the XML format.
Listing 2 Example for a SOAP request
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
<soap:Body>
<getProductDetails xmlns="http://warehouse.example.com/ws">
<productID>827635</productID>
</getProductDetails>
</soap:Body>
</soap:Envelope>
REST is used as a way to access Web Services using the URL as a carrier for information. Listing 3 shows two examples for Remote Procedure Calls and their REST
counterparts.
Listing 3 Example of a REST application
RPC functions:
1. getUser(peter)
2. listLocations()
REST URLs:
1. http://www.example.com/user/peter
2. http://www.example.com/location/
Google oers two services to lookup and compare the popularity of search terms.
Google Suggest [15] shows similar search terms for a given search term with the according
possible result count, as shown in Figure 9.
Google Trends [16] shows statistics for
specied search terms and is able to compare the popularity of search terms over time,
as shown in Figure 10. We use Google Suggest to nd similar search terms and Google
Trends to compare between the popularity for similar search terms regarding one topic.
Google SOAP Search API
When we implemented a seeder with the Google SOAP
Search API [14], this was the only way to query Google Web searches.
30
Meanwhile it
3. Monkey-Spider
Figure 9: Google suggestions for wallpaper
Figure 10: Comparison of search terms with Google Trends
became deprecated and replaced with the Google AJAX Search API [13].
Because of
this it is not possible to retrieve new API keys to access Google with the SOAP API
anymore but existing keys are still functional.
It is a future task to reimplement the
Google seeder with the Google AJAX Search API. Nevertheless we are using the SOAP
API with our own API key.
With the Google SOAP Search API developers can issue search requests to Google's
index of Web pages and receive results as structured data. The search is restricted to
a maximum of 1000 results per day and 10 results per query. We have used PyGoogle
[43] which is a Python implementation to the Google SOAP Search API. It provides
the function
doGoogleSearch()
which is called with the search term, the index of the
rst desired result, the maximum number of results, and delivers title, URL, summary,
SearchResult. We get the URL from the SearchResult
seeds.txt le. We are iterating doGoogleSearch() as much as
hostname etc. in the result class
class and add it to our
we need results, since only a maximum of 10 results per query can be retrieved.
The
maximum iteration is 100 resulting out of the daily restriction to 1000 search results.
31
3. Monkey-Spider
doGoogleSearch()
q
start
maxResults
filter
restrict
safeSearch
language
inputencoding
outputencoding
license_key
http_proxy
Search Google using the SOAP API
search string
zero-based index of first desired result
maximum number of results to return
request filtering of similar results
restrict results by country or topic
enables filtering of adult content
restricts search to given languages
input encoding of given search term
encoding for the query result
the Google API license key to use
proxy to use for talking to Google
Table 3: Options of the
doGoogleSearch()
function
Listing 4 shows the complete query code.
Listing 4 Our query to Google
# import PyGoogle
import google
# open seed file
f=open('seeds.txt','w')
# do 100 querys with 10 results each
for j in range(0,100):
search=google.doGoogleSearch(license_key=_googleKey,q=_searchTerm,
\safeSearch=0,start=j*10)
# write out each result
for i in range(len(search.results)):
a=search.results[i].URL
f.write(b)
f.close()
Yahoo Search Web Services
We use pYsearch [45] for our Yahoo seeder, which is a
Python implementation for the Yahoo Search Web services API [65] utilizing REST-like
requests. It oers an easy access Web search class providing similar functionalities and
options as the Google SOAP Search API. Like other APIs this service is also restricted
to 1000 results per day.
MSN Search Web Service
Like the Google SOAP Search API, the MSN SOAP Search
API isn't supported and has no reference any more. It is replaced by the Live Search
API [32].
Nevertheless it still works.
There is no Python interface to it but we used
32
3. Monkey-Spider
the Python Web Service's SOAP [60] implementation to access this service. The options
and results are very similar to the Google SOAP Search API except that the search is
restricted to 250 results per day.
3.3.2. Spamtrap Seeding
Often malware is sent and spread utilizing spam. A spamtrap is an email account established specially to collect spam mails. Such email adresses are propagated openly on
dierent places of the Internet where spammers are likely to nd them. Our idea is to
collect spam in a spamtrap and then to extract all the URLs contained in the messages
to use them as seeds for our crawl. Therefore we set up an email account intended to
be a target for spammers.
We can also use any other email address with Post Oce
Protocol v3 (POP3) [41] support.
We than download the spam mails with the POP3
protocol and extract found URLs in the messages with the htmldata library. The found
URLs are used as seeds for our crawl. Currently, we do not examine attached les like
Shelia does, but this could be a further task to be worked on.
3.3.3. Blacklist Seeding
A
blacklist
is an access control mechanism.
A subset - the blacklist - as part of a set
of Web sites is prevented from being accessed. A general strategy for commonly known
bad sites is to blacklist them in a hosts-le, which is used to look up the IP address of
a device connected to the Internet. These commonly known hosts are generally known
bad sites, that host some of the most malicious content. There are some Web sites which
do collect such hosts in hosts-les for the community, in order to to protect themselves
from malicious content and other unwanted parts of the Web.
We have written a program to automatically download hosts-les from some major
hosts-le provider, like Mike's Ad Blocking Hosts le [36], strip the IPs contained, which
usually redirect to the localhost, and use them as seeds for our crawls. It is obvious that
these hosts have more malicious content than common Web sites so they do not represent
the Web as a whole but they represent the bad guys and bad company's who built a
kind of business model around this practices. Hosts les do generate a huge hit rate and
deliver us many malicious contents for our research.
Listing 5 Sample hosts le
127.0.0.1
127.0.0.1
127.0.0.1
127.0.0.1
127.0.0.1
localhost
www.realbannerads.com
www.virusguard.com
www.virusrescue.com
www.antispyguide.com
33
3. Monkey-Spider
Listing 6 The hostsFileSeed.sh seeder script
#!/bin/sh
#### http://hostsfile.mine.nu
wget http://hostsfile.mine.nu.nyud.net:8080/Hosts.zip -O hosts.tmp.zip
unzip -f hosts.tmp.zip
cat Hosts > > hosts
rm Hosts hosts.tmp.zip
#### http://www.mvps.org/winhelp2002/hosts.htm
wget http://www.mvps.org/winhelp2002/hosts.txt -O hosts.tmp
cat hosts.tmp > > hosts
rm hosts.tmp
#### http://www.hostsfile.org/hosts.html
wget http://www.hostsfile.org/BadHosts.tar.gz
tar xfz BadHosts.tar.gz
cat BadHosts/hosts.lnx > > hosts
rm -rf BadHosts*
#### http://hphosts.mysteryfcm.co.uk/?s=Download
wget http://hphosts.mysteryfcm.co.uk/download/hosts.zip -O hosts.tmp.zip
unzip -f hosts.tmp.zip -d hoststmp
cat hoststmp/HOSTS.TXT > > hosts
rm -rf hoststmp hosts.tmp.zip
#### http://someonewhocares.org/hosts/
wget http://someonewhocares.org/hosts/hosts -O hosts.tmp
cat hosts.tmp > > hosts
#### http://www.everythingisnt.com/hosts.html
wget http://everythingisnt.com/hosts -O hosts.tmp
cat hosts.tmp > > hosts
#### filter and format the file
grep -v '#' hosts |sed '/^$/d' | awk '{print $2}' \
|sed 's/^[ \t]*//;s/[ \t]*$//' | sed 's/\x0D$//'| sort -u > hosts.tmp
mv hosts.tmp seeds.txt
3.4. Web Crawling
Crawling is an important aspect because it is responsible for the scope and signicance
of our research. There are several questions which are to be answered suciently:
Which sites do we crawl? Which content do we crawl? How do we extract links found
in the crawled contents?
How deep do we want to crawl?
How should we store the
crawled content? What about the crawling performance? Which statistics do we need?
How important are client capabilities like cookies and user-agent for the server behavior?
In the following Sections, we describe our crawling strategy and the Web crawler we
use for our research.
34
3. Monkey-Spider
3.4.1. Heritrix - a Web Crawler
Heritrix is an archaic word for heiress, a woman who inherits.
Since our
crawler seeks to collect and preserve the digital artifacts of our culture for
the benet of future researchers and generations, this name seemed apt.
(The Heritrix Project)
After doing an evaluation of other free software available on the Internet, we conclude
that only Heritrix meets the requirements we need for our research. For this approach we
have examined mainly libcurl, wget and pavuk. All these are meant to crawl Web sites
but are not designed to dump the state of a Web server in it's most unchanged state
like an archiver would do. Some of them also lack of recursive and parallel crawling. All
of them lack of a link extraction and queueing mechanism.
Heritrix is the Internet Archive's open source, extensible, Web scale, archival quality
and easily customizable Web crawler project. For our development and evaluation we
have used the version 1.12.x of the stable release. Some of the features and reasons which
were important for our work are the following.
Experienced Developers
The Internet Archive is a non-prot organization which is
founded to build a Internet library. This library should be able to oer permanent access
for researchers, historians and scholars to historical collections of the Internet.
founded in 1996 by Brewster Kahle in San Francisco, California.
It was
They have received
data from Alexa Internet and others. The actual collection includes texts, audio, movies
and software as well as archived Web pages starting from 1996. The Internet Archive
has built the largest public Web archive up to now.
While the Internet Archive gets its data from others, they decided to start a new
Web crawler project with which they could do additional crawls of specic Web sites or
on specic purposes among others for commercial uses. Starting archiving in 1996 the
Internet Archive gained a great experience of Web archiving and crawling. Heritrix is
built out of this great expertise. With this expertise we don't need to concentrate on the
crawling technique, the link extraction and other details but only on the ne tuning for
our purpose of the crawls.
Policies for Dierent Requirements
On the one hand, crawling policies are important
to improve the performance of the overall crawl and on the other hand not to overload
the crawled hosts. They are used to avoid multiple crawls of the same content and the
coordination of the crawler thread.
Heritrix supports dierent policies which can be
congured in detail:
•
A
selection policy
•
A
re-visit policy
•
A
politeness policy
that states which pages to download
that states when to check for changes to the pages
that states how to avoid overloading visited Web sites
35
3. Monkey-Spider
Free Software
Heritrix is Free software, so we can use it for our own infrastructure
which is also Free software. It is constantly evolving and extending, this gives our infrastructure the exibility to adopt newer versions of Heritrix without the need for adaption
because we always use the same interface to Heritrix and get the same ARC-les as
output which we process. Newer versions will have more and better features. Through
its consequent modularity and open API, it is very easy to extend or to rewrite the
functionality of Heritrix.
Support for Use Cases
The Heritrix crawler is suitable for many use cases which have
dierent requirements to the conguration:
•
Broad crawling: Broad crawls are large scale and high-bandwidth crawls. Every
queued host is recursively crawled. Every discovered link is queued except of links
failing two parameters. The rst is
max-link-hops :
URLs more than this number
of links from a seed will not be ruled in-scope. The second is
max-transitive-hops :
URLs reached by more than this number of transitive hops will not be ruled inscope, even if otherwise on an in-focus site. A
broadcrawl
tries to get as much Web
sites as possible within the given resources like time to crawl, available bandwidth
and the available storage.
•
Focused crawling: Focused crawls are crawls with a complete coverage of a selected
topic or site. The amount of crawled data is depending on the crawled sites.
•
Continuous crawling: Continuous crawling does re-crawl already crawled content
over time to monitor changes between previously determined periods.
•
Experimental crawling: Heritrix can be used to experiment with dierent crawling strategies and to evaluate for example dierent protocols, crawling orders and
archiving methods
Logging
Heritrix provides several log les per crawl. The location of the les can be
customized. The generated les are:
• crawl.log:
The main log for all crawls. For each tried URL successful or not, a
line is added to crawl.log with
1. a logging time stamp
2. the fetch status code
3. the size of the downloaded document
4. the URL
5. a bread crumb of the discovery path for the current URL
6. the last actual referrer
7. the mime type
36
3. Monkey-Spider
8. the ID of the worker thread that downloaded this document
9. a time stamp indicating when a network fetch was begun and eventually a
duration
10. a SHA1 digest of the content
11. the source tag inherited by this URL
12. annotations, if any have been set, like the number of times the URL was tried
• local-errors.log:
network related problems trying to fetch the document, can
provide an insight to advanced users when other logs and/or reports have unusual
data
• progress-statistics.log:
At congurable intervals a line about the progress of
the crawl is written to this le containing the following information
time stamp: Time stamp indicating when the line was written, in ISO8601
format.
discovered: Number of URLs discovered to date.
queued: Number of URLs queued at the moment.
downloaded: Number of URLs downloaded to date
doc/s(avg): Number of documents downloaded per second since the last snapshot. In parenthesis since the crawl began.
KB/s(avg): Amount in kilobytes downloaded per second since the last snapshot. In parenthesis since the crawl began.
dl-failures: Number of URLs that Heritrix has failed to download to date.
busy-thread: Number of threads currently busy processing a URL.
mem-use-KB: Amount of memory currently assigned to the Java Virtual Machine.
• runtime-errors.log captures unexpected exceptions and errors that occur during
the crawl, may be due to hardware limitation or more often because of software
bugs
• uri-errors.log logs errors in dealing with encountered URLs.
by erroneous URLs.
Usually it's caused
Generally, it is only of interest to advanced users trying to
explain unexpected crawl behavior.
• recover.gz:
a g-zipped journal of all events. It can be used to restore the detection
path of a crawled content
Multi-Threaded Design
A Multi-threaded design grants the optimal usage of the re-
sources also on a multiprocessor system.
37
3. Monkey-Spider
URL-Normalization
the Web.
Syntactically dierent URLs can represent the same content on
To avoid overhead caused by multiple downloads of the same resources we
have to use several URL-normalization techniques. Heritrix has a dedicated module for
this task and implements several well-known and widely used URL-normalization techniques. The class which implements the normalization is called
org.archive.crawler.url.canonicalize. Every canonicalization module has
herit the abstract org.archive.crawler.url.canonicalize.BaseRule class.
to inEvery
module can be added to a crawl. The currently available canonicalization rules inheriting
from BaseRule are:
• FixupQueryStr:
Strip any trailing question mark
• LowercaseRule:
Lowercases the URL
• RegexRule:
General conversion rule
• StripExtraSlashes:
Strip any extra slashes found in the path
• StripSessionCFIDs:
Strip cold fusion session IDs
• StripSessionIDs:
Strip known session IDs
• StripUserinfoRule:
• StripWWWRule:
Strip any 'userinfo' found on http/https URLs
Strip any www found on http/https URLs, if they have some
path/query component (content after third slash). top slash page URLs are left
unstripped, so that Heritrix prefers crawling redundant top pages to missing an
entire site only available from either the www-full or www-less host name, but not
both.
Link Extraction
Intelligent link extraction modules for various les and request types
are the major advantages of Heritrix because it guarantees a maximum coverage of the
Web content.
For every crawled content and depending on the mime type Heritrix
executes the corresponding extractor modules and adds extracted URLs to the queue.
Advanced Web Interface
interface.
stopped.
Heritrix is administered via an easy-to-use Web based user
Crawl jobs can be congured, recongured, queued, started, paused, and
The UI displays the status on the currently running crawl job listing vitals
such as documents crawled, documents pending, megabytes downloaded, download rate,
and critical exceptions like shown in Figure 11. Crawl job reports and logs both for
the currently running job and for jobs run in the past can be accessed via the UI.
JMX Interface
Heritrix also features a
Java Management Extensions Interface
[29]
which allows us to remotely control its behavior over our own Web interface. It has not
the full functionality provided by the Heritrix Web interface yet but the main control
operations are already there.
38
3. Monkey-Spider
Module Name
Description
ExtractorCSS
parsing URLs from CSS type les [7]
Extractordoc
extract href style links from word documents
ExtractorHTML
basic link extraction from an HTML content-body, using regular
expressions
ExtractorHTTP
ExtractorImpliedURI
ExtractorJS
extracts URLs from HTTP response headers
nding implied URLs inside other URLs
processes JavaScript les for strings that are likely to be crawlable
URLs [27]
ExtractorPDF
PDF [40] for the purpose of extracting URLs
ExtractorSWF
extracts URLs from Macromedia Flash les [33]
ExtractorUniversal
looks at the raw byte code and tries to extract anything that looks
like a link
nding URLs inside other URLs
ExtractorURI
ExtractorXML
a simple extractor which nds HTTP URLs inside XML/RSS les
JerichoExtractorHTML
improved link-extraction from an HTML content-body using the
Jericho HTML Parser [28]
Table 4: Extractor modules contained in Heritrix
3.4.2. Conguring Crawl Jobs
Heritrix has a mature job management with jobs and proles. For every desired crawl a
job has to be generated. Jobs can be generated from scratch or use prior existing jobs or
precongured proles. Proles can be precongured for every kind of crawl job.
Each Job conguration and running consists of the following steps:
1. Create a job: Creation out of an existing prole or job
2. Congure the job: Conguration of which modules are involved and the options of
the involved modules
3. Run the job
3.4.3. Heritrix Architecture
In this Section, we give an overview of the Heritrix architecture, describing the general
operation and key components of the software.
Heritrix has two conguration interfaces. The main and fully functional conguration
interface is the Web interface and the second not yet fully functional interface is the JMX
interface. Heritrix is modular. The modules to be used can be chosen and congured
at runtime over the Web interface. The consequent modular design allows us to write
own modules as a replacement or an extension to the Heritrix crawler. There is also a
developers guide which introduces developers to write their own modules.
39
3. Monkey-Spider
Figure 11: The Heritrix Web interface console running a job
Heritrix consists of some core classes and pluggable modules.
Core classes can be
congured but not replaced. Pluggable classes can be substituted. There exists a set of
basic pluggable classes which are part of the Heritrix package.
A
CrawlController
class holds all classes cooperating to do a crawl and provides an
interface to the running crawl.
It executes a
to the
master thread.
ToeThreads.
The
CrawlContoller.
from the Frontier
Figure 12 shows the scheme of the
The CrawlController puts the URLs
Frontier is responsible to determine the
URL to process.
The Frontier is responsible for the politeness policy dened in Section 3.4.1, to avoid
overloading of Web servers. If a given URL complies with the politeness policy than it
will be delivered to a requesting
ToeThread.
Heritrix is multi-threaded and thus each
ToeThread. A ToeThread requests a
Frontier, sends it through all the processors and then requests a new
URL to be processed is handled by one thread called
new URL from the
URL. The processors are grouped in processor chains. The prefetch processors, the fetch
processors, the extractor processors, the write/index processors and the post-processors.
The processor chain works out the following steps regarding one URL:
•
The URL is revised wheter it is really in scope.
•
The URL is fetched with the regarding protocol.
•
The extractor processors try to discover new URLs out of the fetched content.
•
The fetched content is indexed and written to disk .
•
The state of the crawl is updated.
40
3. Monkey-Spider
Figure 12: Simple
CrawlContoller
scheme [19]
•
The found URLs are checked if they are in scope.
•
URLs which are checked to be in scope are resceduled to the
Frontier.
The whole crawl jobs ends when there is no more URL left in the
Frontier
and no
ToeThread delivers a URL. The resulting output of a terminated job are les generated
from the Writer processors and the logging information described in Section 3.4.1.
3.4.4. The ARC File-Format
Heritrix has dierent so-called
Writer processors are
Writer processors contain:
These
Writer
processors to write the crawled content to disk.
designed for dierent purposes. The current implemented
• MirrorWriterProcessor:
Processor module that writes the results of successful
fetches to les on disk. Writes contents of one URL to one le on disk. The les
are arranged in a directory hierarchy based on the URL paths. Hosts are mirrored
like the le hierarchy that might exist on the servers.
• ARCWriterProcessor:
Processor module for writing the results of successful fetches
to the Internet Archive ARC le-format [54] aggregated les.
• ExperimentalWARCWriterProcessor:
Experimental processor module for writing
the results of successful fetches to the Web ARChive le format [64], which will be
the successor of the ARC le-format.
The ARC-le format is a le-format to aggregate many les in it, to ease and optimize
le storage, and to increase overall lesystem performance. It was designed to meet the
following requirements:
41
3. Monkey-Spider
•
Self-containment: The le contains all information regarding the aggregated objects
to be identied and extracted. No other indexes or the like should be necessary.
•
Multi-protocol support: Many other high-level protocols than HTTP like DNS,FTP
and HTTPS are supported.
•
Streaming:
ARC-les are streamable, since every content block consists of its
header with information and the the content itself. Every aggregated object could
be streamed without the need for other parts of an ARC-le than the actual position
where the object resides.
•
Viability: Each record should be viable. After a record has been written its integrity
should not depend on additional later generated indexes or the like.
Listing 7 shows an example record in an ARC-le containing all information regarding a
crawled content in the rst part like the protocol, URL, timestamp, request-header etc.
The rst part is in plaintext and is seperated from the second part with a blank line.
The second part holds the crawled content in binary format. Every crawled content is
aggregated in such ARC-les of a xed size (we are currently using 100 MB).
Listing 7
Example of an ARC-le record, rst part holds the information, second part
the content. Each ARC-le contains usually hundreds or thousands of records depending
on aggregated les size
http://test.org/test/ 12.3.4.5 20070322153737 text/html 9551
HTTP/1.1 200 OK
Date: Thu, 22 Mar 2007 13:37:14 GMT
Server: Apache/1.3.37 (Unix) mod_perl/1.29
Connection: close
Content-Type: text/html; charset=utf-8
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html> <body> Hello World! </body> </html>
Every record is atomic and thus totally independent from any other parts of the ARCle or any other information stored elsewhere in the crawl.
The ARCWriterProcessor usually writes the crawled contents to a bunch of ARC-les
at a time. ARC-les that are currently being processed have an additional .open sux.
Those without such a sux are completed. The produced ARC-les are named as follows
and reside in the
arcs
directory of the crawl:
[prefix]-[12-digit-timestamp]-[series#-padded-to-5-digits]-\
[crawler-hostname].arc.gz
For example:
Monkey-Spider-20070307182551-00422-spongbob.arc.gz
42
3. Monkey-Spider
3.4.5. The CDX File-Format
This CDX le-format is internally used by the Internet Archive but not ocially standardized.
It is a plain text le holding all the information contained in an ARC-le
without the contents of the crawled les.
generated with the
It is used as an index to an ARC-le and
arcreader tool contained in Heritrix.
The following option generates
a CDX-le for the given ARC-le:
$> arcreader --format cdxfile Monkey-Spider-20070307182551-00422-spongbob.arc.gz
Listing 8 shows an example for a CDX-le. Every line corresponds to a record in the
ARC le.
Listing 8 Example contents of a CDX-le with one contained le (trimmed)
CDX b e a m s c V n g
20070307122203 0.0.0.0 filedesc://test.arc text/plain - HWP[..]U6N 0 1280 test
20070307122202 66.35.250.203 http://sf.net text/html - V24[..]H57 694 10381 test
The contained information is the same information as in the rst block of the ARC-le
records. Table 5 shows the meaning of the colums of the CDX-le.
index
b
e
a
m
s
c
V
n
g
content
timestamp
IP
URL
mimetype
-
md5sum
file offset
content length
ARC-file
Table 5: Contents of the CDX le-format
3.5. Malware Detection
This part is the main operation eld for the research done on the crawled content. While
the crawling is still running the crawler generates constantly compressed ARC-les which
we analyze meanwhile. The
ScanFolder part works step by step through each ARC-le.
After the processing each of the ARC-les are deleted or recompressed.
3.5.1. ScanFolder
The Heritrix crawler generates a folder for each generated crawl job containing the conguration, the log les, the runtime les, and a folder with the corresponding ARC-les.
The
ScanFolder program is a command line program which delegates the scanning work
done on a folder with ARC-les. Therefore it creates a queue of already closed ARC-les
and performs extraction and malware-scanning on each of them, as shown in Figure 13.
43
3. Monkey-Spider
Figure 13: Operation of the
ScanFolder
program
3.5.2. Extractor
ExtractArcfile which operates on
arcreader program of the Heritrix package we
The Extractor is a command line program named
exactly one ARC-le at a time. Using the
create a CDX-le containing all the relevant information regarding the ARC-le. Before
processing the le has to be unpacked with
gunzip [11].
After that the Extractor dumps
the contained les in that ARC-le to one likewise named directory, ignoring the original
folder index and lenames of the contained les. Therefore it uses the checksum already
contained in the CDX-le as lenames, and thus grants an approximately unique lename
due to the non-collision requirement of hash-functions. Furthermore, it uses the MIME
[38] type also contained in the CDX-le to generate a lename with the lename extension
utilizing the
mimetypes.guess_extension()
function contained in the Python library.
Listing 9 shows some generated lenames. The les are then analyzed with the malware
Listing 9 A sample of extracted les
EASN4AF6NPYDXFK563WJKVFZLVE5COMN.html OZK6OENXC4AENH6JG5RWY7PCSUFQNSXP.js
ZQYZF2FDRD3BDQ6VNLWPHYUHLKT6PGYI.swf EATYUELC2PU7YWEIF3U23SFCZ2PHKRE4.gif
OZUR7BGJXKP6AHGDMAQGHPH5B6PBINIZ.exe ZRBXC2IWGFK456ADC5MVZT3DFPFLFSCJ.jpe
scanner. After the analysis the directory is removed and the ARC-les are compressed
again. The whole procedure of the
ExtractArcfile
program is shown in Figure 14.
Performance Measures for the Extractor Program
In the beginnings, we have used the arcreader program to extract the contents of an
ARC-le and evaluated that it performed very badly on more than a dozen les.
We
realized that it would be unbearable to use this program for the extraction because the
extraction would take much more time than the analysis of the les, which is the most
time consuming part of the Monkey-Spider system. We searched for other tools for the
extraction of the les on the Internet and did not nd anything suitable for us.
decided to program this for ourself.
We called our program
We
ExtractArcfile2ndGen.
Comparing the runtimes of both extractor programs shows us an acceleration of the
44
3. Monkey-Spider
Figure 14: Operation of the
ExtractArcfile
ExtractArcfile2ndGen was about
ExtractArcfile program we had used before.
extraction process. The
the initial
program
50 to 70 times faster then
3.5.3. Malware Scanner
With our approach of high-bandwidth and broadscope crawling we have observed our
particular crawler (described in Chapter 4) to crawl about 30 URLs/sec.
With such
bandwidth we need fast and reliable tools for automatic malware analysis because the
analysis seems to be the most time consuming part in our processing chain. Depending
on the status of existing malware, our analysis can be separated into two states. The
rst approach is the detection of known malware with common malware scanners like
ClamAV and the second approach is to analyze suspicious binaries with automatic malware analysis tools, like the CWSandbox. Furthermore the Monkey-Spider system can
be extended with any other automated analysis tool.
3.5.4. ClamAV
All the extracted les are examined with ClamAV. We do not use a python interface for
ClamAV to decrease our system dependencies. The version information and the output
45
3. Monkey-Spider
of ClamAV is dumped to a le called
clamav.report.
An example report le with one
virus is shown in Listing 10.
Listing 10 Example for a ClamAV report le
ClamAV 0.90.2-exp/3236/Sun May 13 09:23:27 2007
N2H5UALHLJ3YZ6D27ZPOKYJFLWFUNQCS.zip: Trojan.W32.HotKeysHook.A2 FOUND
3UXDRDBAO3UXZ3GT3MKIWBF77XH5MRKC.html: HTML.Phishing.Pay-6 FOUND
ES3RAP26G2C2WCUWAZI4JHPLUTTE2JZE.html: Trojan.Downloader-2388 FOUND
This le is then parsed with Python. The relevant information for detected malware
plus the version information is then stored in the database. The malicious code is stored
in a separate
attic
directory for further studies. The analysis directory is removed and
the ARC-le is either removed or repacked depending on the purpose of the crawl. We
use the fast option of gzip for repacking to reduce the time consumed. Figure 15 shows
the operation of the
ClamAV
program.
Figure 15: Operation of the
ClamAV
program
3.5.5. CWSandbox
The CWSandbox [8] is a tool to automatically analyze malware, based on behavior
analysis. It is also developed at our chair by Carsten Willems [71]. Suspicious code is
executed for a nite time in a simulated environment (the CWSandbox) where all system
calls are monitored and analyzed. The CWSandbox generates than a detailed report of
the behaviour of the code.
The CWSandbox uses three strategies to fulll its three criteria:
46
3. Monkey-Spider
•
Dynamic analysis
means that malware is analyzed by executing it within a simu-
lated environment to achieve the
•
API hooking
automation
of the analysis.
means that calls to the Windows API are re-routed to the monitoring
software before the actual API code is called and grants the
eectiveness
of the
overall process.
•
DLL code injection
allows API hooking to be implemented while granting con-
dence in the implementation and the
correcteness
of the reported results.
Utilizing this three strategies the CWSandbox autmatically generates reports describing
for example the following behaviour of the analyzed binary. File creation and modication, Windows Registry modication, DLL utilization, memory access, process creation,
and network connection.
We are implementing a program to analyze all crawled binaries with the CWSandbox
and synchronize the results with our database but we are not nished yet.
3.6. Web Interface
The Monkey-Spider system is designed to be completely controlled over a Web interface.
Therefore we have used the Karrigell Web framework [30]. Because our Web interface is
not fully functional the following specication is more a state to become. User authentication is used for the access control but the communication is not encrypted.
The
interface works job based like Heritrix and uses the jobs created by Heritrix for further
processing and storage. Our interface is split into the following main pages.
Status-Page
The status-page 16 is considered as an overview of the whole system. It
shows information without any other functionality. The shown information is the current
seed le, the status of the crawler, the status of the malware scanner, and the status of
the database.
Seeding-Page
The seeding-page 17 is the place where the seeder modules can be used
to generate seeds for the crawl. Seeds can be created using queries to Web search engines
with dierent properties. A spamtrap can be downloaded through POP3. Additionally,
URLs can be added manually, too. It can be specied if the old seed should be overwritten
or not. Figure shows the generation of the Web search based seed on the seeding-page
of the Monkey-Spider Web interface.
Crawling-Page
The crawling-page 18 is used to control Heritrix and to congure and
control crawls. The crawls are normal Heritrix crawl jobs. Crawl jobs can be generated
using the previously generated seeds. Jobs can be started, running jobs can be paused.
The crawling-page shows pending, running and completed crawl jobs and is able to abort
these as well.
47
3. Monkey-Spider
Figure 16: The Monkey-Spider status-page
Scanning- and Database-Page
The scanning- and the database-pages are not func-
tional yet but the scanning-page will provide the possibilitiy to scan a crawl job with
dierent malware scanners and the results will be acessible through the database-page
providing a Web interface to the database.
3.7. Malware Database
All our results except the binary les are stored in a database. As our main database we
use PostgreSQL, but there was no special requirement for the database than to be free
software, SQL based and performant. Any other database having the same qualities and
a Python Interface could be used.
We have generated three tables for our database scheme.
all unique found malware les. The
mw_scanner
The
malware
table holds
table holds all specic malware scanner
versions with all major and minor versions and update dates. The
mw_output table holds
all descriptions of found malware samples associated with one particular malware and
one particular malware scanner version.
• malware:
Holds a unique malicious content
id: A unique malware-ID number
filename: The lename of the malicious content
url: The URL where the content was located
48
3. Monkey-Spider
Figure 17: The Monkey-Spider seeding page
checksum:
size:
Filesize
date:
Crawl date
crawlname:
comment:
• mw_scanner:
The generated checksum
id:
The name of crawl job which crawled the le
Placeholder for any additional information
Holds a uniqe malware scanner plus version info
A uniqe malware scanner-ID number
name:
The name of the malware scanner
engine_ver:
The version of the malware scanner
signature_ver:
lastupdate:
• mw_output:
The version of the malware signatures
The date of the last update of the signature database
Holds a description of a specic malware sample, providing:
mw_id: Foreign key to malware.id
mw_sc_id: Foreign key to mw_scanner.id
description: The name of the malware according to the malware scanner
49
3. Monkey-Spider
Figure 18: The Monkey-Spider crawling page
3.8. Limitations
The World Wide Web is a system of interlinked, hypertext documents that runs over
the Internet.
We assume the whole World Wide Web as a set of available content on
computer systems with a unique IP address at a xed time. We call our set the Web. The
World Wide Web is not synonymous for the Internet but we use it that way throughout
this work because it is commonly used this way. The rst drawback in every research
on this set is that it does change signicantly over time and is not predictable to any
database or something similar.
Every machine connected to the Internet has (should
have) a unique IP-address or is assumed as such. Not every machine connected to the
Internet has to have a DNS name or anything else which is usual to any World Wide
Web server, like a running Web server on socket 80 or even replying to ping requests
and may only be reachable for a small period of time. These wobbling nodes can have
malicious content and serve these. We want to access as much as it is possible for us on
this set at dierent times, because it is not possible to dump the Web at a particular
point in time. Therefore we crawl content and extract as much links as possible from
this content and try to follow these as far as useful.
3.8.1. Avoid Detection and Blacklisting
In spite of not following
robots.txt
les, which is a restriction mechanism for Web
crawlers, malicious Web site operators could avoid honeyclients from accessing their
sites utilizing blacklists. It is therefore very important that IP adress ranges of research
honeylclients won't get revealed to the public. Thus neither the operator nor the location
50
3. Monkey-Spider
of such honeyclients should be published.
Additionally, future malware could use vulnerabilities in utilized malware analysis tools
to report back the honeyclient's IP ranges or even detect the type of the honeyclient.
Contrary to blacklisting dangerous Web sites by the Internet community, such a blacklisting of dangerous Web clients for malicious Web site operators could become common
to malware site operators.
3.8.2. Hidden Web
Another major drawback when we crawl the Web is that we crawl all contents based
on URLs which we have either started with or which we have discovered automatically
somewhere. A crawler automatically traverses the Web graph and retrieves pages. Nor-
publicly indexable Web [91]. But
Hidden Web, also sometimes referred
mally they only get a portion of the Web called the
main parts of the Web are part of the so called
to as the
Deep Web.
The Hidden Web is mostly part of form based database interfaces
and/or authentication based information which has no static URL linked anywhere. We
do not process forms or authentication mechanisms on Web sites and thus can not access
this content. According to Bergman [70] the public information on the Hidden Web was
400 to 550 times larger than the publicly indexable Web in 2001.
3.8.3. Dynamically Created Content
Code obfuscation is one of the methods that malware writers use for their abuse, to hide
their intentions from researchers [106].
In Listing 13, an example for a code obfuscation is shown. The code is downloaded
from [26] and uses two obfuscation techniques. The goal is to confuse human analyzers
and to avoid being automatically analyzed.
The rst technique is a very easy and common technique of mapping the program code
as a binary sequence with standard reversibel escape sequences. The
unescape
function
easliy decrypts the program code as shown in the second block. The second obfuscation
technique is a non-linear mapping based on a pseudorandom string
t.
Depending on the
input stream an output stream is generated. Even though we can extract links out of
JavaScript code we cannot detect dynamically generated links and content or obfuscated
links out of JavaScript code because we have no JavaScript interpreter and do not execute
scripts during the crawl and as a result do not crawl the generated links.
3.8.4. Independent Crawls
In general, Web crawlers do not have a synchronization between two or many dierent
crawls to avoid duplicate data or crawling the same content more than once. The same
problem will be apparent with our infrastructure because our crawler doesn't support
this either. So it is indeed the case that in our evaluation we shall crawl and analyze
partly the same content over time and for dierent topics. This could make a dierence
for the results since top ranked sites like
wikipedia.com
crawled because they are often linked on many sites.
51
or
amazon.com
are likely to be
3. Monkey-Spider
3.8.5. HTTP Headers, Cookies and other Client Information
Depending on the Web server and Web site programmer, Web sites often change their
responses based on the client information they are able to retrieve. Some of the methods
of retrieving client information are HTTP-request-headers, HTTP-cookies, Flash-cookies
and client-side scripts like JavaScript. We have observed a malicious JavaScript script
(http://www.mt-download.com/mtrslib2.js) probably exploiting the browser security
which could not be downloaded directly or executed with any other browser than the
Internet Explorer. Based on similar behavior we are probably missing some malicious
content.
3.8.6. Peer-to-Peer Networks
The classic client server network architecture is one where many client nodes connect to
a few server nodes to request content and the server delivers/serves the clients with
the requested contents. In this architecture the clients need not to be connected to each
other and the data ows only from one client to one server and vice versa. Our approach
operates as a client and requests the contents from the Internet servers, other Internet
clients keep uneected.
Peer-to-peer networks are decentralized networks which often don't distinct between
two classes of networking nodes, they only have peers which are connected either directly
or indirectly over other peers with each other. Content indexes and content distribution
go over other peers. This network architecture can be used for almost any type of service
from lesharing to Web serving. But the vast majority of users today use peer-to-peer
le-sharing networks like Gnutella [12], Kazaa [31] and Freenet [56].
Peer-to-peer le-sharing networks delight themselves to be more popular. These networks are often used for le-sharing and therefore are predestined to spread malware.
Recent research [89, 98] on malware in peer-to-peer networks has proven that often more
than fty percent transpired as malicious code.
Based on the fact that we crawl contents which is linked somewhere and can be requested with the HTTP protocol our crawler is not able to retrieve les from any peerto-peer network of which the content is not directly linked or indexed somewhere as a
HTTP-requestable URL. Thus this type of networks are out of our research scope.
52
4. Evaluation
4. Evaluation
We have evaluated our system over a period of three month from February till April
2007.
In the beginning of February we have done an initial test crawl with an adult
based topic search, this crawl will be further denoted as the rst crawl. After that we
have done several topics and hosts le based crawls in March and April.
We have used ClamAV for our evaluation system as a malware detector, since this
was the only module with fully functionality at the time of our analysis.
In contrast
to other projects like the UW Spycrawler [93] we have scanned every crawled content
regardless of its type. This procedure assures us not to oversee a malicious content but
generates a performance trade-o for scanning unnecessarily content. In contrast to other
anti-virus solutions, the ClamAV anti-virus scanner has additional detection capabilities
for phishing and Trojan downloading Web sites.
Additionally, we have observed that
some portion of executables, possibly malicious ones, are neither marked as executables
in their mimetype specications nor have a corresponding le extension, e.g. .exe or
.msi. Depending on this advantages we decided to agree to the slower performance of
the system and scan all crawled content.
The Evaluation system was one standard PC utilizing a high-bandwith Internet connection at the university.
4.1. Test Environment
We have set up a PC utilizing an Intel Pentium 4 CPU with 3.00 GHz and 2048 KB
L2-cache. It has a hyperthreading enabled standard processor equipped with 2 GB RAM.
The system has three hard disks one with 80 GB and two with 750 GB capacity. The 80
GB disk is used for the operating system. For the huge storage needs of the crawls and
analysis the two 750 GB disks were put together as a logical volume with 1.5 TB.
The operating system is a standard Ubuntu [62] version 6.10 installation with the
additional programs needed for the Monkey-Spider. Like most Linux systems, the Ubuntu
system already contains the Python interpreter.
To install the Monkey-Spider system
we are taking the following steps. Installing and conguring the PostgresSQL database
server, the Sun Java Runtime Environment, the Heritrix Web crawler, the Karrigell
Web framework, the Clam anti-virus scanner in the most recent version, the PyGoogle
Python package, the pYsearch Python package, the Python Web Services package, and
the PyGreSQL Python package
Finally, the Monkey-Spider program is installed and congured. After starting Karrigell, Monkey-Spider is ready to be accessed over its Web interface over
http://localhost:8081.
It has to be noted that this communication is password pro-
tected but not encrypted, so any system in between could gather the password and other
sensitive information.
53
4. Evaluation
4.2. Crawls
We have initiated broadscope crawls based on topic specic Web searches and on aggregated
hosts les.
Being inexperienced with Web crawling, we have specied huge values
for our broadscope crawls. The
max-link-hops
value was 100 and the
30
value was 30. With this scope a theoretical depths of 100
=
max-trans-hops
1059 domains and unlim-
ited links within a domain the crawls could be able to crawl major parts of the Web,
nevertheless that there ain't
1059
domains on the net but some pages are not linked on
any other Web site. Therefore the crawls are not executed till the end and thus we ended
the crawls manually. The eect of this can be clearly seen on the mimetype distribution
of the complete crawls in Figure 19.
Figure 19: Filetype distribution
More than half of the downloaded les are Web sites followed by images. If the hosts
would be completely crawled than the occurence of the other les had to be more than
this.
Another interesting information is the third rank of the
no-type
type depicting
a lack of mimetype information in the response. This is an additional endorsement for
our approach of malware scanning of all downloaded les rather than only considering
executable les, e.g.
A First Test Crawl
application/octet-stream.
We have done a rst test crawl based on adult search topics in
February 2007 with a starting seed of 2205 URLs.
We have done this to revise our
settings and optimize the crawler performance. During our crawl we have observed some
performance bottlenecks depending on the conguration of the crawls.
maximum available
had to be adjusted.
Toe-Threads and
The Toe-Threads
The count of
the total amount of maximum usable memory
are the active crawl threads and responsible for
checking, crawling, and anylzing a URL. The initial setting for this option was 250. This
value can be sucient with about 250 MB of usable main memory like occupied from
the Java Virtual Machine (JVM). We adjusted the possible memory load for the JVM
54
4. Evaluation
to 1500 MB and found 600
Toe-Threads
as a good trade-o for our conguration. We
achieved an average download rate of about 1 MB/sec and crawled 26 documents/sec.
The Extractor works relatively ecient. Table 6 shows the extraction eciency for the
initial crawl.
module
handled objects
extracted links
links / object
6487350
735538
0,11
4132168
345231030
83,55
19881
88307
4,44
21658
162291
7,49
17921
11117
0,62
ExtractorHTTP
ExtractorHTML
ExtractorCSS
ExtractorJS
ExtractorSWF
ExtractorXML
ExtractorURL
ExtractorUniversal
35165
1638260
46,59
6506489
6776544
1,04
1901858
1205772
0,63
Table 6: Extractor modules extraction eciency
It has to be reconsidered that only discovered links which do pass the requirement are
queued for crawling. These requirements do contain among others like wheter if a URL
is already queued or downloaded, if a URL is still unique after normalization, if a URL
is in the scope of the crawl etc.
We have observed some interesting aspects during our rst extensive test crawl like
discovering mainly one type of
Trojan.Downloader.HTML.Codebase
as shown in Table
7 which shows the type and count of found malware during our rst crawl.
name
occurence
name
occurence
Trojan.Downloader.HTML.Codebase
3634
Worm.Sobig.E
8
Trojan.Downloader-2388
224
W32.Elkern.C
8
Worm.Yaha.G-2
92
Worm.Sobig.F-1
7
Dialer-891
64
Phishing.Email.SSL-Spoof
7
Trojan.Vundo-181
54
Worm.Mydoom.M
6
Dialer-715
41
Worm.Bagle.Z
6
Trojan.Downloader.JS.IstBar.A-2
20
Worm.Bagle.Gen-vbs
6
Worm.Sircam
17
Worm.Bagle.AP
6
Adware.Casino-3
15
Trojan.JS.RJump
6
Worm.SomeFool.P
11
Exploit.HTML.IFrame
6
Table 7: Top 20 malware types of our test crawl
After analyzing the malware database, we have detected that this malware is one
particular Trojan downloader distributed almost all from the same Web site
(http://99calze.com/Galleria_PopUpFoto.php)
observerd 3471 samples from this site.
with dierent get values.
We have
We suppose that this is an error of the Web
site programmer because such an obfuscation wouldn't make sense, but the site itself is
55
4. Evaluation
malicious.
Topic Based Crawls
After the optimizations we have executed ve topic based crawls
during March and April for the topics adult, pirate, celebrity, games, and wallpaper.
To generate starting seeds we used synonymously and commonly used terms
regarding the topics utilizing Google Suggest and Google Trends. For example, the topic
wallpaper was searched for with free wallpapers, wallpapers etc.
Hosts File Crawl
After that we have aggregated
hosts
les from six major
hosts
le
providers in one seed le with a total of 8713 URLs in April. Table 8 shows the seed
count per topic, the discovered links, the queued links, and the actual downloaded links.
We have observed a total rate of 6.48 discovered and queued links per crawled content.
topic
adult
pirate
celebrity
games
wallpaper
hosts le
initial seed count
1867
2324
1405
2367
2781
8713
discovered links
10566820
21592129
20738435
33600204
23943186
19159246
queued
8931291
18875457
17542649
27893560
18860725
15725376
downloaded
1537145
2547330
2855300
5215394
4646027
3204560
Table 8: Seed count per topic
4.3. Performance
We have observed an average analysis ratio of 0.048 seconds per downloaded content and
an average analysis ratio of 2.35 seconds per downloaded and compressed MB. Table
9 shows the runtime of the malware analysis regarding the runtime of the
ScanFolder
program.
topic
secs
hours
content count
MB
content
MB/sec
pirate
116416.21
32,34
2547330
54182,43
0,0457
2,15
celebrity
138246.53
38,4
2855300
41877,09
0,0484
3,3
game
279741.43
77,71
5215394
168713,89
0,0536
1,66
wallpaper
224220.8
62,28
4646027
150598,52
0,0483
0,92
hosts le
138727.65
38,54
3204560
60423,33
0,0433
3,71
total/avg.
897352,62
249,26
18468611
475795,26
0,0479
2,35
Table 9: Malware analysis performance per crawl
To increase the performance of the overall analysis process we have searched for possible redundancies and found one minor optimization. The
ARCWriter
processor writes
compressed ARC-les as output of the Heritrix crawls. Figure 14 shows the operation
of the
ExtractARCFile
program. Prior to malware detection, every ARC-le is uncom-
pressed and in the end compressed. Since the speeding up of the decompression is not
56
4. Evaluation
possible without great eort, it is very easy to speed up the compression because the
compression ratio of the processed le is not crucial to us. As the ARC-les are initially
compressed using
gzip
we also use
gzip
to compress the les again. We have compared
the time in which an average size (compressed about 100 MB) ARC-le is compressed
with the normal
gzip operation and the additional gzip --fast option. It took 13.357s
gzip --fast option enabled. Ap-
to compress such a le with gzip and 9.697s with the
pliying this to our prior found average 2.35 seconds per compressed MB results in about
2.31 seconds per compressed MB, a performance gain of 2 %.
A complete regarding
of compressing the processed ARC-les results in 2.22 seconds per compressed MB, resulting in a performance gain of 5.3 %.
But not compressing the les could produce
ARC-les with, in the worst case, about double the size and thus become crucial to the
system storage requirement. A nal solution could be the deletion of processed ARC-les
together with the lost of the crawled content.
4.4. Phishing Analysis
Since ClamAV has a very good phishing detection engine we have found a total of 27
dierent types of phishing sites out of a total of 32, in our crawls. The most interesting
aspect is the distribution of the samples. Table 10 shows the distribution of the found
sites.
topic
phishing sites
uniqe sites
adult
1
1
pirate
28
23
celebrity
0
0
games
0
0
wallpaper
1
1
hostsle
2
2
total
32
27
Table 10: Topic based phishing site distribution
Table 11 shows the dierent detected phishing sites and their occurence.
Further analysis of the pirate based phishing sites has unveiled that eight of the found
sites are located on the domain
antiphishing.org
which is a known anti-phishing site.
It is obvious that they have the phishing sites for educational purposes.
Eventually, all of the 32 found sites have resulted in phishing awareness educational
content hosted with no malicious background. This brings us to the result that phishing
sites are normally not linked on any other Web site and are not available for over a few
days. Furthermore, if phishing sites are only up for few days and the signature database
of ClamAV has to be updated manually on a daily basis, it is likely that phishing sites
will be missed due to the lack of signatures or the delay of the signatures.
another strategy to detect phishing sites has to be considered.
57
Therefore
4. Evaluation
occurence
name
occurence
name
4
HTML.Phishing.Bank-3
1
HTML.Phishing.Bank-581
2
HTML.Phishing.Auction-28
1
HTML.Phishing.Bank-621
2
HTML.Phishing.Bank-573
1
HTML.Phishing.Bank-680
1
HTML.Phishing.Acc-10
1
HTML.Phishing.Card-3
1
HTML.Phishing.Auction-146
1
HTML.Phishing.Card-45
1
HTML.Phishing.Auction-20
1
HTML.Phishing.Pay-1
1
HTML.Phishing.Auction-47
1
HTML.Phishing.Pay-131
1
HTML.Phishing.Auction-74
1
HTML.Phishing.Pay-131
1
HTML.Phishing.Bank-159
1
HTML.Phishing.Pay-19
1
HTML.Phishing.Bank-162
1
HTML.Phishing.Pay-2
1
HTML.Phishing.Bank-214
1
HTML.Phishing.Pay-35
1
HTML.Phishing.Bank-277
1
HTML.Phishing.Pay-6
1
HTML.Phishing.Bank-44
1
Phishing.Email
1
HTML.Phishing.Bank-557
Table 11: Uniqe phishing site types
4.5. Results
We know only little about the distribution of malicious Web sites. But our results show
an interesting topic specic maliciousness probability regarding binary les.
Table 12
shows the topic malciousness.
topic
maliciousness in %
pirate
2.6
wallpaper
2.5
hosts le
1.7
games
0.3
celebrity
0.3
adult
0.1
total
1.0
Table 12: Topic specic binary le maliciousness probabilitiy
This result can be compared to the result of the SiteAdvisor malware analysis report
in March 2007 [76]. They have found that 4.1 % of all Web sites have been identied to
be malicious. The gap between their result and our result, of 1.0 % per crawled binary,
can mainly be explained with the following dierences:
•
SiteAdvisor has additional maliciousness indicators: the user feedback, drive-bydownloads, and the spam mail analysis. This approach leads to a higher maliciousness rate.
•
Our approach lacks correctness due to our incomplete crawls.
58
4. Evaluation
•
We could only generate results for the overall probability regarding binary les and
thus miss other non-binary samples due to some shortcomings in our crawls.
•
The whole Web, which is targeted by the SiteAdvisor, is likely to have more topics.
•
Users of the SiteAdvisor system can report additional pages to be analyzed. Therefore it is likely that the SiteAdvisor system will discover malicious Web sites which
are not linked on any other site and thus can not be found with our approach.
Furthermore we have discovered a total of 93 dierent malware types in our crawl. Table
13 shows the top 20 malware types found during our analysis.
name
occurence
name
occurence
HTML.MediaTickets.A
Trojan.Aavirus-1
487
W32.Parite.B
6
92
HTML.Phishing.Bank-3
4
Trojan.JS.RJump
91
Exploit.HTML.IFrame
3
Adware.Casino-3
22
Joke.Stressreducer-3
3
Adware.Trymedia-2
12
Trojan.URLspoof.gen
3
Adware.Casino
10
Adware.Hotbar-2
2
Worm.Mytob.FN
9
Adware.Webhancer-10
2
Dialer-715
8
ClamAV-Test-File
2
Adware.Casino-5
7
Dialer-166
2
Trojan.Hotkey
6
Eicar-Test-Signature
2
Table 13: Top 20 malware types
We have identied a total of 50 unique domains as the source for all found malicious
content. Table 14 shows the top 20 malicious domains found during our crawls.
count
domain
count
domain
487
desktopwallpaperfree.com
8
3dlez.com
92
waterfallscenes.com
8
antiphishing.org
91
pro.webmaster.free.fr
7
cyervo.iespana.es
15
astalavista.com
7
luckyblackjack.com
14
bunnezone.com
7
packetstormsecurity.org
12
oss.sgi.com
6
downloadcheats.net
12
ppd-les.download.com
6
themerboy.com
11
888casino.com
4
kit.carpediem.fr
11
888.com
3
boingboing.net
10
bigbenbingo.com
3
honeynet.org
Table 14: Top 20 malicious domain distribution
Manual URL analysis has shown us some false positives and other suspicious domains.
59
4. Evaluation
For example
pro.webmaster.free.fr
seems to have a kind of a so called
This is a mechanism to trap Web spiders aka Web crawlers.
spider trap.
Such sites automatically
generate initely URLs to bother a Web crawler with a needless task.
Some of the provided domains are known non-malicious sites with malware research
based topics. Therefore they provide some malware samples and our system detects them
as malicious. Examples are:
• oss.sgi.com
which is a serious open-source development site of Silicon Graphics
Incorporated.
• antiphising.org
• boingboing.net
• honeynet.org
the ocial site of the Anti-Phishing Working Group
a serious community page with a phishing mail sample in a post
the Honeynet Project dedicated to ght hackers
These false positives show us some problems of our approach.
Known Non-malicious Web sites
The fact that a site can be non-malicious except
having known malicious code for research and any other purposes is not recognizable
with our system. It occurs to us that only two mechanisms can avoid such false positives.
The rst is the use of a rating like system, like the SiteAdvisor uses, in which the Web
site admin and users can rate and describe a Web sites maliciousness.
The second is a previously generated whitelist of known serious Web sites. This could
be like a community generated blacklist or like a new registrar like system in which
every serious Web site can declare its seriousity to a whitelist registrar to avoid being
blacklisted.
Our system has none of these but could easily be extended with a whitelisting system.
Problems with our crawl method
In comparison to the UW Spycrawler [93] system
they crawl a xed number of URLs on specic hosts and crawl the whole contents of a
host. We did broadscope crawls and ended them knowing that our system could neither
store nor process as much data. With this approach we have seen that the crawler gets
mainly HTML sites and focuses on the extraction but on the download of the whole
content. Maybe this is why we only discover a minor part of the contents on the crawled
sites. This leads to a few executable les and thus to very few malicous les.
As a solution, later crawls should be netuned not to exceed the available funds of the
research system and thus the
max-link-hops
to a very small value.
60
and
max-trans-hops
option should be set
5. Future Work
5. Future Work
The main task of our work is to implement the Monkey-Spider system in the rst place
and then to evaluate it with representative solutions for the Internet.
Since our time
is restricted to six months and I am the only one who implements and evaluates the
system, we could not fully complete our plans. The operation chain is functional but the
integration to the Web interface, the integration of other malware scanners than ClamAV
and thus a rst release candidate for the public was not feasible for us.
Furthermore
during the implementation and research we obtained new ideas for the extension and
other use cases for our system. In the following Chapters, we want to introduce a few
pursuing ideas.
5.1. Queueing Additional Content
Since the outcome of the research fully depends on the probed content our initial probe
queue is an important issue for our results. Extending the queue will broaden the scope
of the and/or rene and improve our results.
Google AJAX Search Seeder and Microsoft Live Search Seeder Integration
As the
Google SOAP Search API is no longer supported and no API keys are available, it is one
of the major tasks to reimplement the Google seeder with the new Google AJAX Search
API. The good news is that the new API doesn't have restriction on the query. With
this new API we could build huge seeds which feature all search results for a special topic
and do domain based crawling to keep the focus on malware totally based on a topic.
This proceeding would give us the most representative evaluation of the Internet.
Microsoft has stopped their support for their MSN SOAP Search API and started a
new Search API through the Microsoft Live Search API which is also SOAP based and
nearly the same. Also this API should be integrated in the Monkey-Spider. The main
advantage is the increased result rate per day from 250 to 1000 results.
Automated Blacklist Seeder
from a hard coded
hosts
hosts
We have written a blacklist seeder script to generate seeds
le list. As a future task it should be possible to upload a new
le through the Monkey-Spider Web interface and to control the blacklist seeder
from the Web interface.
Monitoring Seeder
One of our major drawbacks is the lack of a monitoring seeder.
Since our Web interface is not fully functional yet, a monitoring seeder could not be
implemented. Thus an automatic monitoring of detected malware is currently not supported. The monitoring seeder can queue previously detected malicious Web sites which
have exceeded a given time period and generate seeds to be recrawled.
JavaScript Execution Seeder
Often JavaScript code is used to generate links to other
executable content which are likely to be malicious since this is an obfuscation technique
to veil malicious actions.
Since our crawler extracts links out of crawled content and
61
5. Future Work
also JavaScript les it does not execute them and thus misses some possibly interesting
malicious content. As a further improvement a JavaScript execution seeder can be implemented which interprets JavaScript code in a secure sandbox-like environment and
extracts generated links. These links can than be additionally queued and analyzed with
our system.
Typosquatting Seeder
Mispelled domain names of popular domains, so called typos
(typographical errors), are an important source for maliciousness on the Internet. Normally, such domains can't have a meaningful content, except for a coincidenial match
with other serious sites. The practice of using typos to misuse such shade trac is commonly known as typosquatting. Malware writers want to spread their code as fast and as
wide as possible so they use general typos of real existings sites and use this not meant
trac to compromise the clients [105, 67].
As an addition, our seeder could automatically generate typos of a given list of popular
Web sites and use them as seeds.
Microsoft has developed such an automatic typo
generator called the Strider URL Tracer with Typo-Patrol [51].
Figure 20: Popular Google typos
Figure 20 shows common typos for Google.com. Figure 21 shows Goole.com one of the
squatter sites with lots of advertising.
5.2. Crawler Extensions
Some drawbacks in the crawler can have a negative eect on the performance, scope, and
representativeness of the crawls. The following extensions try to avoid some of them.
Mail Crawler
Compared to Shelia our mail seeder lacks the processing of malware
attached to spam mails and thus misses them. Additional to Heritrix a new mail crawler
could create ARC-les out of mail attachments and include them in the analysis process.
62
5. Future Work
Figure 21: Goole.com, a typosquatter site
WARC Integration
Being the standard archive format of the Internet Archive for over
eleven years, a new le format is going to replace its place, which is a product of the
proposals of many national librarys and the Internet Archive as part of the International Internet Preservation Consortium [22]. The Web ARChive (WARC) le format
[64] covers all features of the recent ARC-le format and adds new functionality, better
standardization, more record types, and ease of use. Thus our malware database could
provide more information and better statistics regarding the connection of malicious
neighbourhoods.
Crawl Syncronisation
Considering the fact that two crawl jobs are not in the least
synchronized with each other, we know that it is possible and, especially in broad scope
crawls, very likely that a part of the crawled content will be unnecessarily recrawled. This
recrawling can be partly accelerated with a Web proxy used as the Internet gateway.
There is another way of completely avoiding overhead. The
DeDuplicator
[55] is an
add-on module for Heritrix. It oers a means to reduce the amount of duplicate data
collected in a series of snapshot crawls. It operates as a processor in the processing chain
of Heritrix and considers all crawls in a series of crawls. When the DeDuplicator gets a
duplicate URL to be crawled the processing moves straight to the PostProcessing chain
instead of to any processor which should be skipped if a duplicate is detected. As an
extension the DeDuplicator module can be integrated to the Monkey-Spider system to
63
5. Future Work
avoid such overhead.
Explicit Top Rank Crawls
A further problem regarding duplicate crawling is the prob-
lem of recrawling top ranked Web sites on every crawl. When we perform broad crawls
top ranked Web sites, like
Wikipedia.org, Amazon.com,
and
YouTube.com,
are most
likely to be linked frequently on common Web sites. Thus these sites will be recrawled
more likely than not. These sites do falsify our topics based research results in giving
more clean Web sites than there should be.
Writing a new processor module for Heritrix which can exclude given top ranked sites
is our solution to this problem. In this approach, we propose to generate manually or
automatically a top rank site list with about a thousand sites for example out of the Alexa
Top Sites [2], and use this list as an exclude rule with a blacklist like processor. This
blacklist processor should avoid crawling contents from any link hosted on the blacklisted
top ranked servers. After all we could additionally crawl only the top ranked site list to
gather information about their maliciousness.
P2P Crawler
As mentioned before in Chapter 3.8.6, peer-to-peer networks are currently
out of our scope. Since Heritrix was built to crawl Web sites it would not be a good idea
in improving Heritrix to support peer-to-peer networks. In contrast we propose to write
a peer-to-peer client per underlying peer-to-peer protocol that generates ARC-les as
output. These ARC-les could than be examined in exactly the same way as the other
ARC-les.
These peer-to-peer crawlers could use existing peer-to-peer client libraries
like libgnutella of the gtk-gnutella[18] project to minimize the implementation eort. A
similar crawling strategy has to be developed to nd representative results for topic based
searches in peer-to-peer networks.
5.3. Malware Analysis
Exploit DB comparison
There are some publicly available exploit databases on the
net, like Milw0rm [37] or Metasploit [35]. Such databases contain exploit code to known
vulnerabilities and shell code to be used with exploits to execute specic operation on an
attacked machine. As an addition, we could generate signatures of such publicly available
exploits and shell codes and examine our crawled content with them. Found sites should
be at least suspicious or hacker Web sites and thus lead to other likely interesting content.
Other Malware Scanners
The ClamAV scanner is most likely to miss many publicly
known malware and thus generate false negatives.
To achieve a better detection rate
an integration of more malware scanners including anti-virus and anti-spyware solutions
would be desirable.
Containment strategy
We have built the Monkey-Spider system without a containment
strategy and our code is not written regarding to security issues due to the lack of
knowledge and time.
To avoid the honeylient of beeing owned and maybe used for
64
5. Future Work
malicious actions, a containment strategy has to be developed. Especially vulnerabilities
of common malware scanners are likely to be exploited. Another use case for the future
is a possible exploitation of the Monkey-Spider software itself due to its free availablitiy.
Web Db Query Interface
Additionally, to standart queries to the malware database,
like topic based malware statistics, a general Web SQL interface to the underlying
database could provide a powerful and ecient, but insecure, addition to the MonkeySpider system.
Logle Analysis
During our data collection for the evaluation it became clear that many
overall important information can only be obtained with standardized logle evaluations.
We therefore recommend an additional module for the automated script based analysis
of the provided log les.
Such an eort could also contain some graphical evaluation
extensions.
5.4. General Purpose
Alternative and extended use cases for the Monkey-Spider system open a new perspective
to widespread analysis of the Web in general. Admitting that such analysis systems may
exist we present our system as a freely available Internet analysis framework.
In the
following Sections we want to present some of the ideas of such analyses, which could be
realized with very few implementation eort.
Internet File Collection
Since our system extracts les contained in an ARC-le gen-
erated with Heritrix, we could denote this behavior as dumping parts of the Internet.
Extracted les can than be used to build collections of documents, programs, photos,
audio, and video. Though the copyright of the downloaded material has to be considered
before reusing copyrighted material in any way. Therefore, only hosts with public domain
material could be crawled.
Criminal Investigation
Criminal investigation on the Internet needs good tools to nd
traces of crime to persecute and sentence such doings.
For this approach only mul-
timedia les could be extracted and manually examined with multi le indexing and
visualizing tools to unveil criminal activities.
The manual analysis of the multimedia
content is necessary because only humans can recognize crime. Imaginable use cases can
be the disclosure of child sexual abuse, sexual assault in general, terrorism, and assault
in general.
Statistics
In combination with the dumped les and the huge logging information it is
possible to generate any kind of content specic statistics:
•
How is the distribution of les on the Internet?
65
5. Future Work
•
Which avors of a letype are used, for example leversions, lesizes, metainformation etc ?
•
Language statistics.
•
Web Authoring Statistics, used tags etc., for example like Google Web Authoring
Statistics[17]
Such statistics could then be used by companies to forcast for example the popularity of
a program or a letype.
Copyright Infringement
Copyright infringement is the most widespread crime for which
the Internet is utilized. Automatically observing millions of les in a small part of time
can unveil minor and major suppliers of copyrighted material and thus support the ght
against copyright infringement. Several techniques can be used for the analysis:
•
Signature and watermark scanning as well as analyzing of copyrighted material
like photos, audio-les, video-les, e-books, and documents: Holders of copyrighted
materials can provide signatures and/or digital watermarks of their material. Signatures can only be compared to unmodied material in contrast to digital watermarks, whose attainment is the applicability to modied and/or falsied content,
like resized pictures or resampled audio-les.
•
Keyword scanning and data-mining techniques on le names and other metadata
on the Web site or the le itself like in the Exchangeable Image le Format (EXIF),
the Tagged Image le Format (TIFF), and in ID3 on MP3 Audio etc. Data-Mining
and keywords can be used to gather information to identify an infringement, for
example many infringements are not directly accessible but the lename unveils
the content.
Cryptographical Hash Function Collision Detection
A
hash function
is a function
that generates a xed length output to an arbitary length input. The output serves as
a xed length ngerprint for any given input. Hash functions have widespread use cases
in computer systems, for example as database indexes, integrity checksums, and security
applications.
Cryptographic hash functions are hash functions with additional security properities
to be used in dierent security applications like authentication and message integrity
indicators. Crypthographic hash functions must provide additional property's to warrant
security aspects. As a security property it is important that it should be computationally
infeasible to nd a collision for a secure cryptograhic hash function. A collision is a case
when two dierent inputs generate the same output. Computationally feasible refers to
any method which is faster than a brute-force attack which tries every possible input
combination.
The process of nding two arbitrary values whose hashes collide is called a collision
attack; the process of nding one arbitrary value whose hash collides with another, given
66
5. Future Work
Figure 22: Scheme of a hash-function
hash is called a preimage attack. A successful preimage attack is a much more serious
break than a successful collision attack.
The two most-commonly used hash functions in computer systems are
SHA-1 [63].
MD5 [58]
and
Since Heritrix uses SHA-1 for it's checksum generation per given URL, the
generated checksums of millions of Web sites could be examined to detect a collision for
the SHA-1. Such a system could be used as a proof of concept (poc) to collision eects on
the SHA-1 function and could be a paradigm for researchers on this subject. Additionally
the org.archive.io.ArchiveRecord class of the Heritrix package can be modied to examine
other hash functions e.g. the MD5 hash function.
Wordlist Generation
A
dictionay attack
is a techinque for defeating a cipher or authen-
tication method by trying a decryption key or password through trying dierent words
out of a dictionay (commonly called wordlists). These attacks are computationally much
more feasible than plain brute-force attacks in which every possible combination is tried
to guess a password. Dictionary attacks are most likely to succeed because generally
most people have a tendency to choose passwords which are easy to remember, and
choose words taken from their native language.
The success of such an attack depends on the quality of the wordlist used by such
attack. The more comprehensive it is the greater is the possibility for a successful attack.
Wordlists can be comprehensive by means of contained languages, names, trademarks,
and other special character combinations only used by a small portion of people like
slang [9], dialects [3], special abbreviations [52] etc. and special writing styles like the
hacker writing style [57].
The ultimate wordlist in this context is the Internet itself.
Therefore it would be another interesting idea to use the Monkey-Spider as a Wordlist
generator.
Strings could easily be extracted out of every crawled content.
easily be achieved for example with the
re
This could
module from the Python standard library.
67
5. Future Work
Listing 11 Demonstration of the re.split() function
> > > a=" KIEV (AFX) - Yushchenko said that his country "
> > > re.split('\W+', a)
[, 'KIEV', 'AFX', 'Yushchenko', 'said', 'that', 'his', 'country', ]
Additionally, the occurence count of every word could be stored to provide a popularity
rank. The generated wordlist could then be sorted according to the previously generated
rank.
We are claiming that it could be empirically shown that such a sorted wordlist
(with maybe millions of words) would provide the best wordlist that could be used with
every dictionary attack on earth.
68
6. Summary and Conclusions
6. Summary and Conclusions
"The more you read and learn, the less your adversary will know." (Sun
Tzu)
A virtual combat for money and glory is in full progress. We wanted the reader to learn
his weakness and prepare for the counterstrike. With this, work we have concentrated
our eorts to the consistent rising client side threats on the Internet. At the same time,
we have observed the importance of mapping malicious Web sites and giving researchers
the ability to repeat and expand similar research options.
We rst showed how Web sites can be a threat to client security.
Thereafter, we
introduced the reader to dierent malware types and malicious content from a social
viewpoint.
Meanwhile, the concept and importance of honeypot based computation
security devices have been shown and classied. Afterwards, similar work on this topic
has been presented and compared to our work.
As the next step, we introduced our solution - the Monkey-Spider Internet analysis
framework - and explained its implementation and application as a client honeypot. We
made several design decisions and justied them to the reader. Making use of dierent
other freely available systems, we showed the reasons for choosing particulariy avors of
such. After introducing Web crawling techniques, we introduced our Web interface to
control the Monkey-Spider system. Meanwhile our malware detection approach has been
shown and criticized. Thereafter, some important limitations of our approach have been
shown and solution ideas have been suggested to overcome some of the mentioned issues.
Finally, we applied our system in the real world. We learned some good lessons and
found interesting results which we shared with the reader. One of the most important
was the crawling strategy. All the aspects of the crawling method have to be considered
and evaluated precisely to achieve the desired results. To correctly categorize a Web site
or domain as malicious to client security, all the publicly reachable content has to be
examined.
Eventually, we pointed out other use cases for our framework and showed the applicability to a wide variety of Internet analysis tasks.
We conclude that after many years of server security research we need to concentrate
more on client security. With the rise and abuse of botnets the danger has become greater
than ever to be abused by an attacker. The continuance of the Web as we are used to
know it is at stake if the power of decomission of network nodes is in the hands of a
bunch of criminals controling botnets. An easy way of avoiding the threats of malicious
Web sites has to be the automatic avoidance of these dangerous parts of the Web in an
easy to use and error-free manner. SiteAdvisor is an example for an interesting approach
utilizing the power of the Internet community but is not free and sucient for the needs.
Therefore, we propose an independent
Web site security task force (WSSTF)
of vol-
untary individuals and companies which could enable Web security starting from the
Internet service provider level, to even avoid clients to become the victims of such attacks. Whereas this could become a legal issue and/or an Internet censorship issue to
overcome.
69
A. Database Scheme
A. Database Scheme
Listing 12 createvirdb.sql script to generate the Monkey-Spider database scheme
-- db scheme for the Monkey-Spider
DROP TABLE mw_output;
DROP TABLE malware;
DROP TABLE mw_scanner;
CREATE TABLE malware (
id int PRIMARY KEY,
filename varchar(40),
url text,
checksum varchar(32),
size int,
date timestamp,
job varchar(20),
comment varchar(20)
);
--uniqe malware-id
--filename of the malware
--url where it was crawled
--checksum
--filesize
--date when the file was crawled
--name of the crawl job
--comment
CREATE TABLE mw_scanner (
id int PRIMARY KEY,
--uniqe malware scanner-id
name varchar(20),
--name of the malwarescanner
engine_ver varchar(40),
--the engine version
signature_ver varchar(40), --the signature version
lastupdate timestamp,
--date when the mw-scanner was last updated
UNIQUE (engine_ver,signature_ver)
);
CREATE TABLE mw_output (
mw_id int REFERENCES malware(id),
mw_sc_id int REFERENCES mw_scanner(id),
description varchar(20),
PRIMARY KEY (mw_id,mw_sc_id)
);
71
--foreign key
--foreign key
--description
--Primary key
to malware.id
to mw_scanner.id
of found malware
both foreign keys
B. JavaScript Code Obfuscation Sample
B. JavaScript Code Obfuscation Sample
Listing 13 JavaScript code obfuscation
// the whole script in the escaped binary form
// Block one: the escape sequence is abbriviated for clarity
<!-eval(unescape("function w%28s%2[...] %3B");
//-->
// Block two: the evaluated obfuscation function
<!-function w(s)
{
t = "whmSrp;^s";
o = new String;
l = s.length;
for (i = 0; i < l; i++)
{
n = t.indexOf(s.charAt(i));
if (n == -1) {
// charAt() gets the indexed
// character of the string
o += s.charAt(i);
continue;
}
if (n >= 0) {
// fromCharCode() gets the character
// out of the Unicode index
o += String.fromCharCode(10);
continue;
}
}
document.write(o);
}
//-->
73
References
References
[1] AdAware.
http://www.lavasoft.com/.
[2] Alexa Top 500 rank.
http://www.alexa.com/site/ds/top_sites?ts_mode=global&lang=none.
[3] Black Country Dialect.
http://www.sedgleymanor.com/dictionaries/dialect.html.
[4] Boomerang Decompiler.
http://boomerang.sourceforge.net/.
[5] Capture-HPC.
http://www.nz-honeynet.org/cabout.html.
[6] Clam AntiVirus a GPL anti-virus toolkit for UNIX.
http://www.clamav.net/.
[7] CSS Specication.
http://www.w3.org/TR/CSS21/.
[8] CWSandbox - Behavior-based Malware Analysis.
http://www.cwsandbox.org/.
[9] Dictionary of English slang and colloquialisms of the UK.
http://www.peevish.co.uk/slang/.
[10] Extensible Markup Language (XML).
http://www.w3.org/XML/.
[11] GNU zip.
http://www.gzip.org/.
[12] Gnutella Protocol Specication.
http://rfc-gnutella.sourceforge.net/developer/stable/index.html.
[13] Google AJAX Search API.
http://code.google.com/apis/ajaxsearch/.
[14] Google SOAP Search API.
http://code.google.com/apis/soapsearch/.
[15] Google Suggest.
http://www.google.com/webhp?complete=1&hl=en.
[16] Google Trends.
http://www.google.de/trends.
75
References
[17] Google Web Authoring Statistics.
http://code.google.com/webstats/index.html.
[18] Gtk-Gnutella a gnutella network client.
http://gtk-gnutella.sourceforge.net.
[19] Heritrix, the Internet Archive's open-source, extensible, web-scale, archival-quality
web crawler project.
http://crawler.archive.org/.
[20] Htmldata, a Python library to manipulate HTML/XHTML/CSS documents.
http://oregonstate.edu/~barnesc/htmldata/.
[21] IDA Pro disassembler.
http://www.datarescue.com/idabase/.
[22] International Internet Preservation Consortium (IIPC).
http://netpreserve.org.
[23] Internet Archive.
http://www.archive.org.
[24] Internet Relay Chat Protocol Specication.
http://tools.ietf.org/html/rfc1459.
[25] Java programming language.
http://java.sun.com/.
[26] JavaScript obfuscation example.
http://idesitv.com/starone1-dkcj3s.htm.
[27] JavaScript Specication.
http://www.ecma-international.org/publications/standards/Ecma-262.
htm.
[28] Jericho HTML Parser.
http://jerichohtml.sourceforge.net/doc/index.html.
[29] JMX Specication.
http://java.sun.com/javase/technologies/core/mntr-mgmt/
javamanagement/.
[30] Karrigell, a exible Python Web framework.
http://karrigell.sourceforge.net/.
[31] Kazaa Network.
http://www.kazaa.com.
[32] Live Search API Reference.
http://msdn2.microsoft.com/en-us/library/bb266180.aspx.
76
References
[33] Macromedia Flash File Format (SWF).
http://www.adobe.com/licensing/developer/fileformat/faq/.
[34] McAfee SiteAdvisor.
http://www.siteadvisor.com/.
[35] Metasploit.
http://www.metasploit.com/.
[36] Mike's Ad Blocking Hosts le.
http://everythingisnt.com/hosts.html.
[37] Milw0rm.
http://www.milw0rm.com/.
[38] Multipurpose Internet Mail Extensions (MIME).
http://www.ietf.org/rfc/rfc2045.txt.
[39] Nepenthes.
http://nepenthes.mwcollect.org/.
[40] PDF Specication.
http://www.adobe.com/devnet/pdf/pdf_reference.html.
[41] Post Oce Protocol - Version 3.
http://tools.ietf.org/html/rfc1939.
[42] PostgreSQL, a powerful, open source relational database system.
http://www.postgresql.org/.
[43] PyGoogle, a Python Interface to the Google SOAP Search API.
http://pygoogle.sourceforge.net/.
[44] PyGreSQL, a PostgreSQL module for Python.
http://www.pygresql.org/.
[45] pYsearch, a Python interface for the Yahoo Search API.
http://pysearch.sourceforge.net/.
[46] Resource Hacker.
http://www.angusj.com/resourcehacker/.
[47] Simple Object Access Protocol (SOAP).
http://www.w3.org/TR/soap/.
[48] Snort.
http://www.snort.org/.
[49] Strider Gatekeeper Spyware Management: Beyond Signature-based Approach.
http://research.microsoft.com/spyware/.
77
References
[50] Strider GhostBuster Rootkit Detection.
http://research.microsoft.com/rootkit/.
[51] Strider URL Tracer.
http://research.microsoft.com/URLTracer/.
[52] The Acronym Finder.
http://www.acronymfinder.com/.
[53] The Anti-Spyware Coalition.
http://www.antispywarecoalition.org/.
[54] The ARC le-format.
http://www.archive.org/web/researcher/ArcFileFormat.php.
[55] The DeDuplikator Heritrix add-on module.
http://deduplicator.sourceforge.net/.
[56] The Free Network Project.
http://freenetproject.org/.
[57] The Jargon File.
http://www.catb.org/~esr/jargon/.
[58] The MD5 Message-Digest Algorithm.
http://tools.ietf.org/html/rfc1321.
[59] The Python Programming Language.
http://python.org/.
[60] The Python Web Services Project.
http://pywebsvcs.sourceforge.net/.
[61] The Zero Day Initiative.
http://www.zerodayinitiative.com/.
[62] Ubuntu Linux.
http://www.ubuntu.com/.
[63] US Secure Hash Algorithm 1 (SHA1).
http://tools.ietf.org/html/rfc3174.
[64] WARC specication.
http://archive-access.svn.sourceforge.net/viewvc/*checkout*/
archive-access/branches/gjm_warc_0_12/warc/warc_file_format.html.
[65] Yahoo Search Web Services.
http://developer.yahoo.com/search/.
78
References
[66] Follow the Money; or, why does my computer keep getting infested with spyware?
2004.
http://tacit.livejournal.com/125748.html.
[67] Googkle.com installed malware by exploiting browser vulnerabilities. 2005.
http://www.f-secure.com/v-descs/googkle.shtml.
[68] Adobe Security Advisories. Buer Overow Vulnerability in Adobe Acrobat, July
2006.
http://www.adobe.com/support/security/bulletins/apsb06-09.html.
[69] Microsoft Security Advisory.
A COM object (Javaprxy.dll) could cause Internet
Explorer to enexpectedly exit, July 2005.
http://www.microsoft.com/technet/security/advisory/903144.mspx.
[70] Michael K. Bergman. The 'Deep' Web: Surfacing Hidden Value. 2001.
http://brightplanet.com/resources/details/deepweb.html.
[71] Thorsten Holz Carsten Willems, Felix Freiling. CWSandbox: Towards Automated
Dynamic Binary Analysis. 2006.
[72] Brad Daniels Arunvijay Kumar Yi-Min Wang Roussi Roussev Shan Lu Juhan Lee
Chad Verbowski, Emre Kıcıman. Flight Data Recorder: Monitoring Persistent-
7th USENIX Symposium on
Operating Systems Design and Implementation (OSDI '06), 2006.
State Interactions to Improve Systems Management.
[73] William Cheswick. An evening with Berferd. 1998.
[74] Andy Cianciotto. Spam Attack: Zipped Trojan. April 2007.
http://www.symantec.com/enterprise/security_response/weblog/2007/04/
trojanpeacomm_spam_run.html.
[75] Fred Cohen. A Computer Virus. 1984.
http://all.net/books/virus/part2.html.
[76] Shane Keats Dan Nunes. Mapping the Mal Web. 2007.
http://www.siteadvisor.com/studies/map_malweb_mar2007.html.
[77] Robert Danford. 2nd Generation Honeyclients. SANSFIRE 2006 Presentation.
http://handlers.dshield.org/rdanford/pub/Honeyclients_Danford_
SANSfire06.pdf.
[78] Alex Eckelberry. Spam with malware links. April 2007.
http://sunbeltblog.blogspot.com/2007/04/spam-with-malware-links.html.
[79] Bernhard Müller Martin Eiszner. IE6 javaprxy.dll COM instantiation heap corruption vulnerability. June 2005.
http://www.sec-consult.com/184.html.
79
References
[80] Roy Thomas Fielding. Architectural styles and the design of network-based software architectures. 2000.
Chair-RichardN.Taylor.
[81] Matthew Fordahl. Latest e-mail worm spreading fast. January 2004.
http://www.securityfocus.com/news/7905.
[82] The Free Software Foundation. The Free Software Denition.
http://www.gnu.org/philosophy/free-sw.html.
[83] Roger A. Grimes. Tracking malware with honeyclients, April 2006.
http://www.infoworld.com/article/06/04/14/77378_16OPsecadvise_1.html.
[84] Anti-Phishing Working Group. Phishing Attack Trends Report - February 2007.
2007.
http://www.antiphishing.org/phishReportsArchive.html.
[85] Network Working Group.
RPC: Remote Procedure Call Protocol Specication
Version 2, 1988.
http://tools.ietf.org/html/rfc1057.
[86] Thorsten Holz. A Short Visit to the Bot Zoo.
IEEE Security and Privacy, 3(3):76
79, 2005.
[87] The Open Source Initiative. The Open Source Denition.
http://opensource.org/docs/osd.
[88] Christopher Chute Wolfgang Schlichting-John McArthur Stephen Minton Irida
Xheneti Anna Toncheva Alex Manfrediz John F. Gantz, David Reinsel. The Expanding Digital Universe: A Forecast of Worldwide Information Growth Through
2010. March 2007.
[89] Andrew Kalafut, Abhinav Acharya, and Minaxi Gupta.
A study of malware in
peer-to-peer networks. pages 327332, 2006.
[90] A.M. Kuchling. Python Advanced Library: Batteries Included Philosophy.
http://www.python.org/dev/peps/pep-0206/.
[91] Steve Lawrence and C. Lee Giles.
Searching the World Wide Web.
Science,
280(5360):98100, 1998.
citeseer.ist.psu.edu/lawrence98searching.html.
[92] Sang Ho Lee, Sung Jin Kim, and Seok-Hoo Hong. On URL Normalization. 2005.
[93] Alexander Moshchuk, Tanya Bragin, Steven D. Gribble, and Henry M. Levy.
A
Proceedings of the 13th Annual
Network and Distributed Systems Security Symposium (NDSS 2006), San Diego,
Crawler-based Study of Spyware on the Web. In
CA, February 2006.
80
References
Know Your Enemy, Second Edition: Learning about Security Threats (2nd Edition). Pearson Education, 2004.
[94] The Honeynet Project.
[95] Paul Roberts. New, virulent Cabir mobile phone worms spotted, December 2004.
http://www.infoworld.com/article/04/12/28/HNcabir_1.html.
[96] Joan Robert Rocaspana. SHELIA: A Client Honeypot for Client-Side Attack Detection. 2007.
[97] C. Seifert. HoneyC - The Low-Interaction Client Honeypot. 2006.
citeseer.ist.psu.edu/seifert06honeyc.html.
[98] Seungwon Shin, Jaeyeon Jung, and Hari Balakrishnan. Malware prevalence in the
KaZaA le-sharing network. pages 333338, 2006.
[99] Lance Spitzner.
Honeypots: Tracking Hackers.
Addison-Wesley Professional,
September 2002.
[100] Dave Taylor. How do I know if an eBay message is phishing?, January 2006.
http://www.askdavetaylor.com/how_do_i_know_if_an_ebay_message_is_
phishing.html.
[101] Sun Tzu.
The Art of War.
http://www.gutenberg.org/etext/132.
[102] Dan Verton. Internet fraud expanding, security experts warn. 2003.
http://www.computerworld.com/securitytopics/security/cybercrime/
story/0,10801,78551,00.html?SKC=cybercrime-78551.
[103] Kathy Wang. Honeyclient Development Project.
http://www.honeyclient.org/.
[104] Yi-Min Wang, Doug Beck, Xuxian Jiang, Roussi Roussev, Chad Verbowski, Shuo
Chen, and Samuel T. King. Automated Web Patrol with Strider HoneyMonkeys:
Finding Web Sites That Exploit Browser Vulnerabilities. In
NDSS.
The Internet
Society, 2006.
http://www.isoc.org/isoc/conferences/ndss/06/proceedings/html/2006/
papers/honeymonkeys.pdf.
[105] Aaron Weber. Amusement Park Typosquatters Install Toolbar via Exploit. 2007.
http://blog.siteadvisor.com/2007/04/amusement_park_typosquatters_i.
shtml.
[106] Davey Winder. Dynamic code obfuscation will dominate 2007 malware map. January 2007.
http://www.daniweb.com/blogs/entry1225.html.
[107] Tom Zeller. Black Market in Stolen Credit Card Data Thrives on Internet.
New York Times, June 21, June 2005.
81
The
References
[108] Lidong Zhou, Lintao Zhang, Frank McSherry, Nicole Immorlica, Manuel Costa,
and Steve Chien. A First Look at Peer-to-Peer Worms: Threats and Defenses. In
IPTPS, pages 2435, 2005.
82