Thesis
Transcription
Thesis
University of Mannheim Laboratory for Dependable Distributed Systems Diploma Thesis Monkey-Spider: Detecting Malicious Web Sites Ali Ikinci May 23, 2007 First Examiner: Second Examiner: Advisor: Prof. Dr.-Ing. Felix C. Freiling Prof. Dr.-Ing. Wolfgang Eelsberg Dipl.-Inform. Thorsten Holz Hiermit versichere ich, dass ich die Arbeit selbständig verfasst und keine anderen als die angegebenen Quellen und Hilfsmittel benutzt sowie Zitate kenntlich gemacht habe. _________________________ Ali Ikinci Mannheim, den 23. Mai 2007 Abstract Client side attacks are on the rise. Malicious Web sites are posing a serious threat to client security. There is neither a freely available comprehensive database of threats on the Web nor sucient freely available tools to build such. This work will introduce the Monkey-Spider Internet analysis framework. Utilizing it as a client honeypot we portray the challenge in such an approach and evaluate our system as a high-speed Internet scale analysis tool to build a database of threats found in the wild. Furthermore, the evaluation of this system in the analysis of dierent crawls over two month is presented along with the lessons learned. Additionally, alternative use cases for our framework are presented. Zusammenfassung Angrie seitens des Clients sind im Anstieg. Bösartige Webseiten stellen eine ernsthafte Bedrohung für die Clientsicherheit dar. Es gibt weder eine frei verfügbare Datenbank über die Gefahren aus dem Web, noch ausreichende, frei verfügbare Werkzeuge zum Aufbau einer solchen. Diese Arbeit führt das Monkey-Spider System zur Analyse des Internets ein. Dafür setzten wir es als ein Client Honeypot ein. bestehen sind, auf. Wir zeigen die Herausforderungen, die zu Dabei werten wir unser System als ein Werkzeug für eine performante und netzweite Analyse, zur Erstellung einer Datenbank von Gefahren in der Praxis, aus. Desweiteren präsentieren wir unsere Auswertung aus über zwei Monaten und gehen auf die Erkenntnisse ein, die wir in dieser Zeit gewonnen haben. Zusätzlich dazu präsentieren wir weitere mögliche Anwendungsgebiete für unser System. Contents Contents 1. Introduction 1 1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2. Requirements Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3. Simplied Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4. Acknowledgements 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2. Background 2.1. 2.2. 2.3. 2.4. 2.5. 5 Denition of Malware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Types of Malware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.1. Computer Virus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.2. Computer Worm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.3. Trojan Horse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.4. Spyware and Adware . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.5. Exploit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.6. Backdoor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.7. Botnet and Bots . . . . . . . . . . . . . . . . . . . . . . . . . . . . Detection of Malware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1. Static Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2. Behavioral Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . Honeypots vs. Honeyclients 9 9 10 11 . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4.1. Honeypots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4.2. Honeyclients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5.1. HoneyMonkey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5.2. Spycrawler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5.3. SiteAdvisor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5.4. HoneyC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.5.5. HoneyClient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5.6. Shelia 20 2.5.7. Capture - HPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.5.8. Pezzonavante 22 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6. Spam and Malware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.7. Phishing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.8. Crawling the Web 23 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3. Monkey-Spider 27 3.1. System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2. Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Seed Generation 29 3.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1. Web Search Seeding . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3.2. Spamtrap Seeding 3.3.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Blacklist Seeding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 i Contents 3.4. 3.5. Web Crawling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4.1. Heritrix - a Web Crawler 35 3.4.2. Conguring Crawl Jobs . . . . . . . . . . . . . . . . . . . . . . . . 39 3.4.3. Heritrix Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.4.4. The ARC File-Format . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4.5. The CDX File-Format . . . . . . . . . . . . . . . . . . . . . . . . . 43 . . . . . . . . . . . . . . . . . . . . . . . Malware Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.5.1. ScanFolder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.5.2. Extractor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.5.3. Malware Scanner . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.5.4. ClamAV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.5.5. CWSandbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.6. Web Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.7. Malware Database 48 3.8. Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.8.1. Avoid Detection and Blacklisting . . . . . . . . . . . . . . . . . . . 50 3.8.2. Hidden Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.8.3. Dynamically Created Content . . . . . . . . . . . . . . . . . . . . . 51 3.8.4. Independent Crawls 51 3.8.5. HTTP Headers, Cookies and other Client Information . . . . . . . 52 3.8.6. Peer-to-Peer Networks . . . . . . . . . . . . . . . . . . . . . . . . . 52 . . . . . . . . . . . . . . . . . . . . . . . . . . 4. Evaluation 53 4.1. Test Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2. Crawls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3. Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.4. Phishing Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.5. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5. Future Work 61 5.1. Queueing Additional Content 5.2. Crawler Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.3. Malware Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.4. General Purpose 65 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6. Summary and Conclusions 69 A. Database Scheme 71 B. JavaScript Code Obfuscation Sample 73 References 75 ii List of Figures List of Figures 1. Simplied architecture of the Monkey-Spider system . . . . . . . . . . . . 4 2. Sample IDA Pro session [21] . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3. Honeypots vs. honeyclients 14 4. The general honeyclient process chain 5. Sample SiteAdvisor site report 6. SiteAdvisors rated Google search results 7. Example for an eBay phishing site [100] trying to steel passwords for iden. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 8. Monkey-Spider Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 27 9. Google suggestions for wallpaper . . . . . . . . . . . . . . . . . . . . . . . 31 10. Comparison of search terms with Google Trends . . . . . . . . . . . . . . . 31 11. The Heritrix Web interface console running a job 40 12. Simple tity theft . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 . . . . . . . . . . . . . . . . . . . . . . . . 19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 15. CrawlContoller scheme [19] . . . . Operation of the ScanFolder program . . . Operation of the ExtractArcfile program Operation of the ClamAV program . . . . . . 16. The Monkey-Spider status-page . . . . . . . . . . . . . . . . . . . . . . . . 48 17. The Monkey-Spider seeding page . . . . . . . . . . . . . . . . . . . . . . . 49 18. The Monkey-Spider crawling page . . . . . . . . . . . . . . . . . . . . . . . 50 19. Filetype distribution 20. Popular Google typos 21. Goole.com, a typosquatter site . . . . . . . . . . . . . . . . . . . . . . . . 63 22. Scheme of a hash-function . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 13. 14. . . . . . . . . . . . . . . . . . 41 . . . . . . . . . . . . . . . . . 44 . . . . . . . . . . . . . . . . . 45 . . . . . . . . . . . . . . . . . 46 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 iii List of Tables List of Tables 1. Honeypot comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2. Comparison of dierent honeyclients 3. Options of the 4. Extractor modules contained in Heritrix . . . . . . . . . . . . . . . . . . . 39 5. Contents of the CDX le-format . . . . . . . . . . . . . . . . . . . . . . . 43 6. Extractor modules extraction eciency . . . . . . . . . . . . . . . . . . . . 55 7. Top 20 malware types of our test crawl . . . . . . . . . . . . . . . . . . . . 55 8. Seed count per topic 56 9. Malware analysis performance per crawl 10. Topic based phishing site distribution . . . . . . . . . . . . . . . . . . . . . 57 11. Uniqe phishing site types 58 12. Topic specic binary le maliciousness probabilitiy . . . . . . . . . . . . . 58 13. Top 20 malware types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 14. Top 20 malicious domain distribution . . . . . . . . . . . . . . . . . . . . . 59 doGoogleSearch() . . . . . . . . . . . . . . . . . . . . . function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v 14 15 32 56 List of Listings List of Listings strings 1. Application of the command . . . . . . . . . . . . . . . . . . . . . 2. Example for a SOAP request . . . . . . . . . . . . . . . . . . . . . . . . . 30 3. Example of a REST application . . . . . . . . . . . . . . . . . . . . . . . . 30 4. Our query to Google . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5. Sample . . . . . . . . . . . . . . . . . . . . . 33 6. The . . . . . . . . . . . . . . . . . . . . . 34 7. Example of an ARC-le record, rst part holds the information, second hosts le . . . . . . . . . . . hostsFileSeed.sh seeder script 10 32 part the content. Each ARC-le contains usually hundreds or thousands of records depending on aggregated les size . . . . . . . . . . . . . . . . . 42 8. Example contents of a CDX-le with one contained le (trimmed) . . . . . 43 9. A sample of extracted les . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 10. Example for a ClamAV report le . . . . . . . . . . . . . . . . . . . . . . 46 11. Demonstration of the function . . . . . . . . . . . . . . . . . . 68 12. createvirdb.sql re.split() script to generate the Monkey-Spider database scheme . 71 13. JavaScript code obfuscation . . . . . . . . . . . . . . . . . . . . . . . . . . 73 vii 1. Introduction 1. Introduction Between 2006 and 2010, the information added annually to the digital universe will increase more than six fold from 161 exabytes to 988 exabytes. [88] The Internet is growing and evolving every day. More and more people are becoming part of the so-called Internet community. With this growth also the amount of threats for these people is increasing. Criminals who want to destroy, cheat, con, and steal are evolving rapidly [102]. There is no comprehensive and free database to study malicious Web sites found on the Internet. HoneyMonkey [104] and SiteAdvisor [34] are proprietary systems with an approach to build such research databases. Nevertheless they are not freely available for further research and only limited access to their results is publicly provided. In this work, we introduce the Monkey-Spider Internet analysis framework, which is part of my work on this topic. The task which we want to address with this work is to build an easy-to-use infrastructure to detect and monitor malicious Web sites and to evaluate it on popular parts of the Internet. Malicious Web sites are Web sites which have any kind of content which could be a threat for the security of the clients which do request these sites. The resulting system will further be referenced as the There is an ongoing battle on the Internet between whitehats blackhats Monkey-Spider. the bad guys and the good ones [94]. The term blackhat is used for someone who compromises the security of a computer system without having the permission of the owner and to gain control over the system connected to a network. Blackhats write malicious code and content and attack computer systems for criminal intentions. The term whitehat is used contrarily to blackhat and depicts a person who is ethically opposed to the abuse of computer systems. Whitehats try to trace back blackhats and unveil their activities to ght their threats and secure the Internet. The strategies to unveil and trace criminal activities on the Internet has resemblance with criminal investigation in the real world. Used techniques are fast analysis methods, deception, inltration, tracing, replaying criminal activities, as well as waiting and observing. One of the best tools for real world analysis of blackhat activities and techniques are honeypot based technologies. A honeypot is an information system resource whose value lies in unauthorized or illicit use of that resource. (Lance Spitzner) [94] Honeypots bait blackhats to malicious actions and then examine their behaviour and capture their tools. Honeypots can be distinguished as eypots. server honeypots and client hon- Server honeypots are computer systems serving dierent kinds of services, like FTP, HTTP etc. to clients. They wait for clients to attack them. Client honeypots are clients requesting contents from other servers and thus initiate malicious servers to attack them. Starting from this idea, the Monkey-Spider system will be used in our work as a client honeypot (or honeyclient ) to Web servers, to nd, observe, and capture malicious activities on the publicly accessable Internet. 1 1. Introduction 1.1. Motivation "Speed is the essence of war. Take advantage of the enemy's unpreparedness; travel by unexpected routes and strike him where he has taken no precautions." (Sun Tzu) [101] Manual analyses for individually detected malware sites have appeared in the last years [66, 67]. These analyses are expensive, time-consuming, and far to few. Our aim is the automatical large-scale-analysis for millions of Web sites. To build an infrastructure for studying actual threats on the Internet the resulting system should be able to or help to answer the following questions. With our evaluation in Chapter 4 we try to nd answers to some of them: • How much malware is on the Web? • Where is that malware located? • How are malicious Web sites related with each other? • What fraction of binaries on the Internet are infected with malware? • How are the threats changing over time? • What is the topic specic probability for an infected site? • Which publicly unknown malware exists on the Internet? We choose the name Monkey-Spider for our system. It stands for analysis of Web sites, and spider monkey, the automated which is used synonymously for Web crawlers, resulting in an automated analyzing crawler. 1.2. Requirements Analysis Diving into this topic we have found it to be very interesting and extremely extensible to research. The ultimate idea would be to have a copy of the Internet Archive [23] and to research on its huge collected content. Recognizing our available funds, the funds of a diploma thesis, we realize that we have to gure out the requirements appropriate for the scope of this thesis. These requirements are the following: • The most important requirement is the performance of the whole system to achieve the analysis of millions of website in a few days with few computers. • Fast and easy implementation to reach the allowed time in the scope of a diploma thesis and to have enough time for an adequate evaluation. • Free tool availability and usability. • The resulting system should be easily extensible for other researchers. 2 1. Introduction • Generate representative results, thus analyze great portions of the Web. • Detect • Generate signatures and detailed information about the examined objects and anal- zero-day exploits and publicly unknown malware. ysis. • Queue any linked content to achieve a full coverage. Our solution proposal: • Don't reinvent the wheel, therefore make excessive use of available free open source software [87] to minimize the implementation eort and facilitate extensibility of the used foreign parts. • Keep the infrastructure free [82] so that others can build on it and do further research. • Seperate crawling and scanning to achieve high performance and scalability. • Use a high performance multithreaded archival quality Web crawler. • Avoid crawling overhead through advanced crawling techniques. • Use extended link extraction mechanisms. • Use fast and easy malware detection methods to focus more on detection than the fully analysis of threats. Therefore use a low-interaction approach. • Focus on modularity to allow additions and further improvements. • Use extensive prototyping, the process of quickly putting together a working model (a prototype), in order to test various aspects of a design, illustrate ideas or features and gather early user feedback and results. • Collect enough data to generate representative time-based and/or content-based statistics. 1.3. Simplied Architecture "What is essential in war is victory, not prolonged operations." (Sun Tzu) The main idea used for our system is at rst to crawl the contents of a Web site and then to analyze the crawled content for maliciousness. as a crawler based low-interaction honeyclient. Our system can be classied We didn't want to analyze Web sites while surng them with real Web browsers. The reason for not doing so is to have both tasks split up and optimized, namely browsing or crawling a Web site and analyzing or scanning the contents. 3 1. Introduction One of the main advantages of this partinioning is the ability to split up the operational parts. Hence the scanning and the crawling can be done on dierent machines and thus combined more performant. The scanning usually does take signicantly more time and resources than the crawling. The crawling process can take as much processing time and memory as the Internet connection can bear and the tuning of the crawler allows. Scanner Crawler Internet Database Figure 1: Simplied architecture of the Monkey-Spider system The Monkey-Spider is made up of two main tasks the crawling and the scanning. At rst a Web crawler collects resources from dierent parts of the Internet. After that the collected content is checked with dierent malware scanners and analysis tools. Malicious content found during investigation is stored. The collected information is stored in a database. 1.4. Acknowledgements Initially, I would like to thank Prof. Dr. Freiling for giving me the opportunity to work on such an interesting topic even though I have not studied in his research area but in computer networks. He has been very insightfully for my problems and supportive in a human fashion. I would also like to thank Prof. Dr. Eelsberg for being such a kind person. He accepted to be my second examiner and allowed me to work on my thesis at the chair for distributed dependable systems. I further would like to express my thanks to Dipl.-Inform. Thorsten Holz for his great support of mine. He has given me good advices and motivated me. Since he was often busy in being productive, he was always there for a conversation either personally or per email. I would also like to thank my dear friends Nuh Duran and Arno Knöbl, who have helped me to amend my numerous mistakes and gave me precious advices. Finally, I would like to thank my familiy for the patience and motivation they have had for me during my studies and ups and downs. Eventually, I would like to thank all the people working at our oor. I had a great time working here and am probably going to miss this friendly atmosphere. 4 2. Background 2. Background Prior to trying to detect and analyze malware, we have to give proper answers to the following questions. What are malicious Web sites and what are we looking for? What is malware? Which kind of threats are there on the Internet? How is malware distributed over the Internet? How can malware be detected? The following Sections try to answer these questions and introduce the reader to the matter. In Section 2.1, we dene what malicious Web sites and malware are. Section 2.2 common types and functionalities of malware are explained. After that in Hereafter a description of detection and analysis methods available regarding malware is shown in Section 2.3. Section 2.4 points out some important concepts for our work namely honeypots, honeyclients and their interaction levels. Then we present related work that is done on this and similar topics to give an overview of yet archived tools and results by others in Section 2.5 and classify our work in the context of others. Afterwards we are explaining how spam is aected with malicious code in Section 2.6. Finally, we introduce some Web crawling concepts which are important to understand our system explained in Section 2.8. 2.1. Denition of Malware Malicious Web sites have any kind of content which could be a threat for the security of the clients requesting these sites. Malware is any chunk of software that serves malicious purposes. This software is built with malice aforethought to damage, cheat, con, steal or similar intentions. Malicious Web sites can either host such malware or exploit a client to force its infection with malware. The latter Web sites are called sites [93]. Malware is either programmed, generated with so-called drive-by-download malware toolkits or generated in runtime to obfuscate itself from detection. For the infection of computer systems malware exploits either known or unknown vulnerabilities of the victim systems or is consciously or unconciously executed by the user or computer itself. Malware often spreads itself on the infected computer system and over computer networks to other computer systems. Many types of malware use dierent advanced techniques to hide their existence and obfucate their actions. Thus they try to avoid their detection to achieve infection and functionality as long as possible. The aims of general malicious software can be summed up as: • Avoid detection: Detected malware can easily be recognized also by others and actions can be taken against them • Ensure functionality: Every software has only a meaning as long as it functions properly and has a value for the attacker • Ensure (wide) spreading: Avoiding spread locking is important for the success of the malware 5 2. Background 2.2. Types of Malware The more you read and learn, the less your adversary will know. (Sun Tzu) Dierent types of malware make use of dierent operation techniques and therefore require various detection and analysis strategies. It is a Sisyphean challenge to discover new malware and to invent new technologies against malware writers because the malware researcher has always the assignment to rediscover a threat which an attacker has discovered and implemented before him. As long as there are people with the knowledge and the intention to abuse the researcher keeps searching for threats. The classication of malware is done out of specic schemes like: • Infection method: Remote exploitation or local execution and exploitation • Spreading: Self replication and infection of other binarys and systems • User interaction degree: Socially force the user to download and install a malware or hiding from the user • Abuse purpose: key logging, identity theft, scam, fraud etc. Many modern malicious codes or content use well-known techniques of other sorts of malicious code to achieve synergy eects for better infection, spreading, and hiding. Having a copy of the malicious code or an infected sample of a publicly known malware it is easy to unveil its existence through malware specic signatures. This scheme also works well on spam mails or phishing sites which are not executable malicious code but malicious from a social viewpoint for the user. As an example the HTML.MediaTickets.A exploitation Web site signature contains just the string executing an exploitation script. The following line has been extracted from the Web site http://desktopwallpaperfree.com for over 400 times like shown in Table 14. <script language="JavaScript" src="http://www.mt-download.com/mtrslib2.js"></script> 2.2.1. Computer Virus A program that can 'infect' other programs by modifying them to include a possibly evolved copy of itself. With the infection property, a virus can spread throughout a computer system or network using the authorizations of every user using it to infect their programs. Every program that gets infected may also act as a virus and thus the infection grows. (Fred Cohen) [75] A computer virus is a program that can infect a computer system without the knowledge of the user. Viruses are self-replicating and can infect other programs which than act as a host for the virus similar to biological viruses. Infection of other systems can only occur if an infected le is transferred to another computer system through a datatransfer 6 2. Background over a computer network or storage media. Computer viruses are not able to infect other computer systems over a computer network on their own. They can have dierent unwanted and dangerous eects on a computer like data and even system destruction, faulty system behaviour, and at least squandering of computing resources in memory and storage space. Computer viruses are usually detected through virus signatures. Anti-virus companies and activists generate signatures for software samples which are suspicious to be malicious. These signatures are bitstreams and can be considered as ngerprints, which are typical to one specic computer virus. To unveil an infected host the le system and memory of the computer system has to be scanned for theses signatures. 2.2.2. Computer Worm Computer worms are often confused with computer viruses but are a dierent kind of malware. Worms are computer programs with the intended ability to exploit one or more vulnerabilities of a specic system implementation. In contrast to computer viruses wait passively to be executed and to infect les on dierent systems, computer worms attack other systems to actively infect them. Computer worms don't infect other who programs with their code but copy themselves on other exploited computer systems. Computer worms often have a secondary intention other than their own spreading. This can be either to send spam through infected hosts or to download other malicious code to permanently take control over a computer utilizing a backdoor or a Trojan Horse. Since computer worms are actively spreading and infecting systems over the network, new threats often generate a heavy load on parts of the Internet and can cause great nancial damage especially for small networks. networks utilizing emails [81], peer-to-peer Worms can spread through computer networks [108], or even mobile phones [95]. Like computer viruses, computer worms are also detected with signatures. In addition to this a worm attack over a network can also be detected with an intrusion detection system signature. With this approach an infection can be avoided with an intrusion detection system prior to reaching the victim machine. 2.2.3. Trojan Horse The term Trojan Horse is part of the myth of the Trojan War. It stands for an attack device which veils its true bad intentions to its victim as harmless and useful. In the scope of computation security a Trojan Horse is a malicious program which comes as a useful program or is attached to a useful program to the victim and infects the victim unsighted. Trojan Horses don't replicate themselves or spread over the Internet in an automatical manner and thus are not categorized as a virus or a worm but often use similar techniques to hide from the user and to hide its actions as a wanted or unproblematic system operation. They can harm the client in any way like the other malware can. 7 2. Background 2.2.4. Spyware and Adware There is no precise denition of spying aspect. spyware but every denition attempt focuses on the The Anti-Spyware Coalition [53] has formed a denition which comes close to subject: Technologies deployed without appropriate user consent and/or implemented in ways that impair user control over: Material changes that aect their user experience, privacy, or system security; Use of their system resources, including what programs are installed on their computers; and/or Collection, use, and distribution of their personal or other sensitive information. (The Anti-Spyware Coalition) Spyware generally has criminal or nancial purposes. The criminal purpose contains among others identity and/or credit card information theft. The compromised systems often have a keylogger installed which logs and scans the hard disks and Web forms to inspect for credit card or other criminally valuable information. The nancial purpose is mostly advertising and stealing credentials. The compromised system's Web search or browser history are observed to gain information about interests and custom advertisements are submitted. In contrast to viruses or worms, spyware does not either self-replicate to spread itself or transmit the infection to other systems. Spyware is installed either with social engineering and deception techniques, through other malware or by exploiting system vulnerabilities with prepared exploiting Web sites. Adware is software with advertising functions, either bundled with another sofware or a part of a software. It is often considered disturbing but its main feature is the consent of the user to agree to advertisements by installing the software. Some sort of adware can send specic information about the client to optimize the advertising functionality. Such consented spying functions lead sometimes to an erroneously naming of adware as spyware but as long as adware stays legal in its actions it can't be categorized as spyware. 2.2.5. Exploit Exploits are either malicious software or malicious content which can be used to exploit a vulnerable system to gain control over a computer system or to allow a privilege escalation or a denial of service attack. So-called drive-by-downloads are Web sites which can exploit Web clients who just visit such a Web site. Once exploited an attacker can do anything on a victim system like installing additional malicious software like backdoors or adware. Exploits are often classied as remote or local exploits which shows the applicability of an exploit from a remote machine over a computer network or from the machine itself. Exploits loose their applicability as soon as the according vulnerabilities are patched. Exploits which are published before a vendor patch had existed are called zero-day ex- ploits according to the count of days between the publishing of a vendor patch and the occurence of the exploit. Such types of exploits are rare and interesting to both blackhats and whitehats. Since unpatchable vulnerabilities can pose a possible threat to all 8 2. Background available systems on the Internet, system vendors are trying to be the rst to know about a vulnerability and are sometimes even prepared to pay a bounty for their detection and transfer [61]. 2.2.6. Backdoor A backdoor or a rootkit is a malicious code which enables an attacker to modify a victim computer system in a way that the attacker has the ability to remote access the system without the knowlede of the victim. Such code can either modify a legitimate program on the victim system or install itself to enable remote access. A successful infected system can than be used for any other purpose and stay undetected for a very long time because if the backdoor is not remotely accessed it is likely that it will not be detected. 2.2.7. Botnet and Bots The term bot [86] is an abbreviation of robot and stands for an automated malicious program with some similarities to a computer worm. It has the ability to automatically connect to a point in the Internet where his originator can gain control over such. Bots are infested computer systems which autmatically connect to such a command and control server and wait for commands, thus forming a kind of distributed remotely controllable network of bots, a so-called botnet. Botnets often utilize Internet Relay Chat (IRC) [24] for their communication. Infested bots connect to a password protected and encrypted specially prepared IRC server and become a part of the botnet. After a botnet is up and running such a botnet can hold thousands of bots ready for abuse. Botnets are often used for distributed denial of service (DDOS) attacks and spamming. They have three major attributes: • A remote control facilitiy enabling the control of a bot by the attacker. • The implementation of several commands, which designates the abuse capabilities of a bot. • The spreading mechanism which is important for the further propagation of the bot. 2.3. Detection of Malware Malware researchers are retrieving malware samples and samples of suspicious code often through the following devices: • Mail gateways provide acess to many spam mails with attached malware or links to malicious sites. • Honeypots are victims of automatic malware attacks and therefore are often used to collect such binaries in an automatical manner. Furthermore, there are specialized honeypots only for the automatic collection of malware samples like Nepenthes [39]. 9 2. Background • Intrusion detection systems can detect an attack and dump the communication to gather binarys. • Infected systems host infected binarys. Malware can either be publicly discovered or publicly undiscovered malware. Publicly discovered malware can exploit computer systems through publicly known vulnerabilities. Publicly undiscovered malware can exploit either known or unknown vulnerabilities of computer systems. Malware doesn't have to exploit a system at all, it just has to have malicious intentions and to be executed on the victim computer system. Depending on this fact, the detection mechanisms dier in the following two ways. Known malware including non-binary threats can easily be detected with the classic signature database method in which for every known malware a signature is generated. These signatures are characteristic byte sequences to identify malicious code. For unknown malware with no existing signature, analysis tools have to be used to unveil the maliciousness of an observed binary. There are two kinds of analysis techniques for malware. The major trade-o of both systems is the relation between the fast analysis and the detail level of the information about a sample. 2.3.1. Static Analysis Static malware analysis is the most expensive way and high skills are needed for the analysis. But this approach can deliver complete information about a binary. There are three ways to perform the static analysis of malware. At rst the binary can be examined without ever being executed. Extracted strings can deliver relevant information about the contents. The standard Unix tool strings is able to print the strings of printable characters in les. Listing 1 shows a simple example in which strings are extracted out of a le containing two strings. Listing 1 Application of the strings command $> cat hello.txt hello world $> strings hello.txt hello world In fact some malware was named after uniqe strings found in malware samples. Such strings can show contained text, for example a message or a disclaimer of a sofware library. Hash functions (MD5 [58] / SHA-1 [63]) can deliver some information about the content enabling us to compare the le with others. Additional information can be obtained if contained resources can be extracted. The Resource Hacker [46] is an example for an extraction tool able to obtain contained icons, bitmaps etc. from Windows binarys. Since being easy to apply, this approach is likely to deliver only slight information. The second way is to disassemble the binary and examine the machine code. This is the most dicult and long lasting way to analyze, but can lead to a fully understanding of the 10 2. Background binary. The main problem of this approach is the need of a very deep knowledge of the examined system and the machine language. One of the best tools for disassembling code for many architectures is the commercial IDA Pro disassembler [21]. We have observed that it is widely used by malware researchers. Figure 2 shows a sample debugging session. Figure 2: Sample IDA Pro session [21] The last option is the decompilation of the binary to human readable program code. This option needs fewer system knowledge than the disassembling but in contrast to disassembling, decompilation is not always applicable or useful. If the original program code was compiled using the code optimizing options of a compiler and/or without debugging symbols and/or with a special disassembler obfuscator the resulting code can be either very dicult to analyze or even impossible to decompile. Even if a code is decompiled successfully the resulting code can be dicult to analyze because the source code commentaries and even variable names are not recovered which usually deliver crucial information about a sourcecode. Nevertheless it is far more easy to analyze such code than to step to a huge machine code program. The Boomerang [4] project is an example for a popular multi-platform decompiler. 2.3.2. Behavioral Analysis Behavioral malware analysis is faster and needs fewer skills than static analysis. Behav- ioral analysis often gives fewer and incomplete information about the examined code. Nevertheless for the most purposes, the incomplete information may be enough. Be- havioral analysis doesn't analyze the software itself but the behavior of it in a prepared 11 2. Background environment using a black box approach. Therefore a system is examined before the execution and after the execution of a suspicious binary. All important information regarding the system security are logged. System calls can be hooked to trace all actions. A system that implements modern behavioral analysis methods is the CWSandbox [8] which is described in Section 3.5.5. Behavioral analysis is likely to miss some features of the software sample, like hidden features, password protection, software with unfullled dependencies like additional software libraryies or system abilities. Nevertheless, behavioral analysis is the best way to autmatically analyze many software samples at a time and to identify their maliciousness. 2.4. Honeypots vs. Honeyclients All warfare is based on deception. (Sun Tzu) Deception is a technique used to trap criminals not only in criminal investigation but also in computer security. Like a thief in the real world an attacker wanting to break into a system wouldn't want to be observed or traced and will ee as fast as she can, the very moment she realizes the trap. An attacker who is fully decepted that the system she prepares to attack is not observed and safe for her to break in will probably do this. Like in the real world, a thief is a thief if she is caught stealing, an attacker can only be criminally chased if she had attacked someones computer system or network illegally. Another aspect of deception is the way the attacker will act on a compromised system. The actions will dier regarding how much the attacker is decepted by a compromised system. The more she is decepted and trusts in her prociency, the more she will unveil of her ability, techniques and aims [73]. 2.4.1. Honeypots Honeypots [99] are dedicated deception devices. The value of which lies in being probed, attacked and compromised by manual or automated attacks to gather information on the attacker without his knowledge. Being dedicated only for attack analysis and not for production use every action on a honeypot is suspicious and illegal. Server honeypots, who are often just called honeypots, wait for the adversary to attack and thus have to be a desireable target for an attacker to fulll its task. Like in the real world an open door may tempt a saint such a desireable system should provide publicly known system vulnerabilities and weak security proles like weak or no passwords, old and unpatched system software etc. to succeed. Honeypots are usually classied as low-interaction or high-interaction honeypots. High- interaction honeypots are real systems providing real application for the hacker to interact with. They are used to gather profound information about threats and attacks. Lowinteraction honeypots simulate real systems and services. An attacked system has only restricted interaction capabilities and thus lacks some of its deception values. There are some aspects which are special to honeypots in general. Honeypots are likely to be controlled by attackers who will try to attack other computer systems. This may be 12 2. Background a legal issue in many countries needing attention. To restrict the ability to attack other systems, honeypots have a containment strategy. This strategy can restrict outbound connections and thus minimize the possible damage. The containment strategy has to be carefully adjusted to keep the deception up, so that the attacker doesn't realize that she is being observed. 2.4.2. Honeyclients Honeyclients, which are also sometimes denominated as client honeypots, are the opposite to server honeypots. Honeyclients actively crawl or access the Web to search for servers that exploit the client and thus to gather information of how attackers exploit clients. Web browsers are not the only kind of client software that can be attacked but the current focus in this work and in other honeyclients is mostly based on the analysis of Web client exploitation. One major culprit of server honeypots is their waiting for an attack. An attack appears only by chance, thus it is possible that a ready to attack honeypot won't be attacked for months or it is quite possible that it is attacked occasionally by many attackers at the same time. It is not easily predictable how frequently attacks will occur on a honeypot and thus the analyses get more complicated. In comparison with this behavior honeyclients initialize every analysis and thus control the maximum number of possible attacks. The analysis is important even if no attack occurs because such Web sites and servers can then be classied as likely not to be malicious and thus a safe Web could be mapped. Honeyclients can also be classied as low-interaction or high-interaction honeyclients. The research done on this eld is very new. High-interaction honeyclients are usually real automated Web browsers on real operating systems which interact with Web sites like real humans would do. They log as much data as possible during the attack and allow a xed time period for an attack. Since providing detailed information about the attack high-interaction honeyclients are very slow and not able to scan broad parts of the Web. Low-interaction honeyclients are often emulated Web browsers, usually Web crawlers, which do have no or only limited abilities for attackers to interact with. Low-interaction honeyclients often make use of static signature or heuristics based malware and attack detection and thus lack the detection of zero-day exploits and unimplemented attack types. These honeyclients are not suited for an in-depth investigation of an attackers actions after a succesful compromise because the system is only simulated and any other action than the initial exploitation is likely to be missed too. In spite of these drawbacks are low-interaction honeyclients often easy to deploy and operate and very performant. They can be used for automatical malware collection and to take a sounding on a portion of the Web. Furthermore the containment of attacks is not a big deal because the compromised system is not real and thus unusable for the attacker which additionally simplies the deployment of a low-interaction honeyclient. Figure 3 shows the operation of honeypots and honeyclients. 13 2. Background Figure 3: Honeypots vs. honeyclients low-interaction high-interaction honeypots simulated systems and services real systems and services honeyclients simulated Web clients and systems real Web clients and systems very fast; low system load; zero-day exploit detection; advantages multi threaded; multi system; detailed analysis; scalable; easy containment; real world systems; misses complex client exploits; slow; complex resetup; improper malware analysis; heavy system load; no zero-day detection; single architecture; disadvantages Table 1: Honeypot comparison Table 1 compares all aspects of honeypot based devices. 2.5. Related Work Because our architecture is very similar to a honeyclient, most research done on this eld can be referenced as honeyclients. Similar types of research regarding the malicious contents of the Internet have been done with the following sometimes immature and evolving research projects as shown in Table 2. Throughout all available honeyclients, a common general architecture or process chain, which consists of three serial steps depending on each other, can be observed. analyze. At rst a queue is lled with objects to Then a client requests the queued objects. Eventually the gathered objects are analyzed regarding their maliciousness. The queue can also be extended during the request and the analysis steps. This simple process can be depicted like in Figure 4. 14 SiteAdvisor 15 2005 Microsoft Research starting year developer/s development production/ status stable drive-bydownloads, analyzed zero-day threats exploits, malware analysis Windows spectrum os Lavasoft AdAware Strider Tools FDR, GB, GK malware analysis Windows os 2006 Snort Web client (crawler) Free Software lowinteraction HoneyC 2006 self email n/a highinteraction Sheila 2006 state based Web client (various) Free Software highinteraction HPC Capture - Pezzo- multi os malware, spam, phishing University of Mannheim development/ beta 2006 ClamAV many free and proprietary tools 2005 Web client (crawler), email Free Software lowinteraction Spider Monkey- Web client (IE/Mozilla) highinteraction proprietary/ not available navante University Vrije University Danford/ of Universiteit of Still Secure Wellington Amsterdam Wellington development/ development/ development/ development/ n/a beta beta beta beta malware, malware, drive-bymalware, zero-day zero-day downloads, exploits, exploits, malware zero-day zero-day drive-bydrive-byexploits exploits downloads downloads Windows Windows multi os multi os multi os os os Kathy Wang 2004 self Web client (IE) Free Software highinteraction Honeyclient Table 2: Comparison of dierent honeyclients Windows os spyware production/ stable McAfee 2005 n/a Web client (any) drive-bydownloads, malware, spam University of Washington production/ stable 2005 Web client (IE/Mozilla) Web client (IE) highhighhighinteraction interaction interaction proprietary/ proprietary/ proprietary/ not not not available available available infection vector toolkit availability classication UW Spycrawler Honey- Monkey 2. Background 2. Background Figure 4: The general honeyclient process chain 2.5.1. HoneyMonkey The Strider HoneyMonkey Exploit Detection system is a network of monkey programs running on virtual machines with dierent patch levels and constantly patrolling the Web to hunt for Web sites that exploit browser vulnerabilities. (The Strider HoneyMonkey Project) [104] The HoneyMonkey is a Web browser (Internet Explorer) based high-interaction honeyclient developed at Microsoft Research in 2005. For their analysis approach they introduced the term Automated Web Patrol. The architecture consists of a chain of virtual machines with dierent avors of the Windows Operating system in variuos ascending patch levels. Starting with the fully unpatched system, the Monkey Controller initiates a so called monkey program which browses previously scheduled Web sites. The monkey waits for a given time. After waiting the virtual machine is checked for possible intrusions. If an attack is detected, the Web site is revisited with the next machine in the pipeline. This reanalysis lasts as long as the machines are keeping successfully exploited. If nally the last fully patched system in the chain is also exploited, then this Web site is picked out to have a zero-day exploit. The HoneyMonkey system uses a black-box approach for their analysis. Since a monkey is instructed to launch a browser instance and wait a few minutes but not set up to click on anything automatically, any executable les created out of the browsers temporary folder signalizes an exploit. The monkeys are started with the Strider Flight Data Recorder [72] which records every single le and registry read/write. Each monkey also runs with the Strider Gatekeeper [49] to detect any hooking of Auto-Start Extensibility Points (ASEPs) that may not create executables, and with the Strider GhostBuster [50] which can detect stealth malware that 16 2. Background hides processes and ASEP hooks. For the ease of use every monkey runs in a virtual machine queued, started, and reset by a Monkey Controller. The HoneyMonkey is also able to create a URL-level topological graph to unveil bad neighborhoods and major players in the malware community. During their research in May/June 2005, they have found that totally unpachted WinXP SP1 systems could be exploited by 752 URLs and a fully patched WinXP SP2 had no exploits. They also claimed to have detected a zero-day exploit in July 2005 [69], but it is known that SEC Consult had found and announced that vulnerability already in mid June 2005 [79]. The main advantage of the HoneyMonkey system is that it is able to detect zeroday exploits against fully patched Microsoft systems and can classify which exploits are applicable up to which patch version. The performance of HoneyMonkey in identication of exploits was about two minutes per URL. 2.5.2. Spycrawler The University of Washington Spycrawler [93] in Seattle/USA is a proprietary crawler based high-interaction honeyclient with an event-based detection mechanism developed in 2005. The focus of their research is on spyware and thus they don't follow other types of malware. They utilize three steps in their architecture to detect and analyze Spyware and use Google search results as their starting points: 1. Determination of whether a Web object contains executable software: a) Based on the Content-type HTTP header (e.g. application/octet-stream) b) Based on the URL containing known executable extensions (e.g. exe, cab, or msi) 2. Downloading, installing, and executing that software within a virtual machine: a) Download URL with Heritrix [19] b) Install and execute the binary with an automated monkey program 3. Analyzing whether step two caused a spyware infection: Analyze the virtual machine with the AdAware [1] Spyware scanner to detect infections. One of the main disadvantages of this approach is that they are possibly and probably missing some executables because step one is not sucient to detect all of them. Since with this approach the real number of spyware on the Web can be underestimated it is not going to be overestimated. By comparison our system analyzes all content. This leads to analyzing more content than needed but excludes missing of executable code. Another main disadvantage is that their analysis is based on only one anti-spyware solution and only on known spyware types. 17 2. Background Despite of this weaknesess they generate interesting results. Their implementation with a cluster of ten nodes achieves an average resetup period of 92 seconds per executable, to resetup the VM, download the binary, install and execute it and to analyze it with AdAware. Their results gives an insight of the topic specic distribution of Spayware on the Web and of the probality for an infected site in general. They have crawled about 40 million URL in May/October 2005 and found executable les in about 19% of the crawled content. executables. Approximately 4% of the crawled sites had spyware infected Additionally, they found some major spyware spreading sites linked on other sites and some major spyware types contained on most of the sites. 2.5.3. SiteAdvisor The SiteAdvisor [34] is a propritary high-interaction honeyclient from McAfee. The developement started in April 2005 from a group of MIT engineers. The rst results came about December 2005. In April 2006 McAfee bought SiteAdvisor and now continues their work. SiteAdvisor is a propritary honeyclient system. There is few information and no public documentation about the architecture of their system. Regarding their Web blog it seems that they are using a crawler based high-interaction honeyclient. In December 2005 they had 80 analyzing nodes in their honeypot cluster for the analysis. They have recently published some research results but it is not possible to directly access their result database. They have created a system which analyzes malicious content on Web sites, drive-by-downloads, email trac after automatic registration on their sites, and a very active community of Internet users who personally rate and comment on the maliciousness of a Web site. Other site specic information like online aliates, popularity, and country are provided. All this information is provided as a specic site report, which has to be queried manually for each Web site. Figure 5 shows a sample site report for a malicious Web site. They have a simple safety rating system regarding the overall safety of a site based on the three colors of trac lights. Green is for harmless Web sites, yellow shows a suspicious appearance, and red means an assured or rated maliciousness of a Web site. The SiteAdvisor system also provides a browser plugin for the automatical analysis of visited and searched content. Search results for popular Web searches are highlighted regarding their safety rating. The SiteAdvisor's safety ratings appear next to search results a small button on your browser toolbar changes color based on SiteAdvisor's safety results. Menu options on SiteAdvisor's toolbar let you customize SiteAdvisor or see a site's detailed test results. Figure 6 shows a sample search with an enabled SiteAdvisor browser plugin. Every now and then they also provide some malware related statistics. In their malware report from March 2007 they found out that 4.1% of all sites tested by SiteAdvisor are rated red or yellow. They point out the dierent maliciousness levels of Top Level Domains. They show a range of red and yellow rated pages from 0.1% of nnish (.) domains to 10.1% of the islands of Tokelau (.tk) domains. They also observe the maliciousness of general TLDs and identiy .info domains as the riskiest TLD with 7.5% of 18 2. Background Figure 5: Sample SiteAdvisor site report maliciousness and .com with 5.5%. 2.5.4. HoneyC HoneyC [97] is a component based low-interaction honeyclient developed in 2006 at the Victoria University of Wellington/New Zealand. It is the rst cross platform open source framework written in Ruby. HoneyC emulates only the essential features of the target clients and applies signature matching for a fast static analysis of server responses. It is based on three components. The components communicate over a normal unix pipe with each other. $> queuer | visitor | analysisEngine Components can be multithreaded and are modular. The initial version of the HoneyC includes the following implementations for the three components: • Queuer, querys objects to analyze: Yahoo Search Web Services [65] queuer • Visitor is a client: a self written Web client requesting content • Analysis engine analyzes the response: engine 19 a Snort [48] signature database analysis 2. Background Figure 6: SiteAdvisors rated Google search results Preliminary results [97] show a very good performance of 3 seconds for the analysis of one Web site. They have estimated how long it would last to analyze the whole publicly indexable Web with HoneyC. Calculations show that for approx 52 billion Web sites with 945 petabytes it would take one week utilizing a bandwith of about 13,75 Gbit and more than 257000 instances of HoneyC to scan the whole Web. 2.5.5. HoneyClient The HoneyClient [103] is a Web browser based high-interaction honeyclient. It is the rst open source client honeypot developed in 2004 by Kathy Wang. HoneyClient runs on a normal Windows host. It takes snapshots of the Registry and the lesystem before any action is taken. Right after a Web site is automatically visited, a long system scan detects lesystem and Registry changes. Being the rst freely available client honeypot it was not built to analyze millions of Web sites or to be very performant but to be a proof of concept. HoneyClient has a worse performance of several minutes per URL. 2.5.6. Shelia Shelia [96] is a high-interaction honeyclient in development at the Vrije Universiteit Amsterdam/Netherlands with the primary focus on spam mails. They are building a system for checking automatically all links and attachments received by email. For each Web site link or le received a behavioral analysis is made to detect the maliciousness of a resource. Therefore the queuer component is expanded to additionally queue malicious content. The striking speciality of the Shelia system is a so called client emulator which is re- 20 2. Background sponsible to invoke a proper application to deal with each kind of les identied by the queuer component. It is special because this system analyzes not only vulnerabilities of one client software but also vulnerabilities of any software associated with specic les. It should be able to detect macro viruses and third party software vulnerabilities, for example like a buer overow vulnerability in Adobe Acrobat Reader [68]. The analysis consists of three parts which are signicant after the execution of a le or the invokation of a le with its application: • The process monitor engine monitors the actions executed by the clients and applys a containment strategy to prevent successful attacks from further spreading o the honeyclient • An attack detection engine determines if an action executed by the client is considered illegal with a DLL Injection and an API Hooking technique • Attack log engine logs as much information as possible from the attack 2.5.7. Capture - HPC Capture-HPC [5] is a high-interaction honeyclient developed at the Victoria University of Wellington/New Zealand. Capture has two functional areas in its design namely a Capture client and a Capture server. The clients are hosting the actual real honeypot on a virtual machine. The server coordinates and controls the clients. Capture concentrates on three aspects of high-interaction honeyclients: • Capture is designed to be fast. State changes on the clients are triggering malicious actions in real time to the server. • Capture is designed to be scalable. The central Capture server can control numerous clients across a network • Capture supports dierent clients. The current version supports Firefox, Opera and Internet Explorer. The scriptable server takes each URL it receives and distributes them to all clients in a round robin fashion while controlling the clients in means of starting and stopping them. The clients connect to the server on port 7070 and report back any state changes. Therefore each client monitors its own state while browsing a Web site for changes on the le system, registry, and processes. An exclusion list for known normal system changes are used to identiy a normal state, any other operation triggers a malicious classication of the Web server and sends this information to the Capture server. Since the state of the client has been changed, the client resets its state to a clean state and retrieves new instructions from the server. If no state change has happened the client requests new instructions from the server and continues its browsing without resetting. The client consists of two components, a set of kernel drivers and a user space process. The kernel drivers have event-based detection mechanisms for monitoring the systems 21 2. Background state changes. The user space process communicates back the system changes to the server after capturing them from the kernel drivers and ltering them from the exclusion list. An interesting feature in development is to support non-browser clients, such as multimedia players and Microsoft Oce applications. 2.5.8. Pezzonavante Pezzonavante [77] is a high-interaction honeyclient developed by Robert Danford at StillSecure started in August 2005. Since it is a propritary system and is mainly used internally to test the security products of StillSecure, there is few information and statistics about the system architecture and results. Nevertheless Pezzonavante found malicious Web sites and code is shared with the computer security community. Pezzonavante utilizes a farm of honeyclients with dierent patch levels of operating systems and partly the security software from StillSecure with its highest security policy. A central URL server coordinates every client in the farm to visit the same sites. Like with HoneyMonkey this approach is used to expose clients with dierent security settings to the same threats. To avoid overhead and Web server downtimes Pezzonavante uses squid caching. In contrast to other systems Pezzonavante doesn't do integrity checks after each visit but in a monthly fashion. It is likely to miss some malicious sites due to the lack of detection but this is overruled by a high speed detection of the most malicious Web sites. Pezzonavante uses a hybrid, asynchronus approach to attack detection utilizing amongst other the following techniques: • Osiris integrity checking • Security tool scans (various anti-virus and anti-spyware scanners) • Snort network IDS alerts • Trac analysis • Snapshot comparisons Pezzonavante's main attainment is the possible high speed detection and analysis due to monthly intrusion checking. According to Grimes [83], the Pezzonavante system surfed the Web in December 2005 with four Windows XP machines with a rate of 150 to 300 URLs per hour, resulting in about 24 to 12 seconds per website. 2.6. Spam and Malware The term spam is commonly used for Unsolicited Bulk Email (UBE) or Unsolicited Commercial Email (UCE). This type of email messages is often used for advertising and phishing, but this is not the only malicious issue about spam. Recent spam waves 22 2. Background have shown dierent attempts to use spam as an infection vector for malware [78, 74]. These spam mails have either attached malicious code and trick the user to execute the attachments or guide users to Web sites hosting malicious contents. Spam mails are day-to-day threats and appearances and so are the sites they link to. To capture these threats our system should also be able to analyze malicious content linked in spam mails. The found malware is likely to use the latest available techniques and could also include zero-day threats. A day-by-day observation of incoming spam mail is important, any later observation would miss the linked malicious content. According to the phishing attack trend report for the February 2007 [84] of the Anti-Phishing Working Group, the average time online for phishing sites has been four days. They have also collected a total of 289 uniqe malware types on 3121 hosts only in February. 2.7. Phishing Phishing is not considered as malware but as malicious for the surfer from a social viewpoint. Phishing is scam or criminal activity using social engineering techniques. It is an attempt to fraudulently acquire sensitive information like user names, passwords, online-banking transaction numbers and credit card data. Every phishing attempt is done in two steps. First a phishing mail is sent to anybody with the same methods like spam mails are sent. This mail then uses social engineering tricks to guide the victim to a previously prepared phishing Web site claiming to be a serious entity like the ocial homepage of the victims bank, the PayPal Web site, or the eBay Web site. A normal Internet user is often not able to realize whether an email message is really from the claimed sender or not. Figure 7 shows such a phishing Web site. Once the information is gained, phishers either do sell this information on specialized black markets [107] or use them to scam. 2.8. Crawling the Web For a broad and high bandwidth analysis of the entire Web or parts of it, it is obvious that performance and scalability are the main concerns for nding appropriate solutions. It is possible to scan the Web while surng through the contents, but this would lead to a poor performance because the malware scanner would have to analyze only a small portion of data per the Web site and thus generate a process per analyzed bunch of content. Since every process generation creates a huge overhead while loading additional library, forking new processes, allocating memory etc. we try to avoid this as much as possible with batch processing of the crawled content at the end of the crawls. Queueing To crawl the Web one must have one or more starting points. There is no possibility to mirror the Web directly because all parts are dynamically added and removed. The only way of getting a good opinion about it is to crawl representative portions for a convincing representation. In order to achieve this we have to nd some starting points and explore further linked pages. This bunch of starting URLs are called 23 seeds throughout the rest of 2. Background Figure 7: Example for an eBay phishing site [100] trying to steel passwords for identity theft this paper. Based on the intention of the research the popularity of the found material, it is important to create a good hit rate and to gain representative results for our evaluation. There are several methods of getting these seeds. To surf the Web people have two options from where to start with. They either start their surng from a previously known link or search for new URLs on popular Web search engines like google.com. Once they have started, they can use the links on a Web site to go further. Trying to understand the distribution of malware on the Internet we choose popular topics and generate a seed out of Web searches based on these topics. Additinally we generate seeds based on hosts-les which are collected and blacklisted malicious as well as otherwise dangerous sites. Eventually we use spam mails as seeds which often lead to phishing and malicious Web sites. Scope The crawl scope denotes how far we want to go in discovered URLs. Two parameters are determining the crawl scope. The rst one is the maximum link hops to include, which means URLs more than this number of links from a seed will not be ruled in-scope. The second one is the maximum transitive hops to include, meaning URLs reached by more than this number of transitive hops will not be ruled in-scope, even if otherwise on an in-focus site. These two parameters are decisive settings for the hit count and the relevance. The hit count speciys how many malicious sites are found. 24 2. Background URL Normalization The time when Web sites were static, handwritten HTML sites with uniqe human generated URLs are history. Modern Web sites even personal unprofessional sites are machinegenerated sites which use the URL as a transport layer for information. Database queries over the URL or simple data transport over the URL mechanisms like Representational State Transfer [80] are getting more and more popular. As a result URLs pointing to the same physical resource on the Web are overloaded, while downloading the same content multiple times we use URL normalization [92] techinques to avoid unneccarily overhead. Link Extraction To reach as many available contents as possible not only URLs contained in a HTML le are to be analyzed but any other contents too. Link extraction is the process of extracting URLs out of any document like HTML, JavaScript, PDF etc. Link extraction is essential for the scope of our evaluation because often related links are embedded in other les than the HTML site and are reachable for human surfers, too. File Storage It is the Archive's experience that it is dicult to manage hundreds of millions of small les in most existing le systems. (The Internet Archive) [54] The rst idea for storing the crawled contents is just to mirror them and to use their URL as a folder index. This approach may be useful for standard archiving of a dozen Web sites, but unusable for broad scope crawls of millions of URLs with obfuscated URLs with sometimes over length. Most le systems will get problems managing so much les. Therefore an optimized storage strategy has to be considered regarding: • Minimize le creation and management. Every le object creation, process creation and termination takes notable time and resources. • Avoid lename conicts. Request headers in URLs as le names and overlong URLs could cause problems with some lesystems due to restrictions of the lesystem itself. • Utilize data compression. Many les are likely to be accessed only once or twice so compression is not a big performance issue. Many plaintext le formats and especially HTML can be compressed very eciently. Smaller les improve the overall lesystem performance due to smaller write and read operations. 25 3. Monkey-Spider 3. Monkey-Spider In this Chapter, we introduce our Internet analysis framework: the Monkey-Spider. We have used it in our research as a low-interaction honeyclient but it is able to do other kind of research regarding the Internet as well. Some reasonable use cases and extensions to be worked on in the future are shown in Chapter 5. 3.1. System Architecture The Monkey-Spider system utilizes many existing freely available software systems. Figure 8 shows the whole system. The architecture is divided in dierent functional blocks, Figure 8: Monkey-Spider Architecture marked as dotted or drawn through rounded rectangles. These blocks can either be on dierent computer systems for scalability and performance reasons or used on one computer system for the ease of use. The hatched modules are still in development. Queue generation Every run begins with the Seeder block which generates starting URLs for the crawler. One can either generate seeds with the Web search seeders and/or 27 3. Monkey-Spider generate seeds out of spam mails with the mail seeder and/or use the MonitorDB seeder to requeue previously detected malicious sites. The Web search seeders use the corresponding search engines namely Yahoo, Google and MSN Search for searching and URL extraction. The mail seeder extracts URLs out of collected spam mails from a spamtrap. The monitoring seeder is used to constantly reseed previously found malicious content over time from our malware database. Web Crawling Heritrix [19] - the Web crawler - crawles contents optionally through an interconnected Web proxy. The Web proxy can be used to increase performance and avoid duplicate crawling. Heritrix queues the generated URLs from the seeder and stores the crawled contents on the le server while generating detailed log les. Malware Analysis The scanner extracts and analyzes the ARC-les on the le system with dierent anti-virus solutions and malware analysis tools. Identied malware and malicious Web sites are stored in the malware attic directory. Information regarding the malicious content is stored in the database. Furthermore, we copy every found binary and JavaScript le to an additional archive directory for additional research purposes. 3.2. Implementation One of our design principles is not to reinvent the wheel, therefore we use several Free Software [82] packages which are doing well on one particular subject and then glue them together with the Python programming language [59]. As a result we only have to implement some minor parts as prototypes very fast, so that we can concentrate on the evaluation of the system and on dependable results. We use Linux as the implementation platform because Heritrix is also only able to work with Linux. We rst evaluated Java [25] as the programming language for our framework, because it is also used by Heritrix. We would thus be able to write our own modules and extend the functionality of Heritrix for our purpose. After some weeks of testing and evaluating, it became obvious that it is not possible to produce fast prototypes for our system and it would be very timeconsuming to implement all things in Java. We than started to examine Python for the programming. We realized that this language is often used as a prototyping language and it was less complex to implement many standard operations than in Java because Python has a very huge and useful library [90]. The language comes with a large standard library that covers areas such as string processing (regular expressions, Unicode, calculating dierences between les), Internet protocols (HTTP, FTP, SMTP, XML-RPC, POP, IMAP, CGI programming), software engineering (unit testing, logging, proling, parsing Python code), and operating system interfaces (system calls, le systems, TCP/IP sockets). Two additional advantages left no doubt in using Python. The one is a nearly perfect documentation of all aspects of the language and good tutorials in the standard system documentation and the other are many available interfaces to the many common tasks on the Internet, like pYsearch 28 3. Monkey-Spider [45]. With this approach it took only a few weeks to implement the major parts of the system. We make use of several Free software packages for our system namely: • The Python programming language [59], a dynamic object-oriented programming language. • Heritrix the Internet Archive's open-source, extensible, Web-scale, archival-quality Web crawler project. • Karrigell [30] a exible Python Web framework. • Clam anti-virus [6] a free anti-virus toolkit for Linux. • Htmldata [20] a manipulation library for HTML/XHTML/CSS documents with Python. • PyGoogle [43] a Python Interface to the Google SOAP Search API. • pYsearch [45] a Python Interface for the Yahoo Search Web services API. • SOAPpy [60] a simple to use SOAP library for Python. • PostgreSQL [42] a powerful, open source relational database system. • PyGreSQL [44] a PostgreSQL module for Python. 3.3. Seed Generation In the step of seed generation we try to nd some starting points for our crawl. The seeds are stored in a plain text le named seeds.txt residing in each proles directory of the Heritrix package. They contain one URL per line and nothing else. 3.3.1. Web Search Seeding We have used the Web Services APIs of the three popular search engines, Google, Yahoo and MSN Search. With this interfaces we retrieve URLs of the corresponding Web searches. We choose some topics like celebrity or games as start points and search for typical terms which we use for this topics. For example to populate search results for pirate web sites we use crackz, serialz, warez etc. After that we collect all found URLs of one topic and use this list as seeds for our crawl. With this method popular pages on the searched content can be found because people either use a search engine as a starting point for surng the Web or start from a previously known URL. They use mainly the rst results of found sites. We have a guarantee that the found URLs are likely to have the right content for the topic we are searching for. Another advantage of web search based seeding is that the results are always up to date for changes over time. For example, if a page has been down for some time it won't be found in the rst results any more, and if a new page has appeared recently with the 29 3. Monkey-Spider right content it will be displayed in the rst part of the search results. This gives us the safety to ignore special hosts for our initial searches and to concentrate on the search topics to get representative results in our evaluation. To access the databases of the search engines we make use of (RPC) [85]. RPC is a technology that allows a program to execute a method on another computer system in a shared computer network. The search engines provide access Simple Object Access Protocol (SOAP ) Representational State Transfer (REST) [80]. to their RPC calls either with utilizing Remote Procedure Calls [47] requests or SOAP is a protocol for exchanging XML-based [10] messages over computer networks. Listing 2 shows an example for a SOAP request in which the client needs to know the product details for the ID 827635. The answer is also delivered in the XML format. Listing 2 Example for a SOAP request <soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"> <soap:Body> <getProductDetails xmlns="http://warehouse.example.com/ws"> <productID>827635</productID> </getProductDetails> </soap:Body> </soap:Envelope> REST is used as a way to access Web Services using the URL as a carrier for information. Listing 3 shows two examples for Remote Procedure Calls and their REST counterparts. Listing 3 Example of a REST application RPC functions: 1. getUser(peter) 2. listLocations() REST URLs: 1. http://www.example.com/user/peter 2. http://www.example.com/location/ Google oers two services to lookup and compare the popularity of search terms. Google Suggest [15] shows similar search terms for a given search term with the according possible result count, as shown in Figure 9. Google Trends [16] shows statistics for specied search terms and is able to compare the popularity of search terms over time, as shown in Figure 10. We use Google Suggest to nd similar search terms and Google Trends to compare between the popularity for similar search terms regarding one topic. Google SOAP Search API When we implemented a seeder with the Google SOAP Search API [14], this was the only way to query Google Web searches. 30 Meanwhile it 3. Monkey-Spider Figure 9: Google suggestions for wallpaper Figure 10: Comparison of search terms with Google Trends became deprecated and replaced with the Google AJAX Search API [13]. Because of this it is not possible to retrieve new API keys to access Google with the SOAP API anymore but existing keys are still functional. It is a future task to reimplement the Google seeder with the Google AJAX Search API. Nevertheless we are using the SOAP API with our own API key. With the Google SOAP Search API developers can issue search requests to Google's index of Web pages and receive results as structured data. The search is restricted to a maximum of 1000 results per day and 10 results per query. We have used PyGoogle [43] which is a Python implementation to the Google SOAP Search API. It provides the function doGoogleSearch() which is called with the search term, the index of the rst desired result, the maximum number of results, and delivers title, URL, summary, SearchResult. We get the URL from the SearchResult seeds.txt le. We are iterating doGoogleSearch() as much as hostname etc. in the result class class and add it to our we need results, since only a maximum of 10 results per query can be retrieved. The maximum iteration is 100 resulting out of the daily restriction to 1000 search results. 31 3. Monkey-Spider doGoogleSearch() q start maxResults filter restrict safeSearch language inputencoding outputencoding license_key http_proxy Search Google using the SOAP API search string zero-based index of first desired result maximum number of results to return request filtering of similar results restrict results by country or topic enables filtering of adult content restricts search to given languages input encoding of given search term encoding for the query result the Google API license key to use proxy to use for talking to Google Table 3: Options of the doGoogleSearch() function Listing 4 shows the complete query code. Listing 4 Our query to Google # import PyGoogle import google # open seed file f=open('seeds.txt','w') # do 100 querys with 10 results each for j in range(0,100): search=google.doGoogleSearch(license_key=_googleKey,q=_searchTerm, \safeSearch=0,start=j*10) # write out each result for i in range(len(search.results)): a=search.results[i].URL f.write(b) f.close() Yahoo Search Web Services We use pYsearch [45] for our Yahoo seeder, which is a Python implementation for the Yahoo Search Web services API [65] utilizing REST-like requests. It oers an easy access Web search class providing similar functionalities and options as the Google SOAP Search API. Like other APIs this service is also restricted to 1000 results per day. MSN Search Web Service Like the Google SOAP Search API, the MSN SOAP Search API isn't supported and has no reference any more. It is replaced by the Live Search API [32]. Nevertheless it still works. There is no Python interface to it but we used 32 3. Monkey-Spider the Python Web Service's SOAP [60] implementation to access this service. The options and results are very similar to the Google SOAP Search API except that the search is restricted to 250 results per day. 3.3.2. Spamtrap Seeding Often malware is sent and spread utilizing spam. A spamtrap is an email account established specially to collect spam mails. Such email adresses are propagated openly on dierent places of the Internet where spammers are likely to nd them. Our idea is to collect spam in a spamtrap and then to extract all the URLs contained in the messages to use them as seeds for our crawl. Therefore we set up an email account intended to be a target for spammers. We can also use any other email address with Post Oce Protocol v3 (POP3) [41] support. We than download the spam mails with the POP3 protocol and extract found URLs in the messages with the htmldata library. The found URLs are used as seeds for our crawl. Currently, we do not examine attached les like Shelia does, but this could be a further task to be worked on. 3.3.3. Blacklist Seeding A blacklist is an access control mechanism. A subset - the blacklist - as part of a set of Web sites is prevented from being accessed. A general strategy for commonly known bad sites is to blacklist them in a hosts-le, which is used to look up the IP address of a device connected to the Internet. These commonly known hosts are generally known bad sites, that host some of the most malicious content. There are some Web sites which do collect such hosts in hosts-les for the community, in order to to protect themselves from malicious content and other unwanted parts of the Web. We have written a program to automatically download hosts-les from some major hosts-le provider, like Mike's Ad Blocking Hosts le [36], strip the IPs contained, which usually redirect to the localhost, and use them as seeds for our crawls. It is obvious that these hosts have more malicious content than common Web sites so they do not represent the Web as a whole but they represent the bad guys and bad company's who built a kind of business model around this practices. Hosts les do generate a huge hit rate and deliver us many malicious contents for our research. Listing 5 Sample hosts le 127.0.0.1 127.0.0.1 127.0.0.1 127.0.0.1 127.0.0.1 localhost www.realbannerads.com www.virusguard.com www.virusrescue.com www.antispyguide.com 33 3. Monkey-Spider Listing 6 The hostsFileSeed.sh seeder script #!/bin/sh #### http://hostsfile.mine.nu wget http://hostsfile.mine.nu.nyud.net:8080/Hosts.zip -O hosts.tmp.zip unzip -f hosts.tmp.zip cat Hosts > > hosts rm Hosts hosts.tmp.zip #### http://www.mvps.org/winhelp2002/hosts.htm wget http://www.mvps.org/winhelp2002/hosts.txt -O hosts.tmp cat hosts.tmp > > hosts rm hosts.tmp #### http://www.hostsfile.org/hosts.html wget http://www.hostsfile.org/BadHosts.tar.gz tar xfz BadHosts.tar.gz cat BadHosts/hosts.lnx > > hosts rm -rf BadHosts* #### http://hphosts.mysteryfcm.co.uk/?s=Download wget http://hphosts.mysteryfcm.co.uk/download/hosts.zip -O hosts.tmp.zip unzip -f hosts.tmp.zip -d hoststmp cat hoststmp/HOSTS.TXT > > hosts rm -rf hoststmp hosts.tmp.zip #### http://someonewhocares.org/hosts/ wget http://someonewhocares.org/hosts/hosts -O hosts.tmp cat hosts.tmp > > hosts #### http://www.everythingisnt.com/hosts.html wget http://everythingisnt.com/hosts -O hosts.tmp cat hosts.tmp > > hosts #### filter and format the file grep -v '#' hosts |sed '/^$/d' | awk '{print $2}' \ |sed 's/^[ \t]*//;s/[ \t]*$//' | sed 's/\x0D$//'| sort -u > hosts.tmp mv hosts.tmp seeds.txt 3.4. Web Crawling Crawling is an important aspect because it is responsible for the scope and signicance of our research. There are several questions which are to be answered suciently: Which sites do we crawl? Which content do we crawl? How do we extract links found in the crawled contents? How deep do we want to crawl? How should we store the crawled content? What about the crawling performance? Which statistics do we need? How important are client capabilities like cookies and user-agent for the server behavior? In the following Sections, we describe our crawling strategy and the Web crawler we use for our research. 34 3. Monkey-Spider 3.4.1. Heritrix - a Web Crawler Heritrix is an archaic word for heiress, a woman who inherits. Since our crawler seeks to collect and preserve the digital artifacts of our culture for the benet of future researchers and generations, this name seemed apt. (The Heritrix Project) After doing an evaluation of other free software available on the Internet, we conclude that only Heritrix meets the requirements we need for our research. For this approach we have examined mainly libcurl, wget and pavuk. All these are meant to crawl Web sites but are not designed to dump the state of a Web server in it's most unchanged state like an archiver would do. Some of them also lack of recursive and parallel crawling. All of them lack of a link extraction and queueing mechanism. Heritrix is the Internet Archive's open source, extensible, Web scale, archival quality and easily customizable Web crawler project. For our development and evaluation we have used the version 1.12.x of the stable release. Some of the features and reasons which were important for our work are the following. Experienced Developers The Internet Archive is a non-prot organization which is founded to build a Internet library. This library should be able to oer permanent access for researchers, historians and scholars to historical collections of the Internet. founded in 1996 by Brewster Kahle in San Francisco, California. It was They have received data from Alexa Internet and others. The actual collection includes texts, audio, movies and software as well as archived Web pages starting from 1996. The Internet Archive has built the largest public Web archive up to now. While the Internet Archive gets its data from others, they decided to start a new Web crawler project with which they could do additional crawls of specic Web sites or on specic purposes among others for commercial uses. Starting archiving in 1996 the Internet Archive gained a great experience of Web archiving and crawling. Heritrix is built out of this great expertise. With this expertise we don't need to concentrate on the crawling technique, the link extraction and other details but only on the ne tuning for our purpose of the crawls. Policies for Dierent Requirements On the one hand, crawling policies are important to improve the performance of the overall crawl and on the other hand not to overload the crawled hosts. They are used to avoid multiple crawls of the same content and the coordination of the crawler thread. Heritrix supports dierent policies which can be congured in detail: • A selection policy • A re-visit policy • A politeness policy that states which pages to download that states when to check for changes to the pages that states how to avoid overloading visited Web sites 35 3. Monkey-Spider Free Software Heritrix is Free software, so we can use it for our own infrastructure which is also Free software. It is constantly evolving and extending, this gives our infrastructure the exibility to adopt newer versions of Heritrix without the need for adaption because we always use the same interface to Heritrix and get the same ARC-les as output which we process. Newer versions will have more and better features. Through its consequent modularity and open API, it is very easy to extend or to rewrite the functionality of Heritrix. Support for Use Cases The Heritrix crawler is suitable for many use cases which have dierent requirements to the conguration: • Broad crawling: Broad crawls are large scale and high-bandwidth crawls. Every queued host is recursively crawled. Every discovered link is queued except of links failing two parameters. The rst is max-link-hops : URLs more than this number of links from a seed will not be ruled in-scope. The second is max-transitive-hops : URLs reached by more than this number of transitive hops will not be ruled inscope, even if otherwise on an in-focus site. A broadcrawl tries to get as much Web sites as possible within the given resources like time to crawl, available bandwidth and the available storage. • Focused crawling: Focused crawls are crawls with a complete coverage of a selected topic or site. The amount of crawled data is depending on the crawled sites. • Continuous crawling: Continuous crawling does re-crawl already crawled content over time to monitor changes between previously determined periods. • Experimental crawling: Heritrix can be used to experiment with dierent crawling strategies and to evaluate for example dierent protocols, crawling orders and archiving methods Logging Heritrix provides several log les per crawl. The location of the les can be customized. The generated les are: • crawl.log: The main log for all crawls. For each tried URL successful or not, a line is added to crawl.log with 1. a logging time stamp 2. the fetch status code 3. the size of the downloaded document 4. the URL 5. a bread crumb of the discovery path for the current URL 6. the last actual referrer 7. the mime type 36 3. Monkey-Spider 8. the ID of the worker thread that downloaded this document 9. a time stamp indicating when a network fetch was begun and eventually a duration 10. a SHA1 digest of the content 11. the source tag inherited by this URL 12. annotations, if any have been set, like the number of times the URL was tried • local-errors.log: network related problems trying to fetch the document, can provide an insight to advanced users when other logs and/or reports have unusual data • progress-statistics.log: At congurable intervals a line about the progress of the crawl is written to this le containing the following information time stamp: Time stamp indicating when the line was written, in ISO8601 format. discovered: Number of URLs discovered to date. queued: Number of URLs queued at the moment. downloaded: Number of URLs downloaded to date doc/s(avg): Number of documents downloaded per second since the last snapshot. In parenthesis since the crawl began. KB/s(avg): Amount in kilobytes downloaded per second since the last snapshot. In parenthesis since the crawl began. dl-failures: Number of URLs that Heritrix has failed to download to date. busy-thread: Number of threads currently busy processing a URL. mem-use-KB: Amount of memory currently assigned to the Java Virtual Machine. • runtime-errors.log captures unexpected exceptions and errors that occur during the crawl, may be due to hardware limitation or more often because of software bugs • uri-errors.log logs errors in dealing with encountered URLs. by erroneous URLs. Usually it's caused Generally, it is only of interest to advanced users trying to explain unexpected crawl behavior. • recover.gz: a g-zipped journal of all events. It can be used to restore the detection path of a crawled content Multi-Threaded Design A Multi-threaded design grants the optimal usage of the re- sources also on a multiprocessor system. 37 3. Monkey-Spider URL-Normalization the Web. Syntactically dierent URLs can represent the same content on To avoid overhead caused by multiple downloads of the same resources we have to use several URL-normalization techniques. Heritrix has a dedicated module for this task and implements several well-known and widely used URL-normalization techniques. The class which implements the normalization is called org.archive.crawler.url.canonicalize. Every canonicalization module has herit the abstract org.archive.crawler.url.canonicalize.BaseRule class. to inEvery module can be added to a crawl. The currently available canonicalization rules inheriting from BaseRule are: • FixupQueryStr: Strip any trailing question mark • LowercaseRule: Lowercases the URL • RegexRule: General conversion rule • StripExtraSlashes: Strip any extra slashes found in the path • StripSessionCFIDs: Strip cold fusion session IDs • StripSessionIDs: Strip known session IDs • StripUserinfoRule: • StripWWWRule: Strip any 'userinfo' found on http/https URLs Strip any www found on http/https URLs, if they have some path/query component (content after third slash). top slash page URLs are left unstripped, so that Heritrix prefers crawling redundant top pages to missing an entire site only available from either the www-full or www-less host name, but not both. Link Extraction Intelligent link extraction modules for various les and request types are the major advantages of Heritrix because it guarantees a maximum coverage of the Web content. For every crawled content and depending on the mime type Heritrix executes the corresponding extractor modules and adds extracted URLs to the queue. Advanced Web Interface interface. stopped. Heritrix is administered via an easy-to-use Web based user Crawl jobs can be congured, recongured, queued, started, paused, and The UI displays the status on the currently running crawl job listing vitals such as documents crawled, documents pending, megabytes downloaded, download rate, and critical exceptions like shown in Figure 11. Crawl job reports and logs both for the currently running job and for jobs run in the past can be accessed via the UI. JMX Interface Heritrix also features a Java Management Extensions Interface [29] which allows us to remotely control its behavior over our own Web interface. It has not the full functionality provided by the Heritrix Web interface yet but the main control operations are already there. 38 3. Monkey-Spider Module Name Description ExtractorCSS parsing URLs from CSS type les [7] Extractordoc extract href style links from word documents ExtractorHTML basic link extraction from an HTML content-body, using regular expressions ExtractorHTTP ExtractorImpliedURI ExtractorJS extracts URLs from HTTP response headers nding implied URLs inside other URLs processes JavaScript les for strings that are likely to be crawlable URLs [27] ExtractorPDF PDF [40] for the purpose of extracting URLs ExtractorSWF extracts URLs from Macromedia Flash les [33] ExtractorUniversal looks at the raw byte code and tries to extract anything that looks like a link nding URLs inside other URLs ExtractorURI ExtractorXML a simple extractor which nds HTTP URLs inside XML/RSS les JerichoExtractorHTML improved link-extraction from an HTML content-body using the Jericho HTML Parser [28] Table 4: Extractor modules contained in Heritrix 3.4.2. Conguring Crawl Jobs Heritrix has a mature job management with jobs and proles. For every desired crawl a job has to be generated. Jobs can be generated from scratch or use prior existing jobs or precongured proles. Proles can be precongured for every kind of crawl job. Each Job conguration and running consists of the following steps: 1. Create a job: Creation out of an existing prole or job 2. Congure the job: Conguration of which modules are involved and the options of the involved modules 3. Run the job 3.4.3. Heritrix Architecture In this Section, we give an overview of the Heritrix architecture, describing the general operation and key components of the software. Heritrix has two conguration interfaces. The main and fully functional conguration interface is the Web interface and the second not yet fully functional interface is the JMX interface. Heritrix is modular. The modules to be used can be chosen and congured at runtime over the Web interface. The consequent modular design allows us to write own modules as a replacement or an extension to the Heritrix crawler. There is also a developers guide which introduces developers to write their own modules. 39 3. Monkey-Spider Figure 11: The Heritrix Web interface console running a job Heritrix consists of some core classes and pluggable modules. Core classes can be congured but not replaced. Pluggable classes can be substituted. There exists a set of basic pluggable classes which are part of the Heritrix package. A CrawlController class holds all classes cooperating to do a crawl and provides an interface to the running crawl. It executes a to the master thread. ToeThreads. The CrawlContoller. from the Frontier Figure 12 shows the scheme of the The CrawlController puts the URLs Frontier is responsible to determine the URL to process. The Frontier is responsible for the politeness policy dened in Section 3.4.1, to avoid overloading of Web servers. If a given URL complies with the politeness policy than it will be delivered to a requesting ToeThread. Heritrix is multi-threaded and thus each ToeThread. A ToeThread requests a Frontier, sends it through all the processors and then requests a new URL to be processed is handled by one thread called new URL from the URL. The processors are grouped in processor chains. The prefetch processors, the fetch processors, the extractor processors, the write/index processors and the post-processors. The processor chain works out the following steps regarding one URL: • The URL is revised wheter it is really in scope. • The URL is fetched with the regarding protocol. • The extractor processors try to discover new URLs out of the fetched content. • The fetched content is indexed and written to disk . • The state of the crawl is updated. 40 3. Monkey-Spider Figure 12: Simple CrawlContoller scheme [19] • The found URLs are checked if they are in scope. • URLs which are checked to be in scope are resceduled to the Frontier. The whole crawl jobs ends when there is no more URL left in the Frontier and no ToeThread delivers a URL. The resulting output of a terminated job are les generated from the Writer processors and the logging information described in Section 3.4.1. 3.4.4. The ARC File-Format Heritrix has dierent so-called Writer processors are Writer processors contain: These Writer processors to write the crawled content to disk. designed for dierent purposes. The current implemented • MirrorWriterProcessor: Processor module that writes the results of successful fetches to les on disk. Writes contents of one URL to one le on disk. The les are arranged in a directory hierarchy based on the URL paths. Hosts are mirrored like the le hierarchy that might exist on the servers. • ARCWriterProcessor: Processor module for writing the results of successful fetches to the Internet Archive ARC le-format [54] aggregated les. • ExperimentalWARCWriterProcessor: Experimental processor module for writing the results of successful fetches to the Web ARChive le format [64], which will be the successor of the ARC le-format. The ARC-le format is a le-format to aggregate many les in it, to ease and optimize le storage, and to increase overall lesystem performance. It was designed to meet the following requirements: 41 3. Monkey-Spider • Self-containment: The le contains all information regarding the aggregated objects to be identied and extracted. No other indexes or the like should be necessary. • Multi-protocol support: Many other high-level protocols than HTTP like DNS,FTP and HTTPS are supported. • Streaming: ARC-les are streamable, since every content block consists of its header with information and the the content itself. Every aggregated object could be streamed without the need for other parts of an ARC-le than the actual position where the object resides. • Viability: Each record should be viable. After a record has been written its integrity should not depend on additional later generated indexes or the like. Listing 7 shows an example record in an ARC-le containing all information regarding a crawled content in the rst part like the protocol, URL, timestamp, request-header etc. The rst part is in plaintext and is seperated from the second part with a blank line. The second part holds the crawled content in binary format. Every crawled content is aggregated in such ARC-les of a xed size (we are currently using 100 MB). Listing 7 Example of an ARC-le record, rst part holds the information, second part the content. Each ARC-le contains usually hundreds or thousands of records depending on aggregated les size http://test.org/test/ 12.3.4.5 20070322153737 text/html 9551 HTTP/1.1 200 OK Date: Thu, 22 Mar 2007 13:37:14 GMT Server: Apache/1.3.37 (Unix) mod_perl/1.29 Connection: close Content-Type: text/html; charset=utf-8 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <body> Hello World! </body> </html> Every record is atomic and thus totally independent from any other parts of the ARCle or any other information stored elsewhere in the crawl. The ARCWriterProcessor usually writes the crawled contents to a bunch of ARC-les at a time. ARC-les that are currently being processed have an additional .open sux. Those without such a sux are completed. The produced ARC-les are named as follows and reside in the arcs directory of the crawl: [prefix]-[12-digit-timestamp]-[series#-padded-to-5-digits]-\ [crawler-hostname].arc.gz For example: Monkey-Spider-20070307182551-00422-spongbob.arc.gz 42 3. Monkey-Spider 3.4.5. The CDX File-Format This CDX le-format is internally used by the Internet Archive but not ocially standardized. It is a plain text le holding all the information contained in an ARC-le without the contents of the crawled les. generated with the It is used as an index to an ARC-le and arcreader tool contained in Heritrix. The following option generates a CDX-le for the given ARC-le: $> arcreader --format cdxfile Monkey-Spider-20070307182551-00422-spongbob.arc.gz Listing 8 shows an example for a CDX-le. Every line corresponds to a record in the ARC le. Listing 8 Example contents of a CDX-le with one contained le (trimmed) CDX b e a m s c V n g 20070307122203 0.0.0.0 filedesc://test.arc text/plain - HWP[..]U6N 0 1280 test 20070307122202 66.35.250.203 http://sf.net text/html - V24[..]H57 694 10381 test The contained information is the same information as in the rst block of the ARC-le records. Table 5 shows the meaning of the colums of the CDX-le. index b e a m s c V n g content timestamp IP URL mimetype - md5sum file offset content length ARC-file Table 5: Contents of the CDX le-format 3.5. Malware Detection This part is the main operation eld for the research done on the crawled content. While the crawling is still running the crawler generates constantly compressed ARC-les which we analyze meanwhile. The ScanFolder part works step by step through each ARC-le. After the processing each of the ARC-les are deleted or recompressed. 3.5.1. ScanFolder The Heritrix crawler generates a folder for each generated crawl job containing the conguration, the log les, the runtime les, and a folder with the corresponding ARC-les. The ScanFolder program is a command line program which delegates the scanning work done on a folder with ARC-les. Therefore it creates a queue of already closed ARC-les and performs extraction and malware-scanning on each of them, as shown in Figure 13. 43 3. Monkey-Spider Figure 13: Operation of the ScanFolder program 3.5.2. Extractor ExtractArcfile which operates on arcreader program of the Heritrix package we The Extractor is a command line program named exactly one ARC-le at a time. Using the create a CDX-le containing all the relevant information regarding the ARC-le. Before processing the le has to be unpacked with gunzip [11]. After that the Extractor dumps the contained les in that ARC-le to one likewise named directory, ignoring the original folder index and lenames of the contained les. Therefore it uses the checksum already contained in the CDX-le as lenames, and thus grants an approximately unique lename due to the non-collision requirement of hash-functions. Furthermore, it uses the MIME [38] type also contained in the CDX-le to generate a lename with the lename extension utilizing the mimetypes.guess_extension() function contained in the Python library. Listing 9 shows some generated lenames. The les are then analyzed with the malware Listing 9 A sample of extracted les EASN4AF6NPYDXFK563WJKVFZLVE5COMN.html OZK6OENXC4AENH6JG5RWY7PCSUFQNSXP.js ZQYZF2FDRD3BDQ6VNLWPHYUHLKT6PGYI.swf EATYUELC2PU7YWEIF3U23SFCZ2PHKRE4.gif OZUR7BGJXKP6AHGDMAQGHPH5B6PBINIZ.exe ZRBXC2IWGFK456ADC5MVZT3DFPFLFSCJ.jpe scanner. After the analysis the directory is removed and the ARC-les are compressed again. The whole procedure of the ExtractArcfile program is shown in Figure 14. Performance Measures for the Extractor Program In the beginnings, we have used the arcreader program to extract the contents of an ARC-le and evaluated that it performed very badly on more than a dozen les. We realized that it would be unbearable to use this program for the extraction because the extraction would take much more time than the analysis of the les, which is the most time consuming part of the Monkey-Spider system. We searched for other tools for the extraction of the les on the Internet and did not nd anything suitable for us. decided to program this for ourself. We called our program We ExtractArcfile2ndGen. Comparing the runtimes of both extractor programs shows us an acceleration of the 44 3. Monkey-Spider Figure 14: Operation of the ExtractArcfile ExtractArcfile2ndGen was about ExtractArcfile program we had used before. extraction process. The the initial program 50 to 70 times faster then 3.5.3. Malware Scanner With our approach of high-bandwidth and broadscope crawling we have observed our particular crawler (described in Chapter 4) to crawl about 30 URLs/sec. With such bandwidth we need fast and reliable tools for automatic malware analysis because the analysis seems to be the most time consuming part in our processing chain. Depending on the status of existing malware, our analysis can be separated into two states. The rst approach is the detection of known malware with common malware scanners like ClamAV and the second approach is to analyze suspicious binaries with automatic malware analysis tools, like the CWSandbox. Furthermore the Monkey-Spider system can be extended with any other automated analysis tool. 3.5.4. ClamAV All the extracted les are examined with ClamAV. We do not use a python interface for ClamAV to decrease our system dependencies. The version information and the output 45 3. Monkey-Spider of ClamAV is dumped to a le called clamav.report. An example report le with one virus is shown in Listing 10. Listing 10 Example for a ClamAV report le ClamAV 0.90.2-exp/3236/Sun May 13 09:23:27 2007 N2H5UALHLJ3YZ6D27ZPOKYJFLWFUNQCS.zip: Trojan.W32.HotKeysHook.A2 FOUND 3UXDRDBAO3UXZ3GT3MKIWBF77XH5MRKC.html: HTML.Phishing.Pay-6 FOUND ES3RAP26G2C2WCUWAZI4JHPLUTTE2JZE.html: Trojan.Downloader-2388 FOUND This le is then parsed with Python. The relevant information for detected malware plus the version information is then stored in the database. The malicious code is stored in a separate attic directory for further studies. The analysis directory is removed and the ARC-le is either removed or repacked depending on the purpose of the crawl. We use the fast option of gzip for repacking to reduce the time consumed. Figure 15 shows the operation of the ClamAV program. Figure 15: Operation of the ClamAV program 3.5.5. CWSandbox The CWSandbox [8] is a tool to automatically analyze malware, based on behavior analysis. It is also developed at our chair by Carsten Willems [71]. Suspicious code is executed for a nite time in a simulated environment (the CWSandbox) where all system calls are monitored and analyzed. The CWSandbox generates than a detailed report of the behaviour of the code. The CWSandbox uses three strategies to fulll its three criteria: 46 3. Monkey-Spider • Dynamic analysis means that malware is analyzed by executing it within a simu- lated environment to achieve the • API hooking automation of the analysis. means that calls to the Windows API are re-routed to the monitoring software before the actual API code is called and grants the eectiveness of the overall process. • DLL code injection allows API hooking to be implemented while granting con- dence in the implementation and the correcteness of the reported results. Utilizing this three strategies the CWSandbox autmatically generates reports describing for example the following behaviour of the analyzed binary. File creation and modication, Windows Registry modication, DLL utilization, memory access, process creation, and network connection. We are implementing a program to analyze all crawled binaries with the CWSandbox and synchronize the results with our database but we are not nished yet. 3.6. Web Interface The Monkey-Spider system is designed to be completely controlled over a Web interface. Therefore we have used the Karrigell Web framework [30]. Because our Web interface is not fully functional the following specication is more a state to become. User authentication is used for the access control but the communication is not encrypted. The interface works job based like Heritrix and uses the jobs created by Heritrix for further processing and storage. Our interface is split into the following main pages. Status-Page The status-page 16 is considered as an overview of the whole system. It shows information without any other functionality. The shown information is the current seed le, the status of the crawler, the status of the malware scanner, and the status of the database. Seeding-Page The seeding-page 17 is the place where the seeder modules can be used to generate seeds for the crawl. Seeds can be created using queries to Web search engines with dierent properties. A spamtrap can be downloaded through POP3. Additionally, URLs can be added manually, too. It can be specied if the old seed should be overwritten or not. Figure shows the generation of the Web search based seed on the seeding-page of the Monkey-Spider Web interface. Crawling-Page The crawling-page 18 is used to control Heritrix and to congure and control crawls. The crawls are normal Heritrix crawl jobs. Crawl jobs can be generated using the previously generated seeds. Jobs can be started, running jobs can be paused. The crawling-page shows pending, running and completed crawl jobs and is able to abort these as well. 47 3. Monkey-Spider Figure 16: The Monkey-Spider status-page Scanning- and Database-Page The scanning- and the database-pages are not func- tional yet but the scanning-page will provide the possibilitiy to scan a crawl job with dierent malware scanners and the results will be acessible through the database-page providing a Web interface to the database. 3.7. Malware Database All our results except the binary les are stored in a database. As our main database we use PostgreSQL, but there was no special requirement for the database than to be free software, SQL based and performant. Any other database having the same qualities and a Python Interface could be used. We have generated three tables for our database scheme. all unique found malware les. The mw_scanner The malware table holds table holds all specic malware scanner versions with all major and minor versions and update dates. The mw_output table holds all descriptions of found malware samples associated with one particular malware and one particular malware scanner version. • malware: Holds a unique malicious content id: A unique malware-ID number filename: The lename of the malicious content url: The URL where the content was located 48 3. Monkey-Spider Figure 17: The Monkey-Spider seeding page checksum: size: Filesize date: Crawl date crawlname: comment: • mw_scanner: The generated checksum id: The name of crawl job which crawled the le Placeholder for any additional information Holds a uniqe malware scanner plus version info A uniqe malware scanner-ID number name: The name of the malware scanner engine_ver: The version of the malware scanner signature_ver: lastupdate: • mw_output: The version of the malware signatures The date of the last update of the signature database Holds a description of a specic malware sample, providing: mw_id: Foreign key to malware.id mw_sc_id: Foreign key to mw_scanner.id description: The name of the malware according to the malware scanner 49 3. Monkey-Spider Figure 18: The Monkey-Spider crawling page 3.8. Limitations The World Wide Web is a system of interlinked, hypertext documents that runs over the Internet. We assume the whole World Wide Web as a set of available content on computer systems with a unique IP address at a xed time. We call our set the Web. The World Wide Web is not synonymous for the Internet but we use it that way throughout this work because it is commonly used this way. The rst drawback in every research on this set is that it does change signicantly over time and is not predictable to any database or something similar. Every machine connected to the Internet has (should have) a unique IP-address or is assumed as such. Not every machine connected to the Internet has to have a DNS name or anything else which is usual to any World Wide Web server, like a running Web server on socket 80 or even replying to ping requests and may only be reachable for a small period of time. These wobbling nodes can have malicious content and serve these. We want to access as much as it is possible for us on this set at dierent times, because it is not possible to dump the Web at a particular point in time. Therefore we crawl content and extract as much links as possible from this content and try to follow these as far as useful. 3.8.1. Avoid Detection and Blacklisting In spite of not following robots.txt les, which is a restriction mechanism for Web crawlers, malicious Web site operators could avoid honeyclients from accessing their sites utilizing blacklists. It is therefore very important that IP adress ranges of research honeylclients won't get revealed to the public. Thus neither the operator nor the location 50 3. Monkey-Spider of such honeyclients should be published. Additionally, future malware could use vulnerabilities in utilized malware analysis tools to report back the honeyclient's IP ranges or even detect the type of the honeyclient. Contrary to blacklisting dangerous Web sites by the Internet community, such a blacklisting of dangerous Web clients for malicious Web site operators could become common to malware site operators. 3.8.2. Hidden Web Another major drawback when we crawl the Web is that we crawl all contents based on URLs which we have either started with or which we have discovered automatically somewhere. A crawler automatically traverses the Web graph and retrieves pages. Nor- publicly indexable Web [91]. But Hidden Web, also sometimes referred mally they only get a portion of the Web called the main parts of the Web are part of the so called to as the Deep Web. The Hidden Web is mostly part of form based database interfaces and/or authentication based information which has no static URL linked anywhere. We do not process forms or authentication mechanisms on Web sites and thus can not access this content. According to Bergman [70] the public information on the Hidden Web was 400 to 550 times larger than the publicly indexable Web in 2001. 3.8.3. Dynamically Created Content Code obfuscation is one of the methods that malware writers use for their abuse, to hide their intentions from researchers [106]. In Listing 13, an example for a code obfuscation is shown. The code is downloaded from [26] and uses two obfuscation techniques. The goal is to confuse human analyzers and to avoid being automatically analyzed. The rst technique is a very easy and common technique of mapping the program code as a binary sequence with standard reversibel escape sequences. The unescape function easliy decrypts the program code as shown in the second block. The second obfuscation technique is a non-linear mapping based on a pseudorandom string t. Depending on the input stream an output stream is generated. Even though we can extract links out of JavaScript code we cannot detect dynamically generated links and content or obfuscated links out of JavaScript code because we have no JavaScript interpreter and do not execute scripts during the crawl and as a result do not crawl the generated links. 3.8.4. Independent Crawls In general, Web crawlers do not have a synchronization between two or many dierent crawls to avoid duplicate data or crawling the same content more than once. The same problem will be apparent with our infrastructure because our crawler doesn't support this either. So it is indeed the case that in our evaluation we shall crawl and analyze partly the same content over time and for dierent topics. This could make a dierence for the results since top ranked sites like wikipedia.com crawled because they are often linked on many sites. 51 or amazon.com are likely to be 3. Monkey-Spider 3.8.5. HTTP Headers, Cookies and other Client Information Depending on the Web server and Web site programmer, Web sites often change their responses based on the client information they are able to retrieve. Some of the methods of retrieving client information are HTTP-request-headers, HTTP-cookies, Flash-cookies and client-side scripts like JavaScript. We have observed a malicious JavaScript script (http://www.mt-download.com/mtrslib2.js) probably exploiting the browser security which could not be downloaded directly or executed with any other browser than the Internet Explorer. Based on similar behavior we are probably missing some malicious content. 3.8.6. Peer-to-Peer Networks The classic client server network architecture is one where many client nodes connect to a few server nodes to request content and the server delivers/serves the clients with the requested contents. In this architecture the clients need not to be connected to each other and the data ows only from one client to one server and vice versa. Our approach operates as a client and requests the contents from the Internet servers, other Internet clients keep uneected. Peer-to-peer networks are decentralized networks which often don't distinct between two classes of networking nodes, they only have peers which are connected either directly or indirectly over other peers with each other. Content indexes and content distribution go over other peers. This network architecture can be used for almost any type of service from lesharing to Web serving. But the vast majority of users today use peer-to-peer le-sharing networks like Gnutella [12], Kazaa [31] and Freenet [56]. Peer-to-peer le-sharing networks delight themselves to be more popular. These networks are often used for le-sharing and therefore are predestined to spread malware. Recent research [89, 98] on malware in peer-to-peer networks has proven that often more than fty percent transpired as malicious code. Based on the fact that we crawl contents which is linked somewhere and can be requested with the HTTP protocol our crawler is not able to retrieve les from any peerto-peer network of which the content is not directly linked or indexed somewhere as a HTTP-requestable URL. Thus this type of networks are out of our research scope. 52 4. Evaluation 4. Evaluation We have evaluated our system over a period of three month from February till April 2007. In the beginning of February we have done an initial test crawl with an adult based topic search, this crawl will be further denoted as the rst crawl. After that we have done several topics and hosts le based crawls in March and April. We have used ClamAV for our evaluation system as a malware detector, since this was the only module with fully functionality at the time of our analysis. In contrast to other projects like the UW Spycrawler [93] we have scanned every crawled content regardless of its type. This procedure assures us not to oversee a malicious content but generates a performance trade-o for scanning unnecessarily content. In contrast to other anti-virus solutions, the ClamAV anti-virus scanner has additional detection capabilities for phishing and Trojan downloading Web sites. Additionally, we have observed that some portion of executables, possibly malicious ones, are neither marked as executables in their mimetype specications nor have a corresponding le extension, e.g. .exe or .msi. Depending on this advantages we decided to agree to the slower performance of the system and scan all crawled content. The Evaluation system was one standard PC utilizing a high-bandwith Internet connection at the university. 4.1. Test Environment We have set up a PC utilizing an Intel Pentium 4 CPU with 3.00 GHz and 2048 KB L2-cache. It has a hyperthreading enabled standard processor equipped with 2 GB RAM. The system has three hard disks one with 80 GB and two with 750 GB capacity. The 80 GB disk is used for the operating system. For the huge storage needs of the crawls and analysis the two 750 GB disks were put together as a logical volume with 1.5 TB. The operating system is a standard Ubuntu [62] version 6.10 installation with the additional programs needed for the Monkey-Spider. Like most Linux systems, the Ubuntu system already contains the Python interpreter. To install the Monkey-Spider system we are taking the following steps. Installing and conguring the PostgresSQL database server, the Sun Java Runtime Environment, the Heritrix Web crawler, the Karrigell Web framework, the Clam anti-virus scanner in the most recent version, the PyGoogle Python package, the pYsearch Python package, the Python Web Services package, and the PyGreSQL Python package Finally, the Monkey-Spider program is installed and congured. After starting Karrigell, Monkey-Spider is ready to be accessed over its Web interface over http://localhost:8081. It has to be noted that this communication is password pro- tected but not encrypted, so any system in between could gather the password and other sensitive information. 53 4. Evaluation 4.2. Crawls We have initiated broadscope crawls based on topic specic Web searches and on aggregated hosts les. Being inexperienced with Web crawling, we have specied huge values for our broadscope crawls. The max-link-hops value was 100 and the 30 value was 30. With this scope a theoretical depths of 100 = max-trans-hops 1059 domains and unlim- ited links within a domain the crawls could be able to crawl major parts of the Web, nevertheless that there ain't 1059 domains on the net but some pages are not linked on any other Web site. Therefore the crawls are not executed till the end and thus we ended the crawls manually. The eect of this can be clearly seen on the mimetype distribution of the complete crawls in Figure 19. Figure 19: Filetype distribution More than half of the downloaded les are Web sites followed by images. If the hosts would be completely crawled than the occurence of the other les had to be more than this. Another interesting information is the third rank of the no-type type depicting a lack of mimetype information in the response. This is an additional endorsement for our approach of malware scanning of all downloaded les rather than only considering executable les, e.g. A First Test Crawl application/octet-stream. We have done a rst test crawl based on adult search topics in February 2007 with a starting seed of 2205 URLs. We have done this to revise our settings and optimize the crawler performance. During our crawl we have observed some performance bottlenecks depending on the conguration of the crawls. maximum available had to be adjusted. Toe-Threads and The Toe-Threads The count of the total amount of maximum usable memory are the active crawl threads and responsible for checking, crawling, and anylzing a URL. The initial setting for this option was 250. This value can be sucient with about 250 MB of usable main memory like occupied from the Java Virtual Machine (JVM). We adjusted the possible memory load for the JVM 54 4. Evaluation to 1500 MB and found 600 Toe-Threads as a good trade-o for our conguration. We achieved an average download rate of about 1 MB/sec and crawled 26 documents/sec. The Extractor works relatively ecient. Table 6 shows the extraction eciency for the initial crawl. module handled objects extracted links links / object 6487350 735538 0,11 4132168 345231030 83,55 19881 88307 4,44 21658 162291 7,49 17921 11117 0,62 ExtractorHTTP ExtractorHTML ExtractorCSS ExtractorJS ExtractorSWF ExtractorXML ExtractorURL ExtractorUniversal 35165 1638260 46,59 6506489 6776544 1,04 1901858 1205772 0,63 Table 6: Extractor modules extraction eciency It has to be reconsidered that only discovered links which do pass the requirement are queued for crawling. These requirements do contain among others like wheter if a URL is already queued or downloaded, if a URL is still unique after normalization, if a URL is in the scope of the crawl etc. We have observed some interesting aspects during our rst extensive test crawl like discovering mainly one type of Trojan.Downloader.HTML.Codebase as shown in Table 7 which shows the type and count of found malware during our rst crawl. name occurence name occurence Trojan.Downloader.HTML.Codebase 3634 Worm.Sobig.E 8 Trojan.Downloader-2388 224 W32.Elkern.C 8 Worm.Yaha.G-2 92 Worm.Sobig.F-1 7 Dialer-891 64 Phishing.Email.SSL-Spoof 7 Trojan.Vundo-181 54 Worm.Mydoom.M 6 Dialer-715 41 Worm.Bagle.Z 6 Trojan.Downloader.JS.IstBar.A-2 20 Worm.Bagle.Gen-vbs 6 Worm.Sircam 17 Worm.Bagle.AP 6 Adware.Casino-3 15 Trojan.JS.RJump 6 Worm.SomeFool.P 11 Exploit.HTML.IFrame 6 Table 7: Top 20 malware types of our test crawl After analyzing the malware database, we have detected that this malware is one particular Trojan downloader distributed almost all from the same Web site (http://99calze.com/Galleria_PopUpFoto.php) observerd 3471 samples from this site. with dierent get values. We have We suppose that this is an error of the Web site programmer because such an obfuscation wouldn't make sense, but the site itself is 55 4. Evaluation malicious. Topic Based Crawls After the optimizations we have executed ve topic based crawls during March and April for the topics adult, pirate, celebrity, games, and wallpaper. To generate starting seeds we used synonymously and commonly used terms regarding the topics utilizing Google Suggest and Google Trends. For example, the topic wallpaper was searched for with free wallpapers, wallpapers etc. Hosts File Crawl After that we have aggregated hosts les from six major hosts le providers in one seed le with a total of 8713 URLs in April. Table 8 shows the seed count per topic, the discovered links, the queued links, and the actual downloaded links. We have observed a total rate of 6.48 discovered and queued links per crawled content. topic adult pirate celebrity games wallpaper hosts le initial seed count 1867 2324 1405 2367 2781 8713 discovered links 10566820 21592129 20738435 33600204 23943186 19159246 queued 8931291 18875457 17542649 27893560 18860725 15725376 downloaded 1537145 2547330 2855300 5215394 4646027 3204560 Table 8: Seed count per topic 4.3. Performance We have observed an average analysis ratio of 0.048 seconds per downloaded content and an average analysis ratio of 2.35 seconds per downloaded and compressed MB. Table 9 shows the runtime of the malware analysis regarding the runtime of the ScanFolder program. topic secs hours content count MB content MB/sec pirate 116416.21 32,34 2547330 54182,43 0,0457 2,15 celebrity 138246.53 38,4 2855300 41877,09 0,0484 3,3 game 279741.43 77,71 5215394 168713,89 0,0536 1,66 wallpaper 224220.8 62,28 4646027 150598,52 0,0483 0,92 hosts le 138727.65 38,54 3204560 60423,33 0,0433 3,71 total/avg. 897352,62 249,26 18468611 475795,26 0,0479 2,35 Table 9: Malware analysis performance per crawl To increase the performance of the overall analysis process we have searched for possible redundancies and found one minor optimization. The ARCWriter processor writes compressed ARC-les as output of the Heritrix crawls. Figure 14 shows the operation of the ExtractARCFile program. Prior to malware detection, every ARC-le is uncom- pressed and in the end compressed. Since the speeding up of the decompression is not 56 4. Evaluation possible without great eort, it is very easy to speed up the compression because the compression ratio of the processed le is not crucial to us. As the ARC-les are initially compressed using gzip we also use gzip to compress the les again. We have compared the time in which an average size (compressed about 100 MB) ARC-le is compressed with the normal gzip operation and the additional gzip --fast option. It took 13.357s gzip --fast option enabled. Ap- to compress such a le with gzip and 9.697s with the pliying this to our prior found average 2.35 seconds per compressed MB results in about 2.31 seconds per compressed MB, a performance gain of 2 %. A complete regarding of compressing the processed ARC-les results in 2.22 seconds per compressed MB, resulting in a performance gain of 5.3 %. But not compressing the les could produce ARC-les with, in the worst case, about double the size and thus become crucial to the system storage requirement. A nal solution could be the deletion of processed ARC-les together with the lost of the crawled content. 4.4. Phishing Analysis Since ClamAV has a very good phishing detection engine we have found a total of 27 dierent types of phishing sites out of a total of 32, in our crawls. The most interesting aspect is the distribution of the samples. Table 10 shows the distribution of the found sites. topic phishing sites uniqe sites adult 1 1 pirate 28 23 celebrity 0 0 games 0 0 wallpaper 1 1 hostsle 2 2 total 32 27 Table 10: Topic based phishing site distribution Table 11 shows the dierent detected phishing sites and their occurence. Further analysis of the pirate based phishing sites has unveiled that eight of the found sites are located on the domain antiphishing.org which is a known anti-phishing site. It is obvious that they have the phishing sites for educational purposes. Eventually, all of the 32 found sites have resulted in phishing awareness educational content hosted with no malicious background. This brings us to the result that phishing sites are normally not linked on any other Web site and are not available for over a few days. Furthermore, if phishing sites are only up for few days and the signature database of ClamAV has to be updated manually on a daily basis, it is likely that phishing sites will be missed due to the lack of signatures or the delay of the signatures. another strategy to detect phishing sites has to be considered. 57 Therefore 4. Evaluation occurence name occurence name 4 HTML.Phishing.Bank-3 1 HTML.Phishing.Bank-581 2 HTML.Phishing.Auction-28 1 HTML.Phishing.Bank-621 2 HTML.Phishing.Bank-573 1 HTML.Phishing.Bank-680 1 HTML.Phishing.Acc-10 1 HTML.Phishing.Card-3 1 HTML.Phishing.Auction-146 1 HTML.Phishing.Card-45 1 HTML.Phishing.Auction-20 1 HTML.Phishing.Pay-1 1 HTML.Phishing.Auction-47 1 HTML.Phishing.Pay-131 1 HTML.Phishing.Auction-74 1 HTML.Phishing.Pay-131 1 HTML.Phishing.Bank-159 1 HTML.Phishing.Pay-19 1 HTML.Phishing.Bank-162 1 HTML.Phishing.Pay-2 1 HTML.Phishing.Bank-214 1 HTML.Phishing.Pay-35 1 HTML.Phishing.Bank-277 1 HTML.Phishing.Pay-6 1 HTML.Phishing.Bank-44 1 Phishing.Email 1 HTML.Phishing.Bank-557 Table 11: Uniqe phishing site types 4.5. Results We know only little about the distribution of malicious Web sites. But our results show an interesting topic specic maliciousness probability regarding binary les. Table 12 shows the topic malciousness. topic maliciousness in % pirate 2.6 wallpaper 2.5 hosts le 1.7 games 0.3 celebrity 0.3 adult 0.1 total 1.0 Table 12: Topic specic binary le maliciousness probabilitiy This result can be compared to the result of the SiteAdvisor malware analysis report in March 2007 [76]. They have found that 4.1 % of all Web sites have been identied to be malicious. The gap between their result and our result, of 1.0 % per crawled binary, can mainly be explained with the following dierences: • SiteAdvisor has additional maliciousness indicators: the user feedback, drive-bydownloads, and the spam mail analysis. This approach leads to a higher maliciousness rate. • Our approach lacks correctness due to our incomplete crawls. 58 4. Evaluation • We could only generate results for the overall probability regarding binary les and thus miss other non-binary samples due to some shortcomings in our crawls. • The whole Web, which is targeted by the SiteAdvisor, is likely to have more topics. • Users of the SiteAdvisor system can report additional pages to be analyzed. Therefore it is likely that the SiteAdvisor system will discover malicious Web sites which are not linked on any other site and thus can not be found with our approach. Furthermore we have discovered a total of 93 dierent malware types in our crawl. Table 13 shows the top 20 malware types found during our analysis. name occurence name occurence HTML.MediaTickets.A Trojan.Aavirus-1 487 W32.Parite.B 6 92 HTML.Phishing.Bank-3 4 Trojan.JS.RJump 91 Exploit.HTML.IFrame 3 Adware.Casino-3 22 Joke.Stressreducer-3 3 Adware.Trymedia-2 12 Trojan.URLspoof.gen 3 Adware.Casino 10 Adware.Hotbar-2 2 Worm.Mytob.FN 9 Adware.Webhancer-10 2 Dialer-715 8 ClamAV-Test-File 2 Adware.Casino-5 7 Dialer-166 2 Trojan.Hotkey 6 Eicar-Test-Signature 2 Table 13: Top 20 malware types We have identied a total of 50 unique domains as the source for all found malicious content. Table 14 shows the top 20 malicious domains found during our crawls. count domain count domain 487 desktopwallpaperfree.com 8 3dlez.com 92 waterfallscenes.com 8 antiphishing.org 91 pro.webmaster.free.fr 7 cyervo.iespana.es 15 astalavista.com 7 luckyblackjack.com 14 bunnezone.com 7 packetstormsecurity.org 12 oss.sgi.com 6 downloadcheats.net 12 ppd-les.download.com 6 themerboy.com 11 888casino.com 4 kit.carpediem.fr 11 888.com 3 boingboing.net 10 bigbenbingo.com 3 honeynet.org Table 14: Top 20 malicious domain distribution Manual URL analysis has shown us some false positives and other suspicious domains. 59 4. Evaluation For example pro.webmaster.free.fr seems to have a kind of a so called This is a mechanism to trap Web spiders aka Web crawlers. spider trap. Such sites automatically generate initely URLs to bother a Web crawler with a needless task. Some of the provided domains are known non-malicious sites with malware research based topics. Therefore they provide some malware samples and our system detects them as malicious. Examples are: • oss.sgi.com which is a serious open-source development site of Silicon Graphics Incorporated. • antiphising.org • boingboing.net • honeynet.org the ocial site of the Anti-Phishing Working Group a serious community page with a phishing mail sample in a post the Honeynet Project dedicated to ght hackers These false positives show us some problems of our approach. Known Non-malicious Web sites The fact that a site can be non-malicious except having known malicious code for research and any other purposes is not recognizable with our system. It occurs to us that only two mechanisms can avoid such false positives. The rst is the use of a rating like system, like the SiteAdvisor uses, in which the Web site admin and users can rate and describe a Web sites maliciousness. The second is a previously generated whitelist of known serious Web sites. This could be like a community generated blacklist or like a new registrar like system in which every serious Web site can declare its seriousity to a whitelist registrar to avoid being blacklisted. Our system has none of these but could easily be extended with a whitelisting system. Problems with our crawl method In comparison to the UW Spycrawler [93] system they crawl a xed number of URLs on specic hosts and crawl the whole contents of a host. We did broadscope crawls and ended them knowing that our system could neither store nor process as much data. With this approach we have seen that the crawler gets mainly HTML sites and focuses on the extraction but on the download of the whole content. Maybe this is why we only discover a minor part of the contents on the crawled sites. This leads to a few executable les and thus to very few malicous les. As a solution, later crawls should be netuned not to exceed the available funds of the research system and thus the max-link-hops to a very small value. 60 and max-trans-hops option should be set 5. Future Work 5. Future Work The main task of our work is to implement the Monkey-Spider system in the rst place and then to evaluate it with representative solutions for the Internet. Since our time is restricted to six months and I am the only one who implements and evaluates the system, we could not fully complete our plans. The operation chain is functional but the integration to the Web interface, the integration of other malware scanners than ClamAV and thus a rst release candidate for the public was not feasible for us. Furthermore during the implementation and research we obtained new ideas for the extension and other use cases for our system. In the following Chapters, we want to introduce a few pursuing ideas. 5.1. Queueing Additional Content Since the outcome of the research fully depends on the probed content our initial probe queue is an important issue for our results. Extending the queue will broaden the scope of the and/or rene and improve our results. Google AJAX Search Seeder and Microsoft Live Search Seeder Integration As the Google SOAP Search API is no longer supported and no API keys are available, it is one of the major tasks to reimplement the Google seeder with the new Google AJAX Search API. The good news is that the new API doesn't have restriction on the query. With this new API we could build huge seeds which feature all search results for a special topic and do domain based crawling to keep the focus on malware totally based on a topic. This proceeding would give us the most representative evaluation of the Internet. Microsoft has stopped their support for their MSN SOAP Search API and started a new Search API through the Microsoft Live Search API which is also SOAP based and nearly the same. Also this API should be integrated in the Monkey-Spider. The main advantage is the increased result rate per day from 250 to 1000 results. Automated Blacklist Seeder from a hard coded hosts hosts We have written a blacklist seeder script to generate seeds le list. As a future task it should be possible to upload a new le through the Monkey-Spider Web interface and to control the blacklist seeder from the Web interface. Monitoring Seeder One of our major drawbacks is the lack of a monitoring seeder. Since our Web interface is not fully functional yet, a monitoring seeder could not be implemented. Thus an automatic monitoring of detected malware is currently not supported. The monitoring seeder can queue previously detected malicious Web sites which have exceeded a given time period and generate seeds to be recrawled. JavaScript Execution Seeder Often JavaScript code is used to generate links to other executable content which are likely to be malicious since this is an obfuscation technique to veil malicious actions. Since our crawler extracts links out of crawled content and 61 5. Future Work also JavaScript les it does not execute them and thus misses some possibly interesting malicious content. As a further improvement a JavaScript execution seeder can be implemented which interprets JavaScript code in a secure sandbox-like environment and extracts generated links. These links can than be additionally queued and analyzed with our system. Typosquatting Seeder Mispelled domain names of popular domains, so called typos (typographical errors), are an important source for maliciousness on the Internet. Normally, such domains can't have a meaningful content, except for a coincidenial match with other serious sites. The practice of using typos to misuse such shade trac is commonly known as typosquatting. Malware writers want to spread their code as fast and as wide as possible so they use general typos of real existings sites and use this not meant trac to compromise the clients [105, 67]. As an addition, our seeder could automatically generate typos of a given list of popular Web sites and use them as seeds. Microsoft has developed such an automatic typo generator called the Strider URL Tracer with Typo-Patrol [51]. Figure 20: Popular Google typos Figure 20 shows common typos for Google.com. Figure 21 shows Goole.com one of the squatter sites with lots of advertising. 5.2. Crawler Extensions Some drawbacks in the crawler can have a negative eect on the performance, scope, and representativeness of the crawls. The following extensions try to avoid some of them. Mail Crawler Compared to Shelia our mail seeder lacks the processing of malware attached to spam mails and thus misses them. Additional to Heritrix a new mail crawler could create ARC-les out of mail attachments and include them in the analysis process. 62 5. Future Work Figure 21: Goole.com, a typosquatter site WARC Integration Being the standard archive format of the Internet Archive for over eleven years, a new le format is going to replace its place, which is a product of the proposals of many national librarys and the Internet Archive as part of the International Internet Preservation Consortium [22]. The Web ARChive (WARC) le format [64] covers all features of the recent ARC-le format and adds new functionality, better standardization, more record types, and ease of use. Thus our malware database could provide more information and better statistics regarding the connection of malicious neighbourhoods. Crawl Syncronisation Considering the fact that two crawl jobs are not in the least synchronized with each other, we know that it is possible and, especially in broad scope crawls, very likely that a part of the crawled content will be unnecessarily recrawled. This recrawling can be partly accelerated with a Web proxy used as the Internet gateway. There is another way of completely avoiding overhead. The DeDuplicator [55] is an add-on module for Heritrix. It oers a means to reduce the amount of duplicate data collected in a series of snapshot crawls. It operates as a processor in the processing chain of Heritrix and considers all crawls in a series of crawls. When the DeDuplicator gets a duplicate URL to be crawled the processing moves straight to the PostProcessing chain instead of to any processor which should be skipped if a duplicate is detected. As an extension the DeDuplicator module can be integrated to the Monkey-Spider system to 63 5. Future Work avoid such overhead. Explicit Top Rank Crawls A further problem regarding duplicate crawling is the prob- lem of recrawling top ranked Web sites on every crawl. When we perform broad crawls top ranked Web sites, like Wikipedia.org, Amazon.com, and YouTube.com, are most likely to be linked frequently on common Web sites. Thus these sites will be recrawled more likely than not. These sites do falsify our topics based research results in giving more clean Web sites than there should be. Writing a new processor module for Heritrix which can exclude given top ranked sites is our solution to this problem. In this approach, we propose to generate manually or automatically a top rank site list with about a thousand sites for example out of the Alexa Top Sites [2], and use this list as an exclude rule with a blacklist like processor. This blacklist processor should avoid crawling contents from any link hosted on the blacklisted top ranked servers. After all we could additionally crawl only the top ranked site list to gather information about their maliciousness. P2P Crawler As mentioned before in Chapter 3.8.6, peer-to-peer networks are currently out of our scope. Since Heritrix was built to crawl Web sites it would not be a good idea in improving Heritrix to support peer-to-peer networks. In contrast we propose to write a peer-to-peer client per underlying peer-to-peer protocol that generates ARC-les as output. These ARC-les could than be examined in exactly the same way as the other ARC-les. These peer-to-peer crawlers could use existing peer-to-peer client libraries like libgnutella of the gtk-gnutella[18] project to minimize the implementation eort. A similar crawling strategy has to be developed to nd representative results for topic based searches in peer-to-peer networks. 5.3. Malware Analysis Exploit DB comparison There are some publicly available exploit databases on the net, like Milw0rm [37] or Metasploit [35]. Such databases contain exploit code to known vulnerabilities and shell code to be used with exploits to execute specic operation on an attacked machine. As an addition, we could generate signatures of such publicly available exploits and shell codes and examine our crawled content with them. Found sites should be at least suspicious or hacker Web sites and thus lead to other likely interesting content. Other Malware Scanners The ClamAV scanner is most likely to miss many publicly known malware and thus generate false negatives. To achieve a better detection rate an integration of more malware scanners including anti-virus and anti-spyware solutions would be desirable. Containment strategy We have built the Monkey-Spider system without a containment strategy and our code is not written regarding to security issues due to the lack of knowledge and time. To avoid the honeylient of beeing owned and maybe used for 64 5. Future Work malicious actions, a containment strategy has to be developed. Especially vulnerabilities of common malware scanners are likely to be exploited. Another use case for the future is a possible exploitation of the Monkey-Spider software itself due to its free availablitiy. Web Db Query Interface Additionally, to standart queries to the malware database, like topic based malware statistics, a general Web SQL interface to the underlying database could provide a powerful and ecient, but insecure, addition to the MonkeySpider system. Logle Analysis During our data collection for the evaluation it became clear that many overall important information can only be obtained with standardized logle evaluations. We therefore recommend an additional module for the automated script based analysis of the provided log les. Such an eort could also contain some graphical evaluation extensions. 5.4. General Purpose Alternative and extended use cases for the Monkey-Spider system open a new perspective to widespread analysis of the Web in general. Admitting that such analysis systems may exist we present our system as a freely available Internet analysis framework. In the following Sections we want to present some of the ideas of such analyses, which could be realized with very few implementation eort. Internet File Collection Since our system extracts les contained in an ARC-le gen- erated with Heritrix, we could denote this behavior as dumping parts of the Internet. Extracted les can than be used to build collections of documents, programs, photos, audio, and video. Though the copyright of the downloaded material has to be considered before reusing copyrighted material in any way. Therefore, only hosts with public domain material could be crawled. Criminal Investigation Criminal investigation on the Internet needs good tools to nd traces of crime to persecute and sentence such doings. For this approach only mul- timedia les could be extracted and manually examined with multi le indexing and visualizing tools to unveil criminal activities. The manual analysis of the multimedia content is necessary because only humans can recognize crime. Imaginable use cases can be the disclosure of child sexual abuse, sexual assault in general, terrorism, and assault in general. Statistics In combination with the dumped les and the huge logging information it is possible to generate any kind of content specic statistics: • How is the distribution of les on the Internet? 65 5. Future Work • Which avors of a letype are used, for example leversions, lesizes, metainformation etc ? • Language statistics. • Web Authoring Statistics, used tags etc., for example like Google Web Authoring Statistics[17] Such statistics could then be used by companies to forcast for example the popularity of a program or a letype. Copyright Infringement Copyright infringement is the most widespread crime for which the Internet is utilized. Automatically observing millions of les in a small part of time can unveil minor and major suppliers of copyrighted material and thus support the ght against copyright infringement. Several techniques can be used for the analysis: • Signature and watermark scanning as well as analyzing of copyrighted material like photos, audio-les, video-les, e-books, and documents: Holders of copyrighted materials can provide signatures and/or digital watermarks of their material. Signatures can only be compared to unmodied material in contrast to digital watermarks, whose attainment is the applicability to modied and/or falsied content, like resized pictures or resampled audio-les. • Keyword scanning and data-mining techniques on le names and other metadata on the Web site or the le itself like in the Exchangeable Image le Format (EXIF), the Tagged Image le Format (TIFF), and in ID3 on MP3 Audio etc. Data-Mining and keywords can be used to gather information to identify an infringement, for example many infringements are not directly accessible but the lename unveils the content. Cryptographical Hash Function Collision Detection A hash function is a function that generates a xed length output to an arbitary length input. The output serves as a xed length ngerprint for any given input. Hash functions have widespread use cases in computer systems, for example as database indexes, integrity checksums, and security applications. Cryptographic hash functions are hash functions with additional security properities to be used in dierent security applications like authentication and message integrity indicators. Crypthographic hash functions must provide additional property's to warrant security aspects. As a security property it is important that it should be computationally infeasible to nd a collision for a secure cryptograhic hash function. A collision is a case when two dierent inputs generate the same output. Computationally feasible refers to any method which is faster than a brute-force attack which tries every possible input combination. The process of nding two arbitrary values whose hashes collide is called a collision attack; the process of nding one arbitrary value whose hash collides with another, given 66 5. Future Work Figure 22: Scheme of a hash-function hash is called a preimage attack. A successful preimage attack is a much more serious break than a successful collision attack. The two most-commonly used hash functions in computer systems are SHA-1 [63]. MD5 [58] and Since Heritrix uses SHA-1 for it's checksum generation per given URL, the generated checksums of millions of Web sites could be examined to detect a collision for the SHA-1. Such a system could be used as a proof of concept (poc) to collision eects on the SHA-1 function and could be a paradigm for researchers on this subject. Additionally the org.archive.io.ArchiveRecord class of the Heritrix package can be modied to examine other hash functions e.g. the MD5 hash function. Wordlist Generation A dictionay attack is a techinque for defeating a cipher or authen- tication method by trying a decryption key or password through trying dierent words out of a dictionay (commonly called wordlists). These attacks are computationally much more feasible than plain brute-force attacks in which every possible combination is tried to guess a password. Dictionary attacks are most likely to succeed because generally most people have a tendency to choose passwords which are easy to remember, and choose words taken from their native language. The success of such an attack depends on the quality of the wordlist used by such attack. The more comprehensive it is the greater is the possibility for a successful attack. Wordlists can be comprehensive by means of contained languages, names, trademarks, and other special character combinations only used by a small portion of people like slang [9], dialects [3], special abbreviations [52] etc. and special writing styles like the hacker writing style [57]. The ultimate wordlist in this context is the Internet itself. Therefore it would be another interesting idea to use the Monkey-Spider as a Wordlist generator. Strings could easily be extracted out of every crawled content. easily be achieved for example with the re This could module from the Python standard library. 67 5. Future Work Listing 11 Demonstration of the re.split() function > > > a=" KIEV (AFX) - Yushchenko said that his country " > > > re.split('\W+', a) [, 'KIEV', 'AFX', 'Yushchenko', 'said', 'that', 'his', 'country', ] Additionally, the occurence count of every word could be stored to provide a popularity rank. The generated wordlist could then be sorted according to the previously generated rank. We are claiming that it could be empirically shown that such a sorted wordlist (with maybe millions of words) would provide the best wordlist that could be used with every dictionary attack on earth. 68 6. Summary and Conclusions 6. Summary and Conclusions "The more you read and learn, the less your adversary will know." (Sun Tzu) A virtual combat for money and glory is in full progress. We wanted the reader to learn his weakness and prepare for the counterstrike. With this, work we have concentrated our eorts to the consistent rising client side threats on the Internet. At the same time, we have observed the importance of mapping malicious Web sites and giving researchers the ability to repeat and expand similar research options. We rst showed how Web sites can be a threat to client security. Thereafter, we introduced the reader to dierent malware types and malicious content from a social viewpoint. Meanwhile, the concept and importance of honeypot based computation security devices have been shown and classied. Afterwards, similar work on this topic has been presented and compared to our work. As the next step, we introduced our solution - the Monkey-Spider Internet analysis framework - and explained its implementation and application as a client honeypot. We made several design decisions and justied them to the reader. Making use of dierent other freely available systems, we showed the reasons for choosing particulariy avors of such. After introducing Web crawling techniques, we introduced our Web interface to control the Monkey-Spider system. Meanwhile our malware detection approach has been shown and criticized. Thereafter, some important limitations of our approach have been shown and solution ideas have been suggested to overcome some of the mentioned issues. Finally, we applied our system in the real world. We learned some good lessons and found interesting results which we shared with the reader. One of the most important was the crawling strategy. All the aspects of the crawling method have to be considered and evaluated precisely to achieve the desired results. To correctly categorize a Web site or domain as malicious to client security, all the publicly reachable content has to be examined. Eventually, we pointed out other use cases for our framework and showed the applicability to a wide variety of Internet analysis tasks. We conclude that after many years of server security research we need to concentrate more on client security. With the rise and abuse of botnets the danger has become greater than ever to be abused by an attacker. The continuance of the Web as we are used to know it is at stake if the power of decomission of network nodes is in the hands of a bunch of criminals controling botnets. An easy way of avoiding the threats of malicious Web sites has to be the automatic avoidance of these dangerous parts of the Web in an easy to use and error-free manner. SiteAdvisor is an example for an interesting approach utilizing the power of the Internet community but is not free and sucient for the needs. Therefore, we propose an independent Web site security task force (WSSTF) of vol- untary individuals and companies which could enable Web security starting from the Internet service provider level, to even avoid clients to become the victims of such attacks. Whereas this could become a legal issue and/or an Internet censorship issue to overcome. 69 A. Database Scheme A. Database Scheme Listing 12 createvirdb.sql script to generate the Monkey-Spider database scheme -- db scheme for the Monkey-Spider DROP TABLE mw_output; DROP TABLE malware; DROP TABLE mw_scanner; CREATE TABLE malware ( id int PRIMARY KEY, filename varchar(40), url text, checksum varchar(32), size int, date timestamp, job varchar(20), comment varchar(20) ); --uniqe malware-id --filename of the malware --url where it was crawled --checksum --filesize --date when the file was crawled --name of the crawl job --comment CREATE TABLE mw_scanner ( id int PRIMARY KEY, --uniqe malware scanner-id name varchar(20), --name of the malwarescanner engine_ver varchar(40), --the engine version signature_ver varchar(40), --the signature version lastupdate timestamp, --date when the mw-scanner was last updated UNIQUE (engine_ver,signature_ver) ); CREATE TABLE mw_output ( mw_id int REFERENCES malware(id), mw_sc_id int REFERENCES mw_scanner(id), description varchar(20), PRIMARY KEY (mw_id,mw_sc_id) ); 71 --foreign key --foreign key --description --Primary key to malware.id to mw_scanner.id of found malware both foreign keys B. JavaScript Code Obfuscation Sample B. JavaScript Code Obfuscation Sample Listing 13 JavaScript code obfuscation // the whole script in the escaped binary form // Block one: the escape sequence is abbriviated for clarity <!-eval(unescape("function w%28s%2[...] %3B"); //--> // Block two: the evaluated obfuscation function <!-function w(s) { t = "whmSrp;^s"; o = new String; l = s.length; for (i = 0; i < l; i++) { n = t.indexOf(s.charAt(i)); if (n == -1) { // charAt() gets the indexed // character of the string o += s.charAt(i); continue; } if (n >= 0) { // fromCharCode() gets the character // out of the Unicode index o += String.fromCharCode(10); continue; } } document.write(o); } //--> 73 References References [1] AdAware. http://www.lavasoft.com/. [2] Alexa Top 500 rank. http://www.alexa.com/site/ds/top_sites?ts_mode=global&lang=none. [3] Black Country Dialect. http://www.sedgleymanor.com/dictionaries/dialect.html. [4] Boomerang Decompiler. http://boomerang.sourceforge.net/. [5] Capture-HPC. http://www.nz-honeynet.org/cabout.html. [6] Clam AntiVirus a GPL anti-virus toolkit for UNIX. http://www.clamav.net/. [7] CSS Specication. http://www.w3.org/TR/CSS21/. [8] CWSandbox - Behavior-based Malware Analysis. http://www.cwsandbox.org/. [9] Dictionary of English slang and colloquialisms of the UK. http://www.peevish.co.uk/slang/. [10] Extensible Markup Language (XML). http://www.w3.org/XML/. [11] GNU zip. http://www.gzip.org/. [12] Gnutella Protocol Specication. http://rfc-gnutella.sourceforge.net/developer/stable/index.html. [13] Google AJAX Search API. http://code.google.com/apis/ajaxsearch/. [14] Google SOAP Search API. http://code.google.com/apis/soapsearch/. [15] Google Suggest. http://www.google.com/webhp?complete=1&hl=en. [16] Google Trends. http://www.google.de/trends. 75 References [17] Google Web Authoring Statistics. http://code.google.com/webstats/index.html. [18] Gtk-Gnutella a gnutella network client. http://gtk-gnutella.sourceforge.net. [19] Heritrix, the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. http://crawler.archive.org/. [20] Htmldata, a Python library to manipulate HTML/XHTML/CSS documents. http://oregonstate.edu/~barnesc/htmldata/. [21] IDA Pro disassembler. http://www.datarescue.com/idabase/. [22] International Internet Preservation Consortium (IIPC). http://netpreserve.org. [23] Internet Archive. http://www.archive.org. [24] Internet Relay Chat Protocol Specication. http://tools.ietf.org/html/rfc1459. [25] Java programming language. http://java.sun.com/. [26] JavaScript obfuscation example. http://idesitv.com/starone1-dkcj3s.htm. [27] JavaScript Specication. http://www.ecma-international.org/publications/standards/Ecma-262. htm. [28] Jericho HTML Parser. http://jerichohtml.sourceforge.net/doc/index.html. [29] JMX Specication. http://java.sun.com/javase/technologies/core/mntr-mgmt/ javamanagement/. [30] Karrigell, a exible Python Web framework. http://karrigell.sourceforge.net/. [31] Kazaa Network. http://www.kazaa.com. [32] Live Search API Reference. http://msdn2.microsoft.com/en-us/library/bb266180.aspx. 76 References [33] Macromedia Flash File Format (SWF). http://www.adobe.com/licensing/developer/fileformat/faq/. [34] McAfee SiteAdvisor. http://www.siteadvisor.com/. [35] Metasploit. http://www.metasploit.com/. [36] Mike's Ad Blocking Hosts le. http://everythingisnt.com/hosts.html. [37] Milw0rm. http://www.milw0rm.com/. [38] Multipurpose Internet Mail Extensions (MIME). http://www.ietf.org/rfc/rfc2045.txt. [39] Nepenthes. http://nepenthes.mwcollect.org/. [40] PDF Specication. http://www.adobe.com/devnet/pdf/pdf_reference.html. [41] Post Oce Protocol - Version 3. http://tools.ietf.org/html/rfc1939. [42] PostgreSQL, a powerful, open source relational database system. http://www.postgresql.org/. [43] PyGoogle, a Python Interface to the Google SOAP Search API. http://pygoogle.sourceforge.net/. [44] PyGreSQL, a PostgreSQL module for Python. http://www.pygresql.org/. [45] pYsearch, a Python interface for the Yahoo Search API. http://pysearch.sourceforge.net/. [46] Resource Hacker. http://www.angusj.com/resourcehacker/. [47] Simple Object Access Protocol (SOAP). http://www.w3.org/TR/soap/. [48] Snort. http://www.snort.org/. [49] Strider Gatekeeper Spyware Management: Beyond Signature-based Approach. http://research.microsoft.com/spyware/. 77 References [50] Strider GhostBuster Rootkit Detection. http://research.microsoft.com/rootkit/. [51] Strider URL Tracer. http://research.microsoft.com/URLTracer/. [52] The Acronym Finder. http://www.acronymfinder.com/. [53] The Anti-Spyware Coalition. http://www.antispywarecoalition.org/. [54] The ARC le-format. http://www.archive.org/web/researcher/ArcFileFormat.php. [55] The DeDuplikator Heritrix add-on module. http://deduplicator.sourceforge.net/. [56] The Free Network Project. http://freenetproject.org/. [57] The Jargon File. http://www.catb.org/~esr/jargon/. [58] The MD5 Message-Digest Algorithm. http://tools.ietf.org/html/rfc1321. [59] The Python Programming Language. http://python.org/. [60] The Python Web Services Project. http://pywebsvcs.sourceforge.net/. [61] The Zero Day Initiative. http://www.zerodayinitiative.com/. [62] Ubuntu Linux. http://www.ubuntu.com/. [63] US Secure Hash Algorithm 1 (SHA1). http://tools.ietf.org/html/rfc3174. [64] WARC specication. http://archive-access.svn.sourceforge.net/viewvc/*checkout*/ archive-access/branches/gjm_warc_0_12/warc/warc_file_format.html. [65] Yahoo Search Web Services. http://developer.yahoo.com/search/. 78 References [66] Follow the Money; or, why does my computer keep getting infested with spyware? 2004. http://tacit.livejournal.com/125748.html. [67] Googkle.com installed malware by exploiting browser vulnerabilities. 2005. http://www.f-secure.com/v-descs/googkle.shtml. [68] Adobe Security Advisories. Buer Overow Vulnerability in Adobe Acrobat, July 2006. http://www.adobe.com/support/security/bulletins/apsb06-09.html. [69] Microsoft Security Advisory. A COM object (Javaprxy.dll) could cause Internet Explorer to enexpectedly exit, July 2005. http://www.microsoft.com/technet/security/advisory/903144.mspx. [70] Michael K. Bergman. The 'Deep' Web: Surfacing Hidden Value. 2001. http://brightplanet.com/resources/details/deepweb.html. [71] Thorsten Holz Carsten Willems, Felix Freiling. CWSandbox: Towards Automated Dynamic Binary Analysis. 2006. [72] Brad Daniels Arunvijay Kumar Yi-Min Wang Roussi Roussev Shan Lu Juhan Lee Chad Verbowski, Emre Kıcıman. Flight Data Recorder: Monitoring Persistent- 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI '06), 2006. State Interactions to Improve Systems Management. [73] William Cheswick. An evening with Berferd. 1998. [74] Andy Cianciotto. Spam Attack: Zipped Trojan. April 2007. http://www.symantec.com/enterprise/security_response/weblog/2007/04/ trojanpeacomm_spam_run.html. [75] Fred Cohen. A Computer Virus. 1984. http://all.net/books/virus/part2.html. [76] Shane Keats Dan Nunes. Mapping the Mal Web. 2007. http://www.siteadvisor.com/studies/map_malweb_mar2007.html. [77] Robert Danford. 2nd Generation Honeyclients. SANSFIRE 2006 Presentation. http://handlers.dshield.org/rdanford/pub/Honeyclients_Danford_ SANSfire06.pdf. [78] Alex Eckelberry. Spam with malware links. April 2007. http://sunbeltblog.blogspot.com/2007/04/spam-with-malware-links.html. [79] Bernhard Müller Martin Eiszner. IE6 javaprxy.dll COM instantiation heap corruption vulnerability. June 2005. http://www.sec-consult.com/184.html. 79 References [80] Roy Thomas Fielding. Architectural styles and the design of network-based software architectures. 2000. Chair-RichardN.Taylor. [81] Matthew Fordahl. Latest e-mail worm spreading fast. January 2004. http://www.securityfocus.com/news/7905. [82] The Free Software Foundation. The Free Software Denition. http://www.gnu.org/philosophy/free-sw.html. [83] Roger A. Grimes. Tracking malware with honeyclients, April 2006. http://www.infoworld.com/article/06/04/14/77378_16OPsecadvise_1.html. [84] Anti-Phishing Working Group. Phishing Attack Trends Report - February 2007. 2007. http://www.antiphishing.org/phishReportsArchive.html. [85] Network Working Group. RPC: Remote Procedure Call Protocol Specication Version 2, 1988. http://tools.ietf.org/html/rfc1057. [86] Thorsten Holz. A Short Visit to the Bot Zoo. IEEE Security and Privacy, 3(3):76 79, 2005. [87] The Open Source Initiative. The Open Source Denition. http://opensource.org/docs/osd. [88] Christopher Chute Wolfgang Schlichting-John McArthur Stephen Minton Irida Xheneti Anna Toncheva Alex Manfrediz John F. Gantz, David Reinsel. The Expanding Digital Universe: A Forecast of Worldwide Information Growth Through 2010. March 2007. [89] Andrew Kalafut, Abhinav Acharya, and Minaxi Gupta. A study of malware in peer-to-peer networks. pages 327332, 2006. [90] A.M. Kuchling. Python Advanced Library: Batteries Included Philosophy. http://www.python.org/dev/peps/pep-0206/. [91] Steve Lawrence and C. Lee Giles. Searching the World Wide Web. Science, 280(5360):98100, 1998. citeseer.ist.psu.edu/lawrence98searching.html. [92] Sang Ho Lee, Sung Jin Kim, and Seok-Hoo Hong. On URL Normalization. 2005. [93] Alexander Moshchuk, Tanya Bragin, Steven D. Gribble, and Henry M. Levy. A Proceedings of the 13th Annual Network and Distributed Systems Security Symposium (NDSS 2006), San Diego, Crawler-based Study of Spyware on the Web. In CA, February 2006. 80 References Know Your Enemy, Second Edition: Learning about Security Threats (2nd Edition). Pearson Education, 2004. [94] The Honeynet Project. [95] Paul Roberts. New, virulent Cabir mobile phone worms spotted, December 2004. http://www.infoworld.com/article/04/12/28/HNcabir_1.html. [96] Joan Robert Rocaspana. SHELIA: A Client Honeypot for Client-Side Attack Detection. 2007. [97] C. Seifert. HoneyC - The Low-Interaction Client Honeypot. 2006. citeseer.ist.psu.edu/seifert06honeyc.html. [98] Seungwon Shin, Jaeyeon Jung, and Hari Balakrishnan. Malware prevalence in the KaZaA le-sharing network. pages 333338, 2006. [99] Lance Spitzner. Honeypots: Tracking Hackers. Addison-Wesley Professional, September 2002. [100] Dave Taylor. How do I know if an eBay message is phishing?, January 2006. http://www.askdavetaylor.com/how_do_i_know_if_an_ebay_message_is_ phishing.html. [101] Sun Tzu. The Art of War. http://www.gutenberg.org/etext/132. [102] Dan Verton. Internet fraud expanding, security experts warn. 2003. http://www.computerworld.com/securitytopics/security/cybercrime/ story/0,10801,78551,00.html?SKC=cybercrime-78551. [103] Kathy Wang. Honeyclient Development Project. http://www.honeyclient.org/. [104] Yi-Min Wang, Doug Beck, Xuxian Jiang, Roussi Roussev, Chad Verbowski, Shuo Chen, and Samuel T. King. Automated Web Patrol with Strider HoneyMonkeys: Finding Web Sites That Exploit Browser Vulnerabilities. In NDSS. The Internet Society, 2006. http://www.isoc.org/isoc/conferences/ndss/06/proceedings/html/2006/ papers/honeymonkeys.pdf. [105] Aaron Weber. Amusement Park Typosquatters Install Toolbar via Exploit. 2007. http://blog.siteadvisor.com/2007/04/amusement_park_typosquatters_i. shtml. [106] Davey Winder. Dynamic code obfuscation will dominate 2007 malware map. January 2007. http://www.daniweb.com/blogs/entry1225.html. [107] Tom Zeller. Black Market in Stolen Credit Card Data Thrives on Internet. New York Times, June 21, June 2005. 81 The References [108] Lidong Zhou, Lintao Zhang, Frank McSherry, Nicole Immorlica, Manuel Costa, and Steve Chien. A First Look at Peer-to-Peer Worms: Threats and Defenses. In IPTPS, pages 2435, 2005. 82