CyberEagle: Automated Discovery, Attribution, Analysis and Risk
Transcription
CyberEagle: Automated Discovery, Attribution, Analysis and Risk
Summer 2011 Company Meeting CyberEagle: Automated Discovery, Attribution, Analysis and Risk Assessment of Information Security Threats Saby Saha, Narus Lei Liu, Michigan State University Prakash Mandayam, Michigan State University Narus Company Confidential 1 CyberEagle • • • • • • Motivation and Challenges Project Layout Architecture Statistical Machine Learning/ Data Mining Results Conclusion & Future Work Narus Company Confidential 2 Increasing Security Threats Zeus: 3.6 million machines [HTML Injection] Koobface: 2.9 million machines [Social Networking Sites] TidServ: 1.5 million machines [Email spam attachment] • Continuous and increased attacks on infrastructure • Threats to business, national security • Huge financial stake (Conficker: 10 million machines, Loss $9.1 Billion) • Attacks are becoming more advanced and sophisticated • Honeypots, IDS/IPS, Email/IP Reputation Systems are inadequate Narus Company Confidential 3 More Sophisticated Attacks Narus Company Confidential 4 Host Based Security • Complete monitoring end hosts behavior and the state of the system • Often analyzes a malware program in a controlled environment to build a model its behavior • Pros – Information rich view: high detection rate with low false positive – Reverse engineer the properties of the Threat • Cons – After-the-fact approach • Require malicious code for analysis – Fail to identify evolved threats – Not effective to identify zero-day threats Narus Company Confidential 5 Network Security • • • • Firewall systems IDS/IPS Network behavior anomaly detection (NBAD) Pros: – Complete macro view of the network – With the knowledge of good traffic it can identify anomalies – Able to identify new threats as anomalies • Cons – Generate large number of false positives – Unsupervised approach, lacks ground truth Narus Company Confidential 6 Bringing Them Together • Leverage advantages of both the approaches • Host-security tag flows with threat signatures – Generates ground truth for associated with flows • Network security can learn rich statistical model for all threats using the flow data tagged with ground truth • Develop a comprehensive end-to-end data security system for real-time discovery, analysis, and risk assessment of security threats Narus Company Confidential 7 Enhanced Comprehensive Security System • Discover common and persistent behavioral patterns for all security threats – Even when sessions are encrypted (IDS/IPS fails) • Generate precise threat alerts in real-time – Reduce the false positive rate • Identify new threats which has some similarities with previous ones – Newly evolved version of a threat – New threat with similar behavioral pattern • Inform about the newly identified threat to the host-security Narus Company Confidential 8 System Overview Model Generation Classification Validation Assessment Narus Company Confidential 9 • Extract Set of Transport Layer Features • Generation of Statistical Models • Flush Out Model to Streaming Classification Path • Redirect Packets Matching Model to Binary Analysis Module • Extract Executable and Execute Executable • Analysis of Information Touched • Assess the Risk • Increase Confidence and Alert Information Flow Narus Company Confidential 10 Supervised Threat Classification • Data – Network flow features • Kernel – Define similarity between different flows • Classifier – Binary to separate good from bad – Multiclass to further separate bad flows • Scalability issues – Hierarchy Narus Company Confidential 11 Challenges • Irregular data – – – – Missing values. Imbalanced data Heterogeneous. Non applicable features. • Large number of classes (Number of threats reaches hundreds of thousands) • New classes • Noise in the data • All threat classes may not be captured • Minimize false positives Narus Company Confidential 12 Preprocessing • Normalization • Deal with missing values – Case deleting method: – Mean imputation • Overall classes • Each individual class – Median imputation • Overall classes • Each individual class Narus Company Confidential 13 Classifier Framework Flows SNORT Bad Flows 76 different classes 13935 Flows Class 1 Shellcode Class 2 Spambot_Proxy_Control_Channel … 44427 Flows Class 76 Exploit_Suspected_PHP_Injection_Attack Supervised Classifier Unknown Flows Learning/Training Macro-Level Classifier Learning/Training Unknown Bad CL_A Narus Company Confidential Micro-Level Classifier 14 CL_B CL_N … Binary Classifier Results • Kernel Learning • Biased SVM performance comparison with different kernels Precision good Recall good F1 good Precision bad Recall bad F1 bad Accuracy G-mean Narus Company Confidential Linear Kernel 79.75 87.07 83.25 79.75 37.17 42.74 74.08 56.89 15 RBF Kernel 87.46 90.42 88.9347 69.33 62.55 65.7657 83.26 75.21 Poly Kernel 78.70 97.79 87.2126 79.78 24.81 34.8495 78.79 49.25 Binary Classifier Results • Parameter selection for Biased SVM with RBF Kernel When gamma=10, C+/C_=0.5, win best F1_bad = 0.6494 When gamma=10, C+/C_=0.55, win best F1_bad = 0.657657 Narus Company Confidential 16 Binary Classifier Results • F1 bad comparison of the methods for Binary classifier F1 bad comparison without noise 100 90 80 70 60 50 40 30 20 10 0 86.8 F1 bad comparison with noise 88.7 70 67.55 51.7 76.01 80 79.43 51.7 90 63.74 65.7657 KNN Biased SVM 79.07 60 53.4 50 45.57 45.57 46.41 40 30 20 10 0 Bagging Adaboost SMO SMO KNN Biased SVM Decision Bagging Tree Decision Tree Bagging Adaboost SMO F1 best performance with/without noise: 79.07/88.7 % Narus Company Confidential 17 SMO Decision Tree Bagging Decision Tree Preprocessing (Multiclass) • Tree based generated features – For each class k, do • Repeat c times – Collect samples from class k, label them +1 – Collect samples from class kc, label them -1. – Build a regression tree on above binary data. – Store the tree as Tik • End – End • Example: Tree based features Original features Home owner Marital status Annual income Number of children age - married 125K - 41 No Not married 70K N/A 22 No - 59K 1 55 yes Not married - N/A 23 yes married 100K 1 - Narus Company Confidential transformation 18 Tree 1 Tree 2 Tree 3 Tree 4 Tree5 -0.25 -0.25 -1 0.5 -0.25 -1 -1 0.2 0.714286 -0.33333 -0.5 -0.5 1 0.5 -0.5 -1 -0.33333 1 0.25 0.777778 -0.14286 -0.14286 0.142857 -1 -0.14286 Preprocessing • Multiclass results comparison with – Original features – Tree based generated features Average performance of 6 majority classes 83 82 81 80 79 78 77 76 75 74 73 81.75 77.43 76.84 Precision Precision Recall Recall Tree F1 original F1 tree original Tree based original based features based features features features features features Narus Company Confidential Tree based features Original Features 80.93 80.21 76.40 Performance of 6 majority classes 19 Class ID Precision Recall F1 Precision Recall F1 24 77.65 78.30 77.97 86.12 88 87.05 25 63.62 70.02 66.67 79.3 82 80.63 28 99.36 99.70 99.53 100 100 100 48 82.16 73.95 77.84 79.68 77.9 78.78 68 69.05 71.38 70.20 67.7 76 71.61 76 66.58 71.23 68.83 68.45 66.6 67.51 Multi-class Classification • Identify individual threats • Identify new classes and provide properties • Classifiers – K-Nearest Neighbor • No training involved • Computationally intensive for testing – Ensemble methods • Failing to scale up for huge number of classes – Sphere-based SVM • Encapsulate each class in a hyper sphere. • Transform data into appropriate space such that they cluster into single cohesive unit Narus Company Confidential 20 Building Kernel • Let (Xi,Yi) be the data points where Yi={+1,-1} • Construct ground truth kernel K – Kij = YiYj • Now learn a parametric kernel as follows Kij ~fθ(Xi,Xj) – Kij = fθ(Xi,Xj) class 1 2 3 4 5 6 1 +1 +1 +1 -1 -1 -1 2 +1 +1 +1 -1 -1 -1 +1 3 +1 +1 +1 -1 -1 -1 55 +1 4 -1 -1 -1 +1 +1 +1 N/A 23 -1 100K 1 - -1 5 -1 -1 -1 +1 +1 +1 - 2 32 -1 6 -1 -1 -1 +1 +1 +1 Home owner Marital status Annual income Number of children age Y - married 125K - 41 +1 No Not married 70K N/A 22 No - 59K 1 yes Not married - yes married - Married T yy Once θ is learned, it can be applied onto the test set. Narus Company Confidential 21 Kernel for Multi Class • For each class we do following – Collect samples belonging to class and label as +1 – Collection samples from rest of data and label as -1 – Build separate kernel for each class. Kij ~fθ(Xi,Xj) Narus Company Confidential 22 class 1 2 3 4 5 6 1 +1 +1 +1 -1 -1 -1 2 +1 +1 +1 -1 -1 -1 3 +1 +1 +1 -1 -1 -1 4 -1 -1 -1 +1 +1 +1 5 -1 -1 -1 +1 +1 +1 6 -1 -1 -1 +1 +1 +1 Boosted Trees for Kernel Learning 1 1 1 y1 1 1 1 1 2 3 4 5 6 1 +1 -1 -1 +1 -1 +1 2 -1 +1 +1 -1 +1 -1 3 -1 +1 +1 -1 +1 -1 4 +1 -1 -1 +1 -1 +1 5 -1 +1 +1 -1 +1 -1 6 +1 -1 -1 +1 -1 +1 Output of tree 1 1 1 1 y2 1 1 1 Output of tree 2 Kernel matrix for tree 1 . Narus Company Confidential 1 2 3 4 5 6 1 +1 +1 +1 -1 -1 -1 2 +1 +1 +1 -1 -1 -1 3 +1 +1 +1 -1 -1 -1 4 -1 -1 -1 +1 +1 +1 5 -1 -1 -1 +1 +1 +1 6 -1 -1 -1 +1 +1 +1 23 1 2 3 4 5 6 1 +1 -1 +1 +1 -1 +1 2 -1 +1 -1 +1 +1 -1 3 -1 +1 +1 -1 +1 -1 4 +1 -1 -1 +1 -1 +1 5 -1 +1 +1 -1 +1 -1 6 +1 -1 -1 +1 -1 +1 Kernel matrix for tree 2 Multi class Results Spheres require only K =6 (number of classes) comparison whereas KNN require N comparisons. Narus Company Confidential 24 Classification +New Class Detection Find transformation Find transformation to separate class x from rest of data to separate class + from rest of data Build a separate Kernel for each class Find transformation to separate class -from rest of data Narus Company Confidential 25 Find transformation to separate class ^ from rest of data New Class Generation Narus Company Confidential 26 Conclusion • CyberEagle: An enhanced comprehensive security system – Bringing Host and Network security together to fight security threats • Identify threats that IDS/IPS fails to detect (Encrypted, evolved) • Identify new threats in the earliest stage • Generate signatures for the new threats and alert the host security system in an automated way Narus Company Confidential 27 Future Work • Improve classification accuracy • Scaling up for huge number of classes • Reduce computation during classification – Learn class hierarchy – Increase speed without sacrificing accuracy • • • • Validate with diverse data Reputation analysis of the ip addresses Online update of the classifier Mapreduce implementations Narus Company Confidential 28 Summer 2011 Company Meeting Thank You Prakash, Lei, Saby Narus Company Confidential 29