URI Anomaly Detection using Similarity Metrics
Transcription
URI Anomaly Detection using Similarity Metrics
Tel-Aviv University Raymond and Beverly Sackler Faculty of Exact Science School of Computer Science URI Anomaly Detection using Similarity Metrics This thesis is submitted as partial fulfillment of the requirements towards the M.Sc. degree in the School of Computer Science, Tel-Aviv University By Saar Yahalom May 2008 Tel-Aviv University Raymond and Beverly Sackler Faculty of Exact Science School of Computer Science URI Anomaly Detection using Similarity Metrics This thesis is submitted as partial fulfillment of the requirements towards the M.Sc. degree in the School of Computer Science, Tel-Aviv University By Saar Yahalom This research work in this thesis has been conducted under the supervision of Prof. Amir Averbuch May 2008 ACKNOWLEDGEMENT I would like to express my sincere gratitude and appreciation to my advisor, Professor Amir Averbuch, for providing me with the unique opportunity to work in the research area of machine learning and security, for his expert guidance and mentorship, and for his help in completing this thesis. Gil David, offered much-appreciated advice, support and encouragement throughout the research process and especially during the experimentation stage. His experience and knowledge helped me save valuable time. I would like to thank the other lab members Neta, Alon and Shachar for their help throughout the research and for providing me with a good and enriching environment. Finally, I would like to thank the friends at Tel Aviv University who made my work here fun, and my girlfriend Shlomit whose support and love helped me to start this degree. DEDICATION This thesis is dedicated to my parents, who have always supported me and gave the right advice whenever I needed it. Their hard work, love and patience have motivated and inspired me throughout my life. Abstract Web servers play a major role in almost all businesses today. Since they are publicly accessible, they are the most exposed servers to attacks and intrusion attempts. Web server administrators deploy Intrusion Detection Systems (IDS) and Intrusion Prevention Systems (IPS) to protect against these threats. Most IDS rely on signatures to detect these attacks. The major drawbacks of signature based IDS are its inability to cope with “zero day” attacks (new unknown attacks) and the constant need to update the signatures database in order to keep up with the emergence of new threats. This is a known problem and the security community released that IDS, which can detect anomalies (attacks) that deviate from normal web site behavior, is needed. In this work, two anomaly detection algorithms are presented. The algorithms are based on similarity metrics which are used to measure the relation between normal and abnormal data. The algorithms were tested on web server data with two types of similarity metrics: A compression based similarity metric, called N CD, and a similarity metric used in the field of Natural language processing, called N -gram similarity. The algorithms exhibit high detection rate together with high performance and can complement a signature based IDS to provide better security. v Contents 1 Introduction 1 2 Technical Background 2 3 Related work on similarity metrics 3.1 3.2 NCD for Similarity and Clustering Algorithm . . . . . . . . . . . . . 14 3.1.1 Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 N-Gram Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4 Anomaly detection algorithms that use similarity metrics 4.1 4.2 4.3 14 17 S-Score Algorithm Description . . . . . . . . . . . . . . . . . . . . . 18 4.1.1 Algorithm discussion . . . . . . . . . . . . . . . . . . . . . . . 20 4.1.2 Algorithm Disadvantages . . . . . . . . . . . . . . . . . . . . . 20 Improving the Detection Efficiency and Accuracy by Clustering . . . 21 4.2.1 Choosing Representatives . . . . . . . . . . . . . . . . . . . . 21 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5 Experimental results for anomaly detection in URIs 27 5.1 URIs Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.2 The performance of NCD and N-Sim as URI similarity metrics . . . . 29 5.3 Results of S-Score and N SC for anomaly detection . . . . . . . . . . 36 5.3.1 S-Score Scoring Functions Comparison . . . . . . . . . . . . . 36 5.3.2 Performance of N SC vs. S-Score . . . . . . . . . . . . . . . . 41 5.3.3 Comparison of the accuracy performance in N SC between NCD and N-Sim . . . . . . . . . . . . . . . . . . . . . . . . . vi 43 5.3.4 How different number of representatives in N SC affects the performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Previous Work on URI Anomaly Detection 44 46 6.1 URI Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . 46 6.2 Prediction By Partial Matching (PPM) Anomaly Detection . . . . . . 48 7 Conclusion 48 vii List of Figures 5.1 The network architecture that collects the data . . . . . . . . . . . . 28 5.2 2D PCA plotting of table 2. NCD similarity metric is used . . . . . . 34 5.3 2D PCA plotting of table 3. N-Sim similarity metric is used . . . . . 35 5.4 2D PCA plotting of table 4. Cosine similarity metric is used . . . . . 35 5.5 Comparison between the scoring functions accuracy that use the NCD metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 37 Correct detection vs. false negatives of the scoring functions that use the NCD metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.7 Comparison between the scoring functions accuracy when N-Sim is used 40 5.8 Correct detection vs. false negatives of the Scoring functions when 5.9 N-Sim is used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 NSC Vs. S-Score (NCD) . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.10 Comparison between NCD, N-Sim and a decision tree that is based on both metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.11 Correct detection vs. false negatives that NCD and N-Sim produce . 44 5.12 Performance comparison between different representatives sizes of p% 45 6.1 47 Zeta function example . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Tables 1 PPM model after processing the string abracadabra (maximum order 2). c counts the substring appearance and p is the substring probability. 2 7 The NCD distance matrix values. Values that are close to zero mean higher similarity. The highlighted numbers, which are less than 0.5, have good correspondence to the URI groups. . . . . . . . . . . . . . 3 31 The N-Sim distance matrix values. Values that are close to zero mean higher similarity. The highlighted numbers, which are less than 0.5, have almost perfect correspondence to the URI groups. . . . . . . . . 4 32 The Cosine similarity distance matrix values. Values that are close to zero mean higher similarity. The highlighted numbers, which are less than 0.4, do not correspond to the URI groups. Hence, this metric is not suitable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 33 1 Introduction Web servers play a major role in almost all businesses today. They either house the commercial site, connecting its customers to the business data, or become part of a bigger infrastructure that provides web services. Since they are publicly accessible, they are usually the most exposed servers to attacks and intrusion attempts. In 2007, theft of personal data more than tripled to be 62% of the cases involved hacking web sites1 . In order to fight these attacks, web server administrators deploy Intrusion Detection Systems (IDS) and Intrusion Prevention Systems (IPS). The most known IDS is Snort2 which is a signature based IDS. The problems with signature based systems are “zero day” attacks (new unknown attacks) and the constant need to update the signatures database in order to keep up with the emergence of new threats. Creating a signature that will not cause too many false alarms when legitimate requests are made is not a trivial task. It leaves the web server administrator dependent solely on signature subscription services. The security community is aware of these problems and calls have been made in order to construct an IDS system that can spot anomalies based on the normal web site behavior while being independent from signatures. A system like this can either work in parallel or complements a signature based IDS to provide a tighter security. This work describes an anomaly detection algorithm designed to target these problems. We use the notion of a “similarity metric” in order to measure the similarity of web requests to a known set of normal data that was collected from a web site. Two similarity metrics are used: a compression based metric called Normalized Compression Distance (NCD), which is closely related to Kolmogorov complexity, and the N-Gram Similarity (N-Sim), which has been used mainly in Natural Lan1 USA Today, http://www.usatoday.com/tech/news/computersecurity/infotheft/2007-12-09- data-theft N.htm 2 http://www.snort.org 1 guage Processing (NLP) to find similar words. The presented algorithm can handle a real world scenario in terms of performance and false alarm rate. A two stage algorithm is proposed. Initially, it constructs a set of normal data using clustering and similarity measures and later it uses the constructed dataset to detect anomaly behavior efficiently and accurately. The work has the following structure: In section 2, a technical background is given on topics that are relevant to this research. The reader is referred to this section for an explanation on methods that have been used in our anomaly detection algorithms. Related work on similarity metrics, anomaly detection and clustering is presented in section 3. These works use the same similarity metrics that were used later in the experiments section. A detailed explanation of two anomaly detection algorithms, which were developed, is given in section 4. Experimental results on URI anomaly detection are presented in section 5. In addition, it provides a detailed performance comparison between two detection methods that use the NCD and N-Sim similarity metrics. Previous works on URI anomaly detection are reviewed in section 6. 2 Technical Background In this section, we provide the technical background that is needed for this work. 1. Uniform Resource Identifier (URI) [3] is a compact sequence of characters that identifies an abstract or physical resource. URL is another common name for URI. All Internet addresses comply with the URI format. When a web page is requested from a web server, a URI, which represents the requested page, is sent. ftp://ftp.is.co.za/rfc/rfc1808.txt, http://www.ietf.org/rfc/rfc2396.txt, mailto:[email protected] are typical URIs examples. 2 2. Kolmogorov Complexity K(X), also known as the algorithmic entropy of ∆ a string X, is the minimal description of the string X. K(X) = min{|d(X)|} where d(X) is a description of the string X. d(X) can be thought of as a computer program that outputs the string X, or as a universal Turing machine that outputs X. The Kolmogorov Complexity is uncomputable ([9]). Since Kolmogorov Complexity is uncomputable, compression can be used to approximate the value of K(X). Assume that C(X) is the length in bits of the compressed version of the string X produced by a compressor C, then, K(X) can be approximated to be K(X) ≈ C(X) [10]. Examples of compressors are: Zip (5 in this list), Bzip2 (6 in this list) and PPM (8 in this list). This approximation is loose. Consider the description of π. A program that computes the first 200,000 digits of π will be shorter than the compression of the 200,000 digits. However, the idea of compression that approximates the Kolmogorov complexity is utilized in this work. 3. Similarity Metric Similarity is a quantity that reflects the strength of the relationship between two objects. The range of this quantity is either [−1, 1] or [0, 1] in its normalized form. Distance and Metric [20]: Without loss of generality, a distance needs only to operate on finite sequences of 0’s and 1’s since every finite sequence over a finite alphabet can be represented by a finite binary sequence. Formally, a distance is a function D with nonnegative real values defined on the Cartesian product X × X such that D : X × X → <+ . It is called a distance metric on X if for every x, y, z ∈ X: • D(x, y) = 0 iff x = y (the identity axiom); 3 • D(x, y) + D(y, z) ≤ D(x, z) (the triangle inequality); • D(x, y) = D(y, x) (the symmetry axiom). A set X, which is provided with a metric, is called a metric space. For example, every set X has the trivial discrete metric D(x, y) = 0 if x = y and D(x, y) = 1 otherwise. In this work, we use the notion defined in [20]: A normalized distance or similarity distance, is a function d : Ω × Ω → [0, 1] that is symmetric d(x, y) = d(y, x). For every x ∈ {0, 1}∗ and every constant e ∈ [0, 1] | {y : d(x, y) ≤ e ≤ 1} | < 2eK(x)+1 (2.1) where K(x) is the Kolmogorov complexity of x. 4. Normalized Compression Distance (NCD) [10] is defined as: N CD(x, y) = C(xy) − min{C(x), C(y)} max{C(x), C(y)} (2.2) where C(xy) is the size of the compressed string that was produced from the concatenation of the strings x and y. The NCD satisfies the following inequalities that define a distance metric: (a) N CD(x, x) = 0 (The identity axiom); (b) N CD(x, y) = N CD(y, x) (Symmetry); (c) For C(x) ≤ C(y) ≤ C(z) N CD(x, y) ≤ N CD(x, z) + N CD(z, y) (Triangle Inequality). In addition, NCD satisfies inequality (2.1), and in most cases N CD(x, y) ∈ [0, 1] which practically makes it a similarity metric. NCD approximates the information distance defined in [10, 20]. In other words, N CD(x, y) measures how well x describes y, or more specifically, how well the knowledge of x helps to compress y. More information can be found in [10, 20] 4 5. LZW Compression [28, 27] is a universal lossless data compression algorithm. The compressor builds a string dictionary from the text being compressed. The string dictionary maps fixed-length codes to strings. The dictionary is initialized with all single-character strings (256 entries for ASCII). Each new character is examined by the compressor in the following way: the character is concatenated to the last word read until the new word does not exist in the dictionary. Then, the new word is added to the dictionary and the code for it is output and we start again with an empty word. LZW is a universal compression algorithm which means that it is not optimal. As the data becomes larger, LZW becomes asymptotically closer to an optimal compression. 6. BZip2 Compression [8] uses the Burrows-Wheeler block-sorting text compression algorithm [8] and Huffman coding. This compression is considered to be better than what is achieved by LZ77/LZ78-based compressors. It approaches the performance of PPM (see 8 in this list) which is a family of statistical compressors. 7. Arithmetic Coding [14] is a method to achieve lossless data compression. Arithmetic coding is a form of variable-length entropy encoding that converts a string into another representation that represents the more frequently used characters with fewer bits and infrequently used characters with more bits. Arithmetic coding encodes the entire message into a single number to be in [0, 1]. 8. Prediction By Partial Matching (PPM) Compression [12, 11, 7] is a data compression scheme that in the last 15 years outperforms other lossless text compression methods. PPM is a finite context statistical modeling technique that can be viewed as blending together several fixed-order context 5 models to predict the next character in the input sequence. It is convenient to assume that an order-k PPM model stores a table of all subsequences up to a length k that occur anywhere in the training text. For each such subsequence (or context), the frequency counts of characters, which immediately follow it, is maintained. When compressing the ith character si of a string s, the statistics associated with the previous k characters si−k , · · · , si−1 is used to predict the current character. If no such statistic exists, a shorter context is used for the prediction. The predicted probability is used together with an arithmetic coder to store the character. Table 1 presents an example of PPM modeling (see [11]). 6 Order k = 2 Order k = 1 Order k = 0 Order k = −1 Prediction c p Prediction c p Prediction c p Prediction c p 2 2 3 → Esc 1 1 3 ab → ac 1 1 2 → Esc 1 1 2 → ca 1 2 → Esc 1 1 2 a 2 2 3 → Esc 1 1 3 → a 1 1 2 → Esc 1 1 2 → d 1 1 2 → Esc 1 1 2 da → ra a 1 ad → br r b 1 1 2 → Esc 1 1 2 → c a b c d r → b 2 2 7 → c 1 1 7 → d 1 1 7 → Esc 3 3 7 → 2 2 3 → Esc 1 1 3 r 1 1 2 → Esc 1 1 2 → a 1 1 2 → Esc 1 1 2 → 2 1 3 → Esc 1 1 3 → a a → a 5 5 16 → b 2 2 16 → c 1 1 16 → d 1 1 16 → r 2 2 16 → Esc 5 5 16 → A 1 Table 1: PPM model after processing the string abracadabra (maximum order 2). c counts the substring appearance and p is the substring probability. 9. N-Gram is a series of N consecutive characters in a string. For example, the string “Example of n-grams” contains the following 2-grams: 7 1 |A| { “Ex”, “xa”, “am”, “mp”, “pl”, “le”, “e ”, “ o”, “of”, “f ”, “ n”, “n-”, “-g”, “gr”, “ra”, “am”, “ms” } and the following 3-grams: {“Exa”, “xam”, “amp”, “mpl”, “ple”, “le ”, “e o”, “ of”, “of ”, “f n”, “ n-”, “n-g”, “-gr”, “gra”, “ram”, “ams”}. N-Gram is a common tool in NLP. 10. N-Gram Vector of a given string is a vector of size | P |N . Each entry in the vector is the number of occurrences of an N-Gram in the original string. For example, the Bi-Gram vector of the string “AABCA” over the alphabet {A,B,C} is: q = (|{z} 1 , |{z} 1 , |{z} 0 , |{z} 0 , |{z} 0 , |{z} 1 , |{z} 1 , |{z} 0 , |{z} 0 ). AA AB AC BA BB BC CA CB CC 11. N-Gram Euclidean Distance is measured between strings. Each string is represented with an N-Gram vector. The distance between two strings is the ∆ Euclidean distance between their corresponding N-Gram vectors: euc (q, r) = 1 P 2 2 . y (q(y) − r(y)) The 2-Gram Euclidean distance between the strings s1 = “AABCA” and s2 = “BCAAB” is euc ((1, 1, 0, 0, 0, 1, 1, 0, 0), (1, 1, 0, 0, 0, 1, 1, 0, 0)) = 0. We can see that even though these strings are not the same, their 2-Gram distance is equal to zero. Their 1-Gram distance over the same alphabet is: euc ((3, 1, 1), (2, 2, 1)) = p √ (3 − 2)2 + (1 − 2)2 + (1 − 1)2 = 2. 12. N-Gram Cosine Similarity is measured between strings. Similar to the NGram Euclidean distance, each vector is represented by an N-Gram vector. The N-Gram cosine similarity between two strings is the cosine similarity between their corresponding N-Gram vectors and it is defined between two vectors q, r by P ∆ cos (q, r) = qP 8 q(y)r(y) . P 2 2 q(y) r(y) y y y 13. Hamming Distance between two bit strings of equal length is the number of positions for which the corresponding bits differ. It measures the minimum number of substitutions required to change one bit string into another bit string. For example, the Hamming distance between 0101 and 1100 is two. 14. Levenshtein Distance (Edit Distance) [19] is a string metric. Sometime it is referred to as an edit distance. It can be thought of as a generalization of the Hamming distance. The Levenshtein distance between two strings is given by the minimum number of operations needed to transform one string into another, where an operation can be an insertion, deletion and substitution of a single character. It is usually normalized by the length of the longest of the two compared strings. This normalized distance is called Normalized Edit Distance (NED). NED is a similarity metric as defined earlier (3 in this list). 15. N-Gram Similarity and Distance (N-Sim) [16] generalizes the concept of the longest common subsequence to encompass N-grams rather than just unigrams. ∆ Given two strings X = x1 . . . xk and Y = y1 . . . yk . Let Γi,j = (x1 . . . xi , y1 . . . yj ) ∆ and Γni,j = (xi+1 . . . xi+n , yj+1 . . . yj+n ). The strings are divided into all possible sub-strings of N consecutive characters. These sub-strings are aligned and compared. A score is given according to the formula: 1 if x = y PN 1 N s1 (x, y) = . sN (Γi,j ) = N u=1 s1 (xi+u , yj+u ), 0 otherwise The N-gram similarity looks for an alignment of a consecutive set of sub-strings whose sum is the highest. N-gram similarity is defined by the following recursive 9 definition: Sn (Γk,l ) = max Sn (Γk−1,l ), Sn (Γk,l−1 ), Sn (Γk−1,l−1 ) + sn (Γnk−n,l−n ) Sn (Γk,l ) = 0 if (k = n and l < n) or (k < n and l = n) 1 if xu = yu , 1 ≤ u ≤ n Sn (Γn,n ) = Sn (Γ0,0 ) = . 0 otherwise (2.3) The N-Sim algorithm, which is a dynamic programming solution to the recur- 10 sion, is: Algorithm 1: N -SIM1 algorithm. Data: X - first string to compare, Y - second string to compare. S - two dimensional matrix that holds the calculations throughout the iterations. 1 N -SIM1 (X,Y ) 2 begin 3 Let K ← length(X); 4 Let L ← length(Y ); // add the sub-string x1 0N −1 and y1 0N −1 to the strings X, Y . 5 for u ← 1 to N − 1 do 6 X ← x1 0 + X ; 7 Y ← y1 0 + Y ; // initialize S first row and column with zeros. 8 9 for i ← 0 to K do S[i, 0] ← 0 ; 10 for j ← 1 to L do 11 S[0, j] ← 0 ; // calculate Sn (Γk,l ). 12 13 14 for i ← 1 to K do for j ← 1 to L do S[i, j] ← N ; ) max S[i − 1, j] , S[i, j − 1] , S[i − 1, j − 1] + s (Γ N i−N,j−N | {z } | {z } | {z } Sn (Γi−1,j ) 15 Sn (Γi,j−1 ) Sn (Γi−1,j−1 )+sn (Γn i−n,j−n ) return S[K, L]/ max(K, L) ; In our work, we use N -SIM = 1 − N -SIM1 that serves as a similarity metric 11 as was defined earlier (3 in this list). 16. K-Mean Clustering [21] is a popular clustering algorithm. Given k clusters that were fixed a-priori and n data points, the algorithm defines k centroids c1 , . . . , ck . Each data point is assigned to a cluster of the nearest centroid. At this point, new centroids are chosen and the data points are assigned again. The process is repeated until the selection of the centroids does not change from the previous iteration. The selection of the first centroids greatly affects the quality and the accuracy of the algorithm. The algorithm aims at minimizing P P (j) the objective function J = kj=1 ni=1 kxi − cj k2 . 17. Fuzzy C-Mean Clustering [13] is a fuzzy clustering algorithm. It is called fuzzy because a data point can belong to one or more clusters at once. The algorithm is similar to the K-Mean algorithm. Given C as the clusters number and n data points, the algorithm is based on the minimization of the following P P 2 objection function: J = ni=1 kj=1 um ij kxi − cj k where m is a real number greater than 1 and uij is called the membership function. It measures the membership degree of xi to the cluster j. uij is defined as: uij = PN 1 2 PC kxi −cj k m−1 k=1 , cj = kxi −ck k i=1 PN um ij · xi i=1 um ij . The iteration stopping rule is maxij |uk+1 − ukij | < where k is the iteration ij number. It differs from K-Mean. 18. Fuzzy C Medoids Clustering (FCMdd) [17] is a fuzzy clustering algorithm. The objective function is based on selecting c representatives (medoids) from the dataset in such a way that the total fuzzy dissimilarity within each cluster is minimized. The algorithm is close to the Fuzzy C-Means algorithm but its complexity is lower. It produces a good clustering. 12 This algorithm has a linearized version which we will refer to as F uzzyM edoids. The linearized version picks the medioids from a set X by minimizing the fuzzy dissimilarity of the top items in each cluster. The p |X| top items of a cluster are those that got the highest membership scores of this cluster. Algorithm 2: F uzzyM edoids algorithm. 1 F uzzyM edoids(X,c, r(·, ·)) Data: X - data items to be clustered, c - number of clusters, r(·, ·) - distance function, X(p)i - top p members of cluster i. 2 3 begin Set iter = 0 ; // method II in [17] 4 Pick the initial set of medoids V = {v1 , v2 , . . . , vc } from X ; 5 repeat 6 for i = 1, 2 . . . , c do for j = 1, 2 . . . , |X| do 7 Compute memberships uij (Eq. 2.4) and identify top members 8 X(p)i , i = 1, 2, . . . , c ; 9 10 11 Store the current medoids: Vold = V ; for i = 1, 2 . . . , c compute the new medoid vi do P q = argminxk ∈X(p)i nj=1 um ij r(xk , xj ) ; 12 v i = xq ; 13 iter = iter + 1 ; 14 until Vold = V or iter = M AX IT ER ; 13 The membership function is: uij = 1 r(xj ,vi ) Pc k=1 1 m−1 1 r(xj ,vk ) 1 m−1 (2.4) where m is a real number greater than 1 and r(·, ·) is a distance function. 3 Related work on similarity metrics Related work on similarity metrics with relation to anomaly detection and clustering is described in this section. Special attention is given to [10] that demonstrates the flexibility of the NCD similarity metric. 3.1 3.1.1 NCD for Similarity and Clustering Algorithm Anomaly Detection Network Traffic Analysis [26] uses the NCD similarity metric to identify network traffic which is similar to previous known attack patterns. A Snort [25] plug-in was developed in order to calculate the NCD value of a reassembled TCP session against a database of known attack sessions. The NCD metric was also tested on clustering of worms executables. The results showed that different variations of the same worms were grouped together while regular executables were separated from the rest of the worms. This allows to test efficiently if an executable is suspected as a worm, and if so to which of the worm families it is similar to. Anomaly Detection for Video Surveillance Application [1] is a system that detects anomalies in a surveillance images using the NCD similarity metric. At first, the system collects a set of pictures that serves as a database of normal data. Later, a threshold is defined and each picture, which the NCD scores 14 above this threshold, is considered as an anomaly. The compression is achieved via a lossy compressor. Masquerader Detection [4, 5] focuses on detecting those who hide their identity by impersonating other people while using their computer accounts. Masquerader detection uses the NCD similarity metric that was introduced at [4]. A user’s command line data is examined. A block of 5000 command lines serves as a database of each user. Then, blocks of 100 commands lines each are tested against the user’s database. This test calculates the NCD between the tested block and the database using several compressors such as PPM, BZip, Compress and Zlib. A block that scores NCD above a given threshold is classified as an anomaly. The results are compatible with previous work [22] on the same data that uses specific features in the data. 3.1.2 Clustering Clustering by Compression [10] introduces the notion of NCD and its definition as a similarity metric. The NCD metric in [10] is used as a method for hierarchical clustering that is based on a fast randomized hill-climbing heuristic of a new quartet tree optimization criterion. It provides a wide range of clustering results. Clustering experimental results: An open-source toolkit, called CompLearn [23], is used to test the performance of their clustering method and the similarity metric. Here are some of their clustering experimental results: 1. Genomics and Phylogeny: The authors reconstruct the phylogeny of eutherians (placenta mammals) by comparing their whole mitochondrial genome. The mitochondrial genome of 20 species was taken and the NCD distance matrix was computed using different compressors (gzip, 15 BZip2, PPMZ). Next, the quartet clustering method was used. The results matched the commonly believed morphology-supported hypothesis. In another experiment, the SARS virus sequenced genome was clustered with 15 similar virii. The NCD distance matrix was calculated using the BZip2 compressor. The results were very similar to the definitive tree that is based on medical-macrobio-genomics analysis posted in the New England Journal of Medicine. 2. Language Trees: “The Universal Declaration of Human Rights” in 52 languages were clustered together. The NCD distance matrix was calculated using the gzip compressor. The resulted language tree improved a similar experiment by [2]. 3. Literature: The texts of several Russian writers (Gogol, Dostojevski, Tolstoy, Bulgakov, Tsjechov) with three to four texts each were clustered. The NCD distance matrix was calculated using a PPMZ compressor. It provided a perfect clustering for these texts. 4. Music: 36 Musical Instrument Digital Interface (MIDI) files were clustered. The NCD distance matrix was calculated using a bzip2 compressor. The results grouped together works from the same composers with several mistakes. Overall the results were encouraging. 5. Optical Character Recognition: Two dimensional images of handwritten numbers were clustered. The NCD distance matrix was calculated using a PPMZ compressor. 28 out of 30 (93%) images were clustered correctly. Later, an SVM was trained for a single digit recognition using vectors of NCD measures as features. The digit recognition accuracy was 87% which is the current state-of-the-art for this problem. 6. Astronomy: Observations of the microquasar GRS 1915+105 made with 16 the Rossi X-ray Timing Explorer were analyzed. 12 objects were clustered. The NCD distance matrix was calculated using the PPMZ compressor. The results were consistent with the classifications made by experts of these observations. Summing up, this work ([23]) showed promising results in non-feature based clustering. It lays the foundation to explore the NCD metric for new fields such as the one we are exploring, namely, anomaly detection in URIs. 3.2 N-Gram Similarity N-Gram Similarity ([16]) is a generalized version of the Longest Common Subsequence (LCS) problem. The algorithm is described in the technical background (see 15 in sec. 2). N-Gram Similarity was designed to provide similarity scores between words or strings. It was tested on a few datasets: Genetic cognates, Translational cognates and confusable drug names. Cognates are words of the same origin that belong to distinct languages for example English father and German vater. The N-Gram Similarity scores are compared to a few other scores such as NED (see 14 in sec. 2), DICE and LCS. The results showed a clear accuracy advantage to the N-Gram Similarity and N-Gram Distance developed by [16]. Another conclusion from these results is that N-Gram Similarity and N-Gram Distance provide almost the same accuracy. 4 Anomaly detection algorithms that use similarity metrics In this section, we describe new algorithms to detect anomalies by using similarity metrics. Similarity metrics provide a normalized measurement in the range [0,1]. 17 Unnormalized metrics make it difficult to determine which items are “close” or “far” from each other. Similarity metrics provides a solution to this issue. First, we introduce an anomaly detection method that is based on similarity metrics and scoring functions that we name S-Score. Next, we introduce a novel 2-stage anomaly detection algorithm that is based on clustering and similarity metrics which overcomes the down sides of the S-Score method that is named Nearest Similar Centroid (N SC). 4.1 S-Score Algorithm Description The algorithm compares between a tested item and a normal dataset and gives it an anomaly score. The anomaly score is calculated by a scoring function that uses a similarity metric. We assume that an item, which is similar to a large portion of a normal dataset, will receive a low anomaly score. The S-Score detection method performs the following: 1. An item u, which is a new arrival, is measured against a database of normal data T using a similarity measuring function m(·, ·). m(·, ·) is based on either the NCD or N-Sim similarity metrics. The results are stored in a vector V . 2. The result vector V is given a score by the scoring function s(V ). Another way to describe it is that the energy stored in V is calculated by using an energy function. 3. The score or energy of the vector V is tested against a threshold. The higher the score is more anomalous is the item. 18 Formally, the algorithm is: Algorithm 3: S-Score detection method. Data: u - the item being tested, T - the normal data set Result: The anomaly score for u 1 S-Score(u, T ) 2 begin 3 i ← 0; 4 foreach (t in T ) do 5 V [i] ← m(u, t) i ← i + 1; s(V ) ; |T | 6 item score ← 7 return item score; Several similarity measuring functions m(·, ·) were tested: • m(x, y) = N CD(x, y) - NCD similarity metric (Eq. 2.2). • m(x, y) = N -Sim(x, y) - N-Sim similarity metric (see section 2.15). Several score functions were tested: s(X) = X |xi | - l1 norm. (4.1) i X 1 · xi . (max{X} − median{X})2 i X 1 s(X) = · xi . (max{X} − mean{X})2 i X |1−xi | s(X) = e−( ) . s(X) = (4.2) (4.3) (4.4) i s(X) = X i 1 ·e high · σhigh (− (xi −µhigh )2 ) 2σ 2 high + 1 ·e low · σlow (− (xi −µlow )2 ) 2σ 2 low ! sum of the vector with a Gaussian smoothing. We use different scaling for high and low similarity items. X s(X) = x2i - l2 norm (energy). i 19 (4.5) (4.6) 4.1.1 Algorithm discussion The use of scoring functions enable to treat differently similarity and dissimilarity measures. A different weight can be applied when comparing between similar and less similar items. Thus, it makes the algorithm flexible and tunable. The scoring function can be adjusted to take into account problematic measures and to apply correcting weights to them. The results from this detection method were satisfactory as it is demonstrated in section 5.3.1. (Eq. 4.6) is the best scoring function while the best similarity measure is the N CD. The time complexity of this algorithm is O(T ), which is linear in the the size of the training dataset. The reason that N-Sim did not succeed as well is because of its high anomaly score when short and long strings are compared. This can probably be solved by dividing the database into 3 or 4 length groups. Then, a string is compared only to its length group while maintaining a different threshold for each individual group. 4.1.2 Algorithm Disadvantages While testing this method, three main problems surfaced: Efficiency: the most severe problem was time performance. A large database is needed in order to produce accurate detections. This required many comparisons with many compression operations which are CPU intensive. This means that in order to achieve satisfactory accuracy this detection method is impractical. Short URIs compared to long URIs: when comparing a short string to a long one, the compared string is given a high anomaly score. This is true for both N CD and N -Sim measures. It is less noticeable when N CD is used. This greatly affects the measurement of a string against a collection of strings with variable lengths. The N CD obtained results are very good due mainly to the 20 scoring function. The results could be even better if short strings will receive special treatment. Noise: a large database is needed in order to achieve accurate results. The large dataset contains URIs with random sections, malformed URIs and even several attacks. This caused a lot of noise to be added to the scoring calculations. This means that the score of normal URIs was shifted sometime towards the anomalous end and vice versa. 4.2 Improving the Detection Efficiency and Accuracy by Clustering The improvement idea is based on separating the detection into two stages. First, a group of “good” representatives are chosen from the training set. Then, the tested item is measured against the chosen representatives group. Two main questions arise: How to choose representatives that will be classified “good”? How to compare between the tested items and the chosen representatives? 4.2.1 Choosing Representatives The representatives have to be diverse while still being similar to most of the other items. To obtain these characteristics, the items have to be divided into groups according to their similarity. From each of these groups, we pick the items that are most similar to other items in that group. Clustering is used to construct these groups. Then, items, which are located in the center of each of these clusters, are chosen. If the clustering is done properly, we guarantee diversity between the groups. Picking items from the center of the clusters guarantees that they will be the most similar to the rest of the items in that cluster. 21 There are many ways to cluster. These methods can be classified into three major groups [15]: 1. Hierarchical: Algorithms that are based on a linkage metric. Their complexity is O(n2 ) which rules them out for large databases. 2. Partitioning: There are several algorithms in this group. K-Means is the most famous one. Others for example are K-Medoids, PAM, CLARA and CLARANS. Another group of algorithms are the Fuzzy K-Means and KMediods variants, such as Fuzzy C Means and Fuzzy Medoids. The main problem is how to determine the value of K which is the number of clusters. The complexity varies from O(n2 ) to O(n) depending on the algorithm and its implementation. 3. Density Based: Algorithms that are based on a density function. Items are clustered together with their closest neighbors within a certain -neighborhood. Some examples are DBSCAN, GDBSCAN, OPTICS and DENCLUE. One advantage is that the number of clusters is not set in advance, although other parameters have to be set in advance for most of these algorithms. The complexity varies from O(n2 ) to O(n· log(n)). Fuzzy Medoids (see 18 in section 2) was chosen because of its low complexity (O(n)). The items are clustered into K initial clusters where K is unknown. The clusters homogeneity is tested and if it is above a threshold then the cluster is re-clustered again to K̃ < K subclusters, until we are left with small uniform clusters. This is actually a top down hierarchical clustering. In each clustering step, a node is divided into K̃ child nodes. During the clustering, very small clusters with cluster size less than M IN SIZE (parameter) are considered as noise and are removed from the dataset. The initial clustering into the first K clusters can be done in several ways that utilize additional knowledge we have about the data and the similarity metric 22 behavior. For example, we used the fact that short URIs are distant from long URIs. Therefore, the initial clusters are separated by length groups that produce better accuracy. The homogeneity of the clusters can be measured in different ways. We chose to calculate the standard deviation of the similarity distance between the cluster items and the cluster centroid in the following way: Homogeneity (Cluster) = StdDev {m(u, c) |u ∈ C, c is the centroid of C} . (4.7) Another tested way to calculate the cluster homogeneity, which is not described here, was to use the mean of the similarity distance between the centroid and the cluster’s items. The algorithm is called Nearest Similar Centroid and will be referred to as the N SC algorithm or the N SC detection method. 23 Formally, the clustering algorithm can be described by: Algorithm 4: Clustering method. Data: T - training dataset, K - number of initial clusters, K̃ - number of sub-clusters, HLimit - homogeneity limit, N C - final number of clusters, r(·, ·) - a distance metric (e.g. NCD, N-Sim) Result: n Clusters = (t1,1 , t1,2 , · · · , t1,n1 ), · · · , (tN C,1 , tN C,2 , · · · , tN C,nN C ) |ti,j ∈ T, NC X o ni = |T | i=1 1 ClusterTrainingSet(T , K, K̃, HLimit, r(·, ·)) 2 begin 3 N onHomogeneicClusters = ∅; 4 Clusters = ∅; 5 C = FuzzyMedoids(T ,K,r(·, ·)) ; // see 18 in sec.2 6 repeat 7 foreach cluster ∈ C do 8 if Homogeneity(cluster ) < HLimit (Eq. 4.7) 9 then 10 11 12 Clusters = Clusters S {cluster}; else N onHomogeneicClusters = S N onHomogeneicClusters {cluster}; // FirstOf() gets the first item from a given set and removes it. 13 F irstCluster = FirstOf(N onHomogeneicClusters); 14 C = FuzzyMedoids(F irstCluster,K̃,r(·, ·)); 15 16 until N onHomogeneicClusters 6= ∅; 24 return Clusters; (4.8) The F uzzyM edoids clustering method uses a similarity measure function during clustering. Several similarity functions were tested: NCD, N-Gram Similarity and N-Gram Cosine Similarity. The N-Gram Cosine Similarity provided fast calculations together with acceptable clusters when N = 6. The homogeneity threshold HLimit can be thought of as the allowed percentage of deviation from the clusters centroid. The reason for this is that all the distances are normalized to be in [0,1]. We got good results when the HLimit values are between 10% to 15% deviation. Finally, representatives from the clusters are chosen. The centroid of each cluster is the item which is the most similar to all the others. This means that the centroid average distance from all the items in the clusters is minimal. With this insight we get ( Centroid(Ci ) = argmin 1 X dist(x, y), x ∈ Ci |Ci | y∈C ) ) ( = argmin X dist(x, y), x ∈ Ci y∈Ci i From each cluster, we pick the top p% centroids. This means that larger clusters will receive better representation. Algorithm 5: Find Centroids method. Data: C - the cluster, n - the number of centroids to find, m(· , · ) - a similarity measuring function. Result: A list containing n centroids. 1 FindCentroids(C, n) 2 begin 3 Let Centroids = ∅ ; 4 for i = 1 To n do nP 5 Centroid = argmin 6 C = C − {Centroid} ; 7 Centroids = Centroids 8 y∈C m(x, y), x ∈ C S {Centroid} ; return Centroids; 25 o ; . The similarity measuring function m(· , · ) is either NCD or N-Sim. The centroids should be chosen using the same similarity metric that is also used later in the detection method. If we were to use a different similarity metric instead, the centroids that we choose will not represent the center of the clusters and the classification of new items will fail. 4.3 Detection The representatives are the centers of the generated clusters. A new item can be classified to be one of the clusters by measuring its similarity to the center of each cluster. If the similarity measure of the item is within a given radius from one of the centers then the item is classified to a relevant cluster, else it is an anomaly. Since we have a score for the tested item, then we can evaluate its anomaly. The time performance gain is due to the fact that the test is done against a small fraction of the training dataset. The accuracy is higher because most of the noise is filtered out during the clustering process. The result is a much faster and more accurate detection algorithm. Formally: Algorithm 6: N SC detection method. Data: u - the item being tested, R - the representatives set, ρ - the maximum allowed radius, m(· , · ) - a similarity measuring function. Result: The anomaly score for u 1 NSC(u, R, ρ) 2 begin 3 r = min {m(u, rep) |rep ∈ R}; 4 if r > ρ then 5 6 mark u as an anomaly; return r; 26 The similarity measuring function m(· , · ) is either NCD or N-Sim 5 Experimental results for anomaly detection in URIs Experimental results of anomaly detection in URIs are presented. A description of the datasets is given. We show that the NCD and N-Sim similarity metrics successfully measure the similarities between URIs. Different scoring functions and their accuracies for the S-Score detection method are compared. An accuracy comparison between the S-Score and the N SC detection methods is performed. Then, the affect of the number of representatives on the accuracy and performance of the N SC algorithm is tested. Then, We examine how the N SC algorithm is affected (accuracy and performance) by the size of the representatives set. 5.1 URIs Datasets We have used several real commercial large operational networks. These networks consist of several different subnetworks. We use the web servers datasets that contain traffic from four dozen web servers. These web servers handle thousands of requests every day. These networks are protected using several network common security tools: • Signature-based tools; • Anomaly-based tools; • Firewalls and proxies; • VPNs. 27 These networks suffer from thousands of attacks every day. Figure 5.1 presents the network architecture. Figure 5.1: The network architecture that collects the data The datasets were collected using the tcpdump program during several days. The data was collected before it was filtered by these security tools. The resulting corpus size is several tera bytes of raw data. The database is named LURI DB (Large URI DB.). We constructed a dataset of all the URIs in the collected data. The dataset was then separated to normal URIs and attack URIs using several tools such as Snort and special crafted tools that were developed by us. 28 The training dataset contains 30,000 URIs where most of them are valid while several of them are attacks and abnormal URIs that were not caught by Snort. The testing dataset consists of 20,000 URIs that contains 98% normal URIs and 2% attacks. All the attacks and the abnormal URIs, which were found in the dataset, were injected to the tested dataset. 5.2 The performance of NCD and N-Sim as URI similarity metrics Initially, the NCD and N-Sim similarity measures (metrics) were tested. The metrics were used for finding similarities between URIs in an unsupervised manner. The URI data has semantic rules but it is mostly constructed of spoken words. This means that a similarity measure that works well for natural languages and text processing will have a better chance. Both, NCD and N-Sim, have shown good results in the past [10, 16] for measuring similarities between strings and this is why they were selected for the test. Since the datasets we work on are confidential, we bring an example dataset that does not belong to this dataset. This URIs dataset is taken from www.cnn.com homepage. It is given here to demonstrate the performance of NCD and N-Sim. The dataset consists of a group of 13 URIs together with 6 web attacks. The URI list is: 1. /SPECIALS/2008/news/luxury.week/ 2. /SPECIALS/2008/news/olympics/ 3. /site=cnn_international...tile=5260084747021&domId=330023 4. /site=cnn_international...=homepage&page.allowcompete=yes¶ms.styles=fs 5. /cnn/.element/js/2.0/scripts/prototype.js 6. /cnn/.element/js/2.0/scripts/effects.js 7. /cnn/.element/js/2.0/csiManager.js 29 8. /cnn/.element/js/2.0/StorageManager.js 9. /cnn/.element/img/2.0/sect/main/more_rss.gif 10. /cnn/.element/img/2.0/global/red_bull.gif 11. /cnn/2008/US/04/05/airport.arrests.ap/tzpop.lax.gi.jpg 12. /cnn/2008/SHOWBIZ/04/06/heston.dead/tzpop.jpg 13. /cnn/2008/CRIME/04/05/texas.ranch/tzpop.compound.housing.cnn.jpg 14. /... 15. /_vti_bin/owssvr.dll?UL=1&ACT=4&BUILD=6551&STRMVER=4&CAPREQ=0 16. /crawlers/img/stat_m.php?p=0...%26white%3D1%26tariff%3D21%26priceHigh%3D9999 17. /auth/auth.php?smf_root_path=http://www.ricksk8.xpg.com.br/echo2.txt? 18. /cnn/2008/CRIME/04/05/texas.ranch/bush.nato.ap/tags.php?BBCodeFile= http://rpgnet.com/newrpgnet/intranet/cmd.txt? 19. /cnn/2008/SHOWBIZ/04/06/heston.dead///tags.php?BBCodeFile= http://rpgnet.com/newrpgnet/intranet/cmd.txt? There are four groups of similar URIs. Group 1: {1, 2}, Group 2: {3, 4}, Group 3: {5, 6, 7, 8, 9, 10} and Group 4: {11, 12, 13}. The last six URIs 14-19 are web attacks. Distance matrices between these URIs were constructed using NCD and N-SIM similarity metrics. 30 Table 2: The NCD distance matrix values. Values that are close to zero mean higher similarity. The highlighted numbers, which are less than 0.5, have good correspondence to the URI groups. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 0.18 0.47 0.98 0.98 0.88 0.87 0.91 0.87 0.91 0.90 0.90 0.87 0.85 0.94 0.97 0.97 0.98 0.92 0.91 0.47 0.19 0.97 0.98 0.90 0.87 0.91 0.89 0.91 0.93 0.90 0.87 0.87 0.97 0.97 0.97 0.98 0.92 0.91 0.98 0.97 0.15 0.39 0.95 0.96 0.96 0.95 0.95 0.96 0.93 0.95 0.96 0.99 0.98 0.93 0.97 0.91 0.91 0.98 0.97 0.40 0.14 0.92 0.95 0.95 0.92 0.93 0.95 0.92 0.94 0.96 0.99 0.98 0.93 0.95 0.90 0.91 0.88 0.88 0.93 0.91 0.12 0.32 0.51 0.49 0.58 0.62 0.87 0.81 0.80 0.93 0.97 0.96 0.92 0.89 0.89 0.87 0.87 0.95 0.93 0.34 0.13 0.49 0.49 0.56 0.60 0.87 0.81 0.83 0.92 0.97 0.96 0.94 0.89 0.89 0.88 0.85 0.95 0.93 0.49 0.49 0.15 0.29 0.60 0.62 0.87 0.83 0.85 0.94 0.97 0.97 0.97 0.91 0.91 0.87 0.87 0.94 0.92 0.49 0.49 0.32 0.13 0.58 0.60 0.87 0.81 0.80 0.95 0.97 0.97 0.95 0.89 0.89 0.91 0.91 0.95 0.92 0.60 0.58 0.64 0.62 0.11 0.49 0.87 0.80 0.85 0.96 0.95 0.95 0.94 0.88 0.88 0.90 0.93 0.96 0.94 0.64 0.62 0.64 0.64 0.47 0.14 0.90 0.83 0.87 0.95 0.97 0.96 0.97 0.89 0.90 0.90 0.89 0.93 0.91 0.89 0.89 0.89 0.87 0.87 0.89 0.15 0.65 0.69 0.98 0.98 0.97 0.91 0.66 0.79 0.87 0.89 0.94 0.93 0.83 0.81 0.87 0.85 0.78 0.81 0.63 0.13 0.61 0.96 0.97 0.96 0.94 0.79 0.80 0.87 0.87 0.96 0.96 0.83 0.85 0.87 0.83 0.83 0.85 0.68 0.61 0.15 0.98 0.97 0.97 0.98 0.85 0.66 0.97 0.97 0.99 0.99 0.95 0.95 0.97 0.95 0.93 0.95 0.98 0.96 0.98 0.40 0.98 1.00 1.00 0.99 0.98 0.97 0.95 0.98 0.97 0.98 0.98 0.98 0.98 0.95 0.95 1.00 0.98 1.00 0.98 0.14 0.99 0.98 0.97 0.93 0.94 0.96 0.96 0.98 0.97 0.95 0.96 0.97 0.97 0.97 1.02 0.99 0.99 1.00 0.99 0.15 0.94 0.92 0.92 0.97 0.95 0.95 0.92 0.91 0.92 0.97 0.95 0.91 0.95 0.88 0.92 0.95 0.98 1.00 0.93 0.14 0.82 0.81 0.91 0.91 0.90 0.88 0.90 0.90 0.91 0.88 0.88 0.89 0.65 0.80 0.85 0.98 0.99 0.92 0.84 0.13 0.38 0.91 0.91 0.91 0.89 0.89 0.90 0.91 0.88 0.88 0.89 0.79 0.81 0.65 0.98 0.98 0.93 0.84 0.38 0.13 31 Table 3: The N-Sim distance matrix values. Values that are close to zero mean higher similarity. The highlighted numbers, which are less than 0.5, have almost perfect correspondence to the URI groups. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 0.01 0.30 0.96 0.95 0.79 0.79 0.80 0.82 0.81 0.80 0.83 0.80 0.75 0.93 0.91 0.96 0.90 0.89 0.89 0.30 0.01 0.96 0.94 0.79 0.79 0.80 0.83 0.81 0.80 0.83 0.81 0.76 0.92 0.91 0.96 0.91 0.90 0.89 0.96 0.96 0.00 0.34 0.93 0.94 0.94 0.93 0.93 0.93 0.92 0.93 0.95 0.99 0.93 0.87 0.91 0.88 0.88 0.95 0.94 0.34 0.00 0.90 0.91 0.91 0.90 0.90 0.91 0.89 0.91 0.93 0.98 0.91 0.90 0.87 0.83 0.84 0.79 0.79 0.93 0.90 0.00 0.21 0.39 0.39 0.48 0.47 0.79 0.73 0.72 0.94 0.91 0.95 0.85 0.87 0.86 0.79 0.79 0.94 0.91 0.21 0.00 0.35 0.36 0.45 0.47 0.79 0.74 0.74 0.94 0.91 0.96 0.85 0.88 0.86 0.80 0.80 0.94 0.91 0.39 0.35 0.00 0.20 0.49 0.47 0.80 0.78 0.74 0.93 0.89 0.96 0.88 0.88 0.87 0.82 0.83 0.93 0.90 0.39 0.36 0.20 0.00 0.48 0.44 0.79 0.76 0.71 0.94 0.89 0.96 0.88 0.87 0.86 0.81 0.81 0.93 0.90 0.48 0.45 0.49 0.48 0.00 0.36 0.79 0.73 0.72 0.95 0.88 0.94 0.86 0.86 0.85 0.80 0.80 0.93 0.91 0.47 0.47 0.47 0.44 0.36 0.00 0.81 0.76 0.75 0.94 0.89 0.95 0.87 0.88 0.87 0.83 0.83 0.92 0.89 0.79 0.79 0.80 0.79 0.79 0.81 0.00 0.55 0.59 0.96 0.89 0.94 0.80 0.61 0.71 0.80 0.81 0.93 0.91 0.73 0.74 0.78 0.76 0.73 0.76 0.55 0.00 0.50 0.96 0.89 0.95 0.82 0.75 0.76 0.75 0.76 0.95 0.93 0.72 0.74 0.74 0.71 0.72 0.75 0.59 0.50 0.00 0.95 0.89 0.96 0.86 0.79 0.62 0.93 0.92 0.99 0.98 0.94 0.94 0.93 0.94 0.95 0.94 0.96 0.96 0.95 0.04 0.96 0.99 0.97 0.98 0.98 0.91 0.91 0.93 0.91 0.91 0.91 0.89 0.89 0.88 0.89 0.89 0.89 0.89 0.96 0.00 0.95 0.89 0.93 0.92 0.96 0.96 0.87 0.90 0.95 0.96 0.96 0.96 0.94 0.95 0.94 0.95 0.96 0.99 0.95 0.00 0.91 0.89 0.90 0.90 0.91 0.91 0.87 0.85 0.85 0.88 0.88 0.86 0.87 0.80 0.82 0.86 0.97 0.89 0.91 0.00 0.75 0.75 0.89 0.90 0.88 0.83 0.87 0.88 0.88 0.87 0.86 0.88 0.61 0.75 0.79 0.98 0.93 0.89 0.75 0.00 0.25 0.89 0.89 0.88 0.84 0.86 0.86 0.87 0.86 0.85 0.87 0.71 0.76 0.62 0.98 0.92 0.90 0.75 0.25 0.00 These results were compared to the N-Gram Cosine similarity metric. It is a typical similarity metric that is being used in NLP. The Cosine similarity was calculated for the 6-gram vectors of the URIs. 32 Table 4: The Cosine similarity distance matrix values. Values that are close to zero mean higher similarity. The highlighted numbers, which are less than 0.4, do not correspond to the URI groups. Hence, this metric is not suitable. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 0.00 0.15 0.69 0.67 0.37 0.36 0.35 0.34 0.35 0.36 0.42 0.42 0.33 0.69 0.64 0.55 0.59 0.38 0.35 0.15 0.00 0.60 0.62 0.31 0.33 0.35 0.40 0.31 0.39 0.34 0.36 0.25 0.79 0.58 0.54 0.59 0.37 0.33 0.69 0.60 0.00 0.07 0.38 0.46 0.42 0.39 0.38 0.42 0.40 0.40 0.50 0.95 0.58 0.32 0.43 0.31 0.36 0.67 0.62 0.07 0.00 0.32 0.39 0.37 0.34 0.34 0.37 0.40 0.44 0.53 0.93 0.56 0.36 0.44 0.29 0.34 0.37 0.31 0.38 0.32 0.00 0.07 0.13 0.16 0.15 0.30 0.21 0.20 0.22 0.59 0.70 0.40 0.32 0.13 0.13 0.36 0.33 0.46 0.39 0.07 0.00 0.08 0.15 0.12 0.29 0.29 0.32 0.29 0.59 0.72 0.46 0.43 0.21 0.20 0.35 0.35 0.42 0.37 0.13 0.08 0.00 0.05 0.09 0.21 0.22 0.28 0.28 0.55 0.72 0.48 0.50 0.19 0.20 0.34 0.40 0.39 0.34 0.16 0.15 0.05 0.00 0.12 0.20 0.25 0.25 0.27 0.58 0.74 0.47 0.46 0.16 0.17 0.35 0.31 0.38 0.34 0.15 0.12 0.09 0.12 0.00 0.15 0.26 0.29 0.30 0.59 0.70 0.42 0.44 0.18 0.17 0.36 0.39 0.42 0.37 0.30 0.29 0.21 0.20 0.15 0.00 0.34 0.35 0.35 0.58 0.68 0.48 0.52 0.27 0.25 0.42 0.34 0.40 0.40 0.21 0.29 0.22 0.25 0.26 0.34 0.00 0.17 0.15 0.54 0.67 0.45 0.33 0.13 0.18 0.42 0.36 0.40 0.44 0.20 0.32 0.28 0.25 0.29 0.35 0.17 0.00 0.18 0.49 0.67 0.38 0.29 0.16 0.20 0.33 0.25 0.50 0.53 0.22 0.29 0.28 0.27 0.30 0.35 0.15 0.18 0.00 0.63 0.70 0.47 0.42 0.17 0.11 0.69 0.79 0.95 0.93 0.59 0.59 0.55 0.58 0.59 0.58 0.54 0.49 0.63 0.00 0.86 0.78 0.56 0.62 0.66 0.64 0.58 0.58 0.56 0.70 0.72 0.72 0.74 0.70 0.68 0.67 0.67 0.70 0.86 0.00 0.51 0.72 0.65 0.67 0.55 0.54 0.32 0.36 0.40 0.46 0.48 0.47 0.42 0.48 0.45 0.38 0.47 0.78 0.51 0.00 0.35 0.36 0.37 0.59 0.59 0.43 0.44 0.32 0.43 0.50 0.46 0.44 0.52 0.33 0.29 0.42 0.56 0.72 0.35 0.00 0.21 0.28 0.38 0.37 0.31 0.29 0.13 0.21 0.19 0.16 0.18 0.27 0.13 0.16 0.17 0.62 0.65 0.36 0.21 0.00 0.03 0.35 0.33 0.36 0.34 0.13 0.20 0.20 0.17 0.17 0.25 0.18 0.20 0.11 0.66 0.67 0.37 0.28 0.03 0.00 33 Next, we calculate the P CA of tables 2, 3 and 4. The first two principal components are used to plot the tables in two dimensions. Figure 5.2: 2D PCA plotting of table 2. NCD similarity metric is used Figure 5.3: 2D PCA plotting of table 3. N-Sim similarity metric is used 34 Figure 5.4: 2D PCA plotting of table 4. Cosine similarity metric is used Figures 5.2 and 5.3 are the plots of the distance matrices. Both NCD and N-Sim produce clusters that are clearly separated from the attack URIs. In Fig. 5.4, we can see that the Cosine Similarity distance matrix does not provide well separated clusters. Several compressors have been tested for the NCD metric: LZW, bzip2 and PPM. The PPM compressor provided the best results and was chosen for the NCD calculation. NCD and N-Sim metrics show good capabilities in measuring similarities between URIs in an unsupervised manner. 5.3 Results of S-Score and N SC for anomaly detection The results described from this point to the rest of the section were collected by running the algorithms on the LURI DB dataset (see 5.1). 35 5.3.1 S-Score Scoring Functions Comparison Six scoring functions were tested: l1 -norm (Eq. 4.1), vector sum with median scaling (Eq. 4.2), vector sum with mean scaling (Eq. 4.3), vector sum with exponential smoothing (Eq. 4.4), vector sum with Gaussian smoothing (Eq. 4.5) and l2 -norm (Eq. 4.6). Figure 5.5 shows the accuracy comparisons between these scoring functions when the NCD is used as the similarity metric. Best accuracy together with low false negative rate is achieved with the l2 -norm and with the vector sum with mean scaling. Leading is the l2 -norm that produced a false negative rate of 0.135%, a false positive rate of 3.75% and a total accuracy rate of 96.15%, A little behind is the vector sum with mean scaling that produced a false negative rate of 0.155%, a false positive rate of 5.055% and a total accuracy rate of 94.7%. Figure 5.6 shows the relation between a correct detection probability and the false negatives probability that these six functions with the NCD as the similarity metric is used. A very low false negative probability that comes with a high false positives probability which accounts for the low correct detection probability is valid for all the six functions. A higher false negative probability comes with a low false positive probability and with high correct detection probability. We can use the “elbow” rule in order to choose a good balance between false negatives and correct detections for each of the functions. 36 Figure 5.5: Comparison between the scoring functions accuracy that use the NCD metric 37 Figure 5.6: Correct detection vs. false negatives of the scoring functions that use the NCD metric Figure 5.7 shows the accuracy comparisons between the same scoring functions when the N-Sim is used as the similarity metric. Here, the vector sum with Gaussian smoothing leads with a false negative rate of 0.125%, a false positive rate of 7.65% and a total accuracy rate of 92.225%. Close second are the vector sum with mean scaling and the l1 -norm scoring functions with nearly the same accuracy. Figure 5.8 shows the relation between the correct detection probability and false negative probabilities from the six functions when the N-Sim similarity metric is used. As mentioned before, the N-Sim metric provides a high anomaly scores when a short and a long strings are compared. The N-Sim measures the length of the shortest conversion path from the first compared string to the second one. The conversion path from a short string to a long one is at least as long as their length difference. Thus, the anomaly score becomes high. All the tested scoring functions sum the 38 similarity measures of the items. Therefore, short strings receive a high anomaly score and the N-Sim produces a high number of false positives. Summing up, both similarity metrics performed well where the NCD produces 5% more accuracy. The best overall scoring functions were reached by l2 -norm and by the vector sum with mean scaling. Figure 5.7: Comparison between the scoring functions accuracy when N-Sim is used 39 Figure 5.8: Correct detection vs. false negatives of the Scoring functions when N-Sim is used 5.3.2 Performance of N SC vs. S-Score The N SC detection method has a low false negative rate of 0.085% and a correct detection rate of 99.105%. Figure 5.9 shows the comparison between the N SC detection method and the two best scoring functions. The time performance of the N SC detection method depends directly on the total number of chosen representatives. For r representatives from the training set of size T we have r = T . p If the S-Score detection method takes τ seconds, then the N SC detection method becomes τ p seconds. Section 5.3.4 evaluates the performance for different values of r. The second detection method achieves high accuracy with low false negatives rate because it overcomes three problems that occurred during the application of the S-Score detection method: 1. The performance is better due to a lower number of 40 comparisons. 2. Testing for a minimum similarity distance allows a mixture of string lengths where a short string that is being compared to a long string achieves a high anomaly score. It means that it does not belong to the cluster of long strings as was expected. 3. The clustering and the search for a minimum similarity distance reduces dramatically the noise in the training data. In our tests, we got that the noise level dropped to zero and the representatives did not contain any abnormal URIs or attacks. Figure 5.9: NSC Vs. S-Score (NCD) 5.3.3 Comparison of the accuracy performance in N SC between NCD and N-Sim Several methods were tested: NCD alone, N-Sim alone and a decision tree that was trained on 4000 items of NCD and N-Sim values. The results are presented in Fig. 5.10. N-Sim achieved the best accuracy although a very good accuracy was also achieved by the other two methods. The decision tree method produces the lowest false negative rate. Our assumption as a future research goal is that a combination of the two metrics with a better method than decision tree such as Support Vector 41 Machine will produce even better results. Figure 5.10: Comparison between NCD, N-Sim and a decision tree that is based on both metrics Figure 5.11 shows the relation between a correct detection probability and false negatives probabilities of NCD and N-Sim that use the N SC detection method. Figure 5.11: Correct detection vs. false negatives that NCD and N-Sim produce 42 5.3.4 How different number of representatives in N SC affects the performance The time performance of the N SC detection method has a linear dependency on the number of representatives r (see section 5.3.2). The number of representatives is chosen as a p% from of the total training set T . Figure 5.12 presents a comparison between different representatives p%. The most significant thing is that even a p = 2% of the training data produces a very high detection rate of 98.56% with a low false negative rate of 0.085% together with a performance boost. The performance gain, as expected by this linear dependency, is 5 times better then the test where p = 10% was used. Figure 5.12: Performance comparison between different representatives sizes of p% A simple heuristic can be added in order to retain a high detection rate while time performance is increased. It is based on sorting the representatives by their proximity to the center of their cluster. First, all the representatives which are closest to the center, are sorted. Then, all the representatives, which are second closest to the center, are sorted and so on. If we change the N SC detection method to search for a 43 representative r, which compiles with m(u, r) < ρ, we get a major performance gain. As Fig. 5.12 shows, this is due to the fact that 98.56% of the items get detected by 2% of the training data. Therefore, ∼ 98% of the items will be detected fast, and only for about 0.6% of the data it will take longer. Algorithm 7: N SC detection method with the sort heuristic. Data: u - the item being tested, R - the sorted by proximity to center representatives set, ρ - the maximum allowed radius. Result: The anomaly score for u 1 DetectWithHeuristic(u, R, ρ) 2 begin 3 foreach rep ∈ R do 4 r = m(u, rep) ; // NCD or N-Sim 5 if r < ρ then 6 return r; 7 mark u as an anomaly; 6 Previous Work on URI Anomaly Detection Previous works on URI anomaly detection are described. Each work uses a private dataset that is unavailable to the public. Therefore, it is impossible to compare between the results of these different approaches. 6.1 URI Anomaly Detection Detecting HTTP Network Attacks using Language Models: A method for network intrusion detection that is based on n-grams was developed in [24]. They propose a representation of n-grams using tries accompanied by a novel method for comparison between tries in linear time. Each URI is converted to a 44 trie and a k-nearest neighbors method, which is called Zeta, is applied to detect anomalies. The issue of choosing the length of the n-gram is also addressed, they have solved the problem by moving to a word based model instead of an n-grams model. Each URI is parsed into its underlying words. The results were compatible with the best n-gram value they received in previous tests. Figure 6.1: Zeta function example Detecting Anomalies in HTTP Queries: A system, which can analyze client queries from different server-side programs, was developed in [18]. The system performs a focused analysis and builds a profile for each of the server-side applications. The system first enters a learning stage in which it builds the profiles. In the second stage, parameters from the new queries are compared with the established profiles that are specific to the application being referenced. URIs, which do not contain a query string, are ignored. The profiles consist the following analysis of the following query attributes: length, character distribution, query structural inference, range, presence or absence and order. 45 6.2 Prediction By Partial Matching (PPM) Anomaly Detection A spam filter, which is based on the PPM compression scheme, is described in [6, 7]. The PPM model is used to estimate the sequences probabilities of characters that are based on previous observations. The probability of a document is the product of the probabilities of all the characters contained in the document. Each message being spam receives a “spamminess score” that is based on these probabilities. 7 Conclusion In this work, we presented two algorithms for URI anomaly detection that use similarity metrics. The first detection algorithm that uses score functions achieves good results while being computational expensive. The second detection algorithm contains two sequential stages. First, also termed training, a normal dataset is collected using clustering and centroid selection. Second, anomalies are detected by comparing new data to the clusters centroids. The tests showed that the N SC algorithm provides fast performance with a very good detection rate and a very low false negatives rate. Two similarity metrics NCD and N-Sim were used . Both exhibited similar accuracy in the N SC algorithm. The NCD metric produced better results for the S-Score algorithm. With some performance modifications, our N SC algorithm can be used as an Intrusion Detection System (IDS) application for web servers. In the future, we plan to improve the detection stage by considering the cluster radius in order to detect anomalies. 46 References [1] C. E. Au, S. Skaff, and J. J. Clark. Anomaly detection for video surveillance applications. In ICPR ’06: Proceedings of the 18th International Conference on Pattern Recognition, pages 888–891, Washington, DC, USA, 2006. IEEE Computer Society. [2] D. Benedetto, E. Caglioti, and V. Loreto. Language trees and zipping. Physical Review Letters, 88(4), January 2002. [3] T. Berners-Lee, R. Fielding, and L. Masinter. Rfc 3986, uniform resource identifier (uri): Generic syntax. [4] M. Bertacchini and P. I. Fierens. Preliminary results on masquerader detection using compression based similarity metrics, 2006. [5] M. Bertacchini and P. I. Fierens. Ncd based masquerader detection using enriched command lines, 2007. [6] A. Bratko, B. Filipič, G. V. Cormack, T. R. Lynam, and B. Zupan. Spam filtering using statistical data compression models. J. Mach. Learn. Res., 7:2673– 2698, 2006. [7] A. Bratko, B. Filipič, and B. Zupan. Towards practical ppm spam filtering: Experiments for the TREC 2006 spam track. In Proc. 15th Text REtrieval Conference (TREC 2006), Gaithersburg, MD, 2006. [8] M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, 1994. [9] G. J. Chaitin. Algorithmic information theory, IBM journal of research and development, 1977. In Information Randomness & Incompleteness: Papers on 47 Algorithmic Information Theory, Gregory J. Chaitin, World Scientific, Series in Computer Science–Vol. 8. 1987. [10] R. Cilibrasi and P. M. B. Vitanyi. Clustering by compression. Information Theory, IEEE Transactions on, 51:1523–1545, 2005. [11] J. G. Cleary and W. J. Teahan. Unbounded length contexts for PPM. The Computer Journal, 40(2/3), 1997. [12] J. G. Cleary and I. H. Witten. Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications, COM-32(4):396– 402, April 1984. [13] J. C. Dunn. A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. Journal of Cybernetics, 3:32–57, 1973. [14] P. G. Howard and J. S. Vitter. Arithmetic coding for data compression. Technical Report Technical report DUKE–TR–1994–09, 1994. [15] D. A. Keim and A. Hinneburg. Clustering techniques for large data sets - from the past to the future. In KDD Tutorial Notes, pages 141–181, 1999. [16] G. Kondrak. N-gram similarity and distance. In SPIRE, pages 115–126, 2005. [17] R. Krishnapuram, A. Joshi, O. Nasraoui, and L. Yi. Low-complexity fuzzy relational clustering algorithms for web mining. IEEE-FS, 9:595–607, Aug. 2001. [18] C. Kruegel and G. Vigna. Anomaly detection of web-based attacks. In CCS ’03: Proceedings of the 10th ACM conference on Computer and communications security, pages 251–261, New York, NY, USA, 2003. ACM. [19] V. I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics - Doklady, 10(8):707–710, February 1966. 48 [20] M. Li, X. Chen, X. Li, B. Ma, and P. M. B. Vitányi. The similarity metric. pages 863–872. Society for Industrial and Applied Mathematics, 2003. [21] J. B. MacQueen. Some methods for classification and analysis of multivariate observations. In L. M. L. Cam and J. Neyman, editors, Proc. of the fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 281–297. University of California Press, 1967. [22] R. A. Maxion. Masquerade detection using enriched command lines. In Proceedings of International Conference on Dependable Systems and Networks (DSN ’03), San Francisco, CA, June 2003. [23] J. B. R. Cilibrasi, A. Cruz and S. Wehner. Complearn toolkit. [24] K. Rieck and P. Laskov. Detecting unknown network attacks using language models. In DIMVA, volume 4064 of Lecture Notes in Computer Science, pages 74–90. Springer, 2006. [25] S. D. Team. Snort the open source network intrusion detection system. [26] S. Wehner. Analyzing worms and network traffic using compression. Journal of Computer Security, 15(3):303–320, 2007. [27] T. A. Welch. A technique for high-performance data compression. Computer, 17(6):8–19, 1984. [28] J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 23(3):337–343, 1977. 49 תקציר שרתי אינטרנט מהווים מרכיב חשוב ברוב בתי העסק כיום .מכיוון שהם נגישים לכלל האוכלוסייה הם החשופים ביותר לניסיונות פריצה ותקיפה .אנשי האחזקה של השרתים מפעילים מערכות זיהוי תקיפה ( ])Intrusion Detection System [IDS ומערכות מניעת תקיפה ( ] )Intrusion Prevention System [IPSעל מנת להגן על השרתים מפני איומים אלו .רוב מערכות ה IDSמתבססות על חתימות לזיהוי ההתקפות .החסרונות העיקריים בחתימות הן התקפות חדשות ולא מוכרות ( zero )day attacksוהצורך המתמיד בעדכון מאגר החתימות על מנת להתמודד עם איומים חדשים .בעיה זו מוכרת וקיים הצורך לתכנן IDSשמסוגלת לזהות התקפות על פי ההתנהגות הנורמאלית של השרת ללא תלות בחתימות .בעבודה זו ,מוצגים שני אלגוריתמים לזיהוי אנומליות (התקפות או מידע חריג) .האלגוריתמים מבוססים על מדידת דמיון בין מידע נורמאלי למידע חריג .האלגוריתמים שנבחנו על מידע שנאסף משרתי אינטרנט התבססו על שתי שיטות למדידת דמיון .השיטה הראשונה הנקראת ,NCDמתבססת על כיווץ ( )compressionוהשנייה, בשם N-gram similarityלקוחה מהתחום של עיבוד שפות טבעיות ( .)NLP האלגוריתמים הציגו יכולות זיהוי גבוהות יחד עם קצב עבודה מהיר והם ניתנים לשילוב ב IDSמבוסס חתימות להשגת אבטחה טובה יותר. אוניברסיטת תל-אביב הפקולטה למדעים מדויקים ע"ש ריימונד ובברלי סאקלר ביה"ס למדעי המחשב זיהוי אנומליות בכתובות אינטרנט בעזרת פונקציות דמיון עבודה זו מוגשת כעבודת גמר לשם קבלת התואר "מוסמך במדעים" – M.Sc.בבית הספר למדעי המחשב, אוניברסיטת תל אביב על ידי סער יהלום עבודת המחקר בוצעה תחת הנחייתו של פרופ' אמיר אוורבוך אייר ,תשס"ח אוניברסיטת תל-אביב הפקולטה למדעים מדויקים ע"ש ריימונד ובברלי סאקלר ביה"ס למדעי המחשב זיהוי אנומליות בכתובות אינטרנט בעזרת פונקציות דמיון עבודה זו מוגשת כעבודת גמר לשם קבלת התואר "מוסמך במדעים" – M.Sc.בבית הספר למדעי המחשב, אוניברסיטת תל אביב על ידי סער יהלום אייר ,תשס"ח