URI Anomaly Detection using Similarity Metrics

Transcription

URI Anomaly Detection using Similarity Metrics
Tel-Aviv University
Raymond and Beverly Sackler Faculty of Exact Science
School of Computer Science
URI Anomaly Detection using
Similarity Metrics
This thesis is submitted as partial fulfillment of the requirements
towards the M.Sc. degree in the School of Computer Science,
Tel-Aviv University
By
Saar Yahalom
May 2008
Tel-Aviv University
Raymond and Beverly Sackler Faculty of Exact Science
School of Computer Science
URI Anomaly Detection using
Similarity Metrics
This thesis is submitted as partial fulfillment of the requirements
towards the M.Sc. degree in the School of Computer Science,
Tel-Aviv University
By
Saar Yahalom
This research work in this thesis has been conducted under the
supervision of Prof. Amir Averbuch
May 2008
ACKNOWLEDGEMENT
I would like to express my sincere gratitude and appreciation to my
advisor, Professor Amir Averbuch, for providing me with the unique
opportunity to work in the research area of machine learning and security,
for his expert guidance and mentorship, and for his help in completing this
thesis.
Gil David, offered much-appreciated advice, support and
encouragement throughout the research process and especially during the
experimentation stage. His experience and knowledge helped me save
valuable time.
I would like to thank the other lab members Neta, Alon and Shachar
for their help throughout the research and for providing me with a good
and enriching environment.
Finally, I would like to thank the friends at Tel Aviv University who
made my work here fun, and my girlfriend Shlomit whose support and
love helped me to start this degree.
DEDICATION
This thesis is dedicated to my parents, who have always supported me and
gave the right advice whenever I needed it. Their hard work, love and
patience have motivated and inspired me throughout my life.
Abstract
Web servers play a major role in almost all businesses today. Since they
are publicly accessible, they are the most exposed servers to attacks and intrusion attempts. Web server administrators deploy Intrusion Detection Systems
(IDS) and Intrusion Prevention Systems (IPS) to protect against these threats.
Most IDS rely on signatures to detect these attacks. The major drawbacks of
signature based IDS are its inability to cope with “zero day” attacks (new
unknown attacks) and the constant need to update the signatures database in
order to keep up with the emergence of new threats. This is a known problem
and the security community released that IDS, which can detect anomalies
(attacks) that deviate from normal web site behavior, is needed. In this work,
two anomaly detection algorithms are presented. The algorithms are based on
similarity metrics which are used to measure the relation between normal and
abnormal data. The algorithms were tested on web server data with two types
of similarity metrics: A compression based similarity metric, called N CD,
and a similarity metric used in the field of Natural language processing, called
N -gram similarity. The algorithms exhibit high detection rate together with
high performance and can complement a signature based IDS to provide better
security.
v
Contents
1 Introduction
1
2 Technical Background
2
3 Related work on similarity metrics
3.1
3.2
NCD for Similarity and Clustering Algorithm . . . . . . . . . . . . .
14
3.1.1
Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . .
14
3.1.2
Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
N-Gram Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
4 Anomaly detection algorithms that use similarity metrics
4.1
4.2
4.3
14
17
S-Score Algorithm Description . . . . . . . . . . . . . . . . . . . . .
18
4.1.1
Algorithm discussion . . . . . . . . . . . . . . . . . . . . . . .
20
4.1.2
Algorithm Disadvantages . . . . . . . . . . . . . . . . . . . . .
20
Improving the Detection Efficiency and Accuracy by Clustering . . .
21
4.2.1
Choosing Representatives . . . . . . . . . . . . . . . . . . . .
21
Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
5 Experimental results for anomaly detection in URIs
27
5.1
URIs Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
5.2
The performance of NCD and N-Sim as URI similarity metrics . . . .
29
5.3
Results of S-Score and N SC for anomaly detection . . . . . . . . . .
36
5.3.1
S-Score Scoring Functions Comparison . . . . . . . . . . . . .
36
5.3.2
Performance of N SC vs. S-Score . . . . . . . . . . . . . . . .
41
5.3.3
Comparison of the accuracy performance in N SC between
NCD and N-Sim . . . . . . . . . . . . . . . . . . . . . . . . .
vi
43
5.3.4
How different number of representatives in N SC affects the
performance . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6 Previous Work on URI Anomaly Detection
44
46
6.1
URI Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . .
46
6.2
Prediction By Partial Matching (PPM) Anomaly Detection . . . . . .
48
7 Conclusion
48
vii
List of Figures
5.1
The network architecture that collects the data . . . . . . . . . . . .
28
5.2
2D PCA plotting of table 2. NCD similarity metric is used . . . . . .
34
5.3
2D PCA plotting of table 3. N-Sim similarity metric is used . . . . .
35
5.4
2D PCA plotting of table 4. Cosine similarity metric is used . . . . .
35
5.5
Comparison between the scoring functions accuracy that use the NCD
metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.6
37
Correct detection vs. false negatives of the scoring functions that use
the NCD metric
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
5.7
Comparison between the scoring functions accuracy when N-Sim is used 40
5.8
Correct detection vs. false negatives of the Scoring functions when
5.9
N-Sim is used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
NSC Vs. S-Score (NCD) . . . . . . . . . . . . . . . . . . . . . . . . .
42
5.10 Comparison between NCD, N-Sim and a decision tree that is based
on both metrics
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
5.11 Correct detection vs. false negatives that NCD and N-Sim produce .
44
5.12 Performance comparison between different representatives sizes of p%
45
6.1
47
Zeta function example . . . . . . . . . . . . . . . . . . . . . . . . . .
viii
List of Tables
1
PPM model after processing the string abracadabra (maximum order
2). c counts the substring appearance and p is the substring probability.
2
7
The NCD distance matrix values. Values that are close to zero mean
higher similarity. The highlighted numbers, which are less than 0.5,
have good correspondence to the URI groups. . . . . . . . . . . . . .
3
31
The N-Sim distance matrix values. Values that are close to zero mean
higher similarity. The highlighted numbers, which are less than 0.5,
have almost perfect correspondence to the URI groups. . . . . . . . .
4
32
The Cosine similarity distance matrix values. Values that are close to
zero mean higher similarity. The highlighted numbers, which are less
than 0.4, do not correspond to the URI groups. Hence, this metric is
not suitable.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ix
33
1
Introduction
Web servers play a major role in almost all businesses today. They either house the
commercial site, connecting its customers to the business data, or become part of a
bigger infrastructure that provides web services. Since they are publicly accessible,
they are usually the most exposed servers to attacks and intrusion attempts. In 2007,
theft of personal data more than tripled to be 62% of the cases involved hacking web
sites1 . In order to fight these attacks, web server administrators deploy Intrusion Detection Systems (IDS) and Intrusion Prevention Systems (IPS). The most known IDS
is Snort2 which is a signature based IDS. The problems with signature based systems
are “zero day” attacks (new unknown attacks) and the constant need to update the
signatures database in order to keep up with the emergence of new threats. Creating
a signature that will not cause too many false alarms when legitimate requests are
made is not a trivial task. It leaves the web server administrator dependent solely on
signature subscription services. The security community is aware of these problems
and calls have been made in order to construct an IDS system that can spot anomalies based on the normal web site behavior while being independent from signatures.
A system like this can either work in parallel or complements a signature based IDS
to provide a tighter security.
This work describes an anomaly detection algorithm designed to target these
problems. We use the notion of a “similarity metric” in order to measure the similarity of web requests to a known set of normal data that was collected from a web
site. Two similarity metrics are used: a compression based metric called Normalized
Compression Distance (NCD), which is closely related to Kolmogorov complexity,
and the N-Gram Similarity (N-Sim), which has been used mainly in Natural Lan1
USA Today,
http://www.usatoday.com/tech/news/computersecurity/infotheft/2007-12-09-
data-theft N.htm
2
http://www.snort.org
1
guage Processing (NLP) to find similar words. The presented algorithm can handle
a real world scenario in terms of performance and false alarm rate. A two stage
algorithm is proposed. Initially, it constructs a set of normal data using clustering
and similarity measures and later it uses the constructed dataset to detect anomaly
behavior efficiently and accurately.
The work has the following structure: In section 2, a technical background is given
on topics that are relevant to this research. The reader is referred to this section for
an explanation on methods that have been used in our anomaly detection algorithms.
Related work on similarity metrics, anomaly detection and clustering is presented in
section 3. These works use the same similarity metrics that were used later in the
experiments section. A detailed explanation of two anomaly detection algorithms,
which were developed, is given in section 4. Experimental results on URI anomaly
detection are presented in section 5. In addition, it provides a detailed performance
comparison between two detection methods that use the NCD and N-Sim similarity
metrics. Previous works on URI anomaly detection are reviewed in section 6.
2
Technical Background
In this section, we provide the technical background that is needed for this work.
1. Uniform Resource Identifier (URI) [3] is a compact sequence of characters
that identifies an abstract or physical resource. URL is another common name
for URI. All Internet addresses comply with the URI format. When a web page
is requested from a web server, a URI, which represents the requested page, is
sent.
ftp://ftp.is.co.za/rfc/rfc1808.txt,
http://www.ietf.org/rfc/rfc2396.txt, mailto:[email protected]
are typical URIs examples.
2
2. Kolmogorov Complexity K(X), also known as the algorithmic entropy of
∆
a string X, is the minimal description of the string X. K(X) = min{|d(X)|}
where d(X) is a description of the string X. d(X) can be thought of as a
computer program that outputs the string X, or as a universal Turing machine
that outputs X. The Kolmogorov Complexity is uncomputable ([9]).
Since Kolmogorov Complexity is uncomputable, compression can be used to
approximate the value of K(X). Assume that C(X) is the length in bits of the
compressed version of the string X produced by a compressor C, then, K(X)
can be approximated to be K(X) ≈ C(X) [10]. Examples of compressors are:
Zip (5 in this list), Bzip2 (6 in this list) and PPM (8 in this list).
This approximation is loose. Consider the description of π. A program that
computes the first 200,000 digits of π will be shorter than the compression of
the 200,000 digits. However, the idea of compression that approximates the
Kolmogorov complexity is utilized in this work.
3. Similarity Metric
Similarity is a quantity that reflects the strength of the relationship between
two objects. The range of this quantity is either [−1, 1] or [0, 1] in its normalized
form.
Distance and Metric [20]: Without loss of generality, a distance needs only
to operate on finite sequences of 0’s and 1’s since every finite sequence
over a finite alphabet can be represented by a finite binary sequence.
Formally, a distance is a function D with nonnegative real values defined
on the Cartesian product X × X such that D : X × X → <+ . It is called
a distance metric on X if for every x, y, z ∈ X:
• D(x, y) = 0 iff x = y (the identity axiom);
3
• D(x, y) + D(y, z) ≤ D(x, z) (the triangle inequality);
• D(x, y) = D(y, x) (the symmetry axiom).
A set X, which is provided with a metric, is called a metric space. For example,
every set X has the trivial discrete metric D(x, y) = 0 if x = y and D(x, y) = 1
otherwise.
In this work, we use the notion defined in [20]: A normalized distance or
similarity distance, is a function d : Ω × Ω → [0, 1] that is symmetric d(x, y) =
d(y, x). For every x ∈ {0, 1}∗ and every constant e ∈ [0, 1]
| {y : d(x, y) ≤ e ≤ 1} | < 2eK(x)+1
(2.1)
where K(x) is the Kolmogorov complexity of x.
4. Normalized Compression Distance (NCD) [10] is defined as:
N CD(x, y) =
C(xy) − min{C(x), C(y)}
max{C(x), C(y)}
(2.2)
where C(xy) is the size of the compressed string that was produced from the
concatenation of the strings x and y. The NCD satisfies the following inequalities that define a distance metric:
(a) N CD(x, x) = 0 (The identity axiom);
(b) N CD(x, y) = N CD(y, x) (Symmetry);
(c) For C(x) ≤ C(y) ≤ C(z)
N CD(x, y) ≤ N CD(x, z) + N CD(z, y) (Triangle Inequality).
In addition, NCD satisfies inequality (2.1), and in most cases N CD(x, y) ∈
[0, 1] which practically makes it a similarity metric.
NCD approximates the information distance defined in [10, 20]. In other words,
N CD(x, y) measures how well x describes y, or more specifically, how well the
knowledge of x helps to compress y. More information can be found in [10, 20]
4
5. LZW Compression [28, 27] is a universal lossless data compression algorithm.
The compressor builds a string dictionary from the text being compressed.
The string dictionary maps fixed-length codes to strings. The dictionary is
initialized with all single-character strings (256 entries for ASCII). Each new
character is examined by the compressor in the following way: the character
is concatenated to the last word read until the new word does not exist in the
dictionary. Then, the new word is added to the dictionary and the code for it
is output and we start again with an empty word.
LZW is a universal compression algorithm which means that it is not optimal.
As the data becomes larger, LZW becomes asymptotically closer to an optimal
compression.
6. BZip2 Compression [8] uses the Burrows-Wheeler block-sorting text compression algorithm [8] and Huffman coding. This compression is considered
to be better than what is achieved by LZ77/LZ78-based compressors. It approaches the performance of PPM (see 8 in this list) which is a family of
statistical compressors.
7. Arithmetic Coding [14] is a method to achieve lossless data compression.
Arithmetic coding is a form of variable-length entropy encoding that converts
a string into another representation that represents the more frequently used
characters with fewer bits and infrequently used characters with more bits.
Arithmetic coding encodes the entire message into a single number to be in
[0, 1].
8. Prediction By Partial Matching (PPM) Compression [12, 11, 7] is a
data compression scheme that in the last 15 years outperforms other lossless text compression methods. PPM is a finite context statistical modeling
technique that can be viewed as blending together several fixed-order context
5
models to predict the next character in the input sequence. It is convenient to
assume that an order-k PPM model stores a table of all subsequences up to a
length k that occur anywhere in the training text. For each such subsequence
(or context), the frequency counts of characters, which immediately follow it,
is maintained. When compressing the ith character si of a string s, the statistics associated with the previous k characters si−k , · · · , si−1 is used to predict
the current character. If no such statistic exists, a shorter context is used for
the prediction. The predicted probability is used together with an arithmetic
coder to store the character. Table 1 presents an example of PPM modeling
(see [11]).
6
Order k = 2
Order k = 1
Order k = 0
Order k = −1
Prediction c p
Prediction c p
Prediction c p
Prediction c p
2
2
3
→ Esc 1
1
3
ab →
ac
1
1
2
→ Esc 1
1
2
→
ca
1
2
→ Esc 1
1
2
a
2
2
3
→ Esc 1
1
3
→
a
1
1
2
→ Esc 1
1
2
→
d
1
1
2
→ Esc 1
1
2
da →
ra
a
1
ad →
br
r
b
1
1
2
→ Esc 1
1
2
→
c
a
b
c
d
r
→
b
2
2
7
→
c
1
1
7
→
d
1
1
7
→ Esc 3
3
7
→
2
2
3
→ Esc 1
1
3
r
1
1
2
→ Esc 1
1
2
→
a
1
1
2
→ Esc 1
1
2
→
2
1
3
→ Esc 1
1
3
→
a
a
→
a
5
5
16
→
b
2
2
16
→
c
1
1
16
→
d
1
1
16
→
r
2
2
16
→ Esc 5
5
16
→ A 1
Table 1: PPM model after processing the string abracadabra (maximum order 2). c
counts the substring appearance and p is the substring probability.
9. N-Gram is a series of N consecutive characters in a string. For example, the
string “Example of n-grams” contains the following 2-grams:
7
1
|A|
{ “Ex”, “xa”, “am”, “mp”, “pl”, “le”, “e ”, “ o”, “of”, “f ”, “ n”, “n-”, “-g”,
“gr”, “ra”, “am”, “ms” } and the following 3-grams:
{“Exa”, “xam”, “amp”, “mpl”, “ple”, “le ”, “e o”, “ of”, “of ”, “f n”, “ n-”,
“n-g”, “-gr”, “gra”, “ram”, “ams”}. N-Gram is a common tool in NLP.
10. N-Gram Vector of a given string is a vector of size |
P
|N . Each entry in the
vector is the number of occurrences of an N-Gram in the original string.
For example, the Bi-Gram vector of the string “AABCA” over the alphabet
{A,B,C} is: q = (|{z}
1 , |{z}
1 , |{z}
0 , |{z}
0 , |{z}
0 , |{z}
1 , |{z}
1 , |{z}
0 , |{z}
0 ).
AA
AB
AC
BA
BB
BC
CA
CB
CC
11. N-Gram Euclidean Distance is measured between strings. Each string is
represented with an N-Gram vector. The distance between two strings is the
∆
Euclidean distance between their corresponding N-Gram vectors: euc (q, r) =
1
P
2 2
.
y (q(y) − r(y))
The 2-Gram Euclidean distance between the strings s1 = “AABCA” and s2 =
“BCAAB” is
euc ((1, 1, 0, 0, 0, 1, 1, 0, 0), (1, 1, 0, 0, 0, 1, 1, 0, 0)) = 0.
We can see that even though these strings are not the same, their 2-Gram
distance is equal to zero. Their 1-Gram distance over the same alphabet is:
euc ((3, 1, 1), (2, 2, 1)) =
p
√
(3 − 2)2 + (1 − 2)2 + (1 − 1)2 = 2.
12. N-Gram Cosine Similarity is measured between strings. Similar to the NGram Euclidean distance, each vector is represented by an N-Gram vector. The
N-Gram cosine similarity between two strings is the cosine similarity between
their corresponding N-Gram vectors and it is defined between two vectors q, r
by
P
∆
cos (q, r) = qP
8
q(y)r(y)
.
P
2
2
q(y)
r(y)
y
y
y
13. Hamming Distance between two bit strings of equal length is the number
of positions for which the corresponding bits differ. It measures the minimum
number of substitutions required to change one bit string into another bit
string. For example, the Hamming distance between 0101 and 1100 is two.
14. Levenshtein Distance (Edit Distance) [19] is a string metric. Sometime
it is referred to as an edit distance. It can be thought of as a generalization
of the Hamming distance. The Levenshtein distance between two strings is
given by the minimum number of operations needed to transform one string
into another, where an operation can be an insertion, deletion and substitution
of a single character. It is usually normalized by the length of the longest of
the two compared strings. This normalized distance is called Normalized Edit
Distance (NED). NED is a similarity metric as defined earlier (3 in this list).
15. N-Gram Similarity and Distance (N-Sim) [16] generalizes the concept
of the longest common subsequence to encompass N-grams rather than just
unigrams.
∆
Given two strings X = x1 . . . xk and Y = y1 . . . yk . Let Γi,j = (x1 . . . xi , y1 . . . yj )
∆
and Γni,j = (xi+1 . . . xi+n , yj+1 . . . yj+n ). The strings are divided into all possible
sub-strings of N consecutive characters. These sub-strings are aligned and
compared. A score is given according to the formula:

 1 if x = y
PN
1
N
s1 (x, y) =
.
sN (Γi,j ) = N u=1 s1 (xi+u , yj+u ),
 0 otherwise
The N-gram similarity looks for an alignment of a consecutive set of sub-strings
whose sum is the highest. N-gram similarity is defined by the following recursive
9
definition:
Sn (Γk,l )
= max Sn (Γk−1,l ), Sn (Γk,l−1 ), Sn (Γk−1,l−1 ) + sn (Γnk−n,l−n )
Sn (Γk,l )
= 0 if (k = n and l < n) or (k < n and l = n)

 1 if xu = yu , 1 ≤ u ≤ n
Sn (Γn,n ) = Sn (Γ0,0 ) =
.
 0 otherwise
(2.3)
The N-Sim algorithm, which is a dynamic programming solution to the recur-
10
sion, is:
Algorithm 1: N -SIM1 algorithm.
Data: X - first string to compare, Y - second string to compare.
S - two dimensional matrix that holds the calculations throughout the
iterations.
1
N -SIM1 (X,Y )
2
begin
3
Let K ← length(X);
4
Let L ← length(Y );
// add the sub-string x1 0N −1 and y1 0N −1 to the strings X, Y .
5
for u ← 1 to N − 1 do
6
X ← x1 0 + X ;
7
Y ← y1 0 + Y ;
// initialize S first row and column with zeros.
8
9
for i ← 0 to K do
S[i, 0] ← 0 ;
10
for j ← 1 to L do
11
S[0, j] ← 0 ;
// calculate Sn (Γk,l ).
12
13
14
for i ← 1 to K do
for j ← 1 to L do
S[i, j] ←




N
;
)
max 
S[i
−
1,
j]
,
S[i,
j
−
1]
,
S[i
−
1,
j
−
1]
+
s
(Γ
N
i−N,j−N
| {z } | {z } |
{z
}
Sn (Γi−1,j )
15
Sn (Γi,j−1 )
Sn (Γi−1,j−1 )+sn (Γn
i−n,j−n )
return S[K, L]/ max(K, L) ;
In our work, we use N -SIM = 1 − N -SIM1 that serves as a similarity metric
11
as was defined earlier (3 in this list).
16. K-Mean Clustering [21] is a popular clustering algorithm. Given k clusters
that were fixed a-priori and n data points, the algorithm defines k centroids
c1 , . . . , ck . Each data point is assigned to a cluster of the nearest centroid. At
this point, new centroids are chosen and the data points are assigned again.
The process is repeated until the selection of the centroids does not change from
the previous iteration. The selection of the first centroids greatly affects the
quality and the accuracy of the algorithm. The algorithm aims at minimizing
P P
(j)
the objective function J = kj=1 ni=1 kxi − cj k2 .
17. Fuzzy C-Mean Clustering [13] is a fuzzy clustering algorithm. It is called
fuzzy because a data point can belong to one or more clusters at once. The
algorithm is similar to the K-Mean algorithm. Given C as the clusters number
and n data points, the algorithm is based on the minimization of the following
P P
2
objection function: J = ni=1 kj=1 um
ij kxi − cj k where m is a real number
greater than 1 and uij is called the membership function. It measures the
membership degree of xi to the cluster j. uij is defined as:
uij =
PN
1
2
PC kxi −cj k m−1
k=1
,
cj =
kxi −ck k
i=1
PN
um
ij · xi
i=1
um
ij
.
The iteration stopping rule is maxij |uk+1
− ukij | < where k is the iteration
ij
number. It differs from K-Mean.
18. Fuzzy C Medoids Clustering (FCMdd) [17] is a fuzzy clustering algorithm.
The objective function is based on selecting c representatives (medoids) from
the dataset in such a way that the total fuzzy dissimilarity within each cluster
is minimized. The algorithm is close to the Fuzzy C-Means algorithm but its
complexity is lower. It produces a good clustering.
12
This algorithm has a linearized version which we will refer to as F uzzyM edoids.
The linearized version picks the medioids from a set X by minimizing the fuzzy
dissimilarity of the top items in each cluster. The p |X| top items of a cluster
are those that got the highest membership scores of this cluster.
Algorithm 2: F uzzyM edoids algorithm.
1
F uzzyM edoids(X,c, r(·, ·))
Data:
X - data items to be clustered, c - number of clusters,
r(·, ·) - distance function, X(p)i - top p members of cluster i.
2
3
begin
Set iter = 0 ;
// method II in [17]
4
Pick the initial set of medoids V = {v1 , v2 , . . . , vc } from X ;
5
repeat
6
for i = 1, 2 . . . , c do
for j = 1, 2 . . . , |X| do
7
Compute memberships uij (Eq. 2.4) and identify top members
8
X(p)i , i = 1, 2, . . . , c ;
9
10
11
Store the current medoids: Vold = V ;
for i = 1, 2 . . . , c compute the new medoid vi do
P
q = argminxk ∈X(p)i nj=1 um
ij r(xk , xj ) ;
12
v i = xq ;
13
iter = iter + 1 ;
14
until
Vold = V or iter = M AX IT ER ;
13
The membership function is:
uij =
1
r(xj ,vi )
Pc
k=1
1
m−1
1
r(xj ,vk )
1
m−1
(2.4)
where m is a real number greater than 1 and r(·, ·) is a distance function.
3
Related work on similarity metrics
Related work on similarity metrics with relation to anomaly detection and clustering
is described in this section. Special attention is given to [10] that demonstrates the
flexibility of the NCD similarity metric.
3.1
3.1.1
NCD for Similarity and Clustering Algorithm
Anomaly Detection
Network Traffic Analysis [26] uses the NCD similarity metric to identify network traffic which is similar to previous known attack patterns. A Snort [25]
plug-in was developed in order to calculate the NCD value of a reassembled
TCP session against a database of known attack sessions. The NCD metric
was also tested on clustering of worms executables. The results showed that
different variations of the same worms were grouped together while regular
executables were separated from the rest of the worms. This allows to test
efficiently if an executable is suspected as a worm, and if so to which of the
worm families it is similar to.
Anomaly Detection for Video Surveillance Application [1] is a system that
detects anomalies in a surveillance images using the NCD similarity metric. At
first, the system collects a set of pictures that serves as a database of normal
data. Later, a threshold is defined and each picture, which the NCD scores
14
above this threshold, is considered as an anomaly. The compression is achieved
via a lossy compressor.
Masquerader Detection [4, 5] focuses on detecting those who hide their identity
by impersonating other people while using their computer accounts. Masquerader detection uses the NCD similarity metric that was introduced at [4]. A
user’s command line data is examined. A block of 5000 command lines serves
as a database of each user. Then, blocks of 100 commands lines each are tested
against the user’s database. This test calculates the NCD between the tested
block and the database using several compressors such as PPM, BZip, Compress and Zlib. A block that scores NCD above a given threshold is classified
as an anomaly. The results are compatible with previous work [22] on the same
data that uses specific features in the data.
3.1.2
Clustering
Clustering by Compression [10] introduces the notion of NCD and its definition
as a similarity metric. The NCD metric in [10] is used as a method for hierarchical
clustering that is based on a fast randomized hill-climbing heuristic of a new quartet
tree optimization criterion. It provides a wide range of clustering results.
Clustering experimental results: An open-source toolkit, called CompLearn [23],
is used to test the performance of their clustering method and the similarity
metric. Here are some of their clustering experimental results:
1. Genomics and Phylogeny: The authors reconstruct the phylogeny of eutherians (placenta mammals) by comparing their whole mitochondrial
genome. The mitochondrial genome of 20 species was taken and the
NCD distance matrix was computed using different compressors (gzip,
15
BZip2, PPMZ). Next, the quartet clustering method was used. The results matched the commonly believed morphology-supported hypothesis.
In another experiment, the SARS virus sequenced genome was clustered
with 15 similar virii. The NCD distance matrix was calculated using the
BZip2 compressor. The results were very similar to the definitive tree
that is based on medical-macrobio-genomics analysis posted in the New
England Journal of Medicine.
2. Language Trees: “The Universal Declaration of Human Rights” in 52 languages were clustered together. The NCD distance matrix was calculated
using the gzip compressor. The resulted language tree improved a similar
experiment by [2].
3. Literature: The texts of several Russian writers (Gogol, Dostojevski, Tolstoy, Bulgakov, Tsjechov) with three to four texts each were clustered.
The NCD distance matrix was calculated using a PPMZ compressor. It
provided a perfect clustering for these texts.
4. Music: 36 Musical Instrument Digital Interface (MIDI) files were clustered. The NCD distance matrix was calculated using a bzip2 compressor.
The results grouped together works from the same composers with several
mistakes. Overall the results were encouraging.
5. Optical Character Recognition: Two dimensional images of handwritten
numbers were clustered. The NCD distance matrix was calculated using
a PPMZ compressor. 28 out of 30 (93%) images were clustered correctly.
Later, an SVM was trained for a single digit recognition using vectors of
NCD measures as features. The digit recognition accuracy was 87% which
is the current state-of-the-art for this problem.
6. Astronomy: Observations of the microquasar GRS 1915+105 made with
16
the Rossi X-ray Timing Explorer were analyzed. 12 objects were clustered.
The NCD distance matrix was calculated using the PPMZ compressor.
The results were consistent with the classifications made by experts of
these observations.
Summing up, this work ([23]) showed promising results in non-feature based clustering. It lays the foundation to explore the NCD metric for new fields such as the one
we are exploring, namely, anomaly detection in URIs.
3.2
N-Gram Similarity
N-Gram Similarity ([16]) is a generalized version of the Longest Common Subsequence (LCS) problem. The algorithm is described in the technical background (see
15 in sec. 2). N-Gram Similarity was designed to provide similarity scores between
words or strings. It was tested on a few datasets: Genetic cognates, Translational
cognates and confusable drug names. Cognates are words of the same origin that
belong to distinct languages for example English father and German vater. The
N-Gram Similarity scores are compared to a few other scores such as NED (see 14
in sec. 2), DICE and LCS. The results showed a clear accuracy advantage to the
N-Gram Similarity and N-Gram Distance developed by [16]. Another conclusion
from these results is that N-Gram Similarity and N-Gram Distance provide almost
the same accuracy.
4
Anomaly detection algorithms that use similarity metrics
In this section, we describe new algorithms to detect anomalies by using similarity
metrics. Similarity metrics provide a normalized measurement in the range [0,1].
17
Unnormalized metrics make it difficult to determine which items are “close” or “far”
from each other. Similarity metrics provides a solution to this issue. First, we introduce an anomaly detection method that is based on similarity metrics and scoring
functions that we name S-Score. Next, we introduce a novel 2-stage anomaly detection algorithm that is based on clustering and similarity metrics which overcomes the
down sides of the S-Score method that is named Nearest Similar Centroid (N SC).
4.1
S-Score Algorithm Description
The algorithm compares between a tested item and a normal dataset and gives it an
anomaly score. The anomaly score is calculated by a scoring function that uses a
similarity metric. We assume that an item, which is similar to a large portion of a
normal dataset, will receive a low anomaly score.
The S-Score detection method performs the following:
1. An item u, which is a new arrival, is measured against a database of normal
data T using a similarity measuring function m(·, ·). m(·, ·) is based on either
the NCD or N-Sim similarity metrics. The results are stored in a vector V .
2. The result vector V is given a score by the scoring function s(V ). Another way
to describe it is that the energy stored in V is calculated by using an energy
function.
3. The score or energy of the vector V is tested against a threshold. The higher
the score is more anomalous is the item.
18
Formally, the algorithm is:
Algorithm 3: S-Score detection method.
Data: u - the item being tested, T - the normal data set
Result: The anomaly score for u
1
S-Score(u, T )
2
begin
3
i ← 0;
4
foreach (t in T ) do
5
V [i] ← m(u, t) i ← i + 1;
s(V )
;
|T |
6
item score ←
7
return item score;
Several similarity measuring functions m(·, ·) were tested:
• m(x, y) = N CD(x, y) - NCD similarity metric (Eq. 2.2).
• m(x, y) = N -Sim(x, y) - N-Sim similarity metric (see section 2.15).
Several score functions were tested:
s(X) =
X
|xi | - l1 norm.
(4.1)
i
X
1
·
xi .
(max{X} − median{X})2 i
X
1
s(X) =
·
xi .
(max{X} − mean{X})2 i
X
|1−xi |
s(X) =
e−( ) .
s(X) =
(4.2)
(4.3)
(4.4)
i
s(X) =
X
i
1
·e
high · σhigh
(−
(xi −µhigh )2
)
2σ 2
high
+
1
·e
low · σlow
(−
(xi −µlow )2
)
2σ 2
low
!
sum of the vector with a Gaussian smoothing.
We use different scaling for high and low similarity items.
X
s(X) =
x2i - l2 norm (energy).
i
19
(4.5)
(4.6)
4.1.1
Algorithm discussion
The use of scoring functions enable to treat differently similarity and dissimilarity
measures. A different weight can be applied when comparing between similar and
less similar items. Thus, it makes the algorithm flexible and tunable. The scoring
function can be adjusted to take into account problematic measures and to apply
correcting weights to them. The results from this detection method were satisfactory
as it is demonstrated in section 5.3.1. (Eq. 4.6) is the best scoring function while
the best similarity measure is the N CD. The time complexity of this algorithm is
O(T ), which is linear in the the size of the training dataset.
The reason that N-Sim did not succeed as well is because of its high anomaly
score when short and long strings are compared. This can probably be solved by
dividing the database into 3 or 4 length groups. Then, a string is compared only to
its length group while maintaining a different threshold for each individual group.
4.1.2
Algorithm Disadvantages
While testing this method, three main problems surfaced:
Efficiency: the most severe problem was time performance. A large database is
needed in order to produce accurate detections. This required many comparisons with many compression operations which are CPU intensive. This means
that in order to achieve satisfactory accuracy this detection method is impractical.
Short URIs compared to long URIs: when comparing a short string to a long
one, the compared string is given a high anomaly score. This is true for both
N CD and N -Sim measures. It is less noticeable when N CD is used. This
greatly affects the measurement of a string against a collection of strings with
variable lengths. The N CD obtained results are very good due mainly to the
20
scoring function. The results could be even better if short strings will receive
special treatment.
Noise: a large database is needed in order to achieve accurate results. The large
dataset contains URIs with random sections, malformed URIs and even several
attacks. This caused a lot of noise to be added to the scoring calculations.
This means that the score of normal URIs was shifted sometime towards the
anomalous end and vice versa.
4.2
Improving the Detection Efficiency and Accuracy by
Clustering
The improvement idea is based on separating the detection into two stages. First, a
group of “good” representatives are chosen from the training set. Then, the tested
item is measured against the chosen representatives group.
Two main questions arise: How to choose representatives that will be classified
“good”? How to compare between the tested items and the chosen representatives?
4.2.1
Choosing Representatives
The representatives have to be diverse while still being similar to most of the other
items. To obtain these characteristics, the items have to be divided into groups
according to their similarity. From each of these groups, we pick the items that are
most similar to other items in that group.
Clustering is used to construct these groups. Then, items, which are located in
the center of each of these clusters, are chosen. If the clustering is done properly, we
guarantee diversity between the groups. Picking items from the center of the clusters
guarantees that they will be the most similar to the rest of the items in that cluster.
21
There are many ways to cluster. These methods can be classified into three major
groups [15]:
1. Hierarchical: Algorithms that are based on a linkage metric. Their complexity
is O(n2 ) which rules them out for large databases.
2. Partitioning: There are several algorithms in this group. K-Means is the
most famous one. Others for example are K-Medoids, PAM, CLARA and
CLARANS. Another group of algorithms are the Fuzzy K-Means and KMediods variants, such as Fuzzy C Means and Fuzzy Medoids. The main
problem is how to determine the value of K which is the number of clusters.
The complexity varies from O(n2 ) to O(n) depending on the algorithm and its
implementation.
3. Density Based: Algorithms that are based on a density function. Items are
clustered together with their closest neighbors within a certain -neighborhood.
Some examples are DBSCAN, GDBSCAN, OPTICS and DENCLUE. One advantage is that the number of clusters is not set in advance, although other
parameters have to be set in advance for most of these algorithms. The complexity varies from O(n2 ) to O(n· log(n)).
Fuzzy Medoids (see 18 in section 2) was chosen because of its low complexity (O(n)).
The items are clustered into K initial clusters where K is unknown. The clusters
homogeneity is tested and if it is above a threshold then the cluster is re-clustered
again to K̃ < K subclusters, until we are left with small uniform clusters. This is
actually a top down hierarchical clustering. In each clustering step, a node is divided
into K̃ child nodes. During the clustering, very small clusters with cluster size less
than M IN SIZE (parameter) are considered as noise and are removed from the
dataset. The initial clustering into the first K clusters can be done in several ways
that utilize additional knowledge we have about the data and the similarity metric
22
behavior. For example, we used the fact that short URIs are distant from long
URIs. Therefore, the initial clusters are separated by length groups that produce
better accuracy. The homogeneity of the clusters can be measured in different ways.
We chose to calculate the standard deviation of the similarity distance between the
cluster items and the cluster centroid in the following way:
Homogeneity (Cluster) = StdDev {m(u, c) |u ∈ C, c is the centroid of C} .
(4.7)
Another tested way to calculate the cluster homogeneity, which is not described here,
was to use the mean of the similarity distance between the centroid and the cluster’s
items.
The algorithm is called Nearest Similar Centroid and will be referred to as the
N SC algorithm or the N SC detection method.
23
Formally, the clustering algorithm can be described by:
Algorithm 4: Clustering method.
Data: T - training dataset, K - number of initial clusters, K̃ - number of
sub-clusters, HLimit - homogeneity limit, N C - final number of
clusters, r(·, ·) - a distance metric (e.g. NCD, N-Sim)
Result:
n
Clusters = (t1,1 , t1,2 , · · · , t1,n1 ), · · · , (tN C,1 , tN C,2 , · · · ,
tN C,nN C ) |ti,j ∈ T,
NC
X
o
ni = |T |
i=1
1
ClusterTrainingSet(T , K, K̃, HLimit, r(·, ·))
2
begin
3
N onHomogeneicClusters = ∅;
4
Clusters = ∅;
5
C = FuzzyMedoids(T ,K,r(·, ·)) ; // see 18 in sec.2
6
repeat
7
foreach cluster ∈ C do
8
if Homogeneity(cluster ) < HLimit (Eq. 4.7)
9
then
10
11
12
Clusters = Clusters
S
{cluster};
else
N onHomogeneicClusters =
S
N onHomogeneicClusters {cluster};
// FirstOf() gets the first item from a given set and
removes it.
13
F irstCluster = FirstOf(N onHomogeneicClusters);
14
C = FuzzyMedoids(F irstCluster,K̃,r(·, ·));
15
16
until N onHomogeneicClusters 6= ∅;
24
return Clusters;
(4.8)
The F uzzyM edoids clustering method uses a similarity measure function during
clustering. Several similarity functions were tested: NCD, N-Gram Similarity and
N-Gram Cosine Similarity. The N-Gram Cosine Similarity provided fast calculations
together with acceptable clusters when N = 6. The homogeneity threshold HLimit
can be thought of as the allowed percentage of deviation from the clusters centroid.
The reason for this is that all the distances are normalized to be in [0,1]. We got
good results when the HLimit values are between 10% to 15% deviation.
Finally, representatives from the clusters are chosen. The centroid of each cluster
is the item which is the most similar to all the others. This means that the centroid
average distance from all the items in the clusters is minimal. With this insight we
get
(
Centroid(Ci ) = argmin
1 X
dist(x, y), x ∈ Ci
|Ci | y∈C
)
)
(
= argmin
X
dist(x, y), x ∈ Ci
y∈Ci
i
From each cluster, we pick the top p% centroids. This means that larger clusters
will receive better representation.
Algorithm 5: Find Centroids method.
Data: C - the cluster, n - the number of centroids to find,
m(· , · ) - a similarity measuring function.
Result: A list containing n centroids.
1
FindCentroids(C, n)
2
begin
3
Let Centroids = ∅ ;
4
for i = 1 To n do
nP
5
Centroid = argmin
6
C = C − {Centroid} ;
7
Centroids = Centroids
8
y∈C m(x, y), x ∈ C
S
{Centroid} ;
return Centroids;
25
o
;
.
The similarity measuring function m(· , · ) is either NCD or N-Sim.
The centroids should be chosen using the same similarity metric that is also used
later in the detection method. If we were to use a different similarity metric instead,
the centroids that we choose will not represent the center of the clusters and the
classification of new items will fail.
4.3
Detection
The representatives are the centers of the generated clusters. A new item can be
classified to be one of the clusters by measuring its similarity to the center of each
cluster. If the similarity measure of the item is within a given radius from one of the
centers then the item is classified to a relevant cluster, else it is an anomaly. Since
we have a score for the tested item, then we can evaluate its anomaly.
The time performance gain is due to the fact that the test is done against a small
fraction of the training dataset. The accuracy is higher because most of the noise
is filtered out during the clustering process. The result is a much faster and more
accurate detection algorithm.
Formally:
Algorithm 6: N SC detection method.
Data: u - the item being tested, R - the representatives set,
ρ - the maximum allowed radius, m(· , · ) - a similarity measuring function.
Result: The anomaly score for u
1
NSC(u, R, ρ)
2
begin
3
r = min {m(u, rep) |rep ∈ R};
4
if r > ρ then
5
6
mark u as an anomaly;
return r;
26
The similarity measuring function m(· , · ) is either NCD or N-Sim
5
Experimental results for anomaly detection in
URIs
Experimental results of anomaly detection in URIs are presented. A description of
the datasets is given. We show that the NCD and N-Sim similarity metrics successfully measure the similarities between URIs. Different scoring functions and their
accuracies for the S-Score detection method are compared. An accuracy comparison
between the S-Score and the N SC detection methods is performed. Then, the affect
of the number of representatives on the accuracy and performance of the N SC algorithm is tested. Then, We examine how the N SC algorithm is affected (accuracy
and performance) by the size of the representatives set.
5.1
URIs Datasets
We have used several real commercial large operational networks. These networks
consist of several different subnetworks. We use the web servers datasets that contain
traffic from four dozen web servers. These web servers handle thousands of requests
every day.
These networks are protected using several network common security tools:
• Signature-based tools;
• Anomaly-based tools;
• Firewalls and proxies;
• VPNs.
27
These networks suffer from thousands of attacks every day. Figure 5.1 presents
the network architecture.
Figure 5.1: The network architecture that collects the data
The datasets were collected using the tcpdump program during several days. The
data was collected before it was filtered by these security tools. The resulting corpus
size is several tera bytes of raw data. The database is named LURI DB (Large URI
DB.).
We constructed a dataset of all the URIs in the collected data. The dataset was
then separated to normal URIs and attack URIs using several tools such as Snort
and special crafted tools that were developed by us.
28
The training dataset contains 30,000 URIs where most of them are valid while
several of them are attacks and abnormal URIs that were not caught by Snort. The
testing dataset consists of 20,000 URIs that contains 98% normal URIs and 2%
attacks. All the attacks and the abnormal URIs, which were found in the dataset,
were injected to the tested dataset.
5.2
The performance of NCD and N-Sim as URI similarity
metrics
Initially, the NCD and N-Sim similarity measures (metrics) were tested. The metrics
were used for finding similarities between URIs in an unsupervised manner.
The URI data has semantic rules but it is mostly constructed of spoken words.
This means that a similarity measure that works well for natural languages and
text processing will have a better chance. Both, NCD and N-Sim, have shown good
results in the past [10, 16] for measuring similarities between strings and this is why
they were selected for the test.
Since the datasets we work on are confidential, we bring an example dataset that
does not belong to this dataset. This URIs dataset is taken from www.cnn.com
homepage. It is given here to demonstrate the performance of NCD and N-Sim. The
dataset consists of a group of 13 URIs together with 6 web attacks. The URI list is:
1.
/SPECIALS/2008/news/luxury.week/
2.
/SPECIALS/2008/news/olympics/
3.
/site=cnn_international...tile=5260084747021&amp;domId=330023
4.
/site=cnn_international...=homepage&page.allowcompete=yes&params.styles=fs
5.
/cnn/.element/js/2.0/scripts/prototype.js
6.
/cnn/.element/js/2.0/scripts/effects.js
7.
/cnn/.element/js/2.0/csiManager.js
29
8.
/cnn/.element/js/2.0/StorageManager.js
9.
/cnn/.element/img/2.0/sect/main/more_rss.gif
10. /cnn/.element/img/2.0/global/red_bull.gif
11. /cnn/2008/US/04/05/airport.arrests.ap/tzpop.lax.gi.jpg
12. /cnn/2008/SHOWBIZ/04/06/heston.dead/tzpop.jpg
13. /cnn/2008/CRIME/04/05/texas.ranch/tzpop.compound.housing.cnn.jpg
14. /...
15. /_vti_bin/owssvr.dll?UL=1&ACT=4&BUILD=6551&STRMVER=4&CAPREQ=0
16. /crawlers/img/stat_m.php?p=0...%26white%3D1%26tariff%3D21%26priceHigh%3D9999
17. /auth/auth.php?smf_root_path=http://www.ricksk8.xpg.com.br/echo2.txt?
18. /cnn/2008/CRIME/04/05/texas.ranch/bush.nato.ap/tags.php?BBCodeFile=
http://rpgnet.com/newrpgnet/intranet/cmd.txt?
19. /cnn/2008/SHOWBIZ/04/06/heston.dead///tags.php?BBCodeFile=
http://rpgnet.com/newrpgnet/intranet/cmd.txt?
There are four groups of similar URIs. Group 1: {1, 2}, Group 2: {3, 4}, Group
3: {5, 6, 7, 8, 9, 10} and Group 4: {11, 12, 13}. The last six URIs 14-19 are web
attacks. Distance matrices between these URIs were constructed using NCD and
N-SIM similarity metrics.
30
Table 2:
The NCD distance matrix values. Values that are close to zero mean
higher similarity. The highlighted numbers, which are less than 0.5, have good
correspondence to the URI groups.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
0.18 0.47 0.98 0.98 0.88 0.87 0.91 0.87 0.91 0.90 0.90 0.87 0.85 0.94 0.97 0.97 0.98 0.92 0.91
0.47 0.19 0.97 0.98 0.90 0.87 0.91 0.89 0.91 0.93 0.90 0.87 0.87 0.97 0.97 0.97 0.98 0.92 0.91
0.98 0.97 0.15 0.39 0.95 0.96 0.96 0.95 0.95 0.96 0.93 0.95 0.96 0.99 0.98 0.93 0.97 0.91 0.91
0.98 0.97 0.40 0.14 0.92 0.95 0.95 0.92 0.93 0.95 0.92 0.94 0.96 0.99 0.98 0.93 0.95 0.90 0.91
0.88 0.88 0.93 0.91 0.12 0.32 0.51 0.49 0.58 0.62 0.87 0.81 0.80 0.93 0.97 0.96 0.92 0.89 0.89
0.87 0.87 0.95 0.93 0.34 0.13 0.49 0.49 0.56 0.60 0.87 0.81 0.83 0.92 0.97 0.96 0.94 0.89 0.89
0.88 0.85 0.95 0.93 0.49 0.49 0.15 0.29 0.60 0.62 0.87 0.83 0.85 0.94 0.97 0.97 0.97 0.91 0.91
0.87 0.87 0.94 0.92 0.49 0.49 0.32 0.13 0.58 0.60 0.87 0.81 0.80 0.95 0.97 0.97 0.95 0.89 0.89
0.91 0.91 0.95 0.92 0.60 0.58 0.64 0.62 0.11 0.49 0.87 0.80 0.85 0.96 0.95 0.95 0.94 0.88 0.88
0.90 0.93 0.96 0.94 0.64 0.62 0.64 0.64 0.47 0.14 0.90 0.83 0.87 0.95 0.97 0.96 0.97 0.89 0.90
0.90 0.89 0.93 0.91 0.89 0.89 0.89 0.87 0.87 0.89 0.15 0.65 0.69 0.98 0.98 0.97 0.91 0.66 0.79
0.87 0.89 0.94 0.93 0.83 0.81 0.87 0.85 0.78 0.81 0.63 0.13 0.61 0.96 0.97 0.96 0.94 0.79 0.80
0.87 0.87 0.96 0.96 0.83 0.85 0.87 0.83 0.83 0.85 0.68 0.61 0.15 0.98 0.97 0.97 0.98 0.85 0.66
0.97 0.97 0.99 0.99 0.95 0.95 0.97 0.95 0.93 0.95 0.98 0.96 0.98 0.40 0.98 1.00 1.00 0.99 0.98
0.97 0.95 0.98 0.97 0.98 0.98 0.98 0.98 0.95 0.95 1.00 0.98 1.00 0.98 0.14 0.99
0.98 0.97 0.93 0.94 0.96 0.96 0.98 0.97 0.95 0.96 0.97 0.97 0.97
1.02 0.99 0.99
1.00 0.99 0.15 0.94 0.92 0.92
0.97 0.95 0.95 0.92 0.91 0.92 0.97 0.95 0.91 0.95 0.88 0.92 0.95 0.98
1.00 0.93 0.14 0.82 0.81
0.91 0.91 0.90 0.88 0.90 0.90 0.91 0.88 0.88 0.89 0.65 0.80 0.85 0.98 0.99 0.92 0.84 0.13 0.38
0.91 0.91 0.91 0.89 0.89 0.90 0.91 0.88 0.88 0.89 0.79 0.81 0.65 0.98 0.98 0.93 0.84 0.38 0.13
31
Table 3:
The N-Sim distance matrix values. Values that are close to zero mean
higher similarity. The highlighted numbers, which are less than 0.5, have almost
perfect correspondence to the URI groups.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
0.01 0.30 0.96 0.95 0.79 0.79 0.80 0.82 0.81 0.80 0.83 0.80 0.75 0.93 0.91 0.96 0.90 0.89 0.89
0.30 0.01 0.96 0.94 0.79 0.79 0.80 0.83 0.81 0.80 0.83 0.81 0.76 0.92 0.91 0.96 0.91 0.90 0.89
0.96 0.96 0.00 0.34 0.93 0.94 0.94 0.93 0.93 0.93 0.92 0.93 0.95 0.99 0.93 0.87 0.91 0.88 0.88
0.95 0.94 0.34 0.00 0.90 0.91 0.91 0.90 0.90 0.91 0.89 0.91 0.93 0.98 0.91 0.90 0.87 0.83 0.84
0.79 0.79 0.93 0.90 0.00 0.21 0.39 0.39 0.48 0.47 0.79 0.73 0.72 0.94 0.91 0.95 0.85 0.87 0.86
0.79 0.79 0.94 0.91 0.21 0.00 0.35 0.36 0.45 0.47 0.79 0.74 0.74 0.94 0.91 0.96 0.85 0.88 0.86
0.80 0.80 0.94 0.91 0.39 0.35 0.00 0.20 0.49 0.47 0.80 0.78 0.74 0.93 0.89 0.96 0.88 0.88 0.87
0.82 0.83 0.93 0.90 0.39 0.36 0.20 0.00 0.48 0.44 0.79 0.76 0.71 0.94 0.89 0.96 0.88 0.87 0.86
0.81 0.81 0.93 0.90 0.48 0.45 0.49 0.48 0.00 0.36 0.79 0.73 0.72 0.95 0.88 0.94 0.86 0.86 0.85
0.80 0.80 0.93 0.91 0.47 0.47 0.47 0.44 0.36 0.00 0.81 0.76 0.75 0.94 0.89 0.95 0.87 0.88 0.87
0.83 0.83 0.92 0.89 0.79 0.79 0.80 0.79 0.79 0.81 0.00 0.55 0.59 0.96 0.89 0.94 0.80 0.61 0.71
0.80 0.81 0.93 0.91 0.73 0.74 0.78 0.76 0.73 0.76 0.55 0.00 0.50 0.96 0.89 0.95 0.82 0.75 0.76
0.75 0.76 0.95 0.93 0.72 0.74 0.74 0.71 0.72 0.75 0.59 0.50 0.00 0.95 0.89 0.96 0.86 0.79 0.62
0.93 0.92 0.99 0.98 0.94 0.94 0.93 0.94 0.95 0.94 0.96 0.96 0.95 0.04 0.96 0.99 0.97 0.98 0.98
0.91 0.91 0.93 0.91 0.91 0.91 0.89 0.89 0.88 0.89 0.89 0.89 0.89 0.96 0.00 0.95 0.89 0.93 0.92
0.96 0.96 0.87 0.90 0.95 0.96 0.96 0.96 0.94 0.95 0.94 0.95 0.96 0.99 0.95 0.00 0.91 0.89 0.90
0.90 0.91 0.91 0.87 0.85 0.85 0.88 0.88 0.86 0.87 0.80 0.82 0.86 0.97 0.89 0.91 0.00 0.75 0.75
0.89 0.90 0.88 0.83 0.87 0.88 0.88 0.87 0.86 0.88 0.61 0.75 0.79 0.98 0.93 0.89 0.75 0.00 0.25
0.89 0.89 0.88 0.84 0.86 0.86 0.87 0.86 0.85 0.87 0.71 0.76 0.62 0.98 0.92 0.90 0.75 0.25 0.00
These results were compared to the N-Gram Cosine similarity metric. It is a typical similarity metric that is being used in NLP. The Cosine similarity was calculated
for the 6-gram vectors of the URIs.
32
Table 4: The Cosine similarity distance matrix values. Values that are close to zero
mean higher similarity. The highlighted numbers, which are less than 0.4, do not
correspond to the URI groups. Hence, this metric is not suitable.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
0.00 0.15 0.69 0.67 0.37 0.36 0.35 0.34 0.35 0.36 0.42 0.42 0.33 0.69 0.64 0.55 0.59 0.38 0.35
0.15 0.00 0.60 0.62 0.31 0.33 0.35 0.40 0.31 0.39 0.34 0.36 0.25 0.79 0.58 0.54 0.59 0.37 0.33
0.69 0.60 0.00 0.07 0.38 0.46 0.42 0.39 0.38 0.42 0.40 0.40 0.50 0.95 0.58 0.32 0.43 0.31 0.36
0.67 0.62 0.07 0.00 0.32 0.39 0.37 0.34 0.34 0.37 0.40 0.44 0.53 0.93 0.56 0.36 0.44 0.29 0.34
0.37 0.31 0.38 0.32 0.00 0.07 0.13 0.16 0.15 0.30 0.21 0.20 0.22 0.59 0.70 0.40 0.32 0.13 0.13
0.36 0.33 0.46 0.39 0.07 0.00 0.08 0.15 0.12 0.29 0.29 0.32 0.29 0.59 0.72 0.46 0.43 0.21 0.20
0.35 0.35 0.42 0.37 0.13 0.08 0.00 0.05 0.09 0.21 0.22 0.28 0.28 0.55 0.72 0.48 0.50 0.19 0.20
0.34 0.40 0.39 0.34 0.16 0.15 0.05 0.00 0.12 0.20 0.25 0.25 0.27 0.58 0.74 0.47 0.46 0.16 0.17
0.35 0.31 0.38 0.34 0.15 0.12 0.09 0.12 0.00 0.15 0.26 0.29 0.30 0.59 0.70 0.42 0.44 0.18 0.17
0.36 0.39 0.42 0.37 0.30 0.29 0.21 0.20 0.15 0.00 0.34 0.35 0.35 0.58 0.68 0.48 0.52 0.27 0.25
0.42 0.34 0.40 0.40 0.21 0.29 0.22 0.25 0.26 0.34 0.00 0.17 0.15 0.54 0.67 0.45 0.33 0.13 0.18
0.42 0.36 0.40 0.44 0.20 0.32 0.28 0.25 0.29 0.35 0.17 0.00 0.18 0.49 0.67 0.38 0.29 0.16 0.20
0.33 0.25 0.50 0.53 0.22 0.29 0.28 0.27 0.30 0.35 0.15 0.18 0.00 0.63 0.70 0.47 0.42 0.17 0.11
0.69 0.79 0.95 0.93 0.59 0.59 0.55 0.58 0.59 0.58 0.54 0.49 0.63 0.00 0.86 0.78 0.56 0.62 0.66
0.64 0.58 0.58 0.56 0.70 0.72 0.72 0.74 0.70 0.68 0.67 0.67 0.70 0.86 0.00 0.51 0.72 0.65 0.67
0.55 0.54 0.32 0.36 0.40 0.46 0.48 0.47 0.42 0.48 0.45 0.38 0.47 0.78 0.51 0.00 0.35 0.36 0.37
0.59 0.59 0.43 0.44 0.32 0.43 0.50 0.46 0.44 0.52 0.33 0.29 0.42 0.56 0.72 0.35 0.00 0.21 0.28
0.38 0.37 0.31 0.29 0.13 0.21 0.19 0.16 0.18 0.27 0.13 0.16 0.17 0.62 0.65 0.36 0.21 0.00 0.03
0.35 0.33 0.36 0.34 0.13 0.20 0.20 0.17 0.17 0.25 0.18 0.20 0.11 0.66 0.67 0.37 0.28 0.03 0.00
33
Next, we calculate the P CA of tables 2, 3 and 4. The first two principal components are used to plot the tables in two dimensions.
Figure 5.2: 2D PCA plotting of table 2. NCD similarity metric is used
Figure 5.3: 2D PCA plotting of table 3. N-Sim similarity metric is used
34
Figure 5.4: 2D PCA plotting of table 4. Cosine similarity metric is used
Figures 5.2 and 5.3 are the plots of the distance matrices. Both NCD and N-Sim
produce clusters that are clearly separated from the attack URIs. In Fig. 5.4, we
can see that the Cosine Similarity distance matrix does not provide well separated
clusters.
Several compressors have been tested for the NCD metric: LZW, bzip2 and
PPM. The PPM compressor provided the best results and was chosen for the NCD
calculation.
NCD and N-Sim metrics show good capabilities in measuring similarities between
URIs in an unsupervised manner.
5.3
Results of S-Score and N SC for anomaly detection
The results described from this point to the rest of the section were collected by
running the algorithms on the LURI DB dataset (see 5.1).
35
5.3.1
S-Score Scoring Functions Comparison
Six scoring functions were tested: l1 -norm (Eq. 4.1), vector sum with median scaling
(Eq. 4.2), vector sum with mean scaling (Eq. 4.3), vector sum with exponential
smoothing (Eq. 4.4), vector sum with Gaussian smoothing (Eq. 4.5) and l2 -norm
(Eq. 4.6).
Figure 5.5 shows the accuracy comparisons between these scoring functions when
the NCD is used as the similarity metric. Best accuracy together with low false
negative rate is achieved with the l2 -norm and with the vector sum with mean scaling.
Leading is the l2 -norm that produced a false negative rate of 0.135%, a false positive
rate of 3.75% and a total accuracy rate of 96.15%, A little behind is the vector sum
with mean scaling that produced a false negative rate of 0.155%, a false positive rate
of 5.055% and a total accuracy rate of 94.7%. Figure 5.6 shows the relation between
a correct detection probability and the false negatives probability that these six
functions with the NCD as the similarity metric is used. A very low false negative
probability that comes with a high false positives probability which accounts for
the low correct detection probability is valid for all the six functions. A higher false
negative probability comes with a low false positive probability and with high correct
detection probability. We can use the “elbow” rule in order to choose a good balance
between false negatives and correct detections for each of the functions.
36
Figure 5.5: Comparison between the scoring functions accuracy that use the NCD
metric
37
Figure 5.6: Correct detection vs. false negatives of the scoring functions that use the
NCD metric
Figure 5.7 shows the accuracy comparisons between the same scoring functions
when the N-Sim is used as the similarity metric. Here, the vector sum with Gaussian
smoothing leads with a false negative rate of 0.125%, a false positive rate of 7.65%
and a total accuracy rate of 92.225%. Close second are the vector sum with mean
scaling and the l1 -norm scoring functions with nearly the same accuracy. Figure
5.8 shows the relation between the correct detection probability and false negative
probabilities from the six functions when the N-Sim similarity metric is used.
As mentioned before, the N-Sim metric provides a high anomaly scores when a
short and a long strings are compared. The N-Sim measures the length of the shortest
conversion path from the first compared string to the second one. The conversion
path from a short string to a long one is at least as long as their length difference.
Thus, the anomaly score becomes high. All the tested scoring functions sum the
38
similarity measures of the items. Therefore, short strings receive a high anomaly
score and the N-Sim produces a high number of false positives.
Summing up, both similarity metrics performed well where the NCD produces
5% more accuracy. The best overall scoring functions were reached by l2 -norm and
by the vector sum with mean scaling.
Figure 5.7: Comparison between the scoring functions accuracy when N-Sim is used
39
Figure 5.8: Correct detection vs. false negatives of the Scoring functions when N-Sim
is used
5.3.2
Performance of N SC vs. S-Score
The N SC detection method has a low false negative rate of 0.085% and a correct
detection rate of 99.105%. Figure 5.9 shows the comparison between the N SC detection method and the two best scoring functions. The time performance of the
N SC detection method depends directly on the total number of chosen representatives. For r representatives from the training set of size T we have r =
T
.
p
If the
S-Score detection method takes τ seconds, then the N SC detection method becomes
τ
p
seconds. Section 5.3.4 evaluates the performance for different values of r.
The second detection method achieves high accuracy with low false negatives
rate because it overcomes three problems that occurred during the application of the
S-Score detection method: 1. The performance is better due to a lower number of
40
comparisons. 2. Testing for a minimum similarity distance allows a mixture of string
lengths where a short string that is being compared to a long string achieves a high
anomaly score. It means that it does not belong to the cluster of long strings as
was expected. 3. The clustering and the search for a minimum similarity distance
reduces dramatically the noise in the training data. In our tests, we got that the
noise level dropped to zero and the representatives did not contain any abnormal
URIs or attacks.
Figure 5.9: NSC Vs. S-Score (NCD)
5.3.3
Comparison of the accuracy performance in N SC between NCD
and N-Sim
Several methods were tested: NCD alone, N-Sim alone and a decision tree that was
trained on 4000 items of NCD and N-Sim values. The results are presented in Fig.
5.10. N-Sim achieved the best accuracy although a very good accuracy was also
achieved by the other two methods. The decision tree method produces the lowest
false negative rate. Our assumption as a future research goal is that a combination
of the two metrics with a better method than decision tree such as Support Vector
41
Machine will produce even better results.
Figure 5.10: Comparison between NCD, N-Sim and a decision tree that is based on
both metrics
Figure 5.11 shows the relation between a correct detection probability and false
negatives probabilities of NCD and N-Sim that use the N SC detection method.
Figure 5.11: Correct detection vs. false negatives that NCD and N-Sim produce
42
5.3.4
How different number of representatives in N SC affects the performance
The time performance of the N SC detection method has a linear dependency on
the number of representatives r (see section 5.3.2). The number of representatives is
chosen as a p% from of the total training set T . Figure 5.12 presents a comparison
between different representatives p%. The most significant thing is that even a
p = 2% of the training data produces a very high detection rate of 98.56% with a low
false negative rate of 0.085% together with a performance boost. The performance
gain, as expected by this linear dependency, is 5 times better then the test where
p = 10% was used.
Figure 5.12: Performance comparison between different representatives sizes of p%
A simple heuristic can be added in order to retain a high detection rate while time
performance is increased. It is based on sorting the representatives by their proximity
to the center of their cluster. First, all the representatives which are closest to the
center, are sorted. Then, all the representatives, which are second closest to the
center, are sorted and so on. If we change the N SC detection method to search for a
43
representative r, which compiles with m(u, r) < ρ, we get a major performance gain.
As Fig. 5.12 shows, this is due to the fact that 98.56% of the items get detected by
2% of the training data. Therefore, ∼ 98% of the items will be detected fast, and
only for about 0.6% of the data it will take longer.
Algorithm 7: N SC detection method with the sort heuristic.
Data: u - the item being tested, R - the sorted by proximity to center
representatives set, ρ - the maximum allowed radius.
Result: The anomaly score for u
1
DetectWithHeuristic(u, R, ρ)
2
begin
3
foreach rep ∈ R do
4
r = m(u, rep) ; // NCD or N-Sim
5
if r < ρ then
6
return r;
7
mark u as an anomaly;
6
Previous Work on URI Anomaly Detection
Previous works on URI anomaly detection are described. Each work uses a private
dataset that is unavailable to the public. Therefore, it is impossible to compare
between the results of these different approaches.
6.1
URI Anomaly Detection
Detecting HTTP Network Attacks using Language Models: A method for
network intrusion detection that is based on n-grams was developed in [24].
They propose a representation of n-grams using tries accompanied by a novel
method for comparison between tries in linear time. Each URI is converted to a
44
trie and a k-nearest neighbors method, which is called Zeta, is applied to detect
anomalies. The issue of choosing the length of the n-gram is also addressed,
they have solved the problem by moving to a word based model instead of an
n-grams model. Each URI is parsed into its underlying words. The results
were compatible with the best n-gram value they received in previous tests.
Figure 6.1: Zeta function example
Detecting Anomalies in HTTP Queries: A system, which can analyze client
queries from different server-side programs, was developed in [18]. The system performs a focused analysis and builds a profile for each of the server-side
applications. The system first enters a learning stage in which it builds the
profiles. In the second stage, parameters from the new queries are compared
with the established profiles that are specific to the application being referenced. URIs, which do not contain a query string, are ignored. The profiles
consist the following analysis of the following query attributes: length, character distribution, query structural inference, range, presence or absence and
order.
45
6.2
Prediction By Partial Matching (PPM) Anomaly Detection
A spam filter, which is based on the PPM compression scheme, is described in [6, 7].
The PPM model is used to estimate the sequences probabilities of characters that
are based on previous observations. The probability of a document is the product
of the probabilities of all the characters contained in the document. Each message
being spam receives a “spamminess score” that is based on these probabilities.
7
Conclusion
In this work, we presented two algorithms for URI anomaly detection that use similarity metrics. The first detection algorithm that uses score functions achieves good
results while being computational expensive.
The second detection algorithm contains two sequential stages. First, also termed
training, a normal dataset is collected using clustering and centroid selection. Second, anomalies are detected by comparing new data to the clusters centroids. The
tests showed that the N SC algorithm provides fast performance with a very good
detection rate and a very low false negatives rate.
Two similarity metrics NCD and N-Sim were used . Both exhibited similar accuracy in the N SC algorithm. The NCD metric produced better results for the
S-Score algorithm.
With some performance modifications, our N SC algorithm can be used as an
Intrusion Detection System (IDS) application for web servers.
In the future, we plan to improve the detection stage by considering the cluster
radius in order to detect anomalies.
46
References
[1] C. E. Au, S. Skaff, and J. J. Clark. Anomaly detection for video surveillance
applications. In ICPR ’06: Proceedings of the 18th International Conference
on Pattern Recognition, pages 888–891, Washington, DC, USA, 2006. IEEE
Computer Society.
[2] D. Benedetto, E. Caglioti, and V. Loreto. Language trees and zipping. Physical
Review Letters, 88(4), January 2002.
[3] T. Berners-Lee, R. Fielding, and L. Masinter. Rfc 3986, uniform resource identifier (uri): Generic syntax.
[4] M. Bertacchini and P. I. Fierens. Preliminary results on masquerader detection
using compression based similarity metrics, 2006.
[5] M. Bertacchini and P. I. Fierens. Ncd based masquerader detection using enriched command lines, 2007.
[6] A. Bratko, B. Filipič, G. V. Cormack, T. R. Lynam, and B. Zupan. Spam
filtering using statistical data compression models. J. Mach. Learn. Res., 7:2673–
2698, 2006.
[7] A. Bratko, B. Filipič, and B. Zupan. Towards practical ppm spam filtering:
Experiments for the TREC 2006 spam track. In Proc. 15th Text REtrieval
Conference (TREC 2006), Gaithersburg, MD, 2006.
[8] M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, 1994.
[9] G. J. Chaitin. Algorithmic information theory, IBM journal of research and
development, 1977. In Information Randomness & Incompleteness: Papers on
47
Algorithmic Information Theory, Gregory J. Chaitin, World Scientific, Series
in Computer Science–Vol. 8. 1987.
[10] R. Cilibrasi and P. M. B. Vitanyi. Clustering by compression. Information
Theory, IEEE Transactions on, 51:1523–1545, 2005.
[11] J. G. Cleary and W. J. Teahan. Unbounded length contexts for PPM. The
Computer Journal, 40(2/3), 1997.
[12] J. G. Cleary and I. H. Witten. Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications, COM-32(4):396–
402, April 1984.
[13] J. C. Dunn. A fuzzy relative of the isodata process and its use in detecting
compact well-separated clusters. Journal of Cybernetics, 3:32–57, 1973.
[14] P. G. Howard and J. S. Vitter. Arithmetic coding for data compression. Technical Report Technical report DUKE–TR–1994–09, 1994.
[15] D. A. Keim and A. Hinneburg. Clustering techniques for large data sets - from
the past to the future. In KDD Tutorial Notes, pages 141–181, 1999.
[16] G. Kondrak. N-gram similarity and distance. In SPIRE, pages 115–126, 2005.
[17] R. Krishnapuram, A. Joshi, O. Nasraoui, and L. Yi. Low-complexity fuzzy
relational clustering algorithms for web mining. IEEE-FS, 9:595–607, Aug. 2001.
[18] C. Kruegel and G. Vigna. Anomaly detection of web-based attacks. In CCS
’03: Proceedings of the 10th ACM conference on Computer and communications
security, pages 251–261, New York, NY, USA, 2003. ACM.
[19] V. I. Levenshtein. Binary codes capable of correcting deletions, insertions, and
reversals. Soviet Physics - Doklady, 10(8):707–710, February 1966.
48
[20] M. Li, X. Chen, X. Li, B. Ma, and P. M. B. Vitányi. The similarity metric.
pages 863–872. Society for Industrial and Applied Mathematics, 2003.
[21] J. B. MacQueen. Some methods for classification and analysis of multivariate
observations. In L. M. L. Cam and J. Neyman, editors, Proc. of the fifth Berkeley
Symposium on Mathematical Statistics and Probability, volume 1, pages 281–297.
University of California Press, 1967.
[22] R. A. Maxion. Masquerade detection using enriched command lines. In Proceedings of International Conference on Dependable Systems and Networks (DSN
’03), San Francisco, CA, June 2003.
[23] J. B. R. Cilibrasi, A. Cruz and S. Wehner. Complearn toolkit.
[24] K. Rieck and P. Laskov. Detecting unknown network attacks using language
models. In DIMVA, volume 4064 of Lecture Notes in Computer Science, pages
74–90. Springer, 2006.
[25] S. D. Team. Snort the open source network intrusion detection system.
[26] S. Wehner. Analyzing worms and network traffic using compression. Journal of
Computer Security, 15(3):303–320, 2007.
[27] T. A. Welch. A technique for high-performance data compression. Computer,
17(6):8–19, 1984.
[28] J. Ziv and A. Lempel. A universal algorithm for sequential data compression.
IEEE Transactions on Information Theory, 23(3):337–343, 1977.
49
‫תקציר‬
‫שרתי אינטרנט מהווים מרכיב חשוב ברוב בתי העסק כיום‪ .‬מכיוון שהם נגישים‬
‫לכלל האוכלוסייה הם החשופים ביותר לניסיונות פריצה ותקיפה‪ .‬אנשי האחזקה‬
‫של השרתים מפעילים מערכות זיהוי תקיפה ( ]‪)Intrusion Detection System [IDS‬‬
‫ומערכות מניעת תקיפה ( ]‪ )Intrusion Prevention System [IPS‬על מנת להגן על‬
‫השרתים מפני איומים אלו‪ .‬רוב מערכות ה ‪ IDS‬מתבססות על חתימות לזיהוי‬
‫ההתקפות‪ .‬החסרונות העיקריים בחתימות הן התקפות חדשות ולא מוכרות ( ‪zero‬‬
‫‪ )day attacks‬והצורך המתמיד בעדכון מאגר החתימות על מנת להתמודד עם‬
‫איומים חדשים‪ .‬בעיה זו מוכרת וקיים הצורך לתכנן ‪ IDS‬שמסוגלת לזהות התקפות‬
‫על פי ההתנהגות הנורמאלית של השרת ללא תלות בחתימות‪ .‬בעבודה זו‪ ,‬מוצגים‬
‫שני אלגוריתמים לזיהוי אנומליות (התקפות או מידע חריג)‪ .‬האלגוריתמים‬
‫מבוססים על מדידת דמיון בין מידע נורמאלי למידע חריג‪ .‬האלגוריתמים שנבחנו‬
‫על מידע שנאסף משרתי אינטרנט התבססו על שתי שיטות למדידת דמיון‪ .‬השיטה‬
‫הראשונה הנקראת ‪ ,NCD‬מתבססת על כיווץ (‪ )compression‬והשנייה‪,‬‬
‫בשם ‪ N-gram similarity‬לקוחה מהתחום של עיבוד שפות טבעיות ( ‪.)NLP‬‬
‫האלגוריתמים הציגו יכולות זיהוי גבוהות יחד עם קצב עבודה מהיר והם ניתנים‬
‫לשילוב ב ‪ IDS‬מבוסס חתימות להשגת אבטחה טובה יותר‪.‬‬
‫אוניברסיטת תל‪-‬אביב‬
‫הפקולטה למדעים מדויקים ע"ש ריימונד ובברלי סאקלר‬
‫ביה"ס למדעי המחשב‬
‫זיהוי אנומליות בכתובות‬
‫אינטרנט בעזרת פונקציות דמיון‬
‫עבודה זו מוגשת כעבודת גמר לשם קבלת התואר‬
‫"מוסמך במדעים" – ‪ M.Sc.‬בבית הספר למדעי המחשב‪,‬‬
‫אוניברסיטת תל אביב‬
‫על ידי‬
‫סער יהלום‬
‫עבודת המחקר בוצעה תחת הנחייתו של‬
‫פרופ' אמיר אוורבוך‬
‫אייר‪ ,‬תשס"ח‬
‫אוניברסיטת תל‪-‬אביב‬
‫הפקולטה למדעים מדויקים ע"ש ריימונד ובברלי סאקלר‬
‫ביה"ס למדעי המחשב‬
‫זיהוי אנומליות בכתובות‬
‫אינטרנט בעזרת פונקציות דמיון‬
‫עבודה זו מוגשת כעבודת גמר לשם קבלת התואר‬
‫"מוסמך במדעים" – ‪ M.Sc.‬בבית הספר למדעי המחשב‪,‬‬
‫אוניברסיטת תל אביב‬
‫על ידי‬
‫סער יהלום‬
‫אייר‪ ,‬תשס"ח‬