A NetFlow based flow analysis and monitoring system in enterprise

Transcription

A NetFlow based flow analysis and monitoring system in enterprise
Available online at www.sciencedirect.com
Computer Networks 52 (2008) 1074–1092
www.elsevier.com/locate/comnet
A NetFlow based flow analysis and monitoring system
in enterprise networks
Liu Bin a,*, Lin Chuang a, Qiao Jian b, He Jianping a, Peter Ungsunan a
a
b
Department of Computer Science and Technology, Tsinghua University, Beijing, China
School of Telecommunication Engineering, Beijing University of Posts and Telecommunications, Beijing, China
Received 2 April 2007; received in revised form 21 August 2007; accepted 29 December 2007
Available online 3 January 2008
Responsible Editor: A. Marshall
Abstract
In this paper, a flow analysis and monitoring system based on NetFlow is introduced. The system is built on a Browser–
Server framework, aimed at enterprise networks. Data collection and display are separated into two modules, which makes
the system clearly demarcated and easy to deploy. The data collection module receives and analyzes NetFlow-exported
packets and inserts per flow record information into the Oracle database. The display module acts as a J2EE web server,
fetches real-time or history traffic information from the database and shows it to web users. In addition to the above-mentioned functions, the most important part of the system is an IDS. A real-time anomalous traffic monitoring module with a
stable matching pattern algorithm and two traffic statistic based intrusion detection algorithms – one algorithm is based on
variance similarity while the other is based on Euclidean distance – are embedded in the system to detect worm and other
malicious attacks. With the aim of identifying anomalous network traffic simply and effectively, a proved ‘‘join” strategy is
also designed along with the two traffic statistic based intrusion detection algorithms. The whole IDS module is able to run
with low computational complexity and high detection accuracy. Finally, we conduct experiments to verify the performance of our system.
Ó 2008 Elsevier B.V. All rights reserved.
Keywords: Traffic measurement; NetFlow; Intrusion detection; Matching pattern; Similarity
1. Background
When analyzing network traffic information or
defending against the threat of various worms and
attacks in enterprise networks, we need powerful
tools to handle Gigabit per second traffic. Tcpdump
or other software implemented on personal comput*
Corresponding author.
E-mail address: [email protected] (L. Bin).
ers cannot satisfy this request, while Cisco’s NetFlow, along with associated switches/routers, can
easily capture enterprise network link traffic information, analyze and give high level statistics for different purposes. NetFlow can also aggregate the
raw data before exporting to reduce not only the
memory and computational load of the data collecting server but also the link bandwidth consumption.
The principle of NetFlow is as follows: when the
router receives a packet, its NetFlow module scans
1389-1286/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved.
doi:10.1016/j.comnet.2007.12.004
L. Bin et al. / Computer Networks 52 (2008) 1074–1092
the source IP address, the destination IP address,
the source port number, the destination port number, the protocol type, the type of service (ToS)
bit in IP header, and the input (or output) interface
number on the router of the IP packet to judge
whether it belongs to a flow record that already
exists. If so, it updates the flow record; otherwise,
a new flow record is generated in the cache. The
expired flow records in the cache are exported periodically to a destination IP address using UDP.
(Note that a UDP packet contains several flow
records.) The exporting packet format can be found
in the NetFlow standard. The most common version of NetFlow is V5. The newest version, NetFlow
V9, is described in RFC 3954 [1].
Looking through a flow record, it is found that a
flow record does not contain any upper-layer information, it just contains traffic profiles. The advantage to this approach is its high speed. Paying no
attention to packet payloads greatly reduces the
processing overhead and makes NetFlow based
anomalous traffic detection an extraordinarily good
fit for busy, high-speed network environments [10].
In this paper, a flow analysis and monitoring system based on NetFlow is introduced. The system is
built on a Browser–Server framework, aimed at
enterprise networks. Data collection and display
are separated into two modules, which makes the system clearly demarcated and easy to deploy. The data
collection module receives and analyzes NetFlowexported packets and inserts per flow record information into the Oracle database. The display module
acts as a J2EE web server, fetches real-time or history
traffic information from the database and shows it to
web users. Additionally, the most important part of
the system is an IDS, a real-time anomalous traffic
monitoring module with a stable matching pattern
algorithm and two traffic statistic based intrusion
detection algorithms – one based on variance similarity and the other based on Euclidean distance – are
embedded in the system to detect worm and other
malicious attacks. With the aim of identifying anomalous network traffic simply and effectively, a proved
‘‘join” strategy is also designed along with two traffic
statistic based intrusion detection algorithms. The
whole IDS module is able to run with low computational complexity and high detection accuracy.
2. Related works
Intrusion detection systems (IDSs) are designed
to monitor activities on a network and recognize
1075
anomalies that may be symptomatic of misuse or
malicious attacks [2]. An IDS consists of two components: monitoring and analysis. Data is collected
by monitoring activities in the hosts or network.
Then the raw data is analyzed to classify activities
as normal or suspicious. Currently there are two
basic approaches to intrusion detection, misuse
detection and anomaly detection. Anomaly detection is aimed at characterizing the legitimate behavior of the system, and then detecting the anomalous
behavior. Common methods used in intrusion
detection include statistics, immunology, neural networks, data mining, machine learning, finite state
automation and so on. In our system, two kinds
of intrusion detection methods are embedded, one
kind using pattern matching while the other based
on traffic statistics.
Pattern matching is a simple type of attack detection technique. Using the pattern matching technique, IDSs generally match the text (audit
records) or binary sequences against known attack
signatures [13]. In the pattern matching approach
[14–16], rules for identification of attacks are easy
to write, understand and are also very customizable.
Recently, deep packet inspection (DPI) is often used
in network intrusion detection and prevention systems (NIDPS), where incoming packet payloads
are compared against known attack signatures
[38–40]. Attack signatures can be easily generated
for new alerts and warnings. In [37], the author presents a novel scheme for pattern matching, called
BFPM, that exploits a hardware based programmable state machine technology to achieve deterministic processing rates that are independent of input
and pattern characteristics on the order of 10 Gb/s
for FPGA and at least 20 Gb/s for ASIC implementations. The limitation of the pattern matching
approach is that it can recognize only known
attacks. It requires continuous updates of attack signatures to identify new attacks.
Most of the pattern matching algorithms in IDS
have the simple concept of string matching which is
not applicable for our system because of the loss of
upper-layer information in NetFlow records as we
describe in Section 1. Thus, a special pattern matching algorithm will be proposed in Section 3 to meet
the system need.
The advantage of anomaly detection over pattern
matching is that previously unknown attacks can be
discovered. Using statistical analysis of network
traffic to identify intrusion is a popular research
direction. The following are some of the existing
1076
L. Bin et al. / Computer Networks 52 (2008) 1074–1092
detection technologies which are based on network
traffic statistics:
1. Cumulative sum method: The CUSUM algorithm
is a commonly used algorithm in statistical process
control, which can detect the change of the mean
value of a statistical process [17,18]. CUSUM
relies on the fact that if a change occurs, the probability distribution of the random sequence will
also change. Generally, CUSUM requires a parametric model for the random sequence so that the
probability density function can be applied to
monitor the sequence. Unfortunately, in many situations, we do not have the knowledge of the
underlying distribution of the statistics we are
observing. For example, while the arrival rate of
TCP connections in the Internet has been shown
to follow a Pareto distribution [19], other traffic
measures that have been used for attack detection
have no known distribution, e.g., the arrival rate
of new source IP addresses [20], or the arrival rate
of reset packets [21]. Thus, a key challenge for
parametric methods is how to model the random
sequence. An alternative approach is the use of
non-parametric methods, which are not modelspecific.
For the most common SYN flooding attacks, a
non-parametric cumulative sum (CUSUM)
method is applied in [3] to detect the sequential
change point. The two algorithms considered in
[4] are an adaptive threshold algorithm and a particular application of the CUSUM algorithm for
change point detection. The performance is investigated in terms of the detection probability, the
false alarm ratio, and the detection delay, using
workloads of real traffic traces. In [5], a detection
method of SYN flooding called D-SAT (detecting
SYN flooding attack by two-stage statistical
approach) is proposed. D-SAT only monitors
SYN count and the ratio between SYN and other
TCP packets at the first stage. Then it detects SYN
flooding and finds victims more accurately in its
second stage.
2. Statistical model method: The statistical model
method is often used to detect anomalous intrusion. As a mature method, it identifies activities
which have large statistical errors with usual ones
as anomalous. The commonly used methods
are threshold detection [22], mean and standard
deviation [23], and Markov process model [24].
Threshold detection [22] is one of the most basic
and simplest techniques for intrusion detection.
The goal of threshold detection is to record each
occurrence of a specific event and detect when
the number of occurrences of that event surpasses
a reasonable amount that one might expect to
occur within a specified time period. Mean and
standard deviation means that by comparing
event measures to a user profile mean and standard deviation, a confidence interval for an anomaly can be established. The user profile values are
fixed or based on weighted historical data [22].
The Markov process model applies only to event
counters. It means that each distinct type of event
is a state variable and uses a state transition matrix
[23] to characterize the transition frequencies
between states. A new observation is defined to
be anomalous, if its probability, which is determined by the previous state and the transition
matrix, is too low. In [6], the authors describe statistical model based algorithms which firstly combine the expectation, variance and other statistical
parameters together, then use hypothesis testing
to detect the attacks.
3. Data mining methods: Data mining is an information extraction activity to discover hidden facts
contained in databases. Data mining techniques
[25] are used to find patterns and subtle relationships in data and inferred rules that allow the
prediction of future results.
Various data mining techniques are proposed
[26,27], e.g. association rule, classification, clustering analysis, characterization/generalization,
incremental updates and meta rules. In [7], technologies based on mining fuzzy rules and data
mining techniques are designed for the detection
of anomalous network traffic from normal traffic.
When there’s offensive traffic, the system can
detect intrusion by finding the deviation from
normal patterns.
4. Wavelet analysis method: The wavelet analysis
technique has already been introduced to intrusion detection [8]. Wavelet analysis intrusion
detection (WAID) is achieved by treating staterealizable protocols as discrete waveforms.
WAID is offered as a multi-dimensional analysis
approach to detecting temporally spaced network
attacks. The Wavelet analysis method can also be
associated with the self-similar characteristics of
network traffic. In [9], the author proved mathematically that the significant changes in the Hurst
parameter can help find attacks. In [41], the
authors apply signal processing techniques in
intrusion detection systems, and develop and
L. Bin et al. / Computer Networks 52 (2008) 1074–1092
implement a framework, called Waveman, for
real-time wavelet based analysis of network traffic
anomalies. Then two metrics, namely percentage
deviation and entropy, are used to evaluate the
performance of various wavelet functions on
detecting different types of anomalies.
5. Neural networks: A neural network is a network
of computational units that jointly implement
complex mapping functions. The neural network
approach to intrusion detection is to learn the
behavior of actors (e.g. users, daemons, and so
on) in the system [28,29]. Three phases (collecting
training data, training, and performance) are
required to build a neural network for an intrusion detection system [30].
6. Genetic algorithm: The genetic algorithm is a
family of computational models based on principles of evaluation and natural selection. The process of a genetic algorithm usually begins with a
randomly selected population (the set of chromosomes during a stage of evaluation) of chromosomes [31,32]. Genetic algorithms can be used
to evolve simple rules for network traffic. These
rules are used to differentiate normal network
connections from anomalous connections. The
rules consider parameters such as source and destination IP addresses and port numbers, duration
of the connection, protocol used, etc., to indicate
the probability of an intrusion [33]. These anomalous connections refer to events with high probability of intrusions.
7. Immune system: The human immune system
(HIS) [34] provides the human body with a high
level of protection from invading pathogens.
Immune system approaches are also used for
computer security, scheduling and virus detection. The role of the immune system in the
human body is similar to the role of an intrusion
detection system. Immune system approaches
with distributed, light weight computing and
self-organizing capabilities can be applied to a
network based intrusion detection system to provide better security [35,36].
Besides the pattern matching algorithm proposed
in Section 3, two algorithms based on network traffic statistics will also be introduced in Section 4.
Being different from most existing work, our algorithms focus on the actual project application. In
the current high-speed network scenario, the requirement of intrusion detection algorithms is not only
highly accurate, but also have low computational
1077
complexity and high processing speed. Although
the previous algorithms can achieve high accuracy,
they are complicated and require many hardware
and software resources. Our algorithms are simple
to implement and easy to realize as well as have
low false positive rates. In view of such a situation
that in certain exceptional scenarios may lead to a
detection accuracy decline, we also propose and
prove a ‘‘join results” strategy which can further
reduce the false positive rate.
3. System structure
3.1. Overview
This system has the following characteristics:
1. Based on a Browser–Server framework to avoid
the trouble of client software installation.
2. Data collection module and data analysis and
display module are separated which makes the
system clearly demarcated and also makes it easy
to modify and deploy.
3. J2EE and Applet development technologies are
used in the display module, which makes it possible to supply real-time traffic monitoring functions to web users.
4. Anomalous traffic analysis functions embedded in
the system are helpful to find common worm and
other malicious attacks. For the pattern matching
method, a characteristic field based algorithm can
detect known anomalies efficiently. For the traffic
statistical method, two simple and accurate algorithms are used with a proved ‘‘join” strategy.
5. The data collection module can support all versions (1, 5, 7, 8 and 9) of NetFlow and provide
data to other modules in some fixed formats version-independently.
Fig. 1 shows the overall structure of the system.
The switches/routers collect traffic information and
send flow records to data collectors. The collectors
then analyze the raw data according to user-set rules,
and insert both the raw data (selective) and processed data into the database. The web server queries
the database, updates the display, and periodically
scans the data to look for anomalous traffic.
3.2. Data collection and processing module
The main function of this module is collecting,
aggregating, processing and storing NetFlow data.
1078
L. Bin et al. / Computer Networks 52 (2008) 1074–1092
can be deployed in different servers which is good for
the high-load network to increase processing speed.
Here we explain the principles of the Protocol aggregation thread and the AS matrix aggregation thread.
Fig. 1. System structure.
This module runs on the data collectors in Fig. 1. In
the module, dozens of mutually cooperative threads
are executed concurrently. Fig. 2 is the simplification flowchart of the handling procedures. Moreover, there are some configuration files which
allow users to easily set up the module to meet different requirements.
The system supports 12 kinds of aggregations (in
Fig. 2, only two of the 12 aggregation threads are
listed), each of which aggregates NetFlow data
according to different flow-field-sets and operates
on different tables in the database. For example,
in the host flow matrix aggregation function, the
flow-field-set is [source IP address, destination IP
address], so the flow records which have the same
source IP address and destination IP address are
aggregated by the aggregation thread.
There are lots of flexibilities between aggregation
threads: First, a certain aggregation function can be
enabled or disabled at any time by modifying configuration files. Second, different aggregation functions
1. Protocol aggregation: Strictly speaking, it should
be called application layer protocol aggregation,
because this thread classifies flow records with
not only the Ethernet protocol but also the source
(or destination) port field.
The aggregation rules are defined in the configuration file ‘‘Protocols. aggregate” whose part is
like Fig. 3. Before the data collection starts, for
each rule in ‘‘Protocols. aggregate”, a corresponding item is generated in memory. Then after
it starts, by matching these rules, the protocol
type of a certain flow record can be determined,
and the corresponding rule item in memory can
be updated. The matching is ‘‘short circuit matching,” that is, if a flow record matches a previous
rule then no later rule need be checked.
2. AS matrix aggregation: Sometimes, in addition
to the traffic statistics of different hosts, people
are more concerned with the traffic between different departments. The source and destination
AS field of NetFlow can satisfy this purpose,
but the BGP protocol should have been
deployed. The system provides the function of
aggregating and analyzing the traffic between different departments based on the AS field of NetFlow. Meanwhile, because the deployment of
BGP is not very common, we also define the
other AS according to IP subnet address but
not the AS fields of BGP.
For example, the IP address scope of a certain
AS named AS_99 defined in configuration file
Fig. 2. Flowchart of collecting procedures.
L. Bin et al. / Computer Networks 52 (2008) 1074–1092
1079
Fig. 3. Protocols.aggregate.
Fig. 4. AS_Source.aggregate (AS_Destination.aggregate is quite
the same).
Fig. 5. The system interface.
‘‘AS_Source. aggregate” (Fig. 4) is ‘‘10.8.1.0–
10.8.3.255”. Note that all subnet masks are of the
form 255.255..0, and then the rules for AS_99 are
generated obeying following steps:
ical data queries, and the remainder figure is the
main interface of the system.
Then in the following, we introduce each function unit in detail:
Step 1: Convert the start IP address 10.8.1.0 to
hex: start = 0A080100, and convert the
end IP address 10.8.3.255 to hex:
end = 0A0803FF.
Step 2: (start 8) = 0A0801, (end 8) = 0A0803.
The needed right-shift-bit-number is gained
by 32-24 (subnet mask bit number) = 8.
(Since the subnet mask is of the form
255.255..0, right shift the start and end IP
addresses to reduce hash table capacity).
Step 3: For each long integer i between variable
‘‘start” and variable ‘‘end”, insert
[i, AS_99] into hash table. Then for each
flow record’s source or destination IP
address, just shift right 8 bits and search
the hash table, the AS number which the
IP address belongs to can be found.
1. The traffic information monitoring unit mainly
provides real-time monitoring and historical data
queries of traffic going through the switches/routers in which NetFlow is embedded.
A major part of this unit can be considered as the
interfaces of the aggregation threads which are
introduced in Section 3.2 (i.e. this part shows
the real-time or the historical data which are processed by the aggregation threads to users). Commonly, different monitoring functions in this unit
correspond to different aggregation threads in the
data collecting and processing module introduced
in Section 3.2. For example, protocol and application traffic monitoring corresponds to protocol
aggregation (Fig. 3), AS flow matrix monitoring
corresponds to the department flow matrix
aggregation (Fig. 4), end-to-end flow matrix monitoring corresponds to the host flow matrix aggregation, and so on.
Moreover, this unit also offers a real-time monitoring function which is more particular than
aggregation monitoring, which allows users to
track traffic trends of a single IP address or even
a single port without any aggregations.
2. The anomalous traffic detection and query unit
mainly provides real-time monitoring and historical data queries of anomalous traffic. As we have
explained in Section 1, we employ two types of
anomalous traffic detection methods in this
system – pattern matching method and traffic
statistical method.
3.3. Display and anomalous traffic monitoring module
This module contains three function units: traffic
information monitoring unit, anomalous traffic detection and query unit, and system setting unit. Each of
the former two units contains two parts: historical
data queries and real-time monitoring. Additionally,
the system interfaces are also classified in this module. Fig. 5 shows part of the system interfaces. In the
figures, the bar-graph, the pie-graph and the blue
and white form are some interfaces of real-time
monitoring, the white form is an interface of histor-
1080
L. Bin et al. / Computer Networks 52 (2008) 1074–1092
In the pattern matching method, anomalous flow
records are mainly identified by characteristic
fields and characteristic rules (sometimes by the
statistic of ICMP and TCP_FLAGS), while in
the traffic statistical method, anomalous traffic
is identified by the ‘‘similar degree” between the
traffic data and the trained sample.
In the traffic statistical methods which will be
introduced in Section 4.2, the traffic in a certain
period is considered to be normal if its ‘‘feature
extraction vector” and ‘‘trained sample” are similar enough. But in this section, we only discuss
the pattern matching method, the concepts
‘‘trained sample”, ‘‘feature extraction vector”,
and the details of the traffic statistical methods
will be discussed in Section 4.2.
Because ICMP and TCP_FLAGS information is
helpful when detecting some worms, the ICMP
packet and TCP_FLAGS monitoring function
is also implemented in this unit. ICMP and
TCP_FLAGS monitoring function provides statistical information of the ICMP protocol and
the URG, ACK, PSH, RST, SYN, and FIN
(TCP flag bits) of the TCP protocol (NetFlow
export record contains the cumulative OR of
TCP flags of the entire flow duration).
Characteristic fields involve the source port, the
destination port, the number of packets, the
number of octets, and the protocol type of flow
record fields. A characteristic rule is defined by
several of the above characteristic fields. We say
that a flow record matches a characteristic rule
if this flow record matches all the characteristic
fields of the rule.
As Fig. 6, here the needed function is ‘‘for a period of time, identify all the flow records by a certain characteristics rule set (meaning if a flow
record matches an arbitrary rule of the set, we
pick it up, or we will just ignore it). When a characteristic flow record first appears, the system
stores the matched flow record in memory by a
specific format, but if it does not first appear,
the system updates the corresponding stored item
instead of creating new”. To make the result
clear, the matched result set distinguishes
Fig. 6. The needed function.
between different source addresses and different
destination addresses, i.e. two same-rule-matched
flow records with different source addresses or
different destination addresses are seen as two
different items in the output result set. To implement the function of Fig. 6, we propose the following matching algorithm.
Matching algorithm:
Step 1: To use the characteristics rule set A, for
each rule a 2 A which is named aname, let
its form be [a1, a2, a3, a4, a5]. a1–a5 represent
different characteristic fields (‘‘source port‘‘,
‘‘destination port”, ‘‘packets number”,
‘‘octets number”, and ‘‘protocol type”),
and here:
8
>
< the string format of the ith element
ai ði 2 ½1;5Þ ¼ ðthe i th element of a is definedÞ;
>
:
/ðthe ith element of a is undefinedÞ:
Then for each rule a 2 A, a mapping set is
generated:
arule ¼ ½arule1 ; ½arule2 ; arule3 :
P
Here, arule1 ¼ 5i¼1 ai , means combine the strings together arule2 = aname, arule3 = [b1, b2, b3, b4, b5], and
bi ¼
1ðai 6¼ /Þ;
0ðai ¼ /Þ:
Let Arule as the mapping rule set generated by A.
Step 2: The result set R is obtained by accessing the
database and getting all the raw flow
records during the time t1 and t2. Let S (initialized as NULL) be the algorithm output
set. Each item of S includes two attributes:
the keyword (denoted by key, will be introduced in Step 5) and the value (denoted by
value). S(key) is the set of key, value(key) is
the value corresponding to a certain key,
i.e. value(key) is user-defined output of the
matched key. m is the traversal cursor of
Arule and n is the traversal cursor of R. Initialization m = 1 and n = 1.
Step 3: Let temprule = Arule[m]. According to Step
1, we have, temprule = [temprule1, [temprule2,
temprule3]].
Step 4: Let r = R[n], according to the definition of
table RAW in which the raw flow records
is stored, we have r = [srcaddr, dstaddr, srcport, dstport, dpkts, doctets, protocol, stamp],
L. Bin et al. / Computer Networks 52 (2008) 1074–1092
Step 5:
Step 6:
Step 7:
Step 8:
i.e. r includes seven attributes of a NetFlow
record appending the timestamp on which
this record is put into the database. The
types of all these fields in r are string. Let
string sign = NULL, we map [srcport, dstport, dpkts, doctets, protocol] to [b1, b2, b3, b4,
b5](temprule3), and append the element of r
to sign if the mapping value of the element
in temprule3 is 1. That means if b1 = 1, we
append srcport to sign. Then:
if (sign == temprule1) goto Step 5;
else goto Step 6;
Let key0 = temprule2 + srcaddr + dstaddr;
if (key0 2 S(key)) update value(key0 );
else put [key0 , value(key0 )] into S;
n = n + 1;
if (R[n]! = NULL) goto Step 4;
else goto Step 7;
m = m + 1;
if (Arule[m]! = NULL) goto Step 3;
else goto Step 8;
output S.
For the algorithm above, there is one thing in
need of explanation. For step 2, firstly, we access
the database server to get flow information instead
of receiving packets from the data collector and
judging the flow records directly. The reason is
Receiving packets from the data collector and
judging the flow records directly makes the
matching algorithm act passively. So system
fluctuations (such as burst traffic), which
will waste much of the algorithm’s time in
processing flow records, further result in a performance decline.
Judgment based on the database makes the
matching algorithm active. When to begin
judging can be completely determined by the
time the system has gotten the flow information, even if some fluctuation will not result
in a performance decline.
In practice, our layout can meet system
requirements for processing speed well. Even
under the algorithm execution frequency of
1s, the algorithm can handle the flow records
in real-time.
If the data collection and the algorithm realization are deployed on different servers, algorithm implementation without the database
will result in a duplicate transmission of flow
records to both the servers, causing a waste
of bandwidth.
1081
3. The system setting unit supplies other functions
like automatic traffic report generation, user
management, interface styles setting and so on.
The automatic traffic report generation function
can generate reports and delete or export useless
data in the database according to the configurations (e.g. the content of reports, the report generation time, and the data clearing strategy) to
save the storage space of the database.
4. Using the system to detect anomalous traffic
4.1. Pattern matching method
When using the pattern matching method (or
ICMP and TCP_FLAGS monitoring) to detect
anomalous flow records, the anomalous flow
records are classified into three categories: malicious
network attacks, such as denial of service (DoS),
Trojan horses and worms, and unexpected network
applications (e.g. eMule, BitTorrent).
1. Malicious network attacks
Malicious network attacks can be detected by our
system effectively:
(a) Ping of death: Malicious and oversize ICMP
packets can cause memory allocation errors, lead
to TCP/IP stack collapse, and leave the recipients
with a crashed system. When the ping of death
attack breaks out, the volume of ping packets
between the attacker and a victim can reach hundreds of MB.
ICMP monitoring of this system can create statistics for recent ICMP packets. When the ping of
death attack breaks out, the particularly large
and anomalous ICMP packets with their source
IP addresses can be found in real time.
(b) SYN flood: This kind of attack causes the
victim denial of services by sending many ‘‘semiconnected” packets. As we all know, there is a
‘‘three-way handshake” to start a normal TCP
connection. When the attacker repeatedly carries
on the first two handshakes several times, many
connections are hung up, and these ‘‘semiconnected” transactions will exhaust the
resources of the victim.
TCP_FLAGS monitoring of this system can find
out the IP address which has the most connections in a period time. If a SYN flood does
not occur, we should find that TCP_FLAGS
SYN and FIN have almost the same count in
the traffic monitoring window. But if a SYN
1082
L. Bin et al. / Computer Networks 52 (2008) 1074–1092
flood outbreaks, many more SYN flags can be
seen.
In addition, Smurf attacks, UDP floods, Land
attacks, Fraggle attacks, e-mail attacks and
so on can also be detected using a similar
method.
2. Trojan horses and worms
Trojan horses and worms with known characteristics can be identified using the ‘‘characteristic
fields” monitoring function, .e.g. the characteristic fields of the Trojan horse ‘‘Wincrash_v2” is:
the destination port = 2583,the protocol type =
TCP. The characteristic field of worm ‘‘Shockwave Killer” is: the destination port = 2048, the
protocol type = ICMP, the number of octets =
92. The characteristic fields of some worms are
shown below:
Code Red Worm: destination port = 80, protocol
type = 80, packet number = 3, octet number =
144.
Worm.Opasoft,W32.Opaserv.Worm:
destination port = 137, protocol type = UDP, octet
number = 78.
Worm.NetKiller2003,Worm.Sqlp1434,W32.
Slammer,W32.SQLExp.Worm:
destination
port = 1434, protocol type = UDP, octet
number = 404.
Worm.Blaster,W32.Blaster.Worm: destination
port = 135,
protocol
type = TCP,
octet
number = 48.
Worm.KillMsBlast,W32.Nachi.worm,W32.
Welchia.Worm: destination port = 2048, protocol type = ICMP, octet number = 92.
Worm.Sasser,W32.Sasser: destination port =
445, protocol type = TCP, octet number = 48.
W32.Witty.Worm: source port = 4000, protocol
type = UDP.
The protocol of most Trojan horses is TCP. The
destination port of some TCP Trojan horses are
listed below (Their characteristic fields can be
written as: destination port = some value, protocol type = TCP):
BackDoor (1999), Black Hole 2001 (2001),
Ripper (2023), Wincrash v2 (2583), Remote
Administrator (4899), VNC (5800, 5900), Dameware NT Utilities (6129), GuangWai Girl (6267),
DeepThroat v1.0-3.1 (6670, 6671), Indoctrination (6939), Priority (6969), Netspy (7306),
Genue (7511), Glacier (7626), Way2.4 (8011),
back orifice (31337), back orifice2000 (54320),
netbus (12345), subseven (27374, 1243).
Because these ports are very special, when some
characteristic traffic like above is monitored, it
is necessary to further investigate and analyze.
When the worm scans the network before dissemination, it randomly or pseudo-randomly generates a large number of IP addresses to scan and
detect loopholes in the hosts. But a major part
of these scanned IP addresses are null or unreachable, so no TCP FIN packets are returned. So for
unknown characteristic TCP worms, similar with
the above ‘‘SYN flood” detection, we can use
‘‘TCP_FLAGS monitoring” to expect to see a
large number of SYN packets in the flow records
associated with the worm-infected host.
Unknown characteristics UDP worm: If the protocol field of a NetFlow record is ‘‘1”, then it means
the protocol type of this record is ICMP. Then we
can convert the destination port field of the record
to a three-hexadecimal-number in which the first
is the ‘‘ICMP type” and the other two is the ‘‘code
field for the type”. Then we can use the ‘‘ICMP
type” and the ‘‘code field” to look up in a table
[11, Chapter 6] to get detailed information.
When a host initiates a UDP request to a nonexistent host, the middle switch will send an
‘‘ICMP T_3” (means destination unreachable)
packet to the worm-host. So we can use ‘‘Characteristic fields monitoring” to set ‘‘ICMP T_3”
rules as: ‘‘Port Unreachable (characteristic fields:
destination port = 771 (hex: 303), protocol
type = ICMP)”, ‘‘Host Unreachable (characteristic fields: destination port = 769 (hex: 301), protocol type = ICMP)”, and so on.
Thus, if we find a certain IP address suddenly
receives a lot of ‘‘ICMP T_3” packets, it is
very likely someone is disseminating UDP
worms.
3. Unexpected network application
We can use ‘‘Protocol monitoring” or ‘‘Characteristic monitoring” to detect unexpected network
applications, by setting a new rule in the configuration file in Fig. 3 or in the characteristics rule set
in Fig. 6, e.g. some P2P software which may consume too much bandwidth are forbidden in the
Enterprise Network. The characteristics of these
P2P software (Their characteristic fields can be
L. Bin et al. / Computer Networks 52 (2008) 1074–1092
written as: destination port = some value, protocol type = UDP/TCP) are as follows:
QQ Live (UDP: 13000–14000), XunLei (TCP:
3076, 3077), NetFairy (TCP: 7777, 7778, 11300),
eMule (TCP: 4242, 4661, 4662), KuGoo (TCP:
3318, 7000), BitSpirit (TCP: 16881), BaoCue
(TCP: 6346), PTC (TCP: 50007), Poco (TCP:
2881, 5354, 8094), Kamun (TCP: 3751, 3753,
4772, 4774), 100bao (TCP: 3468), Bai Hua PP
(TCP: 5093), Kuro (TCP: 6800, 6801, 7003),
Baidu Download (TCP: 11000), eDonkey (4371,
4662), Baizhao P2P (TCP: 9000), OPENEXT
(TCP: 2500, 4173, 5467, 10002, 10003), iLink
(TCP: 5000), DDS (TCP: 11608), iMesh (TCP:
4662), winmx (TCP: 5690), PPlive (UDP: 4004,
TCP:8008).
4.2. Traffic statistical methods
For the pattern matching method, we detect
anomalous flow records based on the flow characteristics of NetFlow. When the flow characteristics
are unknown or changed, we hope to determine
whether the traffic is normal or not with the overall
traffic statistical information in macro. Based on
this idea, we propose two traffic statistic based algorithms. Unlike the pattern matching method which
has a high accuracy, in certain exceptional scenarios
the traffic statistical methods may lead to a decline
in accuracy. We also propose and prove a ‘‘join
results” strategy which can further reduce the false
positive rate to prevent this from happening.
4.2.1. Intrusion detection algorithms
Algorithm 1. An anomalous traffic detection algorithm based on the variance similarity.
1083
Detecting the system’s anomalous traffic is a classification problem, i.e. to distinguish between normal and anomalous traffic of a certain IP address.
For management convenience, it can be done in
more detail, i.e. with the help of multi-level standards, the level that the anomalous traffic belongs
to can be clarified further.
*
A feature extraction vector F was defined as
*
F ¼ ðF 1 ; F 2 ; . . . ; F n Þ. Here Fi(1 6 i 6 n) is an average value that can identify the traffic characteristics
in a fixed length of time, such as certain IP
addresses’ average numbers of passive-connections,
of received octets, of received packets and so on.
Let us take NetFlow V5 and ‘‘Host Behavior
Based Detection” [42,43] as an example to explain
how we can get data preparation. In host behavior
based detection, three behavior classes (bytes sent/
bytes received per minute, outgoing flow records
per minute, and bidirectional flow records per minute) are defined as the standard to detect worms.
Thus for our algorithm, the feature extraction vector
of host behavior based detection is [bytes sent/bytes
received per minute, outgoing flow records per minute, bidirectional flow records per minute]. Fig. 7
shows the structure of a V5 flow record, and each
UDP packet sent from the router/switch to the data
collector (in Fig. 1) contains 29 V5 flow records.
When handling the received flow records, the data
collector aggregates and stores records in the database (Oracle) as described in Fig. 2 (note that in the
record structure, there are many traffic information
details like dPkts, dOctets and tos), then with the
help of Hibernate or JDBC, we can get the feature
extraction vector of a certain IP address as follows:
To obtain the ‘‘bytes sent/bytes received per minute”, the steps are:
Fig. 7. The structure of a V5 flow record.
1084
L. Bin et al. / Computer Networks 52 (2008) 1074–1092
1. In order to obtain the ‘‘bytes sent per minute”, we can execute the T-SQL sentence –
select sum(dOctets) 60/ (END_ TIMESTART_TIME) from TABLE where
TIME between START_TIME and END_
TIME and srcaddr = ‘IP’ – where TABLE
is the table storing the traffic data of IP
addresses (named HOSTMATRIX in our
system), START_TIME and END_TIME
bound the time interval, and the constraint
‘‘srcaddr = ‘IP’ ” guarantees the selective
items’ source address are ‘IP’. Denote the
result as i1.
2. In order to obtain the ‘‘bytes received per
minute”, we can execute the T-SQL sentence – select sum(dOctets) 60/(END_
TIME-START_TIME)
from
TABLE
where TIME between START_TIME and
END_TIME and dstaddr = ‘IP’ – Denote
the result as i2.
3. The ‘‘bytes sent/bytes received per minute”
is obtained by i1/i2.
To obtain the ‘‘outgoing flow records per minute”, we can execute the T-SQL sentence –
‘‘select count() 60/(END_ TIME-START_
TIME) from TABLE where TIME between
START_TIME
and
END_TIME
and
srcaddr = ‘IP’ ”.
To obtain the ‘‘bidirectional flow records per
minute”, we can execute the T-SQL sentence –
select
count() 60/(END_TIME-START_
TIME) from TABLE where TIME between
START_TIME and END_TIME and srcaddr = ‘IP’ and dstaddr in (select srcaddr from
TABLE where TIME between START_TIME
and END_TIME and dstaddr = ‘IP’).
*
The elements of the feature extraction vector F
should be flexibly determined to meet different
requirements. For example: as we have discussed
in Section 4.1, if we find a certain IP address suddenly receives a lot of ‘‘ICMP T_3” packets, it is
very likely someone is disseminating UDP worms.
Considering this circumstance, the switches’ average
received number
of ICMP packets can be added as
*
an element of F by the servers which are monitoring
the network traffic. *
A trained sample E was defined as the average of
some
normal traffic records’ feature extraction vec*
tor F , which implies the main characteristic of normal traffic. The number and update frequency of
trained samples are related to the specific applica-
tion. For example, two trained samples can be prepared for either busy times (Monday–Friday) or
leisure times (Saturday, Sunday), or three different
trained samples can be prepared to detect the traffic
with different time granularities for intervals of
1 min, 3 min and 10 min. When a streaming server
is changed into a common web server, it is obvious
that trained samples need to be re-generated.
So the problem of detecting anomalous traffic is
changed to a classification problem. The traffic in
a certain period is considered
to be normal if its fea*
*
ture extraction vector F and the trained sample E
are similar enough, and vice versa.
Then there comes the problem of how to define
similar enough. A quantification method is needed
(i.e., there should be a mathematical *formulation
*
so that the similarity degree between F and E can
be measured). Similarity-value sim(x1, x2) denotes
the similarity degree between x1 and x2. The larger
the sim(x1, x2) is, the more similar x1 and x2 are.
*
Here*we compute the similarity-value between F
and E as follows:
* *
simðF ; EÞ ¼
n
X
i¼1
wi
minðF i ; Ei Þ
;
maxðF i ; Ei Þ
where wi is related to (Fi Ei)2. Here, we use the
empirical formula:
,
N 1
X
1
1
2
; S i ¼ ðF i Ei Þ : So; 0 6 sim
wi ¼
Si
S
i
i¼0
6
N 1
X
wi ¼ 1:
i¼0
If Si is too large, it implies that the difference between Fi and Ei is obvious, and wi is given a lower
value. On the other hand, if Si is small, wi is given
a higher value. The weight based idea is simple
and unstable characteristics can be discarded
according to the weight. According to the definition
of similarity-values, for the purpose of reducing the
false positive rate, the determination of trained samples using Algorithm 1 should involve records
whose calculated feature extraction vector elements
are in the middle range.
Then we must find an appropriate similarityvalue threshold.*If the record’s similarity-value to
trained sample E is not less than the threshold, it
can be classified as normal traffic, and vice versa.
The choice of the threshold is directly related to
detection accuracy. A more appropriate way to
choose the threshold is to calculate the similarity-
L. Bin et al. / Computer Networks 52 (2008) 1074–1092
1085
value for each of a large number of records (containing both normal and anomalous ones), then by
the principle of minimal false positive rate, use
certain methods to identify a suitable threshold. A
simpler way to determine the threshold is to calculate the similarity-values for each of a large number
of normal records and pick the peak similarity-value
as the threshold.
Algorithm 2. An anomalous traffic detection algorithm based on the Euclidean distance.
*
In Algorithm 2, a trained sample E is also
needed, which can be obtained by the same
method
*
as Algorithm 1. But to the trained sample E in Algorithm 2, the above limitation in Algorithm 1 is not
necessary. In other words, the determination of
the trained sample of Algorithm 2 can involve several types of records, but must not involve records
whose calculated feature extraction vector elements
are in the middle range. We will prove that the junction of each result can enhance the detection accuracy in detail in the next part, and if different
algorithms can generate different trained samples
with different types of normal traffic records, it is
useful to cover a wider range of normal scenarios
thus better to reduce the false positive rate and
implement intrusion detection functions.
*
For feature
extraction vector F and the trained
*
sample E, define a new vector:
*
f ¼ ðf1 ; f2 ; . . . ; fn Þ;
fi ¼
Fi
ð1 6 i 6 nÞ;
Ei
d is the Euclidean
distance to weigh the difference
*
*
between F and E, and defined as
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
* *
n
X
2
d ¼ f 1 ¼
ðfi 1Þ :
i¼1
Following a similar principle as Algorithm 1, there
must be a distance threshold that minimizes the
false positive rate. (The method to determine the
threshold is rather similar to Algorithm 1.) So the
traffic can be classified by comparing the Euclidean
distance between a certain feature extraction vector
and the trained sample. If the distance
is less than
*
the threshold, traffic of the vector f which corresponds to the time period is normal, and vice versa.
4.2.2. An improvement strategy of combining two
algorithms
The main idea of the improvement strategy is to
combine the two algorithms to improve the detec-
Fig. 8. Judgment process.
Table 1
Symbols definition (i, j, a1, a2 are Boolean variants)
Symbol
Definition
P(j)
Aij
The probability of input being j
The probability of detecting input j as i by Algorithm
1
The probability of detecting input j as i by Algorithm
2
The conditional probability of input being j and the
detection output is i by Algorithm 1.
Aij = A(ijj) P(j)
The conditional probability of input being j and the
detection output is i by Algorithm 2. Bij = B(ijj) P(j)
The strategy output. Here a1 is the output of
Algorithm 1 and a2 is the output of Algorithm 2
The probability of the accurate detection by the ‘‘join”
strategy
Bij
A(ijj)
B(ijj)
P(a2, a1)
Pr1
tion accuracy. As shown in Fig. 8, we ‘‘join” the
results of Algorithms 1 and 2 in the strategy.
Now we will prove that the strategy is better than
using one algorithm individually. Assume the algorithm is a Boolean function. When the actual traffic
is normal, we say the input is 0, and vise versa. And
when the actual traffic is discriminated as anomalous by Algorithms 1 or 2, we say the output A
(Algorithm 1) or B (Algorithm 2) is 1, and vise
versa. So there are four (input, output) combinations for each of the two algorithms, they are (1,
1), (1, 0), (0, 1) and (0, 0). The final result is denoted
as D, and D = A \ B. The symbols needed are
defined as in Table 1.
By the marginal probability distribution characteristic, we have A11 + A01 = B11 + B01 and A10 +
A00 = B10 + B00. So,
1086
L. Bin et al. / Computer Networks 52 (2008) 1074–1092
P ð1; 1Þ ¼ P ð1; 1j1ÞP ð1Þ þ P ð1; 1j0ÞP ð0Þ
¼ ðA11 þ A01 ÞAð1j1ÞBð1j1Þ
þ ðB10 þ B00 ÞAð1j0ÞBð1j0Þ
A11
B11
¼ ðA11 þ A01 Þ
A11 þ A01 B11 þ B01
A10
B10
þ ðB10 þ B00 Þ
:
A10 þ A00 B10 þ B00
Similarly,
A11
B11
P ð1; 0Þ ¼ ðA11 þ A01 Þ
A11 þ A01 B11 þ B01
þ ðB10 þ B00 Þ
P ð0; 1Þ ¼ ðA11 þ A01 Þ
A11
B01
A11 þ A01 B11 þ B01
þ ðB10 þ B00 Þ
P ð0; 0Þ ¼ ðA11 þ A01 Þ
A00
B10
;
A10 þ A00 B10 þ B00
A10
B00
;
A10 þ A00 B10 þ B00
A01
B01
A11 þ A01 B11 þ B01
A00
B00
þ ðB10 þ B00 Þ
:
A10 þ A00 B10 þ B00
And we can validate that,
P ð1; 1Þ þ P ð1; 0Þ þ P ð0; 1Þ þ P ð0; 0Þ ¼ 1:
By the join Boolean operations, we get,
P r1 ¼ ½pð1; 0j0Þ þ pð0; 1j0Þ þ pð0; 0j0Þpð0Þ
þ pð1; 1j1Þpð1Þ
A00
B10
þ ðB10 þ B00 Þ
A10 þ A00 B10 þ B00
A10
B00
A00
þ ðB10 þ B00 Þ
A10 þ A00 B10 þ B00
A10 þ A00
B00
A11
B11
þ ðA11 þ A01 Þ
B10 þ B00
A11 þ A01 B11 þ B01
A00 B10
A11 B11
¼ B00 þ
þ
:
A10 þ A00 A11 þ A01
¼ ðB10 þ B00 Þ
For a real system, the probability of the accurate
detection is greater than the one of the false positivity and the probability of normal traffic is greater
than the one of anomaly. So we have
Að0j0ÞBð1j0Þ Að0j1ÞBð1j1Þ
and
ðB10 þ B00 Þ ¼ P ð0Þ ðB11 þ B01 Þ ¼ P ð1Þ:
Then,
P r1 ðB11 þ B00 Þ > 0:
Similarly,
P r1 ðA11 þ A00 Þ > 0;
i.e., for the probability of the accurate detection, the
‘‘join” strategy is higher than using one algorithm
individually. Thus, from the above formulations,
we evidently find that the accuracy of detection is
improved.
5. System performance
5.1. Overall performances
In this section, we analyze the overall performance of the system in the Gigabit network shown
in Fig. 9 (in the figure, the NetFlow function is supported by Cisco 4000 and 6000 series). We hide the
department names.
According to Fig. 1, the hardware configuration
of the database servers and data collectors is
E3500 ((400MZCPU 1, 1024 MB Memory 1)
4, TGX Card Quad Ethernet Card DFSCSI
Card 1, 18 GB Febrial Disk 4, 8 mm Tape Dri-
When Algorithm 1 or 2 is used individually, the
probability of the accurate detection is A11 + A00
or B11 + B00. So,
A00 B10
A11 B11
þ
B11
A10 þ A00 A11 þ A01
A00
A01
¼
B10 B11
A10 þ A00
A11 þ A01
¼ Að0j0ÞBð1j0ÞðB10 þ B00 Þ
P r1 ðB11 þ B00 Þ ¼
Að0j1ÞBð1j1ÞðB11 þ B01 Þ:
Fig. 9. The actual network.
L. Bin et al. / Computer Networks 52 (2008) 1074–1092
ver Int 1) and A1000 (36 GB DFSCSI Hard
Disk 1). The OS is Sun Solaris 8, and the database is Oracle 10g.
In Section 3, we have introduced that the system
supports 12 aggregation threads, each of which
aggregates NetFlow data according to different
flow-field-sets and operates on different tables in
the database, and all the aggregation functions can
be enabled or disabled at any time by modifying
configuration files. Additionally, when the data collector receives a NetFlow UDP packet, it can determine whether to record the raw flow records in the
packet or simply drop the packet after handling it
by the process in Fig. 2. Fig. 10 shows part of the
configuration file which controls the statuses of
above functions, and we can see that the ‘‘SaveRaw” function (i.e. record the raw flow records to
database), the ‘‘SrcAS aggregation” function and
the ‘‘ASMatrix aggregation” function are enabled,
and the interval for flushing the memory and storing
the ‘‘SrcAS” and ‘‘ASMatrix” information to the
database is 3 s and 29 s. The ‘‘DstAs.interval = 0”
means the ‘‘DstAS aggregation” function is
disabled.
The overall performance testing results of Fig. 9
are listed in Table 2. The average flow record number of the network in Fig. 9 is 90-flow records/s (3–4
UDP packets/s), while in a peak hour it is 150 flow
records/s (5–6 UDP packets/s), and under burst
Fig. 10. Configuration file.
Table 2
Performance testing results
Sample rate
No Sampling
1 in 2
1 in 4
Real-time processed packet ratio
Part of aggregating
functions (%)
All aggregating functions,
recording raw data (%)
49.90
99.50
100
25.30
50.66
88.11
1087
conditions it can reach 250 flow records/s (8–9
UDP packets/s). With all aggregation functions
enabled, including recording raw flow records to
database, we have to sample 1 in four flow records
to catch up with the packet arrival speed during
burst periods. If we need not record raw flow
records, and disable five aggregating functions, sampling 1 of 2 could process the requirement. Therefore, it is efficient to improve system performance
by only enabling necessary functions.
5.2. Pattern matching algorithm performance
By observing the matching algorithm we can
find that the running time is related to the size
of the rule set and the involved flow record number, but has nothing to do with the matched
record proportion. (Because: (1) All the flow
records will be processed by the algorithm no matter whether it is matched or not. (2) Although the
process is different between processing a matched
flow record and a non-matched one, the database
access time, which is thousand times slower than
the implementation time of the ordinary program
statement, is the same.) Therefore, we discuss the
performance of the matching algorithm based on
the two aspects of rule size and flow record
number.
For different rule sizes (100, 200, 500, 1000, 2000,
5000, 10 000) and flow record numbers (100–10 000,
increased by 50), we test each running time when the
proportion of matched flow records is 0%, 10%,
20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% and
100% (where the corresponding rule of a matched
flow record is randomly chosen from the rule set)
and then calculate the average running time of the
ten circumstances. Results are shown in Fig. 11.
Fig. 11 shows that with different rule set sizes, the
running time of the matching algorithm has a linear
growth with the flow number, and as we discussed,
the running time has nothing to do with the
matched proportion, so our algorithm is stable.
For Fig. 9 network, the flow speed is 90-records/s
on average, 150-records/s in the peak hour and 250records/s during burst periods. And from Fig. 11 we
can get Table 3.
According to Section 5.1, with 1 in 4 sampling,
the algorithm implementation can handle a scale
of 1000 rules and 600-records/s with a running frequency of 1 s, far more than the actual scale during
burst periods in the deployed system environment in
Fig. 9, with 450 rules and 250-flows/s.
1088
L. Bin et al. / Computer Networks 52 (2008) 1074–1092
Fig. 11a. Matching algorithm running time 1.
Fig. 11b. Matching algorithm running time 2.
Table 3
Part of running times (ms)
Data
100
150
200
250
Rule
Running time (ms)
100
200
500
1000
2000
125
141
156
187
188
203
250
328
359
468
563
766
672
984
1294
1500
1250
1875
2357
3015
5.3. Traffic statistical algorithm performance
5.3.1. NS simulation of anomalous traffic
Because we had not obtained permission to introduce attacks to verify our traffic statistical algorithms when we wrote this paper, NS2 was used.
The topology for the normal traffic simulation is
shown in Fig. 12. Nodes 6–45 denote 40 web servers
and Nodes 46–465 denote 420 web clients, between
which the communication is conducted in HTTP 1.0
Fig. 12. Topology for internet traffic simulation.
and TCP Reno. The number of sessions is 400 for
high-load scenarios and each session consists of a
fixed number (300) of Web pages. This ensures that
almost all simulations and sessions are active for the
duration of the simulation (4200 s). For certain ses-
L. Bin et al. / Computer Networks 52 (2008) 1074–1092
sions, the server and client are randomly chosen
from their box and the client will visit all web pages
of that server.
Parameters are set as follows: Session Interval:
exponential distribution of 1 s; Visiting Interval:
exponential distribution of 1 s; Object Size and
Shape: 12 packet/object, shape 1.2 Pareto II; TCP
packet: 1000 byte.
Feldmann et al. have proved that the simulation
under the above topology and parameters are similar to real network traffic [12].
Based on the above topology, we simulate some
anomalous traffic in the following circumstances:
1. Long term attack: 1000–1500 s, attacker at Node
6 and recipient at Node 46 with an intruding traffic of 1 Mbps.
2. Heavy Traffic attack: 2000–2050 s, attacker at
Node 16 and recipient at Node 48 with an intruding traffic of 1.2 Mbps in a time burst.
3. Smooth attack: 3000–3100 s, attacker at Node 26
and recipient at Node 50 with an intruding traffic
of 1.8 Mbps. Both the attack traffic and time
range are moderate.
Fig. 13 shows the result of simulation. The Red
traffic is for Node 46, Green for Node 48 and blue
for Node 50.
There do exist varieties of attacking modes and
the corresponding traffic is rather complicated. Here
we just consider the traffic’s attack intensity and
time range to test whether our algorithms can detect
those kinds of attacks or not.
1089
5.3.2. Calculation for trained samples
Compared with the real scenario, simulation provides a lack of information in both the amount of
data and the content of an individual packet, but
it is still able to verify the performance of our strategy to a certain extent. According to the above simulation setting and the algorithms’ needs, we define
the two-dimensional feature extraction vector:
[average number of received packets/100 s, average
amount of received octets/100 s]. Here we select
‘‘average number of received packets/100 s” to
reflect the attack time range, and also select ‘‘average amount of received octets/100 s” to reflect the
attack intensity. Note that the two feature extraction vector fields are unique to our attack setting
in the simulation, but for other kinds of attacking
settings, more vector fields like ‘‘Bytes sent/Bytes
received”, ‘‘outgoing flow records”, or ‘‘bidirectional flow records” may be necessary (as we have
introduced in Section 4.2.1).
In a real environment, the determination of the
trained samples should depend on a large number
of normal records, but for our 4200 s-simulation,
hundreds of seconds are enough.
According to the requests for the trained samples
in Section 4.2.1 – Algorithm 1, we respectively
choose 500 ‘‘middle-value” records for each of the
three nodes to generate the trained samples of Algorithm 1. Because there is no special limitation for
the trained samples of Algorithm 2, for simplicity,
we have used the same as Algorithm 1. The results
are shown in Table 4.
5.3.3. Results and analysis
Here we compare the results of the two algorithms using Table 3. In the table, ‘‘1” represents
the detection of anomalous traffic, ‘‘0” represents
not detected, and a false positivity is marked in
bold.
From Table 5, we come to the following
conclusions.
1. Generally speaking, our strategy for intrusion
detection is simple, effective and accurate. There
is only one false positivity (at Node 46) in all,
Table 4
Sample and threshold
Fig. 13. Result of the simulation for anomalous internet traffic.
No
Trained sample
Similarity threshold
Distance threshold
46
48
50
[0.972, 700.64]
[0.826, 485.04]
[1.048, 733.92]
0.2204
0.3291
0.3925
6.3984
4.2258
2.7552
1090
L. Bin et al. / Computer Networks 52 (2008) 1074–1092
Table 5
Results (I: Algorithm 1, II: Algorithm 2)
Node
Result (calculated at 100 s intervals)
46: I
II
Join
000000011111000001000000010000000100000
000000011111000000000000000000000110000
000000011111000000000000000000000100000
48: I
II
Join
000000000000000001000000000010000000000
000000000000000001000000000000000000000
000000000000000001000000000000000000000
50: I
II
Join
000000100000000000000000000110000000000
000000000000000000000000000100000000000
000000000000000000000000000100000000000
achieving an accuracy rate of 99%. Even without
using the ‘‘join” strategy, the accuracy rate of the
two algorithms also reaches as high as 95% and
98%.
2. Algorithm 1 has more false positivities than
Algorithm 2. This is because the simulation time
is limited, and the records are deficient. Therefore, the trained samples do not well respond
to the requests of Algorithm 1 (meaning the
‘‘middle-value” records we choose are not very
exact). Because there is no special limitation for
Algorithm 2, the false positive rate is low.
3. Algorithm 1 provides a sign of how much the
current traffic behavior is similar to the normal
scenario. Therefore, it can better reflect the
essence of intrusion detection with the support
of a large number of records, and will be able
to give the network traffic a more detailed delineation if multi-level similarity-values are defined.
4. The false positivity in the period 3600–3700 s (at
Node 46) appears both in the results of Algorithms 1 and 2. This is because in this period of
time, a sudden change has occurred to the feature
extraction vector elements, such as average number of received packets/100 s. The ranges of the
other periods are 0–4 while here it suddenly rises
to 15. When calculating the trained sample and
threshold of Node 46, we make a deliberate decision of not choosing it, which leads to the false
positivity. This further shows the importance of
sufficient and appropriate samples.
6. Conclusion
A flow analysis and monitoring system based on
the J2EE framework is introduced above. This system receives flow records exported by NetFlow,
extracts flow information and stores it in the Oracle
database. A J2EE web server is used to provide realtime and historical traffic information to web users.
Anomalous traffic monitoring functions are also
embedded. All the functions and anomalous traffic
definitions can be easily configured.
The performance of the matching pattern algorithm is stable, so it can be easily expanded. The traffic based innovative and cost-effective detection
algorithms with join strategy can determine and classify traffic types (normal or anomalous) online in
real time with low computational complexity, which
provides important insights on the design of not only
this but also other intrusion detection and prevention systems for improving network security.
This system has already been deployed in a government agency in Beijing, and works well hitherto.
Our future work will focus on the improvement of
system performance and algorithm efficiency, and
the integration of more anomalous attack models.
Acknowledgment
This work is supported by the National HighTech Research and Development Plan of China
(863) under Grant No. 2006AA01Z225; the National Grand Fundamental Research Program of
China (973) under Grant No. 2003CB314804 and
2006CB708301; the National Natural Science Foundation of China (NSFC) under Grant No. 60573122
and 60773138.
References
[1] Cisco, Cisco IOS NetFlow Technology Data Sheet. <http://
www.cisco.comlao/NetFlow>.
[2] T. Chen, Intrusion detection for viruses and worms, IEC
Annual Review of Communications 57 (Fall) (2004).
[3] Haining Wang, Danlu Zhang, Kang G. Shin, Detecting SYN
flooding attacks, in: INFOCOM 2002. Twenty-First Annual
Joint Conference of the IEEE Computer and Communications Societies, Proceedings IEEE 3 (23–27) (2002) 1530–
1539.
[4] Vasilios A. Siris, Fotini Papagalou, Application of anomaly
detection algorithms for detecting SYN flooding attacks,
Global Telecommunications Conference 29 (3) (2004) 2050–
2054.
[5] Seung-won Shin, Ki-young Kim, Jong-soo Jang, D-SAT:
detecting SYN flooding attack by two-stage statistical
approach, applications and the Internet, in: Proceedings,
The 2005 Symposium on 31 January–4 February 2005, pp.
430–436.
[6] J.B.D. Caberera, T.B. Ravichandran, R.K. Mehra, Statistical traffic modeling for network intrusion detection, in:
Proceedings of 8th International Symposium on Modeling,
L. Bin et al. / Computer Networks 52 (2008) 1074–1092
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
Analysis and Simulation of Computer and Telecommunication Systems, 2000, 29(1), 2000, pp. 466–473.
John E. Dickerson, Jukka Juslin, Ourania Koukousoula,
Julie A. Dickerson, Fuzzy intrusion detection, in: IFSA
World Congress and 20th NAFIPS International Conference
9(3), 2001, vol. 1506–1510.
R.C. Garcia, M.N.O. Sadiku, J.D. Cannady, WAID: wavelet
analysis intrusion detection, circuits and systems, 2002, in:
M-WSCAS-2002, The 2002 45th Midwest Symposium, vol.
3, 4–7 August 2002, pp. III-688–III-691.
M. Li, W. Jia, W. Zhao, Decision analysis of network based
intrusion detection systems for denial-of-service attacks, in:
Proceedings, IEEE Conferences on Info-tech and Infonet,
2001.
Yiming Gong, Detecting Worms, and Anomaly Activities
with NetFlow. <http://www.securityfocus.com/infocus/
1796>.
W. Richard Stevens, TCP/IP Illustrated (Vol. 1: The
Protocols), Addison Wesley, 1994.
P. Huang, A. Feldmann, A.C. Gilbert, W. Willinger,
Dynamics of ip traffic: a study of the role of variability
and the impact of control, in: ACM SIGCOMM’99, vol. 29,
Massachusetts, USA, 1999.
Shiuh-Pyng Shieh, Virgil D. Gligor, On a Patter-oriented
model for intrusion detection, IEEE Transactions on
Knowledge and data Engineering 9 (4) (1997).
Shiuh-Pyng Shieh, Virgil D. Gligor, A pattern-oriented
intrusion detection system and its applications, in: Proceedings of IEEE symposium Research in Security and Privacy,
Oakland, CA, May 1991, pp. 327–342.
Sandeep Kumar, Eugene H. Spafford, A pattern matching
model for misuse intrusion detection, in: Proceedings of the
17th National Computer security conference, Baltimore,
MD, 1994.
C.J. Coit, S. Staniford, J. McAlemey, Towards faster string
matching for intrusion detection or exceeding the speed of
snort, in: DARPA Information Survivability Conference and
Exposition (DISCEX II 01), Anaheim, CA, June
2001.
B.E. Brodsky, B.S. Darkhovsky, Nonparametric Methods in
Change-point Problems, Kluwer Academic Publishers,
Dordrecht, 1993.
M. Basseville, I.V. Nikiforov, Detection of Abrupt Changes:
Theory and Application, Prentice Hall, Englewood cliffs, NJ,
1993.
V. Paxson, S. Floyd, Wide-area traffic: the failure of Poisson
modeling, IEEE/ACM Trans Networking 3 (3) (1995).
T. Peng, C. Leckie, K. Ramamohanarao, Detecting reflector
attacks by sharing beliefs, in: Proceedings of the IEEE 2003
Global Communications Conference (Globecom 2003), vol.
3, San Francisco, California, USA, 2003b, pp. 1358–1362.
T. Peng, C. Leckie, K. Ramamohanarao, Proactively
detecting DDoS attack using source IP address monitoring,
in: Proceedings of Networking 2004, Athens, Greece, 2004,
pp. 771–782.
Harold S. Javitz, Alfonso Valdes, The NIDES statistical
component: description and justification, SRI International,
March 1993.
Zheng Zhang, Jun Li, C.N. Manikopoulos, Jay Jorgenson,
Jose Ucles, HIDE: a hierarchical network intrusion detection
system using statistical preprocessing and neural network
classification, in: Proceedings of the 2001 IEEE Workshop
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
1091
on Information Assurance and Security, United States
Military Academy, West point, NY, 5–6 June, 2001.
Daniel Q. Naiman, Statistical anomaly detection via httpd
data analysis, Computational Statistics & Data Analysis
(2004) 51–67.
Wenke Lee, Salvatore J. Stolfo, Kui W. Mok, A data mining
framework for building intrusion detection models, in:
Proceedings of the 20th IEEE symposium on security and
privacy, Oakland, CA 1999.
Wenke Lee, Salvatore J. Stolfo, Data mining approaches for
Intrusion detection system, in: Proceedings of the 7th
USENIX security symposium, San Antonio, TX, January,
1998.
Bertrand Portier, Jerome Froment, Data mining techniques
for Intrusion detection, Data mining term paper, The
University of Texas, Spring, 2000.
Srinivas Mukkamala, Guadalupe Janoski, Andrew Sung,
Intrusion detection using neural networks and support
vector machines, Appeared in IEEE IJCNN, May
2002.
Herve Debar, Monique Becker, Didier Siboni, A neural
network component for an Intrusion Detection System, in:
Proceedings of the 1992 IEEE computer Society Symposium
on research in Computer Security and Privacy, 1992, pp.
240–250.
Jake Ryan, Meng-Jang Lin, Intrusion detection with neural
networks, Advances in Neural Information Processing Systems, vol. 10, MIT press, 1998.
Ludovic Me, GASSATA, a genetic algorithm as an alternative tool for security audit trail analysis, in: 1st International
Conference on the Recent Advances in Intrusion Detection,
Belgium 1998.
Susan M. Bridges, Rayford B. Vaughn, Fuzzy data mining
and genetic algorithms, applied to Intrusion Detection.
Wei Li, Using Genetic Algorithm for Network Intrusion
Detection, SANS Institute, 2004.
Kathia Regina L. Juca, Azzedine Boukerche, Human
immune anomayand misuse based detection for computer
system operations: part II, Proceedings of the International
Parallel and Distributed Processing Symposium (IPDPS’03),
IEEE, 2003.
Hiroyuki Nishiyama, Fumio Mizoguchi, Design of security
system based on immune system, in: Proceedings of the 10th
International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprozes (WETICE’01),
IEEE, 2001.
Stephanie Forrest, Thomas A. Longstaff, A sense of self for
UNIX processes, in: Proceedings of 1996 IEEE Symposium
on Computer security and Privacy, Los Alamos, CA, pp.
120–128.
Jan van Lunteren, High-performance pattern-matching for
intrusion detection, in: Proceedings of IEEE INFOCOM
2006, April 2006, pp. 1–13.
Z.K. Baker, V.K. Prasanna, High-throughput linked-pattern
matching for intrusion detection systems, in: Proceedings of
the First Annual ACM Symposium on Architectures for
Networking and Communications Systems, 2005.
L. Tan, T. Sherwood, Architectures for bit-split string
scanning in intrusion detection, IEEE Micro (January–
February) (2006).
N.S. Artan, H. Jonathan Chao, TriBiCa: Trie bitmap
content analyzer for high-speed network intrusion detection,
1092
L. Bin et al. / Computer Networks 52 (2008) 1074–1092
in: Proceedings of IEEE INFOCOM 2007, May 2007, pp.
125–133.
[41] C.-T. Huang, S. Thareja, Y.-J. Shin, Wavelet-based real time
detection of network traffic anomalies, in: Proceedings of
Workshop on Enterprise Network Security (WENS 2006) (in
assoc. with Second SecureComm), August 2006.
[42] T. D̈ubendorfer, B. Plattner, Host behaviour based early
detection of worm outbreaks in internet backbones, in:
Proceedings of 14th IEEE WET ICE/ STCA Security
Workshop, IEEE, 2005.
[43] T. D̈ubendorfer, B. Plattner, A framework for real-time
worm attack detection and backbone monitoring, in: Proceedings of IWCIP 2005, November 2005.
Bin Liu is a ME student in the Department of Computer Science at Tsinghua
University, China. His research interests
include sensor networks, performance
evaluation and network management.
Liu has received his BS in computer
science and technology from Beijing
University of Posts and Telecommunications, China.
Chuang Lin is a professor of the
Department of Computer Science and
Technology, Tsinghua University, Beijing, China. He received the Ph.D. degree
in Computer Science from the Tsinghua
University in 1994. His current research
interests include computer networks,
performance evaluation, network security analysis, and Petri net theory and its
applications. He has published more
than 300 papers in research journals and
IEEE conference proceedings in these areas and has published
three books.
Professor Lin is a member of ACM Council, a senior member
of the IEEE and the Chinese Delegate in TC6 of IFIP. He serves
as the Technical Program Vice Chair, the 10th IEEE Workshop
on Future Trends of Distributed Computing Systems (FTDCS
2004); the General Chair, ACM SIGCOMM Asia workshop
2005; the Associate Editor, IEEE Transactions on Vehicular
Technology; the Area Editor, Journal of Computer Networks;
and the Area Editor, Journal of Parallel and Distributed
Computing.
Jian Qiao is a ME student of the
School of Telecommunication Engineering, Beijing University of Posts and
Telecommunications, Beijing, China. He
received the BS degree in Telecommunication
Engineering
from
Beijing
University of Posts and Telecommunications. His current research interests
include distributed system, grid computing and optical network.
Jianping He is a M.E. candidate in the
Dept. of C.S at Tsinghua University,
Beijing, P. R. China. His research interests include multi-hop wireless networks,
network management. Jianping He
received his B.S of C.S. from Beijing
University of Posts and Telecommunications in 2006.
Peter Ungsunan is a doctoral student in
the Department of Computer Science
and Technology at Tsinghua University.
Previously he taught at Beihang University and worked as a networking
engineer at Lucent Technologies. He
received MS and MBA degrees in CIS
and Finance from the City University of
New York and BS degrees in Electrical
and Mechanical Engineering from
Rensselaer Polytechnic Institute.