Measuring and Managing the Remote Client Perceived Response

Transcription

Measuring and Managing the Remote Client
Perceived Response Time for Web Transactions
using Server-side Techniques
David P. Olshefski
Submitted in partial fulfillment of the
requirements for the degree
of Doctor of Engineering Science
in the Fu Foundation School of Engineering and Applied Science
COLUMBIA UNIVERSITY
2006
c
2006
David P. Olshefski
All Rights Reserved
ABSTRACT
Measuring and Managing the Remote Client
Perceived Response Time for Web Transactions
using Server-side Techniques
David P. Olshefski
As businesses continue to grow their dependence on the World Wide Web, it is
increasingly vital for them to accurately measure and manage response time of their Web
services. This dissertation shows that it is possible to determine the remote client perceived
response time for Web transactions using only server-side techniques and that doing so is
useful and essential for the management of latency based service level agreements.
First, we present Certes, a novel modeling algorithm, that accurately estimates
connection establishment latencies as perceived by the remote clients, even in the presence of admission control drops. We present a non-linear optimization that models this
effect and then we present an O(c) time and space online approximation algorithm. Second, we present ksniffer, an intelligent traffic monitor which accurately determines the
pageview response times experienced by a remote client without any changes to existing
systems or Web content. Novel algorithms for inferring the remote client perceived response time on a per pageview basis are presented which take into account network loss,
RTT, and incomplete information. Third, we present Remote Latency-based Management
(RLM), a system that controls the latencies experienced by the remote client by manipulating the packet traffic into and out of the Web server complex. RLM tracks the progress of
each pageview download in real-time, as each embedded object is requested, making fine
grained decisions on the processing of each request as it pertains to the overall pageview
latency. RLM introduces fast SYN and SYN/ACK retransmission and embedded object
rewrite and removal techniques to control the latency perceived by the remote client.
We have implemented these mechanisms in Linux and demonstrate their effectiveness across a wide-range of realistic workloads. Our experimental results show for the first
time that server-side response time measurements can be done in real-time at gigabit traffic
rates to within 5% of that perceived by the remote client. This is an order of magnitude
better than common application-level techniques run at the Web server. Our results also
demonstrate for the first time how both the mean and the shape of the per pageview client
perceived response time distribution can be dynamically controlled at the server complex.
Contents
List of Figures
iv
Acknowledgments
Chapter 1
viii
Introduction
1
1.1
Client Perceived Response Time . . . . . . . . . . . . . . . . . . . . . .
4
1.2
Network Protocol Behavior . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.3
Modeling/Managing Client Perceived Response Time Latencies . . . . .
13
1.4
Novel Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
Chapter 2
Modeling Latency of Admission Control Drops
20
2.1
The Certes Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
2.2
Mathematical Constructs of the Certes Model . . . . . . . . . . . . . . .
29
2.3
Fast Online Approximation of the Certes Model . . . . . . . . . . . . . .
44
2.3.1
Packet Loss in the Network . . . . . . . . . . . . . . . . . . . .
52
2.3.2
Client Frustration Time Out (FTO) . . . . . . . . . . . . . . . . .
52
2.3.3
SYN Flood Attacks . . . . . . . . . . . . . . . . . . . . . . . . .
54
2.3.4
Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
Certes Linux Implementation . . . . . . . . . . . . . . . . . . . . . . . .
55
2.4
i
2.5
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
2.5.1
Experimental Design . . . . . . . . . . . . . . . . . . . . . . . .
62
2.5.2
Measurements and Results . . . . . . . . . . . . . . . . . . . . .
67
2.6
Certes Applied in Admission Control . . . . . . . . . . . . . . . . . . . .
73
2.7
Shortcomings of the (strictly) Queuing Theoretic Approach . . . . . . . .
76
2.8
Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
2.9
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
81
Chapter 3
Modeling Client Perceived Response Time
83
3.1
ksniffer Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
3.2
ksniffer Pageview Response Time . . . . . . . . . . . . . . . . . . . . .
90
3.2.1
TCP Connection Setup . . . . . . . . . . . . . . . . . . . . . . .
92
3.2.2
HTTP Request . . . . . . . . . . . . . . . . . . . . . . . . . . .
94
3.2.3
HTTP Response . . . . . . . . . . . . . . . . . . . . . . . . . .
95
3.2.4
Online Embedded Pattern Learning . . . . . . . . . . . . . . . .
98
3.2.5
Embedded Object Processing . . . . . . . . . . . . . . . . . . . 100
3.3
Packet Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.4
Longest Prefix Matching . . . . . . . . . . . . . . . . . . . . . . . . . . 113
3.5
Tracking the Most Active . . . . . . . . . . . . . . . . . . . . . . . . . . 114
3.6
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
3.7
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Chapter 4
Remote Latency-based Web Server Management
129
4.1
RLM Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . 132
4.2
RLM Pageview Event Node Model . . . . . . . . . . . . . . . . . . . . . 134
4.3
Connection Latency Management . . . . . . . . . . . . . . . . . . . . . 138
ii
4.4
Transfer Latency Management . . . . . . . . . . . . . . . . . . . . . . . 144
4.5
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
4.5.1
Response Time Distribution . . . . . . . . . . . . . . . . . . . . 150
4.5.2
Managing Connection Latency . . . . . . . . . . . . . . . . . . . 152
4.5.3
Managing Load and Admission Control . . . . . . . . . . . . . . 153
4.5.4
Managing Transfer Latency . . . . . . . . . . . . . . . . . . . . 160
4.6
Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
4.7
Alternative Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
4.8
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Chapter 5
Related Work
183
5.1
Measuring Client Perceived Response Time . . . . . . . . . . . . . . . . 186
5.2
Latency Management using Admission Control . . . . . . . . . . . . . . 190
5.3
Web Server Based Approaches . . . . . . . . . . . . . . . . . . . . . . . 192
5.4
Content Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
5.5
TCP Level Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
5.6
Packet Capture and Analysis Systems . . . . . . . . . . . . . . . . . . . 201
5.7
Services On-demand . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
5.8
Stream-based Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
5.9
Internet Standards Activity . . . . . . . . . . . . . . . . . . . . . . . . . 206
Chapter 6
6.1
Conclusion
209
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
iii
List of Figures
1.1
Typical TCP client-server interaction. . . . . . . . . . . . . . . . . . . .
8
1.2
Effect of SYN drops on client perceived response time. . . . . . . . . . .
10
1.3
Downloading a container page and embedded objects over multiple TCP
connections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
1.4
Breakdown of client response time. . . . . . . . . . . . . . . . . . . . . .
14
1.5
Pageview modeled as an event node graph. . . . . . . . . . . . . . . . . .
16
2.1
Typical TCP client-server interaction. . . . . . . . . . . . . . . . . . . .
22
2.2
Effect of SYN drops on client perceived response time. . . . . . . . . . .
23
2.3
Dropped SYN/ACK from server to client captured in SYN-to-END time.
26
2.4
Variance in RTT affects arrival time of retries. . . . . . . . . . . . . . . .
35
2.5
Initial connection attempts that get dropped become retries three seconds
later. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6
45
A second attempt at connection, that gets dropped, becomes a retry six
seconds later. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
2.7
After three connection attempts the client gives up. . . . . . . . . . . . .
48
2.8
Relationship between incoming, accepted, dropped, completed requests. .
49
2.9
The smaller the interval, the more difficult to accurately discretize events.
50
iv
2.10 Addition of network SYN drops to the model. . . . . . . . . . . . . . . .
53
2.11 Certes implementation on a Linux Web server. . . . . . . . . . . . . . . .
56
2.12 TCP/IP connection establishment on Linux. . . . . . . . . . . . . . . . .
58
2.13 TCP/IP outbound data transmission on Linux. . . . . . . . . . . . . . . .
60
2.14 Experimental test bed. . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
2.15 Certes accuracy and stability in various environments. . . . . . . . . . . .
67
2.16 Certes response time distribution approximates that of the client for Tests
D and G. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
2.17 Certes online tracking of the client response time in Tests A and G. . . . .
70
2.18 Certes online tracking of the client response time in Test J, in on-off mode.
72
2.19 Web server control manipulating the Apache accept queue limit. . . . . .
75
2.20 Client response time increases as accept queue limit decreases. . . . . . .
75
2.21 Effect of SYN drop rate on client response time, as modeled as an M/M/1
queuing system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
2.22 Modeling as an M/M/1 queuing system fails to accurately track client perceived response time. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
2.23 Using a sliding window of drop probabilities fails to capture all the dependences between time intervals. . . . . . . . . . . . . . . . . . . . . . . .
2.24 Certes begins modeling at the 600th interval during a consistent load test.
78
79
2.25 Certes begins modeling at the 575th interval (in the middle of a peak)
during a variable load test. . . . . . . . . . . . . . . . . . . . . . . . . .
3.1
3.2
80
Downloading a container page and embedded objects over multiple TCP
connections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
Multi-tiered server farm with a ksniffer monitor. . . . . . . . . . . . . . .
87
v
3.3
Typical libpcap based sniffer architecture (left) vs. the ksniffer architecture
(right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
88
3.4
Objects used by ksniffer for tracking. . . . . . . . . . . . . . . . . . . . .
91
3.5
HTTP request/reply. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93
3.6
Downloading multiple container pages and embedded objects over multiple connections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
3.7
Client active pageviews. . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.8
Network and server dropped SYNs. . . . . . . . . . . . . . . . . . . . . 109
3.9
Algorithm for detecting network dropped SYNs from captured SYNs. . . 111
3.10 Experimental test bed. . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
3.11 Test F, pageviews per second. . . . . . . . . . . . . . . . . . . . . . . . . 119
3.12 Test F, mean pageview response time. . . . . . . . . . . . . . . . . . . . 120
3.13 Client perceived response time on a per subnet basis. . . . . . . . . . . . 121
3.14 Test V, mean pageview response time. . . . . . . . . . . . . . . . . . . . 122
3.15 Test V, response time distribution. . . . . . . . . . . . . . . . . . . . . . 122
3.16 Test X, mean pageview response time. . . . . . . . . . . . . . . . . . . . 123
3.17 Live Internet Web site. . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
3.18 Apache measured response time, per URL. . . . . . . . . . . . . . . . . . 126
3.19 Apache measured response time for loner pages. . . . . . . . . . . . . . . 126
4.1
RLM deployment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
4.2
Breakdown of client response time. . . . . . . . . . . . . . . . . . . . . . 135
4.3
Pageview modeled as an event node graph. . . . . . . . . . . . . . . . . . 136
4.4
SYN drops at the server. . . . . . . . . . . . . . . . . . . . . . . . . . . 139
4.5
Second connection in page download fails. . . . . . . . . . . . . . . . . . 140
vi
4.6
Fast SYN retransmission. . . . . . . . . . . . . . . . . . . . . . . . . . . 143
4.7
Fast SYN/ACK retransmission. . . . . . . . . . . . . . . . . . . . . . . . 144
4.8
Cardwell et al. Transfer Latency Function f for 80 ms RTT and 2% loss
rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
4.9
Experimental test bed. . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
4.10 0.3 ms RTT, 0% loss. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
4.11 80 ms RTT, 0% loss. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
4.12 80 ms RTT, 4% loss. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
4.13 Unmanaged heavy load. . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
4.14 MaxClients load shedding. . . . . . . . . . . . . . . . . . . . . . . . 155
4.15 Low priority penalties. . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
4.16 Improvement from applying fast SYN retransmission. . . . . . . . . . . . 158
4.17 Widening the think time gap. . . . . . . . . . . . . . . . . . . . . . . . . 159
4.18 Embedded object removal. . . . . . . . . . . . . . . . . . . . . . . . . . 160
4.19 Embedded object rewrite. . . . . . . . . . . . . . . . . . . . . . . . . . . 163
4.20 Applying predicted elapsed time. . . . . . . . . . . . . . . . . . . . . . . 164
4.21 Full vs. Empty, mean pageview response time. . . . . . . . . . . . . . . . 165
4.22 Steady state flow model. . . . . . . . . . . . . . . . . . . . . . . . . . . 168
4.23 Equation 4.27 for µ0 = λ0 = 1000. . . . . . . . . . . . . . . . . . . . . . 176
4.24 Equation 4.27 for µ0 = λ0 = 1000, zoomed in on minima. . . . . . . . . 177
4.25 /M/G/1 service latency
1
µ0 −µ
overlayed onto Figure 4.24. . . . . . . . . . 178
4.26 Capacity model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
vii
Acknowledgments
My deepest gratitude goes toward Jason Nieh, my advisor, who led me through the dissertation. I will forever be amazed at his utter relentless pursuit of excellence. He consistently
overturned every stone, demanded thorough explanation and objective evaluation of every
idea, and developed ideas well beyond that which I would reach on my own. He has developed and shaped my entire outlook on how to perform solid, meaningful research within
the field of computer science. He led by example, not only as a researcher, but also as a
person, whom I respect and admire.
I would like to thank my first advisor, Yechiam Yemini, who brought me into the
Ph.D. program at Columbia University. I often think of YY and his insistence on developing ideas into something ‘big’. His imagination and ability to see far down the road
ahead has always struck me as unique and uncanny. Likewise, many thanks to Henning
Schulzrinne and Vishal Misra for taking the time and energy to be on my thesis committee,
making the event memorable (even for a ‘hacker’ such as myself).
There are many people at Columbia University who were a significant part of my
graduate program. This includes my office mate Sushil DaSilva, who many times kept
me laughing under the toils and stress of graduate life. Danilo Florissi was a mentor and
friend, something I will always remember and treasure. Gong Su, Ioannis Stamos, Susan
Tritto, Andreas Prodromidis, Maria Papadopouli (Bella Maria!), Alexandros Konstanti-
viii
nou, Martha Zadok, Ashutosh and Manu Dutta, Michael Grossberg, Ricardo Baratto, and
Dan Phung have all helped me in numerous ways, leaving me with many fond memories
as well.
Lastly, I owe a tremendous debt of gratitude to those individuals at IBM who made
this possible. Michael Karasick, my manager, was the first to support me in my quest
for higher education, both through encouragement and financial support. Dinesh Verma
continued to support my degree program over the course of many years. John Tracey not
only showed great patience and understanding by allowing me to spend time on thesis
related work at the office, but also took the time to participate as a member of my thesis
committee. Erich Nahum, coauthor and committee member, will always be a friend and
mentor. My thanks to Li Zhang for his insights on modeling the steady state behavior of
TCP connection establishment. Dakshi Agrawal, my friend and coauthor, could you burn
me a few more of those CDs?
DAVID P. O LSHEFSKI
COLUMBIA UNIVERSITY
May 2006
ix
To ‘Country Joe’ and all those like him.
x
CHAPTER 1. INTRODUCTION
1
Chapter 1
Introduction
“A Web site that fails to deliver its content, either in a timely manner or not at all, causes
visitors to quickly lose interest, wasting the time and money spent on the site’s development.” - Freshwater Software [147].
“Every Web usability study I have conducted since 1994 has shown the same thing: users
beg us to speed up page downloads.” - J. Nielsen, “The Need for Speed” [113].
“Some users and applications drive the revenue of the business. If the system is slow,
customers go elsewhere, and transactions or sales are lost forever.” - P. Sevcik, Business
Communications Review [144].
“Forty one percent of consumers experiencing service failures at online retail sites say
they will not shop at that site again.” - Boston Consulting Group [70].
“Users perceive slow Web sites as being less secure and, as a result, are less likely to
2
make a purchase due to fear of credit card theft.” - Bhatti et al. [29].
These quotes and others like them indicate that Web sites need to manage their response
times. The revenue generated by a Web site for a business depends on it - large amounts
of revenue can be lost by a Web based company if the response time of its Web site is
too slow. Customers get frustrated, and leave the site to conduct their business elsewhere.
Worst yet, end users retain memories of such experiences and avoid slow Web sites when
conducting future business. Once a customer is lost due to poor performance, he or she is
difficult to regain. This translates into real dollars.
This need to manage Web response time has shifted the focus of Web server performance from throughput and utilization benchmarks [106, 24, 112] to guaranteeing delay
bounds for different classes of clients [94, 159, 86, 125, 53, 8, 123, 42, 30]. Key to the
effective management of response time is the accurate measurement of response time. Although an accurate measurement of response time is highly valued, it is difficult to obtain.
Inarguably, companies spend millions each year in an attempt to do so, which is evident in
the industry which has sprung and flourished based on the promise of obtaining true, accurate response times [69, 88, 98, 59, 149]. Unfortunately, Web based companies are paying
good money only to receive in return a poor substitute for the client perceived response
time.
Real-time management of Web response time requires that client response time
measurements be available in real-time. Such real-time measurements can be an integral
part of a closed-loop management system that manages resources or scheduling within the
Web server complex. If the response time metrics are only available long after the transactions complete, then the control mechanism is obviously unable to affect the response time
of those transactions. Likewise, both measurement of response time and the corresponding
3
control mechanisms must be able to function at high traffic rates so as to be applicable in
today’s Internet environment.
The term itself, “response time”, has become diluted, meaning a variety of different metrics to different people. To the system administrator it usually means the per
URL server response time. To the database administrator it means the per query response
time. To the network administrator it relates to the amount of available bandwidth. These
definitions only measure a portion of the client perceived response time. In addition, no
standard definition for “client perceived response time” for Web transactions exists, nor is
there an RFC working group in the process of creating such a standard definition. Existing
research to date has misrepresented response time by incorrectly measuring or ignoring
key latencies in the process of a Web request. Management mechanisms are then based
on and validated against these inaccurate latency measurements. We expose these long
standing shortcomings and present pragmatic solutions for addressing them.
From the remote client perspective, there is only one measure of response time that
matters: how long it takes to download a Web page along with all its embedded objects.
This definition includes latencies associated with network delays, server delays, retrieval
of embedded objects and the latencies experienced by the remote client due to his/her local
machine. It is this definition of response time we seek to measure and manage: the per
pageview response time, as perceived by the remote client.
Capturing and managing the per pageview client perceived response time for a
client that is located on the other side of the country (or planet) is non-trivial. This dissertation focuses on how to measure and manage the remote client perceived response time
using only information available at the Web server. Our approach for tackling this problem
is based on analyzing the packet streams into and out of the server complex. By reconstructing the activity across multiple network protocol layers, we are able to determine the
4
per pageview response time, as perceived by the remote client who is physically located a
great distance from the Web server. Our approach is non-invasive to existing systems - no
modifications are required to the server complex or Web site content making deployment
fast and simple. Being a server-side approach, the resulting measurements are available in
real-time and are used to manage the response time as the Web page is being downloaded.
1.1 Client Perceived Response Time
Measuring and managing the remote client perceived response time ought to account for
all latencies associated with a pageview download, from the perspective of the remote
client. We first present an anatomical view of the client-server behavior that occurs when a
Web client accesses a remote Internet Web site, considering the behaviors and interactions
between the four key entities involved: client, Web browser, network and server. Once a
URL, such as
http://www.cnn.com/index.html
is entered into a Web browser [99, 110], the following ten steps occur to download
and display the Web page:
1. URL parsing. The client browser parses the URL to obtain the name of the remote
host, www.cnn.com, from which to obtain the Web page, /index.html. Web
browsers maintain a cache of Web pages, so if the Web page is in cache and has not
expired, processing can be performed locally and steps 2-7 below can be skipped.
2. DNS lookup. In order to contact the Web site (e.g., www.cnn.com), the browser
must first obtain its IP address from DNS [100, 101]. Since the browser maintains a
local cache containing the IP addresses of frequently accessed Web sites, contacting
5
the DNS server for this information is only performed on a cache miss, which often
implies that the Web site is being visited for the first time.
3. TCP connection setup. The client establishes a TCP connection with the remote
Web server. Before a client can send the HTTP request to the Web server, a TCP
connection must first be established via the TCP three-way handshake mechanism
[8, 38]. First, the client sends a SYN packet to the server. Second, the server acknowledges the client request for connection by sending a SYN/ACK back to the
client. Third, the client responds by sending an ACK to the server, completing the
process of establishing a connection. Note that if the client’s Web browser already
had an established TCP connection to the server and persistent HTTP connections
[61, 28] are used, the browser may reuse this connection, skipping this step.
4. HTTP request sent. The browser requests the Web content, /index.html, from
the remote site by sending an HTTP request over the established TCP connection.
5. HTTP request received. When the Web server machine receives an HTTP packet,
the operating system determines which application should receive the message. The
HTTP request is then passed to an HTTP server application, such as Apache, which
is typically executing in user space.
6. HTTP request processed. The HTTP server application processes the request by
obtaining the content either from a disk file, CGI script or other such program.
7. HTTP response sent. The HTTP server application passes the content to the operating system, which in turn, sends the content to the client. If the response is a disk
file, the HTTP server may use the sendfile() system call to transfer the file to the
client using only kernel level mechanisms.
6
8. HTTP response processed. Upon receiving the response to the HTTP request, the
client browser processes the Web content. If the content consists of an HTML page,
the browser parses the HTML, identifies any embedded objects such as images, and
begins rendering the Web page on the display.
9. Embedded objects retrieved. The browser opens additional connections to retrieve
any embedded objects, allowing the browser to make multiple, simultaneous requests for the embedded objects. This parallelism helps to reduce overall latency.
Depending on where the embedded objects are located, connections may be to the
same server, other Web servers, or content delivery networks (CDNs). If the connections are persistent and embedded objects are located on the same server, then
several embedded objects will be obtained over each connection (HTTP 1.1 [61] or
HTTP 1.0 with KeepAlive [28]). Otherwise, a new connection will be established
for each embedded object (HTTP 1.0 without KeepAlive [28]).
10. Rendering. Once all the embedded objects have been obtained, the browser can fully
render the Web page on the display, within the browser window.
As a remote client maneuvers through a Web site this process repeats itself for each
new pageview being downloaded. We define a pageview as the collection of objects used
for displaying a Web page, usually consisting of the container page (an HTML file) and
any embedded objects (ie. gifs, jpgs, etc) that may be associated with the container page.
A pageview, of course, may not have embedded objects, as is the case for a pageview
consisting of a single postscript file.
A complete measure of the time to download and display a pageview would account for the time spent across all ten steps. The only way to completely and accurately
measure the actual client perceived response time is to measure the response time on the
7
client machine within the Web browser itself. In addition, for such information to be of
use to a Web site, the Web browser would need to support a mechanism for transmitting
the response time measurements back to the Web site for use in verifying compliance with
service-level objectives. Unfortunately, such browser functionality does not exist. As a
result, several pragmatic approaches have been developed to determine client response
time without requiring client browser modification. These approaches must be considered
as methods to estimate rather than measure client perceived response time, though some
may be more accurate than others. For example, measuring response time within the Web
server from the time the request arrives (step 5) to the time the response is transmitted
(step 7) ignores the connection establishment and network transfer latencies. We show in
Chapter 2 this can be as much as an order of magnitude less than the response time perceived by the remote client. Likewise, the response time obtained via monitor machines,
although covering steps 1 through 10, does not measure the response time for actual client
transactions. The use of embedded JavaScript not only requires modifications to existing
Web content but also fails to measure steps 1 through 8 (i.e. the download of the initial
container page is not captured). In this dissertation, we present an approach for measuring
all but the DNS lookup time and final rendering step using server-side mechanisms that do
not require modifications to existing systems. In this work, we assume that URL parsing
and web page rendering times are small and DNS lookups are generally cached to reduce
their impact on response time.
1.2 Network Protocol Behavior
Steps 1 through 10 describe the tasks performed by each key entity. Figure 1.1 illustrates
how the tasks are performed by depicting the client and server interaction at a packet
8
Client
Server
SYN J
TCP
connection
establishment
(3)
Client
Perceived
Response
Time
SYN K, ack J+1
ack K+1
(4)
GET index.html
(5)
(6)
(7)
Web Server
Measured
Response
Time
DATA
(8)
Figure 1.1: Typical TCP client-server interaction.
level that occur in steps 3 through 8 over a single TCP connection. Figure 1.1 depicts the
download of a pageview with no embedded objects; shortly we address the case where
embedded objects are present. First, the TCP connection is established via the TCP threeway handshake mechanism, corresponding to step 3. Step 4 is depicted when the client
transmits a GET request to the server. The GET request makes its way through the Internet
and arrives at the server (step 5). The Web server reads, parses and processes the request
(step 6). The Web server begins to transmit the response to the client in step 7. The
response makes its way through the Internet and arrives at the remote client where it is
processed and rendered by the browser (step 8). The response is not fully processed or
rendered until all segments of the response are received by the client.
The difficult challenge we address in this dissertation is how one can measure and
manage the response time perceived by the remote client using events observed at the
Web server. For example, the TCP three-way handshake that establishes the connection
9
is processed entirely within the kernel, in the TCP stack. The HTTP GET request, on
the other hand, is processed in user space by the HTTP server. A correlation between the
activities in both kernel space and user space, across network protocol layers (IP, TCP and
HTTP) is required. In addition, events occurring at the Web server are
1
2
RTT (Round
Trip Time) different in time than when the remote client experiences it. For example, the
remote client perceives the arrival of the response to a HTTP GET request
1
2
RTT after the
server transmits the response. Packet loss in the network, TCP retransmissions, variance
in RTT, time spent waiting in kernel queues, etc. all have an effect on the response time
latency, yet are invisible to the user space HTTP server.
Figure 1.2 shows the same client-server interaction as in Figure 1.1 but in the presence of SYN drops at the server. SYN drops at the server commonly occur under server
overload or due to admission control [148]. When the initial SYN is dropped, the server
does not transmit the corresponding SYN/ACK packet. As a result, the client-side TCP
waits 3 s, then retransmits the initial SYN to the server. If this retransmitted SYN is also
dropped, the client-side TCP waits another 6 s, then retransmits the SYN. If that SYN
is dropped, then the client-side TCP waits another 12 s before retransmitting the SYN.
This is the well known TCP exponential back-off mechanism [33, 127] which causes the
client-side TCP to double the wait time between SYN retransmissions. In Figure 1.2, the
resulting effect is a 21 s connection establishment latency.
Figure 1.2 clearly shows that dropping a SYN at the server does not represent a
denial of access but rather a delay in establishing the connection. In other words, dropping
a SYN at the server simply reschedules the connection establishment for a (near) future
moment in time. The key observation is that the latency due to SYN drops/retransmission
is large relative to the time required to compose and transfer the HTTP response, and as
such, will be the dominant factor in the overall client perceived response time. This ef-
10
Client
initial SYN to server
Server
SYN J
wait 3 seconds
x
SYN J
x
wait 6 seconds
SYN ‘s
dropped by
server
SYN J
x
wait 12 seconds
Client
Perceived
Response
Time
SYN J
TCP
Connection
Established
SYN accepted
SYN K, ack J+1
ack K+1
GET index.html
DATA
Web Server
Measured
Response
Time
Figure 1.2: Effect of SYN drops on client perceived response time.
fect is exacerbated under HTTP 1.0 without KeepAlive where each individual Web object
requires the establishment of a separate TCP connection.
Capturing this effect is crucial when measuring and managing response time - yet it
has been ignored by other approaches [42, 43, 54, 89, 92, 91]. Server-side admission control mechanisms which perform load shedding by explicitly dropping connection requests
ignore the effect that the dropped SYNs have on the response time perceived by the remote
client. Only the response time for the accepted connections, when they are accepted, is
collected and reported. The latencies associated with the SYN drops/retransmissions that
11
occur prior to successful connection establishment are ignored. Any admission control
mechanism which explicitly drops SYNs but then ignores these effects underestimates the
client perceived response time and misrepresents the number of rejected requests. Modeling and quantifying the effect that SYN drops have on the client perceived response time
is a major contribution of this dissertation.
SYN drop latencies have significant implications not only for the individual client
experiencing them but also for aggregating latency metrics. Often it is the mean, median or
95th percentile of the response time that is used in specifying the service level objective.
For example, a service level objective may state that “the mean response time for high
priority customers must be below 3 s” or that “the 95th percentile of the response time
be below 5 s”. We will show in later chapters that the shape of the distribution of the
client perceived response time is significantly impacted, in predictable ways, by SYN drop
latencies. Prior work that ignores the shape of the response time distribution and validates
an approach using only the mean response time misrepresents the effectiveness of their
technique.
Lastly, it is important to not lose sight of how Web browsers behave and what the
client actually sees in the Web browser. Existing Web browsers open multiple connections
to obtain embedded objects, in parallel, from the Web server. Where as Figure 1.1 depicts
the client-server interaction over a single TCP connection, Figure 1.3 depicts the pageview
download of a container page plus embedded objects over multiple TCP connections. The
beginning of the client perceived response time is shown as the moment the initial SYN
packet is transmitted from the client (t0 ), and the end of the response time is defined as the
moment at which the client receives the last byte for the last embedded object within the
page (te ); the client perceived response time is therefore calculated as te − t0 .
The connection establishment latencies depicted in Figure 1.2 is observed by the
Client
t0
12
Server
SYN J
ck J+1
SYN K, a
ack K+1
GET index
.html
GET obj3.g
if
GET obj8.g
if
GET obj2.g
if
Client
SYN M
Server
M+1
SYN N, ack
ack N+1
G
E
T
o
bj1.gif
ti
tj
tx
ty
GET obj4.gif
te
Figure 1.3: Downloading a container page and embedded objects over multiple TCP connections.
remote client differently depending on which connection is experiencing the SYN drops.
If the first connection experiences SYN drops, the Web browser will display a message
‘Connecting to server...’ in the progress text message bar while the TCP exponential backoff mechanism is in effect. This is not the case when the SYN drops occur
on the second connection. Likewise, this is significantly different than for the case when
the connection is immediately established but the server then takes a significant amount
of time to compose the response. In this case, the progress text message bar displays
‘Waiting for response...’ and, as such, the client may be willing to wait longer
13
for a response knowing that the server is busy working on a reply. Likewise, if portions
of the pageview are slowly being obtained and rendered, a client may hit the stop button
on the Web browser after enough of the pageview has been displayed. We refer to this
as a partial page download. Facts such as these are important for a latency management
system to understand - admission control drops ought to be coordinated given the current
number of established connections the browser is maintaining.
An effective server-side approach for measuring and managing the per pageview
client perceived response time as depicted in Figure 1.3 must examine the activity between
client and server at the packet level. Only by tracking activity at the packet level is it
possible to capture the key latencies involved with each task in Steps 3 through 10. This
requires tracking, reconstructing and correlating the activity across multiple protocols (IP,
TCP and HTTP), over multiple connections. Determining which embedded objects belong
to which pageviews and handling dropped SYNs and how they impact response time when
they occur on either of the two connections depicted in Figure 1.3 is also required. An
important novel contribution of this dissertation is to provide server-side mechanisms for
solving these problems under high bandwidth rates, online, in real-time, in the presence of
loss and incomplete information.
1.3 Modeling/Managing Client Perceived Response Time
Latencies
A key observation is that whatever measure of latency a management approach relies on
becomes the latency which gets managed. Therefore, if a management approach simply
tracks the server response time for a single URL (e.g. ty − tx in Figure 1.3), then this, of
course, is what the management technique ends up actually controlling. In this dissertation
Client
Server
t0
Tconn
14
SYN J
ck J+1
SYN K, a
ack K+1
GET index
.html
Tserver
Ttransfer
Trender
GET obj3
.gif
Client
Tconn
Tserver
Server
M+1
SYN N, ack
ack N+1
GET obj6.gif
Ttransfer
Trender
SYN M
GET obj8.g
if
Tserver
Tserver
Ttransfer
te
Ttransfer
Figure 1.4: Breakdown of client response time.
we take a per pageview approach to response time management instead of a per URL
approach. Figure 1.4 depicts the response time of te − t0 for the pageview download of
index.html which embeds obj3.gif, obj6.gif and obj8.gif. The figure is annotated with the
following terms:
1. Tconn TCP connection establishment latency, using the TCP 3-way handshake. Begins when the client sends the TCP SYN packet to the server.
2. Tserver latency for server complex to compose the response by opening a file, or
calling a CGI program or servlet. Begins when the server receives an HTTP request
from the client.
15
3. Ttransf er time required to transfer the response from the server to the client. Begins
when the server sends the HTTP request header to the client.
4. Trender time required for the browser to process the response, such as parse the
HTML or render the image. Begins when the client receives the last byte of the
HTTP response.
These four latencies are serialized over each connection and delimited by a specific
event. As such, a pageview download can be viewed as a set of well defined activities
required to complete the pageview. Figure 1.5 depicts the download of Figure 1.4 as an
event node graph, where each node represents a state, and each link indicates a precedence
relationship and is labeled with the transition activity. The nodes in the graph are ordered
by time and each node is annotated with the elapsed time from the start of the transaction.
Each activity contributes to the overall response time; certain activities overlap in time,
some activities have greater potential to add larger latencies than others, some activities are
on the critical path and some activities are more difficult to control than others. It is exactly
this perspective which is missing in management systems to date. Existing systems fail to
see the context in which an individual URL is being requested and its relationship to the
overall pageview latency. Systems perform admission control on individual URL requests
or classify all URL requests from the same client into the same class without regard as the
current state of the pageview download. Managing the critical path high latency activities,
in the context of the current state of the pageview download, is a novel approach presented
in this dissertation. We present a system capable of tracking a pageview download, online
as it occurs, and performing packet manipulations that manage the response time perceived
by the remote client. These techniques are applied in the context of an individual page
download to achieve a specified absolute response time goal, and can be triggered by an
16
0ms 1 SYN arrival
Tconn
75ms 2
GET index.html
Tserver
884ms 3
index.html response header
Ttransfer
1375ms 4 Index.html response complete
Trender
1385ms 5 browser parsed index.html
GET obj3.gif 6 1410ms
1410ms 7 SYN arrival
Tserver
obj3.gif response header 9 1710ms
Tconn
1485ms 8
Ttransfer
obj3.gif response complete 10 2533ms
1785ms 15 obj6.gif response header
Trender
Ttransfer
3426ms 16 obj6.gif response complete
Tserver
GET obj6.gif
Tserver
Trender
3436ms 17 obj6.gif response processed
Ttransfer
Trender
obj8.gif response processed 14 3307ms
18 3436ms
Figure 1.5: Pageview modeled as an event node graph.
elapsed time threshold or a prediction of the latency associated with an embedded object.
The rest of this dissertation is outlined as follows. In Chapter 2 we present Certes
[116, 117], our initial research into understanding and quantifying the effects of SYN
drops on TCP connection establishment latency. The result is a novel modeling algorithm
capable of determining the effect of SYN drops on TCP connection establishment using
only a set of simple counters. Certes was implemented within the Linux kernel on the
17
Web server and shown to be highly accurate for a variety of workloads, under conditions
commonly found in the Internet. In Chapter 3 we present ksniffer [118], which is a novel
intelligent traffic monitor capable of determining the per pageview response times experienced by a remote client. Implemented as an appliance that is placed in front of the Web
server it uses novel algorithms for inferring the remote client perceived response time on
a per pageview basis, online in real-time. In chapter 4 we present Remote Latency-based
Management (RLM) [115], which manages the latencies experienced by the remote client
by manipulating the packet traffic into and out of the Web server complex. RLM tracks
the progress of each page download in real-time, as each embedded object is requested,
allowing us to make fine grained decisions on the processing of each request as it pertains
to the overall pageview latency. Related work is presented in Chapter 5. The last chapter
of this dissertation consists of concluding remarks and directions for future work.
1.4 Novel Contributions
This dissertation addresses the challenge of measuring and managing per pageview response time, as perceived by the remote client, using only information observed at the
Web server. We delve into the issues related to admission control drops and their effect
on response time latency as well as server-side packet-level manipulation techniques for
managing client perceived response times. Many of our techniques can generalize to the
problem of determining transaction level latencies for protocols that consist of a set of
implicitly correlated requests. Specifically, in the context of HTTP over TCP/IP this dissertation presents:
1. An understanding and quantification of the effects of SYN drops on TCP connection
establishment latency. We present a non-linear optimization model of the effect of
18
the TCP exponential backoff mechanism on TCP connection establishment latency,
then devise an O(c) online algorithm for approximating the non-linear model. We
present a kernel level design and implementation of the fast online algorithm. We
experimentally validate the fast online algorithm showing results accurate to within
5% of the latencies measured at the client.
2. An architecture, design and implementation of a scalable, high speed traffic monitor
capable of determining the remote client perceived response time on a per pageview
basis. We present online algorithms for correlating HTTP requests over multiple
TCP connections in the presence of ambiguity and incomplete information. This includes an algorithm for online, incremental, embedded pattern learning in the presence of ambiguity and incomplete information and an algorithm for inferring the
existence of SYN or SYN/ACK packets that are dropped in the network and never
captured by the traffic monitor. We experimentally validate our implementation of
the high speed traffic monitor showing that it is possible to determine the remote
client perceived response time at near gigabit rates to within 5% error by only analyzing the packet stream into and out of the Web server complex.
3. The design and implementation of an inline traffic monitor capable of managing the
client perceived response time through manipulation of the packet stream that exists
between the remote client and Web server. We present a novel model for tracking an
individual pageview download which is based on modeling the pageview activities
as an event node graph. We present a study which examines how common Web
browsers behave under certain failure conditions such as admission control SYN
drops and how this affects the response time perceived by the client. This led us to
two novel techniques, Fast SYN and Fast SYN/ACK retransmission for managing the
19
latencies associated with admission control drops. We experimentally validate the
event node model showing that it is possible to manage the shape of the response
time distribution using the Fast SYN and Fast SYN/ACK retransmission techniques.
CHAPTER 2. MODELING LATENCY OF ADMISSION CONTROL DROPS
20
Chapter 2
Modeling Latency of Admission Control
Drops
Certes (CliEnt Response Time Estimated by the Server) represents the results of our initial
investigation into measuring TCP latencies, with a focus toward the behavior and effects
of the TCP exponential backoff mechanism on TCP connection establishment. The result
was a novel, online mechanism that accurately estimates mean client perceived response
time, using only information available at the Web server. Certes combines a model of
TCP retransmission and exponential back-off mechanisms with three simple server-side
measurements: connection drop rate, connection accept rate, and connection completion
rate. The model and measurements are used to quantify the time due to failed connection
attempts and determine their effect on mean client perceived response time. Certes then
measures both time spent waiting in kernel queues as well as time to retrieve requested
Web data. It achieves this by going beyond application-level measurements to using a
kernel-level measure of the time from the very beginning of a successful connection until
it is completed. Existing admission control mechanisms which perform service differen-
21
tiation based on connection throttling have failed to address these affects. Our approach
does not require probing or third party sampling, and does not require modification of Web
pages, HTTP servers, or client-side modifications. Certes uses a model that is inherently
able to decompose response time into various server and network components to help determine whether server or network providers are responsible for performance problems.
Certes can be used to measure response times for any Web content, not just HTML. Certes
runs online in constant time with very low overhead and can be used at Web sites and
server farms to verify compliance with service level objectives.
Figure 2.1 depicts the typical TCP interaction between the remote client and server
while Figure 2.2 depicts the same situation under conditions of server SYN drops (due to
admission control or overload). Modeling the latency associated with the dropped SYNs,
depicted as CONN-FAIL in Figure 2.2, is the main contribution of Certes. The problem
arises when a server drops a SYN: no information concerning the individual SYN drop
is maintained. Therefore, when a server accepts a SYN and processes a connection, the
server is unaware of how many failed connection attempts have been experienced by the
client prior to this successful attempt. Maintaining state for each SYN drop is not an option
for scalability reasons. This would require the server to commit a portion of memory to
track each SYN drop for a significant period of time (up to 10’s of seconds). Overload
conditions or transient spikes could exhaust memory. In addition, this would make the
server vulnerable to SYN flood attacks, where malicious remote clients send large amounts
of SYN packets to the server without the intent of ever establishing a connection. In either
case the server would be tying up large amounts of memory for unestablished connections
at the exact point in time in which the server is already over-utilized. Certes models the
CONN-FAIL latency without maintaining state for each individual SYN drop.
We have implemented Certes and verified its response time measurements against
Client
22
Server
SYN J
TCP
connection
SYN K, ack J+1
Client
Perceived
Response
Time
ack K+1
GET index.html
bi-directional
data
transfer
SYN-to-END
Server
Measured
Response
Time
DATA
FIN M
ack M+1
TCP
termination
FIN N
ack N+1
Figure 2.1: Typical TCP client-server interaction.
those obtained via detailed client-side instrumentation. Our results demonstrate that Certes
provides accurate server-based measurements of mean client response times in HTTP
1.0/1.1 environments, even with rapidly changing workloads. Our results show that Certes
is particularly useful under overloaded server conditions when Web server applicationlevel and kernel-level measurements can be grossly inaccurate. We further demonstrate
the need for Certes measurement accuracy in Web server control mechanisms that manipulate inbound kernel queuing or that perform admission control to achieve response time
goals.
This chapter is outlined as follows. Section 2.1 presents an overview of the Certes
approach followed by Section 2.2 which presents the detailed mathematical construction of
the Certes model. Section 2.3 presents a fast online approximation of the non-linear maximization presented in Section 2.2. Section 2.4 describes our implementation of Certes
within the Linux kernel. Section 2.5 presents experimental results demonstrating the ef-
Client
23
Server
SYN J
wait 3 seconds
x
SYN J
x
wait 6 seconds
SYN ‘s
dropped by
server
x
SYN J
CONN-FAIL
connection
failure
latency
wait 12 seconds
Client
Perceived
Response
Time
SYN J
TCP
Connection
SYN accepted
SYN K, ack J+1
ack K+1
GET
SYN-to-END
Server
Measured
Response
Time
DATA
FIN M
ack M+1
TCP
Termination
FIN N
ack N+1
Figure 2.2: Effect of SYN drops on client perceived response time.
fectiveness of Certes in estimating mean client perceived response time at the server with
various dynamic workloads for both HTTP 1.0/1.1. Section 2.6 presents the problem associated with existing admission control mechanisms and how Certes effectively solves it.
Section 2.7 compares Certes to the less effective queuing theoretic approach and Section
2.8 shows that Certes converges rapidly. Before ending this chapter we summarize our
24
findings.
2.1 The Certes Model
The novel contribution of Certes is to provide a server-side measure of mean client perceived response time that includes the impact of failed TCP connection attempts on Web
server performance. To simplify our discussion and focus on the issue of failed TCP connection attempts, we make the following assumptions:
1. We focus on measuring the response time due to TCP connection setup through
retrieving embedded objects, steps 3 through 9 in Chapter 1. We do not consider
steps 1, 2, and 10. We assume that URL parsing and Web page rendering times
are small and DNS lookups are generally cached to reduce their impact on response
time.
2. We focus on determining the contribution to client perceived response time due to
the performance of a given Web server. We do not quantify delays that may be due
to Web objects residing on other servers or CDNs.
3. We limit our discussion to an estimate of response time based on the duration of a
TCP connection. For non-persistent connections where each HTTP request uses a
separate TCP connection, this estimate corresponds to measuring the response time
for individual HTTP requests. For persistent connections where multiple HTTP
requests may be served over a single connection, this estimate may include the time
for multiple requests. Since a Web page with embedded objects requires multiple
HTTP requests in order to be fully displayed, determining the response time for
downloading a Web page requires correlating the response times of multiple HTTP
25
requests which we discuss further in Chapter 3.
Given these assumptions, a measure of client-perceived response time should include the time starting from when the first SYN packet is sent from the client to the server
until the last HTTP response data packet is received from the server by the client. For a
given connection, we define CONN-FAIL as the time between when the first SYN packet
is sent from the client and when the last SYN packet is sent from the client (Figure 2.2).
This is the time due to failed TCP connection attempts. When there are no failed connection attempts, CONN-FAIL is zero. For a given connection, we define SYN-to-END as the
time between when the server receives the last SYN packet until the time when the server
sends the last data packet. This is essentially the server’s perception of response time in the
absence of SYN drops. The client perceived response time is equal to CONN-FAIL and
SYN-to-END plus one round trip time (RTT) to account for the time it takes to send the
SYN packet from the client to the server plus the time it takes to send the last data packet
from the server to the client. The client perceived response time over the connection is:
CLIENT RT = CONN-FAIL + SYN-to-END + RT T
(2.1)
Determining client perceived response time then reduces to determining CONN-FAIL,
SYN-to-END, and RTT. Note that any failure to complete the 3-way handshake after the
SYN is accepted by the server is captured by SYN-to-END. For example, delays caused by
dropped SYN/ACKs from the server to the client (the second part of the 3-way handshake)
are accounted for in the SYN-to-END time (as shown in Figure 2.3). The equation also
holds if the server terminates the connection before sending any data by sending a FIN or
RST.
Client
26
Server
SYN J
wait 3 seconds
x
SYN J
x
wait 6 seconds
SYN ‘s
dropped by
server
SYN J
x
wait 12 seconds
Client
Perceived
Response
Time
SYN J
SYN accepted
SYN K, ack J+1
x
wait 3 seconds
SYN K, ack J+1
x
wait 6 seconds
SYN K, ack J+1
ack K+1
SYN-to-END
Server
Measured
Response
Time
GET
DATA
Figure 2.3: Dropped SYN/ACK from server to client captured in SYN-to-END time.
Determining the SYN-to-END component of the client perceived response time
is relatively straightforward. The SYN-to-END time can be decomposed into two components: the time taken to establish the TCP connection after receiving the initial SYN,
and the time taken to receive and process the HTTP request(s) from the client. In certain
circumstances, for example when the Web server is lightly loaded and the data transfer is
large, the first component of the SYN-to-END time can be ignored, and the second component can be used as an approximation to the processing time spent in the application-level
27
server. In such cases, measuring the processing time in the application-level server can
provide a good estimate of the SYN-to-END time. In general, the processing time in the
application-level server is not a good estimate of the SYN-to-END time. If the Web server
is heavily loaded, it may delay sending the SYN/ACK back to the client, or it may delay delivering the HTTP request from the client to the application-level server. In such
cases, the time to establish the TCP connection may constitute a significant component
of the SYN-to-END time. Thus, to obtain an accurate measure of the SYN-to-END time,
measurements must be done at the kernel level. A simple way to measure SYN-to-END
is by timestamping in the kernel when the last SYN packet is received by the server and
when the last data packet is sent from the server. If the kernel does not already provide
such a packet timestamp mechanism, it can be added with minor modifications. Section
2.4 describes in further detail how we measured SYN-to-END for our Certes Linux implementation.
Determining the RTT component of the client perceived response time is also relatively straightforward. RTT can be determined at the server by measuring the time from
when the SYN/ACK is sent from the server to the time when the server receives the ACK
back from the client. The RTT time measured in this way includes the time spent by the
client in processing the SYN/ACK and preparing its reply - as is the norm for TCP. Other
approaches for estimating RTT can also be used [7]. For both SYN-to-END and RTT
measurements, the kernel at the Web server must provide the respective timestamps. As
discussed in Section 2.4, these timestamps can be added with minor modifications.
However, determining CONN-FAIL is a difficult problem. The problem is that
when a server accepts a SYN and processes the connection, the server is unaware of how
many failed connection attempts have been made by the client prior to this successful
attempt. The TCP header [127] and the data payload of a SYN packet do not provide any
28
indication of which attempt the accepted SYN represents. As a result, the server cannot
examine the accepted SYN to determine whether it is an initial attempt at connecting,
or a first retry at connecting, or an N th retry at connecting. Even in the cases where
the server is responsible for dropping the initial SYN and causing a retry, it is difficult
for the server to remember the time the initial SYN was dropped and correlate it with
the eventually accepted SYN for a given connection. For such a correlation, the server
would be required to retain additional state for each dropped SYN at precisely the time
when the server’s input network queues are probably near capacity, which could result
in performance scalability problems for the server. In Chapter 3 we present an approach
that is able to maintain the persistent state of each connection request by offloading this
processing from the Web server to a separate appliance.
Certes solves this problem by taking advantage of two properties of server mechanisms for supporting SYNs. First, since the server cannot distinguish between whether
a SYN packet is an initial attempt or N th retry, it must treat them all equally. Second,
it is easy for a server to simply count the number of SYNs that are dropped versus accepted since it only requires a small amount of state. As a result, Certes can compute the
probability that a SYN is dropped and apply that probability equally to all SYNs during
a given time period to estimate the number of SYN retries that occur. This information is
then combined with a understanding of the TCP exponential backoff mechanism to correlate accepted SYNs with the number of SYN drops that occurred to determine how many
retries were needed before establishing a connection.
Certes can then determine CONN-FAIL based on how many retries were needed
and the amount of time necessary for those retries to occur. In particular, due to TCP timeout and exponential backoff mechanisms specified in RFC 1122 [33], the first SYN retry
occurs 3 seconds after the initial SYN, the second SYN retry occurs 6 seconds after the
29
first retry, the third SYN retry occurs 12 seconds after the second retry, etc. Certes does
assume that all clients adhere to this exact exponential behavior on SYN retries from RFC
1122. This is a reasonable assumption given that RFC 1122 is supported by all major operating systems, including Microsoft operating systems [99], Linux [133], FreeBSD [65],
NetBSD 1.5 [108], AIX 5.x, and Solaris. OneStat.com [119] estimates that 97.46% of the
Web server accesses on the Internet are from users running a Windows operating system.
They attribute the rest to Macintosh and Linux users (1.43% and .26%, respectively).
Section 2.2 presents a detailed step-by-step construction of the Certes model. In
particular, we discuss the impact of the variance of RTT on when retries arrive at the
server and how Certes accounts for this variability. Section 2.3 describes how the Certes
model can be implemented efficiently, yielding good response time results.
2.2 Mathematical Constructs of the Certes Model
Certes determines the mean client perceived response time by accounting for CONN-FAIL
using a statistical model that estimates the number of first, second, third, etc., retries that
occur during a specified time interval. Certes divides time into discrete intervals for grouping connections by their temporal relationship. Without loss of generality, we will assume
that time is divided into one second intervals, but in general any interval size less than the
initial TCP retry timeout value of three seconds may be used. For ease of exposition, let
m = 3 be the number of discrete time intervals that occur during the initial TCP retry
timeout value of three seconds.
Certes determines the number of retries that occurred before a SYN is accepted
by using simple counters to take three aggregate server-side measurements for each time
interval. The measurements are:
DROP P EDi
30
the total number of SYN packets that the server
dropped during the ith interval.
ACCEP T EDi
the total number of SYN packets that the server
did not drop during the ith interval.
COMP LET EDi the total number of connections that completed
during the ith interval.
Using these three measurements, we can compute for a given interval the offered
load at the server, which is the number of SYN packets arriving at the server. The offered
load in the ith interval is:
OF F ERED LOADi = ACCEP T EDi + DROP P EDi
(2.2)
Certes decomposes each of these measured quantities, OF F ERED LOADi ,
DROP P EDi, ACCEP T EDi, and COMP LET EDi as a sum of terms that have associations to connection attempts. Let Rij be the number of SYNs that arrived at the server
as a j th retry during the ith interval, starting with Ri0 as the number of initial attempts to
connect to the server during interval i. Let Dij be the number of SYNs that arrived at the
server as a j th retry during the ith interval but were dropped by the server. Let Aji be the
number of SYNs that arrived at the server as a j th retry during the ith interval and were
accepted by the server. Let Cij be the number of connections completed during the ith
interval that were accepted by the server as a j th retry. Let k be the maximum number of
retries attempted by any client. For each interval i we have the following decomposition:
OF F ERED LOADi =
DROP P EDi
=
ACCEP T EDi
=
COMP LET EDi
=
Pk
j=0
Rij
Pk
Dij
Pk
Cij
j=0
Pk
31
(2.3)
j
j=0 Ai
j=0
For each time interval i, Certes determines the mean client perceived response time
for those Web transactions that are completed during the time interval. This includes both
connections that are completed during the time interval as well as connections that give up
during the interval after exceeding the maximum number of retries attempted by the client.
COMP LET EDi is the number of transactions that completed during the ith interval and
Rik+1 is the number of clients that gave up during the interval. Applying Equation 2.1 to a
time interval, Certes computes the mean client response time for the ith interval as:
CLIENT RTi =
Rik+1 · 3[2k+1 − 1] +
Pk
j=1
Cij · 3[2j − 1] +
P
SYN-to-END +
COMP LET EDi + Rik+1
P
RT T
(2.4)
Equation 2.4 essentially divides the sum of the response times by the number of
transactions to obtain mean response time. In the denominator, Equation 2.4 sums the total
number of transactions that completed and clients that gave up. In the numerator, there are
four terms summed together. The first term Rik+1 · 3[2k+1 − 1] is the amount of time that
clients waited before giving up based on the TCP exponential backoff mechanism. The
P
second term kj=1[Cij · 3[2j − 1]] represents the total CONN-FAIL time experienced by
P
those clients that completed in the ith interval. The third term SYN-to-END is the sum
of the measured SYN-to-END times for all transactions completed in the ith interval. The
P
fourth term RT T is the sum of one round trip time for all transactions completed during
32
the ith interval. For example, if k = 2, then Equation 2.4 reduces to:
CLIENT RTi =
P
P
SYN-to-END + RT T + 21Rik+1 + 9Ci2 + 3Ci1
COMP LET EDi + Rik+1
(2.5)
Ci1 indicates the number of clients that waited an additional 3 seconds due to a
SYN drop, Ci2 is the number of clients that waited an additional 9 seconds due to two
SYN drops, and Rik+1 is the number of clients that gave up after waiting 21 seconds.
To compute the mean client perceived response time for each interval, Certes uses
Equation 2.3 to derive the values of Cij and Rik+1 from the measured quantities
OF F ERED LOADi , DROP P EDi, ACCEP T EDi, and COMP LET EDi . We start
from the observation that the TCP header [127] and the data payload of a SYN packet do
not provide any indication of which connection attempt a dropped SYN represents. As
a result, the server’s TCP implementation cannot distinguish a SYN packet containing a
j th SYN retry from a SYN packet containing a k th SYN retry. This implies that all types
of SYN packets are dropped or accepted with equal probability. The mean SYN drop
rate at the server for the ith interval can be computed from OF F ERED LOADi and
DROP P EDi:
DRi = DROP P EDi/OF F ERED LOADi
(2.6)
A key hypothesis of Certes is that the drop rate must therefore be equal for all Rij
in the ith interval. This results in the following relations between Rij and Dij :
33
Di0 = DRi · Ri0
Di1 = DRi · Ri1
Di2 = DRi · Ri2
..
.
(2.7)
Dik = DRi · Rik
Each individual connection that completes during the ith interval was accepted during the (i − SYN-to-END)th interval. Because each connection may have a different
SYN-to-END time, connections that complete during the ith interval may have been accepted during different intervals. Let ACCEP T EDp,i be the number of connections that
were accepted during the pth interval and completed during the ith interval. Therefore,
COMP LET EDi =
X
ACCEP T EDp,i
(2.8)
p
Let
ACCEP T EDp,i =
k
X
Ajp,i
(2.9)
j=0
where Ajp,i is the number of SYNs that were accepted during the pth interval as a
j th retry and completed during the ith interval. Therefore,
Cij =
X
Ajp,i
(2.10)
p
As mentioned above, when a server accepts a SYN and processes the connection,
the server is unaware of how many failed connection attempts have been made by the client
prior to this successful attempt. Therefore, there is no direct method for determining the
34
number of retries associated with a specific connection. As such, there is no direct method
for obtaining Ajp,i . We estimate the value of Ajp,i from the ratio of Ajp to ACCEP T EDp :
Ajp,i
Ajp
=
· ACCEP T EDp,i
ACCEP T EDp
(2.11)
Since the SYNs that do not get dropped get accepted, Equation 2.7 implies that Aji
is:
Aji = Rij − Dij = Rij − [DRi · Rij ]
(2.12)
Combining Equations 2.11 and 2.12 allows us to rewrite Equation 2.10 as:
Cij
=
X Rpj − [DRp − Rpj ]
p
ACCEP T EDp
· ACCEP T EDp,i
(2.13)
Equation 2.13 solves for Cij in terms of Rpj , DRp and ACCEP T EDp,i. We can
substitute Equation 2.13 into our equation for calculating CLIENT RTi , effectively removing Cij from Equation 2.4. We now turn our attention to solving for Rij .
Drops occurring during the ith interval return as retries in future intervals. Based
on the TCP exponential backoff mechanism, the timing of the return depends on whether
it was an initial SYN, a 1st retry, a 2nd retry, etc. As a result, the number of retries arriving
during the ith interval is a function of the number of drops that occurred in prior intervals:
Ri1
0
= Di−m
Ri2
1
= Di−2m
Ri3
..
.
2
= Di−4m
k
Rik+1 = Di−2
k−1 m
(2.14)
Client
35
Server
SYN J
50ms
wait 3 seconds
x
SYN J
SYN ‘s
dropped by
server
200ms
wait 6 seconds
x
SYN J
100ms
SYN accepted
Figure 2.4: Variance in RTT affects arrival time of retries.
Equation 2.14 assumes that retries arrive at the server exactly when expected based
on the TCP specification (i.e. in 3 seconds, 6 seconds, etc). Due to variance in RTT, this
assumption may not hold in practice. Such a scenario is shown in Figure 2.4, where the
network delay changes between connection attempts for a specific client. This has the
effect of skewing the estimates for Rij , since retries may not always arrive at the server
exactly when expected (i.e. in 3 seconds, 6 seconds, etc). Note that it is the variance in
RTT for a specific client that affects the model and not the differences in RTT between
clients. For example, the server will observe the 3 second, 6 second, 12 second, etc. retry
delay for each client with a consistent RTT, regardless of the magnitude of the RTT.
This effect can be accounted for by treating Rij as a weighted distribution over the
j
Dij of past intervals instead of just using a single interval. Let Wp,i
be the portion of Dpj
that will return as Rij+1 . The following holds:
1=
X
i
j
Wp,i
(2.15)
36
Using these weights, we can modify Equation 2.14 so that Rij is a combination
of drops occurring in a small set of prior intervals, rather than the number of drops that
occurred in one specific prior interval:
0
0
Ri1 = · · · + [Wi−m−1,i
· Di−m−1
]+
0
[Wi−m,i
0
· Di−m
]+
0
0
[Wi−m+1,i
· Di−m+1
]+···
1
1
Ri2 = · · · + [Wi−2m−1,i
· Di−2m−1
]+
1
[Wi−2m,i
1
· Di−2m
]+
1
1
[Wi−2m+1,i
· Di−2m+1
]+···
2
2
Ri3 = · · · + [Wi−4m−1,i
· Di−4m−1
]+
2
[Wi−4m,i
..
.
2
]+
· Di−4m
2
2
[Wi−4m+1,i
· Di−4m+1
]+···
k−1
k−1
Rik = · · · + [Wi−2
k−1 m−1,i · Di−2k−1 m−1 ]+
k−1
[Wi−2
k−1 m,i
k−1
· Di−2
k−1 m ]+
k−1
k−1
[Wi−2
k−1 m+1,i · Di−2k−1 m+1 ] + · · ·
(2.16)
37
j
Equation 2.7 allows us to rewrite Equation 2.16 in terms of DRi , Wp,i
and Rij by
substituting DRi · Rij for Dij :
0
0
Ri1 = · · · + [Wi−m−1,i
· DRi−m−1 · Ri−m−1
]+
0
[Wi−m,i
· DRi−m
0
· Ri−m
]+
0
0
[Wi−m+1,i
· DRi−m+1 · Ri−m+1
]+···
1
1
Ri2 = · · · + [Wi−2m−1,i
· DRi−2m−1 · Ri−2m−1
]+
1
[Wi−2m,i
· DRi−2m
1
· Ri−2
]+
1
1
[Wi−2m+1,i
· DRi−2m+1 · Ri−2m+1
] +···
2
2
Ri3 = · · · + [Wi−4m−1,i
· DRi−4m−1 · Ri−4m−1
]+
2
[Wi−4m,i
..
.
· DRi−4m
·
2
Ri−4m
]+
2
2
] +···
· DRi−4m+1 · Ri−4m+1
[Wi−4m+1,i
k−1
k−1
Rik = · · · + [Wi−2
k−1 m−1,i · DRi−2k−1 m−1 · Ri−2k−1 m−1 ]+
k−1
[Wi−2
k−1 m,i
· DRi−2k−1 m
k−1
· Ri−2
k−1 m ]+
k−1
k−1
[Wi−2
k−1 m+1,i · DRi−2k−1 m+1 · Ri−2k−1 m+1 ] + · · ·
(2.17)
38
By recursive substitution of the Rij terms we can transform these k equations into
j
terms of the unknowns Ri0 and Wp,i
. For k = 2 and m = 3 the result is:
0
0
0
0
0
0
Ri1 = [Wi−4,i
· DRi−4 · Ri−4
] + [Wi−3,i
· DRi−3 · Ri−3
] + [Wi−2,i
· DRi−2 · Ri−2
]
1
0
0
Ri2 = Wi−7,i
· DRi−7 · [[Wi−11,i−7
· DRi−11 · Ri−11
]+
0
0
[Wi−10,i−7
· DRi−10 · Ri−10
]+
0
0
[Wi−9,i−7
· DRi−9 · Ri−9
]]
+
1
0
0
Wi−6,i
· DRi−6 · [[Wi−10,i−6
· DRi−10 · Ri−10
]+
0
0
]+
· DRi−9 · Ri−9
[Wi−9,i−6
0
[Wi−8,i−6
· DRi−8 ·
0
Ri−8
]]
(2.18)
+
1
0
0
Wi−5,i
· DRi−5 · [[Wi−9,i−5
· DRi−9 · Ri−9
]+
0
0
]+
· DRi−8 · Ri−8
[Wi−8,i−5
0
0
[Wi−7,i−5
· DRi−7 · Ri−7
]]
From Equation 2.3 we have:
OF F ERED LOADi = Ri0 + Ri1 + Ri2
(2.19)
39
and by substituting Equations 2.18 into Equation 2.19 we get:
OF F ERED LOADi =
Ri0 +
0
0
0
0
0
0
[Wi−4,i
· DRi−4 · Ri−4
] + [Wi−3,i
· DRi−3 · Ri−3
] + [Wi−2,i
· DRi−2 · Ri−2
]+
1
0
0
Wi−7,i
· DRi−7 · [[Wi−11,i−7
· DRi−11 · Ri−11
]+
0
0
[Wi−10,i−7
· DRi−10 · Ri−10
]+
0
0
[Wi−9,i−7
· DRi−9 · Ri−9
]]+
(2.20)
0
0
1
]+
Wi−6,i
· DRi−6 · [[Wi−10,i−6
· DRi−10 · Ri−10
0
0
[Wi−9,i−6
· DRi−9 · Ri−9
]+
0
0
[Wi−8,i−6
· DRi−8 · Ri−8
]]+
1
0
0
Wi−5,i
· DRi−5 · [[Wi−9,i−5
· DRi−9 · Ri−9
]+
0
0
[Wi−8,i−5
· DRi−8 · Ri−8
]+
0
0
[Wi−7,i−5
· DRi−7 · Ri−7
]]
Equation 2.20 provides one equation for each interval i, in terms of OF F ERED LOADi
j
(which is measured), DRi (which is measured), Ri0 (which is unknown) and Wp,i
(which is
unknown). Once solutions for Ri0 are found, they can be used to calculate Rij , ∀i, j. Addij
tionally, the presence of Wp,i
introduces nonlinearity. Each interval i contains 7 unknowns:
0
0
0
1
1
1
Ri0 , Wi,i+2
, Wi,i+3
, Wi,i+4
, Wi,i+5
, Wi,i+6
, and Wi,i+7
.
From Equation 2.15 we have the following equations for each interval i:
0
0
0
1 = Wi,i+2
+ Wi,i+3
+ Wi,i+4
1
1
1
1 = Wi,i+5
+ Wi,i+6
+ Wi,i+7
(2.21)
All values in Equation 2.20 must be positive, and hence we have the constraints:
j
0 ≤ Ri0 , Wp,i
∀i, j, p
40
(2.22)
j
Of course, if the values for Wp,i
were somehow magically known, then Equation
2.20 could be solved directly since it reduces to a linear system of N equations in N
j
unknowns. In practice, however, Wp,i
are unknown and need to be estimated. We describe
one approach to a solution whose general steps are as follows:
j
1. Determine an initial estimate for all Wp,i
over a window of prior intervals. Errors in
j
the estimates for Wp,i
are directly related to the errors in Ri0 . As such, determining
the bounds for this error is a known solved problem: bounding the error in solving a
system of linear equations whose coefficients may contain experimental error [68].
j
2. Solve Equation 2.20 using these Wp,i
estimated values.
3. If there is no solution in step 2, (i.e. Equation 2.22 is not satisfied) or there is a
j
positive change in the optimization objective, then change the values for Wp,i
and
iterate.
j
Let WI be the initial vector of Wp,i
estimated values. The objective of the optimiza-
tion may be to minimize ||WI − WS ||, where WS is the final solution vector of weights.
In other words, assuming that the initial best estimate is based on prior fact, the solution
vector ought not to deviate significantly from it.
Step 1: One approach for determining WI to account for the impact of variance
in RTT shown in Figure 2.4 would be to base WI on average historical measures of the
changes in RTT over time. Let χk be the probability density function of ∆RT T over a
period of length 3[2k ]m. Given that the arrivals of Ri0 are uniformly distributed over the
41
ith interval (defined by the probability density function ti ), then
0
E[Wi,i+2
] =
0
E[Wi,i+3
] =
0
E[Wi,i+4
] =
1
E[Wi,i+5
] =
1
E[Wi,i+6
] =
1
E[Wi,i+7
] =
R i+2
i+1
R i+3
i+2
R i+4
i+3
R i+5
i+4
R i+6
i+5
R i+7
i+6
fχ0 (x)dx
fχ0 (x)dx
fχ0 (x)dx
fχ1 (x)dx
(2.23)
fχ1 (x)dx
fχ1 (x)dx
Where fχk (t) is the convolution of ti and χk :
fχ0 (t) = 3 +
fχ1 (t) = 9 +
R∞
−∞
R∞
−∞
χ0 (x)ti (t − x)dx
χ1 (x)ti (t − x)dx
(2.24)
0
In other words, E[Wi,i+2
] is the mean portion of Ri0 that is expected to return during
j
1
the (i+2)nd interval as Ri+2
. Note that in Equation 2.23 the E[Wp,i
] terms are independent
j
j
of p. We now set WI to E[Wp,i
], in effect, replacing Wp,i
in Equation 2.20 with its historical
j
j
mean, E[Wp,i
]. By replacing the variables Wp,i
by their means, the error can be quantified
using Chernoff’s Bound [124].
0
1
Step 2: Substituting the current estimated values of Wp,i
and Wp,i
into Equation
2.20 translates the problem into a linear system of N equations in N unknowns, for N
0
1
intervals (i.e. since Wp,i
and Wp,i
are now constants, the only unknowns left are Ri0 ).
During system initialization, note that all SYNs arriving, accepted or dropped during the
first interval are initial SYNs. Likewise, Rij = 0 for 1 ≤ j ≤ k, 1 ≤ i ≤ 3 (no 1st ,
2nd , 3rd , ... k th , retries can occur in the first three intervals) and Rij = 0 for 2 ≤ j ≤ k,
4 ≤ i ≤ 9 (no 2nd , 3rd , ... k th , retries can occur during the 4th and 9th intervals). In
42
general,
Rij = 0 for i ≤ 3(2z − 1), j ≥ z, 1 ≤ z ≤ k,
Rij
(2.25)
= 0 for i ≤ 0, ∀j.
For the initial N intervals, there are only N unknowns:
OF F ERED LOAD1 = R10
0
· DR1 ]+
OF F ERED LOAD4 = R10 · [W1,4
0
R20 · [W2,4
· DR2 ]+
(2.26)
R40
0
· DR1 ]+
OF F ERED LOAD5 = R10 · [W1,5
0
R20 · [W2,5
· DR2 ]+
0
R30 · [W3,5
· DR3 ]+
R50
Step 3: If step 2 does not produce a satisfactory solution, an adjustment is made to the
0
1
values of Wp,i
and Wp,i
. There are several ways to perform this adjustment. One method
0
1
is based on the partial derivatives of Ri0 with respect to Wp,i
and Wp,i
, as defined by the
gradient matrix:

∂R0i
0
∂Wi−2,i
∂R0i−1
0
∂Wi−2,i
∂R0i−2
0
∂Wi−2,i

... 




 ∂R0i
∂R0i−1
∂R0i−2


...
0
0
0
∂W
∂W
∂W


i−3,i
i−3,i
i−3,i



 ∂R0i
∂R0i−1
∂R0i−2
G=

...
0
0
0
∂W
∂W
∂W


i−4,i
i−4,i
i−4,i


..
..
 ..
.. 
. 
 .
.
.




(2.27)
43
The number of columns in G is equal to the number of intervals in the sliding window and
j
the number of rows in G is equal to the total number of Wp,i
in the sliding window. Using
j
G we can formulate a linear program to determine Wp,i
for the next iteration:
GT

∂R0i
0
∂Wi−2,i
∂R0i
0
∂Wi−3,i
∆R0
∆W

∂R0i
0
∂Wi−4,i
...


∂R0i−1
∂R0i−1
 ∂R0i−1
...
 ∂W 0
0
0
∂Wi−3,i
∂Wi−4,i
i−2,i


∂R0i−2
∂R0i−2
 ∂R0i−2
...
 ∂W 0
0
0
∂Wi−3,i
∂Wi−4,i
i−2,i

 .
..
..
..
..
.
.
.

 


 
 
 
 
 
 
 
 
 
 





0
∆Wi−2,i
0
∆Wi−3,i
0
∆Wi−4,i
1
∆Wi−7,i
1
∆Wi−6,i
1
∆Wi−5,i
..
.


∆Ri0






0

 ∆Ri−1




0

 ∆Ri−2




 =  ∆R0


i−3




0

 ∆Ri−4





 ∆R0
i−5



 .
..




















(2.28)
j
The column vector ∆W is the amount of (unknown) change to apply to the Wp,i
for the
next iteration. The column vector ∆R0 is the amount of change we would like to witness
j
for each Ri0 by applying the new values for Wp,i
. In this case,


 ||0 − R0 || if R0 < 0
i
i
0
∆Ri =

 0
otherwise
(2.29)
Essentially, Equation 2.28 uses the gradient matrix GT to determine how much each weight
ought to be changed in order to achieve a viable solution. Equation 2.28 can be solved
using a linear least squares method [130] to obtain a best fit solution for the ∆W .
j
Final Step: Once step 2 produces a satisfactory solution for Ri0 and Wp,i
, these
values can be plugged into Equation 2.17 to obtain the values for Rij . The values for Rij
can then be used in Equation 2.13 to determine Cij . Having determined the values for Rij
and Cij for the ith interval, we use these values in Equation 2.4 to obtain the mean client
response time.
44
2.3 Fast Online Approximation of the Certes Model
Section 2.2 describes a computationally expensive algorithm: solving a system of nonlinear equations. We now present a fast, online, implementation of Certes that produces near
optimal results based on a non-iterative approach. We simplify the mathematical approach
in two ways:
1. We assume that all transactions that complete during the ith interval have roughly the
same SYN-to-END time. If variance in SYN-to-END time leads to an inconsistency
in the model, we make an online adjustment similar to Equation 2.13 but based on
the mean SYN-to-END time for a given interval. For the remainder of the chapter,
when referring to SYN-to-END time, we imply the mean SYN-to-END time for a
given interval.
2. We compute an initial estimate of weights, WI , by assuming RTT has no variance.
If this assumption leads to an inconsistency in the model, we make simple online
j
adjustments to Wp,i
in the current and future time intervals.
What follows is a step-by-step example exposing this approach.
Step 1: An alternative to the approach given in the prior section for determining
WI is to begin with the assumption that the RTT has no variance. Given an assumption of
zero variance in the RTT, the initial values for WI become:
k−1
0
1
2
0 = Wi−m−1,i
= Wi−2m−1,i
= Wi−4m−1,i
= ... = Wi−2
k−1 m−1,i
0
1 = Wi−m,i
0 =
0
Wi−m+1,i
1
= Wi−2m,i
=
1
Wi−2m+1,i
2
= Wi−4m,i
=
2
Wi−4m+1,i
k−1
= ... = Wi−2
k−1 m,i
= ... =
(2.30)
k−1
Wi−2
k−1 m+1,i
If, by using this assumption, a solution cannot be found, we add-in or adjust for RTT
j
variance by increasing or decreasing the values for Wp,i
using simple online heuristics
45
ACCEPTED 1
A 10
t0
t1
R 10 D10
R 10
t2
seconds
t3
t4
R 14
Client’s TCP waits
3 sec
then tries to
connect again
total
incoming
SYNs
Figure 2.5: Initial connection attempts that get dropped become retries three seconds later.
in Step 3. These adjustments serve as an alternative to iterating over Equation 2.20 to
j
determine optimal values for Wp,i
.
Step 2: The following demonstrates how to efficiently solve Equation 2.20 via
online direct substitution over a sliding window of intervals. Assume that the server is
booted at time t0 (or there is a period of inactivity prior to t0 ), as shown in Figure 2.5.
Certes assumes that all SYNs arriving during the first interval [t0 ,t1 ] are initial SYNs.
During the first interval [t0 ,t1 ] the server measures ACCEP T ED1 and DROP P ED1 and
can use those measurements to determine A01 = ACCEP T ED1 , D10 = DROP P ED1,
and R10 = OF F ERED LOAD1 . Section 2.8 shows the results when Certes is applied
when SYNs in the first interval are not all initial SYNs. The dropped SYNs, D10 , will
return to the server as 1st retries three seconds later as R41 during interval [t3 ,t4 ].
ACCEPTED 1
ACCEPTED 4
A 10
A 04
t0
t1
R 10 D10
46
+ A 14
t3
t4
R 04 D 04 R 14 D14
R 10
R 04 + R 14
total
incoming
SYNs
total
incoming
SYNs
t6
t7
R 17 1st retries
t9
t10
seconds
2
R 10
2nd retries
Figure 2.6: A second attempt at connection, that gets dropped, becomes a retry six seconds
later.
Moving ahead in time to interval [t3 ,t4 ], as shown in Figure 2.6, the server measures
ACCEP T ED4 and DROP P ED4 and calculates the SYN drop rate for the 4th interval,
DR4 , using Equation 2.6. The Web server cannot distinguish between an initial SYN or a
1st retry, therefore, the drop rate applies to both R40 and R41 equally, giving D41 = DR4 ·R41 ,
and then A14 = R41 − D41 . From Equations 2.3, A04 = ACCEP T ED4 − A14 and D40 =
DROP P ED4 − D41 . Finally, the number of initial SYNs arriving during the 4th interval
is R40 = A04 + D40 . We have determined the values for all terms in Figure 2.6.
Note that the D41 dropped SYNs will return to the server as 2nd retries six seconds
2
later during interval [t9 ,t10 ], as R10
, when those clients experience their second TCP
timeout and that the D40 dropped SYNs will return to the server as 1st retries, as R71 , three
seconds later during interval [t6 ,t7 ].
47
By continuing in this manner it is possible to recursively compute all values of Rij ,
Aji and Dij for all intervals, for a given k. Figure 2.7 depicts the 10th interval, including
those intervals that directly contribute to the values in the 10th interval. Clients that give
up after k connection attempts are depicted as ending the transaction.
Figure 2.8 shows the final model defining the relationships between the incoming, accepted, dropped and completed connections during the ith interval. Connections
accepted during the ith interval complete during the (i + SYN-to-END)th interval. The
j
client frustration timeout is specified in seconds and the term Ri+[F
indicates
T O−2k−1 m]
that clients who do not get accepted during the ith interval on the k th retry will cancel their
attempt for service during the i + [F T O − 2k−1m] interval.
The model in Figure 2.8 can be implemented in a Web server by using a simple data
structure with a sliding window. Note that during each time interval, only the aggregate
counters for DROP P EDi, ACCEP T EDi , and COMP LET EDi are incremented. At
the end of each time interval, the more detailed counters for Rij , Aji , Dij , Cij are computed
using a fixed number of computations.
Step 3: As mentioned in Section 2.2, due to inconsistencies in network delays the
1st retry from a client may not arrive at the server exactly three seconds later, rather it
may arrive in the interval prior to or after the interval it was expected to arrive. Likewise,
since the measurement for SYN-to-END is not constant, there will be instances where
j
Ci+SYN-to-END
6= Aji ; in other words, some of the j retries accepted in the ith interval may
complete prior to or after the (i + SYN-to-END)th interval.
These occurrences relate to an interesting aspect of the choice for interval length.
In general, when sampling techniques are used, the smaller the sampling period (more frequent the sampling), the more accurate the result. Certes is not a sampling based approach
- yet one might intuit that using shorter intervals would somehow provide for better results
0
A 10
t0
t3
t1
R 10 D10
R =A
0
1
T
1
total
incoming
SYNs
t4
R 14 D14
t6
t7
R 07 D 07
+ A 110
2
+ A 10
t9
t10
seconds
0
2
0
2
R 10
R 110 D110 R 10
D10
D10
R
+R
0
10
1
10
+R
total
incoming
SYNs
Figure 2.7: After three connection attempts the client gives up.
2
10
clients get
frustrated
and
give up
before
next retry
ACCEPTED10
48
C1i+SYN− to −END
C i2+SYN− to −END
C ik+SYN− to −END
ACCEPTEDi = A 0i
+ A1i
+ A i2 K
+ A ik
seconds
ti-1
ti
R 0i D 0i R 1i D1i R 2i D i2
…
ti+SYN-to_END-1
ti+SYN_to_END
R ik D ik
D i0-3
D i0-6
D i0- 2k -1 m
R ik++[1FTO-2 k -1 m ]
R 0i + R 1i + R i2 + K + R ki
total incoming SYNs
clients get
frustrated
and
eventually
give up.
Figure 2.8: Relationship between incoming, accepted, dropped, completed requests.
C 0i+SYN− to− END
Transactions
accepted in the ith
interval complete
during the
i+SYN-to-END
interval, where
SYN-to-END is the
server measured
response time.
49
ac c ep te d
ac ce p te d
t0
t1
50
t2
d ro p p e d
t3
seco nd s
t4
1 st re trie s
in itial
co n n e ctio n
atte m p ts
Figure 2.9: The smaller the interval, the more difficult to accurately discretize events.
- just the opposite is true. As shown in Figure 2.9, as the size of the interval is reduced below a certain point, the probability that events happen when expected reduces as well. For
example, the probability that a dropped initial SYN will arrive back at the server during
the interval that is exactly three seconds later becomes zero as the size of the interval is reduced to zero. This is similar in nature to the problem of variance in RTT that is specified
in Figure 2.4. Likewise, with small sized intervals, the probability of events occurring on
an interval boundary increases.
These inconsistencies can be accounted for by performing online adjustments to
j
Wp,i
to ensure that relationships within and between intervals remain consistent. The
function rint, round to integer, is used to ensure that certain values for the model remain
integral (i.e. we do not allow a fractional number of dropped SYNs). The first heuristic
we use is:
if
51
(OF F ERED LOADi < Ri1 + Ri2 ) then
overload = (Ri1 + Ri2 ) − OF F ERED LOADi
h 1 i
Ri
Ri1 = Ri1 − rint(overload · R1 +R
2 )
h i 1i i
Ri
1
1
Ri+1
= Ri+1
+ rint(overload · R1 +R
2 )
i
h 2 ii
Ri
Ri2 = Ri2 − rint(overload · R1 +R
2 )
i
h 2i i
Ri
2
2
Ri+1
= Ri+1
+ rint(overload · R1 +R
2 )
i
(2.31)
i
If the number of retries exceeds the measured offered load, we simply delay a
portion of the overload until the next interval.
The second heuristic we use is:
if
(ACCEP T EDi−SYN-to-END 6= COMP LET EDi ) then
h
i
A0
Ci0 = rint(COMP LET EDi · ACCEPi−SYN-to-END
)
T EDi−SYN-to-END
i
h
A1
)
T EDi−SYN-to-END
i
h
A2
)
T EDi−SYN-to-END
(2.32)
Since our approximation uses the mean SYN-to-END time, the number of completed connections may not equal the number of accepted connections. We adjust for this
difference by using the ratio A0i−SYN-to-END : A1i−SYN-to-END : A2i−SYN-to-END to calculate the
ratio for Ci0 : Ci1 : Ci2 . This attempts to adjust for variance in the SYN-to-END time.
As we’ll show in Section 2.5, the results obtained by using these heuristics are
sufficiently accurate to allow us to bypass the use of the costlier optimization approach
defined in Section 2.2.
Final Step: Having determined the values for Rij and Cij for the ith interval, we use
these values in Equation 2.4 to obtain the mean client response time.
52
2.3.1 Packet Loss in the Network
Packet drops that occur in the network (and not explicitly by the server) are included in
the model to refine the client response time estimate. Since the client-side TCP reacts
to network drops in the same manner as it does to server-side drops, network drops are
estimated and added to the drop counts, Dij . As shown in Figure 2.10, SYNs dropped by
the network (NDSij ) are combined with those dropped at the server.
To estimate the SYN drop rate in the network, one can use a general estimate of
a 2-3% [164, 168] packet loss rate in the Internet or, in the case of private networks,
obtain packet loss probabilities from routers. Another approach is to assume that the
packet loss rate from the client to the server is equal to the loss rate from the server to the
client. The server can estimate the packet loss rate to the client from the number of TCP
retransmissions.
2.3.2 Client Frustration Time Out (FTO)
A scenario that is very often neglected when calculating response times occurs when the
client cancels the connection request due to frustration while waiting to connect. This was
shown in Figure 2.7. Any client is only willing to wait a certain amount of time before
hitting reload on the browser or going to another site. Such failed transactions must be
included when determining client response time. To include this latency, the Certes model
defines a limit, referred to as the client frustration timeout (FTO), which is the longest
amount of time a client is willing to wait for an indication of a successful connection.
In other words, the FTO is a measure of the upper bound on the number of connection
attempts that a client’s TCP implementation will make before the client hits reload on the
browser or goes to another Web site.
C 1i + SYN − to − END
C i2+ SYN − to − END
C ik+ SYN − to − END
ACCEPTED
i
= A i0
+
A 1i
+
A i2
K
+
A ik
seconds
t i-1
ti
R 0i D 0i R 1i
NDS 0i
D 1i R i2 D i2
NDS 1i
NDS i2
…
t i+SYN-to-END-1
R ik D ik
NDS ik
R ik++[1FTO - 2 k -1 m ]
D
D
0
i -3
0
i-6
ti+SYN-to-END
clients get
frustrated
and
give up.
D i0- 2 k -1 m
Figure 2.10: Addition of network SYN drops to the model.
C 0i + SYN − to − END
Transactions
accepted in the ith
interval complete
during the
i+SYN-to-END
interval, where
SYN-to-END is the
server measured
response time.
53
If the client frustration timeout is:
less than 3 sec
at least 3 sec but less than 9 sec
at least 45 sec but less than 1.55 min
at least 1.55 min but less than 3.15 min
54
then the number of retries will be:
0
1
2
3
4
5
Table 2.1: Relationship between client frustration timeout and number of connection attempts.
Table 2.1 specifies the relationship between FTO and the value for k, which was
introduced in Section 2.2 as the maximum number of retries a client is willing to attempt
before giving up. Fortunately, the value chosen for the number of retries covers a range of
client behavior - unfortunately, that range will not cover all client behavior. Although it
is possible to use a distribution of the FTO derived from real world Web browsing traffic,
for simplicity, we used a constant default value of 21 seconds (k = 2) in most of our
experiments. In Section 2.5 we look more carefully at the impact of using an incorrect
assumption for the FTO.
2.3.3 SYN Flood Attacks
Another scenario that is very often neglected when calculating response times arises during a SYN flood (denial of service) attack. During a SYN flood, the attackers keep the
server’s SYN queue full by continually sending large numbers of initial SYNs. This essentially reduces the FTO to zero. The end result is that the server stands idle, with a full
SYN queue, while very few client connections are established and serviced. A SYN flood
attack is very different from a period of heavy load. The perpetrators of a SYN attack
do not adhere to the TCP timeout and exponential back-off mechanisms, never respond
to a SYN/ACK and never establish a connection with the server; no transactions are ser-
55
viced. On the other hand, in heavy load conditions, clients adhere to the TCP protocol and
large numbers of transactions are serviced (excluding situations where the server enters a
thrashing state).
Certes works well under heavy load conditions due to the adherence of clients to
the TCP protocol. During a SYN flood attack, Certes faces the problem of identifying
the distribution of the FTO. Our approach to a solution involves identifying when a SYN
attack is underway, allowing Certes to switch from the FTO distribution currently in use
to one that is representative of a SYN attack. While identifying a SYN attack is relatively
simple, it is not simple to construct a representative FTO distribution for a SYN flood
attack. Implementing this approach was beyond the scope of this thesis and left for future
work.
2.3.4 Categorization
Certes can be used, in parallel, to obtain response time estimates for multiple classes of
transactions. Since Certes is based on the drop activity associated with SYN packets,
the classification of a dropped SYN is limited to the information contained in the SYN
packet which includes, the device the packet arrived on, source IP address and port, and
destination IP address and port.
2.4 Certes Linux Implementation
In this section we describe our implementation of the fast online Certes model from Section 2.3 that executes on the Web server machine itself. In Chapter 3 we discuss our
implementation of the Certes model which executes on an appliance that sits in front of
the Web server.
remote
admin
Web Server Machine
local
admin
Apache
event
Certes
Log
/proc
user
kernel
56
TCP
counters &
SYN-to-END
kernel
module
IP
device
driver
network
Figure 2.11: Certes implementation on a Linux Web server.
Figure 2.11 shows Certes executing along side, but separate from Apache. Apache
is shown as the Web server application in Figure 2.11, but any Web server application can
be used since Certes runs completely independently. Certes was built with the expectation
that it would be part of a control loop. As such, a local or remote administrator (or control
module) can subscribe to Certes to receive notification when response time thresholds are
exceeded. Certes also periodically logs its modeling results to disk to provide a history of
Web server performance that can be used for additional performance analysis.
Certes was mostly implemented as a user-space application that obtains kernel measurements at the end of each time interval, and then uses the measurements to perform
modeling calculations in user space. This split between user and kernel space is by design, to reduce the amount of changes introduced into the kernel. This time interval is the
same one introduced in Section 2.2, and can be set to any value less than the initial TCP
retry timeout value of 3 seconds. The kernel measurements required by Certes are the total
57
number of accepted, dropped, and completed connections, and the total SYN-to-END time
for all completed connections during an interval. We implemented these as global running
counters within the kernel. These variables are monotonically increasing from the time at
which the machine is booted, regardless of whether or not the Certes model is executing
in user space.
If the kernel already provides these four measurements, then Certes can be implemented without any kernel modifications. However, since RedHat 7.1 is not fully
instrumented for Certes, minor modifications were made to the kernel. These modifications totaled less than 50 lines of code. To expose the ACCEPTED, COMPLETED and
DROPPED counters, and the SYN-to-END measurement to user space, we wrote a simple
kernel module that extended the /proc directory. User space programs can then obtain
the kernel values by simply reading a file in the /proc directory. This is the de facto
method in Linux for obtaining kernel values from user space.
To provide further details on our kernel modifications, we describe the steps by
which the Linux kernel manages TCP connection establishment and termination. We
then discuss our instrumentation code that obtains the ACCEPTED, COMPLETED and
DROPPED counters, and the SYN-to-END measurement.
Figure 2.12 shows the structure of the TCP/IP connection establishment implementation in Linux. The three important data structures are the rx queue, the SYN queue, and
the accept queue. The rx queue contains incoming packets that have just arrived. The
SYN queue, which is actually implemented as a hash table, contains those connections
which have yet to complete the TCP three-way handshake. The accept queue contains
those connections which have completed the three-way handshake but have not yet been
accepted by the Apache Web server application. The accept queue is often referred to as
the listen queue since the socket it is attached to is a listening socket. Figure 2.12 is num-
(1)
SYN J
(3)
ts
SYN
Hash
Table
limit
(2)
(4) SYN K,
58
IP
limit
user space
TCP
limit #
processes
rx queue
ack J+1
(6)
(5) ack K+1
(7)
GET
device
driver
+
device
independent
network
apache
(8)
limit
accept();
accept queue
Figure 2.12: TCP/IP connection establishment on Linux.
bered according to the following steps that occur during TCP connection establishment in
an unmodified Linux kernel:
1. A SYN arrives and is timestamped (denoted ts in the figure).
2. The incoming SYN packet is placed onto the rx queue during the hardware interrupt.
The rx queue does have a limit, but this limit can be changed and rarely do packets
get dropped due to the rx queue being full.
3. During post interrupt handling, the IP layer will route the incoming SYN to TCP. If
the SYN hash table is full, TCP drops the incoming SYN. Otherwise, TCP creates
an open (connection) request and places it into the SYN hash table. Note that Linux
does not save the timestamp for the initial SYN packet in the open request structure.
4. TCP will respond to the incoming SYN immediately by sending a SYN/ACK to the
client. If TCP cannot immediately send a SYN/ACK to the client (i.e. the tx queue
in Figure 2.13 is full), TCP will drop the incoming SYN.
59
5. The client completes the TCP 3-way handshake by sending an ACK to the server.
6. Once TCP receives the third part of the TCP 3-way handshake from the client, the
open request will be placed onto the accept queue for processing. At this point, the
new child socket is created and pointed to by the open request. The connection is
considered to be established at this point.
7. The GET request could arrive at the server prior to the child connection being accepted by Apache, in which case the GET request is simply attached to the child
socket as an inbound data packet.
8. Apache accepts the newly established child connection and proceeds to process the
request. The speed at which Apache can process requests, along with the limit on
the number of running Apache processes affects the length of the accept queue.
Our kernel modifications with respect to steps 1 through 8 are as follows. We added
an 8-byte timestamp field to both the open request structure and the socket structure so that
the timestamp of the initial SYN could be saved across the lifetime of the connection. In
step 3 the timestamp in the SYN is copied to the open request structure and in step 6 it
is copied from the open request structure to the child socket structure. The ACCEPTED
counter is also incremented during step 6. For our DROPPED counter, we just used the
existing SNMP/TCP drop counter, but fixed several functions in the kernel that either
failed to increment the counter when necessary or, due to incorrect logic, incremented the
counter more than once for the same SYN drop.
Figure 2.13 shows the outbound processing that occurs when Apache is sending
data to the client. Figure 2.13 is numbered according to the following steps that occur
during TCP outbound data transmission in an unmodified Linux kernel:
free
queue
device
driver
+
device
independent
ts
(13)
network
(14)
user space
IP
TCP
sock
write
queue
(9)
apache
writev(); (10)
(12)
(11)
tx queue
writev()
returns
before
packet is
transmitted
Figure 2.13: TCP/IP outbound data transmission on Linux.
(9) Apache compiles a response to the GET request. This may include executing
a program (such as CGI script or Java program) or reading a file from disk.
(10) Once the response is composed, Apache makes a socket system call to send
the data (e.g., writev()). If there is space available in the kernel for the
response data, the data is copied into the kernel and then writev() immediately returns. If not, writev() will block the Apache process until space
becomes available.
(11) Once the data is copied to kernel space, TCP will immediately attempt to
queue the data for transmission by placing it onto the tx queue. If the tx queue
is full, TCP places the data on the socket outbound write queue. If that queue
is full TCP will cause the Apache process to block.
(12) Placed onto the tx queue, the data waits to be transmitted onto the network.
(13) After the data is transmitted onto the network, the data packet is placed onto
60
61
the free queue.
(14) Pending acknowledgment by the remote TCP, the data packet is freed.
Our kernel modifications infringed upon step 13. As a data packet is being placed
onto the free queue, the current time is stored in the socket structure. Likewise, if the
server application (e.g., Apache) closes the connection, or a TCP FIN or RST packet is
received from the client, the current time is also saved in the socket structure in another
timestamp field. This is also the point at which the COMPLETED counter is incremented.
In this manner we are able to identify when the server finished sending data to the client
and when either the server or the client closed the connection. We choose the lesser of
these two as the end of the transaction. Subtracting the timestamp obtained from the
initial SYN allows us to determine the SYN-to-END time for the connection (which is
then added to the running total). In other words, we defined the end of the transaction to
be whichever occurs first: the last data packet is sent from the server to the client or the
first arrival/transmission of a TCP RST or FIN packet.
In this section we provided some insights into the key kernel modifications we
performed, all of which were relatively minor. Other modifications, not included in the
above discussion, were too far removed from the purpose of this thesis to be discussed.
Suffice it to say, a thorough investigation of the Linux kernel TCP/IP stack was undertaken
to ensure that all code paths relevant to Certes were examined. Although we provided all of
the above by directly modifying and rebuilding the kernel, it would be possible to provide
the identical support using a kernel module (but with greater implementation difficulty).
62
2.5 Experimental Results
To demonstrate the effectiveness of Certes, we implemented Certes on Linux and evaluated
its performance in HTTP 1.0/1.1 environments, under constant and changing workloads.
The results presented here focus on evaluating the accuracy of Certes for determining
client perceived response time in the presence of failed connection attempts. The accuracy of Certes is quantified by comparing its estimate of client perceived response time
with client-side measurements obtained through detailed client instrumentation. Section
2.5.1 describes the experimental design and the test bed used for the experiments. Section
3.6 presents the client perceived response time measurements obtained for various HTTP
1.0/1.1 Web workloads. Section 2.6 demonstrates how a Web server control mechanism
can use Certes to evaluate its own ability to manage client response time.
2.5.1 Experimental Design
The test bed consisted of six machines: a Web server, a WAN emulator, and four client
machines (Figure 2.14). Each machine was an IBM Netfinity 4500R with dual 933 MHz
CPUs, 10/100 Mbps Ethernet, 512 MB RAM, and 9.1 GB SCSI HD. Both the server and
clients ran RedHat Linux 7.1 while the WAN emulator ran FreeBSD 4.4. The client machines were connected to the Web server via two 10/100 Mbps Ethernet switches and a
WAN emulator, used as a router between the two switches. The client-side switch was
a 3Com SuperStack II 3900 and the server-side switch was a Netgear FS508. The WAN
emulator software used was DummyNet [136], a flexible and commonly used FreeBSD
tool. The WAN emulator simulated network environments with different network latencies, ranging from 0.3 to 150 ms of round-trip time, as would be experienced in LAN and
cross-country WAN environments, respectively. The WAN emulator simulated networks
63
Client 1
Client 2
Web
Server
Switch A
100 Mbps
WAN
Emulator
Switch B
100 Mbps
Client 3
Client 4
Figure 2.14: Experimental test bed.
with 0-3% packet loss, which is not uncommon over the Internet.
The Web server machine ran the latest stable version of the Apache HTTP server,
V1.3.20. Apache was configured to run 255 daemons and a variety of test Web pages and
CGI scripts were stored on the Web server. The number of test pages was small and the
page sizes were 1 KB, 5 KB, 10 KB, and 15 KB. The CGI scripts would dynamically
generate a set of pages of similar sizes. Certes also executed on the server machine, independently from Apache (shown in Figure 2.11). The Certes implementation was designed
to periodically obtain counters and aggregate SYN-to-END time from the kernel and perform modeling calculations in user space. Periodically Certes would log the modeling
results to disk. For our experiments, the Certes implementation was configured to use 250
ms measurement intervals and a default frustration timeout of 21 seconds (except where
noted).
The client machines ran an improved version of the Webstone 2.5 Web traffic generator [160]. Five improvements were made to the traffic generator. First, we removed all
interprocess communication (IPC) and made each child process autonomous to avoid any
load associated with IPC. Second, we modified the WebStone log files to be smaller yet
contain more information. Third, we extended the error handling mechanisms and modified how and when timestamps were taken to obtain more accurate client-side measurements. Fourth, we implemented a client frustration timeout mechanism after discovering
64
the one provided in WebStone was only triggered during the select() function call and was
not a true wall clock frustration timeout mechanism. Fifth, we added an option to the
traffic generator that would produce a variable load on the server by switching between on
and sleep states.
The traffic generators were used on the four client machines to impose a variety of
workloads on the Web server. The results for sixteen different workloads are presented,
half of which were HTTP 1.0, the other half HTTP 1.1. While both HTTP 1.0 and HTTP
1.1 support persistent and non-persistent connections, we configured the traffic generators
to run HTTP 1.0 over non-persistent connections and HTTP 1.1 over persistent connections. Although recent studies indicate that non-persistent connections are still used far
more frequently than persistent connections in practice [146], the use of persistent connections increases the duration of each connection and reduces the number of connection
attempts, thereby reducing the effect that SYN drops have on client response time.
Measuring both HTTP 1.0/1.1 workloads provides a way to quantify the benefits
of using Certes for different versions of HTTP versus only using simpler SYN-to-END
measurements. For the HTTP 1.1 workloads considered, the number of Web objects per
connection ranged from 5 to 15, consistent with recent measurements of the number of
objects (e.g., banners, icons) typically found in a Web page [146].
The characteristics of the sixteen workloads are summarized in Table 2.2. In an
attempt to cover a broad range of conditions we varied the workloads along the following
dimensions:
1. static pages and dynamic content (Perl and C)
2. HTTP 1.0 and 1.1
3. 1 to 15 pages per connection
65
4. 0% to 3% network drop rate
5. 5 ms to 150 ms network delays
6. 1400 to 4800 clients (30 to 1670 conn/sec)
7. CPU and bandwidth bound
8. consistent and variable load
All of the sixteen workloads imposed a constant load on the server except for Test
I and Test J, which imposed a highly-varying load on the server. Each experimental workload was run for 20 minutes. For each workload, we measured at the server the steady-state
number of connections per second and mean SYN drop rate during successive one-second
time intervals. These measurements provide an indication of the load imposed on the
server.
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Page
Types
Pages per
Connection
static
static+cgi
static+cgi
cgi
static
static
static+cgi
static+cgi
static+cgi
static
static+cgi
static
static+cgi
static
static+cgi
static+cgi
1
1
1
1
15
15
5
5
5
1
5
1
5
1
5
1
Network
HTTP
Drop
Rate
0
1.0
0
1.0
0
1.0
0
1.0
0
1.1
0
1.1
0
1.1
0
1.1
0
1.1
0
1.0
3%
1.1
3%
1.0
3%
1.1
3%
1.0
3%
1.1
3%
1.0
ping RTT
(ms)
min/avg/max
1/8/21
0.2/0.5/5
141/152/165
0.2/0.4/4
4/11/17
141/153/167
0.2/0.7/6
140/152/165
120/133/147
142/151/165
0.2/0.6/6
0.2/0.9/8
140/151/161
144/151/164
140/150/161
140/151/161
Connections
SYN
per Second Drop Rate
1210-1670
330-580
320-675
175-320
80-150
50-96
97-173
95-175
30-185
55-470
103-177
340-1310
50-115
145-400
57-147
180-400
11%-22%
11%-33%
0.5%-26%
26%-44%
45%-63%
0.5%-36%
42%-59%
9%-37%
0% - 54%
0%-78%
35%-56%
0%-30%
14%-54%
0.5%-34%
8%- 53%
5%-38%
Table 2.2: Test configurations included HTTP 1.0/1.1, with static and dynamic pages.
Test
Total
Number
of
Clients
2000
2000
2000
2000
2000
1400
2000
1600
2000
4800
2000
2000
1600
2000
1500
1800
66
67
2.5.2 Measurements and Results
Figure 2.15a compares the client-side, Certes, SYN-to-END and Apache measured mean
response times for each experiment. The values shown are the response times, calculated
on a per second interval, averaged over the 20 minute test period. Figure 2.15b shows the
same results normalized with respect to the client-side measurements.
Client
(a)
Apache
SYN-to-END
Certes
Mean Response Time (seconds)
18
16
14
12
10
8
6
4
2
0
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
N
O
P
Experiments
Normalized Mean Response Time
(b)
Apache
SYN-to-END
Certes
110%
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
A
B
C
D
E
F
G
H
I
J
K
L
M
Experiments
Figure 2.15: Certes accuracy and stability in various environments.
The results show that the SYN-to-END measurement consistently underestimated
the client-side measured response time, with the error ranging from 5% to more than 80%.
The Apache measurements for response time, which by definition will always be less than
the SYN-to-END time, were extremely inaccurate, with an error of at least 80% in all
test cases. In contrast, the Certes estimate was consistently very close to the client-side
68
measured response time, with the error being less than 2.5% in all cases except Tests L, N
and P, which were less than 7.4%.
Figures 2.12 and 2.13 explain why the Apache level measure of response time is so
short compared to the mean client perceived response time. Apache does not measure all
the inbound kernel queuing that occurs nor the time it takes to perform the TCP three-way
handshake. On outbound, Apache measures the end of the transaction before the data is
transmitted (i.e. as soon as the writev() returns).
Figures 2.16a and 2.16b, show the response time distributions for Test D using
HTTP 1.0 and Test G using HTTP 1.1. These results show that Certes not only provides
an accurate aggregate measure of client perceived response time, but that Certes provides
an accurate measure of the distribution of client perceived response times. Figure 2.16
again shows how erroneous the SYN-to-END time measurements are in estimating client
perceived response time.
Figures 2.17a and 2.17b show how the response time varies over time for Test A
using HTTP 1.0 and Test G using HTTP 1.1. The figures show the mean response time at
one-second time intervals as determined by each of the four measurement methods. The
client-side measured response time increases at the beginning of each test run then reaches
a steady state during most of the test run while the traffic generated is relatively constant.
At the end of the experiment the clients are terminated, the generated traffic drops off, and
the response time drops to zero.
69
110
(a)
100
90
80
Client−Side
Certes
SYN−to−END
Apache
count
70
60
50
40
30
20
10
0
0
1
2
3
4
5
6
mean response time
7
8
9
10
16
18
100
(b)
90
80
Client−Side
Certes
SYN−to−END
Apache
70
count
60
50
40
30
20
10
0
0
2
4
6
8
10
mean response time
12
14
Figure 2.16: Certes response time distribution approximates that of the client for Tests D
and G.
1.2
(a)
mean response time (sec)
1
0.8
0.6
Client−Side
Certes
SYN−to−END
Apache
0.4
0.2
0
0
200
400
600
800
elapsed time (sec)
1000
1200
1000
1200
18
(b)
16
14
12
10
Client−Side
Certes
SYN−to−END
Apache
8
6
4
2
0
0
200
400
600
800
elapsed time (sec)
Figure 2.17: Certes online tracking of the client response time in Tests A and G.
70
71
Figure 2.17 shows that Certes can track in real-time the variations in client perceived response time for both HTTP 1.0/1.1 environments. The figure also indicates that
Certes is effective at tracking both smaller and larger scale response times, and that Certes
is able to track client perceived response time over time in addition to providing the accurate long term aggregate measures of mean response time shown in Figure 2.15. Again,
Certes provides a far more accurate real-time measure of client perceived response time
than SYN-to-END times or Apache. The large amount of overlap in the figures between
the Certes response time measurements and client-side response time measurements show
that the measurements are very close. In contrast, the SYN-to-END and Apache measurements have almost no overlap with the client-side measurements and are substantially
lower.
To gain insight on Certes’ sensitivity to the FTO, Test O and Test P were executed
using false assumptions for the number of retries k. In these two cases the FTO was
distributed across clients: 1/3 of the transactions were from clients configured to have
an FTO of 9 seconds (k = 1), 1/3 were from clients configured to have an FTO of 21
seconds (k = 2), and 1/3 from clients configured to have a client FTO of 45 seconds
(k = 3); the online model used the incorrect assumption that all clients had an FTO of
21 seconds (k = 2). The results for Tests O and P show that the Certes response time
measurements were still within 2% and 7.4%, respectively, of the client-side response
time measurements. For Test O the resulting Certes estimate was only off by 108 ms
and for Test P the difference was 677 ms. As mentioned earlier, if the distribution for k
was known (via historical measurements) the distribution can easily be included into the
model. Further study is needed to determine if error bounds exist for Certes and under
which specific conditions Certes is least accurate and why.
One of the key requirements for an online algorithm such as Certes is to be able
72
18
16
Client−Side
Certes
SYN−to−END
Apache
14
12
10
8
6
4
2
0
700
800
900
1000
elapsed time (sec)
1100
1200
1300
Figure 2.18: Certes online tracking of the client response time in Test J, in on-off mode.
to quickly observe rapid changes in client response time. Figure 2.18 shows how Certes
is able to track the client response time as it rapidly changes over time. There is no significant lag in Certes reaction time to these changes. This is an important feature for any
mechanism to be used in real-time control. As expected, the SYN-to-END measurement
tracks the client perceived response time during the time intervals in which SYN drops do
not occur. During the interval in which SYN drops occur, the SYN-to-END measurement
reaches a maximum (i.e. about 6 seconds in Figure 2.18), which indicates the inaccuracy
of the SYN-to-END time for those connections that are accepted when the accept queue is
nearly full. We note for completeness that Figure 2.18 is zoomed in to show detail and does
not contain information from the entire experiment. The chaos at the end of the test run
is indicative of the time-dependent nature of SYN dropping. These relatively few clients
experienced SYN drops prior to these last few intervals, increasing the overall mean client
response time during a period when the load on the system is actually very light. The mean
client response time during these intervals actually reflects heavy load in the recent past.
73
An important consideration in using an online measurement tool such as Certes is
ensuring that the measurement overhead does not adversely affect the performance of the
Web server. To determine the overhead of Certes, we re-executed Tests A, G and H on the
server without the Certes instrumentation and found the difference in throughput and client
response time to be insignificant. This suggests that Certes imposes little or no overhead
on the server.
2.6 Certes Applied in Admission Control
In this section we demonstrate how Certes can be combined with a Web server control
mechanism to better manage client response time and identify a common pitfall that many
admission control mechanisms fall into.
Web server control mechanisms often manipulate inbound kernel queue limits as
a way to achieve response time goals [92, 94, 86, 125, 8, 42, 41]. Unfortunately, there is
a serious pitfall that can occur when post-TCP connection measurements are used as an
estimate of the client response time. Using these types of measurements as the response
time goal can lead the control mechanism to take actions that may result in having the exact
opposite effect on client perceived response time from that which is intended. Without a
model such as Certes, the control mechanism will be unaware of this effect.
To emulate the effects of a control mechanism at the Web server, we modified the
server to dynamically change the Apache accept queue limit over time. Figure 2.19 shows
the accept queue limit changing every 10 seconds between the values of 25 and 211 . Figure
2.20 shows the effect this has on the client perceived response time.
When the queue limit is small, such as near the 200th interval, the response time
at the clients is high due to failed connection attempts, but the SYN-to-END time is small
74
due to short queue lengths at the server. The pitfall occurs when the control mechanism
decides to shorten the accept queue to reduce response time, causing SYN drops, which
in turn increases mean client response time. The control mechanism must be aware of the
effect that SYN drops have on the client perceived response time and include this as an
input when deciding on the proper queue limits.
accept queue [0:1265.83]
length
limit
2000
1800
1600
1400
count
1200
1000
800
600
400
200
0
0
200
400
600
800
elapsed time (sec)
1000
1200
Figure 2.19: Web server control manipulating the Apache accept queue limit.
10
Client−Side
Certes
SYN−to−END
Apache
9
8
7
6
5
4
3
2
1
0
0
200
400
600
800
elapsed time (sec)
1000
1200
Figure 2.20: Client response time increases as accept queue limit decreases.
75
76
2.7 Shortcomings of the (strictly) Queuing Theoretic Approach
As discussed in Chapter 1, the related work on modeling TCP that we cite assumes that the
SYN drop rate remains consistent over time (and is based on network drop probabilities
and not drop rates at the server). We show here that such a queuing theoretic approach
leads to an error prone result that is not nearly as accurate as the Certes model. Using
an M/M/1 queuing system to represent the Web server, the steady-state expected client
response time is:
CLIENT RT = (ts + tq ) · (1 − p3 ) + 3p + 6p2 + 12p3 + · · ·
(2.33)
where
ts is the service time of the request, i.e.
1
µ
tq is the time spent waiting on the queue
p is the probability of dropping a connection request
The assumptions for this overly simplified model is that the offered load remains
constant over time and that ts remains constant regardless of the offered load. Nevertheless, Figure 2.21 is a plot of Equation 2.33 (with ts + tq = .010 seconds) showing
the additional time added to the mean service time under the given SYN drop rate. For
example, a drop rate of approximately 20% adds 1 second to the mean service time.
Substituting SYN-to-END for ts + tq in Equation 2.33, we can obtain the results
that the M/M/1 model produces for Test J. Figure 2.22 shows Figure 2.18 overlaid with
the M/M/1 results. The M/M/1 model fails to track the client perceived response time as
effective as Certes. This is due to its inability to track the dependencies between time
77
Effect of SYN drop on Client Response Time
20
Penalty Added to Client Response Time (sec)
18
16
14
12
10
8
6
4
2
0
0
0.1
0.2
0.3
0.4
0.5
0.6
Probability of a SYN drop
0.7
0.8
0.9
1
Figure 2.21: Effect of SYN drop rate on client response time, as modeled as an M/M/1
queuing system.
intervals. Note that this model still requires collecting the SYN-to-END time, and the
number of dropped, accepted and completed for the current interval.
Equation 2.34 shows a more accurate approximation, using the drop probabilities
from prior intervals:
CLIENT RT =
mean(SYN-to-ENDi )
3 · DRi−SYN-to-END−3 +
(2.34)
6 · DRi−SYN-to-END−6 · DRi−SYN-to-END−6 +
12 · DRi−SYN-to-END−12 · DRi−SYN-to-END−18 · DRi−SYN-to-END−21
This approach captures some, but not all, of the dependences that exist between
time intervals. The results from applying Equation 2.34 to Test J are shown in Figure 2.23.
Note that to apply this approach, a sliding window of the number of dropped, accepted and
completed is required - exactly that which is required by Certes. Therefore, Certes gives a
more accurate result using the same information at an equivalent computational cost.
78
18
16
Client−Side
Certes
SYN−to−END
/M/M/1
14
12
10
8
6
4
2
0
900
950
1000
1050
elapsed time (sec)
1100
1150
1200
Figure 2.22: Modeling as an M/M/1 queuing system fails to accurately track client perceived response time.
Client−Side
Certes
SYN−to−END
/M/M/1 plus
20
18
16
14
12
10
8
6
4
2
0
100
120
140
160
180
200
elapsed time (sec)
220
240
260
Figure 2.23: Using a sliding window of drop probabilities fails to capture all the dependences between time intervals.
79
2.8 Convergence
The online implementation of Certes makes the assumption that during the first interval,
all SYNs are initial SYNs. Certes will converge if this assumption is not true - i.e. if
Certes begins modeling during the ith interval. Figures 2.24 and 2.25 are the results of
starting the Certes modeling halfway through the execution of a consistent and variable
load experiment. Figure 2.24 is similar to Test A except that the accept queue limit was set
to 512 and Figure 2.25 is a re-execution of Test J. Figures 2.24 and 2.25 represent worst
case scenarios in the sense that none of the measurements for prior intervals are available
when Certes begins modeling. In both cases, Certes converges after 21 seconds, which is
the FTO.
3
Client−Side
Certes
SYN−to−END
2.5
2
1.5
1
0.5
0
0
200
400
600
800
elapsed time (sec)
1000
1200
Figure 2.24: Certes begins modeling at the 600th interval during a consistent load test.
80
16
Client−Side
Certes
SYN−to−END
14
12
10
8
6
4
2
0
200
300
400
500
600
700
elapsed time (sec)
800
900
1000
Figure 2.25: Certes begins modeling at the 575th interval (in the middle of a peak) during
a variable load test.
81
2.9 Summary
This chapter presented Certes, an online server-based mechanism that enables Web servers
to measure client perceived response time. Certes is based on a model of TCP that quantifies the effect that SYN drops have on client perceived response time by using three simple
server-side measurements. Certes does not suffer from any of the drawbacks associated
with the addition of new hardware, having to modify existing Web pages or HTTP servers,
and does not rely on third party sampling. Certes can also be used for the delivery of
non-HTML objects such as PDF or PS files.
A key result of Certes is its robustness and accuracy. Certes was shown to provide
accurate estimates in the HTTP 1.0/1.1 environments, with both static and dynamically
created pages, under constant and variable loads of differing scale. Certes can be applied
over long periods of time and does not drift or diverge from the client perceived response
time; any errors that may be introduced into the model do not accumulate over time. Certes
is computationally inexpensive and can be used online at the Web server to provide information in real-time. Certes captures the subtle changes that can occur under constant load
as well as the rapid changes that occur under bursty conditions. Certes can also determine
the distribution of the client perceived response time, which is extremely important, since
service-level objectives may not only specify mean response time targets, but also indicate
variability measures such as mode, maximum, standard deviation and variance.
Certes can be readily applied in a number of contexts. Certes is particularly useful to Web servers that manage QoS by performing admission control. Certes allows
such servers to quantify the effect that admission control drops have on client perceived
response time as well as allowing them to avoid the pitfalls associated with using application level or kernel level SYN-to-END measurements of response time. Certes is accurate
82
under heavy server load, the moment at which admission control or scheduling algorithms
must make critical decisions. Algorithms that manage resource allocation, reservations or
congestion [17] can also benefit from the short-term forecasting [45] of connection retries
modeled by Certes.
Our work with Certes lead us to develop ksniffer, a system for determining the per
pageview client perceived response time, which is presented in the next chapter.
CHAPTER 3. MODELING CLIENT PERCEIVED RESPONSE TIME
83
Chapter 3
Modeling Client Perceived Response
Time
The Certes research does what no other published work has done to date which is to expose
a significant flaw in existing admission control techniques that ignore the latency effects
of SYN drops. We now step beyond the Certes work by developing an approach that is
capable of determining the response time, as perceived by the remote client, on a per
pageview basis. We introduce ksniffer, a kernel-based traffic monitor capable of determining pageview response times as perceived by remote clients, in real-time at gigabit traffic
rates. ksniffer is based on novel, online mechanisms that take a “look once, then drop” approach to packet analysis to reconstruct TCP connections and learn client pageview activity. These mechanisms are designed to operate accurately with live network traffic even in
the presence of packet loss and delay, and can be efficiently implemented in kernel space.
This enables ksniffer to perform analysis that exceeds the functionality of current traffic
analyzers while doing so at high bandwidth rates. ksniffer requires only to passively monitor network traffic and can be integrated with systems that perform server management
84
to achieve specified response time goals. Our experimental results demonstrate that ksniffer can run on an inexpensive, commodity, Linux-based PC and provide online pageview
response time measurements, across a wide range of operating conditions, that are within
five percent of the response times measured at the client by detailed instrumentation.
As described in Chapter 1 and shown in Figure 3.1, a pageview consists of a container page and zero or more embedded objects, which are usually obtained over multiple
TCP connections. This introduces a set of problems not present when simply measuring
per URL server response time. In addition to the SYN drop latencies covered by Certes,
we are now faced with the problem of correlating multiple HTTP requests over multiple
TCP connections into a consistent model for each pageview. SYN drop latencies on either
connection, ambiguities in object containment, TCP retransmission latencies, HTTP protocol parsing, missing information, and network packet loss are some of the problems that
need to be addressed.
ksniffer mechanisms take a “look once, then drop” approach to packet analysis, use
simple hashing data structures to match Web objects to pageviews, and can be efficiently
implemented in kernel space. ksniffer uses the Certes model of TCP retransmission and
exponential backoff that accounts for latency due to connection setup overhead and network packet loss. It combines this model with higher level online mechanisms that use
access history and HTTP referer information when available to learn relationships among
Web objects to correlate connections and Web objects to determine pageview response
times. Furthermore, ksniffer only looks at TCP/IP and HTTP protocol header information and does not need to parse any HTTP data payload. This enables ksniffer to perform
higher level Web pageview analysis online, in the presence of high data rates; it can monitor traffic at gigabit line speeds while running on an inexpensive, commodity PC. These
mechanisms enable ksniffer to provide accurate results across a wide range of operating
Client
t0
85
Server
SYN J
ck J+1
SYN K, a
ack K+1
GET index
.html
GET obj3
.gif
Client
SYN M
Server
M+1
SYN N, ack
ack N+1
GET obj1.gif
GET obj8.g
if
GET obj2
.gif
GET obj4.gif
te
Figure 3.1: Downloading a container page and embedded objects over multiple TCP connections.
conditions, including high load, connection drops, and packet loss. In these cases, obtaining accurate performance measures is most crucial because Web server and network
resources may be overloaded.
Current approaches to measuring response time include active probing from geographically distributed monitors, instrumenting HTML Web pages with JavaScript, offline
analysis of packet traces, and instrumenting Web servers to measure application-level performance or per connection performance. These all fall short, in one area or another,
in terms of accuracy, cost, scalability, usefulness of information collected, and real-time
availability of measurements. ksniffer provides several significant advantages over these
86
approaches. First, ksniffer does not require any modifications to Web pages, Web servers,
or browsers, making deployment easier and faster. This is particularly important for Web
hosting companies responsible for maintaining the infrastructure surrounding a Web site
but are often not permitted to modify the customer’s server machines or content. Second,
ksniffer captures network characteristics such as packet loss and delay, aiding in distinguishing network problems from server problems. Third, ksniffer measures the behavior
of every session for every real client who visits the Web site. Therefore, it does not fall
prey to biases that arise when sampling from a select, predefined set of remote monitoring machines that have better connectivity, and use different Web browser software, than
the actual users of the Web site. Fourth, ksniffer can obtain metrics for any Web content,
not just HTML. Fifth, ksniffer performs online analysis of high bandwidth, live packet
traffic instead of offline analysis of traces stored on disk, bypassing the need to manage
large amounts of disk storage to store packet traces. More importantly, ksniffer can provide performance measurements to Web servers in real-time, enabling them to respond
immediately to performance problems through diagnosis and resource management.
The rest of this chapter is outlined as follows. Section 3.1 presents an overview of
the ksniffer architecture. Section 3.2 describes the ksniffer algorithms for reconstructing
TCP connections and pageview activities. Section 3.3 discusses how ksniffer handles less
ideal operating conditions, such as packet loss and server overload. Sections 3.4 and 3.5
discuss some issues related to result aggregation. Section 3.6 presents experimental results
quantifying the accuracy and scalability of ksniffer under various operating conditions. We
measure the accuracy of ksniffer against measurements obtained at the client and compare
the scalability of ksniffer against user-space packet analysis systems. Finally, we present
some concluding remarks.
clients
w/browser
Application
servers
database
servers
INTERNET
HTTP
servers
87
Figure 3.2: Multi-tiered server farm with a ksniffer monitor.
3.1 ksniffer Architecture
ksniffer was motivated by the desire to have a fast, scalable, flexible, yet inexpensive
traffic monitor that can be used in production environments for observing latencies in
server farms. As such, performance is one of our key design issues: online analysis of
high bandwidth links, where information is available in an on-demand manner. Cost of
deployment is another major consideration as well.
Figure 3.2 depicts how ksniffer is deployed within a server farm. ksniffer receives
packet streams from the mirrored port of the switch that is situated in front of the server
complex. ksniffer, in turn, provides information back to the server complex which can
be used for managing the client perceived response time. This overall approach to gather
traffic from a multi-tiered server farm is used in existing systems [109].
What distinguishes ksniffer from other high speed systems is that ksniffer achieves
high performance on commodity PC’s by splitting its functionality between the kernel and
user space. Figure 3.3 (right) illustrates the architecture of ksniffer. ksniffer attaches itself
directly to the device independent layer within the Linux kernel. To the kernel, ksniffer
appears as just another network protocol layer within the stack, treated similarly to IP code.
ksniffer does not produce any log files, but can read configuration parameters and write
88
ksniffer Machine
Traffic Monitor Machine
trace
logs
Local
Analysis
Monitor
libpcap
user
user
kernel
kernel
Socket
TCP
Raw
Socket
Offline
Analysis
mmap
Socket
(a)
TCP
IP
IP
config
files
filtering
protocol
analysis
(b)
reports
device independent
device driver
network
device independent
device driver
packet stream
device driver
Remote
Analysis
Figure 3.3: Typical libpcap based sniffer architecture (left) vs. the ksniffer architecture
(right).
debugging information to disk from kernel space. ksniffer is similar to other computing
systems in that a portion of the functionality is contained in the kernel, as dynamically
loadable modules, and the remainder is contained in user space.
Lower level features requiring high performance are implemented in the operating
system, and exploit in-kernel performance advantages such as zero-copy buffer management, eliminated system calls, and reduced context switches. This functionality includes
TCP connection tracking, HTTP session monitoring, IP address longest prefix matching,
file extension matching, longest URI path matching, multiple attribute range matching,
and a hot list module. Some of this code was developed by duplicating then modifying
code from the Linux kernel, as well as from an in-kernel HTTP server [78, 83]. Certain
components, such as HTTP session tracking, could well be placed in either kernel or user
space. In these cases, for performance reasons, we opted for the faster kernel implementation. In this way, we reduce the amount of information that is transferred between kernel
89
and user space, and reduce the number of user space processes and their attendant context
switching costs. In addition, the finer-grained timing capabilities afforded in the kernel
allow for more accurate time-keeping than in user space.
Higher level analysis features, such as clustering and rule generation, are implemented in user space. This functionality is less performance-critical, and benefits from
available support such as floating point arithmetic and a large number of specialty libraries.
Application level information developed by ksniffer is made available to user space via an
mmap’ed shared memory region. This lets communication occur between the user and
kernel components, in a producer/consumer relationship, without the overhead of buffer
copies or system calls.
Making the decision to place a portion of ksniffer in the kernel has both advantages and drawbacks. Other research has demonstrated scalability in the context of HTTP
servers. The world’s fastest Web servers execute in kernel space: AFPA [83, 78] and Tux
[132]. On the other hand, kernel module development requires more expertise than user
space development. This issue is ameliorated somewhat by the availability in recent years
of virtualization tools that allow kernel development in user space, such as VMWare [158]
and User-Mode Linux [52]. Similarly, kernel programming is inherently less safe than
user-space programming, since programming errors can result in crashes that halt the machine and disable other services that might be running. Since ksniffer is designed to be
used as a dedicated monitoring appliance, however, this is less of a concern. We show
that the performance gains of developing in the kernel outweigh the cost of developing for
special purpose hardware.
Nprobe [75] avoids copying overheads by mapping a section of kernel memory
into user space and placing packets into the shared memory area. ksniffer also uses shared
memory, but does not transfer packets between kernel and user space, but instead transfers
90
application level events. This allows ksniffer to access kernel structures directly, such as
the skb, which contain pointers into the raw data packet. This avoids the dual header
parsing that would occur with Nprobe: 1) first, the kernel creates a skb and parses the
headers in the packet, 2) second, Nprobe passes the raw packet up to user space to be
re-parsed and analyzed.
Gigascope [48] is a traffic monitor designed for high-speed IP backbone links and
requires a programmable network interface adapter. Where ksniffer splits its functionality
between user and kernel space, Gigascope splits its functionality between the NIC and user
space. The key difference between the two is that ksniffer provides higher level functions
within its kernel modules (such as TCP connection tracking) where as Gigascope modifies
the NIC firmware to perform packet filtering. They report being able to count packets on
port 80 whose payload matches the regular expression ˆ[ˆ \\ n]*HTTP/1.* at 610 Mbps
when their code is executing directly on the NIC. Executing their NIC packet filtering
code on the host CPU instead of on the NIC, they fall to 480 Mbps, at which point they
experience interrupt livelock. This was using a 733 Mhz processor with 2GB of RAM.
ksniffer has already demonstrated more functionality at similar bandwidth rates, using a
machine with less memory. In addition, the cost of a Gigascope deployment far exceeds
that of ksniffer.
3.2 ksniffer Pageview Response Time
To determine the client perceived response time for a Web page, ksniffer measures the
time from when the client sends a packet corresponding to the start of the transaction until
the client receives the packet corresponding to the end of the transaction. How a packet
may indicate the start or end of a transaction depends upon several factors. To show how
Client
IP addr
pageview hash table
FIFO pageview queue
loners queue
round trip time (RTT)
Flow
4-tuple
start time, end time
number of requests
FIFO request queue
FIFO finish queue
requesting client
91
Web
URL
server reply state
referring pageview
Pageview
URL
request count
container pattern
timeout
requesting client
container Web object
embedded object hash table
Figure 3.4: Objects used by ksniffer for tracking.
this is done, we first briefly describe some basic entities tracked by ksniffer, then describe
how ksniffer determines response time based on an anatomical view of the client-server
behavior that occurs when a Web page is downloaded.
ksniffer keeps track of four entities to maintain the information it needs to measure
response time: clients, pageviews, HTTP objects, and TCP connections. ksniffer tracks
each of these entities using the corresponding data objects shown in Figure 3.4. Clients are
uniquely identified by their IP address. A pageview consists of a container page and a set
of embedded HTTP objects. For example, a typical Web page consists of an HTML file as
the container page and a set of embedded images which are the embedded HTTP objects.
Pageviews are identified by the URL of the associated container page and Web objects are
identified by their URL. A flow represents a TCP connection, and is uniquely identified by
the four tuple consisting of source and destination IP address and port numbers.
It is the associations between instances of these objects which enables ksniffer to
reconstruct the activity at the Web site. To efficiently manage these associations, ksniffer maintains sets of hash tables to perform fast lookup and correlation between the four
92
types of objects. Separate hash tables are used for finding clients and flows, indexed by
hash functions on the IP address and four-tuple, respectively. Each client object contains
a pageview hash table indexed by a hash function over the container page URL. Flows
contain a FIFO request queue of Web objects that have been requested but not completed,
and a FIFO finish queue of Web objects that have been completed.
Suppose a remote client, Cj , requests a Web page. We decompose the resulting
client-server behavior into four parts: TCP connection setup, HTTP request, HTTP response, and embedded object processing. We use the following notation in our discussion.
Let Cj be the j th remote client and Fij be the ith TCP connection associated with remote
client Cj . Let pvij be the ith pageview associated with remote client Cj , and wkj,i be the k th
Web object requested on Fij . Let ti be the ith moment in time, d represent an insignificant
amount of processing time, either at the client or the server, p represent the Web server
processing time of an HTTP request, and RT T be the round trip time between the client
and the server.
3.2.1 TCP Connection Setup
If the client, Cj , is not currently connected to the Web server, the pageview transaction
begins with making a connection. Connection establishment is performed using the well
known TCP three-way handshake, as shown in Figure 3.5. The start of the pageview
transaction corresponds to the SYN J packet transmitted by the client at time t0 . However,
ksniffer is located on the server-side of the network, where a dotted line is used in Figure
3.5 to represent the point at which ksniffer captures the packet stream. ksniffer does not
capture SYN J until time t0 + .5RT T , after the packet takes 1/2 RTT to traverse the
network. This is assuming ksniffer and the Web server are located close enough together
that they see packets at essentially the same time.
F1j
Client
t0
t0+RTT + d
t0+RTT + 2d
ti
93
Server
SYN J
J+1
SYN K, ack
t0+.5RTT
t0+.5RTT + d
ack K+1
GET index.h
tml
HTTP
response
header
t0+1.5RTT + 2d
ti+.5RTT
ti +.5RTT+ p
tk
tk+.5RTT
captured by ksniffer
Figure 3.5: HTTP request/reply.
If this is the first connection from Cj , ksniffer will create a flow object F1j and
insert it in the flow hash table. At this moment, ksniffer does not know the value for RT T
since only the SYN J packet has been captured, so it cannot immediately determine time
t0 . Instead, it sets the start time for F1j equal to t0 + .5RT T . ksniffer then waits for further
activity on the connection. At t0 + 1.5RT T + 2d, ksniffer and the Web server receive
the ACK K+1 packet, establishing the TCP connection between client and server. ksniffer
can now determine the RT T as the difference between the SYN-ACK from the server
(the SYN K, ACK J+1 packet) and the resulting ACK from the client during connection
establishment (the ACK K+1 packet). ksniffer then updates F1j ’s start time by subtracting
1/2 RT T from its value to obtain t0 .
At time t0 + 1.5RT T + 2d, for the first connection from Cj , ksniffer creates a client
object Cj , saves the RT T value, and inserts the object into the client hash table. For each
94
subsequent connection from Cj , a new flow object Fij will be created and linked to the
existing client object, Cj. The RT T for each new flow will be computed, and Cj ’s RT T
will be updated based on an exponentially weighted moving average of the RT T s of its
flows in the same manner as TCP [126]. The updated RT T is then used to determine the
actual start time for each flow.
3.2.2 HTTP Request
Once connected to the server, the remote client transmits an HTTP request for the container
page and waits for the response. If this is not the first request over the connection, then
this HTTP request indicates the beginning of the pageview transaction. Figure 3.5 depicts
the first request over a connection. At time ti , the client transmits the HTTP GET request
onto the network, and after taking 1/2 RTT to traverse the network, the server receives the
request at ti + .5RT T .
ksniffer captures and parses the packet containing the HTTP GET request, splitting
the request into all its constituent components and identifying the URL requested. Since
this is the first HTTP request over connection F1j , it incurs the connection setup overhead.
In this case, a Web object is created, w1j,1, to represent the request, and the start time for
w1j,1 is set to the start time of F1j . In this manner, the connection setup time is attributed
to the first HTTP request on each flow. w1j,1 is then inserted into F1j ’s request queue and
F1j ’s number-of-requests field is set to one. If this was not the first HTTP request over
connection F1j , but was instead the k th request on F1j , a Web object wkj,1 would be created
and its start time would be set equal to ti .
Next, ksniffer creates pv1j , the pageview object that will track the pageview, and
inserts it into Cj ’s pageview hash table. We assume for the moment that w1j,1 is a container
page; embedded objects are discussed in Section 3.2.5. ksniffer sets pv1j ’s start time equal
95
to w1j,1’s start time, and sets w1j,1 as the container Web object for pv1j . At this point in time,
ksniffer has properly determined which pageview is being downloaded, and the correct
start time of the transaction.
3.2.3 HTTP Response
After the Web server receives the HTTP request and takes p amount of time to process
it, the server sends a reply back to the client. ksniffer captures the value of p, the server
response time, which is often mistakenly cited as the client perceived response time. As
we demonstrated in Chapter 2 server response time can underestimate the client perceived
response time by more than an order of magnitude. The first response packet contains the
HTTP response header, along with the initial portion of the Web object being retrieved.
ksniffer looks at the response headers but never parses the actual Web content returned by
the server; HTML parsing would entail too much overhead to be used in an online, high
bandwidth environment (in Chapter 4 we discuss packet manipulation techniques which
do not require an HTML language parser).
ksniffer obtains F1j from the flow hash table and determines the first Web object
in F1j ’s request queue is w1j,1, which was placed onto the queue when the request was
captured. An HTTP response header does not specify the URL for which the response is
for. Instead, HTTP protocol semantics dictate that, for a given connection, HTTP requests
be serviced in the order they are received by the Web server. As a result, F1j ’s FIFO request
queue enables ksniffer to identify each response over a flow with the correct request object.
ksniffer updates w1j,1’s server reply state based on information contained in the
response header. In particular, ksniffer uses the Content-length: and Transfer Encoding:
fields, if present, to determine what will be the sequence number of the last byte of data
transmitted by the server for this request.
96
ksniffer captures each subsequent packet to identify the time of the end of the response. This is usually done by identifying the packet containing the sequence number
for the last byte of the response. When the response is chunked [61], sequence number
matching cannot be used. Instead, ksniffer follows the chunk chain within the response
body across multiple packets to determine the packet containing the last byte of the response. For CGI responses over HTTP 1.0 which do not specify the Content-length: field,
the server closes the connection to indicate the end of the response. In this case, ksniffer
simply keeps track of the time for the last data packet before the connection is closed.
ksniffer sets w1j,1’s end time to the arrival time of each response packet, plus 1/2
RT T to account for the transit time of the packet from server to client. ksniffer also sets
pv1j ’s end time to w1j,1’s end time. The end time will monotonically increase until the
server reply has been completed, at which point the (projected) end time will be equal to
tk + .5RT T , as shown in Figure 3.5. When ksniffer captures the last byte of the response
at time tk , w1j,1 is moved from F1j ’s request queue to F1j ’s finish queue, where it remains
until either F1j is closed or until ksniffer determines that all segment retransmissions (if
any) have been accounted for, which is discussed in Section 3.3.
Most Web browsers in use today serialize multiple HTTP requests over a connection such that the next HTTP request is not sent until the response for the previous request
has been fully received. For these clients, there is no need for each flow object to maintain
a queue of requests since there will only be one outstanding request at any given time. The
purpose of ksniffer’s request queue mechanism is to support HTTP pipelining, which has
been adopted by a small, but potentially growing number of Web browsers. Under HTTP
pipelining, a browser can send multiple HTTP requests at once, without waiting for the
server to reply to each individual request. ksniffer’s request queues provide support for
HTTP pipelining by conforming to RFC2616 [61], which states that a server must send its
97
responses to a set of pipelined requests in the same order that the requests are received on
that connection. Since TCP is a reliable transport mechanism, requests that are pipelined
from the client, in a certain order, are always received by the server in the same order.
Any packet reordering that may occur in the network is handled by TCP at the server.
ksniffer provides similar mechanisms to handle packet reordering so that HTTP requests
are placed in F1j ’s request queues in the correct sequence. This entails properly handling
a packet that contains multiple HTTP requests as well as an HTTP request which spans
packet boundaries.
At this point in time, ksniffer has properly determined tk + .5RT T , the time at
which the packet containing the last byte of data for w1j,1 was received by client Cj . If the
Web page has no embedded objects then this marks the end of the pageview transaction.
For example, if w1j,1 corresponds to a PDF file instead of an HTML file, ksniffer can
determine that the transaction has completed, since a PDF file cannot have embedded
objects.
If w1j,1 can potentially embed one or more Web objects, ksniffer cannot assume
that pv1j has completed. Unfortunately, at time tk + .5RT T , ksniffer cannot determine
yet if requests for embedded objects are forthcoming or not. In particular, ksniffer does
not parse the HTML within the container page to identify which embedded objects may
be requested by the browser. Such processing is too computationally expensive for an
online, high bandwidth system, and often does not even provide the necessary information. For example, a JavaScript within the container page could download an arbitrary
object that could only be detected by executing the JavaScript, not just parsing the HTML.
Furthermore, HTML parsing would not indicate which embedded objects will be directly
downloaded from the server, since some may be obtained via caches or proxies. ksniffer
instead takes a simpler approach based on waiting and observing what further HTTP re-
98
quests are sent by the client, then using HTTP request header information to dynamically
learn which container pages embed which objects.
3.2.4 Online Embedded Pattern Learning
ksniffer learns which container pages embed which objects by tracking the Referer: field
in HTTP request headers. The Referer: field contained in subsequent requests is used to
group embedded objects with their associated container page. Since the Referer: field is
not always present, ksniffer develops patterns from those it does collect to infer embedded
object relationships when requests are captured that do not contain a Referer: field. This
technique is faster than parsing HTML, executing JavaScript, or walking the Web site
with a Web crawler. In addition, it allows ksniffer to react to changes in container page
composition as they are reflected in the actual client transactions.
ksniffer creates referer patterns on the fly. For each HTTP request that is captured,
ksniffer parses the HTTP header and determines if the Referer: field is present. If so, this
relationship is saved in a pattern for the container object. For example, when monitoring
www.ibm.com, if a GET request for obj1.gif is captured, and the Referer: field is found to
contain “http://www.ibm.com/index.html”, ksniffer adds obj1.gif as an embedded object
within the pattern for index.html. If a Referer: field is captured which specifies a host not
being monitored by ksniffer, such as “http://www.xyz.com/buy.html”, it is ignored.
ksniffer uses file extensions as a heuristic when building patterns. Web objects
with an extension such as .ps and .pdf cannot contain embedded objects, nor can they
be embedded within a page. As such, patterns are not created for them, nor are they
associated with a container page. Web objects with an extension such as .gif or .jpg are
usually associated with a container page, but cannot themselves embed other objects. Web
objects with an extension such as .html or .htm can embed other objects or be embedded
99
themselves. Each individual .html object has its own unique pattern, but currently an
.html object is never a member of another object’s pattern. This prevents cycles within
the pattern structures, but results in ksniffer treating frames of .html pages as separate
pageviews.
Taking this approach means that ksniffer does not need to be explicitly told which
Web pages embed which objects – it learns this on its own. Patterns are persistently kept
in memory using a hash table indexed by the container page URL. Each pvij and container
wkj,i is linked to the pattern for the Web object it represents, allowing ksniffer to efficiently
query the patterns associated with the set of active pageview transactions.
Since Web pages can change over time, patterns get dynamically updated, based on
the client activity seen at the Web site. Therefore, a particular embedded object, obj1.jpg,
may not belong to the pattern for container index.html at time ti , and yet belong to the
pattern at time ti±k . Likewise, a pattern may not exist for buy.html at time ti , but then be
created at a later time ti+k , when a request is captured. Of course, the same embedded
object, obj1.jpg, may appear in multiple patterns, index.html and buy.html, at the same
time or at different times. Since patterns are only created from client transactions, the set
of patterns managed by ksniffer may be a subset of all the container pages on the Web
site. This can save memory: ksniffer maintains patterns for container pages that are being
downloaded, but not for those container pages on the Web site which do not get requested.
Only the Referer: field is used to manipulate patterns, and the embedded objects
within a pattern are unordered. ksniffer places a configurable upper bound of 100 embedded objects within a pattern so as to limit storage requirements. When the limit is reached,
an LRU algorithm is used for replacement, removing the embedded object which has not
been linked to the container page in an HTTP request for the longest amount of time.
Each pattern typically contains a superset of those objects which the container page
100
actually embeds. As the pattern changes, the new embedded objects get added to the
pattern; but the old embedded objects only get removed from the pattern if the limit is
reached. This is perfectly acceptable since ksniffer does not use patterns in a strict sense
to determine, absolutely, whether or not a container page embeds a particular object.
Most Web browsers, including Internet Explorer and Mozilla, provide Referer:
fields, but some do not and privacy proxies may remove them. To see what percentage
of embedded objects have Referer: fields in practice, we analyzed the access log files of
a popular musician resource Web site that has over 800,000 monthly visitors. The access
logs covered a 15 month period from January 2003 until March 2004. 87% of HTTP requests had a Referer: field, indicating that a substantial portion of embedded objects may
have Referer: fields in practice. ksniffer is specifically designed for monitoring high speed
links that transmit a large number of transactions per second. In the domain of pattern
generation, this is an advantage. The probability that at least one HTTP request with the
Referer: field set for a particular container page will arrive within a given time interval is
extremely high.
3.2.5 Embedded Object Processing
If a container page references embedded objects, the end of the transaction will be indicated by the packet containing the sequence number of the last byte of data, for the last
object to complete transmission. To identify this packet, ksniffer determines which embedded object requests are related to each container page using the Referer: field of HTTP
requests, file extension information, and the referer patterns discussed in Section 3.2.4.
In our example, suppose index.html contains references to five embedded images
obj1.gif, obj2.gif, obj3.gif, obj4.gif, and obj8.gif. The embedded objects will be identified
and processed as shown in Figure 3.6 (ignoring for the moment F3j ). At time tk + .5RT T ,
Client
t0
Fj
1
Server
Client
SYN J
tj
ck J+1
SYN K, a
ack K+1
GET index
.html
tk
tk+.5RTT
GET obj3.g
if
Fj
3
101
Server
SYN Q
ack Q+1
SYN R,
ack R+1
GET buy.htm
l
Fj
Client
ta
2
SYN M
Server
M+1
SYN N, ack
ack N+1
GET obj1.gif
GET obj8.g
if
GET obj8.gif
GET obj1.gif
tr
tq
GET obj2.g
if
GET obj4.gif
GET obj11.g
if
te
Figure 3.6: Downloading multiple container pages and embedded objects over multiple
connections.
the browser parses the HTML document and identifies any embedded objects. If embedded objects are referenced within the HTML, the browser opens an additional connection,
F2j , to the server so that multiple HTTP requests for the embedded objects can be serviced, in parallel, to reduce the overall latency of the transaction. The packet containing
the sequence number of the last byte of the last embedded object to be fully transmitted
indicates the end of the pageview transaction, te .
The start and end times for embedded object requests are determined in the same
manner as previously described in Sections 3.2.2 and 3.2.3. Each embedded object that is
requested is tracked in the same manner that the container page, index.html, was tracked.
For example, when the second connection is initiated, ksniffer creates a flow object F2j to
track the connection, and associates it with Cj . When the request for obj1.gif on F2j is
102
captured at time tq , a w1j,2 object is created for tracking the request, and is placed onto F2j ’s
request queue.
To determine the pageview response time, which is calculated as te - t0 , requires
correlating embedded objects to their proper container page, which involves tackling a
set of challenging problems. Clients, especially proxies, may be downloading multiple
pageviews simultaneously. It is possible for a person to open two or more browsers and
connect to the same Web site, or for a proxy to send multiple pageview requests to a server,
on behalf of several remote clients. In either case, there can be multiple active pageview
transactions simultaneously associated with the remote client Cj (e.g., pv1j , pv2j ... pvkj ).
In addition, some embedded objects being requested may appear in multiple pageviews,
and some Web objects may be retrieved from caches or CDNs. ksniffer applies a set of
heuristics that attempt to determine the true container page for each embedded object. We
present experimental results in Section 3.6 demonstrating that these heuristics are effective
for accurately measuring client perceived response time.
For example, suppose that F3j in Figure 3.6 depicts client Cj downloading buy.html
at roughly the same time as index.html (i.e. t0 ≈ tj ). Suppose also that ksniffer knows
in advance that index.html embeds {obj1.gif, obj3.gif, obj8.gif, obj4.gif ,obj2.gif} and that
buy.html embeds {obj1.gif, obj8.gif, obj11.gif}. This means that both container pages are
valid candidates for the true container page of obj1.gif. Whether or not tr < tq is a crucial
indication as to the identity of the true container page. At time ta , when connection F2j is
being established, there is no information which could distinguish whether this connection
belongs to index.html or buy.html. The only difference between F1j , F2j and F3j with respect
to the TCP/IP 4-tuple is the remote client port number. Hence only the client, Cj , can be
identified at time ta , and at time tq , it is unknown whether index.html or buy.html is the
true container page for obj1.gif.
103
Cj
hash table
loners
FIFO
pv 6j
pv 5j
pv 3j
pv 7j
pv 2j
pv 8j
pv 4j
pv1j
pv 8j
pv 7j
pv 4j
.pdf, ps, .zip, etc
.html, shtml, etc
pv 2j
pv1j
contents of
hashtable
Figure 3.7: Client active pageviews.
To manage pageviews and their associated embedded objects, ksniffer maintains
three lists of active pageviews for each client, each sorted by request time, as shown in
Figure 3.7. The loners queue contains pageviews which represent objects that cannot have
embedded objects. These pageviews are kept in their own list, which is never searched
when attempting to locate a container page for a new embedded object request. All other
pageviews, which could potentially embed an object, are placed on both a FIFO pageview
queue and the pageview hash table. This enables ksniffer to quickly locate the youngest
candidate container page. Each pageview also maintains an embedded object hash table, not shown in Figure 3.7, that consists of the embedded objects associated with that
pageview and state indicating whether and to what extent they have been downloaded.
Given a request wij,k captured on flow Fkj for client Cj , ksniffer will perform the
following actions:
1. If wij,k ∈ {.html, .shtml, ...} ksniffer will treat wij,k as a container page by placing
it into the pageview hash table (and FIFO queue) for client Cj . In addition, if a
pageview is currently associated with Fkj , ksniffer assumes that pageview is done.
104
2. If wij,k ∈ {.pdf, .ps, ...} ksniffer will treat wij,k as a loner object by placing it on
the loner queue for Cj . In addition, if a pageview is currently associated with Fkj ,
ksniffer assumes it is done.
3. If wij,k ∈ {.jpg, .gif, ...} then
(a) If the Referer: field contains the monitored server name, such as
http://www.ibm.com/buy.html, then Cj ’s pageview hash table is searched to locate pvcj , the youngest pageview downloading that container page that has yet
to download wij,k . If pvcj exists, wij,k is associated to pvcj as one of its embedded
objects. If no pageview meets the criterion, pvcj is created and wij,k is associated
to it.
(b) If the Referer: field contains a foreign host name, such as
http://www.xyz.com/buy.html, then wij,k is treated as a loner object.
(c) If wij,k has no Referer: field, then the FIFO queue is searched to locate, pvcj , the
youngest pageview which has wij,k in its referer pattern and has yet to download
wij,k . If pvcj exists, then wij,k is associated to pvcj as one of its embedded objects.
If no pageview meets the criterion, then wij,k is treated as a loner object.
The algorithm above is based on several premises. If a request for an embedded
object wij,k arrives with a Referer: field containing the monitored server as the host (e.g.,
http://www.ibm.com/buy.html), then the remote browser almost certainly must have previously downloaded that container page (e.g., buy.html) from the monitored server (e.g.,
www.ibm.com), parsed the page, and is now sending the request for the embedded object
wij,k . If ksniffer failed to capture the request for the container page (e.g., buy.html) it is
highly likely that it is being served from the browser cache for this particular transaction. If a request for an embedded object arrives with a Referer: field containing a foreign
105
host (e.g., http://www.xyz.com/buy.html), it is highly likely that the foreign host is simply
embedding objects from the monitored Web site into its own pages.
When a request for an embedded object arrives without a Referer: field, every
pageview associated with the client becomes a potential candidate for the container page
of that object. This is depicted in Figure 3.6 when the request for obj1.gif arrives without
a Referer: field. If the client is actually a remote proxy, then the number of potential
candidates may be large. ksniffer applies the patterns described in Section 3.2.4 as a
means of reducing the number of potential candidates and focusing on the true container
page of the embedded object. The heuristic is to locate the youngest pageview which
contains the object in its pattern, but has yet to download the object. Patterns are therefore
exclusionary. Any candidate pageview not containing the embedded object in its pattern is
excluded from consideration. This may result in the true container page being passed over,
but as mentioned in Section 3.2.4, the likelihood that a container page embeds an object
that does not appear in the page’s pattern is very low for an active Web site. If a suitable
container pageview is not found, then the object is treated as a loner object. If a Referer:
field is missing, then most likely it was removed by a proxy and not a browser on the client
machine; but if the proxy had cached the container page during a prior transaction, it is
likely to have cached the embedded object as well. This implies the object is not being
requested as part of a page, but being downloaded as an individual loner object.
If a client downloads an embedded object, such as obj1.gif, it is unlikely that the
client will download the same object again, for the same container page. If an object
appears multiple places within a container page, most browsers will only request it once
from the server. Therefore, ksniffer not only checks if an embedded object is in the pattern
for a container page, but also checks if that instance has already downloaded the object or
not.
106
The youngest candidate is usually a better choice than the oldest candidate. If
browsers could not obtain objects from a cache or CDN, then the oldest candidate would
be a better choice, based on FCFS. Since this is not the case, choosing the oldest candidate
will tend to assign an object obj1.jpg to a container page whose ‘slot’ for obj1.jpg was
already filled via an unseen cache hit. This tends to overestimate response time for older
pages. It is more likely that an older page obtained obj1.jpg from a cache and that the
younger page is the true container for obj1.jpg, than vice versa.
ksniffer relies on capturing the last byte of data for the last embedded object to
determine the pageview response time. However, given the use of browser caches and
CDNs, not all embedded objects will be seen by ksniffer since not all objects will be
downloaded directly from the Web server. The purpose of a cache or CDN is to provide
much faster response time than can be delivered by the original Web server. As a result,
it is likely that objects requested from a cache or CDN will be received by the client
before objects requested from the original server. If the Web server is still serving the last
embedded object received by the client, other objects served from a cache or CDN will
not impact ksniffer’s pageview response time measurement accuracy. If the last embedded
object received by the client is from a cache or CDN, ksniffer will end up not including
that object’s download time as part of its pageview response time. Since caches and CDNs
are designed to be fast, the time unaccounted for by ksniffer will tend to be small even in
this case.
Given that embedded objects may be obtained from someplace other than the server,
and that a pattern for a container page may not be complete, how can ksniffer determine
that the last embedded object has been requested? For example, at time te , how can ksniffer
determine whether the entire download for index.html is completed, or another embedded
object will be downloaded for index.html on either F1j or F2j ? This is essentially the same
107
problem described at the end of Section 3.2.3 with respect to whether or not embedded
object requests will follow a request for a container page or not.
ksniffer approaches this problem in two ways. First, if no embedded objects are
associated to a pageview after a timeout interval, the pageview transaction is assumed to
be complete. A six second timeout is used by default, in part based on the fact that the
current ad hoc industry quality goal for complete Web page download times is six seconds
[87]. If a client does not generate additional requests for embedded objects within this
time frame, it is very likely that the pageview is complete. ksniffer also cannot report
the response time for a pageview until the timeout expires. A six second timeout is small
enough to impose only a modest delay in reporting. We discuss the implications of such a
timeout in the presence of connection failure in the next section on packet loss.
Second, if a request for a container page, wkj,i, arrives on a persistent connection Fij ,
then we consider that all pageview transactions associated with each prior object, wbj,i, b <
k, on Fij to be complete. In other words, a new container page request over a persistent
connection signals the completion of the prior transaction and the beginning of a new
one. We believe this to be a reasonable assumption, including under pipelined requests,
since in most cases, only the embedded object requests will be pipelined. Typical user
behavior will end up serializing container page requests over any given connection. Hence,
the arrival of a new container page request would indicate a user click in the browser
associated with this connection. Taking this approach also allows ksniffer to properly
handle quick clicks, when the user clicks on a visible link before the entire pageview is
downloaded and displayed in the browser.
108
3.3 Packet Loss
Studies have shown that the packet loss rate within the Internet is roughly 1-3% [167]. We
classify packet loss into three types: A) a packet is dropped by the network before being
captured by ksniffer, B) a packet is dropped by the network after being captured and C)
a packet is dropped by the server or client after being captured. Types A and B are most
often due to network congestion or transmission errors while type C drops occur when the
Web server (or, less likely, the client) becomes temporarily overloaded. The impact that
a packet drop has on measuring response time depends not only on where or why it was
dropped, but also on the contents of the packet. We first address the impact of SYN drops,
then look at how a lost data packet can affect response time measurements.
Figure 3.5 depicts the well known TCP connection establishment protocol. Suppose that the initial SYN which is transmitted at time t0 is either dropped in the network
or at the server. In either case, no SYN/ACK response is forthcoming from the server.
The client side TCP recognizes such SYN drops through use of a timer [33]. If a response
is not received in 3 seconds, TCP will retransmit the SYN packet. If that SYN packet is
also dropped by the network or server, TCP will again resend the same SYN packet, but
not until after waiting an additional 6 seconds. As each SYN is dropped, TCP doubles the
wait period between SYN retransmissions: 3 s, 6 s, 12 s, 24 s, etc. TCP continues in this
manner until either the configured limit of retries is reached, at which time TCP reports
“unable to connect” back to the browser, or the user takes an action to abort the connection
attempt, such as refreshing or closing the browser. This behavior is the TCP exponential
backoff mechanism we have been discussing throughout this dissertation and is depicted
in Figure 3.8.
This additional delay has a large impact on the client response time. Suppose there
109
F1j
Client
t0
Server
SYN J
x
t0 + 3s
SYN J
SYN
dropped by
network
x
SYN
dropped by
server
t0 + 9s
SYN J
t0+9s+.5RTT + d
SYN K, ack J+1
t0+9s+RTT + d
ack K+1
t0+9s+1.5RTT + 2d
captured by
ksniffer
Figure 3.8: Network and server dropped SYNs.
is a 3% network packet loss rate from client to server. Three percent of the SYN packets
sent from the remote clients will be dropped in the network before reaching ksniffer or
the server. The problem is that since the SYN packets are dropped in the network before
reaching the server farm, both ksniffer and the server are completely unaware that the
SYNs were dropped. This will automatically result in an error for any traffic monitoring
system which measures response time using only those packets which are actually captured. If each client is using two persistent connections to access the Web site, this error
will be 180% for a 100 ms response time and a 4.5% error for a 4 s response time. Under
HTTP 1.0 without KeepAlive, where a connection is opened to obtain each object, the
probability of a network SYN drop grows with the number of objects in the pageview. For
110
a page download of 10 objects, there is a 30% chance of incurring the 3 second retransmission delay, a 60% chance for 20 objects and a 90% chance for 30 objects.
ksniffer uses a simple technique similar to the fast online Certes algorithm for capturing this undetectable connection delay (type ‘A’ SYN packet loss). Three counters are
kept for each subnet. One of the three counters is incremented whenever a SYN/ACK
packet is retransmitted from the server to the client (which indicates that the SYN/ACK
packet was lost in the network). The counter that gets incremented depends on how many
times the SYN/ACK has been transmitted. Every time a SYN/ACK is sent twice, the first
counter is incremented, every time a SYN/ACK packet is sent 3 times, the second counter
is incremented, and every time a SYN/ACK is sent 4 times, the third counter is incremented. Whenever a SYN packet arrives for a new connection, if one of the three counters
is greater than zero, then ksniffer subtracts the appropriate amount of time from the start
time of the connection and decrements the counter (round robin is used to break ties). Assuming that a SYN packet will be dropped as often as a SYN/ACK, this gives ksniffer a
reasonable estimate for the number of connections which are experiencing a 3 s, 9 s, or
21 s connection delay. Alternatively, an approach such as the one in Sting [140] could be
used to estimate client to server network loss rates, but this involves active probing.
The same retransmission delays are incurred when SYNs are dropped by the server
(type ‘C’). In this case, ksniffer is able to capture and detect that the SYNs were dropped by
the server, and distinguish these connection delays, which are due to server overload, from
those previously described, which are due to network congestion. ksniffer also determines
when a client is unable to connect to the server. If the client reattempts access to the Web
site in the next six seconds after a connection failure, ksniffer considers the time associated
with the first failed connection attempt as part of the connection latency for the reattempt;
otherwise the failed connection attempt is reported under the category “frustrated client”.
111
if (num syns > 1) return;
if (num syns == 0)
num syns = 1;
last syn = ti ;
start time = ti ;
else
delta = ti - last syn;
for (j = 1, j ≤ k, j++)
{
F = 3[2j - 1];
for (i = j + 1; i ≤ k; i++)
{
L = 3[2i - 1];
if (delta ≈ L - F)
{
start time -= F;
num syns = i+1;
return;
}
}
}
Figure 3.9: Algorithm for detecting network dropped SYNs from captured SYNs.
Interestingly, some undetectable network SYN drops are actually detectable. In
Figure 3.8, the presence of a 6 second gap between the first and second captured SYN
allows ksniffer to infer the presence of the network dropped SYN. Although this situation
is less common, whenever two SYNs are captured by ksniffer, the gap can be calculated
to determine whether or not one or more network dropped SYNs occurred. The algorithm
in Figure 3.9 depicts the process required to capture detectable network SYN drops from
the time gap between two captured SYNs. In the algorithm, ti is the capture time of the
current SYN packet, k is the number of expected SYN retransmissions to account for, and
num syns is the number of total SYNs sent by the client, both captured plus inferred.
112
Similar undetected latency occurs when a GET request is dropped in the network
before reaching ksniffer or the server, then retransmitted by the client. An undetected GET
request drop differs from an undetected SYN drop in two ways. First, unlike SYN drops,
TCP determines the retransmission timeout period based on RTT and a number of implementation dependent parameters. ksniffer implements the standard RTO calculation [126]
using Linux TCP parameters, and adjusts for this undetectable time in the same manner
as mentioned above. Note that fast retransmit does not come into play since the server is
not actually expecting to receive a TCP segment from the client. Second, a dropped GET
request will only affect the measurement of the overall pageview response time if the GET
request is for a container page and is not the first request over the connection. Otherwise,
the start of the transaction will be indicated by the start of connection establishment, not
the time of the container page request.
As mentioned in the previous section, ksniffer uses a timeout as one mechanism
for determining when a pageview is finished. If no embedded objects are associated to
a pageview after a specified timeout period, the pageview transaction is assumed to be
complete. Yet, the presence of network and server dropped SYNs affects this heuristic
since both can increase the time gap between object downloads. This is easily resolved
by using a 6 s timeout period in conjunction with the notion that a pageview remains as
being ‘in progress’ while the client is in the process of attempting to establish a connection
(which may include a series of SYN drops). The usual rules then apply: the first request
over the new connection will be for either an embedded object (i.e. continued processing
of the current pageview) or a container page (i.e. start of a new pageview).
ksniffer often expects to capture the packet containing the sequence number of the
last byte of data for a particular request. To capture retransmissions, ksniffer uses a timer
along with the finish queue on each flow to capture retransmitted packets and update the
113
end of response time appropriately. Suppose the last packet of a response is captured by
ksniffer at time tk , at which point ksniffer identifies it as containing the sequence number
for the last byte of the response, and moves the wkj,i request object from the flow’s request
queue to the flow’s finish queue. The packet is then dropped in the network before reaching
the client (type ‘B’). At time tk+h , ksniffer will capture the retransmitted packet and, using
its sequence number, determine that it is a retransmission for wkj,i, which is located on the
finish queue. The completion time of wkj,i is then set to the timestamp of this packet.
3.4 Longest Prefix Matching
Longest prefix matching is the ability to determine which subnet a remote client is a member of. This capability enables ksniffer to monitor or aggregate information on a per remote
subnet basis. Determining response times on a per subnet basis, determining the most active subnets, identifying changes in activity over time periods for specific subnets are all
examples of such analysis. This capability cannot be performed using common packet
header filtering techniques. As such, ksniffer implements the Chiueh and Pradhan [44]
algorithm for longest prefix matching. We made two slight modifications to the published
work. First, in our implementation, we shared the prefix structures across multiple hash
tables, reducing memory requirements. Second, we tracked the most frequently accessed
prefixes by use of a statistical hot list, in which we kept pointers to the top 200 prefixes.
We perform an offline preprocessing of routing tables to improve the performance
at load time of the kprefix module. This includes the following offline steps:
1. Download the compressed, binary formatted BGP routing tables from RIPE NCC
[135] and the Oregon Route Views Project [155].
114
2. Parse the compressed, binary BGP tables and construct a longest prefix matching
hash structure and save it to disk.
3. Copy the longest prefix file to the ksniffer machine.
The online process simply involves reading the longest prefix file from disk during
module load and creating a hash table structure in kernel memory.
The files obtained from RIPE NCC [135] and the Oregon Route Views Project
[155] are in the well known MRT binary format [104]. The process of downloading and
parsing the files takes several hours and is performed offline using a Java program.
The longest prefix file is simply a list of prefix/length pairs. During module load,
this file can be read into the hash structure in less than one second. The reason why the
offline process takes several hours is twofold: downloading the BGP routing table files
takes time, and each file contains a significant number of duplicate entries. Krishnamurthy
and Wang [90] found 391,497 unique prefix/netmask entries following a related approach.
We obtained 139,508 unique longest prefixes that required 21.5 MB of kernel memory.
The longest prefix matching is performed only once per connection, not for every packet
seen, and then stored in a data structure.
At runtime, the cost of a longest prefix match is O(c), where c is a small constant
independent of the number of entries in the hash tables. ksniffer performs a longest prefix
match once per connection (for the remote IP address), when it observes the TCP 3-way
handshake (not for every packet).
3.5 Tracking the Most Active
Often it is only the most active clients, pages or subnets that are of interest in managing a
Web server. Knowing which subnets or clients are most active during different times of the
115
day is useful for providing quality of service, performing capacity planning, or deciding
where to place a content distribution node. Such a “hot list” would be expected to change
over time, based on the activities of the clients. Unfortunately, for things like IP addresses,
it is difficult to track each and every client over long periods of time (a hash table of size
232 would be required). Instead, we use a probabilistic approach to maintaining hot lists
which is based on the Gibbons and Matias algorithm [67] which tracks the most relevant
users over time to within a small error fraction. This allows ksniffer to maintain hot lists
for a number of items without large storage overheads.
ksniffer also supports the ability to quickly perform various types of matching
based on a URI string (or substring thereof) which is useful for aggregating response
times. Three matching algorithms are implemented: 1) the ability to track response time
for URIs based on the longest directory path which it matches, 2) matching on an exact
directly path, and 3) matching on the file extension of the requested filename.
We implemented ksniffer as a set of Linux kernel modules and installed it on a commodity
PC to demonstrate its accuracy and performance under a wide range of Web workloads.
We report an evaluation of ksniffer in a controlled experimental setting as well as an evaluation of ksniffer tracking user behavior at a live Internet Web site.
Our experimental test bed is shown in Figure 3.10. We used a traffic model based
on Surge [23] but made some minor adjustments to reflect more recent work [76, 146] done
on characterizing Web traffic: the maximum number of embedded objects in a given page
was reduced from 150 to 100 and the percentage of base, embedded, and loner objects
were changed from 30%, 38% and 32% to 42%, 48% and 10%, respectively. The total
10.1.0.0
200ms
10.2.0.0
140ms
1 GHz PIII
512M RAM
Redhat 7.3
116
Catalyst
6500
10.3.0.0
80ms
10.4.0.0
switch
1.7GHz
Xeon
750M RAM
RedHat 9.0
Web
Server
20ms
1.7GHz Xeon
ksniffer 750M RAM
Redhat 9.0
number of container pages was 1041, with 959 unique embedded objects. 49% of the
embedded objects are embedded by more than one container page. We also fixed a bug in
the modeling code and included CGI scripts in our experiments, something not present in
Surge.
For traffic generation, we used an updated version of WaspClient [107], which
is a modified version of the client provided by Surge. Virtual clients on each machine
cycle through a series of pageview requests, first obtaining the container page then all its
embedded objects. A virtual client can open 2 parallel TCP connections for fetching pages,
which mimics the behavior of Microsoft IE. Requests on a TCP connection are serialized,
so that the next request is not sent until the current response on that connection is obtained.
In addition, each virtual client binds to a unique IP address using IP aliasing on the client
machine. This lets each client machine appear to the server as a collection of up to 200
unique clients from the same subnet.
To emulate wide-area network conditions, we extended the rshaper [139] bandwidth shaping tool to include packet loss and round trip latencies. We installed this software on each client traffic generator machine, enabling us to impose packet drops as well
117
as the RTT delays between 20 to 200 ms as specified in Figure 3.10.
To quantify the accuracy of the client perceived response times measured by ksniffer, we ran fifteen different experiments with different traffic loads under non-ideal and
high-stress operating conditions and compared ksniffer’s measurements against those obtained by the traffic generators executing on the client machines. We measured with two
different Web servers, Apache and TUX, used both HTTP 1.0 without KeepAlive and
persistent HTTP 1.1, and included a combination of static pages and CGI programs for
Web content. We also measured in the presence of network and server packet loss, missing Referer: fields, client caching, and near gigabit traffic rates. Table 3.1 summarizes
these experimental results. In all cases, the difference between the mean response time as
determined by ksniffer, and that measured directly on the remote client was less than 5%.
Furthermore, the absolute time difference between ksniffer and client-side instrumentation
was in some cases less than 1 ms and in all cases less than 50 ms.
Web
Server
Apache
Apache
Apache
Apache
TUX
TUX
Apache
Apache
Apache
TUX
TUX
TUX
Apache
Apache
Apache
Type
HTTP
PV/s
URL/s
Mbps
static
cgi+static
static
cgi+static
static
static
static
cgi+static
static
static
static
static
static
static
static
1.0
1.0
1.1
1.1
1.0
1.1
1.0
1.1
1.1
1.0
1.1
1.0
1.0
1.1
1.0
5-140
5-160
10-180
10-400
65-750
125-1370
35-500
60-690
60-700
1909
2423
0-2410
419
728
2174
5-625
10-660
30-730
40-1520
260-3000
500-5300
140-2000
250-2880
260-3000
8,007
10,164
0-10,000
1756
3054
9120
1-60
1-60
3-70
3-140
15-270
35-455
10-200
15-250
20-265
690
878
0-850
152
264
462
Table 3.1: Summary of results.
Client
RT
1.528s
1.513s
1.003s
0.726s
1.556s
0.815s
1.537s
0.792s
0.884s
7.8ms
30.5ms
0.574s
1.849s
.328s
.365s
ksniffer
RT
1.498s
1.483s
0.981s
0.699s
1.506s
0.782s
1.489s
0.825s
0.929s
7.7ms
29.7ms
0.571s
1.806s
.318s
.363s
diff
(ms)
-29
-30
-22
-27
-49
-33
-48
-33
-45
-0.17
-0.83
-3
-42
-10
-1.7
%
diff
-1.9
-2.0
-2.2
-3.7
-3.2
-4.1
-3.1
-4.0
-4.8
-2.2
-2.7
-0.5
-2.3
-3.1
-0.5
elapsed
time
133m
133m
79m
72m
20m
11m
32m
22m
18m
210s
165s
29m
16m
9m
184s
A
B
C
D
E
F
G
H
I
S1
S2
V
O1
O2
X
Virtual
Clients
120
120
120
120
800
800
500
400
500
16
80
800
800
240
800
118
119
1400
clients
ksniffer
1200
pageviews/s
1000
800
600
400
200
0
0
200
400
interval (s)
600
Figure 3.11: Test F, pageviews per second.
All tests (except Tests S1 and S2) were done under non-ideal conditions found in
the Internet with 2% packet loss and 20% missing Referer: fields. Each client requested
the same sequence of pageviews, but since each traffic generator machine was configured
with a different RTT to the Web server as shown in Figure 3.10, the clients took different
amounts of time to obtain all of their pages, resulting in a variable load on the Web server
over time. For example, Figure 3.11 shows results from Test F comparing ksniffer against
client-side instrumentation in measuring pageviews/s over time. There are two lines in the
figure, but they are hard to distinguish because ksniffer’s pageview count is so close to
direct client-side instrumentation.
Figure 3.12 shows results from Test F comparing ksniffer against client-side instrumentation in measuring mean client perceived pageview response time for each 1 second
interval. ksniffer results are very accurate and hard to distinguish from client-side instrumentation. As indicated by Figure 3.11, the variable response time is due to the completion
of clients. During the initial 250 s, clients from each of the four subnets are actively making requests. At around 250 s, the clients from subnet 10.4.0.0 with RTT 20 ms have
120
mean response time (s)
2.5
2
clients
ksniffer
1.5
1
0.5
0
0
200
400
interval (s)
600
Figure 3.12: Test F, mean pageview response time.
completed, while clients from the other subnets remain active. At around 300 s, the clients
from subnet 10.3.0.0 with RTT of 80 ms have completed, leaving clients from subnets
10.2.0.0 and 10.1.0.0 active. At time 475 s, clients from subnet 10.2.0.0 with RTT of 140
ms have completed, leaving only those clients from subnet 10.1.0.0 with RTT of 200 ms.
Note that, although the pageview request rate decreases, the mean response time increases
because the remaining clients have larger RTTs to the Web server and thus incur larger
response times.
Figure 3.13 shows results for Tests A through F obtained by applying the longest
prefix matching algorithm described in Section 3.4 to categorize RTT and response time
on a per subnet basis. In the figure, the blue bars represent the response time obtained by
ksniffer for the corresponding set of clients in each subnet. These results show that ksniffer provides accurate pageview response times as compared to client-side instrumentation
even on a per subnet basis when different subnets have different RTTs to the Web server.
ksniffer RTT measurements are also very accurate as compared to the actual RTT used for
each subnet. The results show how this mechanism can be very effective in differentiating
10.1.0.0
10.2.0.0
10.3.0.0
10.4.0.0
ksniffer
3
response time (sec)
121
2.5
2
1.5
1
0.5
0
A
B
C
D
Experiments
E
F
Figure 3.13: Client perceived response time on a per subnet basis.
performance and identifying problems across different subnets.
Tests S1 and S2 were done under high bandwidth conditions to show results at the
maximum bandwidth rate possible in our test bed. This was done by using the faster TUX
Web server and by imposing no packet loss or network delay. For HTTP 1.1, 80 virtual
clients generated the greatest bandwidth rate, but under HTTP 1.0 only 16 clients generated the highest bandwidth rate. ksniffer is within 3% of client-side measurements, even
under a rate of 878 Mbps of HTTP content. The absolute time difference between ksniffer
and client response time measurements was less than 1 ms. We note that the resolution of
the packet timer on ksniffer is only 1 ms, due to the Linux clock timer granularity. Under HTTP 1.0 without KeepAlive, each object retrieved requires its own TCP connection.
The TCP connection rate under Test S1 was 8,000 connections/s. The results demonstrate
ksniffer’s ability to track TCP connection establishment and termination at high connection rates.
Test V was done with severe variations in load alternating between no load and
122
0.8
0.6
0.4
clients
ksniffer
0.2
0
0
200
400
600
interval (s)
800
1000
Figure 3.14: Test V, mean pageview response time.
4
8
x 10
count
6
clients
ksniffer
4
2
0
0
0.5
1
1.5
seconds
2
2.5
Figure 3.15: Test V, response time distribution.
maximum bandwidth load by switching the clients between on and off modes every 50 s.
Figure 3.14 compares ksniffer response time with that measured at the client, and Figure
3.15 compares the distribution of the response time. This indicates ksniffer’s accuracy
under extreme variations in load.
123
0.5
0.4
0.3
0.2
clients
ksniffer
0.1
0
0
50
100
interval (s)
150
Figure 3.16: Test X, mean pageview response time.
Tests O1 and O2 were done with the Web server experiencing overload and therefore dropping connections. We configured Apache to support up to 255 simultaneous
connections, then started 240 virtual clients. Since each client opens two connections to
the server to obtain a container page and its embedded objects, this overwhelmed Apache.
During Test O1 and O2, the Web server machine reported a connection failure rate of
27% and 12%, respectively. Table 3.1 shows that ksniffer’s pageview response time for
these tests were only 3% less than those from the client-side. These results show ksniffer’s
ability to measure response times accurately in the presence of both server overload and
network packet loss
Test X was done to show ksniffer performance with caching clients by modifying the clients so that 50% of the embedded objects requested were obtained from a zero
latency local cache. Figure 3.16 compares ksniffer and client-side instrumentation in measuring pageview response time over the course of the experiment. The results show that
ksniffer can provide very accurate response time measurements in the presence of client
caching as well.
PageDetailer
embedded objects
35
2.5
30
response time (s)
3
ksniffer
25
2
20
1.5
15
1
10
0.5
5
0
0
124
embedd ed objects
1 2 3 4 5 6 7 8 9 10 11 12
each page downloaded during user session
Figure 3.17: Live Internet Web site.
We deployed ksniffer in front of a live Internet Web site, GuitarNotes.com, which is
hosted in NYC. Figure 3.17 depicts results for tracking a single user during a logon session
from Hawthorne, NY. Using MS IE V6, and beginning with an empty browser cache, the
user first accessed the home page and then visited a dozen pages within the site including
the product review section, discussion forum, FAQ, classified ads, and performed several
site searches for information. This covered a range of static and dynamically generated
pageviews. The number of embedded objects for each page varied between 5 and 30, and
is indicated by the dotted line, which is graphed against the secondary Y axis on the right.
These objects included .gif, .css and .js objects.
PageDetailer [79] was executing on the client machine monitoring all socket level
activity of IE. PageDetailer uses a Windows socket probe to monitor and timestamp each
socket call made by the browser: connect(), select(), read() and write(). By parsing the
HTTP requests and replies, it is able to determine the response time for a pageview, as
well as for each embedded object within a page. The pageview response time is calculated
as the difference between the connect() system call entry and the return from the read()
125
system call for the last byte of data of the last embedded object. As shown in Figure 3.17,
the response time which ksniffer calculates in NYC at the Web server is nearly identical
to that measured by PageDetailer running on the remote client machine. For each of the
twelve pages downloaded by the client, ksniffer is within 5% of the response time recorded
by PageDetailer.
ksniffer provides excellent performance scalability compared to common user-space
passive packet capture systems. Almost all existing passive packet capture systems in use
today are based on libpcap [151]. Libpcap is a user space library that opens a raw socket to
provide packets to user space monitor programs. As a scalability test, we wrote a libpcap
based traffic monitor program whose only function was to count TCP packets. Executing on the same physical machine as ksniffer, the libpcap packet counter program began
to drop a large percentage of packets when the traffic rate was roughly 325 Mbps. In
contrast, ksniffer performs complex pageview analysis at near gigabit traffic rates without
such packet loss.
In Chapter 5 we discuss several alternative methods for measuring client perceived
response time along with their respective shortcomings. One approach, which we mention
now, involves instrumenting the Web server to track when requests arrive and complete
service at the application-level [8, 86, 92, 94]. This approach has the desirable properties
that it only requires information available at the Web server and can be used for non-HTML
content. However, application-level approaches do not account for network interactions or
delays due to TCP connection setup or waiting in kernel queues on the Web server.
Our results from Chapter 2 demonstrate that application-level Web server measurements can under estimate response time by more than an order of magnitude. Figure 3.18
compares the response time captured within Apache by tracking when requests arrive and
complete service with the response time measured at the remote client and by ksniffer.
126
3
clients (per PV)
ksniffer (per PV)
apache (per URL)
2.5
2
1.5
1
0.5
0
0
500
1000
1500
2000
2500
interval (s)
3000
3500
4000
4500
Figure 3.18: Apache measured response time, per URL.
1.5
clients, RT = 0.131
ksniffer, RT = 0.133
apache, RT = 0.000429051
response time (s)
1
0.5
0
0
500
1000
1500
2000
2500
interval (s)
3000
3500
4000
4500
Figure 3.19: Apache measured response time for loner pages.
Apache does not correlate embedded objects with their container pages to provide a per
pageview response time. Instead, Apache only measures response time for each individual
URL request. As such, the response time reported by Apache is completely unrelated to
the response time perceived by the remote client. Even the per URL response time reported
by Apache is grossly inaccurate with respect to the per URL response time measured by
127
the client. Figure 3.19 depicts the Apache response time for only the loner pages - container pages without embedded objects. This per URL response time measured by Apache
is extremely inaccurate with respect to the response time experienced by the remote client.
Nevertheless, systems continue to mistakenly use the Apache measured response time as
the basis for determining and managing quality of service in Web servers [26, 46].
3.7 Summary
We have designed, implemented and evaluated ksniffer, a kernel-based traffic monitor that
can be colocated with Web servers to measure their performance as perceived by remote
clients in real-time. State of the art packet level approaches for reconstructing network
protocol behavior are based on offline analysis, after the network packets have been logged
to disk [6, 37, 60, 66, 75]. This kind of analysis uses multiple passes and is limited to
analyzing only reasonably sized log files [146]. ksniffer’s correlation algorithm differs
from EtE [66] in that it does not require multiple passes and offline operation, uses file
extensions and refer host names in addition to the filename in the Referer: field, handles
multiple requests for the same Web page from the same client, and accounts for connection
setup time and packet loss in determining response time. Feldmann [60] describes many
of the issues involved in TCP/HTTP reconstruction, but does not consider the problem of
measuring response time.
ksniffer shares certain limitations that are present in all network traffic monitors.
Response time components due to processing on the remote client machines cannot be
directly measured from server-side network traffic. Examples include times for DNS query
resolution and HTML parsing and rendering on the client. Embedded objects obtained
from locations other than the monitored servers may have an impact on accuracy as well,
128
but only if their download completion time exceeds that of the last object obtained from
the monitored server.
As a passive network monitor, ksniffer requires no changes to clients or Web servers,
and does not perturb performance in the way that intrusive instrumentation methods can.
ksniffer determines client perceived pageview response times using novel, online mechanisms that take a “look once, then drop” approach to packet analysis to reconstruct TCP
connections and learn client pageview activity.
We have implemented ksniffer as a set of loadable Linux kernel modules and validated its performance using both a controlled experimental test bed and a live Internet
Web site. Our results show that ksniffer’s in-kernel design scales much better than common user-space approaches, enabling ksniffer to monitor gigabit traffic rates using only
commodity hardware, software, and network interface cards. More importantly, our results demonstrate ksniffer’s unique ability to accurately measure client perceived response
times even in the presence of network and server packet loss, missing HTTP Referer:
fields, client caching, and widely varying static and dynamic Web content.
Our measurement work with ksniffer lead us to development Remote Latency-based
Management, an approach for not only measuring but also managing the remote client
perceived response time.
CHAPTER 4. REMOTE LATENCY-BASED WEB SERVER MANAGEMENT
129
Chapter 4
Remote Latency-based Web Server
Management
Although many techniques have been developed for managing the response time of Web
services [3, 8, 30, 31, 43, 46, 54, 89, 159], previous approaches focused on controlling
only the Web server response time of individual URL requests. Unfortunately, this has
little relevance to end users who are located remotely, not at the Web server, and who are
interested in viewing entire Web pages that consist of multiple objects, not just individual
URLs. The problem is exacerbated by the fact that the response time measured within the
Web server can be an order of magnitude less than that perceived by the remote client,
as shown in Chapter 2. These techniques are controlling the wrong measure of response
time. They may in fact improve server response time while unknowingly and unexpectedly
degrade the overall response time seen from the perspective of end users as shown in
Section 2.6.
Managing the response time of Web services is crucial for providing differentiated services in which different classes of traffic or clients can receive different quality of
130
service. For example, an e-commerce Web site may want to ensure that users with full
shopping carts are given highest priority for receiving the best response time while users
who are casual visitors are given lower priority. Existing approaches that provide differentiated services often depend on load shedding in the form of admission control to maintain
a specified set of response time thresholds [31, 54, 43, 89]. Requests from low priority
clients are dropped when they begin to interfere with the response time of high priority
clients. However, prior techniques ignore the effect that admission control drops have on
the overall response time perceived by end users.
In this chapter we present Remote Latency-based Management (RLM), a novel approach for managing Web response times as perceived by remote clients using only serverside techniques. RLM correlates container pages and their embedded objects to manage
pageview response times of entire Web pages, not just individual URLs. RLM tracks the
progress of each Web page in real-time as it is downloaded, and uses the information to
dynamically control the client perceived response time by manipulating the network traffic
in and out of the Web server complex. RLM provides its management functionality in a
non-invasive manner as a stand-alone appliance that simply sits in front of a Web server
complex, without any changes to existing Web clients, servers, or applications.
RLM builds on our work with ksniffer from Chapter 3, our kernel-based traffic
monitor that accurately estimates pageview response times as perceived by remote clients
at gigabit traffic rates. RLM uses passive packet capture to track the elapsed time of each
pageview download, then uses this information to build a novel event node model to enable RLM to make key management decisions dynamically at each point in the download
process. In particular, our model defines and includes the effect of connection admission control drops on partially successful Web page downloads. It also accounts for some
notable behaviors of common Web browsers in the presence of connection failures.
131
Using this model, RLM applies two sets of techniques for managing pageview
response time, fast SYN and SYN/ACK retransmission, and embedded object removal and
rewrite. Fast SYN and SYN/ACK retransmission reduce connection latencies associated
with bursty loads and network loss, which are key factors for short-lived TCP connections
typical of Web transactions. Embedded object removal and rewrite reduce server and
network transfer latencies by adapting Web page content as the pageview is in the process
of being downloaded. These techniques can be applied using a simple set of management
rules that can be defined to provide differentiated services across multiple classes of Web
clients and content.
We implemented RLM on an inexpensive, commodity, Linux-based PC and demonstrate that it can manage client perceived pageview response times in real-time for threetier Web architectures. Using our prototype, we present some experimental data obtained
from using RLM for managing client perceived pageview response times using the TPCW e-commerce Web workload. We present results for both single and multiple service
class environments. Our results show RLM’s unique ability to track a pageview download
as it occurs, properly measure its elapsed response time as perceived by the remote client,
decide if action ought to be taken at key junctures during the download, and apply latency
control mechanisms for the current activities.
The rest of this chapter is outlined as follows. Section 4.1 presents an overview
of the RLM architecture. Section 4.2 describes the RLM pageview download model and
pageview event node framework used for making response time management decisions.
Section 4.3 describes RLM mechanisms used for connection latency management and their
effect on client browsers. Section 4.4 describes RLM mechanisms used for page transfer latency management. Section 4.5 describes the implementation of RLM and presents
some experimental results based on managing a TPC-W workload. Section 4.6 presents
132
RLM
Internet
HTTP
TCP
HTTP
TCP
Web
server
complex
Figure 4.1: RLM deployment.
theoretical analysis based on several simplifying assumptions. Finally, we present some
concluding remarks.
4.1 RLM Architecture Overview
As shown in Figure 4.1, RLM is a stand-alone appliance which sits in front of a Web server
complex. RLM does not require any modifications to Web pages, the server complex, or
browsers, making deployment fast and easy. This is particularly important for Web hosting
companies that are responsible for maintaining the infrastructure surrounding a Web site,
but are not permitted to modify the customer’s server machines or content.
RLM builds on our work with ksniffer from Chapter 3 and uses a similar architecture. It is designed as a set of dynamically loadable kernel modules that reside above
the network device independent layer in the operating system. Its device independence
makes it easy to deploy on any inexpensive, commodity PC without special NIC hardware
or device driver modifications. RLM monitors and manages bidirectional network traffic
and looks at each packet only once. Its in-kernel implementation exploits several performance optimizations such as zero-copy buffer management, eliminated system calls, and
reduced context switches [78, 84]. Our work with ksniffer demonstrated that the in-kernel
133
architecture can support gigabit traffic rates.
RLM operates as a server-side mechanism with a low-delay control path to the Web
server complex that is unaffected by outside network conditions. RLM measures client
perceived pageview response times for Web transactions, then uses that information as
real-time feedback in managing the behavior of the Web server complex to deliver desired
response times. This tight measurement and management feedback loop near the server
complex is key to RLM’s ability to provide real-time control of the performance of the
Web server complex.
RLM passively captures network packets to measure client perceived response
times, then actively manipulates the packet stream between client and server to meet desired response time goals. RLM operates at the network packet level in part to provide
its functionality without any modifications to the Web server complex. More importantly,
as discussed in Section 4.2, measuring and controlling client perceived response times
require tracking the client-server interaction at the packet level.
RLM measures client perceived response times by capturing and analyzing packets using the Certes model of TCP retransmission and exponential backoff that accounts
for latency due to connection setup overhead and network packet loss. It combines this
model with ksniffer’s higher level online mechanisms that use access history and HTTP
referrer information when available to learn relationships among Web objects to correlate
connections and Web objects to determine pageview response times. The remainder of this
chapter focuses on the management mechanisms RLM provides that work in conjunction
with its measurement mechanisms.
134
4.2 RLM Pageview Event Node Model
RLM introduces a new model for specifying and achieving response time service level
objectives based on tracking a pageview download as it happens and making service decisions at each key juncture based on the current state of the pageview download. A
pageview download can be viewed as a set of well defined activities such as establishing a
connection, getting the container page, and getting the embedded objects in a page. RLM
models a pageview download as an event node graph, where each node represents an activity and each link indicates a precedence relationship. The nodes in the graph are ordered
by time and each node is annotated with the elapsed time from the start of the transaction. Each activity contributes to the overall response time. Some activities may overlap
in time, have greater potential to incur larger latencies, be on the critical path, and be more
difficult to control than others. RLM controls pageview response times by identifying and
managing the high latency activities on the critical path of the pageview download.
To illustrate our approach, Figure 4.2 depicts the response time of te − t0 for the
pageview download of index.html which embeds obj3.gif, obj6.gif and obj8.gif. This example uses two connections to download the page, consistent with modern Web browsers
which open multiple connections to download Web content faster. Four types of latencies
are serialized over each connection and delimited by specific events:
1. Tconn TCP connection establishment latency, using the TCP 3-way handshake. Begins when the client sends the TCP SYN packet to the server.
2. Tserver latency for the server complex to compose the response by opening a file, or
calling a CGI program or servlet. Begins when the server receives an HTTP request
from the client.
Client
Server
t0
Tconn
135
SYN J
ck J+1
SYN K, a
ack K+1
GET index
.html
Tserver
Ttransfer
Trender
GET obj3
.gif
Client
Tconn
Tserver
Server
M+1
SYN N, ack
ack N+1
GET obj6.gif
Ttransfer
Trender
SYN M
GET obj8.g
if
Tserver
Tserver
Ttransfer
te
Ttransfer
Figure 4.2: Breakdown of client response time.
3. Ttransf er time required to transfer the response from the server to the client. Begins
when the server sends the HTTP request header to the client.
4. Trender time required for the browser to process the response, such as parse the
HTML or render the image. Begins when the client receives the last byte of the
HTTP response.
The corresponding event node graph generated by RLM is shown in Figure 4.3.
Each link is identified with the type of latency that results from the particular activity. Each
node is annotated with the elapsed time from the start of the transaction. By measuring
the elapsed time at a given node, RLM can track the page download as it progresses and
136
0ms 1 SYN arrival
Tconn
75ms 2
GET index.html
Tserver
884ms 3
index.html response header
Ttransfer
1375ms 4 Index.html response complete
Trender
1385ms 5 browser parsed index.html
1410ms 7 SYN arrival
Tserver
Tconn
1485ms 8
Ttransfer
1785ms 15 obj6.gif response header
Trender
Ttransfer
3426ms 16 obj6.gif response complete
Tserver
GET obj6.gif
Tserver
Trender
3436ms 17 obj6.gif response processed
Ttransfer
Trender
obj8.gif response processed 14 3307ms
18 3436ms
Figure 4.3: Pageview modeled as an event node graph.
determine at each node whether to take additional actions to satisfy response time goals for
the given page. This ability to make management decisions at each point in time within
the context of the pageview download is a key difference between RLM and other QoS
approaches.
It is crucial for RLM to model network loss and track client-server interaction at
the packet level to measure and manage the entire client perceived response time te −
137
t0 shown in Figure 4.2. Mechanisms which attempt to measure response time via time
stamping server-side user space events are ineffective. For example, measuring response
time within an Apache Web server ignores time spent during the TCP 3-way handshake
for establishing the connection and time spent in kernel queues before the request is given
to Apache. We showed in Chapter 2 that such measurements are an order of magnitude
less than the response time experienced by the remote client. Likewise, measuring and
controlling the time required to service a single URL (i.e. Tserver ) is simply not relevant
to the remote client who is downloading not just a single URL but an entire pageview.
Latency control mechanisms must take into account effects seen at the packet level which
impose latency for the remote client.
RLM allows a variety of rules to be defined and enforced at different points in an
event node graph to manage response time. Each node in the graph has a set of associated
characteristics that determine what types of rules can be defined. For example, when a
SYN is captured by RLM at node 1 in Figure 4.3, the management decision can be based
only on fields contained within the SYN packet, namely source IP address, destination IP
address, source port, and destination port. The decision cannot be based on which page is
being requested since the GET request has not yet been sent by the client. Algorithms for
quickly classifying packets or clients into service classes have been previously studied and
are not discussed in this dissertation. We focus instead on the management framework and
on introducing a set of techniques that can be applied for each instance of a page download
to reduce the remaining time to complete the pageview to meet a defined response time
goal. Section 4.5 provides some example rules that can be used with RLM and their impact
on response time.
138
4.3 Connection Latency Management
One of the key types of latency that RLM must manage is TCP connection establishment latency Tconn . It is especially important to understand its impact on client perceived
pageview response times since a great deal of work in controlling Web server performance
has focused on applying admission control to prevent Web server overload by dropping
TCP connections. However, the effect of admission control drops on the behavior of Web
browsers has not been carefully studied.
To determine how load shedding affects the client perceived response time on real
Web browsers, we conducted a series of experiments using Microsoft Internet Explorer
v6.0 and Firefox v1.02 in which we performed various types of connection rejection by
performing SYN drops to emulate an admission control mechanism at the Web server. The
end result was that the resulting response time at the browser is greatly affected not only
by the number of SYN drops, but also by the connection for which the SYN drops occur.
Figure 4.4 depicts the behavior of TCP under server SYN drops. The client sends
the initial SYN at t0 , but the server drops this connection request due to admission control. The client’s TCP implementation waits 3 seconds for a response. If no response is
received, the client will retransmit the SYN at t0 + 3s. If that SYN gets dropped, then
the next SYN retransmission occurs at time t0 + 9s. The timeout period doubles (from
3 s, 6 s, 12 s, etc.) until either the connection is established, the client hits stop/refresh
on the browser which cancels the connection, or the maximum number of SYN retries
is reached. This is the well-known TCP exponential backoff mechanism we have been
discussing throughout this dissertation.
Server SYN drops are not a denial of service, but rather a means for rescheduling
the connection into the near future. Although this behavior is effective in shedding server
Client
t0
t0 + 3
t0 + 9
139
Server
SYN J
x
SYN J
x
dropped
by server
SYN J
ck J+1
SYN K, a
ack K+1
GET index
.html
tA
Figure 4.4: SYN drops at the server.
load, it has significant effects on the response time perceived at the remote clients. Existing
admission control mechanisms which perform SYN throttling simply ignore this effect and
report the response time once the connection is accepted, beginning from time tA . This
misrepresents both the client perceived response time and throttling rate at the Web site.
Because Web browsers open multiple connections to the server as shown in Figure
4.2, it is important to understand the effect of a SYN drop in the context of which connection is being affected. If only the first SYN on the first connection is dropped, then the
client will experience the 3 s retransmission delay, but will still be serviced. If the first connection gets established immediately, but all SYNs on the second connection are dropped,
as shown in Figure 4.5, then the client will eventually receive a connection failure after
multiple retries. While the second connection is undergoing SYN drops at the server, our
study shows that Web browsers will display an hourglass cursor on the screen, a spinning
busy icon in the corner of the browser, and a progress bar at the bottom of the browser
Client
t0
140
Server
SYN J
serviced
requests
Client
tz
SYN M
Server
x
tx
Last byte of last
embedded object
on this connection
tz + 3
SYN M
tz + 9
SYN M
x
x
Connection
failure reported
tz + 21
Figure 4.5: Second connection in page download fails.
showing ‘in progress’. The browser continues to show these signs that the page is in the
process of being downloaded until TCP reports the connection failure to the browser after
21 s as shown in Figure 4.5. Although all objects successfully obtained from the server
are obtained over the first connection during the time interval t0 through tx , the browser
does not indicate the end of the page download until tz + 21, when TCP reports the failure
of the second connection to the browser.
For the scenario in Figure 4.5, our study of Web browsers indicates that only a
partial page download will occur in practice. The browser will never retrieve the first object
141
which would have been retrieved on the second connection. The browser will retrieve all
other objects over the first connection, including those objects which would have been
obtained over the second connection had it been established. Therefore, one embedded
object is strictly associated with the second failed connection and is not obtained.
If the second connection is eventually established, the embedded object associated
with the second connection will then be obtained. For example, suppose that the SYN
transmitted at tz + 9 was accepted by the server, the connection was established, and an
object was requested and obtained over that connection. The end of the client perceived
response time would then be the time that the last byte of the response for that object was
received by the client.
A variety of SYN drop combinations could occur, across multiple connections
causing various effects on the client perceived response time. If all SYNs on the first connection are dropped, then the client is actually denied access to the server. If both connections are established, each after one or more SYN drops, then the TCP exponential backoff
mechanism plays an important role in the latency experienced at the remote browser. This
effect becomes more pronounced under HTTP 1.0 without KeepAlive where each URL
request requires its own TCP connection and the retrieval of each embedded object faces
the possibility of SYN drops and possible connection failure. Although the majority of
browsers use persistent HTTP, the trend for Web servers is to close a connection after a
single URL request is serviced if the load is high. Apache Tomcat [153] behaves in this
manner when the number of simultaneous connections is greater than 90% of the configured limit, and reduces the idle time if the number of simultaneous connections is greater
than 66%. This effectively reduces all transactions to HTTP 1.0 without KeepAlive.
The maximum number of SYN retries that lead to a connection failure defines the
connection timeout and depends on the operating system used by the client Web browser.
142
In most cases, the default configuration on the number of SYN retries is used. For example,
Windows XP operating systems default to two retries, resulting in a connection timeout
after 21 s. As such, RLM uses 21 s in its model as the pageview response time for a Web
page request that suffers a connection timeout. Other operating systems allow for more
SYN retries, so the value we use is conservative as using a larger value would increase
the effect of connection failure on response time, exaggerating the benefit of the RLM
mechanisms used for managing Tconn .
On the other hand, if the browser is painting the screen in a piece-meal manner,
indicating that progress is being made, it is more likely that clients will tend to read the
pageview as it slowly gets displayed on the screen. This behavior would occur if SYN
drops occur on the second connection. In this situation, the pageview response time could
exceed 21 s.
In all cases, our study of Web browsers indicates that packet drops during connection establishment can have a significant, coarse-grained impact on pageview response
time. Because of the TCP exponential backoff mechanism, any SYN drop results in a
significant increase in Tconn . Note that other types of packet drops that occur once a TCP
connection is established do not have the same coarse-grained effect. For example, if an
HTTP GET request is dropped, the client will retransmit after the retransmission timeout
expires, but this timeout value is much smaller than the 3 s initial timeout used during
connection establishment.
RLM introduces a fast SYN retransmission technique that can be used to reduce the
coarse-grained effect of SYN drops. Figure 4.6 depicts the behavior of this mechanism.
After a server SYN drop, RLM retransmits the SYN, on behalf of the remote client, at
a shorter time interval than the TCP exponential backoff. Since RLM resides within the
same complex in which the server exists and is not retransmitting the SYNs over the net-
Client
t0
143
Server
SYN J
ck J+1
SYN K, a
ack K+1
x
Fast SYN
retransmission
RLM
Figure 4.6: Fast SYN retransmission.
work, it could at most be considered a locally controlled violation of the TCP protocol.
The net effect is that a connection is established as soon as the server is able to accept the
request. Since dropping a SYN at the server requires little processing, the overhead of this
approach on the server complex is minimal, even when the server is loaded. Nevertheless,
the retransmission gap can be adjusted based on the current load or the number of active
simultaneous connections.
RLM also introduces a fast SYN/ACK retransmission technique that can be used to
reduce the coarse-grained effect of SYN/ACK drops. SYN/ACKs dropped in the network
cause the same latency effect as a SYN dropped at the server. From the client perspective,
there is no difference between a SYN dropped at the server and a SYN/ACK dropped in
the network; a SYN/ACK does not arrive at the client and the TCP exponential backoff
mechanism applies. Figure 4.7 depicts the behavior of the RLM fast SYN/ACK retransmission mechanism. If RLM does not capture an ACK from the client within a timeout
much smaller than the TCP exponential backoff, RLM retransmits the SYN/ACK to the
client on behalf of the server. Fast SYN/ACK retransmission violates the TCP protocol by
performing retransmissions using a shorter retransmission timeout period than the expo-
Client
t0
144
Server
SYN J
x
, ack J+1
K
N
SY
ack K+1
Fast SYN/ACK
retransmission
RLM
Figure 4.7: Fast SYN/ACK retransmission.
nential backoff. One can make several arguments that this is a minor divergence from the
protocol. On the other hand, an Internet Web site which uses this technique to improve
Tconn can rightly be labeled as an unfair participant on the Internet. If deployed, the overhead, either in the network or in the remote client is minimal. Referring to Figure 4.3, both
fast SYN and fast SYN/ACK retransmission can be applied during state transitions 1→2
and 7→8 to reduce the critical path Tconn .
4.4 Transfer Latency Management
Another key type of latency that RLM must manage is the TCP transfer latency Ttransf er ,
which can become a dominant component of response time when the network connection
between the client and the server is the bottleneck. Ttransf er is known to be a function of
object size, network round trip time (RTT) and packet loss rate:
Ttransf er = f (size, RT T, loss)
(4.1)
Several analytic models of f (size, RT T, loss) have been developed [121, 38, 145].
Figure 4.8 depicts the transfer latency function defined by Cardwell at al. [38], for realistic
145
1.6
1.4
Feasible
Region
E[T] in seconds
1.2
1
0.8
0.6
Infeasible
Region
0.4
0.2
0
0
20
40
60
Object size in packets
80
100
Figure 4.8: Cardwell et al. Transfer Latency Function f for 80 ms RTT and 2% loss rate.
Internet conditions of an RTT of 80 ms and loss rate of 2% [167]. The line indicates the
expected time (y-axis) it will take to transfer an object of the given size (x-axis). For
smaller objects, in this case less than 10 packets in size, the transfer latency is dominated
by TCP slow start behavior, the logarithmic part of the graph. For larger objects, the
transfer latency is dominated by TCP steady-state behavior, the near-linear part of the
graph. Cardwell’s function is a model of the expected amount of time required, not the
minimum time. The farther a point is from the line, the less likely it is to occur in practice.
For example, it is extremely unlikely that an object of size 50 packets can be transferred in
under 1 second if the RTT is 80 ms and the loss rate is 2%. We labeled the region below
the line as infeasible. The model predicts that under higher loss rates and longer RTT,
reducing object size can reduce Ttransf er by half.
Since RTT and loss rate are a function of the end-to-end path from client to server
through the Internet and therefore uncontrollable, RLM is left with varying the response
146
size as a control mechanism for affecting Ttransf er . RLM accomplishes this using two
simple techniques:
1. Embedded object rewrite: Translate a request for a large image into a request for a
smaller image. Capture the HTTP request packet, if the request is for a large image
then modify the request packet by overwriting the URL so that it specifies a smaller
image, and then pass the request onto the server.
2. Embedded object removal: Remove references to embedded objects from container
pages. Capture the HTTP response packets, if the response is for a container page
then modify the response packet by overwriting references to embedded objects with
blanks, and then pass the request packet onto the client.
Embedded object rewrite retrieves an embedded object but one of much smaller
size than the original, reducing the response size and Ttransf er for that object. The tradeoff is that the quality of the content is affected since the client will see a lower quality
image. By modifying the client to server HTTP request, RLM can decide on a per request
basis, in the middle of a pageview download, whether or not to change the requested object
size. This presumes the existence of smaller objects; for some Web sites, maintaining all
or some of their images in two or more sizes may not be possible. This technique can also
be applied to dynamic content, where a less computationally expensive CGI is executed in
place of the original, or the arguments to the CGI are modified (e.g., a search request has
its arguments changed to return at most 25 items instead of 200).
Embedded object removal entirely avoids retrieving an embedded object, eliminating Ttransf er for that embedded object. This has a greater latency reduction effect than the
embedded object rewrite, but may further reduce the quality of the Web content displayed.
Instead of viewing thumbnail images, the client only sees text. Unlike embedded object
147
rewrite which can be applied for any image retrieval during pageview download, the decision on whether or not to blank out embedded objects in the container page can only be
made at one point in the pageview download, when the container page is being sent from
the server to the client, which is transition 3→4 in Figure 4.3.
These content adaptation techniques may also be able to reduce other types of latencies. Embedded object rewrite can reduce Tserver on the server and Trender at the client
when the smaller object is also faster to serve and render. Embedded object removal can
eliminate Tconn for an embedded object if a new connection is no longer required, and
can eliminate Tserver and Trender if the object no longer needs to be served or displayed.
However, the resulting change in the pageview response time due to these latencies will
depend on whether the server delivering embedded objects or the client rendering them
are the bottleneck, the latter being unlikely with modern PC clients.
RLM adapts content by simply modifying a packet and forwarding the modified
version; it does not need to keep buffers of packet content. RLM is not a proxy, and as
such, must ensure the consistency of the sequence space for each connection. This means
that changing the HTTP request/response is constrained by the size and amount of white
space in each packet, and the checksum value must be recomputed as it changes with the
change in payload.
We implemented RLM as a set of kernel modules that can be loaded into an inexpensive, off-the-shelf PC running Linux. Our kernel module approach is based on our work
with ksniffer which demonstrated significant performance scalability benefits for executing within kernel space. We present some results using RLM to manage client perceived
148
Server Complex – black box
RLM
loss +
delay
HTTP
Apache
HTTP
MySQL
Tomcat
APJ 1.3
JDBC
response times in both single and multiple service class environments using TPC-W [152],
a transactional Web e-commerce benchmark which emulates an online book store, running
on a three-tier Web architecture.
Figure 4.9 shows our experimental test bed. It consists of seven machines: three
Web clients, one RLM appliance, and three servers functioning as a three-tier Web architecture. Apache 2.0.55 [64] was installed as the first tier HTTP server and was configured
to run up to 1200 server threads using the worker multi-processing module configuration.
Apache Tomcat 5.5.12 [153] was employed as the second tier application server (servlet
engine) and was configured to maintain a pool of 1500 to 2000 AJP 1.3 server threads to
service requests from the HTTP server, and a pool of 1000 persistent JDBC connections
to the database server. MySQL 1.3 [105] was employed as the third tier database (DB)
server and was set to the default configuration with the exception that max connections
was changed to accommodate the 1000 persistent connections from Tomcat.
Each of the three client machines was an IBM IntelliStation M Pro 6868 with a
1GHz Pentium 3 CPU and 512MB RAM. The RLM machine was an IBM IntelliStation 6850 with a 1.7GHz Xeon CPU and 768MB RAM. The Apache machine was an
IBM IntelliStation M Pro 6868 with a 1 GHz Pentium 3 CPU and 1GB RAM. The Tomcat machine was an IBM IntelliStation M Pro 6849 with a 1.7GHz Pentium 4 CPU and
768MB RAM. The MySQL machine was an IBM IntelliStation 6850 with a 1.7GHz Xeon
149
CPU and 768MB RAM. All machines were running RedHat Linux, with the DB server
running a 2.6.8.1 Linux kernel and the other machines running a 2.4.20 Linux kernel.
The machines were connected via 100Mbps Fast Ethernet Netgear, CentreCOM, and Dell
switches. We installed a modified version of the rshaper [139] bandwidth shaping tool
on each of the three client machines to emulate wide-area network conditions in terms of
transmission latencies and packet loss.
We used a popular Java implementation of TPC-W [154] for our workload, but
made two important modifications to the client emulated browser (EB) code to make it
behave like a real Web browser such as Microsoft Internet Explorer. First, we modified
the EB code to use two persistent parallel connections as shown in Figure 4.2 over which
the container object and embedded objects are retrieved. These connections were not
closed by the client but remained open during the client think periods (unless closed by the
server). The original EB sent HTTP/1.1 request headers but actually used one connection
for each GET request, effectively emulating HTTP/1.0 behavior by opening a connection,
sending the request, reading the response and closing the connection. Second, we modified
the EB to behave under connection failure as shown in Figure 4.5. We also used IP aliasing
so that each individual EB could obtain its own unique IP address.
The TPC-W e-commerce application consists of a set of 14 servlets. Each pageview
download consists of the container page and a set of embedded images. All container
pages are built dynamically by one of the 14 servlets running within Tomcat. First, the
servlet performs a DB query to obtain a list of items from one or more DB tables, then the
container page is dynamically built to contain that list of items as references to embedded
images. After the container page is sent to the client, the client parses it to obtain the list of
embedded images, which are then retrieved from Apache. As such, all images are served
by the front end Apache server, and all container pages are served by Tomcat and MySQL.
150
0.7
0.6
PDF
0.5 mean 0.26s, 81.4th percentile
0.4
95th is 1.404s
0.3
0.2
32314 total pages
0.1
0
0
5
10
15
response time (sec)
20
25
Figure 4.10: 0.3 ms RTT, 0% loss.
4.5.1 Response Time Distribution
We first present measurements running TPC-W under ideal conditions of light load and no
network loss or delay; we then add network loss and delay to see the effect this has on the
response time distribution. We use 200 clients which keeps the DB server, which is the
bottleneck resource in our multi-tier complex, at only 60% utilization.
Figure 4.10 shows the response time distribution under ideal network conditions of
minimal delay and zero loss along with the mean and 95th percentile of the client perceived
response time and the total number of Web pages served. This scenario is often used for
Web server performance benchmarking and QoS experimentation, but is very unrealistic
for an Internet Web site. Figure 4.11 shows the response time distribution for the same
experiment under 80 ms RTT and zero packet loss. The additional RTT shifts and spreads
the distribution to the right as the transfer latency becomes more significant for larger
pageviews.
151
0.25
PDF
0.2
0.15
mean 0.98s, 68.5th percentile
95th is 1.90s
0.1
29307 total pages
0.05
0
0
5
10
15
response time (sec)
20
25
Figure 4.11: 80 ms RTT, 0% loss.
Figure 4.12 shows the response time distribution under more realistic network conditions of 80 ms RTT and 2% network loss in each direction; studies have shown that the
packet loss rate within the Internet is roughly 1-3% [167]. The distribution shifts further
to the right and a clearly distinguishable spike occurs just after 3 s. This is attributed to
either the first or second connection of the pageview having an initial SYN or SYN/ACK
drop in the network. The response time distribution in Figure 4.12, not the one shown
in Figure 4.10, is most likely to be the shape of the response time distribution for Web
clients. Any approach which claims to manage client perceived response time for Internet
Web service ought to be verified under conditions found in the Internet: network latency
and loss. Unless otherwise indicated, all of our experiments are done using the same 80
ms RTT network latency and 2% network loss.
152
0.12
0.1
PDF
0.08
0.06
26601 total pages
95th is 4.452s
0.04
0.02
0
0
5
10
15
response time (sec)
20
25
Figure 4.12: 80 ms RTT, 4% loss.
4.5.2 Managing Connection Latency
Table 4.1 shows how RLM can improve response times by applying fast SYN/ACK retransmission to the same experiment shown in Figure 4.12 with different SYN/ACK retransmission gaps. For each retransmission gap, we report mean client pageview (PV)
response time (RT), mean PV RT speedup versus Figure 4.12, 95th percentile PV RT, percentage of pages downloaded with greater than 3 s response time, and total number of
pageviews. For example, the following RLM rule is used for a 500 ms retransmission gap:
IF IP.SRC == *.*.*.* THEN
FAST SYN/ACK GAP 500ms
(4.2)
Fast SYN/ACK retransmission results in a modest reduction in the mean response
time, as much as 16.8% for the smallest retransmission gap of 10 ms. More importantly,
SYN/ACK
gap
3s
1s
500 ms
10 ms
mean
PV RT
1.9 s
1.72 s
1.64 s
1.58 s
mean
speedup
0%
9.5%
13.7%
16.8%
95th %
PV RT
4.45 s
4.22 s
4.09 s
4s
>3s
PV RT
22.36%
18.17%
16.05%
15.42%
153
total
pages
26601
27001
27287
27455
Table 4.1: Fast SYN/ACK retransmission.
the technique results in a much larger reduction in the number of pages that have higher
response times, reducing the number of pages with more than 3 s response time by over
30%. In general, we would not expect this technique to significantly affect the mean
response time, but rather significantly affect those pageviews that experience a network
SYN/ACK drop.
4.5.3 Managing Load and Admission Control
Figure 4.13 shows the response time distribution for running TPC-W with 550 clients,
causing heavy load. The mean client perceived response time increased to 5 s from the 1.9
s for 200 clients shown in Figure 4.12. No SYN drops are occurring at the server complex.
The only SYNs being dropped are those being lost in the network, so the percentage of
SYN drops is the same for both the light and heavy load experiments shown in Figures 4.12
and 4.13, respectively. Bandwidth is at low utilization throughout the entire test bed. The
increase in response time is due to increased CPU utilization within the multi-tier complex.
In such a scenario, it is usually desirable to apply a load shedding technique to
prevent the Web server from overloading or to simply improve response time by reducing
the load. A simple and common load shedding mechanism is to manipulate the Apache
MaxClients setting [51, 93, 161]. MaxClients is an upper bound on the number of
httpd threads available to service incoming connections; it limits the number of simulta-
154
0.025
0.02
54193 total pages
0.015
PDF
mean 5.01s, 58th percentile
0.01
95th is 11.5s
0.005
0
0
5
10
15
20
response time (sec)
25
Figure 4.13: Unmanaged heavy load.
neous connections being serviced by Apache.
Figure 4.14 shows the result of setting MaxClients to 400 for the same workload
shown in Figure 4.13. The spike at 5 s in the distribution represents those pageviews which
incurred an initial SYN drop resulting in a 3 s timeout on one of the two EB connections
to the server (in addition to the 2 s baseline latency shown in Figure 4.12). The spike at 8
s, which is barely visible in Figure 4.12 but pronounced in Figure 4.14, represents those
pageviews which incurred a 3 s timeout on both connections to the server. The spike at
21 s represents those clients which experienced a connection failure. Table 4.2 depicts the
results for setting various limits on the number of simultaneous connections served.
We instrumented the TPC-W servlets to capture their response time by taking a
timestamp when the servlet was called and a timestamp when the servlet returned; this
covers the time it takes to build the container page, including the DB query but does not
include the time to connect to the server complex or transmit the response. Table 4.2 shows
155
0.025
44087 total pages
0.02
PDF
0.015
95th is 19.61s
0.01
0.005
0
0
5
10
15
20
response time (sec)
25
Figure 4.14: MaxClients load shedding.
Max
Clients
600
500
400
300
200
mean
PV RT
5.01 s
5.28 s
7.86 s
12.3 s
19.1 s
95th %
PV RT
11.6 s
11.9 s
19.6 s
28.5 s
40.5 s
Tomcat
RT
3.14 s
1.013 s
0.405 s
0.155 s
0.068 s
total
pages
54193
53038
44087
34440
25894
server
SYN drops
0%
4.8%
18.2%
30.7%
43.5%
Table 4.2: Load shedding via connection throttling.
under Tomcat response time that as the number of simultaneous connections decreases,
the mean time to query the DB and create the container page decreases. However, when
measured in terms of pageviews perceived by the client and including those pages which
experienced the default admission control drops, the overall mean pageview response time
actually increases. Some clients are experiencing response times which can be considered
better than required while other clients are experiencing significant latencies due to SYN
drops. The results demonstrate that this common form of load shedding is ineffective at
156
reducing client perceived response times. Furthermore, the significant effect that SYN
drops have on the response time distribution makes providing service level agreements
based on meeting a threshold for the 95th percentile impossible to achieve.
A common alternative to changing the Apache MaxClients is to perform SYN
throttling to control the offered load on the system. We apply this technique in the context
of a multi-class QoS environment. Many Web sites demand the ability to support different
users with multiple classes of service and response time, such as providing buyers with
better performance than casual visitors at an e-commerce site. It is often desirable to
maintain a specific response time threshold for a certain class of clients. Given a finite set
of resources under heavy load, high priority clients are expected to receive better response
time than if all clients were treated equally. Similarly, low priority clients will suffer worse
response time than if all clients were treated equally.
We ran TPC-W with 550 clients as we did in Figure 4.13, but divided the clients into
one-third high priority clients from subnet 10.4.*.* and two-thirds low priority clients from
other subnets. We perform typical SYN throttling (i.e. admission control) by dropping
SYNs arriving from low priority clients when the high priority clients are exceeding their
response time threshold. We employ RLM using the following rule:
IF IP.SRC ! = 10.4.*.* AND RT HIGH > 3.0s THEN
DROP SYN
(4.3)
Figure 4.15 shows that mean response time for the 184 high priority clients was
roughly 3 s, but at a heavy cost to the 366 low priority clients. The vertical spike at 21
s for the low priority clients indicates the set of connection failures experienced by those
clients. From Figure 4.12 we see that 200 clients alone receive 2 s mean response time. As
high priority
3.11s mean RT
8.4s 95th %
21634 pages
low priority
9.53s mean RT
26.6s 95th %
28280 pages
0.15
PDF
157
0.1
0.05
0
0
5
10
15
response time (sec)
20
25
Figure 4.15: Low priority penalties.
such, our 184 high priority clients have processing to spare for the low priority clients. But
the high retransmission penalty significantly affects the response time of the low priority
clients. RLM can improve this situation by using fast SYN and SYN/ACK retransmission.
Figure 4.16 shows the effect of the following rule:
IF IP.SRC == *.*.*.* THEN
FAST SYN + SYN/ACK GAP 500ms
IF IP.SRC ! = 10.4.*.* AND RT HIGH > 3.0s THEN
DROP SYN
HALT FAST SYN
(4.4)
In enforcing this rule, if the mean response time for the high priority clients exceeds
3 s, then incoming SYNs from low priority clients are dropped by RLM and existing low
PDF
0.15
high priority
3.19s mean RT
7.55s 95th %
21520 pages
low priority
7.74s mean RT
22s 95th %
29201 pages
0.1
0.05
0
0
158
5
10
15
response time (sec)
20
25
Figure 4.16: Improvement from applying fast SYN retransmission.
priority fast SYN retransmissions are temporarily halted. The moment RT HIGH < 3.0
s, the low priority fast SYN retransmission is resumed and new SYNs from low priority
clients are passed to the server. Low priority clients have their requests processed without
them waiting the full TCP retransmission timeout periods. This lead to a 23% improvement for low priority clients, while maintaining essentially the same response time and
throughput of the high priority clients. Most importantly, the spike at 21 s in Figure 4.15
is gone with the removal of the large number of connection failures experienced by the
low priority clients.
Figure 4.17 depicts an alternative distribution to Figure 4.16 based on using the
same rule with the addition that all initial SYNs on the first connection of the pageview
from low priority clients are dropped, with a fast SYN retransmitted 3 s later and 500
ms subsequently there after. By dropping all initial SYNs for low priority clients we effectively increased their think time, the interarrival time between container page requests.
PDF
0.15
high priority
2.65s mean RT
6.28s 95th %
22472 pages
low priority
7.51s mean RT
18.1s 95th %
29902 pages
0.1
0.05
0
0
159
5
10
15
response time (sec)
20
25
Figure 4.17: Widening the think time gap.
This shifts the distribution to the right and has the effect of improving the 95th percentile
of both service classes. Note that some low priority pages are served very fast. These
pages are re-using the TCP connection for the container page, and hence unaffected by
SYN manipulation. Reducing the arrival rate of low priority clients resulted in a reduction
in load on the MySQL server, which in turn reduced server time and client response time.
Application of this technique warrants further investigation. Over-use could lead to
livelock [102] or create more work for the system, offsetting the benefits of reducing the
drop latencies (although, in an e-commerce environment the front-end Web server is not
the bottleneck). We have not observed any negative effects on the TCP endpoints (client
or server) with respect to their ability to measure RTT and enforce their TCP timeouts and
retransmissions.
160
Mean Client Perceived RT (sec)
6
5
4
3
2
1
0
0
160ms RTT
220ms RTT
300ms RTT
2
4
6
8
10
12
Number of embedded objects retrieved
Figure 4.18: Embedded object removal.
4.5.4 Managing Transfer Latency
We now consider managing transfer latency under situations of variable and larger RTT.
We run TPC-W with 200 clients so that the multi-tier complex is under modest load and
modified our environment by splitting our clients into three groups, one having 160 ms
RTT, another with 220 ms RTT and the third with 300 ms RTT. The resulting mean client
perceived response times were roughly 3 s, 4 s, and 5 s, respectively. In this environment,
nothing in the server complex is overloaded, and no server-side SYN drops are occurring.
As such, load shedding performed at the server will not improve response times.
Figure 4.18 shows the impact of embedded object removal on mean response time
as a function of the number of embedded objects removed and retrieved. In this graph, the
x-axis represents the number of embedded objects retrieved and therefore not removed.
For example, when x is 3, the first 3 embedded objects are retrieved while the rest of the
objects are removed from the container page. No page had more than 11 objects per page,
161
so the rightmost points in Figure 4.18 correspond to full downloads. The leftmost points
correspond to all embedded objects being removed, reducing mean response time to near
1 s, using the following RLM rule:
IF IP.SRC = *.*.*.* THEN
REMOVE EMBEDS
(4.5)
Without embedded object removal, the difference in RTT separates out the clients
into three service classes when only one service class is desired. Figure 4.18 shows that
different numbers of embedded objects can be removed for clients with different RTTs to
provide similar response times for all clients. For example, all clients can be given the
same mean response time of 3 s by doing full downloads for the 160 ms RTT clients,
only allowing 220 ms RTT clients to download two embedded objects, and having half the
300 ms RTT clients download one object and receive 2 s response time while the other
half download two objects and receive 4 s response time. The mix of different numbers of
embedded objects for the 300 ms RTT clients is due to the discrete nature of the technique:
either an object is obtained or it is not obtained. Although discrete, embedded object
removal has the advantage over embedded object rewrite in that multiple copies of the
same object need not be maintained. This is an issue if disk space is limited or charged by
use.
Note the large jump in Figure 4.18 when the second embedded object is downloaded. While this is partly due to the overhead in opening the second connection, the key
reason is that for roughly 18% of the pageviews in TPC-W, the second embedded object
is a large 256KB image. Since it is being downloaded as the first object on the second
connection, not only does it incur Tconn but also TCP slow-start. By configuring RLM to
162
remove only the second image, the response time dropped to 2.19 s for the 160 ms RTT
clients, 2.85 s for the 220 ms RTT clients, and 3.68 s for the 300 ms RTT clients. The
curves in Figure 4.18 are relatively flat after the second object. As more objects are downloaded on the same persistent connection, the TCP window size increases. In addition,
fewer pages have a larger number of embedded objects. Roughly 75% of the pages contain 9 objects, and 18% of the TPC-W pages contain 10 or more. If the typical e-commerce
Web site has a similar shape, then removal/rewrite could be applied in this manner, bottom
to top, for the pages with the most items; this implies fewer clients will be affected.
Figure 4.19 shows the corresponding results for embedded object rewrite. Embedded object removal is more effective at reducing response time than embedded object
rewrite, but the effect is coarse-grained. The removal of the references to embedded objects must occur during the transition 3→4 in Figure 4.3, essentially eliminating states 6
through 18. In contrast, embedded object rewrite can be applied at a finer-granularity as
different parts of the page are downloaded. To reduce response time, one would like to
apply a rule such as:
IF RT > 2s THEN
REWRITE EMBEDS
(4.6)
However, this may not be effective. Referring back to Figure 4.3, it is at node 5
that the browser obtains the list of embedded objects to obtain. In our current scenario,
this is after the server, which is under light load, returns the container page. At this point,
the elapsed response time is relatively short, less than 1 s. It is at this moment that the
EB opens the second connection and may request the large image which greatly increases
the response time. Even if we decide after 1 s to rewrite the remaining objects, the time
163
6
5
4
3
160ms RTT
220ms RTT
300ms RTT
2
1
0
0
2
4
6
8
10
Number of full sized objects
12
Figure 4.19: Embedded object rewrite.
required to finish downloading the large image will extend the pageview response time to
beyond 1 s. Indeed, the result we obtain is a response time of 2.8 s, 3.52 s and 4.55 s for
the 160 ms, 220 ms and 300 ms RTT clients, respectively; this matches the download of
two full size images in Figure 4.19.
This indicates the need to predict the latency contribution that a request will have to
the overall pageview response time. Figure 4.20 shows the results of rewriting embedded
objects if the predicted Ttransf er for that object would cause the pageview to exceed the
specified elapsed time. RLM keeps an average of the Ttransf er for image downloads under
different size, RTT and loss groupings, which is comparable to Equation 4.1. Note this
does not include Trender , and as such is an underestimate of the latency associated with
the image. For an RTT of 160 ms, 220 ms and 300 ms the large image is predicted to
have a Ttransf er value of 6.17 s, 8.11 s and 11.05 s, respectively. Notice that at a 1 s
threshold Figure 4.20 matches the response time as seen in Figure 4.19 when rewriting all
164
6
5
4
3
2
160ms RTT
220ms RTT
300ms RTT
1
0
2
4
6
8
10
Predicted Elapsed RT
12
Figure 4.20: Applying predicted elapsed time.
images - hence correctly removing the large image for the short threshold. As the threshold
increases, the predictor allows the large image to be included in the pageview when it no
longer is a factor in achieving the response time goal. This is seen at 7 s, 9 s and 12 s
for 160 ms, 220 ms and 300 ms clients. This technique tends to shorten the tail on the
response time distribution by removing embedded objects which take longer to download.
Figure 4.21 shows the effect of embedded object removal as compared to full page
downloads as the number of clients increase. Once the number of clients reaches 550 the
time to download full pages vs. pages with no embedded objects is the same. At this
point, when full pages are being downloaded the 550 clients are spending half their time
at the DB server obtaining the container page and half their time at Apache retrieving the
embedded objects - this means the DB server is serving roughly 275 requests simultaneously. When the embedded images are removed from the container page, the client spends
no time at Apache and all the time at the DB server - effectively doubling the load on the
Response time (sec) (220ms RTT)
165
10
8
Full downloads
Container portion
Remove all embeds
6
4
2
0
200
300
400
500
Number of clients
600
700
Figure 4.21: Full vs. Empty, mean pageview response time.
DB server so that the DB server is serving roughly 550 requests simultaneously. The extra
load on the DB server causes longer delay in serving the container page. As a reference
point, the dotted line in Figure 4.21 shows the split between the time spent on the container
page (below the line) and the embedded objects (above the line). The time required to obtain the embedded objects from Apache increases the time between DB requests to the
MySQL server. As such, the complex could return a full pageview or an empty pageview
at the same response time. Note that the think time in both cases is the same. This clearly
shows that attempting to managing one portion of the response time instead of the overall
pageview response time may not be effective.
166
4.6 Theoretical Analysis
As mentioned throughout this dissertation, a key problem is that the retransmission timeout
values for SYN drops are too large having a disproportionate effect on client response time
and are ignored by existing admission control mechanisms. In this section we take a look
at this problem from a theoretical perspective, with the goal of identifying any points of
interest, such as an optimal response time minima. To simply matters we consider the
situation where the client is downloading a container page with no embedded objects over
a single connection - such as when a client downloads a postscript or PDF file.
In Section 2.2 we presented Equation 2.4 for defining CLIENT RTi , the mean
client perceived response time for the transactions completing during the ith interval. For
simplicity we assume k = 2, which means Equation 2.4 resolves to:
CLIENT RTi =
P
P
SYN-to-END + RT T + 21Ri3 + 9Ci2 + 3Ci1
COMP LET EDi + Ri3
(4.7)
Intuition and general experience suggests that the SYN-to-END time of a transaction is dependent on the current load in the system: as the number of currently active
transactions in the system increases, the resource share alotted to each individual transaction decreases and the per transaction service time increases (general processor sharing
model). Assume the variance of SYN-to-END is small, such that all transactions accepted
in the ith interval complete in the (i + SYN-to-END)th interval. Let
SYN-to-END = f (µ)
be the mean SYN-to-END as a function of the given acceptance rate µ. We modify
Equation 4.7 to be:
167
CLIENT RTi =
P
COMP LET EDi · SYN-to-END + RT T + 21Ri3 + 9Ci2 + 3Ci1
(4.8)
COMP LET EDi + Ri3
P
Equation 2.3 allows us to substitute kj=0 Cij for COMP LET EDi and rewrite
Equation 4.8 as:
CLIENT RTi =
P
(Ci2 + Ci1 + Ci0 ) · SYN-to-END + RT T + 21Ri3 + 9Ci2 + 3Ci1
Ri3 + Ci2 + Ci1 + Ci0
(4.9)
which resolves to:
CLIENT RTi =
21Ri3 + Ci2 (SYN-to-END + 9) + Ci1 (SYN-to-END + 3)+
P
Ci0 (SYN-to-END) + RT T
Ri3 + Ci2 + Ci1 + Ci0
(4.10)
In order to make our optimization claims we first re-formulate Equation 4.10 under
a steady-state model. Equation 2.12 gives us:
j
j
Cij = Aji−SYN-to-END = Ri−SYN-to-END
− [DRi−SYN-to-END · Ri−SYN-to-END
]
Substituting Equation 4.11 into Equation 4.10 gives us:
(4.11)
p0 20
20
0
3s
p0 p1 20
1
(1-p0120
p0(1-p1120
6s
p0p1p220
2
168
12s
G
p0p1(1-p2120
S
SYN-to-END
T
20
Figure 4.22: Steady state flow model.
CLIENT RTi =
21Ri3 +
2
2
])(SYN-to-END + 9)+
(Ri−SYN-to-END
1
1
])(SYN-to-END + 3)+
(Ri−SYN-to-END
0
0
(Ri−SYN-to-END
])(SYN-to-END)+
P
RT T
Ri3 +
2
2
Ri−SYN-to-END
]+
1
1
Ri−SYN-to-END
]+
0
0
]
Ri−SYN-to-END
(4.12)
which presents CLIENT RTi in terms of DRi , Rij , SYN-to-END and RT T .
Since we are able to specify Rj in terms of R0 we are able to create a steady state flow
model, which is depicted in Figure 4.22. Let λ0 be the arrival rate of R0 and pj be the
drop probability for Rj . State S represents those requests which get served and state G
169
represent those requests which do not get served (giveups). Each node is annotated with
its associated latency. The steady state flow model has allowed us to remove the concept of
an ith interval, making the analysis tractable by removing the explicit interdependencies
between intervals from the equation. Let λj be:
λ1 = (1 − p0 )λ0
λ2 = p0 (1 − p1 )λ0
λ3 = p0 p1 (1 − p2 )λ0
λ 4 = p0 p1 p2 λ 0
It should be clear that just as we are able to write λj in terms of pj , we are able to
write pj in terms of λj :
p0 = 1 −
λ1
λ0
p1 = 1 −
λ2
p0 λ0
p2 = 1 −
λ3
p0 p1 λ0
We’ll tend to write λj in terms of pj in the following since it will result in easier to
understand formulas. We define µ as the acceptance rate into state S in Figure 4.22. We
can now rewrite Equation 4.10 as:
170
CLIENT RTi = RT T +
λ1 SYN-to-END + λ2 (SYN-to-END + 3) + λ3 (SYN-to-END + 9) + λ4 21
λ0
(4.13)
subject to the constraints:
λ0 = λ1 + λ2 + λ3 + λ4
µ = λ1 + λ2 + λ3
λj ≥ 0
µ≥0
The first question we answer is how should we control λj to minimize the overall
mean response time? In other words, does it affect mean response time if we accept or
drop a SYN based on whether it is an initial SYN, 1st retry, 2nd retry, etc? Put another
way, is it possible to minimize Equation 4.13, for an arbitrary value of SYN-to-END?
We can view Equation 4.13 as:
CLIENT RT =
w 1 λ1 + w 2 λ2 + w 3 λ3 + w 4 λ4
λ1 + λ2 + λ3 + λ4
where wj are cost weights applied to each λj :
+ RT T
(4.14)
171
w1 = SYN-to-END
w2 = SYN-to-END + 3
(4.15)
w3 = SYN-to-END + 9
w4 = 21
Minimizing Equation 4.13 can now be stated as determining values for λj , 1 ≤ j ≤
4 such that the cost of
w 1 λ1 + w 2 λ2 + w 3 λ3 + w 4 λ4
(4.16)
is minimal, under the same constraints as Equation 4.13. It should be clear that the
minimal solution will choose λj to be largest when wj is smallest (and vice versa). In most
cases the minimum mean response time will be achieved when
λ 1 > λ2 > λ3 > λ4
(4.17)
w1 < w2 < w3 < w4
(4.18)
since in most cases
The optimal algorithm for accepting SYNs is therefore to always accept as many
initial SYNs as possible, then as many 1st retries as possible, then as many 2nd retries as
possible, etc. The exception to this rule is when the ordering of the weights in Equation
4.18 change, which can happen when SYN-to-END gets large. Since
w1 < w2 < w3
(4.19)
172
always holds and w4 is a constant, the only possible orderings are:
w1 < w2 < w3 < w4
w1 < w2 < w4 < w3
(4.20)
w1 < w4 < w2 < w3
w4 < w1 < w2 < w3
Assuming that it is better to serve a request at 21 s rather than deny it, the ordering
of wj in terms of SYN-to-END is:
if SYN-to-END ≤ 12 then,
SYN-to-END < SYN-to-END + 3 < SYN-to-END + 9 ≤ 21
if 12 < SYN-to-END ≤ 18 then,
SYN-to-END < SYN-to-END + 3 ≤ 21 < SYN-to-END + 9
if 18 < SYN-to-END ≤ 21 then,
SYN-to-END ≤ 21 < SYN-to-END + 3 < SYN-to-END + 9
if 21 < SYN-to-END then,
21 < SYN-to-END < SYN-to-END + 3 < SYN-to-END + 9
(4.21)
Equation 4.21 simply suggests that as SYN-to-END increases, at some point the
client will perceive a shorter response time if denied service rather than being provided
with a response. This ordering result is general and independent from f (µ), which determines the value for SYN-to-END. This implies the following two cases:
173
Case 1: if 21 < SYN-to-END
λ4 = λ0
λ1 = λ2 = λ3 = 0
(4.22)
p0 = p1 = p2 = 1
CLIENT RT = 21
Case 2: if SYN-to-END ≤ 21
λ1 = µ
λ4 = λ0 − µ
λ2 = λ3 = 0
p0 =
(4.23)
λ0 −µ
λ0
p1 = p2 = 1
CLIENT RT =
w1 λ1 +w4 λ4
λ1 +λ4
Although Equation 4.21 has four orderings to consider, the constraint that λ1 +λ2 +
λ3 = µ and the fact that we are minimizing forces us to merge the first three orderings into
case 2. If λ0 ≤ µ then all requests get accepted on the initial SYN; if λ0 > µ then there will
always be λ0 −µ transactions not getting any service; of the µ receiving service, it is always
optimal to accept them on the initial SYN, unless w4 < w1 (Case 1). Since w1 < w2 < w3
dropping some portion of λ1 only to be accepted as λ2 or λ3 only adds latency to the
response time. In summary, this result states that, in the general case, to minimize response
time one should either accept a request on the initial SYN or completely deny the request
by dropping all subsequent SYN retransmissions. However, our experimental results under
realistic workload conditions show a benefit for not following this policy. Under realistic
workload conditions, bursts in the arrival rate are often followed by a period t∆ of low
174
λ0 , eliminating the need to continually drop incoming requests during t∆ . In other words,
the additional work imposed by the burst can be serviced after the burst when few other
requests arrive.
The second question we answer is for what functions of f (µ) is it possible to optimize the overall mean response time? In other words, can we reduce the acceptance rate
such that the resulting decrease in SYN-to-END more than offsets the latencies associated
with SYN drops? Since µ = λ1 + λ2 + λ3 , the only way to reduce the acceptance rate µ is
to increase λ4 , the number of requests which do not receive service.
We now define µ0 to be the service rate at state S. Where µ is the acceptance rate
(rate at which requests are entering state S) µ0 is the rate at which the accepted requests
are being serviced. By definition, µ ≤ µ0 always holds. From the result given in Equation
4.23 in case 2, we have:
CLIENT RT =
w 1 λ1 + w 4 λ4
λ1 = µ, λ4 = λ0 − µ
λ1 + λ4
(4.24)
Assume a /M/G/1 queuing model for state S. The service delay for /M/G/1 is:
SYN-to-END =
1
µ0 − µ
(4.25)
by substituting Equation 4.25 into Equation 4.24 we get:
CLIENT RT =
h
1
µ0 −µ
i
µ + 21[λ0 − µ]
µ + [λ0 − µ]
(4.26)
which is equivalent to:
21µ0
1
µ0
1
CLIENT RT = 21 − 0 − 0 + 0
+ 21(µ0 − µ)
λ
λ
λ µ0 − µ
(4.27)
175
To minimize Equation 4.27 note that the only variable in the equation that is not
considered a constant in steady-state is µ, the acceptance rate. Therefore, to minimize the
last term is to minimize the equation - determine an acceptance rate, µ, which minimizes
√
√
the response time. It is well known that a + b ≥ 2 ab and that a + b = 2 ab iff a = b. If
we let
a=
µ0
µ0 −µ
(4.28)
b = 21(µ0 − µ)
then
µ0
µ0 −µ
√
+ 21(µ0 − µ) = 2 21µ0
(4.29)
√
0
= 21(µ0 − µ) = 21µ0 . By solving for µ we are able to obtain µopt =
iff µ0µ−µ
p
µ0 − µ210 as the acceptance rate which minimizes the response time for a given capacity
√
0
and
µ0 , and is independent from the offered load, λ0 . Substituting 21µ0 for both µ0µ−µ
21(µ0 − µ) in Equation 4.27 we are able to obtain the minimum possible response time as:
CLIENT RT = 21 −
The other result is µ = µ0 −
1
,
21
21µ0
1 p
1
− 0 + 0 2 21µ0
0
λ
λ
λ
(4.30)
which is the point at which the accept rate will
cause the response time to exceed 21 s. This suggests that when the accept rate reaches
95.2% of the capacity of the system, it is better to simply drop all requests than to provide
a response whose latency will be > 21s. The optimal drop rate is then:
p0 = 1 −
µopt
, subject to p0 ≥ 0
λ0
(4.31)
176
22
20
18
Response Time (sec)
16
14
12
10
8
6
4
2
0
0
200
400
µ
600
800
1000
Figure 4.23: Equation 4.27 for µ0 = λ0 = 1000.
which is zero when λ0 ≤ µopt . This implies that as long as λ0 is greater than the
optimal accept rate, the minimum CLIENT RT can only be achieved through the use of
SYN drops. This leaves us with three ranges for setting the drop rate p0 :
1. If 0 ≤ λ0 ≤ µopt then p0 = 0 ; λ0 = µ and λ0 < µ0
2. If µopt ≤ λ0 ≤ µ0 −
3. if µ0 −
1
21
1
21
then p0 = 1 −
µopt
λ0
< λ0 then p0 = 1
Figure 4.23 is a graph of Equation 4.27 and Figure 4.24 is the same graph, zoomed
in on the minima point. Figure 4.23 shows that for values of µ << λ0 (or µ << µ0 ) the
drop penalty significantly affects the response time. As µ increases, the drops decrease
and the response time decreases. The dotted portion of the curve, to the right of µopt ,
increases sharply. This indicates that the latency associated with queuing delay dominates
the SYN drop latencies. Figure 4.24 shows the minima point at CLIENT RT = 0.29s
177
Response Time (sec)
1
0.8
0.6
0.4
0.29s
0.2
0
950
960
970
980
µ
µ0 −
990
1000
p
µ0 /21 = 993.099
1010
Figure 4.24: Equation 4.27 for µ0 = λ0 = 1000, zoomed in on minima.
when µ = 993.099, which is when the acceptance rate is at 99.3% of system capacity. The
drop rate at the minima is p0 = 0.006901.
As a comparison, Figure 4.25 shows the /M/G/1 service latency overlayed onto
Figure 4.24 as a dashed line. The /M/G/1 is monotonically increasing with no minima
point - except to service only a single request. Suppose the /M/G/1 curve was used to
determine when to drop or accept requests. If a 1 s threshold was selected, then indeed the
mean response time would be 1 s, including the overhead for SYN drop latencies. Yet, for
the same arrival rate and system capacity, the optimal mean response time is 0.29 s, which
is 3.4 times less than the 1 s threshold. Likewise, if the system is throttled at 90% capacity,
the resulting response time would be 3.8 s, again 13 times larger than the optimum.
An alternative view of the RLM/server complex can be based on capacity (fixed
buffer), as shown in Figure 4.26. Let C be the maximum number of requests the server
complex can process concurrently. When a request is received by the server it is allocated
178
Response Time (sec)
1
0.8
0.6
0.4
0.29s
0.2
1
µ0 −µ
0
950
960
970
µ
980
990
1000
p
µ0 − µ0 /21 = 993.099
Figure 4.25: /M/G/1 service latency
one of C slots. Let µ0 be the service rate with
1
µ0 −µ
1
µ0
1010
overlayed onto Figure 4.24.
being the service latency. Suppose that
the server complex is currently servicing C − 1 requests. From this state the server can
either enter state C if a request arrives before a request completes, or state C − 2 if a
request completes before a request arrives. This implies that if the goal is to maintain the
server as fully loaded in state C, then RLM should send requests to the server complex at a
rate of µ0 or every
1
µ0
seconds once the server is full. Alternatively, RLM could transmit a
request whenever it detects an available slot at the server (assuming RLM knows the value
of C).
Consider the requests waiting at RLM to be sent onto the server at the rate of µ0 . If
λ0 > µ0 then some requests will not be serviced. In this case, the policy which minimizes
mean response time for the client would be to enforce a LILO service policy at RLM. In
other words, RLM should pass on to the server the requests which have been waiting the
shortest amount of time; since there will be µ0 − λ0 requests that are denied service the
1
RLM
0
top
1
179
server
20
C
slots
bot
frustrated clients
Figure 4.26: Capacity model.
best choice for those requests are the requests which currently have waited the longest. So
the top of the stack is popped and sent onto the server, while the bottom of the stack is
dequeued as clients are denied service. This is consistent with the flow model we presented
which suggested that a request either be accepted or completely denied service. Here, we
are simply saying that as λ0 varies over time, the LIFO stack at RLM stores requests
which can be serviced when λ0 again falls below µ0 . This is the basic idea behind fast
SYN retransmission.
The analysis presented in this section complements our experimental results which
are performed using actual Web servers under realistic workloads - conditions which are
difficult to analyze. Although the analysis looks at a limited set of conditions, it does provide a theoretical framework and insight for understanding how such systems behave. On
the Internet, for example, the arrival rate λ0 and service capacity µ0 are not constant, difficult to measure, and dependent on numerous factors. Likewise, the default accept queue
throttling mechanism within Linux is not a strict /M/G/1 with a bounded buffer length, but
rather resembles a variant of random early detection. Although the analysis suggests using
a ‘persistent drop’ approach for load shedding, we found through experimentation with
180
actual Web servers the variance in the arrival rate λ0 allowed us to effectively apply our
techniques of fast SYN + SYN/ACK retransmission to obtain the benefit we demonstrated
in the prior section. This can be seen if we consider the server complex as changing from
one steady-state to the next over time. As such, as λ0 and µ0 change over time, so does the
optimal drop rate, p0 . Changes in p0 at time t0 will affect p1 at time t3 and p2 at time t9 ,
allowing us to accept SYNs which had been dropped during a prior time interval.
4.7 Alternative Approaches
If we allow modifications to the Web server then alternative approaches arise. Some or
all of the functionality of RLM could be moved into the Web server complex; alternatives
arise based on which functions are moved from RLM into the server. Since tracking client
perceived response time requires tracking activity at the packet level, this functionality as
a whole would most likely remain as a single unit, either in RLM or moved to the server.
Therefore, one option would be to load all the RLM kernel modules into the Web
server kernel and pass response time information up to the Web server executing in user
space. The Web server could then make control decisions and/or adjustments based on
the response time. For example, with enough coordination between kernel and user space
the technique of embedded object rewrite/removal could be performed by the Web server
or application server. Fast SYN + SYN/ACK retransmission would be performed by the
RLM modules executing in kernel space. In addition, the RLM modules could combine
other techniques regarding connection management with the fast SYN + SYN/ACK, such
as manipulating the accept queue and SYN hash table. RLM could also share the learned
embedded object patterns with the Web server (or vice versa) so that the Web server could
perform pageview management within user space.
181
The drawback to such an approach is that the Web server would now incur the
additional overhead of performing RLM duties - but in a e-commerce environment where
the front end system is underutilized this will not be a problem. If the server complex
consists of several front end systems, then by placing RLM functionality in each of them
we lose a single point of control.
Another option would be to keep measurement functionality within RLM and move
the management and control functions into the server complex. In such a scenario, RLM
would only be used as a measurement device, reporting per pageview response times into
the server complex which makes adjustments based on the client perceived response time.
One drawback to this approach is that the ability to manage TCP connection establishment
latencies would be lost - unless the server was loaded with kernel modules which performed this RLM functionality. Likewise, if RLM was reporting per pageview response
times after a page has completed downloading, the ability to make adjustments for the
client, in real-time, as the page is in the process of being downloaded is lost. By splitting
RLM functionality across two machines, RLM and the server, we lose the tight feedback
loop between measurement and management which allows us to affect the remote client
perceived response time in real-time, in an online manner.
4.8 Summary
Remote Latency-based Management (RLM) is a new approach for managing the client perceived response time of a Web services infrastructure using only server-side techniques.
RLM tracks each pageview download as it occurs and makes control decisions at each
key juncture in the context of a specific pageview download. RLM introduces fast SYN
and SYN/ACK retransmission mechanisms and uses content adaptation in the form of em-
182
bedded object removal and rewrite. These techniques can be applied based on control
decisions during the pageview download. We have implemented RLM as a stand-alone
Linux PC appliance that simply sits in front of a Web server complex and manages response time, without any changes to existing Web clients, servers, or applications. Our
results demonstrate the importance of measuring client perceived pageview response time
in managing Web server performance, the limitations of existing admission control techniques, and the benefits of RLM’s mechanisms for controlling response time to manage an
overloaded Web server complex and to mitigate the negative impact of network latencies
and loss.
RLM provides a starting point for future work in developing a comprehensive management scheme that can provide a range of policies for controlling client perceived response times, and can meet different service level objectives for different classes of service. Many other packet manipulation techniques can be explored in this context, including manipulating drop, delay or retransmission of URL requests and responses. While
RLM provides a blackbox approach to management when modifications to a Web server
complex are not possible, its core management framework can be used with other more
invasive mechanisms that can be deployed in the Web server complex to achieve a greater
degree of control of resources to manage overall response time.
CHAPTER 5. RELATED WORK
183
Chapter 5
Related Work
Many approaches for measuring and managing Web server latency have been developed
[94, 159, 86, 125, 53, 8, 123, 42, 30]. Each approach has contributed to the field in its own
way, yet almost all approaches consist of the following elements, which provide a simple
framework for viewing the measurement and management of Web servers:
1. A service level objective (SLO) which is the defined set of latency goals to achieve.
The SLO is often expressed in the form of a set of policy rules and a set of classification rules that categorize clients or requests into different service classes.
2. A measure of latency which is to be controlled, often used as feedback for decision
making and validation.
3. A set of effectors which can be configured to control the latency. Examples of an
effector would be the scheduling mechanisms on CPU, disk or network transmission
queues, or an admission control mechanism.
4. A decision algorithm for properly configuring the effector given a set of measurements and a target latency to achieve.
184
The two most commonly cited service level objectives are the relative and absolute
objectives. Also known as proportional, a relative objective seeks to distinguish service
classes by a relative factor. For example, low priority requests may experience at worst
twice the latency than the high priority requests, yet neither the high priority or low priority
requests will be guaranteed a specific threshold. In an absolute objective, each class is
guaranteed to experience a latency below a certain threshold. For example, high priority
requests are guaranteed to complete in under 3 s, while low priority requests are guaranteed
to complete in under 10 s. Hybrids of the two have been proposed, and the definition
of what a ‘guarantee’ actually means varies, with penalties being assessed for missed
objectives. Verma [157] provides background on policy management and different types
of service level agreements. A latency management approach that is only applicable with
one type of service level objective is considered to be weaker than an approach which
is applicable across several different types of service level objectives. Often the objective
chosen is that which best fits the mechanism. For example, weighted fair queuing naturally
lends itself towards a proportional objective, while admission control tends to be applied
when response times exceed a specific threshold.
The focus of this dissertation is not on policy related matters such as policy definition, management or evaluation. We used absolute objectives for our work in Chapter 4
with RLM simply because we felt that having an absolute threshold for response time is
more relevant to the remote client. Our focus on managing the shape of the response time
distribution has not been seen in the types of service level objectives that are discussed in
the literature or appear on commercial Web sites. In this context we showed that defining service level objectives by the use of percentiles (e.g., 95th ) does not bode well in the
context of admission control drops since the rejected transactions cause a significant shift
in the response time distribution. Future work in service level agreement definition and
185
validation may incorporate many of the ideas presented in this dissertation.
The focus of this dissertation has been on measuring remote client perceived response time using only the information available at the server complex. As such, in this
chapter we describe existing techniques for measuring response time and indicate how
our work is different from these other approaches. Existing approaches fail to capture all
the key latencies (such as the latencies associated with SYN drops), or fail to capture the
latencies for actual clients (instead capturing latencies for monitor machines), or simply
measure the wrong thing (e.g., per URL server response time).
In this dissertation we also applied novel server-side techniques for managing the
remote client perceived response time distribution. As such, we present in this chapter
existing latency management approaches and contrast them with our work, citing what we
feel to be their strengths and weaknesses. Each of the latency management approaches we
discuss here rely on one of the inferior methods presented in the next section for measuring
response time (or a variant thereof). Whatever a management latency approach chooses as
its measure of latency becomes the latency that is actually managed and controlled. Often
the effector being used is an effective mechanism for controlling some resource (e.g.,
scheduling, admissions control), but it is usually applied to manage the wrong latency
(e.g., server response time) or without an understanding of the effect it has on the remote
client pageview response time. As such, our work is unique from the work presented in this
chapter in that, unlike these other approaches, we base our management latency approach
on an accurate measure of the remote client perceived pageview response time. By doing
so this has led us to develop unique effectors (e.g., fast SYN + SYN/ACK retransmission)
and decision algorithms specific to the problem of managing per pageview response times
(e.g., accepting or dropping a SYN based on whether it belongs to the first or second
connection of a pageview download) which also differentiates our work from the existing
186
approaches.
5.1 Measuring Client Perceived Response Time
A key focus of this dissertation has been on accurately determining the response time,
as perceived by the remote client. As a result, we developed a number of measurement
techniques that allows us to determine latency using only information available at the
Web server. We then verified our approach by validating our measure of latency against
those taken at the client via detailed instrumentation. In this section we present existing
alternative approaches for measuring latency.
One approach for measuring response time being taken by a number of companies
[69, 88, 98, 59, 149] is to periodically measure response times obtained by a geographically distributed set of monitors. These monitors can be fully instrumented to provide a
complete measurement of response time across all of the ten steps previously discussed,
as perceived by the monitors. However, this approach differs from our approach in at least
five significant ways. First, no actual client transactions are being measured - only the response time for transactions generated by the monitors are reported. Second, this approach
is based on coarse-grained sampling and may suffer from statistical biases. ksniffer and
RLM, on the other hand, capture and analyze all actual client transactions, not just a sampling. Third, monitors are limited to performing transactions that do not affect other users
or modify state in backend databases. For example, it would be unwise to configure a monitor to actually purchase an airline ticket or trade stock on an open exchange. Although
it is possible to tag the transactions from monitors as ‘fake’, this requires that the server
complex be changed to treat such transactions differently, hence exercising different code
paths than those taken by actual client transactions. Both ksniffer and RLM do not require
187
changes to existing systems. It would, however, be an interesting study to compare the
mean response time reported by monitor machines with the mean response time reported
by ksniffer for the actual client pageviews. Fourth, the information gathered by monitors is
generally not available at the Web server in real-time, limiting the ability of a Web server
to respond to changes in response time to meet delay bound guarantees. Both ksniffer and
RLM are placed at the server and produce results which are available in real-time. Lastly,
CDN providers are known to place CDN servers near monitors used by these companies
to artificially improve their own performance measurements [50]. This effect is not seen
by the real clients being tracked by ksniffer and RLM.
A second approach involves instrumenting existing Web pages with client-side
scripting in order to gather client response time statistics (e.g., JavaScript) [131]. Like
our approach, this approach can be used to account for actual client transactions. However, client-side scripting will always consider the start of the transaction to be sometime
after the first byte of the HTTP response is received by the client and the client begins
processing the HTTP response (step 8). A ‘post-connection’ approach as this does not
account for any delays that occur in steps 1 through 7, including time due to TCP connection setup or waiting in kernel queues on the Web server. Throughout this dissertation we
showed how important it is to capture TCP connection establishment latency, especially
in the presence of admission control drops. Both ksniffer and RLM have been explicitly
designed and engineered to determine these latencies. Client-side scripting also cannot be
applied to non-HTML files that cannot be instrumented, such as PDF and Postscript files.
It may also not work for older browsers or browsers with scripting capabilities disabled.
Both ksniffer and RLM can determine response times for non-HTML content, regardless
if scripting is enabled or disabled on the client browser, and without modification to existing systems. JavaScript measurements cannot accurately decompose the response time
188
into server and network components and therefore provide no insight into whether server
or network providers would be responsible for problems. A packet level approach such as
that used by ksniffer and RLM is able to provide these insights. Network behaviors such
as packet retransmissions are not visible to a JavaScript executing within a Web browser.
A third approach for measuring response time is to have the Web server application
track when requests arrive and complete service [92, 94, 86, 8]. Like our approach, this
approach has the desirable properties that it only requires information available at the Web
server and can be used for non-HTML content. However, this approach only measures step
6 of the total response time - per URL server latency. Server latency measurements at the
application-level do not properly include network interactions and provide no information
on network problems that might occur and affect client perceived response time. They also
do not account for overheads associated with the TCP protocol underlying HTTP, including the time due to TCP connection setup or waiting in kernel queues. These times can be
significant, especially for servers which discard connection attempts to avoid overloading
the server [159], or for servers which limit input queue lengths of an application server
[53] in order to provide a bound on the time spent in the application layer. We showed
in Chapter 2 how the Apache level measure of response time can be a order of magnitude
less than the client perceived response time, then in Chapter 3 we presented mechanisms
in ksniffer which can account for these latencies. A variant of this approach is to measure
response time within the kernel of the Web server. We showed in Chapter 2 that without
tracking or modeling the latencies associated with SYN drops and retransmissions, even
the best designed kernel level measure of response time can be orders of magnitude less
than the response time perceived by the remote client.
A fourth approach is to simply log network packets to disk, and then use the log
files to reconstruct the client response time [146, 6, 37, 60, 66, 75]. This kind of analysis
189
is performed offline, using multiple passes and limited to analyzing only reasonably sized
log files [146]. Hence this technique cannot be used in a real-time latency management
system. EtE [66] uses a packet level reconstruction approach similar to ksniffer and RLM
to determine per pageview response time, but uses multiple pass, offline algorithms to do
so. Although EtE does account for steps 4 to 9, it does not account for SYN drop latencies due to admission control or network loss. Scalability can also be a drawback with
any packet capture/logging approach. The mechanism must be able to capture the packet
at line speed while at the same time write all or a portion of the packet to disk. Both
ksniffer and RLM execute within kernel space avoiding packet copies to user space and
perform online analysis without the need to generate log files. BLT [60] is a system that
aggregates TCP and HTTP information from packet traces to produce application-layer
information. BLT is user-space program built on top of tcpdump to produce logs that are
processed off-line, and is used in a number of network monitoring projects within AT&T
[37]. Feldmann [60] describes many of the issues involved in TCP/HTTP reconstruction
from packet traces, but does not consider the problem of measuring response time. Indeed,
simply reconstructing the TCP sequence space is much simpler than accurately determining latencies, as perceived by the remote TCP endpoint.
In addition to the above measurement approaches, a number of analytical models
have been developed for modeling TCP latencies [122, 121, 38]. For example, Padhye et
al. [121] derived a model of the steady state latency of a TCP bulk transfer for a given
loss rate and round trip time. This model was further extended by Cardwell et al. [38] to
include the effects of TCP three-way handshake and TCP slow start. The extended model
can accurately estimate throughput for TCP transfers of a given length. These analytical
models focus on modeling only a portion of the pageview response time - the TCP transfer
latency. They assume a fixed packet loss rate that remains constant over time and is known
190
a priori. These assumptions are often invalid in measuring Web server performance. For
example, SYN packet loss rates may change frequently due to server load or when the Web
server uses SYN drops to manipulate its quality of service. The models and algorithms
presented in this thesis are explicitly designed to handle large variances in SYN drop rates
over time. As we mentioned in Chapter 4 we see the static models as being useful in the
prediction of transfer latency, Ttransf er , but may need to be adjusted and combined with
actual measurements for better accuracy.
5.2 Latency Management using Admission Control
Throughout this dissertation we presented the problem that existing admission control
approaches have in relation to controlling client perceived response time - existing approaches ignore the effect of admission control drops on the remote client perceived response time. In this section we mention several of these systems and contrast their approach to ksniffer and RLM.
Quorum [31] is a proxy front-end that throttles URL requests into the server. Quorum is driven by the per URL server response time and reports results for only successful
requests, ignoring the dropped URL requests and their impact on the pageview response
time. Like us, they take a black box approach towards the server complex, requiring no
changes to existing systems. However, being a user space proxy, they fail to capture kernel
and network level effects on both sides of the proxy: client to/from the proxy and proxy
to/from the server. The key issue is that this system will drop a URL request without understanding the context (page download) in which the object is being requested and the
impact this has on the remote client. By comparison, RLM is not a user space proxy but
a packet level capture and manipulation system, so it can track latencies at all network
191
protocol levels. The pageview correlation algorithms and online event node model allows
RLM to understand the context of each URL being requested.
Other systems suffer from similar problems. In [86], Kanodia and Knightly combine admission control and scheduling with service envelopes to provide relative delay
bounds. They measure delay in user space on the Web Server and only for those transactions that are accepted, ignoring all transactions that are dropped due to admission control.
In [53], Eggert and Heidemann take an application-level approach that allocates resources
into two classes, foreground and background. They relegate lower priority responses to
background processing. In [42], Chen et al. present an admission control algorithm named
PACERS. Admission is based on the near future request rate and expected service time of
the request. Unfortunately, only simulations were performed.
A control theoretic approach is presented by Kamra et al. [85] in which a selftuning proportional integral controller was integrated into a front end Web proxy and then
used in admission control. One key benefit to this approach is that a learning period
is not required. Although they used client measured response time to validate that the
controller/proxy overhead is insignificant (as compared to without using the proxy), the
remainder of their testing and validation was performed by using response time measurements taken within the proxy, in user space, and not at the client. This work also focused
on the per URL and not the pageview response time, and failed to report the effect that
admission control drops have on latency (although the drop rates were reported).
Welsh and Culler [162] decompose Internet services into multiple stages, each of
which can perform admission control. They monitor the response time through a stage,
where each stage can enforce its own targeted 90th percentile response time. Voigt et
al. [159] proposed TCP SYN policing and prioritized accept queues to support different
service classes. Elnikety et al. [54] developed a DB proxy that prevents overload on the
192
database server by performing admission control on the database queries being sent from
the 2nd tier application server to the 3rd tier DB server. Although they directly manage
the bottleneck resource by dropping DB queries, they ignore the effect this has on the
pageview response time. In addition, these drops are being performed late, after much
resource and processing has already been invested in developing a response. Our load
shedding mechanisms in RLM perform packet stream manipulations before resources are
invested by the server complex in generating a response.
Cherkasova et al. [43] introduced the concept of session based admission control,
the idea being that the true measure of a Web site is its ability to successfully complete
sessions and considers the cost of rejecting a client session from the perspective of the
amount of computational resources wasted on the rejected session. A session is defined to
be a series of pageview downloads which culminate in a sale. Kihl and Widell [89] showed
that session based admission control reduces the number of angry customers, which are
defined to be those customers which gain access to the Web site but are unable to complete
their full transaction. Yet, those customers that are rejected before they have spent any time
in the site are not classified as angry. We see the possibility of extending RLM to support
a variety of session-based management techniques.
5.3 Web Server Based Approaches
The systems we presented in this dissertation are server-side solutions, yet require no
modifications to existing server complex infrastructure. Both ksniffer and RLM rely on
packet level analysis, avoiding the severe problems associated with measuring latencies
within user space. What follows is a description of existing server-side approaches which
fall prey to these and other problems associated with attempting to manage latency within
193
the Web server itself.
eQoS [161] is an approach for managing pageview response time by adjusting the
number of simultaneous connections being serviced by the Web server. This system tracks
the activity between the client and Apache in user space within the Web server by intercepting socket calls in user space. As such, it is unable to detect latencies due to SYN
drops, time spent in kernel queues, and network RTT and loss, and is therefore unable to
obtain an accurate measure of client perceived response time. As shown in Figure 2.13,
the writev() system call, which writes data to a socket, returns after the data is just copied
from user space into the kernel - not when the transmission is actually completed. This approach simply manages the server response time, not the remote client perceived response
time, and could be prone to effects we presented in Section 2.6 - throttling the number of
simultaneous connections enough to cause undetected SYN drops in the kernel. On the
other hand, our approach directly tracks the packet level interaction between the remote
client and server, allowing us to capture latencies associated with both TCP and HTTP.
The tight feedback loop between measurement and management allows RLM to manage
the pageview download online, as it happens.
Hellerstein et al. [51, 93, 125] are performing research into using control theory to
control computer systems. In [93] they apply control theory to determine the proper value
of the well known MaxClients setting in Apache to minimize response time for all
users. They measure response time as the average wait time in the accept queue (ignoring
the TCP 3-way handshake) plus the average number of HTTP requests per page times the
apache response time. A major contribution of this dissertation is to show the significant
affects that SYN drops have on remote client response time, and to not simply ignore it as
they have. We present results in Figure 4.14 which clearly show the problems associated
with throttling the MaxClients setting within Apache. The control theory approach
194
makes the assumption that the outputs of the system are a linear function of the system
inputs, which is not typically true of a Web server. It also assumes that the system itself is
time invariant, meaning that the system will always behave in exactly the same way for a
given load. This is not true for a Web server whose service times change due to application
changes. Sensitivity to instability is another concern. An oversensitive controller can
cause oscillations in the system by rapidly changing from one extreme setting to another.
Another issue is that training such a controller usually requires a training set that covers a
large portion of the operating space under which the system is expected to run. RLM does
not require a training period but rather can begin working once placed in front of a server
complex. In [125] they are able to achieve latency goals for individual RPC requests sent
to Lotus Notes by managing the length of the inbound request queue. Lu et al. [94] have
also looked at the control theory approach, to provide relative delays.
Pradhan et al. [129] presented a closed-loop adaptive mechanism for controlling
a single tier Web server. Once again, this work was per URL based, unlike our work
which is per pageview based, and unlike our work, requires modifications to the Web
server complex. Other work on managing service quality on a per URL request includes
[46, 30, 8, 3, 159, 41]. We feel that all these per URL approaches are missing the simple,
highly important idea that a remote client does not download a single URL, but rather a
whole pageview into his or her browser.
5.4 Content Adaptation
In Chapter 4 we presented the content adaptation techniques of embedded object rewrite
and removal used by RLM to manage the client perceived response time. What differentiates our work from prior art is that the application of these adaptation techniques is driven
195
by the goal of achieving a specified response time, as perceived by the remote client, for
the currently active pageview download. In addition, our implementation is not proxy or
Web server-based, but is instead based on a system which can analyze and manipulate
packet streams, in real-time, as the pageview is being downloaded. Proxy or Web server
based systems do not have the ability to accurately measure client perceived response time
due to their inability to capture TCP level or kernel level latencies.
T. Abdelzaher and N. Bhatti in [4] provide a good overview of the many issues
involved with content adaptation within the Web server, using server load as the motivation for its application. In order for a Web server to perform embedded object rewrite,
an initial offline process must take place that creates multiple versions of the embedded
objects, each version varying in size and fidelity. Then, in order to perform embedded object rewrite, either (a) multiple versions of the container pages must be created with each
version embedding objects of alternate sizes, or (b) the underlying file system on the Web
server is changed so that when the Web server performs a sendfile() the symbolic link to
that file is swapped to point to an alternate version of the file, or (c) if the container page
is created dynamically, the program creating the container page must be altered to choose
which size objects to embed. Similar modifications must be made to the Web server in
order to support embedded object removal: multiple versions of container pages must be
created in advance or modifications must be made to the scripts which dynamically create
the container pages. Our server-side approach does not require changes to the existing
Web server for embedded object rewrite and removal except that it requires the existence
of alternate versions of an object. In their approach, content adaptation is driven by the
need for load shedding on the Web server, where as in our approach we drive the use of
content adaptation based on the latency required by the remote client. As a way to use
latency to drive the content adaptation, they suggest the use of a server-based agent, which
196
executes directly on the Web server machine, that periodically measures the response time
by obtaining a request from the Web server. This, once again, is indicative of the focus existing systems have placed on server response time and not the response time as perceived
by the remote client.
Content adaptation has been applied in the context of content negotiation, where
the endpoints (client and server) negotiate the content format, such as when a browser requests that images be sent as .PNG rather than as .GIF files which the browser does not
support [39]. The negotiation between endpoints typically requires two requests for each
object downloaded: the first request is made to retrieve the list of alternative formats, the
second request is made to actually retrieve the selected object [77]. In client-initiated content negotiation [143], first the browser obtains a list of alternate versions of the container
page, then retrieves an estimate of the available bandwidth to the server from a monitoring
machine (using SPAND [142]). Based on the bandwidth prediction and the size of the
alternates, the browser estimates the transfer time for the different versions of the container page and chooses the one that most closely matches the user’s requested response
time. In this approach, the browser chooses between a predefined set of alternative container pages, where each variant of the container page has references to a set of embedded
objects whose total size varies between the variants. Once the variant of the container
page is selected there is no going back - all embedded objects are retrieved. This differs
from our approach in that we track the pageview as it is being downloaded and make decisions as each embedded object is being requested. In addition, our approach does not
require changes to Web browsers or the existence of a monitor machine to provide the
bandwidth estimate to the Web browser. In server-initiated content negotiation the amount
of outbound available bandwidth is used to determine the variant to return, with the goal
of keeping the outbound bandwidth to at most 90% of the link capacity. In our server-side
197
solution, we do not seek to maintain a specified bandwidth utilization. Instead, we seek to
provide a specified response time. We do so by tracking the RTT and loss rate between the
server and client to develop an estimate for the transfer latency of an object. We then use
this estimate to predict the impact this object will have on the overall pageview response
time, which is currently in progress. In our model we do not necessarily assume that
the outbound bandwidth is the sole key contributor to the latency of the overall pageview
download but instead treat the component latencies of the pageview separately, and adjust
as the bottleneck latency changes.
The IETF has two efforts related to content adaptation. The first, named Open Pluggable Edge Services (OPES) [22, 138, 63, 137, 21], is a working group defining an architecture and set of protocols for a series of network-based proxies (termed OPES servers)
capable of modifying a request or response as the message traverses from one endpoint to
the other. These intermediary proxies perform well-defined transformations on the message as it forwards the message on toward its intended destination, usually modifying the
message by adding or modifying the HTTP headers in the request or response. In addition,
they perform typical proxy-like functions such as accounting, translation, compression, redirection, and servicing requests from local cache. In the OPES architecture each intermediary OPES service is a TCP endpoint, thus a series of application-to-application hops are
performed as the request or response traverses through the series of OPES servers. Such a
model is much different than our approach which consists of a single device that is placed
in front of the Web server for the purpose of packet stream analysis and manipulation.
RLM is not a TCP endpoint or proxy, and as such, is able to capture RTT, network loss
and latencies due to SYN drops and retransmissions. In OPES, the services applied to a
request or response are distributed across multiple machines. In RLM, the measurement
and control feedback loop is tightly bound within the same machine. We wonder about
198
the overhead associated with proxy-to-proxy hops as defined in OPES and what affect this
may have on scalability. On the other hand, our approach requires maintaining the TCP sequence number space over the connection, which in turn places limitations on the amount
of modifications we can perform on the HTTP request or response.
The second IETF effort, Internet Content Adaptation Protocol (ICAP) [55], is a
lightweight protocol for executing a remote procedure call on HTTP messages. An ICAP
client can pass HTTP messages to an ICAP server for transformation. A typical example
of ICAP usage would be a proxy which sends an HTTP container page response to an
ICAP server which then inserts the correct advertising links into the container page. As
such, ICAP servers are user level services which perform proxy-like functions - similar
to OPES but with a different API. Once again this differs from our work which seeks to
manage remote client perceived response time through packet stream analysis and manipulation, rather than provide a general use architecture for HTTP request/response transformation. However, this does raise the question as to whether or not RLM could be used
in conjunction with OPES or ICAP based proxies. If the proxy is placed between RLM
and the server complex, RLM will simply treat the proxy as part of the server complex. If
such a proxy were positioned between the client and RLM, then RLM will consider the
proxy as the remote client (i.e. the remote TCP endpoint). In such a case, RLM would be
measuring and managing response time to the proxy and not the remote client. In Section
6.1 we consider the possibility of having distributed RLM machines, having associations
with distributed sets of OPES or ICAP proxies.
Related to embedded object removal is the optional HTTP ALT=“string” attribute.
Used by most Web sites it allows a non-graphical browser to substitute string in place of
the referenced image:
199
< IMG SRC=“sell.gif” ALT=“[sell icon]” >
In cases where the image is strictly cosmetic (not clickable) the string is usually set to
the NULL string “” so that the user sees nothing at all. The embedded object removal
technique we describe blanks out the SRC= portion of the reference so that the browser is
left with:
< IMG
ALT=“[sell icon]” >
As such, this forces the browser to use the alternate text string where the image would
have been displayed. One policy for embedded objects removal would be to only blank
out the non-clickable images or strictly cosmetic images, whose ALT=“” or is absent.
5.5 TCP Level Mechanisms
ksniffer and RLM track behaviors at a packet level, using an understanding of how TCP/IP
behaves under various conditions. Other research has sought to control or manage latencies by examining the low-level behaviors of TCP/IP.
Schroeder et al. [141] show how changing the outbound packet scheduler in Linux
from the default (FCFS) to a Shortest-Remaining-Processing-Time-first scheduler (SRPT)
can reduce mean response times for all clients in an overloaded Web server. Hence, they
do not look at the multi-class, nor dynamic adjustments to the scheduler. They focus on
comparing the two static scheduling approaches for a Web server serving only static pages.
Since no content is dynamically generated, the bottleneck in the system is known a priori
200
to always be the transmission link between the server and clients, and they use file size
as the predictor of processing time, hence ignoring RTT and loss. Packets are scheduled
on the outbound link based on which file the packet is from, and how many bytes remain
from the file to be transmitted. Their traffic generator emulates a user as a single, persistent
connection over which multiple requests are issued, which does not match that of a typical
end user in the Internet using IE or Netscape, which opens multiple persistent connections.
They also measured client response time at the traffic generator machine on a per request
basis, not a per pageview basis. Crovella et al. [49] schedule outbound bandwidth based on
SRN, where the byte count of the response is used as the response length - this is applied
when the bandwidth at the server is the bottleneck and also does not take into account RTT
and loss.
Barford and Crovella used critical path analysis to examine the bottlenecks that exist for a single TCP connection [25]. Jamjoom and Shin [82] showed the best approach to
SYN dropping is all or nothing - accept the initial SYN or drop it and all its retransmissions. With Pillai [81] they presented Abacus, a modified token bucket filter, and showed
under simulation how it smoothed out the synchronization effects associated with SYN
retransmissions.
A fast retransmission mechanism violating TCP was presented in the context of
wireless networking [16, 18] to alleviate latencies associated with loss due to the physical
medium. However, this was not applied to SYN and SYN/ACK processing but only to
data transfer over established connections. Williamson and Wu [163] detail the effects
of individual packet drops (i.e. SYN drops) on transfer time for a pageview download,
motivating the need to merge HTTP level information into TCP/IP to reduce timeout and
retransmission latencies. Nahum et al. [107] discuss the effects of RTT and loss over
WAN. Offline, multiple pass pageview reconstruction from packet dumps was performed
201
in [66]. Bhatti et al. has studied user’s tolerance for delay [29].
Pradhan et al. [128] showed under simulation that a portal router residing in front of
a Web server could reduce the TCP slow start time for short lived connections by modifying the receiver window size within client to server TCP packets, causing the server to skip
slow start and immediately send larger amounts of packets. In addition, they effectively
reduced the server perceived RTT by transmitting ACK packets to the server immediately
from the portal server, before the data packet is received by the remote client. Since their
focus was on reducing TCP slow start they did not consider mechanisms such as fast SYN
+ SYN/ACK retransmission for connection establishment as we presented here.
5.6 Packet Capture and Analysis Systems
ksniffer and RLM have a basic dependence on fast packet capture. In Chapter 3 we showed
the scalability of ksniffer by tracking and reconstructing pageview response times up to
gigabit bandwidth rates. This exceeds the functional capabilities of existing systems at that
rate. Many packet capture and analysis systems have been developed over the years [48,
47, 75, 109, 151, 97]. Although none of the systems perform protocol reconstruction to
determine the remote client per pageview response time they can be compared to ksniffer
and RLM in their design and engineering.
Gigascope [48] is a general traffic monitor designed for high-speed IP backbone
links. Gigascope is a compiled query system: the user submits a set of SQL-like queries
and the system generates C/C++ code that is compiled, linked to a runtime system and
then executed in user space. Queries filter the packet stream into smaller streams which are
stored in a data warehouse for offline report generation. They demonstrate scalability, but
not at the level of online functionality that ksniffer and RLM provides. In this dissertation
202
we have not sought to develop a general packet analysis and filtering system, but rather
have sought to solve the problem of determining an accurate measurement of the client
perceived response time, on a per pageview basis, using only server-side information. It
might be possible to extend a system like Gigascope with the algorithms that are presented
here, but most of the algorithms in this dissertation cannot be presented as SQL queries
(e.g., tracking the state of a TCP connection).
Aguilera et al. [6] use passive packet capture for performance debugging of distributed systems. They treat each machine in the distributed system as a black box and
then use offline statistical methods to infer the relationships between machines based on
traffic patterns. ksniffer and RLM do not use statistical methods, but rather perform protocol reconstruction, examining each packet’s relationship in the overall transaction. We
do see that statistical pattern matching could be applied within ksniffer and RLM but at
a higher level to determine historical patterns of behavior, which in turn could be used in
prediction.
NetQoS SuperAgent [109], a commercial product, is a passive traffic monitor that,
like us, provides information such as round-trip times to remote clients and measures
components of response times that are due to packet retransmissions. SuperAgent also
analyzes certain TCP-based applications such as Oracle and Lotus Notes. They report
response times on a per URL request basis instead of per page view response time, such
as ksniffer and RLM.
Packeteer [120] sells an inline traffic monitor/shaper appliance. Instead of connecting the device to a mirrored port, the device is plugged inline like a router, like RLM.
Although the device can monitor a large number of protocols, it does not reconstruct
pageview response times, but only reports response time on a per URL request basis. It
counts the number of dropped SYNs, but does not track the latency associated with them.
203
Their focus is mostly on monitoring and shaping traffic in an enterprise network. The
top model supports up to 500 Mbps and costs roughly $27,000 per unit. They also sell
a centralized controller (software which runs on windows) that can collect and aggregate
information from each monitor device. Hypertrak from Trio Networks [111], a similar
system, logs each response time entry into an Oracle database, from which reports can be
generated. Using a third party DB they can determine the geographic location of a remote
client from his/her IP address. Their software performs filtering at the device driver, and
runs on RH 7.1. Similar to Packeteer, they can deploy multiple boxes within a server farm
to achieve scalability - a central controller collects and aggregates information to develop
final reports. The ability to collect and correlate information from multiple devices was
not within the scope of this dissertation. We see this functionality as being useful, but not
a research challenge.
Most other network traffic monitoring systems have focused on improving scalability thought packet filtering. The assumption here is that the amount of network traffic
of real interest to the monitor is actually a small subset of the volume of the traffic as a
whole. They are thus concerned with efficient techniques for filtering ‘irrelevant’ packets
to reduce the processing load of the monitor. Systems falling into this category include the
work by Mogul et al. [103], the BSD Packet Filter (BPF) [97], the CPSF language [103],
tcpdump [80], libpcap [151], Ethereal [58]. Subsequent research efforts on packet filters
have focused on improving performance, for example by merging common predicates in
filters [165], redesigning filtering engines [15], using dynamic code generation [57], and
through just-in-time compilation [27]. For high-volume Web sites, however, the vast majority of the traffic is of interest, and thus we require efficient techniques for processing as
much of it as possible. In this vein, the use of programmable ethernet adapters that can
copy packets directly into user space are of interest [56, 32]. The key drawback to these
204
devices are their expense. In chapter 3 we showed scalability up to the gigabit level using
just off-the-shelf hardware.
5.7 Services On-demand
Server farm management systems that allocate resources on-demand to meet specified response time goals are receiving much attention [134, 9, 96]. The ability of a Web hosting
center to move CPU cycles, machines, bandwidth and storage from a hosted Web site that
is meeting its latency goal to one that is not, is a key requirement for an automated management system. B. Urgaonkar and P. Shenoy state in their Cataclysm[156] paper that ‘a
hosting platform can take one or more of three actions during an overload: (i) add capacity
to the application by allocating idle or underused servers (ii) turn away excess requests and
preferentially service only important requests, and (ii) degrade the performance of admitted requests in order to service a larger number of aggregate requests’. They did not seem
to perceive the other option which we presented in this dissertation and also presented by
T. Abdelzaher and N. Bhatti in [4], which is to reduce the amount of work required for
each pageview.
Throughout this dissertation we have assumed a fixed amount of physical resources
are available within the server farm (i.e. blackbox). Nevertheless, on-demand allocation decisions must be based on accurate measurements. Over-allocating resources to one
hosted Web site results in an overcharge to that customer and a reduction in the available
physical resources left to meet the needs of the others. Under-allocation results in poor
response time and unsatisfied Web site users. The ability to base these allocation decisions
on a measure that is relevant to both the Web site owner and the end user of the Web site
is a competitive advantage. Likewise, the opportunity to apply content adaptation instead
205
of throwing more physical resources at the problem could lead to a significant reduction in
cost. Merging our on ksniffer and RLM with an on-demand resource management system
is potentially interesting future work.
5.8 Stream-based Systems
There has been much work related to analyzing data streams, which has focused on information retrieval and not actually applying the information towards managing Web server
latency. In this context, the packet stream into and out of the Web server complex can be
viewed as a data stream which is to be mined. Several research groups have taken the approach of developing Data Stream Management Systems (DSMS), capable of performing
persistent SQL type queries over continuous data streams (i.e. performing selects and joins
over multiple streams). The basic idea is to take the fundamentals of database systems and
extend them into the realm of data streams. These extensions include making the queries
persistent over time (rather than a simple request/response on a set of fixed size tables),
adding functionality for handling relational tables which are unbounded in length (via the
use of windows) and adding support for time related predicates. The largest research group
addressing DSMS is located at Stanford [72]. They have published work related to building general purpose DSMS [72], data stream query languages [10], operator scheduling
[11], load shedding queries [13], and clustering [74, 73, 114]. So far, their only application of this system has been towards network traffic management [14]. Other work has
been focused on optimizing the query engine by sharing and ordering selection predicates
[40, 95]. One major benefit to these systems is being able to use a well known existing
API, such as SQL queries, to obtain information about the data streams. This provides the
user the ability to compose SQL queries on the fly to obtain specific information by using
206
select and join. On the other hand, drawbacks exist. First, such a system is fairly large
and complex, requiring multiple components, such as an SQL parser, optimizer and query
processor. Second, SQL queries alone are insufficient for modeling complex entities such
as a TCP connection. In order to track TCP connection state, one is required to maintain
state information and make complex decisions with respect to prior and current events.
SQL queries alone are not amenable to this problem. Indeed, GigaScope [47] provides the
capability for installing user built custom modules to handle exactly this. Other modeling
questions that cannot be expressed well using SQL is determining remote client subnet
membership via longest prefix matching and longest path matching on a URL. As such,
these systems are different than the work presented in this dissertation. Our systems were
designed to solve a specific problem, where as these systems appear to be designed more
for browsing streams to identify any interesting or useful patterns or information.
Although further afield from our work, the problem of clustering and modeling
online numeric data streams presents engineering challenges that are similar to the problem
of tracking and modeling Web transactions (i.e. packet streams). Such systems [166, 62,
74, 20, 5, 150] require high speed incremental update, compact memory representations
whose memory requirements grow slowly with the number of entities being tracked, and
the ability to work well in the presence of incomplete information. An example of such
a system would be a system to analyze real-time sensor data. Barbara et al. [19] and
Babcock et al. [12] both provide a good overview of such issues.
5.9 Internet Standards Activity
We now end the related work section by mentioning the work being done by relevant
standards bodies. As mentioned earlier in this dissertation, no standards body is tackling
207
the problem of developing a standard definition of, or methodology for, measuring remote
client perceived pageview response time.
The Internet Engineering Task Force (IETF)[2] is a large open international community of network designers, operators, vendors, and researchers concerned with the evolution of the Internet architecture and the smooth operation of the Internet. The mission
of the IETF is to produce high quality, relevant technical and engineering documents that
influence the way people design, use, and manage the Internet in such a way as to make the
Internet work better. These documents include protocol standards, best current practices,
and informational documents of various kinds.
Transactions over TCP (T/TCP) [34, 35] is a set of RFCs concerned with the latency
problems associated with the TCP 3-way handshake. They propose the TCP Accelerated
Open (TAO) mechanism for eliminating the TCP 3-way handshake delay. In TAO, the request and response are sent in the payload of the SYN and SYN/ACK packets, respectively.
Each endpoint accepts the other’s sequence number if the sequence number is greater than
the last sequence number seen from that host. This requires each endpoint maintain a per
host cache of the last sequence number received by the host. This was proposed in 1992
and has never caught on for a variety of reasons, including the requirement to modify
existing TCP implementations, C libraries and user level applications. The existing TCP
specification does allow a data payload to be contained in a SYN packet, but data cannot
be presented to the user application until the 3-way handshake is completed - hence, TAO
proposed the use of cached sequence numbers to eliminate the need to wait for the 3rd
part of the handshake to complete. Regardless, no existing TCP implementation provides
the capability to include a data payload in an initial SYN packet. This is something of an
artifact of the sockets API and socket system call API. A user application must first call
connect() to establish the connection to the server (3-way handshake), then call writev()
208
to send data. Even if the call to connect() is non-blocking and a writev() is executed immediately after returning from connect() in hopes of appending the data as payload to the
initial SYN packet, the underlying TCP implementation will return an error to the user
application on the call to writev(). By contrast, our approach does not seek to change the
TCP protocol, nor does it require changes to existing TCP implementations.
The Interconnect Software Consortium (ICSC) [1] was formed with the purpose to
develop and publish software specifications, guidelines and compliance tests that enable
the successful deployment of fast interconnects such as those defined by the InfiniBand
specification. The Extended Sockets API [71] presented by the ICSC proposes a new set
of event driven, asynchronous socket calls to improve the performance and scalability of
the sockets API. On the server side, the Web server will be able to accept a large number
of incoming connections at once by making a single call to accept(). This reduces the
number of system calls required to handle all the incoming connection requests since they
are grouped into a single call. On the client side, the browser has little added benefit - an
asynchronous connect() call instead of a blocking or non-blocking call. In either case, the
TCP 3-way handshake is still required before data can be transfered over the connection.
The work presented in this dissertation is essentially neutral with respect to the proposed
Extended Sockets API.
CHAPTER 6. CONCLUSION
209
Chapter 6
Conclusion
This dissertation shows that it is possible to determine the remote client perceived response
time for Web transactions using only server-side techniques and that doing so is useful for
the management of latency based service level agreements.
First, we presented Certes, a novel modeling algorithm, that accurately estimates
connection establishment latencies as perceived by the remote clients, even in the presence
of admission control drops. We presented a non-linear optimization that models the effect
that the TCP exponential backoff mechanism has on connection establishment latency. We
then presented an O(c) time and space online approximation algorithm as an improvement
over the non-linear optimization. We implemented the fast online approximation algorithm on a Web server and performed numerous experiments validating that the latencies
determined by our online model were within 5% of the latencies observed at the remote
client. In addition, through the use of kernel level accept queue limit adjustment on the
Web server, we showed how existing techniques of admission control which ignore the
latency affects of SYN drops can have the exact opposite effect on response time in which
they intend, making response time worse for the remote client rather than improving it.
210
The basic idea that a SYN drop does not deny service, but rather postpones service of a
request has significant impact on all prior and future work involving admission control.
Second, we presented ksniffer, an intelligent traffic monitor which accurately determines the pageview response times experienced by a remote client without any changes to
existing systems or Web content. We presented novel algorithms for inferring the remote
client perceived response time on a per pageview basis which take into account network
loss, RTT, and incomplete information. By noticing gaps in TCP retransmissions we are
able to infer the existence of a network packet drop affecting the pageview response time,
without the dropped packet ever being captured by our system. Our algorithms perform
embedded object correlation, online, in the absence of Referer: fields and in the presence
of multiple simultaneous pageview downloads to the same remote IP address. Our kernel
level design and implementation was shown capable of tracking packet streams at near
gigabit rates, far exceeding the functionality and performance of existing systems. The
basic idea that a client downloads an entire pageview and not just a single URL, and that
the response time as measured at the remote browser and not the server response time is
what matters to the client has profound impact on all prior and future work not only involving response time measurement but also impacts the basic concepts surrounding how
Web server performance is benchmarked, evaluated, managed and controlled.
Third, we presented Remote Latency-based Management (RLM), a system that
controls the latencies experienced by the remote client by manipulating the packet traffic into and out of the Web server complex. RLM tracks the progress of each pageview
download in real-time as it happens and manages latencies dynamically, as each embedded
object is requested, making fine grained decisions on the processing of each request as it
pertains to the overall pageview latency, as perceived by the remote client. Our system is
based on a novel event node graph model that models the pageview download as a set of
211
specific, well-defined activities with latency, precedence and critical path attributes. Using
packet manipulation techniques such as fast SYN + SYN/ACK retransmission and embedded object rewrite and removal, we are able to manipulate not only the mean pageview
response time but also the shape of the pageview response time distribution, without modifying the Web server complex or Web content. The basic idea that the QoS given to an
individual URL request should be based on the context of the pageview it is currently
being downloaded for has significant impact on how Web server performance is viewed,
evaluated and managed.
An important aspect of our approach has been to experimentally validate our ideas
with realistic workloads under Internet conditions of network loss and delay, which can
have significant impact on response time. Toward this end, we have also implemented
various techniques to make traffic generators as realistic as possible in their behavior. It
has been our goal to have the most realistically behaving traffic generator possible. By
studying the behavior of existing Web browsers, such as Microsoft’s Internet Explorer, we
were able to uncover some notable effects that occur under failure conditions and incorporate this behavior into our own traffic generator. Likewise, we discovered behaviors in
traffic generators being used by the research community today which do not at all mimic
the behavior of Internet Explorer. For example, something as simple as the traffic generator closing the connection instead of keeping it open until the Web server closes the
connection (which is how IE works) has significant impact on latency, load on the server,
number of simultaneous idle connections at the server, and the perception of load by an
admissions control mechanism. The basic idea that a traffic generator used to validate
a latency management approach should mimic the behavior of real Web browsers under
all conditions has significant impact on prior and future research involving Web server
benchmarking and latency management.
212
6.1 Future Work
This dissertation provides a foundation for future work in managing web server performance and also raises some fundamental questions to consider as follow-on research.
The TCP exponential backoff mechanism as it currently is defined and implemented
for connection establishment is antiquated and needs to be changed. We predict that future
implementations of TCP will have a different mechanism, based on a change or extension
to the TCP protocol specification. Indeed, what we learned about the effect that SYN drops
have on connection establishment latency applies to any new protocols being developed
for the Internet which do not rely on TCP as their underlying transport mechanism (e.g.,
peer to peer, chat). New protocols ought to be developed with the understanding of how
retransmissions, similar in nature to those due to SYN drops, can affect overall response
time. Perhaps another transport layer protocol will be developed for the Internet and gain
wide acceptance; one which is a hybrid of TCP, UDP, etc. Instead of dropping a SYN (if
the protocol does have the concept of a SYN) the protocol may return a packet indicating
to the client when they ought to re-attempt connection establishment. Unfortunately, it
may take many years for a new version of TCP or a new transport protocol to be deployed
throughout the world.
A key thought when developing systems is to layer and isolate (network) protocols
from each other, with each layer in the protocol providing a well defined yet somewhat
limited API. This can be extended from the network into the server with the kernel being
one layer and the HTTP user level process being another. This is considered a tried and
true method for making the development of complex systems more tractable. What we
found, of course, is that this makes for considerable difficulties when trying to measure
and correlate activities across multiple layers. Each layer below hides information from
213
the layer above, with little or no access to any latency measurement of real value. This
layering concept needs to be re-evaluated; at the very least new protocols being developed
ought to be designed with the explicit goal of making accurate latency measurement a key
design point.
As Web services become more distributed, measuring and managing the client perceived response time becomes more difficult. If any portion of a pageview can be obtained
from any computer on the Internet, then the only single centralized point of measurement
and control will be the Web browser itself. It will be up to the browser to choose from
which computer to obtain each portion of the pageview. Yet, Web browsers may not have
access to the information required to make an intelligent choice of which source to use.
Likewise, as content becomes ever more distributed, Web sites lose their ability to control the client experience. In such a scenario it may be the Internet infrastructure itself
which measures performance and routes requests based on the content locations, load on
those machines and available or projected bandwidth on the paths to those machines. This
implies an Internet where the browser can place a request for an object or data onto the
Internet infrastructure and receive a response without specifying where to obtain it from.
Such global load balancing may not be possible, but the infrastructure of the Internet could
at least be extended to be more supportive of the measurement and management of Web
transactions.
Understanding client perception and behavior is key to a Web site success. As
such, additional research needs to be focused on managing the perceptions and behaviors
of the remote client. Future browsers ought to be able to measure, analyze and relay their
users perceptions and intentions to the Web server complex, working in conjunction with
the server complex to provide the best possible experience for the client when visiting the
Web site. A full and thorough examination of how current browsers behave under various
214
conditions and how clients react may lead towards the development of better browsers, not
just better Web site content [36]. A Web server in such a partnership with Web browsers
would be pageview centric, optimizing the pageview downloads of all currently active
clients, given the current and expected load, RTT and loss to each client, and prediction of
near future behaviors of each client.
On-demand resource allocation promises the ability to shift resources, in real-time,
to exactly where they are needed. Such environments require accurate measures of response time as input to the allocation decision. In addition, research in this area needs to
consider and address the issues involved in the cost trade-off between the choice of redirecting resources and the choice of adapting content or computation to the given set of
resources.
No industry standard method for measuring pageview response time exists today
which makes scientific evaluation and comparison of existing techniques difficult at best.
A well defined standard for the measurement of client perceived response time needs to
be developed. This would allow for a more meaningful comparison between latency management approaches. Such a standard may spark the development of new service level
agreements based on remote client perceived response time (existing service level agreements only contractually bind to availability objectives). Given such a standard, verification tools could then be written to validate a system’s ability to measure and manage the
client perceived response time.
Given the number of packet switching devices on the Internet and available for
use in a multi-tier server complex, other packet manipulation techniques that can manage
latency or control the server complex ought to be explored. We showed in Chapter 4 how
embedded object removal shifted load from the front end Apache server to the backend DB
server. Other techniques could exist that control the behaviors of remote clients, browsers
or Web servers.
215
BIBLIOGRAPHY
216
Bibliography
[1] The Interconnect Software Consortium. http://www.opengroup.org/icsc/.
[2] The Internet Engineering Task Force. http://www.ietf.org/.
[3] Tarek Abdelzaher, Kang G. Shin, and Nina Bhatti. Performance Guarantees for
Web Server End-Systems: A Control-Theoretical Approach. IEEE Transactions on
Parallel and Distributed Systems, 13(1):80–96, January 2002.
[4] Tarek F. Abdelzaher and Nina Bhatti. Web Content Adaptation to Improve Server
Overload Behavior. Computer Networks, 31(11-16):1563–1577, 1999.
[5] Charu C. Aggarwal, Jiawei Han, Jianyong Wang, and Philip S. Yu. A Framework
for Clustering Evolving Data Streams. In Proceedings of the 29th International
Conference on Very Large Data Bases, pages 81–92, Berlin, Germany, September
2003.
[6] Marcos K. Aguilera, Jeffrey C. Mogul, Janet L. Wiener, Patrick Reynolds, and
Athicha Muthitacharoen. Performance Debugging for Distributed Systems of Black
Boxes. In Proceedings of the 19th ACM Symposium on Operating System Principles
(SOSP ’03), pages 74–89, Lake George, NY, October 2003.
[7] Mark Allman. A Web Server’s View of the Transport Layer. ACM Computer Communication Review, 30(4):133–142, October 2000.
[8] Jussara Almeida, Mihaela Dabu, Anand Manikutty, and Pei Cao. Providing Differentiated Levels of Service in Web Content Hosting. In Technical Report CS-TR1998-1364, University of Wisconsin-Madison, 1998. Computer Sciences Department.
[9] Karen Appleby, Sameh Fakhouri, Liana Fong, German Goldszmidt, Michael Kalantar, Srirama Krishnakumar, Donald Pazel, John Pershing, and Benny Rochwerger.
Oceano-SLA Based Management of a Computing Utility. In Proceedings of the
IFIP/IEEE Symposium on Integrated Network Management, pages 855–868. IEEE,
May 2001.
BIBLIOGRAPHY
217
[10] Arvind Arasu, Shivnath Babu, and Jennifer Widom. The CQL Continuous Query
Language: Semantic Foundations and Query Execution. In Technical Report 200367. Stanford University, October 2003.
[11] Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Dilys Thomas.
Operator Scheduling in Data Stream Systems. In Technical Report 2003-68. Stanford University, 2003.
[12] Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer
Widom. Models and Issues in Data Stream Systems. In Proceedings of the 21st
ACM Symposium on Principles of Database Systems (PODS ’02), pages 1–16,
Madison, Wisconsin, 2002.
[13] Brian Babcock, Mayur Datar, and Rajeev Motwani. Load Shedding Techniques for
Data Stream Systems. In Proceedings of the 2003 Workshop on Management and
Processing of Data Streams (MPDS ’03), June 2003.
[14] Shivnath Babu, Lakshminarayan Subramanian, and Jennifer Widom. A Data Stream
Management System for Network Traffic Management. In Proceedings of the Workshop on Network-Related Data Management (NRDM ’01), May 2001.
[15] Mary L. Bailey, Burra Gopal, Michael A. Pagels, Larry L. Peterson, and Prasenjit
Sarkar. PATHFINDER: A Pattern-Based Packet Classifier. In Proceedings of the
First Symposium on Operating Systems Design and Implementation (OSDI ’94),
pages 115–123, Monterey CA, November 1994.
[16] Hari Balakrishnan, Venkata Padmanabhan, Srinivasan Seshan, and Randy Katz. A
Comparison of Mechanisms for Improving TCP Performance over Wireless Links.
IEEE/ACM Transactions on Networking (TON), 5(6):756–769, December 1997.
[17] Hari Balakrishnan, Hariharan S. Rahul, and Srinivasan Seshan. An Integrated Congestion Management Architecture for Internet Hosts. ACM SIGCOMM Computer
Communication Review, 29(4):175–187, October 1999.
[18] Hari Balakrishnan, Srinivasan Seshan, and Randy Katz. Improving Reliable Transport and Handoff Performance in Cellular Wireless Networks. ACM Wireless Networks, 1(4):469–481, December 1995.
[19] Daniel Barbara. Requirements for Clustering Data Streams. ACM SIGKDD Explorations, 3(2):23–27, 2002.
[20] Daniel Barbara and Ping Chen. Using the Fractal Dimension to Cluster Datasets.
In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pages 260–264, Boston, MA, 2000.
BIBLIOGRAPHY
218
[21] Abbie Barbir, Eric Burger, Robin Chen, Stephen McHenry, Hilarie Orman, and
Reinaldo Penno. RFC 3752: Open Pluggable Edge Services (OPES) Use Cases and
Deployment Scenarios. IETF, April 2004.
[22] Abbie Barbir, Reinaldo Penno, Robin Chen, Markus Hofmann, and Hilarie Orman.
RFC 3835: An Architecture for Open Pluggable Edge Services (OPES). IETF,
August 2004.
[23] Paul Barford and Mark Crovella. Generating Representative Web Workloads for
Network and Server Performance Evaluation. In Proceedings of the International
Conference on Measurement and Modeling of Computer Systems (SIGMETRICS
’98), pages 151–160, July 1998.
[24] Paul Barford and Mark Crovella. A Performance Evaluation of Hyper Text Transfer
Protocols. ACM SIGMETRICS Performance Evaluation Review, 27(1):188–197,
June 1999.
[25] Paul Barford and Mark Crovella. Critical Path Analysis for TCP Transactions. In
Proceedings of the Annual Conference of the ACM Special Interest Group on Data
Communication (SIGCOMM ’00), pages 127–138, Stockholm, Sweden, 2000.
[26] Paul Barham, Austin Donnelly, Rebecca Isaacs, and Richard Mortier. Using Magpie for Request Extraction and Workload Modelling. In Proceedings of the 6th
Symposium on Operating Systems Design and Implementation (OSDI ’04), pages
259–272, San Francisco, CA, December 2004.
[27] Andrew Begel, Steven McCanne, and Susan L. Graham. BPF+: Exploiting Global
Data-Flow Optimization in a Generalized Packet Filter Architecture. In Proceedings
of the Annual Conference of the ACM Special Interest Group on Data Communication (SIGCOMM ’99), pages 123–134, August 1999.
[28] Tim Berners-Lee, Roy Fielding, and Henrik Frystyk. RFC 1945: Hypertext Transfer
Protocol – HTTP/1.0. IETF, May 1996.
[29] Nina Bhatti, Anna Bouch, and Allan Kuchinsky. Integrating User-Perceived Quality
into Web Server Design. In Proceedings of the 9th International World Wide Web
Conference (WWW-9), pages 1–16, Amsterdam, Netherlands, May 2000.
[30] Nina Bhatti and Rich Friedrich. Web Server Support for Tiered Services. IEEE
Network, 13(5):64–71, September 1999.
[31] Josep M. Blanquer, Antoni Batchelli, Klaus Schauser, and Rich Wolski. Quorum:
Flexible Quality of Service for Internet Services. In Proceedings of the 2nd Symposium on Networked Systems Design and Implementation (NSDI ’05), pages 159–
174, Boston, MA, May 2005.
BIBLIOGRAPHY
219
[32] Herbert Bos, Willem de Bruijn, Mihai Cristea, Trung Nguyen, and Georgios Portokalidis. FFPF: Fairly Fast Packet Filters. In Proceedings of the 6th Symposium
on Operating Systems Design and Implementation (OSDI ’04), pages 347–363, San
Francisco, CA, December 2004.
[33] Robert Braden. RFC 1122: Requirements for Internet Hosts - Communication Layers. IETF, October 1989.
[34] Robert Braden. RFC 1379: Extending TCP for Transactions – Concepts. IETF,
November 1992.
[35] Robert Braden. RFC 1644: T/TCP – TCP Extensions for Transactions Function
Specification. IETF, July 1994.
[36] Browster. http://www.browster.org/.
[37] R. Caceres, N. G. Duffield, A. Feldmann, J. Friedmann, A. Greenberg, R. Greer,
T. Johnson, C. Kalmanek, B. Krishnamurthy, D. Lavelle, P. Mishra, K. K. Ramakrishnan, J. Rexford, F. True, and J. E. van der Merwe. Measurement and Analysis of
IP Network Usage and Behavior. IEEE Communications Magazine, 38(5):144–151,
May 2000.
[38] Neal Cardwell, Stefan Savage, and Thomas Anderson. Modeling TCP Latency. In
Proceedings of the IEEE Conference on Computer Communications (INFOCOMM
’00), pages 1742–1751, Tel-Aviv, Israel, March 2000. IEEE.
[39] Surendar Chandra, Carla Ellis, and Amin Vahdat. Differentiated Multimedia Web
Services using Quality Aware Transcoding. In Proceedings of the IEEE Conference
on Computer Communications (INFOCOMM ’00), pages 961–969, Tel-Aviv, Israel,
March 2000.
[40] Jianjun Chen, David J. DeWitt, Feng Tian, and Yuan Wang. NiagaraCQ: a Scalable
Continuous Query System for Internet Databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 379–390, Dallas,
TX, 2000.
[41] Xiangping Chen and Prasant Mohapatra. Providing Differentiated Service from
an Internet Server. In Proceedings of the IEEE 8th International Conference On
Computer Communications and Networks, pages 214–217, Boston, MA, October
1999.
[42] Xiangping Chen, Prasant Mohapatra, and Huamin Chen. An Admission Control
Scheme for Predictable Server Response Time for Web Accesses. In Proceedings
of the 10th International World Wide Web Conference, pages 545–554, Hong Kong,
China, May 2001.
BIBLIOGRAPHY
220
[43] Ludmila Cherkasova and Peter Phaal. Session Based Admission Control: a Mechanism for Improving Performance of Commercial Web Sites. In Proceedings of the
7th IEEE/IFIP International Workshop on Quality of Service (IWQoS ’99), pages
226–235, London, UK, May 1999.
[44] Tzi-Cker Chiueh and Prashant Pradhan. High Performance IP Routing Table
Lookup using CPU Caching. In Proceedings of the IEEE Conference on Computer
Communications (INFOCOMM ’99), pages 1421–1428, Orlando, FLA, March
1999.
[45] Edith Cohen, Balachander Krishnamurthy, and Jennifer Rexford. Efficient Algorithms for Predicting Requests to Web Servers. In Proceedings of the IEEE Conference on Computer Communications (INFOCOMM ’99), pages 284–293, Orlando,
FLA, March 1999.
[46] Ira Cohen, Jeff Chase, Moises Goldszmidt, Terence Kelly, and Julie Symons. Correlating Instrumentation Data to System States: A Building Block for Automated
Diagnosis and Control. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI ’04), pages 231–244, San Francisco, CA,
December 2004.
[47] Chuck Cranor, Theodore Johnson, Vladislav Shkapenyuk, and Oliver Spatscheck.
Gigascope: A Stream Database for Network Applications. In Proceedings of the
ACM SIGMOD International Conference on Management of Data, San Diego, CA,
May 2003.
[48] Chuck Cranor, Theodore Johnson, Vladislav Shkapenyuk, Oliver Spatscheck, and
G. Yuan. Gigascope: A Fast and Flexible Network Monitor. Technical Report
TD-5ABQY6, AT&T Labs–Research, Floram Park, NJ, May 2002.
[49] Mark Crovella, Robert Frangioso, and Mor Harchol-Balter. Connection Scheduling
in Web Servers. In Proceedings of the 2nd USENIX Symposium on Internet Technologies and Systems (USITS ’99), pages 243–254, Boulder, CO, October 1999.
[50] Peter Danzig. Ideas for Next Generation Content Delivery. Talk given at the 11th International Workshop on Network and Operating Systems Support for Digital Audio
and Video (NOSSDAV ’01), June 2001.
[51] Yixin Diao, Joseph L Hellerstein, and Sujay Parekh. Optimizing Quality of Service
Using Fuzzy Control. In Proceedings of the 13th IFIP/IEEE International Workshop on Distributed Systems: Operations and Management: Management Technologies for E-Commerce and E-Business Applications (DSOM ’02), pages 42–53,
Montreal, Canada, 2002.
[52] Jeff Dike. User-mode Linux. http://user-mode-linux.sourceforge.net/.
BIBLIOGRAPHY
221
[53] Lars Eggert and John Heidemann. Application-Level Differentiated Services for
Web Servers. World Wide Web Journal, 3(2):133–142, August 1999.
[54] Sameh Elnikety, Erich Nahum, John Tracey, and Willy Zwaenepoel. A Method for
Transparent Admission Control and Request Scheduling in E-Commerce Web Sites.
In Proceedings of the 13th International World Wide Web Conference (WWW2004),
pages 276–286, New York, NY, May 2004.
[55] Jeremy Elson and Alberto Cerpa. RFC 3507: Internet Content Adaptation Protocol
(ICAP). IETF, April 2003.
[56] Endace. http://www.Endace.com/.
[57] Dawson R. Engler and M. Frans Kaashoek. DPF: Fast, Flexible Message Demultiplexing Using Dynamic Code Generation. In Proceedings of the Annual Conference of the ACM Special Interest Group on Data Communication (SIGCOMM ’96),
pages 53–59, Palo Alto, CA, August 1996.
[58] Ethereal. http://www.Ethereal.com/.
[59] Exodus. http://www.Exodus.com/.
[60] Anja Feldmann. BLT: Bi-Layer Tracing of HTTP and TCP/IP. In Proceedings
of the 9th International World Wide Web Conference (WWW-9), pages 321–335,
Amsterdam, Netherlands, May 2000.
[61] Roy Fielding, Jim Gettys, Jeffrey Mogul, Henrik Frystyk, and Tim Berners-Lee.
RFC 2068: Hypertext Transfer Protocol HTTP 1.1. IETF, January 1997.
[62] Doug H. Fisher. Iterative Optimization and Simplification of Hierarchical Clusterings. Journal of AI Research, 4:147–180, 1996.
[63] Sally Floyd and Leslie Daigle. RFC 3238: IAB Architectural and Policy Considerations for Open Pluggable Edge Services. IETF, January 2002.
[64] The Apache Software Foundation. Apache. http://www.apache.org/.
[65] FreeBSD. http://www.FreeBSD.org/.
[66] Yun Fu, Ludmila Cherkasova, Wenting Tang, and Amin Vahdat. EtE: Passive Endto-End Internet Service Performance Monitoring. In Proceedings of the USENIX
Annual Conference, pages 115–130, Monterey, CA, June 2002.
[67] Phillip B. Gibbons and Yossi Matias. New Sampling-based Summary Statistics for
Improving Approximate Query Answers. In Proceedings of the ACM SIGMOD
International Conference on Management of Data, pages 331–342, Seattle, WA,
June 1998.
BIBLIOGRAPHY
222
[68] Gene H. Golub and Charles F. Van Loan. Matrix Computations. The John Hopkins
University Press, 2715 North Charles Street, Baltimore, MD 21218-4319, 1996.
[69] Gomez Inc. http://www.Gomez.com/.
[70] Boston Consulting Group. http://www.BCG.com/.
[71] The Open Group Sockets API Extensions Working Group. Extended Sockets API
(ES-API) 1.0. http://www.opengroup.org/icsc/sockets/, January 2005.
[72] The STREAM Group. STREAM: The Stanford Stream Data Manager. IEEE Data
Engineering Bulletin, 26(1), March 2003.
[73] Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani, and Liadan
O’Callaghan. Clustering Data Streams: Theory and Practice. IEEE Transactions
on Knowledge and Data Engineering, 15(3):515–528, May/June 2003.
[74] Sudipto Guha, Nina Mishra, Rajeev Motwani, and Liadan O’Callaghan. Clustering
Data Streams. In Proceedings of the Annual Symposium on Foundations of Computer Science, pages 359–366. IEEE Computer Society, November 2000.
[75] James Hall, Ian Pratt, and Ian Leslie. Non-Intrusive Estimation of Web Server
Delays. In Proceedings of the 26th Annual IEEE Conference on Local Computer
Networks (LCN), page 215, Tampa, Florida, November 2001.
[76] Felix Hernandez-Campos, Kevin Jeffay, and F. Donelson Smith. Tracking the Evolution of Web Traffic: 1995-2003. In Proceedings of the 11th IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunications Systems (MASCOTS), pages 16–25, Orlando, FL, October 2003.
[77] Koen Holtman and Andrew Mutz. RFC 2295: Transparent Content Negotiation in
HTTP. IETF, May 1998.
[78] Elbert Hu, Philippe Joubert, Richard King, Jason LaVoie, and John Tracey. Adaptive Fast Path Architecture. IBM Journal of Research and Development, 45(2):191–
206, April 2001.
[79] IBM AlphaWorks. Page Detailer. http://www.alphaworks.ibm.com/tech/pagedetailer.
[80] Van Jacobson, Craig Leres, and Steve McCanne. tcpdump. ftp://ftp.ee.lbl.gov/.
[81] Hani Jamjoom, Padmanabhan Pillai, and Kang G. Shin. Re-synchronization and
Controllability of Bursty Service Requests. IEEE/ACM Transactions on Networking
(TON), 12(4):582–594, August 2004.
BIBLIOGRAPHY
223
[82] Hani Jamjoom and Kang G. Shin. Persistent Dropping: an Efficient Control of
Traffic Aggregates. In Proceedings of the Annual Conference of the ACM Special
Interest Group on Data Communication (SIGCOMM ’03), pages 287–298, Karlsruhe, Germany, August.
[83] Philippe Joubert, Richard King, Rich Neves, Mark Russinovich, and John M.
Tracey. High-Performance Memory-Based Web Servers: Kernel and User-Space
Performance. In Proceedings of the USENIX Annual Conference, pages 175–188,
Boston, MA, June 2001.
[84] M. Frans Kaashoek, Dawson Engler, Gregory R. Ganger, Hector Briceno, Russell
Hunt, David Mazieres, Tom Pinckney, Robert Grimm, John Janotti, and Kenneth
Mackenzie. Application Performance and Flexibility on Exokernel Systems. In
Proceedings of the 16th ACM Symposium on Operating Systems Principles (SOSP
’97), Saint-Malo, France, October 1997.
[85] Abhinav Kamra, Vishal Misra, and Erich Nahum. Yaksha: A Self-Tuning Controller for Managing the Performance of 3-Tiered Web Sites. In Proceedings of the
12th IEEE/IFIP International Workshop on Quality of Service (IWQoS ’04), pages
47–56, Montreal, Canada, June 2004.
[86] Vikram Kanodia and Edward W. Knightly. Multi-Class Latency-Bounded Web Services. In Proceedings of the 8th IEEE/IFIP International Workshop on Quality of
Service (IWQoS ’00), pages 231–239, Pittsburgh, PA, June 2000.
[87] Terry Keeley. Thin, High Performance Computing over the Internet. In Proceedings
of the 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS ’00), page 407, San Francisco,
CA, 2000.
[88] KeyNote. http://www.KeyNote.com/.
[89] Maria Kihl and Niklas Widell. Admission Control Schemes Guaranteeing Customer
QoS in Commercial Web Sites. In Proceedings of the IFIP/IEEE Conference on
Network Control and Engineering (NETCON ’02), pages 305–316, Paris, France,
October 2002.
[90] Balachander Krishnamurthy and Jia Wang. On Network-Aware Clustering of Web
Clients. In Proceedings of the Annual Conference of the ACM Special Interest
Group on Data Communication (SIGCOMM ’00), pages 97–110, Stockholm, Sweden, August 2000.
[91] Sam C. M. Lee, John C. S. Lui, and David K. Y. Yau. Admission Control and
Dynamic Adaptation for a Proportional-Delay DiffServ-Enabled Web Server. In
Proceedings of the International Conference on Measurement and Modeling of
BIBLIOGRAPHY
224
Computer Systems (SIGMETRICS ’02), pages 172–182, Marina Del Rey, CA, June
2002.
[92] Kelvin Li and Sugih Jamin. A Measurement-Based Admission-Controlled Web
Server. In Proceedings of the IEEE Conference on Computer Communications
(INFOCOMM ’02), pages 651–659, New York, NY, June 2002.
[93] Xue Liu, Lui Sha, Yixin Diao, Steve Froehlich, Joseph L. Hellerstein, and Sujay
Parekh. Online Response Time Optimization of Apache Web Server. In Proceedings of the IEEE/IFIP International Workshop on Quality of Service (IWQoS ’03),
pages 461–478, 2003.
[94] Chenyang Lu, Tarek Abdelzaher, John Stankovic, and Sang H. Son. A Feedback
Control Approach for Guaranteeing Relative Delays in Web Server. In Proceedings of the 7th IEEE Real-Time Technology and Applications Symposium, Taipei,
Taiwan, June 2001.
[95] Samuel Madden, Mehul Shah, Joseph M. Hellerstein, and Vijayshankar Raman.
Continuously Adaptive Continuous Queries over Streams. In Proceedings of the
ACM SIGMOD International Conference on Management of Data, pages 49–60,
Madison, Wisconsin, 2002.
[96] Francisco Matias, Cuenca-Acuna, and Thu D. Nguyen. Self-Managing Federated
Services. In Proceedings of the 23rd IEEE International Symposium on Reliable
Distributed Systems (SRDS ’04), pages 240–250, Florianopolis, Brazil, October
2004.
[97] Steven McCanne and Van Jacobson. The BSD Packet Filter: A New Architecture
for User-level Packet Capture. In Proceedings of the Winter USENIX Technical
Conference, pages 259–270, 1993.
[98] Mercury Interactive. http://www-heva.MercuryInteractive.com/.
[99] Microsoft. http://www.MicroSoft.com/.
[100] Paul Mockapetris. RFC 1034: Domain Names Concepts and Facilities. IETF,
November 1987.
[101] Paul Mockapetris. RFC 1035: Domain Names Implementation and Specification.
IETF, November 1987.
[102] Jeffrey Mogul and K. K. Ramakrishnan. Eliminating Receive Livelock in an
Interrupt-Driven Kernel. ACM Transactions on Computer Systems (TOCS),
15(3):217–252, 1997.
BIBLIOGRAPHY
225
[103] Jeffrey Mogul, Richard Rashid, and Michael Accetta. The Packet Filter: An Efficient Mechanism for User-Level Network Code. In Proceedings of the 11th Symposium on Operating System Principles (SOSP ’87), pages 39–51, Austin, TX,
November 1987.
[104] Multi-Threaded Routing Toolkit. http://www.mrtd.net/.
[105] MySQL. http://www.MySQL.com.
[106] Erich Nahum, Tsipora Barzilai, and Dilip Kandlur. Performance Issues in WWW
Servers. ACM SIGMETRICS Performance Evaluation Review, 27(1):216–217, May
1999.
[107] Erich Nahum, Marcel Rosu, Srinivasan Seshan, and Jussara Almeida. The Effects
of Wide-Area Conditions on WWW Server Performance. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS ’01), pages 257–267, Cambridge, MA, June 2001.
[108] NetBSD. http://www.NetBSD.org/.
[109] NetQoS. http://www.NetQoS.com/.
[110] Netscape. http://www.Netscape.com/.
[111] Trio Networks. Hypertrak. http://www.TrioNetworks.com/.
[112] Henrik Frystyk Nielsen, James Gettys, Anselm Baird-Smith, Eric Prud’hommeaux,
Hkon Wium Lie, and Chris Lilley. Network Performance Effects of HTTP/1.1,
CSS1, and PNG. ACM SIGCOMM Computer Communication Review, 27(4):155–
166, October 1997.
[113] Jakob Nielsen. The Need for Speed. http://www.useit.com/alertbox/9703a.html,
March 1997.
[114] Liadan O’Callaghan, Nina Mishra, Adam Meyerson, Sudipto Guha, and Rajeev
Motwani. Streaming-Data Algorithms For High-Quality Clustering. In Proceedings
of the 18th IEEE International Conference on Data Engineering (ICDE ’02), page
685, San Jose, CA, March 2000.
[115] David Olshefski and Jason Nieh. Understanding the Management of Client Perceived Response Time. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS ’06), pages 240–251,
Saint-Malo, France, June 2006.
BIBLIOGRAPHY
226
[116] David Olshefski, Jason Nieh, and Dakshi Agrawal. Inferring Client Response Time
at the Web Server. In Proceedings of the International Conference on Measurement
and Modeling of Computer Systems (SIGMETRICS ’02), pages 160–171, Marina
Del Rey, CA, June 2002.
[117] David Olshefski, Jason Nieh, and Dakshi Agrawal. Using Certes to Infer Client Response Time at the Web Server. ACM Transactions on Computer Systems (TOCS),
22(1):49–93, February 2004.
[118] David Olshefski, Jason Nieh, and Erich Nahum. ksniffer: Determining the Remote Client Perceived Response Time from Live Packet Streams. In Proceedings of
the 6th Symposium on Operating Systems Design and Implementation (OSDI ’04),
pages 333–346, San Francisco, CA, December 2004.
[119] OneStat. Microsoft’s Windows OS global market share is more than 97% according
to OneStat.com. OneStat Press Release, September 2002.
[120] Packeteer. http://www.Packeteer.com/.
[121] Jitendra Padhye, Victor Firoiu, Don Towsley, and Jim Kurose. Modeling TCP
Throughput: A Simple Model and its Empirical Validation. ACM SIGCOMM Computer Communication Review, 28(4):303–314, October 1998.
[122] Jitendra Pahdye and Sally Floyd. On Inferring TCP Behavior. In Proceedings of
the Annual Conference of the ACM Special Interest Group on Data Communication
(SIGCOMM ’01), pages 287–298, San Diego, CA, August 2001.
[123] Raju Pandey, J. Fritz Barnes, and Ronald Olsson. Supporting Quality of Service in
HTTP Servers. In Proceedings of the 17th Annual ACM Symposium on Principles
of Distributed Computing, pages 247–256, Puerto Vallarta, Mexico, June 1998.
[124] Athanasios Papoulis and S. Unnikrishna Pillai. Probability, Random Variables, and
Stochastic Processes. McGraw-Hill Series in Electrical Engineering, 2001.
[125] Sujay Parekh, Neha Gandhi, Joe Hellerstein, Dawn Tilbury, T. S. Jayram, and Joe
Bigus. Using Control Theory to Achieve Service Level Objectives In Performance
Management. In Proceedings of the IFIP/IEEE International Symposium on Integrated Network Management Conference Proceedings, pages 841–854, Seattle,
WA, May 2001.
[126] Vern Paxson and Mark Allman. RFC 2988: Computing TCP’s Retransmission
Timer. IETF, November 2000.
[127] John Postel. RFC 793: Transmission Control Protocol. IETF, September 1981.
BIBLIOGRAPHY
227
[128] Prashant Pradhan, Tzi cker Chiueh, and Anindya Neogi. Aggregate TCP Congestion Control Using Multiple Network Probing. In Proceedings of the 20th IEEE International Conference on Distributed Computing Systems (ICDCS ’00)), page 30,
Taipei, China, April 2000.
[129] Prashant Pradhan, Renu Tewari, Sambit Sahu, Abhishek Chandra, and Prashant
Shenoy. An Observation-based Approach Towards Self-managing Web Servers. In
Proceedings of the 10th IEEE/IFIP International Workshop on Quality of Service
(IWQoS ’02), pages 13–22, Miami, FL, May 2002.
[130] William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery.
Numerical Recipes in C: The Art of Scientific Computing, 2nd Edition. Cambridge
University Press, United Kingdom, 1992.
[131] Ramakrishnan Rajamony and Mootaz Elnozahy. Measuring Client-Perceived Response Times on the WWW. In Proceedings of the 3rd USENIX Symposium on
Internet Technologies and Systems (USITS ’01), San Francisco, CA, March 2001.
[132] Red Hat Inc. The Tux WWW server. http://people.redhat.com/∼mingo/TUXpatches/.
[133] RedHat. http://www.RedHat.com/.
[134] IBM T.J. Watson Research. The Oceano Project.
http://www.research.ibm.com/oceanoproject/.
[135] RIPE Network Coordination Centre. http://data.ris.ripe.net/.
[136] Luigi Rizzo. Dummynet: a Simple Approach to the Evaluation of Network Protocols. ACM SIGCOMM Computer Communication Review, 27(1):31–41, January
1997.
[137] Alex Rousskov. RFC 4037: Open Pluggable Edge Services (OPES) Callout Protocol (OCP) Core. IETF, March 2005.
[138] Alex Rousskov and Martin Stecher. RFC 4236: HTTP Adaptation with Open Pluggable Edge Services (OPES). IETF, November 2005.
[139] Alessandro Rubini. rshaper. http://www.linux.it/∼rubini/software/index.html.
[140] Stephan Savage. Sting: a TCP-based Network Measurement Tool. In Proceedings of the USENIX Symposium on Internet Technologies and Systems (USITS ’99),
pages 71–79, Boulder, CO, October 1999.
[141] Bianca Schroeder and Mor Harchol-Balter. Web Servers under Overload: How
Scheduling can Help. ACM Transactions on Internet Technology, 6(1):20–52, 2006.
BIBLIOGRAPHY
228
[142] Srinivasan Seshan, Mark Stemm, and Randy H. Katz. SPAND: Shared Passive
Network Performance Discovery. In Proceedings of the USENIX Symposium on
Internet Technologies and Systems (USITS ’97), Monterey, CA, December 1997.
[143] Srinivasan Seshan, Mark Stemm, and Randy H. Katz. Benefits of Transparent Content Negotiation in HTTP. In Proceedings of the IEEE GLobcom 98 Internet MiniConference, Sydney, Australia, November 1998.
[144] Peter Sevcik. Customers Need Performance-Based Network ROI Analysis. Business Communications Review, pages 12–14, July 1999.
[145] Biplab Sikdar, Shivkumar Kalyanaraman, and Kenneth Vastola. Analytic Models and Comparative Study of the Latency and Steady-State Throughput of TCP
Tahoe, Reno and SACK. In IEEE GLOBECOMM, pages 100–110, San Antonio,
TX, November 2001.
[146] F. Donelson Smith, Felix Hernandez Campos, Kevin Jeffay, and David Ott. What
TCP/IP Protocol Headers can tell us about the Web. ACM SIGMETRICS Performance Evaluation Review, 29(1):245–256, June 2001.
[147] Freshwater Software. Web Server Monitor.
http://www.freshtech.com/white paper/bookchapter/chapter.html.
[148] W. Richard Stevens. TCP/IP Illustrated, Volume 1 The Protocols. Addison-Wesley,
Massachusetts, 1994.
[149] StreamCheck. http://www.StreamCheck.com/.
[150] Nitin Thaper, Sudipto Guha, Piotr Indyk, and Nick Koudas. Dynamic Multidimensional Histograms. In Proceedings of the ACM SIGMOD International Conference
on Management of Data, pages 428–439, 2002.
[151] The libpcap project. http://sourceforge.net/projects/libpcap/.
[152] The Transaction Processing Council (TPC). http://www.TPC.org/tpcw.
[153] Tomcat 5.5. http://www.jakarta.apache.org/tomcat.
[154] TPC-W Java Implementation. http://mitglied.lycos.de/jankiefer/tpcw.
[155] University of Oregon Route Views Archive Project. http://archive.routeviews.org/.
[156] Bhuvan Urgaonkar and Prashant Shenoy. Cataclysm: Policing Extreme Overloads
in Internet Applications. In Proceedings of the 14th International World Wide Web
Conference (WWW2005), pages 740–749, Chiba, Japan, May 2005.
BIBLIOGRAPHY
229
[157] Dinesh Verma. Policy-Based Networking: Architecture and Algorithms. New Riders, ISBN 1-57870-226-7, 2000.
[158] VMWare. http://www.VMWare.com/.
[159] Thiemo Voigt, Renu Tewari, Ashish Mehra, and Douglas Freimuth. Kernel Mechanisms for Service Differentiation in Overloaded Web Servers. In Proceedings of
the USENIX Annual Conference, pages 189–202, Boston, MA, June 2001.
[160] WebStone. http://www.mindcraft.com/.
[161] Jianbin Wei and Cheng-Zhong Xu. eQoS: Provisioning of Client-Perceived End-toEnd QoS Guarantees in Web Servers. In Proceedings of the International Workshop
on Quality of Services (IQWoS 2005), Passau, Germany, June 2005.
[162] Matt Welsh and David Culler. Adaptive Overload Control for Busy Internet
Servers. In Proceedings of USENIX Symposium on Internet Technologies and Systems (USITS ’03), Seattle, Washington, 2003.
[163] Carey Williamson and Qian Wu. A Case for Context-Aware TCP/IP. ACM SIGMETRICS Performance Evaluation Review, 29(4):11–23, 2002.
[164] Maya Yajnik, Sue B. Moon, James F. Kurose, and Donald F. Towsley. Measurement
and Modeling of the Temporal Dependence in Packet Loss. In Proceedings of the
IEEE Conference on Computer Communications (INFOCOMM ’99), pages 345–
352, Orlando, FLA, March 1999.
[165] Masanobu Yuhara, Brian N. Bershad, Chris Maeda, and J. Eliot B. Moss. Efficient
Packet Demultiplexing for Multiple Endpoints and Large Messages. In Proceedings
of the Winter USENIX Technical Conference, pages 153–165, 1994.
[166] Tian Zhang, Raghu Ramakrishnan, and Miron Livny. BIRCH: A Efficient Data
Clustering Method for Very Large Databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 103–114, Montreal,
Canada, 1997.
[167] Yin Zhang, Nick Duffield, Vern Paxson, and Scott Shenker. On the Constancy of
Internet Path Properties. In Proceedings of ACM SIGCOMM Internet Measurement
Workshop (IMW ’01), San Francisco, CA, November 2001.
[168] Yin Zhang, Vern Paxson, and Scott Shenker. The Stationarity of Internet Path Properties: Routing, Loss and Throughput. In ACIRI Technical Report, May 2000.

Measuring and Managing the Remote Client Perceived Response

Transcription

Similar documents

Summer Work for Students Entering AP Calculus AB

sweet sweet - Slimming World

COVER SHEET FOR THE CEMALA FOUNDATION GRANT PACKET

DAY OF SYN - Romney Marsh

Client-server Systems

O - StudyQuran

Universal DDoS Mitigation Bypass

2012 Moorundie Park Poll Merino Stud Newsletter