pdf version - Investigación

Transcription

pdf version - Investigación
Thesis
Modeling TCP/IP Software Implementation Performance And Its
Application For Software Routers
By
Oscar Iván Lepe Aldama
M.Sc. Electronics and Telecommunications
CICESE Research Center, México, 1995
B.S. Computer Engineering
Universidad Nacional Autónoma de México, 1992
Submitted to the Department of Computer Architecture
in partial fulfillment of the requirements for the degree of
DOCTOR OF PHILOSOPHY IN COMPUTER ENGINEERING
At the
UNIVERSITAT POLITECNICA DE CATALUNYA
Presented and defended on December 3, 2002
Advisor:
Prof. Dr. Jorge García-Vidal
Jury:
Prof. Dr. Ramón Puigjaner-Trepat
Profa. Dra. Olga Casals-Torres
Presidente
Examinador
Prof. Dr. José Duato-Marín
Examinador
Prof. Dr. Joan García-Haro
Prof. Dr. Sebastian Sallent-Ribes
Examinador
Examinador
 2002 Oscar I. Lepe A.
All rights reserved
Dr.
President
Dr.
Secretari
Dr.
Vocal
Dr.
Vocal
Dr.
Vocal
Data de la defensa pública
Qualificació
MODELING TCP/IP SOFTWARE IMPLEMENTATION PERFORMANCE AND ITS APPLICATION FOR
SOFTWARE ROUTERS
5
Contents
List of figures
List of tables
Preface
Chapter 1 Introduction............................................................................. 17
1.1
Motivation......................................................................................................... 17
1.2
Scope................................................................................................................. 18
1.3
Dissertation’s thesis .......................................................................................... 19
1.4
Synopsis ............................................................................................................ 19
1.5
Outline............................................................................................................... 20
1.6
Related work ..................................................................................................... 20
Chapter 2 Internet protocols’ BSD software implementation.............. 23
2.1
Introduction....................................................................................................... 23
2.2
Interprocess communication in the BSD operating system .............................. 24
2.2.1 BSD’s interprocess communication model.............................................. 24
2.2.2 Typical use of sockets .............................................................................. 25
2.3
BSD’s networking architecture......................................................................... 26
2.3.1 Memory management plane..................................................................... 26
2.3.2 User plane ................................................................................................ 27
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
6
CONTENTS
2.4
The software interrupt mechanism and networking processing ....................... 28
2.4.1 Message reception.................................................................................... 28
2.4.2 Message transmission .............................................................................. 29
2.4.3 Interrupt priority levels ............................................................................ 30
2.5
BSD implementation of the Internet protocols suite ........................................ 33
2.6
Run-time environment: the host’s hardware..................................................... 34
2.6.1 The central processing unit and the memory hierarchy ........................... 34
2.6.2 The busses organization........................................................................... 35
2.6.3 The input/output bus’ arbitration scheme ................................................ 36
2.6.4 PCI hidden bus arbitration’s influence on latency................................... 37
2.6.5 Network interface card’s system interface............................................... 37
2.6.6 Main memory allocation for direct memory access network interface
cards......................................................................................................... 38
2.7
Other system’s networking architectures.......................................................... 40
2.7.1 Active Messages [Eicken et al. 1992]...................................................... 40
2.7.2 Integrated Layer Processing [Abbot and Peterson 1993] ........................ 40
2.7.3 Application Device Channels [Druschel 1996] ....................................... 42
2.7.4 Microkernel operating systems’ extensions for improved networking
[Coulson et al. 1994; Coulson and Blair 1995] ....................................... 43
2.7.5 Communications oriented operating systems [Mosberger and Peterson
1996]........................................................................................................ 44
2.7.6 Network processors.................................................................................. 45
2.8
Summary........................................................................................................... 46
Chapter 3 Characterizing and modeling a personal computerbased software router ...................................................................47
3.1
Introduction....................................................................................................... 47
3.2
System modeling............................................................................................... 48
3.3
Personal computer-based software routers ....................................................... 49
3.3.1 Routers’ rudiments................................................................................... 49
3.3.2 The case for software routers................................................................... 49
3.3.3 Personal computer-based software routers .............................................. 50
3.3.4 Personal computer based-IPsec security gateways .................................. 51
3.4
A queuing network model of a personal computer-based software IP router .. 51
3.4.1 The forwarding engine, the network interface cards and the packet
flows ........................................................................................................ 53
3.4.2 The service stations’ scheduling politics and the mapping between
networking stages and model elements ................................................... 53
3.4.3 Modeling a security gateway ................................................................... 55
3.5
System characterization .................................................................................... 56
3.5.1 Tools and techniques for profiling in-kernel software............................. 56
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
CONTENTS
7
3.5.2 Software profiling .................................................................................... 56
3.5.3 Probe implementation .............................................................................. 57
3.5.4 Extracting information from the kernel ................................................... 60
3.5.5 Experimental setup................................................................................... 60
3.5.6 Traffic patterns......................................................................................... 61
3.5.7 Experimental design................................................................................. 62
3.5.8 Data presentation ..................................................................................... 63
3.5.9 Data analysis ............................................................................................ 63
3.5.10 “Noise” process characterization ........................................................... 65
3.6
Model validation ............................................................................................... 66
3.6.1 Service time correlations.......................................................................... 67
3.6.2 Qualitative validation............................................................................... 68
3.6.3 Quantitative validation............................................................................. 69
3.7
Model parameterization .................................................................................... 72
3.7.1 Central processing unit speed .................................................................. 72
3.7.2 Memory technology ................................................................................. 74
3.7.3 Packet’s size............................................................................................. 75
3.7.4 Routing table’s size.................................................................................. 76
3.7.5 Input/output bus’s speed .......................................................................... 76
3.8
Model’s applications......................................................................................... 77
3.8.1 Capacity planning .................................................................................... 77
3.8.2 Uniform experimental test-bed ................................................................ 79
3.9
Summary ........................................................................................................... 89
Chapter 4 Input/output bus usage control in personal computerbased software routers.................................................................. 90
4.1
Introduction....................................................................................................... 90
4.2
The problem ...................................................................................................... 91
4.3
Our solution ...................................................................................................... 91
4.3.1 BUG’s specifications and network interface card’s requirements........... 93
4.3.2 Low overhead and intrusion..................................................................... 93
4.3.3 Algorithm................................................................................................. 94
4.3.4 Algorithm’s details................................................................................... 95
4.3.5 Algorithm’s a priori estimated costs ........................................................ 99
4.3.6 An example scenario................................................................................ 99
4.4
BUG performance study ................................................................................. 101
4.4.1 Experimental setup................................................................................. 102
4.4.2 Response to unbalanced constant packet rate traffic ............................. 104
4.4.3 Study on the influence of the activation period ..................................... 105
4.4.4 Response to on-off traffic ...................................................................... 106
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
8
CONTENTS
4.4.5 Response to self-similar traffic .............................................................. 107
4.5
A performance study of a software router incorporating the BUG ................ 109
4.5.1 Results.................................................................................................... 109
4.6
An implementation ......................................................................................... 111
4.7
Summary......................................................................................................... 111
Chapter 5 Conclusions and future work ...............................................113
5.1
Conclusions..................................................................................................... 113
5.2
Future work..................................................................................................... 114
Appendix A
Appendix B
Bibliography
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
MODELING TCP/IP SOFTWARE IMPLEMENTATION PERFORMANCE AND ITS APPLICATION FOR
SOFTWARE ROUTERS
9
List of figures
Figure 2.1—OMT object model for BSD IPC ............................................................... 25
Figure 2.2—BSD’s two-plane networking architecture. The user plane is depicted
with its layered structure, which is described in following sections. Bold
circles in the figure represent defined interfaces between planes and layers:
A) Socket-to-Protocol, B) Protocol-to-Protocol, C) Protocol-to-Network
Interface, and D) User Layer-to-Memory Management. Observe that this
architecture implies that layers share the responsibility of taking care of the
storage associated with transmitted data............................................................... 26
Figure 2.3—BSD networking user plane’s software organization................................. 27
Figure 2.4—Example of priority levels and kernel processing...................................... 31
Figure 2.5—BSD implementation of the Internet protocol suite. Only chief tasks,
message queues and associations are shown. Please note that some control
flow arrows are sourced at the bounds of the squares delimiting the
implementation layers. This is for denoting that a task is executed after an
external event, such as an interrupt or a CPU scheduler event ............................. 33
Figure 2.6—Chief components in a general purpose computing hardware. .................. 34
Figure 2.7—Example PCI arbitration algorithm............................................................ 36
Figure 2.8—Main memory allocation for device drivers of network interface cards
supporting direct memory access.......................................................................... 39
Figure 2.9—Integrated layering processing [Abbot and Peterson 1993]....................... 41
Figure 2.10—Application Device Channels [Druschel 1996] ....................................... 42
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
10
LIST OF FIGURES
Figure 2.11—SUMO extensions to a microkernel operating system [Coulson et al.
1994; Coulson and Blair 1995]............................................................................. 43
Figure 2.12—Making paths explicit in the Scout operating system [Mosberger and
Peterson 1996] ...................................................................................................... 44
Figure 3.1—A spectrum of performance modeling techniques ..................................... 48
Figure 3.2—McKeown’s [2001] router architecture...................................................... 49
Figure 3.3—A queuing network model of a personal computer-based software
router that has two network interface cards and that is traversed by a single
packet flow. The number and meaning of the shown queues is a result of the
characterization process presented in the next section ......................................... 52
Figure 3.4—A queuing network model of a software router that shows a one-toone mapping between C language functions implementing the BSD
networking code and the depicted model’s queues. In order to simplify the
figure, this model does not model the software router’s input/output bus nor
the noise process. Moreover, it simplistically models the network interface
cards...................................................................................................................... 54
Figure 3.5—A queuing network model of a software router configured as a
security gateway. The number and meaning of the shown queues is a result
of the characterization process presented in next section ..................................... 55
Figure 3.6—Probe implementation for FreeBSD .......................................................... 59
Figure 3.7—Experimental setup .................................................................................... 60
Figure 3.8—Characterization charts for a security gateway’s protocols layer.............. 64
Figure 3.9—Characterization charts for a software router’s network interfaces
layer ...................................................................................................................... 64
Figure 3.10—Comparison of the CCPFs computed after both, measured data from
the software router under test, and predicted data from a corresponding
queuing network model, which used a one-to-one mapping between C
language networking functions and model’s queue.............................................. 67
Figure 3.11—Example chart from the service time correlation analysis. It shows
the plot of ip_input cycle counts versus ip_output cycle counts. A
correlation is clearly shown .................................................................................. 67
Figure 3.12—Model’s validation charts. The two leftmost columns’ charts depict
per-packet latency traces. The right column’s chart depicts latency traces’
CCPFs................................................................................................................... 70
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
LIST OF FIGURES
11
Figure 3.13—Relationship between measured execution times and central
processing unit operation speed. Observe that some measures have
proportional behavior while others have linear behavior. The main text
explains the reasons to these behaviors and why the circled measures do not
all agree with the regression lines......................................................................... 72
Figure 3.14—Outliers related to the CPU’s instruction cache. The left chart was
drawn after data taken from the ipintrq probe. The right chart
corresponds to the ESP (DES) probe at a security gateway. Referenced
outliers are highlighted ......................................................................................... 74
Figure 3.15—Relationship between measured execution times and message size ........ 75
Figure 3.16—Capacity planning charts.......................................................................... 78
Figure 3.17—Queuing network model for a BSD based software router with two
network interface cards attached to it and three packet flows traversing it .......... 81
Figure 3.18—Basic system’s performance analysis: a) system’s overall
throughput; b) per flow throughput share for system one; c) per flow
throughput share for system two........................................................................... 82
Figure 3.19—Queuing network model for a Mogul and Ramakrishnan [1997]
based software router with two network interface cards attached to it and
three packet flows traversing it............................................................................. 84
Figure 3.20—Mogul and Ramakrishnan [1997] based software router’s
performance analysis: a) system’s overall throughput; b) per flow
throughput share for system one; c) per flow throughput share for system
two ........................................................................................................................ 85
Figure 3.21—Queuing network model for a software router including the receiver
live-lock avoidance mechanism and a QoS aware CPU scheduler, similar to
the one proposed by Qie et al [2001]. The software router has two network
interface cards and three packet flows traverse it ................................................. 87
Figure 3.22—Qie et al [2001] based software router’s performance analysis: a)
per flow throughput share for system one; b) per flow throughput share for
system two ............................................................................................................ 88
Figure 4.1—The BUG’s system architecture. The BUG is a piece of software
embedded in the operating system’s kernel that shares information with the
network interface card’s device drivers and manipulates the vacancy of each
DMA receive channel ........................................................................................... 92
Figure 4.2—The BUG’s periodic and bistate operation for reducing overhead and
intrusion ................................................................................................................ 94
Figure 4.3—The BUG’s packet-by-packet GPS server emulation with batch
arrivals .................................................................................................................. 95
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
12
LIST OF FIGURES
Figure 4.4—The BUG is work conservative.................................................................. 96
Figure 4.5—The BUG’s unfairness counterbalancing mechanism................................ 97
Figure 4.6—The BUG’s bus utilization grant packetization policy. In the
considered scenario, three packets flows with different packet sizes traverse
the router and the BUG has granted each an equal number of bus utilization
bytes. Packet sizes are small, medium and large respectively for the orange,
green and blue packets flows. After packetization, some idle time gets
induced.................................................................................................................. 98
Figure 4.7—An example of the behavior of the BUG mechanism. Vectors A, D,
N, G and g are defined as: A = (A1, A2, A3), etc. It is assumed that the
system serves three packets flows with the same shares and with the same
packet lengths, and that in a period T up to six packets can be transferred
through the bus. The BUG does not consider all variables at all times. At
every activation instant, the variables that the BUG ignores are printed in
gray ....................................................................................................................... 100
Figure 4.8—Queuing network models for: a) PCI bus, b) WFQ bus, and c) BUG
protected PCI bus; all with three network interface cards attached to it and
three packets flows traversing them...................................................................... 103
Figure 4.9—BUG performance study: response comparison to unbalanced
constant packet rate traffic between a WFQ bus, a PCI bus and a BUG
protected PCI bus; first, middle and left columns respectively. At row (a)
flow3 is the misbehaving flow while flow2 and flow1 are for (b) and (c),
respectively ........................................................................................................... 104
Figure 4.10—BUG performance study: on the influence of the activation period ........ 106
Figure 4.11— BUG performance study: response comparison to on-off traffic
between an ideal WFQ bus, a PCI bus, and a BUG protected PCI bus ................ 107
Figure 4.12—QoS aware system’s performance analysis: a) system’s overall
throughput; b) per packets flow throughput share for system one; c) per
packets flow throughput share for system two ..................................................... 110
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
MODELING TCP/IP SOFTWARE IMPLEMENTATION PERFORMANCE AND ITS APPLICATION FOR
SOFTWARE ROUTERS
13
List of tables
Table 3-I ......................................................................................................................... 76
Table 3-II........................................................................................................................ 76
Table 3-III ...................................................................................................................... 79
Table 4-I ......................................................................................................................... 93
Table 4-II........................................................................................................................ 99
Table 4-III ...................................................................................................................... 102
Table 4-IV ...................................................................................................................... 108
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
MODELING TCP/IP SOFTWARE IMPLEMENTATION PERFORMANCE AND ITS APPLICATION FOR
SOFTWARE ROUTERS
15
Preface
Dedico esta obra a mi esposa, Tania Patricia, y a nuestros hijos,
Pedro Darío y Sebastián.
Al mismo tiempo, quiero agradecer a Tania y a Pedro su apoyo y paciencia
durante el tiempo que estuvimos fuera de casa, viviendo no en las mejores condiciones.
Sólo espero que esta precariedad haya sido mitigada por lo enriquecedor de las
experiencias que hemos vivido, que siento nos harán mejores personas al habernos
recalcado la importancia de la unidad familiar, y el valor de nuestras costumbres,
nuestra gente y nuestra tierra.
Agradezco a mi director de tesis, Jorge García-Vidal, por su invaluable tutela y
dedicación. Difícilmente olvidaré las incontables y largas reuniones de discusión que le
dieron vida a esta obra, ni las numerosas noches en vela que pasamos ultimando los
detalles de los artículos que enviamos a revisión. Pero también difícilmente olvidaré las
tardes de café o las tardes en el parque con nuestros hijos, donde hablábamos de todo
menos del trabajo.
Agradezco al pueblo de México, que con su esfuerzo permite la existencia de
programas de becas para que mexicanos hagan estudios de postgrado en el extranjero,
como el programa gestionado por el Consejo Nacional de Ciencia y Tecnología, o el
gestionado por el Centro de Investigación Científica y de Educación Superior de
Ensenada, que me dieron cobijo.
Y en general agradezco a todas las personas que de forma directa o indirecta me
ayudaron a la consecución de esta obra, por ejemplo a los profesores del DAC/UPC y a
los revisores anónimos en los congresos. En particular quiero mencionar, en orden alfabético, a Alejandro Limón Padilla, David Hilario Covarrubias Rosales, Delia White,
Eva Angelina Robles Sánchez, Francisco Javier Mendieta Jiménez, Jorge Enrique Preciado Velasco, José María Barceló Ordinas, Llorenç Cerdà Alabern, Olga María
Casals Torres, Oscar Alberto Lepe Peralta, Victoriano Pagoaga. A todos ellos muchas
gracias.
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
MODELING TCP/IP SOFTWARE IMPLEMENTATION PERFORMANCE AND ITS APPLICATION FOR
SOFTWARE ROUTERS
17
Chapter 1
Introduction
1.1 Motivation
Evidently, computer networks and the computer applications than run over them
are fundamental to today’s human society. Indeed, some kind of these telematic systems
is fundamental for the proper operation of the New York Stock Exchange, for instance,
but also some other kind is fundamental for providing telephony service to little villages
in hard-to-reach places, like the numerous villages at the mountains of Chiapas, México.
As it is well known, telematic technology’s application for better human society is only
limited by humans’ imagination.
Today’s human’s necessities for information processing and transportation present
complex problems. For solving these problems, scientists and engineers are required to
produced ideas and products with an ever-increasing number of components and intercomponents relationships. In order to cope with this complexity, developers invented
the concept of layered architectures, which allow a structured “divide to conquer”
approach for designing complex systems by providing a step-by-step enhancement of
system services.
Unfortunately, layered structures can result in relatively low performance
telematic systems—and often they are—if implemented carelessly [Clark 1982;
Tennenhouse 1989; Abbott and Peterson 1993; Mogul and Ramakrishnan 1999]. In
order to understand this, let us consider the following:
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
18
SCOPE—1.2
•
•
•
•
Telematic systems are mostly implemented with software.
Each software layer is designed as an independent entity that concurrently and asynchronously communicates with its neighbors through a
message-passing interface. This allows for better interoperability, manageability and extensibility.
In order to allow software layers to work concurrently and
asynchronously, the host computer system has to provide a secure and
reliable multiprogramming environment through an operating system.
Ideally, the operating system should perform its role without consuming
a significant share of processing resources. Unfortunately, as reported
elsewhere [Druschel 1996], current operating systems are threatening to
become bottlenecks when processing input/output data streams.
Moreover, they are the source of statistical delays—incurred as each
data unit is marshaled through the layered software—that hamper the
correct deployment of important services.
Others have recognized this problem and have conducted studies analyzing some
aspects of some operating systems’ computer networks software, networking software
for short. For us it is striking that although these studies are numerous, (a search through
the ACM’s Digital Library on the term “(c.2 and d.4 and c.5)<IN>ccs” returns 54
references, and a search through IEEE’s Xplore on the term “‘protocol
processing’<IN>de” returns 81 references) only one of them pursued to build a general
model of the networking software—see this chapter’s section on “related work”. Indeed,
although different systems’ networking software has more similarities than differences,
as we will later discuss, most of these studies have only focused in identifying and
solving particular problems of particular systems. When saying this we do not deny the
importance of this work, however, we believe that modeling is an important part of
doing research.
The Internet protocol suite (TCP/IP) [Stevens 1994] is, nowadays, the preferred
technology for networking. Of all possible implementations, the one done at the
University of California at Berkeley for the Berkeley Software Distribution operating
system, or BSD, has been used as the starting point for most available systems. The
BSD operating system [McKusick et al. 1996], which is a flavor of the UNIX system
[Ritchie and Thompson 1978], was selected as the foundation for implementing the first
TCP/IP suite back in the 80's.
1.2 Scope
Naturally, for modeling a system, high degrees of observability and controllability
are required. For us this means free access to both networking software’s source code
and host computer’s technical specifications. (When we speak of free, we are referring
to freedom, not price.) Today’s personal computer (PC) technology provides this.
Indeed, not only there is tons of freely available detailed technical documentation on
Intel’s IA32 PC hardware but also there are several PC operating systems with open
source policy. Of those we decide to work with FreeBSD, a 4.4BSD-Lite derived
operating system optimized to be run on Intel’s IA32 PCs.
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
1.3—DISSERTATION’S THESIS
19
When searching for a networking application for which the appliance of a performance model could be of importance we found software routers. A software router
can be defined as a general-purpose computer that executes a computer program capable
of forwarding IP datagrams among network interface cards attached to its input/output
bus (I/O bus). Evidently, software routers have performance limitations because they
use a single central processing unit (CPU) and a single shared bus to process all packets.
However, due to the ease with which they can be programmed for supporting new
functionality—securing communications, shaping traffic, supporting mobility,
translating network addresses, supporting applications’ proxies, and performing n-level
routing—software routers are important at the edge of the Internet.
1.3 Dissertation’s thesis
From all the above we came up with the following dissertation thesis:
Is it possible to build a queuing network model of the Internet
protocols’ BSD implementation that can be used for predicting
with reasonable accuracy not only the mean values of the
operational parameters studied but also their cumulative
probability function? And, can this model be applied for
studying the performance of PC-based software routers
supporting communication quality assurance mechanisms, or
Quality-of-Service (QoS) mechanisms?
1.4 Synopsis
Three are the main contributions of this work. In no particular order:
•
•
•
A detailed performance study of the software implementation of the
TCP/IP protocols suite, when executed as part of the kernel of a BSD
operating system over generic PC hardware
A validated queuing network model of the studied system, solved by
computer simulation
An I/O bus utilization guard mechanism for improving the performance
of software routers supporting QoS mechanisms and built upon PC
hardware and software
This document presents our experiences building a performance model of a PCbased software router. The resulting model is an open multiclass priority network of
queues that we solved by simulation. While the model is not particularly novel from the
system modeling point of view, in our opinion, it is an interesting result to show that
such a model can estimate, with high accuracy, not just average performance-numbers
but the complete probability distribution function of packet latency, allowing performance analysis at several levels of detail. The validity and accuracy of the multiclass
model has been established by contrasting its packet latency predictions in both, time
and probability spaces. Moreover, we introduced into the validation analysis the predictions of a router’s single queue model. We did this for quantitatively assessing the ad-
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
20
OUTLINE—1.5
vantages of the more complex multiclass model with respect to the simpler and widely
used but not so accurate, as here shown, single queue model, under the considered scenario that the router’s CPU is the system bottleneck and not the communications links.
The single queue model was also solved by simulation.
Besides, this document addresses the problem of resource sharing in PC-based
software routers supporting QoS mechanisms. Others have put forward solutions that
are focused on suitably distributing the workload of a software router's CPU—see this
chapter’s section on “related work”. However, the evident increasing gap in operation
speed between a PC-based software router's CPU and I/O bus means to us that attention
must be paid to the effect the limitations imposed by this bus has on system’s overall
performance. Consequently, we propose a mechanism that jointly controls both I/O bus
and CPU operation for improved PC-based software router performance. This
mechanism involves changes to the operating system kernel code and assumes the
existence of certain network interface card’s functions, although it does not require
changes to the PC hardware. A performance study is shown that provides insight into
the problem and helps to evaluate both the effectiveness of our approach, and several
software router design trade-offs.
1.5 Outline
The rest of this chapter is devoted to discuss about related work. Chapter 2’s
objective is to understand the influence that operating system design and
implementation technique have over the performance of the Internet protocol’s BSD
software implementation. Chapter 3 presents our experiences building, validating and
conducting the parameterization of a performance model of a PC-based software router.
Moreover, it presents some results from applying the model for capacity planning.
Chapter 4 addresses the problem of resource sharing in PC-based software routers
supporting communication quality assurance mechanisms. Furthermore, it presents our
mechanism for jointly controlling the router’s CPU and I/O bus, indispensable for a
software router to support communication quality assurance mechanisms.
1.6 Related work
Cabrera et al. [1988] (in reality, they presented their study’s results on July, 1985)
is the earliest work we found across the publicly accessible literature on TCP/IP’s implementation experimental evaluation. Their study was an ambitious one, which objective was to assess the impact that different processors, network hardware interfaces, and
Ethernets have on the communication across machines, under various hosts’ and communication medias’ load conditions. Their measurements highlighted the ultimate
bounds on communication performance perceived by application programs. Moreover,
they presented a detailed timing analysis of the dynamic behavior of the networking
software. They studied TCP/IP’s implementation within 4.2BSD when ran by then state
of the art minicomputers, attached to legacy Ethernets. Consequently, their study’s results are no longer valid. Worst yet, they used the gprof(1) and kgmon(8)tools for profiling. These tools are only supported by software and, consequently, produce results
that have limited accuracy when compare to results produce by performance monitoring
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
1.6—RELATED WORK
21
hardware counters, as we do. However complete their study was, they did not pursue to
build a system model, like we do.
Sanghi et al. [1990] is the earliest work we found across the publicly accessible
literature on TCP/IP’s implementation experimental evaluation that uses software profiling. Their study is very narrow when compare to ours in the sense that their study’s
objective was only to determine the suitability of roundtrip time estimators for TCP implementations.
Papadopoulos and Gurudatta [1993] presented results of a more general study on
TCP/IP’s implementation performance, obtained using software profiling. They studied
TCP/IP’s implementation within a 4.3BSD-derived operating system—SUN OS. Like
our work, their study’s objective was to characterize the performance of the networking
software. Conversely, their study was not aimed to produce a system model.
Kay and Pasquale [1996] (in reality, they presented their study’s results on September, 1994) conducted another TCP/IP’s implementation performance study. Differently from previous work, their study was carried out at a different granularity level;
that is, they went inside the networking functions and analyzed how these functions
used the source code—touching data, protocol-specific processing, memory buffers manipulation, error checking, data structure manipulation, and operating system overhead.
Moreover, their study’s objective was to guide a search for bottlenecks in archiving high
throughput and low latency. Once again, they did not pursue building a system model.
Ramamurthy [1988] modeled the UNIX system using a queuing network model.
However, his model characterizes the system’s multitasking properties and therefore
cannot be applied to study the networking software, which is governed by the software
interrupt mechanism. Moreover, Ramamurthy’s model was only solved for predicting
mean values, something that is not enough when conducting performance analyses of
today’s networking software. What is required is a model capable of producing the
complete probability function of operational parameters so analysis at several levels of
detail may be performed.
Björkman and Gunningberg [1998] modeled an Internet protocols’ implementation using a queuing network model, however, their model characterizes a user-space
implementation (base on a parallelized version of University of Arizona’s x-kernel
[Hutchinson and Peterson 1991]) and therefore disregards the software interrupt mechanism. Moreover, their model was aimed to predict only system throughput (measured in
packets per second) when the protocols are hosted by shared-memory multiprocessor
systems. Besides, their study was aimed only for high-performance distributed computing, where it is considered that connections are always open with a steady stream of
messages—that is, no retransmissions or other unusual events occur. All this impedes
the utilization of their model for our intended applications.
Packet service disciplines and their associated performance issues have been
widely studied in the context of bandwidth scheduling in packet-switched networks
[Zhang 1995]. Arguably, several such disciplines now exist that are both fair—in terms
of assuring access to link bandwidth in the presence of competing packets flows—and
easy to implement efficiently—with both hardware and software. Recently, an interest
has arisen to map these packet service disciplines in the context of CPU scheduling for
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
22
RELATED WORK—1.6
programmable and software routers. Qie et al. [2001], and Chen and Morris [2001], for
a software router, and Pappu and Wolf [2001], for a programmable router, have put
forward solutions that are focused on suitably distributing the workload of a router’s
processor. However, neither the formers nor the latters consider the problem of I/O bus
bandwidth scheduling. And this problem is important in the context of software routers,
as we demonstrate.
Chiueh and Pradhan [2000] recognized the suitability and inherent limitations of
using PC technology for implementing software routers. In order to overcome the performance limitations of PCs’ I/O buses, and to construct high-speed routers, they have
proposed using clusters of PCs. Upon this architecture, several PCs having at most one
network interface card each are tightly couple by means of a Myrinet system area network to conform a software router with as many subnetwork attachments as computing
nodes. After solving all internodes communication, memory coherency, and routing table distribution problems—arguably not an easy task—a “clustered router” may be able
to overcome the limitations of current PCs’ I/O buses and not only provide high performance (in term of packets per second) but also support QoS mechanisms. Our work,
however, is aimed at supporting QoS mechanisms in clearly simpler and less expensive
PC-based software routers.
Scottis, Krunz and Liu [1999] recognized the performance limitations of the omnipresent Peripheral Component Interconnect (PCI) PC I/O bus to support QoS mechanisms, and proposed an enhancement. Conversely to our software enhancement, theirs
introduced a new bus arbitration protocol that has to be implemented with hardware.
Moreover, their bus arbitration protocol is aimed to improve bus support for periodic
and predictable real-time streams only, and clearly this is not suitable for Internet
routers.
There is general agreement in the PC industry that the demands of current user
applications are quickly overwhelming the shared parallel architecture and limited
bandwidth of the various types of PCI buses. (Erlanger, L. Editor “High-performance
busses and interconnects,” http://www.pcmag.com/article2/0,4149,293663,00.asp current as 8 July 2002.) With this increasing demand and its lack of QoS, PCI has been due
for a replacement in different system and application scenarios for more than a few
years now. 3GIO, InfiniBand, RapidIO and HyperTransport are new technologies designed to improve I/O and device-to-device performance in a variety of system and application categories, however, not all are direct replacements for PCI. In this sense,
3GIO and InfiniBand may be considered as direct candidates to succeed PCI. All these
technologies have in common that they define packet-based, point-to-point serial connections with layered communications, which readily support QoS mechanisms. However, due to the large installation base of PCI equipment, it is expected that the PCI bus
will be used for still some years. Consequently, we think our work is important.
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
MODELING TCP/IP SOFTWARE IMPLEMENTATION PERFORMANCE AND ITS APPLICATION FOR
SOFTWARE ROUTERS
23
Chapter 2
Internet protocols’ BSD software
implementation
2.1 Introduction
This chapter’s objective is to understand the influence that operating system design and implementation techniques have over the performance of the Internet protocols’ BSD software implementation. Later chapters discuss how to apply this knowledge for building a performance model of a personal computer-based software router.
The chapter is organized as follows. The first three sections set the conceptual
framework for the document but may be skipped by a reader familiar with BSD’s networking subsystem. Section 2.2 presents a brief overview on BSD’s interprocess communication facility. Section 2.3 presents BSD’s networking architecture. And BSD’s
software interrupt mechanism is presented in section 2.4. Following sections present the
chief features, components and structures of the Internet protocol’s BSD software implementation. These sections present several ideas and diagrams that are referenced in
this document’s later sections and should not be skipped. Section 0 presents the software implementation while section 2.6 the run-time environment. Finally, section 2.7
presents brief descriptions of other system’s networking architectures and section 2.8
summarizes.
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
24
INTERPROCESS COMMUNICATION IN THE BSD OPERATING SYSTEM—2.2
2.2 Interprocess communication in the BSD
operating system
The BSD operating system [McKusick et al. 1996] is a flavor of UNIX [Ritchie
and Thompson 1978] and historically, UNIX systems were weak in the area of interprocess communication [Wright and Stevens 1995]. Before the 4.2 release of the BSD
operating system, the only standard interprocess communication facility was the pipe.
The requirements of the Internet [Stevens 1994] research community resulted in a significant effort to address the lack of a comprehensive set of interprocess communication
facilities in UNIX. (At the time 4.2BSD was being designed there was no global Internet but an experimental computer network sponsored by the United States of America’s
Defense Advanced Research Projects Agency.)
2.2.1
BSD’s interprocess communication model
4.2BSD’s interprocess communication facility was designed to provide a sufficiently general interface upon which distributed-computing applications—sharing of
physically distributed computing resources, distributed parallel computing, computer
supported telecommunications—could be constructed independently of the underlying
communication protocols. This facility has outlasted and is present in the current 4.4 release. For now on, when referring to the BSD operating system, we mean the 4.2 release
or any follow-on release like the current 4.4. While designing the interprocesscommunication facilities that would support these goals, the developers identified the
following requirements and developed unifying concepts for each:
•
•
•
•
The system must support communication networks that use different
sets of protocols, different naming conventions, different hardware, and
so on. The notion of a communication domain was defined for these
reasons.
A unified abstraction for an endpoint of communication is needed that
can be manipulated with a file descriptor. The socket is the abstract object from which messages are sent and received.
The semantic aspects of communication must be made available to applications in a controlled and uniform way. So, all sockets are typed according to their communication semantics.
Processes must be able to locate endpoints of communication so that
they can rendezvous without being related. Hence, sockets can be
named.
Figure 2.1 depicts the OMT Object Model [Rumbaugh et al. 1991] for these requirements.
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
2.2—INTERPROCESS COMMUNICATION IN THE BSD OPERATING SYSTEM
2.2.2
25
Typical use of sockets
First, a socket must be created with the socket system call, which returns a file
descriptor that is then used in later socket operations:
s = socket(domain, type, protocol);
int s, domain, type, protocol;
After a socket has been created, the next step depends on the type of socket being
used. The most commonly used type of socket requires a connection before it can be
used. The creation of a connection between two sockets usually requires that each socket
have an address (name) bound to it. The address to be bound to a socket must be formulated in a socket address structure.
error = bind(s, addr, addrlen);
int error, s, addrlen;
struct sockaddr *addr;
A connection is initiated with a connect system call:
error = connect(s, peeraddr, peeraddrlen);
int error, s, peeraddrlen;
struct sokaddr *peeraddr;
When a socket is to be used to wait for connection-requests to arrive, the system
call pair listen/accept is used instead:
error = listen(s, backlog);
int error, s, backlog;
Connections are then received, one at a time, with:
snew = accept(s, peeraddr, peeraddrlen);
int snew, s, peeraddrlen;
struct sockaddr *peeraddr;
Figure 2.1—OMT object model for BSD IPC
*
2
1..*
Protocol
named
protocols
1..*
1..*
PH.D. DISSERTATION
Datagram Socket
Name Scheme
1..*
1..*
1..*
Network Name
1..*
Host Name
1..*
is bound to
1..*
Stream Socket
1..*
Communication
Domain
*
obeys a
Socket
Network
1..*
Service
1..*
belongs
*
*
named networks
Communication
Channel
Host
named hosts
communicates with
runs in
named services
Client
*
links
Server
Process
Raw Socket
Socket Name
1..*
Protocol Name
1..*
Service Name
OSCAR IVÁN LEPE ALDAMA
26
BSD’S NETWORKING ARCHITECTURE—2.3
A variety of calls are available for transmitting and receiving data. The usual read
and write system calls, as well as the newer send and recv system calls can be used
with sockets that are in a connected state. The sendto and recvfrom system calls are
most useful for connectionless sockets, where the peer’s address is specified with each
transmitted message. Finally, the sendmsg and recvmsg system calls support the full interface to the interprocess-communication facilities.
The shutdown system call terminates data transmission or reception at a socket.
Sockets are discarded with the normal close system call.
2.3 BSD’s networking architecture
BSD’s networking architecture has two planes, as shown in Figure 2.2: the user
plane and the memory management plane. The user plane defines a framework within
which many communication domains may coexist and network services can be implemented. The memory management plane defines memory management policies and
procedures that comply with the user plane’s memory requirements. More on this a little further.
2.3.1
Memory management plane
It is well known [McKusick et al. 1996; Wright and Stevens 1995] that the requirements placed by interprocess communication and network protocols on a memory
management scheme tend to be substantially different from those of other parts of the
operating system. Basically, network messages processing require attaching and/or detaching protocol headers and/or trailers to messages. Moreover, some times these headers’ and trailers’ sizes vary with the communication session’s state; some other times
the number of these protocol elements is, a priori, unknown. Consequently, a special-
Figure 2.2—BSD’s two-plane networking architecture. The user plane is depicted with its
layered structure, which is described in following sections. Bold circles in the figure
represent defined interfaces between planes and layers: A) Socket-to-Protocol, B)
Protocol-to-Protocol, C) Protocol-to-Network Interface, and D) User Layer-toMemory Management. Observe that this architecture implies that layers share the
responsibility of taking care of the storage associated with transmitted data
Memory-Management Plane
User Plane
D
Socket Layer
D'
A
Protocol Layer
B
D"
C
Network-Interface Layer
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
2.3—BSD’S NETWORKING ARCHITECTURE
27
purpose memory management facility was created by the BSD developing team for the
use of the interprocess communication and networking systems.
The memory management facilities revolve around a data structure called an
mbuf. Mbufs, or memory buffers, are 128 bytes long, with 100 or 108 of this space re-
served for data storage. There are three sets of header fields that might be present in an
mbuf and which are used for identifying and managing purposes. Multiple mbufs can be
linked forming mbuf-chains to hold an arbitrary quantity of data. For very large messages, the system can associate larger sections of data with an mbuf by referencing an
external mbuf-cluster from a private virtual memory area. Data is stored either in the
internal data area of the mbuf or in the external cluster, but never in both.
2.3.2
User plane
The user plane, as said before, provides a framework within which many communication domains may coexist and network services can be implemented. Networking
facilities are accessed through the socket abstraction. These facilities include:
•
•
•
A structured interface to the socket layer.
A consistent interface to hardware devices.
Network-independent support for message routing.
The BSD developing team devised a pipelined implementation for the user plane
with three vertically delimited stages or layers. As Figure 2.2 and Figure 2.3 show, these
layers are the sockets layer, the protocols layer, and the network-interfaces layer.
Jointly, the protocols layer and the network-interfaces layer are named the networking
support. Basically, the sockets layer is a protocol-independent interface used by applications to access the networking support. The protocols layer contains the implementation of the communication domains supported by the system, where each communicaFigure 2.3—BSD networking user plane’s software organization
userprocess
process
user
user
process
system calls
kernel
function call
Socket layer
socket
queues
function call
Protocols layer
(IP, ESP, AH, cryptographic algorithms)
interface
queues
protocol queue
(IP input)
Interfaces layer
PH.D. DISSERTATION
software interrupt @ splnet
(caused by interface layer)
software interrupt @ splimp
(caused by hardware-interrupt handler)
OSCAR IVÁN LEPE ALDAMA
28
THE SOFTWARE INTERRUPT MECHANISM AND NETWORKING PROCESSING—2.4
tion domain may have its own internal structure. Last but not least, the networkinterfaces layer is mainly concerned with driving the transmission media involved.
Entities at different layers communicate through well-defined interfaces and their
execution is decoupled by means of message queues, as shown in Figure 2.3. Concurrent access to these message queues is controlled by the software interrupt mechanism,
as explained in the next section.
2.4 The software interrupt mechanism and
networking processing
Networking processing within the BSD operating system is pipelined and interrupt driven. In order to show how this works, let us describe the sequence of chief
events occurred during message reception and transmission. If you feel lost during the
first read, please keep an eye on Figure 2.3 during the second pass. It helps. During the
following description, when we say “the system” we mean a computer system executing
a BSD derived operating system.
2.4.1
Message reception
When a network interface card captures a message from a communications link, it
posts a hardware interrupt to the system’s central processing unit. Upon catching this interrupt—preempting any running application program and entering supervisor mode and
the operating system kernel’s address space—the system executes some networkinterfaces layer task and marshals the message from the network interface card’s local
memory to a protocols layer’s mbuf queue in main memory. During this marshaling the
system does any data-link protocol duties and determines to which communication domain the message is destined. Just after leaving the message in the selected protocol’s
mbuf queue and before terminating the hardware interrupt execution context, the system
posts a software interrupt addressed to the corresponding protocols layer task. Considering that the arrived message is destined to a system’s application program and that the
addressed application has an opened socket, the system, upon catching the outstanding
software interrupt, executes the corresponding protocols layer task and marshals the
message from the protocol’s mbuf queue to the addressed socket’s mbuf queue. All protocols processing within the corresponding communication domain takes place at this
software interrupt’s context. Just after leaving the message into the addressed socket’s
mbuf queue and before terminating the software interrupt execution context, the system
flags for execution any application program that might be sleeping over the addressed
socket, waiting for a message to arrive. When the system is finished with all the interrupts execution contexts and its scheduler schedules for execution the application program that just received the message, the system executes the corresponding sockets
layer task and marshals the message from the socket’s mbuf queue to the corresponding
application’s buffer in user address space. Afterwards, the system exits supervisor mode
and the address space of the operating system’s kernel and resumes the execution of the
communicating application program.
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
2.4—THE SOFTWARE INTERRUPT MECHANISM AND NETWORKING PROCESSING
29
Here let us spot a performance detail of the previous description. The message
marshalling between mbuf queues does not always imply a data copy operation. There
are copy-operations involve when marshalling messages between a network interface
card’s local memory and a protocol’s mbuf queue, and between a socket’s mbuf queue
and an application program’s buffer. But there is no data copy operation between a protocol’s and a socket’s mbuf queues. Here, only mbuf references—also know as pointers—are copied.
2.4.2
Message transmission
Message transmission network processing may be initiated by several events. For
instance, by an application program issuing the sendmsg—or similar—system call. But
it can also be initiated when forwarding a message, when a protocol timer expires or
when the system has deferred messages.
When an application program issues the sendmsg system call, giving a data buffer
as one of the arguments, (other arguments are, for instance, the communication domain
identification and the destination socket’s address) the system enters into supervisor
mode and into the operating system’s kernel address space and executes some sockets
layer task. This task builds an mbuf upon the selected communication domain and
socket type and, considering that the given data buffer fits inside one mbuf, it copies into
the built mbuf’s payload the contents of the given data buffer. In case that the communication channel protocol’s state allows the system to immediately transmit a message, the
system executes the appropriate protocols layer task and marshals the message through
the arbitrary protocol structure of the corresponding communication domain. Among
other protocol-dependent tasks, the system here selects a communication link for transmitting the message out. Considering that the network interface card attached to the selected communications link is idle, the system executes the appropriate networkinterfaces layer task and marshals the message from the corresponding mbuf in main
memory to the network interface card’s local memory. At this point the system hands
over the message delivery’s responsibility to the network interface card.
Observe that under the considered situation the system processes the message
transmission in a single execution context—that of the communicating application—and
no intermediary buffering is required. On the contrary, if for instance the system finds
an addressed network interface card busy, the system would place the mbuf in the corresponding network interface’s mbuf queue and would defer the execution of the networkinterfaces layer task. For cases like this, network interface cards are built to produce a
hardware interrupt not just when receiving a message but at the end of every busy period. Moreover, network interface card’s hardware interrupt handlers are built to always
check for deferred message at the corresponding network interface’s output mbuf queue.
When deferred messages are found, the system does whatever is required to transmit
them out. Observe that in this case the message transmission is done in the execution
context of the network interface card’s hardware interrupt.
Another scenario happens if a communication channel protocol’s state impedes
the system to immediately transmit a message. For instance, when a TCP connection’s
transmission window is closed [Stevens, 1994]. In this case, the system would place the
message’s mbuf in the corresponding socket’s mbuf queue, and defers the execution of
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
30
THE SOFTWARE INTERRUPT MECHANISM AND NETWORKING PROCESSING—2.4
the protocols layer task. Of course, a deferring protocol must have some built-in means
for later transmitting or discarding any deferred message. For instance, TCP may open a
connection’s transmission window after receiving one or more segments from a peer.
Upon opening the transmission window, TCP will start transmitting as many deferred
messages as possible as soon as possible—that is, just after finishing message reception.
Observe that in this case the message transmission is done in the execution context of
the protocols layer reception software interrupt. Also observe that when transmitting deferred messages at the protocols layer, the system may defer again at the networkinterfaces layer as explained above.
Communications protocols generally defined timed operations that require the interchange of messages with peers, and thus require transmitting messages. For instance,
TCP’s delayed acknowledge mechanism [Stevens 1994]. In this cases, protocols may
relay on the BSD callout mechanism [McKusick et al, 1996] and request for the system
to execute some task at predefined times. The BSD callout mechanism uses the system‘s real-time clock for scheduling the execution of any enlisted task. It arranges itself
to issue software interrupts every time an enlisted task is required to execute. If the
called out task initiates the transmission of a networking message, this message transmission is done in the execution context of the callout mechanism software interrupt.
Once again, as explained above, transmission hold off may happen at the networkinterfaces layer as explained above.
Finally, let us consider the message-forwarding scenario. In this scenario some
communications protocol—implemented at the protocols layer—is capable of forwarding messages; for instance, the Internet Protocol (IP) [Stevens 1994]. During message
reception, a protocol like IP may find out that the received message is not addressed to
the local system but to another system to which it knows how to get to by means of a
routing table. In this case, the protocol will launch a message transmission task upon the
message being forwarded. Observe that this message transmission processing is done in
the execution context of the protocols layer reception software-interrupt.
2.4.3
Interrupt priority levels
The BSD operating system assigns a priority level to each hardware and software
interrupt handler. The ordering of the different priority levels means that some interrupt
handler preempts the execution of any lower-priority one. One concern with these different priority levels is how to handle data structures shared between interrupt handlers
executed at different priority levels. The BSD interprocess communication facility code
is sprinkled with calls to the functions splimp and splnet. These two calls are always
paired with a call to splx to return the processor to the previous level. The result of this
synchronization mechanism is a sequence of events like the one depicted in Figure 2.4.
1) While a sockets layer task is executing at spl0, an Ethernet card receives a message and posts a hardware interrupt causing a network interfaces layer task—the Ethernet device driver—to execute at splimp.
This interrupt preempts the sockets layer code.
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
2.4—THE SOFTWARE INTERRUPT MECHANISM AND NETWORKING PROCESSING
31
2) While the Ethernet device driver is running, it places the received message into the appropriate protocols layer’s input mbuf queue—for instance IP—and schedules a software interrupt to occur at splnet. The
software interrupt won’t take effect immediately since the kernel is currently running at a higher priority level.
3) When the Ethernet device driver completes, the protocols layer executes at splnet.
4) A terminal device interrupt occurs—say the completion of a SLIP
packet. It is handled immediately, preempting the protocols layer, since
terminal processing’s priority, at spltty, is higher than protocols
layer’s.
5) The SLIP device driver places the received packet onto IP’s input mbuf
queue and schedules another software interrupt for the protocols layer.
6) When the SLIP device driver completes, the preempted protocols layer
task continues at splnet and finishes processing the message received
from the Ethernet device driver. Then, it processes the message received form the SLIP device driver. Only when IP’s input mbuf queue
gets empty the protocols layer task will return control to whatever it
preempted—the sockets layer task in this example.
7) The sockets layer task continues form where it was preempted.
Figure 2.4—Example of priority levels and kernel processing
preempted
spl0
preempted
socket
splnet
protocol
protocol
(IP input + TCP input)
(IP input + TCP input)
SLIP
device driver
spltty
splimp
Ethernet
device driver
step 1
PH.D. DISSERTATION
socket
2 3
4
5 6
7
OSCAR IVÁN LEPE ALDAMA
32
BSD IMPLEMENTATION OF THE INTERNET PROTOCOLS SUITE—2.5
2.5 BSD implementation of the Internet protocols
suite
Figure 2.5 shows a control and data flow diagrams of the chief tasks that implement the Internet protocols suite within BSD. Furthermore, it shows its control and data
associations with chief tasks at both the sockets layer and the network-interfaces layer.
Within the 4.4BSD-lite source code distribution, the files implementing the Internet protocols suite are located at the sys/netinet subdirectory. On the other hand, the files
implementing the sockets layer are located at the sys/kern subdirectory. The files implementing the network-interfaces layer are scattered among few subdirectories. The
tasks implementing general data-link protocol tasks, such as Ethernet, the address resolution protocol or the point-to-point protocol, are located at the sys/net subdirectory.
On the other hand, the tasks implementing hardware drivers are located at hardware dependent subdirectories, such as the sys/i386/isa or the sys/pci subdirectories.
As can be seen in Figure 2.5, the protocols implementation, in general, provides
output and input tasks per protocol. In addition, the IP protocol has a special
ip_forwarding task. It can also be seen that the IP protocol does not have an input
task. Instead, the implementation comes with an ipintr task. The fact that IP input
processing is started by a software interrupt may be the cause of this apparent fault to
the general rule. (The FreeBSD operating system drops the ipintr task in favor of an
ip_input task.) Observe that the figure depicts all the control and data flows corresponding to the message reception and message transmission scenarios described in the
previous section.
In order to complete the description let me note some facts on the networkinterfaces layer. The tasks shown at the bottom half of the layer depict hardware dependent tasks. The names depicted, Xintr, Xread and Xstart are not actual task names
but name templates. For building actual task names the capital “x” is substituted by the
name of a hardware device. For example, the FreeBSD source code distribution has
xlintr, xlread and xlstart for the xl device driver, which is the device deriver used
for the 3COM’s 3C900 and 3C905 families of PCI/Fast-Ethernet network interface
cards.
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
2.5—BSD IMPLEMENTATION OF THE INTERNET PROTOCOLS SUITE
33
Figure 2.5—BSD implementation of the Internet protocol suite. Only chief tasks, message queues and associations are shown. Please note that some control flow arrows are sourced at the bounds of the squares delimiting the implementation layers. This is for denoting that a task is executed after an external event, such as an
interrupt or a CPU scheduler event
recv
send
kernel
soreceive
sosend
Sockets layer
socket
receive buffer
icmp_input
udp_input
socket
transm. buffer
udp_output
ipintr system calls ip_forward
ipintrq
tcp_output
ip_output
Protocols layer
ether_input
Xread
received
packets
ehter_output
if_snd
Xintr
Xstart
software interrupt @ splimp
(caused by hardware-interrupt handler)
Network-interfaces layer
software interrupt @ splnet
(caused by interface layer)
tcp_input
userprocess
process
user
user
process
transmitted
packets
Legend
Function call
Function call (other tasks involved)
Data flow
Task
Message queue
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
RUN-TIME ENVIRONMENT: THE HOST’S HARDWARE—2.6
34
2.6 Run-time environment: the host’s hardware
The BSD operating system was devised to run on a computing hardware with an
organization much like the one shown in Figure 2.6. This computing hardware organization is widely used for building personal computers, low-end servers and workstations
or high-end embedded systems. The shown organization is an instance of the classical
stored-program computer architecture with the following features [Hennessy and Patterson 1995]:
•
•
•
•
•
2.6.1
A single central processing unit
A four-level, hierarchic memory (not shown)
A two-tier, hierarchic bus
Interrupt driven input/output processing (not shown)
Programmable or direct memory access network interface cards
The central processing unit and the memory hierarchy
Nowadays, personal computers and the like computing systems are provisioned
with high-performance microprocessors. These microprocessors in general leverage the
following technologies: very-low operation cycle period, pipelines, multiple instruction
issues, out-of-order and speculative execution, data prefetching or trace caches.
In order to sustain a high operation throughput, this kind of microprocessors requires very fast access to instructions and data. Unfortunately, current memory technology lags behind microprocessor technology in its performance/price ratio. That is, low
latency memory components have to be small in order to remain economically feasible.
Consequently, personal computers and the like computing systems—but also other
computing systems using high-performance microprocessors—are suited with hierar-
Figure 2.6—Chief components in a general purpose computing hardware.
CPU
System Bus
Main
Memory
Bridge
NIC
I/O Bus
NIC
Others
Other
Others
devices
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
2.6—RUN-TIME ENVIRONMENT: THE HOST’S HARDWARE
35
chically organized memory. Ever faster and thus smaller memory components are
placed lower in the hierarchy and thus closer to the microprocessor. Several caching
techniques are used for mapping large address spaces onto the smaller and faster memory components, which in consequence are named cache memories [Hennessy and Patterson 1995]. These caching techniques mainly consist of replacement policies for
swapping out of the cache memory computer program’s address space sections (named
address space pages) that are not expected to be used in the near future in favor of active
ones. The caching techniques also determine what to do with the swapped out address
space sections—it may or may not be stored in the memory component at the next
higher level, considering that the computer program’s complete address space is always
resident in the memory component at the top of the hierarchy.
Another important aspect of the microprocessor-memory relationship is the wire
latency. That is, the time required for a data signal to travel from the output ports of a
memory component to the input ports of a microprocessor, or vice versa. Nowadays, the
lowest wire latencies are obtained when placing a microprocessor and a cache memory
in the same chip. The next worst step happens when placing these components within a
single package. The next worst step occurs when the cache memory is part of the main
memory component and thus it is at the opposite side of the system bus with respect to
the microprocessor.
Let us cite some related performance numbers of an example microprocessor. The
Intel’s Pentium 4 microprocessor is available at speeds ranging from 1.6 to 2.4 GHz. It
has a pipelined, multiple issue, speculative, and out-of-order engine. It has a 20 KB, onchip, separated data/instruction level-one cache, whose wire latency is estimated at two
clock cycles. And it has a 512 or 256 KB on-chip and unified level-two cache, whose
wire latency is estimated at 10 clock cycles.
2.6.2
The busses organization
For reasons not relevant to this discussion, the use of a hierarchical organization
of busses is attractive. Nowadays, personal computers and the like computing systems
come with two-tier busses. One bus, the so named system bus, links the central processing unit and the main memory through a very fast point-to-point bus. The second bus,
named the input/output bus, links all input/output components or periphery devices, like
network interface cards and video or disk controllers, through a relatively slower multidrop input/output bus.
For quantitatively putting these busses on perspective, let us note some performance numbers of two widely deployed busses: the system bus of Intel’s Pentium 4 microprocessor and the almost omnipresent Peripheral Component Interconnect input/output bus. [Shanley and Anderson 2000] The specification for the Pentium 4’s system bus states a speed operation of 400 MHz and a theoretical maximum throughput of
3.2 Gigabytes per second. (Here, 1 Gigabytes equals 10^9 bytes.) On the other hand, the
PCI bus specification states a selection of path widths between 32 and 64 bits and a selection of speed operations between 33 and 66 MHz. Consequently, the theoretical
maximum throughput for the PCI bus stays between 132 and 528 Mbytes per second for
the 33-MHz/32-bit PCI and the 66-MHz/64-bit PCI, respectively. (Here, 1 Mbytes
equals 10^6 bytes.)
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
RUN-TIME ENVIRONMENT: THE HOST’S HARDWARE—2.6
36
2.6.3
The input/output bus’ arbitration scheme
One more important aspect to mention with respect to the input/output bus is its
arbitration scheme. Because the input/output bus is a multi-drop bus, its path is shared
by all components attached to it and thus some access protocol is required.
The omnipresent PCI bus [Shanley and Anderson 2000] uses a set of signals for
implementing a use-by-request master-slave arbitration scheme. These signals are emitted through a set of wires separated from the address/data wires. There is a request/grant
pair of wires for each bus attachment and a set of shared wires for signaling an initiatorready event, (FRAME and IRDY) a target-ready event, (TRDY and DEVSEL) and for
issuing commands (three wires).
A periphery device attached to the PCI bus (device for short) that wants to transfer
some data, requests the PCI bus mastership by emitting a request signal to the PCI bus
arbiter. (Bus arbiter for short.) The bus arbiter grants the bus mastership by emitting a
grant signal to a requesting device. A granted device becomes the bus master and drives
the data transfer by addressing a slave device and issuing to it read or writes commands.
A device may request bus mastership and the bus arbiter may grant it at any time, even
when other device is currently performing a bus transaction, in what is called “hidden
bus arbitration.” This seams a natural way to improve performance. However, devices
may experience reduced performance or malfunctioning if bus masters are preempted
too quickly. Next subsection discusses this and other issues regarding performance and
latency.
The PCI bus specification does not defines how the bus arbiter should behave
when receiving simultaneous requests. The 2.1PCI bus specification only states that the
arbiter is required to implement a fairness algorithm to avoid deadlocks. Generally,
some kind of bi-level round robin policy is implemented. Under this policy, devices are
separated in two groups: a fast access and a slow access group. The bus arbiter rotates
grants through the fast access group allowing one grant to the slow access group at the
end of each cycle. Grants for slow access devices are also rotated. Figure 2.7 depicts
this policy.
Figure 2.7—Example PCI arbitration algorithm
Fast access group
A
B
C
Slow access group
a
OSCAR IVÁN LEPE ALDAMA
b
c
Example sequence:
A
B
a
A
B
b
A
B
c
PH.D. DISSERTATION
2.6—RUN-TIME ENVIRONMENT: THE HOST’S HARDWARE
2.6.4
37
PCI hidden bus arbitration’s influence on latency
PCI bus masters should always use burst transfers to transfer blocks of data between themselves and target devices. If a bus master is in the midst of a burst transaction and the bus arbiter removes its grant signal, this indicates that the bus arbiter has
detected a request from another bus master and is granting bus mastership for the next
transaction to the other device. In other words, the current bus master has bee preempted. Due to PCI’s hidden bus arbitration this could happen any moment, even one
bus cycle before the current bus master has initiated its transaction. Evidently this hampers PCI’s burst transactions support and leads to bad performance.
In order to avoid this the 2.1PCI bus specification mandates the use of a master latency timer per PCI device. The value contained in this latency timer defines the minimum amount of time that a bus master is permitted to retain bus mastership. Therefore,
a current bus master retains bus mastership until either it completes its burst transaction
or its latency timer expires.
Note that independently of the latency timer a PCI device must be capable of
managing bus transaction preemption; that is, it must be capable of “remembering” the
state of a transaction so it may continue where it left off.
The latency timer is implemented as a configuration register in a PCI device’s
configuration space. It is either initialized by the system’s configuration software at
startup, or contains a hardwire value. It may equal zero, in which case a device can only
enforce single data phase transactions. Configuration software computes latency timer
values for PCI devices not having it hardwire from its knowledge of the bus speed and
each PCI device’s target value, stored in another PCI configuration register.
2.6.5
Network interface card’s system interface
There are two different techniques for interfacing a computer system with periphery devices like network interface cards. If using the programmable input/output technique, a periphery device interchanges data between its local memory and the system’s
main memory by means of a program executed by the central processing unit. This program articulates either input/output or memory instructions that read or write data from
or to particular main memory’s locations. These locations were previously allocated and
initialized by the system’s configuration software at startup. The periphery device’s and
motherboard’s organizations determine the use of either input/output or memory instructions. When using this technique, periphery devices interrupt the central processing
unit when they want to initiate a data interchange.
With the direct memory access (DMA) technique, the data interchange is carried
out without the central processing unit intervention. Instead, a DMA periphery device
uses a pair of specialized electronic engines for performing the data interchange with
the system’s main memory. One engine is part of the same periphery device while the
other is part of the bridge chipset; see Figure 2.6. Evidently, the input/output bus must
support DMA transactions. In a DMA transaction, one engine assumes the bus master
role and issues read or write commands; the other engine’s role is as servant and follows
commands. Generally, the DMA engine at the bridge chipset may assume both roles.
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
RUN-TIME ENVIRONMENT: THE HOST’S HARDWARE—2.6
38
When incorporating a master DMA engine, a periphery device interrupts the central
processing unit after finishing a data interchange. It is important to note that DMA engines do not allocate nor initialize the main memory’s locations from or to where data is
read or written. Instead, the corresponding device driver is responsible of that and
somehow communicates the location’s addresses to the master DMA engine. Next subsection further explains this.
Periphery devices’ system interface may incorporate both previously described
techniques. For instance, they may relay on programmable input/output for setup and
performance statistics gathering tasks and on DMA for input/output data interchange.
2.6.6
Main memory allocation for direct memory access network interface cards
Generally, a DMA capable network interface card supports the usage of a linked
list of message buffers, mbufs, for data interchange with main memory; see Figure 2.8.
During startup, the corresponding device driver builds two of this mbufs lists, one for
handling packets exiting the system, named the transmit channel, and the other for handling packet entering it, named the receive channel. Mbufs in these lists are wrapped
with descriptors that hold additional list management information, including the mbuf’s
main memory start address and size. Network interface cards maintain a local copy of
the current mbuf’s list descriptor. They use the list descriptor’s data to marshal DMA
transfers. A network interface card may get list descriptors either autonomously, by
DMA, or by device driver command, by programmable input/output code. The method
used depends on context as explained in the next paragraphs. Generally, a list descriptor
includes a “next element” field. If a network interface card supports it, it uses this field
to fetch the next current mbuf’s list descriptor. This field is set to zero to instruct the
network interface card to stop doing DMA through a channel.
Transmit channel. At system startup, transmit channels are empty and, consequently, device drivers clear—writing a zero—network interface cards’ local copy of
the current mbuf descriptor. When there is a network message to transmit, a device
driver queues the corresponding mbuf to a transmit channel and does whatever necessary so the appropriate network interface card gets its copy of the new current mbuf’s
list descriptor. This means that a device driver either copies the list descriptor by programmable input/output code, or instructs the network interface card to DMA copy it.
Advanced DMA network interface cards are programmed to periodically poll transmit
channels looking for new elements, simplifying the device driver task. Generally, list
descriptors have a “message processed” field, which is required for transmit channel
operation: after a network interface card DMA copies a message from the transmit
channel, it sets the “message processed” field and, after transmitting the message
through the data link, notifies the appropriate device driver of a message transmission.
(Message transmission notification may be batched for improved performance.) When
acknowledging an end-of-transmission notification, a device driver will walk the transmit channel, dequeuing each list element that has it “message processed” field set.
Receive channel. At system startup, device drivers provide receive channels with
a predefined number of empty mbufs. Naturally, this number is a trade-off between
channel-overrun probability and memory wastage, which in turn depends on the operation velocity difference between the host computer and the data link. Continuing with
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
2.6—RUN-TIME ENVIRONMENT: THE HOST’S HARDWARE
39
system startup, device drivers make whatever necessary so network interface cards get
their copy of the new current mbuf’s list descriptor of the appropriate receive channel.
This means that device drivers either copy the list descriptor by programmable input/output code or instruct network interface cards to DMA copy it. It may also happen
that device drivers do not have to signal network interface cards because the latter are
programmed to periodically poll receive channels looking for new elements. After receiving and DMA copying one or more network messages, a network interface card notifies the appropriate device driver of a message reception. When acknowledging a reception notification, a device driver will walk the receive channel dequeuing and processing each list element that has its “message size” field greater than zero. Moreover, a
device driver must provide its receive channel with more empty mbufs as soon as possible for avoiding the corresponding network interface card to stall; a situation that may
result in network message losses.
Figure 2.8—Main memory allocation for device drivers of network interface
cards supporting direct memory access
Filled buffers
queued by
device driver
Processed buffers
emptied and dequeued
by device driver
Descriptor
Descriptor
Processed
Message
Buffer
Processed
Message
Buffer
Main memory
a) Transmit channel
NIC
Descriptor
Local
Memory
Descriptor
Descriptor
Descriptor
Current
Message
Buffer
Filled
Message
Buffer
Filled
Message
Buffer
DMA
Data Link
Empty buffers
queued by
device driver
Filled buffers
dequeued by
device driver
Descriptor
Descriptor
Empty
Message
Buffer
Empty
Message
Buffer
Main memory
b) Receive channel
PH.D. DISSERTATION
NIC
Descriptor
Local
Memory
Descriptor
Descriptor
Descriptor
Current
Message
Buffer
Filled
Message
Buffer
Filled
Message
Buffer
DMA
Data Link
OSCAR IVÁN LEPE ALDAMA
40
OTHER SYSTEM’S NETWORKING ARCHITECTURES—2.7
2.7 Other system’s networking architectures
BSD-like networking architectures are known to have good and bad qualities almost since its inception [Clark 1982]. While communications links worked at relatively
low speeds, the tradeoff between modularity and efficiency was positive. With the advent of multi-megabit data communication technologies this started not to hold true.
Worst yet, since some ten years ago networking application programs are not improving
performance proportionally to the central processing unit’s and communication link’s
speeds. Others [Abbot and Peterson 1993; Coulson and Blair 1995; Druschel and Banga
1996; Druschel and Peterson 1993; Eicken et al. 1992; Geist and Westall 1998; Hutchinson and Peterson 1991; Mosberger and Peterson 1996] have pointed that operating
system overheads and networking software not exploiting cache memory features cause
the problem. These same people have proposed new networking architectures to improve overall networking application programs’ performance. Strikingly, although these
new networking architectures are relatively old, none have been deployed in production
systems. Arguably, the reason is that most of these networking architectures required
large changes in networking application programs, at best. In the worst case, they also
require changes in communication protocols’ design and implementation.
In this section we briefly explore post-BSD networking architectures. We think
this is interesting in the context of this document because we believe that these relatively new networking architectures have several similarities with the BSD networking
architecture. Consequently, the methodology defined in this document may easily be
applied for studying the performance of these other systems.
2.7.1
Active Messages [Eicken et al. 1992]
An active message is a message that incorporates the name of the remote procedure that will process it. When an active message arrives to the system, the operating
system does not have to buffer the data because it can learn from the active message the
name of the software module where the data goes. Like traditional communications protocol messages, active messages are encapsulated. Differently form traditional protocol
layers, each software layer can learn the next procedure that will process the message
and directly call it to run. This can avoid copy memory operations and reduce central
processing unit context switches. Unfortunately, this requires a network wide naming
space for procedures that is not standard in current protocol specifications. Fast Sockets
[Rodrigues, Anderson and Culler 1997] is a protocol stack with active messages that
implements a BSD socket like network application program interface and can interoperate with legacy TCP/IP systems. However, this interoperation is limited.
2.7.2
Integrated Layer Processing [Abbot and Peterson 1993]
Integrated Layer Processing (ILP) reduces the copy memory operations and creates a running pipeline of layers’ code by means of a proposed dynamic code hooking.
This dynamic hooking, as showed in Figure 2.9, allows integrally running independently constructed layers’ code and eliminates interlayer buffering. The result is improved performance, due to better cache behavior, less copy memory operations and
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
2.7—OTHER SYSTEM’S NETWORKING ARCHITECTURES
41
less central processing unit context switches, all without scarifying modularity. A brief
description of this technique follows.
Dynamic code hooking requires that software modules implement a special interface. Also, the hooked modules have to agree on the data type that they process, i.e., a
machine word. This means that ILP cannot be applied to legacy protocol stacks but new
protocols can be designed to meet the interface and data specification. Basically, the
module interface is a mixed form of code-in-lining and loop unrolling with an interlayer
communication mechanism implemented with central processing unit’s registers. (this
can be done only if all modules process data one word at a time.) That is, the programmer must implement each module as a loop that processes incoming metadata a word at
a time. Inside the loop, he must place an explicit hook that the runtime environment will
use to dynamically link the next layer’s code and transfer data. The first and last modules in the pipeline are special because they read/write information from/to a buffer in
memory.
There is another restriction to ILP. Protocol metadata is encapsulated; that is, one
layer's header is another one's data. Furthermore, these layer's header could be noexistent for a third one. So, for this technique to work, the integrated processing can
only be applied to the part of a message that all layers agree exists and has the same
meaning; that is, the user application data. This means another change in protocol specification; protocol processing must be divided in three parts: (1) protocol initialization,
(2) data processing and (3) protocol consolidation. The first and third parts implement
Figure 2.9—Integrated layering processing [Abbot and Peterson 1993]
input buffer
input buffer
read
f1
read
f1
write
f2
f2
read
f3
write
write
output buffer
read
f3
write
output buffer
a)
INITIAL
INITIAL
INITIAL
INTEGRATED
DATA
MANIPULATION
FINAL
LEVEL N
FINAL
LEVEL N-1
FINAL
LEVEL N-2
b)
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
42
OTHER SYSTEM’S NETWORKING ARCHITECTURES—2.7
protocol control so they process control data in message headers and/or trailers. Those
parts have to be run in the traditional serial order. Only data processing parts can be integrated. This is shown in Figure 8
2.7.3
Application Device Channels [Druschel 1996]
This technique improve overall telematic system performance by allowing common telecommunication operations bypass operating system’s address space; that is,
send and receive data operations are done by layered protocol code at user application’s
address space. System security is preserved because connection establishment and tire
down are still controlled by protocol code within operating system’s kernel.
Figure 2.10 shows a typical scenario for application device channels (ADCs).
There, it can be seen that an ADC is formed by a pair of code stubs that cooperate to
send and receive messages to and from the network. One stub is at the adapter driver
and the other at the user application. The user-level layered protocols can process messages using single-address-space procedure-calls to improve overall performance. Also,
because the operating system’s scheduler is not aware of the layered structure of userlevel protocols, it is unlikely to interleave its run with other processes and overall performance is improved. Moreover, because there is no need to general-purpose protocols,
user-level protocols can easily be optimized to meet user application needs, further improving performance. Finally, there is still a need to implement some protocol code inside the kernel because the operating system controls ADC allocation and connection
establishment and tired down.
Figure 2.10—Application Device Channels [Druschel 1996]
Application
Send
Connection
Management
Receive
Protocol
Library
OS
Network Protocols
ADC
ADC
OSCAR IVÁN LEPE ALDAMA
Network Interface
PH.D. DISSERTATION
2.7—OTHER SYSTEM’S NETWORKING ARCHITECTURES
2.7.4
43
Microkernel operating systems’ extensions for improved networking
[Coulson et al. 1994; Coulson and Blair 1995]
Even though mechanism like Application Device Channels can exploit the notion
of user-level resource management to improve traditional telematic system’s performance, other requirements of modern telematic systems cannot be met. Telematic systems
like distributed multimedia, loosely couple parallel computing and wide-area network
management, require real time behavior and end-to-end communications quality assurance mechanisms, better known as QoS mechanisms. While legacy operating systems
cannot provide these requirements [Vahalia 1996], the so named microkernel operating
systems can [Liedtke 1996]. Furthermore, microkernel operating systems lend themselves to user-level resource management and can have multiple personalities that permit concurrently run modern and legacy telematic systems.
Under this trend, the SUMO project at Lancaster University
(http://www.comp.lancs.ac.uk/computing/research/sumo/sumo.html, current as 2 July
2002) have extend a microkernel OS to support an end-to-end quality of service architecture over ATM networks for multimedia distributed computing. Figure 2.11 shows
the microkernel SUMO extensions. SUMO extends the basic microkernel concepts with
rtports, rthandlers, QoS controlled connections, and QoS handlers. This set of abstractions promote a event driven style of programming that, with the help of a splitlevel central processing unit scheduling mechanism; a QoS external memory mapper; a
connection oriented network actor; and a flow management actor, allows the implementation of QoS controlled multimedia distributed systems.
Figure 2.11—SUMO extensions to a microkernel operating system [Coulson et al. 1994;
Coulson and Blair 1995]
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
44
OTHER SYSTEM’S NETWORKING ARCHITECTURES—2.7
2.7.5
Communications oriented operating systems [Mosberger and Peterson
1996]
While the techniques just described look for better telematic system’s performance in computation-oriented operating systems, the Scout project at Princeton University (formerly at University of Arizona) is aimed at building a communication-oriented
operating system. This project is motivated by the fact that telematic systems workloads
have only a low percentage of computations. Here, communication operations are the
common case.
Just as computation oriented operating systems extend the use of the processor by
means of software abstractions like processes or threads, the Scout operating system extend the use of physical telecommunication channels by means of a software abstraction
called “path”. This path is the next step consequence of all the previously described
techniques and can be seen as the virtual trail drawn by a message that travels through a
layered system from a source module to a sink one; see Figure 2.12. This path has a pair
of buffers, one at each end, and a transformation rule. The buffers are used the regular
way; the transformation rule represents the set of operations that act over each message
that travels through the path.
Within Scout, all resource management and allocation is done on behalf of a path.
So, it is possible to obtain the following benefits: (1) early work segregation for better
resource management, (2) allow central processing unit scheduling for the hole path
which improves performance, (3) allow easy implementation of admission control
mechanism and (4) early work discard for reduced waste of resources.
Figure 2.12—Making paths explicit in the Scout operating system [Mosberger and
Peterson 1996]
a)
b)
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
2.7—OTHER SYSTEM’S NETWORKING ARCHITECTURES
2.7.6
45
Network processors
Although the network processor concept is not a networking architecture solution
to the problem of better telematic systems we decided to include a brief mention to it
because it is the current hot topic on programmable-by-software hardware for communications systems. (We got 15 references after a search at IEEE Explore of the pattern
“‘network processors’ in <ti> not ‘neural network’ in <ti>” restricted to not being before 1999. Besides, see Geppert, L. Editor “The new chips on the block,” IEEE Spectrum, January 2001, p.66-68.) Network processors are microprocessors specialized for
executing communications software. They exploit communications software’s intrinsic
parallelism at several levels: data, task, thread and instruction [Verdú et al. 2002]. Datalevel parallelism (DLP) exists because, in general, one packet processing is independent
of others—previous or subsequent. Vector processors and SIMD chip multiprocessors
are known to effectively leverage this kind of parallelism. They are also known to require processing regularity, however, to avoid load unbalancing that hampers performance. Communication software does not exhibit enough processing regularity—in general, the number of instructions required to process one packet has no correlation with
others. One work around for this problem exploits thread-level parallelism (TLP) conjunctionally with DLP. The idea then is to simultaneously combine several execution
contexts within the microprocessor so if a microprocessor’s execution unit stalls for
whatever reason, (the address resolution is not complete or the forwarding data has to be
fetched form main memory or current communication session’s state impedes further
processing) the hardware may automatically change the execution context and proceed
processing another thread—packet. This technique, which has in multitasking its software counterpart, is known as simultaneous multi-threading [Eggers et al. 1997].
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
46
SUMMARY—2.8
2.8 Summary
•
•
•
•
•
•
•
•
Internet protocols’ BSD implementation is networking’s de facto standard. In some way or another, most available systems derive their structure, concepts, and/or code from it.
Networking processing within a BSD system is pipelined and governed
by the software interrupt mechanism.
Message queues are used by a BSD system for decoupling processing
between networking pipeline’s stages.
The software interrupt mechanism defines preemptive interrupt levels.
All of the above suggests to us that widely used, single-queue performance models of software routers incur in significant error. We further
discuss this in chapter 3,when proposing a networking queue performance model for these systems.
Networking processing in PC-based telematic systems has some random variability due to microprocessor features like pipelines, multiple
instruction issues, out-of-order and speculative execution, data prefetching, and trace caches. We take this into account when defining a characterization process of these systems, as discuss in chapter 3.
The PCI I/O bus uses simple bi-level round robin policy for bus arbitration. We will show in chapter 3 that this may hamper a PC-based software router of sustaining system-wide QoS behavior, and in chapter 4
we propose a solution to this problem.
Devices and device drivers supporting DMA transactions use a pair of
message buffer lists, called receive and transmit channels, which may
be used to implement a credit-based flow-control mechanism for fairly
sharing I/O bus bandwidth, as will be discussed later in chapter 4.
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
MODELING TCP/IP SOFTWARE IMPLEMENTATION PERFORMANCE AND ITS APPLICATION FOR
SOFTWARE ROUTERS
47
Chapter 3
Characterizing and modeling a personal
computer-based software router
3.1 Introduction
This chapter presents our experiences building and conducting the parameterization of performance models for a personal computer-based software IP router and IPsec
security gateway; that is, machines built upon off-the-self personal computer technology. The resulting models are open multiclass priority networks of queues that we
solved by simulation. While these models are not particularly novel from the system
modeling point of view, in our opinion, it is an interesting result to show that such models can estimate, with high accuracy, not just average performance-numbers but the
complete probability distribution function of packet latency, allowing performance
analysis at several levels of detail. The validity and accuracy of the queuing network
models has been established by contrasting its packet latency predictions in both, time
and probability spaces. Moreover, we introduced into the validation analysis the predictions of a software IP router’s single, first-come, first-served queue model. We did this
for quantitatively assessing the advantages of the more complex queuing network model
with respect to the simpler and widely used but not so accurate, as here shown, single
queue model, under the considered scenario that the software IP router’s CPU is the system bottleneck and not the communications links. The single queue model was also
solved by simulation. The queuing network models were successfully parameterized
with respect to central processing unit speed, memory technology, packet size, routing
table size, and input/output bus speed. Results reveal unapparent but important performance trends for software IP router design.
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
48
SYSTEM MODELING—3.2
This chapter is organized as follows. The first two sections set the background.
Section 3.2 briefly discuss about trade-offs in system modeling and puts in perspective
the appropriateness of networking software’s single queue models with respect to queuing network models. Section 3.3 presents personal computer-based IP software routers’
chief technological and performance issues. Moreover, it puts this software IP router
technology in perspective when comparing it with others. Following the background
sections, section 3.4 presents the case on queuing network modeling for personal computer-based IP software routers and describes an example model. Furthermore, it explains how to modify or extend the example model for modeling such routers with different configurations. Then, section 3.5 describes the laborious system characterization
process and the implications that this process’s results had on system modeling. Section
3.6 presents validation results for the example model, and section 3.7 argues about its
parameterization. After results provided by this parameterization, section 3.7 also presents interesting software routers’ performance trends. Finally, section 3.8 presents example uses of personal computer-based software IP router queuing network models for
capacity planning and as uniform experimental test beds, and section 3.9 summarizes
the chapter.
3.2 System modeling
Evidently, performance models of computer systems are important for researching
and developing new ideas. These performance models are commonly built from a spectrum of techniques as those shown in Figure 3.1. A researcher or engineer may use
hardware prototyping, hardware simulators, queuing networks, simple queues or simple
math. Evidently, there are tradeoffs to consider when choosing a technique. Complexity,
ease to reason with, and obtained results’ relevance are of chief concern. Clearly, hardware prototyping is the technique that can give results with greater relevance. However,
it is also the most complex technique and it is difficult to reason with—it is not ease to
see how a component’s variation influences others or the whole system. On the other
side of the spectrum, single-queue theory is a simple technique that is easy to reason
with but gives results of limited relevance or scope. Networks of queues are important
models of multiprogrammed and time-shared computer systems. However, these models
have not been used for performance modeling of computer networking software. Instead, simple queues are generally used for modeling network nodes implementing networking software.
Figure 3.1—A spectrum of performance modeling techniques
Hardware Prototype
Hardware Simulator
Queuing Network
FIFO queue
Simple Calculations
Target Application and OS
Hardware Model
SD
PRO
Fetch
Predictor
Pipeline
Perf
Core
Caches
Simulation Kernel
Professiona l Workstation 6000
Target
ISA
Target I/O
Interface
Host Plataform
Complexity / Time cost
Easy to reason with
Relevance of given results
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
3.3—PERSONAL COMPUTER-BASED SOFTWARE ROUTERS
49
3.3 Personal computer-based software routers
3.3.1
Routers’ rudiments
A router is a machine capable of conducting packet switching or layer 3 forwarding. Within the Internet realm, routers forward Internet Protocol (IP) datagrams and
bind the subnets that form the Internet. Consequently, IP routers’ performance directly
influences packet’s end-to-end throughput, latency and jitter within the Internet.
In general, in order to perform its chief function, routers do several tasks
[McKeown 2001]; see Figure 3.2. These tasks have different comply, urgency, and
complexity levels, and are done at different time scales. For instance, routing is a complex task (mainly due to the size of the data it processes and because this data is stored
distributively) that is done every time there is a change in network topology and may
take up to few hundred seconds to comply [Labovitz et al 2001]. On the other hand,
packet classification has to be done every time a packet arrives, and consequently,
should comply at “wire speeds.” Where wire speed means, for example, around five microseconds for Fast Ethernet [Quinn and Russell 1997] and around two microseconds
for Gigabit Ethernet (when operating half duplex and with packet bursting enabled)
[Stephen 1998]. In order to cope with this operational diversity, routers have, in general,
a two-plane architecture; see Figure 3.2.
3.3.2
The case for software routers
Evidently, a router’s data plane implementation mostly determines a router’s performance influence on packet throughput, latency and jitter. Naturally, different levels
of router’s features, like raw packet switching performance, support for extended functionality, implementation cost or implementation flexibility, may be set using different
implementation technologies for the data plane. A router’s data plane may be implemented using the following technologies:
Figure 3.2—McKeown’s [2001] router architecture
Control Plane
Routing
Admission
Control
Protocols
Reservation
Table
Data Plane
Packet
Classification
Policing &
Access Control
Ingress
PH.D. DISSERTATION
per-packet
processing
Switching
Fwding
table
lookup
Swting
Interconnect
Output
Scheduling
Egress
OSCAR IVÁN LEPE ALDAMA
50
PERSONAL COMPUTER-BASED SOFTWARE ROUTERS—3.3
•
•
•
General-purpose computing hardware and software. For instance, a
workstation or personal computer running some kind of Unix or Linux
Specialized computing hardware and software. For instance, Cisco‘s
2500 or 7x00 hardware executing Cisco’s IOS forwarding software
Application specific integrated circuits. For instance, Juniper’s M-120
For performance reasons it is interesting to classify routers in software and hardware routers.
•
•
Hardware routers are routers whose data plane is implemented with
hardware only. For instance, Juniper’s M-120
Software routers are routers whose data plane is implemented partly or
completely with software. For instance, Cisco’s 2500 and 7x00 product
series, or a workstation or personal computer running some kind of
Unix or Linux
Evidently, hardware routers outperform software routers in raw packet switching.
At the Internet’s core, where data links are utterly fast and expensive, hardware routers
are deployed in order to sustain high data link utilization. At the Internet’s edge, factors
like multiprotocol support, packet filtering and ciphering, or above level 3 switching,
are more important than data link utilization, however. Consequently, at the Internet’s
edge software routers are deployed. Besides, software routers have other features that
made them more attractive than hardware routers at particular scenarios; software
routers are easier to modify and extend, have shorter times-to-market, longer life cycles,
and cost less.
3.3.3
Personal computer-based software routers
Evidently, a software router’s performance is affected by the following
•
•
•
The host’s hardware architecture
The forwarding program’s software architecture
The network interface card’s hardware and corresponding device driver
architectures
For this research, we selected to work with personal computing technology. This
technology provided us with an open work environment—one that is easily controllable
and observable—for which there is lots of publicly available and highly detailed technical information. Moreover, other software routers have similar or at least comparable
architectures so our findings may be extrapolated. For this research, we used computers
with the following features:
•
•
•
•
•
Intel IA32 architecture
Intel’s Pentium class microprocessor as central processing unit
Periphery Component Interconnect (PCI) input/out bus
Direct memory access (DMA) capable network interface cards attached
to the PCI input/output bus
Internet networking software as a subsystem of the kernel of a BSDderived operating system
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
3.4—A QUEUING NETWORK MODEL OF A PERSONAL COMPUTER-BASED SOFTWARE IP ROUTER
51
For the shake of brevity, along the rest of this chapter we may refer to personal
computer-based software IP routers only as software routers, unless otherwise specified.
3.3.4
Personal computer based-IPsec security gateways
Kent and Atkinson [1998] define a secured communications architecture for the
Internet named IPsec. Within this architecture, an IP router implementing the IPsec protocols is called an IPsec security gateway. Along the rest of this chapter, for the shake of
brevity we may refer to personal computer based-IPsec security gateway only as security gateways, unless otherwise specified.
3.4 A queuing network model of a personal
computer-based software IP router
After this chapter’s description of personal computer-based software IP routers, it
should be clear that no single-queue model could capture the whole system behavior. In
order to model the pipeline-like organization and priority preemptive execution of the
BSD networking code, a queuing network model is at least required. Figure 3.3 shows a
queuing network model of a software router. This model corresponds to a personal
computer that executes the BSD networking code for forwarding IP datagrams between
two DMA capable network interface cards. Moreover, this model is restricted to a software router that is threaded by a single unidirectional packet flow. (Packets only enter
the software router through the number-one network interface card and only exit the
software router through the number-two network interface card.) Models for software
routers with different number of network interface cards or different packets flows configuration, or for security gateways may be derived after the one here shown as later explained. The shown model is comprised of:
•
•
Four service stations, one per each software router’s active element: the
central processing unit (CPU), the input/output bus control logic (I/O
BUS), the network interface card’s packet upload control logic (NA1IN)
and the network interface card’s packet download control logic
(NA2OUT)
Eight first-come, first-served (FCFS) queues (na1brx, na2btx, dmatx,
dmarx, ipintrq, ifsnd, ipreturn and ifeotx) and their corresponding service time politics, which model network interface cards’ local
memory (na1brx, na2brx) and BSD networking mbuf queues. The
number and meaning of the networking related queues, (dmarx,
ipintrq, ifsnd, ipreturn and ifeotx) as well as the features of the
service time politics applied by the CPU service station, are a result of
the characterization process that section 3.5 describes. The service time
politics applied by the I/O BUS service station (to customers at na1brx
and dmatx queues) were not computed after our own measurements but
after the results presented by Loeb et al [2001]. A PCI bus working at
33 MHz, with a 32-bit data path, which data phases last one bus cycle,
and whose data transactions were never preempted, was considered.
The service time politic applied by the NA2OUT service station (to cus-
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
52
A QUEUING NETWORK MODEL OF A PERSONAL COMPUTER-BASED SOFTWARE IP ROUTER—3.4
•
•
tomers at the na2btx queue) corresponds to a zero-overhead synchronous data link of particular speed
Nine customer transitions between queues representing the per-packet
network processing execution order. Cloning transitions, pictured by
circles at their cloning point, result in customers’ copies walking the
cloning paths. Blockable transitions, having a “t” like symbol over the
blocking point, result in service stations not servicing transitions’
source queues, if transitions’ destination queues are full
One customer source for driving the model. It spawns one customer per
packet traversing the software router. An auxiliary first-come, firstserved queue and its corresponding service time politic are used for
modeling the communications medium’s speed
Figure 3.3—A queuing network model of a personal computer-based software router that
has two network interface cards and that is traversed by a single packet flow. The
number and meaning of the shown queues is a result of the characterization process presented in the next section
Legend
na1btx
src
Queue
na2btx
NA2
IN
Service
station
na1brx
NA1
IN
SRC
Customer
source or
sink
NA1
OUT
NA2
OUT
na2brx
I/O
BUS
dmatx
(RR)
sink
SINK
dmarx
2
ipintrq
5
ifsnd
3
ipreturn
4
ifeotx
2
CPU
(PP)
null
DEVNULL
fast noise
1
OSFN
slow noise
OSSN
1
dma - Direct Memory Access
if(IF) - Interface
ifeotx - IF End Of Transmission
na(NA) - Network Adapter
nab - NA buffer
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
3.4—A QUEUING NETWORK MODEL OF A PERSONAL COMPUTER-BASED SOFTWARE IP ROUTER
•
•
•
53
A set of two customers sources, two FCFS queues and two transitions
through the central processing unit for modeling the “noise” process.
This process is defined and its relevance explained in subsections 3.5.9
and 3.5.10.
A set of one FCFS queue and one customer sink for computing per
packet statistics
A set of one FCFS queue and one customer sink for trashing residual
customers
May this relatively simple model give useful predictions? Is this a better model of
a software router than the widely used single queue model, under the considered scenario that the central processing unit is the system’ s bottleneck a not the communications links? These are some questions for which we wanted answers but couldn’t find at
the time we started this research. Before elaborating on the quest for these answers, the
rest of this section is devoted to presenting modeling details. These details are important
for deriving models of software routers with different configurations.
3.4.1
The forwarding engine, the network interface cards and the packet flows
Model elements depicted at Figure 3.3 within the dotted line model the software
router. Model elements outside the dotted line are auxiliary elements. For modeling a
software router with more network interface cards, some elements have to be replicated.
Gray areas depict elements that conjunctively model a network interface card. A model
requires one such group of elements for each network interface card that both inputs and
outputs packets. Network interface cards that only inputs or output packets require only
part of these elements. A model requires some other queues—and corresponding service
time politics and transitions—to be replicated depending on the packet flow configuration. The figure depicts these queues with a shadow following their top right side. To
further clarify this, consider the example shown in Figure 3.3. As said before, for this
model, packets enter the software router through one network interface card and exit it
through the other. Consequently, the model uses one dmarx queue and one pair of
dmatx and ifsnd queues. Indeed, one dmarx queue is required per each network interface card entering packets to the software router, and one pair of dmatx and ifsnd
queues is required per each network interface card exiting them.
3.4.2
The service stations’ scheduling politics and the mapping between
networking stages and model elements
The service station acting as the input/output bus has a round-robin no-preemptive
scheduling policy that mimics the basic behavior of the PCI bus’ arbitration scheme.
The service station acting as the central processing unit has a prioritized preemptive
scheduling policy that mimics the basic behavior of the BSD’s software interrupt
mechanism. At any given time, this service station preemptively serves the customer up
front of the queue with the smallest priority number. Figure 3.3 shows queue’s priority
numbers in front every queue served by the CPU service station. The priority scheme
shown in Figure 3.3 models the fact that any task at the network-interfaces layer preempts any task at the protocols layer. It also models the fact that once a message enters
a layer it’s processing is not preempted by another message entering the same layer.
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
54
A QUEUING NETWORK MODEL OF A PERSONAL COMPUTER-BASED SOFTWARE IP ROUTER—3.4
For better explaining the above, consider Figure 3.4. This figure shows a software
router‘s model with a different mapping of networking stages to queues—and their corresponding service time politic. Besides, the input/output bus logic and the noise are not
modeled and the network interface cards’ models were simplified. This figure’s model
uses a one-to-one mapping between the C language functions implementing BSD networking and the depicted model’s queues. You can check this by comparing Figure 3.4
with previous chapter’s Figure 2.5. Observed that the queues named ip_input,
ip_forward, ip_output and ether_output, which represent tasks at the protocols
layer, have all higher priority numbers than the queues named ether_input, if_intr,
if_read and if_start, which are tasks at the network-interfaces layer. Moreover, the
priority of ip_input is greater that the priority of ip_forward, which is greater than the
priority of ip_output, which is greater than the priority of ether_output. This models
the fact that a message recently arriving to ip_input waits processing until a message
that is being forwarding (ip_forward) completes the rest of the protocols layer’s
stages—that is, ip_output and ether_output. For the same reason the priority of
if_intr is greater than the priority of if_read and if_start, which is greater than the
priority of ether_input.
In order to complete this discussion let us note that while Figure 3.4’s model is
clearly a more detailed model of a software router with respect to that of Figure 3.3, it is
Figure 3.4—A queuing network model of a software router that shows a one-to-one mapping between C language functions implementing the BSD networking code and
the depicted model’s queues. In order to simplify the figure, this model does not
model the software router’s input/output bus nor the noise process. Moreover, it
simplistically models the network interface cards
if_intr
Legend
3
NIC
Customer
source or
sink
if_read
2
Service
station
ether_input
Queue
1
ip_input
7
CPU
ip_forward
6
ip_output
5
ether_output
4
if_start
nic's fifo
2
OSCAR IVÁN LEPE ALDAMA
NIC
PH.D. DISSERTATION
3.4—A QUEUING NETWORK MODEL OF A PERSONAL COMPUTER-BASED SOFTWARE IP ROUTER
55
not a better model. As will be explained in the next section, Figure 3.4’s model gives
inaccurate results due to service time correlation between some networking stages.
3.4.3
Modeling a security gateway
Figure 3.5 shows a queuing network model of a software router configured as a
security gateway. The model corresponds to a personal computer that executes BSD
networking code implementing the IPsec architecture [Kent and Atkinson 1998]. Note
that the model assumes that the in-kernel networking software processes IPsec tunnels
as pipelines, and thus, at “every exit of a tunnel” it puts the resulting packet back to the
ipintrq queue.
Figure 3.5—A queuing network model of a software router configured as a security gateway. The number and meaning of the shown queues is a result of the characterization process presented in next section
Legend
na1btx
src
Q ueue
na2btx
NA2
IN
Service
station
na1brx
NA1
IN
SRC
Custom er
source or
sink
NA1
O UT
NA2
O UT
na2brx
I/O
BUS
dmatx
(RR)
sink
SINK
dmarx
2
ipintrq:
ip
esp ah
5
ifsnd
3
CPU
(PP)
ipreturn
4
ifeotx
2
null
DEVN ULL
fast noise
1
OSFN
slow noise
OSSN
1
dm a - Direct M em ory Access
if(IF) - Interface
ifeotx - IF End O f Transm ission
na(NA) - Network Adapter
nab - NA buffer
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
56
SYSTEM CHARACTERIZATION—3.5
3.5 System characterization
This section reports on the assessment of the service time politics that are applied
by the CPU service station of a software router queuing network model. These service
time politics model the execution times of the BSD networking code executed on behalf
of each forwarded IP datagram. We used software profiling for assessing these execution times. Of the several tools and techniques that were available for profiling in-kernel
software, software profiling is a good trade off between process intrusion, measurement
quality and set up costs. We expected the profiled code to have stochastic behavior due
to the hardware’s features of cache memory and dynamic instruction scheduling—
branch prediction, out-of-order and speculative execution. Thus, we strived to compute
service time histograms rather than single value descriptors. We also carried out a data
analysis for unveiling some data correlation and hidden processes that forced us to adapt
the queuing network model’s structure. The rest of this section is devoted to report the
details of this process.
3.5.1
Tools and techniques for profiling in-kernel software
As stated before, for our research we used a networking code that is a subsystem
of the kernel of a BSD-derived operating system. More specifically, we used the kernel
of FreeBSD release 4.1.1. From the profiling point of view this kind of software presents a problem: the microprocessor runs the operating system kernel’s code in supervisor (or similar) mode and, because of this, special tools and techniques are needed for
profiling “alive” kernel’s code. Where by “alive code” we mean code that is executed at
full speed, in contrast to code that is executed step-by-step. Of the to-us-known tools
and techniques for kernel’s code profiling we used software profiling [Chen and Eustace
1995; Kay and Pasquale 1996; Mogul and Ramakrishnan 1997; Papadopoulos and
Parulkar 1993]. This technique is a reasonably flexible tool that does not requires additional support. Contrarily, other tools, like programmable instrumentation toolkits [Austin, Larson and Ernst 2002] and logic analyzers, although more powerful are complex
and proprietary and thus require a considerably amount of time, effort and money. For
the kind of task at hand, the use of these tools was considered too timely, energetically
and economically expensive.
3.5.2
Software profiling
A software probe may be defined as a little piece of software manually introduced
at strategic places of target software for gathering performance information. Data structures and routines that manipulate the recorded information supplement the software
probes. One important aspect when designing software probes, for now on just probes,
is minimizing its overhead.
We found two ways to use probes [Kay and Pasquale 1996; Mogul and
Ramakrishnan 1997; Papadopoulos and Parulkar 1993]. Each way uses a different
mechanism to gather information. One uses software mechanisms, resulting in instruments that, we can say, are self-sufficient. The other relays on special-purpose microprocessor-registers used as event counters. Both have tradeoffs. Software only probes
are portable but can incur in relatively higher overhead, depending on the complexity of
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
3.5—SYSTEM CHARACTERIZATION
57
the gathered information. Event-counters based probes are hardware dependant, and
thus are not portable, but can gather information hidden to software, like instruction
count mix, interruptions, TLB misses, etc.
As our reference systems wore Intel’s Pentium class microprocessors, our probes
take advantage of these microprocessors’ performance monitoring event counters
(PMEC). This kind of microprocessors has three such event counters, of which two are
programmable [Shanley 1997]. Four model specific registers (MSR), which can be accessed through the rdmsr (read MSR) and wrmsr (write MSR) ring-0 instructions, implement the three PMECs. One MSR implements the PMEC’s control and event selector register. It is a 64-bit read/write register employed for programming the type of
event that each of the two programmable PMECs would count. Another MSR implements the time stamp counter (TSC), a read-only 64-bit free-running counter that increments on every clock cycle and that can be use to indirectly measure execution time.
This register can also be accessed through the rdtsc (read TSC) ring-0 instruction. Finally, two MSRs separately implement the two programmable PMECs. Each of these is
40-bit read-only register that counts occurrences or duration of one of several dozen
events, such as cache misses, instruction count, or interrupts.
The technique we employed for software profiling is as follows:
1) A probe is placed at the activation point and at each returning point of
each profiled software object. Note that some of these objects have
more than one returning point
2) When an experiment is exercised, each probe reads and records values
from the three PMEC along with a unique tag that identifies the probe
3) The recorded information is placed in a single heap inside the kernel.
We decided to do this, instead of using, let say, one heap-per-object,
because in this way we found it is easier to observe and analyze objectactivation nesting. Furthermore, a common heap facilitated calculating
each object expenses in a chain of nested activations
4) When an experiment is finished, the recorded information is extracted
from the in-kernel heap and is organized by an "ad-hoc" computer program. This program classifies the records by its source probe, computes
the monitored event expenses per activation and writes the results in
several text files—one per source probe. Each produced text file holds a
table with one row per object activation and one column per PMEC.
This kind of text files can be feed to most widely available statistical
and graphical computer programs for analysis
3.5.3
Probe implementation
Before implementing the probes, there were some questions to answer. The first
question was: Which hardware events do we monitor? At first sight the answer to this
question seams simple: because we were interested in measuring execution time, we had
to monitor machine cycles. However, because we also wanted to know how those cycles were spent, and to see a proof of the concurrent nature of the software objects we
instrumented, we decided to monitor as well instructions and hardware interrupts. The
instruction count will give us an idea of the path taken by the microprocessor through
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
58
SYSTEM CHARACTERIZATION—3.5
code. The interrupt count will show us the concurrent behavior of not only the instrumented objects but of the whole kernel. In turn, this will serve us to discriminate meaningless data: a high interrupt count would mean that the microprocessor spent too much
time doing something else besides exercising the instrumented object.
A second question was: Where do we place the profiling probes? Other way to
look at this problem is to answer: what level of granularity do we use when delimiting
code for profiling? A priori, we choose a C language function level of granularity. This
means that we flanked with profiling probes each of the C language functions implementing IP forwarding. From previous chapter’ s Figure 2.5 it can be see that these
functions are: Xintr, Xread, Xstart, ether_input, ether_output, ipintr,
ip_forward and ip_output. However, as will be explained in the next section, we
found out that this level of granularity did not result in appropriate model’s service
times. Therefore, a posteriori, we increased the level of granularity and flanked with
profiling probes all the code executed from when an mbuf is taken out of any mbuf
queue to when this mbuf is placed in another mbuf queue. This level of granularity resulted in the definition of the following five “profiling grains”; see Figure 3.3:
•
•
•
•
•
The driver reception, dmarx; this grain spans Xintr (the reception part),
Xread and ether_input
The protocol processing, ipintr; this grain spans from the activation
point of ip_input, through ip_forwarding, ip_output and
ether_output, and up to the code at ether_output that either places
the packet at the interface’s transmission queue or activates Xstart
The driver transmission, ifsnd; this grain corresponds to Xstart.
The return from protocol processing, ipreturn; this grain spans from
where the protocol processing grain left off, to the returning points of
ether_output, ip_output, ip_forwarding and ip_input
The driver end-of-transmission interrupt handler, ifeotx; this grain
corresponds to the part in Xintr that attends the end-of-transmission interrupt that network interface cards signal
A third question was: How do we manipulate the PMEC? Searching the FreeBSD
documentation on the web (http://www.FreeBSD.org) we found that this operating system implements a device driver named /dev/perfmon, which implements an interface
to the PMECs. We also found that this device driver uses three FreeBSD kernel’s C language macro definitions for manipulating the Intel’s Pentium's MSR and for reading the
TSC. These macro definitions, wrmsr, rdmsr and rdtsc, are defined in the header file
cpufunc.h stored in the sys/i386/include/ directory.
A fourth question was: How would we switch probes on and off? Papadopoulos
and Gurudatta [1993] present one way to do it. Although their mechanism did not fit us,
it did give us a hint. Like them, we decided to use a test variable for controlling probe
activation. Departing from their methodology, we decided to use a spill-monitoring
variable that we named num_entries. Num_entries would count the number of entries
recorded in the heap and probes would be active as long as num_entries is less than the
heap's capacity. At any given time, we could deactivate the probes by writing a value
bigger than the heap's capacity, and vice versa.
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
3.5—SYSTEM CHARACTERIZATION
59
A last question was: How would we implement the heap of records? For answering this, we considered that heap manipulation would have to require as few instructions
as possible. Thus, we decided not to employ dynamic memory and defined a static array
of records. This has the inconvenient of limiting the size of the heap. (Compiler related
restrictions limited the size of the profiling records heap to around 100 thousand entries.) Each record is implemented by a C language structure with the following members:
•
•
•
•
id holds an integer value taken from a defined enumeration of probe
identifiers.
tsc holds the 64-bit number read form the TSC.
instruc holds the 64-bit number read from PMEC1, which is initially
configured to count executed instructions.
interrup holds the 64-bit number read form PMEC2, which is initially
configured to count hardware interruptions.
For the enumeration that defines the probe identifiers, we decided to employ a
scheme that would allow us to readily identify both the source object and probe's location inside the object. Thus, each probe identifier has two 4-bits parts: one for identifying the source object and another for identifying the probe's location inside the object.
Moreover, given that every profiled object has one activation point but many return
points we decided to use the number 0x0 for identifying the activation point and numbers between 0x1 and 0xF for identifying returning points.
With all this set up, we were able to devise a sequence of instructions that would
work for any probe. In order to ease the probe codification process, we defined a set of
C language macro definitions that were added to the kernel.h header file stored in the
sys/sys/ directory. The text for the macro definitions is shown in Figure 3.6.
Figure 3.6—Probe implementation for FreeBSD
#if defined(I586_CPU)
#define NAVI_RDMSR1 rdmsr(0x12)
#define NAVI_RDMSR2 rdmsr(0x13)
#elif defined(I686_CPU)
#define NAVI_RDMSR1 rdmsr(0xc1)
#define NAVI_RDMSR2 rdmsr(0xc2)
#endif
#define NAVI_REPORT(a) \
if ( navi_entrynum < NAVI_NUMENTRIES ) { \
disable_intr(); \
navi_buffer[navi_entrynum].id = a; \
navi_buffer[navi_entrynum].tsc = rdtsc(); \
navi_buffer[navi_entrynum].instruc = NAVI_RDMSR1; \
navi_buffer[navi_entrynum++].interrup = NAVI_RDMSR2; \
enable_intr(); \
}
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
60
SYSTEM CHARACTERIZATION—3.5
3.5.4
Extracting information from the kernel
Extracting special information from a "live kernel" is a well-known procedure.
You just have to implement a new system call. This is trivial but cumbersome. FreeBSD
has an easier way to go: the sysctl(8) subsystem. This subsystem defines a set of C
language macro definitions for exporting kernel variables, and the system call
sysctl(2) for reading and writing the exported variables. The scheme for identifying
variables is similar to the one use for a management information base, MIB [Stallings
1997]. The header file sysctl.h, stored in the sys/sys/ directory, has the definitions
of the macro definitions, and the file kern_mib.c, stored in the sys/kern/ directory,
uses the macro definitions for initializing the subsystem. The sysctl(8) and
sysctl(2) man pages explain how to use this subsystem, but they do not say how to
extend it. Honing [1999] explains how to accomplish this.
3.5.5
Experimental setup
Figure 3.7 shows the experimental setup we used for system characterization and
model validation. We characterized and modeled two systems: a software router and a
security gateway. In any case, we wanted the systems under test’s central processing
units to be the bottlenecks. That is why the software router under test wore as its central
processing unit an 100 MHz Intel’s Pentium microprocessor, while the security gateway
under test wore a 600 MHz Intel’s Pentium III. Indeed, it is well known that the software implementing authentication and encryption algorithms consumes more computing
power than plain IP forwarding. (After our research, we now know exactly how much
more computing power a security gateway consumes when compared to a plain IP software router.)
Figure 3.7—Experimental setup
Source A
Source B
Source C
Source D
HP Internet
Advisor
Security
Associations
Security Gateway
IPsec tunnel
A->D: secured,
AH tunnel,
ESP tunnel.
B->D: unsecured.
IP Router
Under Test
Security Gateway
Under Test
Phantom Sink
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
3.5—SYSTEM CHARACTERIZATION
61
Traffic generation was accomplished using a modified version of Richard Stevens’ sock program [Stevens 1994, appendix C] and using the traffic generation facility
of a Hewlett-Packard’s Internet Advisor J2522B. During system characterization, experiments were driven only by a single sock source, while during model validation we
used a mix of sources. As we did not have any special requirements for the source
nodes, besides being able to run the sock program, we used whatever we had at hand.
Thus, sources A and B wore 100 MHz and 120 MHz Intel’s Pentiums, respectively;
source C wore a 233 MHz Intel’s Pentium II and source D a 400 MHz Intel’s Pentium
III. During experiments with the security gateway, in order to avoid that performance
limitations at the IPsec tunnel’s inbound security gateway would modify the traffic’s
features or interfere with the characterization of the security gateway under test, which
was the outbound one, the inbound gateway wore a 650 MHz Intel’s Pentium III. The
“phantom sink” was implemented using an address resolution protocol (ARP) table entry for a nonexistent node.
All the computers shown are personal computers with Intel IA32/PCI hardware
architecture executing FreeBSD release 4.1.1 as their operating system. Main memory
configuration varies between 16, 32 and 64 Megabytes. In any case, it was more than
enough for storing the software router’s working set; that is, the most used sections of
the software router’s software. Indeed, the FreeBSD kernel’s binary file we used was
relatively small; it did not occupy more than 3 Mbytes. (1 Mbytes equals 2^20 bytes.)
This binary file included, besides the normal stuff, (the kernel was configured to include
only a reduced set of features [Leffer and Karels 1993]) the profiling probes and the
storage for the heap of profiling records. Of course, the main memory space required for
storing a FreeBSD system’s data varies as a function of the software load offered to the
system—number of processes, devices, open files, open sockets, etc. In our case, this
was not an issue because we configured each FreeBSD system so it would only execute
the chief system processes—init, pagedaemon, vmdaemon, syncer and only a few
getty. With this configuration, we wanted to avoid uncontrolled process context
changes that will drain computing power.
All computers had 3COM 905B-TX PCI/Fast Ethernet network interface cards attached to either 10 Megabit per second Ethernet or 100 Megabit per second Fast
Ethernet channels. A 3COM Office Connect TP4/Combo Ethernet hub was used for
linking all sources, the software router under test and the inbound security gateway. A
point-to-point Fast Ethernet link was used for the IPsec tunnel. FreeBSD kernels were
configured to use the xl(8) device driver for controlling the 3COM network interface
cards. All computers had onboard EIDE/PCI hard disk controllers. A plain and basic
FreeBSD system requires around 400 Megabytes of disk storage for holding the chief
file systems. With the considered configuration, no swap space was required.
3.5.6
Traffic patterns
The traffic pattern for system characterization served to purposes:
•
•
Minimizing the preemption of networking stages
Avoiding that packet arrivals would be in synchronization with any
kernel supported clock
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
62
SYSTEM CHARACTERIZATION—3.5
The first feature was important for easing the system characterization process. Recall that we have set up probes at the activation and returning points of selected software tasks, and probes record the current value of the three PMECs. Therefore, resource
consumption of any profiled task is equal to the difference between the three PMECs
values recorded at a profiled task’s returning point and the corresponding recorded values at the same profiled task’s activation point. Observe that any preemption of a profiled task—which will occur between its activation and returning points—will result in
an overblown resource consumption record. Indeed, in this case, the resource consumption record would also include resources consumed by the preempting task.
We saw two ways to solve this problem. One was to build a computer program
that would identify overblown records and would try to eliminate from them the preemption costs. The other solution was to use a traffic pattern that would assure the system under test would forward one packet at a time. The tradeoffs here were the experiment’s run time and the solution’s complexity. Clearly, the second solution is simpler.
As for the experiment’s run time it turn out not to be an issue. Indeed, given that the
heap of profiling records was limited in size to 100 thousand entries and that the forwarding of a packet produced around 12 entries, experimental runs can only be as long
as to produce 8333 packets. At the same time, our slowest system under test would forward one packet in less than 80 microseconds, as we found out a posteriori. Consequently, we used the second solution.
The traffic’s second feature, avoiding synchronization with kernel clocks, is important for dealing with software preemptions not related to networking tasks. A
FreeBSD kernel includes tasks that attend urgent events, like the system real-time
clock’s interrupt-handler, or a page-fault exception’s or hard disk’s interrupt-handler.
Such tasks are executed with high priority, and thus, preempt any networking task. Our
systems under test, which had a simplified configuration as stated before, only produced
urgent events related to the real-time clock, and therefore, launched high priority tasks
following a periodic pattern—with different periods but all being a multiple of the realtime clock’s period. A posteriori, we found out that if sources (which were also
FreeBSD systems using the same kind of real-time clocks to time the traffic production)
produced IP datagrams periodically, packets arrivals at the system under test may get in
synch with one or more periodic and high priority kernel tasks. This situation would result in a number of overblown measurements larger than when the traffic is not synchronized with these kernel tasks.
All the above resulted in the following traffic pattern for system characterization:
•
3.5.7
A packet trace with random packet inter-arrival times of at least 2 times
the mean system under test’s packet processing time and as much between 3 and 10 time this mean value
Experimental design
We carried out a large set of experiments for assessing the influence that various
operational parameters have over service times. The parameters we considered are:
•
Packet’s size
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
3.5—SYSTEM CHARACTERIZATION
•
•
•
•
•
•
63
Inter-packet arrival time
Packets burst’s size (number of packets in a packets burst)
Encryption protocol
Authentication protocol
Traffic mix
Routing table’s size
Besides, we carried out some more experiments for assessing the behavior of
various system elements. The elements we considered are:
•
•
•
•
Code paths
Central processing unit speed
Main memory
Cache memory
Finally, we carried out some more experiments for assessing the overhead and intrusion of the profiling probes.
3.5.8
Data presentation
Upon the data gathered from each PMEC of each profiling grain (section 3.5.3) at
each experiment, we produced a set of data descriptors and corresponding charts. As
stated in this section’s introduction, we expected the profiled code to have stochastic
behavior and thus we strived to produce histograms as data descriptors. For each experiment, we produced the following six charts:
•
•
•
•
Central processing unit cycle count trace and histogram (2 charts)
Machine code instruction count trace and histogram (idem)
Hardware interrupt count trace
Correlation between cycle count and instruction count
Figure 3.8 and Figure 3.9 show some example charts sets. Appendix A shows the
charts sets for all the experiments.
3.5.9
Data analysis
At the instruction and cycle count charts in Figure 3.8 and Figure 3.9, it can be
seen that measure points form belts. From the correlation cycles/instructions chart, it
can be seen that each belt in the instruction count chart correlates to a single belt in the
cycle count chart. This was something expected as the software probes at each layer
were programmed to profile more than one routine. Thus, each belt represents a code
path through the networking software. In the case of Figure 3.8, the upper belt corresponds to the Encapsulating Security Payload (ESP) protocol, the middle belt corresponds to the Authentication Header (AH) protocol, and the lower belt corresponds to
the Internet Protocol (IP) protocol. In the case of Figure 3.9, the upper belt corresponds
to the networking message reception routine while the lower belt corresponds to the
end-of-transmission interrupt handler. From the charts alone, one cannot tell which belt
corresponds to which code path. To do this one has to analyze the recorded data with re-
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
64
SYSTEM CHARACTERIZATION—3.5
spect to the identification tag written by the software probes with each performance record.
It is striking although expected that the belts at the instruction count charts are
narrow—in fact these are single valued—while they are thick at the cycle count charts.
This clearly proves our assumption that the networking code would have stochastic behavior and the importance of representing the model’s service times with histograms.
From the shown charts, it is evident that some kernel activities unrelated to network processing are consuming system resources. This is apparent from the existence of
several clearly unclustered measures, or outliers, at the cycle and instruction charts.
Moreover, these points correspond to probe records detecting at least one interruption—
there is a one to one relationship between these points and impulses in the interruption
count chart. For us this clearly implies that the networking activities were being preempted by some other not-instrumented activity. We named this activity, or collection
of activities, the “noise” process, as it is an ever present process which influences sysFigure 3.8—Characterization charts for a security gateway’s protocols layer
Figure 3.9—Characterization charts for a software router’s network interfaces layer
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
3.5—SYSTEM CHARACTERIZATION
65
tem performance but over which we have no control. In order to model this “noise’s” influence we added a set of elements to the software router’s model, as described in section 3.4. Consequently, when computing service time histograms we disregarded the
outliers.
3.5.10 “Noise” process characterization
A precise representation of the “noise” process would require analyzing activation sequences for several kernel tasks. A priori, we decided not to do this, as it is a laborious task. Instead, we tried a simpler approach. A posteriori, the used approach
turned out to be enough for producing highly accurate results, as later section 3.6 shows.
The used approach’s objective was to estimate the probability density function
(PDF) of two random variables: the inter-arrival time of “noise” events and the number
of central processing unit cycles consume by these events. The motivation for going this
way arose when doing initial model validations, as later discussed in section 3.6. Indeed, we were able to determine that, in most charts, the packet latency measures clustered at the second belt were due to packets that were preempted at least once. Under
the assumed load conditions, this preemption can only be due to the “noise” process.
Therefore, finding plausibly relationships between the second belt and this process may
solve the problem.
The first relationship we found was associated to “noise” events’ inter-arrival
times. The inter-arrival time of “noise” events must be constant—as we have associated
this process to activities tied to the system’s real-time clock. Therefore, we only had to
estimated its period or rate. We did this by computing the relative frequency of measures within the second belt. Considering no correlation between the occurrence of
“noise” events and packet arrivals, (something that we were able to assure following the
traffic patterns we used as stated before in subsection 3.5.6) we computed a mean arrival rate for “noise” events as λ = p / T . Where p is the probability of a record detecting
a preemption, and T is the mean service time computed over records that did not detected preemption. The estimated period of 4.2 milliseconds is much closed to the 4 milliseconds period of some bookkeeping kernel routines reported in the literature
[McKusick et al 1996].
After settling that the “noise” process was a periodic process, we proceed to estimate the cycle count’s PDF. The fact that this process was periodic was important; this
allowed us to discard one random dimension and to consider that measure dispersion
within the second belt was just related to the randomness of the service times. (See section 3.6.) Moreover, latency times experienced by preempted packets under the assumed
load conditions are equal to the sum of an unpreempted service time plus a value associated to a “noise” event’s service time. Consequently, we decided to use for characterizing the “noise” process a well-known PDF fitted to the packet-latency measures at the
second belt minus the unpreempted average service time. We tried uniform, exponential
and normal PDF but the best fit was given by the normal PDF.
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
66
MODEL VALIDATION—3.6
3.6 Model validation
This section presents a validation for our software router’s and security gateway’s
queuing network models. The validation is based on a two-dimension comparison between the queuing network models’ predictions and both measurements taken from real
systems and predictions from a simpler single queue model. All models were solved by
computer simulation, using a computer program built by us.
One dimension of the comparison contrasts per-packet processing latency traces.
For simplicity, for now on we will refer to the per-packet processing latency as latency.
In this dimension, we can qualitatively compare the temporal behavior of the systems.
The other dimension compares the complementary cumulative probability functions, for
now on CCPF, of the latency traces. Through this dimension, we can quantitatively
compare the performance of the systems at several levels of detail. For example, either
we can do a broad comparison by comparing mean values or we can assess different
service levels by watching at different percentile values.
Performance experiments were devised for stressing different system behavior.
Experiments varied accordingly to the operational parameters of packet length and inter-packet time-gap. In spite of all the possible combinations of these two parameters,
two main kinds of traffic are identified. On kind has fixed size packets produced with
constant inter-packet time-gap, for now on IPG. The other kind has fixed size packets
produced with randomly variable IPG.
The traffic of the first kind was initially intended to validate service time distributions computed after the characterization process. Later, it was also used for validating
the “noise” process’s model. It is interesting to note that the equipment we used to produce the traffic—a HP Internet Advisor—was not able to deliver a pure constant IPG
traffic at all IPG values. For IPG values below one millisecond, the equipment produced
packet-bursts one millisecond apart. The resulted traffic had an average IPG equal to the
one we selected but the IPG between packets in a packets burst was almost equal to the
minimum Ethernet inter-frame gap. While this was firstly annoying, we were able to
take advantage of this “equipment feature” and used the resulted one million packetbursts per second traffic for some validation experiments. In these cases, experiments
varied according to the number of packets within a packet burst, or packet burst’s size.
The traffic of the second kind was devise to resemble usual traffic on a real local
area network; that is, traffic with a high ratio between its peek and average packet rate.
It was produced by the superposition of four on/off sources with geometrically random
state periods. During the on state, the traffic was produced with a geometrically random
IPG.
When watching at the validation charts it is important to take into account that all
systems, the real one and both models, were feed with exactly the same input traffic. We
assured this by gathering traffic traces during the performance-measuring experiments
that we carry on with the real machines, and later feeding these traces to models runs.
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
3.6—MODEL VALIDATION
3.6.1
67
Service time correlation
Before presenting the final model validation, allow us here discuss on the service
time correlation we observed when using a one-to-one mapping between C language
functions implementing BSD networking and queuing network model’s queues. Subsections 3.4.2 and 3.5.3 presented this problem’s implications on system modeling and
probe implementation, respectively. Figure 3.10 presents the CCPF comparison for
packet latency traces gather after some experimental runs. It clearly shows that a software router’s model with queues mapped as stated before is not a good model. Figure
3.11 presents an example chart we produced after a service times’ correlation analysis.
It clearly shows that networking C language functions’ execution times aren’t independent. As stated in subsections 3.4.2 and 3.5.3, these results drove changes in both the
final model and the probe’s implementation.
Figure 3.10—Comparison of the CCPFs computed after both, measured data from the
software router under test, and predicted data from a corresponding queuing network model, which used a one-to-one mapping between C language networking
functions and model’s queue
Figure 3.11—Example chart from the service time correlation analysis. It shows the plot
of ip_input cycle counts versus ip_output cycle counts. A correlation is clearly
shown
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
68
3.6.2
MODEL VALIDATION—3.6
Qualitative validation
For carrying out the qualitative validation, as defined in this section’s introduction, we produced latency traces charts like the ones shown along the two leftmost columns of Figure 3.12. There, each row corresponds to a particular experiment. Within
each row, the left-hand chart depicts data measured at the system under test while the
center chart depicts data estimated by the queuing network model. Results at a glance
qualitatively validate the queuing network model. Indeed, at all rows both charts have a
lot more similarities than differences. But beyond this broad comparison there are several details worth to have a look at.
The first interesting thing that shows up in the charts is the existence of two or
more measures belts. These are more evident at the charts for the 100 MHz software
router but they also exist at experiments with the 600 MHz security gateway. The first
belt from the bottom up corresponds to packets that arrived at the system when the system had empty message queues, and consequently, they were serviced without preemption. Measure dispersion at this belt corresponds to the system’s stochastic servicetimes. The second belt (again, from the bottom up) corresponds to packets that although
arriving to an empty system experienced some preemption due to the “noise” process.
Note that the first two bands are omnipresent in charts, and that charts corresponding to experiments with traffic of the first kind show no other bands. Differently, charts
for experiments with “bursty” traffic (traffic with packet bursts) depict some other
bands. These other bands correspond to packets that experienced some queuing time;
they arrived to the system when the system had non-empty message queues. These
bands’ large variance is a consequence of random message queue lengths and some additional preemption time due to the “noise” process. The number of these additional
bands (additional to the first two) relates to the size of the driving traffic’s packet bursts.
One discrepancy between the real systems’ behavior and their queuing network
models’ is apparent when looking closely at the measures bands: measures bands’ variances are a little different. The source of this discrepancy relates to the modeling of the
“noise” process. As explained in subsection 3.5.10, our “noise” characterization trades
off accuracy for simplicity. However, as next section shows, this discrepancy only results in a minor quantitative error and thus we disregarded doing a more accurate noise
characterization and modeling.
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
3.6—MODEL VALIDATION
3.6.3
69
Quantitative validation
For carrying out the quantitative validation, as defined in this section’s introduction, we produced charts comparing three latency’s CCPFs: the real system’s, the queuing network model’s and the single queue model’s. Example charts are shown in the
right column of Figure 3.12. Each chart depicts curves produced after the latency traces
shown in the same row. Again, results at a glance quantitatively validate the queuing
network model and reveal its better suitability with respect to the single queue model.
Beyond this broad comparison, there are some details worth to discuss.
One interesting thing to note is the stepped nature of the real system’s and queuing network model’s curves. This is most evident at experiments with constant IPG.
This nature relates to the central processing unit sharing between packets and the
“noise” process. Differences in the prominence of the steps relate to differences between the “noise” process’ service times’ real density function and the estimated one.
From the charts, it should now be evident that the quantitative error is minor.
Another interesting thing to note is that the error introduced by the statistical estimation of the “noise” process’ service times diminishes, as the traffic gets “bursty”.
This is striking although reasonable. Indeed, the average single-packet service time is
significantly larger than the average “noise” service time. Thus, when packets bursts
appear, the packet sojourn time is more affected by the size of message queues than by
the central processing unit sharing with the “noise” process.
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
70
MODEL VALIDATION—3.6
Figure 3.12—Model’s validation charts. The two leftmost columns’ charts depict per-packet
latency traces. The right column’s chart depicts latency traces’ CCPFs
Observed per-packet latency
Estimated per-packet latency
Comparison of CCPF curves
5e+05
4e+05
4e+05
4e+05
4e+05
4e+05
4e+05
2e+05
0.01
5e+05
4e+05
4e+05
4e+05
4e+05
4e+05
4e+05
4e+05
4e+05
4e+05
4e+05
4e+05
Packet number
2e+05
1e+05
1e+05
1e+05
8e+04
1e+05
1e+05
1e+05
8e+04
3e+05
2e+05
2e+05
2e+05
1e+05
1e+05
8e+04
6e+04
4e+04
2e+04
0.001
0e+00
0e+00
2e+04
5e+04
0e+00
2e+04
1e+05
5e+04
0.01
2e+04
1e+05
4e+04
2e+05
3e+04
6e+04
1e+05
2e+05
2e+05
2e+04
p(X>=x)
2e+05
1e+04
2e+05
0.1
3e+05
5e+03
2e+05
Nanoseconds
1
0e+00
Nanoseconds
4e+05
3e+05
6e+04
4e+04
2e+04
4e+04
2e+04
5e+04
0e+00
4e+04
3e+04
2e+04
2e+04
Packet number
5e+05
2e+04
2e+04
0.001
Packet number
5e+05
2e+04
0e+00
p(X>=x)
0.01
0e+00
4e+04
0e+00
3e+04
5e+04
0e+00
2e+04
5e+04
2e+04
1e+05
2e+04
2e+05
1e+04
4e+04
2e+05
1e+05
1e+04
3e+04
2e+05
2e+05
5e+03
0.1
3e+05
1e+04
2e+05
5e+03
Nanoseconds
1
5e+03
2e+05
2e+05
4e+05
3e+05
2e+05
4e+05
4e+05
2e+05
4e+05
1e+05
4e+05
Nanoseconds
5e+05
4e+05
0e+00
2e+04
0.001
Packet number
5e+05
0e+00
Nanoseconds
0e+00
p(X>=x)
0.01
0e+00
4e+04
3e+04
2e+04
0e+00
2e+04
5e+04
0e+00
2e+04
5e+04
1e+04
1e+05
5e+03
2e+05
1e+05
OSCAR IVÁN LEPE ALDAMA
4e+04
2e+05
2e+05
Packet number
3e+04
2e+05
2e+04
2e+05
2e+04
2e+05
0.1
3e+05
1e+04
3e+05
Nanoseconds
1
5e+03
Nanoseconds
5e+05
0e+00
Nanoseconds
Packet number
Packet number
Nanoseconds
Pseudo constant IPG (there are some 2-packet bursts),
512 byte UDP datagrams @ 1000 pkt/s
Constant inter-burst gap with burst of 3 or 4 packets, 4 byte
UDP datagrams @ 2500 pkt/s
Pseudo constant IPG (there are some 2-packet bursts), 4 byte
UDP datagrams @ 1000 pkt/s
Packet number
2e+04
0.001
0e+00
4e+04
0e+00
3e+04
5e+04
0e+00
2e+04
5e+04
2e+04
1e+05
2e+04
2e+05
1e+05
1e+04
2e+05
5e+03
p(X>=x)
2e+05
2e+04
2e+05
2e+04
2e+05
0.1
3e+05
1e+04
3e+05
1
5e+03
Nanoseconds
5e+05
0e+00
Nanoseconds
Constant IPG, 4 byte UDP datagrams @ 333 pkt/s
Queuing network model
Single queue model
Real software router
Nanoseconds
PH.D. DISSERTATION
Packet number
PH.D. DISSERTATION
5e+05
5e+05
4e+05
4e+05
4e+05
4e+05
4e+05
4e+05
3e+05
2e+05
2e+05
2e+05
1e+05
1e+05
5e+04
5e+04
0e+00
0e+00
Packet number
3e+05
Packet number
2e+05
Packet number
7e+05
6e+05
5e+05
0e+00
2e+05
5e+04
0e+00
4e+05
1e+05
5e+04
3e+05
1e+05
2e+05
2e+05
2e+05
2e+05
2e+05
1e+05
Packet number
1e+05
2e+05
4e+05
4e+05
3e+05
2e+05
2e+05
2e+05
1e+05
5e+04
0e+00
4e+04
3e+04
p(X>=x)
2e+05
0e+00
2e+05
2e+04
Estimated per-packet latency
5e+04
2e+05
3e+05
0e+00
2e+05
3e+05
p(X>=x)
4e+05
p(X>=x)
4e+05
4e+05
4e+04
4e+05
4e+05
4e+04
4e+05
3e+04
5e+05
2e+04
Packet number
3e+04
5e+05
2e+04
0e+00
2e+04
5e+04
0e+00
2e+04
5e+04
2e+04
1e+05
2e+04
2e+05
1e+05
1e+04
2e+05
2e+04
4e+05
2e+04
4e+05
4e+05
5e+03
4e+05
1e+04
4e+05
5e+03
5e+05
4e+05
1e+04
2e+05
Nanoseconds
2e+05
0e+00
4e+04
3e+05
0e+00
3e+05
Nanoseconds
3e+04
2e+04
2e+04
2e+04
1e+04
5e+03
0e+00
Nanoseconds
Bursty ON/OFF traffic, 56 byte UDP datagrams, mean
packet rate of 1492 pkt/s, burst of up to 8700 pkt/s
5e+05
5e+03
3e+05
Nanoseconds
4e+04
3e+04
2e+04
2e+04
2e+04
1e+04
5e+03
0e+00
Nanoseconds
Bursty ON/OFF traffic, 56 byte UDP datagrams, mean
packet rate of 2857 pkt/s, burst of up to 9800 pkt/s
Observed per-packet latency
0e+00
4e+04
3e+04
2e+04
2e+04
2e+04
1e+04
5e+03
0e+00
Nanoseconds
Bursty ON/OFF traffic, 512 byte UDP datagrams, mean
packet rate of 2857 pkt/s, burst of up to 9800 pkt/s
3.6—MODEL VALIDATION
71
Figure 1.11—continuation
Comparison of CCPF curves
Queuing network model
Single queue model
Real software router
1
0.1
2e+05
0.01
0.001
1
Nanoseconds
0.1
2e+05
0.01
0.001
Nanoseconds
1
0.1
2e+05
0.01
0.001
Nanoseconds
OSCAR IVÁN LEPE ALDAMA
72
MODEL PARAMETERIZATION—3.7
3.7 Model parameterization
The previous section showed that it is possible to characterize a particularly slow
software router and use the resulted service times for producing highly accurate predictions with a queuing network model of the same system. This section’s objective is to
present answers to the following questions:
•
•
3.7.1
Is it possible to extrapolate our findings for performance evaluation of
faster systems?
From what point does changes in hardware architecture, or traffic pattern
affects our model’s predictions?
Central processing unit speed
For answering this section’s first question, we produced Figure 3.13. The figure
shows the execution times measured by the software probes of the plain software router
code when the operation speed of the software router’s central processing unit, for now
on CPU, was varied between 300 MHz and 630 MHz. We were able to do this thanks to
a feature of the Gigabit GA-6VXE+ motherboard and the Intel’s Pentium III type
SECC2 microprocessor that one of our test systems had. A set of switches on this motherboard allowed us to change the operation speed of this type of microprocessor. Let us
emphasize the importance of this feature. Without it, we would have need to change microprocessors, which may have different cache memory configurations, and even whole
motherboards, which most certainly would have different chipsets, for experimenting at
different central processing unit speeds. This would result in experiments with too many
latitudes that would produce data too complex to analyze and thus useless.
Figure 3.13—Relationship between measured execution times and central processing
unit operation speed. Observe that some measures have proportional behavior
while others have linear behavior. The main text explains the reasons to these behaviors and why the circled measures do not all agree with the regression lines
18
y = 4.6271x + 0.148
16
y = 2.1723x + 1.9141
14
y = 0.8898x + 1.3722
microseconds
12
y = 0.9017x + 0.0047
10
y = 0.6368x + 0.1364
8
6
4
2
0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
CPU's clock period (ns)
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
3.7—MODEL PARAMETERIZATION
73
Back to Figure 3.13, it can be seen that ipintrq’s, ifsnd’s and ipreturn’s execution times are proportional to the CPU speed, while dmarx’s and eotx’s execution
times vary linearly with respect to the same variable. (See subsection 3.5.3 for the definition of these software probes.) This has a clear explanation. The second set of probes
profile code that executes some input/output (I/O) instructions for communicating
through the I/O bus with the network interface cards. The operation speed of the I/O bus
was constant during the experiment and therefore the time required for executing the
I/O instructions did not scale with the CPU speed but remained constant. This results in
the offset observed for dmarx’s and eotx’s curves. Note that the value of this offset depends not only on the I/O bus speed but also on the network interface card and corresponding device driver architectures.
In order to complete this analysis we included in Figure 3.13 a set of measurements taken from a different computer. These measurements are circled shown in the
figure. The “new” computer had a 1 GHz Intel’s Pentium III with different cache configuration and a different motherboard, an ASUS TUV4X with different chipset and
main memory chips’ technology. Both computers’ I/O buses operated at identical operation speeds, however. As the figure shows, not all “new” computer’s measurements
agree with the regression lines of the “old” computer’s measurements. In fact, only
eotx’s new measurement agrees; the rest are lower than their corresponding regression
line. Moreover, ipintrq’s new measurement is the farthest from its corresponding regression line.
We believe there is no single source for these discrepancies. Moreover, as the
compared computers have no few differences we cannot categorically be sure what
these sources are. However, after knowing that the software router software’s performance is not influence by the memory technology and that the software router software is
small enough to fit inside the CPU cache memory, see next subsection, we believe the
discrepancies’ main source is the different level two cache suited by each CPU. Indeed,
the SECC2 600 MHz Intel Pentium III had an on-package, half speed, level-two cache
while the FC-PGA2/370, 1 GHz Intel Pentium III had an on-chip, full speed, advanced
transfer, level-two cache. What we believe this means is that the “new” computer, when
compared to the “old” one, not only executed instructions faster but also had its pipeline
fuller or with less bubbles as, on average, it did not had to wait as much for the instructions and their data.
Clearly, different code segments profited differently after this cache’s performance improvement. Although we have not done a detailed code analysis, we believe it is
mainly due to the relative code sizes each software probe profiled and the mix of instructions. We believe it is reasonable to expect that, basically, a large piece of code
without I/O instructions would profit more from a faster cache than a small piece of
code with many I/O instructions. Ipintrq is the probe that profiled the biggest piece of
code (1261 instructions) and none of them are I/O instructions. We believe that is why
its 1 GHz measurement is the farthest from its corresponding regression line. The other
code sizes are: dmarx, 475; eotx, 236; ifsnd, 216; and ipreturn, 192. As stated before, dmarx and eotx profiled some I/O instructions. We believe the reason eotx’s
1 GHz measurement is just over the regression line is that the profiled code has a relative large number of I/O instructions. However, we have not assessed this.
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
74
MODEL PARAMETERIZATION—3.7
3.7.2
Memory technology
We found that the working set of the BSD networking code for both the software
router and the security gateway fits inside their CPU’s cache memory. Partridge et al
[1998] have found similar results. Therefore, these systems’ performance is not affected
by main memory technology. We observe this after analyzing several cycle-count traces
from different profiling probes. The analysis consisted in filtering preempted records—
records whose corresponding interrupt count was not zero. Figure 3.14 shows example
charts from two different profiling probes: the ipintrq probe and the probe profiling
the DES algorithm for the Encapsulating Security Payload protocol (ESP). The figure
shows four outliers circled in each chart. Each of these records corresponds to the first
packet being processed by the system after some non-networking-related activities have
been executed. Observe that after these first packets are processed, no other packet experiences a high cycle count. For Figure 3.14 this means that only four out of almost 25
(or 35) thousand packets were affected. This suggested to us that during the processing
of the so-called first packet, the CPU’s instruction cache should have been out of networking instructions, and thus, the CPU required extra cycles for reading the instructions from main memory. But this also suggested to us that all other packets were processed by instructions read from the CPU’s cache, and thus did not require reading from
main memory. Therefore, this subsection’s premise is sustained.
For completing the picture, allow us here discuss why is it that we hold that some
non-networking related activities trashed the CPU’s instruction cache. It happened that
because we could only allocate a probe-record heap big enough for storing 100 thousand records—for all the probes in the kernel—we had to break each experiment in four
rounds. After each round, we have to start a shell session in the system’s console and interact with the system. During this interaction, we executed several shell commands for
extracting the heap of profiling records from the kernel’s address space and save it to a
file. Then, we executed some other shell commands for preparing the system for a next
round and for launching it. Clearly, after all this, the CPU’s instruction cache was filled
with whatever instructions but the ones of the networking software.
Figure 3.14—Outliers related to the CPU’s instruction cache. The left chart was drawn after data taken from the ipintrq probe. The right chart corresponds to the ESP
(DES) probe at a security gateway. Referenced outliers are highlighted
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
3.7—MODEL PARAMETERIZATION
3.7.3
75
Packet’s size
Figure 3.15 shows the influence that packet size has over probe-measured execution times. On one hand, it shows that execution times for the code required for plain IP
forwarding is insensitive to the packet size. This is most likely due to the way FreeBSD
release 4.1.1 manages message buffers—it uses cluster message buffers indistinctly of
packet size. On the other hand, as expected, it shows that the execution time of the code
implementing authentication and encryption algorithms for IPsec protocols augment
with the size of the packet. The figure shows curves for a pair of authenticating algorithms—md5 and sha1—and for three encrypting algorithms—DES, triple DES and
blowfish. Each of these curves reflects the various behaviors of these algorithms. It can
be seen that blowfish has the smallest dependence to packet size, while DES and triple
DES have the greatest. On the other hand, it can be seen that blowfish has the largest set
up time and that triple DES is truly three times costlier than DES.
Figure 3.15—Relationship between measured execution times and message size
Pentium III 600MHz
Microseconds (logscale)
1000
Bookkeeping
IF start
EoT int
Rx int
IP forwarding
AH(md5)
AH(sha1)
ESP(des)
ESP(3des)
ESP(blowfish)
100
10
1
0
200 400 600 800 1000 1200 1400 1600
Packet size (bytes)
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
76
3.7.4
MODEL PARAMETERIZATION—3.7
Routing table’s size
Another aspect that generally influences the performance of software routers is the
routing table lookup algorithm. Table 3-I consigns IP forwarding’s average execution
times for our 600 MHz system, when its routing table had 4, 16, 64, 128, 256, 512 and
1024 entries. Traffic threading the software router was randomly distributed across addresses. The times obtained (15.44, 15.98, 16.16 and 16.37 s) do not show significant
variations. Consequently, we can say that for the expected routing table’s sizes that a
software router may face—at the Internet’s edge—routing performance is not influenced by routing table’s size.
TABLE 3-I
3.7.5
Routing table size
IP forwarding time
[µs]
4
15.44
16
15.98
64
16.16
128
16.37
256
16.56
512
16.81
1024
17.41
Input/output bus’ speed
After settling that the considered networking code’s execution time proportionally
varies with the executing CPU’s speed, we computed the expected execution times for
IP forwarding and encrypting/authenticating for 1500 bytes network messages, at selected CPU speeds, and compared them with transfer times through I/O buses of, one,
33 MHz operation speed and 32-bit data path, and the other, 66 MHz operation speed
and 64-bit data path. The results at Table 3-II clearly show that for the CPUs that will be
available on the market in the following years, the bus is potentially a bottleneck to the
system.
TABLE 3-II
IO Bus transfer
[µs]
33 x 32
66 x 64
CPU speed
[MHz]
Forwarding
[µs]
Forwarding +
3DES
[µs]
600
15.44
580
11.4
2.9
1200
7.72
290
11.4
2.9
4800
1.93
72.5
11.4
2.9
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
3.8—MODEL’S APPLICATIONS
77
3.8 Model’s applications
Now that we have proved the validity of the queuing network model, in this section we apply it for capacity planning and as a uniform experimental test-bed. When
used as a capacity-planning tool, the queuing network model may help to devise system
configurations, or for tuning a system supporting communication quality-assurance
mechanisms, or quality-of-service (QoS) mechanisms. Indeed, user QoS levels may be
map to parameters such as overall maximum latency, sustainable throughput or packet
loss rate. Then, the queuing network model’s ability to estimate the complete probability density function of performance parameters may be used for identifying system’s
operational regions that meet these QoS levels. Subsection 3.8.1 elaborates on this.
When used as a uniform experimental test-bed, a queuing network model eases
the performance study of systems that incorporate some modifications or extensions.
Generally, all there is to do is to adjust the model’s queuing network. Certainly, when
adjusting the queuing network, care must be taken to remap service times. Moreover, although this remapping may not be trivial, it is clearly easier than producing correct
software for a new real system. Following this rationale, subsection 3.8.2 presents performance evaluation results of two modified software routers: one that incorporates the
Mogul and Ramakrishnan’s [1997] receiver live-lock avoidance mechanism, and another that besides this incorporates a weighted fair queuing scheduler for IP packet
flows.
3.8.1
Capacity planning
Figure 3.16 shows two charts drawn from data estimated by the queuing network
model. In each chart, some software router’s performance variable is plot against the
average traffic load in packets per second (pps). We are assuming the performance bottleneck is the software router and that the traffic load is Poisson. Five curves are drawn
in all charts, each one corresponding to the performance of a system under particular
conditions. Let us now discuss how to use these charts for capacity planning.
Forwarding capacity—Let us define forwarding capacity as the maximum load
value at which a software router can forward packets without losing any. Observe that,
as noted in subsection 3.7.3, packet forwarding as implemented in FreeBSD 4.1.1 is insensible to the packet size. Thus, for assessing the forwarding capacity of a system we
look at the losses chart and find the load at which the system starts to loose packets. For
example, a 100 MHz system has a forwarding capacity of, more or less, 8.5 kpps. This
means that it can hardly drive a single Ethernet—which may produce packets at rates
exceeding 14 kpps. On the other hand, a 650 MHz system has a forwarding capacity of
55 kpps.
Encryption capacity—The above mapped to a security gateway results to the encryption capacity. That is, the maximum load-value at which a security gateway can
forward encrypted packets. Differently from packet forwarding, as also noted in subsection 3.7.3, packet encryption performance is influence by the packet size and the
encryption algorithm used. In fact, a security gateway has a bimodal response with
respect to packet size. When packets are smaller than a particular value, the security
gateway performance is limited by its forwarding capacity. When packets are bigger
than this value, the encryption capacity is the performance limit. To numerically show
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
MODEL’S APPLICATIONS—3.8
78
value, the encryption capacity is the performance limit. To numerically show this, consider the following. Let us define tf as the packet forwarding time ad tue as the per-byte
encryption time. Thus, the maximum processing capacity of a security gateway expressed in packets per second is equal to C ( pps ) = 1 /(t f + t u e * L) , where L is the packet
size in bytes. This expression measure in bytes per second (Bps) maps to:
C ( Bps ) =
1
tue +
tf
L
Clearly, when L→0, C(Bps) →L/tf and when L→∞, C(Bps) →1/tue
Figure 3.16—Capacity planning charts
a)
b)
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
3.8—MODEL’S APPLICATIONS
79
Returning to the encryption capacity example, from the losses chart, a 650 MHz
system configured for authenticate and encrypt a communication session—using both
AH and ESP protocols, and using DES for encrypting, has an encryption capacity of 4.2
kpps when processing 512 byte packets and 3.6 kpps when processing 1386 byte packets. If it uses 3DES for 1386 bye packets, this system’s encryption capacity is around
1.5 kpps.
Maximum latency—If someone is interested in knowing what is the minimum
clock rate (read: price) that supports packet latencies not higher than certain value with
some probability, he must look at the corresponding latency percentile chart. For example, a 100MHz software router operating at its forwarding capacity of 8.5 kpps has a 99percentile latency around 10 ms. As another example, a 650 MHz security gateway configured to use 3DES, processing 1386 byte packets and operating at its encryption capacity of 1.5 kpps has a 99-percentile latency around 10 ms.
3.8.2
Uniform experimental test-bed
The following study’s objective is to observe the performance of a software router
supporting communication quality-assurance mechanisms, or Quality-of-Service (QoS)
mechanisms. Here it is shown that the considered software routers cannot sustain system wide QoS behavior solely by adding QoS aware CPU schedulers. Where by this
schedulers we do not mean user processes schedulers but networking tasks schedulers;
that is, schedulers regulating the marshalling of packets through the networking software. The next chapter presents a solution to this problem.
The study that concerns us here compares the performance of two software
routers. The one’s CPU speed is 1 GHz and the other’s is 3 GHz. These software routers
models’ service times were extrapolated applying section 3.7’s parameterization rules to
section 3.5’s measurements for the 600-MHz system. We considered a PCI I/O bus like
the one described in section 3.4. The data links were considered to have a 1 Gbps
throughput.
For this study, the software router workload consisted on the superposition of
three traffic-flows with characteristics shown in Table 3-III. The offered load in bits per
second for each flow is identical. As we are interested mainly in system throughput under overload, we have considered Poisson input traffic.
TABLE 3-III
Packet length (bytes)
Solicited share
Flow 1
172
1/3
Flow 2
558
1/3
Flow 3
1432
1/3
It was expected that an ideal QoS aware software router would allocate one third
of its resources to each flow.
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
80
MODEL’S APPLICATIONS—3.8
Basic system—In order to set a comparison baseline, we firstly studied the performance of the considered software routers when they use basic BSD networking software. Figure 3.17 shows the queuing network model for a basic software router traversed by three packet flows. Figure 3.18 shows performance data computed after these
models for global offered loads in the range of [0, 1400 Mbps]. As can be seen, the
software router with the 1 GHz CPU has a linear increase of the aggregated throughput
for offered loads below 225 Mbps. At this point, the CPU utilization is 100%, while the
bus utilization is around 50%, and the system enters into a saturation state. If we further
increase the offered load, the throughput decreases until a receiver live-lock condition
appears for an offered load of 810 Mbps. During saturation, most losses occur in the IP
input buffer.
For the software router with the 3-GHz central processing unit, the bus saturates
for an offered load of 500 Mbps. The bus utilization at this point is 100% while the
CPU utilization is around 70%. The system behavior for increasing offered loads depends on which priorities are used by the bus arbiter. The results here shown correspond
to the case in which reception has priority over transmission. Observe that system
throughput decreases with increasing offered loads. This can be explained if we observe
that CPU utilization is increasing and most losses occur in the drivers transmit queue.
This indicates that the transfer from NIC to CPU increases with increasing offered loads
and hence the CPU processes more work. However, this work cannot be transferred to
the output NIC as the bus is saturated by the NIC to CPU transfers.
If transfers NIC-CPU and CPU-NIC have the same priority the CPU never
reaches saturation, and losses can occur either in the NIC’s reception queue or in the
device driver’s output transmission queue.
Consequently, the basic system cannot provide a fair share of the resources when
it is in saturation.
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
3.8—MODEL’S APPLICATIONS
81
Figure 3.17—Queuing network model for a BSD based software router with two network interface cards attached to it and three packet flows traversing it
nabtx:
flow3 flow2 flow1
NA19
18
17
OUT
nabrx1
nabrx2
NAIN
SRC1
nabrx3
SRC2
SRC3
netup
flow3 flow2 flow1
24
23
nabtx:
flow3 flow2 flow1
NAOUT
22
nab1rx
0
12
1
nab2rx
NAIN
1
12
1
nab3rx
2
12
1
I/O
BUS
dma1tx
14
2
(PNP)
dma2tx
15
2
dma3tx
16
sink1
2
25
dma1rx
3
2
dma2rx
4
26
SINK2
2
sink3
dma3rx
5
SINK1
sink2
2
27
SINK3
28
DEVNULL
ipintrq:
flow3 flow2 flow1
8
7
6
5
CPU
(PP)
ifsnd:
flow3 flow2 flow1
11
10
9
3
ipreturn
12
4
ifeotx
13
2
fast noise
OSFN
20
1
slow noise
OSSN
PH.D. DISSERTATION
21
1
OSCAR IVÁN LEPE ALDAMA
MODEL’S APPLICATIONS—3.8
82
Figure 3.18—Basic system’s performance analysis: a) system’s overall throughput; b) per flow
throughput share for system one; c) per flow throughput share for system two
a)
b)
c)
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
3.8—MODEL’S APPLICATIONS
83
Mogul and Ramakrishnan [1997] based software router—We performed similar experiments with a model of a software router that includes the modifications proposed by Mogul and Ramakrishnan. The proposed modification’s objective is to eliminate the receiver live-lock pathology. This is accomplished, basically, by turning off the
software interrupt mechanism and driving networking processing by a polling scheduler
during a software router’s busy period. Consequently, networking processing is no
longer conducted through a pipeline but is done as a run-to-completion task. From the
queuing network modeling point of view, this results in the aggregation of all networking related service queues into a single queue. Mogul and Ramakrishnan showed that
the resulting total IP processing time is actually shorter than the sum of the times for
each individual phase. They showed that the added polling scheduler’s processing costs
is compensated by the dropped IP queue manager’s processing costs. Moreover, the runto-completion processing is implemented with fewer function calls. We took advantaged of these observations and, considering that basic networking phases’ service times
are independent, we computed the service time of the aggregated service as the sum of
the basic networking phases’ service times.
Figure 3.19 shows the queuing network model for a Mogul and Ramakrishnan
based software router, and Figure 3.20 shows performance data computed after this
model when parameterized for the two software routers described at this section’s introduction. Naturally, offered load was similar to the one used for the basic system. As can
be seen and as expected, the performance of the two modified software routers do not
change with respect to that of the basic system when operating below saturation. Moreover, they reach saturation at more or less the same offered load, although now the
throughput degradation is gone thanks to the receiver live-lock elimination mechanism.
Yet, the system still does not achieve a fair share of software router resource.
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
MODEL’S APPLICATIONS—3.8
84
Figure 3.19—Queuing network model for a Mogul and Ramakrishnan [1997] based software
router with two network interface cards attached to it and three packet flows traversing it
nabtx:
flow3 flow2 flow1
NA19
18
17
OUT
nabrx1
NAIN
SRC1
nabrx2
nabrx3
SRC2
SRC3
netup
flow3 flow2 flow1
24
23
nabtx:
flow3 flow2 flow1
NAOUT
22
nab1rx
0
12
NAIN
1
nab2rx
1
12
1
nab3rx
2
12
1
I/O
BUS
dma1tx
14
2
(PNP)
dma2tx
15
2
dma3tx
16
sink1
2
25
dma1rx
3
SINK1
sink2
dma2rx
26
SINK2
4
sink3
dma3rx
5
27
SINK3
28
DEVNULL
CPU
(RR)
ifeotx
13
2
fast noise
OSFN
20
1
slow noise
OSSN
OSCAR IVÁN LEPE ALDAMA
21
1
PH.D. DISSERTATION
3.8—MODEL’S APPLICATIONS
85
Figure 3.20—Mogul and Ramakrishnan [1997] based software router’s performance analysis:
a) system’s overall throughput; b) per flow throughput share for system one; c) per flow
throughput share for system two
a)
b)
c)
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
86
MODEL’S APPLICATIONS—3.8
Mogul and Ramakrishnan [1997] based software router with a QoS aware
CPU scheduler—When a software router includes the modifications proposed by Mogul and Ramakrishnan, to us, it seams possible to introduce a QoS aware CPU scheduler
like a WFQ scheduler. Indeed, once the polling scheduler is in control, it may use whatever policy it implements to select the next packet for processing. This scheme is proposed by Qie et al [2001], although for and operating system other than UNIX.
Figure 3.21 shows the queuing network model for this kind of software routers
and Figure 3.22 shows performance data computed after this model when parameterized
for the two software routers described at this section’s introduction. As can be seen, the
considered networking architecture can only support QoS communications when the
CPU is the system’s bottleneck, as in the case of the software router with a CPU operating at 1 GHz. (Figure 3.22.a.) However, when the bus becomes the system’s bottleneck,
a fair share of router’s resources is not achieved. (Figure 3.22.b.) By the way, the system’s overall throughput chart has been omitted in Figure 3.22, as it is identical to the
one presented in Figure 3.20.
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
3.8—MODEL’S APPLICATIONS
87
Figure 3.21—Queuing network model for a software router including the receiver livelock avoidance mechanism and a QoS aware CPU scheduler, similar to the
one proposed by Qie et al [2001]. The software router has two network interface cards and three packet flows traverse it
nabtx:
flow3 flow2 flow1
NA19
18
17
OUT
nabrx1
nabrx2
NAIN
SRC1
nabrx3
SRC2
SRC3
netup
flow3 flow2 flow1
24
23
nabtx:
flow3 flow2 flow1
NAOUT
22
nab1rx
0
12
1
nab2rx
NAIN
1
12
1
nab3rx
2
12
1
I/O
BUS
dma1tx
14
2
(PNP)
dma2tx
15
2
dma3tx
16
sink1
2
25
dma1rx
26
dma2rx
4
dma3rx
SINK1
sink2
3
SINK2
sink3
WFQ
27
SINK3
28
DEVNULL
5
wfqsignal
5
wfqcost
5
dmarx:
flow3 flow2 flow1
8
7
6
CPU
(RR)
ifeotx
13
fast noise
OSFN
20
slow noise
OSSN
PH.D. DISSERTATION
21
OSCAR IVÁN LEPE ALDAMA
MODEL’S APPLICATIONS—3.8
88
Figure 3.22—Qie et al [2001] based software router’s performance analysis: a) per
flow throughput share for system one; b) per flow throughput share for system
two
a)
b)
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
3.9—SUMMARY
89
3.9 Summary
•
•
•
•
•
•
•
•
A queuing network model of a software router gives accurate results
Single queue models of software routers ignore chief system features
Characterizing the system was hard not because of the studied system
but due to the required process
Armed with a mature process, characterization becomes straightforward
Model’s service times computed after some system may be used for
predicting the performance of other systems, if scaled appropriately
Service times scale linearly with CPU operation speed but can be considered constant with respect to message and routing table sizes
Service times’ offset varies with respect to the network interface card’s
and device driver’s technology and the cache memory performance
In the advent of CPU’s working above 1 GHz, the I/O bus is the bottleneck
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
MODELING TCP/IP SOFTWARE IMPLEMENTATION PERFORMANCE AND ITS APPLICATION FOR
SOFTWARE ROUTERS
91
Chapter 4
Input/output bus usage control in
personal computer-based software
routers
4.1 Introduction
In this chapter we address the problem of resource sharing in personal computerbased software routers when supporting communication-quality assurance mechanisms,
widely known as quality-of-service (QoS) mechanisms. Others have put forward
solutions that are focused on suitably distributing the workload of the central processing
unit. However, the increase in central processing unit’s (CPU) speed in relation to that
of the input/output (I/O) bus means attention must be paid to the effect the limitations
imposed by this bus has on the system’s overall performance.
Here we propose a mechanism that jointly controls both the I/O bus and the CPU
operation. This mechanism involves changes to the operating system kernel code and
assumes the existence of certain network interface card’s functions, although it does not
require changes to the personal computer hardware. Here we also present a performance
study that provides insight into the problem and helps to evaluate both the effectiveness
of our approach, and several software router design trade-offs.
The chapter is organized as follows. Section 4.2 presents the problem, section 4.3
discusses our proposed solution and section 4.4 presents a performance study of the
proposed mechanism in isolation, carried out by simulation. Then, section 4.5 considers
the performance of a software router incorporating the proposed mechanism, carried out
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
92
4.2—THE PROBLEM
also by simulation. Section 4.6 discuses on a particular implementation of the proposed
mechanism and presents measured performance data. Finally, section 4.7 summarizes.
4.2 The problem
Given that there are inherent performance limitations in the architecture of a software router, but also that there are reasons that make its use attractive, (see previous
chapter’s section 3.3) the question of how to optimize its performance arises. In addition, if we want to support QoS mechanisms, we must find suitable ways of sharing resources and, as Nagle said, providing protection, so ill behaved sources can only have
limited negative impact on well-behaved ones [Nagle 1987]. There are two different aspects of the problem of resource sharing: the fair share of communications links and the
fair share of router resources, mainly its central processing unit (CPU) and input/output
(I/O) bus. The first aspect affects output packets flows that share a single network interface card, while the second aspect affects all packets flows that go through the router.
We will focus our work on the second aspect of this problem.
In other works the problem of fairly sharing software router resources is tackled in
terms of protecting [Indiresan, Mehra and Shin 1997; Mogul and Ramakrishnan 1997]
or sharing [Druschel and Banga 1996; Qie et al 2001] the use of the central processing
unit amongst different packets flows in an efficient way. However, the increase in
central processing unit speed in relation to that of the I/O bus makes it easy for this bus
to constitute a bottleneck. This is why we address this problem in this article,
considering not only the sharing of the software router’s central processing unit, but also
its I/O bus.
4.3 Our solution
The mechanism we propose for implementing input/output (I/O) bus sharing between differentiated packets flows manipulates the vacancy space of each direct memory access receive channel (see chapter 2’s subsection 2.6.6 for a description on direct
memory access channels) in such a way that the overall I/O bus’ activity follows a
schedule similar to one produced by a general processor sharing (GPS) server [Demers,
Keshav and Shenker 1989]. The mechanism, that we named Bus Utilization Guard
(BUG), acts as a flow scheduler that indirectly regulates the I/O bus utilization. Packets
flows wanting to traverse the router have to get registered before the BUG. This may be
accomplished either manually or by means of any resource reservation protocol like
RSVP. In any case, a packets flow is required to submit target resource utilization for
registering and the system is required to manage packets flow identifications for packets
flows it admits. Note that the system has to be able to map each differentiated packets
flow to a direct memory access channel, DMA channel for short. Figure 4.1 shows the
BUG’s system architecture.
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
4.3—OUR SOLUTION
93
Figure 4.1—The BUG’s system architecture. The BUG is a piece of software embedded in the
operating system’s kernel that shares information with the network interface card’s device
drivers and manipulates the vacancy of each DMA receive channel
M ain M em ory
CPU
Filled
buffers
BUG
NIC's
Driver
Empty
buffers
Descriptor
Descriptor
Descriptor
Descriptor
Empty
Packet
Buffer
Empty
Packet
Buffer
W orking
Packet
Buffer
Filled
Packet
Buffer
Interrupts
System Bus
I/O Bus
Com mands
NIC
DM A
Descriptor
Packet
Buffer
Data Link
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
94
4.3.1
4.3—OUR SOLUTION
BUG’s specifications and network interface card’s requirements
The mission of the BUG is to assist supporting QoS system behavior by indirectly
controlling input/output bus usage following a GPS like policy. (For now we regard to
the input/output bus simply as the bus.) We thought of the BUG as either a piece of
software within the operating system’s kernel, or as a hardware add-on attached to the
AGP connector. We wanted the BUG not to require any change to the host computer’s
hardware architecture. However, the BUG still requires network interface cards to keep
a running sum of packets and bytes received per differentiated packets flow. Moreover,
the BUG can only differentiate packets flows when they are mapped to separate DMA
channels. Therefore, if it is wanted to differentiate packets flows entering the router
through a network interface card, the latter must support several DMA channels—one
channel per differentiable packets flow.
Because the BUG protects a rather fast resource and due to its software only implementation requirement, it was important that the BUG would have low overhead.
Moreover, because the BUG uses the bus it thrives to protect, (for polling information
from the network interface cards and configuring them, as will be explained later) it was
important the BUG would be as low intrusive as possible. Table 4-I summarizes these
specifications and requirements.
TABLE 4-I
BUG’s specifications
Avoid host hardware modifications
Low overhead
Avoid intrusion
Network interface card’s requirements
Per packets flow DMA channels
To be able to keep some packets flow state information
4.3.2
Low overhead and intrusion
In order to minimize overhead and intrusion, we devised the BUG to get activated
only every T seconds, where T >> τ the time required in the worst case for executing
BUG’s activities when it gets activated. Figure 4.2 illustrates this behavior. As will be
shown, τ influences packet latency because during the time the BUG is running no
packet may traverse the protected bus.
May T be arbitrarily large? As will be shown, the BUG is a reactive mechanism
that from a summary of what have happened at the bus during the last T seconds, adjusts
DMA receive channel’s vacancy spaces so that the bus’ behavior during a busy period—a train of T periods of 100% utilization—resembles that of a GPS server. Consequently, a priori, T cannot be arbitrarily large. Intuitively, there must be a Tmax beyond
which either the BUG cannot react fast enough or the required adjustments are unfeasible to perform. Besides, as will be shown, there is a proportional relationship between T
and BUG’s main memory storage requirement. Indeed, a priori, the mean number of
packets that the BUG has to be aware of increases with T and more packets imply more
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
4.3—OUR SOLUTION
95
mbuf descriptors, which require more main memory. Actually, the DMA channels are
the ones requiring the added memory and not the BUG.
To further reduce intrusion, we devised the BUG as a bistate mechanism. At every
activation, the BUG firstly determines how much of the bus resources have been used
during the last T seconds. If the bus has not been one hundred percent utilized, the BUG
enters monitoring state and no further action is taken. Otherwise, the BUG enters enforcing state and executes all its tasks. Figure 4.2 also illustrates this behavior.
4.3.3
Algorithm
The BUG emulates a packet-by-packet GPS server with batched arrivals. This
GPS server’s inputs are computed after data polled from network interface cards and
network interface card’s device drivers. Figure 4.3 shows and example scenario. Appendix B lists the C++ language code of the algorithm’s implementation used for the
simulation experiments described in section 0 and 4.5.
Assume that the mechanism is in monitoring state at cycle k•T. Then, the mechanism gathers Di,k—the number of bytes transferred through the bus during period
((k-1)•T, k•T) by DMA receive channel-i, channel-i for short. If sum(Di,k) < T/βBUS,
where βBUS is the cost per bit of bus transfer, the mechanism remains at monitoring state
and no further actions are taken. Otherwise, the mechanism detects the start of a busy
period and enters enforcing state. When at this state, the mechanism polls each network
interface card to gather Ni,k—the number of bytes stored at the network interface card
associated with channel-i—and computes the amount of bus utilization granted to each
DMA receive channel, or γik. This is done after the outputs of the emulated GPS server,
Gi,k. The emulated GPS server’s (or just GPS server) inputs are the Ni,k at the start of a
busy period. Afterwards, the inputs are the amount of arrived traffic during the last period or Ai,k=N i,k –Ni,k-1+Di,k. BUG is work-conservative and thus:
γi,k = Gi,k + (T/βBUS - (G1,k+…+GN,k) )
Figure 4.2—The BUG’s periodic and bistate operation for reducing overhead and intrusion
τ
T
Enforcing
Poll NICs
Com pute
shares
not fully used
fully used
Adjust vacancy
spaces
fully used
Monitoring
Enfocing
not fully used
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
96
4.3—OUR SOLUTION
Observe that under the work-conservative premise:
sum(γi,k) ≥ T/βBUS
This situation may lead to an unfair share. Consequently, the BUG is prepared
with an unfairness-counterbalancing algorithm. This algorithm, basically, computes an
unfairness level per DMA receive channel, u i,k, which is positive for depriver packet
flows and is negative for deprived ones. Then, the BUG reduces or augments the bus
utilization grant, γi,k, of every depriver or deprived packets flow, respectively, by an
amount proportional to the correspondign unfairness level. That is:
γ∗i,k = γi,k - u i,k
One problem with this approach is that if unfairness is detected then:
sum(γ∗i,k) ≤ T/βBUS
That is, the unfairness-counterbalancing algorithm may artificially produce some
bus idle time. This problem also arises when packetzing bus utilization grants, as shortly
explained. Happily, a single mechanism, one that allows the BUG to vary the length of
its activation period, solves both problems.
The BUG will switch from enforcing state to monitoring state at the start of any
activation instant that sum(Di,k) < T/βBUS. When this switch occurs, the BUG resets the
emulated GPS server and stops regulating the bus usage by giving away:
γi,k = (T/βBUS)
4.3.4
Algorithm’s details
The BUG requires detecting when the bus is fully utilized. One approach to do
this is asking the bus for its accumulated idle time at each activation period. If there is a
difference between the current value and the read previously, then the bus was not fully
utilized during the last T time units. The problem with this approach is that the PCI bus
Figure 4.3—The BUG’s packet-by-packet GPS server emulation with batch arrivals
Arrivals/Departures
Patterns
GPS input
GPS Bus' Capacity Share
6
Bus capacity
5
4
3
2
Tim e
t-T
t
t+T
OSCAR IVÁN LEPE ALDAMA
1
Flow1 Flow2 Flow3
Tim e
PH.D. DISSERTATION
4.3—OUR SOLUTION
97
specification does not defines such a thing as bus accumulated idle time [Shanley and
Anderson 2000]. Upon this absence, the BUG computes the bus utilization level after
the received byte-count it collects at each activation-instant from network interface
cards’ device drivers.
The BUG normally works periodically with period T, but the implementation of
the unfairness counterbalancing mechanism, as shortly explained, and the busutilization grants packetization requirement, as also shortly explained, may induce some
bus idle time if nothing else is done. For circumventing this problem, after an enforcing
state activation during which the emulated GPS server ran at 100% utilization, we allow
the BUG to adjust its activation period. In this case, the BUG is granting the use of the
bus for a time equal to sum(γ∗i,k/βBUS), where γ∗i,k is the packetized bus-utilization grant
of packets flow i, at activation instant k, after adjusted for unfairness if necessary. Consequently, for avoiding any bus idle time, the BUG sets its next activation instant
sum(γ∗i,k/βBUS) into the future. Observe that this value is equal to T – dt, where it is expected that dt << T thanks to the BUG implementation’s properties. Note that the activation period adjusting is only required at the considered activation instant.
The BUG is a work-conservative mechanism in which any idle time detected at
the emulated GPS server is given away for all packets flows to use. Moreover, when at
monitoring state it sets all use-grants to a value proportional to a complete T period, so
the bus can get busy as soon as possible. Figure 4.4 shows an example scenario. This
has two evident implications. On the good side, packets do not experience any initial
waiting. On the bad side, some unfairness may arise because some packets flow may
use more than its solicited share in detriment of other packets flows. Happily, unfairness
may only occur during a bus’ busy period. Therefore, the BUG only needs to look for
and correct unfairness when at enforcing state. Observe that the emulated GPS server
cannot deal with any unfairness because the latter is produced outside its scope. Thus,
any counterbalancing mechanism has to be implemented as part of BUG’s control-logic.
Of course, unfairness may be avoided altogether by implementing a non-workconservative mechanism, in which packet arrivals occurred during an activation period
get artificially batched at the next activation instant. (By appropriately manipulating the
DMA receive channel’s vacancy spaces.) But then, packets would experience added latency. We also have implemented a non-work-conservative BUG and it works well
Figure 4.4—The BUG is work conservative
GPS input
(new and backlog)
GPS Bus Capacity Share
Flow Grants
5
Bus capacity
6
5
4
3
1
Flow1 Flow2 Flow3
PH.D. DISSERTATION
4
3
Computed by
the GPS
2
Given away
by BUG
6
2
1
Time
Flow1 Flow2 Flow3
OSCAR IVÁN LEPE ALDAMA
98
4.3—OUR SOLUTION
throughput-wise. However, here we only present results from the work-conservative
version because we believe it is a better overall solution.
When at enforcing state, the BUG has to check for unfairness. One way to do this
is comparing at every activation instant of a busy period two running sums per packets
flow: sumi(Gi,k) and sumi(Di,k). Then, unfairness may be computed as:
ui,k = sumi(Di,k) – sumi(Gi,k)
A positive difference denotes a depriver packets flow, while a negative value denotes a deprived one. Every depriver packets flow will have his bus utilization grant,
γi,k, reduced by ui,k and every deprived packets flow will get his augmented by ui,k. Any
γi,k that gets negative is topped to zero. Figure 4.5 shows an example scenario.
Admissibly, the unfairness counterbalancing mechanism just explained needs refinement. We can see that in a scenario where a router would “be attacked” by overly
misbehaving packets flows, the levels of unfairness would be so high that, with high
probability, the busy period’s average length ( T ) would be relatively small, that is,
T << T . This is clearly undesirable since, in this case, the BUG would be consuming too
much of router’s resources. In favor of the BUG, allow us to comment that under the
considered scenario of packets flows threading a router supporting QoS, where the
packets flows would need to pass an admission control and would be policed to ensure
QoS level agreements, we expect only small levels of unfairness. Indeed, in this case,
unfairness would be primarily due to computer networks’ stochastic behavior, which
produces packets bursts. Furthermore, we see that allowing some added complexity to
the BUG it is possible extend it for detecting relatively high levels of unfairness and
proceed accordingly. One such added complexity would involve computing the sum of
ui,k for deprived packets flows and reducing γi,k for each depriver packets flow only by a
proportional fraction of that sum. We left this as future work.
There is another difficulty with the unfairness counterbalancing mechanism: the
initialization of sumi(Gi,k). We believe that an initial value of zero is wrong, simply
because of our definition of “enforicing state”. Indeed, if the BUG is at the first enforcing state activation of a busy period, then the bus has just been 100% utilized and the
BUG needs to include a measure of that utilization into the unfairness formula. Unfortunately, because of our definition of “monitoring state”, the BUG has no way for computing previous period’s Gi,k values. At the point in time in question, the only thing reFigure 4.5—The BUG’s unfairness counterbalancing mechanism
Possible
Actual Use
Flow Grants
Σ G > Capacity
6
Given away
by BUG
5
4
3
Unfariness
6
3
6
5
2
5
4
1
Computed by
the GPS
1
Flow1 Flow2 Flow3
OSCAR IVÁN LEPE ALDAMA
Flow1 Flow2
3
2
-1
2
1
-2
1
Flow1 Flow2 Flow3
Σ G < Capacity
4
Flow3
3
2
Next Flow Grants
Supposing a (4,4,4) arrival vector.
Flow1 Flow2 Flow3
PH.D. DISSERTATION
4.3—OUR SOLUTION
99
lated to previous period’s Gi,k values the BUG knows for sure is the set of QoS levels
requested by the packets flows threading the router. Consequently, we are using that set
of values for initializing sumi(Gi,k). Furthermore, we have observed that at the first enforcing state activation of a busy period, the bus and the emulated GPS server are not in
synch. That is, the bus is 100% utilized while the emulated GPS server is not. This results in Gi,k values inappropriate for the unfairness counterbalancing mechanism: they
are too low. Consequently, if during a first enforcing state activation of a busy period
the bus and the emulated GPS server are not in synch, the BUG does not accumulates
the Gi,k values but the set of QoS levels requested by the packets flows.
The BUG is required to packetize the computed bus utilization grants before adjusting vacancy spaces at DMA receive channels. This is required because while the
GPS algorithm’s information unit is the byte, the information unit for DMA channels is
the packet. Figure 4.6 shows an example scenario. When packetizing utilization grants it
may happen that modulus(γi,k, Li) ≠ 0, where Li is the mean packet length for channel-i.
Hence, some rounding off is required. We have tested rounding off both down and up
and both produce particular problems. However, the former gave us a more stable
mechanism. Of course, if we let the BUG to round off its bus utilization grants, then its
emulated GPS server will get out of synch with respect to what is happening at the bus.
Therefore, we also programmed the BUG to accordingly adjust the state of its emulated
GPS server. If nothing else is done, some bus idle time is artificially produced and the
overall share assigned to that packets flow would be much less of what it should be.
This problem, and the induced by the unfairness counterbalancing mechanism, can be
solved if we let the BUG reduced its next activation period length by some dt time
value. Evidently, this increases the BUG’s overhead. But as long as dt is a small fraction
of T, the overhead’s increase will remain at acceptable levels.
Figure 4.6—The BUG’s bus utilization grant packetization policy. In the considered
scenario, three packets flows with different packet sizes traverse the router and
the BUG has granted each an equal number of bus utilization bytes. Packet
sizes are small, medium and large respectively for the orange, green and blue
packets flows. After packetization, some idle time gets induced.
Byte grants
Packetized
grants
Back-to-back
packetized grants
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
100
4.3.5
4.3—OUR SOLUTION
Algorithm’s a priori estimated costs
When at monitoring state, the BUG only needs “to keep an eye” on the bus
utilization at every activation instant, and thus we can say this task’s costs are low.
However, this task still requires O(n) work, where n is the number of DMA receivechannels associated to network interface cards.
When at enforcing state, the BUG performs a sequence of tasks involving loops:
assessing bus utilization, computing the emulated GPS server’s inputs, emulating the
GPS server itself, and adjusting packets flows’ bus grants. Of all these tasks, the emulation of the GPS server requires the largest number of instructions. Shreedhar and
Barghese [1995] state that a naive implementation of a GPS server requires O(log(m))
work per packet, where m is the number of packets in the router. However, Keshav
[1991] shows that good implementation requires O(log(n)), where n is the number of active packets flows.
Section 4.6 reports the a posteriori BUG’s costs measurements.
4.3.6
An example scenario
In order to see how all this works allow us to present a step-by-step description of
an example operation scenario for the BUG. The scenario considers the operation of the
BUG protected bus in isolation. Three packets flows load the system. Each packets flow
solicits one third of the bus capacity. All packets are of the same size and six packets
fully occupy the bus. Figure 4.7 shows a picture of the description that follows. In the
picture time runs downwards. Marks at the time axis denote BUG’s nominal activation
instants and thus are spaced by T seconds. The picture shows the BUG’s operation variables as vectors. For each vector, dimension k corresponds to packets flow k. Table 4-II
defines the used vectors.
TABLE 4-II
Vector
Definition
A
D
Number of packet arrivals during the last activation period
Number of packet departures during the last activation period
N
Number of packets waiting at the buffers of some NIC
G
Outputs produced by the emulated GPS
Gac
Running sum of GPS outputs
Dac
Running sum of departures
U
γ
Unfairness ( Di −1 + Di − G i −1 )
Use grant given by BUG
Because vectors A and B denote events occurred anytime during the previous activation period, we show them between activation instants. The rest are computed, or
considered, at every activation-instant and therefore we show them grouped and pointed
at the corresponding instant. Observe that D values are arbitrary and that to simplify
things, all units are given in packets, not bytes.
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
4.3—OUR SOLUTION
101
Figure 4.7—An example of the behavior of the BUG. Vectors A, D, N, G and γ are defined as: A = (A1, A2, A3), etc. It is assumed that the system serves three packets flows with the same QoS shares (2,2,2) and with the same packet lengths,
and that in a period T up to six packets can be transferred through the bus. The
BUG does not consider all variables at all times. At every activation instant, the
variables that the BUG ignores are printed in gray
tim e
0T
A 1 =(1,1,1)
D 1 =(1,1,1)
T
2T
bus 100%
Enforcing
N3
= (1,1,4)
= (2,2,2)
G3
pgps 100%
U3
= (1,1,-2)
= (1,1,4)
γ3
G 3 ac = (6,6,6)
D 3 ac = (5,5,2)
A 4 =(1,1,1)
D 4 =(1,1,4)
bus 100%
Enforcing
N4
= (1,1,1)
G4
= (2,2,2)
pgps 100%
U4
= (0,0,0)
= (2,2,2)
γ4
G 4 ac = (8,8,8)
D 4 ac = (6,6,6)
A 5 =(2,2,2)
D 5 =(2,2,2)
bus 100%
Enforcing
N5
= (1,1,1)
= (2,2,2)
G5
pgps 100%
U5
= (0,0,0)
= (2,2,2)
γ5
G 5 ac = (10,10,10)
D 5 ac = (8,8,8)
A 6 =(0,0,0)
D 6 =(1,1,1)
bus 50%
M onitoring
N6
= (0,0,0)
G6
= (0,0,0)
pgps 0%
6
U
= (0,0,0)
= (0+6,0+6,0+6)
γ6
G 6 ac = (2,2,2)
D 6 ac = (0,0,0)
4T
5T
PH.D. DISSERTATION
In synch
A 3 =(3,3,3)
D 3 =(2,2,2)
3T
6T
Not in synch
bus 100%
Enforcing-1st
= (0,0,3)
N2
G2
= (0,0,3) + 3
pgps 50%
U2
= (1,1,-2)
= (0+3,0+3,5+3)
γ2
G 2 ac = (4,4,4)
2
D ac = (3,3,0)
QoS shares (2,2,2)
A 2 =(3,3,3)
D 2 =(3,3,0)
bus 50%
M onitoring
N1
= (0,0,0)
G1
= (0,0,0)
pgps 0%
U1
= (0,0,0)
= (0+6,0+6,0+6)
γ1
G 1 ac = (2,2,2)
D 1 ac = (0,0,0)
OSCAR IVÁN LEPE ALDAMA
102
4.3—OUR SOLUTION
At the first activation instant, the BUG measures a bus utilization level of
50%. Consequently, the BUG enters monitoring state, sets use-grants per packets
flow to six, resets the emulated GPS server’s state, and resets the running sums of
the unfairness counterbalancing mechanism: the running sums of departures are zeroed but the running sums of GPS outputs are set to packets flows’ QoS shares.
At the second activation instant, the BUG measures a bus utilization level of
100% and thus enters enforcing state, emulates the GPS server, checks for unfairness,
sets use-grants after the GPS’s outputs and the unfairness levels, and allows the GPS
server to keep its state. Because this is the first time into enforcing state, GPS server’s
inputs are taken after NIC occupation levels. In this case, inputs do not saturate the GPS
server, which reports some idle time. At the same time, the BUG detects a packets flow
is being deprived of some solicited bus time. Therefore, depriver packets flows have
their use-grants reduced and deprived ones augmented. At the same time, all packets
flows get additional use-grants proportional to the GPS server’s idle time. As the GPS
server is not in synch with the bus, the BUG updates the running sums of GPS outputs
with the QoS shares, not the actual GPS outputs.
At the third activation instant, the BUG again measures a bus utilization level of
100% and remains at enforcing state. This time, GPS server’s inputs are taken after the
arrivals. The GPS server now gets 100% used and thus no idle-time related use-grants
will be added. However, the BUG still detects a packets flow being deprived of some
solicited bus time and thus the GPS server’s outputs get adjusted appropriately for computing the use-grants.
At the fourth activation instant, the BUG remains at enforcing state. At this time
the GPS server gets 100% used, again, and no packets flow is detected as being deprived. Therefore, use-grants are directly taken from the GPS server’s outputs. The fifth
activation period is pretty much the same as the previous, apart from vector D.
At the sixth activation period, the BUG enters monitoring state and thus sets usegrants per packets flow to six, and does a GPS server mind reset.
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
4.4—BUG PERFORMANCE STUDY
103
4.4 BUG performance study
The BUG has several operational latitudes, as previous section shows, like the activation period value and its variation, the effectiveness of the rounding off policy, or
the influence that a highly variant traffic has over the dual-mode operation. In order to
assess how and how much this latitudes influence overall performance, we devised a series of simulation experiments. At the same time, this experiments allowed us to see
how well a PCI bus controlled by a BUG approximates a bus ideally supporting QoS,
like a weighted fair queuing (WFQ) bus.
4.4.1
Experimental setup
We study the performance of a BUG regulated PCI bus, to which three network
interface cards were attached. For the comparison studies we similarly configured a hypothetical, ideal WFQ bus and a plain PCI bus. Each bus was traversed by three packets
flows, each coming from one network interface card. Buses were modeled with queuing
networks. Figure 4.8 shows these models. We approximated the PCI bus operation by a
server using a round robin scheduler. Operational parameters for all busses where computed after a 33 MHz, 32 bits PCI bus. Data links are assumed to sustain one gigabit per
second throughput. We used a simple yet meaningful QoS differentiation: the packet
size. Indeed, as reported elsewhere [Shreedhar and Varghese 1996], round robin scheduling is particularly unfair upon packets flows with different packet sizes. The packets
flows used had features shown in Table 4-III. Different experiments used different interarrival processes to show particular behavior.
TABLE 4-III
Packet length (bytes)
Solicited share
Flow 1
172
1/3
Flow 2
558
1/3
Flow 3
1432
1/3
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
104
4.4—BUG PERFORMANCE STUDY
Figure 4.8—Queuing network models for: a) PCI bus, b) WFQ bus, and c) BUG
protected PCI bus; all with three network interface cards attached to it and
three packets flows traversing them
link1
SINK1
link2
nab2rx
NA2
-IN
SRC2
I/O
BUS
sink2
SINK2
(RR)
link3
sink3
nab3rx
NA3
-IN
SRC3
sink1
nab1rx
NA1
-IN
SRC1
SINK3
a)
link1
nab1rx
NA1
-IN
SRC1
link2
nab2rx
NA2
-IN
SRC2
link3
nab3rx
NA3
-IN
SRC3
WFQ
wfqarrivals
wfqdepartures
wfqsignal
nabrx:
flow3 flow2 flow1
sink1
SINK1
I/O
BUS
sink2
SINK2
(RR)
sink3
SINK3
b)
link1
NA1
-IN
SRC1
link2
NA2
-IN
SRC2
link3
SRC3
NA3
-IN
Flow1
Monitor
na1brx
sink1
SINK1
na2brx
I/O
BUS
Flow2
Monitor
sink2
SINK2
(RR)
Flow3
Monitor
na3brx
sink3
SINK3
bugsignal
Flow
Sched
(GPS)
c)
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
4.4—BUG PERFORMANCE STUDY
4.4.2
105
Response to unbalanced, constant packet rate traffic
Figure 4.9 shows busses responses to unbalanced, constant packet rate traffic, for
three different operation scenarios for each of the three considered busses. Each line at
every chart denotes the running sum of output bytes over time for one of the three considered packets flows. Charts at the left, center and right columns correspond to the
WFQ bus, the plain PCI bus, and the BUG equipped PCI bus, respectively. Each row of
charts corresponds to a particular traffic pattern. The traffic pattern for row (a) was as
follows. At time zero, flow1 and flow2 start loading the system with a load level
equivalent to 45% of a PCI bus capacity each; that is, 475.2 Mbps. Two milliseconds
later (first arrow) or 20 times BUG’s activation period, flow3 starts loading the system
also at 475.2 Mbps. Then, two milliseconds later, (second arrow) flow3 augments its
load to 1 Gbps. Traffic patterns for row (b) and (c) are similar but changing packets
flows’ roles.
For these experiments, we set BUG’s nominal activation period to 0.1 milliseconds. At this value, and under the considered bus’ speed, the worst-case DMA receive
Figure 4.9—BUG performance study: response comparison to unbalanced constant
packet rate traffic between a WFQ bus, a PCI bus and a BUG protected PCI bus;
first, middle and left columns respectively. At row (a) flow3 is the misbehaving flow
while flow2 and flow1 are for (b) and (c), respectively
I/O Bus with WFQ scheduling
Departure Trace when Flow 3 misbehaves
I/O Bus with Round Robin scheduling
Departure Trace when Flow 3 misbehaves
0.7
0.6
0.3
T = 0.1 ms
0.5
Output [Mbytes]
0.4
0.4
0.3
T = 0.1 ms
0.3
0.2
0.2
0.1
0.1
0.1
0.0
0.0
1
2
3
4
a)
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
0
I/O Bus with WFQ scheduling
Departure Trace when Flow 2 misbehaves
I/O Bus with Round Robin scheduling
Departure Trace when Flow 2 misbehaves
T = 0.1 ms
0.3
T = 0.1 ms
0.3
0.2
0.1
0.1
0.1
0.0
0.0
4
5
6
7
8
9
10
1
I/O Bus with WFQ scheduling
Departure Trace when Flow 1 misbehaves
2
3
4
5
6
7
8
9
10
0
0.3
T = 0.1 ms
0.3
0.2
0.1
0.1
0.1
0.0
0.0
3
PH.D. DISSERTATION
4
5
time [ms]
6
7
8
9
10
6
7
8
9
10
9
10
0.4
0.2
2
5
0.5
0.4
0.2
1
4
Flows
small pkts
medium pkts
large pkts
0.6
Output [Mbytes]
Output [Mbytes]
T = 0.1 ms
0
3
I/O Bus with BUG scheduling
Departure Trace when Flow 1 misbehaves
0.5
c)
2
0.7
Flows
small pkts
medium pkts
large pkts
0.6
0.4
0.3
1
I/O Bus with Round Robin scheduling
Departure Trace when Flow 1 misbehaves
0.5
10
time [ms]
0.7
0.6
9
T = 0.1 ms
Buff = 76 pkts
time [ms]
Flows
small pkts
medium pkts
large pkts
8
0.0
0
time [ms]
0.7
7
0.4
0.2
3
6
0.5
0.4
0.2
2
5
Flows
small pkts
medium pkts
large pkts
0.6
Output [Mbytes]
0.4
1
4
I/O Bus with BUG scheduling
Departure Trace when Flow 2 misbehaves
0.5
Output [Mbytes]
0.5
0
3
0.7
Flows
small pkts
medium pkts
large pkts
0.6
b)
2
time [ms]
0.7
Flows
small pkts
medium pkts
large pkts
0.3
1
time [ms]
0.7
0.6
T = 0.1 ms
Buff = 76 pkts
0.0
0
time [ms]
Output [Mbytes]
0.4
0.2
0
Flows
small pkts
medium pkts
large pkts
0.6
0.5
Output [Mbytes]
Output [Mbytes]
0.7
Flows
small pkts
medium pkts
large pkts
0.6
0.5
Output [Mbytes]
I/O Bus with BUG scheduling
Departure Trace when Flow 3 misbehaves
0.7
Flows
small pkts
medium pkts
large pkts
T = 0.1 ms
Buff = 76 pkts
0.0
0
1
2
3
4
5
time [ms]
6
7
8
9
10
0
1
2
3
4
5
6
7
8
time [ms]
OSCAR IVÁN LEPE ALDAMA
106
4.4—BUG PERFORMANCE STUDY
channel size is 76, which corresponds to flow1. Putting this on implementation perspective, we may say that 76 mbufs corresponds to little more than half the nominal DMA
channel size for FreeBSD, which is a size of 128. At the same time, a 0.1 milliseconds
period is only 10 times smaller than the nominal FreeeBSD’s real-time clock period,
and thus, feasible to implement. On the other hand, this implies that the BUG should
take no more than 10 microseconds to execute if we want the overhead premise of
T >> τ to hold. For a software router wearing a 1 GHz central processing unit this
means 10 thousand cycles. A priori, that should be enough.
From Figure 4.9’s left-most column we can see that during the first two milliseconds the ideal bus allows a 50% bus share between the two active packets flows. Then,
after the third packets flow gets active, the bus allows a 33% bus share irrespectively of
the load level of the so-called misbehaving packets flow. (The small share differences
are due to WFQ’s well-known misbehavior upon packet bursts. Zhang and Keshav
[1991] explain this.)
From Figure 4.9’s middle column we can see that a plain PCI bus only adequately
follows the ideal behavior during the first two milliseconds. A that point in time, (first
arrow) the round robin scheduling deprives flow1 from having enough bus time in favor
of both flow2 and flow3. Moreover, flow3 is never deprived and flow2 is, when flow3
gets greedy after the second arrow at row (a).
From Figure 4.9’s right-most columns we can see that the BUG equipped PCI bus
behaves very much like the ideal bus does.
4.4.3
Study on the influence of the activation period
We repeated the experiments of the previous subsection but only for the BUG protected PCI bus and augmenting BUG’s activation period T. We wanted to see if we
could find any macroscopic problems related to this operational parameter. Per BUG’s
algorithm description, as T becomes relatively larger small-scale injustices appear; we
wanted to see if these microscopic injustices might reflect macroscopically and how.
We incremented T until the BUG regulated DMA channel’s size became unreasonable large. Indeed, as stated in subsection 4.3.2, there is a proportional relationship
between T and BUG regulated DMA channels’ sizes. As stated in the previous section,
for the considered bus’ speed and with T equal to 0.1 milliseconds the BUG requires 76
mbufs on the worst case, or a little more than half the nominal DMA channel size for
FreeBSD. With T equal to 0.5 milliseconds the required DMA channel’s size is 383, already more than double the normal size. And with T equal to 10 milliseconds each
DMA channel requires 7674 or 14.9 Mbytes of non-swappable or wired memory. (1
Mbytes equals 2^20 bytes.) Certainly, recall that FreeBSD’s release 4.1.1 only uses
mbuf clusters and thus each DMA channel’s slot requires at least 2048 bytes.
Figure 4.10 shows the study’s results and, as seen there, the BUG protected PCI
bus maintains its excellent macroscopic behavior. As in the previous figure, each line at
every chart denotes the running sum of output bytes over time for one of the three considered packets flows. Departing from previous figure, each row now corresponds to a
particular experiment using a particular BUG activation period T. Moreover, each col-
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
4.4—BUG PERFORMANCE STUDY
107
umn now corresponds to a particular traffic pattern with respect to the misbehaving
flow: flow1 for the left column, flow2 for the center column, and flow3 for the right
column.
4.4.4
Response to on-off traffic
Figure 4.11 shows busses responses to on-off traffic, for two different operation
scenarios for each of the three considered busses. Each line at every chart but one denotes the running sum of output bytes over time for one of the three considered packets
flows. The extra line denotes the running sum of input bytes. Ideally, output lines
Figure 4.10—BUG performance study: on the influence of the activation period
I/O Bus with BUG scheduling
Departure Trace when Flow 2 misbehaves
I/O Bus with BUG scheduling
Departure Trace when Flow 1 misbehaves
2.0
Output [Mbytes]
T = 0.5 ms
Buff = 383 pkts
1.0
1.5
T = 0.5 ms
Buff = 383 pkts
1.0
0.5
0.5
0
5
10
15
20
a)
25
30
35
40
45
5
10
15
20
I/O Bus with BUG scheduling
Departure Trace when Flow 1 misbehaves
35
40
45
50
0
2.0
0.0
30
40
b)
50
3.0
T = 1 ms
Buff = 767 pkts
2.0
60
70
80
90
100
10
I/O Bus with BUG scheduling
Departure Trace when Flow 1 misbehaves
20
30
40
50
60
70
80
90
100
0.0
50
100
150
200
250
15.0
T = 5 ms
Buff = 3837 pkts
10.0
300
350
400
450
500
50
100
150
200
250
300
350
400
450
Output [Mbytes]
20.0
10.0
0
200
300
PH.D. DISSERTATION
400
500
600
time [ms]
700
800
900 1000
80
90
100
50
100
150
200
250
300
350
400
450
500
I/O Bus with BUG scheduling
Departure Trace when Flow 3 misbehaves
50.0
Flows
small pkts
medium pkts
large pkts
30.0
Flows
small pkts
medium pkts
large pkts
40.0
T = 10 ms
Buff = 7674 pkts
20.0
30.0
T = 10 ms
Buff = 7674 pkts
20.0
10.0
0.0
100
70
time [ms]
10.0
0.0
60
T = 5 ms
Buff = 3837 pkts
I/O Bus with BUG scheduling
Departure Trace when Flow 2 misbehaves
40.0
T = 10 ms
Buff = 7674 pkts
50
10.0
500
50.0
Flows
small pkts
medium pkts
large pkts
0
15.0
time [ms]
I/O Bus with BUG scheduling
Departure Trace when Flow 1 misbehaves
d)
40
0.0
0
50.0
30.0
30
5.0
time [ms]
40.0
20
Flows
small pkts
medium pkts
large pkts
20.0
0.0
0
10
I/O Bus with BUG scheduling
Departure Trace when Flow 3 misbehaves
5.0
c)
50
25.0
Output [Mbytes]
Output [Mbytes]
5.0
45
time [ms]
Flows
small pkts
medium pkts
large pkts
20.0
10.0
40
2.0
I/O Bus with BUG scheduling
Departure Trace when Flow 2 misbehaves
T = 5 ms
Buff = 3837 pkts
35
T = 1 ms
Buff = 767 pkts
0
25.0
15.0
3.0
time [ms]
Flows
small pkts
medium pkts
large pkts
30
0.0
0
25.0
25
1.0
time [ms]
20.0
20
Flows
small pkts
medium pkts
large pkts
4.0
0.0
20
15
I/O Bus with BUG scheduling
Departure Trace when Flow 3 misbehaves
1.0
10
10
5.0
Output [Mbytes]
T = 1 ms
Buff = 767 pkts
0
5
time [ms]
Flows
small pkts
medium pkts
large pkts
4.0
Output [Mbytes]
Output [Mbytes]
30
5.0
Flows
small pkts
medium pkts
large pkts
1.0
Output [Mbytes]
25
I/O Bus with BUG scheduling
Departure Trace when Flow 2 misbehaves
5.0
3.0
1.0
time [ms]
time [ms]
4.0
T = 0.5 ms
Buff = 383 pkts
0.0
0
50
1.5
0.5
0.0
0.0
Flows
small pkts
medium pkts
large pkts
2.0
Output [Mbytes]
Output [Mbytes]
1.5
2.5
Flows
small pkts
medium pkts
large pkts
Output [Mbytes]
Flows
small pkts
medium pkts
large pkts
2.0
Output [Mbytes]
I/O Bus with BUG scheduling
Departure Trace when Flow 3 misbehaves
2.5
2.5
0.0
0
100
200
300
400
500
600
time [ms]
700
800
900 1000
0
100
200
300
400
500
600
700
800
900 1000
time [ms]
OSCAR IVÁN LEPE ALDAMA
108
4.4—BUG PERFORMANCE STUDY
should appear almost over input lines. Charts at the left, center and right columns correspond to busses responses comparison for the flow1, the flow2 and the flow3, respectively. Each row of charts corresponds to a particular value of the source’s on-state period length: a) eight times BUG’s activation period: b) half BUG’s activation period.
Packet inter-arrival processes were Poisson with mean bit rate equal to 3520 Mbps, or
300% of the PCI bus capacity. Sources’ off-state period-lengths were drawn after an exponential random process with mean value equal to nine times the on-state periodlength. Consequently, all packets flows overall mean bit rates were equal to 30% of the
PCI bus capacity or 352 Mbps.
Besides observing the system response to this kind of traffic, with these experiments we wanted to see if we could find any BUG pathology related to operating-mode
cycles, where the continuous but random path into and out of enforcing mode may produce some wrong behavior. Consequently, we ran several experiments with different
on-off cycle lengths. Here we present results for two different on-state period-lengths.
Figure 4.11.a presents results when the on-state period is equal to eight times the
BUG activation period, T, and Figure 4.11.b presents results when the on-state period is
equal to 0.5T. From both figures we can see that despite the traffic’s fluctuations, the
BUG quite well follows the ideal WFQ policy, while the PCI like Round Robin policy
again favors the largest-packets flow and affects the most to the smallest-packets flow.
Furthermore, it seams that the BUG is not macroscopically sensitive to a traffic pattern
that repeatedly takes it in and out of enforcing mode.
4.4.5
Response to self-similar traffic
In order to evaluate the long-range behavior of a BUG protected PCI bus we perform an experiment feeding synthetic self-similar traffic to the simulators of the comFigure 4.11— BUG performance study: response comparison to on-off traffic between an
ideal WFQ bus, a PCI bus, and a BUG protected PCI bus
a)
b)
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
4.4—BUG PERFORMANCE STUDY
109
pared buses. This traffic trace was composed of 6.5 million packets, classified in the
three packets flows described in Table 4-III, and had an average throughput of 125
Mbytes per second, (1 Mbytes equals 10^6 bytes) or 100% the maximum theoretical
throughput of the PCI bus. The simulation run spanned 18.874 real-time seconds. Table
4-IV lists variance values of output-byte traces’ correlation functions between the ideal
WFQ bus and both, the plain PCI bus (round robin) and the BUG protected PCI bus,
when different observation periods where applied. In this table we can see that correlation’s variance values for the plain PCI bus are higher than the values for the BUG protected PCI bus. Moreover, the values’ differences get relatively larger with the observation period, although the increase is not proportional to the size of the observation period. For instance, Flow1’s relative difference at an observation period of 0.1 milliseconds is around 1.845 (14.36 / 7.78), at 1 millisecond is 3.02, and at 2.5 milliseconds is
3.342.
Evidently, these results are a good news, bad news case, where the good news are,
of course, that the BUG protected PCI bus better follows the long-range behavior of the
ideal WFQ bus when compared to a plain PCI bus even under a very high variability
operation scenario. However, the variance values for the BUG protected PCI bus are
somewhat higher than we expected. We recognize that it is interesting to dig further into
this issue. However, we are leaving this as future work.
TABLE 4-IV
Period
[ms]
Plain PCI bus
BUG protected PCI bus
Flow1
Flow2
Flow3
Flow1
Flow2
Flow3
0.1
14.36
2.70
1.62
7.78
1.58
0.95
0.5
49.24
7.51
4.73
19.47
3.29
2.09
1
81.39
11.68
7.39
26.90
4.67
3.01
1.5
109.88
15.36
9.76
33.35
5.81
3.68
2
133.63
18.21
11.73
39.42
7.13
3.89
2.5
156.78
21.11
13.63
46.91
8.04
4.66
Before passing to another issue, here we consign details on how we produced the
self-similar traffic trace. We used Christian Schuler’s program, (Christian Schuler, Research Institute for Open Communication Systems, GMD FOKUS, Hardenbergplatz 2,
D-10623 Berlin, Germany) which implements Vern Paxson’s self-similar traffic fast
approximation method [Paxson 1997]. Upon given a Hurst parameter, a mean value,
and a variance, Schuler’s program produces a list of numbers. Each number represents a
count of packet arrivals within an arbitrary period. Schuler’s program does not give any
meaning to this period, leaving to the program’s user its definition. Paxson’s method
uses a fractional gaussian noise process and consequently the output of Schuler’s program may contain negative numbers. The given mean and variance values influence the
relative count of these negative numbers. It is up to the program’s user the use of proper
program inputs and the negative number’s interpretation. Schuler’s program output corresponds to an aggregated traffic trace. Given that we wanted this aggregated traffic to
be composed of the three packets flows described in Table 4-III, we filtered Schuler’s
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
110
4.5—A PERFORMANCE STUDY OF A SOFTWARE ROUTER INCORPORATING THE BUG
program output through an ad hoc program to adequately produce three traffic traces
corresponding each to a required packets flow.
We employed a Hurst parameter of 0.8, a mean value of 25 packets per observation period, and a variance for times the mean value. We determined this set of parameters after what Lucas et al [1997] have reported. Their paper reports statistical characteristics of traffic threading the University of Virginia’s campus network, which hosts approximately 10 thousand computers. The traffic analyzed focuses on three 90-minute intervals starting at 2:15 AM, 2:00 PM, and 9:00 PM. Regardless of the utilization levels
exhibited during these periods, the paper reports the traffic adjusting to a Hurst parameter of 0.8 and having a variance for times its mean value.
We came to the value of 25 packets per observation period empirically. We ran
Schuler‘s program with several mean values (using, of course, the stated Hurst parameter and variance) and counted the number of negative numbers contained in the output
trace. Schuler’s program ran pretty fast on the University’s SGI Power Challenge so, a
priori, we did not invest any time selecting the initial trial value. We randomly choose
10 packets per observation period and the output resulted with 4.9% of negative numbers. Entering 20 packets per observation we got a trace with 1.2% of those numbers.
With 30 we got 0.3% and with 25 we got 0.5%.
We conveniently attribute a 72 microseconds value to Schuler’s program observation period. This value, conjunctively with the given packets flows’ features and the selected mean value of 25 packets-per-observation, results in a target traffic trace’s
throughput of 125 Mbytes per second.
4.5 A performance study of a software router
incorporating the BUG
Naturally, after proving that a BUG protected bus works fine in isolation, at least
at the queuing network modeling level of detail, we wanted to see if such a bus may improve the operation of a software router bore up to provide differentiated services. In
order to do this, we basically extended the set of experiments presented in previous
chapter’s subsection 3.8.2. Following the rationale presented there, we extended the
queuing network model that previous chapter’s Figure 3.21 shows and introduced the
flow monitors and flow scheduler shown in previous section’s Figure 4.8.c. The resulted
model’s operational parameters and the features of the workload were set as in previous
chapter’s 3.8.2. Once again, we have performed the simulation for routers configured
with two different central processing units. The one’s CPU speed is 1 GHz and the
other’s is 3 GHz. As before, for the considered traffic, note that the CPU is the system’s
bottleneck for the 1 GHz router while the bus is the system’s bottleneck for the 3 GHz
router.
4.5.1
Results
Figure 4.12 shows results for a software router using a WFQ scheduling for the
CPU and the BUG mechanism for controlling bus usage. We see that the obtained re-
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
4.5—A PERFORMANCE STUDY OF A SOFTWARE ROUTER INCORPORATING THE BUG
111
sults correspond to almost an ideal behavior, as under saturation throughput does not
decrease with increasing offered loads and the system achieves a fair share of both
router resources: CPU and bus.
Figure 4.12—QoS aware system’s performance analysis: a) system’s overall
throughput; b) per packets flow throughput share for system one; c) per packets flow throughput share for system two
a)
b)
c)
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
112
4.6—AN IMPLEMENTATION
4.6 An implementation
Currently we are working on a BUG implementation for a FreeBSD powered PCbased software router, using 3COM’s 3C905B network interface cards based on the
“hurricane” PCI bus-master chips and controlled by the FreeBSD xl driver.
4.7 Summary
•
•
•
•
•
We presented a mechanism for improving the resource sharing of the
input/output bus of personal computer-based software routers
The mechanism that we proposed and called BUG, for bus utilization
guard, does not imply any changes in the host computer’s hardware, although some special features are required for network interface cards—
they should have different direct memory access channels for each differentiated packets flow and they should be able to give information
about the number of bytes and packets stored for each of these channels
The BUG mechanism can be run by the central processing unit or by a
suitable coprocessor attached at the AGP connector
Using a queuing model solved by simulation, we studied BUG’s performance. The results show that the BUG is effective in controlling the
bus share between competing packets flows
When we use this mechanism in combination with the known techniques for central processing unit usage control, we obtain a nearly
ideal behavior of the share of the software router resources for a broad
range of workloads
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
MODELING TCP/IP SOFTWARE IMPLEMENTATION PERFORMANCE AND ITS APPLICATION FOR
SOFTWARE ROUTERS
113
Chapter 5
Conclusions and future work
5.1 Conclusions
We have showed that single-queue performance models of software routers implemented with BSD networking software, or similar, and personal computing (PC)
hardware, or similar, incur in significant error when these models are used to study scenarios where the router’s central processing unit (CPU) is the system’s bottleneck and
not the communications links. Furthermore, we have showed that a queuing network
model better models these systems under the considered scenarios.
Armed with a mature characterization process, we have showed that it is possible
to build a queuing network model of PC-based software routers that is highly accurate,
so this model may be use to carry out performance studies at several detail levels. Furthermore, we have showed that model’s service times computed after some system may
be used for predicting the performance of other systems, if scaled appropriately, and
that model’s service times scale linearly with CPU’s operation speed but can be considered constant with respect to messages’ and routing table’s sizes. Moreover, model’s
service times related to the network interfaces layer have linear behavior with respect to
CPU’s operation speed and their offset varies with respect to the network interface
card’s and device driver’s technology and the cache memory performance.
Using our validated, parameterized, queuing network model, we have quantitatively showed that current CPUs allow a software routers to sustain high throughput
when forwarding plain Internet Protocol datagrams if some adjustments are introduced
to the networking software, and that current PC’s input/output (I/O) buses, however, are
the limiting factor for achieving system-wide high performance. Moreover, we have
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA
114
5.2—FUTURE WORK
quantitatively showed how and when current PC’s I/O buses hamper a PC-based software router of supporting system-wide communication quality assurance mechanisms,
or Quality-of-Service (QoS) mechanisms.
Under the light of the above statement, we proposed a mechanism for improving
the resource sharing of the input/output bus of personal computer-based software
routers. This mechanism that we called BUG, for bus utilization guard, does not imply
any changes in the host computer’s hardware, although some special features are required for network interface cards—they should have different direct memory access
channels for each differentiated packets flow and they should be able to give information about the number of bytes and packets stored for each of these channels. When we
use this mechanism in combination with known techniques for CPU usage control, we
quantitatively showed that it is possible to obtain a nearly ideal behavior of the share of
the software router resources for a broad range of workloads.
5.2 Future work
A precise analysis of the results obtained when a BUG protected PCI bus was
loaded with self-similar traffic is missing in this document; see subsection 4.4.5. Recall
that although results confirmed that under the considered load conditions, a BUG protected PCI bus better follows the ideal behavior of a WFQ bus when compare to a plain
PCI bus, these results also showed a departure from the ideal behavior higher than we
expected.
A working implementation for a production system of the BUG is missing in this
document. Currently, an undergraduate student of the Facultad de Informática de Barcelona (FIB/UPC) is conducting his final project on this subject. He is implementing the
BUG for a software router running FreeBSD release 4.5 and wearing 3COM’s 3C996
PCI-X/Gigabit Ethernet network interface cards.
We find naturally pursuing to extend our modeling process to embrace the whole
networking software; that is, to model PC-based communication processes involving
Internet architecture’s transport layer protocols and BSD’s sockets layer. This, we think,
is no easy task, as most certainly it would involve user-level application programs that
are subject to the CPU scheduler. That is, the CPU’s extended model would have to include not only the priority preemptive scheduling policy, which models the software interrupt mechanism, but also a round-robin preemptive policy for modeling the UNIX
CPU scheduler. Furthermore, the CPU’s extended model would have to switch between
scheduling policies depending on the kind of tasks pending execution. In turn, this
would require extending our characterization process for describing the tasks involve
when context switching occurs. Evidently, if such a model could be built it would be
valuable for capacity planning and as a uniform test-bed for Internet services.
We also find tempting to simplify the software router model’s assumptions and to
pursue to solve it analytically. The motivation would be to assess the trade-off between
the model simplification and the error incurred.
Current telematic systems research is focusing on embedded systems for personal
communications and ubiquitous computing. At the same time, current embedded sys-
OSCAR IVÁN LEPE ALDAMA
PH.D. DISSERTATION
5.2—FUTURE WORK
115
tems wear complex microprocessors (some of which provide performance-monitoring
counters like Intel’s Pentium-class microprocessors do) and execute full-blown PC operating system kernels like OpenBSD or Linux. Under this light we find interesting trying to use our characterization and modeling process for studying the performance of
telematic systems when executed over embedded hardware. Other performance modeling challenges we find interesting are to study emerging software router technologies
like SMP PC- and PC clusters-based software routers.
PH.D. DISSERTATION
OSCAR IVÁN LEPE ALDAMA