pdf version - Investigación
Transcription
pdf version - Investigación
Thesis Modeling TCP/IP Software Implementation Performance And Its Application For Software Routers By Oscar Iván Lepe Aldama M.Sc. Electronics and Telecommunications CICESE Research Center, México, 1995 B.S. Computer Engineering Universidad Nacional Autónoma de México, 1992 Submitted to the Department of Computer Architecture in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY IN COMPUTER ENGINEERING At the UNIVERSITAT POLITECNICA DE CATALUNYA Presented and defended on December 3, 2002 Advisor: Prof. Dr. Jorge García-Vidal Jury: Prof. Dr. Ramón Puigjaner-Trepat Profa. Dra. Olga Casals-Torres Presidente Examinador Prof. Dr. José Duato-Marín Examinador Prof. Dr. Joan García-Haro Prof. Dr. Sebastian Sallent-Ribes Examinador Examinador 2002 Oscar I. Lepe A. All rights reserved Dr. President Dr. Secretari Dr. Vocal Dr. Vocal Dr. Vocal Data de la defensa pública Qualificació MODELING TCP/IP SOFTWARE IMPLEMENTATION PERFORMANCE AND ITS APPLICATION FOR SOFTWARE ROUTERS 5 Contents List of figures List of tables Preface Chapter 1 Introduction............................................................................. 17 1.1 Motivation......................................................................................................... 17 1.2 Scope................................................................................................................. 18 1.3 Dissertation’s thesis .......................................................................................... 19 1.4 Synopsis ............................................................................................................ 19 1.5 Outline............................................................................................................... 20 1.6 Related work ..................................................................................................... 20 Chapter 2 Internet protocols’ BSD software implementation.............. 23 2.1 Introduction....................................................................................................... 23 2.2 Interprocess communication in the BSD operating system .............................. 24 2.2.1 BSD’s interprocess communication model.............................................. 24 2.2.2 Typical use of sockets .............................................................................. 25 2.3 BSD’s networking architecture......................................................................... 26 2.3.1 Memory management plane..................................................................... 26 2.3.2 User plane ................................................................................................ 27 PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA 6 CONTENTS 2.4 The software interrupt mechanism and networking processing ....................... 28 2.4.1 Message reception.................................................................................... 28 2.4.2 Message transmission .............................................................................. 29 2.4.3 Interrupt priority levels ............................................................................ 30 2.5 BSD implementation of the Internet protocols suite ........................................ 33 2.6 Run-time environment: the host’s hardware..................................................... 34 2.6.1 The central processing unit and the memory hierarchy ........................... 34 2.6.2 The busses organization........................................................................... 35 2.6.3 The input/output bus’ arbitration scheme ................................................ 36 2.6.4 PCI hidden bus arbitration’s influence on latency................................... 37 2.6.5 Network interface card’s system interface............................................... 37 2.6.6 Main memory allocation for direct memory access network interface cards......................................................................................................... 38 2.7 Other system’s networking architectures.......................................................... 40 2.7.1 Active Messages [Eicken et al. 1992]...................................................... 40 2.7.2 Integrated Layer Processing [Abbot and Peterson 1993] ........................ 40 2.7.3 Application Device Channels [Druschel 1996] ....................................... 42 2.7.4 Microkernel operating systems’ extensions for improved networking [Coulson et al. 1994; Coulson and Blair 1995] ....................................... 43 2.7.5 Communications oriented operating systems [Mosberger and Peterson 1996]........................................................................................................ 44 2.7.6 Network processors.................................................................................. 45 2.8 Summary........................................................................................................... 46 Chapter 3 Characterizing and modeling a personal computerbased software router ...................................................................47 3.1 Introduction....................................................................................................... 47 3.2 System modeling............................................................................................... 48 3.3 Personal computer-based software routers ....................................................... 49 3.3.1 Routers’ rudiments................................................................................... 49 3.3.2 The case for software routers................................................................... 49 3.3.3 Personal computer-based software routers .............................................. 50 3.3.4 Personal computer based-IPsec security gateways .................................. 51 3.4 A queuing network model of a personal computer-based software IP router .. 51 3.4.1 The forwarding engine, the network interface cards and the packet flows ........................................................................................................ 53 3.4.2 The service stations’ scheduling politics and the mapping between networking stages and model elements ................................................... 53 3.4.3 Modeling a security gateway ................................................................... 55 3.5 System characterization .................................................................................... 56 3.5.1 Tools and techniques for profiling in-kernel software............................. 56 OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION CONTENTS 7 3.5.2 Software profiling .................................................................................... 56 3.5.3 Probe implementation .............................................................................. 57 3.5.4 Extracting information from the kernel ................................................... 60 3.5.5 Experimental setup................................................................................... 60 3.5.6 Traffic patterns......................................................................................... 61 3.5.7 Experimental design................................................................................. 62 3.5.8 Data presentation ..................................................................................... 63 3.5.9 Data analysis ............................................................................................ 63 3.5.10 “Noise” process characterization ........................................................... 65 3.6 Model validation ............................................................................................... 66 3.6.1 Service time correlations.......................................................................... 67 3.6.2 Qualitative validation............................................................................... 68 3.6.3 Quantitative validation............................................................................. 69 3.7 Model parameterization .................................................................................... 72 3.7.1 Central processing unit speed .................................................................. 72 3.7.2 Memory technology ................................................................................. 74 3.7.3 Packet’s size............................................................................................. 75 3.7.4 Routing table’s size.................................................................................. 76 3.7.5 Input/output bus’s speed .......................................................................... 76 3.8 Model’s applications......................................................................................... 77 3.8.1 Capacity planning .................................................................................... 77 3.8.2 Uniform experimental test-bed ................................................................ 79 3.9 Summary ........................................................................................................... 89 Chapter 4 Input/output bus usage control in personal computerbased software routers.................................................................. 90 4.1 Introduction....................................................................................................... 90 4.2 The problem ...................................................................................................... 91 4.3 Our solution ...................................................................................................... 91 4.3.1 BUG’s specifications and network interface card’s requirements........... 93 4.3.2 Low overhead and intrusion..................................................................... 93 4.3.3 Algorithm................................................................................................. 94 4.3.4 Algorithm’s details................................................................................... 95 4.3.5 Algorithm’s a priori estimated costs ........................................................ 99 4.3.6 An example scenario................................................................................ 99 4.4 BUG performance study ................................................................................. 101 4.4.1 Experimental setup................................................................................. 102 4.4.2 Response to unbalanced constant packet rate traffic ............................. 104 4.4.3 Study on the influence of the activation period ..................................... 105 4.4.4 Response to on-off traffic ...................................................................... 106 PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA 8 CONTENTS 4.4.5 Response to self-similar traffic .............................................................. 107 4.5 A performance study of a software router incorporating the BUG ................ 109 4.5.1 Results.................................................................................................... 109 4.6 An implementation ......................................................................................... 111 4.7 Summary......................................................................................................... 111 Chapter 5 Conclusions and future work ...............................................113 5.1 Conclusions..................................................................................................... 113 5.2 Future work..................................................................................................... 114 Appendix A Appendix B Bibliography OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION MODELING TCP/IP SOFTWARE IMPLEMENTATION PERFORMANCE AND ITS APPLICATION FOR SOFTWARE ROUTERS 9 List of figures Figure 2.1—OMT object model for BSD IPC ............................................................... 25 Figure 2.2—BSD’s two-plane networking architecture. The user plane is depicted with its layered structure, which is described in following sections. Bold circles in the figure represent defined interfaces between planes and layers: A) Socket-to-Protocol, B) Protocol-to-Protocol, C) Protocol-to-Network Interface, and D) User Layer-to-Memory Management. Observe that this architecture implies that layers share the responsibility of taking care of the storage associated with transmitted data............................................................... 26 Figure 2.3—BSD networking user plane’s software organization................................. 27 Figure 2.4—Example of priority levels and kernel processing...................................... 31 Figure 2.5—BSD implementation of the Internet protocol suite. Only chief tasks, message queues and associations are shown. Please note that some control flow arrows are sourced at the bounds of the squares delimiting the implementation layers. This is for denoting that a task is executed after an external event, such as an interrupt or a CPU scheduler event ............................. 33 Figure 2.6—Chief components in a general purpose computing hardware. .................. 34 Figure 2.7—Example PCI arbitration algorithm............................................................ 36 Figure 2.8—Main memory allocation for device drivers of network interface cards supporting direct memory access.......................................................................... 39 Figure 2.9—Integrated layering processing [Abbot and Peterson 1993]....................... 41 Figure 2.10—Application Device Channels [Druschel 1996] ....................................... 42 PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA 10 LIST OF FIGURES Figure 2.11—SUMO extensions to a microkernel operating system [Coulson et al. 1994; Coulson and Blair 1995]............................................................................. 43 Figure 2.12—Making paths explicit in the Scout operating system [Mosberger and Peterson 1996] ...................................................................................................... 44 Figure 3.1—A spectrum of performance modeling techniques ..................................... 48 Figure 3.2—McKeown’s [2001] router architecture...................................................... 49 Figure 3.3—A queuing network model of a personal computer-based software router that has two network interface cards and that is traversed by a single packet flow. The number and meaning of the shown queues is a result of the characterization process presented in the next section ......................................... 52 Figure 3.4—A queuing network model of a software router that shows a one-toone mapping between C language functions implementing the BSD networking code and the depicted model’s queues. In order to simplify the figure, this model does not model the software router’s input/output bus nor the noise process. Moreover, it simplistically models the network interface cards...................................................................................................................... 54 Figure 3.5—A queuing network model of a software router configured as a security gateway. The number and meaning of the shown queues is a result of the characterization process presented in next section ..................................... 55 Figure 3.6—Probe implementation for FreeBSD .......................................................... 59 Figure 3.7—Experimental setup .................................................................................... 60 Figure 3.8—Characterization charts for a security gateway’s protocols layer.............. 64 Figure 3.9—Characterization charts for a software router’s network interfaces layer ...................................................................................................................... 64 Figure 3.10—Comparison of the CCPFs computed after both, measured data from the software router under test, and predicted data from a corresponding queuing network model, which used a one-to-one mapping between C language networking functions and model’s queue.............................................. 67 Figure 3.11—Example chart from the service time correlation analysis. It shows the plot of ip_input cycle counts versus ip_output cycle counts. A correlation is clearly shown .................................................................................. 67 Figure 3.12—Model’s validation charts. The two leftmost columns’ charts depict per-packet latency traces. The right column’s chart depicts latency traces’ CCPFs................................................................................................................... 70 OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION LIST OF FIGURES 11 Figure 3.13—Relationship between measured execution times and central processing unit operation speed. Observe that some measures have proportional behavior while others have linear behavior. The main text explains the reasons to these behaviors and why the circled measures do not all agree with the regression lines......................................................................... 72 Figure 3.14—Outliers related to the CPU’s instruction cache. The left chart was drawn after data taken from the ipintrq probe. The right chart corresponds to the ESP (DES) probe at a security gateway. Referenced outliers are highlighted ......................................................................................... 74 Figure 3.15—Relationship between measured execution times and message size ........ 75 Figure 3.16—Capacity planning charts.......................................................................... 78 Figure 3.17—Queuing network model for a BSD based software router with two network interface cards attached to it and three packet flows traversing it .......... 81 Figure 3.18—Basic system’s performance analysis: a) system’s overall throughput; b) per flow throughput share for system one; c) per flow throughput share for system two........................................................................... 82 Figure 3.19—Queuing network model for a Mogul and Ramakrishnan [1997] based software router with two network interface cards attached to it and three packet flows traversing it............................................................................. 84 Figure 3.20—Mogul and Ramakrishnan [1997] based software router’s performance analysis: a) system’s overall throughput; b) per flow throughput share for system one; c) per flow throughput share for system two ........................................................................................................................ 85 Figure 3.21—Queuing network model for a software router including the receiver live-lock avoidance mechanism and a QoS aware CPU scheduler, similar to the one proposed by Qie et al [2001]. The software router has two network interface cards and three packet flows traverse it ................................................. 87 Figure 3.22—Qie et al [2001] based software router’s performance analysis: a) per flow throughput share for system one; b) per flow throughput share for system two ............................................................................................................ 88 Figure 4.1—The BUG’s system architecture. The BUG is a piece of software embedded in the operating system’s kernel that shares information with the network interface card’s device drivers and manipulates the vacancy of each DMA receive channel ........................................................................................... 92 Figure 4.2—The BUG’s periodic and bistate operation for reducing overhead and intrusion ................................................................................................................ 94 Figure 4.3—The BUG’s packet-by-packet GPS server emulation with batch arrivals .................................................................................................................. 95 PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA 12 LIST OF FIGURES Figure 4.4—The BUG is work conservative.................................................................. 96 Figure 4.5—The BUG’s unfairness counterbalancing mechanism................................ 97 Figure 4.6—The BUG’s bus utilization grant packetization policy. In the considered scenario, three packets flows with different packet sizes traverse the router and the BUG has granted each an equal number of bus utilization bytes. Packet sizes are small, medium and large respectively for the orange, green and blue packets flows. After packetization, some idle time gets induced.................................................................................................................. 98 Figure 4.7—An example of the behavior of the BUG mechanism. Vectors A, D, N, G and g are defined as: A = (A1, A2, A3), etc. It is assumed that the system serves three packets flows with the same shares and with the same packet lengths, and that in a period T up to six packets can be transferred through the bus. The BUG does not consider all variables at all times. At every activation instant, the variables that the BUG ignores are printed in gray ....................................................................................................................... 100 Figure 4.8—Queuing network models for: a) PCI bus, b) WFQ bus, and c) BUG protected PCI bus; all with three network interface cards attached to it and three packets flows traversing them...................................................................... 103 Figure 4.9—BUG performance study: response comparison to unbalanced constant packet rate traffic between a WFQ bus, a PCI bus and a BUG protected PCI bus; first, middle and left columns respectively. At row (a) flow3 is the misbehaving flow while flow2 and flow1 are for (b) and (c), respectively ........................................................................................................... 104 Figure 4.10—BUG performance study: on the influence of the activation period ........ 106 Figure 4.11— BUG performance study: response comparison to on-off traffic between an ideal WFQ bus, a PCI bus, and a BUG protected PCI bus ................ 107 Figure 4.12—QoS aware system’s performance analysis: a) system’s overall throughput; b) per packets flow throughput share for system one; c) per packets flow throughput share for system two ..................................................... 110 OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION MODELING TCP/IP SOFTWARE IMPLEMENTATION PERFORMANCE AND ITS APPLICATION FOR SOFTWARE ROUTERS 13 List of tables Table 3-I ......................................................................................................................... 76 Table 3-II........................................................................................................................ 76 Table 3-III ...................................................................................................................... 79 Table 4-I ......................................................................................................................... 93 Table 4-II........................................................................................................................ 99 Table 4-III ...................................................................................................................... 102 Table 4-IV ...................................................................................................................... 108 PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA MODELING TCP/IP SOFTWARE IMPLEMENTATION PERFORMANCE AND ITS APPLICATION FOR SOFTWARE ROUTERS 15 Preface Dedico esta obra a mi esposa, Tania Patricia, y a nuestros hijos, Pedro Darío y Sebastián. Al mismo tiempo, quiero agradecer a Tania y a Pedro su apoyo y paciencia durante el tiempo que estuvimos fuera de casa, viviendo no en las mejores condiciones. Sólo espero que esta precariedad haya sido mitigada por lo enriquecedor de las experiencias que hemos vivido, que siento nos harán mejores personas al habernos recalcado la importancia de la unidad familiar, y el valor de nuestras costumbres, nuestra gente y nuestra tierra. Agradezco a mi director de tesis, Jorge García-Vidal, por su invaluable tutela y dedicación. Difícilmente olvidaré las incontables y largas reuniones de discusión que le dieron vida a esta obra, ni las numerosas noches en vela que pasamos ultimando los detalles de los artículos que enviamos a revisión. Pero también difícilmente olvidaré las tardes de café o las tardes en el parque con nuestros hijos, donde hablábamos de todo menos del trabajo. Agradezco al pueblo de México, que con su esfuerzo permite la existencia de programas de becas para que mexicanos hagan estudios de postgrado en el extranjero, como el programa gestionado por el Consejo Nacional de Ciencia y Tecnología, o el gestionado por el Centro de Investigación Científica y de Educación Superior de Ensenada, que me dieron cobijo. Y en general agradezco a todas las personas que de forma directa o indirecta me ayudaron a la consecución de esta obra, por ejemplo a los profesores del DAC/UPC y a los revisores anónimos en los congresos. En particular quiero mencionar, en orden alfabético, a Alejandro Limón Padilla, David Hilario Covarrubias Rosales, Delia White, Eva Angelina Robles Sánchez, Francisco Javier Mendieta Jiménez, Jorge Enrique Preciado Velasco, José María Barceló Ordinas, Llorenç Cerdà Alabern, Olga María Casals Torres, Oscar Alberto Lepe Peralta, Victoriano Pagoaga. A todos ellos muchas gracias. PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA MODELING TCP/IP SOFTWARE IMPLEMENTATION PERFORMANCE AND ITS APPLICATION FOR SOFTWARE ROUTERS 17 Chapter 1 Introduction 1.1 Motivation Evidently, computer networks and the computer applications than run over them are fundamental to today’s human society. Indeed, some kind of these telematic systems is fundamental for the proper operation of the New York Stock Exchange, for instance, but also some other kind is fundamental for providing telephony service to little villages in hard-to-reach places, like the numerous villages at the mountains of Chiapas, México. As it is well known, telematic technology’s application for better human society is only limited by humans’ imagination. Today’s human’s necessities for information processing and transportation present complex problems. For solving these problems, scientists and engineers are required to produced ideas and products with an ever-increasing number of components and intercomponents relationships. In order to cope with this complexity, developers invented the concept of layered architectures, which allow a structured “divide to conquer” approach for designing complex systems by providing a step-by-step enhancement of system services. Unfortunately, layered structures can result in relatively low performance telematic systems—and often they are—if implemented carelessly [Clark 1982; Tennenhouse 1989; Abbott and Peterson 1993; Mogul and Ramakrishnan 1999]. In order to understand this, let us consider the following: PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA 18 SCOPE—1.2 • • • • Telematic systems are mostly implemented with software. Each software layer is designed as an independent entity that concurrently and asynchronously communicates with its neighbors through a message-passing interface. This allows for better interoperability, manageability and extensibility. In order to allow software layers to work concurrently and asynchronously, the host computer system has to provide a secure and reliable multiprogramming environment through an operating system. Ideally, the operating system should perform its role without consuming a significant share of processing resources. Unfortunately, as reported elsewhere [Druschel 1996], current operating systems are threatening to become bottlenecks when processing input/output data streams. Moreover, they are the source of statistical delays—incurred as each data unit is marshaled through the layered software—that hamper the correct deployment of important services. Others have recognized this problem and have conducted studies analyzing some aspects of some operating systems’ computer networks software, networking software for short. For us it is striking that although these studies are numerous, (a search through the ACM’s Digital Library on the term “(c.2 and d.4 and c.5)<IN>ccs” returns 54 references, and a search through IEEE’s Xplore on the term “‘protocol processing’<IN>de” returns 81 references) only one of them pursued to build a general model of the networking software—see this chapter’s section on “related work”. Indeed, although different systems’ networking software has more similarities than differences, as we will later discuss, most of these studies have only focused in identifying and solving particular problems of particular systems. When saying this we do not deny the importance of this work, however, we believe that modeling is an important part of doing research. The Internet protocol suite (TCP/IP) [Stevens 1994] is, nowadays, the preferred technology for networking. Of all possible implementations, the one done at the University of California at Berkeley for the Berkeley Software Distribution operating system, or BSD, has been used as the starting point for most available systems. The BSD operating system [McKusick et al. 1996], which is a flavor of the UNIX system [Ritchie and Thompson 1978], was selected as the foundation for implementing the first TCP/IP suite back in the 80's. 1.2 Scope Naturally, for modeling a system, high degrees of observability and controllability are required. For us this means free access to both networking software’s source code and host computer’s technical specifications. (When we speak of free, we are referring to freedom, not price.) Today’s personal computer (PC) technology provides this. Indeed, not only there is tons of freely available detailed technical documentation on Intel’s IA32 PC hardware but also there are several PC operating systems with open source policy. Of those we decide to work with FreeBSD, a 4.4BSD-Lite derived operating system optimized to be run on Intel’s IA32 PCs. OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 1.3—DISSERTATION’S THESIS 19 When searching for a networking application for which the appliance of a performance model could be of importance we found software routers. A software router can be defined as a general-purpose computer that executes a computer program capable of forwarding IP datagrams among network interface cards attached to its input/output bus (I/O bus). Evidently, software routers have performance limitations because they use a single central processing unit (CPU) and a single shared bus to process all packets. However, due to the ease with which they can be programmed for supporting new functionality—securing communications, shaping traffic, supporting mobility, translating network addresses, supporting applications’ proxies, and performing n-level routing—software routers are important at the edge of the Internet. 1.3 Dissertation’s thesis From all the above we came up with the following dissertation thesis: Is it possible to build a queuing network model of the Internet protocols’ BSD implementation that can be used for predicting with reasonable accuracy not only the mean values of the operational parameters studied but also their cumulative probability function? And, can this model be applied for studying the performance of PC-based software routers supporting communication quality assurance mechanisms, or Quality-of-Service (QoS) mechanisms? 1.4 Synopsis Three are the main contributions of this work. In no particular order: • • • A detailed performance study of the software implementation of the TCP/IP protocols suite, when executed as part of the kernel of a BSD operating system over generic PC hardware A validated queuing network model of the studied system, solved by computer simulation An I/O bus utilization guard mechanism for improving the performance of software routers supporting QoS mechanisms and built upon PC hardware and software This document presents our experiences building a performance model of a PCbased software router. The resulting model is an open multiclass priority network of queues that we solved by simulation. While the model is not particularly novel from the system modeling point of view, in our opinion, it is an interesting result to show that such a model can estimate, with high accuracy, not just average performance-numbers but the complete probability distribution function of packet latency, allowing performance analysis at several levels of detail. The validity and accuracy of the multiclass model has been established by contrasting its packet latency predictions in both, time and probability spaces. Moreover, we introduced into the validation analysis the predictions of a router’s single queue model. We did this for quantitatively assessing the ad- PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA 20 OUTLINE—1.5 vantages of the more complex multiclass model with respect to the simpler and widely used but not so accurate, as here shown, single queue model, under the considered scenario that the router’s CPU is the system bottleneck and not the communications links. The single queue model was also solved by simulation. Besides, this document addresses the problem of resource sharing in PC-based software routers supporting QoS mechanisms. Others have put forward solutions that are focused on suitably distributing the workload of a software router's CPU—see this chapter’s section on “related work”. However, the evident increasing gap in operation speed between a PC-based software router's CPU and I/O bus means to us that attention must be paid to the effect the limitations imposed by this bus has on system’s overall performance. Consequently, we propose a mechanism that jointly controls both I/O bus and CPU operation for improved PC-based software router performance. This mechanism involves changes to the operating system kernel code and assumes the existence of certain network interface card’s functions, although it does not require changes to the PC hardware. A performance study is shown that provides insight into the problem and helps to evaluate both the effectiveness of our approach, and several software router design trade-offs. 1.5 Outline The rest of this chapter is devoted to discuss about related work. Chapter 2’s objective is to understand the influence that operating system design and implementation technique have over the performance of the Internet protocol’s BSD software implementation. Chapter 3 presents our experiences building, validating and conducting the parameterization of a performance model of a PC-based software router. Moreover, it presents some results from applying the model for capacity planning. Chapter 4 addresses the problem of resource sharing in PC-based software routers supporting communication quality assurance mechanisms. Furthermore, it presents our mechanism for jointly controlling the router’s CPU and I/O bus, indispensable for a software router to support communication quality assurance mechanisms. 1.6 Related work Cabrera et al. [1988] (in reality, they presented their study’s results on July, 1985) is the earliest work we found across the publicly accessible literature on TCP/IP’s implementation experimental evaluation. Their study was an ambitious one, which objective was to assess the impact that different processors, network hardware interfaces, and Ethernets have on the communication across machines, under various hosts’ and communication medias’ load conditions. Their measurements highlighted the ultimate bounds on communication performance perceived by application programs. Moreover, they presented a detailed timing analysis of the dynamic behavior of the networking software. They studied TCP/IP’s implementation within 4.2BSD when ran by then state of the art minicomputers, attached to legacy Ethernets. Consequently, their study’s results are no longer valid. Worst yet, they used the gprof(1) and kgmon(8)tools for profiling. These tools are only supported by software and, consequently, produce results that have limited accuracy when compare to results produce by performance monitoring OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 1.6—RELATED WORK 21 hardware counters, as we do. However complete their study was, they did not pursue to build a system model, like we do. Sanghi et al. [1990] is the earliest work we found across the publicly accessible literature on TCP/IP’s implementation experimental evaluation that uses software profiling. Their study is very narrow when compare to ours in the sense that their study’s objective was only to determine the suitability of roundtrip time estimators for TCP implementations. Papadopoulos and Gurudatta [1993] presented results of a more general study on TCP/IP’s implementation performance, obtained using software profiling. They studied TCP/IP’s implementation within a 4.3BSD-derived operating system—SUN OS. Like our work, their study’s objective was to characterize the performance of the networking software. Conversely, their study was not aimed to produce a system model. Kay and Pasquale [1996] (in reality, they presented their study’s results on September, 1994) conducted another TCP/IP’s implementation performance study. Differently from previous work, their study was carried out at a different granularity level; that is, they went inside the networking functions and analyzed how these functions used the source code—touching data, protocol-specific processing, memory buffers manipulation, error checking, data structure manipulation, and operating system overhead. Moreover, their study’s objective was to guide a search for bottlenecks in archiving high throughput and low latency. Once again, they did not pursue building a system model. Ramamurthy [1988] modeled the UNIX system using a queuing network model. However, his model characterizes the system’s multitasking properties and therefore cannot be applied to study the networking software, which is governed by the software interrupt mechanism. Moreover, Ramamurthy’s model was only solved for predicting mean values, something that is not enough when conducting performance analyses of today’s networking software. What is required is a model capable of producing the complete probability function of operational parameters so analysis at several levels of detail may be performed. Björkman and Gunningberg [1998] modeled an Internet protocols’ implementation using a queuing network model, however, their model characterizes a user-space implementation (base on a parallelized version of University of Arizona’s x-kernel [Hutchinson and Peterson 1991]) and therefore disregards the software interrupt mechanism. Moreover, their model was aimed to predict only system throughput (measured in packets per second) when the protocols are hosted by shared-memory multiprocessor systems. Besides, their study was aimed only for high-performance distributed computing, where it is considered that connections are always open with a steady stream of messages—that is, no retransmissions or other unusual events occur. All this impedes the utilization of their model for our intended applications. Packet service disciplines and their associated performance issues have been widely studied in the context of bandwidth scheduling in packet-switched networks [Zhang 1995]. Arguably, several such disciplines now exist that are both fair—in terms of assuring access to link bandwidth in the presence of competing packets flows—and easy to implement efficiently—with both hardware and software. Recently, an interest has arisen to map these packet service disciplines in the context of CPU scheduling for PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA 22 RELATED WORK—1.6 programmable and software routers. Qie et al. [2001], and Chen and Morris [2001], for a software router, and Pappu and Wolf [2001], for a programmable router, have put forward solutions that are focused on suitably distributing the workload of a router’s processor. However, neither the formers nor the latters consider the problem of I/O bus bandwidth scheduling. And this problem is important in the context of software routers, as we demonstrate. Chiueh and Pradhan [2000] recognized the suitability and inherent limitations of using PC technology for implementing software routers. In order to overcome the performance limitations of PCs’ I/O buses, and to construct high-speed routers, they have proposed using clusters of PCs. Upon this architecture, several PCs having at most one network interface card each are tightly couple by means of a Myrinet system area network to conform a software router with as many subnetwork attachments as computing nodes. After solving all internodes communication, memory coherency, and routing table distribution problems—arguably not an easy task—a “clustered router” may be able to overcome the limitations of current PCs’ I/O buses and not only provide high performance (in term of packets per second) but also support QoS mechanisms. Our work, however, is aimed at supporting QoS mechanisms in clearly simpler and less expensive PC-based software routers. Scottis, Krunz and Liu [1999] recognized the performance limitations of the omnipresent Peripheral Component Interconnect (PCI) PC I/O bus to support QoS mechanisms, and proposed an enhancement. Conversely to our software enhancement, theirs introduced a new bus arbitration protocol that has to be implemented with hardware. Moreover, their bus arbitration protocol is aimed to improve bus support for periodic and predictable real-time streams only, and clearly this is not suitable for Internet routers. There is general agreement in the PC industry that the demands of current user applications are quickly overwhelming the shared parallel architecture and limited bandwidth of the various types of PCI buses. (Erlanger, L. Editor “High-performance busses and interconnects,” http://www.pcmag.com/article2/0,4149,293663,00.asp current as 8 July 2002.) With this increasing demand and its lack of QoS, PCI has been due for a replacement in different system and application scenarios for more than a few years now. 3GIO, InfiniBand, RapidIO and HyperTransport are new technologies designed to improve I/O and device-to-device performance in a variety of system and application categories, however, not all are direct replacements for PCI. In this sense, 3GIO and InfiniBand may be considered as direct candidates to succeed PCI. All these technologies have in common that they define packet-based, point-to-point serial connections with layered communications, which readily support QoS mechanisms. However, due to the large installation base of PCI equipment, it is expected that the PCI bus will be used for still some years. Consequently, we think our work is important. OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION MODELING TCP/IP SOFTWARE IMPLEMENTATION PERFORMANCE AND ITS APPLICATION FOR SOFTWARE ROUTERS 23 Chapter 2 Internet protocols’ BSD software implementation 2.1 Introduction This chapter’s objective is to understand the influence that operating system design and implementation techniques have over the performance of the Internet protocols’ BSD software implementation. Later chapters discuss how to apply this knowledge for building a performance model of a personal computer-based software router. The chapter is organized as follows. The first three sections set the conceptual framework for the document but may be skipped by a reader familiar with BSD’s networking subsystem. Section 2.2 presents a brief overview on BSD’s interprocess communication facility. Section 2.3 presents BSD’s networking architecture. And BSD’s software interrupt mechanism is presented in section 2.4. Following sections present the chief features, components and structures of the Internet protocol’s BSD software implementation. These sections present several ideas and diagrams that are referenced in this document’s later sections and should not be skipped. Section 0 presents the software implementation while section 2.6 the run-time environment. Finally, section 2.7 presents brief descriptions of other system’s networking architectures and section 2.8 summarizes. PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA 24 INTERPROCESS COMMUNICATION IN THE BSD OPERATING SYSTEM—2.2 2.2 Interprocess communication in the BSD operating system The BSD operating system [McKusick et al. 1996] is a flavor of UNIX [Ritchie and Thompson 1978] and historically, UNIX systems were weak in the area of interprocess communication [Wright and Stevens 1995]. Before the 4.2 release of the BSD operating system, the only standard interprocess communication facility was the pipe. The requirements of the Internet [Stevens 1994] research community resulted in a significant effort to address the lack of a comprehensive set of interprocess communication facilities in UNIX. (At the time 4.2BSD was being designed there was no global Internet but an experimental computer network sponsored by the United States of America’s Defense Advanced Research Projects Agency.) 2.2.1 BSD’s interprocess communication model 4.2BSD’s interprocess communication facility was designed to provide a sufficiently general interface upon which distributed-computing applications—sharing of physically distributed computing resources, distributed parallel computing, computer supported telecommunications—could be constructed independently of the underlying communication protocols. This facility has outlasted and is present in the current 4.4 release. For now on, when referring to the BSD operating system, we mean the 4.2 release or any follow-on release like the current 4.4. While designing the interprocesscommunication facilities that would support these goals, the developers identified the following requirements and developed unifying concepts for each: • • • • The system must support communication networks that use different sets of protocols, different naming conventions, different hardware, and so on. The notion of a communication domain was defined for these reasons. A unified abstraction for an endpoint of communication is needed that can be manipulated with a file descriptor. The socket is the abstract object from which messages are sent and received. The semantic aspects of communication must be made available to applications in a controlled and uniform way. So, all sockets are typed according to their communication semantics. Processes must be able to locate endpoints of communication so that they can rendezvous without being related. Hence, sockets can be named. Figure 2.1 depicts the OMT Object Model [Rumbaugh et al. 1991] for these requirements. OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 2.2—INTERPROCESS COMMUNICATION IN THE BSD OPERATING SYSTEM 2.2.2 25 Typical use of sockets First, a socket must be created with the socket system call, which returns a file descriptor that is then used in later socket operations: s = socket(domain, type, protocol); int s, domain, type, protocol; After a socket has been created, the next step depends on the type of socket being used. The most commonly used type of socket requires a connection before it can be used. The creation of a connection between two sockets usually requires that each socket have an address (name) bound to it. The address to be bound to a socket must be formulated in a socket address structure. error = bind(s, addr, addrlen); int error, s, addrlen; struct sockaddr *addr; A connection is initiated with a connect system call: error = connect(s, peeraddr, peeraddrlen); int error, s, peeraddrlen; struct sokaddr *peeraddr; When a socket is to be used to wait for connection-requests to arrive, the system call pair listen/accept is used instead: error = listen(s, backlog); int error, s, backlog; Connections are then received, one at a time, with: snew = accept(s, peeraddr, peeraddrlen); int snew, s, peeraddrlen; struct sockaddr *peeraddr; Figure 2.1—OMT object model for BSD IPC * 2 1..* Protocol named protocols 1..* 1..* PH.D. DISSERTATION Datagram Socket Name Scheme 1..* 1..* 1..* Network Name 1..* Host Name 1..* is bound to 1..* Stream Socket 1..* Communication Domain * obeys a Socket Network 1..* Service 1..* belongs * * named networks Communication Channel Host named hosts communicates with runs in named services Client * links Server Process Raw Socket Socket Name 1..* Protocol Name 1..* Service Name OSCAR IVÁN LEPE ALDAMA 26 BSD’S NETWORKING ARCHITECTURE—2.3 A variety of calls are available for transmitting and receiving data. The usual read and write system calls, as well as the newer send and recv system calls can be used with sockets that are in a connected state. The sendto and recvfrom system calls are most useful for connectionless sockets, where the peer’s address is specified with each transmitted message. Finally, the sendmsg and recvmsg system calls support the full interface to the interprocess-communication facilities. The shutdown system call terminates data transmission or reception at a socket. Sockets are discarded with the normal close system call. 2.3 BSD’s networking architecture BSD’s networking architecture has two planes, as shown in Figure 2.2: the user plane and the memory management plane. The user plane defines a framework within which many communication domains may coexist and network services can be implemented. The memory management plane defines memory management policies and procedures that comply with the user plane’s memory requirements. More on this a little further. 2.3.1 Memory management plane It is well known [McKusick et al. 1996; Wright and Stevens 1995] that the requirements placed by interprocess communication and network protocols on a memory management scheme tend to be substantially different from those of other parts of the operating system. Basically, network messages processing require attaching and/or detaching protocol headers and/or trailers to messages. Moreover, some times these headers’ and trailers’ sizes vary with the communication session’s state; some other times the number of these protocol elements is, a priori, unknown. Consequently, a special- Figure 2.2—BSD’s two-plane networking architecture. The user plane is depicted with its layered structure, which is described in following sections. Bold circles in the figure represent defined interfaces between planes and layers: A) Socket-to-Protocol, B) Protocol-to-Protocol, C) Protocol-to-Network Interface, and D) User Layer-toMemory Management. Observe that this architecture implies that layers share the responsibility of taking care of the storage associated with transmitted data Memory-Management Plane User Plane D Socket Layer D' A Protocol Layer B D" C Network-Interface Layer OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 2.3—BSD’S NETWORKING ARCHITECTURE 27 purpose memory management facility was created by the BSD developing team for the use of the interprocess communication and networking systems. The memory management facilities revolve around a data structure called an mbuf. Mbufs, or memory buffers, are 128 bytes long, with 100 or 108 of this space re- served for data storage. There are three sets of header fields that might be present in an mbuf and which are used for identifying and managing purposes. Multiple mbufs can be linked forming mbuf-chains to hold an arbitrary quantity of data. For very large messages, the system can associate larger sections of data with an mbuf by referencing an external mbuf-cluster from a private virtual memory area. Data is stored either in the internal data area of the mbuf or in the external cluster, but never in both. 2.3.2 User plane The user plane, as said before, provides a framework within which many communication domains may coexist and network services can be implemented. Networking facilities are accessed through the socket abstraction. These facilities include: • • • A structured interface to the socket layer. A consistent interface to hardware devices. Network-independent support for message routing. The BSD developing team devised a pipelined implementation for the user plane with three vertically delimited stages or layers. As Figure 2.2 and Figure 2.3 show, these layers are the sockets layer, the protocols layer, and the network-interfaces layer. Jointly, the protocols layer and the network-interfaces layer are named the networking support. Basically, the sockets layer is a protocol-independent interface used by applications to access the networking support. The protocols layer contains the implementation of the communication domains supported by the system, where each communicaFigure 2.3—BSD networking user plane’s software organization userprocess process user user process system calls kernel function call Socket layer socket queues function call Protocols layer (IP, ESP, AH, cryptographic algorithms) interface queues protocol queue (IP input) Interfaces layer PH.D. DISSERTATION software interrupt @ splnet (caused by interface layer) software interrupt @ splimp (caused by hardware-interrupt handler) OSCAR IVÁN LEPE ALDAMA 28 THE SOFTWARE INTERRUPT MECHANISM AND NETWORKING PROCESSING—2.4 tion domain may have its own internal structure. Last but not least, the networkinterfaces layer is mainly concerned with driving the transmission media involved. Entities at different layers communicate through well-defined interfaces and their execution is decoupled by means of message queues, as shown in Figure 2.3. Concurrent access to these message queues is controlled by the software interrupt mechanism, as explained in the next section. 2.4 The software interrupt mechanism and networking processing Networking processing within the BSD operating system is pipelined and interrupt driven. In order to show how this works, let us describe the sequence of chief events occurred during message reception and transmission. If you feel lost during the first read, please keep an eye on Figure 2.3 during the second pass. It helps. During the following description, when we say “the system” we mean a computer system executing a BSD derived operating system. 2.4.1 Message reception When a network interface card captures a message from a communications link, it posts a hardware interrupt to the system’s central processing unit. Upon catching this interrupt—preempting any running application program and entering supervisor mode and the operating system kernel’s address space—the system executes some networkinterfaces layer task and marshals the message from the network interface card’s local memory to a protocols layer’s mbuf queue in main memory. During this marshaling the system does any data-link protocol duties and determines to which communication domain the message is destined. Just after leaving the message in the selected protocol’s mbuf queue and before terminating the hardware interrupt execution context, the system posts a software interrupt addressed to the corresponding protocols layer task. Considering that the arrived message is destined to a system’s application program and that the addressed application has an opened socket, the system, upon catching the outstanding software interrupt, executes the corresponding protocols layer task and marshals the message from the protocol’s mbuf queue to the addressed socket’s mbuf queue. All protocols processing within the corresponding communication domain takes place at this software interrupt’s context. Just after leaving the message into the addressed socket’s mbuf queue and before terminating the software interrupt execution context, the system flags for execution any application program that might be sleeping over the addressed socket, waiting for a message to arrive. When the system is finished with all the interrupts execution contexts and its scheduler schedules for execution the application program that just received the message, the system executes the corresponding sockets layer task and marshals the message from the socket’s mbuf queue to the corresponding application’s buffer in user address space. Afterwards, the system exits supervisor mode and the address space of the operating system’s kernel and resumes the execution of the communicating application program. OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 2.4—THE SOFTWARE INTERRUPT MECHANISM AND NETWORKING PROCESSING 29 Here let us spot a performance detail of the previous description. The message marshalling between mbuf queues does not always imply a data copy operation. There are copy-operations involve when marshalling messages between a network interface card’s local memory and a protocol’s mbuf queue, and between a socket’s mbuf queue and an application program’s buffer. But there is no data copy operation between a protocol’s and a socket’s mbuf queues. Here, only mbuf references—also know as pointers—are copied. 2.4.2 Message transmission Message transmission network processing may be initiated by several events. For instance, by an application program issuing the sendmsg—or similar—system call. But it can also be initiated when forwarding a message, when a protocol timer expires or when the system has deferred messages. When an application program issues the sendmsg system call, giving a data buffer as one of the arguments, (other arguments are, for instance, the communication domain identification and the destination socket’s address) the system enters into supervisor mode and into the operating system’s kernel address space and executes some sockets layer task. This task builds an mbuf upon the selected communication domain and socket type and, considering that the given data buffer fits inside one mbuf, it copies into the built mbuf’s payload the contents of the given data buffer. In case that the communication channel protocol’s state allows the system to immediately transmit a message, the system executes the appropriate protocols layer task and marshals the message through the arbitrary protocol structure of the corresponding communication domain. Among other protocol-dependent tasks, the system here selects a communication link for transmitting the message out. Considering that the network interface card attached to the selected communications link is idle, the system executes the appropriate networkinterfaces layer task and marshals the message from the corresponding mbuf in main memory to the network interface card’s local memory. At this point the system hands over the message delivery’s responsibility to the network interface card. Observe that under the considered situation the system processes the message transmission in a single execution context—that of the communicating application—and no intermediary buffering is required. On the contrary, if for instance the system finds an addressed network interface card busy, the system would place the mbuf in the corresponding network interface’s mbuf queue and would defer the execution of the networkinterfaces layer task. For cases like this, network interface cards are built to produce a hardware interrupt not just when receiving a message but at the end of every busy period. Moreover, network interface card’s hardware interrupt handlers are built to always check for deferred message at the corresponding network interface’s output mbuf queue. When deferred messages are found, the system does whatever is required to transmit them out. Observe that in this case the message transmission is done in the execution context of the network interface card’s hardware interrupt. Another scenario happens if a communication channel protocol’s state impedes the system to immediately transmit a message. For instance, when a TCP connection’s transmission window is closed [Stevens, 1994]. In this case, the system would place the message’s mbuf in the corresponding socket’s mbuf queue, and defers the execution of PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA 30 THE SOFTWARE INTERRUPT MECHANISM AND NETWORKING PROCESSING—2.4 the protocols layer task. Of course, a deferring protocol must have some built-in means for later transmitting or discarding any deferred message. For instance, TCP may open a connection’s transmission window after receiving one or more segments from a peer. Upon opening the transmission window, TCP will start transmitting as many deferred messages as possible as soon as possible—that is, just after finishing message reception. Observe that in this case the message transmission is done in the execution context of the protocols layer reception software interrupt. Also observe that when transmitting deferred messages at the protocols layer, the system may defer again at the networkinterfaces layer as explained above. Communications protocols generally defined timed operations that require the interchange of messages with peers, and thus require transmitting messages. For instance, TCP’s delayed acknowledge mechanism [Stevens 1994]. In this cases, protocols may relay on the BSD callout mechanism [McKusick et al, 1996] and request for the system to execute some task at predefined times. The BSD callout mechanism uses the system‘s real-time clock for scheduling the execution of any enlisted task. It arranges itself to issue software interrupts every time an enlisted task is required to execute. If the called out task initiates the transmission of a networking message, this message transmission is done in the execution context of the callout mechanism software interrupt. Once again, as explained above, transmission hold off may happen at the networkinterfaces layer as explained above. Finally, let us consider the message-forwarding scenario. In this scenario some communications protocol—implemented at the protocols layer—is capable of forwarding messages; for instance, the Internet Protocol (IP) [Stevens 1994]. During message reception, a protocol like IP may find out that the received message is not addressed to the local system but to another system to which it knows how to get to by means of a routing table. In this case, the protocol will launch a message transmission task upon the message being forwarded. Observe that this message transmission processing is done in the execution context of the protocols layer reception software-interrupt. 2.4.3 Interrupt priority levels The BSD operating system assigns a priority level to each hardware and software interrupt handler. The ordering of the different priority levels means that some interrupt handler preempts the execution of any lower-priority one. One concern with these different priority levels is how to handle data structures shared between interrupt handlers executed at different priority levels. The BSD interprocess communication facility code is sprinkled with calls to the functions splimp and splnet. These two calls are always paired with a call to splx to return the processor to the previous level. The result of this synchronization mechanism is a sequence of events like the one depicted in Figure 2.4. 1) While a sockets layer task is executing at spl0, an Ethernet card receives a message and posts a hardware interrupt causing a network interfaces layer task—the Ethernet device driver—to execute at splimp. This interrupt preempts the sockets layer code. OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 2.4—THE SOFTWARE INTERRUPT MECHANISM AND NETWORKING PROCESSING 31 2) While the Ethernet device driver is running, it places the received message into the appropriate protocols layer’s input mbuf queue—for instance IP—and schedules a software interrupt to occur at splnet. The software interrupt won’t take effect immediately since the kernel is currently running at a higher priority level. 3) When the Ethernet device driver completes, the protocols layer executes at splnet. 4) A terminal device interrupt occurs—say the completion of a SLIP packet. It is handled immediately, preempting the protocols layer, since terminal processing’s priority, at spltty, is higher than protocols layer’s. 5) The SLIP device driver places the received packet onto IP’s input mbuf queue and schedules another software interrupt for the protocols layer. 6) When the SLIP device driver completes, the preempted protocols layer task continues at splnet and finishes processing the message received from the Ethernet device driver. Then, it processes the message received form the SLIP device driver. Only when IP’s input mbuf queue gets empty the protocols layer task will return control to whatever it preempted—the sockets layer task in this example. 7) The sockets layer task continues form where it was preempted. Figure 2.4—Example of priority levels and kernel processing preempted spl0 preempted socket splnet protocol protocol (IP input + TCP input) (IP input + TCP input) SLIP device driver spltty splimp Ethernet device driver step 1 PH.D. DISSERTATION socket 2 3 4 5 6 7 OSCAR IVÁN LEPE ALDAMA 32 BSD IMPLEMENTATION OF THE INTERNET PROTOCOLS SUITE—2.5 2.5 BSD implementation of the Internet protocols suite Figure 2.5 shows a control and data flow diagrams of the chief tasks that implement the Internet protocols suite within BSD. Furthermore, it shows its control and data associations with chief tasks at both the sockets layer and the network-interfaces layer. Within the 4.4BSD-lite source code distribution, the files implementing the Internet protocols suite are located at the sys/netinet subdirectory. On the other hand, the files implementing the sockets layer are located at the sys/kern subdirectory. The files implementing the network-interfaces layer are scattered among few subdirectories. The tasks implementing general data-link protocol tasks, such as Ethernet, the address resolution protocol or the point-to-point protocol, are located at the sys/net subdirectory. On the other hand, the tasks implementing hardware drivers are located at hardware dependent subdirectories, such as the sys/i386/isa or the sys/pci subdirectories. As can be seen in Figure 2.5, the protocols implementation, in general, provides output and input tasks per protocol. In addition, the IP protocol has a special ip_forwarding task. It can also be seen that the IP protocol does not have an input task. Instead, the implementation comes with an ipintr task. The fact that IP input processing is started by a software interrupt may be the cause of this apparent fault to the general rule. (The FreeBSD operating system drops the ipintr task in favor of an ip_input task.) Observe that the figure depicts all the control and data flows corresponding to the message reception and message transmission scenarios described in the previous section. In order to complete the description let me note some facts on the networkinterfaces layer. The tasks shown at the bottom half of the layer depict hardware dependent tasks. The names depicted, Xintr, Xread and Xstart are not actual task names but name templates. For building actual task names the capital “x” is substituted by the name of a hardware device. For example, the FreeBSD source code distribution has xlintr, xlread and xlstart for the xl device driver, which is the device deriver used for the 3COM’s 3C900 and 3C905 families of PCI/Fast-Ethernet network interface cards. OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 2.5—BSD IMPLEMENTATION OF THE INTERNET PROTOCOLS SUITE 33 Figure 2.5—BSD implementation of the Internet protocol suite. Only chief tasks, message queues and associations are shown. Please note that some control flow arrows are sourced at the bounds of the squares delimiting the implementation layers. This is for denoting that a task is executed after an external event, such as an interrupt or a CPU scheduler event recv send kernel soreceive sosend Sockets layer socket receive buffer icmp_input udp_input socket transm. buffer udp_output ipintr system calls ip_forward ipintrq tcp_output ip_output Protocols layer ether_input Xread received packets ehter_output if_snd Xintr Xstart software interrupt @ splimp (caused by hardware-interrupt handler) Network-interfaces layer software interrupt @ splnet (caused by interface layer) tcp_input userprocess process user user process transmitted packets Legend Function call Function call (other tasks involved) Data flow Task Message queue PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA RUN-TIME ENVIRONMENT: THE HOST’S HARDWARE—2.6 34 2.6 Run-time environment: the host’s hardware The BSD operating system was devised to run on a computing hardware with an organization much like the one shown in Figure 2.6. This computing hardware organization is widely used for building personal computers, low-end servers and workstations or high-end embedded systems. The shown organization is an instance of the classical stored-program computer architecture with the following features [Hennessy and Patterson 1995]: • • • • • 2.6.1 A single central processing unit A four-level, hierarchic memory (not shown) A two-tier, hierarchic bus Interrupt driven input/output processing (not shown) Programmable or direct memory access network interface cards The central processing unit and the memory hierarchy Nowadays, personal computers and the like computing systems are provisioned with high-performance microprocessors. These microprocessors in general leverage the following technologies: very-low operation cycle period, pipelines, multiple instruction issues, out-of-order and speculative execution, data prefetching or trace caches. In order to sustain a high operation throughput, this kind of microprocessors requires very fast access to instructions and data. Unfortunately, current memory technology lags behind microprocessor technology in its performance/price ratio. That is, low latency memory components have to be small in order to remain economically feasible. Consequently, personal computers and the like computing systems—but also other computing systems using high-performance microprocessors—are suited with hierar- Figure 2.6—Chief components in a general purpose computing hardware. CPU System Bus Main Memory Bridge NIC I/O Bus NIC Others Other Others devices OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 2.6—RUN-TIME ENVIRONMENT: THE HOST’S HARDWARE 35 chically organized memory. Ever faster and thus smaller memory components are placed lower in the hierarchy and thus closer to the microprocessor. Several caching techniques are used for mapping large address spaces onto the smaller and faster memory components, which in consequence are named cache memories [Hennessy and Patterson 1995]. These caching techniques mainly consist of replacement policies for swapping out of the cache memory computer program’s address space sections (named address space pages) that are not expected to be used in the near future in favor of active ones. The caching techniques also determine what to do with the swapped out address space sections—it may or may not be stored in the memory component at the next higher level, considering that the computer program’s complete address space is always resident in the memory component at the top of the hierarchy. Another important aspect of the microprocessor-memory relationship is the wire latency. That is, the time required for a data signal to travel from the output ports of a memory component to the input ports of a microprocessor, or vice versa. Nowadays, the lowest wire latencies are obtained when placing a microprocessor and a cache memory in the same chip. The next worst step happens when placing these components within a single package. The next worst step occurs when the cache memory is part of the main memory component and thus it is at the opposite side of the system bus with respect to the microprocessor. Let us cite some related performance numbers of an example microprocessor. The Intel’s Pentium 4 microprocessor is available at speeds ranging from 1.6 to 2.4 GHz. It has a pipelined, multiple issue, speculative, and out-of-order engine. It has a 20 KB, onchip, separated data/instruction level-one cache, whose wire latency is estimated at two clock cycles. And it has a 512 or 256 KB on-chip and unified level-two cache, whose wire latency is estimated at 10 clock cycles. 2.6.2 The busses organization For reasons not relevant to this discussion, the use of a hierarchical organization of busses is attractive. Nowadays, personal computers and the like computing systems come with two-tier busses. One bus, the so named system bus, links the central processing unit and the main memory through a very fast point-to-point bus. The second bus, named the input/output bus, links all input/output components or periphery devices, like network interface cards and video or disk controllers, through a relatively slower multidrop input/output bus. For quantitatively putting these busses on perspective, let us note some performance numbers of two widely deployed busses: the system bus of Intel’s Pentium 4 microprocessor and the almost omnipresent Peripheral Component Interconnect input/output bus. [Shanley and Anderson 2000] The specification for the Pentium 4’s system bus states a speed operation of 400 MHz and a theoretical maximum throughput of 3.2 Gigabytes per second. (Here, 1 Gigabytes equals 10^9 bytes.) On the other hand, the PCI bus specification states a selection of path widths between 32 and 64 bits and a selection of speed operations between 33 and 66 MHz. Consequently, the theoretical maximum throughput for the PCI bus stays between 132 and 528 Mbytes per second for the 33-MHz/32-bit PCI and the 66-MHz/64-bit PCI, respectively. (Here, 1 Mbytes equals 10^6 bytes.) PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA RUN-TIME ENVIRONMENT: THE HOST’S HARDWARE—2.6 36 2.6.3 The input/output bus’ arbitration scheme One more important aspect to mention with respect to the input/output bus is its arbitration scheme. Because the input/output bus is a multi-drop bus, its path is shared by all components attached to it and thus some access protocol is required. The omnipresent PCI bus [Shanley and Anderson 2000] uses a set of signals for implementing a use-by-request master-slave arbitration scheme. These signals are emitted through a set of wires separated from the address/data wires. There is a request/grant pair of wires for each bus attachment and a set of shared wires for signaling an initiatorready event, (FRAME and IRDY) a target-ready event, (TRDY and DEVSEL) and for issuing commands (three wires). A periphery device attached to the PCI bus (device for short) that wants to transfer some data, requests the PCI bus mastership by emitting a request signal to the PCI bus arbiter. (Bus arbiter for short.) The bus arbiter grants the bus mastership by emitting a grant signal to a requesting device. A granted device becomes the bus master and drives the data transfer by addressing a slave device and issuing to it read or writes commands. A device may request bus mastership and the bus arbiter may grant it at any time, even when other device is currently performing a bus transaction, in what is called “hidden bus arbitration.” This seams a natural way to improve performance. However, devices may experience reduced performance or malfunctioning if bus masters are preempted too quickly. Next subsection discusses this and other issues regarding performance and latency. The PCI bus specification does not defines how the bus arbiter should behave when receiving simultaneous requests. The 2.1PCI bus specification only states that the arbiter is required to implement a fairness algorithm to avoid deadlocks. Generally, some kind of bi-level round robin policy is implemented. Under this policy, devices are separated in two groups: a fast access and a slow access group. The bus arbiter rotates grants through the fast access group allowing one grant to the slow access group at the end of each cycle. Grants for slow access devices are also rotated. Figure 2.7 depicts this policy. Figure 2.7—Example PCI arbitration algorithm Fast access group A B C Slow access group a OSCAR IVÁN LEPE ALDAMA b c Example sequence: A B a A B b A B c PH.D. DISSERTATION 2.6—RUN-TIME ENVIRONMENT: THE HOST’S HARDWARE 2.6.4 37 PCI hidden bus arbitration’s influence on latency PCI bus masters should always use burst transfers to transfer blocks of data between themselves and target devices. If a bus master is in the midst of a burst transaction and the bus arbiter removes its grant signal, this indicates that the bus arbiter has detected a request from another bus master and is granting bus mastership for the next transaction to the other device. In other words, the current bus master has bee preempted. Due to PCI’s hidden bus arbitration this could happen any moment, even one bus cycle before the current bus master has initiated its transaction. Evidently this hampers PCI’s burst transactions support and leads to bad performance. In order to avoid this the 2.1PCI bus specification mandates the use of a master latency timer per PCI device. The value contained in this latency timer defines the minimum amount of time that a bus master is permitted to retain bus mastership. Therefore, a current bus master retains bus mastership until either it completes its burst transaction or its latency timer expires. Note that independently of the latency timer a PCI device must be capable of managing bus transaction preemption; that is, it must be capable of “remembering” the state of a transaction so it may continue where it left off. The latency timer is implemented as a configuration register in a PCI device’s configuration space. It is either initialized by the system’s configuration software at startup, or contains a hardwire value. It may equal zero, in which case a device can only enforce single data phase transactions. Configuration software computes latency timer values for PCI devices not having it hardwire from its knowledge of the bus speed and each PCI device’s target value, stored in another PCI configuration register. 2.6.5 Network interface card’s system interface There are two different techniques for interfacing a computer system with periphery devices like network interface cards. If using the programmable input/output technique, a periphery device interchanges data between its local memory and the system’s main memory by means of a program executed by the central processing unit. This program articulates either input/output or memory instructions that read or write data from or to particular main memory’s locations. These locations were previously allocated and initialized by the system’s configuration software at startup. The periphery device’s and motherboard’s organizations determine the use of either input/output or memory instructions. When using this technique, periphery devices interrupt the central processing unit when they want to initiate a data interchange. With the direct memory access (DMA) technique, the data interchange is carried out without the central processing unit intervention. Instead, a DMA periphery device uses a pair of specialized electronic engines for performing the data interchange with the system’s main memory. One engine is part of the same periphery device while the other is part of the bridge chipset; see Figure 2.6. Evidently, the input/output bus must support DMA transactions. In a DMA transaction, one engine assumes the bus master role and issues read or write commands; the other engine’s role is as servant and follows commands. Generally, the DMA engine at the bridge chipset may assume both roles. PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA RUN-TIME ENVIRONMENT: THE HOST’S HARDWARE—2.6 38 When incorporating a master DMA engine, a periphery device interrupts the central processing unit after finishing a data interchange. It is important to note that DMA engines do not allocate nor initialize the main memory’s locations from or to where data is read or written. Instead, the corresponding device driver is responsible of that and somehow communicates the location’s addresses to the master DMA engine. Next subsection further explains this. Periphery devices’ system interface may incorporate both previously described techniques. For instance, they may relay on programmable input/output for setup and performance statistics gathering tasks and on DMA for input/output data interchange. 2.6.6 Main memory allocation for direct memory access network interface cards Generally, a DMA capable network interface card supports the usage of a linked list of message buffers, mbufs, for data interchange with main memory; see Figure 2.8. During startup, the corresponding device driver builds two of this mbufs lists, one for handling packets exiting the system, named the transmit channel, and the other for handling packet entering it, named the receive channel. Mbufs in these lists are wrapped with descriptors that hold additional list management information, including the mbuf’s main memory start address and size. Network interface cards maintain a local copy of the current mbuf’s list descriptor. They use the list descriptor’s data to marshal DMA transfers. A network interface card may get list descriptors either autonomously, by DMA, or by device driver command, by programmable input/output code. The method used depends on context as explained in the next paragraphs. Generally, a list descriptor includes a “next element” field. If a network interface card supports it, it uses this field to fetch the next current mbuf’s list descriptor. This field is set to zero to instruct the network interface card to stop doing DMA through a channel. Transmit channel. At system startup, transmit channels are empty and, consequently, device drivers clear—writing a zero—network interface cards’ local copy of the current mbuf descriptor. When there is a network message to transmit, a device driver queues the corresponding mbuf to a transmit channel and does whatever necessary so the appropriate network interface card gets its copy of the new current mbuf’s list descriptor. This means that a device driver either copies the list descriptor by programmable input/output code, or instructs the network interface card to DMA copy it. Advanced DMA network interface cards are programmed to periodically poll transmit channels looking for new elements, simplifying the device driver task. Generally, list descriptors have a “message processed” field, which is required for transmit channel operation: after a network interface card DMA copies a message from the transmit channel, it sets the “message processed” field and, after transmitting the message through the data link, notifies the appropriate device driver of a message transmission. (Message transmission notification may be batched for improved performance.) When acknowledging an end-of-transmission notification, a device driver will walk the transmit channel, dequeuing each list element that has it “message processed” field set. Receive channel. At system startup, device drivers provide receive channels with a predefined number of empty mbufs. Naturally, this number is a trade-off between channel-overrun probability and memory wastage, which in turn depends on the operation velocity difference between the host computer and the data link. Continuing with OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 2.6—RUN-TIME ENVIRONMENT: THE HOST’S HARDWARE 39 system startup, device drivers make whatever necessary so network interface cards get their copy of the new current mbuf’s list descriptor of the appropriate receive channel. This means that device drivers either copy the list descriptor by programmable input/output code or instruct network interface cards to DMA copy it. It may also happen that device drivers do not have to signal network interface cards because the latter are programmed to periodically poll receive channels looking for new elements. After receiving and DMA copying one or more network messages, a network interface card notifies the appropriate device driver of a message reception. When acknowledging a reception notification, a device driver will walk the receive channel dequeuing and processing each list element that has its “message size” field greater than zero. Moreover, a device driver must provide its receive channel with more empty mbufs as soon as possible for avoiding the corresponding network interface card to stall; a situation that may result in network message losses. Figure 2.8—Main memory allocation for device drivers of network interface cards supporting direct memory access Filled buffers queued by device driver Processed buffers emptied and dequeued by device driver Descriptor Descriptor Processed Message Buffer Processed Message Buffer Main memory a) Transmit channel NIC Descriptor Local Memory Descriptor Descriptor Descriptor Current Message Buffer Filled Message Buffer Filled Message Buffer DMA Data Link Empty buffers queued by device driver Filled buffers dequeued by device driver Descriptor Descriptor Empty Message Buffer Empty Message Buffer Main memory b) Receive channel PH.D. DISSERTATION NIC Descriptor Local Memory Descriptor Descriptor Descriptor Current Message Buffer Filled Message Buffer Filled Message Buffer DMA Data Link OSCAR IVÁN LEPE ALDAMA 40 OTHER SYSTEM’S NETWORKING ARCHITECTURES—2.7 2.7 Other system’s networking architectures BSD-like networking architectures are known to have good and bad qualities almost since its inception [Clark 1982]. While communications links worked at relatively low speeds, the tradeoff between modularity and efficiency was positive. With the advent of multi-megabit data communication technologies this started not to hold true. Worst yet, since some ten years ago networking application programs are not improving performance proportionally to the central processing unit’s and communication link’s speeds. Others [Abbot and Peterson 1993; Coulson and Blair 1995; Druschel and Banga 1996; Druschel and Peterson 1993; Eicken et al. 1992; Geist and Westall 1998; Hutchinson and Peterson 1991; Mosberger and Peterson 1996] have pointed that operating system overheads and networking software not exploiting cache memory features cause the problem. These same people have proposed new networking architectures to improve overall networking application programs’ performance. Strikingly, although these new networking architectures are relatively old, none have been deployed in production systems. Arguably, the reason is that most of these networking architectures required large changes in networking application programs, at best. In the worst case, they also require changes in communication protocols’ design and implementation. In this section we briefly explore post-BSD networking architectures. We think this is interesting in the context of this document because we believe that these relatively new networking architectures have several similarities with the BSD networking architecture. Consequently, the methodology defined in this document may easily be applied for studying the performance of these other systems. 2.7.1 Active Messages [Eicken et al. 1992] An active message is a message that incorporates the name of the remote procedure that will process it. When an active message arrives to the system, the operating system does not have to buffer the data because it can learn from the active message the name of the software module where the data goes. Like traditional communications protocol messages, active messages are encapsulated. Differently form traditional protocol layers, each software layer can learn the next procedure that will process the message and directly call it to run. This can avoid copy memory operations and reduce central processing unit context switches. Unfortunately, this requires a network wide naming space for procedures that is not standard in current protocol specifications. Fast Sockets [Rodrigues, Anderson and Culler 1997] is a protocol stack with active messages that implements a BSD socket like network application program interface and can interoperate with legacy TCP/IP systems. However, this interoperation is limited. 2.7.2 Integrated Layer Processing [Abbot and Peterson 1993] Integrated Layer Processing (ILP) reduces the copy memory operations and creates a running pipeline of layers’ code by means of a proposed dynamic code hooking. This dynamic hooking, as showed in Figure 2.9, allows integrally running independently constructed layers’ code and eliminates interlayer buffering. The result is improved performance, due to better cache behavior, less copy memory operations and OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 2.7—OTHER SYSTEM’S NETWORKING ARCHITECTURES 41 less central processing unit context switches, all without scarifying modularity. A brief description of this technique follows. Dynamic code hooking requires that software modules implement a special interface. Also, the hooked modules have to agree on the data type that they process, i.e., a machine word. This means that ILP cannot be applied to legacy protocol stacks but new protocols can be designed to meet the interface and data specification. Basically, the module interface is a mixed form of code-in-lining and loop unrolling with an interlayer communication mechanism implemented with central processing unit’s registers. (this can be done only if all modules process data one word at a time.) That is, the programmer must implement each module as a loop that processes incoming metadata a word at a time. Inside the loop, he must place an explicit hook that the runtime environment will use to dynamically link the next layer’s code and transfer data. The first and last modules in the pipeline are special because they read/write information from/to a buffer in memory. There is another restriction to ILP. Protocol metadata is encapsulated; that is, one layer's header is another one's data. Furthermore, these layer's header could be noexistent for a third one. So, for this technique to work, the integrated processing can only be applied to the part of a message that all layers agree exists and has the same meaning; that is, the user application data. This means another change in protocol specification; protocol processing must be divided in three parts: (1) protocol initialization, (2) data processing and (3) protocol consolidation. The first and third parts implement Figure 2.9—Integrated layering processing [Abbot and Peterson 1993] input buffer input buffer read f1 read f1 write f2 f2 read f3 write write output buffer read f3 write output buffer a) INITIAL INITIAL INITIAL INTEGRATED DATA MANIPULATION FINAL LEVEL N FINAL LEVEL N-1 FINAL LEVEL N-2 b) PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA 42 OTHER SYSTEM’S NETWORKING ARCHITECTURES—2.7 protocol control so they process control data in message headers and/or trailers. Those parts have to be run in the traditional serial order. Only data processing parts can be integrated. This is shown in Figure 8 2.7.3 Application Device Channels [Druschel 1996] This technique improve overall telematic system performance by allowing common telecommunication operations bypass operating system’s address space; that is, send and receive data operations are done by layered protocol code at user application’s address space. System security is preserved because connection establishment and tire down are still controlled by protocol code within operating system’s kernel. Figure 2.10 shows a typical scenario for application device channels (ADCs). There, it can be seen that an ADC is formed by a pair of code stubs that cooperate to send and receive messages to and from the network. One stub is at the adapter driver and the other at the user application. The user-level layered protocols can process messages using single-address-space procedure-calls to improve overall performance. Also, because the operating system’s scheduler is not aware of the layered structure of userlevel protocols, it is unlikely to interleave its run with other processes and overall performance is improved. Moreover, because there is no need to general-purpose protocols, user-level protocols can easily be optimized to meet user application needs, further improving performance. Finally, there is still a need to implement some protocol code inside the kernel because the operating system controls ADC allocation and connection establishment and tired down. Figure 2.10—Application Device Channels [Druschel 1996] Application Send Connection Management Receive Protocol Library OS Network Protocols ADC ADC OSCAR IVÁN LEPE ALDAMA Network Interface PH.D. DISSERTATION 2.7—OTHER SYSTEM’S NETWORKING ARCHITECTURES 2.7.4 43 Microkernel operating systems’ extensions for improved networking [Coulson et al. 1994; Coulson and Blair 1995] Even though mechanism like Application Device Channels can exploit the notion of user-level resource management to improve traditional telematic system’s performance, other requirements of modern telematic systems cannot be met. Telematic systems like distributed multimedia, loosely couple parallel computing and wide-area network management, require real time behavior and end-to-end communications quality assurance mechanisms, better known as QoS mechanisms. While legacy operating systems cannot provide these requirements [Vahalia 1996], the so named microkernel operating systems can [Liedtke 1996]. Furthermore, microkernel operating systems lend themselves to user-level resource management and can have multiple personalities that permit concurrently run modern and legacy telematic systems. Under this trend, the SUMO project at Lancaster University (http://www.comp.lancs.ac.uk/computing/research/sumo/sumo.html, current as 2 July 2002) have extend a microkernel OS to support an end-to-end quality of service architecture over ATM networks for multimedia distributed computing. Figure 2.11 shows the microkernel SUMO extensions. SUMO extends the basic microkernel concepts with rtports, rthandlers, QoS controlled connections, and QoS handlers. This set of abstractions promote a event driven style of programming that, with the help of a splitlevel central processing unit scheduling mechanism; a QoS external memory mapper; a connection oriented network actor; and a flow management actor, allows the implementation of QoS controlled multimedia distributed systems. Figure 2.11—SUMO extensions to a microkernel operating system [Coulson et al. 1994; Coulson and Blair 1995] PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA 44 OTHER SYSTEM’S NETWORKING ARCHITECTURES—2.7 2.7.5 Communications oriented operating systems [Mosberger and Peterson 1996] While the techniques just described look for better telematic system’s performance in computation-oriented operating systems, the Scout project at Princeton University (formerly at University of Arizona) is aimed at building a communication-oriented operating system. This project is motivated by the fact that telematic systems workloads have only a low percentage of computations. Here, communication operations are the common case. Just as computation oriented operating systems extend the use of the processor by means of software abstractions like processes or threads, the Scout operating system extend the use of physical telecommunication channels by means of a software abstraction called “path”. This path is the next step consequence of all the previously described techniques and can be seen as the virtual trail drawn by a message that travels through a layered system from a source module to a sink one; see Figure 2.12. This path has a pair of buffers, one at each end, and a transformation rule. The buffers are used the regular way; the transformation rule represents the set of operations that act over each message that travels through the path. Within Scout, all resource management and allocation is done on behalf of a path. So, it is possible to obtain the following benefits: (1) early work segregation for better resource management, (2) allow central processing unit scheduling for the hole path which improves performance, (3) allow easy implementation of admission control mechanism and (4) early work discard for reduced waste of resources. Figure 2.12—Making paths explicit in the Scout operating system [Mosberger and Peterson 1996] a) b) OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 2.7—OTHER SYSTEM’S NETWORKING ARCHITECTURES 2.7.6 45 Network processors Although the network processor concept is not a networking architecture solution to the problem of better telematic systems we decided to include a brief mention to it because it is the current hot topic on programmable-by-software hardware for communications systems. (We got 15 references after a search at IEEE Explore of the pattern “‘network processors’ in <ti> not ‘neural network’ in <ti>” restricted to not being before 1999. Besides, see Geppert, L. Editor “The new chips on the block,” IEEE Spectrum, January 2001, p.66-68.) Network processors are microprocessors specialized for executing communications software. They exploit communications software’s intrinsic parallelism at several levels: data, task, thread and instruction [Verdú et al. 2002]. Datalevel parallelism (DLP) exists because, in general, one packet processing is independent of others—previous or subsequent. Vector processors and SIMD chip multiprocessors are known to effectively leverage this kind of parallelism. They are also known to require processing regularity, however, to avoid load unbalancing that hampers performance. Communication software does not exhibit enough processing regularity—in general, the number of instructions required to process one packet has no correlation with others. One work around for this problem exploits thread-level parallelism (TLP) conjunctionally with DLP. The idea then is to simultaneously combine several execution contexts within the microprocessor so if a microprocessor’s execution unit stalls for whatever reason, (the address resolution is not complete or the forwarding data has to be fetched form main memory or current communication session’s state impedes further processing) the hardware may automatically change the execution context and proceed processing another thread—packet. This technique, which has in multitasking its software counterpart, is known as simultaneous multi-threading [Eggers et al. 1997]. PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA 46 SUMMARY—2.8 2.8 Summary • • • • • • • • Internet protocols’ BSD implementation is networking’s de facto standard. In some way or another, most available systems derive their structure, concepts, and/or code from it. Networking processing within a BSD system is pipelined and governed by the software interrupt mechanism. Message queues are used by a BSD system for decoupling processing between networking pipeline’s stages. The software interrupt mechanism defines preemptive interrupt levels. All of the above suggests to us that widely used, single-queue performance models of software routers incur in significant error. We further discuss this in chapter 3,when proposing a networking queue performance model for these systems. Networking processing in PC-based telematic systems has some random variability due to microprocessor features like pipelines, multiple instruction issues, out-of-order and speculative execution, data prefetching, and trace caches. We take this into account when defining a characterization process of these systems, as discuss in chapter 3. The PCI I/O bus uses simple bi-level round robin policy for bus arbitration. We will show in chapter 3 that this may hamper a PC-based software router of sustaining system-wide QoS behavior, and in chapter 4 we propose a solution to this problem. Devices and device drivers supporting DMA transactions use a pair of message buffer lists, called receive and transmit channels, which may be used to implement a credit-based flow-control mechanism for fairly sharing I/O bus bandwidth, as will be discussed later in chapter 4. OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION MODELING TCP/IP SOFTWARE IMPLEMENTATION PERFORMANCE AND ITS APPLICATION FOR SOFTWARE ROUTERS 47 Chapter 3 Characterizing and modeling a personal computer-based software router 3.1 Introduction This chapter presents our experiences building and conducting the parameterization of performance models for a personal computer-based software IP router and IPsec security gateway; that is, machines built upon off-the-self personal computer technology. The resulting models are open multiclass priority networks of queues that we solved by simulation. While these models are not particularly novel from the system modeling point of view, in our opinion, it is an interesting result to show that such models can estimate, with high accuracy, not just average performance-numbers but the complete probability distribution function of packet latency, allowing performance analysis at several levels of detail. The validity and accuracy of the queuing network models has been established by contrasting its packet latency predictions in both, time and probability spaces. Moreover, we introduced into the validation analysis the predictions of a software IP router’s single, first-come, first-served queue model. We did this for quantitatively assessing the advantages of the more complex queuing network model with respect to the simpler and widely used but not so accurate, as here shown, single queue model, under the considered scenario that the software IP router’s CPU is the system bottleneck and not the communications links. The single queue model was also solved by simulation. The queuing network models were successfully parameterized with respect to central processing unit speed, memory technology, packet size, routing table size, and input/output bus speed. Results reveal unapparent but important performance trends for software IP router design. PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA 48 SYSTEM MODELING—3.2 This chapter is organized as follows. The first two sections set the background. Section 3.2 briefly discuss about trade-offs in system modeling and puts in perspective the appropriateness of networking software’s single queue models with respect to queuing network models. Section 3.3 presents personal computer-based IP software routers’ chief technological and performance issues. Moreover, it puts this software IP router technology in perspective when comparing it with others. Following the background sections, section 3.4 presents the case on queuing network modeling for personal computer-based IP software routers and describes an example model. Furthermore, it explains how to modify or extend the example model for modeling such routers with different configurations. Then, section 3.5 describes the laborious system characterization process and the implications that this process’s results had on system modeling. Section 3.6 presents validation results for the example model, and section 3.7 argues about its parameterization. After results provided by this parameterization, section 3.7 also presents interesting software routers’ performance trends. Finally, section 3.8 presents example uses of personal computer-based software IP router queuing network models for capacity planning and as uniform experimental test beds, and section 3.9 summarizes the chapter. 3.2 System modeling Evidently, performance models of computer systems are important for researching and developing new ideas. These performance models are commonly built from a spectrum of techniques as those shown in Figure 3.1. A researcher or engineer may use hardware prototyping, hardware simulators, queuing networks, simple queues or simple math. Evidently, there are tradeoffs to consider when choosing a technique. Complexity, ease to reason with, and obtained results’ relevance are of chief concern. Clearly, hardware prototyping is the technique that can give results with greater relevance. However, it is also the most complex technique and it is difficult to reason with—it is not ease to see how a component’s variation influences others or the whole system. On the other side of the spectrum, single-queue theory is a simple technique that is easy to reason with but gives results of limited relevance or scope. Networks of queues are important models of multiprogrammed and time-shared computer systems. However, these models have not been used for performance modeling of computer networking software. Instead, simple queues are generally used for modeling network nodes implementing networking software. Figure 3.1—A spectrum of performance modeling techniques Hardware Prototype Hardware Simulator Queuing Network FIFO queue Simple Calculations Target Application and OS Hardware Model SD PRO Fetch Predictor Pipeline Perf Core Caches Simulation Kernel Professiona l Workstation 6000 Target ISA Target I/O Interface Host Plataform Complexity / Time cost Easy to reason with Relevance of given results OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 3.3—PERSONAL COMPUTER-BASED SOFTWARE ROUTERS 49 3.3 Personal computer-based software routers 3.3.1 Routers’ rudiments A router is a machine capable of conducting packet switching or layer 3 forwarding. Within the Internet realm, routers forward Internet Protocol (IP) datagrams and bind the subnets that form the Internet. Consequently, IP routers’ performance directly influences packet’s end-to-end throughput, latency and jitter within the Internet. In general, in order to perform its chief function, routers do several tasks [McKeown 2001]; see Figure 3.2. These tasks have different comply, urgency, and complexity levels, and are done at different time scales. For instance, routing is a complex task (mainly due to the size of the data it processes and because this data is stored distributively) that is done every time there is a change in network topology and may take up to few hundred seconds to comply [Labovitz et al 2001]. On the other hand, packet classification has to be done every time a packet arrives, and consequently, should comply at “wire speeds.” Where wire speed means, for example, around five microseconds for Fast Ethernet [Quinn and Russell 1997] and around two microseconds for Gigabit Ethernet (when operating half duplex and with packet bursting enabled) [Stephen 1998]. In order to cope with this operational diversity, routers have, in general, a two-plane architecture; see Figure 3.2. 3.3.2 The case for software routers Evidently, a router’s data plane implementation mostly determines a router’s performance influence on packet throughput, latency and jitter. Naturally, different levels of router’s features, like raw packet switching performance, support for extended functionality, implementation cost or implementation flexibility, may be set using different implementation technologies for the data plane. A router’s data plane may be implemented using the following technologies: Figure 3.2—McKeown’s [2001] router architecture Control Plane Routing Admission Control Protocols Reservation Table Data Plane Packet Classification Policing & Access Control Ingress PH.D. DISSERTATION per-packet processing Switching Fwding table lookup Swting Interconnect Output Scheduling Egress OSCAR IVÁN LEPE ALDAMA 50 PERSONAL COMPUTER-BASED SOFTWARE ROUTERS—3.3 • • • General-purpose computing hardware and software. For instance, a workstation or personal computer running some kind of Unix or Linux Specialized computing hardware and software. For instance, Cisco‘s 2500 or 7x00 hardware executing Cisco’s IOS forwarding software Application specific integrated circuits. For instance, Juniper’s M-120 For performance reasons it is interesting to classify routers in software and hardware routers. • • Hardware routers are routers whose data plane is implemented with hardware only. For instance, Juniper’s M-120 Software routers are routers whose data plane is implemented partly or completely with software. For instance, Cisco’s 2500 and 7x00 product series, or a workstation or personal computer running some kind of Unix or Linux Evidently, hardware routers outperform software routers in raw packet switching. At the Internet’s core, where data links are utterly fast and expensive, hardware routers are deployed in order to sustain high data link utilization. At the Internet’s edge, factors like multiprotocol support, packet filtering and ciphering, or above level 3 switching, are more important than data link utilization, however. Consequently, at the Internet’s edge software routers are deployed. Besides, software routers have other features that made them more attractive than hardware routers at particular scenarios; software routers are easier to modify and extend, have shorter times-to-market, longer life cycles, and cost less. 3.3.3 Personal computer-based software routers Evidently, a software router’s performance is affected by the following • • • The host’s hardware architecture The forwarding program’s software architecture The network interface card’s hardware and corresponding device driver architectures For this research, we selected to work with personal computing technology. This technology provided us with an open work environment—one that is easily controllable and observable—for which there is lots of publicly available and highly detailed technical information. Moreover, other software routers have similar or at least comparable architectures so our findings may be extrapolated. For this research, we used computers with the following features: • • • • • Intel IA32 architecture Intel’s Pentium class microprocessor as central processing unit Periphery Component Interconnect (PCI) input/out bus Direct memory access (DMA) capable network interface cards attached to the PCI input/output bus Internet networking software as a subsystem of the kernel of a BSDderived operating system OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 3.4—A QUEUING NETWORK MODEL OF A PERSONAL COMPUTER-BASED SOFTWARE IP ROUTER 51 For the shake of brevity, along the rest of this chapter we may refer to personal computer-based software IP routers only as software routers, unless otherwise specified. 3.3.4 Personal computer based-IPsec security gateways Kent and Atkinson [1998] define a secured communications architecture for the Internet named IPsec. Within this architecture, an IP router implementing the IPsec protocols is called an IPsec security gateway. Along the rest of this chapter, for the shake of brevity we may refer to personal computer based-IPsec security gateway only as security gateways, unless otherwise specified. 3.4 A queuing network model of a personal computer-based software IP router After this chapter’s description of personal computer-based software IP routers, it should be clear that no single-queue model could capture the whole system behavior. In order to model the pipeline-like organization and priority preemptive execution of the BSD networking code, a queuing network model is at least required. Figure 3.3 shows a queuing network model of a software router. This model corresponds to a personal computer that executes the BSD networking code for forwarding IP datagrams between two DMA capable network interface cards. Moreover, this model is restricted to a software router that is threaded by a single unidirectional packet flow. (Packets only enter the software router through the number-one network interface card and only exit the software router through the number-two network interface card.) Models for software routers with different number of network interface cards or different packets flows configuration, or for security gateways may be derived after the one here shown as later explained. The shown model is comprised of: • • Four service stations, one per each software router’s active element: the central processing unit (CPU), the input/output bus control logic (I/O BUS), the network interface card’s packet upload control logic (NA1IN) and the network interface card’s packet download control logic (NA2OUT) Eight first-come, first-served (FCFS) queues (na1brx, na2btx, dmatx, dmarx, ipintrq, ifsnd, ipreturn and ifeotx) and their corresponding service time politics, which model network interface cards’ local memory (na1brx, na2brx) and BSD networking mbuf queues. The number and meaning of the networking related queues, (dmarx, ipintrq, ifsnd, ipreturn and ifeotx) as well as the features of the service time politics applied by the CPU service station, are a result of the characterization process that section 3.5 describes. The service time politics applied by the I/O BUS service station (to customers at na1brx and dmatx queues) were not computed after our own measurements but after the results presented by Loeb et al [2001]. A PCI bus working at 33 MHz, with a 32-bit data path, which data phases last one bus cycle, and whose data transactions were never preempted, was considered. The service time politic applied by the NA2OUT service station (to cus- PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA 52 A QUEUING NETWORK MODEL OF A PERSONAL COMPUTER-BASED SOFTWARE IP ROUTER—3.4 • • tomers at the na2btx queue) corresponds to a zero-overhead synchronous data link of particular speed Nine customer transitions between queues representing the per-packet network processing execution order. Cloning transitions, pictured by circles at their cloning point, result in customers’ copies walking the cloning paths. Blockable transitions, having a “t” like symbol over the blocking point, result in service stations not servicing transitions’ source queues, if transitions’ destination queues are full One customer source for driving the model. It spawns one customer per packet traversing the software router. An auxiliary first-come, firstserved queue and its corresponding service time politic are used for modeling the communications medium’s speed Figure 3.3—A queuing network model of a personal computer-based software router that has two network interface cards and that is traversed by a single packet flow. The number and meaning of the shown queues is a result of the characterization process presented in the next section Legend na1btx src Queue na2btx NA2 IN Service station na1brx NA1 IN SRC Customer source or sink NA1 OUT NA2 OUT na2brx I/O BUS dmatx (RR) sink SINK dmarx 2 ipintrq 5 ifsnd 3 ipreturn 4 ifeotx 2 CPU (PP) null DEVNULL fast noise 1 OSFN slow noise OSSN 1 dma - Direct Memory Access if(IF) - Interface ifeotx - IF End Of Transmission na(NA) - Network Adapter nab - NA buffer OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 3.4—A QUEUING NETWORK MODEL OF A PERSONAL COMPUTER-BASED SOFTWARE IP ROUTER • • • 53 A set of two customers sources, two FCFS queues and two transitions through the central processing unit for modeling the “noise” process. This process is defined and its relevance explained in subsections 3.5.9 and 3.5.10. A set of one FCFS queue and one customer sink for computing per packet statistics A set of one FCFS queue and one customer sink for trashing residual customers May this relatively simple model give useful predictions? Is this a better model of a software router than the widely used single queue model, under the considered scenario that the central processing unit is the system’ s bottleneck a not the communications links? These are some questions for which we wanted answers but couldn’t find at the time we started this research. Before elaborating on the quest for these answers, the rest of this section is devoted to presenting modeling details. These details are important for deriving models of software routers with different configurations. 3.4.1 The forwarding engine, the network interface cards and the packet flows Model elements depicted at Figure 3.3 within the dotted line model the software router. Model elements outside the dotted line are auxiliary elements. For modeling a software router with more network interface cards, some elements have to be replicated. Gray areas depict elements that conjunctively model a network interface card. A model requires one such group of elements for each network interface card that both inputs and outputs packets. Network interface cards that only inputs or output packets require only part of these elements. A model requires some other queues—and corresponding service time politics and transitions—to be replicated depending on the packet flow configuration. The figure depicts these queues with a shadow following their top right side. To further clarify this, consider the example shown in Figure 3.3. As said before, for this model, packets enter the software router through one network interface card and exit it through the other. Consequently, the model uses one dmarx queue and one pair of dmatx and ifsnd queues. Indeed, one dmarx queue is required per each network interface card entering packets to the software router, and one pair of dmatx and ifsnd queues is required per each network interface card exiting them. 3.4.2 The service stations’ scheduling politics and the mapping between networking stages and model elements The service station acting as the input/output bus has a round-robin no-preemptive scheduling policy that mimics the basic behavior of the PCI bus’ arbitration scheme. The service station acting as the central processing unit has a prioritized preemptive scheduling policy that mimics the basic behavior of the BSD’s software interrupt mechanism. At any given time, this service station preemptively serves the customer up front of the queue with the smallest priority number. Figure 3.3 shows queue’s priority numbers in front every queue served by the CPU service station. The priority scheme shown in Figure 3.3 models the fact that any task at the network-interfaces layer preempts any task at the protocols layer. It also models the fact that once a message enters a layer it’s processing is not preempted by another message entering the same layer. PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA 54 A QUEUING NETWORK MODEL OF A PERSONAL COMPUTER-BASED SOFTWARE IP ROUTER—3.4 For better explaining the above, consider Figure 3.4. This figure shows a software router‘s model with a different mapping of networking stages to queues—and their corresponding service time politic. Besides, the input/output bus logic and the noise are not modeled and the network interface cards’ models were simplified. This figure’s model uses a one-to-one mapping between the C language functions implementing BSD networking and the depicted model’s queues. You can check this by comparing Figure 3.4 with previous chapter’s Figure 2.5. Observed that the queues named ip_input, ip_forward, ip_output and ether_output, which represent tasks at the protocols layer, have all higher priority numbers than the queues named ether_input, if_intr, if_read and if_start, which are tasks at the network-interfaces layer. Moreover, the priority of ip_input is greater that the priority of ip_forward, which is greater than the priority of ip_output, which is greater than the priority of ether_output. This models the fact that a message recently arriving to ip_input waits processing until a message that is being forwarding (ip_forward) completes the rest of the protocols layer’s stages—that is, ip_output and ether_output. For the same reason the priority of if_intr is greater than the priority of if_read and if_start, which is greater than the priority of ether_input. In order to complete this discussion let us note that while Figure 3.4’s model is clearly a more detailed model of a software router with respect to that of Figure 3.3, it is Figure 3.4—A queuing network model of a software router that shows a one-to-one mapping between C language functions implementing the BSD networking code and the depicted model’s queues. In order to simplify the figure, this model does not model the software router’s input/output bus nor the noise process. Moreover, it simplistically models the network interface cards if_intr Legend 3 NIC Customer source or sink if_read 2 Service station ether_input Queue 1 ip_input 7 CPU ip_forward 6 ip_output 5 ether_output 4 if_start nic's fifo 2 OSCAR IVÁN LEPE ALDAMA NIC PH.D. DISSERTATION 3.4—A QUEUING NETWORK MODEL OF A PERSONAL COMPUTER-BASED SOFTWARE IP ROUTER 55 not a better model. As will be explained in the next section, Figure 3.4’s model gives inaccurate results due to service time correlation between some networking stages. 3.4.3 Modeling a security gateway Figure 3.5 shows a queuing network model of a software router configured as a security gateway. The model corresponds to a personal computer that executes BSD networking code implementing the IPsec architecture [Kent and Atkinson 1998]. Note that the model assumes that the in-kernel networking software processes IPsec tunnels as pipelines, and thus, at “every exit of a tunnel” it puts the resulting packet back to the ipintrq queue. Figure 3.5—A queuing network model of a software router configured as a security gateway. The number and meaning of the shown queues is a result of the characterization process presented in next section Legend na1btx src Q ueue na2btx NA2 IN Service station na1brx NA1 IN SRC Custom er source or sink NA1 O UT NA2 O UT na2brx I/O BUS dmatx (RR) sink SINK dmarx 2 ipintrq: ip esp ah 5 ifsnd 3 CPU (PP) ipreturn 4 ifeotx 2 null DEVN ULL fast noise 1 OSFN slow noise OSSN 1 dm a - Direct M em ory Access if(IF) - Interface ifeotx - IF End O f Transm ission na(NA) - Network Adapter nab - NA buffer PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA 56 SYSTEM CHARACTERIZATION—3.5 3.5 System characterization This section reports on the assessment of the service time politics that are applied by the CPU service station of a software router queuing network model. These service time politics model the execution times of the BSD networking code executed on behalf of each forwarded IP datagram. We used software profiling for assessing these execution times. Of the several tools and techniques that were available for profiling in-kernel software, software profiling is a good trade off between process intrusion, measurement quality and set up costs. We expected the profiled code to have stochastic behavior due to the hardware’s features of cache memory and dynamic instruction scheduling— branch prediction, out-of-order and speculative execution. Thus, we strived to compute service time histograms rather than single value descriptors. We also carried out a data analysis for unveiling some data correlation and hidden processes that forced us to adapt the queuing network model’s structure. The rest of this section is devoted to report the details of this process. 3.5.1 Tools and techniques for profiling in-kernel software As stated before, for our research we used a networking code that is a subsystem of the kernel of a BSD-derived operating system. More specifically, we used the kernel of FreeBSD release 4.1.1. From the profiling point of view this kind of software presents a problem: the microprocessor runs the operating system kernel’s code in supervisor (or similar) mode and, because of this, special tools and techniques are needed for profiling “alive” kernel’s code. Where by “alive code” we mean code that is executed at full speed, in contrast to code that is executed step-by-step. Of the to-us-known tools and techniques for kernel’s code profiling we used software profiling [Chen and Eustace 1995; Kay and Pasquale 1996; Mogul and Ramakrishnan 1997; Papadopoulos and Parulkar 1993]. This technique is a reasonably flexible tool that does not requires additional support. Contrarily, other tools, like programmable instrumentation toolkits [Austin, Larson and Ernst 2002] and logic analyzers, although more powerful are complex and proprietary and thus require a considerably amount of time, effort and money. For the kind of task at hand, the use of these tools was considered too timely, energetically and economically expensive. 3.5.2 Software profiling A software probe may be defined as a little piece of software manually introduced at strategic places of target software for gathering performance information. Data structures and routines that manipulate the recorded information supplement the software probes. One important aspect when designing software probes, for now on just probes, is minimizing its overhead. We found two ways to use probes [Kay and Pasquale 1996; Mogul and Ramakrishnan 1997; Papadopoulos and Parulkar 1993]. Each way uses a different mechanism to gather information. One uses software mechanisms, resulting in instruments that, we can say, are self-sufficient. The other relays on special-purpose microprocessor-registers used as event counters. Both have tradeoffs. Software only probes are portable but can incur in relatively higher overhead, depending on the complexity of OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 3.5—SYSTEM CHARACTERIZATION 57 the gathered information. Event-counters based probes are hardware dependant, and thus are not portable, but can gather information hidden to software, like instruction count mix, interruptions, TLB misses, etc. As our reference systems wore Intel’s Pentium class microprocessors, our probes take advantage of these microprocessors’ performance monitoring event counters (PMEC). This kind of microprocessors has three such event counters, of which two are programmable [Shanley 1997]. Four model specific registers (MSR), which can be accessed through the rdmsr (read MSR) and wrmsr (write MSR) ring-0 instructions, implement the three PMECs. One MSR implements the PMEC’s control and event selector register. It is a 64-bit read/write register employed for programming the type of event that each of the two programmable PMECs would count. Another MSR implements the time stamp counter (TSC), a read-only 64-bit free-running counter that increments on every clock cycle and that can be use to indirectly measure execution time. This register can also be accessed through the rdtsc (read TSC) ring-0 instruction. Finally, two MSRs separately implement the two programmable PMECs. Each of these is 40-bit read-only register that counts occurrences or duration of one of several dozen events, such as cache misses, instruction count, or interrupts. The technique we employed for software profiling is as follows: 1) A probe is placed at the activation point and at each returning point of each profiled software object. Note that some of these objects have more than one returning point 2) When an experiment is exercised, each probe reads and records values from the three PMEC along with a unique tag that identifies the probe 3) The recorded information is placed in a single heap inside the kernel. We decided to do this, instead of using, let say, one heap-per-object, because in this way we found it is easier to observe and analyze objectactivation nesting. Furthermore, a common heap facilitated calculating each object expenses in a chain of nested activations 4) When an experiment is finished, the recorded information is extracted from the in-kernel heap and is organized by an "ad-hoc" computer program. This program classifies the records by its source probe, computes the monitored event expenses per activation and writes the results in several text files—one per source probe. Each produced text file holds a table with one row per object activation and one column per PMEC. This kind of text files can be feed to most widely available statistical and graphical computer programs for analysis 3.5.3 Probe implementation Before implementing the probes, there were some questions to answer. The first question was: Which hardware events do we monitor? At first sight the answer to this question seams simple: because we were interested in measuring execution time, we had to monitor machine cycles. However, because we also wanted to know how those cycles were spent, and to see a proof of the concurrent nature of the software objects we instrumented, we decided to monitor as well instructions and hardware interrupts. The instruction count will give us an idea of the path taken by the microprocessor through PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA 58 SYSTEM CHARACTERIZATION—3.5 code. The interrupt count will show us the concurrent behavior of not only the instrumented objects but of the whole kernel. In turn, this will serve us to discriminate meaningless data: a high interrupt count would mean that the microprocessor spent too much time doing something else besides exercising the instrumented object. A second question was: Where do we place the profiling probes? Other way to look at this problem is to answer: what level of granularity do we use when delimiting code for profiling? A priori, we choose a C language function level of granularity. This means that we flanked with profiling probes each of the C language functions implementing IP forwarding. From previous chapter’ s Figure 2.5 it can be see that these functions are: Xintr, Xread, Xstart, ether_input, ether_output, ipintr, ip_forward and ip_output. However, as will be explained in the next section, we found out that this level of granularity did not result in appropriate model’s service times. Therefore, a posteriori, we increased the level of granularity and flanked with profiling probes all the code executed from when an mbuf is taken out of any mbuf queue to when this mbuf is placed in another mbuf queue. This level of granularity resulted in the definition of the following five “profiling grains”; see Figure 3.3: • • • • • The driver reception, dmarx; this grain spans Xintr (the reception part), Xread and ether_input The protocol processing, ipintr; this grain spans from the activation point of ip_input, through ip_forwarding, ip_output and ether_output, and up to the code at ether_output that either places the packet at the interface’s transmission queue or activates Xstart The driver transmission, ifsnd; this grain corresponds to Xstart. The return from protocol processing, ipreturn; this grain spans from where the protocol processing grain left off, to the returning points of ether_output, ip_output, ip_forwarding and ip_input The driver end-of-transmission interrupt handler, ifeotx; this grain corresponds to the part in Xintr that attends the end-of-transmission interrupt that network interface cards signal A third question was: How do we manipulate the PMEC? Searching the FreeBSD documentation on the web (http://www.FreeBSD.org) we found that this operating system implements a device driver named /dev/perfmon, which implements an interface to the PMECs. We also found that this device driver uses three FreeBSD kernel’s C language macro definitions for manipulating the Intel’s Pentium's MSR and for reading the TSC. These macro definitions, wrmsr, rdmsr and rdtsc, are defined in the header file cpufunc.h stored in the sys/i386/include/ directory. A fourth question was: How would we switch probes on and off? Papadopoulos and Gurudatta [1993] present one way to do it. Although their mechanism did not fit us, it did give us a hint. Like them, we decided to use a test variable for controlling probe activation. Departing from their methodology, we decided to use a spill-monitoring variable that we named num_entries. Num_entries would count the number of entries recorded in the heap and probes would be active as long as num_entries is less than the heap's capacity. At any given time, we could deactivate the probes by writing a value bigger than the heap's capacity, and vice versa. OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 3.5—SYSTEM CHARACTERIZATION 59 A last question was: How would we implement the heap of records? For answering this, we considered that heap manipulation would have to require as few instructions as possible. Thus, we decided not to employ dynamic memory and defined a static array of records. This has the inconvenient of limiting the size of the heap. (Compiler related restrictions limited the size of the profiling records heap to around 100 thousand entries.) Each record is implemented by a C language structure with the following members: • • • • id holds an integer value taken from a defined enumeration of probe identifiers. tsc holds the 64-bit number read form the TSC. instruc holds the 64-bit number read from PMEC1, which is initially configured to count executed instructions. interrup holds the 64-bit number read form PMEC2, which is initially configured to count hardware interruptions. For the enumeration that defines the probe identifiers, we decided to employ a scheme that would allow us to readily identify both the source object and probe's location inside the object. Thus, each probe identifier has two 4-bits parts: one for identifying the source object and another for identifying the probe's location inside the object. Moreover, given that every profiled object has one activation point but many return points we decided to use the number 0x0 for identifying the activation point and numbers between 0x1 and 0xF for identifying returning points. With all this set up, we were able to devise a sequence of instructions that would work for any probe. In order to ease the probe codification process, we defined a set of C language macro definitions that were added to the kernel.h header file stored in the sys/sys/ directory. The text for the macro definitions is shown in Figure 3.6. Figure 3.6—Probe implementation for FreeBSD #if defined(I586_CPU) #define NAVI_RDMSR1 rdmsr(0x12) #define NAVI_RDMSR2 rdmsr(0x13) #elif defined(I686_CPU) #define NAVI_RDMSR1 rdmsr(0xc1) #define NAVI_RDMSR2 rdmsr(0xc2) #endif #define NAVI_REPORT(a) \ if ( navi_entrynum < NAVI_NUMENTRIES ) { \ disable_intr(); \ navi_buffer[navi_entrynum].id = a; \ navi_buffer[navi_entrynum].tsc = rdtsc(); \ navi_buffer[navi_entrynum].instruc = NAVI_RDMSR1; \ navi_buffer[navi_entrynum++].interrup = NAVI_RDMSR2; \ enable_intr(); \ } PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA 60 SYSTEM CHARACTERIZATION—3.5 3.5.4 Extracting information from the kernel Extracting special information from a "live kernel" is a well-known procedure. You just have to implement a new system call. This is trivial but cumbersome. FreeBSD has an easier way to go: the sysctl(8) subsystem. This subsystem defines a set of C language macro definitions for exporting kernel variables, and the system call sysctl(2) for reading and writing the exported variables. The scheme for identifying variables is similar to the one use for a management information base, MIB [Stallings 1997]. The header file sysctl.h, stored in the sys/sys/ directory, has the definitions of the macro definitions, and the file kern_mib.c, stored in the sys/kern/ directory, uses the macro definitions for initializing the subsystem. The sysctl(8) and sysctl(2) man pages explain how to use this subsystem, but they do not say how to extend it. Honing [1999] explains how to accomplish this. 3.5.5 Experimental setup Figure 3.7 shows the experimental setup we used for system characterization and model validation. We characterized and modeled two systems: a software router and a security gateway. In any case, we wanted the systems under test’s central processing units to be the bottlenecks. That is why the software router under test wore as its central processing unit an 100 MHz Intel’s Pentium microprocessor, while the security gateway under test wore a 600 MHz Intel’s Pentium III. Indeed, it is well known that the software implementing authentication and encryption algorithms consumes more computing power than plain IP forwarding. (After our research, we now know exactly how much more computing power a security gateway consumes when compared to a plain IP software router.) Figure 3.7—Experimental setup Source A Source B Source C Source D HP Internet Advisor Security Associations Security Gateway IPsec tunnel A->D: secured, AH tunnel, ESP tunnel. B->D: unsecured. IP Router Under Test Security Gateway Under Test Phantom Sink OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 3.5—SYSTEM CHARACTERIZATION 61 Traffic generation was accomplished using a modified version of Richard Stevens’ sock program [Stevens 1994, appendix C] and using the traffic generation facility of a Hewlett-Packard’s Internet Advisor J2522B. During system characterization, experiments were driven only by a single sock source, while during model validation we used a mix of sources. As we did not have any special requirements for the source nodes, besides being able to run the sock program, we used whatever we had at hand. Thus, sources A and B wore 100 MHz and 120 MHz Intel’s Pentiums, respectively; source C wore a 233 MHz Intel’s Pentium II and source D a 400 MHz Intel’s Pentium III. During experiments with the security gateway, in order to avoid that performance limitations at the IPsec tunnel’s inbound security gateway would modify the traffic’s features or interfere with the characterization of the security gateway under test, which was the outbound one, the inbound gateway wore a 650 MHz Intel’s Pentium III. The “phantom sink” was implemented using an address resolution protocol (ARP) table entry for a nonexistent node. All the computers shown are personal computers with Intel IA32/PCI hardware architecture executing FreeBSD release 4.1.1 as their operating system. Main memory configuration varies between 16, 32 and 64 Megabytes. In any case, it was more than enough for storing the software router’s working set; that is, the most used sections of the software router’s software. Indeed, the FreeBSD kernel’s binary file we used was relatively small; it did not occupy more than 3 Mbytes. (1 Mbytes equals 2^20 bytes.) This binary file included, besides the normal stuff, (the kernel was configured to include only a reduced set of features [Leffer and Karels 1993]) the profiling probes and the storage for the heap of profiling records. Of course, the main memory space required for storing a FreeBSD system’s data varies as a function of the software load offered to the system—number of processes, devices, open files, open sockets, etc. In our case, this was not an issue because we configured each FreeBSD system so it would only execute the chief system processes—init, pagedaemon, vmdaemon, syncer and only a few getty. With this configuration, we wanted to avoid uncontrolled process context changes that will drain computing power. All computers had 3COM 905B-TX PCI/Fast Ethernet network interface cards attached to either 10 Megabit per second Ethernet or 100 Megabit per second Fast Ethernet channels. A 3COM Office Connect TP4/Combo Ethernet hub was used for linking all sources, the software router under test and the inbound security gateway. A point-to-point Fast Ethernet link was used for the IPsec tunnel. FreeBSD kernels were configured to use the xl(8) device driver for controlling the 3COM network interface cards. All computers had onboard EIDE/PCI hard disk controllers. A plain and basic FreeBSD system requires around 400 Megabytes of disk storage for holding the chief file systems. With the considered configuration, no swap space was required. 3.5.6 Traffic patterns The traffic pattern for system characterization served to purposes: • • Minimizing the preemption of networking stages Avoiding that packet arrivals would be in synchronization with any kernel supported clock PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA 62 SYSTEM CHARACTERIZATION—3.5 The first feature was important for easing the system characterization process. Recall that we have set up probes at the activation and returning points of selected software tasks, and probes record the current value of the three PMECs. Therefore, resource consumption of any profiled task is equal to the difference between the three PMECs values recorded at a profiled task’s returning point and the corresponding recorded values at the same profiled task’s activation point. Observe that any preemption of a profiled task—which will occur between its activation and returning points—will result in an overblown resource consumption record. Indeed, in this case, the resource consumption record would also include resources consumed by the preempting task. We saw two ways to solve this problem. One was to build a computer program that would identify overblown records and would try to eliminate from them the preemption costs. The other solution was to use a traffic pattern that would assure the system under test would forward one packet at a time. The tradeoffs here were the experiment’s run time and the solution’s complexity. Clearly, the second solution is simpler. As for the experiment’s run time it turn out not to be an issue. Indeed, given that the heap of profiling records was limited in size to 100 thousand entries and that the forwarding of a packet produced around 12 entries, experimental runs can only be as long as to produce 8333 packets. At the same time, our slowest system under test would forward one packet in less than 80 microseconds, as we found out a posteriori. Consequently, we used the second solution. The traffic’s second feature, avoiding synchronization with kernel clocks, is important for dealing with software preemptions not related to networking tasks. A FreeBSD kernel includes tasks that attend urgent events, like the system real-time clock’s interrupt-handler, or a page-fault exception’s or hard disk’s interrupt-handler. Such tasks are executed with high priority, and thus, preempt any networking task. Our systems under test, which had a simplified configuration as stated before, only produced urgent events related to the real-time clock, and therefore, launched high priority tasks following a periodic pattern—with different periods but all being a multiple of the realtime clock’s period. A posteriori, we found out that if sources (which were also FreeBSD systems using the same kind of real-time clocks to time the traffic production) produced IP datagrams periodically, packets arrivals at the system under test may get in synch with one or more periodic and high priority kernel tasks. This situation would result in a number of overblown measurements larger than when the traffic is not synchronized with these kernel tasks. All the above resulted in the following traffic pattern for system characterization: • 3.5.7 A packet trace with random packet inter-arrival times of at least 2 times the mean system under test’s packet processing time and as much between 3 and 10 time this mean value Experimental design We carried out a large set of experiments for assessing the influence that various operational parameters have over service times. The parameters we considered are: • Packet’s size OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 3.5—SYSTEM CHARACTERIZATION • • • • • • 63 Inter-packet arrival time Packets burst’s size (number of packets in a packets burst) Encryption protocol Authentication protocol Traffic mix Routing table’s size Besides, we carried out some more experiments for assessing the behavior of various system elements. The elements we considered are: • • • • Code paths Central processing unit speed Main memory Cache memory Finally, we carried out some more experiments for assessing the overhead and intrusion of the profiling probes. 3.5.8 Data presentation Upon the data gathered from each PMEC of each profiling grain (section 3.5.3) at each experiment, we produced a set of data descriptors and corresponding charts. As stated in this section’s introduction, we expected the profiled code to have stochastic behavior and thus we strived to produce histograms as data descriptors. For each experiment, we produced the following six charts: • • • • Central processing unit cycle count trace and histogram (2 charts) Machine code instruction count trace and histogram (idem) Hardware interrupt count trace Correlation between cycle count and instruction count Figure 3.8 and Figure 3.9 show some example charts sets. Appendix A shows the charts sets for all the experiments. 3.5.9 Data analysis At the instruction and cycle count charts in Figure 3.8 and Figure 3.9, it can be seen that measure points form belts. From the correlation cycles/instructions chart, it can be seen that each belt in the instruction count chart correlates to a single belt in the cycle count chart. This was something expected as the software probes at each layer were programmed to profile more than one routine. Thus, each belt represents a code path through the networking software. In the case of Figure 3.8, the upper belt corresponds to the Encapsulating Security Payload (ESP) protocol, the middle belt corresponds to the Authentication Header (AH) protocol, and the lower belt corresponds to the Internet Protocol (IP) protocol. In the case of Figure 3.9, the upper belt corresponds to the networking message reception routine while the lower belt corresponds to the end-of-transmission interrupt handler. From the charts alone, one cannot tell which belt corresponds to which code path. To do this one has to analyze the recorded data with re- PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA 64 SYSTEM CHARACTERIZATION—3.5 spect to the identification tag written by the software probes with each performance record. It is striking although expected that the belts at the instruction count charts are narrow—in fact these are single valued—while they are thick at the cycle count charts. This clearly proves our assumption that the networking code would have stochastic behavior and the importance of representing the model’s service times with histograms. From the shown charts, it is evident that some kernel activities unrelated to network processing are consuming system resources. This is apparent from the existence of several clearly unclustered measures, or outliers, at the cycle and instruction charts. Moreover, these points correspond to probe records detecting at least one interruption— there is a one to one relationship between these points and impulses in the interruption count chart. For us this clearly implies that the networking activities were being preempted by some other not-instrumented activity. We named this activity, or collection of activities, the “noise” process, as it is an ever present process which influences sysFigure 3.8—Characterization charts for a security gateway’s protocols layer Figure 3.9—Characterization charts for a software router’s network interfaces layer OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 3.5—SYSTEM CHARACTERIZATION 65 tem performance but over which we have no control. In order to model this “noise’s” influence we added a set of elements to the software router’s model, as described in section 3.4. Consequently, when computing service time histograms we disregarded the outliers. 3.5.10 “Noise” process characterization A precise representation of the “noise” process would require analyzing activation sequences for several kernel tasks. A priori, we decided not to do this, as it is a laborious task. Instead, we tried a simpler approach. A posteriori, the used approach turned out to be enough for producing highly accurate results, as later section 3.6 shows. The used approach’s objective was to estimate the probability density function (PDF) of two random variables: the inter-arrival time of “noise” events and the number of central processing unit cycles consume by these events. The motivation for going this way arose when doing initial model validations, as later discussed in section 3.6. Indeed, we were able to determine that, in most charts, the packet latency measures clustered at the second belt were due to packets that were preempted at least once. Under the assumed load conditions, this preemption can only be due to the “noise” process. Therefore, finding plausibly relationships between the second belt and this process may solve the problem. The first relationship we found was associated to “noise” events’ inter-arrival times. The inter-arrival time of “noise” events must be constant—as we have associated this process to activities tied to the system’s real-time clock. Therefore, we only had to estimated its period or rate. We did this by computing the relative frequency of measures within the second belt. Considering no correlation between the occurrence of “noise” events and packet arrivals, (something that we were able to assure following the traffic patterns we used as stated before in subsection 3.5.6) we computed a mean arrival rate for “noise” events as λ = p / T . Where p is the probability of a record detecting a preemption, and T is the mean service time computed over records that did not detected preemption. The estimated period of 4.2 milliseconds is much closed to the 4 milliseconds period of some bookkeeping kernel routines reported in the literature [McKusick et al 1996]. After settling that the “noise” process was a periodic process, we proceed to estimate the cycle count’s PDF. The fact that this process was periodic was important; this allowed us to discard one random dimension and to consider that measure dispersion within the second belt was just related to the randomness of the service times. (See section 3.6.) Moreover, latency times experienced by preempted packets under the assumed load conditions are equal to the sum of an unpreempted service time plus a value associated to a “noise” event’s service time. Consequently, we decided to use for characterizing the “noise” process a well-known PDF fitted to the packet-latency measures at the second belt minus the unpreempted average service time. We tried uniform, exponential and normal PDF but the best fit was given by the normal PDF. PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA 66 MODEL VALIDATION—3.6 3.6 Model validation This section presents a validation for our software router’s and security gateway’s queuing network models. The validation is based on a two-dimension comparison between the queuing network models’ predictions and both measurements taken from real systems and predictions from a simpler single queue model. All models were solved by computer simulation, using a computer program built by us. One dimension of the comparison contrasts per-packet processing latency traces. For simplicity, for now on we will refer to the per-packet processing latency as latency. In this dimension, we can qualitatively compare the temporal behavior of the systems. The other dimension compares the complementary cumulative probability functions, for now on CCPF, of the latency traces. Through this dimension, we can quantitatively compare the performance of the systems at several levels of detail. For example, either we can do a broad comparison by comparing mean values or we can assess different service levels by watching at different percentile values. Performance experiments were devised for stressing different system behavior. Experiments varied accordingly to the operational parameters of packet length and inter-packet time-gap. In spite of all the possible combinations of these two parameters, two main kinds of traffic are identified. On kind has fixed size packets produced with constant inter-packet time-gap, for now on IPG. The other kind has fixed size packets produced with randomly variable IPG. The traffic of the first kind was initially intended to validate service time distributions computed after the characterization process. Later, it was also used for validating the “noise” process’s model. It is interesting to note that the equipment we used to produce the traffic—a HP Internet Advisor—was not able to deliver a pure constant IPG traffic at all IPG values. For IPG values below one millisecond, the equipment produced packet-bursts one millisecond apart. The resulted traffic had an average IPG equal to the one we selected but the IPG between packets in a packets burst was almost equal to the minimum Ethernet inter-frame gap. While this was firstly annoying, we were able to take advantage of this “equipment feature” and used the resulted one million packetbursts per second traffic for some validation experiments. In these cases, experiments varied according to the number of packets within a packet burst, or packet burst’s size. The traffic of the second kind was devise to resemble usual traffic on a real local area network; that is, traffic with a high ratio between its peek and average packet rate. It was produced by the superposition of four on/off sources with geometrically random state periods. During the on state, the traffic was produced with a geometrically random IPG. When watching at the validation charts it is important to take into account that all systems, the real one and both models, were feed with exactly the same input traffic. We assured this by gathering traffic traces during the performance-measuring experiments that we carry on with the real machines, and later feeding these traces to models runs. OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 3.6—MODEL VALIDATION 3.6.1 67 Service time correlation Before presenting the final model validation, allow us here discuss on the service time correlation we observed when using a one-to-one mapping between C language functions implementing BSD networking and queuing network model’s queues. Subsections 3.4.2 and 3.5.3 presented this problem’s implications on system modeling and probe implementation, respectively. Figure 3.10 presents the CCPF comparison for packet latency traces gather after some experimental runs. It clearly shows that a software router’s model with queues mapped as stated before is not a good model. Figure 3.11 presents an example chart we produced after a service times’ correlation analysis. It clearly shows that networking C language functions’ execution times aren’t independent. As stated in subsections 3.4.2 and 3.5.3, these results drove changes in both the final model and the probe’s implementation. Figure 3.10—Comparison of the CCPFs computed after both, measured data from the software router under test, and predicted data from a corresponding queuing network model, which used a one-to-one mapping between C language networking functions and model’s queue Figure 3.11—Example chart from the service time correlation analysis. It shows the plot of ip_input cycle counts versus ip_output cycle counts. A correlation is clearly shown PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA 68 3.6.2 MODEL VALIDATION—3.6 Qualitative validation For carrying out the qualitative validation, as defined in this section’s introduction, we produced latency traces charts like the ones shown along the two leftmost columns of Figure 3.12. There, each row corresponds to a particular experiment. Within each row, the left-hand chart depicts data measured at the system under test while the center chart depicts data estimated by the queuing network model. Results at a glance qualitatively validate the queuing network model. Indeed, at all rows both charts have a lot more similarities than differences. But beyond this broad comparison there are several details worth to have a look at. The first interesting thing that shows up in the charts is the existence of two or more measures belts. These are more evident at the charts for the 100 MHz software router but they also exist at experiments with the 600 MHz security gateway. The first belt from the bottom up corresponds to packets that arrived at the system when the system had empty message queues, and consequently, they were serviced without preemption. Measure dispersion at this belt corresponds to the system’s stochastic servicetimes. The second belt (again, from the bottom up) corresponds to packets that although arriving to an empty system experienced some preemption due to the “noise” process. Note that the first two bands are omnipresent in charts, and that charts corresponding to experiments with traffic of the first kind show no other bands. Differently, charts for experiments with “bursty” traffic (traffic with packet bursts) depict some other bands. These other bands correspond to packets that experienced some queuing time; they arrived to the system when the system had non-empty message queues. These bands’ large variance is a consequence of random message queue lengths and some additional preemption time due to the “noise” process. The number of these additional bands (additional to the first two) relates to the size of the driving traffic’s packet bursts. One discrepancy between the real systems’ behavior and their queuing network models’ is apparent when looking closely at the measures bands: measures bands’ variances are a little different. The source of this discrepancy relates to the modeling of the “noise” process. As explained in subsection 3.5.10, our “noise” characterization trades off accuracy for simplicity. However, as next section shows, this discrepancy only results in a minor quantitative error and thus we disregarded doing a more accurate noise characterization and modeling. OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 3.6—MODEL VALIDATION 3.6.3 69 Quantitative validation For carrying out the quantitative validation, as defined in this section’s introduction, we produced charts comparing three latency’s CCPFs: the real system’s, the queuing network model’s and the single queue model’s. Example charts are shown in the right column of Figure 3.12. Each chart depicts curves produced after the latency traces shown in the same row. Again, results at a glance quantitatively validate the queuing network model and reveal its better suitability with respect to the single queue model. Beyond this broad comparison, there are some details worth to discuss. One interesting thing to note is the stepped nature of the real system’s and queuing network model’s curves. This is most evident at experiments with constant IPG. This nature relates to the central processing unit sharing between packets and the “noise” process. Differences in the prominence of the steps relate to differences between the “noise” process’ service times’ real density function and the estimated one. From the charts, it should now be evident that the quantitative error is minor. Another interesting thing to note is that the error introduced by the statistical estimation of the “noise” process’ service times diminishes, as the traffic gets “bursty”. This is striking although reasonable. Indeed, the average single-packet service time is significantly larger than the average “noise” service time. Thus, when packets bursts appear, the packet sojourn time is more affected by the size of message queues than by the central processing unit sharing with the “noise” process. PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA 70 MODEL VALIDATION—3.6 Figure 3.12—Model’s validation charts. The two leftmost columns’ charts depict per-packet latency traces. The right column’s chart depicts latency traces’ CCPFs Observed per-packet latency Estimated per-packet latency Comparison of CCPF curves 5e+05 4e+05 4e+05 4e+05 4e+05 4e+05 4e+05 2e+05 0.01 5e+05 4e+05 4e+05 4e+05 4e+05 4e+05 4e+05 4e+05 4e+05 4e+05 4e+05 4e+05 Packet number 2e+05 1e+05 1e+05 1e+05 8e+04 1e+05 1e+05 1e+05 8e+04 3e+05 2e+05 2e+05 2e+05 1e+05 1e+05 8e+04 6e+04 4e+04 2e+04 0.001 0e+00 0e+00 2e+04 5e+04 0e+00 2e+04 1e+05 5e+04 0.01 2e+04 1e+05 4e+04 2e+05 3e+04 6e+04 1e+05 2e+05 2e+05 2e+04 p(X>=x) 2e+05 1e+04 2e+05 0.1 3e+05 5e+03 2e+05 Nanoseconds 1 0e+00 Nanoseconds 4e+05 3e+05 6e+04 4e+04 2e+04 4e+04 2e+04 5e+04 0e+00 4e+04 3e+04 2e+04 2e+04 Packet number 5e+05 2e+04 2e+04 0.001 Packet number 5e+05 2e+04 0e+00 p(X>=x) 0.01 0e+00 4e+04 0e+00 3e+04 5e+04 0e+00 2e+04 5e+04 2e+04 1e+05 2e+04 2e+05 1e+04 4e+04 2e+05 1e+05 1e+04 3e+04 2e+05 2e+05 5e+03 0.1 3e+05 1e+04 2e+05 5e+03 Nanoseconds 1 5e+03 2e+05 2e+05 4e+05 3e+05 2e+05 4e+05 4e+05 2e+05 4e+05 1e+05 4e+05 Nanoseconds 5e+05 4e+05 0e+00 2e+04 0.001 Packet number 5e+05 0e+00 Nanoseconds 0e+00 p(X>=x) 0.01 0e+00 4e+04 3e+04 2e+04 0e+00 2e+04 5e+04 0e+00 2e+04 5e+04 1e+04 1e+05 5e+03 2e+05 1e+05 OSCAR IVÁN LEPE ALDAMA 4e+04 2e+05 2e+05 Packet number 3e+04 2e+05 2e+04 2e+05 2e+04 2e+05 0.1 3e+05 1e+04 3e+05 Nanoseconds 1 5e+03 Nanoseconds 5e+05 0e+00 Nanoseconds Packet number Packet number Nanoseconds Pseudo constant IPG (there are some 2-packet bursts), 512 byte UDP datagrams @ 1000 pkt/s Constant inter-burst gap with burst of 3 or 4 packets, 4 byte UDP datagrams @ 2500 pkt/s Pseudo constant IPG (there are some 2-packet bursts), 4 byte UDP datagrams @ 1000 pkt/s Packet number 2e+04 0.001 0e+00 4e+04 0e+00 3e+04 5e+04 0e+00 2e+04 5e+04 2e+04 1e+05 2e+04 2e+05 1e+05 1e+04 2e+05 5e+03 p(X>=x) 2e+05 2e+04 2e+05 2e+04 2e+05 0.1 3e+05 1e+04 3e+05 1 5e+03 Nanoseconds 5e+05 0e+00 Nanoseconds Constant IPG, 4 byte UDP datagrams @ 333 pkt/s Queuing network model Single queue model Real software router Nanoseconds PH.D. DISSERTATION Packet number PH.D. DISSERTATION 5e+05 5e+05 4e+05 4e+05 4e+05 4e+05 4e+05 4e+05 3e+05 2e+05 2e+05 2e+05 1e+05 1e+05 5e+04 5e+04 0e+00 0e+00 Packet number 3e+05 Packet number 2e+05 Packet number 7e+05 6e+05 5e+05 0e+00 2e+05 5e+04 0e+00 4e+05 1e+05 5e+04 3e+05 1e+05 2e+05 2e+05 2e+05 2e+05 2e+05 1e+05 Packet number 1e+05 2e+05 4e+05 4e+05 3e+05 2e+05 2e+05 2e+05 1e+05 5e+04 0e+00 4e+04 3e+04 p(X>=x) 2e+05 0e+00 2e+05 2e+04 Estimated per-packet latency 5e+04 2e+05 3e+05 0e+00 2e+05 3e+05 p(X>=x) 4e+05 p(X>=x) 4e+05 4e+05 4e+04 4e+05 4e+05 4e+04 4e+05 3e+04 5e+05 2e+04 Packet number 3e+04 5e+05 2e+04 0e+00 2e+04 5e+04 0e+00 2e+04 5e+04 2e+04 1e+05 2e+04 2e+05 1e+05 1e+04 2e+05 2e+04 4e+05 2e+04 4e+05 4e+05 5e+03 4e+05 1e+04 4e+05 5e+03 5e+05 4e+05 1e+04 2e+05 Nanoseconds 2e+05 0e+00 4e+04 3e+05 0e+00 3e+05 Nanoseconds 3e+04 2e+04 2e+04 2e+04 1e+04 5e+03 0e+00 Nanoseconds Bursty ON/OFF traffic, 56 byte UDP datagrams, mean packet rate of 1492 pkt/s, burst of up to 8700 pkt/s 5e+05 5e+03 3e+05 Nanoseconds 4e+04 3e+04 2e+04 2e+04 2e+04 1e+04 5e+03 0e+00 Nanoseconds Bursty ON/OFF traffic, 56 byte UDP datagrams, mean packet rate of 2857 pkt/s, burst of up to 9800 pkt/s Observed per-packet latency 0e+00 4e+04 3e+04 2e+04 2e+04 2e+04 1e+04 5e+03 0e+00 Nanoseconds Bursty ON/OFF traffic, 512 byte UDP datagrams, mean packet rate of 2857 pkt/s, burst of up to 9800 pkt/s 3.6—MODEL VALIDATION 71 Figure 1.11—continuation Comparison of CCPF curves Queuing network model Single queue model Real software router 1 0.1 2e+05 0.01 0.001 1 Nanoseconds 0.1 2e+05 0.01 0.001 Nanoseconds 1 0.1 2e+05 0.01 0.001 Nanoseconds OSCAR IVÁN LEPE ALDAMA 72 MODEL PARAMETERIZATION—3.7 3.7 Model parameterization The previous section showed that it is possible to characterize a particularly slow software router and use the resulted service times for producing highly accurate predictions with a queuing network model of the same system. This section’s objective is to present answers to the following questions: • • 3.7.1 Is it possible to extrapolate our findings for performance evaluation of faster systems? From what point does changes in hardware architecture, or traffic pattern affects our model’s predictions? Central processing unit speed For answering this section’s first question, we produced Figure 3.13. The figure shows the execution times measured by the software probes of the plain software router code when the operation speed of the software router’s central processing unit, for now on CPU, was varied between 300 MHz and 630 MHz. We were able to do this thanks to a feature of the Gigabit GA-6VXE+ motherboard and the Intel’s Pentium III type SECC2 microprocessor that one of our test systems had. A set of switches on this motherboard allowed us to change the operation speed of this type of microprocessor. Let us emphasize the importance of this feature. Without it, we would have need to change microprocessors, which may have different cache memory configurations, and even whole motherboards, which most certainly would have different chipsets, for experimenting at different central processing unit speeds. This would result in experiments with too many latitudes that would produce data too complex to analyze and thus useless. Figure 3.13—Relationship between measured execution times and central processing unit operation speed. Observe that some measures have proportional behavior while others have linear behavior. The main text explains the reasons to these behaviors and why the circled measures do not all agree with the regression lines 18 y = 4.6271x + 0.148 16 y = 2.1723x + 1.9141 14 y = 0.8898x + 1.3722 microseconds 12 y = 0.9017x + 0.0047 10 y = 0.6368x + 0.1364 8 6 4 2 0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 CPU's clock period (ns) OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 3.7—MODEL PARAMETERIZATION 73 Back to Figure 3.13, it can be seen that ipintrq’s, ifsnd’s and ipreturn’s execution times are proportional to the CPU speed, while dmarx’s and eotx’s execution times vary linearly with respect to the same variable. (See subsection 3.5.3 for the definition of these software probes.) This has a clear explanation. The second set of probes profile code that executes some input/output (I/O) instructions for communicating through the I/O bus with the network interface cards. The operation speed of the I/O bus was constant during the experiment and therefore the time required for executing the I/O instructions did not scale with the CPU speed but remained constant. This results in the offset observed for dmarx’s and eotx’s curves. Note that the value of this offset depends not only on the I/O bus speed but also on the network interface card and corresponding device driver architectures. In order to complete this analysis we included in Figure 3.13 a set of measurements taken from a different computer. These measurements are circled shown in the figure. The “new” computer had a 1 GHz Intel’s Pentium III with different cache configuration and a different motherboard, an ASUS TUV4X with different chipset and main memory chips’ technology. Both computers’ I/O buses operated at identical operation speeds, however. As the figure shows, not all “new” computer’s measurements agree with the regression lines of the “old” computer’s measurements. In fact, only eotx’s new measurement agrees; the rest are lower than their corresponding regression line. Moreover, ipintrq’s new measurement is the farthest from its corresponding regression line. We believe there is no single source for these discrepancies. Moreover, as the compared computers have no few differences we cannot categorically be sure what these sources are. However, after knowing that the software router software’s performance is not influence by the memory technology and that the software router software is small enough to fit inside the CPU cache memory, see next subsection, we believe the discrepancies’ main source is the different level two cache suited by each CPU. Indeed, the SECC2 600 MHz Intel Pentium III had an on-package, half speed, level-two cache while the FC-PGA2/370, 1 GHz Intel Pentium III had an on-chip, full speed, advanced transfer, level-two cache. What we believe this means is that the “new” computer, when compared to the “old” one, not only executed instructions faster but also had its pipeline fuller or with less bubbles as, on average, it did not had to wait as much for the instructions and their data. Clearly, different code segments profited differently after this cache’s performance improvement. Although we have not done a detailed code analysis, we believe it is mainly due to the relative code sizes each software probe profiled and the mix of instructions. We believe it is reasonable to expect that, basically, a large piece of code without I/O instructions would profit more from a faster cache than a small piece of code with many I/O instructions. Ipintrq is the probe that profiled the biggest piece of code (1261 instructions) and none of them are I/O instructions. We believe that is why its 1 GHz measurement is the farthest from its corresponding regression line. The other code sizes are: dmarx, 475; eotx, 236; ifsnd, 216; and ipreturn, 192. As stated before, dmarx and eotx profiled some I/O instructions. We believe the reason eotx’s 1 GHz measurement is just over the regression line is that the profiled code has a relative large number of I/O instructions. However, we have not assessed this. PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA 74 MODEL PARAMETERIZATION—3.7 3.7.2 Memory technology We found that the working set of the BSD networking code for both the software router and the security gateway fits inside their CPU’s cache memory. Partridge et al [1998] have found similar results. Therefore, these systems’ performance is not affected by main memory technology. We observe this after analyzing several cycle-count traces from different profiling probes. The analysis consisted in filtering preempted records— records whose corresponding interrupt count was not zero. Figure 3.14 shows example charts from two different profiling probes: the ipintrq probe and the probe profiling the DES algorithm for the Encapsulating Security Payload protocol (ESP). The figure shows four outliers circled in each chart. Each of these records corresponds to the first packet being processed by the system after some non-networking-related activities have been executed. Observe that after these first packets are processed, no other packet experiences a high cycle count. For Figure 3.14 this means that only four out of almost 25 (or 35) thousand packets were affected. This suggested to us that during the processing of the so-called first packet, the CPU’s instruction cache should have been out of networking instructions, and thus, the CPU required extra cycles for reading the instructions from main memory. But this also suggested to us that all other packets were processed by instructions read from the CPU’s cache, and thus did not require reading from main memory. Therefore, this subsection’s premise is sustained. For completing the picture, allow us here discuss why is it that we hold that some non-networking related activities trashed the CPU’s instruction cache. It happened that because we could only allocate a probe-record heap big enough for storing 100 thousand records—for all the probes in the kernel—we had to break each experiment in four rounds. After each round, we have to start a shell session in the system’s console and interact with the system. During this interaction, we executed several shell commands for extracting the heap of profiling records from the kernel’s address space and save it to a file. Then, we executed some other shell commands for preparing the system for a next round and for launching it. Clearly, after all this, the CPU’s instruction cache was filled with whatever instructions but the ones of the networking software. Figure 3.14—Outliers related to the CPU’s instruction cache. The left chart was drawn after data taken from the ipintrq probe. The right chart corresponds to the ESP (DES) probe at a security gateway. Referenced outliers are highlighted OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 3.7—MODEL PARAMETERIZATION 3.7.3 75 Packet’s size Figure 3.15 shows the influence that packet size has over probe-measured execution times. On one hand, it shows that execution times for the code required for plain IP forwarding is insensitive to the packet size. This is most likely due to the way FreeBSD release 4.1.1 manages message buffers—it uses cluster message buffers indistinctly of packet size. On the other hand, as expected, it shows that the execution time of the code implementing authentication and encryption algorithms for IPsec protocols augment with the size of the packet. The figure shows curves for a pair of authenticating algorithms—md5 and sha1—and for three encrypting algorithms—DES, triple DES and blowfish. Each of these curves reflects the various behaviors of these algorithms. It can be seen that blowfish has the smallest dependence to packet size, while DES and triple DES have the greatest. On the other hand, it can be seen that blowfish has the largest set up time and that triple DES is truly three times costlier than DES. Figure 3.15—Relationship between measured execution times and message size Pentium III 600MHz Microseconds (logscale) 1000 Bookkeeping IF start EoT int Rx int IP forwarding AH(md5) AH(sha1) ESP(des) ESP(3des) ESP(blowfish) 100 10 1 0 200 400 600 800 1000 1200 1400 1600 Packet size (bytes) PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA 76 3.7.4 MODEL PARAMETERIZATION—3.7 Routing table’s size Another aspect that generally influences the performance of software routers is the routing table lookup algorithm. Table 3-I consigns IP forwarding’s average execution times for our 600 MHz system, when its routing table had 4, 16, 64, 128, 256, 512 and 1024 entries. Traffic threading the software router was randomly distributed across addresses. The times obtained (15.44, 15.98, 16.16 and 16.37 s) do not show significant variations. Consequently, we can say that for the expected routing table’s sizes that a software router may face—at the Internet’s edge—routing performance is not influenced by routing table’s size. TABLE 3-I 3.7.5 Routing table size IP forwarding time [µs] 4 15.44 16 15.98 64 16.16 128 16.37 256 16.56 512 16.81 1024 17.41 Input/output bus’ speed After settling that the considered networking code’s execution time proportionally varies with the executing CPU’s speed, we computed the expected execution times for IP forwarding and encrypting/authenticating for 1500 bytes network messages, at selected CPU speeds, and compared them with transfer times through I/O buses of, one, 33 MHz operation speed and 32-bit data path, and the other, 66 MHz operation speed and 64-bit data path. The results at Table 3-II clearly show that for the CPUs that will be available on the market in the following years, the bus is potentially a bottleneck to the system. TABLE 3-II IO Bus transfer [µs] 33 x 32 66 x 64 CPU speed [MHz] Forwarding [µs] Forwarding + 3DES [µs] 600 15.44 580 11.4 2.9 1200 7.72 290 11.4 2.9 4800 1.93 72.5 11.4 2.9 OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 3.8—MODEL’S APPLICATIONS 77 3.8 Model’s applications Now that we have proved the validity of the queuing network model, in this section we apply it for capacity planning and as a uniform experimental test-bed. When used as a capacity-planning tool, the queuing network model may help to devise system configurations, or for tuning a system supporting communication quality-assurance mechanisms, or quality-of-service (QoS) mechanisms. Indeed, user QoS levels may be map to parameters such as overall maximum latency, sustainable throughput or packet loss rate. Then, the queuing network model’s ability to estimate the complete probability density function of performance parameters may be used for identifying system’s operational regions that meet these QoS levels. Subsection 3.8.1 elaborates on this. When used as a uniform experimental test-bed, a queuing network model eases the performance study of systems that incorporate some modifications or extensions. Generally, all there is to do is to adjust the model’s queuing network. Certainly, when adjusting the queuing network, care must be taken to remap service times. Moreover, although this remapping may not be trivial, it is clearly easier than producing correct software for a new real system. Following this rationale, subsection 3.8.2 presents performance evaluation results of two modified software routers: one that incorporates the Mogul and Ramakrishnan’s [1997] receiver live-lock avoidance mechanism, and another that besides this incorporates a weighted fair queuing scheduler for IP packet flows. 3.8.1 Capacity planning Figure 3.16 shows two charts drawn from data estimated by the queuing network model. In each chart, some software router’s performance variable is plot against the average traffic load in packets per second (pps). We are assuming the performance bottleneck is the software router and that the traffic load is Poisson. Five curves are drawn in all charts, each one corresponding to the performance of a system under particular conditions. Let us now discuss how to use these charts for capacity planning. Forwarding capacity—Let us define forwarding capacity as the maximum load value at which a software router can forward packets without losing any. Observe that, as noted in subsection 3.7.3, packet forwarding as implemented in FreeBSD 4.1.1 is insensible to the packet size. Thus, for assessing the forwarding capacity of a system we look at the losses chart and find the load at which the system starts to loose packets. For example, a 100 MHz system has a forwarding capacity of, more or less, 8.5 kpps. This means that it can hardly drive a single Ethernet—which may produce packets at rates exceeding 14 kpps. On the other hand, a 650 MHz system has a forwarding capacity of 55 kpps. Encryption capacity—The above mapped to a security gateway results to the encryption capacity. That is, the maximum load-value at which a security gateway can forward encrypted packets. Differently from packet forwarding, as also noted in subsection 3.7.3, packet encryption performance is influence by the packet size and the encryption algorithm used. In fact, a security gateway has a bimodal response with respect to packet size. When packets are smaller than a particular value, the security gateway performance is limited by its forwarding capacity. When packets are bigger than this value, the encryption capacity is the performance limit. To numerically show PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA MODEL’S APPLICATIONS—3.8 78 value, the encryption capacity is the performance limit. To numerically show this, consider the following. Let us define tf as the packet forwarding time ad tue as the per-byte encryption time. Thus, the maximum processing capacity of a security gateway expressed in packets per second is equal to C ( pps ) = 1 /(t f + t u e * L) , where L is the packet size in bytes. This expression measure in bytes per second (Bps) maps to: C ( Bps ) = 1 tue + tf L Clearly, when L→0, C(Bps) →L/tf and when L→∞, C(Bps) →1/tue Figure 3.16—Capacity planning charts a) b) OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 3.8—MODEL’S APPLICATIONS 79 Returning to the encryption capacity example, from the losses chart, a 650 MHz system configured for authenticate and encrypt a communication session—using both AH and ESP protocols, and using DES for encrypting, has an encryption capacity of 4.2 kpps when processing 512 byte packets and 3.6 kpps when processing 1386 byte packets. If it uses 3DES for 1386 bye packets, this system’s encryption capacity is around 1.5 kpps. Maximum latency—If someone is interested in knowing what is the minimum clock rate (read: price) that supports packet latencies not higher than certain value with some probability, he must look at the corresponding latency percentile chart. For example, a 100MHz software router operating at its forwarding capacity of 8.5 kpps has a 99percentile latency around 10 ms. As another example, a 650 MHz security gateway configured to use 3DES, processing 1386 byte packets and operating at its encryption capacity of 1.5 kpps has a 99-percentile latency around 10 ms. 3.8.2 Uniform experimental test-bed The following study’s objective is to observe the performance of a software router supporting communication quality-assurance mechanisms, or Quality-of-Service (QoS) mechanisms. Here it is shown that the considered software routers cannot sustain system wide QoS behavior solely by adding QoS aware CPU schedulers. Where by this schedulers we do not mean user processes schedulers but networking tasks schedulers; that is, schedulers regulating the marshalling of packets through the networking software. The next chapter presents a solution to this problem. The study that concerns us here compares the performance of two software routers. The one’s CPU speed is 1 GHz and the other’s is 3 GHz. These software routers models’ service times were extrapolated applying section 3.7’s parameterization rules to section 3.5’s measurements for the 600-MHz system. We considered a PCI I/O bus like the one described in section 3.4. The data links were considered to have a 1 Gbps throughput. For this study, the software router workload consisted on the superposition of three traffic-flows with characteristics shown in Table 3-III. The offered load in bits per second for each flow is identical. As we are interested mainly in system throughput under overload, we have considered Poisson input traffic. TABLE 3-III Packet length (bytes) Solicited share Flow 1 172 1/3 Flow 2 558 1/3 Flow 3 1432 1/3 It was expected that an ideal QoS aware software router would allocate one third of its resources to each flow. PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA 80 MODEL’S APPLICATIONS—3.8 Basic system—In order to set a comparison baseline, we firstly studied the performance of the considered software routers when they use basic BSD networking software. Figure 3.17 shows the queuing network model for a basic software router traversed by three packet flows. Figure 3.18 shows performance data computed after these models for global offered loads in the range of [0, 1400 Mbps]. As can be seen, the software router with the 1 GHz CPU has a linear increase of the aggregated throughput for offered loads below 225 Mbps. At this point, the CPU utilization is 100%, while the bus utilization is around 50%, and the system enters into a saturation state. If we further increase the offered load, the throughput decreases until a receiver live-lock condition appears for an offered load of 810 Mbps. During saturation, most losses occur in the IP input buffer. For the software router with the 3-GHz central processing unit, the bus saturates for an offered load of 500 Mbps. The bus utilization at this point is 100% while the CPU utilization is around 70%. The system behavior for increasing offered loads depends on which priorities are used by the bus arbiter. The results here shown correspond to the case in which reception has priority over transmission. Observe that system throughput decreases with increasing offered loads. This can be explained if we observe that CPU utilization is increasing and most losses occur in the drivers transmit queue. This indicates that the transfer from NIC to CPU increases with increasing offered loads and hence the CPU processes more work. However, this work cannot be transferred to the output NIC as the bus is saturated by the NIC to CPU transfers. If transfers NIC-CPU and CPU-NIC have the same priority the CPU never reaches saturation, and losses can occur either in the NIC’s reception queue or in the device driver’s output transmission queue. Consequently, the basic system cannot provide a fair share of the resources when it is in saturation. OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 3.8—MODEL’S APPLICATIONS 81 Figure 3.17—Queuing network model for a BSD based software router with two network interface cards attached to it and three packet flows traversing it nabtx: flow3 flow2 flow1 NA19 18 17 OUT nabrx1 nabrx2 NAIN SRC1 nabrx3 SRC2 SRC3 netup flow3 flow2 flow1 24 23 nabtx: flow3 flow2 flow1 NAOUT 22 nab1rx 0 12 1 nab2rx NAIN 1 12 1 nab3rx 2 12 1 I/O BUS dma1tx 14 2 (PNP) dma2tx 15 2 dma3tx 16 sink1 2 25 dma1rx 3 2 dma2rx 4 26 SINK2 2 sink3 dma3rx 5 SINK1 sink2 2 27 SINK3 28 DEVNULL ipintrq: flow3 flow2 flow1 8 7 6 5 CPU (PP) ifsnd: flow3 flow2 flow1 11 10 9 3 ipreturn 12 4 ifeotx 13 2 fast noise OSFN 20 1 slow noise OSSN PH.D. DISSERTATION 21 1 OSCAR IVÁN LEPE ALDAMA MODEL’S APPLICATIONS—3.8 82 Figure 3.18—Basic system’s performance analysis: a) system’s overall throughput; b) per flow throughput share for system one; c) per flow throughput share for system two a) b) c) OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 3.8—MODEL’S APPLICATIONS 83 Mogul and Ramakrishnan [1997] based software router—We performed similar experiments with a model of a software router that includes the modifications proposed by Mogul and Ramakrishnan. The proposed modification’s objective is to eliminate the receiver live-lock pathology. This is accomplished, basically, by turning off the software interrupt mechanism and driving networking processing by a polling scheduler during a software router’s busy period. Consequently, networking processing is no longer conducted through a pipeline but is done as a run-to-completion task. From the queuing network modeling point of view, this results in the aggregation of all networking related service queues into a single queue. Mogul and Ramakrishnan showed that the resulting total IP processing time is actually shorter than the sum of the times for each individual phase. They showed that the added polling scheduler’s processing costs is compensated by the dropped IP queue manager’s processing costs. Moreover, the runto-completion processing is implemented with fewer function calls. We took advantaged of these observations and, considering that basic networking phases’ service times are independent, we computed the service time of the aggregated service as the sum of the basic networking phases’ service times. Figure 3.19 shows the queuing network model for a Mogul and Ramakrishnan based software router, and Figure 3.20 shows performance data computed after this model when parameterized for the two software routers described at this section’s introduction. Naturally, offered load was similar to the one used for the basic system. As can be seen and as expected, the performance of the two modified software routers do not change with respect to that of the basic system when operating below saturation. Moreover, they reach saturation at more or less the same offered load, although now the throughput degradation is gone thanks to the receiver live-lock elimination mechanism. Yet, the system still does not achieve a fair share of software router resource. PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA MODEL’S APPLICATIONS—3.8 84 Figure 3.19—Queuing network model for a Mogul and Ramakrishnan [1997] based software router with two network interface cards attached to it and three packet flows traversing it nabtx: flow3 flow2 flow1 NA19 18 17 OUT nabrx1 NAIN SRC1 nabrx2 nabrx3 SRC2 SRC3 netup flow3 flow2 flow1 24 23 nabtx: flow3 flow2 flow1 NAOUT 22 nab1rx 0 12 NAIN 1 nab2rx 1 12 1 nab3rx 2 12 1 I/O BUS dma1tx 14 2 (PNP) dma2tx 15 2 dma3tx 16 sink1 2 25 dma1rx 3 SINK1 sink2 dma2rx 26 SINK2 4 sink3 dma3rx 5 27 SINK3 28 DEVNULL CPU (RR) ifeotx 13 2 fast noise OSFN 20 1 slow noise OSSN OSCAR IVÁN LEPE ALDAMA 21 1 PH.D. DISSERTATION 3.8—MODEL’S APPLICATIONS 85 Figure 3.20—Mogul and Ramakrishnan [1997] based software router’s performance analysis: a) system’s overall throughput; b) per flow throughput share for system one; c) per flow throughput share for system two a) b) c) PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA 86 MODEL’S APPLICATIONS—3.8 Mogul and Ramakrishnan [1997] based software router with a QoS aware CPU scheduler—When a software router includes the modifications proposed by Mogul and Ramakrishnan, to us, it seams possible to introduce a QoS aware CPU scheduler like a WFQ scheduler. Indeed, once the polling scheduler is in control, it may use whatever policy it implements to select the next packet for processing. This scheme is proposed by Qie et al [2001], although for and operating system other than UNIX. Figure 3.21 shows the queuing network model for this kind of software routers and Figure 3.22 shows performance data computed after this model when parameterized for the two software routers described at this section’s introduction. As can be seen, the considered networking architecture can only support QoS communications when the CPU is the system’s bottleneck, as in the case of the software router with a CPU operating at 1 GHz. (Figure 3.22.a.) However, when the bus becomes the system’s bottleneck, a fair share of router’s resources is not achieved. (Figure 3.22.b.) By the way, the system’s overall throughput chart has been omitted in Figure 3.22, as it is identical to the one presented in Figure 3.20. OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 3.8—MODEL’S APPLICATIONS 87 Figure 3.21—Queuing network model for a software router including the receiver livelock avoidance mechanism and a QoS aware CPU scheduler, similar to the one proposed by Qie et al [2001]. The software router has two network interface cards and three packet flows traverse it nabtx: flow3 flow2 flow1 NA19 18 17 OUT nabrx1 nabrx2 NAIN SRC1 nabrx3 SRC2 SRC3 netup flow3 flow2 flow1 24 23 nabtx: flow3 flow2 flow1 NAOUT 22 nab1rx 0 12 1 nab2rx NAIN 1 12 1 nab3rx 2 12 1 I/O BUS dma1tx 14 2 (PNP) dma2tx 15 2 dma3tx 16 sink1 2 25 dma1rx 26 dma2rx 4 dma3rx SINK1 sink2 3 SINK2 sink3 WFQ 27 SINK3 28 DEVNULL 5 wfqsignal 5 wfqcost 5 dmarx: flow3 flow2 flow1 8 7 6 CPU (RR) ifeotx 13 fast noise OSFN 20 slow noise OSSN PH.D. DISSERTATION 21 OSCAR IVÁN LEPE ALDAMA MODEL’S APPLICATIONS—3.8 88 Figure 3.22—Qie et al [2001] based software router’s performance analysis: a) per flow throughput share for system one; b) per flow throughput share for system two a) b) OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 3.9—SUMMARY 89 3.9 Summary • • • • • • • • A queuing network model of a software router gives accurate results Single queue models of software routers ignore chief system features Characterizing the system was hard not because of the studied system but due to the required process Armed with a mature process, characterization becomes straightforward Model’s service times computed after some system may be used for predicting the performance of other systems, if scaled appropriately Service times scale linearly with CPU operation speed but can be considered constant with respect to message and routing table sizes Service times’ offset varies with respect to the network interface card’s and device driver’s technology and the cache memory performance In the advent of CPU’s working above 1 GHz, the I/O bus is the bottleneck PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA MODELING TCP/IP SOFTWARE IMPLEMENTATION PERFORMANCE AND ITS APPLICATION FOR SOFTWARE ROUTERS 91 Chapter 4 Input/output bus usage control in personal computer-based software routers 4.1 Introduction In this chapter we address the problem of resource sharing in personal computerbased software routers when supporting communication-quality assurance mechanisms, widely known as quality-of-service (QoS) mechanisms. Others have put forward solutions that are focused on suitably distributing the workload of the central processing unit. However, the increase in central processing unit’s (CPU) speed in relation to that of the input/output (I/O) bus means attention must be paid to the effect the limitations imposed by this bus has on the system’s overall performance. Here we propose a mechanism that jointly controls both the I/O bus and the CPU operation. This mechanism involves changes to the operating system kernel code and assumes the existence of certain network interface card’s functions, although it does not require changes to the personal computer hardware. Here we also present a performance study that provides insight into the problem and helps to evaluate both the effectiveness of our approach, and several software router design trade-offs. The chapter is organized as follows. Section 4.2 presents the problem, section 4.3 discusses our proposed solution and section 4.4 presents a performance study of the proposed mechanism in isolation, carried out by simulation. Then, section 4.5 considers the performance of a software router incorporating the proposed mechanism, carried out PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA 92 4.2—THE PROBLEM also by simulation. Section 4.6 discuses on a particular implementation of the proposed mechanism and presents measured performance data. Finally, section 4.7 summarizes. 4.2 The problem Given that there are inherent performance limitations in the architecture of a software router, but also that there are reasons that make its use attractive, (see previous chapter’s section 3.3) the question of how to optimize its performance arises. In addition, if we want to support QoS mechanisms, we must find suitable ways of sharing resources and, as Nagle said, providing protection, so ill behaved sources can only have limited negative impact on well-behaved ones [Nagle 1987]. There are two different aspects of the problem of resource sharing: the fair share of communications links and the fair share of router resources, mainly its central processing unit (CPU) and input/output (I/O) bus. The first aspect affects output packets flows that share a single network interface card, while the second aspect affects all packets flows that go through the router. We will focus our work on the second aspect of this problem. In other works the problem of fairly sharing software router resources is tackled in terms of protecting [Indiresan, Mehra and Shin 1997; Mogul and Ramakrishnan 1997] or sharing [Druschel and Banga 1996; Qie et al 2001] the use of the central processing unit amongst different packets flows in an efficient way. However, the increase in central processing unit speed in relation to that of the I/O bus makes it easy for this bus to constitute a bottleneck. This is why we address this problem in this article, considering not only the sharing of the software router’s central processing unit, but also its I/O bus. 4.3 Our solution The mechanism we propose for implementing input/output (I/O) bus sharing between differentiated packets flows manipulates the vacancy space of each direct memory access receive channel (see chapter 2’s subsection 2.6.6 for a description on direct memory access channels) in such a way that the overall I/O bus’ activity follows a schedule similar to one produced by a general processor sharing (GPS) server [Demers, Keshav and Shenker 1989]. The mechanism, that we named Bus Utilization Guard (BUG), acts as a flow scheduler that indirectly regulates the I/O bus utilization. Packets flows wanting to traverse the router have to get registered before the BUG. This may be accomplished either manually or by means of any resource reservation protocol like RSVP. In any case, a packets flow is required to submit target resource utilization for registering and the system is required to manage packets flow identifications for packets flows it admits. Note that the system has to be able to map each differentiated packets flow to a direct memory access channel, DMA channel for short. Figure 4.1 shows the BUG’s system architecture. OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 4.3—OUR SOLUTION 93 Figure 4.1—The BUG’s system architecture. The BUG is a piece of software embedded in the operating system’s kernel that shares information with the network interface card’s device drivers and manipulates the vacancy of each DMA receive channel M ain M em ory CPU Filled buffers BUG NIC's Driver Empty buffers Descriptor Descriptor Descriptor Descriptor Empty Packet Buffer Empty Packet Buffer W orking Packet Buffer Filled Packet Buffer Interrupts System Bus I/O Bus Com mands NIC DM A Descriptor Packet Buffer Data Link PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA 94 4.3.1 4.3—OUR SOLUTION BUG’s specifications and network interface card’s requirements The mission of the BUG is to assist supporting QoS system behavior by indirectly controlling input/output bus usage following a GPS like policy. (For now we regard to the input/output bus simply as the bus.) We thought of the BUG as either a piece of software within the operating system’s kernel, or as a hardware add-on attached to the AGP connector. We wanted the BUG not to require any change to the host computer’s hardware architecture. However, the BUG still requires network interface cards to keep a running sum of packets and bytes received per differentiated packets flow. Moreover, the BUG can only differentiate packets flows when they are mapped to separate DMA channels. Therefore, if it is wanted to differentiate packets flows entering the router through a network interface card, the latter must support several DMA channels—one channel per differentiable packets flow. Because the BUG protects a rather fast resource and due to its software only implementation requirement, it was important that the BUG would have low overhead. Moreover, because the BUG uses the bus it thrives to protect, (for polling information from the network interface cards and configuring them, as will be explained later) it was important the BUG would be as low intrusive as possible. Table 4-I summarizes these specifications and requirements. TABLE 4-I BUG’s specifications Avoid host hardware modifications Low overhead Avoid intrusion Network interface card’s requirements Per packets flow DMA channels To be able to keep some packets flow state information 4.3.2 Low overhead and intrusion In order to minimize overhead and intrusion, we devised the BUG to get activated only every T seconds, where T >> τ the time required in the worst case for executing BUG’s activities when it gets activated. Figure 4.2 illustrates this behavior. As will be shown, τ influences packet latency because during the time the BUG is running no packet may traverse the protected bus. May T be arbitrarily large? As will be shown, the BUG is a reactive mechanism that from a summary of what have happened at the bus during the last T seconds, adjusts DMA receive channel’s vacancy spaces so that the bus’ behavior during a busy period—a train of T periods of 100% utilization—resembles that of a GPS server. Consequently, a priori, T cannot be arbitrarily large. Intuitively, there must be a Tmax beyond which either the BUG cannot react fast enough or the required adjustments are unfeasible to perform. Besides, as will be shown, there is a proportional relationship between T and BUG’s main memory storage requirement. Indeed, a priori, the mean number of packets that the BUG has to be aware of increases with T and more packets imply more OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 4.3—OUR SOLUTION 95 mbuf descriptors, which require more main memory. Actually, the DMA channels are the ones requiring the added memory and not the BUG. To further reduce intrusion, we devised the BUG as a bistate mechanism. At every activation, the BUG firstly determines how much of the bus resources have been used during the last T seconds. If the bus has not been one hundred percent utilized, the BUG enters monitoring state and no further action is taken. Otherwise, the BUG enters enforcing state and executes all its tasks. Figure 4.2 also illustrates this behavior. 4.3.3 Algorithm The BUG emulates a packet-by-packet GPS server with batched arrivals. This GPS server’s inputs are computed after data polled from network interface cards and network interface card’s device drivers. Figure 4.3 shows and example scenario. Appendix B lists the C++ language code of the algorithm’s implementation used for the simulation experiments described in section 0 and 4.5. Assume that the mechanism is in monitoring state at cycle k•T. Then, the mechanism gathers Di,k—the number of bytes transferred through the bus during period ((k-1)•T, k•T) by DMA receive channel-i, channel-i for short. If sum(Di,k) < T/βBUS, where βBUS is the cost per bit of bus transfer, the mechanism remains at monitoring state and no further actions are taken. Otherwise, the mechanism detects the start of a busy period and enters enforcing state. When at this state, the mechanism polls each network interface card to gather Ni,k—the number of bytes stored at the network interface card associated with channel-i—and computes the amount of bus utilization granted to each DMA receive channel, or γik. This is done after the outputs of the emulated GPS server, Gi,k. The emulated GPS server’s (or just GPS server) inputs are the Ni,k at the start of a busy period. Afterwards, the inputs are the amount of arrived traffic during the last period or Ai,k=N i,k –Ni,k-1+Di,k. BUG is work-conservative and thus: γi,k = Gi,k + (T/βBUS - (G1,k+…+GN,k) ) Figure 4.2—The BUG’s periodic and bistate operation for reducing overhead and intrusion τ T Enforcing Poll NICs Com pute shares not fully used fully used Adjust vacancy spaces fully used Monitoring Enfocing not fully used PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA 96 4.3—OUR SOLUTION Observe that under the work-conservative premise: sum(γi,k) ≥ T/βBUS This situation may lead to an unfair share. Consequently, the BUG is prepared with an unfairness-counterbalancing algorithm. This algorithm, basically, computes an unfairness level per DMA receive channel, u i,k, which is positive for depriver packet flows and is negative for deprived ones. Then, the BUG reduces or augments the bus utilization grant, γi,k, of every depriver or deprived packets flow, respectively, by an amount proportional to the correspondign unfairness level. That is: γ∗i,k = γi,k - u i,k One problem with this approach is that if unfairness is detected then: sum(γ∗i,k) ≤ T/βBUS That is, the unfairness-counterbalancing algorithm may artificially produce some bus idle time. This problem also arises when packetzing bus utilization grants, as shortly explained. Happily, a single mechanism, one that allows the BUG to vary the length of its activation period, solves both problems. The BUG will switch from enforcing state to monitoring state at the start of any activation instant that sum(Di,k) < T/βBUS. When this switch occurs, the BUG resets the emulated GPS server and stops regulating the bus usage by giving away: γi,k = (T/βBUS) 4.3.4 Algorithm’s details The BUG requires detecting when the bus is fully utilized. One approach to do this is asking the bus for its accumulated idle time at each activation period. If there is a difference between the current value and the read previously, then the bus was not fully utilized during the last T time units. The problem with this approach is that the PCI bus Figure 4.3—The BUG’s packet-by-packet GPS server emulation with batch arrivals Arrivals/Departures Patterns GPS input GPS Bus' Capacity Share 6 Bus capacity 5 4 3 2 Tim e t-T t t+T OSCAR IVÁN LEPE ALDAMA 1 Flow1 Flow2 Flow3 Tim e PH.D. DISSERTATION 4.3—OUR SOLUTION 97 specification does not defines such a thing as bus accumulated idle time [Shanley and Anderson 2000]. Upon this absence, the BUG computes the bus utilization level after the received byte-count it collects at each activation-instant from network interface cards’ device drivers. The BUG normally works periodically with period T, but the implementation of the unfairness counterbalancing mechanism, as shortly explained, and the busutilization grants packetization requirement, as also shortly explained, may induce some bus idle time if nothing else is done. For circumventing this problem, after an enforcing state activation during which the emulated GPS server ran at 100% utilization, we allow the BUG to adjust its activation period. In this case, the BUG is granting the use of the bus for a time equal to sum(γ∗i,k/βBUS), where γ∗i,k is the packetized bus-utilization grant of packets flow i, at activation instant k, after adjusted for unfairness if necessary. Consequently, for avoiding any bus idle time, the BUG sets its next activation instant sum(γ∗i,k/βBUS) into the future. Observe that this value is equal to T – dt, where it is expected that dt << T thanks to the BUG implementation’s properties. Note that the activation period adjusting is only required at the considered activation instant. The BUG is a work-conservative mechanism in which any idle time detected at the emulated GPS server is given away for all packets flows to use. Moreover, when at monitoring state it sets all use-grants to a value proportional to a complete T period, so the bus can get busy as soon as possible. Figure 4.4 shows an example scenario. This has two evident implications. On the good side, packets do not experience any initial waiting. On the bad side, some unfairness may arise because some packets flow may use more than its solicited share in detriment of other packets flows. Happily, unfairness may only occur during a bus’ busy period. Therefore, the BUG only needs to look for and correct unfairness when at enforcing state. Observe that the emulated GPS server cannot deal with any unfairness because the latter is produced outside its scope. Thus, any counterbalancing mechanism has to be implemented as part of BUG’s control-logic. Of course, unfairness may be avoided altogether by implementing a non-workconservative mechanism, in which packet arrivals occurred during an activation period get artificially batched at the next activation instant. (By appropriately manipulating the DMA receive channel’s vacancy spaces.) But then, packets would experience added latency. We also have implemented a non-work-conservative BUG and it works well Figure 4.4—The BUG is work conservative GPS input (new and backlog) GPS Bus Capacity Share Flow Grants 5 Bus capacity 6 5 4 3 1 Flow1 Flow2 Flow3 PH.D. DISSERTATION 4 3 Computed by the GPS 2 Given away by BUG 6 2 1 Time Flow1 Flow2 Flow3 OSCAR IVÁN LEPE ALDAMA 98 4.3—OUR SOLUTION throughput-wise. However, here we only present results from the work-conservative version because we believe it is a better overall solution. When at enforcing state, the BUG has to check for unfairness. One way to do this is comparing at every activation instant of a busy period two running sums per packets flow: sumi(Gi,k) and sumi(Di,k). Then, unfairness may be computed as: ui,k = sumi(Di,k) – sumi(Gi,k) A positive difference denotes a depriver packets flow, while a negative value denotes a deprived one. Every depriver packets flow will have his bus utilization grant, γi,k, reduced by ui,k and every deprived packets flow will get his augmented by ui,k. Any γi,k that gets negative is topped to zero. Figure 4.5 shows an example scenario. Admissibly, the unfairness counterbalancing mechanism just explained needs refinement. We can see that in a scenario where a router would “be attacked” by overly misbehaving packets flows, the levels of unfairness would be so high that, with high probability, the busy period’s average length ( T ) would be relatively small, that is, T << T . This is clearly undesirable since, in this case, the BUG would be consuming too much of router’s resources. In favor of the BUG, allow us to comment that under the considered scenario of packets flows threading a router supporting QoS, where the packets flows would need to pass an admission control and would be policed to ensure QoS level agreements, we expect only small levels of unfairness. Indeed, in this case, unfairness would be primarily due to computer networks’ stochastic behavior, which produces packets bursts. Furthermore, we see that allowing some added complexity to the BUG it is possible extend it for detecting relatively high levels of unfairness and proceed accordingly. One such added complexity would involve computing the sum of ui,k for deprived packets flows and reducing γi,k for each depriver packets flow only by a proportional fraction of that sum. We left this as future work. There is another difficulty with the unfairness counterbalancing mechanism: the initialization of sumi(Gi,k). We believe that an initial value of zero is wrong, simply because of our definition of “enforicing state”. Indeed, if the BUG is at the first enforcing state activation of a busy period, then the bus has just been 100% utilized and the BUG needs to include a measure of that utilization into the unfairness formula. Unfortunately, because of our definition of “monitoring state”, the BUG has no way for computing previous period’s Gi,k values. At the point in time in question, the only thing reFigure 4.5—The BUG’s unfairness counterbalancing mechanism Possible Actual Use Flow Grants Σ G > Capacity 6 Given away by BUG 5 4 3 Unfariness 6 3 6 5 2 5 4 1 Computed by the GPS 1 Flow1 Flow2 Flow3 OSCAR IVÁN LEPE ALDAMA Flow1 Flow2 3 2 -1 2 1 -2 1 Flow1 Flow2 Flow3 Σ G < Capacity 4 Flow3 3 2 Next Flow Grants Supposing a (4,4,4) arrival vector. Flow1 Flow2 Flow3 PH.D. DISSERTATION 4.3—OUR SOLUTION 99 lated to previous period’s Gi,k values the BUG knows for sure is the set of QoS levels requested by the packets flows threading the router. Consequently, we are using that set of values for initializing sumi(Gi,k). Furthermore, we have observed that at the first enforcing state activation of a busy period, the bus and the emulated GPS server are not in synch. That is, the bus is 100% utilized while the emulated GPS server is not. This results in Gi,k values inappropriate for the unfairness counterbalancing mechanism: they are too low. Consequently, if during a first enforcing state activation of a busy period the bus and the emulated GPS server are not in synch, the BUG does not accumulates the Gi,k values but the set of QoS levels requested by the packets flows. The BUG is required to packetize the computed bus utilization grants before adjusting vacancy spaces at DMA receive channels. This is required because while the GPS algorithm’s information unit is the byte, the information unit for DMA channels is the packet. Figure 4.6 shows an example scenario. When packetizing utilization grants it may happen that modulus(γi,k, Li) ≠ 0, where Li is the mean packet length for channel-i. Hence, some rounding off is required. We have tested rounding off both down and up and both produce particular problems. However, the former gave us a more stable mechanism. Of course, if we let the BUG to round off its bus utilization grants, then its emulated GPS server will get out of synch with respect to what is happening at the bus. Therefore, we also programmed the BUG to accordingly adjust the state of its emulated GPS server. If nothing else is done, some bus idle time is artificially produced and the overall share assigned to that packets flow would be much less of what it should be. This problem, and the induced by the unfairness counterbalancing mechanism, can be solved if we let the BUG reduced its next activation period length by some dt time value. Evidently, this increases the BUG’s overhead. But as long as dt is a small fraction of T, the overhead’s increase will remain at acceptable levels. Figure 4.6—The BUG’s bus utilization grant packetization policy. In the considered scenario, three packets flows with different packet sizes traverse the router and the BUG has granted each an equal number of bus utilization bytes. Packet sizes are small, medium and large respectively for the orange, green and blue packets flows. After packetization, some idle time gets induced. Byte grants Packetized grants Back-to-back packetized grants PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA 100 4.3.5 4.3—OUR SOLUTION Algorithm’s a priori estimated costs When at monitoring state, the BUG only needs “to keep an eye” on the bus utilization at every activation instant, and thus we can say this task’s costs are low. However, this task still requires O(n) work, where n is the number of DMA receivechannels associated to network interface cards. When at enforcing state, the BUG performs a sequence of tasks involving loops: assessing bus utilization, computing the emulated GPS server’s inputs, emulating the GPS server itself, and adjusting packets flows’ bus grants. Of all these tasks, the emulation of the GPS server requires the largest number of instructions. Shreedhar and Barghese [1995] state that a naive implementation of a GPS server requires O(log(m)) work per packet, where m is the number of packets in the router. However, Keshav [1991] shows that good implementation requires O(log(n)), where n is the number of active packets flows. Section 4.6 reports the a posteriori BUG’s costs measurements. 4.3.6 An example scenario In order to see how all this works allow us to present a step-by-step description of an example operation scenario for the BUG. The scenario considers the operation of the BUG protected bus in isolation. Three packets flows load the system. Each packets flow solicits one third of the bus capacity. All packets are of the same size and six packets fully occupy the bus. Figure 4.7 shows a picture of the description that follows. In the picture time runs downwards. Marks at the time axis denote BUG’s nominal activation instants and thus are spaced by T seconds. The picture shows the BUG’s operation variables as vectors. For each vector, dimension k corresponds to packets flow k. Table 4-II defines the used vectors. TABLE 4-II Vector Definition A D Number of packet arrivals during the last activation period Number of packet departures during the last activation period N Number of packets waiting at the buffers of some NIC G Outputs produced by the emulated GPS Gac Running sum of GPS outputs Dac Running sum of departures U γ Unfairness ( Di −1 + Di − G i −1 ) Use grant given by BUG Because vectors A and B denote events occurred anytime during the previous activation period, we show them between activation instants. The rest are computed, or considered, at every activation-instant and therefore we show them grouped and pointed at the corresponding instant. Observe that D values are arbitrary and that to simplify things, all units are given in packets, not bytes. OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 4.3—OUR SOLUTION 101 Figure 4.7—An example of the behavior of the BUG. Vectors A, D, N, G and γ are defined as: A = (A1, A2, A3), etc. It is assumed that the system serves three packets flows with the same QoS shares (2,2,2) and with the same packet lengths, and that in a period T up to six packets can be transferred through the bus. The BUG does not consider all variables at all times. At every activation instant, the variables that the BUG ignores are printed in gray tim e 0T A 1 =(1,1,1) D 1 =(1,1,1) T 2T bus 100% Enforcing N3 = (1,1,4) = (2,2,2) G3 pgps 100% U3 = (1,1,-2) = (1,1,4) γ3 G 3 ac = (6,6,6) D 3 ac = (5,5,2) A 4 =(1,1,1) D 4 =(1,1,4) bus 100% Enforcing N4 = (1,1,1) G4 = (2,2,2) pgps 100% U4 = (0,0,0) = (2,2,2) γ4 G 4 ac = (8,8,8) D 4 ac = (6,6,6) A 5 =(2,2,2) D 5 =(2,2,2) bus 100% Enforcing N5 = (1,1,1) = (2,2,2) G5 pgps 100% U5 = (0,0,0) = (2,2,2) γ5 G 5 ac = (10,10,10) D 5 ac = (8,8,8) A 6 =(0,0,0) D 6 =(1,1,1) bus 50% M onitoring N6 = (0,0,0) G6 = (0,0,0) pgps 0% 6 U = (0,0,0) = (0+6,0+6,0+6) γ6 G 6 ac = (2,2,2) D 6 ac = (0,0,0) 4T 5T PH.D. DISSERTATION In synch A 3 =(3,3,3) D 3 =(2,2,2) 3T 6T Not in synch bus 100% Enforcing-1st = (0,0,3) N2 G2 = (0,0,3) + 3 pgps 50% U2 = (1,1,-2) = (0+3,0+3,5+3) γ2 G 2 ac = (4,4,4) 2 D ac = (3,3,0) QoS shares (2,2,2) A 2 =(3,3,3) D 2 =(3,3,0) bus 50% M onitoring N1 = (0,0,0) G1 = (0,0,0) pgps 0% U1 = (0,0,0) = (0+6,0+6,0+6) γ1 G 1 ac = (2,2,2) D 1 ac = (0,0,0) OSCAR IVÁN LEPE ALDAMA 102 4.3—OUR SOLUTION At the first activation instant, the BUG measures a bus utilization level of 50%. Consequently, the BUG enters monitoring state, sets use-grants per packets flow to six, resets the emulated GPS server’s state, and resets the running sums of the unfairness counterbalancing mechanism: the running sums of departures are zeroed but the running sums of GPS outputs are set to packets flows’ QoS shares. At the second activation instant, the BUG measures a bus utilization level of 100% and thus enters enforcing state, emulates the GPS server, checks for unfairness, sets use-grants after the GPS’s outputs and the unfairness levels, and allows the GPS server to keep its state. Because this is the first time into enforcing state, GPS server’s inputs are taken after NIC occupation levels. In this case, inputs do not saturate the GPS server, which reports some idle time. At the same time, the BUG detects a packets flow is being deprived of some solicited bus time. Therefore, depriver packets flows have their use-grants reduced and deprived ones augmented. At the same time, all packets flows get additional use-grants proportional to the GPS server’s idle time. As the GPS server is not in synch with the bus, the BUG updates the running sums of GPS outputs with the QoS shares, not the actual GPS outputs. At the third activation instant, the BUG again measures a bus utilization level of 100% and remains at enforcing state. This time, GPS server’s inputs are taken after the arrivals. The GPS server now gets 100% used and thus no idle-time related use-grants will be added. However, the BUG still detects a packets flow being deprived of some solicited bus time and thus the GPS server’s outputs get adjusted appropriately for computing the use-grants. At the fourth activation instant, the BUG remains at enforcing state. At this time the GPS server gets 100% used, again, and no packets flow is detected as being deprived. Therefore, use-grants are directly taken from the GPS server’s outputs. The fifth activation period is pretty much the same as the previous, apart from vector D. At the sixth activation period, the BUG enters monitoring state and thus sets usegrants per packets flow to six, and does a GPS server mind reset. OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 4.4—BUG PERFORMANCE STUDY 103 4.4 BUG performance study The BUG has several operational latitudes, as previous section shows, like the activation period value and its variation, the effectiveness of the rounding off policy, or the influence that a highly variant traffic has over the dual-mode operation. In order to assess how and how much this latitudes influence overall performance, we devised a series of simulation experiments. At the same time, this experiments allowed us to see how well a PCI bus controlled by a BUG approximates a bus ideally supporting QoS, like a weighted fair queuing (WFQ) bus. 4.4.1 Experimental setup We study the performance of a BUG regulated PCI bus, to which three network interface cards were attached. For the comparison studies we similarly configured a hypothetical, ideal WFQ bus and a plain PCI bus. Each bus was traversed by three packets flows, each coming from one network interface card. Buses were modeled with queuing networks. Figure 4.8 shows these models. We approximated the PCI bus operation by a server using a round robin scheduler. Operational parameters for all busses where computed after a 33 MHz, 32 bits PCI bus. Data links are assumed to sustain one gigabit per second throughput. We used a simple yet meaningful QoS differentiation: the packet size. Indeed, as reported elsewhere [Shreedhar and Varghese 1996], round robin scheduling is particularly unfair upon packets flows with different packet sizes. The packets flows used had features shown in Table 4-III. Different experiments used different interarrival processes to show particular behavior. TABLE 4-III Packet length (bytes) Solicited share Flow 1 172 1/3 Flow 2 558 1/3 Flow 3 1432 1/3 PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA 104 4.4—BUG PERFORMANCE STUDY Figure 4.8—Queuing network models for: a) PCI bus, b) WFQ bus, and c) BUG protected PCI bus; all with three network interface cards attached to it and three packets flows traversing them link1 SINK1 link2 nab2rx NA2 -IN SRC2 I/O BUS sink2 SINK2 (RR) link3 sink3 nab3rx NA3 -IN SRC3 sink1 nab1rx NA1 -IN SRC1 SINK3 a) link1 nab1rx NA1 -IN SRC1 link2 nab2rx NA2 -IN SRC2 link3 nab3rx NA3 -IN SRC3 WFQ wfqarrivals wfqdepartures wfqsignal nabrx: flow3 flow2 flow1 sink1 SINK1 I/O BUS sink2 SINK2 (RR) sink3 SINK3 b) link1 NA1 -IN SRC1 link2 NA2 -IN SRC2 link3 SRC3 NA3 -IN Flow1 Monitor na1brx sink1 SINK1 na2brx I/O BUS Flow2 Monitor sink2 SINK2 (RR) Flow3 Monitor na3brx sink3 SINK3 bugsignal Flow Sched (GPS) c) OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 4.4—BUG PERFORMANCE STUDY 4.4.2 105 Response to unbalanced, constant packet rate traffic Figure 4.9 shows busses responses to unbalanced, constant packet rate traffic, for three different operation scenarios for each of the three considered busses. Each line at every chart denotes the running sum of output bytes over time for one of the three considered packets flows. Charts at the left, center and right columns correspond to the WFQ bus, the plain PCI bus, and the BUG equipped PCI bus, respectively. Each row of charts corresponds to a particular traffic pattern. The traffic pattern for row (a) was as follows. At time zero, flow1 and flow2 start loading the system with a load level equivalent to 45% of a PCI bus capacity each; that is, 475.2 Mbps. Two milliseconds later (first arrow) or 20 times BUG’s activation period, flow3 starts loading the system also at 475.2 Mbps. Then, two milliseconds later, (second arrow) flow3 augments its load to 1 Gbps. Traffic patterns for row (b) and (c) are similar but changing packets flows’ roles. For these experiments, we set BUG’s nominal activation period to 0.1 milliseconds. At this value, and under the considered bus’ speed, the worst-case DMA receive Figure 4.9—BUG performance study: response comparison to unbalanced constant packet rate traffic between a WFQ bus, a PCI bus and a BUG protected PCI bus; first, middle and left columns respectively. At row (a) flow3 is the misbehaving flow while flow2 and flow1 are for (b) and (c), respectively I/O Bus with WFQ scheduling Departure Trace when Flow 3 misbehaves I/O Bus with Round Robin scheduling Departure Trace when Flow 3 misbehaves 0.7 0.6 0.3 T = 0.1 ms 0.5 Output [Mbytes] 0.4 0.4 0.3 T = 0.1 ms 0.3 0.2 0.2 0.1 0.1 0.1 0.0 0.0 1 2 3 4 a) 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 0 I/O Bus with WFQ scheduling Departure Trace when Flow 2 misbehaves I/O Bus with Round Robin scheduling Departure Trace when Flow 2 misbehaves T = 0.1 ms 0.3 T = 0.1 ms 0.3 0.2 0.1 0.1 0.1 0.0 0.0 4 5 6 7 8 9 10 1 I/O Bus with WFQ scheduling Departure Trace when Flow 1 misbehaves 2 3 4 5 6 7 8 9 10 0 0.3 T = 0.1 ms 0.3 0.2 0.1 0.1 0.1 0.0 0.0 3 PH.D. DISSERTATION 4 5 time [ms] 6 7 8 9 10 6 7 8 9 10 9 10 0.4 0.2 2 5 0.5 0.4 0.2 1 4 Flows small pkts medium pkts large pkts 0.6 Output [Mbytes] Output [Mbytes] T = 0.1 ms 0 3 I/O Bus with BUG scheduling Departure Trace when Flow 1 misbehaves 0.5 c) 2 0.7 Flows small pkts medium pkts large pkts 0.6 0.4 0.3 1 I/O Bus with Round Robin scheduling Departure Trace when Flow 1 misbehaves 0.5 10 time [ms] 0.7 0.6 9 T = 0.1 ms Buff = 76 pkts time [ms] Flows small pkts medium pkts large pkts 8 0.0 0 time [ms] 0.7 7 0.4 0.2 3 6 0.5 0.4 0.2 2 5 Flows small pkts medium pkts large pkts 0.6 Output [Mbytes] 0.4 1 4 I/O Bus with BUG scheduling Departure Trace when Flow 2 misbehaves 0.5 Output [Mbytes] 0.5 0 3 0.7 Flows small pkts medium pkts large pkts 0.6 b) 2 time [ms] 0.7 Flows small pkts medium pkts large pkts 0.3 1 time [ms] 0.7 0.6 T = 0.1 ms Buff = 76 pkts 0.0 0 time [ms] Output [Mbytes] 0.4 0.2 0 Flows small pkts medium pkts large pkts 0.6 0.5 Output [Mbytes] Output [Mbytes] 0.7 Flows small pkts medium pkts large pkts 0.6 0.5 Output [Mbytes] I/O Bus with BUG scheduling Departure Trace when Flow 3 misbehaves 0.7 Flows small pkts medium pkts large pkts T = 0.1 ms Buff = 76 pkts 0.0 0 1 2 3 4 5 time [ms] 6 7 8 9 10 0 1 2 3 4 5 6 7 8 time [ms] OSCAR IVÁN LEPE ALDAMA 106 4.4—BUG PERFORMANCE STUDY channel size is 76, which corresponds to flow1. Putting this on implementation perspective, we may say that 76 mbufs corresponds to little more than half the nominal DMA channel size for FreeBSD, which is a size of 128. At the same time, a 0.1 milliseconds period is only 10 times smaller than the nominal FreeeBSD’s real-time clock period, and thus, feasible to implement. On the other hand, this implies that the BUG should take no more than 10 microseconds to execute if we want the overhead premise of T >> τ to hold. For a software router wearing a 1 GHz central processing unit this means 10 thousand cycles. A priori, that should be enough. From Figure 4.9’s left-most column we can see that during the first two milliseconds the ideal bus allows a 50% bus share between the two active packets flows. Then, after the third packets flow gets active, the bus allows a 33% bus share irrespectively of the load level of the so-called misbehaving packets flow. (The small share differences are due to WFQ’s well-known misbehavior upon packet bursts. Zhang and Keshav [1991] explain this.) From Figure 4.9’s middle column we can see that a plain PCI bus only adequately follows the ideal behavior during the first two milliseconds. A that point in time, (first arrow) the round robin scheduling deprives flow1 from having enough bus time in favor of both flow2 and flow3. Moreover, flow3 is never deprived and flow2 is, when flow3 gets greedy after the second arrow at row (a). From Figure 4.9’s right-most columns we can see that the BUG equipped PCI bus behaves very much like the ideal bus does. 4.4.3 Study on the influence of the activation period We repeated the experiments of the previous subsection but only for the BUG protected PCI bus and augmenting BUG’s activation period T. We wanted to see if we could find any macroscopic problems related to this operational parameter. Per BUG’s algorithm description, as T becomes relatively larger small-scale injustices appear; we wanted to see if these microscopic injustices might reflect macroscopically and how. We incremented T until the BUG regulated DMA channel’s size became unreasonable large. Indeed, as stated in subsection 4.3.2, there is a proportional relationship between T and BUG regulated DMA channels’ sizes. As stated in the previous section, for the considered bus’ speed and with T equal to 0.1 milliseconds the BUG requires 76 mbufs on the worst case, or a little more than half the nominal DMA channel size for FreeBSD. With T equal to 0.5 milliseconds the required DMA channel’s size is 383, already more than double the normal size. And with T equal to 10 milliseconds each DMA channel requires 7674 or 14.9 Mbytes of non-swappable or wired memory. (1 Mbytes equals 2^20 bytes.) Certainly, recall that FreeBSD’s release 4.1.1 only uses mbuf clusters and thus each DMA channel’s slot requires at least 2048 bytes. Figure 4.10 shows the study’s results and, as seen there, the BUG protected PCI bus maintains its excellent macroscopic behavior. As in the previous figure, each line at every chart denotes the running sum of output bytes over time for one of the three considered packets flows. Departing from previous figure, each row now corresponds to a particular experiment using a particular BUG activation period T. Moreover, each col- OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 4.4—BUG PERFORMANCE STUDY 107 umn now corresponds to a particular traffic pattern with respect to the misbehaving flow: flow1 for the left column, flow2 for the center column, and flow3 for the right column. 4.4.4 Response to on-off traffic Figure 4.11 shows busses responses to on-off traffic, for two different operation scenarios for each of the three considered busses. Each line at every chart but one denotes the running sum of output bytes over time for one of the three considered packets flows. The extra line denotes the running sum of input bytes. Ideally, output lines Figure 4.10—BUG performance study: on the influence of the activation period I/O Bus with BUG scheduling Departure Trace when Flow 2 misbehaves I/O Bus with BUG scheduling Departure Trace when Flow 1 misbehaves 2.0 Output [Mbytes] T = 0.5 ms Buff = 383 pkts 1.0 1.5 T = 0.5 ms Buff = 383 pkts 1.0 0.5 0.5 0 5 10 15 20 a) 25 30 35 40 45 5 10 15 20 I/O Bus with BUG scheduling Departure Trace when Flow 1 misbehaves 35 40 45 50 0 2.0 0.0 30 40 b) 50 3.0 T = 1 ms Buff = 767 pkts 2.0 60 70 80 90 100 10 I/O Bus with BUG scheduling Departure Trace when Flow 1 misbehaves 20 30 40 50 60 70 80 90 100 0.0 50 100 150 200 250 15.0 T = 5 ms Buff = 3837 pkts 10.0 300 350 400 450 500 50 100 150 200 250 300 350 400 450 Output [Mbytes] 20.0 10.0 0 200 300 PH.D. DISSERTATION 400 500 600 time [ms] 700 800 900 1000 80 90 100 50 100 150 200 250 300 350 400 450 500 I/O Bus with BUG scheduling Departure Trace when Flow 3 misbehaves 50.0 Flows small pkts medium pkts large pkts 30.0 Flows small pkts medium pkts large pkts 40.0 T = 10 ms Buff = 7674 pkts 20.0 30.0 T = 10 ms Buff = 7674 pkts 20.0 10.0 0.0 100 70 time [ms] 10.0 0.0 60 T = 5 ms Buff = 3837 pkts I/O Bus with BUG scheduling Departure Trace when Flow 2 misbehaves 40.0 T = 10 ms Buff = 7674 pkts 50 10.0 500 50.0 Flows small pkts medium pkts large pkts 0 15.0 time [ms] I/O Bus with BUG scheduling Departure Trace when Flow 1 misbehaves d) 40 0.0 0 50.0 30.0 30 5.0 time [ms] 40.0 20 Flows small pkts medium pkts large pkts 20.0 0.0 0 10 I/O Bus with BUG scheduling Departure Trace when Flow 3 misbehaves 5.0 c) 50 25.0 Output [Mbytes] Output [Mbytes] 5.0 45 time [ms] Flows small pkts medium pkts large pkts 20.0 10.0 40 2.0 I/O Bus with BUG scheduling Departure Trace when Flow 2 misbehaves T = 5 ms Buff = 3837 pkts 35 T = 1 ms Buff = 767 pkts 0 25.0 15.0 3.0 time [ms] Flows small pkts medium pkts large pkts 30 0.0 0 25.0 25 1.0 time [ms] 20.0 20 Flows small pkts medium pkts large pkts 4.0 0.0 20 15 I/O Bus with BUG scheduling Departure Trace when Flow 3 misbehaves 1.0 10 10 5.0 Output [Mbytes] T = 1 ms Buff = 767 pkts 0 5 time [ms] Flows small pkts medium pkts large pkts 4.0 Output [Mbytes] Output [Mbytes] 30 5.0 Flows small pkts medium pkts large pkts 1.0 Output [Mbytes] 25 I/O Bus with BUG scheduling Departure Trace when Flow 2 misbehaves 5.0 3.0 1.0 time [ms] time [ms] 4.0 T = 0.5 ms Buff = 383 pkts 0.0 0 50 1.5 0.5 0.0 0.0 Flows small pkts medium pkts large pkts 2.0 Output [Mbytes] Output [Mbytes] 1.5 2.5 Flows small pkts medium pkts large pkts Output [Mbytes] Flows small pkts medium pkts large pkts 2.0 Output [Mbytes] I/O Bus with BUG scheduling Departure Trace when Flow 3 misbehaves 2.5 2.5 0.0 0 100 200 300 400 500 600 time [ms] 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000 time [ms] OSCAR IVÁN LEPE ALDAMA 108 4.4—BUG PERFORMANCE STUDY should appear almost over input lines. Charts at the left, center and right columns correspond to busses responses comparison for the flow1, the flow2 and the flow3, respectively. Each row of charts corresponds to a particular value of the source’s on-state period length: a) eight times BUG’s activation period: b) half BUG’s activation period. Packet inter-arrival processes were Poisson with mean bit rate equal to 3520 Mbps, or 300% of the PCI bus capacity. Sources’ off-state period-lengths were drawn after an exponential random process with mean value equal to nine times the on-state periodlength. Consequently, all packets flows overall mean bit rates were equal to 30% of the PCI bus capacity or 352 Mbps. Besides observing the system response to this kind of traffic, with these experiments we wanted to see if we could find any BUG pathology related to operating-mode cycles, where the continuous but random path into and out of enforcing mode may produce some wrong behavior. Consequently, we ran several experiments with different on-off cycle lengths. Here we present results for two different on-state period-lengths. Figure 4.11.a presents results when the on-state period is equal to eight times the BUG activation period, T, and Figure 4.11.b presents results when the on-state period is equal to 0.5T. From both figures we can see that despite the traffic’s fluctuations, the BUG quite well follows the ideal WFQ policy, while the PCI like Round Robin policy again favors the largest-packets flow and affects the most to the smallest-packets flow. Furthermore, it seams that the BUG is not macroscopically sensitive to a traffic pattern that repeatedly takes it in and out of enforcing mode. 4.4.5 Response to self-similar traffic In order to evaluate the long-range behavior of a BUG protected PCI bus we perform an experiment feeding synthetic self-similar traffic to the simulators of the comFigure 4.11— BUG performance study: response comparison to on-off traffic between an ideal WFQ bus, a PCI bus, and a BUG protected PCI bus a) b) OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 4.4—BUG PERFORMANCE STUDY 109 pared buses. This traffic trace was composed of 6.5 million packets, classified in the three packets flows described in Table 4-III, and had an average throughput of 125 Mbytes per second, (1 Mbytes equals 10^6 bytes) or 100% the maximum theoretical throughput of the PCI bus. The simulation run spanned 18.874 real-time seconds. Table 4-IV lists variance values of output-byte traces’ correlation functions between the ideal WFQ bus and both, the plain PCI bus (round robin) and the BUG protected PCI bus, when different observation periods where applied. In this table we can see that correlation’s variance values for the plain PCI bus are higher than the values for the BUG protected PCI bus. Moreover, the values’ differences get relatively larger with the observation period, although the increase is not proportional to the size of the observation period. For instance, Flow1’s relative difference at an observation period of 0.1 milliseconds is around 1.845 (14.36 / 7.78), at 1 millisecond is 3.02, and at 2.5 milliseconds is 3.342. Evidently, these results are a good news, bad news case, where the good news are, of course, that the BUG protected PCI bus better follows the long-range behavior of the ideal WFQ bus when compared to a plain PCI bus even under a very high variability operation scenario. However, the variance values for the BUG protected PCI bus are somewhat higher than we expected. We recognize that it is interesting to dig further into this issue. However, we are leaving this as future work. TABLE 4-IV Period [ms] Plain PCI bus BUG protected PCI bus Flow1 Flow2 Flow3 Flow1 Flow2 Flow3 0.1 14.36 2.70 1.62 7.78 1.58 0.95 0.5 49.24 7.51 4.73 19.47 3.29 2.09 1 81.39 11.68 7.39 26.90 4.67 3.01 1.5 109.88 15.36 9.76 33.35 5.81 3.68 2 133.63 18.21 11.73 39.42 7.13 3.89 2.5 156.78 21.11 13.63 46.91 8.04 4.66 Before passing to another issue, here we consign details on how we produced the self-similar traffic trace. We used Christian Schuler’s program, (Christian Schuler, Research Institute for Open Communication Systems, GMD FOKUS, Hardenbergplatz 2, D-10623 Berlin, Germany) which implements Vern Paxson’s self-similar traffic fast approximation method [Paxson 1997]. Upon given a Hurst parameter, a mean value, and a variance, Schuler’s program produces a list of numbers. Each number represents a count of packet arrivals within an arbitrary period. Schuler’s program does not give any meaning to this period, leaving to the program’s user its definition. Paxson’s method uses a fractional gaussian noise process and consequently the output of Schuler’s program may contain negative numbers. The given mean and variance values influence the relative count of these negative numbers. It is up to the program’s user the use of proper program inputs and the negative number’s interpretation. Schuler’s program output corresponds to an aggregated traffic trace. Given that we wanted this aggregated traffic to be composed of the three packets flows described in Table 4-III, we filtered Schuler’s PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA 110 4.5—A PERFORMANCE STUDY OF A SOFTWARE ROUTER INCORPORATING THE BUG program output through an ad hoc program to adequately produce three traffic traces corresponding each to a required packets flow. We employed a Hurst parameter of 0.8, a mean value of 25 packets per observation period, and a variance for times the mean value. We determined this set of parameters after what Lucas et al [1997] have reported. Their paper reports statistical characteristics of traffic threading the University of Virginia’s campus network, which hosts approximately 10 thousand computers. The traffic analyzed focuses on three 90-minute intervals starting at 2:15 AM, 2:00 PM, and 9:00 PM. Regardless of the utilization levels exhibited during these periods, the paper reports the traffic adjusting to a Hurst parameter of 0.8 and having a variance for times its mean value. We came to the value of 25 packets per observation period empirically. We ran Schuler‘s program with several mean values (using, of course, the stated Hurst parameter and variance) and counted the number of negative numbers contained in the output trace. Schuler’s program ran pretty fast on the University’s SGI Power Challenge so, a priori, we did not invest any time selecting the initial trial value. We randomly choose 10 packets per observation period and the output resulted with 4.9% of negative numbers. Entering 20 packets per observation we got a trace with 1.2% of those numbers. With 30 we got 0.3% and with 25 we got 0.5%. We conveniently attribute a 72 microseconds value to Schuler’s program observation period. This value, conjunctively with the given packets flows’ features and the selected mean value of 25 packets-per-observation, results in a target traffic trace’s throughput of 125 Mbytes per second. 4.5 A performance study of a software router incorporating the BUG Naturally, after proving that a BUG protected bus works fine in isolation, at least at the queuing network modeling level of detail, we wanted to see if such a bus may improve the operation of a software router bore up to provide differentiated services. In order to do this, we basically extended the set of experiments presented in previous chapter’s subsection 3.8.2. Following the rationale presented there, we extended the queuing network model that previous chapter’s Figure 3.21 shows and introduced the flow monitors and flow scheduler shown in previous section’s Figure 4.8.c. The resulted model’s operational parameters and the features of the workload were set as in previous chapter’s 3.8.2. Once again, we have performed the simulation for routers configured with two different central processing units. The one’s CPU speed is 1 GHz and the other’s is 3 GHz. As before, for the considered traffic, note that the CPU is the system’s bottleneck for the 1 GHz router while the bus is the system’s bottleneck for the 3 GHz router. 4.5.1 Results Figure 4.12 shows results for a software router using a WFQ scheduling for the CPU and the BUG mechanism for controlling bus usage. We see that the obtained re- OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 4.5—A PERFORMANCE STUDY OF A SOFTWARE ROUTER INCORPORATING THE BUG 111 sults correspond to almost an ideal behavior, as under saturation throughput does not decrease with increasing offered loads and the system achieves a fair share of both router resources: CPU and bus. Figure 4.12—QoS aware system’s performance analysis: a) system’s overall throughput; b) per packets flow throughput share for system one; c) per packets flow throughput share for system two a) b) c) PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA 112 4.6—AN IMPLEMENTATION 4.6 An implementation Currently we are working on a BUG implementation for a FreeBSD powered PCbased software router, using 3COM’s 3C905B network interface cards based on the “hurricane” PCI bus-master chips and controlled by the FreeBSD xl driver. 4.7 Summary • • • • • We presented a mechanism for improving the resource sharing of the input/output bus of personal computer-based software routers The mechanism that we proposed and called BUG, for bus utilization guard, does not imply any changes in the host computer’s hardware, although some special features are required for network interface cards— they should have different direct memory access channels for each differentiated packets flow and they should be able to give information about the number of bytes and packets stored for each of these channels The BUG mechanism can be run by the central processing unit or by a suitable coprocessor attached at the AGP connector Using a queuing model solved by simulation, we studied BUG’s performance. The results show that the BUG is effective in controlling the bus share between competing packets flows When we use this mechanism in combination with the known techniques for central processing unit usage control, we obtain a nearly ideal behavior of the share of the software router resources for a broad range of workloads OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION MODELING TCP/IP SOFTWARE IMPLEMENTATION PERFORMANCE AND ITS APPLICATION FOR SOFTWARE ROUTERS 113 Chapter 5 Conclusions and future work 5.1 Conclusions We have showed that single-queue performance models of software routers implemented with BSD networking software, or similar, and personal computing (PC) hardware, or similar, incur in significant error when these models are used to study scenarios where the router’s central processing unit (CPU) is the system’s bottleneck and not the communications links. Furthermore, we have showed that a queuing network model better models these systems under the considered scenarios. Armed with a mature characterization process, we have showed that it is possible to build a queuing network model of PC-based software routers that is highly accurate, so this model may be use to carry out performance studies at several detail levels. Furthermore, we have showed that model’s service times computed after some system may be used for predicting the performance of other systems, if scaled appropriately, and that model’s service times scale linearly with CPU’s operation speed but can be considered constant with respect to messages’ and routing table’s sizes. Moreover, model’s service times related to the network interfaces layer have linear behavior with respect to CPU’s operation speed and their offset varies with respect to the network interface card’s and device driver’s technology and the cache memory performance. Using our validated, parameterized, queuing network model, we have quantitatively showed that current CPUs allow a software routers to sustain high throughput when forwarding plain Internet Protocol datagrams if some adjustments are introduced to the networking software, and that current PC’s input/output (I/O) buses, however, are the limiting factor for achieving system-wide high performance. Moreover, we have PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA 114 5.2—FUTURE WORK quantitatively showed how and when current PC’s I/O buses hamper a PC-based software router of supporting system-wide communication quality assurance mechanisms, or Quality-of-Service (QoS) mechanisms. Under the light of the above statement, we proposed a mechanism for improving the resource sharing of the input/output bus of personal computer-based software routers. This mechanism that we called BUG, for bus utilization guard, does not imply any changes in the host computer’s hardware, although some special features are required for network interface cards—they should have different direct memory access channels for each differentiated packets flow and they should be able to give information about the number of bytes and packets stored for each of these channels. When we use this mechanism in combination with known techniques for CPU usage control, we quantitatively showed that it is possible to obtain a nearly ideal behavior of the share of the software router resources for a broad range of workloads. 5.2 Future work A precise analysis of the results obtained when a BUG protected PCI bus was loaded with self-similar traffic is missing in this document; see subsection 4.4.5. Recall that although results confirmed that under the considered load conditions, a BUG protected PCI bus better follows the ideal behavior of a WFQ bus when compare to a plain PCI bus, these results also showed a departure from the ideal behavior higher than we expected. A working implementation for a production system of the BUG is missing in this document. Currently, an undergraduate student of the Facultad de Informática de Barcelona (FIB/UPC) is conducting his final project on this subject. He is implementing the BUG for a software router running FreeBSD release 4.5 and wearing 3COM’s 3C996 PCI-X/Gigabit Ethernet network interface cards. We find naturally pursuing to extend our modeling process to embrace the whole networking software; that is, to model PC-based communication processes involving Internet architecture’s transport layer protocols and BSD’s sockets layer. This, we think, is no easy task, as most certainly it would involve user-level application programs that are subject to the CPU scheduler. That is, the CPU’s extended model would have to include not only the priority preemptive scheduling policy, which models the software interrupt mechanism, but also a round-robin preemptive policy for modeling the UNIX CPU scheduler. Furthermore, the CPU’s extended model would have to switch between scheduling policies depending on the kind of tasks pending execution. In turn, this would require extending our characterization process for describing the tasks involve when context switching occurs. Evidently, if such a model could be built it would be valuable for capacity planning and as a uniform test-bed for Internet services. We also find tempting to simplify the software router model’s assumptions and to pursue to solve it analytically. The motivation would be to assess the trade-off between the model simplification and the error incurred. Current telematic systems research is focusing on embedded systems for personal communications and ubiquitous computing. At the same time, current embedded sys- OSCAR IVÁN LEPE ALDAMA PH.D. DISSERTATION 5.2—FUTURE WORK 115 tems wear complex microprocessors (some of which provide performance-monitoring counters like Intel’s Pentium-class microprocessors do) and execute full-blown PC operating system kernels like OpenBSD or Linux. Under this light we find interesting trying to use our characterization and modeling process for studying the performance of telematic systems when executed over embedded hardware. Other performance modeling challenges we find interesting are to study emerging software router technologies like SMP PC- and PC clusters-based software routers. PH.D. DISSERTATION OSCAR IVÁN LEPE ALDAMA