Traffic at the Network Edge: Sniff, Analyze, Act.
Transcription
Traffic at the Network Edge: Sniff, Analyze, Act.
POLITECNICO DI TORINO SCUOLA DI DOTTORATO Dottorato in Ingegneria Elettronica e delle Comunicazioni – XVII Ciclo Tesi di Dottorato Traffic at the Network Edge: Sniff, Analyze, Act. Dario Rossi Tutore prof. Marco Ajmone Marsan Coordinatore del corso di dottorato prof. Ivo Montrosset Gennaio 2005 2 Contents 1 A Primer on Network Measurement 1.1 Motivations and Methodology . . 1.2 Software Classification . . . . . . 1.2.1 Software List . . . . . . . 1.3 Sniffing Tools . . . . . . . . . . . 1.3.1 Tcpdump . . . . . . . . . 1.3.2 Other Tools . . . . . . . . 1.4 Tstat Overview . . . . . . . . . . 1.4.1 The TCPTrace Tool . . . . 1.4.2 Collected Statistics . . . . 1.4.3 Output Overview . . . . . 1.4.4 Usage Overview . . . . . 1.5 The Network Scenario . . . . . . 1.5.1 The GARR-B Architecture 1.5.2 The GARR-B Statistics . . 1.5.3 The GARR-G Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 4 6 8 8 8 10 11 12 21 24 25 25 26 26 2 The Measurement Setup 2.1 Traffic Measures in the Internet . . . . . . . . . . . . . 2.2 The Tool: Tstat . . . . . . . . . . . . . . . . . . . . 2.3 Trace analysis . . . . . . . . . . . . . . . . . . . . . . 2.4 IP Level Measures . . . . . . . . . . . . . . . . . . . 2.5 TCP Level Measures . . . . . . . . . . . . . . . . . . 2.5.1 TCP flow level analysis . . . . . . . . . . . . . 2.5.2 Inferring TCP Dynamics from Measured Data . 2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 29 31 32 34 35 36 39 42 3 User Patience and the World Wide Wait 3.1 Background . . . . . . . . . . . . . . . 3.2 Interrupted Flows: a definition . . . . . 3.2.1 Methodology . . . . . . . . . . 3.2.2 Interruption Criterion . . . . . . 3.3 Results . . . . . . . . . . . . . . . . . . 3.3.1 Impact of the User Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 43 44 44 45 50 51 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CONTENTS 4 3.4 3.3.2 Impact of Flow Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3.3 Completion and Interruption Times . . . . . . . . . . . . . . . . . . . . . 52 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4 The Zoo of Elephant and Mice 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Preliminary Definitions . . . . . . . . . . . . . . . . . . 4.3.2 Input Data . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Properties of the Aggregation Criterion . . . . . . . . . 4.3.4 Trace Partitioning Model and Algorithm . . . . . . . . . 4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Traffic Aggregate Bytewise Properties . . . . . . . . . . 4.4.2 Inspecting TCP Interarrival Time Properties within TAs 4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 57 59 60 60 62 64 66 67 68 71 74 5 Feeding a Switch with Real Traffic 5.1 Introduction . . . . . . . . . . . . . . . . . . . 5.2 Internet Traffic Synthesis . . . . . . . . . . . . 5.2.1 Preliminary Definitions . . . . . . . . . 5.2.2 Traffic Matrix Generation . . . . . . . 5.2.3 Greedy Partitioning Algorithm . . . . . 5.3 Performance study . . . . . . . . . . . . . . . 5.3.1 Measurement setup . . . . . . . . . . . 5.3.2 The switching architectures under study 5.3.3 Traffic scenarios . . . . . . . . . . . . 5.3.4 Simulation results . . . . . . . . . . . 5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 77 78 79 79 81 82 82 82 83 84 85 . . . . . . . . . . . . . . 87 87 88 90 91 91 93 95 97 98 99 99 100 102 103 6 Data Inspection: the Analysis of Nonsense 6.1 Introduction and Background . . . . . . . . 6.2 Architecture Overview . . . . . . . . . . . 6.2.1 In the Beginning Was Perl . . . . . 6.2.2 Input Files and Formats . . . . . . 6.2.3 Formats and Expressions Interaction 6.2.4 The d.tools Core . . . . . . . . 6.2.5 The d.tools Flexibility . . . . . 6.2.6 The DiaNa GUI . . . . . . . . . . 6.3 Performance Evaluation and Benchmarking 6.3.1 Related Works . . . . . . . . . . . 6.3.2 The Benchmarking Setup . . . . . . 6.3.3 Perl IO . . . . . . . . . . . . . . . 6.3.4 Perl Implicit Split Construct . . . . 6.3.5 Perl Operators and Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CONTENTS 6.4 6.5 5 6.3.6 DiaNa Startup Overhead . . . . . 6.3.7 Fields Number and Memory Depth Practical Examples . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . 7 Final Thoughts 7.1 Measurement Tool . . . . . . . . . . . 7.1.1 Cope with Traffic Shifts . . . . 7.1.2 Cope with Bandwidth Increase . 7.1.3 Distributed Measurement Points 7.2 User Patience and the World Wide Wait 7.2.1 Practical Applications . . . . . 7.3 TCP Aggregates Analysis . . . . . . . . 7.3.1 Future Aggregations . . . . . . 7.4 Switching Performance Evaluation . . . 7.4.1 Modified Optimization Problem 7.4.2 Responsive Sources . . . . . . . 7.5 Post Processing with DiaNa . . . . . . . 7.5.1 DiaNa Makeup and Restyle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 105 106 108 . . . . . . . . . . . . . 109 109 109 110 113 114 114 115 115 118 119 121 124 124 6 CONTENTS List of Figures 1.1 1.2 1.3 1.4 2.1 2.2 2.3 2.4 Example of the tcpdump Command Output . . . . . . . . . . . . . . . . . . . Evolution of the GARR-B Network: from April 1999 to May 2004 . . . . . . . . Logical Map of the Italian GARR-B Network, with a Zoomed View of the Torino MAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Input versus Output Load Statistics over Different Time-Scales: Yearly to Monthly and Weekly to Daily . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 . 26 . 27 . 28 2.6 2.7 2.8 IP payload traffic balance - Period (B) . . . . . . . . . . . . . . . . . . . . . . . . Distribution of the incoming traffic versus the source IP address . . . . . . . . . . Distribution of the TTL field value for outgoing and incoming packets - Period (B) Incoming and outgoing flows size distribution; tail distribution in log-log scale (lower plot); zoom in linear and log-log scale of the portion near the origin (upper plots) - Period (B) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Asymmetry distribution of connections expressed in bytes (upper plot) and segments (lower plot) - Period (B) . . . . . . . . . . . . . . . . . . . . . . . . . . . . Distribution of the connections completion time - Period(B) . . . . . . . . . . . . Distribution of “rwnd” as advertised during handshake . . . . . . . . . . . . . . . TCP congestion window estimated from the TCP header . . . . . . . . . . . . . . 38 39 40 41 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 Completed and Interrupted TCP Flow . . . . . . . . . . . . . . . Probability and Cumulative Distribution . . . . . . . . . . . . Normalized Probability Distribution, . . . . . Temporal Gap Reduction . . . . . . . . . . . . . . . . . . . . . . Interrupted vs Completed Flows Size CDF . . . . . . . . . . . . Sensitivity of the Interruption Criterion to the and Parameters Interrupted vs Completed vs Flows Amount and Ratio . . . . . . . ! : Server on the Top, Client on the Bottom . . . . . . . "$#&%' : Server on the Top, Client on the Bottom . . . . . . . . . . (#*)+' : Server case only . . . . . . . . . . . . . . . . . . . . . . "$#&%', ! : server case only . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 46 47 48 49 49 50 53 54 55 55 4.1 4.2 4.3 4.4 Aggregation Level From Packet Level to TA Level The Measure Setup . . . . . . . . . . . . . . . . . Flow Size and Arrival Times for Different TAs . . . TR Size Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 62 63 64 2.5 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 33 35 36 LIST OF FIGURES 8 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 TR Flow Number Distribution . . . . . . . . . . . . . . . . . . . . Trace Partitioning: Algorithmic Behavior . . . . . . . . . . . . . . Trace Partitioning: Samples for Different Aggregated Classes K . . Number of Traffic Relations - .0/213540- within each Traffic Aggregate . Number of TCP Flows within each Traffic Aggregate . . . . . . . . Mean Size of TCP Flows within each Traffic Aggregate . . . . . . . Elephant TCP Flows within each Traffic Aggregate . . . . . . . . . TCP Flows Size Distribution of TAs (Class 36 78 ) . . . . . . . . Interarrival Time Mean of TCP Flows within TAs . . . . . . . . . . Interarrival Time Variance of TCP Flows within TAs . . . . . . . . Interarrival Time Hurst Parameter of TCP Flows within TAs . . . . Interarrival Time Hurst Parameter of TCP Flows within TAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 65 67 68 69 70 70 71 72 72 73 74 5.1 5.2 5.3 5.4 5.5 5.6 5.7 Internet traffic abstraction model (on the top) and measure setup (on the bottom). Internet traffic at different levels of aggregation . . . . . . . . . . . . . . . . . . The optimization problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The greedy algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mean packet delay under PT and P3 scenarios for cell mode policies. . . . . . . . Mean packet delay under PT and P3 scenarios for packet mode policies. . . . . . Throughput under PT and P3 scenarios for cell mode policies. . . . . . . . . . . . . . . . . . 79 80 80 81 83 84 85 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 6.12 DiaNa Framework Conceptual Layering . . . . . . . . . . . . . Input Data from the DiaNa Perspective . . . . . . . . . . . . . . Parallel Expressions and Default Expansion . . . . . . . . . . . . Architecture of a Generic d.tools . . . . . . . . . . . . . . . . The d.loop Synoptic . . . . . . . . . . . . . . . . . . . . . . . The DiaNa Graphical User Interface . . . . . . . . . . . . . . . Input/Output Performance on Linux-2.4 . . . . . . . . . . . . . . Linux vs. Solaris Normalized Throughput Performance . . . . . . Explicit split and Performance Loss . . . . . . . . . . . . . . . Normalized Cost for Floating Point, Integer and String Operations Modules Loading Performance . . . . . . . . . . . . . . . . . . . Time Cost of Fields Number vs. Memory Depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 91 93 94 95 98 101 101 102 103 105 106 7.1 7.2 7.3 7.4 7.5 TstatRRD: Different Temporal Resolution for IP Packet Length Statistics TstatRRD Web Interface: Round Trip Time Example . . . . . . . . . . . Flow “Delocalization” as Consequence of the Aggregation Process . . . . Flows Regrouping through Logical Masks . . . . . . . . . . . . . . . . . Responsive Traffic Generation Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 112 119 120 122 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . List of Tables 1.1 1.2 1.3 2.1 2.2 2.3 2.4 Network Monitoring and Measuring Software Tools: Classification by Functional Role . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Tstat Output: Log Field Description . . . . . . . . . . . . . . . . . . . . . . . . . 23 Input versus Output Bandwidth and Utilization Statistics over Different TimeScales: Yearly to Monthly and Weekly to Daily . . . . . . . . . . . . . . . . . . . 27 2.6 Summary of the analyzed traces . . . . . . . . . . . . . . . . . . . . . . . . . . . Host name of the 10 most contacted hosts on a flow basis . . . . . . . . . . . . . . TCP options negotiated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Percentage of TCP traffic generated by common applications in number of flows, segments and transferred bytes - Period (B) . . . . . . . . . . . . . . . . . . . . . Average data per flow sent by common applications, in segments and bytes - Period (B) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . OutB or DupB events rate, computed with respect to the number of packets and flows 3.1 3.2 Three Most Active server and client statistics: total flows 9 # and interrupted : # . . . 50 Average Values of the Inspected Parameters . . . . . . . . . . . . . . . . . . . . . 52 4.1 Trace Informations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.1 6.2 6.3 Serial Expression Tokens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Sampling Option Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Benchmarking Setup Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . 100 2.5 i 32 34 35 37 37 41 ii LIST OF TABLES Abstract Since the dawning of science, the experimental method has been considered an essential investigation tool for the study of all natural phenomena. Similarly, traffic measurement represents an indispensable and valuable tool for the analysis of nowadays telecommunication networks. Indeed, the description of the key characteristics of the real communication processes is a necessary step for a deeper understanding of the complex dynamics of the network traffic. However, despite the importance gained by computer networks in current life, as testified by the deep penetration of the Internet, several aspects of current traffic knowledge are nevertheless limited or unsatisfactory. Indeed, the pace at which technology is evolving continuously enables different services and applications: as a consequence, traffic streams flowing in current networks are very different from the traffic patterns of the recent past. Indeed, while a few years ago Internet was synonym with Web-browsing, the pervasive diffusion of wide-band access has entailed a shift of the application spectrum – of which peer-2-peer file-sharing and audio/video streaming are the most representative examples; moreover, the evolutionary process will likely to repeat itself, although foreseeing its evolution might be hard from our current viewpoint. Therefore, it is evident that traffic measurements and analysis represent one of the core tools in order to allow for efficient Internet performance monitoring, evaluation and management. Indeed, the knowledge of real traffic is indispensable for the development of synthetic traffic models – which are very useful for the study, the analysis and the dimensioning of both real networks and network apparatuses. Besides, efficient traffic engineering cannot leave real traffic properties out of consideration. Finally, the characterization of current traffic is a key factor for enabling future generation networks design and deployment. This thesis, hopefully brings some original contributions to the academic research in the field of network traffic measurements, and is organized as follows: after a necessary and complete introduction, the presented research focuses on the observation of the traffic from different viewpoints –considering network, transport and application levels– analyzing the most hazy aspects of the TCP/IP networking and uncovering some unanticipated phenomena. More on details, Chapter 1 digs into the details of the traffic collection methodology: after discussing the motivations behind field-measurement, an introduction is given on the available measurement techniques; then, the software tools are discussed at length, devoting special attention to the sniffing tools used throughout all this research study. Eventually, in order to fully describe the measurement setup, the chapter exhaustively describes of the network scenario, considering its architecture and its evolution as well. Chapter 2 provides a thorough analysis, at both the network and transport levels, of the tipical traffic patterns that can be measured in the network so far described. In order to characterize the most relevant aspects of the measured traffic, the study considered a huge data set collected iii iv Abstract during more than two months on the Politecnico’s access link. It is important to stress that while standard performance measures (such as flow dimensions, traffic distribution, etc.) remain at the base of traffic evaluation, more sophisticated indices (obtained through data correlation between the incoming and outgoing traffic) are taken into account as well, in order to give reliable estimates of the network performance also from the user perspective. The complex relationship between the network and its users is under analysis in Chapter 3, which introduces the notion of “user-patience”, an application-level metric apt to quantify the perceived Quality of Service (QoS). More in details, the chapter presents a study of Web users behavior when the degradation of network performance causes the increase of page transfer times: a criterion to infer the application-level impatience metric from the informations available at the transport-level is presented and validated. In order to infer whether worsening network conditions translate into greater user impatience, more than two millions of flows were analyzed: surprisingly, several of the insights gained in the study are counter-intuitive, contributed to refine the complex picture of interactions between user perception of the Web and network-level events. Then, Chapter 4 enlighten some interesting aspects at the root of Long Range Dependence properties exhibited by TCP flows aggregates, extending to the transport-level results that were only previously known at packet-level. More on details, starting from traffic measurement at the TCP-flow level, a simple aggregation criterion is used to induce a partition of heavy and light flows –which are known in the literature respectively as elephants and mice– into different traffic aggregates. Then, the statistical properties of the TCP flow inter-arrival process within each aggregate are analyzed at depth: interpretation of the results suggests TCP elephants to be cause of LRD phenomena at the TCP-flow level. In Chapter 5, traffic measurements are used to evaluate the performance of different switching architectures. The chapter describes a novel methodology –which can be expressed in terms of an optimization problem– for the synthesis of realistic traffic starting from a single real packetlevel trace. Using simulation, several scheduling algorithms were stressed by traditional traffic models and by the novel synthetic traffic: the comparison of the results allowed to gather unexpected results –that traditional traffic models were unable to bring to evidence– along with a strong degradation of the achievable performance. Chapter 6 introduces DiaNa, a novel software tool primarly designed to process a huge amount of data –possibly several orders of magnitude bigger than the workstation random access memory– in an efficient, spreadsheet-like, batch fashion. The tools has been developed primarily to perform the analysis described in the other research studies presented in this thesis; however, one of the primary desing goals was to offer extreme flexibility from the user perspective: as a consequence, DiaNa is written in Perl and its use is not restricted to traffic traces. The DiaNa syntax is a very small and orthoghonal superset of the underlying Perl syntax, which allow, e.g., to comfortably address file tokens and profitably use file formats throughout the data processing; the DiaNa software includes also an interactive Perl/Tk graphical user interface, layered on the top of several batch shell scripts. The chapter introduces the software, dissects its architecture, evaluate its performance through benchmarking and presents some examples of use in a networking context. Finally, some conclusive consideration are drawn in Chapter 7: after a brief summary for each of the different research studies early described, the chapter reports some possible research directions, as well as sketching some of the currently on-going activities of the candidate in the measurement field. Acknowledgements I’ve always tried to be very concise: following this choice, I just want to thank everybody I know: if we are still in contact, then we both care. Oo > ˜ v vi Acknowledgements Chapter 1 A Primer on Network Measurement RAFFIC measurement is an applied networking research methodology aimed at understanding packet traffic on the Internet. From its humble beginnings in LAN-based measurement of network applications and protocols, network measurement research has grown in scope and magnitude, and has helped provide insight into fundamental behavioural properties of the Internet, its protocols, and its users. For example, Internet traffic measurement and monitoring serves as the basis for a wide range of IP network operations, management, engineering tasks such as trouble shooting, accounting and usage profiling, routing weight configuration, load balancing, capacity planning, and so forth. From a very high-level perspective, there are mainly three core methodologies that, with very different properties, enable the networking studies: namely, these are analytical models, simulation and measurement. To validate the research results and the resercher’s intuition, it is normally useful to take several approaches in parallel. For example, from traffic measurement, useful analytical models can be derived; these models possibly describe traffic behavior, or specify the traffic generation pattern – in which case they can be used as simulation input. Besides, the statistical properties of the simulation results can be compared to the statistical properties of the real observed traffic; or, in other setups, simulation input may directly be driven from real trace. In this study, though we combine the aforementioned approaches, the starting point is always constituted from the collection and the analysis of real traffic traces: therefore, we devote this introductory chapeter to brief overview the computer tools necessary for trace collection and analyis. Adopting again a very high-level perspective, there are two main traffic measurement “philosphies”: namely, measurement can be either active or passive. Tools belonging to the first class inject ad-hoc traffic into the network, aiming at measuring the reaction of the network itself; conversely, tools belonging to the second class passively observe and collect the traffic flowing into the networks. Though the field of active measurement is very interesting, in the following we will focus on passive measurements tools mainly and on the analysis of the gathered data as well. The rest of this chapter is organized as follows: firstly, Section 1.1 provides a high level general description of the existent data collection methods, introducing the problem of packet sniffing; then, a more detailed classification of the available traffic measurement tools is presented in Section 1.2 – where a partial taxonomy of such tools, without any pretention of completeness, will also be presented. While Section 1.3 breifly decribes some of the most widely deployed tools, Section 1.4 focuses on Tstat, the software tool created by our research group, to whose developement 1 2 CHAPTER 1. A PRIMER ON NETWORK MEASUREMENT I have contributed. Finally, the setup on which measurement are taken is thoroughly described in Section 1.5. 1.1 Motivations and Methodology Network traffic measurement provides a means to go “under the hood”, much like an Internet mechanic, to understand what is or is not working properly on a local-area or wide-area network. Using specialized network measurement hardware or software, a networking researcher can collect detailed information about the transmission of packets on the network, including their timing structure and contents. With detailed packet-level measurements, and some knowledge of the Internet protocol stack, it is possible to “reverse engineer” significant information about the structure of an Internet application, or the behaviour of an Internet user. There are four main reasons why network traffic measurement is a useful methodology: Network Troubleshooting Computer networks are not infallible; often, a single malfunctioning piece of equipment can disrupt the operation of an entire network, or at least degrade performance significantly. Examples of such scenarios include “broadcast storms”, illegal packet sizes, incorrect addresses, and security attacks. In such scenarios, detailed measurements from the operational network can often provide a network administrator with the information required to pinpoint and solve the problem. Protocol Debugging Developers often want to test out “new, improved” versions of network applications and protocols: in this context, network traffic measurement provides a means to ensure the correct operation of the new protocol or application, its conformance to required standards, and (if necessary) its backward-compatibility with previous versions, prior to unleashing it on a production network. Workload Characterization Network traffic measurements can also be used as input to the workload characterization process, which analyzes empirical data (often using statistical techniques) to extract salient and representative properties describing a network application or protocol; as briefly mentioned earlier, the knowledge of the workload characteristics can then lead to the design of better protocols and networks for supporting the application. Performance Evaluation Finally, network traffic measurements can be used to determine how well a given protocol or application is performing in the Internet, and a detailed analysis of network measurements can help identify performance bottlenecks; once these performance problems are addressed, new versions of the protocols can provide better (i.e., faster) performance for the end users of Internet applications The “tools of the trade” for network traffic measurement research can be classified in several different ways, classifying thus their methodology: 1.1. MOTIVATIONS AND METHODOLOGY 3 Hardware-based vs Software-based Measurement Tools The primary categorization among network measurement tools is hardware-based versus software-based measurement tools: hardware-based platforms, offered by many vendors, are often referred to as network traffic analyzers: special-purpose equipment designed expressly for the collection and analysis of network data. This equipment is (more than) often (very) expensive (and sometimes even more), depending on the number of network interfaces, the types of network cards, the storage capacity, and the protocol analysis capabilities. Software-based measurement tools typically rely on kernel-level modifications to network interfaces of commodity workstations to convert them into machines with packet capture capability. In general, the software-based approach is much less expensive than the hardwarebased approach, but may not offer the same functionality and performance as a dedicated network traffic analyzer. In some cases, the software-based approach is very specialized, as in the case of Web servers nd proxies workload analysis, which relies on the access logs that are recorded by Internet servers and proxies: these logs record each client request for Web site content, including the time of day, client IP address, URL requested, and document size to get, e.g., provides useful insight into Web server workloads, without the need to collect detailed network-level packet traces. Passive vs Active Measurement Approaches A passive network monitor is used to observe and record the packet traffic on an operational network, without injecting any traffic of its own onto the network – that is, the measurement device is non-intrusive; most network measurement tools fall into this category. An active network measurement approach uses packets generated by a measurement device to probe the Internet and measure its characteristics. Simple examples of the latter approach include the ping utility for estimating network latency to a particular destination on the Internet, the traceroute utility for determining Internet routing paths, and the pathchar tool for estimating link capacities and latencies along an Internet path. On-line vs Off-line Traffic Analysis Some network traffic analyzers support real-time collection and analysis of network data, often with graphical displays for on-line visualization of live traffic data, and most hardwarebased network analyzers support this feature. Other network measurement devices are intended only for real-time collection and storage of traffic data; analysis is postponed to an off-line stage – which may be typically the case depending on the analysis features: once the traffic data is collected and stored, a researcher can perform as many analyses as desired in the post-processing phase. Clearly, performing as much on-line analysis as possible brings several benefits, such as easening and speeding-up the post-processing phase. LAN vs WAN Measurement The early network traffic measurement research in the literature was undertaken in Local Area Network (LAN) environments, such as Ethernet LANs. LANs are easier to measure, for several reasons. First, a LAN is typically administered by a single well-known organization, meaning that obtaining security clearance for traffic analysis is a relatively straightforward process. Second, the broadcast nature of an Ethernet LAN means that all packets CHAPTER 1. A PRIMER ON NETWORK MEASUREMENT 4 transmitted on the LAN are seen by all hosts. Network measurement can be done in this context by simply configuring a network interface into promiscuous mode, which means that the interface will receive and record (rather than ignore) the packets destined to other hosts on the network. Later measurement work extended traffic collection and analysis to Wide Area Network (WAN) environments. The greater challenges here include administrative control of the network, and security and privacy issues. altough, with time and coordination between these measurements, a more complete picture of end-to-end network performance will be possible. For organizations with a single access point to the external Internet, such as ours, measurement devices can be put in-line on an Internet link near the default router for the organization. However, it is worth to point out that one of our current on-going concers is to enable the the deployment of a Wide-Area Network Measurement infrastructure that can collect simultaneous measurements of client, server, and network behaviours from different measurement points: please refer to Section 7.1.3 for further details on this topic. Following the classification so far introduced, we may identify the methodology followed in this thesis as software software-based passive sniffing; measurement are taken on the unique egress router of our wide campus LAN, and kind of on/off-line processing depends on the specific task. 1.2 Software Classification In this section, we individuate four main criteria upon which network monitoring and measure software can be classified: namely, their functional role, the resources managed, the underlying mechanism and the software environment/licence. With respect of the general management area or functional role of a tool, we can individuate several possible class of software. The following keywords may be used to describe such classes: ; Alarm: a reporting/logging tool that can trigger on specific events within a network. ; Analyzer: a traffic monitor that reconstructs and interprets protocol messages that span several packets. ; Benchmark: a tool used to evaluate the performance of network components. ; Control: a tool that can change the state or status of a remote network resource. ; Debugger: a tool that by generating arbitrary packets and monitoring traffic, can drive a remote network component to various states and record its responses. ; Generator: a traffic generation tool. ; Manager: a distributed network management system or system component. ; Map: a tool that can discover and report a system’s topology or configuration. ; Reference: a tool for documenting MIB structure or system configuration. ; Routing: a packet route discovery tool. 1.2. SOFTWARE CLASSIFICATION ; Security: a tool for analyzing or reducing threats to security. ; Status: a tool that remotely tracks the status of network components. ; Traffic: a tool that monitors packet flow. 5 Similarly, one can classify network monitoring software depending on the managed resources; without aim of completeness, we may individuate the following self-explaining categories. Bridge, CHAOS, DECnet, DNS, Ethernet, FDDI, IP, OSI, NFS, Ring, SMTP. Beside, another possible classification policy involve the mechanism use by the tool, such as in the following list: ; CMIS: a network management system or component based on CMIS/CMIP, the Common Management Information System and Protocol. ; Curses: a tool that uses the ”curses” tty interface package. ; Eavesdrop: a tool that silently monitors communications media (e.g., by putting an ethernet interface into ”promiscu- ous” mode). ; NMS: the tool is a component of or queries a Network Manage- ment System. ; Ping: a tool that sends packet probes such as ICMP echo mes- sages; to help distinguish tools, we do not consider NMS queries or protocol spoofing (see below) as probes. ; Proprietary: a distributed tool that uses proprietary communications techniques to link its components. ; RMON: a tool which employs the RMON extensions to SNMP. ; SNMP: a network management system or component based on SNMP, the Simple Network Management Protocol. ; Spoof: a tool that tests operation of remote protocol modules by peer-level message exchange. ; X: a tool that uses X-Windows. Finally, we may discriminate among the tool’s operating environment and licence: DOS, HP, Macintosh, OS/2, Sun, UNIX1 , VMS, Standalone2 . About the licence, the main classification 1 2 ; Free: a tool is available at no charge, though other restric- tions may apply (tools that are part of an OS distribu- tion but not otherwise available are not listed as ”free”). ; Library: a tool packaged with either an Application Programming Interface (API) or objectlevel subroutines that may be loaded with programs. As well as others *NIX sistems, such as FreeBSD or Linux A integrated hardware/software tool that requires only a network interface for operation CHAPTER 1. A PRIMER ON NETWORK MEASUREMENT 6 ; Sourcelib: a collection of source code (subroutines) upon which developers may construct other tools. In the following, we will restrict out attention to software tools that ; is open source ; runs on *NIX systems ; uses Eavesdrop ; capture IP and TCP headers 1.2.1 Software List In Table 1.1 we report a practical example of tool classification by functional role, listing the most relevant software tools surveyed in [2, 3]. It is evident than the list is far from being complete, and is also partly out-of-date: indeed, the task of keeping such a list has grown too big, as it is fairly well known in the scientific community. As a side –but important– note, it should be pointed out that in many cases the tools are not purely passive, simply because they try to resolve host names using reverse DNS lookups. Notice that it may be possible to perform a similar classification with respect to the other three criteria listed in the previous section. Besides, we must point out that a single tool may fall into several categories, depending on its purpose and extent; in the next section we will review some of the tools from the following list, specifically, those listed in bold from the last category. A few words must be devoted to some important missing items: however, as a disclaimer, it should be pointed out that a complete bibliography of the measurement projects and tools could have easily been a research project per-se. Moreover, the compilation of a complete survey of measurement tools, altough very interesting, it is neverthless very cumbersome and, worst of all, would become out-of-date very quickly vanyfing the tremendous effort. Neverthless, we cannot forget to cite projects such as PlanetLab[4, 5], currently consisting of 550 nodes over 261 sites scattered across the globe which enabled in its turn an infinite serie of independent measurement projects and tools. Another good starting point is AIDA[6], the Cooperative Association for Internet Data Analysis, which provides tools and analyses promoting the engineering and maintenance of a robust, scalable global Internet infrastructure: about 40 software tools and more than 100 related technical papers are hosted in the website. Alarm CMIP Library EMANATE MONET SNMP Lib. Analyzer LANVista PacketView Benchmark hammer & anvil spray Dual Manager EtherMeter NetMetrix snmpd LANWatch Sniffer iozone ttcp Eagle LanProbe NETMON SpiderMonitor NetMetrix SpiderMonitor LANVista XNETMON Exeption LANWatch NETscout XNETMON NETscout nhfsstone 1.2. SOFTWARE CLASSIFICATION CMIS Control Debugger Generator Manager Map Reference Routing Security Status Traffic CMIP library CMIP Library MIB Manager SNMP Lib. EthernetBox hammer & anvil NetMetrix Sniffer Beholder EMANATE LanProbe NETMON OverVIEW tokenview decaddrs EtherMeter NetIntegrator EMANATE arp hopcheck NPRV Security Checklist LAN Patrol Beholder DiG EMANA Internet Rover MONET NNStat ping SNMP Lib. TokenVIEW ENTM EthernetBox LAN Patrol MONET netwatch NNStat spray trpt Generic MIB Dual Manager MONET snmpd NetMetrix LADDIS nhfsstone SpiderMonitor CMIP Lib. EthernetBox MIB Manager NETscout SAS/CPE Tricklet Dual Manager Network Map NPRV HyperMIB decaddrs MONET ping query Dual Manager SNMP Lib. CMIP Library dnsstats fping lamers net monitor NOCOL ping PSI SNMP Tricklet EtherApe EtherView LanProbe NetMetrix NetIntegrator PacketView Tcpdump Tstat 7 MIB Browser Eagle Entropy NETMO proxyd TokenVW XNETMON ping XNETMON LANVista LimeIP ping ping spray TTCP decaddrs Dual Manager getone Network Map MONET NetLabs Agent NNStat NOCOL SNMP Li snmpd Wollongo XNETMON etherhostprobe XNETMON LanProbe NETMON SNMP Lib. MIB Manager XNETMON etherhostprobe getone NETMON netstat traceroute Eagle EMANATE XNETMON CMU SNMP ControlIP doc Dual Manager getone host LanProbe mconnect Netlabs Agent NETscout NPRV OverVIEW proxy SAS/CPE snmpd snmpd vrfy XNETMON etherfind EtherMeter Getethers IPTraf LANVista LANWatch NETMON NETscout nfswatch nhfsstone Sniffer SpiderMonitor tcplogger tcptrace ttcp XNETMON CHAPTER 1. A PRIMER ON NETWORK MEASUREMENT 8 Table 1.1: Network Monitoring and Measuring Software Tools: Classification by Functional Role 1.3 Sniffing Tools This section contains a brief descriptions of some network traffic tools, commonly known as packet sniffers or protocol analyzer. Apart from Tstat, which we will review throughly in the next section, for each tool we report a terse descriptions of its purposes or attributes. To summarize, however, let us say that the tcpdump utility can be used for basic manual analysis (or by other more automated tools via the libpcap library): these are, to date, the most widely used solutions, and a number of utilities are built on top of them. 1.3.1 Tcpdump Tcpdump [10] is a command line tool for network packet collection from a network interface. Normally it prints out the packet headers that match a specific boolean expression but you can also instruct it to display the entire packet in different output formats. Tcpdump can read and write data from files (in tcpdump format) in addition to reading data from the network interface. This is especially useful for logging data for later analysis, possibly with some other tool that can read tcpdump files, possibly with some other tool. Libpcap [11], the library component of tcpdump, has become very popular: the vaste majority of the packets sniffers mentioned in this thesis makes use of this library. An example of the tcpdump command output is provided in Figure 1.1. The Tcpdump/libpcap duo has been ported to the Windows platforms by another research group of Politecnico di Torino: WinPcap [13, 14] is an open source library for packet capture and network analysis for the Win32 platforms, including a kernel-level packet filter, a low-level dynamic link library and a high-level and system-independent library. WinPcap, which is available at [12], features a device driver that adds to Windows (95, 98, ME, NT, 2000, XP and 2003) the ability to capture and send raw data from a network card, with the possibility to filter and store in a buffer the captured packets. 1.3.2 Other Tools ENTM ENTM [7] is a screen-oriented utility that runs under VAX/VMS. It monitors local ethernet traffic and displays either a real time or cumulative, histogram showing a percent breakdown of traffic by ethernet protocol type. The information in the display can be reported based on packet count or byte count. The percent of broadcast, multicast and approximate lost packets is reported as well. The screen display is updated every three seconds. Additionally, a real time, sliding history window may be displayed showing ethernet traffic patterns for the last five minutes. ENTM can also report IP traffic statistics by packet count or byte count. The IP histograms reflect information collected at the TCP and UDP port level, includ- ing ICMP type/code combinations. Both the 1.3. SNIFFING TOOLS 9 17:06:02.613909 nonsns.polito.it.675 < serverlipar.polito.it.sunrpc: S 900316031:900316031(0) win 5840 <mss 1460,sackOK,timestamp 20220003 0,nop,wscale 0 < (DF) 17:06:02.614144 serverlipar.polito.it.sunrpc < nonsns.polito.it.675: S 2978051708:2978051708(0) ack 900316032 win 5792 <mss 1460,sackOK,timestamp 203083456 20220003,nop,wscale 2 < (DF) 17:06:02.614169 nonsns.polito.it.675 < serverlipar.polito.it.sunrpc: . ack 1 win 5840 <nop,nop,timestamp 20220003 203083456 < (DF) 17:06:02.614197 nonsns.polito.it.675 < serverlipar.polito.it.sunrpc: P 1:61(60) ack 1 win 5840 <nop,nop,timestamp 20220003 203083456 < (DF) 17:06:02.614432 serverlipar.polito.it.sunrpc < nonsns.polito.it.675: . ack 61 win 1448 <nop,nop,timestamp 203083456 20220003 < (DF) 17:06:02.614631 serverlipar.polito.it.sunrpc < nonsns.polito.it.675: P 1:33(32) ack 61 win 1448 <nop,nop,timestamp 203083456 20220003 < (DF) 17:06:02.614641 nonsns.polito.it.675 < serverlipar.polito.it.sunrpc: . ack 33 win 5840 <nop,nop,timestamp 20220003 203083456 < (DF) 17:06:02.614667 nonsns.polito.it.675 < serverlipar.polito.it.sunrpc: F 61:61(0) ack 33 win 5840 <nop,nop,timestamp 20220003 203083456 < (DF) 17:06:02.614689 nonsns.polito.it.676 < serverlipar.polito.it.731: S 910139082:910139082(0) win 5840 <mss 1460,sackOK,timestamp 20220003 0,nop,wscale 0 < (DF) 17:06:02.614896 serverlipar.polito.it.731 < nonsns.polito.it.676: S 2977556316:2977556316(0) ack 910139083 win 5792 <mss 1460,sackOK,timestamp 203083456 20220003,nop,wscale 2 < (DF) 17:06:02.614905 nonsns.polito.it.676 < serverlipar.polito.it.731: . ack 1 win 5840 <nop,nop,timestamp 20220003 203083456 < (DF) 17:06:02.614918 serverlipar.polito.it.sunrpc < nonsns.polito.it.675: F 33:33(0) ack 62 win 1448 <nop,nop,timestamp 203083456 20220003 < (DF) 17:06:02.614928 nonsns.polito.it.675 < serverlipar.polito.it.sunrpc: . ack 34 win 5840 <nop,nop,timestamp 20220003 203083456 < (DF) 17:06:02.614957 nonsns.polito.it.676 < serverlipar.polito.it.731: P 1:77(76) ack 1 win 5840 <nop,nop,timestamp 20220003 203083456 < (DF) 17:06:02.615133 serverlipar.polito.it.731 < nonsns.polito.it.676: . ack 77 win 1448 <nop,nop,timestamp 203083457 20220003 < (DF) Figure 1.1: Example of the tcpdump Command Output ethernet and IP histograms may be sorted by ASCII protocol/port name or by percent-value. All screen displays can be saved in a file for printing later. EtherApe EtherApe [8] passively collects data from one or several network interfaces. It makes simple traffic calculations and print a graphical display of the network traffic in real-time. The graphical display uses different colors to make it easier to distinguish between each protocol. It also displays how much traffic of each protocol that has been transmitted over the network in a very visible way; more traffic give a thicker line. The configuration on what network traffic it should listen to is very flexible. It is for example possible to define that it should listen to TCP traffic from some hosts and IP traffic from other hosts. You can also define how long it should remember traffic from hosts 10 CHAPTER 1. A PRIMER ON NETWORK MEASUREMENT in order to get a good view of the current and past network traffic. EtherApe can be useful for the network administrator to get an overview of the traffic. The drawback is the graphical display, which requires that a lot of libraries have to be installed on the node. The graphical output also interferes with the flow if the display is not displayed locally connected to the node. Getethers Getethers [9] runs through all addresses on an ethernet segment (a.b.c.1 to a.b.c.254) and pings each address, and then determines the ethernet address for that host. It produces a list, in either plain ASCII, the file format for the Excelan Lanalyzer, or the file format for the Network General Sniffer, of hostname/ethernet address pairs for all hosts on the local nework. The plain ASCII list optionally includes the vendor name of the ethernet card in each system, to aid in the determination of the identity of unknown systems. Getethers uses a raw IP socket to generate ICMP echo requests and receive ICMP echo replies, and then examines the kernel ARP table to determine the ethernet address of each responding system. IPtraf IPtraf [15] is a console based traffic sniffer. It can gather a variety of reports such as TCP connection packet and byte counts, interface statistics and activity indicators, TCP/UDP traffic breakdowns, and LAN station packet and byte counts in real time. 1.4 Tstat Overview The lack of automatic tools able to produce statistical data from collected network traces was a major motivation to develop Tstat, a tool which, starting from standard software libraries, is able to offer network managers and researchers important information about classic and novel performance indexes and statistical data about Internet traffic. Tstat started as evolution of TCPtrace[49], of which it inherits most of the features: therefore, we report a detailed description of the measurement mechanism of TCPtrace in Section 1.4.1; here, we briefly report the most relevant Tstat mechanisms, referring the reader to the informations available along with Tstat distribution for further details. Tstat is able to analyze either real-time captured traces, using common PC hardware, or previously recorded traces, supporting various dump formats, such as the one supported by the libpcap library. The software assumes to receive as input a trace collected on an edge node, in such a way that both data segments and ACK segments can be analyzed. Besides the more common IP statistics, derived from the analysis of the IP header, Tstat is also able to rebuild each TCP connection status looking at the TCP header in the forward and backward packet flows: thus incoming and outgoing packets/flows are identified. If connection opening and closing is observed, the flow is marked as a complete flow, and then analyzed. To free memory related to status used by TCP flows that are inactive, a timer of 30 minutes is used as garbage collector. TCP segments that belong to flows whose opening is not recorded (because either were started before, or early declared closed by the garbage procedure) are discarded and marked as “trash”. The bidirectional TCP flow analysis allows the derivation of novel statistics (such as, for example, the congestion window size, out-of-sequence segments, duplicated segments, etc.) which 1.4. TSTAT OVERVIEW 11 are collected distinguishing both between clients and servers, (i.e., host that actively open a connection and host tha t reply to the connection request) and also identifying internal and external hosts (i.e., hosts located inside oroutside the edge node used as measurement point). Instead of dumping each single measured datum, for each measured quantity Tstat builds a histogram, dumping every collected distribution on a periodical basis (four times per hour by defaults), performing thus a discrete amount of statistical analysis on-line. These data sets can be used to produce either time plots, or aggregated plots over different time spans: the Web interface available at [32] allows the user to browse all the collected data, selecting the desired performance indexes and directly producing the graphs, – as well as retrieve the raw data that can then later be used for further analysis. A total of more than 80 different histogram types are available, including both IP and TCP statistics. They range from classic measures directly available from packet headers (e.g., percentage of TCP or UDP packets, packet length distribution, TCP port distribution, ...), to advanced measures, related to TCP (e.g., average congestion window, RTT estimates, out-of-sequence data, duplicated data, ...). A complete flow-level log, useful for post-processing purposes, keeps track of all the TCP flow analyzed including all the performance indexed early described. 1.4.1 The TCPTrace Tool TCPtrace is a tool written by Shawn Ostermann at Ohio University, for analysis of TCP dump files and is maintained these days by his students and members of the Internetworking Research Group (IRG) at Ohio University. It can take as input the files produced by several popular packetcapture programs, including tcpdump, snoop, etherpeek, HP Net Metrix, and WinDump. tcptrace can produce several different types of output containing information on each connection seen, such as elapsed time, bytes and segments sent and recieved, retransmissions, round trip times, window advertisements, throughput, and more. It can also produce a number of graphs for further analysis. In the following, we will overview the tool, referring the reader for further informations to [16]. Interface Overview When tcptrace is run trivially on a dumpfile, it generates output similar to the following : Beluga:/Users/mani> tcptrace tigris.dmp 1 arg remaining, starting with ’tigris.dmp’ Ostermann’s tcptrace -- version 6.4.5 -- Fri Jun 13, 2003 87 packets seen, 87 TCP packets traced elapsed wallclock time: 0:00:00.037900, 2295 pkts/sec analyzed trace file elapsed time: 0:00:12.180796 TCP connection info: 1: pride.cs.ohiou.edu:54735 - elephus.cs.ohiou.edu:ssh (a2b) 30> 30< (complete) 2: pride.cs.ohiou.edu:54736 - a17-112-152-32.apple.com:http (c2d) 12> 15< (complete) \ \ In the above example, tcptrace is run on dumpfile tigris.dmp. The initial lines tell that the file tcptrace is processing currently is tigris.dmp, the version of tcptrace 12 CHAPTER 1. A PRIMER ON NETWORK MEASUREMENT running, and when it was compiled. The next line tells that a total of 87 packets were seen in the dumpfile and all the 87 TCP packets (in this case) were traced. The next line tells that the elapsed wallclock time i.e., the time tcptrace took to process the dumpfile, and the average speed in packets per second taken for processing. The following line indicates the trace file elapsed time i.e., the duration of packet capture of the dumpfile calculated as the duration between the capture of the first and last packets. The subsequent lines indicate the two TCP connections traced from the dumpfile. The first connection was seen between machines pride.cs.ohiou.edu at TCP port 54735, and elephus .cs.ohiou.edu at TCP port ssh (22). Similarly the second connection was seen between machines pride.cs.ohiou.edu at TCP port 54736, and a17-112-152-32 .apple.com at TCP port http (80). tcptrace uses a labeling scheme to refer to individual connections traced. In the above example the two connections are labeled a2b and c2d respectively. For the first connection, 30 packets were seen in the a2b direction (pride.cs.ohiou.edu == = elephus.cs.ohiou.edu) and 30 packets were seen in the b2a direction (elephus.cs.ohiou.edu == = pride.cs.ohiou.edu). The two connections are reported as complete indicating that the entire TCP connection was traced i.e., SYN and FIN segments opening and closing the connection were traced. TCP connections may also be reported as reset if the connection was closed with an RST segment, or unidirectional if traffic was seen flowing in only one direction. The above brief output generated by tcptrace can also be generated with the -b option. In the above example, tcptrace looked up names (elephus.cs.ohiou.edu, for example) and service names (http, for example) involving a DNS name lookup operation. Such name and service lookups can be turned off with the -n option to make tcptrace process faster. If you need name lookups but would rather have the short names of machines (elephus instead of elephus.cs.ohiou.edu for example), use the -s option. 1.4.2 Collected Statistics Output Example tcptrace can produce detailed statistics of TCP connections from dumpfiles when given the -l or the long output option. The -l option produces output similar to the one shown in this example. Beluga:/Users/mani> tcptrace -l malus.dmp.gz 1 arg remaining, starting with ’malus.dmp.gz’ Ostermann’s tcptrace -- version 6.4.6 -- Tue Jul 1, 2003 32 packets seen, 32 TCP packets traced elapsed wallclock time: 0:00:00.037948, 843 pkts/sec analyzed trace file elapsed time: 0:00:00.404427 TCP connection info: 1 TCP connection traced: TCP connection 1: host a: elephus.cs.ohiou.edu:59518 host b: a17-112-152-32.apple.com:http complete conn: yes first packet: Thu Jul 10 19:12:54.914101 2003 last packet: Thu Jul 10 19:12:55.318528 2003 1.4. TSTAT OVERVIEW elapsed time: 0:00:00.404427 total packets: 32 filename: malus.dmp.gz a->b: b->a: total packets: 16 ack pkts sent: 15 pure acks sent: 13 sack pkts sent: 0 dsack pkts sent: 0 max sack blks/ack: 0 unique bytes sent: 450 actual data pkts: 1 actual data bytes: 450 rexmt data pkts: 0 rexmt data bytes: 0 zwnd probe pkts: 0 zwnd probe bytes: 0 outoforder pkts: 0 pushed data pkts: 1 SYN/FIN pkts sent: 1/1 req 1323 ws/ts: Y/Y adv wind scale: 0 req sack: Y sacks sent: 0 urgent data pkts: 0 urgent data bytes: 0 mss requested: 1460 max segm size: 450 min segm size: 450 avg segm size: 449 max win adv: 40544 min win adv: 5840 zero win adv: 0 avg win adv: 23174 initial window: 450 initial window: 1 ttl stream length: 450 missed data: 0 truncated data: 420 truncated packets: 1 data xmit time: 0.000 idletime max: 103.7 throughput: 1113 13 pkts bytes bytes bytes bytes bytes bytes bytes times bytes bytes pkts bytes bytes bytes pkts secs ms Bps total packets: ack pkts sent: pure acks sent: sack pkts sent: dsack pkts sent: max sack blks/ack: unique bytes sent: actual data pkts: actual data bytes: rexmt data pkts: rexmt data bytes: zwnd probe pkts: zwnd probe bytes: outoforder pkts: pushed data pkts: SYN/FIN pkts sent: req 1323 ws/ts: adv wind scale: req sack: sacks sent: urgent data pkts: urgent data bytes: mss requested: max segm size: min segm size: avg segm size: max win adv: min win adv: zero win adv: avg win adv: initial window: initial window: ttl stream length: missed data: truncated data: truncated packets: data xmit time: idletime max: throughput: 16 16 2 0 0 0 18182 13 18182 0 0 0 0 0 1 1/1 Y/Y 0 N 0 0 0 1460 1448 806 1398 33304 33304 0 33304 1448 1 18182 0 17792 13 0.149 99.9 44957 pkts bytes bytes bytes bytes bytes bytes bytes times bytes bytes pkts bytes bytes bytes pkts secs ms Bps The initial lines of output are similar to the brief output explained in . The following lines indicate that the hosts involved in the connection and their TCP port numbers are: host a: host b: elephus.cs.ohiou.edu:59518 a17-112-152-32.apple.com:http The following lines indicate that the connection was seen to be complete i.e., the connection was traced in its entirety with the SYN and FIN segments of the connection observed in the dumpfile. CHAPTER 1. A PRIMER ON NETWORK MEASUREMENT 14 The time at which the first and last packets of the connection were captured is reported, followed by the lifetime of the connection, and the number of packets seen. Then, the filename currently being processed is listed, followed by the multiple TCP statistics for the forward (a2b) and the reverse (b2a) directions. Output Statistics We explain the TCP parameter statistics in the following, for the a2b direction. Similar explanation would hold for the b2a direction too. ; total packets The total number of packets seen. ; ack pkts sent The total number of ack packets seen (TCP segments seen with the ACK bit set). ; pure acks sent The total number of ack packets seen that were not piggy-backed with data (just the TCP header and no TCP data payload) and did not have any of the SYN/FIN/RST flags set. ; sack pkts sent The total number of ack packets seen carrying TCP SACK [20] blocks. ; dsack pkts sent The total number of sack packets seen that carried duplicate SACK (D-SACK) [22] blocks. ; max sack blks/ack The maximum number of sack blocks seen in any sack packet. ; unique bytes sent The number of unique bytes sent, i.e., the total bytes of data sent excluding retransmitted bytes and any bytes sent doing window probing. ; actual data pkts The count of all the packets with at least a byte of TCP data payload. ; actual data bytes The total bytes of data seen. Note that this includes bytes from retransmissions / window probe packets if any. ; rexmt data pkts The count of all the packets found to be retransmissions. ; rexmt data bytes The total bytes of data found in the retransmitted packets. ; zwnd probe pkts The count of all the window probe packets seen. (Window probe packets are typically sent by a sender when the receiver last advertised a zero receive window, to see if the window has opened up now). ; zwnd probe bytes The total bytes of data sent in the window probe packets. ; outoforder pkts The count of all the packets that were seen to arrive out of order. ; pushed data pkts The count of all the packets seen with the PUSH bit set in the TCP header. 1.4. TSTAT OVERVIEW 15 ; SYN/FIN pkts sent The count of all the packets seen with the SYN/FIN bits set in the TCP header respectively. ; req 1323 ws/ts If the endpoint requested Window Scaling/Time Stamp options as specified in RFC 1323[25] a ‘Y’ is printed on the respective field. If the option was not requested, an ‘N’ is printed. For example, an “N/Y” in this field means that the window-scaling option was not specified, while the Time-stamp option was specified in the SYN segment. Note that since Window Scaling option is sent only in SYN packets, this field is meaningful only if the connection was captured fully in the dumpfile to include the SYN packets. ; adv wind scale The window scaling factor used. Again, this field is valid only if the connection was captured fully to include the SYN packets. Since the connection would use window scaling if and only if both sides requested window scaling [25], this field is reset to 0 (even if a window scale was requested in the SYN packet for this direction), if the SYN packet in the reverse direction did not carry the window scale option. ; req sack If the end-point sent a SACK permitted option in the SYN packet opening the connection, a ‘Y’ is printed; otherwise ‘N’ is printed. ; sacks sent The total number of ACK packets seen carrying SACK information. ; urgent data pkts The total number of packets with the URG bit turned on in the TCP header. ; urgent data bytes The total bytes of urgent data sent. This field is calculated by summing the urgent pointer offset values found in packets having the URG bit set in the TCP header. ; mss requested The Maximum Segment Size (MSS) requested as a TCP option in the SYN packet opening the connection. ; max segm size The maximum segment size observed during the lifetime of the connection. ; min segm size The minimum segment size observed during the lifetime of the connection. ; avg segm size The average segment size observed during the lifetime of the connection calculated as the value reported in the actual data bytes field divided by the actual data pkts reported. ; max win adv The maximum window advertisement seen. If the connection is using window scaling (both sides negotiated window scaling during the opening of the connection), this is the maximum window-scaled advertisement seen in the connection. For a connection using window scaling, both the SYN segments opening the connection have to be captured in the dumpfile for this and the following window statistics to be accurate. CHAPTER 1. A PRIMER ON NETWORK MEASUREMENT 16 ; min win adv The minimum window advertisement seen. This is the minimum windowscaled advertisement seen if both sides negotiated window scaling. ; zero win adv The number of times a zero receive window was advertised. ; avg win adv The average window advertisement seen, calculated as the sum of all window advertisements divided by the total number of packets seen. If the connection endpoints negotiated window scaling, this average is calculated as the sum of all window-scaled advertisements divided by the number of window-scaled packets seen. Note that in the windowscaled case, the window advertisements in the SYN packets are excluded since the SYN packets themselves cannot have their window advertisements scaled, as per RFC 1323[25]. ; initial window The total number of bytes sent in the initial window i.e., the number of bytes seen in the initial flight of data before receiving the first ack packet from the other endpoint. Note that the ack packet from the other endpoint is the first ack acknowledging some data (the ACKs part of the 3-way handshake do not count), and any retransmitted packets in this stage are excluded. ; initial window The total number of segments (packets) sent in the initial window as explained above. ; ttl stream length The Theoretical Stream Length. This is calculated as the difference between the sequence numbers of the SYN and FIN packets, giving the length of the data stream seen. Note that this calculation is aware of sequence space wrap-arounds, and is printed only if the connection was complete (both the SYN and FIN packets were seen). ; missed data The missed data, calculated as the difference between the ttl stream length and unique bytes sent. If the connection was not complete, this calculation is invalid and an “NA” (Not Available) is printed. ; truncated data The truncated data, calculated as the total bytes of data truncated during packet capture. For example, with tcpdump, the snaplen option can be set to 64 (with -s option) so that just the headers of the packet (assuming there are no options) are captured, truncating most of the packet data. In an Ethernet with maximum segment size of 1500 J0K bytes, this would amount to truncated data of $>?A@CB?DE F8DGBHI for a packet. ; truncated packets The total number of packets truncated as explained above. ; data xmit time Total data transmit time, calculated as the difference between the times of capture of the first and last packets carrying non-zero TCP data payload. ; idletime max Maximum idle time, calculated as the maximum time between consecutive packets seen in the direction. ; throughput The average throughput calculated as the unique bytes sent divided by the elapsed time i.e., the value reported in the unique bytes sent field divided by the elapsed time (the time difference between the capture of the first and last packets in the direction). 1.4. TSTAT OVERVIEW 17 RTT Stats RTT (Round-Trip Time) statistics are generated when the -r option is specified along with the -l option. The following fields of output are produced along with the output generated by the -l option. surya:/home/mani/tcptrace-manual> tcptrace -lr indica.dmp.gz 1 arg remaining, starting with ’indica.dmp.gz’ Ostermann’s tcptrace -- version 6.4.5 -- Fri Jun 13, 2003 153 packets seen, 153 TCP packets traced elapsed wallclock time: 0:00:00.128422, 1191 pkts/sec analyzed trace file elapsed time: 0:00:19.092645 TCP connection info: 1 TCP connection traced: TCP connection 1: host a: 192.168.0.70:32791 host b: webco.ent.ohiou.edu:23 complete conn: yes first packet: Thu Aug 29 18:54:54.782937 2002 last packet: Thu Aug 29 18:55:13.875583 2002 elapsed time: 0:00:19.092645 total packets: 153 filename: indica.dmp.gz a->b: b->a: total packets: 91 total packets: . . . . . . . . . . . . throughput: 10 Bps throughput: RTT RTT RTT RTT RTT samples: min: max: avg: stdev: 48 74.1 204.0 108.6 44.2 ms ms ms ms RTT RTT RTT RTT RTT samples: min: max: avg: stdev: 62 94 Bps 47 0.1 38.8 8.1 14.7 ms ms ms ms RTT from 3WHS: 75.0 ms RTT from 3WHS: 0.1 ms RTT RTT RTT RTT RTT 1 79.5 79.5 79.5 0.0 RTT RTT RTT RTT RTT 1 0.1 0.1 0.1 0.0 full_sz full_sz full_sz full_sz full_sz smpls: min: max: avg: stdev: post-loss acks: 0 ms ms ms ms full_sz full_sz full_sz full_sz full_sz smpls: min: max: avg: stdev: post-loss acks: For the following 5 RTT statistics, only ACKs for multiply-transmitted segments (ambiguous ACKs) were considered. Times are taken from the last instance of a segment. ambiguous acks: 1 ambiguous acks: RTT min (last): 76.3 ms RTT min (last): ms ms ms ms 0 0 0.0 ms CHAPTER 1. A PRIMER ON NETWORK MEASUREMENT 18 RTT max (last): RTT avg (last): RTT sdv (last): segs cum acked: duplicate acks: triple dupacks: max # retrans: min retr time: max retr time: avg retr time: sdv retr time: 76.3 76.3 0.0 0 0 0 1 380.2 380.2 380.2 0.0 ms ms ms ms ms ms ms RTT max (last): RTT avg (last): RTT sdv (last): segs cum acked: duplicate acks: triple dupacks: max # retrans: min retr time: max retr time: avg retr time: sdv retr time: 0.0 0.0 0.0 0 0 0 0 0.0 0.0 0.0 0.0 ms ms ms ms ms ms ms ; RTT samples The total number of Round-Trip Time (RTT) samples found. tcptrace is pretty smart about choosing only valid RTT samples. An RTT sample is found only if an ack packet is received from the other endpoint for a previously transmitted packet such that the acknowledgment value is 1 greater than the last sequence number of the packet. Further, it is required that the packet being acknowledged was not retransmitted, and that no packets that came before it in the sequence space were retransmitted after the packet was transmitted. Note : The former condition invalidates RTT samples due to the retransmission ambiguity problem, and the latter condition invalidates RTT samples since it could be the case that the ack packet could be cumulatively acknowledging the retransmitted packet, and not necessarily ack-ing the packet in question. ; RTT min The minimum RTT sample seen. ; RTT max The maximum RTT sample seen. ; RTT avg The average value of RTT found, calculated straightforward-ly as the sum of all the RTT values found divided by the total number of RTT samples. ; RTT stdev The standard deviation of the RTT samples. ; RTT from 3WHS The RTT value calculated from the TCP 3-Way Hand-Shake (connection opening) [18], assuming that the SYN packets of the connection were captured. ; RTT full sz smpls The total number of full-size RTT samples, calculated from the RTT samples of full-size segments. Full-size segments are defined to be the segments of the largest size seen in the connection. ; RTT full sz min The minimum full-size RTT sample. ; RTT full sz max The maximum full-size RTT sample. ; RTT full sz avg The average full-size RTT sample. ; RTT full sz stdev The standard deviation of full-size RTT samples. 1.4. TSTAT OVERVIEW 19 ; post-loss acks The total number of ack packets received after losses were detected and a retransmission occurred. More precisely, a post-loss ack is found to occur when an ack packet acknowledges a packet sent (acknowledgment value in the ack pkt is 1 greater than the packet’s last sequence number), and at least one packet occurring before the packet acknowledged, was retransmitted later. In other words, the ack packet is received after we observed a (perceived) loss event and are recovering from it. ; ambiguous acks, RTT min, RTT max, RTT avg, RTT sdv These fields are printed only if there was at least one ack received that was ambiguous due to the retransmission ambiguity problem i.e., the segment being ack-ed was retransmitted and it is impossible to determine if the ack is for the original or the retransmitted packet. Note that these samples are not considered in the RTT samples explained above. The statistics below are calculated from the time of capture of the last transmitted instance of the segment. ; ambiguous acks is the total number of such ambiguous acks seen. The following RTT min, RTT max, RTT avg, RTT sdv fields represent the minimum, maximum, average, and standard deviation respectively of the RTT samples calculated from ambiguous acks. ; segs cum acked The count of the number of segments that were cumulatively acknowledged and not directly acknowledged. ; duplicate acks The total number of duplicate acknowledgments received. An ack packet is found to be a duplicate ack based on this definition used by 4.4 BSD Lite TCP Stack [26] : - The ack packet has the biggest ACK (acknowledgment number) ever seen. - The ack should be pure (carry zero tcp data payload). - The advertised window carried in the ack packet should not change from the last window advertisement. - There must be some outstanding data. ; triple dupacks The total number of triple duplicate acknowledgments received (three duplicate acknowledgments acknowledging the same segment), a condition commonly used to trigger the fast-retransmit/fast-recovery phase of TCP. ; max # retrans The maximum number of retransmissions seen for any segment during the lifetime of the connection. ; min retr time The minimum time seen between any two (re)transmissions of a segment amongst all the retransmissions seen. CHAPTER 1. A PRIMER ON NETWORK MEASUREMENT 20 ; max retr time The maximum time seen between any two (re)transmissions of a segment. ; avg retr time The average time seen between any two (re)transmissions of a segment calculated from all the retransmissions. ; sdv retr time The standard deviation of the retransmission-time samples obtained from all the retransmissions. The raw RTT samples found can also be dumped into data files with the -Z option as in tcptrace -Z file.dmp This generates files of the form a2b rttraw.dat and b2a rttraw.dat (for both directions of the first TCP connection traced), c2d rttraw.dat and d2c rttraw.dat (for the second TCP connection traced) etc. in the working directory. Each of the datafiles contain lines of the form : seq# rtt where seq# is the sequence number of the first byte of the segment being acknowledged (by the ack packet that contributed this RTT sample) and rtt is the RTT value in milli-seconds of the sample. Note that only valid RTT samples (as counted in the RTT Samples field listed above) are dumped. CWND Stats tcptrace reports statistics on the estimated congestion window with the -W option when used in conjunction with the -l option. Since there is no direct way to determine the congestion window at the TCP sender, the outstanding unacknowledged data is used to estimate the congestion window. The 4 new statistics produced by the -W option in addition to the detailed statistics reported due to the -l option, are explained below. surya:/home/mani/tcptrace-manual> tcptrace -lW malus.dmp.gz 1 arg remaining, starting with ’malus.dmp.gz’ Ostermann’s tcptrace -- version 6.4.6 -- Tue Jul 1, 2003 32 packets seen, 32 TCP packets traced elapsed wallclock time: 0:00:00.026658, 1200 pkts/sec analyzed trace file elapsed time: 0:00:00.404427 TCP connection info: 1 TCP connection traced: TCP connection 1: host a: elephus.cs.ohiou.edu:59518 host b: A17-112-152-32.apple.com:80 complete conn: yes first packet: Thu Jul 10 19:12:54.914101 2003 last packet: Thu Jul 10 19:12:55.318528 2003 elapsed time: 0:00:00.404427 total packets: 32 1.4. TSTAT OVERVIEW 21 filename: malus.dmp.gz a->b: b->a: total packets: 16 . . . . . . avg win adv: 22091 bytes max owin: min non-zero owin: avg owin: wavg owin: 451 1 31 113 bytes bytes bytes bytes initial window: . . . . . . throughput: 450 bytes 1113 Bps total packets: . . . . . . avg win adv: 16 33304 bytes max owin: min non-zero owin: avg owin: wavg owin: 1449 1 1213 682 bytes bytes bytes bytes initial window: . . . . . . throughput: 1448 bytes 44957 Bps ; max owin The maximum outstanding unacknowledged data (in bytes) seen at any point in time in the lifetime of the connection. ; min non-zero owin The minimum (non-zero) outstanding unacknowledged data (in bytes) seen. ; avg owin The average outstanding unacknowledged data (in bytes), calculated from the sum of all the outstanding data byte samples (in bytes) divided by the total number of samples. ; wavg owin The weighted average outstanding unacknowledged data seen. For example, if the outstanding data (odata) was 500 bytes for the first 0.1 seconds, 1000 bytes for the next 1 second, and 2000 bytes for the last 0.1 seconds of a connection that lasted 1.2 seconds, wavg owin= ((500 x 0.1) + (1000 x 1) + (2000 x 0.1)) / 1.2 = 1041.67 bytes an estimate closer to 1000 bytes which was the outstanding data for the most of the lifetime of the connection. Note that the straight-forward average reported in avg owin would have been (500+1000+2000)/1.2 = 2916.67 bytes, a value less indicative of the outstanding data observed during most of the connection’s lifetime. 1.4.3 Output Overview This section provides a very brief overview of the Tstat output; since a number of them have alredy been explained in the previous sections, being inherited by TCPtrace, this section will only report a list of the items known to Tstat. Col 1 2 3 4 5 6 7 Label Client IP addr Client TCP port packets RST sent ACK sent PURE ACK sent unique bytes Description IP addresses of the client TCP port addresses for the client Total number of packets observed form the client 0 if no RST segment has been sent by the client, 1 otherwise Number of segments with the ACK field set to 1 Number of segments with ACK field set to 1 and no data Number of bytes sent in the payload CHAPTER 1. A PRIMER ON NETWORK MEASUREMENT 22 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 data pkts data bytes rexmit pkts rexmit bytes out seq pkts SYN count FIN count RFC1323 ws RFC1323 ts window scale SACK req SACK sent MSS max seg size min seg size win max 24 win min 25 26 win zero cwin max 27 28 29 cwin min initial cwin Average rtt 30 31 32 33 34 35 36 37 38 rtt min rtt max Stdev rtt rtt count rtx RTO rtx FR reordering net dup unknown 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 flow control unnece rtx RTO unnece rtx FR != SYN seqno Server IP addr Server TCP port packets RST sent ACK sent PURE ACK sent unique bytes data pkts data bytes rexmit pkts rexmit bytes out seq pkts SYN count Number of segments with payload Number of bytes transmitted in the payload, including retransmissions Number of retransmitted segments Number of retransmitted bytes Number of segments observed out of sequence Number of SYN segments observed (including rtx) Number of FIN segments observed (including rtx) Window scale option sent [boolean] Timestamp option sent [boolean] Scaling values negotiated [scale factor] SACK option set [boolean] number of SACK messages sent MSS declared [bytes] Maxium segment size observed [bytes] Minimum segment size observed [bytes] Maximum receiver window announced (already scale by the window scale factor) [bytes] Maximum receiver window announced (already scale by the window scale factor) [bytes] Total number of segments declaring zero as receiver window Maximum in-flight-size (= largest sequence number - corresponding last ACK ) [bytes] Minimum in-flight-size [bytes] First in-flight size (or total unacked bytes sent before receiving the first ACK) [bytes] Average RTT (time elapsed between the data segment and the corresponding ACK) [ms] Minimum RTT observed during connection lifetime [ms] Maximum RTT observed during connection lifetime [ms] Standard deviation of the RTT [ms] Number of valid RTT observation Number of retransmitted segments due to timeout expiration Number of retransmitted segments due to Fast Retransmit (three dup-ack) Number of packet reordering observed Number of network duplicates observed Number of segments not in sequence (or duplicate which are not classified as specific events) Number of retransmitted segments to probe the receiver window Number of unnecessary transmissions following a timeout expiration Number of unnecessary transmissions following a fast retransmit Set to 1 if the eventual retransmitted SYN segments have different initial seqno IP addresses of the server TCP port addresses for the server Total number of packets observed form the server 0 if no RST segment has been sent by the server, 1 otherwise Number of segments with the ACK field set to 1 Number of segments with ACK field set to 1 and no data Number of bytes sent in the payload Number of segments with payload Number of bytes transmitted in the payload, including retransmissions Number of retransmitted segments Number of retransmitted bytes Number of segments observed out of sequence Number of SYN segments observed (including rtx) 1.4. TSTAT OVERVIEW 56 57 58 59 60 61 62 63 64 65 FIN count RFC1323 ws RFC1323 ts window scale SACK req SACK sent MSS max seg size min seg size win max 66 win min 67 68 win zero cwin max 69 70 cwin min initial cwin 71 Average rtt 72 73 74 75 76 77 78 79 80 rtt min rtt max Stdev rtt rtt count rtx RTO rtx FR reordering net dup unknown 81 82 83 84 85 86 87 88 89 90 91 92 flow control unnece rtx RTO unnece rtx FR != SYN seqno Completion time First time Last time C first payload S first payload C last payload S last payload Internal 23 Number of FIN segments observed (including rtx) Window scale option sent [boolean] Timestamp option sent [boolean] Scaling values negotiated [scale factor] SACK option set [boolean] number of SACK messages sent MSS declared [bytes] Maxium segment size observed [bytes] Minimum segment size observed [bytes] Maximum receiver window announced (already scale by the window scale factor) [bytes] Maximum receiver window announced (already scale by the window scale factor) [bytes] Total number of segments declaring zero as receiver window Maximum in-flight-size (= largest sequence number - corresponding last ACK ) [bytes] Minimum in-flight-size [bytes] First in-flight size, or total number of unack-ed bytes sent before receiving the first ACK segment [bytes] Average RTT (time elapsed between the data segment and the corresponding ACK)[ms] Minimum RTT observed during connection lifetime [ms] Maximum RTT observed during connection lifetime [ms] Standard deviation of the RTT [ms] Number of valid RTT observation Number of retransmitted segments due to timeout expiration Number of retransmitted segments due to Fast Retransmit (three dup-ack) Number of packet reordering observed Number of network duplicates observed Number of segments not in sequence (or duplicate which are not classified as specific events) Number of retransmitted segments to probe the receiver window Number of unnecessary transmissions following a timeout expiration Number of unnecessary transmissions following a fast retransmit Set to 1 if the eventual retransmitted SYN segments have different initial seqno Flow duration since first packet to last packet [ms] Flow first packet since first segment ever [ms] Flow last segment since first segment ever [ms] Client first segment with payload since the first flow segment [ms] Server first segment with payload since the first flow segment [ms] Client last segment with payload since the first flow segment [ms] Server last segment with payload since the first flow segment [ms] Bool set to 1 if the client has internal IP Table 1.2: Tstat Output: Log Field Description Basically, there are two currently supported kinds of output, although for on-going extensions, we refer the readed to Section 7.1.2. The first is a flow-level log, which tracks the statistics reported in Table 1.2; the table is partitioned into three macro-blocks, which reflects the flow direction: i.e., client to server, server to client, common to both and server and client. Notice that, since the log is CHAPTER 1. A PRIMER ON NETWORK MEASUREMENT 24 created on-line, the flows are sorted by their finish-time. The second kind of output is a periodical histogram dump, where some of the formerly mentioned statistics are tracked for all the aggregated flows in higher details; the configurable dump-interval time is, by default, 15 minutes. 1.4.4 Usage Overview For completeness, we report here a simple and brief overview of the calling syntax of Tstat: for a more detailed description we refer the reader to the manual available along with the standard distribution; conversely, we refer the reader to the next chapter for an exhaustive example of the analysis allowed by the tool. Usage: tstat Where: -Nfile: -d: -h: -H: -t: -u: -v: -w: -m: -l: -iifc: -ffile: -dag: -S: -Rconf: -rpath: [-Nfile] [-d n] [-lmhtuvw] [-sdir] [-iinterface] [-ffilterfile] <file1 file2> [-dag device_name device_name ...] specify the file name which contains the description of the internal networks. This file must contain the subnets that will be considered as ’internal’ during the analyisis Each subnet must be specified using IP address on the first line and netmask on the second one: 130.192.0.0 255.255.0.0 193.204.134.0 255.255.255.0 increase debug level (repeat to increase debug level) print this help and exit print insternal histograms names and definitions print ticks showing the trace analysis progress do not trace udp packets print version and exit print *lots* of warning enable multi-threaded engine (useful for live capture) enable live capture using libpcap specifies the interface to be used to capture traffic specifies the libpcap filter file (see tcpdump) enable live capture usign Endace DAG cards; the default device for capture is /dev/dag0 pure rrd-engine for ever-lating scalability (do not create histograms and log_files) specify the configuration file for integration with RRDTools (See README.rrdtools for further information) path to use to create/update the RRDTool database: this should better be outside the directory tree and should 1.5. THE NETWORK SCENARIO -sdir: file: 25 be accessible from the Web server puts the trace analysis into the specified directory trace file to be analyzed; use ’stdin’ to read from standard input (e.g., when piped) 1.5 The Network Scenario Politecnico di Torino LAN is connected to the Internet through the The Italian Academic and Research Network [27] (GARR). This section briefly overview this network in both its current setup Section 1.5.1, proposing some simple statistics in Section 1.5.2, and anticipating its future evolution Section 1.5.3. Before entering into further details, let us assert that despite the networking scenario is common to all the research works described in the following chapters, neverthless it would be cumbersome to introduce a common notation; on the same line, altough the data sets used for all the works are somehow similar, neverthless, there are both minor and critical differences. Therefore, while we provide a comprehensive statistical study of the typical data pattern observed on our network, conversely we will reintroduce the notation and the describe the setup of each work separately and at the risk of redundacy. 1.5.1 The GARR-B Architecture The currently active network service is named “GARR-B Phase 4a: GARR BroadBand Network” and it deploys the GARR-B Project[28]. GARR-B supports the TCP/IP Transport Service with Managed Badwidth Service (MBS) (which is also available for new services exprimentation) on top of a backbone running at 155 Mbps and 2.5 Gbps. GARR-B replaces the previous GARR-2 service, which was discontinued in late april 1999. Figure 1.2 depicts the evolution of the GARR-B Network from april 1999 to may 2004, and will be used as reference in the following. The GARR-B network is widely interconnected with the other european and worldwide Research Networks (including Internet2) via the 2.5 Gbps link to GEANT[30] backbone, and the commercial Internet with a 2.5 Gbps link (Global Crossing - GX) and a 622 Mbps link (KPNQwest - KQ). The route from Politecnico di Torino, toward an over-ocean Internet machine is as follows: ; from Politecnico LAN to Torino (TO) access POP over an ATM AAL-5 OC5 link at 34 Mbps (not shown on the map) ; from the TO access POP to the Milano (MI) router (of the fully meshed backbone among MI, BO, RM, NA), over a 155 Mbps channel (red on the map) ; from the MI backbone router to: – MI-GEANT 2.5 Gbpsm (violet on the map) – MI-GX 2.5 Gbps (violet) – any other Backbone router either 2.5 Gbps or 155 Mbps (respectively, blue and red) CHAPTER 1. A PRIMER ON NETWORK MEASUREMENT 26 Figure 1.2: Evolution of the GARR-B Network: from April 1999 to May 2004 1.5.2 The GARR-B Statistics Abstracting from the network geography, Figure 1.3 represents the interconnection of functional network elements, where edges are represented using different colors depending on the traffic volume exchanged. Let now consider the Torino Metropolitan Area Network (MAN) subportion of the whole GARR-B Network, as the zoomed inset of Figure 1.3 shows: it can be seen that out institution is the most active network element. Focusing on Politecnico di Torino, we can observe input versus output load statistics over different timescales. Since the measure are taken at the GARR, the input direction is to be considered with the POP perspective. Figure 1.4 reports yearly to monthly and weekly to daily timescales. The graphics shown in Figure 1.4 were created by the GARR usign the Multi Router Traffic Grapher (MRTG) [29]. MRTG is a tool to monitor the traffic load on network-links, which generates HTML pages containing graphical images which provide a LIVE visual representation of this traffic. As it can be gathered the output traffic represent the predominant traffic portion; considering the POP perspective, the traffic streams entering the Politecnico di Torino have to be considered as output streams. Therefore, the Politecnico LAN is principally a client network, where client contact external servers (e.g., for file download or others services). Similary, Table 1.3 reports the L numerical results referring to Figure 1.4; istantaneous values have been sampled Tuesday the > of October 2004, nearly at tea time3 . 1.5.3 The GARR-G Project The GARR-G project defines a program of activities to develop and provision the next generation of the data transmission network infrastructure for the italian academic and research community, 3 Probably, today 5’o clock tea will be an excellent Fortnum and Mason’s Earl Gray 1.5. THE NETWORK SCENARIO 27 Figure 1.3: Logical Map of the Italian GARR-B Network, with a Zoomed View of the Torino MAN Direction ‘Daily’ Graph (5 Minute Average) ‘Weekly’ Graph (30 Minute Average) ‘Monthly’ Graph (2 Hour Average) ‘Yearly’ Graph (1 Day Average) In Out In Out In Out In Out Bandwidth Mbps (Utilization%) Max Average Istantaneous 17.8 (63.7%) 4.75 (17.0%) 9.71 (34.7%) 28.0 (100.0%) 11.0 (39.3%) 27.7 (98.8%) 18.0 (64.4%) 5.27 (18.8%) 8.94 (31.9%) 27.7 (98.8%) 9.56 (34.2%) 27.3 (97.6%) 16.8 (60.0%) 4.87 (17.4%) 11.4 (40.6%) 27.5 (98.0%) 8.25 (29.5%) 27.0 (96.3%) 15.6 (55.8%) 5.14 (18.4%) 4.95 (17.7%) 16.1 (57.3%) 7.75 (27.7%) 12.9 (46.2%) Table 1.3: Input versus Output Bandwidth and Utilization Statistics over Different Time-Scales: Yearly to Monthly and Weekly to Daily named GARR-Giganet. GARR-Giganet will be the evolution of the present GARR-B network. GARR-Giganet will continue to provide network connectivity between all the italian academic and reasearch institutions and with the rest of the world and have an active role in the development of the information society. 28 CHAPTER 1. A PRIMER ON NETWORK MEASUREMENT Figure 1.4: Input versus Output Load Statistics over Different Time-Scales: Yearly to Monthly and Weekly to Daily The GARR-Giganet network will provide connectivity and services, always being at the stateof-the art in transmission technology supporting IPv6 addresses and advanced Quality of Service (QoS) policies. The network will provide an highly distributed access. The backbone will be based on a high speed optical transport, using point to point “lambdas” with a link speed not lower then 2.5 Gigabit per second. The development of metropolitan networks connected to GARR-GigaPoPs will allow high speed advanced services to GARR users connected to MANs. The international connections to European Networks (GEANT)[30] and international (research networks like Internet2[31] and general interest) are part of the basic infrastructure and will also be connected at high speed. Meanwhile, a pilot network, named GARR-G Pilot, has been installed and is operational since the beginning of 2001. It is based on optical “lambdas” links at 2.5 Gb/s and it spans well over half of the Italian territory: the Pilot provides solid ground to the engineering of GARR-Giganet. Chapter 2 The Measurement Setup HIS chapter introduce Tstat [34], a new tool for the collection and statistical analysis of TCP/IP traffic, able to infer TCP connection status from traces. Discussing its use, we present some of the performance figures that can be obtained and the insight that such figures can give on TCP/IP protocols and the Internet. While field measures have always been the starting point for networks planning and dimensioning, their statistical analysis beyond simple traffic volume estimation is not so common. Analyzing Internet traffic is difficult because a large amount of performance figures that can be devised in TCP/IP networks, but also because many performance figures can be derived only if bidirectional traffic is jointly considered. Tstat automatically correlates incoming and outgoing flows and derives about 80 different performance indices both at the IP and at the TCP level, allowing a very deep insight in the network performance. Moreover, while standard performance measure, such as flow dimensions, traffic distribution, etc., remain at the base of traffic evaluation, more sophisticated indices, like the out-of-order probability and gap dimension in TCP connections, obtained through data correlation between the incoming and outgoing traffic, give reliable estimates of the network performance also from the user perspective. Several of these indices are discussed on traffic measures performed for more than 2 months on the access link of our institution. 2.1 Traffic Measures in the Internet Planning and dimensioning of telecom networks is traditionally based on traffic measurements, upon which estimates and mathematical models are built. While this process proved to be reasonably simple in traditional, circuit switched, networks, it seems to be much harder in packet switched IP based networks, where the TCP/IP client-server communication paradigm introduces correlation both in space and time. While a large part of this difficulty lies in the failure of traditional modeling paradigms [35, 43], there are also several key points to be solved in performing the measures themselves and, most of all, in organizing the enormous amount of data that are collected through measures. First of all, the client-server communication paradigm implies that the traffic behavior does have meaning only when the forward and backward traffic are jointly analyzed, otherwise half of the story goes unwritten, and should be wearingly inferred (with high risk of making mistakes!). 29 30 CHAPTER 2. THE MEASUREMENT SETUP This problem makes measuring inherently difficult; it can be solved if measures are taken on the network edge, where the outgoing and incoming flows are necessarily coupled, but it can prove impossible in the backbone, where the peering contracts among providers often disjoint the forward and backward routes [44]. Second, data traffic must be characterized to a higher level of detail than voice traffic, since the ‘always-on’ characteristics of most sources and the nature itself of packet switching require the collections of data at the session, flow, and packet level, while circuit switched traffic is well characterized by the connection level alone. This is due to the source model of the traffic, which is well characterized and relatively simple in case of voice traffic, but more complex and variable in case of data networks, where different application models can coexist and interact together. In the absence of CAC (Connection Admission Control) functions and with a connectionless communication model, the notion of connection itself becomes quite fuzzy in the Internet. Summarizing, the layered structure of the TCP/IP protocol suite, requires the analysis of traffic at least at the IP, TCP/UDP, and Application level in order to have a picture of the traffic clear enough to allow the interpretation of data. Starting from the pioneering work of Danzig [57, 58, 59] and of Paxons and Floyd [35, 60] in which the authors characterized the traffic of the ”first Internet” via measures, there has always been an increasing interest in the data collection, measure and analysis, to characterize either the network protocol or the users behavior. After the birth of the Web, lots of effort has been devoted to study caching and content delivery architecture, which intrinsically are based on the deep knowledge of the traffic and user behavior. Thus many works analyze traces at the application levels, typically log files of web servers or proxy servers [36, 37, 38]. These are then very helpful understand user behavior, but less interesting from the network point of view. Many projects are instead using real traffic traces, captured form large campus networks, like the work in [39], where the authors characterize the HTTP protocol by using large traces collected at the university campus at Berkeley. Similarly in [40] the authors present data collected from a large Olympic server in 1996, where very useful findings are helpful to understand TCP behavior, like loss recovery efficiency and ACK compression. In [41], authors analyzed more than 23 millions of HTTP connections, and derived a model for the connection interarrival time. More recently, the authors of [42] analyze and derive models for the Web traffic, starting from the TCP/IP header analysis. None of these works, however, characterize the traffic at the network level, rebuilding the status of single TCP connections, independently from the application level. We are interested in passive tools which analyze traces rather than active tools that derive performance indices injecting traffic in the network, like for example the classic ping or traceroute utilities. Among passive tools, many are based on the libpcap library developed with the tcpdump tool [11, 10], that allow different protocol level analysis. For example, tcpanaly is a tool for automatically analyzing a TCP implementation’s behavior by inspecting packet traces of the TCP’s activity. Another interesting tool is tcptrace [49], which is able to rebuild a TCP connection status from traces, matching data segments and ACKs. For each connection, it keeps track of elapsed time, bytes/segments sent and received, retransmissions, round trip times, window advertisements, throughput, etc. At IP level, ntop [45] is able to collect statistics, enabling users to track relevant network activities including traffic characterization, network utilization, network protocol usage. However, none of the tools are able to derive statistical data collection and post-elaboration. 2.2. THE TOOL: TSTAT 31 Thus, to the best of our knowledge, this chapter presents two different contributions in the field of Internet traffic measures. ; A new tool, briefly described in Sect. 2.2, for gathering and elaborating Internet measurements has been developed and made available to the community. ; The description of the most interesting results of traffic analysis performed with the above tool, discussing their implication on the network, at both the IP level in Sect. 2.4, and the TCP level in Sect. 2.5. In the remaining of the chapter we assume that the reader is familiar with the Internet terminology, that can be found in [46, 47, 48] for example. 2.2 The Tool: Tstat The lack of automatic tools able to produce statistical data from collected network traces was a major motivation to develop a new tool, called Tstat [34], which, starting from standard software libraries, is able to offer network managers and researchers important information about classic and novel performance indices and statistical data about Internet traffic. Started as an evolution of TCPtrace [49], Tstat is able to analyze traces in real time (Tstat processes a 6 hour long trace from a 16 Mbit/s link in about 15 minute), using common PC hardware, or start from previously recorded traces in various dump formats, such the one supported by the libpcap library [11]. The software assumes to receive as input a trace collected on an edge node, such that both data segments and ACK segments can be analyzed. Besides common IP statistics, derived from the analysis of the IP header, Tstat is also able to rebuild each TCP connection status looking at the TCP header in the forward and backward packet flows. If connection opening and closing is observed, the flow is marked as a complete flow, and then analyzed. To free memory related to status used by TCP flows that are inactive, a timer is used as garbage collector. TCP segments that belong to flows whose opening is not recorded (because either were started before, or early declared closed by the garbage procedure) are discarded and marked as “garbage”, and are put into the corresponding bin (or else into the toilet and then flushed). The TCP flow analysis allows the derivation of novel statistics, such as, for example, the congestion window size, out-of-sequence segments, duplicated segments, etc. Some of the data analysis are described in deeper details in Sects. 2.3–2.5, along with the measurement campaign conducted on our campus access node. Statistic are collected distinguishing between clients and servers, i.e., hosts that actively open a connection and hosts that replies to the connection request, but also identifying internal and external hosts, i.e., hosts located in our campus LAN or outside it with respect to the edge node where measures are collected. Thus incoming and outgoing packets/flows are identified. Instead of dumping single measured data, for each measured quantity Tstat builds a histogram, collecting the distribution of that given quantity. Every 15 minutes, it produces a dump of all the histograms it collected. A set of companion tools are available to produce both time plot, or aggregated plot over different time spans. Moreover a Web interface is available [34], which allows the user to travel among all the collected data, select the desired quantity, and directly produce graphs, as well as retrieve the raw data that can then later be used for further analysis. CHAPTER 2. THE MEASUREMENT SETUP 32 Period Jun. 00 Jan. 01 Pkts MON,P!QSR 242.7 401.8 Protocol share [%] other UDP TCP 0.75 5.96 93.29 0.59 4.31 95.10 Flows MON,PTQSR 4.48 7.06 Trash % 5.72 6.77 Table 2.1: Summary of the analyzed traces A total of 79 different histogram types are available, comprising both IP and TCP statistics. They range from classic measures directly available from the packet headers (e.g., percentage of TCP or UDP packets, packet length distribution, TCP port distribution, . . . ), to advanced measures, related to TCP (e.g., average congestion window, RTT estimates, out-of-sequence data, duplicated data, . . . ). A complete log also keeps track of all the TCP flows analyzed, and is useful for postprocessing purposes. 2.3 Trace analysis We performed a trace elaboration using Tstat on data collected on the Internet access link of Politecnico di Torino, i.e., between the border router of Politecnico and the access router of GARR/BTEN [27], the Italian and European Research network. Data were collected on files, each storing 6 hours long traces (to avoid incurring in File System size limitations), for a total of more than 100 Gbytes of compressed data. Within the Politecnico Campus LAN, there are approximately 7,000 access points; most of them are clients, but several servers are regularly accessed from outside institutions. 100 TCP UDP Other % 10 1 0.1 Mon 01/29 Tue 01/30 Wed 01/31 Thu 02/01 Fri 02/02 Sat 02/03 Sun 02/04 00:00 03:46 07:33 11:20 15:06 18:53 22:40 time Figure 2.1: IP payload traffic balance - Period (B) 2.3. TRACE ANALYSIS 33 The data was collected during different time periods, in which the network topology evolved. Among them, we selected the earliest one and the last one. ; Period (A) - June 2000: from 6/1/2000 to 6/11/2000, when the bandwidth of the access link was 4 Mbit/s, and the link between the GARR and the corresponding US peering was 45 Mbit/s; ; Period (B) - January 2001: from 1/19/2001 to 2/11/2001, when the bandwidth of the access link was 16 Mbit/s, and the link between the GARR and the corresponding US peering was 622 Mbit/s. The two periods are characterized by a significant upgrade in network capacity. In particular, the link between GARR and the US was a bottleneck during June 2000, but not during January 2001. Other data collections are freely available though the web interface [34]. Every time we observed a non negligible difference in the measures during different periods, we report both of them. Otherwise, we report only the most recent one. Table 2.1 summarizes the traces that are analyzed in next two sections. It shows that, not surprisingly, the larger part of the traffic is transported using TCP protocol, being the UDP traffic percentage about 5%, and other protocols practically negligible. The number of complete TCP flows is larger than 4 and 7 millions in the two periods, and only about 6% of the TCP packets were thrashed from the flow analysis, the majority of them belonging to the beginning of each trace. Fig. 2.1 plots the IP payload type evolution normalized versus the link capacity, during a week in period (B). It is clearly visible the alternating effects between days and nights, and between working days and weekend days. This periodic behavior allows us to define a busy period, that we selected from 8 AM to 6 PM, Monday to Friday. Thus in the remaining of the chapter, we report results averaged only on busy periods. 1.6 1.4 10 1 0.1 0.01 0.001 0.0001 1e-05 1e-06 1e-07 1.2 % 1 0.8 0.6 0.4 Jun. 00 Jan. 01 1 10 100 1000 10000 100000 0.2 0 0 10 20 30 40 50 60 70 80 90 100 Position Figure 2.2: Distribution of the incoming traffic versus the source IP address CHAPTER 2. THE MEASUREMENT SETUP 34 2.4 IP Level Measures Most measurements campaigns in data networks concentrated on the traffic volumes, packet interarrival times and similar measures. We avoid reporting similar results because they do not differ much from previously published ones, and because we think that from the data elaboration tool presented, other and more interesting performance figures can be derived, which allows a deeper insight in the Internet. Thus, we report only the most interesting statistics that can be gathered looking at the IP protocol header, referring the reader to [34] to analyze all the figures he might be interested into. Fig. 2.2 plots the distribution of the hit count for incoming traffic, i.e., the relative number of times the same IP addresses was seen, at the IP level. The log/log scale plot in the inside box draw all the distribution, while the larger, linear scale plot magnifies the first 100 positions of the distribution. More than 200,000 different hosts were contacted, with the top 10 sending about 5% of the packets. It is interesting to note that the distribution of the traffic is very similar during the two different periods, but, looking at the top 100 IP addresses, little correlation can be found: the most contacted hosts are different, but the relative quantity of traffic they send is, surprisingly, the same. This confirms the difficulties in predicting the traffic pattern in the Internet. A further, very interesting feature, is the similarity of the TCP flow and the IP packet distribution. The reason lies probably in the dominance of short, web-browsing flows in the overall traffic. June 2000 Host name Flows .cineca.it 244206 .kataweb.it 91880 .ilsole24ore.it 79039 .ilsole24ore.it 68194 .kataweb.it 64202 .heartland.net 53539 .edu.tr 51556 .ilsole24ore.it 50721 .banki.hu 47092 .iol.it 42837 Jan. 2001 Host name Flows .kataweb.it 193508 .cineca.it 184770 .e-urope.it 93989 .iol.it 87946 .matrix.it 76917 .kataweb.it 75611 .kataweb.it 74990 .iol.it 61526 .supereva.it 61237 .matrix.it 57550 Table 2.2: Host name of the 10 most contacted hosts on a flow basis Looking instead at the distance (number of hops) between the client and the server, Fig. 2.3 reports the distribution of the Time To Live (TTL) distinguishing between incoming and outgoing packets. For the outgoing traffic, TTL distribution is concentrated on the initial values set by the different operating systems: 128 (Win 98/NT/2000), 64 (Linux, SunOs 5.8), 32 (Win 95), 60 (Digital OSF/1) being the most common. For the incoming traffic, instead, we can see very similar distributions at the left of each peak, reflecting the number of routers traversed by packets before arriving at the measurement point. The zoomed plot in the box, shows that, supposing that the outside hosts set the TTL to 128, the number of hops traversed by packets is between 25 and 5 hops1 . 1 The minimum is intrinsically due to the topological position of the Politecnico gateway [27]. 2.5. TCP LEVEL MEASURES 35 35 In Out 30 5 4 20 3 % 25 2 15 1 10 0 98 104 110 116 122 128 5 0 0 32 64 96 128 160 TTL 192 224 256 288 Figure 2.3: Distribution of the TTL field value for outgoing and incoming packets - Period (B) Table 2.3: TCP options negotiated Option succ. SACK WinScale TimeStamp 5.0 10.9 3.1 SACK WinScale TimeStamp 11.6 5.1 4.5 client server June 2000 29.5 0.1 19.2 1.3 3.1 0.1 January 2001 32.9 0.3 10.9 1.2 4.5 0.0 unset 70.4 79.5 96.9 66.7 87.9 95.5 2.5 TCP Level Measures We concentrate now on TCP traffic, that represent the vast majority of the collected traffic. Tstat offers the most interesting (and novel) performance figures at this level. The first figure we look at, is the TCP options [50, 51] negotiated during the three way handshake. Table 2.3 shows the percentage of the clients that requested an option in the “client” column; if the server positively replies, then the option is successfully negotiated and accounted in the “succ” column; otherwise it will not be used. The “unset” percentage counts connections where no option was present, from either side. Finally, the “server” column reports the percentage of servers that, although did not receive the option request, do sent an option acknowledge. For example, looking at the SACK option, we see that about 30% of clients declared SACK capabilities, that were accepted by servers only in 5% of connections in June 2000 and 11.6% in January 2001. Note the “strange” behavior of some servers: 0.1% and 0.3% of replies contain a positive acknowledgment to clients that did not request the option. In general we can state that TCP options are rarely used, and, while the SACK option is in- CHAPTER 2. THE MEASUREMENT SETUP 36 creasingly used, the use of Window Scale and Timestamp options is either constant or decreasing. 2.5.1 TCP flow level analysis We now consider flow-level figures, those that require to correlate the flow in both directions for their derivation. 14 In Out 12 100 10 10 1 8 % 0.1 6 0.01 0.001 4 0.0001 2 0 0 2000 50 500 4000 6000 flow length [bytes] 10 5000 50000 8000 10000 In Out 1 % 0.1 0.01 0.001 0.0001 1e-05 100000 1e+06 1e+07 flow length [bytes] Figure 2.4: Incoming and outgoing flows size distribution; tail distribution in log-log scale (lower plot); zoom in linear and log-log scale of the portion near the origin (upper plots) - Period (B) Fig. 2.4 reports 3 plots. The lower one shows the tail of the distribution in log-log scale, showing a clear linear trend, typical of heavy tailed distributions. The linear plot (upper, large one) shows a magnification near the origin, with the characteristic peak of very short connections, typically of a few bytes. The inset in log-log scale shows the portion of the distribution where the mass 2.5. TCP LEVEL MEASURES 37 Table 2.4: Percentage of TCP traffic generated by common applications in number of flows, segments and transferred bytes - Period (B) Service HTTP SMTP HTTPS POP FTP control GNUTELLA ftp data Port 80 25 443 110 21 6346 20 flow % 81.48 2.98 1.66 1.26 0.54 0.53 0.51 segm. % 62.78 2.51 0.87 0.93 0.54 2.44 6.04 bytes % 61.27 2.04 0.52 0.42 0.50 1.58 9.46 is concentrated, and the linear decay begins. In particular, incoming flows shorter than 10 kbytes are 83% of the total analyzed flows, while outgoing flows shorter than 10 kbytes are 98.5%. Notice also that the dimension of incoming flows is consistently larger than that of outgoing flows, except very close to the origin. The TCP port number distribution, which directly translates in traffic generated by the most popular applications, is reported in Table 2.4, sorted by decreasing percentage. Results are reported for each application in percentage of flows, segments (including signalling and control ones), and bytes (considering the payload only). The application average flow size on both directions of TCP flows is instead reported in Table 2.5. These measures take a different look at the problem; indeed that the largest portion of the Internet traffic is web-browsing is no news at all, and that FTP, amounts to roughly 10% of the byte traffic, tough the number of FTP flows are marginal, is again well known. The different amount of data transferred by the applications in the client-server and server-client directions is instead not as well known, though not surprising. The asymmetry is much more evident when expressed in bytes than is segments, hinting to a large number of control segments (acknowledgments) sent without data to piggyback them on. For example, a HTTP client sends to the server about 1 kbyte of data, and receives about 16 kbytes as reply. But more than 15 segments go from the client to the server, while more than 19 segments from the server to the client are requested to transport the data. Table 2.5: Average data per flow sent by common applications, in segments and bytes - Period (B) Service HTTP SMTP HTTPS POP FTP control GNUTELLA FTP data Average client to server server to client segment byte segment byte 15.2 1189.9 19.5 15998.4 21.2 15034.4 16.7 624.3 11.5 936.7 12.3 6255.8 14.9 91.1 18.5 7489.0 23.7 11931.1 21.9 9254.3 101.5 23806.9 105.7 44393.9 314.0 343921.3 223.5 82873.2 Fig. 2.5 confirms intuition given by Table 2.5. The figure reports the index of asymmetry U of connections, obtained as the ratio between the client-server and server-client plus client-server CHAPTER 2. THE MEASUREMENT SETUP 38 5 4.5 4 3.5 % 3 2.5 2 1.5 1 0.5 0 0 0.1 0.2 0.3 0.4 0.5 ξ 0.6 0.7 0.8 0.9 1 30 30 25 25 20 20 15 % 10 15 5 0 10 0.45 0.475 0.5 0.525 0.55 5 0 0 0.2 0.4 0.6 0.8 1 ξ Figure 2.5: Asymmetry distribution of connections expressed in bytes (upper plot) and segments (lower plot) - Period (B) amount of transferred data, i.e.: WYX UV W(X X 1[Z\ X 1[Z\ ]_4a` WYX ]^4 X 1]b\ ZE4 measured as either bytes-wise net-payload (upper plot), or segment-wise (bottom plot), which again includes all the signalling/control segments. The upper plot shows a clear trend to asymmetric connections, with much more bytes transferred from the server to the client. If we consider the number of segments, instead, connections are almost perfectly symmetrical, as highlighted by the inset, magnifying the central portion of the distribution: 25% of the connections are perfectly symmetrical, and no effect is observed due to the delayed ACK implementation. This observation 2.5. TCP LEVEL MEASURES 39 can have not marginal effects in the design of routers, which, regardless of the asymmetry of the information in bytes, must always route and switch an almost equal number of packets in both directions. Fig. 2.6 reports the distribution of the connections completion time, i.e., the time elapsed between the first segment in the three-way-handshake and the last segment that closes the connection in both directions. Obviously, this measure depends upon the application, data size, path characteristics, network congestion, possible packet drops, etc. There are however several characteristic features in the plot. First of all, most of the connections complete in just a few seconds: indeed, about 52% last less than 1 second, 78% less than 10 second, only 5% of the connections are not terminated after more than 70 s, where the histogram collects all the remaining flows. Moreover, the completion time tends to be heavy tailed, which can be related to the heavy tailed flow size. Finally, spikes in the plot are due to application level timeouts (e.g., 15 seconds correspond to a timer in the AUTH protocol, 30 seconds to caches and web servers timers) which are generally not considered at all in traffic analysis. Interestingly, in the inset, it is also possible to observe the retransmission timeout suffered by TCP flows whose first segment is dropped, that is set by default at 3 seconds. 10 10 1 1 % 0.1 0.01 0.1 0 2 4 6 8 10 0.01 0.001 0 10 20 30 40 50 60 70 s Figure 2.6: Distribution of the connections completion time - Period(B) 2.5.2 Inferring TCP Dynamics from Measured Data In this last section, we show how some more sophisticated elaborations of the raw measured data can be used to obtain insight in TCP behavior and dynamics. Fig. 2.7 plots the advertised receiver window (rwnd) advertised in the TCP header during handshake. Looking at the plots during period (A), we note that about 50% of clients advertise rwnd around 8 kbytes, while 16 kbytes is used by about 9% of the connections, and 30% uses about 32 kbytes. These values are obtained summing together all the bins around 8, 16 and 32. During period (B), we observe a general increase in the initial rwnd, being now 44%, 19%, 24% CHAPTER 2. THE MEASUREMENT SETUP % 40 16 14 12 10 8 6 4 2 0 Jun. 00 - Avg 0 8 16 24 32 40 48 56 64 16 14 12 10 8 6 4 2 0 Jan. 01 - Avg 0 8 16 24 32 40 48 56 64 Figure 2.7: Distribution of “rwnd” as advertised during handshake the respective percentages. Note that 8 kbytes rwnd can be a strong limitation on the maximum throughput a connection can reach, as for 200 ms Round Trip Time correspond to about 40 kbytes/s. In order to complete the picture, Fig. 2.8 plot the estimated in-flight data for the outgoing flows, i.e., the bytes already sent from the source inside our LAN, and whose ACK is not yet received, evaluated looking at the sequence and acknowledgement number in the opposite directions. Given that the measures are collected very close to the sender, and that the rwnd is not a constraints, this is an estimate of the sender congestion window. The discrete-like result clearly shows the effect of the segmentation used by TCP. Moreover, being flow length very short (see Fig. 2.4), the flight size is always concentrated on small values: more than 83% of samples are indeed counted for flight sizes smaller than 4Kbytes, Finally, note that the increased network capacity in period (B) does not apparently affect the in-flight data, and hence the congestion windows observed. This suggests that in the current Internet scenario, where most of the flows are very short and the main limitation to the performance of the others seems to be the receiver buffer, the dynamic, sliding window implementation in TCP rarely comes into play. The only effect of TCP on the performance is delaying the data transfer both with the three-way handshake, and with unnecessary, very long timeouts when one of the first packet of the flow is dropped, an event that is due to the traffic fluctuations and not to congestion induced by the short flow itself. The out-of-sequence burst size (OutB), i.e., the byte-wise length of data received out of sequence, and similarly, the duplicated data burst size (DupB), i.e., the byte-wise length of contiguous duplicated data received, are other interesting performance figures related to TCP and the Internet. In particular a OutB can be observed if either a packet reordering had been performed in the network, or if packets were dropped along the path, but if and only if the flow length is larger than one segment. A DupB instead can be observed if either a packet is replicated in the network, or if, after a packet drop, the recovery phase performed by the sender covers already received segments. Table 2.6 reports the probability of observing OutB (or DupB) events, evaluated with respect to the number of segments and flows observed, i.e., the ratio between the total OutB (or DupB) events recorded and the number of packets or flows observed during in the same period of time. 2.5. TCP LEVEL MEASURES 41 40 35 100 10 30 1 25 % Jun. 00 Jan. 01 0.1 20 0.01 0.001 15 0.0001 256 10 1024 4096 16384 10240 12288 65536 5 0 0 2048 4096 6144 8192 bytes 14336 Figure 2.8: TCP congestion window estimated from the TCP header Table 2.6: OutB or DupB events rate, computed with respect to the number of packets and flows Period Jun. 00 Jan. 01 Period Jun. 00 Jan. 01 P c OutB d vs Pkt vs flow in % out % in % out % 3.44 0.07 43.41 0.43 1.70 0.03 23.03 0.99 P c DupB d vs Pkt vs flow in % out % in % out % 1.45 1.47 18.27 18.59 1.31 1.09 17.65 14.81 Starting from OutB, we see that practically no OutB events are recorded on the outgoing flows, thanks to the simple LAN topology, which, being a 100 Mbit/s switched Ethernet, rarely drop packets (recall that the access link capacity is either 4 Mbit/s or 16 Mbit/s). On the contrary, the probability of observing an OutB is rather large for the incoming data: 3.4% of packets are received out of sequence in period (A), corresponding to a 43% of probability when related to flow. Looking at the measure referring to period (B) we observe a halved chance of OutB, 1.7% and 23% respectively. This is mainly due to the increased capacity of the access and the US peering links, that reduced the dropping probability; however, the loss probability remains very high, specially thinking that these are average values over the whole working day. Looking at the DupB probabilities, and recalling that the internal LAN can be considered a sequential, drop-free environment, the duplicated burst recorded on the outgoing data can be ascribed to dropping events recovered by the servers. Thus a measure of the dropping probability is derived. Incoming DupB events are due to retransmission from external hosts of already received 42 CHAPTER 2. THE MEASUREMENT SETUP data, as the probability that a packet is recorded on the trace and then later dropped on the LAN is negligible. 2.6 Conclusions This chapter presented a novel tool for Internet traffic data collection and its statistical elaboration, that offers roughly 80 different types of plots and measurement figures, ranging from the simple amount of observed traffic, to complex reconstructions of the flow evolution. The major novelty of the tool is its capability of correlating the outgoing and incoming flow at a single edge router, thus inferring the performance and behavior of any single flow observed during the measurement period. Exploiting this capability, we have presented and discussed some statistical analysis performed on data collected at the ingress router of our institution. The results presented offer a deep insight in the behavior of both the IP and the TCP protocols, highlighting several characteristics of the traffic that, to our best knowledge, were never observed on ‘normal’ traffic, but were only generated by injecting ad-hoc flows in the Internet or observed in simulations. Chapter 3 User Patience and the World Wide Wait HIS chapter, whose results have been published in [52], presents a study of web user behavior when network performance decreases causing the increase of page transfer times. Real traffic measurements are analyzed to infer whether worsening network conditions translate into greater impatience by the user, which translates in early interruption of TCP connections. Several parameters are studied, to gather their impact on the interruption probability upon web transfers: times of day, file size, throughput and time elapsed since the beginning of the download. Results presented try to paint a picture of the complex interactions between user perception of the Web and network-level events. 3.1 Background Despite the growing amount of peer-to-peer traffic exchanged, web browsing remains one of the most popular activities on the Internet. Web users, at a rather unconscious level, usually define their browsing experience through the page latency (or response time), defined as the time between the user request for a specific web page and the complete transfer of every object in the web page. With the improvement in server and router technology, the availability of high-speed network access and larger capacity pipes, the web browsing experience is currently improving. However, congestion may still arise, causing the TCP congestion control to kick in and leading to higher page latencies. In such cases, users can become impatient, as testified by the popularization of the World Wide Wait acronym [53]. The user behavior radically changes, the current transfer is aborted, and new one is possibly started right away, e.g., hitting the ‘stop’ and ‘reload’ buttons in Web browsers. This behavior can affect the network performance, since the network does some effort to transfer information which might turn out to be useless. Furthermore, resources devoted to aborted connections are unnecessarily taken away from other connections. In this chapter, we do not focus on the causes that affect the web browsing performance, but, rather, on the measurement of the impact of the user behavior when dealing with poorly performing web transfers. Using almost two months of real traffic analysis, we study the effect of early transfer interruptions on TCP connections, and the correlation between connection parameters (such as throughput, file size, etc.) and the probability of early transfer interruption. 43 CHAPTER 3. USER PATIENCE AND THE WORLD WIDE WAIT 44 The rest of this chapter is organized as follows: Section 3.2 defines and validates the interruption measuring criterion; Section 3.3 analyzes the interruption of real traffic traces, reporting the most interesting gathered results; conclusive considerations are the object of Section 7. 3.2 Interrupted Flows: a definition With interruption event we indicate the early termination of an ongoing Web transfer by the client, before the server ends sending data. From the browser perspective, such an event can be generated by several interactions between the user and the application: aborting the transfer by pressing the stop button, leaving the page being downloaded by following a link or a bookmark, or closing the application. From the TCP perspective, the events described above cause the early termination of all TCP connections1 that are being used to transfer the web page objects. While it is impossible to distinguish among them, they can all be identified by looking at the evolution of the connection itself, as detailed in the following section. Though it would seem natural to consider the interruption as a “session” metric rather than a “flow” metric, session aggregation is extremely difficult and critical [54]. Therefore, due also to the hazy definition of “Web session”, we will restrict our attention to individual TCP flows, attempting to infer the end of ongoing TCP connections, rather than the termination of ongoing Web sessions. 3.2.1 Methodology In order to define a heuristic criterion discriminating between interrupted and completed TCP flows, we first inspected several packet-level traces corresponding to either artificially interrupted or regularly terminated Web transfers. We considered the most common operating systems and web browsers: Windows 9x, Me, 2k, Xp and Linux 2.2.x, 2.4.x were checked, in combination with MSIE 4.x, 5.x, 6.x, Netscape 4.7x, 6.x or Mozilla 1.x. Figure 3.1 sketches the evolution of a single TCP connection used in interrupted (right) versus completed (left) HTTP transaction. In the latter case, after the connection set-up, the client performs a GET request, which causes DATA to be transmitted by the server. If persistent connections are used, several GET-DATA phases can follow. At the end, the connection tear-down is usually observed from the server side through FIN or reset (RST) messages. Conversely, user-interrupted transfers cause the client to abruptly signal the server the TCP connection interruption. The actual chain of events depends on the OS used by clients, i.e., Microsoft clients immediately send an RST segment, while Netscape/Mozilla clients gently close the connection by sending a FIN message first. From then on, the client replies with RST segments upon the reception of server segments that were in flight when the interruption happened (indicated by thicker arrows in the figure). In all cases, any user interruption action generates an event which is asynchronous with respect to the self-clocked TCP window mechanism. In Figure 3.1, several time instants are also identified: ;fehg 1 " and ehgji identifying the time of the TCP Flow Start and End, respectively; In this chapter we interchangeably use the terms connection and flow. 3.2. INTERRUPTED FLOWS: A DEFINITION 45 Completed Flow SYN Interrupted Flow SYN tFS Connection Setup SYN + ACK ACK Way (Three ) Handshake GET SYN + ACK ACK GET t CS DATA DATA t SS ACK GET ACK GET t CE DATA Data Stream DATA ACK ACK DATA + FIN ACK tSE FIN tFE User’s Interruption DATA + FIN ’ tFE RST or FIN RST ACK Client Server Time Client Server Figure 3.1: Completed and Interrupted TCP Flow ;felk and elk2i identifying the time of the client request Start and End, corresponding to the first and last segment carrying data from the client side; ;fe " and e " i identifying the time of the server reply Start and End, corresponding to the first and last segment carrying data from the server side. "m" Timestamps are recorded by Tstat, which passively analyzes traffic in between the client and server hosts (its location being represented by the vertical dotted line in the figure); therefore, the time reference is neither that of the client nor of the server2 . 3.2.2 Interruption Criterion From the single flow traffic analysis, we can define a heuristic discriminating among client-interrupted and completed connections. We preliminarily introduce a necessary condition to the interruption 2 In the measurement setup we used, the packet monitor is close to the client (or server) side, and therefore the reference error is small, since the delay introduced by our campus LAN is small compared to the RTT. CHAPTER 3. USER PATIENCE AND THE WORLD WIDE WAIT 46 flow property, which we call eligibility, derived from the observation of Figure 3.1. TCP connections in which the server sent DATA but did not send a FIN (or RST) segment and the client sent an RST segment are said to be eligible. Thus: Eligible n* poq1 FIN "sr RST " 4Vt DATA" t RST k (3.1) where the index ( ] or Z ) refers to the sender of the segment. The client FIN asynchronously sent by Netscape/Mozilla browsers can be neglected, because RSTs are sent anyway upon the reception of the following incoming server packets. The client’s FIN (asynchronously sent at the time of the interruption by Netscape/Mozilla Browsers) can be neglected, because subsequent RST are anyway sent on each incoming server packets. 1 0.1 0.6 0.01 Probability Distribution Cumulative Distribution 0.8 0.4 0.001 0.2 0 0 2 4 6 8 10 12 14 16 18 20 0.0001 Http Eligible 0 30 Figure 3.2: 60 [ 90 120 150 t_{gap} [sec] 180 210 240 270 Probability and Cumulative Distribution However, this criterion by itself is not sufficient to distinguish among interrupted and completed connections. Indeed, there are a number of cases in which we can still observe an RST segment from clients before the connection tear-down by servers. In particular, due to HTTP protocol settings [55], servers may wait for a timer to expire (usually set to 15 seconds after the last data segment has been sent) before closing the connection; moreover, HTTP 1.1 and Persistent-HTTP 1.0 protocols use a longer timer, set to a multiple of 60 seconds. Connections abruptly closed during this idle time would be classified as interrupted, even if the data transfers were already completed. as the time elapsed between the last data segment from the To gauge this, let us define u gji i " . In Figure 3.2 we plot both the pdf (in the server and the actual flow end, i.e. @ [ for all HTTP connections (solid line) and the eligible ones (dotted lines). inset) and the CDF of As can be observed, the majority of connections are closed within few seconds after the reception of the last data segment. The server-timer expiration is reflected by the pdf peak after 15s, which is clearly absent for the eligible flow class. But the presence of a timer at the client side, triggered 3.2. INTERRUPTED FLOWS: A DEFINITION 47 0.1 Http Eligible pdf 0.01 0.001 0 0.5 Figure 3.3: Normalized 1 ^t gap wxyz v 1.5 2 0.0001 Probability Distribution, {5|}~| about 60s after the last segment is received, causes the client to send an RST segment before the server connection tear-down, as shown by the CDF plot for eligible flows. Unfortunately, all flows terminated by the timer expiration match the eligibility criterion: we need an additional time constraint in order to uniquely distinguish the interrupted flows from the subset of the eligible ones. Recalling that user interruptions are asynchronous with respect to TCP selfwxuyz clocking based on the RTT, we expect that of an interrupted flow is roughly independent from TCP timings and upper-bounded by a function of the flow measured RTT. Let us define the wxyz normalized v as wxyz wxuyz2 | {F RTT RTT (3.2) v where RTT and RTT are the average and standard deviation of the connection RTT respectively 3 . wxuyz Figure 3.3 plots the v pdf for both the eligible and non-eligible flows when {|} and | . wxuyz For non-eligible flows, the pdf shows that v can be either: 3 close to 0 when the server FIN is piggybacked by the last server data segments and the client has already closed its half-connection or closes its half-open connection by the means of an RST segment; roughly 1 RTT when the server FIN is piggybacked by the last server data segments, and the client sends a FIN-ACK segment, causing the last server-side ACK segment to be received 1 RTT later by the server; much larger than 1 RTT for connections which remain open and are then closed by an application timer expiration. The RTT and RTT estimation used by Tstat is the same as the one TCP sender uses. The lack of accuracy of the algorithm, the variability of RTT itself and the few samples per flow make this measurement not accurate, affecting [ distribution. the 48 CHAPTER 3. USER PATIENCE AND THE WORLD WIDE WAIT u[ Instead, considering eligible flows, we observe that is no longer correlated with the RTT. Moreover, we would expect that, in this case, the asynchronous interruption events uniformly distributed among one RTT. This is almost confirmed by Figure 3.3, except that the pdf exhibits a peak close to 0. This is explained considering the impact of the TCP window size: the transmission of several packets within the same window, and therefore during the same RTT, shifts the " i [ measurement point, reducing the toward smaller values than the RTT, as sketched in Figure 3.4. DATA ACK ACK RTT DATA ACK ACK RTT User Interruption tSE tFE RST Client Tstat Server Time Figure 3.4: Temporal Gap Reduction Therefore, from the former observations, we define the flow interruption criterion as: Interrupted n* Eligible t1 [V 04 (3.3) As a further validation of the criterion, we plot in Figure 3.5 the CDF of the server data size transmitted on a connection of both complete and interrupted flows. Looking at the inset reporting a zoom of the CDF curve, it can be noted that the interrupted flows size is essentially a multiple of the maximum segment size (which is usually set to the corresponding Ethernet MTU of 1500 bytes). Indeed, for normal connections, the data size carried by flows is independent from the segmentation imposed by TCP. This further confirms that in the former case not all the server packets reached the client before the interruption happened. In order to test the sensitivity to the interruption heuristic, we analyzed the interruption probability, i.e., the ratio of the interrupted connection number versus the totally traced connections, both for different values of and and with respect to a simplified interruption criterion E e L'S ). that uses a fixed threshold (i.e., Results are plotted in Figure 3.6, adding in the inset the relative error percentage to the curve 1[^+4 C14 , as a function of the time of day considering 10 min observation window. It can be seen that different 1[^+4 values do not largely affect the RTT-dependent results (the error is within few percentage points). On the contrary, a fixed-threshold approach deeply alters the interruption 3.2. INTERRUPTED FLOWS: A DEFINITION 49 1 Cumulative Distribution 0.8 1 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 Completed Interrupted 0 0 20 1 2 3 4 5 6 40 60 Flow Size [KBytes] 7 8 9 10 0 80 100 Figure 3.5: Interrupted vs Completed Flows Size CDF 0.25 Interruption Probability 0.2 α=1 β=1 α=2 β=0 Ttresh = 30sec Ttresh = 90sec Relative Error 200% 150% 100% 0.15 50% 0% 0.1 0.05 0 10:00 11:00 12:00 13:00 14:00 Time of Day [Hours] 15:00 16:00 Figure 3.6: Sensitivity of the Interruption Criterion to the and Parameters ratio, compromising the criterion validity. For example, when ¡£¢L¤¥¦§[¤ includes the client 60-second timer of persistent connections, the error grows to over ¨8©©[ª : recalling the results of Figure 3.2, this would qualify almost all eligible flows as interrupted. Therefore we can confirm that the interruption criterion we defined so far is affected by a relative error which is however small enough to be neglected. CHAPTER 3. USER PATIENCE AND THE WORLD WIDE WAIT 50 Table 3.1: Three Most Active server and client statistics: total flows 9 rank Internal server External server 1 2 3 1 2 3 « # 186400 131907 86189 29300 25637 18448 ® ¬! ¯¬T ²$³?´Oµ$³¶ º»µ?´½¼8²¶ º¸´O¾$³¶ ²?´O·$¿¶ ²?´ÁÀ¾¶ º8´O¾$¾¶ ° # 15969 10024 7320 539 659 231 # and interrupted : # ± ¬ ! ·?´¹¸0²¶ ¼m´¹¸0²¶ ·?´¹¸0²¶ º8´O·$³¶ ²?´O¿0¼Â¶ º8´O²$¿¶ 3.3 Results In this section we study how the interruption probability is affected by the most relevant connection properties, such as the flow size, throughput and completion time. Also, we discriminate flows as client or server (respectively when the server is external or internal to our LAN) and as mice or elephants (depending on whether their size is shorter than or longer than 100 KB). 0.5M Interruprion Probability Total Samples Interrupted Samples 0.1M 0.5 Number of samples 10K 0.3 1K 0.2 0.1K 0.1 10 0:00 Interruption Probability 0.4 2:00 4:00 6:00 0 8:00 10:00 12:00 14:00 16:00 18:00 20:00 22:00 24:00 Time [hours] Figure 3.7: Interrupted vs Completed vs Flows Amount and Ratio Figure 3.7 plots the number of interrupted versus totally traced flow (left y-axis scale), together with their ratio (right y-axis), as a function of the time of day. Client flows only are considered 4 . As expected, the total number of tracked flows is higher during working hours, and the same happens to interrupted flows, leading to an almost constant interruption probability. Given this behavior, in the following we will restrict our analysis to the 10:00–16:00 interval, where we consider both the traffic and the interruption ratio to be stationary. It must be pointed 4 Server flows yielded the same behavior. 3.3. RESULTS 51 out that our campus is mainly a client network toward external servers, i.e., only the ö of the tracked connections have servers inside our campus LAN. Therefore, to both have a statistically meaningful data set and to compare the client versus server results on approximatively the same number of connections, we used traces with different temporal extension. The client traces refer to the work-week from Monday 7 to Friday 12 November 2002 from 10:00 to 16:00, where we observed >?Ã?Dà unique clients contacting Ä8ůÆÇÇÇ unique external server for a total of more than $?È flows. Instead, servers data refer to a two-month-long trace (Monday to Friday, October to November 2002, 10:00–16:00), where >($É$B unique external clients contacted 118 unique internal servers generating Ä ÇÊ ÆË Å¯ÇÌ connections. For the same reasons, the elephants data-set refers to the same period of the server trace. Considering the selected dataset, the average percentage of interrupted flow of all logged servers is 9.18%, while for all logged clients is 4.20%. This shows that a significant percentage of TCP flows are interrupted: this quantity was measured on our campus network, which offers a generally good browsing experience, therefore we expect this ratio to be much higher in worse-performing scenarios. Table 3.1 details the interruption statistics for the three most contacted internal and external servers. 9 # and : # represent, respectively, the total and the interrupted number of observed flows. Apart from noticing that the number of external contacted servers is higher and therefore the traffic is more spread than the internal servers, it is worth to notice that the interruption probability of the three most contacted internal servers is roughly the same for each server ( Ä ÆÁ¶ ). Considering the external servers statistics, the interruption ratio is smaller (from ÍÏÎ>ж to ÎÍÏ>Ñж ), and also smaller than the average interruption probability which is larger than D[¶ . This suggests that the three most contacted servers offer a good browsing experience to our clients. In order to better understand the motivations that drive user impatience, in the following subsections we inspect how the interruption probability varies when conditioned to different parameters Ò . In particular, we define as #*ÓÔ 1 Ò Õ4_ Z XÖ?W 1[9 is Interrupted Z t 9n × XÖ?W ÒØÙ # 4 19qn ÒØÚÙ # 4 the ratio of the interrupted connection number over the total connection number, conditioned to a general Ò parameter, thus Û P Ü flow is Interrupted - ÒÞÝ . alternatively, with some algebra and applying the Bayes formula, it can be seen that # expresses ßàÜm9 is Interrupted má ÒqØ Ù Ý # . Intuitively, when Û is constant over any Ò value interval, this means that the interruption is not correlated with the parameter Ò . We introduce in Table 3.2 the average values of the parameters studied in the following for both elephant (â ) and mices (ã ) , client and server flows. Û 3.3.1 Impact of the User Throughput Let the average user throughput be the amount of the data transferred by the server over the time elapsed between the connection setup and the last server data packet: referring to Figure 3.1, we CHAPTER 3. USER PATIENCE AND THE WORLD WIDE WAIT 52 may write5 : Throughput ×ä DATA "Yå 1 " i @ g " 4 Figure 3.8 reports É[ » as well as the number of total and interrupted flow samples, for both server (on the top) and client (on the bottom) flows. The number of samples can be read on the left y-axis, while the corresponding probability can be read on the right y-axis It can be noted that, in the server case, ! slightly decreases when the user transfer rate increases, while a general increase in the æ É[ » is observed by client connections, which is quite counterintuitive. However, this is explained considering mice and elephant flows. Indeed, in the mice case, the interrupted flows throughput is Ä$ÅÊ ç times higher than for completed flows. This suggests that the early termination is due to a link-follow behavior (i.e., the user clicking on a link to reach a new page). On the contrary, interrupted elephant flows have a throughput 1.5 times smaller than the one of completed flows, confirming the intuition that a smaller throughput leads to higher interruption probability. 3.3.2 Impact of Flow Size In Figure 3.9 the interruption probability is conditioned to the flow size 6 , i.e., "$#&%' . Considering client flows (on the bottom), we observe that there is a peak of short transfers that are aborted: this is due to the interruption of parallel TCP connections opened by a single HTTP session. In the server case (top plot), the "0#&%' is higher, on average, than the previous case. In both cases, against expectations, users do not tend to wait longer when transferring longer flows, as the increasing interruption probability suggests. 3.3.3 Completion and Interruption Times Figure 3.10 shows the dependence of completed and interrupted server flows on the time elapsed since flow start until its end, i.e., &#Ï)è' . It can be gathered from the figure that users mainly 5 This performance parameter does not include the time elapsed during connection tear-down since it does not affect the user perception of transfer time. 6 In the case of interrupted connections, the size has to be interpreted as the amount of data transferred until the interruption occurred. é ê Average T [s] Size [Kb] Thr [Kbps] T [s] Size [Kb] Thr [Kbps] Interrupted client server 4.81 28.91 10.01 19.47 115.93 118.72 108.42 82.86 1081.00 394.34 192.62 255.07 Completed client server 10.12 29.58 6.34 8.67 79.15 90.57 122.12 123.78 638.46 454.71 274.29 344.58 Table 3.2: Average Values of the Inspected Parameters 3.3. RESULTS 53 Interruption Probability Total Samples Interrupted Samples (Server) 10K 0.2 1K 0.15 0.1K 0.1 10 0.05 0 20 1M 40 60 Throughput [Kbps] 80 Interruption Probability Total Samples Interrupted Samples (Client) 0.1M Number of Samples 0.25 0 100 0.3 0.25 10K 0.2 1K 0.15 0.1K P| Thrughput Number of Samples 0.1M 0.3 P| Thrughput 1M 0.1 10 0.05 0 Figure 3.8: 20 ëAì í¤¥îï»ð¤ñﻢ 40 60 Throughput [Kbps] 80 0 100 : Server on the Top, Client on the Bottom abort the transfer in the first 20 seconds: during this time, users take the most ‘critical’ decisions, while, after that time, they tend to wait longer before interrupting the transfer. The slow rise in the interruption ratio after the 20 seconds mark, though, shows that users are still willing to interrupt the transfer if they think it takes too much time. Finally, Figure 3.11 considers server flows within the 0-20s interval only. The ë ì ò$ó&ô¦ probability is further conditioned to different classes of users according to their throughput, i.e.,ë ì ò$ó&ô¦,ì í¤¥îï»ð¤ñﻢ . Three throughput classes are considered: Fast ( õö¨$©© Kbps), Slow ( ÷ø¨$© Kbps) and Medium speed (between 10Kbps and 100Kbps). Looking at the figure, it can be noticed that the three different CHAPTER 3. USER PATIENCE AND THE WORLD WIDE WAIT 54 1M 0.3 0.25 10K 0.2 1K 0.15 0.1K 0.1 10 Interruption Probability Total Samples Interrupted Samples 0 20 1M 40 60 Transferred Data Size [KBytes] 80 Interruprion Probability Total Samples Interrupted Samples (Client) 0.1M Number of Samples P| Size 0.1M 0.05 0 100 0.3 0.25 10K 0.2 1K 0.15 0.1K P| Size Number of Samples (Server) 0.1 10 0.05 0 Figure 3.9: 20 ëAì ò$ó&ô¦ 40 60 Received Data Size [KBytes] 80 0 100 : Server on the Top, Client on the Bottom classes suffer very different interruption probability: higher for slow flows, and much smaller for fast flows. Linear interpolation of data (dotted lines) is used to highlight this trend. Indeed, slow connections massively increase the interruption probability, while faster connections are likely to be left alone. This shows that the throughput is indeed one of the main performance indexes that drives the interruption probability. 3.4 Conclusions The research presented in this chapter inspected a phenomenon intrinsically rooted in the current use of the Internet, caused by user impatience at waiting too long for web downloads to complete. 3.4. CONCLUSIONS 55 0.1M Number of Samples 0.3 Interruption Probability Total Samples Interrupted Samples (Server) 0.25 10K 0.2 1K 0.15 0.1K P| Time 1M 0.1 10 0.05 0 10 20 30 40 Transfer Time [sec] Figure 3.10: ùú û(ü*ý+þ 50 60 0 : Server case only 0.8 Fast ( >100 Kbps) Med ( >10 Kbps) Slow ( <10 Kbps) (Server) 0.7 Interruption Probability 0.6 0.5 0.4 0.3 0.2 0.1 0 0 20 Figure 3.11: ë 40 60 Transferred Size [KBytes] ì ò$ó&ô¦,ì í¤¥îï»ð¤ñï!¢ 80 100 : server case only We defined a methodology to infer TCP flows interruption, and presented an extended set of results gathered from real traffic analysis. Several parameters have been considered, showing that the interruption probability is affected mainly by the user-perceived throughput. The presented interruption metric could be profitably used in defining the user satisfaction of Web performance, as well as to derive traffic models that include the early interruption of connections. 56 CHAPTER 3. USER PATIENCE AND THE WORLD WIDE WAIT Chapter 4 The Zoo of Elephant and Mice HIS chapter, whose results have been published in [56], studies the TCP flow arrival process, starting from the aggregated measurement at the TCP flow level taken from our campus network. After introducing the tools used to collect and process TCP flow level statistics, we analyze the statistical properties of the TCP flow inter-arrival process. We create different traffic aggregates by splitting the original trace, such that i) each traffic aggregate has, bytewise, the same amount of traffic, and ii) it is constituted by all the TCP flows with the same source/destination IP addresses (i.e., belonging to the same traffic relation). In addition, the splitting algorithm packs the largest traffic relations in the first traffic aggregates; therefore, subsequently generated aggregates are constituted by an increasing number of smaller traffic relations. This induces a divisions of TCP-elephants and TCP-mice into different traffic aggregates. The statistical characteristics of each aggregates are presented, showing that the TCP flow arrival process exhibits long range dependencies which tends to vanish on traffic aggregates composed by many traffic relations made of TCP-mice mainly. 4.1 Introduction Since the pioneering work of Danzig [57, 58, 59], and Paxons [60, 61] the interest in data collection, measurement and analysis to characterize either the network or the users behavior increased steadily, also because it was clear from the very beginning that “measuring” the Internet was not an easy job. The lack of simple, yet satisfactory model like the traditional Erlang teletraffic theory for the circuit-switched networks, still pose this research field as a central topic of the current research community. Moreover, the well known Long Range Dependency (LRD) behavior shown by the Internet traffic makes traffic measuring and modeling even more interesting. Indeed, after the two seminal chapters [61, 62], in which authors showed that traffic traces captured on both LANs and WANs exhibit LRD properties, many works focused on studying the behavior of data traffic in packet networks. This, with the intent of both trying to find a physical explanation of the properties displayed by the traffic, and to find accurate stochastic processes that can be used for the traffic description in analytical models. Considering the design of the Internet, it is possible to devise three different layers at which study Internet traffic: Application, Transport and Network, to which user sessions, TCP or UDP 57 58 CHAPTER 4. THE ZOO OF ELEPHANT AND MICE flows, and IP packets respectively correspond. Indeed, a simple “click” on a web link, causes the generation of a request at the application level (i.e., an HTTP request), which is translated into many transport level connections (TCP flows); each connection, then, generates a sequence of data messages that are transported by the network (IP packets). In this chapter, we concentrate our attention to the flow level, and to the TCP flow level in particular, given that the majority of the traffic is today transported using the TCP protocol. The motivation behind this choice is that while it was shown (see for example [61, 63]) that the arrival processes of both packets and flows exhibit LRD properties, a lot of researchers concentrated their attention to the packet level, while the flow level traffic characteristics are relatively less studied. Moreover, even if the packet level is of great interest to support router design, e.g., for buffer dimensioning, the study of the TCP flow level is becoming more and more important, since the flow arrival process is of direct role in the dimensioning processes of web servers and proxies. Another strong motivation to support the study of flow arrival process is represented by the increasing diffusion of network apparatuses which operates at the flow level, e.g. Network Address Translators or Load Balancer; indeed, their design and scalability mainly depend on the number of flows they have to keep track to perform packet manipulations between incoming and outgoing data. Going back at the packet level, the prevailing justification to the LRD presence at this level is supposed to be the heavy tailed distribution of files size [64]: the presence of long-lived flows, called in the literature “elephants”, induces correlation to the packet level traffic, even if the majority of the traffic is build by short-lived flows, or “mice”. The question we try to answer in this chapter is weather the presence of mice and elephants has an influence to the LRD characteristics at the flow level as well. To face this topic, we collected several days of live traffic from our campus network at the Politecnico di Torino, which consist of more than 7000 hosts, the majority of which are clients. Instead of considering the packet level trace, we performed a live collection of data directly at the TCP flow level, using Tstat [65, 33], a tool able to keep track of single TCP flows by looking at both the data and acknowledgment segments. The flow level trace was then post-processed by DiaNa [95], a novel tool which allowed to easily derive several simple as well as very complex measurement indexes in a very efficient way. Both tools are under development and made available to the research community as open source. To gauge the impact of elephants and mice on the TCP flow arrival process, we follow an approach similar to [66, 68], that creates a number of artificial scenarios, deriving each of them from the original trace into a number of sub-traces, mimicking the splitting/aggregation process that traffic aggregates experience following different paths inside the network. We then study the statistical properties of the flow arrival process of different sub-traces, showing that the LRD tends to vanish on traffic aggregates composed mostly of TCP-mice. The rest of the chapter is organized as follows. Section 4.2 provides a survey of related works and results; problem definition and input data analysis is the object of Section 4.3, in which the adopted algorithm to derive traffic aggregates will be briefly described highlighting its features. The result of the analysis on traffic aggregates is presented in Section 4.4, and conclusions will be drawn in Section 4.5. 4.2. RELATED WORKS 59 4.2 Related Works In the last years a lot of effort has been spent to the traffic analysis issue in the Internet, at both IP and TCP level. It is well known IP arrival process is characterized by long-range dependence and how important is to consider this property for network planning, buffer dimensioning in primis. Long-range dependence study in the network domain has been pioneered by many works as [62, 61, 64, 72], all agreeing on the possible causes of LRD of IP traffic. Indeed, they identified heavytailed file size distribution (and, consequently, TCP flows size) as the main cause of long-term correlation at IP level, that keeps true for WAN and LAN traffic. IP traffic is also characterized by more complicated scaling phenomena, which will not be our issues, that have been well discussed in [68], where authors consider also small-scales phenomena, which have less direct impacts on engineering problems. Authors in [68] used an interesting manipulation of data at TCP level, which allowed to study the relation between the IP and TCP scaling behavior both at small and large scales. The gained results also confirmed that IP traffic LRD properties are partly inherited as consequence of TCP level properties (e.g., the distribution of the connection duration and flow size), while others scaling properties seem to depend on packets arrivals within flows. The statistical analysis of real measured traffic, due to the significant amount of collected data and research efforts, gave new impulses to traffic modeling as well. Here, we briefly summarize the different approaches followed in the last years, where a number of attempts were made to develop models for LRD data traffic at packet level mainly. Among the different approaches, Fractional Brownian Motion (FBM) received a lot of attentions thanks to pioneering work [73]; while providing good approximations, the FBM models, however, are not able to capture both the short and long term correlation structure of network traffic. Therefore, the poor FBM scaling properties drove many research efforts toward Multifractal models, whose attractive is due to their scale-invariance properties. [74, 75] mainly focus on the physical explanation of network dynamics, showing multifractal proprieties for the first time. Indeed, also other works such as [76, 71] suggest multifractal models as possibly being the best fit to measured data. For example, in [76], authors tried to explain the scaling phenomena, from small to large scales, from direct inference of the network dynamics. In particular, they identify in the heavy-tailed nature of the TCP flows number per Web session the cause of LRD effects clearly visible at large scales in the TCP traffic. This effect, analogous to an ÿ åEå effect with infinite variance service time distribution, was pointed out in [71] and already known in literature. Finally, [77, 78] have been more interested on measure-based traffic modeling issue, identifying a multifractal class of processes able to describe the multiscaling properties of TCP/IP traffic, from small to large scales; however, the real impact of the for the small traffic scales properties still remain questionable. All these traffic characterization works deviate considerably from classical Markovian models which continue to be widely used for performance evaluation purposes with good results [79, 80, 81, 82]; in all the above works for example, the Markov Modulated Poisson Process (MMPP), is considered as one of the best Markov process that emulates traffic behavior. However, in [81, 82] authors also point out that MMPP do not bring long-term correlation; authors therefore, define the local Hurst parameter using an approximated LRD definition valid on a limited range of time scales. CHAPTER 4. THE ZOO OF ELEPHANT AND MICE 60 Finally, another approach to model Internet traffic involves the emulation of the real hierarchical nature of network dynamics, e.g., considering users sessions, TCP flows and IP packets; e.g., in [83], each of the model’s components was fitted to empirical pdf, such as the distribution of both TCP flows and web pages size, and the arrival distribution of page and flows. All of these different approaches reach similar conclusion using different techniques. The common point of view has always been to take into account the real traffic behavior, in order to be able to either i) use more reasonable tools for network planning or ii) explain the links between causes and effects of network traffic phenomena. The analysis brought network community to awareness on the real traffic which has to be taken into account for a correct performance evaluation of the systems. In this chapter, we are interested on the large scale mainly as explained in the following section, where we will describe a new criterion to classify traffic starting from the TCP level and not considering the packet level at all. We define rules to aggregate TCP flows and define an high level entity with good engineering properties to split homogeneous traffic. Studying then the different traffic aggregates, we try to gain an insight on TCP flow interarrival behavior. 4.3 Problem Definition In this section, we first introduce the different aggregation level and the notation that will be used in the remaining of the chapter. We then describe the measuring setup and the input trace characteristics that are relevant to understand the significance of the presented results. Finally, we describe the splitting/aggregation criterion that will be used to derive different traffic aggregates, showing briefly its properties and its technical aspects. 4.3.1 Preliminary Definitions When performing trace analysis, it is possible to focus the attention on different level of aggregations. Considering the Internet architecture, it is common to devise three different levels, i.e., IP-packet, TCP/UDP flow, and user session level. While the definition of the first two layers is commonly accepted, the user session one is more fuzzy, as it derives from the user behavior. In this chapter we therefore decided to follow a different approach in the definition of aggregates. In particular, to mimic the splitting/aggregation process that data experience following different paths in the network, we define four level of aggregation, sketched in Figure 4.1: IP packets, TCP flows, Traffic Relations (TR), and Traffic Aggregates (TA). Being interested into the Flow arrival process, we will neglect the packet level, and also the UDP traffic, because of its connectionless nature, and because it consists of a small portion of the current Internet traffic. Let us first introduce the formal definition of the different aggregation levels considered in this chapter, as well as the related notation: ; 1 TCP Flow Level A single TCP connection1 is constituted, as usual, by several packets exchanged between K the same client (i.e., the host that performed the active open) and the same server (i.e., In this chapter we use the term “flow” and “connection” interchangeably. 4.3. PROBLEM DEFINITION TA Size |τ| = 2 61 |τ| = 1 Traffic Aggregate TA Level t Traffic Relation TR Level TR Size t Flow Size TCP Flow Level t Packet Level Figure 4.1: Aggregation Level From Packet Level to TA Level the host that performed the passive open), having besides the same (source,destination) TCP ports pair. We will consider only successfully opened TCP flows –i.e., whose three-wayhandshake was successful– and we will consider the flow arrival time as the observation K time of the first client ] segment. We denote with # 10 4 the bytewise size of the Õ -th TCP Flow tracked during the observation interval; the flow size considers the amount of W K bytes flowing from the server toward the client , which is usually the most relevant part of the connection. ; Traffic Relation Level K K A Traffic Relation (TR) 10 4 aggregates all the - 1 4$- TCP flows, ‘having as source K K K and as destination; we indicate its size, expressed in bytes, with e 10 4 # # 1 4 . The intuition behind this aggregation criterion is that all the packets within TCP flows belonging to the same TR usually follow the same path(s) in the network, yielding then the same statistical properties on links along the path(s). ; Traffic Aggregate Level Considering that several TRs can cross the same links along their paths, we define a higher level of aggregation, which we call Traffic Aggregate (TA). The -th TA is stated by .?/ and K its bytewise size is e 1 4^ !"# $&%(')$*,+ e 10 4 ; the number of traffic relations belonging to a traffic aggregate is indicated with - .0/ - . In the following, rather than using the bytewise size of flows, TRs and TAs, we will will refer to their weight, that is, their size expressed in bytes normalized over the total amount of bytes K observed in the whole trace - # # $ # 1 4 ; thus . # 10 K 4 / # 10 K . e 10 K 4 å . /æ e 10 4å - 4 å - TCP Flow Level Traffic Relation Level Traffic Aggregate Level CHAPTER 4. THE ZOO OF ELEPHANT AND MICE 62 Outgoing data Incoming data Internet Packet Sniffer edge router Internal LANs Pc with Tstat Figure 4.2: The Measure Setup 4.3.2 Input Data The analysis was conduced over different traces collected in several days over our Institution ISP link during October 2002. Our campus network is built upon a large 100 Mbps Ethernet LAN, which collects traffic from more than 7,000 hosts. The LAN is then connected to the Internet by a single router, whose WAN link has a capacity of 28 Mbps2 . We used Tstat to perform the live analysis of the incoming and outgoing packets sniffed on the router WAN link, as schemed in Figure 4.2, obtaining several statistics at both packet and flow levels. Moreover, Tstat dumped a flow-level trace, which has then been split into several traces, each of which refers to a period of time equal to an entire day of real traffic. Besides, since our campus network is mainly populated by clients, we consider in this analysis only the flows originated by clients internal to our Institution LAN. Each trace has been then separately analyzed, preliminary eliminating the non-stationary time-interval (i.e., night/day effect) and considering then a busy-period from 8:00 to 18:00. Given that the qualitative results observed on several traces do not change, in this chapter we present results derived from a single working day trace, whose properties are briefly reported in Table 4.1. During the 10 hours of observation, 2,380 clients contacted about 36,000 different servers, generating more than 172,000 TRs, for a total of more then 2.19 million TCP flows, or 71.76 million packets, or nearly 80 GBytes of data. In the considered mixture of traffic, TCP protocol represents the 94% of the total packets, which allows us to neglect the influence of the other protocols (e.g., UDP); considering the application services, TCP connections are mainly constituted by HTTP flows, representing the 86% of the total services and more than half of the totally exchanged bytes. More in detail, there are 10664 different identified TCP server ports, 99 of which are well-known: these account for 95% of the flows and for the 57% of the traffic volume. Considering the different traffic aggregation levels previously defined, Figure 4.3 shows examples of the flow arrival time sequence. Each vertical line represents a single TCP flow, which K started at the corresponding time instant of the x-axis, and whose weight . # 1 4 is reported on the y-axis. The upper plot shows trace . , whose .1 7 , and represents the largest possible TA, built considering all connections among all the possible source-destination pairs. Trace sub-portions .2 and . k , while being constituted by a rather different number of TRs ( - .32 - >4 ÃB and - . k -Y ×>4 ), 2 The data-link level is based on an AAL-5 ATM virtual circuit at 34 Mbps (OC1). 4.3. PROBLEM DEFINITION 63 wi(c,s) w=1, |τA|=172332 τA wi(c,s) w=1/50, |τB|=5986 τB wi(c,s) w=1/50, |τC|=59 τC wi(c,s) w=0.06 ≈1/16, |τD|=1 τD 8:00 10:00 12:00 14:00 Time of Day 16:00 18:00 Figure 4.3: Flow Size and Arrival Times for Different TAs 2 . k å >m . Observing Figure 4.3, it can be gathered that .32 have indeed the same weight . aggregates a larger number of flows than . k ; furthermore, weight of .(2 flows is smaller (i.e., TCP flows tend to be “mice”), while . k is build by a much smaller number of heavier (i.e., “elephants”) TCP flows. This intuition will be confirmed by the data analysis presented in Section 4.4. Finally, TCP flows shown in .65 constitute a unique traffic relation; this TR is built by a small number of TCP flows, whose weight is very large, so that they amount to 1/16 of the total traffic. To give the reader more details on the statistical properties of the different TRs, Figure 4.4 shows the distribution of . for all the TRs (using a lin/log plot). It can be noticed that, except for the largest and smallest TRs, the distribution can be approximated by a linear function, i.e., the amount of bytes exchanged by considering different client/server couples follows an exponential distribution. More interesting is instead the distribution of the TCP-flows number per traffic relation, shown in Figure 4.5 using a log/log plot. Indeed, the almost linear shape of the pdf shows that it exhibits a heavy tail, which could be at the basis of the LRD properties in the TCP-flow arrival process. The parallel with the heavy-tailed pdf of the flow-size, which induces LRD properties to the packet level, is straightforward. Table 4.1: Trace Informations Internal Clients External Servers Traffic Relations 2,380 35,988 172,574 Flows Number Packets Number Total Trace Size ²?´Lº»¾170º¸ È ¼?º8´½¼8µ!70º¸ È 79.9 GB CHAPTER 4. THE ZOO OF ELEPHANT AND MICE Normalized Weight wsd 64 10 -2 10 -3 10 -4 10-5 10 -6 10 -7 10-8 10 -9 10-10 1 40K 80K 120K 160K Traffic Relation Index Figure 4.4: TR Size Distribution Probability Distribution 10-1 10-2 10-3 10-4 1 10 100 10-5 1000 Number of TCP Flows within TR Figure 4.5: TR Flow Number Distribution 4.3.3 Properties of the Aggregation Criterion We designed the aggregation criterion in order to satisfy some properties that help the analysis and interpretation of the results. The key-point is that the original trace is split into 8 different TA, such that i) each TA has, bytewise, the same amount of traffic, i.e., the 8 -th portion of the total traffic and ii) each TA aggregates together one or more TRs. Indeed, we considered TR aggregation a natural choice, since it preserve the characteristics of packet within TCP flows following the same network path, 4.3. PROBLEM DEFINITION 65 having therefore similar properties. Being possible to find more than one solution to the previous problem, the splitting algorithm we implemented packs the largest TRs in the first TAs; besides, in virtue of the bytewise traffic constraint, subsequently generated aggregates are constituted by an increasing number of smaller TRs. Therefore the TAs, while composed by several TRs related to heterogeneous network paths, are ordered by the number - .0/ - of TRs constituting them. To formalize the problem, and to introduce the notation that will be used also to present the results, let us define: ; Class K: the number of TA in which we split the trace; ; Ø : Slot J: a specific TA of class K, namely .0/21[354»9 ; 3=< ; ; Weight . ; Target Weight . TA at class 3 . /21354 : the weight of slot J of class K; 1[354s : the ideal portion of the traffic that should be present in each å 3 | τ1(1)|=172574 K=2 | τ1(2)|=242 K=3 | τ1(3)|=55 | τ2(2)|=172332 | τ2(3)|=1110 | τ3(3)|=171409 Undersized - wJ(K) < 1/K ... K=1 ... NF(K)=1 K=16 Exact ... K=100 ... NF(K)=2 ... K=42 ... ... wJ(K) = 1/K NF(K)=9 ... Oversized + wJ(K) > 1/K Figure 4.6: Trace Partitioning: Algorithmic Behavior Figure 4.6 sketches the splitting procedure. When considering class 3 we have a single TA of weight . 1[354E 6 , derived by aggregating all the TRs, which corresponds to the original trace. Considering 3 Î , we have two TAs, namely . 1Î4 and .?>1Î4 ; the former is build by 242 TRs, which account to . 1354 å Î of relative traffic, while the latter contains all the remaining TRs. This procedure can be repeated for increasing values of 3 , until the weight of a single traffic relation becomes larger than the target weight. Being impossible to split a single TR into smaller aggregates, we are forced to consider TAs having a weight .A/@ 1354 = . 1354 . The weight . 1[354 has therefore to be interpreted as an ideal target, in the sense that it is possible that one or more TRs will have a weight larger than . 1[354 , as the number of slots grows. In such cases, there will be a b 1354 number of fixed slots, i.e., TAs constituted by a single TR, of weight .B @ 1[354Ú=p å 3 ; the remaining weight will be distributed over the 3 @/ 1[354 non-fixed slots; # CHAPTER 4. THE ZOO OF ELEPHANT AND MICE 66 therefore the definition of . . /21[354 E DF CD is: /21[354_ .B @ 1354 = / GIH K/ JMLON "P L N "P3 S R . @ @;Q # # g 3@T 1354 1[354 3 LON "P H K/ J P In the dataset considered in this chapter, for example, the TR .K5 shown in Figure 4.3 is the largest of the whole trace, having . 5 ÉͽBÎÞ= å $B . Therefore, from class 3 p8B on, the slot J=1 will be always occupied by this aggregate, i.e., . 1354A .65 VUl3 W $B , as evidenced in Figure 4.6. 4.3.4 Trace Partitioning Model and Algorithm More formally, the problem can be conduced to a well known optimization problem ß á!Z ) Û of job scheduling over identical parallel machines [69], which is known to be strongly NP-hard. Traffic relations TR are the jobs that have to be scheduled on a fixed number 3 of machines (i.e., TA) minimizing the maximum completion time (i.e., the TA weight). The previously introduced ideal target . P å 3 is the optimum solution in the case of preemptive scheduling. Since we preserve the TR identities, preemption is not allowed; however, it is straightforward that minimizing the maximum deviation of the completion time from . 1[354 is equivalent to the objective function that minimize the maximum completion time. K We define X as the Bb Y - . 104$- vector of the jobs length (i.e., TR weights . # 10 4 ) and Z as the . vector of the machine completion times (i.e., TA weights / ). Stating with ÿ the mapping BY 3 matrix (i.e., ÿ #\[ means that the i-th job is assigned to the j-th machine), and stating with ÿ # its i-th column, we have: ] Õ$^`_ _=WbC a # UhÕ ØT: 3=<dcfe E Q [ ÿ #\[ UhÕ F s.t. a # W ÿ #Og X @hZ # a # f W Z # @fÿ #Og X UhÕ UhÕ The greedy adopted solution, which has the advantage over, e.g., an LPT[69] solution of the clustering properties earlier discussed, implies the preliminary bytewise sorting of the traffic relations, and three simple rules: ; allow a machine load to exceed ; keep scheduling the biggest unscheduled job into the same machine while the load is still below å 3 ; ; remove the scheduled job from the unscheduled jobs list as soon as job has been scheduled. å 3 if the machine has no previously scheduled job; A representation of the algorithm output applied to the TA creation problem is provided in Figure 4.7; the x-axis represents the number of TR within TA and the y-axis represents the weight 4.4. RESULTS Normalized Traffic Aggregate Weight wi(c,s) 67 ∀K ∈ [3,100] K=10 K=50 K=100 0.1 0.01 10 100 1000 10000 100000 Number of Traffic Relations |τ| within each Traffic Aggregate Figure 4.7: Trace Partitioning: Samples for Different Aggregated Classes K l lnmpo }$~rq?É~8}8Ms are of each generated TA, i.e, ikv j . For the ease of the read, results for classes l tvu highlighted, whereas neither classes nor fixed slots are reported in the picture. Observing the figure, and recalling the distribution of the TR weights plotted in Figure 4.4, we notice that, l given a class , the number w x j w of TRs inside the y -th slot increases as y increases. For example, the last slot (the one on the rightmost part of the plot) exhibits always a number of TRs larger than 100,000. On the contrary, looking at the first classes, we observe that the number of TRs tends to l decrease for increasing , showing the “packing” enforced by the selected algorithm. 4.4 Results In this section we investigate the most interesting properties of the artificially built traffic aggregates; the analysis will be conduced in terms of, mainly, aggregated size and TCP flow interarrival times, coupling their study with the knowledge of the underlaying layers – i.e., traffic relation and flow levels. To help the presentation and discussion of the measurement results, we first propose a visual representation, alternative to the one of Figure 4.7, of the dataset obtained by applying the splitting algorithm. Indeed, the aggregation process induces a non-linear sampling of both the number l iSj l , complicating the interpretation of the wx j w of TRs within each TA and the TA weight v results. However, the same qualitative information can be immediately gathered if we plot data l as a function of the class and slot y indexes, using besides different gray-scale intensities to represent the measured quantity we are interested in. l As a first example of the new representation, Figure 4.8 depicts the number w x j w of the traffic l relations mapped into each traffic aggregate x j . Looking at the plot and choosing a particular l class , every point of the vertical line represents therefore the number of TRs within each of l l the possible TA – obtained by partitioning the trace into the TAs having approximatively a 68 CHAPTER 4. THE ZOO OF ELEPHANT AND MICE Figure 4.8: Number of Traffic Relations - .0/21[354$- within each Traffic Aggregate weight. Given the partitioning algorithm used, it is straigth-forward to understand the reason why the higher is the slot considered, the larger is the number of TRs within the same TA, as the gray gradient clearly shows. To better appreciate this partitioning effect, contour lines are shown for - .8/21[354$- Ø ÜY?8$É88É8$?É8$? Ý as reference values; it can be gathered that the bottom white-colored zone of the plot is constituted by fixed slots (i.e., - .+-( ), whereas the slot F3 has always - . P 1[354$-h= $É?Ézl U 3 , as already observed from Figure 4.7. This further confirms the validity of our simple heuristic in providing the previously described aggregation properties. å 3 4.4.1 Traffic Aggregate Bytewise Properties Having showed that the number of TR within the generated TA spread smoothly over a very wide range for any TA weight, we now investigate if these effects are reflected also by the number of K TCP flows withing each TA, # $&%,)*3"P - 1 4$- , shown in Figure 4.9. Quite surprisingly, we observe that also the number of TCP connections within TA shows an almost similar spreading behavior: the larger number of TRs within a TA, the larger number of TCP flows within the same TA. Indeed, the smoothed upper part of the plot (i.e., roughly, slots = 3 å Î ) is represent by TAs with a large number of TCP flows, larger than about 1,000; instead, TAs composed by few TRs contain a much smaller number of TCP connections: that is, the number of TCP flows keeps relatively small, while not as regular as in the previous case. Indeed, it must be pointed out that there are exception to this trend, as shown by the “darker” diagonal lines – e.g., those ending in >mD , considering class 3 $ . Probably, within these TAs there is one (or slot Î> or possibly more) TR which is built by a large number of short TCP flows. Coupling this result with the bytewise constraint imposed on TAs within the same class, we can state that the bottom region is constituted by a small number of flows accounting for the same 4.4. RESULTS 69 100 1000000 Number of TCP Flows within Slot J for Different Classes K 90 80 100000 70 10000 Slot J 60 50 1000 40 30 100 20 10 10 0 100 90 80 70 60 50 Class K 40 30 20 10 0 Figure 4.9: Number of TCP Flows within each Traffic Aggregate traffic volume generated by a huge number of lighter flows. This intuition is also confirmed by Figure 4.10, which shows the mean TCP flow size in the different aggregates; the higher is the slot considered, the larger are both the TRs and TCP flows number, the shorter are the TCP flows. While this might be surprising at first, the intuition behind this clustering is that the largest TRs are built by heavier TCP-connection than the smaller TRs, i.e., TCP-elephants play a big role also in defining the TR weight. Therefore the splitting algorithm, packing together larger TRs, tends also to pack together TCP-elephants. The TCP flow size variance, not shown to avoid cluttering the figures, further confirms the clustering of TCP elephant flows in the bottom side of the plot. To better highlight this trend, we adopted a threshold criterion: the mean values of the TCP flow size are compared against fixed quantization thresholds, which allows to better appreciate the mean flow size distribution in TAs. The results for threshold values set to $»Î>?É>?É88 kB are shown in Figure 4.11, where higher threshold values correspond to darker colors. The resulting plot underlines that the mean flow size grows toward TA constituted by a smaller number of larger TR, which are in their turn, constituted by a small number of TCP-elephants. Similarly, the higher-order slots are mostly constituted by mice. However, the gained result do not exclude the presence of TCP-mice in the bottom TAs, nor the presence of TCP-elephants in the top TAs; therefore, to further ensure that the clustering property of TCP-elephants in the first slots holds, we investigated how the TCP flow size distributions in the different aggregates vary as a function of the TA number 3 . We thus plot in Figure 4.12 the empirical flow size distribution for each TAs, showing only the results of class 3 G for the ease of the read. As expected, i) TCP-elephants are evidently concentrates in lower slots, as shown by the heavier tail of the distribution; ii) the TCP-elephants presence decreases as the slot increases, i.e., when moving toward higher slots. The complementary cumulative distribution CHAPTER 4. THE ZOO OF ELEPHANT AND MICE 70 100 90 Mean TCP Flow Size within Slot J for Different Classes K 80 100 MB 70 10 MB Slot J 60 50 1 MB 40 100 KB 30 20 10 KB 10 0 100 90 80 70 60 50 Class K 40 30 20 10 0 Figure 4.10: Mean Size of TCP Flows within each Traffic Aggregate 100 > 1 MB 90 TCP Elephant Estimate Based on Different Mean TCP Flow Size Thresholds 80 > 500 KB 70 Slot J 60 > 250 KB 50 40 30 > 100 KB 20 10 100 90 80 70 60 50 Class K 40 30 20 10 < 100 KB Figure 4.11: Elephant TCP Flows within each Traffic Aggregate function ßàÜ Ù = Ò£Ý , shown in the inset of Figure 4.12, clearly confirms this trend. This finally confirms that the adopted aggregation criterion induces a division of TCP-elephants and TCP-mice into different TAs: TRs containing the highest number of TCP-elephants tends to be packed in the first TA. Therefore in the following, we will use for convenience the terms TA- 4.4. RESULTS 71 mice and TA-elephants to indicate TA constituted (mostly) by TCP-mice and TCP-elephants flows respectively. 1 J=1 J=2 J=3 0.9 0.8 1 0.7 0.1 0.5 1-CDF(X) CDF 0.6 0.4 0.01 0.3 0.001 0.2 0.1 0 10K 10 100 1K 10K 100K 100K 1M 1M 10M 100M 10M 1G 0.0001 100M TCP Flow Size (Bytes) Figure 4.12: TCP Flows Size Distribution of TAs (Class 3 7$ ) 4.4.2 Inspecting TCP Interarrival Time Properties within TAs The result gained in the previous section has clearly important consequences when studying the TCP flow arrival process of different aggregates. Let us first consider the mean interarrival time of TCP flows within each TA, shown in Figure 4.14. Intuitively, TA-mice have a large number of both TR and TCP flows, and therefore TCP flow mean interarrival time is fairly small (less than 100ms). This is no longer true for TAelephants, since a smaller number of flows has to carry the same amount of data over the same temporal window. Therefore, the mean interarrival time is much larger (up to hours). We recognize, however, that a possible problem might arise, affecting the statistical quality of the results: the high interarrival time may be due to non-stationarity in the TA-elephants traffic, where TCP flows may be separated by long silence gap. This effect becomes more visible as long as the class index 3 and so the aggregation level - ./213540- decreases. We will try to underline whether the presented results are affected by this problem in the remaining part of the analysis. The previous assertion is also confirmed observing Figure 4.14, which shows, within each TA, the plot of TCP flow interarrival time variance. It can be noticed that the TCP flow interarrival time variance is several orders of magnitude smaller considering TA-mice with respect to the TAelephants. Let us now consider the Hurst parameter { 10 354 measured considering the interarrival time of TCP flows within TAs. For each TA, we performed the calculation of { 10 354 using the waveletbased approach developed in [70]. We adopted the tools developed there and usually referred to as the AV estimator. Other approaches can be pursued to analyze traffic traces, but the wavelet 72 CHAPTER 4. THE ZOO OF ELEPHANT AND MICE Mean TCP Flow Interarrival Times (ms) within Slot J for Different Classes K Figure 4.13: Interarrival Time Mean of TCP Flows within TAs Figure 4.14: Interarrival Time Variance of TCP Flows within TAs framework has emerged as one of the best estimators, as it offers a very versatile environment, as well as fast and efficient algorithms. The results are shown in Figure 4.15, which clearly shows that the Hurst parameter tends to decrease for increasing slot, i.e., the TA-mice show { 1É 354 smaller than TA-elephants. Recalling 4.4. RESULTS 73 TCP Flow Interarrival Times Hurst Parameter within Slot J for Different Classes K Figure 4.15: Interarrival Time Hurst Parameter of TCP Flows within TAs the problem of the stationarity of the dataset obtained by the splitting algorithm, not all the {1É354 are significant; in particular, the one whose confidence interval is too large or the series is not stationary enough to give correct estimations were discarded, and not reported in the plot. Still, the increase in the Hurst parameter is visible for TA-elephants. To better show this property, Figure 4.16 presents detailed plots of { 1É 354 for 3 Ø ÜY$É>?É8$? Ý (on the top-left, top-right, bottom-left plots respectively) and for all the slots of each class. It can be observed that the Hurst parameter always tends to decrease when considering the TA-mice slots, while it becomes unreliable for TA with few TRs, as testified by the larger confidence intervals. This is particularly visible when considering the 3 8 class. In the bottom-right plot, finally, { 10 354 value for large . In the x-axis, the å 3 we report a detail of the decaying feature of the value is used, so that to allow a direct comparison among the three different classes. Notice that the last class, the one composed by many, small TRs which are aggregation of small TCP flows, always exhibits the same Hurst parameter. Therefore, the most important observation, trustable due to good confidence interval, is that we are authorized to say that TA-mice behavior is driven by TCP-mice. Similarly, TA-elephants are driven by TCP-elephants. Moreover, the interarrival process dynamic in TA-elephants and TAmice are completely different in nature because light TRs tend to contain a relatively small amount of data carried over many small TCP flows, which do not clearly exhibit LRD properties. On the contrary, TCP-elephants seem to introduce a more clear LRD effects in the interarrival time of flows within TA-elephants. A possible justification of this effect might reside in the different behavior the users have when generating connections: indeed, when considering that TCP-mice are typically of Web browsing, the correlation generated by Web sessions tends to vanish when a large number of (small) TRs are aggregated together. On the other side, the TCP-elephants, which are rare but not negligible, seem CHAPTER 4. THE ZOO OF ELEPHANT AND MICE 74 to be generated with a higher degree of correlation so that i) TRs are larger, ii) when aggregating them, the number of users is still small. For example, consider the different behavior of a user which is downloading large files, e.g., MP3s; one can suppose that the user starts downloading the first file, then –immediately after having finished the download– he starts downloading a second one, and so on until, e.g., the entire LP has been downloaded. The effect on the TCP flow arrival time is similar to a ON-OFF source behavior, whose ON period is heavy-tailed and vaguely equal to the file download period, which turns out to follow an heavy tailed distribution; besides, we do not care about the OFF period. This is clearly one of the possible causes of LRD properties at the flow level: see, for an example, the .,5 aggregate in Figure 4.3. 1 1 K=10 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 1 1 5 10 0.5 K=50 1 25 50 0.7 K=100 0.9 0.65 0.8 0.7 0.6 K=100 K=50 K=10 0.6 0.5 0.55 1 50 100 0.8 0.85 0.9 0.95 Normalized Slot J/K 1 Figure 4.16: Interarrival Time Hurst Parameter of TCP Flows within TAs Besides, consider again the heavy-tailed distribution of the TCP flow number within TRs, shown earlier in Figure 4.5. If we further consider TRs as a superset of web sessions, we gather the same result stated in [71], that is, the ÿ åVå| effect with infinite variance service time distribution. Finally, TAs tends to aggregate several TRs, separating short term correlated connections (TAmice) from low-rate ON-OFF connections with infinite variance ON time (TA-elephants). This clearly recall to the well know phenomena described in [64]; that is, the superposition of ONOFF sources (as well as “packet trains”) exhibiting the Noah effect (infinite variance in the ON or OFF period) produces the so called “Joseph effect”: the resulting aggregated process exhibits self-similarity and in particular LRD properties. 4.5 Conclusions In this chapter we have studied the TCP flow arrival process, starting from the aggregated measurement taken from our campus network; specifically, we performed a live collection of data directly at 4.5. CONCLUSIONS 75 the TCP flow level, neglecting therefore the underlaying IP-packet level. Trace were both obtained and post-processed through software tools developed by our research group, publicly available to the community as open sources. We focused our attention beyond the TCP level, defining two layered high-level traffic entities. At a first level beyond TCP, we identified traffic relations, which are constituted by TCP flows with the same path properties. At the highest level, we considered traffic relation aggregates having homogeneous bytewise weight; most important, the followed approach enabled us to divide the whole traffic into aggregates mainly made of either TCP-elephants or TCP-mice. This permitted to gain some interesting insights on the TCP flow arrival process. First, we have observed, as already known, that long range dependence at TCP level can be caused from the fact that the number of flows within a traffic aggregate is heavy-tailed. In addition, the traffic aggregate properties allowed us to see that TCP-elephants aggregates behave like ON-OFF sources characterized by an heavy-tailed activity period. Besides, we were able to observe that LRD at TCP level vanishes for TCP-mice aggregates: this strongly suggests that even ON-OFF behavior is responsible of the LRD at TCP level. 76 CHAPTER 4. THE ZOO OF ELEPHANT AND MICE Chapter 5 Feeding a Switch with Real Traffic HIS chapter, whose results have been published in [84], proposes a novel methodology to generate realistic traffic traces to be used for performance evaluation of high performance switches. It is fairly well known that real Internet traffic shows long and short range dependency characteristics, difficult to be captured by flexible, yet simple, synthetic models. One option is to use real traffic traces, which however are difficult to obtain, as requires to capture traffic in different places with synchronization and management problems. We therefore present a methodology to generate several synthetic traffic traces from a single real trace of packets, by carefully grouping packets belonging to the same flow to guarantee to keep the same statistical properties of the original trace. After formalizing the problem, we solve it and apply the results to assess the performance of scheduling algorithms in high performance switches, comparing the results to other simpler traffic models traditionally adopted in the switching community. Our results show that realistic traffic degrades the performance of the switch by more than one order of magnitude with respect to the traditional traffic models. 5.1 Introduction In the last years, many different studies have pointed out how the Internet traffic behaves, focusing their analysis on the statistical properties of IP packets and traffic flows. The whole network community is now more conscious that traffic arriving at an IP router is considerably different from the traditional models (Bernoulli, on/off and many others). The seminal paper of Leland [62] gave new impulses in traffic modeling, leading to a huge number of papers deeply investigating such problem from different point of view. A group of studies focused on statistical analysis and data fit, e.g. [85]. These works highlighted traffic properties such as Long Range Dependence (LRD) at large time scales and also multi-fractal properties. LRD is probably the most relevant cause of degradation in system performance because it heavily influences the buffer performance whose behavior, being characterized by a Weibull tail [86], is considerably different from the exponential tail of conventional Markovian models. However, no commonly accepted model of Internet traffic has yet emerged, because either the proposed models are too simple (e.g., Markovian models), or very complex and difficult to understand and tune (e.g., multi-fractal models). Therefore, Internet traffic is intrinsically different from the random processes commonly used in performance evaluation of networking systems, where trace-driven simulations are the most 77 78 CHAPTER 5. FEEDING A SWITCH WITH REAL TRAFFIC commonly used approach. This applies in particular to performance evaluation of high speed switches/routers, since the overall complexity of switching systems cannot be fully captured by analytical models. Hence, all the switch designers use simulation to validate their architectures by stressing its performance under critical situations. How to generate the traffic to feed the simulation model is still an open question, because: i) traffic models (like Bernoulli, on/off, etc.) are indeed flexible and easy to tune, but they are not good model of real Internet traffic; ii) real traffic traces are more difficult to tune, and are not flexible since they refer to a particular network topology/configuration. In this paper, we propose a novel approach to generate synthetic traffic traces to be used for performance evaluation. It steams from the generation of synthetic traffic from a real trace, and adds the capability of building different scenarios (e.g., number of input/output ports and traffic pattern), keeping real traffic characteristics and providing synthetic traffic flexibility. The main idea is to generate the traffic which follows the time correlation of packets at IP flow level and, at the same time, satisfies some given traffic pattern relations. As interesting example of application, the performance of basic switching architectures are discussed and compared to the tradition benchmarking models. 5.2 Internet Traffic Synthesis We consider a nY} switch, where each switch port is associated to an input/output link toward external networks, as shown in top plot of Figure 5.1. One possible approach to feed this switch with real packet traces requires to sample the traffic at each of the links; unfortunately, this approach is not easily viable, since it requires either to have a real ~Y} switching architecture or to manage and synchronize several distributed packet sniffers. In a more realistic situation, only one trace referring to one link is available: for example in this work, traffic traces have been sniffed at Politecnico’s egress router, as shown in bottom part of Figure 5.1. Once the trace is available, a methodology to create different traffic scenarios is required, for example by imposing specific traffic relations among the input/output ports. The output of the methodology will be a set of traces, satisfying the constraints imposed by the selected scenario. Our approach tries to establish the best mapping among source-destination IP W K W K addresses 1 4 and input-output ports 1Õu( 4 of the switch (i.e., Õ and ), in order to generate a total of synthetic traces that can be “replayed”, i.e. fed to the input links of the switch under analysis. The traffic relations are described by a traffic matrix e , of size Y , expressing the normalized average offered load between any input and output ports. Additional constraints must be met in order to keep the statistical behavior of the original traffic trace, e.g., to keep the packet time correlation among the IP packets having the same source and destination addresses. This problem is intuitively not trivial, given the number of constraints that must be satisfied. To better discuss the synthesis problem, we introduce the notation used throughout the rest of the paper, then formalize the problem, and solve it using a greedy heuristic. 5.2. INTERNET TRAFFIC SYNTHESIS s 79 i j d ... Edge Router Packet Sniffer External Servers Internal Clients Figure 5.1: Internet traffic abstraction model (on the top) and measure setup (on the bottom). 5.2.1 Preliminary Definitions When traffic is routed on a network, it is possible to focus the attention on different level of aggregations – namely IP packet, IP Flow, and Flow Aggregate levels. In particular, we define: ; K IP Flow: An IP flow aggregates all IP packets having the same IP source address Ø= and W Ø= IP destination address . We state its size expressed in bytes with , and its normalized load ! åQ 6%? Q/ %? . This is a natural aggregation level which entails that IP W K packets routed from to will follow the same route, closely mimicking Internet behavior. ; Flow Aggregate: We define a flow aggregate as the aggregation of all packets having W K source address ØT # and destination address Ø; [ . We will choose address sets such that Ü # Ý and Ü [ Ý are partitions of and respectively. The flow aggregate represents the traffic crossing the switch from input Õ to output . Let us denote the address-2-port mapping function with A 2 P 1[4 ; then, for any source address tied K K to a specific input Õ we have A 2 P 1 4s Õ,VU ØT # , as well as for any destination address it holds W W Ø= [ for a specific output port . A 2 P 1 4_ 0U Figure 5.2 reports an example of the previous classification: at the bottom there is the original traffic trace, which is composed by four IP flows (labeled A,B,C,D). Then two flow aggregates are generated, considering the union of Ü A,B Ý , and Ü C,D Ý respectively. 5.2.2 Traffic Matrix Generation The mapping of the IP flows to a given traffic matrix e establishes a binding among IP addresses and switch ports. More formally, this binding can be thought as a generalization of the ß á!Z ) Û optimization problem of scheduling jobs over identical parallel machines[69], and its formalization 4 of matrix Ù is denoted by Ù #[ , the Õ -th row by is provided in Figure 5.3, in which element 1Õu( Ù Ù Ù # and the -th column by [ , being the transpose matrix of Ù . 2 É + The normalized IP flows load matrix Z ØT: É8,< is the input of the optimization problem, in which each IP flow represents a job of size which have to be scheduled without deadline. CHAPTER 5. FEEDING A SWITCH WITH REAL TRAFFIC 80 A+B) Flow Aggregates C+D) A) B) IP Flows C) D) Measured Aggregated Traffic t Figure 5.2: Internet traffic at different levels of aggregation > A fixed number of machines are available, each corresponding to an input output couple of the switch; each machine is assigned a target completion time, i.e., the target (normalized) traffic L L L É 2 matrix e Ø: ÉÔ,< , which can be exceeded without penalty. Matrices ] Ø ÜÉ8 Ý and Ø + L Ý ÜÉ8 are the output of the problem, i.e., the mapping of jobs to machines, or, in our case, the mapping of source IP addresses to switch input ports and destination IP addresses to switch output ports respectively. The objective is to minimize the maximum error committed in the approximation, i.e., the maximum deviance from the target traffic matrix. The first two system constraints lower bound the Z 4 [ mapping and the target error a #\[ with the absolute difference among the approximated 1] # s e #\[ completion time. _ _Wba s.t. CDD a \# [ E DDF a \# [ Q Q #\[ U Õ, Ø ÜY»ÎÔÍ8Í8Í» WF1[] # s Z 4 [ @ e #[ W e #\[ @1[] # s Z 4 [ K [ ] [ F? U W # U # Ý UhÕ, UhÕ, Figure 5.3: The optimization problem With respect to the classic ß á!Z ) Û formulation, two additional constraints are present: once an IP destination address has been mapped to a particular switch output port, then all IP flows with the same destination IP address must be mapped to the same output port, as we assume that only one path is used to route packets. The same applies for IP source addresses and switch input ports. 5.2. INTERNET TRAFFIC SYNTHESIS ¡¢¤£ 3P ¥§¦¨`¥© 81 P while( ª6«(¬®_< ) c //¯±°$select freer heaviest IP flow ¥&²6³O´µr¶$·O¸1µ¹ ¯º ports and ³ ¡ ¢¼»` M¡¢ ¡ ¢ ¯½¾¥$¿,³´µr¶$·O¸1¡µ¢ ¹ ¡¢ « // set address-to-port if unset ½ ½ ´/mapping ° if( À(Á(Â0ÃYc ¿ d ), Á(ÂÃYc ¿ d ´f² if( À(Á(Â0ÃYc d ), Á6Â0ÃYc d // update involving already mapped dests ¿ foreach( already mapped Ä ) c P ¿ if( Ç±È « ¬ÆÅ < £Ë) MÇÈ && ( ª3Á6Â0ÃYcÄ d ) c d « ¬ Å £ É P #Å Ê a2p a2p É Å ÊRÌ « ¬ Å d // update involving already mapped sources ½ foreach( already mapped ) c P ½Ä if( « ¬Å < È Í ) && (È Í ª3Á6Â0ÃYc Ä d ) c d £ É ¬ Å PÊ a2p « ¬ Å £ a2p É ¬ Å Ê Ì « ¬®Å d d Figure 5.4: The greedy algorithm Therefore, each flow source will use just one row of the machine grid and each flow destination will use just one column, as enforced by the last two constraints. Since the well known ß á!Z ) Û optimization problem [69] is known to be strongly NP-hard, its bi-dimensional extension cannot have a polynomial time solution. Moreover, due to the size of our problem, we look for a simple and fast approximation of the optimal solution: among all the possible strategies, a greedy approach has been selected due to its extreme simplicity. 5.2.3 Greedy Partitioning Algorithm In this section, we will briefly highlight some of the main features of the greedy adopted strategy, whose pseudo-code is shown in Figure 5.4. The intuition at the base of the algorithm is to try to K W map the heaviest (un-mapped) IP flow 1 4 to the freest port pair 1Õu( 4 , and then force all flows having the same IP source address to enter from the same switch input port, while flows having the same destination address will be force to exits from the same switch output port. This is done by updating the approximated traffic matrix Î , whose elements account for the size of IP flows assigned to that particular port pair. This is repeated until all the IP flows have been mapped. CHAPTER 5. FEEDING A SWITCH WITH REAL TRAFFIC 82 In the context of a greedy solution, the choice to accommodate at each step the heaviest remaining IP flow –which simply yields to process IP flows in a reversed sorted order– is quite intuitive. However, this not the case for the port pair selection; indeed, we tried several policies, and tested their performance under different traffic scenarios. For example: ; the port pair corresponding to the globally largest element in e h @ Î , i.e. 1ÐÕ,( 4Ï ÐÆÑIÒ Ð3Ó ) Ó 1 e )Ô Ó @TÎ ) Ó 4 ; ; the row-wise largest element, i.e. the largest element of the largest row ÕÕÏ ÐÑIÒ Ð3Ó ) 1 Q/ e ) @TÎ )Ô 4 , §Ï ÐÆÑIÒ ÐKÓ Ó 1 e # Ó @TÎ # Ó 4 — or, symmetrically, the columnwise largest; ; the coupled (row,column)-wise largest element, that is the element that lies at the intersection of the largest row and column, i.e. ÕÏ ÐÑIÒ ÐKÓ ) 1 Q e ) @hÎ )Ô 4 , §Ï ÐÑÖÒ ÐKÓ Ó 1 Q/ e Ó @hÎ Ó 4 . All the former approaches gave similar results only with uniform target matrix e ; in the other cases the global approach gave the best results, even when compared to the coupled (row,column)wise. Indeed, the global approach tries to minimize the maximum error among the target e and approximated traffic matrix Î at a local level, i.e. for a specific input/output pair. Other strategies try to minimize respectively the error on either the input ports (row-wise strategy) or the average error (coupled strategy), explaining thus the relatively worse performances. 5.3 Performance study 5.3.1 Measurement setup In order to collect traffic traces, we observed the data flow on the Internet access link of our institution, i.e., we focus on the data flow between the edge router of our campus LAN and the access router of GARR/B-TEN, the Italian and European Research network. Since our university hosts mainly act as clients, we recorded only the traffic flows originated by external servers reaching internal clients (i.e., the direction highlighted in Figure 5.1). The trace has been sampled during a busy period of six hours, by collecting data on 28 million of packets and 42400 IP flows. The time window has been chosen such that the overall traffic is tested to be stationary both for the first and second order statistics. The property of real traffic that we mainly take into account is the long range dependency which is well known to be responsible of buffer performance degradation. It is not the topic of this paper to provide a statistical analysis of the traffic measured at our institution router but it is relevant to highlight that the measured traffic exhibit LRD [87] properties from the scales of hundreds of milliseconds to the entire length of the data trace with the Hurst parameter in the range of 0.7-0.8 [88]. 5.3.2 The switching architectures under study An IP switch/router is a very complex system [89], composed by several functionalities: here we focus our attention only on the performance of switching systems. We consider a simple model 5.3. PERFORMANCE STUDY 83 Mean delay [time slots] 100000 OQ-P3 MWM-P3 iSLIP-P3 OQ-PT MWM-PT iSLIP-PT 10000 1000 100 10 0 0.1 0.2 0.3 0.4 0.5 0.6 Normalized load 0.7 0.8 0.9 Figure 5.5: Mean packet delay under PT and P3 scenarios for cell mode policies. of the switching architecture, based on the one described in [90]. The incoming, variable size, IP packets are chopped into fixed size cells which are sent to the internal switch, where they are transferred to output port, and then reassembled into the original packets before being sent across the output link. The internal switch, which operates in a time slotted fashion, can be input queued (IQ), output queued (OQ) or a combined solution, depending of the available bandwidth inside the switching fabric. IQ switches are usually considered scaling better than OQ switches with the line speed, and for this reason they are considered in practical implementations of high speed switches. Input queues are organized into the well known virtual output queue (VOQ) structure, necessary to maximize the throughput. One disadvantage of IQ switches is that they require a scheduling algorithm to coordinate the transfer of the packets across the switching fabric; the performance of an IQ switch, in terms of delays and throughput, is very sensible to the adopted scheduling algorithm and depends also on the traffic matrix considered. Scheduling algorithms can work either in cell mode or in packet mode [90]. In cell mode, cells are transferred individually. In packet mode, cells belonging to the same IP packet are transferred as a train of cells, in subsequent time slots; hence, the scheduling decision is correlated to the packet size. In the past, the performance of several scheduling algorithms have been compared [90, 91] under Bernoulli or correlated on/off traffic. Here we compare the performance of maximum weight matching (MWM) [92] and iSLIP [93] scheduling algorithms under different traffic models. We selected MWM as example of theoretical optimal algorithm which is too complex to be implemented, whereas iSLIP was chosen as example of practical implementation with suboptimal performance. We consider a ×ÙØ× switch, with internal cell format of 64 bytes. In the IQ switch, buffers are set equal to ÚÆÛ|Û|Û cells per VOQ, i.e. about 320 KBytes per VOQ and about ÜÝÚ MBytes per input port – which is a reasonable amount of high-speed memory available today in an input card. In the OQ switch, buffers are set equal to Þ|Û|Û|Û|Û cells ( ßÚKÛ|Û|ÛØà× ) to compare fairly with the IQ switch. 5.3.3 Traffic scenarios For the sake of space, we present our results only for uniform traffic, i.e. ádâãßåäæ?ç , where ä (between Ûèݱé and ÛèÝê ) is the average input load, normalized to the link speed. We consider two CHAPTER 5. FEEDING A SWITCH WITH REAL TRAFFIC 84 Mean delay [time slots] 100000 OQ-P3 MWM-P3 iSLIP-P3 OQ-PT MWM-PT iSLIP-PT 10000 1000 100 10 0 0.1 0.2 0.3 0.4 0.5 0.6 Normalized load 0.7 0.8 0.9 Figure 5.6: Mean packet delay under PT and P3 scenarios for packet mode policies. scenarios, depending on the process of packet generation. ë Packet trace (PT). Packets are generated according to the trace, following the methodology of traffic synthesis we presented in Section 5.2.2. ë Packet trimodal (P3). Packet generation is modulated by an on/off process, satisfying the traffic matrix á . Packet lengths are generated according to a trimodal distribution, which approximates the distribution observed in our trace. This can be considered a traditional good synthetic model, tuned according to the features of the real trace. 5.3.4 Simulation results Figs. 5.5 and 5.6 plot the average delay as a function of the normalized load for tree configurations: OQ switch, IQ switch with MWM scheduler and IQ switch with iSLIP scheduler. The first graph refers to cell mode (CM) schedulers for IQ, the second to packet mode (PM) schedulers for IQ. In all cases, the delays experienced under Packet Trace model are much larger than in the case of trimodal traffic. This holds true not only for high load, but also at low load the effect of the LRD traffic causes much higher delay. Note also that the performance of CM and PM are almost the same in both scenarios; this is reasonable, since the estimated coefficient of variation of the packet length distribution is é|ÝìÛzÚ and, according to the approximate model in [90], CM and PM should behave the same. Fig. 5.7 shows the throughput achieved in both scenarios considering CM policies (PM behave the same). Because of the LRD property of the input traffic, the queue occupation under PT is much larger than P3 causing higher loss probability and reduced throughput. The most surprising result is that the relative behavior of the three schedulers changes from P3 to PT. Indeed, OQ, IQ-MWM an IQ-iSLIP give almost the same performance with the traditional traffic model, while a degradation in the throughput curves is present considering the Packet Trace model (up to 10% reduction) and an increase in delays. Moreover, IQ-MWM behaves the worse considering the delay metric, which depends mainly on metric based on queue length [90]; iSLIP on the contrary shows shorter delays than OQ: one partial explanation of this is that, for high 5.4. CONCLUSION 85 1 OQ-P3 MWM-P3 iSLIP-P3 OQ-PT MWM-PT iSLIP-PT 0.9 Throughput per port 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 Normalized load 0.6 0.7 0.8 0.9 Figure 5.7: Throughput under PT and P3 scenarios for cell mode policies. loads, iSLIP is experiencing larger losses than OQ under PT (larger than IQ-MWM for instance), as Fig. 5.7 shows. Finally, the most important fact is that OQ is penalized in terms of average delay: the shared buffer at output queue allow much longer queues to build up, therefore degrading the delay performance because of the Weibull tail [94]. These results underline that traffic models traditionally adopted to assess switching performance are not capable of showing real world figures. 5.4 Conclusion This work proposed a novel and flexible methodology to synthesize realistic traffic traces to evaluate the performance of switches and, in general, of controlled queuing networks. Packets are generated from a single packet trace from which different synthetic traffic traces are obtained fulfilling a desired scenario, e.g., a traffic matrix. Additional constraints are imposed to maintain the original traffic characteristics, mimicking the behavior imposed by Internet routing. We compared the performance of a switch adopting different queuing and scheduling strategies, under two scenarios: the synthetic traffic of our methodology and traditional traffic models. We observed that not only absolute values of throughput and delays can change considerably from one scenario to the other, but also their relative behaviors. This fact highlights the importance of some design aspects (e.g., the buffer management) which are traditionally treated separately. These results show new behavioral aspects of the queuing and scheduling in switches, which requires more insight in the future. 86 CHAPTER 5. FEEDING A SWITCH WITH REAL TRAFFIC Chapter 6 Data Inspection: the Analysis of Nonsense ESPITE the availability of a rather large number of scientific software tools, to the best author knowledge, they all fail, unfortunately, in one scenario: when all the data cannot be fitted into the workstation random access memory. This chapter introduces DiaNa, a novel software tool primarly designed to process a huge amount of data, in an efficient, spreadsheet-like, batch fashion. One of the primary desing goals was to offer extreme flexibility from the user perspective: as a consequence, DiaNa is written in Perl. The DiaNa syntax is a very small and orthoghonal superset of the underlying Perl syntax, which allow, e.g., to comfortably address file tokens and profitably use file formats throughout the data processing. The DiaNa software includes also an interactive Perl/Tk graphical user interface, layered on the top of several batch shell scripts. While we shall only briefly review the former, we will focus on the latters, throughly describing the tools architecture in order to better explain the offered possibilities. Besides, the achievable performance will extensively be inspected; we finally present some examples of the results gathered thorugh the use of the presented tool in a networking context. 6.1 Introduction and Background The vaste landscape of software is characterized by a massive redundancy, of which the number of existing programming languages, from general-purpose to very specialized ones, is a manifest example. The choice of the “right” tool for the given task is not univocal, given the number of existing tradeoffs and interrelationships among the different possible comparison criteria. Nevertheless, despite the wide variety of already available software, there are cases where the need of programs such as the Data Inspector aka Nonsense Analyzer (DiaNa) arises. Specifically, to the best of author’s knowledge, all the existing publicy available solutions fails to perform when the amount of data to be processed is much greather than –and cannot therefore be fitted into– the workstation’s random access memory. The DiaNa software has been designed by the The Networking Group (TNG) at the Politecnico di Torino during research works dealing with massive amount of real traffic measurements. Obviously many researcher before have dealt with similar neworking problems, and a large amount of software has been constructed to facilitate their analysis; however, in the authors opinion, the lack of generality of these programs entailed the need for a significant effort in order to do anything 87 CHAPTER 6. DATA INSPECTION: THE ANALYSIS OF NONSENSE 88 out of the ordinary. In the past years we therefore developed, extensively tested and profitably used the tool, which is made freely available [95] to the research community; the package, which is constituted by a set of Perl[96] scripts and a Perl/Tk Graphical User Interface (GUI): in the following, we will refer to the shell tools as d.tools and to both the GUI and to the entire package as DiaNa, covering their essential features. It must be pointed out that the presented framework does not intend to be a replacement for any existing analysis tools, and a key of its desing is to allow extreme interoperability with other existing tool. Also, despite the software has been designed for rather lenghty batch computation on huge volumes of data, neverthless its use is not restricted to the former scenario, as the featured GUI can profitably be used in an interactive fashion. DiaNa and the d.tools can automate tedious, repetitive and otherwise error-prone activities, and can assist the user in activities ranging from data collection and preparation to the typesetting and dissemination of the final results. More specifically, the software is able to parse text files holding informations structurally representable as matrices, performing spreedsheet-like operations on the data. However, even in a single application domain the range of tasks that have to be performed over the same data set can change dramatically: therefore, the flexibility to perform the widest possible range of tasks has been one of the most critical desing feature. The choice of Perl as the base language is extremely helpful in this direction, allowing a natural cooperation with other tools, beside the use of a wide number of available libraries or the on-the-fly parsing of user-defined Perl code, as we will detail in the following. Therefore the d.tools, beside offering a small number of specialized tasks, have to be intended as a customizable framework with a number of built-in functionalities especially suited to process a certain the kind of input data. The rest of this chapter is organized as follows: Section 6.2 contains a description of the architecture and syntax; a benchmarking analysis is proposed in Section 6.3, and finally some example of use are presented in Section 6.4 for a networking context. 6.2 Architecture Overview In this section, starting from a wide-angle perspective, we outline the interaction and the interdependence of the various pieces composing the DiaNa software. As it will be clear later, there are basically three different possibilities of interactions with the framework: namely, i) the interactive GUI approach, ii) the shell-oriented usage and iii) the API level, ordered by increasing level of complexity. The relationship among the DiaNa GUI, the d.tools and Perl is sketched in Figure 6.1. From a top-down point of view, the graphical Perl/Tk GUI acts as a centralized information collector and coordinator of the lower-level shell-tools; each tool performs an atomic task on the input data, offering the possibility of defining and exploiting custom file formats, upon which useful algorithmic expressions may be built. More in details, the GUI offers an Integrated Developement Environment (IDE) to manage the DiaNa syntactical items; furthermore, it automatizes the interaction among the tools, covering a wide range of tasks, as for example: ; discriminating and split the input according to a given condition; 6.2. ARCHITECTURE OVERVIEW 89 Extendible (Plugin/Templates) Syntactical items manager Gnuplot frontend Centralized coordination DiaNa GUI Perl syntax Perl shell Specialized text file low level processing DiaNasyntax Tools Perl Expressions DiaNa Syntax Perl syntax Formats Ranges Perl Syntax syntax Figure 6.1: DiaNa Framework Conceptual Layering ; performing arbitrary numerical and textual operations, in a serial or parallel fashion on the input data; ; ranking the input data on a database-like manner; ; evaluating, in an on-line fashion, the first and second order statistics possibly subsampling the data; ; computing empiric probability distributions, possibly conditioned. The DiaNa items, can be divided into the following cathegoriesbasically: formats define columnar fields labels and comments, allowing to easily identify and describe different file tokens; expressions allow to use the so far defined format labels and tokens for, e.g., spreadsheet-like computation; as discussed more in details later, the expression’s syntax represents a small and orthogonal superset of the already full blown Perl one; ranges represent a convenient way to define partitions of the real numbers field í . In the following sections we will dissect, adopting different perspectives, the architecture common to the d.tools and, devoting special attention to their syntax, we will describe how the items are defined and interact together. First of all, Section 6.2.1 motivates the choice of the base language. Although the DiaNa syntax provides a fairly small increment to the Perl one, nevertheless it would be cumbersome to provide here its exhaustive reference material. Therefore, rather than covering all the gory details of the added items (i.e., format, ranges and expressions) we will just give, respectively in Section 6.2.2 and Section 6.2.3, an adequate description of their use by illustrating the interaction between formats and expressions. After detailing the common core architecture in Section 6.2.4, we consider the most basic tool d.loop describing its usage in Section 6.2.5, while we quickly review the main GUI’s features in Section 6.2.6. CHAPTER 6. DATA INSPECTION: THE ANALYSIS OF NONSENSE 90 6.2.1 In the Beginning Was Perl Altough languages are interesting and intrinsecally worthy of study, their real purpose is as tools in problem solving: therefore, we will focus here on practical reasons behind the choice of Perl as DiaNa’s parent framework. Neverthless, we may say that the philosophy of the language designer shapes the language, its libraries, its ecological niche, and its community: if the language is successful, it will be used in domains that the original designer may not have intended as well as being often extended in ways the original designer may not have intended. This happened with many languages, but it is especially true for Perl: a real culture grew up around the language, to such extent that there exists even perl poetry as well as chapters on Perl poetry [98]. Technically, there are two main manifest disadvantages in this choice, the first being the actual a priori limit to the performance achievable, e.g., by a specialized C program; however, whether performance could become a serious problem, it shall not be forgot that is possible to use external C code from Perl1 . A more serious drawback, as admitted by the authors themselves in [99], is that “There’s no correct way to write Perl. A Perl script is correct if it’s halfway readble and gets the job done before your boss fires you.”. Altough this is stated in a fun and winning fashion, truth is that Perl syntax normally allows to produce really illegible code –even outside the context of the Obfuscated Perl Contest[104]– which clearly has a cost in terms of the readability and maintainability. On the other side, Perl dramatically reduces the writability cost, and its desing guidance principle “There’s More Than One Way to Do It” translates directly into unbeatable flexibility and extensibility. These latters are irremissible properties since, even within an application domain, requirements for two distinct projects may vary widely. Concerning this, Perl has the singular advantage of a huge collection of well organised modules, publically distributed worldwide on the CPAN[96] network of Internet sites; these includes obviously an incredible amount of scientific data-manipulation tools, such as the Perl Data Language (PDL) [102]. Moreover, Perl offers a full suite of modern features beyond numerical and graphics features (e.g., from databases access to network connectivity and Object-Oriented programming) as well as a wide range of ways to communicate with other tools, e.g., though the mechanism of i) shell piping or ii) embedding of other applications. The former approach, though powerful, is a limited form of “one-way” communication. The latter, more advanced form of communication is also supported, as examplified in [106]; the main disadvantage is that their implementation is often platform dependent. Finally, we believe that implementation in a true general-purpose full-featured language gives access to a wealth of useful features –and Perl filled all these constraints most admirably– whereas specialist systems, whether commercial or free, are hampered in their access to these features by their proprietary nature and specialist syntax. Besides, it is undoubtedly easier to add features to a robust and complete language rather than going the opposite directions. Therefore, we may say that who already used Perl for day-to-day programming tasks will find the DiaNa extension extremely productive. Conversely, the advanced use of DiaNa can be achieved through an expressive knowledge of Perl, and therefore DiaNa learning cost is be absorbed by the other large number of possible uses of Perl. 1 XS allows to interface Perl and C code or a C library, creating either dynamically loaded or statically linked Perl libraries; it must be said that external C code is also supported in an inline fashion thorugh the Inline Perl module 6.2. ARCHITECTURE OVERVIEW 91 6.2.2 Input Files and Formats The d.tools matrix abstraction of a generic input data file is sketched in Figure 6.2. Input files may contain comments, that is, lines beginning with a configurable prefix (by default a î sign). Comments rows are usually discarded, but can be optionally processed; moreover, special informations can be included in the file header, defined as the first block of comment lines at the beginning of the input file Different rows are individuated by a configurable Record Separator (RS), newline by default, whereas an Input Field Separator (IFS) –which can be a character or a regular expression [103]– is responsible for columns partitioning. IFS RS Figure 6.2: Input Data from the DiaNa Perspective As earlier mentioned, the assumption behind the DiaNa developement is that not all the processing data can be sourced at once into the workstation memory. Therefore, the adopted strategy is to perform an on-line processing over a window of data rows, doubly limiting the information available at any point of time in both the “future” and the “past”, with respect to the “present” point. That is, as soon as data rows have been read from input, they become available for processing and are possibly buffered; all the columns of the buffered rows represent the sub-portion of the input available to the running algorithm. This crucial point is emphasized in Figure 6.2, where the buffered rows are represented with colors fading toward the past. Columnar fields play a privileged role over file rows: therefore, a format allows to attribute a (label,comment) pair to each file column. Each format definition is usually stored in an external file, where each row contains the label and comment string pair separated by a tabulation character. However, columnar labels may be directly stored in the input file header, which has to be intended as a quick and dirty alternative to the proper definition of a format; indeed, although very useful, this option is neverthless discouraged, since it is more error prone, less expressive and less maintainable than the centralized definition of a stable format. 6.2.3 Formats and Expressions Interaction DiaNa expressions are the glue that ties input files to formats, allowing to quick reference the file tokens as matrix cells in algebraic idioms or algorithmic chunks of code. As will be outlined later, input files can be read either in serial or in parallel; therefore, expressions are interpreted in a serial or parallel fashion accordingly to the input context, which is 92 CHAPTER 6. DATA INSPECTION: THE ANALYSIS OF NONSENSE Token #j #i:j #lab #i:lab Meaning the j-th columnar field of the current row the j-th columnar field of the the abs(i)-th previously read row the columnar field, whose label is lab, of the current row the columnar field, whose label is lab, of the abs(i)-th previously read row Table 6.1: Serial Expression Tokens usually uniquely settled by each tool. Let focus first on the former kind, i.e., on serial expressions: although parallel expressions syntax is formally identical to the serial one, nevertheless their interpretation is radically different. Serial Expressions As earlier mentioned, the DiaNa syntax consists of a small and orthogonal superset of the Perl syntax: more precisely, the novelty is represented by the introduction of a few tokens. Tokens are strings beginning with the pound # sign, which is interpreted by Perl as a comment delimiter character: that way, DiaNa tokens do not clashes with the underlying Perl syntax. Token allows to easily address different columnar fields among the several buffered rows within the current file “window”: i.e., they embed a mechanism to profit of this sort of memory. Moreover, since formats define a 1-to-1 correspondence between file columns and labels, these latter can be used as in expressions’ tokens. Serial tokens have mainly the form reported in Table 6.1; however, the list bares additional discussion, and there are a number of special convenient expansions. First of all, it should be noted that DiaNa tokens resemble more closely Awk’s tokens (i.e., $1,$2,. . . ,$NF) than Perl ones (i.e., $ [0],$ [1],. . . ,$ [-1]). Secondly, the two notations #i:j and #-i:j being perfectly equivalent, the latter may be used for further clarity. Besides, it is possible to reference fields starting from the end of the line through tokens of the form #-j : #-1 represents the last field, #-2 represents the second last and so on; however, this feature is valid on the current row only, i.e., it cannot be used in conjunction with memory. Other useful expansions are supported: for example, i is expanded to the i-th column when the matches the regular expression /ˆ ï d+$/, i.e., is uniquely constituted by the relative number Øñð Õ . Also, #. expands to the line number of the current input files seen as one unique stream: this differs from Perl internal variable $., since the latter counter is restarted on any new input file. Several tokens can then be combined together through the powerful Perl syntax, including any possible algebraic, textual and subroutine based manipulations. Indeed, all the tokens are parsed once at startup, in order to speed up their evaluation during processing, and the DiaNa expression becomes a Perl expression This desing choice follows from the pursuit of the maximum possible flexibility, in order for the application to be as general-purpose as possibile, suited to a range of contexts as well as to different problem classes within the same domain. 6.2. ARCHITECTURE OVERVIEW 93 Parallel Expressions Parallel expression tokens, though formally identical to their serial counterpart, have a totally different interpretation: #i:j identifies the j-th column of the i-th file (where both columns and files numbering starts from 1), whereas #i expands to #1:i , . . . , #N:i (where N is the number of input files). ‘paste‘ a A b B c #1, #2, #3 C a b c A B C Figure 6.3: Parallel Expressions and Default Expansion Readers familiar with UNIX-like environment will immediately appreciate the ability to easily write a tool orthogonal to paste, as illustrated in Figure 6.3. Indeed, suppose to work parallely on three files a, b, c; assume further that each file has simply two columns, containing respectively lower and upper case characters. In reason of the default token expansion, any expression concatenating the tokens #1 #2 #3 will naturally interleave each files column; conversely, an output like paste can be achieved only by explicit reference, as in #1:1 #2:1 #3:1 #1:2 #2:2 #3:2) It is not hard to devise that parallel expressions can be tremendously useful for direct comparison: in the simplest form, each token can be algebrically (e.g., differences, ratios, averages) compared to its counterparts, stored in the same columnar positions of other files. ] Ranges ] For completeness, we overview the partitioning syntax of] the real set : »ÿò< Ø í into different ] K intervals (or bins). To uniformly partition the : ÿò< range, one can] specify either the bin size K of the bins number (respectively through : : ÿ and : ÿ @ ); or, to just take into account the magnitudo of the variable, ]ôó one can specify a ó logarithmic bin size ( :LOG: ÿ , where LOG is a keyword; future releases plan to support an arbitrary base H through LOG H ); finally, arbitrary ranges Ò Ò Ò ÿ ). can be specified as lists ( G , ,..., L 6.2.4 The d.tools Core The d.tools are a set of tools, each designed to perform a well defined task, developed around a common core, sketched in Figure 6.4. The framework provides a set of useful built-in functionalities, that we will review in the current section, to simplify the implementation of new and unsupported tools. At the leftmost side of the scheme, we shall notice that tools may work on either a parallel or serial fashion over the input files: in the former case, several buffered rows for each separate file will be available at the same time; in the latter, the files will be concatenated in a unique stream. Each tool will usually operate in a single context, since the syntactical expressions, CHAPTER 6. DATA INSPECTION: THE ANALYSIS OF NONSENSE 94 although formally identical, have a totally different interpretation. The core will transparently handle the most common archiving and compression schemes (i.e., tar, gzip, bz2 as well as their combination): the needed decompression/dearchiving filters are automatically pipelined at the input basing on files extension. Comments lines, discarded by default, can be possibly processed if needed. At the tool bootstrap, the engine loads any required Perl module and possibly evaluates on-thefly custom startup code, in the form of external scripts or inline-code. The core module tryes also to load the format corresponding to the input files, again based on files extension (note that the format recognition is compatible and coordinated with compression recognition); whether the file format has been defined (either globally or locally in the input file header) textual labels can be used in token. Expression are parsed once, and real memory requirements are automatically determined in both the vertical deepness (i.e., how many input rows have to be buffered) and the horizontal extension (i.e., which fields have to be buffered), so that DiaNa memory can be represented as a sparse matrix. Besides, memory is setup only for the really required fields: this, rather than for memory-economy purposes, result in better performance by partial optimization of the memory update loop, as we shall discuss more in details on Section 6.3.7. Besides, the configurable uniform or Montecarlo sub-sampling routine is possibly set up; among the other startup tasks, whether verbose mode has been required, the core sets up the progress meter heuristic (provided that either one of the /dev/stderr or /dev/stdout channels, in that order, are unused). Serial Input inN ... in0 (SP0) Progress Bar Expressions tool specific Ranges actions Parallel Input . . . out0 Output Redirection shell pipe (SP9) out9 Formats in0 . . . shell pipe decompression sampling mux User Libraries Post Processing inN compression . . . Figure 6.4: Architecture of a Generic d.tools At each step in the loop over the input rows, a sampling decision is taken, and data get either discarded or read and possibly buffered, becoming thus available for online processing; the progress meter, whether used, is refreshed by unity-percentage increments. The specific action performed clearly depends on the tool purpose: for example, the most simple tool (d.loop), does not perform any action but does rather offer the possibility of executing user-specific code, possibly defined directly on the command line. Rather than detailing the possibilities offered by the already existing tools, it is important to stress that user code can be easily plugged at this point of the process, using the DiaNa engine as flexible and smart input-output framework which alleviate the user from a wide serie of tasks, allowing him to concentrate on the pure algorithmic part. The interaction with DiaNa is possible either through the existing tools by on-the-fly evaluation of user-defined code or, with an orthogonal approach, the embedding of the framework into the 6.2. ARCHITECTURE OVERVIEW 95 custom application as a plain Perl module. Finally, the rightmost side of the synopsis shows that up to ten output streams (optionally compressed through the same mechanisms accepted at the input) can be used simultaneously, each of which can be fed into a different custom shell pipe; this is clearly useful to parallelize algorithmic work on the same input data, sharing and therefore amortizing an intensive IO workload over multiple processing tasks. The data processing can benefit of Perl natural support for a wide variety of structures, such as hashes and arrays, that can be grown on demand without requiring explicit memory management. While we will provide some practical example on Section 6.4, let us anticipate that nested combination of the above structures is allowed without restrictions – which can be very useful to i) post-process the parsed data, or ii) for processes layered on several stages, or finally iii) for a common pre-processing phase. Besides, it should be said that complex data structures can naturally be stored and re-loaded, which : the internal Perl structures can be dumped, through standard Perl modules, into text script that may be simply sourced by the tools on demand. 6.2.5 The d.tools Flexibility d.loop \ [−cvVzZfPW] \ [−d lib1,lib2,..libN ] \ [−B code ] [−C code ] [−E code ] \ [−? ’key1=>val1,...keyN=>valN’] \ [−! private ] \ [−H hdrchar ] \ [−I IFS] [−O OFS] \ [−F format ] \ [−S sampling ]\ [−0 out0 .. −9 out9] \ [(−e expression | ’expr’)] \ in0 .. inN Figure 6.5: The d.loop Synoptic In this section we quickly describe the so far described core functionalities can be accessed at the shell level – avoiding as possible to provide a reference manual, which scope would be out of context. This is done with the intent of showing, among the the three possible aforementioned kind of interaction, the one representing the best tradeoff between complexity of the approach versus featured flexibility. As we previously mentioned, d.loop is a basic template for executing user-supplied code, thus building many useful tools that have access to all the core functionalities provided by the DiaNa core module. Also, this tool should be a reference for the subset of common options, whose synoptic is presented in Figure 6.5. d.loop Specific Options There is only one option (specifically, -C) among those listed in Figure 6.5 specific to d.loop, which allows to specify the Perl code that will be executed on-the-fly (by default, d.loop action is to translate the IFS to OFS) 96 CHAPTER 6. DATA INSPECTION: THE ANALYSIS OF NONSENSE Output Control DiaNa features a redirection mechanism which we will call explicit to distinguish it from the default implicit shell redirection syntax. The explicit redirection currently offers up to ten simultaneous ouput channels (addressable through -0 out0 . . . -9 out9), but future releases plan to raise the limit up to the operating system capabilities (addressable through any numerical-only extended option, starting from --10). However, the channel -0 should be considered as reserved, as it is currently used for debug (dumping a copy of the Perl script created on-the-fly that is going to be executed). Unless otherwise specified, the first two channels are directed to /dev/stdout and /dev/stderr respectively, whereas and -^ defaults to /dev/null for nU^ =öÎ . The colon character :, when used as parameter of a redirection option, expands to the null device /dev/null (this can be useful to discard part of the output in some processing, such as the part not matching the condition specified to d.match). By default existing output files are preserved from being overwritten, unless explicitally otherwise specified (through -f) Custom per-channels piping is supported by the Perl open syntax (e.g., -3 | command > file will pipe the output to command before redirecting it to file), where clearly there are no restriction to the levels of piped command. Output compression can be required through the appropriate option (specifically, -z to use gzip and -Z for bzip2); the core module will handle the specified custom output pipes, and the compressed output files extensions .gz, .bz2 will be added if necessary. Channels are directly available in an expression through the use of the OUT() subroutine (e.g., OUT($n,"string") will put the string in the $n-th file, and OUT("string") to the first output channel) or by direct access to the corresponding FileHandle (e.g., $OUT[$n]->print("string") Input Control Input data can be either piped or specified as file names on the command line; in the latter case, the core transparently handle the most common compressions/archiving formats (i.e., gzip, bzip2, tar and their combination). The comment character defaults to î (but can be specified through -H) and the comment lines, that are skipped by defaults, can be optionally processed (--P flag). The input field separator regular expression, can be configured (through -I) to a value other than the default / ï s+/ (i.e., multiple spaces are treated as an unique separator). Format recognition enables the token resolution, allowing to use the expanded alphanumerical labels in expressions. Though the format can be specified in several ways, here we consider for brevity the most common behavior (i.e., the colon expansion through -F:). In this case the core will look for a format name matching the file extension (taking care of compressed extensions); however, if a valid format definition cannot be found, the core will try to load format from the last header line of the files being processed; otherwise, alphanumerical labels expansion will be turned off. Besides, it should be said the search of the format definition file can be affected in several ways, the simplest of which is either the use of the DIANADIR environment variable, or the use of local format definitions (i.e., in the current working directory). Finally, the input control subsampling, accepting the comma-separated tokens listed in Table 6.2, can be tuned through the option S; the colon : character expand to the environment variable DIANASAMPLE or the hard-coded defaults. 6.2. ARCHITECTURE OVERVIEW 97 Code Control There are several peers to “inject” custom code in a DiaNa tool; for example, different kind of Perl code can be loaded at startup through the -d switch, and a basic code checking is also performed. To be noted that it is also possible to load with the same option both Perl modules and code fragments, such as subroutine definition as well as stored variables (e.g., -d My::Module,myscript.pl,myv Pre- and post-processing code, similarly to Awk’s (but not Perl’s) BEGIN and END blocks, can be specified through -B and-E respectively; remarkably, the loaded code is available in any point of the processing. Although this may be misleading, these special code blocks are unrelated from the homonym Perl syntactical blocks: rather, “begin” and “end” are relative to the main loop over the data. As mentioned early, variables can be easily stored into external files for later use: the easiest way is to “dump” them to a specific channel in and end block (e.g., -e ’OUT($n,Dumper($var))’, being the Data::Dumper module loaded by default) and to redirect the channel to a file (e.g., -n myvar.pl). Finally, the code that has to be executed at each step can be either specified on the command line (’expr’) or through a keyword (i.e., -e expr). In the latter case, loop code can be stored in a file, and the keyword is subject to an expansion mechanism similar to the format resolution early mentioned. Expressions are critical items, so strict syntax checking is always performed; furthermore, it is possible to compile only the expressions (-c), printing additional informations about the DiaNa token expansion. 6.2.6 The DiaNa GUI The DiaNa framework features also a Tk user interface, visually organized as a notebook as shown in Figure 6.6. Following the paradigm that motivated the developement of DiaNa, the GUI has been developed to allow for the greatest possible flexibility, while offering at the same time ease and immediateness of use typical of graphical interfaces. Therefore, beside the point-and-click approach (useful to manage the expression and formats items), the first notebook page is an evoluted Perl Shell (featuring auto-completion, interactive history, . . . ). Following the same approach, the Gnuplot page is a gnuplot [105] front-end that couples a drag-and-drop approach to a template mechanism. Templates authors need only to define a couple of well-known function functions and, accessing DiaNa internal data, can provide advanced visualization features; besides, the gnuplot code is always available for direct editing, and multiple picture formats are handled in a way transparent to the user (based on the output file name). A plugin mechanism offers another comfortble peering point to customize the GUI, providing reusable building blocks for (possibly graphic and interactive) functionalities that are not implemented in the factory interface. For lack of space, we refer for further details about the framework, its usage and the developement, to the DiaNa ¯ uõ mõ hõ t p >min <max ³ uniform, Montecarlo (default), head, tail use p percentage of the file (default ö 10) NuP,÷ use at least min lines of input (default < N,P,ù) use all most max lines of input (default ø ) Table 6.2: Sampling Option Synopsis 98 CHAPTER 6. DATA INSPECTION: THE ANALYSIS OF NONSENSE Tutorial and man-pages, both included into the standard distribution; finally, a valid source of information is represented by the about 16k lines of source code itself, more than half of which are comments. Figure 6.6: The DiaNa Graphical User Interface 6.3 Performance Evaluation and Benchmarking The choise of Perl as the base language of the DiaNa software allows for great flexibility to the expense of the achievable performance: in this section we quantitatively inspect the aforementioned tradeoff, further gathering some relevant qualitative insight. Techniques for computer performance analysis are divided into three broad classes: analytic modeling, simulation modeling and benchmarking; the latter technique, which is the object of the current section, is the performance measurement through direct observation of the system of interest. In this context, a benchmark is said to be representative of a user’s workload if it accurately predicts performance of the user’s workload on a range of configurations. Clearly, a benchmark that measures general compute performance will not be very representative for a user with a workload that is either intensively floating point, or memory bandwidth, or I/O oriented. Due to the difficulty to define general, significant workloads representive of a broad class of tasks, which evidently depend on the nature of problem to be solved, we will try in the following to isolate the impact of each single component affecting the performance of the proposed software. 6.3. PERFORMANCE EVALUATION AND BENCHMARKING 99 6.3.1 Related Works We must stress that our focus it is not to present a complete picture of the suitability of Perl as a scientific tool, neither to assert a final word on Perl benchmarking, nor to contrast Perl with the performance achieved by other languages – since these topics have already been more thoroughly explored by a number of relevant studies. For example, [106] explored in details how Perl can be used in processing data for statistics and statistical computing can be either by using modules or by embedding specialized statistical applications, also investigating the numerical and statistical reliability of various Perl statistical modules. The thesis of the chapter is to demonstrate how Perl’s distinguishing features make it particularly well suited to perform labor intensive (e.g., complex statistical computations involving matrix algebra and numerical optimization) and sophisticated tasks (ranging from the preparation of data to the writing of statistical reports). Authors in [107] performed some basic experiments to see how fast various popular scripting (like Awk, Perl,and Tcl) and user-interface languages (like Tck/Tk, Java, and Visual Basic) run on a spectrum of representative tasks. Their observed enormous variation in performance, depending on many factors, some of which uncontrollable and even unknowable. As they point out, the comparison revealed to be more challenging, with more difficulties and fewer clear-cut results than expected. The results seems to indicate little hope of predicting performance in other than a most general way. In the author’s own words: “if there is a single clear conclusion, it is that no benchmark result should ever be taken at face value”. Finally, [108] considered a rather different approach to effectuate a thorough empirical comparison of four scripting languages (Perl, Python, Rexx, and Tcl), and three conventional system programming languages (C, C++, Java). In a very large scale effort, about 80 realizations of the same requirement set, coded by different programmers, were compared with each other in terms of program length, programming effort, runtime efficiency, memory consumption, and reliability. Interestingly, it was found that programs written in conventional languages were typically two to three times longer than scripts, and that consequently productivity, in terms of line of code per hour, was haved with respect to scripting languages. Programs written in a conventional language (excluding Java) consumed about half the memory of a typical script program, and run twice as fast; also, in terms of efficiency, Java was consistently outperformed by scripting languages. Scripting languages were found to be, on the whole, more reliable than conventional languages. 6.3.2 The Benchmarking Setup The DiaNa benchmarking setup, as schematized in Table 6.3, considered different hardware architectures (HW), as well as different operating systems (OS) and different flavors of the Perl interpreter (SW). The data reported in the table bare additional discussion; on the SW regard, it should be noted that we indicate simply by Perl the most common perl interpreter distribution (available at [96] as well as in almost any GNU+Linux distribution), whereas we denote by ActivePerl the ActiveState[109] solution. Besides, it should be noted that although the two interpreters share both the revision and version numbers, they differ in the subversion number. However, let us remind at the risk of being tedious that our purpose is not to provide an exhaustive quantitative picture of DiaNa performance, nor to rank hardware architectures, operating sys- 100 CHAPTER 6. DATA INSPECTION: THE ANALYSIS OF NONSENSE SL LX WN HW OS SW HW OS SW HW OS SW UltraSPARC IIe @ 0.5 Mhz + 1 GB RAM Solaris 5.8 Perl 5.8.0 Pentium IV @ 2.4 GHz + 1 GB RAM Linux-2.4 Perl 5.8.0 + ActivePerl 5.8.4 Pentium IV @ 2.4 GHz + 1 GB RAM Windows 9X + Cygwin 1.5.11 Perl 5.8.0 + ActivePerl 5.8.4 Table 6.3: Benchmarking Setup Architectures tems or Perl interpreter; rather, our purpose is to gather a meaningful qualitative insight of DiaNa performance valid under a range of different architectures. Therefore, the peformance achieved by each program instance are to be intended on a relative rather than absolute scale. This remark also justifies the significant difference between the SN and LX or WN clock speeds. In the following, where the performance picture are similars across the different systems and unless otherwise stated, we will report the performance results referring to the LX (i.e., Perl 5.8.0 on a Linux 2.4 kernel running on an Intel Pentium IV platform clocked at 2.4 GHz and equiped with 1 GB RAM). In any of the above systems, the performance metrics were measured using the GNU time utility. Besides, a number of standard techniques have been used to reduce the caching effect, such as to discard the first measurement and to use long enough files, roughly twice the workstation main memory size. 6.3.3 Perl IO Preliminary, we report some results on a benchmark whose workload is purely input/outputoriented. The setup is very simple: a very long file is transferred using different tools, and the transfer mode is either binary or textual, depending on the tool. Specifically, we effectuated a textual-mode file copy with Perl, Awk, cat, and a binary-mode copy with cp, dd, and a C program. Although our interest is mainly driven by textual-mode performance, neverthless we tested the achievable performance using different sizes of the input and output buffers of both dd and the specialized C program. Figure 6.7 reports, with the exception of the dd output, the IO peformance expressed as the CPU load required (on the x-axis) to achieve a given throughput (on the y-axis) on the Linux platform. In terms of throughput, cp achieves the best performance, although the least cpu load is generated by cat. The performance of the specialized C program can be tuned among these two different behaviors varying the IO buffer size: small buffers allow for high throghput at the expense of higher CPU load (cp-like), while both the throughput and the CPU load decrease as the buffer size increase (cat-like). Interestingly, under the Linux platform, Perl falls into the cp-like throughput class, whereas the throughput achieved by Awk belongs to the lower cat class; however, the Perl throughput peformance comes at the cost of a higher load, roughly twice as much as the CPU required by cp: this entails that less computational power will be available to sustain such a level of throughput 6.3. PERFORMANCE EVALUATION AND BENCHMARKING 101 55 Throughput (MB/s) 50 45 40 awk cat cp perl C(256) C(512) C(1024) C(2048) C(4096) 35 30 25 10 15 20 25 30 35 40 45 50 55 CPU Load Figure 6.7: Input/Output Performance on Linux-2.4 performance. Normalized Throughput 1.5 1.25 1 0.75 0.5 10 15 20 (Linux-2.4) (Linux-2.4) (Linux-2.4) (Linux-2.4) 25 awk cat cp perl 30 35 40 CPU Load 45 50 55 60 65 (Solaris-5.8) awk (Solaris-5.8) cat (Solaris-5.8) cp (Solaris-5.8) perl Figure 6.8: Linux vs. Solaris Normalized Throughput Performance Let now investigate if these remarks hold for different systems, specifically on the Linux and Solaris platforms, considering in Figure 6.8 the behavior of a subset of tools (namely cat, cp, Awk and Perl) again evidencing the tradeoff among CPU load and the achievable throughput. In order to allow a qualitative comparison across systems, each tools throughput has been normalized over the average throughput achieved within each system; similarly, the normalized CPU load has to be intended as rescaled over the average load within each system. 102 CHAPTER 6. DATA INSPECTION: THE ANALYSIS OF NONSENSE A first observation is that although the scaled cp performance are very similar in both systems, the cat ones have a totally orthogonal behavior. More precisely, under LX, cat requires less CPU but achieves a low throughput (lower than both cp the average of other tools); conversely, under SL, cat achieves the highest throughput among all tools, requiring even more CPU power that cp. A final interesting remark is that the performance of Awk and Perl appear far more similar under SL than LX. 6.3.4 Perl Implicit Split Construct The Perl interpreters offer a number of switches and flags to modify and finely tune their runtime behavior; while these options are designed with Perl programmers’ Lazyness in mind (i.e., to quickly write powerful inline scripts), as a matter of performance the use of either switch does not provide any significant improvement. Rather, the use of implicit split (i.e., splitting the current line of text, loaded into the default scalar variable $ , on the basis of the IFS to several tokens into the default array @ ) dramatically contributes to boost performance, practically more than halving the execution times. Therefore the implicit construct, though deprecated since it clobbers subroutine arguments, is neverthless used by DiaNa in reason of the huge performance benefit. To better highlight this phenomenon, we measured the execution times of two Perl inlines, respectively implementing an implicit (perl -ne "split") and explicit (perl -ne "@ =split") split, under the LX and WX machines. We repeated the tests considering both ActivePerl and Perl and, similarly to before, we normalized the result over the mean per-system execution time. Results are plotted in Figure 6.9: the first important observation is that the performance gap of the explicit vs. implicit solution persist across systems and interpreters. Besides, looking at the WX machine it is evident that ActivePerl literally outperforms the Cygwin Perl interpreter: it is tempting to conjecture that ActivePerl is highly optimized for Windows platforms, although this statement would require further investigation. Linux-2.4 2.5 Perl Windows 98 ActivePerl Perl ActivePerl @_= 2 @_= 1.5 @_= 1 @_= 0.5 0 Figure 6.9: Explicit split and Performance Loss 6.3. PERFORMANCE EVALUATION AND BENCHMARKING 103 6.3.5 Perl Operators and Expressions Perl’s rich set of operators, data types, and control constructs are not necessarily intuitive when it comes to speed and space optimization. As admitted by Perl authors, many trade-offs were made during its design, and such decisions are hidden in the code internals. A number of practical advices are given in [99], the de facto Perl reference, featuring an entire section devoted to programming efficiency – where it is stated that in general, the shorter and simpler the code is, the faster it runs, but there are exceptions. Some of the advices follows programming praxis: it is well known that accessing a scalar variable is (experimentally, Ú ú ¶Î> ) faster than accessing an array; similarly, arrays and scalars are (respectively, ú ¶£Ñ> and ú ¶è0Î?> ) faster than an hash table. Conversely, a counterintuitive statement is that maximizing the length of any nonoptional literal strings in regular expressions may result in a peformance gain; indeed, longer patterns often match faster than shorter patterns because the optimizer peforms a Boyer-Moore search, which benefits from longer strings. Since all the operations contained in a DiaNa expression are ultimately performed by the Perl interpreter, it can be useful to investigate its per-operator performance: indeed, such a knowledge may lead to “rephrase” an expression, when possible, in a more convenient fashion. An interesting plot of the unary, binary and trinary Perl operators (for the LX archictecture û only) is provided in Figure 6.10, depicting the relative time-cost of each operator for both integer and floating point ü operations; for completeness, string operations are reported in the small inset plot of Figure 6.10. The cost is measured as the ratio of the time elapsed to perform the generic ý operation with respect to the time elapsed to perform the sum ` operation on the same data set. The top and bottom x-axis in Figure 6.10 refer to, respectively, the floating and integer operations, whose operators are sorted into increasing cost order. Floating Operators -- 1.35 ||= || ++ ! 15 12 9 6 3 0 1.3 1.25 Time(Op) / Time(+) && 1.2 1.15 1.1 - += >= <= <=>&&= > < /= * *= == != -= ?: / + - %= **= % ** &= & <<= = << | >> |= >>= String .= ne eq cmp !˜ =˜ 1.05 1 0.95 0.9 Integer Float 0.85 0.8 ++ && || -- ! - /= -= != &&= ||= >= + ?: < <= > *= / * == <=> += %= &= % - <<= = & |= << | >>= >> ** **= Integer Operators Figure 6.10: Normalized Cost for Floating Point, Integer and String Operations First of all, it must be said that the cost absolute scale for the integer operations is 1.5 times smaller than the corresponding floating operation (averaged over all the observations). Then, it 104 CHAPTER 6. DATA INSPECTION: THE ANALYSIS OF NONSENSE should be noticed that the floating cost scale is flattened with respect to itsû integer counterpart: û the longest -operation takes approximately ¶lB? more time than shortest -one, whereas this ratio drops to ¶hG in a floating context. Besides, it should be said that Perl allow to use realoperands combined with integer-only operators, such as the integer reminder division %, since the û interpreters performs a preliminary conversion of the operands – which partly explains the changes to ü of the operators cost order. As a side note, although it could be tempting to to use the measured operator cost to predict, on a single platform, the cost of any arbitrary expression, neverthless this approach would remain of limited interest (indeed, the relatively small difference among the longest and the shortest operation easily allow to estimate a coarse but satisfying upper-bound). Finally, it should be noted that, in general, the “contract” assignment form for one of the the four arithmetic operations (i.e., +=,-=,*=,/=) is slighlty faster than the corresponding operation itself (i.e., +,-,*,/): this suggest to break long expressions into smaller chunks, especially if the intermediate values have to be reused at a later step. We also tested the effect of parenthesization on expression cost, to find out that even with several levels of redundant and unnecessary parenthesization (i.e., using more than 20,000 parentheses, differently combined on a 100 element sum), the parenthesization overhead is neglectable. 6.3.6 DiaNa Startup Overhead We now shift the focus from Perl to DiaNa, which introduce different kinds of overhead with respect to the parens language: module loading, files startup, progress bar startup, sampling expression parsing, memory functionality to mention a few. However, since we shall expect DiaNa to work on large amount of data, we are allowed to consider the startup overhead associated with the shell scripts small enough to be neglected. Nevertheless, is interesting to notice that the startup time, altough depending on the desired features, always remains bounded by a small constant time, which make the tools suitable also for interactive data processing. Indeed, it should be noted that much of the startup overhead is clearly neglectable per se: e.g., the computations required to set-up the correct parameters for deterministic, uniform or Monte-Carlo sampling is offset by the dramatic reduction of the file size; the DiaNa expression parsing into Perl syntax occurs once at startup, so that no matter how complex the expression can be, this will affect only the evaluation performed by the Perl interpreter. Let un now analyze the precise estimate of the file lines number, which can be a very time consuming task, especially since we expect to process large files; therefore, we decided to estimate the file lines number in a very simple manner. First we gather the actual, i.e., non-compressed, file size ] expressed in bytes; then we compute the average line length þ , expressed in bytes, considering only a small subset of the file – which, by default, is constituted the first hundred lines. The underlying assumption is that although the values measured in the files can change significantly, the amount of bytes needed to store them will be roughly the same (i.e., both and É may be stored in a textfile as 1e+00 and 1e+09 respectively). Finally, we consider the file to be ] å þ lines long, practically yielding to very quick estimate with a satisfying approximation degree. For instance, the estimation of the number of line of the whole of data, requiring one access for each of the 12 files, took 250 milliseconds on average for the non-compressed 1 GB set and took 430 ms for the gzip-compressed 67-MB set. 6.3. PERFORMANCE EVALUATION AND BENCHMARKING 105 1.6 1 0.9 1.4 Load Time (s) 0.7 1 0.6 0.8 0.5 0.4 0.6 0.3 0.4 ➚ Load Time ➚ LoC CDF ➘ Load Time ➘ LoC CDF 0.2 0 0 50 100 150 200 Lines of Code (LoC) CDF 0.8 1.2 0.2 0.1 0 250 Number of Loaded Modules Figure 6.11: Modules Loading Performance More interesting are the performance result of the module loading time, which measure the time required by perl to load external modules through the use pragma. We varied the number of loaded module from a single module to more than 200 modules included in the standard GNU uèÿ q É}8 Lines of Code (LoC). We repeated the test varying the order distribution, for a total of }(æ of the modules. The benchmarks reported in Figure 6.11 consider the cases where the modules are either loaded, basing on their LoC length, in ascending or descending order. The figure points represent, on the left y-axis, the load time expressed in seconds as a function of the number of loaded modules on the x-axis; the two lines on the right y-axis represent the cumulative distribution function of the LoC loaded as a function of the number of loaded modules. It can be gathered that the absolute times keeps very small, i.e. about 1.5 seconds, even in the insane case where all the system’s modules are loaded. Considering a more reasonable and practical number of modules, the time required to load the 5 biggest module is on average equal to 150 ms, whereas the time required to load the 5 smallest module is about three times smaller; an interesting observation is that this difference vanish as we consider the }$ biggest or smallest modules, where the loading time is roughly 300 ms in both cases. 6.3.7 Fields Number and Memory Depth To conclude the benchmarking analysis, we investigated the effect of the number of fields and memory depth; in order to avoid to negatively bias the test toward expression with longer output, we simply evaluated the expression and then discarded the result value, so that the amount of IO workload does not depend on the expression under examination. The considered expressions consist of mere sum of different fields, all of which belongs to the same “past” point: in other words, each expression evaluate a sum of the form ü #M:i, where o ÿ6ÿ,ÿ both the number of fields and memory depth vary in }~ ~r? q s . Recalling Figure 6.2, CHAPTER 6. DATA INSPECTION: THE ANALYSIS OF NONSENSE 106 z=50 50 z=100 z=150 z=200 250 45 Memory Depth 40 200 35 30 150 25 20 100 15 10 50 5 0 0 5 10 15 20 25 30 35 Number of Fields 40 45 50 0 Figure 6.12: Time Cost of Fields Number vs. Memory Depth the number of considered fields defines the horizontal extension of the used memory, whereas the vertical dimension individuates its extent (or depth) toward the past. When the vertical dimension is not null, DiaNa implicitally sets up a loop to update the fields that makes use of the buffering feature. Furthermore, to simplify the updating process, DiaNa overdimension the buffers to the maximum of the memory depth over all expression tokens, thus avoiding to keeping and accessing per-field state. In other words, the buffer maintanance requires memory cells to be updated for each of the ç memory-active fields, at each advance in the input file; therefore, the required per-step updates amount to a total of ç . The cost of the expression, expressed in this benchmark as the seconds elasped to process 100 MB on LX, are shown in Figure 6.12 as a function of both the fields number (on the x-axis) and and memory depth (on the y-axis). The hyperbolic shape of the contour lines, shown for execution times multiple of 50 seconds, confirms our expectations; indeed, the nesting of the memory and field loops entails the buffer update cost to be proportional to the ç product. 6.4 Practical Examples The presented tool has been developed for the analysis of network traces: in the following we explore how the DiaNa framework has been used for the resarch activities in our group. Altough the networking context is common to all the considered projects, each of them follows a rather different path. Referring to the results discussed in the previous chapters, we briefly recall the problem and highlight the specific DiaNa feature used for its solution. 6.4. PRACTICAL EXAMPLES 107 The Web and User Patience The DiaNa tools developement started for [52], which study the web users’ behavior when network performance decreases causes the increase of page transfer times. Basing on two month real traffic measurements at the flow level, we analyzed whether worsening network conditions translate into greater user impatience, with the intent to refine a picture of the complex interactions between user perception of the Web and network-level events. The chapter introduced a TCP-level heuristic criterion able to discriminate among completed and user-interrupted HTTP connections, involving about a dozen of TCP connection parameter. Let consider a simpler property (called eligibility in [52], which is a necessary condition to any interrupted flow), that can be stated as in the following expression, valid for Tstat files: (#47==80) && (#4 && !#59) && #53. The definition a Tstat format allowed us to rephrase the former expression into the following more readable equivalent: (#S.port==80) && #S.data && #C.rst && !(#S.fin á #S.rst), where the S and C prefixes stand for server and client. This is clearly convenient in terms of readability and maintainability, especially with more complex expressions. Besides, infering the flow-level heuristic, involving the processing of two month of real traces, could not have been possible without the core assumptions of the DiaNa framework. The Zoo of Elephants and Mices In [56], we analyze the statistical properties of the TCP flow arrival process, and its Long Range Dependency (LRD) characteristics in particular, showing as possible causes of the LRD of TCP flow arrival process i) the heavy tailed distribution of the number of flows in a traffic aggregate, and ii) the presence of TCP-elephants within them. Methodologically, we aggregated several TRs within the same TA in such a way that each TA has, bytewise, the same amount of traffic. To induce a divisions of TCP-elephants and TCPmice into different traffic aggregates, the used algorithm packs the largest TRs in the first TA, so that subsequently generated aggregates are constituted by an increasing number of smaller traffic relations. To perform such an aggregation, we computed once the amount of bytes exchanged during the observed trace per traffic relation (i.e., $TR Ü "#S.addr/#C.addr" Ý += #S.data), post-processing the hash by sorting its keys bytewise (i.e., @TR = sort Ü $TR Ü $a Ý <=> $TR Ü $b Ý7Ý keys %TR;). A greedy algorithm has been then used to consecutively aggregate the sorted TRs into a single TA accounting for å of the totally exchanged bytes, where N varied in ÜYÍ¯Í $? Ý . Then, we applied the gathered mapping to the original trace, defining thus traffic aggregates; the tremendously useful features, in this case, are Perl ability to easily define mapping through hash tables, coupled to its powerful sorting semantic. Finally, to study the TCP arrival process with the TA, and specifically its LRD properties, we used the wavelet-based AV[67] estimator, which has many advantages, as it is fast, robust in the presence of non-stationarities and can be used to generate on-line estimates; in this case, the natural integration between Perl and the shell allowed the parallel execution of already existing (and fast) C programs through simple piping mechanism. 108 CHAPTER 6. DATA INSPECTION: THE ANALYSIS OF NONSENSE Synthetizing Realistic Traffic In [84] we present a methodology to generate several synthetic traffic traces from a single real trace of packets captured with tcpdump [10]. The synthetic traffic has then been used to assess the performance of scheduling algorithms in high performance switches. Our results show that realistic traffic degrades the performance of the switch by more than one order of magnitude with respect to the traditional traffic models, suggesting that the latters may not be able of showing real-world pictures. Formally, the synthetis problem in [84] can be stated as an optimization problem of job schedul] ing over parallel machines. However, due to the size of the problem, and since a solution to the X Ò ß á!Z problem, which is well known to be NP-hard, cannot be found in polynomial time, we opted for a greedy approximated solution; neverthless, future work plan to optimally solve, using Perl Data Language, a modified version of the problem. Methodologically, the first step of the synthesis problem consist of gathering the per-TR amount of exchanged bytes, similarly to the previous section but, unlike before, we are now considering a packet-level trace. The gathered information are then used to define a near-optimal mapping among logical switches port and real IP adress, where, again, mapping function are expressed through Perl hashes. The second step of the synthesis problem consist of apply the mapping to the original trace, in order to obtain the synthetic traffic and to fed an external switch simulator engine. 6.5 Conclusions This chapter presented DiaNa, a versatile and flexible Perl framework, designed especially to process huge amount of data; however, as suggested from the benchmarking results, it can be profitably used for interactive analysis as well. After overviewing its architecture and syntax, we devoted special attention to the simplest possible way of getting the most out of it, describing its usage and interaction at a shell-level. Finally, we gave some intuition of possible uses of the tool, describing the context that brought us to its developement as well as giving some practical examples. Chapter 7 Final Thoughts HIS thesis covered several aspects related to the network measurements field: this final chapter discusses some interesting future directions and possible extensions of the research presented so far. 7.1 Measurement Tool To summarize, in Chapter 1 we provided a general introduction to the networking measurement field, describing how the real traffic can actually be gather as well as Politecnico’s network. In Chapter 2, we presented Tstat, a tool for Internet traffic data collection and its statistical elaboration. Exploiting its capabilities, we have presented and discussed some statistical analysis performed on data collected at the Politecnico ingress router. The results presented offer a deep insight in the behavior of both the IP and the TCP protocols, highlighting several characteristics of the traffic that were hardly observed through passive measurement – but were rather i) generated by injecting ad-hoc flows in the Internet or ii) observed in simulations. Having access to relevant and up-to-date measurement data is a key issue for network analysis in order to allow for efficient Internet performance monitoring, evaluation and management. New applications keep appearing; user and protocol behavior keep evolving; traffic mixes and characteristics are continuously changing, which implies that traffic traces may have a short span of relevance and new traces have to be collected quite regularly. There are several ongoing works dealing with Tstat: indeed, based on the experience gained on IP networks traffic measurement, and especially with TCP traffic, we are currently extending Tstat capabilities in several directions, which we describe here shortly. 7.1.1 Cope with Traffic Shifts As mentioned early, the pace at which technology is evolving, continuously enables different services and applications: as a consequence, traffic streams flowing in current networks are very different from the traffic pattern of the very recent past. Indeed, while a few years ago Internet was synonym with Web-browsing, the pervasive diffusion of wide-band access has entailed a shift of the application spectrum – of which peer-2-peer file-sharing and audio/video streaming are the most representative examples. 109 CHAPTER 7. FINAL THOUGHTS 110 Although TCP traffic still represent the majority of Internet traffic, the volume of UDP and RTP traffic is increasing. A significant effort should be devoted to enable the monitoring of traffic types other than the currently supported. We have currently extended Tstat to support for simple UDP and RTP accounting, although much remains to be done in order to use Tstat as profitably as with TCP traffic. Finally, the traffic shift is still in-progress: for example, no media streaming solution has definitively taken over the others, and as a result HTTP-based streaming is still very popular beside RTP-based applications. Also, the popularity of peer-2-peer applications changes very frequently; furthermore, since the underlying protocols keeps evolving, the TCP ports carrying peer-2-peer control traffic changes as well – which complicates, e.g., even the basic task of tracking the control traffic volume of peer-2-peer1 . applications. 7.1.2 Cope with Bandwidth Increase In oder to be useful, network traffic measures should evidently be continuous and persistent: these joint requirements allow to track changes in the traffic pattern, as well as to gather a statistically significant traffic “population”. Under this light, another issue arising from the widespread and massive increase of the access link capacity is that the quantity of the data that has to be monitored increase as well; although this may seem obvious, nevertheless this generate two potentially very critical issues: ; computational complexity, due to the continuously monitoring requirement ; scalability and storage problems, due to the data persistence requirement While at present the computational complexity does not constitute a problem, scalability of the measurement and storage capacity begins to be troublesome. We have adopted a simple solution to allow for a complete and “flavored” analysis of the network traffic that satisfy both the persistence and continuity requirements. The solution involves the use of a sophisticated and efficient RoundRobin Database (RRD)[112] structure: in this way, the scalability is achieved through the use of different levels of temporal granularity. In very simple terms, the adopted methodology allow to determine a priori the maximum amount of state that has to be maintained, tuned as a function of the available storage amount. At the same time, the detail of the information will depend on the resolution of the temporal scale under examination: in other words, the higher the detail level (i.e., the finer the temporal resolution), the smaller the statistical population sample size (i.e., the shorter the temporal observation window). Such a system would allow, for example: ; to know exactly the characteristic of the last traffic hour, keeping packet level granularity ; to know the traffic gathered in the last day, week, month and year using decreasing levels of detail, as shown for the IP packet length statistics in Figure 7.1 Finally, Figure 7.2 show an example of the Web interface provided to browse the statistical properties of the different collected traces in the persistent monitoring setup. 1 I have worked on P2p when I was a Visiting Researcher at Computer Science Division, University of California, Berkeley; I worked with Prof. Ion Stoica on an enhancement [110] of the Chord[111] Distributed Hash Table 7.1. MEASUREMENT TOOL Figure 7.1: TstatRRD: Different Temporal Resolution for IP Packet Length Statistics 111 112 CHAPTER 7. FINAL THOUGHTS Figure 7.2: TstatRRD Web Interface: Round Trip Time Example 7.1. MEASUREMENT TOOL 113 7.1.3 Distributed Measurement Points In order to give a holistic view of what is going on in the network, passive measurements have to be carried out at different places simultaneously: on this basis, we are currently integrating Tstat in a passive measurement infrastructure, consisting of coordinated measurement points, arranged in measurement areas. Specifically, the infrastructure identified is the Distributed Passive Measurement Infrastructure (DPMI)[113] developed at the Blekinge Institute of Technology (Bth). The DPMI structure allows for a efficient use of passive monitoring equipment in order to supply researchers and network managers with up-to-date and relevant data. The infrastructure is generic with regards to the capturing equipment, ranging from simple PCAP-based devices to high-end DAG cards and dedicated ASICs, in order to promote a large-scale deployment of measurement points. The key requirements of DPMI are: ; Cost: Access to measurement equipment should be shared among users, primarily for two reasons: first, as measurements get longer (for instance for detecting LRD behavior) a single measurement can tie up a resource for many days; second, high quality measurement equipment is expensive and should hence have a high rate of utilization. ; Ease of use: The setup and control of measurements should be easy from the user’s point of view; as the complexity of measurements grows, we should hide this complexity from the users as far as possible. ; Modularity: The system should be modular to allow independent development of separate modules, in order to separately handling security, privacy and scalability (w.r.t. different link speeds as well as locations); since one cannot predict all possible uses of the system, the system should be flexible to support different measurements as well as different measurement equipment. The DPMI architecture consisting of three main components, Measurement Point (MP), Consumer and Measurement Area (MAr), which is depicted in Figure 7.1.3a. The task of the MP is to deal with the packet capturing, packet filtering and measurement data distribution, while the Consumer is fed with the MP data streams. A complex policy of filtering and optimization can be configured through the MArC which, hiding the complexity to the consumers, basically handles the communications between MP and Consumers. The actual distribution from MP to Consumers is delayed for a matter of efficiency: captured data is stored in a buffer pending transmission, and the frame distribution and data duplication is handled by the MArN. Under this perspective, Tstat is simply a Consumer of the DMPI architecture, and is fed a stream of encapsulated DPMI packet, whose header is shown if Figure 7.1.3b. The header stores a Capture Identifier (CI) a Measurement Point Identifier (MP), a capture timestamp (supporting an accuracy of picoseconds), the packet length, and the number of bytes that actually were captured, plus the length of the captured packet and the length of the packet fraction that is actually being distributed (typically, the header). Therefore, the capture header enables us to exactly pinpoint by which MP and on what link the frame was captured, which is vital information when trying to obtain spatial information about the network behavior. CHAPTER 7. FINAL THOUGHTS 114 MP1 S M w MP2 A i r t MPn N c h TSD MArC User Tstat Consumer A Consumer B (a) CI MPid Time Len Time CLen (b) Distributed Passive Measurement Infrastructure: The Architecture (a) and the Packet Header (b) 7.2 User Patience and the World Wide Wait Then, in Chapter 3, we inspected a phenomenon intrinsically rooted in the current use of the Internet, caused by user impatience at waiting too long for web downloads to complete. We defined a methodology to infer TCP flows interruption, and presented an extended set of results gathered from real traffic analysis. Considering several parameters, we showed that the interruption probability is affected mainly by the user-perceived throughput, as expectable. Nevertheless, we gathered that, unlike what one may think, users do not tend to wait longer when transferring long files, as the unexpected increase of the interruption probability along with the file size testify. Moreover, results shown that the user intervention is mostly limited to the very beginning (environ 20 seconds) of the life of the elephant flows. 7.2.1 Practical Applications A possible practical application of the presented work would consider to use the presented interruption metric to measure the user satisfaction of Web performance: this objective and quantifiable metric could be used, e.g., by system administrator to test the validity of network link upgrades, of the effectiveness of Web caching. Another possible use would be to include a model of the early interruption of connections in traffic generators for simulation purposes, such as WETMO, a WEb Traffic MOdule for the Network Simulator ns-2 available at [114]. WETMO has been developed during my Master Thesis A Simulation Study of Web Traffic over DiffServ Networks, available at [115], which has been published in the International Archives and has been developed in collaboration with NEC Europe Ltd. – Network Laboratories, Heidelberg (Germany). Besides, it is worth to point out that WETMO has been used to gather the results published in [116]. Let me briefly describe the tool: WETMO is a 7.3. TCP AGGREGATES ANALYSIS 115 complete environment for the performance evaluation of short-lived traffic over DiffServ networks, featuring primarily a sophisticated, light and efficient Web traffic generator, which can also automatically handle long-lived TCP or UDP background traffic – and Web traffic could incorporate the notion of early interruption. WETMO is designed to leverage the user from a wide series of tasks, from traffic generation to statistics collection, allowing him to concentrate on the description of the scenario (network parameters, topology and QoS settings). Worth of mention is the fact that WETMO implements a new flow-level monitor method, allowing to collect both per-flow and aggregated statistics of TCP connections, which has proved to be very light-weighted on both computational power and storage requirements. 7.3 TCP Aggregates Analysis Chapter 4 studied the TCP flow arrival process, starting from the aggregated measurement taken from our campus network; specifically, we performed a live collection of data directly at the TCP flow level, neglecting therefore the underlaying IP-packet level. Trace were both obtained and post-processed through software tools developed by our research group, publicly available to the community as open sources. We focused our attention beyond the TCP level, defining two layered high-level traffic entities. At a first level beyond TCP, we identified traffic relations, which are constituted by TCP flows with the same path properties. At the highest level, we considered traffic relation aggregates having homogeneous byte-wise weight; most important, the followed approach enabled us to divide the whole traffic into aggregates mainly made of either TCP-elephants or TCPmice. This permitted to gain some interesting insights on the TCP flow arrival process. First, we have observed, as already known, that long range dependence at TCP level can be caused from the fact that the number of flows within a traffic aggregate is heavy-tailed. In addition, the traffic aggregate properties allowed us to see that TCP-elephants aggregates behave like ON-OFF sources characterized by an heavy-tailed activity period. Besides, we were able to observe that LRD at TCP level vanishes for TCP-mice aggregates: this strongly suggests that even ON-OFF behavior is responsible of the LRD at TCP level. 7.3.1 Future Aggregations The simple methodology described in Chapter 4 allowed to create comparable traffic aggregates, i.e., traffic aggregates having the same volume (or weight) but internally constituted by an extremely different number of flows, having thus a very different density (or weight density). The aggregation process allowed, in other words, to naturally partition the so called Elephants from Mice, as the heavy and light flows are commonly addressed in the literature. However, several other aggregation criteria could be explored as well, in order to gain a deeper knowledge of the traffic patterns currently populating the Internet. Notation So, the past analysis considered a growing number of homogeneous partitions of the original trace; recalling the notation adopted in Section 4.3.1, we have: CHAPTER 7. FINAL THOUGHTS 116 K - TCP Level # 10 4 flow size, expressed in byte, of the Õ -th TCP flow exchanged between K K and (considering only the data originated from server directed to client ). K - Traffic Relation (TR) Level 10 4 represents all the TCP flows exchanged among and K K # 1 4 during the whole trace, having size: e9 10 4 # - Traffic Aggregate (TA) Level . is the -th traffic aggregate, having size: e9 1 24_ K !"# $&%(') + e 10 - Trace Level . represents all the observed flows, and thus the whole trace, having size: K - # # $ # 1 4 . In other words, the aggregated traffic, having total weight - , has been divided into groups . : it is important to stress that this aggregation is a partition of the trace . in the sense that: L # . # . . (7.1) [ ! 20UhÕ# ; " . (7.2) Moreover, each of the groups contains the same absolute traffic quantity - å ; neverthless, inside each group there is a significatively different number of flows (or, in other words, a different XÖmW K e9 cardinality of the aggregates Z 1Ð. 4 ), depending on their size 10 4: e 1±4_ e ä d# $&%,) L ä e [I 1 K 4Ä%$'&)( 1(4_ /- (7.3) (7.4) Moreover, the number of Traffic Relations with each Traffic Aggregate satisfy the following relationship: Z XÖ?W 1. # 4¼W Z XÖmW 1. [ + 4 *-, Õ_=p (7.5) in other words, by construction, the aggregates having a smaller index are constituted by a smaller number of the bigger flows observed during the whole trace. Aggregation Criteria A possible extension of the past work could consider a complementary approach: rather than partitioning the whole trace on different TAs comprised of a variable number of TRs, one could decide to aggregate a constant number of traffic relations adopting several different aggregation criteria; clearly, this entails that the TAs would no longer be dimensionally homogeneous. The problem formulation is therefore rather simplified: the elimination of the partitioning constraint release the researcher from the burden of solving of an optimization problem. K 4 7.3. TCP AGGREGATES ANALYSIS 117 Let us indicate the aforementioned criterion with . and define the criterion cardinality, indicated with î/. , as the number of traffic relations aggregated by . ; let us define further as .0 the traffic aggregate generated by . ; finally, whether it will be necessary to explicitly indicate at the same time both the cardinality and the criterion, the notation . (n) will be used, where evidently ^ î/. . As previously stated, the set of aggregates does not constitute a partition of the whole 0 . trace unless the complementary aggregate .10 is considered, defined as .2V . . The aggregation upon . of a number î3. of TRs can be performed in several ways; for example: - Largest .24 : select the î/. biggest TRs - Smallest . " : select the î/. smallest TRs - Random . : randomly select î3. TRs - Nearest . L : select the î/. nearest traffic relations, where the distance function in the IP space could be tuned on the base of a given reference address (AND-bitmask) - Farthest . g : select the î/. farthest traffic relations, where the distance function in the IP space could be tuned on the base of a given reference address (XOR-bitmask) Without entering into further details, let us briefly state that . represent an extremely simple though significant subsample of the trace, that can be implemented in several ways (e.g., uniformly or Monte-Carlo, on the number of flows within TRs or on the IP-space distance, or on the flow size, . . . ). In practice, one could consider a number of increasing cardinality aggregates, e.g., including 87 # Ý65 L:9#> up to half of the total TR number (i.e., adopting base 2 or 10, we would have î/£ or . Õ+ ^ ÜmÎ # # Ý 5 Å G L:9#> ) Intuitively, when the number of TR increases, the observed properties of the î/a . Õ$ ^ ÜY$ # traffic aggregate should become, at the same time i) statistically more significant and ii) decreasingly evident. This is better explained through an example: let consider the normalized TRs weight within TAs (i.e., the ratio of the ] TA size over the whole trace size, divided further by the number of TRs within that TA) induced by .;4 and . " : initially, .14 (1) individuate the biggest traffic relations ] X Ò e9 (having normalized weight 1 4 å - ) while . " (1) individuate the smallest one (normalized e weight Õ$^ 1 4 å - ). If we now extend the TAs considering the second biggest (smallest) traffic relation the normalized weight .;4 (2) (. " (2)) will clearly decrease (increase): when finally all the TRs are considerer, thus considering . " (N) and .14 (N), both the TAs are comprised of the same TRs and cannot be distinguished. . å Î , A simple approach could be to directly contrast the characteristics of . and .;0 when î3q in other words when the two aggregates split the whole trace in a two-set partition. For example, one could consider the heaviest half .;4 (N/2) and the complementary one .10=< (N/2), which is therefore constituted by the lightest half of the trace (notice furthermore that with .:0?> 1 å Î4 . " 1 å Î4 , since .14 is the complementary criterion with respect to . " ). Recalling the results shown in Section 4.4, this would prove a successful method to uncover remarkably different properties of the aggregates. The following is a brief list of the metrics whose distribution, first and second order statistics, Hurst parameter could be studied. A first group of metric is irrespective from the aggregation criteria, in the sense that they are global, or related to the whole trace, as for example: CHAPTER 7. FINAL THOUGHTS 118 - number of - TRs per trace - TCP flows per TRs - IP packets per TR - IP packets per TCP flow - size of: - TRs - TCP flows - interarrival time: - of IP packets within TCP flows - of TCP flows within TRs These metrics could be easily extended to the local case, or in other words be studies along with the specific aggregation criteria; for example, it could be interesting to study, for different criteria and cardinality, the behavior of, e.g.: - number of TRs, TCP flows and IP packets per TA - interarrival time properties of TRs, TCP flows and IP packets within TAs - TA size 7.4 Switching Performance Evaluation In Chapter 5 we proposed a novel and flexible methodology to synthesize realistic traffic traces to evaluate the performance of switches and, in general, of controlled queuing networks. Packets are generated from a single packet trace from which different synthetic traffic traces are obtained fulfilling a desired scenario, e.g., a traffic matrix. Additional constraints are imposed to maintain the original traffic characteristics, mimicking the behavior imposed by Internet routing. We compared the performance of a switch adopting different queuing and scheduling strategies, under two scenarios: the synthetic traffic of our methodology and traditional traffic models. We observed that not only absolute values of throughput and delays can change considerably from one scenario to the other, but also their relative behaviors. This fact highlights the importance of some design aspects (e.g., the buffer management) which are traditionally treated separately. These results show new behavioral aspects of the queuing and scheduling in switches, which will likely requires more insight in the future. The work presented in Chapter 5 could be extended in two ways, which I will briefly describe here. The first path would require to solve a modified version of the optimization problem, considering additional constraints related to the network topology; a second path that could be possibly 7.4. SWITCHING PERFORMANCE EVALUATION 119 followed in order to fill the gap between realism of the traffic source and the current traffic models could be to use responsive traffic sources. Both the approaches could then be contrasted and compared with the results achievable under traditional traffic models, much as it has been done in Section 5.3.4. 7.4.1 Modified Optimization Problem One of the main limitations of the current model is that, though it save the identity of the single packet streams at the IP level, nevertheless the process still allows the aggregation of different TCP flows that, traveling possibly completely disjoint paths in the network, experience rather different networks conditions. Similarly but in a mirrored fashion, flows that travel along the same path may be assigned to different input-output aggregates. The Problem s2 d s1 i1 j ... s3 External Servers i2 Internal Clients (Polito) Figure 7.3: Flow “Delocalization” as Consequence of the Aggregation Process The diagram sketched in Figure 7.3 may be helpful to illustrate the problem. Let assume that a user, incidentally named Dario2 , browses the Web from the Politecnico di Torino with a PC W having IP address . In order to correctly visualize the current page, the browser will send several HTTP requests in parallel: since P-HTTP1.0 or HTTP1.1, one for each object in the current page 3 . Moreover, due to the increased diffusion of load balancing techniques, it is possible that some of these requests will be redirected to two different servers belonging to the same ISP, having IP W K K K addresses e Î . Similarly, another work session may generate traffic from G directed to . W K Currently, the aggregation process can lead to the situation schemed, where the paths V\ W K e Î5\ enter the Politecnico ingress router through two different ports Õu and ÕSÎ : this equals to logically disjoin two physically totally overlapped paths. The complementary effect, early described and not shown in the picture, corresponds to the logical regrouping (through the ingress W W K K port Õu ) of physically disjoined paths ( \ and Gæ\ ). Therefore, the packet interarrival process within each input-output aggregates may be partly “artificial”, in the sense that the real correlations among different traffic sources is potentially scrambled during the aggregation process. 2 3 Any resemblance or reference to actual persons, living or dead, is unintentional and purely coincidental Any reference to cookies in the website and any resemblance to real cookies or biscuits is purely coincidental CHAPTER 7. FINAL THOUGHTS 120 The Solution A possible solution to this problem would consist in devising a simple criterion to logically regroup K K the servers (e.g., ed Î ); moreover, this should be done a priori, or in other words prior to the application of the greedy aggregation algorithm described in Section 5.2.3 To ease the notation and for the sake of the clarity, in the following we will refer to the a priori process, which involves K uniquely IP sources # as to regrouping, whereas the final process will be indicated, as usual, as aggregation. s1 s2 s3 s4 s5 FF.FF.FF.0 10.10.10.1 10.10.10.2 10.10.20.3 10.20.20.4 20.20.20.5 FF.FF.00.0 10.10.10.1 10.10.10.2 10.10.20.3 10.20.20.4 20.20.20.5 FF.00.00.0 10.10.10.1 10.10.10.2 10.10.20.3 10.20.20.4 20.20.20.5 Figure 7.4: Flows Regrouping through Logical Masks Let focus first on the regrouping process only: we will analyze its impact on the aggregation process at a later step. Figure 7.4 reports an extreme example of the logical regrouping process through logical masks. Since IP addresses can be represented on 32 bits, we indicate with logical mask a sequence of ^ symbols followed by GÎ @ ^ symbols ; if we further define ^ as the mask depth, in the following we will indicate the ^ -deep mask with ã Ó . The ^ -th regrouping criterion can be expressed as a K K simple logical function between the server IP addresses and the mask ã Ó : two addresses belong to the same group ü [ if and only if their first ^ bit are the same, i.e.: K K Ø ü [ K *-, K t t ã Ó (7.6) Intuitively, the ^ -th grouping criterion will generate a number 1 ^£4 , unknown a priori, of address groups that are a partition of the IP address space , thus: L ½ Ó( # ü # # ü [ @ ü 2VUhÕA " (7.7) (7.8) This process bring several advantages at the price of a limited disadvantage: B though the regrouping process does not guarantee that two sources near in the IP space correspond to two sources physically near in the real network interconnections, it can be considered nevertheless a valid approximation; B the improved model represents therefore a further step toward a realistic traffic generation methodology, offering an arrival pattern significantly more robust with respect to the early adopter model (as a consequence of the elimination of the false positive aggregation of “far” sources and the false negative missed aggregations of “near” sources); 7.4. SWITCHING PERFORMANCE EVALUATION B 121 B the regrouping of source addresses does not affect the freedom to map destination addresses to the switch output ports; therefore, though the constraint number increase, the increase is “unilateral”, in the sense that the flexibility of the synthesis is shifted toward the ability of efficiently associate destination addresses to the switch output ports. C moreover, it should be considered that given the real network position of the destination address with respect to the trace collection point (i.e., edge router of Politecnico di Torino), the model is largely less affected from the mapping among destination addresses and output ports; indeed, the statistical properties of the flow that are entering the Politecnico LAN have already been determined: the impact of the LAN is clearly negligible with respect to the multi-hop route that flows have traveled from the source until this last router – in other words, the Politecnico router does not act as a bottleneck and does not bias, or modify infinitesimally in the worst case, the flow performance; C the average load will be harder to “balance” and the instantaneous load will be less “balanced” during the whole trace: potentially, the offered traffic will be disjoined on the temporal support the reduced flexibility of the aggregation process leads to a smaller number of potential traffic scenarios; however, the most interesting traffic matrices (such as uniform or diagonal matrices) should not be much harder to generate than in the previous case. 7.4.2 Responsive Sources As we previously outlined, in the field of switching and routing architectures, a severe problem is constituted by the absence of traffic models apt to describe the complex interactions happening in packet networks: a possible solution, still unexplored, would require to adopt a feedback mechanism to implement reactive sources, i.e., sources that reacts to the real network condition, mimicking the real TCP behavior. The Problem A simple choice would be to consider a generic Additive Increase Multiplicative Decrease (AIMD) source, which is an extreme simplification of the real TCP behavior but would be much important for a first-grade analysis of feedback sources; without entering in the full details of the TCP algorithm and behavior, we illustrate here the appropriateness of the AIMD responsive sources. Indeed, it is well-known that the amount of data transmitted by TCP during a Round Trip Time (RTT) is proportional to the congestion window (cwnd), which is tuned by an acknowledgment mechanism. When the data sent is acknowledged, the cwnd is slowly increased (thus, the Additive Increase), and so does the amount of data injected in the network: this is done to probe slowly probe the network reaction reaction to the increase of the transmission rate. Conversely, when the data fails to be acknowledged the TCP sources has an indication of packet loss, in which case the cwnd can be abruptly diminished (thus, the Multiplicative Decrease): specifically, it is either halved (on Fast Recovery) or reset (on Timeout), depending on the “gravity” of the received congestion signal: indeed, the loss of a packet is a symptom of network congestion and therefore, in order to avoid network saturation, responsive sources react by decreasing their transmission rate. CHAPTER 7. FINAL THOUGHTS 122 Although the AIMD sources are not able to capture all the different flavors of the TCP protocol suite and its many variations (e.g., Reno, NewReno, Tahoe, Sack, Vegas, Westwood and other more or less stable algorithms), nevertheless it constitute a first valuable approximation of a windowed protocol – featuring anyway some important differences from the classical Markovian traffic models, where everything is independent, memoryless and uncorrelated and thus the network feedback is completely neglected. The Solution How to contextualize an AIMD traffic model to the performance evaluation of switching architectures is very easy iff we allow some compromise. With the help of Figure 7.5, we will introduce in the following the details of AIMD traffic generation. Normally, the switch offered traffic is specified through the matrix D whose elements D #\[ quantify the traffic entering on average in Õ and directed to . Clearly, whether we introduce a feedback-model, it will no longer be possible to impose a priori the average load, that will be rather measured at the ingress ports of the device. The target matrix can therefore be D shall be substituted by an adjacency matrix E utile to control the AIMD traffic, whose semantic meaning is the following: every element E #\[ is an integer number representing the number of superimposed AIMD sources generating responsive traffic directed from Õ to . Initially, it could be a good compromise to use a uniform unitary adjacency matrix (i.e., only one source per ingress-egress pair). M/M/1/B D/D/oo Background β Traffic Switch AIMD Sink AIMD Source A (ACK) T’ ∆ T’’ Figure 7.5: Responsive Traffic Generation Process Referring to Figure 7.5, the adjacency matrix E # s only the first control mechanism for the AIMD traffic, the second being the feedback mechanism. Let continue the picture analysis in clockwise sense focusing only on the metrics that may be interesting for a future analysis: namely, the offered load eAF , the delay G and the throughput e#F F , as well as the packet loss statistics that can K0K be gathered as ß ÜHJI Ý 61 e F @ e F F 4 å e F . At the output of the switching fabric, there’s an ACK generator that injects acknowledgment packets in the feedback ring, which abstract the network 7.4. SWITCHING PERFORMANCE EVALUATION 123 model traveled by acknowledgment packets (or acks). Acks are firstly queued into a M/M/1/B, where, possibly, some background traffic can be injected in order to tune the network congestion. Before completing the backward path and being notified to the AIMD generator, every ack can also be delayed of a constant quantity: the D/D/Õ$^ I queue emulates the portion of the RTT due to the link propagation; thus, as summarized below, acks are subject to two different kind of delay: ; the queuing delay, a variable component due to ÿ ; the propagation delay, a fixed component due to å åLK å ÿ å å The algorithm of the AIMD-ACK pair is very simple; to be very schematic, the AIMD generator: ; checks for timeout expiration, in which case: - it drops the transmission windows by , (e.g., the window is halved when Î - retransmits the lost packet, starting from the last frame positively acknowledged (avoiding to overload the network if the in-flight size is bigger than the decreased window size) ; otherwise - generates new packets (the amount is such to fill up the cwnd) - sets the timeout corresponding to each packet acknowledgment - start the actual packet transmission - upon reception of a positive acknowledgment, it increments the the transmission winWNM O 4 , where the window may saturate to a finite value or dow to 1 ++cwnd . ^ WPM O ) . ^ \ It should be noted that there are two point in the system where packet loss can happen: either in the backward bottleneck queue or in the switch. The second case bare additional discussion; indeed, there may be cases in which an AIMD packet will not even reach the ACK generator: specifically this can happen when the scheduling algorithm is not stable for any admissible ingress traffic (e.g., iSLIP). In any of the two above situations, the timer expiration due to the missed reception of an acknowledgment packet for the corresponding lost packet will trigger the reduction of the transmission window. Conversely, the ACK generator will work in a way similar to a “reflector”: upon reception of K a packet directed from Õ to , having sequence number , it will send a packet from to Õ with seK quence number (or, to mimic more closely the TCP behavior, we could set this value to the next K sequence number the sender of the segment is expecting to receive, thus ` ). Acknowledgment packets (or acks) enter then a queuing network, which constitute the backward part of the path, and they will be responsible of either a decrease (on ack loss) or an increase (on successful ack reception) of the window at the transmitter side, closing therefore the feedback loop. As a conclusive observation, notice that the feedback loop can also contribute, beside packet loss, to tune the AIMD transmission rate: indeed a simple ack delay may be sufficient to trigger the expiration CHAPTER 7. FINAL THOUGHTS 124 of the timeout and consequently yielding to a decrease of the congestion window and thus of the transmission rate. This network model has several parameter that can be tuned: without aim of completeness, we finally report some interesting scenarios that could be studied. First of all, architectures and switching algorithms that could be considered are the same that I have considered in Section 5.3.4: mainly, input queuing (iSLIP, MWM, . . . ) or outptut queuing solutions. The new parameters that can be tuned are the ÿ å ÿ å åLK buffer size K , as well as the queuing discipline that can be adopted (i.e., FIFO, DropTail, RED, Choke, . . . ); moreover, one can consider several classes of propagation delay, or even non-deterministic delays (e.g., ÿ å ÿ å| rather than å å| ). Finally, AIMD traffic allow to specify the number of sources per input-output port pair through the adjacency matrix E whereas the background traffic (which can be markovian, bursty, bernoulli on/off, . . . ) can be controlled in volume through . 7.5 Post Processing with DiaNa Finally, in Chapter 6 we presented DiaNa, the versatile and flexible Perl framework, designed especially to process huge amount of data, that has been used to carry on the analysis described in the previous chapters. After over viewing its architecture and syntax, we devoted special attention to the simplest possible way of getting the most out of it, describing its usage and interaction at a shell-level. Finally, we gave some intuition of possible uses of the tool, describing the context that brought us to its development as well as giving some practical examples, that are ultimately the “backstage” of the presented research activity. 7.5.1 DiaNa Makeup and Restyle As with any computer program, a lot has been done since but nevertheless a lot remains to be done. For example, several extensions have been implemented, such as the P-Square algorithm [97] for the on-line estimation of quantiles requiring a constant state and no a-priori knowledge or assumption of the data; however, the implementation has been carried on in Perl, which has a number of features that speedup the implementation to the detriment of the running speed: therefore, a lot of optimization could be possible, such as re-implementing the most stable part of the tool in a faster and more efficient language (as well as more stubborn and annoying 4 ). However, we refer the reader to the DiaNa website [95] for other technical issues: indeed, to thank the reader for its attention until this point, I have decided to report (and sometime invent) some Murphy’s law regarding computer topics, which ironically highlight some of the real problems that occurred during the development of DiaNa (and for the analysis of the results described in this thesis as well). ; 4 Complexity Axiom The complexity of any program grows until it exceeds the capability of the programmer who must maintain it. As you have probably guessed, I’m talking about Kerny’s & Ritchy’s C, with whom many programmers have a love-hate relationship. About the optimization of DiaNa through reimplementation, however, I would go along with the perl manual pages, which suggest that “the implementation in C is left as an exercise for the reader”. 7.5. POST PROCESSING WITH DIANA ; Obsolescence Law Any given program, when running, is obsolete. ; Bugs Invariant Software bugs are impossible to detect by anybody except the end user. 125 - Corollary 1: A working program is one that has only unobserved bugs. - Corollary 2: If a program has not crashed yet, it is waiting for a critical moment before crashing. - Corollary 3: Bugs will appear in one part of a working program when another “unrelated” part is modified. - Corollary 4: A program generator creates programs that are more buggy than the program generator itself5 ; Triviality Theorem Every non-trivial program has at least one bug - Corollary 1: A sufficient condition for program triviality is that it has no bugs. - Corollary 2: At least one bug will be observed after the author finishes its PhD. ; Axiom of File-dynamics The number of files produced by each simulation run tends to overcome any effort to keep them ranked in any order. ; Saturation Lemma Disks are always full6 . - Corollary: It is futile to try to get more disk space: data expands to fill any void. ; Rules of Thumb The value of a program is inversely proportional to the weight of its output. - Corollary 1: If a program is useful, it will have to be changed. - Corollary 2: If a program is useless, it will have to be documented. ; Indetermination Principle Constants aren’t and variables won’t. - Corollary: In any multi-threaded program the value (or the state) of any given resource cannot be punctually determined, even if the resource is not supposed to be shared across agents7 5 Except the DiaNa core, of course Except disks holding traffic measurement when the output of tcpdump is redirected to a single file named /dev/null 7 Tstat obviously being an exception. 6 126 CHAPTER 7. FINAL THOUGHTS Bibliography [1] J. D. Sloan, “Network Troubleshooting Tools”, O’Reilly Ed., Aug. 2001. [2] R. Stine Ed., “FYI on a Network Management Tool Catalog”, Network Working Group RFC1147, Apr. 1990 [3] “Tools for Monitoring and Debugging TCP/IP Internets and Interconnected Devices”, Network Working Group RFC1470, Jun. 1993 [4] PlanetLab website http://www.planet-lab.org/ [5] L. Peterson, T. Anderson, D. Culler, and T. Roscoe. In Proceedings of First Workshop on Hot Topics in Networking (HotNets-I), October 2002. [6] CAIDA, the Cooperative Association http://www.caida.org for Internet Data Analysis, website [7] ENTM, available at ftp://ccc.nmfecc.gov [8] EtherApe homepage, http://etherape.sourceforge.net/ [9] Getethers, available at ftp://harbor.ecn.purdue.edu/ [10] S. McCanne, C. Leres, http://www.tcpdump.org/ and V. Jacobson, tcpdump homepage, [11] S. McCanne, C. Leres, and V. Jacobson, http://sourceforge.net/projects/libpcap/ libpcap homepage, [12] WinPcap homepage, http://winpcap.polito.it [13] L. Degioanni, M. Baldi, F. Risso and G. Varenni, “Profiling and Optimization of SoftwareBased Network-Analysis Applications”, In Proceedings of the 15th IEEE Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2003), Sao Paulo, Brasil, November 2003 [14] F. Risso, L. Degioanni, “An Architecture for High Performance Network Analysis”, In Proceedings of the 6th IEEE Symposium on Computers and Communications (ISCC 2001), Hammamet, Tunisia, July 2001 [15] IPTraf homepage, http://cebu.mozcom.com/riker/iptraf/ 127 BIBLIOGRAPHY 128 [16] Manikantan Ramadas, “TCPTrace Manual”, available http://jarok.cs.ohiou.edu/software/tcptrace/manual/ at [17] Information Sciences Institute, U. o. S. C. User Datagram Protocol, August 1980. RFC 768. [18] Information Sciences Institute, U. o. S. C. Transmission Control Protocol, September 1981. RFC 793. [19] K.Ramakrishnan, S.Floyd and D.Black, The Addition of Explicit Congestion Notification (ECN) to IP, September 2001. RFC 3168. [20] M.Mathis, J.Mahdavi, S.Floyd and A.Romanow, TCP Selective Acknowledgement Options, October 1996. RFC 2018. [21] R.Fielding, J.Gettys, J.Mogul, H.Frystyk, L.Masinter, P.Leach and T.Berners-Lee, Hypertext Transfer Protocol – HTTP/1.1, June 1999. RFC 2616. [22] S.Floyd, J.Mahdavi, M.Mathis and M.Podolsky, “An Extension to the Selective Acknowledgement (SACK)” Option for TCP, July 2000. RFC 2883. [23] W.R. Stevens, “TCP/IP Illustrated Volume I: The Protocols”. Addison-Wesley, 1994. [24] T.Berners-Lee, R.Fielding and H.Frystyk, “Hypertext Transfer Protocol – HTTP/1.0”, May 1996. RFC 1945. [25] V.Jacobson, R.Braden and D.Borman, “ TCP Extensions for High Performance”, May 1992. RFC 1323. [26] G.R. Wright, and W.R. Stevens, Addison-Wesley, 1996. “TCP/IP Illustrated Volume II The Implementation”, [27] GARR, “GARR - The Italian Academic and Research http://www.garr.it/garr-b-home-engl.shtml, 2001. Network,” [28] Commissione Reti e Calcolo Scientifico del MURST, “Progetto di Rete a Larga Banda per le Universitá e la Ricerca Scientifica Italiana”, Tecnical Document CRCS-97/11, 1997, available at http://www.garr.it/docs/crcs1.shtml [29] MRTG homepage, http://people.ee.ethz.ch/ oetiker/webtools/mrtg/ [30] GEANT, Pan-European Gigabit http://www.dante.net/geant/ Network homepage, cfr. [31] Internet2 homepage, cfr. http://www.internet2.edu/ [32] Tstat’s Homepage, http://tstat.tlc.polito.it/ [33] M. Mellia, A. Carpani and R. Lo Cigno, “Measuring IP and TCP behavior on Edge Nodes”, IEEE Globecom 2002, Taipei, Taiwan, November 2002 BIBLIOGRAPHY 129 [34] M. Mellia, A. Carpani, and R. Lo Cigno, “Tstat web page,” http://tstat.tlc.polito.it/, 2001. [35] V. Paxson and S. Floyd, “Wide-Area Traffic: The Failure of Poisson Modeling,” IEEE/ACM Transactions on Networking, Vol. 3, no. 3, pp. 226–244, Jun. 1995. [36] B. M. Duska, D. Marwood and M. J. Feeley, “The Measured Access Characteristics of WorldWide-We b Client Proxy Caches”, USENIX Symposium on Internet Technologies and Syst ems, pp. 23–36, Dec. 1997. [37] L. Fan, P. Cao, J. Almeida and A. Broder, “Summary Cache: A Scalable Wide-Area Web Cache Sharin g Protocol,” ACM SIGCOMM ’98, pp. 254–265, 1998. [38] A. Feldmann, R. Caceres, F. Douglis, G. Glass and M. Rabinovich, “Performance of Web Proxy Caching in Heterogeneous BandwidthEnvironments”, IEEE INFOCOM ’99, pp. 107– 116, 1999. [39] B. Mah, “An Empirical Model of HTTP Network Traffic,” IEEE INFOCOM ’97, Apr. 1997. [40] H. Balakrishnan, M. Stemm, S. Seshan, V. Padmanabhan and R. H. Katz, “TCP Behavior of a Busy Internet Server: Analysis and Solutions”, IEEE INFOCOM ’98, pp. 252–262, Mar. 1998. [41] W. S. Cleveland, D. LinD and X. Sun, “IP Packet Generation: Statistical Models for TCP Sta rt Times Based on Connection-Rate Superposition”, ACM SIGMETRICS 2000, pp. 166–177, Jun. 2000. [42] F. Donelson Smith, F. Hernandez, K. Jeffay and D. Ott, “What TPC/IP Protocol Headers can Tell us about the Web,” ACM SIGMETRICS ’01, pp. 245–256, Jun 2001. [43] M. E. Crovella and A. Bestavros, “Self Similarity in World Wide Web Traffic: Evidence and Possible Causes,” IEEE/ACM Transactions on Networking, Vol. 5, no. 6, pp. 835–846, 1997. [44] V. Paxson, “End-to-end routing behavior in the Internet,” IEEE/ACM Transactions on Networking, Vol. 5, no. 5, pp. 601–615, 1997. [45] L. Deri and S. Suin, “Effective traffic measurement using ntop”, IEEE Communications Magazine, Vol. 38, pp. 138–1 43, May 2000. [46] J. Postel, “Internet Protocol”, RFC 791, Sept. 1981. [47] J. Postel, “Transmission control protocol”, RFC 793, Sept. 1981. [48] M. Allman, V. Paxson, and W. Stevens, “TCP Congestion Control”, RFC 2581, 1999 [49] S. Ostermann, tcptrace, 2001, Version 5.2. [50] V. Jacobson, R. Braden, and D. Borman, “TCP Extensions For High Performance”, RFC 1323, May 1992. 130 BIBLIOGRAPHY [51] M. Mathis, J. Madhavi, S. Floyd, and A. Romanow, “TCP Selective Acknowledgment Options”, RFC 2018, Oct. 1996. [52] D. Rossi, C. Casetti, M. Mellia, User Patience and the Web: a Hands-on Investigation, In Proceedings of IEEE Globecom, San Francisco, CA, Dec. 2003. [53] R. Khare and I. Jacobs, http://www.w3.org/Protocols/NL-PerfNote.html [54] M. Molina et al., Web Traffic Modeling Exploiting TCP Connections’ Temporal Clustering through HTML-REDUCE, IEEE Network, May 2000 [55] R. Fielding et al., Hypertext Transfer Protocol HTTP/1.1, RFC2616, June 1999 [56] D. Rossi, L. Muscariello, M. Mellia, On the properties of TCP Flow Arrival Process, In Proceedings of ICC, QoS and Performance Modeling Symposium, Paris, Jun. 2004. [57] R. Caceres, P. Danzig, S. Jamin and D. Mitzel, Characteristics of Wide-Area TCP/IP Conversations, ACM SIGCOMM, 1991 [58] P. Danzig and S. Jamin, tcplib: A library of TCP Internetwork Traffic Characteristics, USC Technical report, 1991 [59] P. Danzig, S. Jamin, R. Caceres, D. Mitzel, and D. Estrin, An Empirical Workload Model for Driving Wide-Area TCP/IP Network Simulations, Internetworking: Research and Experience, Vol.3, No.1, pp.1–26, 1992 [60] V. Paxons, Empirically Derived Analytic Models of Wide-Area TCP Connections, IEEE/ACM Transactions on Networking, Vol.2, pp.316–336, Aug. 1994 [61] V. Paxson and S. Floyd, Wide-area Traffic: The Failure of Poisson Modeling, IEEE/ACM Transactions on Networking, Vol.3, No.3, pp.226–244, Jun. 1995 [62] W.E. Leland, M.S. Taqqu, W. Willinger and V. Wilson, On the Self-Similar Nature of Ethernet Traffic (Extended version), IEEE/ACM Transaction on Networking, Vol.2, No.1, pp.1–15, Jan. 1994 [63] A. Feldmann, Characteristics of TCP Connection Arrivals, Park and Willinger (editors) Self-Similar Network Traffic and Performance Evaluation, Wiley-Interscience, 2000 [64] W. Willinger, M.S. Taqqu, R. Sherman and D.V. Wilson, Self-Similarity through High Variability: Statistical Analysis of Ethernet LAN Traffic at the Source Level, IEEE/ACM Transaction on Networking, Vol.5, No.1, pp.71–86, Jan. 1997 [65] M. Mellia, A. Carpani and R. Lo Cigno, Measuring IP and TCP behavior on Edge Nodes, IEEE Globecom, Taipei (TW), Nov 2002 [66] A. Erramilli, O. Narayan and W. Willinger, Experimental Queueing Analysis with LongRange Dependent Packet Traffic, IEEE/ACM Transactions on Networking, Vol.4, No.2, pp.209–223, 1996 BIBLIOGRAPHY 131 [67] P. Abry and D. Veitch, “Wavelet Analysis of Long Range Dependent Traffic”, Transactions on Information Theory, Vol.44, No.1 pp.2–15, Jan. 1998 [68] N. Hohn, D. Veitch and P. Abry, , Does Fractal Scaling at the IP Level depend on TCP Flow Arrival Processes ?, 2nd Internet Measurement Workshop, Marseille, Nov. 2002 [69] S.C. Graves, A.H.G. Rinnooy Kan and P.H. Zipkin, Logistic of Production and Inventory, Nemhauser and Rinnoy Kan (editors), Handbooks in Operation Research and Management Science, Vol.4, North-Holland, 1993. [70] P. Abry, P. Flandrin, M.S. Taqqu and D.Veitch, Self-similarity and Long-Range Dependence Through the Wavelet Lens, In Long Range Dependence: Theory and Applications, Doukhan, Oppenheim, 2000 [71] A. Feldmann, A. Gilbert, W. Willinger and T. Kurtz, The Changing Nature of Network Traffic: Scaling Phenomena, Computer Communication Review 28, No.2, Ap. 1998 [72] M.E. Crovella and A. Bestavros, Self-Similarity in World Wide Web Traffic: Evidence and Possible Causes, IEEE/ACM Transactions on Networking, Vol.5, No.6, pp.835–846, 1997 [73] I. Norros, On the Use of Fractional Brownian Motion in the Theory of Connectionless Networks, IEEE Journal on Selected Areas in Communications, Vol.13, pp.953–962, Aug. 1995 [74] M. Taqqu and V. Teverosky, Is Network Traffic Self-Similar or Multifractal ?, Fractals, Vol.5, No.1, pp.63–73, 1997 [75] R.H. Riedi and J. Lévy Véhel, Multifractal Properties of TCP Traffic: a Numerical Study, IEEE Transactions on Networking, Nov. 1997 (Extended version appeared as INRIA research report 3129, Mar. 1997) [76] A. Gilbert, W. Willinger and A. Feldmann, Scaling Analysis of Conservative Cascades, with Applications to Network Traffic, IEEE Transactions on Information Theory, Vol.45, No.3, pp.971–992, Apr. 1999 [77] D. Veitch, P. Abry, P. Flandrin and P. Chainais, Infinitely Divisible Cascade Analysis of Network Traffic Data, IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’00), 2000. [78] S.Roux, D.Veitch, P.Abry, L.Huang, P.Flandrin and J.Micheel, Statistical Scaling Analysis of TCP/IP Data, IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’01), Special session: Network Inference and Traffic Modeling, May 2001 [79] A. Horváth and M. Telek, A Markovian Point Process Exhibiting Multifractal Behavior and its Application to Traffic Modeling, 4th International Conference on Matrix-Analytic Methods in Stochastic Models, Adelaide (Australia), Jul. 2002 [80] A.T. Andersen and B.F. Nielsen, A Markovian Approach for Modeling Packet Traffic with Long-Range Dependence IEEE Journal on Selected Areas on Communication, Vol.16, No.5, pp.719–732, 1998. 132 BIBLIOGRAPHY [81] S. Robert and J.Y. Le Boudec, New Models for Pseudo Self-Similar Traffic, Performance Evaluation, Vol.30, No.1-2, pp.57–68, 1997. [82] S. Robert and J.Y. Le Boudec, Can Self-Similar Traffic be Modeled by Markovian Process ?, International Zurich Seminar on Digital Communication, Feb. 1996 [83] A. Reyes Lecuona, E. González Parada, E. Casilari, J.C. Casasola and A. Dı́az Estrella, A Page-Oriented WWW Traffic Model for Wireless System Simulations, ITC16, Edinburgh, Jun. 1999 [84] P. Giaccone, L. Muscariello, M. Mellia, D. Rossi, The performance of Switch under Real Traffic, In Proceedings of HPSR04, Phoenix, AZ, April 2004. [85] A. Feldmann, A.C. Gilbert and W. Willinger, “Data Networks as Cascades: Investigating the Multifractal Nature of Internet WAN Traffic”, ACM SIGCOMM’98, Boston, Ma, pp. 42–55, Sep. 1998. [86] I. Norros, “ A storage model with self-similar input”, Queueing Systems, Vol. 16, pp. 387– 396, 1994. [87] M.S. Taqqu, “Fractional Brownian Motion and Long Range Dependence”, Theory and Application of Long-Range Dependence P.Doukhan, G. Oppenheim, M.S. Taqqu Editors, 2002. [88] L. Muscariello, M. Mellia, M. Meo, R. Lo Cigno, M.Ajmone Marsan, “A Simple Markovian Approach to Model Internet Traffic at Edge Routers”, COST279, Technical Document, TD(03)032, May 2003 [89] H. J. Chao, “Next generation routers”, Proceedings of the IEEE, Vol. 90, No. 9, pp. 1518– 1558, Sep. 2002, [90] M. Ajmone Marsan, A. Bianco, P. Giaccone, E. Leonardi and F. Neri, “Packet-mode scheduling in input-queued cell-based switches”, IEEE/ACM Transactions on Networking, Vol. 10, No. 5, pp. 666–678, Oct. 2002 [91] N. McKeown and T. E. Anderson, “A Quantitative Comparison of Scheduling Algorithms for Input-Queued Switches”, Computer Networks and ISDN Systems, Vol. 30, No. 24, pp. 2309–2326, Dec. 1998. [92] N. McKeown, A. Mekkittikul, V. Anantharam and J.Walrand, “Achieving 100% Throughput in an Input-Queued Switch”, IEEE Transactions on Communications, Vol. 47, No. 8, pp. 1260–1272, Aug. 1999, [93] N.McKeown,“The iSLIP scheduling algorithm for input-queued switches”, IEEE/ACM Transactions on Networking, Vol. 7, No. 2, Aug.1999, pp.188-201 [94] B. V. Rao, K. R. Krishnan and D. P. Heyman, “Performance of Finite Buffer Queues under Traffic with Long-Range Dependence,” Proc. IEEE Globecom, Vol. 1, pp. 607–611, Nov. 1996. BIBLIOGRAPHY 133 [95] DiaNa, http://www.tlc-networks.polito.it/diana [96] Perl, http://www.perl.com/ [97] R. Jain and I. Chlamtac, The P-Square Algorithm for Dynamic Calculation of Percentiles and Histograms without Storing Observations, Communications of the ACM, Oct. 1985, [98] S. Hopkins, Camels and Needles: Computer Poetry Meets the Perl Programming Language, Usenix, 1992 [99] L. Wall, T. Christiansen and J. Orwant, Programming Perl, O’Reilly Ed., 2000 G rd Edition, Jul. [100] D. M. Beazley, D. Flecther and D. Dumont, Perl Extension Building with SWIG, USENIX, 1998 [101] Comprehensive Perl Archive Network, cfr. http://www.cpan.org [102] Perl Data Language, cfr. http://pdl.perl.org [103] J. E. Friedl, Mastering Regular Expressions, O’Reilly Ed., Î rd Edition, Jul. 2002 [104] Obfuscated Perl Contest, The Perl Journal, cfr http://www.tpj.com/ [105] Gnuplot, cfr. http://www.gnuplot.info [106] G. Baiocchi, Using Perl for Statistics: Data Processing and Statistical Computing, Journal of Statistical Software, Vol. 11, 2004. [107] B. W. Kernighan and C. J. Van Wyk, Timing Trials, or, the Trials of Timing: Experiments with Scripting and User-Interface Languages, Technical Report, Nov. 97 [108] L. Prechelt, An Empirical Comparison of Seven Programming Languages, IEEE Computer, Oct. 2000. [109] ActiveState, http://www.activestate.com [110] D. Rossi and I. Stoica, Gambling Heuristics and Chord Routing, Submitted to IEEE Globecom’05, St Luis, MO, November 2005. [111] I. Stoica, R. Morris, D. Liben-Nowell, D. R. Karger, M. F. Kaashoek, F. Dabek and H. Balakrishnan, “Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications,” IEEE/ACM Transactions on Networking, Vol. 11, No. 1, pp. 17-32, Feb. 2003. [112] T. Oetiker, RRDtools website http://people.ee.ethz.ch/ oetiker/webtools/rrdtoo [113] P. Arlos, M. Fiedler and A. A. Nilsson, “A Distributed Passive Measurement Infrastructure”, Passive and Active Measurements Workshop, Boston, USA 2005. 134 BIBLIOGRAPHY [114] WETMO, a WEb Traffic MOdule for the Network Simulator ns-2 is available at http://www.tlc-networks.polito.it/wetmo/ [115] D. Rossi, “A Simulation Study of Web Traffic over DiffServ Networks”, Master Thesis, International Archives, available at http://www.tlc-networks.polito.it/rossi/DRossi-Thesis.ps.gz [116] D. Rossi, C. Casetti, M. Mellia, “A Simulation Study of Web Traffic over DiffServ Networks”, In Proceedings IEEE Globecom 2002, Taipei, TW, November 2002.