Jim Fischer

Transcription

Jim Fischer
CiNIC – Calpoly Intelligent NIC
A Thesis
Presented to the Faculty of
California Polytechnic State University
San Luis Obispo
In Partial Fulfillment
Of the Requirements for the Degree
Master of Science in Electrical Engineering
By
James D. Fischer
June 2001
Authorization for Reproduction of Master’s Thesis
I hereby grant permission for the reproduction of this thesis, in whole or in part,
without further authorization from me.
Signature (James D. Fischer)
Date
ii
Approval Page
Title:
CiNIC – Calpoly Intelligent NIC
Author:
James D. Fischer
Date Submitted:
June 26, 2001
Dr. James G. Harris
Advisor
Signature
Dr. Mei-Ling Liu
Committee Member
Signature
Dr. Fred W. DePiero
Committee Member
Signature
iii
Abstract
CiNIC – Calpoly Intelligent NIC
by:
James D. Fischer
The Calpoly intelligent NIC (CiNIC) Project offloads TCP/IP stack processing from
an i686-based (Pentium) host system running either Microsoft Windows 2000 or
Redhat Linux 7.1, and onto a dedicated EBSA-285 co-host that is running the
ARM/Linux networking operating system.
The goals of this thesis are threefold: 1) implement a software development
environment for the EBSA-285 platform; 2) implement a low-level system environment for the EBSA-285 that can boot an ARM/Linux kernel on the EBSA-285 board,
and 3) implement performance instrumentation that gathers latency data from the
CiNIC architecture during network send operations.
The first half of this thesis presents a brief history of the CiNIC Project and an
overview of the hardware and software architectures that comprise the CiNIC
platform. An overview of the software development tools for the ARM/Linux
environment on the EBSA-285 is also provided. The second half of the thesis
develops the performance instrumentation technique Ron J. Ma and I developed to
gather latency data from the CiNIC platform during network send operations.
Samples of performance are provided that demonstrate the capability of the
performance instrumentation technique.
iv
Acknowledgements
I would like to thank the following individuals and companies for their invaluable
support throughout this project. Paul Rudnick, John Dieckman, Carl First, Charles
Chen, Anand Rajagopalan and others at 3Com Corporation who generously provided
equipment, financial support, technical support and moral support to Cal Poly’s
Network Performance Research Group over the years. Jim Medeiros and Pete
Hawkins of Ziatech/Intel Corporation for volunteering their time and knowledge of
PCI-based systems design. Adaptec Corporation for their generous donation of a
SCSI-RAID card to the CiNIC project.
I would particularly like to thank Dr. Jim Harris, my thesis advisor, mentor,
and friend for his seemingly inexhaustible enthusiasm, knowledge, and support.
Dr. Mei-Ling Liu, Dr. Hugh Smith, and the student members of Cal Poly’s Network
Performance Research Group. These bright, creative, stimulating, and thoroughly
enjoyable individuals have contributed more to my Cal Poly education than they can
possibly imagine. In particular I would like to thank Ron J. Ma for assisting me in
adapting my original instrumentation technique for use in the new CiNIC
architecture.
And finally, I extend my heartfelt thanks to my family for their boundless
support and love.
v
Table of Contents
LIST OF TABLES .............................................................................................................VIII
LIST OF FIGURES ..............................................................................................................IX
CHAPTER 1
INTRODUCTION .................................................................................... 1
CHAPTER 2
CINIC ARCHITECTURE ....................................................................... 3
2.1
2.2
2.2.1
2.2.2
2.2.3
2.2.4
2.2.5
2.2.6
2.2.7
2.2.8
2.3
2.3.1
2.3.2
2.3.3
HISTORICAL OVERVIEW ........................................................................................... 3
HARDWARE ARCHITECTURE .................................................................................... 7
Host Computer System .....................................................................................................8
Intel 21554 PCI-PCI Nontransparent Bridge ..................................................................9
Intel EBSA-285 StrongARM / 21285 Evaluation Board.................................................11
3Com 3c905C Fast Ethernet Network Card ..................................................................12
Promise Ultra66 PCI-EIDE Hard Disk Adaptor ...........................................................12
Maxtor 531DX Ultra 100 Hard Drive............................................................................12
Intel EBSA-285 SDRAM Memory Upgrade....................................................................12
FuturePlus Systems FS2000 PCI Analysis Probe ..........................................................13
SOFTWARE ARCHITECTURE ................................................................................... 14
Integrating CiNIC with Microsoft Windows 2000..........................................................14
Integrating CiNIC with Redhat Linux 7.x.......................................................................17
Service Provider Example: Co-host Web Caching.........................................................20
CHAPTER 3
3.1
3.2
3.2.1
3.2.2
3.2.3
3.2.4
3.2.5
3.2.6
3.2.7
3.3
3.3.1
3.3.2
3.3.3
3.3.4
3.3.5
3.4
3.4.1
3.4.2
DEVELOPMENT / BUILD ENVIRONMENT.................................... 23
OVERVIEW OF THE SOFTWARE INFRASTRUCTURE ................................................. 23
CREATING AN I686–ARM TOOL CHAIN ................................................................ 23
Caveat Emptor ...............................................................................................................23
Getting Started ...............................................................................................................25
Creating the i686-ARM Development Tools with an i686 host ......................................27
Creating the i686-ARM C Library with an i686 host .....................................................30
Creating the ARM-ARM Tool Chain on an i686 host ....................................................32
Copying the ARM-ARM Tool Chain to the EBSA-285 ...................................................34
Rebuild the ARM-ARM Tool Chain on the EBSA-285....................................................37
SYSTEM ENVIRONMENT ......................................................................................... 38
EBSA-285 BIOS .............................................................................................................38
Booting Linux on the EBSA-285.....................................................................................42
Linux Bootstrap Sequence..............................................................................................43
Linux Operating System Characteristics........................................................................46
Synchronizing the System Time with the Current Local Time ........................................47
USER ENVIRONMENT ............................................................................................. 50
Building applications with the i686-ARM cross tool chain............................................51
Installing from the i686 host to the EBSA-285’s file system ..........................................53
vi
vii
3.4.3
Rebuilding everything with the ARM-ARM tool chain ...................................................54
CHAPTER 4
4.1
4.2
4.2.1
4.3
4.3.1
4.3.2
4.4
4.4.1
4.4.2
4.4.3
4.4.4
BACKGROUND ........................................................................................................ 56
PERFORMANCE METRICS ....................................................................................... 58
Framework for Host Performance Metrics ....................................................................59
STAND-ALONE PERFORMANCE INSTRUMENTATION .............................................. 60
Theory of Operation .......................................................................................................60
IPv4 Fragmentation of UDP Datagrams in Linux .........................................................66
HOST / CO-HOST COMBINED PLATFORM PERFORMANCE INSTRUMENTATION ...... 71
Theory of Operation .......................................................................................................72
CiNIC Architecture Polling Protocol Performance Testbed..........................................79
CiNIC Host Socket Performance Testbed ......................................................................81
Secondary PCI Bus “Posted Write” Phenomenon.........................................................82
CHAPTER 5
5.1
5.2
5.3
5.4
5.4.1
5.4.2
PERFORMANCE DATA AND ANALYSIS........................................ 85
STAND-ALONE PERFORMANCE – MICROSOFT WINDOWS 2000 (SP1) .................... 86
STAND-ALONE PERFORMANCE – LINUX 2.4.3 KERNEL ON I686 HOST ................. 90
STAND-ALONE PERFORMANCE – LINUX 2.4.3 KERNEL ON EBSA-285 ................. 95
CINIC ARCHITECTURE – POLLING PROTOCOL PERFORMANCE TESTBED ........... 100
CiNIC Polling Protocol TCP / IPv4 Performance .......................................................101
CiNIC Polling Protocol UDP / IPv4 Performance ......................................................114
CHAPTER 6
6.1
6.2
INSTRUMENTATION .......................................................................... 56
CONCLUSIONS AND FUTURE WORK .......................................... 123
CONCLUSIONS ...................................................................................................... 123
FUTURE WORK ..................................................................................................... 124
BIBLIOGRAPHY ............................................................................................................... 125
APPENDIX A
ACRONYMS......................................................................................... 129
APPENDIX B
RFC 868 TIME PROTOCOL SERVERS .......................................... 131
APPENDIX C
EBSA-285 SDRAM UPGRADE .......................................................... 136
vii
List of Tables
TABLE 2-1 DESCRIPTION OF THE HOST COMPUTER SYSTEM......................................................................9
viii
List of Figures
FIGURE 2-1 ANACONDA TESTBED ............................................................................................................4
FIGURE 2-2 INTEL’S EBSA-285 REPLACES 3COM ANACONDA AS CO-HOST ..............................................5
FIGURE 2-3 THE 21554 BRIDGES TWO DISTINCT PCI BUS DOMAINS .........................................................6
FIGURE 2-4 CINIC HARDWARE ARCHITECTURE ......................................................................................8
FIGURE 2-5 SCHEMATIC DIAGRAM OF THE CINIC HARDWARE ARCHITECTURE .......................................8
FIGURE 2-6 INTEL 21554 NONTRANSPARENT PCI-PCI BRIDGE ..............................................................10
FIGURE 2-7 PRIMARY PCI BUS’S VIEW OF THE 21554 AND THE SECONDARY PCI BUS ...........................10
FIGURE 2-8 SECONDARY PCI BUS’S VIEW OF THE 21554 AND THE PRIMARY PCI BUS ...........................10
FIGURE 2-9 ORIGINAL WINDOWS 2000 SOFTWARE ARCHITECTURE .......................................................15
FIGURE 2-10 MICROSOFT WINDOWS OPEN SYSTEM ARCHITECTURE [37] .............................................16
FIGURE 2-11 “WOSA FRIENDLY” WINDOWS 2000 SOFTWARE ARCHITECTURE .....................................17
FIGURE 2-12 REDHAT LINUX 7.X NETWORKING ARCHITECTURE ............................................................18
FIGURE 2-13 OVERVIEW OF THE CINIC ARCHITECTURE ........................................................................19
FIGURE 2-14 EXAMPLE OF A WEB CACHING PROXY ON A NON-CINIC HOST...........................................21
FIGURE 2-15 CINIC-BASED WEB CACHING SERVICE PROVIDER ..............................................................22
FIGURE 3-1 SOME ON-LINE HELP RESOURCES FOR LINUX SOFTWARE DEVELOPERS ................................27
FIGURE 3-2 USING AN I686 HOST TO CREATE THE I686-ARM TOOL CHAIN ...........................................28
FIGURE 3-3 CREATING THE INSTALLATION DIRECTORY FOR THE I686-ARM TOOL CHAIN .....................29
FIGURE 3-4 CONFIGURE OPTIONS FOR THE I686-ARM BINUTILS AND GCC PACKAGES ............................29
FIGURE 3-5 COPYING THE ARM/LINUX HEADERS INTO THE I686-ARM DIRECTORY TREE ....................31
FIGURE 3-6 CONFIGURE OPTIONS FOR THE GNU GLIBC C LIBRARY .......................................................31
FIGURE 3-7 ADDING THE I686-ARM TOOL CHAIN PATH TO THE ‘PATH’ ENVIRONMENT VARIABLE .....32
FIGURE 3-8 USING AN I686 HOST TO CREATE AN ARM-ARM TOOL CHAIN ...........................................33
FIGURE 3-9 INITIAL CONFIGURE OPTIONS WHEN BUILDING THE ARM-ARM TOOL CHAIN .....................34
FIGURE 3-10 CREATE A MOUNT POINT ON THE I686 HOST FOR THE EBSA-285’S FILE SYSTEM ..............35
FIGURE 3-11 /ETC/FSTAB ENTRY FOR /EBSA285FS MOUNT POINT ON I686 HOST .....................................35
FIGURE 3-12 COPY THE ARM-ARM TOOL CHAIN TO THE EBSA-285’S FILE SYSTEM ...........................35
FIGURE 3-13 CREATE SYMLINKS AFTER INSTALLING FROM I686 HOST TO EBSA-285’S FILE SYSTEM ...36
FIGURE 3-14 USING THE LDCONFIG COMMAND TO UPDATE THE EBSA-285’S DYNAMIC LINKER ...........36
FIGURE 3-15 CREATE LDCONFIG ON THE I686 HOST AND INSTALL IT ON THE EBSA-285.......................37
FIGURE 3-16 RECREATION OF THE NATIVE ARM-ARM TOOL CHAIN ....................................................37
FIGURE 3-17 CONFIGURE OPTIONS WHEN REBUILDING THE ARM-ARM TOOL CHAIN ...........................38
FIGURE 3-18 EBSA-285 JUMPER POSITIONS FOR PROGRAMMING THE FLASH ROM [24].......................39
FIGURE 3-19 EBSA-285 FLASH ROM CONFIGURATION ........................................................................42
FIGURE 3-20 THE PLATFORM USED BY ERIC ENGSTROM TO PORT LINUX ONTO THE EBSA-285 ............43
FIGURE 3-21 ARM/LINUX BIOS’S BOOTP SEQUENCE ............................................................................44
FIGURE 3-22 LINUX KERNEL BOOTP SEQUENCE......................................................................................46
FIGURE 3-23 BASH ENVIRONMENT VARIABLES FOR COMMON I686-ARM CONFIGURE OPTIONS ............51
FIGURE 3-24 CONFIGURE STEP FOR APPS THAT ARE USED DURING THE LINUX BOOT SEQUENCE............53
ix
x
FIGURE 3-25 CONFIGURE STEP FOR APPS THAT ARE NOT USED DURING THE LINUX BOOT SEQUENCE .....53
FIGURE 3-26 COPYING THE /ARM-ARM DIRECTORY TO THE EBSA-285’S ROOT DIRECTORY ..................54
FIGURE 3-27 CREATE SYMLINKS AFTER INSTALLING FROM I686 HOST TO EBSA-285’S FILE SYSTEM ...54
FIGURE 3-28 BASH ENVIRONMENT VARIABLES FOR THE ARM-ARM CONFIGURE OPTIONS ...................55
FIGURE 4-1 “STAND-ALONE” HOST PERFORMANCE INSTRUMENTATION.................................................61
FIGURE 4-2 CONTENTS OF THE PAYLOAD BUFFER FOR STAND-ALONE MEASUREMENTS .........................61
FIGURE 4-3 WRITE TO UNIMPLEMENTED BAR ON 3C905C NIC CORRESPONDS TO TIME T 0 ..................63
FIGURE 4-4 WIRE ARRIVAL TIME T 1 .......................................................................................................64
FIGURE 4-5 WIRE EXIT TIME T 2..............................................................................................................64
FIGURE 4-6 “SPREADSHEET FRIENDLY” CONTENTS OF TEST RESULTS FILE ............................................65
FIGURE 4-7 IP4 FRAGMENTATION OF “LARGE” UDP DATAGRAM ..........................................................66
FIGURE 4-8 IP4 DATAGRAM SEND SEQUENCE FOR UDP DATAGRAMS > 1 MTU....................................67
FIGURE 4-9 PCI BUS WRITE BURST OF IP DATAGRAM N WHERE |IP DATAGRAM N| < |START MARKER| ...69
FIGURE 4-10 LOGIC ANALYZER PAYLOAD “MISS” ..................................................................................69
FIGURE 4-11 CINIC ARCHITECTURE INSTRUMENTATION CONFIGURATION ............................................71
FIGURE 4-12 WELL-DEFINED DATA PAYLOAD FOR THE COMBINED PLATFORM ......................................72
FIGURE 4-13 WRITE TO UNIMPLEMENTED BAR TRIGGERS THE LOGIC ANALYZER .................................73
FIGURE 4-14 WIRE ARRIVAL TIME T 1,1 ON PRIMARY PCI BUS ..............................................................74
FIGURE 4-15 WIRE ARRIVAL TIME T 2,1 ON SECONDARY PCI BUS .........................................................75
FIGURE 4-16 WIRE EXIT TIME T 1,,2 ON PRIMARY PCI BUS ....................................................................75
FIGURE 4-17 WIRE EXIT TIME T 2,2 ON SECONDARY PCI BUS ................................................................76
FIGURE 4-18 WIRE ARRIVAL TIME T 2,3 ON SECONDARY PCI BUS ...........................................................77
FIGURE 4-19 WIRE EXIT TIME T 2,4 ON SECONDARY PCI BUS ..................................................................78
FIGURE 4-20 DATA PATH TIME LINE .......................................................................................................78
FIGURE 4-21 CINIC POLLING PROTOCOL PERFORMANCE TESTBED ........................................................80
FIGURE 4-22 CINIC ARCHITECTURE HOST SOCKET PERFORMANCE TESTBED .........................................81
FIGURE 4-23 START MARKER APPEARS FIRST ON SECONDARY PCI BUS ................................................82
FIGURE 5-1 WINDOWS 2000 TCP/IP4 ON I686 HOST, PAYLOAD SIZES: 32 TO 16K BYTES .....................87
FIGURE 5-2 WINDOWS 2000 TCP/IP4 ON I686 HOST, PAYLOAD SIZES: 512 TO 65K BYTES ...................88
FIGURE 5-3 WINDOWS 2000 UDP/IP4 ON I686 HOST, PAYLOAD SIZES: 512 TO 65K BYTES ..................89
FIGURE 5-4 LINUX 2.4.3 TCP/IP4 ON I686 HOST, PAYLOAD SIZES: 64 TO 9K BYTES .............................91
FIGURE 5-5 LINUX 2.4.3 TCP/IP4 ON I686 HOST, PAYLOAD SIZES: 512 TO 65K BYTES .........................92
FIGURE 5-6 LINUX 2.4.3 UDP/IP4 ON I686 HOST, PAYLOAD SIZES: 64 TO 9K BYTES ............................93
FIGURE 5-7 LINUX 2.4.3 UDP/IP4 ON I686 HOST, PAYLOAD SIZES: 512 TO 65K BYTES ........................94
FIGURE 5-8 LINUX 2.4.3-RMK2 TCP/IP4 ON EBSA285, PAYLOAD SIZES: 64 TO 4.3K BYTES ...............96
FIGURE 5-9 LINUX 2.4.3-RMK2 TCP/IP4 ON EBSA285, PAYLOAD SIZES: 512 TO 50K BYTES...............97
FIGURE 5-10 LINUX 2.4.3-RMK2 UDP/IP4 ON EBSA285, PAYLOAD SIZES: 64 TO 9K BYTES ................98
FIGURE 5-11 LINUX 2.4.3-RMK2 UDP/IP4 ON EBSA285, PAYLOAD SIZES: 512 TO 31K BYTES ............99
FIGURE 5-12 CINIC WIRE ENTRY TIMES...............................................................................................101
FIGURE 5-13 LINUX 2.4.3 TCP/IP4 ON CINIC, POLLING PROTOCOL, WIRE ENTRY LATENCIES ...........102
FIGURE 5-14 LINUX 2.4.3 TCP/IP4 ON CINIC, POLLING PROTOCOL, WIRE ENTRY LATENCIES ...........103
FIGURE 5-15 PARTIAL PLOT OF CINIC WIRE ENTRY TIMES ..................................................................104
FIGURE 5-16 LINUX 2.4.3 TCP/IP4 ON CINIC, POLLING PROTOCOL, WIRE ENTRY LATENCIES ...........105
FIGURE 5-17 CINIC WIRE EXIT TIMES ..................................................................................................106
FIGURE 5-18 LINUX 2.4.3 TCP/IP4 ON CINIC, POLLING PROTOCOL, WIRE EXIT LATENCIES ..............107
FIGURE 5-19 LINUX 2.4.3 TCP/IP4 ON CINIC, POLLING PROTOCOL, WIRE EXIT LATENCIES ..............108
x
xi
FIGURE 5-20 PARTIAL PLOT OF CINIC WIRE EXIT TIMES ......................................................................109
FIGURE 5-21 LINUX 2.4.3 TCP/IP4 ON CINIC, POLING PROTOCOL, WIRE EXIT LATENCIES ................110
FIGURE 5-22 21554 OVERHEAD ON WIRE ENTRY AND EXIT TIMES ........................................................111
FIGURE 5-23 LINUX 2.4.3 TCP/IP4 ON CINIC, POLLING PROTOCOL, 21554 OVERHEAD .....................112
FIGURE 5-24 SOCKET TRANSFER TIME .................................................................................................113
FIGURE 5-25 LINUX 2.4.3 TCP/IP4 ON CINIC, POLLING PROTOCOL, TESTAPP SEND TIMES (PCI2)......114
FIGURE 5-26 LINUX 2.4.3 UDP/IP4 ON CINIC, POLLING PROTOCOL, WIRE ENTRY LATENCIES ..........115
FIGURE 5-27 LINUX 2.4.3 UDP/IP4 ON CINIC, POLLING PROTOCOL, WIRE ENTRY LATENCIES ..........116
FIGURE 5-28 LINUX 2.4.3 UDP/IP4 ON CINIC, POLLING PROTOCOL, WIRE ENTRY LATENCIES ..........117
FIGURE 5-29 LINUX 2.4.3 UDP/IP4 ON CINIC, POLLING PROTOCOL, WIRE EXIT LATENCIES ..............118
FIGURE 5-30 LINUX 2.4.3 UDP/IP4 ON CINIC, POLLING PROTOCOL, WIRE EXIT LATENCIES ..............119
FIGURE 5-31 LINUX 2.4.3 UDP/IP4 ON CINIC, POLLING PROTOCOL, WIRE EXIT LATENCIES ..............120
FIGURE 5-32 LINUX 2.4.3 UDP/IP4 ON CINIC, POLLING PROTOCOL, 21554 OVERHEADS ...................121
FIGURE 5-33 LINUX 2.4.3 UDP/IP4 ON CINIC, POLLING PROTOCOL, TESTAPP SEND TIMES (PCI2) .....122
FIGURE 6-1 CINIC DEVELOPMENT CYCLE ............................................................................................124
xi
1
Chapter 1
Introduction
The Calpoly intelligent Network Interface Card (CiNIC) Project introduces a
networking architecture that offloads network-related protocol stack processing from
an Intel i686 host system onto a dedicated ARM/Linux co-host attached to the i686
system. Placing these processing tasks on a dedicated co-host relieves the host system
of the myriad peripheral tasks that go along with managing the flow of information
through a networking protocol stack – e.g., servicing interrupts from the network
card, managing port connection requests, error detection and recovery, managing
flow control, etc. Placing network-related protocol stack processing on a dedicated
co-host also allows one to implement low-cost, “value-added” service providers such
as firewall protection, encryption, information caching, QOS, etc., on the dedicated
co-host with little or no performance hit on the host system.
The goals of this thesis are threefold:
•
Implement a software development environment for the co-host platform.
To this end, I discuss the two GNU software development tool chains I
implemented to support ARM/Linux software development for the EBSA285 co-host platform.
•
Implement a system environment on the co-host that is capable of bootstrapping an ARM/Linux kernel on the co-host when power is applied to
the co-host. To this end, I discuss my implementation of the ARM/Linux
BIOS for the EBSA-285.
•
Implement performance instrumentation that gathers latency data from the
CiNIC architecture during network send operations. To this end, I discuss the
logic analysis / systems approach I developed to gather latency data from the
2
individual components that comprise the CiNIC platform – i.e., the host
system performance by itself and the EBSA-285 co-host system performance
by itself – and the combined host/co-host CiNIC platform. Samples of the
performance data obtained with this instrumentation technique are provided
for the purpose of demonstrating the capabilities of this measurement
technique.
The remainder of this thesis is organized as follows. A brief historical
perspective of the CiNIC project is provided in Chapter 2, along with an overview of
the project’s current hardware and software architectures. Chapter 3 covers the CiNIC
Project’s software infrastructure – i.e., the i686-ARM and ARM-ARM software
development tool chains, the ARM/Linux BIOS being used to boot a Linux kernel on
the EBSA-285, and a brief overview of the bootstrap process. The performance
instrumentation technique used to capture performance data from the CiNIC platform
during network sends is discussed in Chapter 4. The performance metrics used by the
Cal Poly Network Performance Research Group are also mentioned in Chapter 4.
Samples of the performance data are provided in Chapter 5 for the purpose of
demonstrating the capabilities of the aforementioned instrumentation technique.
Chapter 6 presents the thesis’s conclusions and proposes some areas for future work.
Appendix A contains a list of acronyms that are used throughout this document.
Information on RFC-868 Time Servers is provided in Appendix B. And finally,
Appendix C provides information on adding SDRAM to an EBSA-285.
NOTE:
A CD-ROM disc accompanies this thesis. This disc contains the source
code and executable programs that accompany this work. The disc also
contains “README” files that contain additional / last-minute information that is not presented herein.
2
3
Chapter 2
2.1
CiNIC Architecture
Historical Overview
The first efforts for incorporating a co-host for network operation were directed
towards off-loading the NIC driver to a co-host. Maurico Sanchez’s I20 development
work was the project’s first effort [49]. He ported a 3Com 3c905x NIC device driver
from the host machine to the I2O platform. Subsequent performance measurements
indicated that there was a significant latency penalty paid for this arrangement, due to
the overhead associated with the I2O message handling.
The seeds of the CiNIC project were sewn in the summer of 1999 when the
CPNPRG received two “Anaconda” development boards from 3Com Corporation.
Each Anaconda contained an Intel StrongARM (SA-110) CPU, an Intel 21285 Core
Logic chip, two Hurricane NICs (3Com’s 10/100 Mbps Ethernet ASIC), a custom
DMA engine, and 2 MB of RAM (see Figure 2-1). With these boards, Angel Yu
began experimenting with the idea of transplanting a host machine’s TCP/IP protocol
stack onto a parasitic “network appliance” plugged into the host’s PCI bus [56]. The
hope was that by offloading the host machine’s low-level networking tasks onto a
dedicated “co-host” we could, among other things, free up a significant number of
CPU clock cycles on the host system.
4
Host Computer System
PCI bus
3Com “Anaconda”
21285
ASIC
NIC
SA-110
RAM
NIC
LAN
Figure 2-1 Anaconda testbed
Yu’s work yielded some valuable results, but one unexpected result was the
realization that the Anaconda was not particularly well suited for the task at hand. Its
limited memory capacity made it almost impossible to fit the required software items
within it – i.e., the Anaconda microcode, the TCP/IP protocol stack, an operating
system of some sort, data processing and storage areas, etc. It was not well suited as a
development platform in that there was no “easy” way to reload our software into the
Anaconda’s RAM during the boot process (e.g., after a software “crash”). Also, the
“mapmem” technique we were using to integrate the Anaconda’s RAM into the
“virtual machine” environment of a Microsoft Windows NT 4.0 host was problematic,
at best.
Despite the aforementioned difficulties with the Anaconda, we genuinely
liked the board’s architectural concept. We felt that Intel’s StrongARM CPU and
21285 Core Logic chip set were particularly well suited to performing TCP/IP stack
operations on a dedicated co-host. Furthermore, if we could increase the amount of
RAM on the co-host, we could perform not only TCP/IP stack operations but “value
added” tasks as well – e.g., web caching, QOS, encryption, firewall support, etc.
5
So late in the fall of 1999 we began searching for a development platform to
replace the Anaconda and came across the Intel EBSA-285 – a single-board computer
system with the form factor of a PCI add-in card, that serves as an evaluation
platform for Intel’s StrongARM CPU and 21285 Core Logic chip set [25]. The EBSA285 board had all of the features we were looking for in an ARM-based development
platform and more. The addition of an EBSA-285 board to the CiNIC project allowed
us to make additional progress toward the goal of offloading the TCP/IP stack onto a
dedicated co-host.
Of course, the incorporation of an EBSA-285 into the CiNIC project did not
immediately solve all our problems. In fact, it quickly became apparent that the
EBSA-285’s built-in boot loader / monitor / debugging software (a.k.a., “Angel”) was
less than optimal for our development efforts. What we really needed was a fullblown operating system environment on the EBSA-285 itself that would facilitate our
development efforts rather than impede them. Fortunately, Russell King, a systems
programming guru over in Great Britain [29], had ported the Linux operating system
to a variety of ARM-based single-board computers by this time, including the EBSA285. Thus began the CPNPRG’s effort to implement the Linux operating system –
and ultimately a complete development environment – on our EBSA-285.
Win2K or Linux on the
Host Computer System
Primary PCI
Linux on the EBSA-285
21285
SA-110
Figure 2-2 Intel’s EBSA-285 replaces 3Com Anaconda as co-host
6
The decision to implement the Linux operating system on the EBSA-285 was
a move in the right direction, but it introduced a new and significant problem: we now
needed a way to isolate the Linux operating system on the EBSA-285 from the
operating system environment on the host machine (which would be either Windows
2000 or Linux). Without an isolation layer between the host and co-host operating
system environments, the two operating systems would battle each other for control
of the host machine’s PCI bus. Such a conflict would apparently throw the entire
system into a state of chaos, and therefore had to be resolved.
Thankfully, the task of isolating the host and co-host operating systems had a
straightforward solution: the Intel 21554 Nontransparent PCI-to-PCI Bridge [23].
The 21554 is specifically designed to bridge two distinct PCI bus domains (a.k.a.,
“processor domains”).
Win2K or Linux on the
Host Computer System
Primary PCI
21554
Secondary PCI
Linux on the EBSA-285
21285
SA-110
Figure 2-3 The 21554 bridges two distinct PCI bus domains
The addition of a 21554 bridge seemed a perfect solution to the problem of
isolating the co-host’s OS from the primary PCI bus, but its incorporation introduced
yet another significant architectural change to the CiNIC project. Specifically, the
incorporation of a 21554 bridge into the CiNIC architecture implied:
a) The EBSA-285 must now reside on a secondary PCI bus system, and,
7
b) The secondary PCI bus system must somehow be connected to the host
system’s PCI backplane via the 21554 nontransparent bridge, and,
c) A host/co-host communications protocol would need to be created to pass
information back and forth between the host and co-host processor
domains via the 21554 bridge.
Nevertheless, the benefits of having the 21554 were obvious and more than justified
its incorporation into the project. Ihab Bishara, in his senior project [3], addressed
points a) and b) above by locating a 21554 evaluation board that suited our needs
perfectly.
2.2
Hardware Architecture
The current hardware architecture of the CiNIC project is comprised of the following
components: a PC-based host computer system, an Intel 21554 evaluation board, an
Intel EBSA-285 evaluation board, a 3Com 3c905C network interface card [NIC], a
Promise Ultra66 EIDE card and a Maxtor 2R015H1 15GB hard drive. As an option, a
second NIC may be attached to the system as well. These components are shown
pictorially in Figure 2-4 and schematically in Figure 2-5. The instrumentation
components consist of two FuturePlus Systems FS2000 PCI bus probe cards, a
Hewlett-Packard 16700A Logic Analysis mainframe, and three Hewlett-Packard
16555D State/Timing modules. (The HP16700A Logic Analysis system and its three
16555D State/Timing modules are not shown.)
8
Host PC
Intel EBSA-285
Secondary PCI
host bridge slot
on the 21554
Intel 21554 &
Secondary PCI
FuturePlus
FS2000 PCI Probe
3Com
3c905C NIC
Promise Ultra 66 EIDE Card
( not installed as shown )
Figure 2-4 CiNIC Hardware Architecture
Host Computer System
Primary PCI
21554
Op tional
Secondary PCI
EIDE
21285
EBSA
285
NIC
NIC
LAN
Internet
Hard Disk
Figure 2-5 Schematic diagram of the CiNIC hardware architecture
2.2.1
Host Computer System
The host computer, whose DNS name is Hydra, is a Dell Dimension XPS T450
desktop computer whose characteristics of interest are itemized in Table 2-1. This
machine was purchased specifically for the CiNIC project, and has five 32-bit PCI
slots on its motherboard instead of the usual four.
9
Table 2-1 Description of the host computer system
Component
Characteristics
CPU
Intel Pentium-III (Katami); stepping 03; 450 MHz; 16KB L1 instruction
cache; 16KB L1 data cache; 512 KB L2 cache; “fast CPU save and
restore” feature enabled
Primary PCI
backplane
PCI BIOS revision 2.10; 5 expansion slots; Device list:
Host bridge: Intel Corporation
00:00.0
440BX/ZX - 82443BX/ZX Host bridge (rev 03)
PCI bridge: Intel Corporation
00:01.0
440BX/ZX - 82443BX/ZX AGP bridge (rev 03)
ISA bridge: Intel Corporation
00:07.0
82371AB PIIX4 ISA (rev 02)
IDE interface: Intel Corporation
00:07.1
82371AB PIIX4 IDE (rev 01)
USB Controller: Intel Corporation
00:07.2
82371AB PIIX4 USB (rev 01)
Bridge: Intel Corporation
00:07.3
82371AB PIIX4 ACPI (rev 02)
Multimedia audio controller: Yamaha Corporation
00:0c.0
YMF-724F [DS-1 Audio Controller] (rev 03)
Ethernet controller: 3Com Corporation
00:0d.0
3c905C-TX [Fast Etherlink] (rev 74)
Bridge: Digital Equipment Corporation
00:10.0
DECchip 21554 (rev 01)
VGA compatible controller: ATI Technologies Inc
01:00.0
3D Rage Pro AGP 1X/2X (rev 5c)
Memory
128 MB
2.2.2
Intel 21554 PCI-PCI Nontransparent Bridge
The Intel 21554 is a nontransparent PCI-PCI bridge that performs PCI bridging
functions for embedded and intelligent I/O (I20) applications [23]. Unlike a transparent PCI-to-PCI bridge, the 21554 is specifically designed to provide bridging
support between two distinct PCI bus domains (a.k.a., “processor domains”) as shown
in Figure 2-6.
10
Host
21554
Primary
PCI Bus
EBSA-285
Secondary
PCI Bus
Figure 2-6 Intel 21554 nontransparent PCI-PCI bridge
Briefly, the 21554 passes PCI transactions between the primary and secondary PCI
buses by translating primary PCI bus addresses into secondary PCI bus addresses and
vice versa. Because the 21554 is nontransparent, it presents itself to the primary PCI
bus as a single PCI device. Consequently, the host bridge on the primary PCI bus is
unaware of the presence of the secondary PCI bus on the “downstream” side of the
21554 as shown in Figure 2-7:
Host
21554
EBSA-285
Single
PCI
Device
Primary
PCI Bus
Secondary
PCI Bus
Figure 2-7 Primary PCI bus’s view of the 21554 and the secondary PCI bus
Likewise, the host bridge on the secondary PCI bus sees the 21554 as a single PCI
device, and is unaware of the presence of the primary PCI bus on the “upstream” side
of the 21554 as shown in Figure 2-8.
Host
21554
EBSA-285
Single
PCI
Device
Primary
PCI Bus
Secondary
PCI Bus
Figure 2-8 Secondary PCI bus’s view of the 21554 and the primary PCI bus
11
The opacity of the 21554 allows two separate host bridges on two PCI busses to
operate more-or-less independently of each other, while simultaneously providing a
communications path that facilitates the transfer of data between the primary and
secondary PCI address domains.
In addition to the 21554’s PCI-PCI isolation feature, the 21554 can also
provide PCI support services on the secondary PCI bus. For example, the 21554 can
be configured as the PCI bus arbiter for up to nine PCI devices on the secondary PCI
bus. See [3] and [23] for complete coverage of the feature set and capabilities of the
Intel 21554.
2.2.3
Intel EBSA-285 StrongARM / 21285 Evaluation Board
The Intel EBSA-285 is a single-board computer with the form-factor of a “standard
long” PCI add-in card. It is an evaluation board for the Intel SA-110 “StrongARM”
microprocessor and Intel 21285 Core Logic chip set. The board can be configured
either as a PCI add-in card or as a PCI host bridge. In the CiNIC architecture, the
EBSA-285 is configured for host bridge mode (a.k.a., “central function” mode) [24],
and is therefore installed in the “host” slot on the Intel 21554 evaluation board [8]
(see also: Figure 2-4). When configured for host bridge operation, the EBSA-285 may
optionally be configured as the arbiter for the secondary PCI bus. However, when the
EBSA-285 is configured for arbitration, its LED indicators and flash ROM image
selector switch are unavailable for use. Since we wanted to use both the LEDs and the
flash ROM image selector in the CiNIC project, the job of arbitration on the
secondary PCI bus was delegated to the 21554. For complete coverage of the feature
set and capabilities of the EBSA-285, as well as a discussion of the hardware
12
configuration choices for the EBSA-285 as it pertains to the CiNIC project, refer to
[3], [24] and [25].
2.2.4
3Com 3c905C Fast Ethernet Network Card
3Com’s 3c905C network card (p/n: 3CSOHO100-TX) provides connectivity between
the secondary PCI bus and a 10 Mbps Ethernet or 100 Mbps Fast Ethernet network.
The card features a built-in “link auto-negotiation” feature that transparently
configures the card for use with either a 10-Mbps or 100-Mbps Ethernet network.
2.2.5
Promise Ultra66 PCI-EIDE Hard Disk Adaptor
The Promise Ultra66 EIDE card provides connectivity between the secondary PCI
bus and an EIDE hard drive. The Promise Ultra66 EIDE card was chosen because it
is supported by the ARM/Linux BIOS [30] we are using to boot the EBSA-285.
2.2.6
Maxtor 531DX Ultra 100 Hard Drive
The Maxtor 531DX Ultra 100 hard drive (p/n: 2R015H1) is a “higher-reliability”
single head / single platter hard drive. The drive has an Ultra ATA/100 interface that
can transfer up to 100 megabytes per second. The drive also utilizes Maxtor’s Silent
StoreTM technology for “whisper-quiet” acoustic performance. The 531DX Ultra 100
product line targets entry-level systems and consumer electronics applications [34].
2.2.7
Intel EBSA-285 SDRAM Memory Upgrade
The Intel EBSA-285 comes from the factory with 16 MB of SDRAM installed. This
is a sufficient amount of RAM for booting a Linux kernel and running a few
programs “on top of” a command shell, but not much more. Software packages
13
comprised of one or more “large” binary images – e.g., the GNU C library (glibc), the
Linux kernel, the GNU C and C++ compilers (gcc and g++, respectively), etc. –
generally cannot be built on an EBSA-285 with only 16 MB of RAM installed. For
this reason, we decided to increase the amount of SDRAM on the EBSA-285 by
adding an additional 128 MB DIMM to the existing 16 MB DIMM, to give us a total
RAM capacity of 144MB. See Appendix C for the particulars of this upgrade.
2.2.8
FuturePlus Systems FS2000 PCI Analysis Probe
The FuturePlus Systems FS2000 PCI Analysis Probe and extender card performs
three functions [13]:
•
The first is to act as an extender card, extending a module up 3 inches from the
motherboard.
•
The second is to provide test points very near (less than one inch) from the PCI
gold finger card edge to measure the power and signal fidelity.
•
The third is to provide a complete interface between any PCI add-in slot and HP
Logic Analyzers. The Analysis Probe interface connects the signals from the PCI
Local bus to the logic analyzer inputs.
The FS2000 card comes with inverse assembly software that is loaded onto
the HP 16700A logic analyzer. This software converts captured PCI bus signals into
human-readable textual descriptions of the PCI transactions captured by the analyzer.
When a PCI add-in card is installed in the FS2000 card edge connector, JP2
should have the jumper connected across pins 1 and 2. If no card is installed in the
FS2000 card edge connector, JP2 should have the jumper connected across pins 2 and
3. For complete information regarding the possible jumper configurations on the
FS2000 card, see page 9 of the FS2000 Users Manual [13].
14
2.3
Software Architecture
From the very beginning, the ability to use the CiNIC with a variety of operating
system environments was a design goal of the CiNIC project. To this end, we focused
our efforts on integrating the CiNIC co-host into two different operating system
environments: Microsoft Windows 2000 and Redhat Linux. Microsoft’s Windows
2000 was chosen because it is representative of the family of “32-bit” Microsoft
Windows operating systems that are currently the predominant operating system on
Intel x86-based desktop PCs. Redhat Linux (or more specifically, the Linux operating
system in general) was chosen primarily because it is an open source operating
system – i.e., the source code for both the operating system and the software tools that
are needed to build the operating system are freely available via the Internet.
Off-loading the TCP/IP protocol stack from the host machine onto the EBSA285 co-host was just one of jobs we had in mind for the EBSA-285 platform. Another
of the design goals of the CiNIC project was extensibility. Specifically, we wanted
the EBSA-285 to provide value-added services coincident to the offloaded TCP/IP
protocol stack. For example, the TCP/IP stack on the EBSA-285 might be configured
to support tasks such as web caching, encryption, firewall protection, traffic
management, QOS support, and so on.
2.3.1
Integrating CiNIC with Microsoft Windows 2000
Integrating the CiNIC co-host into the Windows 2000 operating system environment
is (not surprisingly) a non-trivial task. The CPNPRG’s original design, described in
Peter Huang’s master’s thesis [22], recommends replacing the existing Winsock 2
15
DLL (WS2_32.dll) with two custom software packages: a CiNIC-specific Winsock 2
DLL and a host-CiNIC communication interface / device driver.
Win2K Host
CiNIC co-host
User-layer
Application
Linux Device Driver
Custom
Winsock DLL
( ws2_32.dll )
EBSA-285
Communication
Interface
Linux
Socket Calls
TCP/IP
Stack
W2k Device Driver
Host
Communication
Interface
Linux OS
Shared
Memory
NIC Driver
Figure 2-9 Original Windows 2000 software architecture
This is essentially a “brute force” approach in that it completely eliminates Windows’
own networking subsystem and replaces it with a CiNIC-specific subsystem. Clearly,
this design is not feasible for the “real world” since it renders useless all non-CiNIC
network devices – whether these be virtual devices (e.g., “loopback” devices) or real
(e.g., Token Ring, X.25, etc.).
Fortunately, the task of integrating the CiNIC into the Windows 2000
environment is solvable by other means. The Windows Sockets 2 Architecture (a.k.a.,
the Windows Open System Architecture [WOSA]) is Microsoft’s networking
infrastructure for their 32-bit Windows operating systems. The WOSA defines a pair
of service provider interfaces (SPI) at the bottom edge of Microsoft’s Winsock 2 DLL
that developers can use to integrate vendor-defined transport and/or name space
providers into the existing Win32 networking subsystem.
16
Figure 2-10 Microsoft Windows Open System Architecture [37]
The WOSA gives user-layer applications simultaneous access to multiple
service providers (i.e., protocol stacks and/or service providers) via Microsoft’s
Winsock 2 DLL. The Win32 application simply asks the Winsock 2 DLL to enumerate
the list of available service providers, and then the application connects itself to the
desired provider.
Microsoft’s Winsock 2 SDK also includes a Win32 applet called sporder.exe
that exports a set of functions (from a DLL named sporder.dll) that device vendors
can use to reorder the list of available service providers. Software installation tools
can leverage the sporder.exe interface to programmatically reorder the available
service providers to suit their device’s needs. Of course, end users can also run the
sporder.exe applet to establish, for example, a particular TCP/IP protocol stack as the
default TCP/IP provider if more than one such stack is present on the system.
Taking the WOSA feature set into consideration, a “new and improved”
design for the CiNIC software architecture would probably look something like
Figure 2-11. Specifically, it would: a) leave Microsoft’s Winsock DLL in place, and,
17
b) create a CiNIC-specific transport service provider DLL that would interface with
the Winsock 2 SPI, and c) use the sporder.exe applet and sporder.dll library to make
the CiNIC transport service the default transport service provider.
Host
User-layer
Application
CiNIC co-host
Microsoft’s
Winsock 2 DLL
( ws2_32.dll )
CiNIC-specific
TCP/IP trans port
service DLL
Host
Communication
Interface
Linux Device Driver
EBSA-285
Communication
Interface
Linux OS
Linux
Socket Calls
TCP/IP
Stack
Shared
Memory
NIC Driver
Figure 2-11 “WOSA friendly” Windows 2000 software architecture
2.3.2
Integrating CiNIC with Redhat Linux 7.x
Figure 2-12 shows the pre-CiNIC architecture of a typical Redhat Linux 7.x host. In
very general terms, when a user-layer application wishes to send (or receive) data
over a network connection, it does so by calling the appropriate functions in a hostspecific (i.e., custom) “socket” library. The functions in the socket library call the
appropriate functions in the host’s TCP/IP protocol stack, which then perform
whatever processing is necessary on the user-defined payload. And finally, the
TCP/IP stack utilizes the NIC’s device driver routines to send the data out on the
wire. Network receive operations are, in essence, performed in the reverse order.
18
User Space
A p p li ca tio n
Kernel Space
HOS T
S y ste m ca l ls
S o cke t A P I
T C P / IP S ta ck
N IC D e v ice D r iv e r
N IC
Figure 2-12 Redhat Linux 7.x networking architecture
With the CiNIC architecture, the host’s own networking subsystem is
effectively replaced by the networking subsystem on the parasitic co-host. In other
words, the functional blocks comprising the shaded region of Figure 2-12 – i.e.,
•
The low-level interface between the System Calls block and the top of the
TCP/IP stack block
•
The entire TCP/IP protocol stack
•
The NIC’s device driver
•
The NIC itself.
are in essence transplanted from the host system to the EBSA-285 co-host, as shown
in Figure 2-13.
19
Kernel Space
User Space
HOST
EBSA-285
Application
Socket API
System calls
Protocol
hostmem _p
module
21554
ebsamem _p
module
Protocol
Handlers
Service
Providers
System calls
TCP / IP Stack
NIC Device Driver
NIC
Figure 2-13 Overview of the CiNIC architecture
Of course, a certain amount of “glue logic” is required to support such a
transplant. Specifically, any host-side system call that formerly made calls into the
20
host-side TCP/IP protocol stack must now be redesigned to communicate with the
networking subsystem on the co-host, and vice versa. As shown in Figure 2-13, this
logic is implemented as two new functional blocks on both the host and co-host sides
– i.e., a Protocol block and an xxxmem_p module block. The two xxxmem module
blocks – i.e., hostmem_p on the host side and ebsamem_p on the co-host side – are a
complementary pair of device drivers, designed and implemented by Mark
McClelland [35], that establish a low-level communications pathway between the
primary and secondary PCI busses via the Intel 21554 PCI-to-PCI nontransparent
bridge. The host and co-host Protocol blocks were designed and implemented by Rob
McCready [36]. These two protocol blocks implement the networking data path as
follows. The host-side Protocol block intercepts (a.k.a., “hijacks”) all network-related
system calls made by the host-side Socket API. The host-side Protocol block reroutes
these system calls to its complementary block on the co-host side, the Protocol
Handlers block, using the PCI-based communications channel established by Mark
McClelland’s hostmem_p and ebsamem_p modules. Note that in addition to
interfacing with the underlying system calls on the co-host side, the Protocol
Handlers block on the co-host also provides a mechanism for shimming user-defined
service providers into the networking data path. The idea here is that these custom
service providers will perform value added processing on the data as it leaves and/or
enters the system.
2.3.3
Service Provider Example: Co-host Web Caching
As an example of the service provider shim mentioned in the previous section,
assume for the moment that someone wants to implement a web caching utility. On a
21
non-CiNIC host, the architecture for a web caching utility might look something like
User Space
Figure 2-14.
Kernel Space
Netscape
Web Caching
Proxy
Loop back 127.0.0.1
NIC
Figure 2-14 Example of a web caching proxy on a non-CiNIC host
With this design, a user manually configures her web browser– e.g., Netscape
Navigator – to use a custom web caching proxy. (It is assumed that the web caching
proxy is running on the same machine as the web browser software.) In such a design,
the web browser would probably establish a communications channel between itself
and the proxy via some sort of loopback connection. Consequently, when the browser
sends a web page request to the network it is intercepted by the web caching proxy
before it reaches the network. The proxy checks to see if the requested web page is
currently stored in the local system’s web page cache. If it is, the proxy then checks
whether the locally cached page is current or not. If the requested page is not present
in the local cache, or a newer version of the requested web exists, the web caching
proxy retrieves the requested web page and stores it in the system’s local cache.
22
Finally, the proxy copies the web page from the local cache to the browser software
where it is then displayed. This design was, in fact, implemented by Jenny Huang in
her Master’s Thesis. Refer to [21] for the details of her implementation.
With the CiNIC architecture, the proxy server would be replaced with a web
caching service provider on the co-host side. The service provider would be shimmed
into the data path via the Protocol Handlers block as shown in Figure 2-13 above,
and schematically would probably look something like Figure 2-15. As shown in
Figure 2-15, HTTP requests from the host are demultiplexed from the host’s
outbound network traffic. The filtered out HTTP requests are then passed to the web
caching service provider which performs the required cache lookup, cache refresh,
cache update, or whatever. Note too that the output from the web caching service
provider could theoretically be piped into yet another service provider – e.g., a
content filter that blocks access to specific web sites (e.g., a “Net Nanny” type
content filter for use in elementary schools).
non-HTTP traffic
Host’s
outbound
traffic
Web Caching Service
HTTP
DMUX
Level 2 Cache
Hard Disk
MUX
Level 1 Cache
RAM Disk
Figure 2-15 CiNIC-based web caching service provider
23
Chapter 3
3.1
Development / Build Environment
Overview of the Software Infrastructure
This chapter discusses the software infrastructure of the CiNIC project.
Specifically, I discuss my work creating the i686-ARM and ARM-ARM versions of
the GNU software development tool chains to facilitate systems-level and
application-level software development for (and in) the ARM/Linux environment on
the EBSA-285. I then describe my implementation of Russell King’s ARM/Linux
basic I/O subsystem [BIOS] that boots the ARM/Linux kernel on the EBSA-285. The
chapter ends with a brief description of the Redhat 7.x / System V user environment I
created on the EBSA-285.
3.2
3.2.1
Creating an i686–ARM Tool Chain
Caveat Emptor
Creating an i686-ARM (pronounced “i686 cross ARM”) version of the GNU
software development tool chain [GNU/SDTC] is supposed to be a relatively
straightforward process – at least, that’s the claim made by the folks that create the
packages that comprise the GNU tool chain. My experiences with creating a
functional i686-ARM tool chain were problematic at best – particularly when
building the GNU C compiler gcc, and the GNU C library, glibc. To give some idea
of just how problematic this was for me, I worked full time for more than three
months before I finally succeeded at building my first, fully functional, i686-ARM
24
cross tool chain. Unfortunately, this seems to be the usual state of affairs for neophyte
developers of the GNU tool chains. So if you are feeling brave and want to try
building a GNU software development tool chain for yourself, don’t be surprised
when you run into major problems doing this; you are definitely not alone.
With regard to building the GNU tool chains, I would like to pass along the
following comments and suggestions.
•
Use only a high-end computer platform to build the sources. By “high-end” I
mean a computer that has a very high speed CPU (preferably an SMP machine
with multiple “very high speed CPUs”!), lots of RAM, and lots of free disk space.
The CPNPRG purchased a high-end Dell Precision 420 workstation specifically
for the task of building (and, rebuilding, and rebuilding, etc.) the GNU tool chains
and Linux kernel. This machine has two 750 MHz Pentium-III CPUs, 512 MB of
RAM, and two 18 GB hard drives and is perfect for the task at hand. Specifically,
this particular Dell workstation is able to build the entire GNU tool chain in about
one hour. Prior to having this Dell machine, the build process took more than 6
hours from start to finish. This is an altogether intolerable (and frustrating!)
situation because a particular build sequence may run for a few hours before
crashing, and depending on the reason for the crash you might need to restart the
entire build from scratch.)
•
Do not hesitate to use the appropriate newsgroups and mailing lists to ask for help
when you feel hopelessly stuck. The GNU home page (www.gnu.org) has links to
the GNU C compiler and GNU C library home pages, and from there you can
access many useful help resources (e.g., online documents, links to mailing lists,
etc.).
•
Become familiar with a scripting language such as Bash or Perl. The steps one
must take to create a complete tool chain are next to impossible to perform by
typing the required commands by hand at a shell command prompt. It is much
easier – and therefore more efficient – to write the required command sequences
into a text file (using a text editor like vim or emacs) and then to have a scripting
25
language read those commands from the text file and execute them on your
behalf.
3.2.2
Getting Started
Before one can develop software on an ARM platform, one apparently needs
an ARM-compatible software development tool chain – i.e., an ARM-compatible
assembler, an ARM-compatible C compiler, an ARM-compatible linker / loader, and
so on. Note, however, that the only way to build an ARM-compatible tool chain is to
have an existing ARM-compatible tool chain! Fortunately, the GNU folks have
solved this “chicken or the egg” conundrum by designing their tool chain so that one
can build a “cross” tool chain that runs on a non-ARM host – e.g., an Intel i686 host –
but produces executable programs that run on an ARM host. This cross tool chain,
then, is used to create a “native” ARM-compatible tool chain – i.e., a tool chain that
can be used directly within the ARM environment.
Each software package in the GNU/SDTC contains a “read me” file named
INSTALL that provides instructions for configuring, building, and installing the
software in that particular package. (This file is typically located in the top-level
directory of the package’s source tree.) So the starting point, when building these
GNU software packages, is reading and understanding the information in the
INSTALL file. Be aware, too, that the INSTALL file typically contains GNU-specific
jargon that seems unintelligible to neophyte package builders. So to fully understand
how to configure, build, and install the software in a GNU software package, you
must also understand the idioms used by the GNU software developers.
26
Three GNU-derived jargon terms one must understand before attempting to
build a GNU software package – and in particular the packages that comprise the
GNU/SDTC – are build machine, host machine, and target machine. These three
terms are used at the very beginning of the build process during the software
configuration phase. Note that if you incorrectly specify any of these machine types
during the configuration phase, one of two outcomes will occur: a) the build process
will fail, or b) the resulting executables will not work. The following descriptions,
then, are intended as supplements to the descriptions found in the GNU documents:
•
The build machine is the machine you are using “right now” to convert
source code files into executable images (or libraries, etc.).
•
The build machine creates executables that run on a particular host machine.
Note that the host machine need not be the same machine type as the build
machine type. For example, an i686 machine can create executable images
(e.g., a C compiler, a game, a Linux kernel, etc.) for a PowerPC machine. So
the build machine, in this case, is specified as i686, and the target machine is
specified as PowerPC.
•
Assume the build machine creates an executable image that is used on a
specific host machine. Assume further that the executable image is some sort
of source code translator. (For example, an i686 build machine creates a
C compiler (e.g., GNU gcc) for a PowerPC host machine.) When the source
code translator is used on the host machine, it produces executable images that
run on a target machine. (For example: The build machine creates a C
compiler that runs on a PowerPC host machine. When the C compiler runs on
the PowerPC, it translates source code files into executable images that run on
a StrongARM target machine.)
In addition to specifying the build, host, and target machine types, one must
also specify, during the configuration process, the installation path(s) for the package
27
that’s being built. For example, after the build machine creates the executable images
for the GNU C compiler, these executable images must be installed on the host
machine. The directories in which these executable images are to be installed on the
host machine must be specified during the configure process.
For additional information on the creation and installation of the software
packages that comprise the GNU/SDTC, the Internet is an invaluable resource. The
starting point for GNU-related software packages should probably be the GNU web
site. A complete list of Linux-related “how to” documents can be found at the Linux
Documentation Project and the Tucows websites, just to mention a few. There are
also a number of Linux-related newsgroups on Usenet that provide peer-to-peer
support for Linux users. A brief list of some of these online resources is provided in
Figure 3-1.
World Wide Web
http://www.gnu.org/
http://www.linuxdoc.org/
http://howto.tucows.com/
Usenet Newsgroups
news://news.calpoly.edu/gnu.gcc.help
news://news.calpoly.edu/comp.os.linux.development.apps
Figure 3-1 Some on-line help resources for Linux software developers
3.2.3
Creating the i686-ARM Development Tools with an i686 host
As shown in Figure 3-2, an i686 build machine will be used to create executable
images of the GNU/SDTC (from the GNU/SDTC sources). The resulting
GNU/SDTC images will subsequently be used on an i686 host machine to generate
executable images that run on a StrongARM target machine. In other words, the goal
is to create an i686-ARM cross compiler – i.e., a compiler that runs on an i686
machine, but creates executable images that run a StrongARM machine.
28
Step #1
B: i686
H: i686
T: ARM
i686 PC
1b
i686→i686
i686→ARM
1a
Sources
Figure 3-2 Using an i686 host to create the i686-ARM tool chain
The starting point in this process is the GNU binutils package. This package
creates the “binary utilities” that are commonly found in any software development
tool chain – e.g., a linker, an assembler, a librarian, etc. Note that you DO NOT want
to install the i686-ARM tools in the same directories that currently contain the host’s
versions of the GNU/SDTC. If you inadvertently overwrite the native i686-i686 tool
chain with the i686-ARM tool chain, you will loose the ability to create programs that
run on an i686 host! So during the configuration step (see the INSTALL file that
accompanies the binutils package), be sure to specify an installation path that is
outside the directory tree that currently holds the host’s own software development
tool chain. I recommend creating a directory named “/i686-arm” in the root of the
host’s file system, and then specifying this directory as the installation directory when
configuring the software packages that comprise the GNU/SDTC. I also recommend
setting the permissions on this directory to 777 – i.e., all users have read, write, and
execute access to this directory, as shown in Figure 3-3. Without these permissions,
one must be logged in as the super user “root” to perform the installation step, and
installing the i686-ARM binaries as root is about the best way to really mess up the
29
host system if, for some reason, the installation procedure decides to overwrite the
host’s own tool chain. (Note that if you are not logged in as the super user root, you
won’t have the necessary permissions to overwrite the system’s tool chain. So if you
are not logged in as root, and you perform an installation step that tries to overwrite
the host’s own tool chain, the install step will simply fail, and will not damage the
host’s tool chain.)
NOTE The i686-ARM tool chain will eventually be used to create an ARMARM tool chain. This tool chain must be (temporarily) installed on the i686 host
during the build process, so while we’re making directories, we might as well also
create a directory called “/arm-arm” in which to install the ARM-ARM tool chain
(see Figure 3-3).
$ su Password: <root’s password here>
# mkdir -m 777 /i686-arm
# mkdir -m 777 /arm-arm
# exit
$
Figure 3-3 Creating the installation directory for the i686-ARM tool chain
Assuming a directory named “/i686-arm” has been created on the i686
host as shown in Figure 3-3, Figure 3-4 shows the applicable arguments one should
use when configuring the binutils package.
% ./configure \
--build=i686-pc-linux-gnu \
--host=i686-pc-linux-gnu \
--target=arm-ebsa285-linux-gnu \
--prefix=/i686-arm/usr
--exec-prefix=/i686-arm/usr
...
Figure 3-4 Configure options for the i686-ARM binutils and gcc packages
30
After configuring the binutils package, build and install the package according
to the instructions in the INSTALL file. Following the creation and installation of the
binutils package, the next step in the process is the creation of an i686-ARM C
compiler. After reading the INSTALL file that accompanies the GNU gcc source
package, use the same configure settings shown in Figure 3-4 when configuring the
GNU gcc package. After configuring, build and install the gcc sources as instructed
by the gcc package’s INSTALL file.
3.2.4
Creating the i686-ARM C Library with an i686 host
Creating an i686-ARM version of the GNU C compiler does not also create the
standard C library. So the goal at this point is to use the i686-ARM cross compiler to
build a C library that is compatible with the i686-ARM C compiler. The i686-ARM C
compiler and i686-ARM standard C library can then be used in conjunction with the
i686-ARM binutils programs to create, on the i686 host, executable programs that run
on a StrongARM target machine.
In order to create the i686-ARM C library, the i686-ARM C compiler must
have access to the header files for an ARM/Linux kernel. This requirement may seem
strange at first, but in fact it makes good sense. Note that these header files describe
in complete detail the architecture of the underlying hardware. Without this
information, it would be virtually impossible to create a C library that is compatible
with the EBSA-285. Since the header files for the Linux kernel do not yet exist (or
perhaps do exist but are in some incomplete state), they apparently must be created at
this point. Fortunately, the task of building a Linux kernel and/or the kernel’s header
31
files does not depend on the existence of a standard C library. So the Linux kernel
and its header files can be built even if a standard C library does not yet exist.
NOTE
The accompanying CD-ROM contains a Bash shell script and additional
documentation that demonstrates how to use the i686-ARM C compiler
to create the ARM/Linux kernel header files.
After creating the ARM/Linux kernel header files, the files must be copied
from the Linux kernel source tree to the /usr/include directory in the “/i686-arm”
tree as shown in Figure 3-5.
$ cp -a /<path_to_ARM/Linux_sources>/include/linux \
> /i686-arm/usr/include
Figure 3-5 Copying the ARM/Linux headers into the i686-ARM directory tree
Now that the i686-ARM tool chain has access to the ARM/Linux header files,
one can create an i686-ARM compatible C library. Keep in mind that the i686-ARM
C library must contain ARM object code – i.e., machine code that executes on an
ARM CPU. The i686-ARM C library must not contain object code that executes on
an i686 CPU. When performing the software configuration step for the GNU glibc C
library package, then, the host machine type must be specified as ARM and not i686
(see Figure 3-6). And as always, do not forget to specify the “/i686-arm” directory
as the root installation directory for the library!
% ./configure \
--build=i686-pc-linux-gnu \
--host=arm-ebsa285-linux-gnu \
--target=arm-ebsa285-linux-gnu \
--prefix=/i686-arm/usr/
--exec-prefix=/i686-arm/usr
...
Figure 3-6 Configure options for the GNU glibc C library
32
3.2.5
Creating the ARM-ARM Tool Chain on an i686 host
At this point the /i686-arm directory contains an i686-ARM tool chain that can (at
least in theory!) create executable images that run on an ARM processor in the Linux
environment. Keep in mind that many of the ARM-ARM tool chain executables in
the /i686-arm directory have the prefix “arm-ebsa285-linux-gnu-” attached to
them – e.g., arm-ebsa285-linux-gnu-gcc, arm-ebsa285-linux-gnu-ld, etc. So if you
want to build ARM executable images, you must use the arm-ebsa285-linux-gnu-*
executables to do so. You should also put the /i686-arm/bin directory in your
shell’s PATH environment variable so that the cross tools can be found during the
build process. Figure 3-7 shows how to do this if you are using the Bash shell.
$ PATH="/i686-arm/bin:$PATH"
Figure 3-7 Adding the i686-ARM tool chain path to the ‘PATH’ environment variable
In practice, the i686-ARM tool chain can build most of the software packages
one would find on a “typical” Linux host. There are, however, a few software
packages that cannot be built with the i686-ARM tool chain (or with any cross tool
chain, for that matter). These packages must be built on the same host that will
ultimately execute them. A typical example is where a package’s build sequence
creates one or more executable images that test the capabilities of the underlying
system (e.g., checking for overflow errors during integer calculations). If this package
were created with the i688-ARM cross compiler, the compiler would generate ARM
executables (not i686 executables), and these ARM executables clearly will not run
on the i686 host that’s performing the build. Such a package, then, must be built
33
natively – i.e., the compiling and linking must be performed on the same platform that
will ultimately execute the application program(s) that are being created.
The next logical step, then, is to create a “native” ARM-ARM tool chain –
i.e., a tool chain that is installed on an ARM platform (not an i686 platform), that runs
on an ARM platform (not on an i686 platform), and that creates executable images
for an ARM platform.
i686 PC
i686→ARM
Sources
2a
2b
Step #2
B: i686
H: ARM
T: ARM
ARM→ARM
Figure 3-8 Using an i686 host to create an ARM-ARM tool chain
The native tool chain is initially created on an i686 host using the i686-ARM
tool chain (see Figure 3-8). The i686-ARM-generated ARM-ARM tool chain is then
installed on the EBSA-285 where it is then used to rebuild the entire ARM-ARM tool
chain. (See the documentation that comes with the GNU C compiler for complete
information on this process. This documentation also describes a test procedure one
can use to verify the correctness [to some degree] of the C compiler that is built using
the i686-ARM-generated C compiler on the ARM platform.)
The steps one takes to build ARM-ARM versions of the GNU binutils, gcc,
and glibc packages using the i686-ARM cross compiler are essentially the same as for
the i686-ARM versions of these tools – i.e., the software is configured, built, and
34
installed, in that order. The configure step is a bit different, though, because one now
wants to create executables (on the i686 build machine) that will execute on an ARM
host. Furthermore, the resulting ARM-ARM tool chain should not be installed in the
same directory tree as the i686-ARM tool chain, as this would replace the i686-ARM
tool chain with the ARM-ARM tool chain. The applicable settings when configuring
for ARM-ARM versions of the GNU binutils, gcc, and glibc packages are shown in
Figure 3-9.
% ./configure \
--build=i686-pc-linux-gnu \
--host=arm-ebsa285-linux-gnu \
--target=arm-ebsa285-linux-gnu \
--prefix=/arm-arm/usr
--exec-prefix=/arm-arm/usr
...
Figure 3-9 Initial configure options when building the ARM-ARM tool chain
3.2.6
Copying the ARM-ARM Tool Chain to the EBSA-285
After building the ARM-ARM tool chain on the i686 host, the ARM-ARM tool chain
will be installed on the i686 host in the “/arm-arm” subdirectory. The contents of
this subdirectory must now be copied over to the EBSA-285’s file system. The easiest
way to do this is to create a subdirectory on the i686 host on which to mount the
EBSA-285’s file system. Assuming the EBSA-285’s file system resides on a host
named pictor, and assuming the NFS server on pictor is configured so that multiple
hosts may mount the EBSA-285’s file system, create a directory named
/ebsa285fs on the i686 host as shown in Figure 3-10.
35
$ su Password: <root’s password>
# mkdir -m 777 /ebsa285fs
Figure 3-10 Create a mount point on the i686 host for the EBSA-285’s file system
While still logged in on the i686 host as the super user ‘root’, edit the file
/etc/fstab and add the line shown in Figure 3-11 (if this line is not already in the
file, of course).
NOTE
The text in Figure 3-11 is actually one long line, and not two separate
lines as shown. So when typing the text into the /etc/fstab file on the
i686 host, be sure to type all of the text in Figure 3-11 on a single line.
Also, there is no space between the comma that follows the word
noauto and the beginning of the word users on the second line – i.e.,
this sequence should be typed as ‘…,exec,noauto,users,async,…’.
pictor:/ebsa285fs /ebsa285fs nfs rw,suid,dev,exec,noauto,
users,async,rsize=8192,wsize=8192,nfsvers=2,hard 0 0
Figure 3-11 /etc/fstab entry for /ebsa285fs mount point on i686 host
While still logged in on the i686 host as the super user ‘root’, mount the EBSA-285’s
file system on the i686 host and copy the ARM-ARM tool chain from the i686 host to
the EBSA-285’s file system as shown in Figure 3-12.
# mount /ebsa285fs
# cp -a /arm-arm/* /ebsa285fs/
# cp -a /arm-arm/.[^.]* /ebsa285fs/
Figure 3-12 Copy the ARM-ARM tool chain to the EBSA-285’s file system
The final step in the installation process is the creation of some symbolic links
on the EBSA-285’s file system. Note that some packages hard code the name of the
installation directory into the executable programs the package creates (e.g., the GNU
36
gcc
package
does
this).
Since
the
install
directory
was
specified
as
/arm-arm/<whatever> during the configure step, the next step is to: a) create a
directory named /arm-arm in the root of the EBSA-285’s file system, and b) create
some appropriate symlinks within that directory, as shown in Figure 3-27.
#
#
#
#
>
>
>
>
#
cd /ebsa285fs
if [ ! -d arm-arm ]; then mkdir arm-arm; fi
cd arm-arm
for item in $(ls ..); do
if [ -d "../${item}" ]; then
ln -s ../${item} ${item}
fi
done
rm -f arm-arm
Figure 3-13 Create symlinks after installing from i686 host to EBSA-285’s file system
Now that the ARM-ARM tool chain is installed on the EBSA-285’s file
system, you probably need to log on to the EBSA-285 as the super user ‘root’ (e.g.,
via a telnet session) and use the ldconfig command to update the necessary links and
cache files for the GNU run-time (a.k.a., “dynamic”) linker (see Figure 3-14).
[jfischer@fornax jfischer]$ telnet ebsa285
Trying 192.168.24.96...
Connected to ebsa285.
Escape character is '^]'.
Cal Poly Network Performance Research Group
Kernel 2.4.3-rmk2 on an armv4l
ebsa285 login: root
Password:
[root@ebsa285: /root]$ ldconfig
Figure 3-14 Using the ldconfig command to update the EBSA-285’s dynamic linker
If the ldconfig command does not yet exist on the EBSA-285, create it on the
i686 host using the i686-ARM cross compiler, and then copy the ldconfig executable
image from the i686 host to the /usr/bin/ directory on the EBSA-285’s file system
(see Figure 3-15).
37
#
#
#
#
#
#
cd <path_to_ldconfig_sources>
./configure <etc...>
make
mount /ebsa285fs
cp -a ./ldconfig /ebsa285fs/usr/bin/
umount /ebsa285fs
Figure 3-15 Create ldconfig on the i686 host and install it on the EBSA-285
3.2.7
Rebuild the ARM-ARM Tool Chain on the EBSA-285
Believe it or not, the last step in the creation of the ARM-ARM tool chain is to
rebuild the entire tool chain using the i686-ARM-generated ARM-ARM tool chain
(i.e., using the tool chain that is currently installed in the “/arm-arm/usr/bin”
subdirectory on the EBSA-285’s file system (see Figure 3-16).
ARM→ARM
Sources
3
Step #3
B: ARM
H: ARM
T: ARM
EBSA-285
Figure 3-16 Recreation of the native ARM-ARM tool chain
Figure 3-17 shows the applicable configuration settings for building the GNU
binutils, gcc, and glibc packages on the ARM with the i686-ARM-generated ARMARM tool chain. Note that the build, target, and host machines are all three specified
as ARM machines in this case. Also note that the install step should install the ARMARM binaries in the /usr/bin/ subdirectory this time, not in the /arm-arm/usr/bin/
subdirectory.
38
% ./configure \
--build=arm-ebsa285-linux-gnu \
--host=arm-ebsa285-linux-gnu \
--target=arm-ebsa285-linux-gnu \
--prefix=/usr
--exec-prefix=/usr
...
Figure 3-17 Configure options when rebuilding the ARM-ARM tool chain
3.3
3.3.1
System Environment
EBSA-285 BIOS
When the EBSA-285 ships from the factory, the first two banks of its Flash ROM
contain the Angel boot / monitor / debugging utility (bank 0) and a self-test utility
(bank 1). The Angel debugger is designed for use with ARM Ltd.’s Software
Development Toolkit [ARM/SDT] and works quite well in this capacity.
Unfortunately, the Angel boot loader is not designed to function as a boot
loader for the Linux operating system. To remedy this situation, Russell King (“the”
ARM/Linux guru) created a custom, Linux-based BIOS for the EBSA-285 that is
“Linux kernel friendly.” King’s BIOS can either replace the Angel boot loader
outright, or it can be installed along side Angel in the EBSA’s Flash ROM. The
ARM/Linux BIOS [A/L-BIOS] is open source and available from King’s FTP site
[30]. The documentation that accompanies the A/L-BIOS sources describes how to
configure and build the A/L-BIOS for your specific needs. Since the A/L-BIOS
executes on the EBSA-285, the BIOS’s sources (apparently) must be compiled with
the i686-ARM cross compiler or the native ARM-ARM compiler.
Before building the A/L-BIOS you must edit the Makefile and specify the
Flash ROM bank number that you will be loading the A/L-BIOS into. And, of course,
after you create the A/L-BIOS image you must load the image into the Flash ROM
39
bank number specified in the Makefile. If you don’t do this, the A/L-BIOS will not
work. The documentation that accompanies the sources discusses this requirement
(and others) in detail.
After creating an executable image of the A/L-BIOS, the next step is to store
this image in the EBSA-285’s Flash ROM. Prepare the EBSA-285 for this procedure
by performing the following steps:
q If the EBSA is powered on, remove its power source and wait a few
seconds for the power supplies to completely die down.
q Remove the EBSA from its host and lay the board down on a staticfree surface.
q Record the positions of jumpers J9, J10, and J15 pins 4-5-6 (see Figure
3-18 for the locations of these three jumpers).
q Reconfigure the EBSA’s jumpers as shown in Figure 3-18.
Specifically, the EBSA-285’s must be configured for “add-in” card
mode (jumpers on J9 and J10 pins 1-2), and the SA-110
microprocessor must be held in “blank programming mode” by
placing a jumper on J15 pins 5-6 [24].
J15 (5-6)
J9
J10
Figure 3-18 EBSA-285 jumper positions for programming the Flash ROM [24]
40
After configuring the EBSA’s jumpers as shown in Figure 3-18, make certain
the MS-DOS host is turned off. If the MS-DOS host is currently turned on, turn it off
and wait a few seconds for the power supplies to die off. Install the EBSA-285 in any
available PCI slot in the MS-DOS host machine and then turn on the DOS host. Make
certain the host boots the MS-DOS operating system.
NOTE
The Flash Management Utility [FMU] software will not work if the host
machine boots a non-MS-DOS operating system (e.g., Linux or Microsoft
Windows).
NOTE
The FMU software is installed when the ARM/SDT is installed. So you
may need to install the ARM/SDT software in order to have the FMU
software available to you.
When the DOS prompt appears, navigate to the directory where the FMU
software is installed. Note that the FMU application is a 32-bit, protected mode DOS
program, and not a 16-bit “real mode” DOS program. Therefore, you must invoke the
DOS/4GW 32-bit DOS protected mode emulation utility dos4gw.exe before running
the FMU application. For example:
C:\FMU> dos4gw fmu [enter]
// {blah blah text here...}
FMU>
The dos4gw command starts the 32-bit protected mode DOS emulator, which in turn
launches the FMU application program, FMU.EXE.
When the “FMU>” command prompt appears, you can get a listing of the
FMU command set by typing a question mark ‘?’ and pressing the ENTER key. To
learn how each of FMU commands is used, refer to the chapter titled “Flash
Management Utility” (chapter 7) in [24]. A brief description of some of these
41
commands is provided in the next few paragraphs.
The FMU’s List command lists the Flash ROM image numbers that currently
have programs in them (and, by omission, it also identifies the images that currently
do not have programs in them).
If you want to replace an existing image with a new image, you must first
delete the existing image with the FMU’s “Delete <image-number>” command.
The FMU’s Program command is explained in [24], but the following
example shows how one might use this command. This example replaces the contents
of image #5 with a BIOS image that is stored on a floppy disk:
FMU> delete 5 [enter]
Do you really want to do this (y/N)? yes
Deleting flash blocks: 5
Scanning Flash blocks for usage
FMU> program 5 ARM/Linux-BIOS A:\bios 5
Writing A:\bios into flash block 5
// yadda yadda, lotsa output here...
#######################################
Scanning Flash blocks for usage
[enter]
FMU>
When the “FMU>” command prompt reappears, the programming stop is complete.
Exit the FMU by issuing the Quit command at the “FMU>” command prompt. Turn
off the MS-DOS host and wait a few seconds for the power supplies to die out.
Remove the EBSA-285 from the DOS host, lay it down on a static-free surface, and
return the EBSA-285’s jumpers to their original positions.
Before reinstalling the EBSA-285 in its original host, be sure to set the
EBSA’s Flash Image Selector switch (see Figure 3-19) to the image number that
42
holds the new BIOS. For example, if the new BIOS was programmed into image #5,
set the EBSA’s Image Selector Switch so that it points at the number 5 (see below).
Flash ROM
Layout
0 Angel Debugger
1 Diagnostics
2 < unused >
3 < unused >
4 < unused >
5 ARM/Linux BIOS
. ..
C < unused >
D < unused >
E < unused >
F < unused >
Figure 3-19 EBSA-285 Flash ROM configuration
3.3.2
Booting Linux on the EBSA-285
When the A/L-BIOS boots on the EBSA-285, it performs a power-on system test
[POST]. During the POST, the A/L-BIOS tries to determine whether any “bootworthy” devices are connected to the system. In particular, the A/L-BIOS checks to
see if an IDE hard drive is connected to the EBSA-285. If a hard drive is indeed
connected, the A/L-BIOS tries to load and boot a Linux kernel from the hard drive. If
this fails, the A/L-BIOS then tries to locate a NIC. If a NIC is found, the A/L-BIOS
tries using the BOOTP and TFTP protocols to download a Linux kernel via the
network. If this fails, the BIOS attempts to download a Linux kernel via the EBSA285’s serial port. If this fails, the A/L-BIOS panics and everything comes to a
screeching halt.
43
When Eric Engstrom began the task of porting Linux to the EBSA-285 [11],
he utilized the A/L-BIOS’s ability to download a Linux kernel via the EBSA’s serial
port as shown in Figure 3-20. Using this configuration, he was able to get his first
“test” kernels up and running on the EBSA-285. Actually, Engrstom’s first test
kernels were not Linux kernels at all; they were simple little 3-line “hello world”
programs. Nevertheless, the A/L-BIOS dutifully downloaded and launched these
programs as if they were “the real thing.” Of course, once these test programs ran to
completion they would simply terminate and the EBSA would hang. Regardless, this
was indeed a major milestone for Engstrom and the entire CiNIC project.
CiNIC v0.1a
EBSA-285
Hello,
World!
21554
Host PC
Hydra
RS-232C
Serial
Connection
Fornax
Figure 3-20 The platform used by Eric Engstrom to port Linux onto the EBSA-285
3.3.3
Linux Bootstrap Sequence
It is a well-known fact that the Linux operating system will not successfully bootstrap
itself if it does not have access to a “root” file system. So the next step in the
evolutionary process of porting Linux to the EBSA-285 was creating the EBSA’s root
file system, and, more importantly, making this file system accessible to the EBSA285. To do this, Engstrom and I leveraged information we found in [31], [39], [52],
and [54] to set up a diskless “NFS/root” file system for the EBSA-285. In a nutshell,
we:
44
•
Attached a 3Com 3c905C NIC to the secondary PCI bus (i.e., to the 21554
evaluation board) and then connected the NIC to the lab’s Ethernet LAN.
•
Created a “bare bones” root file system for the EBSA in a subdirectory of
the root file system owned by the host machine pictor
•
Installed and configured the Network File System [NFS] protocol on pictor
so that the EBSA’s could access its root file system on pictor via an
Ethernet LAN.
•
Installed and configured the BOOTP [54] and TFTP [52] protocols on
pictor. (Support for these protocols is already built into the A/L-BIOS.)
With these protocols in place, the A/L-BIOS could ask “who am I?” and
receive its IP4 address in reply. The A/L-BIOS can now use the TFTP
protocol to download a Linux kernel from pictor (see Figure 3-21 below).
•
Configured and built an ARM/Linux kernel with NFS/root file system
support built into it, and installed this kernel image on pictor in
accordance with [31], [39], [52], and [54].
Pictor
100 Mbps Fast Ethernet
EBSA-285
21554
3c905C
1
• DNS name
• IP4 address
•
Host PC
BIOS’s bootp request
( Who am I ? )
2
/tftproot/kernel
tftp download request:
3
“get /tftproot/kernel”
4
EBSA285
Linux kernel
Figure 3-21 ARM/Linux BIOS’s bootp sequence
45
After the BIOS downloads the Linux kernel from the host machine pictor, it
decompresses the kernel image and then hands control of the EBSA-285 over to the
Linux kernel. From this point on, the A/L-BIOS no longer controls the EBSA-285.
As the Linux kernel comes to life on the EBSA-285, it eventually reaches a
point where it must mount and use its root file system. But before this can happen, the
kernel must first determine the EBSA-285’s IP4 address. It does this by sending out a
BOOTP (i.e., “who am I?”) request. The BOOTP server pictor hears this request and
sends back two pieces of information: the EBSA’s IP4 address and the name of the
directory (relative to pictor’s own file system) that contains the EBSA’s root file
system (see Figure 3-22 below). With these items of information in hand, the Linux
kernel sends an NFS mount request back to pictor, asking it to mount the specified
directory (“/ebsa285fs” in this case). If pictor complies with this mount request (and
it generally does), the EBSA/Linux kernel has its root file system and can continue its
bootstrap sequence. Otherwise the EBSA/Linux kernel panics and halts.
46
Pictor
100 Mbps Fast Ethernet
EBSA-285
21554
3c905C
bootp request
( Who am I ? )
1
• DNS name
• IP4 address
•
/ebsa285fs
Host PC
2
nfs mount request:
3
mount pictor:/ebsa285fs
Time Line
100 Mbps
Fast Ethernet
EBSA-285
21554
EBSA285
3c905C
Host PC
Figure 3-22 Linux kernel bootp sequence
3.3.4
Linux Operating System Characteristics
A complete description of the evolution of the Linux operating system environment
on the EBSA-285 is provided in Eric Engstrom’s senior project report [11]. At the
time of this writing, the operating system environment on the EBSA-285 is (more-orless) the same System V setup used by Redhat Linux 7.x distributions. The Linux
kernel is explicitly configured to support the creation and use of multiple RAM disks
to provide “disk-like” support for Jenny Huang’s web caching research [21].
47
3.3.5
Synchronizing the System Time with the Current Local Time
As is often the case with small, single-board computer systems, the EBSA-
285 does not have an on-board real-time clock chip (e.g., such as the Intel 82801AA
LPC Interface Controller – a CMOS chip with battery backup – that maintains,
among other things, the current local time in many desktop PCs). So when Linux
boots on the EBSA-285 it default initializes its system clock to the “zero epoch” time
of 12:00:00.0 midnight, January 1, 1970. Keep in mind that that Linux’s system clock
is a software pseudo-clock that Linux creates and manages, whose time value resides
in system RAM. Consequently, whenever the EBSA-285 reboots, the current system
time value is lost.
If Linux is unable to synchronize its system clock with the current local time,
it cannot properly manage some of its most important subsystems – particularly its
file systems. Furthermore, a number of application programs (e.g., the GNU make
utility) do not function properly when the system time is not properly synchronized
with the current local time. So in order for Linux to function correctly on the EBSA285, its system clock must somehow be synchronized with the current local time
during the boot sequence. One way to accomplish this is to log on to the EBSA-285
as the system administrator – i.e., as the super-user ‘root’ – and manually synchronize
the system time with the current local time via the date command. This should be
done immediately after Linux boots on the EBSA-285, before any other use of the
system, in order to minimize the possibility of corruption within the EBSA-285’s
time-sensitive subsystems. Of course, manually setting the system clock like this is a
somewhat risky solution to the clock synchronization issue because it assumes the
48
system administrator will always be on hand immediately after Linux boots. If this is
not the case and a system- or application-layer process happens to use a timedependent resource before the system time is synchronized to the current local time,
the EBSA-285 can (and typically will) become unstable. An example of an
application-layer process that is highly dependent on the current system time is the
make utility. If the current system time is not coordinated with the time stamps that
appear on the underlying file system, the make utility will exhibit a variety of failure
modes – e.g., failing to recompile stale source files, linking stale object modules with
new object modules, etc. Another example of an application-layer process that
requires some degree of accuracy in the system clock is the Concurrent Versioning
System (CVS) utility. This utility is used by a group of programmers who collectively
work on (i.e., make modifications to) the source files that comprise a given software
package. If a user happens to use the CVS utility check out, or check back in, or edit,
etc., before the system time is synchronized with the current local time, the entire
CVS database can become hopelessly corrupted (i.e., the CVS database for that
particular software package would probably need to be recreated from scratch).
Assuming the system clock is somehow synchronized with the current local
time during the boot sequence, the system clock’s long-term stability now becomes
an issue of concern. Because the system clock is a software-based clock it does not
exhibit the same degree of long-term stability and/or accuracy as a hardware-based
real-time clock. Consequently, the system clock time always drifts away from the
current local time after a synchronization operation. Furthermore, the system clock’s
49
drift rate is heavily influenced by the amount of hardware and/or software interrupt
activity on the Linux host, so the magnitude of the drift rate varies over time.
Because the system time inevitably drifts away from the current local time,
the system administrator must periodically re-execute the date command (or some
similar command) to bring the system clock back into synchronization with the
current local time. Since manual resynchronization of the system clock is a tedious
and error prone procedure, a preferred solution would be one that automatically
obtains the current local time from an external time reference and then resynchronizes
the local system time to the time value obtained from the external reference.
Fortunately, a number of time synchronization protocols exist that perform exactly
this function. Of these protocols, perhaps the simplest is the RFC 868 Time Protocol
[42]. A host configured as an RFC-868 Time Protocol server responds to incoming
service requests by sending back a 32-bit, RFC-defined value that represents the
number of seconds that have elapsed since January 1, 1900. The requestor – i.e., the
RFC-868 client – can then use this value to set its internal clock(s).
On Linux hosts, the rdate command is an RFC 868-based utility that obtains
the current local time from a remote RFC-868 Time Protocol server via a TCP/IPbased network connection. The rdate command is easily integrated into Linux’s boot
scripts (e.g., the /etc/rc.d/rc.sysinit script) for the purpose of automatically
synchronizing the system time with the current local time during the boot sequence.
The task of periodically resynchronizing the system clock with the current local time
can be accomplished with the help of Linux’s cron utility. The files
50
/sbin/syncsysclock and /etc/cron.hourly/local.cron on the accompanying CD-ROM
show how boot-time and periodic synchronization is performed on our EBSA-285.
NOTE
The clockdiff utility can be used to observe the amount of drift in the
EBSA’s system clock over time, relative to other Linux hosts, with a
resolution of 1 millisecond (0.001 second). For additional information on
network-based time synchronization and/or system timing issues, see
[38] and [41].
3.4
User Environment
The evolution of the initial user environment on the EBSA-285 is documented
in Eric Engstrom’s senior project report [11]. Engstrom’s initial user environment
was intentionally minimal, providing only enough functionality to support the most
basic tasks – e.g., logging on to the EBSA-285, listing the contents of a directory,
changing to a new directory, and so on and not much else. There were no fullfeatured command shells, no editors, no file system utilities, etc.; these were yet to be
created and installed on the EBSA-285.
Since I was the one with the most experience using the i686-ARM cross
development tool chain, I began the long, tedious process of building and installing
the software packages that comprise a “typical” Linux user environment on the
EBSA-285 – e.g., command shells, file system utilities, system configuration files,
networking support, etc. For the most part, I used the source packages that came with
the Redhat 7.0 and 7.1 distributions when creating the EBSA-285’s user environment.
Apparently, then, the EBSA-285’s user environment is essentially the same “System
V” user environment found on Redhat 7.x workstations (with the possible exception
that there is currently no support for the X-windows environment on the EBSA-285).
51
3.4.1
Building applications with the i686-ARM cross tool chain
Initially, I used the i686-ARM cross development tool chain to build the
applications and utilities one typically finds on a Redhat 7.x workstation. The steps
required to use the i686-ARM tool chain are not always obvious, and experimentation
is generally needed before one succeeds at building a software package with it. In
some cases, one must edit a package’s Makefile(s) before the build process will work.
As is the case when building the GNU software development tool chain, one
must be extremely careful when specifying the build and install options during the
package configuration phase. In my experience, the use of command shell
environment variables helps minimize the possibility of incorrectly specifying these
options. Figure 3-23 shows some of the bash environment variables I used when
configuring software packages on the i686 host, before building and installing them
on the i686 host with the i686-ARM cross tool chain.
export arm_arm_root="/arm-arm"
export common_options="\
--build=i686-pc-linux-gnu \
--host=arm-ebsa285-linux-gnu \
--target=arm-ebsa285-linux-gnu \
--sysconfdir=$(arm_arm_root)/etc \
--sharedstatedir=$(arm_arm_root)/usr/share \
--localstatedir=$(arm_arm_root)/var \
--mandir=$(arm_arm_root)/usr/share/man \
--infodir=$(arm_arm_root)/usr/share/info \
"
export prefix_bin="\
--prefix=$(arm_arm_root)/bin \
--sbindir=$(arm_arm_root)/sbin \
--exec-prefix=$(arm_arm_root)/bin \
"
export prefix_usr="\
--prefix=$(arm_arm_root)/usr \
--sbindir=$(arm_arm_root)/sbin \
--exec-prefix=$(arm_arm_root)/usr \
"
Figure 3-23 Bash environment variables for common i686-ARM configure options
52
The arm_arm_root environment variable specifies the root directory on the
i686 host where the resulting ARM-compatible executables will be installed (i.e.,
after they are created with the i686-ARM cross tool chain). Note that the contents of
this directory tree will eventually be copied en masse over to the EBSA-285’s file
system on Pictor.
WARNING
If you choose to use the environment variables shown in Figure 3-23, make sure
the arm_arm_root environment variable is initialized as shown. Do not omit
the definition for the arm_arm_root environment variable, and do not define it
as the root directory ‘/’. Doing either of these will potentially wipe out the i686
applications of the same name on the build host, rendering the i686 build host
inoperable in the process!
The common_options environment variable shown in Figure 3-23
specifies the build and install options that are more-or-less common to all of the
packages being built with the i686-ARM cross tool chain.
The prefix_bin environment variable shown in Figure 3-23 specifies the
build and install options that are common to source packages that create application
programs that are typically used during the Linux boot sequence. See the /sbin and
/bin directories to find out which programs are generally required during the Linux
kernel boot sequence. For example, Figure 3-24 shows how the common_options
and prefix_bin environment variables are used when configuring the software
package that creates the /sbin/insmod program – a program that is typically used
during the Linux boot sequence to load kernel-loadable modules.
53
% ./configure \
$(common_options) \
$(prefix_bin) \
... # other package options here
Figure 3-24 Configure step for apps that are used during the Linux boot sequence
The prefix_usr environment variable shown in Figure 3-23 specifies the
build and install options that are common to source packages that create programs
that typically are not required during the Linux boot sequence (e.g., the emacs editor).
For example, Figure 3-25 shows how the common_options and prefix_usr
environment variables are used when configuring the software package that creates
the /usr/bin/emacs text editor program – a program that is typically not used during
the Linux boot sequence.
% ./configure \
$(common_options) \
$(prefix_usr) \
... # other package options here
Figure 3-25 Configure step for apps that are not used during the Linux boot sequence
3.4.2
Installing from the i686 host to the EBSA-285’s file system
After building these packages with the cross development tool chain, the
packages must be temporarily installed on the i686 host. Assuming one uses the same
bash environment variables shown in Figure 3-23, performing the ‘make install’ step
on the i686 host installs the ARM executables in the /arm-arm subdirectory on the
i686 host.
Following the i686 install step, the contents of the /arm-arm directory must
be copied to the EBSA-285’s file system. As mentioned in §3.2.6, the easiest way to
54
do this is to use the NFS feature to mount the EBSA-285’s file system on the i686
host, and then simply copy everything from the /arm-arm directory to the
/ebsa285fs directory as shown in Figure 3-26.
# mount /ebsa285fs
# cp -a /arm-arm/* /ebsa285fs/
# cp -a /arm-arm/.[^.]* /ebsa285fs/
Figure 3-26 Copying the /arm-arm directory to the EBSA-285’s root directory
The final step in the installation process is (once again) the creation of some
symbolic links on the EBSA-285’s file system. As mentioned in §3.2.6, some
packages hard code the name of the installation directory into the executable
programs the package creates. Since the install directory was specified as
/arm-arm/<whatever>, some programs will not work correctly if they are not
installed in this specific directory tree on the EBSA-285’s file system. The quick and
easy fix for problem is to create a directory named /arm-arm in the root of the
EBSA-285’s file system and create some appropriate symlinks within that directory,
as shown in Figure 3-27.
%
%
%
%
>
>
%
cd /ebsa285fs
mkdir arm-arm
cd arm-arm
for item in $(ls ..); do
[ -d "../${item}" ] && ln -s ../${item} ${item}
done
rm -f arm-arm
Figure 3-27 Create symlinks after installing from i686 host to EBSA-285’s file system
3.4.3
Rebuilding everything with the ARM-ARM tool chain
Eventually, everything that is built with the i686-ARM cross tool chain should
be rebuilt natively on the EBSA-285 platform using the native ARM-ARM tool
55
chain. This should be done as soon as possible to maximize the stability of the
runtime applications on the EBSA-285 platform. Note that cross compiling sometimes yields ARM executables that are less stable than ARM executables that are
built natively. So I recommend using the i686-ARM cross compiler to build only
what is absolutely necessary in §3.4.2 – i.e., the utilities that are commonly used
when building software packages (e.g., the GNU make utility, the Perl scripting
language, the GNU sed and awk utilities, the GNU fileutils package, etc.).
When building a software package on the EBSA-285 with the native ARMARM tool chain, one uses essentially the same configure options that were used when
building the same package on the i686 host with the i686-ARM tool chain. The only
difference is that the arm_arm_root environment variable is now undefined (or
defined to the empty string “”) as shown in Figure 3-28.
export arm_arm_root=""
export common_options="\
--build=i686-pc-linux-gnu \
--host=arm-ebsa285-linux-gnu \
--target=arm-ebsa285-linux-gnu \
--sysconfdir=$(arm_arm_root)/etc \
--sharedstatedir=$(arm_arm_root)/usr/share \
--localstatedir=$(arm_arm_root)/var \
--mandir=$(arm_arm_root)/usr/share/man \
--infodir=$(arm_arm_root)/usr/share/info \
"
export prefix_bin="\
--prefix=$(arm_arm_root)/bin \
--sbindir=$(arm_arm_root)/sbin \
--exec-prefix=$(arm_arm_root)/bin \
"
export prefix_usr="\
--prefix=$(arm_arm_root)/usr \
--sbindir=$(arm_arm_root)/sbin \
--exec-prefix=$(arm_arm_root)/usr \
"
Figure 3-28 Bash environment variables for the ARM-ARM configure options
56
Chapter 4
Instrumentation
When developing a new hardware / software architecture, a natural question that
arises is “what are the performance characteristics of the new architecture?” The
answers to this question are particularly important if the new architecture represents
an evolutionary or revolutionary change to some preexisting architecture. Such is the
case with the CiNIC Project.
This chapter describes the performance instrumentation techniques I created,
with the assistance of Ron J. Ma [33], to gather latency data from the CiNIC
architecture during network send operations. There are two implementations of this
instrumentation technique: “stand-alone” and “CiNIC.” The stand-along implementation gathers performance data from the host system by itself, or from the EBSA-285
co-host system by itself. The stand-alone system does not gather performance data
from the combined host/co-host CiNIC platform. The CiNIC instrumentation
technique, on the other hand, gathers performance data only from the combined
host/co-host CiNIC platform. Samples of the performance data obtained with these
two instrumentation techniques are provided in Chapter 5.
4.1
Background
Historically, the CPNPRG has relied on techniques such as time stamping and
profiling to harvest performance data from a particular hardware / software platform
at runtime. For example, Maurico Sanchez used these techniques to glean CPU
utilization and throughput performance data from his I2O development efforts [49].
57
Sanchez also wanted to obtain latency data for the I2O platform, but was ultimately
unable to do so, primarily because of restrictions that were imposed on him by the
Novel Netware environment he was using. This difficulty was another example of the
problems that spawned the idea of using a logic analyzer to passively gather latency
data from a host system via its PCI bus, an idea that was ultimately implemented with
great success – albeit on platforms other than Sanchez’s I2O platform [12], [32], [57].
An overview of this technique is provided in §4.3.
Following the conclusion of Sanchez’s I2O development work, the CPNPRG
obtained the source code for an experimental Win32 IPv6 protocol stack from
Microsoft Research (a.k.a., “msripv6”). Peter Xie instrumented this code with
software time stamps to determine its performance characteristics [56], and I
subsequently used the logic analysis measurement method described in [12] to obtain
latency data for the same msripv6 protocol stack. Unfortunately, Xie and I were
gathering our performance data from two different machines, and the performance
characteristics we were measuring were also quite different. Consequently, the
correlation between his measurement data and mine was not particularly good. To
remedy this situation, Bo Wu and I set about the task of integrating Xie’s work with
mine so that we could simultaneously harvest performance data from the msripv6
protocol stack via: a) the software timestamps Xie and Wu had placed in the msripv6
stack, and b) the logic analysis measurement method described in [12]. This approach
imposed a tight correlation on the time stamp data gathered by the instrumented
msripv6 stack, the timestamp data gathered by the test application, and the latency
times observed by the logic analyzer on the host’s PCI bus. The combined data from
58
this experiment revealed some interesting characteristics of the msripv6 stack that had
previously gone undetected [12].
The integration of the host and co-host platforms into the CiNIC architecture
marks the current evolution of the logic analysis / systems approach to gathering
latency data from a host’s PCI bus in response to network send operations at the
application layer. Ron J. Ma and I worked together to extend the logic analyzer’s
measurement capability to gather latency data from two different PCI busses – i.e.,
the primary PCI bus on the host system, and the secondary PCI bus on the co-host –
in response to network send transactions from the host system. This work was
performed in parallel with the development efforts of Rob McCready and Mark
McClellend (§2.3.2), who at the time were still working on the host/co-host
communication infrastructure. Because this infrastructure was only partially available
to us, Ma and I had to devise a proverbial “plan B” host/co-host communications
protocol we could use in the interim to send data from the host system down to the
EBSA-285 co-host and out to the NIC ([33] and §4.4.2). This simple protocol allowed
Ma and I to meet the goal of extending the logic analyzer’s measurement capability to
gather latency data from two different PCI busses – i.e., the primary PCI bus on the
host system, and the secondary PCI bus on the co-host – in response to network send
transactions from the host system.
4.2
Performance Metrics
At present, the CPNPRG is currently interested in determining latency, throughput,
and CPU utilization characteristics for the pre- and post-CiNIC architectures. In this
thesis, I am interested in obtaining PCI-level wire entry and exit latencies for:
59
•
The host system Hydra as a “stand-alone” platform (i.e., before its TCP/IP stack
processing is off-loaded onto the EBSA-285 co-host), and,
•
The EBSA-285 co-host as a “stand-alone” platform (i.e., the EBSA-285 is
operating independently of the host system, and plays no part in the host system’s
networking transactions), and,
•
The combined host / co-host CiNIC architecture that off-loads the host system’s
TCP/IP stack processing onto the EBSA-285 co-host.
4.2.1
Framework for Host Performance Metrics
Before one begins the task of obtaining data for some particular performance characteristic, one must first define fundamental concepts such as ‘metrics’ and
‘measurement methodologies’ that allow us to speak clearly about measurement
issues [41]. These concepts must be clearly defined for each performance
characteristic being measured, and must include some discussion of the measurement
uncertainties and errors involved in the quantity being measured and the proposed
measurement process. Clearly, these definitions must be derived from standardized
metrics and measurement methodologies to obtain results that are meaningful to the
scientific community at large.
In the realm of IP performance metrics, Vern Paxson’s “Framework for IP
Performance Metrics” (RFC 2330) is the de facto framework for implementing
performance metrics for IP-related traffic on the Internet at large. Specifically,
Paxson’s document is:
“[a] memo that attempts to define a general framework for
particular metrics to be developed by the IETF’s IP Performance
Metrics effort, begun by the Benchmarking Methodology Working
Group (BMWG) of the Operational Requirements Area, and being
continued by the IP Performance Metrics Working Group (IPPM)
of the Transport Area” [41].
60
While this memorandum deals specifically with IP performance issues for the Internet
at large, we (the CPNPRG) feel Paxson’s measurement framework (or portions
thereof) are also useful for gathering performance data within an endpoint node. To
this end, the CPNPRG has adopted Vern Paxson’s “Framework for IP Performance
Metrics” (RFC 2330) as a reference document for the measurement methodologies
employed by the group.
4.3
Stand-alone Performance Instrumentation
This section describes the instrumentation technique used to measure the “stand
alone” performance characteristics of either the host machine hydra, or the EBSA285 co-host. This instrumentation technique is not used to gather performance data
concurrently from the combined host / co-host CiNIC architecture. (Section 4.4 below
discusses the instrumentation technique used with the CiNIC architecture.)
NOTE
In section 4.3, the term “system under test” [SUT] refers to either the
host machine hydra, or the EBSA-285 co-host. It does not refer to the
combined host / co-host CiNIC architecture.
4.3.1
Theory of Operation
The instrumentation configuration shown in Figure 4-1 below is essentially the same
configuration described in [12], with the exception that there is no longer a connection between the host computer’s parallel port and the logic analyzer. A “new-andimproved” PCI-based triggering mechanism is now being used to trigger the logic
analyzer in lieu of the parallel port triggering mechanism described in [12].
61
System Under Test
[ SUT ]
PCI Bus
NIC
HP 16700A
Logic Analyzer
LAN
Figure 4-1 “Stand-alone” host performance instrumentation
To quickly review the measurement technique described in [12], a test
application, running on the SUT, allocates storage for a “payload buffer” of some
desired size. The test application then initializes the buffer’s contents with a “well
formed” payload value. The specific content of the payload buffer after initialization
depends upon the transport protocol being used. For TCP transactions, the test
application initializes the payload buffer with the content shown in Figure 4-2.
0
AAAAAAAAAAAA ******** EEEEEEEEEEEE
TCP / UDP payload
Payload buffer
Figure 4-2 Contents of the payload buffer for stand-alone measurements
The TCP-based payload buffer always begins with a contiguous sequence of twelve
“A” characters (a.k.a., the “start marker” characters), and it always ends with a
contiguous sequence of twelve “E” characters (a.k.a., the “stop marker” characters).
The bytes in the middle of the payload buffer – i.e., between the start and stop marker
characters – are initialized with asterisk characters ‘*’. This arrangement ensures:
a) the start marker characters are the first characters of the TCP payload to reach the
62
PCI bus (and therefore the network media), and b) the stop marker characters are the
last characters of the TCP payload to reach the PCI bus, in response to the test
application sending the payload buffer contents to a remote host, via a TCP
connection, using the SUT’s networking subsystem.
For UDP-based send transactions in the Linux environment, initialization of
the payload buffer requires some special processing to ensure the start marker
characters are the first characters of the UDP payload to appear on the PCI bus, and
the stop marker characters are the last characters of the UDP payload to appear on the
PCI bus. Section 4.3.2 below explains in detail why this special processing is required
on Linux hosts, and shows where to place the start and stop markers within the UDP
payload buffer. The goal here is to arrange the contents of the UDP payload buffer so
that the start and stop marker characters arrive on the PCI bus with the same sequence
shown in Figure 4-2, in response to the test application sending the contents of the
payload buffer to a remote host, via a UDP connection, using the SUT’s networking
subsystem.
Referring now to Figure 4-3 below, the newly developed PCI-based logic
analyzer triggering mechanism works as follows. A test application, running on the
SUT, writes a known 32-bit value to an unimplemented base address register [BAR].
This unimplemented BAR can belong to any PCI function that is connected to the
SUT’s PCI bus. For this project, the value 0xFEEDFACE is written to the 3rd BAR on
a 3Com 3c905C NIC. The logic analyzer’s triggering subsystem is manually
configured to detect this write transaction on the PCI bus, and the analyzer uses this
63
event as its “time = 0 seconds” reference time t0 for all subsequent PCI transactions it
observes on the PCI bus (see Figure 4-3).
System Under Test
[ SUT ]
PCI Bus
0xFEEDFACE
t0
BAR
NIC
HP 16700A
Logic Analyzer
Figure 4-3 Write to unimplemented BAR on 3c905C NIC corresponds to time t 0
From this point on, the instrumentation technique used to gather performance
data from the SUT is virtually the same as described in [12]. So immediately after the
test application triggers the logic analyzer by writing a known value to an unimplemented BAR, the test application sends the contents of the payload buffer to a remote
host’s discard port ([43], [47]) via the network. The payload is sent using socket calls
that are native to the operating system environment that’s controlling the SUT.
After a bit of processing within the SUT’s TCP/IP protocol stack, the payload
buffer’s start marker characters reach the PCI bus on their way to the NIC (see Figure
4-4). The analyzer detects the arrival of these characters on the PCI bus and records
this event as the “wire arrival time” t1 (formerly referred to as the “start marker” time
in [12]). In other words, time t1 represents the processing latency associated with
moving the first bytes of the payload buffer from the application layer, down through
the SUT’s networking subsystem, onto the PCI bus, and out to the NIC.
64
System Under Test
[ SUT ]
PCI Bus
“AAAA”
t
NIC
0
t
HP 16700A
Logic Analyzer
1
LAN
Figure 4-4 Wire arrival time t 1
In a similar manner, when the payload buffer’s stop marker sequence arrives
at the PCI bus (see Figure 4-5), the analyzer detects and records this event as the
“wire exit time” t2 (formerly referred to as the “stop marker” time in [12]). Time t2,
then, represents the processing latency associated with moving the last bytes of the
payload buffer from the application layer, down through the SUT’s networking
subsystem, onto the PCI bus, and out to the NIC.
System Under Test
[ SUT ]
PCI Bus
“EEEE”
t
NIC
t
HP 16700A
Logic Analyzer
LAN
0
t
1
2
Figure 4-5 Wire exit time t 2
After the logic analyzer records the wire entry and exit times t1 and t2, the test
application (on the SUT) downloads these values from the analyzer for postprocessing purposes.
In general, the test application repeats the aforementioned steps multiple times
for a given payload size. For example, the test application might: a) allocate a buffer
for a 200-byte payload, and b) initialize the buffer’s contents for a TCP connection
65
(see Figure 4-2), and c) send the buffer’s contents a total of thirty times to a remote
host’s discard port via a TCP connection. Each of these send operations generates a
{t1, t2} data pair, and the resulting collection of thirty {t1,t2} data pairs comprise a
sample set of latency times for a 200-byte TCP payload. The test application sorts the
t1 and t2 values in the sample set and determines the minimum, maximum, and
median values for these latency times. The resulting min, max, and median t1 and t2
times are then written to a text file in such a way that the file’s contents have a
“spreadsheet-friendly” input format (see Figure 4-6). For example, importing the
results files into Microsoft Excel and then plotting the median latency times for t1 and
t2 created the X-Y scatter plots found in Chapter 5.
After the test application collects and processes a sample set of latency times
for a given payload size, it can (if programmed to do so) perform the aforementioned
steps again on different payload sizes.
Protocol
Server
Trials
Start size
Stop size
Step size
Bad data
Timeouts
Date/Time
File name
OS/Version
Size
512
1024
1536
2048
2560
. . .
:
:
:
:
:
:
:
:
:
:
:
TCP/IPv4 Send -- Blocking
192.168.24.65
30
512
65536
512
0
4
Mon Apr 16 13:07:39 2001
LT4SB_512-65536-512_2001-04-16_13-07-39.dat
Red Hat Linux release 7.0 (Guinness) [2.4.3-ipv6]
|-------- t1 --------|
min
median max
20.024 21.916 24.136
23.816 24.608 25.472
24.840 27.088 29.272
25.912 27.112 28.848
26.072 27.080 28.224
|-------- t2 --------|
min
median max
23.848 25.728 27.952
31.496 32.268 33.136
43.008 45.236 47.440
53.888 55.216 85.624
91.512 92.816 93.784
Figure 4-6 “Spreadsheet friendly” contents of test results file
66
4.3.2
IPv4 Fragmentation of UDP Datagrams in Linux
When the Linux kernel’s IP4 sub-layer receives a UDP datagram, it checks to see
whether the resulting IP datagram will fit within a single network frame. (Recall that
the maximum transmission unit [MTU] of the underlying network media determines
the size of a network frame.) If the resulting IP datagram will indeed fit within a
single network frame, no further processing is required by the IP4 sub-layer and the
datagram is sent on its way.
On the other hand, if the size of a UDP datagram is such that the resulting IP
datagram will exceed the capacity of a single network frame, the IP4 sub-layer must
fragment the UDP datagram into a collection of smaller datagrams whose sizes are
compatible with the MTU of the underlying network media. When fragmenting a
“large” UDP datagram into multiple IP datagrams, Linux’s IP4 sub-layer fragments
the UDP datagram from front to back as shown in Figure 4-7. Depending on the
length of the UDP datagram, the size of the last IP datagram (IP Datagram n in Figure
4-7) will be less than or equal to the MTU of the underlying network media. The sizes
of the remaining IP datagrams – i.e., IP Datagram 1 through IP Datagram n-1 – are the
same size, and are directly proportional to the MTU of the underlying network media.
UDP Datagram
UDP Header
IP Datagram
[ ∝ 1 MTU ]
UDP Payload
1
IP Datagram
[ ∝ 1 MTU ]
2
IP Datagram
[ ≤ 1 MTU ]
Figure 4-7 IP4 fragmentation of “large” UDP datagram
n
67
After fragmenting the UDP datagram into a collection of network-compatible
IP datagrams, we have observed that the IP4 sub-layer sends the IP datagrams to the
network in “back-to-front” fashion – i.e., IP Datagram n is the first to be sent, followed
by IP Datagram n-1 and so on, so that the last IP datagram sent to the network is
IP Datagram 1.
For UDP payloads, then, the start and stop marker character sequences should
be positioned at the beginning of the IP Datagram n, and at the end of the IP Datagram
, respectively (see Figure 4-8 below). This placement scheme will ensure the start
1
marker characters are the first characters of the UDP payload to appear on the PCI
bus, as a result of the IP4 sub-layer sending IP datagram n to the network first.
Likewise, placing the stop marker at the end of IP Datagram 1 will ensure the stop
marker characters are the last characters to reach the PCI bus, as a result of the IP4
sub-layer sending IP Datagram 1 to the network last.
UDP Datagram
UDP Header
**EEEE
IP Datagram
[ = 1 MTU ]
*****
...
IP Datagram
[ = 1 MTU ]
1
*****
AAAAA** ... ****
IP Datagram
[ ≤ 1 MTU ]
2
n
Send Order
1st
IP
AAAA** ... ****
IP Datagram n
[ ≤ 1 MTU ]
2nd
IP
*****
...
*****
IP Datagram n - 1
[ 1 MTU ]
3nd
...
IP
*****
...
*****
IP Datagram n - 2
[ 1 MTU ]
last
IP
...
UDP
***...***EEEE
...
IP Datagram
1
[ 1 MTU ]
Figure 4-8 IP4 datagram send sequence for UDP datagrams > 1 MTU
68
Having established a general-purpose algorithm for placing the start and stop
markers within a large UDP payload, one must now check for special cases. The most
obvious special case is where the last IP datagram (IP Datagram n in Figure 4-8) is too
small to hold the entire 12-character start marker sequence. For this case, I chose to
place the start marker at the beginning of the previous IP datagram – i.e.,
IP Datagram n-1
– rather than trying to split the start marker across two separate IP
datagrams. The reasoning behind this design decision is as follows.
The start marker is a contiguous sequence of twelve ASCII “A” characters –
i.e., AAAAAAAAAAAA. (For a detailed description of why the start [and stop]
marker is 12 bytes long, see [12].) Assume for the moment that the length of
IP Datagram n’s
payload section is exactly eleven bytes long. These eleven bytes will
typically be grouped together into three 32-bit “DWORD” values before being sent
across the PCI bus to the NIC. (Note that one of the three 32-bit DWORDs will have
an unused byte in it, as there are eleven bytes-worth of payload being transferred to
the NIC, and a total of twelve bytes are available within the three DWORDs.)
Assuming these three DWORDs are encapsulated within a “typical” IP datagram with
an header length of 20 octets (i.e., 5 DWORDs), the entire IP datagram (header +
payload) occupies a total of 31 bytes, or 8 DWORDs. So when the payload section of
IP Datagram n
is no greater than eleven bytes, a maximum of eight DWORDs will be
transferred across the PCI bus when sending this datagram to the NIC.
On PCI bus systems that support burst-mode memory write transactions – and
this is the case for both the host machine Hydra and the EBSA-285 co-host – these
eight DWORDs are written to the NIC using a burst-mode memory write transaction.
69
Such a transaction is comprised of a single PCI address phase followed immediately
by a contiguous sequence of PCI data phases:
Addr
Data 1
Data 2
...
Data 8
Figure 4-9 PCI bus write burst of IP Datagram n
where |IP Datagram n| < |start marker|
Let t1 represent the time required to transfer IP Datagram n across the PCI bus to the
NIC during a burst-mode write transaction. As shown below, time t1 is approximately
273 nanoseconds:
1 data phase  30.3 ns

t1 = 1 address phase + 8 DWORDs ⋅
= 273 ns
⋅
DWORD  phase

[4.1]
Recalling that IP Datagram n does not contain any start marker characters for
the case when the length of IP Datagram n is less than the length of the start marker,
the logic analyzer apparently will not detect the write transaction that sends IP
Datagram n
to the NIC.
UDP Datagram
UDP Header
AAAA***
...
IP Datagram
[ = 1 MTU ]
***EEEE
1
***
IP Datagram 2
[ < start marker length ]
Send Order
1st
IP
***
( ← Not detected by logic analyzer )
IP Datagram
2
2nd
IP
UDP
AAAA***...***EEEE
IP Datagram
1
Figure 4-10 Logic analyzer payload “miss”
70
This “payload miss” by the logic analyzer introduces a slight measurement error into
the UDP payload transfer latency measurements that is inversely proportional to the
size of the UDP datagram. The worst-case error, therefore, occurs when a UDP
datagram is fragmented into two IP datagrams, and IP Datagram 2 contains fewer than
|start marker| characters (see Figure 4-10). Since the underlying network media is
in our case an Ethernet LAN, IP Datagram 1 contains exactly 1,492 octets (or 373
DWORDs). Assuming the entire contents of IP Datagram
1
are sent to the NIC in a
single PCI burst-mode write transaction, the time t2 required to send IP Datagram 1 to
the NIC is approximately 11.3 microseconds:
1 data phase  30.3 ns

t 2 = 1 address phase + 373 DWORDs ⋅
= 11.3 µs
⋅
DWORD  phase

[4.2]
This time interval recorded by the logic analyzer is the time required to send the
entire UDP datagram to the NIC, and not just the IP Datagram 1 component of the
UDP datagram. In other words, the time interval recorded by the logic analyzer does
not include the time spent sending the IP Datagram 2 component of the UDP datagram
to the NIC.
% error =
t1
242 ns
× 100 =
× 100 = 2.1 %
t1 + t 2
242 ns + 11.3 µs
[4.3]
For our purposes, this is an acceptable upper bound on this measurement error. Of
course, this measurement error can be avoided altogether by choosing sizes for the
UDP payloads such that the resulting IP datagrams have the length of an MTU.
Therefore, we do not incur any measurement error due to this artifact of IP and UDP
payload size.
71
4.4
Host / Co-host Combined Platform Performance Instrumentation
The instrumentation used to gather latency data from the combined host / co-host
CiNIC platform is shown schematically in Figure 4-11 below. The data path between
the host system and the NIC on the co-host system is logically broken down into three
sub-paths. The first data flow sub-path is between the host computer system and the
Intel 21554 PCI-to-PCI Nontransparent Bridge [21554] – i.e., the outbound data flow
across the primary PCI bus. The second data flow sub-path is from the 21554 to the
system RAM on the EBSA-285 co-host – i.e., the inbound data flow on the secondary
PCI bus. The third and final data flow sub-path is between the EBSA-285 and the
NIC on the secondary PCI bus – i.e., the outbound data flow on the secondary PCI
bus. The logic analyzer’s job, in this case, is to gather latency data from these three
data flow sub-paths.
NOTE
In section 4.4, the term “system under test” [SUT] refers to the
combined host / co-host CiNIC architecture.
Host Computer System
1
Primary PCI
HP 16700A
Logic Analyzer
21554
2
Secondary PCI
EBSA-285 Linux
21285
3
NIC
SA-110
LAN
Figure 4-11 CiNIC architecture instrumentation configuration
72
4.4.1
Theory of Operation
A test application, running on the host system, allocates storage for a “payload
buffer” of some desired size. The test application then initializes the buffer’s contents
with essentially the same “well formed” payload as described in section 4.3, with the
exception that the TCP or UDP payload content is now encapsulated within a pair of
4-character “P” and “Q” sequences as shown in Figure 4-12.
0
PPPP AAAAAAAAAAAA ******** EEEEEEEEEEEE QQQQ
TCP / UDP payload
Payload buffer
Figure 4-12 Well-defined data payload for the combined platform
This new payload arrangement is required because we have observed that UDP
transactions under Linux are formatted as described in §4.3.2 above.
After initializing the payload buffer, the test application writes a known 32-bit
value to an unimplemented base address register [BAR] on the primary PCI bus. For
this project, the value 0xFEEDFACE is written to the 3rd BAR on a 3Com 3c905C
NIC connected to the primary PCI bus.
NOTE
This NIC corresponds to the generic “PCI” device block shown connected
to the primary PCI bus in Figure 4-13. The 3Com 3c905C NIC on the
primary PCI bus is not used to send data to, or receive data from, the
network. This NIC’s sole purpose in this case is to provide an
unimplemented BAR for the aforementioned BAR write transaction.
The logic analyzer’s triggering subsystem is manually configured to detect this BAR
write transaction when it occurs on the primary PCI bus, and the analyzer uses this
73
event as the “time = 0 seconds” reference times t1,0 and t2,0 for all subsequent PCI
transactions on both the primary and secondary PCI busses (see Figure 4-13).
t 1,0
Win2K or Linux on the
Host Computer System
Primary PCI
PCI
BAR
0xFEEDFACE
21554
HP 16700A
Logic Analyzer
Secondary PCI
EBSA-285 Linux
21285
NIC
t 2,0
SA-110
LAN
Figure 4-13 Write to unimplemented BAR triggers the logic analyzer
Immediately after the test application triggers the logic analyzer by writing a
known value to an unimplemented BAR on the primary PCI bus, the test application
sends the contents of the payload buffer to a remote host’s discard port via the
network [43], [47]. The payload is sent from the test application on the host using
socket calls that are native to the operating system environment that’s controlling the
host machine. More specifically, the socket API “seen” by the test application on the
host is virtually the same API for both the CiNIC and stand-alone architectures. Of
course, the networking subsystem within the operating system proper will generally
need to be modified in some (perhaps radical) way in order to support the CiNIC
architecture (e.g., see [36]).
After a bit of processing within the host operating system (in response to the
network send operation by the test application), the four “P” characters at the head of
the payload buffer eventually reach the primary PCI bus on their way to the 21554
74
(see Figure 4-14). The analyzer detects the arrival of these characters on the primary
PCI bus and records this event as the “wire arrival time” t1,1 for the primary PCI bus.
In other words, time t1,1 represents the processing latency associated with moving the
first bytes of the payload buffer from the application layer, down through the host’s
operating system, and onto the primary PCI bus.
t
1,0
t 1,1
Win2K or Linux on the
Host Computer System
PPPP
Primary PCI
21554
HP 16700A
Logic Analyzer
Secondary PCI
EBSA-285 Linux
21285
NIC
t
2,0
SA-110
LAN
Figure 4-14 Wire arrival time t 1,1 on primary PCI bus
When the payload buffer’s leading edge characters reach the 21554 bridge, the
21554 stores these characters within its 256-byte downstream posted write buffer
[23]. As soon as it is able, the 21554 dumps the contents of its posted write buffer
onto the secondary PCI bus. At this point, the four “P” characters at the head of the
payload buffer reach the secondary PCI bus on their way to the system RAM on the
EBSA-285 (via the 21285) as shown in Figure 4-15. The analyzer detects the arrival
of these characters on the secondary PCI bus and records this event as the “wire
arrival time” t2,1 for the secondary PCI bus. In other words, time t2,1 represents the
data processing latency introduced by the 21554 bridge as it moves the first bytes of
the payload buffer from the primary PCI bus onto the secondary PCI bus.
75
t
1,0
t
Win2K or Linux on the
Host Computer System
Primary PCI
21554
Secondary PCI
EBSA-285 Linux
21285
1,1
HP 16700A
Logic Analyzer
PPPP
NIC
t
2,0
SA-110
t
2,1
LAN
Figure 4-15 Wire arrival time t 2,1 on secondary PCI bus
After a bit more processing within the host operating system (in response to
the aforementioned network send operation by the test application), the four “Q”
characters at the end of the payload buffer eventually reach the primary PCI bus on
their way to the 21554 (see Figure 4-16). The analyzer detects the arrival of these
characters on the primary PCI bus and records this event as the “wire exit time” t1,2
for the primary PCI bus. In other words, time t1,2 represents the processing latency
associated with moving the last bytes of the payload buffer from the application layer,
down through the host’s operating system, and onto the primary PCI bus.
t 1,0
t 1,1
Win2K or Linux on the
Host Computer System
QQQQ
Primary PCI
21554
t 1,2
HP 16700A
Logic Analyzer
Secondary PCI
EBSA-285 Linux
21285
NIC
SA-110
t 2,0
t 2,1
LAN
Figure 4-16 Wire exit time t 1,,2 on primary PCI bus
76
When the payload buffer’s trailing edge characters reach the 21554 bridge, the
21554 again stores these characters within its 256-byte downstream posted write
buffer [23]. As soon as it is able, the 21554 dumps the contents of its posted write
buffer onto the secondary PCI bus. At this point, the four “Q” characters at the tail
end of the payload buffer reach the secondary PCI bus on their way to the system
RAM on the EBSA-285 (via the 21285) as shown in Figure 4-17. The analyzer
detects the arrival of these characters on the secondary PCI bus and records this event
as the “wire exit time” t2,2 for the secondary PCI bus. In other words, time t2,2
represents the data processing latency introduced by the 21554 bridge as it moves the
last bytes of the payload buffer from the primary PCI bus onto the secondary PCI bus.
t
1,0
t 1,1
Win2K or Linux on the
Host Computer System
Primary PCI
21554
Secondary PCI
EBSA-285 Linux
21285
t 1,2
HP 16700A
Logic Analyzer
“QQQQ”
NIC
t
2,0
SA-110
t
LAN
t
2,1
2,2
Figure 4-17 Wire exit time t 2,2 on secondary PCI bus
After some processing by the Linux operating system on the co-host, the
twelve “A” characters near the head of the payload buffer eventually reach the
secondary PCI bus on their way to the NIC (see Figure 4-18). The analyzer detects the
arrival of these characters on the secondary PCI bus and records this event as the
“wire arrival time” t2,3 on the secondary PCI bus. In other words, time t2,3 represents
77
the processing latency introduced by TCP/IP stack on the co-host as it moves the first
bytes of the payload buffer from the system RAM on the EBSA-285 to the NIC on
the secondary PCI bus.
t 1,0
Win2K or Linux on the
Host Computer System
Primary PCI
21554
Secondary PCI
EBSA-285 Linux
21285
t
1,1
t
1,2
HP 16700A
Logic Analyzer
“AAAA”
NIC
t 2,0
SA-110
t
LAN
t 2,3
2,1
t 2,2
Figure 4-18 Wire arrival time t 2,3 on secondary PCI bus
After a bit more processing within the Linux operating system on the co-host,
the twelve “E” characters near the end of the payload buffer eventually reach the
secondary PCI bus on their way to the NIC (see Figure 4-19). The analyzer detects the
arrival of these characters on the secondary PCI bus and records this event as the
“wire exit time” t2,4 on the secondary PCI bus. In other words, time t2,4 represents the
processing latency introduced by TCP/IP stack on the co-host as it moves the last
bytes of the payload buffer from the system RAM on the EBSA-285 and to the NIC
on the secondary PCI bus.
78
t 1,0
t
Win2K or Linux on the
Host Computer System
Primary PCI
t 1,2
HP 16700A
Logic Analyzer
21554
“EEEE”
Secondary PCI
EBSA-285 Linux
21285
1,1
t 2,0
NIC
t
SA-110
LAN
t
2,4
t
t
2,1
2,2
2,3
Figure 4-19 Wire exit time t 2,4 on secondary PCI bus
Primary PCI
BAR Write
“Trigger”
t 1, 0
t 1, 1
Host writes
data payload
to the 21554
t 1, 2
Secondary PCI
t 2, 0
PPPP
.
.
.
PPPP
.
.
.
QQQQ
QQQQ
AAAA
.
.
.
EEEE
t 2, 1
t 2, 2
t 2, 3
t 2, 4
Time Line
Figure 4-20 Data path time line
21554 writes
data payload
to shared
memory
(via 21285)
Socket call
(on EBSA)
sends data
payload
from shared
memory to
the NIC
79
4.4.2
CiNIC Architecture Polling Protocol Performance Testbed
As mentioned in §4.1, Rob McCready and Mark McClellend (§2.3.2) had not yet
completed the host/co-host communication protocol when Ma and I began working
on the new logic analysis measurement tool (i.e., the tool that would gather latency
data from both the primary and secondary PCI busses). Consequently, Ma and I had
to devise a “plan B” communications protocol that would allow us to send data from
host system down to the EBSA-285 co-host and then out to the NIC on the secondary
PCI bus. This protocol is basically a stop-and-wait protocol that is built around a set
of state bits that the host and co-host sides manipulate and poll to determine the
current state of the protocol (e.g., the host has finished writing data down to the cohost; the co-host has finished sending the data to the NIC, etc.). Ma’s senior project
report [33] describes this protocol in detail, so only a brief description of the polling
protocol testbed is provided here.
Figure 4-21 below is a schematic representation of the hardware / software
architecture of the polling testbed for the CiNIC architecture. The three data paths
from Figure 4-11 are also shown on Figure 4-21 along with a fourth data path 0
(zero). In a nutshell, the client-2.1 application triggers the logic analyzer by writing to
an unimplemented BAR on a PCI device that’s connected to the primary PCI bus
(path 0). Immediately thereafter, the client-2.1 application copies a block of data
down to the shared memory on the EBSA-285 (paths 1 and 2). The testapp program
on the EBSA-285 detects the data block’s arrival and copies it from shared memory
out to the NIC via a socket send operation (path 3).
Host
client-2.1
0
BAR
Kernel
Application
80
pcitrig.o
PCI
Device
hostmem_p.o
/dev/hostmem
PCI 1
1
21554
HP 16700A
2
PCI 2
Kernel
21285
NIC’s Device
Driver
NIC
Shared
Memory
TCP / IP
Stack
ebsamem_p.o
Application
/dev/ebsamem
Socket API
3
testapp
EBSA-285
Figure 4-21 CiNIC Polling protocol performance testbed
81
4.4.3
CiNIC Host Socket Performance Testbed
The performance testbed for the CiNIC “Host Socket” configuration is shown in
Figure 4-22. This is the current CiNIC design and implementation that was completed
Application
in June 2001.
Host
client-2.01
0
BAR
Socket API
Kernel
pcitrig.o
PCI
Device
protocol
hostmem_p.o
/dev/hostmem
PCI 1
1
21554
HP 16700A
2
PCI 2
Kernel
21285
ebsamem_p.o
NIC’s Device
Driver
NIC
Shared
Memory
/dev/ebsamem
TCP / IP
Stack
Protocol
Handlers
3
EBSA-285
Figure 4-22 CiNIC architecture host socket performance testbed
82
4.4.4
Secondary PCI Bus “Posted Write” Phenomenon
While working on the instrumentation for the CiNIC architecture polling protocol,
Ma and I noticed a phenomenon where the logic analyzer would detect the presence
of the four “P” characters on the secondary PCI bus before they appeared on the
primary PCI bus as shown in Figure 4-23a. Recall that the data path is from the host
to the 21554 across the primary PCI bus (Figure 4-23b), and then from the 21554 to
the 21285 (and ultimately onto the shared memory) via the secondary PCI bus (Figure
4-23c). For a moment, there, we thought we might have discovered a heretofore
unknown quantum-like behavior in the 21554 bridge – i.e., data was showing up on
the secondary PCI side of the 21554 before it had been written to the primary PCI
side. Talk about a performance boost!
host
host
PCI1
21554
host
PCI1
PPPP
21554
PCI2
PPPP
PCI1
21554
PCI2
PCI2
PPPP
21285
21285
21285
shared
memory
shared
memory
shared
memory
(a)
(b)
(c)
Figure 4-23 Start marker appears first on secondary PCI bus
What we were seeing, of course, was not some strange quantum effect but a
standard PCI behavior. Taking it step by step, the test application on the host machine
Hydra copies the contents of its defined payload buffer down to the shared memory
83
region on the EBSA-285. At the hardware level, this copy operation transfers the
payload buffer’s contents from the upstream side of the 21554 to the downstream side
of the 21554, then to the 21285 host bridge, and finally to the shared memory region
on the EBSA-285. An application on the EBSA-285 then copies the data from the
shared memory region to the NIC via a socket connection. When this write
completes,
the
test
application
begins
the
next
iteration
of
the
“host ⇒ EBSA ⇒ NIC” send cycle by once again copying the contents of its payload
buffer down to the shared memory region on the EBSA-285. At this point, however,
the 21554 bridge is faced with a dilemma because: a) the 21554 has no way of
knowing whether the 21285 host bridge has completed the previous write operation to
shared memory, and b) the 21554 has just received data (from the current
“host ⇒ EBSA ⇒ NIC” send cycle) that is to be written to the same shared memory
region as the previous write operation. So before the 21554 can send the new data to
the 21285 (i.e., to the shared memory region via the 21285), the 21554 must first
ensure the 21285 host bridge has written all of the data from the previous memory
write transaction out to shared memory. The 21554 does this by performing a read
transaction from the same region of shared memory that was the target of the previous
write transaction. According to [51]:
“Before permitting a read to occur on the target bus the
bridge designer must first flush all posted writes to their
destination memory targets. A device driver can ensure that
all memory data has been written to its device by performing
a read from the device. This will force the flushing of all
posted write buffers in bridges that reside between the master
executing the read and the target device before the read is
permitted to complete.”
84
So the initial set of “P” characters on the secondary PCI bus (Figure 4-23a) is an
artifact of the 21554 bridge chip performing a dummy read from shared memory to
ensure the 21285 has finished writing its posted write data out to shared memory.
Following this read, the 21554 asserts its “target ready” line and allows the host to
send it the four “P” characters via the primary PCI bus (Figure 4-23b). Next, the
21554 transfers the “P” characters across the secondary PCI bus to the 21285 (Figure
4-23c), which ultimately writes the “P” characters into shared memory.
85
Chapter 5
Performance Data and Analysis
Using the instrumentation technique described in Chapter 4, I obtained performance
data for a number of test cases. Performance data for earlier versions of the Linux
kernel, and for Microsoft Windows NT 4.0 is presented in my senior project report
[12]. Samples of the performance data obtained with this instrumentation technique
are provided here solely for the purpose of demonstrating the capabilities of this
measurement technique.
The charts in the following subsections represent the latency characteristics of
a particular system under test [SUT] in response to a series of network send operations
from the application layer. There are two types of SUTs: “stand-alone” systems and
the combined host/co-host CiNIC system. During the stand-alone tests, the SUT was
either an i686 host or the EBSA-285, but not both. For the combined tests, the SUT
was the entire CiNIC architecture – i.e., the host computer system, the 21554 PCI-PCI
nontransparent bridge, and the EBSA-285 co-host (see Figure 2-4).
Each chart’s title contains the following information items: the name and
version of the operating system that was running on the SUT during the tests, the
transport protocol tested (either TCP or UDP), the version of the Internet protocol
tested (either IPv4 or IPv6), and the range of payloads tested. The range of payloads
tested are shown with the format, min-max-step/count, where min is the
86
minimum (or “starting”) payload size, max is the largest (or “ending”) payload size,
step is the payload increment size, and count is the number of times each payload size
was sent.
Each chart contains two traces. The lower trace represents the median wire
arrival time [41] – i.e., the median time when the first byte of the data payload
appeared on a particular PCI bus relative to the “time zero” PCI trigger signal (see
§§4.3 and 4.4). The upper trace represents the median wire exit time [41] – i.e., the
median time when the last byte of the data payload reached a particular PCI bus
relative to a “time zero” trigger signal.
For purely informational purposes, certain figures on the following pages
contain a pair of charts that plot the same set of data values. The top chart provides a
linear plot of the data and the bottom provides a semi-log plot of the data. This was
done because linear plots often reveal performance characteristics that go unnoticed on
semi-log plots, and vice versa.
5.1
Stand-alone Performance – Microsoft Windows 2000 (SP1)
The TCP/IP4 and UDP/IP4 performance data shown in this section were obtained
using the stand-alone performance instrumentation described in §4.3. The SUT was a
dual Pentium-III host with 512 MB of RAM and a 3Com 3c905C NIC. The operating
system environment was Microsoft Windows 2000 with service pack 1 (SP1) installed.
Blocking-type sockets were used for all network connections.
87
Fornax: Windows 2000 TCP / IP4
32-16384-32 / 10 ( Linear )
14000
12000
Latency (usec)
10000
8000
6000
4000
2000
0
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
12000
14000
16000
18000
Payload Size (bytes)
Fornax: Windows 2000 TCP / IP4
32-16384-32 / 10 ( Semilog )
100000
Latency (usec)
10000
1000
100
10
0
2000
4000
6000
8000
10000
Payload Size (bytes)
Figure 5-1 Windows 2000 TCP/IP4 on i686 host, payload sizes: 32 to 16K bytes
88
Fornax: Windows 2000 TCP / IP4
512-65536-512 / 30 ( Linear )
60000
50000
Latency (usec)
40000
30000
20000
10000
0
0
10000
20000
30000
40000
50000
60000
70000
50000
60000
70000
Payload Size (bytes)
Fornax: Windows 2000 TCP / IP4
512-65536-512 / 30 ( Semilog )
100000
Latency (usec)
10000
1000
100
10
0
10000
20000
30000
40000
Payload Size (bytes)
Figure 5-2 Windows 2000 TCP/IP4 on i686 host, payload sizes: 512 to 65K bytes
89
Fornax: Windows 2000 UDP / IP4
512-65536-512 / 30 ( Linear )
6000
5000
Latency (usec)
4000
3000
2000
1000
0
0
10000
20000
30000
40000
50000
60000
70000
50000
60000
70000
Payload Size (bytes)
Fornax: Windows 2000 UDP / IP4
512-65536-512 / 30 ( Semilog )
10000
Latency (usec)
1000
100
10
0
10000
20000
30000
40000
Payload Size (bytes)
Figure 5-3 Windows 2000 UDP/IP4 on i686 host, payload sizes: 512 to 65K bytes
90
5.2
Stand-alone Performance – Linux 2.4.3 Kernel on i686 Host
The TCP/IP4 and UDP/IP4 performance data shown in this section were obtained
using the stand-alone performance instrumentation described in §4.3. The SUT was
the i686 PC named Hydra. As mentioned in section 2.2.1, Hydra is a uniprocessor
host with a 450 MHz Pentium-III CPU. The operating system was Redhat Linux 7.1
with a Linux 2.4.3 kernel. Blocking-type sockets were used for all network
connections.
91
Hydra: Linux 2.4.3 TCP / IP4
64-9216-32 / 30 ( Linear )
1200
1000
Latency (usec)
800
600
400
200
0
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
7000
8000
9000
10000
Payload Size (bytes)
Hydra: Linux 2.4.3 TCP / IP4
64-9216-32 / 30 ( Semilog )
10000
Latency (usec)
1000
100
10
0
1000
2000
3000
4000
5000
6000
Payload Size (bytes)
Figure 5-4 Linux 2.4.3 TCP/IP4 on i686 host, payload sizes: 64 to 9K bytes
92
Hydra: Linux 2.4.3 TCP / IP4
512-65536-512 / 30 ( Linear )
8000
7000
Latency (usec)
6000
5000
4000
3000
2000
1000
0
0
10000
20000
30000
40000
50000
60000
70000
50000
60000
70000
Payload Size (bytes)
Hydra: Linux 2.4.3 TCP / IP4
512-65536-512 / 30 ( Semilog )
10000
Latency (usec)
1000
100
10
0
10000
20000
30000
40000
Payload Size (bytes)
Figure 5-5 Linux 2.4.3 TCP/IP4 on i686 host, payload sizes: 512 to 65K bytes
93
Hydra: Linux 2.4.3 UDP / IP4
64-9216-32 / 30 ( Linear )
700
600
Latency (usec)
500
400
300
200
100
0
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
7000
8000
9000
10000
Payload Size (bytes)
Hydra: Linux 2.4.3 UDP / IP4
64-9216-32 / 30 ( Semilog )
Latency (usec)
1000
100
10
0
1000
2000
3000
4000
5000
6000
Payload Size (bytes)
Figure 5-6 Linux 2.4.3 UDP/IP4 on i686 host, payload sizes: 64 to 9K bytes
94
Hydra: Linux 2.4.3 UDP / IP4
512-65536-512 / 30 ( Linear )
6000
5000
Latency (usec)
4000
3000
2000
1000
0
0
10000
20000
30000
40000
50000
60000
70000
50000
60000
70000
Payload Size (bytes)
Hydra: Linux 2.4.3 UDP / IP4
512-65536-512 / 30 ( Semilog )
10000
Latency (usec)
1000
100
10
0
10000
20000
30000
40000
Payload Size (bytes)
Figure 5-7 Linux 2.4.3 UDP/IP4 on i686 host, payload sizes: 512 to 65K bytes
95
5.3
Stand-alone Performance – Linux 2.4.3 Kernel on EBSA-285
The TCP/IP4 and UDP/IP4 performance data shown in this section were obtained
using the stand-alone performance instrumentation described in §4.3. The SUT was an
EBSA-285 with 144 MB of RAM and a 3Com 3c905C NIC. The EBSA-285 was
configured for host bridge operation. The operating system environment was
(approximately) Redhat Linux 7.1 with a Linux 2.4.3-rmk2 kernel. The “-rmk2” suffix
on the Linux kernel version identifies the second patch that “Russell M. King” has
created for version 2.4.3 of the ARM/Linux kernel. Blocking-type sockets were used
for all network connections.
96
EBSA285: Linux 2.4.3-rmk2 TCP / IP4
64-9216-32 / 30 ( Linear )
1400
1200
Latency (usec)
1000
800
600
400
200
0
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
7000
8000
9000
10000
Payload Size (bytes)
EBSA285: Linux 2.4.3-rmk2 TCP / IP4
64-9216-32 / 30 ( Semilog )
Latency (usec)
10000
1000
100
0
1000
2000
3000
4000
5000
6000
Payload Size (bytes)
Figure 5-8 Linux 2.4.3-rmk2 TCP/IP4 on EBSA285, payload sizes: 64 to 4.3K bytes
97
EBSA285: Linux 2.4.3-rmk2 TCP / IP4
512-65536-512 / 30 ( Linear )
10000
9000
8000
Latency (usec)
7000
6000
5000
4000
3000
2000
1000
0
0
10000
20000
30000
40000
50000
60000
70000
50000
60000
70000
Payload Size (bytes)
EBSA285: Linux 2.4.3-rmk2 TCP / IP4
512-65536-512 / 30 ( Semilog )
Latency (usec)
10000
1000
100
0
10000
20000
30000
40000
Payload Size (bytes)
Figure 5-9 Linux 2.4.3-rmk2 TCP/IP4 on EBSA285, payload sizes: 512 to 50K bytes
98
EBSA285: Linux 2.4.3-rmk2 UDP / IP4
64-9216-32 / 30 ( Linear )
1000
900
800
Latency (usec)
700
600
500
400
300
200
100
0
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
7000
8000
9000
10000
Payload Size (bytes)
EBSA285: Linux 2.4.3-rmk2 UDP / IP4
64-9216-32 / 30 ( Semilog )
Latency (usec)
1000
100
0
1000
2000
3000
4000
5000
6000
Payload Size (bytes)
Figure 5-10 Linux 2.4.3-rmk2 UDP/IP4 on EBSA285, payload sizes: 64 to 9K bytes
99
EBSA285: Linux 2.4.3-rmk2 UDP / IP4
512-65536-512 / 30 ( Linear )
6000
5000
Latency (usec)
4000
3000
2000
1000
0
0
10000
20000
30000
40000
50000
60000
70000
50000
60000
70000
Payload Size (bytes)
EBSA285: Linux 2.4.3-rmk2 UDP / IP4
512-65536-512 / 30 ( Semilog )
Latency (usec)
10000
1000
100
0
10000
20000
30000
40000
Payload Size (bytes)
Figure 5-11 Linux 2.4.3-rmk2 UDP/IP4 on EBSA285, payload sizes: 512 to 31K bytes
100
5.4
CiNIC Architecture – Polling Protocol Performance Testbed
The TCP/IP4 and UDP/IP4 performance data shown in this section were obtained
using the combined host / co-host CiNIC performance instrumentation described in
§4.4. The SUT was the i686 host PC named Hydra and the EBSA-285 co-host. The
operating system environment on the host PC was Redhat Linux 7.1. The operating
system environment on the EBSA-285 was (approximately) Redhat Linux 7.1 with a
Linux 2.4.3-rmk2 kernel. The “-rmk2” suffix on the Linux kernel version identifies the
second patch that “Russell M. King” has created for version 2.4.3 of the ARM/Linux
kernel. Blocking-type sockets were used for all network connections.
Recall that the polling protocol testbed being used writes the data payload
directly down to the shared memory on the co-host. It does not use any socket calls to
transfer data down to the co-host. A custom test application on the co-host detects the
host-to-shared memory write operation whereupon it copies the payload contents from
shared memory to the NIC via a co-host-side socket call.
101
5.4.1
CiNIC Polling Protocol TCP / IPv4 Performance
The stacked area charts shown in Figure 5-13 and Figure 5-14 show the cumulative
wire entry times across the entire CiNIC data path during, as depicted in Figure 5-12.
Time T1 (= t1,1 – t1,0) is the host application / OS wire entry latency – i.e., the time
interval between the PCI trigger event and the arrival of the data payload’s leading
edge on the primary PCI bus. Time T2 (= t2,1 – t1,1) is the latency added by the 21554
as it copies the payload’s leading edge from the primary PCI bus to the secondary PCI
bus. Time T3 (= t2,3 – t2,1) is the processing overhead of the test application and Linux
OS on the EBSA-285 – i.e., the time interval between the arrival of the payload in
shared memory and the arrival of the payload’s leading edge as a result of the test app
sending the payload from shared memory to the NIC via a socket call.
Primary PCI
t 1, 0
t 1, 1
t 2, 0
T1
PPPP
...
Host writes
data payload
to the 21554
Secondary PCI
BAR Write
“Trigger”
T2
PPPP
t 2, 1
...
21554 writes
data payload
to shared
memory
(via 21285)
T3
AAAA
t 2, 3
...
Socket call
(on EBSA)
sends data
payload
from shared
memory to
the NIC
Time Li ne
Figure 5-12 CiNIC wire entry times
102
Wire Entry Latencies vs. Payload Size
TCP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear )
t(2,3) - t(2,1)
t(2,1) - t(1,1)
t(1,1) - t(1,0)
60000
50000
Latency (usec)
40000
30000
20000
10000
704
724
744
764
724
744
764
684
704
664
644
624
604
584
564
544
524
504
484
464
444
424
404
384
364
344
324
304
284
264
244
224
204
184
164
144
124
84
104
64
0
Payload Size (bytes)
Wire Entry Latencies vs. Payload Size
TCP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Semilog )
t(2,3) - t(2,1)
t(2,1) - t(1,1)
t(1,1) - t(1,0)
100000
Latency (usec)
10000
1000
100
10
684
664
644
624
604
584
564
544
524
504
484
464
444
424
404
384
364
344
324
304
284
264
244
224
204
184
164
144
124
104
84
64
1
Payload Size (bytes)
Figure 5-13 Linux 2.4.3 TCP/IP4 on CiNIC, Polling protocol, Wire entry latencies
103
Wire Entry Latencies vs. Payload Size
TCP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear )
t(2,3) - t(2,1)
t(2,1) - t(1,1)
t(1,1) - t(1,0)
25000
Latency (usec)
20000
15000
10000
5000
51
2
25
60
46
08
66
56
87
0
10 4
75
12 2
80
14 0
84
16 8
89
18 6
94
20 4
99
23 2
04
25 0
08
27 8
13
29 6
18
31 4
23
33 2
28
35 0
32
37 8
37
39 6
42
41 4
47
43 2
52
45 0
56
47 8
61
49 6
66
51 4
71
53 2
76
55 0
80
57 8
85
59 6
90
61 4
95
64 2
00
0
0
Payload Size (bytes)
Wire Entry Latencies vs. Payload Size
TCP / IP4, Linux 2.4.3 kernel, Polling Protocol
t(2,3) - t(2,1)
t(2,1) - t(1,1)
t(1,1) - t(1,0)
100000
Latency (usec)
10000
1000
100
10
51
2
25
60
46
08
66
56
87
0
10 4
75
12 2
80
14 0
84
16 8
89
18 6
94
20 4
99
23 2
04
25 0
08
27 8
13
29 6
18
31 4
23
33 2
28
35 0
32
37 8
37
39 6
42
41 4
47
43 2
52
45 0
56
47 8
61
49 6
66
51 4
71
53 2
76
55 0
80
57 8
85
59 6
90
61 4
95
64 2
00
0
1
Payload Size (bytes)
Figure 5-14 Linux 2.4.3 TCP/IP4 on CiNIC, Polling protocol, Wire entry latencies
104
In Figure 5-13 and Figure 5-14, the elapsed time plot T3 (t2,3 – t2,1) is so large it
completely swamps out the elapsed time plots T1 (t1,1 – t1,0) and T2 (t2,1 – t1,1). The time
plots T1 and T2 are therefore re-plotted without T3 in Figure 5-16 so as to reveal the
characteristics of these two time intervals.
Primary PCI
t 1, 0
t 1, 1
t 2, 0
T1
PPPP
...
Host writes
data payload
to the 21554
Secondary PCI
BAR Write
“Trigger”
T2
PPPP
t 2, 1
...
21554 writes
data payload
to shared
memory
(via 21285)
...
Socket call
(on EBSA)
sends data
payload
from shared
memory to
the NIC
Time Li ne
Figure 5-15 Partial plot of CiNIC wire entry times
105
Wire Entry Latencies vs. Payload Size
TCP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear )
t(2,1) - t(1,1)
t(1,1) - t(1,0)
3.0
2.5
Latency (usec)
2.0
1.5
1.0
0.5
64
84
10
4
12
4
14
4
16
4
18
4
20
4
22
4
24
4
26
4
28
4
30
4
32
4
34
4
36
4
38
4
40
4
42
4
44
4
46
4
48
4
50
4
52
4
54
4
56
4
58
4
60
4
62
4
64
4
66
4
68
4
70
4
72
4
74
4
76
4
0.0
Payload Size (bytes)
Wire Entry Latencies vs. Payload Size
TCP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear )
t(2,1) - t(1,1)
t(1,1) - t(1,0)
3.0
2.5
Latency (usec)
2.0
1.5
1.0
0.5
6
87
04
10
75
12 2
80
14 0
84
16 8
89
18 6
94
20 4
99
23 2
04
25 0
08
27 8
13
29 6
18
31 4
23
33 2
28
35 0
32
37 8
37
39 6
42
41 4
47
43 2
52
45 0
56
47 8
61
49 6
66
51 4
71
53 2
76
55 0
80
57 8
85
59 6
90
61 4
95
64 2
00
0
8
66
5
0
46
0
25
6
51
2
0.0
Payload Size (bytes)
Figure 5-16 Linux 2.4.3 TCP/IP4 on CiNIC, Polling protocol, Wire entry latencies
106
The stacked area charts shown in Figure 5-18 and Figure 5-19 show the cumulative
wire exit times across the entire CiNIC data path, as depicted in Figure 5-17. Time T1
(= t1,2 – t1,0) is the host application / OS wire exit latency – i.e., the time interval
between the PCI trigger event and the arrival of the data payload’s trailing edge on the
primary PCI bus. Time T2 (= t2,2 – t1,2) is the latency added by the 21554 as it copies
the payload’s trailing edge from the primary PCI bus to the secondary PCI bus. Time
T3 (= t2,4 – t2,2) is the processing overhead of the test application and Linux OS on the
EBSA-285 – i.e., the time interval between the arrival of the payload in shared
memory and the arrival of the payload’s trailing edge as a result of the test app
sending the payload from shared memory to the NIC via a socket call.
Primary PCI
t 1, 0
Secondary PCI
BAR Write
“Trigger”
t 2, 0
T1
...
Host writes
data payload
to the 21554
...
t 1, 2
QQQQ
T2
QQQQ
t 2, 2
T3
...
EEEE
t 2, 4
21554 writes
data payload
to shared
memory
(via 21285)
Socket call
(on EBSA)
sends data
payload
from shared
memory to
the NIC
Time Li ne
Figure 5-17 CiNIC wire exit times
107
t(2,4) - t(2,2)
Wire Exit Latencies vs. Payload Size
TCP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear )
t(2,2) - t(1,2)
t(1,2) - t(1,0)
60000
50000
Latency (usec)
40000
30000
20000
10000
704
724
744
764
724
744
764
684
704
664
644
624
604
584
564
544
524
504
484
464
444
424
404
384
364
344
324
304
284
264
244
224
204
184
164
144
124
84
104
64
0
Payload Size (bytes)
t(2,4) - t(2,2)
Wire Exit Latencies vs. Payload Size
TCP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Semilog )
t(2,2) - t(1,2)
t(1,2) - t(1,0)
100000
1000
100
10
684
664
644
624
604
584
564
544
524
504
484
464
444
424
404
384
364
344
324
304
284
264
244
224
204
184
164
144
124
104
84
1
64
Latency (usec)
10000
Payload Size (bytes)
Figure 5-18 Linux 2.4.3 TCP/IP4 on CiNIC, Polling protocol, Wire exit latencies
108
Wire Exit Latencies vs. Payload Size
TCP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear )
t(2,4) - t(2,2)
t(2,2) - t(1,2)
t(1,2) - t(1,0)
30000
25000
Latency (usec)
20000
15000
10000
5000
51
2
25
60
46
08
66
56
87
0
10 4
75
12 2
80
14 0
84
16 8
89
18 6
94
20 4
99
23 2
04
25 0
08
27 8
13
29 6
18
31 4
23
33 2
28
35 0
32
37 8
37
39 6
42
41 4
47
43 2
52
45 0
56
47 8
61
49 6
66
51 4
71
53 2
76
55 0
80
57 8
85
59 6
90
61 4
95
64 2
00
0
0
Payload Size (bytes)
Wire Exit Latencies vs. Payload Size
TCP / IP4, Linux 2.4.3 kernel, Polling Protocol
t(2,4) - t(2,2)
t(2,2) - t(1,2)
t(1,2) - t(1,0)
100000
1000
100
10
1
51
2
25
60
46
08
66
56
87
0
10 4
75
12 2
80
14 0
84
16 8
89
18 6
94
20 4
99
23 2
04
25 0
08
27 8
13
29 6
18
31 4
23
33 2
28
35 0
32
37 8
37
39 6
42
41 4
47
43 2
52
45 0
56
47 8
61
49 6
66
51 4
71
53 2
76
55 0
80
57 8
85
59 6
90
61 4
95
64 2
00
0
Latency (usec)
10000
Payload Size (bytes)
Figure 5-19 Linux 2.4.3 TCP/IP4 on CiNIC, Polling protocol, Wire exit latencies
109
In Figure 5-18 and Figure 5-19, the elapsed time plot T3 (t2,4 – t2,2) is so large it
completely swamps out the elapsed time plots T1 (t1,2 – t1,0) and T2 (t2,2 – t1,2). The
time plots T1 and T2 are therefore re-plotted without T3 in Figure 5-21 so as to reveal
the characteristics of these two time intervals.
Primary PCI
t 1, 0
Secondary PCI
BAR Write
“Trigger”
t 2, 0
T1
...
Host writes
data payload
to the 21554
...
t 1, 2
QQQQ
T2
QQQQ
t 2, 2
21554 writes
data payload
to shared
memory
(via 21285)
...
Socket call
(on EBSA)
sends data
payload
from shared
memory to
the NIC
Time Li ne
Figure 5-20 Partial plot of CiNIC wire exit times
110
Wire Exit Latencies vs. Payload Size
TCP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear )
t(2,2) - t(1,2)
t(1,2) - t(1,0)
25
Latency (usec)
20
15
10
5
64
84
10
4
12
4
14
4
16
4
18
4
20
4
22
4
24
4
26
4
28
4
30
4
32
4
34
4
36
4
38
4
40
4
42
4
44
4
46
4
48
4
50
4
52
4
54
4
56
4
58
4
60
4
62
4
64
4
66
4
68
4
70
4
72
4
74
4
76
4
0
Payload Size (bytes)
Wire Exit Latencies vs. Payload Size
TCP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear )
t(2,2) - t(1,2)
t(1,2) - t(1,0)
1600
1400
1000
800
600
400
200
6
87
04
10
75
12 2
80
14 0
84
16 8
89
18 6
94
20 4
99
23 2
04
25 0
08
27 8
13
29 6
18
31 4
23
33 2
28
35 0
32
37 8
37
39 6
42
41 4
47
43 2
52
45 0
56
47 8
61
49 6
66
51 4
71
53 2
76
55 0
80
57 8
85
59 6
90
61 4
95
64 2
00
0
8
66
5
46
0
25
6
0
0
51
2
Latency (usec)
1200
Payload Size (bytes)
Figure 5-21 Linux 2.4.3 TCP/IP4 on CiNIC, Poling protocol, Wire exit latencies
111
Figure 5-23 shows the overhead introduced on the wire entry and exit times by the
Intel 21554 PCI-PCI nontransparent bridge, as depicted in Figure 5-22.
Primary PCI
t 1, 0
t 1, 1
PPPP
t 2, 1
...
t 1, 2
t 2, 0
PPPP
...
Host writes
data payload
to the 21554
Secondary PCI
BAR Write
“Trigger”
QQQQ
QQQQ
t 2, 2
21554 writes
data payload
to shared
memory
(via 21285)
...
Socket call
(on EBSA)
sends data
payload
from shared
memory to
the NIC
Time Li ne
Figure 5-22 21554 overhead on wire entry and exit times
Some points of interest on these graphs are:
•
The 21554’s downstream posted write FIFO is 256 bytes long. This fact is clearly
visible on the wire exit time plot (i.e., the upper plot) where the wire exit latency
across the 21554 does not begin to increase until the payload size reaches
approximately 256 bytes.
•
The wire exit latency plot levels off at about 2.5 usec (x 33e6 ≈ 83 PCI clock
cycles) for payload sizes of approximately 512 bytes and larger. This sounds
reasonable because the 21554 requires a minimum of
[256 (bytes)] / [4 (bytes / bus cycle)] x [1 (PCI bus cycle) / 33 MHz]
≈ 1.94 usec
to empty all 256 bytes of its downstream posted write FIFO.
112
21554 Wire Entry and Exit Overheads vs. Payload Size
TCP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear )
t(2,2) - t(1,2)
t(2,1) - t(1,1)
3.0
2.5
Overhead ( usec )
2.0
1.5
1.0
0.5
0.0
0
100
200
300
400
500
600
700
Payload Size (usec)
21554 Wire Entry and Exit Overheads vs. Payload Size
TCP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear )
t(2,2) - t(1,2)
t(2,1) - t(1,1)
4.0
3.5
2.5
2.0
1.5
1.0
0.5
6
87
04
10
75
12 2
80
14 0
84
16 8
89
18 6
94
20 4
99
23 2
04
25 0
08
27 8
13
29 6
18
31 4
23
33 2
28
35 0
32
37 8
37
39 6
42
41 4
47
43 2
52
45 0
56
47 8
61
49 6
66
51 4
71
53 2
76
55 0
80
57 8
85
59 6
90
61 4
95
64 2
00
0
8
66
5
46
0
25
6
0
0.0
51
2
Overhead ( usec )
3.0
Payload Size (bytes)
Figure 5-23 Linux 2.4.3 TCP/IP4 on CiNIC, Polling protocol, 21554 overhead
800
113
Figure 5-25 shows the time required for the co-host-side socket to transfer the entire
data payload from shared memory to the NIC, as depicted in Figure 5-24.
Primary PCI
t 1, 0
t 2, 0
...
Host writes
data payload
to the 21554
Secondary PCI
BAR Write
“Trigger”
...
21554 writes
data payload
to shared
memory
(via 21285)
AAAA
t 2, 3
...
EEEE
t 2, 4
Socket call
(on EBSA)
sends data
payload
from shared
memory to
the NIC
Time Li ne
Figure 5-24 Socket transfer time
114
Send Duration, Shared Memory to NIC on PCI bus 2 vs. Payload Size
TCP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear )
t(2,4) - t(2,3)
1400
1200
Delta ( usec )
1000
800
600
400
200
9024
8768
8512
8256
8000
7744
7488
7232
6976
6720
6464
6208
5952
5696
5440
5184
4928
4672
4416
4160
3904
3648
3392
3136
2880
2624
2368
2112
1856
1600
1344
832
1088
576
64
320
0
Payload Size ( bytes )
Figure 5-25 Linux 2.4.3 TCP/IP4 on CiNIC, Polling protocol, testapp send times (PCI2)
5.4.2
CiNIC Polling Protocol UDP / IPv4 Performance
The UDP performance graphs in this section have the same sequence as the TCP
performance graphs in the previous section – i.e., wire entry, partial wire entry, wire
exit, partial wire exit, and so on.
115
Wire Entry Latency vs. Payload Size
UDP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear )
t(2,3) - t(2,1)
t(2,1) - t(1,1)
t(1,1) - t(1,0)
60000
50000
Latency (usec)
40000
30000
20000
10000
8512
8768
9024
8768
9024
8256
8512
8000
7744
7488
7232
6976
6720
6464
6208
5952
5696
5440
5184
4928
4672
4416
4160
3904
3648
3392
3136
2880
2624
2368
2112
1856
1600
1344
832
1088
576
64
320
0
Payload Size (bytes)
Wire Entry Latency vs. Payload Size
UDP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Semilog )
t(2,3) - t(2,1)
t(2,1) - t(1,1)
t(1,1) - t(1,0)
100000
Latency (usec)
10000
1000
100
10
8256
8000
7744
7488
7232
6976
6720
6464
6208
5952
5696
5440
5184
4928
4672
4416
4160
3904
3648
3392
3136
2880
2624
2368
2112
1856
1600
1344
1088
832
576
64
320
1
Payload Size (bytes)
Figure 5-26 Linux 2.4.3 UDP/IP4 on CiNIC, Polling protocol, Wire entry latencies
116
Wire Entry Latency vs. Payload Size
UDP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear )
t(2,3) - t(2,1)
t(2,1) - t(1,1)
t(1,1) - t(1,0)
30000
Latency (usec)
20000
10000
51
2
25
60
46
08
66
56
87
0
10 4
75
12 2
80
14 0
84
16 8
89
18 6
94
20 4
99
23 2
04
25 0
08
27 8
13
29 6
18
31 4
23
33 2
28
35 0
32
37 8
37
39 6
42
41 4
47
43 2
52
45 0
56
47 8
61
49 6
66
51 4
71
53 2
76
55 0
80
57 8
85
59 6
90
61 4
95
64 2
00
0
0
Payload Size (bytes)
Wire Entry Latency vs. Payload Size
UDP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Semilog )
t(2,3) - t(2,1)
t(2,1) - t(1,1)
t(1,1) - t(1,0)
1000000
100000
Latency (usec)
10000
1000
100
10
51
2
25
60
46
08
66
56
87
0
10 4
75
12 2
80
14 0
84
16 8
89
18 6
94
20 4
99
23 2
04
25 0
08
27 8
13
29 6
18
31 4
23
33 2
28
35 0
32
37 8
37
39 6
42
41 4
47
43 2
52
45 0
56
47 8
61
49 6
66
51 4
71
53 2
76
55 0
80
57 8
85
59 6
90
61 4
95
64 2
00
0
1
Payload Size (bytes)
Figure 5-27 Linux 2.4.3 UDP/IP4 on CiNIC, Polling protocol, Wire entry latencies
117
t(2,1) - t(1,1)
Wire Entry Latency vs. Payload Size
UDP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear )
t(1,1) - t(1,0)
3.0
2.5
Latency (usec)
2.0
1.5
1.0
0.5
9024
8768
8512
8256
8000
7744
7488
7232
6976
6720
6464
6208
5952
5696
5440
5184
4928
4672
4416
4160
3904
3648
3392
3136
2880
2624
2368
2112
1856
1600
1344
832
1088
576
64
320
0.0
Payload Size (bytes)
Wire Entry Latency vs. Payload Size
UDP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear )
t(2,1) - t(1,1)
t(1,1) - t(1,0)
3.0
2.5
Latency (usec)
2.0
1.5
1.0
0.5
51
2
25
60
46
08
66
56
87
0
10 4
75
12 2
80
14 0
84
16 8
89
18 6
94
20 4
99
23 2
04
25 0
08
27 8
13
29 6
18
31 4
23
33 2
28
35 0
32
37 8
37
39 6
42
41 4
47
43 2
52
45 0
56
47 8
61
49 6
66
51 4
71
53 2
76
55 0
80
57 8
85
59 6
90
61 4
95
64 2
00
0
0.0
Payload Size (bytes)
Figure 5-28 Linux 2.4.3 UDP/IP4 on CiNIC, Polling protocol, Wire entry latencies
118
Wire Exit Latency vs. Payload Size
UDP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear )
t(2,4) - t(2,2)
t(2,2) - t(1,2)
t(1,2) - t(1,0)
60000
50000
Latency (usec)
40000
30000
20000
10000
8512
8768
9024
8768
9024
8256
8512
8000
7744
7488
7232
6976
6720
6464
6208
5952
5696
5440
5184
4928
4672
4416
4160
3904
3648
3392
3136
2880
2624
2368
2112
1856
1600
1344
832
1088
576
64
320
0
Payload Size (bytes)
Wire Exit Latency vs. Payload Size
UDP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Semilog )
t(2,4) - t(2,2)
t(2,2) - t(1,2)
t(1,2) - t(1,0)
100000
1000
100
10
8256
8000
7744
7488
7232
6976
6720
6464
6208
5952
5696
5440
5184
4928
4672
4416
4160
3904
3648
3392
3136
2880
2624
2368
2112
1856
1600
1344
1088
832
576
64
1
320
Latency (usec)
10000
Payload Size (bytes)
Figure 5-29 Linux 2.4.3 UDP/IP4 on CiNIC, Polling protocol, Wire exit latencies
119
Wire Exit Latency vs. Payload Size
UDP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear )
t(2,4) - t(2,2)
t(2,2) - t(1,2)
t(1,2) - t(1,0)
30000
Latency (usec)
20000
10000
51
2
25
60
46
08
66
56
87
0
10 4
75
12 2
80
14 0
84
16 8
89
18 6
94
20 4
99
23 2
04
25 0
08
27 8
13
29 6
18
31 4
23
33 2
28
35 0
32
37 8
37
39 6
42
41 4
47
43 2
52
45 0
56
47 8
61
49 6
66
51 4
71
53 2
76
55 0
80
57 8
85
59 6
90
61 4
95
64 2
00
0
0
Payload Size (bytes)
Wire Exit Latency vs. Payload Size
UDP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Semilog )
t(2,4) - t(2,2)
t(2,2) - t(1,2)
t(1,2) - t(1,0)
1000000
100000
1000
100
10
1
51
2
25
60
46
08
66
56
87
0
10 4
75
12 2
80
14 0
84
16 8
89
18 6
94
20 4
99
23 2
04
25 0
08
27 8
13
29 6
18
31 4
23
33 2
28
35 0
32
37 8
37
39 6
42
41 4
47
43 2
52
45 0
56
47 8
61
49 6
66
51 4
71
53 2
76
55 0
80
57 8
85
59 6
90
61 4
95
64 2
00
0
Latency (usec)
10000
Payload Size (bytes)
Figure 5-30 Linux 2.4.3 UDP/IP4 on CiNIC, Polling protocol, Wire exit latencies
120
Wire Exit Latency vs. Payload Size
UDP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear )
t(2,2) - t(1,2)
t(1,2) - t(1,0)
250
Latency (usec)
200
150
100
50
9024
8768
8512
8256
8000
7744
7488
7232
6976
6720
6464
6208
5952
5696
5440
5184
4928
4672
4416
4160
3904
3648
3392
3136
2880
2624
2368
2112
1856
1600
1344
832
1088
576
64
320
0
Payload Size (bytes)
Wire Exit Latency vs. Payload Size
UDP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear )
t(2,2) - t(1,2)
t(1,2) - t(1,0)
1600
1400
1200
800
600
400
200
6
87
04
10
75
12 2
80
14 0
84
16 8
89
18 6
94
20 4
99
23 2
04
25 0
08
27 8
13
29 6
18
31 4
23
33 2
28
35 0
32
37 8
37
39 6
42
41 4
47
43 2
52
45 0
56
47 8
61
49 6
66
51 4
71
53 2
76
55 0
80
57 8
85
59 6
90
61 4
95
64 2
00
0
8
66
5
46
0
25
6
0
0
51
2
Latency (usec)
1000
Payload Size (bytes)
Figure 5-31 Linux 2.4.3 UDP/IP4 on CiNIC, Polling protocol, Wire exit latencies
121
t(2,2) - t(1,2)
21554 Wire Entry and Exit Overheads vs. Payload Size
UDP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear )
t(2,1) - t(1,1)
4.0
3.5
Overhead ( usec )
3.0
2.5
2.0
1.5
1.0
0.5
0.0
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Payload Size (usec)
21554 Wire Entry and Exit Overheads vs. Payload Size
UDP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear )
t(2,1) - t(1,1)
t(1,1) - t(1,0)
4.0
3.5
2.5
2.0
1.5
1.0
0.5
6
87
04
10
75
12 2
80
14 0
84
16 8
89
18 6
94
20 4
99
23 2
04
25 0
08
27 8
13
29 6
18
31 4
23
33 2
28
35 0
32
37 8
37
39 6
42
41 4
47
43 2
52
45 0
56
47 8
61
49 6
66
51 4
71
53 2
76
55 0
80
57 8
85
59 6
90
61 4
95
64 2
00
0
8
66
5
46
0
25
6
0
0.0
51
2
Overhead (usec)
3.0
Payload Size (bytes)
Figure 5-32 Linux 2.4.3 UDP/IP4 on CiNIC, Polling protocol, 21554 overheads
122
Send Duration, Shared Memory to NIC on PCI bus 2 vs. Payload Size
UDP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear )
t(2,4) - t(2,3)
700
600
400
300
200
100
Payload Size ( bytes )
Figure 5-33 Linux 2.4.3 UDP/IP4 on CiNIC, polling protocol,
testapp send times (PCI2)
9024
8768
8512
8256
8000
7744
7488
7232
6976
6720
6464
6208
5952
5696
5440
5184
4928
4672
4416
4160
3904
3648
3392
3136
2880
2624
2368
2112
1856
1600
1344
832
1088
576
64
0
320
Delta ( usec )
500
123
Chapter 6
6.1
Conclusions and Future Work
Conclusions
The first half of this thesis presented a brief history of the Calpoly intelligent NIC
(CiNIC) Project, and described the hardware and software architectures that comprise
the CiNIC platform. This was followed by a discussion of the i686-ARM and ARMARM development tools I implemented to support software development for (and
within) the ARM/Linux environment on the EBSA-285 co-host.
In the second half of the thesis I described the performance instrumentation
techniques I created, with the assistance of Ron J. Ma [33], to gather latency data
from the CiNIC architecture during network send operations. There are two
implementations of this instrumentation technique: “stand-alone” and “CiNIC.” The
stand-along implementation gathers performance data either from the host system by
itself, or the EBSA-285 co-host system by itself. The CiNIC instrumentation
technique gathers performance data from the combined host/co-host CiNIC platform.
Samples of the performance data obtained with these two instrumentation techniques
are provided in Chapter 5. As part of this development effort, Ma and I devised a
makeshift communications protocol that sends data from host system down to the
EBSA-285 co-host and out to the NIC on the secondary PCI bus. The results of our
tests with the makeshift data transfer protocol showed the CiNIC platform about 50%
slower than the stand-alone i686 host Hydra at this stage of development.
124
6.2
Future Work
Some ideas for future work based on this project are:
•
Upon completion of the host / co-host communications infrastructure (see [35]
and [36]), use the instrumentation technique described herein to gather
performance data for that system. Specifically, use host-side socket calls instead
of the memory mapping technique used in the polling protocol [33] to write data
down to the EBSA-285 co-host and out to the NIC.
•
Using empirical data gathered with the instrumentation technique described
herein, create a performance model for the CiNIC architecture. This model should
be derived from the performance metrics described in Vern Paxson’s “Framework
for IP Performance Metrics” document [41].
•
Use the performance model to identify potential optimization points within the
CiNIC architecture’s data path.
•
Employing the cyclic development strategy shown in Figure 6-1, make
incremental improvements the CiNIC architecture.
Research
Results
System design
Performance
Instrumentation
Implementation
Testing
Analysis
Figure 6-1 CiNIC development cycle
125
Bibliography
[1]
American National Standards Institute, “Programming Languages – C”
ISO/IEC 9899:1999(E), 2nd edition. December 1, 1999.
[2]
American National Standards Institute, “Programming Languages – C++”
ISO/IEC 14882:1998(E). September 1, 1998.
[3]
D. Barlow, “The Linux GCC HOWTO.” May 1999
<http://howto.tucows.com/>
[4]
I. Bishara, “Data Transfer Between WinNT and Linux Over a NonTransparent Bridge.” Senior project, California Polytechnic State University,
San Luis Obispo (June 2000).
[5]
Borland International, “Service applications.” Software: Borland C++ Builder
5.0 Development Guide: Programming with C++ Builder, 2000.
[6]
D. P. Bovet and M. Cesati, “Understanding the Linux Kernel.” O’Reilly,
2001. ISBN 0-596-00002-2.
[7]
Y. Chen, “The Performance of a Client-Side Web Caching System.” Master’s
thesis, California Polytechnic State University, San Luis Obispo (April 2000).
[8]
D. Comer, “Internetworking with TCP/IP: Principles, Protocols, and Architectures.” 4th ed., v1, Prentice-Hall, 2000. ISBN 0-13-018380-6.
[9]
Digital Equipment Corp., “Digital Semiconductor 21554 PCI-to-PCI Bridge
Evaluation Board.” User’s guide number EC-R93JA-TE, February 22, 2000
(Preliminary document).
[10]
D. Dougherty and A. Robbins, “sed & awk.” 2nd ed. Unix Power Tools series.
O’Reilly, 1997. ISBN 1-56592-225-5.
[11]
E. Engstrom, “Running ARM/Linux on the EBSA-285.”Senior project,
California Polytechnic State University, San Luis Obispo (Spring 2001).
[12]
J. Fischer, “Measuring TCP/IP Performance in the PC Environment: A Logic
Analysis / Systems Approach.” Senior project, California Polytechnic State
University, San Luis Obispo (June 2000).
[13]
FuturePlus Systems Corp, “FS2000 Users Manual.” Rev. 3.5, 1998.
[14]
FuturePlus Systems Corp, “PCI Analysis Probe and Extender Card FS2000 –
Quick Start Instructions.” Rev 3.4, 1998.
[15]
Hewlett-Packard Corp, “HP 16500C/16501A Logic Analysis System:
Programmer's Guide.” Publication number 16500-97018, December 1996.
126
[16]
Hewlett-Packard Corp, “HP 16550A 100-MHz State / 500-MHz Timing Logic
Analyzer: Programmer's Guide.” Publication number 16500-97000, May
1993.
[17]
Hewlett-Packard Corp, “HP 16550A 100-MHz State / 500-MHz Timing Logic
Analyzer: User’s Reference.” Publication number 16550-97001, May 1993.
[18]
Hewlett-Packard Corp, “HP 16554A, HP 16555A, and HP 16555D
State/Timing Logic Analyzer: User’s Reference.” Publication number 1655597015, February 1999.
[19]
Hewlett-Packard Corp, “Installation Guide: HP 16600 Series, HP 16700A, HP
16702A Measurement Modules.” Publication number 16700-97010, 1999.
[20]
Hewlett-Packard Corp, “Remote Programming Interface (RPI) For the HP
16700 Logic Analysis System.” Version 3-1-99, 1999.
[21]
Y. Huang, “Web Caching Performance Comparison for Implementation on a
Host (Pentium III) and a Co-host (EBSA-285).” Master’s thesis, California
Polytechnic State University, San Luis Obispo (June 2001).
[22]
P. Huang, “Design and Implementation of the CiNIC Software Architecture
on a Windows Host.” Master’s thesis, California Polytechnic State University,
San Luis Obispo (February 2001).
[23]
Intel Corporation, “21554 PCI-to-PCI Bridge for Embedded Applications.”
Hardware reference manual 278091-001, September 1998.
[24]
Intel Corporation, “StrongARM** EBSA-285 Evaluation Board.” Reference
manual 278136-001, October 1998.
[25]
Intel Corporation, “21285 Core Logic for SA-110 Microprocessor.” Datasheet
number 278115-001, September 1998.
[26]
Intel Corporation, “SA-110 Microprocessor.” Technical reference manual
278058-001, September 1998.
[27]
Intel Corporation, “PCI and uHAL on the EBSA-285.” Application note
278204-001, October 1998.
[28]
Intel Corporation, “Memory Initialization on the EBSA-285.” Application
note 278152-001, October 1998.
[29]
R. King, “The ARM Linux Project.” 09-June-2001,
<http://www.arm.linux.org.uk/>
[30]
R. King, “ARM/Linux FTP site.” 09-June-2001
<ftp.arm.linux.org.uk/pub/armlinux/source/>.
[31]
A. Kostyrka, “NFS-Root mini-howto.” v8.0, 8 Aug, 1997
<http://howto.tucows.com/>
[32]
Y.G. Liang, “Network Latency Measurement for I2O Architecture by Logic
Analyzer Instrumentation Technique.” Senior project, California Polytechnic
State University, San Luis Obispo (June 1998).
127
[33]
R. Ma, “Instrumentation Development on the Host PC / Intel EBSA-285 Cohost Network Architecture.” Senior project, California Polytechnic State
University, San Luis Obispo (June 2001).
[34]
Maxtor Corp., “531DX Ultra DMA 100 5400 RPM: Quick Specs.” Document
number: 42085, rev. 2, May 7, 2001.
[35]
M. McClelland, “A Linux PCI Shared Memory Device Driver for the Cal Poly
Intelligent Network Interface Card” Senior project, California Polytechnic
State University, San Luis Obispo (June 2001).
[36]
R. McCready, “Design and Development of CiNIC Host/Co-host Protocol”
Senior project, California Polytechnic State University, San Luis Obispo (June
2001).
[37]
Microsoft Corp., “Simultaneous Access to Multiple Transport Protocols,”
Windows Sockets Version 2. Microsoft Developer Network, Platform SDK
online reference, January 2001.
[38]
D. Mills, “Network Time Protocol (Version 3),” Network Working Group,
RFC 1305, 120 pages (March 1992).
[39]
R. Nemkin, “Diskless-HOWTO.” v1.0, 13 May 1999.
<http://howto.tucows.com/>
[40]
C. Newham and B. Rosenblatt, “Learning the bash shell.” 2nd ed. O’Reilly,
1998. ISBN 1-56592-347-2.
[41]
V. Paxson, “Framework for IP Performance Metrics.” Informational
RFC 2330, 40 pages (May 1998).
[42]
T. Pepper, “Linux Network Device Drivers for PCI Network Interface Cards
Using Embedded EBSA-285 StrongARM Processor Systems.” Senior project,
California Polytechnic State University, San Luis Obispo (June 1999).
[43]
J. Postel, “Discard Protocol.” RFC-863, 1 page (May 1983).
[44]
J. Postel, “Internet Protocol.” RFC-791, 45 pages (September 1981).
[45]
J. Postel and K. Harrenstien, “Time Protocol.” Internet standard RFC 868,
2 pages (May 1983).
[46]
B. Quinn and D. Shute, “Windows Sockets Network Programming.” Reading:
Addison-Wesley, 1996.
[47]
J. Reynolds and J. Postel, “Assigned Numbers.” Memo RFC 1340, 139 pages
(July 1992).
[48]
A. Rubini, “Linux Device Drivers.” O’Reilly, 1998. ISBN 1-56692-292-1.
[49]
M. Sanchez, “Iterations on TCP/IP – Ethernet Network Optimization.”
Master’s Thesis, California Polytechnic State University, San Luis Obispo
(June 1999).
[50]
R. L. Schwartz and T. Christiansen, “Learning Perl.” 2nd ed. UNIX
Programming series, O’Reilly, 1997. ISBN 1-56592-284-0.
128
[51]
T. Shanley and D. Anderson, “PCI System Architecture.” PC System Architecture Series. 4th ed. Reading: Mindshare, 1999. ISBN 0-201-30974-2.
[52]
K. Sollins, “The TFTP Protocol (Revision 2).” Internet standard, RFC 1350,
11 pages (July 1992).
[53]
L. Wall, T. Christiansen and R. L. Schwartz. “Programming Perl.” 2nd ed.
O’Reilly, 1996. ISBN 1-56592-149-6.
[54]
W. Wimer, “Clarifications and Extensions for the Bootstrap Protocol.”
Proposed standard, RFC 1532. 23 pages (October 1993).
[55]
B. Wu, “An Interface Design and Implementation for a Proposed Network
Architecture to Enhance the Network Performance.” Master’s thesis,
California Polytechnic State University, San Luis Obispo (June 2000).
[56]
P. Xie, “Network Protocol Performance Evaluation of IPv6 for Windows
NT.” Master’s thesis, California Polytechnic State University, San Luis
Obispo (June1999).
[57]
A. Yu, “Windows NT 4.0 Embedded TCP/IP Network Appliance: A Proposed
Design to Increase Network Performance.” Master’s thesis, California Polytechnic State University, San Luis Obispo (December 1999).
129
Appendix A Acronyms
Acronym
Definition
A/L-BIOS
ARM/Linux BIOS
ARM
Advanced Risc Microprocessor
BAR
Base Address Register
CiNIC
Calpoly Intelligent NIC
CPNPRG
Cal Poly Network Performance Research Group
DIMM
Dual In-line Memory Module
DLL
Dynamic Link Library
DNS
Domain Name Service
DUT
Device Under Test
EBSA
Evaluation Board – StrongARM
EIDE
Enhanced IDE
FMU
Flash Management Utility (p/o the SDT)
FTP
File Transfer Protocol
GNU/SDTC
GNU Software Development Tool Chain
HAL
Hardware Abstraction Layer
HUT
Host Under Test
I2O
Intelligent I/O
IDE
Integrated Drive Electronics
IETF
Internet Engineering Task Force
IP
Internet Protocol (versions 4 and 6)
JTAG
Joint Test Action Group
KB
Kilobyte
LED
Light Emitting Diode
130
Acronym
Definition
MB
Megabyte
MBPS
Megabits per second
NFS
Network File System
NIC
Network Interface Card
OS
Operating System
PCI
Peripheral Component Interconnect
POST
Power On System (Self) Test
QOS
Quality of Service
RAID
Redundant Array of Inexpensive Disks
RPM
Redhat Package Manager
SCSI
Small Computer System Interface
SDRAM
Synchronous Dynamic Random Access Memory
SDT
Software Development Toolkit (ARM Ltd.)
SUT
System Under Test
TCP
Transmission Control Protocol
UDP
User Datagram Protocol
uHAL
Micro HAL
WOSA
Windows Open System Architecture
131
Appendix B RFC 868 Time Protocol Servers
B.1.
RFC 868 Time Protocol Servers
The Linux operating system on the EBSA-285 can only use the rdate command when
the EBSA-285 is connected to a network that has one or more RFC 868 Time Protocol
servers connected to it. Therefore, the system administrator needs to configure at least
one of the networked workstations as a server for the RFC 868 Time Protocol. This
appendix describes how this is done.
NOTE
Redundancy is generally a good thing when dealing with system-critical
services. So the system administrator should, in my opinion, configure at
least three machines to act as RFC 868 Time Protocol servers. These
machines should preferably be a mixture of both Linux and Windows
2000 hosts so that if one set of hosts is temporarily unavailable (e.g., a
Windows upgrade temporarily disables the RFC 868 time service on the
Win2K hosts), the remaining RFC 868 time servers will still be available
to the EBSA-285.
B.1.1.
RFC 868 Support on Linux
The Linux hosts in the CPNPRG lab use the xinetd daemon to manage (on a host-byhost basis) incoming connection requests for the so-called “well-known” network
services such as telnet, FTP, finger, the RFC 868 Time Protocol, etc. On Redhat 7.x
systems there exists a directory named /etc/xinet.d/ that contains configuration files for
each of the well-known network services the xinetd daemon is responsible for. The
132
xinetd daemon reads these configuration files when it starts, and thereby learns what it
should do when it receives an incoming connection request for a particular network
service.
To enable support for the RFC 868 Time Protocol on our Redhat 7.x Linux
hosts I had to manually created two configuration files named time and time-udp in
each host’s /etc/xinetd/ directory. This was done in accordance with the instructions
found in the xinetd.conf(5) man page. After creating the time and time-udp
configuration files I restarted each host’s xinetd daemon so that the daemon would
re-read its configuration files and thereby enable support for the RFC 868 Time
Protocol. To restart the xinetd daemon on a Redhat 7.x Linux host, log on as the super
user ‘root’ and issue the xinetd daemon’s restart command as shown in Figure B-1
[jfischer@server: /temp]$ su –
Password:
[root@server: /root]$ /etc/rc.d/init.d/xinetd restart
Figure B-1 Restarting the xinetd daemon on a Linux host
To test whether the xinetd restart enabled support for the RFC 868 Time Protocol on a
particular Linux host, log on to any other Linux host (i.e., any host other than the RFC
868 server) and issue the following command shown in Figure B-2:
[jfischer@xyz: /temp]$ rdate –p server
[server]
Wed Apr 25 02:47:09 2001
[jfischer@xyz: /temp]$
Figure B-2 Testing the xinetd daemon with the rdate command
where ‘server’ is the DNS name (e.g., pictor) or IP address of the RFC 868 server
133
whose xinetd daemon you just restarted. The RFC 868 server should send back a reply
that looks something like the one on line 2 (i.e., the line that starts with the string
“[server]”) in Figure B-2.
B.1.2.
RFC 868 Support on Windows 2000 Workstations
As far as I know, Microsoft’s Windows 2000 for Workstations operating system does
not ship with an RFC 868 Time Protocol service. Since the RFC 868 time protocol is a
fairly trivial protocol, I decided to “roll my own” Windows 2000 service application to
implement the TCP/IP version of the RFC 868 Time Protocol. Recall that a Win2K
“service application” is a user-layer application that: a) does not have a “display
window,” and b) takes service requests from client applications, process those
requests, and returns some information to the client applications. They typically run in
the background, without much user input. A web, FTP, or e-mail server is an example
of a service application [5]. I used Borland C++ Builder, version 5, build 2195,
Service Pack 1, to write the Win2K service application. The source code and
executable image for the Win2K service can be found on the accompanying CD-ROM
in the directory named \CiNIC\software\Win2k\RFC-868\.
B.1.2.1.
Installing the Windows 2000 RFC 868 Service Application
To install the RFC 868 service application on a Windows 2000 workstation,
start by copying the file TimeServerTcp.exe from the accompanying CD-ROM to the
134
/WINNT/system32/ subdirectory on the Windows 2000 host. Then open a Win2K
command prompt window and type the following at the command prompt:
C:\> %windir%\system32\TimeServerTcp.exe /INSTALL
After executing the install command shown above, you should be able to start the
service application by opening the Administrative Tools > Services applet (via the
Windows Control Panel), selecting the RFC 868 Time Service (TCP) entry in the list
of available services (see Figure B-3), and clicking on the Start Service button on the
Services window toolbar.
Figure B-3 Windows 2000 “Services” applett showing the RFC 868 Time Service
NOTE
The RFC 868 time service is designed to start automatically when
Windows boots. If for some reason this (or any other) service fails to
start, the System Administrator can manually start, pause, stop, and
restart any Win2K service via the Administrative Tools > Services applet
in the Windows Control Panel.
135
B.1.2.2.
Removing the Windows 2000 RFC 868 Service Application
To remove the RFC 868 Time Service from the Services applet, open the
Administrative Tools > Services applet (via the Windows Control Panel) and make
sure the RFC 868 Time Service (TCP) is stopped. Then open a command prompt
window and type the command,
W:/> %windir%\system32\TimeServerTcp.exe /UNINSTALL
This should remove the RFC 868 Time Service from the Win2K services database.
136
Appendix C EBSA-285 SDRAM Upgrade
It is important to note that the EBSA-285 has very specific requirements for the
SDRAM DIMMS it uses. Not all SDRAM DIMMs will work with the EBSA-285. For
example, we tried transplanting some SDRAM DIMMs from various PC hosts into the
EBSA-285 and found that the EBSA-285 would not boot with these DIMMs installed.
The exact specifications for the EBSA-285 SDRAM DIMMs are located in sections
2.3.1, 8.8.7 (particularly tables 8-1 and 8-2), and appendix A.5 of the EBSA-285
reference manual [24].
The ARM/Linux BIOS [30] we are using to boot the EBSA-285 also has some
specific requirements for the SDRAM DIMMs. Note that the BIOS must be able to
detect and initialize the SDRAM DIMMs during the EBSA-285 POST sequence;
otherwise, the EBSA-285 will not boot. We ran into this very problem with versions
1.9 and lower of the ARM/Linux BIOS – even when compatible SDRAM DIMMs
were installed on the EBSA-285 (i.e., the v1.9 and lower BIOSs failed to detect and
initialize some EBSA-compatible DIMMs). The 1.10 and 1.11 versions of the
ARM/Linux BIOS are definitely better at detecting and initializing SDRAM DIMMs,
but they still have trouble with certain DIMM configurations. For example, after
installing two EBSA-compatible 128 MB DIMMs on the EBSA-285, the 1.10 and
1.11 ARM/Linux BIOSs both failed to detect and initialize these DIMMs;
137
consequently, the EBSA-285 would not boot. However, when we installed one 128
MB DIMM in the EBSA’s first (i.e., “top”) DIMM socket and the original 16 MB
DIMM in the second (i.e., “bottom”) DIMM socket, the 1.10 and 1.11 BIOSs both
successfully detected and initialized the DIMMs. Consequently, we are now using a
128 MB + 16 MB SDRAM DIMM configuration on the EBSA-285, giving us a total
SDRAM capacity of 144 MB. This quantity of RAM has proven to be more than
sufficient for developing and debugging purposes, and for building “large” software
packages on the EBSA-285.
The particulars of the 128 MB SDRAM DIMM part we are currently using are
itemized in Table C-1:
Table C-1 EBSA-285 and ARM/Linux v1.1x BIOS compatible 128 MB SDRAM
Item
Description
Vendor
Viking Components
30200 Avenida de las Bandera
Rancho Santa Maria, CA 92688
1.800.338.2361
Part number
H6523
Nomenclature
128 MB SDRAM DIMM
PC100 compliant
168-pin 64-bit unbuffered
3.3 Volt switching levels
4M x 64 array configuration
Received / Installed
December 2000