Real-Time Performance of Windows XP Embedded
Transcription
Real-Time Performance of Windows XP Embedded
Real-Time Performance of Windows XP Embedded Andreas Harnesk [email protected] David Tenser [email protected] April 30, 2006 ABB Corporate Research Mälardalen University Supervisor: Henrik Johansson [email protected] Supervisor: Frank Lüders [email protected] Advanced Industrial Communication Group Västerås, Sweden Department of Computer Science and Electronics Västerås, Sweden Abstract A business unit of ABB, providing embedded system based products for the automation industry, today runs their real-time core on dedicated hardware, isolated from any extra functionality. To stay competitive in the industry, development costs need to be reduced. One possible solution is to run both the real-time core and the extra functonality on the same hardware, and switch to Windows XP Embedded as the operating system. In this report, the characteristics of XP as a real-time operating system are revealed by investigating how XP works under the hood. Two types of real-time implementations are evaluated; one implemented as a normal user-thread, and another implemented in a device driver. Tests are conducted, measuring execution times of the implementations. The results show that a device driver implementation is more deterministic than a user-mode implementation. While the specic tests conducted yielded execution times withas it appearedlimited variation, no hard guarantees about an absolute worst case execution time can be made. However, the tests show that the probability of execution times exceeding those measured are very unlikely. Thus, this indicates that XP might be suitable as a soft real-time operating system under certain controlled conditions. 1 Sammanfattning En aärsenhet på ABB som tillverkar produkter baserad på integrerade system för automationsindustrin kör idag sin realtidskärna på separat hårdvara, isolerad från övrig funktionalitet. För att förbli konkurrenskraftig i industrin måste utvecklingskostnaderna minskas. En möjlig lösning kan vara att köra både realtidskärnan och övrig funktionalitet på samma hårdvara, samt att byta operativsystem till Windows XP Embedded. I denna rapport avslöjas XP:s karakteristik som realtidsoperativsystem genom att undersöka hur XP fungerar under huven. Två typer av realtidsimplementationer testas; en implementerad som en normal användartråd och den andra implementeras i en drivrutin. Tester med avseende på exekveringstider genomförs. Resultaten visar att en drivrutinsimplementation är mer deterministisk än en användartrådsimplementation. Även om de specika testerna medgav exekveringstider med begränsad variation går det inte att ge några hårda garantier för en absolut värstafallsexekveringstid. Testerna visar dock att sannolikheten för exekveringstider överskridande de uppmätta är ytterst osannolika. Detta indikerar att XP kan fungera som ett mjukt realtidsoperativsystem under kontrollerade förutsättningar. 2 Acknowledgements First and foremost we would like to thank our thesis supervisors Henrik Johansson and Frank Lüders, who have shown a large and consistent interest throughout the project. Our numerous scientic discussions and their many constructive comments have greatly improved this work. Thanks to Roger Melander for giving us a deeper insight in how task interrupts work in a processor, and also for being such a nice person. A special thanks to Jimmy Kjellsson for helping us setting up the oscilloscope environment. Last but not least, thanks to Dr. Tomas Lennvall for providing us with lots of constructive feedback. Our experience at ABB Corporate Research has been nothing but positive and the people working there truly are top notch. 3 Contents 1 Introduction 1.1 Microsoft Windows . . . . . . . . . . . . . . . . . . . . . . . . 2 Real-Time Concepts 2.1 2.2 2.3 2.4 2.5 2.6 Hard and Soft Real-Time . . . . . Tasks, Processes and Threads . . . Shared Resources and Semaphores Priorities . . . . . . . . . . . . . . Scheduling . . . . . . . . . . . . . . Time Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 RTOS Requirements 3.1 3.2 3.3 3.4 3.5 Requirement Requirement Requirement Requirement Requirement 1: 2: 3: 4: 5: Preemtible and Multitasking . . Task Priorities . . . . . . . . . . Predictable Task Synchronization Avoid Priority Inversion . . . . . Predictable Temporal Behavior . 4 Windows XP Embedded 4.1 4.2 4.3 4.4 4.5 4.6 Background . . . . . . . . . . . . . . . System Structure Overview . . . . . . 4.2.1 Hardware Abstraction Layer . . 4.2.2 Kernel . . . . . . . . . . . . . . 4.2.3 Device Drivers . . . . . . . . . 4.2.4 Executive . . . . . . . . . . . . Thread Scheduling and Priority Levels Interrupt Handling . . . . . . . . . . . 4.4.1 Interrupt Service Routine . . . 4.4.2 Deferred Procedure Call . . . . 4.4.3 Asynchronous Procedure Call . Memory Management . . . . . . . . . 4.5.1 Kernel Page Pools . . . . . . . 4.5.2 Memory Manager . . . . . . . . Windows Driver Model . . . . . . . . . 4.6.1 I/O Request Packets . . . . . . 4.6.2 Driver Types . . . . . . . . . . 4.6.3 Device Objects . . . . . . . . . 4.6.4 I/O Request Processing . . . . 4.6.5 Floating-Point Operations . . . 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 11 11 11 12 12 13 13 14 14 15 15 15 16 17 17 18 18 18 19 19 19 20 22 22 22 22 24 24 24 25 25 26 26 28 5 Real-Time Aspects of XP 5.1 5.2 Design Issues That Limit XP's Use As a RTOS . . . . . . . . Using XP as a RTOS . . . . . . . . . . . . . . . . . . . . . . . 6 Extensions 6.1 6.2 RTX . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Architecture . . . . . . . . . . . . . . . . . . . 6.1.2 Software Development . . . . . . . . . . . . . 6.1.3 Does RTX Meet the RTOS Requirements? . . INtime . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Architecture . . . . . . . . . . . . . . . . . . . 6.2.2 APIs . . . . . . . . . . . . . . . . . . . . . . . 6.2.3 Software Development . . . . . . . . . . . . . 6.2.4 Does INtime Meet the RTOS Requirements? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 30 31 33 33 33 35 35 36 36 37 37 38 7 Related Work 39 8 Problem Description 42 7.1 7.2 7.3 8.1 8.2 8.3 User Level Thread Implementation . . . . . . . . . . . . . . . Driver Based Implementation . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . Suggested Model . . . . . . . . . . . . . . . . . . . . . . . . . Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Methodology 9.1 9.2 9.3 9.4 9.5 9.6 Conducted Tests . . . . . . . . . . . 9.1.1 User-Thread Implementation 9.1.2 Driver Implementation . . . . Test System . . . . . . . . . . . . . . 9.2.1 System Services . . . . . . . . Execution Time Measurement . . . . 9.3.1 Performance Counter . . . . . 9.3.2 Time-Stamp Counter . . . . . 9.3.3 Oscilloscope . . . . . . . . . . System Load Conditions . . . . . . . 9.4.1 Idle . . . . . . . . . . . . . . 9.4.2 CPU Load . . . . . . . . . . . 9.4.3 Graphics Load . . . . . . . . 9.4.4 HDD Load . . . . . . . . . . 9.4.5 Network Load . . . . . . . . . 9.4.6 Stress . . . . . . . . . . . . . Test Names . . . . . . . . . . . . . . Additional Tests . . . . . . . . . . . 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 39 40 43 44 44 45 45 46 47 47 48 48 48 50 51 52 52 52 52 52 53 53 53 53 10 Results 10.1 TSC Measurement Results 10.1.1 UserIdle . . . . . . 10.1.2 UserCPU . . . . . 10.1.3 UserGraphics . . . 10.1.4 UserHDD . . . . . 10.1.5 UserNetwork . . . 10.1.6 UserStress . . . . . 10.2 Oscilloscope Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Conclusions 11.1 11.2 11.3 11.4 11.5 11.6 11.7 Better Determinism Than Reported In Previous Work Higher Task Priority Yields Better Determinism . . . . Driver Faster Than User-Mode . . . . . . . . . . . . . Task Interruption Can Occur Anywhere . . . . . . . . Small Dierence Between Normal and Prioritized DPC Algorithm Slower in Kernel-Mode . . . . . . . . . . . . No Guarantees Can Be Given . . . . . . . . . . . . . . 12 Future Work 12.1 12.2 12.3 12.4 Use of an Ethernet Based Protocol for Communication Modify Interrupt Handling . . . . . . . . . . . . . . . . Run the Tests on XPE . . . . . . . . . . . . . . . . . . Evaluate Extensions . . . . . . . . . . . . . . . . . . . A Oscilloscope Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 54 54 54 55 55 57 57 57 60 60 60 61 61 61 62 62 63 63 63 64 64 66 6 List of Figures 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Simplied Windows architecture[30]. . . . . . . . . . . . . . . The full range of priority levels in XP[3]. . . . . . . . . . . . . The virtual memory for two processes. The gray areas represent shared memory. . . . . . . . . . . . . . . . . . . . . . . . The ow of I/O requests through the system. . . . . . . . . . Sketch of the implementation suggested by the ABB business unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Full event cycle of the user-thread implementation. . . . . . . Sequential time diagram of the user-thread implementation. . Full event cycle of the driver implementation. . . . . . . . . . Sequential time diagram of the driver implementation. . . . . Measured start-stop time versus measurement number for the Performance Counter. (a) With Sleep(), (b) Without Sleep(). Measured start-stop time versus measurement number for the Time-Stamp Counter, (a) with Sleep(), (b) without Sleep(). . UserIdle algorithm execution time. (a) Scatter plot, (b) Time distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . UserCPU algorithm execution time. (a) Scatter plot, (b) Time distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . UserGraphics algorithm execution time. (a) Scatter plot, (b) Time distribution. . . . . . . . . . . . . . . . . . . . . . . . . UserHDD algorithm execution time. (a) Scatter plot, (b) Time distribution. . . . . . . . . . . . . . . . . . . . . . . . . UserNetwork algorithm execution time. (a) Scatter plot, (b) Time distribution. . . . . . . . . . . . . . . . . . . . . . . . . UserStress algorithm execution time. (a) Scatter plot, (b) Time distribution. . . . . . . . . . . . . . . . . . . . . . . . . Suggested model for interrupt interception. . . . . . . . . . . 18 21 23 26 43 46 46 47 48 49 50 55 56 56 57 58 58 64 List of Tables 1 2 3 4 The priority levels in XP. . . . . . . . . . . . . . . . . Measured start-stop time in µs for the PeC and TSC. . Test names used throughout the report. . . . . . . . . Algorithm execution time in µs for the TSC tests. . . . 7 . . . . . . . . . . . . . . . . 20 51 53 54 Glossary APC API BIOS COTS CPU DDK DPC EDF FDO FiDO FIFO FTP GPOS GUI HAL HDD I/O IDE IDT IDTR IRP IRQ IRQL ISR OS PC PeC PCI PDO RAM RT-HAL RTOS RTSS SRI TSC WCET WDM XPE XP Asynchronous Procedure Call Application Program Interface Basic Input/Output System Commercial o the Shelf Central Processing Unit Microsoft Windows Driver Development Kit Deferred Procedure Call Earliest Deadline First Functional Device Objects Filter Device Objects First In First Out File Transfer Protocol General Purpose Operating System Graphical User Interface Hardware Abstraction Layer Hard Disk Drive Input/Output Integrated Development Environment Interrupt Descriptor Table Interrupt Descriptor Table Register Interrupt Request Package Interrupt Request Interrupt Request Level Interrupt Service Routine Operating System Personal Computer Performance Counter Peripheral Component Interconnect Physical Device Objects Random Access Memory Real-Time Hardware Abstraction Layer Real-Time Operating System Real-Time Sub-System Service Request Interrupt Time-Stamp Counter Worst-Case Execution Time Windows Driver Model Microsoft Windows XP Embedded Microsoft Windows XP Family (including XPE) 8 1 Introduction Within ABB, and the automation industry in general, embedded systems are found in virtually every product and system. More and more functionality are being built in and the performance requirements become tougher. Today it is very common that embedded systems run with the support oered by a real-time operating system (RTOS) in order to meet the requirements enforced by, for example, the industry process being controlled. These RTOSs are often high performing and quite reliable. However, this often comes to the price of high cost, high complexity, and unfriendly usage. In addition, it is not rare that such systems require special development tools and environments, and sometimes also special platforms. Since cost and usability are two important factors for the industry, this can be a problem in certain business areas. In the past, automation systems were usually developed specically for one or possibly a few products, which made the development extremely expensive. To decrease development costs in general, functionality is often grouped into independent, reusable and well dened solutions. This has been possible thanks to standardization from organizations such as IEEE, W3C, ISO, and IEC, but also because of de-facto standards like Microsoft Windows and the .NET framework. 1.1 Microsoft Windows The Windows XP family of operating systems (OS) dominate the personal computer OS market[30]. It is a general purpose operating system (GPOS) designed to optimize throughput and average performance[34]. Because of its popularity, there is a strong interest in using XP as an embedded real-time system for the automation industry. There are several reaons why the interest in XP as a RTOS is so strong. The most signicant possible benets are: • Personal computer (PC) hardware is much cheaper than traditional embedded systems in the automation industry. For example, cheap Ethernet adapters could be used. • Functionality can be developed using rapid prototyping and the .NET framework. • A vast amount of software, development tools, and COTS components (e.g. ActiveX) are available to the developers[24]. • It is arguably easier to develop software for XP than for RTOSs because of the familiar integrated development environments (IDE) like Microsoft Visual Studio. 9 • Customers are inherently familiar with the user interface, since XP is a de-facto standard in the oce world[30]. This leads to the key question: Under which conditions can XP be used as a RTOS? This report tries to answer this question by investigating how XP works under the hood, and describing central functionality and mechanisms in particular those aecting real-time performanceof XP. 10 2 Real-Time Concepts Before covering the requirements of a RTOS, the general real-time terms and concepts will be introduced. A real-time system is dened as a system where correct behavior not only depends on an error free result, but also on when the result is delivered[23]. 2.1 Hard and Soft Real-Time If the real-time system fails to complete the calculation within a dened time frame, it is considered a system failure. The eect of missing a deadline varies from dierent applications, and real-time systems are often divided into two separate classes, depending on how critical a missed deadline is. In a hard real-time system, deadlines must be met at all times, and a missed deadline could lead to catastrophic results[21]. An example of a hard realtime system is the steering systems in an airplane, where a missed deadline during landing could result in a crash. A calculated result delivered after a deadline is considered useless in a hard real-time system. Soft real-time systems[7], however, are allowed to miss deadlines sometimes, but it will usually result in a performance degradation[21]. An example of a soft real-time system is a DVD-player, where missed deadlines during decoding could result in frame skips, leading to poor quality rather than failure. 2.2 Tasks, Processes and Threads All real-time systems consist of tasks [7]. A task can be seen as a sequence of method executions. There are two types of tasks, known as periodic and nonperiodic tasks. Just as it sounds, periodic tasks are executed periodically, for example every 20 milliseconds. Nonperiodic tasks, also known as event triggered tasks, are executed when an event occurs[23]. Periodic tasks are often used for sensor reading, actuator control, and other time critical events, while nonperiodic tasks are better suited for events that are less common, for example user interaction. A process is an executing program, including the current values of the program counter, registers, and variables. The central processing unit (CPU) rapidly switches from process to process, running each for a short period of time. At any instant of time, the CPU is running only one process, but in the course of a longer period of time, it may run several processes. This technique, giving the illusion of parallelism, is known as multitasking [31]. The actual switch of actively running process is known as a context switch. Each process has an address space and at least one thread of execution. The thread has a program counter, keeping track of the instruction to execute next, along with registers, and a stack. In modern operating systems, a 11 process can have more than one thread, all sharing the same address space. Switching of actively running thread in a process is also called a context switch. However, a context switch within the same process is much faster, since the address space for the process remains unchanged. This is one of the most signicant benets of multithreading OSs. Both processes and threads can be seen as dierent types of tasks. 2.3 Shared Resources and Semaphores A shared resource is a resource used by several tasks. It can be anything from network access to a global variable used for task synchronisation. To ensure deterministic behavior of a real-time application, the usage of shared resources may need to be protected in some cases. More specically, only one task should be able to access a shared resource at a time[31]. This protection mechanism is know as a critical section. For example, if a linked list is used as a shared resource, only one task can be allowed to access it during updates (write operations), since iteration of a list being written to can result in pointer errors. This protection of resources is often realized using semaphores [23]. Put simply, a task has exclusive rights to a resource if the task has locked the semaphore. When a semaphore is locked, any other task requesting access to the resource is blocked during the execution of the critical section. When a task has left the critical section, the semaphore is unlocked and the blocked task can acquire it instead. If multiple tasks are waiting for the semaphore, dierent approaches can be taken to determine which task should be granted the semaphore[31]. 2.4 Priorities The concept of priority levels is important in a real-time system. All tasks are given a priority level, which determines the exection order and time-share of each task in a system. A low priority task is interrupted if a task with higher priority wants to execute during the same timeframe. A common priority problem in real-time systems occurs when shared resources are used between tasks with dierent priorities. In the following example, a system with three tasks and a semaphore is used. The tasks have the priority levels of high, normal, and low. At rst, the high and normal priority tasks are idle and the low priority task runs and immediately locks the semaphore. After a while, the high priority task is ready to run and wants to use the semaphore, but since it is already locked, this task is blocked until the low prioritized task is nished with the semaphore. In the meantime, the normal priority task is ready to execute. Because it has a higher priority than the low priority task, it gets to execute instead and does so for an arbitrary long time. During this time, the highest priority task has to wait 12 for the blocked semaphore owned by the low priority task, which in turn cannot execute since a higher priority task is running, even though it has the highest priority. In other words, the priority of the tasks is inverted; a phenomenom called priority inversion. Mechanisms used to avoid priority inversions will be discussed in Section 3.4. 2.5 Scheduling The execution order of tasks is decided by a scheduler. Two kinds of scheduling algorithms exist: o-line and on-line scheduling[23]. An o-line scheduler makes a schedule prior to code execution[7]. Because of this, an o-line scheduler can guarantee that no deadlines are missed, since it has complete knowledge of the system, assuming the timing constraint of each task is correct. However, this type of scheduling algorithm allows no event triggered tasks, since the knowledge of when an event occurs is not known before runtime. To allow event triggered tasks in a real-time system, an on-line scheduler needs to be used. The main drawback with an on-line scheduler is that deadline guarantee can only be given under certain controlled conditions, for example, if no event triggered tasks exist[20]. 2.6 Time Analysis Time analysis is an important subject in real-time systems. Normally, developers are interested in the average execution time, worst-case execution time (WCET), and execution time variation[7]. WCET and variation are the most interesting ones of the three. The WCET is the longest time a task will take to execute. If this time is known, it is possible to design the system in such a way that the deadlines are never missed. The execution time variation is also important since a low variation means a better utilization of the hardware. Since most real-time systems are embedded and have limited hardware capacity[7], both determinism and memory eciency are important to keep hardware costs down. 13 3 RTOS Requirements Based on the denition of a real-time system, the results of a RTOS should be given within a predened time frame. The RTOS needs to be time deterministic to guarantee the fulllment of this requirement. Although time deterministic behavior is important in a RTOS it is not the only requirement for an OS to be considered a RTOS. The following requirements need to be fullled[34, 32, 19]: • The OS has to be multitasked and preemptible. • The notion of task priority has to exist. • The OS has to support predictable task synchronization mechanisms. • The OS must support a system for avoiding priority inversion. • The OS must have predictable temporal behavior. Because of the vague denition of a soft real-time system (a system allowed to occasionally miss deadlines), no denition of a RTOS can be based on what is required by soft real-time systems. As Timmerman says in [34]: ...the term `real-time' is often misused to indicate a fast system. And fast can then be seen as `should meet timing deadlines', thus meaning a soft real-time system.. In other words, a GPOS would be considered a RTOS if soft real-time characteristics were sucient. In this report, the term RTOS means a OS suitable to run hard real-time systems. 3.1 Requirement 1: Preemtible and Multitasking According to the rst requirement, the OS must be multitasked. Tasks can be implemented as both processes and threads in the same system. Since all threads in a process share the same address space, creating, destructing, and switching threads are many times faster then the same operations on processes[31]. Multithreaded OSs are therefore preferred over those that are just multitasked. According to [34]: ...[The] scheduler should be able to preempt any thread in the system and give the resource to the thread that needs it most. The OS (and the hardware architecture) should also allow multiple levels of interrupts to enable preemption at the interrupt level. In other words, a preemtible system must be capable of preempting a thread at any time during execution. Almost all OSs are multitasked, multithreaded, and oer preemption. However, most GPOSs do not allow the kernel to be preempted. Because of this limitation, a high-priority task cannot preempt a kernel call made by a low-priority task.[19]. 14 3.2 Requirement 2: Task Priorities The notion of task priorities needs to exist in order to have some predictability of task execution order and to ensure that the most critical tasks get to run rst. There are many dierent scheduling algorithms available to make this possible. The optimal solution for dynamic priorities (priorities assigned during runtime) is called earliest deadline rst (EDF) and lets the task with the earliest deadline execute. But since complete knowledge of the task execution needs to be known in advance, this algorithm is not suitable in event triggered systems[23]. Although tasks in a system running EDF scheduling do not have priority levels assigned during system design, all tasks are still eectively prioritized according to the earliest deadline. Rate monotonic is the optimal algorithm for system with static prioritized task (task priority is decided in advance). One of the major drawbacks with this scheduling algorithm is the unreliable requirement that all tasks executes without any interaction[23]. 3.3 Requirement 3: Predictable Task Synchronization Mechanisms It is unlikely that tasks in a RTOS execute independently of each other. Because of this, a RTOS needs a predictable synchronization between tasks[23]. By using shared resources guarded by locks, safe interprocess/thread communication can be guaranteed. In a RTOS, this locking mechanism needs to be time deterministic. 3.4 Requirement 4: Avoid Priority Inversion Priority inversion is a classic real-time problem and must be handled in a RTOS. There is no way to eliminate priority inversion when shared resources and priority levels are used[19], which are both requirements of a RTOS. A RTOS needs to have a system for minimizing the time of the inversion. One solution for this problem is known as a shared resource protocol, which determines rules for accessing shared resources. One of the simplest and most widely used shared resource protocol is called priority inheritance protocol. It reduces the blocking time by giving the low priority task the same priority as the blocked task waiting for the semaphore. To reduce the blocking time even more, a task cannot have a semaphore locked when execution is done. The downside of this protocol is that a high prioritized task can be blocked by several low prioritized tasks[23]. For example, if a high prioritized task needs two semaphores to execute and both of them are locked when execution starts, the high prioritized task must wait until both low prioritized tasks have released their semaphores before execution can start. 15 3.5 Requirement 5: Predictable Temporal Behavior The nal requirement states that the system activities (system calls, task switching, interrupt latency, and interrupt masking ) should have predicable temporal behavior. Some papers argue that predicable temporal behavior is not enough and that timing constraints even should be given by the RTOS manufacturer. Also, system interrupt levels and device driver interrupt request levels need to be known by the developer of the real-time system[34]. Interrupts are described in Section 4.4. 16 4 Windows XP Embedded In the previous section, the basic concepts of real-time systems were introduced, along with a list of requirements an RTOS needs to fulll. This section introduces Windows XP Embedded (XPE) and explains how its relevant OS mechanisms work. Because XPE is a componentized version of Windows XP Professional (XP), all technical operating system details for XP, such as thread priorities, scheduling algorithms, and inter-process communication also apply to XPE[37]. Applications designed for XP can run without modications on XPE, as long as the required libraries for the application are installed (for example, a .NET application will obviously need the .NET framework)[37]. Furthermore, the same driver model (WDM) is used, which makes all device drivers for XP available to the embedded system[36]. 4.1 Background Most of the previous research on real-time applications and Windows has been based on Windows NT 4.0. There are several reasons for this: • Windows NT was designed from the ground up as a 32-bit operating system with reliability, security, and performance as its primary goals[8]. This means NT was considered a new technology, which incidentally is what the letters NT stand for[31]. • Windows NT 4.0 was the rst version of NT that sported the popular user interface from Windows 95, which made it easier for companies to migrate to it; and many of them did so[8, 31]. • Since NT 4.0, the kernel has not changed in terms of real-time characteristics. The scheduling algorithm, thread priorities, and interrupt routines have remained the same throughout the dierent versions of NT[30]. This means that the limitations of using the platform as a RTOS are already well known. Because this report is examining the real-time characteristics of XPE, it is important to know that XP, which XPE derives from, is part of the Windows NT family of operating systems and its formal version number is NT 5.1. As stated above, XP also uses the same scheduling algorithms and interrupt handling routines as NT 4.0 and NT 5.0 (commonly known as Windows 2000). This makes the previous research (see [27, 24, 34]) highly relevant for this report, even though it was performed on NT 4.0. From here on, the term XP will be used for information applying to both Windows XP and Windows XP Embedded. The term XPE will be used only for information applying specically to Windows XP Embedded. 17 System support processes Service processes User applications Environment subsystems Subsystem DLLs User-mode Kernel-mode Executive Kernel Windowing and graphics Device drivers Hardware abstraction layer (HAL) Figure 1: Simplied Windows architecture[30]. 4.2 System Structure Overview In order to understand how XP works with threads, priorities, and interrupts, it is necessary to gain some basic knowledge about the structure of the OS. 4.2.1 Hardware Abstraction Layer One of the primary design goals of NT was to make it portable across dierent platforms[31]. Therefore, NT/XP is divided into several layers, each one using the services of the ones below it. As shown in Figure 1, the rst layer, working closely with the hardware, is called the Hardware Abstraction Layer (HAL). Its purpose is to provide the upper level of the OS with a simplied abstraction of the often very complex hardware below it, in order to allow the rest of the OS to be mostly platform independent. For example, HAL has calls to associate interrupt service procedures with interrupts, and set their priorities[31]. The HAL is delivered in source code (requiring a special agreement with Microsoft). It is thus possible to redene how XP handles the system clock, interrupts, and so forth[35]. As will be shown in Section 6, some third party solutions make use of a modied HAL to achieve predictable temporal behavior in XP. 4.2.2 Kernel Above the HAL is the actual kernel layer. The purpose of the kernel is to make the rest of the OS hardware independent. This is where XP handles thread management and scheduling, context switches, CPU registers, page tables, and so on. The actual scheduling algorithm used will be discussed later in this section. The kernel has another important function: it provides support for two classes of system objects, namely control objects and 18 dispatcher objects. Control objects are objects controlling the system. The most important object to know about is the deferred procedure call (DPC), which is used to split o the non-time critical part of an interrupt service procedure from the time critical part. This mechanism will be explained in greater detail in Section 4.4.2. Dispatcher objects include semaphores, mutexes, events, and other objects threads can wait on. Since this is closely related to thread scheduling, dispatcher objects are handled in the kernel. 4.2.3 Device Drivers Device drivers work closely with the kernel. Running in kernel-mode, they have direct memory access and can manipulate system objects and I/O devices. However, a device driver can also do things not related to devices, such as performing calculations. This part of the system is relevant for this report. The Windows Driver Model (WDM) will be discussed in greater detail in Section 4.6. 4.2.4 Executive The last part of the system structure mentioned in this brief overview is what is known as the executive. It is a collection of components working together with the kernel to provide the rest of the system with a device-independent abstraction. Among other things, the executive contains components for managing processes, I/O, and memory. The I/O Manager, for example, plays an important role in interrupt handling, explained in Section 4.4 4.3 Thread Scheduling and Priority Levels Windows XP has 32 priority levels for user-mode threads, numbered 0 to 31. A process can have one of the following class priorities for the process: Idle, Below Normal, Normal, Above Normal, High, and Realtime. Each thread can then have a relative priority compared to the other threads in the process. The available thread priority levels are: Idle, Lowest, Below Normal, Normal, Above Normal, Highest, and Time Critical[31]. This sums up to a total of 42 combinations, which are mapped to the 32 priority levels according to Table 1. As seen in Table 1, the class priorities ranging from High to Idle have the same upper and lower priority limit. This makes it possible for the XP scheduler to dynamically make priority adjustments to maximize average performance[24]. For example, when an I/O operation completes a request that a thread was blocked waiting for, the priority of that thread is increased. The purpose of this is to maximize I/O utilization[31] and is not the same 19 Win32 process class priorities Win32 Thread priorities Time critical Highest Above normal Normal Below normal Lowest Idle Realtime 31 26 25 24 23 22 16 High 15 15 14 13 12 11 1 Above Normal 15 12 11 10 9 8 1 Normal 15 10 9 8 7 6 1 Below Normal 15 8 7 6 5 4 1 Idle 15 6 5 4 3 2 1 Table 1: The priority levels in XP. as priority inheritance. Note that the dynamic priority boosts never increase the priority above level 15. As a result of these dynamic priority properties, none of the aected priority classes are predictable and the use of them in a real-time application would render the application non-deterministic. The number of available priority levels to be considered for a real-time application is thus reduced from 32 to just 7 (the Realtime class). The thread priority levels in the Realtime class are all higher than the dynamic classes, making them more suitable for real-time application usage. It should be clear that, although this priority class is called Realtime, there are no guarantees given from the operating system. It simply means that it is the highest priority class available for user-level threads and no dynamic priority adjustments are ever made on threads in this class[31]. Threads sharing the same priority level are processed in First-In-FirstOut (FIFO) order. Figure 2 shows the full range of priority levels in XP, including the ISRs and DPCs. 4.4 Interrupt Handling Interrupts in XP have higher priority than all the user-level threads mentioned in Section 4.3, including those in the real-time priority class. All hardware platforms supported by XP implement an interrupt controller that manages external interrupt requests (IRQs) for the CPU. Once an interrupt occurs, the CPU gets the interrupt number (known as a vector), which is translated from the IRQ by the interrupt controller. This vector is then used as an index in the interrupt descriptor table (IDT) to nd the appropriate routine for handling the interrupt[30, 13]. XP lls the IDT with pointers to routines for interrupt handling at start-up. To locate the IDT, the CPU reads the IDT register (IDTR), which stores the base address and size of the IDT[13]. XP also uses the IDT to map vectors to IRQs[30]. 20 Figure 2: The full range of priority levels in XP[3]. Since interrupts are handled dierently by dierent CPU architectures, XP provides an abstract scheme to deal with all platforms. This HAL scheme provides a common priority handling mechanism for interrupt requests by assigning an interrupt request level (IRQL) to all interrupts[2]. IRQLs range from 0 to 31, where higher numbers represent higher priority. The dynamic and real-time priority spectrum for user threads all run at IRQL 0 and have an internal priority scheme as described earlier in this report. Because the CPU is always executing code at a specic IRQL stored as part of the execution context of the executing thread, the IRQLs is used to determine execution order. When an interrupt occurs, the CPU compares the IRQL of the incoming interrupt to the current IRQL. If the incoming interrupt has a higher IRQL than the current one, the trap handler saves the state information of the currently executing thread, raises the IRQL of the CPU to the value of the incoming interrupt, and calls the interrupt dispatcher, which is a part of the I/O Manager. The interrupt dispatcher calls the appropriate routine for handling the interrupt. When the interrupt routine is nished, the CPU lowers the IRQL to the value of the preempted thread and continues execution. If the IRQL of the interrupt is lower than or equal to the current IRQL of the CPU, the interrupt request is left pending until the IRQL drops below the value of the request[30, 2]. Two classes of IRQLs exist. The lowest three IRQLs (0-2) belong to the software class. They consist of PASSIVE_LEVEL, used for normal thread execution, DISPATCH_LEVEL for thread scheduling, memory management and ex21 ecution of DPCs, and APC_LEVEL for asynchronous procedure call execution[2]. Asynchronous procedure calls and deferred procedure calls are explained later in this section. The remaining levels (3-31) belong to the hardware class. The lowest 24 IRQLs in this class (3-26) are reserved for device interrupts, also known as DIRQLs. They are used for interrupt service routine (ISR) execution[2, 30]. 4.4.1 Interrupt Service Routine The interrupt dispatcher, among other things, makes the system execute an ISR mapped to the device triggering the interrupt, which runs at the same IRQL as the interrupt[31]. For a more detailed explanation of the interrupt dispatcher, see Section 4.6.4 on page 26. Only critical processing is performed in the ISR, for example, copying or moving a registry value or buer. An ISR must complete its execution very quickly to avoid slowing down the operation of the device triggering the interrupt, and delaying the operation of all lower processes at lower IRQL. 4.4.2 Deferred Procedure Call Although an ISR might move data from a CPU register or a hardware port into a memory buer, in general the bulk of the processing is scheduled for later execusion in a DPC, which runs when the processor drops its IRQL to DISPATCH_LEVEL[31]. The DPCs are handled by the scheduler in a FIFO queue. Since interrupts have higher IRQLs, a DPC can be preempted by an interrupt at any time, which means the FIFO queue can sometimes grow very long. However, it is possible to set a higher priority of a scheduled DPC using a special kernel method. This will eectively place the DPC rst in line of the queue[25]. 4.4.3 Asynchronous Procedure Call There are also asynchronous procedure calls (APCs) running below the DPC priority level. APCs are similar to DPCs, but they must execute their code in the context of a specic user process[3], which means a full process context switch may need to be carried out by the OS before it can run. ISRs and DPCs, on the other hand, only manipulate the kernel memory shared by all processes and can therefore run within any process context. 4.5 Memory Management The concept of virtual memory is used in XP. One of the main reasons for this is to allow the system to use more memory than is physically available. For example, an application requiring 500 MB of memory can run on a computer 22 with only 256 MB of random access memory (RAM) available. This can be achieved by moving blocks of memory out to the hard disk drive (HDD) when not directly needed by an application, to make room for the ones that are actually needed[31, 9]. These blocks of memory, or pages, are said to be mapped out from memory when not needed. Likewise, when pages are needed by an application and not currently in memory, they are mapped in again. Pages not loaded in memory are stored in paging les. This allows the system to use as much memory as the RAM and paging les combined. All processes running in XP use pages to access memory. A xed page size is used for a specic system architecture. On the Pentium architecture, the page size is 4 KB. An address in the virtual address space is 32 bits long, which results in a total availability of 4 GB virtual memory for each process[31, 9, 30]. The virtual memory for each process is split up in two halves. The lower 2 GB half is used for process code and data, except for about 250 MB, which is reserved for system data. This system data is shared by all user processes and contains system counters and timers. The upper 2 GB half of the virtual address space is the kernel memory, containing the operating system itself, page tables, the paged pool, and the nonpaged pool. Except for the page tables, the upper memory is shared by all user processes in the system. However, it is only accessible from kernelmode, which means the user processes are not allowed to directly access this memory[31, 9]. Process 1 Process 2 Paged Pool Paged Pool 2 GB Nonpaged Pool Physical Memory Nonpaged Pool OS OS Page Table Page Table 250 MB System Data System Data 1750 MB Process Private Code and Data Process Private Code and Data Figure 3: The virtual memory for two processes. The gray areas represent shared memory. The page tables store information about the available pages for each process in the system. Every process has its own private page table. 23 In order for a user-process to access the kernel memory, system calls (including driver requests) need to be made. When a system call is executed, the system traps into kernel-mode, which makes the entire kernel memory visible to the process. The virtual address space remains unchanged, which makes the processing of system calls performant[31]. 4.5.1 Kernel Page Pools The nonpaged and paged memory pools are used by drivers and the OS for data structures. Drivers are loaded in the nonpaged pool and can allocate memory from both the nonpaged and the paged memory pools[31, 30]. Although both the paged and the nonpaged memory pool are accessible for all processes, one major dierence exists. While the paged pool is handled just like the private memory of each process, the nonpaged pool is never mapped out from memory, which means no page faults can occur when accessing pages allocated in the nonpaged pool. One of the reasons for having a nonpaged pool is to guarantee that some parts of the system are never paged out. For example, if the memory manager itself, running in DISPATCH_LEVEL, was mapped out, no other pages could be mapped in, leading to system failure[30]. For this reason, memory in the paged pool can only be accessed from the PASSIVE_LEVEL IRQL. Higher IRQLs must use the nonpaged pool[9, 30]. 4.5.2 Memory Manager The memory manager is responsible for moving pages in and out of memory. When a process is accessing a page that is not mapped in, a page fault will occur. This page fault is handled by the memory manager, which loads the page to memory. The process causing the page fault is interrupted and has to wait for the memory manager to load the page into memory before execution can continue. For performance reasons, the page replacement algorithm in XP strives to always have a certain amount of free physical memory pages available. This will decrease the amount of work when a page needs to be mapped in, since only one disk operation is needed to read a page to be mapped in, as opposed to both mapping out a page to disk and mapping in another. To make sure enough free pages exist, the system runs the balance set manager every second. If the number of free pages decrease to a specic threshold, the memory manager starts mapping out pages not needed at the moment[31, 30]. 4.6 Windows Driver Model The WDM is a framework for device drivers that is source code compatible with Windows 98 and later. It includes a library, oering a large set of 24 routines to the developer[25, 36]. There are two major classes of WDM drivers. The rst class is called user-mode drivers. Drivers in this class run in user-mode and the class is mostly intended for testing purposes with simulated hardware. The other class, which this report will focus on, is the kernel-mode driver. As the name implies, drivers in this class run in kernel-mode. Because of the direct hardware access available in kernel-mode, this type of driver is used to control hardware. Even though kernel-mode drivers are often used to control hardware, simulated hardware or no hardware at all can be used by these drivers[2, 25]. 4.6.1 I/O Request Packets Drivers written for the WDM framework should handle input/output (I/O) requests as specied in the I/O Request Packet (IRP). I/O requests are I/O system service calls from user-mode applications, such as read and write operations[2]. An IRP determines the work order, i.e. in what order dierent subroutines of a driver should be executed to complete an I/O request. When the IRP is created, it is passed to the I/O Manager, which determines what driver and subroutine should execute. The subroutine performs its work on the IRP and passes it back to the I/O Manager, which sends it to the next subroutine. When the IRP is completed, the I/O Manager destroys it and sends the status back to the requestor[25, 2]. 4.6.2 Driver Types Three types of drivers exist under WDM. Function drivers are responsible for I/O operations, handling interrupts within the driver, and deciding what should be controllable by the user. Bus drivers handle the connection between the hardware and the rest of the computer. The PCI bus driver, for example, detects the cards on the PCI bus. It determines the I/O-mapping or memory-mapping requirements of each card. Both function and bus drivers are required for all hardware devices. The third type of driver, the lter driver, can be supplied by manufacturers to modify the functionality of the higher functional driver[25]. This is known as an upper lter driver. There are also lower lter drivers that work as a lter between the bus driver and the function driver. A good example of a lower lter driver is one that encrypts data before it reaches the bus driver, which means neither the functional driver nor the bus driver need to know about the encryption. 25 4.6.3 Device Objects To help software manage hardware in Windows, device objects are used. Each type of driver has a device object mapped to it. Bus drivers are represented by physical device objects (PDO). Functional device objects (FDO) are mapped to the function drivers. Both above and below the FDO, lter device objects (FiDO) may exist, which are mapped to lter drivers[2, 25]. 4.6.4 I/O Request Processing When an I/O request is raised in the system, it gets processed according to the steps in Figure 4. Although not all I/O requests go through all these steps, this model represents a typical I/O request ow. 1 User Thread 2 IRP = PENDING; StartIO(); 7 12 io_request(); Dispatch routine 6 Win32 Kernel User-Mode Application 3 HAL I/O Manager 4 5 8 ISR StartIO EnableInterrupts(); RequestDPC(); 9 11 10 DPC IRP = SUCCESS; CompleteIRP(); Kernel-Mode Driver Figure 4: The ow of I/O requests through the system. 1. When an I/O request is invoked by a user-thread, the system traps into kernel-mode and passes the request to the I/O Manager. 2. In the I/O Manager the request is translated into an IRP, describing the work order of the drivers involved in handling the request. Before invoking the right dispatch routine of the driver (one dispatch routine 26 per function oered by the driver exists), the I/O manger prepares the user buer and the access method to this buer[2, 25]. 3. If no device activity requiring interrupts is needed for the I/O request (for example, when reading zero bytes or writing to a port register), the dispatch routine marks the IRP as completed, executes the rest of the dispatch routine, and sends it back to the I/O Manager, which noties the user-thread of the completion of the I/O request. The scenario of reading zero bytes can occur if polling (periodical status checking) drivers are used[2, 25]. Usually, however, the I/O request actually needs some device activity before completion. In this case the IRP is marked as pending and the Start I/O function of the driver is called before the IRP is passed back to the I/O Manager. The dispatch routine also performs parameter validation. For functional drivers, the parameter validation has to take the limitations of the underlying bus driver into account. For example, if the total transfer size exceeds the limits of the bus driver, the dispatch routine is responsible for splitting the request into multiple requests[2, 25]. 4. The I/O Manager then queues the call to the Start I/O routine of the driver, which starts up the device. The rst thing done by the I/O Manager when a device is requested to start is to check to see if the device is busy. That is, checking if a previous IRP is marked as pending for the device. If the device is busy, the new IRP is queued. If the device have no IRP marked as pending, the queue is skipped and the Start I/O routine of the device is called directly, which starts the device by safely accessing the device registers[2, 25]. 5. The IRP is then returned to the I/O Manager, which awaits a device interrupt[2, 25]. 6. HAL receives the device interrupt when it occurs. 7. The interrupt is then routed to the interrupt dispatcher of the I/O Manager. 8. Most devices are connected to an interrupt request level (IRQL), which means the interrupt dispatcher calls the ISR of a device connected to a specic IRQL when a device interrupt occurs. Some devices do not use interrupts and requires polling to notice any changes for that device[2]. Since IRQLs can be shared by other drivers, the rst thing the ISR does is checking whether or not the interrupt was intended for the specic device. If not, the interrupt request is passed back to the interrupt dispatcher, which sends it to another device connected to the same IRQL[2, 25]. 27 The ISR is working on the IRQL of the device, which means that other threads at the same IRQL or lower have to wait until the ISR is completed. Because of this, as little work as is reasonably possible should be performed in the ISR. Most of the time, ISRs only perform hardware dependent work, such as moving data to or from hardware registers to kernel-mode buers. As mentioned earlier, the number of kernel-mode functions available in an ISR is very limited[2, 25]. 9. Because of the limited kernel-mode functionality available, the ISR often schedules a DPC for latex execution, which will take care of the processing not performed in the ISR[25]. 10. The scheduling of DPCs are handled by the I/O Manager and is implemented as a FIFO queue. Although the DPC queue is of FIFO type, drivers can set the priority of the DPC as high, which will make the I/O Manager place the DPC rst in the queue. 11. The DPCs run in DISPATCH_LEVEL and have full access to the kernelmode functions. The DPCs complete the work of the device driver that for various reasons could or should not be performed in an ISR. After the work in the DPC is done, the DPC marks the IRP as completed and sends it back to the I/O Manager, which in turn destroys it[25, 2]. 12. When the I/O Manager has destroyed the IRP, it schedules a kernelmode APC. This APC will execute I/O Manager code for copy status and transfer size information to the user-thread. The APC needs to execute in the context of the requesting user-thread since it needs to safely access the user-space memory. By running the APC at the same priority level as the requesting thread, page faults can be handled normally. If the I/O request included a data read from a device with the buered I/O read method, the APC copies the driver allocated buers back to the user-space buers of the requesting thread (from the nonpaged pool to the paged pool accessible by the user-thread). When the APC has completed its execution, the I/O Manager noties the requesting user-thread[2, 25]. 4.6.5 Floating-Point Operations According to the WDM documentation, drivers should avoid doing any oating-point operations unless absolutely necessary, for performance reasons [25, 36]. Before carrying out oating-point operations, a special kernel routine needs to be called to save the nonvolatile oating-point context. After the oating-point operations are nished, another kernel routine must be called to restore the nonvolatile oating-point context again[25]. Callers of 28 these kernel routines must be running at IRQL ≤ DISPATCH_LEVEL. In other words, oating-point operations are not allowed in ISRs[25]. 29 5 Real-Time Aspects of XP While the previous section provided an overview of XP, this section analyses the real-time aspects of the OS and compares the system characteristics with the previously mentioned RTOS requirements. Finally, the reasons why XP is not suitable for hard real-time applications are explained. XP is a GPOS for PCs[34] and as such, the priority for the OS is to optimize average performance, not minimize or limit worst-case performance. For a real-time application, the WCET is more relevant, since it is a guarantee that the execution time will never exceed a certain limit[33]. The average performance, on the other hand, is irrelevant in the RTOS context, since it gives no guarantee regarding execution time for a particular execution. 5.1 Design Issues That Limit XP's Use As a RTOS There are several design issues in XP limiting its use as a RTOS[24]: • No priority inversion protection exists Threads running in the Realtime class can be blocked by lower priority threads holding a shared resource. No mechanism to prevent this exists in XP. • Limited number of priorities As explained in Section 4.3, only 7 priority levels are available for Realtime threads. This is only sucient for very simple real-time applications and severely limits the amount of control a system designer has over thread priorities. • DPCs are processed in FIFO order Even though dierent interrupt priority levels exist, the bulk of the processing in a device driver is done in a DPC, which is processed in FIFO order. This makes time critical processing unsuitable even at this priority level, since it may be delayed indenitely by less critical processing scheduled earlier in the FIFO queue. DPCs can also be delayed by ISRs of any priority level. It is possible to specify a higher priority when scheduling a DPC. This will place the DPC rst in the DPC queue. However, there is no guarantee that other device drivers will not do the same, which would only invert the DPC processing order. • Masking interrupts Any code running in kernel-mode, including all device drivers, can disable interrupts or raise the IRQL to the highest level, which eectively 30 gives the code exclusive access to the CPU. This can lead to unpredictable results. This could potentially be used by a small real-time application that wants to increase the temporal determinism, but there is no guarantee that other non-critical device drivers in the system would not take advantage of this too. • Page swapping XP's use of virtual memory leads to page swapping, which can occur at any point during the execution of a thread. However, virtual memory can be turned o in XP, eectively eliminating this design issue. • IRQL mapping The HAL dynamically maps interrupts to IRQLs at system startup as it detects the devices attached. This leads to reduced portability and predictability of a real-time application, since it is not possible to know the order of device interrupts when hardware changes. By reducing the number of device drivers used in the system and making sure that as few drivers as possible share the same IRQL, a higher level of predictability can be achieved. • Interrupts and DPCs have higher priority than Realtime threads Even threads running at the highest user-level priority can be delayed indenitely because of interrupting ISRs and DPCs. 5.2 Using XP as a RTOS Dierent approaches of using XP as a RTOS are suggested throughout the literature, where the most common alternatives are [32, 27, 17]: • Use XP as it is, but with a constrained environment for applications and functionality to ensure timing constraints. Future development of such a system is hard, and no guarantees of deadlines can be given. • Implement the time critical parts as a device driver running in kernelmode. The richness of the entire Win32 application program interface (API) cannot be utilized. Debugging becomes more dicult and critical, since bugs can crash the whole system.[3, 5]. • Create a wrapper for the Win32 API to a commercial RTOS. No COTS can be used and the Windows device drivers cannot be used. • Run Windows XP and a RTOS on two dierent machines. Both hardware and software costs increase. 31 • Run Windows XP and a RTOS on a single machine. This report will focus on the rst two approaches. 32 6 Extensions In this section, the approach of running Windows XP and a RTOS on a single processor machine will be examined, using real-time extensions available from third parties. All the extensions have slightly dierent implementations, but all of them have made some modications to the HAL or at least intercepts the interrupts before they reach the HAL (which actually can be seen as a modication)[32]. Note that not all of these extensions make permanent changes to the HAL. Instead, a reconguration of the HAL is done at system startup. The extensions include: • CeWin and VxWin by Kuka Controls[18]. • HyperKernel by Nematron[12]. • RTX by Ardence[29]. • INtime by TenAsys[15]. Since information on CeWin, VxWin, and HyperKernel is sparse, this report will not focus on these two solutions. RTX and INtime, however, will be given deeper descriptions, since more research has been done on those technologies. 6.1 RTX 6.1.1 Architecture The RTX runtime environment is implemented as something called a RealTime Sub-System (RTSS). This is actually a kernel device driver for Windows XP. Achieving real-time performance this way is possible thanks to the standard device driver model and the fact that the HAL is customizable. By combining these two techniques, a temporal predictable model for building real-time systems is possible[10]. The RTSS is implemented as a system capable of stopping Windows from masking interrupts, using an own scheduler, and handling synchronization, to name a few features[10]. Since the RTSS runs as a kernel device driver, applications written in RTX will also run in kernel-mode. This mode oers no memory or stack overrun protections, errors that would likely give an unreliable execution environment resulting in a system crash. The HAL modications used in RTX have been implemented as extensions instead of an entire replacement. This makes the RTSS compatible with all existing versions of Windows XP, Windows 2000, and Windows 2003 platforms. New Windows service packs can be installed without affecting the RTSS environment. The RTSS relies on the HAL extensions to operate correctly[10]. The extended HAL used in RTSS is called RT-HAL throughout this section. 33 The standard Windows HAL was modied for the following three reasons[10]: 1. To make it impossible for Windows XP threads to interrupt the RTSS or mask the RTSS-managed devices. RT-HAL intercepts interrupt masks coming from the Windows threads and manipulates this mask, so that no RTSS-controlled interrupts can be masked. 2. To increase the resolution of the Windows XP provided timers to 100 µs, instead of 1000 µs. 3. To provide a shutdown handler for the Windows XP environment, which makes it possible for the RTSS to carry on after a traditional bluescreen Windows crash. The RTSS applications are responsible for managing the shutdown handler and it is up to the real-time application developer to decide which applications should use this handler. The handler is used to clean up and reset any hardware state if a crash or normal shutdown of the XP environment occurs. However, it is the developer's responsibility what will actually happen. RTX supports 256 thread priority levels. The scheduling algorithm used is round-robin[1] and the ready queue is implemented as a double linked list for each priority level. This increases both the speed of insertion and removal of threads compared to a single linked list. If two threads of the same priority are ready at the same time, one of them is chosen and runs until the quantum has expired. By default, the quantum is set to innity[10]. The RTSS uses the Windows provided model even for RTSS interrupt handling. This may seem unwise, since previous work has shown that DPCs are not deterministic enough for real-time use[3]. However, it only catches the interrupt in Windows and then the actual ISR is run in the RTSS, if the interrupt was intended for it. The RTSS is therefore only dependent on the interrupt latency of XP[10]. Studies have shown that interrupt latencies in XP is very deterministic, enough to even run hard real-time systems in an ISR[3]. RTX has worked on lowering the interrupt latency, and claims have been made that WCET of less than 30 µs is possible[10]. The RTSS environment uses the memory management mechanisms provided by XP, and memory allocation is done in the nonpaged memory pool[10]. This means that memory allocation by a RTSS-thread is non-deterministic. The benet of this memory model, according to Ardence, is that it reduces RTX resource consumption. Communication between the XP environment and the RTSS environment is realized with the use of queues, one in each direction. If an XP thread needs some service from the RTSS environment, a command is inserted into the queue as a Service Request Interrupt (SRI). The RTSS environment then executes the service and sends a reply message back to the XP thread. Normally, SRIs for synchronization are requested by the XP environment and 34 SRIs for memory management and le operations are requested by the RTSS environment[10]. Priority inversions for shared resources are avoided by using priority inheritance, also known as priority promotion in most papers studying RTX[10, 1]. 6.1.2 Software Development RTX provides libraries which can be used by Visual Studio. It also provides a useful application wizard, guiding the user through settings, and generates skeleton source code for the applications[10, 1]. Even though applications written for RTX run in kernel-mode, code writing and debugging can be done in user-mode during development from within Visual Studio (version 6.0 and newer), oering a fully protected environment. Breakpoints can be set and source code stepping can be used, just like when debugging any normal Windows application. Final releases, however, will be compiled to run in kernel-mode[10, 1]. 6.1.3 Does RTX Meet the RTOS Requirements? • The OS has to be multitasked and preemptible. The OS is denitely preemptible and multitasked. Preemption can occur for both threads and ISRs. Tasks can be realized as both processes and threads in this system, allowing for lower task switching time if memory can be shared by other tasks. The scheduling algorithm used is round-robin with priority queues. • The notion of task priority has to exist. 256 priority levels for threads exist. • The OS has to support predictable task synchronization mechanisms. Synchronization objects are available, such as semaphores, mutex, and shared memory objects. A study by Timmerman et al. showed predicable behavior of synchronization objects[32]. • The OS must support a system for avoiding priority inversion. Priority inheritance is used to protect the system from priority inversion. • The OS must have predictable temporal behavior. To determine this requirement, extensive testing of this extension needs to be made. Memory allocation is not deterministic since it is handled by Windows memory management mechanism[10]. Four of the ve requirements of a RTOS are denitely fullled by RTX. Even though the undeterminism of the memory allocation can be solved by allocating all memory needed before startup of the real-time system, the 35 tests done in [32] are too limited to conclude that RTX oers predictable temporal behavior under all conditions. 6.2 INtime 6.2.1 Architecture In contrast to RTX, INtime from TenAsys runs both the real-time applications and the non real-time applications in user-mode. There is still the possibility to write a real-time application as a driver, which will run in kernel-mode[16, 28]. Running applications in user-mode protects the system from crashing because of programming errors such as null pointers, and page faults. However, applications still have the ability to gain direct access to physical memory if that is deemed necessary by the developer[28]. INtime installs a number of components in Windows. The most important includes a Windows kernel driver and a Windows service. The kernel driver manages communication between the INtime and the Windows environment. The service handles the actual loading of the INtime kernel into the system. A context switch then occurs to make the system go into the INtime kernel. In this state all real-time activity is handled before any Windows activity. XP eectively becomes the idle-task of the INtime kernel. When running in the INtime kernel, all Windows interrupts are masked. A real-time interrupt (both software and hardware) is handled directly. Thanks to monitoring of the HAL, Windows kernel is unable to mask real-time interrupts. This means that even badly designed device drivers, masking interrupts running in the Windows kernel, cannot aect the performance of the real-time kernel[28]. The scheduling algorithm used in INtime is round-robin with 256 priority levels. 128 of these levels are priority for user threads and the other 128 are used for interrupt priorities[16, 24]. The interrupt handling in INtime is similar to the one used in XP. When an interrupt occurs, it is handled by its appropriate ISR. Just as in XP, minimal work is done in the ISR[24]. The bulk of the work is instead performed in an interrupt thread. Interrupt threads are like DCPs in XP, but with dierent priority levels instead of a single FIFO queue, to increase temporal determinism of the system. This interrupt model can also be bypassed and processes can handle interrupts directly[28]. Memory management is handled by INtime itself and all shared memory (used for shared resources) reside in the nonpaged memory pool. This means no swapping of shared memory can occur, making the temporal predictability in accessing shared memory good[24]. Shared resources used within INtime are protected with semaphores. Semaphore queues can be realized as both FIFO and priority queues. Compared to XP, which only uses FIFO queues for semaphores, the temporal 36 behavior is more deterministic using priority queues. Priority inheritance is used on shared resources to ensure that priority inversion is as low as possible[24, 28]. Thanks to the monitoring functionality, which makes XP the idle thread of INtime, real-time applications can continue to run even if XP crashes. The rst thing done in case of an XP crash is to suspend the thread scheduling XP. A real-time process can then restart the Windows operating system and operation can be brought back to normal mode[28]. This means it is possible to make the real-time applications completely independent of XP. 6.2.2 APIs INtime provides the user with multiple programming APIs: • Real-Time API The real-time API resembles the Win32 API, which will make a transition for Windows programmers as smooth as possible[24]. The realtime API is object based, where all objects are referenced by handles. Handles are global to the entire real-time system[28]. • Win32 API A subset of the Win32 implementation used in Windows CE is provided by INtime. It will allow usage of some existing code directly in INtime[28]. Although it is based on the Win32 API for CE, no information is given whether it has time deterministic behavior or not. • APIs for the Windows environment Windows APIs are provided to allow the non-real-time Windows environment to share objects with the real-time INtime environment. Both real-time objects and Win32 objects can be shared by processes in the dierent environments[28]. • C and C++ libraries. INtime provides support for both Embedded C++ (EC++), with the use of the Standard Template Library (STL), and ANSI C. 6.2.3 Software Development Software is written in C or C++, with the entire STL available. Building applications from start to release can be done entirely in Microsoft Visual Studio. INtime even includes a project wizard for the IDE. This wizard eases development and generates skeleton code for the developer. Even debugging can be done in Visual Studio .NET with the use of breakpoints, source-level single-stepping, and variable watching. For Visual Studio 6 users, INtime includes a separate debugger called Spider. 37 6.2.4 Does INtime Meet the RTOS Requirements? Does XP using the INtime extension fulll the requirements put on a RTOS? • The OS has to be multitasked and preemptible. This requirement is fullled since INtime is multithreaded, and thereby multitasked. Preemption can occur at every level in the system. ISRs have dierent priority levels and are preemptible as well. Interrupt threads exist with 128 dierent priority levels. These attributes clearly fulll the rst requirement. • The notion of task priority has to exist. This requirement is also fullled, since both user/kernel level threads and interrupt threads have priorities. • The OS has to support predictable task synchronization mechanisms. The INtime kernel uses semaphores with both FIFO and priority queues, where priority queues should give higher predictability. Acquisition and release of semaphores has been shown to be deterministic in [32]. • The OS must support a system for avoiding priority inversion. INtime uses priority inheritance to achieve this goal. • The OS must have predictable temporal behavior. In a previous study, INTime showed predictable behavior[32]. However, this study was based on version 1.20 (while the current version is 3.0) and the number of tests conducted were limited. INTime fullls at least four of the ve RTOS requirements. However, tests and time measurements of the software are needed to determine if it oers a predictable temporal behavior under all conditions. 38 7 Related Work Studies of real-time performance in Windows NT (using the same scheduling and interrupt routines as XP) have been done before. While most of the studies focused on real-time performance of user level threads[24, 27, 4], some studies focused on the real-time performance in device drivers[5, 3]. Most of the papers based their conclusions on time measurements, while [34] drew its conclusion based on the inner workings of Windows NT. 7.1 User Level Thread Implementation The testing of user level thread performance was done in a slightly dierent way. While [27, 24] implemented a time critical application, [4] only measured thread creation time and task switching under various system loads. All these tests show that temporal predictability decreases as the system load increase and the frequency of interrupts increases. Higher prioritized tasks also seem to increase the temporal predictability. The conclusions of the predictability of user level threads are clear: Windows NT is not suitable for running real-time applications at user level; WCET for the application in [24] was almost 10000% over the average execution time. Depending on the timing constraints, Windows NT running user level applications could be used for soft real-time systems. According to the studies, WCETs cannot be guaranteed, which means the system must be allowed to miss deadlines sometimes. 7.2 Driver Based Implementation The driver based experiments dier more from each other than the user level testing. [3] measures interrupt latencies for dierent drivers, and interrupt vectors under various load conditions (always known before testing). The paper continues with execution time measurements for ISRs and DPCs under the same system conditions. Even measurements of context switching latency for both processes and threads were made, but these tests did not use as varying system load as the test conducted to measure interrupt latency. The results from the interrupt latency tests showed that the latency did not dier much because of system load, except when network load was present. Since the Ethernet interface was set up to raise interrupts at vector 10, the custom serial driver (connected to vector 11) would be preempted by every interrupt on the Ethernet interface. By assigning the driver to another vector, the high latency introduced by network load could be reduced. As shown by these test results, network load had no major eect on the interrupt latency when lower interrupt vectors than 10 were used. ISR execution time of the custom serial driver had a high temporal predictability under all system loads tested. However, the 39 ISR execution time for dierent drivers did not have predictable temporal determinism, but that is dependent on the amount of work needed in the ISR (or perhaps badly designed drivers). The result from the DPC latency measurement had enormous standard deviations from average times. Neither thread nor process context switching gives any determinism to the system. The latency depended highly on CPU load. The conclusion of this study is that Windows NT is suitable not only for soft real-time systems but also for hard systems as long as all the time critical execution is done inside the ISRs. Running in DPC-level or below oers too poor determinism to be used by a hard real-time system. The nal recommendation was to turn o virtual memory in the system, especially if an integrated driver electronics HDD is used, since the experiment on paging showed that IDE drivers basically had no temporal determinism at all. The second paper, focusing on the driver implementation[5], had a different approach. It implemented a driver polling input data at the frequency programmed to the LAPIC driver (see [5] for more information) located on the CPU, running with the frequency of the system bus. This timer was programmed to use the highest interrupt level. However, the LAPIC timer can still be delayed if interrupts have been disabled by other drivers. The methodology descriptions in these tests were sparse, making it hard to fully understand how the tests were carried out. According to the paper, the LAPIC driver performed well on both loaded and unloaded systems. However, since some data polling occations still missed the deadline, no hard real-time systems could be implemented successfully according to the paper. The authors stated that the LAPIC driver required a dual processor machine, since the LAPIC is disabled by Windows XP on a uniprocessor system. This contradicts the fact that they presented test results from a system using a single Pentium 4 processor with Hyper-Threading, which is a technology to simulate a dual processor machine (see [11]). Although this suggests that Hyper-Threading is enough, it is still a drawback with this solution. 7.3 Conclusions In general, it seems that all papers agreed Windows NT can be used for soft real-time systems if: • the timing constraints are not too tight, • the system is allowed to miss deadline sometimes, and • the work load is low. Some of the papers also concluded that if all jobs are done at ISR level, even hard real-time systems can be built on Windows NT[3]. In contrast, 40 [34] concludes that running a hard real-time system Windows NT is out of the question. The methodology between these two papers were very dierent since [3] measured execution time of an implementation while [34] based their conclusions on analysis of the inner workings of Windows NT. 41 8 Problem Description The true trigger of this master's thesis is an anonymous business unit of ABB providing embedded system based products for the automation industry. They use a traditional real-time system, where input data received from a sensor is processed in an algorithm and then sent to an actuator. Outside the real-time core, extra functionality is provided to make the units more useful to the customer. Today, the real-time core is running on a dedicated hardware, isolated from the extra functionality. The trend clearly shows that the demand for extra functionality outside the core is growing. The development cost required to meet this demand is usually very high, for several reasons: • The systems often run on specic hardware with memory constraints and limited resources. • The systems are sometimes running a custom designed operating system, which means no commercial of the shelf (COTS) software components exist. • Even on systems running a commercial RTOS, writing software and reusing software components is more limited than in popular general purpose operating systems (GPOS). • The RTOS development environments are often complex, which makes writing software hard. To stay competitive in this industry, the development cost for this kind of supportive functionality needs to be reduced. One possible solution is to run the real-time core on the same hardware as the extra functionality and to switch OS to a popular GPOS. The ABB business unit wants the alternative of using XPE with a cheap real-time extension to be explored. For their purposes, a self-written device driver implementation is optimal, since it would be cheaper than buying licenses from third-party real-time extensions. Because of the planned shipping volume, the manufacturing cost per unit is important to keep as low as possible. The development cost is less important, as it is viewed more as a one-time cost. The following is a list of some of the key reasons why the ABB business unit wants XP to be investigated: • By using XP, development time would be reduced by using the same platform for both development and target units. • Rapid prototyping using the .NET framework would be possible without access to the target unit. 42 • A large number of COTS and standard applications would be able to run on the target units. The fact that XP is designed as a GPOS, and as such does not support hard real-time usage, is recognized by the business unit. However, the characteristics of the OS need to be thoroughly explored in order to get a deeper understanding of its performance and limitations. Even if the results would conclude that XP is not suitable for their purposes, they would at least have a clear reason why this is the case. 8.1 Suggested Model Figure 5 shows the original suggested model for the embedded system, provided by the business unit. The sensor is on the left, the embedded system running XPE is in the middle, and the actuator is on the right. Figure 5: Sketch of the implementation suggested by the ABB business unit. It is just meant as an overview of a possible system and is in no way nalized. For example, the communication stack suggested may be replaced with another Ethernet based automation protocol in the real implementation. 43 8.2 Purpose The purpose of this report is to reveal the characteristics of XPE as a RTOS by investigating how XP works under the hood. 8.3 Scope Because of the limited time available for this master's thesis, XP will be used instead of XPE to run the tests. Too much time would otherwise be spent on setting up and conguring an XPE installation. Since XP and XPE use the same kernel (along with scheduling algorithms, IRQLs, and HAL), this will not aect the results of the tests[37]. While Figure 5 is a model of the whole embedded system, the only part that will be investigated in this report is the real-time characteristics of XP. Communication from the sensor and to the actuator will be simulated. The third-party real-time extensions will not actually be tested. The focus will be solely on the real-time characteristics of XP itself. There are other OSs relevant to the assigner. For example, Windows CE 5.0 is a scalable OS with real time capabilities that allows applications to be developed in a familiar Windows environment[22]. However, CE does not have the rich availability of COTS and applications as XP, and the .NET Compact Framework is more limited than the standard .NET framework[6]. Windows CE will not be investigated in this report. 44 9 Methodology In order to measure the real-time characteristics of XP, a number of tests on a dedicated system were conducted. Because Microsoft is very protective about the source code for XP, at best a black box approach to performance analysis was possible. The tests evaluated the performance aspects that aect the determinism and responsiveness of XP as a real-time system, which included: ISR latency, interrupt execution, DPC latency, DPC execution, and communications between user-mode and device drivers. 9.1 Conducted Tests This report focused on the rst two approaches of using XP as a RTOS explained in Section 5.2: Using XP as it is with a standard user-mode process; and implementing the time critical parts as a device driver running in kernel-mode. The latter approach was in turn devided into two separate tests: one implementing the time critical parts in a DPC, and another implementing it in a prioritized DPC. An ISR implementation was not considered because of the inability to safely calculate oating-point operations, as mentioned in Section 4.6.5. The tests simulated a typical system used in the automation industry, where a sensor transmits an input to an embedded real-time system, which performs calculations on the input, and thereafter sends the results to an actuator. Time measurements were conducted on specic events, as well as the full event cycle, to evaluate the determinism of XP. The sensor input was simulated using a tone generator connected to the acknowledge (ACK) pin on the parallel port, which generated a hardware interrupt at IRQL 3 in the CPU. A custom device driver was written to handle the interrupts and start the event cycle. The processing of sensor input was simulated using an algorithm, performing a xed number of oating point operations (e.g. multiplications, divisions, etc.). Originally, this algorithm came from the ABB business unit triggering this Master's Thesis, but because the algorithm contained many compiler warnings and was generally very complex, the algorithm was replaced with a simpler one only using a fraction of the source code needed for the original algorithm. Tests were conducted to make sure the execution time of the new algorithm was equal to the original one. Finally, the output to the actuator was simulated by setting a parallel port pin in the device driver. 45 9.1.1 User-Thread Implementation In the user-thread implementation, communication with the input and output were handled by read and write calls to the device driver. The algorithm was then processed in a user-thread with the highest priority (31). Figure 6 shows the full event cycle for this implementation. Although the user-thread starts by calling the read() function of the driver, the actual event cycle (input from sensor to output to actuator) starts when the hardware interrupt occurs. This is event number 6 in Figure 6. After the algorithm is processed, the userthread calls the write() function of the device driver, which simulates the communication with the actuator by setting a parallel port pin. Figure 7 shows the sequence of events, where the vertical axis represents the priority level of the executing event. 6 Win32 Kernel User-Mode Application 1 User Thread 7 while(1) { read(); algorithm(); write(); } 11 HAL I/O Manager 13 16 2 3 Read 4 5 8 10 ISR StartIO IRP = PENDING; StartPackage(); 9 EnableInterrupts(); DisableInterrupts(); RequestDPC(); 11 14 15 DPC Write IRP = SUCCESS; CompleteIRP(); outp(...); CompleteIRP(); Kernel-Mode Driver Figure 6: Full event cycle of the user-thread implementation. Drv: ISR DIRQL Drv: ISR Drv: DPC Dispatch Drv: StartIO Usr: Algorithm Passive t 0 t As t Af t Bs t Bf t Cs Drv: Write t Cf t Ds Drv: Read t Df t Fs t Ff t Gs t Gf t 1 Figure 7: Sequential time diagram of the user-thread implementation. 46 9.1.2 Driver Implementation In the driver implementation, the algorithm was executed directly in the DPC of the driver. Figure 8 shows the full event cycle for the implementation. As with the user-thread implementation, the event cycle starts when the interrupt occurs. Figure 9 shows the sequence of events, where the vertical axis represents the priority level of the executing event. Because of the fewer events and higher priority levels of the driver implementation, it was reasonable to believe it would have better performance than the user-thread implementation. More specically, lower WCETs were expected. 1 Win32 Kernel HAL 2 I/O Manager 5 4 4 3 ISR DPC algorithm(); IRP = SUCCESS; CompleteIRP(); DisableInterrupts(); RequestDPC(); Kernel-Mode Driver Figure 8: Full event cycle of the driver implementation. 9.2 Test System The tests were conducted on a dedicated PC system running Windows XP Professional with Service Pack 2 installed. The hardware on which the measurements were conducted consisted of an ICP Electronics NANO-7270 motherboard, a Pentium M 1.6 GHz processor, and a Fujitsu MHT2060BH SATA hard disk drive. Attached to it were a standard USB keyboard and a PS/2 mouse. 47 ISR DIRQL ISR DPC Dispatch Algorithm DPC Passive t 0 t As t Af t Bs t t Cs Cf t Bf t 1 Figure 9: Sequential time diagram of the driver implementation. 9.2.1 System Services All Windows XP system services not needed for the test system, such as Server, Workstation, and DNS, were disabled. Virtual memory was disabled as well. Only the most critical services required for XP to run properly were enabled, namely: • Plug and Play • Remote Procedure Call (RPC) 9.3 Execution Time Measurement To determine the real-time performance of XP, the execution time of the different aspects described earlier in this section were measured. Three dierent methods for measuring execution time were considered: • using the performance counter (PeC) available in the Win32 API, • using the time-stamp counter (TSC) of the processor, and • using an oscilloscope to externaly measure signals on the parallel port. 9.3.1 Performance Counter Measurements using the PeC were performed using the two methods provided in the Win32 API: QueryPerformanceCounter() and QueryPerformanceFrequency(). The QueryPerformanceFrequency() function returns the number of clock ticks per second, while the QueryPerformanceCounter() function returns the current value. Unfortunately, the PeC uses dierent hardware timers on dierent systems. Most platforms without any processor power saving technologies such as Speedstep use the TSC of the processor as 48 the timer, while other systems use the chipset, BIOS, or power management timer[38]. This counter was evaluated under two dierent test conditions. The rst condition stored two consecutive readings of the PeC (start and stop time) 6.8 million times in a for-loop. This test condition generated a theoretical worst-case latency, since this test utilized 100% of the CPU. This makes it more likely for kernel-mode tasks such as scheduling to interrupt the PeC readings. In the second test condition, two consecutive readings were stored in the same manner, but with an added Sleep() statement after each start and stop measurement. This simulated a real-time system used in the automation industry more closely, where the system waits for an input sent from a sensor. The number of instructions required for reading the PeC is insignicant compared to the entire test system and algorithm. As a result, it is more likely that the system will be interrupted when not reading the PeC, which made this condition more applicable to a real-world application. On the test platform, the timer was running at a frequency of 3,579,545 Hz, which gives a resolution of 279 ns. The dierence between the start and stop time under both system conditions can be seen in Figure 10. Table 2 provides a summary of the measurement statistics. (a) (b) Figure 10: Measured start-stop time versus measurement number for the Performance Counter. (a) With Sleep(), (b) Without Sleep(). The test of the PeC without a Sleep() statement showed two discrete levels, one at 838 ns and another at 1,117 ns, which are equal to 3 and 4 ticks respectively on the PeC. Both these levels probably represent the normal latency introduced of two consecutive time-stamps. Since the processor/performance counter frequency ratio is about 450 to 1, we can assume 49 that the normal latency of two time-stamps is somewhere between 3 and 4 ticks of the PeC. The second test of the PeC with the added Sleep() statement showed three discrete levels; the same two as in the previous test, and a third level with a slightly higher latency than the other two. Even though the PeC usually gives a low latency for making the time-stamps, the tests showed maximum values as high as 126.27 µs and 45.26 µs for the rst and second test, respectively. Figure 10 and the standard deviation of these tests showed that the PeC was unsuitable for time measurements in our applications, since samples were spread over the entire spectrum between 838 ns to 126.27 µs (45.26 µs for the second test). 9.3.2 Time-Stamp Counter All processors built on the IA-32 architecture, starting with the Pentium processor, have a built in TSC. The clock tick frequency of this counter varies on dierent processor families. On some processors, the counter is increased at a constant rate determined by the processor conguration, while others increase the counter with every internal processor clock cycle[14]. In the P6 family (Pentium, Pentium M, Pentium 4, and Xeon) the TSC is implemented as a 64-bit counter and is guaranteed to not wrap around within 10 years after being reset[14]. (a) (b) Figure 11: Measured start-stop time versus measurement number for the Time-Stamp Counter, (a) with Sleep(), (b) without Sleep(). The test platform with its 1.6 GHz Pentium M processor has the TSC implemented as a 64-bit counter, increasing with every processor clock cycle. 50 PeC without PeC with TSC without TSC with Sleep() Sleep() Sleep() Sleep() Min 0.84 0.84 0.03 0.03 Mean 1.07 0.99 0.03 0.03 WCET 126.27 45.26 60.56 39.65 Std. dev. 0.37 0.22 0.05 0.02 Table 2: Measured start-stop time in µs for the PeC and TSC. Since the test was conducted with SpeedStep technology disabled, constant clock frequency was guaranteed, giving a resolution of 0.625 ns. To compare the latency of the TSC with that of the performance counter, the two evaluation tests explained in the previous section were conducted on the TSC as well. The TSC showed lower latency of two consecutive readings compared to the performance counter. The dierent test conditions had higher impact on the results than during the test of the PeC. As seen in Figure 11 the test without any Sleep() statement have the same latency on almost all the samples compared to the test condition with the Sleep() statement added which shows three discrete levels, where the highest of these three only occurred during the last half of the test. However, the test without a Sleep() statement got a higher worst case latency and a higher standard deviation than the other test. Table 2 shows the minimum, mean-value, maximum, median, and standard deviation (in ns) of all four tests conducted on both the PeC and the TSC. Even though maximum values of the TSC were in the same range as the ones of the PeC, only a few of the samples reached a time higher than 1 µs. The low latency and the fact that it is unlikely for two consecutive readings to have a higher latency than 1 µs gave a determinism good enough to be suitable for measurement in our conducted tests. 9.3.3 Oscilloscope The ISR latency (the delay between a hardware interrupt and the start of the ISR execution) is impossible to measure using either the PeC or the TSC, since the start of the event occurs on the OS level, which the test implementations have no control over. For this reason, an external measurement approach was also necessary. An Agilent Inniium 54833D MSO oscilloscope was connected to the parallel port of the motherboard, measuring the voltage on selected pins. To verify the reliability of this measurement method, a simple test was conducted where a parallel port pin was set (logical 1) and then immediately unset (logical 0) again. This test was then running in a user-thread of priority 31 (Realtime), and was iterated one million times. Normally, a user-thread is not allowed to write to port registers because it 51 is a restricted kernel-mode operation and will cause a Privileged Instruction Exception. Because of this, a third-party solution called AllowIO was used, which can grant the process full rights to any port[26]. The results of the test showed a maximum jitter of less than 5 µs, with a maximum execution time of 6.21 µs and an average execution time of 1.37 µs. This was signicantly more deterministic than using the PeC or TSC, and the accuracy was sucient for the other tests conducted. One signicant limitation with the oscilloscope was its inability to save each individual measurement to a le for later analysis. The oscilloscope was only capable of calculating the WCET, minimum execution time, mean execution time, and the standard deviation on the collected data set. 9.4 System Load Conditions The real-time application tests were conducted using dierent load conditions in order to evaluate the performance impact. Several applications were developed to realize these load conditions. These applications were developed in Visual Studio .NET. A custom device driver was also developed, to allow the measurements of ISR/DPC latency and performance, and communication between device drivers and user-processes. 9.4.1 Idle When the system was idle, no other processes than the ones necessary for XP to function properly were running. Network was disabled, and the keyboard and mouse were not used. 9.4.2 CPU Load In this system load, an simple C application was developed, running a endless for-loop to utilize 100% of the CPU. The process was running in the Normal process and thread priority levels. 9.4.3 Graphics Load A custom application was written in Visual Basic .NET to realize this system load. It dynamically created many graphical user interface (GUI) controls and then moved, resized, and changed properties on them. The purpose was to test how GUI rendering of normal applications aected real-time performance. 9.4.4 HDD Load Two large les were copied back and forth on the HDD to determine how disk activity aects real-time performance. A simple batch script was used to achieve this load condition. 52 Idle CPU Load Graphics Load Hard Drive Load Network Load Stress User-thread UserIdle UserCPU UserGraphics UserHDD UserNetwork UserStress DPC DriverIdle DriverCPU DriverGraphics DriverHDD DriverNetwork DriverStress Prioritized DPC DriverPrioIdle DriverPrioCPU DriverPrioGraphics DriverPrioHDD DriverPrioNetwork DriverPrioStress Table 3: Test names used throughout the report. 9.4.5 Network Load In this system load, a batch script was used to transfer large les over a small local network, connected with a router. In eect, both network and disk load was measured at the same time. A HP Vectra VL800 running Filezilla Server version 0.9.12 beta was used as the File Transfer Protocol (FTP) server. The test platform was running the console-based FTP client NcFTP version 3.1.9. 9.4.6 Stress In the Stress mode, all of the above load conditions were running at the same time, to simulate a worst case scenario of the real-time application. 9.5 Test Names Table 3 shows the names used to identify the specic tests conducted in the various load conditions. 9.6 Additional Tests Aside from the above tests measuring the impact of dierent load conditions, additional tests were conducted to measure mechanisms such as a process context switch, the time quanta of a process, etc. These test results are not presented in the report, as they were only conducted to gain a better understanding of the primary execution time tests. All tests listed in Table 3 were conducted using the oscilloscope. However, since the oscilloscope was incapable of collecting and saving each sample in the tests, additional tests were conducted using the TSC. This provided a graphical scatter plot of the measured execution time for every sample and an execution time distribution of the samples. The TSC tests measured the algorithm execution time 4,500,000 times in each test. Because of the limited time available for these additional tests, they were only conducted on the user-thread implementation. 53 10 Results 10.1 TSC Measurement Results This section presents the results of the TSC measurements of each userthread test graphically using two diagrams for each test. The rst diagram shows a scatter plot of the measured execution time in µs for every sample, while the second one shows the execution time distribution of the samples. Table 4 shows the minimum, mean, WCET and standard deviation of the TSC test results. UserIdle UserCPU UserGraphics UserHDD UserNetwork UserStress Min 36.88 36.89 36.87 36.87 36.88 36.88 Mean 37.27 37.40 37.29 37.40 38.36 39.40 WCET 132.24 107.15 154.74 155.42 144.48 168.87 Std. dev. 0.83 0.72 1.00 2.00 4.58 6.20 Table 4: Algorithm execution time in µs for the TSC tests. 10.1.1 UserIdle As seen in the time distribution diagram in Figure 12, a majority of the samples measured 37.42 µs, which is represented by the distinct lowest line in the scatter plot diagram. This means that the majority of the algorithm calculations in the test were not interrupted by other tasks. The remaining samples were distributed in a time spectrum ranging from around 40 − 120 µs, except for two samples taking 130.55 µs and 132.24 µs, respectively. It is possible to identify discrete levels in this spectrum, where samples are more densely grouped. For example, one level exists at 75 µs. However, because of the black box approach used when testing XP, the reason why these levels exist is not known. One interesting note about the UserIdle scatter plot diagram is the change of characteristics after one third of the test period. The reason could be one or more device drivers entering a power saving mode. For example, the HDD might be spinning down after a period of inactivity. The power saving functionality is part of the WDM development guidelines[2, 25]. 10.1.2 UserCPU The dierence between UserIdle and UserCPU was minimal. A majority of the samples measure 37.43 µs and the remaining samples were distributed in the 40 − 120 µs range. The fact that a pure CPU load did not aect the performance of the test implementation much was not surprising since the 54 (a) (b) Figure 12: UserIdle algorithm execution time. (a) Scatter plot, (b) Time distribution. test ran in the Realtime priority class, while the CPU load application ran in the Normal priority class. 10.1.3 UserGraphics In the UserGraphics test, the sample distribution in the 40 − 120 µs time range was more dense, which means more algorithm calculations were interrupted compared to UserIdle and UserCPU. However, the WCET sample of 154.74 µs was not much worse than the WCET for UserIdle, which indicates that the real-time performance of XP is not signicantly aected by GUI stress. 10.1.4 UserHDD In the HDD stress load condition, there was a signicant increase of samples around 50 µs, as seen in Figure 15. Also, the spectrum between 40 − 80 µs was more dense compared to the previous load conditions. However, the WCET was just 155.42 µs, which was similar to the results in the previous tests. As in the UserIdle test, the UserHDD test changed characteristics after a period of time. In this test, the change occurred after two thirds of the time, where the samples ranging between 50 − 80 µs suddenly dropped to a more compact range of 50 − 60 µs. The reason for this is unknown, but as seen in Figure 15, this does not aect WCET. In fact, the WCET sample was measured near the end of the test where the scatter plot showed the best temporal determinism. 55 (a) (b) Figure 13: UserCPU algorithm execution time. (a) Scatter plot, (b) Time distribution. (a) (b) Figure 14: UserGraphics algorithm execution time. (a) Scatter plot, (b) Time distribution. 56 (a) (b) Figure 15: UserHDD algorithm execution time. (a) Scatter plot, (b) Time distribution. 10.1.5 UserNetwork The network load condition had a signicant number of samples around 60 µs. Also, the range between 40 − 65 µs was dense. The samples above 65 µs were distributed in a similar way as the UserHDD test, and the WCET was 144.48 µs. 10.1.6 UserStress The UserStress test, running all previous load conditions at the same time, wasperhaps unsurprisinglyhaving the most impact on real-time performance. The density of samples around 60 µs was even higher here compared to UserNetwork, but the characteristics and time distribution was similar. However, even in this stressed load condition, no sample exceeded 170 µs. In fact, although not specically designed for it, XP seems to do a good job keeping the WCET atwhat it seemsa limited level. The amount of system load applied seems to have a small impact of the measured WCET. Every one million samples, the test changed characteristics for a short period of time, as seen in the scatter plot of Figure 17. In actual time, this was roughly every 15 minutes. The reason for this behavior is not known, but it did not negatively aect real-time performance. 10.2 Oscilloscope Test Results The results of the tests conducted using the oscilloscope are briey presented in this section. For a complete listing of these test results, see Appendix A. 57 (a) (b) Figure 16: UserNetwork algorithm execution time. (a) Scatter plot, (b) Time distribution. (a) (b) Figure 17: UserStress algorithm execution time. (a) Scatter plot, (b) Time distribution. 58 The oscilloscope test results showed a surprisingly good level of predictability compared to the results of the previous work in the eld of XP real-time performance[27, 24, 3]. The CPU load conditions (UserCPU, DriverCPU, and DriverPrioCPU) had a minor impact on the tests. At most, a slightly higher standard deviation was measured, but the WCET were similar to the idle tests (UserIdle, DriverIdle, and DriverPrioIdle). Similar to the TSC tests, HDD and network loads had the biggest performance impact after the stress tests. As expected, the driver implementation had shorter average execution times and WCET than the user-thread implementation. For example, the UserStress WCET was 450.89 µs, where the DriverStress and DriverPrioStress WCETs were 328.30 µs and 356.06 µs, respectively. One surprising discovery was that the tests with prioritized DPCs had longer execution times than the normal DPC tests in many cases. Possible reasons why this was the case are discussed in Section 11. 59 11 Conclusions A number of observations can be made from the tests conducted. The following list summarizes the most important observations, and the rest of this chapter is devoted to explaining them in greater detail: • XP has a better determinism than was reported in the previous work (Section 7). • Higher task priority yields better determinism. • A pure driver implementation is faster and more deterministic than a user-mode implementation. • Task interruption can occur anywhere in a full event cycle. • The dierence in execution time and determinism between a prioritized and a normal DPC is small. • The algorithm execution time is slower in kernel-mode compared to normal user-mode. • No hard guarantees in terms of WCET can be given. 11.1 Better Determinism Than Reported In Previous Work When evaluating the test results, a general reection is that the latencies and WCETs were much more predictable than previously reported in the eld of XP real-time performance[27, 24, 3]. While [24] reported application WCETs almost 10000% over the average execution time, our conducted tests never even generated full event cycle WCETs ten times over the average execution time. There can be several reasons for this dierence, where the most probable one is dierent load conditions. It is impossible to pinpoint the exact reasons, since the conditions under which the tests of the previous work were conducted are not known. 11.2 Higher Task Priority Yields Better Determinism It is clear from the results that a higher priority level yields a higher degree of determinism. The ISR, running at the highest priority level in the tests, had a maximum latency of 41.82 µs after an interrupt was triggered. This was measured in the UserFile test. Compared to the mean latency of 11.96 µs, the maximum latency is roughly four times larger. In the same test, the maximum time between the scheduling of a DPC and its actual execution is 80.61 µs. Compared to the mean latency of 3.72 µs, the maximum latency is over 20 times larger, which is signicantly larger than the latency of the ISR. The reason for this, as discussed in Section 4.4, 60 is the fact that a DPC can be interrupted by any interrupt, whereas the ISR can only be interrupted by higher IRQL interrupts. Since the parallel port uses IRQL 3 on the test system, the only devices with priority are the keyboard and the system timer. 11.3 Driver Faster Than User-Mode A pure driver implementation has a shorter WCET and is more deterministic than a user-mode implementation communicating with a driver. This came as no surprise, considering the reduced number of steps required in the driver compared to the user-mode implementation. Also, the algorithm runs in a higher priority level (DISPATCH_LEVEL) in the driver implementation, which reduces the probability of it becoming interrupted. 11.4 Task Interruption Can Occur Anywhere Every individual step in a full event cycle can be interrupted at any time. In the test results, it is easy to see that a discrete step, such as the DPC execution time, has a signicantly higher WCET compared to its average execution time. The sum of the WCET of each individual event exceeds the actually measured WCET for the whole event cycle. This means that, theoretically, the WCET is higher than measured in the tests. However, the test results indicate that it is statistically very unlikely that all steps in the chain of events will be interrupted in a single event cycle. 11.5 Small Dierence Between Normal and Prioritized DPC The dierence between a prioritized and a normal FIFO DPC in terms of WCET and average execution time was unexpectedly small. In fact, many of the driver implementation tests had longer execution times when using a prioritized DPC. Two possible reasons are considered, where a mixture of the two may be closer to the actual reason: 1. The DPC queue is never long enough for the priority to make a difference. This reason alone seems unlikely, considering the File and Network load conditions, both generating many DPCs. 2. The other device drivers loaded in the system also use prioritized DPCs. This is impossible to know without access to the source code for every device driver loaded in the system. The use of prioritized DPCs in a real-time application is therefore not advisable. 61 11.6 Algorithm Slower in Kernel-Mode The average execution time for the algorithm is approximately 12 % faster when executing in a normal user-mode thread compared to when executing in the DPC. This is likely because of dierent levels of code optimization in the DDK compiler and the Visual Studio 2003 compiler. 11.7 No Guarantees Can Be Given While the specic tests conducted did not yield any execution times over one millisecond, no hard guarantees about an absolute WCET can be made. The tests only prove that, under exactly the load conditions simulated, execution times exceeding the results were not measured. However, the tests show that the probability of execution times exceeding those measured are very unlikely. Thus, this indicates that XP might be suitable as a soft RTOS under certain controlled conditions. 62 12 Future Work The results from the tests showed that XP could be suitable as a soft realtime system. However, only the inner workings of XP were evaluated, which is just part of the suggested target platform by the ABB business unit. Although the temporal predictability of XP is sucient for some soft realtime systems as it is, dierent techniques to further increase the determinism should be evaluated to make XP an alternative for systems with even stricter temporal constraints. The following areas would be of interest to evaluate if more time was available: • Use of an Ethernet based protocol for communication • Modify interrupt handling • Run the tests on XPE instead of XP • Evaluate extensions 12.1 Use of an Ethernet Based Protocol for Communication As mentioned in Section 8, all conducted test used the parallel port to simulate communications from the sensor and to the actuator. In the original model suggested by the business unit (see Figure 5), the communication between actuator and sensor could be using an Ethernet stack to decrease production cost. Evaluating the temporal behavior of the TCP/IP stack using UDP would be of interest. If the temporal predictability of this protocol stack is not sucient, other Ethernet based protocols could be evaluated instead. 12.2 Modify Interrupt Handling To modify interrupt handling in XP, two dierent approaches are suggested as future work; modify the source code of HAL, or intercept interrupts before they even reach HAL. As mentioned in Section 4, the source code for HAL can be delivered from Microsoft with a special agreement. Some simple modications could possibly increase the determinism enough to make XP a more suitable alternative for systems with higher temporal constraints. Interception of interrupts could be done by modify the IDT to pass all interrupts to a custom interrupt handler routine. A suggested model is presented in Figure 18. When an interrupt occurs in this model, it is passed to the customized interrupt handler. This interrupt handler rst examines the interrupt vector to determine if the interrupt is intended for the time critical system or not. If the interrupt was intended for the system (represented by path An in Figure 18), the custom interrupt sets a ag to mark that an interrupt for 63 1 Custom Interrupt Handler Win32 Kernel B2 HAL I/O Manager B3 IntendedForTimeCriticalSystem(); MarkAsPending(); QueueInterrupts(); TimeCriticalSystem(); UnmarkPending(); ProcessQueue(); A4 A2 Time Critical System ReadFromSensor(); Algorithm(); WriteToActuator(); A3 Figure 18: Suggested model for interrupt interception. the system is pending, queues all incoming interrupts, executes the critical application, turns o the pending ag, and nally processes the queue of interrupts. However, if the interrupt was not intended for the time critical application, the customized interrupt handler simply passes the interrupt to the HAL, and processing of the interrupt is handled by the XP I/O Manager as normal (represented by path Bn in Figure 18). 12.3 Run the Tests on XPE Because of the limited time available for this Master's Thesis, the tests were conducted on Windows XP Professional instead of XPE. Although the kernel, thread priorities, scheduling algorithms, and inter-process communication of XP and XPE are identical, further testing on XPE would be interesting to see if additional system services not needed for the ABB business unit could be disabled to improve temporal predictability. 12.4 Evaluate Extensions This report only had time for a brief overview of the third-party realtime extensions. Although [32] shows promising results for the evaluated extensions, the number of test conducted are too few to make any real conclusions 64 about the temporal predictability of the extensions. Further analysis of the available real-time extensions would be subject for future work in this area. 65 A Oscilloscope Test Results User-thread measurements All user-thread tests measured the full event cycle as well as selected individual events described in Figure 6. The test results use the same event names used in Figure 7. All test results are presented in µs. UserIdle Min t0 tGf t0 tAs tAf tBs t0 tDf tDf tGf 106.72 8.69 1.96 81.67 24.59 Mean 110.89 10.21 2.96 85.38 25.52 WCET 186.76 21.30 62.96 160.55 94.88 Std. dev. 1.68 0.69 0.34 1.51 0.67 Number of samples: 994 210 UserCPU t0 tGf t0 tAs tAf tBs t0 tDf tDf tGf Min 103.57 8.60 2.08 80.41 22.27 Mean 109.28 10.19 2.15 85.04 24.24 WCET 192.55 25.43 66.02 165.71 100.28 Std. dev. 2.46 0.69 0.39 2.18 0.83 Number of samples: 1 064 400 UserGraphics t0 tGf t0 tAs tAf tBs t0 tDf tDf tGf Min 104.52 8.69 2.08 81.74 22.16 Mean 118.25 10.39 2.99 93.60 24.66 WCET 256.30 22.85 85.17 226.59 105.60 Std. dev. 14.70 0.78 0.65 13.47 1.50 Number of samples: 441 060 UserHDD t0 tGf t0 tAs tAf tBs t0 tDf tDf tGf Min 103.05 8.69 2.09 81.16 21.84 Mean 145.05 11.80 3.78 119.14 25.87 WCET 327.89 42.67 163.61 298.27 123.26 Std. dev. 17.79 2.03 1.60 16.41 2.34 Number of samples: 1 339 300 66 UserNetwork Min 102.99 8.69 2.09 81.05 21.77 t0 tGf t0 tAs tAf tBs t0 tDf tDf tGf Mean 130.04 11.27 3.52 104.58 25.46 WCET 402.52 42.43 214.81 358.28 130.16 Std. dev. 11.31 1.48 2.26 10.06 3.98 Number of samples: 1 099 900 UserStress Min 102.68 8.74 2.08 80.13 21.98 t0 tGf t0 tAs tAf tBs t0 tDf tDf tGf Mean 141.32 12.14 3.54 114.55 26.77 WCET 450.89 51.34 243.74 416.50 147.96 Std. dev. 20.04 2.72 4.10 18.14 5.60 Number of samples: 1 267 800 Driver Measurements All device driver tests measured the full event cycle as well as selected individual events described in Figure 8. The test results use the same event names used in Figure 9. All test results are presented in µs. DriverIdle t0 tBf t0 tAs tAf tBs tAs tAf tBs tBf Min 56.93 6.50 2.06 2.77 44.49 Mean 58.58 7.98 3.05 2.95 44.61 WCET 119.63 18.82 60.54 13.37 59.38 Std. dev. 0.84 0.68 0.34 0.06 0.35 Number of samples: 1 294 000 DriverCPU t0 tBf t0 tAs tAf tBs tAs tAf tBs tBf Min 57.08 6.44 2.07 2.78 44.49 Mean 65.53 7.96 2.12 2.81 52.62 WCET 125.81 18.80 59.73 13.03 67.75 Std. dev. 0.84 0.68 0.33 0.05 0.82 Number of samples: 1 314 900 67 DriverGraphics t0 tBf t0 tAs tAf tBs tAs tAf tBs tBf Min 56.61 6.48 2.06 2.77 44.49 Mean 60.60 8.19 2.92 3.00 46.49 WCET 146.52 21.91 90.39 16.85 71.61 Std. dev. 3.60 0.81 0.64 0.16 3.43 Number of samples: 1 315 300 DriverHDD t0 tBf t0 tAs tAf tBs tAs tAf tBs tBf Min 56.68 6.53 2.07 2.78 44.51 Mean 65.05 9.32 3.51 3.61 48.61 WCET 190.39 26.01 123.43 19.78 84.33 Std. dev. 5.92 1.33 1.50 0.89 3.46 Number of samples: 963 450 DriverNetwork t0 tBf t0 tAs tAf tBs tAs tAf tBs tBf Min 57.05 6.67 2.09 2.94 44.71 Mean 65.21 9.08 3.36 3.45 49.33 WCET 259.72 25.86 182.26 27.14 89.71 Std. dev. 5.67 1.05 2.26 0.64 4.15 Number of samples: 1 353 900 DriverStress t0 tBf t0 tAs tAf tBs tAs tAf tBs tBf Min 64.01 6.49 2.07 2.89 51.23 Mean 72.31 9.49 3.09 3.67 56.06 WCET 328.30 31.37 249.14 26.81 93.01 Std. dev. 7.20 1.55 3.65 1.20 3.05 Number of samples: 1 312 300 68 DriverPrioIdle t0 tBf t0 tAs tAf tBs tAs tAf tBs tBf Min 56.56 6.53 2.08 2.79 44.51 Mean 58.63 8.03 3.05 2.94 44.62 WCET 110.85 17.32 56.18 13.41 59.41 Std. dev. 0.83 0.68 0.28 0.07 0.36 Number of samples: 991 920 DriverPrioCPU t0 tBf t0 tAs tAf tBs tAs tAf tBs tBf Min 64.00 6.44 2.07 2.78 52.54 Mean 65.57 7.99 2.12 2.84 52.62 WCET 121.72 13.98 57.59 13.33 67.80 Std. dev. 0.84 0.68 0.32 0.05 0.35 Number of samples: 906 930 DriverPrioGraphics t0 tBf t0 tAs tAf tBs tAs tAf tBs tBf Min 56.58 6.47 2.06 2.78 44.50 Mean 60.22 8.13 2.91 3.01 46.17 WCET 135.74 17.74 80.38 12.76 71.19 Std. dev. 3.37 0.77 0.58 0.22 3.23 Number of samples: 809 390 DriverPrioHDD t0 tBf t0 tAs tAf tBs tAs tAf tBs tBf Min 56.68 6.56 2.07 2.79 44.51 Mean 65.44 9.38 3.55 4.25 48.26 WCET 223.53 39.20 138.00 31.30 88.63 Std. dev. 6.12 1.35 1.49 1.02 3.50 Number of samples: 3 776 400 69 DriverPrioNetwork t0 tBf t0 tAs tAf tBs tAs tAf tBs tBf Min 56.93 6.73 2.07 2.82 44.73 Mean 68.44 9.21 3.63 3.95 51.65 WCET 297.21 25.15 218.36 24.53 84.95 Std. dev. 7.25 1.29 3.20 0.97 4.97 Number of samples: 990 750 DriverPrioStress t0 tBf t0 tAs tAf tBs tAs tAf tBs tBf Min 64.09 6.52 2.07 2.80 52.51 Mean 72.73 9.42 3.28 3.28 55.82 WCET 356.06 33.63 278.43 28.39 96.11 Std. dev. 7.61 1.59 3.92 1.37 3.16 Number of samples: 998 320 Algorithm Execution Time The following test was conducted to measure the dierence in execution time of oatingpoint operations. The results are presented in µs. User-thread Device driver Min 39.34 41.67 Mean 39.87 41.76 WCET 105.28 56.40 70 Std. dev. 0.79 0.32 Samples 153 910 116 370 References [1] Ardence RTX Real-time Extension for Control of Windows. Ardence. http://www.ardence.com/assets/5f940542924c4a42b30fc5584872d798.pdf. [2] A. Baker and J. Lozano. The Windows 2000 Device Driver Book. Prentice Hall PTR, 2001. [3] A. Baril. Using Windows NT in Real-Time Systems. In Proceedings of the Fifth IEEE Real-Time Technology and Applications Symposium (RTAS '99), pages 132 141, Washington - Brussels - Tokyo, 1999. IEEE Computer Society. [4] L. Budin and L. Jelenkovic. Time-Constrained Programming in Windows NT Environment. In Proceedings of the IEEE International Symposium on Industrial Electronics, (ISIE '99), pages 9094, Bled, 1999. IEEE Computer Society. [5] J. Cinkelj et al. Soft Real-Time Acquisition in Windows XP. In Intelligent Solutions in Embedded Systems, 2005. Third International Workshop, pages 110116, Bled, 2005. Intelligent Solutions in Embedded Systems. [6] Comparisons with the .NET Framework. http://msdn.microsoft.com/library/default.asp?url=/library/enus/dv_evtuv/html/etconcomparisonswithnetframework.asp. [7] I. Crnkovic and M. Larsson. Building Reliable Component-Based Software Systems. Artech House, Inc., 2002. [8] S. Daily. Introducing Windows NT 4.0. 29th Street Press, February 1997. [9] E. Dekker and J. Newcomer. Developing Windows NT Device Drivers. AddisonWesley, 1999. [10] Hard Real-Time with Venturcom RTX on Microsoft Windows XP and Windows XP Embedded. Venturcom, Inc., September 2003. http://msdn.microsoft.com/library/default.asp?url=/library/enus/dnxpesp1/html/tchHardRealTimeWithVenturcomRTXOnMicrosoftWindowsXPWindowsXPEmbedded.asp. [11] Hyper-Threading Technology Overview. Intel Corporation. http://www.intel.com/business/bss/products/hyperthreading/overview.htm. [12] HyperKernel - Real-time Extensions for Windows NT/2000. Nematron. http://www.nematron.com/HyperKernel/. [13] Intel Corporation. IA-32 Intel Architecture Software Developer's Manual Volume 3A: System Programming Guide, Part 1, January 2006. Order Number: 253668-018. [14] Intel Corporation. IA-32 Intel Architecture Software Developer's Manual Volume 3B: System Programming Guide, Part 2, January 2006. Order Number: 253669-018. [15] INtime. TenAsys. http://www.tenasys.com/intime.html. [16] INtime 3.0 Real-time Operating System (RTOS) Extension for Windows. TenAsys. http://www.tenasys.com/resources/getFile.php?leid=6. 71 [17] D. Kresta. Getting Real with NT Approaches to Real-Time Windows NT. Real-Time Magazine, 2:3235, 1997. [18] KUKA Controls GmbH - Hard Real-Time Windows XP. KUKA Controls GmbH. http://www.kuka-control.com/product/. [19] P. N. Leroux. RTOS versus GPOS: What is best for embedded development? Embedded Computing Design, January 2005. [20] C. Liu and J. Leyland. Scheduling Algorithms for Multiprogramming in Hard RealTime Environment. Journal of the ACM, 20(1), 1973. [21] M. Lutz and P. Laplante. C# and the .NET Framework: Ready for Real-Time? IEEE Software, 20(1):7480, 2003. [22] Microsoft Windows CE 5.0. http://msdn.microsoft.com/library/default.asp?url=/library/enus/wceintro5/html/wce50oriWelcomeToWindowsCE.asp. [23] C. Nordström et al. Robusta realtidssystem. Mälardalen Real-Time Research Centre, Västerås, August 2000. [24] K. Obenland, J. Kowalik, T. Frazier, and J. Kim. Comparing the Real-Time Performance of Windows NT to an NT Real-Time Extension. In Proceedings of the Fifth IEEE Real-Time Technology and Applications Symposium (RTAS '99), pages 142153, Washington - Brussels - Tokyo, 1999. IEEE Computer Society. [25] W. Oney. Programming the Microsoft Windows Driver Model. Microsoft Press, 1999. [26] C Peacock. PortTalk - A Windows NT I/O Port Device Driver. http://www.beyondlogic.org/porttalk/porttalk.htm. [27] K. Ramamritham et al. Using Windows NT for Real-Time Applications: Experimental Observations and Recommendations. In Proceedings of the Fourth IEEE RealTime Technology and Applications Symposium (RTAS '98), pages 132141, Washington - Brussels - Tokyo, June 1998. IEEE Computer Society. [28] Real-Time Operating Systems: INtime Architecture. TenAsys Corporation, September 2003. http://msdn.microsoft.com/library/default.asp?url=/library/enus/dnxpesp1/html/tchReal-TimeOperatingSystemsINtimeArchitecture.asp. [29] RTX. Ardence. http://www.ardence.com/embedded/products.aspx?ID=70. [30] M. E. Russinovich and D. A. Solomon. Microsoft Windows Internals Fourth Edition: Microsoft Windows Server 2003, Windows XP, and Windows 2000. Microsoft Press, 2005. [31] A. Tanenbaum. Modern Operating Systems, Second Edition. Prentice Hall International, 2001. [32] M. Timmerman et al. Designing for Worst Case: The Impact of Real-Time OS Performance on Real-World Embedded Design. Real-Time Magazine, 3:1119, 1998. 72 [33] M. Timmerman and J-C. Monfret. Designing for Worst Case: The Impact of RealTime OS Performance on Real-World Embedded Design. Real-Time Magazine, 3:52 56, 1997. [34] M. Timmerman and J-C. Monfret. Windows NT as Real-Time OS? Real-Time Magazine, 2:613, 1997. [35] M. Timmerman and J-C. Monfret. Windows NT Real-Time Extensions: an Overview. Real-Time Magazine, 2:1424, 1997. [36] Windows Driver Model (WDM). Microsoft Corporation, April 2002. http://www.microsoft.com/whdc/archive/wdmoverview.mspx. [37] Windows XP Embedded Home Page. Microsoft Corporation, November 2005. http://msdn.microsoft.com/embedded/windowsxpembedded/. [38] P. Work and K. Nguyen. Measure Code Sections Using The Enhanced Timer. Intel Corporation. http://www.intel.com/cd/ids/developer/asmo-na/eng/209859.htm. 73