my presentation
Transcription
my presentation
Predictable Programming on a Precision Timed Architecture Ben Lickly, Isaac Liu, Hiren Patel, Edward Lee, University of California, Berkeley Sungjun Kim, Stephen Edwards, Columbia University, New York Presented By Ashutosh Dhekne PhD Student, University of Illinios at Urbana Champaign Goal of the Paper • Rethink processor architecture to provide predictable timing CPU • Why such a stance? RAM Caching • Current computers optimized for average performance • Too many time saving tricks that complicate WCET analysis Pipelined Execution • How to achieve it? • Exposed memory hierarchies • Thread interleaved pipelining • Deadline instructions fx CPU CPU perf RAM Virtual Memory Frequency Scaling Words that Stick [link] Main Memory Processor (CISC) Instruction Pipeline ALUs MMU Cache Paging Cache Try Cache Miss Low Latency High Latency IO - DMA DMA Transparent to Program Internal Registers Task Switch Regs HDD External Material – Drawn from memory The Familiar Architecture (x86) The PRET Architecture Scratchpad Memory (Part of Memory Address Space) Thread Interleaved Pipeline Main Memory Code ALUs Thread Controller MMU DMA Data M/M IO 5 0 3 1 Register File Register File Register File Register File 4 2 Memory Wheel Paper Innovations Processor (RISC) Main Memory 0x00000000 0x00000FFF Boot code used by each thread on startup. Initializes all the registers 0x3F800000 0x40000000 0x405FFFFF Shared Data 8MB between multiple threads Thread local instructions and data (1MB per thread) 512KB for instruction, 512KB for data Memory Mapped IO 4 0 3 1 0x80000000 0xFFFFFFFF 5 2 The Memory Wheel I am feeling lucky! • Access the Main Memory only through the Memory Wheel • 13 cycle slotted time to access the Main Memory • TDMA access creates false busy resource impression • In the worst case, 90 cycles are required to access memory – bounded worst case • Can we keep the pipeline always running? • What about Data Hazards, Control Hazards, Structural Hazards? Instruction 0 Instruction 1 Instruction 2 Instruction 3 Instruction 4 Instruction 5 Instruction 6 Instruction 7 0 1 2 3 4 F D E M W F D E M W F D E M W F D E M W F D E M W F D E M W F D E M W F D E M 5 6 7 8 9 10 11 W External Material – Drawn from memory Instruction Pipelines • What if we thread interleave pipelines, instead? • Can we avoid all pipeline hazards? Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 0 Thread 1 Thread 2 0 1 2 3 4 F D E M W F D E M W F D E M W F D E M W F D E M W F D E M W F D E M W F D E M 5 6 7 8 9 10 11 W Derived from: Precision Timed Machines, Isaac Liu Thread-Interleaved Pipelines • Can we ensure no hazards in thread interleaved pipelines? • Always fill the pipelines with instructions from distinct threads • • • • • No explicit control dependencies between threads – No Control Hazard Long latency instructions; prevent two from same thread – No Data Hazard Very few concurrent threads; push in NOPs – No Data Hazard Access to multi-cycle shared resources (eg. Memory) – Structural Hazard TDMA access to the shared resources – removes timing dependencies • Nonetheless, removing interdependence between pipeline units eases timing analysis Derived from: Precision Timed Machines, Isaac Liu Hazardless Pipeline – Not Quite Deadline hit Deadline miss Deadline of Task 1A) Finish the task and detect at the end, if the deadline was missed 1B) Immediately handle a missed deadline 2A) Continue with next task 2B) Stall before next task Task Next Task Deadline Miss Handler Preemption Stall Derived from: Precision Timed Machines, Isaac Liu Deadline Handling Deadline hit Deadline miss Deadline of Task 1A) Finish the task and detect at the end, if the deadline was missed Future Work 1B) Immediately handle a missed deadline 2A) Continue with next task 2B) Stall before next task Task Next Task Deadline Miss Handler Preemption Stall Derived from: Precision Timed Machines, Isaac Liu Deadline Handling The Deadline Instruction • A per-thread Deadline Register ti • DEAD(x) blocks until ti reaches zero • It then loads the value x in the register and executes next instruction • The paper does not handle missing deadlines Producer int main() { DEAD(28); volatile unsigned int *buf = (unsigned int*) (0x3F800200); unsigned int i = 0; for (i=0; ; i++) { DEAD(26); *buf = i; } return 0; } Register ti is loaded with value 28 Program waits here until ti becomes zero, then loads 26. If program returns here due to the loop, it might wait again. • The deadline register is checked in the register access stage and replayed until it becomes zero Example Game Command Queues Pixel Data Commands Commands Game Logic Thread Swap 1 2 New graphics available (Sync Request) Sync Complete 3 (Queue Swapped) Even Buffer Odd Buffer Graphics Controller Thread Pixel Data Swap 2 Refresh Screen 1 (VSync Request) VSync 3 (Frame Buffer Swapped) Video Driver Thread VGA Real-time Constraints VGA Vsync Time VGA Hsync Time Sixteen Pixels at a time Experiences from the Two Samples • It is possible to provide timing guarantees using the PRET architecture • But, timing calculations by hand are error-prone • Automated tools will be provided in the future • The underlying architecture lacks synchronization primitives • Simple synchronization can be achieved using the deadline instructions Comparison with the LEON3 • Average case time degradation is studied • PRET shows significant degradation due to lack of parallel threads • None of the special PRET features are used • Degradation factor < 6; no pipeline hazard advantage? Conclusions • The paper builds a remarkable architecture using SystemC model • It introduces new instruction for one type of deadlines • PRET keeps memory hierarchy and time differences exposed to user • The model runs actual C programs and a small game • Somewhat unfair comparison between LEON3 and PRET at the end It is possible to modify a RISC processor to have predictable timing Some Observations • With a project of this scale, it is difficult to fit all details in a paper • I had to refer to one of the author’s thesis work to gain insights • The memory wheel assumes all threads will use memory equally • I would suggest reduce the LEON3 comparison; include more fundamental insights instead • Overall the work is commendable • Provides some thoughts not discussed in any previous paper • A true systems level work Can off the shelf architectures provide a strict WCET mode? Thanks!