All About the Cell Processor
Transcription
All About the Cell Processor
Systems and Technology Group All About the Cell Processor H. Peter Hofstee, Ph. D. IBM Systems and Technology Group SCEI/Sony Toshiba IBM Design Center Austin, Texas © 2005 IBM Corporation Systems and Technology Group Acknowledgements Cell is the result of a deep partnership between SCEI/Sony, Toshiba, and IBM Cell represents the work of more than 400 people starting in 2001 More detailed papers on the Cell implementation and the SPE micro-architecture can be found in the ISSCC 2005 proceedings 2 © 2005 IBM Corporation Systems and Technology Group Agenda Trends in Processors and Systems Cell Processor Overview Aspects of the Implementation Power Efficient Architecture 3 © 2005 IBM Corporation Systems and Technology Group Trends in Microprocessors and Systems 4 © 2005 IBM Corporation Systems and Technology Group Processor Performance over Time (Game processors take the lead on media performance) Flops (SP) 1000000 100000 10000 1000 Game Processors PC Processors 100 10 1993 1995 1996 1998 2000 2002 2004 5 © 2005 IBM Corporation Systems and Technology Group System Trends toward Integration Memory Northbridge Memory Next-Gen Accel Processor Processor IO Southbridge IO Implied loss of system configuration flexibility Must compensate with generality of acceleration function to maintain market size. 6 © 2005 IBM Corporation Systems and Technology Group Next Generation Processors address Programming Complexity and Trend Towards Programmable Offload Engines with a Simpler System Alternative GPU NIC CPU Security … Media Hardwired Function 7 CPU Streaming Graphics Processor 64b Power Processor Network Processor Synergistic Processor Security Processor … Media Processor Programmable ASIC Mem. Contr. .. . Config. Synergistic IO Processor Cell © 2005 IBM Corporation Systems and Technology Group Performance Limiters in Conventional Microprocessors Memory Wall – Latency induced bandwidth limitations Power Wall – Must improve efficiency and performance equally Frequency Wall – Diminishing returns from deeper pipelines (can be negative if power is taken into account) 8 © 2005 IBM Corporation Systems and Technology Group Cell Overview 9 © 2005 IBM Corporation Systems and Technology Group Cell Goals 10 Outstanding performance, especially on game/multimedia applications. Real time responsiveness to the user and the network. Applicable to a wide range of platforms. Support introduction in systems in 2005. © 2005 IBM Corporation Systems and Technology Group Cell Concept Compatibility with 64b Power Architecture™ – Builds on and leverages IBM investment and community Increased efficiency and performance – Non Homogenous Coherent Chip Multiprocessor – High design frequency, low operating voltage – Streaming DMA architecture attacks “Memory Wall” – Highly optimized implementation Interface between user and networked world – Image rich information, virtual reality – Flexibility and security Multi-OS support, including RTOS/non-RTOS – Combine real-time and non-real time worlds 11 © 2005 IBM Corporation Systems and Technology Group Cell Chip Block Diagram SPU SPE SXU SXU SXU SXU SXU SXU SXU SXU LS LS LS LS LS LS LS LS SMF SMF SMF SMF SMF SMF SMF SMF Coherent Bus (up to 96 Bytes/cycle) L2 PPE MIC Dual XDRTM L1 12 BIC FlexIOTM PXU © 2005 IBM Corporation Systems and Technology Group What is a Synergistic Processor? (and why is it efficient?) Local Store “is” large 2nd level register file / private instruction store instead of cache – Asynchronous transfer (DMA) to shared memory – Frontal attack on the Memory Wall – 128 entry x 128 bit Media & Compute optimized FXU EVN LS FXU ODD GPR LS CHANNEL DMA SBI 13 SPU FWD – One context – SIMD architecture LS SMM BEB – Unified (large) Register File SFP CONTROL Media Unit turned into a Processor LS DP ATO RTB SM © 2005 IBM Corporation Systems and Technology Group Coherent Offload Model DMA into and out of Local Store equivalent to Power core loads & stores Governed by Power Architecture page and segment tables for translation and protection Shared memory model – Power architecture compatible addressing – MMIO capabilities for SPEs – Local Store is mapped (alias) allowing LS to LS DMA transfers – DMA equivalents of locking loads & stores – OS management/virtualization of SPEs • Pre-emptive context switch is supported (but not efficient) 14 © 2005 IBM Corporation Systems and Technology Group One approach to programming Cell Single Source Compiler – Auto parallelization ( treat target Cell as an SMP ) – Auto SIMD-ization ( SIMD-vectorization ) – Compiler management of Local Store as 2nd level register file / SW managed cache (I&D) • Most Cell unique piece Optimization – OpenMP pragmas – Vector.org SIMD intrinsics – Data/Code partitioning – Streaming / pre-specifying code/data use Prototype Single Source Compiler Developed in IBM Research 15 © 2005 IBM Corporation Systems and Technology Group Cell Implementation Aspects 16 © 2005 IBM Corporation Systems and Technology Group SPE BLOCK DIAGRAM Floating-Point Unit Permute Unit Fixed-Point Unit Load-Store Unit Branch Unit Channel Unit Local Store (256kB) Single Port SRAM Result Forwarding and Staging Register File Instruction Issue Unit / Instruction Line Buffer On-Chip Coherent Bus 8 Byte/Cycle 17 16 Byte/Cycle 128B Read 128B Write DMA Unit 64 Byte/Cycle 128 Byte/Cycle © 2005 IBM Corporation Systems and Technology Group SPE PIPELINE FRONT END IF1 IF2 IF3 IF4 IF5 IB1 IB2 ID1 ID2 ID3 IS1 IS2 SPE PIPELINE BACK END Branch Instruction RF1 RF2 Permute Instruction EX1 EX2 EX3 EX4 WB Load/Store Instruction EX1 EX2 EX3 EX4 EX5 EX6 WB IF IB ID IS RF EX WB Instruction Fetch Instruction Buffer Instruction Decode Instruction Issue Register File Access Execution Write Back Fixed Point Instruction EX1 EX2 WB Floating Point Instruction EX1 18 EX2 EX3 EX4 EX5 EX6 WB © 2005 IBM Corporation Systems and Technology Group PPE BLOCK DIAGRAM Pre-Decode 8 L2 Interface Fetch Control Branch Scan L1 Instruction Cache 4 Thread A Thread B 4 SMT Dispatch (Queue) 2 L1 Data Cache 1 1 1 Branch Load/Store Fixed-Point Execution Unit Unit Unit Completion/Flush Decode Dependency Issue 2 Microcode Thread A Thread B Thread A VMX/FPU Issue (Queue) 2 1 1 1 VMX VMX FPU Load/Store/ Arith./Logic Unit Arith/Logic Unit Permute VMX Completion 19 1 Threads alternate fetch and dispatch cycles 1 FPU Load/Store FPU Completion © 2005 IBM Corporation Systems and Technology Group PPE PIPELINE FRONT END Microcode IC1 IC2 IC3 IC4 IB1 IB2 Instruction Cache and Buffer BP1 BP2 MC1 MC2 MC3 MC4 ID1 IS1 ID2 ID3 ... IS2 MC9 MC10 MC11 IS3 Instruction Decode and Issue BP3 BP4 Branch Prediction PPE PIPELINE BACK END Branch Instruction DLY DLY DLY RF1 EX1 EX2 EX3 EX4 IBZ IC0 RF2 EX1 EX2 EX3 EX4 EX5 WB EX3 EX4 EX5 EX6 EX7 EX8 WB RF2 Fixed Point Unit Instruction DLY DLY DLY RF1 Load/Store Instruction RF1 20 RF2 EX1 EX2 IC Instruction Cache IB Instruction Buffer BP Branch Prediction MC Microcode ID Instruction Decode IS Instruction Issue DLY Delay Stage RF Register File Access EX Execution WB Write Back © 2005 IBM Corporation Systems and Technology Group Cell Processor ~250M transistors ~235mm2 Top frequency >4GHz 9 cores, 10 threads > 256 GFlops (SP) @4GHz > 26 GFlops (DP) @4GHz Up to 25.6GB/s memory B/W Up to 75 GB/s I/O B/W ~400M$(US) design investment 21 © 2005 IBM Corporation Systems and Technology Group First pass hardware measurement in the Lab - Nominal Voltage = 1V Hardware Performance Measurement (85°C) 4.5 Frequency [GHz] Fmax 4 3.5 3 0.9 1 1.1 1.2 Supply Voltage 22 © 2005 IBM Corporation Systems and Technology Group Cell Processor Configurable I/O XDRtm Direct Attach XDR XDRtm XDRtm XDRtm Two I/O interfaces IOIF XDRtm XDRtm IOIF XDRtm XDRtm XDRtm XDRtm CELL Processor CELL Processor BIF CELL Processor XDRtm XDRtm XDRtm XDRtm IOIF 23 IOIF1 SW CELL Processor IOIF0 CELL Processor BIF CELL Processor IOIF IOIF IOIF – Coherent or I/O Protection CELL Processor BIF – Configurable number of Bytes © 2005 IBM Corporation Systems and Technology Group Power Efficient Architecture 24 © 2005 IBM Corporation Systems and Technology Group Power Efficient Architecture and the BE Non-Homogeneous Coherent Multi-Processor – Data-plane/Control-plane specialization – More efficient than homogeneous SMP 3-level model of Memory – Bandwidth without (inefficient) speculation – High-bandwidth .. Low power 25 © 2005 IBM Corporation Systems and Technology Group Power Efficient Architecture and the SPE Power Efficient ISA allows Simple Control – Single mode architecture – No cache – Branch hint – Large unified register file – Channel Interface Efficient Microarchitecture – Single port local store – Extensive clock gating Efficient implementation – See Cool Chips paper by O. Takahashi et al. and T. Asano et al. 26 © 2005 IBM Corporation Systems and Technology Group Cell BE Processor Example Application Areas Cell excels at processing of rich media content in the context of broad connectivity – – – – – – – – – Digital content creation (games and movies) Game playing and game serving Distribution of dynamic, media rich content Imaging and image processing Image analysis (e.g. video surveillance) Next-generation physics-based visualization Video conferencing Streaming applications (codecs etc.) Physical simulation & science Cell is an excellent match for any applications that require: – – – – – 27 Parallel processing Real time processing Graphics content creation or rendering Pattern matching High-performance SIMD capabilities © 2005 IBM Corporation Systems and Technology Group Cell Summary Cell ushers in a new era of leading edge processors optimized for digital media and entertainment Responsiveness to the human user and the network are key drivers for Cell New levels of performance and power efficiency beyond what is achieved by PC processors Cell will enable entirely new classes of applications, even beyond those we contemplate today 28 © 2005 IBM Corporation