All About the Cell Processor

Transcription

All About the Cell Processor
Systems and Technology Group
All About the Cell Processor
H. Peter Hofstee, Ph. D.
IBM Systems and Technology Group
SCEI/Sony Toshiba IBM Design Center
Austin, Texas
© 2005 IBM Corporation
Systems and Technology Group
Acknowledgements
ƒ Cell is the result of a deep partnership between
SCEI/Sony, Toshiba, and IBM
ƒ Cell represents the work of more than 400 people
starting in 2001
ƒ More detailed papers on the Cell implementation
and the SPE micro-architecture can be found in
the ISSCC 2005 proceedings
2
© 2005 IBM Corporation
Systems and Technology Group
Agenda
ƒ Trends in Processors and Systems
ƒ Cell Processor Overview
ƒ Aspects of the Implementation
ƒ Power Efficient Architecture
3
© 2005 IBM Corporation
Systems and Technology Group
Trends in Microprocessors and Systems
4
© 2005 IBM Corporation
Systems and Technology Group
Processor Performance over Time
(Game processors take the lead on media performance)
Flops (SP)
1000000
100000
10000
1000
Game Processors
PC Processors
100
10
1993 1995 1996 1998 2000 2002 2004
5
© 2005 IBM Corporation
Systems and Technology Group
System Trends toward Integration
Memory
Northbridge
Memory
Next-Gen
Accel
Processor
Processor
IO
Southbridge
IO
ƒ Implied loss of system configuration flexibility
ƒ Must compensate with generality of acceleration
function to maintain market size.
6
© 2005 IBM Corporation
Systems and Technology Group
Next Generation Processors address Programming
Complexity and Trend Towards Programmable Offload
Engines with a Simpler System Alternative
GPU
NIC
CPU
Security
…
Media
Hardwired
Function
7
CPU
Streaming
Graphics
Processor
64b Power
Processor
Network
Processor
Synergistic
Processor
Security
Processor
…
Media
Processor
Programmable
ASIC
Mem.
Contr.
..
.
Config.
Synergistic
IO
Processor
Cell
© 2005 IBM Corporation
Systems and Technology Group
Performance Limiters in Conventional
Microprocessors
ƒ Memory Wall
– Latency induced bandwidth limitations
ƒ Power Wall
– Must improve efficiency and performance equally
ƒ Frequency Wall
– Diminishing returns from deeper pipelines
(can be negative if power is taken into account)
8
© 2005 IBM Corporation
Systems and Technology Group
Cell Overview
9
© 2005 IBM Corporation
Systems and Technology Group
Cell Goals
10
ƒ
Outstanding performance, especially on
game/multimedia applications.
ƒ
Real time responsiveness to the user and the
network.
ƒ
Applicable to a wide range of platforms.
ƒ
Support introduction in systems in 2005.
© 2005 IBM Corporation
Systems and Technology Group
Cell Concept
ƒ Compatibility with 64b Power Architecture™
– Builds on and leverages IBM investment and community
ƒ Increased efficiency and performance
– Non Homogenous Coherent Chip Multiprocessor
– High design frequency, low operating voltage
– Streaming DMA architecture attacks “Memory Wall”
– Highly optimized implementation
ƒ Interface between user and networked world
– Image rich information, virtual reality
– Flexibility and security
ƒ Multi-OS support, including RTOS/non-RTOS
– Combine real-time and non-real time worlds
11
© 2005 IBM Corporation
Systems and Technology Group
Cell Chip Block Diagram
SPU SPE
SXU
SXU
SXU
SXU
SXU
SXU
SXU
SXU
LS
LS
LS
LS
LS
LS
LS
LS
SMF
SMF
SMF
SMF
SMF
SMF
SMF
SMF
Coherent Bus (up to 96 Bytes/cycle)
L2
PPE
MIC
Dual XDRTM
L1
12
BIC
FlexIOTM
PXU
© 2005 IBM Corporation
Systems and Technology Group
What is a Synergistic Processor?
(and why is it efficient?)
ƒ Local Store “is” large 2nd level register file / private instruction
store instead of cache
– Asynchronous transfer (DMA) to shared memory
– Frontal attack on the Memory Wall
– 128 entry x 128 bit
ƒ Media & Compute optimized
FXU EVN
LS
FXU ODD
GPR
LS
CHANNEL
DMA
SBI
13
SPU
FWD
– One context
– SIMD architecture
LS
SMM
BEB
– Unified (large) Register File
SFP
CONTROL
ƒ Media Unit turned into a Processor
LS
DP
ATO
RTB
SM
© 2005 IBM Corporation
Systems and Technology Group
Coherent Offload Model
ƒ DMA into and out of Local Store equivalent to Power core
loads & stores
ƒ Governed by Power Architecture page and segment tables for
translation and protection
ƒ Shared memory model
– Power architecture compatible addressing
– MMIO capabilities for SPEs
– Local Store is mapped (alias) allowing LS to LS DMA transfers
– DMA equivalents of locking loads & stores
– OS management/virtualization of SPEs
• Pre-emptive context switch is supported (but not efficient)
14
© 2005 IBM Corporation
Systems and Technology Group
One approach to programming Cell
ƒ Single Source Compiler
– Auto parallelization ( treat target Cell as an SMP )
– Auto SIMD-ization ( SIMD-vectorization )
– Compiler management of Local Store as 2nd level register file / SW
managed cache (I&D)
• Most Cell unique piece
ƒ Optimization
– OpenMP pragmas
– Vector.org SIMD intrinsics
– Data/Code partitioning
– Streaming / pre-specifying code/data use
ƒ Prototype Single Source Compiler Developed in IBM Research
15
© 2005 IBM Corporation
Systems and Technology Group
Cell Implementation Aspects
16
© 2005 IBM Corporation
Systems and Technology Group
SPE BLOCK DIAGRAM
Floating-Point Unit
Permute Unit
Fixed-Point Unit
Load-Store Unit
Branch Unit
Channel Unit
Local Store
(256kB)
Single Port SRAM
Result Forwarding and Staging
Register File
Instruction Issue Unit / Instruction Line Buffer
On-Chip Coherent Bus
8 Byte/Cycle
17
16 Byte/Cycle
128B Read
128B Write
DMA Unit
64 Byte/Cycle
128 Byte/Cycle
© 2005 IBM Corporation
Systems and Technology Group
SPE PIPELINE FRONT END
IF1
IF2
IF3
IF4
IF5
IB1
IB2
ID1
ID2
ID3
IS1
IS2
SPE PIPELINE BACK END
Branch Instruction
RF1
RF2
Permute Instruction
EX1
EX2 EX3
EX4
WB
Load/Store Instruction
EX1
EX2
EX3
EX4
EX5
EX6
WB
IF
IB
ID
IS
RF
EX
WB
Instruction Fetch
Instruction Buffer
Instruction Decode
Instruction Issue
Register File Access
Execution
Write Back
Fixed Point Instruction
EX1
EX2
WB
Floating Point Instruction
EX1
18
EX2 EX3
EX4 EX5
EX6
WB
© 2005 IBM Corporation
Systems and Technology Group
PPE BLOCK DIAGRAM
Pre-Decode
8
L2
Interface
Fetch Control
Branch Scan
L1 Instruction Cache
4
Thread A Thread B
4
SMT Dispatch (Queue)
2
L1 Data Cache
1
1
1
Branch
Load/Store Fixed-Point
Execution
Unit
Unit
Unit
Completion/Flush
Decode
Dependency
Issue
2
Microcode
Thread A
Thread B
Thread A
VMX/FPU Issue (Queue)
2
1
1
1
VMX
VMX
FPU
Load/Store/
Arith./Logic Unit Arith/Logic Unit
Permute
VMX Completion
19
1
Threads alternate
fetch and dispatch
cycles
1
FPU
Load/Store
FPU Completion
© 2005 IBM Corporation
Systems and Technology Group
PPE PIPELINE FRONT END
Microcode
IC1
IC2
IC3
IC4
IB1
IB2
Instruction Cache and Buffer
BP1
BP2
MC1 MC2 MC3
MC4
ID1
IS1
ID2
ID3
...
IS2
MC9 MC10 MC11
IS3
Instruction Decode and Issue
BP3 BP4
Branch Prediction
PPE PIPELINE BACK END
Branch Instruction
DLY
DLY
DLY
RF1
EX1
EX2
EX3
EX4
IBZ
IC0
RF2
EX1
EX2
EX3
EX4
EX5
WB
EX3
EX4
EX5
EX6
EX7
EX8 WB
RF2
Fixed Point Unit Instruction
DLY
DLY
DLY
RF1
Load/Store Instruction
RF1
20
RF2
EX1
EX2
IC Instruction Cache
IB Instruction Buffer
BP Branch Prediction
MC Microcode
ID Instruction Decode
IS
Instruction Issue
DLY Delay Stage
RF Register File Access
EX Execution
WB Write Back
© 2005 IBM Corporation
Systems and Technology Group
Cell Processor
ƒ ~250M transistors
ƒ ~235mm2
ƒ Top frequency >4GHz
ƒ 9 cores, 10 threads
ƒ > 256 GFlops (SP) @4GHz
ƒ > 26 GFlops (DP) @4GHz
ƒ Up to 25.6GB/s memory B/W
ƒ Up to 75 GB/s I/O B/W
ƒ ~400M$(US) design investment
21
© 2005 IBM Corporation
Systems and Technology Group
First pass hardware measurement in the Lab - Nominal Voltage = 1V
Hardware Performance Measurement (85°C)
4.5
Frequency [GHz]
Fmax
4
3.5
3
0.9
1
1.1
1.2
Supply Voltage
22
© 2005 IBM Corporation
Systems and Technology Group
Cell Processor Configurable I/O
XDRtm
ƒ Direct Attach XDR
XDRtm
XDRtm
XDRtm
ƒ Two I/O interfaces
IOIF
XDRtm
XDRtm
IOIF
XDRtm XDRtm
XDRtm XDRtm
CELL
Processor
CELL
Processor
BIF
CELL
Processor
XDRtm XDRtm
XDRtm XDRtm
IOIF
23
IOIF1
SW
CELL
Processor
IOIF0
CELL
Processor
BIF
CELL
Processor
IOIF
IOIF
IOIF
– Coherent or I/O Protection
CELL
Processor
BIF
– Configurable number of Bytes
© 2005 IBM Corporation
Systems and Technology Group
Power Efficient Architecture
24
© 2005 IBM Corporation
Systems and Technology Group
Power Efficient Architecture and the BE
ƒ Non-Homogeneous Coherent Multi-Processor
– Data-plane/Control-plane specialization
– More efficient than homogeneous SMP
ƒ 3-level model of Memory
– Bandwidth without (inefficient) speculation
– High-bandwidth .. Low power
25
© 2005 IBM Corporation
Systems and Technology Group
Power Efficient Architecture and the SPE
ƒ Power Efficient ISA allows Simple Control
– Single mode architecture
– No cache
– Branch hint
– Large unified register file
– Channel Interface
ƒ Efficient Microarchitecture
– Single port local store
– Extensive clock gating
ƒ Efficient implementation
– See Cool Chips paper by O. Takahashi et al. and T. Asano et al.
26
© 2005 IBM Corporation
Systems and Technology Group
Cell BE Processor Example Application Areas
ƒ Cell excels at processing of rich media content in the context of broad
connectivity
–
–
–
–
–
–
–
–
–
Digital content creation (games and movies)
Game playing and game serving
Distribution of dynamic, media rich content
Imaging and image processing
Image analysis (e.g. video surveillance)
Next-generation physics-based visualization
Video conferencing
Streaming applications (codecs etc.)
Physical simulation & science
ƒ Cell is an excellent match for any applications that require:
–
–
–
–
–
27
Parallel processing
Real time processing
Graphics content creation or rendering
Pattern matching
High-performance SIMD capabilities
© 2005 IBM Corporation
Systems and Technology Group
Cell Summary
ƒ Cell ushers in a new era of leading edge processors
optimized for digital media and entertainment
ƒ Responsiveness to the human user and the network
are key drivers for Cell
ƒ New levels of performance and power efficiency
beyond what is achieved by PC processors
ƒ Cell will enable entirely new classes of applications,
even beyond those we contemplate today
28
© 2005 IBM Corporation