HC19.20.320.Teraflops Prototype Processor with 80 Cores.v2

Transcription

HC19.20.320.Teraflops Prototype Processor with 80 Cores.v2
Teraflops Prototype Processor
with 80 Cores
Yatin Hoskote, Sriram Vangal,
Saurabh Dighe, Nitin Borkar,
Shekhar Borkar
Microprocessor Technology Labs,
Intel Corp.
Agenda
Goals
Architecture overview
Processing engine
Interconnect fabric
Power management
Application mapping
Experimental results
2
2
Goals
Build Intel’s first Network-on-Chip (NoC)
research prototype
Show teraflops performance under 100W
Develop NoC design methodologies
– Tiled design
– Mesochronous (phase tolerant) clocking
– Fine grain power management
Prototype key NoC technologies
– On-die 2D mesh fabric
– 3D stacked memory
3
3
39
MSINT
Crossbar
Router
MSINT
39
MSINT
Processor Architecture
Mesochronous
Interface (MSINT)
40GB/s
I/O
MSINT
96
Buf
2KB Data memory (DMEM)
64
64
32
64
32
3KB Inst. Memory (IMEM)
6-read, 4 -write 32 entry Register File
32
32
x
x
PLL
JTAG
96b Ctl
+
+
Normalize
FPMAC0
32
Normalize
32
FPMAC1
Processing Engine (PE)
Each tile:
5 GHz, 20 Gflops
@ 1.2V
160 SP FP engines in 8x10 2D mesh
4
4
Instruction Set
LOAD/STORE SND/RCV PGM FLOW SLEEP
FPU (2)
96-bit instruction word, up to 8 operations/cycle
Instruction Type
Latency (cycles)
FPU
9
LOAD/STORE
2
SEND/RECEIVE
JUMP/BRANCH
2
1
STALL/WAIT FOR DATA
NAP/WAKE
N/A
1
5
5
SP FP Processing Engine
Fast, single-cycle multiply-accumulate algorithm
Key optimizations
32
– Multiplier output is in carry-save format and
uses 4-2 carry-save adders, removing
expensive carry-propagate adders from the
critical path
– Accumulation is performed in base 32,
converting expensive variable shifters in the
accumulate loop to constant shifters
x
+
Normalize 32
– Costly normalization step is moved outside
the accumulate loop
FPMAC
Ref: S. Vangal, et al. IEEE JSSC, Oct. 2006
Accumulation in 15 Fanout-of-4 stages
Sustained 2FLOPS per cycle
6
6
Agenda
Goals
Architecture overview
Processing engine
Interconnect fabric
Power management
Application mapping
Experimental results
7
7
2D Mesh Interconnect
4 byte wide data links, 6 bits overhead
Source directed, wormhole routing
Two virtual lanes
5 port, fully non-blocking router
5GHz operation @1.2V
Compute
320GB/s bisection B/W
Element
Router
5 cycle fall-through latency
One tile
On/off flow control
High bandwidth, low latency fabric
8
8
Router Architecture
Five router pipe
stages
From
PE
Ref: S. Vangal, et
al., VLSI 2007
Buffer
Read
From
W
39
To PE
Stg 4
39
MSINT
To N
MSINT
To E
To S
MSINT
From
S
Stg 5
39
To W
MSINT
From
E
Lane 1
Lane 0
39
From
N
Input buffered
Shared crossbar
switch
Distributed dual
phase arbitration
Buffer
Write
Stg 1
Crossbar switch
Link
Route Port/Lane Switch
Arb
Traversal Traversal
Compute
Router Pipe stages
9
9
Fabric Key Features
Double pumped crossbar switch
– Dual-edge triggered FFs on alternate data bits
– Crossbar channel width reduced by 50% leading
to 0.34mm2 router area
– Ref: S. Vangal, et al. VLSI, 2005
Mesochronous clocking
– Phase-tolerant FIFO based synchronizers at
interfaces between tiles
– Enables low power, scalable global clock
distribution
– 2W global clock distribution power @ 1.2V 5GHz
– Overhead of synchronizers: 6% of router power
and 1-2 clock cycles in latency
10
10
Agenda
Goals
Architecture overview
Processing engine
Interconnect fabric
Power management
Application mapping
Experimental results
11
11
Power Management Hooks
Extensive use of sleep transistors
Activation of sleep transistors done dynamically or
statically
Dynamic through special instructions or packets
based on workload
– NAP/WAKE: PE can put its FPMAC engines to sleep or wake
them up
– PESLEEP/PEWAKE: PE can put another PE to sleep or wake
it up for processing tasks by sending sleep or wake packets
Static through scan
– Entire PE or individual router ports
12
12
Tile Sleep Regions
MSINT
% of device width on sleep
MSINT
Crossbar
10%
Router
MSINT
21 sleep regions with
independent control
MSINT
2KB Data memory
57%(DMEM)
RIB
14%
– 74% of transistor
device width
Dynamic sleep
6R, 4W 32
entry Register File
72%
3KB Inst. 56%
Memory (IMEM)
– Individual FPMACs,
cores or tiles
Static sleep control
– Scan chain
Clock gating
x
x
+
90%
Normalize
– Works with sleep
+
90%
Normalize
FPMAC0
FPMAC1
Processing Engine (PE)
Fine grain power management
13
13
Leakage Savings
Est breakdown @ 1.2V 110C
MSINT
MSINT
82mW
Crossbar
Router
70mW
MSINT
– Total measured idle
power 13W @ 1.2V
2KB Data
memory (DMEM)
42mW8mW
RIB
Regulated sleep for
memory arrays
– State retention
6R,
4W 32 entry Register File
15mW7.5mW
Vcc
Memory
array
V ssv
VREF=Vcc -VMIN +_
3KB 63mW21mW
Inst. Memory (IMEM)
Standby VMIN
Memory
Clamping
MSINT
Dynamic sleep
x
x
100mW
+
20mW
100mW
+
20mW
Normalize
Normalize
FPMAC0
wake
Processing Engine (PE)
2X-5X leakage power reduction
14
FPMAC1
14
Pipelined Wakeup from Sleep
sleep
Multiplier
5
Single cycle wakeup
4.5
clk
nap1
4
FPMAC current (A)
W
Logic
fastwake
nap2
2W
Accumulator
clk
3.5
3
2.5
2
4X
1.5
1
Blocks enter sleep at same time
0.5
nap3
W
0
Logic
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
2
fastwake
time (ns)
Pipelined wakeup sequence
nap4
FPMAC current (A)
2W
Normalization
clk
nap5
W
Logic
fastwake
1.5
Accumulator
Normalization
1
Multiplier
0.5
nap6
2W
0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
4X reduction in peak wakeup current
15
15
Router Power Management
Activity based power management
Individual port enables
– Queues on sleep and clock gated when port idle
1000
Port_en
5
1
Gclk
Lanes 0 & 1
Queue
Input
Array
(Register
QueueFile)
Sleep
Transistor
1.2V, 5.1GHz
80°C
800
Stg 2-5
MSINT
Ports 1 -4
Router Port 0
600
924mW
7.3X
400
126mW
200
gated_clk
Global clock
buffer
0
0
1
2
3
4
5
Number of active router ports
7X power reduction for idle routers
16
16
Estimated Power Breakdown
Tile Power
Profile
Clock
dist.
11%
Router +
Links
28%
Communication
Power Profile
Dual
FPMACs
36%
10-port
RF
4%
IMEM +
DMEM
21%
Links
17%
MSINT
6%
Arbiters
+ Control
7%
4GHz, 1.2V, 110°C
Crossbar
15%
Clocking
33%
Queues
+
Datapath
22%
17
17
Agenda
Goals
Architecture overview
Processing engine
Interconnect fabric
Power management
Application mapping
Experimental results
18
18
Application Kernels
Four kernels mapped on to architecture
– Stencil 2D heat diffusion equation
– SGEMM for 100x100 matrices
– Spreadsheet doing weighted sums
– 64 point 2D FFT (using 64 tiles)
Kernels were hand coded in assembly code
and manually optimized
Sized to fit in the on-chip local data memory
– Instruction memory not a limiter
19
19
Communication Patterns
Stencil
Spreadsheet
SGEMM
2D FFT
Communication overlapped with computation
20
20
Agenda
Goals
Architecture overview
Processing engine
Interconnect fabric
Power management
Application mapping
Experimental results
21
21
Application Performance
Application
Kernels
FLOP TFLOPS @ % Peak
TFLOPS
count 4.27GHz
Stencil
358K
1.00
73.3%
80
SGEMM: Matrix 2.63M
Multiplication
0.51
37.5%
80
Spreadsheet
62.4K
0.45
33.2%
80
2D FFT
196K
0.02
2.73%
64
Tiles
used
1.07V, 4.27GHz operation 80 C
22
22
Frequency vs Power
300
6
5.34GHz
Freq
5
25°C
N=80
200
162W
150
4
3.8GHz
100
3
78W
1.2GHz
2
50
Freq (GHz)
250
Power (W)
7
Peak 2TFLOPS @ 300W
Active Power
Leakage Power
1
17W
0
0
0.75 0.8 0.85 0.9 0.95
1
1.1 1.2 1.3 1.39
Voltage (V)
Stencil app peak efficiency:
6 GFLOPS/W to 22 GFLOPS/W
23
23
Measured Power Savings
Power (W)
Power
%Savings
250
40%
200
30%
150
20%
100
50
10%
0
0%
0.7
0.8
0.9
1
1.1
1.2
1.3
Pwr Savings (W)
50%
300
1.4
Vcc (V)
Stencil app: Savings in power with
router power management on
24
24
Leakage Savings
% Total Power
16%
Sleep disabled
Sleep enabled
12%
8%
80°C
N=80
2X
4%
0%
0.67 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35
Vcc (V)
Measured maximum leakage power reduction
25
25
Design Tradeoffs
Core count
– Determined by die size constraints (300 sqmm) and
performance per watt requirements (10 GFLOPS/W)
Router link width
– 4 byte data path enables single flit transfer of single
precision word
– Crossbar area and power scales quadratically with link
width
Mesochronous implementation
– Single clock source with scalable, low power clock
distribution
Backend design and validation benefits
– Tiled methodology enabled rapid design by small team
(less than 400 person-months)
– Functional first silicon in 2 hours
26
26
Die Photo and Chip Details
I/O Area
DMEM
1.5mm
21.72mm
2.0mm
2.0mm
single tile
RIB
RF
IMEM
MSINT
CLK
Global clk spine + clk buffers
1.5mm
12.64mm
FPMAC0
FPMAC1
Router
PLL
I/O Area
TAP
Technology
65nm CMOS Process
Interconnect
1 poly, 8 metal (Cu)
Transistors
100 Million
Die Area
275mm2
Tile area
3mm 2
Package
1248 pin LGA, 14 layers,
343 signal pins
27
27
Summary
An 80-Core NoC architecture in 65nm CMOS
– 160 high-performance FPMAC engines
– Fast, single-cycle accumulator
– Low latency, compact router design
– High bandwidth 2D mesh interconnect
TFLOPS level performance at high power efficiency
– Fine grain power management
– Chip dissipates 97W at 1TFLOPS (Stencil app)
– Measured energy efficiency of 6 to 22 GFLOPS/W
Demonstrated peak performance up to 2TFLOPS
Building blocks for future peta-scale computing
28
28
Acknowledgements
Implementation
– Circuit Research Lab Advanced Prototyping team (Hillsboro,
OR and Bangalore, India)
Application kernels
– Software Solutions Group (Santa Clara, CA)
– Application Research Labs (DuPont, WA)
PLL design
– Logic Technology Development (Hillsboro, OR)
Package design
– Assembly Technology Development (Chandler, AZ)
29
29