Swizzle Switch

Transcription

Swizzle Switch
1
Swizzle Switch: A Self-Arbitrating High-Radix
Crossbar for NoC Systems
Ronald Dreslinski, Korey Sewell, Thomas Manville, Sudhir Satpathy,
Nathaniel Pinckney, Geoff Blake, Michael Cieslak, Reetuparna Das,
Thomas Wenisch, Dennis Sylvester, David Blaauw, and Trevor Mudge
University of Michigan
University of Michigan
1
1
Outline
2
 Swizzle Switch—Circuit & Microarchitecture
 Overview
 Arbitration
 Prototype
 Swizzle Switch—Cache Coherent Manycore Interconnect
 Motivation & Existing Interconnects
 Swizzle Switch Interconnect
 Evaluation
University of Michigan
2
2
Swizzle Switch
Conventional Matrix Crossbar
3
Swizzle Switch
 Embeds arbitration within crossbar—single cycle arbitration
 Re-use input/output data buses for arbitration
 SRAM-like layout with priority bits at cross-points
 Low-power optimizations
 Excellent scalability
University of Michigan
3
3
Data Routing
4
Pre-charge
Source
IP
 Multicast & Broadcast
1
0
0
128
Source
IP
0
0
1
Source
IP
0
1
0
 Bitlines discharged if
 Data = “1”
 Crosspoint = “1”
Read Buffers
128
Dest.
IP
University of Michigan
Dest.
IP
Dest.
IP
4
4
Swizzle Switch Architecture
5
Output bus used as prioritybus
Req k
Rel k
0
1
Req l
Rel l
0
0
0
Req m
Rel m
Pre-charge
Source
IP
1
0
0
0
0
0
0
0
128
Source
IP
0
0
Priority
vectors
Source
IP
1
0
128
Dest.
IP
1
0
1
1
1
1
0
0
1
1
1
1
Sense Amp
+ Latch
Read Buffers
Dest.
IP
Priority vector
Dest.
IP
Req n
Rel n
0
1
0
1
Data routing, arbitration,
And priority update control embedded within crosspoints
University of Michigan
5
5
Outline
6
 Swizzle Switch—Circuit & Microarchitecture
 Overview
 Arbitration
 Prototype
 Swizzle Switch—Cache Coherent Manycore Interconnect
 Motivation & Existing Interconnects
 Swizzle Switch Interconnect
 Evaluation
University of Michigan
6
6
Inhibit Based Arbitration
Output bus used as priority bus
Priority Priority
line l
line n
0
Req l
x<y
Priority
x
Priority
y
1
0
Sense Amp
+ Latch
Req n
0
Priority
z
University of Michigan
0
Priority
line m
0
Req m
0
z>y
1
1
7
This diagram is a single column in
the Swizzle-Switch (output), each
output arbitrates/transfers data
independently
Each Crosspoint has a sense
amp/latch to indicate connectivity.
Each input samples a unique bit of
the output bus to determine if it has
been granted the channel
Priority vectors are stored and when
a request is issued they discharge
bits along the output columns to
INHIBIT lower priority requests
Finally, the priority vectors are
updated when the data transfer
completes.
7
7
Least Recently Granted(LRG)
8
Output bus used as priority bus
Priority Priority
line l
line n
0
Req l
x<y
Priority
x
Priority
y
1
0
Priority
z
University of Michigan
z>y
1
INTERMEDIATE Priority
Discharges SOME Priority Lines
0
Sense Amp
+ Latch
Req n
0
Priority
line m
0
Req m
0
LOWEST Priority
Discharges NO Priority Lines
HIGHEST Priority
Discharges ALL Priority Lines
1
8
8
Least Recently Granted(LRG)
9
Output bus used as priority bus
Priority Priority
line l
line n
0
Req l
x<y
Priority
x
Priority
y
1
0
Sense Amp
+ Latch
Req n
0
Priority
line m
1
Req m
0
Example Arbitration:
(1) Req l and Req m Request the bus
(red lines)
(2) Req m discharges Priority line l,
priority lines m and n remain
charged (green lines)
(3) Req l senses Priority line l and is
inhibited (not granted), Req m
senses Priority line m and is not
inhibited
(4) The crosspoint records the
connectivity at input m
0
Priority
z
University of Michigan
z>y
1
1
9
9
Least Recently Granted(LRG)
10
Output bus used as priority bus
Example Priority Update:
Req l
Rel l
Req m
Rel m
Req n
Rel n
SET
0
Priority
x
0
Input m signals it is done with data
transfer by asserting Rel m
0
1
Priority
y
1
0
RESET
0
Priority
z
University of Michigan
1
1
10
10
Least Recently Granted(LRG)
11
Output bus used as priority bus
Req l
Rel l
Req m
Rel m
Req n
Rel n
0
Priority
x+1
1
0
LOWEST Priority
Discharges NO Priority Lines
0
Priority
0
0
0
HIGHEST Priority
Discharges ALL Priority Lines
0
Priority
z
University of Michigan
INTERMEDIATE Priority
Discharges SOME Priority Lines
1
1
11
11
Least Recently Granted(LRG)
12
Req l
Rel l
Req m
Rel m
Req n
Rel n
0
Priority
x+1
1
0
0
Priority
0
0
0
Increasing Priority
Output bus used as priority bus
LRG Algorithm
0
0
x
l
Least
priority
m
0
Priority
upgraded
l
y
m
m-1
Priority
62
62 62
unchanged
z n
n
“m” releases channel
0
Priority
z
University of Michigan
1
1
12
12
Outline
13
 Swizzle Switch—Circuit & Microarchitecture
 Overview
 Arbitration
 Prototype
 Swizzle Switch—Cache Coherent Manycore Interconnect
 Motivation & Existing Interconnects
 Swizzle Switch Interconnect
 Evaluation
University of Michigan
13
13
64x64 Prototype
University of Michigan
14
14
14
Measurement Results
University of Michigan
15
15
15
Measurement Results
University of Michigan
16
16
16
Outline
17
 Swizzle Switch—Circuit & Microarchitecture
 Overview
 Arbitration
 Prototype
 Swizzle Switch—Cache Coherent Manycore Interconnect
 Motivation & Existing Interconnects
 Swizzle Switch Interconnect
 Evaluation
University of Michigan
17
17
Scaling Interconnect for Many-Cores

18
Existing interconnects—Buses, Crossbars, Rings
 Limited to ~16 cores
 Other’s Interconnect proposals for Many-Cores
 Packet-switched, multi-hop, network-on-chip (NoC)
 Grid of routers—meshes, tori and flattened butterfly

Our Proposal
 Swizzle Switch Networks
 Flat single-stage, one-hop, crossbar++ interconnect
University of Michigan
18
18
Mesh Network-on-Chip
Memory Controller
Memory Controller
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Memory Controller
Memory Controller
1.73 mm
32 kB
ICache
ARM
32 kB
Cortex
DCache
A5
R
256 kB
L2 Cache Bank
1.73 mm
Tile
13.8 mm
Tile
Memory Controller
Memory Controller
Tile
Tile
Memory Controller
Memory Controller
Tile
19
Tile
Area = 3.0 mm2
13.8 mm
Area = 190 mm2
Router R
(a)
University of Michigan
19
19
Flattened Butterfly Network-on-Chip
Memory Controller
Tile
Tile
Router
Tile
Memory Controller
Tile
Router
Tile
Tile
Router
Tile
Router
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Router
Router
Router
Router
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Router
Router
Router
Router
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Router
Tile
Router
Tile
Tile
Router
Tile
Memory Controller
Tile
Router
Tile
Tile
Tile
13.8 mm
Tile
Memory Controller
Memory Controller
Tile
Memory Controller
Memory Controller
Tile
20
Memory Controller
13.8 mm
Area = 190 mm2
University of Michigan
20
20
Motivating Swizzle Switch Networks
21
 Uniform access latency
Ease of programming, data placement, thread placement,...
 Low Power
 Simplicity
Packet-switched NoCs need routing, congestion management, flow
control, wormhole switching,...
University of Michigan
21
21
Motivating Swizzle Switch Networks
Mesh
SSN
Average = 0.017 flits/node/sec
Unfairness = 42.2
Average = 0.013 flits/node/sec
Unfairness = 1.008
0.06
0.06
0.05
0.05
0.04
0.04
0.03
0.03
0.02
dex (X)
1
3
no
de
1 2 3 4
nod e in 5 6 7 8
7
(Y
)
0.00
5
0.01
0.00
1 2 3 4
no de i n 5 6 7 8
d e x (X )
1
5
7
in
de
x
0.01
3
no
de
(Y
)
0.02
in
de
x
accepted throughput
[flits/node/cycle]
22
 Unfairness = Nodehighest_throughput / Nodelowest_throughput
 Hotspot Traffic = All nodes sending data to node8,8
 Under Hotspot traffic, the Crossbar has a slightly less
throughput than the Mesh but is 40x more fair.
University of Michigan
22
22
Motivating Swizzle Switch Networks
Mesh
SSN
0.60
0.50
0.50
0.40
0.40
0.30
0.30
0.20
0.20
3
no
de
d e x (X )
1
7
in
de
x
0.00
5
0.10
0.00
1 2 3 4
no de i n 5 6 7 8
d e x (X )
1
5
7
in
de
x
0.10
3
no
de
(Y
)
0.60
(Y
)
Average = 0.47 flits/node/sec
Unfairness = 1.02
Average = 0.36 flits/node/sec
Unfairness = 1.89
1 2 3 4
no de i n 5 6 7 8
23
 In the Mesh, nodes closest to the center receive the highest
throughput
 Under Uniform Random traffic, the Crossbar has more
throughput than the Mesh and is 87% more fair.
University of Michigan
23
23
Motivating Swizzle Switch Networks
University of Michigan
24
24
24
Outline
25
 Swizzle Switch—Circuit & Microarchitecture
 Overview
 Arbitration
 Prototype
 Swizzle Switch—Cache Coherent Manycore Interconnect
 Motivation & Existing Interconnects
 Swizzle Switch Interconnect
 Evaluation
University of Michigan
25
25
Top-Level Floorplan
Memory Controller
26
Memory Controller
512 kB L2 cache
1.38 mm
NE
Memory Controller
Memory Controller
NW
WN
3.27 mm
EN
WS
SE
L2 Area = 4.50 mm2
0.86 mm
Arm
Cortex
A5
32 kB
Icache
32 kB
Dcache
0.86 mm
Memory Controller
ES
SW
14.3 mm
Memory Controller
Swizzle
Switch x 3
Core + L1 Area = .74 mm2
Memory Controller
Memory Controller
14.3 mm
Total Area = 204 mm2
University of Michigan
26
26
Outline
27
 Swizzle Switch—Circuit & Microarchitecture
 Overview
 Arbitration
 Prototype
 Swizzle Switch—Cache Coherent Manycore Interconnect
 Motivation & Existing Interconnects
 Swizzle Switch Interconnect
 Evaluation
University of Michigan
27
27
Evaluation

Simulation Parameters
Feature
NoC (Mesh/FBFly)
SSN
Processors
64 in-order cores, 1 IPC, 1.5 GHz
L1 Cache
32kB I/D Caches, 4-way associative, 64-byte line size, 1 cycle latency
L2 Cache
Shared L2, 16 MB, 64-way banked, 8way associative, 64-byte line size, 10
cycle latency
Shared L2, 16MB, 32-way
banked, 16-way associative, 64-byte line size, 11
cycle latency
Interconnect
3.0 GHz, 128-bit, 4-stage Routers,
3 virt. networks w/ 3 virt. channels
1.5 GHz, 64x32x128bit
Swizzle Switch Network
Main Memory

28
4096MB, 50 cycle latency
Benchmarks
 SPLASH 2 : Scientific parallel application suite
University of Michigan
28
28
Cholesky
.
FFT
.
FMM
Lu
Contig
Lu
NonContig
.
Ocean
Contig
29
Ocean
NonContig
Radix
.
Raytrace
.
Water
NSquared
. Water Spatial
Overall Performance
Quality-of-Service
University of Michigan
29
SSN
Mesh
FBFly
SSN
FBFly
Mesh
SSN
FBFly
Mesh
SSN
Mesh
.
FBFly
SSN
Mesh
.
FBFly
Memory Stall
SSN
FBFly
Mesh
SSN
Mesh
.
Core Active
FBFly
Mesh
.
FBFly
SSN
FBFly
Mesh
SSN
FBFly
Mesh
SSN
Mesh
.
FBFly
SSN
FBFly
Barnes
SSN
Synchronization Stall
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Mesh
Normalized Execution Time
Results—Performance & QoS
29
30
Barnes!
.!
Cholesky!
.!
FFT!
FMM!
.! Lu Contig! .!
Lu
.!
NonContig!
Ocean
Contig!
Ocean
.!
NonContig!
Radix!
Raytrace!
Water
.!
NSquared!
Water
Spatial!
SSN!
Mesh!
FBFly!
SSN!
FBFly!
Mesh!
SSN!
Mesh!
.!
FBFly!
SSN!
Mesh!
.!
FBFly!
SSN!
FBFly!
Mesh!
SSN!
Mesh!
.!
FBFly!
SSN!
Mesh!
FBFly!
SSN!
FBFly!
Mesh!
SSN!
Mesh!
FBFly!
SSN!
Mesh!
.!
FBFly!
SSN!
FBFly!
Mesh!
SSN!
FBFly!
SSN!
Mesh!
Wire Dynamic!
SSN/Router Dynamic!
SSN/Router Leakage!
Clock!
Buffer!
Mesh!
7,000!
6,000!
5,000!
4,000!
3,000!
2,000!
1,000!
0!
FBFly!
Interconnect Power (mW)!
Results—Power
.! Weighted
Average!
Barnes!
.!
Cholesky!
.!
FFT!
FMM!
.!
LuContig!
.!
Lu
.!
NonContig!
Ocean
Contig!
.!
Ocean
.!
NonContig!
Radix!
Raytrace!
Water
.!
NSquared!
Water
Spatial!
Average!
Which results in an average reduction in total system energy
to complete the task of 11%
University of Michigan
30
SSN!
Mesh!
.!
FBFly!
SSN!
FBFly!
Mesh!
SSN!
Mesh!
.!
FBFly!
SSN!
Mesh!
.!
FBFly!
SSN!
FBFly!
Mesh!
SSN!
FBFly!
Mesh!
SSN!
Mesh!
FBFly!
SSN!
FBFly!
Mesh!
SSN!
FBFly!
Mesh!
SSN!
Mesh!
.!
FBFly!
SSN!
FBFly!
Mesh!
SSN!
FBFly!
SSN!
Mesh!
Interconnect!
L2 Dynamic!
L2 Leakage!
Core/L1 Dynamic!
Core/L1 Leakage!
FBFly!
900!
800!
700!
600!
500!
400!
300!
200!
100!
0!
Mesh!
Total Benchmark Energy!
(pJ/Inst)!
On average the SSN uses 28% less power in the interconnect
compared to a flattened butterfly
30
Summary
31
 Swizzle Switch Prototype (45nm)
 64x64 Crossbar with 128-bit busses
 Embedded LRG priority arbitration
 Achieved 4.4 Tbps @ ~600MHz consuming only 1.3W of power
 Swizzle Switch Network Evaluation
 Improved performance by 21%
 Reduced power by 28%
 Reduced latency variability by 3x
University of Michigan
31
31
32
Additional Detailed Slides
University of Michigan
32
32
Arbitration Mechanism (Matrix View)
33
Pre-charge
Source
IP
128
Req
Rel
Source
IP
Req
Rel
Inhibits (X)
X0 X1 X2 X3 X4 Priority
Req
Rel
Read Buffers
128
Dest.
IP
Dest.
IP
University of Michigan
Dest.
IP
Requests (R)
Source
IP
R0
X
1
0
0
0
1
R1
R2
0
1
X
1
0
X
0
0
0
0
0
2
R3
R4
1
1
1
1
1
1
X
0
1
X
4
3
33
33
Least Recently Granted (LRG)
Reset
S
e
t
X0
X1
X2
X3
X4
Priority
In0
X
1
0
0
0
1
In1
0
X
0
0
0
In2
1
1
X
0
In3
1
1
1
In4
1
1
1
University of Michigan
34
X0
X1
X2
X3
X4
Priority
In0
X
1
0
0
1
2
0
In1
0
X
0
0
1
1
0
2
In2
1
1
X
0
1
3
X
1
4
In3
1
1
1
X
1
4
0
X
3
In4
0
0
0
0
X
0
34
34
Round Robin Arbitration
35
Output bus used as priority bus
Req l
Rel l
Req m
Rel m
Req n
Rel n
SET
0
Priority
x
0
0
0
Priority
y
1
0
0
Priority
z
University of Michigan
1
1
RESET
35
35
Round Robin Arbitration
36
Req l
Rel l
Req m
Rel m
Req n
Rel n
0
Priority
x+1
0
1
0
Priority
y+1
1
1
Increasing Priority
Output bus used as priority bus
Round Robin Algorithm
0
0
x
l
Least
priority
m
0
l
m
m-1
62 62
z n
62
n
y
Priority
upgraded
0
Priority
0
University of Michigan
0
0
36
36
QoS Arbitration
37
QoS
LRG
Arbitration Arbitration
Increasing Priority
3
3
0
QoS
values
3
3
Highest QoS
3
3
Winner
2
0
1
University of Michigan
37
37
Timing Diagram
University of Michigan
38
38
38
Crosspoint Circuit
39
wl
Crosspoint(x,y)
Rel_b
out<0> out<64>
in<0>
in<64>
clock_b
wl
Priority-line
clock
Sense_
amp
Rel
out<y>
Reset_b
Priority bit
out<y>
(priority_line)
Rel
in<y>
wl
Req
wl out<63> out<127>
Ack
clock_b
Rel_b
in<63>
in<127>
in<x+64>
in<x> used to index Req/Rel
out<y> sampled for Arbitration
in<x+64> used for sending ack
out<y> discharged for LRG update
University of Michigan
39
39
Regenerative Bit-line Repeater
40
clock_b
16×16 Macro
clock_b
Bitline_R
Bit-line Repeaters
Bitline_L
SSN
clock
Regeneration and Decoupling
improves speed
Bitline_L
University of Michigan
Bitline_R
40
40
Simulated bit-line delay improvement
41
Technology : 45nm
Supply : 1.1V
Temperature : 25°C
University of Michigan
41
41
SSN Scaling: Simulation
42
Technology : 45nm
Supply : 1.1V
Temperature : 25°C
wo Repeater
with Repeater
256 bit Bus
with Repeater
128 bit Bus
Regenerative repeaters improve SSN scalability
University of Michigan
42
42
Swizzle Switch Network-on-Chip
./ ./
%3, %) * %)! %)! %%
(1 (1 ('
$
1 ( %)*%- . /
1 ( %)! %- . /
1 ( %)*%+,
1 0%)*%+,
4
/
/
%- .
%- .
)*
%
%)*
(1
('
,
%+, ! %+
%)* %)
(' (1
$
#
$
$
('
%+,
%)!
4
$
$
#
' ( %)*%+, %
$
' ( %)*%- . /
#
4
' ( %)! %+,
' 0%)! %+,
4
$
$
! "#$%&$! ' #
( ) *+' *"' ,
- . /0012$
- . /345
67+8 8 9
! "#$%&$! "( ) *( ) *"' ,
- . /0012
- . /345
"6"8 8 9
' ( %)! %- . /%
! "#$%& &
1 0%)! %- . /
#
43
#
! "$: ; *
1 ( %)! %+,
1 0%)! %+,
1 0%)*%- . /
Destination
L2
L1
Shared Data
Data Forwarding
Requests
Writebacks
L2
Responses
Invalidations
Source
L1
Classification
× ×
University of Michigan
! ' #$%&$! "#
+' *( ) *"' ,
- . /0012$- . /345
67+8 8 9
4
#
$
$
4 #
$ 4
01
0' 01
0
0
%)* %2!
%)* 1 %)! ' %)! %
%+, %+,
%3,
%- . - . /
/
$ $
01 0' %
%)* ) *%
%- . - . /
/
$
$
4
#
0'
%)!
%+,
' 0%)*%+,
' 0%)*%- . /
' 0%)! %- . /
! "#$%& &
(b)
⇠
43
43
Results—64-core with A9 O3 cores
University of Michigan
44
44
44

Similar documents