Presentation

Transcription

Presentation
Photonic Many-Core Architecture Study
Nadya Bliss1, Krste Asanović2, Keren Bergman3, Luca Carloni3,
Jeremy Kepner1, Sanjeev Mohindra1, Vladimir Stojanović4
1MIT
Lincoln Laboratory, 2University of California Berkeley,
3Columbia University, 4MIT Research Laboratory of Electronics
September 23rd, 2008
PM: Jagdeep Shah
This work is sponsored by DARPA under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions
and recommendations are those of the author and are not necessarily endorsed by the United States Government.
MIT Lincoln Laboratory
HPEC2008 1
NTBliss 9/29/2008
Outline
•
•
•
•
•
Introduction
Logical Architecture Abstraction
Modeling and Mapping
Experiments and Results
Summary
HPEC2008 2
NTBliss 9/29/2008
MIT Lincoln Laboratory
Emerging Device Trends
Feature Size Reduction
3D Fabrication
Photonic Interconnects
1970s
Intel 80486DX2
Die: 12x6.75mm
Intel 4004
10 microns
Sun Sparc
0.8 microns
AMD Athlon
0.18 microns
Reduced path length
for accesses across
the memory hierarchy
STI Cell
65 nm
VS
Intel Core 2
45 nm
2008
Emerging device technologies create a large
parameter space of possible future architectures
HPEC2008 3
NTBliss 9/29/2008
MIT Lincoln Laboratory
Benefits of Photonic Interconnects
OPTICS
ELECTRONICS
TO MEMORY
• Communication to memory banks is chip power
and pin/wire density limited
• Use optical network as an efficient global crossbar
• Better scaling with N groups
• Poor scaling of on-chip mem controllers with cores • Expected performance - 40-80 Tb/sec
• At most 3-6 Tb/sec in the next few years
CORE-TO-CORE
TX
RX
TX
RX
TX
RX
TX
RX
TX
RX
TX
RX
• Buffer, receive and re-transmit at every switch
• Modulate/receive data once per communication
• Power dissipation grows with data rate
• Scalable, low power switch fabric
• Balanced communication and computation
Photonics
Photonics can
can provide
provide high
high bandwidth,
bandwidth, low
low latency
latency communication
communication
while
while meeting
meeting power
power requirements
requirements of
of embedded
embedded systems.
systems.
HPEC2008 4
NTBliss 9/29/2008
MIT Lincoln Laboratory
System Level View
-Photonic Many-core Architecture Network: PhotoMANSelecting a system level architecture allows the parameter space
to be narrowed while meeting requirements of DoD applications.
•
Manycore processor chip
– 64-256 cores (in 22nm node)
•
Off-chip memory
– a set of DRAM chips
– minimum capacity - 128 GB (at 22nm)
•
•
Evaluate interaction of the photonic
network and memory hierarchy
Board power limit 500 W
– Consistent with power constraints of
medium-sized UAV
RQ-7 Shadow
To
To evaluate
evaluate the
the architecture
architecture develop
develop
1.
1. Expressive
Expressive logical
logical abstraction
abstraction
2.
2. Modeling
Modeling and
and mapping
mapping framework
framework
HPEC2008 5
NTBliss 9/29/2008
MIT Lincoln Laboratory
Outline
•
•
•
•
•
Introduction
Logical Architecture Abstraction
Modeling and Mapping
Experiments and Results
Summary
HPEC2008 6
NTBliss 9/29/2008
MIT Lincoln Laboratory
Logical Abstraction
-Kuck* Memory Hierarchy2-LEVEL HIERARCHY EXAMPLE
SM2
Legend:
• P - processor
• N - inter-processor network
• M - memory
• SM - shared memory
• SMN - shared memory network
SMN2
SM1
SM1
...
SMN1
M0
...
P0
SMN1
M0
M0
P0
P0
N0.5
...
P0
N0.5
N1.5
M0
Subscript indicates
hierarchy level
x.5 subscript for N indicates
indirect memory access
The Kuck notation provides a clear way of describing a hardware
architecture along with the memory and communication hierarchy
HPEC2008 7
NTBliss 9/29/2008
MIT Lincoln Laboratory
*High Performance Computing: Challenges for Future Systems, David Kuck, 1996
PhotoMAN Logical Representation
-MIT/UCB 1 Group Memory ConfigurationSystem-Level
High-Level
The Kuck notation is suitable for both high-level
and detailed physical descriptions of the
architecture, such as groups and access points.
HPEC2008 8
NTBliss 9/29/2008
Detailed
Legend:
• AP - access point
• APG - access point group
MIT Lincoln Laboratory
PhotoMAN Logical Representation
-MIT/UCB 4 Group Memory Configuration0
15
SM2
XS to SM connections
are 1-to-1
XSG
APN connections are
1-to-Number of Groups
N0.5 is a single
electrical mesh
15
XS2
0
APN1
AP10
SMN
is an
electrical mesh
connecting only
processors
within the group
0
...
0
XS2
0...3
SM0...15 are DRAM
memory banks,
8GB each
SM2
0
...
APG
AP15
1
0
SMN1
...
...
AP10
3
...
APG
AP15
1
Number of access
points per group is
equal to number of
memory banks
3
SMN1
0
1
2
3
M0
M0
M0
M0
0
1
2
3
P0
P0
P0
P0
255
M0
...
255
P0
Logical view of the 16
(N) group configuration
is similar
N0.5
While
While the
the Kuck
Kuck representation
representation is
is flexible,
flexible, the
the PhotoMAN
PhotoMAN study
study
is
is focused
focused on
on 1,
1, 4,
4, and
and 16
16 group
group memory
memory configurations.
configurations.
HPEC2008 9
NTBliss 9/29/2008
Legend:
• APN - access point network
• XS - cross bar
• XSG - cross bar group
MIT Lincoln Laboratory
Outline
•
•
•
•
•
Introduction
Logical Architecture Abstraction
Modeling and Mapping
Experiments and Results
Summary
HPEC2008 10
NTBliss 9/29/2008
MIT Lincoln Laboratory
pMapper: Modeling and Mapping
Machine description
together with an
abstraction layer is
used to generate a
performance model
Application specification
(MATLAB) is used to
generate a signal flow graph
Maps (distribution specifications)
are generated for the application
Results can be used to predict
application performance and
architecture parameters
pMapper performs
• application to
architecture mapping
• application on
architecture simulation
APPLICATION
SIGNAL FLOW GRAPH
HPEC2008 11
NTBliss 9/29/2008
MIT Lincoln Laboratory
PhotoMAN Machine Description
Given a hardware model H
and a program parse tree T,
pMapper finds maps M that
minimize execution latency:
Focus of the PhotoMAN study
HPEC2008 12
NTBliss 9/29/2008
MIT Lincoln Laboratory
Memory Hierarchy Formulation
-MIT/UCB 1 Group Memory ConfigurationPHYSICAL VIEW
CORE-TO-CORE NETWORK, N0.5
SHARED MEMORY NETWORK, SMN1
•• Bandwidth
Bandwidth and
and latency
latency matrices
matrices have
have the
the
same
same pattern
pattern of
of non-zeros
non-zeros
•• Topology
and SMN is the same
Topology for
for N
N0.5
0.5 and SMN11 is the same
for
for the
the 1-Group
1-Group configuration
configuration
•• Diagonal
Diagonal entries
entries encode
encode
•• R
RNN -- bandwidth
bandwidth to
to local
local store
store
i
i
•• R
- whether P is an access point
RMon
Mon - whether P is an access point
HPEC2008 13
NTBliss 9/29/2008
AP-to-SM
ACCESS
POINTS
MIT Lincoln Laboratory
Memory Hierarchy Formulation
-MIT/UCB NG Group Memory ConfigurationPHYSICAL VIEW
SHARED MEMORY NETWORK, SMN1
ACCESS POINTS
AP-XS-MEMORY NETWORK
•• Core-to-core
Core-to-core network
network not
not shown
shown and
and is
is
the
the same
same as
as in
in 11 group
group case
case
•• While
While memory
memory access
access requires
requires one
one
additional
additional transfer,
transfer, the
the topology
topology is
is
represented
represented with
with aa single
single matrix
matrix -- R
RAXSon
AXSon
AP-XS BANDWIDTH
XS-MEMORY BANDWIDTH
HPEC2008 14
NTBliss 9/29/2008
MIT Lincoln Laboratory
Outline
•
•
•
•
•
Introduction
Logical Architecture Abstraction
Modeling and Mapping
Experiments and Results
Summary
HPEC2008 15
NTBliss 9/29/2008
MIT Lincoln Laboratory
Maps
1D BLOCK
2D BLOCK
1D CYCLIC
2D CYCLIC
1D HIERARCHICAL
...
P0 P1 P2 P3
INCREASING PROGRAMMING COMPLEXITY
•• High
High programmability
programmability is
is aa desirable
desirable architecture
architecture characteristic
characteristic
•• Complexity
Complexity of
of mapping
mapping chosen
chosen to
to optimize
optimize performance
performance (minimize
(minimize
execution
execution time)
time) provides
provides insight
insight into
into programmability
programmability of
of hardware
hardware
•• The
The higher
higher complexity
complexity of
of the
the mapping,
mapping, the
the lower
lower programmability
programmability
HPEC2008 16
NTBliss 9/29/2008
MIT Lincoln Laboratory
Synthetic Aperture Radar (SAR)
SAR processing chain is common to many defense application and
requires significant amount of both computation and communication.
Typical application
• SAR spotlight mode
• Collect raw SAR data
• Processing chain produces an image
• Image can then be analyzed
Processing chain simulated
• FFTs, IFFTs, and data-reorganization
HPC Challenge relevance: FFT
Pulse
Compression
Cross-range
Re-sampling
Matched Filter &
Interpolation
Back-projection
Image
Conversion
Part 1
Image
Conversion
Part 2
All-to-all, full data redistributions
HPEC2008 17
NTBliss 9/29/2008
MIT Lincoln Laboratory
Airborne Video Surveillance
Georegistration is a key computational kernel in airborne
video surveillance and other image processing algorithms.
Typical application
• High data rate imaging sensor
• Collect data
• Georegister data
• Analyze activity
SONOMA (LLNL)
GPS/
INS
Processing chain simulated
• projective transform with bilinear
interpolation for each pixel
6 COTS cameras - 66Mpix
HPC Challenge relevance: STREAM
and Random Access
HPEC2008 18
NTBliss 9/29/2008
MIT Lincoln Laboratory
PhotoMAN Performance
PROJECTIVE TRANSFORM
SAR
AVS
OPTICAL TO MEMORY
BANKS (MIT/UCB)
OPTICAL MESH
(COLUMBIA)
HPEC2008 19
NTBliss 9/29/2008
Optical
Optical (photonic)
(photonic) interconnects
interconnects
both
both to
to memory
memory and
and between
between
cores
cores yield
yield best
best performance
performance
MIT Lincoln Laboratory
PhotoMAN Programmability
SAR
AVS
Scalability with
number of cores
Maps selected:
• 1D Block
• Hierarchical
• Smallest block: fits
into core’s local store
1D HIERARCHICAL
• Requires hierarchical maps
• Can be improved with cache architecture
HPEC2008 20
NTBliss 9/29/2008
+
• Architecture is well-balanced
• Maps with maximum number of cores are
chosen (optical to memory and optical)
MIT Lincoln Laboratory
See J. Kepner and N. Bliss, “Evaluating the Productivity of a Multicore Architecture”
Best Performing Architecture
LOGICAL VIEW
•• 16
16 groups
groups
•• Optical
Optical to
to memory
memory
•• Optical
Optical mesh
mesh
•• 256
256 cores
cores
HPEC2008 21
NTBliss 9/29/2008
PHYSICAL VIEW
Current/future
Current/future research
research
•• Network
Network topology
topology
•• Power
Power optimization
optimization
•• Processor
Processor characteristics
characteristics
•• Cache
Cache architecture
architecture
•• Hierarchical
Hierarchical mapping
mapping
MIT Lincoln Laboratory
Summary
•
Emerging device trends are motivating the need for logical
architecture abstractions and robust modeling, mapping and
simulation environments
•
PhotoMAN study focus: photonic networks
•
Kuck diagrams provide an expressive logical abstraction
•
Detailed hardware model describes the mapping and
modeling optimization space explored by pMapper and allows
for architecture evaluation
•
Initial results show over an order of magnitude improvement
in application performance with photonics, while maintaining
scalability
HPEC2008 22
NTBliss 9/29/2008
MIT Lincoln Laboratory