Presentation
Transcription
Presentation
Photonic Many-Core Architecture Study Nadya Bliss1, Krste Asanović2, Keren Bergman3, Luca Carloni3, Jeremy Kepner1, Sanjeev Mohindra1, Vladimir Stojanović4 1MIT Lincoln Laboratory, 2University of California Berkeley, 3Columbia University, 4MIT Research Laboratory of Electronics September 23rd, 2008 PM: Jagdeep Shah This work is sponsored by DARPA under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government. MIT Lincoln Laboratory HPEC2008 1 NTBliss 9/29/2008 Outline • • • • • Introduction Logical Architecture Abstraction Modeling and Mapping Experiments and Results Summary HPEC2008 2 NTBliss 9/29/2008 MIT Lincoln Laboratory Emerging Device Trends Feature Size Reduction 3D Fabrication Photonic Interconnects 1970s Intel 80486DX2 Die: 12x6.75mm Intel 4004 10 microns Sun Sparc 0.8 microns AMD Athlon 0.18 microns Reduced path length for accesses across the memory hierarchy STI Cell 65 nm VS Intel Core 2 45 nm 2008 Emerging device technologies create a large parameter space of possible future architectures HPEC2008 3 NTBliss 9/29/2008 MIT Lincoln Laboratory Benefits of Photonic Interconnects OPTICS ELECTRONICS TO MEMORY • Communication to memory banks is chip power and pin/wire density limited • Use optical network as an efficient global crossbar • Better scaling with N groups • Poor scaling of on-chip mem controllers with cores • Expected performance - 40-80 Tb/sec • At most 3-6 Tb/sec in the next few years CORE-TO-CORE TX RX TX RX TX RX TX RX TX RX TX RX • Buffer, receive and re-transmit at every switch • Modulate/receive data once per communication • Power dissipation grows with data rate • Scalable, low power switch fabric • Balanced communication and computation Photonics Photonics can can provide provide high high bandwidth, bandwidth, low low latency latency communication communication while while meeting meeting power power requirements requirements of of embedded embedded systems. systems. HPEC2008 4 NTBliss 9/29/2008 MIT Lincoln Laboratory System Level View -Photonic Many-core Architecture Network: PhotoMANSelecting a system level architecture allows the parameter space to be narrowed while meeting requirements of DoD applications. • Manycore processor chip – 64-256 cores (in 22nm node) • Off-chip memory – a set of DRAM chips – minimum capacity - 128 GB (at 22nm) • • Evaluate interaction of the photonic network and memory hierarchy Board power limit 500 W – Consistent with power constraints of medium-sized UAV RQ-7 Shadow To To evaluate evaluate the the architecture architecture develop develop 1. 1. Expressive Expressive logical logical abstraction abstraction 2. 2. Modeling Modeling and and mapping mapping framework framework HPEC2008 5 NTBliss 9/29/2008 MIT Lincoln Laboratory Outline • • • • • Introduction Logical Architecture Abstraction Modeling and Mapping Experiments and Results Summary HPEC2008 6 NTBliss 9/29/2008 MIT Lincoln Laboratory Logical Abstraction -Kuck* Memory Hierarchy2-LEVEL HIERARCHY EXAMPLE SM2 Legend: • P - processor • N - inter-processor network • M - memory • SM - shared memory • SMN - shared memory network SMN2 SM1 SM1 ... SMN1 M0 ... P0 SMN1 M0 M0 P0 P0 N0.5 ... P0 N0.5 N1.5 M0 Subscript indicates hierarchy level x.5 subscript for N indicates indirect memory access The Kuck notation provides a clear way of describing a hardware architecture along with the memory and communication hierarchy HPEC2008 7 NTBliss 9/29/2008 MIT Lincoln Laboratory *High Performance Computing: Challenges for Future Systems, David Kuck, 1996 PhotoMAN Logical Representation -MIT/UCB 1 Group Memory ConfigurationSystem-Level High-Level The Kuck notation is suitable for both high-level and detailed physical descriptions of the architecture, such as groups and access points. HPEC2008 8 NTBliss 9/29/2008 Detailed Legend: • AP - access point • APG - access point group MIT Lincoln Laboratory PhotoMAN Logical Representation -MIT/UCB 4 Group Memory Configuration0 15 SM2 XS to SM connections are 1-to-1 XSG APN connections are 1-to-Number of Groups N0.5 is a single electrical mesh 15 XS2 0 APN1 AP10 SMN is an electrical mesh connecting only processors within the group 0 ... 0 XS2 0...3 SM0...15 are DRAM memory banks, 8GB each SM2 0 ... APG AP15 1 0 SMN1 ... ... AP10 3 ... APG AP15 1 Number of access points per group is equal to number of memory banks 3 SMN1 0 1 2 3 M0 M0 M0 M0 0 1 2 3 P0 P0 P0 P0 255 M0 ... 255 P0 Logical view of the 16 (N) group configuration is similar N0.5 While While the the Kuck Kuck representation representation is is flexible, flexible, the the PhotoMAN PhotoMAN study study is is focused focused on on 1, 1, 4, 4, and and 16 16 group group memory memory configurations. configurations. HPEC2008 9 NTBliss 9/29/2008 Legend: • APN - access point network • XS - cross bar • XSG - cross bar group MIT Lincoln Laboratory Outline • • • • • Introduction Logical Architecture Abstraction Modeling and Mapping Experiments and Results Summary HPEC2008 10 NTBliss 9/29/2008 MIT Lincoln Laboratory pMapper: Modeling and Mapping Machine description together with an abstraction layer is used to generate a performance model Application specification (MATLAB) is used to generate a signal flow graph Maps (distribution specifications) are generated for the application Results can be used to predict application performance and architecture parameters pMapper performs • application to architecture mapping • application on architecture simulation APPLICATION SIGNAL FLOW GRAPH HPEC2008 11 NTBliss 9/29/2008 MIT Lincoln Laboratory PhotoMAN Machine Description Given a hardware model H and a program parse tree T, pMapper finds maps M that minimize execution latency: Focus of the PhotoMAN study HPEC2008 12 NTBliss 9/29/2008 MIT Lincoln Laboratory Memory Hierarchy Formulation -MIT/UCB 1 Group Memory ConfigurationPHYSICAL VIEW CORE-TO-CORE NETWORK, N0.5 SHARED MEMORY NETWORK, SMN1 •• Bandwidth Bandwidth and and latency latency matrices matrices have have the the same same pattern pattern of of non-zeros non-zeros •• Topology and SMN is the same Topology for for N N0.5 0.5 and SMN11 is the same for for the the 1-Group 1-Group configuration configuration •• Diagonal Diagonal entries entries encode encode •• R RNN -- bandwidth bandwidth to to local local store store i i •• R - whether P is an access point RMon Mon - whether P is an access point HPEC2008 13 NTBliss 9/29/2008 AP-to-SM ACCESS POINTS MIT Lincoln Laboratory Memory Hierarchy Formulation -MIT/UCB NG Group Memory ConfigurationPHYSICAL VIEW SHARED MEMORY NETWORK, SMN1 ACCESS POINTS AP-XS-MEMORY NETWORK •• Core-to-core Core-to-core network network not not shown shown and and is is the the same same as as in in 11 group group case case •• While While memory memory access access requires requires one one additional additional transfer, transfer, the the topology topology is is represented represented with with aa single single matrix matrix -- R RAXSon AXSon AP-XS BANDWIDTH XS-MEMORY BANDWIDTH HPEC2008 14 NTBliss 9/29/2008 MIT Lincoln Laboratory Outline • • • • • Introduction Logical Architecture Abstraction Modeling and Mapping Experiments and Results Summary HPEC2008 15 NTBliss 9/29/2008 MIT Lincoln Laboratory Maps 1D BLOCK 2D BLOCK 1D CYCLIC 2D CYCLIC 1D HIERARCHICAL ... P0 P1 P2 P3 INCREASING PROGRAMMING COMPLEXITY •• High High programmability programmability is is aa desirable desirable architecture architecture characteristic characteristic •• Complexity Complexity of of mapping mapping chosen chosen to to optimize optimize performance performance (minimize (minimize execution execution time) time) provides provides insight insight into into programmability programmability of of hardware hardware •• The The higher higher complexity complexity of of the the mapping, mapping, the the lower lower programmability programmability HPEC2008 16 NTBliss 9/29/2008 MIT Lincoln Laboratory Synthetic Aperture Radar (SAR) SAR processing chain is common to many defense application and requires significant amount of both computation and communication. Typical application • SAR spotlight mode • Collect raw SAR data • Processing chain produces an image • Image can then be analyzed Processing chain simulated • FFTs, IFFTs, and data-reorganization HPC Challenge relevance: FFT Pulse Compression Cross-range Re-sampling Matched Filter & Interpolation Back-projection Image Conversion Part 1 Image Conversion Part 2 All-to-all, full data redistributions HPEC2008 17 NTBliss 9/29/2008 MIT Lincoln Laboratory Airborne Video Surveillance Georegistration is a key computational kernel in airborne video surveillance and other image processing algorithms. Typical application • High data rate imaging sensor • Collect data • Georegister data • Analyze activity SONOMA (LLNL) GPS/ INS Processing chain simulated • projective transform with bilinear interpolation for each pixel 6 COTS cameras - 66Mpix HPC Challenge relevance: STREAM and Random Access HPEC2008 18 NTBliss 9/29/2008 MIT Lincoln Laboratory PhotoMAN Performance PROJECTIVE TRANSFORM SAR AVS OPTICAL TO MEMORY BANKS (MIT/UCB) OPTICAL MESH (COLUMBIA) HPEC2008 19 NTBliss 9/29/2008 Optical Optical (photonic) (photonic) interconnects interconnects both both to to memory memory and and between between cores cores yield yield best best performance performance MIT Lincoln Laboratory PhotoMAN Programmability SAR AVS Scalability with number of cores Maps selected: • 1D Block • Hierarchical • Smallest block: fits into core’s local store 1D HIERARCHICAL • Requires hierarchical maps • Can be improved with cache architecture HPEC2008 20 NTBliss 9/29/2008 + • Architecture is well-balanced • Maps with maximum number of cores are chosen (optical to memory and optical) MIT Lincoln Laboratory See J. Kepner and N. Bliss, “Evaluating the Productivity of a Multicore Architecture” Best Performing Architecture LOGICAL VIEW •• 16 16 groups groups •• Optical Optical to to memory memory •• Optical Optical mesh mesh •• 256 256 cores cores HPEC2008 21 NTBliss 9/29/2008 PHYSICAL VIEW Current/future Current/future research research •• Network Network topology topology •• Power Power optimization optimization •• Processor Processor characteristics characteristics •• Cache Cache architecture architecture •• Hierarchical Hierarchical mapping mapping MIT Lincoln Laboratory Summary • Emerging device trends are motivating the need for logical architecture abstractions and robust modeling, mapping and simulation environments • PhotoMAN study focus: photonic networks • Kuck diagrams provide an expressive logical abstraction • Detailed hardware model describes the mapping and modeling optimization space explored by pMapper and allows for architecture evaluation • Initial results show over an order of magnitude improvement in application performance with photonics, while maintaining scalability HPEC2008 22 NTBliss 9/29/2008 MIT Lincoln Laboratory