vectors
Transcription
vectors
ENES Workshop Exascale Technologies & Innovation in HPC for Climate Models Hamburg, March 2014 © NEC Corporation 2014 Agenda ▌ Brief overview First results and interpretation Lessons learned future ▌ NEC’s future plans Basic directions Implications for users 2 © NEC Corporation 2014 Vector-Core Vector Pipeline(VPP) x16 ADB 1 MByte (Assignable Data Buffer) Vector Pipeline Mask Reg. Mask Multi Load: 256GB/s * (16B x16) Vector Reg. Multi/Logical 16KB * + 128KB * Add Add/Div/Sqrt Scalar Scalar Reg. Scalar Unit * aggregate for 16VPPs 3 © NEC Corporation 2014 ALU/Mul Add/Sub SIMD = Vector ? Input Pipeline Result Scalar SIMD Vector SX Vector is more efficient than SIMD SX is a SIMD-vector 4 © NEC Corporation 2014 LSI Configuration Scalar Processing Unit Vector Processing Unit SPU Remote access Control Unit Architecture VPU 256GB/s Vector CORE corecorecore ADB RCU (Assignable Data Buffer) 256GB/s Interconnect 8GB/s x2 8GB/s x2 crossbar Performance 64GFlops ADB size 1MB ADB bandwidth 256GB/s Memory bandwidth 64GB/s~ 256GB/s Memory Byte/Flop 1.0 ~ 4.0 CPU MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC 256GB/s Memory controller 256GB/s Memory(DDR3) 5 © NEC Corporation 2014 Cores 4 Performance 256GFlops Memory bandwidth 256GB/s Byte/Flop 1.0 Architectural Improvement Enhancing vector instruction issue Enhancing bypass chaining path Shorten memory latency Enhancing instruction reordering Enlarging ADB capacity Avoiding redundant loads (MSHR) Avoiding redundant stores (store merge) 6 © NEC Corporation 2014 CPU MSHR: Miss Status Handling Register New feature for NGV, lesson learned from SX-9 Should improve performance for a lot of applications do j = 1, n do i = 1, n d(i,j) = a(i-1,j) + a(i,j) + a(i+1,j) - a(i,j-1) end do end do 7 © NEC Corporation 2014 MSHR: Miss Status Handling Register MSHR release MSHR release Simulation: Factor 2.4 on the Himeno-Benchmark LD A #4 4:4 8 © NEC Corporation 2014 4:3 4:2 Memory MSHR ADB ALU Memory MSHR ADB With MSHR NGV ADB Off ADB On ALU Memory ADB ALU Memory ADB LD A #1 LD A #2 LD A #3 ALU No MSHR SX-9 ADB Off ADB On 4:1 NGV Node Card CPU Memory 4GB x 16DIMMs DDR3 2000MHz 256GF 256GB/s 37cm 11cm 9 © NEC Corporation 2014 Configurations 64 nodes = 16TF, 16TB/s 4 cages = 32 modules = 64 nodes = 64CPUs 8 modules = 16 nodes = 16 CPUs 2 nodes = 2 CPUs 1CPU, 256GF, 256GB/s 10 © NEC Corporation 2014 Rack Specifications 0.75m x 1.5m x 2.0m 30KW Sustained memory bandwidth [GB/s] Sustained Memory Bandwidth 11 © NEC Corporation 2014 The single core can exploit whole memory bandwidth of single CPU, 256GB/s Exploiting # of cores Real Code 12 © NEC Corporation 2014 Power-Efficiency 13 © NEC Corporation 2014 Lessons learned (what a surprise!) ▌ Bandwidth matters ▌ Faster CPU less loss of efficiency due to parallelization Amdahl’s law still holds! If not worse! There is no use to only optimize scalability, be it application, system software or hardware. ▌ Consequently there is a scientific need for an HPC-targeted architecture. Do the business-cases exist? • Need for capability computing? Or rather throughput? • Power-consumption and TCO! Economical aspects on the vendors’ side! 14 © NEC Corporation 2014 Moving Forward All information is subject to change without notice! ▌ Board of Directors: Provide a profitable business-concept for a differentiated system targeted to high-performance processing, not only the traditional HPC-market. ▌ Current direction new product-line, not just a continuation of SX “It’s the economy …”, need to target a broader market, not only HPC consequences even for the processor architecture Studies ongoing: • technology, architecture, most important competition • which markets, and what are the applications and the requirements? ▌ Thoughts and quite some open questions: Multi-core Vector, more registers Shared ADB? Multi-socket, or how much “SMP”? Coherency? Optimization for short(er)-vectors (some ideas!) Interconnect: own one, 3rd party? • Functionalities! Atomic updates, CAF, … Revised product lineup?From “student board” to Exa-Flops LINUX! 15 © NEC Corporation 2014 What does it mean for the code-owner? ▌ Vectorization is here to stay and increasingly important depending on hardware-vendor I believe the importance will grow, and so will be the “SIMD- or vector-width” ▌ MPI will continue to be the leading paradigm for distributed memory, OpenMP for shared memory, the user has to master both. ▌ What about PGAS? Perhaps my personal problem: I have seen such things before, Intel Paragon and Cray T3D, HPF … Two aspects to it: ease of use and performance Latency? (dimensionless numbers!) • PGAS can also be used for latency hiding … to my opinion the most important aspect! ▌ Well, yes, economy! A little complaint from a vendor A “proprietary architecture” (basically nowadays a “non-Intel-architecture”) is expensive, but is “Intel-only-HPC” a solution? • • probably not! Nvidia!? Implications for users! We build the targeted product, but the customers have to give it a chance! RFP • 16 Long time ago RFPs were very different! © NEC Corporation 2014 17 17 © NEC Corporation 2014 Water Cooling NGV racks are cooled by both water and air 2 stages water cooling with primary and secondary coolant Heat exchanger NGV rack NGV rack NGV rack NGV rack chiller 20 deg C 300L/min 7-15 deg C 300L/min 21-24 deg C 16-19 deg C Secondary coolant 18 © NEC Corporation 2014 Primary coolant (Facility supplying chilled water)