The Mont-Blanc project
Transcription
The Mont-Blanc project
http://www.montblanc-project.eu The Mont-Blanc approach towards Exascale Alex Ramirez Barcelona Supercomputing Center Disclaimer: Not only I speak for myself ... All references to unavailable products are speculative, taken from web sources. There is no commitment from ARM, Samsung, Intel, Nvidia, or others, implied. This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement n° 288777. Outline • A bit of history • • • • Supercomputers from mobile components • • • • • 2 Vector supercomputers Commodity supercomputers The next step in the commodity chain Homogeneous architecture Compute accelerators Rely on OmpSs to handle the challenges BSC prototype roadmap Mont-Blanc project goals HPC Advisory Council, Malaga September 13, 2012 In the beginning ... there were only supercomputers • • • • • • • • 3 Built to order • Very few of them Special purpose hardware • Very expensive Control Data, Convex, ... Cray-1 • 1975, 160 MFLOPS • 80 units, 5-8 M$ Cray X-MP • 1982, 800 MFLOPS Cray-2 • 1985, 1.9 GFLOPS Cray Y-MP • 1988, 2.6 GFLOPS Fortran+vectorizing compilers HPC Advisory Council, Malaga September 13, 2012 3 The Killer Microprocessors 10.000 MFLOPS Cray-1, Cray-C90 NEC SX4, SX5 1000 Alpha AV4, EV5 Intel Pentium 100 IBM P2SC HP PA8200 10 1974 • 1989 1994 1999 They were not faster ... ... but they were significantly cheaper and greener Need 10 microprocessors to achieve the performance of 1 Vector CPU • 4 1984 Microprocessors killed the Vector supercomputers • • • 1979 SIMD vs. MIMD programming paradigms HPC Advisory Council, Malaga September 13, 2012 4 Then, commodity took over special purpose • ASCI Red, Sandia • • • • 1997, 1 Tflops (Linpack), 9298 cores @ 200 Mhz 1.2 Tbytes Intel Pentium Pro • Upgraded to Pentium II Xeon, 1999, 3.1 Tflops • ASCI White, LLNL • • • • • 2001, 7.3 TFLOPS 8192 proc. @ 375 Mhz, 6 Tbytes (3+3) Mwats IBM Power 3 Message-Passing Programming Models 5 HPC Advisory Council, Malaga September 13, 2012 5 Finally, commodity hardware + commodity software • MareNostrum • Nov 2004, #4 Top500 • • IBM PowerPC 970 FX • • • 6 20 Tflops, Linpack Blade enclosure Myrinet + 1 GbE network SuSe Linux HPC Advisory Council, Malaga September 13, 2012 6 The next step in the commodity chain HPC Servers Desktop • • Mobile 7 HPC Advisory Council, Malaga • Total cores in Jun'12 Top500 • 13.5 Mcores Tablets sold in Q4 2011 • 27 Mtablets Smartphones sold Q4 2011 • > 100 Mphones September 13, 2012 7 ARM Processor improvements in DP FLOPS IBM Intel BG/Q AVX 16 DP ops/cycle 8 ARMv8 Intel SSE2 4 2 ARM CortexTM-A9 1 1999 • 2003 2005 2007 2009 2011 2013 2015 8 DP ops / cycle ARM quickly moved from optional floating-point to state-of-the-art • 8 2001 IBM BG/Q and Intel AVX implement DP in 256-bit SIMD • • ARM CortexTM-A15 IBM BG/P ARMv8 ISA introduces DP in the NEON instruction set (128-bit SIMD) HPC Advisory Council, Malaga September 13, 2012 8 ARM processor efficiency vs. IBM / Intel / Nvidia 8 Cortex-A15 @ 2 GHz* 32 GFLOPS (4-core) 7 6 5 GFLOPS / W Cortex-A9 @ 1 GHz 2 GFLOPS (2-core) IBM-Chip Intel-Chip 4 BG/Q @ 1.6 GHz 205 GFLOPS (16-core) 3 NVIDIA-Card ARM-chip ARM11 @ 482 MHz 0.5 GFLOPS 2 1 0 2007-nov 2008-jun 2008-nov 2009-jun 2009-nov 2010-jun 2010-nov 2011-jun 2011-nov 2012-jun 2012-nov * Based on ARM Cortex-A9 @ 2GHz power consumption on 45nm, not an ARM comitment 9 HPC Advisory Council, Malaga September 13, 2012 9 Are the “Killer Mobiles™" coming? Cost (log10) Server ($1500) Desktop ($150) HPC-Mobile ($40) ? Nowadays Mobile ($20) Near future Performance (log2) • Where is the sweet spot? Maybe in the low-end ... • • • The same reason why microprocessors killed supercomputers • 10 Today ~ 1:8 ratio in performance, 1:100 ratio in cost Tomorrow ~ 1:2 ratio in performance, still 1:100 in cost ? Not so much performance ... but much lower cost, and power HPC Advisory Council, Malaga September 13, 2012 10 Killer mobile™ example: Samsung Exynos 5450 * • • 4-core ARM Cortex-A15 @ 2 GHz • 32 GFLOPS 8-core ARM Mali T685 • 168 GFLOPS* • Dual channel DDR3 memory controller • All in a low-power mobile socket * Data from web sources, not an ARM or Samsung commitment 11 HPC Advisory Council, Malaga September 13, 2012 11 Are we building BlueGene again? • Yes ... • Exploit Pollack's Rule in presence of abundant parallelism • • ... and No • • • • 12 Many small cores vs. Single fast core Heterogeneous computing • On-chip GPU Commodity vs. Special purpose • • • Higher volume Many vendors Lower cost Lots of room for improvement • No SIMD / vectors yet ... Build on Europe's embedded strengths HPC Advisory Council, Malaga September 13, 2012 Can we achieve competitive performance? 64 GB/s 10-40 Gb/s 1 Gb/s 56 Gb/s • 2-socket Intel Sandy Bridge • • • • • 13 370 GFLOPS 1 address space 44 MB on-chip memory 136 GB/s 64 GB/s intra-node (2 x QPI) HPC Advisory Council, Malaga • 8-socket ARM Cortex A-15 • • • • • 256 GFLOPS 8 address spaces 16 MB on-chip memory 102 GB/s 1 Gb/s intra-node (1 GbE) September 13, 2012 Can we achieve competitive performance? • Sandy Bridge + Nvidia K20 • • • 1685 GFLOPS 2 address spaces 32 GB/s between CPU-GPU • • 14 • 8-socket Exynos 5450 • • • • 16x PCIe 3.0 68 + 192 GB/s HPC Advisory Council, Malaga 1600 GFLOPS 16 address spaces 12.8 GB/s between CPU-GPU • Shared memory 102 GB/s September 13, 2012 Then, what is so good about it? • Sandy Bridge + Nvidia K20 • • 15 > $3000 > 400 Watt HPC Advisory Council, Malaga • 8-socket Exynos 5450 • • < $200 < 100 Watt September 13, 2012 There is no free lunch 2X more cores for the same performance ½ on-chip memory / core 8X more address spaces 1 GbE inter-chip communication 16 HPC Advisory Council, Malaga September 13, 2012 OmpSs runtime layer manages architecture complexity P0 P1 P2 • • Programmer exposed a simple architecture Task graph provides lookahead • • Automatically handle all of the architecture challenges • • • • • No overlap 17 Overlap HPC Advisory Council, Malaga Exploit knowledge about the future Strong scalability Multiple address spaces Low cache size Low interconnect bandwidth Enjoy the positive aspects • • Energy efficiency Low cost September 13, 2012 A big challenge, and a huge opportunity for Europe Built with the best that is coming GFLOPS / W Built with the best of the market 2011 • 2012 2013 2014 2015 2016 Prototypes are critical to accelerate software development • 18 What is the best that we could do? 256 nodes 250 GFLOPS 1.7 Kwatt System software stack + applications HPC Advisory Council, Malaga September 13, 2012 2017 Very high expectations ... • • High media impact of ARM-based HPC Scientific, HPC, general press quote MontBlanc objectives • 19 Highlighted by Eric Schmidt, Google Executive Chairman, at the EC's Innovation Convention HPC Advisory Council, Malaga September 13, 2012 The hype curve Peak of Inflated Expectations Visibility Plateau of Productivity Slope of Enlightenment Trough of Disillusionment Technology Trigger Time • 20 We'll see how deep it gets on the way down ... HPC Advisory Council, Malaga September 13, 2012 Project goals • • To develop an European Exascale approach Based on embedded power-efficient technology • Objetives • • • 21 Develop a first prototype system, limited by available technology Design a Next Generation system, to overcome the limitations Develop a set of Exascale applications targeting the new system HPC Advisory Council, Malaga September 13, 2012
Similar documents
Are mobile processors ready for HPC? - Mont
Fujitsu Wind Tunnel is #1 1993-1996, with 170 GFLOPS
More information