The Mont-Blanc project

Transcription

The Mont-Blanc project
http://www.montblanc-project.eu
The Mont-Blanc approach
towards Exascale
Alex Ramirez
Barcelona Supercomputing Center
Disclaimer: Not only I speak for myself ... All references to unavailable products are speculative, taken from web sources.
There is no commitment from ARM, Samsung, Intel, Nvidia, or others, implied.
This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement n° 288777.
Outline
•
A bit of history
•
•
•
•
Supercomputers from mobile components
•
•
•
•
•
2
Vector supercomputers
Commodity supercomputers
The next step in the commodity chain
Homogeneous architecture
Compute accelerators
Rely on OmpSs to handle the challenges
BSC prototype roadmap
Mont-Blanc project goals
HPC Advisory Council, Malaga
September 13, 2012
In the beginning ... there were only supercomputers
•
•
•
•
•
•
•
•
3
Built to order
•
Very few of them
Special purpose hardware
•
Very expensive
Control Data, Convex, ...
Cray-1
•
1975, 160 MFLOPS
•
80 units, 5-8 M$
Cray X-MP
•
1982, 800 MFLOPS
Cray-2
•
1985, 1.9 GFLOPS
Cray Y-MP
•
1988, 2.6 GFLOPS
Fortran+vectorizing compilers
HPC Advisory Council, Malaga
September 13, 2012
3
The Killer Microprocessors
10.000
MFLOPS
Cray-1, Cray-C90
NEC SX4, SX5
1000
Alpha AV4, EV5
Intel Pentium
100
IBM P2SC
HP PA8200
10
1974
•
1989
1994
1999
They were not faster ...
... but they were significantly cheaper and greener
Need 10 microprocessors to achieve the performance of 1 Vector
CPU
•
4
1984
Microprocessors killed the Vector supercomputers
•
•
•
1979
SIMD vs. MIMD programming paradigms
HPC Advisory Council, Malaga
September 13, 2012
4
Then, commodity took over special purpose
•
ASCI Red, Sandia
•
•
•
•
1997, 1 Tflops (Linpack),
9298 cores @ 200 Mhz
1.2 Tbytes
Intel Pentium Pro
•
Upgraded to Pentium II Xeon,
1999, 3.1 Tflops
•
ASCI White, LLNL
•
•
•
•
•
2001, 7.3 TFLOPS
8192 proc. @ 375 Mhz,
6 Tbytes
(3+3) Mwats
IBM Power 3
Message-Passing Programming Models
5
HPC Advisory Council, Malaga
September 13, 2012
5
Finally, commodity hardware + commodity software
•
MareNostrum
•
Nov 2004, #4 Top500
•
•
IBM PowerPC 970 FX
•
•
•
6
20 Tflops, Linpack
Blade enclosure
Myrinet + 1 GbE network
SuSe Linux
HPC Advisory Council, Malaga
September 13, 2012
6
The next step in the commodity chain
HPC
Servers
Desktop
•
•
Mobile
7
HPC Advisory Council, Malaga
•
Total cores in Jun'12 Top500
•
13.5 Mcores
Tablets sold in Q4 2011
•
27 Mtablets
Smartphones sold Q4 2011
•
> 100 Mphones
September 13, 2012
7
ARM Processor improvements in DP FLOPS
IBM Intel
BG/Q AVX
16
DP ops/cycle
8
ARMv8
Intel
SSE2
4
2
ARM
CortexTM-A9
1
1999
•
2003
2005
2007
2009
2011
2013
2015
8 DP ops / cycle
ARM quickly moved from optional floating-point to state-of-the-art
•
8
2001
IBM BG/Q and Intel AVX implement DP in 256-bit SIMD
•
•
ARM
CortexTM-A15
IBM
BG/P
ARMv8 ISA introduces DP in the NEON instruction set (128-bit SIMD)
HPC Advisory Council, Malaga
September 13, 2012
8
ARM processor efficiency vs. IBM / Intel / Nvidia
8
Cortex-A15 @ 2 GHz*
32 GFLOPS (4-core)
7
6
5
GFLOPS / W
Cortex-A9 @ 1 GHz
2 GFLOPS (2-core)
IBM-Chip
Intel-Chip
4
BG/Q @ 1.6 GHz
205 GFLOPS
(16-core)
3
NVIDIA-Card
ARM-chip
ARM11 @ 482 MHz
0.5 GFLOPS
2
1
0
2007-nov
2008-jun
2008-nov
2009-jun
2009-nov
2010-jun
2010-nov
2011-jun
2011-nov
2012-jun
2012-nov
* Based on ARM Cortex-A9 @ 2GHz power consumption on 45nm, not an ARM comitment
9
HPC Advisory Council, Malaga
September 13, 2012
9
Are the “Killer Mobiles™" coming?
Cost (log10)
Server ($1500)
Desktop ($150)
HPC-Mobile ($40) ?
Nowadays
Mobile ($20)
Near future
Performance (log2)
•
Where is the sweet spot? Maybe in the low-end ...
•
•
•
The same reason why microprocessors killed supercomputers
•
10
Today ~ 1:8 ratio in performance, 1:100 ratio in cost
Tomorrow ~ 1:2 ratio in performance, still 1:100 in cost ?
Not so much performance ... but much lower cost, and power
HPC Advisory Council, Malaga
September 13, 2012
10
Killer mobile™ example: Samsung Exynos 5450 *
•
•
4-core ARM Cortex-A15 @ 2 GHz
•
32 GFLOPS
8-core ARM Mali T685
•
168 GFLOPS*
•
Dual channel DDR3 memory controller
•
All in a low-power mobile socket
* Data from web sources, not an ARM or Samsung commitment
11
HPC Advisory Council, Malaga
September 13, 2012
11
Are we building BlueGene again?
•
Yes ...
•
Exploit Pollack's Rule in
presence of abundant
parallelism
•
•
... and No
•
•
•
•
12
Many small cores vs. Single
fast core
Heterogeneous computing
•
On-chip GPU
Commodity vs. Special
purpose
•
•
•
Higher volume
Many vendors
Lower cost
Lots of room for improvement
•
No SIMD / vectors yet ...
Build on Europe's embedded
strengths
HPC Advisory Council, Malaga
September 13, 2012
Can we achieve competitive performance?
64 GB/s
10-40 Gb/s
1 Gb/s
56 Gb/s
•
2-socket Intel Sandy Bridge
•
•
•
•
•
13
370 GFLOPS
1 address space
44 MB on-chip memory
136 GB/s
64 GB/s intra-node (2 x QPI)
HPC Advisory Council, Malaga
•
8-socket ARM Cortex A-15
•
•
•
•
•
256 GFLOPS
8 address spaces
16 MB on-chip memory
102 GB/s
1 Gb/s intra-node (1 GbE)
September 13, 2012
Can we achieve competitive performance?
•
Sandy Bridge + Nvidia K20
•
•
•
1685 GFLOPS
2 address spaces
32 GB/s between CPU-GPU
•
•
14
•
8-socket Exynos 5450
•
•
•
•
16x PCIe 3.0
68 + 192 GB/s
HPC Advisory Council, Malaga
1600 GFLOPS
16 address spaces
12.8 GB/s between CPU-GPU
•
Shared memory
102 GB/s
September 13, 2012
Then, what is so good about it?
•
Sandy Bridge + Nvidia K20
•
•
15
> $3000
> 400 Watt
HPC Advisory Council, Malaga
•
8-socket Exynos 5450
•
•
< $200
< 100 Watt
September 13, 2012
There is no free lunch
2X more cores for
the same
performance
½ on-chip memory /
core
8X more address
spaces
1 GbE inter-chip
communication
16
HPC Advisory Council, Malaga
September 13, 2012
OmpSs runtime layer manages architecture complexity
P0
P1
P2
•
•
Programmer exposed a simple
architecture
Task graph provides
lookahead
•
•
Automatically handle all of the
architecture challenges
•
•
•
•
•
No overlap
17
Overlap
HPC Advisory Council, Malaga
Exploit knowledge about the
future
Strong scalability
Multiple address spaces
Low cache size
Low interconnect bandwidth
Enjoy the positive aspects
•
•
Energy efficiency
Low cost
September 13, 2012
A big challenge, and a huge opportunity for Europe
Built with the best
that is coming
GFLOPS / W
Built with the best
of the market
2011
•
2012
2013
2014
2015
2016
Prototypes are critical to accelerate software development
•
18
What is the best
that we could do?
256 nodes
250 GFLOPS
1.7 Kwatt
System software stack + applications
HPC Advisory Council, Malaga
September 13, 2012
2017
Very high expectations ...
•
•
High media impact of ARM-based HPC
Scientific, HPC, general press quote MontBlanc objectives
•
19
Highlighted by Eric Schmidt, Google Executive
Chairman, at the EC's Innovation Convention
HPC Advisory Council, Malaga
September 13, 2012
The hype curve
Peak of Inflated Expectations
Visibility
Plateau of Productivity
Slope of Enlightenment
Trough of Disillusionment
Technology Trigger
Time
•
20
We'll see how deep it gets on the way down ...
HPC Advisory Council, Malaga
September 13, 2012
Project goals
•
•
To develop an European Exascale approach
Based on embedded power-efficient technology
•
Objetives
•
•
•
21
Develop a first prototype system, limited by available technology
Design a Next Generation system, to overcome the limitations
Develop a set of Exascale applications targeting the new system
HPC Advisory Council, Malaga
September 13, 2012