Maximum Performance Computing for Exascale Applications

Transcription

Maximum Performance Computing for Exascale Applications
Maximum Performance Computing
for Exascale Applications
Oskar Mencer
July 2012
Challenges
Scientific Computing is a small market with a large impact on society :
Medicine, Earth Science, Physics, Chemistry, BioChemistry, ...
Efficiency
§  What is the maximum amout of computation per Watt we could get?
§  Exascale: ExaBytes at Exaflops
§  Operational Costs: Exa$ and ExaWatts?
Micro
processors
§  ISCA makes computer architecture research boring
§  Intel-ISA dominance
§  Von Neuman Architecture
§  IEEE Floating Point abstraction
§  If performance depends on data movement,
Parallel
Programming
Amdahls Law does not apply.
§  David May: Compilers improve 2x in >=10 years
(but SW efficiency halfs every 18 months)
§  Parallel Programming is HARD
§  Reading parallel programs is “impossible”
Limits of Computation
Objective: Maximum Performance Computing (MPC)
What is the fastest we can compute desired results?
Conjecture:
Data movement is the real limit on computation.
Maximum Performance Computing (MPC)
Less Data Movement = Less Data * Less Movement
The journey will take us through:
1.  Information Theory: Kolmogorov Complexity
2.  Optimisation via Kahneman and Von Neumann
3.  Real World Dataflow Implications and Results
Kolmogorov Complexity (K)
Definition (Kolmogorov):
“If a description of string s, d(s), is of minimal length, […]
it is called a minimal description of s. Then the length of
d(s), […] is the Kolmogorov complexity of s, written K(s),
where K(s) = |d(s)|”
Of course K(s) depends heavily on the Language L
used to describe actions in K.
(e.g. Java, Esperanto, an Executable file, etc)
Kolmogorov, A.N. (1965). "Three Approaches to the Quantitative Definition of Information". Problems Inform. Transmission 1 (1): 1–7.
A Maximum Performance Computing Theorem
For a computational task f, computing the result r,
given inputs i, i.e. task f: r = f( i ), or
i
f
r
Assuming infinite capacity to compute and remember
inside box f, the time T to compute task f depends on moving
the data in and out of the box.
Thus, for a machine f with infinite memory and
infinitely fast arithmetic, Kolmogorov complexity K(i+r) defines
the fastest way to compute task f.
SABR model:
dFt = σ t Ft β dWt
dσ t = ασ t dZ t
< dW , dZ >= ρdt
We integrate in time (Euler in log-forward, Milstein in volatility)
ln Ft +1 = ln Ft − 12 (σ t exp((β − 1) ln Ft ))2 .dt + σ t exp((β − 1) ln Ft )ΔWt
σ t +1 = σ t + ασ t ΔZt + 12 (ασ t )(α )(ΔZt2 − dt )
logic
st
a
t
e
σ, F
The representation K(σ,F) of the state σ,F is critical!
MPC– Bad News
1. Real computers do not have either infinite memory or
infinitely fast arithmetic units.
2. Kolmogorov Theorem. K is not a computable function.
MPC – Good News
Today’s arithmetic units are fast enough.
So in practice...
Kolmogorov Complexity =>
=> MPC depends on the Representation of the Problem.
Euclids Elements, Representing a²+b²=c²
17 × 24 = ?
Thinking Fast and Slow
Daniel Kahneman
Nobel Prize in Economics, 2002
back to 17 × 24
Kahneman splits thinking into:
System 1: fast, hard to control ... 400
System 2: slow, easier to control ... 408
Remembering Fast and Slow
John von Neumann, 1946:
“We are forced to recognize the
possibility of constructing a
hierarchy of memories, each of
which has greater capacity than
the preceding, but which is less
quickly accessible.”
Consider Computation and Memory Together
Computing f(x) in the range [a,b] with |E| ≤ 2⁻ⁿ
Table
Table+Arithmetic
and +,-­‐,×,÷ §  uniform vs non-uniform
§  number of table entries
§  how many coefficients
Arithmetic
+,-­‐,×,÷ §  polynomial or rational approx
§  continued fractions
§  multi-partite tables
Underlying hardware/technology changes the optimum
MPC in Practice
Tradeoff Representation, Memory and Arithmetic
From Theory to Practice
Optimise Whole Programs
Customise
Architecture
Method
Iteration
Processor
Discretisation
Storage
Bit Level
Representation
Customise
Numerics
Example:
data flow graph
generated by
MaxCompiler
4866
static dataflow cores
in 1 chip
Mission Impossible?
Maximum Performance Computing (MPC)
Less Data Movement = Less Data + Less Movement
The journey will take us through:
1.  Information Theory: Kolmogorov Complexity
2.  Optimisation via Kahneman and Von Neumann
3.  Real World Dataflow Implications and Results
8 Maxeler DFEs replacing 1,900 Intel CPU cores
presented by ENI at the Annual SEG Conference, 2010 2,000
Compared to 32 3GHz x86 cores parallelized using MPI 15Hz peak frequency
1,600
30Hz peak frequency
1,400
45Hz peak frequency
1,200
70Hz peak frequency
Equivalent CPU cores
1,800
1,000
800
600
400
200
0
1
4
Number of MAX2 cards
8
100kWa&s of Intel cores => 1kWa& of Maxeler Dataflow Engines Example: Sparse Matrix Computations
O. Lindtjorn et al, HotChips 2010
Given matrix A, vector b, find vector x in Ax = b.
DOES NOT SCALE BEYOND
SIX x86 CPU CORES
MAXELER SOLUTION: 20-40x in 1U
60
GREE0A
1new01
Speedup per 1U Node
50
40
30
20
10
0
0
1
2
3
4
5
6
Compression Ratio
7
8
Domain Specific Address and Data Encoding
(*Patent Pending)
9
10
Example: JP Morgan Derivatives Pricing
O Mencer, S Weston, Journal on Concurrency and ComputaOon, July 2011. •  Compute value and risk of complex credit derivaOves. •  Moving overnight run to realOme intra day •  Reported Speedup: 220-­‐270x 8 hours => 2 minutes •  2011: American Finance Technology Award for Most CuUng Edge IT IniOaOve Validated Maximum Performance Computing
customers comparing 1 box from Maxeler (in a deployed system) with 1 box from Intel
22
Seismic
App1 19x and App2 25x
Weather
Finance
App 32x and App2 29x
Fluid Flow
30x
60x
Sensor Trace Processing
App1 22x, App2 22x
Imaging / Preprocessing
App1 26x and App2 30x
Optimise Whole Programs with Finite Resources
SYSTEM 1 x86 cores SYSTEM 2 flexible memory
+logic Low Latency Memory High Throughput Memory Balance Computation and Memory
The Ideal System 2 is a Production Line
SYSTEM 1 x86 cores SYSTEM 2 flexible memory
+logic Low Latency Memory High Throughput Memory Balance Computation and Memory
1U dataflow cloud providing
dynamically scalable compute
capability over Infiniband
MPC-­‐X1000 •  8 vec&s dataflow engines (DFEs) •  192GB of DFE RAM •  Dynamic allocaOon of DFEs to convenOonal CPU servers –  Zero-­‐copy RDMA between CPUs and DFEs over Infiniband •  Equivalent performance to 40-­‐60 x86 servers 25
Datacenter Qualified Dataflow Solutions
integrated Engines (Cards), 1U nodes, Racks, MaxelerOS, MaxCompiler
High Density DFEs Intel Xeon CPU cores and up to 6 DFEs with 288GB of RAM MaxWorkstaNo
n Desktop dataflow development system The Dataflow Appliance The Low Latency Appliance Intel Xeon CPUs and 1-­‐2 DFEs with Dense compute with 8 DFEs, direct links to up to six 10Gbit 384GB of RAM and dynamic Ethernet connecOons allocaOon of DFEs to CPU servers with zero-­‐copy RDMA access MaxRack 10, 20 or 40 node rack systems integraOng compute, networking & storage MaxCloud Hosted, on-­‐demand, scalable accelerated compute Dataflow Engines 48GB DDR3, high-­‐speed connecOvity and dense configurable logic Architecture Model
Host application
CPU
SLiC
Kernels
MaxelerOS
DataFlow
+
+
Memory
27
*
Memory
Interconnect
Manager
Programming with MaxCompiler
C / C++ / Fortran SLiC
28
MaxJ MaxCompiler Development Process
CPU
CPU
Code
Main
Memory
CPU Code (.c)
int *x, *y;
for (int i =0; i < DATA_SIZE; i++)
y[i]= x[i] * x[i] + 30;
29
yi = xi × xi + 30
MaxCompiler Development Process
Memory
CPU
Code
PCI
Manager
Express
CPU Code (.c)
#include “MaxSLiCInterface.h”
#include “Calc.max”
int *x, *y;
for (int i =0; i < DATA_SIZE; i++)
y[i]= x[i] * x[i] + 30;
Calc(x, y, DATA_SIZE)
30 x
MaxelerOS
30
x Chip
SLiC
Main
x y Memory
x CPU
x
30
+ +
x
y Manager (.java)
Manager m = new
Manager(“Calc”);
Kernel k =
new MyKernel();
m.setKernel(k);
m.setIO(
link(“x", PCIE),
link(“y", PCIE));
m.addMode(modeDefault());
m.build();
MyKernel (.java)
HWVar x = io.input("x", hwInt(32));
HWVar result = x * x + 30;
io.output("y", result, hwInt(32));
MaxCompiler Development Process
Memory
y Host
Code
PCI
Manager
Express
CPUCode (.c)
#include
device
= “MaxSLiCInterface.h”
max_open_device(maxfile,
#include
"/dev/maxeler0");
“Calc.max”
int *x, *y;
Calc(x, DATA_SIZE)
31
30 x
Main
x Memory
x Chip
SLiC
MaxelerOS
x CPU
x
30
+ +
x
y Manager (.java)
MyKernel (.java)
Manager m = new Manager();
Kernel k =
new MyKernel();
HWVar x = io.input("x", hwInt(32));
m.setKernel(k);
m.setIO(
link(“x", PCIE),
link(“y", DRAM_LINEAR1D));
m.addMode(modeDefault());
m.build();
io.output("y", result, hwInt(32));
HWVar result = x * x + 30;