The EXA-DUNE project

Transcription

The EXA-DUNE project
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
Dune vs. 1 000 000 000 000 000 000
wissen leben
WWU Münster
Christian Engwer
joint work with P. Bastian, O. Ippisch, J. Fahlke,
D. Göddeke, S. Müthing, D. Ribbrock, S. Turek
2.04.2016,
OPM Meeting
The EXA-DUNE project
2006 Discussion in the German Research Foundation (DFG) on the necessity of a funding
initiative for HPC software.
2010 Discussion with DFG’s Executive Committee. Suggestion of a flexible, strategically
initiated SPP.
Initiative out of Germany HPC community, referring to increasing activities on HPC
software elsewhere (USA: NSF, DOE; Japan; China; G8)
2011 Submission of the proposal, international reviewing, and formal acceptance.
2012 13 full proposals accepted for funding SPPEXA 1.
Review of project sketches and full proposals.
2013 Official start of SPPEXA 1.
2014 Call for proposals SPPEXA 2, incl. international partners (France, Japan)
2015 16 full proposals accepted for funding SPPEXA 2 (7 collaborations with Japan / 3
collaborations with France).
2016 Official start of SPPEXA 2.
2020 Expected arrival of EXA-scale machines ;-)
WWU Münster
Christian Engwer
02.06.2016
1 /22
wissen leben
WWU Münster
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
2006 Discussion in the German Research Foundation (DFG) on the necessity of a funding
initiative for HPC software.
2010 Discussion with DFG’s Executive Committee. Suggestion of a flexible, strategically
initiated SPP.
Initiative out of Germany HPC community, referring to increasing activities on HPC
software elsewhere (USA: NSF, DOE; Japan; China; G8)
2011 Submission of the proposal, international reviewing, and formal acceptance.
2012 13 full proposals accepted for funding SPPEXA 1.
Review of project sketches and full proposals.
2013 Official start of SPPEXA 1.
2014 Call for proposals SPPEXA 2, incl. international partners (France, Japan)
2015 16 full proposals accepted for funding SPPEXA 2 (7 collaborations with Japan / 3
collaborations with France).
2016 Official start of SPPEXA 2.
2020 Expected arrival of EXA-scale machines ;-)
6 years HPC related project funding
WWU Münster
Christian Engwer
02.06.2016
1 /22
wissen leben
WWU Münster
Westfälische
Wilhelms-Universität
Münster
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
2 /22
EXA-DUNE
DUNE’s Framework approach to software development
Integrated toolbox of simulation components
Existing body of complex applications
I
Good performance + scalability for traditional MPI model
wissen leben
WWU Münster
I
I
,
,
WWU Münster
Christian Engwer
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
2 /22
EXA-DUNE
Challenges:
Standard low order algorithms do not scale any more
Incorporate new algorithms, hardware paradigms
I
Integrate changes across simulation stages (Ahmdahl’s Law)
I
Provide “reasonable” upgrade path for existing applications
wissen leben
WWU Münster
I
I
,
,
WWU Münster
Christian Engwer
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
2 /22
EXA-DUNE
DUNE + FEAST = Flexibility + Performance
I General Software Frameworks
→ co-designed to specific hardware platforms is not sufficient
I
Hardware-Oriented Numerics
wissen leben
WWU Münster
→ design/choose algorithms with hardware in mind
,
,
WWU Münster
Christian Engwer
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
3 /22
Hardware Challenges
I
Multiple Levels of concurrency
MPI-parallel, Multi-core, SIMD
I
Memory-wall
wissen leben
WWU Münster
I
WWU Münster
Christian Engwer
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
3 /22
Hardware Challenges
I
Multiple Levels of concurrency
I
MPI-parallel, Multi-core, SIMD
I
Memory-wall
WWU Münster
Christian Engwer
wissen leben
WWU Münster
Example Intel Xeon E5-2698v3 (Haswell)
I Advertised peak performance: 486.4 GFlop/s
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
3 /22
Hardware Challenges
I
Multiple Levels of concurrency
I
MPI-parallel, Multi-core, SIMD
I
Memory-wall
WWU Münster
Christian Engwer
wissen leben
WWU Münster
Example Intel Xeon E5-2698v3 (Haswell)
I Advertised peak performance: 486.4 GFlop/s
I 16 cores → single Core: 30.4 GFlop/s
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
3 /22
Hardware Challenges
I
Multiple Levels of concurrency
I
MPI-parallel, Multi-core, SIMD
I
Memory-wall
WWU Münster
Christian Engwer
wissen leben
WWU Münster
Example Intel Xeon E5-2698v3 (Haswell)
I Advertised peak performance: 486.4 GFlop/s
I 16 cores → single Core: 30.4 GFlop/s
I AVX2+FMA → without FMA: 15.2 GFlop/s
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
3 /22
Hardware Challenges
I
Multiple Levels of concurrency
I
MPI-parallel, Multi-core, SIMD
I
Memory-wall
WWU Münster
Christian Engwer
wissen leben
WWU Münster
Example Intel Xeon E5-2698v3 (Haswell)
I Advertised peak performance: 486.4 GFlop/s
I 16 cores → single Core: 30.4 GFlop/s
I AVX2+FMA → without FMA: 15.2 GFlop/s
I 4× SIMD → without AVX: 3.8 GFlop/s
→ classic, non-parallel code bound by 3.8 GFlop/s
→ you loose 99% Performance
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
4 /22
Outline
1 Introduction
2 EXA-DUNE Overview
3 EXA-DUNE Selected Features
wissen leben
WWU Münster
Linear Algebra
Assembly
High-level SIMD techniques
Multiple RHS
4 Outlook
,
,
WWU Münster
Christian Engwer
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
5 /22
Hybrid Parallelism
MPI
M
UMA
SIMD
CPU
L2
L2
L1
L1
P
P
...
...
M
CPU
Accl
L2
IC
IC
IC
L1
M
M
M
...
IC
M
P
...
I
Coarse-grained: MPI between heterogeneous nodes
I
Medium-grained: multicore-CPUs, GPUs, MICs, APUs, . . .
I
Fine-grained: vectorization, GPU ‘threads’, . . .
wissen leben
WWU Münster
M
M
CPU
,
,
WWU Münster
Christian Engwer
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
5 /22
Hybrid Parallelism
MPI
M
UMA
SIMD
CPU
L2
L2
L1
L1
P
P
...
...
M
CPU
Accl
L2
IC
IC
IC
L1
M
M
M
...
IC
M
P
...
I
Coarse-grained: MPI between heterogeneous nodes
I
Medium-grained: multicore-CPUs, GPUs, MICs, APUs, . . . TBB
I
Fine-grained: vectorization, GPU ‘threads’, . . . ???
wissen leben
WWU Münster
M
M
CPU
,
,
WWU Münster
Christian Engwer
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
6 /22
Algorithm Design
Hardware-Oriented Numerics
wissen leben
WWU Münster
Challenge: Combine flexibility and hardware efficiency
I Existing codes no longer run faster automatically
I Standard low order algorithms do not scale any more
I Much more than a pure implementational issue
WWU Münster
Christian Engwer
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
6 /22
Algorithm Design
Hardware-Oriented Numerics
Challenge: Combine flexibility and hardware efficiency
I
higher order:
Discontinuous Galerkin, Reduced
Basis
I
virtual local refinement:
global unstructured mesh, local
tensor-prduct mesh
I
multiple right-hand-sides:
horizontal SIMD-vectorization
wissen leben
WWU Münster
Solution: Locally structured, globally unstructured data
→ increased algorithmic intensity and vectorization
Concepts:
,
,
WWU Münster
Christian Engwer
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
6 /22
Algorithm Design
Hardware-Oriented Numerics
Challenge: Combine flexibility and hardware efficiency
I
higher order:
Discontinuous Galerkin, Reduced
Basis
I
virtual local refinement:
global unstructured mesh, local
tensor-prduct mesh
I
multiple right-hand-sides:
horizontal SIMD-vectorization
wissen leben
WWU Münster
Solution: Locally structured, globally unstructured data
→ increased algorithmic intensity and vectorization
Concepts:
,
,
WWU Münster
Christian Engwer
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
6 /22
Algorithm Design
Hardware-Oriented Numerics
Challenge: Combine flexibility and hardware efficiency
Solution: Locally structured, globally unstructured data
→ increased algorithmic intensity and vectorization
Concepts:
higher order:
Discontinuous Galerkin, Reduced
Basis
I
virtual local refinement:
global unstructured mesh, local
tensor-prduct mesh
I
multiple right-hand-sides:
horizontal SIMD-vectorization

x0,0 x1,0
x0,1 x1,1
A· x0,2 x1,2
..
x0,M x1,M

















. . . xN,0   b0,0
. . . xN,1   b0,1
 
. . . xN,2  =  b0,2
..  
 
 
b0,M
. . . xN,M
b1,0
b1,1
b1,2
..
b1,M

. . . bN,0 
. . . bN,1 

. . . bN,2 
.. 


. . . bN,M
wissen leben
WWU Münster
I
,
,
WWU Münster
Christian Engwer
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
7 /22
dune-common
I Multi-Threading support in DUNE (via TBB)
dune-grid
I Hybrid parallelization of DUNE grids (on-top of the grid interface)
I Virtual-refinement of unstructured DUNE grids
dune-istl
I Hybrid parallelization
I Performance portable Matrix format
I (vertically) vectorized preconditioners (incl. CUDA versions)
I (horizontally) vectorized solvers (multiple RHS)
dune-pdelab
I Sum-Factorization DG
I Special-purpose high-performance assemblers
wissen leben
WWU Münster
Features in EXA-DUNE
,
,
WWU Münster
Christian Engwer
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
7 /22
dune-common
I Multi-Threading support in DUNE (via TBB)
dune-grid
I Hybrid parallelization of DUNE grids (on-top of the grid interface)
I Virtual-refinement of unstructured DUNE grids
dune-istl
I Hybrid parallelization
I Performance portable Matrix format
I (vertically) vectorized preconditioners (incl. CUDA versions)
I (horizontally) vectorized solvers (multiple RHS)
dune-pdelab
I Sum-Factorization DG
I Special-purpose high-performance assemblers
wissen leben
WWU Münster
Features in EXA-DUNE
,
,
WWU Münster
Christian Engwer
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
8 /22
SELL-C-σ as Cross-Platform Matrix Format
Kreutzer, Hager, Wellein, Fehske, Bishop ’13
I
Adopt SELL-C -σ as unified matrix format in DUNE
(including assembly phase → dune-pdelab)
I
Differentiate by memory domain (host, Xeon Phi, GPU)
I
Memory domain and C via extended C++ allocator interface
wissen leben
WWU Münster
Block-sorted and chunked ELL (SELL-C -σ) with suitable tuning offers
competitive performance across modern architectures (CPU, GPGPU,
Xeon Phi)
,
,
WWU Münster
Christian Engwer
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
9 /22
Blocked SELL Format for DG discretizations
Choi, Singh, Vuduc ’10
I
Vectorize algorithms by
operating on multiple
independent blocks
simultaneously
I
Tradeoff between
vectorization and cache
usage at larger block sizes
···
memory layout of data and column index arrays
[Muething et.al. 2013]
WWU Münster
Christian Engwer
02.06.2016
wissen leben
WWU Münster
Move SIMD from scalar
level to block level
compressed block
column index matrix
I
compressed data matrix
Interleaved storage of blocks from C block rows
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
10 /22
Hybrid-parallel DUNE Grid
Strategies for multi-threaded assembly of matrices and vectors
I Thread parallelization on-top of the DUNE grid interface
Evaluation of partitioning strategies:
strided
I
ranged
sliced
tensor
. . . and data access strategies:
I
I
I
batched: Batched writeback with global lock.
elock: One lock per mesh entity
coloring: partitions of the same color do not “touch”.
other strategies not considered here:
I
I
WWU Münster
global locking → as bad as expected.
race-free schemes → in general not possible not possible.
Christian Engwer
wissen leben
WWU Münster
I
,
,
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
11 /22
Evaluation
Evaluation of partitioning strategies
. . . and data access strategies
I Best partitioning strategy:
I Best strategy: entity-wise locking
(no benefit from coloring)
CPU
CPU
PHI
PHI
PHI
k
E10
E20
E60
E120
E240
0
1
2
3
4
5
62%
62%
72%
79%
87%
88%
42%
42%
46%
50%
49%
51%
75%
84%
90%
92%
43%
57%
69%
72%
21%
30%
38%
41%
Runtimes per dof, degree k, jacobian, sliced partitioning, entity-wise locking
Evaluation: 60-core Xeon PHIs and 10-core CPU’s
I Full benefit of Xeon PHI will require vectorization of user code
→ vectorization: non-trivial, work in progress
WWU Münster
Christian Engwer
02.06.2016
wissen leben
WWU Münster
ranges of cells
→ memory efficient and fast
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
11 /22
Evaluation
Evaluation of partitioning strategies
. . . and data access strategies
I Best partitioning strategy:
I Best strategy: entity-wise locking
(no benefit from coloring)
CPU
CPU
PHI
PHI
PHI
k
E10
E20
E60
E120
E240
0
1
2
3
4
5
62%
62%
72%
79%
87%
88%
42%
42%
46%
50%
49%
51%
75%
84%
90%
92%
43%
57%
69%
72%
21%
30%
38%
41%
Runtimes per dof, degree k, jacobian, sliced partitioning, entity-wise locking
Evaluation: 60-core Xeon PHIs and 10-core CPU’s
I Full benefit of Xeon PHI will require vectorization of user code
→ vectorization: non-trivial, work in progress
WWU Münster
Christian Engwer
02.06.2016
wissen leben
WWU Münster
ranges of cells
→ memory efficient and fast
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
12 /22
Vectorizing over Elements
User provides local Operator to compute local Stiffness matrix
Patch Grid
reduce costs of unstructured
meshes
I
I
I
I
extract subset of mesh
store as flat grid
store in consecutive arrays,
without pointers
wissen leben
WWU Münster
I
I
add structured refinement
I
I
exploit local structure
improve data locality
WWU Münster
Christian Engwer
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
12 /22
Vectorizing over Elements
User provides local Operator to compute local Stiffness matrix
I
Patch Grid
reduce costs of unstructured
meshes
I
I
I
I
(0)
extract subset of mesh
store as flat grid
store in consecutive arrays,
without pointers
v0
local coefficients vector
v1
v2
v3
add structured refinement
I
I
exploit local structure
improve data locality
WWU Münster
0,0 1,0 2,0 3,0 0,1 1,1 2,1 3,1 0,2 1,2 2,2 3,2 0,3 1,3 2,3 3,3 0,4 1,4 2,4 3,4
patch coefficients vector
Christian Engwer
02.06.2016
wissen leben
WWU Münster
I
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
12 /22
Vectorizing over Elements
User provides local Operator to compute local Stiffness matrix
I
Patch Grid
reduce costs of unstructured
meshes
I
I
I
I
(1)
extract subset of mesh
store as flat grid
store in consecutive arrays,
without pointers
v0
local coefficients vector
v1
v2
v3
add structured refinement
I
I
exploit local structure
improve data locality
WWU Münster
0,0 1,0 2,0 3,0 0,1 1,1 2,1 3,1 0,2 1,2 2,2 3,2 0,3 1,3 2,3 3,3 0,4 1,4 2,4 3,4
patch coefficients vector
Christian Engwer
02.06.2016
wissen leben
WWU Münster
I
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
12 /22
Vectorizing over Elements
User provides local Operator to compute local Stiffness matrix
I
Patch Grid
reduce costs of unstructured
meshes
I
I
I
I
(2)
extract subset of mesh
store as flat grid
store in consecutive arrays,
without pointers
v0
local coefficients vector
v1
v2
v3
add structured refinement
I
I
exploit local structure
improve data locality
WWU Münster
0,0 1,0 2,0 3,0 0,1 1,1 2,1 3,1 0,2 1,2 2,2 3,2 0,3 1,3 2,3 3,3 0,4 1,4 2,4 3,4
patch coefficients vector
Christian Engwer
02.06.2016
wissen leben
WWU Münster
I
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
12 /22
Vectorizing over Elements
User provides local Operator to compute local Stiffness matrix
I
Patch Grid
reduce costs of unstructured
meshes
I
I
I
I
(3)
extract subset of mesh
store as flat grid
store in consecutive arrays,
without pointers
v0
local coefficients vector
v1
v2
v3
add structured refinement
I
I
exploit local structure
improve data locality
WWU Münster
0,0 1,0 2,0 3,0 0,1 1,1 2,1 3,1 0,2 1,2 2,2 3,2 0,3 1,3 2,3 3,3 0,4 1,4 2,4 3,4
patch coefficients vector
Christian Engwer
02.06.2016
wissen leben
WWU Münster
I
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
12 /22
Vectorizing over Elements
User provides local Operator to compute local Stiffness matrix
I
Patch Grid
reduce costs of unstructured
meshes
I
I
I
I
(4)
extract subset of mesh
store as flat grid
store in consecutive arrays,
without pointers
v0
local coefficients vector
v1
v2
v3
add structured refinement
I
I
exploit local structure
improve data locality
WWU Münster
0,0 1,0 2,0 3,0 0,1 1,1 2,1 3,1 0,2 1,2 2,2 3,2 0,3 1,3 2,3 3,3 0,4 1,4 2,4 3,4
patch coefficients vector
Christian Engwer
02.06.2016
wissen leben
WWU Münster
I
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
12 /22
Vectorizing over Elements
User provides local Operator to compute local Stiffness matrix
I
Patch Grid
reduce costs of unstructured
meshes
I
I
I
I
(5)
extract subset of mesh
store as flat grid
store in consecutive arrays,
without pointers
v0
local coefficients vector
v1
v2
v3
add structured refinement
I
I
exploit local structure
improve data locality
0,0 1,0 2,0 3,0 0,1 1,1 2,1 3,1 0,2 1,2 2,2 3,2 0,3 1,3 2,3 3,3 0,4 1,4 2,4 3,4
patch coefficients vector
wissen leben
WWU Münster
I
,
,
WWU Münster
Christian Engwer
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
13 /22
Vectorizing over Elements (2)
−∆u = 0
I
Conforming FEM, Lagrange, Q1 , three level virtual refinement
16-Core Intel Haswell E5-2698v3
I
I
I
SIMD
AVX
AVX
Advertised 486.4 GFlop/s,
Peak w.o. FMA: 243.2 GFlop/s (→ %avail)
Bandwidth: STREAM 2×40 GByte/s
lanes
1
1
4
4
threads
16
32
16
32
GByte/s
26.73
34.32
62.89
73.01
%effBW
33.4
42.9
78.8
91.3
GFlop/s
9.758
12.53
22.96
26.65
%avail
4.0
5.2
9.4
11.0
wissen leben
WWU Münster
I
on Ω + BC
,
,
WWU Münster
Christian Engwer
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
14 /22
Interface Challenges
Vectorization of local operator (user code)
application
pdelab
user code
Assembler
local operator
local function
space
time integrator
grid operator
grid function
space
common
grid
geometry
istl
wissen leben
WWU Münster
core modules
localfunctions
,
,
WWU Münster
Christian Engwer
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
14 /22
Interface Challenges
Vectorization of local operator (user code)
application
pdelab
user code
Assembler
local operator
local function
space
time integrator
grid operator
grid function
space
common
WWU Münster
grid
geometry
istl
localfunctions
I
Intrinsics (non-portable)
I
Special language (needs special compiler)
I
Autovectorizer (difficult to drive portably)
Christian Engwer
wissen leben
WWU Münster
core modules
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
14 /22
Interface Challenges
Vectorization of local operator (user code)
application
user code
Assembler
local operator
local function
space
time integrator
grid operator
grid function
space
core modules
common
grid
geometry
istl
localfunctions
I
Intrinsics (non-portable)
I
Special language (needs special compiler)
I
Autovectorizer (difficult to drive portably)
→ Vectorization library (e.g. Vc, Boost.SIMD, NGSolve)
WWU Münster
Christian Engwer
02.06.2016
wissen leben
WWU Münster
pdelab
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
15 /22
Programming Approaches
Intrinsics
(non-portable)
I
Special language
(needs special compiler)
I
Autovectorize
(difficult to drive portably)
WWU Münster
↓
Christian Engwer
wissen leben
WWU Münster
I
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
15 /22
Programming Approaches
Intrinsics
(non-portable)
I
Special language
(needs special compiler)
I
Autovectorize
(difficult to drive portably)
WWU Münster
↓
Christian Engwer
wissen leben
WWU Münster
I
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
15 /22
Programming Approaches
Intrinsics
(non-portable)
I
Special language
(needs special compiler)
I
Autovectorize
(difficult to drive portably)
vectorization library
I
I
I
↓
wissen leben
WWU Münster
I
Hide instrinsics beneath a portable
interface
Implementations: e.g. Vc, Boost.SIMD,
NGSolve
WWU Münster
Christian Engwer
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
16 /22
Features of Vc
Storage: Vc::SimdArray<T,N>
Abstracts a vector of type T with size N
template < typename T >
SimdArray <T ,N > operator *( SimdArray <T ,N > a , SimdArray <T ,N > b )
{
for ( std :: size_t i =0; i < N ; i ++) a [ i ] *= b [ i ];
return a ;
}
WWU Münster
Christian Engwer
02.06.2016
wissen leben
WWU Münster
SIMD Operator overloads: +, −, · , /, etc.
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
16 /22
Features of Vc
Storage: Vc::SimdArray<T,N>
Abstracts a vector of type T with size N
template < typename T >
SimdArray <T ,N > operator *( SimdArray <T ,N > a , SimdArray <T ,N > b )
{
for ( std :: size_t i =0; i < N ; i ++) a [ i ] *= b [ i ];
return a ;
}
Limitations:
Not all operations are supported on SIMD data types
In particular
I No implicit cast from float → double
I No branching, e.g. if(c) x=a else x=b
Alternative: x = c?a:b; → x = cond(c,a,b);
wissen leben
WWU Münster
SIMD Operator overloads: +, −, · , /, etc.
,
,
WWU Münster
Christian Engwer
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
17 /22
Horizontal Vectorization for multiple RHS
I
Many applications require solves for many right-hand-sides
I
I
wissen leben
WWU Münster
I
Multi-Scale FEM
Inverse problems
...
WWU Münster
Christian Engwer
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
17 /22
Horizontal Vectorization for multiple RHS
Many applications require solves for many right-hand-sides
Multi-Scale FEM
Inverse problems
...
I
I
I
I
This corresponds to
foreach
i ∈ [0, N] : solve Axi = bi
→
solve AX = B
wissen leben
WWU Münster
I
with X = (x0 , ..xN ), B = (b0 , ..bN )
WWU Münster
Christian Engwer
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
17 /22
Horizontal Vectorization for multiple RHS
Many applications require solves for many right-hand-sides
Multi-Scale FEM
Inverse problems
...
I
I
I
I
This corresponds to
foreach
I
i ∈ [0, N] : solve Axi = bi
→
solve AX = B
with X = (x0 , ..xN ), B = (b0 , ..bN )
Templated FEM-code allows to implement vectorized assembly and
solvers, e.g.
template<typename T> Dune::BlockVector
and
wissen leben
WWU Münster
I
template<class X> class BiCGSTABSolver : public InverseOperator<X,X>
template<class M, class X, class Y> class AssembledLinearOperator
,
,
WWU Münster
Christian Engwer
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
18 /22
Modifications to the linear Algebra code
Dune::BlockVector<double>
→
Dune::BlockVector<Vc::SimdArray<double,N>>



X =


x0,0
x0,1
x0,2
x1,0
x1,1
x1,2
..
.
x2,0
x2,1
x2,2
...
...
...
xN,0
xN,1
xN,2
..
.
x0,M
x1,M
x2,M
...
xN,M







Memory layout: Corresponds to M × N dense-matrix with row-major
storage.
wissen leben
WWU Münster

,
,
WWU Münster
Christian Engwer
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
18 /22
Modifications to the linear Algebra code
Dune::BlockVector<double>
→
Dune::BlockVector<Vc::SimdArray<double,N>>



X =


x0,0
x0,1
x0,2
x1,0
x1,1
x1,2
..
.
x2,0
x2,1
x2,2
...
...
...
xN,0
xN,1
xN,2
..
.
x0,M
x1,M
x2,M
...
xN,M







Memory layout: Corresponds to M × N dense-matrix with row-major
storage.
And smaller SIMD-aware modifications to the Solvers
wissen leben
WWU Münster

,
,
WWU Münster
Christian Engwer
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
19 /22
Application: EEG Source Reconstruction
Cooperation UKM (Münster), BESA GmbH (München)
I
EEG measurements
I
> 200 electrodes
I
measure surface potential
I
governing equation
∇u · n = 0
on Ω
on ∂Ω
Source: Wikipedia
Goal: reconstruct brain activity
I Inverse modelling approach
I L2 regulerization
I Fidelity term on potential at electrodes
wissen leben
WWU Münster
−∇ · K ∇u = f
,
,
WWU Münster
Christian Engwer
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
19 /22
Application: EEG Source Reconstruction
Cooperation UKM (Münster), BESA GmbH (München)
I
EEG measurements
I
> 200 electrodes
I
measure surface potential
I
governing equation
∇u · n = j
on Ω
on ∂Ω
Source: Wikipedia
Goal: reconstruct brain activity
I Inverse modelling approach
I L2 regulerization
I Fidelity term on potential at electrodes
wissen leben
WWU Münster
−∇ · K ∇u = 0
,
,
WWU Münster
Christian Engwer
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
20 /22
Performance Measurements
I
SIMD-vectorized AMG-CG solver
Solve for 256 RHS, 300K Cells, 60K Vertices
Timing for Haswell-EP (E5-2698v3, 16 cores, AVX2, 4 lanes)
I 1-16 cores
I no SIMD, AVX (4 lanes), AVX (4 × 4 lanes)
I speedup 50 (max theoretical speedup 64)
Speeup
120
60
100
50
80
40
speedup
sec
Wall time for 256 electrodes
60
40
30
20
20
10
0
4x4
16
8
4
2
1
4
1
es
lan
0
cores
WWU Münster
1
16
8
4
2
1
4
4x4
es
lan
cores
Christian Engwer
02.06.2016
wissen leben
WWU Münster
I
I
,
,
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
21 /22
Outlook
dune-common
I Multi-Threading support in DUNE (via TBB)
dune-grid
I Hybrid parallelization of DUNE grids (on-top of the grid interface)
I Virtual-refinement of unstructured DUNE grids
dune-istl
I Hybrid parallelization
I Performance portable Matrix format
I (vertically) vectorized preconditioners (incl. CUDA versions)
I (horizontally) vectorized solvers (multiple RHS)
dune-pdelab
I Sum-Factorization DG
I Special-purpose high-performance assemblers
wissen leben
WWU Münster
Transition from EXA-DUNE to mainline?!
,
,
WWU Münster
Christian Engwer
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
21 /22
Outlook
dune-common
X Multi-Threading support in DUNE (via TBB)
dune-grid
X Hybrid parallelization of DUNE grids (on-top of the grid interface)
˜ Virtual-refinement of unstructured DUNE grids
dune-istl
X Hybrid parallelization
? Performance portable Matrix format
? (vertically) vectorized preconditioners (incl. CUDA versions)
+ (horizontally) vectorized solvers (multiple RHS)
dune-pdelab
+ Sum-Factorization DG
˜ Special-purpose high-performance assemblers
wissen leben
WWU Münster
Transition from EXA-DUNE to mainline?!
,
,
WWU Münster
Christian Engwer
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
22 /22
Outlook II
second phase
Push features upstream
I
Task-parallel scheduling
I
Asynchronicity → latency hiding, communication hiding, . . .
I
Resilience
I
...
wissen leben
WWU Münster
I
,
,
WWU Münster
Christian Engwer
02.06.2016
Westfälische
Wilhelms-Universität
Münster
The EXA-DUNE project
22 /22
Outlook II
second phase
Push features upstream
I
Task-parallel scheduling
I
Asynchronicity → latency hiding, communication hiding, . . .
I
Resilience
I
...
wissen leben
WWU Münster
I
Questions?!.
,
,
WWU Münster
Christian Engwer
02.06.2016

Similar documents