The EXA-DUNE project
Transcription
The EXA-DUNE project
Westfälische Wilhelms-Universität Münster The EXA-DUNE project Dune vs. 1 000 000 000 000 000 000 wissen leben WWU Münster Christian Engwer joint work with P. Bastian, O. Ippisch, J. Fahlke, D. Göddeke, S. Müthing, D. Ribbrock, S. Turek 2.04.2016, OPM Meeting The EXA-DUNE project 2006 Discussion in the German Research Foundation (DFG) on the necessity of a funding initiative for HPC software. 2010 Discussion with DFG’s Executive Committee. Suggestion of a flexible, strategically initiated SPP. Initiative out of Germany HPC community, referring to increasing activities on HPC software elsewhere (USA: NSF, DOE; Japan; China; G8) 2011 Submission of the proposal, international reviewing, and formal acceptance. 2012 13 full proposals accepted for funding SPPEXA 1. Review of project sketches and full proposals. 2013 Official start of SPPEXA 1. 2014 Call for proposals SPPEXA 2, incl. international partners (France, Japan) 2015 16 full proposals accepted for funding SPPEXA 2 (7 collaborations with Japan / 3 collaborations with France). 2016 Official start of SPPEXA 2. 2020 Expected arrival of EXA-scale machines ;-) WWU Münster Christian Engwer 02.06.2016 1 /22 wissen leben WWU Münster Westfälische Wilhelms-Universität Münster The EXA-DUNE project 2006 Discussion in the German Research Foundation (DFG) on the necessity of a funding initiative for HPC software. 2010 Discussion with DFG’s Executive Committee. Suggestion of a flexible, strategically initiated SPP. Initiative out of Germany HPC community, referring to increasing activities on HPC software elsewhere (USA: NSF, DOE; Japan; China; G8) 2011 Submission of the proposal, international reviewing, and formal acceptance. 2012 13 full proposals accepted for funding SPPEXA 1. Review of project sketches and full proposals. 2013 Official start of SPPEXA 1. 2014 Call for proposals SPPEXA 2, incl. international partners (France, Japan) 2015 16 full proposals accepted for funding SPPEXA 2 (7 collaborations with Japan / 3 collaborations with France). 2016 Official start of SPPEXA 2. 2020 Expected arrival of EXA-scale machines ;-) 6 years HPC related project funding WWU Münster Christian Engwer 02.06.2016 1 /22 wissen leben WWU Münster Westfälische Wilhelms-Universität Münster Westfälische Wilhelms-Universität Münster The EXA-DUNE project 2 /22 EXA-DUNE DUNE’s Framework approach to software development Integrated toolbox of simulation components Existing body of complex applications I Good performance + scalability for traditional MPI model wissen leben WWU Münster I I , , WWU Münster Christian Engwer 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 2 /22 EXA-DUNE Challenges: Standard low order algorithms do not scale any more Incorporate new algorithms, hardware paradigms I Integrate changes across simulation stages (Ahmdahl’s Law) I Provide “reasonable” upgrade path for existing applications wissen leben WWU Münster I I , , WWU Münster Christian Engwer 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 2 /22 EXA-DUNE DUNE + FEAST = Flexibility + Performance I General Software Frameworks → co-designed to specific hardware platforms is not sufficient I Hardware-Oriented Numerics wissen leben WWU Münster → design/choose algorithms with hardware in mind , , WWU Münster Christian Engwer 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 3 /22 Hardware Challenges I Multiple Levels of concurrency MPI-parallel, Multi-core, SIMD I Memory-wall wissen leben WWU Münster I WWU Münster Christian Engwer 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 3 /22 Hardware Challenges I Multiple Levels of concurrency I MPI-parallel, Multi-core, SIMD I Memory-wall WWU Münster Christian Engwer wissen leben WWU Münster Example Intel Xeon E5-2698v3 (Haswell) I Advertised peak performance: 486.4 GFlop/s 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 3 /22 Hardware Challenges I Multiple Levels of concurrency I MPI-parallel, Multi-core, SIMD I Memory-wall WWU Münster Christian Engwer wissen leben WWU Münster Example Intel Xeon E5-2698v3 (Haswell) I Advertised peak performance: 486.4 GFlop/s I 16 cores → single Core: 30.4 GFlop/s 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 3 /22 Hardware Challenges I Multiple Levels of concurrency I MPI-parallel, Multi-core, SIMD I Memory-wall WWU Münster Christian Engwer wissen leben WWU Münster Example Intel Xeon E5-2698v3 (Haswell) I Advertised peak performance: 486.4 GFlop/s I 16 cores → single Core: 30.4 GFlop/s I AVX2+FMA → without FMA: 15.2 GFlop/s 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 3 /22 Hardware Challenges I Multiple Levels of concurrency I MPI-parallel, Multi-core, SIMD I Memory-wall WWU Münster Christian Engwer wissen leben WWU Münster Example Intel Xeon E5-2698v3 (Haswell) I Advertised peak performance: 486.4 GFlop/s I 16 cores → single Core: 30.4 GFlop/s I AVX2+FMA → without FMA: 15.2 GFlop/s I 4× SIMD → without AVX: 3.8 GFlop/s → classic, non-parallel code bound by 3.8 GFlop/s → you loose 99% Performance 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 4 /22 Outline 1 Introduction 2 EXA-DUNE Overview 3 EXA-DUNE Selected Features wissen leben WWU Münster Linear Algebra Assembly High-level SIMD techniques Multiple RHS 4 Outlook , , WWU Münster Christian Engwer 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 5 /22 Hybrid Parallelism MPI M UMA SIMD CPU L2 L2 L1 L1 P P ... ... M CPU Accl L2 IC IC IC L1 M M M ... IC M P ... I Coarse-grained: MPI between heterogeneous nodes I Medium-grained: multicore-CPUs, GPUs, MICs, APUs, . . . I Fine-grained: vectorization, GPU ‘threads’, . . . wissen leben WWU Münster M M CPU , , WWU Münster Christian Engwer 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 5 /22 Hybrid Parallelism MPI M UMA SIMD CPU L2 L2 L1 L1 P P ... ... M CPU Accl L2 IC IC IC L1 M M M ... IC M P ... I Coarse-grained: MPI between heterogeneous nodes I Medium-grained: multicore-CPUs, GPUs, MICs, APUs, . . . TBB I Fine-grained: vectorization, GPU ‘threads’, . . . ??? wissen leben WWU Münster M M CPU , , WWU Münster Christian Engwer 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 6 /22 Algorithm Design Hardware-Oriented Numerics wissen leben WWU Münster Challenge: Combine flexibility and hardware efficiency I Existing codes no longer run faster automatically I Standard low order algorithms do not scale any more I Much more than a pure implementational issue WWU Münster Christian Engwer 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 6 /22 Algorithm Design Hardware-Oriented Numerics Challenge: Combine flexibility and hardware efficiency I higher order: Discontinuous Galerkin, Reduced Basis I virtual local refinement: global unstructured mesh, local tensor-prduct mesh I multiple right-hand-sides: horizontal SIMD-vectorization wissen leben WWU Münster Solution: Locally structured, globally unstructured data → increased algorithmic intensity and vectorization Concepts: , , WWU Münster Christian Engwer 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 6 /22 Algorithm Design Hardware-Oriented Numerics Challenge: Combine flexibility and hardware efficiency I higher order: Discontinuous Galerkin, Reduced Basis I virtual local refinement: global unstructured mesh, local tensor-prduct mesh I multiple right-hand-sides: horizontal SIMD-vectorization wissen leben WWU Münster Solution: Locally structured, globally unstructured data → increased algorithmic intensity and vectorization Concepts: , , WWU Münster Christian Engwer 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 6 /22 Algorithm Design Hardware-Oriented Numerics Challenge: Combine flexibility and hardware efficiency Solution: Locally structured, globally unstructured data → increased algorithmic intensity and vectorization Concepts: higher order: Discontinuous Galerkin, Reduced Basis I virtual local refinement: global unstructured mesh, local tensor-prduct mesh I multiple right-hand-sides: horizontal SIMD-vectorization x0,0 x1,0 x0,1 x1,1 A· x0,2 x1,2 .. x0,M x1,M . . . xN,0 b0,0 . . . xN,1 b0,1 . . . xN,2 = b0,2 .. b0,M . . . xN,M b1,0 b1,1 b1,2 .. b1,M . . . bN,0 . . . bN,1 . . . bN,2 .. . . . bN,M wissen leben WWU Münster I , , WWU Münster Christian Engwer 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 7 /22 dune-common I Multi-Threading support in DUNE (via TBB) dune-grid I Hybrid parallelization of DUNE grids (on-top of the grid interface) I Virtual-refinement of unstructured DUNE grids dune-istl I Hybrid parallelization I Performance portable Matrix format I (vertically) vectorized preconditioners (incl. CUDA versions) I (horizontally) vectorized solvers (multiple RHS) dune-pdelab I Sum-Factorization DG I Special-purpose high-performance assemblers wissen leben WWU Münster Features in EXA-DUNE , , WWU Münster Christian Engwer 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 7 /22 dune-common I Multi-Threading support in DUNE (via TBB) dune-grid I Hybrid parallelization of DUNE grids (on-top of the grid interface) I Virtual-refinement of unstructured DUNE grids dune-istl I Hybrid parallelization I Performance portable Matrix format I (vertically) vectorized preconditioners (incl. CUDA versions) I (horizontally) vectorized solvers (multiple RHS) dune-pdelab I Sum-Factorization DG I Special-purpose high-performance assemblers wissen leben WWU Münster Features in EXA-DUNE , , WWU Münster Christian Engwer 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 8 /22 SELL-C-σ as Cross-Platform Matrix Format Kreutzer, Hager, Wellein, Fehske, Bishop ’13 I Adopt SELL-C -σ as unified matrix format in DUNE (including assembly phase → dune-pdelab) I Differentiate by memory domain (host, Xeon Phi, GPU) I Memory domain and C via extended C++ allocator interface wissen leben WWU Münster Block-sorted and chunked ELL (SELL-C -σ) with suitable tuning offers competitive performance across modern architectures (CPU, GPGPU, Xeon Phi) , , WWU Münster Christian Engwer 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 9 /22 Blocked SELL Format for DG discretizations Choi, Singh, Vuduc ’10 I Vectorize algorithms by operating on multiple independent blocks simultaneously I Tradeoff between vectorization and cache usage at larger block sizes ··· memory layout of data and column index arrays [Muething et.al. 2013] WWU Münster Christian Engwer 02.06.2016 wissen leben WWU Münster Move SIMD from scalar level to block level compressed block column index matrix I compressed data matrix Interleaved storage of blocks from C block rows Westfälische Wilhelms-Universität Münster The EXA-DUNE project 10 /22 Hybrid-parallel DUNE Grid Strategies for multi-threaded assembly of matrices and vectors I Thread parallelization on-top of the DUNE grid interface Evaluation of partitioning strategies: strided I ranged sliced tensor . . . and data access strategies: I I I batched: Batched writeback with global lock. elock: One lock per mesh entity coloring: partitions of the same color do not “touch”. other strategies not considered here: I I WWU Münster global locking → as bad as expected. race-free schemes → in general not possible not possible. Christian Engwer wissen leben WWU Münster I , , 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 11 /22 Evaluation Evaluation of partitioning strategies . . . and data access strategies I Best partitioning strategy: I Best strategy: entity-wise locking (no benefit from coloring) CPU CPU PHI PHI PHI k E10 E20 E60 E120 E240 0 1 2 3 4 5 62% 62% 72% 79% 87% 88% 42% 42% 46% 50% 49% 51% 75% 84% 90% 92% 43% 57% 69% 72% 21% 30% 38% 41% Runtimes per dof, degree k, jacobian, sliced partitioning, entity-wise locking Evaluation: 60-core Xeon PHIs and 10-core CPU’s I Full benefit of Xeon PHI will require vectorization of user code → vectorization: non-trivial, work in progress WWU Münster Christian Engwer 02.06.2016 wissen leben WWU Münster ranges of cells → memory efficient and fast Westfälische Wilhelms-Universität Münster The EXA-DUNE project 11 /22 Evaluation Evaluation of partitioning strategies . . . and data access strategies I Best partitioning strategy: I Best strategy: entity-wise locking (no benefit from coloring) CPU CPU PHI PHI PHI k E10 E20 E60 E120 E240 0 1 2 3 4 5 62% 62% 72% 79% 87% 88% 42% 42% 46% 50% 49% 51% 75% 84% 90% 92% 43% 57% 69% 72% 21% 30% 38% 41% Runtimes per dof, degree k, jacobian, sliced partitioning, entity-wise locking Evaluation: 60-core Xeon PHIs and 10-core CPU’s I Full benefit of Xeon PHI will require vectorization of user code → vectorization: non-trivial, work in progress WWU Münster Christian Engwer 02.06.2016 wissen leben WWU Münster ranges of cells → memory efficient and fast Westfälische Wilhelms-Universität Münster The EXA-DUNE project 12 /22 Vectorizing over Elements User provides local Operator to compute local Stiffness matrix Patch Grid reduce costs of unstructured meshes I I I I extract subset of mesh store as flat grid store in consecutive arrays, without pointers wissen leben WWU Münster I I add structured refinement I I exploit local structure improve data locality WWU Münster Christian Engwer 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 12 /22 Vectorizing over Elements User provides local Operator to compute local Stiffness matrix I Patch Grid reduce costs of unstructured meshes I I I I (0) extract subset of mesh store as flat grid store in consecutive arrays, without pointers v0 local coefficients vector v1 v2 v3 add structured refinement I I exploit local structure improve data locality WWU Münster 0,0 1,0 2,0 3,0 0,1 1,1 2,1 3,1 0,2 1,2 2,2 3,2 0,3 1,3 2,3 3,3 0,4 1,4 2,4 3,4 patch coefficients vector Christian Engwer 02.06.2016 wissen leben WWU Münster I Westfälische Wilhelms-Universität Münster The EXA-DUNE project 12 /22 Vectorizing over Elements User provides local Operator to compute local Stiffness matrix I Patch Grid reduce costs of unstructured meshes I I I I (1) extract subset of mesh store as flat grid store in consecutive arrays, without pointers v0 local coefficients vector v1 v2 v3 add structured refinement I I exploit local structure improve data locality WWU Münster 0,0 1,0 2,0 3,0 0,1 1,1 2,1 3,1 0,2 1,2 2,2 3,2 0,3 1,3 2,3 3,3 0,4 1,4 2,4 3,4 patch coefficients vector Christian Engwer 02.06.2016 wissen leben WWU Münster I Westfälische Wilhelms-Universität Münster The EXA-DUNE project 12 /22 Vectorizing over Elements User provides local Operator to compute local Stiffness matrix I Patch Grid reduce costs of unstructured meshes I I I I (2) extract subset of mesh store as flat grid store in consecutive arrays, without pointers v0 local coefficients vector v1 v2 v3 add structured refinement I I exploit local structure improve data locality WWU Münster 0,0 1,0 2,0 3,0 0,1 1,1 2,1 3,1 0,2 1,2 2,2 3,2 0,3 1,3 2,3 3,3 0,4 1,4 2,4 3,4 patch coefficients vector Christian Engwer 02.06.2016 wissen leben WWU Münster I Westfälische Wilhelms-Universität Münster The EXA-DUNE project 12 /22 Vectorizing over Elements User provides local Operator to compute local Stiffness matrix I Patch Grid reduce costs of unstructured meshes I I I I (3) extract subset of mesh store as flat grid store in consecutive arrays, without pointers v0 local coefficients vector v1 v2 v3 add structured refinement I I exploit local structure improve data locality WWU Münster 0,0 1,0 2,0 3,0 0,1 1,1 2,1 3,1 0,2 1,2 2,2 3,2 0,3 1,3 2,3 3,3 0,4 1,4 2,4 3,4 patch coefficients vector Christian Engwer 02.06.2016 wissen leben WWU Münster I Westfälische Wilhelms-Universität Münster The EXA-DUNE project 12 /22 Vectorizing over Elements User provides local Operator to compute local Stiffness matrix I Patch Grid reduce costs of unstructured meshes I I I I (4) extract subset of mesh store as flat grid store in consecutive arrays, without pointers v0 local coefficients vector v1 v2 v3 add structured refinement I I exploit local structure improve data locality WWU Münster 0,0 1,0 2,0 3,0 0,1 1,1 2,1 3,1 0,2 1,2 2,2 3,2 0,3 1,3 2,3 3,3 0,4 1,4 2,4 3,4 patch coefficients vector Christian Engwer 02.06.2016 wissen leben WWU Münster I Westfälische Wilhelms-Universität Münster The EXA-DUNE project 12 /22 Vectorizing over Elements User provides local Operator to compute local Stiffness matrix I Patch Grid reduce costs of unstructured meshes I I I I (5) extract subset of mesh store as flat grid store in consecutive arrays, without pointers v0 local coefficients vector v1 v2 v3 add structured refinement I I exploit local structure improve data locality 0,0 1,0 2,0 3,0 0,1 1,1 2,1 3,1 0,2 1,2 2,2 3,2 0,3 1,3 2,3 3,3 0,4 1,4 2,4 3,4 patch coefficients vector wissen leben WWU Münster I , , WWU Münster Christian Engwer 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 13 /22 Vectorizing over Elements (2) −∆u = 0 I Conforming FEM, Lagrange, Q1 , three level virtual refinement 16-Core Intel Haswell E5-2698v3 I I I SIMD AVX AVX Advertised 486.4 GFlop/s, Peak w.o. FMA: 243.2 GFlop/s (→ %avail) Bandwidth: STREAM 2×40 GByte/s lanes 1 1 4 4 threads 16 32 16 32 GByte/s 26.73 34.32 62.89 73.01 %effBW 33.4 42.9 78.8 91.3 GFlop/s 9.758 12.53 22.96 26.65 %avail 4.0 5.2 9.4 11.0 wissen leben WWU Münster I on Ω + BC , , WWU Münster Christian Engwer 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 14 /22 Interface Challenges Vectorization of local operator (user code) application pdelab user code Assembler local operator local function space time integrator grid operator grid function space common grid geometry istl wissen leben WWU Münster core modules localfunctions , , WWU Münster Christian Engwer 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 14 /22 Interface Challenges Vectorization of local operator (user code) application pdelab user code Assembler local operator local function space time integrator grid operator grid function space common WWU Münster grid geometry istl localfunctions I Intrinsics (non-portable) I Special language (needs special compiler) I Autovectorizer (difficult to drive portably) Christian Engwer wissen leben WWU Münster core modules 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 14 /22 Interface Challenges Vectorization of local operator (user code) application user code Assembler local operator local function space time integrator grid operator grid function space core modules common grid geometry istl localfunctions I Intrinsics (non-portable) I Special language (needs special compiler) I Autovectorizer (difficult to drive portably) → Vectorization library (e.g. Vc, Boost.SIMD, NGSolve) WWU Münster Christian Engwer 02.06.2016 wissen leben WWU Münster pdelab Westfälische Wilhelms-Universität Münster The EXA-DUNE project 15 /22 Programming Approaches Intrinsics (non-portable) I Special language (needs special compiler) I Autovectorize (difficult to drive portably) WWU Münster ↓ Christian Engwer wissen leben WWU Münster I 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 15 /22 Programming Approaches Intrinsics (non-portable) I Special language (needs special compiler) I Autovectorize (difficult to drive portably) WWU Münster ↓ Christian Engwer wissen leben WWU Münster I 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 15 /22 Programming Approaches Intrinsics (non-portable) I Special language (needs special compiler) I Autovectorize (difficult to drive portably) vectorization library I I I ↓ wissen leben WWU Münster I Hide instrinsics beneath a portable interface Implementations: e.g. Vc, Boost.SIMD, NGSolve WWU Münster Christian Engwer 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 16 /22 Features of Vc Storage: Vc::SimdArray<T,N> Abstracts a vector of type T with size N template < typename T > SimdArray <T ,N > operator *( SimdArray <T ,N > a , SimdArray <T ,N > b ) { for ( std :: size_t i =0; i < N ; i ++) a [ i ] *= b [ i ]; return a ; } WWU Münster Christian Engwer 02.06.2016 wissen leben WWU Münster SIMD Operator overloads: +, −, · , /, etc. Westfälische Wilhelms-Universität Münster The EXA-DUNE project 16 /22 Features of Vc Storage: Vc::SimdArray<T,N> Abstracts a vector of type T with size N template < typename T > SimdArray <T ,N > operator *( SimdArray <T ,N > a , SimdArray <T ,N > b ) { for ( std :: size_t i =0; i < N ; i ++) a [ i ] *= b [ i ]; return a ; } Limitations: Not all operations are supported on SIMD data types In particular I No implicit cast from float → double I No branching, e.g. if(c) x=a else x=b Alternative: x = c?a:b; → x = cond(c,a,b); wissen leben WWU Münster SIMD Operator overloads: +, −, · , /, etc. , , WWU Münster Christian Engwer 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 17 /22 Horizontal Vectorization for multiple RHS I Many applications require solves for many right-hand-sides I I wissen leben WWU Münster I Multi-Scale FEM Inverse problems ... WWU Münster Christian Engwer 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 17 /22 Horizontal Vectorization for multiple RHS Many applications require solves for many right-hand-sides Multi-Scale FEM Inverse problems ... I I I I This corresponds to foreach i ∈ [0, N] : solve Axi = bi → solve AX = B wissen leben WWU Münster I with X = (x0 , ..xN ), B = (b0 , ..bN ) WWU Münster Christian Engwer 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 17 /22 Horizontal Vectorization for multiple RHS Many applications require solves for many right-hand-sides Multi-Scale FEM Inverse problems ... I I I I This corresponds to foreach I i ∈ [0, N] : solve Axi = bi → solve AX = B with X = (x0 , ..xN ), B = (b0 , ..bN ) Templated FEM-code allows to implement vectorized assembly and solvers, e.g. template<typename T> Dune::BlockVector and wissen leben WWU Münster I template<class X> class BiCGSTABSolver : public InverseOperator<X,X> template<class M, class X, class Y> class AssembledLinearOperator , , WWU Münster Christian Engwer 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 18 /22 Modifications to the linear Algebra code Dune::BlockVector<double> → Dune::BlockVector<Vc::SimdArray<double,N>> X = x0,0 x0,1 x0,2 x1,0 x1,1 x1,2 .. . x2,0 x2,1 x2,2 ... ... ... xN,0 xN,1 xN,2 .. . x0,M x1,M x2,M ... xN,M Memory layout: Corresponds to M × N dense-matrix with row-major storage. wissen leben WWU Münster , , WWU Münster Christian Engwer 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 18 /22 Modifications to the linear Algebra code Dune::BlockVector<double> → Dune::BlockVector<Vc::SimdArray<double,N>> X = x0,0 x0,1 x0,2 x1,0 x1,1 x1,2 .. . x2,0 x2,1 x2,2 ... ... ... xN,0 xN,1 xN,2 .. . x0,M x1,M x2,M ... xN,M Memory layout: Corresponds to M × N dense-matrix with row-major storage. And smaller SIMD-aware modifications to the Solvers wissen leben WWU Münster , , WWU Münster Christian Engwer 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 19 /22 Application: EEG Source Reconstruction Cooperation UKM (Münster), BESA GmbH (München) I EEG measurements I > 200 electrodes I measure surface potential I governing equation ∇u · n = 0 on Ω on ∂Ω Source: Wikipedia Goal: reconstruct brain activity I Inverse modelling approach I L2 regulerization I Fidelity term on potential at electrodes wissen leben WWU Münster −∇ · K ∇u = f , , WWU Münster Christian Engwer 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 19 /22 Application: EEG Source Reconstruction Cooperation UKM (Münster), BESA GmbH (München) I EEG measurements I > 200 electrodes I measure surface potential I governing equation ∇u · n = j on Ω on ∂Ω Source: Wikipedia Goal: reconstruct brain activity I Inverse modelling approach I L2 regulerization I Fidelity term on potential at electrodes wissen leben WWU Münster −∇ · K ∇u = 0 , , WWU Münster Christian Engwer 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 20 /22 Performance Measurements I SIMD-vectorized AMG-CG solver Solve for 256 RHS, 300K Cells, 60K Vertices Timing for Haswell-EP (E5-2698v3, 16 cores, AVX2, 4 lanes) I 1-16 cores I no SIMD, AVX (4 lanes), AVX (4 × 4 lanes) I speedup 50 (max theoretical speedup 64) Speeup 120 60 100 50 80 40 speedup sec Wall time for 256 electrodes 60 40 30 20 20 10 0 4x4 16 8 4 2 1 4 1 es lan 0 cores WWU Münster 1 16 8 4 2 1 4 4x4 es lan cores Christian Engwer 02.06.2016 wissen leben WWU Münster I I , , Westfälische Wilhelms-Universität Münster The EXA-DUNE project 21 /22 Outlook dune-common I Multi-Threading support in DUNE (via TBB) dune-grid I Hybrid parallelization of DUNE grids (on-top of the grid interface) I Virtual-refinement of unstructured DUNE grids dune-istl I Hybrid parallelization I Performance portable Matrix format I (vertically) vectorized preconditioners (incl. CUDA versions) I (horizontally) vectorized solvers (multiple RHS) dune-pdelab I Sum-Factorization DG I Special-purpose high-performance assemblers wissen leben WWU Münster Transition from EXA-DUNE to mainline?! , , WWU Münster Christian Engwer 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 21 /22 Outlook dune-common X Multi-Threading support in DUNE (via TBB) dune-grid X Hybrid parallelization of DUNE grids (on-top of the grid interface) ˜ Virtual-refinement of unstructured DUNE grids dune-istl X Hybrid parallelization ? Performance portable Matrix format ? (vertically) vectorized preconditioners (incl. CUDA versions) + (horizontally) vectorized solvers (multiple RHS) dune-pdelab + Sum-Factorization DG ˜ Special-purpose high-performance assemblers wissen leben WWU Münster Transition from EXA-DUNE to mainline?! , , WWU Münster Christian Engwer 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 22 /22 Outlook II second phase Push features upstream I Task-parallel scheduling I Asynchronicity → latency hiding, communication hiding, . . . I Resilience I ... wissen leben WWU Münster I , , WWU Münster Christian Engwer 02.06.2016 Westfälische Wilhelms-Universität Münster The EXA-DUNE project 22 /22 Outlook II second phase Push features upstream I Task-parallel scheduling I Asynchronicity → latency hiding, communication hiding, . . . I Resilience I ... wissen leben WWU Münster I Questions?!. , , WWU Münster Christian Engwer 02.06.2016