U. Meier Yang
Transcription
U. Meier Yang
Preparing hypre for Emerging Architectures Ulrike Meier Yang (coll. with R. Falgout) SPPEXA Symposium 2016 LLNL-PRES-681117 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC Jan 25, 2016 Current architecture trends favor regular compute patterns for high performance The next high performance computer is coming to LLNL: Sierra ATS (Advanced Technology System) in 2017-2018 CPU Processor Architecture: … IBM Power and NVIDIA Volta processors GPU GPU 2 LLNL-PRES-681117 Structured, semi-structured and unstructured interfaces Linear System Interfaces Linear Solvers PFMG, ... FAC, ... Split, ... MLI, ... AMG, ... unstruc CSR Data Layouts structured composite block-struc 3 LLNL-PRES-681117 (Semi-)Structured interface and solvers: — PFMG: highly efficient, but has limitations regards applicability — SMG: currently doesn’t scale well — No semi-structured multigrid solver — Semi-structured interface has large unstructured portions What needs to happen to fix these issues? — Develop new structured-grid matrix class that supports rectangular matrices and constant coefficients — Develop new semi-structured-grid matrix class that builds on the new structured-grid matrix Benefits: — Will increase structure (we are sure!) and performance (we hope) — Will facilitate development of new structured and semi-structured multigrid solvers (e.g. interpolation, RAP) 4 LLNL-PRES-681117 Appropriate for scalar applications on structured grids with a fixed stencil pattern A lot of potential for GPUs!! Grids are described via a global d-dimensional index space (singles in 1D, tuples in 2D, and triples in 3D) A box is a collection of cell-centered indices, described by its “lower” and “upper” corners Index Space (6,11) The scalar grid data is always associated with cell centers (unlike the more general SStruct interface) (7,3) (15,8) (-3,2) 5 LLNL-PRES-681117 The StructGrid data structure consists of • a global d-dimensional index space • • an array of boxes, described by their “lower” and “upper” corners a box manager needed to manage box information within and across processes • box manager entries organized using an index table that divides index space into regions defined by cuts in each coordinate direction 6 LLNL-PRES-681117 Current StructMatrix data structure consists of — a struct grid Proc 1 Proc 0 (6,4) (-3,1) — a struct stencil S4 S1 S0 S2 S3 -1 = -1 4 -1 -1 — Data space, i.e. extended grid, includes grid boxes with ghost layers defined by stencil — matrix data, defined on data space, stored contiguously — communication package StructVector consists of grid, ghost layer info, data space and data, but has no stencil or communication package 7 LLNL-PRES-681117 S4 S1 S0 S2 S3 Stencil = -1 -1 4 -1 -1 Grid boxes: [(-3,1), (-1,2)] (2,4) (-1,2) [(0,1), (2,4)] (3,5) (-3,1) (0,1) (0,3) Data Space: grid boxes + ghost layers: [(-4,0), (0,3)] , [(-1,0), (3,5)] (-4,0) Data stored S0 S1 S2 S3 S4 S0 S1 (-1,0) S2 S3 S4 Should we allow for different ways of storing in the future, e.g. interleaving? 8 LLNL-PRES-681117 hypre_BoxLoop2Begin (ndim, loop_size,\ dbox1, start1, stride1, i1,\ dbox2, start2, stride2, i2)\ hypre_BoxLoop2For (i1, i2) { x(i1) += y(i2); } hypre_BoxLoop2End (i1, i2); dbox1 = [(0,0),(10,5)] ndim = 2 loop_size = (3,2) dbox2 = [(1,1),(6,6)] X X X X X X X X X X X←(10,5) Y Y Y Y Y Y←(6,6) XXXXXXXXXXX YYYYYY XXXXXXXXXXX YYYYYY XXXXXXXXXXX YYYYYY XXXXXXXXXXX YYYYYY (0,0) →X X X X X X X X X X X (1,1) → Y Y Y Y Y Y 9 LLNL-PRES-681117 hypre_BoxLoop2Begin (ndim, loop_size,\ dbox1, start1, stride1, i1,\ dbox2, start2, stride2, i2)\ hypre_BoxLoop2For (i1, i2) { x(i1) += y(i2); } hypre_BoxLoop2End (i1, i2); dbox1 = [(0,0),(10,5)] start1 = (2,1) stride1 = (3,2) ndim = 2 loop_size = (3,2) dbox2 = [(1,1),(6,6)] X X X X X X X X X X X←(10,5) Y Y Y Y Y Y←(6,6) XXXXXXXXXXX YYYYYY XXXXXXXXXXX YYYYYY XXXXXXXXXXX YYYYYY XXXXXXXXXXX YYYYYY (0,0) →X X X X X X X X X X X (1,1) → Y Y Y Y Y Y 10 LLNL-PRES-681117 hypre_BoxLoop2Begin (ndim, loop_size,\ dbox1, start1, stride1, i1,\ ndim = 2 loop_size = (3,2) dbox2, start2, stride2, i2)\ hypre_BoxLoop2For (i1, i2) { x(i1) += y(i2); } dbox1 = [(0,0),(10,5)] dbox2 = [(1,1),(6,6)] hypre_BoxLoop2End (i1, i2); start1 = (2,1) stride1 = (3,2) X X X X X X X X X X X←(10,5) Y Y Y Y Y Y←(6,6) XXXXXXXXXXX YYYYYY XXXXXXXXXXX YYYYYY XXXXXXXXXXX YYYYYY XXXXXXXXXXX YYYYYY (0,0) →X X X X X X X X X X X (1,1) → Y Y Y Y Y Y 11 LLNL-PRES-681117 hypre_BoxLoop2Begin (ndim, loop_size,\ ndim = 2 dbox1, start1, stride1, i1,\ loop_size = (3,2) dbox2, start2, stride2, i2)\ hypre_BoxLoop2For (i1, i2) { x(i1) += y(i2); } hypre_BoxLoop2End (i1, i2); dbox1 = [(0,0),(10,5)] dbox2 = [(1,1),(6,6)] start1 = (2,1) stride1 = (3,2) start2 = (2,2) stride2 = (1,1) X X X X X X X X X X X←(10,5) XXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXX (0,0) →X X X X X X X X X X X (1,1) → Y Y Y Y Y Y←(6,6) YYYYYY YYYYYY YYYYYY YYYYYY YYYYYY 12 LLNL-PRES-681117 Parallelize using OpenMP4 pragmas (R. Li - preliminary results are not good) Use RAJA underneath — “a software abstraction that systematically encapsulates platform-specific code to enable applications to be portable across diverse hardware architectures without major source code disruption” Consider as some DSL, which could be compiled with the right compiler, e.g. ROSE 13 LLNL-PRES-681117 Requires two grids – domain and range grid Typical rectangular matrices in multigrid methods, e.g. restriction, Proc 0 Proc 1 14 LLNL-PRES-681117 Requires two grids – domain and range grid Typical rectangular matrices in multigrid methods, e.g. restriction, e.g. Proc 0 Proc 0 Proc 1 Proc 1 Note that boxes as well as neighbor processes can disappear when coarsening, this requires careful alignment Stencil, e.g. R = 1/2 1 1/2 15 LLNL-PRES-681117 Now domain grid coarser than range Proc 0 Proc 0 Proc 1 Proc 1 What about stencil? P= Example: 1/4 ∗ ∗ 𝐏= ∗ 1/4 ∗ 1/4 ∗ 1/4 𝑟1 ⊕ 1/2 ∗ 1/2 𝟏/𝟒 𝟏/𝟐 𝟏/𝟒 𝟏/𝟐 𝟏 𝟏/𝟐 𝟏/𝟒 𝟏/𝟐 𝟏/𝟒 𝑟2 1/2 ∗ ⊕ 1/2 𝑟3 ⊕ 1 𝑟4 16 LLNL-PRES-681117 New StructMatrix has — a base grid from which both grids can be derived 2 3 1 4 Note that box numbers are local to process, (proc, box-num) unique across global grid — two strides that define potential coarsening factors: here (1,1) for domain and (2,1) for range — two sets of box numbers that define the subset of the base grid for domain and range here (1,2,3,4) for domain (for efficiency) and (1,2,4) for range 17 LLNL-PRES-681117 Old StructVector consists of grid, ghost layer info, data space and data New StructVector has in addition — a stride to allow for coarsening — a set of box numbers to define a subset of the base grid — a place holder to save old grid, data space, data and stride to allow to temporarily resize a vector to fit it to a matrix for Matvecs (requires a copy but reduces active memory) 18 LLNL-PRES-681117 Re-index x with regard to the grid and domain stride of the matrix A Generate the ghost layer sizes from the matrix stencil Compute the data space for x using the ghost layer info Generate the compute information based on the matrix grid, stencil and strides Create the compute package, including communication info 19 LLNL-PRES-681117 matrix-vector multiplication for square and rectangular matrices (not completely general, needs strides) Matrix-matrix multiplications Triple matrix product Transpose matrix operations They are written and working (e.g. in PFMG) This is not ready for release! 20 LLNL-PRES-681117 Allows more general grids: — Grids that are mostly (but not entirely) structured — Examples: block-structured grids, structured adaptive mesh refinement grids, overset grids Adaptive Mesh Refinement Block-Structured Overset 21 LLNL-PRES-681117 Allows more general PDE’s — Multiple variables (system PDE’s) — Multiple variable types (cell centered, face centered, vertex centered, … ) (i,j) Variables are referenced by the abstract cell-centered index to the left and down The interface uses a graph to allow nearly arbitrary relationships between part data 22 LLNL-PRES-681117 a SStruct grid consists of 3 discretization stencils — grid information for each part, including • variable types • struct grids for each variable type for each part — neighborhood info — box managers for each part and variable — possibly finite element information 23 LLNL-PRES-681117 a sstruct graph describes stencil and non-stencil couplings between parts and consists of • stencils for each part and variable • grid info for each part • possibly finite element info 24 LLNL-PRES-681117 The SStructMatrix data structure is based on a splitting of the nonzeros into structured and unstructured couplings: A=S+U S is stored as a collection of struct matrices for each part and variable type and currently contains only stencil couplings between the same variable type submatrix for a part with 3 variable types 𝑆11 𝑈21 𝑈31 𝑈12 𝑆22 𝑈32 𝑈13 𝑈23 𝑆33 → 𝑆11 𝑆21 𝑆31 𝑆12 𝑆22 𝑆32 𝑆13 𝑆23 𝑆33 New data structures will allow increasing the structured portion! 25 LLNL-PRES-681117 In context of Struct interface we have so far considered only cell-centered variable type for matvec Now we also have face- and edgecentered node-centered and potentially more in higher dimensions ! 26 LLNL-PRES-681117 Need to somehow be able to consider them in the context of boxes of struct interface Difficulty in properly lining up boxes for different variable types How can we match them up properly? Current approach: 27 LLNL-PRES-681117 Consider nodal variable type for this grid But not allowed to overlap! Current algorithm to deal with this within the same variable type Becomes too complicated for a combination of variable types 28 LLNL-PRES-681117 suggestion by C. Engwer at Dagstuhl stencil workshop: Better option: Go one level up and use strides! Now one can use the old method! We are still working on this and checking for road blocks 29 LLNL-PRES-681117 Implemented structured rectangular matrix routines in hypre (will not be released until the semi-structured part is completed) In the process of implementing it in the semi-structured interface We plan to move this to a CPU-GPU system Plans to implement a semi-structured multigrid solver 30 LLNL-PRES-681117 Thank you! This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Partial support for this work was provided through Scientific Discovery through Advanced Computing (SciDAC) program funded by U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research (and Basic Energy Sciences/Biological and Environmental Research/High Energy Physics/Fusion Energy Sciences/Nuclear Physics) and by Applied Mathematics Program, DOE ASCR. 31 LLNL-PRES-681117