U. Meier Yang

Transcription

U. Meier Yang
Preparing hypre for Emerging
Architectures
Ulrike Meier Yang
(coll. with R. Falgout)
SPPEXA Symposium 2016
LLNL-PRES-681117
This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore
National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC
Jan 25, 2016
 Current architecture trends favor regular compute patterns for
high performance
 The next high performance computer is coming to LLNL:
Sierra ATS (Advanced Technology System)
in 2017-2018
CPU
 Processor Architecture:
…
IBM Power and NVIDIA Volta processors
GPU
GPU
2
LLNL-PRES-681117
 Structured, semi-structured and unstructured interfaces
Linear System Interfaces
Linear Solvers
PFMG, ...
FAC, ...
Split, ...
MLI, ...
AMG, ...
unstruc
CSR
Data Layouts
structured
composite
block-struc
3
LLNL-PRES-681117
 (Semi-)Structured interface and solvers:
— PFMG: highly efficient, but has limitations regards applicability
— SMG: currently doesn’t scale well
— No semi-structured multigrid solver
— Semi-structured interface has large unstructured portions
 What needs to happen to fix these issues?
— Develop new structured-grid matrix class that supports rectangular matrices and constant
coefficients
— Develop new semi-structured-grid matrix class that builds on the new structured-grid matrix
 Benefits:
— Will increase structure (we are sure!) and performance (we hope)
— Will facilitate development of new structured and semi-structured multigrid solvers (e.g.
interpolation, RAP)
4
LLNL-PRES-681117
 Appropriate for scalar applications on structured grids with a fixed stencil
pattern
 A lot of potential for GPUs!!
 Grids are described via a global d-dimensional index space (singles in 1D,
tuples in 2D, and triples in 3D)
 A box is a collection of cell-centered indices, described by its “lower” and
“upper” corners
Index Space
(6,11)
 The scalar grid data is always
associated with cell centers
(unlike the more general
SStruct interface)
(7,3)
(15,8)
(-3,2)
5
LLNL-PRES-681117
 The StructGrid data structure consists of
• a global d-dimensional index space
•
•
an array of boxes, described by their
“lower” and “upper” corners
a box manager needed to manage box
information within and across processes
• box manager entries organized using
an index table that divides index space
into regions defined by cuts in each
coordinate direction
6
LLNL-PRES-681117
 Current StructMatrix data structure consists of
— a struct grid
Proc 1
Proc 0
(6,4)
(-3,1)
— a struct stencil
S4
S1 S0 S2
S3
-1
=
-1
4
-1
-1
— Data space, i.e. extended grid, includes grid boxes with ghost layers defined by
stencil
— matrix data, defined on data space, stored contiguously
— communication package
 StructVector consists of grid, ghost layer info, data space and data, but has no stencil
or communication package
7
LLNL-PRES-681117
S4
S1 S0 S2
S3
 Stencil
=
-1
-1 4 -1
-1
 Grid boxes: [(-3,1), (-1,2)]
(2,4)
(-1,2)
[(0,1), (2,4)]
(3,5)
(-3,1)
(0,1) (0,3)
 Data Space: grid boxes + ghost layers:
[(-4,0), (0,3)] , [(-1,0), (3,5)]
(-4,0)
 Data stored
S0
S1
S2
S3
S4
S0
S1
(-1,0)
S2
S3
S4
 Should we allow for different ways of storing in the future, e.g. interleaving?
8
LLNL-PRES-681117
hypre_BoxLoop2Begin (ndim, loop_size,\
dbox1, start1, stride1, i1,\
dbox2, start2, stride2, i2)\
hypre_BoxLoop2For (i1, i2)
{
x(i1) += y(i2);
}
hypre_BoxLoop2End (i1, i2);
dbox1 = [(0,0),(10,5)]
ndim = 2
loop_size = (3,2)
dbox2 = [(1,1),(6,6)]
X X X X X X X X X X X←(10,5)
Y Y Y Y Y Y←(6,6)
XXXXXXXXXXX
YYYYYY
XXXXXXXXXXX
YYYYYY
XXXXXXXXXXX
YYYYYY
XXXXXXXXXXX
YYYYYY
(0,0) →X X X X X X X X X X X
(1,1) → Y Y Y Y Y Y
9
LLNL-PRES-681117
hypre_BoxLoop2Begin (ndim, loop_size,\
dbox1, start1, stride1, i1,\
dbox2, start2, stride2, i2)\
hypre_BoxLoop2For (i1, i2)
{
x(i1) += y(i2);
}
hypre_BoxLoop2End (i1, i2);
dbox1 = [(0,0),(10,5)]
start1 = (2,1)
stride1 = (3,2)
ndim = 2
loop_size = (3,2)
dbox2 = [(1,1),(6,6)]
X X X X X X X X X X X←(10,5)
Y Y Y Y Y Y←(6,6)
XXXXXXXXXXX
YYYYYY
XXXXXXXXXXX
YYYYYY
XXXXXXXXXXX
YYYYYY
XXXXXXXXXXX
YYYYYY
(0,0) →X X X X X X X X X X X
(1,1) → Y Y Y Y Y Y
10
LLNL-PRES-681117
hypre_BoxLoop2Begin (ndim, loop_size,\
dbox1, start1, stride1, i1,\
ndim = 2
loop_size = (3,2)
dbox2, start2, stride2, i2)\
hypre_BoxLoop2For (i1, i2)
{
x(i1) += y(i2);
}
dbox1 = [(0,0),(10,5)]
dbox2 = [(1,1),(6,6)]
hypre_BoxLoop2End (i1, i2);
start1 = (2,1)
stride1 = (3,2)
X X X X X X X X X X X←(10,5)
Y Y Y Y Y Y←(6,6)
XXXXXXXXXXX
YYYYYY
XXXXXXXXXXX
YYYYYY
XXXXXXXXXXX
YYYYYY
XXXXXXXXXXX
YYYYYY
(0,0) →X X X X X X X X X X X
(1,1) → Y Y Y Y Y Y
11
LLNL-PRES-681117
hypre_BoxLoop2Begin (ndim, loop_size,\
ndim = 2
dbox1, start1, stride1, i1,\
loop_size = (3,2)
dbox2, start2, stride2, i2)\
hypre_BoxLoop2For (i1, i2)
{
x(i1) += y(i2);
}
hypre_BoxLoop2End (i1, i2);
dbox1 = [(0,0),(10,5)]
dbox2 = [(1,1),(6,6)]
start1 = (2,1)
stride1 = (3,2)
start2 = (2,2)
stride2 = (1,1)
X X X X X X X X X X X←(10,5)
XXXXXXXXXXX
XXXXXXXXXXX
XXXXXXXXXXX
XXXXXXXXXXX
(0,0) →X X X X X X X X X X X
(1,1) →
Y Y Y Y Y Y←(6,6)
YYYYYY
YYYYYY
YYYYYY
YYYYYY
YYYYYY
12
LLNL-PRES-681117
 Parallelize using OpenMP4 pragmas
(R. Li - preliminary results are not good)
 Use RAJA underneath
— “a software abstraction that systematically encapsulates platform-specific
code to enable applications to be portable across diverse hardware
architectures without major source code disruption”
 Consider as some DSL, which could be compiled with the right
compiler, e.g. ROSE
13
LLNL-PRES-681117
 Requires two grids – domain and range grid
 Typical rectangular matrices in multigrid methods, e.g. restriction,
Proc 0
Proc 1
14
LLNL-PRES-681117
 Requires two grids – domain and range grid
 Typical rectangular matrices in multigrid methods, e.g. restriction, e.g.
Proc 0
Proc 0
Proc 1
Proc 1
 Note that boxes as well as neighbor processes can disappear when
coarsening, this requires careful alignment
 Stencil, e.g.
R =
1/2 1 1/2
15
LLNL-PRES-681117
 Now domain grid coarser than range
Proc 0
Proc 0
Proc 1
Proc 1
 What about stencil?
P=
Example:
1/4 ∗
∗
𝐏= ∗
1/4 ∗
1/4
∗
1/4
𝑟1
⊕ 1/2 ∗
1/2
𝟏/𝟒 𝟏/𝟐 𝟏/𝟒
𝟏/𝟐
𝟏
𝟏/𝟐
𝟏/𝟒 𝟏/𝟐 𝟏/𝟒
𝑟2
1/2
∗
⊕
1/2
𝑟3
⊕ 1
𝑟4
16
LLNL-PRES-681117
 New StructMatrix has
— a base grid from which both grids can be derived
2
3
1
4
Note that box numbers are local to process, (proc, box-num) unique across global grid
— two strides that define potential coarsening factors:
here (1,1) for domain
and (2,1) for range
— two sets of box numbers that define the subset of the base grid for domain and range
here (1,2,3,4) for domain
(for efficiency)
and (1,2,4) for range
17
LLNL-PRES-681117
 Old StructVector consists of grid, ghost layer info, data
space and data
 New StructVector has in addition
— a stride to allow for coarsening
— a set of box numbers to define a subset of the base grid
— a place holder to save old grid, data space, data and stride
to allow to temporarily resize a vector to fit it to a matrix for
Matvecs (requires a copy but reduces active memory)
18
LLNL-PRES-681117
 Re-index x with regard to the grid and domain stride of the
matrix A
 Generate the ghost layer sizes from the matrix stencil
 Compute the data space for x using the ghost layer info
 Generate the compute information based on the matrix grid,
stencil and strides
 Create the compute package, including communication info
19
LLNL-PRES-681117
 matrix-vector multiplication for square and rectangular matrices
(not completely general, needs strides)
 Matrix-matrix multiplications
 Triple matrix product
 Transpose matrix operations
 They are written and working (e.g. in PFMG)
 This is not ready for release!
20
LLNL-PRES-681117
 Allows more general grids:
— Grids that are mostly (but not entirely) structured
— Examples: block-structured grids, structured adaptive mesh refinement
grids, overset grids
Adaptive Mesh
Refinement
Block-Structured
Overset
21
LLNL-PRES-681117
 Allows more general PDE’s
— Multiple variables (system PDE’s)
— Multiple variable types (cell centered, face centered, vertex centered, … )
(i,j)
Variables are referenced by the
abstract cell-centered index to
the left and down
 The interface uses a graph to allow nearly arbitrary relationships
between part data
22
LLNL-PRES-681117
 a SStruct grid consists of
3 discretization stencils
— grid information for each part, including
• variable types
• struct grids for each variable type for each
part
— neighborhood info
— box managers for each part and variable
— possibly finite element information
23
LLNL-PRES-681117
 a sstruct graph describes stencil
and non-stencil couplings between
parts and consists of
• stencils for each part and variable
• grid info for each part
• possibly finite element info
24
LLNL-PRES-681117
 The SStructMatrix data structure is based on a splitting of the
nonzeros into structured and unstructured couplings:
A=S+U
 S is stored as a collection of struct matrices for
each part and variable type and currently contains
only stencil couplings between the same variable
type
submatrix for a
part with
3 variable types
𝑆11
𝑈21
𝑈31
𝑈12
𝑆22
𝑈32
𝑈13
𝑈23
𝑆33
→
𝑆11
𝑆21
𝑆31
𝑆12
𝑆22
𝑆32
𝑆13
𝑆23
𝑆33
 New data structures will allow increasing the structured portion!
25
LLNL-PRES-681117
 In context of Struct interface we have so far
considered only cell-centered variable type for
matvec
 Now we also have
face- and edgecentered
 node-centered
 and potentially more in higher dimensions !
26
LLNL-PRES-681117
 Need to somehow be able to consider them in the context of
boxes of struct interface
 Difficulty in properly lining up boxes for different variable types
 How can we match them up properly?
 Current approach:
27
LLNL-PRES-681117
 Consider nodal variable type for this grid
 But not allowed to overlap!
 Current algorithm to deal with this
within the same variable type
 Becomes too complicated for a
combination of variable types
28
LLNL-PRES-681117
 suggestion by C. Engwer at Dagstuhl stencil workshop:
 Better option:
Go one level up and use strides!
Now one can use the old
method!
 We are still working on this
and checking for road blocks
29
LLNL-PRES-681117
 Implemented structured rectangular matrix routines in hypre
(will not be released until the semi-structured part is
completed)
 In the process of implementing it in the semi-structured
interface
 We plan to move this to a CPU-GPU system
 Plans to implement a semi-structured multigrid solver
30
LLNL-PRES-681117
Thank you!
 This work was performed under the auspices of the U.S. Department
of Energy by Lawrence Livermore National Laboratory under
Contract DE-AC52-07NA27344. Partial support for this work was
provided through Scientific Discovery through Advanced Computing
(SciDAC) program funded by U.S. Department of Energy, Office of
Science, Advanced Scientific Computing Research (and Basic
Energy Sciences/Biological and Environmental Research/High
Energy Physics/Fusion Energy Sciences/Nuclear Physics) and by
Applied Mathematics Program, DOE ASCR.
31
LLNL-PRES-681117