Semi-Discrete Matrix-Free Formulation of 3D Elastic Full
Transcription
Semi-Discrete Matrix-Free Formulation of 3D Elastic Full
1 Semi-Discrete Matrix-Free Formulation of 3D Elastic Full Waveform Inversion Modeling Stephen Moore, Devi Sudheer Chunduri, Sergiy Zhuk, Tigran Tchrakian, Ewout van den Berg, Albert Akhriev, Alberto Costa Nogueira Junior, Andrew Rawlinson, Lior Horesh IBM Research Abstract. Full waveform inversion (FWI) is an emerging subsurface imaging technique, used to locate oil and gas bodies. The key challenges that hinder its adoption by industry are both algorithmic and computational in nature, including storage, communication, and processing of large-scale system matrices, which impose cardinal impediments upon computational scalability. In this work we will present a complete matrixfree algorithmic formulation of a 3D elastic time domain spectral element solver for both the forward and adjoint wave-fields as part of a greater cloud based FWI framework. We will discuss computational optimisation (SIMD vectorisation, use of Many Integrated Core architectures, etc) and present scaling results for two HPC systems, namely an IBM Blue Gene/Q and an Intel based system equipped with Xeon Phi coprocessors. 1 Introduction In the field of exploration geophysics, Full Waveform Inversion (FWI) is regarded as the state-of-the-art seismic imaging technique. The goal of which is to detect and characterize subsurface geological structures such as ore minerals, hydrocarbons, geothermal reservoirs and aquifers. As the approach utilizes a comprehensive representation of the interaction between wave physics and subsurface properties, it offers unique advantages in terms of generality, fidelity, complexity and robustness, and is hence capable of imaging arbitrarily heterogeneous compressional and shear wave velocity profiles of the subsurface constituents [1]. As a wavefield propagates through a medium of heterogeneous elastic properties, waves are reflected off different material regions in the subsurface. Data acquisition is performed by recording the response of the subsurface to either active (e.g. via airguns, waterguns, vibrator truck, boomer, explosives) or passive excitation of the subsurface. Data are collected using an array of geophones deployed according to a predefined survey design [2], where the data itself is in the form of seismic reflection (seismograms) traces. By creating a computer model of the surveyed region (with a choice of subsurface material properties, or ‘earth model’) and assuming that a governing equation such as the acoustic or elastic wave equations are representative of the physics, a ‘forward’ simulation of the wavefield subject to the same sources can be performed. From this simulation, synthetically generated seismograms can be created and the discrepancy between the experimental and computational seismograms can be assessed. The level of misfit, can be quantified in various ways depending upon the assumptions taken regarding the noise characteristics. The misfit functional (or noise model) is instrumental in directing the non-linear optimization process of inversion in updating the material properties. Since the inverse problem is by nature ill-posed, it is essential to incorporate a-priori information (e.g. structure, sparsity) as well as constraints (e.g. positivity) as to constraint the solution space. Thus, the inversion’s objective is to find a feasible solution that minimizes a compromise between the aforementioned data misfit and a-priori information. At this point one may infer that the current aposteriori subsurface estimate model is a sufficient approximation to reality, or alternatively, devise means to quantify uncertainty. As 3D and 4D formulations of the elastic FWI problem are notoriously of large-scale, implying that only matrix-free inversion schemes are computationally tractable. From an inversion perspective, this means that both Jacobians and Hessian approximations can only be accessed implicitly in a matrix-vector product fashion [3]. Of key computational importance in the update of the subsurface model is the gradient of the misfit functional, which is computed through a combination of a forward wavefield and an ‘adjoint’ wavefield [4,5]. The former can be viewed as computed by marching forward in time and the latter computed by marching backward in time. Some of the important computational bottlenecks are associated with storage, communication, and processing of large-scale system matrices. To alleviate these problems, in this study, a complete matrix-free formulation is introduced. Specifically, we propose a hybrid optimize/discretize strategy for subsurface model (Lam´e parameters) estimation in the displacement-stress formulation of the 3D elastic wave equation. Our formulation relies upon a high order spectral element discretization of the wave equation, which entails a system of ODEs. The latter, so-called Spectral Element Model (SEM) is continuous in time and defines the dynamics of the projection coefficients representing the solution of the elastic wave equation in a polynomial basis. We further introduce a derivation of the adjoint equations for the SEM, which enables the incorporation of a generic data misfit functional, measuring (continuous in time), the discrepancy between the simulated wavefield and observed data. The key contributions of the study are a scalable computational strategy and a novel approach for matrix-free evaluation of the adjoint wavefield using a hybrid distributed-shared memory approach. The formulation enables computation of the adjoint wavefield, and consequently the gradient (or proximal function of which) of the objective function, while avoiding storage of the forward displacement field in memory. This algorithmic setup grants scalability in both weak and strong scaling tests. This solver is one component of a greater effort aimed at the development of a cloud-based FWI framework. 2 Methods The elastic wave equation [1] is a statement of Newton’s second law of motion applied to a continuous solid material and is defined as ρ∂tt ui = ∂xj σij + fi (1) where x ∈ Ω ⊂ R3 denotes a computational solid domain with boundary Γ , t ∈ [0, T ], i, j ∈ [1, 2, 3], ρ(x) is the mass density, u(x, t) is the displacement vector field, σ(x, t) is the stress tensor field, and f (x, t) is a seismic source modeled in the form of a point source f (x, t) = θ(t)δ(x − xs ) applied at xs using a standard Ricker wavelet θ(t). Approximating the earth model with a linear isotropic relation (i.e. Hooke’s law), the stress tensor is related to displacement gradients as σij = λδij ∂xk uk + µ ∂xi uj + ∂xj ui (2) where λ(x) and µ(x) are the first and second Lam´e parameters respectively. In fact, µ can be physically interpreted as the shear modulus and λ = κ − 2/3µ, where κ is the bulk modulus. The elastic wave equation is subject to the standard initial conditions ui = ∂t ui = 0, free surface boundary conditions σij nj = 0 and Clayton and Enquist absorbing boundary conditions σij nj = −ρv∂t ui respectively, where nj denotes the normal to the boundary Γ and v denotes the elastic wave speed (which will be P wave speed for displacement components parallel to nj and S wave speed for components perpendicular to nj ). Using Galerkin projection the weak form of the elastic wave equation is Z Z Z Z ψρ∂tt ui dΩ + ∂xj ψσij dΩ = ψfi dΩ + ψσij dΓj (3) Ω Ω Ω Γ where ψ denotes the choice of basis functions. 2.1 Discretization Using the spectral element method for the spatial discretization, the computational domain is defined as a tesselation of Ne hexahedral elements (Figure 1(c)) with the displacement field approximated as ui (x, t) ≈ ψpqr (x)ui,pqr (t), such that Z Z ψabc ρψpqr ∂tt ui,pqr dΩ + Ωe Γe Z Z ψab ρvψpq ∂t ui,pq dΓ + Ωe ∂xj ψabc σij dΩ = ψabc fi dΩ Ωe (4) In this specific formulation the basis functions ψpqr are taken to be a tensor product of the family of N th order Lagrange polynomials ψpqr = `p `q `r (Figure 1(b)). After applying a geometric transformation to a reference element ξi ∈ [−1, +1] (Figure 1(a)), performing the integration with Gauss-Legendre-Lobatto (GLL) quadrature (Figure 1(c)) [6], and then global assembly, gives the system of ODEs M¨ u + Cu˙ + F(m, u) = s (5) 1.0 0.8 0.6 0.4 0.2 0 −0.2 −0.4 −1.0 −0.5 (a) 0.0 ξ 0.5 1.0 (b) (c) Fig. 1. Illustrations of (a) the transformation of a hexahedral element from x to ξ coordinates, (b) a family of 8th order Lagrange polynomials defined for ξi ∈ [−1, +1], (c) an 8th order spectral element depicting the locations of the GLL quadrature/nodal points. where u is the global set of displacements and m the global set of Lam´e parameters λ and µ defined for GLL points from all spectral elements in the grid. The mass M and damping C matrices, force F and source s vectors are given by M= Ne N +1 X X wf wg wh ψabc ρψpqr Jξ 1,f ,ξ2,g ,ξ3,h e=1 f,g,h=1 C= Ne N +1 X X wf wg ψab ρvi ψpq Jξ 1,f ξ2,g e=1 f,g=1 F= Ne N +1 X X wf wg wh ∂ξk ψabc ∂ξk xj σij,pqr Jξ 1,f ,ξ2,g ,ξ3,h e=1 f,g,h=1 s= Ne N +1 X X wf wg wh ψabc fi Jξ 1,f ,ξ2,g ,ξ3,h (6) e=1 f,g,h=1 where J defines the Jacobian of the transformation mapping to the reference element, and wf denote the weights associated with the GLL quadrature. Due to the cardinal interpolation properties of the Lagrange polynomials `a (ξf ) = δaf and the specific choice of GLL quadrature, the mass and damping matrices (which don’t involve spatial derivatives of the Lagrange polynomials) are diagonal and hence trivially assembled and stored as 1D arrays. It is worthwhile mentioning that F(m, u) ≡ K(m)u where K is the well known (sparse, but non-diagonal), stiffness matrix in solid mechanics, which linearly depends on m, that is K(m1 + m2 ) = K(m1 ) + K(m1 ). Finally, in order to perform the time integration, an explicit form of the Newmark Method [7] is used: un+1 = un + ∆tu˙ n + 1/2∆t2 u ¨n ˆ˙ n+1 = u˙ n + 1/2∆t2 u u ¨ n+1 ˆ˙ n+1 − F(un+1 ) u ¨ n+1 = (M + 1/2∆tC)−1 sn+1 − Cu ˆ˙ n+1 + 1/2∆t2 u u˙ n+1 = u ¨ n+1 (7) where it is important to note that due to the diagonality of M and C the inverse required for the update of u ¨ n+1 in (7) is trivial and can in fact be precomputed and stored as a 1D array once M and C have been assembled. In order to update the earth model, a generic data misfit functional relating the solution of the discrete forward elastic model to the observed data is defined as Z TX Ns J(m) = r(d(xs , t) − V u(t))dt (8) 0 s=1 where the r(x) > 0 if x 6= 0 is a noise model quantifying the error between V u and the observed quantities d, V maps the discrete wavefield, u into the space of observations, or equivalently, maps the grid of GLL points approximating Ω into the set of sensor locations {x1 . . . xNs }. Estimation of the earth model, m, given the data d by variational approaches (gradient descent or Quasi-Newton methods) requires the computation of ∂mj J. The latter is achieved by introducing the following adjoint equation M¨ u† − Cu˙ † + F(m, u† ) = V T (d − V u), u† (T ) = u˙ † (T ) = 0 (9) where u solves (5). We also define βj as a solution of the following scalar ODE β˙ j = u† (t)T ∂mj K u(t), βj (0) = 0 (10) Finally, through integration by parts, it is easy to prove that ∂mj J(m) = βj (T ) . (11) The key drawback of the approach for computing ∂mj J presented in the previous section is that computation of βj requires the storage of either u or u† in the memory, as the former is computed forward in time and integration of the adjoint model (10) requires the latter to be computed backwards in time. Storage of either discrete wavefield at each GLL point at every time step would prohibitively limit scalability of the code, as multiple runs of the forward and adjoint time marching will be required for computation of each ∂mj J. Alternatively, we propose that during the first run (i.e. solution of (5)) the residual r is computed and stored at each time step. As the number of sensors Ns is significantly smaller than the total number of GLL points in the grid, storage of the wavefield at these points for every time step is trivial. The second run solves (9) backward in time solely for the purpose of computing u† (0) and u˙ † (0) which is then stored. Finally, (5), (9), and (10) are simultaneously marched forward until time T at which point ∂mj J can be evaluated. Note that (9) is solved forward in time by using the stored u† (0) and u˙ † (0) as the initial conditions. This approach requires little storage, but assumes that time integration can be performed sufficiently fast. 2.2 Parallel Implementation The spectral element and Newmark discretization methods allow for a relatively simple explicit time marching scheme, avoiding the need for sparse matrix data Fig. 2. An illustration of a computational domain (left) that is discretized into spectral elements and decomposed amongst multiple compute nodes (middle), where a given processes’ portion of elements (right) are further separated into interior (blue) and interface (red) elements. structures or the solution of linear systems. The parallel implementation of this solver involves a hybrid distributed-shared memory approach implemented with MPI and OpenMP. The distributed memory aspect involves a standard domain decomposition approach where individual processes are responsible for the time marching over a unique subset of spectral elements (Figure 2). This approach implies that GLL points shared by spectral elements on neighbouring processes will be duplicated and so when the forces F(m, u) are assembled at each time step in the forward or adjoint solve, the forces need to be exchanged and summed between any processes that contain a duplicated GLL point. In order to hide the latency of this data exchange, a processes’ elements can be grouped into interior and interface elements (Figure 2), processing the interface elements first, exchanging forces on duplicate GLL points with non-blocking MPI sends and receives, and processing the interior elements while the messages are in transit. The shared memory parallelization is applied mainly to the computation of the forces, as this is by far the most computationally intensive part of the algorithm. The computation involves the use of multiple threads to process the different spectral elements within an interior or interface region, where an OpenMP thread loops over the GLL points in an element, computes the displacement gradients and stresses at each GLL point, performs the integration, and updates the forces array, which is shared by all threads. The update involves a synchronized write to the memory, which is achieved using OpenMP locks by maintaining a lock for each shared GLL point; a more flexible and scalable approach compared to using either a critical or named critical sections. Pseudocode for the evalutation of the forces is presented in Algorithm 1. 2.3 Experimental Setup Experiments were performed on two different systems, namely an IBM Blue Gene/Q and the Intel based ‘Stampede’ supercomputers. Each Blue Gene/Q compute node contains a 16 core 1.6GHz A2 processor, with 16GB of DDR3 for region=interface:interior #pragma omp for for e=1:N_e // in region for p,q,r=1:N+1 // each GLL point Compute displacement gradients Compute stresses Integrate omp_set_lock at GLL point pqr Update global forces array omp_unset_lock at GLL point pqr end end if region==interface MPI_Isend then MPI_Irecv forces on duplicate GLL points end MPI_WaitAll end Algorithm 1: Pseudocode for computation of forces at each time step. memory. Each core supports four-way Symmetric Multithreading (SMT) and in our implementation uses 64 threads per node. Each core has a Quad Floating Point Unit (FPU) supporting 4-way double precision SIMD operations. The code was compiled with the IBM XLC compiler. The Stampede compute nodes contain two Intel Xeon E5-2680 8 core (Sandy Bridge) at 2.7Ghz, one Intel Xeon Phi SE10P 61 core (Knights Corner) coprocessor at 1.09GHz, and 32GB of DRAM. The Sandy Bridge core has 256-bit vector registers and can be programmed using the 256-bit vector intrinsics. The Xeon Phi coprocessor is a Many Integrated Core (MIC) architecture, the basis of which is a light-weight x86 core with inorder instruction processing, coupled with heavy-weight 512bit SIMD registers and instructions. With these two features the Phi die can support 61 cores, and can execute 8 double precision SIMD instructions. The code was compiled and run using Intel Composer XE 2013 and Intel MPI Library 4.1. The nature of the algorithm is such that most computations required at a GLL point, such as the displacement gradients in the ξi and xi coordinates for example: N +1 X ∂ui ∂`f = ui ∂ξ1 ∂ξ1 ξ1,p f =1 , ∂ui ∂ui ∂ξ1 ∂ui ∂ξ2 ∂ui ∂ξ3 = + + ∂x1 ∂ξ1 ∂x1 ∂ξ2 ∂x1 ∂ξ3 ∂x1 (12) can be performed in a vectorizable fashion for i ∈ [1, 2, 3], with similar expressions for displacement gradients in other directions, and with similar operations for the computation of stresses and the integration. Taking advantage of this fact we used the QPX intrinsics on Blue Gene/Q such as vec mul, vec madd for example, to perform multiply and multiply-add operations respectively, using 3 of the 4 available execution slots. Despite not using the full capacity of the floating point unit, this approach significantly improves performance without requiring development of a more complex algorithm. With the Sandy Bridge core the same approach is also used, but with the AVX intrinsics. With the MIC architecture on the other hand, the ability to perform 8 double precision SIMD intstructions would mean that this approach would make far less efficient use of the SIMD registers. To test its full capacity a restriction was made to 7th order Lagrange polynomials in the discretization (hence N + 1 = 8) and the algorithm for the computation of forces was restructured so that rather than vectorize the computation of quantities in i ∈ [1, 2, 3], terms in the Lagrange polynomial f ∈ [1−N +1] (12) were vectorized with KNC mutliply and a reduction instrinsic such as mm512 mul pd and mm512 reduce add pd respectively. 3 Results & Discussion 64 1.1 Observed Ideal 32 Parallel Efficiency 1.0 Speedup 16 8 4 0.9 0.8 0.7 2 1 32 Observed Ideal 64 128 256 512 Number of Nodes (a) 1024 2048 0.6 32 64 128 256 512 Number of Nodes 1024 2048 (b) Fig. 3. Strong scaling results for the forward solver presenting (a) speedup and (b) parallel efficiency for an internode run, testing the distributed memory parallelization on Blue Gene/Q. To test the scalability of the solver, both strong and weak scaling runs were performed on each system for a simulation of the forward wavefield only (since the nature of the algorithm means that the adjoint and β fields will scale in exactly the same way). On each Blue Gene/Q compute node 1 MPI process was instantiated and multiple threads were instantiated on the 16 cores. To test the distributed memory parallelization, Figures 3(a) and 3(b) present the speedup and parallel efficiency for a grid comprising approximately 5.5 billion degrees of freedom which is distributed over increasingly more compute nodes. As can be observed the solver scales well out to 2048 compute nodes (which is in fact two Blue Gene/Q ‘racks’). To test the shared memory parallelization 64 1.10 Observed Ideal 1.00 Parallel Efficiency 32 16 Speedup Observed Ideal 8 4 0.90 0.80 0.70 0.60 2 1 1 0.50 2 4 8 16 32 Number of Threads per Node (a) 64 0.40 1 2 4 8 16 32 Number of Threads per Node 64 (b) Fig. 4. Strong scaling results for the forward solver presenting (a) speedup and (b) parallel efficiency for an intranode run, testing the shared memory parallelization on Blue Gene/Q. Figures 4(a)-4(b) present the speedup and parallel efficiency within a single compute node for a grid comprising approximately 6.4 million degrees of freedom. As can be observed the solver scales reasonably well to 16 threads per node. With 32 and 64 threads per node however, the hardware threading capability is being utilized, which shows an improvement but decreasing parallel efficiency, as expected. To test the weak scalability Figure 5 presents the run time for 500 time steps with grid portions comprising 6.4 million degrees of freedom per MPI process, implying that with 1024 compute nodes the total grid size comprises approximately 6.5 billion degrees of freedom. As can be observed the solver performs reasonably well requiring approximately 13% greater run time for a grid approximately one thousand times larger. Figures 6(a) and 6(b) present the speedup and parallel efficiency for the grid comprising approximately 5.5 billion degrees of freedom on Stampede. As can be observed the solver scales reasonably well out to 512 compute nodes. The speedup on Stampede is less than that on Blue Gene/Q on for large node runs, the reason for which could be that the topology of nodes allocated on Stampede are in general irregular, whereas as on Blue Gene/Q the nodes allocated for a job have a structured topology resulting in good MPI communication performance. As previously mentioned, Stampede compute nodes are equipped with Intel Xeon Phi (MIC) accelerators. In general, there are two ways in which MIC 160 Observed Average 155 Run Time [s] 150 145 140 135 130 1 2 4 8 16 32 64 128 256 5121024 Number of Nodes Fig. 5. Weak scaling results for the forward solver presenting the run time for 500 time steps on Blue Gene/Q. 16 1.1 Observed Ideal Observed Ideal 1.0 Parallel Efficiency Speedup 8 4 0.9 0.8 2 0.7 1 32 64 128 256 Number of Nodes (a) 512 0.6 32 64 128 256 Number of Nodes 512 (b) Fig. 6. Strong scaling results for the forward solver presenting (a) speedup and (b) parallel efficiency for an internode run, testing the distributed memory parallelization on Stampede. accelerator could be used. Once a portion of the grid is allocated to a node, the elements in the domain could be distributed between host and MIC, for instance, host operating on interface elements which involves MPI communication, and the interior elements processing could be offloaded to MIC. However, this approach involves the high overhead of data transfer across host and MIC each time step to update the displacement gradients and global force array. The other approach is to assign a portion of the grid to the MIC, and an MPI process running on each MIC. This could be accomplished using the symmetric MPI methodology between host and MIC. It becomes crucial in this approach to balance the compute load between CPU and MIC. Since the core computation part in the code involves irregular memory accesses, and hence poor last level cache locality, the memory access problem becomes even severe on MIC where a lot of threads compete for memory interface controller. According to our experiments, we identified that on average, the code runs on MIC (using 120 threads) 3 times slower than the host (using 16 cores). So, the approach used to balance the load is to have 3 MPI tasks run on the CPU each using 5 threads and 1 MPI task runs on the MIC. Figure 7(a) presents strong scaling results for both cases for a grid comprising approximately 115 million degrees of freedom, using only the host to run 3 MPI processes with 5 OpenMP threads per process, and additionally running 1 MPI task with 120 threads on the MIC. As the results show, using the MIC, results in around 34% improvement in performance. 9 240 CPU CPU+MIC 120 8 Observed Ideal Speedup log2 Run Time 60 7 6 30 15 8 5 4 4 3 1 2 2 4 8 Number of Nodes (a) 16 32 1 1 2 4 8 15 30 60 120 240 Number of Threads per Node (b) Fig. 7. Strong scaling results for the forward solver comparing CPU and symmetric CPU+MIC, presenting (a) runtime for an internode run testing the distributed memory parallelization and (b) speedup for a MIC intranode run testing the shared memory parallelization on on Stampede. To test the shared memory parallelization Figures 7(b) presents the speedup within a MIC for a grid comprising approximately 32 million degrees of freedom. As can be observed the solver scales reasonably well to 60 threads per MIC. With 120 and 180 threads per MIC, the hardware threading capability (of which the MIC has support up to 4 hardware threads per core) are utilized, which show an improvement, but decreasing parallel efficiency. Since the code uses 512-bit vector instrinsics, and since the vector unit is shared across the threads in a core, there is no improvement when 240 threads are used. 4 Conclusions A communication-avoiding, matrix-free formulation and parallelization strategy for a 3D elastic spectral element forward and adjoint solvers has been introduced. The formulation presents a viable strategy to overcome one of the key computational bottlenecks associated with full waveform simulation and inversion. The proposed algorithmic framework requires minimal storage and communication and shows remarkable scalability on large computational domains. Furture work will include the addition of the variational approach (using ∂mj J to update the earth model and thereby completing the FWI algorithm) and integrating the solver with the cloud-based delivery model. References 1. Fichtner, A.: Full Seismic Waveform Modelling and Inversion. Advances in Geophysical and Environmental Mechanics and Mathematics. Springer (2010) 2. Haber, E., Van Den Doel, K., Horesh, L.: Optimal design of simultaneous source encoding. Inverse Problems in Science and Engineering (ahead-of-print) (2014) 1–18 3. Horesh, L., Schweiger, M., Arridge, S., Holder, D.: Large-scale non-linear 3d reconstruction algorithms for electrical impedance tomography of the human head. In: World Congress on Medical Physics and Biomedical Engineering 2006, Springer (2007) 3862–3865 4. Chavent, G.: Analyse fonctionnelle et identification de coefficients r´epartis dans les ´equations aux d´eriv´ees partielles. Iria (1971) 5. Fichtner, A., Bunge, H.P., Igel, H.: The adjoint method in seismology - ii. applications: traveltimes and sensitivity functionals. Physics of the Earth and Planetary Interiors 157 (2006) 105–123 6. Parter, S.V.: On the legendre-gauss-lobatto points and weights. J. Sci. Comput. 14(4) (December 1999) 347–355 7. Kane, C., Marsden, J.E., Ortiz, M., West, M.: Variational integrators and the newmark algorithm for conservative and dissipative mechanical systems. Internat. J. Numer. Methods Engrg 49 (1999) 1295–1325