Domain Decomposition Method for Fast Gaussian Process
Transcription
Domain Decomposition Method for Fast Gaussian Process
Domain Decomposition Method for Fast Gaussian Process Regression of Large Spatial Datasets Chiwoo Park1 , Jianhua Huang2 and Yu Ding 2 1 Florida State University Texas A&M University 2 Spatial Statistics Workshop January 30, 2015 Chiwoo Park (Florida State University) Domain Decomposition Method Jan. 30, 2015 1 / 25 Introduction Spatial Prediction with Large Datasets Predict a spatial field f (x) for an arbitrary location x ∗ ∈ R2 given the observations of f (x) at n locations, x1 , . . . , xn . Potential application: griding remote sensing data Typically n > 100, 000 and the number of testing locations > 10, 000. Chiwoo Park (Florida State University) Domain Decomposition Method Jan. 30, 2015 2 / 25 Introduction A popular solution is ... Gaussian Process Regression for Spatial Prediction The observations y = (y1 , ..., yn )t are corrupted with independent Gaussian noises, yi = f (xi ) + i and i ∼ N(0, σ 2 ). Assume f (x) is a zero-mean Gaussian process with a covariance function k (x, x 0 ). f (x∗ )|y ∼ N (µ̂(x∗ ), σ̂(x∗ )) mean: µ̂(x∗ ) = k tx∗ (σ 2 I + K xx )−1 y variance: σ̂(x∗ ) = k∗∗ − k tx∗ (σ 2 I + K xx )−1 k x∗ , where k∗∗ = k (x∗ , x∗ ), k x∗ = (k (x1 , x∗ ), . . . , k (xN , x∗ ))t and K xx is an N × N matrix with (i, j)th entity k (xi , xj ). Chiwoo Park (Florida State University) Domain Decomposition Method Jan. 30, 2015 3 / 25 Introduction Computational Challenge mean: µ̂(x∗ ) = k tx∗ (σ 2 I + K xx )−1 y variance: σ̂(x∗ ) = k∗∗ − k tx∗ (σ 2 I + K xx )−1 k x∗ , Computing the mean and the variance involves the inversion of (σ 2 I + K xx ). training time complexity: the operation count for the inversion is proportional to O(n3 ). We may also need to estimate the parameters of covariance function that typically requires the evaluation of the inversion of different covariance matrices multiple times. testing time complexity: after the inversion, we still need additional O(n2 ) operations for obtaining the mean and the variance. The case for n < 2000 is not a problem. For large n > 10000, the computation is prohibitive. Chiwoo Park (Florida State University) Domain Decomposition Method Jan. 30, 2015 4 / 25 Literature Review Literature Replace the sample covariance matrix with a somewhat ’similar’ but cheaply invertible covariance matrix. Covariance Tapering (Furrer et al., 2006; Kaufman et al., 2008): Use the tapered covariance field, Ktaper (x, x 0 ) = K (x, x 0 )Kr (x, x 0 ). Ktaper preserves the shape of K , but its value is truncated out to zero outside of a fixed range r . Asymptotically consistent for the isotropic Matérn covariance function Complexity ∼ O(n2 r ). Tested on n ∼ 7,352 (computing time 20 seconds) Chiwoo Park (Florida State University) Domain Decomposition Method Jan. 30, 2015 5 / 25 Literature Review Literature Low-rank Approximation: Assume a certain conditional independence (Seeger et al., 2003; Snelson and Ghahramani, 2006, 2007) Introduce m latent variables, u = (u1 , ..., um )T , which are the values of f (x) at a set of input locations xu . Assume conditional independence of f (x ∗ ) and y conditioned on u. The sample covariance matrix K xx is replaced with a version of K xu K −1 uu K ux . Complexity is reduced to O(nm2 ) for training and O(m2 ) for testing. Chiwoo Park (Florida State University) Domain Decomposition Method Jan. 30, 2015 6 / 25 Literature Review Literature Local Kriging: Partition y into y1 ,...,and ym and fit a separate Gaussian process regression to each partition. The sample covariance matrix is approximated by its block diagonal version. When each block size is b, the computation is reduced to O(nb2 ) for training and O(b2 ) for testing. Prediction inconsistency exists in between partitions! Chiwoo Park (Florida State University) Domain Decomposition Method Jan. 30, 2015 7 / 25 Literature Review Literature Mixture of Local Experts: Take an ensemble of the local kriging predictors by the Bayesian Committee Machine (Tresp, 2000), Bagging (Chen and Ren, 2009), the Dirichlet process mixture (Rasmussen and Ghahramani, 2002), and Bayesian model averaging (Urtasun and Darrell, 2008). The operation count is reduced to O(nb2 ) for training and O(nb) for testing Please note that the operation count for testing is higher than O(b2 ) of the localized kriging due to the need for taking the mixture or ensemble. It could be an issue if the number of test locations is large, e.g. 6.6 × 105 in Furrer et al. (2006). Chiwoo Park (Florida State University) Domain Decomposition Method Jan. 30, 2015 8 / 25 Our Approach Our approach We propose an improved local kriging approach that explicitly addresses the problem of the prediction inconsistency problem. The local kriging has advantages over the others. It can fit a different set of covariance field parameters to each individual partition. It is easy to compute in parallel for faster computation; complexity per computation node further reduced to O(b3 ) for training and O(b2 ) for testing. Solving the prediction inconsistency problem might help to improve the prediction accuracy. Chiwoo Park (Florida State University) Domain Decomposition Method Jan. 30, 2015 9 / 25 Our Approach Domain Decomposition Method (Park et al., 2011, 2012) How do we solve the prediction inconsistency problem? Formulate the local kriging as an unconstrained optimization Add an consistency constraint to the optimization. Chiwoo Park (Florida State University) Domain Decomposition Method Jan. 30, 2015 10 / 25 Our Approach Domain Decomposition Method (Park et al., 2011, 2012) Gaussian process regression as an optimization problem Its mean predictor is a linear predictor, i.e., µ̂(x∗ ) = k tx∗ (σ 2 I + K xx )−1 y is in L := {u(x∗ )t y; u(x∗ ) : R2 → Rn }. The predictor minimizes the expected prediction error among f̂ ∈ L: min E[f − f̂ |x = x∗ ]2 f̂ ∈L = min u(x∗ )t (σ 2 I + K xx )u(x∗ ) − 2u(x∗ )t k x∗ + k∗∗ . u(x∗ ) The optimal solution is u(x∗ ) = (σ 2 I + K xx )−1 k x∗ . We restrict u(x∗ ) to the form of Ak x∗ where A ∈ Rn×n , min k tx∗ At (σ 2 I + K xx )Ak x∗ − 2k tx∗ At + k∗∗ . A∈Rn×n Chiwoo Park (Florida State University) Domain Decomposition Method Jan. 30, 2015 11 / 25 Our Approach Domain Decomposition Method (Park et al., 2011, 2012) min k tx∗ At (σ 2 I + K xx )Ak x∗ − 2k tx∗ At + k∗∗ . A∈Rn×n The KKT condition obtains the optimal solution  = (σ 2 I + K xx )−1 . t The corresponding mean prediction is f̂ (x∗ ) = k tx∗  y. The expected prediction error is the objective value at the optimal solution, which obtains the minimum error among the same class. The computation cost was unchanged with the reformulation! Chiwoo Park (Florida State University) Domain Decomposition Method Jan. 30, 2015 12 / 25 Our Approach Domain Decomposition Method (Park et al., 2011, 2012) Partition the domain of the latent function f , and the observations y are partitioned accordingly. Ω1 (x1,y1) Γ0 (x0,r) Ω2 (x2,y2) For x∗ ∈ Ωj , min Aj ∈Rn×n k txj ∗ Atj (σ 2 I + K xj xj )Aj k xj ∗ − 2k txj ∗ Atj k xj ∗ + k∗∗ . The optimal solution is Âj = (σ 2 I + K xj xj )−1 , and the corresponding predictor t is f̂j (x∗ ) = k txj ∗ Âj y, which is equivalent to the localized kriging solution. Chiwoo Park (Florida State University) Domain Decomposition Method Jan. 30, 2015 13 / 25 Our Approach Domain Decomposition Method (Park et al., 2011, 2012) Γ0 Ω1 (x1,y1) (x0,r) Ω2 (x2,y2) Surely, the two predictors for Ω1 and Ω2 will have different predictions on Γ0 , t t f̂1 (x◦ ) = k tx1 ◦ Â1 y 1 6= f̂2 (x◦ ) = k tx2 ◦ Â2 y 2 Add constraints: We force the two predictors to have the same values r (x◦ ) at x◦ ∈ Γ0 , (LPj ) min Aj ∈Rn×n s.t. Chiwoo Park (Florida State University) k txj ∗ Atj (σ 2 I + K xj xj )Aj k xj ∗ − 2k txj ∗ Atj k xj ∗ + k∗∗ . k txj ◦ Atj y j = r (x◦ ) for x◦ ∈ Γ0 . Domain Decomposition Method Jan. 30, 2015 14 / 25 Our Approach Domain Decomposition Method (Park et al., 2011, 2012) Boundary Value Estimation Ideally r (x◦ ) = f (x◦ )|Γ. Since r (x◦ ) is unknown, we estimates r (x◦ ) that minimizes the overall predictive error at x◦ ∈ Γ0 Z (BVP) Minimize E[f − f̂1 |x = x∗ ]2 + E[f − f̂2 |x = x∗ ]2 r (x◦ ) x∗ ∈Γ0 Z = E[r − f̂1 |x = x∗ ]2 + E[r − f̂2 |x = x∗ ]2 . x∗ ∈Γ0 We call each problem (LPj ) as a ‘local problem’ and problem (BVP) for r (x◦ ) as a ‘boundary value problem’. Chiwoo Park (Florida State University) Domain Decomposition Method Jan. 30, 2015 15 / 25 Our Approach Domain Decomposition Method (Park et al., 2011, 2012) A Finite Dimensional Approximation of (LPj ): Let xb be a collection of q points on Γ0 and r b be a vector of q values of r (x◦ ) at the q points. When r b are known, problem (LPj ) is approximated by min (LPj ) Aj ∈Rn×n s.t. k txj ∗ Atj (σ 2 I + K xj xj )Aj k xj ∗ − 2k txj ∗ Atj k xj ∗ . k txj xb Atj y j = r b . The KKT condition of the problem obtains the optimal solution. The corresponding prediction and the prediction error are f̂jc (x∗ ) =k txj ∗ (σ 2 I + K xj xj )−1 y j t + k̄ xb ∗ (r b − k txj xb (σ 2 I + K xj xj )−1 y j ) σ̂jc (x∗ ) =k∗∗ − k txj ∗ (σ 2 I + K xj xj )−1 k xj ∗ t + k̄ xb ∗ (r b − k txj xb (σ 2 I + K xj xj )−1 y j )(r b − k txj xb (σ 2 I + K xj xj )−1 y j )t k̄ xb where k̄ xb ∗ diminishes as x∗ goes away from Γ0 . Chiwoo Park (Florida State University) Domain Decomposition Method Jan. 30, 2015 16 / 25 Our Approach Domain Decomposition Method A Finite Dimensional Approximation of (BVP): Z (BVP) Minimize E[r − f̂1 ]2 + E[r − f̂2 ]2 rb x∗ ∈Γ0 ≈ X t t (r b − k txj xb Âj y j )t (r b − k txj xb Âj y j ) j=1,2 The optimal solution is in the form of t t r b = α1 k tx1 xb Â1 y 1 + α2 k tx2 xb Â2 y 2 = α1 f̂1 (xb ) + α2 f̂2 (xb ), where αj is proportional to the standardized square sum of yj . Chiwoo Park (Florida State University) Domain Decomposition Method Jan. 30, 2015 17 / 25 Our Approach Domain Decomposition Method Learning the covariance structure: The underlying covariance function k (x, x 0 ) is unknown. It is typically assumed to be in a parametric form kθ (x, x 0 ). The parameter θ is learned by the likelihood maximization, 1 (G-DDM) One θ is learned by o Xn Maximize − ytj (σ 2 I + K θxj xj )−1 yj + log det|σ 2 I + K θxj xj | . j 2 (L-DDM) A separate θj is learned for each Ωj by Maximize Chiwoo Park (Florida State University) θ θ − ytj (σ 2 I + K xjj xj )−1 yj − log det|σ 2 I + K xjj xj |. Domain Decomposition Method Jan. 30, 2015 18 / 25 Empirical Studies Empirical Study Experimental setting: datasets: TCO.Lv2 (n = 182, 591), TCO.Lv3 (n = 48, 331), MOD08-CL (n = 64, 800) 90% used for training, 10% reserved for testing anisotropic squared exponential covariance field Focus of empirical study: Compare with the standard kriging result. Compare with the local kriging to check whether forcing the prediction consistency would improve the overall accuracy. Compare with the low-rank approximation method, FIC and PIC. Compare with the localized methods, Bayesian Committee Machine (BCM), Local Probabilistic Regression (LPR), and BGP (Bagging for GP) Chiwoo Park (Florida State University) Domain Decomposition Method Jan. 30, 2015 19 / 25 Empirical Studies Empirical Study: G-DDM versus standard kriging 1d-dataset (Toy Dataset: n = 200) (b) local kriging (a) standard kriging (a) full GPR (c) G-DDM 2 2 2 1.5 1.5 1.5 1 1 1 0.5 0.5 0.5 0 0 0 −0.5 −0.5 −0.5 −1 −1 −1 −1.5 −1.5 −1.5 −2 −2 −2 −2.5 −2.5 −2.5 −3 0 2 4 6 −3 0 2 4 6 −3 0 2 4 6 Our method perfectly eliminates the prediction inconsistency problem. The prediction of DDM is comparable to that of the standard kriging. Chiwoo Park (Florida State University) Domain Decomposition Method Jan. 30, 2015 20 / 25 Empirical Studies Empirical Study: L-DDM versus local kriging 2d-dataset (TCO.Lv3) (a) Mean squared error (b) Mean squared mismatch DDM local GP 24 DDM local GP 20 22 15 20 18 10 16 5 14 12 500 400 300 200 # of subdomains 100 0 0 500 400 300 200 # of subdomains 100 0 Our method practically eliminates the prediction inconsistency problem. Controlling the boundary continuity does indeed help the prediction accuracy. Chiwoo Park (Florida State University) Domain Decomposition Method Jan. 30, 2015 21 / 25 Empirical Studies Empirical Study: G-DDM versus FIC, PIC Comparison with low-rank approximation methods (b) MOD08−CL. MSE. (a) TCO.Lv3 MSE. G−DDM kPIC rPIC FIC 300 250 200 0.01 G−DDM kPIC rPIC FIC 0.008 0.006 150 0.004 100 0.002 50 100 200 300 train + test time (sec) 400 0 100 200 300 400 train + test time (sec) 500 DDM outperformed kPIC (PIC with k-means clustering), rPIC (PIC with regular meshing), and FIC. Chiwoo Park (Florida State University) Domain Decomposition Method Jan. 30, 2015 22 / 25 Empirical Studies Empirical Study: L-DDM versus BCM, LPR, BGP Comparison with a mixture of local experts (TCO.Lv3) 1 3 (a) Training Time Effectiveness. 4 (b) Testing Time Effectiveness 10 10 L−DDM BGP L−BGP LPR BCM L−DDM BGP L−BGP LPR BCM 3 10 2 10 2 10 1 1 10 1 10 2 10 3 10 train time (sec) 4 10 10 −1 10 0 10 1 2 10 10 3 10 test time (sec) Training Time vs. MSE: DDM is comparable to the BCM. Testing Time vs. MSE: DDM is superior to the other methods. 1 BCM = Bayesian Committee Machine (Tresp, 2000), BGP = Bagging (Chen and Ren, 2009), LPR = Bayesian model averaging (Urtasun and Darrell, 2008). Chiwoo Park (Florida State University) Domain Decomposition Method Jan. 30, 2015 23 / 25 Empirical Studies Empirical Study: Parallel computation (P-DDM) P-DDM: complexity = O(B 3 ), where B = # of data in a subdomain. 1 Solve the interface equation for r ◦ . 2 Given r ◦ , train each local predictor in parallel. Computation time is constant in n! The followings are the computation times required to obtain the best accuracy for the two datasets. TCO.Lv3 (n=48,331): 27 secs TCO.Lv2 (n=182,591): 29 secs (with fore cpu cores) Chiwoo Park (Florida State University) Domain Decomposition Method Jan. 30, 2015 24 / 25 Conclusion Summary Benefits of our approach Reduced computation: Locality: O(nB 2 ) ∼ FIC, PIC, BCM, LPR. Parallelization: O(B 3 ), constant in n. Competitive accuracy, because of Adaptive to local changes: different covariance functions for different subdomains Improved continuity: enhanced accuracy over boundaries Open-source implementation also available! Please check “A Local and Parallel Computation Toolbox for Gaussian Process Regression 1.0" at mloss.org or Journal of Machine Learning Research Open Source Software Section. On-going work a better finite element approximation of the formulation; better accuracy and numerically more stable Chiwoo Park (Florida State University) Domain Decomposition Method Jan. 30, 2015 25 / 25 References Chen, T. and J. Ren (2009). Bagging for gaussian process regression. Neurocomputing 72(7-9), 1605–1610. Furrer, R., M. G. Genton, and D. Nychka (2006). Covariance tapering for interpolation of large spatial datasets. Journal of Computational and Graphical Statistics 15(3). Kaufman, C. G., M. J. Schervish, and D. W. Nychka (2008). Covariance tapering for likelihood-based estimation in large spatial data sets. Journal of the American Statistical Association 103(484), 1545–1555. Park, C., J. Huang, and Y. Ding (2011). Domain decomposition approach for fast gaussian process regression of large spatial data sets. Journal of Machine Learning Research 12, 1697–1728. Park, C., J. Huang, and Y. Ding (2012). GPLP: a local and parallel computation toolbox for gaussian process regression. The Journal of Machine Learning Research 13, 775–779. Rasmussen, C. E. and Z. Ghahramani (2002). Infinite mixtures of gaussian process experts. In Advances in Neural Information Processing Systems 14, pp. 881–888. MIT Press. Seeger, M., C. K. I. Williams, and N. D. Lawrence (2003). Fast forward selection to speed up sparse gaussian process regression. In International Workshop on Artificial Intelligence and Statistics 9. Society for Artificial Intelligence and Statistics. Snelson, E. and Z. Ghahramani (2006). Sparse gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems 18, pp. 1257–1264. MIT Press. Snelson, E. and Z. Ghahramani (2007). Local and global sparse gaussian process approximations. In International Conference on Artifical Intelligence and Statistics 11, pp. 524–531. Society for Artificial Intelligence and Statistics. Tresp, V. (2000). A bayesian committee machine. Neural Computation 12(11), 2719–2741. Urtasun, R. and T. Darrell (2008). Sparse probabilistic regression for activity-independent human pose inference. In IEEE Conference on Computer Vision and Pattern Recognition 2008, pp. 1–8. IEEE. Chiwoo Park (Florida State University) Domain Decomposition Method Jan. 30, 2015 25 / 25