Domain Decomposition Method for Fast Gaussian Process

Transcription

Domain Decomposition Method for Fast
Gaussian Process Regression of Large Spatial
Datasets
Chiwoo Park1 , Jianhua Huang2 and Yu Ding 2
1
Florida State University
Texas A&M University
2
Spatial Statistics Workshop
January 30, 2015
Chiwoo Park (Florida State University)
Domain Decomposition Method
Jan. 30, 2015
1 / 25
Introduction
Spatial Prediction with Large Datasets
Predict a spatial field f (x) for an arbitrary location x ∗ ∈ R2 given the
observations of f (x) at n locations, x1 , . . . , xn .
Potential application: griding remote sensing data
Typically n > 100, 000 and the number of testing locations > 10, 000.
Jan. 30, 2015
2 / 25
Introduction
A popular solution is ...
Gaussian Process Regression for Spatial Prediction
The observations y = (y1 , ..., yn )t are corrupted with independent Gaussian
noises,
yi = f (xi ) + i and i ∼ N(0, σ 2 ).
Assume f (x) is a zero-mean Gaussian process with a covariance function
k (x, x 0 ).
f (x∗ )|y ∼ N (µ̂(x∗ ), σ̂(x∗ ))
mean:
µ̂(x∗ ) = k tx∗ (σ 2 I + K xx )−1 y
variance: σ̂(x∗ ) = k∗∗ − k tx∗ (σ 2 I + K xx )−1 k x∗ ,
where k∗∗ = k (x∗ , x∗ ), k x∗ = (k (x1 , x∗ ), . . . , k (xN , x∗ ))t and K xx is an N × N
matrix with (i, j)th entity k (xi , xj ).
Jan. 30, 2015
3 / 25
Introduction
Computational Challenge
mean:
µ̂(x∗ ) = k tx∗ (σ 2 I + K xx )−1 y
variance: σ̂(x∗ ) = k∗∗ − k tx∗ (σ 2 I + K xx )−1 k x∗ ,
Computing the mean and the variance involves the inversion of (σ 2 I + K xx ).
training time complexity: the operation count for the inversion is
proportional to O(n3 ). We may also need to estimate the parameters of
covariance function that typically requires the evaluation of the inversion
of different covariance matrices multiple times.
testing time complexity: after the inversion, we still need additional O(n2 )
operations for obtaining the mean and the variance.
The case for n < 2000 is not a problem. For large n > 10000, the
computation is prohibitive.
Jan. 30, 2015
4 / 25
Literature Review
Literature
Replace the sample covariance matrix with a somewhat ’similar’ but cheaply
invertible covariance matrix.
Covariance Tapering (Furrer et al., 2006; Kaufman et al., 2008): Use the
tapered covariance field,
Ktaper (x, x 0 ) = K (x, x 0 )Kr (x, x 0 ).
Ktaper preserves the shape of K , but its value is truncated out to zero
outside of a fixed range r .
Asymptotically consistent for the isotropic Matérn covariance function
Complexity ∼ O(n2 r ).
Tested on n ∼ 7,352 (computing time 20 seconds)
Jan. 30, 2015
5 / 25
Literature Review
Literature
Low-rank Approximation: Assume a certain conditional independence
(Seeger et al., 2003; Snelson and Ghahramani, 2006, 2007)
Introduce m latent variables, u = (u1 , ..., um )T , which are the values of
f (x) at a set of input locations xu . Assume conditional independence of
f (x ∗ ) and y conditioned on u.
The sample covariance matrix K xx is replaced with a version of
K xu K −1
uu K ux .
Complexity is reduced to O(nm2 ) for training and O(m2 ) for testing.
Jan. 30, 2015
6 / 25
Literature Review
Literature
Local Kriging: Partition y into y1 ,...,and ym and fit a separate Gaussian
process regression to each partition.
The sample covariance matrix is approximated by its block diagonal
version.
When each block size is b, the computation is reduced to O(nb2 ) for
training and O(b2 ) for testing.
Prediction inconsistency exists in between partitions!
Jan. 30, 2015
7 / 25
Literature Review
Literature
Mixture of Local Experts:
Take an ensemble of the local kriging predictors by the Bayesian
Committee Machine (Tresp, 2000), Bagging (Chen and Ren, 2009), the
Dirichlet process mixture (Rasmussen and Ghahramani, 2002), and
Bayesian model averaging (Urtasun and Darrell, 2008).
The operation count is reduced to O(nb2 ) for training and O(nb) for
testing
Please note that the operation count for testing is higher than O(b2 ) of the
localized kriging due to the need for taking the mixture or ensemble. It
could be an issue if the number of test locations is large, e.g. 6.6 × 105 in
Furrer et al. (2006).
Jan. 30, 2015
8 / 25
Our Approach
Our approach
We propose an improved local kriging approach that explicitly addresses the
problem of the prediction inconsistency problem.
The local kriging has advantages over the others. It can fit a different set
of covariance field parameters to each individual partition. It is easy to
compute in parallel for faster computation; complexity per computation
node further reduced to O(b3 ) for training and O(b2 ) for testing.
Solving the prediction inconsistency problem might help to improve the
prediction accuracy.
Jan. 30, 2015
9 / 25
Our Approach
Domain Decomposition Method (Park et al., 2011, 2012)
How do we solve the prediction inconsistency problem?
Formulate the local kriging as an unconstrained optimization
Add an consistency constraint to the optimization.
Jan. 30, 2015
10 / 25
Our Approach
Gaussian process regression as an optimization problem
Its mean predictor is a linear predictor, i.e., µ̂(x∗ ) = k tx∗ (σ 2 I + K xx )−1 y is
in
L := {u(x∗ )t y; u(x∗ ) : R2 → Rn }.
The predictor minimizes the expected prediction error among f̂ ∈ L:
min E[f − f̂ |x = x∗ ]2
f̂ ∈L
=
min u(x∗ )t (σ 2 I + K xx )u(x∗ ) − 2u(x∗ )t k x∗ + k∗∗ .
u(x∗ )
The optimal solution is u(x∗ ) = (σ 2 I + K xx )−1 k x∗ . We restrict u(x∗ ) to the
form of Ak x∗ where A ∈ Rn×n ,
min k tx∗ At (σ 2 I + K xx )Ak x∗ − 2k tx∗ At + k∗∗ .
A∈Rn×n
Jan. 30, 2015
11 / 25
Our Approach
min k tx∗ At (σ 2 I + K xx )Ak x∗ − 2k tx∗ At + k∗∗ .
A∈Rn×n
The KKT condition obtains the optimal solution Â = (σ 2 I + K xx )−1 .
t
The corresponding mean prediction is f̂ (x∗ ) = k tx∗ Â y.
The expected prediction error is the objective value at the optimal
solution, which obtains the minimum error among the same class.
The computation cost was unchanged with the reformulation!
Jan. 30, 2015
12 / 25
Our Approach
Partition the domain of the latent function f , and the observations y are
partitioned accordingly.
Ω1
(x1,y1)
Γ0
(x0,r)
Ω2
(x2,y2)
For x∗ ∈ Ωj ,
min
Aj ∈Rn×n
k txj ∗ Atj (σ 2 I + K xj xj )Aj k xj ∗ − 2k txj ∗ Atj k xj ∗ + k∗∗ .
The optimal solution is Âj = (σ 2 I + K xj xj )−1 , and the corresponding predictor
t
is f̂j (x∗ ) = k txj ∗ Âj y, which is equivalent to the localized kriging solution.
Jan. 30, 2015
13 / 25
Our Approach
Γ0
Ω1
(x1,y1)
(x0,r)
Ω2
(x2,y2)
Surely, the two predictors for Ω1 and Ω2 will have different predictions on Γ0 ,
t
t
f̂1 (x◦ ) = k tx1 ◦ Â1 y 1 6= f̂2 (x◦ ) = k tx2 ◦ Â2 y 2
Add constraints: We force the two predictors to have the same values r (x◦ )
at x◦ ∈ Γ0 ,
(LPj )
min
Aj ∈Rn×n
s.t.
k txj ∗ Atj (σ 2 I + K xj xj )Aj k xj ∗ − 2k txj ∗ Atj k xj ∗ + k∗∗ .
k txj ◦ Atj y j = r (x◦ ) for x◦ ∈ Γ0 .
Jan. 30, 2015
14 / 25
Our Approach
Boundary Value Estimation
Ideally r (x◦ ) = f (x◦ )|Γ. Since r (x◦ ) is unknown, we estimates r (x◦ ) that
minimizes the overall predictive error at x◦ ∈ Γ0
Z
(BVP)
Minimize
E[f − f̂1 |x = x∗ ]2 + E[f − f̂2 |x = x∗ ]2
r (x◦ )
x∗ ∈Γ0
Z
=
E[r − f̂1 |x = x∗ ]2 + E[r − f̂2 |x = x∗ ]2 .
x∗ ∈Γ0
We call each problem (LPj ) as a ‘local problem’ and problem (BVP) for r (x◦ )
as a ‘boundary value problem’.
Jan. 30, 2015
15 / 25
Our Approach
A Finite Dimensional Approximation of (LPj ):
Let xb be a collection of q points on Γ0 and r b be a vector of q values of
r (x◦ ) at the q points.
When r b are known, problem (LPj ) is approximated by
min
(LPj )
Aj ∈Rn×n
s.t.
k txj ∗ Atj (σ 2 I + K xj xj )Aj k xj ∗ − 2k txj ∗ Atj k xj ∗ .
k txj xb Atj y j = r b .
The KKT condition of the problem obtains the optimal solution. The
corresponding prediction and the prediction error are
f̂jc (x∗ ) =k txj ∗ (σ 2 I + K xj xj )−1 y j
t
+ k̄ xb ∗ (r b − k txj xb (σ 2 I + K xj xj )−1 y j )
σ̂jc (x∗ ) =k∗∗ − k txj ∗ (σ 2 I + K xj xj )−1 k xj ∗
t
+ k̄ xb ∗ (r b − k txj xb (σ 2 I + K xj xj )−1 y j )(r b − k txj xb (σ 2 I + K xj xj )−1 y j )t k̄ xb
where k̄ xb ∗ diminishes as x∗ goes away from Γ0 .
Jan. 30, 2015
16 / 25
Our Approach
A Finite Dimensional Approximation of (BVP):
Z
(BVP)
Minimize
E[r − f̂1 ]2 + E[r − f̂2 ]2
rb
x∗ ∈Γ0
≈
X
t
t
(r b − k txj xb Âj y j )t (r b − k txj xb Âj y j )
j=1,2
The optimal solution is in the form of
t
t
r b = α1 k tx1 xb Â1 y 1 + α2 k tx2 xb Â2 y 2
= α1 f̂1 (xb ) + α2 f̂2 (xb ),
where αj is proportional to the standardized square sum of yj .
Jan. 30, 2015
17 / 25
Our Approach
Learning the covariance structure: The underlying covariance function
k (x, x 0 ) is unknown. It is typically assumed to be in a parametric form
kθ (x, x 0 ). The parameter θ is learned by the likelihood maximization,
1
(G-DDM) One θ is learned by
o
Xn
Maximize −
ytj (σ 2 I + K θxj xj )−1 yj + log det|σ 2 I + K θxj xj | .
j
2
(L-DDM) A separate θj is learned for each Ωj by
Maximize
θ
θ
− ytj (σ 2 I + K xjj xj )−1 yj − log det|σ 2 I + K xjj xj |.
Jan. 30, 2015
18 / 25
Empirical Studies
Empirical Study
Experimental setting:
datasets: TCO.Lv2 (n = 182, 591), TCO.Lv3 (n = 48, 331), MOD08-CL
(n = 64, 800)
90% used for training, 10% reserved for testing
anisotropic squared exponential covariance field
Focus of empirical study:
Compare with the standard kriging result.
Compare with the local kriging to check whether forcing the prediction
consistency would improve the overall accuracy.
Compare with the low-rank approximation method, FIC and PIC.
Compare with the localized methods, Bayesian Committee Machine
(BCM), Local Probabilistic Regression (LPR), and BGP (Bagging for GP)
Jan. 30, 2015
19 / 25
Empirical Studies
Empirical Study: G-DDM versus standard kriging
1d-dataset (Toy Dataset: n = 200)
(b) local kriging
(a) standard
kriging
(a) full GPR
(c) G-DDM
2
2
2
1.5
1.5
1.5
1
1
1
0.5
0.5
0.5
0
0
0
−0.5
−0.5
−0.5
−1
−1
−1
−1.5
−1.5
−1.5
−2
−2
−2
−2.5
−2.5
−2.5
−3
0
2
4
6
−3
0
2
4
6
−3
0
2
4
6
Our method perfectly eliminates the prediction inconsistency problem.
The prediction of DDM is comparable to that of the standard kriging.
Jan. 30, 2015
20 / 25
Empirical Studies
Empirical Study: L-DDM versus local kriging
2d-dataset (TCO.Lv3)
(a) Mean squared error
(b) Mean squared mismatch
DDM
local GP
24
DDM
local GP
20
22
15
20
18
10
16
5
14
12
500
400
300
200
# of subdomains
100
0
0
500
400
300
200
# of subdomains
100
0
Our method practically eliminates the prediction inconsistency problem.
Controlling the boundary continuity does indeed help the prediction
accuracy.
Jan. 30, 2015
21 / 25
Empirical Studies
Empirical Study: G-DDM versus FIC, PIC
Comparison with low-rank approximation methods
(b) MOD08−CL. MSE.
(a) TCO.Lv3 MSE.
G−DDM
kPIC
rPIC
FIC
300
250
200
0.01
G−DDM
kPIC
rPIC
FIC
0.008
0.006
150
0.004
100
0.002
50
100
200
300
train + test time (sec)
400
0
100
200
300
400
train + test time (sec)
500
DDM outperformed kPIC (PIC with k-means clustering), rPIC (PIC with
regular meshing), and FIC.
Jan. 30, 2015
22 / 25
Empirical Studies
Empirical Study: L-DDM versus BCM, LPR, BGP
Comparison with a mixture of local experts (TCO.Lv3) 1
3
(a) Training Time Effectiveness.
4
(b) Testing Time Effectiveness
10
10
L−DDM
BGP
L−BGP
LPR
BCM
L−DDM
BGP
L−BGP
LPR
BCM
3
10
2
10
2
10
1
1
10 1
10
2
10
3
10
train time (sec)
4
10
10 −1
10
0
10
1
2
10
10
3
10
test time (sec)
Training Time vs. MSE: DDM is comparable to the BCM.
Testing Time vs. MSE: DDM is superior to the other methods.
1 BCM = Bayesian Committee Machine (Tresp, 2000), BGP = Bagging (Chen and Ren, 2009),
LPR = Bayesian model averaging (Urtasun and Darrell, 2008).
Jan. 30, 2015
23 / 25
Empirical Studies
Empirical Study: Parallel computation (P-DDM)
P-DDM: complexity = O(B 3 ), where B = # of data in a subdomain.
1
Solve the interface equation for r ◦ .
2
Given r ◦ , train each local predictor in parallel.
Computation time is constant in n! The followings are the computation
times required to obtain the best accuracy for the two datasets.
TCO.Lv3 (n=48,331): 27 secs
TCO.Lv2 (n=182,591): 29 secs (with fore cpu cores)
Jan. 30, 2015
24 / 25
Conclusion
Summary
Benefits of our approach
Reduced computation:
Locality: O(nB 2 ) ∼ FIC, PIC, BCM, LPR.
Parallelization: O(B 3 ), constant in n.
Competitive accuracy, because of
Adaptive to local changes: different covariance functions for different
subdomains
Improved continuity: enhanced accuracy over boundaries
Open-source implementation also available! Please check “A Local and
Parallel Computation Toolbox for Gaussian Process Regression 1.0" at
mloss.org or Journal of Machine Learning Research Open Source
Software Section.
On-going work
a better finite element approximation of the formulation; better accuracy
and numerically more stable
Jan. 30, 2015
25 / 25
References
Chen, T. and J. Ren (2009). Bagging for gaussian process regression. Neurocomputing 72(7-9), 1605–1610.
Furrer, R., M. G. Genton, and D. Nychka (2006). Covariance tapering for interpolation of large spatial datasets. Journal of Computational and Graphical
Statistics 15(3).
Kaufman, C. G., M. J. Schervish, and D. W. Nychka (2008). Covariance tapering for likelihood-based estimation in large spatial data sets. Journal of the
American Statistical Association 103(484), 1545–1555.
Park, C., J. Huang, and Y. Ding (2011). Domain decomposition approach for fast gaussian process regression of large spatial data sets. Journal of Machine
Learning Research 12, 1697–1728.
Park, C., J. Huang, and Y. Ding (2012). GPLP: a local and parallel computation toolbox for gaussian process regression. The Journal of Machine Learning
Research 13, 775–779.
Rasmussen, C. E. and Z. Ghahramani (2002). Infinite mixtures of gaussian process experts. In Advances in Neural Information Processing Systems 14, pp.
881–888. MIT Press.
Seeger, M., C. K. I. Williams, and N. D. Lawrence (2003). Fast forward selection to speed up sparse gaussian process regression. In International Workshop
on Artificial Intelligence and Statistics 9. Society for Artificial Intelligence and Statistics.
Snelson, E. and Z. Ghahramani (2006). Sparse gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems 18, pp.
1257–1264. MIT Press.
Snelson, E. and Z. Ghahramani (2007). Local and global sparse gaussian process approximations. In International Conference on Artifical Intelligence and
Statistics 11, pp. 524–531. Society for Artificial Intelligence and Statistics.
Tresp, V. (2000). A bayesian committee machine. Neural Computation 12(11), 2719–2741.
Urtasun, R. and T. Darrell (2008). Sparse probabilistic regression for activity-independent human pose inference. In IEEE Conference on Computer Vision
and Pattern Recognition 2008, pp. 1–8. IEEE.
Jan. 30, 2015
25 / 25

Domain Decomposition Method for Fast Gaussian Process

Transcription

Similar documents