Deep Canonical Correlation Analysis Galen Andrew Raman Arora Jeff Bilmes
Transcription
Deep Canonical Correlation Analysis Galen Andrew Raman Arora Jeff Bilmes
International Conference on Machine Learning (ICML 2013) Deep Canonical Correlation Analysis Galen Andrew1 Raman Arora2 1 University 2 Toyota Jeff Bilmes1 Karen Livescu2 of Washington Technological Institute at Chicago Presented by Shaobo Han, Duke University Nov. 7, 2014 G. Andrew et al., 2013 Deep Canonical Correlation Analysis 1 / 13 Outline 1 Background Canonical Correlation Analysis (CCA) Kernel Canonical Correlation Analysis (KCCA) Denoising Autoencoder 2 Deep Canonical Correlation Analysis (DCCA) 3 Experiments MNIST Handwritten Digits Articulatory Speech G. Andrew et al., 2013 Deep Canonical Correlation Analysis 2 / 13 Introduction The problem: learn complex nonlinear transformations of two views, such that the resulting representations are maximally correlated Deep CCA (DCCA): learn highly correlated deep architectures I A nonlinear extension of CCA (linear projections) I An alternative to Kernel CCA (nonlinear projections) Related work: I Multimodal autoencoders1 I Multimodal restricted Boltzmann machines2 Key difference: learn two separate deep encodings, with the objective that the learned encodings are as correlated as possible 1 Ngiam et al., Multimodal deep learning, ICML, 2011 Srivastava & Salakhutdinov, Multimodal learning with deep Boltzmann machines, NIPS, 2012 2 G. Andrew et al., 2013 Deep Canonical Correlation Analysis 3 / 13 Canonical Correlation Analysis Objective: find pairs of linear projection of the two views, (ω1T X1 , ω2T X2 ) that are maximally correlated cov(ω1T X1 , ω2T X2 ) ρ(X1 , X2 ) = max corr(ω1T X1 , ω2T X2 ) = max q ω1 ,ω2 ω1 ,ω2 var(ω1T X1 )var(ω2T X2 ) ω1T Σ12 ω2 = max q ω1 ,ω2 (ω1T Σ11 ω1 )(ω2T Σ22 ω2 ) CCA reduces to a generalized eigenvalue problem: ω1 Σ11 0 0 Σ12 ω1 =ρ Σ21 0 ω2 0 Σ22 ω2 ¯ 1 ∈ Rp1 ×n , H ¯ 2 ∈ Rp2 ×n , one can estimate Given centered data matrices H b 11 = Σ G. Andrew et al., 2013 1 ¯ ¯T H1 H1 + r1 I, n−1 r1 > 0 Deep Canonical Correlation Analysis 4 / 13 Kernel Canonical Correlation Analysis (1/3) Objective: find maximally correlated nonlinear projections ρF = = max corr(f1 (X1 ), f2 (X2 )) max p f1 ∈H1 ,f2 ∈H2 cov(f1 (X1 ), f2 (X2 )) f1 ∈H1 ,f2 ∈H2 var(f1 (X1 ))var(f2 (X2 )) Reproducing property: f (x) = hK(·, x), f i, ∀f ∈ H (1) Let K1 and K2 be Mercer kernels3 with feature maps φ1 , φ2 , then corr(f1 (X1 ), f2 (X2 )) = corr(hφ1 (X1 ), f1 i, hφ2 (X2 ), f2 i) (2) 3 Saitoh, Theory of reproducing kernels and its applications, Longman Scientific & Technical, 1988 G. Andrew et al., 2013 Deep Canonical Correlation Analysis 5 / 13 Kernel Canonical Correlation Analysis (2/3) Kernel trick: corresponding to any Mercer kernel K(x1 , x2 ), there is a map φ : X 7→ F, such that K(x1 , x2 ) = hφ(x1 ), φ(x2 )i An instantiation: Define φ(x) = K(·, x) as the feature map, then hφ(x1 ), φ(x2 )i = hK(·, x1 ), K(·, x2 )i = K(x1 , x2 ) (3) Empirical correlations: let f1 (x) = αT1 K1 (·, x), f2 (x) = αT2 K2 (·, x) 1 cd ov(hφ1 (x1 ), f1 i, hφ2 (x2 ), f2 i) = αT1 K1 K2 α2 N 1 T 2 var(hφ c α K α1 1 (x1 ), f1 i) = N 1 1 1 T 2 var(hφ c α K α2 2 (x2 ), f2 i) = N 2 2 Kernelized CCA problem: ρbF = max α1 ,α2 ∈RN αT1 K1 K2 α2 q (αT1 K12 α1 )(αT2 K22 α2 ) G. Andrew et al., 2013 Deep Correlation Analysis All the calculation are performed inCanonical the input space. (4) 6 / 13 Kernel Canonical Correlation Analysis (3/3) Regularized KCCA: ρbFr = max α1 ,α2 ∈RN αT1 K1 K2 α2 q (αT1 (K1 + r1 I)2 α1 )(αT2 (K2 + r2 I)2 α2 ) (5) KCCA also reduces to a generalized eigenvalue problem: 0 K1 K2 α1 (K1 + r1 I)2 0 α1 =ρ K2 K1 0 α2 0 (K2 + r2 I)2 α2 Drawbacks of KCCA: 1. Representation is limited by a fixed kernel 2. Training time scales poorly with data size 3. Training data needs to be referenced when computing representations of unseen instances G. Andrew et al., 2013 Deep Canonical Correlation Analysis 7 / 13 Outline 1 Background Canonical Correlation Analysis (CCA) Kernel Canonical Correlation Analysis (KCCA) Denoising Autoencoder 2 Deep Canonical Correlation Analysis (DCCA) 3 Experiments MNIST Handwritten Digits Articulatory Speech G. Andrew et al., 2013 Deep Canonical Correlation Analysis 7 / 13 Denoising Autoencoder4 Objective: find good initial intermediate representations by explicitly fill-in-the-blanks training I I I I e Clean input x is partially destroyed, yielding corrupted input: x e is mapped to hidden representation: y = fθ (e e + b) x x) = s(W x T From y we reconstruct a z = gθ0 (y) = W y Train parameters to minimize the "reconstruction error" la (W , b) = ||Z − X||2F + λa (||W ||2F + ||b||22 ) (6) 4 Vincent et al., Extracting and composing robust features with denoising autoencoders, ICML, 2008 G. Andrew et al., 2013 Deep Canonical Correlation Analysis 8 / 13 Outline 1 Background Canonical Correlation Analysis (CCA) Kernel Canonical Correlation Analysis (KCCA) Denoising Autoencoder 2 Deep Canonical Correlation Analysis (DCCA) 3 Experiments MNIST Handwritten Digits Articulatory Speech G. Andrew et al., 2013 Deep Canonical Correlation Analysis 8 / 13 Deep Canonical Correlation Analysis (1/2) Objective: find maximally correlated representations of the two views by passing them through multiple stacked layers of nonlinear transformations x1 ∈ Rp1 , h1 = s(W11 x1 + b11 ) ∈ Rc1 , where W11 ∈ Rc1 ×p1 , b11 ∈ Rc1 , h2 = s(W21 h1 + b12 ) ∈ Rc1 , final representation f1 (x1 ) = s(Wd1 hd−1 + b1d ) ∈ Ro , for a network with d layers G. Andrew et al., 2013 Deep Canonical Correlation Analysis 9 / 13 Deep Canonical Correlation Analysis (2/2) Parameter optimization: (θ1∗ , θ2∗ ) = argmax corr(f1 (X1 , θ1 ), f2 (X2 , θ2 )) (7) (θ1 ,θ2 ) 1. Pretraining with a denoising autoencoder 2. Compute the gradient of corr(H1d , H2d ) w.r.t. the top-level representations H1d and H2d 3. Fine-tuning parameters Wlv and bvl using backpropagation (All hyper-parameters are chosen to optimize the total correlation on a development set) Non-saturating sigmoid function: If g : R 7→ R is the function g(y) = y 3 /3 + y, then s(x) = g −1 (x) I derivative is a simple function of its value I s is not bounded G. Andrew et al., 2013 Deep Canonical Correlation Analysis 10 / 13 Outline 1 Background Canonical Correlation Analysis (CCA) Kernel Canonical Correlation Analysis (KCCA) Denoising Autoencoder 2 Deep Canonical Correlation Analysis (DCCA) 3 Experiments MNIST Handwritten Digits Articulatory Speech G. Andrew et al., 2013 Deep Canonical Correlation Analysis 10 / 13 MNIST Handwritten Digits Learn correlated representations of the left and right halves of images I 54, 000 training, 6, 000 development, 10, 000 test I 28 × 28 matrix of pixels, 392 features in each view I For KCCA, use radial basis function (RBF) kernels I Selected width for the DCCA-50-2 model: 2038 (left images), 1608 (right images) Total correlation captured in the 50 most correlated dimensions on the split MNIST dataset G. Andrew et al., 2013 Deep Canonical Correlation Analysis 11 / 13 Articulatory Speech Data (1/2) Wisconsin X-ray Microbeam Database (XRMB) of simultaneous acoustic and articulatory recordings I 5 independent experiments, 60% training, 20% development, 20% test I MFCC x1 ∈ R273 , XRMB x2 ∈ R112 I For KCCA, use RBF kernels or polynomial kernels of degree d I Selected width for the DCCA-50-2 model: 1641 (MFCC), 1769 (XRMB) Total correlation captured in the 50 most correlated dimensions on the articulatory dataset G. Andrew et al., 2013 Deep Canonical Correlation Analysis 12 / 13 Articulatory Speech Data (2/2) Total correlation captured by DCCA-112-d, for d ranging from 3 to 8 G. Andrew et al., 2013 Deep Canonical Correlation Analysis 13 / 13