Deep Canonical Correlation Analysis Galen Andrew Raman Arora Jeff Bilmes

Transcription

International Conference on Machine Learning (ICML 2013)
Deep Canonical Correlation Analysis
Galen Andrew1
Raman Arora2
1 University
2 Toyota
Jeff Bilmes1
Karen Livescu2
of Washington
Technological Institute at Chicago
Presented by Shaobo Han, Duke University
Nov. 7, 2014
G. Andrew et al., 2013
1 / 13
Outline
1
Background
Canonical Correlation Analysis (CCA)
Kernel Canonical Correlation Analysis (KCCA)
Denoising Autoencoder
2
Deep Canonical Correlation Analysis (DCCA)
3
Experiments
MNIST Handwritten Digits
Articulatory Speech
2 / 13
Introduction
The problem: learn complex nonlinear transformations of two views, such
that the resulting representations are maximally correlated
Deep CCA (DCCA): learn highly correlated deep architectures
I
A nonlinear extension of CCA (linear projections)
I
An alternative to Kernel CCA (nonlinear projections)
Related work:
I
Multimodal autoencoders1
I
Multimodal restricted Boltzmann machines2
Key difference: learn two separate deep encodings, with the objective that
the learned encodings are as correlated as possible
1
Ngiam et al., Multimodal deep learning, ICML, 2011
Srivastava & Salakhutdinov, Multimodal learning with deep Boltzmann machines,
NIPS, 2012
2
3 / 13
Canonical Correlation Analysis
Objective: find pairs of linear projection of the two views, (ω1T X1 , ω2T X2 )
that are maximally correlated
cov(ω1T X1 , ω2T X2 )
ρ(X1 , X2 ) = max corr(ω1T X1 , ω2T X2 ) = max q
ω1 ,ω2
ω1 ,ω2
var(ω1T X1 )var(ω2T X2 )
ω1T Σ12 ω2
= max q
ω1 ,ω2
(ω1T Σ11 ω1 )(ω2T Σ22 ω2 )
CCA reduces to a generalized eigenvalue problem:
ω1
Σ11 0
0 Σ12
ω1
=ρ
Σ21 0
ω2
0 Σ22
ω2
¯ 1 ∈ Rp1 ×n , H
¯ 2 ∈ Rp2 ×n , one can estimate
Given centered data matrices H
b 11 =
Σ
1 ¯ ¯T
H1 H1 + r1 I,
n−1
r1 > 0
4 / 13
Kernel Canonical Correlation Analysis (1/3)
Objective: find maximally correlated nonlinear projections
ρF =
=
max
corr(f1 (X1 ), f2 (X2 ))
max
p
f1 ∈H1 ,f2 ∈H2
cov(f1 (X1 ), f2 (X2 ))
f1 ∈H1 ,f2 ∈H2
var(f1 (X1 ))var(f2 (X2 ))
Reproducing property:
f (x) = hK(·, x), f i,
∀f ∈ H
(1)
Let K1 and K2 be Mercer kernels3 with feature maps φ1 , φ2 , then
corr(f1 (X1 ), f2 (X2 )) = corr(hφ1 (X1 ), f1 i, hφ2 (X2 ), f2 i)
(2)
3
Saitoh, Theory of reproducing kernels and its applications, Longman Scientific &
Technical, 1988
5 / 13
Kernel trick: corresponding to any Mercer kernel K(x1 , x2 ), there is a
map φ : X 7→ F, such that K(x1 , x2 ) = hφ(x1 ), φ(x2 )i
An instantiation: Define φ(x) = K(·, x) as the feature map, then
hφ(x1 ), φ(x2 )i = hK(·, x1 ), K(·, x2 )i = K(x1 , x2 )
(3)
Empirical correlations: let f1 (x) = αT1 K1 (·, x), f2 (x) = αT2 K2 (·, x)
1
cd
ov(hφ1 (x1 ), f1 i, hφ2 (x2 ), f2 i) = αT1 K1 K2 α2
N
1 T 2
var(hφ
c
α K α1
1 (x1 ), f1 i) =
N 1 1
1 T 2
var(hφ
c
α K α2
2 (x2 ), f2 i) =
N 2 2
Kernelized CCA problem:
ρbF =
max
α1 ,α2 ∈RN
αT1 K1 K2 α2
q
(αT1 K12 α1 )(αT2 K22 α2 )
G. Andrew
et al., 2013
Deep
Correlation
Analysis
All
the calculation
are performed
inCanonical
the input
space.
(4)
6 / 13
Regularized KCCA:
ρbFr =
max
α1 ,α2 ∈RN
αT1 K1 K2 α2
q
(αT1 (K1 + r1 I)2 α1 )(αT2 (K2 + r2 I)2 α2 )
(5)
KCCA also reduces to a generalized eigenvalue problem:
0
K1 K2
α1
(K1 + r1 I)2
0
α1
=ρ
K2 K1
0
α2
0
(K2 + r2 I)2
α2
Drawbacks of KCCA:
1. Representation is limited by a fixed kernel
2. Training time scales poorly with data size
3. Training data needs to be referenced when computing representations
of unseen instances
7 / 13
Outline
1
Background
2
3
Experiments
Articulatory Speech
7 / 13
Denoising Autoencoder4
Objective: find good initial intermediate representations by explicitly
fill-in-the-blanks training
I
I
I
I
e
Clean input x is partially destroyed, yielding corrupted input: x
e is mapped to hidden representation: y = fθ (e
e + b)
x
x) = s(W x
T
From y we reconstruct a z = gθ0 (y) = W y
Train parameters to minimize the "reconstruction error"
la (W , b) = ||Z − X||2F + λa (||W ||2F + ||b||22 )
(6)
4
Vincent et al., Extracting and composing robust features with denoising
autoencoders, ICML, 2008
8 / 13
Outline
1
Background
2
3
Experiments
Articulatory Speech
8 / 13
Deep Canonical Correlation Analysis (1/2)
Objective: find maximally correlated representations of the two views by
passing them through multiple stacked layers of nonlinear transformations
x1 ∈ Rp1 , h1 = s(W11 x1 + b11 ) ∈ Rc1 , where W11 ∈ Rc1 ×p1 , b11 ∈ Rc1 ,
h2 = s(W21 h1 + b12 ) ∈ Rc1 , final representation
f1 (x1 ) = s(Wd1 hd−1 + b1d ) ∈ Ro , for a network with d layers
9 / 13
Deep Canonical Correlation Analysis (2/2)
Parameter optimization:
(θ1∗ , θ2∗ ) = argmax corr(f1 (X1 , θ1 ), f2 (X2 , θ2 ))
(7)
(θ1 ,θ2 )
1. Pretraining with a denoising autoencoder
2. Compute the gradient of corr(H1d , H2d ) w.r.t. the top-level
representations H1d and H2d
3. Fine-tuning parameters Wlv and bvl using backpropagation
(All hyper-parameters are chosen to optimize the total correlation on a
development set)
Non-saturating sigmoid function:
If g : R 7→ R is the function
g(y) = y 3 /3 + y, then s(x) = g −1 (x)
I
derivative is a simple function of
its value
I
s is not bounded
10 / 13
Outline
1
Background
2
3
Experiments
Articulatory Speech
10 / 13
Learn correlated representations of the left and right halves of images
I
54, 000 training, 6, 000 development, 10, 000 test
I
28 × 28 matrix of pixels, 392 features in each view
I
For KCCA, use radial basis function (RBF) kernels
I
Selected width for the DCCA-50-2 model: 2038 (left images), 1608
(right images)
Total correlation captured in the 50 most correlated dimensions on the split
MNIST dataset
11 / 13
Articulatory Speech Data (1/2)
Wisconsin X-ray Microbeam Database (XRMB) of simultaneous acoustic
and articulatory recordings
I 5 independent experiments, 60% training, 20% development, 20% test
I MFCC x1 ∈ R273 , XRMB x2 ∈ R112
I For KCCA, use RBF kernels or polynomial kernels of degree d
I Selected width for the DCCA-50-2 model: 1641 (MFCC), 1769
(XRMB)
Total correlation captured in the 50 most correlated dimensions on the
articulatory dataset
12 / 13
Articulatory Speech Data (2/2)
Total correlation captured by DCCA-112-d, for d ranging from 3 to 8
13 / 13

Deep Canonical Correlation Analysis Galen Andrew Raman Arora Jeff Bilmes

Transcription

Similar documents

Chapter Summary

ISyE 6644 — Fall 2014

CBSE 11th Economics Sample Paper

RESEARCH METHODS IN I/O PSYCHOLOGY

Nutrition Care Process - Dietetic Internships Intern Site

Year 10 English Pathways Remember…

How dependent is one measure of the

Introduction Background

SMAM 319 Exam 1 Name______________________

Cloud-scale Image Compression Through Content Deduplication

Where do the laws of physics come from v9