Structured deep learning : Pose and gestures - LIRIS

Transcription

Structured deep learning : Pose and gestures - LIRIS
Structured deep learning :!
Pose and gestures!
Christian Wolf!
Université de Lyon, INSA-Lyon!
LIRIS UMR CNRS 5205!
April 30th, 2015!
Pose and gestures!
ACCV2014
#635
CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
submission ID 635
405
Deep Learning : overview!
406
407
408
409
410
411
412
413
414
415
416
Pose estimation: hand!
417
418
419
420
421
proposed deep convolutional architecture of a single learner.
422
ers F1 , F2 and F3 with rectified linear activation units (ReLU).
are followed by 2 ⇥ 2 max pooling and reduction.
o most existing methods for scene labeling, instead of randomly
or patches), training is performed image-wise, i.e. all pixels from
re provided to the classifier at once and each pixel gets assigned
ass label based on information extracted from its neighborhood.
he convolutional classifier with pooling/reduction layers to an
itional way would lead to loss in resolution by a factor of 4 (in
ration). On the other hand, simply not reducing the image resot higher layers from learning higher level features, as the size of
does not grow with respect to the image content. To avoid this
loy specifically designed splitting functions originally proposed
g in [27] and further exploited in OverFeat networks [28]. Intueach map at a given resolution is reduced to four di↵erent maps
n using max pooling. The amount of elements is preserved, but
each map is lower compared to the maps of previous layers.
423
424
425
426
427
Direct deep gesture recognition!
(without pose estimation)!
428
429
430
431
432
433
434
435
436
437
438
Articulated pose from
Kinect (V1) does not
provide hand pose (finger
joint positions)!
3 Hand pose estimation!
-  A complex problem!
–  Small images (according to distances between hand
and sensor)!
–  Large variation in hand poses!
–  Real time is a challenge!
-  Our solution!
–  Segmenting hands into parts!
–  Structured deep learning!
–  Semi supervised setting!
PhD of Natalia
Neverova!
estimation
through
on in Pose
Parts from
Single Depth Images
segmentation!
Mat Cook
Toby Sharp
Mark Finocchio
-  Calculate human pose : set of joint positions!
x Kipman
Andrew Blake
-  Use an intermediate representation : body /
mbridge
& Xbox Incubation
hand part segmentation!
[PRL 2014]!
depth image
body parts
3D joint proposals
Figure : From
Shottonan
etsingle
al., CVPR
2011!
Figure 1. Overview.
input
depth image, a per-pixel
body part distribution is inferred. (Colors indicate the most likely
[ACCV 2014]!
Segmentation and spatial relationships !
Features!
F2
i
Fi
F1
F3
F4
Labels!
Fi
Fj
Fi
N
l2
l1
li
i
l3
li
lj
li
l4
Oui
Oui
Auto-context Pixelwise
classification.
models!
The prior
!
improves the
!
classifier!
Oui
MRF/CRF/BN. Inference
of a global solution with
high computational
complexity!
Non"
Pixelwise
classification !
(independant)!
!
!
[ICPR 2002a]!
[IEEE-Tr-PAMI 2010]!
[BMVC-2014]!
5th/43 @ !
[Neurocomputing 2010]!
[Work in progress!
R. Khan]!
DIBCO 2009!
[ICPR 2010]!
[EG-W-3DOR 2008]!
[ICPR 2002b]!
[ICPR 2008]!
ANR Canada!
ANR Madras!
Labex IMU-Rivière!
ANR Solstice!
[Pattern recognition
letters 2014]!
[ACCV 2014]!
[Work in progress !
N. Neverova]!
INTERABOT!
notation,complexity.
the pixels of an image are indexed
tational
m
m Euclidean distances of pairs of f
ndex:
X
=
{X
}. We seek to
learn
a segmenta
Euclidean
distance
irepresentation
m
(Deep)
learning!
XM
We
seek
to
1 learn aMsegmentation model
i }.
images {X , . . . , m
X } and
their associated la
m
m Section
m
||Zi
Zj ||2||Z
(see
3.1).
wo parts:
Z
||
(see
2
i
j
our
the
pixels
of
an
image
are
indexe
CV2014 notation, Feature
ACCV2014
Classifier!
mapping!
#635
#635
m
m m |✓ )ˆwhich
ˆ
m
mclassifier
m
•
A
l
=
g(Z
etric
function
fsegmentation
(X
) gwhich
em
function
Zi We
=
f (XIDi635
|✓flearn
)Zwhich
embeds
•a iA
classifier
li =
g(Z
10mapping
ACCV-14
submission
= {X
}.
seek
to
mode
i =
i |✓i f each
CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
i
m
Q
endfield
to
a
feature
representation
Z
2
R
parameters
✓parameters
an .estimat
irepresentation
✓g givin
its receptive field
to a feature
g giving
405
405
406
406
407
407
408
408
409
409
410
410
411
411
412
412
413
413
earned from training data, taking into account
meters ✓f are learned from training data, taking in
m
m in
As
common
in
the
De
As
is
common
the
Deep
Learning
ing function Zi = f (Xi |✓is
)
which
embeds
eac
f
8
trained
in an unsupervis
trained
in 8anrepresentation
unsupervised
(Hinton
ptive field to
a feature
Z m 2 RaQ
414
i
414
415
415
416
416
417
417
418
418
419
419
and
Hin
(Salakhutdinov
andtaking
Hinton,
2007)
ma
are learned from
training(Salakhutdinov
data,
into
accoun
kind of
inductiveT
other kind ofother
inductive
principle.
Results!
Input: real depth videos from a Kinect sensor. Synthethic training data!
600 000 images rendered with 3D modeler (« Poser »)!
Calculations distributed over 4 workstations (several weeks) CVPR
#****
Synthetic training data -  Testing: #!
Real training data 108
109
110
111
112
113
114
–  pure classification, no graphical model (speed!)!
-  Training:#!
CVPR 2013 Su
–  Leverage structural information to get a loss on real data
w/o groundtruth!
115
116
117
i
Zi
li
(Parenthesis: semantic full scene labelling)!
Goal: label each pixel of an image with a semantic class.!
Joint work with !
LHC, St. Etienne Elisa Fromont!
Rémi Emonet!
Taygun Kekec!
Alain Trémeau!
Damien Muselet!
[BMVC-2014]!
(Independant pixelwise classification)!
Figure by
[Sermanet 2012]!
I!
P!
(a) msConvnet
(a) msConvnet
O!
(b) msAugLearner
(b) msAugLearner (c) GroundTruth
(c) GroundTruth
FigureFigure
5: Raw5:image
labeling
of the multiscale
ConvNet,
our multiscale
augmented
learner learner
Raw image
labeling
of the multiscale
ConvNet,
our multiscale
augmented
and ground
truth labels.
and ground
truth labels.
(a)
We experiment
with anwith
AugL
does
usenot
anyuse
trueany
context
label injection
corresponding
We experiment
an that
AugL
thatnot
does
true context
label injection
corresponding
to t =to
0 and
AugL that
hasthat
an has
injection
parameter
t = 0.05.
t = another
0 and another
AugL
an injection
parameter
t = 0.05.
(Deep auto-context: augmented learner)!
P!
I!
(a) msConvnet
(a) msConvnet
O!
(b) msAugLearner
(b) msAugLearner (c) GroundTruth
(c) GroundTruth
FigureFigure
5: Raw
labeling
of theof
multiscale
ConvNet,
our multiscale
augmented
learnerlearner
5: image
Raw image
labeling
the multiscale
ConvNet,
our multiscale
augmented
and ground
truth labels.
and ground
truth labels.
We experiment
with an
AugL
that does
usenot
any
true
label injection
corresponding
We experiment
with
an AugL
thatnot
does
use
anycontext
true context
label injection
corresponding
to t =to0tand
AugLAugL
that has
injection
parameter
t = 0.05.
= 0another
and another
thatanhas
an injection
parameter
t = 0.05.
Intermediate
results:
context
learner
– The –classification
accuracies
obtained
from the
Intermediate
results:
context
learner
The classification
accuracies
obtained
from the
context
learner
(“ContextL”
in theintable)
are given
in Table
1 for both
In Fig.In4,Fig. 4,
context
learner
(“ContextL”
the table)
are given
in Table
1 fordatasets.
both datasets.
we show
the responses
of ourofcontext
learnerlearner
maps for
some
The second
row row
we show
the responses
our context
maps
for input
some patches.
input patches.
The second
showsshows
strongstrong
responses
for theforobject,
tree and
classes.
For theFor
second
and third
responses
the object,
treebuilding
and building
classes.
the second
and third
rows,rows,
although
the context
learnerlearner
outputs
a strong
response
for thefor
object
class, due
itsdue
the its the
although
the context
outputs
a strong
response
the object
class,
relative
simplicity,
it is not
provide
an accurate
classification
(e.g., for
thefor
building
relative
simplicity,
it isable
nottoable
to provide
an accurate
classification
(e.g.,
the building
class).class).
Nevertheless,
this network
is useful
for thefor
augmented
learnerlearner
and it’sand
training
time time
Nevertheless,
this network
is useful
the augmented
it’s training
is negligible:
the
time
per
sample
is
lower
(see
Table
1)
and
it
converges
faster
than
is negligible: the time per sample is lower (see Table 1) and it converges fasterthe
than the
Convnet.
Convnet.
Classification
accuracy
results
– Table
1 shows
the classification
resultsresults
obtained
with with
Classification
accuracy
results
– Table
1 shows
the classification
obtained
the different
approaches.
Overall,
we observe
that our
provides
better better
resultsresults
for for
the different
approaches.
Overall,
we observe
thatmethod
our method
provides
both the
and the
Flow datasets.
For Stanford
dataset,
another
state ofstate
the of
artthe art
bothStanford
the Stanford
andSIFT
the SIFT
Flow datasets.
For Stanford
dataset,
another
technique
is
reported
by
Munoz
et
al.
[12].
They
reported
their
pixel
accuracy
as
76.9
and
technique is reported by Munoz et al. [12]. They reported their pixel accuracy as 76.9 and
class class
accuracy
as 66.2
a deepa learning
architecture.
With our
we were
accuracy
as without
66.2 without
deep learning
architecture.
Withtechnique,
our technique,
we were
able toable
obtain
much
higher
class
accuracy.
to obtain much higher class accuracy.
WhileWhile
the accuracy
gain varies
between
singlescale
and multiscale
implementations,
we we
the accuracy
gain varies
between
singlescale
and multiscale
implementations,
observe
that
our
approach
consistently
improves
both
pixel
and
class
accuracies.
The
gain
observe that our approach consistently improves both pixel and class accuracies. The gain
on single-scale
experiments
are higher
compared
to multiscale
implementations.
This brings
on single-scale
experiments
are higher
compared
to multiscale
implementations.
This brings
(b)
Context learner fc: !
to predict
a whole
patch of(a)semantic
labels.
tion • oflearn
our feature
learning
approaches.
The target
func- !
•  Integrate
thisf patch
« features »
intop.
the(b)
direct
extraction
function
and a as
prediction
function
Our learner fd!
• 
« Augmented
learner »
e learning
of context features
fc and! dependent features fd .
•  1-in-K encoding (« hot one » ) of semantic labels when
used as input Structured loss (1) : local context!
4
CONFIDENTIAL
COPY.
DO NOT DISTRIBUTE.
Loss generated
from aREVIEW
context
learner
is calculated on a
Handoutput
segmentation
with structured convolutional learning 635
segmented
patch.!
The context learner is trained on synthetic
images.!
TRAINING PHASE
1
INPUT DEPTH MAP
SEGMENTATION MAP 1
n
SEGMENTATION MAP 2
CONTEXT
LEARNER
fd (.; ✓d )
fc (.; ✓c )
Y
5
(i,j)
fd
DIRECT
LEARNER
X (i,j)
A
(i,j)
d
(i,j)
Yc
DIRECT LEARNER
n
FORWARD PASS
BACKPROPAGATION
(i,j)
G
CONTEXT LEARNER
FORWARD PASS
BACKPROPAGATION
G(i,j)
G(i,j)
GROUND TRUTH MAP
GROUND TRUTH MAP
Fig. 2. The two learning pathways involving a direct learner fd and a context learner
fd . The context learner operates on punctured neighborhood maps n(i,j) , where the
(to be predicted) middle pixel is missing.
ere pixel coordinates, in vector form, are denoted as r
.
If r(i,j) Rk > ⌧ , i.e. the pixel (i, j) is close enough to its barycenter
(2):data),
global
context!
estimatedStructured
from the labelledloss
synthetic
then the
pixel is considered
rectly classified and used to update parameters of the direct learner ✓d . T
s function term for one pixel (i, j)
is givenregion
as follows:
A single
is supposed to
Hand segmentation
with
convolutional⌘learning
⇣ structured
exist
for
each
label.!
(i,j)
(i,j)
(i)
(i,j)
Q+
(✓
y
)
=
F
log
P
Y
=
y
x
, ✓d , ✓c are
,
d
d
yd Unconnectedd outlier regions
glb
d
(i,j)
where pixel coordinates, in vectoridentified
form, are to
denoted
as
r
.
generate
loss.!
(i,j)
(i)
If
r
Rk > related
⌧ , i.e. the
pixel
(i,of
j)class
is close
enough to its barycent
ere Fk is a weight
to the
size
components:
s estimated from the labelled synthetic data), then the pixel is consider
(i)
(i,j)
orrectly classified and used
of the direct learner ✓d .
Fk to= update
|{j : Yd parameters
= k}| ↵ Predicted
class!
oss function term for one pixel (i, j) is given as follows:
outparameter.
of range! In the opposite case, when r(i,j) R 
d ↵ > 0Siispixel
a gain
⇣
⌘
k
(i,j)
(i,j)
(i)
(i,j)
e current prediction
is
penalized
and
the
class
corresponding
Q+
(✓
y
)
=
F
log
P
Y
=
y
x
, ✓d , ✓cto ,the clos
d
d
yd
glb
d
d
ment in the given distance ⌧ is promoted:
Class of nearest
barycenter!
(i)
⇣
⌘
where FkElse!
is a weight(i,j)
related to the
size
of
class
components:
Qglb (✓d yd ) = F (i) log P Yd = x(i,j) , ✓d , ✓c ,
(
here
(i)
(i,j)
Fk = |{j : Yd
= argmin (|r(i,j)
= k}|
Rk |).
↵
(
High resolution segmentation!
(Classical) resolution reductions between layers.!
System
resolution
kept Wolf,
highGraham
by keeping
shifts.!
10
Natalia
Neverova, is
Christian
W. Taylor,different
Florian Nebout
Results: supervised vs. semi-supervised!
On 50 manually annotated frames (real data)!
Training
method!
Training data!
Supervised!
synth.!
Semisupervised!
all!
Test
data!
Accuracy
(per pixel)!
Accuracy !
(per class)!
synth.!
85.9%!
78.5%!
real!
47.2%!
35.0%!
synth.!
75.5%!
78.3%!
real!
50.5%!
43.4%!
Average gain of a single update (stochastic gradient descent):!
Gain in % points!
Local!
Global!
Loc+Glb!
Supervised (w. Labels)!
+0.60!
+0.41!
+0.82!
+16.05!
No labels required!
[ACCV 2014]!
Results on real images!
One step of unsupervised training!
Supervised
Pre-training
Experimental results
Conclusions
Pose and gestures!
ct recognition, pose estimation, scene recognition usually
d part segmentation.
1
Pose estimation: full body!
2
3
8
4
7
5
6
8
10
3
5
11
9
7
10
6
1
9
11
ures are typically based on appearance (SIFT, HOG/HOF ...).
ures can also be informed by spatial relationships.
Pose estimation: hand!
spatial learning
Direct deep gesture recognition!
(without pose estimation)!
5 / 29
Gesture recognition!
Communicative gestures!
Multiple modalities:!
-  color and depth video!
-  Skeleton (articulated pose)!
-  Audio!
Multiple scales:!
-  full upper-body motion!
-  fine hand articulation!
-  short and long-term
dependencies.!
PhD of Natalia
Neverova!
A multi-scale architecture!
RANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
Operates
at 3 temporal
scales
! temporal scales corresponding to
. The deep convolutional multi-modal
architecture
operating
at 3
s of 3 different durations.
Althoughtothe
audio poses
modality
not present
in the 2014 ChaLearn Lo
corresponding
dynamic
of 3 is
different
durations!
e Challenge dataset, we have conducted additional experiments by augmenting the visual signal wi
Single-scale deep architecture!
max pooling
ConvD2
HLV1
Path V1:
HLV2
depth video,
right hand
Path V1:
ConvD1
shared hidden layer
HLS
intensity video,
right hand
ConvC1
ConvC2
ConvD2
Path V2:
HLV1
depth video,
left hand
output layer
ConvD1
Path V2:
intensity video,
left hand
HLV2
ConvC1
ConvC2
HLM2
HLM3
Path M:
mocap stream
pose feature
extractor
HLM1
Path A:
audio stream
mel frequency
spectrograms
ConvA1
HLA1
HLA2
Articulated pose descriptor !
IEEE TRANSACTIONS ON PATTERN ANALYS
1. Based on 11 upper body joints!
2. Position normalization: HipCenter is
an origin of a new coordinate system.!
3. Size normalization by the mean
distance between each pair of joints.1!
4. Calculation of basis vectors for
each frame (shown in blue) by
applying PCA on 6 torso joints (shown
in white).!
1Zanfir
Fig. 3. The pose descriptor is cal
M., Leordeanu, M., Sminchisescu, C., “The Moving
Pose:
An Efficient 3D
ized
coordinates
of Kinematics
11 upper bod
Descriptor for Low-Latency Action Recognition and Detection”, ICCV 2013!
also including their velocities and
of angles (triples of joints formin
TABLE 1
Hyper-parameters chosen for the deep learning
models (a single temporal scale)
HoG features f
Parameters / weights (single scale)!
use zero-mean and
Layer
Input D1,D2
ConvD1
ConvD2
Input C1,C2
ConvC1
ConvC2
HLV1
HLV2
Input M
HLM1
HLM2
HLM3
Input A
ConvA1
HLA1
HLA2
HLS1
HLS2
Output layer
Filter size / n.o. units
N.o. parameters
Paths V1, V2
72⇥72⇥5
25⇥5⇥5⇥3
25⇥5⇥5
72⇥72⇥5
25⇥5⇥5⇥3
25⇥5⇥5
900
450
Path M
183
700
700
350
Path A
40⇥9
25⇥5⇥5
700
350
Shared layers
1600
84
21
1900
650
1900
650
3 240 900
405 450
-
Pooling
2⇥2⇥1
2⇥2⇥3
1⇥1
2⇥2⇥1
2⇥2⇥3
1⇥1
-
128 800
490 700
245 350
-
650
3 150 000
245 350
1⇥1
1⇥1
-
3 681 600
134 484
1785
-
ages to extract HoG
a 2-level spatial py
and a magnified ve
Histograms of d
extracted on two s
from a whole map
version (by a facto
Derivatives of
and depth histogr
1) h(t 1), wh
Combined togethe
270-dimensional de
frame and, consequ
for the dynamic po
Extremely rand
data fusion and ge
of this sort have ge
in conjunction wit
we followed the
of the neural archi
trained independen
as described in Sec
Training algorithm!
Difficulties:!
Number of parameters: !
-  ~12.4M per scale !
-  ~37.2M total!
Number of training gestures: ~10 000.!
Result: poor convergence.!
max pooling
Path V1:
ConvD2
HLV1
HLV2
depth video,
right hand
Path V1:
ConvD1
shared hidden layer
HLS
intensity video,
right hand
ConvC1
ConvC2
ConvD2
depth video,
left hand
ConvD1
Path V2:
HLV2
intensity video,
left hand
ConvC1
ConvC2
Path M:
mocap stream
HLM
pose feature extractor
!
!Proposed solution:!
-  Structured weight matrices!
-  Pretraining of individual channels separately;!
-  “clever” initialization of shared layers;!
-  An iterative training algorithm gradually increasing the
number of parameters to learn.!
!
!
output layer
HLV1
Path V2:
Initialization of the shared layer: !
structured weight matrices!
The top hidden layer from each path is initially wired to a subset of
neurons in the shared layer.!
!
During fusion, additional connections between paths and the shared
hidden layer are added.!
A slightly different view!
weights W1 when training is completed
weights W1
path V1:
depth,
hand 1
W1
path V2:
depth,
hand 2
W1
path M:
mocap
data
W1
path A:
audio
signal
W1
data
flow
v1
vv12
vv21
W1
v2
W1
W1
W1
vm1
W1
vm2
va1
W1
va2
mv1
W1
av1
md2
W1
av2
W1
m
W1
ma
W1
W1
am
a
hidden layer HLS with units shared across modalities
d1
W2
d2
W2
m
W2
a
W2
output layer
weights W2
Blocks of the weight matrices are learned iteratively !
after proper initialization of the diagonal elements.!
Gesture localization!
An additional binary classifier is employed for filtering and refinement
of temporal position of each gesture. !
class
number
confident prediction
target
main classifier
class
class
number
confident prediction
no gesture
no gesture
main classifier
0 target
frame
class
pre-stroke
post-stroke number
binary
no gesture
no gesture
1
motion
detector
0
frame
0
pre-stroke
post-stroke number
binary
no motion
motion
no motion
1
motion detector
0
no motion
motion
no motion
2014 ChaLearn Looking at People Challenge (ECCV)!
Track 3: Gesture recognition!
Rank!
Team!
Score!
1!
Ours (LIRIS)!
0.8500!
2!
C. Monnier (Charles Rivier Analytics, USA)!
0.8339!
3!
Ju Yong Chang (ETRI, KR)!
0.8268!
4!
Xiaojiang Peng et al. (Southwest University, CN)!
0.7919!
5!
L. Pigou et al. (Gent University, BE)!
0.7888!
6!
D. Wu et al. (University of Sheffield, UK)!
0.7873!
7!
N.C. Camgoz et al. (Bogazici University, TR)!
0.7466!
8!
G. Evangelidis et al. (INRIA Rhône-Alpes, FR)!
0.7454!
9!
“Telepoints”!
0.6888!
10!
G.Chen et al. (TU München, D)!
0.6490!
…!
…!
17!
“YNL”!
0.2706!
Escalera et al, “ChaLearn Looking at People Challenge 2014: Dataset and Results”, ECCV-W, 2014!
Results!
Error evolution during !
iterative training!
Post competition improvements!
With motion
detector!
Without motion
detector!
Virtual
rank!
0.821!
0.865!
(1)!
Extremley
randomized trees !
(baseline)!
0.729!
0.781!
(6)!
Deep learning + ERT
(hybrid)!
0.829!
0.867!
(1)!
Model (no audio)!
max pooling
Multi-scale!
Deep learning
(proposed)!
Path V1:
ConvD2
HLV1
HLV2
depth video,
right hand
Path V1:
ConvD1
shared hidden layer
HLS
intensity video,
right hand
ConvC1
ConvC2
ConvD2
output layer
HLV1
Path V2:
depth video,
left hand
ConvD1
Path V2:
HLV2
intensity video,
left hand
ConvC1
ConvC2
Path M:
mocap stream
HLM
pose feature extractor
Dropout!
-  Introduced in 2012 for Imagenet!
–  A. Krizhevsky, I. Sutskever, G. Hinton, « ImageNet Classification with Deep Convolutional
Networks », 2012.!
-  During training, for each training sample, 50% of the
units are disabled.!
-  Punishes co-adaptation of units!
-  Large performance gains!
shared layer) is therefore related to a specific gesture class.
Our
: dropout
on(and
shared
layer!
Let us network
note, that this
block structure
meaning)
is forced
on the weight matrix during initialization and in the early
phases of training. If only the diagonal blocks are non-zero,
which is forced at the beginning of the training procedure,
then individual modalities are trained independently, and no
cross correlations between modalities are modelled. During
the final phases of training, no structure is imposed and the
weights can evolve freely. Considering the shared layer, this
can be formalized by expressing output of each hidden unit
as follows:
2
3
Fk
Fn
K X
X
X
6
(k)
(k,k) (k)
(n,k) (n)
(k) 7
hj = 4 wi,j xi +
wi,j xi + bj 5 (10)
i=1
(k)
n=1 i=1
n6=k
modality k) are dropped with probability p(k) . In a dropout
network,
this can be: modality-wise
formulated such thatdropout!
the input to the
Moddrop
activation function of a given output unit (denoted as sd )
-  Punisha co-adaptation
of individual
units (k)
(likefor
vanilla
involves
Bernoulli selector
variable
eachdropout)!
modality
Train acan
network
is robust/resistent
vs.which
dropping
of
k- which
take which
on values
in {0, 1} and
is activated
individual modalities (e.g. fail of audio channel).!
with probablity p(k):
2
3
NE INTELLIGENCE
Fk
Fn
K X
X
X
(k)
6
(k,k) (k)
(n,k) (n)
(k) 7
(k)
¯
hj = 4 wi,j xi +
wi,j xi + bj 5
im to learn
er-channel
een modalle missing
Therefore,
a way that
predictions
s (with an
signals are
9
thisi=1
step, the network takes
asi=1
an input multi-modal training
n=1
(k)
samples { (k) xd }, k =n61=.k. . K from the training set |D|
(19)
(k)
where for eachKsample Feach
modality
component
x
is
d
k
X
X (k) (k)
droppeds(set
to
0)
with
probability pk which
is
(k) a certain
=
w
x
(20)
d
j
j
indicated by Bernoulli selector
Bernoulli selector!
j=1
k=1
(k)
: P(
(k)
=1) = p(k) .
(15)
Accordingly, one step of a gradient descent given an input
with a certain number of non-zero modality components
Moddrop: results!
Classification accuracy on the validation set !
(dynamic poses)!
Modalities
Dropout
Dropout+Moddrop
All
96.77
96.81
Mocap missing
38.41
92.82
Audio missing
84.10
92.59
Hands missing
53.13
73.28
Jacquard index on test set (full gestures)!
Modalities
Dropout
Dropout+Moddrop
All
0.875
0.875
Mocap missing
0.306
0.859
Audio missing
0.789
0.854
Hands missing
0.466
0.680
Current and future work!
Integrate a model learned for pose estimation into the gesture
recognition algorithm.!
Still no direct pose estimation! !
max pooling
Path V1:
ConvD2
HLV1
HLV2
depth video,
right hand
Path V1:
ConvD1
shared hidden layer
HLS
intensity video,
right hand
ConvC1
ConvC2
ConvD2
output layer
HLV1
Path V2:
depth video,
left hand
ConvD1
Path V2:
HLV2
intensity video,
left hand
ConvC1
ConvC2
Path M:
mocap stream
HLM
pose feature extractor
Conclusion (1)!
1.  General method for gesture and near-range action recognition from
a combination of color and depth video and articulated pose data.!
2.  Each channel captures a spatial scale, the system operates at three
temporal scales.!
3.  Possible extensions:!
#- additional modalities;!
#- additional temporal scales;!
#- feedback connections to handle noisy or missing channels. !
[ECCV Workshop 2014]!
[ACCV 2014]!
[Journal under review]!
[Conference in writing]!
[Pattern Recognition Letters
2014]!
Conclusion (2)!
ACCV2014 GRAPHES
DE
68
IT
CHAP
RE 4.
EN T
#635
RIEM
APPA
APH
R GR
T PA
ES
ACCV2014
#635
CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
10
Visual recognition!
PPA
ET A
EN
RIEM
Deep Learning!
ACCV-14 submission ID 635
405
405
406
406
407
407
408
408
409
409
410
410
411
411
412
412
413
413
414
414
415
415
416
416
417
418
419
420
p
x hyFig.
4. The proposed deep convolutional architecture of a single learner.
e d eu
en t d
422
m
e
ri
a
Structured !
p
421
Figu
– Ap
re 4.8
p h es
er-gra
´eos.
de vid
en ces
u
q
e
´
s
ir de
`a part
ruits
c on s t
423
learning
convolutional layers F1 , F2 and F3 with rectified lineardeep
activation
units! (ReLU).
424
Layersand
F1 semi-structured
and F2 are followed by 2 ⇥ 2 max pooling and reduction.
Structured
425
models! to most existing methods for scene labeling, instead of randomly
As opposed
426
sampling pixels (or patches), training is performed image-wise, i.e. all pixels from
427
the given image are provided to the classifier at once and each pixel gets assigned
428
with an output class label based on information extracted from its neighborhood.
429
Applying of the convolutional classifier with pooling/reduction layers to an
430
image in the traditional way would lead to loss in resolution by a factor of 4 (in
431
the given configuration). On the other hand, simply not reducing the image reso432
lution will prevent higher layers from learning higher level features, as the size of
433
the filter support does not grow with respect to the2eimage
ordre. content. To avoid this
434
e type functions originally proposed
d
dilemma, we employ specifically
designed
splitting
it
u
)
d
(b
435
phe r´e
for image scanning in [27] and
further
er-gra exploited in OverFeat networks [28]. Intun h yp
436
(b) uat
;
le
itively speaking,g´eeach
map
a
given
resolution
is reduced to four di↵erent maps
n´era
ph
437
(a)
er-gra
of lower
resolution
using
max
pooling.
The
amount
of elements is preserved, but
p
y
h
un
438
– ( a)
re 4.9 the resolution of each map is lower compared to the maps of previous layers.
Figu439
In more detail, let us consider the output of the first convolutional layer F1
440
of the network. Once the output feature maps are obtained, 4 virtual extended
441
copies of them are created by zero padding with 1) one column on the left, 2) one
442
column on the right, 3) one row on top, 4) one row in the bottom. Therefore, each
443
copy will contain the original feature map but shifted in 4 di↵erent directions.
444
On the next step, we apply max pooling (2 ⇥ 2 with stride 2 ⇥ 2) to each of
445
the extended maps producing 4 low-resolution maps. By introducing the shifts,
446
pixels from all extended maps combined together can reconstruct the original
447
feature map as if max pooling with stride 1 ⇥ 1 had been applied.
448
This operation allows the network to preserve results of all computations for
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
Including structural terms into deep feed-forward models is
a goal and a challenge!
449
441
442
443
444
445
446
447
448
449