Video Content Recognition with Deep Learning -lite

Transcription

Video Content Recognition with Deep Learning -lite
Video Content Recognition
with Deep Learning
Zuxuan Wu
Fudan University
1
Big Video Data
Big Video Data
Global Internet Video Highlights in 2019 (by CISCO)
• It would take an individual more than 5 million years to
watch videos generated each month across all networks.
• Video traffic will be 80 percent of all network traffic, up from
64% in 2014.
Video Classification
4
Video Classification
• Videos are everywhere
5
Video Classification
• Videos are everywhere
• Wide applications
ü Web video search
ü Video collection management
ü Intelligent video surveillance
6
Outline
• Benchmarks
• Modeling Temporal Dependencies
• Utilizing Class Context
7
For images,
• Total number of non-empty synsets: 21841
• Total number of images: 14,197,122
8
An overview of existing
video datasets
Dataset
# Videos
# Classes
Year
Manually
Labeled ?
Kodak
1,358
25
2007
✓
MCG-WEBV
234,414
15
2009
✓
CCV
9,317
20
2011
✓
UCF-101
13,320
101
2012
✓
THUMOS-2014
18,394
101
2014
✓
MED-2014
≈28,000
20
2014
✓
Sports-1M
1M
487
2014
✗
ActivityNet
27,801
203
2015
✓
FCVID
91,223
239
2015
✓
……
Fudan-Columbia Video Dataset (FCVID)
• The largest public benchmark (239 categories) of
Internet videos with manual annotations
• Covering many categories organized in a
hierarchical structure
• 91,223 videos, average duration 167 seconds
• Released in Feb. 2015!
http://bigvid.fudan.edu.cn/FCVID/
Audio, Visual Features, Texts, Videos Available
Fudan-Columbia Video Dataset (FCVID)
• Utility – supporting practical needs
• Coverage – what people like to record
• Feasibility – can be recognized
• Multiple Annotators – minimize subjectivity
11
Example 1:
Example 2:
Outline
• Benchmarks
• Modeling Temporal Dependencies
• Utilizing Class Context
16
Video Classification: State-of-the-Arts
1. Improved Dense Trajectories [Wang et al., ICCV 2013]
a) Tracking trajectories
b) Computing local descriptors along the trajectories
2. Feature Encoding [Perronnin et al., CVPR 2010, Xu et al., CVPR 2015]
a) Encoding local features with Fisher Vector/VLAD
b) Normalization methods, such as Power Norm
17
Video Classification: Deep Learning
1. Image-based CNN Classification [Zha et al., arXiv 2015]
a) Extracting deep features for each frame
b) Averaging frame-level deep features
2. Two-Stream CNN [Simonyan et al., NIPS 2014]
Diving
Video Classification: Deep Learning
Video Classification: Deep Learning
Diving
Jumping from platform
Video Classification: Deep Learning
Diving
Jumping from platform
Rotating in the air
Video Classification: Deep Learning
Diving
Jumping from platform
Rotating in the air
Falling into water
Video Classification: Deep Learning
3. Recurrent NN: LSTM [Ng et al., CVPR 2015]
Diving
Jumping from platform
Ot-1
Ot
Ot+1
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
Rotating in the air
The performance is not ideal,same
as image-based classification.
Falling into water
Video Classification: Deep Learning
3. Recurrent NN: LSTM [Ng et al., CVPR 2015]
Diving
We propose
a hybrid deep learning framework
Jumping from platform
to capture appearance, short-term motion and
long-term temporal dynamics in videos.
Rotating in the air
Zuxuan Wu, Xi Wang, Yu-Gang Jiang et al., Modeling Spatial-Temporal
The performance of LSTM
Clues in a Hybrid Deep Learning Framework
for Video Classification. In
and average pooling is close.
ACM Multimedia, 2015. (Full Presentation).
Falling into water
Our Framework
We propose a hybrid deep learning framework to model rich multimodal information:
a) Appearance,
shot-term motion with CNN
Final Prediction
y1s
y 2s
y 3s
yTs −1
yTs
LSTM
LSTM
LSTM
LSTM
LSTM
Fusion
Layer
WsE
LSTM
LSTM
LSTM
LSTM
Spatial CNN Spatial CNN Spatial CNN
LSTM
y m2
y 3m
y Tm−1
yTm
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
WmE
l =E
Spatial CNN Spatial CNN
Individual Frames
y1m
Motion CNN Motion CNN Motion CNN
Input Video
Motion CNN Motion CNN
Stacked Optical Flow
25
Our Framework
We propose a hybrid deep learning framework to model rich multimodal information:
a) Appearance,
shot-term motion with CNN
b) Long-term temporal information with LSTM
Final Prediction
y1s
y 2s
y 3s
yTs −1
yTs
LSTM
LSTM
LSTM
LSTM
LSTM
Fusion
Layer
WsE
LSTM
LSTM
LSTM
LSTM
Spatial CNN Spatial CNN Spatial CNN
LSTM
y m2
y 3m
y Tm−1
yTm
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
WmE
l =E
Spatial CNN Spatial CNN
Individual Frames
y1m
Motion CNN Motion CNN Motion CNN
Input Video
Motion CNN Motion CNN
Stacked Optical Flow
26
Our Framework
We propose a hybrid deep learning framework to model rich multimodal information:
a) Appearance,
shot-term motion with CNN
b) Long-term temporal information with LSTM
c) Regularized fusion to explore feature correlations
Final Prediction
y1s
y 2s
y 3s
yTs −1
yTs
LSTM
LSTM
LSTM
LSTM
LSTM
Fusion
Layer
WsE
LSTM
LSTM
LSTM
LSTM
Spatial CNN Spatial CNN Spatial CNN
LSTM
y m2
y 3m
y Tm−1
yTm
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
Regularzation W
E
m
l =E
Spatial CNN Spatial CNN
Individual Frames
y1m
Motion CNN Motion CNN Motion CNN
Input Video
Motion CNN Motion CNN
Stacked Optical Flow
27
Spatial and Motion CNN Features
http://bigvid.fudan.edu.cn/code/ucf101_motion_model.txt
Hao Ye, Zuxuan Wu et al. Evaluating Two-Stream CNN for Video
Classification. In ICMR, 2015.
28
Temporal Modeling with LSTM
An unrolled recurrent neural network.
29
Regularized Feature Fusion
[Ngiam et al., ICML 2011 Srivastava et al., NIPS 2012]
30
Regularized Feature Fusion
DNN Learning Scheme
- Calculate prediction error
- Update weights in a BP manner
w
(t )
w
( t +1)
[Ngiam et al., ICML 2011 Srivastava et al., NIPS 2012]
31
Regularized Feature Fusion
DNN Learning Scheme
- Calculate prediction error
- Update weights in a BP manner
w
(t )
w
( t +1)
The fusion is performed in a free
manner without explicitly exploring the
feature correlations.
[Ngiam et al., ICML 2011 Srivastava et al., NIPS 2012]
32
Regularized Feature Fusion
DNN Learning Scheme
- Calculate prediction error
- Update weights in a BP manner
w
(t )
w
( t +1)
The fusion is performed in a free
manner without explicitly exploring the
feature correlations.
[Ngiam et al., ICML 2011 Srivastava et al., NIPS 2012]
Feature Selection:
Modeling relationship
Remove redundancy
33
Experiments
Datasets:
- UCF101: 101 action classes, 13,320 video clips from YouTube
- Columbia Consumer Videos (CCV): 20 classes, 9,317 videos from
YouTube
34
Experiments
Temporal Modeling:
UCF-101
CCV
Spatial ConvNet
80.4
75.0
Motion ConvNet
78.3
59.1
Spatial LSTM
83.3
43.3
Motion LSTM
76.6
54.7
ConvNet (spatial + motion)
86.2
75.8
LSTM (spatial + motion)
86.3
61.9
ConvNet + LSTM (spatial)
84.4
77.9
ConvNet + LSTM (motion)
81.4
70.9
ALL Streams
90.3
82.4
LSTM are worse than CNN on noisy long videos.
CNN and LSTM are highly complementary!
35
Experiments
Regularized Feature Fusion:
UCF-101
CCV
Spatial SVM
78.6
74.4
Motion SVM
78.2
57.9
SVM-EF
86.6
75.3
SVM-LF
85.3
74.9
SVM-MKL
86.8
75.4
NN-EF
86.5
75.6
NN-LF
85.1
75.2
M-DBM
86.9
75.3
Two-Stream CNN
86.2
75.8
Regularized Fusion
88.4
76.2%
Regularized fusion performs better compared with fusion
in a free manner.
36
Experiments
Hybrid Deep Learning Framework:
37
Experiments
Hybrid Deep Learning Framework:
38
Experiments
Comparisons with State-of-the-Art:
UCF101
Donahue et al.
Srivastava et al.
Wang et al.
Tran et al.
Simonyan et al.
Lan et al.
Zha et al.
Ours
CCV
82.9%
84.3%
85.9%
86.7%
88.0%
89.1%
89.6%
91.3%
Xu et al.
Ye et al.
Jhuo et al.
Ma et al.
Liu et al.
Wu et al.
60.3%
64.0%
64.0%
63.4%
68.2%
70.6%
Ours
83.5%
39
Outline
• Benchmarks
• Modeling Temporal Dependencies
• Utilizing Class Context
40
Class Relationships
Similar Video Semantics
Football
41
Class Relationships
Similar Video Semantics
Hockey
Running
Rugby
Tennis
Football
Hurdles
Badminton
42
A More Recent Work
Individual
frame
- Utilize 3 ConvNets to
capture appearance,
short-term motion and
audio clues
respectively.
Spatial ConvNet
Prediction
with
Adaptive
Fusion
Stacked
optical flow
Motion ConvNet
Audio
spectrogram
Audio ConvNet
Fusing Multi-Stream Deep Networks for Video Classification, arXiv preprint
arXiv:1509.06086.
A More Recent Work
Individual
frame
- Utilize 3 ConvNets to
capture appearance,
short-term motion and
audio clues
respectively.
Spatial ConvNet
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
Spatial LSTM
Stacked
optical flow
Prediction
with
Adaptive
Fusion
Motion ConvNet
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
- Leverage LSTMs to
model long-term
temporal dynamics.
Motion LSTM
Audio
spectrogram
Audio ConvNet
Fusing Multi-Stream Deep Networks for Video Classification, arXiv preprint
arXiv:1509.06086.
A More Recent Work
Individual
frame
- Utilize 3 ConvNets to
capture appearance,
short-term motion and
audio clues
respectively.
Spatial ConvNet
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
Spatial LSTM
Stacked
optical flow
Prediction
with
Adaptive
Fusion
Motion ConvNet
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
Motion LSTM
Audio
spectrogram
- Leverage LSTMs to
model long-term
temporal dynamics.
- Using class
relationships as a
prior to refine scores.
Audio ConvNet
Fusing Multi-Stream Deep Networks for Video Classification, arXiv preprint
arXiv:1509.06086.
Experimental Results
Adaptive Multi-Stream Fusion:
Fusion Method
UCF-101
CCV
Average fusion
90.3
82.4
Weighted fusion
90.6
82.7
Kernel average fusion
90.2
82.1
MKL fusion
89.6
81.8
Logistic regression fusion
89.8
82.0
Adaptive multi-stream fusion (λ1=0)
90.9
82.8
Adaptive multi-stream fusion (λ2=0)
91.6
83.7
Adaptive multi-stream fusion (-A)
92.2
84.0
Adaptive multi-stream fusion
92.6
84.9
Adaptive Multi-Stream Fusion gives the best result!
Thank you!
47

Similar documents

Video Content Recognition with Deep Learning

Video Content Recognition with Deep Learning [content recognition] Hao Ye,  Zuxuan Wu,  Rui-­‐Wei  Zhao,  Xi  Wang,   Yu-­‐Gang  Jiang,   Xiangyang Xue, Evaluating  Two-­‐Stream  CNN  for  Video  Classification, ACM ...

More information