Video Content Recognition with Deep Learning

Transcription

Video Content Recognition with Deep Learning
Video Content Recognition with Deep Learning
Yu-Gang  Jiang
Lab for Big Video Data Analytics (BigVid)
Fudan University, Shanghai, China
[email protected]
Joint work with: Zuxuan Wu, Xi Wang, Hao Ye, Xiangyang Xue, Jian Tu, Rui-­Wei Zhao, Jun Wang, Shih-­Fu Chang
Big Video Data
• Global Internet Video Highlights (by CISCO)
– It would take an individual more than 5 million years to watch the amount of video that will cross global IP networks each month in 2019. – Globally, IP video traffic will be 80 percent of all IP traffic (both business and consumer) by 2019, up from 67 percent in 2014. 2
Very Little Research on Video
Slide credit: Prof. Chua Tat Seng (NUS)
• Given the needs and popularity of videos
in industry, there are insufficient research
efforts on video
• Several urgent needs are all video based: – Visual recognition for in-­video advertising – Live (video) object/scene/event recognition
– Live copyright protection
– Live surveillance applications 3
Very Little Research on Video
Slide credit: Prof. Chua Tat Seng (NUS)
4
Very Little Research on Video
Slide credit: Prof. Chua Tat Seng (NUS)
• Need to look into multi-­‐modal approach to tackle problems
• Lack of datasets and infrastructures
5
Outline
• Datasets
• Deep Learning Approaches
• Demo
• Future Directions
6
An overview of existing datasets
for video content recognition
Dataset
# Videos
# Classes
Year
Manually Labeled ?
Publicly
Accessible
Kodak
1,358
25
2007
✓
Some features
MCG-­‐WEBV
234,414
15
2009
✓
100%
CCV
9,317
20
2011
✓
mostly
UCF-­‐101
13,320
101
2012
✓
100%
THUMOS-­‐2014
18,394
101
2014
✓
100%
MED-­‐2014
300,000
20
2014
✓
~18%
Sports-­‐1M
1M
487
2014
✗
mostly
FCVID
91,223
239
2015
✓
100%
EventNet
95,321
500
2015
✓
mostly
NUS-­‐CMU-­‐Yahoo!
100,000
520
2015/6
✓
to be released
THUMOS Challenge
•
•
•
•
~20k Web videos;; 101 classes (mainly human motion related)
Started in 2013, in conjunction with ICCV’13, ECCV’14, CVPR’15
10-­20 teams from top universities/companies participated each year
Organizers:
– Ivan Laptev (INRIA), Mubarak Shah (UCF, USA), Rahul Sukthankar (Google)
– Haroon Idrees (UCF), Alex Gorban (Google), Yu-­Gang Jiang (Fudan), Amir Zamir (Stanford)
http://www.thumos.info/
8
Fudan-­Columbia Video Dataset (FCVID)
• One of the largest (91223 videos) public benchmarks of Internet videos with manual
annotations
• Covering 239 categories organized in a hierarchical structure
• Released in Feb. 2015
http://bigvid.fudan.edu.cn/FCVID/
– Downloaded by people from 40+ universities/companies
Y.-­G. Jiang et al., Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks,
arXiv preprint arXiv:1502.07209.
FCVID
Total number of categories: 239;; Higher level groups: 11;; Second level groups: 32
All c ategories
Sports (46)
Sports Amateur (14)
Sports Professional (20)
Music (17)
DIY (21)
Extreme Sports (8)
Sports for t he disabled (4)
musical performance without instruments (4)
solo musical performance with instruments (10)
group musical performance with instruments (3)
baseball
baseball
marathon
rock c limbing
wheelchair basketball
singing on stage
guitar performance
symphony orchestra performance
making rings
Making wallet
basketball
basketball
Rhythmic gymnastics
skateboarding
wheelchair tennis
singing in ktv
piano performance
rock band performance
making earrings
Making pencil cases
soccer
soccer
taekwondo
surfing
wheelchair race
beatbox
violin performance
chamber music
making bracelets
Making phone cases
biking
biking
archery
skydiving
wheelchair soccer
chorus
accordion performance
making festival cards
Making photo frame
ice s kating
swimming
fencing
bungee jumping
cello performance
making clothes(sewing)
Making bookmark
swimming
skiing
Car racing
rafting
flute performance
making a paper plane
Building a dog house
skiing
American f ootball
rowing
parkour
trumpet performance
making paper flowers
Blowing up an air bed
American f ootball
tennis
sumo wrestling
kitesurfing
saxophone performance
knitting
Pitching a tent
tennis
sports t rack
diving
harmonica performance
assembling a computer
Changing tires
table tennis
boxing
shooting
drumming
assembling a bike
Making shorts
badminton
billiard
frisbee
shooting
Tying a tie
FCVID
Total number of categories: 239;; Higher level groups: 11;; Second level groups: 32
All c ategories
Beauty & fashion (10)
Cooking & Health (30)
Leisure & Tricks (22)
Art (10)
Beauty(6)
fashion(4)
Food (13)
Drinks (5)
Health(12)
Leisure sports(7)
Common leisure activities(11)
Tricks(4)
making up
showing fashionable high heeled shoes
barbecue
making coffee
Dumbbell workout
roller s kating
flying kites
yoyo tricks
painting
make lipstick
showing fashionable handbags
making French fries
making tea
Barbell workout
fishing
bumper cars
pen spinning
sculpting
eye makeup
fashion show
making sandwich making juice
punching bag workout
boating
kicking shuttlecock
solving magic c ube
doing graffiti
hair style design
red carpet fashion
roasting turkey
making milk t ea
push ups
golfing
playing chess
card manipulation
making ceramic craft nail art design
making sushi
making mixed drinks
pull ups
bowling
playing bridge
solo dance
face m assage
making salad
sit ups
hiking
snowball fight
group dance
tattooing
making pizza
rope skipping
horse riding
making a snowman
social dance
making cake
treadmill
arm wrestling
spray painting
making hotdog
Hula hoop
playing with Nun Chucks
sand painting
making cookies
jogging
playing with remote controlled aircraft
Chinese paper cutting
making ice c ream
yoga
playing with remote controlled cars
making Chinese dumplings
Tai Chi Chuan
making egg tarts
FCVID
Total number of categories: 239;; Higher level groups: 11;; Second level groups: 32
All c ategories
Everyday Life (54)
Nature (26)
Travel (11)
Tech & Education (7)
Places (6)
Activities(4
)
Kids (7)
Family Events (4)
Chores(8)
Social Events (10)
Public Events (7)
Pets and others (8)
Sceneries
(8)
Natural Phenomen
on(7)
Animal
(11)
Transporta
tions(4)
Tourist Spots(7)
High-­tech product introductions (6)
Education
(1)
Temple exterior
hair cutting
kid playing on playground
birthday
cleaning windows
wedding ceremony
parade
bird
beach
sunset
dolphin
train
Egyptian pyramids
psp
classroom
bridge
shaving beard
kids building blocks
family dinner
cleaning floor
wedding reception
fire fighting
dog
mountain
tornado
turtle
airplane
Eiffel tower
smart phone
Cathedral exterior
brushing teeth
kindergart
en
decorating Christmas tree
weeding
wedding dance car accidents
cat
river
lightning
snake
ship
the great wall
panel computer
museum interior
walking with a dog
baby eating snack
camping
washing dishes
graduation
street fighting
hamster
waterfall
sandstorm
spider
bus
the Statue of Liberty
single-­lens reflex camera
library
interior
baby crawling
car washing
dinning at restaurant
public speech
rabbit
forest
volcano eruption
cow
the oriental pearl TV tower
telescope
amusemen
t park
washing an infant
fruit t ree pruning
picnic
fireworks show
delicious food
ocean
solar eclipse
panda
the Leaning Tower of Pisa
laptop
kids making faces
shoveling snow
group banquet
car exhibition
house plants
desert
lunar eclipse
butterfly
Taj Mahal
cleaning carpet
debate
toy figures
grass land
camel
drinking
inside bar
gorilla
dancing inside nightclub
giraffe
elephant
Example Videos
Everyday life
Kids
Kids playing blocks
EventNet Dataset
• 95,321 videos and 500 events (from WikiHow)
• 4,490 event-­specific concepts and detectors
• http://eventnet.ee.columbia.edu
Presentation at this conference on Thursday (29th):
14
NUS-­CMU-­Yahoo! Dataset
Slide credit: Prof. Chua Tat Seng (NUS)
• Targets on consumer videos, mostly recorded by mobile devices
• Source and Scale: Yahoo Creative Commons – 0.1 million videos from Flickr
• Targeted Annotations – 420 entry-­level concepts from ImageNet, 100 scenes from SUN
• To be released very soon.
15
Outline
• Datasets
• Deep Learning Approaches
• Demo
• Future Directions
16
Popular Approaches
1.   Improved  Dense  Trajectories  a) Tracking local frame patches
b) Computing trajectory descriptors
2.   Feature  Encoding
a) Encoding local features with Fisher Vector/VLAD
b) Normalization methods, e.g., Power Norm
3. Image-Based  Deep Video  Classification  a) Extracting deep features on each frame
b) Averaging frame-­level deep features
17
Video Classification with Regularized DNN
Z. Wu, Y.-­G. Jiang et al., Exploring Inter-­feature and Inter-­class
Relationships with Deep Neural Networks for Video Classification, ACM
Multimedia 2014 (full paper)
18
Video Classification with Regularized DNN
Regularization
Regularization
Z. Wu, Y.-­G. Jiang et al., Exploring Inter-­feature and Inter-­class
Relationships with Deep Neural Networks for Video Classification, ACM
Multimedia 2014 (full paper)
19
Experimental Results
Effect of Exploring Feature Relationships
......
......
......
20
Experimental Results
Effect of Exploring Feature Relationships
Approaches
Hollywood2
CCV
CCV+
NN-­EF
62.00%
58.50%
62.60%
62.10%
61.50%
64.50%
66.70%
61.90%
67.50%
64.90%
67.20%
69.10%
70.50%
64.70%
70.00%
68.50%
70.10%
71.80%
NN-­LF
......
......
......
SVM-­EF
SVM-­LF
M-­DBM
DNN-­FR
21
Experimental Results
Effect of Exploring Class Relationships
......
......
......
22
Experimental Results
Effect of Exploring Class Relationships
Approaches Hollywood2
......
......
......
CCV
CCV+
DMF
61.80%
67.60%
68.50%
DASD
60.90%
66.80%
70.20%
DNN-­CR
63.00%
69.30%
72.10%
23
Two-­Stream CNN
K. Simonyan and A. Zisserman, “Two-­stream convolutional networks for action recognition in videos,” in NIPS, 2014.
24
A Hybrid Deep Learning Framework
Long-­‐term temporal
information
Long-­‐term temporal
information
Static Frames
Local Motion
Presentation at this conference on Thursday (29th):
Z. Wu, X. Wang, Y.-­G. Jiang, H. Ye, X. Xue, Modeling Spatial-­Temporal Clues in a Hybrid Deep Learning Framework for Video Classification, ACM Multimedia 2015 (Full Paper)
Experimental Results
UCF-­‐101
CCV
Donahue et al. (2014, LSTM)
82.9% Xu et al. (2013)
60.3%
Srivastava et al. (2015, LSTM)
84.3% Ma et al. (2014)
63.4%
Wang et al. (2013)
85.9% Ye et al. (2012)
64.0%
Tran et al. (2014, CNN)
86.7% Jhuo et al. (2014)
64.0%
Simonyan et al. (2014, CNN)
88.0% Liu et al. (2013)
68.2%
Lan et al. (2014)
89.1% Wu et al. (2014)
70.6%
Zha et al. (2015, CNN)
89.6%
Ours
91.3% Ours
83.5%
Z. Wu, X. Wang, Y.-­G. Jiang, H. Ye, X. Xue, Modeling Spatial-­Temporal Clues in a Hybrid Deep Learning Framework for Video Classification, ACM Multimedia 2015 (Full Paper)
A More Recent Work
Individual
frame
-­ Utilize 3 ConvNets to capture appearance, short-­term motion and audio clues respectively.
Spatial ConvNet
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
Spatial LSTM
Stacked
optical flow
Prediction
with
Adaptive
Fusion
Motion ConvNet
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
Motion LSTM
Audio
spectrogram
-­ Leverage LSTMs to model long-­term temporal dynamics.
-­ Adaptive fusion to
learn the best weights
for each class.
Audio ConvNet
Fusing Multi-­Stream Deep Networks for Video Classification, arXiv preprint arXiv:1509.06086.
Experimental Results
Multi-­Stream Networks:
UCF-­101 CCV
Spatial ConvNet
80.4
75.0
Motion ConvNet
78.3
59.1
Spatial LSTM
83.3
43.3
Motion LSTM
76.6
54.7
Audio ConvNet
16.2
21.5
ConvNet (spatial + motion)
86.2
75.8
LSTM (spatial + motion)
86.3
61.9
ConvNet + LSTM (spatial)
84.4
77.9
ConvNet + LSTM (motion)
81.4
70.9
ALL Streams
90.3
82.4
LSTM are worse than CNN on noisy videos.
CNN and LSTM are highly complementary!
Experimental Results
Adaptive Multi-­Stream Fusion:
Fusion Method
UCF-­101 CCV
Average fusion 90.3
82.4
Weighted fusion
90.6
82.7
Kernel average fusion 90.2 82.1
MKL fusion 89.6
81.8
Logistic regression fusion 89.8
82.0
Adaptive multi-­stream fusion (λ1=0)
90.9
82.8
Adaptive multi-­stream fusion (λ2=0)
91.6
83.7
Adaptive multi-­stream fusion (-­A)
92.2
84.0
Adaptive multi-­stream fusion 92.6
84.9
Adaptive Multi-­Stream Fusion gives the best result!
Experimental Results
Adaptive Multi-­Stream Fusion:
Per class result on CCV.
Experimental Results
UCF-­101
Donahue et al.
Srivastava et al.
Wang et al.
Tran et al.
Simonyan et al.
Lan et al.
Zha et al.
Ours (-­A)
Ours
CCV
82.9%
84.3%
85.9%
86.7%
88.0%
89.1%
89.6%
92.2%
92.6%
Xu et al.
Ye et al.
Jhuo et al.
Ma et al.
Liu et al.
Wu et al.
60.3%
64.0%
64.0%
63.4%
68.2%
70.6%
Ours (-­A)
Ours
84.0%
84.9%
Fusing Multi-­Stream Deep Networks for Video Classification, arXiv preprint arXiv:1509.06086.
Outline
• Datasets
• Deep Learning Approaches
• Demo
• Future Directions
32
33
Outline
• Datasets
• Deep Learning Approaches
• Demo
• Future Directions
34
(1) Better Networks
Two-­‐Stream CNN
NIPS’14
Long Short-­‐Term Memory
CVPR’15, ICML’15, MM’15, ArXiv’15 And Convolution3D, Gated RNN, etc.
(2) Better Datasets
For images there is ImageNet:
• Total number of non-­empty synsets: 21841
• Total number of images: 14,197,122
(2) Better Datasets
For videos, we only have these…
Dataset
# Videos
# Classes
Year
Manually Labeled ?
Publicly
Accessible
Kodak
1,358
25
2007
✓
No videos
MCG-­‐WEBV
234,414
15
2009
✓
100%
CCV
9,317
20
2011
✓
100%
UCF-­‐101
13,320
101
2012
✓
100%
THUMOS-­‐2014
18,394
101
2014
✓
100%
MED-­‐2014
300,000
20
2014
✓
~18%
Sports-­‐1M
1M
487
2014
✗
100%
FCVID
91,223
239
2015
✓
100%
EventNet
95,321
500
2015
✓
100%
NUS-­‐CMU-­‐Yahoo!
700,000
520
2015
✓
to be released
References
•
•
•
•
•
•
•
FCVID: Yu-­‐Gang Jiang, Zuxuan Wu, Jun Wang, Xiangyang Xue, Shih-­‐Fu Chang, Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks, arXiv:1502.07209. http://bigvid.fudan.edu.cn/FCVID/
THUMOS: www.thumos.info
CCV: Yu-­‐Gang Jiang, Guangnan Ye, Shih-­‐Fu Chang, Daniel Ellis, Alexander C. Loui,
Consumer Video Understanding: A Benchmark Database and An Evaluation of Human and Machine Performance, ACM ICMR 2011.
http://www.ee.columbia.edu/ln/dvmm/CCV/
[content recognition] Zuxuan Wu, Yu-­‐Gang Jiang, Xi Wang, Hao Ye, Xiangyang Xue, Jun Wang, Fusing Multi-­‐Stream Deep Networks for Video Classification, arXiv:1509.06086.
[content recognition] Zuxuan Wu, Xi Wang, Yu-­‐Gang Jiang, Hao Ye, Xiangyang Xue,
Modeling Spatial-­‐Temporal Clues in a Hybrid Deep Learning Framework for Video Classification, ACM Multimedia 2015.
[content recognition] Hao Ye, Zuxuan Wu, Rui-­‐Wei Zhao, Xi Wang, Yu-­‐Gang Jiang, Xiangyang Xue, Evaluating Two-­‐Stream CNN for Video Classification, ACM ICMR 2015.
[content recognition] Zuxuan Wu, Yu-­‐Gang Jiang, et al., Exploring Inter-­‐feature and Inter-­‐class Relationships with Deep Neural Networks for Video Classification, ACM Multimedia 2014.
38
Chong-Wah Ngo
Xi Wang
Jian Tu
Shih-Fu Chang
Xiangyang Xue
Zuxuan Wu
Jiajun Wang
Hao Ye
Jun Wang
Qi Dai
Yudong Jiang
Rui-Wei Zhao
Thank you!
[email protected]
40

Similar documents

Video Content Recognition with Deep Learning -lite

Video Content Recognition with Deep Learning -lite Video Classification: Deep Learning 1. Image-based CNN Classification [Zha et al., arXiv 2015]

More information