Video Content Recognition with Deep Learning
Transcription
Video Content Recognition with Deep Learning
Video Content Recognition with Deep Learning Yu-Gang Jiang Lab for Big Video Data Analytics (BigVid) Fudan University, Shanghai, China [email protected] Joint work with: Zuxuan Wu, Xi Wang, Hao Ye, Xiangyang Xue, Jian Tu, Rui-Wei Zhao, Jun Wang, Shih-Fu Chang Big Video Data • Global Internet Video Highlights (by CISCO) – It would take an individual more than 5 million years to watch the amount of video that will cross global IP networks each month in 2019. – Globally, IP video traffic will be 80 percent of all IP traffic (both business and consumer) by 2019, up from 67 percent in 2014. 2 Very Little Research on Video Slide credit: Prof. Chua Tat Seng (NUS) • Given the needs and popularity of videos in industry, there are insufficient research efforts on video • Several urgent needs are all video based: – Visual recognition for in-video advertising – Live (video) object/scene/event recognition – Live copyright protection – Live surveillance applications 3 Very Little Research on Video Slide credit: Prof. Chua Tat Seng (NUS) 4 Very Little Research on Video Slide credit: Prof. Chua Tat Seng (NUS) • Need to look into multi-‐modal approach to tackle problems • Lack of datasets and infrastructures 5 Outline • Datasets • Deep Learning Approaches • Demo • Future Directions 6 An overview of existing datasets for video content recognition Dataset # Videos # Classes Year Manually Labeled ? Publicly Accessible Kodak 1,358 25 2007 ✓ Some features MCG-‐WEBV 234,414 15 2009 ✓ 100% CCV 9,317 20 2011 ✓ mostly UCF-‐101 13,320 101 2012 ✓ 100% THUMOS-‐2014 18,394 101 2014 ✓ 100% MED-‐2014 300,000 20 2014 ✓ ~18% Sports-‐1M 1M 487 2014 ✗ mostly FCVID 91,223 239 2015 ✓ 100% EventNet 95,321 500 2015 ✓ mostly NUS-‐CMU-‐Yahoo! 100,000 520 2015/6 ✓ to be released THUMOS Challenge • • • • ~20k Web videos;; 101 classes (mainly human motion related) Started in 2013, in conjunction with ICCV’13, ECCV’14, CVPR’15 10-20 teams from top universities/companies participated each year Organizers: – Ivan Laptev (INRIA), Mubarak Shah (UCF, USA), Rahul Sukthankar (Google) – Haroon Idrees (UCF), Alex Gorban (Google), Yu-Gang Jiang (Fudan), Amir Zamir (Stanford) http://www.thumos.info/ 8 Fudan-Columbia Video Dataset (FCVID) • One of the largest (91223 videos) public benchmarks of Internet videos with manual annotations • Covering 239 categories organized in a hierarchical structure • Released in Feb. 2015 http://bigvid.fudan.edu.cn/FCVID/ – Downloaded by people from 40+ universities/companies Y.-G. Jiang et al., Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks, arXiv preprint arXiv:1502.07209. FCVID Total number of categories: 239;; Higher level groups: 11;; Second level groups: 32 All c ategories Sports (46) Sports Amateur (14) Sports Professional (20) Music (17) DIY (21) Extreme Sports (8) Sports for t he disabled (4) musical performance without instruments (4) solo musical performance with instruments (10) group musical performance with instruments (3) baseball baseball marathon rock c limbing wheelchair basketball singing on stage guitar performance symphony orchestra performance making rings Making wallet basketball basketball Rhythmic gymnastics skateboarding wheelchair tennis singing in ktv piano performance rock band performance making earrings Making pencil cases soccer soccer taekwondo surfing wheelchair race beatbox violin performance chamber music making bracelets Making phone cases biking biking archery skydiving wheelchair soccer chorus accordion performance making festival cards Making photo frame ice s kating swimming fencing bungee jumping cello performance making clothes(sewing) Making bookmark swimming skiing Car racing rafting flute performance making a paper plane Building a dog house skiing American f ootball rowing parkour trumpet performance making paper flowers Blowing up an air bed American f ootball tennis sumo wrestling kitesurfing saxophone performance knitting Pitching a tent tennis sports t rack diving harmonica performance assembling a computer Changing tires table tennis boxing shooting drumming assembling a bike Making shorts badminton billiard frisbee shooting Tying a tie FCVID Total number of categories: 239;; Higher level groups: 11;; Second level groups: 32 All c ategories Beauty & fashion (10) Cooking & Health (30) Leisure & Tricks (22) Art (10) Beauty(6) fashion(4) Food (13) Drinks (5) Health(12) Leisure sports(7) Common leisure activities(11) Tricks(4) making up showing fashionable high heeled shoes barbecue making coffee Dumbbell workout roller s kating flying kites yoyo tricks painting make lipstick showing fashionable handbags making French fries making tea Barbell workout fishing bumper cars pen spinning sculpting eye makeup fashion show making sandwich making juice punching bag workout boating kicking shuttlecock solving magic c ube doing graffiti hair style design red carpet fashion roasting turkey making milk t ea push ups golfing playing chess card manipulation making ceramic craft nail art design making sushi making mixed drinks pull ups bowling playing bridge solo dance face m assage making salad sit ups hiking snowball fight group dance tattooing making pizza rope skipping horse riding making a snowman social dance making cake treadmill arm wrestling spray painting making hotdog Hula hoop playing with Nun Chucks sand painting making cookies jogging playing with remote controlled aircraft Chinese paper cutting making ice c ream yoga playing with remote controlled cars making Chinese dumplings Tai Chi Chuan making egg tarts FCVID Total number of categories: 239;; Higher level groups: 11;; Second level groups: 32 All c ategories Everyday Life (54) Nature (26) Travel (11) Tech & Education (7) Places (6) Activities(4 ) Kids (7) Family Events (4) Chores(8) Social Events (10) Public Events (7) Pets and others (8) Sceneries (8) Natural Phenomen on(7) Animal (11) Transporta tions(4) Tourist Spots(7) High-tech product introductions (6) Education (1) Temple exterior hair cutting kid playing on playground birthday cleaning windows wedding ceremony parade bird beach sunset dolphin train Egyptian pyramids psp classroom bridge shaving beard kids building blocks family dinner cleaning floor wedding reception fire fighting dog mountain tornado turtle airplane Eiffel tower smart phone Cathedral exterior brushing teeth kindergart en decorating Christmas tree weeding wedding dance car accidents cat river lightning snake ship the great wall panel computer museum interior walking with a dog baby eating snack camping washing dishes graduation street fighting hamster waterfall sandstorm spider bus the Statue of Liberty single-lens reflex camera library interior baby crawling car washing dinning at restaurant public speech rabbit forest volcano eruption cow the oriental pearl TV tower telescope amusemen t park washing an infant fruit t ree pruning picnic fireworks show delicious food ocean solar eclipse panda the Leaning Tower of Pisa laptop kids making faces shoveling snow group banquet car exhibition house plants desert lunar eclipse butterfly Taj Mahal cleaning carpet debate toy figures grass land camel drinking inside bar gorilla dancing inside nightclub giraffe elephant Example Videos Everyday life Kids Kids playing blocks EventNet Dataset • 95,321 videos and 500 events (from WikiHow) • 4,490 event-specific concepts and detectors • http://eventnet.ee.columbia.edu Presentation at this conference on Thursday (29th): 14 NUS-CMU-Yahoo! Dataset Slide credit: Prof. Chua Tat Seng (NUS) • Targets on consumer videos, mostly recorded by mobile devices • Source and Scale: Yahoo Creative Commons – 0.1 million videos from Flickr • Targeted Annotations – 420 entry-level concepts from ImageNet, 100 scenes from SUN • To be released very soon. 15 Outline • Datasets • Deep Learning Approaches • Demo • Future Directions 16 Popular Approaches 1. Improved Dense Trajectories a) Tracking local frame patches b) Computing trajectory descriptors 2. Feature Encoding a) Encoding local features with Fisher Vector/VLAD b) Normalization methods, e.g., Power Norm 3. Image-Based Deep Video Classification a) Extracting deep features on each frame b) Averaging frame-level deep features 17 Video Classification with Regularized DNN Z. Wu, Y.-G. Jiang et al., Exploring Inter-feature and Inter-class Relationships with Deep Neural Networks for Video Classification, ACM Multimedia 2014 (full paper) 18 Video Classification with Regularized DNN Regularization Regularization Z. Wu, Y.-G. Jiang et al., Exploring Inter-feature and Inter-class Relationships with Deep Neural Networks for Video Classification, ACM Multimedia 2014 (full paper) 19 Experimental Results Effect of Exploring Feature Relationships ...... ...... ...... 20 Experimental Results Effect of Exploring Feature Relationships Approaches Hollywood2 CCV CCV+ NN-EF 62.00% 58.50% 62.60% 62.10% 61.50% 64.50% 66.70% 61.90% 67.50% 64.90% 67.20% 69.10% 70.50% 64.70% 70.00% 68.50% 70.10% 71.80% NN-LF ...... ...... ...... SVM-EF SVM-LF M-DBM DNN-FR 21 Experimental Results Effect of Exploring Class Relationships ...... ...... ...... 22 Experimental Results Effect of Exploring Class Relationships Approaches Hollywood2 ...... ...... ...... CCV CCV+ DMF 61.80% 67.60% 68.50% DASD 60.90% 66.80% 70.20% DNN-CR 63.00% 69.30% 72.10% 23 Two-Stream CNN K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in NIPS, 2014. 24 A Hybrid Deep Learning Framework Long-‐term temporal information Long-‐term temporal information Static Frames Local Motion Presentation at this conference on Thursday (29th): Z. Wu, X. Wang, Y.-G. Jiang, H. Ye, X. Xue, Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification, ACM Multimedia 2015 (Full Paper) Experimental Results UCF-‐101 CCV Donahue et al. (2014, LSTM) 82.9% Xu et al. (2013) 60.3% Srivastava et al. (2015, LSTM) 84.3% Ma et al. (2014) 63.4% Wang et al. (2013) 85.9% Ye et al. (2012) 64.0% Tran et al. (2014, CNN) 86.7% Jhuo et al. (2014) 64.0% Simonyan et al. (2014, CNN) 88.0% Liu et al. (2013) 68.2% Lan et al. (2014) 89.1% Wu et al. (2014) 70.6% Zha et al. (2015, CNN) 89.6% Ours 91.3% Ours 83.5% Z. Wu, X. Wang, Y.-G. Jiang, H. Ye, X. Xue, Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification, ACM Multimedia 2015 (Full Paper) A More Recent Work Individual frame - Utilize 3 ConvNets to capture appearance, short-term motion and audio clues respectively. Spatial ConvNet LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM Spatial LSTM Stacked optical flow Prediction with Adaptive Fusion Motion ConvNet LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM Motion LSTM Audio spectrogram - Leverage LSTMs to model long-term temporal dynamics. - Adaptive fusion to learn the best weights for each class. Audio ConvNet Fusing Multi-Stream Deep Networks for Video Classification, arXiv preprint arXiv:1509.06086. Experimental Results Multi-Stream Networks: UCF-101 CCV Spatial ConvNet 80.4 75.0 Motion ConvNet 78.3 59.1 Spatial LSTM 83.3 43.3 Motion LSTM 76.6 54.7 Audio ConvNet 16.2 21.5 ConvNet (spatial + motion) 86.2 75.8 LSTM (spatial + motion) 86.3 61.9 ConvNet + LSTM (spatial) 84.4 77.9 ConvNet + LSTM (motion) 81.4 70.9 ALL Streams 90.3 82.4 LSTM are worse than CNN on noisy videos. CNN and LSTM are highly complementary! Experimental Results Adaptive Multi-Stream Fusion: Fusion Method UCF-101 CCV Average fusion 90.3 82.4 Weighted fusion 90.6 82.7 Kernel average fusion 90.2 82.1 MKL fusion 89.6 81.8 Logistic regression fusion 89.8 82.0 Adaptive multi-stream fusion (λ1=0) 90.9 82.8 Adaptive multi-stream fusion (λ2=0) 91.6 83.7 Adaptive multi-stream fusion (-A) 92.2 84.0 Adaptive multi-stream fusion 92.6 84.9 Adaptive Multi-Stream Fusion gives the best result! Experimental Results Adaptive Multi-Stream Fusion: Per class result on CCV. Experimental Results UCF-101 Donahue et al. Srivastava et al. Wang et al. Tran et al. Simonyan et al. Lan et al. Zha et al. Ours (-A) Ours CCV 82.9% 84.3% 85.9% 86.7% 88.0% 89.1% 89.6% 92.2% 92.6% Xu et al. Ye et al. Jhuo et al. Ma et al. Liu et al. Wu et al. 60.3% 64.0% 64.0% 63.4% 68.2% 70.6% Ours (-A) Ours 84.0% 84.9% Fusing Multi-Stream Deep Networks for Video Classification, arXiv preprint arXiv:1509.06086. Outline • Datasets • Deep Learning Approaches • Demo • Future Directions 32 33 Outline • Datasets • Deep Learning Approaches • Demo • Future Directions 34 (1) Better Networks Two-‐Stream CNN NIPS’14 Long Short-‐Term Memory CVPR’15, ICML’15, MM’15, ArXiv’15 And Convolution3D, Gated RNN, etc. (2) Better Datasets For images there is ImageNet: • Total number of non-empty synsets: 21841 • Total number of images: 14,197,122 (2) Better Datasets For videos, we only have these… Dataset # Videos # Classes Year Manually Labeled ? Publicly Accessible Kodak 1,358 25 2007 ✓ No videos MCG-‐WEBV 234,414 15 2009 ✓ 100% CCV 9,317 20 2011 ✓ 100% UCF-‐101 13,320 101 2012 ✓ 100% THUMOS-‐2014 18,394 101 2014 ✓ 100% MED-‐2014 300,000 20 2014 ✓ ~18% Sports-‐1M 1M 487 2014 ✗ 100% FCVID 91,223 239 2015 ✓ 100% EventNet 95,321 500 2015 ✓ 100% NUS-‐CMU-‐Yahoo! 700,000 520 2015 ✓ to be released References • • • • • • • FCVID: Yu-‐Gang Jiang, Zuxuan Wu, Jun Wang, Xiangyang Xue, Shih-‐Fu Chang, Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks, arXiv:1502.07209. http://bigvid.fudan.edu.cn/FCVID/ THUMOS: www.thumos.info CCV: Yu-‐Gang Jiang, Guangnan Ye, Shih-‐Fu Chang, Daniel Ellis, Alexander C. Loui, Consumer Video Understanding: A Benchmark Database and An Evaluation of Human and Machine Performance, ACM ICMR 2011. http://www.ee.columbia.edu/ln/dvmm/CCV/ [content recognition] Zuxuan Wu, Yu-‐Gang Jiang, Xi Wang, Hao Ye, Xiangyang Xue, Jun Wang, Fusing Multi-‐Stream Deep Networks for Video Classification, arXiv:1509.06086. [content recognition] Zuxuan Wu, Xi Wang, Yu-‐Gang Jiang, Hao Ye, Xiangyang Xue, Modeling Spatial-‐Temporal Clues in a Hybrid Deep Learning Framework for Video Classification, ACM Multimedia 2015. [content recognition] Hao Ye, Zuxuan Wu, Rui-‐Wei Zhao, Xi Wang, Yu-‐Gang Jiang, Xiangyang Xue, Evaluating Two-‐Stream CNN for Video Classification, ACM ICMR 2015. [content recognition] Zuxuan Wu, Yu-‐Gang Jiang, et al., Exploring Inter-‐feature and Inter-‐class Relationships with Deep Neural Networks for Video Classification, ACM Multimedia 2014. 38 Chong-Wah Ngo Xi Wang Jian Tu Shih-Fu Chang Xiangyang Xue Zuxuan Wu Jiajun Wang Hao Ye Jun Wang Qi Dai Yudong Jiang Rui-Wei Zhao Thank you! [email protected] 40
Similar documents
Video Content Recognition with Deep Learning -lite
Video Classification: Deep Learning 1. Image-based CNN Classification [Zha et al., arXiv 2015]
More information