Video Content Recognition with Deep Learning -lite
Transcription
Video Content Recognition with Deep Learning -lite
Video Content Recognition with Deep Learning Zuxuan Wu Fudan University 1 Big Video Data Big Video Data Global Internet Video Highlights in 2019 (by CISCO) • It would take an individual more than 5 million years to watch videos generated each month across all networks. • Video traffic will be 80 percent of all network traffic, up from 64% in 2014. Video Classification 4 Video Classification • Videos are everywhere 5 Video Classification • Videos are everywhere • Wide applications ü Web video search ü Video collection management ü Intelligent video surveillance 6 Outline • Benchmarks • Modeling Temporal Dependencies • Utilizing Class Context 7 For images, • Total number of non-empty synsets: 21841 • Total number of images: 14,197,122 8 An overview of existing video datasets Dataset # Videos # Classes Year Manually Labeled ? Kodak 1,358 25 2007 ✓ MCG-WEBV 234,414 15 2009 ✓ CCV 9,317 20 2011 ✓ UCF-101 13,320 101 2012 ✓ THUMOS-2014 18,394 101 2014 ✓ MED-2014 ≈28,000 20 2014 ✓ Sports-1M 1M 487 2014 ✗ ActivityNet 27,801 203 2015 ✓ FCVID 91,223 239 2015 ✓ …… Fudan-Columbia Video Dataset (FCVID) • The largest public benchmark (239 categories) of Internet videos with manual annotations • Covering many categories organized in a hierarchical structure • 91,223 videos, average duration 167 seconds • Released in Feb. 2015! http://bigvid.fudan.edu.cn/FCVID/ Audio, Visual Features, Texts, Videos Available Fudan-Columbia Video Dataset (FCVID) • Utility – supporting practical needs • Coverage – what people like to record • Feasibility – can be recognized • Multiple Annotators – minimize subjectivity 11 Example 1: Example 2: Outline • Benchmarks • Modeling Temporal Dependencies • Utilizing Class Context 16 Video Classification: State-of-the-Arts 1. Improved Dense Trajectories [Wang et al., ICCV 2013] a) Tracking trajectories b) Computing local descriptors along the trajectories 2. Feature Encoding [Perronnin et al., CVPR 2010, Xu et al., CVPR 2015] a) Encoding local features with Fisher Vector/VLAD b) Normalization methods, such as Power Norm 17 Video Classification: Deep Learning 1. Image-based CNN Classification [Zha et al., arXiv 2015] a) Extracting deep features for each frame b) Averaging frame-level deep features 2. Two-Stream CNN [Simonyan et al., NIPS 2014] Diving Video Classification: Deep Learning Video Classification: Deep Learning Diving Jumping from platform Video Classification: Deep Learning Diving Jumping from platform Rotating in the air Video Classification: Deep Learning Diving Jumping from platform Rotating in the air Falling into water Video Classification: Deep Learning 3. Recurrent NN: LSTM [Ng et al., CVPR 2015] Diving Jumping from platform Ot-1 Ot Ot+1 LSTM LSTM LSTM LSTM LSTM LSTM Rotating in the air The performance is not ideal,same as image-based classification. Falling into water Video Classification: Deep Learning 3. Recurrent NN: LSTM [Ng et al., CVPR 2015] Diving We propose a hybrid deep learning framework Jumping from platform to capture appearance, short-term motion and long-term temporal dynamics in videos. Rotating in the air Zuxuan Wu, Xi Wang, Yu-Gang Jiang et al., Modeling Spatial-Temporal The performance of LSTM Clues in a Hybrid Deep Learning Framework for Video Classification. In and average pooling is close. ACM Multimedia, 2015. (Full Presentation). Falling into water Our Framework We propose a hybrid deep learning framework to model rich multimodal information: a) Appearance, shot-term motion with CNN Final Prediction y1s y 2s y 3s yTs −1 yTs LSTM LSTM LSTM LSTM LSTM Fusion Layer WsE LSTM LSTM LSTM LSTM Spatial CNN Spatial CNN Spatial CNN LSTM y m2 y 3m y Tm−1 yTm LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM WmE l =E Spatial CNN Spatial CNN Individual Frames y1m Motion CNN Motion CNN Motion CNN Input Video Motion CNN Motion CNN Stacked Optical Flow 25 Our Framework We propose a hybrid deep learning framework to model rich multimodal information: a) Appearance, shot-term motion with CNN b) Long-term temporal information with LSTM Final Prediction y1s y 2s y 3s yTs −1 yTs LSTM LSTM LSTM LSTM LSTM Fusion Layer WsE LSTM LSTM LSTM LSTM Spatial CNN Spatial CNN Spatial CNN LSTM y m2 y 3m y Tm−1 yTm LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM WmE l =E Spatial CNN Spatial CNN Individual Frames y1m Motion CNN Motion CNN Motion CNN Input Video Motion CNN Motion CNN Stacked Optical Flow 26 Our Framework We propose a hybrid deep learning framework to model rich multimodal information: a) Appearance, shot-term motion with CNN b) Long-term temporal information with LSTM c) Regularized fusion to explore feature correlations Final Prediction y1s y 2s y 3s yTs −1 yTs LSTM LSTM LSTM LSTM LSTM Fusion Layer WsE LSTM LSTM LSTM LSTM Spatial CNN Spatial CNN Spatial CNN LSTM y m2 y 3m y Tm−1 yTm LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM Regularzation W E m l =E Spatial CNN Spatial CNN Individual Frames y1m Motion CNN Motion CNN Motion CNN Input Video Motion CNN Motion CNN Stacked Optical Flow 27 Spatial and Motion CNN Features http://bigvid.fudan.edu.cn/code/ucf101_motion_model.txt Hao Ye, Zuxuan Wu et al. Evaluating Two-Stream CNN for Video Classification. In ICMR, 2015. 28 Temporal Modeling with LSTM An unrolled recurrent neural network. 29 Regularized Feature Fusion [Ngiam et al., ICML 2011 Srivastava et al., NIPS 2012] 30 Regularized Feature Fusion DNN Learning Scheme - Calculate prediction error - Update weights in a BP manner w (t ) w ( t +1) [Ngiam et al., ICML 2011 Srivastava et al., NIPS 2012] 31 Regularized Feature Fusion DNN Learning Scheme - Calculate prediction error - Update weights in a BP manner w (t ) w ( t +1) The fusion is performed in a free manner without explicitly exploring the feature correlations. [Ngiam et al., ICML 2011 Srivastava et al., NIPS 2012] 32 Regularized Feature Fusion DNN Learning Scheme - Calculate prediction error - Update weights in a BP manner w (t ) w ( t +1) The fusion is performed in a free manner without explicitly exploring the feature correlations. [Ngiam et al., ICML 2011 Srivastava et al., NIPS 2012] Feature Selection: Modeling relationship Remove redundancy 33 Experiments Datasets: - UCF101: 101 action classes, 13,320 video clips from YouTube - Columbia Consumer Videos (CCV): 20 classes, 9,317 videos from YouTube 34 Experiments Temporal Modeling: UCF-101 CCV Spatial ConvNet 80.4 75.0 Motion ConvNet 78.3 59.1 Spatial LSTM 83.3 43.3 Motion LSTM 76.6 54.7 ConvNet (spatial + motion) 86.2 75.8 LSTM (spatial + motion) 86.3 61.9 ConvNet + LSTM (spatial) 84.4 77.9 ConvNet + LSTM (motion) 81.4 70.9 ALL Streams 90.3 82.4 LSTM are worse than CNN on noisy long videos. CNN and LSTM are highly complementary! 35 Experiments Regularized Feature Fusion: UCF-101 CCV Spatial SVM 78.6 74.4 Motion SVM 78.2 57.9 SVM-EF 86.6 75.3 SVM-LF 85.3 74.9 SVM-MKL 86.8 75.4 NN-EF 86.5 75.6 NN-LF 85.1 75.2 M-DBM 86.9 75.3 Two-Stream CNN 86.2 75.8 Regularized Fusion 88.4 76.2% Regularized fusion performs better compared with fusion in a free manner. 36 Experiments Hybrid Deep Learning Framework: 37 Experiments Hybrid Deep Learning Framework: 38 Experiments Comparisons with State-of-the-Art: UCF101 Donahue et al. Srivastava et al. Wang et al. Tran et al. Simonyan et al. Lan et al. Zha et al. Ours CCV 82.9% 84.3% 85.9% 86.7% 88.0% 89.1% 89.6% 91.3% Xu et al. Ye et al. Jhuo et al. Ma et al. Liu et al. Wu et al. 60.3% 64.0% 64.0% 63.4% 68.2% 70.6% Ours 83.5% 39 Outline • Benchmarks • Modeling Temporal Dependencies • Utilizing Class Context 40 Class Relationships Similar Video Semantics Football 41 Class Relationships Similar Video Semantics Hockey Running Rugby Tennis Football Hurdles Badminton 42 A More Recent Work Individual frame - Utilize 3 ConvNets to capture appearance, short-term motion and audio clues respectively. Spatial ConvNet Prediction with Adaptive Fusion Stacked optical flow Motion ConvNet Audio spectrogram Audio ConvNet Fusing Multi-Stream Deep Networks for Video Classification, arXiv preprint arXiv:1509.06086. A More Recent Work Individual frame - Utilize 3 ConvNets to capture appearance, short-term motion and audio clues respectively. Spatial ConvNet LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM Spatial LSTM Stacked optical flow Prediction with Adaptive Fusion Motion ConvNet LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM - Leverage LSTMs to model long-term temporal dynamics. Motion LSTM Audio spectrogram Audio ConvNet Fusing Multi-Stream Deep Networks for Video Classification, arXiv preprint arXiv:1509.06086. A More Recent Work Individual frame - Utilize 3 ConvNets to capture appearance, short-term motion and audio clues respectively. Spatial ConvNet LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM Spatial LSTM Stacked optical flow Prediction with Adaptive Fusion Motion ConvNet LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM Motion LSTM Audio spectrogram - Leverage LSTMs to model long-term temporal dynamics. - Using class relationships as a prior to refine scores. Audio ConvNet Fusing Multi-Stream Deep Networks for Video Classification, arXiv preprint arXiv:1509.06086. Experimental Results Adaptive Multi-Stream Fusion: Fusion Method UCF-101 CCV Average fusion 90.3 82.4 Weighted fusion 90.6 82.7 Kernel average fusion 90.2 82.1 MKL fusion 89.6 81.8 Logistic regression fusion 89.8 82.0 Adaptive multi-stream fusion (λ1=0) 90.9 82.8 Adaptive multi-stream fusion (λ2=0) 91.6 83.7 Adaptive multi-stream fusion (-A) 92.2 84.0 Adaptive multi-stream fusion 92.6 84.9 Adaptive Multi-Stream Fusion gives the best result! Thank you! 47
Similar documents
Video Content Recognition with Deep Learning
[content recognition] Hao Ye, Zuxuan Wu, Rui-‐Wei Zhao, Xi Wang, Yu-‐Gang Jiang, Xiangyang Xue, Evaluating Two-‐Stream CNN for Video Classification, ACM ...
More information