Keynote slides
Transcription
Keynote slides
ECCV WORKSHOP VISART ZURICH SEPTEMBER 2014 Computer Vision for interactive experiences with art and artistic documents Prof. Rita Cucchiara DIPARTIMENTO DI INGEGNERIA Enzo Ferrari Università di Modena e Reggio Emilia, Italia http://www.Imagelab.ing.unimore.it ABOUT OUR WORK IMAGELAB [email protected] SOFTECH-ICT research center in ICT for enterprise Dipartimento di Ingegneria Enzo Ferrari Research in computer vision , pattern recognition Multimedia and machine learning - Vision for Video Surveillance (since 1999) - Vision for Medical imaging - Vision for Industry - Vision for cultural experiences Rita Cucchiara ECCV W VISART 2014, Italy Università di Modena e Reggio Emilia, ITALY Open post-doc positions!! Please contact me CULTURAL HERITAGE UNESCO 2003 cultural heritage definitions: • tangible heritage : artifacts such as objects, paintings, buildings, structures, landscapes, cities. (cultural heritage, natural heritage) • intangible heritage : practices, representations, expressions, memories, knowledge that are naturally transmitted in oral or written forms. Modena, Italy The Dolomiti, Italy The Mediterrean diet, Italy • digital heritage is made up of computer-based materials of enduring value that should be kept for future generations: all multimedia data texts, images, audio, graphics, software and web pages, native digital or after digitalization. Rita Cucchiara ECCV W VISART 2014, Italy COMPUTER VISION FOR INTERACTIVITY Interacting with the Great Beauty: 1)Computer vision for human-centered (digital) activities in experience with art and cultural heritage ( monitoring, reconstruction AR, retrieval, learning and understanding,) 2)Computer vision for human augmentation to improve the experience with art and cultural heritage ( visual augmentation, natural HCI, enjoying) Rita Cucchiara ECCV W VISART 2014, Italy COMPUTER VISION FOR INTERACTIVITY Computer vision for interactivity with art is not (only) interactive art. but they started together… Myron Krueger’s Videoplace, (1975 for “artificial reality”, SIGGRAPH 1985)…. Before mouse! Rita Cucchiara ECCV W VISART 2014, Italy CV FOR INTERACTING WITH CH Envi-Vision: Vision by environment, fixed or moving cameras Ego-vision: Vision by mobile, wearable egocentric, cameras 1)for seeing what your eyes don’t reach 2)for seeing what your eyes cannot see 3)for telling what your eyes are seeing 4) for seeing with more eyes Rita Cucchiara, Alberto Del Bimbo, "Visions for Augmented Cultural Heritage Experience," IEEE Multimedia, vol. 21, no. 1, pp. 74-82, Jan.-Mar., 2014. Rita Cucchiara ECCV W VISART 2014, Italy 1.FOR SEEING WHAT YOUR EYES DON’T REACH • Webcams • Surveillance cameras mounted in museums and CH locations for seeing interactions Italy: Project Cluster SC&C MNEMOSYNE [Baldanov et al MCH Workshop 2012] • Drones • Images for 3D reconstruction [Pollefeys, Van Gool ACM Surv 2003] The Great Buddha project 02-07 [Hiazaki ..Ikeuki JCV 2007] Rita Cucchiara ECCV W VISART 2014, Italy 2. FOR SEEING WHAT YOUR EYES CANNOT SEE • Thermal cameras for monitoring • Floor cameras for interactions…* FLORIMAGE project, LECCE Museum 2015 • Stereo and range scanners for 3D reconstruction • Deep image processing as David does… *M. Lombardi, A. Pieracci, P. Santinelli, R. Vezzani, R. Cucchiara,"Human Behavior Understanding with Wide Area Sensing Floors" in HBU2013, vol. 8212, LNCS 8212, pp. 112-123, 2013 ITALIAN Project PON DICET, Lecce 2013-2016 Rita Cucchiara ECCV W VISART 2014, Italy 3. FOR TELLING WHAT YOUR EYES ARE SEEING Typical interaction from mobile vision (Google Goggles) Augmented reality and vision [Caarls et al JIVP 2009] • Gaming Experiments for 3D retrieval by mobile (Enzo Ferrari Museum 2013) • Vision and augmented reality in cultural and natural sites (MARMOTA FBK: FP7 VENTURI ) • Document and painting recognition for retrieval • As James and many of us Rita Cucchiara ECCV W VISART 2014, Italy DOCUMENT RECOGNITION FOR RETRIEVAL Computer vision CV HCI - Image Retrieval [Zhang PAMI2012] - Image segmentation and multi-digitalization - Multi-digitalization of minate code manuscripts - - The DE RERUM NOVARUM Project + Multi-digitalization of Digitalized Encyclopedia - The Treccani-DICET project** +D. Borghesani, C. Grana, R. Cucchiara, "Miniature illustrations retrieval and innovative interaction for digital illuminated manuscripts"in Multimedia Systems, 2013 **D.Coppi, C.Grana, R. Cucchiara Illustration Segmentationin in Digitalized documents using local correlation features Proc. of ICRDL 2014, MTAP to appear Rita Cucchiara ECCV W VISART 2014, Italy DOCUMENT ANALYSIS FOR INTERACTION Multi-digitalization of artistic books for new forms of digital interactivities - Layout segmentation - Picture segmentation and tagging - Copy detection - Search with relevant feedback web intera ction Manual annotation Document Analysis Digitalization Papery Documents Rita Cucchiara ECCV W VISART 2014, Italy Digital Library Mutimedia interactive Digital Library CV FOR AUTOMATIC PICTURE SELECTION 1400 AC The Holy Bible of Borso D’Este Content-based image retrieval Backgrou nd Text User interface Image Picture Decoratio n Feature annotatio n Thanks to D.BoRghesani,C.Grana 2010-13 Rita Cucchiara ECCV W VISART 2014, Italy FEATURES CORRELATION MATRIX FEATURE POINTS FROM REIMANNIAN TO EUCLIDEAN SPACE AUTOCORRELATION DIRECTIONAL HISTOGRAMS Rita Cucchiara ECCV W VISART 2014, Italy [Borghesani et al ACM J Multimedia Systems 2014] [Borghesani et al MTAP 2012] Segmentation and tagging on digitalized books ACM Multimedia 2010, MTAP2011 Interactive surfing on digitalized books Adding relevance feedback by users Adding positive and negative relevance feedback Improving search by similarity After multi-digitalization a Multitouch Interface 2012 Software written in C++ using Nokia QT4 libraries Supports Windows, Mac and Linux 46’’ LCD panel equipped with a 32 point multitouch Now a multidigitalized product INTERACTION WITH ENCYCLOPEDIA Multi-digitalization of Treccani Encyclopedia Blocks feature extraction Training model SVM classification XY Cut Blocks Autocorrelation Image segmentation and block classification nn blocks extraction Rita Cucchiara ECCV W VISART 2014, Italy Thanks to C.Grana, M.Fornaciari, D.Coppi INTERACTING WITH ENCYCLOPEDIA Layour segmentation Specific tailored for drawings and artistic schemes Tesseract Rita Cucchiara ECCV W VISART 2014, Italy Our Method Courtesy of Treccani Enciclopedia AUTOCORRELATION MATRIX • Block analysis • Visual features extraction • Feature classification Dataset Treccani Gutenberg13 Method Our Tesseract Our Tesseract % TP % FN % FP 99,57 0,43 4,53 52,25 47,71 0,39 99,50 0,50 11,50 83,13 16,87 1,02 Rita Cucchiara ECCV W VISART 2014, Italy After multi-digitalization: Web retrieval Searching and interacting with Treccani images through the web GOLD descriptors ( Gaussian of Local Descriptors) (best accuracy at IMAGECLEF2013) C.Grana, G. Serra, M. Manfredi, R. Cucchiara,"Beyond Bag of Words for Concept Detection and Search of Cultural Heritage Archives« in SISAP 2013, vol. 8199, LNCS 8199, Spain, pp. 233-244, Oct. 2-4, 2013 TUTORIAL AT ICPR2014 Rita Cucchiara ECCV W VISART 2014, Italy Thanks to C.Grana, G.Serra, M. Manfredi 4. FOR SEEING WITH MORE EYES Physical, augmented experience with tangible CH The explosion of wearable cameras and ego-centric vision: - wearable museum or city guides - self gesture analysis for interactions - real-time recognition of CH targets Using Vision for detecting/tracking/recognizing targets and observers’ interaction Rita Cucchiara ECCV W VISART 2014, Italy EGOCENTRIC VISION Egocentric vision ( “EgoVision”) models and techniques for understanding what a person sees, from the first person’s point of view and centered on the human perceptual needs. Often called first-person vision, to recall the needs of using wearable cameras (e.g. on glasses mounted on the head) for acquiring and processing the same visual stimuli that human acquire and process. a broader meaning ….. to understand what a person sees or want to see and to exploit similar learning, perception and reasoning paradigms of humans.. Rita Cucchiara ECCV W VISART 2014, Italy CV CHALLENGES FOR CULTURAL EXPERIENCE IN EGO-VISION Cultural experience in ego-vision Life Logging Organizing memory and data Off-line Big data Deep learning Storage and transmission issues Rita Cucchiara ECCV W VISART 2014, Italy Human Augmentation Understanding world by vision… On-line Noisy, unconstrained data Fast learning with few examples Processing issues A CULTURAL EXPERIENCE WITH EGOVISION & Computer vision Rita Cucchiara ECCV W VISART 2014, Italy Thanks to Giuseppe Serra, Stefano Aletto, Lorenzo Baldini Francesco Paci, Luca Benini @ETHZ CHALLENGES IN EGO-VISION Egovision for visual augmentation … * Similar to video-surveillance and robot vision • • • • fast, real-time ( please, limit the data searchspace ) reactive pro-active similar scenes ( typically people, social life, children..) many similar methods ( detection, action recognition , tracking) but • Unconstrained • Large different motion factors • Frequent Changes of field of view • Very Long videos * R.Cucchiara Egocentric vision tarcking and evaluating human signs ICVSS Catania 2014 Rita Cucchiara ECCV W VISART 2014, Italy CV CHALLENGES IN EGOVISION(1/3) 1. Hardware • Design new hardware • Exploit real-time capabilities for egovision 2. Recognizing FoA /PoI • Estimating FoA[Li ICCV2013], [OgakiCVPRW2012], [Jianfeng CVPR2014] • Eye-tracking; & ego vision Rita Cucchiara ECCV W VISART 2014, Italy Gglass, Vuzix M100,Golden-i, Mobox+OdroidXU, MEG4.0 CV CHALLENGES IN EGO-VISION(2/3) 3. Recognizing head motion • • • • Head/body motion for outdoor summarization [ Peleg CVPR2014] Motion for indoor summarization [Grauman CVPR2013] Motion for supporting attention [Matsuo, CVPRW2014] Motion for SLAM as in robotics [Bahera, ACCV2012] 4. Recognizing objects • Objects useful for humans [Fathi, CVPR2013, Fathi, CVPR2011] • Objects in the hand [Fathi, CVPR2011] • Target tagging in the scene [Pirsiavash, Ramanan CVPR 2012] or Artworks in a museum…. Rita Cucchiara ECCV W VISART 2014, Italy CV CHALLENGES IN EGO-VISION(3/3) 5. Recognizing actions • Self-actions gestures [Kitani, CVPR 2013; Baraldi, EVW2014] • Actions of people, social actions [Ryoo, CVPR 2013; Alletto EFPVW 2014] • Actions in the environment (sport..) [Kitani, IEEE PC Magazine 2012] 6. Tracking: recognizing among the time • Tracking target objects • tracking face and people [Alletto, ICPR 2014] • Multiple target tracking Rita Cucchiara ECCV W VISART 2014, Italy EGO-VISION INTERACTION WITH CH Positioning: Recognizing targets indoor or outdoor HCI: Gesture recognition from egocentric video Recogition of emotions and feelings Experience augmentation: Recognizing visual/audio queries and interaction Rita Cucchiara ECCV W VISART 2014, Italy A VIDEO Video Gesture L. Baraldi, F. Paci, G. Serra, L. Benini, R. Cucchiara Gesture Recognition using Wearable Vision Sensors to Enhance Visitors’ Museum Experiences IEEE Sensors 2015 Rita Cucchiara ECCV W VISART 2014, Italy HAND SEGMENTATION IN EGO-VISION In ego-vision • Many luminance variations • Correcting strong camera/head motion • Recognizing ego-gesture from very few example It is an old problem, many approaches in different contexts: Skin classification: [Khan et al ICIP2010] Random forest: ( better than BN, MP,NB, AdaB…) Background subtraction after image registration [Fathi ICCV 2011] (assuming static bckg, hands with objects etc..) Generic object recognition : [Li, Kitani CVPR 2013] sparse feature selection and a battery of RF trained with different luminance conditions Rita Cucchiara ECCV W VISART 2014, Italy EGO-GESTURE RECOGNITION 1) (Ego-)Hand detection 2) (Ego-)Camera motion suppression 3) Feature extraction 4) Classification Rita Cucchiara ECCV W VISART 2014, Italy L.Baraldi, F. Paci, G. Serra, L. Benini, R. Cucchiara,"Gesture Recognition in Ego-Centric Videos using Dense Trajectories and Hand Segmentation« in Proc. of 10th IEEE Embedded Vision Workshop@CVPR2014 AN EGO-VISION SOLUTION Superpixel segmentation Superpixel descriptors Classification by Collection of RFs Temporal coherence Spatial coherence - SLIC (Simple linear Iterative clustering*) - K means in 5D (Lab+xy) *[Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Süsstrunk, SLIC Superpixels Compared to State-of-the-art Superpixel Methods, IEEE TPAMI, vol. 34, num. 11, p. 2274 - 2282, May 2012 Rita Cucchiara ECCV W VISART 2014, Italy [Serra ACM MMw 2013] [Baraldi CVPRW2014] AN EGOVISION SOLUTION Superpixel segmentation Superpixel descriptors - Classification by Collection of RFs Descriptors -mean and covariance in RGB LabH and HSVH 27 Gabor filters (9 orientation, 3 scales 7x7,13x13,19x19) - HoG Rita Cucchiara ECCV W VISART 2014, Italy Temporal coherence Spatial coherence AN EGOVISION SOLUTION Superpixel segmentation Superpixel descriptors Classification by Collection of RFs Temporal coherence Spatial coherence Classifier - Collection of Random forests - Indexed by a 32 bin RGBH - It encodes the appearance of the scene and the global luminance - Hp: bkg and hands changes colors accordingly Rita Cucchiara ECCV W VISART 2014, Italy Feature vector Scene luminance Global Luminance feature AN EGOVISION SOLUTION Superpixel segmentation Superpixel descriptors Classification by Collection of RFs Temporal coherence Spatial coherence Estimated priors Temporal smoothing in a window of k frames Posterior probability to be or not a hand pixel in a previous window Rita Cucchiara ECCV W VISART 2014, Italy AN EGOVISION SOLUTION Superpixel segmentation Superpixel descriptors Classification by Collection of RFs Spatial consistency Eliminate spurious superpixels Close holes Use grabcut using posteriori as a seed point Rita Cucchiara ECCV W VISART 2014, Italy Temporal coherence Spatial coherence HAND SEGMENTATION CONSISTENCY • Hand segmentation without temporal and spatial consistency • Hand segmentation with temporal and spatial consistency Rita Cucchiara ECCV W VISART 2014, Italy HAND SEGMENTATION Results shows that there is a significant improvement in performance when all the three consistency aspect are used together: • illumination invariance (II) • temporal smoothing (TS) • spatial consistency (SC). The method proposed by Li et al. is the most similar to our approach, nevertheless exploiting temporal and spatial coherence we are able to outperform their results. Rita Cucchiara ECCV W VISART 2014, Italy CAMERA MOTION SUPPRESSION 1) (Ego-)Hand detection 2) (Ego-)Camera motion suppression 3) Feature extraction 4) Classification Rita Cucchiara ECCV W VISART 2014, Italy CAMERA MOTION N • Camera (head) motion removal. • hands movement is usually not consistent with camera motion, resulting in wrong matches between the two frames. • For this reason we introduce a segmentation mask that disregards feature matches belonging to hands Dense motion Object detection Extract dense keypoints - Original frame sequence Output frame sequence without camera motion Estimate Homography Extract dense keypoints Rita Cucchiara ECCV W VISART 2014, Italy - Apply to EGO-GESTURE FEATURE • dense trajectories, HOG, HOG, MBH* extracted around hand regions. • Feature points are sampled inside and around the user’s hands and tracked during the gesture; inside a spatio-temporal volume aligned with each trajectory. • descriptors in a Bag of Words and then classified using a linear SVM classifier. *[Wang et al cvpr2013] Rita Cucchiara ECCV W VISART 2014, Italy DENSE TRAJECTORIES From * dense points ( but as in Shi Tomasi in ‘94 only if the eigenvalues of the autocorrelation matrix are every small) Points are connected in trajectory and normalized trajectory shape is used [Dalal and Triggs 2005 ] HOG static appearance [Lampert 2008] HOF with 9 bins: motion [Dalal 2004] Motion boundary histograms [*] H. Wang, A. Klaser, C. Schmid, and C.-L. Liu. Action Recognition by Dense Trajectories. In Proc. of CVPR, 2011 and IJCV2013. Rita Cucchiara ECCV W VISART 2014, Italy TRAJECTORY DESCRIPTION • Having removed camera motion between two adjacent frames, trajectories can be extracted. • Feature points are densely sampled at several spatial scales and tracked. Trajectories are restricted to lie inside and around the user’s hands. Without hand segmentation Rita Cucchiara ECCV W VISART 2014, Italy With hand segmentation TRAJECTORY DESCRIPTION • The spatio-temporal volume aligned with each trajectory is considered, and Trajectory descriptor, HOG, HOF and MBH are computed around it. • Since the histograms tend to be sparse, they are power-normalized to unsparsify the representation, while still allowing for linear classification. The function: is applied to each bin. • The final descriptor is the concatenation of the four power-normalized histograms. Gestures are eventually recognized using a linear SVM 1-vs-1 classifier. TD descriptors BoW HOG descriptors BoW HOF descriptors BoW MBH descriptors BoW Rita Cucchiara ECCV W VISART 2014, Italy Power-normalization and concatenation SVM 1-vs-1 EXPERIMENTAL RESULTS Datasets: • The Cambridge-Gesture database, with 900 sequences of nine hand gesture types under different illumination conditions; • Our Interactive Museum Dataset, an ego-centric gesture recognition dataset with 700 sequences from seven gesture classes performed by five subjects. • The EDSH dataset, which consists of three egocentric videos with indoor and outdoor scenes and large variations of illumination. Rita Cucchiara ECCV W VISART 2014, Italy GESTURE RECOGNITION Cambridge Gesture DB Results on the Interactive Museum Dataset using only 2 samples per class for training. [2] T.-K. Kim and R. Cipolla. Canonical correlation analysis of video volume tensors for action categorization and detection. Trans. PAMI, 2009 [3] Y. M. Lui, J. R. Beveridge, and M. Kirby. Action classification on product manifolds. In Proc. of CVPR, 2010 [4] Y. M. Lui and J. R. Beveridge. Tangent bundle for human action recognition. In In proc. of Automatic Face & Gesture Recognition and Workshops, 2011 [5] A. Sanin, C. Sanderson, M. T. Harandi, and B. C. Lovell. Spatio-temporal covariance descriptors for action and gesture recognition. In Proc. of Workshop on Applications of Computer Vision, 2013. Rita Cucchiara ECCV W VISART 2014, Italy LAST BUT… Rita Cucchiara ECCV W VISART 2014, Italy TRACKING: THE BIG CHALLENGE Rita Cucchiara ECCV W VISART 2014, Italy SINGLE TARGET TRACKING Tracking is the task of generating an inference about the motion of an object given a sequence of images *. in ego-vision is hard! * Arnold W. M. Smeulders, Dung M. Chu, Rita Cucchiara, Simone Calderara, Afshin Deghghan and, and Mubarak Shah, Visual Tracking: an Experimental Survey, IEEE TPAMI, July 2014. Rita Cucchiara ECCV W VISART 2014, Italy THE HARDNESS OF TRACKING Which is the invariance that can be perceived and maintained along the time? Tracking is hard as nothing is fixed: • Problems of lights: the target aspect, the illumination, • Problems of motion: the object/camera motion, • Problems of scene: the occlusion, the confusion... • …. Searching for the invariance in the video Rita Cucchiara ECCV W VISART 2014, Italy 14 TRACKING CHALLENGES IN 313 VIDEOS 01-LIGHT 02-SURFACECOVER 03-SPECULARITY 04-TRANSPARENCY 05-SHAPE 06-MOTIONSMOOTHNESS 07-MOTIONCOHERENCE 08-CLUTTER 09-CONFUSION 10-LOWCONTRAST 11-OCCLUSION 12-MOVINGCAMERA 13-ZOOMINGCAMERA 14-LONGDURATION Rita Cucchiara ECCV W VISART 2014, Italy EXPERIMENTAL RESULTS ON ALOV++ The upper bound, taking the best of all trackers at each frame 10% About the 30%, correctly tracked only [TST] A [FBT] [STR] [L1O] B [NCC] [TLD] C D E The lower bound, what all trackers can do 7% Rita Cucchiara ECCV W VISART 2014, Italy Survival curves by Kaplan-Meier See PAMI 2014 TRACKING IN EGO-VISION In egovision? • all the previous problems!! • relative motion • No motion of observer but motion of target • Motion of observer but fixed target • Motion of both observer and target • the dataset @Imagelab • EGO_GROUP • EGO_TRACK Rita Cucchiara ECCV W VISART 2014, Italy EGOVISION FROM A MOVING HEAD Rita Cucchiara ECCV W VISART 2014, Italy 0.40 TRACKING IN EGOVISION: EVALUATION Tracking results in the second scenario V2.2: tracking of a environmental point of interest. Target stays still but gets occluded and exits the camera FoV Color based trackers (HBT, NN) performs poorly due to the difficulty in discriminating the object based on color. Rita Cucchiara ECCV W VISART 2014, Italy TRACKING IN EGOVISION: EVALUATION Tracking results: table shows the F-measure for each video and each tracke a lot of work to do…r Scenari Still camera, o still person Video NN HBT TLD STR NCC FRT V1.1 V1.2 Moving camera, still person Moving camera, moving person V2.1 V3.1 V2.2 V2.3 V3.2 V3.3 0.5204 0.2793 0.2314 0.0472 0.1211 0.2552 0.0867 0.1565 0.5187 0.1177 0.0206 0.1602 0.0333 0.5786 0.1457 0.0973 0.4838 0.1767 0.5091 0.6372 0.4342 0.2446 0.0237 0.1303 0.6406 0.2397 0.0698 0.5745 0.0801 0.5532 0.0294 0.0879 0.4326 0.2251 0.4575 0.3769 0.0147 0.3607 0.1834 0.1118 0.2271 0.2138 0.1406 0.0294 0.0389 0.0984 0.1492 0.0756 Rita Cucchiara ECCV W VISART 2014, Italy TRACKING AND INTERACTING DICET Project: 2013-2015 tracking people and target for social interaction analysis On-line interaction with art Rita Cucchiara ECCV W VISART 2014, Italy CONCLUSIONS AND OPEN PROBLEMS • A Few conclusions: • Computer vision for multi-digitalization and interaction • • along successefull story 2D documents, 3D objects • Computer vision for real-time interaction • • • No hardware but software problem A long way ahead Interaction by egocentric vision is not an utopia anymore Rita Cucchiara ECCV W VISART 2014, Italy THANKS TO http://imagelab.ing.unimo.it Interdipartimental Research Center in ICT Tecnopolo di Modena Emila Romagna High Technology Network PEOPLE Rita Cucchiara Giuseppe Serra Marco Manfredi Costantino Grana Paolo Santinelli Francesco Solera Roberto Vezzani Martino Lombardi Simone Pistocchi Simone Calderara Michele Fornaciari Fabio Battilani Augusto Pieracci Dalia Coppi Patrizia Varini Stefano Alletto, Rita Cucchiara ECCV W VISART 2014, Italy