Keynote slides

Transcription

Keynote slides

ECCV WORKSHOP VISART
ZURICH SEPTEMBER 2014
Computer Vision for interactive experiences
with art and artistic documents
Prof. Rita Cucchiara
DIPARTIMENTO DI INGEGNERIA Enzo Ferrari
Università di Modena e Reggio Emilia, Italia
http://www.Imagelab.ing.unimore.it
ABOUT OUR WORK
IMAGELAB [email protected]
SOFTECH-ICT research center in ICT for enterprise
Dipartimento di Ingegneria Enzo Ferrari
Research in computer vision , pattern recognition
Multimedia and machine learning
-
Vision for Video Surveillance (since 1999)
-
Vision for Medical imaging
-
Vision for Industry
-
Vision for cultural experiences
Rita Cucchiara ECCV W VISART 2014, Italy
Università di
Modena e Reggio
Emilia, ITALY
Open post-doc positions!!
Please contact me
CULTURAL HERITAGE
UNESCO 2003 cultural heritage definitions:
•
tangible heritage : artifacts such as objects, paintings, buildings, structures,
landscapes, cities. (cultural heritage, natural heritage)
• intangible heritage : practices, representations, expressions, memories,
knowledge that are naturally transmitted in oral or written forms.
Modena, Italy
The Dolomiti, Italy
The Mediterrean diet, Italy
• digital heritage is made up of computer-based materials of enduring value
that should be kept for future generations: all multimedia data texts, images,
audio, graphics, software and web pages, native digital or after digitalization.
COMPUTER VISION FOR INTERACTIVITY
Interacting with the Great Beauty:
1)Computer vision for human-centered (digital) activities in experience
with art and cultural heritage
( monitoring, reconstruction AR, retrieval, learning and understanding,)
2)Computer vision for human augmentation to improve the experience
with art and cultural heritage
( visual augmentation, natural HCI, enjoying)
COMPUTER VISION FOR
INTERACTIVITY
Computer vision for interactivity with art is not (only) interactive art.
but they started together…
Myron Krueger’s Videoplace,
(1975 for “artificial reality”, SIGGRAPH 1985)…. Before mouse!
CV FOR INTERACTING WITH CH
Envi-Vision: Vision by environment, fixed or moving cameras
Ego-vision: Vision by mobile, wearable egocentric, cameras
1)for seeing what your eyes don’t reach
2)for seeing what your eyes cannot see
3)for telling what your eyes are seeing
4) for seeing with more eyes
Rita Cucchiara, Alberto Del Bimbo,
"Visions for Augmented Cultural Heritage Experience," IEEE
Multimedia, vol. 21, no. 1, pp. 74-82, Jan.-Mar., 2014.
1.FOR SEEING WHAT YOUR EYES DON’T
REACH
• Webcams
• Surveillance cameras mounted in museums and CH locations
for seeing interactions Italy: Project Cluster SC&C
MNEMOSYNE [Baldanov et al MCH Workshop 2012]
• Drones
• Images for 3D reconstruction
[Pollefeys, Van Gool ACM Surv 2003]
The Great Buddha project 02-07
[Hiazaki ..Ikeuki JCV 2007]
2. FOR SEEING WHAT YOUR EYES CANNOT
SEE
• Thermal cameras for monitoring
• Floor cameras for interactions…*
FLORIMAGE project, LECCE Museum 2015
• Stereo and range scanners for 3D reconstruction
• Deep image processing as David does…
*M. Lombardi, A. Pieracci, P. Santinelli, R. Vezzani, R. Cucchiara,"Human Behavior Understanding with Wide Area Sensing Floors"
in HBU2013, vol. 8212, LNCS 8212, pp. 112-123, 2013 ITALIAN Project PON DICET, Lecce 2013-2016
3. FOR TELLING WHAT YOUR EYES ARE SEEING
Typical interaction from mobile vision (Google
Goggles)
Augmented reality and vision [Caarls et al JIVP
2009]
•
Gaming Experiments for 3D retrieval by
mobile (Enzo Ferrari Museum 2013)
•
Vision and augmented reality in cultural and
natural sites (MARMOTA FBK: FP7 VENTURI
)
•
Document and painting recognition for
retrieval
•
As James and many of us
DOCUMENT RECOGNITION FOR
RETRIEVAL
Computer
vision
CV
HCI
-
Image Retrieval [Zhang PAMI2012]
-
Image segmentation and multi-digitalization
-
Multi-digitalization of minate code manuscripts
-
-
The DE RERUM NOVARUM Project +
Multi-digitalization of Digitalized Encyclopedia
-
The Treccani-DICET project**
+D. Borghesani, C. Grana, R. Cucchiara, "Miniature illustrations retrieval and innovative interaction for digital illuminated
manuscripts"in Multimedia Systems, 2013
**D.Coppi, C.Grana, R. Cucchiara Illustration Segmentationin in Digitalized documents using local correlation features Proc.
of ICRDL 2014, MTAP to appear
DOCUMENT ANALYSIS FOR
INTERACTION
Multi-digitalization of artistic books for new forms of digital interactivities
-
Layout segmentation
-
Picture segmentation and tagging
-
Copy detection
-
Search with relevant feedback
web
intera
ction
Manual
annotation
Document
Analysis
Digitalization
Papery
Documents
Digital
Library
Mutimedia interactive
Digital Library
CV FOR AUTOMATIC PICTURE SELECTION
1400 AC The Holy Bible of Borso D’Este
Content-based
image retrieval
Backgrou
nd
Text
User
interface
Image
Picture
Decoratio
n
Feature
annotatio
n
Thanks to D.BoRghesani,C.Grana 2010-13
FEATURES
CORRELATION MATRIX
FEATURE POINTS
FROM REIMANNIAN
TO EUCLIDEAN SPACE
AUTOCORRELATION
DIRECTIONAL HISTOGRAMS
[Borghesani et al ACM J Multimedia Systems 2014]
[Borghesani et al MTAP 2012]
Segmentation and tagging on digitalized books
ACM Multimedia 2010, MTAP2011
Interactive surfing on digitalized books
Adding relevance feedback by users
Adding positive and negative relevance feedback
Improving search by similarity
After multi-digitalization a Multitouch Interface
2012 Software written in C++ using Nokia
QT4 libraries
Supports Windows, Mac and Linux
46’’ LCD panel equipped with a 32 point
multitouch
Now a multidigitalized product
INTERACTION WITH ENCYCLOPEDIA
Multi-digitalization of Treccani Encyclopedia
Blocks feature extraction
Training model
SVM
classification
XY Cut
Blocks Autocorrelation
Image segmentation and block
classification
nn blocks extraction
Thanks to C.Grana, M.Fornaciari, D.Coppi
INTERACTING WITH ENCYCLOPEDIA
Layour segmentation
Specific tailored for drawings and artistic schemes
Tesseract
Our Method
Courtesy of Treccani Enciclopedia
AUTOCORRELATION MATRIX
• Block analysis
• Visual features extraction
• Feature classification
Dataset
Treccani
Gutenberg13
Method
Our
Tesseract
Our
Tesseract
% TP % FN % FP
99,57 0,43 4,53
52,25 47,71 0,39
99,50 0,50 11,50
83,13 16,87 1,02
After multi-digitalization: Web retrieval
Searching and interacting with Treccani images through the web
GOLD descriptors ( Gaussian of Local Descriptors)
(best accuracy at IMAGECLEF2013)
C.Grana, G. Serra, M. Manfredi, R. Cucchiara,"Beyond Bag
of Words for Concept Detection and Search of Cultural
Heritage Archives« in SISAP 2013, vol. 8199, LNCS 8199,
Spain, pp. 233-244, Oct. 2-4, 2013
TUTORIAL AT ICPR2014
Thanks to C.Grana, G.Serra, M. Manfredi
4. FOR SEEING WITH MORE EYES
Physical, augmented experience with tangible CH
The explosion of wearable cameras and ego-centric vision:
- wearable museum or city guides
- self gesture analysis for interactions
- real-time recognition of CH targets
Using Vision for detecting/tracking/recognizing
targets and observers’ interaction
EGOCENTRIC VISION
Egocentric vision ( “EgoVision”)
models and techniques for understanding what a person sees, from the
first person’s point of view and centered on the human perceptual needs.
Often called first-person vision, to recall the needs of using wearable cameras
(e.g. on glasses mounted on the head) for acquiring and processing the same
visual stimuli that human acquire and process.
a broader meaning …..
to understand what a person sees or want to see
and to exploit similar
learning, perception and reasoning paradigms of humans..
CV CHALLENGES FOR CULTURAL
EXPERIENCE IN EGO-VISION
Cultural experience in ego-vision
Life Logging
Organizing memory and data
Off-line
Big data
Deep learning
Storage and transmission issues
Human Augmentation
Understanding world by vision…
On-line
Noisy, unconstrained data
Fast learning with few examples
Processing issues
A CULTURAL EXPERIENCE WITH EGOVISION
&
Computer
vision
Thanks to Giuseppe Serra, Stefano
Aletto, Lorenzo Baldini
Francesco Paci, Luca Benini @ETHZ
CHALLENGES IN EGO-VISION
Egovision for visual augmentation … *
Similar to video-surveillance and robot vision
•
•
•
•
fast, real-time  ( please, limit the data searchspace )
reactive pro-active
similar scenes ( typically people, social life, children..)
many similar methods ( detection, action recognition , tracking)
but
• Unconstrained
• Large different motion factors
• Frequent Changes of field of view
• Very Long videos
* R.Cucchiara Egocentric vision tarcking and evaluating
human signs ICVSS Catania 2014
CV CHALLENGES IN EGOVISION(1/3)
1.
Hardware
• Design new hardware
• Exploit real-time capabilities for egovision
2.
Recognizing FoA /PoI
• Estimating FoA[Li ICCV2013], [OgakiCVPRW2012], [Jianfeng
CVPR2014]
• Eye-tracking; & ego vision
Gglass, Vuzix M100,Golden-i, Mobox+OdroidXU, MEG4.0
CV CHALLENGES IN EGO-VISION(2/3)
3. Recognizing head motion
•
•
•
•
Head/body motion for outdoor summarization [ Peleg CVPR2014]
Motion for indoor summarization [Grauman CVPR2013]
Motion for supporting attention [Matsuo, CVPRW2014]
Motion for SLAM as in robotics [Bahera, ACCV2012]
4. Recognizing objects
• Objects useful for humans [Fathi, CVPR2013, Fathi, CVPR2011]
• Objects in the hand [Fathi, CVPR2011]
• Target tagging in the scene [Pirsiavash, Ramanan
CVPR 2012]
or Artworks in a museum….
CV CHALLENGES IN EGO-VISION(3/3)
5. Recognizing actions
• Self-actions gestures [Kitani, CVPR 2013; Baraldi, EVW2014]
• Actions of people, social actions [Ryoo, CVPR 2013; Alletto EFPVW
2014]
• Actions in the environment (sport..) [Kitani, IEEE PC Magazine 2012]
6. Tracking: recognizing among the time
• Tracking target objects
• tracking face and people [Alletto, ICPR 2014]
• Multiple target tracking
EGO-VISION INTERACTION WITH CH
Positioning:
Recognizing targets indoor or outdoor
HCI:
Gesture recognition from egocentric video
Recogition of emotions and feelings
Experience augmentation:
Recognizing visual/audio queries
and interaction
A VIDEO
Video Gesture
L. Baraldi, F. Paci, G. Serra, L. Benini, R. Cucchiara Gesture Recognition using Wearable Vision
Sensors to Enhance Visitors’ Museum Experiences IEEE Sensors 2015
HAND SEGMENTATION IN EGO-VISION
In ego-vision
• Many luminance variations
• Correcting strong camera/head motion
• Recognizing ego-gesture from very few example
It is an old problem, many approaches in different contexts:
Skin classification: [Khan et al ICIP2010] Random forest: ( better than BN,
MP,NB, AdaB…)
Background subtraction after image registration [Fathi ICCV 2011] (assuming
static bckg, hands with objects etc..)
Generic object recognition : [Li, Kitani CVPR 2013] sparse feature selection
and a battery of RF trained with different luminance conditions
EGO-GESTURE RECOGNITION
1)
(Ego-)Hand detection
2)
(Ego-)Camera motion suppression
3)
Feature extraction
4)
Classification
L.Baraldi, F. Paci, G. Serra, L. Benini, R. Cucchiara,"Gesture
Recognition in Ego-Centric Videos using Dense Trajectories and
Hand Segmentation« in Proc. of 10th IEEE Embedded Vision
Workshop@CVPR2014
AN EGO-VISION SOLUTION
Superpixel
segmentation
Superpixel
descriptors
Classification by
Collection of RFs
Temporal
coherence
Spatial
coherence
- SLIC (Simple linear Iterative clustering*)
- K means in 5D (Lab+xy)
*[Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal
Fua, and Sabine Süsstrunk,
SLIC Superpixels Compared to State-of-the-art Superpixel Methods,
IEEE TPAMI, vol. 34, num. 11, p. 2274 - 2282, May 2012
[Serra ACM MMw 2013] [Baraldi CVPRW2014]
AN EGOVISION SOLUTION
Superpixel
segmentation
Superpixel
descriptors
-
Classification by
Collection of RFs
Descriptors
-mean and covariance in RGB
LabH and HSVH
27 Gabor filters (9 orientation, 3 scales
7x7,13x13,19x19)
- HoG
Temporal
coherence
Spatial
coherence
Superpixel
segmentation
Superpixel
descriptors
Classification by
Collection of RFs
Temporal
coherence
Spatial
coherence
Classifier
- Collection of Random forests
- Indexed by a 32 bin RGBH
- It encodes the appearance of the scene and
the global luminance
- Hp: bkg and hands changes colors
accordingly
Feature
vector
Scene
luminance
Global
Luminance
feature
Superpixel
segmentation
Superpixel
descriptors
Classification by
Collection of RFs
Temporal
coherence
Spatial
coherence
Estimated priors
Temporal smoothing
in a window of k frames
Posterior probability to be or not
a hand pixel in a previous window
Superpixel
segmentation
Superpixel
descriptors
Classification by
Collection of RFs
Spatial consistency
Eliminate spurious superpixels
Close holes
Use grabcut using posteriori as a seed point
Temporal
coherence
Spatial
coherence
HAND SEGMENTATION CONSISTENCY
• Hand segmentation without temporal and spatial consistency
• Hand segmentation with temporal and spatial consistency
HAND SEGMENTATION
Results shows that there is a significant
improvement in performance when all the
three consistency aspect are used together:
• illumination invariance (II)
• temporal smoothing (TS)
• spatial consistency (SC).
The method proposed by Li et al. is the
most similar to our approach,
nevertheless exploiting temporal and
spatial coherence we are able to
outperform their results.
CAMERA MOTION SUPPRESSION
1)
(Ego-)Hand detection
2)
(Ego-)Camera motion suppression
3)
Feature extraction
4)
Classification
CAMERA MOTION N
•
Camera (head) motion removal.
•
hands movement is usually not consistent with camera motion, resulting in wrong
matches between the two frames.
•
For this reason we introduce a segmentation mask that disregards feature matches
belonging to hands
Dense
motion
Object detection
Extract dense keypoints
-
Original frame sequence
Output frame sequence
without camera motion
Estimate Homography
Extract dense keypoints
-
Apply to
EGO-GESTURE FEATURE
• dense trajectories, HOG, HOG, MBH* extracted around hand regions.
• Feature points are sampled inside and around the user’s hands and tracked
during the gesture; inside a spatio-temporal volume aligned with each trajectory.
• descriptors in a Bag of Words and then classified using a linear SVM classifier.
*[Wang et al cvpr2013]
DENSE TRAJECTORIES
From * dense points ( but as in Shi Tomasi in ‘94 only if the eigenvalues of the
autocorrelation matrix are every small)
Points are connected in trajectory and normalized trajectory shape is used
[Dalal and Triggs 2005 ] HOG static appearance
[Lampert 2008] HOF with 9 bins: motion
[Dalal 2004] Motion boundary histograms
[*] H. Wang, A. Klaser, C. Schmid, and C.-L. Liu. Action Recognition by Dense Trajectories. In Proc. of CVPR, 2011
and IJCV2013.
TRAJECTORY DESCRIPTION
• Having removed camera motion between two adjacent frames, trajectories
can be extracted.
• Feature points are densely sampled at several spatial scales and tracked.
Trajectories are restricted to lie inside and around the user’s hands.
Without hand segmentation
With hand segmentation
TRAJECTORY DESCRIPTION
• The spatio-temporal volume aligned with each trajectory is considered, and
Trajectory descriptor, HOG, HOF and MBH are computed around it.
• Since the histograms tend to be sparse, they are power-normalized to
unsparsify the representation, while still allowing for linear classification.
The function:
is applied to each bin.
• The final descriptor is the concatenation of the four power-normalized
histograms. Gestures are eventually recognized using a linear SVM 1-vs-1
classifier.
TD descriptors
BoW
HOG descriptors
BoW
HOF descriptors
BoW
MBH descriptors
BoW
Power-normalization
and concatenation
SVM 1-vs-1
EXPERIMENTAL RESULTS
Datasets:
•
The Cambridge-Gesture
database, with 900 sequences of
nine hand gesture types under
different illumination conditions;
•
Our Interactive Museum
Dataset, an ego-centric gesture
recognition dataset with 700
sequences from seven gesture
classes performed by five subjects.
•
The EDSH dataset, which
consists of three egocentric videos
with indoor and outdoor scenes
and large variations of
illumination.
GESTURE RECOGNITION
Cambridge Gesture DB
Results on the Interactive Museum
Dataset using only 2 samples per
class for training.
[2] T.-K. Kim and R. Cipolla. Canonical correlation analysis of video volume tensors for action categorization and
detection. Trans. PAMI, 2009
[3] Y. M. Lui, J. R. Beveridge, and M. Kirby. Action classification on product manifolds. In Proc. of CVPR, 2010
[4] Y. M. Lui and J. R. Beveridge. Tangent bundle for human action recognition. In In proc. of Automatic Face &
Gesture Recognition and Workshops, 2011
[5] A. Sanin, C. Sanderson, M. T. Harandi, and B. C. Lovell. Spatio-temporal covariance descriptors for action and
gesture recognition. In Proc. of Workshop on Applications of Computer Vision, 2013.
LAST BUT…
TRACKING: THE BIG CHALLENGE
SINGLE TARGET TRACKING
Tracking is the task of generating an
inference about the motion of an object
given a sequence of images *.
in ego-vision is hard!
* Arnold W. M. Smeulders, Dung M. Chu, Rita Cucchiara, Simone Calderara,
Afshin Deghghan and, and Mubarak Shah, Visual Tracking: an Experimental
Survey, IEEE TPAMI, July 2014.
THE HARDNESS OF TRACKING
Which is the invariance that can be perceived and maintained along the
time?
Tracking is hard as nothing is fixed:
• Problems of lights: the target aspect, the illumination,
• Problems of motion: the object/camera motion,
• Problems of scene: the occlusion, the confusion...
• …. Searching
for the invariance in the video
14 TRACKING CHALLENGES IN 313 VIDEOS
01-LIGHT
02-SURFACECOVER
03-SPECULARITY
04-TRANSPARENCY
05-SHAPE
06-MOTIONSMOOTHNESS
07-MOTIONCOHERENCE
08-CLUTTER
09-CONFUSION
10-LOWCONTRAST
11-OCCLUSION
12-MOVINGCAMERA
13-ZOOMINGCAMERA
14-LONGDURATION
EXPERIMENTAL RESULTS ON ALOV++
The upper bound, taking the
best of all trackers at each
frame 10%
About the 30%, correctly tracked only
[TST]
A
[FBT]
[STR]
[L1O]
B
[NCC]
[TLD]
C
D
E
The lower bound, what all
trackers can do 7%
Survival curves by Kaplan-Meier
See PAMI 2014
TRACKING IN EGO-VISION
In egovision?
•
all the previous problems!!
•
relative motion
• No motion of observer but motion of target
• Motion of observer but fixed target
• Motion of both observer and target
• the dataset @Imagelab
• EGO_GROUP
• EGO_TRACK
EGOVISION FROM A MOVING HEAD
0.40
TRACKING IN EGOVISION: EVALUATION
Tracking results in the second
scenario
V2.2: tracking of a environmental point
of interest. Target stays still but gets
occluded and exits the camera FoV
Color based trackers (HBT, NN)
performs poorly due to the difficulty in
discriminating the object based on color.
TRACKING IN EGOVISION: EVALUATION
Tracking results: table shows the F-measure for each video and each tracke
a lot of work to do…r
Scenari Still camera,
o
still person
Video
NN
HBT
TLD
STR
NCC
FRT
V1.1
V1.2
Moving camera, still
person
Moving camera, moving
person
V2.1
V3.1
V2.2
V2.3
V3.2
V3.3
0.5204
0.2793
0.2314
0.0472
0.1211
0.2552
0.0867
0.1565
0.5187
0.1177
0.0206
0.1602
0.0333
0.5786
0.1457
0.0973
0.4838
0.1767
0.5091
0.6372
0.4342
0.2446
0.0237
0.1303
0.6406
0.2397
0.0698
0.5745
0.0801
0.5532
0.0294
0.0879
0.4326
0.2251
0.4575
0.3769
0.0147
0.3607
0.1834
0.1118
0.2271
0.2138
0.1406
0.0294
0.0389
0.0984
0.1492
0.0756
TRACKING AND INTERACTING
DICET Project: 2013-2015
tracking people and target for social interaction analysis
On-line interaction with art
CONCLUSIONS AND OPEN PROBLEMS
• A Few conclusions:
• Computer vision for multi-digitalization and interaction
•
•
along successefull story
2D documents, 3D objects
• Computer vision for real-time interaction
•
•
•
No hardware but software problem
A long way ahead
Interaction by egocentric vision is not an utopia anymore
THANKS TO
http://imagelab.ing.unimo.it
Interdipartimental Research Center in ICT
Tecnopolo di Modena
Emila Romagna High Technology Network
PEOPLE
Rita Cucchiara
Giuseppe Serra
Marco Manfredi
Costantino Grana
Paolo Santinelli
Francesco Solera
Roberto Vezzani
Martino Lombardi
Simone Pistocchi
Simone Calderara
Michele Fornaciari
Fabio Battilani
Augusto Pieracci
Dalia Coppi
Patrizia Varini
Stefano Alletto,

Keynote slides

Transcription

Similar documents

Davide Baltieri, Roberto Vezzani, Rita Cucchiara Ákos

Intimissimi has a great history behind it. The extraordinary

issue 16

Impacts on the TETCO system

Company profile

B E TI NIEMEYER

PARK OF GRANCIA,

Chitarre Acustiche d`Italia

Flyer Cervo 12 - Lorenzo Micheli

TAPA 01.indd 1 12-10