Conceptos y Tecnicas de Busqueda de Imagenes y Videos por

Transcription

Conceptos y Tecnicas de Busqueda de
Imagenes y Videos por Contenido
Visual
Alejandro Jaimes
COLUMBIA UNIVERSITY
ELECTRICAL ENGINEERING DEPARTMENT &
DIGITAL VIDEO MULTIMEDIA GROUP
Outline
l
Introduction
• Future Multimedia Applications
• Problems Addressed & Overview
l
Understanding of Visual Content & Users
• Multi-Level Indexing Pyramid & Eye Tracking
l
Flexible Computational Methods that learn
• The Visual Apprentice
l
Detectors in Practical Tasks
• STELLA & Non-Identical Duplicates
l
Summary of Contributions
l
Conclusions & Future Work
1
Motivation
l
Tremendous growth in the availability of
multimedia data for personal use
– Creation (e.g., digital cameras)
– Acquisition (e.g., scanners, recording devices)
– Access (e.g., web; pervasive devices)
l
Advances in communications, affective, and
wearable computing
The Visual Revolution
User
Activity
Lithography
1798
Photography
1860s
[A non-mathematical historical perspective]
Super 8 mm
Film
Cartridges VCR
1965
1972
Future
Applications
2050+
TIVO
TV Anytime
Late 1990s
Digital
Cameras
1990s
Brownie
camera 1900
Time
2
Alex in the Future: Active Users!
Personalized
Everywhere
Display
Affective
Interface
Multimedia
Capture
Device
Wearable
Multimedia
Application
… exciting, unforeseen
applications
Problems Addressed
l
Automatic indexing and organization of visual
information through user input at multiple levels
• Understanding of visual content and users
Multi-Level Indexing Pyramid & Eye Tracking
• Flexible computational frameworks
The Visual Apprentice
• Integration of generic detectors in solving
practical tasks in a specific domain
STELLA & Duplicate Detection
3
Automatic indexing and organization of visual information
through user input at multiple levels
Understanding Content & Users
Object-p
Perceptual
Region
Object-p
Object-p
Perceptual
Region
Flexible
Frameworks
Detectors in Practical
Tasks
Roadmap
l
Introduction
l
l
l
l
l
4
The Multi-level Indexing Pyramid
[SPIE ’00] [MPEG-7 ’99 ‘00] [ASIST ’00] [JASIS ’01]
l
Understand levels of description of visual information
l
Conceptual structure for classifying visual attributes
into multiple levels
– Art (E. Panofsky), Cognitive Psychology (E. Rosch et al.)
– Information Sciences (C. Jörgensen), Visual Information Retrieval
l
Why is this important?
– Guide indexing process
– Build better indexing systems
l
Key points
– Pyramid represents full range of visual attributes
– Strong impact on MPEG-7
Indexing Levels (Visual Attributes)
Knowledge
Syntax
1.
Type/
Technique
2.
Global Distribution
Texture, etc.
3.
Local Structure
4.
Global Composition
5.
Generic Object
6.
Semantics
Generic Scene
7.
Specific Object
8.
Specific Scene
9.
Abstract Object
Ana Alex 10.
Abstract Scene
5
Level 1: Type/Technique
1.
Type/
Technique
2.
Global Distribution
3.
Local Structure
4.
Global Composition
Syntax
l
Type/technique used
during production
l
No knowledge of visual
content, just general visual
characteristics
l
Examples:
– Color or b/w
photograph
– Water color or oil
painting
Color photograph
Oil painting
Level 2: Global Distribution
1.
Type/
Technique
2.
Global Distribution
3.
Local Structure
4.
Global Composition
Syntax
l
Distribution of Low-level
features only
l
Examples:
– Color distribution
• Dominant, histogram
– Global texture
• Coarseness, contrast
– Global shape
• Aspect ratio
Similar texture, color histogram
– Global
motion/deformation
• Speed, acceleration
6
Level 3: Local Structure
Characterization and
extraction of basic visual
elements
l
Examples:
1.
Type/
Technique
2.
Global Distribution
3.
Local Structure
4.
Global Composition
Syntax
l
– Dots, lines, tone, circles,
squares
– Local color
– Binary shape mask
– Local motion/deformation
Blood cells = circles
Stars = dots
Level 4: Global Composition
1.
Type/
Technique
2.
Global Distribution
3.
Local Structure
4.
Global Composition
Syntax
l
Arrangement or layout of basic
elements
l
No knowledge of objects
l
Examples:
– Balance, Symmetry
– Center of interest
– Leading line, viewing angle
Horizontal leading line
Centered object
Centered object
7
Level 5: Generic Object
5.
Generic Object
6.
Generic Scene
Semantics 7.
Specific Object
8.
Specific Scene
9.
Abstract Object
10.
Abstract Scene
l
General (every day)
knowledge about objects
l
Examples:
– Common nouns
• Person
• Chair
• Desk
Airplane
What the image is “of”
Persons, flag
Level 6: Generic Scene
5.
Generic Object
6.
Generic Scene
7.
Specific Object
8.
Specific Scene
9.
Abstract Object
10.
Abstract Scene
Semantics
l
General knowledge about
scene
l
Examples:
– City, Landscape
– Indoor, Outdoor
– Daytime, Nighttime
Outdoors, city, street
Indoors, office
8
Level 7: Specific Object
5.
Generic Object
Semantics
6.
Generic Scene
7.
Specific Object
8.
Specific Scene
9.
Abstract Object
10.
Abstract Scene
l
Identified and named
objects
l
Specific knowledge about
objects, known facts
l
Examples:
–
–
–
–
F-18
B. Clinton
Chinese Ambassador Z. Li
American flag
Lincoln desk
B. Clinton, Z. Li
Level 8: Specific Scene
5.
Generic Object
Semantics
6.
Generic Scene
7.
Specific Object
8.
Specific Scene
9.
Abstract Object
10.
Abstract Scene
l
Identified and named
scene
l
Specific knowledge about
scene, known facts
l
Examples:
– Name of a city, street, lake
– Name of a building
Paris
Oval Office, White House
9
Level 9: Abstract Object
5.
Generic Object
6.
Generic Scene
7.
Specific Object
8.
Specific Scene
9.
Abstract Object
10.
Abstract Scene
Semantics
l
Interpretation of an object
l
Subjective or based on
specific personal
knowledge
l
Examples:
– Political power
– Sympathy
About baseball (or basketball?)
Political Gesture
What the image is “about”
Level 10: Abstract Scene
5.
Generic Object
Semantics 6.
Generic Scene
7.
Specific Object
8.
Specific Scene
9.
Abstract Object
10.
Abstract Scene
Peacefulness
US Government
l
Subjective interpretation
of a scene
l
Examples:
– International politics
– War
– Apology
What the image is “about”
10
1.
Type/
Technique
2.
Global Distribution
3.
Local Structure
4.
Global Composition
5.
Generic Object
6.
Generic Scene
7.
Specific Object
8.
Specific Scene
9.
Abstract Object
10.
Abstract Scene
Semantics
Syntax
Pyramid Example
1. TYPE:
Color still image
2. GLOBAL DISTRIBUTION:
Color histogram
3. LOCAL STRUCTURE:
Circles, squares
4. GLOBAL COMPOSITION:
Centered
5. GENERIC OBJECT (of):
6. GENERIC SCENE (of):
Persons, building
Outdoors
7. SPECIFIC OBJECT (of):
Ana, Alex
8. SPECIFIC SCENE (of):
CEPSR
9. ABSTRACT OBJECT (about): Happy, friendly
10. ABSTRACT SCENE (about): Research agreement,
friendship
Pyramid Contributions & Impact
l
Novel structure to classify visual attributes into
multiple levels
– Spontaneous, semi-structured, and structured descriptions, 66
subjects (naïve, indexers, researchers), 700 images
– Represents full range of visual attributes (also shown by Burke CIVR ’02)
– Can guide indexing process (more attributes generated with pyramid than
without it)
– Retrieval improvement between 5% and 80%
l
Impact on MPEG-7
– Syntactic/semantic objects/relationships
– Specific and generic events/objects, concrete objects/events
– Text annotation type, labels for semantic entities, etc.
11
Eye Tracking
[with J. Pelz, T. Grabowski, J. Babock SPIE ’01]
l
Study the way people look at images
l
Determine if there are fixation patterns within image
categories (handshake, landscape, etc.)
l
Why is this interesting?
– Understand visual process
– Use regions of interest from eye tracking to construct
structured classifiers (e.g., Visual Apprentice)
l
Key Points
– First study with different image categories
• Natural tasks (Pelz et al, 2000), Art (Buswell 1920), ROIs (Privitera
& Stark ‘00), perception Yarbus, 1967), etc.
– Patterns found in handshakes and centered object
Types of Eye Movements
Fixation
Fixation
A
Saccade
B
l
Fixations: gaze held at stationary point (typically 250
msec., but can vary from 100 msec. to 1000 msec.)
l
Saccades: rapid, ballistic eye movements
2-3 eye movements per second (> 100,000 per day)
12
Eye Tracking Setup
l
ASL 504 Remote Eye Tracker (Applied Science Labs)
Experimental Setup
l
Examine fixation patterns of human subjects viewing
images of different categories
l
Image classes (50 images per class)
–
–
–
–
–
l
Handshakes
Landscapes
Centered Object
Crowds
Miscellaneous
Subjects
– 10 subjects (6 male, 4 female, undergraduate students with
normal or corrected to normal vision)
13
Experimental Results I
l
Strong Image Dependence (one subject, several images)
Experimental Results II
l
Strong Similarity in Fixation Patterns for Some Images
Subject A
Subject B
Subject C
Subject D
14
Experimental Results III
l
Wide Variation in Fixation Patterns for Some Images
Subject A
Subject B
Subject C
Subject D
Experimental Results IV
l
Categories with consistent fixation patterns
(handshake, centered object)
15
Eye Tracking Contributions & Impact
l
First study within and across image categories
– 10 subjects, 250 images, 5 categories (400,000+ data points)
– Patterns in handshake and centered object categories, no
patterns in others
l
l
Understand what to index
Link to automatic techniques
Outline
l
Introduction
l
l
l
l
l
16
Flexible Frameworks that Learn
l
[SPIE ’99, ’00, ACCV ’00, IJIG ’01]
Automatically construct a visual detector from user input
– Visual Detectors are programs that automatically assign
semantic labels to objects (e.g., sky) or scenes (e.g., handshake
scene)
l
Novel framework to learn structured visual detectors from
user input at multiple levels
– User defines models and provides training examples
– Learning algorithms + different features + training examples =
Automatic Visual Detectors
l
Key Points
– Flexible computational approach
– Multiple features, learning algorithms
– No expert input required
Motivation
l
Why structured detectors?
– Definition of complex scenes/objects
l
Why learning?
– Flexible (no expert input) vs. Static systems
l
Why multiple learning algorithms and features?
– Different algorithms and features perform differently on the
same data
17
Related Work
l
Visual Detectors
– Most approaches are static or specialized- Body Plans
(U. Berkeley); WebSeer (U. Chicago)
– Others work at scene level only (MIT indoor/outdoor, HP
City vs. Landscape)
– Other learning approaches are not structured (MIT
FourEyes)
Definition Hierarchy
Object
Level 1: Object
Level 2: Object-part
Level 3: Perceptual Area
Level 4: Region
Object-part 1
Perceptual-area 1
Region 1
Object-part 2
Object-part n
Perceptual-area n
Region 2
18
User Input
l
Define hierarchy
– Decide nodes and containment relationships
l
For each node
– Label examples in images/videos
l
Labeling is performed by clicking on regions or
outlining areas.
Batting
Level 1: Object
Level 4: Region
Ground
Grass
Sand
Regions
Regions
Pitcher
Batter
Regions
Regions
19
Definition Hierarchy Example
Batting
Ground
Grass
Regions
Pitcher
Batter
Regions
Regions
Sand
Regions
Learning Classifiers
l
A classifier is a function that, given an input, assigns it to
one of k classes. A learning algorithm is a function that,
given a set of examples and their classes, constructs a
classifier
l
Training Data
– For each node of the hierarchy, a set of examples.
– Superset of features is extracted from each example:
•
•
•
•
l
Color (Average LUV, Dominant Color, etc.)
Shape & Location (Perimeter, Formfactor, Eccentricity, etc.)
Texture (Edge Direction Histogram, Tamura, etc.)
Motion (Trajectory, velocity, etc.)
Superset of Machine learning algorithms use training data
• Nearest Neighbors, Decision Trees, etc.
20
Learning Classifiers
Learning
Algorithm 1
C1
Learning
Algorithm 2
C2
Learning
Algorithm n
Cn
Performance
Estimator
D1
D1
Stage 1:
Training data obtained
and feature vectors
computed.
Stage 2:
Classifiers
learned
Stage 3:
Classifiers
selected
Feature Subset Selection
l
Set of features A find a set B such that and where
B is a better feature set than A.
l
Wrapper Model
Feature selection set
Training set
Feature set
Feature evaluation
Learning Algorithm
21
A Classification Example
Handshake
Object
Object-part
Face 1
Handshake
Face 2
Regions
Regions
Perceptual Area
Region
Regions
Cf
Face region
classifier
Cr
Determines
Face o-p classifier
input
Training Summary
l
User:
– Labels example images according to hierarchy
l
System:
–
–
–
–
Automatically segments examples
Extracts visual features for each node
Applies set of learning algorithms to each node
Selects best features for each classifier
– A set of classifiers for each node (best features)
– Best classifier selected
22
Classification Summary
l
l
l
Automatic Segmentation
Feature Extraction
Classification and grouping
Handshake
Object
Object-part
Face 1
Handshake
Face 2
Regions
Regions
Perceptual Area
Region
Regions
Experiments: Definition Hierarchy I
Level 1: Object
Ship
Sky
Elephant
Face
Level 4: Region
Regions
Regions
Regions
Regions
23
Experiments: Definition Hierarchy II
Skater
Level 1: Object
Level 4: Region
Body
Face
White
Blue
Regions
Regions
Regions
Experiments: Definition Hierarchy II
Handshake
Object
Object-part
Face 1
Handshake
Face 2
Regions
Regions
Perceptual Area
Region
Regions
24
Experiments: Definition Hierarchy III
Batting
Level 1: Object
Level 4: Region
Ground
Grass
Sand
Regions
Regions
Pitcher
Batter
Regions
Regions
Results I
l
Hierarchy I:
–
–
–
–
–
–
l
Ships
Roses
Elephants
Cocktails
Face
Skies
Recall
Precision
Train Test
94%
94%
91%
95%
70%
55%
70%
75%
77%
36%
89%
87%
30
30
30
30
86
40
280
280
280
280
378
670
100%
74%
62%
70%
30
80
490
733
Hierarchy II:
– Skater
– Handshakes
25
Baseball Experiments
l
Data used
–
–
–
–
–
l
12 innings from different games
Different teams (7)
Field (natural, artificial)
Time of day (6 day, 6 night)
Broadcast (4 channels)
Difficulties
– Visual Appearance: time, weather, stadium, teams, etc.
– Signal: reception, noise, origin, etc.
Recall: 64% Precision: 100% Train: 60 Test: 376
Examples
l
Missed:
l
Correctly rejected as Pitching Scenes:
26
When to use learning?
Recurrent Visual Semantics (RVS)
Repetitive appearance of elements (e.g., objects,
scenes or shots) that are visually similar and have a
common level of meaning within a specific context.
l
Analysis
l
When to use the VA?
– Structured classes (e.g., not indoor, outdoor classifiers)
– Recurrent Visual Semantics!
l
Structural Differences
– More nodes, higher precision, lower recall
l
Advantages
– Hierarchy implies context
– Can integrate classifiers built by experts
– Can generate multiple features (e.g., MPEG-7)
27
VA Contributions & Impact
l
Novel framework that learns visual detectors from user
input at multiple levels
– Integration of multiple learning algorithms and features
– Use of non-domain specific techniques (but can be easily
incorporated)
– Flexibility in defining object/scene hierarchies and detailed user
input at multiple levels
l
Different domains and projects (Digital Libraries News
images, Corel, Baseball, Consumer photography)
l
Visual Apprentice plus expert input (D. Zhong ICME ’00)
Roadmap
l
Introduction
l
l
l
l
l
28
STELLA (Story TELLing and
Album Creation)
[with Alex C. Loui of Kodak & Ana B. Benitez]
•Hierarchical organization
(novel composition features)
•Time sequence information
(novel variation of Ward algorithm)
•User can add metadata, edit
clusters, etc.
•Logs editing operations
Non-Identical Duplicates!
[Interface with Enrico Bounano]
Bounano]
Non-Identical Duplicate Detection
[with Alex C. Loui of Kodak]
l
Determine if image I1 and image I2 are duplicates
– Duplicates: images of the same scene, taken from approximately
the same angle and range (Kodak definition)
l
New duplicate taxonomy, novel framework for detecting nonidentical duplicates, detailed analysis of duplicate database
l
Why are duplicates important?
• Essential in organizing personal collections
• They represent important events, prevalent in current collections
(19% of the images, per roll Kodak database) Many more in the future!
l
Key points
– New problem
• Stereo, registration, mosaics, etc.
– Multiple view geometry +
visual detectors
29
A Duplicate Model
Image
Lighting
Camera
Scene
… something has changed!
Lighting
A Duplicate Model
Scene
Camera
Image
Component
Scene
Camera
Image
Lighting
Subject/Background
- Flash/no-flash
- Move
- Sunny/overcast
- Light/dark
- Change
- Non-stationary
Exposure
Viewpoint
- Aperture
- Change (Translation)
- Shutter Speed
- Rotation (pan/tilt)
- Zoom
- Noise
- Luminosity
- Color
- Added Text (video only)
Duplicate
Categories
No Change
Horizontal/Vertical
translation
Framing
Angle Change
Zoom
Subject Move
Subject Change
Different
Background
Several Changes
30
A Novel Duplicate Taxonomy
A New Framework
31
Multiple View Geometry
Type III: Homography
Type IV: Fundamental Matrix
l
=Fx
x
= Hx
Framework Overview
Interest Point Matching
Local
Global
Changes Changes
Alignment
e
Fundamental Matrix
Homography
d
Self Correlation
Block Matching
Global Change Areas
Local Analysis
VA Detectors
Domain Rules
32
Detected Change Areas
Object Detection
Vegetation
Face
33
Duplicate Image Database
l
[Collected & labeled by Kodak]
255 image pairs from 60 rolls of film, from 54 real
consumers, labeled by 10 other people
– Only obvious duplicates, very similar non-duplicates,
and ”borderline” duplicates (obvious non-duplicates not
included)
– Considerable subjectivity (100% agreement in only 43%
of the pairs)
Obvious duplicate
Borderline duplicate
Similar non-duplicate
Obvious
non-duplicate
Non-Duplicates Contributions & Impact
l New problem of non-identical duplicate
consumer photographs
– Novel taxonomy of different types of duplicates
– New framework to detect non-identical duplicates
• Multiple view geometry
• Visual detectors
• Domain knowledge
l Detailed analysis of duplicate database
– Set with 100% agreement (65 duplicates, 43 non-duplicates)
•
•
•
•
Precision: 64% Recall: 97%
False Positives (Visually similar non-duplicate image pairs!)
Misses (Significant visual variation but minor semantic variation!)
Hits (mostly in no change, subject move categories)
34
Roadmap
l
Introduction
l
l
l
l
l
Automatic indexing and organization of visual information
through user input at multiple levels
Object-p
Perceptual
Region
Object-p
Object-p
Perceptual
Region
Flexible
Frameworks
Tasks
35
Publications Summary (34 total)
l
l
l
l
l
1 Book Chapter (Wiley & Sons ’02)
2 Journal Papers (IJIG, ASIS)
11 MPEG-7 proposals (includes one outside thesis
topics)
17 Conference publications (includes outside thesis
topics)
3 patents (filed)
Object-p
Perceptual
Region
Object-p
Object-p
Perceptual
Region
-Novel conceptual structure
-First study across categories
-Link to automatic detectors
Flexible
Frameworks
-User input at multiple levels
-Multiple features, algorithms
-Structured detectors
•Full range of attributes
•Guides indexing process
•Improves retrieval
Tasks
-New problem
-Novel taxonomy
-Detailed analysis
-New framework
•Multiple view geometry +
Visual detectors
36
Conclusions
l
We addressed the problem of automatic
indexing and organization of visual
information through user input at multiple
levels
– Understanding of visual content and users
– Flexible computational frameworks
– Integration of generic detectors in solving
practical tasks in a specific domain
Future Work I
•
Multi-Level Indexing Pyramid
• Integration with STELLA for browsing
• Automatic classification of descriptions
•
Eye Tracking
•
•
•
•
•
Extend to more categories
Analyze scan path information
Build computational approach directly from eye track data
Passive learning?
Does color make a difference?
37
Flexible Frameworks that Learn
Future Work II
l
–
–
–
–
–
Active learning
Expert knowledge integration (ontologies?)
Further classifier combination
Automatic collection of training data (RVS)
Integration with eye tracking
Object-p
Perceptual
Region
Object-p
Object-p
Perceptual
Region
Future Work III
l
STELLA (Story TELLing and Album creation)
–
–
–
–
l
New browsing and visualization approaches
Transcoding (photographs anywhere, anytime)
Multimedia STELLA (not just photographs)
Storytelling
Duplicate Detection
– More efficient duplicate detection
– Automatic model selection
38
Future Research Directions
l
Personalization
– Machine learning (Video Digests - ICIP ’02)
– Multiple-level feature integration
– Learning from semi-passive interaction
l
Semantic analysis
– Knowledge repositories & reasoning (Context - SPIE ’03)
– Production-rules knowledge
– Multi-modal analysis
l
Interactive frameworks
– Multi-level browsing
– Multimedia ontologies (ICME ’03)
– “Smart” annotation
Other Projects
l
TREC Benchmark (IBM TREC ’02)
l
Personalized Video Digests (IBM ICIP ’02)
l
Learning color corrections (IBM IS&T PICS ’01)
l
Descreening (IBM SPIE ’99)
l
3-D Stereoscopic Imaging (IBM SPIE ’99)
l
MPEG-7 (Columbia)
39
Acknowledgements
l
l
l
l
l
l
l
l
l
Prof. Shih-Fu Chang
Dr. Nevenka Dimitrova, Prof. Alexandros Eleftheriadis, Prof.
Dan Ellis, Prof. John R. Kender
Ana B. Benitez
Dr. Lawrence Bergman, Dr. Vittorio Castelli (IBM Watson)
Dr. Corinne Jörgensen (Florida State University)
Dr. Alexander C. Loui (Kodak- Rochester, NY)
Dr. Jeff Pelz (RIT- Rochester)
Dr. Hawley Rising, Dr. Toby Walker (SONY)
All my colleagues and friends at Columbia and IBM TJ Watson
Research Center, anonymous reviewers and conference &
MPEG-7 colleagues
The End
THANK YOU FOR YOUR ATTENTION!
40

Conceptos y Tecnicas de Busqueda de Imagenes y Videos por

Transcription

Similar documents