Análisis y Calidad de Texto e Imágenes

Transcription

Análisis y Calidad de Texto e Imágenes
Categorization of Generic Images
Marco Bressan, July 2008
Index
Intro
Image Categorization
 Problem statement
 Representation of Visual Objects
 Global vs. Local Descriptors
 Bag of Words
 Patch Detection
 Local Features
 Visual Codebook
 Supervised vs. Unsupervised Codebook
 Images as Continuous Distributions
 Fisher kernel
Tasks Related to Categorization
 Similarity, Retrieval, Clustering, Segmentation, …
Applications
Open Research Challenges
Bibliography
2
3
Research @ Xerox
Xerox Research
Centre Europe
Grenoble, France
Xerox Research
Centre of Canada
Mississauga,
Ontario, Canada
Fuji Xerox
Japan
Palo Alto Research Center
California, USA
Xerox Research Center Webster
New York, USA
4
Research on Textual And Visual Pattern Analysis
U.S. Library
of Congress
CACAO
Pascal 2
EERQI
ARTSTOR
SAPIR
SMART
Shaman
PINVIEW
Infom@gic
UAB
Omnia
SYNC3
Fragrance
Core research: pattern classification (mostly text and images), computer vision, text mining and
retrieval, machine learning, image processing. Sample projects:
 handwritten word-spotting
 automatic content creation
 visual aesthetics and personalization
 eye-tracking for image understanding
 emotional visual categories
 multilingual categorization and retrieval
Image Categorization
painting, oil, portrait,
modernism, woman,
Modigliani, …
Research Challenge: Vision and Object Recognition
Aristotle’s observation “Of all senses, trust only the sense of sight”, is somewhat
supported by nature: 50% of the cerebral cortex is for vision. Still…
 Vision is ambiguous: multiple scenes can result in the same image, one scene can result in multiple images.
 Vision is contextual: The human brain is sensitive to contextual associations
 Vision is inferential: Our prior knowledge (learned or “wired”) determines what we see.
 The environment is complex: not accessible, continuous, dynamic, non-deterministic, ...
Nevertheless, animals extract information from images, e.g. we are able to recognize
objects despite variations in size, shape, view, lighting, time, occlusion, clutter, …
6
7
Research Challenge: Vision + Classification
I stand at the window and see a house, trees, sky.
Theoretically I might say there were 327
brightnesses and nuances of colour.
Do I have "327"?
No. I have sky, house, and trees.
Max Wertheimer, 1923
HI: identities
(players)
HI: text
(temporal)
(teams, score)
LO: texture
(stadium)
111100101011100001110011011100001000010001000011001011000110010010111011110
010000011000100000000100000010101110110001101111100011111101100000100110100
011010011010100011001001000111110101110001010110101101111010000110110001001
111111110101110111111100011101111100111101101110011001110110101100100111101
100010111011011000101110000011100000010111010100011010110011111100011111101
100100111110100011010110010111100110010110010101110011010111000000101011011
011110011111110011011010001110011010001010111010111001100110000010000111000
100111111101101101111010100111101000100000110000110101010001110101001101111
011101101011011100111011101011010001011111011100010011110011001100010111010
000110010000011000001111010011110110110010111100111101000000001110000010000
010111010010100101100010100100100101100100100000100010010000011001100010110
100010001011100001011010111110100101000011000011111101110101100111111111101
001011110111011010000010000101001110111011000001111001100101111001011000100
100011000011111011111111100110100001111110101111000010111101110111110010111
100010001100011101100111010110011010011111001010101000000111111100001111010
001010001011100101101101110100001011110011000111101000100010100111100001011
000000110100000101010101110011101001100110010100011110000001100101010101111
101101100111001100011000000001111110010100011100001111100011011100001010000
111010000000010001110010100100101010001010010010111111110111110011011110111
001011101100110101110011001111011010011110101001110000001111111011000111101
100110011010010000100001010101101100010111101100101001000010100111000000100
000000010010111011010110100010011011111111100110100001111110101111000010111
101110
HI: action
(temporal)
(match
situation)
HI: ads
(sponsors)
LO: low-level
HI: high-level
And much more…
• sports event (football)
• Leicester City vs. F.C.
Barcelona
• Javier Saviola has the ball
LO: color
(teams)
• Full stadium
• UEFA? Champions?
• Players wear certain brands
• Advertising in background
• The score favours a team
• Barcelona is attacking
• Nike, UNICEF
• Anguish, excitement,
celebration...
Research Challenge: Vision + Classification
L’ambitieux, Gilbert Garcin, 2003
Visual semantics involve multiple levels of abstraction: semantic,
aesthetic, contextual, social, emotional, …
8
9
Tasks related to Image Categorization
photographs
document images
paintings
drawings
Image Categorization

to assign one or more labels to a given image, based on its semantic content
Object Recognition

to identify an instance of a given object
Object Detection

to determine whether a member of a visual category is present in an image
Image Similarity

to define a metric between images, e.g. for duplicate detection
Content-Based Image Retrieval

given an image repository, to retrieve images relevant to a given query
Non-Supervised Image Categorization: Clustering

to group images by similarity
Image Segmentation

to assign one or more labels to each image pixel, based on its semantic content
Hybrid approaches

to include multiple modalities in the analysis
maps, charts,
tables
domain-specific:
satellite, medical
10
Generic Visual Categorization (GVC)
The pattern classification problem which consists in assigning one
or multiple labels to an image based on its semantic content
beach, sky, clouds,
vegetation, landscape,
horizon, photo, …
form, handwriting,
letter, service contract
painting, oil, portrait,
modernism, woman,
Modigliani
map, legend,
downtown
Generic: The same flexible framework copes with various objects, subjects and scenes and with
various graphic arts (photography, painting, drawing, document images, etc.)
Scientific challenges: handle inherent category
variability as well as variations in size, shape,
view, lighting, occlusion, clutter, etc.
Representing Visual Objects
Approaches Based on Global Descriptors
 MPEG-7 SCD (Scalable Color Descriptor) + HTD (Homogeneous Texture Descriptor)
 Appearance or View-Based [1]: Having seen all possible appearances of an object, can
recognition be achieved by just efficiently remembering all of them?
 GIST [2]: With just a glance at a complex real-world scene, an observer can comprehend a
variety of perceptual and semantic information.
Approaches Based on Local Descriptors
 Region-Based

BlobWorld [3]: Segmentation into coherent regions that correspond to different objects.

Recognition as Translation [4]: Learning a lexicon for a fixed image vocabulary
 Part-Based

Constellations of Parts [5]: Object/scene model is spatial arrangement of a number of parts.
Recognition is based on inferring object class based on similarity in parts’ appearance and their
spatial arrangement.

Bag of Visual Words [6]
[1] S.K. Nayar, S.A. Nene, and H. Murase. Real-time 100 object recognition system. In ARPA96, pages 1223–1228, 1996.
[2] Oliva, A. Gist of the scene. In: Itti L, Rees G, Tsotsos J. , editors. Neurobiology of attention. San Diego, CA: Academic Press /Elsevier; pp. 251–257, 2005.
[3] C. Carson, M. Thomas, S. Belongie, J. M. Hellerstein, and J. Malik. Blobworld: A system for region-based image indexing and retrieval. In Proc. Int. Conf. Visual Inf. Sys., 1999
[4] P. Duygulu, K. Barnard, N. de Freitas, D. Forsyth, Object Recognition as Machine Translation: Learning a lexicon for a fixed image vocabulary, ECCV, Copenhagen, 2002.
[5] R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised scale-invariant learning. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2003.
[6] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray. Visual categorization with bags of keypoints. In Proc. of ECCV Workshop on Statistical Learning for Computer Vision, 2004.
11
12
Appearance-Based Object Recognition
Based on the statistical modeling of raw image data, has shown good performance in
a variety of problems, but requires the segmentation of the object and its
performance is very limited in complex scenes, e.g. in the presence of occlusion or
cluttered backgrounds.
View-based modelling
Eigenspace
representation
13
GIST
80 million tiny images: a large dataset for
non-parametric object and scene recognition
Representation: PCA of a set of spatially averaged
filter-bank outputs + adapted metrics
Tasks
 retrieval
 semantic classification
 segmentation
Classification at multiple semantic levels
 object detection
 colorization
A. Torralba and R. Fergus and W. T. Freeman,”Tiny Images”, Computer
Science and Artificial Intelligence Lab (CSAIL), MIT, 2007
 orientation detection
A person will process ~1B images by the age of 3
Representing Visual Objects
Approaches Based on Global Descriptors
 MPEG-7 SCD (Scalable Color Descriptor) + HTD (Homogeneous Texture Descriptor)
 Appearance or View-Based [1]: Having seen all possible appearances of an object, can
recognition be achieved by just efficiently remembering all of them?
 GIST [2]: With just a glance at a complex real-world scene, an observer can comprehend a
variety of perceptual and semantic information.
Approaches Based on Local Descriptors
 Region-Based

BlobWorld [3]: Segmentation into coherent regions that correspond to different objects.

Recognition as Translation [4]: Learning a lexicon for a fixed image vocabulary
 Part-Based

Constellations of Parts [5]: Object/scene model is spatial arrangement of a number of parts.
Recognition is based on inferring object class based on similarity in parts’ appearance and their
spatial arrangement.

Bag of Visual Words [6]
[1] S.K. Nayar, S.A. Nene, and H. Murase. Real-time 100 object recognition system. In ARPA96, pages 1223–1228, 1996.
[2] Oliva, A. Gist of the scene. In: Itti L, Rees G, Tsotsos J. , editors. Neurobiology of attention. San Diego, CA: Academic Press /Elsevier; pp. 251–257, 2005.
[3] C. Carson, M. Thomas, S. Belongie, J. M. Hellerstein, and J. Malik. Blobworld: A system for region-based image indexing and retrieval. In Proc. Int. Conf. Visual Inf. Sys., 1999
[4] P. Duygulu, K. Barnard, N. de Freitas, D. Forsyth, Object Recognition as Machine Translation: Learning a lexicon for a fixed image vocabulary, ECCV, Copenhagen, 2002.
[5] R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised scale-invariant learning. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2003.
[6] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray. Visual categorization with bags of keypoints. In Proc. of ECCV Workshop on Statistical Learning for Computer Vision, 2004.
14
Region-based Approaches: Recognition as Translation
Learning a lexicon for a fixed image vocabulary
Steps
Results

min-cut segmentation

blob features

blob clustering -> lexicon

categorization as a statistical machine translation problem

Experiments: Corel dataset with 371 words and 4500
images that have 5 to 10 segments each, 33 features per
segment. A total of ~35000 segments yields a 500 blob
lexicon.
15
16
Constellations of Parts
Bouchard, G.; Triggs, B.; “Hierarchical Part-Based Visual. Object Categorization” CVPR, 2005.
H is the set of valid allocations of features to the parts
with: locations X, scales S, and appearances A. H is
O(NP), where N is the total number of features and P
the number of parts (typically <7).
Bag of Visual Words (BoW)
BoW: Motivation from Text Mining
Orderless document representation: frequencies of words from
a dictionary Salton & McGill (1983)
US Presidential Speeches Tag Cloud
http://chir.ag/phernalia/preztags/
18
BoW : Motivation from Texture Recognition
Texture is characterized by the repetition of basic elements or textons
For stochastic textures, it is the identity of the textons, not their spatial
arrangement, that matters
Julesz, 1981; Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik, 2001; Schmid 2001; Varma & Zisserman, 2002,
2003; Lazebnik, Schmid & Ponce, 2003
19
BoW: Motivation from Texture Recognition
histogram
Universal texton dictionary
20
21
BoW: Strict Definition
1- Traditional Bag-of-Words
UNIVERSAL SET
OBJ1
OBJ2
22
Bag of Words: Strict Definition
1. PATCH
DETECTION
2. FEATURE
EXTRACTION
3. LEARNING
THE
CODEBOOK
4. HISTOGRAM
COMPUTATION
23
BoW: Pipeline
patch
detection
feature
extraction
Vector
quantization
histogram
computation
classification
+0.1
-1.5
x=
…
-0.5
representation
categorization
Components
 Patch detection: identify where to extract information in the image
 Feature extraction: compute local features
 Vector quantization: map local features to visual words
 Histogram computation: build a global representation of the image
 Classification and Learning

discriminative model: support vector machine

generative model: naïve bayes, hierarchical bayesian models
BoW: Patch detection, where is the information?
Interest point detector:

Harris (corner detector)

Laplacian (round shape detector)
unstable, ignore uniform regions
Blob detector: partition of the image into uniform
regions (generally based on color)
 little semantic meaning
Regular Grid: extract information on regular grids
(at multiple scales)
 typically extract ≈ 500 patches / image
24
BoW: Patch detection for image registration
Andrew Zisserman, “Probabilistic Models of Visual Object Categories”, AERFAI tutorial, 2006
25
BoW: Patch detection for image retrieval
Andrew Zisserman, “Probabilistic Models of Visual Object Categories”, AERFAI tutorial, 2006
26
Microsoft Photosynth
28
BoW: Patch detection, where is the information?
Interest point detector:

Harris (corner detector)

Laplacian (round shape detector)
unstable, ignore uniform regions
Blob detector: partition of the image into uniform
regions (generally based on color)
 little semantic meaning
Regular Grid: extract information on regular grids
(at multiple scales)
 typically extract ≈ 500 patches / image
29
BoW: Local features, what to extract?
Split patch into sub-regions and accumulate local statistics (typically 4x4) on:
Gray-level features:
 Compute at each pixel dominant orientation
and discretize (8 bins)
 Accumulate histograms over sub-regions
 rotation, scale, illumination invariant
 128 dimensions
Color features:
Compute simple RGB statistics over sub-regions (mean + standard
deviation)
 96 dimensions
PCA for dimensionality reduction:
Reduces computational cost
Reduces noise
 D = 50 dimensions
K. Mikolajczykand C. Schmid. A performance evaluation of local descriptors. In Proc. IEEE CVPR, June 2003.
30
BoW: What is the visual codebook?
Given a set of training images, cluster the low-level feature vectors to estimate
a visual codebook
Alternatives:
 K-means
 mean-shift
 Gaussian mixture model (GMM)
Each Gaussian is a visual word of the visual codebook:
 the mixture weight represent the frequency of the visual word
 the mean vector represents the average of the word
 the covariance matrix models how much the visual word varies
typically, N on the order of a hundred Gaussians / words
Trade-off for the estimation of visual vocabularies:
 unsupervised approach [SZ03]: universal but not discriminative
 supervised approach [MT06]: discriminative but not universal
[SZ03] J. Sivic and A. Zisserman, Video Google: a text retrieval approach to object matching in videos, ICCV, 2003.
[MT06] F. Moosman, B. Triggs and F. Jurie, Randomized clustering forests for building fast and discriminative
visual vocabularies, NIPS, 2006.
31
BoW: Local features, what to extract?
Analogy between visual codebook and human vision
Olshausen and Field, 2004, Fei-Fei and Perona, 2005
32
BoW: What is the visual codebook?
Given a set of training images, cluster the low-level feature vectors to estimate
a visual codebook
Alternatives:
 K-means
 mean-shift
 Gaussian Mixture Model (GMM)
Each Gaussian is a visual word of the visual codebook:
 the mixture weight represent the frequency of the visual word
 the mean vector represents the average of the word
 the covariance matrix models how much the visual word varies
typically, N on the order of a hundred Gaussians / words
Trade-off for the estimation of visual vocabularies:
 unsupervised approach [SZ03]: universal but not discriminative
 supervised approach [MT06]: discriminative but not universal
[SZ03] J. Sivic and A. Zisserman, Video Google: a text retrieval approach to object matching in videos, ICCV, 2003.
[MT06] F. Moosman, B. Triggs and F. Jurie, Randomized clustering forests for building fast and discriminative
visual vocabularies, NIPS, 2006.
33
34
BoW: Universal and adapted vocabularies
eye
MLE
tail
ear
MAP
MAP
universal vocabulary
cat’s eye
cat’s ear
cat’s tail
dog’s eye
cat vocabulary
F. Perronnin, “Universal and Adapted Vocabularies for Generic Visual Categorization”, IEEE Trans on PAMI, 2007
dog’s ear
dog’s tail
dog vocabulary
BoW: Universal and adapted vocabularies
Given the training samples
MLE
MAP
Maximize
Maximize
E-step:
E-step:
M-step:
M-step:
35
36
BoW: Bipartite histograms
For each image, compute one histogram per category on the combined vocabularies
cat
vocabulary
universal
vocabulary
dog
vocabulary
universal
vocabulary
Separate the relevant information from the irrelevant one. If an image belongs to a given category:
 it is more likely to be best described by the adapted vocabulary of that category rather than by the universal
vocabulary.
 it is more likely to be best described by the universal vocabulary than by the adapted vocabularies of the other
categories.
BoW: Bipartite histograms, example
Relevance with respect to boat
37
BoW: Bipartite histograms, example
Relevance with respect to clouds & sky
38
BoW: Bipartite histograms, example
Relevance with respect to flowers
39
BoW: Bipartite histograms, performance
PASCAL visual object classes challenge 2006
In-house database
 10 objects: bicycle, bus, car, cat, cow, dog,
 19 objects and scenes: beach, bicycling, birds,
horse, motorbike, person, sheep.
170 ms / image
 2,618 training images + 2,610 test images
40
boating, cats, clouds & sky, desert, dogs,
flowers, golf, motor sports, mountains, people,
sunrise & sunset, surfing, underwater, urban,
waterfalls, winter sports.
 30,000 training images + 4,800 test images
(collected independently)
41
BoW: Images as Continuous Distributions
Instead of modelling all images with a single vocabulary, model each image
with its own (much smaller) vocabulary [FS05,ZM05]
patch
detection
feature
extraction
distribution
estimation
classification
+0.1
-1.5
x=
…
-0.5
Image representations are discriminative and universal
Kernels on distributions: Kullback-Leibler divergence, Probability Product
Kernel, Earth Movers Distance are costly
[FS05] J. Farquhar, S. Szedmak, H. Meng and J. Shawe-Taylor, Improving bags-of-keypoints image categorization, Tech report,
University of Southampton, 2005.
[ZM05] J. Zhang, M. Marszalek, S. Lazebnik and C. Schmid, Local features and kernels for classification of texture and object
categories: an in-depth study, Technical report, INRIA, 2005.
Images as Continuous Distributions
2- Vector Sets as GMMs
OBJ1
OBJ2
42
Images as Continuous Distributions
3- Proposed Approach
UNIVERSAL SET
OBJ1
OBJ2
43
44
BoW: The Fisher kernel
Idea [JH99]: given a generative model with parameters
a fixed-length representation of the vector set
using the following gradient vector:
, compute
Use the Fisher information matrix to measure the similarity between
gradients:
Equivalent to normalizing the gradient vectors:
Fisher Representation: Given a visual vocabulary, estimate how
the visual words should be modified to best model the image
[JH99] T. Jaakola and D. Haussler, Exploiting generative models in discriminative classifiers, NIPS, 1999.
45
BoW: The Fisher kernel
= histogram of number of visual word occurrences
 Extends the traditional histogram of occurrences
Results in very high dimensional representations, even with small visual
vocabularies (typically N=128):
 N dimensions for mixture weights
 N x D for mean vectors
 N x D for diagonal covariance matrices
= (2D + 1) x N dimensional vector
 Classification accuracy similar to the histogram representation but at a fraction
of the cost (20ms / image)
Experimental validation
PASCAL visual object classes challenge 2006
10 classes: bicycle, bus, car, cat, cow, dog, horse, motorbike, person, sheep
2,618 training + 2,610 test images
Performance measured with Area Under the Curve (AUC)
46
Experimental validation
47
PASCAL visual object classes challenge 2007
20 classes: aeroplane, bicycle, bird, boat, bottle, bus, car, cat, chair, cow, dining
table, dog, horse, motorbike, person, potted plant, sheep, sofa, train, tv monitor
5,011 training images + 4,952 test images
Performance measured with Average Precision (AP)
Experimental validation
PASCAL visual object classes challenge 2007
Training on VOC 2007 data, testing on VOC 2006 data:
48
Related Tasks
50
Research Challenge: Classification
The Dewey Decimal Classification (DDC) system is a general
knowledge organization tool conceived by Melvil Dewey in 1873.
200 Religion
000 Computer
science, information
& general works
100 Philosophy &
psychology
200 Religion
300 Social sciences
400 Language
500 Science
600 Technology
700 Arts &
recreation
800 Literature
900 History &
geography
201 Philosophy of Christianity
202 Miscellany of Christianity
203 Dictionaries of Christianity
204 Special topics
205 Serial publications of Christianity
206 Organizations of Christianity
207 Education, research in Christianity
208 Kinds of persons in Christianity
209 History & geography of Christianity
210 Natural theology
211 Concepts of God
212 Existence, attributes of God
213 Creation
214 Theodicy
215 Science & religion
216 Good & evil
217 Not assigned or no longer used
218 Humankind
219 Not assigned or no longer used
220 Bible
221 Old Testament
222 Historical books of Old Testament
223 Poetic books of Old Testament
224 Prophetic books of Old Testament
225 New Testament
226 Gospels & Acts
227 Epistles
228 Revelation (Apocalypse)
229 Apocrypha & pseudepigrapha
230 Christian theology
231 God
232 Jesus Christ & his family
233 Humankind
234 Salvation (Soteriology) & grace
235 Spiritual beings
236 Eschatology
237 Not assigned or no longer used
238 Creeds & catechisms
239 Apologetics & polemics
240 Christian moral & devotional
theology
241 Moral theology
242 Devotional literature
243 Evangelistic writings for individuals
244 Not assigned or no longer used
245 Texts of hymns
246 Use of art in Christianity
247 Church furnishings & articles
248 Christian experience, practice, life
249 Christian observances in family life
250 Christian orders & local church
251 Preaching (Homiletics)
252 Texts of sermons
253 Pastoral office (Pastoral theology)
254 Parish government & administration
255 Religious congregations & orders
256 Not assigned or no longer used
257 Not assigned or no longer used
258 Not assigned or no longer used
259 Activities of the local church
260 Christian social theology
261 Social theology
262 Ecclesiology
263 Times, places of religious observance
264 Public worship
265 Sacraments, other rites & acts
266 Missions
267 Associations for religious work
268 Religious education
269 Spiritual renewal
270 Christian church history
271 Religious orders in church history
272 Persecutions in church history
273 Heresies in church history
274 Christian church in Europe
275 Christian church in Asia
276 Christian church in Africa
277 Christian church in North America
278 Christian church in South America
279 Christian church in other areas
280 Christian denominations & sects
281 Early church & Eastern churches
282 Roman Catholic Church
283 Anglican churches
284 Protestants of Continental origin
285 Presbyterian, Reformed, Congregational
286 Baptist, Disciples of Christ, Adventist
287 Methodist & related churches
288 Not assigned or no longer used
289 Other denominations & sects
290 Other & comparative religions
291 Comparative religion
292 Classical (Greek & Roman) religion
293 Germanic religion
294 Religions of Indic origin
295 Zoroastrianism (Mazdaism, Parseeism)
296 Judaism
297 Islam & religions originating in it
298 Not assigned or no longer used
299 Other religions
51
Research Challenge: Classification
In the physical domain everything has to go
someplace, it can only go in one place and
two things cannot go to the same place
52
Research Challenge: Classification
Categorization is the process in which ideas and objects are recognized, differentiated
and understood. Categorization is fundamental in language, prediction, inference,
decision making and in all kinds of interaction with the environment. (Wikipedia)
Physical World
Digital World

A object can hang from only one branch.

An object can be classified in many (hundreds) categories.

Design needs to be defined ahead of time

Organic growth

Owner of information controls the organization

Users can control the organization of the info owned by others.

Multiple people can use one tree

There can be different tree for each person.

Ambiguity is a problem

Ambiguity is an advantage
Computer
Equipment
Photographic
Equipment
Active Lifestyles
Sporting Goods
Graduation Gifts
New Arrivals
Sale Items
Travel
Equipment
Canon Products
Research Challenge: Classification
“Folksonomies are different in important ways from top-down, hierarchical taxonomies… The
old way creates a tree. The new rakes leaves together.”
David Weinberger
53
54
Tasks Related with Image Categorization
photographs
document images
paintings
drawings
Image Categorization

to assign one or more labels to a given image, based on its semantic content
Object Recognition

to identify an instance of a given object
Object Detection

to determine whether a member of a visual category is present in an image
Image Similarity

to define a metric between images, e.g. for duplicate detection
Content-Based Image Retrieval

given an image repository, to retrieve images relevant to a given query
Non-Supervised Image Categorization: Clustering

to group images by similarity
Image Segmentation

to assign one or more labels to each image pixel, based on its semantic content
Hybrid approaches

to include multiple modalities in the analysis
maps, charts,
tables
domain-specific:
satellite, medical
55
Image Similarity and Retrieval
Precision
Xerox
Average
Text Only
Similarity measure based on the Fisher representation is robust to scanning
variations, resolution, compression, cropping, image edits, etc.
Image Only
Hybrid
56
Hybrid Similarity and Retrieval
Flowers from « Pyrénées »
Hybrid Clustering:
TEXT + IMAGES
Flowers Cluster
Flowers from « Vercors »
xrce
InfoM@gic: Hybrid Retrieval
 Retrieve documents from cross-media database (images with text)
given one or more query images and/or textual queries
 Results are ranked using cross-media similarity measures.
Clustering
57
Fully automated grouping of large image
repositories.
Grouping criteria can be defined through
universal vocabulary
Image Retrieval
Actual clusters discovered by our tool from 120000 random images from a photofinishing workflow. Also
applied successfully to document images (NIST database)
Clustering
Edward Hopper
(1882-1967)
Johannes Vermeer
(1632-1675)
58
59
Generic Visual Segmentation
LocBov, sheep
The Sheepness Map
50
50
100
100
150
150
200
200
250
250
300
300
350
350
400
400
450
450
100
200
300
400
500
600
100
200
300
400
500
600
Urban Structure Example
Classifier, UrbanStructure, CombMap
50
50
100
100
FisherKernel, UrbanStructure, Texture Map
150
150
50
200
200
100
250
250
150
50
100
150
200
250
300
350
400
50
100
150
200
250
300
350
200
250
50
100
150
200
250
300
350
400
Combination of low-level feature-based segmentation with class-probability maps
400
sample apps: image asset visualization,60
delivery to mobile phone, visualization in
printer screen
Generic Visual Segmentation
Summarization from sheep category probability map
20
50
50
100
100
40
60
80
100
120
140
150
150
200
200
250
250
300
300
160
180
350
200
20
40
60
80
100
120
140
160
180
200
automatic thumbnailing
350
50
100
150
200
250
300
350
400
450
500
50
100
150
200
250
300
350
400
450
500
Summarization from drawing probability map
200
500
500
1000
1000
400
600
800
1000
1500
1500
1200
2000
2000
2500
2500
3000
3000
500
1000
1500
2000
2500
500
1000
1500
2000
2500
Repurpose an image for reflow
200
400
600
800
1000
1200
Applications
Applications: Class-based Image Enhancement (CBIE)
Content understanding enables superior quality

“Snow looks dirty in all my skiing photos”

“Look at all those details in that building”

“I like the colour in those flowers”
Based on Generic Visual Categorization (GVC) and Clustering

GVC: The same flexible framework copes with various objects, subjects and scenes and with various graphic
arts (photography, painting, drawing, document images, etc.)

Clustering: Discovery of similar types of semantic content
Validated through User Preference evaluations
Semantic aspect is key to future personalized offering
PEOPLE
no faces
portraits
groups
crowds
TIME & LOC
outdoors
indoors
night / flash
day
seasonal
ENVIRONMENT
urban
nature
clouds
underwater
sunrise/sunset
sky
flowers
fog
snow
TYPE
photograph
poster
drawing
paintings
doc objects…
INTENT
professional
personal
effects
M. Bressan, G. Csurka and S. Favre, “Towards Intent Dependent Image Enhancement: State-of-the-art and Recent Attempts”, In
Proc VISAPP, March 2007.
62
Applications

Robotics, e.g. assisted driving
63
Applications

Robotics, e.g. assisted driving

Medical & Satellite Imaging
64
Applications

Robotics, e.g. assisted driving

Medical & Satellite Imaging

Security, e.g. biometrics, tracking

Entertainment

Visual Inspection & Quality Control

OCR/ICR

Management of Multimedia Assets


Augmented Reality


indexing, storage, retrieval, visualization,
human-computer interfaces, context-aware computing, wearable devices
Knowledge Inference / Knowledge Creation
65
66
Applications: Inference from Large Databases
Inference of
Geographic
Information
James Hays, Alexei A Efros, “IM2GPS: estimating geographic information from a single image”, CMU, CVPR 2008
Applications: Inference from large databases
Land Cover Estimation
Other applications: Urban vs. Rural (via Light Pollution),
Population Density Estimation, Elevation Gradient Estimation
James Hays, Alexei A Efros, “IM2GPS: estimating geographic information from a single image”, CMU, CVPR 2008
67
Applications: Semi-supervised Hybrid Content Generation
Our plans to hit Copacabana beach the next
day and check out hot Brazilian girls in
skimpy bikinis were ruined by the weather. It
rained all day! Can you believe that. I think
we'll be heading to another place mid-week
for some beach time.
There is a lot of tourists there from around ten
until three, but it didn’t feel as crowded as we’d
feared. We started there for 12 hours- saw the
sunrise and sunset, and walked the citadel twice. It
is an awesome site in the proper sense of the word
(Yanks take note). Bloody magic. Some
archeologists reckon that Machu Picchu could
have predated the Inca but that they did a lot of
improvements.
Marco Bressan, Gabriela Csurka, Yves Hoppenot and Jean-Michel Renders, “Travel Blog
Assistant System (TBAS)”, Metadata Minining for Image Understanding Workshop, 2008
Today had another wander around the old
town and went into a number of the great
churches. On the way around some of them
noticed a parade of monks and nuns singing
and carrying statues of Mary and Jesus
before entering the Cathedral - was nice to
watch.
68
69
Applications: Document Images
NIST: database of simulated USA tax forms :
 20 forms types
 5,590 images
Evaluation protocol (repeated 10 times):
 10 images / class for training
 The rest is used for testing
0% error rate (best reported before was 0.2%)
DEMO
Applications: Document Images: Demo
70
71
GVC Model UI screenshots
GVC modeling features
Performance analytics
Customized image list view
Online content sources
8/13/2008
Open Research Challenges
Improve BOW:

Patch Detection

Improve viewpoint invariance: scale, similarity, affine invariance.

Gaze models

Efficient discriminative models

Taking shape into account

Low-level fusion of hybrid content models
Beyond BOW: Taking into account the structure of objects

Structure model

tight parametric model (e.g. complete Gaussian)

loose model (e.g. pairwise relations)
Improved Classification models

multiple levels of abstraction

transfer learning

learning for structured output, hierarchical class models
Learning

learn from ‘contaminated’ data sets: noisy, unlabelled, weakly labelled data.

reduction of training requirements: active, semi-supervised and unsupervised learning
Retrieval

efficient coding for large-scale indexing and retrieval

visualization of query results
72
Useful References
73
74
Useful References
Recommended Tutorials
 Frédéric Jurie, “Vision par ordinateur et catégorisation d’images”,
CNRS, ProjetLEAR, Inria Rhône-Alpes, Septembre 2006.

http://www-poleia.lip6.fr/~cord/isis/jurie.pdf
 Andrew Zisserman, “Probabilistic models of visual object categories”,
Visual Geometry Group, University of Oxford, 2006

http://www.robots.ox.ac.uk/~vgg
 Li Fei-Fei, “Bag of Words”, in CVPR2007, Princeton University, 2007

http://vision.cs.princeton.edu/documents/CVPR2007_tutorial_bag_of_words.ppt
Conclusiones y Preguntas
75
Fisher kernel on visual vocabularies
Notations
Log-likelihood function:
Modeling the visual vocabulary with a GMM:
Occupancy probability:
76
Fisher kernel on visual vocabularies
Gradient formulae (1/2)
Formulae for the partial derivatives:
BOV:
Gradient size = (2xD+1)xN-1 compared to histogram size = N
77
Fisher kernel on visual vocabularies
Gradient formulae (2/2)
Introducing the MLE formulae:
Leads to:
78
Fisher kernel on visual vocabularies
79
Fisher information matrix
Assumption: at a given time t, the distribution
i.e. a single Gaussian contributes significantly
is sharply peaked,
Under this approximation,
is diagonal
 component-wise normalization of the dynamic range
Formulae: