Slides - Image, Video and Multimedia Systems Lab

Transcription

Slides - Image, Video and Multimedia Systems Lab
Naive Bayes Nearest Neighbor Classifiers
Christos Varytimidis
Image, Video and Multimedia Systems Laboratory
National Technical University of Athens
January 2011
Outline
Irani - In Defence of Nearest-Neighbor Based Image Classification
Wang - Image-to-Class Distance Metric Learning for Image
Classification
Behmo - Towards Optimal Naive Bayes Nearest Neighbor
Outline
Irani - In Defence of Nearest-Neighbor Based Image Classification
Wang - Image-to-Class Distance Metric Learning for Image
Classification
Behmo - Towards Optimal Naive Bayes Nearest Neighbor
Irani - In Defence of Nearest-Neighbor Based Image
Classification
rds”), for obtaining compact
ization gives rise to a signifi, but also to significant degraower of descriptors. Such dintial for many learning-based
l tractability, and for avoid(a)
(b)
(c)
(d)
Figure 1. Effects
of descriptor quantization
Informative
des unnecessary and especially Boiman,
Shechtman,
Irani – –CVPR
2008
scriptors have low database frequency, leading to high quanametric classification, that has
tization
error.
(a)
An
image
from
the
Face
class
in
CalInof Defence
Nearest-Neighbor
Based Image Classification
te for this loss
information. oftech101.
(b) Quantization error of densely computed image descriptors (SIFT) using a large codebook (size 6, 000) of Calteche is essential to Kernel meth101 (generated using [14]). Red = high error; Blue = low error.
n NN-Image classifiers, it proThe most informative descriptors (eye, nose, etc.) have the highest
n only when the query image
quantization error. (c) Green marks the 8% of the descriptors in
the image that are most frequent in the database (simple edges).
ase images, but does not gen(d) Magenta marks the 8% of the descriptors in the image that are
led images. This limitation is
least frequent in the database (mostly facial features).
with large diversity.
e a remarkably simple nonthe data (typically hundreds of thousands of descriptors exer, which requires no descriptracted from the training images), is quantized to a rather
a direct “Image-to-Class” dissmall codebook (typically into 200 − 1000 representative
1
he Naive-Bayes assumption ,
descriptors). Lazebnik et al. [16] further proposed to add
especially severe for classes with large diversity.
In this paper we propose a remarkably simple nonparametric NN-based classifier, which requires no descriptor quantization, and employs a direct “Image-to-Class” distance. We show that under the Naive-Bayes assumption1 ,
the theoretically optimal image classifier can be accurately
approximated by this simple algorithm. For brevity, we refer to this classifier as “NBNN”, which stands for “NaiveBayes Nearest-Neighbor”.
NBNN is embarrassingly simple: Given a query image, compute all its local image descriptors d 1 , ..., dn .
Search
for the class C which minimizes the sum
n
2
d
(where NNC (di ) is the NNi − NNC (di ) i=1
descriptor of d i in class C). Although NBNN is extremely
simple and requires no learning/training, its performance
ranks among the top leading learning-based image classifiers. Empirical comparisons are shown on several challenging databases (Caltech-101,Caltech-256 and Graz-01).
The paper is organized as follows: Sec. 2 discusses the
causes for the inferior performance of standard NN-based
image classifiers. Sec. 3 provides the probabilistic formulation and the derivation of the optimal Naive-Bayes image
classifier. In Sec. 4 we show how the optimal Naive-Bayes
classifier can be accurately approximated with a very simple NN-based classifier (NBNN). Finally, Sec. 5 provides
empirical evaluation and comparison to other methods.
the data (typically
Naive Bayes Nearest-Neighbor Classifier
tracted from the t
small codebook (
descriptors). Laze
rough quantized lo
resentation. Such
are necessary for
image classificatio
were also used i
compared to in [27
However, the si
tized codebook re
will be shown nex
tion is considerab
Learning-based al
information loss b
classification resu
ple non-parametri
phase to “undo” th
It is well known
quantization error
tization error. Ho
a large database o
simple edges and
classes within the
ng, its performance
based image classiwn on several chal-256 and Graz-01).
Sec. 2 discusses the
standard NN-based
robabilistic formuNaive-Bayes image
ptimal Naive-Bayes
ed with a very simly, Sec. 5 provides
other methods.
assification?
mage classification
erformance of non-
rametric classifiers
d to generate codeng compact image
ms of quantized decriptors taken from
are i.i.d. given image class.
will be shown next, the amount of discriminative information is considerably reduced due to the rough quantization.
Learning-based algorithms can compensate for some of this
information loss by their learning phase, leading to good
classification results. This, however, is not the case for simple non-parametric algorithms, since they have no training
phase to “undo” the quantization damage.
It is well known that highly frequent descriptors have low
quantization error, while rare descriptors have high quantization error. However, the most frequent descriptors in
a large database of images (e.g., Caltech-101) comprise of
simple edges and corners that appear abundantly in all the
classes within the database, and therefore are least informative for classification (provide very low class discriminativity). In contrast, the most informative descriptors for classification are the ones found in one (or few) class, but are
rare in other classes. These discriminative descriptors tend
to be rare in the database, hence get high quantization error.
This problem is exemplified in Fig. 1 on a face image from
Caltech-101, even when using a relatively large codebook
of quantized descriptors.
As noted before [14, 26], when densely sampled image descriptors are divided into fine bins, the bin-density
follows a power-law (also known as long-tail or heavy-tail
distributions). This implies that most descriptors are infrequent (i.e., found in low-density regions in the descriptor
Quantization Error
btaining compact
s rise to a signifisignificant degracriptors. Such diny learning-based
y, and for avoidary and especially
sification, that has
oss of information.
al to Kernel methe classifiers, it pron the query image
but does not genThis limitation is
versity.
ably simple nonquires no descripage-to-Class” disayes assumption1 ,
can be accurately
Quantization Error
(a)
(b)
(c)
(d)
Figure 1. Effects of descriptor quantization – Informative descriptors have low database frequency, leading to high quantization error.
(a) An image from the Face class in Caltech101. (b) Quantization error of densely computed image descriptors (SIFT) using a large codebook (size 6, 000) of Caltech101 (generated using [14]). Red = high error; Blue = low error.
The most informative descriptors (eye, nose, etc.) have the highest
quantization error. (c) Green marks the 8% of the descriptors in
the image that are most frequent in the database (simple edges).
(d) Magenta marks the 8% of the descriptors in the image that are
least frequent in the database (mostly facial features).
the data (typically hundreds of thousands of descriptors extracted from the training images), is quantized to a rather
small codebook (typically into 200 − 1000 representative
descriptors). Lazebnik et al. [16] further proposed to add
rough quantized location information to the histogram rep-
space), therefore rather isolated. In other words, there are
Effects
of descriptor
Quantization
almost no ‘clusters’
in the descriptor space.
Consequently,
any clustering to a small number of clusters (even thousands) will inevitably incur a very high quantization error
in most database descriptors. Thus, such long-tail descriptor distribution is inherently inappropriate for quantization.
High quantization error leads to a drop in the discriminative power of descriptors. Moreover, the more informative (discriminative) a descriptor is, the more severe the
degradation in its discriminativity. This is shown quantitatively in Fig. 2. The graph provides an evidence to
the severe drop in the discriminativity (informativeness) of
the (SIFT) descriptors in Caltech-101 as result of quantization. The descriptor discriminativity measure of [2, 26]
was used: p(d|C)/p(d|C), which measures how well a descriptor d discriminates between its class C and all other
classes C. We compare the average discriminativity of
all descriptors in all Caltech-101 classes after quantization:
p(dquant |C)/p(dquant |C), to their discriminativity before
quantization.
Alternative methods have been proposed for generating
compact codebooks via informative feature selection [26,
2]. These approaches, however, discard all but a small set
of highly discriminative descriptors/features. In particular,
they discard all descriptors with low-discriminativity. Although individually such descriptors offer little discrimina-
Figure 2. Effects of d
descriptor discrimin
of descriptor discrimi
(for a very large samp
each for its respective
along the y-axis. Thi
after quantization” (th
scale in both axes. NO
a descriptor d is, the l
2.2. Image-to-Im
In this section
tance, which is fund
RVM), significantly
non-parametric ima
belled (‘training’) im
NN-image classi
Effects of descriptor Quantization
her words, there are
pace. Consequently,
clusters (even thouh quantization error
h long-tail descripate for quantization.
rop in the discrimi, the more informahe more severe the
his is shown quandes an evidence to
informativeness) of
as result of quantimeasure of [2, 26]
ures how well a deass C and all other
discriminativity of
s after quantization:
criminativity before
osed for generating
Figure 2. Effects of descriptor quantization – Severe drop in
descriptor discriminative power. We generated a scatter plot
of descriptor discriminative power before and after quantization
(for a very large sample set of SIFT descriptors d in Caltech-101,
each for its respective class C). We then averaged this scatter plot
along the y-axis. This yields the “Average discriminative power
after quantization” (the RED graph). The display is in logarithmic
scale in both axes. NOTE: The more informative (discriminative)
a descriptor d is, the larger the drop in its discriminative power.
2.2. Image-to-Image vs. Image-to-Class Distance
In this section we argue that “Image-to-Image” dis-
Image-to-Class Distance
p(Q|C) =
Taking the log proba
Ĉ = arg max log(p
C
The simple classifie
sification algorithm
Sec 4 we show how
approximated using
(without descriptor q
Figure 3. “Image-to-Image” vs. “Image-to-Class” distance. A
Ballet class with large variability and small number (three) of ‘labelled’ images (bottom row). Even though the “Query-to-Image”
distance is large to each individual ‘labelled’ image, the “Queryto-Class” distance is small. Top right image: For each descriptor at each point in Q we show (in color) the ‘labelled’ image
which gave it the highest descriptor likelihood. It is evident that
the new query configuration is more likely given the three images,
than each individual image seperately. (Images taken from [4].)
the entire class C (using all images I ∈ C), we would
get better generalization capabilities than by employing in-
Naive-Bayes classi
KL-Distance: In S
benefits of using an
show that the above
to minimizing “Que
Eq. (1) can be rew
Ĉ = a
where we sum over
tract a constant term
image configurations by “composing pieces” from a set of
other images was previously shown useful in [17, 4].
We prove (Sec. 3) that under the Naive-Bayes assumption, the optimal distance to use in image classification is
the KL “Image-to-Class” distance, and not the commonly
used “Image-to-Image” distribution distances (KL, χ 2 , etc.)
where KL(··) is
two probability di
Naive-Bayes assum
mizes a “Query-totor distributions of
A similar relat
and KL-distance w
tion, yet between
distances and not
between descriptor
cation have also be
again – between pa
Probabilistic formulation - Maximum Likelihood
3. Probabilistic Formulation
Bayes Rule: p(C|Q) =
p(Q|C)p(C)
p(Q)
In this section we derive likelihood×prior
the optimal Naive-Bayes imposterior
=
age classifier,
which is approximated
by NBNN (Sec. 4).
evidence
Given a new query (test) image Q, we want to find its
class C. It is well known [7] that maximum-a-posteriori
(MAP) classifier minimizes the average classification error: Ĉ = arg maxC p(C|Q). When the class prior p(C)
is uniform, the MAP classifier reduces to the MaximumLikelihood (ML) classifier:
Ĉ = arg max p(C|Q) = arg max p(Q|C).
C
C
Let d1 , ..., dn denote all the descriptors of the query image Q. We assume the simplest (generative) probabilistic
model, which is the Naive-Bayes assumption (that the descriptors d1 , ..., dn of Q are i.i.d. given its class C), namely:
4. The Approx
In this section w
accurately approxim
age classifier of Se
Non-Parametric D
The optimal MAP
requires computing
scriptor d in a class
tors in an image da
ber of pixels in the
Probabilistic formulation - Naive Bayes Assumption
Naive Bayes Assumption = descriptors d1 , . . . , dn are i.i.d.
p(Q|C) = p(d1 , .., dn |C) =
n
i=1
p(di |C)
Taking the log probability of the ML decision rule we get:
n
1
log p(di |C)
Ĉ = arg max log(p(C|Q)) = arg max
C
C n
i=1
(1)
The simple classi¿er implied by Eq. (1) is the optimal classi¿cation algorithm under the Naive-Bayes assumption.
approximated using a non-parametric NN-based algorithm
(without
descriptor quantization).
Naive Bayes
Classifier
⇔ Minimum Image-to-Class
Naive-Bayes classifier
⇔ Minimum “Image-to-Class”
KL-Distance
KL-Distance: In Sec. 2.2 we discussed the generalization
Class” distance. A
umber (three) of ‘lae “Query-to-Image”
image, the “Querye: For each descriphe ‘labelled’ image
d. It is evident that
en the three images,
es taken from [4].)
∈ C), we would
by employing ints. Such a direct
ned by computing
distributions of Q
ough the “Query‘labelled’ images
KL-distance may
on. Inferring new
ces” from a set of
l in [17, 4].
ve-Bayes assumpe classification is
not the commonly
nces (KL, χ 2 , etc.)
benefits of using an “Image-to-Class” distance. We next
the
above MAP classifier of Eq. (1) is equivalent
to minimizing “Query-to-Class” KL-distances.
Eq. (1) can be rewritten as:
Ĉ = arg max
p(d|Q) log p(d|C)
C
d
where we sum over all possible descriptors d. We can subtract a constant term independent of C from the right hand
side of the
above equation, without affecting Ĉ. By subtracting d p(d|Q) log p(d|Q), we get:
p(d|C)
)
Ĉ = arg max(
p(d|Q) log
C
p(d|Q)
d∈D
=
arg min( KL(p(d|Q)p(d|C)) )
C
(2)
where KL(··) is the KL-distance (divergence) between
two probability distributions. In other words, under the
Naive-Bayes assumption, the optimal MAP classifier minimizes a “Query-to-Class” KL-distance between the descriptor distributions of the query Q and the class C.
A similar relation between Naive-Bayes classification
and KL-distance was used in [28] for texture classifica-
d=1
thatAlgorithm:
the approximation of Eq. (4) always bounds
TheNote
NBNN
from below
complete
Parzen window
estimate ofdistribuEq. (3).
Due
to the the
long-tail
characteristic
of descriptor
Fig.
4
shows
the
accuracy
of
such
NN
approximation
of
tions, almost all of the descriptors are rather isolated in the
the distribution
p(d|C).
Even
when
very small
numdescriptor
space,
therefore
very
farusing
fromamost
descriptors
berthe
of database.
nearest neighbors
(as small
as the
r =terms
1, a single
nearin
Consequently,
all of
in the sumest neighbor
forfor
each
d in will
eachbe
class
C), a very
mation
of Eq.descriptor
(3), except
a few,
negligible
(K
accurate
approximation
p (d|C)
of the complete Parzen
exponentially
decreases with
NN distance) . Thus we can accuwindow
estimate isthe
obtained
(see Fig.
Moreover,
NN
approximate
summation
in Eq.4.a).
(3) using
the (few)
descriptor
approximation
hardly
reduces
the discriminative
r largest elements
in the sum.
These
r largest
elements corpower
(seeneighbors
Fig. 4.b).ofThis
is in contrast
to
respondoftodescriptors
the r nearest
a descriptor
d ∈ Q
C descriptors due to dethe
severe
discriminativity
of
within
the drop
class in
descriptors
dC
,
..,
d
∈
C:
1
L
r
scriptor quantization.
1
(d|C)
=
K(d
− dC ) in the ac(4)
p
We have indeed
found very small differences
NN
NNj
L
d=1changing r from 1 to 1000
tual classification results when
always bounds
TheNote that the approximation
case of rof=Eq.
1 is(4)
especially
convefrom
complete
Parzenobtains
windowa estimate
of Eq.
(3).
nientbelow
to use,thesince
log
p(d|C)
very
simple
form:
n
2
Fig.
4 shows
of such NN approximation
of
logP
(Q|C)
∝ −the accuracy
i=1 di − NNC (di ) and there is no
the distribution
p(d|C).
when using
smallkernel
numlonger
a dependence
on Even
the variance
of thea very
Gaussian
ber
K. of nearest neighbors (as small as r = 1, a single nearest
neighbor descriptor
for each
d in 5.
each class C), a very
experimental
results reported
in Sec.
accurate approximation p (d|C) of the complete Parzen
NN
window estimate is obtained (see Fig. 4.a). Moreover, NN
descriptor approximation hardly reduces the discriminative
power of descriptors (see Fig. 4.b). This is in contrast to
the severe drop in discriminativity of descriptors due to descriptor quantization.
The resulting Na
fication performan
can therefore be su
descriptor types a
The NBNN Algo
each image using
1. Compute desc
assumption on all
2. ∀di ∀C compu
very simple exten
3. Ĉ = arg min
The decision rulC
of Despite
each of its
thesimp
t
imates
the single-d
theore
the above
requires
nomin
learnin
Ĉ = arg
C
where dji is the i-t
Combining
determined bySever
the
approaches
to imat
Kj corresponding
demonstrated
that
who learn weights
in
single
areafixed
andclassifi
share
fication performan
descriptor
types Ca
Computational
each
image
using
ficient approximate
assumption
on all
tree implementatio
very
exten
searchsimple
is logarithm
The
decision[1].rule
the KD-tree
of each of the t d
the above single-d
Ĉ = arg minC
where dji is the i-th
determined by the
Kj corresponding t
Parzen Window - Nearest Neighbors
NN density distribution - discriminativity
approximation of the
nsity p(d|C) [7]. Let
obtained from all the
the Parzen likelihood
− dC
j )
(3)
function (which is
typically a Gaussian:
2
)). As L approaches
educes accordingly, p̂
[7].
y, all the database dey estimation of Eq. (3).
nally time-consuming
ance (d − d C
j ) for all
lass). We next show
hbor approximation of
of descriptor distribue rather isolated in the
(a)
(b)
Figure 4. NN descriptor estimation preserves descriptor density distribution and discriminativity.
(a) A scatter plot of
the 1-NN probability density distribution p (d|C) vs. the true
NN
distribution p(d|C). Brightness corresponds to the concentration
of points in the scatter plot. The plot shows that 1-NN distribution provides a very accurate approximation of the true distribution. (b) 20-NN descriptor approximation (Green graph) and 1NN descriptor approximation (Blue graph) preserve quite well the
discriminative power of descriptors. In contrast, descriptor quantization (Red graph) severely reduces discriminative power of descriptors. Displays are in logarithmic scale in all axes.
The resulting Naive-Bayes NN image classifier (NBNN)
can therefore be summarized as follows:
The NBNN Algorithm:
1. Compute descriptors d , ..., d of the query image Q.
ces accordingly, p̂
l the database deimation of Eq. (3).
y time-consuming
e (d − d C
j ) for all
). We next show
approximation of
escriptor distribuher isolated in the
most descriptors
terms in the sumbe negligible (K
Thus we can accu(3) using the (few)
gest elements cordescriptor d ∈ Q
C:
dC
NNj
)
(4)
4) always bounds
stimate of Eq. (3).
approximation of
sity distribution and discriminativity.
(a) A scatter plot of
the 1-NN probability density distribution p (d|C) vs. the true
NN
distribution p(d|C). Brightness corresponds to the concentration
of points in the scatter plot. The plot shows that 1-NN distribution provides a very accurate approximation of the true distribution. (b) 20-NN descriptor approximation (Green graph) and 1NN descriptor approximation (Blue graph) preserve quite well the
discriminative power of descriptors. In contrast, descriptor quantization (Red graph) severely reduces discriminative power of descriptors. Displays are in logarithmic scale in all axes.
The NBNN Algorithm!
The resulting Naive-Bayes NN image classifier (NBNN)
can therefore be summarized as follows:
The NBNN Algorithm:
1. Compute descriptors d 1 , ..., dn of the query image Q.
2. ∀di ∀C computethe NN of d i in C: NNC (di ).
n
3. Ĉ = arg minC i=1 di − NNC (di ) 2 .
Despite its simplicity, this algorithm accurately approximates the theoretically optimal Naive-Bayes classifier,
requires no learning/training, and is efficient.
Combining Several Types of Descriptors:
Recent
approaches to image classification [5, 6, 20, 27] have
demonstrated that combining several types of descriptors
in a single classifier can significantly boost the classification performance. In our case, when multiple (t)
descriptor types are used, we represent each point in
each image using t descriptors. Using a Naive Bayes
gest elements cordescriptor d ∈ Q
C:
requires no learning/training, and is efficient.
Combining
of Descriptors
Combining Several
Several TypesTypes
of Descriptors:
Recent
C
d
NNj
)
(4)
4) always bounds
estimate of Eq. (3).
approximation of
a very small num= 1, a single nearh class C), a very
e complete Parzen
a). Moreover, NN
the discriminative
s is in contrast to
criptors due to de-
erences in the acr from 1 to 1000
especially convevery simple form:
2 and there is no
e Gaussian kernel
was used in all the
approaches to image classification [5, 6, 20, 27] have
demonstrated that combining several types of descriptors
in a single classifier can significantly boost the classification performance. In our case, when multiple (t)
descriptor types are used, we represent each point in
using
t descriptors. Using a Naive Bayes
assumption on all the descriptors of all types yields a
very simple extension of the NBNN algorithm above.
The decision rule linearly combines the contribution
of each of the t descriptor types. Namely, Step (3) in
the above single-descriptor-type
NBNN is replaced by:
Ĉ = arg minC tj=1 wj · ni=1 dji − NNC (dji ) 2 ,
where dji is the i-th query descriptor of type j, and w j are
determined by the variance of the Parzen Gaussian kernel
Kj corresponding to descriptor type j. Unlike [5, 6, 20, 27],
who learn weights wj per descriptor-type per class, our w j
are fixed and shared by all classes.
Computational Complexity & Runtime: We use the efficient approximate-r-nearest-neighbors algorithm and KDtree implementation of [23]. The expected time for a NNsearch is logarithmic in the number of elements stored in
the KD-tree [1]. Note that the KD-tree data structure is
requires no learncessing step has a
ber of elements N )
w seconds for cones in Caltech-101).
(‘training’) images
n D the number of
ains n label ·nD deors searches within
for one query im(nC ·nD ·log(nD ))
no training time in
g of the KD-tree.
on Caltech-101 for
d SIFT descriptors
NBNN, and comrs (learning-based
s are provided in
the luminance part (L* from a CIELAB color space) as a
luminance descriptor, and the chromatic part (a*b*) as a
color descriptor. Both are normalized to unit length.
4. Shape descriptor: We extended the Shape-Context
descriptor [22] to contain edge-orientation histograms in
NN-based
its log-polar
bins. method
This descriptorPerformance
is applied to texture42.1 ± 0.81%
invariant SPM
edge NN
mapsImage
[21], [27]
and is normalized
to unit length.
GBDist NN Image
[27] of45.2
5. The Self-Similarity
descriptor
[25].± 0.96%
GB Vote NN [3]
52%
SVM-KNN
59.1 ±for0.56%
The descriptors
are[30]
densely computed
each image, at
NBNN
(1 Desc)
± scale
1.14%
five different
spatial
scales, enabling65.0
some
invariance.
Desc)
72.8(similar
± 0.39%
To furtherNBNN
utilize(5
rough
spatial position
to [30, 16]),
we augment each descriptor d with its location l in the imTable 1.˜ Comparing the performance of non-parametric NN-based
age: d = (d, αl). The resulting L 2 distance between deapproaches on
the Caltech-101 dataset (nlabel = 215). All the
scriptors, d˜ − d˜2 2 = d1 −d2 2 +α2 l
1 −l2 , combines
listed methods1do not
require a learning phase.
descriptor distance and location distance. (α was manually set in our experiments. The same fixed α was used
for Caltech-101 and Caltech-256, and α = 0 for Graz-01.)
pearance and shap
on Ca
Experiments - Descriptor extraction comparisons
of NBNN to other
5.2. Experiments
Following common benchmarking procedures, we split
each class to randomly chosen disjoint sets of ‘training images’ and ‘test images’. In our NBNN algorithm, since there
is no training, we use the term ‘labelled images’ instead of
‘training images’. In learning-based methods, the training
images are fed to a learning process generating a classifier
paring NBNN with
descriptor-type im
NN-based) (Fig. 5
tiple descriptor-typ
classifiers (learning
Table 1 shows
101 for several N
we used 15 labelle
bers reported in t
tor NBNN algorith
gap all NN-image
‘SVM-KNN’ [30]
based, which was c
quires no learning/training, its performance ranks among
the top leading learning-based image classifiers. Sec 5.3
further demonstrates experimentally the damaging effects
of using descriptor-quantization or “image-to-image” distances in a non-parametric classifier.
(training) and test
Combining Several Types of Descriptors
selected nlabel = 1
5.1. Implementation
We tested our NBNN algorithm with a single descriptortype (SIFT), and with a combination of 5 descriptor-types:
1. The SIFT descriptor ([19]).
2 + 3.
Simple Luminance & Color Descriptors: We
use log-polar sampling of raw image patches, and take
the luminance part (L* from a CIELAB color space) as a
luminance descriptor, and the chromatic part (a*b*) as a
color descriptor. Both are normalized to unit length.
4. Shape descriptor: We extended the Shape-Context
descriptor [22] to contain edge-orientation histograms in
its log-polar bins. This descriptor is applied to textureinvariant edge maps [21], and is normalized to unit length.
5. The Self-Similarity descriptor of [25].
The descriptors are densely computed for each image, at
five different spatial scales, enabling some scale invariance.
To further utilize rough spatial position (similar to [30, 16]),
we augment each descriptor d with its location l in the image: d˜ = (d, αl). The resulting L 2 distance between de2
2
2
˜ ˜ 2
20 images per clas
nlabel = 1, 5, 10,
images per class. T
times (randomly s
time performance
rate per class. Th
somewhat differen
Caltech-101: Th
furniture, vehicles
pearance and shap
comparisons on C
of NBNN to other
paring NBNN with
descriptor-type im
NN-based) (Fig. 5
tiple descriptor-ty
classifiers (learnin
Table 1 shows
101 for several N
we used 15 labell
bers reported in
tor NBNN algorit
gap all NN-imag
Results on Caltech-101
(a) (a)
(b) (b)
Figure
5. Performance
comparison
on Caltech-101.
(a) Single
descriptor
methods:
‘NBNN
(1 Desc)’,
‘Griffin
Figure
5. Performance
comparison
on Caltech-101.
(a) Single
descriptor
typetype
methods:
‘NBNN
(1 Desc)’,
‘Griffin
SPM’SPM’
[13],[13],
‘SVM‘SVM
KNN’
‘SPM’
‘PMK’
‘DHDP’
SVM’
(SVM
Geometric
KNN’
[30],[30],
‘SPM’
[16],[16],
‘PMK’
[12],[12],
‘DHDP’
[29],[29],
‘GB ‘GB
SVM’
(SVM
withwith
Geometric
Blur)Blur)
[27],[27],
‘GB ‘GB
Vote Vote
NN’ NN’
[3], [3],
‘GB ‘GB
NN’ NN’
(NN-(NNImage
Geometric
‘SPM
(NN-Image
Spatial
Pyramids
Match)
(b) Multiple
descriptor
methods:
Image
withwith
Geometric
Blur)Blur)
[27],[27],
‘SPM
NN’ NN’
(NN-Image
withwith
Spatial
Pyramids
Match)
[27].[27].
(b) Multiple
descriptor
typetype
methods:
‘NBNN
(5
Desc)’,
‘Bosch
Trees’
(with
ROI
Optimization)
[5],
‘Bosch
SVM’
[6],
‘LearnDist’
[11],
‘SKM’
[15],
‘Varma’
[27],
‘KTA’
‘NBNN (5 Desc)’, ‘Bosch Trees’ (with ROI Optimization) [5], ‘Bosch SVM’ [6], ‘LearnDist’ [11], ‘SKM’ [15], ‘Varma’ [27], ‘KTA’ [18].[18].
multi-descriptor
NBNN
algorithm
performs
better
OurOur
multi-descriptor
NBNN
algorithm
performs
eveneven
better
(72.8%
on labelled
15 labelled
images).
[3] uses
(72.8%
on 15
images).
‘GB‘GB
VoteVote
NN’NN’
[3] uses
an an
image-to-class
NN-based
voting
scheme
(without
descripimage-to-class
NN-based
voting
scheme
(without
descriptor quantization),
descriptor
votes
a single
tor quantization),
but but
eacheach
descriptor
votes
onlyonly
to a to
single
(nearest) class, hence the inferior performance.
(SVM-based) and [24] (Boosting based). NBNN performs
only slightly worse than the SVM-based classifier of [31].
Results - Contribution Evaluation
5.3. Impact of Quantization & Image-to-Image Dist.
In Sec. 2 we have argued that descriptor quantization
and “Image-to-Image” distance degrade the performance of
non-parametric image classifiers. Table 3 displays the results of introducing either of them into NBNN (tested on
Caltech-101 with n label = 30). The baseline performance
of NBNN (1-Desc) with a SIFT descriptor is 70.4%. If we
replace the “Image-to-Class” KL-distance in NBNN with
Opelt Zhang
Lazebnik the
NBNN
NBNN
an Class
“Image-to-Image”
KL-distance,
performance
drops
[24]
[16]
(1 Desc)
(5 Desc)
to 58.4% (i.e.,
17%[31]
drop in performance).
To check
the ef86.5 92.0
86.3descriptors
±2.5 89.2 ±4.7
90.0 ±4.3
fectBikes
of quantization,
the SIFT
are quantized
to a
People of80.8
88.0 82.3
86.0the
±5.0
87.0 ±4.6of
codebook
1000 words.
This±3.1
reduces
performance
Table
Resultsdrop
on Graz-01
NBNN to 50.4%
(i.e.,2.28.4%
in performance).
The spatial pyramid match kernel of [16] measures
No Quant.
With Quant.
distances between histograms of quantized SIFT descrip“Image-to-Class”
70.4%
50.4% (-28.4%)
tors, but within an SVM classifier. Their SVM learning
“Image-to-Image” 58.4% (-17%) phase compensates for some of the information loss due
to
quantization,
raising classification
performance
up to
Table
3. Impact of introducing
descriptor quantization
or “Image64.6%. However,
comparison
to the
performance
to-Image”
distance into
NBNN (using
SIFTbaseline
descriptor
on Caltechof NBNN
implies that the information loss in101,
nlabel =(70.4%)
30).
curred by the descriptor quantization was larger than the
formance is better than the learning-based classifiers of [16]
gain obtained and
by using
SVM.
(SVM-based)
[24] (Boosting
based). NBNN performs
only slightly worse than the SVM-based classifier of [31].
[12] K. Grauman a
Discriminative
ICCV, 2005.
[13] G. Griffin, A. H
egory dataset.
[14] F. Jurie and B
sual recognitio
[15] A. Kumar and
for object reco
[16] S. Lazebnik,
features: Spat
scene categori
and Pattern Re
[17] B. Leibe, A. Le
egorization an
InBosch,
ECCV A.
Work
[5] A.
Zi
[18] using
Y. Lin,
T. Liu,f
random
forBosch,
objectA.
cate
[6] A.
Zi
[19] with
D. Lowe.
a spatialDi
p
keypoints.
IJC
[7] R.
Duda, P. Har
York, 200
[20] New
M. Marszałek,
jer.Fei-Fei,
Learning
[8] R.
L.an
visual
modelsInf
recognition.
[21] bayesian
C. M. J. approa
Marti
Workshop
on G
image bounda
[9] P.cues.
Felzenszwalb
PAMI, 2
recogniti
[22] object
G. Mori,
S. Be
[10] R.
Fergus,
P. co
Pe
using
shape
by unsup
[23] nition
D. Mount
and
[11] A.
Y. Sis
estFrome,
neighbor
consistent
local
Comp. Geome
and M.
classF
[24] trieval
A. Opelt,
[12] K.
Grauman
potheses
and an
b
Discriminative
nition. In ECC
ICCV, 2005.
Outline
Irani - In Defence of Nearest-Neighbor Based Image Classification
Wang - Image-to-Class Distance Metric Learning for Image
Classification
Behmo - Towards Optimal Naive Bayes Nearest Neighbor
Wang - Image-to-Class Distance Metric Learning
for Image Classification
710
Z. Wang, Y. Hu, and L.-T. Chia
soft-margin, we introduce a slack variable ξ in the error term, and the whole
convex optimization problem is therefore formed as:
min
O(M1 , M2 , . . . , MC ) =
T
(1 − λ)
T r(ΔXip Mp ΔXip
)+λ
ξipn
M1 ,M2 ,...,MC
i,p→i
(6)
i,p→i,n→i
T
T
s.t.∀ i, p, n : T r(ΔXin Mn ΔXin
) − T r(ΔXip Mp ΔXip
) ≥ 1 − ξipn
∀ i, p, n : ξipn ≥ 0
∀ c : Mc 0
This optimization problem
an instance
SDP, which
Wang,is Hu,
Chia –ofECCV
2010can be solved using
standard SDP solver. However, as the standard SDP solvers is computation
Image-to-Class
Metricdescent
Learning
for Image
expensive, we useDistance
an efficient gradient
based method
derived Classification
from [20,19]
to solve our problem. Details are explained in the next subsection.
2.3
An Efficient Gradient Descent Solver
Due to the expensive computation cost of standard SDP solvers, we propose
an efficient gradient descent solver derived from Weinberger et al. [20,19] to
solve this optimization problem. Since the method proposed by Weinberger et
Contributions - Differences
• Differences on approach
1. Mahalanobis distance instead of Euclidean (learning for Mc )
2. Spatial Pyramid Match
3. Weighted local features
• Results
1. Fewer features-descriptors needed per image
2. Lower testing time
3. and. . . higher performance!
In this section, we formulate a large
margin convex optimization problem for
Notation
learning the Per-Class metrics and introduce an efficient gradient descent method
to solve this problem. We also adopt two strategies to further enhance the discrimination of our learned I2C distance.
2.1
Notation
Our work deals with the image represented by a collection of its local feature descriptors extracted from patches around each keypoint. So let Fi =
{fi1 , fi2 , . . . , fimi } denote features belonging to image Xi , where mi represents
the number of features in Xi and each feature is denoted as fij ∈ Rd , ∀j ∈
{1, . . . , mi }. To calculate the I2C distance from an image Xi to a candidate
class c, we need to find the NN of each feature fij from class c, which is denoted
as fijc . The original I2C distance from image Xi to class c is defined as the sum
Introducing Mahalanobis distance
Image-to-Class Distance Metric Learning for Image Classification
709
of Euclidean distances between each feature in image Xi and its NN in class c
and can be formulated as:
Dist(Xi , c) =
mi
j=1
fij − fijc 2
(1)
After learning the Per-Class metric Mc ∈ Rd×d for each class c, we replace the
Euclidean distance between each feature in image Xi and its NN in class c by
the Mahalanobis distance and the learned I2C distance becomes:
Dist(Xi , c) =
mi
j=1
(fij − fijc )T Mc (fij − fijc )
(2)
This learned I2C distance can also be represented in a matrix form by introducing
a new term ΔXic , which is a mi × d matrix representing the difference between
all features in the image Xi and their nearest neighbors in the class c formed as:
⎛
⎞
c T
(fi1 − fi1
)
Euclidean distance between each feature in image Xi and its NN in class c by
the Mahalanobis distance and the learned I2C distance becomes:
. . . more Notation
Dist(Xi , c) =
mi
j=1
(fij − fijc )T Mc (fij − fijc )
(2)
This learned I2C distance can also be represented in a matrix form by introducing
a new term ΔXic , which is a mi × d matrix representing the difference between
all features in the image Xi and their nearest neighbors in the class c formed as:
⎛
⎞
c T
(fi1 − fi1
)
⎜ (fi2 − f c )T ⎟
i2
⎟
ΔXic = ⎜
(3)
⎝
⎠
...
c
(fimi − fim
)T
i
So the learned I2C distance from image Xi to class c can be reformulated as:
T
Dist(Xi , c) = T r(ΔXic Mc ΔXic
)
(4)
This is equivalent to the equation (2). If Mc is an identity matrix, then it’s
also equivalent to the original Euclidean distance form of equation (1). In the
following subsection, we will use this formulation in the optimization function.
2.2
Problem Formulation
Optimization problem
The objective function in our optimization problem is composed of two terms:
the regularization term and error term. This is analogous to the optimization
problem in SVM. In the error term, we incorporate the idea of large margin and
formulate the constraint that the I2C distance from image Xi to its belonging
class p (named as positive distance) should be smaller than the distance to any
other class n (named as negative distance) with a margin. The formula is given
as follows:
T
T
) − T r(ΔXip Mp ΔXip
)≥1
(5)
T r(ΔXin Mn ΔXin
In the regularization term, we simply minimize all the positive distances similar
to [20]. So for the whole objective function, on one side we try to minimize all
the positive distances, on the other side for every image we keep those negative
distances away from the positive distance by a large margin. In order to allow
Optimization problemn
710
Z. Wang, Y. Hu, and L.-T. Chia
soft-margin, we introduce a slack variable ξ in the error term, and the whole
convex optimization problem is therefore formed as:
min
O(M1 , M2 , . . . , MC ) =
T
(1 − λ)
T r(ΔXip Mp ΔXip
)+λ
ξipn
M1 ,M2 ,...,MC
i,p→i
(6)
i,p→i,n→i
T
T
s.t.∀ i, p, n : T r(ΔXin Mn ΔXin
) − T r(ΔXip Mp ΔXip
) ≥ 1 − ξipn
∀ i, p, n : ξipn ≥ 0
∀ c : Mc 0
This optimization problem is an instance of SDP, which can be solved using
standard SDP solver. However, as the standard SDP solvers is computation
expensive, we use an efficient gradient descent based method derived from [20,19]
to solve our problem. Details are explained in the next subsection.
2.3
An Efficient Gradient Descent Solver
solve this optimization problem. Since the method proposed by Weinberger et
al. targets on solving only one global metric, we modify it to learn our Per-Class
metrics. This solver updates all matrices iteratively by taking a small step along
the gradient direction to reduce the objective function (6) and projecting onto
feasible set to ensure that each matrix is positive semi-definite in each iteration.
t+1 = M t − η · ∇O(M , M , . . . , M )
Gradient
To evaluate
the Descent:
gradient ofM
objective
function
for each matrix,
1
2 we denote
c the
c
c
th
matrix Mc for each class c at t iteration as Mct , and the corresponding gradient
as G(Mct ). We define a set of triplet error indices N t such that (i, p, n) ∈ N t
if ξipn > 0 at the tth iteration. Then the gradient G(Mct ) can be calculated by
taking the derivative of objective function (6) to Mct :
T
T
T
G(Mct ) = (1 − λ)
ΔXic
ΔXic + λ
ΔXic
ΔXic − λ
ΔXic
ΔXic (7)
An Efficient Gradient Descent Solver
i,c=p
(i,p,n)∈N t ,c=p
(i,p,n)∈N t ,c=n
Directly calculating the gradient in each iteration using this formula would be
computational expensive. As the changes in the gradient from one iteration to
the next are only determined by the differences between the sets N t and N t+1 ,
we use G(Mct ) to calculate the gradient G(Mct+1 ) in the next iteration, which
would be more efficient:
T
T
G(Mct+1 ) = G(Mct ) + λ(
ΔXic
ΔXic −
ΔXic
ΔXic ) (8)
(i,p,n)∈(N t+1 −N t ),c=p
(i,p,n)∈(N t+1 −N t ),c=n
T
T
− λ(
ΔXic
ΔXic −
ΔXic
ΔXic )
(i,p,n)∈(N t −N t+1 ),c=p
(i,p,n)∈(N t −N t+1 ),c=n
T
Since (ΔXic
ΔXic ) is unchanged during the iterations, we can accelerate the
updating procedure by pre-calculating this value before the first iteration. The
timization problem (6) is convex, this solver is able to converge to the global
optimum. We summarize
the whole
work flow Algorithm
in Algorithm 1.
Gradient
Descent
Algorithm 1. A Gradient Descent Method for Solving Our Optimization Problem
T
Input: step size α, parameter λ and pre-calculated data (ΔXic
ΔXic ), i
{1, . . . , N }, c ∈ {1, . . . , C}
for c := 1 to C do T
ΔXip
G(Mc0 ) := (1 − λ) i,p→i ΔXip
0
Mc := I
end for{Initialize M and gradient for each class}
Set t := 0
repeat
Compute N t by checking each error term ξipn
for c = 1 to C do
Update G(Mct+1 ) using equation (8)
Mct+1 := Mct + αG(Mct+1 )
Project Mct+1 for keeping positive semi-definite
end for
Calculate new objective function
t := t + 1
until Objective function converged
Output: each matrix M1 , . . . , MC
∈
To generate a more discriminative I2C distance for better recognition perforimages in a class. We adopt the idea of spatial pyramid by restricting each feature
mance,
we improve
ourSpatial
learned
distance
bythe
adopting
the idea
of aspatial
descriptor
in the image
to
only find
its
NN in
same
subregion
from
class atpyramid
each
Pyramid
Match
match
[9]
and
learning
I2C
distance
function
[16].
level.
Spatial pyramid match (SPM) is proposed by Lazebnik et al. [9] which makes
use of spatial correspondence, and the idea of pyramid match is adapted from
Grauman et al. [8]. This method recursively divides the image into subregions
at increasingly fine resolutions. We adopt this idea in our NN search by limiting
each feature point in the image to find its NN only in the same subregion from a
candidate class at each level. So the feature searching set in the candidate class
is reduced from the whole image (top level, or level 0) to only the corresponding
subregion (finer level), see Figure 2 for details. This spatial restriction enhances
the robustness of NN search by reducing the effect of noise due to wrong matches
from other subregions. Then the learned distances from all levels are merged
together as pyramid combination.
In addition, we find in our experiments that a single level spatial restriction
Fig.
parallelogram
an image, and
the right
parallelograms
denote
at 2.
a The
finer left
resolution
makes denotes
better recognition
accuracy
compared
to the top
images
in
a
class.
We
adopt
the
idea
of
spatial
pyramid
by
restricting
each
feature
level especially for those images with geometric scene structure, although the
descriptor
imagelower
to only
find
NN in the
same subregion
class at
accuracyinisthe
slightly
than
theitspyramid
combination
of all from
levels.a Since
theeach
level.
candidate searching set is smaller in a finer level, which requires less computation
cost for the NN search, we can use just a single level spatial restriction of the
Spatial pyramid match (SPM) is proposed by Lazebnik et al. [9] which makes
use of spatial correspondence, and the idea of pyramid match is adapted from
Weghts
onspeed
local
features
- for
Optimization
learned I2C
distance to
up the
classification
test images. Compared to
the top level, a finer level spatial restriction not only reduces the computation
cost, but also improves the recognition accuracy in most datasets. For some
images without geometric scene structure, this single level can still preserve the
recognition performance due to sufficient features in the candidate class.
We also use the method of learning I2C distance function proposed in [16]
to combine with the learned Mahalanobis I2C distance. The idea of learning
local distance function is originally proposed by Frome et al. and used for image
classification and retrieval in [6,5]. Their method learns a weighted distance
function for measuring I2I distance, which is achieved by also using a large
margin framework to learn the weight associated with each local feature. Wang
et al. [16] have used this idea to learn a weighted I2C distance function from
each image to a candidate class, and we find our distance metric learning method
can be combined with this distance function learning approach. For each class,
its weighted I2C distance is multiplied with our learned Per-Class matrix to
generate a more discriminative weighted Mahalanobis I2C distance. Details of
this local distance function for learning weight can be found in [6,16].
3
3.1
Experiment
Datasets and Setup
Experiments - descriptors
tion, we use dense sampling strategy and SIFT features [12] as our descriptor,
which are computed on a 16 × 16 patches over a grid with spacing of 8 pixels
for all datasets. This is a simplified method compared to some papers that use
densely sampled and multi-scale patches to extract large number of features,
81.2 ± 0.52 79.7 ± 1.83 89.8 ± 1.16
I2CDML+SPM
Results
- Spatial
Match
gain
I2CDML+Weight
78.5 ± Pyramid
0.74 81.3 ± 1.46
90.1 ± 0.94
I2CDML+
83.7 ± 0.49 84.3 ± 1.52 91.4 ± 0.88
SPM+Weight
Scene 15
Corel
Sports
0.86
0.93
I2CDML
I2CDML+Weight
0.84
I2CDML
I2CDML+Weight
0.92
0.84
0.82
I2CDML
I2CDML+Weight
0.91
0.82
0.9
0.8
0.8
0.89
0.78
0.78
0.76
0.76
NS
SSL
SPM
0.88
NS
SSL
SPM
0.87
NS
SSL
SPM
Fig. 4. Comparing the performance of no spatial restriction (NS), spatial single
level restriction (SSL) and spatial pyramid match (SPM) for both I2CDML and
I2CDML+Weight in all the three datasets. With only spatial single level, it achieves
better performance than without spatial restriction, although slightly lower than spatial pyramid combination of multiple levels. But it requires much less computation cost
for feature NN search.
Then we show in Table 2 the improved I2C distance through spatial pyramid
restriction from the idea of spatial pyramid match in [9] and learning weight
associated with each local feature in [16]. Both strategies are able to augment
Results - Caltech 101
Image-to-Class Distance Metric Learning for Image Classification
717
Caltech 101
0.7
0.6
0.5
0.4
I2CDML
I2CDML+Weight
NBNN
0.3
0.2
1*1
2*2
3*3
4*4
5*5
6*6
7*7
SPM
Fig. 5. Comparing the performances of I2CDML, I2CDML+Weight and NBNN from
spatial division of 1× to 7×7 and spatial pyramid combination (SPM) on Caltech 101.
less than 1000 features per image on average using our feature extraction strategy, which are about 1/20 compared to the size of feature set in [1]. We also use
single level spatial restriction to constrain the NN search for acceleration. For
each image, we divide it from 2×2 to 7×7 subregions and test the performance
Outline
Irani - In Defence of Nearest-Neighbor Based Image Classification
Wang - Image-to-Class Distance Metric Learning for Image
Classification
Behmo - Towards Optimal Naive Bayes Nearest Neighbor
Behmo - Towards Optimal Naive Bayes Nearest
Neighbor
172
R. Behmo et al.
Fig. 1. Subwindow detection for the original NBNN (red) and for our version of NBNN (green).
Behmo,
Marcombes,
Dalalyan,
Prinet
– original
ECCV
Since the
background class is more densely
sampled than the
object class, the
NBNN 2010
tends to select an object window that is too small relatively to the object instance. As show these
examples,
our approach addresses
this issue.
Towards
Optimal
Naive
Bayes Nearest Neighbor
by the efficiency of the classifier. Naive Bayes Nearest Neighbor (NBNN) is a classifier
introduced in [1] that was designed to address this issue: NBNN is non-parametric, does
not require any feature quantization step and thus uses to advantage the full discriminative power of visual features. However, in practice, we observe that NBNN performs
relatively well on certain datasets, but not on others. To remedy this, we start by analyzing the theoretical foundations of the NBNN. We show that this performance variability
could stem from the assumption that the normalization factor involved in the kernel estimator of the conditional density of features is class-independent. We relax this assumption and provide a new formulation of the NBNN which is richer than the original one.
In particular, our approach is well suited for optimal, multi-channel image classification
and object detection. The main argument of NBNN is that the log-likelihood of a visual
Contributions - Differences
• Differences on approach
1. Optimize normalization factors of Parzen Window (learning)
2. Learn optimal combinations of defferent descriptors (channels)
3. Spatial Pyramid Matching
4. Classification by Detection using ESS
• Results
1. Copes with differently populated classes
2. Has higher performance!
3. but. . . is slow at both learning and testing!
This classifier is shown to outperform the usual nearest neighbor classifier. Moreover,
it does not require any feature quantization step, and the descriptive power of image
features is thus preserved.
The reasoning above proceeds in three distinct steps: the naive Bayes assumption
considers that image features are independent identically distributed given the image
class cI (equation 1). Then, the estimation of a feature probability density is obtained
by a non-parametric density estimation process like the Parzen-Rosenblatt estimator
(equation 2). NBNN is based on the assumption that the logarithm of this value, which
is a sum of distances, can be approximated by its largest term (equation 3). In the following section, we will show that the implicit simplification that consists in removing the
normalization parameter from the density estimator is invalid in most practical cases.
Along with the notation introduced in this section, we will also need the notion of
point-to-set distance, which is simply the squared Euclidean distance of a point to its
nearest neighbor in the set: ∀Ω ⊂ RD , ∀x ∈ RD , τ (x, Ω) = inf y∈Ω x − y 2 . In
what follows, τ (x, χc ) will be abbreviated as τ c (x).
Notation
2.2 Affine Correction of Nearest Neighbor Distance for NBNN
The most important theoretical limitation of NBNN is that in order to obtain a simple
approximation of the log-likelihood, the normalization factor 1/Z of the kernel estimator is assumed to be the same for all classes. Yet, there is no a priori reason to believe
that this assumption is satisfied in practice. If this factor significantly varies from one
class to another, then the approximation of the maximum a posteriori class label ĉI by
equation 4 becomes unreliable.
It should be noted that the objection that we raise does not concern the core hypothesis of NBNN, namely the naive Bayes hypothesis and the approximation of the
sum of exponentials of equation 2 by its largest term. In fact, in the following we will
essentially follow and extend the arguments presented in [1] using the same starting
hypothesis.
NBNN reasoning steps
1. Naive Bayes assumption (eq.1)
2. Parzen window estimator of pdf (eq.2)
3. Nearest neighbor approximation (eq.3) (invalid removal of
normalization parameters)
2.1 Initial Formulation of NBNN
Original NBNN
In this section, we briefly recall the main arguments of NBNN described by Boiman et
al. [1] and introduce some necessary notation.
In an image I with hidden class label cI , we extract KI features (dIk )k ∈ RD .
Under the naive Bayes assumption, and assuming all image labels are equally probable
(P (c) ∼ cte) the optimal prediction ĉI of the class label of image I maximizes the
product of the feature probabilities relatively to the class label:
ĉI = arg max
c
KI
k=1
P (dIk |c).
(1)
The feature probability conditioned on the image class P (dIk |c) can be estimated by
a non-parametric
kernel estimator,
also called Parzen-Rosenblatt estimator. If we note
χc = dJk |cJ = c, 1 ≤ k ≤ KJ the set of all features from all training images that
belong to class c, we can write:
− dIk − d 2
1 P (dIk |c) =
exp
,
(2)
Z
2σ 2
c
d∈χ
where σ is the bandwidth of the density estimator. In [1], this estimator is further approximated by the largest term from the sum on the RHS. This leads to a quite simple
expression:
174
R. Behmo et al.
d − d 2 .
(3)
∀d, ∀c, − log (P (d|c)) min
c
d ∈χ
The decision rule for image I is thus:
ĉI = arg max P (I|c) = arg min
c
c
k
min dIk − d 2 .
d∈χc
(4)
This classifier is shown to outperform the usual nearest neighbor classifier. Moreover,
c
c 2
c 2
Optimal
Naive
Nearest
175
Z Towardsbounded
2(σby
) 1/2Bayes
2(σ
)Neighbor
space, in order to reach an approximation
we need
to sample
233 points.
In practice, the PR estimator does not converge and there is little sense in keeping more
−4/(4+D)
cthe first D
cof D
that
speed
the Parzen-Rosenblatt
2 (σ
just
=the
|χconvergence
|(2π) term
) the.ofsum.
Recall
that τ c (d) is(PR)
theestimator
squaredis K
Euclidean[13].
distance of
where Z cthan
This
means
that
in
the
case
of
a
128-dimensional
feature
space,
such
as
the
SIFT
c
Thus, the
log-likelihood
relatively to an
image
label
c is: feature
the feature
aboved equations,
we
have
replaced
the classd to its nearest
neighbor
in χ of. aInvisual
space, in order to reach an approximation
bounded
by
1/2 we need to sample 233 points.
c c
c
c is no reason to believe that
independent
notation
σ,
Z
by
σ
,
Z
since,
in
general,
there
τ
1
τ
(d)
(d)
In practice,
estimator
does notexp
converge
there =
is little sense
in keeping
c
− logthePPR
(d|c)
= − log
− and
+ log(Z
), more
(6)
c )2
Zc
2(σinstance,
2(σ c )2parameters are functions
parameters
For
both
thanshould
just the be
f rstequal
term ofacross
the sum.classes.
Thus,
the
log-likelihood
of
a
visual
feature
d
relatively
to
an
image
label
c
is:
of the number ofc training
features
of class c inc the training set.
D
Recall that
is the
Euclidean distance of
where Z = |χc |(2π) 2 (σ c )D . squared
τ (d)
c obtain:
c
Returning
to
the
naive
Bayes
formulation,
c
1 the abovewe
τ have
τequations,
(d)
(d) replaced cthe class.
In
we
d to its−nearest
neighbor
in
χ
log P (d|c) = − log c c exp −
=
+ log(Z ),
(6)
c )2 there is
2
2(σ
independent notation σ, Z by
σ , Z since, in 2(σ
general,
noc )reason
to believe that
KI c I
KI
parameters should be equal
across
classes.
For
instance,
both
parameters
are
functions
τ (dk )
c
c (I|c))
c = D
2 (σ c )D .
+ clog(Z
) = set.
αc
τ c (dIk ) + KI β cR, τ c (d)
(7)
∀c, −
lognumber
(P
= |χ
where
Z
of
the
of |(2π)
training
features of
in the training
c )class
2
2(σ
c
d to
In the aboveweequations,
its nearest
neighbor
in χ .formulation,
Returning
to the
naive
Bayes
obtain: we have
k=1
k=1replaced the classindependent notation σ, Z by σ c , Z c since, in general, there is no reason to believe that
KI c I c
KI
cparameters should
c 2 be equalc
Fora instance,
both
parameters
τ classes.
(dk ) ) is
where α = 1/(2(σ
) )(I|c))
and =
β across
= log(Z
re-parametrization
ofare
thefunctions
log-likelihood
c
+
log(Z
)
= αcset. τ c (dIk ) + KI β c , (7)
∀c,
−
log
(P
of the number of training features2(σ
of cclass
c in the training
2
)
6 that has the advantage of being
linear in the model parameters.
The image label is
k=1
k=1
Returning to the naive Bayes formulation, we obtain:
Correction of Nearest Neighbor Distance
then decided according
toc a2 criterion
that cis slightly different from equation 4:
c
c
where α = 1/(2(σ ) ) andKβI = log(Z
) is a re-parametrization
of the log-likelihood
KI
τ c (dI )
c
c
cThe
I
k in the model
6 that∀c,
has−the
linear
parameters.
imageclabel(7)
is
log advantage
(P (I|c)) =of being +
log(Z
)
=
α
τ
(d
K
I
k ) + KI β ,
2(σ cthat
)2c then decided according to ak=1
criterion
is slightly
from
c Idifferentk=1
c equation 4:
ĉI = arg min α
τ (dk ) + KI β
.
(8)
c
I
)Kis
a cre-parametrization
of the log-likelihood
where αc = 1/(2(σ c )2 ) and β c = log(Zcck=1
I
min
τ (d
KI β c .
ĉI =
α in the
k ) + parameters.
6 that has the advantage
of arg
being
linear
model
The image label(8)
is
c
We note then
that decided
this modified
criterion
can be
interpreted
in two4: different ways:
accordingdecision
to a criterion
that k=1
is slightly
different
from equation
it can either
be interpreted
as the
consequence
of a density
to which
We note
that this modified
decision
criterion
in two different
ways: a mul estimator
KI can be interpreted
c
cof IaNBNN
cin which
either
be interpreted
as
the
consequence
density
estimator
to
which
a
multiplicativeit can
factor
was
added,ĉI or
as
an
unmodified
an
affine
correction
(8)
= arg min α
τ (dk ) + KI β .
added,Euclidean
or ascan unmodified
NBNN
in which
an affine
correction formuk=1
has beentiplicative
added tofactor
the was
squared
distance.
In the
former,
the resulting
has been added to the squared Euclidean distance. In the former, the resulting formu-
has been added to the squared Euclidean distance. In the former, the resulting formulation can be considered different from the initial NBNN. In the latter, equation 8 can
be obtained from equation 4 simply by replacing τ c (d) by αc τ c (d) + β c (since αc is
positive, the nearest neighbor distance itself does not change). This formulation differs
from [1] only in the evaluation of the distance function, leaving us with two parameters
per class to be evaluated.
At this point, it is important to recall that the introduction of parameters αc and β c
does not violate the naive Bayes assumption, nor the assumption of equiprobability
of classes. In fact, the density estimation correction can be seen precisely as an enIf a class is more densely sampled than others (i.e: its
feature space contains more training samples), then the NBNN estimator will have a
bias towards that class, even though it made the assumption that all classes are equally
probable. The purpose of setting appropriate values for αc and β c is to correct this bias.
It might be noted that deciding on a suitable value for αc and β c reduces to def ning
176
R. Behmo
et al.
an appropriate
bandwidth
σ c . Indeed, the dimensionality D of the feature space and
the number |χc | of training feature points are known parameters. However, in practice,
choosing a good value for the bandwidth parameter is time-consuming and inefficient.
To cope with this issue, we designed an optimization scheme to find the optimal values
of parameters αc , β c with respect to cross-validation.
Correction of Nearest Neighbor Distance
2.3 Multi-channel Image Classification
In the most general case, an image is described by different features coming from different sources or sampling methods. For example, we can sample SIFT features and
local color histograms from an image. We observe that the classification criterion of
equation 1 copes well with the introduction of multiple feature sources. The only difference should be the parameters for density estimation, since feature types correspond,
in general, to different feature spaces.
In order to handle different feature types, we need to introduce a few definitions and
Rdχ . Channels can be defined arbitrarily: a channel can be associated to a particular detector/descriptor pair, but can also represent global image characteristics. For instance,
an image channel can consist in a single element, such as the global color histogram.
Let us assume we have defined a certain number of channels (χn )1≤n≤N , that are
expected to be particularly relevant to the problem at hand. Adapting the framework of
our modified NBNN to multiple channels is just a matter of changing notation. Similarly
to the single-channel case, we aim here at estimating the class label of an image I:
P (d|c).
(9)
ĉI = arg max P (I|c), with P (I|c) =
Combining descriptors
c
n d∈χn (I)
Since different channels have different features spaces, the density correction paramec
ters should depend on the channel index: αc , β c will thus be noted αcn , β
n . The notation
from the previous section are adapted in a similar way: we call χcn = J|cJ =c χn (J)
the set of all features from class c and channel n and define the distance function of a
feature d to χcn by: ∀d, τnc (d) = τ (d, χcn ). This leads to the classification criterion:
τnc (d) + βnc |χn (I)| .
(10)
ĉI = arg min
αcn
c
n
d∈χn (I)
Naturally, when adding feature channels to our decision criterion, we wish to balance
the importance of each channel relatively to its relevance to the problem at hand. Equation 10 shows us that the function of relevance weighting can be assigned to the distance
correction parameters. The problems of adequate channel balancing and nearest neighbor distance correction should thus be addressed in one single step. In the following
section, we present a method to find the optimal values of these parameters.
2.4 Parameter Estimation
We now turn to the problem of estimating values of αcn and βnc that are optimal for
c
Xn (I) =
τ
n (d),c
Xnc (I) =
τn (d),
d∈χn (I)d∈χn (I)
XNc+n (I) = |χn (I)|,
XN +n (I) = |χn (I)|,
∀n = 1, . . . , N.
∀n = 1, . . . , N.
(11)
(11
Parameter Estimation - Optimization
c
c
c, theX
vector
(I) can
considered as as
a global
descriptor
of image I.of
Weimage I. W
For every c, For
theevery
vector
(I)Xcan
be beconsidered
a global
descriptor
c
c
also denote
by ωc the (2N )-vector (α
, αcNc, β1c , .c. . , βN
) andc by W the matrix that
c
c 1 , . . .Optimal
Towards
Naive
Bayes
Nearest
Neighbor
177
also denote by
ω
the
(2N
)-vector
(α
,
.
.
.
,
α
,
β
,
.
.
.
,
β
)
and
by
W
the matrix tha
c
1 w for different
1values of c.NUsing these notation,
N
results from concatenation of vectors
c as:
the
classifier
we
propose
can
be
rewritten
esults from concatenation
of vectors
w cfor different values of c. Using these notation
Xnc (I) =
τnc (d),
XN +n (I) = |χn (I)|, ∀n = 1, . . . , N.
(11)
c c
he classifier we propose
can
be
rewritten
ĉ
=
arg
minas:
(12)
I
c (ω ) X (I),
d∈χ (I)
n
c
c For
every
vector
(I)
can be considered
a global
descriptor
of image I. We
) the
stands
forXthe
transpose
of ω c . Thiscas
spirit
to the winner-takes-all
where
(ωc,
ĉ )-vector
=
arg(α
min
(ω is)close
Xcin(I),
(12
c
c
also
denotewidely
by ωc used
the (2N
.c. , αcN , β1c , . . . , βN
) and by W the matrix that
classifier
forI the
multiclass
1 , . classification.
c
for
different
values
of
c.
Using
these
notation,
results
from
concatenation
of
vectors
w
Given a labeled sample (Ii , ci )i=1,...,Kc independent of the sample used for computstands
transpose
of ω as:
. This
closeoptimization
in spirit to
the winner-takes-a
where (ω c )the
classifier
we
propose
can
be rewritten
we can
define
a constrained
linearisenergy
problem
that
ing
the setsfor
χcn ,the
minimizes
the
hinge
loss
of
a
multi-channel
NBNN
classifier:
classifier widely used for the multiclass
classification.
c c
ĉI = arg minc (ω ) X (I),
(12)
Given a labeled csample
(I
independent
of the
sample
used for compu
iK, ci )i=1,...,K
cci. This
ciis close in spirit
c to
c the winner-takes-all
)
stands
for
the
transpose
of
ω
where
(ω
E(W ) =
max 1 + (ω ) X (Ii ) − (ω ) X (Ii ) + ,
(13)
c:c
ci
can
define
a=constrained
linear energy optimization
problem tha
ng the sets classifier
χcn , wewidely
used
for
i=1the multiclass classification.
Given
a labeled
sample
(Ii , ci )i=1,...,K independent
of
the sample used for computminimizes the
hinge
loss
of
a
multi-channel
NBNN
classifier:
where (x) stands for the positive part of a real x. The minimization
E(W ) can
of
problem that
ing the sets+χcn , we can define a constrained linear energy optimization
be recast as a linear program since it is equivalent to minimizing i ξi subject to
minimizes the hinge
K loss of a multi-channel NBNN classifier:
constraints:
K
E(W ) =
1 + (ω cci)cXci (Ii ) − (ω c )
Xc (Ii ) + ,
max
(13
c
ci i c
c
c
c
ξi E(W
≥ 1 +) (ω
) cX
(Ii 1) −
∀i(ω
= )1, .X. . ,(IK,
∀c = ci , (13)
(14)
c:c=
i ),i ) −
i
=
+ (ω
(ω )i ) XX(Ii (I
max
i ) +,
i=1
c:c=ci
1, . . . , K,
(15)
ξi ≥ 0 ∀i =i=1
c (ω
)
e
≥
0,
= 1,part
. part
. . , N,
(16)
n forpositive
where (x)+where
stands
the
of
real
x. The
minimization
of E(W ) ca
(x)+for
stands
the ∀n
positive
of aareal
x. The
minimization
of
E(W
)
can
be
recast
as
a
linear
program
since
it
is
equivalent
to
minimizing
ξ
subject
toξ subject t
2N
i
i
be recast as where
a linear
program
since
equivalent
to equal
minimizing
en stands
for the vector
of Rit is
having
all coordinates
to zero,
except for
i i
constraints:
the nth coordinate, which is equal to 1. This linear program can be solved quickly for
constraints:
1
c images
practice,
number
a relatively
ofcichannels
ξi ≥large
1 + number
(ω ci ) X
(Ii ) − (ωand
) Xc (Ii ),. In ∀i
= 1, . .the
. , K,
∀c = of
ci ,channels
(14)
should be kept
to the cnumber
c small
relatively
c
cof training samples to avoid overfitting.
the object class and position only depends on the point belonging or not to the object:
c
τn (d) if d ∈ π
(17)
∀n, d, c, − log (P (d|c, π)) =
τnc (d) if d ∈
/ π.
Classification by detection
In the above equation, we have written the feature-to-set distance functions τnc and τnc
without apparent density correction in order to alleviate the notation. We leave to the
reader the task of replacing τnc by αcn τnc + βnc in the equations of this section.
The image log-likelihood function is now decomposed
inside and out
over
all features
c
c
side the object: E(I, c, π) − log P (I|c, π) = n
d∈π τn (d) +
d∈π
/ τn (d) .
The term on the RHS can be rewritten:
(τnc (d) − τnc (d)) +
τnc (d) .
(18)
E(I, c, π) =
n
d∈π
d
Observing that the second sum on the RHS does
not depend
c on π,cwe get E(I, c, π) =
E
2 (I, c), where E1 (I, c, π) =
n
d∈π τn (d) − τn (d) and E2 (I, c) =
c, π)+E
1 (I,
c
c
n
d τn (d). Let us define the optimal object position π̂ relatively to class c as the
position that minimizes the first energy term: π̂ c = arg minπ E1 (I, c, π) for all c. Then,
we can obtain the most likely image class and object position by:
ĉI = arg min (E1 (I, c, π̂ c ) + E2 (I, c)) ,
c
π̂I = π̂ ĉI .
(19)
For any class c, finding the rectangular window π̂ c that is the most likely candidate
can be done naively by exhaustive search, but it proves prohibitive. Instead, we make
use of fast branch and bound subwindow search [2]. The method used to search for the
image window that maximizes the prediction of a linear SVM can be generalized to any
classifier that is linear in the image features, such as our optimal multi-channel NBNN.
In short, the most likely class label and object position for a test image I are found
by the following algorithm:
Detection Algorithm
Towards Optimal Naive Bayes Nearest Neighbor
179
1: declare variables ĉ, π̂
2: Ê = +∞
3: for each class label c do
4:
find π̂ c by efficient branch and bound subwindow search
5:
π̂ c = arg minπ E1 (I, c, π)
6:
if E1 (I, c, π̂ c ) + E2 (I, c) < Ê then
7:
Ê = E1 (I, c, π̂ c ) + E2 (I, c)
8:
ĉ = c
9:
π̂ = π̂ c
10:
end if
11: end for
12: return ĉ, π̂
4 Experiments
Our optimal NBNN classifier was tested on three datasets: Caltech-101 [15], SceneClass
13 [16] and Graz-02 [14]. In each case, the training set was divided into two equal parts
for parameter selection. Classification results are expressed in percent and reflect the
rate of good classification, per class or averaged over all classes.
Fast NN search - LSH
4 Experiments
Our optimal NBNN classifier was tested on three datasets: Caltech-101 [15], SceneClass
13 [16] and Graz-02 [14]. In each case, the training set was divided into two equal parts
for parameter selection. Classification results are expressed in percent and reflect the
rate of good classification, per class or averaged over all classes.
A major practical limitation of NBNN and of our approach is the computational
time necessary to nearest neighbor search, since the sets of potential nearest neighbors
to explore can contain of the order of 105 to 106 points. We thus need to implement an
appropriate search method. However, the dimensionality of the descriptor space can also
be quite large and traditional exact search methods, such as kd-trees or vantage point
trees [17] are inefficient. We chose Locality Sensitive Hashing (LSH) and addressed
the thorny issue of parameter tuning by multi-probe LSH2 [18] with a recall rate of
0.8. We observed that resulting classification performance are not overly sensitive to
small variations in the required recall rate. However, computations speed is: compared
to exhaustive naive search, the observed speed increase was more than ten-fold. Further
improvement in the execution times can be achieved using recent approximate NNsearch methods [19,20].
Let us describe the databases used in our experiments.
Caltech-101 (5 classes). This dataset includes the five most populated classes of the
Caltech-101 dataset: faces, airplanes, cars-side, motorbikes and background. These
images present relatively little clutter and variation in object pose. Images were
resized to a maximum of 300 × 300 pixels prior to processing. The training and
testing sets both contain 30 randomly chosen image per class. Each experiment
was repeated 20 times and we report the average results over all experiments.
SceneClass 13. Each image of this dataset belongs to one of 13 indoor and outdoor
descriptors, with 91.10% good classification rate, while rgSIFT performs worst, with
Results
onevaluation
1 SIFT
and
SIFT
descriptors
4.1Thus,
Single-Channel
Classification
85.17%.
a wrong
of the
feature5space
properties
undermines the descriptorThe
performance.
impact of optimal parameter selection on NBNN is measured by performing image classification with just one feature channel. We chose SIFT features [22] for their
relative popularity.
Results are summarized in Tables 1 and 2.
4.3 Multi-channel
Classification
2
The notion
channel iscomparison
sufficiently
versatile
toofbewords
adapted
to abyvariety
of χdifferent
Table of
1. Performance
between
the bag
classified
linear and
-kernel conSVM,
NBNN classifier
our optimal
texts. In
thistheexperiment,
weandborrow
theNBNN
idea developed in [4] to subdivide the image
in different spatial regions. We consider that
an image channel associated to a certain
2
Datasets
BoW/SVM
BoW/χ -SVM
NBNN [1]
Optimal NBNN
67.85 ±0.78
±0.60
48.52 ±1.53
75.35 ±0.79
Table 3.SceneClass13
Caltech101 [16]
(5classes):
Influence of76.7
various
radiometry
invariant features.
Best and worst
68.18 ±4.21
77.91 ±2.43
61.13 ±5.61
78.98 ±2.37
Graz02 [14]
SIFT invariants
are
highlighted
in
blue
and
red,
respectively.
59.2 ±11.89
89.13 ±2.53
73.07 ±4.02
89.77 ±2.31
Caltech101 [15]
Feature
BoW/χ2 -SVM
NBNN [1]
Optimal NBNN
In Table 1, the first two columns refer to the classification of bags of words by linear
SIFT and by χ2 -kernel SVM.
88.90 In
±2.59
73.07 ±4.02
±2.31
all three experiments
we selected89.77
the most
efficient
SVM
codebook
size (between 500
and±2.18
3000) and feature
histograms
by their
89.90
72.73
±6.01 were normalized
91.10 ±2.45
OpponentSIFT
1
L
norm. Furthermore, only
the±2.63
results for the80.17
χ2 -kernel
the best
possible
86.03
±3.73SVM with
85.17
±4.86
rgSIFT
value
smoothing
are±3.86
reported. In Table
we omitted
86.13
±2.76 parameter
75.43
86.872,±3.23
cSIFT(in a finite grid) of the
2
the
results
of BoW/SVM
because
of their clear73.03
inferiority
w.r.t. BoW/χ
89.40
±2.48
±5.52
90.01-SVM.
±3.03
Transf.
color
SIFT
Table 2. Performance comparison between the bag of words classified by χ2 -kernel SVM, the
NBNN classifier and our optimal NBNN. Per class results for Caltech-101 (5 classes) dataset.
Class
Airplanes
Car-side
BoW/χ2 -SVM
NBNN [1]
Optimal NBNN
91.99 ± 4.87
96.16 ± 3.84
34.17 ±11.35
97.67 ± 2.38
95.00 ± 3.25
94.00 ± 4.29
182
SIFT
OpponentSIFT
rgSIFT
cSIFT
Transf. color SIFT
88.90 ±2.59
89.90 ±2.18
86.03 ±2.63
86.13 ±2.76
89.40 ±2.48
73.07 ±4.02
72.73 ±6.01
80.17 ±3.73
75.43 ±3.86
73.03 ±5.52
89.77 ±2.31
91.10 ±2.45
85.17 ±4.86
86.87 ±3.23
90.01 ±3.03
Results on Spatial Pyramid Matching
R. Behmo et al.
Fig. 2. Feature channels as image subregions: 1 × 1, 1 × 2, 1 × 3, 1 × 4
Table 4. Multi-channel classification, SceneClass13 dataset
Channels
#channels
NBNN
Optimal NBNN
1×1
1×1+1×2
1×1+1×3
1×1+1×4
1
3
4
5
48.52
53.59
55.24
55.37
75.35
76.10
76.54
78.26
image region is composed of all features that are located inside this region. In practice, image regions are regular grids of fixed size. We conducted experiments on the
Results on Classification by Detection
Towards Optimal Naive Bayes Nearest Neighbor
183
Fig. 3. Subwindow detection for NBNN (red) and optimal NBNN (green). For this experiment,
all five SIFT radiometry invariants were combined. (see Section 4.4)
It can be observed that the non-parametric NBNN usually converges towards an optimal object window that is too small relatively to the object instance. This is due to
the fact that the background class is more densely sampled. Consequently, the nearest
neighbor distance gives an estimate of the probability density that is too large. It was
precisely to address this issue that optimal NBNN was designed.
5 Conclusion