Slides - Image, Video and Multimedia Systems Lab
Transcription
Slides - Image, Video and Multimedia Systems Lab
Naive Bayes Nearest Neighbor Classifiers Christos Varytimidis Image, Video and Multimedia Systems Laboratory National Technical University of Athens January 2011 Outline Irani - In Defence of Nearest-Neighbor Based Image Classification Wang - Image-to-Class Distance Metric Learning for Image Classification Behmo - Towards Optimal Naive Bayes Nearest Neighbor Outline Irani - In Defence of Nearest-Neighbor Based Image Classification Wang - Image-to-Class Distance Metric Learning for Image Classification Behmo - Towards Optimal Naive Bayes Nearest Neighbor Irani - In Defence of Nearest-Neighbor Based Image Classification rds”), for obtaining compact ization gives rise to a signifi, but also to significant degraower of descriptors. Such dintial for many learning-based l tractability, and for avoid(a) (b) (c) (d) Figure 1. Effects of descriptor quantization Informative des unnecessary and especially Boiman, Shechtman, Irani – –CVPR 2008 scriptors have low database frequency, leading to high quanametric classification, that has tization error. (a) An image from the Face class in CalInof Defence Nearest-Neighbor Based Image Classification te for this loss information. oftech101. (b) Quantization error of densely computed image descriptors (SIFT) using a large codebook (size 6, 000) of Calteche is essential to Kernel meth101 (generated using [14]). Red = high error; Blue = low error. n NN-Image classifiers, it proThe most informative descriptors (eye, nose, etc.) have the highest n only when the query image quantization error. (c) Green marks the 8% of the descriptors in the image that are most frequent in the database (simple edges). ase images, but does not gen(d) Magenta marks the 8% of the descriptors in the image that are led images. This limitation is least frequent in the database (mostly facial features). with large diversity. e a remarkably simple nonthe data (typically hundreds of thousands of descriptors exer, which requires no descriptracted from the training images), is quantized to a rather a direct “Image-to-Class” dissmall codebook (typically into 200 − 1000 representative 1 he Naive-Bayes assumption , descriptors). Lazebnik et al. [16] further proposed to add especially severe for classes with large diversity. In this paper we propose a remarkably simple nonparametric NN-based classifier, which requires no descriptor quantization, and employs a direct “Image-to-Class” distance. We show that under the Naive-Bayes assumption1 , the theoretically optimal image classifier can be accurately approximated by this simple algorithm. For brevity, we refer to this classifier as “NBNN”, which stands for “NaiveBayes Nearest-Neighbor”. NBNN is embarrassingly simple: Given a query image, compute all its local image descriptors d 1 , ..., dn . Search for the class C which minimizes the sum n 2 d (where NNC (di ) is the NNi − NNC (di ) i=1 descriptor of d i in class C). Although NBNN is extremely simple and requires no learning/training, its performance ranks among the top leading learning-based image classifiers. Empirical comparisons are shown on several challenging databases (Caltech-101,Caltech-256 and Graz-01). The paper is organized as follows: Sec. 2 discusses the causes for the inferior performance of standard NN-based image classifiers. Sec. 3 provides the probabilistic formulation and the derivation of the optimal Naive-Bayes image classifier. In Sec. 4 we show how the optimal Naive-Bayes classifier can be accurately approximated with a very simple NN-based classifier (NBNN). Finally, Sec. 5 provides empirical evaluation and comparison to other methods. the data (typically Naive Bayes Nearest-Neighbor Classifier tracted from the t small codebook ( descriptors). Laze rough quantized lo resentation. Such are necessary for image classificatio were also used i compared to in [27 However, the si tized codebook re will be shown nex tion is considerab Learning-based al information loss b classification resu ple non-parametri phase to “undo” th It is well known quantization error tization error. Ho a large database o simple edges and classes within the ng, its performance based image classiwn on several chal-256 and Graz-01). Sec. 2 discusses the standard NN-based robabilistic formuNaive-Bayes image ptimal Naive-Bayes ed with a very simly, Sec. 5 provides other methods. assification? mage classification erformance of non- rametric classifiers d to generate codeng compact image ms of quantized decriptors taken from are i.i.d. given image class. will be shown next, the amount of discriminative information is considerably reduced due to the rough quantization. Learning-based algorithms can compensate for some of this information loss by their learning phase, leading to good classification results. This, however, is not the case for simple non-parametric algorithms, since they have no training phase to “undo” the quantization damage. It is well known that highly frequent descriptors have low quantization error, while rare descriptors have high quantization error. However, the most frequent descriptors in a large database of images (e.g., Caltech-101) comprise of simple edges and corners that appear abundantly in all the classes within the database, and therefore are least informative for classification (provide very low class discriminativity). In contrast, the most informative descriptors for classification are the ones found in one (or few) class, but are rare in other classes. These discriminative descriptors tend to be rare in the database, hence get high quantization error. This problem is exemplified in Fig. 1 on a face image from Caltech-101, even when using a relatively large codebook of quantized descriptors. As noted before [14, 26], when densely sampled image descriptors are divided into fine bins, the bin-density follows a power-law (also known as long-tail or heavy-tail distributions). This implies that most descriptors are infrequent (i.e., found in low-density regions in the descriptor Quantization Error btaining compact s rise to a signifisignificant degracriptors. Such diny learning-based y, and for avoidary and especially sification, that has oss of information. al to Kernel methe classifiers, it pron the query image but does not genThis limitation is versity. ably simple nonquires no descripage-to-Class” disayes assumption1 , can be accurately Quantization Error (a) (b) (c) (d) Figure 1. Effects of descriptor quantization – Informative descriptors have low database frequency, leading to high quantization error. (a) An image from the Face class in Caltech101. (b) Quantization error of densely computed image descriptors (SIFT) using a large codebook (size 6, 000) of Caltech101 (generated using [14]). Red = high error; Blue = low error. The most informative descriptors (eye, nose, etc.) have the highest quantization error. (c) Green marks the 8% of the descriptors in the image that are most frequent in the database (simple edges). (d) Magenta marks the 8% of the descriptors in the image that are least frequent in the database (mostly facial features). the data (typically hundreds of thousands of descriptors extracted from the training images), is quantized to a rather small codebook (typically into 200 − 1000 representative descriptors). Lazebnik et al. [16] further proposed to add rough quantized location information to the histogram rep- space), therefore rather isolated. In other words, there are Effects of descriptor Quantization almost no ‘clusters’ in the descriptor space. Consequently, any clustering to a small number of clusters (even thousands) will inevitably incur a very high quantization error in most database descriptors. Thus, such long-tail descriptor distribution is inherently inappropriate for quantization. High quantization error leads to a drop in the discriminative power of descriptors. Moreover, the more informative (discriminative) a descriptor is, the more severe the degradation in its discriminativity. This is shown quantitatively in Fig. 2. The graph provides an evidence to the severe drop in the discriminativity (informativeness) of the (SIFT) descriptors in Caltech-101 as result of quantization. The descriptor discriminativity measure of [2, 26] was used: p(d|C)/p(d|C), which measures how well a descriptor d discriminates between its class C and all other classes C. We compare the average discriminativity of all descriptors in all Caltech-101 classes after quantization: p(dquant |C)/p(dquant |C), to their discriminativity before quantization. Alternative methods have been proposed for generating compact codebooks via informative feature selection [26, 2]. These approaches, however, discard all but a small set of highly discriminative descriptors/features. In particular, they discard all descriptors with low-discriminativity. Although individually such descriptors offer little discrimina- Figure 2. Effects of d descriptor discrimin of descriptor discrimi (for a very large samp each for its respective along the y-axis. Thi after quantization” (th scale in both axes. NO a descriptor d is, the l 2.2. Image-to-Im In this section tance, which is fund RVM), significantly non-parametric ima belled (‘training’) im NN-image classi Effects of descriptor Quantization her words, there are pace. Consequently, clusters (even thouh quantization error h long-tail descripate for quantization. rop in the discrimi, the more informahe more severe the his is shown quandes an evidence to informativeness) of as result of quantimeasure of [2, 26] ures how well a deass C and all other discriminativity of s after quantization: criminativity before osed for generating Figure 2. Effects of descriptor quantization – Severe drop in descriptor discriminative power. We generated a scatter plot of descriptor discriminative power before and after quantization (for a very large sample set of SIFT descriptors d in Caltech-101, each for its respective class C). We then averaged this scatter plot along the y-axis. This yields the “Average discriminative power after quantization” (the RED graph). The display is in logarithmic scale in both axes. NOTE: The more informative (discriminative) a descriptor d is, the larger the drop in its discriminative power. 2.2. Image-to-Image vs. Image-to-Class Distance In this section we argue that “Image-to-Image” dis- Image-to-Class Distance p(Q|C) = Taking the log proba Ĉ = arg max log(p C The simple classifie sification algorithm Sec 4 we show how approximated using (without descriptor q Figure 3. “Image-to-Image” vs. “Image-to-Class” distance. A Ballet class with large variability and small number (three) of ‘labelled’ images (bottom row). Even though the “Query-to-Image” distance is large to each individual ‘labelled’ image, the “Queryto-Class” distance is small. Top right image: For each descriptor at each point in Q we show (in color) the ‘labelled’ image which gave it the highest descriptor likelihood. It is evident that the new query configuration is more likely given the three images, than each individual image seperately. (Images taken from [4].) the entire class C (using all images I ∈ C), we would get better generalization capabilities than by employing in- Naive-Bayes classi KL-Distance: In S benefits of using an show that the above to minimizing “Que Eq. (1) can be rew Ĉ = a where we sum over tract a constant term image configurations by “composing pieces” from a set of other images was previously shown useful in [17, 4]. We prove (Sec. 3) that under the Naive-Bayes assumption, the optimal distance to use in image classification is the KL “Image-to-Class” distance, and not the commonly used “Image-to-Image” distribution distances (KL, χ 2 , etc.) where KL(··) is two probability di Naive-Bayes assum mizes a “Query-totor distributions of A similar relat and KL-distance w tion, yet between distances and not between descriptor cation have also be again – between pa Probabilistic formulation - Maximum Likelihood 3. Probabilistic Formulation Bayes Rule: p(C|Q) = p(Q|C)p(C) p(Q) In this section we derive likelihood×prior the optimal Naive-Bayes imposterior = age classifier, which is approximated by NBNN (Sec. 4). evidence Given a new query (test) image Q, we want to find its class C. It is well known [7] that maximum-a-posteriori (MAP) classifier minimizes the average classification error: Ĉ = arg maxC p(C|Q). When the class prior p(C) is uniform, the MAP classifier reduces to the MaximumLikelihood (ML) classifier: Ĉ = arg max p(C|Q) = arg max p(Q|C). C C Let d1 , ..., dn denote all the descriptors of the query image Q. We assume the simplest (generative) probabilistic model, which is the Naive-Bayes assumption (that the descriptors d1 , ..., dn of Q are i.i.d. given its class C), namely: 4. The Approx In this section w accurately approxim age classifier of Se Non-Parametric D The optimal MAP requires computing scriptor d in a class tors in an image da ber of pixels in the Probabilistic formulation - Naive Bayes Assumption Naive Bayes Assumption = descriptors d1 , . . . , dn are i.i.d. p(Q|C) = p(d1 , .., dn |C) = n i=1 p(di |C) Taking the log probability of the ML decision rule we get: n 1 log p(di |C) Ĉ = arg max log(p(C|Q)) = arg max C C n i=1 (1) The simple classi¿er implied by Eq. (1) is the optimal classi¿cation algorithm under the Naive-Bayes assumption. approximated using a non-parametric NN-based algorithm (without descriptor quantization). Naive Bayes Classifier ⇔ Minimum Image-to-Class Naive-Bayes classifier ⇔ Minimum “Image-to-Class” KL-Distance KL-Distance: In Sec. 2.2 we discussed the generalization Class” distance. A umber (three) of ‘lae “Query-to-Image” image, the “Querye: For each descriphe ‘labelled’ image d. It is evident that en the three images, es taken from [4].) ∈ C), we would by employing ints. Such a direct ned by computing distributions of Q ough the “Query‘labelled’ images KL-distance may on. Inferring new ces” from a set of l in [17, 4]. ve-Bayes assumpe classification is not the commonly nces (KL, χ 2 , etc.) benefits of using an “Image-to-Class” distance. We next the above MAP classifier of Eq. (1) is equivalent to minimizing “Query-to-Class” KL-distances. Eq. (1) can be rewritten as: Ĉ = arg max p(d|Q) log p(d|C) C d where we sum over all possible descriptors d. We can subtract a constant term independent of C from the right hand side of the above equation, without affecting Ĉ. By subtracting d p(d|Q) log p(d|Q), we get: p(d|C) ) Ĉ = arg max( p(d|Q) log C p(d|Q) d∈D = arg min( KL(p(d|Q)p(d|C)) ) C (2) where KL(··) is the KL-distance (divergence) between two probability distributions. In other words, under the Naive-Bayes assumption, the optimal MAP classifier minimizes a “Query-to-Class” KL-distance between the descriptor distributions of the query Q and the class C. A similar relation between Naive-Bayes classification and KL-distance was used in [28] for texture classifica- d=1 thatAlgorithm: the approximation of Eq. (4) always bounds TheNote NBNN from below complete Parzen window estimate ofdistribuEq. (3). Due to the the long-tail characteristic of descriptor Fig. 4 shows the accuracy of such NN approximation of tions, almost all of the descriptors are rather isolated in the the distribution p(d|C). Even when very small numdescriptor space, therefore very farusing fromamost descriptors berthe of database. nearest neighbors (as small as the r =terms 1, a single nearin Consequently, all of in the sumest neighbor forfor each d in will eachbe class C), a very mation of Eq.descriptor (3), except a few, negligible (K accurate approximation p (d|C) of the complete Parzen exponentially decreases with NN distance) . Thus we can accuwindow estimate isthe obtained (see Fig. Moreover, NN approximate summation in Eq.4.a). (3) using the (few) descriptor approximation hardly reduces the discriminative r largest elements in the sum. These r largest elements corpower (seeneighbors Fig. 4.b).ofThis is in contrast to respondoftodescriptors the r nearest a descriptor d ∈ Q C descriptors due to dethe severe discriminativity of within the drop class in descriptors dC , .., d ∈ C: 1 L r scriptor quantization. 1 (d|C) = K(d − dC ) in the ac(4) p We have indeed found very small differences NN NNj L d=1changing r from 1 to 1000 tual classification results when always bounds TheNote that the approximation case of rof=Eq. 1 is(4) especially convefrom complete Parzenobtains windowa estimate of Eq. (3). nientbelow to use,thesince log p(d|C) very simple form: n 2 Fig. 4 shows of such NN approximation of logP (Q|C) ∝ −the accuracy i=1 di − NNC (di ) and there is no the distribution p(d|C). when using smallkernel numlonger a dependence on Even the variance of thea very Gaussian ber K. of nearest neighbors (as small as r = 1, a single nearest neighbor descriptor for each d in 5. each class C), a very experimental results reported in Sec. accurate approximation p (d|C) of the complete Parzen NN window estimate is obtained (see Fig. 4.a). Moreover, NN descriptor approximation hardly reduces the discriminative power of descriptors (see Fig. 4.b). This is in contrast to the severe drop in discriminativity of descriptors due to descriptor quantization. The resulting Na fication performan can therefore be su descriptor types a The NBNN Algo each image using 1. Compute desc assumption on all 2. ∀di ∀C compu very simple exten 3. Ĉ = arg min The decision rulC of Despite each of its thesimp t imates the single-d theore the above requires nomin learnin Ĉ = arg C where dji is the i-t Combining determined bySever the approaches to imat Kj corresponding demonstrated that who learn weights in single areafixed andclassifi share fication performan descriptor types Ca Computational each image using ficient approximate assumption on all tree implementatio very exten searchsimple is logarithm The decision[1].rule the KD-tree of each of the t d the above single-d Ĉ = arg minC where dji is the i-th determined by the Kj corresponding t Parzen Window - Nearest Neighbors NN density distribution - discriminativity approximation of the nsity p(d|C) [7]. Let obtained from all the the Parzen likelihood − dC j ) (3) function (which is typically a Gaussian: 2 )). As L approaches educes accordingly, p̂ [7]. y, all the database dey estimation of Eq. (3). nally time-consuming ance (d − d C j ) for all lass). We next show hbor approximation of of descriptor distribue rather isolated in the (a) (b) Figure 4. NN descriptor estimation preserves descriptor density distribution and discriminativity. (a) A scatter plot of the 1-NN probability density distribution p (d|C) vs. the true NN distribution p(d|C). Brightness corresponds to the concentration of points in the scatter plot. The plot shows that 1-NN distribution provides a very accurate approximation of the true distribution. (b) 20-NN descriptor approximation (Green graph) and 1NN descriptor approximation (Blue graph) preserve quite well the discriminative power of descriptors. In contrast, descriptor quantization (Red graph) severely reduces discriminative power of descriptors. Displays are in logarithmic scale in all axes. The resulting Naive-Bayes NN image classifier (NBNN) can therefore be summarized as follows: The NBNN Algorithm: 1. Compute descriptors d , ..., d of the query image Q. ces accordingly, p̂ l the database deimation of Eq. (3). y time-consuming e (d − d C j ) for all ). We next show approximation of escriptor distribuher isolated in the most descriptors terms in the sumbe negligible (K Thus we can accu(3) using the (few) gest elements cordescriptor d ∈ Q C: dC NNj ) (4) 4) always bounds stimate of Eq. (3). approximation of sity distribution and discriminativity. (a) A scatter plot of the 1-NN probability density distribution p (d|C) vs. the true NN distribution p(d|C). Brightness corresponds to the concentration of points in the scatter plot. The plot shows that 1-NN distribution provides a very accurate approximation of the true distribution. (b) 20-NN descriptor approximation (Green graph) and 1NN descriptor approximation (Blue graph) preserve quite well the discriminative power of descriptors. In contrast, descriptor quantization (Red graph) severely reduces discriminative power of descriptors. Displays are in logarithmic scale in all axes. The NBNN Algorithm! The resulting Naive-Bayes NN image classifier (NBNN) can therefore be summarized as follows: The NBNN Algorithm: 1. Compute descriptors d 1 , ..., dn of the query image Q. 2. ∀di ∀C computethe NN of d i in C: NNC (di ). n 3. Ĉ = arg minC i=1 di − NNC (di ) 2 . Despite its simplicity, this algorithm accurately approximates the theoretically optimal Naive-Bayes classifier, requires no learning/training, and is efficient. Combining Several Types of Descriptors: Recent approaches to image classification [5, 6, 20, 27] have demonstrated that combining several types of descriptors in a single classifier can significantly boost the classification performance. In our case, when multiple (t) descriptor types are used, we represent each point in each image using t descriptors. Using a Naive Bayes gest elements cordescriptor d ∈ Q C: requires no learning/training, and is efficient. Combining of Descriptors Combining Several Several TypesTypes of Descriptors: Recent C d NNj ) (4) 4) always bounds estimate of Eq. (3). approximation of a very small num= 1, a single nearh class C), a very e complete Parzen a). Moreover, NN the discriminative s is in contrast to criptors due to de- erences in the acr from 1 to 1000 especially convevery simple form: 2 and there is no e Gaussian kernel was used in all the approaches to image classification [5, 6, 20, 27] have demonstrated that combining several types of descriptors in a single classifier can significantly boost the classification performance. In our case, when multiple (t) descriptor types are used, we represent each point in using t descriptors. Using a Naive Bayes assumption on all the descriptors of all types yields a very simple extension of the NBNN algorithm above. The decision rule linearly combines the contribution of each of the t descriptor types. Namely, Step (3) in the above single-descriptor-type NBNN is replaced by: Ĉ = arg minC tj=1 wj · ni=1 dji − NNC (dji ) 2 , where dji is the i-th query descriptor of type j, and w j are determined by the variance of the Parzen Gaussian kernel Kj corresponding to descriptor type j. Unlike [5, 6, 20, 27], who learn weights wj per descriptor-type per class, our w j are fixed and shared by all classes. Computational Complexity & Runtime: We use the efficient approximate-r-nearest-neighbors algorithm and KDtree implementation of [23]. The expected time for a NNsearch is logarithmic in the number of elements stored in the KD-tree [1]. Note that the KD-tree data structure is requires no learncessing step has a ber of elements N ) w seconds for cones in Caltech-101). (‘training’) images n D the number of ains n label ·nD deors searches within for one query im(nC ·nD ·log(nD )) no training time in g of the KD-tree. on Caltech-101 for d SIFT descriptors NBNN, and comrs (learning-based s are provided in the luminance part (L* from a CIELAB color space) as a luminance descriptor, and the chromatic part (a*b*) as a color descriptor. Both are normalized to unit length. 4. Shape descriptor: We extended the Shape-Context descriptor [22] to contain edge-orientation histograms in NN-based its log-polar bins. method This descriptorPerformance is applied to texture42.1 ± 0.81% invariant SPM edge NN mapsImage [21], [27] and is normalized to unit length. GBDist NN Image [27] of45.2 5. The Self-Similarity descriptor [25].± 0.96% GB Vote NN [3] 52% SVM-KNN 59.1 ±for0.56% The descriptors are[30] densely computed each image, at NBNN (1 Desc) ± scale 1.14% five different spatial scales, enabling65.0 some invariance. Desc) 72.8(similar ± 0.39% To furtherNBNN utilize(5 rough spatial position to [30, 16]), we augment each descriptor d with its location l in the imTable 1.˜ Comparing the performance of non-parametric NN-based age: d = (d, αl). The resulting L 2 distance between deapproaches on the Caltech-101 dataset (nlabel = 215). All the scriptors, d˜ − d˜2 2 = d1 −d2 2 +α2 l 1 −l2 , combines listed methods1do not require a learning phase. descriptor distance and location distance. (α was manually set in our experiments. The same fixed α was used for Caltech-101 and Caltech-256, and α = 0 for Graz-01.) pearance and shap on Ca Experiments - Descriptor extraction comparisons of NBNN to other 5.2. Experiments Following common benchmarking procedures, we split each class to randomly chosen disjoint sets of ‘training images’ and ‘test images’. In our NBNN algorithm, since there is no training, we use the term ‘labelled images’ instead of ‘training images’. In learning-based methods, the training images are fed to a learning process generating a classifier paring NBNN with descriptor-type im NN-based) (Fig. 5 tiple descriptor-typ classifiers (learning Table 1 shows 101 for several N we used 15 labelle bers reported in t tor NBNN algorith gap all NN-image ‘SVM-KNN’ [30] based, which was c quires no learning/training, its performance ranks among the top leading learning-based image classifiers. Sec 5.3 further demonstrates experimentally the damaging effects of using descriptor-quantization or “image-to-image” distances in a non-parametric classifier. (training) and test Combining Several Types of Descriptors selected nlabel = 1 5.1. Implementation We tested our NBNN algorithm with a single descriptortype (SIFT), and with a combination of 5 descriptor-types: 1. The SIFT descriptor ([19]). 2 + 3. Simple Luminance & Color Descriptors: We use log-polar sampling of raw image patches, and take the luminance part (L* from a CIELAB color space) as a luminance descriptor, and the chromatic part (a*b*) as a color descriptor. Both are normalized to unit length. 4. Shape descriptor: We extended the Shape-Context descriptor [22] to contain edge-orientation histograms in its log-polar bins. This descriptor is applied to textureinvariant edge maps [21], and is normalized to unit length. 5. The Self-Similarity descriptor of [25]. The descriptors are densely computed for each image, at five different spatial scales, enabling some scale invariance. To further utilize rough spatial position (similar to [30, 16]), we augment each descriptor d with its location l in the image: d˜ = (d, αl). The resulting L 2 distance between de2 2 2 ˜ ˜ 2 20 images per clas nlabel = 1, 5, 10, images per class. T times (randomly s time performance rate per class. Th somewhat differen Caltech-101: Th furniture, vehicles pearance and shap comparisons on C of NBNN to other paring NBNN with descriptor-type im NN-based) (Fig. 5 tiple descriptor-ty classifiers (learnin Table 1 shows 101 for several N we used 15 labell bers reported in tor NBNN algorit gap all NN-imag Results on Caltech-101 (a) (a) (b) (b) Figure 5. Performance comparison on Caltech-101. (a) Single descriptor methods: ‘NBNN (1 Desc)’, ‘Griffin Figure 5. Performance comparison on Caltech-101. (a) Single descriptor typetype methods: ‘NBNN (1 Desc)’, ‘Griffin SPM’SPM’ [13],[13], ‘SVM‘SVM KNN’ ‘SPM’ ‘PMK’ ‘DHDP’ SVM’ (SVM Geometric KNN’ [30],[30], ‘SPM’ [16],[16], ‘PMK’ [12],[12], ‘DHDP’ [29],[29], ‘GB ‘GB SVM’ (SVM withwith Geometric Blur)Blur) [27],[27], ‘GB ‘GB Vote Vote NN’ NN’ [3], [3], ‘GB ‘GB NN’ NN’ (NN-(NNImage Geometric ‘SPM (NN-Image Spatial Pyramids Match) (b) Multiple descriptor methods: Image withwith Geometric Blur)Blur) [27],[27], ‘SPM NN’ NN’ (NN-Image withwith Spatial Pyramids Match) [27].[27]. (b) Multiple descriptor typetype methods: ‘NBNN (5 Desc)’, ‘Bosch Trees’ (with ROI Optimization) [5], ‘Bosch SVM’ [6], ‘LearnDist’ [11], ‘SKM’ [15], ‘Varma’ [27], ‘KTA’ ‘NBNN (5 Desc)’, ‘Bosch Trees’ (with ROI Optimization) [5], ‘Bosch SVM’ [6], ‘LearnDist’ [11], ‘SKM’ [15], ‘Varma’ [27], ‘KTA’ [18].[18]. multi-descriptor NBNN algorithm performs better OurOur multi-descriptor NBNN algorithm performs eveneven better (72.8% on labelled 15 labelled images). [3] uses (72.8% on 15 images). ‘GB‘GB VoteVote NN’NN’ [3] uses an an image-to-class NN-based voting scheme (without descripimage-to-class NN-based voting scheme (without descriptor quantization), descriptor votes a single tor quantization), but but eacheach descriptor votes onlyonly to a to single (nearest) class, hence the inferior performance. (SVM-based) and [24] (Boosting based). NBNN performs only slightly worse than the SVM-based classifier of [31]. Results - Contribution Evaluation 5.3. Impact of Quantization & Image-to-Image Dist. In Sec. 2 we have argued that descriptor quantization and “Image-to-Image” distance degrade the performance of non-parametric image classifiers. Table 3 displays the results of introducing either of them into NBNN (tested on Caltech-101 with n label = 30). The baseline performance of NBNN (1-Desc) with a SIFT descriptor is 70.4%. If we replace the “Image-to-Class” KL-distance in NBNN with Opelt Zhang Lazebnik the NBNN NBNN an Class “Image-to-Image” KL-distance, performance drops [24] [16] (1 Desc) (5 Desc) to 58.4% (i.e., 17%[31] drop in performance). To check the ef86.5 92.0 86.3descriptors ±2.5 89.2 ±4.7 90.0 ±4.3 fectBikes of quantization, the SIFT are quantized to a People of80.8 88.0 82.3 86.0the ±5.0 87.0 ±4.6of codebook 1000 words. This±3.1 reduces performance Table Resultsdrop on Graz-01 NBNN to 50.4% (i.e.,2.28.4% in performance). The spatial pyramid match kernel of [16] measures No Quant. With Quant. distances between histograms of quantized SIFT descrip“Image-to-Class” 70.4% 50.4% (-28.4%) tors, but within an SVM classifier. Their SVM learning “Image-to-Image” 58.4% (-17%) phase compensates for some of the information loss due to quantization, raising classification performance up to Table 3. Impact of introducing descriptor quantization or “Image64.6%. However, comparison to the performance to-Image” distance into NBNN (using SIFTbaseline descriptor on Caltechof NBNN implies that the information loss in101, nlabel =(70.4%) 30). curred by the descriptor quantization was larger than the formance is better than the learning-based classifiers of [16] gain obtained and by using SVM. (SVM-based) [24] (Boosting based). NBNN performs only slightly worse than the SVM-based classifier of [31]. [12] K. Grauman a Discriminative ICCV, 2005. [13] G. Griffin, A. H egory dataset. [14] F. Jurie and B sual recognitio [15] A. Kumar and for object reco [16] S. Lazebnik, features: Spat scene categori and Pattern Re [17] B. Leibe, A. Le egorization an InBosch, ECCV A. Work [5] A. Zi [18] using Y. Lin, T. Liu,f random forBosch, objectA. cate [6] A. Zi [19] with D. Lowe. a spatialDi p keypoints. IJC [7] R. Duda, P. Har York, 200 [20] New M. Marszałek, jer.Fei-Fei, Learning [8] R. L.an visual modelsInf recognition. [21] bayesian C. M. J. approa Marti Workshop on G image bounda [9] P.cues. Felzenszwalb PAMI, 2 recogniti [22] object G. Mori, S. Be [10] R. Fergus, P. co Pe using shape by unsup [23] nition D. Mount and [11] A. Y. Sis estFrome, neighbor consistent local Comp. Geome and M. classF [24] trieval A. Opelt, [12] K. Grauman potheses and an b Discriminative nition. In ECC ICCV, 2005. Outline Irani - In Defence of Nearest-Neighbor Based Image Classification Wang - Image-to-Class Distance Metric Learning for Image Classification Behmo - Towards Optimal Naive Bayes Nearest Neighbor Wang - Image-to-Class Distance Metric Learning for Image Classification 710 Z. Wang, Y. Hu, and L.-T. Chia soft-margin, we introduce a slack variable ξ in the error term, and the whole convex optimization problem is therefore formed as: min O(M1 , M2 , . . . , MC ) = T (1 − λ) T r(ΔXip Mp ΔXip )+λ ξipn M1 ,M2 ,...,MC i,p→i (6) i,p→i,n→i T T s.t.∀ i, p, n : T r(ΔXin Mn ΔXin ) − T r(ΔXip Mp ΔXip ) ≥ 1 − ξipn ∀ i, p, n : ξipn ≥ 0 ∀ c : Mc 0 This optimization problem an instance SDP, which Wang,is Hu, Chia –ofECCV 2010can be solved using standard SDP solver. However, as the standard SDP solvers is computation Image-to-Class Metricdescent Learning for Image expensive, we useDistance an efficient gradient based method derived Classification from [20,19] to solve our problem. Details are explained in the next subsection. 2.3 An Efficient Gradient Descent Solver Due to the expensive computation cost of standard SDP solvers, we propose an efficient gradient descent solver derived from Weinberger et al. [20,19] to solve this optimization problem. Since the method proposed by Weinberger et Contributions - Differences • Differences on approach 1. Mahalanobis distance instead of Euclidean (learning for Mc ) 2. Spatial Pyramid Match 3. Weighted local features • Results 1. Fewer features-descriptors needed per image 2. Lower testing time 3. and. . . higher performance! In this section, we formulate a large margin convex optimization problem for Notation learning the Per-Class metrics and introduce an efficient gradient descent method to solve this problem. We also adopt two strategies to further enhance the discrimination of our learned I2C distance. 2.1 Notation Our work deals with the image represented by a collection of its local feature descriptors extracted from patches around each keypoint. So let Fi = {fi1 , fi2 , . . . , fimi } denote features belonging to image Xi , where mi represents the number of features in Xi and each feature is denoted as fij ∈ Rd , ∀j ∈ {1, . . . , mi }. To calculate the I2C distance from an image Xi to a candidate class c, we need to find the NN of each feature fij from class c, which is denoted as fijc . The original I2C distance from image Xi to class c is defined as the sum Introducing Mahalanobis distance Image-to-Class Distance Metric Learning for Image Classification 709 of Euclidean distances between each feature in image Xi and its NN in class c and can be formulated as: Dist(Xi , c) = mi j=1 fij − fijc 2 (1) After learning the Per-Class metric Mc ∈ Rd×d for each class c, we replace the Euclidean distance between each feature in image Xi and its NN in class c by the Mahalanobis distance and the learned I2C distance becomes: Dist(Xi , c) = mi j=1 (fij − fijc )T Mc (fij − fijc ) (2) This learned I2C distance can also be represented in a matrix form by introducing a new term ΔXic , which is a mi × d matrix representing the difference between all features in the image Xi and their nearest neighbors in the class c formed as: ⎛ ⎞ c T (fi1 − fi1 ) Euclidean distance between each feature in image Xi and its NN in class c by the Mahalanobis distance and the learned I2C distance becomes: . . . more Notation Dist(Xi , c) = mi j=1 (fij − fijc )T Mc (fij − fijc ) (2) This learned I2C distance can also be represented in a matrix form by introducing a new term ΔXic , which is a mi × d matrix representing the difference between all features in the image Xi and their nearest neighbors in the class c formed as: ⎛ ⎞ c T (fi1 − fi1 ) ⎜ (fi2 − f c )T ⎟ i2 ⎟ ΔXic = ⎜ (3) ⎝ ⎠ ... c (fimi − fim )T i So the learned I2C distance from image Xi to class c can be reformulated as: T Dist(Xi , c) = T r(ΔXic Mc ΔXic ) (4) This is equivalent to the equation (2). If Mc is an identity matrix, then it’s also equivalent to the original Euclidean distance form of equation (1). In the following subsection, we will use this formulation in the optimization function. 2.2 Problem Formulation Optimization problem The objective function in our optimization problem is composed of two terms: the regularization term and error term. This is analogous to the optimization problem in SVM. In the error term, we incorporate the idea of large margin and formulate the constraint that the I2C distance from image Xi to its belonging class p (named as positive distance) should be smaller than the distance to any other class n (named as negative distance) with a margin. The formula is given as follows: T T ) − T r(ΔXip Mp ΔXip )≥1 (5) T r(ΔXin Mn ΔXin In the regularization term, we simply minimize all the positive distances similar to [20]. So for the whole objective function, on one side we try to minimize all the positive distances, on the other side for every image we keep those negative distances away from the positive distance by a large margin. In order to allow Optimization problemn 710 Z. Wang, Y. Hu, and L.-T. Chia soft-margin, we introduce a slack variable ξ in the error term, and the whole convex optimization problem is therefore formed as: min O(M1 , M2 , . . . , MC ) = T (1 − λ) T r(ΔXip Mp ΔXip )+λ ξipn M1 ,M2 ,...,MC i,p→i (6) i,p→i,n→i T T s.t.∀ i, p, n : T r(ΔXin Mn ΔXin ) − T r(ΔXip Mp ΔXip ) ≥ 1 − ξipn ∀ i, p, n : ξipn ≥ 0 ∀ c : Mc 0 This optimization problem is an instance of SDP, which can be solved using standard SDP solver. However, as the standard SDP solvers is computation expensive, we use an efficient gradient descent based method derived from [20,19] to solve our problem. Details are explained in the next subsection. 2.3 An Efficient Gradient Descent Solver solve this optimization problem. Since the method proposed by Weinberger et al. targets on solving only one global metric, we modify it to learn our Per-Class metrics. This solver updates all matrices iteratively by taking a small step along the gradient direction to reduce the objective function (6) and projecting onto feasible set to ensure that each matrix is positive semi-definite in each iteration. t+1 = M t − η · ∇O(M , M , . . . , M ) Gradient To evaluate the Descent: gradient ofM objective function for each matrix, 1 2 we denote c the c c th matrix Mc for each class c at t iteration as Mct , and the corresponding gradient as G(Mct ). We define a set of triplet error indices N t such that (i, p, n) ∈ N t if ξipn > 0 at the tth iteration. Then the gradient G(Mct ) can be calculated by taking the derivative of objective function (6) to Mct : T T T G(Mct ) = (1 − λ) ΔXic ΔXic + λ ΔXic ΔXic − λ ΔXic ΔXic (7) An Efficient Gradient Descent Solver i,c=p (i,p,n)∈N t ,c=p (i,p,n)∈N t ,c=n Directly calculating the gradient in each iteration using this formula would be computational expensive. As the changes in the gradient from one iteration to the next are only determined by the differences between the sets N t and N t+1 , we use G(Mct ) to calculate the gradient G(Mct+1 ) in the next iteration, which would be more efficient: T T G(Mct+1 ) = G(Mct ) + λ( ΔXic ΔXic − ΔXic ΔXic ) (8) (i,p,n)∈(N t+1 −N t ),c=p (i,p,n)∈(N t+1 −N t ),c=n T T − λ( ΔXic ΔXic − ΔXic ΔXic ) (i,p,n)∈(N t −N t+1 ),c=p (i,p,n)∈(N t −N t+1 ),c=n T Since (ΔXic ΔXic ) is unchanged during the iterations, we can accelerate the updating procedure by pre-calculating this value before the first iteration. The timization problem (6) is convex, this solver is able to converge to the global optimum. We summarize the whole work flow Algorithm in Algorithm 1. Gradient Descent Algorithm 1. A Gradient Descent Method for Solving Our Optimization Problem T Input: step size α, parameter λ and pre-calculated data (ΔXic ΔXic ), i {1, . . . , N }, c ∈ {1, . . . , C} for c := 1 to C do T ΔXip G(Mc0 ) := (1 − λ) i,p→i ΔXip 0 Mc := I end for{Initialize M and gradient for each class} Set t := 0 repeat Compute N t by checking each error term ξipn for c = 1 to C do Update G(Mct+1 ) using equation (8) Mct+1 := Mct + αG(Mct+1 ) Project Mct+1 for keeping positive semi-definite end for Calculate new objective function t := t + 1 until Objective function converged Output: each matrix M1 , . . . , MC ∈ To generate a more discriminative I2C distance for better recognition perforimages in a class. We adopt the idea of spatial pyramid by restricting each feature mance, we improve ourSpatial learned distance bythe adopting the idea of aspatial descriptor in the image to only find its NN in same subregion from class atpyramid each Pyramid Match match [9] and learning I2C distance function [16]. level. Spatial pyramid match (SPM) is proposed by Lazebnik et al. [9] which makes use of spatial correspondence, and the idea of pyramid match is adapted from Grauman et al. [8]. This method recursively divides the image into subregions at increasingly fine resolutions. We adopt this idea in our NN search by limiting each feature point in the image to find its NN only in the same subregion from a candidate class at each level. So the feature searching set in the candidate class is reduced from the whole image (top level, or level 0) to only the corresponding subregion (finer level), see Figure 2 for details. This spatial restriction enhances the robustness of NN search by reducing the effect of noise due to wrong matches from other subregions. Then the learned distances from all levels are merged together as pyramid combination. In addition, we find in our experiments that a single level spatial restriction Fig. parallelogram an image, and the right parallelograms denote at 2. a The finer left resolution makes denotes better recognition accuracy compared to the top images in a class. We adopt the idea of spatial pyramid by restricting each feature level especially for those images with geometric scene structure, although the descriptor imagelower to only find NN in the same subregion class at accuracyinisthe slightly than theitspyramid combination of all from levels.a Since theeach level. candidate searching set is smaller in a finer level, which requires less computation cost for the NN search, we can use just a single level spatial restriction of the Spatial pyramid match (SPM) is proposed by Lazebnik et al. [9] which makes use of spatial correspondence, and the idea of pyramid match is adapted from Weghts onspeed local features - for Optimization learned I2C distance to up the classification test images. Compared to the top level, a finer level spatial restriction not only reduces the computation cost, but also improves the recognition accuracy in most datasets. For some images without geometric scene structure, this single level can still preserve the recognition performance due to sufficient features in the candidate class. We also use the method of learning I2C distance function proposed in [16] to combine with the learned Mahalanobis I2C distance. The idea of learning local distance function is originally proposed by Frome et al. and used for image classification and retrieval in [6,5]. Their method learns a weighted distance function for measuring I2I distance, which is achieved by also using a large margin framework to learn the weight associated with each local feature. Wang et al. [16] have used this idea to learn a weighted I2C distance function from each image to a candidate class, and we find our distance metric learning method can be combined with this distance function learning approach. For each class, its weighted I2C distance is multiplied with our learned Per-Class matrix to generate a more discriminative weighted Mahalanobis I2C distance. Details of this local distance function for learning weight can be found in [6,16]. 3 3.1 Experiment Datasets and Setup Experiments - descriptors tion, we use dense sampling strategy and SIFT features [12] as our descriptor, which are computed on a 16 × 16 patches over a grid with spacing of 8 pixels for all datasets. This is a simplified method compared to some papers that use densely sampled and multi-scale patches to extract large number of features, 81.2 ± 0.52 79.7 ± 1.83 89.8 ± 1.16 I2CDML+SPM Results - Spatial Match gain I2CDML+Weight 78.5 ± Pyramid 0.74 81.3 ± 1.46 90.1 ± 0.94 I2CDML+ 83.7 ± 0.49 84.3 ± 1.52 91.4 ± 0.88 SPM+Weight Scene 15 Corel Sports 0.86 0.93 I2CDML I2CDML+Weight 0.84 I2CDML I2CDML+Weight 0.92 0.84 0.82 I2CDML I2CDML+Weight 0.91 0.82 0.9 0.8 0.8 0.89 0.78 0.78 0.76 0.76 NS SSL SPM 0.88 NS SSL SPM 0.87 NS SSL SPM Fig. 4. Comparing the performance of no spatial restriction (NS), spatial single level restriction (SSL) and spatial pyramid match (SPM) for both I2CDML and I2CDML+Weight in all the three datasets. With only spatial single level, it achieves better performance than without spatial restriction, although slightly lower than spatial pyramid combination of multiple levels. But it requires much less computation cost for feature NN search. Then we show in Table 2 the improved I2C distance through spatial pyramid restriction from the idea of spatial pyramid match in [9] and learning weight associated with each local feature in [16]. Both strategies are able to augment Results - Caltech 101 Image-to-Class Distance Metric Learning for Image Classification 717 Caltech 101 0.7 0.6 0.5 0.4 I2CDML I2CDML+Weight NBNN 0.3 0.2 1*1 2*2 3*3 4*4 5*5 6*6 7*7 SPM Fig. 5. Comparing the performances of I2CDML, I2CDML+Weight and NBNN from spatial division of 1× to 7×7 and spatial pyramid combination (SPM) on Caltech 101. less than 1000 features per image on average using our feature extraction strategy, which are about 1/20 compared to the size of feature set in [1]. We also use single level spatial restriction to constrain the NN search for acceleration. For each image, we divide it from 2×2 to 7×7 subregions and test the performance Outline Irani - In Defence of Nearest-Neighbor Based Image Classification Wang - Image-to-Class Distance Metric Learning for Image Classification Behmo - Towards Optimal Naive Bayes Nearest Neighbor Behmo - Towards Optimal Naive Bayes Nearest Neighbor 172 R. Behmo et al. Fig. 1. Subwindow detection for the original NBNN (red) and for our version of NBNN (green). Behmo, Marcombes, Dalalyan, Prinet – original ECCV Since the background class is more densely sampled than the object class, the NBNN 2010 tends to select an object window that is too small relatively to the object instance. As show these examples, our approach addresses this issue. Towards Optimal Naive Bayes Nearest Neighbor by the efficiency of the classifier. Naive Bayes Nearest Neighbor (NBNN) is a classifier introduced in [1] that was designed to address this issue: NBNN is non-parametric, does not require any feature quantization step and thus uses to advantage the full discriminative power of visual features. However, in practice, we observe that NBNN performs relatively well on certain datasets, but not on others. To remedy this, we start by analyzing the theoretical foundations of the NBNN. We show that this performance variability could stem from the assumption that the normalization factor involved in the kernel estimator of the conditional density of features is class-independent. We relax this assumption and provide a new formulation of the NBNN which is richer than the original one. In particular, our approach is well suited for optimal, multi-channel image classification and object detection. The main argument of NBNN is that the log-likelihood of a visual Contributions - Differences • Differences on approach 1. Optimize normalization factors of Parzen Window (learning) 2. Learn optimal combinations of defferent descriptors (channels) 3. Spatial Pyramid Matching 4. Classification by Detection using ESS • Results 1. Copes with differently populated classes 2. Has higher performance! 3. but. . . is slow at both learning and testing! This classifier is shown to outperform the usual nearest neighbor classifier. Moreover, it does not require any feature quantization step, and the descriptive power of image features is thus preserved. The reasoning above proceeds in three distinct steps: the naive Bayes assumption considers that image features are independent identically distributed given the image class cI (equation 1). Then, the estimation of a feature probability density is obtained by a non-parametric density estimation process like the Parzen-Rosenblatt estimator (equation 2). NBNN is based on the assumption that the logarithm of this value, which is a sum of distances, can be approximated by its largest term (equation 3). In the following section, we will show that the implicit simplification that consists in removing the normalization parameter from the density estimator is invalid in most practical cases. Along with the notation introduced in this section, we will also need the notion of point-to-set distance, which is simply the squared Euclidean distance of a point to its nearest neighbor in the set: ∀Ω ⊂ RD , ∀x ∈ RD , τ (x, Ω) = inf y∈Ω x − y 2 . In what follows, τ (x, χc ) will be abbreviated as τ c (x). Notation 2.2 Affine Correction of Nearest Neighbor Distance for NBNN The most important theoretical limitation of NBNN is that in order to obtain a simple approximation of the log-likelihood, the normalization factor 1/Z of the kernel estimator is assumed to be the same for all classes. Yet, there is no a priori reason to believe that this assumption is satisfied in practice. If this factor significantly varies from one class to another, then the approximation of the maximum a posteriori class label ĉI by equation 4 becomes unreliable. It should be noted that the objection that we raise does not concern the core hypothesis of NBNN, namely the naive Bayes hypothesis and the approximation of the sum of exponentials of equation 2 by its largest term. In fact, in the following we will essentially follow and extend the arguments presented in [1] using the same starting hypothesis. NBNN reasoning steps 1. Naive Bayes assumption (eq.1) 2. Parzen window estimator of pdf (eq.2) 3. Nearest neighbor approximation (eq.3) (invalid removal of normalization parameters) 2.1 Initial Formulation of NBNN Original NBNN In this section, we briefly recall the main arguments of NBNN described by Boiman et al. [1] and introduce some necessary notation. In an image I with hidden class label cI , we extract KI features (dIk )k ∈ RD . Under the naive Bayes assumption, and assuming all image labels are equally probable (P (c) ∼ cte) the optimal prediction ĉI of the class label of image I maximizes the product of the feature probabilities relatively to the class label: ĉI = arg max c KI k=1 P (dIk |c). (1) The feature probability conditioned on the image class P (dIk |c) can be estimated by a non-parametric kernel estimator, also called Parzen-Rosenblatt estimator. If we note χc = dJk |cJ = c, 1 ≤ k ≤ KJ the set of all features from all training images that belong to class c, we can write: − dIk − d 2 1 P (dIk |c) = exp , (2) Z 2σ 2 c d∈χ where σ is the bandwidth of the density estimator. In [1], this estimator is further approximated by the largest term from the sum on the RHS. This leads to a quite simple expression: 174 R. Behmo et al. d − d 2 . (3) ∀d, ∀c, − log (P (d|c)) min c d ∈χ The decision rule for image I is thus: ĉI = arg max P (I|c) = arg min c c k min dIk − d 2 . d∈χc (4) This classifier is shown to outperform the usual nearest neighbor classifier. Moreover, c c 2 c 2 Optimal Naive Nearest 175 Z Towardsbounded 2(σby ) 1/2Bayes 2(σ )Neighbor space, in order to reach an approximation we need to sample 233 points. In practice, the PR estimator does not converge and there is little sense in keeping more −4/(4+D) cthe first D cof D that speed the Parzen-Rosenblatt 2 (σ just =the |χconvergence |(2π) term ) the.ofsum. Recall that τ c (d) is(PR) theestimator squaredis K Euclidean[13]. distance of where Z cthan This means that in the case of a 128-dimensional feature space, such as the SIFT c Thus, the log-likelihood relatively to an image label c is: feature the feature aboved equations, we have replaced the classd to its nearest neighbor in χ of. aInvisual space, in order to reach an approximation bounded by 1/2 we need to sample 233 points. c c c c is no reason to believe that independent notation σ, Z by σ , Z since, in general, there τ 1 τ (d) (d) In practice, estimator does notexp converge there = is little sense in keeping c − logthePPR (d|c) = − log − and + log(Z ), more (6) c )2 Zc 2(σinstance, 2(σ c )2parameters are functions parameters For both thanshould just the be f rstequal term ofacross the sum.classes. Thus, the log-likelihood of a visual feature d relatively to an image label c is: of the number ofc training features of class c inc the training set. D Recall that is the Euclidean distance of where Z = |χc |(2π) 2 (σ c )D . squared τ (d) c obtain: c Returning to the naive Bayes formulation, c 1 the abovewe τ have τequations, (d) (d) replaced cthe class. In we d to its−nearest neighbor in χ log P (d|c) = − log c c exp − = + log(Z ), (6) c )2 there is 2 2(σ independent notation σ, Z by σ , Z since, in 2(σ general, noc )reason to believe that KI c I KI parameters should be equal across classes. For instance, both parameters are functions τ (dk ) c c (I|c)) c = D 2 (σ c )D . + clog(Z ) = set. αc τ c (dIk ) + KI β cR, τ c (d) (7) ∀c, − lognumber (P = |χ where Z of the of |(2π) training features of in the training c )class 2 2(σ c d to In the aboveweequations, its nearest neighbor in χ .formulation, Returning to the naive Bayes obtain: we have k=1 k=1replaced the classindependent notation σ, Z by σ c , Z c since, in general, there is no reason to believe that KI c I c KI cparameters should c 2 be equalc Fora instance, both parameters τ classes. (dk ) ) is where α = 1/(2(σ ) )(I|c)) and = β across = log(Z re-parametrization ofare thefunctions log-likelihood c + log(Z ) = αcset. τ c (dIk ) + KI β c , (7) ∀c, − log (P of the number of training features2(σ of cclass c in the training 2 ) 6 that has the advantage of being linear in the model parameters. The image label is k=1 k=1 Returning to the naive Bayes formulation, we obtain: Correction of Nearest Neighbor Distance then decided according toc a2 criterion that cis slightly different from equation 4: c c where α = 1/(2(σ ) ) andKβI = log(Z ) is a re-parametrization of the log-likelihood KI τ c (dI ) c c cThe I k in the model 6 that∀c, has−the linear parameters. imageclabel(7) is log advantage (P (I|c)) =of being + log(Z ) = α τ (d K I k ) + KI β , 2(σ cthat )2c then decided according to ak=1 criterion is slightly from c Idifferentk=1 c equation 4: ĉI = arg min α τ (dk ) + KI β . (8) c I )Kis a cre-parametrization of the log-likelihood where αc = 1/(2(σ c )2 ) and β c = log(Zcck=1 I min τ (d KI β c . ĉI = α in the k ) + parameters. 6 that has the advantage of arg being linear model The image label(8) is c We note then that decided this modified criterion can be interpreted in two4: different ways: accordingdecision to a criterion that k=1 is slightly different from equation it can either be interpreted as the consequence of a density to which We note that this modified decision criterion in two different ways: a mul estimator KI can be interpreted c cof IaNBNN cin which either be interpreted as the consequence density estimator to which a multiplicativeit can factor was added,ĉI or as an unmodified an affine correction (8) = arg min α τ (dk ) + KI β . added,Euclidean or ascan unmodified NBNN in which an affine correction formuk=1 has beentiplicative added tofactor the was squared distance. In the former, the resulting has been added to the squared Euclidean distance. In the former, the resulting formu- has been added to the squared Euclidean distance. In the former, the resulting formulation can be considered different from the initial NBNN. In the latter, equation 8 can be obtained from equation 4 simply by replacing τ c (d) by αc τ c (d) + β c (since αc is positive, the nearest neighbor distance itself does not change). This formulation differs from [1] only in the evaluation of the distance function, leaving us with two parameters per class to be evaluated. At this point, it is important to recall that the introduction of parameters αc and β c does not violate the naive Bayes assumption, nor the assumption of equiprobability of classes. In fact, the density estimation correction can be seen precisely as an enIf a class is more densely sampled than others (i.e: its feature space contains more training samples), then the NBNN estimator will have a bias towards that class, even though it made the assumption that all classes are equally probable. The purpose of setting appropriate values for αc and β c is to correct this bias. It might be noted that deciding on a suitable value for αc and β c reduces to def ning 176 R. Behmo et al. an appropriate bandwidth σ c . Indeed, the dimensionality D of the feature space and the number |χc | of training feature points are known parameters. However, in practice, choosing a good value for the bandwidth parameter is time-consuming and inefficient. To cope with this issue, we designed an optimization scheme to find the optimal values of parameters αc , β c with respect to cross-validation. Correction of Nearest Neighbor Distance 2.3 Multi-channel Image Classification In the most general case, an image is described by different features coming from different sources or sampling methods. For example, we can sample SIFT features and local color histograms from an image. We observe that the classification criterion of equation 1 copes well with the introduction of multiple feature sources. The only difference should be the parameters for density estimation, since feature types correspond, in general, to different feature spaces. In order to handle different feature types, we need to introduce a few definitions and Rdχ . Channels can be defined arbitrarily: a channel can be associated to a particular detector/descriptor pair, but can also represent global image characteristics. For instance, an image channel can consist in a single element, such as the global color histogram. Let us assume we have defined a certain number of channels (χn )1≤n≤N , that are expected to be particularly relevant to the problem at hand. Adapting the framework of our modified NBNN to multiple channels is just a matter of changing notation. Similarly to the single-channel case, we aim here at estimating the class label of an image I: P (d|c). (9) ĉI = arg max P (I|c), with P (I|c) = Combining descriptors c n d∈χn (I) Since different channels have different features spaces, the density correction paramec ters should depend on the channel index: αc , β c will thus be noted αcn , β n . The notation from the previous section are adapted in a similar way: we call χcn = J|cJ =c χn (J) the set of all features from class c and channel n and define the distance function of a feature d to χcn by: ∀d, τnc (d) = τ (d, χcn ). This leads to the classification criterion: τnc (d) + βnc |χn (I)| . (10) ĉI = arg min αcn c n d∈χn (I) Naturally, when adding feature channels to our decision criterion, we wish to balance the importance of each channel relatively to its relevance to the problem at hand. Equation 10 shows us that the function of relevance weighting can be assigned to the distance correction parameters. The problems of adequate channel balancing and nearest neighbor distance correction should thus be addressed in one single step. In the following section, we present a method to find the optimal values of these parameters. 2.4 Parameter Estimation We now turn to the problem of estimating values of αcn and βnc that are optimal for c Xn (I) = τ n (d),c Xnc (I) = τn (d), d∈χn (I)d∈χn (I) XNc+n (I) = |χn (I)|, XN +n (I) = |χn (I)|, ∀n = 1, . . . , N. ∀n = 1, . . . , N. (11) (11 Parameter Estimation - Optimization c c c, theX vector (I) can considered as as a global descriptor of image I.of Weimage I. W For every c, For theevery vector (I)Xcan be beconsidered a global descriptor c c also denote by ωc the (2N )-vector (α , αcNc, β1c , .c. . , βN ) andc by W the matrix that c c 1 , . . .Optimal Towards Naive Bayes Nearest Neighbor 177 also denote by ω the (2N )-vector (α , . . . , α , β , . . . , β ) and by W the matrix tha c 1 w for different 1values of c.NUsing these notation, N results from concatenation of vectors c as: the classifier we propose can be rewritten esults from concatenation of vectors w cfor different values of c. Using these notation Xnc (I) = τnc (d), XN +n (I) = |χn (I)|, ∀n = 1, . . . , N. (11) c c he classifier we propose can be rewritten ĉ = arg minas: (12) I c (ω ) X (I), d∈χ (I) n c c For every vector (I) can be considered a global descriptor of image I. We ) the stands forXthe transpose of ω c . Thiscas spirit to the winner-takes-all where (ωc, ĉ )-vector = arg(α min (ω is)close Xcin(I), (12 c c also denotewidely by ωc used the (2N .c. , αcN , β1c , . . . , βN ) and by W the matrix that classifier forI the multiclass 1 , . classification. c for different values of c. Using these notation, results from concatenation of vectors w Given a labeled sample (Ii , ci )i=1,...,Kc independent of the sample used for computstands transpose of ω as: . This closeoptimization in spirit to the winner-takes-a where (ω c )the classifier we propose can be rewritten we can define a constrained linearisenergy problem that ing the setsfor χcn ,the minimizes the hinge loss of a multi-channel NBNN classifier: classifier widely used for the multiclass classification. c c ĉI = arg minc (ω ) X (I), (12) Given a labeled csample (I independent of the sample used for compu iK, ci )i=1,...,K cci. This ciis close in spirit c to c the winner-takes-all ) stands for the transpose of ω where (ω E(W ) = max 1 + (ω ) X (Ii ) − (ω ) X (Ii ) + , (13) c:c ci can define a=constrained linear energy optimization problem tha ng the sets classifier χcn , wewidely used for i=1the multiclass classification. Given a labeled sample (Ii , ci )i=1,...,K independent of the sample used for computminimizes the hinge loss of a multi-channel NBNN classifier: where (x) stands for the positive part of a real x. The minimization E(W ) can of problem that ing the sets+χcn , we can define a constrained linear energy optimization be recast as a linear program since it is equivalent to minimizing i ξi subject to minimizes the hinge K loss of a multi-channel NBNN classifier: constraints: K E(W ) = 1 + (ω cci)cXci (Ii ) − (ω c ) Xc (Ii ) + , max (13 c ci i c c c c ξi E(W ≥ 1 +) (ω ) cX (Ii 1) − ∀i(ω = )1, .X. . ,(IK, ∀c = ci , (13) (14) c:c= i ),i ) − i = + (ω (ω )i ) XX(Ii (I max i ) +, i=1 c:c=ci 1, . . . , K, (15) ξi ≥ 0 ∀i =i=1 c (ω ) e ≥ 0, = 1,part . part . . , N, (16) n forpositive where (x)+where stands the of real x. The minimization of E(W ) ca (x)+for stands the ∀n positive of aareal x. The minimization of E(W ) can be recast as a linear program since it is equivalent to minimizing ξ subject toξ subject t 2N i i be recast as where a linear program since equivalent to equal minimizing en stands for the vector of Rit is having all coordinates to zero, except for i i constraints: the nth coordinate, which is equal to 1. This linear program can be solved quickly for constraints: 1 c images practice, number a relatively ofcichannels ξi ≥large 1 + number (ω ci ) X (Ii ) − (ωand ) Xc (Ii ),. In ∀i = 1, . .the . , K, ∀c = of ci ,channels (14) should be kept to the cnumber c small relatively c cof training samples to avoid overfitting. the object class and position only depends on the point belonging or not to the object: c τn (d) if d ∈ π (17) ∀n, d, c, − log (P (d|c, π)) = τnc (d) if d ∈ / π. Classification by detection In the above equation, we have written the feature-to-set distance functions τnc and τnc without apparent density correction in order to alleviate the notation. We leave to the reader the task of replacing τnc by αcn τnc + βnc in the equations of this section. The image log-likelihood function is now decomposed inside and out over all features c c side the object: E(I, c, π) − log P (I|c, π) = n d∈π τn (d) + d∈π / τn (d) . The term on the RHS can be rewritten: (τnc (d) − τnc (d)) + τnc (d) . (18) E(I, c, π) = n d∈π d Observing that the second sum on the RHS does not depend c on π,cwe get E(I, c, π) = E 2 (I, c), where E1 (I, c, π) = n d∈π τn (d) − τn (d) and E2 (I, c) = c, π)+E 1 (I, c c n d τn (d). Let us define the optimal object position π̂ relatively to class c as the position that minimizes the first energy term: π̂ c = arg minπ E1 (I, c, π) for all c. Then, we can obtain the most likely image class and object position by: ĉI = arg min (E1 (I, c, π̂ c ) + E2 (I, c)) , c π̂I = π̂ ĉI . (19) For any class c, finding the rectangular window π̂ c that is the most likely candidate can be done naively by exhaustive search, but it proves prohibitive. Instead, we make use of fast branch and bound subwindow search [2]. The method used to search for the image window that maximizes the prediction of a linear SVM can be generalized to any classifier that is linear in the image features, such as our optimal multi-channel NBNN. In short, the most likely class label and object position for a test image I are found by the following algorithm: Detection Algorithm Towards Optimal Naive Bayes Nearest Neighbor 179 1: declare variables ĉ, π̂ 2: Ê = +∞ 3: for each class label c do 4: find π̂ c by efficient branch and bound subwindow search 5: π̂ c = arg minπ E1 (I, c, π) 6: if E1 (I, c, π̂ c ) + E2 (I, c) < Ê then 7: Ê = E1 (I, c, π̂ c ) + E2 (I, c) 8: ĉ = c 9: π̂ = π̂ c 10: end if 11: end for 12: return ĉ, π̂ 4 Experiments Our optimal NBNN classifier was tested on three datasets: Caltech-101 [15], SceneClass 13 [16] and Graz-02 [14]. In each case, the training set was divided into two equal parts for parameter selection. Classification results are expressed in percent and reflect the rate of good classification, per class or averaged over all classes. Fast NN search - LSH 4 Experiments Our optimal NBNN classifier was tested on three datasets: Caltech-101 [15], SceneClass 13 [16] and Graz-02 [14]. In each case, the training set was divided into two equal parts for parameter selection. Classification results are expressed in percent and reflect the rate of good classification, per class or averaged over all classes. A major practical limitation of NBNN and of our approach is the computational time necessary to nearest neighbor search, since the sets of potential nearest neighbors to explore can contain of the order of 105 to 106 points. We thus need to implement an appropriate search method. However, the dimensionality of the descriptor space can also be quite large and traditional exact search methods, such as kd-trees or vantage point trees [17] are inefficient. We chose Locality Sensitive Hashing (LSH) and addressed the thorny issue of parameter tuning by multi-probe LSH2 [18] with a recall rate of 0.8. We observed that resulting classification performance are not overly sensitive to small variations in the required recall rate. However, computations speed is: compared to exhaustive naive search, the observed speed increase was more than ten-fold. Further improvement in the execution times can be achieved using recent approximate NNsearch methods [19,20]. Let us describe the databases used in our experiments. Caltech-101 (5 classes). This dataset includes the five most populated classes of the Caltech-101 dataset: faces, airplanes, cars-side, motorbikes and background. These images present relatively little clutter and variation in object pose. Images were resized to a maximum of 300 × 300 pixels prior to processing. The training and testing sets both contain 30 randomly chosen image per class. Each experiment was repeated 20 times and we report the average results over all experiments. SceneClass 13. Each image of this dataset belongs to one of 13 indoor and outdoor descriptors, with 91.10% good classification rate, while rgSIFT performs worst, with Results onevaluation 1 SIFT and SIFT descriptors 4.1Thus, Single-Channel Classification 85.17%. a wrong of the feature5space properties undermines the descriptorThe performance. impact of optimal parameter selection on NBNN is measured by performing image classification with just one feature channel. We chose SIFT features [22] for their relative popularity. Results are summarized in Tables 1 and 2. 4.3 Multi-channel Classification 2 The notion channel iscomparison sufficiently versatile toofbewords adapted to abyvariety of χdifferent Table of 1. Performance between the bag classified linear and -kernel conSVM, NBNN classifier our optimal texts. In thistheexperiment, weandborrow theNBNN idea developed in [4] to subdivide the image in different spatial regions. We consider that an image channel associated to a certain 2 Datasets BoW/SVM BoW/χ -SVM NBNN [1] Optimal NBNN 67.85 ±0.78 ±0.60 48.52 ±1.53 75.35 ±0.79 Table 3.SceneClass13 Caltech101 [16] (5classes): Influence of76.7 various radiometry invariant features. Best and worst 68.18 ±4.21 77.91 ±2.43 61.13 ±5.61 78.98 ±2.37 Graz02 [14] SIFT invariants are highlighted in blue and red, respectively. 59.2 ±11.89 89.13 ±2.53 73.07 ±4.02 89.77 ±2.31 Caltech101 [15] Feature BoW/χ2 -SVM NBNN [1] Optimal NBNN In Table 1, the first two columns refer to the classification of bags of words by linear SIFT and by χ2 -kernel SVM. 88.90 In ±2.59 73.07 ±4.02 ±2.31 all three experiments we selected89.77 the most efficient SVM codebook size (between 500 and±2.18 3000) and feature histograms by their 89.90 72.73 ±6.01 were normalized 91.10 ±2.45 OpponentSIFT 1 L norm. Furthermore, only the±2.63 results for the80.17 χ2 -kernel the best possible 86.03 ±3.73SVM with 85.17 ±4.86 rgSIFT value smoothing are±3.86 reported. In Table we omitted 86.13 ±2.76 parameter 75.43 86.872,±3.23 cSIFT(in a finite grid) of the 2 the results of BoW/SVM because of their clear73.03 inferiority w.r.t. BoW/χ 89.40 ±2.48 ±5.52 90.01-SVM. ±3.03 Transf. color SIFT Table 2. Performance comparison between the bag of words classified by χ2 -kernel SVM, the NBNN classifier and our optimal NBNN. Per class results for Caltech-101 (5 classes) dataset. Class Airplanes Car-side BoW/χ2 -SVM NBNN [1] Optimal NBNN 91.99 ± 4.87 96.16 ± 3.84 34.17 ±11.35 97.67 ± 2.38 95.00 ± 3.25 94.00 ± 4.29 182 SIFT OpponentSIFT rgSIFT cSIFT Transf. color SIFT 88.90 ±2.59 89.90 ±2.18 86.03 ±2.63 86.13 ±2.76 89.40 ±2.48 73.07 ±4.02 72.73 ±6.01 80.17 ±3.73 75.43 ±3.86 73.03 ±5.52 89.77 ±2.31 91.10 ±2.45 85.17 ±4.86 86.87 ±3.23 90.01 ±3.03 Results on Spatial Pyramid Matching R. Behmo et al. Fig. 2. Feature channels as image subregions: 1 × 1, 1 × 2, 1 × 3, 1 × 4 Table 4. Multi-channel classification, SceneClass13 dataset Channels #channels NBNN Optimal NBNN 1×1 1×1+1×2 1×1+1×3 1×1+1×4 1 3 4 5 48.52 53.59 55.24 55.37 75.35 76.10 76.54 78.26 image region is composed of all features that are located inside this region. In practice, image regions are regular grids of fixed size. We conducted experiments on the Results on Classification by Detection Towards Optimal Naive Bayes Nearest Neighbor 183 Fig. 3. Subwindow detection for NBNN (red) and optimal NBNN (green). For this experiment, all five SIFT radiometry invariants were combined. (see Section 4.4) It can be observed that the non-parametric NBNN usually converges towards an optimal object window that is too small relatively to the object instance. This is due to the fact that the background class is more densely sampled. Consequently, the nearest neighbor distance gives an estimate of the probability density that is too large. It was precisely to address this issue that optimal NBNN was designed. 5 Conclusion
Similar documents
Recognizing 50 Human Action Categories of Web Videos
recent work has been focused on BOF in one form or another. However, most of this work is limited to small and constrained datasets. Categorizing large numbers of classes has always been a bottlene...
More informationTowards Optimal Naive Bayes Nearest Neighbor
This classifier is shown to outperform the usual nearest neighbor classifier. Moreover, it does not require any feature quantization step, and the descriptive power of image features is thus preser...
More information