Emotion Related Structures in Large Image Databases
Transcription
Emotion Related Structures in Large Image Databases
Emotion Related Structures in Large Image Databases Martin Solli Reiner Lenz ITN, Linköping University SE-60174 Norrköping, Sweden ITN, Linköping University SE-60174 Norrköping, Sweden [email protected] [email protected] ABSTRACT 1. We introduce two large databases consisting of 750 000 and 1.2 million thumbnail-sized images, labeled with emotionrelated keywords. The smaller database consists of images from Matton Images, an image provider. The larger database consists of web images that were indexed by the crawler of the image search engine Picsearch. The images in the Picsearch database belong to one of 98 emotion related categories and contain meta-data in the form of secondary keywords, the originating website and some view statistics. We use two psycho-physics related feature vectors based on the emotional impact of color combinations, the standard RGBhistogram and two SIFT-related descriptors to characterize the visual properties of the images. These features are then used in two-class classification experiments to explore the discrimination properties of emotion-related categories. The clustering software and the classifiers are available in the public domain, and the same standard configurations are used in all experiments. Our findings show that for emotional categories, descriptors based on global image statistics (global histograms) perform better than local image descriptors (bag-of-words models). This indicates that contentbased indexing and retrieval using emotion-based approaches are fundamentally different from the dominant object- recognition based approaches for which SIFT-related features are the standard descriptors. Most available systems for image classification and Content Based Image Retrieval are focused on object and scene recognition. Typical tasks are classifying or finding images containing vehicles, mountains, animals, faces, etc. (see for instance Liu et al. [20] for an overview). In recent years, however, the interest in classification and retrieval methods incorporating emotions and aesthetics has increased. In a survey by Datta et al. [8], the subject is listed as one of the upcoming topics in Content Based Image Retrieval. The main question dealt with in this paper is whether it is possible to use simple statistical measurements, for instance global image histograms, to classify images in categories related to emotions and aesthetics? We compare the image database from an Internet search service provider, where images and meta-data are crawled from the Internet (this database is made available for the research community), with a database from an image provider, where images are selected and categorized by professionals. This leads us to our second question: Is there a difference between classification in databases collected from the Internet, and databases used by professionals? Can we increase the performance on the Internet database by training a supervised learning algorithm on the professional database? In our experiments we compare five different image descriptors, where four of them incorporates color information. Three descriptors are based on global image histograms, and two are bag-of-words strategies, where the descriptors are histograms derived for local image patches. Standard tools are used for clustering in the feature space (if needed), and for building the classification model by supervised learning. Earlier attempts have been made in comparing global and local image descriptors (recently by Douze et al. [12], and van de Sande et al. [28]), but not in an emotion context. A general problem with investigations involving emotions and aesthetics is the lack of ground truth. The concepts of emotions and aesthetics is influenced by many factors, such as perception, cognition, culture, etc. Even if research has shown that people in general perceive color emotions in similar ways, some individuals may have different opinions. The most obvious example is people who have a reduced color sensitivity. Moreover, since our predictions are mainly based on color content, it is, of course, possible to find images with other types of content, for instance objects in the scene, that are the cause of the perceived emotions. Consequently, we cannot expect a correct classification result for every possible image, or a result that will satisfy every possible user. Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing—Indexing methods; I.4.9 [Image Processing and Computer Vision]: Applications General Terms Theory, Experimentation Keywords Image databases, image indexing, emotions Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIVR ’10, July 5-7, Xi’an, China c 2010 ACM 978-1-4503-0117-6/10/07 ...$10.00. Copyright ° INTRODUCTION Finally, we emphasize that this paper is mainly addressing visual properties within image databases, and not linguistic properties. Moreover, the term high-level semantics refers to emotional and aesthetical properties, and we use keywords in English (or American English) in all investigations. 2. RELATED WORK Even if the interest for image emotions and aesthetics has increased, relatively few papers are addressing the problem of including those topics in image indexing. Papers by Berretti et al. [1], and Corridoni et al. [6][5], were among the earliest in this research field. They used clustering in the CIELUV color space, together with a modified k-means algorithm, for segmenting images into homogenous color regions. Then fuzzy sets are used to convert intra-region properties (warmth, luminance, etc.), and inter-region properties (hue, saturation, etc.), to a color description language. A similar method is proposed by Wang and Yu [30]. Cho and Lee [3] present an approach where features are extracted from wavelets and average colors. Hong and Choi [14] propose a search scheme called FMV (Fuzzy Membership Value) Indexing, which allow users to retrieve images based on highlevel keywords, such as ”soft”, ”romantic” etc. Emotion concepts are derived using the HSI color space. We mention three papers that to some extent incorporate psychophysical experiments in the image retrieval model. Yoo [32] proposes a method using descriptors called query color code and query gray code, which are based on human evaluation of color patterns on 13 emotion scales. The database is queried with one of the emotions, and a feedback mechanism is used for dynamically updating the search result. Lee et al. [17] present an emotion-based retrieval system based on rough set theory. They use category scaling experiments, where observers judge random color patterns on three emotion scales. Intended applications are primarily within indexing of wall papers etc. Wang et al. [31] identify an orthogonal three-dimensional emotion space, and design three emotion factors. From histogram features, emotional factors are predicted using a Support Vector Machine. The method was developed and evaluated for artwork. Datta et al. [7][9] use photos from a photo sharing web page, peer-rated in two qualities, aesthetics and originality. Visual or aesthetical image features are extracted, and the relationship between observer ratings and extracted features are explored through Support Vector Machines and classification trees. The primary goal is to build a model that can predict the quality of an image. In [10], Datta et al. discuss the future possibilities in inclusion of image aesthetics and emotions in, for instance, image classification. Within color science, research on color emotions, especially for single colors and two-color combinations, has been an active research field for many years. Such experiments usually incorporate observer studies, and are often carried out in controlled environments. See [21][22][23][13][16] for a few examples. The concept of harmony has also been investigated in various contexts. See for instance Cohen-Or et al. [4], where harmony is used to beautify images. Labeling and search strategies based on emotions have not yet been used in commercial search engines, with one exception, the Japanese emotional visual search engine EVE1 . 1 http://amanaimages.com/eve/, with Japanese documentation and user interface 3. IMAGE COLLECTIONS Our experiments are based on two different databases, containing both images and meta-data, such as keywords or labels. The first database is collected from the image search engine2 belonging to Picsearch AB (publ). The second database is supplied by the image provider Matton Images3 . In the past 5-10 years, numerous large image databases have been made publicly available by the research community (examples are ImageNet [11], CoPhIR [2], and the database use by Torralba et al. [27]). However, most databases are targeting objects and scenes, and very few are addressing high-level semantics, such as emotions and aesthetics. To help others conducting research in this upcoming topic within image indexing, major parts of the Picsearch database are made available for public download.4 Picsearch: The Picsearch database contains 1.2 million thumbnail images, with a maximum size of 128 pixels (height or width). Original images were crawled from public web pages around March 2009, using 98 keywords related to emotions, like ”calm”, ”elegant”, ”provocative”, ”soft”, etc. Thumbnails were shown to users visiting the Picsearch image search engine, and statistics of how many times each image has been viewed and clicked during March through June 2009 were recorded. The ratio between number of clicks and number of views (for images that have been viewed at least 50 times) is used as a rough estimate of popularity. The creation of the database is of course depending on how people are presenting images on their web-sites, and on the performance of the Picsearch crawler when interpreting the sites. A challenging but interesting aspect with images gathered from the Internet is that images where labeled (indirect) by a wide range of people, not knowing that their images and text will be used in emotional image classification. The drawback is that we cannot guarantee the quality of the data since images and keywords have not been checked afterwards. Picsearch uses a ”family filter” for excluding non-appropriate images, otherwise we can expect all kinds of material in the database. Matton: The Matton database is a subset of a commercial image database maintained by Matton Images in Stockholm, Sweden. Our database contains 750 000 images. Each image is accompanied with a set of keywords, assigned by professionals. Even if images originate from different sources, we expect the labeling in the Matton database to be more consistent than in the Picsearch database. But we don’t know how the professionals choose the high-level concepts when labeling the images. To match the Picsearch database, all images are scaled to a maximum size of 128 pixels. 3.1 Data Sets for Training and Testing The Picsearch database was created from 98 keywords which are related to emotions. To illustrate two-category classification, we selected pairs of keywords that represents opposing emotional properties, like vivid -calm. The experiments are done with the 10 pairs of keywords given in Table 1. Opposing emotions were selected based on an intuitive feeling. In upcoming research, however, a linguist or 2 http://www.picsearch.com/ http://www.matton.com/ 4 http://diameter.itn.liu.se/emodb/ 3 Table 1: The Keyword Pairs Used in the Experiments, Representing Opposing Emotion Properties. 1: calm-intense 6: formal-vivid 2: calm-vivid 7: intense-quiet 3: cold-warm 8: quiet-wild 4: colorful-pure 9: soft-spicy 5: formal-wild 10: soft-vivid psychologist can be involved to establish the relevance of each selected pair. For keyword k we extract subsets of the Picsearch database, denoted P i, and the Matton database, denoted M a. Image number i, belonging to the keyword subset k, is denoted P i(k, i) and M a(k, i) respectively. Picsearch subsets: All images in each category P i(k, :) are sorted based on the popularity measurement described earlier in this section. A category subset is created, containing 2000 images: The 1000 most popular images, saved in P i(k, 1, ..., 1000), and 1000 images randomly sampled from the remaining ones, saved in P i(k, 1001, ..., 2000). Matton subsets: Since we do not have a popularity measurement for the Matton database, only 1000 images are extracted and indexed in M a(k, 1, ..., 1000) using each keyword k as query. For the following two categories, the query resulted in less than 1000 images; intense: 907 images, spicy: 615 images. We divide each image category into one training and one test set, where every second image, i = 1 : 2 : end, is used for training, and remaining ones, i = 2 : 2 : end, are used for testing. If one of the categories contains fewer images than the other (occurs when working with categories intense or spicy from the Matton database, or when we mix data from the two databases), the number of images in the opposing category is reduced to the same size by sampling the subset. 4. IMAGE DESCRIPTORS The selection of image descriptors is a fundamental problem in computer vision and image processing. State-of-theart solutions in object and scene classification often involve bag-of-features models (also known as bag-of-words models), where interest points, for instance corners or local maxima/minima, are detected in each image. The characteristics for a patch surrounding each interest point is saved in a descriptor, and various training methods can be used for finding relationships between descriptor content and categories belonging to specified objects or scenes. Finding interest points corresponding to corners etc. works well in object and scene classification, but using the same descriptors in classification of emotional and aesthetical properties can be questioned. Instead, we assume that the overall appearance of the image, especially the color content, plays an important role. Consequently, we focus our initial experiments on global image descriptors, like histograms. A related approach that is assumed to be useful in emotional classification is to classify the image based on homogeneous regions, and transitions between regions, as proposed in [26]. We select three different global histogram descriptors in our experiments. For comparison, we also include two implementations of traditional bag-of-words models, where the descriptors are histograms for image patches corresponding to found interest points. Other image descriptors could also be used, but a comprehensive comparison between different descriptors is beyond the scope of this initial study. Listed below are the five image descriptors: RGB-histogram: We choose the commonly used 512 bins RGB-histogram, where each color channel is quantized into 8 equally sized bins. Emotion-histogram: A descriptor proposed in [25], where 512 bins RGB-histograms are transformed to 64 bins emotionhistogram. A kd-tree decomposition is used for mapping bins from the RGB-histogram to a three-dimensional emotion space, spanned by three emotions related to human perception of colors. The three emotions incorporate the scales: warm-cool, active-passive, and light-heavy. The color emotion metric used originates from perceptual color science, and was derived from psychophysical experiments by Ou et al. [21][22][23]. The emotion metric was originally designed for single colors, and later extended to include pairs of colors. In [25], the model was extended to images, and used in image retrieval. A general description of an emotion metric is a set of equations that defines the relationship between a common color space (for instance CIELAB), and a space spanned by emotion factors. A set of emotion factors are usually derived in psychophysical experiments. Bag-of-Emotions: This is a color-based emotion-related image descriptor, described in [26]. It is based on the same emotion metric as in the emotion-histogram mentioned above. For this descriptor, the assumption is that perceived color emotions in images are mainly affected by homogenous regions, defined by the emotion metric, and transitions between regions. RGB coordinates are converted to emotion coordinates, and for each emotion channel, statistical measurements of gradient magnitudes within a stack of low-pass filtered images are used for finding interest points corresponding to homogeneous regions and transitions between regions. Emotion characteristics are derived for patches surrounding each interest point, and contributions from all patches are saved in a bag-of-emotions, which is a 112 bins histogram. Notice that the result is a single histogram describing the entire image, and not a set of histograms (or other descriptors) as in ordinary bag-of-features models. SIFT: Scale Invariant Feature Transform, is a standard tool in computer vision and image processing. Here we use a SIFT implementation by Andrea Vedaldi5 , both for detecting interest points, and for computing the descriptors. The result is a 128 bins histogram for each interest point. Color descriptor: The color descriptor proposed by Weijer and Schmid [29] is an extension to SIFT, where photometric invariant color histograms are added to the original SIFT descriptor. Experiments have shown that the color descriptor can outperform similar shape-based approaches in many matching, classification, and retrieval tasks. In the experimental part of the paper, the above descriptors are referred to as ”RGB”, ”ehist”, ”ebags”, ”SIFT”, and 5 http://www.vlfeat.org/∼vedaldi/ ”cdescr”. For SIFT and the Color descriptor, individual descriptors are obtained for each interest point found. The average number of found interest points per image, in the Picsearch and Matton database respectively, are 104 and 115. The difference originates from a difference in image size. It is possible to find images smaller than 128 pixels in the Picsearch database, whereas such images are not present in the Matton database. The number of found interest points is believed to be sufficient for the intended application. A common approach in bag-of-words models is to cluster the descriptor space (also known as codebook generation), and by vector quantization obtain a histogram that describes the distribution over cluster centers (the distribution over codewords). We use k-means clustering, with 500 clusters, and 10 iterations, each with a new set of initial centroids. Cluster centers are saved for the iteration that returns the minimum within-cluster sums of point-to-centroid distances. Clustering is carried out in each database separately, with 1000 descriptors randomly collected from each of the 12 categories used in the experiments. State of the art solutions in image classification are often using codebooks of even greater size. But since we use thumbnail images, where the number of found interest points is relatively low, we find it appropriate to limit the size to 500 cluster centers. Preliminary experiments with an increased codebook size did not result in increased performance. 4.1 Dimensionality Reduction For many types of image histograms, the information in the histograms are quite similar in general; therefore histograms can be described in a lower dimensional subspace, with only minor loss in discriminative properties (see for instance Lenz and Latorre-Carmona [18]). There are two advantages of reducing the number of bins in the histograms: 1) Storage and computational savings. 2) Easier comparison between methods when the number of dimensions are the same. We use Principal Component Analysis (or KarhunenLoeve expansion) to reduce the number of dimensions, leaving dimensions with highest variance. 5. CLASSIFICATION METHODS With images separated into opposing emotion pairs, the goal is to predict which emotion an unlabeled image should belong to. A typical approach is to use some kind of supervised learning algorithm. Among common methods we find Support Vector Machines, the Naive Bayes classifier, Neural Networks, and Decision tree classifiers. In this work we utilize a Support Vector Machine, SVMlight, by Thorsten Joachims [15]. For simplicity and reproducibility reasons, all experiments are carried out with default settings, which, for instance, means that we use a linear kernel function. 5.1 Probability Estimates Most supervised learning algorithms produce classification scores that can be used for ranking examples in the test set. However, in some situations, it might be more useful to obtain an estimate of the probability that each example belongs to the category of interest. Various methods have been proposed in order to find these estimates. We adopt a method proposed by Lin et al. [19], which is a modification of a method proposed by Platt [24]. The basic idea is to estimate the probability using a sigmoid function. Given training examples xi ∈ Rn , with cor- responding labels yi ∈ {−1, 1}, and the decision function f (x) of the SVM. The category probability Pr(y = 1|x) can be approximated by the sigmoid function Pr(y = 1|x) = 1 1 + exp(Af (x) + B) (1) where A and B are estimated by solving min {− (A,B) l X (ti log(pi ) + (1 − ti ) log(1 − pi ))} (2) i=1 for 1 , and ti = pi = 1 + exp(Af (xi ) + B) (N p +1 Np +2 1 Nn +2 if yi = 1 if yi = −1 (3) where Np and Nn are the number of examples belong to the positive and negative category respectively. 5.2 Probability Threshold When working with emotions and aesthetics, it is hard to define the ground truth, especially for images lying close to the decision plane. We can refer to these images as ”neutral”, or images that belong to neither of the keywords. Since our databases, especially the Picsearch database, contain data of varying quality, we suspect that a large portion of the images might belong to a ”neutral” category. By defining a probability threshold t, we only classify images with a probability estimate pi that lies above or below the interval {0.5 − t ≤ pi ≤ 0.5 + t}. Images with probability estimate above the threshold, pi ∈ {pi |pi > 0.5 + t}, are assigned to the positive category, and images with probability estimate below the threshold, pi ∈ {pi |pi < 0.5 − t}, belong to the negative category. This method is only applicable when the intended application allows an exclusion of images from the two-category classification. On the other hand, for images receiving a label, the accuracy will in most cases increase. Moreover, a probability threshold can be useful if we apply the method on a completely unknown image, where we don’t know if the image should belong to one of the keywords. 6. EXPERIMENTS In all experiments, the classification accuracy is given by the proportion of correctly labeled images. A figure of, for instance, 0.8, means that a correct label was predicted for 80% of the images in the test set. 6.1 Classification Statistics The result of two-category classifications in the Picsearch and Matton database respectively can be seen in Table 2 and Table 3. Here we use the default descriptor sizes: 512 bins for the RGB-histograms, 64 bins for the emotion histogram, 112 bins for bags-of-emotions, and 500 bins (equals the codebook size) for bag-of-features models using SIFT or Color descriptor. We notice that the classification accuracy varies between keyword pairs and descriptors. Higher accuracy is usually obtained for keyword pairs 4 (colorfulpure), 5 (formal-wild ), 6 (formal-vivid ), and 9 (soft-spicy), whereas pairs 1 (calm-intense), 3 (cold-warm), and 10 (softvivid ) perform poorer. The three models based on global histograms perform better than bag-of-features models using region-based histograms. Moreover, the classification accuracy is increased when the Matton database is used. Table 3: Classification Accuracy in the Matton Database, with Default Descriptor Sizes. (t=0) kw pair RGB ehist ebags SIFT cdescr mean 1 0.67 0.65 0.69 0.64 0.64 0.66 2 0.80 0.80 0.79 0.72 0.75 0.77 3 0.75 0.71 0.74 0.59 0.57 0.67 4 0.78 0.78 0.84 0.65 0.71 0.75 5 0.82 0.81 0.78 0.79 0.79 0.80 6 0.84 0.84 0.84 0.71 0.76 0.80 7 0.64 0.65 0.64 0.65 0.65 0.64 8 0.76 0.72 0.72 0.67 0.68 0.71 9 0.81 0.75 0.82 0.73 0.72 0.77 10 0.69 0.68 0.73 0.67 0.64 0.68 mean 0.76 0.74 0.76 0.68 0.69 As mentioned in Section 4.1, PCA is used for reducing the dimensionality of the descriptors. Figure 1 and Figure 2 show how the classification accuracy depends on the number of dimensions used, for the Picsearch and Matton database respectively. We show the result for 112 dimensions down to 4. For descriptors containing less than 112 bins (the emotion histogram), we substitute the dimensions with the default descriptor size. We conclude that 32-64 dimensions is an appropriate tradeoff between accuracy and storage/computational savings. In the remaining experiments we represent image descriptors with 64 dimensions. If the intended application allows an exclusion of uncertain images from the two-category classification (or if we try to classify completely unknown images), we can use a probability threshold as discussed in Section 5.2. How the classification accuracy is affected by an increased threshold, for the Picsearch and Matton database, can be seen in Figure 3 and Figure 4. A raised threshold value leads to an exclusion of images from the test set. With only a few images left, the accuracy becomes unreliable, as shown for the Picsearch database in Figure 3. The curve showing the accuracy for different descriptors is ended if the threshold value leads to an empty test set for at least one of the emotion pairs. The almost linear increase in accuracy in Figure 4 shows that the models used for classification in the Matton database, can also be utilized as predictors of emotion scores. The result confirms that the labeling accuracy in the Matton database is higher than in the Picsearch database. Next we investigate if the classification accuracy in the Mean classification accuracy RGB ehist ebags SIFT cdescr 0.65 0.6 0.55 0 20 40 60 80 100 Number of dimensions Figure 1: The mean classification accuracy for different number of dimensions in the descriptor. (Picsearch database, t=0) 0.75 Mean classification accuracy Table 2: Classification Accuracy in the Picsearch Database, with Default Descriptor Sizes. (t=0) kw pair RGB ehist ebags SIFT cdescr mean 1 0.62 0.60 0.61 0.57 0.56 0.59 2 0.60 0.58 0.62 0.57 0.57 0.59 3 0.60 0.60 0.59 0.52 0.54 0.57 4 0.63 0.61 0.65 0.59 0.61 0.62 5 0.64 0.62 0.62 0.65 0.67 0.64 6 0.69 0.67 0.69 0.61 0.61 0.65 7 0.61 0.59 0.60 0.55 0.55 0.58 8 0.58 0.57 0.60 0.58 0.60 0.59 9 0.69 0.69 0.70 0.63 0.62 0.67 10 0.56 0.55 0.56 0.57 0.55 0.56 mean 0.62 0.61 0.62 0.59 0.59 0.7 RGB ehist ebags SIFT cdescr 0.65 0.6 0 20 40 60 80 100 Number of dimensions Figure 2: The mean classification accuracy for different number of dimensions in the descriptor. (Matton database, t=0) Picsearch database can increase if the model is trained with Matton samples, and vice versa. The result for the Bag-ofemotions descriptor (”ebags”) can be seen in Figure 5 (only ”ebags” results are shown since this has been the best performing descriptor in earlier experiments). A probability threshold of 0.1 is used together with 64 dimensional descriptors. The highest accuracy was achieved when both training and testing was carried out on the Matton database, followed by training on the Picsearch database and testing on the Matton database. Worst result was achieved when we train on the Matton database and evaluate the classification with Picsearch images. Our last experiment exploits the user statistics that are gathered for the Picsearch database. As explained in Section 3, the popularity of each image is roughly estimated by a click and view ratio. Using the Bag-of-emotions descriptor, we train and evaluate the classification model using different image subsets that are created based on popularity estimates. The result can be seen in Table 4. A slightly increased performance is recorded when the model is evaluated with popular images, or when popular images are used for training. But differences between subsets are very small. 6.2 Classification Examples We illustrate the classification performance by plotting a few subsets of classified images. Plots have been created Classification accuracy with ebags Train:Pi, Test:Pi Train:Pi, Test:Ma Train:Ma, Test:Ma Train:Ma, Test:Pi 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 1 2 3 4 5 6 7 8 9 10 Keyword pair 1−10 Mean classification accuracy Figure 5: Classification accuracy for different keyword pairs (using Bag-of-emotions), for different combinations of training and testing databases. 0.8 0.75 0.7 0.65 RGB ehist ebags SIFT cdescr 0.6 0.55 0.5 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Probability threshold t Figure 3: The change in classification accuracy in the Picsearch database when threshold t is raised. Mean classification accuracy 0.95 Table 4: Mean Classification Accuracy over all Keyword Pairs for Different Subsets of the Picsearch Database. Mix: A Mixture Between Popular and Non-Popular Images. Pop: Only Popular Images. Non: Only Non-Popular Images. Training Testing Mean accuracy mix mix 0.66 mix pop 0.68 mix non 0.64 pop mix 0.65 pop pop 0.68 pop non 0.63 non mix 0.65 non pop 0.65 non non 0.65 0.9 0.85 RGB ehist ebags SIFT cdescr 0.8 0.75 0.7 0.65 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Probability threshold t Figure 4: The change in classification accuracy in the Matton database when the threshold t is raised. for two keyword pairs: 1 (calm-intense), where the classification score was low for both databases, and 6 (formalvivid ), where the classification score was among the highest. Results for keyword pair number 1, for the Picsearch and Matton database respectively, can bee seen in Figure 6 and Figure 7. Corresponding results for pair number 6 are given in Figure 8 and Figure 9. In all figures we show the 20+20 images that obtained the highest and lowest probability estimates. In between, we plot 20 images that obtained a probability estimate close to 0.5. Experiments are carried out with the Bags-of-emotions descriptor (”ebags”), using 64 bins. Images are squared for viewing purposes. 7. SUMMARY AND CONCLUSIONS We introduced two large image databases where images are labeled with emotion-related keywords. One database containing images from a commercial image provider and one with images collected from the Internet. We used the standard RGB histogram and psychophysics-based coloremotion related features to characterize the global appearance of the images. We also used two bag-of-features models, where histograms are derived for individual interest points. The extracted feature vectors are then used by public-domain classification software to separate images from two classes defined by opposing emotion-pairs. In our experiments we found that the three models based on global image histograms outperform the two chosen bag-of-features models, where histograms are derived for individual interest points. The best performing descriptor is the Bag-of-emotions, which is a histogram describing the properties of homogeneous emotion regions in the image, and transitions between regions. The selected emotion-related keywords are often related to the color content, and it is therefore no surprise that the classification accuracy for the intensity-based SIFT descriptor is rather poor. Interesting were the results for the Color descriptor, where photometric invariant color histograms are added to the original SIFT descriptors: these experiments gave only slightly better results than SIFT alone. formal neutral vivid intense neutral calm Figure 8: Classification examples for keyword pair formal-vivid, using the Picsearch database. calm formal neutral neutral vivid intense Figure 6: Classification examples for keyword pair calm-intense, using the Picsearch database. Figure 7: Classification examples for keyword pair calm-intense, using the Matton database. The Picsearch database contains images and meta-data crawled from the Internet, whereas the images in the database from Matton are labeled by professionals (although the keywords vary depending on the original supplier of the images). As expected, the highest classification accuracy was achieved for the Matton database. In general, a high accuracy is obtained when training and testing is carried out on images from the same database source. However, we notice that the accuracy decreases considerably when we train on the Matton database, and test on the Picsearch database, compared to the opposite. A probable explanation is that the Picsearch database is much more diverse (various images, inconsistent labeling, etc.) than the Matton database. When training is conducted on the Matton database we obtain a classifier that is highly suitable for that type of image source, but presumably not robust enough to capture the diversity of the Picsearch database. Hence, if we want a robust classification of images from an unknown source, the classifier trained on the Picsearch database might be preferred. An unexpected conclusion is that the use of popular and non-popular images in the Picsearch database could not significantly influence the performance. One conclusion could be that the popularity of an image depends less on its visual features, and more on the context. The relations between the visual content and the emotional impact of an image are very complicated and probably depending on a large number Figure 9: Classification examples for keyword pair formal-vivid, using the Matton database. of different factors. This is why we proposed a probability threshold on the SVM output. This means that we should avoid classifying images that receives a probability estimate close to 0.5 (in middle between to emotions). Instead, these images can be assigned to a ”neutral” category. We believe this is a way to increase the acceptance of emotions and aesthetics among end-users in daily life search experiments. 8. FUTURE WORK The experiments reported in this paper are only a first attempt in exploring the internal structure of these databases. The results show that the way the databases were constructed lead to profound differences in the statistical properties of the images contained in them. The Picsearch database is due to the larger variations much more challenging. In these experiments we selected those opposing keyword categories where the keywords of the different categories also have a visual interpretation. Images in other categories (like funny) are probably even more visually diverse. Further studies are also needed to investigate if the visual properties are somehow related to the visual content. Classification into opposite categories is one of the easiest approaches to explore the structure of the database. Others like clustering and multiclass classifications are other obvious choices. Also different feature extraction methods suitable for large-scale databases have to be investigated. 9. ACKNOWLEDGMENTS Presented research is included in the project Visuella Världar, financed by the Knowledge Foundation, Sweden. Picsearch AB (publ) and Matton images, Stockholm (Sweden), are gratefully acknowledged for their contributions. 10. REFERENCES [1] S. Berretti, A. Del Bimbo, and P. Pala. Sensations and psychological effects in color image database. IEEE Int Conf on Image Proc, 1:560–563, Santa Barbara, 1997. [2] P. Bolettieri, A. Esuli, F. Falchi, C. Lucchese, R. Perego, T. Piccioli, and F. Rabitti. CoPhIR: A test collection for content-based image retrieval. arXiv:0905.4627v2, 2009. [3] S.-B. Cho and J.-Y. Lee. A human-oriented image retrieval system using interactive genetic algorithm. IEEE Trans Syst Man Cybern Pt A Syst Humans, 32(3):452–458, 2002. [4] D. Cohen-Or, O. Sorkine, R. Gal, T. Leyvand, and Y.-Q. Xu. Color harmonization. In ACM SIGGRAPH 2006, 25:624–630, Boston, MA, 2006. [5] J. Corridoni, A. Del Bimbo, and P. Pala. Image retrieval by color semantics. Multimedia Syst, 7(3):175–183, 1999. [6] J. Corridoni, A. Del Bimbo, and E. Vicario. Image retrieval by color semantics with incomplete knowledge. JASIS, 49(3):267–282, 1998. [7] R. Datta, D. Joshi, J. Li, and J. Wang. Studying aesthetics in photographic images using a computational approach. In 9th Eu Conf on Computer Vision, ECCV 2006, 3953:288–301, Graz, 2006. [8] R. Datta, D. Joshi, J. Li, and J. Wang. Image retrieval: Ideas, influences, and trends of the new age. ACM Comput Surv, 40(2), 2008. [9] R. Datta, J. Li, and J. Wang. Learning the consensus on visual quality for next-generation image management. In 15th ACM Int Conf on Multimedia, MM’07, p.533–536, Augsburg, 2007. [10] R. Datta, J. Li, and J. Z. Wang. Algorithmic inferencing of aesthetics and emotion in natural images: An exposition. 15th IEEE Int Conf on Im Proc, 2008. ICIP 2008., p.105–108, 2008. [11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR09, 2009. [12] M. Douze, H. Jégou, H. Sandhawalia, L. Amsaleg, and C. Schmid. Evaluation of GIST descriptors for web-scale image search. In ACM Int Conf on Im and Video Retr, CIVR 2009, p. 140-147, 2009. [13] X.-P. Gao, J. Xin, T. Sato, A. Hansuebsai, M. Scalzo, K. Kajiwara, S.-S. Guan, J. Valldeperas, M. Lis, and M. Billger. Analysis of cross-cultural color emotion. Color Res Appl, 32(3):223–229, 2007. [14] S. Hong and H. Choi. Color image semantic information retrieval system using human sensation and emotion. In Proceedings IACIS, VII, p. 140–145, 2006. [15] T. Joachims. Making large-scale support vector machine learning practical. Advances in kernel methods: support vector learning, p. 169–184, Cambridge, USA 1999. [16] S. Kobayashi. Color Image Scale. Kodansha Intern., 1991. [17] J. Lee, Y.-M. Cheon, S.-Y. Kim, and E.-J. Park. Emotional evaluation of color patterns based on rough sets. In 3rd Intern Conf on Natural Computation, ICNC 2007, 1:140–144, Haikou, Hainan, 2007. [18] R. Lenz, and P. Latorre-Carmona. Hierarchical S(3)-Coding of RGB Histograms. VISIGRAPP 2009, Communications in Computer and Information Science, 68:188–200, 2010. [19] H.-T. Lin, C.-J. Lin, and R. C. Weng. A note on platt’s probabilistic outputs for support vector machines. Mach. Learn., 68(3):267–276, 2007. [20] Y. Liu, D. Zhang, G. Lu, and W.-Y. Ma. A survey of content-based image retrieval with high-level semantics. Pattern Recogn., 40(1):262–282, 2007. [21] L.-C. Ou, M. Luo, A. Woodcock, and A. Wright. A study of colour emotion and colour preference. part i: Colour emotions for single colours. Color Res Appl, 29(3):232–240, 2004. [22] L.-C. Ou, M. Luo, A. Woodcock, and A. Wright. A study of colour emotion and colour preference. part ii: Colour emotions for two-colour combinations. Color Res Appl, 29(4):292–298, 2004. [23] L.-C. Ou, M. Luo, A. Woodcock, and A. Wright. A study of colour emotion and colour preference. part iii: Colour preference modeling. Color Res Appl, 29(5):381–389, 2004. [24] J. C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in Large Margin Classifiers, p. 61–74. MIT Press, 1999. [25] M. Solli and R. Lenz. Color emotions for image classification and retrieval. In Proceedings CGIV 2008, p. 367–371, 2008. [26] M. Solli and R. Lenz. Color based bags-of-emotions. In CAIP ’09: Proceedings of the 13th International Conference on Computer Analysis of Images and Patterns, p. 573–580, Berlin, Heidelberg, 2009. [27] A. Torralba, R. Fergus, and W. T. Freeman. 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell., 30(11):1958–1970, 2008. [28] K. van de Sande, T. Gevers, and C. Snoek. Evaluating color descriptors for object and scene recognition. IEEE Trans Pattern Anal Mach Intell (in print), 2009. [29] J. van de Weijer and C. Schmid. Coloring local feature extraction. In European Conference on Computer Vision, II:334–348. Springer, 2006. [30] W.-N. Wang and Y.-L. Yu. Image emotional semantic query based on color semantic description. In Int Conf on Mach Learn and Cybern, ICMLC 2005, p.4571–4576, Guangzhou, 2005. [31] W.-N. Wang, Y.-L. Yu, and S.-M. Jiang. Image retrieval by emotional semantics: A study of emotional space and feature extraction. In 2006 IEEE Int Conf on Syst, Man and Cybern, 4:3534–3539, Taipei, 2007. [32] H.-W. Yoo. Visual-based emotional descriptor and feedback mechanism for image retrieval. J. Inf. Sci. Eng., 22(5):1205–1227, 2006.