Emotion Related Structures in Large Image Databases

Transcription

Emotion Related Structures in Large Image Databases
Emotion Related Structures in Large Image Databases
Martin Solli
Reiner Lenz
ITN, Linköping University
SE-60174 Norrköping, Sweden
ITN, Linköping University
SE-60174 Norrköping, Sweden
[email protected]
[email protected]
ABSTRACT
1.
We introduce two large databases consisting of 750 000 and
1.2 million thumbnail-sized images, labeled with emotionrelated keywords. The smaller database consists of images
from Matton Images, an image provider. The larger database
consists of web images that were indexed by the crawler of
the image search engine Picsearch. The images in the Picsearch database belong to one of 98 emotion related categories and contain meta-data in the form of secondary keywords, the originating website and some view statistics. We
use two psycho-physics related feature vectors based on the
emotional impact of color combinations, the standard RGBhistogram and two SIFT-related descriptors to characterize the visual properties of the images. These features are
then used in two-class classification experiments to explore
the discrimination properties of emotion-related categories.
The clustering software and the classifiers are available in
the public domain, and the same standard configurations
are used in all experiments. Our findings show that for emotional categories, descriptors based on global image statistics
(global histograms) perform better than local image descriptors (bag-of-words models). This indicates that contentbased indexing and retrieval using emotion-based approaches
are fundamentally different from the dominant object- recognition based approaches for which SIFT-related features are
the standard descriptors.
Most available systems for image classification and Content Based Image Retrieval are focused on object and scene
recognition. Typical tasks are classifying or finding images
containing vehicles, mountains, animals, faces, etc. (see for
instance Liu et al. [20] for an overview). In recent years,
however, the interest in classification and retrieval methods
incorporating emotions and aesthetics has increased. In a
survey by Datta et al. [8], the subject is listed as one of the
upcoming topics in Content Based Image Retrieval.
The main question dealt with in this paper is whether it is
possible to use simple statistical measurements, for instance
global image histograms, to classify images in categories related to emotions and aesthetics? We compare the image
database from an Internet search service provider, where
images and meta-data are crawled from the Internet (this
database is made available for the research community),
with a database from an image provider, where images are
selected and categorized by professionals. This leads us to
our second question: Is there a difference between classification in databases collected from the Internet, and databases
used by professionals? Can we increase the performance on
the Internet database by training a supervised learning algorithm on the professional database? In our experiments
we compare five different image descriptors, where four of
them incorporates color information. Three descriptors are
based on global image histograms, and two are bag-of-words
strategies, where the descriptors are histograms derived for
local image patches. Standard tools are used for clustering
in the feature space (if needed), and for building the classification model by supervised learning. Earlier attempts have
been made in comparing global and local image descriptors
(recently by Douze et al. [12], and van de Sande et al. [28]),
but not in an emotion context.
A general problem with investigations involving emotions
and aesthetics is the lack of ground truth. The concepts
of emotions and aesthetics is influenced by many factors,
such as perception, cognition, culture, etc. Even if research
has shown that people in general perceive color emotions in
similar ways, some individuals may have different opinions.
The most obvious example is people who have a reduced
color sensitivity. Moreover, since our predictions are mainly
based on color content, it is, of course, possible to find images with other types of content, for instance objects in the
scene, that are the cause of the perceived emotions. Consequently, we cannot expect a correct classification result
for every possible image, or a result that will satisfy every
possible user.
Categories and Subject Descriptors
H.3.1 [Information Storage and Retrieval]: Content
Analysis and Indexing—Indexing methods; I.4.9 [Image Processing and Computer Vision]: Applications
General Terms
Theory, Experimentation
Keywords
Image databases, image indexing, emotions
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
CIVR ’10, July 5-7, Xi’an, China
c 2010 ACM 978-1-4503-0117-6/10/07 ...$10.00.
Copyright °
INTRODUCTION
Finally, we emphasize that this paper is mainly addressing
visual properties within image databases, and not linguistic
properties. Moreover, the term high-level semantics refers to
emotional and aesthetical properties, and we use keywords
in English (or American English) in all investigations.
2.
RELATED WORK
Even if the interest for image emotions and aesthetics
has increased, relatively few papers are addressing the problem of including those topics in image indexing. Papers by
Berretti et al. [1], and Corridoni et al. [6][5], were among
the earliest in this research field. They used clustering in
the CIELUV color space, together with a modified k-means
algorithm, for segmenting images into homogenous color regions. Then fuzzy sets are used to convert intra-region properties (warmth, luminance, etc.), and inter-region properties
(hue, saturation, etc.), to a color description language. A
similar method is proposed by Wang and Yu [30]. Cho and
Lee [3] present an approach where features are extracted
from wavelets and average colors. Hong and Choi [14] propose a search scheme called FMV (Fuzzy Membership Value)
Indexing, which allow users to retrieve images based on highlevel keywords, such as ”soft”, ”romantic” etc. Emotion concepts are derived using the HSI color space.
We mention three papers that to some extent incorporate
psychophysical experiments in the image retrieval model.
Yoo [32] proposes a method using descriptors called query
color code and query gray code, which are based on human evaluation of color patterns on 13 emotion scales. The
database is queried with one of the emotions, and a feedback
mechanism is used for dynamically updating the search result. Lee et al. [17] present an emotion-based retrieval system based on rough set theory. They use category scaling
experiments, where observers judge random color patterns
on three emotion scales. Intended applications are primarily
within indexing of wall papers etc. Wang et al. [31] identify
an orthogonal three-dimensional emotion space, and design
three emotion factors. From histogram features, emotional
factors are predicted using a Support Vector Machine. The
method was developed and evaluated for artwork.
Datta et al. [7][9] use photos from a photo sharing web
page, peer-rated in two qualities, aesthetics and originality.
Visual or aesthetical image features are extracted, and the
relationship between observer ratings and extracted features
are explored through Support Vector Machines and classification trees. The primary goal is to build a model that can
predict the quality of an image. In [10], Datta et al. discuss
the future possibilities in inclusion of image aesthetics and
emotions in, for instance, image classification.
Within color science, research on color emotions, especially for single colors and two-color combinations, has been
an active research field for many years. Such experiments
usually incorporate observer studies, and are often carried
out in controlled environments. See [21][22][23][13][16] for
a few examples. The concept of harmony has also been investigated in various contexts. See for instance Cohen-Or et
al. [4], where harmony is used to beautify images.
Labeling and search strategies based on emotions have not
yet been used in commercial search engines, with one exception, the Japanese emotional visual search engine EVE1 .
1
http://amanaimages.com/eve/, with Japanese documentation and user interface
3.
IMAGE COLLECTIONS
Our experiments are based on two different databases,
containing both images and meta-data, such as keywords
or labels. The first database is collected from the image
search engine2 belonging to Picsearch AB (publ). The second database is supplied by the image provider Matton Images3 . In the past 5-10 years, numerous large image databases
have been made publicly available by the research community (examples are ImageNet [11], CoPhIR [2], and the
database use by Torralba et al. [27]). However, most databases
are targeting objects and scenes, and very few are addressing high-level semantics, such as emotions and aesthetics.
To help others conducting research in this upcoming topic
within image indexing, major parts of the Picsearch database
are made available for public download.4
Picsearch: The Picsearch database contains 1.2 million
thumbnail images, with a maximum size of 128 pixels (height
or width). Original images were crawled from public web
pages around March 2009, using 98 keywords related to
emotions, like ”calm”, ”elegant”, ”provocative”, ”soft”, etc.
Thumbnails were shown to users visiting the Picsearch image search engine, and statistics of how many times each image has been viewed and clicked during March through June
2009 were recorded. The ratio between number of clicks and
number of views (for images that have been viewed at least
50 times) is used as a rough estimate of popularity. The creation of the database is of course depending on how people
are presenting images on their web-sites, and on the performance of the Picsearch crawler when interpreting the sites.
A challenging but interesting aspect with images gathered
from the Internet is that images where labeled (indirect) by a
wide range of people, not knowing that their images and text
will be used in emotional image classification. The drawback
is that we cannot guarantee the quality of the data since images and keywords have not been checked afterwards. Picsearch uses a ”family filter” for excluding non-appropriate
images, otherwise we can expect all kinds of material in the
database.
Matton: The Matton database is a subset of a commercial
image database maintained by Matton Images in Stockholm,
Sweden. Our database contains 750 000 images. Each image
is accompanied with a set of keywords, assigned by professionals. Even if images originate from different sources, we
expect the labeling in the Matton database to be more consistent than in the Picsearch database. But we don’t know
how the professionals choose the high-level concepts when
labeling the images. To match the Picsearch database, all
images are scaled to a maximum size of 128 pixels.
3.1
Data Sets for Training and Testing
The Picsearch database was created from 98 keywords
which are related to emotions. To illustrate two-category
classification, we selected pairs of keywords that represents
opposing emotional properties, like vivid -calm. The experiments are done with the 10 pairs of keywords given in Table 1. Opposing emotions were selected based on an intuitive feeling. In upcoming research, however, a linguist or
2
http://www.picsearch.com/
http://www.matton.com/
4
http://diameter.itn.liu.se/emodb/
3
Table 1: The Keyword Pairs Used in the Experiments, Representing Opposing Emotion Properties.
1: calm-intense
6:
formal-vivid
2:
calm-vivid
7: intense-quiet
3:
cold-warm
8:
quiet-wild
4: colorful-pure 9:
soft-spicy
5: formal-wild 10:
soft-vivid
psychologist can be involved to establish the relevance of
each selected pair. For keyword k we extract subsets of the
Picsearch database, denoted P i, and the Matton database,
denoted M a. Image number i, belonging to the keyword
subset k, is denoted P i(k, i) and M a(k, i) respectively.
Picsearch subsets: All images in each category P i(k, :)
are sorted based on the popularity measurement described
earlier in this section. A category subset is created, containing 2000 images: The 1000 most popular images, saved in
P i(k, 1, ..., 1000), and 1000 images randomly sampled from
the remaining ones, saved in P i(k, 1001, ..., 2000).
Matton subsets: Since we do not have a popularity measurement for the Matton database, only 1000 images are
extracted and indexed in M a(k, 1, ..., 1000) using each keyword k as query. For the following two categories, the query
resulted in less than 1000 images; intense: 907 images, spicy:
615 images.
We divide each image category into one training and one
test set, where every second image, i = 1 : 2 : end, is used
for training, and remaining ones, i = 2 : 2 : end, are used for
testing. If one of the categories contains fewer images than
the other (occurs when working with categories intense or
spicy from the Matton database, or when we mix data from
the two databases), the number of images in the opposing
category is reduced to the same size by sampling the subset.
4.
IMAGE DESCRIPTORS
The selection of image descriptors is a fundamental problem in computer vision and image processing. State-of-theart solutions in object and scene classification often involve
bag-of-features models (also known as bag-of-words models), where interest points, for instance corners or local maxima/minima, are detected in each image. The characteristics for a patch surrounding each interest point is saved in
a descriptor, and various training methods can be used for
finding relationships between descriptor content and categories belonging to specified objects or scenes.
Finding interest points corresponding to corners etc. works
well in object and scene classification, but using the same descriptors in classification of emotional and aesthetical properties can be questioned. Instead, we assume that the overall
appearance of the image, especially the color content, plays
an important role. Consequently, we focus our initial experiments on global image descriptors, like histograms. A
related approach that is assumed to be useful in emotional
classification is to classify the image based on homogeneous
regions, and transitions between regions, as proposed in [26].
We select three different global histogram descriptors in
our experiments. For comparison, we also include two implementations of traditional bag-of-words models, where the
descriptors are histograms for image patches corresponding
to found interest points. Other image descriptors could also
be used, but a comprehensive comparison between different
descriptors is beyond the scope of this initial study. Listed
below are the five image descriptors:
RGB-histogram: We choose the commonly used 512 bins
RGB-histogram, where each color channel is quantized into
8 equally sized bins.
Emotion-histogram: A descriptor proposed in [25], where
512 bins RGB-histograms are transformed to 64 bins emotionhistogram. A kd-tree decomposition is used for mapping
bins from the RGB-histogram to a three-dimensional emotion space, spanned by three emotions related to human perception of colors. The three emotions incorporate the scales:
warm-cool, active-passive, and light-heavy. The color emotion metric used originates from perceptual color science,
and was derived from psychophysical experiments by Ou et
al. [21][22][23]. The emotion metric was originally designed
for single colors, and later extended to include pairs of colors. In [25], the model was extended to images, and used in
image retrieval. A general description of an emotion metric
is a set of equations that defines the relationship between
a common color space (for instance CIELAB), and a space
spanned by emotion factors. A set of emotion factors are
usually derived in psychophysical experiments.
Bag-of-Emotions: This is a color-based emotion-related
image descriptor, described in [26]. It is based on the same
emotion metric as in the emotion-histogram mentioned above.
For this descriptor, the assumption is that perceived color
emotions in images are mainly affected by homogenous regions, defined by the emotion metric, and transitions between regions. RGB coordinates are converted to emotion
coordinates, and for each emotion channel, statistical measurements of gradient magnitudes within a stack of low-pass
filtered images are used for finding interest points corresponding to homogeneous regions and transitions between
regions. Emotion characteristics are derived for patches
surrounding each interest point, and contributions from all
patches are saved in a bag-of-emotions, which is a 112 bins
histogram. Notice that the result is a single histogram describing the entire image, and not a set of histograms (or
other descriptors) as in ordinary bag-of-features models.
SIFT: Scale Invariant Feature Transform, is a standard tool
in computer vision and image processing. Here we use a
SIFT implementation by Andrea Vedaldi5 , both for detecting interest points, and for computing the descriptors. The
result is a 128 bins histogram for each interest point.
Color descriptor: The color descriptor proposed by Weijer
and Schmid [29] is an extension to SIFT, where photometric
invariant color histograms are added to the original SIFT
descriptor. Experiments have shown that the color descriptor can outperform similar shape-based approaches in many
matching, classification, and retrieval tasks.
In the experimental part of the paper, the above descriptors are referred to as ”RGB”, ”ehist”, ”ebags”, ”SIFT”, and
5
http://www.vlfeat.org/∼vedaldi/
”cdescr”. For SIFT and the Color descriptor, individual descriptors are obtained for each interest point found. The
average number of found interest points per image, in the
Picsearch and Matton database respectively, are 104 and
115. The difference originates from a difference in image size.
It is possible to find images smaller than 128 pixels in the
Picsearch database, whereas such images are not present in
the Matton database. The number of found interest points
is believed to be sufficient for the intended application. A
common approach in bag-of-words models is to cluster the
descriptor space (also known as codebook generation), and
by vector quantization obtain a histogram that describes
the distribution over cluster centers (the distribution over
codewords). We use k-means clustering, with 500 clusters,
and 10 iterations, each with a new set of initial centroids.
Cluster centers are saved for the iteration that returns the
minimum within-cluster sums of point-to-centroid distances.
Clustering is carried out in each database separately, with
1000 descriptors randomly collected from each of the 12 categories used in the experiments. State of the art solutions in
image classification are often using codebooks of even greater
size. But since we use thumbnail images, where the number
of found interest points is relatively low, we find it appropriate to limit the size to 500 cluster centers. Preliminary
experiments with an increased codebook size did not result
in increased performance.
4.1
Dimensionality Reduction
For many types of image histograms, the information in
the histograms are quite similar in general; therefore histograms can be described in a lower dimensional subspace,
with only minor loss in discriminative properties (see for instance Lenz and Latorre-Carmona [18]). There are two advantages of reducing the number of bins in the histograms:
1) Storage and computational savings. 2) Easier comparison
between methods when the number of dimensions are the
same. We use Principal Component Analysis (or KarhunenLoeve expansion) to reduce the number of dimensions, leaving dimensions with highest variance.
5.
CLASSIFICATION METHODS
With images separated into opposing emotion pairs, the
goal is to predict which emotion an unlabeled image should
belong to. A typical approach is to use some kind of supervised learning algorithm. Among common methods we find
Support Vector Machines, the Naive Bayes classifier, Neural Networks, and Decision tree classifiers. In this work we
utilize a Support Vector Machine, SVMlight, by Thorsten
Joachims [15]. For simplicity and reproducibility reasons,
all experiments are carried out with default settings, which,
for instance, means that we use a linear kernel function.
5.1
Probability Estimates
Most supervised learning algorithms produce classification
scores that can be used for ranking examples in the test
set. However, in some situations, it might be more useful
to obtain an estimate of the probability that each example
belongs to the category of interest. Various methods have
been proposed in order to find these estimates. We adopt a
method proposed by Lin et al. [19], which is a modification
of a method proposed by Platt [24].
The basic idea is to estimate the probability using a sigmoid function. Given training examples xi ∈ Rn , with cor-
responding labels yi ∈ {−1, 1}, and the decision function
f (x) of the SVM. The category probability Pr(y = 1|x) can
be approximated by the sigmoid function
Pr(y = 1|x) =
1
1 + exp(Af (x) + B)
(1)
where A and B are estimated by solving
min {−
(A,B)
l
X
(ti log(pi ) + (1 − ti ) log(1 − pi ))}
(2)
i=1
for
1
, and ti =
pi =
1 + exp(Af (xi ) + B)
(N
p +1
Np +2
1
Nn +2
if yi = 1
if yi = −1
(3)
where Np and Nn are the number of examples belong to the
positive and negative category respectively.
5.2
Probability Threshold
When working with emotions and aesthetics, it is hard to
define the ground truth, especially for images lying close to
the decision plane. We can refer to these images as ”neutral”,
or images that belong to neither of the keywords. Since our
databases, especially the Picsearch database, contain data
of varying quality, we suspect that a large portion of the
images might belong to a ”neutral” category. By defining
a probability threshold t, we only classify images with a
probability estimate pi that lies above or below the interval
{0.5 − t ≤ pi ≤ 0.5 + t}. Images with probability estimate
above the threshold, pi ∈ {pi |pi > 0.5 + t}, are assigned to
the positive category, and images with probability estimate
below the threshold, pi ∈ {pi |pi < 0.5 − t}, belong to the
negative category. This method is only applicable when the
intended application allows an exclusion of images from the
two-category classification. On the other hand, for images
receiving a label, the accuracy will in most cases increase.
Moreover, a probability threshold can be useful if we apply
the method on a completely unknown image, where we don’t
know if the image should belong to one of the keywords.
6.
EXPERIMENTS
In all experiments, the classification accuracy is given by
the proportion of correctly labeled images. A figure of, for
instance, 0.8, means that a correct label was predicted for
80% of the images in the test set.
6.1
Classification Statistics
The result of two-category classifications in the Picsearch
and Matton database respectively can be seen in Table 2
and Table 3. Here we use the default descriptor sizes: 512
bins for the RGB-histograms, 64 bins for the emotion histogram, 112 bins for bags-of-emotions, and 500 bins (equals
the codebook size) for bag-of-features models using SIFT
or Color descriptor. We notice that the classification accuracy varies between keyword pairs and descriptors. Higher
accuracy is usually obtained for keyword pairs 4 (colorfulpure), 5 (formal-wild ), 6 (formal-vivid ), and 9 (soft-spicy),
whereas pairs 1 (calm-intense), 3 (cold-warm), and 10 (softvivid ) perform poorer. The three models based on global
histograms perform better than bag-of-features models using region-based histograms. Moreover, the classification
accuracy is increased when the Matton database is used.
Table 3: Classification Accuracy in the Matton
Database, with Default Descriptor Sizes. (t=0)
kw pair RGB ehist ebags SIFT cdescr mean
1
0.67
0.65
0.69
0.64
0.64
0.66
2
0.80 0.80
0.79
0.72
0.75
0.77
3
0.75 0.71
0.74
0.59
0.57
0.67
4
0.78
0.78
0.84
0.65
0.71
0.75
5
0.82 0.81
0.78
0.79
0.79
0.80
6
0.84 0.84 0.84
0.71
0.76
0.80
7
0.64 0.65
0.64
0.65
0.65
0.64
8
0.76 0.72
0.72
0.67
0.68
0.71
9
0.81
0.75
0.82
0.73
0.72
0.77
10
0.69
0.68
0.73
0.67
0.64
0.68
mean
0.76 0.74
0.76
0.68
0.69
As mentioned in Section 4.1, PCA is used for reducing
the dimensionality of the descriptors. Figure 1 and Figure 2 show how the classification accuracy depends on the
number of dimensions used, for the Picsearch and Matton
database respectively. We show the result for 112 dimensions down to 4. For descriptors containing less than 112
bins (the emotion histogram), we substitute the dimensions
with the default descriptor size. We conclude that 32-64 dimensions is an appropriate tradeoff between accuracy and
storage/computational savings. In the remaining experiments we represent image descriptors with 64 dimensions.
If the intended application allows an exclusion of uncertain images from the two-category classification (or if we try
to classify completely unknown images), we can use a probability threshold as discussed in Section 5.2. How the classification accuracy is affected by an increased threshold, for
the Picsearch and Matton database, can be seen in Figure 3
and Figure 4. A raised threshold value leads to an exclusion
of images from the test set. With only a few images left,
the accuracy becomes unreliable, as shown for the Picsearch
database in Figure 3. The curve showing the accuracy for
different descriptors is ended if the threshold value leads to
an empty test set for at least one of the emotion pairs. The
almost linear increase in accuracy in Figure 4 shows that the
models used for classification in the Matton database, can
also be utilized as predictors of emotion scores. The result
confirms that the labeling accuracy in the Matton database
is higher than in the Picsearch database.
Next we investigate if the classification accuracy in the
Mean classification accuracy
RGB
ehist
ebags
SIFT
cdescr
0.65
0.6
0.55
0
20
40
60
80
100
Number of dimensions
Figure 1: The mean classification accuracy for different number of dimensions in the descriptor. (Picsearch database, t=0)
0.75
Mean classification accuracy
Table 2: Classification Accuracy in the Picsearch
Database, with Default Descriptor Sizes. (t=0)
kw pair RGB ehist ebags SIFT cdescr mean
1
0.62 0.60
0.61
0.57
0.56
0.59
2
0.60
0.58
0.62
0.57
0.57
0.59
3
0.60 0.60
0.59
0.52
0.54
0.57
4
0.63
0.61
0.65
0.59
0.61
0.62
5
0.64
0.62
0.62
0.65
0.67
0.64
6
0.69 0.67
0.69
0.61
0.61
0.65
7
0.61 0.59
0.60
0.55
0.55
0.58
8
0.58
0.57
0.60
0.58
0.60
0.59
9
0.69
0.69
0.70
0.63
0.62
0.67
10
0.56
0.55
0.56
0.57
0.55
0.56
mean
0.62 0.61
0.62
0.59
0.59
0.7
RGB
ehist
ebags
SIFT
cdescr
0.65
0.6
0
20
40
60
80
100
Number of dimensions
Figure 2: The mean classification accuracy for different number of dimensions in the descriptor. (Matton database, t=0)
Picsearch database can increase if the model is trained with
Matton samples, and vice versa. The result for the Bag-ofemotions descriptor (”ebags”) can be seen in Figure 5 (only
”ebags” results are shown since this has been the best performing descriptor in earlier experiments). A probability
threshold of 0.1 is used together with 64 dimensional descriptors. The highest accuracy was achieved when both
training and testing was carried out on the Matton database,
followed by training on the Picsearch database and testing
on the Matton database. Worst result was achieved when we
train on the Matton database and evaluate the classification
with Picsearch images.
Our last experiment exploits the user statistics that are
gathered for the Picsearch database. As explained in Section 3, the popularity of each image is roughly estimated by
a click and view ratio. Using the Bag-of-emotions descriptor, we train and evaluate the classification model using different image subsets that are created based on popularity
estimates. The result can be seen in Table 4. A slightly
increased performance is recorded when the model is evaluated with popular images, or when popular images are used
for training. But differences between subsets are very small.
6.2
Classification Examples
We illustrate the classification performance by plotting a
few subsets of classified images. Plots have been created
Classification accuracy with ebags
Train:Pi, Test:Pi
Train:Pi, Test:Ma
Train:Ma, Test:Ma
Train:Ma, Test:Pi
0.85
0.8
0.75
0.7
0.65
0.6
0.55
0.5
1
2
3
4
5
6
7
8
9
10
Keyword pair 1−10
Mean classification accuracy
Figure 5: Classification accuracy for different keyword pairs (using Bag-of-emotions), for different combinations of training and testing databases.
0.8
0.75
0.7
0.65
RGB
ehist
ebags
SIFT
cdescr
0.6
0.55
0.5
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Probability threshold t
Figure 3: The change in classification accuracy in
the Picsearch database when threshold t is raised.
Mean classification accuracy
0.95
Table 4: Mean Classification Accuracy over all Keyword Pairs for Different Subsets of the Picsearch
Database. Mix: A Mixture Between Popular and
Non-Popular Images. Pop: Only Popular Images.
Non: Only Non-Popular Images.
Training Testing Mean accuracy
mix
mix
0.66
mix
pop
0.68
mix
non
0.64
pop
mix
0.65
pop
pop
0.68
pop
non
0.63
non
mix
0.65
non
pop
0.65
non
non
0.65
0.9
0.85
RGB
ehist
ebags
SIFT
cdescr
0.8
0.75
0.7
0.65
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Probability threshold t
Figure 4: The change in classification accuracy in
the Matton database when the threshold t is raised.
for two keyword pairs: 1 (calm-intense), where the classification score was low for both databases, and 6 (formalvivid ), where the classification score was among the highest.
Results for keyword pair number 1, for the Picsearch and
Matton database respectively, can bee seen in Figure 6 and
Figure 7. Corresponding results for pair number 6 are given
in Figure 8 and Figure 9. In all figures we show the 20+20
images that obtained the highest and lowest probability estimates. In between, we plot 20 images that obtained a
probability estimate close to 0.5. Experiments are carried
out with the Bags-of-emotions descriptor (”ebags”), using 64
bins. Images are squared for viewing purposes.
7.
SUMMARY AND CONCLUSIONS
We introduced two large image databases where images
are labeled with emotion-related keywords. One database
containing images from a commercial image provider and
one with images collected from the Internet. We used the
standard RGB histogram and psychophysics-based coloremotion related features to characterize the global appearance of the images. We also used two bag-of-features models,
where histograms are derived for individual interest points.
The extracted feature vectors are then used by public-domain
classification software to separate images from two classes
defined by opposing emotion-pairs. In our experiments we
found that the three models based on global image histograms outperform the two chosen bag-of-features models,
where histograms are derived for individual interest points.
The best performing descriptor is the Bag-of-emotions, which
is a histogram describing the properties of homogeneous
emotion regions in the image, and transitions between regions. The selected emotion-related keywords are often related to the color content, and it is therefore no surprise
that the classification accuracy for the intensity-based SIFT
descriptor is rather poor. Interesting were the results for
the Color descriptor, where photometric invariant color histograms are added to the original SIFT descriptors: these
experiments gave only slightly better results than SIFT alone.
formal
neutral
vivid
intense
neutral
calm
Figure 8: Classification examples for keyword pair
formal-vivid, using the Picsearch database.
calm
formal
neutral
neutral
vivid
intense
Figure 6: Classification examples for keyword pair
calm-intense, using the Picsearch database.
Figure 7: Classification examples for keyword pair
calm-intense, using the Matton database.
The Picsearch database contains images and meta-data
crawled from the Internet, whereas the images in the database
from Matton are labeled by professionals (although the keywords vary depending on the original supplier of the images).
As expected, the highest classification accuracy was achieved
for the Matton database. In general, a high accuracy is obtained when training and testing is carried out on images
from the same database source. However, we notice that
the accuracy decreases considerably when we train on the
Matton database, and test on the Picsearch database, compared to the opposite. A probable explanation is that the
Picsearch database is much more diverse (various images, inconsistent labeling, etc.) than the Matton database. When
training is conducted on the Matton database we obtain a
classifier that is highly suitable for that type of image source,
but presumably not robust enough to capture the diversity
of the Picsearch database. Hence, if we want a robust classification of images from an unknown source, the classifier
trained on the Picsearch database might be preferred.
An unexpected conclusion is that the use of popular and
non-popular images in the Picsearch database could not significantly influence the performance. One conclusion could
be that the popularity of an image depends less on its visual
features, and more on the context. The relations between
the visual content and the emotional impact of an image are
very complicated and probably depending on a large number
Figure 9: Classification examples for keyword pair
formal-vivid, using the Matton database.
of different factors. This is why we proposed a probability
threshold on the SVM output. This means that we should
avoid classifying images that receives a probability estimate
close to 0.5 (in middle between to emotions). Instead, these
images can be assigned to a ”neutral” category. We believe
this is a way to increase the acceptance of emotions and
aesthetics among end-users in daily life search experiments.
8.
FUTURE WORK
The experiments reported in this paper are only a first attempt in exploring the internal structure of these databases.
The results show that the way the databases were constructed
lead to profound differences in the statistical properties of
the images contained in them. The Picsearch database is
due to the larger variations much more challenging. In these
experiments we selected those opposing keyword categories
where the keywords of the different categories also have a visual interpretation. Images in other categories (like funny)
are probably even more visually diverse. Further studies are
also needed to investigate if the visual properties are somehow related to the visual content. Classification into opposite categories is one of the easiest approaches to explore the
structure of the database. Others like clustering and multiclass classifications are other obvious choices. Also different
feature extraction methods suitable for large-scale databases
have to be investigated.
9.
ACKNOWLEDGMENTS
Presented research is included in the project Visuella Världar, financed by the Knowledge Foundation, Sweden. Picsearch AB (publ) and Matton images, Stockholm (Sweden),
are gratefully acknowledged for their contributions.
10.
REFERENCES
[1] S. Berretti, A. Del Bimbo, and P. Pala. Sensations and
psychological effects in color image database. IEEE Int
Conf on Image Proc, 1:560–563, Santa Barbara, 1997.
[2] P. Bolettieri, A. Esuli, F. Falchi, C. Lucchese,
R. Perego, T. Piccioli, and F. Rabitti. CoPhIR: A test
collection for content-based image retrieval.
arXiv:0905.4627v2, 2009.
[3] S.-B. Cho and J.-Y. Lee. A human-oriented image
retrieval system using interactive genetic algorithm.
IEEE Trans Syst Man Cybern Pt A Syst Humans,
32(3):452–458, 2002.
[4] D. Cohen-Or, O. Sorkine, R. Gal, T. Leyvand, and
Y.-Q. Xu. Color harmonization. In ACM SIGGRAPH
2006, 25:624–630, Boston, MA, 2006.
[5] J. Corridoni, A. Del Bimbo, and P. Pala. Image
retrieval by color semantics. Multimedia Syst,
7(3):175–183, 1999.
[6] J. Corridoni, A. Del Bimbo, and E. Vicario. Image
retrieval by color semantics with incomplete
knowledge. JASIS, 49(3):267–282, 1998.
[7] R. Datta, D. Joshi, J. Li, and J. Wang. Studying
aesthetics in photographic images using a
computational approach. In 9th Eu Conf on Computer
Vision, ECCV 2006, 3953:288–301, Graz, 2006.
[8] R. Datta, D. Joshi, J. Li, and J. Wang. Image
retrieval: Ideas, influences, and trends of the new age.
ACM Comput Surv, 40(2), 2008.
[9] R. Datta, J. Li, and J. Wang. Learning the consensus
on visual quality for next-generation image
management. In 15th ACM Int Conf on Multimedia,
MM’07, p.533–536, Augsburg, 2007.
[10] R. Datta, J. Li, and J. Z. Wang. Algorithmic
inferencing of aesthetics and emotion in natural
images: An exposition. 15th IEEE Int Conf on Im
Proc, 2008. ICIP 2008., p.105–108, 2008.
[11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and
L. Fei-Fei. Imagenet: A large-scale hierarchical image
database. In CVPR09, 2009.
[12] M. Douze, H. Jégou, H. Sandhawalia, L. Amsaleg, and
C. Schmid. Evaluation of GIST descriptors for
web-scale image search. In ACM Int Conf on Im and
Video Retr, CIVR 2009, p. 140-147, 2009.
[13] X.-P. Gao, J. Xin, T. Sato, A. Hansuebsai, M. Scalzo,
K. Kajiwara, S.-S. Guan, J. Valldeperas, M. Lis, and
M. Billger. Analysis of cross-cultural color emotion.
Color Res Appl, 32(3):223–229, 2007.
[14] S. Hong and H. Choi. Color image semantic
information retrieval system using human sensation
and emotion. In Proceedings IACIS, VII, p. 140–145,
2006.
[15] T. Joachims. Making large-scale support vector
machine learning practical. Advances in kernel
methods: support vector learning, p. 169–184,
Cambridge, USA 1999.
[16] S. Kobayashi. Color Image Scale. Kodansha Intern.,
1991.
[17] J. Lee, Y.-M. Cheon, S.-Y. Kim, and E.-J. Park.
Emotional evaluation of color patterns based on rough
sets. In 3rd Intern Conf on Natural Computation,
ICNC 2007, 1:140–144, Haikou, Hainan, 2007.
[18] R. Lenz, and P. Latorre-Carmona. Hierarchical
S(3)-Coding of RGB Histograms. VISIGRAPP 2009,
Communications in Computer and Information
Science, 68:188–200, 2010.
[19] H.-T. Lin, C.-J. Lin, and R. C. Weng. A note on
platt’s probabilistic outputs for support vector
machines. Mach. Learn., 68(3):267–276, 2007.
[20] Y. Liu, D. Zhang, G. Lu, and W.-Y. Ma. A survey of
content-based image retrieval with high-level
semantics. Pattern Recogn., 40(1):262–282, 2007.
[21] L.-C. Ou, M. Luo, A. Woodcock, and A. Wright. A
study of colour emotion and colour preference. part i:
Colour emotions for single colours. Color Res Appl,
29(3):232–240, 2004.
[22] L.-C. Ou, M. Luo, A. Woodcock, and A. Wright. A
study of colour emotion and colour preference. part ii:
Colour emotions for two-colour combinations. Color
Res Appl, 29(4):292–298, 2004.
[23] L.-C. Ou, M. Luo, A. Woodcock, and A. Wright. A
study of colour emotion and colour preference. part iii:
Colour preference modeling. Color Res Appl,
29(5):381–389, 2004.
[24] J. C. Platt. Probabilistic outputs for support vector
machines and comparisons to regularized likelihood
methods. In Advances in Large Margin Classifiers, p.
61–74. MIT Press, 1999.
[25] M. Solli and R. Lenz. Color emotions for image
classification and retrieval. In Proceedings CGIV 2008,
p. 367–371, 2008.
[26] M. Solli and R. Lenz. Color based bags-of-emotions. In
CAIP ’09: Proceedings of the 13th International
Conference on Computer Analysis of Images and
Patterns, p. 573–580, Berlin, Heidelberg, 2009.
[27] A. Torralba, R. Fergus, and W. T. Freeman. 80 million
tiny images: A large data set for nonparametric object
and scene recognition. IEEE Trans. Pattern Anal.
Mach. Intell., 30(11):1958–1970, 2008.
[28] K. van de Sande, T. Gevers, and C. Snoek. Evaluating
color descriptors for object and scene recognition.
IEEE Trans Pattern Anal Mach Intell (in print), 2009.
[29] J. van de Weijer and C. Schmid. Coloring local feature
extraction. In European Conference on Computer
Vision, II:334–348. Springer, 2006.
[30] W.-N. Wang and Y.-L. Yu. Image emotional semantic
query based on color semantic description. In Int Conf
on Mach Learn and Cybern, ICMLC 2005,
p.4571–4576, Guangzhou, 2005.
[31] W.-N. Wang, Y.-L. Yu, and S.-M. Jiang. Image
retrieval by emotional semantics: A study of emotional
space and feature extraction. In 2006 IEEE Int Conf
on Syst, Man and Cybern, 4:3534–3539, Taipei, 2007.
[32] H.-W. Yoo. Visual-based emotional descriptor and
feedback mechanism for image retrieval. J. Inf. Sci.
Eng., 22(5):1205–1227, 2006.