Using JEI format - Multimedia Computing and Computer Vision Lab

Transcription

Using JEI format - Multimedia Computing and Computer Vision Lab
Journal of Electronic Imaging 11(4), 1 – 0 (October 2002).
Classifying images on the web automatically
Rainer Lienhart
Alexander Hartmann*
Intel Labs
Intel Corporation
2200 Mission College Boulevard
Santa Clara
California 95052-8119
E-mail: [email protected]
Abstract. Numerous research works about the extraction of lowlevel features from images and videos have been published. However, only recently the focus has shifted to exploiting low-level features to classify images and videos automatically into semantically
broad and meaningful categories. In this paper, novel classification
algorithms are presented for three broad and general-purpose categories. In detail, we present algorithms for distinguishing photo-like
images from graphical images, actual photos from only photo-like,
but artificial images and presentation slides/scientific posters from
comics. On a large image database, our classification algorithm
achieved an accuracy of 97.69% in separating photo-like images
from graphical images. In the subset of photo-like images, true photos could be separated from ray-traced/rendered image with an accuracy of 97.3%, while with an accuracy of 99.5% the subset of
graphical images was successfully partitioned into presentation
slides/scientific posters and comics. © 2002 SPIE and IS&T.
[DOI: 10.1117/1.1502259]
1 Introduction
Today’s web search engines allow searching for text contained in web pages. However, more and more people are
also interested in finding images and videos on the World
Wide Web. Some search engines have already started to
offer the possibility to search for images and videos such as
AltaVista™ and Google™, however, they often only enable
search based on textual hints, which are taken from the
image’s filename, ALT-tag and/or associated web page.
AltaVista™ also offers the possibility to search for images similar to one already found using textual hints. However, the similarity search is only possible for some images,
maybe because either not all images are analyzed yet or
there are certain criteria an image must meet before it can
be used for a similarity search. Those criteria, however, are
not explained.
The next generation of search engines will also be media
portals, which allow searching for all kinds of media ele-
*Present address: IT-SAS, Gottlieb-Daimler-Str. 12, 68165 Mannheim, Germany.
Paper II-04 received Feb. 18, 2002; revised manuscript received May 31, 2002;
accepted for publication June 12, 2002.
1017-9909/2002/$15.00 © 2002 SPIE and IS&T.
ments. For instance,1 indexes web images based on the visual appearance of text, faces, and registered trademark
logos. There is a high demand for search engines, which
can index beyond textual descriptions. Media portals of tomorrow need to classify their media content automatically.
Image libraries of tens of millions of images cannot be
classified manually.
In this paper we present novel classification algorithms
for three broad categories. In detail, we present algorithms
for distinguishing
1. photos/photo-like images from graphical images,
2. actual photos from artificial photo-like images such
as raytracing images or screen shots from photorealistic computer games, and
3. presentation slides/scientific posters from comics/
cartoons.
With the exception of the classification into photos/
photo-like images and graphical images we are not aware
of any directly related work.
Our choice for these four classes is the result of a thorough analysis of the image classes we could find most often
in our database of web images. Over a period of four
months about 300 000 web images, which did not represent
buttons or navigational elements, were crawled and downloaded from the web. A large percentage of these images
fell into the earlier four categories. The four categories
were arranged into a simple classification hierarchy 共see
Fig. 1兲.
2 Related Work
Only recently automatic semantic classification of images
into broad general-purpose classes has been the topic of
some research. General-purpose classes are meaningful to
normal people and can be performed by them without being
an expert in a specific field. Examples of general-purpose
classes are outdoor versus indoor and city versus landscape
scenes.
In Refs. 2 and 3 Vailaya et al. describe a method to
classify vacation images into classes like indoor/outdoor,
city/landscape, and sunset/mountain/forest scenes. They
use a Bayesian framework for separating the images in a
Journal of Electronic Imaging / October 2002 / Vol. 11(4) / 1
Lienhart and Hartmann
Fig. 1 Classification hierarchy.
classification hierarchy and report an accuracy of 90.5% for
indoor versus outdoor classification, 95.3% for city versus
landscape classification, and 96.6% for sunset versus forest/
mountain classification.
Gorkani et al. propose a method for distinguishing city/
suburb from country/landscape scenes using the most
dominant orientation in the image texture.4 The dominant
orientation differs between city and landscape images. The
authors state that it takes humans almost no time or ‘‘brain
power’’ to distinguish between those image classes, so
there should exist an easy and fast to calculate feature. The
authors report a classification accuracy of 92.8% on 98 test
images.
Yiu et al. classify pictures into indoor/outdoor scenes
using color histograms and texture orientation.5 For the orientation they use the algorithm by Gorkani and Picard.4
The vertical orientation serves as the discriminant feature,
because indoor images tend to have more artifacts, and artifacts tend to have strong vertical lines.
Bradshaw proposes a method for labeling image regions
as natural or man-made. For instance, buildings are manmade, while mountains in the background are natural. For
homogeneous images, i.e., images depicting either only
man-made or only natural objects, an error rate of about
10% is reported. Bradshaw also proposes how this feature
can be used for indoor versus outdoor classification.6 He
reports a classification accuracy of 86.3%.
Swain et al. describe how to separate photographs and
graphics on web pages.7,8 They only search for ‘‘simple’’
graphics such as navigation buttons or drawings, while our
work deals with artificial but realistic-looking images,
which would be classified as being natural by their algorithm. The features that Swain et al. used are: number of
colors, most frequent color, farthest neighbor metric, saturation metric, color histogram metric, and a few more.7,8 An
error rate of about 9% is reported for distinguishing photos
from graphics encoded as JPEG images.
Schettini et al. recently addressed the problem of separating photographs, graphics, text, and compound documents using color distribution, color statistics, edge distribution, wavelet coefficients, texture features, and
percentage of skin color pixels as features. Compound
documents here are images consisting of more than one of
the categories photographs, graphics, and text. Decision
trees trained by the CART algorithm are used as the base
classifier. Multiple decision trees are trained and combined
in a majority vote. For photos versus text versus graphics
precision values between 0.88 and 0.95 are reported.9 The
authors also applied the same approach to the problem of
distinguishing indoor, outdoor and close-up images. Precision values between 0.87 and 0.91 are reported.10
2 / Journal of Electronic Imaging / October 2002 / Vol. 11(4)
3 Graphical Versus Photo-Like Images
One of the first decisions a user has to make when searching for a particular image is whether the image should be
graphical or photo-like. Examples of graphical images are
buttons and navigation elements, scientific presentations,
slides, and comics; examples of realistic-looking, photolike images are photos, raytracing images, and photorealistic images of modern computer games.
3.1 Features
Features, which could be distinctive for this separation and
of which some have been proposed by Swain et al. in Refs.
7 and 8 are:
• the total number of different colors. Graphics tend to
have less colors;
• the relative size of the largest region and/or the number of regions with a relative size bigger than a certain threshold. Graphics tend to have larger uniformly
colored regions;
• the sharpness of the edges. Edges in graphics are usually sharper than edges in photos;
• the fraction of pixels with a saturation greater than a
certain threshold. Colors in graphics are usually more
saturated than those in realistic-looking images;
• the fraction of pixels having the prevalent color.
Graphics tend to have less colors than photos, and thus
the fraction of pixels of the prevalent color is higher;
• the farthest neighbor metric, which measures the color
distance between two neighbor pixels. The distance is
defined as d⫽ 兩 r1⫺r2 兩 ⫹ 兩 g1⫺g2 兩 ⫹ 兩 b1⫺b2 兩 , the
absolute difference of both pixels’ RGB values. Three
subfeatures can be derived:
• the fraction f 1 of pixels with a distance greater than
zero. Graphics usually have larger single-colored regions. So this metric should be lower for graphics;
• the fraction f 2 of pixels with a distance greater than a
high threshold. This value should be high for graphics;
and
• the ratio between f 2 and f 1 . As f 1 tends to be larger
for photographs, a low value of f 2 / f 1 indicates a
photo-like image.
3.2 Training
Obviously, most of these features are not statistically independent, but rather highly correlated. Therefore, we decided to implement all features and then to select the most
relevant ones by means of feature selection. The discrete
Classifying images on the web automatically
Fig. 2 Discrete AdaBoost training algorithm.
AdaBoost machine learning algorithm with stumps served
as our feature selector.11 AdaBoost is a boosting algorithm
that combines many ‘‘weak’’ classifiers into a ‘‘strong’’
powerful committee-based classifier 共see Fig. 2兲. Weak
classifiers can be very simple and are only required to be
better than chance. Common weak classifiers are stumps—
single-split trees with only two terminal nodes. In each loop
the stump with the lowest training error is selected in step
共3a兲 of Fig. 2. In other words, in 共3a兲 k simple threshold
classifiers 共‘‘stumps’’兲 are trained for all k dimensions of
the input samples. The classifier with the lowest weighted
error errm is selected as f m (x).
After training with about 7516 images only four features
proved to be useful:
• the total number of colors c n after truncating each
color channel to only its five most significant bits 共
32⫻32⫻32⫽32768 colors兲;
• the prevalent color c p ;
• the fraction f 1 of pixels with a distance greater than
zero; and
• the ratio between f 2 and f 1 .
All other features were not selected by the AdaBoost
algorithm. Most likely they were not distinctive enough
partly due to the fact that all our images were JPEG compressed. Some of the characteristic features of graphics are
destroyed by JPEG’s lossy compression. Note that in Refs. 7
and 8 most graphical images were GIF compressed, which
simplifies the task.
The overall classifier
M
F 共 x 兲 ⫽sign[
兺
m⫽1
c m f m (x)]
with
f m共 x 兲 ⫽
再
m
valleft
x⬍threshold
m
valright
else
used seven stumps (M ⫽7) with the parameters in Table 1.
3.3 Experimental Results
On a test set of 947 graphical images 共comics, scientific
posters, and presentation slides, the same images as in Sec.
5兲 and 2272 photographic images 共raytracing images and
photographs, the same images as in Sec. 4兲 91.92% of the
graphical images and 98.97% of the photo-like images
were classified correctly, resulting in an overall accuracy of
97.69%. The misclassified photo-like images were mostly
photos made up of only a few colors 共such as the right
image in Fig. 3兲, or raytracing images, which did not look
realistic at all, but were put into this class because they
were part of an image archive of raytracing images 共see the
two left most images in Fig. 3兲.
Misclassified graphical images were either slides containing large photographs 共Fig. 4兲 or very colorful comics
共not shown for copyright reasons兲.
Overall, most errors in this class were caused by the
large visual diversity of the slides/presentation class, which
not only consists of PowerPoint presentation images, but
Fig. 3 Examples of realistic looking images, which were misclassified as being graphics. The misclassified images were either photos made up of only a few colors (right image) or raytracing images,
which did not look realistic at all, but were put into this class because they were part of an image
archive of raytracing images.
Journal of Electronic Imaging / October 2002 / Vol. 11(4) / 3
Lienhart and Hartmann
Fig. 4 Examples of graphical images misclassified as realistic.
also many scientific posters related to space/astronomy,
fluid/wind motion or physics in general. Only 80.5% of
them were classified as graphical.
4
Computer Generated, Realistic Looking
Images Versus Real Photos
The algorithm proposed in this section for distinguishing
between real photos and computer-generated, but realisticlooking images can be applied to the set of images, which
have been classified as being photo-like by the algorithm
described in Sec. 3.
The class of real photos encompasses all kinds of images
taken from nature. Typical examples are digital photos and
video frames. In contrast, the class of computer-generated
images encompasses raytracing images as well as images
from graphic tools such as Adobe Photoshop and computer
games. Figure 5 shows three examples for each class.
4.1 Features
Every real photo contains noise due to the process of converting an analog image into digital form. For computergenerated, realistic-looking images this conversion/
scanning process, however, is not needed. Thus, it can be
expected that computer-generated images are far less noisy
than digitized images. By designing a feature that measures
noise it should be possible to distinguish between scanned
images and images, which were digital right from the beginning.
A second suitable feature that can be used is the sharpness of the edges. Computer-generated images are supposed to display sharper edges than photographs. However,
due to lossy JPEG compression this feature gets less reliable. Sharp edges may be blurred, and blockiness may be
added, i.e., sharp edges might be added, which were not
there before.
In practice, we measure the amount of noise by means of
the histogram of the absolute difference image between the
original and its denoised version. The difference values can
vary between 0 and 255. Two simple and fast filters for
denoising are the median and the Gaussian filter. The core
difference between both filters are that they assume different noise sources. The median filter is more suitable for
individual pixel outliers, while the Gaussian filter is better
for additive noise.12
Both denoising filters were applied with a radius of 1, 2,
3, and 4. Thus, the resulting feature vector consisted of
Fig. 5 Examples of raytracing images and photographs.
4 / Journal of Electronic Imaging / October 2002 / Vol. 11(4)
Classifying images on the web automatically
Fig. 6 Gentle AdaBoost training algorithm (see Ref. 12).
3.4% compared to 5.9% for Gaussian values
while using only half the number of weak classifiers and thus roughly half the number of features.
Using both features sets reduces the test error rate
to 2.7%. At the same time the number of weak
classifier is reduced by another 30%. Thus by using a larger features pool from which the boosting
algorithm can pick, less features are needed for a
better classifier.
The best classification results were achieved using Gentle AdaBoost with all 2048 values in the
feature pool. Classification accuracy for the test
set was 97.3%—98.2% for raytracing images and
96.0% for photos. In our previous work, we used
the linear vector quantization package from the
Helsinki University of Technology14 to train our
classifier. However, the classification accuracy for
the same test set was only 87.33%.15
2048 values: 4⫻256 from the median filter and 4⫻256
from the Gaussian filter.
4.2 Training
As in Sec 3.2 we can expec that our features are highly correlated. In addition, many of them may only encode noise
with respect to the classification task at hand. Therefore, we
use boosting again for training and feature selection. This
time, however, due to the large number of features, we
compare the performance of four boosting variants: Discrete AdaBoost, Gentle AdaBoost, Real AdaBoost, and
LogitBoost.13 The later three usually compare favorably to
Discrete AdaBoost with respect to the number of weak
classifiers needed to achieve a certain classification performance. The algorithm for Gentle AdaBoost is depicted in
Fig. 6. In our experiments it usually produced the classifier
with the best performance/computational complexity tradeoff.
4.3 Experimental Results
The overall image set consisted of 3225 scenic photographs
from the ‘‘Master Clips 500.000’’-collection and 4352 raytracing images from http://www.irtc.com, the Internet Ray
Tracing Competition. The overall image set was randomly
partitioned into 5305 共70%兲 training images and 2272
共30%兲 test images.
Training was performed with
a.
b.
c.
all 2048 feature values,
only the 1024 median feature values, and
only the 1024 Gaussian feature values
in order to analyze the suitability of the median and Gaussian features for the classification task as well as the performance gain by using both feature sets jointly.
The results are shown in Table 2. The number M of
weak classifiers was determined by setting the target hit
rate as the termination criterion for the iterative loop of the
boosting algorithms. The following observations can be
drawn from the results shown in Table 2.
a.
b.
The test accuracy increases consistently with the
training accuracy demonstrating one of the most
impressive features of boosting algorithms: their
tendency not to overfit the training data in practice.
The median feature values perform better than the
Gaussian features. For instance, with Discrete
AdaBoost, the test error rate for median values is
c.
d.
A closer inspection of the misclassified raytracing images revealed two main source for misclassification. These
images either used noisy, real-world textures or were very
small in dimension 共e.g., only 100⫻75 pixels兲. Some ‘‘photos’’ were misclassified since they were not real photos 共see
Fig. 7, left image兲. Figure 7 shows a few examples of misclassified images.
5 Presentation SlidesÕScientific Posters Versus
ComicsÕCartoons
The algorithm proposed in this section for distinguishing
between presentation slides/scientific posters and comics/
cartoons can be applied to the set of images, which have
been classified as being graphical by the algorithm in Sec. 3.
Table 1 Discrete AdaBoost parameters for graphics vs photo/photolike images.
m
Feature
cm
Thresholdm
Valleft
Valright
1
f2 /f1
3.430 24
0.214 25
0.970 419
0.040 232 9
2
f2 /f1
1.4672
0.063 857 8
0.931 319
0.263 485
3
cp
1.045 37
0.1872
0.821 982
0.387 983
4
cn
0.775 026
626
0.330 293
0.687 471
5
f2 /f1
0.766 541
0.213 204
0.370 636
0.798 874
6
f2 /f1
0.509 741
0.106 966
0.729 928
0.463 033
7
f1
0.538 214
0.987 547
0.636 153
0.397 175
Journal of Electronic Imaging / October 2002 / Vol. 11(4) / 5
Lienhart and Hartmann
Table 2 Classification performance of computer generated, realistic looking images vs real photos.
The results are shown for four common boosting algorithms. Training/test accuracy was determined on
a training/test set of 5305/2272 images. Training accuracy was used as a termination criterion for the
boosting training. Note that discrete AdaBoost consistently needs more features to achieve the same
training and test accuracy as the other boosting algorithms.
Median values
Median⫹Gaussian values
Gaussian values
No. features
Training
accuracy
Test
accuracy
No. features
Training
accuracy
Test
accuracy
No. features
Training
accuracy
Test
accuracy
61
75
96
0.950
0.961
0.971
0.917
0.932
0.948
45
55
70
0.951
0.963
0.972
0.935
0.938
0.951
155
185
236
0.950
0.960
0.970
0.905
0.913
0.917
127
185
0.980
0.990
0.951
0.960
91
130
0.980
0.991
0.963
0.973
272
356
0.980
0.990
0.924
0.935
79
101
0.951
0.960
0.922
0.929
47
57
0.951
0.961
0.929
0.935
210
256
0.950
0.960
0.908
0.917
139
183
295
0.970
0.980
0.990
0.944
0.953
0.966
86
100
164
0.971
0.980
0.990
0.947
0.959
0.967
324
403
590
0.970
0.980
0.990
0.928
0.928
0.941
Real Adaboost
58
70
90
122
163
0.951
0.961
0.971
0.981
0.991
0.923
0.930
0.945
0.951
0.960
46
56
65
92
120
0.952
0.961
0.971
0.981
0.990
0.926
0.941
0.947
0.957
0.963
158
183
223
273
349
0.951
0.960
0.971
0.980
0.990
0.905
0.905
0.915
0.928
0.937
Logit Boost
52
63
80
104
151
0.951
0.961
0.970
0.980
0.990
0.926
0.929
0.942
0.955
0.958
39
50
65
88
121
0.951
0.961
0.970
0.980
0.990
0.923
0.932
0.941
0.957
0.964
144
179
205
261
321
0.951
0.960
0.970
0.980
0.990
0.892
0.897
0.908
0.913
0.921
Input features
Gentle Adaboost
Discrete Adaboost
Max
0.966
0.973
Fig. 7 Examples of misclassified images as (a) natural images and (b) photorealistic, but artificial
images.
6 / Journal of Electronic Imaging / October 2002 / Vol. 11(4)
0.941
Classifying images on the web automatically
Fig. 8 Examples for (a) slides and (b) scientific posters.
The class of presentation slides includes all images
showing slides independently of whether they were created
digitally by presentation programs such as MS PowerPoint
or by hand. Many scientific posters are designed like a
slide, and, therefore, fall into this class, too. However, scientific posters may also differ significantly from the general
layout of slides. Both image classes, presentation slides and
scientific posters, comprise the class of presentation slides/
scientific posters.
The class of comics includes cartoons from newspapers,
most of which are available on the web, and books as well
as other kinds of comics.
Images of both classes can be colored or black and
white. Three examples for slides and three for scientific
posters are shown in Fig. 8, while examples of comics cannot be shown for copyright reasons.
5.1 Features
We observed the following three main differences between
presentation slides/scientific posters and comics/cartoons.
1. In general, the relative size and/or alignment of text
line occurrences differ for comics and slides/posters. Thus,
images of both classes can be distinguished by means of
• the relative width of the topmost text line, i.e., the
ratio between the width of the topmost text line and
the width of the entire image,
• the average relative width and height of all text lines
and their respective standard deviations, and
• the average relative position and standard deviation of
the center of mass over all text lines.
These features are motivated by the following observations: Slides usually have a heading, which almost fills the
entire width of the image. The subsequent text lines are
wider than they are in comics. Moreover, the text lines in
slides either have only one center in about the middle of the
image leading to a small standard deviation over the locations of their centers of mass, or they all start in the same
column and therefore having different centers of mass, but
all those centers of mass are still near each other, and result
in a small standard deviation over the average center location, too.
The relative width of the topmost text line in comics is
usually smaller than in slides, as are all other text lines.
Slides in general use larger fonts than comics do. Therefore, the larger the average relative height of the text lines,
the more probable it is that the image represents a slide.
Further, text in two or more columns is uncommon for
slides. Comics on the other hand usually consist of more
than one image resulting in more than just one visual center
of text blocks. Thus, the standard deviation over the text
line center locations will be large.
2. Images containing multiple smaller images aligned on
Journal of Electronic Imaging / October 2002 / Vol. 11(4) / 7
Lienhart and Hartmann
a virtual grid and framed by rectangles are very likely to be
comics. These borders can easily be detected by edge detection algorithms.
In comics, the length of those border lines is usually an
integral fraction of the image’s width or height. For instance, they might be about a third of the image’s width.
The more lines of such length can be found, the higher is
the probability for the image to be a comic instead of a
presentation slide.
This criterion can be made more precise by checking for
the presence of the other n⫺1 lines in the same row/
column if a line with a length of one n-th of the image’s
width/height was found. By means of this procedure lines
are eliminated, which are just by chance of the correct
length but have nothing to do with the typical borders in
comics.
2. Slides very often have a width-to-height ratio of 4:3
共landscape orientation兲. If the aspect ratio differs from this
ratio, it is very unlikely that the image is a slide.
5.2 Feature Calculation
We used the algorithm and system developed by Lienhart
et al. to find all text lines and text columns in the image
under analysis.16 The text detection system was retrained
with text samples from slides and comics in order to improve text line detection performance. Based on the detected bounding boxes the following five features were calculated:
• the relative width of the top most text line with respect
to the image’s width,
• the average text line width and its standard deviation
over all detected text lines, and
• the average horizontal center position and its standard
deviation over all detected text lines.
Edges were extracted by means of the Canny edges detection algorithm and then vectorized.17 All nonhorizontal
or nonvertical edges were discarded. Two vertical or horizontal lines were merged if and only if they had the same
orientation and the end point of one line was near the start
point of the other. This procedure helped to overcome accidental breakups in the borderlines as well as merged
nearby lines from multiple ‘‘picture boxes.’’ Next the
lengths of all remaining edges were determined and
checked whether they were about one, one half, one third or
Table 3 Classification performance for comics/cartoons vs slides/
posters.
Training
accuracy
Test
accuracy
2
2
2
0.980
0.980
0.980
0.976
0.976
0.976
5
36
0.983
1.000
0.983
0.995
2
2
2
0.980
0.980
0.980
0.976
0.976
0.976
5
42
0.987
1.000
0.979
0.995
2
2
0.980
0.980
0.976
0.976
2
4
21
0.980
0.983
1.000
0.976
0.983
0.993
2
2
2
5
1000
0.980
0.980
0.980
0.994
0.999
0.976
0.976
0.976
0.990
0.992
No. features
Gentle Adaboost
Discrete Adaboost
Real Adaboost
LogitBoost
Max
one fourth of the width or height of the image. If not, the
respective edge was discarded. Finally the relative frequency of edges with roughly the n-th fraction of the image’s width or height (n苸 兵 1,2,3,4其 ) were counted and
taken as another four features.
The feature set was completed by
• the absolute number of vertical and the absolute number of horizontal edges as well as
• the aspect ratio of the image dimension.
In total 12 features were used.
Fig. 9 The only two misclassified presentation slides/scientific posters.
8 / Journal of Electronic Imaging / October 2002 / Vol. 11(4)
0.995
Classifying images on the web automatically
Fig. 10 Box layout of two of the three misclassified cartoon images.
5.3 Experimental Results
During our experiments we observed that in comics the
neural network-based text localizer detected a significant
number of false text blocks of small width, but large height.
In contrast, the text blocks in slides were recognized very
well. This stark contrast in the false alarm rate of our text
localizer between comics and slides can partly be explained
by the fact that the usage of large fonts are prevalent in
slides, but not in comics and that our detector worked very
well on large text lines. In addition, the kinds of strokes
used in comics to draw people and objects sometimes have
similar properties as the strokes used for text, and thus
result in false alarms. Despite these imperfections of our
text detector,16 all our features except the average height of
the text lines could be used.
Again the boosting learning algorithms were used for
training. The training set consisted of 2211 images 共70% of
the overall image set兲—818 slides/posters and 1393 comics. Our novel classification algorithm was tested on a test
set of 947 images 共30% of the overall image set兲—361
slides/posters and 586 comics. As shown in Table 3 there
are not many differences between the different boosting
algorithms. For Gentle and Discrete AdaBoost a test accuracy of 99.5% was achieved. This translates to only five
misclassified images. The image’s aspect ratio and the
number of vertical edges were always the first two features
chosen by the boosting algorithms. In the Gentle AdaBoost
case, even at the test accuracy of 99.5% the following three
features were not selected:
• the relative width of the top most text line with respect
to the image width,
• the standard deviation of text line widths, and
• the relative number of edges with a length of about one
third of the image width or height.
As mentioned before, only five images were misclassified, of which two were slides/posters 共see Fig. 9兲. The
three misclassified cartoons cannot be shown for copyright
reason, however, their schematic layout is shown in Fig. 10.
One of them showed off displaced bounding boxes 共Fig.
10, right image兲, while the other violates the assumption
that framing lines must be a n-th fraction of the image
width or height 共Fig. 10, left image兲. For the third misclassified comic the reason for misclassification was bad text
detection.
6
Conclusion
Automatic semantic classification of images is a very interesting research field. In this paper, we presented novel and
effective algorithms for two classification problems, which
have not been addressed before: comics/cartoons versus
slides/posters and real photos versus realistic-looking but
computer generated images. On a large image database,
true photos could be separated from ray-traced/rendered
image with an accuracy of 97.3%, while with an accuracy
of 99.5% presentation slides were successfully distinguished from comics. We also enhanced and adjusted the
algorithms proposed in Refs. 7 and 8 for the separation of
graphical images from photo-like images. On a large image
database, our classification algorithm achieved an accuracy
of 97.69%.
Acknowledgments
The authors would like to thank Alexander Kuranov and
Vadim Pisarevsky for the work they put in designing and
implementing the four boosting algorithms.
References
1. www.visoo.com
2. A. Vailaya, ‘‘Semantic classification in image databases.’’ PhD thesis,
Department of Computer Science, Michigan State University, 2000.
http://www.cse.msu.edu/⬃vailayaa/publications.html.
3. A. Vailaya, M. Figueiredo, A. Jain, and H. J. Zhang, ‘‘Bayesian framework for hierarchical semantic classification of vacation images,’’
Proceedings of the IEEE International Conference on Multimedia
Computing and Systems (ICMSC), pp. 518 –523, Florence, Italy
共1999兲.
4. M. M. Gorkani and R. W. Picard, ‘‘Texture orientation for sorting
photos ‘at a Glance’,’’ Proc. ICPR, pp. 459– 464 共Oct. 1994兲.
5. E. Yiu, ‘‘Image classification using color cues and texture orientation,’’ Department of Electrical Engineering and Computer Science,
MIT, Master thesis 共1996兲, http://www.ai.mit.edu/projects/cbcl/resarea/current-html/ecyiu/project.html.
6. B. Bradshaw, ‘‘Semantic based image retrieval: A probabilistic approach,’’ ACM Multimedia 2000, pp. 167–176 共Oct. 2000兲.
7. V. Athitsos, M. J. Swain, and C. Frankel, ‘‘Distinguishing photographs
and graphics on the world wide web,’’ IEEE Workshop on ContentBased Access of Image and Video Libraries, pp. 10–17 共June 1997兲.
8. C. Frankel, M. J. Swain, and V. Athistos, ‘‘WebSeer: An image search
engine for the world wide web,’’ University of Chicago Department of
Computer Science Technical Report TR-96-14 共August 1996兲, http://
www.infolab.nwu.edu/webseer/.
9. R. Schettini, G. Ciocca, A. Valsasna, C. Brambilla, and M. De Ponti,
‘‘A hierarchical classification strategy for digital documents,’’ Pattern
Recogn. 35共8兲, 1759–1769 共2002兲.
10. R. Schettini, C. Brambilla, A. Valsasna, and M. De Ponti, ‘‘Content
based classification of digital documents,’’ IAPR Workshop on Pattern
Journal of Electronic Imaging / October 2002 / Vol. 11(4) / 9
Lienhart and Hartmann
11.
12.
13.
14.
15.
16.
17.
Recognition in Information Systems, Setúbal, Portugal 共6 –7 July
2001兲.
Y. Freund and R. E. Schapire, ‘‘Experiments with a new boosting
algorithm,’’ in Machine Learning: Proceedings of the Thirteenth International Conference, pp. 148 –156, Morgan Kaufman, San Francisco 共1996兲.
B. Jaehne, Digital Image Processing, Springer, Berlin 共1997兲.
J. Friedman, T. Hastie, and R. Tibshirani, ‘‘Additive logistic regression: A statistical view of boosting,’’ Dept. of Statistics, Stanford University, Technical Report 共1998兲.
The Learning Vector Quantization Program Package, ftp://
cochlea.hut.fi.
A. Hartmann and R. Lienhart, ‘‘Automatic classification of images on
the web,’’ in Storage and Retrieval for Media Databases 2002, Proc.
SPIE 4676, 31– 40 共2002兲.
R. Lienhart and A. Wernicke, ‘‘Localizing and segmenting text in
images, videos and web pages,’’ IEEE Trans. Circuits Syst. Video
Technol. 12共4兲, 256 –268 共2002兲.
J. Canny, ‘‘A computational approach to edge detection,’’ IEEE Transactions on Pattern Analysis and Machine Intelligence 8共6兲, 34 – 43
共1986兲.
Rainer Lienhart received his Master’s degree in computer science and applied economics and his PhD in computer science
from the University of Mannheim, Germany
on ‘‘methods for content analysis, indexing,
and comparison of digital video sequences.’’ He was a core member of the
Movie Content Analysis Project (MoCA).
Since 1998 he is a Staff Researcher at Intel Labs in Santa Clara. His research interests includes image/video/audio content
analysis, machine learning, scalable signal processing, scalable
10 / Journal of Electronic Imaging / October 2002 / Vol. 11(4)
learning, ubiquitous and distributed media computing in heterogeneous networks, media streaming, and peer-to-peer networking and
mass media sharing. He is a member of the IEEE and the IEEE
Computer Society.
Alexander Hartmann received his Master’s degree in computer science and applied economics from the University of
Mannheim, Germany, on ‘‘new algorithms
for automatic classification of images.’’
During the summer of 2000 he was a Summer Intern at Intel Labs in Santa Clara.
Currently he is working as software engineer at ITSAS, an IBM Global Services
Company, in Germany. His interests include Linux and cryptography.