Using JEI format - Multimedia Computing and Computer Vision Lab
Transcription
Using JEI format - Multimedia Computing and Computer Vision Lab
Journal of Electronic Imaging 11(4), 1 – 0 (October 2002). Classifying images on the web automatically Rainer Lienhart Alexander Hartmann* Intel Labs Intel Corporation 2200 Mission College Boulevard Santa Clara California 95052-8119 E-mail: [email protected] Abstract. Numerous research works about the extraction of lowlevel features from images and videos have been published. However, only recently the focus has shifted to exploiting low-level features to classify images and videos automatically into semantically broad and meaningful categories. In this paper, novel classification algorithms are presented for three broad and general-purpose categories. In detail, we present algorithms for distinguishing photo-like images from graphical images, actual photos from only photo-like, but artificial images and presentation slides/scientific posters from comics. On a large image database, our classification algorithm achieved an accuracy of 97.69% in separating photo-like images from graphical images. In the subset of photo-like images, true photos could be separated from ray-traced/rendered image with an accuracy of 97.3%, while with an accuracy of 99.5% the subset of graphical images was successfully partitioned into presentation slides/scientific posters and comics. © 2002 SPIE and IS&T. [DOI: 10.1117/1.1502259] 1 Introduction Today’s web search engines allow searching for text contained in web pages. However, more and more people are also interested in finding images and videos on the World Wide Web. Some search engines have already started to offer the possibility to search for images and videos such as AltaVista™ and Google™, however, they often only enable search based on textual hints, which are taken from the image’s filename, ALT-tag and/or associated web page. AltaVista™ also offers the possibility to search for images similar to one already found using textual hints. However, the similarity search is only possible for some images, maybe because either not all images are analyzed yet or there are certain criteria an image must meet before it can be used for a similarity search. Those criteria, however, are not explained. The next generation of search engines will also be media portals, which allow searching for all kinds of media ele- *Present address: IT-SAS, Gottlieb-Daimler-Str. 12, 68165 Mannheim, Germany. Paper II-04 received Feb. 18, 2002; revised manuscript received May 31, 2002; accepted for publication June 12, 2002. 1017-9909/2002/$15.00 © 2002 SPIE and IS&T. ments. For instance,1 indexes web images based on the visual appearance of text, faces, and registered trademark logos. There is a high demand for search engines, which can index beyond textual descriptions. Media portals of tomorrow need to classify their media content automatically. Image libraries of tens of millions of images cannot be classified manually. In this paper we present novel classification algorithms for three broad categories. In detail, we present algorithms for distinguishing 1. photos/photo-like images from graphical images, 2. actual photos from artificial photo-like images such as raytracing images or screen shots from photorealistic computer games, and 3. presentation slides/scientific posters from comics/ cartoons. With the exception of the classification into photos/ photo-like images and graphical images we are not aware of any directly related work. Our choice for these four classes is the result of a thorough analysis of the image classes we could find most often in our database of web images. Over a period of four months about 300 000 web images, which did not represent buttons or navigational elements, were crawled and downloaded from the web. A large percentage of these images fell into the earlier four categories. The four categories were arranged into a simple classification hierarchy 共see Fig. 1兲. 2 Related Work Only recently automatic semantic classification of images into broad general-purpose classes has been the topic of some research. General-purpose classes are meaningful to normal people and can be performed by them without being an expert in a specific field. Examples of general-purpose classes are outdoor versus indoor and city versus landscape scenes. In Refs. 2 and 3 Vailaya et al. describe a method to classify vacation images into classes like indoor/outdoor, city/landscape, and sunset/mountain/forest scenes. They use a Bayesian framework for separating the images in a Journal of Electronic Imaging / October 2002 / Vol. 11(4) / 1 Lienhart and Hartmann Fig. 1 Classification hierarchy. classification hierarchy and report an accuracy of 90.5% for indoor versus outdoor classification, 95.3% for city versus landscape classification, and 96.6% for sunset versus forest/ mountain classification. Gorkani et al. propose a method for distinguishing city/ suburb from country/landscape scenes using the most dominant orientation in the image texture.4 The dominant orientation differs between city and landscape images. The authors state that it takes humans almost no time or ‘‘brain power’’ to distinguish between those image classes, so there should exist an easy and fast to calculate feature. The authors report a classification accuracy of 92.8% on 98 test images. Yiu et al. classify pictures into indoor/outdoor scenes using color histograms and texture orientation.5 For the orientation they use the algorithm by Gorkani and Picard.4 The vertical orientation serves as the discriminant feature, because indoor images tend to have more artifacts, and artifacts tend to have strong vertical lines. Bradshaw proposes a method for labeling image regions as natural or man-made. For instance, buildings are manmade, while mountains in the background are natural. For homogeneous images, i.e., images depicting either only man-made or only natural objects, an error rate of about 10% is reported. Bradshaw also proposes how this feature can be used for indoor versus outdoor classification.6 He reports a classification accuracy of 86.3%. Swain et al. describe how to separate photographs and graphics on web pages.7,8 They only search for ‘‘simple’’ graphics such as navigation buttons or drawings, while our work deals with artificial but realistic-looking images, which would be classified as being natural by their algorithm. The features that Swain et al. used are: number of colors, most frequent color, farthest neighbor metric, saturation metric, color histogram metric, and a few more.7,8 An error rate of about 9% is reported for distinguishing photos from graphics encoded as JPEG images. Schettini et al. recently addressed the problem of separating photographs, graphics, text, and compound documents using color distribution, color statistics, edge distribution, wavelet coefficients, texture features, and percentage of skin color pixels as features. Compound documents here are images consisting of more than one of the categories photographs, graphics, and text. Decision trees trained by the CART algorithm are used as the base classifier. Multiple decision trees are trained and combined in a majority vote. For photos versus text versus graphics precision values between 0.88 and 0.95 are reported.9 The authors also applied the same approach to the problem of distinguishing indoor, outdoor and close-up images. Precision values between 0.87 and 0.91 are reported.10 2 / Journal of Electronic Imaging / October 2002 / Vol. 11(4) 3 Graphical Versus Photo-Like Images One of the first decisions a user has to make when searching for a particular image is whether the image should be graphical or photo-like. Examples of graphical images are buttons and navigation elements, scientific presentations, slides, and comics; examples of realistic-looking, photolike images are photos, raytracing images, and photorealistic images of modern computer games. 3.1 Features Features, which could be distinctive for this separation and of which some have been proposed by Swain et al. in Refs. 7 and 8 are: • the total number of different colors. Graphics tend to have less colors; • the relative size of the largest region and/or the number of regions with a relative size bigger than a certain threshold. Graphics tend to have larger uniformly colored regions; • the sharpness of the edges. Edges in graphics are usually sharper than edges in photos; • the fraction of pixels with a saturation greater than a certain threshold. Colors in graphics are usually more saturated than those in realistic-looking images; • the fraction of pixels having the prevalent color. Graphics tend to have less colors than photos, and thus the fraction of pixels of the prevalent color is higher; • the farthest neighbor metric, which measures the color distance between two neighbor pixels. The distance is defined as d⫽ 兩 r1⫺r2 兩 ⫹ 兩 g1⫺g2 兩 ⫹ 兩 b1⫺b2 兩 , the absolute difference of both pixels’ RGB values. Three subfeatures can be derived: • the fraction f 1 of pixels with a distance greater than zero. Graphics usually have larger single-colored regions. So this metric should be lower for graphics; • the fraction f 2 of pixels with a distance greater than a high threshold. This value should be high for graphics; and • the ratio between f 2 and f 1 . As f 1 tends to be larger for photographs, a low value of f 2 / f 1 indicates a photo-like image. 3.2 Training Obviously, most of these features are not statistically independent, but rather highly correlated. Therefore, we decided to implement all features and then to select the most relevant ones by means of feature selection. The discrete Classifying images on the web automatically Fig. 2 Discrete AdaBoost training algorithm. AdaBoost machine learning algorithm with stumps served as our feature selector.11 AdaBoost is a boosting algorithm that combines many ‘‘weak’’ classifiers into a ‘‘strong’’ powerful committee-based classifier 共see Fig. 2兲. Weak classifiers can be very simple and are only required to be better than chance. Common weak classifiers are stumps— single-split trees with only two terminal nodes. In each loop the stump with the lowest training error is selected in step 共3a兲 of Fig. 2. In other words, in 共3a兲 k simple threshold classifiers 共‘‘stumps’’兲 are trained for all k dimensions of the input samples. The classifier with the lowest weighted error errm is selected as f m (x). After training with about 7516 images only four features proved to be useful: • the total number of colors c n after truncating each color channel to only its five most significant bits 共 32⫻32⫻32⫽32768 colors兲; • the prevalent color c p ; • the fraction f 1 of pixels with a distance greater than zero; and • the ratio between f 2 and f 1 . All other features were not selected by the AdaBoost algorithm. Most likely they were not distinctive enough partly due to the fact that all our images were JPEG compressed. Some of the characteristic features of graphics are destroyed by JPEG’s lossy compression. Note that in Refs. 7 and 8 most graphical images were GIF compressed, which simplifies the task. The overall classifier M F 共 x 兲 ⫽sign[ 兺 m⫽1 c m f m (x)] with f m共 x 兲 ⫽ 再 m valleft x⬍threshold m valright else used seven stumps (M ⫽7) with the parameters in Table 1. 3.3 Experimental Results On a test set of 947 graphical images 共comics, scientific posters, and presentation slides, the same images as in Sec. 5兲 and 2272 photographic images 共raytracing images and photographs, the same images as in Sec. 4兲 91.92% of the graphical images and 98.97% of the photo-like images were classified correctly, resulting in an overall accuracy of 97.69%. The misclassified photo-like images were mostly photos made up of only a few colors 共such as the right image in Fig. 3兲, or raytracing images, which did not look realistic at all, but were put into this class because they were part of an image archive of raytracing images 共see the two left most images in Fig. 3兲. Misclassified graphical images were either slides containing large photographs 共Fig. 4兲 or very colorful comics 共not shown for copyright reasons兲. Overall, most errors in this class were caused by the large visual diversity of the slides/presentation class, which not only consists of PowerPoint presentation images, but Fig. 3 Examples of realistic looking images, which were misclassified as being graphics. The misclassified images were either photos made up of only a few colors (right image) or raytracing images, which did not look realistic at all, but were put into this class because they were part of an image archive of raytracing images. Journal of Electronic Imaging / October 2002 / Vol. 11(4) / 3 Lienhart and Hartmann Fig. 4 Examples of graphical images misclassified as realistic. also many scientific posters related to space/astronomy, fluid/wind motion or physics in general. Only 80.5% of them were classified as graphical. 4 Computer Generated, Realistic Looking Images Versus Real Photos The algorithm proposed in this section for distinguishing between real photos and computer-generated, but realisticlooking images can be applied to the set of images, which have been classified as being photo-like by the algorithm described in Sec. 3. The class of real photos encompasses all kinds of images taken from nature. Typical examples are digital photos and video frames. In contrast, the class of computer-generated images encompasses raytracing images as well as images from graphic tools such as Adobe Photoshop and computer games. Figure 5 shows three examples for each class. 4.1 Features Every real photo contains noise due to the process of converting an analog image into digital form. For computergenerated, realistic-looking images this conversion/ scanning process, however, is not needed. Thus, it can be expected that computer-generated images are far less noisy than digitized images. By designing a feature that measures noise it should be possible to distinguish between scanned images and images, which were digital right from the beginning. A second suitable feature that can be used is the sharpness of the edges. Computer-generated images are supposed to display sharper edges than photographs. However, due to lossy JPEG compression this feature gets less reliable. Sharp edges may be blurred, and blockiness may be added, i.e., sharp edges might be added, which were not there before. In practice, we measure the amount of noise by means of the histogram of the absolute difference image between the original and its denoised version. The difference values can vary between 0 and 255. Two simple and fast filters for denoising are the median and the Gaussian filter. The core difference between both filters are that they assume different noise sources. The median filter is more suitable for individual pixel outliers, while the Gaussian filter is better for additive noise.12 Both denoising filters were applied with a radius of 1, 2, 3, and 4. Thus, the resulting feature vector consisted of Fig. 5 Examples of raytracing images and photographs. 4 / Journal of Electronic Imaging / October 2002 / Vol. 11(4) Classifying images on the web automatically Fig. 6 Gentle AdaBoost training algorithm (see Ref. 12). 3.4% compared to 5.9% for Gaussian values while using only half the number of weak classifiers and thus roughly half the number of features. Using both features sets reduces the test error rate to 2.7%. At the same time the number of weak classifier is reduced by another 30%. Thus by using a larger features pool from which the boosting algorithm can pick, less features are needed for a better classifier. The best classification results were achieved using Gentle AdaBoost with all 2048 values in the feature pool. Classification accuracy for the test set was 97.3%—98.2% for raytracing images and 96.0% for photos. In our previous work, we used the linear vector quantization package from the Helsinki University of Technology14 to train our classifier. However, the classification accuracy for the same test set was only 87.33%.15 2048 values: 4⫻256 from the median filter and 4⫻256 from the Gaussian filter. 4.2 Training As in Sec 3.2 we can expec that our features are highly correlated. In addition, many of them may only encode noise with respect to the classification task at hand. Therefore, we use boosting again for training and feature selection. This time, however, due to the large number of features, we compare the performance of four boosting variants: Discrete AdaBoost, Gentle AdaBoost, Real AdaBoost, and LogitBoost.13 The later three usually compare favorably to Discrete AdaBoost with respect to the number of weak classifiers needed to achieve a certain classification performance. The algorithm for Gentle AdaBoost is depicted in Fig. 6. In our experiments it usually produced the classifier with the best performance/computational complexity tradeoff. 4.3 Experimental Results The overall image set consisted of 3225 scenic photographs from the ‘‘Master Clips 500.000’’-collection and 4352 raytracing images from http://www.irtc.com, the Internet Ray Tracing Competition. The overall image set was randomly partitioned into 5305 共70%兲 training images and 2272 共30%兲 test images. Training was performed with a. b. c. all 2048 feature values, only the 1024 median feature values, and only the 1024 Gaussian feature values in order to analyze the suitability of the median and Gaussian features for the classification task as well as the performance gain by using both feature sets jointly. The results are shown in Table 2. The number M of weak classifiers was determined by setting the target hit rate as the termination criterion for the iterative loop of the boosting algorithms. The following observations can be drawn from the results shown in Table 2. a. b. The test accuracy increases consistently with the training accuracy demonstrating one of the most impressive features of boosting algorithms: their tendency not to overfit the training data in practice. The median feature values perform better than the Gaussian features. For instance, with Discrete AdaBoost, the test error rate for median values is c. d. A closer inspection of the misclassified raytracing images revealed two main source for misclassification. These images either used noisy, real-world textures or were very small in dimension 共e.g., only 100⫻75 pixels兲. Some ‘‘photos’’ were misclassified since they were not real photos 共see Fig. 7, left image兲. Figure 7 shows a few examples of misclassified images. 5 Presentation SlidesÕScientific Posters Versus ComicsÕCartoons The algorithm proposed in this section for distinguishing between presentation slides/scientific posters and comics/ cartoons can be applied to the set of images, which have been classified as being graphical by the algorithm in Sec. 3. Table 1 Discrete AdaBoost parameters for graphics vs photo/photolike images. m Feature cm Thresholdm Valleft Valright 1 f2 /f1 3.430 24 0.214 25 0.970 419 0.040 232 9 2 f2 /f1 1.4672 0.063 857 8 0.931 319 0.263 485 3 cp 1.045 37 0.1872 0.821 982 0.387 983 4 cn 0.775 026 626 0.330 293 0.687 471 5 f2 /f1 0.766 541 0.213 204 0.370 636 0.798 874 6 f2 /f1 0.509 741 0.106 966 0.729 928 0.463 033 7 f1 0.538 214 0.987 547 0.636 153 0.397 175 Journal of Electronic Imaging / October 2002 / Vol. 11(4) / 5 Lienhart and Hartmann Table 2 Classification performance of computer generated, realistic looking images vs real photos. The results are shown for four common boosting algorithms. Training/test accuracy was determined on a training/test set of 5305/2272 images. Training accuracy was used as a termination criterion for the boosting training. Note that discrete AdaBoost consistently needs more features to achieve the same training and test accuracy as the other boosting algorithms. Median values Median⫹Gaussian values Gaussian values No. features Training accuracy Test accuracy No. features Training accuracy Test accuracy No. features Training accuracy Test accuracy 61 75 96 0.950 0.961 0.971 0.917 0.932 0.948 45 55 70 0.951 0.963 0.972 0.935 0.938 0.951 155 185 236 0.950 0.960 0.970 0.905 0.913 0.917 127 185 0.980 0.990 0.951 0.960 91 130 0.980 0.991 0.963 0.973 272 356 0.980 0.990 0.924 0.935 79 101 0.951 0.960 0.922 0.929 47 57 0.951 0.961 0.929 0.935 210 256 0.950 0.960 0.908 0.917 139 183 295 0.970 0.980 0.990 0.944 0.953 0.966 86 100 164 0.971 0.980 0.990 0.947 0.959 0.967 324 403 590 0.970 0.980 0.990 0.928 0.928 0.941 Real Adaboost 58 70 90 122 163 0.951 0.961 0.971 0.981 0.991 0.923 0.930 0.945 0.951 0.960 46 56 65 92 120 0.952 0.961 0.971 0.981 0.990 0.926 0.941 0.947 0.957 0.963 158 183 223 273 349 0.951 0.960 0.971 0.980 0.990 0.905 0.905 0.915 0.928 0.937 Logit Boost 52 63 80 104 151 0.951 0.961 0.970 0.980 0.990 0.926 0.929 0.942 0.955 0.958 39 50 65 88 121 0.951 0.961 0.970 0.980 0.990 0.923 0.932 0.941 0.957 0.964 144 179 205 261 321 0.951 0.960 0.970 0.980 0.990 0.892 0.897 0.908 0.913 0.921 Input features Gentle Adaboost Discrete Adaboost Max 0.966 0.973 Fig. 7 Examples of misclassified images as (a) natural images and (b) photorealistic, but artificial images. 6 / Journal of Electronic Imaging / October 2002 / Vol. 11(4) 0.941 Classifying images on the web automatically Fig. 8 Examples for (a) slides and (b) scientific posters. The class of presentation slides includes all images showing slides independently of whether they were created digitally by presentation programs such as MS PowerPoint or by hand. Many scientific posters are designed like a slide, and, therefore, fall into this class, too. However, scientific posters may also differ significantly from the general layout of slides. Both image classes, presentation slides and scientific posters, comprise the class of presentation slides/ scientific posters. The class of comics includes cartoons from newspapers, most of which are available on the web, and books as well as other kinds of comics. Images of both classes can be colored or black and white. Three examples for slides and three for scientific posters are shown in Fig. 8, while examples of comics cannot be shown for copyright reasons. 5.1 Features We observed the following three main differences between presentation slides/scientific posters and comics/cartoons. 1. In general, the relative size and/or alignment of text line occurrences differ for comics and slides/posters. Thus, images of both classes can be distinguished by means of • the relative width of the topmost text line, i.e., the ratio between the width of the topmost text line and the width of the entire image, • the average relative width and height of all text lines and their respective standard deviations, and • the average relative position and standard deviation of the center of mass over all text lines. These features are motivated by the following observations: Slides usually have a heading, which almost fills the entire width of the image. The subsequent text lines are wider than they are in comics. Moreover, the text lines in slides either have only one center in about the middle of the image leading to a small standard deviation over the locations of their centers of mass, or they all start in the same column and therefore having different centers of mass, but all those centers of mass are still near each other, and result in a small standard deviation over the average center location, too. The relative width of the topmost text line in comics is usually smaller than in slides, as are all other text lines. Slides in general use larger fonts than comics do. Therefore, the larger the average relative height of the text lines, the more probable it is that the image represents a slide. Further, text in two or more columns is uncommon for slides. Comics on the other hand usually consist of more than one image resulting in more than just one visual center of text blocks. Thus, the standard deviation over the text line center locations will be large. 2. Images containing multiple smaller images aligned on Journal of Electronic Imaging / October 2002 / Vol. 11(4) / 7 Lienhart and Hartmann a virtual grid and framed by rectangles are very likely to be comics. These borders can easily be detected by edge detection algorithms. In comics, the length of those border lines is usually an integral fraction of the image’s width or height. For instance, they might be about a third of the image’s width. The more lines of such length can be found, the higher is the probability for the image to be a comic instead of a presentation slide. This criterion can be made more precise by checking for the presence of the other n⫺1 lines in the same row/ column if a line with a length of one n-th of the image’s width/height was found. By means of this procedure lines are eliminated, which are just by chance of the correct length but have nothing to do with the typical borders in comics. 2. Slides very often have a width-to-height ratio of 4:3 共landscape orientation兲. If the aspect ratio differs from this ratio, it is very unlikely that the image is a slide. 5.2 Feature Calculation We used the algorithm and system developed by Lienhart et al. to find all text lines and text columns in the image under analysis.16 The text detection system was retrained with text samples from slides and comics in order to improve text line detection performance. Based on the detected bounding boxes the following five features were calculated: • the relative width of the top most text line with respect to the image’s width, • the average text line width and its standard deviation over all detected text lines, and • the average horizontal center position and its standard deviation over all detected text lines. Edges were extracted by means of the Canny edges detection algorithm and then vectorized.17 All nonhorizontal or nonvertical edges were discarded. Two vertical or horizontal lines were merged if and only if they had the same orientation and the end point of one line was near the start point of the other. This procedure helped to overcome accidental breakups in the borderlines as well as merged nearby lines from multiple ‘‘picture boxes.’’ Next the lengths of all remaining edges were determined and checked whether they were about one, one half, one third or Table 3 Classification performance for comics/cartoons vs slides/ posters. Training accuracy Test accuracy 2 2 2 0.980 0.980 0.980 0.976 0.976 0.976 5 36 0.983 1.000 0.983 0.995 2 2 2 0.980 0.980 0.980 0.976 0.976 0.976 5 42 0.987 1.000 0.979 0.995 2 2 0.980 0.980 0.976 0.976 2 4 21 0.980 0.983 1.000 0.976 0.983 0.993 2 2 2 5 1000 0.980 0.980 0.980 0.994 0.999 0.976 0.976 0.976 0.990 0.992 No. features Gentle Adaboost Discrete Adaboost Real Adaboost LogitBoost Max one fourth of the width or height of the image. If not, the respective edge was discarded. Finally the relative frequency of edges with roughly the n-th fraction of the image’s width or height (n苸 兵 1,2,3,4其 ) were counted and taken as another four features. The feature set was completed by • the absolute number of vertical and the absolute number of horizontal edges as well as • the aspect ratio of the image dimension. In total 12 features were used. Fig. 9 The only two misclassified presentation slides/scientific posters. 8 / Journal of Electronic Imaging / October 2002 / Vol. 11(4) 0.995 Classifying images on the web automatically Fig. 10 Box layout of two of the three misclassified cartoon images. 5.3 Experimental Results During our experiments we observed that in comics the neural network-based text localizer detected a significant number of false text blocks of small width, but large height. In contrast, the text blocks in slides were recognized very well. This stark contrast in the false alarm rate of our text localizer between comics and slides can partly be explained by the fact that the usage of large fonts are prevalent in slides, but not in comics and that our detector worked very well on large text lines. In addition, the kinds of strokes used in comics to draw people and objects sometimes have similar properties as the strokes used for text, and thus result in false alarms. Despite these imperfections of our text detector,16 all our features except the average height of the text lines could be used. Again the boosting learning algorithms were used for training. The training set consisted of 2211 images 共70% of the overall image set兲—818 slides/posters and 1393 comics. Our novel classification algorithm was tested on a test set of 947 images 共30% of the overall image set兲—361 slides/posters and 586 comics. As shown in Table 3 there are not many differences between the different boosting algorithms. For Gentle and Discrete AdaBoost a test accuracy of 99.5% was achieved. This translates to only five misclassified images. The image’s aspect ratio and the number of vertical edges were always the first two features chosen by the boosting algorithms. In the Gentle AdaBoost case, even at the test accuracy of 99.5% the following three features were not selected: • the relative width of the top most text line with respect to the image width, • the standard deviation of text line widths, and • the relative number of edges with a length of about one third of the image width or height. As mentioned before, only five images were misclassified, of which two were slides/posters 共see Fig. 9兲. The three misclassified cartoons cannot be shown for copyright reason, however, their schematic layout is shown in Fig. 10. One of them showed off displaced bounding boxes 共Fig. 10, right image兲, while the other violates the assumption that framing lines must be a n-th fraction of the image width or height 共Fig. 10, left image兲. For the third misclassified comic the reason for misclassification was bad text detection. 6 Conclusion Automatic semantic classification of images is a very interesting research field. In this paper, we presented novel and effective algorithms for two classification problems, which have not been addressed before: comics/cartoons versus slides/posters and real photos versus realistic-looking but computer generated images. On a large image database, true photos could be separated from ray-traced/rendered image with an accuracy of 97.3%, while with an accuracy of 99.5% presentation slides were successfully distinguished from comics. We also enhanced and adjusted the algorithms proposed in Refs. 7 and 8 for the separation of graphical images from photo-like images. On a large image database, our classification algorithm achieved an accuracy of 97.69%. Acknowledgments The authors would like to thank Alexander Kuranov and Vadim Pisarevsky for the work they put in designing and implementing the four boosting algorithms. References 1. www.visoo.com 2. A. Vailaya, ‘‘Semantic classification in image databases.’’ PhD thesis, Department of Computer Science, Michigan State University, 2000. http://www.cse.msu.edu/⬃vailayaa/publications.html. 3. A. Vailaya, M. Figueiredo, A. Jain, and H. J. Zhang, ‘‘Bayesian framework for hierarchical semantic classification of vacation images,’’ Proceedings of the IEEE International Conference on Multimedia Computing and Systems (ICMSC), pp. 518 –523, Florence, Italy 共1999兲. 4. M. M. Gorkani and R. W. Picard, ‘‘Texture orientation for sorting photos ‘at a Glance’,’’ Proc. ICPR, pp. 459– 464 共Oct. 1994兲. 5. E. Yiu, ‘‘Image classification using color cues and texture orientation,’’ Department of Electrical Engineering and Computer Science, MIT, Master thesis 共1996兲, http://www.ai.mit.edu/projects/cbcl/resarea/current-html/ecyiu/project.html. 6. B. Bradshaw, ‘‘Semantic based image retrieval: A probabilistic approach,’’ ACM Multimedia 2000, pp. 167–176 共Oct. 2000兲. 7. V. Athitsos, M. J. Swain, and C. Frankel, ‘‘Distinguishing photographs and graphics on the world wide web,’’ IEEE Workshop on ContentBased Access of Image and Video Libraries, pp. 10–17 共June 1997兲. 8. C. Frankel, M. J. Swain, and V. Athistos, ‘‘WebSeer: An image search engine for the world wide web,’’ University of Chicago Department of Computer Science Technical Report TR-96-14 共August 1996兲, http:// www.infolab.nwu.edu/webseer/. 9. R. Schettini, G. Ciocca, A. Valsasna, C. Brambilla, and M. De Ponti, ‘‘A hierarchical classification strategy for digital documents,’’ Pattern Recogn. 35共8兲, 1759–1769 共2002兲. 10. R. Schettini, C. Brambilla, A. Valsasna, and M. De Ponti, ‘‘Content based classification of digital documents,’’ IAPR Workshop on Pattern Journal of Electronic Imaging / October 2002 / Vol. 11(4) / 9 Lienhart and Hartmann 11. 12. 13. 14. 15. 16. 17. Recognition in Information Systems, Setúbal, Portugal 共6 –7 July 2001兲. Y. Freund and R. E. Schapire, ‘‘Experiments with a new boosting algorithm,’’ in Machine Learning: Proceedings of the Thirteenth International Conference, pp. 148 –156, Morgan Kaufman, San Francisco 共1996兲. B. Jaehne, Digital Image Processing, Springer, Berlin 共1997兲. J. Friedman, T. Hastie, and R. Tibshirani, ‘‘Additive logistic regression: A statistical view of boosting,’’ Dept. of Statistics, Stanford University, Technical Report 共1998兲. The Learning Vector Quantization Program Package, ftp:// cochlea.hut.fi. A. Hartmann and R. Lienhart, ‘‘Automatic classification of images on the web,’’ in Storage and Retrieval for Media Databases 2002, Proc. SPIE 4676, 31– 40 共2002兲. R. Lienhart and A. Wernicke, ‘‘Localizing and segmenting text in images, videos and web pages,’’ IEEE Trans. Circuits Syst. Video Technol. 12共4兲, 256 –268 共2002兲. J. Canny, ‘‘A computational approach to edge detection,’’ IEEE Transactions on Pattern Analysis and Machine Intelligence 8共6兲, 34 – 43 共1986兲. Rainer Lienhart received his Master’s degree in computer science and applied economics and his PhD in computer science from the University of Mannheim, Germany on ‘‘methods for content analysis, indexing, and comparison of digital video sequences.’’ He was a core member of the Movie Content Analysis Project (MoCA). Since 1998 he is a Staff Researcher at Intel Labs in Santa Clara. His research interests includes image/video/audio content analysis, machine learning, scalable signal processing, scalable 10 / Journal of Electronic Imaging / October 2002 / Vol. 11(4) learning, ubiquitous and distributed media computing in heterogeneous networks, media streaming, and peer-to-peer networking and mass media sharing. He is a member of the IEEE and the IEEE Computer Society. Alexander Hartmann received his Master’s degree in computer science and applied economics from the University of Mannheim, Germany, on ‘‘new algorithms for automatic classification of images.’’ During the summer of 2000 he was a Summer Intern at Intel Labs in Santa Clara. Currently he is working as software engineer at ITSAS, an IBM Global Services Company, in Germany. His interests include Linux and cryptography.