Análisis y Calidad de Texto e Imágenes
Transcription
Análisis y Calidad de Texto e Imágenes
Categorization of Generic Images Marco Bressan, July 2008 Index Intro Image Categorization Problem statement Representation of Visual Objects Global vs. Local Descriptors Bag of Words Patch Detection Local Features Visual Codebook Supervised vs. Unsupervised Codebook Images as Continuous Distributions Fisher kernel Tasks Related to Categorization Similarity, Retrieval, Clustering, Segmentation, … Applications Open Research Challenges Bibliography 2 3 Research @ Xerox Xerox Research Centre Europe Grenoble, France Xerox Research Centre of Canada Mississauga, Ontario, Canada Fuji Xerox Japan Palo Alto Research Center California, USA Xerox Research Center Webster New York, USA 4 Research on Textual And Visual Pattern Analysis U.S. Library of Congress CACAO Pascal 2 EERQI ARTSTOR SAPIR SMART Shaman PINVIEW Infom@gic UAB Omnia SYNC3 Fragrance Core research: pattern classification (mostly text and images), computer vision, text mining and retrieval, machine learning, image processing. Sample projects: handwritten word-spotting automatic content creation visual aesthetics and personalization eye-tracking for image understanding emotional visual categories multilingual categorization and retrieval Image Categorization painting, oil, portrait, modernism, woman, Modigliani, … Research Challenge: Vision and Object Recognition Aristotle’s observation “Of all senses, trust only the sense of sight”, is somewhat supported by nature: 50% of the cerebral cortex is for vision. Still… Vision is ambiguous: multiple scenes can result in the same image, one scene can result in multiple images. Vision is contextual: The human brain is sensitive to contextual associations Vision is inferential: Our prior knowledge (learned or “wired”) determines what we see. The environment is complex: not accessible, continuous, dynamic, non-deterministic, ... Nevertheless, animals extract information from images, e.g. we are able to recognize objects despite variations in size, shape, view, lighting, time, occlusion, clutter, … 6 7 Research Challenge: Vision + Classification I stand at the window and see a house, trees, sky. Theoretically I might say there were 327 brightnesses and nuances of colour. Do I have "327"? No. I have sky, house, and trees. Max Wertheimer, 1923 HI: identities (players) HI: text (temporal) (teams, score) LO: texture (stadium) 111100101011100001110011011100001000010001000011001011000110010010111011110 010000011000100000000100000010101110110001101111100011111101100000100110100 011010011010100011001001000111110101110001010110101101111010000110110001001 111111110101110111111100011101111100111101101110011001110110101100100111101 100010111011011000101110000011100000010111010100011010110011111100011111101 100100111110100011010110010111100110010110010101110011010111000000101011011 011110011111110011011010001110011010001010111010111001100110000010000111000 100111111101101101111010100111101000100000110000110101010001110101001101111 011101101011011100111011101011010001011111011100010011110011001100010111010 000110010000011000001111010011110110110010111100111101000000001110000010000 010111010010100101100010100100100101100100100000100010010000011001100010110 100010001011100001011010111110100101000011000011111101110101100111111111101 001011110111011010000010000101001110111011000001111001100101111001011000100 100011000011111011111111100110100001111110101111000010111101110111110010111 100010001100011101100111010110011010011111001010101000000111111100001111010 001010001011100101101101110100001011110011000111101000100010100111100001011 000000110100000101010101110011101001100110010100011110000001100101010101111 101101100111001100011000000001111110010100011100001111100011011100001010000 111010000000010001110010100100101010001010010010111111110111110011011110111 001011101100110101110011001111011010011110101001110000001111111011000111101 100110011010010000100001010101101100010111101100101001000010100111000000100 000000010010111011010110100010011011111111100110100001111110101111000010111 101110 HI: action (temporal) (match situation) HI: ads (sponsors) LO: low-level HI: high-level And much more… • sports event (football) • Leicester City vs. F.C. Barcelona • Javier Saviola has the ball LO: color (teams) • Full stadium • UEFA? Champions? • Players wear certain brands • Advertising in background • The score favours a team • Barcelona is attacking • Nike, UNICEF • Anguish, excitement, celebration... Research Challenge: Vision + Classification L’ambitieux, Gilbert Garcin, 2003 Visual semantics involve multiple levels of abstraction: semantic, aesthetic, contextual, social, emotional, … 8 9 Tasks related to Image Categorization photographs document images paintings drawings Image Categorization to assign one or more labels to a given image, based on its semantic content Object Recognition to identify an instance of a given object Object Detection to determine whether a member of a visual category is present in an image Image Similarity to define a metric between images, e.g. for duplicate detection Content-Based Image Retrieval given an image repository, to retrieve images relevant to a given query Non-Supervised Image Categorization: Clustering to group images by similarity Image Segmentation to assign one or more labels to each image pixel, based on its semantic content Hybrid approaches to include multiple modalities in the analysis maps, charts, tables domain-specific: satellite, medical 10 Generic Visual Categorization (GVC) The pattern classification problem which consists in assigning one or multiple labels to an image based on its semantic content beach, sky, clouds, vegetation, landscape, horizon, photo, … form, handwriting, letter, service contract painting, oil, portrait, modernism, woman, Modigliani map, legend, downtown Generic: The same flexible framework copes with various objects, subjects and scenes and with various graphic arts (photography, painting, drawing, document images, etc.) Scientific challenges: handle inherent category variability as well as variations in size, shape, view, lighting, occlusion, clutter, etc. Representing Visual Objects Approaches Based on Global Descriptors MPEG-7 SCD (Scalable Color Descriptor) + HTD (Homogeneous Texture Descriptor) Appearance or View-Based [1]: Having seen all possible appearances of an object, can recognition be achieved by just efficiently remembering all of them? GIST [2]: With just a glance at a complex real-world scene, an observer can comprehend a variety of perceptual and semantic information. Approaches Based on Local Descriptors Region-Based BlobWorld [3]: Segmentation into coherent regions that correspond to different objects. Recognition as Translation [4]: Learning a lexicon for a fixed image vocabulary Part-Based Constellations of Parts [5]: Object/scene model is spatial arrangement of a number of parts. Recognition is based on inferring object class based on similarity in parts’ appearance and their spatial arrangement. Bag of Visual Words [6] [1] S.K. Nayar, S.A. Nene, and H. Murase. Real-time 100 object recognition system. In ARPA96, pages 1223–1228, 1996. [2] Oliva, A. Gist of the scene. In: Itti L, Rees G, Tsotsos J. , editors. Neurobiology of attention. San Diego, CA: Academic Press /Elsevier; pp. 251–257, 2005. [3] C. Carson, M. Thomas, S. Belongie, J. M. Hellerstein, and J. Malik. Blobworld: A system for region-based image indexing and retrieval. In Proc. Int. Conf. Visual Inf. Sys., 1999 [4] P. Duygulu, K. Barnard, N. de Freitas, D. Forsyth, Object Recognition as Machine Translation: Learning a lexicon for a fixed image vocabulary, ECCV, Copenhagen, 2002. [5] R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised scale-invariant learning. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2003. [6] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray. Visual categorization with bags of keypoints. In Proc. of ECCV Workshop on Statistical Learning for Computer Vision, 2004. 11 12 Appearance-Based Object Recognition Based on the statistical modeling of raw image data, has shown good performance in a variety of problems, but requires the segmentation of the object and its performance is very limited in complex scenes, e.g. in the presence of occlusion or cluttered backgrounds. View-based modelling Eigenspace representation 13 GIST 80 million tiny images: a large dataset for non-parametric object and scene recognition Representation: PCA of a set of spatially averaged filter-bank outputs + adapted metrics Tasks retrieval semantic classification segmentation Classification at multiple semantic levels object detection colorization A. Torralba and R. Fergus and W. T. Freeman,”Tiny Images”, Computer Science and Artificial Intelligence Lab (CSAIL), MIT, 2007 orientation detection A person will process ~1B images by the age of 3 Representing Visual Objects Approaches Based on Global Descriptors MPEG-7 SCD (Scalable Color Descriptor) + HTD (Homogeneous Texture Descriptor) Appearance or View-Based [1]: Having seen all possible appearances of an object, can recognition be achieved by just efficiently remembering all of them? GIST [2]: With just a glance at a complex real-world scene, an observer can comprehend a variety of perceptual and semantic information. Approaches Based on Local Descriptors Region-Based BlobWorld [3]: Segmentation into coherent regions that correspond to different objects. Recognition as Translation [4]: Learning a lexicon for a fixed image vocabulary Part-Based Constellations of Parts [5]: Object/scene model is spatial arrangement of a number of parts. Recognition is based on inferring object class based on similarity in parts’ appearance and their spatial arrangement. Bag of Visual Words [6] [1] S.K. Nayar, S.A. Nene, and H. Murase. Real-time 100 object recognition system. In ARPA96, pages 1223–1228, 1996. [2] Oliva, A. Gist of the scene. In: Itti L, Rees G, Tsotsos J. , editors. Neurobiology of attention. San Diego, CA: Academic Press /Elsevier; pp. 251–257, 2005. [3] C. Carson, M. Thomas, S. Belongie, J. M. Hellerstein, and J. Malik. Blobworld: A system for region-based image indexing and retrieval. In Proc. Int. Conf. Visual Inf. Sys., 1999 [4] P. Duygulu, K. Barnard, N. de Freitas, D. Forsyth, Object Recognition as Machine Translation: Learning a lexicon for a fixed image vocabulary, ECCV, Copenhagen, 2002. [5] R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised scale-invariant learning. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2003. [6] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray. Visual categorization with bags of keypoints. In Proc. of ECCV Workshop on Statistical Learning for Computer Vision, 2004. 14 Region-based Approaches: Recognition as Translation Learning a lexicon for a fixed image vocabulary Steps Results min-cut segmentation blob features blob clustering -> lexicon categorization as a statistical machine translation problem Experiments: Corel dataset with 371 words and 4500 images that have 5 to 10 segments each, 33 features per segment. A total of ~35000 segments yields a 500 blob lexicon. 15 16 Constellations of Parts Bouchard, G.; Triggs, B.; “Hierarchical Part-Based Visual. Object Categorization” CVPR, 2005. H is the set of valid allocations of features to the parts with: locations X, scales S, and appearances A. H is O(NP), where N is the total number of features and P the number of parts (typically <7). Bag of Visual Words (BoW) BoW: Motivation from Text Mining Orderless document representation: frequencies of words from a dictionary Salton & McGill (1983) US Presidential Speeches Tag Cloud http://chir.ag/phernalia/preztags/ 18 BoW : Motivation from Texture Recognition Texture is characterized by the repetition of basic elements or textons For stochastic textures, it is the identity of the textons, not their spatial arrangement, that matters Julesz, 1981; Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik, 2001; Schmid 2001; Varma & Zisserman, 2002, 2003; Lazebnik, Schmid & Ponce, 2003 19 BoW: Motivation from Texture Recognition histogram Universal texton dictionary 20 21 BoW: Strict Definition 1- Traditional Bag-of-Words UNIVERSAL SET OBJ1 OBJ2 22 Bag of Words: Strict Definition 1. PATCH DETECTION 2. FEATURE EXTRACTION 3. LEARNING THE CODEBOOK 4. HISTOGRAM COMPUTATION 23 BoW: Pipeline patch detection feature extraction Vector quantization histogram computation classification +0.1 -1.5 x= … -0.5 representation categorization Components Patch detection: identify where to extract information in the image Feature extraction: compute local features Vector quantization: map local features to visual words Histogram computation: build a global representation of the image Classification and Learning discriminative model: support vector machine generative model: naïve bayes, hierarchical bayesian models BoW: Patch detection, where is the information? Interest point detector: Harris (corner detector) Laplacian (round shape detector) unstable, ignore uniform regions Blob detector: partition of the image into uniform regions (generally based on color) little semantic meaning Regular Grid: extract information on regular grids (at multiple scales) typically extract ≈ 500 patches / image 24 BoW: Patch detection for image registration Andrew Zisserman, “Probabilistic Models of Visual Object Categories”, AERFAI tutorial, 2006 25 BoW: Patch detection for image retrieval Andrew Zisserman, “Probabilistic Models of Visual Object Categories”, AERFAI tutorial, 2006 26 Microsoft Photosynth 28 BoW: Patch detection, where is the information? Interest point detector: Harris (corner detector) Laplacian (round shape detector) unstable, ignore uniform regions Blob detector: partition of the image into uniform regions (generally based on color) little semantic meaning Regular Grid: extract information on regular grids (at multiple scales) typically extract ≈ 500 patches / image 29 BoW: Local features, what to extract? Split patch into sub-regions and accumulate local statistics (typically 4x4) on: Gray-level features: Compute at each pixel dominant orientation and discretize (8 bins) Accumulate histograms over sub-regions rotation, scale, illumination invariant 128 dimensions Color features: Compute simple RGB statistics over sub-regions (mean + standard deviation) 96 dimensions PCA for dimensionality reduction: Reduces computational cost Reduces noise D = 50 dimensions K. Mikolajczykand C. Schmid. A performance evaluation of local descriptors. In Proc. IEEE CVPR, June 2003. 30 BoW: What is the visual codebook? Given a set of training images, cluster the low-level feature vectors to estimate a visual codebook Alternatives: K-means mean-shift Gaussian mixture model (GMM) Each Gaussian is a visual word of the visual codebook: the mixture weight represent the frequency of the visual word the mean vector represents the average of the word the covariance matrix models how much the visual word varies typically, N on the order of a hundred Gaussians / words Trade-off for the estimation of visual vocabularies: unsupervised approach [SZ03]: universal but not discriminative supervised approach [MT06]: discriminative but not universal [SZ03] J. Sivic and A. Zisserman, Video Google: a text retrieval approach to object matching in videos, ICCV, 2003. [MT06] F. Moosman, B. Triggs and F. Jurie, Randomized clustering forests for building fast and discriminative visual vocabularies, NIPS, 2006. 31 BoW: Local features, what to extract? Analogy between visual codebook and human vision Olshausen and Field, 2004, Fei-Fei and Perona, 2005 32 BoW: What is the visual codebook? Given a set of training images, cluster the low-level feature vectors to estimate a visual codebook Alternatives: K-means mean-shift Gaussian Mixture Model (GMM) Each Gaussian is a visual word of the visual codebook: the mixture weight represent the frequency of the visual word the mean vector represents the average of the word the covariance matrix models how much the visual word varies typically, N on the order of a hundred Gaussians / words Trade-off for the estimation of visual vocabularies: unsupervised approach [SZ03]: universal but not discriminative supervised approach [MT06]: discriminative but not universal [SZ03] J. Sivic and A. Zisserman, Video Google: a text retrieval approach to object matching in videos, ICCV, 2003. [MT06] F. Moosman, B. Triggs and F. Jurie, Randomized clustering forests for building fast and discriminative visual vocabularies, NIPS, 2006. 33 34 BoW: Universal and adapted vocabularies eye MLE tail ear MAP MAP universal vocabulary cat’s eye cat’s ear cat’s tail dog’s eye cat vocabulary F. Perronnin, “Universal and Adapted Vocabularies for Generic Visual Categorization”, IEEE Trans on PAMI, 2007 dog’s ear dog’s tail dog vocabulary BoW: Universal and adapted vocabularies Given the training samples MLE MAP Maximize Maximize E-step: E-step: M-step: M-step: 35 36 BoW: Bipartite histograms For each image, compute one histogram per category on the combined vocabularies cat vocabulary universal vocabulary dog vocabulary universal vocabulary Separate the relevant information from the irrelevant one. If an image belongs to a given category: it is more likely to be best described by the adapted vocabulary of that category rather than by the universal vocabulary. it is more likely to be best described by the universal vocabulary than by the adapted vocabularies of the other categories. BoW: Bipartite histograms, example Relevance with respect to boat 37 BoW: Bipartite histograms, example Relevance with respect to clouds & sky 38 BoW: Bipartite histograms, example Relevance with respect to flowers 39 BoW: Bipartite histograms, performance PASCAL visual object classes challenge 2006 In-house database 10 objects: bicycle, bus, car, cat, cow, dog, 19 objects and scenes: beach, bicycling, birds, horse, motorbike, person, sheep. 170 ms / image 2,618 training images + 2,610 test images 40 boating, cats, clouds & sky, desert, dogs, flowers, golf, motor sports, mountains, people, sunrise & sunset, surfing, underwater, urban, waterfalls, winter sports. 30,000 training images + 4,800 test images (collected independently) 41 BoW: Images as Continuous Distributions Instead of modelling all images with a single vocabulary, model each image with its own (much smaller) vocabulary [FS05,ZM05] patch detection feature extraction distribution estimation classification +0.1 -1.5 x= … -0.5 Image representations are discriminative and universal Kernels on distributions: Kullback-Leibler divergence, Probability Product Kernel, Earth Movers Distance are costly [FS05] J. Farquhar, S. Szedmak, H. Meng and J. Shawe-Taylor, Improving bags-of-keypoints image categorization, Tech report, University of Southampton, 2005. [ZM05] J. Zhang, M. Marszalek, S. Lazebnik and C. Schmid, Local features and kernels for classification of texture and object categories: an in-depth study, Technical report, INRIA, 2005. Images as Continuous Distributions 2- Vector Sets as GMMs OBJ1 OBJ2 42 Images as Continuous Distributions 3- Proposed Approach UNIVERSAL SET OBJ1 OBJ2 43 44 BoW: The Fisher kernel Idea [JH99]: given a generative model with parameters a fixed-length representation of the vector set using the following gradient vector: , compute Use the Fisher information matrix to measure the similarity between gradients: Equivalent to normalizing the gradient vectors: Fisher Representation: Given a visual vocabulary, estimate how the visual words should be modified to best model the image [JH99] T. Jaakola and D. Haussler, Exploiting generative models in discriminative classifiers, NIPS, 1999. 45 BoW: The Fisher kernel = histogram of number of visual word occurrences Extends the traditional histogram of occurrences Results in very high dimensional representations, even with small visual vocabularies (typically N=128): N dimensions for mixture weights N x D for mean vectors N x D for diagonal covariance matrices = (2D + 1) x N dimensional vector Classification accuracy similar to the histogram representation but at a fraction of the cost (20ms / image) Experimental validation PASCAL visual object classes challenge 2006 10 classes: bicycle, bus, car, cat, cow, dog, horse, motorbike, person, sheep 2,618 training + 2,610 test images Performance measured with Area Under the Curve (AUC) 46 Experimental validation 47 PASCAL visual object classes challenge 2007 20 classes: aeroplane, bicycle, bird, boat, bottle, bus, car, cat, chair, cow, dining table, dog, horse, motorbike, person, potted plant, sheep, sofa, train, tv monitor 5,011 training images + 4,952 test images Performance measured with Average Precision (AP) Experimental validation PASCAL visual object classes challenge 2007 Training on VOC 2007 data, testing on VOC 2006 data: 48 Related Tasks 50 Research Challenge: Classification The Dewey Decimal Classification (DDC) system is a general knowledge organization tool conceived by Melvil Dewey in 1873. 200 Religion 000 Computer science, information & general works 100 Philosophy & psychology 200 Religion 300 Social sciences 400 Language 500 Science 600 Technology 700 Arts & recreation 800 Literature 900 History & geography 201 Philosophy of Christianity 202 Miscellany of Christianity 203 Dictionaries of Christianity 204 Special topics 205 Serial publications of Christianity 206 Organizations of Christianity 207 Education, research in Christianity 208 Kinds of persons in Christianity 209 History & geography of Christianity 210 Natural theology 211 Concepts of God 212 Existence, attributes of God 213 Creation 214 Theodicy 215 Science & religion 216 Good & evil 217 Not assigned or no longer used 218 Humankind 219 Not assigned or no longer used 220 Bible 221 Old Testament 222 Historical books of Old Testament 223 Poetic books of Old Testament 224 Prophetic books of Old Testament 225 New Testament 226 Gospels & Acts 227 Epistles 228 Revelation (Apocalypse) 229 Apocrypha & pseudepigrapha 230 Christian theology 231 God 232 Jesus Christ & his family 233 Humankind 234 Salvation (Soteriology) & grace 235 Spiritual beings 236 Eschatology 237 Not assigned or no longer used 238 Creeds & catechisms 239 Apologetics & polemics 240 Christian moral & devotional theology 241 Moral theology 242 Devotional literature 243 Evangelistic writings for individuals 244 Not assigned or no longer used 245 Texts of hymns 246 Use of art in Christianity 247 Church furnishings & articles 248 Christian experience, practice, life 249 Christian observances in family life 250 Christian orders & local church 251 Preaching (Homiletics) 252 Texts of sermons 253 Pastoral office (Pastoral theology) 254 Parish government & administration 255 Religious congregations & orders 256 Not assigned or no longer used 257 Not assigned or no longer used 258 Not assigned or no longer used 259 Activities of the local church 260 Christian social theology 261 Social theology 262 Ecclesiology 263 Times, places of religious observance 264 Public worship 265 Sacraments, other rites & acts 266 Missions 267 Associations for religious work 268 Religious education 269 Spiritual renewal 270 Christian church history 271 Religious orders in church history 272 Persecutions in church history 273 Heresies in church history 274 Christian church in Europe 275 Christian church in Asia 276 Christian church in Africa 277 Christian church in North America 278 Christian church in South America 279 Christian church in other areas 280 Christian denominations & sects 281 Early church & Eastern churches 282 Roman Catholic Church 283 Anglican churches 284 Protestants of Continental origin 285 Presbyterian, Reformed, Congregational 286 Baptist, Disciples of Christ, Adventist 287 Methodist & related churches 288 Not assigned or no longer used 289 Other denominations & sects 290 Other & comparative religions 291 Comparative religion 292 Classical (Greek & Roman) religion 293 Germanic religion 294 Religions of Indic origin 295 Zoroastrianism (Mazdaism, Parseeism) 296 Judaism 297 Islam & religions originating in it 298 Not assigned or no longer used 299 Other religions 51 Research Challenge: Classification In the physical domain everything has to go someplace, it can only go in one place and two things cannot go to the same place 52 Research Challenge: Classification Categorization is the process in which ideas and objects are recognized, differentiated and understood. Categorization is fundamental in language, prediction, inference, decision making and in all kinds of interaction with the environment. (Wikipedia) Physical World Digital World A object can hang from only one branch. An object can be classified in many (hundreds) categories. Design needs to be defined ahead of time Organic growth Owner of information controls the organization Users can control the organization of the info owned by others. Multiple people can use one tree There can be different tree for each person. Ambiguity is a problem Ambiguity is an advantage Computer Equipment Photographic Equipment Active Lifestyles Sporting Goods Graduation Gifts New Arrivals Sale Items Travel Equipment Canon Products Research Challenge: Classification “Folksonomies are different in important ways from top-down, hierarchical taxonomies… The old way creates a tree. The new rakes leaves together.” David Weinberger 53 54 Tasks Related with Image Categorization photographs document images paintings drawings Image Categorization to assign one or more labels to a given image, based on its semantic content Object Recognition to identify an instance of a given object Object Detection to determine whether a member of a visual category is present in an image Image Similarity to define a metric between images, e.g. for duplicate detection Content-Based Image Retrieval given an image repository, to retrieve images relevant to a given query Non-Supervised Image Categorization: Clustering to group images by similarity Image Segmentation to assign one or more labels to each image pixel, based on its semantic content Hybrid approaches to include multiple modalities in the analysis maps, charts, tables domain-specific: satellite, medical 55 Image Similarity and Retrieval Precision Xerox Average Text Only Similarity measure based on the Fisher representation is robust to scanning variations, resolution, compression, cropping, image edits, etc. Image Only Hybrid 56 Hybrid Similarity and Retrieval Flowers from « Pyrénées » Hybrid Clustering: TEXT + IMAGES Flowers Cluster Flowers from « Vercors » xrce InfoM@gic: Hybrid Retrieval Retrieve documents from cross-media database (images with text) given one or more query images and/or textual queries Results are ranked using cross-media similarity measures. Clustering 57 Fully automated grouping of large image repositories. Grouping criteria can be defined through universal vocabulary Image Retrieval Actual clusters discovered by our tool from 120000 random images from a photofinishing workflow. Also applied successfully to document images (NIST database) Clustering Edward Hopper (1882-1967) Johannes Vermeer (1632-1675) 58 59 Generic Visual Segmentation LocBov, sheep The Sheepness Map 50 50 100 100 150 150 200 200 250 250 300 300 350 350 400 400 450 450 100 200 300 400 500 600 100 200 300 400 500 600 Urban Structure Example Classifier, UrbanStructure, CombMap 50 50 100 100 FisherKernel, UrbanStructure, Texture Map 150 150 50 200 200 100 250 250 150 50 100 150 200 250 300 350 400 50 100 150 200 250 300 350 200 250 50 100 150 200 250 300 350 400 Combination of low-level feature-based segmentation with class-probability maps 400 sample apps: image asset visualization,60 delivery to mobile phone, visualization in printer screen Generic Visual Segmentation Summarization from sheep category probability map 20 50 50 100 100 40 60 80 100 120 140 150 150 200 200 250 250 300 300 160 180 350 200 20 40 60 80 100 120 140 160 180 200 automatic thumbnailing 350 50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500 Summarization from drawing probability map 200 500 500 1000 1000 400 600 800 1000 1500 1500 1200 2000 2000 2500 2500 3000 3000 500 1000 1500 2000 2500 500 1000 1500 2000 2500 Repurpose an image for reflow 200 400 600 800 1000 1200 Applications Applications: Class-based Image Enhancement (CBIE) Content understanding enables superior quality “Snow looks dirty in all my skiing photos” “Look at all those details in that building” “I like the colour in those flowers” Based on Generic Visual Categorization (GVC) and Clustering GVC: The same flexible framework copes with various objects, subjects and scenes and with various graphic arts (photography, painting, drawing, document images, etc.) Clustering: Discovery of similar types of semantic content Validated through User Preference evaluations Semantic aspect is key to future personalized offering PEOPLE no faces portraits groups crowds TIME & LOC outdoors indoors night / flash day seasonal ENVIRONMENT urban nature clouds underwater sunrise/sunset sky flowers fog snow TYPE photograph poster drawing paintings doc objects… INTENT professional personal effects M. Bressan, G. Csurka and S. Favre, “Towards Intent Dependent Image Enhancement: State-of-the-art and Recent Attempts”, In Proc VISAPP, March 2007. 62 Applications Robotics, e.g. assisted driving 63 Applications Robotics, e.g. assisted driving Medical & Satellite Imaging 64 Applications Robotics, e.g. assisted driving Medical & Satellite Imaging Security, e.g. biometrics, tracking Entertainment Visual Inspection & Quality Control OCR/ICR Management of Multimedia Assets Augmented Reality indexing, storage, retrieval, visualization, human-computer interfaces, context-aware computing, wearable devices Knowledge Inference / Knowledge Creation 65 66 Applications: Inference from Large Databases Inference of Geographic Information James Hays, Alexei A Efros, “IM2GPS: estimating geographic information from a single image”, CMU, CVPR 2008 Applications: Inference from large databases Land Cover Estimation Other applications: Urban vs. Rural (via Light Pollution), Population Density Estimation, Elevation Gradient Estimation James Hays, Alexei A Efros, “IM2GPS: estimating geographic information from a single image”, CMU, CVPR 2008 67 Applications: Semi-supervised Hybrid Content Generation Our plans to hit Copacabana beach the next day and check out hot Brazilian girls in skimpy bikinis were ruined by the weather. It rained all day! Can you believe that. I think we'll be heading to another place mid-week for some beach time. There is a lot of tourists there from around ten until three, but it didn’t feel as crowded as we’d feared. We started there for 12 hours- saw the sunrise and sunset, and walked the citadel twice. It is an awesome site in the proper sense of the word (Yanks take note). Bloody magic. Some archeologists reckon that Machu Picchu could have predated the Inca but that they did a lot of improvements. Marco Bressan, Gabriela Csurka, Yves Hoppenot and Jean-Michel Renders, “Travel Blog Assistant System (TBAS)”, Metadata Minining for Image Understanding Workshop, 2008 Today had another wander around the old town and went into a number of the great churches. On the way around some of them noticed a parade of monks and nuns singing and carrying statues of Mary and Jesus before entering the Cathedral - was nice to watch. 68 69 Applications: Document Images NIST: database of simulated USA tax forms : 20 forms types 5,590 images Evaluation protocol (repeated 10 times): 10 images / class for training The rest is used for testing 0% error rate (best reported before was 0.2%) DEMO Applications: Document Images: Demo 70 71 GVC Model UI screenshots GVC modeling features Performance analytics Customized image list view Online content sources 8/13/2008 Open Research Challenges Improve BOW: Patch Detection Improve viewpoint invariance: scale, similarity, affine invariance. Gaze models Efficient discriminative models Taking shape into account Low-level fusion of hybrid content models Beyond BOW: Taking into account the structure of objects Structure model tight parametric model (e.g. complete Gaussian) loose model (e.g. pairwise relations) Improved Classification models multiple levels of abstraction transfer learning learning for structured output, hierarchical class models Learning learn from ‘contaminated’ data sets: noisy, unlabelled, weakly labelled data. reduction of training requirements: active, semi-supervised and unsupervised learning Retrieval efficient coding for large-scale indexing and retrieval visualization of query results 72 Useful References 73 74 Useful References Recommended Tutorials Frédéric Jurie, “Vision par ordinateur et catégorisation d’images”, CNRS, ProjetLEAR, Inria Rhône-Alpes, Septembre 2006. http://www-poleia.lip6.fr/~cord/isis/jurie.pdf Andrew Zisserman, “Probabilistic models of visual object categories”, Visual Geometry Group, University of Oxford, 2006 http://www.robots.ox.ac.uk/~vgg Li Fei-Fei, “Bag of Words”, in CVPR2007, Princeton University, 2007 http://vision.cs.princeton.edu/documents/CVPR2007_tutorial_bag_of_words.ppt Conclusiones y Preguntas 75 Fisher kernel on visual vocabularies Notations Log-likelihood function: Modeling the visual vocabulary with a GMM: Occupancy probability: 76 Fisher kernel on visual vocabularies Gradient formulae (1/2) Formulae for the partial derivatives: BOV: Gradient size = (2xD+1)xN-1 compared to histogram size = N 77 Fisher kernel on visual vocabularies Gradient formulae (2/2) Introducing the MLE formulae: Leads to: 78 Fisher kernel on visual vocabularies 79 Fisher information matrix Assumption: at a given time t, the distribution i.e. a single Gaussian contributes significantly is sharply peaked, Under this approximation, is diagonal component-wise normalization of the dynamic range Formulae: