Conceptos y Tecnicas de Busqueda de Imagenes y Videos por
Transcription
Conceptos y Tecnicas de Busqueda de Imagenes y Videos por
Conceptos y Tecnicas de Busqueda de Imagenes y Videos por Contenido Visual Alejandro Jaimes COLUMBIA UNIVERSITY ELECTRICAL ENGINEERING DEPARTMENT & DIGITAL VIDEO MULTIMEDIA GROUP Outline l Introduction • Future Multimedia Applications • Problems Addressed & Overview l Understanding of Visual Content & Users • Multi-Level Indexing Pyramid & Eye Tracking l Flexible Computational Methods that learn • The Visual Apprentice l Detectors in Practical Tasks • STELLA & Non-Identical Duplicates l Summary of Contributions l Conclusions & Future Work 1 Motivation l Tremendous growth in the availability of multimedia data for personal use – Creation (e.g., digital cameras) – Acquisition (e.g., scanners, recording devices) – Access (e.g., web; pervasive devices) l Advances in communications, affective, and wearable computing The Visual Revolution User Activity Lithography 1798 Photography 1860s [A non-mathematical historical perspective] Super 8 mm Film Cartridges VCR 1965 1972 Future Applications 2050+ TIVO TV Anytime Late 1990s Digital Cameras 1990s Brownie camera 1900 Time 2 Alex in the Future: Active Users! Personalized Everywhere Display Affective Interface Multimedia Capture Device Wearable Multimedia Application … exciting, unforeseen applications Problems Addressed l Automatic indexing and organization of visual information through user input at multiple levels • Understanding of visual content and users Multi-Level Indexing Pyramid & Eye Tracking • Flexible computational frameworks The Visual Apprentice • Integration of generic detectors in solving practical tasks in a specific domain STELLA & Duplicate Detection 3 Automatic indexing and organization of visual information through user input at multiple levels Understanding Content & Users Object-p Perceptual Region Object-p Object-p Perceptual Region Flexible Frameworks Detectors in Practical Tasks Roadmap l Introduction • Future Multimedia Applications • Problems Addressed & Overview l Understanding of Visual Content & Users • Multi-Level Indexing Pyramid & Eye Tracking l Flexible Computational Methods that learn • The Visual Apprentice l Detectors in Practical Tasks • STELLA & Non-Identical Duplicates l Summary of Contributions l Conclusions & Future Work 4 The Multi-level Indexing Pyramid Understanding Content & Users [SPIE ’00] [MPEG-7 ’99 ‘00] [ASIST ’00] [JASIS ’01] l Understand levels of description of visual information l Conceptual structure for classifying visual attributes into multiple levels – Art (E. Panofsky), Cognitive Psychology (E. Rosch et al.) – Information Sciences (C. Jörgensen), Visual Information Retrieval l Why is this important? – Guide indexing process – Build better indexing systems l Key points – Pyramid represents full range of visual attributes – Strong impact on MPEG-7 Indexing Levels (Visual Attributes) Knowledge Syntax 1. Type/ Technique 2. Global Distribution Texture, etc. 3. Local Structure 4. Global Composition 5. Generic Object 6. Semantics Generic Scene 7. Specific Object 8. Specific Scene 9. Abstract Object Ana Alex 10. Abstract Scene 5 Level 1: Type/Technique 1. Type/ Technique 2. Global Distribution 3. Local Structure 4. Global Composition Syntax l Type/technique used during production l No knowledge of visual content, just general visual characteristics l Examples: – Color or b/w photograph – Water color or oil painting Color photograph Oil painting Level 2: Global Distribution 1. Type/ Technique 2. Global Distribution 3. Local Structure 4. Global Composition Syntax l Distribution of Low-level features only l Examples: – Color distribution • Dominant, histogram – Global texture • Coarseness, contrast – Global shape • Aspect ratio Similar texture, color histogram – Global motion/deformation • Speed, acceleration 6 Level 3: Local Structure Characterization and extraction of basic visual elements l Examples: 1. Type/ Technique 2. Global Distribution 3. Local Structure 4. Global Composition Syntax l – Dots, lines, tone, circles, squares – Local color – Binary shape mask – Local motion/deformation Blood cells = circles Stars = dots Level 4: Global Composition 1. Type/ Technique 2. Global Distribution 3. Local Structure 4. Global Composition Syntax l Arrangement or layout of basic elements l No knowledge of objects l Examples: – Balance, Symmetry – Center of interest – Leading line, viewing angle Horizontal leading line Centered object Centered object 7 Level 5: Generic Object 5. Generic Object 6. Generic Scene Semantics 7. Specific Object 8. Specific Scene 9. Abstract Object 10. Abstract Scene l General (every day) knowledge about objects l Examples: – Common nouns • Person • Chair • Desk Airplane What the image is “of” Persons, flag Level 6: Generic Scene 5. Generic Object 6. Generic Scene 7. Specific Object 8. Specific Scene 9. Abstract Object 10. Abstract Scene Semantics l General knowledge about scene l Examples: – City, Landscape – Indoor, Outdoor – Daytime, Nighttime Outdoors, city, street Indoors, office What the image is “of” 8 Level 7: Specific Object 5. Generic Object Semantics 6. Generic Scene 7. Specific Object 8. Specific Scene 9. Abstract Object 10. Abstract Scene l Identified and named objects l Specific knowledge about objects, known facts l Examples: – – – – F-18 B. Clinton Chinese Ambassador Z. Li American flag Lincoln desk What the image is “of” B. Clinton, Z. Li Level 8: Specific Scene 5. Generic Object Semantics 6. Generic Scene 7. Specific Object 8. Specific Scene 9. Abstract Object 10. Abstract Scene l Identified and named scene l Specific knowledge about scene, known facts l Examples: – Name of a city, street, lake – Name of a building Paris Oval Office, White House What the image is “of” 9 Level 9: Abstract Object 5. Generic Object 6. Generic Scene 7. Specific Object 8. Specific Scene 9. Abstract Object 10. Abstract Scene Semantics l Interpretation of an object l Subjective or based on specific personal knowledge l Examples: – Political power – Sympathy About baseball (or basketball?) Political Gesture What the image is “about” Level 10: Abstract Scene 5. Generic Object Semantics 6. Generic Scene 7. Specific Object 8. Specific Scene 9. Abstract Object 10. Abstract Scene Peacefulness US Government l Subjective interpretation of a scene l Examples: – International politics – War – Apology What the image is “about” 10 1. Type/ Technique 2. Global Distribution 3. Local Structure 4. Global Composition 5. Generic Object 6. Generic Scene 7. Specific Object 8. Specific Scene 9. Abstract Object 10. Abstract Scene Semantics Syntax Pyramid Example 1. TYPE: Color still image 2. GLOBAL DISTRIBUTION: Color histogram 3. LOCAL STRUCTURE: Circles, squares 4. GLOBAL COMPOSITION: Centered 5. GENERIC OBJECT (of): 6. GENERIC SCENE (of): Persons, building Outdoors 7. SPECIFIC OBJECT (of): Ana, Alex 8. SPECIFIC SCENE (of): CEPSR 9. ABSTRACT OBJECT (about): Happy, friendly 10. ABSTRACT SCENE (about): Research agreement, friendship Understanding Content & Users Pyramid Contributions & Impact l Novel structure to classify visual attributes into multiple levels – Spontaneous, semi-structured, and structured descriptions, 66 subjects (naïve, indexers, researchers), 700 images – Represents full range of visual attributes (also shown by Burke CIVR ’02) – Can guide indexing process (more attributes generated with pyramid than without it) – Retrieval improvement between 5% and 80% l Impact on MPEG-7 – Syntactic/semantic objects/relationships – Specific and generic events/objects, concrete objects/events – Text annotation type, labels for semantic entities, etc. 11 Understanding Content & Users Eye Tracking [with J. Pelz, T. Grabowski, J. Babock SPIE ’01] l Study the way people look at images l Determine if there are fixation patterns within image categories (handshake, landscape, etc.) l Why is this interesting? – Understand visual process – Use regions of interest from eye tracking to construct structured classifiers (e.g., Visual Apprentice) l Key Points – First study with different image categories • Natural tasks (Pelz et al, 2000), Art (Buswell 1920), ROIs (Privitera & Stark ‘00), perception Yarbus, 1967), etc. – Patterns found in handshakes and centered object Types of Eye Movements Fixation Fixation A Saccade B l Fixations: gaze held at stationary point (typically 250 msec., but can vary from 100 msec. to 1000 msec.) l Saccades: rapid, ballistic eye movements 2-3 eye movements per second (> 100,000 per day) 12 Eye Tracking Setup l ASL 504 Remote Eye Tracker (Applied Science Labs) Experimental Setup l Examine fixation patterns of human subjects viewing images of different categories l Image classes (50 images per class) – – – – – l Handshakes Landscapes Centered Object Crowds Miscellaneous Subjects – 10 subjects (6 male, 4 female, undergraduate students with normal or corrected to normal vision) 13 Experimental Results I l Strong Image Dependence (one subject, several images) Experimental Results II l Strong Similarity in Fixation Patterns for Some Images Subject A Subject B Subject C Subject D 14 Experimental Results III l Wide Variation in Fixation Patterns for Some Images Subject A Subject B Subject C Subject D Experimental Results IV l Categories with consistent fixation patterns (handshake, centered object) 15 Understanding Content & Users Eye Tracking Contributions & Impact l First study within and across image categories – 10 subjects, 250 images, 5 categories (400,000+ data points) – Patterns in handshake and centered object categories, no patterns in others l l Understand what to index Link to automatic techniques Outline l Introduction • Future Multimedia Applications • Problems Addressed & Overview l Understanding of Visual Content & Users • Multi-Level Indexing Pyramid & Eye Tracking l Flexible Computational Methods that learn • The Visual Apprentice l Detectors in Practical Tasks • STELLA & Non-Identical Duplicates l Summary of Contributions l Conclusions & Future Work 16 Flexible Frameworks that Learn The Visual Apprentice l [SPIE ’99, ’00, ACCV ’00, IJIG ’01] Automatically construct a visual detector from user input – Visual Detectors are programs that automatically assign semantic labels to objects (e.g., sky) or scenes (e.g., handshake scene) l Novel framework to learn structured visual detectors from user input at multiple levels – User defines models and provides training examples – Learning algorithms + different features + training examples = Automatic Visual Detectors l Key Points – Flexible computational approach – Multiple features, learning algorithms – No expert input required Motivation l Why structured detectors? – Definition of complex scenes/objects l Why learning? – Flexible (no expert input) vs. Static systems l Why multiple learning algorithms and features? – Different algorithms and features perform differently on the same data 17 Related Work l Visual Detectors – Most approaches are static or specialized- Body Plans (U. Berkeley); WebSeer (U. Chicago) – Others work at scene level only (MIT indoor/outdoor, HP City vs. Landscape) – Other learning approaches are not structured (MIT FourEyes) The Visual Apprentice Definition Hierarchy Object Level 1: Object Level 2: Object-part Level 3: Perceptual Area Level 4: Region Object-part 1 Perceptual-area 1 Region 1 Object-part 2 Object-part n Perceptual-area n Region 2 18 User Input l Define hierarchy – Decide nodes and containment relationships l For each node – Label examples in images/videos l Labeling is performed by clicking on regions or outlining areas. Definition Hierarchy Batting Level 1: Object Level 2: Object-part Level 3: Perceptual Area Level 4: Region Ground Grass Sand Regions Regions Pitcher Batter Regions Regions 19 Definition Hierarchy Example Batting Ground Grass Regions Pitcher Batter Regions Regions Sand Regions Learning Classifiers l A classifier is a function that, given an input, assigns it to one of k classes. A learning algorithm is a function that, given a set of examples and their classes, constructs a classifier l Training Data – For each node of the hierarchy, a set of examples. – Superset of features is extracted from each example: • • • • l Color (Average LUV, Dominant Color, etc.) Shape & Location (Perimeter, Formfactor, Eccentricity, etc.) Texture (Edge Direction Histogram, Tamura, etc.) Motion (Trajectory, velocity, etc.) Superset of Machine learning algorithms use training data • Nearest Neighbors, Decision Trees, etc. 20 Learning Classifiers Definition Hierarchy Learning Algorithm 1 C1 Learning Algorithm 2 C2 Learning Algorithm n Cn Performance Estimator D1 D1 Stage 1: Training data obtained and feature vectors computed. Stage 2: Classifiers learned Stage 3: Classifiers selected Feature Subset Selection l Set of features A find a set B such that and where B is a better feature set than A. l Wrapper Model Feature selection set Training set Feature set Feature evaluation Learning Algorithm 21 A Classification Example Handshake Object Object-part Face 1 Handshake Face 2 Regions Regions Perceptual Area Region Regions Cf Face region classifier Cr Determines Face o-p classifier input Training Summary l User: – Labels example images according to hierarchy l System: – – – – Automatically segments examples Extracts visual features for each node Applies set of learning algorithms to each node Selects best features for each classifier – A set of classifiers for each node (best features) – Best classifier selected 22 Classification Summary l l l Automatic Segmentation Feature Extraction Classification and grouping Handshake Object Object-part Face 1 Handshake Face 2 Regions Regions Perceptual Area Region Regions Experiments: Definition Hierarchy I Level 1: Object Ship Sky Elephant Face Level 2: Object-part Level 3: Perceptual Area Level 4: Region Regions Regions Regions Regions 23 Experiments: Definition Hierarchy II Skater Level 1: Object Level 2: Object-part Level 3: Perceptual Area Level 4: Region Body Face White Blue Regions Regions Regions Experiments: Definition Hierarchy II Handshake Object Object-part Face 1 Handshake Face 2 Regions Regions Perceptual Area Region Regions 24 Experiments: Definition Hierarchy III Batting Level 1: Object Level 2: Object-part Level 3: Perceptual Area Level 4: Region Ground Grass Sand Regions Regions Pitcher Batter Regions Regions Results I l Hierarchy I: – – – – – – l Ships Roses Elephants Cocktails Face Skies Recall Precision Train Test 94% 94% 91% 95% 70% 55% 70% 75% 77% 36% 89% 87% 30 30 30 30 86 40 280 280 280 280 378 670 100% 74% 62% 70% 30 80 490 733 Hierarchy II: – Skater – Handshakes 25 Baseball Experiments l Data used – – – – – l 12 innings from different games Different teams (7) Field (natural, artificial) Time of day (6 day, 6 night) Broadcast (4 channels) Difficulties – Visual Appearance: time, weather, stadium, teams, etc. – Signal: reception, noise, origin, etc. Recall: 64% Precision: 100% Train: 60 Test: 376 Examples l Missed: l Correctly rejected as Pitching Scenes: 26 When to use learning? Recurrent Visual Semantics (RVS) Repetitive appearance of elements (e.g., objects, scenes or shots) that are visually similar and have a common level of meaning within a specific context. l Analysis l When to use the VA? – Structured classes (e.g., not indoor, outdoor classifiers) – Recurrent Visual Semantics! l Structural Differences – More nodes, higher precision, lower recall l Advantages – Hierarchy implies context – Can integrate classifiers built by experts – Can generate multiple features (e.g., MPEG-7) 27 VA Contributions & Impact l Novel framework that learns visual detectors from user input at multiple levels – Integration of multiple learning algorithms and features – Use of non-domain specific techniques (but can be easily incorporated) – Flexibility in defining object/scene hierarchies and detailed user input at multiple levels l Different domains and projects (Digital Libraries News images, Corel, Baseball, Consumer photography) l Visual Apprentice plus expert input (D. Zhong ICME ’00) Roadmap l Introduction • Future Multimedia Applications • Problems Addressed & Overview l Understanding of Visual Content & Users • Multi-Level Indexing Pyramid & Eye Tracking l Flexible Computational Methods that learn • The Visual Apprentice l Detectors in Practical Tasks • STELLA & Non-Identical Duplicates l Summary of Contributions l Conclusions & Future Work 28 STELLA (Story TELLing and Album Creation) [with Alex C. Loui of Kodak & Ana B. Benitez] •Hierarchical organization (novel composition features) •Time sequence information (novel variation of Ward algorithm) •User can add metadata, edit clusters, etc. •Logs editing operations Non-Identical Duplicates! [Interface with Enrico Bounano] Bounano] Non-Identical Duplicate Detection Detectors in Practical Tasks [with Alex C. Loui of Kodak] l Determine if image I1 and image I2 are duplicates – Duplicates: images of the same scene, taken from approximately the same angle and range (Kodak definition) l New duplicate taxonomy, novel framework for detecting nonidentical duplicates, detailed analysis of duplicate database l Why are duplicates important? • Essential in organizing personal collections • They represent important events, prevalent in current collections (19% of the images, per roll Kodak database) Many more in the future! l Key points – New problem • Stereo, registration, mosaics, etc. – Multiple view geometry + visual detectors 29 A Duplicate Model Image Lighting Camera Scene … something has changed! Lighting A Duplicate Model Scene Camera Image Component Scene Camera Image Lighting Subject/Background - Flash/no-flash - Move - Sunny/overcast - Light/dark - Change - Non-stationary Exposure Viewpoint - Aperture - Change (Translation) - Shutter Speed - Rotation (pan/tilt) - Zoom - Noise - Luminosity - Color - Added Text (video only) Duplicate Categories No Change Horizontal/Vertical translation Framing Angle Change Zoom Subject Move Subject Change Different Background Several Changes 30 A Novel Duplicate Taxonomy A New Framework 31 Multiple View Geometry Type III: Homography Type IV: Fundamental Matrix l =Fx x = Hx Framework Overview Interest Point Matching Local Global Changes Changes Alignment e Fundamental Matrix Homography d Self Correlation Block Matching Global Change Areas Local Analysis VA Detectors Domain Rules 32 Detected Change Areas Object Detection Vegetation Face 33 Duplicate Image Database l [Collected & labeled by Kodak] 255 image pairs from 60 rolls of film, from 54 real consumers, labeled by 10 other people – Only obvious duplicates, very similar non-duplicates, and ”borderline” duplicates (obvious non-duplicates not included) – Considerable subjectivity (100% agreement in only 43% of the pairs) Obvious duplicate Borderline duplicate Similar non-duplicate Obvious non-duplicate Non-Duplicates Contributions & Impact l New problem of non-identical duplicate consumer photographs – Novel taxonomy of different types of duplicates – New framework to detect non-identical duplicates • Multiple view geometry • Visual detectors • Domain knowledge l Detailed analysis of duplicate database – Set with 100% agreement (65 duplicates, 43 non-duplicates) • • • • Precision: 64% Recall: 97% False Positives (Visually similar non-duplicate image pairs!) Misses (Significant visual variation but minor semantic variation!) Hits (mostly in no change, subject move categories) 34 Roadmap l Introduction • Future Multimedia Applications • Problems Addressed & Overview l Understanding of Visual Content & Users • Multi-Level Indexing Pyramid & Eye Tracking l Flexible Computational Methods that learn • The Visual Apprentice l Detectors in Practical Tasks • STELLA & Non-Identical Duplicates l Summary of Contributions l Conclusions & Future Work Automatic indexing and organization of visual information through user input at multiple levels Understanding Content & Users Object-p Perceptual Region Object-p Object-p Perceptual Region Flexible Frameworks Detectors in Practical Tasks 35 Publications Summary (34 total) l l l l l 1 Book Chapter (Wiley & Sons ’02) 2 Journal Papers (IJIG, ASIS) 11 MPEG-7 proposals (includes one outside thesis topics) 17 Conference publications (includes outside thesis topics) 3 patents (filed) Object-p Perceptual Region Object-p Object-p Perceptual Region Summary of Contributions Understanding Content & Users -Novel conceptual structure -First study across categories -Link to automatic detectors Flexible Frameworks -User input at multiple levels -Multiple features, algorithms -Structured detectors •Full range of attributes •Guides indexing process •Improves retrieval Detectors in Practical Tasks -New problem -Novel taxonomy -Detailed analysis -New framework •Multiple view geometry + Visual detectors 36 Conclusions l We addressed the problem of automatic indexing and organization of visual information through user input at multiple levels – Understanding of visual content and users – Flexible computational frameworks – Integration of generic detectors in solving practical tasks in a specific domain Understanding Content & Users Future Work I • Multi-Level Indexing Pyramid • Integration with STELLA for browsing • Automatic classification of descriptions • Eye Tracking • • • • • Extend to more categories Analyze scan path information Build computational approach directly from eye track data Passive learning? Does color make a difference? 37 Flexible Frameworks that Learn Future Work II l The Visual Apprentice – – – – – Active learning Expert knowledge integration (ontologies?) Further classifier combination Automatic collection of training data (RVS) Integration with eye tracking Object-p Perceptual Region Object-p Object-p Perceptual Region Detectors in Practical Tasks Future Work III l STELLA (Story TELLing and Album creation) – – – – l New browsing and visualization approaches Transcoding (photographs anywhere, anytime) Multimedia STELLA (not just photographs) Storytelling Duplicate Detection – More efficient duplicate detection – Automatic model selection 38 Future Research Directions l Personalization – Machine learning (Video Digests - ICIP ’02) – Multiple-level feature integration – Learning from semi-passive interaction l Semantic analysis – Knowledge repositories & reasoning (Context - SPIE ’03) – Production-rules knowledge – Multi-modal analysis l Interactive frameworks – Multi-level browsing – Multimedia ontologies (ICME ’03) – “Smart” annotation Other Projects l TREC Benchmark (IBM TREC ’02) l Personalized Video Digests (IBM ICIP ’02) l Learning color corrections (IBM IS&T PICS ’01) l Descreening (IBM SPIE ’99) l 3-D Stereoscopic Imaging (IBM SPIE ’99) l MPEG-7 (Columbia) 39 Acknowledgements l l l l l l l l l Prof. Shih-Fu Chang Dr. Nevenka Dimitrova, Prof. Alexandros Eleftheriadis, Prof. Dan Ellis, Prof. John R. Kender Ana B. Benitez Dr. Lawrence Bergman, Dr. Vittorio Castelli (IBM Watson) Dr. Corinne Jörgensen (Florida State University) Dr. Alexander C. Loui (Kodak- Rochester, NY) Dr. Jeff Pelz (RIT- Rochester) Dr. Hawley Rising, Dr. Toby Walker (SONY) All my colleagues and friends at Columbia and IBM TJ Watson Research Center, anonymous reviewers and conference & MPEG-7 colleagues The End THANK YOU FOR YOUR ATTENTION! 40