Grauman - Frontiers in Computer Vision
Transcription
Grauman - Frontiers in Computer Vision
Rela-ve A0ributes Devi Parikh (TTIC) and Kristen Grauman (UT Aus0n) Learn a scoring func0on rm (xi ) = ∀(i, j) ∈ Om : T w m xi > T w m xi ), }, S :{{ m ∼ that best sa0sfies constraints: ∀(i, j) ∈ Sm : T w m xj ... T w m xi = T wm xj }} ... , Young: S �C �H M �Z Smiling: M �Z Need not use all aMributes Need not relate to all S S H M C Z Youth Infer image category using max-‐likelihood 5. Describing Images Rela-vely Learnt rela-ve a0ributes Density: � m T≺I∼S≺H≺C∼O∼M∼F T∼F≺I∼S≺M≺H∼C∼O O≺C≺M∼F≺H≺I≺S≺T F≺O∼M≺I∼S≺H∼C≺T F≺O∼M≺C≺I∼S≺H≺T C≺M≺O≺T∼I∼S∼H∼F S≺M≺Z≺V≺J≺A≺H≺C A≺C≺H≺Z≺J≺S≺M≺V V≺H≺C≺J≺A≺S≺Z≺M J≺V≺H≺A∼C≺S∼Z≺M V≺J≺H≺C≺Z≺M≺S≺A J≺Z≺M≺S≺A∼C∼H∼V M≺S≺Z≺V≺H≺A≺C≺J M≺J≺S≺A≺H≺C≺V≺Z A≺C≺J∼M∼V≺S≺Z≺H H≺J≺V≺Z≺C≺M≺A≺S H≺V≺J≺C≺Z≺A≺S≺M 7. Image Descrip-on Results … Auto -‐ generate textual descrip-on of: � � Max-‐margin learning to rank formula-on � � � � 1 � � Rela-ve a 0ributes s pace T 2 2 2 � � � � 1/8 d ataset min ||w || + C ξ + γ 1 Adapted o bjec0ve m 2 ij ij T 2 2 2 2 min ||w || + C ξ + γ � m 2 ij ij �� � � from [ Joachims, 2 002] 2 Density 2 2 T T C ξij + γij s.t w (x − x ) ≥ 1 − ξ , ∀(i, j) ∈ O ; |w (x − x )| ≤ γ , ∀(i, j) ∈ S ; i j ij m i j ij m m m T T s.t wm (xi − xj ) ≥ 1 − ξij , ∀(i, j) ∈ Om ; |wm (xi − xj )| ≤ γij , ∀(i, j) ∈ Sm ;C C H H H C F H H M F F I F ξ ≥ 0; γ ≥ 0 ij ij T ≥ 1 − ξij , ∀(i, j) ∈ Om ξij; |w ≥m 0;(x γiji − ≥ x0j )| ≤ γij , ∀(i, j) ∈ Sm ; Rela-ve descrip-on: 0 3. Ranking Func-on vs. Binary Classifier Score “more dense than , less dense than ” w w How do learned “more dense than Highways, less dense than Forests” % c orrectly o rdered p airs Classifier Ranker ranking func0ons Not dense: Dense: Outdoor scenes 80% 89% Whereas conven0onal differ from classifier Celebrity faces 67% 82% Binary descrip-on: “not dense” outputs? b m Classifier instead of ranker (SRA) 80 Less chubby than Less smiling than Less VisFHead than ? ? ? 100 80 20 0 1 20 PubFig 40 20 SRA 0 Proposed 0 1 2 3 4 5 # unseen categories binary ~ rela0ve supervision Amt. of labeled data to learn a0ributes PubFig 60 60 40 20 0 DAP 1 2 5 15 # labeled pairs 40 20 SRA 0 Proposed 1 2 5 15 # labeled pairs << baseline supervision can give unique ordering on all classes Amount of descrip-on OSR PubFig 60 40 20 40 20 DAP SRA Proposed 0 0 6 5 4 3 2 1 11109 8 7 6 5 4 3 2 1 # att to describe unseen # att to describe unseen Rela0ve descrip0ons more natural than insidecity; less natural than highway; more open than not natural, not open, street; less open than coast; more perspec0ve than highway; less perspec0ve perspec0ve than insidecity natural, open, more natural than tallbuilding; less natural than mountain; more open perspec0ve than mountain; less perspec0ve than opencountry; more White than AlexRodriguez; more Smiling than JaredLeto; less White, not Smiling, Smiling than ZacEfron; more VisibleForehead than JaredLeto; less VisibleForehead VisibleForehead than MileyCyrus more White than AlexRodriguez; less White than MileyCyrus; less Smiling White, not Smiling, than HughLaurie; more VisibleForehead than ZacEfron; less not VisibleForehead VisibleForehead than MileyCyrus not Young, more Young than CliveOwen; less Young than ScarleMJohansson; more BushyEyebrows, BushyEyebrows than ZacEfron; less BushyEyebrows than AlexRodriguez; RoundFace more RoundFace than CliveOwen; less RoundFace than ZacEfron DAP 0 1 2 3 4 5 # unseen categories 60 OSR 2 3 # top choices not natural, not open, more natural than tallbuilding; less natural than forest; more open than perspec0ve tallbuilding; less open than coast; more perspec0ve than tallbuilding; OSR 40 60 40 PubFig classical recogni0on problem 60 Example descrip-ons Image Binary descrip0ons Binary Relative 60 0 Human subject experiment: Which image is? More chubby than More smiling than More VisFHead than Number of unseen categories OSR Accuracy Rela-ve a0ributes space Relative Accuracy {( For each aMribute am , Supervision is Om : � Tes0ng: Categorize image into N (=S+U) categories OSR natural open perspective large-objects diagonal-plane close-depth PubFig Masculine-looking White Young Smiling Chubby Visible-forehead Bushy-eyebrows Narrow-eyes Pointy-nose Big-lips Round-face Binary TI S HC OMF 00001 11 1 00011 11 0 11110 00 0 11100 00 0 11110 00 0 11110 00 1 ACHJ MS V Z 11110 01 1 01111 11 1 00001 10 1 11101 10 1 10000 00 0 11101 11 0 01010 00 0 01100 01 1 00100 00 1 10001 10 0 10001 10 0 Accuracy 2. Learning Rela-ve A0ributes � An aMribute is more discrimina0ve when used rela0vely OSR PubFig Quality of descrip-on 60 60 40 20 Accuracy ? Novel zero-‐shot learning from aMribute comparisons Precise automa0cally generated textual Not Smiling descrip0ons of images ∼ Training: Images from S seen and descrip0ons of U unseen categories Unseen categories Enables new applica-ons Smiling … Smiling: Accuracy ? � Smiling Natural Learn a ranking func0on for each Not Natural aMribute Young: Public Figure Face (PubFig): 800 images, 8 categories: Alex Rodriguez (A), Clive Owen (C), Hugh Laurie (H), Jared Leto (J), Miley Cyrus (M), ScarleM Johansson (S), Viggo Mortensen (V) and Zac Efron (Z), gist and color features Baselines: Direct AMribute Predic0on (DAP) [Lampert et al. 2009] (binary) � c ĉ(x) = argmax P (am |x) Accuracy Categorical (binary) aMributes are restric0ve and can be unnatural Richer communica0on between humans and machines Describe images or categories rela0vely e.g. “dogs are furrier than giraffes”, “find less congested downtown Chicago ” scene than Learnt rela-ve a0ributes Outdoor Scene Recogni-on (OSR): 2688 images, 8 categories: coast (C), forest (F), highway (H), inside-‐city (I), mountain (M), open-‐country (O), street (S) and tall-‐building (T), gist features; 8. Zero-‐shot Learning Results Accuracy Proposed idea: Rela-ve A0ributes 6. Datasets Accuracy Mo-va-on: 4. Rela-ve Zero-‐shot Learning % correct image in top choices 1. Main Idea 40 20 DAP SRA Proposed 0 0 1 2 3 1 2 3 Looseness of constraints Looseness of constraints Rela0ve aMributes jointly carve out space for unseen category
Similar documents
BGP in 2013
rou0ng and refrain from announcing more specifics into the rou0ng table? • Are we gewng be[er or worse at aggrega0on in rou0ng? • ...
More informationMoYve, Gesture and the Analysis of Performance Gestures
• Determine how the sounded music is made to cohere; focus not the intent of the performer. • Trap of tradi0onal analyses: assembling...
More information