Grauman - Frontiers in Computer Vision

Transcription

Grauman - Frontiers in Computer Vision
Rela-ve A0ributes Devi Parikh (TTIC) and Kristen Grauman (UT Aus0n) Learn a scoring func0on rm (xi ) =
∀(i, j) ∈ Om :
T
w m xi
>
T
w m xi
), }, S :{{
m
∼
that best sa0sfies constraints: ∀(i, j) ∈ Sm :
T
w m xj
...
T
w m xi
=
T
wm xj
}}
...
,
Young: S �C �H M �Z Smiling: M �Z  Need not use all aMributes  Need not relate to all S S H M C Z Youth Infer image category using max-­‐likelihood 5. Describing Images Rela-vely Learnt rela-ve a0ributes Density: �
m
T≺I∼S≺H≺C∼O∼M∼F
T∼F≺I∼S≺M≺H∼C∼O
O≺C≺M∼F≺H≺I≺S≺T
F≺O∼M≺I∼S≺H∼C≺T
F≺O∼M≺C≺I∼S≺H≺T
C≺M≺O≺T∼I∼S∼H∼F
S≺M≺Z≺V≺J≺A≺H≺C
A≺C≺H≺Z≺J≺S≺M≺V
V≺H≺C≺J≺A≺S≺Z≺M
J≺V≺H≺A∼C≺S∼Z≺M
V≺J≺H≺C≺Z≺M≺S≺A
J≺Z≺M≺S≺A∼C∼H∼V
M≺S≺Z≺V≺H≺A≺C≺J
M≺J≺S≺A≺H≺C≺V≺Z
A≺C≺J∼M∼V≺S≺Z≺H
H≺J≺V≺Z≺C≺M≺A≺S
H≺V≺J≺C≺Z≺A≺S≺M
7. Image Descrip-on Results … Auto -­‐ generate textual descrip-on of: �
�
Max-­‐margin learning to rank formula-on �
�
�
�
1
�
�
Rela-ve a
0ributes s
pace T
2
2
2
�
�
�
�
1/8 d
ataset min
||w
||
+
C
ξ
+
γ
1
Adapted o
bjec0ve m 2
ij
ij
T 2
2
2
2
min
||w
||
+
C
ξ
+
γ
�
m
2
ij
ij
��
�
�
from [
Joachims, 2
002] 2
Density 2
2
T
T
C
ξij +
γij
s.t
w
(x
−
x
)
≥
1
−
ξ
,
∀(i,
j)
∈
O
;
|w
(x
−
x
)|
≤
γ
,
∀(i,
j)
∈
S
;
i
j
ij
m
i
j
ij
m
m
m
T
T
s.t wm (xi − xj ) ≥ 1 − ξij , ∀(i, j) ∈ Om ; |wm (xi − xj )| ≤ γij , ∀(i, j) ∈ Sm ;C C H H H C F H H M F F I F ξ
≥
0;
γ
≥
0
ij
ij
T
≥ 1 − ξij , ∀(i, j) ∈ Om
ξij; |w
≥m
0;(x
γiji −
≥ x0j )| ≤ γij , ∀(i, j) ∈ Sm ;
Rela-ve descrip-on: 0
3. Ranking Func-on vs. Binary Classifier Score “more dense than , less dense than ” w
w
How do learned “more dense than Highways, less dense than Forests” % c
orrectly o
rdered p
airs Classifier Ranker ranking func0ons Not dense: Dense: Outdoor scenes 80% 89% Whereas conven0onal differ from classifier Celebrity faces 67% 82% Binary descrip-on: “not dense” outputs? b
m
  Classifier instead of ranker (SRA) 80
Less chubby than Less smiling than Less VisFHead than ? ? ? 100
80
20
0
1
20
PubFig 40
20
SRA
0
Proposed
0 1 2 3 4 5
# unseen categories
binary ~ rela0ve supervision Amt. of labeled data to learn a0ributes PubFig
60
60
40
20
0
DAP
1
2
5 15
# labeled pairs
40
20
SRA
0
Proposed
1
2
5 15
# labeled pairs
<< baseline supervision can give unique ordering on all classes Amount of descrip-on OSR
PubFig
60
40
20
40
20
DAP
SRA
Proposed
0
0
6 5 4 3 2 1
11109 8 7 6 5 4 3 2 1
# att to describe unseen
# att to describe unseen
Rela0ve descrip0ons more natural than insidecity; less natural than highway; more open than not natural, not open, street; less open than coast; more perspec0ve than highway; less perspec0ve perspec0ve than insidecity natural, open, more natural than tallbuilding; less natural than mountain; more open perspec0ve than mountain; less perspec0ve than opencountry; more White than AlexRodriguez; more Smiling than JaredLeto; less White, not Smiling, Smiling than ZacEfron; more VisibleForehead than JaredLeto; less VisibleForehead VisibleForehead than MileyCyrus more White than AlexRodriguez; less White than MileyCyrus; less Smiling White, not Smiling, than HughLaurie; more VisibleForehead than ZacEfron; less not VisibleForehead VisibleForehead than MileyCyrus not Young, more Young than CliveOwen; less Young than ScarleMJohansson; more BushyEyebrows, BushyEyebrows than ZacEfron; less BushyEyebrows than AlexRodriguez; RoundFace more RoundFace than CliveOwen; less RoundFace than ZacEfron DAP
0 1 2 3 4 5
# unseen categories
60
OSR
2
3
# top choices
not natural, not open, more natural than tallbuilding; less natural than forest; more open than perspec0ve tallbuilding; less open than coast; more perspec0ve than tallbuilding; OSR 40
60
40
PubFig
classical recogni0on problem 60
Example descrip-ons Image Binary descrip0ons Binary
Relative
60
0
Human subject experiment: Which image is? More chubby than More smiling than More VisFHead than Number of unseen categories OSR
Accuracy
Rela-ve a0ributes space Relative
Accuracy
{(
For each aMribute am , Supervision is Om :
�
 Tes0ng: Categorize image into N (=S+U) categories OSR
natural
open
perspective
large-objects
diagonal-plane
close-depth
PubFig
Masculine-looking
White
Young
Smiling
Chubby
Visible-forehead
Bushy-eyebrows
Narrow-eyes
Pointy-nose
Big-lips
Round-face
Binary
TI S HC OMF
00001 11 1
00011 11 0
11110 00 0
11100 00 0
11110 00 0
11110 00 1
ACHJ MS V Z
11110 01 1
01111 11 1
00001 10 1
11101 10 1
10000 00 0
11101 11 0
01010 00 0
01100 01 1
00100 00 1
10001 10 0
10001 10 0
Accuracy
2. Learning Rela-ve A0ributes �
An aMribute is more discrimina0ve when used rela0vely OSR
PubFig
Quality of descrip-on 60
60
40
20
Accuracy
?  Novel zero-­‐shot learning from aMribute comparisons  Precise automa0cally generated textual Not Smiling descrip0ons of images ∼
 Training: Images from S seen and descrip0ons of U unseen categories Unseen categories Enables new applica-ons Smiling … Smiling: Accuracy
? �
Smiling Natural  Learn a ranking func0on for each Not Natural aMribute Young:   Public Figure Face (PubFig): 800 images, 8 categories: Alex Rodriguez (A), Clive Owen (C), Hugh Laurie (H), Jared Leto (J), Miley Cyrus (M), ScarleM Johansson (S), Viggo Mortensen (V) and Zac Efron (Z), gist and color features Baselines:   Direct AMribute Predic0on (DAP) [Lampert et al. 2009] (binary) �
c
ĉ(x) = argmax
P (am |x)
Accuracy
Categorical (binary) aMributes are restric0ve and can be unnatural  Richer communica0on between humans and machines  Describe images or categories rela0vely e.g. “dogs are furrier than giraffes”, “find less congested downtown Chicago ” scene than Learnt rela-ve a0ributes   Outdoor Scene Recogni-on (OSR): 2688 images, 8 categories: coast (C), forest (F), highway (H), inside-­‐city (I), mountain (M), open-­‐country (O), street (S) and tall-­‐building (T), gist features; 8. Zero-­‐shot Learning Results Accuracy
Proposed idea: Rela-ve A0ributes 6. Datasets Accuracy
Mo-va-on: 4. Rela-ve Zero-­‐shot Learning % correct image in top choices
1. Main Idea 40
20
DAP
SRA
Proposed
0
0
1
2
3
1
2
3
Looseness of constraints
Looseness of constraints
Rela0ve aMributes jointly carve out space for unseen category 

Similar documents

BGP in 2013

BGP in 2013 rou0ng  and  refrain  from  announcing  more   specifics  into  the  rou0ng  table?   •  Are  we  gewng  be[er  or  worse  at  aggrega0on  in   rou0ng?   • ...

More information

MoYve, Gesture and the Analysis of Performance Gestures

MoYve, Gesture and the Analysis of Performance Gestures •  Determine  how  the  sounded  music  is  made  to   cohere;  focus  not  the  intent  of  the  performer.   •  Trap  of  tradi0onal  analyses:  assembling...

More information