Photo search by face positions and facial attributes on

Transcription

Photo search by face positions and facial attributes on
Photo Search by Face Positions and Facial Attributes
on Touch Devices
Yu-Heng Lei, Yan-Ying Chen, Lime Iida, Bor-Chun Chen, Hsiao-Hang Su, Winston H. Hsu
National Taiwan University, Taipei, Taiwan
{ryanlei, yanying}@cmlab.csie.ntu.edu.tw,
{limeiida, siriushpa}@gmail.com, [email protected], [email protected]
ABSTRACT
Query Canvas
With the explosive growth of camera devices, people can
freely take photos to capture moments of life, especially the
ones accompanied with friends and family. Therefore, a better solution to organize the increasing number of personal
or group photos is highly required. In this paper, we propose a novel way to search for face images according facial
attributes and face similarity of the target persons. To better match the face layout in mind, our system allows the
user to graphically specify the face positions and sizes on a
query “canvas,” where each attribute or identity is defined
as an “icon” for easier representation. Moreover, we provide
aesthetics filtering to enhance visual experience by removing
candidates of poor photographic qualities. The scenario has
been realized on a touch device with an intuitive user interface. With the proposed block-based indexing approach,
we can achieve near real-time retrieval (0.1 second on average) in a large-scale dataset (more than 200k faces in Flickr
images).
2
Top Ranked Results
3
4
5
(a)
(b)
(c)
(d)
(e)
Figure 1: Example queries and top 5 retrieval results
from our image search system. (a) specifies two arbitrary faces with the larger one on the left and the
smaller one on the right. (b) further constrains that
the left face has attributes “female” and “youth” and
the right face has attribute “kid.” (c) specifies two
faces of “male” and “African” on the left and right,
in addition to an arbitrary face on the center. (d)
specifies a particular face in the database at the desired position and in the desired size. (e) specifies
the previous database face on the left, and a face of
“female” and “youth” on the right.
Categories and Subject Descriptors
H.3.1 [Content Analysis and Indexing]: Indexing methods; H.5.2 [User Interfaces]: Input devices and strategies
General Terms
Algorithms, Design, Experimentation, Performance
Keywords
phenomenon becomes more obvious in consumer photos because most of them contain family members or close friends
that users care about and usually keep in mind. Users may
forget where or when they took the photos but they would
not forget their friends and family. Therefore, they can make
use of facial attributes and face identities to effectively formulate their search intentions. Furthermore, reviewing the
retrieved images probably recalls more scenes in users’ memory. For example, Alice seems standing next to me and an
African kid sitting in the middle. The imagination of the
photo in mind can be organized intuitively by graphically
arranging people on a query “canvas” and refined by designating more face attributes and identities (Fig. 1). Although consumer photos naturally lack of annotations, automatic facial attribute and identity recognition techniques
would make the scenario more economical and scalable.
Recently, some efforts attempt to capture users’ intention by allowing them to visually describe the image content and layout on a query canvas. [1] revisits the problem
Face attributes, Face retrieval, Touch-based user interface,
Block-based indexing
1.
1
INTRODUCTION
When browsing photos, what makes that image memorable? MIT Media Relations [6] pointed out that images
with people in them are the most memorable, followed by
images of human-scale space and close-ups of objects. The
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
MM’11, November 28–December 1, 2011, Scottsdale, Arizona, USA.
Copyright 2011 ACM 978-1-4503-0616-4/11/11 ...$10.00.
651
of sketch-based image search for scene photos. However, the
gap between the user’s mind and their specified query can
still be large even in such a system. For instance, users with
poor drawing skills may have a hard time describing their
intention accurately. In addition, many object details are
naturally difficult to sketch, such as the age of a face. To
deal with this sketching difficulty, [7] allows the user to formulate a 2-D “semantic map” by placing text boxes of various search concepts at desired positions and in desired sizes.
However, it does not address the problem for discovering
specific persons in the intended scene and is therefore inapplicable to managing consumer photos. Meanwhile, typing
text is not a simple operation on touch panels even though
both sketch-based and concept-based queries target at better user experience.
Recent trends reveal that the popularity of touch devices
brings new chances and challenges to image organization.
In this paper, we propose a novel system for searching consumer photos by exploiting computer vision technologies in
estimating facial attributes and face similarity. Rather than
laboriously sketching detailed appearances [1] or typing text
[7], our work allows users to formulate a query canvas by
placing “icons” of desired facial attributes (Fig. 1 (b)(c)),
a specific face instance (Fig. 1 (d)), or a specific face instance with a wildcard face (Fig. 1 (e)) at desired positions
and in desired sizes. Moreover, we provide aesthetics filtering to retain images of better composition, colorfulness, and
contrast, thus enhancing visual experience and saving time
for reviewing poor photographic (usually unintended) image
candidates. The scenario has been realized on a touch device
with an intuitive user interface. To tackle the computation
overheads for searching face positions in large-scale datasets,
we propose a block-based indexing approach enabling rapid
on-line retrieval response (0.1 second on average). The real
system currently processes more than 200k faces in Flickr
images and the number is scalable to larger photo collections.
2.
Aesthetic
Filtering
Return
Relevance
Ranking
Query
Online
Server
User
Pre-load
Offline
Aesthetics
Assessment
Image
Database
Face
Detection
Similarity
Estimation
Block-based
Indexing
Attribute
Detection
Figure 2: Framework of the proposed system. Photos are analyzed by facial attribute detection, face
similarity estimation, aesthetics assessment, and
block-based indexing in the off-line process.
for face positions by the proposed block-based indexing approach (Sec. 6.3).
3.
DETECTING FACE ATTRIBUTES
Facial attributes possess rich information about people
and have been shown promising for seeking specific persons
in face retrieval and surveillance systems. In this work,
we utilize eight people attributes including two of gender
(female, male), three of age (kid, youth, elder) and three
of race (Caucasian, Asian, African) to categorize faces in
large-scale photos. We will extend to more facial attributes
in the future. In the training phase, all the attributes are
learned through a combination of Support Vector Machines
(SVM) and Adaboost similar to [4]. Firstly, we crawl usercontributed photos from Flickr and extract facial regions by
a face detector. Further, the face images are annotated manually and decomposed into different face components (e.g.,
whole face, eyes, nose, mouth), from each of which various
low-level features (e.g., Gabor filter, HOG, Grid color moments, and local binary patterns) are extracted. A mid-level
feature learned is an SVM classifier with a specific low-level
feature extracted from a specific face component. Finally,
the optimal mid-level features for the designated facial attribute are selected and weighted through Adaboost. The
combined strong classifier represents the most important
parts of that attribute, for example, <Gabor, whole face> is
most effective for female attribute while <color, whole face>
is most effective for African attribute.
Experimenting in the benchmark data [4], the approach
can effectively detect facial attributes and achieve more than
80% accuracy on the average. Meanwhile, the framework is
generic for various facial attributes thus providing better
scalability to precisely profile users for more attributes. The
real-valued attribute scores are normalized to the interval
(0,1) by a sigmoid function before they are used.
OBSERVATIONS AND SYSTEM
OVERVIEW
When looking for a photo in mind, it is difficult and inefficient for users to indicate the exact file location in the
storage even though they are well categorized by time or
geo-locations. Some prevailing photo sharing websites employ crowd-sourcing to obtain free tags semantically associated to images, but the mechanism can not be duplicated to
personal photo organizer because users are not expected to
actively annotate their photos. Although current commercial photo management software begin to exploit technologies in face recognition and face clustering, such solutions
still lack the capability of searching for scenes with faces deployed in a specific layout. In light of this observation, our
proposed system (Fig. 2) attempts to make consumer photo
management faster and easier. As the contributions of this
paper, we are (1) to analyze “wild photos” with no tag information at all by automatic facial attribute detection (Sec.
3) and face similarity estimation (Sec. 4), (2) to enhance
visual experience by aesthetic filtering which removes image
candidates of poor photographic qualities (Sec. 5), (3) to
advance search pattern from query by single face instance
to query by multiple attributed faces allocated on a canvas
(Sec. 6.1, 6.2), and (4) to support rapid and accurate search
4.
ESTIMATING FACE SIMILARITIES
To enable search through face appearance, we adapt the
face retrieval framework [2]. The advantage of this framework includes: (1) efficiency, which is achieved by using
sparse representation of face image with inverted indexing,
652
6.
6.1
PHOTO SEARCH ON TOUCH DEVICES
User Interface
Since our system is extremely suitable for a touch-based
interface, we have implemented the user interface on a tablet
device, as shown in Figure 3. The user can drag faces from
top-right onto the canvas, and the result will be displayed
in the result panel in real time. Holding a face icon invokes
an popup attribute selector. We have designed a total of 48
face icons (3 x 4 x 4) to describe the attribute combinations.
To search by similarity, the user can hold a face in the result
panel, and use the new icon on the top-right to find similar
faces in other photos. There is also a simple aesthetic filter
to help find photos with better looks.
6.2
First note that in our system, coordinates are always represented as a fraction of the image width or height. This
allows the computation to be adapted to the various aspect ratios in the query canvas or the database images. For
a (query image, target image) pair, denoted as (q, t), the
ranking problem is casted as a greedy version of maximum
bipartite matching, with the two sets Q and T being the
query image and the target image, respectively. Also note
that bipartite matching ensures each face is matched at most
once. By greedy, we mean the first query face is matched
first by choosing the best matching face available. This significantly reduces the computational cost and coincides with
the idea that the first face coming to the user’s mind is the
most important. The matching score between a face in the
query q and a face in the target t is proposed as a linear
combination of face similarity, face attributes, face center
position, and face size:
Figure 3: The touch-based interface of our system.
Users can formulate a query by adding face icons
from the top-right control panel and dragging the
icons into canvas at desired positions. When clicking
icons, a pop-up window will show up for attribute
selection. Users can browse the query results on
bottom and hold any faces back to query canvas to
find out more images with the similar faces.
and (2) leveraging identity information, which is done by
incorporating the identity information into the optimization
process for codebook construction. Both of the above two
points are suitable for our system. In details, detected faces
are first aligned into canonical position, and then componentbased local binary patterns are extracted from the images
to form feature vectors. Sparse representations are further
computed from these feature vectors based on a learned dictionary combined with extra identity information. By incorporating such framework into our system, the user can
not only specify positions and attributes of the face but also
use a face image itself with position as the query. The realvalued similarity scores are normalized to the interval (0,1)
before they are used.
5.
Ranking Function
match(q, t) = wsim (Sim(q, t)) +
1/|α|

|α|
Y
Attr(q, t)
+
wattr 
α=1
(1)
dw + dh
dc
+ wsize 1 −
wpos 1 − √
2
2
The first term is the similarity score between q and t. The
second term is the geometric mean of all the three attribute
scores (|α| = 3). If an attribute is not specified, it is counted
as 1. Notice in the UI, face similarity and face attributes
are not specified at the same time. The third and the fourth
terms are to normalize the errors in position and size to the
interval (0,1), where dc is the L2 distance between the face
centers, and dw and dh are the L1 distances between the
face widths and heights.
The overall matching score is then proposed as the arithmetic mean of the individual matching scores:
X
1
score(Q, T ) =
match(q, t)
(2)
max(|Q|, |T |) q,t∈M
ASSESSING PHOTO AESTHETICS
The function of filtering based on photo aesthetics is also
integrated in the proposed system. According to the work
[5], the bag-of-aesthetics-preserving features are extracted
to model the photo aesthetics at the global scope in our
paper. These features have the following advantages: 1)
photos can be modeled in multiple resolutions by its decomposition method; 2) photos can be described from different aesthetic aspects, including color, texture, saliency, and
edge, by applying patchwise operations proposed in [5]; 3)
contrast information, which humans are more sensitive to, is
taken into consideration. Therefore, based on these features,
the aesthetic properties of photo composition, colorfulness,
contrast, etc., can be modeled jointly.
M denotes the set of matched faces. |Q| and |T| in the
denominator are the number of faces in the query image
and the target image, so if the numbers are different, there
is a huge penalty in the overall score.
6.3
Block-based Indexing
We apply a block-based method to spatially index all the
database faces. Since the face center coordinates, width and
653
height, denoted as x, y, w, and h, are fractions, the infinitely
many numbers in the interval (0,1) make indexing computationally infeasible and quantization too sensitive. Therefore,
we first quantize each of the four variables into L levels, and
pre-define overlapping blocks of various valid (x,y,w,h) combinations and use them throughout the system. Notice that
not all the L4 combinations are valid blocks. The mapping between an (x,y,w,h) pair and a block id can be easily
achieved by representing a block id as an L-nary number of
4 digits. This mapping is both unique and storage-free. We
can then build an inverted index to record, for each block
id, the image id’s that contain a face in this block and all
of their attribute scores. Of course, examining only faces
in the block of the query is still too sensitive. So in on-line
search, the system runs a small “sliding window” to compute
the scores of faces in neighboring blocks. The range of the
sliding window indicates the level of tolerance.
For multiple-face queries, each face is processed separately.
It is important that we enforce constraints that each target
face can be matched at most once, and that each query face
matches at most one face in the same target image.
7.
Table 1: Precision@10 of 8 selected queries.
#
Query Intention
P@10
1 Single face (top-left)
0.99
2 Single face (profile canvas)
0.93
3 Single face (close-up)
0.99
4 Two faces (left and right)
0.97
5 Single face (male, youth)
0.75
6 Male (left) and female (right)
0.53
7 Female kid (left) and male kid (right)
0.29
8 Three faces on top and two below
0.68
trieved by our system, and ask them whether the retrieved
result is relevant to the query. Table 1 shows the average
results from twenty users. Query tasks 1 through 4 all have
precision higher than 90%. The precision decreases as the
query becomes more complicated. This is because the attribute detector itself has errors. When there are many attributes in a query, it becomes harder to find images with all
correctly detected attributes, For instance, if the attribute
detector has 80% detection rate, when there are three attributes specified in a query, only 51% of the images are
correctly detected by the attribute detector. Query task 7
has the lowest P@10. This is probably because it is naturally hard for the attribute detector to find whether a kid is
a male or female.
EXPERIMENTS
In this section, we describe the dataset and implementation details, and evaluate the performance on the joint ranking of face attributes, face position, and face size. For the
performance on face similarity and photo aesthetics, please
refer to the corresponding references [2] and [5]. For a video
demonstration of the system, please visit our project page:
http://www.csie.ntu.edu.tw/~winston/projects/face
7.1
8.
Dataset and Implementation details
The dataset is composed of two portions. As mentioned
in section 3, we crawl a large number of user-contributed
photos from Flickr as the main portion. For similar face
retrieval, 732 daily photos containing 1,248 faces are added
to the dataset as the second portion. After face detection
by a commercial but free API [3], together there are 115,487
images containing 244,491 faces in the dataset (2.117 faces
per image).
Similar face retrieval is intended only for the second portion, so only the pairwise similarity scores in the second portion are estimated, and faces in the first portion have zero
similarity scores. For the weights in the ranking function, we
empirically choose wsim = 2.00, wattr = 0.70, wpos = 0.20,
and wsize = 0.10. This reflects the user’s intention that face
similarity and attributes are much more important when
they are specified. For block-based indexing, we choose the
number of quantization levels as L = 20, and the sliding
window for tolerance is 5 levels in position and 4 levels in
size. Aesthetic filtering is optionally applied to the initial
result, which keeps images of top 50% aesthetic ranks in the
final result.
With the index and metadata preloaded, the proposed
system reports a typical running time of 0.10 second on a
16-core, 2.40GHz Intel Xeon server with 48GB of RAM. The
storage cost is 112MB.
7.2
CONCLUSIONS
Our work proposes a novel way for effectively organizing
and searching consumer photos by positioning attributed
faces at desired positions and in desired sizes on a query
canvas. Meanwhile, we can automatically detect facial attributes and measure face similarity in the off-line process
to provide rapid on-line photo search. Integrated with aesthetics assessment, we can further save time for browsing
photos with poor quality. The scenario has been realized on
a touch device with an easy-to-use interface and has achieved
fast retrieval response by the proposed block-based indexing
approach.
9.
REFERENCES
[1] Y. Cao et al. Edgel index for large-scale sketch-based
image search. CVPR, 2011.
[2] B.-C. Chen, Y.-H. Kuo, Y.-Y. Chen, K.-Y. Chu, and
W. Hsu. Semi-supervised face image retrieval using
sparse coding with identity constraint. ACM
Multimedia, 2011.
[3] face.com API. http://developers.face.com.
[4] N. Kumar et al. Facetracer: A search engine for large
collections of images with faces. ECCV, 2008.
[5] H.-H. Su, T.-W. Chen, C.-C. Kao, S.-Y. Chien, and
W. Hsu. Scenic photo quality assessment with bag of
aesthetics-preserving features. ACM Multimedia, 2011.
[6] A. Trafton. What makes an image memorable? MIT
Media Relations, 2011.
[7] H. Xu et al. Image search by concept map. SIGIR,
2010.
Performance Evaluation
In order to evaluate our system, we manually create eight
different queries containing different search intentions listed
in the table 1. We then ask twenty people to do the evaluation by showing them the queries with top 10 results re-
654