Towards a Video Annotation System using Face Recognition

Transcription

Towards a Video Annotation System using Face Recognition
Towards a Video Annotation
System using Face Recognition
Lucas Lindström
December 27, 2013
Master’s Thesis in Computing Science, 30 credits
Supervisor at CS-UmU: Petter Ericson
Examiner: Fredrik Georgsson
Umeå University
Department of Computing Science
SE-901 87 UMEÅ
SWEDEN
Abstract
A face recognition software framework was developed to lay the foundation for a future
video annotation system. The framework provides a unified and extensible interface to multiple existing implementations of face detection and recognition algorithms from OpenCV
and Wawo SDK. The framework supports face detection with cascade classification using
Haar-like features, and face recognition with Eigenfaces, Fisherfaces, local binary pattern
histograms, the Wawo algorithm and an ensemble method combining the output of the
four algorithms. An extension to the cascade face detector was developed that covers yaw
rotations. CAMSHIFT object tracking was combined with an arbitrary face recognition
algorithm to enhance face recognition in video. The algorithms in the framework and the
extensions were evaluated on several different test databases with different properties in
terms of illumination, pose, obstacles, background clutter and imaging conditions. The results of the evaluation show that the algorithmic extensions provide improved performance
over the basic algorithms under certain conditions.
ii
Contents
1 Introduction
1.1 Report layout . . .
1.2 Problem statement
1.3 Goals . . . . . . .
1.4 Methods . . . . . .
1.5 Related work . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2 Introduction to face recognition and object tracking
2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . .
2.2 Face detection . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Categories of techniques . . . . . . . . . . . . .
2.2.2 Cascade classification with Haar-like features .
2.3 Face identification . . . . . . . . . . . . . . . . . . . .
2.3.1 Difficulties . . . . . . . . . . . . . . . . . . . . .
2.3.2 Categories of approaches . . . . . . . . . . . . .
2.3.3 Studied techniques . . . . . . . . . . . . . . . .
2.3.4 Other techniques . . . . . . . . . . . . . . . . .
2.4 Face recognition in video . . . . . . . . . . . . . . . . .
2.4.1 Multiple observations . . . . . . . . . . . . . .
2.4.2 Temporal continuity/Dynamics . . . . . . . . .
2.4.3 3D model . . . . . . . . . . . . . . . . . . . . .
2.5 Object tracking . . . . . . . . . . . . . . . . . . . . . .
2.5.1 Object representation . . . . . . . . . . . . . .
2.5.2 Image features . . . . . . . . . . . . . . . . . .
2.5.3 Object detection . . . . . . . . . . . . . . . . .
2.5.4 Trackers . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
2
3
3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
5
6
6
7
10
10
10
11
16
18
18
20
21
22
22
24
24
25
3 Face recognition systems and libraries
29
3.1 OpenCV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.1 Installation and usage . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Wawo SDK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
iii
iv
CONTENTS
3.3
3.2.1 Installation and usage . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
OpenBR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.1 Installation and usage . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4 System description of standalone framework
4.1 Detectors . . . . . . . . . . . . . . . . . . . .
4.1.1 CascadeDetector . . . . . . . . . . . .
4.1.2 RotatingCascadeDetector . . . . . . .
4.2 Recognizers . . . . . . . . . . . . . . . . . . .
4.2.1 EigenFaceRecognizer . . . . . . . . . .
4.2.2 FisherFaceRecognizer . . . . . . . . .
4.2.3 LBPHRecognizer . . . . . . . . . . . .
4.2.4 WawoRecognizer . . . . . . . . . . . .
4.2.5 EnsembleRecognizer . . . . . . . . . .
4.3 Normalizers . . . . . . . . . . . . . . . . . . .
4.4 Techniques . . . . . . . . . . . . . . . . . . .
4.4.1 SimpleTechnique . . . . . . . . . . . .
4.4.2 TrackingTechnique . . . . . . . . . . .
4.5 Other modules . . . . . . . . . . . . . . . . .
4.5.1 Annotation . . . . . . . . . . . . . . .
4.5.2 Gallery . . . . . . . . . . . . . . . . .
4.5.3 Renderer . . . . . . . . . . . . . . . .
4.6 Command-line interface . . . . . . . . . . . .
4.6.1 Options . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
31
32
32
32
33
33
33
34
34
34
34
35
35
35
36
36
37
37
38
38
5 Algorithm extensions
39
5.1 Face recognition/object tracking integration . . . . . . . . . . . . . . . . . . . 39
5.1.1 Backwards tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Rotating cascade detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6 Performance evaluation
6.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Testing datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.1 NRC-IIT . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.2 News . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.3 NR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3.1 Regular versus tracking recognizers . . . . . . . . . . . . . . .
6.3.2 Regular detector versus rotating detector . . . . . . . . . . .
6.3.3 Algorithm accuracy in cases of multiple variable conditions .
6.4 Evaluation results . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.1 Comparison of algorithm accuracy and speed over gallery size
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
43
43
45
45
45
45
46
46
46
46
47
47
CONTENTS
6.4.2
6.4.3
v
Regular detector versus rotating detector . . . . . . . . . . . . . . . . 48
Evaluation of algorithm accuracy in cases of multiple variable conditions 50
7 Conclusion
53
7.1 Limitations of the evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
8 Acknowledgements
55
References
57
vi
CONTENTS
List of Figures
2.1
2.2
2.3
2.4
2.5
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
exceeds that of the
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. 8
. 12
. 14
. 15
2.6
2.7
Example features relative to the detection window. . . .
Eigenfaces, i. e., visualizations of single eigenvectors. . .
The first four Fisherfaces from a set of 100 classes. . . .
Binary label sampling points at three different radiuses.
A given sampling point is labeled 1 if its intensity value
central pixel. . . . . . . . . . . . . . . . . . . . . . . . .
A number of object shape representations. . . . . . . . .
CAMSHIFT in action. . . . . . . . . . . . . . . . . . . .
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
Conceptual view of a typical application. .
IDetector interface UML diagram. . . . .
IRecognizer interface UML diagram. . . .
INormalizer interface UML diagram. . . .
ITechnique interface UML diagram. . . . .
SimpleTechnique class UML diagram. . .
TrackingTechnique class UML diagram. .
Gallery class UML diagram. . . . . . . . .
Renderer class UML diagram. . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.1
5.2
Example illustrating the face recognition/tracking integration. . . .
Illustrated example of the rotating cascade detector in action. . . .
6.1
6.2
6.3
6.4
The
The
The
The
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
31
32
33
34
35
36
36
37
37
. . . . . . 40
. . . . . . 42
performance of each algorithm as measured by subset accuracy, . . .
real time factor of each algorithm as the gallery size increases. . . . .
performance, as measured by subset accuracy, . . . . . . . . . . . . .
real time factor of each algorithm as the gallery size increases. . . . .
vii
.
.
.
.
.
.
.
.
.
. 15
. 23
. 27
.
.
.
.
.
.
.
.
.
.
.
.
47
48
49
49
viii
LIST OF FIGURES
List of Tables
2.1
Combining rules for assembly-type techniques.
6.1
6.2
6.3
NRC-IIT test results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
News test results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
NR test results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
ix
. . . . . . . . . . . . . . . . . 19
x
LIST OF TABLES
Chapter 1
Introduction
Face recognition is finding more and more applications in modern society as time and technology progresses. Traditionally, the main application area has been biometrics for security
and law enforcement purposes, similar to fingerprints. Lately, it has also been used for crime
prevention by identifying suspects in live video feeds[4][41]. With the rise of the world wide
web and the Web 2.0 in particular, an application area that is more relevant to the general
public has emerged; namely, the automatic annotation of metadata for images and video.
By automatically analyzing and attaching metadata to image and video files, end users are
given the power to search and sort among them more intelligently and efficiently.
Codemill AB is an Umeå-based software consultant, employing around 25 employees
with an annual turnover of 12.9 million SEK as of 2011. They developed a face recognition
plugin for the media asset management platform of a client, Vidispine AB, which they
retained ownership of. Now they want to extract the face recognition functionality into a
separate product for sophisticated, automated annotation and searching of video content.
The end goal would be to create a product that combines face and voice recognition for
the identification of individuals present in a video clip, speech recognition for automatic
subtitling and object recognition for the detection and identification of significant signs,
e. g. letters or company logos. A product like this could have broad application areas,
with everything from automatically annotating recordings of internal company meetings for
easy cataloguing to annotating videos uploaded to the web to increase the power of search
engines. A first step towards that goal is the extraction of the existing face recognition
functionality of the Vidispine platform into a standalone application that would serve as the
basis for the continued development of the future product.
The focus of this thesis lies foremost in the extraction of the Vidispine face recognition
module into a standalone software package. A secondary goal was to attempt to improve the
accuracy and/or performance of the existing face recognition system using existing software
libraries. In particular, possibilities for utilizing object tracking and profile face recognition
were to be explored.
1.1
Report layout
Chapter one gives an introduction to the background of the project, the purpose and the
goals. The specific problem that is being addressed is described and an overview of the
methods employed is given. A short summary of related work that has been investigated
over the course of the project is also presented.
1
2
Chapter 1. Introduction
Chapter two provides a quick introduction to the theory behind face recognition systems.
The general problems of face detection and face recognition are described, as well as brief
descriptions of the most common approaches to solving them. This chapter also includes a
brief introduction to object tracking.
Chapter three lists and describes the most common existing face recognition libraries and
systems. Special emphasis is given to OpenCV and Wawo, which are the libraries evaluated
in this report.
Chapter four gives a detailed system description of the face recognition system developed
in the course of this project. In particular, the modular nature of the system is described,
as well as how it can be extended with additional algorithms and techniques in the future.
Chapter five describes an original integration of face recognition algorithms and object
tracking, and discusses its merits and flaws. This chapter also describes an extension to
basic face detection techniques by rotating the input images prior to detection.
Chapter six describes the methods, metrics and test data used in the evaluation of the
different algorithms in the system implementation. The results are presented and discussed.
Chapter seven summarizes the conclusions drawn from the results of the evaluation,
discusses problems encountered over the course of the project and suggestions for future
work.
1.2
Problem statement
The primary task of this project was to extract the Vidispine face recognition plugin module
into a standalone application. Possibilities for improving the accuracy and performance of
the system was to be investigated and different options systematically evaluated. In practice,
this would mainly consist of finding existing face recognition libraries and evaluating their
relative accuracy and performance.
The research questions intended to be addressed in this report are:
1. Using currently available face recognition libraries, using standard test databases and
original test databases suitable for the intended application, what is the optimal tradeoff between accuracy and performance of the task of face recognition?
2. Can frontal face recognition and profile face recognition be combined to improve the
total accuracy, and at what performance cost?
3. Can face detection and recognition be combined with object tracking forwards and
backwards in time to improve accuracy, and at what performance cost?
1.3
Goals
The first goal of this project is to extract the Vidispine face detection and recognition
plugin into a standalone application. The system design of the application should be highly
modular, to allow for low-cost replacement of the underlying libraries. The application
should accept a gallery of face images or videos for a set of subjects, as well as a probe
video. The output will be an annotation of the probe video, describing at different points
in time which subjects are present.
The second goal is to conduct an evaluation of the tradeoff between performance and
accuracy of a number of common libraries and algorithms, for different parameter configurations and under different scene and imaging conditions. The third goal is to investigate
1.4. Methods
3
the possibility of combining frontal face recognition with profile recognition to improve the
total recognition accuracy and what the relative performance of such a method would be.
The final goal would be to try to combine face detection and recognition with object
tracking forwards and backwards in time to improve accuracy and to possibly cover parts
of the video during which the face is partially or completely occluded.
1.4
Methods
A study of the literature on face detection, recognition and tracking is performed to gain
understanding of the inner workings of the libraries, what the challenges to successful face
detection and recognition are, the significance of the parameters for different algorithms
and how they can be used to improve the accuracy and performance of the system. The
standalone application is written in C++, for multiple different reasons. To start with,
the original Vidispine plugin was written in C++, and using the same language makes it
possible to reuse some code. In addition, C++ is widely considered to be a good choice
for performance-intensive applications while still giving the programmer the tools to create
scalable, high-level designs. Finally, C++ being a massively popular programming language,
the likelihood of finding compatible face detection, face recognition, object tracking and
image processing libraries is high.
Existing test databases and protocols are investigated in order to produce results that
can be compared with existing literature. To the extent that it is possible, the evaluation is
performed with standard methods, but when necessary, original datasets that resemble the
intended use cases are created and used. The optimal configuration of libraries, algorithms
and parameters are implemented as the default of the resulting system, for presentation and
live usage purposes.
1.5
Related work
In his master’s thesis, Cahit Gürel presented a face recognition system including subsystems
for image normalization, face detection and face recognition using a feed-forward artificial
neural network[27]. Similarly to the present work, Gürel aimed at creating a complete
integrated software system for face identification instead of simply presenting a single algorithm. Unlike the present work, however, Gürel’s system does not support different choices
of method for each step in the face identification pipeline. Hung-Son Le, in his Ph. D. thesis,
presented a scalable, distributed face database and identification system[35]. The system
provides the entire face identification pipeline and spreads the various phases, such as storage, user interaction, detection and recognition over different processes, allowing different
physical servers to handle different tasks. This system only implements Le’s original algorithms, while the present work interfaces with different underlying libraries implementing a
variety of existing algorithms. These can easily be combined in a multitude of configurations
according to the requirements of the intended application. Acosta et al[15] presented an integrated face detection and recognition system customized for video indexing applications.
The face detection is performed by an original algorithm based on segmenting the input
image into color regions and using a number of constraints such as shape, size, overlappings,
texture and landmark features to distinguish face from non-face. The recognition stage
consists of a modified Eigenfaces approach based on storing multiple views of each gallery
subject. Acosta’s design and choice of algorithms are tuned to the task of face recognition
in video, but again provides only a single alternative.
4
Chapter 1. Introduction
Chapter 2
Introduction to face recognition
and object tracking
Face recognition is a field that deals with the problem of identifying or verifying the identity
of one or more persons in either a static image or a sequence of video frames by making a
comparison with a database of facial images. Research has progressed to the point where
various real-world applications have been developed and are in active use in different settings. The typical use case has traditionally been biometrics in security systems, similar to
fingerprint or iris analysis, but the technology has also been deployed for crime prevention
measures with limited success[4][41]. Recently, face recognition has also been used for web
searching in different contexts[15][36].
The complexity of the problem varies greatly depending on the conditions imposed by
the intended application. In the case of identity verification, the user can be assumed to be
cooperative and makes an identity claim. Thus, the incoming probe image only needs to be
compared to a small subset of the database, as opposed to the case of recognition, where
the probe will be compared to a potentially very large database. On the other hand, an
authentication system will need to operate in near real-time to be acceptable to users while
some recognition systems could operate during much longer time frames.
In general, face recognition can be divided into three main steps, although depending on
the application, not all steps may be required:
1. Detection, the process of detecting a face in a potentially cluttered image.
2. Normalization, which involves transforming, filtering and converting the probe image into whatever format the face database is stored in.
3. Identification, the final step where the normalized probe is compared to the face
database.
2.1
Preliminaries
The following general notation is used in this thesis: x and y represent image coordinates, I
an intensity image of dimensions r × c and I(x, y) is the intensity at position (x, y). ~Γ is the
rc-dimensional vector acquired by concatenating the rows of an image. i,j and k represent
generic sequence indices and l,m,n sequence bounds. ind{P } is the indicator function, which
equals 1 if proposition P is true, and otherwise equals 0.
5
6
Chapter 2. Introduction to face recognition and object tracking
2.2
Face detection
In order to perform face recognition, the face must first be located in the probe image.
The field of face detection deals with this problem. The main task of face detection can
be defined as follows: given an input image, determine if a face is present and if so, its
location and boundaries. Many of the factors that complicate this problem are the same as
for recognition:
– Pose: The orientation of the face relative to the camera may vary.
– Structural components: Hairstyle, facial hair, glasses or other accessories can vary
greatly between individuals.
– Facial expression: A person can wear a multitude of facial expressions like smiling,
frowning, screaming, etc.
– Occlusion: Other objects, including other faces, can partially occlude the face.
– Imaging conditions: Illumination and camera characteristics can vary between images.
The following sections describe the different categories of face detection techniques and
the technique primarily used in this project, cascade classification with Haar-like features.
2.2.1
Categories of techniques
Techniques that deal with detecting faces in single intensity or color images can be roughly
classified into the following four categories[40]:
– Knowledge-based methods: These methods utilize human knowledge of what constitutes a face. Formal rules are defined based on human intuitions of facial properties
which are used to differentiate regions that contain faces from those that do not.
– Feature invariant approaches: Approaches of this type attempt to first extract facial
features that are invariant under differing conditions from an image and then infer the
presence of a face based on those.
– Template matching methods: Standard face pattern templates are manually constructed and stored, either of the entire face or separate facial features. Correlations
between input images and the stored patterns are computed which detection is based
on.
– Appearance-based methods: These methods differ from template matching methods
in that instead of manually constructing templates, they are learned from a set of
training images in order to capture facial variability.
It should be noted that not all techniques fall neatly into a single category, but rather,
some clearly overlap two or more categories. However, these categories still provide a useful
conceptual structure for thinking about face detection methods.
2.2. Face detection
2.2.2
7
Cascade classification with Haar-like features
A very popular method of face detection, and object detection in general, is the cascade
classifier with Haar-like features introduced by Viola and Jones in 2001[61]. The concept is to
characterize a subwindow in an image with a sequence of simple classifiers, each consisting
of one or more features, described below. Each level in the cascade is constructed by
selecting the most distinguishing features out of all possible features using the AdaBoost
algorithm. Each individual classifier in the cascade performs relatively poorly, but in concert
the cascade achieves very good detection rates. Numerous features of this algorithm make
it very efficient, such as immediately discarding subwindows that are rejected by a classifier
early in the sequence, as well as computing the value of a simple classifier on a specialized
image representation in constant time.
Haar-like features
The features used by the method are illustrated in figure 2.1. They are called ”Haar-like”
because they are reminiscent of Haar basis functions which have been used previously[14].
The value of each feature is the sum of the pixel intensities in the dark rectangles subtracted
from the sum of the intensities in the white rectangles. The features can be of any size within
a detection window of a fixed dimension, but the original paper used 24x24 pixels. In this
case, the total number of features are approximately 180,000. A classifier based on a single
feature is defined as
hj (W ) = ind{pj fj (W ) < pj θj }
where fj is the feature, W is the detection window, θj the threshold and pj the parity
indicating the direction of the inequality sign. The false negative and false positive rate
of the classifier can be modulated by varying the threshold, which will become important
later.
Integral image
The features described above can be computed in constant time using a specialized image
representation called an integral image. The value of the integral image at location x, y is
simply the sum of the rectangle above and to the left of the location or
II(x, y) =
X
I(x0 , y 0 )
x0 ≤x,y 0 ≤y
Using this image representation, the sum of pixels in an arbitrary rectangle can be
computed in only four array references. Due to the fact that the rectangles in the features
are adjacent, a feature with two rectangles can be computed in six array references, a feature
with three rectangles can be computed in eight references and a feature with four rectangles
can be computed in only nine references. The integral image can be computed in a single
pass using the recurrence relations
s(x, y) = s(x, y − 1) + I(x, y)
II(x, y) = II(x − 1, y) + s(x, y)
where s(x, y) is the cumulative column sum, s(x, −1) = 0 and II(−1, y) = 0.
8
Chapter 2. Introduction to face recognition and object tracking
Figure 2.1: Example features relative to the detection window.
AdaBoost learning
Given a set of positive and negative training samples of the same size as the detection
window, in this case 24x24 pixels, we want to select a subset of the total 180,000 features
that best distinguishes between them. We do this using the generic machine learning metaalgorithm, AdaBoost[19], which is also used in conjunction with many other algorithms to
improve their performance. The general idea of the algorithm is to build classifiers that are
tweaked in favor of samples misclassified by previous classifiers. This is done by assigning a
weight to each sample, which is initially equal for all samples, and for each round select the
feature that is able to minimize the sum weighted error of the predictions. The weights are
then adjusted so that the samples that were misclassified by the selected classifier receive a
greater weight, and in subsequent rounds classifiers that are able to correctly classify those
samples become more likely to be selected. The resulting set of features are then integrated
into a composite classifier:
1. Given a set of sample images (I1 , b1 ), . . . , (In , bn ) where bi = 0, 1 for negative and
positive samples respectively.
1
1
2. Initialize weights w1,i = 2m
, 2l
for bi = 0, 1 respectively, where m and l are the number
of negative and positive samples respectively.
3. For rounds t = 1, . . . , T :
(a) Normalize the weights
2.2. Face detection
9
wt,i
wt,i = Pn
j=1 wt,j
so that wt is a probability distribution.
(b) For each feature, j, train a classifier hj . The error is evaluated with respect to
wt ,
j =
X
wt,i |hj (Ii ) − bi |
i
(c) Choose the classifier ht with the lowest error t .
(d) Update the weights:
wt+1,i = wt,i βt1−ei
where ei = 0 if sample Ii is classified correctly, ei = 1 otherwise, and βt =
t
1−t .
4. The final composite classifier is:
h(I) = ind{
T
X
t=1
T
αt ht (I) ≥
1X
αt }
2 t=1
where αt = log β1t .
Training the cascade
As was previously mentioned, the cascade consists of a sequence of classifiers. Each classifier
is applied in turn to the detection window, and if any one rejects it, the detection window
is immediately discarded. This is desirable because the large majority of detection windows
will not contain a face and a large amount of computation time can be saved by discarding
true negatives early. For this reason, it is important for each individual stage to have a
very low false negative rate, as this rate will be compounded as the window is passed down
the cascade. For example, in a 32-stage cascade, each stage will need a detection rate of
99,7% to achieve a total detection rate of 90%. However, the reverse applies to the false
positive rate, which means that each stage can have a fairly high false positive rate and
still achieve a low compounded rate. As previously stated, these rates can be modulated
by modifying the threshold parameter, and improved by adding additional features (i.e.
running more AdaBoost rounds). However, the total performance of the cascade classifier
is highly dependent on the number of features, so in order to maintain efficiency we would
like to keep this number low.
Thus, we select a desired final false positive rate and a required false positive rate γ per
stage and run the AdaBoost method the required number of rounds to achieve close to 0%
false negative rate and γ false positive rate when the θ has been modulated. The rates are
determined by testing the classifier on a validation set. For the first stage, the entire sample
set is used and a very low number of features is likely to be needed. The samples used for
the next stage are those that the first stage classifier misclassified, which are likely to be
”harder” and thus require more features to achieve the desired rates. This is acceptable
because the large majority of detection windows will be discarded by the earliest stages,
10
Chapter 2. Introduction to face recognition and object tracking
which are also the fastest. We keep adding stages until the final desired detection/false
positive rate has been achieved.
Since computing the value of a feature can be done in constant time regardless of the size,
the resulting classifier has the interesting property of being scalable to any size. When we
apply the detector in practice, we can scan the input image by placing the detection window
at different locations and scale it to different sizes. Thus, we can easily trade performance
for accuracy by doing a more or less coarse scan.
2.3
Face identification
In this section, the process of determining the identity of a detected face in an image, or
verifying an identity claim, is introduced. First, the main obstacles to successful identification are discussed and the various categories of approaches are described. After that, a
detailed technical description of the techniques used in this project is given. Finally, brief
descriptions of other techniques are listed.
2.3.1
Difficulties
There is a variety of factors that can make the problem of facial recognition or verification
more difficult. The illumination of the probe image commonly varies greatly and this can
cripple the performance of certain techniques. For some use cases the user can be assumed
to look directly at the camera but in many others the view angle could be different, and
also vary. The performance of some techniques are dependent on the pose of the face being
in a certain angle, and are more or less sensitive to deviations from the preferred angle.
For some scenarios the subject cannot be relied on to have a neutral facial expression, and
some techniques are very sensitive to this complication. It might also be of interest to allow
for variation in the style of the face, such as facial hair, hairstyle, sunglasses or articles
of clothing. Any combination of these factors might potentially need to be dealt with as
well. Many solutions to these issues have been proposed and some techniques are markedly
better at dealing with some types of variation. In general, it seems that the performance of
face recognition systems decreases significantly whenever multiple sources of variation are
combined in a single probe. When conditions are ideal, however, current techniques work
very well.
2.3.2
Categories of approaches
Techniques for face recognition can be classified in a multitude of ways. Some of the most
common categorizations are briefly described below[6].
Fully versus partially automatic
A system that performs all three steps listed earlier, detection, normalization and identification, is referred to as fully automatic. It is given only a facial image and performs the
recognition process unaided. A system that assumes the detection and normalization steps
have already been performed is referred to as partially automatic. Commonly, it is given a
facial image and the coordinates of the center of the eyes.
2.3. Face identification
11
Static versus video versus 3D
Methods can be subdivided by the type of input data they utilize. The most basic form
of recognition is performed on a single static image. It is the most widespread approach
both in literature and in real-world applications. Recognition can also be applied to video
sequences, which give the extra advantage of multiple perspectives and possibly imaging
conditions, as well as temporal continuity. Some scanners, such as infrared cameras, can
even provide 3D geometric data which some techniques make use of.
Frontal versus profile versus view-tolerant
Some techniques are designed to handle only frontal images. This is the classical approach
and the alternatives are more recent developments. View-tolerant techniques allow for a
variety of poses and are often more sophisticated, taking the underlying geometry, physics
and statistics into consideration. Techniques that handle profile images are rarely used for
stand-alone applications, but can be useful for coarse pre-searches to reduce the computational load of a more sophisticated technique, or in combination with another technique to
improve recognition precision.
Global versus component-based approach
A global approach is one in which a single feature vector is computed based on the entire face
and fed as input to a classifier. These tend to be very good at classifying frontal images.
However, they are not robust to pose changes since global features tend to be sensitive
to translation and rotation of the face. This weakness can be addressed by aligning the
face prior to classification. The alternative to the global approach is to classify local facial
components independently of each other, and thus allowing a flexible geometrical relation
between them. This makes component-based techniques naturally more robust to pose
changes.
Invariant features versus canonical forms versus variation-modeling
As has been previously stated, variation in appearance depending on illumination, pose,
facial hair, etc., is the central issue to performing face recognition. Approaches to dealing
with it can be divided into three main categories. The first focuses on utilizing features that
are invariant to the changes being studied. The second seeks to either normalize away the
variation using clever image processing or to synthesize a canonical or prototypical version
of the probe image and performing classification on that. The third attempts to create a
parameterized model of the variation and estimating the parameters for a given probe.
2.3.3
Studied techniques
This section gives an overview of the major face recognition techniques that have been
evaluated in this report, and describes the advantages and disadvantages of each.
Eigenfaces
Eigenfaces are one of the earliest successful and most thoroughly investigated approaches to
face recognition[59]. Also known as Karhunen-Loève expansion or eigenpictures, it makes
use of principal component analysis (PCA) to efficiently represent pictures of faces. A set
of eigenfaces are generated by performing PCA on a large set of images representing human
12
Chapter 2. Introduction to face recognition and object tracking
Figure 2.2: Eigenfaces, i. e., visualizations of single eigenvectors.
faces. Informally, the eigenfaces can be considered a set of ”standardized face ingredients”
derived by statistical analysis of a set of real faces. For example, a real face could be
represented by the average face plus 7% of eigenface 1, 53% of eigenface 2 and -3% of
eigenface 3. Interestingly, only a few eigenfaces combined are required to arrive at a fair
approximation of a real human face. Since an individual face is represented only by a vector
of weights, one for each eigenface, this representation is highly space-efficient. Empirical
results show that eigenfaces are robust to variations in illumination, less so to variations in
orientation and even less to variations in size[33], but despite this illumination normalization
is usually required in practice[6].
Mathematically, we wish to find the principal components of the distribution of faces,
represented by the covariance matrix of the face images. These eigenvectors can be thought
of as the primary distinguishing features of the image. Each pixel element contributes to a
lesser or greater extent to each eigenvector, and this allows us to visualize each eigenvector
as a ghostly image, which we call eigenfaces (see figure 2.2). Each image in the gallery can
be represented exactly in terms of a linear combination of all eigenfaces, but can also be
approximated by combining only a subset of the eigenvectors. The ”best” approximation is
achieved by using the eigenvectors with the largest eigenvalues, as they account for most of
the variance in the gallery set. This feature can be used to improve computational efficiency
without necessarily losing much precision. The best M 0 eigenvectors span an M 0 -dimensional
subspace of all possible images, a ”face space”[59].
Algorithm
The algorithm can be summarized as follows:
1. Acquire the gallery set and compute its eigenfaces, which define the face space.
2.3. Face identification
13
2. When given a probe image, project it onto each of the eigenfaces in order to compute
a set of weights to represent it in terms of those eigenfaces.
3. Determine if the image contains a known face by checking if it is sufficiently close to
some gallery face class, or unknown if the distance exceeds some threshold.
Let the galleryP
set of face images be Γ~1 , Γ~2 , Γ~3 , . . . , Γ~n . The average face of the set is
n
1
~i = Γ~i − Ψ.
~
~
defined by Ψ = n i=1 Γ~i . Each face differs from the average by the vector Φ
This set of vectors is then subjected to principal component analysis, which seeks a set
of M orthonormal vectors u~j and their associated eigenvalues λj . The vectors u~j and
scalars λj are the eigenvectors and eigenvalues, respectively, of the covariance matrix C =
Pn ~ ~ T
1
= AAT where A = [Φ~1 Φ~2 . . . Φ~n ].The matrix C is rc × rc and computing
k=1 Φi Φi
n
the eigenvectors and eigenvalues is intractable for typical images. However, this can be
worked around by solving a smaller n × n matrix problem and taking linear combinations
of the resulting vectors (see [58] for details). An arbitrary number of eigenvectors M 0 with
the largest associated eigenvalues are selected. The probe image Γ is transformed into its
~ for k = 1, . . . , M 0 . The
eigenface components by the simple operation ωk = u~k T (~Γ − Ψ)
T
~
weights form a vector Ω = [ω1 ω2 . . . ωM 0 ] that describes the contribution of each eigenvector
in representing the input image. This vector is then used to determine which face class best
describes the probe. The simplest
method is simply to select the class k that minimizes the
~
~
Euclidean distance l = Ω − Ωl where Ωl is a vector describing the lth face class, provided
it falls below some threshold θ . Otherwise the face is classified as ”unknown”.
Fisherfaces
The Eigenfaces method projects face images to a low-dimensional subspace with axes that
capture the greatest variance of the input data. This is desirable, but not necessarily
optimal for classification purposes. For example, the difference in facial features between
two individuals is a type of variance that one would like to capture, but the difference in
illumination between two images of the same individual is not. A different but related
approach is to project the input image to a subspace which minimizes within-class variation
but maximizes inter-class variation. This can be achieved by applying linear discriminant
analysis (LDA), a technique that traces back to work by R. A. Fisher[17]. The resulting
method is thus called Fisherfaces[44].
Given C classes, assume that the data in each class are of homoscedastic normal distributions (i.e., each class is normally distributed with equal covariance matrices to each
~ for a sample of class i. We want to find a subspace of
other). We denote this Γ~i ∼ N (µ~i , Σ)
the face space which minimizes the within-class variation and maximizes the between-class
variation. Within-class differences can be estimated by the within-class scatter matrix which
is given by
Sw =
nj
C X
X
(Γ~ij − µ~j )(Γ~ij − µ~j )T
j=1 i=1
where Γ~ij is the ith sample of class j, µ~j is the mean of class j, and nj is the number of
samples in class j. Likewise, the between-class differences are computed using the betweenclass scatter matrix,
14
Chapter 2. Introduction to face recognition and object tracking
Figure 2.3: The first four Fisherfaces from a set of 100 classes.
Sb =
C
X
(µ~j − µ
~ )(µ~j − µ
~ )T
j=1
T
|V Sb V |
where µ
~ is the mean of all classes. We now want to find the matrix V for which |V
TS V|
w
is maximized. The columns v~i of V corresponds to the basis vectors of the desired subspace.
This can be done by the generalized eigenvalue decomposition Sb V = Sw V Λ where Λ is the
diagonal matrix of the corresponding eigenvalues of V . The eigenvectors of V associated
with non-zero eigenvalues are the Fisherfaces.[16][44]
Local binary pattern histograms
The LBP histograms approach builds on the idea that a face can be viewed as a composition
of local subpatterns that are invariant to monotonic grayscale transformations[62]. By
identifying and combining these patterns, a description of the face image which includes
both texture and shape information is obtained. The LBP operator labels each pixel in an
image with a binary string of length P by selecting P sampling points evenly distributed
around the pixel at a specific radius r. If the sampling point exceeds the intensity of the
central pixel, the corresponding bit in the binary string is 1, and otherwise 0. If the sampling
point is not in the center of a pixel, bilinear interpolation is used to acquire the intensity
value of the sampling point.
Let fl be the labeled image. We can define the histogram for the labeled image as
X
Hi =
ind{fl (x, y) = i}, i = 0, 1, . . . , n − 1
x,y
2.3. Face identification
15
where n is the number of different labels produced by the LBP operator. This histogram
captures the texture information of the subpatterns of the image. We can also capture
spatial information as well by subdividing the image into regions R0 , R1 , . . . , Rm−1 . The
spatially enhanced histogram becomes
Hi,j =
X
ind{fl (x, y) = i}ind{(x, y) ∈ Rj }, i = 0, 1, . . . , n − 1, j = 0, 1, . . . , m − 1.
x,y
We can classify a probe image by comparing the corresponding histograms of the probe
and the gallery set using some dissimilarity measure. Several options exist, including
– Histogram intersection: D(S, M ) =
P
i
min(S, M ).
– Log-likelihood statistic: L(S, M ) =
P
i
Si log(Mi ).
– Chi-square statistic: χ2 (S, M ) =
i
P
(Si −Mi )2
Si +Mi
Each of these can be extended to the spatially enhanced histogram by simply summing
over both i and j.[8]
Figure 2.4: Binary label sampling points at three different radiuses.
Figure 2.5: A given sampling point is labeled 1 if its intensity value exceeds that of the
central pixel.
Hidden Markov models
Hidden Markov models (HMMs) can be applied to the task of face recognition by treating
different regions of the human face (eyes, nose, mouth, etc) as hidden states. HMMs require
one-dimensional observation sequences, and thus the two-dimensional facial images need to
be converted into either 1D temporal or spatial sequences. This way, an HMM is created
for each subject in the database, the probe image is fed as an observation sequence to each
and the match with the highest likelihood is considered best.
16
Chapter 2. Introduction to face recognition and object tracking
Wawo The core face recognition algorithm of the Wawo system is based on an extended
HMM scheme called Joint Multiple Hidden Markov Models (JM-HMM)[35]. The primary
objective of the algorithm is capturing the 2D nature of face images while only requiring a
single gallery image per subject to achieve good performance. The input image is treated
as a set of horizontal and vertical strips. Each strip consists of small rectangular blocks of
pixels and each strip is managed by an individual HMM. When an HMM subsystem of a
probe is to be compared to the corresponding one in a gallery image, the block strips of each
image are first matched according to some similarity measure, and the observation sequence
is formed by the indices of the best-matching blocks.
2.3.4
Other techniques
These are approaches from the literature that have not been evaluated in this report.
Neural networks
A variety of techniques based on artificial neural networks have been developed. The reason
for the popularity of artificial neural networks may be due to the non-linearity allowing
for more effective feature extraction than eigenface-based methods. The structure of the
network is essential to the success of the system, and which is suitable is dependent on
the application. For example, multilayer perceptrons and convolutional neural networks
have been applied to face detection, and a multi-resolution pyramid structure[52][49][32]
to face verification. Some techniques combine multiple structures to increase precision and
counteract certain types of variation[49]. A probabilistic decision-based neural network
(PDBNN) has been shown to function effectively as a face detector, eye localizer and face
recognizer[48]. In general, neural network approaches suffer from computational complexity
issues as the number of individuals increases. They are also unsuitable for single model
image cases due to the fact that they tend to require multiple model images to train to
optimal parameter settings.
Dynamic link architecture
Dynamic link architectures are an extension of traditional artificial neural networks[6]. Memorized objects are represented by sparse graphs, whose vertices are labeled with a multiresolution description in terms of a local power spectrum, and whose edges are geometrical
distance vectors. Distortion invariant object recognition can be achieved by employing elastic graph matching to find the closest stored graph. The method tends to be superior
to other methods in terms of coping with rotation variation, but the matching process is
comparatively expensive.
Geometrical feature matching
This technique is based on the computation of a set of geometrical features from the picture
of a face. The overall configuration can be represented by a vector containing the position
and size of a set of main facial features, such as eyes, eyebrows, mouth, face outline, etc.
It has been shown to be successful for large face databases such as mug shot albums[30].
However, it is dependent on the accuracy of automated feature location algorithms, which
generally do not achieve a high degree of accuracy and require considerable computational
time.
2.3. Face identification
17
3D model
The 3D face model is based on a vector representation of a face that is constructed such
that any convex combination of shape and texture vectors represents describes a realistic
human face. Fitting the 3D model to, or extracting it from, images can be used in two ways
for recognition across different viewing conditions:
– After fitting the model, the comparison can be based on model coefficients that
represent intrinsic features of shape and texture that are independent of imaging
conditions[25].
– 3D face reconstruction can be employed to generate synthetic views from different angles. The views are then transferred to a second view-dependent recognition
system[63].
3D morphable models have been combined with computer graphics simulations of illumination and projection[10]. Among other things, this approach allows for modeling more
sophisticated lighting conditions such as specular lighting and cast shadows (most techniques only consider Lambertian illumination). Scene parameters in probe images, such
as head position and orientation, camera focal length and illumination direction, can be
automatically estimated.
Line edge map
Edge information is useful for face recognition because it is partially insensitive to illumination variation. It has been argued that face recognition in the human brain might
make extensive use of early-stage edge detection without involving higher-level cognitive
functions[53]. The Line Edge Map (LEM) approach extracts lines from a face edge map
as features. This gives it the robustness to illumination variation that is characteristic of
feature-based approaches while simultaneously retaining low memory requirements and high
recognition performance. In addition, LEM is highly robust to size variation. It has been
shown to be less sensitive to pose changes than the eigenface method, but more sensitive to
changes in facial expression[24].
Support vector machines
Support vector machines (SVM) is considered to be an effective method for general purpose
pattern recognition due to its high generalization performance without the need to add
other knowledge[60]. Intuitively, given a set of points belonging to two classes, an SVM
finds the hyperplane that separates the largest possible set of points of the same class on
the same side while maximizing the distance from either class to the hyperplane. A large
variety of SVM-based approaches have been developed with regards to a number of different
application areas[38][22][45][9][34][42]. The main features of SVM-based approaches is that
they are able to extract relevant discriminatory information automatically, and are robust to
illumination changes. However, they can become overtrained on data sanitized by feature
extraction and/or normalization and they involve a large number of parameters so the
optimization space can become difficult to explore completely.
Multiple classifier systems
Traditionally, the approach used in the design of pattern recognition systems has been to
experimentally compare the performance of several classifiers in order to select the best one.
18
Chapter 2. Introduction to face recognition and object tracking
Recently, the alternative approach of combining the output of several classifiers has emerged,
under various names such as multiple classifier systems (MCSs), committee or ensemble
classifiers, with the purpose of improved accuracy. A limited number of approaches of this
kind have been developed with good results for established face databases[54][55][28][47].
2.4
Face recognition in video
Since a video clip consists of a sequence of frame images, face recognition algorithms that
apply to single still images can be applied to video virtually unchanged. However, a video
sequence possesses a number of additional properties that can potentially be utilized to
design face recognition techniques with improved accuracy and/or performance over single
still image techniques. Three properties of major importance are:
– Multiple observations: A video sequence by its very nature will yield multiple observations of any probe or gallery. Additional observations means additional constraints
and potentially increased accuracy.
– Temporal continuity/Dynamics: Successive frames in a video sequence are continuous in the temporal dimension. Geometric continuity related to changes in facial
expression or head/camera movement, or photometric continuity related to changes
in illumination provide additional constraints. Furthermore, changes in head movement or facial expression obey certain dynamics that can be modeled for additional
constraints.
– 3D model: We can attempt to reconstruct a 3D model of the face using a video
sequence. This can be achieved both by treating the video as a set of multiple observations or by making use of temporal continuity and dynamics. Recognition can
the be based on the 3D model, which, as previously described, has the potential to be
invariant to pose and illumination.
Below, these properties, and how they can be exploited to design better face recognition
techniques, will be discussed in detail. We will also study some existing techniques that
make use of these properties.
2.4.1
Multiple observations
This is the most commonly used feature of video sequences. Techniques exploiting this
property treat the video sequence as a set of related still images but ignores the temporal
dimension. The discussion below assumes that images are normalized before being subjected
to further analysis.
Assembly-type algorithms
A simple approach to dealing with multiple observations is to apply a single still image
technique to each individual frame of a video sequence and combining the results by some
rule. In many cases the combining rule is very simple, and some common examples are
given in table 2.1.
Let {Fi ; i = 1, 2, . . . , n} denote the sequence of probe video frames. Let {Ij ; j =
1, 2, . . . , m} denote the set of gallery images. Let d(Fi , Ij ) denote the distance function
between the ith frame of a video sequence and the jth gallery image of some single still
2.4. Face recognition in video
19
image technique. Let Ai (Fi ) denote the gallery image selected by the algorithm applied to
the ith frame of the probe video.
Table 2.1: Combining rules for assembly-type techniques.
Method
Minimum arithmetic mean
Minimum geometric mean
Minimum median
Minimum minimum
Majority voting
Rule
P
d(Fi , Ij )
ĵ = argminj=1,2,...,m n1p n
Qi=1
ĵ = argminj=1,2,...,m n n
i=1 d(Fi , Ij )
ĵ = argminj=1,2,...,m [medi=1,2,...,n d(Fi , Ij )]
ĵ = argminj=1,2,...,m [mini=1,2,...,n d(Fi , Ij )]
P
ĵ = argmaxj=1,2,...,m n
i=1 ind{Ai (Fi ) = j}
One image or several images
Multiple observations can be summarized into a smaller number of images. For example, one
could use the mean or median image of the probe sequence, or use clustering techniques to
produce multiple summary images. After that, single still image techniques or assembly-type
algorithms can be applied to the result.
Matrix
If each frame of the probe video is vectorized by some means, the video can be represented
as a matrix V = [F1 F2 . . . Fn ]. This representation can make use of the various methods of
matrix analysis. For example, matrix decompositions can be invoked to represent the data
more efficiently. Matrix similarity measures can be used for recognition[43].
Probability density function
Multiple observations {F1 , F2 , . . . , Fn } can be regarded as independent realizations drawn
from the same underlying probability distribution. PDF estimation techniques can be utilized to learn this distribution[23]. If both the probe and the gallery consists of video footage,
PDF distance measures can be used to perform recognition. If the probe consists of video
and the gallery of still images, recognition becomes a matter of determining which gallery
image is most likely to be generated from the probe distribution. In the reverse case, where
the gallery consists of video and the probe of a still image, recognition tests which gallery
distribution is most likely to generate the probe.
Manifold
Face appearances of multiple observations form a highly nonlinear manifold. If we can
characterize the manifold[18], recognition reduces to (i) comparing two manifolds if both
the probe and gallery is video, (ii) comparing the distance between a data point and various
manifolds if the probe is a still image and the gallery is video or (iii) comparing the distance
between various data points and a manifold if the probe is video and the gallery consists of
still images.
20
2.4.2
Chapter 2. Introduction to face recognition and object tracking
Temporal continuity/Dynamics
Successive frames in a video clip are continuous along the temporal axis. Temporal continuity provides an additional constraint for modeling face appearance. For example, smoothness
of face movement can be used in face tracking. It was previously stated that these techniques assume that the probe and gallery have been prenormalized, but it can be noted that
in the case of video, face tracking can be used instead of face detection for the purposes of
normalization due to the temporal continuity.
Simultaneous tracking and recognition
Zhou and Chellappa proposed[64] an approach that models tracking and recognition in a single probabilistic framework using time series analysis. A time series model is used, consisting
of the state vector (at , θt ), where at is the identity variable at time t and θt is the tracking
parameter, as well as the observation yt (the video frame), the state transition probability
p(at , θt |at−1 , θt−1 ) and the observational likelihood p(yt |θt , nt ). The task of recognition thus
becomes computing the posterior probability p(at |y0:t ) where y0:t = y0 , y1 , . . . , yt .
Probabilistic appearance manifolds
A probabilistic appearance manifold[31] models each individual in the gallery as a set of
linear subspaces, each modelling a particular pose variation, called pose manifolds. These
are generated by extracting samples from a training video which are divided into groups
through k-means clustering. Principal component analysis is performed on each group to
characterize that subspace. Temporal continuity is captured by computing the transition
probabilities between pose manifolds in the training video. Recognition is performed by
integrating the likelihood that an input frame is generated by a pose manifold and the
probability of transitioning to that pose manifold from the previous frame.
Adaptive hidden Markov model
Liu and Chen proposed[39] an HMM-based approach that captures temporal information
by using temporally indexed observation sequences. The approach makes use of principal
component analysis to reduce each gallery video to a sequence of low-dimensional feature
vectors. These are then used as observation sequences in the training of the HMM models. In
addition, the algorithm gradually adapts to probe videos by using unambiguously identified
probes to update the corresponding gallery model.
System identification
Aggarwal, Chowdury and Chellappa presented[21] a system identification approach to face
recognition in video. Each video sequence is represented by a first-order auto-regressive and
moving average (ARMA) model
θt+1 = Aθt + vt , It = Cθt + wt
where θ is a state vector characterizing the pose of the face, It the frame and v and
w independent identically distributed white noise factors drawn from N (0, Q) and N (0, R)
respectively. System identification is the process of estimating the model parameters A, C,
Q and R based on the observations I1 , I2 , . . . , In . Recognition is performed by selecting the
gallery model that is closest to the probe model by some distance function of the model
parameters.
2.4. Face recognition in video
2.4.3
21
3D model
We can attempt to reconstruct a 3D model of a face from a video sequence. One way to
do this is by utilizing light field rendering, which involves treating each observation as a
2D slice of a 4D function - the light field, which characterizes the flow of light through
unobstructed space. Another method is structure from motion (SfM), which attempts to
recover 3D structure from 2D images coupled with local motion signals. The 3D model will
possess two components: geometric and photometric. The geometric component describes
depth information of the face and the photometric component depicts the texture map.
Structure from motion is more focused on recovering the geometric component, and light
field rendering on recovering the photometric component.
Structure from motion
There is a large body of literature on SfM, but despite this current SfM algorithms cannot
reconstruct the 3D face model reliably. The difficulties are three-fold: (i) the ill-posed nature
of the perspective camera model that results in instability of SfM solutions, (ii) the fact that
the face is not a truly rigid object, especially when the face presents facial expressions and
other deformations and (iii) the input to the SfM algorithm. This is usually a sparse set
of feature points provided by a tracking algorithm with its own flaws. Interpolation from
a sparse to a dense set of feature points is very inaccurate. The first difficulty can be
addressed by using an ortographic or paraperspective model to approximate the perspective
camera model[56][46]. The second problem can often be resolved by imposing a subspace
constraint on the face model[13]. A dense face model can be used to overcome the sparseto-dense issue. However, the dense face model is generic and not appropriate for a specific
individual. Bundle adjustment has been used to adjust the generic model to accomodate
video observation[20].
Light field rendering
The SfM algorithm mainly recovers the geometric component of the face model, i. e., the
depth value of every pixel. Its photometric component is naively set to the appearence
in one reference video frame. An image-based rendering method recovers the photometric
component of the 3D model instead, and light field rendering bypasses even this stage by
extracting novel views directly[37].
22
Chapter 2. Introduction to face recognition and object tracking
2.5
Object tracking
The field of object tracking deals with the combined problems of locating objects in video
sequences, tracking their movement from frame to frame and analyzing object tracks to
recognize behavior. In its simplest form, object tracking can be defined as the problem of
consistently labeling a tracked object in each frame of a video. Depending on the tracker,
additional information about the object can also be detected, such as area, orientation,
shape, etc. Conditions that create difficulties in object tracking include:
– information loss due to projecting a 3D world onto a 2D image,
– noise and cluttered, dynamic backgrounds,
– complex rigid motion (drastic changes in velocity),
– nonrigid motion (deformation),
– occlusion,
– complex object shape,
– varying illumination.
Tracking can be simplified by imposing constraints on the conditions of the scene. For
example, most object tracking methods assume smooth object motion, i. e., no abrupt
changes in direction and velocity. Assuming constant illumination also increases the number
of potential approaches that can be used. The approaches to choose from mainly differ in
how they represent the objects to be tracked, which image features they use and how the
motion is modeled. Which approach performs best depends on the intended application.
Some trackers are even specifically tailored to the tracking of certain classes of objects, for
example humans.
2.5.1
Object representation
Objects can be represented both in terms of their shapes and their appearances. Some
approaches use only the shape of the object to represent them, but some also combine
shape with appearance. Shape and appearance representations are usually chosen to fit a
certain application domain[7]. Major categories of shape representations include:
– Points. The object is represented by one or more points. This is generally suitable for
objects that occupy a small image region.
– Geometric primitives. Objects are represented by geometric primitives, such as rectangles, circles or ellipses. This representation is particularly suitable for rigid objects
but can also be used to bound non-rigid ones.
– Silhouette and contour. The contour is the boundary of an object, and the area inside
it is called the silhouette. Using this representation is suitable for tracking complex
non-rigid objects.
– Articulated shape models. These models consist of body parts held together with
joints. The human body, for example, consists of a head, torso, upper and lower arms,
etc. The motion of the parts are constrained by kinematic models. The constituent
parts can be modeled by simple primitives such as ellipses or cylinders.
2.5. Object tracking
23
– Skeletal models. The skeleton of an object can be extracted using medial axis transformation. This model is commonly used as a shape representation in object recognition
and can be used to model both rigid and articulated objects.
Common appearance representations of objects are:
– Probability densities. Probability density appearance representations can be parametric, such as a Gaussian, or non-parametric, such as histograms. The probability
densities of object appearance can be computed from features (color, texture or more
complex features) of the image region specified by the shape representation, such as
the interior of a rectangle or a contour.
– Templates. Templates are formed from a composition of simple geometric objects.
The main advantage of templates is that they carry both spatial and appearance
information, but they tend to be sensitive to pose changes.
– Active appearance models. Active appearance models simultaneously model shape and
appearance by defining objects in terms of a set of landmarks. Landmarks are often
positioned on the object boundary but can also reside inside the object region. Each
landmark is associated with an appearance vector containing, for example, color and
texture information. The models need to be trained using a set of samples by some
technique, e.g. PCA.
Figure 2.6: A number of object shape representations. a) Single point. b) Multiple points.
c) Rectangle. d) Ellipse. e) Articulated shape. f) Skeletal model. g) Control points on
contour. h) Complete contour. i) Silhouette.
24
Chapter 2. Introduction to face recognition and object tracking
2.5.2
Image features
The image features to use are an integral part of any tracking algorithm. The most desirable property of a feature is how well it distinguishes between the object region and the
background[7]. Features are usually closely related to the object representation. For example, color is mostly used for histogram representations while edges are more commonly used
for contour-based representations. The most common features are:
– Color. The apparent color of an object is influenced both by the light source and the
reflective properties of the object surface. Different color spaces, such as RGB, HSV,
L*u*v or L*a*b, each with different properties, can be used, depending on application
area.
– Edges. Object boundaries generally create drastic changes in image intensity and edge
detectors identify these changes. These features are mostly used in trackers that use
contour-based object representations.
– Optical flow. Optical flow is a field of displacement vectors that describe the motion of
each pixel in an image region. It is computed by assuming that the same pixel retains
the same brightness between consecutive frames.
– Texture. Texture measures the variation of intensity across a surface, describing properties like smoothness and regularity.
Methods for automatic feature selection have also been developed. These can mostly
be categorized as either filter or wrapper methods[11]. Filter methods derive a set of features from a much larger set (such as pixels) based on some general criteria, such as noncorrelation, while wrapper methods select features based on their usefulness in a particular
problem domain.
2.5.3
Object detection
Every tracking method requires some form of detection mechanism. The most common
approach is to use information from a single initial frame, but some methods utilize temporal
information across multiple frames to reduce the number of false positives[7]. This is usually
in the form of frame differencing, which highlights regions that change between frames. In
this case, it is then the tracker’s task to establish correspondence between detected objects
across frames. Some common techniques include:
– Point detectors. Detectors used to find points of interest whose respective loci have
particular qualities[29]. Major advantages of point detectors are insensitivity to variation in illumination and viewpoint.
– Background subtraction. Approach based on the idea of building a model for the
background of the scene and detecting foreground objects based on deviations from
this model[51].
– Segmentation. Segmentation aims to detect objects by partitioning the image into
perceptually similar regions and characterizing them[50].
– Supervised learning. Based on learning a mapping between object features and object
class and then applying the trained model to different parts of an image. These
approaches include neural networks, adaptive boosting, decision trees and support
vector machines.
2.5. Object tracking
2.5.4
25
Trackers
The goal of an object tracker is to track the trajectory of an object over time by locating
it in a series of consecutive frames in a video. This can be done in two general ways[7].
Firstly, the object can be located in each frame individually using a detector, the tracker
being responsible for establishing a correspondence between the regions in separate frames.
Secondly, the tracker can be provided with an initial region located by the detector and then
iteratively update its location in each frame. The shape and appearance model limits the
types of transformations it can undergo between frames. The main categories of tracking
algorithms are:
– Point tracking. With objects represented by points, tracking algorithms use the state
of the object in the previous frame to associate it to the next. This state can include
the position and motion of the object. This requires an external mechanism to detect
the object in each frame beforehand.
– Kernel tracking. The kernel refers to a combination of shape and appearance model.
For example, a kernel can be the rectangular region of the object coupled with a color
histogram describing its appearance. Tracking is done by computing the motion of
the kernel across frames.
– Silhouette tracking. Tracking is performed by estimating the object region in each
frame. This is done by using information encoded in the object region from previous
frames. This information usually takes the form of appearance density, or shape
models such as edge maps.
CAMSHIFT
The Continuously Adaptive Mean Shift (CAMSHIFT) algorithm[12] is a color histogrambased object tracker based on a statistical method called mean shift. It was designed to be
used in perceptual user interfaces and minimizing computational costs was thus a primary
design criterion. In addition, it is relatively tolerant to noise, pose changes and occlusion,
and to some extent also illumination changes. It tracks object movement along four degrees
of freedom: x, y, z position as well as roll angle. x and y is given directly by the search
window, the z position can be derived by estimating the size of the object and relating it
to the current size of the tracking window. Roll can be derived from the second moments
of the probability distribution in the tracking window. It was initially developed to track
faces, but it can also be applied to other object classes.
Color probability distribution The first step of the CAMSHIFT algorithm is to create
a probability distribution image of each frame, based on an externally selected initial track
window which contains exactly the object to track. This is done by generating a color
histogram of the window and using it as a lookup table to convert an incoming frame into
a probability-of-object map. CAMSHIFT uses only the hue dimension of the HSV color
space, and ignores saturation and brightness, which gives it some robustness to illumination
changes. For the purposes of face tracking, it also minimizes the impact of differing skin
colors. Problems with this approach can occur if the brightness value is too extreme, or if
the saturation is too low, due to the nature of the HSV color space causing the hue value
to vary drastically. The solution is to simply ignore pixels to which these conditions apply.
This means that very dim scenes need to be preprocessed to increase the brightness prior
to tracking.
26
Chapter 2. Introduction to face recognition and object tracking
Mean shift The mean shift algorithm is a non-parametric statistical technique which
climbs the gradient of a probability distribution to find the local mode/peak. It involves
five steps:
1. Choose a search window size.
2. Choose the initial location of the search window.
3. Compute the location of the mode inside the search window. This is done as follows:
Let p(x, y) be the probability at position (x, y) in the image, and x and y range over
the search window.
(a) Find the zeroth moment
M00 =
XX
x
p(x, y)
y
(b) Find the first moment for x and y
M10 =
XX
x
xp(x, y); M01 =
y
XX
x
yp(x, y)
y
(c) Then the mode of the search window is
xc =
M01
M10
; yc =
M00
M00
4. Center the search window on the mode.
5. Repeat steps 3 and 4 until convergence.
CAMSHIFT extension The mean shift algorithm only applies to a single static distribution, while CAMSHIFT operates on a continuously changing distribution from frame to
frame. The zeroth moment, which can be thought of as the distribution’s ”area”, is used to
adapt the window size each time a new frame is processed. This means that it can easily
handle objects changing size when, for example, the distance between the camera and the
object changes. The steps to compute the CAMSHIFT extension are as follows:
1. Choose the initial search window.
2. Apply mean shift as described above and store the zeroth moment.
3. Set the search window size to a function of the zeroth moment found in step 2.
4. Repeat steps 2 and 3 until convergence.
When applying the algorithm to a series of video frames, the initial search window of
one frame is simply the computed region of the previous one. The window size function
used in the original paper is
r
M00
s=2
256
2.5. Object tracking
27
Figure 2.7: CAMSHIFT in action. The graph in each step shows a cross-section of the
distribution map, with red representing the probability distribution, yellow the tracking
window and blue the current mode. In this example the algorithm converges after six steps.
This is arrived at by first dividing the zeroth moment by the maximum pixel intensity
value to convert it into units of number of cells, which makes sense for a window size measure.
In order to convert the 2D region into a 1D length, we take the square root. We desire an
expansive window that grows to encompass a connected distribution area, so we multiply
the result by two.
28
Chapter 2. Introduction to face recognition and object tracking
Chapter 3
Face recognition systems and
libraries
This chapter describes the face recognition systems and libraries examined in this report
and gives a brief review of the installation process for each of them.
3.1
OpenCV
OpenCV (open computer vision) is a free open source library for image and video processing
with support for a wide variety of different algorithms in many different domains. It has
extensive documentation and community support, with complete API documentation as
well as various guides and tutorials for specific topics and tasks[2].
3.1.1
Installation and usage
Using Ubuntu 12.04 LTS, OpenCV was very simple to install. There is a relatively new
binary release available through the package manager, but the latest version, which among
other things includes the face recognition algorithms, had to be downloaded separately and
compiled from source. However, the compilation required few external dependencies and
the process was relatively simple. The library itself has an intuitive interface and it is very
easy to create quite powerful sample applications. A great advantage of using OpenCV
for building a face recognition system is that it supports many of the tasks that are not
directly related to recognition, such as loading, representing and presenting video and image
data, image processing and normalization, face detection and so on. The implemented face
recognition algorithms have a unified API which make them easily interchangable in an
application. One minor problem in evaluating their performances is that they only return
the rank-1 classification which prevents the methods from being evaluated using rank-based
metrics. They also return a confidence value for the classification, but its definition is never
formally documented, and the only way to find this out is by reading the source code itself.
29
30
3.2
Chapter 3. Face recognition systems and libraries
Wawo SDK
Wawo Technology AB[3] is a Sweden-based company developing face recognition technologies, with its main product being the Wawo face recognition system, which is based on
original algorithms and techniques presented in the doctoral dissertation of Hung-Son Le.
It is advertised as being capable of performing rapid and accurate illumination-invariant
face recognition based on a very small number of samples per subject. This is done using an
original HMM-based algorithm as well as original contrast-enhancing image normalization
procedures[35].
3.2.1
Installation and usage
The binary distribution of Wawo used in this work was acquired through Codemill and not
directly from Wawo, and thus it might not be the most up-to-date version of the library.
The distribution I was first given access to also lacked some files that I had to piece together
myself from an old hard drive used in previous projects, so while I was able to get the library
running, it is possible that some of the problems described below could have occurred because of this. The documentation that came along was quite limited and mainly consisted
of code samples with some rudimentary comments partially describing the API. Despite
this, producing functional code was not very difficult when Wawo was used in conjunction with the basic facilities provided by OpenCV. I occasionally encountered segmentation
faults originating inside library calls, Again, this is possibly due to using an old, incomplete
distribution. I also discovered that Wawo sets a relatively low upper limit to the size of
the training gallery. This is reasonable given that Wawo’s strength is advertised to be good
performance with very few training samples.
3.3
OpenBR
OpenBR (OpenBiometrics) is a collaborative research project initiated by the MITRE company, a US-based non-profit company operating several federally-funded research centers.
Its purpose is to develop biometrics algorithms and evaluation methodologies. The current system supports several different types of automated biometric analysis, including face
recognition, age estimation and gender estimation[1].
3.3.1
Installation and usage
I could not find a binary release of OpenBR 0.4.0 on the official website, which is the
advertised version of OpenBR at the time of writing. The only available option seemed
to be building from source. The only instructions available were specific to Ubuntu 13.04.
Following them on Ubuntu 12.04 LTS did not work, and thus, I was unable to install and test
the library. There were a large number of different steps involved, which I suspect makes
the process error-prone even if the correct OS version is used. Overall, the installation
procedure seemed immature and documentation was limited.
Chapter 4
System description of
standalone framework
This chapter gives a practical description of the framework developed over the course of
this project. First, a conceptual overview is given and then each component is described in
detail. The framework depends on OpenCV (≥2.4), cmake (tested with 2.8). The Wawo
SDK is included in the source tree but may not be the latest version.
There are four primary types of objects that are of concern to client programmers and
framework developers: detectors, recognizers, normalizers and techniques. Detectors and
recognizers encapsulate elementary face detection and recognition algorithms. Normalizers
perform image preprocessing to deal with varying imaging conditions in the data and algorithm requirements. Techniques integrate the lower-level components and perform high-level
algorithmic functions. These components are interchangable and can be mixed and matched
to suit the intended application and the source data.
Figure 4.1: Conceptual view of a typical application.
31
32
Chapter 4. System description of standalone framework
4.1
Detectors
A detector is a class that wraps the functionality of a face detection algorithm. Every detector implements the IDetector interface. The interface specifies a detect method, which
accepts an image represented by an OpenCV matrix and returns a list of rectangles representing the image regions containing the detected faces. In order to deal with varying input
image formats and imaging conditions, and the fact that different detection algorithms may
benefit from different types of image preprocessing, the IDetector interface also specifies a
method for setting a detector-specific normalizer. See below for details on normalizers.
Figure 4.2: IDetector interface UML diagram.
The framework currently supports two different detectors, but adding additional detectors to the framework is simply a matter of implementing the interface.
4.1.1
CascadeDetector
CascadeDetector implements the cascade classifier with Haar-like features described in detail
in chapter two. Algorithm parameters can be configured using the following methods:
– CascadeDetector(std::string cascadeFileName) - cascadeFileName specifies the
file that contains the cascade training data.
– setScaleFactor(double) - Specifies the degree to which the image size is reduced at
each image scale.
– setMinNeighbours(int) - Specifies how many neighbors each candidate rectangle
should have to retain.
– setMinWidth(int), setMinHeight(int) - Minimum possible face dimensions. Faces
smaller than this are ignored.
– setMaxWidth(int), setMaxHeight(int) - Maximum possible face size.
larger than this are ignored.
Faces
– setHaarFlags(int) - Probably obsolete, see OpenCV documentation.
4.1.2
RotatingCascadeDetector
RotatingCascadeDetector inherits from CascadeDetector and also implements the rotation
extension to the cascade classifier described in detail in chapter five. In addition to the ones
provided by CascadeDetector, algorithm parameters are supplied through the constructor:
– RotatingCascadeDetector(std::string cascadeFileName, double maxAngle,
double stepSize) - maxAngle is the angle of the maximum orientation deviation
from the original upright position, and stepSize is the size of the step angle in each
iteration.
4.2. Recognizers
4.2
33
Recognizers
A recognizer wraps a face recognition algorithm. It is a class that implements the IRecognizer
interface. A face recognition algorithm is always a two-step process. In the first step, the
algorithm needs to be trained with a gallery of known subjects, and thus a recognizer needs
to implement the train method. It accepts a list of gallery images, a corresponding list of
image regions containing the faces of the subjects and a list of labels indicating the identity
of the subject in the image. All three arguments need to be of equal length and a given
index refers to the same subject in all three lists.
Figure 4.3: IRecognizer interface UML diagram.
The recognize method is responsible for actually performing the recognition. It accepts
an image and a face region as input arguments, and returns the estimated label indicating
the identity of the subject in the image. As in the case with detectors, different image
formats and imaging conditions can require varying image preprocessing in order to optimize
the performance of the recognition algorithm, and so image normalization is required. In
addition, since recognizers deal with images from two different sources, the gallery and the
probe, two different normalizers may be necessary. Thus, an implementing class needs to
accept a gallery normalizer and a separate probe normalizer.
The framework currently supports five different recognition algorithms:
4.2.1
EigenFaceRecognizer
EigenFaceRecognizer implements the Eigenfaces algorithm, described in detail in chapter
two. Algorithm parameters can be configured using the following methods:
– setComponentCount(int) - Set the number of eigenvectors to use.
– setConfidenceThreshold(double) - Set the known/unknown subject threshold. A
value including and between 0.0 and DBL MAX.
4.2.2
FisherFaceRecognizer
FisherFaceRecognizer implements the Fisherfaces algorithm, described in detail in chapter
two. Algorithm parameters can be configured using the following methods:
– setComponentCount(int) - Set the number of eigenvectors to use.
– setConfidenceThreshold(double) - Set the known/unknown subject threshold. A
value including and between 0.0 and DBL MAX.
34
Chapter 4. System description of standalone framework
4.2.3
LBPHRecognizer
LBPHRecognizer implements the local binary pattern histograms algorithm, described in
detail in chapter two. Algorithm parameters can be configured using the following methods:
– setRadius(int) - The radius used for building the circular local binary pattern.
– setNeighbours(int) - The number of sample points to build a circular local binary
pattern from.
– setGrid(int x, int y) - The number of cells in the horizontal and vertical direction
respectively.
– setConfidenceThreshold(double) - Set the known/unknown subject threshold. A
value including and between 0.0 and DBL MAX.
4.2.4
WawoRecognizer
WawoRecognizer implements the Wawo algorithm, described briefly in chapter two and four.
Algorithm parameters can be configured using the following methods:
– setRecognitionThreshold(float) - Set the known/unknown subject threshold. A
value including and between 0.0 and 1.0.
– setVerificationLevel(int) - A value including and between 1 and 6. A lower value
runs faster, but probably decreases accuracy.
– setMatchesUsed(int) - If set to greater than 1, the result returned is the mode of
the n most likely candidates.
4.2.5
EnsembleRecognizer
The EnsembleRecognizer combines an arbitrary number of elementary recognizers, which
vote democratically amongst themselves about the final result. The setConfidenceThreshold(double) method sets the minimum fraction of participating recognizers which need to
agree to produce a result. Otherwise, the probe is considered unknown. Note that the
participating recognizers can also explicitly vote for an unknown identity, if so configured.
4.3
Normalizers
Input imagery can vary greatly depending on camera equipment, lighting conditions during
the shot, lossy processing since the image was taken, etc. In addition, different image
processing algorithms require different input preprocessing to achieve optimal performance.
Normalizers are modules that perform the preprocessing steps for the other parts of the
recognition system. This makes it easy to test a variety of normalization options in order
to figure out which one suits a particular algorithm in a particular context best.
Figure 4.4: INormalizer interface UML diagram.
4.4. Techniques
35
A normalizer is a class that implements the INormalizer interface. The interface defines
only a single method, normalize, which accepts an input image and returns an output
image. Any number and combinations of image processing operations can be performed by
a normalizer. The framework currently supports the following four, but adding more is very
simple:
– GrayNormalizer - Converts an RGB image to grayscale.
– ResizeNormalizer - Scale an image to the given dimensions.
– EqHistNormalizer - Enhance contrast by equalizing the image histogram.
– AggregateNormalizer - Utility class that lets the user create a custom normalizer
by assembling a sequence of elementary normalizers. This circumvents the need to
create a new normalizer for every conceivable combination of normalization steps.
4.4
Techniques
A technique is a top-level class that ties together the constituent detection and recognition
algorithms in a particular way. While the IDetection and IRecognition interfaces deal solely
with individual images, a technique is responsible for loading the gallery and probe files and
potentially also iterating over the frames of a probe video and algorithmic tasks spanning
multiple sequential frames.
Figure 4.5: ITechnique interface UML diagram.
Every technique implements the ITechnique interface which specifies the train and recognize methods. The former accepts a Gallery object which specifies the files to use for training
the underlying recognition model. The latter accepts a string containing the filename of the
probe image or video and produces an Annotation object describing the prescence, identities
and locations of recognized individuals in the probe.
4.4.1
SimpleTechnique
SimpleTechnique is the prototypical technique. It accepts one detector for the gallery and
one for the probe, and a single recognizer. All gallery files are loaded from disk in turn.
If a gallery file is an image, it applies the gallery detector and feeds the detected image
region and the corresponding label to the recognizer for training. If a gallery file is a video,
performs the same operations on each of the frames of the video in turn. After the training
is complete, it loads the probe from file. If it is an image, the probe detector is applied to
it and the detected image region is fed to the recognizer and the result is stored. If it is a
video file, the technique performs the same operations on each frame in turn.
4.4.2
TrackingTechnique
TrackingTechnique implements the recognition/object tracking integration described in detail in chapter five. The gallery face detection preprocessing and recognizer training is
identical to SimpleTechnique (see above).
36
Chapter 4. System description of standalone framework
Figure 4.6: SimpleTechnique class UML diagram.
Figure 4.7: TrackingTechnique class UML diagram.
4.5
4.5.1
Other modules
Annotation
The Annotation class represents the annotation of the presence and location of individuals
in a sequence of images, such as a video clip. An instance can be produced by the framework
through the application of face recognition to a probe image or video, but also saved and
loaded from disk.
In addition, an instance can be compared to another, ”true”, annotation by a number
of performance measures, described in chapter six. When saved to file, it uses a simple
ASCII-based file format. The first line contains exactly one positive integer, representing
the total number of individuals in the subject gallery. Each subsequent line represents a
frame or image.
The present individuals in the frame/image can be specified in two ways, depending on
whether or not location data is included. Either, the individuals present in the frame are
represented by a number of non-negative integer labels, separated by whitespace, or each
individual is represented by a non-negative integer label followed by a start parenthesis
’(’, four comma-separated non-negative integers representing the x, y, width, height of the
rectangle specifying the image region of the face of the individual in the frame/image,
followed by an end parenthesis ’)’. Each such segment is separated by whitespace. For
example, a file of the first type may look like this:
4
0
0
0
0
0
1
1
1
1
1
1 2
2
2
2 3
4.5. Other modules
37
And a file of the second type may look like this:
9
0(46,25,59,59)
0(46,25,61,61)
0(45,24,63,63)
0(47,25,61,61)
0(46,25,61,61)
0(45,24,62,62)
0(46,24,62,62) 1(146,124,41,41)
0(45,24,62,62) 1(146,124,41,41)
4.5.2
Gallery
The Gallery class is a simple abstraction of the gallery data used to train face recognition
models. An instance is created simply by providing the path to the gallery listing file. An
optional parameter samplesPerSubject specifies the maximum number of samples to extract
from each video file in the gallery listing. If the parameter is left out, this indicates to client
modules that the maximum number of samples should be extracted. This commonly means
all detected faces in the video.
Figure 4.8: Gallery class UML diagram.
The gallery listing is an ASCII-format newline-separated list of gallery file/label pairs.
Each pair consists of a string specifying the path to the gallery file and a non-negative
integer label specifying the identity of the subject in the gallery file, separated by a colon.
The gallery file can be either an image or a video clip. In both cases, it is assumed that
only the face of the subject is present throughout the image sequence. For example:
/home/user1/mygallery/subj0.jpg:0
/home/user1/mygallery/subj0.avi:0
/home/user1/mygallery/subj1.png:1
/home/user1/mygallery/subj1.wmv:1
/home/user1/mygallery/subj2.bmp:2
/home/user1/mygallery/subj2.avi:2
4.5.3
Renderer
This class is used to play an annotated video back and display detection or recognition
results visually. To render detection results, it accepts the video file and a list of lists of
cv::Rects, representing the set of detected face regions for each frame of the video. To
render recognition results, it simply accepts the video file and an associated Annotation
object. The framerate of the playback can also be configured.
Figure 4.9: Renderer class UML diagram.
38
Chapter 4. System description of standalone framework
4.6
Command-line interface
The framework includes a simple command-line interface to a subset of the functionality
provided by the framework, as an example of what an application might look like. It accepts
a gallery listings file and probe video file and performs face recognition. The technique,
detection and recognition algorithms used can be customized and the result can be either
saved to file or rendered visually. The application can also be used to benchmark different
algorithms. The syntax is as follows:
./[executable] GALLERY_FILE PROBE_FILE [-o OUTPUT_FILE] [-t TECHNIQUE] [-d DETECTOR]
[-c CASCADE_DATA] [-r RECOGNIZER] [-R] [-C CONFIDENCE_THRESHOLD] [-b BENCHMARKING_FILE]
[-n SAMPLES_PER_VIDEO]
4.6.1
Options
– -o - Specifies the file to write the resulting annotation to. If this option is left out, the
output is not saved.
– -t - Specifies the technique to use. Can be either ”simple” or ”tracking”. The default
is ”simple”.
– -d - Specifies the detector to use. Can be either ”cascade” or ”rotating”. The default
is ”cascade”.
– -c - Specifies the cascade detector training data to use. The default is frontal face
training data included in the source tree.
– -r - Specifies the recognizer to use. Can be either ”eigenfaces”, ”fisherfaces”, ”lbph”
or ”wawo”. The default is ”eigenfaces”.
– -R - Indicates that the result should be rendered visually. The sequence of frames is
played back at 30 frames per second and the recognition result is overlayed on each
frame.
– -C - The confidence threshold to set for the selected recognizer. The range of this
value depends on the algorithm. The default is 0.
– -D - Set the benchmarking annotation file to use. The result is compared to this file
and performance data is written to stdout when processing is complete.
– -n - Set the number of faces to extract from each gallery video file for training the
recognizer model. If this option is not given, as many faces as possible will be used.
Chapter 5
Algorithm extensions
This chapter discusses the improvements made to the basic face recognition system of the
Vidispine plugin that have been added to the standalone framework. The improvements
are twofold: Firstly, an integration of an arbitrary face recognition algorithm and the
CAMSHIFT object tracking algorithm and secondly, an extension of the cascade face detection algorithm. This chapter primarily describes the improvements and discusses their
potential and weaknesses, while their performance is empirically evaluated in chapter six.
5.1
Face recognition/object tracking integration
The majority of face recognition approaches proposed in the literature operate on a single
image. As discussed in chapter two, these kinds of techniques can be applied to face recognition in video by applying them on a frame-by-frame basis. However, this purposefully
disregards certain information contained in a video clip that can be used to achieve better
recognition performance. For example, geometric continuity in successive frames tends to
imply the same object. How can we take that into account when recognizing faces? In
addition, a weakness of many popular face recognition techniques is that they are viewdependent. Either the model is trained exclusively with samples from a single view, and
thus only able to recognize faces from this one perspective, or the model is trained with
samples from multiple views and often suffer a reduction in recognition performance for any
one perspective.
A color-based object tracking algorithm has the advantage of not being dependent on the
pose of the tracked object as long as the color distribution of the object does not radically
change with the pose. In fact, an advertised strength of the CAMSHIFT tracking algorithm
is that it is able to continue tracking an object as long as occlusion isn’t 100%[12]. It is
thus a natural step to combine an elementary face recognition algorithm to identify faces
in individual frames, and the CAMSHIFT algorithm in order to overcome issues with viewdependence and to associate the faces of the same subjects across multiple frames. The
proposed algorithm consists of the steps below. For details concerning face recognition, face
detection of the CAMSHIFT algorithm, see chapter two.
39
40
Chapter 5. Algorithm extensions
1. For each frame in the video:
(a) Extend any existing CAMSHIFT tracks to the new frame, possibly terminating
them. If two tracks intersect, select one and terminate it.
(b) Detect any faces in the frame using an arbitrary face detection algorithm. If a
face is detected and it does not intersect with any existing tracks, use it as the
initial search region of a new CAMSHIFT track.
(c) For each existing track, uniformly expand the search region of the current frame
by a fixed percentage, and apply face detection inside the expanded search region.
If a face is detected, apply an arbitrary face recognition algorithm on the face
region, and store the recognized ID.
2. Once the video has been processed, iterate over all tracks that were created:
(a) Compute the mode of all recognized IDs of the track and write it as output to each
frame the track covers. Write the CAMSHIFT search region as the corresponding
face region to each frame the track covers.
Figure 5.1 illustrates an example of the algorithm in action visually.
Figure 5.1: Example illustrating the face recognition/tracking integration. In frame a, a
face is detected using a frontal face detector and a CAMSHIFT track subsequently created.
The face is recognized as belonging to subject ID #1. In frame b, the CAMSHIFT track is
extended, but since the head is partially occluded, the detector is unable to recognize it as a
face. In frame c, the CAMSHIFT track is once again extended, and the head is once again
fully revealed, allowing the detector to once again detect and identify the subject as #1. In
frame d, the face is completely occluded and CAMSHIFT loses track of it. The final track
covers frames 1-3, the mode of the identified faces in the track is computed and assigned to
each frame the track covers.
The main advantage of this approach is that weaknesses in the face detector or the
face recognizer for temporary disadvantageous conditions are circumvented. As long as
a majority of successful identifications in a track are correct, failed detections or invalid
recognitions are overlooked. This means that temporary pose changes that would normally
interrupt a regular recognition algorithm are mediated. The approach could also deal with
temporary occlusions or shadows so long as they do not cause CAMSHIFT to lose track.
5.2. Rotating cascade detector
41
The approach could also be extended to using multiple detector/recognizer pairs for multiple
viewpoints to increase the range of conditions that result in valid identifications, further
increasing the probability of achieving a majority of valid identifications in a single track. A
probable weakness of the approach is that cluttered backgrounds could easily result in false
positive face detections, depending on the quality of the face detector used, which could
produce tracks tracking background noise. It might be possible to filter out many tracks of
this type by finding criteria that are likely to be fulfilled by noise tracks, such as a short
length or a relatively low number of identifications.
5.1.1
Backwards tracking
In theory, there is nothing that prevents the algorithm presented above from tracking both
forward and backward along the temporal axis. In the case where a face is introduced into the
video under conditions that are disadvantageous to the underlying detector/recognizer and
then later become detectable, the algorithm would miss this initial segment unless tracking is
done in both temporal directions. In practice, however, video is usually compressed in such a
way that only certain key frames are stored in full, and the frames inbetween are represented
as sequential modifications to the previous key frame. This means that a video can only
be played back in reverse by first playing it forwards and buffering the frames as they are
computed. If no upper limit is set on the size of this buffer, even short video clips of moderate
resolution would require huge amounts of memory to process. The implementation of the
recognition/tracking integration developed during this project supports backwards tracking
with a finite-sized buffer, the size of which can be configured to suit the memory resources
and accuracy requirements of the intended application.
5.2
Rotating cascade detector
The range of face poses that can be detected by cascade detection with Haar-like features is
limited by the range of poses present in the training data. In addition, training for multiple
poses can limit the accuracy of detecting a face in any one particular pose. To partially
mitigate this issue, an extension to the basic cascade detector has been developed. The basic
idea is to rotate the input image about its center in a stepwise manner and apply the regular
detector to each resulting rotated image in turn, in order to detect faces in the correct pose
besides a rotation about the image z-axis. However, this approach is likely to detect the
same face multiple times due to the basic cascade detector being somewhat robust to minor
pose changes. In order to handle multiple detections of the same face, the resulting detected
face regions are then rotated back to the original image orientation and merged, producing
a single face region in the original image for each face present. This extended detector thus
expands the range of detectable poses. The steps of the algorithm are as follows:
1. For a given input image, a given maximum angle and a given step angle, start by
rotating the image by the negative maximum angle. Increase the rotation angle by
the step angle for each iteration until the angle exceeds the positive maximum angle.
2. For each image orientation:
(a) Apply cascade face detection to the rotated image.
(b) For each detected face region, rotate the region back to the original orientation
and compute an axis-aligned bounding box (AABB) around it.
42
Chapter 5. Algorithm extensions
3. For all AABBs from the previous step, find each set of overlapping rectangles.
4. For each set, compute the average rectangle, i. e., the average top-left/bottom-right
point defining each rectangle.
Figure 5.2 shows an example of the rotating detector in action. A downside of the
rotating detector versus the basic detector is that the stepwise rotation increases processing
time by a constant factor, which is the total number of orientations processed, in addition to
the relatively minor time it takes to merge the resulting face regions. Thus, this extension
is only a viable option in scenarios where accuracy is prioritised over speed, and where a
rotation about the z-axis is very likely to occur frequently. Another issue is that the risk
of detecting false positives is increased as the number of orientations considered increases.
For this reason, the approach may be less useful in scenes with cluttered backgrounds.
Figure 5.2: Illustrated example of the rotating cascade detector in action. In the first step,
the image is rotated to a fixed set of orientations and face detection is applied in each. In
the second step, the detected regions are rotated back to the original image orientation and
an axis-aligned bounding box is computed for each. In the last step, the average of the
overlapping bounding boxes is computed as the final face region.
Chapter 6
Performance evaluation
In this chapter the accuracy and performance of the basic face detection and recognition
algorithms, as well as the algorithmic extensions introduced in this report, are empirically
evaluated. Also, the accuracy and performance of the basic algorithms are evaluated under
a variety of scene and imaging conditions, in order to elucidate what their strengths and
weaknesses are, and to develop recommendations for application areas and avenues for
improvement to be used in future work.
Firstly, the performance metrics used in the evaluation are introduced and explained in
detail. Secondly, the sources and properties of the various datasets used are described, and
thirdly, the setup of each individual test is explained. The final section includes both a
presentation and explanation of the results as well as an analysis and discussion.
6.1
Metrics
The task of performing face detection and recognition on a probe video with respect to a
subject gallery and producing an annotation of the identities and temporal locations of any
faces present in the probe can be viewed as a multi-label classification problem applied to
each frame of the probe. In order to prevent optimization according to a potential bias in
a certain metric, a number of different metrics will be used. Let L be the set of subject
labels present in the gallery. Let D = x1 , x2 , . . . , x|D| be the sequence of frames of the
probe video and let Y = y1 , y2 , . . . , y|D| be the true annotation of the probe where yi ⊆ L
is the true set of subject labels for the ith frame. Let H be a face recognition system and
H(D) = Z = z1 , z2 , . . . , z|D| the predicted annotation for the probe by the system. Let tr
be the time it takes to play the video and tp be the time it takes to perform face recognition
on the video. The following metrics prominent in the literature[57][5] will be used:
– Hamming loss:
|D|
HL(H, D) =
1 X |yi ∆zi |
|D| i=1 |L|
where ∆ is the symmetric difference of two sets, which corresponds to the XOR operation in Boolean logic. This metric measures the average ratio of incorrect labelings and
missing labels to the total number of labels. Since this is a loss function, a Hamming
loss equal to 0 corresponds to optimal performance.
43
44
Chapter 6. Performance evaluation
– Accuracy:
|D|
A(H, D) =
1 X |yi ∩ zi |
|D| i=1 |yi ∪ zi |
Accuracy symmetrically measures the similarity between yi and zi , averaged over all
frames. A value of 1 corresponds to optimal performance.
– Precision:
|D|
P (H, D) =
1 X |yi ∩ zi |
|D| i=1 |zi |
Precision is the average percentage of identified true positives to the total number of
labels identified. A value of 1 corresponds to optimal performance.
– Recall:
|D|
R(H, D) =
1 X |yi ∩ zi |
|D| i=1 |yi |
Recall is the average percentage of identified true positives to the total number of true
positives. A value of 1 corresponds to optimal performance.
– F-measure:
|D|
F (H, D) =
1 X 2 |yi ∩ zi |
|D| i=1 |zi | + |yi |
The F-measure is the harmonic mean of precision and recall and gives an aggregate
description of both metrics. A value of 1 corresponds to optimal performance.
– Subset accuracy:
|D|
SA(H, D) =
1 X
ind{zi = yi }
|D| i=1
Subset accuracy is the fraction of frames in which all subjects are correctly classified
without false positives. A value of 1 corresponds to optimal performance.
– Real time factor:
RT F (H, D) =
tp
tr
The real time factor is the ratio of the time it takes to perform recognition on the
video to the time it takes to play it back. If this value is 1 or below, it is possible to
perform face recognition in near-real time.
6.2. Testing datasets
6.2
45
Testing datasets
Several different test databases were used, for two reasons. Firstly, using several databases
with different properties, such as clutter, imaging conditions and number of subjects gives
a better overview of how different parameters affect the quality and speed of recognition.
Secondly, using standard test databases allow the results to be compared with other results
in the literature. This section lists the databases that were used along with a description of
their properties.
6.2.1
NRC-IIT
This database contains 11 pairs of short video clips of 11 individuals respectively. One or
more of the files for two of the subjects could not be read and were excluded from this
evaluation. The resolution is 160x120 pixels, with the face occupying 14 to 18 of the frame
width. The average duration is 10-20 seconds. All of the clips were shot under approximately
equal illumination conditions, which was uniformly distributed ceiling light and no sunlight.
The subjects pose a variety of different facial expressions and head orientations. Only a
single face is present in each video and the faces are present in the video for the entire
duration.[26]
6.2.2
News
Contains a gallery of six short video clips of the faces of news anchors, each 12-15 seconds long, and a probe clip containing outtakes from news reports featuring two of the
subjects which is 40 seconds long. The resolution of the gallery clips varies slightly, but
is approximately 190x250 pixels. The face occupies the majority of the image with very
little background and no clutter. The subjects are speaking but have mostly neutral facial
expressions. The probe contains outtakes featuring two of the anchors in a full frame from
the original news reports. The resolution is 640x480 pixels. The background contains slight
clutter and is mostly static, but varies slightly as imagery from news stories is sometimes
displayed. In some cases, unknown faces are showed. The illumination is uniform studio
lighting without significant shadows.
6.2.3
NR
Consists of a gallery of seven subjects, and for each subject, five video clips featuring the
subject in a frontal pose, and one to five video clips featuring the subject in a profile pose.
The gallery clips contain only the subjects’ faces but also some background with varying
degrees of clutter. Each clip is one to 10 seconds long. The database also contains one 90
second probe video clip featuring a subset of the subjects in the gallery. All gallery and
probe clips were shot with the same camera. They are in color with a resolution of 640x480
pixels. The illumination and facial expressions of the subjects vary across the gallery clips.
The pose and facial expressions of the subjects in the probe vary, but the illumination is
approximately uniform. The probe features several subjects in a single frame as well as
several unknown subjects. The background is dynamic with a relatively high degree of
clutter compared to the other datasets.
46
6.3
Chapter 6. Performance evaluation
Experimental setup
Three different experiments were performed. The first compares the accuracy and processing
speed of the tracking extension to the basic framework face recognition algorithms as the
gallery size increases, the second measures the performance of the rotation extension to the
cascade detector and the third evaluates the impact of variations in multiple imaging and
scene conditions on recognition algorithm performance. This section describes the purpose
and setup of each experiment. All tests were performed on an Asus N56V laptop with eight
Intel Core i7-3630QM CPUs at 2.40GHz, 8 GB of RAM and an NVIDIA Geforce GT 635M
graphics card. The operating system used was Ubuntu 12.04 LTS.
6.3.1
Regular versus tracking recognizers
This test was performed in order to evaluate the performance and processing speed of the
tracking extension compared to regular recognition systems. The algorithms evaluated were
Eigenfaces, Fisherfaces, Local binary pattern histograms, Wawo and the ensemble method
(see chapter four) using the other four algorithms. For each algorithm, both a frame-byframe recognition approach, as well as the CAMSHIFT tracking approach, described in
chapter five, was used. All algorithms used the cascade classifier with Haar-like features
for face detection. This test was performed on the NRC-IIT database. The gallery was
extracted from the first video clip of each subject. The second video clip of each subject
was used as probes. The mean subset accuracy over all probes was computed for a number
of gallery sizes ranging from 1 to 50. The real time factor (RTF) was computed for each
gallery size by dividing the total processing time for all probes, including retraining the
algorithms with the gallery for each probe, by the sum total length of all probe video clips.
6.3.2
Regular detector versus rotating detector
This test evaluates the recognition performance and processing speed of recognition systems
using the rotating extension of the cascade classifier face detector (see chapter five) with
respect to the regular classifier. The algorithms used for the evaluation are Eigenfaces,
Fisherfaces, Local binary pattern histograms, Wawo and the ensemble method (see chapter
four) using the other four algorithms. For each algorithm, both the regular cascade classifier
and the cascade classifier with the rotating extension was used. In all other regards, the
test was identical to the test described in the previous section. The rotating detector used
20 different angle deviations from the original orientation, with a maximum angle of ±40◦
and a step size of 4◦ .
6.3.3
Algorithm accuracy in cases of multiple variable conditions
The purpose of this test is to illuminate what the obstacles to applying the system to real
life scenarios are, where many different scene and image conditions can be expected to vary
simultaneously. For this reason, each of the basic algorithms was tested on each of the three
datasets, each of which has a different set of variable image and scene conditions (see above).
For each of the three datasets, Eigenfaces, Fisherfaces, LBPH and Wawo was tested,
each using a cascade detector trained for detecting frontal faces using the default cascade
data included in the framework. For the NRC-IIT database, the gallery was extracted
from the first video clip of each subject. The second video clip of each subject was used as
probes. The mean Hamming loss, accuracy, precision, recall, F-measure and subset accuracy
was computed over all probes. For the News database, the same set of measurements was
6.4. Evaluation results
47
computed over its single probe. For the NR database, only the frontal gallery was used. The
same set of measurements was computed over its single probe. In each test, the maximum
number of usable samples were extracted from each gallery video, as specified by the Gallery
module (see chapter four). In the case of Wawo and the ensemble method, this had to be
limited to 50 samples per video for the NRC-IIT test and 10 samples per video for the News
and NR tests due to segmentation faults occurring inside Wawo library calls when using
larger gallery sizes.
6.4
6.4.1
Evaluation results
Comparison of algorithm accuracy and speed over gallery size
As figure 6.1 illustrates, the tracking extension vastly improves the accuracy of all five
algorithms. Wawo and the ensemble approach quickly reach a near-optimal level of accuracy
at 95-96% and the other algorithms catch up as the gallery size increases. Without the
tracking extension, Wawo outperforms the other algorithms for all but the smallest gallery
sizes. However, as figure 6.2 shows, the RTF of Wawo and the ensemble approach are heavily
impacted by the gallery size while the other algorithms retain an essentially constant RTF as
the gallery size increases. The tracking extension adds a relatively minor, constant increase
to the RTF of all algorithms.
These results suggest that the tracking-extended Wawo algorithm may be a suitable
choice for applications where the gallery size is small to medium-sized (but not too small)
and processing time is a non-critical factor. On the other hand, LBPH or Fisherfaces
perform nearly as well, even for smaller sample sizes and vastly outperform Wawo in terms
of processing time. It should be noted that these results are highly dependent on the dataset
and this analysis should only be considered valid for applications that use data with similar
conditions to the test data.
Figure 6.1: The performance of each algorithm as measured by subset accuracy, as the
gallery size increases. The thin lines represent the basic frame-by-frame algorithms and the
thick lines represent the corresponding algorithm using the tracking extension.
48
Chapter 6. Performance evaluation
Figure 6.2: The real time factor of each algorithm as the gallery size increases. The thin
lines represent the basic frame-by-frame algorithms and the thick lines represent the corresponding algorithm using the tracking extension.
6.4.2
Regular detector versus rotating detector
Figure 6.3 shows that the rotating extension invariably improves the accuracy of all algorithms. The degree of the improvement does vary slightly, but usually lies in 0.05-0.1 range.
The degree of improvement does not seem to be affected by the gallery size. Figure 6.4
illustrates that the extension adds a hefty constant cost to the processing time with respect
to gallery size, but which is instead directly proportional to the total number of orientations
considered by the rotation extension.
These results indicate that the rotating extension purchases a slight improvement in accuracy for a large cost in performance. The performance cost can be reduced by considering
a smaller number of orientations. Depending on the application, this may or may not reduce
accuracy. For example, in a scenario where the subjects to be identified are unlikely to lean
their heads to either side by no more than a small amount, a large maximum angle may be
wasteful. In addition, this test only considers the case where the step size is a tenth of the
maximum angle. It is possible that a larger step size would result in the same accuracy, but
at the time of writing this has not been tested.
Another consideration is the increased likelihood of false positives. Since the basic face
detection algorithm is performed once for each orientation, the basic probability of a false
positive is compounded by the number of orientations considered. This factor does not
appear to impair the algorithm for this particular dataset, but in cases where the basic false
positive rate is relatively high, such as in scenes with cluttered backgrounds, it may become
a greater problem.
6.4. Evaluation results
49
Figure 6.3: The performance, as measured by subset accuracy, of each algorithm using the
regular cascade classifier (thin lines) compared to ones using the rotating extension (thick
lines), as the gallery size increases.
Figure 6.4: The real time factor of each algorithm as the gallery size increases. The thin
lines represent the basic frame-by-frame algorithms and the thick lines represent the corresponding algorithm using the tracking extension.
50
6.4.3
Chapter 6. Performance evaluation
Evaluation of algorithm accuracy in cases of multiple variable
conditions
Table 6.1 shows the results of the NRC-IIT test. The subset accuracy measurement indicates
that the algorithms correctly label about 50-60% of frames. Visual inspection shows that
most of the error comes from a failure to detect faces that have been oriented away from the
camera or distorted and/or occluded. This issue is overcome using the tracking technique
as demonstrated above.
Table 6.1: NRC-IIT test results. The values of all measures besides Hamming loss are equal
because by their definition, they equate to the subset accuracy when |zi |, |yi | ≤ 1.
Algorithm
Eigenfaces
Fisherfaces
LBPH
Wawo
Ensemble
Hamming loss
0.104112
0.0875582
0.0975996
0.0908961
0.0933599
Accuracy
0.531498
0.605988
0.560802
0.590968
0.57988
Precision
0.531498
0.605988
0.560802
0.590968
0.57988
Recall
0.531498
0.605988
0.560802
0.590968
0.57988
F-measure
0.531498
0.605988
0.560802
0.590968
0.57988
Subset accuracy
0.531498
0.605988
0.560802
0.590968
0.57988
Table 6.2 shows the result of testing on the News dataset. A noteworthy feature here is
that the recall is similar to the results for the NRC-IIT test, which means that about the
same fraction of true positives were identified. However, the precision is markedly lower,
which indicates a greater number of false positives. This is most likely due to the more
cluttered background in the News probe. This is corroborated by visual inspection of the
result. The error consists both of non-face background elements falsely classified as faces by
the detector and unknown faces falsely identified as belonging to the subject gallery by the
recognizers. The ensemble method performs worst according to all measures in this test.
This is most likely due to the fact that the Wawo component causes a segmentation fault
for sample sizes larger than 10 and this limitation is applied to the other algorithms as well
in the current implementation. The other algorithms perform comparatively worse at this
gallery size as demonstrated above and thus sabotage the overall performance.
Table 6.2: News test results. Accuracy and precision are equal when yi ⊆ zi whenever
|yi ∩ zi | =
6 0. If yi = zi , accuracy and precision equals recall. This indicates a number of
false positives were present.
Algorithm
Eigenfaces
Fisherfaces
LBPH
Wawo
Ensemble
Hamming loss
0.261373
0.34381
0.309898
0.351944
0.368211
Accuracy
0.484974
0.398677
0.444169
0.340433
0.301213
Precision
0.484974
0.398677
0.444169
0.340433
0.301213
Recall
0.605459
0.520265
0.622002
0.463193
0.438379
F-measure
0.524676
0.438737
0.50284
0.381086
0.34654
Subset accuracy
0.367246
0.27957
0.269644
0.219189
0.166253
Table 6.3 shows the results for the NR test. Despite the fact that the training data is
of somewhat lower quality and the background is highly dynamic and cluttered, with many
unknown individuals present, Wawo and Fisherfaces performed on par with the results from
the News test, although Eigenfaces and LBPH performed worse. To various degrees, the
precision measurements indicate all methods detected a large number of false positives.
Visual inspection shows that the error is due to non-faces classified as faces, unknown
individuals identified as belonging to the gallery and, to a greater extent than for the News
test, known subjects falsely identified as other known subjects. The last issue may arise
partly from the relatively lower quality of the subject gallery but also from the more dynamic
6.4. Evaluation results
51
and varied poses and facial expressions all individuals in the probe assume. Despite the same
gallery size limitation to the ensemble method as in the News test, strangely it outperforms
all other algorithms except Wawo.
Table 6.3: NR test results. Accuracy and precision are equal when yi ⊆ zi for all i where
|yi ∩ zi | =
6 0. If instead yi = zi , accuracy and precision would equal recall. This indicates a
number of false positives were present
Algorithm
Eigenfaces
Fisherfaces
LBPH
Wawo
Ensemble
Hamming loss
0.288492
0.21746
0.263492
0.194444
0.21746
Accuracy
0.210648
0.340046
0.244444
0.389583
0.343981
Precision
0.210648
0.340046
0.244444
0.389583
0.343981
Recall
0.308333
0.55
0.388889
0.625
0.55
F-measure
0.24213
0.406111
0.291667
0.465463
0.40963
Subset accuracy
0.119444
0.152778
0.105556
0.169444
0.155556
The above results indicate that a major obstacle to applying face recognition in real
life scenarios is the high rate of false positives detected in cluttered background conditions
and with unknown individuals present in the scene. This problem can be attacked from
two angles. The first is that the face detection algorithm can be improved so as to reduce
the number of non-face regions falsely detected. Besides testing other algorithms than the
cascade classifier, or using better training data, it may also be possible to preprocess the
image to optimize the conditions for face detection.
The second angle would be to improve the face recognition algorithms themselves so
as to correctly label unknown individuals as such. While no formal evaluation has been
performed in this project, rudimentary investigation has indicated that it is possible to
optimize the recognition performance for a particular dataset to a certain degree by finding
an appropriate confidence level. However, as the dataset grows large, it seems likely that
these gains would diminish.
52
Chapter 6. Performance evaluation
Chapter 7
Conclusion
In this report, the implementation of a standalone version of the Vidispine face recognition
plugin was documented, the tradeoff between face recognition performance and accuracy
was evaluated and the possibilities of integrating face recognition with object tracking was
investigated.
Among the algorithms evaluated, Wawo performs better than the others for all but the
smallest gallery sizes. However, this comes at a great performance cost as the recognition
time scales linearly with the number of samples in the gallery. Eigenfaces outperforms
Wawo in terms of accuracy for small gallery sizes and Fisherfaces has an almost comparable
accuracy for large gallery sizes, and the processing time for both is constant with respect to
gallery size.
This implies that the best method to use depends on the requirements of the application.
If maximizing accuracy with a limited, but not too limited (>5 samples per subject), gallery
size is the goal, at any computational costs, Wawo may be the best option. If the gallery
size is limited (<25 samples per subject) but processing time should also be kept in check,
Eigenfaces would be better. Finally, if a relatively large gallery size (>30 samples per
subject) can be acquired and processing time should be minimized, Fisherfaces seems to be
the best choice.
While acquiring a multitude of qualitative photos of a single individual, especially one
that is not directly available, can be very difficult, a large number of admissable samples
can easily be extracted from short video clips. In the modern era of the Web and real-time
media streaming, this is a much more viable option than it used to be in the past.
Due to the nature of the evaluation dataset, these recommendations are conditional on
the assumption that the probe data has an uncluttered background and constant illumination. A cluttered background in particular can be devastating for recognition performance,
as the number of false positives for both face detection and recognition rises dramatically.
The best recommendation to deal with these issues that can be given based on this evaluation is simply to restrict the application scope so as not to include scenes with cluttered
backgrounds and greatly variable illumination.
One of the original goals was to investigate the possibilities of profile face recognition.
While the framework supports profile face recognition in principle, by supplying profile
recognition training data to the elementary detection and recognition algorithms, the performance of such approaches have not been evaluated in this report. This is mainly due to
the fact that there is hardly any suitable data for such an evaluation readily available, and
gathering such data is a time-consuming process that did not fit into the project schedule.
53
54
Chapter 7. Conclusion
An approach for integrating image-based face recognition algorithms and the CAMSHIFT
object tracking algorithm was developed, the specifics of which are described in chapter five.
Using this method, the performance of the basic face recognition algorithms was improved
by approximately 35-45 percentage points. In certain contexts, the method seems to be
able to overcome some major obstacles to successful face detection and recognition, such as
partial occlusion, face deformation, pose changes and illumination variability.
7.1
Limitations of the evaluation
The majority of tests in this evaluation was performed on the NRC-IIT dataset, which
covers only a restricted set of possible scene and imaging conditions. As a consequence,
the results, discussion and recommendations are mainly applicable to similar types of data.
That is, scenes with static uncluttered backgrounds and constant illumination. Based on
the literature, this is a problem that affects the field as a whole, and many authors call for
more standardized face recognition datasets covering wider ranges of variables[6]. If more
time was available, additional data would be gathered for a more informative evaluation.
This would be made easier if the scope of the intended application area of the framework was
restricted and specified in more detail, as the amount of necessary data would be reduced.
7.2
Future work
As the original aims of this project were quite broad, so are the possible future lines of
investigation. As previously mentioned, a primary issue throughout the project was the
lack of useful data to use for evaluation of the developed solutions. Specifically, in order to
systematically address the performance of the various techniques under specific conditions,
such as variability in illumination, pose, background clutter, facial expression, external facial
features such as beard, glasses or makeup, test data sets that introduce these factors one
by one, and in combination in small numbers, would be required. For a more restricted
application area, the amount of necessary test data would be limited to those conditions
that appear in the real-world scenarios in which the framework would be used. There is also
a lack of profile image and video test data relative to the amount of frontal face databases
available in the literature. An important future project could be to build an extensive test
dataset, appropriate to the intended application, according to the above specifications, to
be used to evaluate the performance of new and existing algorithms under development.
The strength and applicability of the framework can always be enhanced by adding
new algorithms for face detection, face recognition and face tracking. If a more specific
application area is selected, this would inform the choice of new algorithms to add, as
different algorithms have different strengths and weaknesses that may make them more or
less suitable for a particular application. In addition, it would be interesting to see how
the ensemble method could be improved by adding new algorithms, from different face
recognition paradigms, that complement each other’s primary weaknesses.
Being able to distinguish between known and unknown individuals is relevant to many
applications, so a future project could be to try to find more general solutions to this
problem. This is yet another instance of a problem that most likely becomes easier if the
problem domain is restricted. Another possible direction could be to investigate image
preprocessing methods to improve detection and recognition performance.
Chapter 8
Acknowledgements
I would like to thank my supervisor, Petter Ericson, for keeping me on track and providing
suggestions and insight throughout the project. I want to thank Johanna Björklund and
Emil Lundh for providing the initial thesis concept and Codemill for providing a calm,
quiet workspace. I’d also like to thank everyone at Codemill for helping me get set up and
providing feedback. Finally, I want to thank my girlfriend, my parents and my brother for
supporting me all the way.
55
56
Chapter 8. Acknowledgements
References
[1] OpenBR (Open Source Biometric Recognition). http://openbiometrics.org/ (visited
2013-11-28).
[2] OpenCV (Open Source Computer Vision). http://opencv.org/ (visited 2013-11-28).
[3] Wawo Technology AB. http://www.wawo.com/ (visited 2013-11-28).
[4] R. Bolle A. K. Jain and S. Pankanti. Biometrics: Personal Identification in Networked
Society. Kluwer Academic Publishers, 1999.
[5] A. M. P. Canuto A. M. Santos and A. F. Neto. Evaluating classification methods
applied to multi-label tasks in different domains. International Journal of Computer
Information Systems and Industrial Management Applications, 3:218–227, 2011.
[6] A. H. El-baz A. S. Tolba and A. A. El-harby. Face recognition: A literature review.
International Journal of Signal Processing, 2(2):88–103, 2005.
[7] O. Javed A. Yilmaz and M. Shah. Object tracking: A survey. ACM Comput. Surv.,
38(4), 2006.
[8] T. Ahonen, Abdenour Hadid, and Matti Pietikäinen. Face recognition with local binary
patterns. In In Proc. of 9th Euro15 We, pages 469–481, 2004.
[9] P. Ho B. Heisele and T. Poggio. Face recognition with support vector machines: Global
versus component-based approach. In In Proc. 8th International Conference on Computer Vision, pages 688–694, 2001.
[10] V. Blanz and T. Vetter. Face recognition based on fitting a 3d morphable model. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 25:2003, 2003.
[11] A. L. Blum and P. Langley. Selection of relevant features and examples in machine
learning. Artificial Intelligence, 97:245–271, 1997.
[12] Gary R. Bradski. Computer vision face tracking for use in a perceptual user interface,
1998.
[13] A. Hertzmann C. Bregler and H. Biermann. Recovering non-rigid 3d shape from image
streams. In CVPR, pages 2690–2696. IEEE Computer Society, 2000.
[14] M. Oren C. P. Papageorgiou and T. Poggio. A general framework for object detection.
In Proceedings of the Sixth International Conference on Computer Vision, pages 555–.
IEEE Computer Society, 1998.
57
58
REFERENCES
[15] A. Albiol E. Acosta, L. Torres and E. J. Delp. An automatic face detection and recognition system for video indexing applications. Proceedings of the IEEE International
Conference on Acoustics, Speech and Signal Processing, 4:3644–3647, 2002.
[16] K. Etemad and R. Chellappa. Discriminant analysis for recognition of human face
images. Journal of Optical Society of America A, 14:1724–1733, 1997.
[17] R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of
Eugenics, 7(7):179–188, 1936.
[18] A. W. Fitzgibbon and A. Zisserman. Joint manifold distance: A new approach to
appearance based clustering. In Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR’03, pages 26–33, Washington, DC, USA, 2003. IEEE Computer Society.
[19] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning
and an application to boosting. In Proceedings of the Second European Conference on
Computational Learning Theory, pages 23–37. Springer-Verlag, 1995.
[20] P. Fua. Regularized bundle adjustment to model heads from image sequences without
calibrated data. International Journal of Computer Vision, 38:154–157, 2000.
[21] A. K. Roy Chowdhury G. Aggarwal and R. Chellappa. A system identification approach
for video-based face recognition. In ICPR (4), pages 175–178, 2004.
[22] S. Z. Li G. Guo and K. Chan. Face recognition by support vector machines. In Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture
Recognition 2000, FG ’00, pages 196–201, Washington, DC, USA, 2000. IEEE Computer Society.
[23] J. W. Fisher G. Shakhnarovich and T. Darrell. Face recognition from long-term observations. In In Proc. IEEE European Conference on Computer Vision, pages 851–868,
2002.
[24] Y. Gao and M. K. H. Leung. Face recognition using line edge map. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 24:764– 779, 2002.
[25] G. G. Gordon. Face recognition based on depth maps and surface curvature. In SPIE
Geometric methods in Computer Vision, pages 234–247, 1991.
[26] D. O. Gorodnichy. Video-based framework for face recognition in video. In Second
Workshop on Face Processing in Video (FPiV’05) in Proceedings of Second Canadian
Conference on Computer and Robot Vision (CRV’05), pages 330–338, 2005.
[27] C. Gürel. Development of a face recognition system. Master’s thesis, Atilim University,
2011.
[28] M. R. Lyu H.-M. Tang and I. King. Face recognition committee machine. In ICME,
pages 425–428. IEEE, 2003.
[29] C. Harris and M. Stephens. A combined corner and edge detector. In In Proc. of Fourth
Alvey Vision Conference, pages 147–151, 1988.
REFERENCES
59
[30] J. Ghosn I. J. Cox and P. N. Yianilos. Feature-based face recognition using mixturedistance. In Proceedings of the 1996 Conference on Computer Vision and Pattern
Recognition (CVPR ’96), CVPR ’96, pages 209–216, Washington, DC, USA, 1996.
IEEE Computer Society.
[31] M.-H. Yang J. Ho and D. Kriegman. Video-based face recognition using probabilistic
appearance manifolds. In In Proc. IEEE Conference on Computer Vision and Pattern
Recognition, pages 313–320, 2003.
[32] N. Ahuja J. Weng and T. S. Huang. Learning recognition and segmentation of 3d
objects from 2d images. Proc. IEEE Int’l Conf. Computer Vision, pages 121–128,
1993.
[33] R. Jafri and H. R. Arabnia. A survey of face recognition techniques. Journal of
Information Processing Systems, 5(2):41–68, 2009.
[34] Y. Li K. Jonsson, J. Kittler and J. Matas. Learning support vectors for face verification
and recognition. In Fourth IEEE International Conference on Automatic Face and
Gesture Recognition, pages 208–213. IEEE Computer Society, 2000.
[35] H.-S. Le. Face Recognition: A Single View Based HMM Approach. PhD thesis, Umeå
University, 2008.
[36] J.-H. Lee and W.-Y. Kim. Video summarization and retrieval system using face recognition and mpeg-7 descriptors. Image and Video Retrieval, 3115:179–188, 2004.
[37] M. Levoy and P. Hanrahan. Light field rendering. In Proceedings of the 23rd Annual
Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’96, pages
31–42, New York, NY, USA, 1996. ACM.
[38] C.-J. Lin. On the convergence of the decomposition method for support vector machines. IEEE Transactions on Neural Networks, 12(6):1288–1298, 2001.
[39] X. Liu and T. Chen. Video-based face recognition using adaptive hidden markov models.
In CVPR (1), pages 340–345. IEEE Computer Society, 2003.
[40] D. J. Kriegman M.-H. Yang and N. Ahuja. Detecting faces in images: A survey. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 24(1):34–58, 2002.
[41] D.
McCullagh.
Call
it
super
bowl
face
scan
http://www.wired.com/politics/law/news/2001/02/41571 (visited 2013-11-26).
i.
[42] M. C. Santana O. Déniz and M. Hernández. Face recognition using independent component analysis and support vector machines. In AVBPA, volume 2091 of Lecture Notes
in Computer Science, pages 59–64. Springer, 2001.
[43] K. Fukui O. Yamaguchi and K. Maeda. Face recognition using temporal image sequence.
In Proceedings of the 3rd. International Conference on Face & Gesture Recognition, FG
’98, pages 318–323, Washington, DC, USA, 1998. IEEE Computer Society.
[44] J. P. Hespanha P. N. Belhumeur and D. J. Kriegman. Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Trans. Pattern Anal. Mach. Intell.,
19(7):711–720, 1997.
60
REFERENCES
[45] P. J. Phillips. Support vector machines applied to face recognition. In Advances in
Neural Information Processing Systems 11, pages 803–809. MIT Press, 1999.
[46] C. J. Poelman and T. Kanade. A paraperspective factorization method for shape and
motion recovery. IEEE Trans. Pattern Anal. Mach. Intell., 19(3):206–218, 1997.
[47] V. Pavlovic R. Huang and D. N. Metaxas. A hybrid face recognition method using
markov random fields. In In Proceedings of ICPR 2004, pages 157–160, 2004.
[48] S.-Y. Kung S.-H. Lin and L.-J. Lin. Face recognition/detection by probabilistic decisionbased neural network. IEEE Transactions on Neural Networks, 8(1):114–132, 1997.
[49] A. C. Tsoi S. Lawrence, C. L. Giles and A. D. Back. Face recognition: A convolutional
neural-network approach. IEEE Transactions on Neural Networks, 8(1):98–113, 1997.
[50] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 22:888–905, 1997.
[51] C. Stauffer and W. E. L. Grimson. Learning patterns of activity using real-time tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22:747–757,
2000.
[52] K.-K. Sung and T. Poggio. Learning human face detection in cluttered scenes. Computer
Analysis of Images and Patterns, 970:432–439, 1995.
[53] B. Takács. Comparing face images using the modified hausdorff distance. Pattern
Recognition, 31(12):1873–1881, 1998.
[54] A. S. Tolba. A parameter-based combined classifier for invariant face recognition.
Cybernetics and Systems, 31(8):837–849, 2000.
[55] A. S. Tolba and A. N. S. Abu-Rezq. Combined classifiers for invariant face recognition.
Pattern Anal. Appl., 3(4):289–302, 2000.
[56] Carlo Tomasi. Shape and motion from image streams under orthography: A factorization method. International Journal of Computer Vision, 9:137–154, 1992.
[57] G. Tsoumakas and I. Katakis. Multi-label classification: An overview. Int J Data
Warehousing and Mining, 2007:1–13, 2007.
[58] M. A. Turk and A. P. Pentland. Eigenfaces for recognition. Journal of Cognitive
Neuroscience, 3(1):71–86, 1991.
[59] M. A. Turk and A. P. Pentland. Face recognition using eigenfaces. Computer Vision
and Pattern Recognition, pages 586–591, 1991.
[60] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag New York,
Inc., New York, NY, USA, 1995.
[61] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple
features. In CVPR, volume 1, pages 511–518. IEEE Computer Society, 2001.
[62] P. Wagner. Face recognition with opencv. http://docs.opencv.org/trunk/modules/contrib/doc/facerec/facerec(vi
2013-12-02).
REFERENCES
61
[63] W. Zhao and R. Chellappa. Sfs based view synthesis for robust face recognition. In Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture
Recognition, pages 285–292, 2000.
[64] S. K. Zhou and R. Chellappa. Probabilistic human recognition from video. In ECCV
(3), volume 2352 of Lecture Notes in Computer Science, pages 681–697. Springer, 2002.