Towards a Video Annotation System using Face Recognition
Transcription
Towards a Video Annotation System using Face Recognition
Towards a Video Annotation System using Face Recognition Lucas Lindström December 27, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Petter Ericson Examiner: Fredrik Georgsson Umeå University Department of Computing Science SE-901 87 UMEÅ SWEDEN Abstract A face recognition software framework was developed to lay the foundation for a future video annotation system. The framework provides a unified and extensible interface to multiple existing implementations of face detection and recognition algorithms from OpenCV and Wawo SDK. The framework supports face detection with cascade classification using Haar-like features, and face recognition with Eigenfaces, Fisherfaces, local binary pattern histograms, the Wawo algorithm and an ensemble method combining the output of the four algorithms. An extension to the cascade face detector was developed that covers yaw rotations. CAMSHIFT object tracking was combined with an arbitrary face recognition algorithm to enhance face recognition in video. The algorithms in the framework and the extensions were evaluated on several different test databases with different properties in terms of illumination, pose, obstacles, background clutter and imaging conditions. The results of the evaluation show that the algorithmic extensions provide improved performance over the basic algorithms under certain conditions. ii Contents 1 Introduction 1.1 Report layout . . . 1.2 Problem statement 1.3 Goals . . . . . . . 1.4 Methods . . . . . . 1.5 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Introduction to face recognition and object tracking 2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . 2.2 Face detection . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Categories of techniques . . . . . . . . . . . . . 2.2.2 Cascade classification with Haar-like features . 2.3 Face identification . . . . . . . . . . . . . . . . . . . . 2.3.1 Difficulties . . . . . . . . . . . . . . . . . . . . . 2.3.2 Categories of approaches . . . . . . . . . . . . . 2.3.3 Studied techniques . . . . . . . . . . . . . . . . 2.3.4 Other techniques . . . . . . . . . . . . . . . . . 2.4 Face recognition in video . . . . . . . . . . . . . . . . . 2.4.1 Multiple observations . . . . . . . . . . . . . . 2.4.2 Temporal continuity/Dynamics . . . . . . . . . 2.4.3 3D model . . . . . . . . . . . . . . . . . . . . . 2.5 Object tracking . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Object representation . . . . . . . . . . . . . . 2.5.2 Image features . . . . . . . . . . . . . . . . . . 2.5.3 Object detection . . . . . . . . . . . . . . . . . 2.5.4 Trackers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 2 3 3 . . . . . . . . . . . . . . . . . . 5 5 6 6 7 10 10 10 11 16 18 18 20 21 22 22 24 24 25 3 Face recognition systems and libraries 29 3.1 OpenCV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1.1 Installation and usage . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Wawo SDK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 iii iv CONTENTS 3.3 3.2.1 Installation and usage . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 OpenBR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3.1 Installation and usage . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4 System description of standalone framework 4.1 Detectors . . . . . . . . . . . . . . . . . . . . 4.1.1 CascadeDetector . . . . . . . . . . . . 4.1.2 RotatingCascadeDetector . . . . . . . 4.2 Recognizers . . . . . . . . . . . . . . . . . . . 4.2.1 EigenFaceRecognizer . . . . . . . . . . 4.2.2 FisherFaceRecognizer . . . . . . . . . 4.2.3 LBPHRecognizer . . . . . . . . . . . . 4.2.4 WawoRecognizer . . . . . . . . . . . . 4.2.5 EnsembleRecognizer . . . . . . . . . . 4.3 Normalizers . . . . . . . . . . . . . . . . . . . 4.4 Techniques . . . . . . . . . . . . . . . . . . . 4.4.1 SimpleTechnique . . . . . . . . . . . . 4.4.2 TrackingTechnique . . . . . . . . . . . 4.5 Other modules . . . . . . . . . . . . . . . . . 4.5.1 Annotation . . . . . . . . . . . . . . . 4.5.2 Gallery . . . . . . . . . . . . . . . . . 4.5.3 Renderer . . . . . . . . . . . . . . . . 4.6 Command-line interface . . . . . . . . . . . . 4.6.1 Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 32 32 32 33 33 33 34 34 34 34 35 35 35 36 36 37 37 38 38 5 Algorithm extensions 39 5.1 Face recognition/object tracking integration . . . . . . . . . . . . . . . . . . . 39 5.1.1 Backwards tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.2 Rotating cascade detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 6 Performance evaluation 6.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Testing datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 NRC-IIT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 News . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.3 NR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Regular versus tracking recognizers . . . . . . . . . . . . . . . 6.3.2 Regular detector versus rotating detector . . . . . . . . . . . 6.3.3 Algorithm accuracy in cases of multiple variable conditions . 6.4 Evaluation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Comparison of algorithm accuracy and speed over gallery size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 43 45 45 45 45 46 46 46 46 47 47 CONTENTS 6.4.2 6.4.3 v Regular detector versus rotating detector . . . . . . . . . . . . . . . . 48 Evaluation of algorithm accuracy in cases of multiple variable conditions 50 7 Conclusion 53 7.1 Limitations of the evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 8 Acknowledgements 55 References 57 vi CONTENTS List of Figures 2.1 2.2 2.3 2.4 2.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . exceeds that of the . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 . 12 . 14 . 15 2.6 2.7 Example features relative to the detection window. . . . Eigenfaces, i. e., visualizations of single eigenvectors. . . The first four Fisherfaces from a set of 100 classes. . . . Binary label sampling points at three different radiuses. A given sampling point is labeled 1 if its intensity value central pixel. . . . . . . . . . . . . . . . . . . . . . . . . A number of object shape representations. . . . . . . . . CAMSHIFT in action. . . . . . . . . . . . . . . . . . . . 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 Conceptual view of a typical application. . IDetector interface UML diagram. . . . . IRecognizer interface UML diagram. . . . INormalizer interface UML diagram. . . . ITechnique interface UML diagram. . . . . SimpleTechnique class UML diagram. . . TrackingTechnique class UML diagram. . Gallery class UML diagram. . . . . . . . . Renderer class UML diagram. . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 5.2 Example illustrating the face recognition/tracking integration. . . . Illustrated example of the rotating cascade detector in action. . . . 6.1 6.2 6.3 6.4 The The The The . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 32 33 34 35 36 36 37 37 . . . . . . 40 . . . . . . 42 performance of each algorithm as measured by subset accuracy, . . . real time factor of each algorithm as the gallery size increases. . . . . performance, as measured by subset accuracy, . . . . . . . . . . . . . real time factor of each algorithm as the gallery size increases. . . . . vii . . . . . . . . . . 15 . 23 . 27 . . . . . . . . . . . . 47 48 49 49 viii LIST OF FIGURES List of Tables 2.1 Combining rules for assembly-type techniques. 6.1 6.2 6.3 NRC-IIT test results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 News test results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 NR test results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 ix . . . . . . . . . . . . . . . . . 19 x LIST OF TABLES Chapter 1 Introduction Face recognition is finding more and more applications in modern society as time and technology progresses. Traditionally, the main application area has been biometrics for security and law enforcement purposes, similar to fingerprints. Lately, it has also been used for crime prevention by identifying suspects in live video feeds[4][41]. With the rise of the world wide web and the Web 2.0 in particular, an application area that is more relevant to the general public has emerged; namely, the automatic annotation of metadata for images and video. By automatically analyzing and attaching metadata to image and video files, end users are given the power to search and sort among them more intelligently and efficiently. Codemill AB is an Umeå-based software consultant, employing around 25 employees with an annual turnover of 12.9 million SEK as of 2011. They developed a face recognition plugin for the media asset management platform of a client, Vidispine AB, which they retained ownership of. Now they want to extract the face recognition functionality into a separate product for sophisticated, automated annotation and searching of video content. The end goal would be to create a product that combines face and voice recognition for the identification of individuals present in a video clip, speech recognition for automatic subtitling and object recognition for the detection and identification of significant signs, e. g. letters or company logos. A product like this could have broad application areas, with everything from automatically annotating recordings of internal company meetings for easy cataloguing to annotating videos uploaded to the web to increase the power of search engines. A first step towards that goal is the extraction of the existing face recognition functionality of the Vidispine platform into a standalone application that would serve as the basis for the continued development of the future product. The focus of this thesis lies foremost in the extraction of the Vidispine face recognition module into a standalone software package. A secondary goal was to attempt to improve the accuracy and/or performance of the existing face recognition system using existing software libraries. In particular, possibilities for utilizing object tracking and profile face recognition were to be explored. 1.1 Report layout Chapter one gives an introduction to the background of the project, the purpose and the goals. The specific problem that is being addressed is described and an overview of the methods employed is given. A short summary of related work that has been investigated over the course of the project is also presented. 1 2 Chapter 1. Introduction Chapter two provides a quick introduction to the theory behind face recognition systems. The general problems of face detection and face recognition are described, as well as brief descriptions of the most common approaches to solving them. This chapter also includes a brief introduction to object tracking. Chapter three lists and describes the most common existing face recognition libraries and systems. Special emphasis is given to OpenCV and Wawo, which are the libraries evaluated in this report. Chapter four gives a detailed system description of the face recognition system developed in the course of this project. In particular, the modular nature of the system is described, as well as how it can be extended with additional algorithms and techniques in the future. Chapter five describes an original integration of face recognition algorithms and object tracking, and discusses its merits and flaws. This chapter also describes an extension to basic face detection techniques by rotating the input images prior to detection. Chapter six describes the methods, metrics and test data used in the evaluation of the different algorithms in the system implementation. The results are presented and discussed. Chapter seven summarizes the conclusions drawn from the results of the evaluation, discusses problems encountered over the course of the project and suggestions for future work. 1.2 Problem statement The primary task of this project was to extract the Vidispine face recognition plugin module into a standalone application. Possibilities for improving the accuracy and performance of the system was to be investigated and different options systematically evaluated. In practice, this would mainly consist of finding existing face recognition libraries and evaluating their relative accuracy and performance. The research questions intended to be addressed in this report are: 1. Using currently available face recognition libraries, using standard test databases and original test databases suitable for the intended application, what is the optimal tradeoff between accuracy and performance of the task of face recognition? 2. Can frontal face recognition and profile face recognition be combined to improve the total accuracy, and at what performance cost? 3. Can face detection and recognition be combined with object tracking forwards and backwards in time to improve accuracy, and at what performance cost? 1.3 Goals The first goal of this project is to extract the Vidispine face detection and recognition plugin into a standalone application. The system design of the application should be highly modular, to allow for low-cost replacement of the underlying libraries. The application should accept a gallery of face images or videos for a set of subjects, as well as a probe video. The output will be an annotation of the probe video, describing at different points in time which subjects are present. The second goal is to conduct an evaluation of the tradeoff between performance and accuracy of a number of common libraries and algorithms, for different parameter configurations and under different scene and imaging conditions. The third goal is to investigate 1.4. Methods 3 the possibility of combining frontal face recognition with profile recognition to improve the total recognition accuracy and what the relative performance of such a method would be. The final goal would be to try to combine face detection and recognition with object tracking forwards and backwards in time to improve accuracy and to possibly cover parts of the video during which the face is partially or completely occluded. 1.4 Methods A study of the literature on face detection, recognition and tracking is performed to gain understanding of the inner workings of the libraries, what the challenges to successful face detection and recognition are, the significance of the parameters for different algorithms and how they can be used to improve the accuracy and performance of the system. The standalone application is written in C++, for multiple different reasons. To start with, the original Vidispine plugin was written in C++, and using the same language makes it possible to reuse some code. In addition, C++ is widely considered to be a good choice for performance-intensive applications while still giving the programmer the tools to create scalable, high-level designs. Finally, C++ being a massively popular programming language, the likelihood of finding compatible face detection, face recognition, object tracking and image processing libraries is high. Existing test databases and protocols are investigated in order to produce results that can be compared with existing literature. To the extent that it is possible, the evaluation is performed with standard methods, but when necessary, original datasets that resemble the intended use cases are created and used. The optimal configuration of libraries, algorithms and parameters are implemented as the default of the resulting system, for presentation and live usage purposes. 1.5 Related work In his master’s thesis, Cahit Gürel presented a face recognition system including subsystems for image normalization, face detection and face recognition using a feed-forward artificial neural network[27]. Similarly to the present work, Gürel aimed at creating a complete integrated software system for face identification instead of simply presenting a single algorithm. Unlike the present work, however, Gürel’s system does not support different choices of method for each step in the face identification pipeline. Hung-Son Le, in his Ph. D. thesis, presented a scalable, distributed face database and identification system[35]. The system provides the entire face identification pipeline and spreads the various phases, such as storage, user interaction, detection and recognition over different processes, allowing different physical servers to handle different tasks. This system only implements Le’s original algorithms, while the present work interfaces with different underlying libraries implementing a variety of existing algorithms. These can easily be combined in a multitude of configurations according to the requirements of the intended application. Acosta et al[15] presented an integrated face detection and recognition system customized for video indexing applications. The face detection is performed by an original algorithm based on segmenting the input image into color regions and using a number of constraints such as shape, size, overlappings, texture and landmark features to distinguish face from non-face. The recognition stage consists of a modified Eigenfaces approach based on storing multiple views of each gallery subject. Acosta’s design and choice of algorithms are tuned to the task of face recognition in video, but again provides only a single alternative. 4 Chapter 1. Introduction Chapter 2 Introduction to face recognition and object tracking Face recognition is a field that deals with the problem of identifying or verifying the identity of one or more persons in either a static image or a sequence of video frames by making a comparison with a database of facial images. Research has progressed to the point where various real-world applications have been developed and are in active use in different settings. The typical use case has traditionally been biometrics in security systems, similar to fingerprint or iris analysis, but the technology has also been deployed for crime prevention measures with limited success[4][41]. Recently, face recognition has also been used for web searching in different contexts[15][36]. The complexity of the problem varies greatly depending on the conditions imposed by the intended application. In the case of identity verification, the user can be assumed to be cooperative and makes an identity claim. Thus, the incoming probe image only needs to be compared to a small subset of the database, as opposed to the case of recognition, where the probe will be compared to a potentially very large database. On the other hand, an authentication system will need to operate in near real-time to be acceptable to users while some recognition systems could operate during much longer time frames. In general, face recognition can be divided into three main steps, although depending on the application, not all steps may be required: 1. Detection, the process of detecting a face in a potentially cluttered image. 2. Normalization, which involves transforming, filtering and converting the probe image into whatever format the face database is stored in. 3. Identification, the final step where the normalized probe is compared to the face database. 2.1 Preliminaries The following general notation is used in this thesis: x and y represent image coordinates, I an intensity image of dimensions r × c and I(x, y) is the intensity at position (x, y). ~Γ is the rc-dimensional vector acquired by concatenating the rows of an image. i,j and k represent generic sequence indices and l,m,n sequence bounds. ind{P } is the indicator function, which equals 1 if proposition P is true, and otherwise equals 0. 5 6 Chapter 2. Introduction to face recognition and object tracking 2.2 Face detection In order to perform face recognition, the face must first be located in the probe image. The field of face detection deals with this problem. The main task of face detection can be defined as follows: given an input image, determine if a face is present and if so, its location and boundaries. Many of the factors that complicate this problem are the same as for recognition: – Pose: The orientation of the face relative to the camera may vary. – Structural components: Hairstyle, facial hair, glasses or other accessories can vary greatly between individuals. – Facial expression: A person can wear a multitude of facial expressions like smiling, frowning, screaming, etc. – Occlusion: Other objects, including other faces, can partially occlude the face. – Imaging conditions: Illumination and camera characteristics can vary between images. The following sections describe the different categories of face detection techniques and the technique primarily used in this project, cascade classification with Haar-like features. 2.2.1 Categories of techniques Techniques that deal with detecting faces in single intensity or color images can be roughly classified into the following four categories[40]: – Knowledge-based methods: These methods utilize human knowledge of what constitutes a face. Formal rules are defined based on human intuitions of facial properties which are used to differentiate regions that contain faces from those that do not. – Feature invariant approaches: Approaches of this type attempt to first extract facial features that are invariant under differing conditions from an image and then infer the presence of a face based on those. – Template matching methods: Standard face pattern templates are manually constructed and stored, either of the entire face or separate facial features. Correlations between input images and the stored patterns are computed which detection is based on. – Appearance-based methods: These methods differ from template matching methods in that instead of manually constructing templates, they are learned from a set of training images in order to capture facial variability. It should be noted that not all techniques fall neatly into a single category, but rather, some clearly overlap two or more categories. However, these categories still provide a useful conceptual structure for thinking about face detection methods. 2.2. Face detection 2.2.2 7 Cascade classification with Haar-like features A very popular method of face detection, and object detection in general, is the cascade classifier with Haar-like features introduced by Viola and Jones in 2001[61]. The concept is to characterize a subwindow in an image with a sequence of simple classifiers, each consisting of one or more features, described below. Each level in the cascade is constructed by selecting the most distinguishing features out of all possible features using the AdaBoost algorithm. Each individual classifier in the cascade performs relatively poorly, but in concert the cascade achieves very good detection rates. Numerous features of this algorithm make it very efficient, such as immediately discarding subwindows that are rejected by a classifier early in the sequence, as well as computing the value of a simple classifier on a specialized image representation in constant time. Haar-like features The features used by the method are illustrated in figure 2.1. They are called ”Haar-like” because they are reminiscent of Haar basis functions which have been used previously[14]. The value of each feature is the sum of the pixel intensities in the dark rectangles subtracted from the sum of the intensities in the white rectangles. The features can be of any size within a detection window of a fixed dimension, but the original paper used 24x24 pixels. In this case, the total number of features are approximately 180,000. A classifier based on a single feature is defined as hj (W ) = ind{pj fj (W ) < pj θj } where fj is the feature, W is the detection window, θj the threshold and pj the parity indicating the direction of the inequality sign. The false negative and false positive rate of the classifier can be modulated by varying the threshold, which will become important later. Integral image The features described above can be computed in constant time using a specialized image representation called an integral image. The value of the integral image at location x, y is simply the sum of the rectangle above and to the left of the location or II(x, y) = X I(x0 , y 0 ) x0 ≤x,y 0 ≤y Using this image representation, the sum of pixels in an arbitrary rectangle can be computed in only four array references. Due to the fact that the rectangles in the features are adjacent, a feature with two rectangles can be computed in six array references, a feature with three rectangles can be computed in eight references and a feature with four rectangles can be computed in only nine references. The integral image can be computed in a single pass using the recurrence relations s(x, y) = s(x, y − 1) + I(x, y) II(x, y) = II(x − 1, y) + s(x, y) where s(x, y) is the cumulative column sum, s(x, −1) = 0 and II(−1, y) = 0. 8 Chapter 2. Introduction to face recognition and object tracking Figure 2.1: Example features relative to the detection window. AdaBoost learning Given a set of positive and negative training samples of the same size as the detection window, in this case 24x24 pixels, we want to select a subset of the total 180,000 features that best distinguishes between them. We do this using the generic machine learning metaalgorithm, AdaBoost[19], which is also used in conjunction with many other algorithms to improve their performance. The general idea of the algorithm is to build classifiers that are tweaked in favor of samples misclassified by previous classifiers. This is done by assigning a weight to each sample, which is initially equal for all samples, and for each round select the feature that is able to minimize the sum weighted error of the predictions. The weights are then adjusted so that the samples that were misclassified by the selected classifier receive a greater weight, and in subsequent rounds classifiers that are able to correctly classify those samples become more likely to be selected. The resulting set of features are then integrated into a composite classifier: 1. Given a set of sample images (I1 , b1 ), . . . , (In , bn ) where bi = 0, 1 for negative and positive samples respectively. 1 1 2. Initialize weights w1,i = 2m , 2l for bi = 0, 1 respectively, where m and l are the number of negative and positive samples respectively. 3. For rounds t = 1, . . . , T : (a) Normalize the weights 2.2. Face detection 9 wt,i wt,i = Pn j=1 wt,j so that wt is a probability distribution. (b) For each feature, j, train a classifier hj . The error is evaluated with respect to wt , j = X wt,i |hj (Ii ) − bi | i (c) Choose the classifier ht with the lowest error t . (d) Update the weights: wt+1,i = wt,i βt1−ei where ei = 0 if sample Ii is classified correctly, ei = 1 otherwise, and βt = t 1−t . 4. The final composite classifier is: h(I) = ind{ T X t=1 T αt ht (I) ≥ 1X αt } 2 t=1 where αt = log β1t . Training the cascade As was previously mentioned, the cascade consists of a sequence of classifiers. Each classifier is applied in turn to the detection window, and if any one rejects it, the detection window is immediately discarded. This is desirable because the large majority of detection windows will not contain a face and a large amount of computation time can be saved by discarding true negatives early. For this reason, it is important for each individual stage to have a very low false negative rate, as this rate will be compounded as the window is passed down the cascade. For example, in a 32-stage cascade, each stage will need a detection rate of 99,7% to achieve a total detection rate of 90%. However, the reverse applies to the false positive rate, which means that each stage can have a fairly high false positive rate and still achieve a low compounded rate. As previously stated, these rates can be modulated by modifying the threshold parameter, and improved by adding additional features (i.e. running more AdaBoost rounds). However, the total performance of the cascade classifier is highly dependent on the number of features, so in order to maintain efficiency we would like to keep this number low. Thus, we select a desired final false positive rate and a required false positive rate γ per stage and run the AdaBoost method the required number of rounds to achieve close to 0% false negative rate and γ false positive rate when the θ has been modulated. The rates are determined by testing the classifier on a validation set. For the first stage, the entire sample set is used and a very low number of features is likely to be needed. The samples used for the next stage are those that the first stage classifier misclassified, which are likely to be ”harder” and thus require more features to achieve the desired rates. This is acceptable because the large majority of detection windows will be discarded by the earliest stages, 10 Chapter 2. Introduction to face recognition and object tracking which are also the fastest. We keep adding stages until the final desired detection/false positive rate has been achieved. Since computing the value of a feature can be done in constant time regardless of the size, the resulting classifier has the interesting property of being scalable to any size. When we apply the detector in practice, we can scan the input image by placing the detection window at different locations and scale it to different sizes. Thus, we can easily trade performance for accuracy by doing a more or less coarse scan. 2.3 Face identification In this section, the process of determining the identity of a detected face in an image, or verifying an identity claim, is introduced. First, the main obstacles to successful identification are discussed and the various categories of approaches are described. After that, a detailed technical description of the techniques used in this project is given. Finally, brief descriptions of other techniques are listed. 2.3.1 Difficulties There is a variety of factors that can make the problem of facial recognition or verification more difficult. The illumination of the probe image commonly varies greatly and this can cripple the performance of certain techniques. For some use cases the user can be assumed to look directly at the camera but in many others the view angle could be different, and also vary. The performance of some techniques are dependent on the pose of the face being in a certain angle, and are more or less sensitive to deviations from the preferred angle. For some scenarios the subject cannot be relied on to have a neutral facial expression, and some techniques are very sensitive to this complication. It might also be of interest to allow for variation in the style of the face, such as facial hair, hairstyle, sunglasses or articles of clothing. Any combination of these factors might potentially need to be dealt with as well. Many solutions to these issues have been proposed and some techniques are markedly better at dealing with some types of variation. In general, it seems that the performance of face recognition systems decreases significantly whenever multiple sources of variation are combined in a single probe. When conditions are ideal, however, current techniques work very well. 2.3.2 Categories of approaches Techniques for face recognition can be classified in a multitude of ways. Some of the most common categorizations are briefly described below[6]. Fully versus partially automatic A system that performs all three steps listed earlier, detection, normalization and identification, is referred to as fully automatic. It is given only a facial image and performs the recognition process unaided. A system that assumes the detection and normalization steps have already been performed is referred to as partially automatic. Commonly, it is given a facial image and the coordinates of the center of the eyes. 2.3. Face identification 11 Static versus video versus 3D Methods can be subdivided by the type of input data they utilize. The most basic form of recognition is performed on a single static image. It is the most widespread approach both in literature and in real-world applications. Recognition can also be applied to video sequences, which give the extra advantage of multiple perspectives and possibly imaging conditions, as well as temporal continuity. Some scanners, such as infrared cameras, can even provide 3D geometric data which some techniques make use of. Frontal versus profile versus view-tolerant Some techniques are designed to handle only frontal images. This is the classical approach and the alternatives are more recent developments. View-tolerant techniques allow for a variety of poses and are often more sophisticated, taking the underlying geometry, physics and statistics into consideration. Techniques that handle profile images are rarely used for stand-alone applications, but can be useful for coarse pre-searches to reduce the computational load of a more sophisticated technique, or in combination with another technique to improve recognition precision. Global versus component-based approach A global approach is one in which a single feature vector is computed based on the entire face and fed as input to a classifier. These tend to be very good at classifying frontal images. However, they are not robust to pose changes since global features tend to be sensitive to translation and rotation of the face. This weakness can be addressed by aligning the face prior to classification. The alternative to the global approach is to classify local facial components independently of each other, and thus allowing a flexible geometrical relation between them. This makes component-based techniques naturally more robust to pose changes. Invariant features versus canonical forms versus variation-modeling As has been previously stated, variation in appearance depending on illumination, pose, facial hair, etc., is the central issue to performing face recognition. Approaches to dealing with it can be divided into three main categories. The first focuses on utilizing features that are invariant to the changes being studied. The second seeks to either normalize away the variation using clever image processing or to synthesize a canonical or prototypical version of the probe image and performing classification on that. The third attempts to create a parameterized model of the variation and estimating the parameters for a given probe. 2.3.3 Studied techniques This section gives an overview of the major face recognition techniques that have been evaluated in this report, and describes the advantages and disadvantages of each. Eigenfaces Eigenfaces are one of the earliest successful and most thoroughly investigated approaches to face recognition[59]. Also known as Karhunen-Loève expansion or eigenpictures, it makes use of principal component analysis (PCA) to efficiently represent pictures of faces. A set of eigenfaces are generated by performing PCA on a large set of images representing human 12 Chapter 2. Introduction to face recognition and object tracking Figure 2.2: Eigenfaces, i. e., visualizations of single eigenvectors. faces. Informally, the eigenfaces can be considered a set of ”standardized face ingredients” derived by statistical analysis of a set of real faces. For example, a real face could be represented by the average face plus 7% of eigenface 1, 53% of eigenface 2 and -3% of eigenface 3. Interestingly, only a few eigenfaces combined are required to arrive at a fair approximation of a real human face. Since an individual face is represented only by a vector of weights, one for each eigenface, this representation is highly space-efficient. Empirical results show that eigenfaces are robust to variations in illumination, less so to variations in orientation and even less to variations in size[33], but despite this illumination normalization is usually required in practice[6]. Mathematically, we wish to find the principal components of the distribution of faces, represented by the covariance matrix of the face images. These eigenvectors can be thought of as the primary distinguishing features of the image. Each pixel element contributes to a lesser or greater extent to each eigenvector, and this allows us to visualize each eigenvector as a ghostly image, which we call eigenfaces (see figure 2.2). Each image in the gallery can be represented exactly in terms of a linear combination of all eigenfaces, but can also be approximated by combining only a subset of the eigenvectors. The ”best” approximation is achieved by using the eigenvectors with the largest eigenvalues, as they account for most of the variance in the gallery set. This feature can be used to improve computational efficiency without necessarily losing much precision. The best M 0 eigenvectors span an M 0 -dimensional subspace of all possible images, a ”face space”[59]. Algorithm The algorithm can be summarized as follows: 1. Acquire the gallery set and compute its eigenfaces, which define the face space. 2.3. Face identification 13 2. When given a probe image, project it onto each of the eigenfaces in order to compute a set of weights to represent it in terms of those eigenfaces. 3. Determine if the image contains a known face by checking if it is sufficiently close to some gallery face class, or unknown if the distance exceeds some threshold. Let the galleryP set of face images be Γ~1 , Γ~2 , Γ~3 , . . . , Γ~n . The average face of the set is n 1 ~i = Γ~i − Ψ. ~ ~ defined by Ψ = n i=1 Γ~i . Each face differs from the average by the vector Φ This set of vectors is then subjected to principal component analysis, which seeks a set of M orthonormal vectors u~j and their associated eigenvalues λj . The vectors u~j and scalars λj are the eigenvectors and eigenvalues, respectively, of the covariance matrix C = Pn ~ ~ T 1 = AAT where A = [Φ~1 Φ~2 . . . Φ~n ].The matrix C is rc × rc and computing k=1 Φi Φi n the eigenvectors and eigenvalues is intractable for typical images. However, this can be worked around by solving a smaller n × n matrix problem and taking linear combinations of the resulting vectors (see [58] for details). An arbitrary number of eigenvectors M 0 with the largest associated eigenvalues are selected. The probe image Γ is transformed into its ~ for k = 1, . . . , M 0 . The eigenface components by the simple operation ωk = u~k T (~Γ − Ψ) T ~ weights form a vector Ω = [ω1 ω2 . . . ωM 0 ] that describes the contribution of each eigenvector in representing the input image. This vector is then used to determine which face class best describes the probe. The simplest method is simply to select the class k that minimizes the ~ ~ Euclidean distance l = Ω − Ωl where Ωl is a vector describing the lth face class, provided it falls below some threshold θ . Otherwise the face is classified as ”unknown”. Fisherfaces The Eigenfaces method projects face images to a low-dimensional subspace with axes that capture the greatest variance of the input data. This is desirable, but not necessarily optimal for classification purposes. For example, the difference in facial features between two individuals is a type of variance that one would like to capture, but the difference in illumination between two images of the same individual is not. A different but related approach is to project the input image to a subspace which minimizes within-class variation but maximizes inter-class variation. This can be achieved by applying linear discriminant analysis (LDA), a technique that traces back to work by R. A. Fisher[17]. The resulting method is thus called Fisherfaces[44]. Given C classes, assume that the data in each class are of homoscedastic normal distributions (i.e., each class is normally distributed with equal covariance matrices to each ~ for a sample of class i. We want to find a subspace of other). We denote this Γ~i ∼ N (µ~i , Σ) the face space which minimizes the within-class variation and maximizes the between-class variation. Within-class differences can be estimated by the within-class scatter matrix which is given by Sw = nj C X X (Γ~ij − µ~j )(Γ~ij − µ~j )T j=1 i=1 where Γ~ij is the ith sample of class j, µ~j is the mean of class j, and nj is the number of samples in class j. Likewise, the between-class differences are computed using the betweenclass scatter matrix, 14 Chapter 2. Introduction to face recognition and object tracking Figure 2.3: The first four Fisherfaces from a set of 100 classes. Sb = C X (µ~j − µ ~ )(µ~j − µ ~ )T j=1 T |V Sb V | where µ ~ is the mean of all classes. We now want to find the matrix V for which |V TS V| w is maximized. The columns v~i of V corresponds to the basis vectors of the desired subspace. This can be done by the generalized eigenvalue decomposition Sb V = Sw V Λ where Λ is the diagonal matrix of the corresponding eigenvalues of V . The eigenvectors of V associated with non-zero eigenvalues are the Fisherfaces.[16][44] Local binary pattern histograms The LBP histograms approach builds on the idea that a face can be viewed as a composition of local subpatterns that are invariant to monotonic grayscale transformations[62]. By identifying and combining these patterns, a description of the face image which includes both texture and shape information is obtained. The LBP operator labels each pixel in an image with a binary string of length P by selecting P sampling points evenly distributed around the pixel at a specific radius r. If the sampling point exceeds the intensity of the central pixel, the corresponding bit in the binary string is 1, and otherwise 0. If the sampling point is not in the center of a pixel, bilinear interpolation is used to acquire the intensity value of the sampling point. Let fl be the labeled image. We can define the histogram for the labeled image as X Hi = ind{fl (x, y) = i}, i = 0, 1, . . . , n − 1 x,y 2.3. Face identification 15 where n is the number of different labels produced by the LBP operator. This histogram captures the texture information of the subpatterns of the image. We can also capture spatial information as well by subdividing the image into regions R0 , R1 , . . . , Rm−1 . The spatially enhanced histogram becomes Hi,j = X ind{fl (x, y) = i}ind{(x, y) ∈ Rj }, i = 0, 1, . . . , n − 1, j = 0, 1, . . . , m − 1. x,y We can classify a probe image by comparing the corresponding histograms of the probe and the gallery set using some dissimilarity measure. Several options exist, including – Histogram intersection: D(S, M ) = P i min(S, M ). – Log-likelihood statistic: L(S, M ) = P i Si log(Mi ). – Chi-square statistic: χ2 (S, M ) = i P (Si −Mi )2 Si +Mi Each of these can be extended to the spatially enhanced histogram by simply summing over both i and j.[8] Figure 2.4: Binary label sampling points at three different radiuses. Figure 2.5: A given sampling point is labeled 1 if its intensity value exceeds that of the central pixel. Hidden Markov models Hidden Markov models (HMMs) can be applied to the task of face recognition by treating different regions of the human face (eyes, nose, mouth, etc) as hidden states. HMMs require one-dimensional observation sequences, and thus the two-dimensional facial images need to be converted into either 1D temporal or spatial sequences. This way, an HMM is created for each subject in the database, the probe image is fed as an observation sequence to each and the match with the highest likelihood is considered best. 16 Chapter 2. Introduction to face recognition and object tracking Wawo The core face recognition algorithm of the Wawo system is based on an extended HMM scheme called Joint Multiple Hidden Markov Models (JM-HMM)[35]. The primary objective of the algorithm is capturing the 2D nature of face images while only requiring a single gallery image per subject to achieve good performance. The input image is treated as a set of horizontal and vertical strips. Each strip consists of small rectangular blocks of pixels and each strip is managed by an individual HMM. When an HMM subsystem of a probe is to be compared to the corresponding one in a gallery image, the block strips of each image are first matched according to some similarity measure, and the observation sequence is formed by the indices of the best-matching blocks. 2.3.4 Other techniques These are approaches from the literature that have not been evaluated in this report. Neural networks A variety of techniques based on artificial neural networks have been developed. The reason for the popularity of artificial neural networks may be due to the non-linearity allowing for more effective feature extraction than eigenface-based methods. The structure of the network is essential to the success of the system, and which is suitable is dependent on the application. For example, multilayer perceptrons and convolutional neural networks have been applied to face detection, and a multi-resolution pyramid structure[52][49][32] to face verification. Some techniques combine multiple structures to increase precision and counteract certain types of variation[49]. A probabilistic decision-based neural network (PDBNN) has been shown to function effectively as a face detector, eye localizer and face recognizer[48]. In general, neural network approaches suffer from computational complexity issues as the number of individuals increases. They are also unsuitable for single model image cases due to the fact that they tend to require multiple model images to train to optimal parameter settings. Dynamic link architecture Dynamic link architectures are an extension of traditional artificial neural networks[6]. Memorized objects are represented by sparse graphs, whose vertices are labeled with a multiresolution description in terms of a local power spectrum, and whose edges are geometrical distance vectors. Distortion invariant object recognition can be achieved by employing elastic graph matching to find the closest stored graph. The method tends to be superior to other methods in terms of coping with rotation variation, but the matching process is comparatively expensive. Geometrical feature matching This technique is based on the computation of a set of geometrical features from the picture of a face. The overall configuration can be represented by a vector containing the position and size of a set of main facial features, such as eyes, eyebrows, mouth, face outline, etc. It has been shown to be successful for large face databases such as mug shot albums[30]. However, it is dependent on the accuracy of automated feature location algorithms, which generally do not achieve a high degree of accuracy and require considerable computational time. 2.3. Face identification 17 3D model The 3D face model is based on a vector representation of a face that is constructed such that any convex combination of shape and texture vectors represents describes a realistic human face. Fitting the 3D model to, or extracting it from, images can be used in two ways for recognition across different viewing conditions: – After fitting the model, the comparison can be based on model coefficients that represent intrinsic features of shape and texture that are independent of imaging conditions[25]. – 3D face reconstruction can be employed to generate synthetic views from different angles. The views are then transferred to a second view-dependent recognition system[63]. 3D morphable models have been combined with computer graphics simulations of illumination and projection[10]. Among other things, this approach allows for modeling more sophisticated lighting conditions such as specular lighting and cast shadows (most techniques only consider Lambertian illumination). Scene parameters in probe images, such as head position and orientation, camera focal length and illumination direction, can be automatically estimated. Line edge map Edge information is useful for face recognition because it is partially insensitive to illumination variation. It has been argued that face recognition in the human brain might make extensive use of early-stage edge detection without involving higher-level cognitive functions[53]. The Line Edge Map (LEM) approach extracts lines from a face edge map as features. This gives it the robustness to illumination variation that is characteristic of feature-based approaches while simultaneously retaining low memory requirements and high recognition performance. In addition, LEM is highly robust to size variation. It has been shown to be less sensitive to pose changes than the eigenface method, but more sensitive to changes in facial expression[24]. Support vector machines Support vector machines (SVM) is considered to be an effective method for general purpose pattern recognition due to its high generalization performance without the need to add other knowledge[60]. Intuitively, given a set of points belonging to two classes, an SVM finds the hyperplane that separates the largest possible set of points of the same class on the same side while maximizing the distance from either class to the hyperplane. A large variety of SVM-based approaches have been developed with regards to a number of different application areas[38][22][45][9][34][42]. The main features of SVM-based approaches is that they are able to extract relevant discriminatory information automatically, and are robust to illumination changes. However, they can become overtrained on data sanitized by feature extraction and/or normalization and they involve a large number of parameters so the optimization space can become difficult to explore completely. Multiple classifier systems Traditionally, the approach used in the design of pattern recognition systems has been to experimentally compare the performance of several classifiers in order to select the best one. 18 Chapter 2. Introduction to face recognition and object tracking Recently, the alternative approach of combining the output of several classifiers has emerged, under various names such as multiple classifier systems (MCSs), committee or ensemble classifiers, with the purpose of improved accuracy. A limited number of approaches of this kind have been developed with good results for established face databases[54][55][28][47]. 2.4 Face recognition in video Since a video clip consists of a sequence of frame images, face recognition algorithms that apply to single still images can be applied to video virtually unchanged. However, a video sequence possesses a number of additional properties that can potentially be utilized to design face recognition techniques with improved accuracy and/or performance over single still image techniques. Three properties of major importance are: – Multiple observations: A video sequence by its very nature will yield multiple observations of any probe or gallery. Additional observations means additional constraints and potentially increased accuracy. – Temporal continuity/Dynamics: Successive frames in a video sequence are continuous in the temporal dimension. Geometric continuity related to changes in facial expression or head/camera movement, or photometric continuity related to changes in illumination provide additional constraints. Furthermore, changes in head movement or facial expression obey certain dynamics that can be modeled for additional constraints. – 3D model: We can attempt to reconstruct a 3D model of the face using a video sequence. This can be achieved both by treating the video as a set of multiple observations or by making use of temporal continuity and dynamics. Recognition can the be based on the 3D model, which, as previously described, has the potential to be invariant to pose and illumination. Below, these properties, and how they can be exploited to design better face recognition techniques, will be discussed in detail. We will also study some existing techniques that make use of these properties. 2.4.1 Multiple observations This is the most commonly used feature of video sequences. Techniques exploiting this property treat the video sequence as a set of related still images but ignores the temporal dimension. The discussion below assumes that images are normalized before being subjected to further analysis. Assembly-type algorithms A simple approach to dealing with multiple observations is to apply a single still image technique to each individual frame of a video sequence and combining the results by some rule. In many cases the combining rule is very simple, and some common examples are given in table 2.1. Let {Fi ; i = 1, 2, . . . , n} denote the sequence of probe video frames. Let {Ij ; j = 1, 2, . . . , m} denote the set of gallery images. Let d(Fi , Ij ) denote the distance function between the ith frame of a video sequence and the jth gallery image of some single still 2.4. Face recognition in video 19 image technique. Let Ai (Fi ) denote the gallery image selected by the algorithm applied to the ith frame of the probe video. Table 2.1: Combining rules for assembly-type techniques. Method Minimum arithmetic mean Minimum geometric mean Minimum median Minimum minimum Majority voting Rule P d(Fi , Ij ) ĵ = argminj=1,2,...,m n1p n Qi=1 ĵ = argminj=1,2,...,m n n i=1 d(Fi , Ij ) ĵ = argminj=1,2,...,m [medi=1,2,...,n d(Fi , Ij )] ĵ = argminj=1,2,...,m [mini=1,2,...,n d(Fi , Ij )] P ĵ = argmaxj=1,2,...,m n i=1 ind{Ai (Fi ) = j} One image or several images Multiple observations can be summarized into a smaller number of images. For example, one could use the mean or median image of the probe sequence, or use clustering techniques to produce multiple summary images. After that, single still image techniques or assembly-type algorithms can be applied to the result. Matrix If each frame of the probe video is vectorized by some means, the video can be represented as a matrix V = [F1 F2 . . . Fn ]. This representation can make use of the various methods of matrix analysis. For example, matrix decompositions can be invoked to represent the data more efficiently. Matrix similarity measures can be used for recognition[43]. Probability density function Multiple observations {F1 , F2 , . . . , Fn } can be regarded as independent realizations drawn from the same underlying probability distribution. PDF estimation techniques can be utilized to learn this distribution[23]. If both the probe and the gallery consists of video footage, PDF distance measures can be used to perform recognition. If the probe consists of video and the gallery of still images, recognition becomes a matter of determining which gallery image is most likely to be generated from the probe distribution. In the reverse case, where the gallery consists of video and the probe of a still image, recognition tests which gallery distribution is most likely to generate the probe. Manifold Face appearances of multiple observations form a highly nonlinear manifold. If we can characterize the manifold[18], recognition reduces to (i) comparing two manifolds if both the probe and gallery is video, (ii) comparing the distance between a data point and various manifolds if the probe is a still image and the gallery is video or (iii) comparing the distance between various data points and a manifold if the probe is video and the gallery consists of still images. 20 2.4.2 Chapter 2. Introduction to face recognition and object tracking Temporal continuity/Dynamics Successive frames in a video clip are continuous along the temporal axis. Temporal continuity provides an additional constraint for modeling face appearance. For example, smoothness of face movement can be used in face tracking. It was previously stated that these techniques assume that the probe and gallery have been prenormalized, but it can be noted that in the case of video, face tracking can be used instead of face detection for the purposes of normalization due to the temporal continuity. Simultaneous tracking and recognition Zhou and Chellappa proposed[64] an approach that models tracking and recognition in a single probabilistic framework using time series analysis. A time series model is used, consisting of the state vector (at , θt ), where at is the identity variable at time t and θt is the tracking parameter, as well as the observation yt (the video frame), the state transition probability p(at , θt |at−1 , θt−1 ) and the observational likelihood p(yt |θt , nt ). The task of recognition thus becomes computing the posterior probability p(at |y0:t ) where y0:t = y0 , y1 , . . . , yt . Probabilistic appearance manifolds A probabilistic appearance manifold[31] models each individual in the gallery as a set of linear subspaces, each modelling a particular pose variation, called pose manifolds. These are generated by extracting samples from a training video which are divided into groups through k-means clustering. Principal component analysis is performed on each group to characterize that subspace. Temporal continuity is captured by computing the transition probabilities between pose manifolds in the training video. Recognition is performed by integrating the likelihood that an input frame is generated by a pose manifold and the probability of transitioning to that pose manifold from the previous frame. Adaptive hidden Markov model Liu and Chen proposed[39] an HMM-based approach that captures temporal information by using temporally indexed observation sequences. The approach makes use of principal component analysis to reduce each gallery video to a sequence of low-dimensional feature vectors. These are then used as observation sequences in the training of the HMM models. In addition, the algorithm gradually adapts to probe videos by using unambiguously identified probes to update the corresponding gallery model. System identification Aggarwal, Chowdury and Chellappa presented[21] a system identification approach to face recognition in video. Each video sequence is represented by a first-order auto-regressive and moving average (ARMA) model θt+1 = Aθt + vt , It = Cθt + wt where θ is a state vector characterizing the pose of the face, It the frame and v and w independent identically distributed white noise factors drawn from N (0, Q) and N (0, R) respectively. System identification is the process of estimating the model parameters A, C, Q and R based on the observations I1 , I2 , . . . , In . Recognition is performed by selecting the gallery model that is closest to the probe model by some distance function of the model parameters. 2.4. Face recognition in video 2.4.3 21 3D model We can attempt to reconstruct a 3D model of a face from a video sequence. One way to do this is by utilizing light field rendering, which involves treating each observation as a 2D slice of a 4D function - the light field, which characterizes the flow of light through unobstructed space. Another method is structure from motion (SfM), which attempts to recover 3D structure from 2D images coupled with local motion signals. The 3D model will possess two components: geometric and photometric. The geometric component describes depth information of the face and the photometric component depicts the texture map. Structure from motion is more focused on recovering the geometric component, and light field rendering on recovering the photometric component. Structure from motion There is a large body of literature on SfM, but despite this current SfM algorithms cannot reconstruct the 3D face model reliably. The difficulties are three-fold: (i) the ill-posed nature of the perspective camera model that results in instability of SfM solutions, (ii) the fact that the face is not a truly rigid object, especially when the face presents facial expressions and other deformations and (iii) the input to the SfM algorithm. This is usually a sparse set of feature points provided by a tracking algorithm with its own flaws. Interpolation from a sparse to a dense set of feature points is very inaccurate. The first difficulty can be addressed by using an ortographic or paraperspective model to approximate the perspective camera model[56][46]. The second problem can often be resolved by imposing a subspace constraint on the face model[13]. A dense face model can be used to overcome the sparseto-dense issue. However, the dense face model is generic and not appropriate for a specific individual. Bundle adjustment has been used to adjust the generic model to accomodate video observation[20]. Light field rendering The SfM algorithm mainly recovers the geometric component of the face model, i. e., the depth value of every pixel. Its photometric component is naively set to the appearence in one reference video frame. An image-based rendering method recovers the photometric component of the 3D model instead, and light field rendering bypasses even this stage by extracting novel views directly[37]. 22 Chapter 2. Introduction to face recognition and object tracking 2.5 Object tracking The field of object tracking deals with the combined problems of locating objects in video sequences, tracking their movement from frame to frame and analyzing object tracks to recognize behavior. In its simplest form, object tracking can be defined as the problem of consistently labeling a tracked object in each frame of a video. Depending on the tracker, additional information about the object can also be detected, such as area, orientation, shape, etc. Conditions that create difficulties in object tracking include: – information loss due to projecting a 3D world onto a 2D image, – noise and cluttered, dynamic backgrounds, – complex rigid motion (drastic changes in velocity), – nonrigid motion (deformation), – occlusion, – complex object shape, – varying illumination. Tracking can be simplified by imposing constraints on the conditions of the scene. For example, most object tracking methods assume smooth object motion, i. e., no abrupt changes in direction and velocity. Assuming constant illumination also increases the number of potential approaches that can be used. The approaches to choose from mainly differ in how they represent the objects to be tracked, which image features they use and how the motion is modeled. Which approach performs best depends on the intended application. Some trackers are even specifically tailored to the tracking of certain classes of objects, for example humans. 2.5.1 Object representation Objects can be represented both in terms of their shapes and their appearances. Some approaches use only the shape of the object to represent them, but some also combine shape with appearance. Shape and appearance representations are usually chosen to fit a certain application domain[7]. Major categories of shape representations include: – Points. The object is represented by one or more points. This is generally suitable for objects that occupy a small image region. – Geometric primitives. Objects are represented by geometric primitives, such as rectangles, circles or ellipses. This representation is particularly suitable for rigid objects but can also be used to bound non-rigid ones. – Silhouette and contour. The contour is the boundary of an object, and the area inside it is called the silhouette. Using this representation is suitable for tracking complex non-rigid objects. – Articulated shape models. These models consist of body parts held together with joints. The human body, for example, consists of a head, torso, upper and lower arms, etc. The motion of the parts are constrained by kinematic models. The constituent parts can be modeled by simple primitives such as ellipses or cylinders. 2.5. Object tracking 23 – Skeletal models. The skeleton of an object can be extracted using medial axis transformation. This model is commonly used as a shape representation in object recognition and can be used to model both rigid and articulated objects. Common appearance representations of objects are: – Probability densities. Probability density appearance representations can be parametric, such as a Gaussian, or non-parametric, such as histograms. The probability densities of object appearance can be computed from features (color, texture or more complex features) of the image region specified by the shape representation, such as the interior of a rectangle or a contour. – Templates. Templates are formed from a composition of simple geometric objects. The main advantage of templates is that they carry both spatial and appearance information, but they tend to be sensitive to pose changes. – Active appearance models. Active appearance models simultaneously model shape and appearance by defining objects in terms of a set of landmarks. Landmarks are often positioned on the object boundary but can also reside inside the object region. Each landmark is associated with an appearance vector containing, for example, color and texture information. The models need to be trained using a set of samples by some technique, e.g. PCA. Figure 2.6: A number of object shape representations. a) Single point. b) Multiple points. c) Rectangle. d) Ellipse. e) Articulated shape. f) Skeletal model. g) Control points on contour. h) Complete contour. i) Silhouette. 24 Chapter 2. Introduction to face recognition and object tracking 2.5.2 Image features The image features to use are an integral part of any tracking algorithm. The most desirable property of a feature is how well it distinguishes between the object region and the background[7]. Features are usually closely related to the object representation. For example, color is mostly used for histogram representations while edges are more commonly used for contour-based representations. The most common features are: – Color. The apparent color of an object is influenced both by the light source and the reflective properties of the object surface. Different color spaces, such as RGB, HSV, L*u*v or L*a*b, each with different properties, can be used, depending on application area. – Edges. Object boundaries generally create drastic changes in image intensity and edge detectors identify these changes. These features are mostly used in trackers that use contour-based object representations. – Optical flow. Optical flow is a field of displacement vectors that describe the motion of each pixel in an image region. It is computed by assuming that the same pixel retains the same brightness between consecutive frames. – Texture. Texture measures the variation of intensity across a surface, describing properties like smoothness and regularity. Methods for automatic feature selection have also been developed. These can mostly be categorized as either filter or wrapper methods[11]. Filter methods derive a set of features from a much larger set (such as pixels) based on some general criteria, such as noncorrelation, while wrapper methods select features based on their usefulness in a particular problem domain. 2.5.3 Object detection Every tracking method requires some form of detection mechanism. The most common approach is to use information from a single initial frame, but some methods utilize temporal information across multiple frames to reduce the number of false positives[7]. This is usually in the form of frame differencing, which highlights regions that change between frames. In this case, it is then the tracker’s task to establish correspondence between detected objects across frames. Some common techniques include: – Point detectors. Detectors used to find points of interest whose respective loci have particular qualities[29]. Major advantages of point detectors are insensitivity to variation in illumination and viewpoint. – Background subtraction. Approach based on the idea of building a model for the background of the scene and detecting foreground objects based on deviations from this model[51]. – Segmentation. Segmentation aims to detect objects by partitioning the image into perceptually similar regions and characterizing them[50]. – Supervised learning. Based on learning a mapping between object features and object class and then applying the trained model to different parts of an image. These approaches include neural networks, adaptive boosting, decision trees and support vector machines. 2.5. Object tracking 2.5.4 25 Trackers The goal of an object tracker is to track the trajectory of an object over time by locating it in a series of consecutive frames in a video. This can be done in two general ways[7]. Firstly, the object can be located in each frame individually using a detector, the tracker being responsible for establishing a correspondence between the regions in separate frames. Secondly, the tracker can be provided with an initial region located by the detector and then iteratively update its location in each frame. The shape and appearance model limits the types of transformations it can undergo between frames. The main categories of tracking algorithms are: – Point tracking. With objects represented by points, tracking algorithms use the state of the object in the previous frame to associate it to the next. This state can include the position and motion of the object. This requires an external mechanism to detect the object in each frame beforehand. – Kernel tracking. The kernel refers to a combination of shape and appearance model. For example, a kernel can be the rectangular region of the object coupled with a color histogram describing its appearance. Tracking is done by computing the motion of the kernel across frames. – Silhouette tracking. Tracking is performed by estimating the object region in each frame. This is done by using information encoded in the object region from previous frames. This information usually takes the form of appearance density, or shape models such as edge maps. CAMSHIFT The Continuously Adaptive Mean Shift (CAMSHIFT) algorithm[12] is a color histogrambased object tracker based on a statistical method called mean shift. It was designed to be used in perceptual user interfaces and minimizing computational costs was thus a primary design criterion. In addition, it is relatively tolerant to noise, pose changes and occlusion, and to some extent also illumination changes. It tracks object movement along four degrees of freedom: x, y, z position as well as roll angle. x and y is given directly by the search window, the z position can be derived by estimating the size of the object and relating it to the current size of the tracking window. Roll can be derived from the second moments of the probability distribution in the tracking window. It was initially developed to track faces, but it can also be applied to other object classes. Color probability distribution The first step of the CAMSHIFT algorithm is to create a probability distribution image of each frame, based on an externally selected initial track window which contains exactly the object to track. This is done by generating a color histogram of the window and using it as a lookup table to convert an incoming frame into a probability-of-object map. CAMSHIFT uses only the hue dimension of the HSV color space, and ignores saturation and brightness, which gives it some robustness to illumination changes. For the purposes of face tracking, it also minimizes the impact of differing skin colors. Problems with this approach can occur if the brightness value is too extreme, or if the saturation is too low, due to the nature of the HSV color space causing the hue value to vary drastically. The solution is to simply ignore pixels to which these conditions apply. This means that very dim scenes need to be preprocessed to increase the brightness prior to tracking. 26 Chapter 2. Introduction to face recognition and object tracking Mean shift The mean shift algorithm is a non-parametric statistical technique which climbs the gradient of a probability distribution to find the local mode/peak. It involves five steps: 1. Choose a search window size. 2. Choose the initial location of the search window. 3. Compute the location of the mode inside the search window. This is done as follows: Let p(x, y) be the probability at position (x, y) in the image, and x and y range over the search window. (a) Find the zeroth moment M00 = XX x p(x, y) y (b) Find the first moment for x and y M10 = XX x xp(x, y); M01 = y XX x yp(x, y) y (c) Then the mode of the search window is xc = M01 M10 ; yc = M00 M00 4. Center the search window on the mode. 5. Repeat steps 3 and 4 until convergence. CAMSHIFT extension The mean shift algorithm only applies to a single static distribution, while CAMSHIFT operates on a continuously changing distribution from frame to frame. The zeroth moment, which can be thought of as the distribution’s ”area”, is used to adapt the window size each time a new frame is processed. This means that it can easily handle objects changing size when, for example, the distance between the camera and the object changes. The steps to compute the CAMSHIFT extension are as follows: 1. Choose the initial search window. 2. Apply mean shift as described above and store the zeroth moment. 3. Set the search window size to a function of the zeroth moment found in step 2. 4. Repeat steps 2 and 3 until convergence. When applying the algorithm to a series of video frames, the initial search window of one frame is simply the computed region of the previous one. The window size function used in the original paper is r M00 s=2 256 2.5. Object tracking 27 Figure 2.7: CAMSHIFT in action. The graph in each step shows a cross-section of the distribution map, with red representing the probability distribution, yellow the tracking window and blue the current mode. In this example the algorithm converges after six steps. This is arrived at by first dividing the zeroth moment by the maximum pixel intensity value to convert it into units of number of cells, which makes sense for a window size measure. In order to convert the 2D region into a 1D length, we take the square root. We desire an expansive window that grows to encompass a connected distribution area, so we multiply the result by two. 28 Chapter 2. Introduction to face recognition and object tracking Chapter 3 Face recognition systems and libraries This chapter describes the face recognition systems and libraries examined in this report and gives a brief review of the installation process for each of them. 3.1 OpenCV OpenCV (open computer vision) is a free open source library for image and video processing with support for a wide variety of different algorithms in many different domains. It has extensive documentation and community support, with complete API documentation as well as various guides and tutorials for specific topics and tasks[2]. 3.1.1 Installation and usage Using Ubuntu 12.04 LTS, OpenCV was very simple to install. There is a relatively new binary release available through the package manager, but the latest version, which among other things includes the face recognition algorithms, had to be downloaded separately and compiled from source. However, the compilation required few external dependencies and the process was relatively simple. The library itself has an intuitive interface and it is very easy to create quite powerful sample applications. A great advantage of using OpenCV for building a face recognition system is that it supports many of the tasks that are not directly related to recognition, such as loading, representing and presenting video and image data, image processing and normalization, face detection and so on. The implemented face recognition algorithms have a unified API which make them easily interchangable in an application. One minor problem in evaluating their performances is that they only return the rank-1 classification which prevents the methods from being evaluated using rank-based metrics. They also return a confidence value for the classification, but its definition is never formally documented, and the only way to find this out is by reading the source code itself. 29 30 3.2 Chapter 3. Face recognition systems and libraries Wawo SDK Wawo Technology AB[3] is a Sweden-based company developing face recognition technologies, with its main product being the Wawo face recognition system, which is based on original algorithms and techniques presented in the doctoral dissertation of Hung-Son Le. It is advertised as being capable of performing rapid and accurate illumination-invariant face recognition based on a very small number of samples per subject. This is done using an original HMM-based algorithm as well as original contrast-enhancing image normalization procedures[35]. 3.2.1 Installation and usage The binary distribution of Wawo used in this work was acquired through Codemill and not directly from Wawo, and thus it might not be the most up-to-date version of the library. The distribution I was first given access to also lacked some files that I had to piece together myself from an old hard drive used in previous projects, so while I was able to get the library running, it is possible that some of the problems described below could have occurred because of this. The documentation that came along was quite limited and mainly consisted of code samples with some rudimentary comments partially describing the API. Despite this, producing functional code was not very difficult when Wawo was used in conjunction with the basic facilities provided by OpenCV. I occasionally encountered segmentation faults originating inside library calls, Again, this is possibly due to using an old, incomplete distribution. I also discovered that Wawo sets a relatively low upper limit to the size of the training gallery. This is reasonable given that Wawo’s strength is advertised to be good performance with very few training samples. 3.3 OpenBR OpenBR (OpenBiometrics) is a collaborative research project initiated by the MITRE company, a US-based non-profit company operating several federally-funded research centers. Its purpose is to develop biometrics algorithms and evaluation methodologies. The current system supports several different types of automated biometric analysis, including face recognition, age estimation and gender estimation[1]. 3.3.1 Installation and usage I could not find a binary release of OpenBR 0.4.0 on the official website, which is the advertised version of OpenBR at the time of writing. The only available option seemed to be building from source. The only instructions available were specific to Ubuntu 13.04. Following them on Ubuntu 12.04 LTS did not work, and thus, I was unable to install and test the library. There were a large number of different steps involved, which I suspect makes the process error-prone even if the correct OS version is used. Overall, the installation procedure seemed immature and documentation was limited. Chapter 4 System description of standalone framework This chapter gives a practical description of the framework developed over the course of this project. First, a conceptual overview is given and then each component is described in detail. The framework depends on OpenCV (≥2.4), cmake (tested with 2.8). The Wawo SDK is included in the source tree but may not be the latest version. There are four primary types of objects that are of concern to client programmers and framework developers: detectors, recognizers, normalizers and techniques. Detectors and recognizers encapsulate elementary face detection and recognition algorithms. Normalizers perform image preprocessing to deal with varying imaging conditions in the data and algorithm requirements. Techniques integrate the lower-level components and perform high-level algorithmic functions. These components are interchangable and can be mixed and matched to suit the intended application and the source data. Figure 4.1: Conceptual view of a typical application. 31 32 Chapter 4. System description of standalone framework 4.1 Detectors A detector is a class that wraps the functionality of a face detection algorithm. Every detector implements the IDetector interface. The interface specifies a detect method, which accepts an image represented by an OpenCV matrix and returns a list of rectangles representing the image regions containing the detected faces. In order to deal with varying input image formats and imaging conditions, and the fact that different detection algorithms may benefit from different types of image preprocessing, the IDetector interface also specifies a method for setting a detector-specific normalizer. See below for details on normalizers. Figure 4.2: IDetector interface UML diagram. The framework currently supports two different detectors, but adding additional detectors to the framework is simply a matter of implementing the interface. 4.1.1 CascadeDetector CascadeDetector implements the cascade classifier with Haar-like features described in detail in chapter two. Algorithm parameters can be configured using the following methods: – CascadeDetector(std::string cascadeFileName) - cascadeFileName specifies the file that contains the cascade training data. – setScaleFactor(double) - Specifies the degree to which the image size is reduced at each image scale. – setMinNeighbours(int) - Specifies how many neighbors each candidate rectangle should have to retain. – setMinWidth(int), setMinHeight(int) - Minimum possible face dimensions. Faces smaller than this are ignored. – setMaxWidth(int), setMaxHeight(int) - Maximum possible face size. larger than this are ignored. Faces – setHaarFlags(int) - Probably obsolete, see OpenCV documentation. 4.1.2 RotatingCascadeDetector RotatingCascadeDetector inherits from CascadeDetector and also implements the rotation extension to the cascade classifier described in detail in chapter five. In addition to the ones provided by CascadeDetector, algorithm parameters are supplied through the constructor: – RotatingCascadeDetector(std::string cascadeFileName, double maxAngle, double stepSize) - maxAngle is the angle of the maximum orientation deviation from the original upright position, and stepSize is the size of the step angle in each iteration. 4.2. Recognizers 4.2 33 Recognizers A recognizer wraps a face recognition algorithm. It is a class that implements the IRecognizer interface. A face recognition algorithm is always a two-step process. In the first step, the algorithm needs to be trained with a gallery of known subjects, and thus a recognizer needs to implement the train method. It accepts a list of gallery images, a corresponding list of image regions containing the faces of the subjects and a list of labels indicating the identity of the subject in the image. All three arguments need to be of equal length and a given index refers to the same subject in all three lists. Figure 4.3: IRecognizer interface UML diagram. The recognize method is responsible for actually performing the recognition. It accepts an image and a face region as input arguments, and returns the estimated label indicating the identity of the subject in the image. As in the case with detectors, different image formats and imaging conditions can require varying image preprocessing in order to optimize the performance of the recognition algorithm, and so image normalization is required. In addition, since recognizers deal with images from two different sources, the gallery and the probe, two different normalizers may be necessary. Thus, an implementing class needs to accept a gallery normalizer and a separate probe normalizer. The framework currently supports five different recognition algorithms: 4.2.1 EigenFaceRecognizer EigenFaceRecognizer implements the Eigenfaces algorithm, described in detail in chapter two. Algorithm parameters can be configured using the following methods: – setComponentCount(int) - Set the number of eigenvectors to use. – setConfidenceThreshold(double) - Set the known/unknown subject threshold. A value including and between 0.0 and DBL MAX. 4.2.2 FisherFaceRecognizer FisherFaceRecognizer implements the Fisherfaces algorithm, described in detail in chapter two. Algorithm parameters can be configured using the following methods: – setComponentCount(int) - Set the number of eigenvectors to use. – setConfidenceThreshold(double) - Set the known/unknown subject threshold. A value including and between 0.0 and DBL MAX. 34 Chapter 4. System description of standalone framework 4.2.3 LBPHRecognizer LBPHRecognizer implements the local binary pattern histograms algorithm, described in detail in chapter two. Algorithm parameters can be configured using the following methods: – setRadius(int) - The radius used for building the circular local binary pattern. – setNeighbours(int) - The number of sample points to build a circular local binary pattern from. – setGrid(int x, int y) - The number of cells in the horizontal and vertical direction respectively. – setConfidenceThreshold(double) - Set the known/unknown subject threshold. A value including and between 0.0 and DBL MAX. 4.2.4 WawoRecognizer WawoRecognizer implements the Wawo algorithm, described briefly in chapter two and four. Algorithm parameters can be configured using the following methods: – setRecognitionThreshold(float) - Set the known/unknown subject threshold. A value including and between 0.0 and 1.0. – setVerificationLevel(int) - A value including and between 1 and 6. A lower value runs faster, but probably decreases accuracy. – setMatchesUsed(int) - If set to greater than 1, the result returned is the mode of the n most likely candidates. 4.2.5 EnsembleRecognizer The EnsembleRecognizer combines an arbitrary number of elementary recognizers, which vote democratically amongst themselves about the final result. The setConfidenceThreshold(double) method sets the minimum fraction of participating recognizers which need to agree to produce a result. Otherwise, the probe is considered unknown. Note that the participating recognizers can also explicitly vote for an unknown identity, if so configured. 4.3 Normalizers Input imagery can vary greatly depending on camera equipment, lighting conditions during the shot, lossy processing since the image was taken, etc. In addition, different image processing algorithms require different input preprocessing to achieve optimal performance. Normalizers are modules that perform the preprocessing steps for the other parts of the recognition system. This makes it easy to test a variety of normalization options in order to figure out which one suits a particular algorithm in a particular context best. Figure 4.4: INormalizer interface UML diagram. 4.4. Techniques 35 A normalizer is a class that implements the INormalizer interface. The interface defines only a single method, normalize, which accepts an input image and returns an output image. Any number and combinations of image processing operations can be performed by a normalizer. The framework currently supports the following four, but adding more is very simple: – GrayNormalizer - Converts an RGB image to grayscale. – ResizeNormalizer - Scale an image to the given dimensions. – EqHistNormalizer - Enhance contrast by equalizing the image histogram. – AggregateNormalizer - Utility class that lets the user create a custom normalizer by assembling a sequence of elementary normalizers. This circumvents the need to create a new normalizer for every conceivable combination of normalization steps. 4.4 Techniques A technique is a top-level class that ties together the constituent detection and recognition algorithms in a particular way. While the IDetection and IRecognition interfaces deal solely with individual images, a technique is responsible for loading the gallery and probe files and potentially also iterating over the frames of a probe video and algorithmic tasks spanning multiple sequential frames. Figure 4.5: ITechnique interface UML diagram. Every technique implements the ITechnique interface which specifies the train and recognize methods. The former accepts a Gallery object which specifies the files to use for training the underlying recognition model. The latter accepts a string containing the filename of the probe image or video and produces an Annotation object describing the prescence, identities and locations of recognized individuals in the probe. 4.4.1 SimpleTechnique SimpleTechnique is the prototypical technique. It accepts one detector for the gallery and one for the probe, and a single recognizer. All gallery files are loaded from disk in turn. If a gallery file is an image, it applies the gallery detector and feeds the detected image region and the corresponding label to the recognizer for training. If a gallery file is a video, performs the same operations on each of the frames of the video in turn. After the training is complete, it loads the probe from file. If it is an image, the probe detector is applied to it and the detected image region is fed to the recognizer and the result is stored. If it is a video file, the technique performs the same operations on each frame in turn. 4.4.2 TrackingTechnique TrackingTechnique implements the recognition/object tracking integration described in detail in chapter five. The gallery face detection preprocessing and recognizer training is identical to SimpleTechnique (see above). 36 Chapter 4. System description of standalone framework Figure 4.6: SimpleTechnique class UML diagram. Figure 4.7: TrackingTechnique class UML diagram. 4.5 4.5.1 Other modules Annotation The Annotation class represents the annotation of the presence and location of individuals in a sequence of images, such as a video clip. An instance can be produced by the framework through the application of face recognition to a probe image or video, but also saved and loaded from disk. In addition, an instance can be compared to another, ”true”, annotation by a number of performance measures, described in chapter six. When saved to file, it uses a simple ASCII-based file format. The first line contains exactly one positive integer, representing the total number of individuals in the subject gallery. Each subsequent line represents a frame or image. The present individuals in the frame/image can be specified in two ways, depending on whether or not location data is included. Either, the individuals present in the frame are represented by a number of non-negative integer labels, separated by whitespace, or each individual is represented by a non-negative integer label followed by a start parenthesis ’(’, four comma-separated non-negative integers representing the x, y, width, height of the rectangle specifying the image region of the face of the individual in the frame/image, followed by an end parenthesis ’)’. Each such segment is separated by whitespace. For example, a file of the first type may look like this: 4 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 3 4.5. Other modules 37 And a file of the second type may look like this: 9 0(46,25,59,59) 0(46,25,61,61) 0(45,24,63,63) 0(47,25,61,61) 0(46,25,61,61) 0(45,24,62,62) 0(46,24,62,62) 1(146,124,41,41) 0(45,24,62,62) 1(146,124,41,41) 4.5.2 Gallery The Gallery class is a simple abstraction of the gallery data used to train face recognition models. An instance is created simply by providing the path to the gallery listing file. An optional parameter samplesPerSubject specifies the maximum number of samples to extract from each video file in the gallery listing. If the parameter is left out, this indicates to client modules that the maximum number of samples should be extracted. This commonly means all detected faces in the video. Figure 4.8: Gallery class UML diagram. The gallery listing is an ASCII-format newline-separated list of gallery file/label pairs. Each pair consists of a string specifying the path to the gallery file and a non-negative integer label specifying the identity of the subject in the gallery file, separated by a colon. The gallery file can be either an image or a video clip. In both cases, it is assumed that only the face of the subject is present throughout the image sequence. For example: /home/user1/mygallery/subj0.jpg:0 /home/user1/mygallery/subj0.avi:0 /home/user1/mygallery/subj1.png:1 /home/user1/mygallery/subj1.wmv:1 /home/user1/mygallery/subj2.bmp:2 /home/user1/mygallery/subj2.avi:2 4.5.3 Renderer This class is used to play an annotated video back and display detection or recognition results visually. To render detection results, it accepts the video file and a list of lists of cv::Rects, representing the set of detected face regions for each frame of the video. To render recognition results, it simply accepts the video file and an associated Annotation object. The framerate of the playback can also be configured. Figure 4.9: Renderer class UML diagram. 38 Chapter 4. System description of standalone framework 4.6 Command-line interface The framework includes a simple command-line interface to a subset of the functionality provided by the framework, as an example of what an application might look like. It accepts a gallery listings file and probe video file and performs face recognition. The technique, detection and recognition algorithms used can be customized and the result can be either saved to file or rendered visually. The application can also be used to benchmark different algorithms. The syntax is as follows: ./[executable] GALLERY_FILE PROBE_FILE [-o OUTPUT_FILE] [-t TECHNIQUE] [-d DETECTOR] [-c CASCADE_DATA] [-r RECOGNIZER] [-R] [-C CONFIDENCE_THRESHOLD] [-b BENCHMARKING_FILE] [-n SAMPLES_PER_VIDEO] 4.6.1 Options – -o - Specifies the file to write the resulting annotation to. If this option is left out, the output is not saved. – -t - Specifies the technique to use. Can be either ”simple” or ”tracking”. The default is ”simple”. – -d - Specifies the detector to use. Can be either ”cascade” or ”rotating”. The default is ”cascade”. – -c - Specifies the cascade detector training data to use. The default is frontal face training data included in the source tree. – -r - Specifies the recognizer to use. Can be either ”eigenfaces”, ”fisherfaces”, ”lbph” or ”wawo”. The default is ”eigenfaces”. – -R - Indicates that the result should be rendered visually. The sequence of frames is played back at 30 frames per second and the recognition result is overlayed on each frame. – -C - The confidence threshold to set for the selected recognizer. The range of this value depends on the algorithm. The default is 0. – -D - Set the benchmarking annotation file to use. The result is compared to this file and performance data is written to stdout when processing is complete. – -n - Set the number of faces to extract from each gallery video file for training the recognizer model. If this option is not given, as many faces as possible will be used. Chapter 5 Algorithm extensions This chapter discusses the improvements made to the basic face recognition system of the Vidispine plugin that have been added to the standalone framework. The improvements are twofold: Firstly, an integration of an arbitrary face recognition algorithm and the CAMSHIFT object tracking algorithm and secondly, an extension of the cascade face detection algorithm. This chapter primarily describes the improvements and discusses their potential and weaknesses, while their performance is empirically evaluated in chapter six. 5.1 Face recognition/object tracking integration The majority of face recognition approaches proposed in the literature operate on a single image. As discussed in chapter two, these kinds of techniques can be applied to face recognition in video by applying them on a frame-by-frame basis. However, this purposefully disregards certain information contained in a video clip that can be used to achieve better recognition performance. For example, geometric continuity in successive frames tends to imply the same object. How can we take that into account when recognizing faces? In addition, a weakness of many popular face recognition techniques is that they are viewdependent. Either the model is trained exclusively with samples from a single view, and thus only able to recognize faces from this one perspective, or the model is trained with samples from multiple views and often suffer a reduction in recognition performance for any one perspective. A color-based object tracking algorithm has the advantage of not being dependent on the pose of the tracked object as long as the color distribution of the object does not radically change with the pose. In fact, an advertised strength of the CAMSHIFT tracking algorithm is that it is able to continue tracking an object as long as occlusion isn’t 100%[12]. It is thus a natural step to combine an elementary face recognition algorithm to identify faces in individual frames, and the CAMSHIFT algorithm in order to overcome issues with viewdependence and to associate the faces of the same subjects across multiple frames. The proposed algorithm consists of the steps below. For details concerning face recognition, face detection of the CAMSHIFT algorithm, see chapter two. 39 40 Chapter 5. Algorithm extensions 1. For each frame in the video: (a) Extend any existing CAMSHIFT tracks to the new frame, possibly terminating them. If two tracks intersect, select one and terminate it. (b) Detect any faces in the frame using an arbitrary face detection algorithm. If a face is detected and it does not intersect with any existing tracks, use it as the initial search region of a new CAMSHIFT track. (c) For each existing track, uniformly expand the search region of the current frame by a fixed percentage, and apply face detection inside the expanded search region. If a face is detected, apply an arbitrary face recognition algorithm on the face region, and store the recognized ID. 2. Once the video has been processed, iterate over all tracks that were created: (a) Compute the mode of all recognized IDs of the track and write it as output to each frame the track covers. Write the CAMSHIFT search region as the corresponding face region to each frame the track covers. Figure 5.1 illustrates an example of the algorithm in action visually. Figure 5.1: Example illustrating the face recognition/tracking integration. In frame a, a face is detected using a frontal face detector and a CAMSHIFT track subsequently created. The face is recognized as belonging to subject ID #1. In frame b, the CAMSHIFT track is extended, but since the head is partially occluded, the detector is unable to recognize it as a face. In frame c, the CAMSHIFT track is once again extended, and the head is once again fully revealed, allowing the detector to once again detect and identify the subject as #1. In frame d, the face is completely occluded and CAMSHIFT loses track of it. The final track covers frames 1-3, the mode of the identified faces in the track is computed and assigned to each frame the track covers. The main advantage of this approach is that weaknesses in the face detector or the face recognizer for temporary disadvantageous conditions are circumvented. As long as a majority of successful identifications in a track are correct, failed detections or invalid recognitions are overlooked. This means that temporary pose changes that would normally interrupt a regular recognition algorithm are mediated. The approach could also deal with temporary occlusions or shadows so long as they do not cause CAMSHIFT to lose track. 5.2. Rotating cascade detector 41 The approach could also be extended to using multiple detector/recognizer pairs for multiple viewpoints to increase the range of conditions that result in valid identifications, further increasing the probability of achieving a majority of valid identifications in a single track. A probable weakness of the approach is that cluttered backgrounds could easily result in false positive face detections, depending on the quality of the face detector used, which could produce tracks tracking background noise. It might be possible to filter out many tracks of this type by finding criteria that are likely to be fulfilled by noise tracks, such as a short length or a relatively low number of identifications. 5.1.1 Backwards tracking In theory, there is nothing that prevents the algorithm presented above from tracking both forward and backward along the temporal axis. In the case where a face is introduced into the video under conditions that are disadvantageous to the underlying detector/recognizer and then later become detectable, the algorithm would miss this initial segment unless tracking is done in both temporal directions. In practice, however, video is usually compressed in such a way that only certain key frames are stored in full, and the frames inbetween are represented as sequential modifications to the previous key frame. This means that a video can only be played back in reverse by first playing it forwards and buffering the frames as they are computed. If no upper limit is set on the size of this buffer, even short video clips of moderate resolution would require huge amounts of memory to process. The implementation of the recognition/tracking integration developed during this project supports backwards tracking with a finite-sized buffer, the size of which can be configured to suit the memory resources and accuracy requirements of the intended application. 5.2 Rotating cascade detector The range of face poses that can be detected by cascade detection with Haar-like features is limited by the range of poses present in the training data. In addition, training for multiple poses can limit the accuracy of detecting a face in any one particular pose. To partially mitigate this issue, an extension to the basic cascade detector has been developed. The basic idea is to rotate the input image about its center in a stepwise manner and apply the regular detector to each resulting rotated image in turn, in order to detect faces in the correct pose besides a rotation about the image z-axis. However, this approach is likely to detect the same face multiple times due to the basic cascade detector being somewhat robust to minor pose changes. In order to handle multiple detections of the same face, the resulting detected face regions are then rotated back to the original image orientation and merged, producing a single face region in the original image for each face present. This extended detector thus expands the range of detectable poses. The steps of the algorithm are as follows: 1. For a given input image, a given maximum angle and a given step angle, start by rotating the image by the negative maximum angle. Increase the rotation angle by the step angle for each iteration until the angle exceeds the positive maximum angle. 2. For each image orientation: (a) Apply cascade face detection to the rotated image. (b) For each detected face region, rotate the region back to the original orientation and compute an axis-aligned bounding box (AABB) around it. 42 Chapter 5. Algorithm extensions 3. For all AABBs from the previous step, find each set of overlapping rectangles. 4. For each set, compute the average rectangle, i. e., the average top-left/bottom-right point defining each rectangle. Figure 5.2 shows an example of the rotating detector in action. A downside of the rotating detector versus the basic detector is that the stepwise rotation increases processing time by a constant factor, which is the total number of orientations processed, in addition to the relatively minor time it takes to merge the resulting face regions. Thus, this extension is only a viable option in scenarios where accuracy is prioritised over speed, and where a rotation about the z-axis is very likely to occur frequently. Another issue is that the risk of detecting false positives is increased as the number of orientations considered increases. For this reason, the approach may be less useful in scenes with cluttered backgrounds. Figure 5.2: Illustrated example of the rotating cascade detector in action. In the first step, the image is rotated to a fixed set of orientations and face detection is applied in each. In the second step, the detected regions are rotated back to the original image orientation and an axis-aligned bounding box is computed for each. In the last step, the average of the overlapping bounding boxes is computed as the final face region. Chapter 6 Performance evaluation In this chapter the accuracy and performance of the basic face detection and recognition algorithms, as well as the algorithmic extensions introduced in this report, are empirically evaluated. Also, the accuracy and performance of the basic algorithms are evaluated under a variety of scene and imaging conditions, in order to elucidate what their strengths and weaknesses are, and to develop recommendations for application areas and avenues for improvement to be used in future work. Firstly, the performance metrics used in the evaluation are introduced and explained in detail. Secondly, the sources and properties of the various datasets used are described, and thirdly, the setup of each individual test is explained. The final section includes both a presentation and explanation of the results as well as an analysis and discussion. 6.1 Metrics The task of performing face detection and recognition on a probe video with respect to a subject gallery and producing an annotation of the identities and temporal locations of any faces present in the probe can be viewed as a multi-label classification problem applied to each frame of the probe. In order to prevent optimization according to a potential bias in a certain metric, a number of different metrics will be used. Let L be the set of subject labels present in the gallery. Let D = x1 , x2 , . . . , x|D| be the sequence of frames of the probe video and let Y = y1 , y2 , . . . , y|D| be the true annotation of the probe where yi ⊆ L is the true set of subject labels for the ith frame. Let H be a face recognition system and H(D) = Z = z1 , z2 , . . . , z|D| the predicted annotation for the probe by the system. Let tr be the time it takes to play the video and tp be the time it takes to perform face recognition on the video. The following metrics prominent in the literature[57][5] will be used: – Hamming loss: |D| HL(H, D) = 1 X |yi ∆zi | |D| i=1 |L| where ∆ is the symmetric difference of two sets, which corresponds to the XOR operation in Boolean logic. This metric measures the average ratio of incorrect labelings and missing labels to the total number of labels. Since this is a loss function, a Hamming loss equal to 0 corresponds to optimal performance. 43 44 Chapter 6. Performance evaluation – Accuracy: |D| A(H, D) = 1 X |yi ∩ zi | |D| i=1 |yi ∪ zi | Accuracy symmetrically measures the similarity between yi and zi , averaged over all frames. A value of 1 corresponds to optimal performance. – Precision: |D| P (H, D) = 1 X |yi ∩ zi | |D| i=1 |zi | Precision is the average percentage of identified true positives to the total number of labels identified. A value of 1 corresponds to optimal performance. – Recall: |D| R(H, D) = 1 X |yi ∩ zi | |D| i=1 |yi | Recall is the average percentage of identified true positives to the total number of true positives. A value of 1 corresponds to optimal performance. – F-measure: |D| F (H, D) = 1 X 2 |yi ∩ zi | |D| i=1 |zi | + |yi | The F-measure is the harmonic mean of precision and recall and gives an aggregate description of both metrics. A value of 1 corresponds to optimal performance. – Subset accuracy: |D| SA(H, D) = 1 X ind{zi = yi } |D| i=1 Subset accuracy is the fraction of frames in which all subjects are correctly classified without false positives. A value of 1 corresponds to optimal performance. – Real time factor: RT F (H, D) = tp tr The real time factor is the ratio of the time it takes to perform recognition on the video to the time it takes to play it back. If this value is 1 or below, it is possible to perform face recognition in near-real time. 6.2. Testing datasets 6.2 45 Testing datasets Several different test databases were used, for two reasons. Firstly, using several databases with different properties, such as clutter, imaging conditions and number of subjects gives a better overview of how different parameters affect the quality and speed of recognition. Secondly, using standard test databases allow the results to be compared with other results in the literature. This section lists the databases that were used along with a description of their properties. 6.2.1 NRC-IIT This database contains 11 pairs of short video clips of 11 individuals respectively. One or more of the files for two of the subjects could not be read and were excluded from this evaluation. The resolution is 160x120 pixels, with the face occupying 14 to 18 of the frame width. The average duration is 10-20 seconds. All of the clips were shot under approximately equal illumination conditions, which was uniformly distributed ceiling light and no sunlight. The subjects pose a variety of different facial expressions and head orientations. Only a single face is present in each video and the faces are present in the video for the entire duration.[26] 6.2.2 News Contains a gallery of six short video clips of the faces of news anchors, each 12-15 seconds long, and a probe clip containing outtakes from news reports featuring two of the subjects which is 40 seconds long. The resolution of the gallery clips varies slightly, but is approximately 190x250 pixels. The face occupies the majority of the image with very little background and no clutter. The subjects are speaking but have mostly neutral facial expressions. The probe contains outtakes featuring two of the anchors in a full frame from the original news reports. The resolution is 640x480 pixels. The background contains slight clutter and is mostly static, but varies slightly as imagery from news stories is sometimes displayed. In some cases, unknown faces are showed. The illumination is uniform studio lighting without significant shadows. 6.2.3 NR Consists of a gallery of seven subjects, and for each subject, five video clips featuring the subject in a frontal pose, and one to five video clips featuring the subject in a profile pose. The gallery clips contain only the subjects’ faces but also some background with varying degrees of clutter. Each clip is one to 10 seconds long. The database also contains one 90 second probe video clip featuring a subset of the subjects in the gallery. All gallery and probe clips were shot with the same camera. They are in color with a resolution of 640x480 pixels. The illumination and facial expressions of the subjects vary across the gallery clips. The pose and facial expressions of the subjects in the probe vary, but the illumination is approximately uniform. The probe features several subjects in a single frame as well as several unknown subjects. The background is dynamic with a relatively high degree of clutter compared to the other datasets. 46 6.3 Chapter 6. Performance evaluation Experimental setup Three different experiments were performed. The first compares the accuracy and processing speed of the tracking extension to the basic framework face recognition algorithms as the gallery size increases, the second measures the performance of the rotation extension to the cascade detector and the third evaluates the impact of variations in multiple imaging and scene conditions on recognition algorithm performance. This section describes the purpose and setup of each experiment. All tests were performed on an Asus N56V laptop with eight Intel Core i7-3630QM CPUs at 2.40GHz, 8 GB of RAM and an NVIDIA Geforce GT 635M graphics card. The operating system used was Ubuntu 12.04 LTS. 6.3.1 Regular versus tracking recognizers This test was performed in order to evaluate the performance and processing speed of the tracking extension compared to regular recognition systems. The algorithms evaluated were Eigenfaces, Fisherfaces, Local binary pattern histograms, Wawo and the ensemble method (see chapter four) using the other four algorithms. For each algorithm, both a frame-byframe recognition approach, as well as the CAMSHIFT tracking approach, described in chapter five, was used. All algorithms used the cascade classifier with Haar-like features for face detection. This test was performed on the NRC-IIT database. The gallery was extracted from the first video clip of each subject. The second video clip of each subject was used as probes. The mean subset accuracy over all probes was computed for a number of gallery sizes ranging from 1 to 50. The real time factor (RTF) was computed for each gallery size by dividing the total processing time for all probes, including retraining the algorithms with the gallery for each probe, by the sum total length of all probe video clips. 6.3.2 Regular detector versus rotating detector This test evaluates the recognition performance and processing speed of recognition systems using the rotating extension of the cascade classifier face detector (see chapter five) with respect to the regular classifier. The algorithms used for the evaluation are Eigenfaces, Fisherfaces, Local binary pattern histograms, Wawo and the ensemble method (see chapter four) using the other four algorithms. For each algorithm, both the regular cascade classifier and the cascade classifier with the rotating extension was used. In all other regards, the test was identical to the test described in the previous section. The rotating detector used 20 different angle deviations from the original orientation, with a maximum angle of ±40◦ and a step size of 4◦ . 6.3.3 Algorithm accuracy in cases of multiple variable conditions The purpose of this test is to illuminate what the obstacles to applying the system to real life scenarios are, where many different scene and image conditions can be expected to vary simultaneously. For this reason, each of the basic algorithms was tested on each of the three datasets, each of which has a different set of variable image and scene conditions (see above). For each of the three datasets, Eigenfaces, Fisherfaces, LBPH and Wawo was tested, each using a cascade detector trained for detecting frontal faces using the default cascade data included in the framework. For the NRC-IIT database, the gallery was extracted from the first video clip of each subject. The second video clip of each subject was used as probes. The mean Hamming loss, accuracy, precision, recall, F-measure and subset accuracy was computed over all probes. For the News database, the same set of measurements was 6.4. Evaluation results 47 computed over its single probe. For the NR database, only the frontal gallery was used. The same set of measurements was computed over its single probe. In each test, the maximum number of usable samples were extracted from each gallery video, as specified by the Gallery module (see chapter four). In the case of Wawo and the ensemble method, this had to be limited to 50 samples per video for the NRC-IIT test and 10 samples per video for the News and NR tests due to segmentation faults occurring inside Wawo library calls when using larger gallery sizes. 6.4 6.4.1 Evaluation results Comparison of algorithm accuracy and speed over gallery size As figure 6.1 illustrates, the tracking extension vastly improves the accuracy of all five algorithms. Wawo and the ensemble approach quickly reach a near-optimal level of accuracy at 95-96% and the other algorithms catch up as the gallery size increases. Without the tracking extension, Wawo outperforms the other algorithms for all but the smallest gallery sizes. However, as figure 6.2 shows, the RTF of Wawo and the ensemble approach are heavily impacted by the gallery size while the other algorithms retain an essentially constant RTF as the gallery size increases. The tracking extension adds a relatively minor, constant increase to the RTF of all algorithms. These results suggest that the tracking-extended Wawo algorithm may be a suitable choice for applications where the gallery size is small to medium-sized (but not too small) and processing time is a non-critical factor. On the other hand, LBPH or Fisherfaces perform nearly as well, even for smaller sample sizes and vastly outperform Wawo in terms of processing time. It should be noted that these results are highly dependent on the dataset and this analysis should only be considered valid for applications that use data with similar conditions to the test data. Figure 6.1: The performance of each algorithm as measured by subset accuracy, as the gallery size increases. The thin lines represent the basic frame-by-frame algorithms and the thick lines represent the corresponding algorithm using the tracking extension. 48 Chapter 6. Performance evaluation Figure 6.2: The real time factor of each algorithm as the gallery size increases. The thin lines represent the basic frame-by-frame algorithms and the thick lines represent the corresponding algorithm using the tracking extension. 6.4.2 Regular detector versus rotating detector Figure 6.3 shows that the rotating extension invariably improves the accuracy of all algorithms. The degree of the improvement does vary slightly, but usually lies in 0.05-0.1 range. The degree of improvement does not seem to be affected by the gallery size. Figure 6.4 illustrates that the extension adds a hefty constant cost to the processing time with respect to gallery size, but which is instead directly proportional to the total number of orientations considered by the rotation extension. These results indicate that the rotating extension purchases a slight improvement in accuracy for a large cost in performance. The performance cost can be reduced by considering a smaller number of orientations. Depending on the application, this may or may not reduce accuracy. For example, in a scenario where the subjects to be identified are unlikely to lean their heads to either side by no more than a small amount, a large maximum angle may be wasteful. In addition, this test only considers the case where the step size is a tenth of the maximum angle. It is possible that a larger step size would result in the same accuracy, but at the time of writing this has not been tested. Another consideration is the increased likelihood of false positives. Since the basic face detection algorithm is performed once for each orientation, the basic probability of a false positive is compounded by the number of orientations considered. This factor does not appear to impair the algorithm for this particular dataset, but in cases where the basic false positive rate is relatively high, such as in scenes with cluttered backgrounds, it may become a greater problem. 6.4. Evaluation results 49 Figure 6.3: The performance, as measured by subset accuracy, of each algorithm using the regular cascade classifier (thin lines) compared to ones using the rotating extension (thick lines), as the gallery size increases. Figure 6.4: The real time factor of each algorithm as the gallery size increases. The thin lines represent the basic frame-by-frame algorithms and the thick lines represent the corresponding algorithm using the tracking extension. 50 6.4.3 Chapter 6. Performance evaluation Evaluation of algorithm accuracy in cases of multiple variable conditions Table 6.1 shows the results of the NRC-IIT test. The subset accuracy measurement indicates that the algorithms correctly label about 50-60% of frames. Visual inspection shows that most of the error comes from a failure to detect faces that have been oriented away from the camera or distorted and/or occluded. This issue is overcome using the tracking technique as demonstrated above. Table 6.1: NRC-IIT test results. The values of all measures besides Hamming loss are equal because by their definition, they equate to the subset accuracy when |zi |, |yi | ≤ 1. Algorithm Eigenfaces Fisherfaces LBPH Wawo Ensemble Hamming loss 0.104112 0.0875582 0.0975996 0.0908961 0.0933599 Accuracy 0.531498 0.605988 0.560802 0.590968 0.57988 Precision 0.531498 0.605988 0.560802 0.590968 0.57988 Recall 0.531498 0.605988 0.560802 0.590968 0.57988 F-measure 0.531498 0.605988 0.560802 0.590968 0.57988 Subset accuracy 0.531498 0.605988 0.560802 0.590968 0.57988 Table 6.2 shows the result of testing on the News dataset. A noteworthy feature here is that the recall is similar to the results for the NRC-IIT test, which means that about the same fraction of true positives were identified. However, the precision is markedly lower, which indicates a greater number of false positives. This is most likely due to the more cluttered background in the News probe. This is corroborated by visual inspection of the result. The error consists both of non-face background elements falsely classified as faces by the detector and unknown faces falsely identified as belonging to the subject gallery by the recognizers. The ensemble method performs worst according to all measures in this test. This is most likely due to the fact that the Wawo component causes a segmentation fault for sample sizes larger than 10 and this limitation is applied to the other algorithms as well in the current implementation. The other algorithms perform comparatively worse at this gallery size as demonstrated above and thus sabotage the overall performance. Table 6.2: News test results. Accuracy and precision are equal when yi ⊆ zi whenever |yi ∩ zi | = 6 0. If yi = zi , accuracy and precision equals recall. This indicates a number of false positives were present. Algorithm Eigenfaces Fisherfaces LBPH Wawo Ensemble Hamming loss 0.261373 0.34381 0.309898 0.351944 0.368211 Accuracy 0.484974 0.398677 0.444169 0.340433 0.301213 Precision 0.484974 0.398677 0.444169 0.340433 0.301213 Recall 0.605459 0.520265 0.622002 0.463193 0.438379 F-measure 0.524676 0.438737 0.50284 0.381086 0.34654 Subset accuracy 0.367246 0.27957 0.269644 0.219189 0.166253 Table 6.3 shows the results for the NR test. Despite the fact that the training data is of somewhat lower quality and the background is highly dynamic and cluttered, with many unknown individuals present, Wawo and Fisherfaces performed on par with the results from the News test, although Eigenfaces and LBPH performed worse. To various degrees, the precision measurements indicate all methods detected a large number of false positives. Visual inspection shows that the error is due to non-faces classified as faces, unknown individuals identified as belonging to the gallery and, to a greater extent than for the News test, known subjects falsely identified as other known subjects. The last issue may arise partly from the relatively lower quality of the subject gallery but also from the more dynamic 6.4. Evaluation results 51 and varied poses and facial expressions all individuals in the probe assume. Despite the same gallery size limitation to the ensemble method as in the News test, strangely it outperforms all other algorithms except Wawo. Table 6.3: NR test results. Accuracy and precision are equal when yi ⊆ zi for all i where |yi ∩ zi | = 6 0. If instead yi = zi , accuracy and precision would equal recall. This indicates a number of false positives were present Algorithm Eigenfaces Fisherfaces LBPH Wawo Ensemble Hamming loss 0.288492 0.21746 0.263492 0.194444 0.21746 Accuracy 0.210648 0.340046 0.244444 0.389583 0.343981 Precision 0.210648 0.340046 0.244444 0.389583 0.343981 Recall 0.308333 0.55 0.388889 0.625 0.55 F-measure 0.24213 0.406111 0.291667 0.465463 0.40963 Subset accuracy 0.119444 0.152778 0.105556 0.169444 0.155556 The above results indicate that a major obstacle to applying face recognition in real life scenarios is the high rate of false positives detected in cluttered background conditions and with unknown individuals present in the scene. This problem can be attacked from two angles. The first is that the face detection algorithm can be improved so as to reduce the number of non-face regions falsely detected. Besides testing other algorithms than the cascade classifier, or using better training data, it may also be possible to preprocess the image to optimize the conditions for face detection. The second angle would be to improve the face recognition algorithms themselves so as to correctly label unknown individuals as such. While no formal evaluation has been performed in this project, rudimentary investigation has indicated that it is possible to optimize the recognition performance for a particular dataset to a certain degree by finding an appropriate confidence level. However, as the dataset grows large, it seems likely that these gains would diminish. 52 Chapter 6. Performance evaluation Chapter 7 Conclusion In this report, the implementation of a standalone version of the Vidispine face recognition plugin was documented, the tradeoff between face recognition performance and accuracy was evaluated and the possibilities of integrating face recognition with object tracking was investigated. Among the algorithms evaluated, Wawo performs better than the others for all but the smallest gallery sizes. However, this comes at a great performance cost as the recognition time scales linearly with the number of samples in the gallery. Eigenfaces outperforms Wawo in terms of accuracy for small gallery sizes and Fisherfaces has an almost comparable accuracy for large gallery sizes, and the processing time for both is constant with respect to gallery size. This implies that the best method to use depends on the requirements of the application. If maximizing accuracy with a limited, but not too limited (>5 samples per subject), gallery size is the goal, at any computational costs, Wawo may be the best option. If the gallery size is limited (<25 samples per subject) but processing time should also be kept in check, Eigenfaces would be better. Finally, if a relatively large gallery size (>30 samples per subject) can be acquired and processing time should be minimized, Fisherfaces seems to be the best choice. While acquiring a multitude of qualitative photos of a single individual, especially one that is not directly available, can be very difficult, a large number of admissable samples can easily be extracted from short video clips. In the modern era of the Web and real-time media streaming, this is a much more viable option than it used to be in the past. Due to the nature of the evaluation dataset, these recommendations are conditional on the assumption that the probe data has an uncluttered background and constant illumination. A cluttered background in particular can be devastating for recognition performance, as the number of false positives for both face detection and recognition rises dramatically. The best recommendation to deal with these issues that can be given based on this evaluation is simply to restrict the application scope so as not to include scenes with cluttered backgrounds and greatly variable illumination. One of the original goals was to investigate the possibilities of profile face recognition. While the framework supports profile face recognition in principle, by supplying profile recognition training data to the elementary detection and recognition algorithms, the performance of such approaches have not been evaluated in this report. This is mainly due to the fact that there is hardly any suitable data for such an evaluation readily available, and gathering such data is a time-consuming process that did not fit into the project schedule. 53 54 Chapter 7. Conclusion An approach for integrating image-based face recognition algorithms and the CAMSHIFT object tracking algorithm was developed, the specifics of which are described in chapter five. Using this method, the performance of the basic face recognition algorithms was improved by approximately 35-45 percentage points. In certain contexts, the method seems to be able to overcome some major obstacles to successful face detection and recognition, such as partial occlusion, face deformation, pose changes and illumination variability. 7.1 Limitations of the evaluation The majority of tests in this evaluation was performed on the NRC-IIT dataset, which covers only a restricted set of possible scene and imaging conditions. As a consequence, the results, discussion and recommendations are mainly applicable to similar types of data. That is, scenes with static uncluttered backgrounds and constant illumination. Based on the literature, this is a problem that affects the field as a whole, and many authors call for more standardized face recognition datasets covering wider ranges of variables[6]. If more time was available, additional data would be gathered for a more informative evaluation. This would be made easier if the scope of the intended application area of the framework was restricted and specified in more detail, as the amount of necessary data would be reduced. 7.2 Future work As the original aims of this project were quite broad, so are the possible future lines of investigation. As previously mentioned, a primary issue throughout the project was the lack of useful data to use for evaluation of the developed solutions. Specifically, in order to systematically address the performance of the various techniques under specific conditions, such as variability in illumination, pose, background clutter, facial expression, external facial features such as beard, glasses or makeup, test data sets that introduce these factors one by one, and in combination in small numbers, would be required. For a more restricted application area, the amount of necessary test data would be limited to those conditions that appear in the real-world scenarios in which the framework would be used. There is also a lack of profile image and video test data relative to the amount of frontal face databases available in the literature. An important future project could be to build an extensive test dataset, appropriate to the intended application, according to the above specifications, to be used to evaluate the performance of new and existing algorithms under development. The strength and applicability of the framework can always be enhanced by adding new algorithms for face detection, face recognition and face tracking. If a more specific application area is selected, this would inform the choice of new algorithms to add, as different algorithms have different strengths and weaknesses that may make them more or less suitable for a particular application. In addition, it would be interesting to see how the ensemble method could be improved by adding new algorithms, from different face recognition paradigms, that complement each other’s primary weaknesses. Being able to distinguish between known and unknown individuals is relevant to many applications, so a future project could be to try to find more general solutions to this problem. This is yet another instance of a problem that most likely becomes easier if the problem domain is restricted. Another possible direction could be to investigate image preprocessing methods to improve detection and recognition performance. Chapter 8 Acknowledgements I would like to thank my supervisor, Petter Ericson, for keeping me on track and providing suggestions and insight throughout the project. I want to thank Johanna Björklund and Emil Lundh for providing the initial thesis concept and Codemill for providing a calm, quiet workspace. I’d also like to thank everyone at Codemill for helping me get set up and providing feedback. Finally, I want to thank my girlfriend, my parents and my brother for supporting me all the way. 55 56 Chapter 8. Acknowledgements References [1] OpenBR (Open Source Biometric Recognition). http://openbiometrics.org/ (visited 2013-11-28). [2] OpenCV (Open Source Computer Vision). http://opencv.org/ (visited 2013-11-28). [3] Wawo Technology AB. http://www.wawo.com/ (visited 2013-11-28). [4] R. Bolle A. K. Jain and S. Pankanti. Biometrics: Personal Identification in Networked Society. Kluwer Academic Publishers, 1999. [5] A. M. P. Canuto A. M. Santos and A. F. Neto. Evaluating classification methods applied to multi-label tasks in different domains. International Journal of Computer Information Systems and Industrial Management Applications, 3:218–227, 2011. [6] A. H. El-baz A. S. Tolba and A. A. El-harby. Face recognition: A literature review. International Journal of Signal Processing, 2(2):88–103, 2005. [7] O. Javed A. Yilmaz and M. Shah. Object tracking: A survey. ACM Comput. Surv., 38(4), 2006. [8] T. Ahonen, Abdenour Hadid, and Matti Pietikäinen. Face recognition with local binary patterns. In In Proc. of 9th Euro15 We, pages 469–481, 2004. [9] P. Ho B. Heisele and T. Poggio. Face recognition with support vector machines: Global versus component-based approach. In In Proc. 8th International Conference on Computer Vision, pages 688–694, 2001. [10] V. Blanz and T. Vetter. Face recognition based on fitting a 3d morphable model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25:2003, 2003. [11] A. L. Blum and P. Langley. Selection of relevant features and examples in machine learning. Artificial Intelligence, 97:245–271, 1997. [12] Gary R. Bradski. Computer vision face tracking for use in a perceptual user interface, 1998. [13] A. Hertzmann C. Bregler and H. Biermann. Recovering non-rigid 3d shape from image streams. In CVPR, pages 2690–2696. IEEE Computer Society, 2000. [14] M. Oren C. P. Papageorgiou and T. Poggio. A general framework for object detection. In Proceedings of the Sixth International Conference on Computer Vision, pages 555–. IEEE Computer Society, 1998. 57 58 REFERENCES [15] A. Albiol E. Acosta, L. Torres and E. J. Delp. An automatic face detection and recognition system for video indexing applications. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 4:3644–3647, 2002. [16] K. Etemad and R. Chellappa. Discriminant analysis for recognition of human face images. Journal of Optical Society of America A, 14:1724–1733, 1997. [17] R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(7):179–188, 1936. [18] A. W. Fitzgibbon and A. Zisserman. Joint manifold distance: A new approach to appearance based clustering. In Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR’03, pages 26–33, Washington, DC, USA, 2003. IEEE Computer Society. [19] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In Proceedings of the Second European Conference on Computational Learning Theory, pages 23–37. Springer-Verlag, 1995. [20] P. Fua. Regularized bundle adjustment to model heads from image sequences without calibrated data. International Journal of Computer Vision, 38:154–157, 2000. [21] A. K. Roy Chowdhury G. Aggarwal and R. Chellappa. A system identification approach for video-based face recognition. In ICPR (4), pages 175–178, 2004. [22] S. Z. Li G. Guo and K. Chan. Face recognition by support vector machines. In Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition 2000, FG ’00, pages 196–201, Washington, DC, USA, 2000. IEEE Computer Society. [23] J. W. Fisher G. Shakhnarovich and T. Darrell. Face recognition from long-term observations. In In Proc. IEEE European Conference on Computer Vision, pages 851–868, 2002. [24] Y. Gao and M. K. H. Leung. Face recognition using line edge map. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24:764– 779, 2002. [25] G. G. Gordon. Face recognition based on depth maps and surface curvature. In SPIE Geometric methods in Computer Vision, pages 234–247, 1991. [26] D. O. Gorodnichy. Video-based framework for face recognition in video. In Second Workshop on Face Processing in Video (FPiV’05) in Proceedings of Second Canadian Conference on Computer and Robot Vision (CRV’05), pages 330–338, 2005. [27] C. Gürel. Development of a face recognition system. Master’s thesis, Atilim University, 2011. [28] M. R. Lyu H.-M. Tang and I. King. Face recognition committee machine. In ICME, pages 425–428. IEEE, 2003. [29] C. Harris and M. Stephens. A combined corner and edge detector. In In Proc. of Fourth Alvey Vision Conference, pages 147–151, 1988. REFERENCES 59 [30] J. Ghosn I. J. Cox and P. N. Yianilos. Feature-based face recognition using mixturedistance. In Proceedings of the 1996 Conference on Computer Vision and Pattern Recognition (CVPR ’96), CVPR ’96, pages 209–216, Washington, DC, USA, 1996. IEEE Computer Society. [31] M.-H. Yang J. Ho and D. Kriegman. Video-based face recognition using probabilistic appearance manifolds. In In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 313–320, 2003. [32] N. Ahuja J. Weng and T. S. Huang. Learning recognition and segmentation of 3d objects from 2d images. Proc. IEEE Int’l Conf. Computer Vision, pages 121–128, 1993. [33] R. Jafri and H. R. Arabnia. A survey of face recognition techniques. Journal of Information Processing Systems, 5(2):41–68, 2009. [34] Y. Li K. Jonsson, J. Kittler and J. Matas. Learning support vectors for face verification and recognition. In Fourth IEEE International Conference on Automatic Face and Gesture Recognition, pages 208–213. IEEE Computer Society, 2000. [35] H.-S. Le. Face Recognition: A Single View Based HMM Approach. PhD thesis, Umeå University, 2008. [36] J.-H. Lee and W.-Y. Kim. Video summarization and retrieval system using face recognition and mpeg-7 descriptors. Image and Video Retrieval, 3115:179–188, 2004. [37] M. Levoy and P. Hanrahan. Light field rendering. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’96, pages 31–42, New York, NY, USA, 1996. ACM. [38] C.-J. Lin. On the convergence of the decomposition method for support vector machines. IEEE Transactions on Neural Networks, 12(6):1288–1298, 2001. [39] X. Liu and T. Chen. Video-based face recognition using adaptive hidden markov models. In CVPR (1), pages 340–345. IEEE Computer Society, 2003. [40] D. J. Kriegman M.-H. Yang and N. Ahuja. Detecting faces in images: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(1):34–58, 2002. [41] D. McCullagh. Call it super bowl face scan http://www.wired.com/politics/law/news/2001/02/41571 (visited 2013-11-26). i. [42] M. C. Santana O. Déniz and M. Hernández. Face recognition using independent component analysis and support vector machines. In AVBPA, volume 2091 of Lecture Notes in Computer Science, pages 59–64. Springer, 2001. [43] K. Fukui O. Yamaguchi and K. Maeda. Face recognition using temporal image sequence. In Proceedings of the 3rd. International Conference on Face & Gesture Recognition, FG ’98, pages 318–323, Washington, DC, USA, 1998. IEEE Computer Society. [44] J. P. Hespanha P. N. Belhumeur and D. J. Kriegman. Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Trans. Pattern Anal. Mach. Intell., 19(7):711–720, 1997. 60 REFERENCES [45] P. J. Phillips. Support vector machines applied to face recognition. In Advances in Neural Information Processing Systems 11, pages 803–809. MIT Press, 1999. [46] C. J. Poelman and T. Kanade. A paraperspective factorization method for shape and motion recovery. IEEE Trans. Pattern Anal. Mach. Intell., 19(3):206–218, 1997. [47] V. Pavlovic R. Huang and D. N. Metaxas. A hybrid face recognition method using markov random fields. In In Proceedings of ICPR 2004, pages 157–160, 2004. [48] S.-Y. Kung S.-H. Lin and L.-J. Lin. Face recognition/detection by probabilistic decisionbased neural network. IEEE Transactions on Neural Networks, 8(1):114–132, 1997. [49] A. C. Tsoi S. Lawrence, C. L. Giles and A. D. Back. Face recognition: A convolutional neural-network approach. IEEE Transactions on Neural Networks, 8(1):98–113, 1997. [50] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22:888–905, 1997. [51] C. Stauffer and W. E. L. Grimson. Learning patterns of activity using real-time tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22:747–757, 2000. [52] K.-K. Sung and T. Poggio. Learning human face detection in cluttered scenes. Computer Analysis of Images and Patterns, 970:432–439, 1995. [53] B. Takács. Comparing face images using the modified hausdorff distance. Pattern Recognition, 31(12):1873–1881, 1998. [54] A. S. Tolba. A parameter-based combined classifier for invariant face recognition. Cybernetics and Systems, 31(8):837–849, 2000. [55] A. S. Tolba and A. N. S. Abu-Rezq. Combined classifiers for invariant face recognition. Pattern Anal. Appl., 3(4):289–302, 2000. [56] Carlo Tomasi. Shape and motion from image streams under orthography: A factorization method. International Journal of Computer Vision, 9:137–154, 1992. [57] G. Tsoumakas and I. Katakis. Multi-label classification: An overview. Int J Data Warehousing and Mining, 2007:1–13, 2007. [58] M. A. Turk and A. P. Pentland. Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1):71–86, 1991. [59] M. A. Turk and A. P. Pentland. Face recognition using eigenfaces. Computer Vision and Pattern Recognition, pages 586–591, 1991. [60] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag New York, Inc., New York, NY, USA, 1995. [61] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In CVPR, volume 1, pages 511–518. IEEE Computer Society, 2001. [62] P. Wagner. Face recognition with opencv. http://docs.opencv.org/trunk/modules/contrib/doc/facerec/facerec(vi 2013-12-02). REFERENCES 61 [63] W. Zhao and R. Chellappa. Sfs based view synthesis for robust face recognition. In Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition, pages 285–292, 2000. [64] S. K. Zhou and R. Chellappa. Probabilistic human recognition from video. In ECCV (3), volume 2352 of Lecture Notes in Computer Science, pages 681–697. Springer, 2002.