The remainder of this thesis is organized as follows. Chapter 2 describes
subsystem of face detection by AdaBoost. Chapter 3 shows face age estimation algorithm including texture analysis using Gabor wavelets, data reduction based on orthogonal locality preserving projections, and classification. Chapter 4 shows experimental results and comparisons. Finally, the conclusions of this system and future work will be presented in chapter 5.
2 Chapter 2 Face Detection
2.1 Face Detection by AdaBoost
In this section, the face detection subsystem is introduced. The detection technique employed in this subsystem based on the AdaBoost algorithm that Viola and Jones proposed in 2004 [11]. The Viola-Jones classifier employs AdaBoost at each node in the cascade to learn a high detection rate at the cost of low rejection rate multitree classifier at each node of the cascade. The algorithm incorporate several innovative feature: 1) It uses Haar-like input features: a threshold applied to sums and differences of rectangular image regions. 2) Its integral image technique enables rapid computation of the value of rectangular regions or such regions rotated 45 degree.
This data structure is used to accelerate computation of the Haar-like input features. 3) It uses statistical boosting to create binary (face-not face) classification nodes characterized by high detection and weak rejection. 4) It organizes the weak classifier nodes of a rejection cascade. In other words: the first group of classifiers is selected that best detects image regions containing an object while allowing many mistaken detections; the next classifier group is the second-best at detection with weak rejection;
and so forth. In test mode, an object is detected only if fit it makes it through the entire cascade.
Fig. 2-1 shows the processes of face detection in implementation. Firstly, the searching windows with various sizes, which would be used to find different face candidates from the input images in multiscales. The face candidates in different scales truly reflect various distances of clients from input image. We totally define 12
searching windows in various sizes from the smallest block size of 24 by 24 to the biggest one by a multiplier of 1.25. All face candidates are normalized to same size and luminance. Then, Each normalized candidates will be classified as face or not by the face detector which was trained beforehand. The face regions are located precisely by a region-based clustering method in the end of the subsystem.
Fig. 2-1: Flowchart of face detection
2.2 Feature Selection
The intensity based features employed in this work were based on Haar features.
Haar-like features are reminiscent of Haar basis functions which have been used by Papageorgiou [12]. In fact, it is feasible to use composition of different brightness rectangles to present the light and dark region in the image. The target object can be detected by contrasting unknown objects with the rectangle features if we can know the entire rectangle features which present the target object. Fig. 2-2 shows the difference in intensity between the region of the eyes and a region across the upper cheeks for charactering faces.
Fig. 2-2: The two of multiple rectangle features appear the face
Fig. 2-3: The four kinds of rectangle features
We selected four types of rectangle features as illustrated in Fig. 2-3, including the vertical edge, horizontal edge, vertical line, and diagonal edge. The features are defined as Eq. (2.1).
, , , ,
subtracted
valve f x y w h Type (2.1) where (x, y) is the origin of the relative coordinate of rectangle features in the searching window. The searching window is used to find the block which has a face inside in the image. The significance of w and h denotes the relative weight and height of rectangle features, respectively. Type represents the kind of rectangle feature.
subtracted
valve
is the sum of the pixels in white rectangle subtracted from those in black rectangle.
A weak classification can be achieved by a single rectangle feature which best separates face and non-face images. That is, for each rectangle feature, the weak classification determiners the optimal threshold classification function, such that the
minimum number of examples are misclassified.
Fig. 2-4: The positive database
Fig. 2-5: The negative database
Fig. 2-6: The flow chart of selecting threshold for rectangle features
The selected threshold for each rectangle feature is acquired through the training process by our database which contains 4,000 face images and 59,000 non-face images. Fig. 2-4 and Fig. 2-5 present some face and nonface examples in our database.
The threshold selection process is shown in Fig. 2-6. In order to get the thresholds,
collecting the distribution information of subtracted values by this rectangle feature foe face database is needed. Find out a threshold which discriminates two classes to make detection rate higher than others. A single rectangle feature which best separates the face and nonface samples can be considered as a weak classifier h x f p
, , ,
as chosen to minimize the possibilities of misclassifications.. In this procedure, we could collect the distributions of f x y w h type for each image in the database, and
, , , ,
then a threshold with higher distinguishability in clustering would be chosen.
Although each rectangle feature can be obtained easily, computing the complete set of all features is extremely costing. Take the smallest searching window of 24 by 24 block size, for example, the entire number of rectangle features will be 160,000.
2.3 Training of Face Detector
Viola and Jones [11] present a variant of AdaBoost is used both to select the rectangle features and to train the classifier. The origin form of AdaBoost learning algorithm is used to boost the classification performance of a single learning algorithm by combining a collection of weak classification functions to form a strong classifier.
Since the stronger classifier is rather time consuming, the structure of cascaded
classifiers by Viola and Jones [11] will be preferred to improving the detection performance and reducing the computational time. As a result, our cascaded Adaboost classification based on the stronger classifier will classify each extracted face image step by step as shown in Fig. 2-7.
Fig. 2-7: The overall classifier
In stage1, an object extracted by searching window is classified as face so that it is allowed entering to stage2, otherwise the object is rejected. As same as in stage3 the object has to pass by stage2. In brief, a labeled face is passed through a series of stages; a rejected object is rejected by particular stage even if it enters the last stage.
Fig. 2-8 shows the flowchart of training of cascaded classification. f is the maximum acceptable false positive rate each stage, d is minimum acceptable detection rate each edge, Ftarget
is overall false positive rate, P is the set of face
samples, and N is the set of non-face samples. i
means the stage of cascaded
classification and ni means the number of weak classification in the stage. The overall false positive rate must be smaller than Ftarget and each stage have to satisfy the equality: Fi f Fi1.
Fig. 2-8: The flow chart of training of cascaded classification
The procedure of our implemented AdaBoost process can be simply described as Fig. 2-9. If m and l are the number of non-faces and faces samples, respectively, and j is the sum of non-faces and faces samples. The initialize weights wi j, for the ith-stage can be defined as , 1 , 1 0,1
2 2
i j j
w for y
m l
. The normalized weighted error
with respect to the weak classifier can be expressed in the following equation:
, , ,
min , , ,
i f p i j j j
j
w h x f p y
(2.3) The updating weights for each iteration are defined in Eq. (2.4), where ej = 0 ifsamples xj is classified correctly, ej = 1 otherwise, Also, the final classifier for the ith-stage is defined in the following equation:
1,
, , ,
1Fig. 2-9: The flow chart of training classification for each stage
2.4 Region-based Clustering
The face detector usually finds more than one face candidates around each face as shown in Fig. 2-10. Here comes the troubled problem when more than two blocks are classified as faces around a single face. So a region-based clustering method is proposed to solve this problem. There are two levels of clustering in this method:
local scale clustering and global scale clustering. The local scale clustering is used to cluster the block in the same scale and design a simple filter to determine numbers of blocks within clusters. If numbers of blocks in some cluster are more than one, that cluster will be preserved; otherwise, it will be discarded.
Fig. 2-10: The image after face detecting distance of a centers in these two regions. Fig. 2-11 (a) and (b) shows the chart of the overlapped region and the distance of a center of two blocks in the local scale clustering and the global scale clustering respectively. In Fig. 2-11 (a), the two blocks are processed as the same cluster, and in Fig. 2-11 (b), the two blocks are processed as different clusters because the distance of their center does not satisfied with Eq. (2.8).
(a)
(b)
Fig. 2-11: The chart of the overlapped region and the distance of a center of two blocks in (a) the local scale clustering and (b) the global scale clustering
At the end, the global scale clustering will use the blocks obtained from local scale clustering, and label the face regions by the average size of all available blocks.
Some results in the entire region based clustering process for both local-scale and global-scale levels will be shown in Fig. 2-12. From the right image in Fig. 3-10, in fact, only one block will be precisely clustered as a face region after applying our local and global clustering processes even though more than five face candidates are obtained for an image with only five faces.
(a) (b)
Fig. 2-12: (a) The results of the local scale clustering (b) the results of the global scale clustering
3 Chapter 3
Age Estimation Algorithm
3.1 Texture Analysis Using Gabor Wavelets
The Gabor wavelets are used for image analysis because their biological relevance and computational properties. Kernels of Gabor wavelets are similar to the 2D receptive field profiles of the mammalian cortical simple cells, exhibit strong characteristics of spatial locality and orientation selectivity, and are optimally localized in the space and frequency domains. The Gabor wavelets transform is generally acknowledged to be particularly suitable for image decomposition and representation when the goal is the derivation of local and discriminating features.
Moreover, Donato et al [13] have experimentally shown that the Gabor wavelets representation gave better performance for classifying facial actions. In this section, the basics on Gabor wavelets are introduced, the Gabor feature representation of images is described, and a Gabor feature vector for age estimation is derived.
3.2 Gabor Wavelets Transform
ei
k
k , (3.2)
where k kmax/ f and /8. kmax is the maximum frequency, and f is the spacing factor between kernels in the frequency domain. Generally, the Gabor kernels in Eq. (3.1) are all self-similar since they can be generated from one filter, the mother wavelet, by scaling and rotation via the wave vector k, . Each kernel is a product of a Gaussian envelope a complex plane wave, while the first term in the square brackets in Eq. (3.1), determines the oscillatory part of the kernel and the second term compensates for the DC value. The parameter is related to the standard derivation of the Gaussian envelope’s width to the wavelength. Fig. 3-1 shows both the real part and imaginary part of Gabor wavelets kernel waveform.
(a) (b)
Fig. 3-1: Waveform of Gabor wavelets kernel (a) real part (b) imaginary part
Fig. 3-2: The real part of the 5 x 8 Gabor wavelets where z=(x, y) ,and denotes the convolution operator.
To apply convolution theorem, the Fast Fourier Transform (FFT) is used to derive the convolution output. Eq. (3.4) and Eq. (3-5) are the definition of convolution via FFT.
where and 1 denote the Fourier and inverse Fourier transform, respectively.
Fig. 3-3: Sample image and magnitude of 40 convolution outputs
Fig. 3-3 shows the magnitude of convolution outputs of a sample image. The outputs exhibit strong characteristics of spatial locality, scale and orientation selectivity corresponding to those displayed in Fig. 3-2. Such characteristics produce salient local features that are suitable for visual event recognition. From now on, we indicate with O,(z) the magnitude of the convolution outputs.
3.3 Scheme of Gabor Wavelets Features
Generally, convolution results corresponding to all Gabor wavelets are always put together as a whole when Principle Component Analysis (PCA) or other algorithms
which are applied to dimension reduction. In this section, we introduce three different schemes: Parallel Dimension Reduction Scheme (PDRS), Ensemble Dimension Reduction Scheme (EDRS), and Multi-channel Dimension Reduction Scheme (MDRS) [17]. Moreover, we would further compare the performance of them.
Fig. 3-4 shows the principle of PDRS. First, 40 Gabor wavelets features are extracted from each sample. Training each PCA projection matrix in every channels and then combining these features by a voting method that will be described later.
Fig. 3-4: Parallel Dimension Reduction Scheme
Fig. 3-5: Ensemble Dimension Reduction Scheme
Fig. 3-6: Multi-channel Dimension Reduction Scheme
EDRS is the most common scheme used for Gabor wavelets feature. As shown in Fig. 3-5, the difference between PDRS and EDRS is that the EDRS concatenate 40 Gabor wavelets features instead of using them in parallel. Besides, Xiaodong Li et al [17] proposed the scheme that named as MDRS in 2009. As shown in Fig. 3-6, the main idea of MDRS is training a PCA projection matrix for the same channel between different samples. In [17], Xiaodong Li et al have already proof that MDRS has better performance than EDRS in facial feature extraction using Gabor wavelets transform.
In order to compare the performance of PDRS and MDRS, the K-Nearest Neighbor (KNN) classifier is used for experiment. For PDRS, we use a voting method called Gaussian voting to combine 40 channels. The concept of Gaussian voting is described as using KNN classifier for each channel, and then there are 40 predicted ages. For each predicted age, treat them as the mean value of a Gaussian distribution and count the histogram. The highest peak is the final predicted answer. For MDRS, we use the concatenated feature directly.
The FG-NET Aging Database [18] is adopted for experiments. The database contains 1002 high-resolution color and grey scale face images with large variation of
lighting, pose, and expression. There are 82 subjects (multiple races) in total with the age ranges from 0 to 69 years. We use the mean absolute error (MAE) criterion to evaluate performance of each age estimation. MAE denotes the average of the absolute errors between the estimated ages and ground truth ages. The mathematical function defines as N is the total number of test images. Table 3-1 shows the experimental result of two schemes. The results of experiments demonstrate the MDRS is a more excellent
The dimensionality of the Gabor wavelets feature space is overwhelmingly high, even though the dimension reduction scheme is already apply. In order to reduce the dimension to a low-dimensional space further. We compare four typical dimensionality reduction methods in this section.
(1) Principle Component Analysis (PCA) [19][20]: The concept of PCA is to find a subspace that maximizes the projection variance.
Sp p pargmax p 1 T
(3.7) where
PCA method is mentioned here because it is very popular.
(2) Linear Discriminant Analysis (LDA) [19]: The LDA method is similar to PCA method, the difference is that LDA use class information to improve itself. This method defines two scatter matrixes: between-class scatter matrix
number of class. The best subspace can defined asp preserves essential manifold structure by measuring the local neighborhood distance information. The main idea of LPP is the construction of adjacency undirected graph.
Based on the relationship between data define the affinity weight )
Then the optimal projection is produces orthogonal basis functions based on LPP and preserve the metric structure.
The optimal projection of OLPP is defined as
still use the MAE criterion to evaluate performance. In our experiment, we change the affinity weight of LPP and OLPP to get more detail. Table 3-2 shows the MAE of each reduction method. We can see that OLPP with cosine distance affinity weight has the best performance in age estimation.Table 3-2: MAE of different reduction method
Method Best dimension MAE
LDA 4096 11.15
Support Vector Machines (SVMs) have considerable potential as classifiers of sparse training data which are developed to solve the classification and regression problems. SVMs have similar roots with neural networks, and it demonstrates the well-known ability of being universal approximates of any multivariable function to any desired degree of accuracy. This approach is produced by Vapnik et al. using some statistical learning theory [23][24][25].
3.5.1 Support Vector Machines
Hard-Margin Support Vector Machines
SVM is a way which starts with a linear separable problem. First, we discuss hard-margin SVMs, in which training data are linearly separable in the input space.
Then we extend it to the case where training data cannot be linearly separable.
For classification, the objective of SVM is to separate the two classes by a function which is induced from available example. Consider the example in Fig. 3-6, there are two classes of data and many possible linear classifiers that can separate these data. However, only one of them is the best classifier which can maximize the distance between the two classes - margin, this linear classifier is called optimal separating hyperplane.
Given a set of training data
xi,yi
,i1, ,m, where xiRp, yi
1, 1
, where the associate labels are yi 1 for class1 and -1 for class2. If the data are linearly separable, the decision function can determined by
Tf x w xb (3.17)
1
Fig. 3-6: Separating hyperplanes
2
Fig. 3-7: The optimal separating hyperplane
Let
Because the training data are linearly separable, without error data satisfying
T 0
b
w x , we can select two hyperplanes that maximize the distance between two classes. The two hyperplanes include the closest data points which are named support vectors, and also called support hyperplanes. Therefore, the problem can be described by the following equation, after scaling:
1 1
maximize the margin which means to minimize w . Thus, the problem becomes the following optimization equations:
In order to solve the above primal problem of the SVM, we using the method of Lagrange multipliers (Minoux, 1986), and the function will be constructed:
2
are the Lagrange multipliers. The Lagrangian has to be minimized with respect to , b
support vectors, and the solution to the problem is given by,
S is the set of support vectors. Hence, the classifier is simply,
sgn
* *
f x w xb (3.29)
Soft-Margin Support Vector Machines
However, training data are not linearly separable in most situations as shown in Fig. 3-8. There are some training data points on the opposite side. In order to correctly separate the data, a method of introducing an additional cost function associated with misclassification is appropriate as the following equation, where i 0.
1 1
i becomes smaller, thus we need to minimize the cost function.
k
4
5
Fig. 3-8: Inseparable case in a two-dimensional space
6
Fig. 3-9: Transformation of feature space
Mapping to a High-Dimensional Space
If the training data are not linearly separable, we can enhance the linear separability in a feature space by mapping the data from the input space into the high-dimensional feature space. Here we show an example in Fig. 3-9. The resulting algorithm is formally similar except that every dot product is replaced by a non-linear kernel functionk . It allows the algorithm to fit the maximum-margin hyperplane in the transformed feature space.
Class 2
In the following are some kernels that are in common use with support vector machines.
Linear kernels: k
x x, '
x x T 'Polynomial kernels: k
x x, '
x xT ' dRadial basis function kernels: k
x x, '
exp
x x ' 2
, for 0Sigmoid: k
x x, '
tanh
kx xT 'c
4 Chapter 4
Experimental Results
In this chapter, the results of age estimation will be presented. Our algorithm was implemented on the platform of PC with Intel Pentium Duo 2GHz and 4GB RAM.
Matlab and Borland C++ Builder are our complier and operated on Windows XP. All
Matlab and Borland C++ Builder are our complier and operated on Windows XP. All