Chapter 2 SVM-based Classifiers
2.2 Incremental SVM
(
i ⋅ i + −1+ i)
=0i y w x b ξ
α ∀i (2-23)
2.1.3 Non-linear SVM
SVM also can generalize to the case where the decision function is not a linear function. This problem can be solved by mapping the data from the feature space to some other higher dimension space (possibly infinite dimension). In most cases of SVM, several kernel functions have been used for nonlinear mapping. Three kernel functions are the most commonly used, and those are
Polynomial:K(x1,x2)=(γx1Tx2 +r)d, r>0 (2-24) Sigmoid:K(x1,x2)=tanh(γx1Tx2+r) (2-25) Gaussian radial basis function: || || )
exp( where γ , d, c are kernel parameters. In these kernel functions, kernel space is infinitely dimensional, so we do not need to know the mapping space of a kernel function explicitly. In conclusion, for a given kernel function, the non-linear decision function of SVM classifier can be now written as
)
SVM requires much time to solve quadratic optimization to obtain a SVM
16
classifier. The key point is that obtaining αi of the dual objective function is time consuming because it involves in the computation of matrix inverse in the programming procedure. So the training time will increase dramatically with the number of training dataset. On the other hand, SVM also suffers from the problem of large memory requirement when training on a large data set. Therefore, incremental learning techniques are developed to make the SVM training faster and to reduce the storage cost over very large data sets.
According to (2-13), the training result of SVM only depends on a small set of samples, termed support vector (SV) sets [11-12]. Therefore, Chunking algorithm [12]
solves the SVM incrementally containing all SV sets plus some of new samples which violate the Karush-Kuhn-Tucker (KKT) condition. But the shortcoming is that the algorithm is limited by the maximal number of support vectors and still requires the procedure of the quadratic optimization. Decomposition method [12] is similar to the Chunking algorithm. The main difference is that the size of retraining samples in decomposition method is fixed.
Except for these conventional methods, α-ISVM [22] provides an efficient incremental algorithm based on the discard factor α. Through the adjustable parameter, it is possible to discard samples optimally. Thus retraining time of SVM can be saved greatly. Another incremental learning algorithm I-SVM [23] also discards part of history samples, both the pre-extracting SVs algorithm and the iteration algorithm have been used in the retraining of the SVM.
SeqSVM[24] trains a SVM classifier on a small part of training samples and selects the so-called convex hull samples to retrain a new SVM. Convex hull samples are wrongly classified by the current SVM and furthest from the current SVM solution. It implies that these samples are the most possible to be the support vectors in the future
17
SVM.
Some other traditional methods of SVM-based incremental learning [25-26] also expedite the retraining by reducing the number of the training data. Besides reducing the number of retraining samples for the purpose of speeding up the retraining procedure, Sequential Minimal Optimization (SMO) [12] provides an alternative approach to avoid the overall procedure of QP. SMO is derived by taking the idea of decomposition method to its extreme and just optimizes two multipliers α1 and α2 in each QP sub-problem. We can compute new values of the two multipliers by keeping the linear constrain
∑
=
α 0 and the new multipliers have the following relationship: The choice of updating the two multipliers is determined by a heuristic. The procedure of SMO is closed while all Lagrange multipliers satisfy the KKT conditions.
Note that the incremental learning methods described above mostly need old training SV sets. Therefore, if we want to train a SVM which can classify new learning datasets and keep the performance of recognizing old ones, retraining new data together with SV set of the original samples or many historical sets would be necessary. This would be insufficient for real-time criteria and memory requirement in robotic applications. To this end, we will propose a fast SVM-based learning algorithm to expedite the training of SVM.
18
Chapter 3
Facial Expression Classification and Learning
This chapter describes the developed classifier and learning algorithms of a robotic facial expression recognition system. SVM is applied to the proposed system for recognizing five facial expressions. Generally, an image-based facial expression recognition system needs to collect certain number of facial datasets to train the emotion classifiers. This will lead to a problem that appearance variations of human faces would be too great to be represented by limited number of samples in practical applications. This means that high recognition rate in practical robotic applications can hardly be expected with only few dataset. Therefore, we propose a novel incremental learning algorithm based on SVM to overcome the problem (termed subject-dependent) of facial expression recognition.
The basic idea of the proposed learning algorithm is to adjust parameters of the SVM hyperplane for learning facial expressions of a new face, which is defined as the data recognized incorrectly through previous trained SVM classifiers. Support vector pursuit learning (SVPL, see 3.2.1) [27-28] is applied to retrain the hyperplane iteratively in the Gaussian-kernel space. To expedite the retraining procedure, only erroneous samples are combined with critical historical sets (see 3.2.2) to restrain a new SVM classifier. After the fast retraining using the proposed method, the new classifier will learn to recognize new facial data with improved correct rate.
19
3.1 The Hierarchical SVM Classifier
It is well known that SVM has been an effective method for designing facial expression recognition system [10], especially when a small number of training data are considered. For using SVM classifiers, the categories of two-class data can be decided by computing the sign of the decision function. The facial expression classifier needs to recognize five categories of emotional expressions consisting of anger, happiness, neural, sadness and surprise. Consequently, a hierarchical SVM classifier is designed to categorize five emotional expressions. The structure of hierarchical SVM classifier is as shown in Figure 3-1. An example of classifying a neutral expression is also illustrated in Figure 3-2. First, neutral, anger and sadness are performed through the first layer classification. Next, anger expression assumes sadness expression in the second one. Finally, neutral expression can assume sadness and be outputted through three layer classifications.
3.2 Fast Learning for Robotic Applications
A facial expression recognition system needs to collect some representative
Figure 3-1 The structure of hierarchical SVM classifier.
Neutral Happiness
Anger Surprise Sadness
Output Emotion Input
Unknown Emotion
20
Figure 3-2 An example of classifying neutral expression
samples to train SVM classifier beforehand. The subject-dependent problem in facial expression recognition can hardly be avoided because different people may show a categorized expression in different ways. If the facial expressions of a test person are not collected previously in the database, high recognition rates are difficult to obtain.
Therefore, a solution to this problem is investigated to accommodate the robot to various persons’ facial expressions.
The objective of this study is to design a system for emotional interaction of entertainment robots. An entertainment robot equipped with a facial expression recognition system and learning algorithm can recognize the facial expression of its
21
host. For a newly purchased robot, if the robot recognizes new faces correctly, then the current parameters of the recognition system will be good enough. On the contrary, if the owner perceived that the robot recognizes new facial expressions erroneously by observing the emotional responses of the robot, the owner would inform the robot about the wrong recognition results through a simple input device.
Consequently, the entertainment robot can start to retrain a new SVM hyperplane immediately.
As mentioned in Chapter 2, large sizes of retraining samples will require much time to solve quadratic optimization (QP) of a SVM classifier. This would not be efficient in practice from both memory requirement and real time criteria. So a fast SVM-based incremental learning algorithm will be required for practical applications.
3.2.1 Support Vector Pursuit Learning
We adopt the concept of support vector pursuit learning (SVPL) [27-28] to develop the emotion learning system. Previous parameters of the hyperplane together with the new data are employed to restrain a new SVM classifier. The main idea of SVPL is that the old hyperplane wk−1 shifts a minimal distance to a new hyperplanew in order that the new one can separate new data correctly (See Figure k 3-3). The data in Figure 3-3 are all new training samples. Because the distance between the new and previous hyerplane is minimal, the new hyperplane is also expected to separate the old data. When the new classifier is obtained, the new training data of the current step can be discarded after completing the training procedure. Hence SVPL effectively reduces the competition and the memory requirement in learning new data.
In practical applications, facial expressions of new faces, which will be learned incrementally, are usually far different from the original ones. It will result in that the
22
Figure 3-3 The diagram of SVPL retraining strategy
new SVM will not maintain a satisfactory performance of recognizing the old data.
So we propose to use a new concept of critical sets to couple with SVPL to design a learning algorithm to achieve this goal.
3.2.2 Critical Sets
Gaussian-kernel mapping method is added in the first stage in the developed algorithm to map the feature space to kernel space where facial data can be retrained easily. Gaussian radial basis function is written as follows:
|| ) exp( ||
) , (
2 2 1 2
1 c
x x x
x
K = − − (3-1) Subsequently, as described in (2-7), the hyperpleane w is constructed from the summation of feature values and the class category. Lagrange multiplierαi determines weights of the training samples to form the hyperplane w. In general, the samples which are nearest to the SVM hyperplane have greatest values of Lagrange multipliers. Namely, these samples play critical roles in forming the SVM hyperplane.
This concept can be explained using Figure 3-4. Figure 3-4 (a) shows that a SVM is
23
(a) (b)
Figure 3-4 (a) a SVM trained with all samples (b) a similar SVM trained with critical sets
Table 3-1 The relationship of decision value and Lagrangian multiplier Decision value Lagrangian multiplier
0.7894 0.2132 -0.91 0.0875 -0.975 0.0225 -0.9791 0.0183
1.0053 -0.0027 1.0694 -0.0669
trained with all samples existing in the feature space, but one can also train an almost the same SVM with few sets that are as shown in Figure 3-4 (b). On the other hand, Table 3-1 shows an example that the sets whose absolute decision values are smaller have bigger Lagrangian multipliers.
As a result, a few historical and important samples, which are nearest to the SVM hyperplane, are reserved to restrain a new SVM classifier for keeping the recognition of old data. We define these samples as critical sets (CSs).
CSs: Xi =argmin|w•Xi+b| (3-2) The size of critical sets is determined empirically. There is a trade-off between training time and the size of critical sets. Thus, we can adjust the size of critical sets to
24
meet the seesaw between learning time and non-forgetting learning.
The effectiveness of critical sets to maintain the performance of recognizing old data is shown in Figure 3-5. Assume that the hyperplane trained with the white circles and the white triangle, as shown in Figure 3-5 (a) and the black samples are new samples. It can be seen that there are two new black circles classified erroneously by the original hyperplane. Figure 3-5 (b) shows one possible retraining error risk with traditional SVPL, where the new hyperplane can classify the new samples correctly, but still one old sampe(white triangle) is misclassified. Next is a new strategy to retrain SVM, only the two erroneous black points combined with critical sets are used to retrain a new hyperplane with SVPL. One black triangle becomes the new critical set after online updating the CSs. The result of the new hyperplane is shown in Figure 3-5 (c). It can separate both the new and old samples correctly.
3.2.3 The Proposed Learning Algorithm
This section introduces the proposed learning algorithm that utilizes CSs and SVPL to retrain a SVM. We denote critical sets as TS0 ={xi0,yi0}Li=1 which are nearest to the initial hyperplane. At t-th incremental learning step, some new samplesTSk ={xit,yit}mi=1
come in, and we assume that the initial SVM cannot classify these new samples correctly. The parameters of the new hyperplane that can separate new sample successfully can be changed to(wt,bt) by proposed algorithm. The following steps are similar to the conventional SVM procedure, the quadratic programming problem of the incremental updating can be expressed as follows:
Objective function:
∑
+=
25
(a)
(b)
(c)
Figure 3-5 An example of the retraining with critical sets by SVPL. (a) is the original SVM hyperplane, (b) one possible error of retraining all new samples by SVPL, (c) retraining only the two erroneous black points combined with critical sets by SVPL
26
whereξ is slack variables in the sets of i TS0 and TSk. C is a trade-off between training error and VC-dimension. Let Wt−1 denotes the previous trained parameter and a fixed constant. This optimization problem can be solved as follows:
∑
+ indicates that the new parameter of SVM can be derived iteratively through the parameter of previous SVM and the retraining sets comprising TS0 and TSk . Consequently, a new SVM, which not only recognizes new facial data but also keeps an acceptable recognition rate of original facial data, will be achieved by the proposed learning algorithm. Further, the proposed algorithm speeds up the QP procedure of SVM dramatically by reducing the retraining datasets. This is beneficial to the real-time requirement in practically robotic applications.Finally, the decision function can be written as below:
( ) ( ( ) ) where K(,) is a Gaussian kernel function which can be used to map the input space to some higher dimensional kernel space. Suppose that φ:X−>F is a non-linear mapping from the input space to some higher kernel space, the decision function can be rewritten as :
27
3.2.4 Decomposing Kernel Function
Kernel function is an implicit mapping from feature space to kernel space, the underlying feature mapping is not known explicitly. Hence, w can not be derived directly for the purpose of retraining SVM hyperplane incrementally.
We can take the feature mapping to be any linear transformationx→AX, for some matrix A. In this case the kernel mapping is given by
K where K is a square symmetric matrix, K can be diagonalized into the following expressions:
X . We assume that X is one possible result of kernel mapping, thus the parameter of w can be derived explicitly. The final decision function in kernel space is given below:
α . Diagonalizing kernel matrix to know the kernel space, the proposed learning algorithm can be accomplished even though the kernel function is an implicit mapping function. The architecture of the proposed learning algorithm is illustrated in Figure 3-6. Table 3-2 summarizes proposed learning algorithm. First, features of facial expression sample are collected. They are mapped to Gaussian kernel space. Next, a hierarchical SVM classification is used to categorize five emotion expressions. In the meanwhile, the parameter of SVM and critical sets are reserved for future learning. If new data are supplied, a new SVM classifier will be learned by the proposed algorithm, which uses only erroneous data and critical sets to
28
update the SVM. As soon as the new SVM is adjusted, critical sets are updated accordingly. The learning procedure is continued until no new data exists.
Table 3-2 The overall procedure of proposed learning algorithm
Figure 3-6 The flowchart of proposed learning algorithm
Initial :Train the SVM classifier S1 with training facial datasets IS (initial sets) and save CS (critical sets);
Step1:Test the new facial data NS with classifier S1, if erroneous data=Nil
then:
do nothing and exit;
else erroneous data=ES(erroneous sets) then:
retrain new SVM classifier S with proposed learning algorithm using ES and CS ;Update CS;
Step2 :New SVM classifier S1=S;IS=NS+IS;
Step3 :Test IS with S1;
29
Chapter 4
Feature Extraction Under Illumination Variation
To assure that feature values are extracted accurately for classification and learning, we adopt Gabor wavelet to develop a robust feature extraction method.
Facial points can be detected under various lighting conditions. Before extracting facial features, a face should be localized and segmented from digitized images sequences. Face preprocessing stage consists of face normalization and feature region localization steps to extract facial features efficiently. While regions of interest corresponding to relevant features are determined, we apply Gabor jets based on Gabor wavelets transformation to extract the facial points. Gabor jets are more invariable and reliable than the gray values which suffer from huge ambiguities as well as slight changes in illumination while representing local features. Each feature point can be matched by a phase-sensitivity similarity function in the relevant regions of interest. As long as the feature points are extracted under illumination variety, we can evaluate the geometric displacements of these points as emotional feature values.
4.1 Face Detection
Color is a direct cue for face localization, but skin color is easily suffered from illumination uncertainty. In this design, a YCrCb 3D color distribution model is applied to segment skin color under illumination variation [29]. The method of face localization consists of face detection and face tracking. Face detection module is an initial state which contains the location and size of face information for latter face
30
tracking. Face region is detected in the first image of image sequences by executing face color segmentation, morphology, color region mapping and attentional cascade (see Figure 4-1).
As shown in Figure 4-2 (b), the binary image is obtained by using skin color segmentation in YCrCb color space. Then, closing operation of morphology [30] is used to eliminate the discontinuous interference in the skin color region(see Figure 4-2 (c)). After obtaining the binary image of color segmentation, the 2-D histogram mapping is utilized to find possible areas of faces and Figure 4-2 (d) illustrates the result. Finally, attentional cascade assigns several rules to confirm if the candidate region is a real human face. If the following conditions are satisfied, the face detection is regarded successful:
a. The ratio of mapping length to mapping width is between 1 and 2.
b. The sum of grayscale in the upper area is smaller than the sum of the lower one.
c. The sum of grayscale of eye areas is smaller than the sum of the center area of two eyebrows.
CMOS sensor
NO Candidate is face or not
YES Color
Segmentation
Morphology Closing
Color Region Mapping
Attentional Cascade
Face Tracking
Figure 4-1 The flowchart of face detection
31
d. The sum of grayscale of the adjacent cheek areas is less than one fixed threshold.
One successful example is shown in Figure 4-2 (e).
(a) (b)
(c) (d)
(e)
Figure 4-2 The results of face detection (a) is the testing image. (b) is the result of color segmentation and (c) is the result of closing operation. (d) is the candidate of face regions. (e) is the final result via the attentional cascade.
32
4.2 Face Tracking
After a face is detected in the initial state, face tracking can be accomplished in the subsequent images by applying the adaptive YCrCb 3D color distribution model.
This statistical model determines the proper threshold values for each color channel of the face image in the tracking mode. Let Y1(Y2) be lower(upper) threshold of skin color of Y channel, Cr1(Cr2) be lower(upper) threshold of skin color of Cr channel and Cb1(Cb2) be lower(upper) threshold of skin color of Cb channel. The threshold values used to segment the face regions are updated by the following equations[29]:
] 0
Where C denotes the channel of color spaces, Y, Cr or Cb and Chist is the histogram of the C channel. S is a scale factor from 0 to 1 to reject the undesired histogram such as eye or eyebrow; in this case the scale factor is set to 0.1. Ntotal is the total pixel number of previous face region. Figure 4-3 shows an example of computing the
Where C denotes the channel of color spaces, Y, Cr or Cb and Chist is the histogram of the C channel. S is a scale factor from 0 to 1 to reject the undesired histogram such as eye or eyebrow; in this case the scale factor is set to 0.1. Ntotal is the total pixel number of previous face region. Figure 4-3 shows an example of computing the