Decomposing Kernel Function - Fast Learning for Robotic Applications

Chapter 3 Facial Expression Classification and Learning

3.2 Fast Learning for Robotic Applications

3.2.4 Decomposing Kernel Function

Kernel function is an implicit mapping from feature space to kernel space, the underlying feature mapping is not known explicitly. Hence, w can not be derived directly for the purpose of retraining SVM hyperplane incrementally.

We can take the feature mapping to be any linear transformationx→AX, for some matrix A. In this case the kernel mapping is given by

K where K is a square symmetric matrix, K can be diagonalized into the following expressions：

X . We assume that X is one possible result of kernel mapping, thus the parameter of w can be derived explicitly. The final decision function in kernel space is given below：

α . Diagonalizing kernel matrix to know the kernel space, the proposed learning algorithm can be accomplished even though the kernel function is an implicit mapping function. The architecture of the proposed learning algorithm is illustrated in Figure 3-6. Table 3-2 summarizes proposed learning algorithm. First, features of facial expression sample are collected. They are mapped to Gaussian kernel space. Next, a hierarchical SVM classification is used to categorize five emotion expressions. In the meanwhile, the parameter of SVM and critical sets are reserved for future learning. If new data are supplied, a new SVM classifier will be learned by the proposed algorithm, which uses only erroneous data and critical sets to

update the SVM. As soon as the new SVM is adjusted, critical sets are updated accordingly. The learning procedure is continued until no new data exists.

Table 3-2 The overall procedure of proposed learning algorithm

Figure 3-6 The flowchart of proposed learning algorithm

Initial ：Train the SVM classifier S1 with training facial datasets IS (initial sets) and save CS (critical sets);

Step1：Test the new facial data NS with classifier S1, if erroneous data=Nil

then：

do nothing and exit;

else erroneous data=ES(erroneous sets) then：

retrain new SVM classifier S with proposed learning algorithm using ES and CS ;Update CS;

Step2 ：New SVM classifier S1=S;IS=NS+IS;

Step3 ：Test IS with S1;

Chapter 4 Feature Extraction Under Illumination Variation

To assure that feature values are extracted accurately for classification and learning, we adopt Gabor wavelet to develop a robust feature extraction method.

Facial points can be detected under various lighting conditions. Before extracting facial features, a face should be localized and segmented from digitized images sequences. Face preprocessing stage consists of face normalization and feature region localization steps to extract facial features efficiently. While regions of interest corresponding to relevant features are determined, we apply Gabor jets based on Gabor wavelets transformation to extract the facial points. Gabor jets are more invariable and reliable than the gray values which suffer from huge ambiguities as well as slight changes in illumination while representing local features. Each feature point can be matched by a phase-sensitivity similarity function in the relevant regions of interest. As long as the feature points are extracted under illumination variety, we can evaluate the geometric displacements of these points as emotional feature values.

4.1 Face Detection

Color is a direct cue for face localization, but skin color is easily suffered from illumination uncertainty. In this design, a YCrCb 3D color distribution model is applied to segment skin color under illumination variation [29]. The method of face localization consists of face detection and face tracking. Face detection module is an initial state which contains the location and size of face information for latter face

tracking. Face region is detected in the first image of image sequences by executing face color segmentation, morphology, color region mapping and attentional cascade (see Figure 4-1).

As shown in Figure 4-2 (b), the binary image is obtained by using skin color segmentation in YCrCb color space. Then, closing operation of morphology [30] is used to eliminate the discontinuous interference in the skin color region(see Figure 4-2 (c)). After obtaining the binary image of color segmentation, the 2-D histogram mapping is utilized to find possible areas of faces and Figure 4-2 (d) illustrates the result. Finally, attentional cascade assigns several rules to confirm if the candidate region is a real human face. If the following conditions are satisfied, the face detection is regarded successful：

a. The ratio of mapping length to mapping width is between 1 and 2.

b. The sum of grayscale in the upper area is smaller than the sum of the lower one.

c. The sum of grayscale of eye areas is smaller than the sum of the center area of two eyebrows.

CMOS sensor

NO Candidate is face or not

YES Color

Segmentation

Morphology Closing

Color Region Mapping

Attentional Cascade

Face Tracking

Figure 4-1 The flowchart of face detection

d. The sum of grayscale of the adjacent cheek areas is less than one fixed threshold.

One successful example is shown in Figure 4-2 (e).

(a) (b)

(e)

Figure 4-2 The results of face detection (a) is the testing image. (b) is the result of color segmentation and (c) is the result of closing operation. (d) is the candidate of face regions. (e) is the final result via the attentional cascade.

4.2 Face Tracking

After a face is detected in the initial state, face tracking can be accomplished in the subsequent images by applying the adaptive YCrCb 3D color distribution model.

This statistical model determines the proper threshold values for each color channel of the face image in the tracking mode. Let Y1(Y2) be lower(upper) threshold of skin color of Y channel, Cr1(Cr2) be lower(upper) threshold of skin color of Cr channel and Cb1(Cb2) be lower(upper) threshold of skin color of Cb channel. The threshold values used to segment the face regions are updated by the following equations[29]：

^] ⁰

Where C denotes the channel of color spaces, Y, Cr or Cb and Chist is the histogram of the C channel. S is a scale factor from 0 to 1 to reject the undesired histogram such as eye or eyebrow; in this case the scale factor is set to 0.1. Ntotal is the total pixel number of previous face region. Figure 4-3 shows an example of computing the threshold values. The face tracking method utilizes the color distribution model to update the threshold values of skin color from the previous face region of sample instant t-1. The threshold values can be adjusted dynamically to accommodate the unexpected changes of lighting conditions.

4.3 Face image preprocessing

To improve the efficiency of extracting facial features, we perform face image preprocessing including face normalization and feature region localization. The segmented front-view face region is firstly normalized to 160x120 grayscale image.

(a) (b) (c)

Figure 4-3 The lower and upper thresholds of each color channel (a) Y1 and Y2 of Y channel (b) Cb1 and Cb2 of Cb channel (c) Cr1 and Cr2 of Cr channel

Three reference points are found on both pupils of the eyes and at the center of the mouth. As shown in Figure 4-4, the face region is divided into 14 relevant regions of interest (ROI), each containing one feature point.

4.3.1 Face normalization

Face region is normalized to a 160x120 gray-value image by using bilinear interpolation [30]. Image bilinear interpolation estimates function value ^f⁽^x^'^,^y^'⁾ based on the known values of nearby pixel ^f^{( y}^x^, ⁾. As shown in Figure 4-5, assume that the black points are unknown values and we estimate them linearly by using the

Figure 4-4 The 14 facial regions of interest

Figure 4-5 Image interpolation

known white points. The bilinear interpolation function can be expressed as follows：

)]

4.3.2 Feature Region Localization

For feature extraction, the relevant regions of interest are determined first. Three reference points located on both pupils of the eye and at the center of the mouth are detected for this purpose by using an adaptive threshold method, integral optical density (IOD) [31]. The pupils are supposed to be the darkest in the upper face regions, so the binary image of the pupils can be segmented by selecting a proper threshold. However, the entire gray-values of the face may vary from image to image, and a fixed threshold does not acquire the binary image of pupils successfully.

Consequently, to overcome this problem, IOD is employed to adaptively determine the threshold. The IOD is defined as ：

In this design, the binary image of the pupils is segmented with the value of IOD set to 0.05 in the rectangle area of 30x30 pixels. Namely, we choose 5% of the darkest gray-value to locate the positions of the pupils. The positions of these two rectangle regions are illustrated in Figure 4-6 (a). In this case, the coordination of the left pupil region is from pixel(15, 40) to pixel(45, 70). The coordination of the right pupil region is defined from pixel(75, 40) to pixel(105, 70). After obtaining the binary image of the pupils, we use 2D-hisgram mapping to locate the positions of the pupils accurately. On the other hand, the center of the widest peak will define the vertical position of the middle of the mouth between pix100 to pix150 in the vertical direction, and the horizontal position of the mouth is the center of two pupils. In this way, three reference points can be found. Figure 4-6 (b) shows typically detected result of these points.

Subsequently, we utilize these three reference points to further divide the face region into 14 regions where each facial point is extracted individually. The detailed descriptions of 14 ROIs are given in Table 4-1.

4.4 Gabor Wavelet Transformation

Features based on Gabor filters have been used in image processing due to their powerful properties [32-35]. In particular, the most favorite one is to remove the variability in lighting, rotation, small shift or deformation in the local feature areas. In this thesis, we will apply Gabor wavelets-based features instead of grayscale features to extract facial points.

(a) (b)

Figure 4-6 (a) the diagram of locating three reference points and (b) is the result

Table 4-1 The detailed descriptions of 14 ROIs ROI-1 Range from pixel(R3_x-8,R3_y-18) to pixel(R3_x+8, R3_y-3).

ROI-2 Range from pixel(R3_x-8,R3_y+3) to pixel(R3_x+8, R3_y+30).

ROI-3 Range from pixel(R3_x+15,R3_y-10) to pixel(R3_x+40, R3_y+10).

ROI-4 Range from pixel(R3_x-40,R3_y-10) to pixel(R3_x-15, R3_y+10).

ROI-5 Range from pixel(R1_x+3,R1_y-5) to pixel(R1_x+15, R1_y+5).

ROI-6 Range from pixel(R1_x-3,R1_y-10) to pixel(R1_x+3, R1_y).

ROI-7 Range from pixel(R1_x-3,R1_y) to pixel(R1_x+3, R1_y+10).

ROI-8 Range from pixel(R2_x-15,R2_y-5) to pixel(R2_x-3, R2_y+5).

ROI-9 Range from pixel(R2_x-3,R2_y-10) to pixel(R2_x+3, R2_y).

ROI-10 Range from pixel(R2_x-3,R2_y) to pixel(R2_x+3, R2_y+10).

ROI-11 Range from pixel(R1_x+5, 10) to pixel(R1_x+30, R1_y).

ROI-12 Range from pixel(10, 10) to pixel(R1_x, R1_y-20).

ROI-13 Range from pixel(R2_x-30, 10) to pixel(R2_x-5, R2_y).

ROI-14 Range from pixel(R2_x, 10) to pixel(110, R2_y-20).

In its general form, a 2D Gabor wavelet kernel function can be described as [33]：

where i denotes a complex number and

σ

is the standard deviation of the Gaussian envelope, kj is the wave vector and different Gabor kernel functions are controlled by different kj. u is the orientation and v is the frequency of the

filter. The multiplicative factor ₂

σ kj

ensures that filters tuned to different spatial

frequency bands have approximately equal energies. The term ₎

exp( 2 σ2

− is subtracted to render the filters insensitive to illumination. x represents the coordinates of a given pixel. In addition, because facial expression features are mainly described by high frequency components, 3 frequencies (v=0,1,2) and N=8 orientations(u=1,…,8) are used to yield 3x8 filters. But, in the application of robotic emotional recognition, evaluating all 24 filters to convolving the face image is quite time consuming.

Consequently, only six important filters are used to extract facial points.

While Gabor wavelets kernel functions have been defined in (4-5), a single Gabor feature is obtained by convolving one of the filters (a specificv and u) with the original image. For pixel x in image I(x), a Gabor jet J is defined such that： where I(x) represents gray-value image. When we apply Gabor filters at a specific point x , we get multiple filter responses for that point. The results of transformation

are described by complex values, so their amplitude can be calculated as the final transformation results. Thus, each feature point can thus be represented by such a Gabor jet in place of just its gray value.

Figure 4-7 shows examples of response of Gabor filters (size is 11x11) and Figure 4-8 is the original face image. Obviously, the facial features of filtered images are generated with different frequencies and orientations of Gabor wavelet filters.

Features of the filtered images have less illumination variation compared with the original grayscale image, which has a light from left. In order to reduce the computing cost of convolution of all 24 filters, we just use 6 filters to extract Gabor jets. 3 frequencies (v=0,1,2) and 2 orientations(u=7,8) are the best choices to filter out the original face image. It is observed that these six filters can filter out the edges of eyebrows, eyes and mouth obviously. Six filtered facial images are illustrated in Figure 4-9.

Figure 4-7 Different frequencies and orientations of Gabor wavelet filters

Figure 4-8 the original face image

Figure 4-9 Six selected filtered facial images

We observe that the edges of eyebrows, eyes and mouth can be emphasized after performing Gabor filters. Since these edges are the regions where facial points are located, binary images of the edges can be approximated by setting a proper threshold of the filtered images. As a result, we can narrow the search regions of facial points validly only in the intersection of the binary image of the edges and the relevant ROIs.

In this design, a binary image of facial edges will be obtained by summing Gabor jets of six selected filtered images into one image, and then we set IOD=0.2 to separate the facial edges in the upper face region and set IOD=0.2 in the mouth region.

The upper face region ranges from pixel(10, 10) to pixel(110, 70) and the mouth region ranges from pixel(R3_x-40, R3_y-20) to pixel(R3_x+40, R3_y+30).

By summing the selected 6 filtered images, these filtered images interfered with illumination can be weaken availably. An example of the binary image of the face is shown in Figure 4-10. Figure 4-10 (a) is the result of summing six Gabor jets into one image, we see that the facial edges can be detected apparently. Accordingly, the binary image of edge is obtained by setting a proper threshold and the result is illustrated in Figure 4-10 (b). Eventually, we identify that the intersection of 14 ROIs and the binary image of edge regions are our final ROIs where facial points are located (See Figure 4-10 (c)). Afterwards, we use similarity function to match facial points efficiently only in these areas, and the cost of searching facial points can be saved greatly.

4.5 Feature Points Extracting

The proposed method of extracting facial points is composed of two phases, training phase and testing phase. In the training phase, some Gabor jets on defined facial points are collected for extracting facial points automatically in the testing phase. Then, for a certain searching point in the testing phase, 5x5 pixels sliding window with 6 different jets are applied to find the most matching point by using a phase-sensitive similarity function. The facial points involving in the changes of facial expressions will be extracted. The exact positions of these points are shown in Figure 4-11. Table 4-2 depicts the definitions of 14 facial points.

(a) (b) (c)

Figure4-10 (a) is the image by summing 6 Gabor jets. (b) is binary image of facial edges.(c) is the intersection of 14 ROIs and the binary image of edge regions on

Figure 4-11 14 facial points

Table 4-2 The definitions of 14 facial points

Descriptions

P1 Mouth top

P2 Mouth bottom

P3 Left mouth corner

P4 Right mouth corner

P5 Inner corner of the right eye

P6 Top of the right eye

P7 Bottom of the right eye

P8 Inner corner of the left eye

P9 Top of the left eye

P10 Bottom of the left eye

P11 Inner corner of right eyebrow

P12 Top of right eyebrow

P13 Inner corner of left eyebrow

P14 Top of left eyebrow

4.5.1 The Training Phase of Feature Points Extracting

The overall procedure of feature point extraction in the training phase is given in left part of Figure 4-12. First, some expressionless images are collected into database, and then we select these facial points manually and evaluate Gabor jets both on and around these points. A 5x5 pixels patch around the facial points is considered to extract their Gabor jets by filtering the defined point with a bank of 6 Gabor filters at 2 orientations and 3 frequencies. In other words, 150 (5x5x6) Gabor features are used to represent one facial point. Finally, Gabor jets of 14 feature points are collected to the feature point database in order to extract facial points automatically in the testing phase.

4.5.2 The Testing Phase of Feature Points Extracting

In the testing phase(see right part of Figure 4-12), the final ROIs are first defined by the intersection of 14 relevant regions of interest and the binary image of edge regions; subsequently, we evaluate each Gabor features of the 5x5 pixels patch by scanning the overall ROIs. A phase-sensitive similarity function is applied to match these Gabor features of possible facial pixels, and finally the position with the highest response reveals the exactly facial position in question. The phase-sensitive similarity function [33]：

Figure 4-12 The overall procedure of feature point extraction in the training phase and testing phase

the amplitude which slowly varies the position and φ_j is the phase. d is a small relative displacement between two jets J and J’. The displacement d can be estimated by maximizing S in its Taylor expansion around _φ d =0, which is a constrained fit of the two-dimensional d to the 6 phase differences φ_j−φ_j'.

4.6 Feature Extraction Evaluation

The displacement information of these key points is taken as the feature values in the proposed facial expression recognition system. To analyze expressional feature values systematically, we adopted AU-coded descriptions of facial expression in the FACS to describe the relationship between AU and the displacement of facial points [36]. Designed for human’s observations to subtle muscle changes of facial appearances, facial action coding system (FACS)[3][4] is a well-known method to analyze facial activities. FACS provides a linguistic description of all possible facial changes in terms of 44 Action Units (AUs).

Referring to [36], Table 4-3 shows the association of facial AUs of four expressions, and the facial AUs pertaining to five expressions are also illustrated in Figure 4-13 [4]. According to these AUs, we can establish the association of the AUs and the displacement of the fiducial points as shown in Table 4-4. We can measure scientifically the facial changes of AUs via displacement of the facial points. At last,

Table 4-3 The association of 5 facial expressions to AU combination Emotional category Visual Cues

Anger AU4 AU7 AU23 AU24

happiness AU7 AU12 AU25

Sadness AU1 AU15

Surprise AU1 AU5 AU7 AU27

Figure 4-13 The list of AUS related to five expressions [4]

Table 4-4 Feature points based descriptions for AUS

AUs Facial Visual Cues Description

AU1 _P_9P₁₅increased Inner brow raiser

AU4 L1, P9P15,P10P16 decreased, Brow lower

AU5 P10P12 increased Upper eyelid raiser

AU7 _P_10P₁₂ decreased , Eyelid tighter

AU12 L2 decreased, P3P4 increased, P3P11decreased Lip corner puller AU15 L2 increased, P3P11 decreased, L3decreased Lip corner depressor AU20 L2 non-change, P3P4 increased Lip stretcher

AU23 P1P2,P3P4 decreased Lip tightener

AU24 P1P2decreased, P3P4 non-change Lip presser

AU25 P1P2,P3P4increased Lips part**

AU27 P1P2increased , P3P4decreased Mouth Stretch

*L1：the vertical displacement of right brow upper center and right eyelid lower center.

L2：the vertical displacement of right eyelid inner corner and right lip corner.

we summarize 16 feature values as AUs for recognizing 5 emotional expressions. The detailed descriptions of these 16 feature values associated to defining facial points are shown in Table 4-5.

4.7 The Experimental Results of Facial Points Extraction

For evaluating the efficiency of Gabor wavelet based feature extraction, a set of images containing different expressions and different lighting conditions from AR database [37] were used. The AR Face database was created by Aleix Martinez and Robert Benavente in the Computer Vision Center (CVC) at the U.A.B. It contains frontal view faces with different facial expressions, illumination conditions, and occlusions, but only the images of different expressions and different illumination conditions are experimented. The experimental images are neutral expression, smile, scream, left light on, right light on and all lights on. Figure 4-14 shows examples of the images.

Gabor wavelets based feature point extraction has been tested on a set of 180 images from 30 persons and each person has six data. Neutral expressions of ten persons are used as the matching data in facial point extraction database. Therefore, neutral expression has 20 images and each of other expressions has 30 images. To evaluate the performance of facial point matching, each of the extracted facial points

在文檔中機器人之表情辨識快速學習法則 (頁 39-0)