Kernel optimization-based discriminant analysis for face recognition

(1)

O R I G I N A L A R T I C L E

Kernel optimization-based discriminant analysis

for face recognition

Jun-Bao LiÆ Jeng-Shyang Pan Æ Zhe-Ming Lu

Received: 1 December 2006 / Accepted: 7 May 2009 / Published online: 26 May 2009 Springer-Verlag London Limited 2009

Abstract The selection of kernel function and its parameter influences the performance of kernel learning machine. The difference geometry structure of the empir-ical feature space is achieved under the different kernel and its parameters. The traditional changing only the kernel parameters method will not change the data distribution in the empirical feature space, which is not feasible to improve the performance of kernel learning. This paper applies kernel optimization to enhance the performance of kernel discriminant analysis and proposes a so-called Kernel Optimization-based Discriminant Analysis (KODA) for face recognition. The procedure of KODA consisted of two steps: optimizing kernel and projecting. KODA auto-matically adjusts the parameters of kernel according to the input samples and performance on feature extraction is improved for face recognition. Simulations on Yale and ORL face databases are demonstrated the feasibility of enhancing KDA with kernel optimization.

Keywords Face recognition

Kernel optimization-based discriminant analysis (KODA) Kernel discriminant analysis (KDA)

1 Introduction

Face recognition research started in the late 1970s and has become one of the most active and exciting research areas in computer vision and pattern recognition since 1990s. Many algorithms have been developed for face recognition in the last years [1–3]. Among the crucial issues of face recogni-tion technology, the low-dimensional feature representarecogni-tion with enhanced discriminatory power is of paramount importance in face recognition systems. Many dimension reduction methods are proposed in the past research, such as linear discriminant analysis (LDA) [4], principal component analysis (PCA) [5], and independent component analysis [6,7], and so on. These algorithms are analyzed or combined to solve some practical problems [8,9]. But for face rec-ognition problem, owing to the nonlinear and complex dis-tribution of face images under a perceivable variation in viewpoint, illumination or facial expression, the linear techniques, such as PCA or LDA, cannot provide reliable and robust solutions to those face recognition problems with complex face variations [10]. Recently, as an effective nonlinear method, so-called kernel learning method was applied to the face recognition [11,12]. Especially owing to the nonlinear property of Kernel discriminant analysis (KDA) and its strong theoretical background, it has been employed to face recognition to solve the pose and illumi-nation problems since 2000 [11,13–15]. KDA is the non-linear extension of LDA with the kernel trick. Researchers employ the kernel trick to map the input data to a higher dimensional kernel space to represent complex nonlinear

J.-B. Li (&)

Department of Automatic Test and Control, Harbin Institute of Technology, P.O. Box 339, Harbin 150001, People’s Republic of China e-mail: [email protected]

J.-S. Pan

Department of Electronic Engineering National Kaohsiung, University of Applied Sciences, D415 Chien-Kung Road, Kaohsiung 807, Taiwan

Z.-M. Lu

Visual Information Analysis and Processing Research Center, Harbin Institute of Technology Shenzhen Graduate School, Room 202L, Building No. 4, HIT Campus Shenzhen University Town, Xili, Shenzhen 518055, People’s Republic of China Neural Comput & Applic (2009) 18:603–612

(2)

relationship of the input data. Then KDA finds an optimal linear projection from the kernel space to a lower dimen-sional projection subspace in which any classes is well separated from the other ones. Some algorithms were pro-posed to enhance the performance of KDA. Reference [16] generalizes the uncorrelated linear discriminant to propose kernel uncorrelated discriminant analysis. Reference [10] proposes kernel direct discriminant analysis to solve ‘‘small sample size’’ (SSS) problem. Reference [17] applies the kernel trick to the maximal Fisher criterion and the minimal statistical correlation between feature components. These methods enhance the performance of KDA through devel-oping the better tricks to find the optimal linear projection from kernel space to the projection subspace, while the influence of the kernel function and its parameter was not be considered. Recently, researchers optimized the parameters of kernel function to improve KDA [18–20], but these methods only choose the optimal parameter of kernel from a set of discrete values which are created in advance. The geometry structure of data distribution in the kernel space is not be changed only through the changing the parameters of kernel. Micchelli and co-workers [21] proposed basic kernel function combination based method, and Lanckriet et al. [22] proposed a kernel optimization through kernel matrix instead of a specific kernel function, and Dai and Yeung [23] proposed basic kernel based convex semi-supervised kernel optimization, Xiong et al. [24] proposed a data-dependent kernel for kernel optimization.

In this paper, we apply kernel optimization to improve KDA to propose kernel optimization-based discriminant analysis (KODA). The procedure of KODA is divided into two steps, optimizing kernel and finding the linear projec-tion. In the first step, we use the data-dependent kernel and optimize the geometrical structure of data distribution in the kernel feature space through adjusting the parameter of the data-dependent kernel. In the second step, we apply the traditional method of finding the linear projection. The main contribution of this paper lies in kernel optimization method. Firstly, we extend the definition of a data-dependent kernel, and secondly we propose two criterions of creating the constraint equation for solving the optimal parameter.

The remainder of this paper is organized as follows. A review of kernel discriminant analysis is presented in Sect.2. The proposed KODA algorithm is introduced in Sect.3. In Sect.4, simulations results are presented to demonstrate the effectiveness of KODA. Conclusions are summarized in Sect.5.

2 Kernel discriminant analysis

This section describes the generalized kernel discriminant analysis algorithm in detail. The main idea of KDA is to

map the original training samples to the feature space F through the nonlinear mapping U, and then to implement linear discriminant analysis in the feature space F. Sup-posed the N dimensional M training sample {x1, x2,…,xM}

from L classes. That is,

U : RN ! F; x7! UðxÞ ð1Þ

The dimension of feature space F is very high, in order to avoid to deal with the mapped samples, we introduce the kernel function, which can be calculated by kðx; yÞ ¼ hUðxÞ; UðyÞi. Through the nonlinear mapping, the scatter matrix between classes and total scatter matrix in the feature space F are defined as follows:

SU_B¼X L i¼1 ni Mðm / i m /_Þðm/ i m /_ÞT ð2Þ SU_T ¼ 1 M XM i¼1 ðUðxiÞ m/ÞðUðxiÞ m/Þ T ð3Þ where m/_¼1 M PM

i¼1UðxiÞ and m/j ¼M1

Pnj

i¼1UðxiÞ. Then

JðVÞ ¼V T_SU BV VT_SU TV ð4Þ where V is the discriminative vector. According to the Mercer function theory, any solution vector V must lie in the feature space consisted of fU xð Þ; U x1 ð Þ; . . .; U x2 ð MÞg,

that is, there is one coefficient cp(p = 1, 2,…, M) cause

the following equation:

V ¼X

M

p¼1

cpUðxpÞ ¼ Wa ð5Þ

where W¼ U x½ ð Þ; U x1 ð Þ; . . .; U x2 ð MÞ and a¼ ½c1; c2;

. . .; cMT, then, the Eq.4is transformed to

JðaÞ ¼a

T_KGKa

aT_KKa ð6Þ

where G¼ diagðG1; G2; . . .; GLÞ, Gi is a ni9 ni matrix

consisted of _n1

i, K is the kernel matrix calculated by k(x, y). In fact, let Aopt¼ a½ 1;a2; . . .;ad consisted of d

discriminant vector a1, a2,…, ad. The matrix Aoptsatisfies

Aopt ¼ arg max A

AT_KGKA

j j

AT_KKA

j j ð7Þ

Then the feature vector y of sample x is y¼ AT

opt½kðx; x1Þ; kðx; x2Þ; . . .; kðx; xnÞT ð8Þ

Procedure:

Step 1. Select kernel function k(x, y) and its parameters, calculate the kernel matrix K;

Step 2. Calculate the projection matrix Aopt;

Step 3. The feature of the sample x is y¼ AT

opt½kðx; x1Þ;

kðx; x2Þ; . . .; kðx; xnÞ T

(3)

From the above discussion, the projection matrix is the function of kernel matrix. If the kernel function and its parameter are not appropriately chosen, the projection matrix is not optimal for the mapping from the input space to the feature space. So the traditional KDA is not auto-matically to adjust the data structure in the feature space for feature extraction.

3 Kernel optimization-based discriminant analysis (KODA)

As the previous discussion, KDA finds an optimal linear projection from the kernel feature space to the projection subspace. Supposed that the nonlinear mapping U is inappropriately chosen, KDA cannot find the optimal linear projection. In our algorithm, we optimize the nonlinear map U to maximize the class separability in feature space by optimizing the kernel, and then find the optimal trans-formation to maximize the class separability in projection subspace. Based on the above idea, we propose two stages of KDA algorithm, the first one is to optimize the kernel and the second is to find the optimal projection with the traditional method same as KDA, and the algorithm pro-cedure is shown in Fig.1.

3.1 Kernel optimization

The geometry structure of sample data in the nonlinear projection space is different with the different kernel function. Accordingly, data in the nonlinear projection space has the different class discriminative ability. So the kernel function should be dependent to the input data, which is the main idea of data-dependent kernel which was proposed in [25]. The parameter of the data-dependent kernel is changed according to the input data so that the optimal geometry structure of data in the feature space is achieved for the classification. In this paper, we extend the definition of the data-dependent kernel as the objective function for creating the constrained optimization equation to solve the solution.

kðx; yÞ ¼ f ðxÞf ðyÞk0ðx; yÞ ð9Þ

Where k0(x, y) is the basic kernel function, such as

polynomial kernel and Gaussian kernel. The function f(x) is defined as [25]

fðxÞ ¼X

i2SVaie

d x~k xik2 _ð10Þ

where ~xi is the support vector, SV is the set of support

vector, ai denotes the positive value which represent the

distribution of ~xi, d is the free parameter. We extend

the definition of data-dependent kernel through defining the function f(x) with the different ways as follows.

fðxÞ ¼ b0þ

XNXV

n¼1

bneðx; ~xnÞ ð11Þ

where d is the free parameters, ~xiis the expansion vectors

(xvs) and NXV is the number of expansion vectors, bn

(n = 0,1,2,…,NXV) is the according expansion coefficients.

Xiong et al. [24] selected randomly the one-third of total number of samples as the expansion vectors. This paper proposed four methods of defining eðx; ~xnÞ as follows:

m1:

eðx;exnÞ ¼ eðx; xnÞ

¼

1 x and xn with the same class

label information ed xxk nk2 _{x and x}

n with the different class

label information 8 > > < > > :

This method regards the all labeled samples as the expansion vector. This method is applicable to supervised learning method. This method must consider the class label of samples and causes the samples from the same class labels centralize into one point in the feature space. Current many pattern recognition methods are supervised learning methods, such as linear discriminant analysis, support vector machine, and so on.

m2:

eðx; ~xnÞ ¼ eðx; xnÞ ¼ ed xxk nk

2

; n¼ 1; 2; . . .; M

All training samples are considered as the expansion vectors. The total number of expansion vectors is equal to the total number of all training samples, i.e., M = XVs.

m3:

eðx;_exnÞ ¼ eðx; xnÞ

¼

1 The class label information of x and xn is same;

ed xxk nk2 _{The class label information of} x and xn is different; 8 > > < > > :

where NXV= L, xn is the class mean of the nth class of

samples. This method is to solve the computation problem faced by method m1 and m2.

m4:

eðx; ~xnÞ ¼ eðx; xnÞ ¼ ed xk xnk

2

; n¼ 1; 2; . . .; L

This method considers the class mean as the expansion vector. This method considers the distance between any samples with the mean of sample but rarely consider the class label.

Kernel optimization Kernel Discriminant Analysis

Input Output

Fig. 1 Procedure of KODA

(4)

mean that KODA can obtain absolutely optimal performance under any condition. The performance of KODA also depends on whether the basic kernel is chosen appropriately.

5 Conclusion

A new face recognition method, namely KODA, has been developed in this paper. In KODA, we propose a novel criterion for kernel discriminant analysis, which enhances the discriminant power by maximizing the class

separability both in the feature space and in the projection subspace. In order to maximize the class separability in the feature space, we optimize the kernel function with the proposed Fisher criterion and maximum margin criterion. The proposed algorithm is more adaptive to the input data, so the performance of KODA is overall superior to those obtained by KDA as simulations. We expect that KODA will provide excellent performance in other areas, such as content-based image indexing and retrieval as well as video and audio classification.

References

1. Li J-B, Pan J-S, Chu S-C (2008) Kernel class-wise locality pre-serving projection. Inf Sci 178(7):1825–1835

Fig. 6 Example face images of face databases used in our experiments. a Example cropped face images from the ORL face database in our experiments (cropped to the size of 48 9 48 to extract the facial region). b Example cropped face images from the Yale face database in our experiments (cropped to the size of 100 9 100 to extract the facial region)

Fig. 7 Example face images of UMIST face database used in our simulations

Table 1 Performance comparison with MMC on YALE and ORL databases

m1 m2 m3 m4

YALE 0.9187 0.9227 0.9200 0.9227

ORL 0.9365 0.9340 0.9335 0.9355

Table 2 Time consuming with m1 on YALE and ORL databases

ORL YALE

MMC 0.05 0.03

FC 1.50 0.32

Table 3 Performance comparison on YALE, ORL and UMIST databases

PCA KPCA LDA ICA KDA KODA

YALE 0.733 0.767 0.867 0.887 0.933 0.947

ORL 0.905 0.915 0.930 0.935 0.940 0.955

(5)

2. Ma B, Qu H-y, Wong H-s (2007) Kernel clustering-based dis-criminant analysis. Pattern Recognit 40(1):324–327

3. Wu X-H, Zhou J-J (2000) Fuzzy discriminant analysis with kernel methods. Pattern Recognit 39(11):2236–2239

4. Belhumeur PN, Hespanha JP, Kriegman DJ (1997) Eigenfaces vs. Fisherfaces: recognition using class specific linear projection. IEEE Trans Pattern Anal Mach Intell 19(7):711–720

5. Chawla MPS (2008) Segment classification of ECG data and construction of scatter plots using principal component analysis. Int J Mech Med Biol (JMMB), WSPC 8(3):421–458

6. Chawla MPS (2007) Parameterization and correction of electro-cardiogram signals using independent component analysis. Int J Mech Med Biol (JMMB), WSPC 7(4):355–379

7. Chawla MPS, Verma HK, Kumar V (2008) Artifacts and noise removal in electrocardiograms using independent component analysis. Int J Cardiol 129(2):278–281

8. Chawla MPS (2008) A comparative analysis of principal com-ponent and independent comcom-ponent techniques for electrocar-diograms. Int J Neural Comput Appl (NCA), Springer (available online 23-7-2008)

9. Chawla MPS, Verma HK, Kumar V (2008) A new statistical PCA–ICA algorithm for location of R-peaks in ECG. Int J Car-diol 129(1):146–148

10. Lu J, Plataniotis KN, Venetsanopoulos AN (2003) Face recog-nition using kernel direct discriminant analysis algorithms. IEEE Trans Neural Netw 14(1):117–226

11. Mu¨ller KR, Mika S, Ra¨tsch G, Tsuda K, Scho¨lkopf B (2001) An introduction to kernel-based learning algorithms. IEEE Trans Neural Netw 12:181–201

12. Liu Q, Lu H, Ma S (2004) Improving kernel fisher discriminant analysis for face recognition. IEEE Trans Pattern Anal Mach Intell 14(1):42–49

13. Baudat G, Anouar F (2000) Generalized discriminant analysis using a kernel approach. Neural Comput 12:2385–2404 14. Ruiz A, Lo´pez de Teruel PE (2001) Nonlinear kernel-based

sta-tistical pattern analysis. IEEE Trans Neural Netw 12:16–32 15. Lu JW, Plataniotis K, Venetsanopoulos AN (2003) Face

recog-nition using kernel direct discriminant analysis algorithms. IEEE Trans Neural Netw 14(1):117–126

16. Liang ZZ, Shi PF (2005) Uncorrelated discriminant analysis using a kernel method. Pattern Recognit 38(2):307–310 17. Liang Z, Shi P (2004) Efficient algorithm for kernel discriminant

anlaysis. Pattern Recognit 37(2):381–384

18. Huang J, Yuen PC, Chen W-S, Lai JH (2004) Kernel Subspace LDA with optimized kernel parameters on face recognition. In: Proceedings of the sixth IEEE international conference on auto-matic face and gesture recognition

19. Wang L, Chan KL, Xue P (2005) A criterion for optimizing kernel parameters in KBDA for image retrieval. IEEE Trans Syst Man Cybern B Cybern 35(3):556–562

20. Chen W-S, Yuen PC, Huang J, Dai D-Q (2005) Kernel machine-based one-parameter regularized fisher discriminant method for face recognition. IEEE Trans Syst Man Cybern B Cybern 35(4):658–669

21. Micchelli CA, Pontil M (2005) Learning the kernel function via regularization. J Mach Learn Res 6:1099–1125

22. Lanckriet G, Cristianini N, Bartlett P, Ghaoui LE, Jordan MI (2004) Learning the kernel matrix with semidefinte program-ming. J Mach Learn Res 5:27–72

23. Dai G, Yeung D-Y (2007) Kernel selection for semi-supervised kernel machines. In: Proceedings of the 24th international con-ference on machine learning, pp 1457–1465

24. Xiong H, Swamy MNS, Ahmad MO (2005) Optimizing the kernel in the empirical feature space. IEEE Trans Neural Netw 16(2):460–474

25. Amari S, Wu S (1999) Improving support vector machine clas-sifiers by modifying kernel functions. Neural Netw 12(6):783– 789

26. Li H, Jiang T, Zhang K (2006) Efficient and robust feature extraction by maximum margin criterion. IEEE Trans Neural Netw 17(1):157–164

27. Samaria F, Harter A (1994) Parameterisation of a stochastic model for human face identification. In: Proceedings of 2nd IEEE workshop on applications of computer vision

28. Graham DB, Allinson NM (1998) Face recognition: from theory to applications. Comput Syst Sci 163:446–456