Robust algorithms for principal component analysis

(1)

Robust algorithms for principal component analysis

Tai-Ning Yang

*

_{, Sheng-De Wang}

Department of Electrical Engineering, National Taiwan University, EE Building, Room 441, 1 Roosevelt Road, Sec. 4, Taipei 106, Taiwan Received 1 December 1997; received in revised form 3 May 1999

Abstract

In this paper, we address the issues related to the design of fuzzy robust principal component analysis (FRPCA) algorithms. The design of robust principal component analysis has been studied in the literature of statistics for over two decades. More recently Xu and Yuille proposed a family of online robust principal component analysis based on statistical physics approach. We extend Xu and Yuille's objective function by using fuzzy membership and derive improved algorithms that can extract the appropriate principal components from the spoiled data set. The diculty of selecting an appropriate hard threshold in Xu and Yuille's approach is alleviated by replacing the threshold by an automatically selected soft threshold in FRPCA. Arti®cially generated data sets are used to evaluate the performance of various PCA algorithms. Ó 1999 Elsevier Science B.V. All rights reserved.

Keywords: Principal component analysis; Robust algorithm; Noise clustering; Neural networks; Fuzzy theory

1. Introduction

Principal component analysis is an important and essential technique for data reduction, image compression, and feature extraction. It has been widely used in many ®elds including data com-munication, pattern recognition, and image pro-cessing. Since PCA algorithms have to process information from the real world, it should have the ability to cope with the noise or outliers.

Robustness theory is concerned about solving problems subject to model perturbation or added noise. According to Huber (1981), a robust algo-rithm not only performs well under the assumed model, but also produces a satisfactory result under the deviation of the assumed model.

Moreover, it will not deteriorate drastically due to the noise or outliers. Much eort has been done in the investigation of the robust principal compo-nent analysis algorithm especially in the literature of statistics. Several strategies have been used to deal with the problem of outliers in PCA. One is to robustify the existing algorithms by applying some kind of robust estimate of the covariance matrix. Several such estimates are reviewed in (Huber, 1981). Ruymgaart (1981) proposed an-other robust PCA based on robust estimates for dispersion in the univariate case along with a certain linearization of the bivariate structure. Critchley (1985) designed another robust PCA which produces the diagnostic statistics based on the in¯uence function. Most of the above algo-rithms from statistics ®eld are operated in a batch way.

In the neural network literature, Oja (1982) found that a simple linear neuron model with a

*_{Corresponding author. Tel.: 23635251; fax:}

+886-2-23671909; e-mail: [email protected]

(2)

constrained Hebbian learning rule could extract the principal components of a stationary data set. Thus, the self-organizing learning rule for com-puting weights of the hidden nodes in a neural network can be associated with PCA techniques. Since then, many other neural network based PCA techniques are proposed. Sanger (1989) ex-tended Oja's method and designed an algorithm for extracting the ®rst k principal components. Foldiak (1989) and Kung and Diamantaras (1990) developed other similar algorithms based on anti-Hebbian learning rules. Unlike the tra-ditional eigenvector analysis algorithms, these approaches do not require the computation of the input data covariance which may increase signi®cantly with the dimensionality of the train-ing data. Furthermore, it is not necessary to evaluate all the eigenvalues and eigenvectors if only the eigenvector corresponding to the most signi®cant eigenvalue is required. To robustify the existing methods, Xu and Yuille (1995) ®rst related the PCA learning rules to energy func-tions and proposed an objective function with the consideration of outliers. Based on statistical physics approach, robust PCA algorithms are derived.

This paper attempts to develop a family of robust PCA algorithms without the diculty of choosing a hard threshold in Xu and Yuille's approach. First we de®ne a fuzzy objective function which includes Xu and Yuille's as a crisp special case. Using gradient descent optimization, we propose the robust algorithms called FRPCA. Only one parameter, the fuzziness variable, needs presetting and aects the in¯uences of outliers.

The remaining parts of this paper are or-ganized as follows. In Section 2, we review Xu and Yuille's PCA and introduce our algorithm called fuzzy robust principal component analy-sis (FRPCA). In Section 3, arti®cially generated data sets are used to illustrate the performance of various PCA algorithms. We demonstrate the diculty of parameters setting in Xu and Yuille's PCA. The eects of various fuzziness values on FRPCA are also indicated. Finally, Section 4 contains the summary and conclu-sion.

2. Robust principal component analysis algorithms For deriving robust PCA algorithms, Xu and Yuille (1995) proposed an optimization function, Eq. (1), subject to ui2 0; 1f g: EU; w Xn i1 uiexi g Xn i1 1 ÿ ui; 1

where X xf 1; x2; . . . ; xng is the data set and

U uf iji 1; . . . ; ng is the membership set. g

is the threshold. Now we brie¯y review their method. The goal is to minimize Eq. (1) with re-spect to ui and w simultaneously. Since ui is a

bi-nary variable and w is a continuous variable, it is a mixture of discrete and continuous optimization and is hard to solve with the gradient descent approach. To overcome the problem, they trans-formed the goal from the minimization of Eq. (1) to the maximization of the following Gibbs dis-tribution:

PU; w expÿcEU; w_Z ; 2 where Z is the partition function that ensures RUR_wPU; w 1. Using the same procedure for

computing the mean ®eld approximation to the statistical physics system by the saddle point method in (Parisi, 1988), they computed the mar-ginal distribution Pmarginalw for approximating

the maximization of PU; w. Pmarginalw is

calcu-lated by averaging the variables in uf g. Thei

measure exi could be one of the following

func-tions: e1xi x ÿ wi Txiw 2; 3 e2xi xk ki 2ÿ w T_x i k k2 w k k2 x T ixiÿw T_x ixTiw wt_w : 4

The gradient descent rules for minimizing E1

P_n

i1e1xi and E2Pni1e2xi are

wnew_wold_a

tyxiÿ u y ÿ vxi; 5

wnew_wold_a t xiy ÿ w wT_wy2 : 6

at is the learning rate. Under the following

(3)

limt!1 at 0; Ptat 1;

P

takt < 1; for some k > 1;

7 the weight w in the updating rules, converges to the principal component vector almost surely (Oja, 1982; Oja and Karhunen, 1985).

Setting e e1or e e2, Xu and Yuille derived

the following on-line algorithms. Xu and Yuille's PCA1 algorithm.

Step 1. Initially set the iteration count t 1, iter-ation bound T , learning coecient a02 0; 1, the

initial weight w and the threshold g. Step 2. While t is less than T , do steps 3±8. Step 3. Compute at a01 ÿ t=T and set i 1.

Step 4. While i is less than n, do steps 5±7. Step 5. Compute y wT_x_i_{, u yw and v w}T_u.

Step 6. Update the weight: wnew_wold_a

t_{1 expce}1

1xi ÿ g

yxiÿ u y ÿ vxi: 8

Step 7. Add 1 to i. Step 8. Add 1 to t.

Xu and Yuille's PCA2 algorithm. The same as Xu and Yuille's PCA1 except step 6.

Step 6. Update the weight: wnew_wold_a t_{1 expce}1 2xi ÿ g xiy ÿ w wT_wy2 : 9

There is another weight updating rule called one-unit Oja's algorithm:

wnew_wold_a

tÿxiyÿ wy2: 10

Although one-unit Oja's algorithm is not a gradi-ent rule of any kind of objective function as pointed by Xu and Yuille (1991), Xu (1993) proved the following results:

1. Only one local (also global) minimum exists for E1 and E2, and all the other critical points are

saddle points.

2. Exiy ÿ wy2TEyxiÿ u y ÿ vxi P 0, E

rep-resents the expectation operation.

3. xiy ÿ wy2Txiy ÿ w=wTwy2 P 0 and

Exiy ÿ wy2TExiy ÿ w=wTwy2 P 0.

So Eq. (10) minimizes E1 in the average sense and

minimizes E2 in both the on-line sense and the

average sense. Since there is only one minimum for E1 and E2, the three rules will ®nally produce the

same solution, the principal component. Based on the above relationship, it is reasonable to propose the following algorithm. exi could be set as e1xi

or e2xi.

Xu and Yuille's PCA3 algorithm. The same as Xu and Yuille's PCA1 except step 6.

Step 6. Update the weight: wnew_wold_a t_{1 expcex}1 i ÿ g xiy ÿ ÿ wy2_: ₁₁

After the training, the membership is decided by the following rule:

ui 1 if ex_{0 otherwise:}i < gp ;

c and g are two parameters in this algorithm. Xu and Yuille suggest setting a small c at ®rst then track the minimum of the objective function as c increases to in®nity. The hard threshold g would be determined before the training process. We expect to ®nd another algorithm that could set the threshold automatically.

We propose an objective function: RE Xn i1 uimexi g Xn i1 1 ÿ uim; 12

subject to ui2 0; 1 and m 2 1; 1. ui is the

membership of xibelonging to the data cluster and

1 ÿ ui is the membership of xi belonging to the

noise cluster. m is the weighting exponent. exi

measures the error between xiand the class center.

The concept is to add a noise cluster in which the data has a constant in¯uence g. The idea comes from Noise clustering design by (Dave, 1991) and fuzzy C-means algorithm by Bezdek (1981). Let us discuss this function from a clustering viewpoint. uiis the membership of xiin the data cluster, while

1 ÿ ui is the membership of xiin the noise cluster.

(4)

of small ui compared to large ui. Following the

fuzzy clustering approach, this is an appropriate formulation when only one data cluster exists. This function measures the weighted sum of dis-tances between the data and the cluster center which is zero in the data set.

Since uiis a continuous variable in our objective

function (12), we do not encounter the diculty caused by the mixture of discrete and continuous optimization. Let us derive our algorithm with the gradient descent approach. First, we compute the gradient of RE with respect to ui. By setting

oRE=oui 0, we get

ui 1

1 ex i=g1=mÿ1

: 13

Substituting this membership back and after sim-pli®cation, we get RE Xn i1 1 1 exi=g1=mÿ1 !_mÿ1 exi: 14

Following the multidimensional chain rule, the gradient of RE with respect to w is

oRE ow oRE oexi oexi ow 1 1 ex i=g1=mÿ1 !_m oexi ow : 15 Let bxi denote 1 1 ex i=g1=mÿ1 !m :

m is called a fuzziness variable in the literature of fuzzy clustering. If m 1, the fuzzy membership, Eq. (13), reduces to the hard membership and could be determined by the following rule: ui 1 if ex_{0 otherwise:}i < g;

g plays the role of hard thresholding in this situ-ation.

If m ! 1, then the maximum fuzziness is achieved:

ui1₂ for all xi: 16

We show the membership relative to some other values of m in Fig. 1. An interesting observation shows g is not a hard threshold any more but a soft threshold that determines where the membership becomes 0:5. Since 0:5 is the average value in the membership domain 0; 1, a reasonable choice for g is the average distance, Pn

i1exi=n. There is

no general rule for the setting of m, most papers set m 2 since it leads to a simpler modi®cation rule. Replacing exi with e1xi or e2xi, FRPCA1 and

FRPCA2 algorithms are derived. FRPCA1 algorithm.

Step 1. Initially set the iteration count t 1, iter-ation bound T , learning coecient a02 0; 1, soft

threshold g to a small positive value and randomly initialize the weight w.

Step 2. While t is less than T , do steps 3±9. Step 3. Compute at a01 ÿ t=T , set i 1 and

r 0.

Step 4. While i is less than n; do steps 5±8. Step 5. Compute y wT_x

i, u yw and v wTu.

Tbxiyxiÿ u y ÿ vxi: 17

Step 7. Update the temporary count: r r e1xi:

Step 8. Add 1 to i.

Step 9. Compute g r=n and add 1 to t. FRPCA2 algorithm. The same as FRPCA1 except steps 6±7.

(5)

Tbxi xiy

ÿ_ww_T_wy2_: ₁₈

Step 7. Update the temporary count: r r e2xi.

Based on the same reason of Xu and Yuille's PCA3, we propose FRPCA3 as follows.

FRPCA3 algorithm. The same as FRPCA1 except steps 6±7.

tbxi xÿ iyÿ wy2: 19

Step 7. Update the temporary count: r r exi.

Both Xu and Yuille's PCA and FRPCA belong to the group of algorithms called M-estimator. The theoretical maximum breakdown point for M-estimator could be found in (Huber, 1981; Hampel et al., 1986). The limit that is a function of the input dimension is higher than the limit of the traditional approach.

In some applications, it is necessary to compute the ®rst k principal components. We can also modify those algorithms for the ®rst k principal components in (Xu and Yuille, 1995) in a similar way.

3. Simulations

In the ®rst of this section, we introduce some results obtained from comparative experiments on the unrobust PCA and FRPCA. The unrobust PCA algorithms using weight updating rules (5), (6) and (10) are called PCA1, PCA2 and PCA3, respectively. Fig. 2 is a set of two-dimensional training data with 100 elements and zero mean. There are 5 outliers. We set T 40 and a0 1.

That is, the ®nal learning rate is 0:025 and each input data is processed 40 times. Fig. 2 shows the results in PCA1, PCA2 and PCA3 are aected by these outliers signi®cantly. The arti®cially gener-ated data set is also used to train FRPCA1, FRPCA2 and FRPCA3. With the same setting as the former simulation and m 2, the result shown

in Fig. 3 indicates FRPCA-type algorithms are robust to these outliers. The weight is initialized with random value and is almost unchanged in the ®rst iteration, since a very small value, 10ÿ6_{, is}

assigned to the initial value of the soft threshold, g. In FRPCA3, exi is replaced by e1xi or e2xi

separately, so there are four overlapped lines in Fig. 3. Since the learning rate is changed from atto

atbxi, we ®nd the iterations required by FRPCA

is less than PCA. Fig. 4 shows the results of FRPCA when T reduces to 5 and the number of outliers increases to 10.

To show the experimental dierences between Xu and Yuille's PCA and FRPCA, we use the

Fig. 2. Testing results of PCA1, PCA2 and PCA3 on the spoiled data set.

Fig. 3. Testing results of FRPCA1, FRPCA2 and FRPCA3 on the spoiled data set.

(6)

same data set and a transformed data set in which x and y coordinates of the data point are scaled down by half as shown in Fig. 6. In the following experiments, we set T 40 and a0 1. Note that

there are four overlapped extracted principal axes in each illustrated line. Sorted by the distance be-tween the origin and the y-intercept of the princi-pal axis, the parameters setting are c 30;f g 0:6g and c 0:5; g 4f g in Figs. 5 and 6. Since the parameters used are c 0:5 and g 4 in (Xu and Yuille, 1995), we start from this setting and get the unrobust result. After experiments of

various parameter setting, Xu and Yuille's PCA produces a robust result when c 30 and g 0:6 as shown in Fig. 5. Unfortunately, these two pa-rameters need to be reset even when the data set is scaled down. As shown in Fig. 6, Xu and Yuille's PCA produces an unrobust result when c 30 and g 0:6 on a scale-down data set. Setting c and g properly could be even more dicult in the case of computing not only the ®rst but also the ®rst k principal components.

Although when m 1 and uibelongs to 0; 1f g,

FRPCA's objective function, Eq. (12), reduces to Xu and Yuille's objective function, Eq. (1), Xu and Yuille's PCA is not a special case of FRPCA be-cause dierent optimization approaches are used. Any very small value could be used to initialize the soft threshold, g. In the following, we want to ®nd the in¯uences of various m values in FRPCA on the performance. Before doing the experiments, we may predict FRPCA will be more like an unrobust PCA as m increases because m raises the member-ship of the outlier as indicated in Fig. 1. Sorted by the distance between the origin and the y-intercept of the principal axis, the fuzzy variable m are set as 1.5, 2.5, 3.5, 4.5 and 5.5 in Fig. 7. The results that may be regarded as some kind of interpolations between results of noise-®ltering PCA and unro-bust PCA correspond to our prediction. Using the same m, FRPCA produces the similar results on the scale-down data set as shown in Fig. 8.

Fig. 4. Testing results of FRPCA1, FRPCA2 and FRPCA3 on

another spoiled data set. Fig. 6. Testing results of Xu and Yuille's PCA on the scale-down data set.

Fig. 5. Testing results of Xu and Yuille's PCA on the spoiled data set.

(7)

4. Conclusions

Stemming from the work of Xu and Yuille and the concept of noise clusters, we derive a family of robust principal component extraction algorithms by a fuzzy objective function. The main charac-teristics of the proposed algorithm are as follows: · In comparison with the traditional robust PCA, the proposed FRPCA is more robust when out-liers exist.

· FRPCA uses a soft threshold that is automati-cally determined in the algorithm.

· As demonstrated by the simulations, the initial value to the the soft threshold can easily be set to any very small value.

There exist other forms of FRPCA-like algo-rithms. One simple modi®cation is to change the learning law to batch mode or using a momentum updating law. These alterations may be better than the original algorithm if the input presentation order is biased.

References

Bezdek, J.C., 1981. Pattern Recognition with Fuzzy Objective Functions. Plenum Press, New York.

Critchley, F., 1985. In¯uence in principal component analysis. Biometrica 72 (3), 627±636.

Dave, R.N., 1991. Characterization and detection of noise in clustering. Pattern Recognition Letters 12 (11), 657±664. Foldiak, P., 1989. Adaptive network for optimal linear feature

extraction. In: Internat. Joint Conf. Neural Networks, Washington, DC, pp. I401±I406.

Hampel, F.M., Ponchotti, E.M., Rousseeuw, P.J., Stahel, W.A., 1986. Robust Statistics: The Approach Base on In¯uence Functions. Wiley, New York.

Huber, P.J., 1981. Robust Statistics. Wiley, New York. Kung, S.Y., Diamantaras, K.I., 1990. A neural network

learning algorithm for adaptive principal extraction. In: Proc. ICASSP, Albuquerque, pp. 861±864.

Oja, E., 1982. A simpli®ed neuron model as a principal component analyzer. J. Math. Biol., 267±273.

Oja, E., Karhunen, J., 1985. On stochastic approximation of eigenvectors and eigenvalues of the expectation of a random matrix. J. Math. Anal. Appl., pp. 69±84.

Parisi, G., 1988. Statistical Field Theory. Addison-Wesley, Reading, MA.

Ruymgaart, F.H., 1981. A robust principal analysis. J. Multi-variate Anal. 11, 485±497.

Sanger, T.D., 1989. Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural Network, 459±473.

Xu, L., 1993. Least mean square error reconstruction for self-organization neural nets. Neural networks 6, 627±648. Xu, L., Yuille, A.L., 1991. Back-propagation and unsupervised

learning in linear networks, in: Chauvin, Y., Rumelhart, E.E. (Eds.), Back-Propagation Theory, Architecture, and Application, Hillsdale, Erlbaum, Hillsdale, NJ.

Xu, L., Yuille, A.L., 1995. Robust principal component analysis by self-organizing rules based on statistical physics approach. IEEE Trans. Neural Net. 6 (1), 131±143. Fig. 7. Testing results of FRPCA on the spoiled data set with

dierent m values.

Fig. 8. Testing results of FRPCA on the scale-down data set with dierent m values.