運用非線性降維技術於多元分類問題與半監督式學習

(1)

行政院國家科學委員會專題研究計畫成果報告

運用非線性降維技術於多元分類問題與半監督式學習研究成果報告(精簡版)

計畫類別：個別型

計畫編號： NSC 97-2221-E-011-130-

執行期間： 97 年 08 月 01 日至 98 年 07 月 31 日執行單位：國立臺灣科技大學資訊工程系

計畫主持人：李育杰

計畫參與人員：碩士班研究生-兼任助理人員：李政益碩士班研究生-兼任助理人員：陳柏宇碩士班研究生-兼任助理人員：陳政翰碩士班研究生-兼任助理人員：曾義澄

報告附件：出席國際會議研究心得報告及發表論文

處理方式：本計畫涉及專利或其他智慧財產權，1 年後可公開查詢

中華民國 98 年 12 月 31 日

(2)

行政院國家科學委員會補助專題研究計畫 ■ 成果報告

□期中進度報告運用非線性降維技術於多元分類問題與半監督式學習

計畫類別：■ 個別型計畫 □ 整合型計畫計畫編號：NSC 97－2221－E－011－130

執行期間： 97 年 8 月 1 日至 98 年 7 月 31 日

計畫主持人：李育杰教授共同主持人：

計畫參與人員：李政益、陳柏宇、陳政翰、曾義澄

成果報告類型(依經費核定清單規定繳交)：■精簡報告 □完整報告

本成果報告包括以下應繳交之附件：

□赴國外出差或研習心得報告一份

□赴大陸地區出差或研習心得報告一份

■出席國際學術會議心得報告及發表之論文各一份

□國際合作研究計畫國外研究報告書一份

處理方式：除產學合作研究計畫、提升產業技術及人才培育研究計畫、

列管計畫及下列情形者外，得立即公開查詢

□涉及專利或其他智慧財產權，□一年■二年後可公開查詢

執行單位：

(3)

Abstract

Sliced inverse regression (SIR) is a renowned dimension reduction method for finding an effective low-dimensional linear subspace. Like many other linear methods, SIR can be extended to nonlinear setting via the “kernel trick”. The main purpose of this project is two-fold. We build kernel SIR in a reproducing kernel Hilbert space rigorously for a more intuitive model explanation and theoretical development. The second focus is on the implementation algorithm of kernel SIR for fast computation and numerical stability. We adopt a low rank approximation to approximate the huge and dense full kernel covariance matrix and a reduced singular value decomposition technique for extracting kernel SIR directions. We also explore kernel SIR’s ability to combine with other linear learning algorithms for multi-classification and regression . Numerical experiments show that kernel SIR is an effective kernel tool for nonlinear dimension reduction and it can easily combine with other linear algorithms to form a powerful toolkit for nonlinear data analysis.

1 Introduction

Dimension reduction is an important topic in machine learning and data mining. The main demand comes from complex data analysis, data visualization and parsimonious modeling [1, 2]. The most popular dimension reduction method is probably the principal component analysis (PCA), which is an unsupervised method. In contrast, the sliced inverse regression (SIR) [3] extracts the dimension reduction subspace, called the effective dimension reduction (e.d.r.) subspace, based on the covariance matrix of input attributes inversely regressed on the responses. SIR can be viewed as a supervised companion of PCA for linear dimension reduction. SIR has won its reputation to perform well in dimension reduction and related applications and has gained great attention in statistical literature [2, 3, 4, 5, 6, 7]. In this project, we reformulate kernel sliced inverse regression (KSIR) in a reproducing kernel Hilbert space setting and emphasize on KSIR’s implementation techniques, its ability to combine with other linear algorithms and its applications to classification as well as regression including multiresponse regression. We adopt a low rank approximation to approximate the huge and dense kernel covariance matrix and also introduce a reduced singular value decomposition technique for estimating kernel e.d.r. directions

(4)

in KSIR implementation. These reduction techniques will speed up the computation and increase the numerical stability. In a nutshell, our learning framework for a complicated data set is as follows. We extract a kernel e.d.r. subspace, also named feature e.d.r. subspace, in which the nonlinear structure of data has been embedded.

Then we can apply linear learning algorithms to the images of the original complicated data in this low-dimensional feature e.d.r. subspace. Under this learning framework we are able to generate nonlinear models for learning tasks such as regression and multi-class classification in an efficient way, compared to a very popular SVM tool LIBSVM [9].

2 Sliced Inverse Regression

Different from other dimension reduction methods, SIR summarizes a regression or classification model. Here we use bold-faced x and y for random vectors and variables and italic letters x and y for their realizations. Suppose we have a data set

D := {(x¹, y₁), . . . , (xⁿ, y_n)},

each pair (xⁱ, y_i) is an instance xⁱ ∈ R^p with its response or class label y_i. Let A ∈ R^n×p be the data matrix of input attributes and Y = (y₁, . . . , y_n)⁰ ∈ Rⁿ be the corresponding responses. Each row of A represents an observation, xⁱ. The empirical data version of sliced inverse regression finds the dimension reduction directions by solving the following generalized eigenvalue problem based on empirical data D:

Σ_E(A|Y_J₎β = λΣ_Aβ, (1)

where ΣA is the sample covariance matrix of A, YJ denotes the membership in slices and there are J many slices, and Σ_E(A|Y_J₎denotes the between-slice sample covariance matrix based on sliced means given by

Σ_E(A|Y_J₎= 1 n

XJ j=1

n_j(¯x^j− ¯x)(¯x^j− ¯x)⁰.

Here ¯x is the sample grand mean, ¯x^j = _n¹_j P

i∈Sjxⁱ is the sample mean for the jth slice and S_j is the index set for jth slice. Note that the slices are extracted from A according to the sorted responses Y . For classification, ¯x^j is simply the sample mean of input attributes for the jth class.

(5)

3 Kernel Extension of SIR and a Fast Implemen- tation

3.1 Kernel Extension of SIR

The classical SIR is designed to find a linear transformation from the input space to a low dimensional e.d.r. subspace that keeps as much information as possible for the output variable y. However, it does not work for nonlinear feature extraction and it fails to find linear directions being in the null space or having small angles to the null space of Σ_E(x|y). We, therefore, consider an alternative feature map Γ : X 7→ H_K given by

x 7→ Γ(x) := K(x, ·). (2)

Kernels used here are positive definite, also known as reproducing kernels. The goal of KSIR is to estimate the feature e.d.r. directions hk’s. Similar to the classical SIR, KSIR finds feature e.d.r. directions by solving a generalized spectrum decomposition problem.

Note that the feature data K(xⁱ, ·) and the feature e.d.r. directions h_k’s are in a high- or infinite-dimensional space H_K. To store and compute the feature data in a computer, a finite basis set {K(·, u_i) : u_i ∈ I} is used to represent (also to approximate) the feature data, where I ⊂ X is a finite index set. We often take the following

basis set : {K(·, xⁱ)}ⁿ_i=1, denoted by {K(·, A)}. (3) Then the feature data become K = [K(xⁱ, x^j)]ⁿ_i,j=1, the usual kernel matrix. It formu- lates KSIR as a generalized spectrum decomposition of the between-slice covariance operator in an RKHS framework. In other words, we solve the following generalized eigenvalue problem:

Σ_{E(T |y)}α = λΣ_Tα (4)

After extracting the feature e.d.r. directions, we will map the feature data onto the feature e.d.r. subspace spanned by these directions. The nonlinear structure of data will be embedded in this feature e.d.r. subspace. Thus, a linear learning algorithm can be applied on this feature e.d.r. subspace directly.

(6)

3.2 A Fast Implementation of KSIR

Since the rank of the between-slice covariance of kernel data, E(T |y) in (4), is (J −1), we do not need to solve the whole eigenvalue decomposition problem for this n × n matrix. Instead, we use the reduced singular value decomposition (SVD) technique to solve for leading (J − 1) components to save the computing time. Let πj be the proportional size (or prior probability) of the jth slice. Consider the eigenvalue decomposition of W⁰Σ⁻¹_T W as UDU⁰, where W consists of centered and weighted slice means. Precisely, the jth column of W is given by

w_j =√

π_j(E(T |y_i, i ∈ S_j) − E(T )) .

Since the eigenvector(s) associated with the zero eigenvalue has no information for y, we may restrict U and D to eigenvectors with non-zero eigenvalues. For simplicity, assume there is only one zero eigenvalue. Then, U has size J × (J − 1) and its columns are formed by eigenvectors and D is a (J − 1) × (J − 1) diagonal matrix with descending non-zero eigenvalues. In the following proposition, we will give the orthonormal basis set, which forms the feature e.d.r. directions.

Proposition 1 (feature e.d.r. directions) The orthonormalized feature e.d.r. di- rections are given by columns of Σ⁻¹_T W UD⁻¹², where U and D are from the eigenvalue decomposition W⁰Σ⁻¹_T W = UDU⁰.

Proposition 1 provides a reduction of computing complexity in the eigenvalue decomposition by using the reduced singular value decomposition for solving only leading (J − 1) components, where J ¿ n. It converts the larger eigenvalue problem for Σ⁻¹_T ΣE(T |y) ∈ R^n×n into a smaller one, W⁰Σ⁻¹_T W ∈ R^J×J. Note that if only a few leading directions are needed, we can use only leading k (< J − 1) columns from U and corresponding leading diagonal elements from D for further reduction.

Proposition 1 is a population version. In practical data analysis sample estimates based on the given data D are used to replace all the population-based quantities.

The sample covariance matrices Σ_E(K|Y_J₎ and Σ_K are used to replace the population covariance matrices ΣE(T |y) and ΣT, respectively. Also the following sample estimates for the centered weighted slice means are used to replace their population versions:

wj = q

nj/n

Ã1⁰_n_jK(ASj, A)

n_j −1⁰_nK(A, A) n

!₀ ,

(7)

where 1⁰_n_jK(A_S_j, A)/n_j and 1⁰_nK(A, A)/n are respectively the jth slice sample mean and the grand mean of K(A, A). Note that then W W⁰ = Σ_E(K|Y_J₎is the between-slice sample covariance. The sample covariance matrix ΣK is singular and is often having much lower effective rank than its size. This low-effective-rank phenomenon causes numerical instability and poor e.d.r. directions estimation [10]. An appropriate way to deal with this problem is to find a reduced-column approximation to K, denoted by ˜K, so that ˜K has full column rank and its column space C( ˜K) provides a good approximation to C(K). This approximation will enhance the numerical stability without much information loss. The reduced-column approximation will cut down the problem size of the generalized spectrum decomposition and will also speed up the computation.

Let ˜P be a projection matrix of size n × ˜n, which satisfies ˜P⁰P = I˜ _˜_n. Given a reduced-column kernel data ˜K := K ˜P , the approximation of KSIR is to solve the following reduced generalized eigenvalue problem:

Σ_{E( ˜}_K|Y_J₎α = λΣ˜ K˜α,˜ (5) which is of much smaller size, as ˜n ¿ n. With the use of reduced kernel ˜K, the corresponding centered weighted slice means are given by ˜W_n×J_˜ with the jth column

˜ wj =

q nj/n

Ã1⁰_n_jK˜Sj

n_j − 1⁰_nK˜ n

!₀ ,

where 1⁰_n_jK˜Sj/nj and 1⁰_nK/n are respectively the jth slice mean and the grand mean˜ of ˜K. We can also apply Proposition 1 to the reduced problem (5) and the resulting feature e.d.r. directions are given by ˜V = Σ⁻¹_K_˜ W ˜˜U ˜D^−1/2, where ˜U and ˜D are the eigenvectors and eigenvalues for ˜W⁰Σ⁻¹_K_˜ W . Here, Σ˜ ⁻¹_K_˜ exists if a proper projection ˜P is used. Note that the most simple way to chose ˜P is a column subset of I_n. The KSIR algorithm using a reduced kernel approximation is given in Algorithm 1.

4 Numerical Experiments

4.1 Data visualization

SIR and KSIR aim to extract a low dimensional feature e.d.r. subspace that contains the information about output variable y as much as possible. The former looks for

(8)

Algorithm 1 Low-rank Projection KSIR Algorithm

Input: A reduced kernel matrix ˜K_n×˜_n and a J-sliced response (label) n-vector Y_J. Output: KSIR directions ˜Vn×(J−1)˜ and associated eigenvalues ˜d(J−1)×1.

1. Compute the centered and weighted slice means ˜W_n×J_˜ ; 2. Compute the covariance matrix ΣK˜ of the reduced kernel;

3. Compute the eigenvalue decomposition of ˜W⁰Σ⁻¹_K_˜ W as ˜˜ U ˜D ˜U⁰; /* O(J³) for solving the eigenvalue problem */

/* ˜D and ˜U consist of non-zero eigenvalues and associated eigenvectors */

/* O(˜n³) for solving the linear system ΣK˜X = ˜W to get Σ⁻¹_K_˜ W */˜ 4. ˜V ← Σ⁻¹_K_˜ W ˜˜U ˜D⁻¹²; ˜d ← diagonal{ ˜D}.

such a subspace in the pattern Euclidean space, while the latter in the feature RKHS.

Moreover, both SIR and KSIR algorithms rank the importance of e.d.r. directions by associated eigenvalues. Thus, we can use the first one or two directions to visualize the main data structure, which will be otherwise complex in high dimension. Here we show some data views in a 2-dimensional subspace obtained by PCA, SIR and KSIR, respectively. Here are two examples, one for classification problem (pendigits data, Figures 1) and the other is for regression problem (peaks in Matlab, Figure 2).

We can clearly see that the KSIR processing turns the nonlinear structure in pattern space into a prominent linear structure in kernel feature space. It provides an empirical justification to combine the KSIR with linear learning algorithms for other tasks on the feature e.d.r. subspace.

4.2 KSIR dimension reduction for classification

In classification, the clustering structure has been defined through their class labels, and slices are made accordingly. We then estimate the central feature e.d.r. sub- space and map the data onto this feature subspace for discriminant purpose. In a J-class problem, we slice the data sets into J slices according to the class labels.

Thus, there are at most J − 1 many independent e.d.r. directions, since the rank of Σ_E(A|Y_J₎ or Σ_E(K|Y_J₎is at most J −1. After extracting the feature e.d.r. subspace, dis- criminant analysis becomes much computationally easier in this very low-dimensional subspace. Since we have turned the nonlinear structure in the pattern space into an

(9)

−1 −0.5 0 0.5 1

−1

−0.5 0 0.5 1

1

1 1

1 1 1

1 1 1 1

2

2 2 2 2 2

2 2

2 2 2

2

2 2

33 3

3333 3

3 3 3 44 3 3 4 4

4

4 4

4 4 44 44

4 5

5 5

5

5 5

5

6 6 6 6

6 66

6

6 6

66

6 7 7

7 7

77 7

7 7 7

7 7 77 8 8 8 8

8 8 8

8 8

8

9 9

9 9 9

9

9 9 9 9

9 9 0 0 9

0 00

0 0 0 0 0 0

00 0

PCA−1 (a)

PCA−2

1

1 1

1 1 1

1 1 1 1

2

2 2 2 2 2

2 2

2 2 2

2

2 2

33 3

3333 3

3 3 3 44 3 3 4 4

4

4 4

4 4 44 44

4 5

5 5

5

5 5

5

6 6 6 6

6 66

6

6 6

66

6 7 7

7 7

77 7

7 7 7

7 7 77 8 8 8 8

8 8 8

8 8

8

9 9

9 9 9

9

9 9 9 9

9 9 0 0 9

0 00

0 0 0 0 0 0

00 0

−1 −0.5 0 0.5 1

−1

−0.5 0 0.5 1

1 1 1 1

1

1 1

1

1 1

111 1 2

2 22

2 2

2 2 2

22 2

33 3

3

33 3 3 3

3 3

4 4

4 4 4

4 4 44

4 4 4 4 4

5

5 5 5 5 5

5 5 5

5 5

66 66 6 6 6

6 6

6 6 6 77 6

7 7

7 7 7

7 7

7 77

8 8

8 8 8 8 8 8

8 8

8 9 9

999 9

9 9 9

9

9 9 9

0 0

0 0 0 0 00

0 0 0 000

SIR−1 (b)

SIR−2

1 1 1 1

1

1 1

1

1 1

111 1 2

2 22

2 2

2 2 2

22 2

33 3

3

33 3 3 3

3 3

4 4

4 4 4

4 4 44

4 4 4 4 4

5

5 5 5 5 5

5 5 5

5 5

66 66 6 6 6

6 6

6 6 6 77 6

7 7

7 7 7

7 7

7 77

8 8

8 8 8 8 8 8

8 8

8 9 9

999 9

9 9 9

9

9 9 9

0 0

0 0 0 0 00

0 0 0 000

−1 −0.5 0 0.5 1

−1

−0.5 0 0.5 1

1 11 111

1 11

1 1

2 2222 2 2222 2222

33 3

3 3333 33 33 3

4 4

4 444444444 44 55 5

5 5 55

555 55 5

66 6 6 6

66 6

6 6

6 66 777777

7 77 7 77

77 88 8

8

8888 888 8 8

99 99 99 99 99

999

0 00 00

0 0 0 0

0 0 0 00

KSIR−1 (c)

KSIR−2

1 11 111

1 11

1 1

2 2222 2 2222 2222

33 3

3 3333 33 33 3

4 4

4 444444444 44 55 5

5 5 55

555 55 5

66 6 6 6

66 6

6 6

6 66 777777

7 77 7 77

77 88 8

8

8888 888 8 8

99 99 99 99 99

999

0 00 00

0 0 0 0

0 0 0 00

−0.8 −0.6 −0.4 −0.2

−0.4

−0.3

−0.2

−0.1 0 0.1

1 11 1

11 1 1 1

1 1

1 2 2

22 2

2 2

22 2 22 2

2

33

3

3 3 3 33 3 3

3 3 3

5 5 5

5

55 5

5 5 5

5 5 5 7

7 777 7

7 7

7 7 7

7 77

99 9 9 9 9 9

9 9 9 9

99

KSIR−1 (d)

KSIR−2

1 11 1

11 1 1 1

1 1

1 2 2

22 2

2 2

22 2 22 2

2

33

3

3 3 3 33 3 3

3 3 3

5 5 5

5

55 5

5 5 5

5 5 5 7

7 777 7

7 7

7 7 7

7

77 8

8 8

8

8 8 8 8

8 88

8

99 9 9 9 9 9

9 9 9 9

99

Figure 1: 2D views of pendigits data by PCA, SIR, KSIR.

approximately linear structure in the feature space via kernel transformation, direct application of linear learning algorithms on KSIR variates is often sufficient. In our classification experiments, we particularly pick the Fisher linear discriminant analysis (FDA) and the linear smooth support vector machine (SSVM) [13] as our baseline linear learning algorithms. One property of SSVM is that it is solved in the primal space and its computational complexity depends on the number of input attributes (here is the number of KSIR variates, which is at most J − 1). A smaller number of columns implies less computational load. Note that as data are projected along the KSIR directions, discriminant analysis is computationally light. The FDA and SSVM acting on top of KSIR variates are numerically compared with the standard nonlinear SVM benchmark algorithm LIBSVM (Table 1 and Table 2).

4.3 KSIR dimension reduction for regression

Different from classification problems, KSIR for regression can be more complicated than classification due to the lack of intuitive slices. In fact, we need to consider more factors, like the number of slices, their positioning and the dimensionality of

(10)

−5 0

5

−5 0 5

−10

−5 0 5 10

Variable−1 Variable−2

Response (Y)

−1 −0.5 0 0.5 1

−10

−5 0 5 10

SIR−1 (c)

Response (Y)

−1 −0.5 0 0.5 1

−10

−5 0 5 10

KSIR−1 (d)

Response (Y)

−1 −0.5 0 0.5 1

−10

−5 0 5 10

PCA−1 (b)

Response (Y)

(a)

Figure 2: 2D views of response vs. the 1st variate by PCA, SIR and KSIR with peaks data.

the final e.d.r. subspace. For positioning of slices, we adopt a simple equal frequency strategy and fix at 30 slices for the data. After extracting the KSIR directions, we apply the linear regularized least square (RLS) fit of the responses on KSIR variates.

The testing results of KSIR+RLS are compared with nonlinear SVR provided in LIBSVM. Note that a reduced kernel approximation has been used in KSIR for all our regression examples. The R² values based on ten-fold cross validation are shown in Table 3 and the cpu time is reported in Table 4. R² is a commonly used criterion for evaluation of regression goodness of fit. Its definition is given below:

R² = 1 − Xn

i=1

(y_i− ˆy_i)²/ Xn

i=1

(y_i− ¯y)², where ˆy_i is the fitted response and ¯y is the grand mean.

5 Conclusion

We have reformulated the KSIR [8] in a more rigorous and formal RKHS framework for theoretical development and for practical implementation of fast algorithm. The

(11)

Table 1: The error rate for FDA and linear SSVM on KSIR variates compared with LIBSVM on classification data sets.

Data set KSIR+FDA KSIR+SSVM LIBSVM

mean (std) mean (std) mean (std) banana 0.1176 (0.0039) 0.1183 (0.0028) 0.1229 tree 0.1225 (0.0027) 0.1156 (0.0023) 0.1283 splice 0.1182 (0.0041) 0.1175 (0.0036) 0.1048 adult 0.1702 (0.0011) 0.1492 (0.0008) 0.1491 web 0.0164 (0.0007) 0.0152 (0.0004) 0.0090 Iris 0.0227 (0.0064) 0.0273 (0.0066) 0.0373 (0.0034) wine 0.0163 (0.0053) 0.0125 (0.0042) 0.0187 (0.0042) vehicle 0.1460 (0.0091) 0.1499 (0.0103) 0.1460 (0.0072) segment 0.0297 (0.0025) 0.0282 (0.0020) 0.0285 (0.0016) dna 0.0652 (0.0035) 0.0447 (0.0013) 0.0463 satimage 0.0926 (0.0031) 0.0914 (0.0034) 0.0870 pendigits 0.0229 (0.0011) 0.0193 (0.0027) 0.0180

medline 0.1208 0.1136 0.1106

KSIR algorithm first maps the pattern space to an appropriate RKHS, and next extracts the main linear features in this embedded feature space. It takes class labels or regression response information into account and is a supervised dimension reduction method. After the extraction of the feature e.d.r. subspace, many super- vised linear learning algorithms, such as FDA, SVM, SVR and possible others, can be applied to the images of input data in this feature e.d.r. subspace. This will generate a nonlinear learning model in the original input pattern space and achieve a very good performance for complex data analysis. We have also incorporated reduced kernel approximation to cut down the computational load and to resolve the numerical instability due to singularity in large between-slice covariance matrix. The singularity problem not only causes numerical instability but also leads to inferior e.d.r. directions estimation.

A few leading components extracted by KSIR can carry most of the relevant information about y in regression and in classification. It allows us to run linear

(12)

Table 2: The training time (seconds) of FDA and linear SSVM on KSIR variates compared with LIBSVM on classification data sets.

Data set KSIR+FDA KSIR+SSVM LIBSVM

banana 0.065 0.064 0.041

tree 0.076 0.078 0.075

splice 0.219 0.185 0.443

adult 4.863 4.867 219.3

web 16.18 16.37 149.5

dna 0.378 0.360 2.806

satimage 3.041 3.197 3.654

pendigits 1.245 1.775 2.402

medline 1.701 1.799 3.032

learning algorithms in a very low dimensional feature e.d.r. subspace and to gain computational advantages without sacrificing the performance of learning algorithms.

For example in solving nonlinear SVM multi-class problem, one has to solve ¡_J

2

¢ nonlinear binary SVMs under the “one-versus-one” scheme. In KSIR-based approach, it only involves solving the KSIR problem once, and the remaining task is solved by a series of¡_J

2

¢many linear binary SVMs in a (J −1)-dimensional space. Moreover, KSIR approach also has an advantage in tuning procedure. We have demonstrated these nice merits in our numerical experiments. Finally, using the first one or two components will help scientists or data analysts to gain a direct insight of data patterns, which will be otherwise complex in high dimensionality.

References

[1] E. Alpaydm, Introduction to Machine Learning. The MIT Press, 2004.

[2] R. D. Cook, Regression Graphics: Ideas for Studying Regressions Through Graph- ics. John Wiley and Sons, 1998.

[3] K. C. Li, “Sliced inverse regression for dimension reduction (with discussion),”

Journal of the American Statistical Association, vol. 86, pp. 316–342, 1991.

(13)

Table 3: R² of RLS on 3 and 29 KSIR variates compared with LIBSVM on regression data sets.

Data set KSIR(3)+RLS KSIR(29)+RLS LIBSVM mean (std) mean (std) mean (std) Housing 0.8619 (0.0398) 0.8611 (0.0440) 0.8680 (0.0314) Kf 1000 0.6457 (0.0479) 0.6473 (0.0462) 0.6470 (0.0459) CA 1000 0.9606 (0.0094) 0.9702 (0.0055) 0.9783 (0.0056) Kin fh 0.6949 (0.0129) 0.6950 (0.0128) 0.7004 (0.0131) CA 0.9733 (0.0047) 0.9770 (0.0031) 0.9821 (0.0022) Friedman 0.9553 (0.0006) 0.9554 (0.0006) 0.9561 (0.0005)

Table 4: The training time (seconds) of RLS on 3 and 29 KSIR variates compared with LIBSVM on regression data sets.

Data set KSIR(3)+RLS KSIR(29)+RLS LIBSVM

Housing 0.021 0.035 0.244

Kf 1000 0.015 0.024 0.371

CA 1000 0.016 0.042 0.502

Kin fh 1.029 1.121 19.57

CA 0.167 0.376 27.50

Friedman 6.064 6.433 2242.1

[4] C. Chen and K. C. Li, “Can SIR be as popular as multiple linear regression?”

Statistica Sinica, vol. 8, pp. 289–316, 1998.

[5] N. Duan and K. C. Li, “Slicing regression: a link free regression method,” Annals of Statistics, vol. 19, pp. 505–530, 1991.

[6] P. Hall and K. C. Li, “On almost linearity of low dimensional projection from high dimensional data,” Annals of Statistics, vol. 21, pp. 867–889, 1993.

[7] K. C. Li, “Nonlinear confounding in high-dimensional regression,” Annals of Statistics, vol. 25, pp. 577–612, 1997.

(14)

[8] H. M. Wu, “Kernel sliced inverse regression with applications on classification,”

Journal of Computational and Graphical Statistics, vol. 17, no. 3, pp. 590–610, 2008.

[9] C. C. Chang and C. J. Lin., “LIBSVM: a library for support vector machines.

Software available at http://www.csie.ntu.edu.tw/∼cjlin/libsvm,” 2001.

[10] F. R. Bach and M. I. Jordan, “Kernel independent component analysis,” Journal of Machine Learning Research, vol. 3, pp. 1–48, 2002.

[11] Y. J. Lee and S. Y. Huang, “Reduced support vector machines: a statistical theory,” IEEE Transactions on Neural Networks, vol. 18, pp. 1–13, 2007.

[12] S. Mika, G. Rätsch, J. Weston, and B. Schölkpf and K.-R. Müller, “Fisher dis- criminant analysis with kernels,” in Neural Networks for Signal Processing IX, 1999, pp. 41–48.

[13] Y. J. Lee and O. L. Mangasarian, “SSVM: A smooth support vector machine,”

Computational Optimization and Applications, vol. 20, pp. 5–22, 2001.

[14] L. Ferr´e, “Determining the dimension in sliced inverse regression and related methods,” Journal of the American Statistical Association, vol. 93, pp. 132–140, 1998.

[15] J. R. Schott, “Determining the dimensionality in sliced inverse regression,” Jour- nal of the American Statistical Association, vol. 89, pp. 141–148, 1994.

[16] I. P. Tu, H. Chen, H. P. Wu, and X. Chen, “An eigenvector variability plot,”

Statistica Sinica, to appear.

[17] C. M. Huang, Y. J. Lee, D. K. J. Lin, and S. Y. Huang, “Model selection for sup- port vector machines via uniform design,” A special issue on Machine Learning and Robust Data Mining of Computational Statistics and Data Analysis, vol. 52, pp. 335–346, 2007.

(15)

Report on “First KES International Symposium on Intelligent Decision Technologies”

(出席國際會議心得報告)

會議地點： Himeji, Japan 會議日期： April 23 – 24, 2009

參與人員：台灣科技大學資工系李育杰副教授

會議簡介

KES - Intelligent Decision Technologies 為資料探勘和決策分析 (decision making) 領域知名的研討會之一。會議的內容主要著重在提供決策相關分析在理論與實務方面的最新發展，涵蓋下列幾項：intelligent agents, fuzzy logic, multi-agent systems, artificial neural networks, genetic algorithms, expert systems, intelligent decision making support systems, information retrieval systems, geographic information systems, knowledge management systems。此次會議包含兩天的會議，並有多個 sessions 同時進行。

心得報告

本年度計畫主題 “非線性監督式降維技術的研發” (Nonlinear dimension reduction for supervised learning)，相關技術引發在資訊安全方面的應用與探討，也發表論文 “Adaptive Alarm Filtering by Causal Correlation Consideration in Intrusion Detection”[1]，文章的摘要節錄如下：

One of the main difficulties in most modern Intrusion Detection Systems is the problem of massive alarms generated by the systems. The alarms may either be false alarms which are wrongly classified by a sensitive model, or duplicated alarms which may be issued by various intrusion detectors or be issued at different time for the same attack. We focus on learning-based alarm filtering system. The system takes alarms as the input which may include the alarms from several intrusion detectors, or the alarms issued in different time such as for multistep attacks. The goal is to filter those alarms with high accuracy and enough representative capability so that the number of false alarms and duplicated alarms can be reduced and the efforts from alarm analysts can be significantly saved. To achieve that, we consider the causal correlation between relevant alarms in the temporal domain to re-label the alarm either to be a false alarm, a duplicated alarm, or a representative true alarm. To be

(16)

find novel attacks. As another feature of our system, our system can deal with the frequent changes of network environment. The framework gives the judgment of attacks adaptively. An ensemble of classifiers is adopted for the purpose. Accordingly, we propose a system mainly consisting of two components: one is for alarm filtering to reduce the number of false alarms and duplicated alarms; and one is the ensemble-based adaptive learner which is capable of adapting to environment changes through automatic tuning given the expertise feedback. Two datasets are evaluated.

此次會議的 session 議題包含“Engineering of IDTs for KMS”, “Intelligent Data Processing Techniques for Decision Making”, “Decision Making in a Dynamic Environment”, “Decision and Health, Intelligent Systems: Foundations and Applications”, “Non-Classical Logics for Intelligent Decision Technologies”,

“Knowledge - Based Interface Systems”, “IDT Based Anomaly Detection”,

“Intelligent Data Analysis”, “Knowledge-Based Software Engineering and Medical Decision Support Systems”, “Rough Sets and Decision Making, Decision Making in a Changing Financial and Social Environment”等幾項。本人參加比較多及比較有趣的項目為“Intelligent Data Processing Techniques for Decision Making”, “Decision Making in a Dynamic Environment”, “IDT Based Anomaly Detection”, “Intelligent Data Analysis”等幾項。其中“IDT Based Anomaly Detection”也是我們此次發表論文的時段。

此外，在此次會議中認識諸位國際學者，如 Kyoto University 的 Toyoaki Nishida 和 Waseda University 的Junzo Watada。另外此會議本身的舉辦經驗也值得探討以作為日後台灣舉辦相關研討會等的依據。

參考資料

[1] Yi-Ren Yeh, Zheng-Yi Lee, and Yuh-Jye Lee. “Anomaly Detection via Over-sampling Principal Component Analysis,” First KES International Symposium on Intelligent Decision Technologies, Himeji, Japan, 2009.

(17)

Anomaly Detection via Over-Sampling Principal Component Analysis

Yi-Ren Yeh, Zheng-Yi Lee, and Yuh-Jye Lee

Abstract. Outlier detection is an important issue in data mining and has been studied in different research areas. It can be used for detecting the small amount of deviated data. In this article, we use “Leave One Out” procedure to check each individual point the “with or without” effect on the variation of principal directions. Based on this idea, an over-sampling principal component analysis outlier detection method is proposed for emphasizing the influence of an abnormal instance (or an outlier).

Except for identifying the suspicious outliers, we also design an on-line anomaly detection to detect the new arriving anomaly. In addition, we also study the quick updating of the principal directions for the effective computation and satisfying the on-line detecting demand. Numerical experiments show that our proposed method is effective in computation time and anomaly detection.

1 Introduction

Due to the reasons that only very few labeled data are available in real applications and the events that people are interested in are extremely rare or do not happen before, the outlier detection is getting people’s attention more and more [3, 4, 7, 8, 9, 11]. Outlier detection can be used in many application domains such as homeland security, credit card fraud detection, intrusion and insider threat detection in cyber- security, fault detection and malignant diagnosis etc. [8, 11, 12, 13]. Thus, the outlier detection methods are designed for finding the rare instances or the deviated data.

In other words, an outlier detection method can be applied to deal with extremely unbalanced data distribution problems, such as capturing the anomaly which exists in a small proportion of network traffic.

Yi-Ren Yeh, Zheng-Yi Lee, and Yuh-Jye Lee

Computer Science and Information Engineering, National Taiwan University of Science and Technology, No.43, Sec.4, Keelung Rd., Taipei, Taiwan 10607

e-mail:{D9515009,M9615018,yuh-jye}@mail.ntust.edu.tw

(18)

450 Y.-R. Yeh et al.

In the past, many outlier detection methods have been proposed [3, 7, 9]. One of the most popular outlier methods is using the density-based local outlier factor (LOF) to measure the outlierness for each instance [3]. The LOF uses the density of each individual instance’s neighbors to define the degree of outlierness and con- cludes a suspicious ranking for all instances. The most important property of the LOF is considering the local data structure for estimating the density. This property makes the LOF discover the outliers which are sheltered under a global data structure. Besides, an angle-based outlier detection (ABOD) method has also been proposed recently [9]. The main concept of ABOD is using the variation of the angles between the each target instance and the rest instances. An outlier or deviated instance will generate a smaller variance among its associated angles. Based on this observation, the ABOD considers all the variance of the angles between the target instance and any pair of instances to detect outliers. However, the time complexity of ABOD is too high to deal with large datasets. In [9], the authors also proposed the fast ABOD which is an approximation of the original ABOD. The difference is that fast ABOD only considers the variance of the angles between the target instance and any pair of instances of target instance’s k nearest neighbors. Even though, these methods mentioned above can not be scaled up to massive datasets because of the very expensive computational cost.

In this paper, we observe that removing (or adding) an abnormal instance (or outlier) will cause a lager effect on principal directions than removing (or adding) a normal one. From this observation, we apply the “Leave One Out” (LOO) procedure to check each individual point the “with or without” effect on the variation of principal directions. This will help us to remove the suspicious outliers in the dataset.

Thus, it can be used for the data cleaning purpose. Once we have a clean dataset, we can extract the leading principal directions from it and use these directions to characterize the normal profile for the dataset. Similarly, we can evaluate the “with or without” effect of new arriving data point. That defines a suspicious score for the new arriving data point. If the score is greater than a certain threshold, we regard this point as an outlier. Based on this mechanism, we proposed an on-line anomaly detection method. Intuitively, the “with or without” effect on the principal direction will be diminished for a single data point even it is an outlier when the dataset is large. To overcome this problem, we employ the “over-sampling” scheme that will amplify the “with or without” influence made by an outlier. We also are aware of computation issues in the whole process. How to compute the principle directions efficiently when the mean and covariance matrix are changed slightly is also a key issue and the tricks for matrix computation will be included in this work as well.

2 Over-Sampling Principal Component Analysis

In this section, we first introduce the classical dimension reduction method PCA briefly. The study on the influence of the variation of principal directions via LOO

(19)

Anomaly Detection via Over-Sampling Principal Component Analysis 451

computation for computing the covariance matrix and estimating principal directions in LOO procedure is also proposed.

2.1 Principal Component Analysis

PCA is an unsupervised dimension reduction method. It can retain those charac- teristics of the data set that contribute most to its variance by keeping lower-order principal components. These few components often contain the “most important”

aspects of the data. Let A∈ R^p×nbe the data matrix and each column, xi∈ R^p, represents an instance. PCA involves the eigenvalue decomposition in the covariance matrix of the data. Its formulation is solving an eigenvalue problem as follows:

ΣAΓ =λΓ, (1)

whereΣA= ¹_n∑ⁿ

i=i(xi−μ)(xi−μ)is the covariance matrix,μ is the grand mean, and the resultingΓ is the eigenvector set. In practical, some eigenvalues have little contribution to variance and can be discarded. It means that we only need to keep few components to represent the data. In addition, PCA explains variance and is sensitive to outliers. A few points distant from the center would have a large influence on variance and its principal directions. In other words, these first few principal directions will be influenced seriously if our data contain some outliers.

2.2 The Influence of an Outlier on Principal Directions

Based on the concept that we mentioned in the previous section, PCA is sensitive to outliers and we only need few principal components to represent the main data structure. That is, an outlier or a deviated instance will cause a larger effect on these principal directions. Hence, we explore the variation of principal directions when removing or adding an instance. This concept is illustrated in Fig. 1 where the clustered blue circles represent the normal data, the red square represents an outlier, and the green arrow is the first principal direction. From the right panel to the left panel in Fig. 1, we can see that the first principal direction is affected when we remove an outlier. The first principal direction is changed and forms a larger angle between the old one and itself. In this case, the first principal direction will not be affected and only form an extremely small angle between the old first principal direction and the new one if we remove a normal instance. Via this observation, we use LOO procedure to check each individual point the “with or without” effect. On the other hand, we might have the pure normal data in hand. In this case, we use the same concept in LOO setting but with incremental strategy. That is, adding an instance to see the variation of the principal directions. Similarly, adding a normal

(20)

452 Y.-R. Yeh et al.

Remove an outlier

Add an outlier

Fig. 1 The illustration for the effect of an outlier on the first principal direction

We check the variation of the principal directions for each new arriving instance and regard it as an outlier if the variation of the principal directions is significant.

In summary, we find that the principal directions will be affected with removing an outlier while the variation of the principal direction will be smaller with removing a normal instance. This concept can be used for identifying the anomaly or outliers in our data. On the contrary, adding an outlier will also cause a larger influence on the principal directions while the variation of the principal directions will be smaller with adding a normal one. It means that we can use the incremental strategy to detect the new arriving abnormal data or outliers. In other words, we explore the variation of the principal directions with removing or adding a data point and use this information to identify outliers and detect new arriving deviated data.

2.3 Over-Sampling Principal Components Analysis

As we mentioned in Section 2.2, we identify outliers in our data and detect the new arriving outliers through the variation of the principal directions. However, the effect of “with or without” a particular data may be diminished when the size of the data is large. On the other hand, the computation in estimating the principal directions will be heavy because we need to recompute the principal directions many times in LOO scenario.

In order to overcome the first problem, we employ “over-sampling” scheme to amplify the outlierness on each data point. For identifying an outlier via LOO strategy, we duplicate the target instance instead of removing it. That is, we duplicate the target instance many times (10% of the whole data in our experiments) and observe how much variation do the principal directions vary. With this over-sampling scheme, the principal directions and mean of the data will only be affected slightly if the target instance is a normal data point (see Fig. 2(a)). On the contrary, the variation will be enlarged if we duplicate an outlier (see Fig. 2(b)). On the other hand,

(21)

Anomaly Detection via Over-Sampling Principal Component Analysis 453

(a) The effect of over-sampling on normal data dulpicated points single point

(b) The effect of over-sampling on an outlier

single point dulpicated points

Fig. 2 The effect of over-sampling on an outlier and a normal instance

point and an outlier. Based on the over-sampling PCA, we make the idea discussed in Section 2.2 more practical.

For computation issue, we need to recompute the principal directions many times in the LOO scenario. In order to avoid this heavy loading, we also proposed two strategies to accelerate the procedure in estimating principal directions. The first one is the fast updating for the covariance matrix. The another one is the solving the eigenvalue problem via the power method [6]. As (1) shows, the formulation of PCA is solving an eigenvalue decomposition on the covariance matrix of the data.

However, it is unnecessary to completely re-compute the covariance matrix in the LOO procedure. The difference of covariance matrix can be easily adjusted while we only duplicate one instance. Hence, we consider a light updating of covariance matrix for fast computation [5]. Let Q=^AA_n be the pre-computed scaled outer- product matrix. We use the following updating for the adjusted mean vector ˜μand covariance matrix ˜Σ: