• 沒有找到結果。

利用本質相關係數建立樣本分類法則之可行性評估

N/A
N/A
Protected

Academic year: 2021

Share "利用本質相關係數建立樣本分類法則之可行性評估"

Copied!
12
0
0

加載中.... (立即查看全文)

全文

(1)

行政院國家科學委員會專題研究計畫 成果報告

利用本質相關係數建立樣本分類法則之可行性評估

研究成果報告(精簡版)

計 畫 類 別 : 個別型

計 畫 編 號 : NSC 95-2119-M-002-045-

執 行 期 間 : 95 年 08 月 01 日至 96 年 07 月 31 日

執 行 單 位 : 國立臺灣大學農藝學系暨研究所

計 畫 主 持 人 : 劉力瑜

計畫參與人員: 碩士班研究生-兼任助理:蕭雅純、馬梓豪

處 理 方 式 : 本計畫可公開查詢

中 華 民 國 96 年 10 月 31 日

(2)

行政院國家科學委員會補助專題研究計畫成果報告

利用本質相關係數建立樣本分類法則之可行性評估

計畫類別:個別型計畫

計畫編號:NSC 95-2119-M-002-045-

執行期間:2006 年 08 月 01 日至 2007 年 07 月 31 日

計畫主持人:劉力瑜

計畫參與人員:碩士班研究生-兼任助理:蕭雅純、馬梓豪

成果報告類型:精簡報告

處理方式:本計畫可公開查詢

執行單位:國立台灣大學農藝學系暨研究所

(3)

中文摘要

依據各試驗單位的特性建立樣本分類法則

,

可細分為兩個步驟

: (1)

篩選數個可明確判別群集的變

, (2)

利用選擇的變數建立最佳分類法則。 其中步驟

(1)

通稱為特徵篩選

(feature selection),

當可

供做為分類依據的變數個數很多時

,

選擇少數幾個足以適當分類樣本的變數除了有助於降低成本

,

挑選適當的變數也是準確分類樣本的關鍵

(Sima et al., 2005)

。 變數篩選標準可歸類為相關性度量

與錯分率度量兩大類。 錯分率度量與分類模式有關

,

不同分類模式結果相異。 相關性度量則取決於

關聯性統計值的選擇。

Hsing

等人於

2005

的論文中提出本質相關係數

(coefficient of intrinsic

depen-dence,

簡稱

CID),

藉由量化變數邊際分布與條件分布間的差異來描述變數間的相關程度

,

屬於關

聯性統計值的一員

,

且不針對描述特定相關模式

,

適合做為特徵篩選的依據。

Hsing

等人於

2005

論文也提到

,

在小樣本時

, CID

估值仍舊能忠實反應變數間實際相關程度。 本研究目的即在探討利

CID

建立小樣本分類法則的可行性

,

並與傳統分類法互相比較其優劣。 最後

,

此一方法將實際應

用在乳癌基因表現資料的分類。

關鍵詞

:

本質相關係數

,

分類法則

,

特徵篩選

,

微陣列

Abstract

The problem of classification is to assign objects to one of the mutually exclusive subgroups in the population based on the object’s characteristics. To build a precise rule of classification, a two-step procedure is usually performed on the training dataset: (1) selecting a few features that are most informative in the sense of decision making; (2) deriving the formula that outputs optimal allocation of objects. Selecting appropriate features is particularly essential for a successful classification. Recent methods of feature selection consider either the misclassification rate of objects given information of a set of variables, or the association between variables and class label. The former yields inconsistent results for different settings of classifiers. The later is subject to the choice of association measures. The analysis of actual data from a study of breast cancer gene expression is included.

Hsing et al. (2005) has proposed a new measure of association, the coefficient of intrinsic depen-dence, or CID. The CID captures not only linear but general association among variables. It was also demonstrated that CID is capable of putting variables in appropriate order according to their degree of association to the target variable even when sample size is small. This research will broaden the work of Hsing et al. (2005) by applying CID in feature selection. It will be followed by construction of Bayes classifiers and comparisons to conventional methods.

Keywords: CID, classification, feature selection, microarray

1

Introduction

The problem of classification is to assign objects to one of the mutually exclusive subgroups in the population based on the object’s characteristics. The methodology of classification has a long history of development (Jain et al., 2000). To build a classification rule, a two-step procedure is usually performed on the training dataset: (1) selecting a few features that distinguish classes the most; (2) deriving the decision-making formula, or the classifier, that best allocates objects. The task in the first step is usually referred as feature selection. Given a set of variables, feature selection aims to select a variable subset that performs the best as to certain statistical or mechanical criteria. This procedure not only reduces cost of trials but increases the accuracy of classification (Jain and Zongker, 1997). It

(4)

was also noted feature selection is particularly essential for proper classification (Sima et al., 2005). Conventional methods of feature selection fall into two categories. One category recognizes variables that best determine the class labels in the training dataset and collects the subset of variables making the least error of allocation. It is intuitive to consider misclassification rate, but it can be accessed only after the classifier is built. To proceed feature selection, one must decide a specific type of classifier. An exhaustive search is then conducted toward all possible combinations of k features and estimation of misclassification rate follows. The number of variables, k, is usually pre-determined as well. Sima et al. (2005) illustrated the choice of classifiers has little effect on classification results. It is the algorithms of error estimation that matters. The precision of error estimation algorithms rapidly corrupt as sample size decreases. As a result, an improper classifier may be wrongly concluded.

The second category of feature selection discerns variables whose values alter from one class to another. In the other words, the variables that are associated with the class labels are favored. Various statistical methods can be performed to determine whether the values of a particular feature differentially expressed. For example, Student’s t-test is widely used when there are only two classes of interest. Welch’s approximation is alternatively suggested when assuming the population variances of the two classes are unequal. One would prefer a nonparametric method, such as Wilcoxon rank-sum test, if he is concerned about the validation of normality assumption for Student’s t-test.

Hsing et al (2005) has proposed a new measure of association - the coefficient of intrinsic dependence, or CID. The CID takes any real values between 0 and +1 inclusive. It is +1 in the case of full dependence and is zero in the case of independence. As the level of dependence ascends, the CID value goes from 0 to 1. Naturally, CID is applicable in feature selection because it follows the convention of association measures. Hsing et al. (2005) has demonstrated merits of CID in feature selection especially under small sample. Their results are confirmed by simulation studies. One of the objectives of this research is to continue the work of Hsing et al. (2005) on development of classifier. We will compare the feature selection results and the misclassification rates to that of conventional methods. We will exercise CID on a complete breast cancer gene expression data that were produced by National Taiwan University Hospital.

2

Method

2.1

Coefficient of Intrinsic Dependence

Let W and Z be explanatory and target variables, respectively. The CID of Z given W is defined as follow: CID(Z|W ) = R1 0 Var[E(I(G(Z) ≤ v)|W )]dv R1 0 Var[I(G(Z) ≤ u)]du , (1)

where G(·) is the marginal cdf of Z, and I(A) is an indicator function such that I(A) =



1 if A is true; 0 otherwise.

It was shown that the numerator integrates the squared distance between the marginal cdf of Z and the conditional cdf of Z given W . By variance decomposition, the denominator serves to standardize the value of CID in [0,1]. When W and Z are nearly independent, the knowledge of W provides little information about Z. The conditional and marginal distributions of Z are therefore alike to each other, which makes the numerator of CID nearly 0. In the other hand, if two variables are highly relevant, one can easily discriminate the object only by the knowledge of explanatory variables. In these cases, CID yields values close to 1. In summary, CID has following properties:

(5)

1. CID always has a value between 0 and 1. If two random variables W and Z are fully dependent, CID(Z|W ) = CID(W |Z) = 1. In the other hand, CID(Z|W ) = CID(W |Z) = 0 if W and Z are independent to each other.

2. The causal relationship between variables is taken into account by asymmetric property of CID. That is, CID(Z|W ) might be different from CID(W |Z).

3. CID requires no distributional assumptions and is invariant under transformations of variables. 4. It is ready to implement in different occasions, such as numerical, categorical, or multivariate

cases, by inserting appropriate distribution functions.

In the case of classification, the target variable is the class of objects, which is denoted as Y . When allocating objects into two classes, which is assumed in this project, there are only two possible results of observed values of Y : 0 or 1. Then a simpler version of (1) can be derived:

CID(Y |x) = Var[E(Y |x)]

Var(Y )

= 1 −E[Var(Y |x)]

Var(Y ) (2)

In particular, if x is univariate with binary outcomes then CID is equal to the square of correlation coefficient of Y and x.

In the absence of knowledge about true distribution functions, the denominator and numerator of CID are separately estimated from a sample of size n. If x involves continuous random variable(s), a binning process can be employed to estimate the conditional distributions. We firstly determine the number of subspaces, B, according to experience. Then the cutting points are decided in a way that each subspace contains approximately equal number of observations. The estimate of CID is

d CID(Y |x) = 1 − PB i=1pˆi(1 − ˆpi) × ni/n ˆ π(1 − ˆπ) (3) where ˆ π = n X j=1 Yj/n; ni = n X j=1 I(xj∈ Ai); ˆ pi = n X j=1 Yj× I(xj ∈ Ai)/ni;

Ai = the ith defined subspace

2.2

Conventional Statistical Methods

We compare the feature selection results of CID with that of the conventional statistics described in

this section. Let x1, · · · , xn1 be a random sample taken from the first distribution with mean µ1 and

variance σ2

1 and y1, · · · , yn2 be a random sample taken from the second distribution with mean µ2and

variance σ2

2. Their sample means and variances are denoted as ¯x, ¯y, Sx2, Sy2, repsectively.

2.2.1 Student’s T -Statistics and Welch’s Approximation

If assuming σ2

1= σ22, the statistic can be written as

t0=q ¯x − ¯y

S2

p(1/n1+ 1/n2)

(6)

where S2

p is the pooled estimate of the common variance. If σ216= σ22, we adopt the Welch

Approxima-tion t0= ¯ x − ¯y p S2 1/n1+ S22/n2 .

The features are ranked based on the absolute values of t0. Features having the largest absolute values

of t0 are claimed to be the most differentially expressed.

2.2.2 Wilcoxon Rank-Sum Test

We combine two samples x1, · · · , xn1 and y1, · · · , yn2 and rank the observations in the combined sample

from the smallest (1) to the largest (n1+ n2). The ranks of x1, · · · , xn1 are denoted as rx1, · · · , rxn1

and the ranks of y1, · · · , yn2 are denoted as ry1, · · · , ryn2. Let

T1= X i rxi, T2= X i ryi, U1= n1n2+n1(n1+ 1) 2 − T1, and U2= n1n2+ n2(n2+ 1) 2 − T2.

For a two-tailed test, the test statistic U = min{U1, U2}; for a one-tailed test, the test statistic U = U1.

The features are ranked based on the values of U . The feature associated with higher value of U is more differentially expressed.

2.3

Rank Statistics

The CID and conventional methods respectively rank features from simulated samples. We compare ranking list produced by each method to the true ranking. Those methods retrieve the most original ranking are determined to be the best. It is intuitive to compute the correlation between the true ranking and the ranking list yielded from the sample. If the method well retrieve the original ranking then the correlation should be high. There is one thing we want to draw readers’ attention to. In a massive dataset such as microarray expression data, people aims to find only a few essential features for classification to achieve dimensional reduction. Unimportant features may show up and down simply because of random error. Therefore, we do not hesitate to omit ranking unimportant features. Only the subset of features proclaimed important are of interest. Braga-Neto et al. (2004) suggested two ranking statistics for comparison of ranking subsets. They would be adopted in our study to evaluate the performance of each feature-selection criterion (i.e., CID, t/Welch test, or Wilcoxon rank sum test).

They are described as below. Let k be the true ranking (most important ones on the top) and k∗ be

the estimated one.

• The first statistic calculates the mean absolute difference between the top K features in the true list and the top K features in the estimated list:

T1=

PK

i=1|k − k∗|

K .

• The second statistics counts how many top K features in the true list are also among the top K features in the estimated list:

T2=

K

X

i=1

I(k∗≤ K),

(7)

2.4

Bayes Decision Theory

For known class conditional densities pk(x) = Pr(x|Y = k) and class priors πk, let

p(k|x) = πkpk(x)

π1p1(x) + π2p2(x)

denote the posterior probability of class k(k = 0, 1) given x. The Bayes rule, dB, classify an object

with observed feature x as that j for which the posterior probability is maximum:

dB(x) = j = arg max

k p(k|x).

In other words, the object is allocated to the class that is more possible to occur given the values of

x. The Bayes misclassification rate, denoted as RB can be estimated as follow:

Pr(dB(x) 6= Y )

= π0Pr(dB(x) = 1|Y = 0)

+ π1Pr(dB(x) = 0|Y = 1). (4)

Bayes rule has one nice property that, when relevant distributions and priors are known, it produces the least errors than any other classifiers.

One wishes to allocate the object based on observed values of features that are corresponding to

high CID values, x∗. Suppose x∈ A

i, the probability that the object belonging to either group

has already estimated while computing the CID value on training data. Hence the Bayes rule can be

immediately implanted for classification. The objects is assigned to group 1 if pi > 0.5 and to group

0 otherwise.

2.5

Data Simulation

We will compare feature selection results of CID to that of the other conventional methods. All methods will be tested on a simulated data whose true Bayes errors can be accessed. Three models are used to generate samples: the Gaussian model with unequal means and variances, and the log-normal model. They are described as follow.

2.5.1 The Gaussian Model with Unequal Means

Suppose all samples come from one of two equally likely p-dimensional Gaussian distributions: N (µ1, Σ)

and N (µ2, Σ); both µ1 and µ2 are p × 1 vectors and Σ is a p × p positive-definite matrix. To simplify

the scenario, let Σ = identity matrix, µ1= −δa, and µ2= δa. The true error rate RB(x) decreases as

the value of δ increases.

2.5.2 The Gaussian Model with Unequal Variances

Suppose a particular feature performs similar in average in two classes but its values are more varied in one of the classes. In the classification point of view, this feature can be used to discriminate objects especially when the variances are highly unequal since the two distributions do not much overlap. Let

all samples come from one of two equally likely p-dimensional Gaussian distributions: N (µ, Σ1) and

N (µ, Σ2), where µ is a p × 1 vector and Σ1 and Σ2 are p × p positive-definite matrices. To simplify

the scenario, let µ1= 0, Σ1 = identity matrix, and Σ2= diag(σ12, · · · , σp2), where σi2= 1 +(i−1)(δ−1)p−1 .

(8)

2.5.3 The Lognormal Model

All samples are drawn from the Gaussian model with unequal means as described in Section 2.5.1 and

take exponentiation. Therefore, the true error rate RB(x) also decreases as the value of δ increases.

2.6

Clinical Breast Cancer Expression Array

Total 80 clinical arrays were from a patient cohort collected at National Taiwan University Hospital (NTUH). These clinical arrays were generated using the Human 1A (version 2) oligonucleotide mi-croarray (Agilent technologies) according to the methods provided by the manufacture (Lien HC et al. 2007). All patients had given informed consent according to guidelines approved by the Institu-tional Review Board (IRB) of NTUH. Paraffin section of breast cancer specimens were stained with CONFIRM anti-Estrogen Receptor (SP1) antibody (Ventana Medical System, Inc) using Ventana Au-tostainer (BenchMark LT Full System)(Ventana Medical Systems, Inc). The control slides for tumor specimens were stained using iVIEW DAB Detection kit (Ventana Medical System, Inc). All the ER immunochemical stained slides were further examined by two experienced pathologists. There are 18 ER(−), 60 six ER(+) and two unidentified specimens in the 80 clinical arrays.

3

Result

1. In the normal and lognormal model settings, t/Welch and Wilcoxon performed only slightly better than CID in the view of better retrieving the true ranking. However, t/Welch and Wilcoxon were not capable to identify features with unequal variances.

Normal Lognormal Normal (Unequal Variance)

0.5 1.0 1.5 2.0 2.5 3.0 0.2 0.4 0.6 0.8 delta Correlation CID.3 CID.10 t Wilcoxon 0.5 1.0 1.5 2.0 2.5 3.0 0.2 0.4 0.6 0.8 delta Correlation CID.3 CID.10 t Wilcoxon 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 0.0 0.2 0.4 0.6 0.8 delta Correlation CID.3 CID.10 t Wilcoxon 0.5 1.0 1.5 2.0 2.5 3.0 2 4 6 8 delta

Average of the First Rank Statistics

CID.3 CID.10 t Wilcoxon 0.5 1.0 1.5 2.0 2.5 3.0 2 4 6 8 delta

Average of the First Rank Statistics

CID.3 CID.10 t Wilcoxon 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 4 6 8 10 delta

Average of the First Rank Statistics

CID.3 CID.10 t Wilcoxon 0.5 1.0 1.5 2.0 2.5 3.0 0.4 0.5 0.6 0.7 0.8 delta

Average of the Second Rank Statistics CID.3CID.10 t Wilcoxon 0.5 1.0 1.5 2.0 2.5 3.0 0.4 0.5 0.6 0.7 0.8 delta

Average of the Second Rank Statistics CID.3CID.10 t Wilcoxon 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 0.3 0.4 0.5 0.6 0.7 delta

Average of the Second Rank Statistics

CID.3 CID.10 t Wilcoxon

(9)

2. We use the top 3 features for each method to build the classifier by linear discriminant analysis (LDA), quadratic discriminant analysis (QDA) and 3-nearnest-neighbor method (3NN), respec-tively. The Figure below presented the misclassification rates for each method.

Normal Lognormal Normal (Unequal Variance)

0.5 1.0 1.5 2.0 2.5 3.0 0.25 0.30 0.35 0.40 0.45 0.50 delta Misclassification Rate CID.3 CID.10 t Wilcoxon 0.5 1.0 1.5 2.0 2.5 3.0 0.36 0.38 0.40 0.42 0.44 0.46 0.48 delta Misclassification Rate CID.3 CID.10 t Wilcoxon 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 0.0 0.1 0.2 0.3 0.4 0.5 delta Misclassification Rate CID.3 CID.10 t Wilcoxon 0.5 1.0 1.5 2.0 2.5 3.0 0.25 0.30 0.35 0.40 0.45 0.50 delta Misclassification Rate CID.3 CID.10 t Wilcoxon 0.5 1.0 1.5 2.0 2.5 3.0 0.35 0.40 0.45 delta Misclassification Rate CID.3 CID.10 t Wilcoxon 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 0.0 0.1 0.2 0.3 0.4 0.5 delta Misclassification Rate CID.3 CID.10 t Wilcoxon 0.5 1.0 1.5 2.0 2.5 3.0 0.25 0.30 0.35 0.40 0.45 0.50 delta Misclassification Rate CID.3 CID.10 t Wilcoxon 0.5 1.0 1.5 2.0 2.5 3.0 0.30 0.35 0.40 0.45 0.50 delta Misclassification Rate CID.3 CID.10 t Wilcoxon 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 0.0 0.1 0.2 0.3 0.4 0.5 delta Misclassification Rate CID.3 CID.10 t Wilcoxon

3. The probability that the object belonging to either group had estimated while computing the

CID value on training data. The Bayes rule suggested to assign the object to group 1 if pi > 0.5

and to group 0 otherwise. We use the top 3 features selected by CID to build the Bayes classifier. The classification results (CondDist) were compared with those of linear discriminant analysis (LDA), quadratic discriminant analysis (QDA) and 3-nearnest-neighbor method (3NN). The Bayes classifier performed remarkably better in the setting of lognormal model. As expected, LDA yielded the least misclassification rate for normal distributions with equal variance and QDA yielded the least misclassification rate for normal distributions with unequal variance. Another drawback of CID Bayes classifier was observed during the simulations. When the size of training data is small, we might not able to observe all possible combinations of categories of three top features. That caused problems to allocate the newly observed sample to one of the subset in the space as well as to obtain the estimated probability that the object belonging to either group.

Normal Lognormal Normal (Unequal Variance)

0.5 1.0 1.5 2.0 2.5 3.0 0.25 0.30 0.35 0.40 0.45 0.50 delta Misclassification Rate LDA QDA 3NN CondDist 0.5 1.0 1.5 2.0 2.5 3.0 0.30 0.35 0.40 0.45 delta Misclassification Rate LDA QDA 3NN CondDist 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 0.0 0.1 0.2 0.3 0.4 0.5 delta Misclassification Rate LDA QDA 3NN CondDist

(10)

4. Three feature selection methods were applied to 80 breast cancer clinical array data. Based on the top 50 features claimed by each method, we clustered the 80 specimens. The clustering results had been shown in the Figure below (from left to right: CID, t/Welch, and Wilcoxon). Observe that CID best separated ER+/− specimens (marked as brown for ER− and as blue for ER+ specimens below the top dendrogram). We also can speculate the two specimens with unknown ER status may be ER+.

X1711.M X2041.M X1657.M X1433.M X2081.M X1453.M X2071.M X1537.M X1533.M X1665.M X1579.M X2075.M X2073.M X1661.M X1355.M X1683.M X2021.M X1621.M X2017.M X1645.M X1577.M X1619.M X2077.M X1507.M X1597.M X1655.M X1435.M X1451.M X1687.M X2029.M X1329.M X1647.M X1685.M X2011.M X1261.M X1651.M X1649.M X1583.M X1513.M X1531.M X2063.M X1707.M X1705.M X1449.M X1713.M X1643.M X1667.M X2039.M X2059.M X2025.M X2047.M X1703.M X2013.M X2055.M X2019.M X1353.M X1701.M X1623.M X1455.M X1509.M X2035.M X2079.M X2043.M X1331.M X2069.M X1653.M X1617.M X1599.M X2051.M X2049.M X2085.M X2065.M X2061.M X1581.M X2023.M X2027.M X2083.M X2057.M X1511.M X2037.M 41 29 17 34 14 4 39 37 19 52 2 36 23 48 50 21 43 42 25 27 15 35 31 51 11 30 22 1 26 28 7 33 18 13 8 5 38 32 10 6 45 49 9 47 46 40 24 44 3 12 20 16 X1661.M X2075.M X1579.M X1537.M X2071.M X1533.M X1355.M X2073.M X1683.M X1711.M X2041.M X1657.M X1453.M X1665.M X2021.M X1433.M X2081.M X1577.M X2077.M X1597.M X1619.M X2017.M X1507.M X1621.M X1655.M X1647.M X1645.M X1451.M X1617.M X1599.M X2029.M X1653.M X2051.M X1667.M X1687.M X1329.M X1455.M X1701.M X1685.M X1707.M X1435.M X1623.M X1581.M X2039.M X1331.M X2027.M X2011.M X2019.M X2063.M X1649.M X1353.M X1643.M X1583.M X1651.M X1513.M X1531.M X1705.M X1703.M X2013.M X2057.M X1511.M X2037.M X2023.M X1449.M X2059.M X2025.M X2047.M X1261.M X1713.M X2055.M X2079.M X2043.M X2049.M X2069.M X1509.M X2083.M X2035.M X2065.M X2061.M X2085.M 17 30 34 29 13 2 40 3 42 26 43 41 22 14 35 11 25 50 31 20 37 27 44 1 24 12 18 16 28 15 45 38 5 32 9 6 33 36 19 49 7 4 8 47 46 48 21 39 10 23 X2079.M X2043.M X2039.M X2013.M X2011.M X2055.M X1713.M X1531.M X2019.M X2037.M X2057.M X1665.M X1433.M X1453.M X2021.M X2041.M X2061.M X2063.M X1701.M X1711.M X2065.M X2027.M X2077.M X1617.M X1331.M X1667.M X1511.M X1509.M X1643.M X1649.M X1261.M X1577.M X1653.M X1599.M X1507.M X2071.M X1537.M X1579.M X1703.M X1705.M X1449.M X1621.M X2025.M X1707.M X2085.M X2047.M X2081.M X1451.M X1685.M X1353.M X1655.M X2051.M X2035.M X2029.M X2083.M X1687.M X2075.M X2049.M X1533.M X1645.M X1661.M X1647.M X1657.M X1597.M X1623.M X1619.M X1355.M X1329.M X2059.M X1583.M X1651.M X2069.M X2073.M X1683.M X2017.M X1455.M X1513.M X2023.M X1435.M X1581.M 35 38 26 23 4 31 36 10 8 39 40 2 41 7 45 24 44 32 28 15 46 33 25 30 9 13 18 42 29 11 20 27 6 17 5 37 19 3 14 22 34 1 21 43 16 12

5. By resubstitution and cross-validation we estimate the misclassification rate based on each method. Top 10 features were used to build the classifier. The Wilcoxon rank-sum test claimed that the top 50 feature are equally important so that Wilcoxon rank-sum test was excluded for this analysis. The Figure below showed that the features selected by CID produced less misclassification rate than that selected by t/Welch test.

LDA(Resub) LDA(CV) QDA(Resub) QDA(CV) 3NN(Resub) 3NN(CV) CID 0 1 2 3 4 5 6 2 3 4 5 6 7 8 9 10

Number of Top Genes

t test 0 1 2 3 4 5 6

(11)

4

Discussion

The high throughput techniques, such as microarrays, aid to monitor expression of thousands of genes simultaneously. One of the objective of microarray studies is to compare gene expression levels in two or several predetermined classes. Differential expression of genes can appear in different forms, for example, different means and/or variances. Therefore, the inquiry of a statistical method to universally identify different patterns of gene expressions arised.

In this study, we adopted the coefficient of intrinsic dependence (CID) to identify putative signatures for classification. Our results showed that CID is promising in supervised learning. The simulations had shown that CID is robust in selecting features with different means or different variances in two classes. When applying to the breast cancer clinical array data, the genes selected by CID best classified ER+/− patients.

However, CID is not appropriate to be immediately applied to unsupervised learning. Although the misclassification rate of CID was as low as those of conventional methods in most of the cases, CID suffered the curse of dimensionality the most. The small sample size relative to the number of variables (genes) is of particular concern in the microarray studies. When the sample size of the training data is small, one might yield a classifier that perfectly classifies the training sample but performs badly in the other samples. While applying CID in classification, the curse of dimensionality strikes from another direction. One may not observe a particular set of data in the training set but the same data appears in the test set. In this scenorio, the probability that the object belonging to certain group is not estimable. Another way to estimate the conditional distribution, such as nonparametric smoothing, might solve the problem.

References

1. U. Braga-Neto, R. Hashimoto, E.R. Dougherty, D.V. Nguyen, and R.J. Carroll (2004). Is Cross-Validation Better than Resubstitution for Ranking Genes? Bioinformatics. 20(2), 253-258. 2. T. Hsing, L.-Y. D. Liu, M. Brun, and E.R. Dougherty (2005). The coefficient of intrinsic

depen-dence (feature selection using el CID). Pattern Recognition. 38: 623-636.

3. HC Lien, YH Hsiao, YS Lin, YT Yao, HF Juan, WH Kuo, MC Hung, KJ Chang, and FJ Hsieh (2007). Molecular signatures of metaplastic carcinoma of the breast by large-scale transcrip-tional profiling: identification of genes potentially related to epithelial-mesenchymal transition. Oncogene. 1-13.

4. A.K. Jain and D. Zongker (1997). Feature Selection - Evaluation, Application, and Small Sample Performance, IEEE Transactions on Pattern Analysis and Machine Intelligence. 19(2): 153-158. 5. A.K. Jain, R.P.W. Duin, and J. Mao (2000). Statistical Pattern Recognition: A Review, IEEE

Transactions on Pattern Analysis and Machine Intelligence. 22(1): 4-37.

6. I. Shmulevich, E.R. Dougherty, S. Kim, and W. Zhang (2001). Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks. Bioinformatics. 18, 261-274. 7. C. Sima, S. Attoor, U. Braga-Neto, J. Lowey, E. Suh, and E.R. Dougherty (2005). Impact of

(12)

8. V.G. Tusher, R. Tibshirani, and G. Chu (2001). Significance Analysis of Microarrays Applied to the Ionizing Radiation Response. Proceedings of the National Academy of Sciences, 98(9):5116-5121.

計畫成果自評

由上述報告

,

可以發現我們依據原計畫的設計進行資料模擬與分析

,

並將方法應用在實際資料上

,

確實達到計畫的預定目標

,

所得結果對未來研究方向的設計也有很大助益。 未來我們將會針對本

篇觀察到的結果

,

對方法學再進一步修正。 我們已經將成果撰寫成技術報告

,

目前正在潤飾的階段

,

完成後將投稿至學術期刊。

參考文獻

相關文件

[This function is named after the electrical engineer Oliver Heaviside (1850–1925) and can be used to describe an electric current that is switched on at time t = 0.] Its graph

When we know that a relation R is a partial order on a set A, we can eliminate the loops at the vertices of its digraph .Since R is also transitive , having the edges (1, 2) and (2,

“Since our classification problem is essentially a multi-label task, during the prediction procedure, we assume that the number of labels for the unlabeled nodes is already known

The remaining positions contain //the rest of the original array elements //the rest of the original array elements.

The min-max and the max-min k-split problem are defined similarly except that the objectives are to minimize the maximum subgraph, and to maximize the minimum subgraph respectively..

– Each listener may respond to a different kind of  event or multiple listeners might may respond to event, or multiple listeners might may respond to 

Usually the goal of classification is to minimize the number of errors Therefore, many classification methods solve optimization problems.. We will discuss a topic called

Experiment a little with the Hello program. It will say that it has no clue what you mean by ouch. The exact wording of the error message is dependent on the compiler, but it might