• 沒有找到結果。

Analysis of SAGE Results with Combined Learning Techniques

N/A
N/A
Protected

Academic year: 2022

Share "Analysis of SAGE Results with Combined Learning Techniques"

Copied!
23
0
0

加載中.... (立即查看全文)

全文

(1)

Analysis of SAGE Results with Combined Learning Techniques

Hsuan-Tien Lin and Ling Li

Learning Systems Group, Caltech

ECML/PKDD Discovery Challenge, 2005/10/07

(2)

Outline

1 Difficulty in SAGE

2 Classification Techniques

3 Feature Selection Techniques

4 Error Estimation Techniques

5 Experimental Results

6 Conclusion

(3)

Difficulty in SAGE

Problem Formulation

SAGE: serial analysis of gene expressions

the larger dataset: 90 samples (libraries) xi, each with 27679 features (counts of SAGE tags)(xi)d

labels yi: 59 cancerous samples, and 31 normal ones

can we predict the cancerous status of the sample based on the features given?

DNA mRNA (?) biological process cancerous status

SAGE (?) machine learning

(4)

Difficulty in SAGE

Difficulty of the Problem

how to build a classifier for the black box?

many possibilities: linear models, decision trees, classifier ensembles, etc.

27679 features with any models above can usually cover all possible labeling on 90 samples

– fitting perfectly on 90 samples is as poor as fitting a random labeling

should all features be used in the black box?

not all features are useful (Alves et al. 2005) some features may even be misleading how to compare different models?

performance needs to be estimated with unseen samples each sample is a precious one out of 90

(5)

Difficulty in SAGE

“Easiness” of the Problem

27679 features give each sample much information

procedure: feature selection, then train with 89 samples, and test on the other

A: feature selection with 89 samples B: feature selection with 90 samples

B gets a test sample in data “preprocessing.”

how much does an extra sample in the

“preprocessing” stage affect the prediction performance?

(6)

Difficulty in SAGE

“Easiness” of the Problem

procedure: feature selection, then train with 89, test on the other A: feature selection with 89 samples

B: feature selection with 90 samples

B is significantly biased towards the single sample

102 103

5 10 15 20 25

number of features

cross−validation error (%)

A B

1 any piece of information can affect the result dramatically

2 careful NOT to look at any test information

(7)

Difficulty in SAGE

Our Approach of Analysis

combination of classification, feature selection, and error estimation techniques

use different combinations to show the relative usefulness of different techniques

systematic and repeatable on similar datasets careful use of unseen samples

robust conclusion with multiple combinations and error estimations

(8)

Classification Techniques

Classification Techniques

techniques that avoid overfitting models that seem promising four classification algorithms

AdaBoost-Stump SVM-Linear SVM-Gaussian SVM-Stump

– a novel and promising paradigm through infinite ensemble learning (Lin and Li, ECML 2005)

(9)

Classification Techniques

Adaptive Boosting with Decision Stumps

model:

g(xˆ ) =sign

T

X

t=1

wtst(x)

!

a finite ensemble of weak rules

each st is a decision stump (thresholding rule on a SAGE tag) – e.g. if the count of the tag 200 greater than 10, then cancerous each wt: a nonnegative weight for st

prediction: each st tells whether the sample is cancerous, andgˆ reports the majority of weighted votes

automatically selects≤T important tags and ignore others in prediction

(10)

Classification Techniques

Support Vector Machine with Linear Kernel

model:

g(x) =ˆ sign

D

X

d=1

wd(x)d+b

!

a hyperplane inRD

– e.g. if the weighted sum of all counts is greater than 10, then cancerous

a large-margin hyperplane: clear separation between cancerous and normal samples

each wd: sensitivity for change of(x)d – measure of the importance of tag d

(11)

Classification Techniques

Support Vector Machine with Gaussian Kernel

model:

g(xˆ ) =sign

N

X

i=1

yiλiexp(−γ(x−xi)2)

!

a nonlinear classifier, similar to a radial basis function network large-margin hyperplane in an infinite dimensional space pros: powerful model, often good prediction performance cons: time-consuming to choose parameterγ, hard to interpret

(12)

Classification Techniques

Support Vector Machine with Stump Kernel

model:

g(xˆ ) =sign

D

X

d=1

X

q∈±1

Z

wq,d(α)sq,d,α(x)dα

 +b

large-margin infinite ensemble of decision stumps: novel and promising

pros: powerful model, often good performance superior power to AdaBoost-Stump due to infinity superior power to SVM-Linear due to nonlinearity faster parameter selection than SVM-Gauss model: partially interpreted

– wq,d can estimate the importance of tag d

(13)

Classification Techniques

Relative Comparison of Classification Techniques

all four have some degree of regularization: avoid overfitting the first three were used in some gene/cancer related tasks SVM-Stump is closely related to AdaBoost-Stump

pros and cons:

AdaBoost SVM SVM SVM

-Stump -Linear -Gauss -Stump

model power(*) − − ↑ ↑

interpretability ↑ ↑ ↓ −

speed ↑ − ↓ −

(*) it is hard to compare AdaBoost-Stump to SVM-Linear in power

(14)

Feature Selection Techniques

Feature Selection with Ranking

Algorithm

1 rank (order) the features by their importance

2 select only the top M features a simple strategy

relies on a good ranking algorithm three simple ranking algorithms:

Ranking with Fisher Score Ranking with Linear Weight Ranking with Stump Weight

the first two have been used in similar tasks

(15)

Feature Selection Techniques

Feature Ranking Techniques

Rank with Fisher Score (RFS):

how well can we use only(xi)d to predict yi? Rank with Linear Weight (RLW):

what is the importance wd of(x)d in the hyperplane Xwd(x)d+b

found by SVM-Linear?

Rank with Stump Weight (RSW):

what is the amount of decision stumpsP

q

R wq,d2 (α)needed for feature d in the ensemble

D

X

d=1

X

q∈±1

Z

wq,d(α)sq,d,α(x)dα

 +b

(16)

Error Estimation Techniques

Error Estimation Techniques

v -fold cross-validation: economic use of samples training folds: v1 of the v folds

test fold: the other folds is reserved unseen estimate: average error on the reduced test fold

v -fold CV is a random process: can be repeated many times our setting: 10 fold×10, 5 fold×20, or 90 fold×1

90 fold: also called leave-one-out

(17)

Error Estimation Techniques

Experiment Settings

Experiment Setting

1 Cross-validation splitting to training folds/test fold

2 Feature ranking on training folds

3 Feature selection by ranking (50,100,200,500,1000,27679)

4 Classification on the reduced training folds

5 Test on the reduced test fold

(18)

Experimental Results

Comparison of Classification Techniques

Ranking with Linear Weight Ranking with Stump Weight

102 103 104

16 18 20 22 24 26 28 30 32 34

number of features

cross−validation error (%)

AdaBoost−Stump, T=100 AdaBoost−Stump, T=1000 SVM−Linear SVM−Gauss SVM−Stump

102 103 104

16 18 20 22 24 26 28 30 32 34

number of features

cross−validation error (%)

AdaBoost−Stump, T=100 AdaBoost−Stump, T=1000 SVM−Linear SVM−Gauss SVM−Stump

results with 10 fold CV×10 AdaBoost-Stump is not good

SVM-Gauss is slightly worse than SVM-Linear SVM-Stump is slightly better than SVM-Linear

(19)

Experimental Results

Comparison of Classification Techniques

SVM-Linear and SVM-Stump are the better choices

AdaBoost SVM SVM SVM

-Stump -Linear -Gauss -Stump

model power − − ↑ ↑

interpretability ↑ ↑ ↓ −

speed ↑ − ↓ −

performance ↓ ↑ ↑ ↑

(20)

Experimental Results

Comparison of Feature Selection Techniques

SVM-Linear SVM-Stump

102 103

16 18 20 22 24 26 28 30 32 34

number of features

cross−validation error (%)

full set RFSRLW RSW

102 103

16 18 20 22 24 26 28 30 32 34

number of features

cross−validation error (%)

full set RFSRLW RSW

results with 10 fold CV×10 Ranking with F-Score is not good

Ranking with Stump Weight is slightly better than with Linear Weight

(21)

Experimental Results

Comparison of Error Estimation Techniques

Ranking with F-Score (10 fold×10) Ranking with F-Score (90 fold)

102 103 104

16 18 20 22 24 26 28 30 32 34

number of features

cross−validation error (%)

AdaBoost−Stump, T=100 AdaBoost−Stump, T=1000 SVM−Linear SVM−Gauss SVM−Stump

102 103 104

15 20 25 30 35

number of features

cross−validation error (%)

AdaBoost−Stump, T=100 AdaBoost−Stump, T=1000 SVM−Linear SVM−Gauss SVM−Stump

leave-one-out does not give stable and explainable results

(22)

Experimental Results

Comparison of Error Estimation Techniques

Ranking with F-Score (10 fold×10) Ranking with F-Score (5 fold×20)

102 103 104

16 18 20 22 24 26 28 30 32 34

number of features

cross−validation error (%)

AdaBoost−Stump, T=100 AdaBoost−Stump, T=1000 SVM−Linear SVM−Gauss SVM−Stump

102 103 104

18 20 22 24 26 28 30 32 34

number of features

cross−validation error (%)

AdaBoost−Stump, T=100 AdaBoost−Stump, T=1000 SVM−Linear SVM−Gauss SVM−Stump

similar conclusions from 5 fold and 10 fold CV 10-fold uses more samples for training

– better choice considering the importance of samples

(23)

Conclusion

Conclusion

carefully analyzed the difficult SAGE dataset legitimate information only

robust conclusion through multiple testing

classification: SVM-Linear and SVM-Stump are both promising feature selection: RLW and RSW are both good

– possible to achieve better performance than full set

error estimation: 10-fold CV seems to be a better choice and leave-one-out is bad

how can we possibly distinguish between the linear model and the stump ensemble model?

are there more samples to verify the findings?

which model selects more biologically meaningful features?

which model is biologically more plausible?

參考文獻

相關文件

The sample point can be chosen to be any point in the subrectangle R ij , but if we choose it to be the upper right-hand corner of R ij [namely (x i , y j ), see Figure 3],

• Since successive samples are correlated, the Markov chain may have to be run for a considerable time in order to generate samples that are effectively independent samples from p(x).

[classification], [regression], structured Learning with Different Data Label y n. [supervised], un/semi-supervised, reinforcement Learning with Different Protocol f ⇒ (x n , y

Normalization by the number of reads in the sample, or by calculating a Z score, should be performed on the reported read counts before comparisons among samples. For genes with

Estimate the sufficient statistics of the complete data X given the observed data Y and current parameter values,. Maximize the X-likelihood associated

With respect to methodology, I draw on techniques of religious studies and art history toexplore the position of the legend of Bodhidharma in Sung-Yuan Ch'an history, as well

I certify that I have audited the financial statements of the Grant Schools Provident Fund set out on pages 24 to 48, which comprise the balance sheet as at 31 August 2017, and

I certify that I have audited the financial statements of the Grant Schools Provident Fund set out on pages 24 to 49, which comprise the balance sheet as at 31 August 2018, and