Analysis of SAGE Results with Combined Learning Techniques

(1)

Analysis of SAGE Results with Combined Learning Techniques

Hsuan-Tien Lin and Ling Li

Learning Systems Group, Caltech

ECML/PKDD Discovery Challenge, 2005/10/07

(2)

Outline

1 Difficulty in SAGE

2 Classification Techniques

3 Feature Selection Techniques

4 Error Estimation Techniques

5 Experimental Results

6 Conclusion

(3)

Difficulty in SAGE

Problem Formulation

SAGE: serial analysis of gene expressions

the larger dataset: 90 samples (libraries) x_i, each with 27679 features (counts of SAGE tags)(x_i)d

labels y_i: 59 cancerous samples, and 31 normal ones

can we predict the cancerous status of the sample based on the features given?

DNA mRNA (?) biological process cancerous status

SAGE (?) machine learning

(4)

Difficulty in SAGE

Difficulty of the Problem

how to build a classifier for the black box?

many possibilities: linear models, decision trees, classifier ensembles, etc.

27679 features with any models above can usually cover all possible labeling on 90 samples

– fitting perfectly on 90 samples is as poor as fitting a random labeling

should all features be used in the black box?

not all features are useful (Alves et al. 2005) some features may even be misleading how to compare different models?

performance needs to be estimated with unseen samples each sample is a precious one out of 90

(5)

Difficulty in SAGE

“Easiness” of the Problem

27679 features give each sample much information

procedure: feature selection, then train with 89 samples, and test on the other

A: feature selection with 89 samples B: feature selection with 90 samples

B gets a test sample in data “preprocessing.”

how much does an extra sample in the

“preprocessing” stage affect the prediction performance?

(6)

Difficulty in SAGE

“Easiness” of the Problem

procedure: feature selection, then train with 89, test on the other A: feature selection with 89 samples

B: feature selection with 90 samples

B is significantly biased towards the single sample

10² 10³

5 10 15 20 25

number of features

cross−validation error (%)

A B

1 any piece of information can affect the result dramatically

2 careful NOT to look at any test information

(7)

Difficulty in SAGE

Our Approach of Analysis

combination of classification, feature selection, and error estimation techniques

use different combinations to show the relative usefulness of different techniques

systematic and repeatable on similar datasets careful use of unseen samples

robust conclusion with multiple combinations and error estimations

(8)

Classification Techniques

techniques that avoid overfitting models that seem promising four classification algorithms

AdaBoost-Stump SVM-Linear SVM-Gaussian SVM-Stump

– a novel and promising paradigm through infinite ensemble learning (Lin and Li, ECML 2005)

(9)

Adaptive Boosting with Decision Stumps

model:

g(xˆ ) =sign

T

X

t=1

w_ts_t(x)

!

a finite ensemble of weak rules

each s_t is a decision stump (thresholding rule on a SAGE tag) – e.g. if the count of the tag 200 greater than 10, then cancerous each w_t: a nonnegative weight for s_t

prediction: each s_t tells whether the sample is cancerous, andgˆ reports the majority of weighted votes

automatically selects≤T important tags and ignore others in prediction

(10)

Support Vector Machine with Linear Kernel

model:

g(x) =ˆ sign

D

X

d=1

w_d(x)_d+b

!

a hyperplane inR^D

– e.g. if the weighted sum of all counts is greater than 10, then cancerous

a large-margin hyperplane: clear separation between cancerous and normal samples

each w_d: sensitivity for change of(x)_d – measure of the importance of tag d

(11)

Support Vector Machine with Gaussian Kernel

model:

g(xˆ ) =sign

N

X

i=1

y_iλiexp(−γ(x−x_i)²)

!

a nonlinear classifier, similar to a radial basis function network large-margin hyperplane in an infinite dimensional space pros: powerful model, often good prediction performance cons: time-consuming to choose parameterγ, hard to interpret

(12)

Support Vector Machine with Stump Kernel

model:

g(xˆ ) =sign





D

X

d=1

X

q∈±1

Z

w_q,d(α)s_q,d,α(x)dα

+b





large-margin infinite ensemble of decision stumps: novel and promising

pros: powerful model, often good performance superior power to AdaBoost-Stump due to infinity superior power to SVM-Linear due to nonlinearity faster parameter selection than SVM-Gauss model: partially interpreted

– w_q,d can estimate the importance of tag d

(13)

Relative Comparison of Classification Techniques

all four have some degree of regularization: avoid overfitting the first three were used in some gene/cancer related tasks SVM-Stump is closely related to AdaBoost-Stump

pros and cons:

AdaBoost SVM SVM SVM

-Stump -Linear -Gauss -Stump

model power(*) − − ↑ ↑

interpretability ↑ ↑ ↓ −

speed ↑ − ↓ −

(*) it is hard to compare AdaBoost-Stump to SVM-Linear in power

(14)

Feature Selection Techniques

Feature Selection with Ranking

Algorithm

1 rank (order) the features by their importance

2 select only the top M features a simple strategy

relies on a good ranking algorithm three simple ranking algorithms:

Ranking with Fisher Score Ranking with Linear Weight Ranking with Stump Weight

the first two have been used in similar tasks

(15)

Feature Selection Techniques

Feature Ranking Techniques

Rank with Fisher Score (RFS):

how well can we use only(x_i)_d to predict y_i? Rank with Linear Weight (RLW):

what is the importance w_d of(x)_d in the hyperplane Xw_d(x)_d+b

found by SVM-Linear?

Rank with Stump Weight (RSW):

what is the amount of decision stumpsP

q

R w_q,d² (α)dαneeded for feature d in the ensemble

D

X

d=1

X

q∈±1

Z

w_q,d(α)s_q,d,α(x)dα

+b

(16)

Error Estimation Techniques

v -fold cross-validation: economic use of samples training folds: v −1 of the v folds

test fold: the other folds is reserved unseen estimate: average error on the reduced test fold

v -fold CV is a random process: can be repeated many times our setting: 10 fold×10, 5 fold×20, or 90 fold×1

90 fold: also called leave-one-out

(17)

Error Estimation Techniques

Experiment Settings

Experiment Setting

1 Cross-validation splitting to training folds/test fold

2 Feature ranking on training folds

3 Feature selection by ranking (50,100,200,500,1000,27679)

4 Classification on the reduced training folds

5 Test on the reduced test fold

(18)

Experimental Results

Comparison of Classification Techniques

Ranking with Linear Weight Ranking with Stump Weight

10² 10³ 10⁴

16 18 20 22 24 26 28 30 32 34

AdaBoost−Stump, T=100 AdaBoost−Stump, T=1000 SVM−Linear SVM−Gauss SVM−Stump

10² 10³ 10⁴

16 18 20 22 24 26 28 30 32 34

results with 10 fold CV×10 AdaBoost-Stump is not good

SVM-Gauss is slightly worse than SVM-Linear SVM-Stump is slightly better than SVM-Linear

(19)

Comparison of Classification Techniques

SVM-Linear and SVM-Stump are the better choices

AdaBoost SVM SVM SVM

-Stump -Linear -Gauss -Stump

model power − − ↑ ↑

interpretability ↑ ↑ ↓ −

speed ↑ − ↓ −

performance ↓ ↑ ↑ ↑

(20)

Comparison of Feature Selection Techniques

SVM-Linear SVM-Stump

10² 10³

16 18 20 22 24 26 28 30 32 34

full set RFSRLW RSW

10² 10³

16 18 20 22 24 26 28 30 32 34

full set RFSRLW RSW

results with 10 fold CV×10 Ranking with F-Score is not good

Ranking with Stump Weight is slightly better than with Linear Weight

(21)

Comparison of Error Estimation Techniques

Ranking with F-Score (10 fold×10) Ranking with F-Score (90 fold)

10² 10³ 10⁴

16 18 20 22 24 26 28 30 32 34

10² 10³ 10⁴

15 20 25 30 35

leave-one-out does not give stable and explainable results

(22)

Comparison of Error Estimation Techniques

Ranking with F-Score (10 fold×10) Ranking with F-Score (5 fold×20)

10² 10³ 10⁴

16 18 20 22 24 26 28 30 32 34

10² 10³ 10⁴

18 20 22 24 26 28 30 32 34

similar conclusions from 5 fold and 10 fold CV 10-fold uses more samples for training

– better choice considering the importance of samples

(23)

Conclusion

carefully analyzed the difficult SAGE dataset legitimate information only

robust conclusion through multiple testing

classification: SVM-Linear and SVM-Stump are both promising feature selection: RLW and RSW are both good

– possible to achieve better performance than full set

error estimation: 10-fold CV seems to be a better choice and leave-one-out is bad

how can we possibly distinguish between the linear model and the stump ensemble model?

are there more samples to verify the findings?

which model selects more biologically meaningful features?

which model is biologically more plausible?