### Analysis of SAGE Results with Combined Learning Techniques

Hsuan-Tien Lin and Ling Li

Learning Systems Group, Caltech

ECML/PKDD Discovery Challenge, 2005/10/07

### Outline

1 Difficulty in SAGE

2 Classification Techniques

3 Feature Selection Techniques

4 Error Estimation Techniques

5 Experimental Results

6 Conclusion

Difficulty in SAGE

### Problem Formulation

SAGE: serial analysis of gene expressions

*the larger dataset: 90 samples (libraries) x** _{i}*, each with 27679
features (counts of SAGE tags)(x

*)*

_{i}*d*

*labels y** _{i}*: 59 cancerous samples, and 31 normal ones

**can we predict the cancerous status of the sample based on**
**the features given?**

DNA mRNA (?) biological process cancerous status

SAGE (?) machine learning

Difficulty in SAGE

### Difficulty of the Problem

**how to build a classifier for the black box?**

many possibilities: linear models, decision trees, classifier ensembles, etc.

27679 features with any models above can usually cover all possible labeling on 90 samples

– fitting perfectly on 90 samples is as poor as fitting a random labeling

**should all features be used in the black box?**

not all features are useful (Alves et al. 2005)
some features may even be misleading
**how to compare different models?**

performance needs to be estimated with unseen samples each sample is a precious one out of 90

Difficulty in SAGE

### “Easiness” of the Problem

27679 features give each sample much information

procedure: feature selection, then train with 89 samples, and test on the other

A: feature selection with 89 samples B: feature selection with 90 samples

B gets a test sample in data “preprocessing.”

**how much does an extra sample in the**

**“preprocessing” stage affect the prediction**
**performance?**

Difficulty in SAGE

### “Easiness” of the Problem

procedure: feature selection, then train with 89, test on the other A: feature selection with 89 samples

B: feature selection with 90 samples

B is significantly biased towards the single sample

10^{2} 10^{3}

5 10 15 20 25

**number of features**

**cross−validation error (%)**

A B

1 any piece of information can affect the result dramatically

2 careful NOT to look at any test information

Difficulty in SAGE

### Our Approach of Analysis

combination of classification, feature selection, and error estimation techniques

use different combinations to show the relative usefulness of different techniques

systematic and repeatable on similar datasets careful use of unseen samples

robust conclusion with multiple combinations and error estimations

Classification Techniques

### Classification Techniques

techniques that avoid overfitting models that seem promising four classification algorithms

AdaBoost-Stump SVM-Linear SVM-Gaussian SVM-Stump

**– a novel and promising paradigm through infinite ensemble**
**learning (Lin and Li, ECML 2005)**

Classification Techniques

### Adaptive Boosting with Decision Stumps

model:

*g(x*ˆ ) =sign

*T*

X

*t*=1

*w*_{t}*s** _{t}*(x)

!

a finite ensemble of weak rules

*each s** _{t}* is a decision stump (thresholding rule on a SAGE tag)
– e.g. if the count of the tag 200 greater than 10, then cancerous

*each w*

_{t}*: a nonnegative weight for s*

_{t}*prediction: each s** _{t}* tells whether the sample is cancerous, and

*g*ˆ reports the majority of weighted votes

automatically selects≤*T important tags and ignore others in*
prediction

Classification Techniques

### Support Vector Machine with Linear Kernel

model:

*g(x) =*ˆ sign

*D*

X

*d=1*

*w** _{d}*(x)

*+*

_{d}*b*

!

a hyperplane inR^{D}

– e.g. if the weighted sum of all counts is greater than 10, then cancerous

a large-margin hyperplane: clear separation between cancerous and normal samples

*each w** _{d}*: sensitivity for change of(x)

_{d}*– measure of the importance of tag d*

Classification Techniques

### Support Vector Machine with Gaussian Kernel

model:

*g(x*ˆ ) =sign

*N*

X

*i=1*

*y** _{i}*λ

*i*exp(−γ(x−

*x*

*)*

_{i}^{2})

!

a nonlinear classifier, similar to a radial basis function network large-margin hyperplane in an infinite dimensional space pros: powerful model, often good prediction performance cons: time-consuming to choose parameterγ, hard to interpret

Classification Techniques

### Support Vector Machine with Stump Kernel

model:

*g(x*ˆ ) =sign

*D*

X

*d=1*

X

*q∈±1*

Z

*w** _{q,d}*(α)s

*(x)*

_{q,d,α}*d*α

+*b*

large-margin infinite ensemble of decision stumps: novel and promising

pros: powerful model, often good performance superior power to AdaBoost-Stump due to infinity superior power to SVM-Linear due to nonlinearity faster parameter selection than SVM-Gauss model: partially interpreted

*– w*_{q,d}*can estimate the importance of tag d*

Classification Techniques

### Relative Comparison of Classification Techniques

all four have some degree of regularization: avoid overfitting the first three were used in some gene/cancer related tasks SVM-Stump is closely related to AdaBoost-Stump

pros and cons:

AdaBoost SVM SVM SVM

-Stump -Linear -Gauss -Stump

model power(*) − − ↑ ↑

interpretability ↑ ↑ ↓ −

speed ↑ − ↓ −

(*) it is hard to compare AdaBoost-Stump to SVM-Linear in power

Feature Selection Techniques

### Feature Selection with Ranking

Algorithm

1 rank (order) the features by their importance

2 *select only the top M features*
a simple strategy

relies on a good ranking algorithm three simple ranking algorithms:

Ranking with Fisher Score Ranking with Linear Weight Ranking with Stump Weight

the first two have been used in similar tasks

Feature Selection Techniques

### Feature Ranking Techniques

Rank with Fisher Score (RFS):

how well can we use only(x* _{i}*)

_{d}*to predict y*

*? Rank with Linear Weight (RLW):*

_{i}*what is the importance w** _{d}* of(x)

*in the hyperplane X*

_{d}*w*

*(x)*

_{d}*+*

_{d}*b*

found by SVM-Linear?

Rank with Stump Weight (RSW):

what is the amount of decision stumpsP

*q*

R *w*_{q,d}^{2} (α)*dα*needed
*for feature d in the ensemble*

*D*

X

*d=1*

X

*q∈±1*

Z

*w** _{q,d}*(α)s

*(x)*

_{q,d,α}*d*α

+*b*

Error Estimation Techniques

### Error Estimation Techniques

*v -fold cross-validation: economic use of samples*
*training folds: v* −*1 of the v folds*

test fold: the other folds is reserved unseen estimate: average error on the reduced test fold

*v -fold CV is a random process: can be repeated many times*
our setting: 10 fold×10, 5 fold×20, or 90 fold×1

90 fold: also called leave-one-out

Error Estimation Techniques

### Experiment Settings

Experiment Setting

1 Cross-validation splitting to training folds/test fold

2 Feature ranking on training folds

3 Feature selection by ranking (50,100,200,500,1000,27679)

4 Classification on the reduced training folds

5 Test on the reduced test fold

Experimental Results

### Comparison of Classification Techniques

Ranking with Linear Weight Ranking with Stump Weight

10^{2} 10^{3} 10^{4}

16 18 20 22 24 26 28 30 32 34

**number of features**

**cross−validation error (%)**

AdaBoost−Stump, T=100 AdaBoost−Stump, T=1000 SVM−Linear SVM−Gauss SVM−Stump

10^{2} 10^{3} 10^{4}

16 18 20 22 24 26 28 30 32 34

**number of features**

**cross−validation error (%)**

AdaBoost−Stump, T=100 AdaBoost−Stump, T=1000 SVM−Linear SVM−Gauss SVM−Stump

results with 10 fold CV×10 AdaBoost-Stump is not good

SVM-Gauss is slightly worse than SVM-Linear SVM-Stump is slightly better than SVM-Linear

Experimental Results

### Comparison of Classification Techniques

SVM-Linear and SVM-Stump are the better choices

AdaBoost SVM SVM SVM

-Stump -Linear -Gauss -Stump

model power − − ↑ ↑

interpretability ↑ ↑ ↓ −

speed ↑ − ↓ −

performance ↓ ↑ ↑ ↑

Experimental Results

### Comparison of Feature Selection Techniques

SVM-Linear SVM-Stump

10^{2} 10^{3}

16 18 20 22 24 26 28 30 32 34

**number of features**

**cross−validation error (%)**

full set RFSRLW RSW

10^{2} 10^{3}

16 18 20 22 24 26 28 30 32 34

**number of features**

**cross−validation error (%)**

full set RFSRLW RSW

results with 10 fold CV×10 Ranking with F-Score is not good

Ranking with Stump Weight is slightly better than with Linear Weight

Experimental Results

### Comparison of Error Estimation Techniques

Ranking with F-Score (10 fold×10) Ranking with F-Score (90 fold)

10^{2} 10^{3} 10^{4}

16 18 20 22 24 26 28 30 32 34

**number of features**

**cross−validation error (%)**

AdaBoost−Stump, T=100 AdaBoost−Stump, T=1000 SVM−Linear SVM−Gauss SVM−Stump

10^{2} 10^{3} 10^{4}

15 20 25 30 35

**number of features**

**cross−validation error (%)**

AdaBoost−Stump, T=100 AdaBoost−Stump, T=1000 SVM−Linear SVM−Gauss SVM−Stump

leave-one-out does not give stable and explainable results

Experimental Results

### Comparison of Error Estimation Techniques

Ranking with F-Score (10 fold×10) Ranking with F-Score (5 fold×20)

10^{2} 10^{3} 10^{4}

16 18 20 22 24 26 28 30 32 34

**number of features**

**cross−validation error (%)**

AdaBoost−Stump, T=100 AdaBoost−Stump, T=1000 SVM−Linear SVM−Gauss SVM−Stump

10^{2} 10^{3} 10^{4}

18 20 22 24 26 28 30 32 34

**number of features**

**cross−validation error (%)**

AdaBoost−Stump, T=100 AdaBoost−Stump, T=1000 SVM−Linear SVM−Gauss SVM−Stump

similar conclusions from 5 fold and 10 fold CV 10-fold uses more samples for training

– better choice considering the importance of samples

Conclusion

### Conclusion

**carefully analyzed the difficult SAGE dataset**
legitimate information only

robust conclusion through multiple testing

classification: SVM-Linear and SVM-Stump are both promising feature selection: RLW and RSW are both good

– possible to achieve better performance than full set

error estimation: 10-fold CV seems to be a better choice and leave-one-out is bad

how can we possibly distinguish between the linear model and the stump ensemble model?

are there more samples to verify the findings?

which model selects more biologically meaningful features?

which model is biologically more plausible?