Analysis of SAGE Results with Combined Learning Techniques
Hsuan-Tien Lin and Ling Li
Learning Systems Group, Caltech
ECML/PKDD Discovery Challenge, 2005/10/07
Outline
1 Difficulty in SAGE
2 Classification Techniques
3 Feature Selection Techniques
4 Error Estimation Techniques
5 Experimental Results
6 Conclusion
Difficulty in SAGE
Problem Formulation
SAGE: serial analysis of gene expressions
the larger dataset: 90 samples (libraries) xi, each with 27679 features (counts of SAGE tags)(xi)d
labels yi: 59 cancerous samples, and 31 normal ones
can we predict the cancerous status of the sample based on the features given?
DNA mRNA (?) biological process cancerous status
SAGE (?) machine learning
Difficulty in SAGE
Difficulty of the Problem
how to build a classifier for the black box?
many possibilities: linear models, decision trees, classifier ensembles, etc.
27679 features with any models above can usually cover all possible labeling on 90 samples
– fitting perfectly on 90 samples is as poor as fitting a random labeling
should all features be used in the black box?
not all features are useful (Alves et al. 2005) some features may even be misleading how to compare different models?
performance needs to be estimated with unseen samples each sample is a precious one out of 90
Difficulty in SAGE
“Easiness” of the Problem
27679 features give each sample much information
procedure: feature selection, then train with 89 samples, and test on the other
A: feature selection with 89 samples B: feature selection with 90 samples
B gets a test sample in data “preprocessing.”
how much does an extra sample in the
“preprocessing” stage affect the prediction performance?
Difficulty in SAGE
“Easiness” of the Problem
procedure: feature selection, then train with 89, test on the other A: feature selection with 89 samples
B: feature selection with 90 samples
B is significantly biased towards the single sample
102 103
5 10 15 20 25
number of features
cross−validation error (%)
A B
1 any piece of information can affect the result dramatically
2 careful NOT to look at any test information
Difficulty in SAGE
Our Approach of Analysis
combination of classification, feature selection, and error estimation techniques
use different combinations to show the relative usefulness of different techniques
systematic and repeatable on similar datasets careful use of unseen samples
robust conclusion with multiple combinations and error estimations
Classification Techniques
Classification Techniques
techniques that avoid overfitting models that seem promising four classification algorithms
AdaBoost-Stump SVM-Linear SVM-Gaussian SVM-Stump
– a novel and promising paradigm through infinite ensemble learning (Lin and Li, ECML 2005)
Classification Techniques
Adaptive Boosting with Decision Stumps
model:
g(xˆ ) =sign
T
X
t=1
wtst(x)
!
a finite ensemble of weak rules
each st is a decision stump (thresholding rule on a SAGE tag) – e.g. if the count of the tag 200 greater than 10, then cancerous each wt: a nonnegative weight for st
prediction: each st tells whether the sample is cancerous, andgˆ reports the majority of weighted votes
automatically selects≤T important tags and ignore others in prediction
Classification Techniques
Support Vector Machine with Linear Kernel
model:
g(x) =ˆ sign
D
X
d=1
wd(x)d+b
!
a hyperplane inRD
– e.g. if the weighted sum of all counts is greater than 10, then cancerous
a large-margin hyperplane: clear separation between cancerous and normal samples
each wd: sensitivity for change of(x)d – measure of the importance of tag d
Classification Techniques
Support Vector Machine with Gaussian Kernel
model:
g(xˆ ) =sign
N
X
i=1
yiλiexp(−γ(x−xi)2)
!
a nonlinear classifier, similar to a radial basis function network large-margin hyperplane in an infinite dimensional space pros: powerful model, often good prediction performance cons: time-consuming to choose parameterγ, hard to interpret
Classification Techniques
Support Vector Machine with Stump Kernel
model:
g(xˆ ) =sign
D
X
d=1
X
q∈±1
Z
wq,d(α)sq,d,α(x)dα
+b
large-margin infinite ensemble of decision stumps: novel and promising
pros: powerful model, often good performance superior power to AdaBoost-Stump due to infinity superior power to SVM-Linear due to nonlinearity faster parameter selection than SVM-Gauss model: partially interpreted
– wq,d can estimate the importance of tag d
Classification Techniques
Relative Comparison of Classification Techniques
all four have some degree of regularization: avoid overfitting the first three were used in some gene/cancer related tasks SVM-Stump is closely related to AdaBoost-Stump
pros and cons:
AdaBoost SVM SVM SVM
-Stump -Linear -Gauss -Stump
model power(*) − − ↑ ↑
interpretability ↑ ↑ ↓ −
speed ↑ − ↓ −
(*) it is hard to compare AdaBoost-Stump to SVM-Linear in power
Feature Selection Techniques
Feature Selection with Ranking
Algorithm
1 rank (order) the features by their importance
2 select only the top M features a simple strategy
relies on a good ranking algorithm three simple ranking algorithms:
Ranking with Fisher Score Ranking with Linear Weight Ranking with Stump Weight
the first two have been used in similar tasks
Feature Selection Techniques
Feature Ranking Techniques
Rank with Fisher Score (RFS):
how well can we use only(xi)d to predict yi? Rank with Linear Weight (RLW):
what is the importance wd of(x)d in the hyperplane Xwd(x)d+b
found by SVM-Linear?
Rank with Stump Weight (RSW):
what is the amount of decision stumpsP
q
R wq,d2 (α)dαneeded for feature d in the ensemble
D
X
d=1
X
q∈±1
Z
wq,d(α)sq,d,α(x)dα
+b
Error Estimation Techniques
Error Estimation Techniques
v -fold cross-validation: economic use of samples training folds: v −1 of the v folds
test fold: the other folds is reserved unseen estimate: average error on the reduced test fold
v -fold CV is a random process: can be repeated many times our setting: 10 fold×10, 5 fold×20, or 90 fold×1
90 fold: also called leave-one-out
Error Estimation Techniques
Experiment Settings
Experiment Setting
1 Cross-validation splitting to training folds/test fold
2 Feature ranking on training folds
3 Feature selection by ranking (50,100,200,500,1000,27679)
4 Classification on the reduced training folds
5 Test on the reduced test fold
Experimental Results
Comparison of Classification Techniques
Ranking with Linear Weight Ranking with Stump Weight
102 103 104
16 18 20 22 24 26 28 30 32 34
number of features
cross−validation error (%)
AdaBoost−Stump, T=100 AdaBoost−Stump, T=1000 SVM−Linear SVM−Gauss SVM−Stump
102 103 104
16 18 20 22 24 26 28 30 32 34
number of features
cross−validation error (%)
AdaBoost−Stump, T=100 AdaBoost−Stump, T=1000 SVM−Linear SVM−Gauss SVM−Stump
results with 10 fold CV×10 AdaBoost-Stump is not good
SVM-Gauss is slightly worse than SVM-Linear SVM-Stump is slightly better than SVM-Linear
Experimental Results
Comparison of Classification Techniques
SVM-Linear and SVM-Stump are the better choices
AdaBoost SVM SVM SVM
-Stump -Linear -Gauss -Stump
model power − − ↑ ↑
interpretability ↑ ↑ ↓ −
speed ↑ − ↓ −
performance ↓ ↑ ↑ ↑
Experimental Results
Comparison of Feature Selection Techniques
SVM-Linear SVM-Stump
102 103
16 18 20 22 24 26 28 30 32 34
number of features
cross−validation error (%)
full set RFSRLW RSW
102 103
16 18 20 22 24 26 28 30 32 34
number of features
cross−validation error (%)
full set RFSRLW RSW
results with 10 fold CV×10 Ranking with F-Score is not good
Ranking with Stump Weight is slightly better than with Linear Weight
Experimental Results
Comparison of Error Estimation Techniques
Ranking with F-Score (10 fold×10) Ranking with F-Score (90 fold)
102 103 104
16 18 20 22 24 26 28 30 32 34
number of features
cross−validation error (%)
AdaBoost−Stump, T=100 AdaBoost−Stump, T=1000 SVM−Linear SVM−Gauss SVM−Stump
102 103 104
15 20 25 30 35
number of features
cross−validation error (%)
AdaBoost−Stump, T=100 AdaBoost−Stump, T=1000 SVM−Linear SVM−Gauss SVM−Stump
leave-one-out does not give stable and explainable results
Experimental Results
Comparison of Error Estimation Techniques
Ranking with F-Score (10 fold×10) Ranking with F-Score (5 fold×20)
102 103 104
16 18 20 22 24 26 28 30 32 34
number of features
cross−validation error (%)
AdaBoost−Stump, T=100 AdaBoost−Stump, T=1000 SVM−Linear SVM−Gauss SVM−Stump
102 103 104
18 20 22 24 26 28 30 32 34
number of features
cross−validation error (%)
AdaBoost−Stump, T=100 AdaBoost−Stump, T=1000 SVM−Linear SVM−Gauss SVM−Stump
similar conclusions from 5 fold and 10 fold CV 10-fold uses more samples for training
– better choice considering the importance of samples
Conclusion
Conclusion
carefully analyzed the difficult SAGE dataset legitimate information only
robust conclusion through multiple testing
classification: SVM-Linear and SVM-Stump are both promising feature selection: RLW and RSW are both good
– possible to achieve better performance than full set
error estimation: 10-fold CV seems to be a better choice and leave-one-out is bad
how can we possibly distinguish between the linear model and the stump ensemble model?
are there more samples to verify the findings?
which model selects more biologically meaningful features?
which model is biologically more plausible?