Infinite Ensemble Learning with Support Vector Machinery
Hsuan-Tien Lin and Ling Li
Learning Systems Group, California Institute of Technology
ECML/PKDD, October 4, 2005
Outline
1 Motivation of Infinite Ensemble Learning
2 Connecting SVM and Ensemble Learning
3 SVM-Based Framework of Infinite Ensemble Learning
4 Concrete Instance of the Framework: Stump Kernel
5 Experimental Comparison
6 Conclusion
Motivation of Infinite Ensemble Learning
Learning Problem
notation: example x ∈ X ⊆ RDand label y ∈ {+1, −1}
hypotheses (classifiers): functions fromX → {+1, −1}
binary classification problem: given training examples and labels {(xi,yi)}Ni=1, find a classifier g(x) : X → {+1, −1}that predicts the label of unseen x well
Motivation of Infinite Ensemble Learning
Ensemble Learning
g(x) : X → {+1, −1}
ensemble learning: popular paradigm (bagging, boosting, etc.) ensemble: weighted vote of a committee of hypotheses
g(x) =sign(P
wtht(x))
ht: base hypotheses, usually chosen from a setH wt: nonnegative weight for ht
ensemble usually better than individual ht(x)in stability/performance
Motivation of Infinite Ensemble Learning
Infinite Ensemble Learning
g(x) =signX
wtht(x)
,ht ∈ H,wt ≥0 setHcan be of infinite size
traditional algorithms: assign finite number of nonzero wt
1 is finiteness regularization and/or restriction?
2 how to handle infinite number of nonzero weights?
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
−1
−0.8
−0.6
−0.4
−0.2 0 0.2 0.4 0.6 0.8 1
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
−1
−0.8
−0.6
−0.4
−0.2 0 0.2 0.4 0.6 0.8 1
Motivation of Infinite Ensemble Learning
SVM for Infinite Ensemble Learning
Support Vector Machine (SVM): large-margin hyperplane in some feature space
SVM: possibly infinite dimensional hyperplane g(x) =sign(P
wdφd(x) +b)
an important machinery to conquer infinity: kernel trick.
how can we use Support Vector Machinery for infinite ensemble learning?
Connecting SVM and Ensemble Learning
Properties of SVM
g(x) =sign(P∞
d=1wdφd(x) +b) =sign PN
i=1λiyiK(xi,x) +b
a successful large-margin learning algorithm.
goal: (infinite dimensional) large-margin hyperplane
minw,b
1
2kwk22+C
N
X
i=1
ξi, s.t. yi
∞
X
d=1
wdφd(xi) +b
!
≥1− ξi, ξi ≥0
optimal hyperplane: represented through duality key for handling infinity: computation with kernel tricks K(x,x0) =P∞
d=1φd(x)φd(x0)
regularization: controlled with the trade-off parameter C
Connecting SVM and Ensemble Learning
Properties of AdaBoost
g(x) =sign PT
t=1wtht(x)
a successful ensemble learning algorithm goal: asymptotically, large-margin ensemble
min
w,h kwk1, s.t. yi
∞
X
t=1
wtht(xi)
!
≥1,wt ≥0
optimal ensemble: approximated by finite one key for good approximation:
finiteness: some ht1(xi) =ht2(xi)for all i
sparsity: optimal ensemble usually has many zero weights regularization: finite approximation
Connecting SVM and Ensemble Learning
Connection between SVM and AdaBoost
φd(x) ⇔ht(x)
SVM AdaBoost
G(x) =P
kwkφk(x) +b G(x) =P
kwkhk(x) wk ≥0 hard-goal
minkwkp, s.t. yiG(xi) ≥1
p=2 p=1
key for infinity
kernel trick finiteness and sparsity regularization
soft-margin trade-off finite approximation
SVM-Based Framework of Infinite Ensemble Learning
Challenge
challenge: how to design a good infinite ensemble learning algorithm?
traditional ensemble learning: iterative and cannot be directly generalized
our main contribution: novel and powerful infinite ensemble learning algorithm with Support Vector Machinery
our approach: embedding infinite number of hypotheses in SVM kernel, i.e.,K(x,x0) =P∞
t=1ht(x)ht(x0) – then, SVM classifier: g(x) =sign(P∞
t=1wtht(x) +b)
1 does the kernel exist?
2 how to ensure wt ≥0?
SVM-Based Framework of Infinite Ensemble Learning
Embedding Hypotheses into the Kernel
Definition
The kernel that embodiesH = {hα: α ∈ C}is defined as KH,r(x,x0) =
Z
C
φx(α)φx0(α)dα,
where C is a measure space, φx(α) = r(α)hα(x), and r: C → R+ is chosen such that the integral always exists
integral instead of sum: works even for uncountableH existence problem handled with a suitable r(·)
KH,r(x,x0): an inner product forφx andφx0 inF = L2(C) the classifier: g(x) =sign R
Cw(α)r(α)hα(x)dα +b
SVM-Based Framework of Infinite Ensemble Learning
Negation Completeness and Constant Hypotheses
g(x) =sign
Z
C
w(α)r(α)hα(x)dα +b
not an ensemble classifier yet w(α) ≥0?
hard to handle: possibly uncountable constraints simple with negation completeness assumption onH (h∈ Hif and only if(−h) ∈ H)
e.g. neural networks, perceptrons, decision trees, etc.
for any w , exists nonnegativew that produces same g˜ What is b?
equivalently, the weight on a constant hypothesis another assumption:Hcontains a constant hypothesis with mild assumptions, g(x)is equivalent to
an ensemble classifier
SVM-Based Framework of Infinite Ensemble Learning
Framework of Infinite Ensemble Learning
Algorithm
1 Consider a hypothesis setH(negation complete and contains a constant hypothesis)
2 Construct a kernelKH,r with proper r(·)
3 Properly choose other SVM parameters
4 Train SVM withKH,r and{(xi,yi)}Ni=1to obtainλi and b
5 Output g(x) =sign PN
i=1yiλiKH(xi,x) +b
hard: kernel construction
SVM as an optimization machinery: training routines are widely available
SVM as a well-studied learning model: inherit the profound regularization properties
Concrete Instance of the Framework: Stump Kernel
Decision Stump
decision stump: sq,d,α(x) =q·sign((x)d − α) simplicity: popular for ensemble learning
(x)2≥ α?
Y @@
@ R
N
+1
−1
(a) Decision Process
- 6
s+1,2,α(x) = +1
(x)2= α (x)2
(x)1
(b) Decision Boundary Figure:Illustration of the decision stump s+1,2,α(x)
Concrete Instance of the Framework: Stump Kernel
Stump Kernel
consider the set of decision stumps S =
sq,d,αd:q ∈ {+1, −1} ,d ∈ {1, . . . ,D} , αd ∈ [Ld,Rd] whenX ⊆ [L1,R1] × [L2,R2] × · · · × [LD,RD],S is negation complete, and contains a constant hypothesis
Definition
The stump kernelKS is defined forS with r(q,d, αd) = 12
KS(x,x0) = ∆S−
D
X
d=1
(x)d− (x0)d
= ∆S− kx−x0k1,
where∆S = 12PD
d=1(Rd−Ld)is a constant
Concrete Instance of the Framework: Stump Kernel
Properties of Stump Kernel
simple to compute: can even use a simpler one
K˜S(x,x0) = −kx −x0k1while getting the same solution under the dual constraintP
iyiλi =0, usingKS orK˜S is the same feature space explanation for`1-norm distance
infinite power: under mild assumptions, SVM-Stump with C = ∞ can perfectly classify all training examples
if there is a dimension for which all feature values are different, the kernel matrix K with Kij= K(xi,xj)is strictly positive definite similar power to the popular Gaussian kernel
exp(−γkx−x0k22)
– suitable control on the power leads to good performance
Concrete Instance of the Framework: Stump Kernel
Properties of Stump Kernel (Cont’d)
fast automatic parameter selection: only needs to search for a good soft-margin parameter C
scaling the stump kernel is equivalent to scaling soft-margin parameter C
Gaussian kernel depends on a good(γ,C)pair (10 times slower) well suited in some specific applications:
cancer prediction with gene expressions
(Lin and Li, ECML/PKDD Discovery Challenge, 2005)
Concrete Instance of the Framework: Stump Kernel
Infinite Decision Stump Ensemble
g(x) =sign
X
q∈{+1,−1}
D
X
d=1
Z Rd
Ld
wq,d(α)sq,d,α(x)dα +b
each sq,d,α: infinitesimal influence wq,d(α)dα equivalently,
g(x) =sign
X
q∈{+1,−1}
D
X
d=1 Ad
X
a=0
wˆq,d,aˆsq,d,a(x) +b
ˆs: a smoother variant of decision stump
(x)d (xi)d (xj)d
ˆs(x)
Concrete Instance of the Framework: Stump Kernel
Infinitesimal Influence
g(x) =sign
X
q∈{+1,−1}
D
X
d=1 Ad
X
a=0
wˆq,d,aˆsq,d,a(x) +b
infinity→dense combination of finite number of smooth stumps infinitesimal influence→concrete weight of the smooth stumps
(x)d (xi)d (xj)d
s(xˆ )
SVM: dense combination of smoothed stumps
(x)d (xi)d (xj)d
s(x)
AdaBoost: sparse combination of middle stumps
Experimental Comparison
Experiment Setting
ensemble learning algorithms:
SVM-Stump: infinite ensemble of decision stumps (dense ensemble of smooth stumps)
SVM-Mid: dense ensemble of middle stumps
AdaBoost-Stump: sparse ensemble of middle stumps SVM algorithms: SVM-Stump versus SVM-Gauss artificial, noisy, and realworld datasets
cross-validation for automatic parameter selection of SVM
evaluate on hold-out test set and averaged over 100 different splits
Experimental Comparison
Comparison between SVM and AdaBoost
tw twn th thn ri rin aus bre ger hea ion pim son vot 0
5 10 15 20 25 30 35
error (%)
SVM−Stump SVM−Mid AdaBoost−Stump(100) AdaBoost−Stump(1000)
Results
fair comparison between AdaBoost and SVM
SVM-Stump is usually best – benefits to go to infinity
SVM-Mid is also good – benefits to have dense ensemble sparsity and finiteness are restrictions
Experimental Comparison
Comparison between SVM and AdaBoost (Cont’d)
−5 0 5
−5 0 5
(x)1 (x) 2
−5 0 5
−5 0 5
(x)1 (x) 2
−5 0 5
−5 0 5
(x)1 (x) 2
left to right: SVM-Stump, SVM-Mid, AdaBoost-Stump
smoother boundary with infinite ensemble (SVM-Stump) still fits well with dense ensemble (SVM-Mid)
cannot fit well when sparse and finite (AdaBoost-Stump)
Experimental Comparison
Comparison of SVM Kernels
tw twn th thn ri rin aus bre ger hea ion pim son vot 0
5 10 15 20 25 30 35
error (%)
SVM−Stump
SVM−Gauss Results
SVM-Stump is only a bit worse than SVM-Gauss still benefit with faster parameter selection in some applications
Conclusion
Conclusion
novel and powerful framework for infinite ensemble learning derived a new and meaningful kernel
– stump kernel: succeeded in specific applications infinite ensemble learning could be better – existing AdaBoost-Stump applications may switch
not the only kernel:
perceptron kernel→infinite ensemble of perceptrons Laplacian kernel→infinite ensemble of decision trees SVM: our machinery for conquering infinity
– possible to apply similar machinery to areas that need infinite or lots of aggregation