Inﬁnite Ensemble Learning with Support Vector Machinery

(1)

Infinite Ensemble Learning with Support Vector Machinery

Hsuan-Tien Lin and Ling Li

Learning Systems Group, California Institute of Technology

ECML/PKDD, October 4, 2005

(2)

Outline

1 Motivation of Infinite Ensemble Learning

2 Connecting SVM and Ensemble Learning

3 SVM-Based Framework of Infinite Ensemble Learning

4 Concrete Instance of the Framework: Stump Kernel

5 Experimental Comparison

6 Conclusion

(3)

Motivation of Infinite Ensemble Learning

Learning Problem

notation: example x ∈ X ⊆ R^Dand label y ∈ {+1, −1}

hypotheses (classifiers): functions fromX → {+1, −1}

binary classification problem: given training examples and labels {(x_i,y_i)}^N_i=1, find a classifier g(x) : X → {+1, −1}that predicts the label of unseen x well

(4)

Ensemble Learning

g(x) : X → {+1, −1}

ensemble learning: popular paradigm (bagging, boosting, etc.) ensemble: weighted vote of a committee of hypotheses

g(x) =sign(P

w_th_t(x))

h_t: base hypotheses, usually chosen from a setH wt: nonnegative weight for ht

ensemble usually better than individual ht(x)in stability/performance

(5)

Infinite Ensemble Learning

g(x) =signX

wtht(x)

,ht ∈ H,wt ≥0 setHcan be of infinite size

traditional algorithms: assign finite number of nonzero w_t

1 is finiteness regularization and/or restriction?

2 how to handle infinite number of nonzero weights?

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

−1

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

−1

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1

(6)

SVM for Infinite Ensemble Learning

Support Vector Machine (SVM): large-margin hyperplane in some feature space

SVM: possibly infinite dimensional hyperplane g(x) =sign(P

w_dφd(x) +b)

an important machinery to conquer infinity: kernel trick.

how can we use Support Vector Machinery for infinite ensemble learning?

(7)

Connecting SVM and Ensemble Learning

Properties of SVM

g(x) =sign(P∞

d=1w_dφ_d(x) +b) =sign PN

i=1λ_iy_iK(x_i,x) +b

a successful large-margin learning algorithm.

goal: (infinite dimensional) large-margin hyperplane

minw,b

1

2kwk²₂+C

N

X

i=1

ξ_i, s.t. y_i

∞

X

d=1

w_dφ_d(x_i) +b

!

≥1− ξ_i, ξ_i ≥0

optimal hyperplane: represented through duality key for handling infinity: computation with kernel tricks K(x,x⁰) =P∞

d=1φ_d(x)φ_d(x⁰)

regularization: controlled with the trade-off parameter C

(8)

Properties of AdaBoost

g(x) =sign PT

t=1w_th_t(x)

a successful ensemble learning algorithm goal: asymptotically, large-margin ensemble

min

w,h kwk₁, s.t. y_i

∞

X

t=1

w_th_t(x_i)

!

≥1,w_t ≥0

optimal ensemble: approximated by finite one key for good approximation:

finiteness: some ht1(xi) =ht2(xi)for all i

sparsity: optimal ensemble usually has many zero weights regularization: finite approximation

(9)

Connection between SVM and AdaBoost

φ_d(x) ⇔ht(x)

SVM AdaBoost

G(x) =P

kw_kφ_k(x) +b G(x) =P

kw_kh_k(x) w_k ≥0 hard-goal

minkwk_p, s.t. y_iG(x_i) ≥1

p=2 p=1

key for infinity

kernel trick finiteness and sparsity regularization

soft-margin trade-off finite approximation

(10)

SVM-Based Framework of Infinite Ensemble Learning

Challenge

challenge: how to design a good infinite ensemble learning algorithm?

traditional ensemble learning: iterative and cannot be directly generalized

our main contribution: novel and powerful infinite ensemble learning algorithm with Support Vector Machinery

our approach: embedding infinite number of hypotheses in SVM kernel, i.e.,K(x,x⁰) =P∞

t=1ht(x)ht(x⁰) – then, SVM classifier: g(x) =sign(P∞

t=1w_th_t(x) +b)

1 does the kernel exist?

2 how to ensure w_t ≥0?

(11)

Embedding Hypotheses into the Kernel

Definition

The kernel that embodiesH = {h_α: α ∈ C}is defined as K_H,r(x,x⁰) =

Z

C

φ_x(α)φ_x⁰(α)dα,

where C is a measure space, φ_x(α) = r(α)h_α(x), and r: C → R⁺ is chosen such that the integral always exists

integral instead of sum: works even for uncountableH existence problem handled with a suitable r(·)

K_H,r(x,x⁰): an inner product forφx andφx⁰ inF = L₂(C) the classifier: g(x) =sign R

Cw(α)r(α)h_α(x)dα +b

(12)

Negation Completeness and Constant Hypotheses

g(x) =sign

Z

C

w(α)r(α)h_α(x)dα +b

not an ensemble classifier yet w(α) ≥0?

hard to handle: possibly uncountable constraints simple with negation completeness assumption onH (h∈ Hif and only if(−h) ∈ H)

e.g. neural networks, perceptrons, decision trees, etc.

for any w , exists nonnegativew that produces same g˜ What is b?

equivalently, the weight on a constant hypothesis another assumption:Hcontains a constant hypothesis with mild assumptions, g(x)is equivalent to

an ensemble classifier

(13)

Framework of Infinite Ensemble Learning

Algorithm

1 Consider a hypothesis setH(negation complete and contains a constant hypothesis)

2 Construct a kernelK_H,r with proper r(·)

3 Properly choose other SVM parameters

4 Train SVM withK_H,r and{(x_i,y_i)}^N_i=1to obtainλi and b

5 Output g(x) =sign PN

i=1y_iλ_iK_H(x_i,x) +b

hard: kernel construction

SVM as an optimization machinery: training routines are widely available

SVM as a well-studied learning model: inherit the profound regularization properties

(14)

Concrete Instance of the Framework: Stump Kernel

Decision Stump

decision stump: s_q,d,α(x) =q·sign((x)_d − α) simplicity: popular for ensemble learning

(x)₂≥ α?

Y ^@_@

@ R

N

+1

−1

(a) Decision Process

- 6

s_+1,2,α(x) = +1

(x)2= α (x)2

(x)1

(b) Decision Boundary Figure:Illustration of the decision stump s+1,2,α(x)

(15)

Stump Kernel

consider the set of decision stumps S =

s_q,d,α_d:q ∈ {+1, −1} ,d ∈ {1, . . . ,D} , α_d ∈ [L_d,R_d] whenX ⊆ [L₁,R₁] × [L₂,R₂] × · · · × [L_D,R_D],S is negation complete, and contains a constant hypothesis

Definition

The stump kernelK_S is defined forS with r(q,d, α_d) = ¹₂

K_S(x,x⁰) = ∆S−

D

X

d=1

(x)_d− (x⁰)_d

= ∆S− kx−x⁰k₁,

where∆S = ¹₂PD

d=1(R_d−L_d)is a constant

(16)

Properties of Stump Kernel

simple to compute: can even use a simpler one

K˜_S(x,x⁰) = −kx −x⁰k₁while getting the same solution under the dual constraintP

iy_iλ_i =0, usingKS orK˜S is the same feature space explanation for`₁-norm distance

infinite power: under mild assumptions, SVM-Stump with C = ∞ can perfectly classify all training examples

if there is a dimension for which all feature values are different, the kernel matrix K with K_ij= K(x_i,x_j)is strictly positive definite similar power to the popular Gaussian kernel

exp(−γkx−x⁰k²₂)

– suitable control on the power leads to good performance

(17)

Properties of Stump Kernel (Cont’d)

fast automatic parameter selection: only needs to search for a good soft-margin parameter C

scaling the stump kernel is equivalent to scaling soft-margin parameter C

Gaussian kernel depends on a good(γ,C)pair (10 times slower) well suited in some specific applications:

cancer prediction with gene expressions

(Lin and Li, ECML/PKDD Discovery Challenge, 2005)

(18)

Infinite Decision Stump Ensemble

g(x) =sign



 X

q∈{+1,−1}

D

X

d=1

Z Rd

L_d

w_q,d(α)s_q,d,α(x)dα +b





each s_q,d,α: infinitesimal influence w_q,d(α)dα equivalently,

g(x) =sign



 X

q∈{+1,−1}

D

X

d=1 Ad

X

a=0

wˆ_q,d,aˆs_q,d,a(x) +b





ˆs: a smoother variant of decision stump

(x)_d (x_i)_d (x_j)_d

ˆs(x)

(19)

Infinitesimal Influence

g(x) =sign



 X

q∈{+1,−1}

D

X

d=1 A_d

X

a=0

wˆ_q,d,aˆs_q,d,a(x) +b





infinity→dense combination of finite number of smooth stumps infinitesimal influence→concrete weight of the smooth stumps

(x)_d (x_i)_d (x_j)_d

s(xˆ )

SVM: dense combination of smoothed stumps

(x)_d (x_i)_d (x_j)_d

s(x)

AdaBoost: sparse combination of middle stumps

(20)

Experimental Comparison

Experiment Setting

ensemble learning algorithms:

SVM-Stump: infinite ensemble of decision stumps (dense ensemble of smooth stumps)

SVM-Mid: dense ensemble of middle stumps

AdaBoost-Stump: sparse ensemble of middle stumps SVM algorithms: SVM-Stump versus SVM-Gauss artificial, noisy, and realworld datasets

cross-validation for automatic parameter selection of SVM

evaluate on hold-out test set and averaged over 100 different splits

(21)

Comparison between SVM and AdaBoost

tw twn th thn ri rin aus bre ger hea ion pim son vot 0

5 10 15 20 25 30 35

error (%)

SVM−Stump SVM−Mid AdaBoost−Stump(100) AdaBoost−Stump(1000)

Results

fair comparison between AdaBoost and SVM

SVM-Stump is usually best – benefits to go to infinity

SVM-Mid is also good – benefits to have dense ensemble sparsity and finiteness are restrictions

(22)

Comparison between SVM and AdaBoost (Cont’d)

−5 0 5

(x)₁ (x) 2

−5 0 5

(x)₁ (x) 2

−5 0 5

(x)₁ (x) 2

left to right: SVM-Stump, SVM-Mid, AdaBoost-Stump

smoother boundary with infinite ensemble (SVM-Stump) still fits well with dense ensemble (SVM-Mid)

cannot fit well when sparse and finite (AdaBoost-Stump)

(23)

Comparison of SVM Kernels

tw twn th thn ri rin aus bre ger hea ion pim son vot 0

5 10 15 20 25 30 35

error (%)

SVM−Stump

SVM−Gauss Results

SVM-Stump is only a bit worse than SVM-Gauss still benefit with faster parameter selection in some applications

(24)

Conclusion

novel and powerful framework for infinite ensemble learning derived a new and meaningful kernel

– stump kernel: succeeded in specific applications infinite ensemble learning could be better – existing AdaBoost-Stump applications may switch

not the only kernel:

perceptron kernel→infinite ensemble of perceptrons Laplacian kernel→infinite ensemble of decision trees SVM: our machinery for conquering infinity

– possible to apply similar machinery to areas that need infinite or lots of aggregation