Distance Based SVM Kernels for Inﬁnite Ensemble Learning

(1)

Distance Based SVM Kernels for Infinite Ensemble Learning

Hsuan-Tien Lin and Ling Li

Learning Systems Group, California Institute of Technology

ICONIP, November 2, 2005

(2)

Infinite Ensemble Learning with SVM

Setup

notation:

examples: x ∈ X ⊆ R^D labels: y∈ {+1, −1}

hypotheses (classifiers): functions fromX → {+1, −1}

binary classification problem: given training examples and labels {(x_i,y_i)}^N_i=1, find a classifier g(x)that predicts the labels of unseen examples well

ensemble learning: weighted vote of a committee of hypotheses g(x) =signX

w_th_t(x)

,w_t ≥0,h_t ∈ H

g(·)is usually better than individual h(·)

(3)

Traditional Ensemble Learning

traditional ensemble learning: iteratively find(w_t,h_t) for t =1,2, · · · ,T

g(x) =sign

T

X

t=1

w_th_t(x)

!

,w_t ≥0,h_t ∈ H

AdaBoost: asymp. approximate an optimal ensemble

minw,h kwk₁, s.t. y_i

∞

X

t=1

w_th_t(x_i)

!

≥1,w_t ≥0

by T iterations of coordinate descent on barrier algorithm is an infinite ensemble better?

(4)

Infinite Ensemble Learning

infinite ensemble learning: |H| = ∞, and possibly infinite number of nonzero weights wt

g(x) =signX

w_th_t(x)

,h_t ∈ H,w_t ≥0

infinite ensemble learning is a challenge (e.g. Vapnik, 1998) SVM can handle infinite number of weights with suitable kernels (e.g. Schölkopf and Smola, 2002)

SVM and AdaBoost are connected (e.g. Rätsch et al., 2001) can SVM be applied to

infinite ensemble learning?

(5)

Connection between SVM and AdaBoost

SVM AdaBoost

min_w,b ¹₂kwk²₂+CPN

i=1|ξ_i| min_w,hkwk₁

y_i(P∞

d=1w_dφd(x_i) +b) ≥1− ξ_i y_i(P∞

t=1w_th_t(x_i)) ≥1 wt ≥0

optimization

dual solution asymp. approximate

with quadratic programming with barrier and coordinate descent key for infinity

kernel trick approximation

K(x,x⁰) =P φ_d(x)φ_d(x⁰)

φd(x) ⇐⇒ht(x)

(6)

Framework of Infinite Ensemble Learning

Algorithm

1 Consider a hypothesis setH

2 EmbedHin a kernelK_Husingφd ⇔h_t

3 Properly choose other SVM parameters

4 Train SVM withK_H,r and{(x_i,y_i)}^N_i=1

5 Output an infinite ensemble classifier

IfHis negation complete and contains a constant hypothesis, SVM classifier is equivalent to an infinite ensemble classifier SVM as an optimization machinery: training routines are widely available (LIBSVM)

SVM as a well-studied learning model: inherit the profound regularization properties

hard part: kernel construction

(7)

Embedding Hypotheses into the Kernel

K(x,x⁰) =

∞

X

t=1

ht(x)ht(x⁰) (may not converge)

⇒ K(x,x⁰) =

∞

X

d=1

[r_dh_d(x)][r_dh_d(x⁰)] (with some positive r )

⇒ K(x,x⁰) = Z

[r(α)h_α(x)][r(α)h_α(x⁰)]dα (handle uncountable cases)

Letφx(α) =r(α)hα(x), the kernelK_H,r(x,x⁰) =R

Cφx(α)φx⁰(α)dα embodiesH = {h_α: α ∈ C}

K_H,r(x,x⁰): an inner product forφx andφ_x⁰ inL₂(C) examples: stump and perceptron kernels

(8)

Stump and Perceptron Kernels

Stump Kernel

decision stump: s_q,d,α(x) =q·sign((x)_d − α) simplicity: popular for ensemble learning considerS =

s_q,d,α_d:q∈ ±1,1≤d ≤D, α_d ∈ [L_d,R_d] : Definition

The stump kernelK_S is defined forS with r(q,d, α_d) = ¹₂: K_S(x,x⁰) = ∆S− kx −x⁰k₁,

where∆S = ¹₂PD

d=1(R_d−L_d)is a constant

(9)

Perceptron Kernel

a simple hyperplane: p_θ,α(x) =sign θ^Tx− α

not easy for ensemble learning: hard to design good algorithm considerP = {p_θ,α: kθk2=1, α ∈ [−R,R]}:

Definition

The perceptron kernelK_P is defined forP with a constant r(θ, α):

K_P(x,x⁰) = ∆P− kx−x⁰k₂, where∆P is a constant.

(10)

Stump and Perceptron Kernels

Properties of the Novel Kernels

simple to compute: can even drop∆S or∆P

adding/subtracting a constant does not change the solution under SVM linear constraintP

y_iλ_i =0

infinite power: perfect separability under mild assumptions similar power to popular Gaussian kernel: exp(−γkx−x⁰k²₂) – suitable control on the power may give good performance fast automatic parameter selection: a good parameter C only

Gaussian kernel depends on a good(γ,C)pair (usually 10 times more computation)

feature space interpretation: domain-specific tuning e.g. stump kernel for gene selection:

the stump weight is a natural estimate of gene importance (Lin and Li, ECML/PKDD Discovery Challenge, 2005)

(11)

Comparison between SVM and AdaBoost

tw twn th thn ri rin aus bre ger hea ion pim son vot 0

5 10 15 20 25 30 35

error (%)

SVM−Stump AdaBoost−Stump(100) AdaBoost−Stump(1000)

5 10 15 20 25 30 35

error (%)

SVM−Perc AdaBoost−Perc(100) AdaBoost−Perc(1000)

Results

fair comparison between AdaBoost and SVM

SVM is usually best – benefits to go to infinity

(12)

Experimental Comparison

Comparison of SVM Kernels

5 10 15 20 25 30 35

error (%)

SVM−Stump SVM−Perc SVM−Gauss

Results

SVM-Perc very similar to SVM-Gauss SVM-Stump comparable to, but sometimes a bit worse than others

(13)

Conclusion

derived two useful kernels: stump kernel and perceptron kernel provided meanings to specific distance metric

stump kernel: succeeded in specific applications – existing AdaBoost-Stump applications may switch perceptron kernel: similar to Gaussian, faster in parameter selection

– can be an alternative to SVM-Gauss not the only kernels:

Laplacian kernel→infinite ensemble of decision trees Exponential kernel→infinite ensemble of decision regions SVM: a machinery for conquering infinity

– possible to apply similar machinery to areas that need infinite or lots of aggregation