• 沒有找到結果。

Distance Based SVM Kernels for Infinite Ensemble Learning

N/A
N/A
Protected

Academic year: 2022

Share "Distance Based SVM Kernels for Infinite Ensemble Learning"

Copied!
13
0
0

加載中.... (立即查看全文)

全文

(1)

Distance Based SVM Kernels for Infinite Ensemble Learning

Hsuan-Tien Lin and Ling Li

Learning Systems Group, California Institute of Technology

ICONIP, November 2, 2005

(2)

Infinite Ensemble Learning with SVM

Setup

notation:

examples: x ∈ X ⊆ RD labels: y∈ {+1, −1}

hypotheses (classifiers): functions fromX → {+1, −1}

binary classification problem: given training examples and labels {(xi,yi)}Ni=1, find a classifier g(x)that predicts the labels of unseen examples well

ensemble learning: weighted vote of a committee of hypotheses g(x) =signX

wtht(x)



,wt ≥0,ht ∈ H

g(·)is usually better than individual h(·)

(3)

Traditional Ensemble Learning

traditional ensemble learning: iteratively find(wt,ht) for t =1,2, · · · ,T

g(x) =sign

T

X

t=1

wtht(x)

!

,wt ≥0,ht ∈ H

AdaBoost: asymp. approximate an optimal ensemble

minw,h kwk1, s.t. yi

X

t=1

wtht(xi)

!

≥1,wt ≥0

by T iterations of coordinate descent on barrier algorithm is an infinite ensemble better?

(4)

Infinite Ensemble Learning with SVM

Infinite Ensemble Learning

infinite ensemble learning: |H| = ∞, and possibly infinite number of nonzero weights wt

g(x) =signX

wtht(x)

,ht ∈ H,wt ≥0

infinite ensemble learning is a challenge (e.g. Vapnik, 1998) SVM can handle infinite number of weights with suitable kernels (e.g. Schölkopf and Smola, 2002)

SVM and AdaBoost are connected (e.g. Rätsch et al., 2001) can SVM be applied to

infinite ensemble learning?

(5)

Connection between SVM and AdaBoost

SVM AdaBoost

minw,b 12kwk22+CPN

i=1i| minw,hkwk1

yi(P

d=1wdφd(xi) +b) ≥1− ξi yi(P

t=1wtht(xi)) ≥1 wt ≥0

optimization

dual solution asymp. approximate

with quadratic programming with barrier and coordinate descent key for infinity

kernel trick approximation

K(x,x0) =P φd(x)φd(x0)

φd(x) ⇐⇒ht(x)

(6)

Infinite Ensemble Learning with SVM

Framework of Infinite Ensemble Learning

Algorithm

1 Consider a hypothesis setH

2 EmbedHin a kernelKHusingφdht

3 Properly choose other SVM parameters

4 Train SVM withKH,r and{(xi,yi)}Ni=1

5 Output an infinite ensemble classifier

IfHis negation complete and contains a constant hypothesis, SVM classifier is equivalent to an infinite ensemble classifier SVM as an optimization machinery: training routines are widely available (LIBSVM)

SVM as a well-studied learning model: inherit the profound regularization properties

hard part: kernel construction

(7)

Embedding Hypotheses into the Kernel

K(x,x0) =

X

t=1

ht(x)ht(x0) (may not converge)

⇒ K(x,x0) =

X

d=1

[rdhd(x)][rdhd(x0)] (with some positive r )

⇒ K(x,x0) = Z

[r(α)hα(x)][r(α)hα(x0)]dα (handle uncountable cases)

Letφx(α) =r(α)hα(x), the kernelKH,r(x,x0) =R

Cφx(α)φx0(α)dα embodiesH = {hα: α ∈ C}

KH,r(x,x0): an inner product forφx andφx0 inL2(C) examples: stump and perceptron kernels

(8)

Stump and Perceptron Kernels

Stump Kernel

decision stump: sq,d,α(x) =q·sign((x)d − α) simplicity: popular for ensemble learning considerS =

sq,d,αd:q∈ ±1,1≤dD, αd ∈ [Ld,Rd] : Definition

The stump kernelKS is defined forS with r(q,d, αd) = 12: KS(x,x0) = ∆S− kx −x0k1,

where∆S = 12PD

d=1(RdLd)is a constant

(9)

Perceptron Kernel

a simple hyperplane: pθ,α(x) =sign θTx− α

not easy for ensemble learning: hard to design good algorithm considerP = {pθ,α: kθk2=1, α ∈ [−R,R]}:

Definition

The perceptron kernelKP is defined forP with a constant r(θ, α):

KP(x,x0) = ∆P− kx−x0k2, where∆P is a constant.

(10)

Stump and Perceptron Kernels

Properties of the Novel Kernels

simple to compute: can even drop∆S or∆P

adding/subtracting a constant does not change the solution under SVM linear constraintP

yiλi =0

infinite power: perfect separability under mild assumptions similar power to popular Gaussian kernel: exp(−γkxx0k22) – suitable control on the power may give good performance fast automatic parameter selection: a good parameter C only

Gaussian kernel depends on a good(γ,C)pair (usually 10 times more computation)

feature space interpretation: domain-specific tuning e.g. stump kernel for gene selection:

the stump weight is a natural estimate of gene importance (Lin and Li, ECML/PKDD Discovery Challenge, 2005)

(11)

Comparison between SVM and AdaBoost

tw twn th thn ri rin aus bre ger hea ion pim son vot 0

5 10 15 20 25 30 35

error (%)

SVM−Stump AdaBoost−Stump(100) AdaBoost−Stump(1000)

tw twn th thn ri rin aus bre ger hea ion pim son vot 0

5 10 15 20 25 30 35

error (%)

SVM−Perc AdaBoost−Perc(100) AdaBoost−Perc(1000)

Results

fair comparison between AdaBoost and SVM

SVM is usually best – benefits to go to infinity

(12)

Experimental Comparison

Comparison of SVM Kernels

tw twn th thn ri rin aus bre ger hea ion pim son vot 0

5 10 15 20 25 30 35

error (%)

SVM−Stump SVM−Perc SVM−Gauss

Results

SVM-Perc very similar to SVM-Gauss SVM-Stump comparable to, but sometimes a bit worse than others

(13)

Conclusion

derived two useful kernels: stump kernel and perceptron kernel provided meanings to specific distance metric

stump kernel: succeeded in specific applications – existing AdaBoost-Stump applications may switch perceptron kernel: similar to Gaussian, faster in parameter selection

– can be an alternative to SVM-Gauss not the only kernels:

Laplacian kernelinfinite ensemble of decision trees Exponential kernelinfinite ensemble of decision regions SVM: a machinery for conquering infinity

– possible to apply similar machinery to areas that need infinite or lots of aggregation

參考文獻

相關文件

First based on the limited domain-specific data, an OOV learn- ing procedure is applied to generate a list of OOVs that may be.. The vocabulary and the language model can be expanded

Learning Models: Stump, Perceptron, Neural Network, Bagging, AdaBoost, SVM (through LIBSVM 2.81), etc.. Generic Optimization Algorithms: Gradient Decent, Conjugate

infinite ensemble learning could be better – existing AdaBoost-Stump applications may switch. derived new and

The format string is like a constraint except that there could be place-holder strings like “%s” and “%d” for strings and integers respectively.. The format string F is like

The proposed adaptive energy management applications are for battery-aware embedded systems of energy harvest wireless sensor network and human-electric hybrid bicycle.. Future

include domain knowledge by specific kernel design (e.g. train a generative model for feature extraction, and use the extracted feature in SVM to get discriminative power).

– stump kernel: succeeded in specific applications infinite ensemble learning could be better – existing AdaBoost-Stump applications may switch. not the

as long as every kernel value is composed of two data vectors and stored in a kernel matrix, our proposed method can reverse those kernel values back to the original data.. In