• 沒有找到結果。

Infinite Ensemble Learning with Support Vector Machinery

N/A
N/A
Protected

Academic year: 2022

Share "Infinite Ensemble Learning with Support Vector Machinery"

Copied!
24
0
0

加載中.... (立即查看全文)

全文

(1)

Infinite Ensemble Learning with Support Vector Machinery

Hsuan-Tien Lin and Ling Li

Learning Systems Group, California Institute of Technology

ECML/PKDD, October 4, 2005

(2)

Outline

1 Motivation of Infinite Ensemble Learning

2 Connecting SVM and Ensemble Learning

3 SVM-Based Framework of Infinite Ensemble Learning

4 Concrete Instance of the Framework: Stump Kernel

5 Experimental Comparison

6 Conclusion

(3)

Motivation of Infinite Ensemble Learning

Learning Problem

notation: example x ∈ X ⊆ RDand label y ∈ {+1, −1}

hypotheses (classifiers): functions fromX → {+1, −1}

binary classification problem: given training examples and labels {(xi,yi)}Ni=1, find a classifier g(x) : X → {+1, −1}that predicts the label of unseen x well

(4)

Motivation of Infinite Ensemble Learning

Ensemble Learning

g(x) : X → {+1, −1}

ensemble learning: popular paradigm (bagging, boosting, etc.) ensemble: weighted vote of a committee of hypotheses

g(x) =sign(P

wtht(x))

ht: base hypotheses, usually chosen from a setH wt: nonnegative weight for ht

ensemble usually better than individual ht(x)in stability/performance

(5)

Motivation of Infinite Ensemble Learning

Infinite Ensemble Learning

g(x) =signX

wtht(x)

,ht ∈ H,wt ≥0 setHcan be of infinite size

traditional algorithms: assign finite number of nonzero wt

1 is finiteness regularization and/or restriction?

2 how to handle infinite number of nonzero weights?

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

−1

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

−1

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1

(6)

Motivation of Infinite Ensemble Learning

SVM for Infinite Ensemble Learning

Support Vector Machine (SVM): large-margin hyperplane in some feature space

SVM: possibly infinite dimensional hyperplane g(x) =sign(P

wdφd(x) +b)

an important machinery to conquer infinity: kernel trick.

how can we use Support Vector Machinery for infinite ensemble learning?

(7)

Connecting SVM and Ensemble Learning

Properties of SVM

g(x) =sign(P

d=1wdφd(x) +b) =sign PN

i=1λiyiK(xi,x) +b

a successful large-margin learning algorithm.

goal: (infinite dimensional) large-margin hyperplane

minw,b

1

2kwk22+C

N

X

i=1

ξi, s.t. yi

X

d=1

wdφd(xi) +b

!

≥1− ξi, ξi ≥0

optimal hyperplane: represented through duality key for handling infinity: computation with kernel tricks K(x,x0) =P

d=1φd(x)φd(x0)

regularization: controlled with the trade-off parameter C

(8)

Connecting SVM and Ensemble Learning

Properties of AdaBoost

g(x) =sign PT

t=1wtht(x)



a successful ensemble learning algorithm goal: asymptotically, large-margin ensemble

min

w,h kwk1, s.t. yi

X

t=1

wtht(xi)

!

≥1,wt ≥0

optimal ensemble: approximated by finite one key for good approximation:

finiteness: some ht1(xi) =ht2(xi)for all i

sparsity: optimal ensemble usually has many zero weights regularization: finite approximation

(9)

Connecting SVM and Ensemble Learning

Connection between SVM and AdaBoost

φd(x) ⇔ht(x)

SVM AdaBoost

G(x) =P

kwkφk(x) +b G(x) =P

kwkhk(x) wk ≥0 hard-goal

minkwkp, s.t. yiG(xi) ≥1

p=2 p=1

key for infinity

kernel trick finiteness and sparsity regularization

soft-margin trade-off finite approximation

(10)

SVM-Based Framework of Infinite Ensemble Learning

Challenge

challenge: how to design a good infinite ensemble learning algorithm?

traditional ensemble learning: iterative and cannot be directly generalized

our main contribution: novel and powerful infinite ensemble learning algorithm with Support Vector Machinery

our approach: embedding infinite number of hypotheses in SVM kernel, i.e.,K(x,x0) =P

t=1ht(x)ht(x0) – then, SVM classifier: g(x) =sign(P

t=1wtht(x) +b)

1 does the kernel exist?

2 how to ensure wt0?

(11)

SVM-Based Framework of Infinite Ensemble Learning

Embedding Hypotheses into the Kernel

Definition

The kernel that embodiesH = {hα: α ∈ C}is defined as KH,r(x,x0) =

Z

C

φx(α)φx0(α)dα,

where C is a measure space, φx(α) = r(α)hα(x), and r: C → R+ is chosen such that the integral always exists

integral instead of sum: works even for uncountableH existence problem handled with a suitable r(·)

KH,r(x,x0): an inner product forφx andφx0 inF = L2(C) the classifier: g(x) =sign R

Cw(α)r(α)hα(x)dα +b

(12)

SVM-Based Framework of Infinite Ensemble Learning

Negation Completeness and Constant Hypotheses

g(x) =sign

Z

C

w(α)r(α)hα(x)dα +b



not an ensemble classifier yet w(α) ≥0?

hard to handle: possibly uncountable constraints simple with negation completeness assumption onH (h∈ Hif and only if(−h) ∈ H)

e.g. neural networks, perceptrons, decision trees, etc.

for any w , exists nonnegativew that produces same g˜ What is b?

equivalently, the weight on a constant hypothesis another assumption:Hcontains a constant hypothesis with mild assumptions, g(x)is equivalent to

an ensemble classifier

(13)

SVM-Based Framework of Infinite Ensemble Learning

Framework of Infinite Ensemble Learning

Algorithm

1 Consider a hypothesis setH(negation complete and contains a constant hypothesis)

2 Construct a kernelKH,r with proper r(·)

3 Properly choose other SVM parameters

4 Train SVM withKH,r and{(xi,yi)}Ni=1to obtainλi and b

5 Output g(x) =sign PN

i=1yiλiKH(xi,x) +b

hard: kernel construction

SVM as an optimization machinery: training routines are widely available

SVM as a well-studied learning model: inherit the profound regularization properties

(14)

Concrete Instance of the Framework: Stump Kernel

Decision Stump

decision stump: sq,d,α(x) =q·sign((x)d − α) simplicity: popular for ensemble learning







 (x)2≥ α?

Y @@

@ R

 N





 +1









−1

(a) Decision Process

- 6

s+1,2,α(x) = +1

(x)2= α (x)2

(x)1

(b) Decision Boundary Figure:Illustration of the decision stump s+1,2,α(x)

(15)

Concrete Instance of the Framework: Stump Kernel

Stump Kernel

consider the set of decision stumps S =

sq,d,αd:q ∈ {+1, −1} ,d ∈ {1, . . . ,D} , αd ∈ [Ld,Rd] whenX ⊆ [L1,R1] × [L2,R2] × · · · × [LD,RD],S is negation complete, and contains a constant hypothesis

Definition

The stump kernelKS is defined forS with r(q,d, αd) = 12

KS(x,x0) = ∆S

D

X

d=1

(x)d− (x0)d

= ∆S− kx−x0k1,

where∆S = 12PD

d=1(RdLd)is a constant

(16)

Concrete Instance of the Framework: Stump Kernel

Properties of Stump Kernel

simple to compute: can even use a simpler one

S(x,x0) = −kx −x0k1while getting the same solution under the dual constraintP

iyiλi =0, usingKS orK˜S is the same feature space explanation for`1-norm distance

infinite power: under mild assumptions, SVM-Stump with C = ∞ can perfectly classify all training examples

if there is a dimension for which all feature values are different, the kernel matrix K with Kij= K(xi,xj)is strictly positive definite similar power to the popular Gaussian kernel

exp(−γkxx0k22)

– suitable control on the power leads to good performance

(17)

Concrete Instance of the Framework: Stump Kernel

Properties of Stump Kernel (Cont’d)

fast automatic parameter selection: only needs to search for a good soft-margin parameter C

scaling the stump kernel is equivalent to scaling soft-margin parameter C

Gaussian kernel depends on a good(γ,C)pair (10 times slower) well suited in some specific applications:

cancer prediction with gene expressions

(Lin and Li, ECML/PKDD Discovery Challenge, 2005)

(18)

Concrete Instance of the Framework: Stump Kernel

Infinite Decision Stump Ensemble

g(x) =sign

 X

q∈{+1,−1}

D

X

d=1

Z Rd

Ld

wq,d(α)sq,d,α(x)dα +b

each sq,d,α: infinitesimal influence wq,d(α) equivalently,

g(x) =sign

 X

q∈{+1,−1}

D

X

d=1 Ad

X

a=0

wˆq,d,aˆsq,d,a(x) +b

ˆs: a smoother variant of decision stump

(x)d (xi)d (xj)d

ˆs(x)

(19)

Concrete Instance of the Framework: Stump Kernel

Infinitesimal Influence

g(x) =sign

 X

q∈{+1,−1}

D

X

d=1 Ad

X

a=0

wˆq,d,aˆsq,d,a(x) +b

infinity→dense combination of finite number of smooth stumps infinitesimal influence→concrete weight of the smooth stumps

(x)d (xi)d (xj)d

s(xˆ )

SVM: dense combination of smoothed stumps

(x)d (xi)d (xj)d

s(x)

AdaBoost: sparse combination of middle stumps

(20)

Experimental Comparison

Experiment Setting

ensemble learning algorithms:

SVM-Stump: infinite ensemble of decision stumps (dense ensemble of smooth stumps)

SVM-Mid: dense ensemble of middle stumps

AdaBoost-Stump: sparse ensemble of middle stumps SVM algorithms: SVM-Stump versus SVM-Gauss artificial, noisy, and realworld datasets

cross-validation for automatic parameter selection of SVM

evaluate on hold-out test set and averaged over 100 different splits

(21)

Experimental Comparison

Comparison between SVM and AdaBoost

tw twn th thn ri rin aus bre ger hea ion pim son vot 0

5 10 15 20 25 30 35

error (%)

SVM−Stump SVM−Mid AdaBoost−Stump(100) AdaBoost−Stump(1000)

Results

fair comparison between AdaBoost and SVM

SVM-Stump is usually best – benefits to go to infinity

SVM-Mid is also good – benefits to have dense ensemble sparsity and finiteness are restrictions

(22)

Experimental Comparison

Comparison between SVM and AdaBoost (Cont’d)

−5 0 5

−5 0 5

(x)1 (x) 2

−5 0 5

−5 0 5

(x)1 (x) 2

−5 0 5

−5 0 5

(x)1 (x) 2

left to right: SVM-Stump, SVM-Mid, AdaBoost-Stump

smoother boundary with infinite ensemble (SVM-Stump) still fits well with dense ensemble (SVM-Mid)

cannot fit well when sparse and finite (AdaBoost-Stump)

(23)

Experimental Comparison

Comparison of SVM Kernels

tw twn th thn ri rin aus bre ger hea ion pim son vot 0

5 10 15 20 25 30 35

error (%)

SVM−Stump

SVM−Gauss Results

SVM-Stump is only a bit worse than SVM-Gauss still benefit with faster parameter selection in some applications

(24)

Conclusion

Conclusion

novel and powerful framework for infinite ensemble learning derived a new and meaningful kernel

– stump kernel: succeeded in specific applications infinite ensemble learning could be better – existing AdaBoost-Stump applications may switch

not the only kernel:

perceptron kernelinfinite ensemble of perceptrons Laplacian kernelinfinite ensemble of decision trees SVM: our machinery for conquering infinity

– possible to apply similar machinery to areas that need infinite or lots of aggregation

參考文獻

相關文件

2 Distributed classification algorithms Kernel support vector machines Linear support vector machines Parallel tree learning?. 3 Distributed clustering

On the course content page, click the function module to switch to different learning activities pages for learning; you can also directly click the "learning activity" in

• ‘ content teachers need to support support the learning of those parts of language knowledge that students are missing and that may be preventing them mastering the

Study the following statements. Put a “T” in the box if the statement is true and a “F” if the statement is false. Only alcohol is used to fill the bulb of a thermometer. An

A conical flask containing oil and water A sectional diagram of the set-up Examples of drawing experimental apparatus.. Apparatus

⇔ improve some performance measure (e.g. prediction accuracy) machine learning: improving some performance measure..

3 active learning: limited protocol (unlabeled data) + requested

“Machine Learning Foundations” free online course, and works from NTU CLLab and NTU KDDCup teams... The Learning Problem What is