### Infinite Ensemble Learning with Support Vector Machines

Hsuan-Tien Lin

in collaboration with Ling Li Learning Systems Group, Caltech

Second Symposium on Vision and Learning, 2005/09/21

### Outline

1 Setup of our Learning Problem

2 Motivation of Infinite Ensemble Learning

3 Connecting SVM and Ensemble Learning

4 SVM-Based Framework of Infinite Ensemble Learning

5 Examples of the Framework

6 Experimental Comparison

7 Conclusion and Discussion

Setup of our Learning Problem

### Setup of our Learning Problem

binary classification problem:

does this image represent an apple?

*features of the image: a vector x* ∈ X ⊆ R* ^{D}*.

e.g.:(x)_{1}can describe the shape,(x)_{2}can describe the color, etc.

**difference to the features in vision: a vector of properties, not a**

“set of interest points.”

*label (whether the image is an apple): y* ∈ {+1, −1}.

learning problem: give many images and their labels (training
examples){(x* _{i}*,

*y*

*)}*

_{i}

^{N}

_{i=1}*, find a classifier g(x*) : X → {+1, −1}that

**predicts unseen images well.**

hypotheses (classifiers): functions fromX → {+1, −1}.

Motivation of Infinite Ensemble Learning

### Motivation of Infinite Ensemble Learning

*g(x*) : X → {+1, −1}

ensemble learning: popular paradigm.

ensemble: weighted vote of a committee of hypotheses.

*g(x*) =sign(P

*w*_{t}*h** _{t}*(x)) ,

*w*

*≥0.*

_{t}**traditional ensemble learning: infinite size committee with finite**
number of nonzero weights.

**is finiteness restriction and/or regularization?**

**how to handle infinite number of nonzero weights?**

SVM (large-margin hyperplane): also popular.

hyperplane: a weighted combination of features.

**SVM: infinite dimensional hyperplane through kernels.**

*g(x*) =sign(P

*w** _{d}*φ

*(x) +*

_{d}*b) .*

**can we use SVM for infinite ensemble learning?**

Connecting SVM and Ensemble Learning

### Illustration of SVM

{(x* _{i}*,

*y*

*)}*

_{i}

^{N}

_{i=1}φ*d*

implicitly computed

φ_{1}(x)

φ_{2}(x)

· · ·

φ∞(x)

*w**d*

via duality

*w*_{1}

*w*_{2}

· · ·

*w*_{∞}
(λ* _{i}*)

^{N}

_{i=1}*g(x*) =sign(P∞

*d=1**w** _{d}*φ

*(x) +*

_{d}*b)*SVM implicit

computation with
K(x,*x*^{0}) =
P∞

*d=1*φ*d*(x)φ*d*(x^{0}).

optimal solution
(w,*b)*represented
by the dual

variablesλ* _{i}*.

Connecting SVM and Ensemble Learning

### Property of SVM

*g(x*) =sign(P∞

*d=1**w** _{d}*φ

*(x) +*

_{d}*b) =*sign P

*N*

*i=1*λ_{i}*y** _{i}*K(x

*,*

_{i}*x*) +

*b*

optimal hyperplane: represented through duality.

key for handling infinity: kernel tricksK(x,*x*^{0}) =P∞

*d=1*φ*d*(x)φ*d*(x^{0}).

quadratic programming of a margin-related criteria.

goal: (infinite dimensional) large-margin hyperplane.

min

*w,b*

1

2kwk^{2}_{2}+*C*

*N*

X

*i=1*

ξ*i*, *s.t. y*_{i}

∞

X

*d=1*

*w** _{d}*φ

*d*(x

*) +*

_{i}*b*

!

≥1− ξ* _{i}*, ξ

*i*≥0.

*regularization: controlled with the trade-off parameter C.*

Connecting SVM and Ensemble Learning

### Illustration of AdaBoost

{(x* _{i}*,

*y*

*)}*

_{i}

^{N}

_{i=1}*h**t*∈ H
iteratively selected

*h*_{1}(x)

*h*_{2}(x)

· · ·

*h** _{T}*(x)

*w**t*≥0
iteratively assigned

*w*_{1}

*w*_{2}

· · ·

*w*_{T}*u*_{1}(i)

*u*_{2}(i)

· · ·
*g(x) =*sign

P*T*

*t=1**w*_{t}*h** _{t}*(x) AdaBoost

most successful ensemble learning algorithm.

boosts up the
performance of
*each individual h** _{t}*.
emphasizes
difficult examples

*by u*

*t*and finds (h

*,*

_{t}*w*

*)iteratively.*

_{t}Connecting SVM and Ensemble Learning

### Property of AdaBoost

*g(x*) =sign
P*T*

*t=1**w**t**h**t*(x)

iterative coordinate descent of a margin-related criteria.

min

*N*

X

*i=1*

exp(−ρ* _{i}*) , s.t.ρ

*=*

_{i}*y*

_{i}∞

X

*t*=1

*w*_{t}*h** _{t}*(x

*)*

_{i}!

,*w** _{t}* ≥0.

goal: asymptotically, large-margin ensemble.

min*w,h* kwk_{1}, *s.t. y*_{i}

∞

X

*t*=1

*w**t**h**t*(x* _{i}*)

!

≥1,*w**t* ≥0.

optimal ensemble: approximated by finite one.

key for good approximation: sparsity

– some optimal ensemble has many zero weights.

regularization: finite approximation.

Connecting SVM and Ensemble Learning

### Connection between SVM and AdaBoost

φ* _{d}*(x) ⇔

*h*

*(x)*

_{t}SVM AdaBoost

*G(x*) =P

*k**w** _{k}*φ

*k*(x) +

*b*

*G(x) =*P

*k**w*_{k}*h** _{k}*(x)

*w*

*≥0*

_{k}**hard-goal**

minkwk* _{p}*,

*s.t. y*

_{i}*G(x*

*) ≥1*

_{i}*p*=2 *p*=1

**optimization**

quadratic programming iterative coordinate descent
**key for infinity**

kernel trick sparsity

**regularization**

soft-margin trade-off finite approximation

SVM-Based Framework of Infinite Ensemble Learning

### Challenge

designing an infinite ensemble learning algorithm traditional ensemble learning: iterative and cannot directly be generalized.

another approach: embedding infinite number of hypotheses in
SVM kernel, i.e.,K(x,*x*^{0}) =P∞

*t=1**h** _{t}*(x)h

*(x*

_{t}^{0}).

*then, SVM classifier: g(x) =*sign(P∞

*t*=1*w*_{t}*h** _{t}*(x) +

*b).*

**does the kernel exist?**

**how to ensure w***t* ≥**0?**

**our main contribution: a framework that conquers the**
**challenge.**

SVM-Based Framework of Infinite Ensemble Learning

### Embedding Hypotheses into the Kernel

Definition

The kernel that embodiesH = {h_{α}: α ∈ C}is defined as
K_{H,r}(x,*x*^{0}) =

Z

C

φ* _{x}*(α)φ

_{x}^{0}(α)

*dα,*

where C is a measure space, φ*x*(α) = *r*(α)h_{α}(x), and r: C → R^{+} is
chosen such that the integral always exists.

integral instead of sum: works even for uncountableH.

K_{H,r}(x,*x*^{0}): an inner product forφ*x* andφ_{x}^{0} inF = L_{2}(C).

*the classifier: g(x*) =sign R

C*w(α)r*(α)h_{α}(x)*d*α +*b .*

SVM-Based Framework of Infinite Ensemble Learning

### Negation Completeness and Constant Hypotheses

*g(x) =*sign

Z

C

*w(α)r*(α)h_{α}(x)*d*α +*b*

not an ensemble classifier yet.

*w*(α) ≥0?

hard to handle: possibly uncountable constraints.

simple with negation completeness assumption onH.

*negation completeness: h*∈ Hif and only if(−h) ∈ H.

*for any w , exists nonnegative**w that produces same g.*˜
*What is b?*

equivalently, the weight on a constant hypothesis.

another assumption:Hcontains a constant hypothesis.

both assumptions: mild in practice.

*g(x*)is equivalent to an ensemble classifier.

SVM-Based Framework of Infinite Ensemble Learning

### Framework of Infinite Ensemble Learning

Algorithm

1 Consider a hypothesis setH(negation complete and contains a constant hypothesis).

2 Construct a kernelKH,r *with proper r*(·).

3 Properly choose other SVM parameters.

4 Train SVM withK_{H,r} and{(x* _{i}*,

*y*

*)}*

_{i}

^{N}*to obtainλ*

_{i=1}

_{i}*and b.*

5 *Output g(x*) =sign
P*N*

*i=1**y** _{i}*λ

*i*K

_{H}(x

*,*

_{i}*x) +b*. easy: SVM routines.

hard: kernel construction.

shall inherit the profound properties of SVM.

Examples of the Framework

### Decision Stump

*decision stump: s** _{q,d,α}*(x) =

*q*·sign((x)

*− α).*

_{d}simplicity: popular for ensemble learning (e.g., Viola and Jones)

(x)_{2}≥ α?

**Y** ^{@}_{@}

@ R

**N**

+1

−1

(a) Decision Process

- 6

*s*_{+1,2,α}(x) = +1

(x)2= α (x)2

(x)1

(b) Decision Boundary
Figure:*Illustration of the decision stump s*+1,2,α(x)

Examples of the Framework

### Stump Kernel

consider the set of decision stumps S =

*s*_{q,d,α}* _{d}*:

*q*∈ {+1, −1} ,

*d*∈ {1, . . . ,

*D} , α*

*∈ [L*

_{d}*,*

_{d}*R*

*] . whenX ⊆ [L*

_{d}_{1},

*R*

_{1}] × [L

_{2},

*R*

_{2}] × · · · × [L

*,*

_{D}*R*

*],S is negation complete, and contains a constant hypothesis.*

_{D}Definition

The stump kernelK_{S} is defined forS *with r*(q,*d, α** _{d}*) =

^{1}

_{2}.

K_{S}(x,*x*^{0}) = ∆S−

*D*

X

*d=1*

(x)* _{d}*− (x

^{0})

_{d}= ∆S− kx−*x*^{0}k_{1},

where∆S = ^{1}_{2}P*D*

*d=1*(R* _{d}*−

*L*

*)is a constant.*

_{d}Examples of the Framework

### Property of Stump Kernel

simple to compute: the constant∆S can even be dropped
*K*˜(x,*x*^{0}) = −kx−*x*^{0}k_{1}.

*infinite power: under mild assumptions, SVM with C*= ∞can
perfectly classify training examples with stump kernel.

the popular Gaussian kernel exp(−γkx−*x*^{0}k^{2}_{2})also.

fast parameter selection: scaling the stump kernel is equivalent to
*scaling soft-margin parameter C.*

Gaussian kernel depends on a good(γ,*C)*pair.

*stump kernel only needs good C: roughly ten times faster.*

feature space explanation for`_{1}-norm similarity.

well suited in some specific applications:

cancer prediction with gene expressions.

Examples of the Framework

### Perceptron

*perceptron: p*_{θ,α}(x) =sign θ^{T}*x*− α
.

not easy for ensemble learning: hard to design good algorithm.

θ^{T}*x* ≥ α?

**Y** ^{@}_{@}

@ R

**N**

+1

−1

(a) Decision Process

- 6

*p*_{θ,α}(x) = +1
θ^{T}*x* = α
(x)2

(x)1

@ I

−θ

(b) Decision Boundary
Figure:*Illustration of the perceptron p*θ,α(x)

Examples of the Framework

### Perceptron Kernel

consider the set of perceptrons P =

*p*_{θ,α}: θ ∈ R* ^{D}*, kθk2=1, α ∈ [−R,

*R]*

.

whenX *is within a ball of radius R centered at the origin,*P is
negation complete, and contains a constant hypothesis.

Definition

The perceptron kernel isK_{P} *with r*(θ, α) =*r*P,
K_{P}(x,*x*^{0}) = ∆P− kx−*x*^{0}k_{2},
*where r*_{P} and∆P are constants.

Examples of the Framework

### Property of Perceptron Kernel

similar properties to the stump kernel.

also simple to compute.

*infinite power: equivalent to a D-∞-1 neural network.*

fast parameter selection: also shown in (Fleuret and Sahbi, ICCV 2003 workshop, called triangular kernel) without feature space explanation.

Examples of the Framework

### Histogram Intersection Kernel

introduced for scene recognition (Odone et al., IEEE TIP, 2005).

assume(x)*d*: counts in the histogram (how many pixels are red?)
– an integer between[0,size of image].

histogram intersection kernel:

K(x,*x*^{0}) =P*D*

*d=1*min((x)* _{d}*, (x

^{0})

*).*

_{d}generalized with difficult math when(x)* _{d}* is not an integer
(Boughorbel et al., ICIP, 2005), similar tasks.

letˆ*s(x*) = (s(x) +1)/2: HIK can be constructed easily from the
framework.

furthermore, HIK equivalent to stump kernel.

insights on why HI (stump) kernel works well for the task?

Examples of the Framework

### Other Kernels

Laplacian kernel:K(x,*x*^{0}) =exp(−γkx−*x*^{0}k_{1}).

provably embodies infinite number of decision trees.

generalized Laplacian: K(x,*x*^{0}) =exp −γP |(x)^{a}* _{d}*− (x

^{0})

^{a}*| .*

_{d}*can be similarly constructed with a slightly different r function.*

standard kernel for histogram-based image classification with SVM (Chappelle et al., IEEE TNN, 1999).

insights on why it should work well?

exponential kernel: K(x,*x*^{0}) =exp(−γkx−*x*^{0}k_{2}).

provably embodies infinite number of decision trees of perceptrons.

Experimental Comparison

### Comparison between SVM and AdaBoost

tw twn th thn ri rin aus bre ger hea ion pim son vot 0

5 10 15 20 25 30 35

error (%)

SVM−Stump AdaBoost−Stump(100) AdaBoost−Stump(1000)

tw twn th thn ri rin aus bre ger hea ion pim son vot 0

5 10 15 20 25 30 35

error (%)

SVM−Perc AdaBoost−Perc(100) AdaBoost−Perc(1000)

Results

fair comparison between AdaBoost and SVM.

SVM is usually best – benefits to go to infinity.

sparsity

(finiteness) is a
**restriction.**

Experimental Comparison

### Comparison of SVM Kernels

tw twn th thn ri rin aus bre ger hea ion pim son vot 0

5 10 15 20 25 30 35

error (%)

SVM−Stump SVM−Perc SVM−Gauss

Results

SVM-Perc very similar to SVM-Gauss.

SVM-Stump comparable to, but sometimes a bit worse than others.

Conclusion and Discussion

### Conclusion and Discussion

constructed: general framework for infinite ensemble learning.

infinite ensemble learning could be better – existing AdaBoost-Stump applications may switch.

derived new and meaningful kernels.

stump kernel: succeeded in specific applications.

perceptron kernel: similar to Gaussian, faster in parameter selection.

gave novel interpretation to existing kernels.

histogram intersection kernel: equivalent to stump kernel.

Laplacian kernel: ensemble of decision trees.

possible thoughts for vision

would fast parameter selection be important for some problems?

any vision applications in which those kernel models are reasonable?

do the novel interpretations give any insights?

any domain knowledge that can be brought into kernel construction?