Distance Based SVM Kernels for Infinite Ensemble Learning
Hsuan-Tien Lin and Ling Li
Learning Systems Group, California Institute of Technology
ICONIP, November 2, 2005
Infinite Ensemble Learning with SVM
Setup
notation:
examples: x ∈ X ⊆ RD labels: y∈ {+1, −1}
hypotheses (classifiers): functions fromX → {+1, −1}
binary classification problem: given training examples and labels {(xi,yi)}Ni=1, find a classifier g(x)that predicts the labels of unseen examples well
ensemble learning: weighted vote of a committee of hypotheses g(x) =signX
wtht(x)
,wt ≥0,ht ∈ H
g(·)is usually better than individual h(·)
Traditional Ensemble Learning
traditional ensemble learning: iteratively find(wt,ht) for t =1,2, · · · ,T
g(x) =sign
T
X
t=1
wtht(x)
!
,wt ≥0,ht ∈ H
AdaBoost: asymp. approximate an optimal ensemble
minw,h kwk1, s.t. yi
∞
X
t=1
wtht(xi)
!
≥1,wt ≥0
by T iterations of coordinate descent on barrier algorithm is an infinite ensemble better?
Infinite Ensemble Learning with SVM
Infinite Ensemble Learning
infinite ensemble learning: |H| = ∞, and possibly infinite number of nonzero weights wt
g(x) =signX
wtht(x)
,ht ∈ H,wt ≥0
infinite ensemble learning is a challenge (e.g. Vapnik, 1998) SVM can handle infinite number of weights with suitable kernels (e.g. Schölkopf and Smola, 2002)
SVM and AdaBoost are connected (e.g. Rätsch et al., 2001) can SVM be applied to
infinite ensemble learning?
Connection between SVM and AdaBoost
SVM AdaBoost
minw,b 12kwk22+CPN
i=1|ξi| minw,hkwk1
yi(P∞
d=1wdφd(xi) +b) ≥1− ξi yi(P∞
t=1wtht(xi)) ≥1 wt ≥0
optimization
dual solution asymp. approximate
with quadratic programming with barrier and coordinate descent key for infinity
kernel trick approximation
K(x,x0) =P φd(x)φd(x0)
φd(x) ⇐⇒ht(x)
Infinite Ensemble Learning with SVM
Framework of Infinite Ensemble Learning
Algorithm
1 Consider a hypothesis setH
2 EmbedHin a kernelKHusingφd ⇔ht
3 Properly choose other SVM parameters
4 Train SVM withKH,r and{(xi,yi)}Ni=1
5 Output an infinite ensemble classifier
IfHis negation complete and contains a constant hypothesis, SVM classifier is equivalent to an infinite ensemble classifier SVM as an optimization machinery: training routines are widely available (LIBSVM)
SVM as a well-studied learning model: inherit the profound regularization properties
hard part: kernel construction
Embedding Hypotheses into the Kernel
K(x,x0) =
∞
X
t=1
ht(x)ht(x0) (may not converge)
⇒ K(x,x0) =
∞
X
d=1
[rdhd(x)][rdhd(x0)] (with some positive r )
⇒ K(x,x0) = Z
[r(α)hα(x)][r(α)hα(x0)]dα (handle uncountable cases)
Letφx(α) =r(α)hα(x), the kernelKH,r(x,x0) =R
Cφx(α)φx0(α)dα embodiesH = {hα: α ∈ C}
KH,r(x,x0): an inner product forφx andφx0 inL2(C) examples: stump and perceptron kernels
Stump and Perceptron Kernels
Stump Kernel
decision stump: sq,d,α(x) =q·sign((x)d − α) simplicity: popular for ensemble learning considerS =
sq,d,αd:q∈ ±1,1≤d ≤D, αd ∈ [Ld,Rd] : Definition
The stump kernelKS is defined forS with r(q,d, αd) = 12: KS(x,x0) = ∆S− kx −x0k1,
where∆S = 12PD
d=1(Rd−Ld)is a constant
Perceptron Kernel
a simple hyperplane: pθ,α(x) =sign θTx− α
not easy for ensemble learning: hard to design good algorithm considerP = {pθ,α: kθk2=1, α ∈ [−R,R]}:
Definition
The perceptron kernelKP is defined forP with a constant r(θ, α):
KP(x,x0) = ∆P− kx−x0k2, where∆P is a constant.
Stump and Perceptron Kernels
Properties of the Novel Kernels
simple to compute: can even drop∆S or∆P
adding/subtracting a constant does not change the solution under SVM linear constraintP
yiλi =0
infinite power: perfect separability under mild assumptions similar power to popular Gaussian kernel: exp(−γkx−x0k22) – suitable control on the power may give good performance fast automatic parameter selection: a good parameter C only
Gaussian kernel depends on a good(γ,C)pair (usually 10 times more computation)
feature space interpretation: domain-specific tuning e.g. stump kernel for gene selection:
the stump weight is a natural estimate of gene importance (Lin and Li, ECML/PKDD Discovery Challenge, 2005)
Comparison between SVM and AdaBoost
tw twn th thn ri rin aus bre ger hea ion pim son vot 0
5 10 15 20 25 30 35
error (%)
SVM−Stump AdaBoost−Stump(100) AdaBoost−Stump(1000)
tw twn th thn ri rin aus bre ger hea ion pim son vot 0
5 10 15 20 25 30 35
error (%)
SVM−Perc AdaBoost−Perc(100) AdaBoost−Perc(1000)
Results
fair comparison between AdaBoost and SVM
SVM is usually best – benefits to go to infinity
Experimental Comparison
Comparison of SVM Kernels
tw twn th thn ri rin aus bre ger hea ion pim son vot 0
5 10 15 20 25 30 35
error (%)
SVM−Stump SVM−Perc SVM−Gauss
Results
SVM-Perc very similar to SVM-Gauss SVM-Stump comparable to, but sometimes a bit worse than others
Conclusion
derived two useful kernels: stump kernel and perceptron kernel provided meanings to specific distance metric
stump kernel: succeeded in specific applications – existing AdaBoost-Stump applications may switch perceptron kernel: similar to Gaussian, faster in parameter selection
– can be an alternative to SVM-Gauss not the only kernels:
Laplacian kernel→infinite ensemble of decision trees Exponential kernel→infinite ensemble of decision regions SVM: a machinery for conquering infinity
– possible to apply similar machinery to areas that need infinite or lots of aggregation