Perceptron Learning with Random Coordinate Descent
Ling Li and Hsuan-Tien Lin
Learning Systems Group, California Institute of Technology, U.S.A.
International Joint Conf. on Neural Networks, August 15, 2007
Introduction
Perceptron
proposed by Rosenblatt(1958) a single neuron;
a linear threshold classifier;
a hyperplane inRd
define(x)0=∆1 and w0=∆ b:
y =sign(hw,xi)
1 b
@
@
@ R (x)1 w1
H HHj ... ... (x)d wd
y =-sign(hw,xi +b)
a simple but useful classifier, especially for building more complex systems
Introduction
Perceptron Learning Rule (PLR)
an iterative optimization procedure to learn w fromS = {(xn,yn)}Nn=1 (Rosenblatt, 1962)
repeatedly, for(xn,yn) ∈ S,
1 if current w correctly classifies xn, do nothing;
2 if current w wrongly classifies xn, wnew =w+ynxn
convergence proved for separableS
but unstable for nonseparable cases
Introduction
Minimum Training Error Perceptrons
w∗∈argmin
w N
X
n=1
Jynhw,xni ≤0K Hard Optimization Problem
numerically:
0/1 loss c(ρ) =Jρ ≤0Knot convex, not continuous, with mostly 0 gradient
combinatorially:
NP-complete
(Marcotte and Savard, 1992)
Useful Classifier theoretically:
w∗converges to optimal linear classifier when N → ∞ practically:
basic building blocks for networks/ensembles of neurons
g
goal: anefficientalgorithmguaranteedto approach w∗ even for nonseparable cases
Introduction
Two Existing Approaches for Nonseparable Sets
w∗∈argmin
w
C(w) =
N
X
n=1
c(yn· hw,xni), where c(ρ) =Jρ ≤0K
pocket-PLR
in addition to PLR, store the best w encountered
guaranteed to locate w∗ with high probability in the long run
usuallyinefficient
– PLR unstable and wastes iterations on bad candidates
support vector machine (SVM) regularize C(w);
change c(ρ)to hinge loss efficiently solved
0/1 loss hinge loss
via quadratic programming no guaranteeon getting w∗ – hinge loss different from 0/1 loss
Introduction
Our Contributions
new perceptron algorithm to minimize 0/1 loss –efficientwithguaranteeon approaching w∗
100 101 102 103
19.5 20 20.5 21 21.5 22 22.5 23 23.5 24
Number of epochs
Training error (%)
our algorithm pocket SVM−stochastic
empirical study to understand 0/1 loss
– insights on dealing with nonseparable data sets
better neural ensemble approach: AdaBoost + our algorithm – useful when modeling very complex data sets
Random Coordinate Descent
Our Algorithm: Random Coordinate Descent
PLR
wnew =w+Jynhw,xni ≤0K (ynxn)
generalized and improved
⇓
generalized and improvedRandom Coordinate Descent (RCD) wnew =w+ αd
instead of fixed directions ynxn, userandomdirections d
instead of a fixed step size 0 or 1, use theoptimalstep sizeαwith respect to d
next: how to compute the optimal step size
Random Coordinate Descent
Computing the Optimal Step Size α
minα∈R N
X
n=1
Jynhw+ αd,xni≤0K
Define δn= hd,∆ xni
whenδn=0 hw,xni
whenδn6=0 δn
δn−1hw,xni + α for those n with nonzeroδn, let(xn0,yn0) ←
δn−1hw,xni,ynsign(δn) minα∈R
X
δn6=0
qyn0 xn0 + α ≤0y
optimalαcan be computed from these new 1-D examples efficientlyby sorting + dynamic programming
Random Coordinate Descent
Choosing Update Directions d
some natural candidates
1 coordinate directions ei = (. . . ,0,1,0, . . .)T
2 PLR directions ynxn
3 sufficiently random directions on the unit spherekdk =1 recall: hard optimization problem
– finite choices like coordinate or PLR stuck in local minima sufficiently random directionsguarantee convergenceto global minima w∗ in the long run
some even provably help withefficient local search
Random Coordinate Descent
Putting Things Together
Random Coordinate Descent iteratively,
1 pick a direction d from sufficiently random choices
2 transform(xn,yn)to(xn0,yn0)with w and d
3 compute optimal step sizeαfrom(xn0,yn0)
4 wnew =w+ αd
100 101 102 103
19.5 20 20.5 21 21.5 22 22.5 23 23.5 24
Number of epochs
Training error (%)
our algorithm pocket SVM−stochastic
Experiments
Comparison as Single Perceptron Algorithms
0 10 20 30 40
training error (%)
au. br. cl. ge. he. pi. io. ri. so. th. vo. yi.
0 10 20 30 40
data set
test error (%)
RCD pocket SVM
training error (0/1 loss): RCD usually lowest; SVM highest
test error: SVM often better pocket slow and not the sharpest in both cases
for a single perceptron, RCD does too good of a job for 0/1 loss and causes overfitting
Experiments
Comparison When Coupled with AdaBoost
au. br. cl. ge. he. pi. io. ri. so. th. vo. yi.
0 5 10 15 20 25 30 35 40
data set
test error (%)
AdaBoost−RCD AdaBoost−pocket AdaBoost−SVM best single perceptron
single perceptron sufficient on 6/12 sets
AdaBoost-RCD significantly better than any single perceptron on the other half
AdaBoost-SVM cannot improve;
AdaBoost-pocket slow
for modeling very complex data sets with perceptron ensembles, AdaBoost-RCD is the best
Conclusion
Conclusion
Random Coordinate Descent: an efficient algorithm guaranteed to minimize 0/1 loss of perceptron
theoretical analysis:
proved to converge to w∗ and to perform fast local search empirical study:
RCD the best training error minimizer – but can cause overfitting
AdaBoost-RCD the best perceptron ensemble approach in test performance
Thank you. Questions?