Perceptron Learning with Random Coordinate Descent

(1)

Perceptron Learning with Random Coordinate Descent

Ling Li and Hsuan-Tien Lin

Learning Systems Group, California Institute of Technology, U.S.A.

International Joint Conf. on Neural Networks, August 15, 2007

(2)

Introduction

Perceptron

proposed by Rosenblatt(1958) a single neuron;

a linear threshold classifier;

a hyperplane inR^d

define(x)₀=^∆1 and w₀=^∆ b:

y =sign(hw,xi)

1 b

@

@ R (x)₁ w₁

H HHj ... ... (x)_d wd

y =-sign(hw,xi +b)

a simple but useful classifier, especially for building more complex systems

(3)

Introduction

Perceptron Learning Rule (PLR)

an iterative optimization procedure to learn w fromS = {(x_n,y_n)}^N_n=1 (Rosenblatt, 1962)

repeatedly, for(x_n,y_n) ∈ S,

1 if current w correctly classifies xn, do nothing;

2 if current w wrongly classifies xn, w^new =w+ynxn

convergence proved for separableS

but unstable for nonseparable cases

(4)

Introduction

Minimum Training Error Perceptrons

w^∗∈argmin

w N

X

n=1

Jynhw,xni ≤0K Hard Optimization Problem

numerically:

0/1 loss c(ρ) =Jρ ≤0Knot convex, not continuous, with mostly 0 gradient

combinatorially:

NP-complete

(Marcotte and Savard, 1992)

Useful Classifier theoretically:

w^∗converges to optimal linear classifier when N → ∞ practically:

basic building blocks for networks/ensembles of neurons

g

goal: anefficientalgorithmguaranteedto approach w^∗ even for nonseparable cases

(5)

Introduction

Two Existing Approaches for Nonseparable Sets

w^∗∈argmin

w

C(w) =

N

X

n=1

c(yn· hw,xni), where c(ρ) =Jρ ≤0K

pocket-PLR

in addition to PLR, store the best w encountered

guaranteed to locate w^∗ with high probability in the long run

usuallyinefficient

– PLR unstable and wastes iterations on bad candidates

support vector machine (SVM) regularize C(w);

change c(ρ)to hinge loss efficiently solved

0/1 loss hinge loss

via quadratic programming no guaranteeon getting w^∗ – hinge loss different from 0/1 loss

(6)

Introduction

Our Contributions

new perceptron algorithm to minimize 0/1 loss –efficientwithguaranteeon approaching w^∗

10⁰ 10¹ 10² 10³

19.5 20 20.5 21 21.5 22 22.5 23 23.5 24

Number of epochs

Training error (%)

our algorithm pocket SVM−stochastic

empirical study to understand 0/1 loss

– insights on dealing with nonseparable data sets

better neural ensemble approach: AdaBoost + our algorithm – useful when modeling very complex data sets

(7)

Random Coordinate Descent

Our Algorithm: Random Coordinate Descent

PLR

w^new =w+Jy_nhw,x_ni ≤0K (y_nx_n)

generalized and improved

⇓

generalized and improved

Random Coordinate Descent (RCD) w^new =w+ αd

instead of fixed directions y_nx_n, userandomdirections d

instead of a fixed step size 0 or 1, use theoptimalstep sizeαwith respect to d

next: how to compute the optimal step size

(8)

Computing the Optimal Step Size α

minα∈R N

X

n=1

Jy_nhw+ αd,x_ni≤0K

Define δ_n= hd,^∆ x_ni

whenδn=0 hw,xni

whenδn6=0 δ_n

δ_n⁻¹hw,x_ni + α for those n with nonzeroδn, let(x_n⁰,y_n⁰) ←

δ_n⁻¹hw,xni,ynsign(δn) minα∈R

X

δn6=0

qy_n⁰ x_n⁰ + α ≤0y

optimalαcan be computed from these new 1-D examples efficientlyby sorting + dynamic programming

(9)

Choosing Update Directions d

some natural candidates

1 coordinate directions e_i = (. . . ,0,1,0, . . .)^T

2 PLR directions y_nx_n

3 sufficiently random directions on the unit spherekdk =1 recall: hard optimization problem

– finite choices like coordinate or PLR stuck in local minima sufficiently random directionsguarantee convergenceto global minima w^∗ in the long run

some even provably help withefficient local search

(10)

Putting Things Together

Random Coordinate Descent iteratively,

1 pick a direction d from sufficiently random choices

2 transform(xn,yn)to(x_n⁰,y_n⁰)with w and d

3 compute optimal step sizeαfrom(x_n⁰,y_n⁰)

4 w^new =w+ αd

10⁰ 10¹ 10² 10³

19.5 20 20.5 21 21.5 22 22.5 23 23.5 24

Number of epochs

Training error (%)

our algorithm pocket SVM−stochastic

(11)

Experiments

Comparison as Single Perceptron Algorithms

0 10 20 30 40

training error (%)

au. br. cl. ge. he. pi. io. ri. so. th. vo. yi.

0 10 20 30 40

data set

test error (%)

RCD pocket SVM

training error (0/1 loss): RCD usually lowest; SVM highest

test error: SVM often better pocket slow and not the sharpest in both cases

for a single perceptron, RCD does too good of a job for 0/1 loss and causes overfitting

(12)

Experiments

Comparison When Coupled with AdaBoost

au. br. cl. ge. he. pi. io. ri. so. th. vo. yi.

0 5 10 15 20 25 30 35 40

data set

test error (%)

AdaBoost−RCD AdaBoost−pocket AdaBoost−SVM best single perceptron

single perceptron sufficient on 6/12 sets

AdaBoost-RCD significantly better than any single perceptron on the other half

AdaBoost-SVM cannot improve;

AdaBoost-pocket slow

for modeling very complex data sets with perceptron ensembles, AdaBoost-RCD is the best

(13)

Conclusion

Random Coordinate Descent: an efficient algorithm guaranteed to minimize 0/1 loss of perceptron

theoretical analysis:

proved to converge to w^∗ and to perform fast local search empirical study:

RCD the best training error minimizer – but can cause overfitting

AdaBoost-RCD the best perceptron ensemble approach in test performance

Thank you. Questions?