• 沒有找到結果。

Perceptron Learning with Random Coordinate Descent

N/A
N/A
Protected

Academic year: 2022

Share "Perceptron Learning with Random Coordinate Descent"

Copied!
13
0
0

加載中.... (立即查看全文)

全文

(1)

Perceptron Learning with Random Coordinate Descent

Ling Li and Hsuan-Tien Lin

Learning Systems Group, California Institute of Technology, U.S.A.

International Joint Conf. on Neural Networks, August 15, 2007

(2)

Introduction

Perceptron

proposed by Rosenblatt(1958) a single neuron;

a linear threshold classifier;

a hyperplane inRd

define(x)0=1 and w0= b:

y =sign(hw,xi)





1 b

@

@

@ R (x)1 w1

H HHj ... ... (x)d wd



y =-sign(hw,xi +b)

a simple but useful classifier, especially for building more complex systems

(3)

Introduction

Perceptron Learning Rule (PLR)

an iterative optimization procedure to learn w fromS = {(xn,yn)}Nn=1 (Rosenblatt, 1962)

repeatedly, for(xn,yn) ∈ S,

1 if current w correctly classifies xn, do nothing;

2 if current w wrongly classifies xn, wnew =w+ynxn

convergence proved for separableS

but unstable for nonseparable cases

(4)

Introduction

Minimum Training Error Perceptrons

w∈argmin

w N

X

n=1

Jynhw,xni ≤0K Hard Optimization Problem

numerically:

0/1 loss c(ρ) =Jρ ≤0Knot convex, not continuous, with mostly 0 gradient

combinatorially:

NP-complete

(Marcotte and Savard, 1992)

Useful Classifier theoretically:

wconverges to optimal linear classifier when N → ∞ practically:

basic building blocks for networks/ensembles of neurons

g

goal: anefficientalgorithmguaranteedto approach w even for nonseparable cases

(5)

Introduction

Two Existing Approaches for Nonseparable Sets

w∈argmin

w

C(w) =

N

X

n=1

c(yn· hw,xni), where c(ρ) =Jρ ≤0K

pocket-PLR

in addition to PLR, store the best w encountered

guaranteed to locate w with high probability in the long run

usuallyinefficient

– PLR unstable and wastes iterations on bad candidates

support vector machine (SVM) regularize C(w);

change c(ρ)to hinge loss efficiently solved

0/1 loss hinge loss

via quadratic programming no guaranteeon getting w – hinge loss different from 0/1 loss

(6)

Introduction

Our Contributions

new perceptron algorithm to minimize 0/1 loss –efficientwithguaranteeon approaching w

100 101 102 103

19.5 20 20.5 21 21.5 22 22.5 23 23.5 24

Number of epochs

Training error (%)

our algorithm pocket SVM−stochastic

empirical study to understand 0/1 loss

– insights on dealing with nonseparable data sets

better neural ensemble approach: AdaBoost + our algorithm – useful when modeling very complex data sets

(7)

Random Coordinate Descent

Our Algorithm: Random Coordinate Descent

PLR

wnew =w+Jynhw,xni ≤0K (ynxn)

generalized and improved

generalized and improved

Random Coordinate Descent (RCD) wnew =w+ αd

instead of fixed directions ynxn, userandomdirections d

instead of a fixed step size 0 or 1, use theoptimalstep sizeαwith respect to d

next: how to compute the optimal step size

(8)

Random Coordinate Descent

Computing the Optimal Step Size α

minα∈R N

X

n=1

Jynhw+ αd,xni≤0K

Define δn= hd, xni

whenδn=0 hw,xni

whenδn6=0 δn

δn−1hw,xni + α for those n with nonzeroδn, let(xn0,yn0) ←

δn−1hw,xni,ynsign(δn) minα∈R

X

δn6=0

qyn0 xn0 + α ≤0y

optimalαcan be computed from these new 1-D examples efficientlyby sorting + dynamic programming

(9)

Random Coordinate Descent

Choosing Update Directions d

some natural candidates

1 coordinate directions ei = (. . . ,0,1,0, . . .)T

2 PLR directions ynxn

3 sufficiently random directions on the unit spherekdk =1 recall: hard optimization problem

– finite choices like coordinate or PLR stuck in local minima sufficiently random directionsguarantee convergenceto global minima w in the long run

some even provably help withefficient local search

(10)

Random Coordinate Descent

Putting Things Together

Random Coordinate Descent iteratively,

1 pick a direction d from sufficiently random choices

2 transform(xn,yn)to(xn0,yn0)with w and d

3 compute optimal step sizeαfrom(xn0,yn0)

4 wnew =w+ αd

100 101 102 103

19.5 20 20.5 21 21.5 22 22.5 23 23.5 24

Number of epochs

Training error (%)

our algorithm pocket SVM−stochastic

(11)

Experiments

Comparison as Single Perceptron Algorithms

0 10 20 30 40

training error (%)

au. br. cl. ge. he. pi. io. ri. so. th. vo. yi.

0 10 20 30 40

data set

test error (%)

RCD pocket SVM

training error (0/1 loss): RCD usually lowest; SVM highest

test error: SVM often better pocket slow and not the sharpest in both cases

for a single perceptron, RCD does too good of a job for 0/1 loss and causes overfitting

(12)

Experiments

Comparison When Coupled with AdaBoost

au. br. cl. ge. he. pi. io. ri. so. th. vo. yi.

0 5 10 15 20 25 30 35 40

data set

test error (%)

AdaBoost−RCD AdaBoost−pocket AdaBoost−SVM best single perceptron

single perceptron sufficient on 6/12 sets

AdaBoost-RCD significantly better than any single perceptron on the other half

AdaBoost-SVM cannot improve;

AdaBoost-pocket slow

for modeling very complex data sets with perceptron ensembles, AdaBoost-RCD is the best

(13)

Conclusion

Conclusion

Random Coordinate Descent: an efficient algorithm guaranteed to minimize 0/1 loss of perceptron

theoretical analysis:

proved to converge to w and to perform fast local search empirical study:

RCD the best training error minimizer – but can cause overfitting

AdaBoost-RCD the best perceptron ensemble approach in test performance

Thank you. Questions?

參考文獻

相關文件

thresholded ensemble model: useful for ordinal regression theoretical reduction: new large-margin bounds. algorithmic reduction: new learning

– stump kernel: succeeded in specific applications infinite ensemble learning could be better – existing AdaBoost-Stump applications may switch. not the

our reduction to boosting approaches results in significantly better ensemble ranking

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time.. In order

[r]

(BFQ, *) Implement the fixed learning rate gradient descent algorithm below for logistic regression.. Run the algorithm with η = 0.001 and T = 2000 on the following set

sketch with weak labels first, refine with limited labeled data later—or maybe learn from many weak labels only?.. Learning with Limited

In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages