Machine Learning Soundings (ᘤ)

(1)

Machine Learning Soundings ( 機器學習深測)

Lecture 3: Optimization in Deep Learning

Hsuan-Tien Lin (林軒田)

[email protected]

Department of Computer Science

& Information Engineering

National Taiwan University ( 國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Soundings 0/9

(2)

Roadmap

1

Deep Learning Foundations

Lecture 1: Neural Network

automatic

pattern feature extraction

from

layers of neurons

with

backprop

for GD/SGD

Lecture 3: Optimization in Deep Learning Difficulty of Deep Learning Optimization

2 Deep Learning Models

(3)

Optimization in Deep Learning Difficulty of Deep Learning Optimization

Difficulty of Deep Learning Optimization

error surface complicated

•

local minima: not as bad as imagined

•

saddle points/local maxima: easily escapable (especially with SGD)

•

plateau: need larger learning rate η

•

ravines: need to avoid oscillation

stability <> computation trade-off

slow computation of gradient (backprop)

⇒ SGD on minibatch

⇒ ‘instable’ estimate of gradient

getting more stable estimate?

averaging

(4)

Running Average Estimate of Gradient

gradient descent:

w _t

← w

_t−1

− η · v

t

original minibatch SG

gradient estimate

v _t

= ∆

t

from one minibatch SG

averaging by multiple SG

if minibatch SG for M times at t-th iteration, each getting ∆

^(m) _t

, more stable gradient estimate by uniform averaging

v _t

=

_M ¹

P

M

m=1

∆

^(m) _t

—needing M times more computation than original minibatch SGD

speedup by reusing each ∆ t = ∆ ⁽¹⁾ _t v _t

=

_M ¹

P

M

m=1

∆

_t−m+1

—‘moving window’ average of SG

issue with ‘moving window’ average:

(5)

Averaging SG Non-uniformly

Running Average

v _t

= βv

_t−1

+ (1 − β)∆

_t

with 0 ≤ β < 1 to control how much history to take β = 0: original SGD

v _t

=

t

X

m=1

β

^t−m

(1 − β)∆

_t

—size-t window, exponentially-decreasing aeveraging

SGD

with momentum: optimization direction

= current SG (∆

t

) + historical inertia (v

_t−1

)

(6)

Benefits of SGD with Momentum

v _t

= βv

_t−1

+ (1 − β)∆

_t w t

=

w _t−1

− ηv

t

•

some variance in SG canceled out

•

oscilliation across ravine dampened

•

shallow local optima/saddle points escaped

SGD with momentum: ‘stablize’ SG with running average

(7)

Per-Component Learning Rate

fixed learning rate :

w _t

=

w _t−1

− ηv

_t

per-component learning rate :

w t

=

w _t−1

− η

_t

v

t

intuition: scales error surface

want: smaller step for larger gradient component

(8)

Running Average of Gradient Magnitude

want: smaller step for larger gradient component, say

η

_t

= 1 k∇E(w

t

)k

•

full gradient ∇E not available, SG only

•

using k∆k not very stable

idea: running average of ∆

_t

∆

_t

(9)

RMSProp

u _t

= βu

_t−1

+ (1 − β)∆

_t

∆

_t

η

_t

= η./√

u t

+ w

t

=

w _t−1

− η

_t

∆

_t

RMSProp: SGD + per-component learng rate using running average of magnitude

(10)

Adam: Adaptive Moment Estimation

Adam ≈ momentum + RMSProp + global decay

v _t

= β

₁ v _t−1

+ (1 − β

₁

)∆

_t u t

= β

₂ u _t−1

+ (1 − β

₂

)∆

t

∆

_t

η

_t

= η./√

u _t

+ ./p t/N

w _t

=

w _t−1

− η

_t

v

t

•

momentum in

v _t

•

RMSProp in

u _t

•

global decay bypt/N

•

(some minor correction of estimation)

Adam usually more aggressive than original SGD (but can also overfit faster)

Machine Learning Soundings (ᘤ)

Machine Learning Soundings ( 機器學習深測)

Lecture 3: Optimization in Deep Learning

Department of Computer Science

& Information Engineering

National Taiwan University ( 國立台灣大學資訊工程系)

Roadmap

1

Lecture 1: Neural Network

pattern feature extraction

layers of neurons

backprop

Lecture 3: Optimization in Deep Learning Difficulty of Deep Learning Optimization

2 Deep Learning Models

Difficulty of Deep Learning Optimization

error surface complicated

•

•

•

•

stability <> computation trade-off

averaging

Running Average Estimate of Gradient

w t

t−1

t

original minibatch SG

v t

t

averaging by multiple SG

(m) t

v t

M 1

M

m=1

(m) t

speedup by reusing each ∆ t = ∆ (1) t v t

M 1

M

m=1

t−m+1

Averaging SG Non-uniformly

Running Average

v t

t−1

t

v t

t

m=1

t−m

t

with momentum: optimization direction

t

t−1

Benefits of SGD with Momentum

v t

t−1

t w t

w t−1

t

•

•

•

Per-Component Learning Rate

w t

w t−1

t

w t

w t−1

t

t

Running Average of Gradient Magnitude

t

t

•

•

t

t

RMSProp

u t

Machine Learning Soundings (ᘤ)

w _t

_t−1

v _t

^(m) _t

v _t

_M ¹

^(m) _t

speedup by reusing each ∆ t = ∆ ⁽¹⁾ _t v _t

_M ¹

_t−m+1

v _t

_t−1

_t

v _t

^t−m

_t

_t−1

v _t

_t−1

_t w t

w _t−1

w _t

w _t−1

_t

w _t−1

_t

_t

_t

_t

u _t

_t−1

_t

_t

_t

w _t−1

_t

_t

v _t

₁ v _t−1

₁

_t u t

₂ u _t−1

₂

_t

_t

u _t

w _t

w _t−1

_t

v _t

u _t