• 沒有找到結果。

Machine Learning Soundings (ᘤ)

N/A
N/A
Protected

Academic year: 2022

Share "Machine Learning Soundings (ᘤ)"

Copied!
10
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning Soundings ( 機器學習深測)

Lecture 3: Optimization in Deep Learning

Hsuan-Tien Lin (林軒田)

[email protected]

Department of Computer Science

& Information Engineering

National Taiwan University ( 國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Soundings 0/9

(2)

Roadmap

1

Deep Learning Foundations

Lecture 1: Neural Network

automatic

pattern feature extraction

from

layers of neurons

with

backprop

for GD/SGD

Lecture 3: Optimization in Deep Learning Difficulty of Deep Learning Optimization

2 Deep Learning Models

(3)

Optimization in Deep Learning Difficulty of Deep Learning Optimization

Difficulty of Deep Learning Optimization

error surface complicated

local minima: not as bad as imagined

saddle points/local maxima: easily escapable (especially with SGD)

plateau: need larger learning rate η

ravines: need to avoid oscillation

stability <> computation trade-off

slow computation of gradient (backprop)

⇒ SGD on minibatch

⇒ ‘instable’ estimate of gradient

getting more stable estimate?

averaging

Hsuan-Tien Lin (NTU CSIE) Machine Learning Soundings 2/9

(4)

Running Average Estimate of Gradient

gradient descent:

w t

← w

t−1

− η · v

t

original minibatch SG

gradient estimate

v t

= ∆

t

from one minibatch SG

averaging by multiple SG

if minibatch SG for M times at t-th iteration, each getting ∆

(m) t

, more stable gradient estimate by uniform averaging

v t

=

M 1

P

M

m=1

(m) t

—needing M times more computation than original minibatch SGD

speedup by reusing each ∆ t = ∆ (1) t v t

=

M 1

P

M

m=1

t−m+1

—‘moving window’ average of SG

issue with ‘moving window’ average:

(5)

Optimization in Deep Learning Difficulty of Deep Learning Optimization

Averaging SG Non-uniformly

Running Average

v t

= βv

t−1

+ (1 − β)∆

t

with 0 ≤ β < 1 to control how much history to take β = 0: original SGD

v t

=

t

X

m=1

β

t−m

(1 − β)∆

t

—size-t window, exponentially-decreasing aeveraging

SGD

with momentum: optimization direction

= current SG (∆

t

) + historical inertia (v

t−1

)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Soundings 4/9

(6)

Benefits of SGD with Momentum

v t

= βv

t−1

+ (1 − β)∆

t w t

=

w t−1

− ηv

t

some variance in SG canceled out

oscilliation across ravine dampened

shallow local optima/saddle points escaped

SGD with momentum: ‘stablize’ SG with running average

(7)

Optimization in Deep Learning Difficulty of Deep Learning Optimization

Per-Component Learning Rate

fixed learning rate :

w t

=

w t−1

− ηv

t

per-component learning rate :

w t

=

w t−1

− η

t

v

t

intuition: scales error surface

want: smaller step for larger gradient component

Hsuan-Tien Lin (NTU CSIE) Machine Learning Soundings 6/9

(8)

Running Average of Gradient Magnitude

want: smaller step for larger gradient component, say

η

t

= 1 k∇E(w

t

)k

full gradient ∇E not available, SG only

using k∆k not very stable

idea: running average of ∆

t

t

(9)

Optimization in Deep Learning Difficulty of Deep Learning Optimization

RMSProp

u t

= βu

t−1

+ (1 − β)∆

t

t

η

t

= η./√

u t

+ w

t

=

w t−1

− η

t

t

RMSProp: SGD + per-component learng rate using running average of magnitude

Hsuan-Tien Lin (NTU CSIE) Machine Learning Soundings 8/9

(10)

Adam: Adaptive Moment Estimation

Adam ≈ momentum + RMSProp + global decay

v t

= β

1 v t−1

+ (1 − β

1

)∆

t u t

= β

2 u t−1

+ (1 − β

2

)∆

t

t

η

t

= η./√

u t

+ ./p t/N

w t

=

w t−1

− η

t

v

t

momentum in

v t

RMSProp in

u t

global decay bypt/N

(some minor correction of estimation)

Adam usually more aggressive than original SGD (but can also overfit faster)

參考文獻

相關文件

⇔ improve some performance measure (e.g. prediction accuracy) machine learning: improving some performance measure?.

○ Value function: how good is each state and/or action. ○ Policy: agent’s

• User goal: Two tickets for “the witch” tomorrow 9:30 PM at regal meridian 16, Seattle. E2E Task-Completion Bot (TC-Bot) (Li et

 End-to-end reinforcement learning dialogue system (Li et al., 2017; Zhao and Eskenazi, 2016)?.  No specific goal, focus on

• We can view the additive models graphically in terms of simple “units”

As for real-time polyp detection, our YOLOv4 transfer learning model achieves high quality, and convincing object detection results obviously.. Our deep learning polyp

As mentioned in Section 4, for training GANs, to avoid the content corre- lation between the source and the target domains, we used 2, 250 images of the MIT-Adobe 5K dataset as

建議:(1) Augmented Reality (AR)、Virtual + existing reality (MR)、Deep Learning, Predictive Modelling、Machine Learning