• 沒有找到結果。

Backpropagation for Optimization

N/A
N/A
Protected

Academic year: 2022

Share "Backpropagation for Optimization"

Copied!
32
0
0

加載中.... (立即查看全文)

全文

(1)

Backpropagation for Optimization

Applied Deep Learning

March 10th, 2020 http://adl.miulab.tw

(2)

最佳化參數

Parameter Optimization

2

(3)

Notation Summary

3

l

a

i

a

l

l

z

i

z

l

l

b

i

b

l : output of a neuron

: output vector of a layer

: input of activation function : input vector of activation function for a layer

: a weight

: a weight matrix : a bias

: a bias vector

(4)

Layer Output Relation – from a to z

4

..…

nodes Nl

Layerl

….. …..

Layerl

− 1

nodes

1

Nl

..…

1 2

j

1 2

i

1 1

al

1 2

al

1 l

aj

a1l

a2l

l

ai

z1l

z2l

l

zi

al

zl

1

al

… …

(5)

Layer Output Relation – from z to a

5

…..

nodes Nl

Layerl

….. …..

Layerl

− 1

nodes

1

Nl

…..

1 2

j

1 2

i

1 1

al

1 2

al

1 l

aj

a1l

a2l

l

ai

z1l

z2l

l

zi

al

zl

1

al

( ) ( )

( )

 

 

 

 

 

 

=

 

 

 

 

 

 

l i l l

l i l l

z z z

a a a

2 1 2

1

( )

il l

i

z

a = 

( )

l

l

z

a = 

(6)

Layer Output Relation

6

…..

nodes Nl

Layerl

….. …..

Layerl

− 1

nodes

1

Nl

…..

1 2

j

1 2

i

1 1

al

1 2

al

1 l

aj

a1l

a2l

l

ai

z1l

z2l

l

zi

al

zl

1

al

l l

l

l

W a b

z =

−1

+

( )

l

l

z

a = 

(7)

Neural Network Formulation

Fully connected feedforward network

7

Layer 1 Layer 2 Layer L

Input Output

= = =

M

N

R

R

f : →

x1

x2

….. …..

y1

y2

….. …..

… …

yM

xN

vector x

vector

y

(8)

Neural Network Formulation

Fully connected feedforward network

8

M

N

R

R

f : →

Layer 1 Layer 2 Layer L

Input Output

x1

x2

….. …..

y1

y2

….. …..

… …

yM

xN

vector x

vector

y

(9)

Loss Function for Training

9

A “Good” function:

Define an example loss function:

sum over the error of all training samples

f *

“Best” Function

Training Data

Model: Hypothesis Function Set

2

1, f f

Training: Pick the best function f *

( ) ( )

x1,yˆ1 , x2,yˆ2 ,

“It claims too much.”

-

(negative) :

x

ˆy: function input function output

(10)

Gradient Descent for Neural Network

10

Computing the gradient includes millions of parameters.

To compute it efficiently, we use backpropagation.

Algorithm

Initialization: start at 𝜃0 while(𝜃(𝑖+1) ≠ 𝜃𝑖)

{

compute gradient at 𝜃𝑖 update parameters }

(11)

如何有效率地計算大量參數呢?

Backpropagation

11

(12)

Forward v.s. Back Propagation

In a feedforward neural network

forward propagation

from input 𝑥 to output 𝑦 information flows forward through the network

during training, forward propagation can continue onward until it produces a scalar cost C(θ)

back-propagation

allows the information from the cost to then flow backwards through the network, in order to compute the gradient

can be applied to any function 12

x1

x2

… …

y1

y2

… …

yM

xN

(13)

Chain Rule

13

z y x w f f

f

forward propagation for cost

back-propagation for gradient

(14)

Gradient Descent for Neural Network

14

Computing the gradient includes millions of parameters.

To compute it efficiently, we use backpropagation.

Algorithm

Initialization: start at 𝜃0 while(𝜃(𝑖+1) ≠ 𝜃𝑖)

{

compute gradient at 𝜃𝑖 update parameters }

(15)

15

… …

Layerl

1 1

al

1 2

al

1 l

aj

… …

Layerl −1

l

ai

… …

1 2

j i

l

bi

1

(16)

… …

Layerl

1 1

al

1 2

al

1 l

aj

… …

Layerl −1

l

ai

… …

1 2

j i

l

bi

1

(17)

17

Layerl

… …

i

l

bi

1

… …

x1r

x2r

r

xj

Input

(18)

… …

Layerl

1 1

al

1 2

al

1 l

aj

… …

Layerl −1

l

ai

… …

1 2

j i

l

bi

1

(19)

19

… …

Layerl

1 1

al

1 2

al

1 l

aj

… …

Layerl −1

l

ai

… …

1 2

j i

l

bi

1

(20)

… …

Layer l-1

1 2

j

… …

1 2

i Layer l

… …

1 2

k

Layer l+1

… …

… … …

1 2

n Layer L (output layer)

l

δi

δ2l

δ1l

+1 l

δk 1 2

+

δl 1 1

+

δl

L

δn L

δ2

δ1L

δl δl+1 δL-1 δL

the propagated gradient

corresponding to the l-th layer

Idea: computing 𝛿𝑙 layer by layer (from 𝛿𝐿 to 𝛿1) is more efficient

(21)

◉ Idea: from L to 1

Initialization: compute 𝛿

𝐿

 Compute 𝛿

𝑙

based on 𝛿

𝑙+1

21

(22)

◉ Idea: from L to 1

Initialization: compute 𝜹

𝑳

 Compute 𝛿

𝑙

based on 𝛿

𝑙+1

𝜕𝐶 𝜕𝑦Τ 𝑖 depends on the loss function

(23)

◉ Idea: from L to 1

Initialization: compute 𝜹

𝑳

 Compute 𝛿

𝑙

based on 𝛿

𝑙+1

23

( )

z

L

z

z

n

(24)

◉ Idea: from L to 1

Initialization: compute 𝛿

𝐿

Compute 𝜹

𝒍

based on 𝜹

𝒍+𝟏

… …

1 2

i Layer l

… …

1 2

k

Layer l+1

l

δi

δ2l

δ1l

+1 l

δk 1 2

+

δl 1 1

+

δl

+1

δl

δl l

i l

i

Δa

Δz

1 1

+

Δz

l 1 2 l+

Δz

+1 l

Δz

k

……

(25)

◉ Idea: from L to 1

Initialization: compute 𝛿

𝐿

Compute 𝜹

𝒍

based on 𝜹

𝒍+𝟏

25

… …

1 2

i Layer l

… …

1 2

k

Layer l+1

l

δi

δ2l

δ1l

+1 l

δk 1 2

+

δl 1 1

+

δl

+1

δl

δl

1 1

al

1 l

aj

lia

1

j i

l

bi

1

(26)

Rethink the propagation

… …

1 2

i

Layer l

… …

1 2

k

Layer l+1

l

δ

i

δ

2l

δ

1l

+1 l

δ

k 1 2

l+

δ

1 1

+

δ

l

+1

δ

l

δ

l

l

δ

i i

( )

zil

multiply a constant

+1 l

δ

k 1 2

+

δ

l 1 1

+

δ

l

… …

output

input

(27)

27

2

( )

1 1

+

zl

( )

2 1

+

zl

( )

+1

zkl

1

k 2

1

i

Layer l+1 Layer l

( )

z1l

( )

z2l

( )

zil

l

δ

i

δ

2l

δ

1l

+1 l

δ

k 1 2

+

δ

l 1 1

+

δ

l

+1

δ

l

δ

l

(28)

◉ Idea: from L to 1

Initialization: compute 𝛿

𝐿

 Compute 𝛿

𝑙−1

based on 𝛿

𝑙

1

2

n

y1

C

( )

z1L

( )

z2L

( )

znL

y2

C

yn

C

 Layer L

2

( )

1 1

+

zl

( )

2 1

+

zl

( )

+1

zlk

1

k 2

1

i

Layer l+1 Layer

l

( )

z1l

( )

z2l

( )

zil

δ1l

δ2l

l

δi

2

( )

1L1

z

1

m

Layer L-1

( )

WL T

( )

Wl+1 T

( )

y

C

L 1

-

L

( )

2L1

z

( )

L−1

zm

+1

δl

δl

(29)

Backpropagation

29

Forward Pass

+1 l

a

k 1 2

+

a

l 1 1

+

a

l

… …

1 2

i … …

1 2

l

k a

i

Layer l+1

a

2l

a

1l

a

l

a

l+1

Layer l

(30)

Backpropagation

30

Backward Pass

2

( )

11

+

zl

( )

21

+

zl

( )

+1

zkl

1

k 2

1

i

Layer l+1 Layer l

( )

z1l

( )

zl2

( )

zil

l

δi

δ2l

δ1l

+1 l

δk 1 2

+

δl 1 1

+

δl

+1

δl

δl

(31)

Gradient Descent for Optimization

31

Algorithm

Initialization: start at 𝜃0 while(𝜃(𝑖+1) ≠ 𝜃𝑖)

{

compute gradient at 𝜃𝑖 update parameters }

(32)

Concluding Remarks

32

… …

1 2

j

… …

1 2

l i wij

Layerl

Layerl

− 1





=

1

1 1 l x

l a

j l l j

i

Backward Pass

Forward Pass

Compute the gradient based on two pre-computed terms from backward and forward passes

參考文獻

相關文件

○ exploits unlabeled data to learn latent factors as representations. ○ learned representations can be transfer to

Peters et al., “Deep Contextualized Word Representations”, in NAACL-HLT, 2018.. 9.. ELMo: Embeddings from

◉ These limitations of vanilla seq2seq make human-machine conversations boring and shallow.. How can we overcome these limitations and move towards deeper

➢ Plot the learning curves (ROUGE versus training steps)... ➢ What is your final

• One of the main problems of using pre-trained word embeddings is that they are unable to deal with out-of- vocabulary (OOV) words, i.e.. words that have not been seen

User goal – Two tickets for “Deadpool” tomorrow 9PM at AMC Pacific Place 11 theater, Seattle. RULE

•It directly models prior semantic knowledge units, which enhances the ability to learn semantic representation?. • ERNIE learns the semantic representation of complete concepts by

MTL – multi-task learning for STM and LM, where they share the embedding layer PSEUDO – train an STM with labeled data, generate labels for unlabeled data, and retrain STM.