Backpropagation for Optimization

(1)

Backpropagation for Optimization

Applied Deep Learning

March 10th, 2020 http://adl.miulab.tw

(2)

最佳化參數

Parameter Optimization

2

(3)

Notation Summary

3

l

a

i

a

l

z

i

z

l

b

i

b

l : output of a neuron

: output vector of a layer

: input of activation function : input vector of activation function for a layer

: a weight

: a weight matrix : a bias

: a bias vector

(4)

Layer Output Relation – from a to z

4

..…

nodes Nl

Layerl

….. …..

Layerl

− 1

nodes

−1

Nl

..…

1 2

j

1 2

i

1 1

−

al

1 2

−

al

−1 l

aj

a₁l

a₂l

l

ai

z₁l

z₂l

l

zi

al

zl

−1

al

… …

(5)

Layer Output Relation – from z to a

5

…..

nodes Nl

Layerl

….. …..

Layerl

− 1

nodes

−1

Nl

…..

1 2

j

1 2

i

1 1

−

al

1 2

−

al

−1 l

aj

a₁l

a₂l

l

ai

z₁l

z₂l

l

zi

al

zl

−1

al

( ) ( )

( )

 







 







=

 







 









l i l l

z z z

a a a



2 1 2

1

( )

i^l l

i

z

a = 

( )

^l

l

z

a = 

(6)

Layer Output Relation

6

…..

nodes Nl

Layerl

….. …..

Layerl

− 1

nodes

−1

Nl

…..

1 2

j

1 2

i

1 1

−

al

1 2

−

al

−1 l

aj

a₁l

a₂l

l

ai

z₁l

z₂l

l

zi

al

zl

−1

al

l l

l

W a b

z =

⁻¹

+

( )

^l

l

z

a = 

(7)

Neural Network Formulation

◉

Fully connected feedforward network

7

Layer 1 Layer 2 Layer L

Input Output

= = =

M

N

R

f : →

x1

x2

….. …..

y1

y2

….. …..

…

… …

…

y_M

xN

vector x

vector

y

(8)

Neural Network Formulation

◉

Fully connected feedforward network

8

M

N

R

f : →

Layer 1 Layer 2 Layer L

Input Output

x1

x2

….. …..

y1

y2

….. …..

…

… …

…

y_M

xN

vector x

vector

y

(9)

Loss Function for Training

9

A “Good” function:

Define an example loss function:

sum over the error of all training samples

f *

“Best” Function

Training Data

Model: Hypothesis Function Set

2

1, f f

Training: Pick the best function f ^*

( ) ( )



x₁,yˆ₁ , x₂,yˆ₂ ,



“It claims too much.”

-

(negative) :

x

ˆy: function input function output

(10)

Gradient Descent for Neural Network

10

Computing the gradient includes millions of parameters.

To compute it efficiently, we use backpropagation.

Algorithm

Initialization: start at 𝜃⁰ while(𝜃^(𝑖+1) ≠ 𝜃^𝑖)

{

compute gradient at 𝜃^𝑖 update parameters }

(11)

如何有效率地計算大量參數呢？

Backpropagation

11

(12)

Forward v.s. Back Propagation

◉

In a feedforward neural network

○ forward propagation

■ from input 𝑥 to output 𝑦 information ﬂows forward through the network

■

during training, forward propagation can continue onward until it produces a scalar cost C(θ)

○ back-propagation

■ allows the information from the cost to then ﬂow backwards through the network, in order to compute the gradient

■ can be applied to any function 12

x1

x2

… …

y1

y2

… …

…

^y^M

xN

(13)

Chain Rule

13

z y x w f f

f

forward propagation for cost

back-propagation for gradient

(14)

Gradient Descent for Neural Network

14

Computing the gradient includes millions of parameters.

To compute it efficiently, we use backpropagation.

Algorithm

{

(15)

15

… …

Layer^l

1 1

−

al

1 2

−

al

−1 l

aj

… …

Layerl −1

l

ai

… …

1 2

j i

l

bi

1

(16)

… …

Layer^l

1 1

−

al

1 2

−

al

−1 l

aj

… …

Layerl −1

l

ai

… …

1 2

j i

l

bi

1

(17)

17

Layer^l

… …

ⁱ

l

bi

1

… …

x₁r

x₂r

r

xj

Input

(18)

… …

Layer^l

1 1

−

al

1 2

−

al

−1 l

aj

… …

Layerl −1

l

ai

… …

1 2

j i

l

bi

1

(19)

19

… …

Layer^l

1 1

−

al

1 2

−

al

−1 l

aj

… …

Layerl −1

l

ai

… …

1 2

j i

l

bi

1

(20)

… …

Layer l-1

1 2

j

… …

1 2

i Layer l

… …

1 2

k

Layer l+1

…

… …

…

… … …

1 2

n Layer L (output layer)

l

δi

δ₂l

δ₁l

+1 l

δk 1 2

+

δl 1 1

+

δl

L

δn L

δ2

δ₁L

δl δ^l⁺¹ δ^L^-¹ δ^L

the propagated gradient

corresponding to the l-th layer

Idea: computing 𝛿^𝑙 layer by layer (from 𝛿^𝐿 to 𝛿¹) is more efficient

(21)

◉ Idea: from L to 1



Initialization: compute 𝛿

^𝐿

 Compute 𝛿

^𝑙

based on 𝛿

^𝑙+1

21

(22)

◉ Idea: from L to 1



Initialization: compute 𝜹

^𝑳

 Compute 𝛿

^𝑙

based on 𝛿

^𝑙+1

𝜕𝐶 𝜕𝑦Τ _𝑖 depends on the loss function

(23)

◉ Idea: from L to 1



Initialization: compute 𝜹

^𝑳

 Compute 𝛿

^𝑙

based on 𝛿

^𝑙+1

23 

( )

^z

L

z

n

(24)

◉ Idea: from L to 1



Initialization: compute 𝛿

^𝐿

 Compute 𝜹

^𝒍

based on 𝜹

^𝒍+𝟏

… …

1 2

i Layer l

… …

1 2

k

Layer l+1

l

δi

δ₂l

δ₁l

+1 l

δk 1 2

+

δl 1 1

+

δl

+1

δl

δl l

i l

i

Δa

Δz →

1 1

+

Δz

l 1 2 l+

Δz

+1 l

Δz

k

……

(25)

◉ Idea: from L to 1



Initialization: compute 𝛿

^𝐿

 Compute 𝜹

^𝒍

based on 𝜹

^𝒍+𝟏

25

… …

1 2

i Layer l

… …

1 2

k

Layer l+1

l

δi

δ₂l

δ₁l

+1 l

δk 1 2

+

δl 1 1

+

δl

+1

δl

…

1 1

−

al

−1 l

aj

… ^lia…

1

j i

l

bi

1

(26)

◉

Rethink the propagation

… …

1 2

i

Layer l

… …

1 2

k

Layer l+1

l

δ

i

δ

₂l

δ

₁l

+1 l

δ

k 1 2

l+

δ

1 1

+

δ

l

+1

δ

l

δ

l

δ

i ⁱ

( )

zi^l

 ^



multiply a constant

+1 l

δ

k 1 2

+

δ

l 1 1

+

δ

l

… …

output

input

(27)

27

2

…

( )

1 ¹

 +

 z^l

( )

2 ¹

 +

 z^l

( )

⁺¹

 z_k^l

1 k 2

1 i

…

Layer l+1 Layer l

( )

^z1^l

^



( )

z₂^l





( )

zi^l





l

δ

i

δ

₂l

δ

₁l

+1 l

δ

k 1 2

+

δ

l 1 1

+

δ

l

+1

δ

l

δ

l

(28)

◉ Idea: from L to 1



Initialization: compute 𝛿

^𝐿

 Compute 𝛿

^𝑙−1

based on 𝛿

^𝑙

1

2

n

…

y1

C



( )

^z1^L





( )

^z2^L





( )

zn^L

^



y2

C



yn

C



 Layer L

2

…

( )

1 ¹

 +

 z^l

( )

2 ¹

 +

 z^l

( )

⁺¹

z^l_k

1

k 2

1

i

…

Layer l+1 Layer

l

( )

^z1^l

^



( )

^z2^l

^



( )

zi^l

^



δ₁l

δ₂l

l

δi

2

…

( )

1^L¹

 −

 z

1

m

Layer L-1

…

… _{( )}

W^L ^T

( )

^W^l⁺¹ ^T

( )

^y

C

L 1

-

 L

( )

2^L¹

 −

 z

( )

^L−¹

z_m

+1

δl

(29)

Backpropagation

29

Forward Pass

⋮

+1 l

a

k 1 2

+

a

l 1 1

+

a

l

… …

1 2

i … …

1 2

l

k a

i

Layer l+1

a

₂l

a

₁l

a

l

a

^l⁺¹

Layer l

(30)

Backpropagation

30

Backward Pass

⋮

2

…

( )

1¹

 +

 z^l

( )

2¹

 +

 z^l

( )

⁺¹

z_k^l

1

k 2

1

i

…

Layer l+1 Layer l

( )

^z1^l

^



( )

^z^l2

^



( )

zi^l





l

δi

δ₂l

δ₁l

+1 l

δk 1 2

+

δl 1 1

+

δl

+1

δl

(31)

Gradient Descent for Optimization

31

Algorithm

{

(32)

Concluding Remarks

32

… …

1 2

j

… …

1 2

l i wij

Layerl

− 1

^





=

− 

1

1 1 l x

l a

j l l j



i

Backward Pass

⋮

Forward Pass

⋮

Compute the gradient based on two pre-computed terms from backward and forward passes