Backpropagation for Optimization
Applied Deep Learning
March 10th, 2020 http://adl.miulab.tw
最佳化參數
Parameter Optimization
2
Notation Summary
3
l
a
ia
ll
z
iz
ll
b
ib
l : output of a neuron: output vector of a layer
: input of activation function : input vector of activation function for a layer
: a weight
: a weight matrix : a bias
: a bias vector
Layer Output Relation – from a to z
4
..…
nodes Nl
Layerl
….. …..
Layerl
− 1
nodes−1
Nl
..…
1 2
j
1 2
i
1 1
−
al
1 2
−
al
−1 l
aj
a1l
a2l
l
ai
z1l
z2l
l
zi
al
zl
−1
al
… …
Layer Output Relation – from z to a
5
…..
nodes Nl
Layerl
….. …..
Layerl
− 1
nodes−1
Nl
…..
1 2
j
1 2
i
1 1
−
al
1 2
−
al
−1 l
aj
a1l
a2l
l
ai
z1l
z2l
l
zi
al
zl
−1
al
( ) ( )
( )
=
l i l l
l i l l
z z z
a a a
2 1 2
1
( )
il li
z
a =
( )
ll
z
a =
Layer Output Relation
6
…..
nodes Nl
Layerl
….. …..
Layerl
− 1
nodes−1
Nl
…..
1 2
j
1 2
i
1 1
−
al
1 2
−
al
−1 l
aj
a1l
a2l
l
ai
z1l
z2l
l
zi
al
zl
−1
al
l l
l
l
W a b
z =
−1+
( )
ll
z
a =
Neural Network Formulation
◉
Fully connected feedforward network7
Layer 1 Layer 2 Layer L
Input Output
= = =
M
N
R
R
f : →
x1
x2
….. …..
y1
y2
….. …..
…
… …
…
…
yMxN
vector x
vector
y
Neural Network Formulation
◉
Fully connected feedforward network8
M
N
R
R
f : →
Layer 1 Layer 2 Layer L
Input Output
x1
x2
….. …..
y1
y2
….. …..
…
… …
…
…
yMxN
vector x
vector
y
Loss Function for Training
9
A “Good” function:
Define an example loss function:
sum over the error of all training samples
f *
“Best” Function
Training Data
Model: Hypothesis Function Set
2
1, f f
Training: Pick the best function f *
( ) ( )
x1,yˆ1 , x2,yˆ2 ,
“It claims too much.”
-
(negative) :x
ˆy: function input function output
Gradient Descent for Neural Network
10
Computing the gradient includes millions of parameters.
To compute it efficiently, we use backpropagation.
Algorithm
Initialization: start at 𝜃0 while(𝜃(𝑖+1) ≠ 𝜃𝑖)
{
compute gradient at 𝜃𝑖 update parameters }
如何有效率地計算大量參數呢?
Backpropagation
11
Forward v.s. Back Propagation
◉
In a feedforward neural network○ forward propagation
■ from input 𝑥 to output 𝑦 information flows forward through the network
■
during training, forward propagation can continue onward until it produces a scalar cost C(θ)○ back-propagation
■ allows the information from the cost to then flow backwards through the network, in order to compute the gradient
■ can be applied to any function 12
x1
x2
… …
y1
y2
… …
…
…
…
yMxN
Chain Rule
13
z y x w f f
f
forward propagation for cost
back-propagation for gradient
Gradient Descent for Neural Network
14
Computing the gradient includes millions of parameters.
To compute it efficiently, we use backpropagation.
Algorithm
Initialization: start at 𝜃0 while(𝜃(𝑖+1) ≠ 𝜃𝑖)
{
compute gradient at 𝜃𝑖 update parameters }
15
… …
Layerl
1 1
−
al
1 2
−
al
−1 l
aj
… …
Layerl −1
l
ai
… …
1 2
j i
l
bi
1
… …
Layerl
1 1
−
al
1 2
−
al
−1 l
aj
… …
Layerl −1
l
ai
… …
1 2
j i
l
bi
1
17
Layerl
… …
il
bi
1
… …
x1r
x2r
r
xj
Input
… …
Layerl
1 1
−
al
1 2
−
al
−1 l
aj
… …
Layerl −1
l
ai
… …
1 2
j i
l
bi
1
19
… …
Layerl
1 1
−
al
1 2
−
al
−1 l
aj
… …
Layerl −1
l
ai
… …
1 2
j i
l
bi
1
… …
Layer l-1
1 2
j
… …
1 2
i Layer l
… …
1 2
k
Layer l+1
…
… …
…
…
… … …
1 2
n Layer L (output layer)
l
δi
δ2l
δ1l
+1 l
δk 1 2
+
δl 1 1
+
δl
L
δn L
δ2
δ1L
δl δl+1 δL-1 δL
the propagated gradient
corresponding to the l-th layer
Idea: computing 𝛿𝑙 layer by layer (from 𝛿𝐿 to 𝛿1) is more efficient
◉ Idea: from L to 1
Initialization: compute 𝛿
𝐿 Compute 𝛿
𝑙based on 𝛿
𝑙+121
◉ Idea: from L to 1
Initialization: compute 𝜹
𝑳 Compute 𝛿
𝑙based on 𝛿
𝑙+1𝜕𝐶 𝜕𝑦Τ 𝑖 depends on the loss function
◉ Idea: from L to 1
Initialization: compute 𝜹
𝑳 Compute 𝛿
𝑙based on 𝛿
𝑙+123
( )
zL
z
z
n◉ Idea: from L to 1
Initialization: compute 𝛿
𝐿 Compute 𝜹
𝒍based on 𝜹
𝒍+𝟏… …
1 2
i Layer l
… …
1 2
k
Layer l+1
l
δi
δ2l
δ1l
+1 l
δk 1 2
+
δl 1 1
+
δl
+1
δl
δl l
i l
i
Δa
Δz →
1 1
+
Δz
l 1 2 l+Δz
+1 l
Δz
k……
◉ Idea: from L to 1
Initialization: compute 𝛿
𝐿 Compute 𝜹
𝒍based on 𝜹
𝒍+𝟏25
… …
1 2
i Layer l
… …
1 2
k
Layer l+1
l
δi
δ2l
δ1l
+1 l
δk 1 2
+
δl 1 1
+
δl
+1
δl
δl
…
1 1
−
al
−1 l
aj
… lia…
1
j i
l
bi
1
◉
Rethink the propagation… …
1 2
i
Layer l
… …
1 2
k
Layer l+1
l
δ
iδ
2lδ
1l+1 l
δ
k 1 2l+
δ
1 1
+
δ
l+1
δ
lδ
ll
δ
i i( )
zil
multiply a constant
+1 l
δ
k 1 2+
δ
l 1 1+
δ
l… …
output
input
27
2
…
( )
1 1 +
zl
( )
2 1 +
zl
( )
+1 zkl
1
k 2
1
i
…
Layer l+1 Layer l
( )
z1l
( )
z2l
( )
zil
l
δ
iδ
2lδ
1l+1 l
δ
k 1 2+
δ
l 1 1+
δ
l+1
δ
lδ
l◉ Idea: from L to 1
Initialization: compute 𝛿
𝐿 Compute 𝛿
𝑙−1based on 𝛿
𝑙1
2
n
…
y1
C
( )
z1L
( )
z2L
( )
znL
y2
C
yn
C
Layer L
2
…
( )
1 1 +
zl
( )
2 1 +
zl
( )
+1zlk
1
k 2
1
i
…
Layer l+1 Layer
l
( )
z1l
( )
z2l
( )
zil
δ1l
δ2l
l
δi
2
…
( )
1L1 −
z
1
m
Layer L-1
…
…
… ( )
WL T( )
Wl+1 T( )
yC
L 1
-
L
( )
2L1 −
z
( )
L−1zm
+1
δl
δl
Backpropagation
29
Forward Pass
⋮
⋮
+1 l
a
k 1 2+
a
l 1 1+
a
l… …
1 2
i … …
1 2
l
k a
iLayer l+1
a
2la
1la
la
l+1Layer l
Backpropagation
30
Backward Pass
⋮
⋮
2
…
( )
11 +
zl
( )
21 +
zl
( )
+1zkl
1
k 2
1
i
…
Layer l+1 Layer l
( )
z1l
( )
zl2
( )
zil
l
δi
δ2l
δ1l
+1 l
δk 1 2
+
δl 1 1
+
δl
+1
δl
δl
Gradient Descent for Optimization
31
Algorithm
Initialization: start at 𝜃0 while(𝜃(𝑖+1) ≠ 𝜃𝑖)
{
compute gradient at 𝜃𝑖 update parameters }
Concluding Remarks
32
… …
1 2
j
… …
1 2
l i wij
Layerl
Layerl
− 1
=
−
1
1 1 l x
l a
j l l j
iBackward Pass
⋮
⋮
Forward Pass
⋮
⋮
Compute the gradient based on two pre-computed terms from backward and forward passes