機器學習技巧)
Lecture 11: Neural Network
Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.twDepartment of Computer Science
& Information Engineering
National Taiwan University
( 國立台灣大學資訊工程系)
Agenda
Lecture 11: Neural Network
Random Forest: Theory and Practice Neural Network Motivation
Neural Network Hypothesis
Neural Network Training
Deep Neural Networks
strength-correlation
decomposition (classification):lim
T →∞
Eout
(G)≤ρ
·1 − s 2 s 2
• strength: average voting margin
within G• correlation: similarity between g t
•
similar for regression (bias-variancedecomposition) RF good ifdiverse
andstrong
Practice: How Many Trees Needed?
theory: the more, the ‘better’
•
NTU KDDCup 2013 Track 1: predicting author-paper relation•
1− Eval
of thousands of trees: [0.981, 0.985] depending on seed;1− E
out
of top 20 teams: [0.98130, 0.98554]•
decision: take 12000 trees with seed 1cons of RF: may need lots of trees
if random
process too unstable
Disclaimer
Many parts of this lecture borrows
Prof. Yaser S. Abu-Mostafa’s slides with permission.
Learning From Data
YaserS.Abu-Mostafa
CaliforniaInstituteofTe hnology
Le ture10:NeuralNetworks
SponsoredbyCalte h'sProvostO e,E&ASDivision,andIST
•
Thursday,May3,2012Biologi alinspiration
biologi alfun tion
−→
biologi alstru ture1
2
1
2
LearningFromData-Le ture10 8/21
Biologi alinspiration
biologi alfun tion
−→
biologi alstru ture1
2
1
2
LearningFromData-Le ture10 8/21
Biologi alinspiration
biologi alfun tion
−→
biologi alstru ture1
2
1
2
LearningFromData-Le ture10 8/21
Biologi alinspiration
biologi alfun tion
−→
biologi alstru ture1
2
1
2
LearningFromData-Le ture10 8/21
Biologi alinspiration
biologi alfun tion
−→
biologi alstru ture1
2
1
2
LearningFromData-Le ture10 8/21
Biologi alinspiration
biologi alfun tion
−→
biologi alstru ture1
2
1
2
LearningFromData-Le ture10 8/21
Biologi alinspiration
biologi alfun tion
−→
biologi alstru ture1
2
1
2
LearningFromData-Le ture10 8/21
Biologi alinspiration
biologi alfun tion
−→
biologi alstru ture1
2
1
2
LearningFromData-Le ture10 8/21
Biologi alinspiration
biologi alfun tion
−→
biologi alstru ture1
2
1
2
LearningFromData-Le ture10 8/21
Combining per eptrons
LearningFromData-Le ture10 9/21
Combining per eptrons
− − +
x 1 x 2
+
−
h 1 x 1 x 2
+
+
−
h 2
x 1 x 2
LearningFromData-Le ture10 9/21
Combining per eptrons
− − +
x 1 x 2
+
−
h 1 x 1 x 2
+
+
−
h 2
x 1 x 2
OR(x 1 , x 2 ) 1
x 1
x 2 1 1
1.5 −1.5
1
x 1
x 2
AN D(x 1 , x 2 ) 1
1
LearningFromData-Le ture10 9/21
Creatinglayers
LearningFromData-Le ture10 10/21
Creatinglayers
1 1
h 1 ¯ h 2
¯ h 1 h 2
f 1.5
1
LearningFromData-Le ture10 10/21
Creatinglayers
1 1
h 1 ¯ h 2
¯ h 1 h 2
f 1.5
1
1 f
1 1.5
1 1 1
h 1
h 2
−1 −1
−1.5
−1.5
1
LearningFromData-Le ture10 10/21
Themultilayerper eptron
LearningFromData-Le ture10 11/21
Themultilayerper eptron
w 1 • x
f 1
1.5 1 1
1 1
−1 −1
−1.5
−1.5
1 x 1 1
x 2
w 2 • x
3 layers feedforward
LearningFromData-Le ture10 11/21
Themultilayerper eptron
w 1 • x
f 1
1.5 1 1
1 1
−1 −1
−1.5
−1.5
1 x 1 1
x 2
w 2 • x
3 layers feedforward
LearningFromData-Le ture10 11/21
Themultilayerper eptron
w 1 • x
f 1
1.5 1 1
1 1
−1 −1
−1.5
−1.5
1 x 1 1
x 2
w 2 • x
3 layers feedforward
LearningFromData-Le ture10 11/21
A powerfulmodel
LearningFromData-Le ture10 12/21
A powerfulmodel
−
−
− + +
+
− +
− −
−
− + +
+ +
− −
− + +
+
− +
Target 8per eptrons 16per eptrons
2redags for generalization and optimization
LearningFromData-Le ture10 12/21
A powerfulmodel
−
−
− + +
+
− +
− −
−
− + +
+ +
− −
− + +
+
− +
Target 8per eptrons 16per eptrons
2redags for generalization and optimization
LearningFromData-Le ture10 12/21
A powerfulmodel
−
−
− + +
+
− +
− −
−
− + +
+ +
− −
− + +
+
− +
Target 8per eptrons 16per eptrons
2redags for generalization and optimization
LearningFromData-Le ture10 12/21
A powerfulmodel
−
−
− + +
+
− +
− −
−
− + +
+ +
− −
− + +
+
− +
Target 8per eptrons 16per eptrons
2redags for generalization and optimization
LearningFromData-Le ture10 12/21
Fun Time
The neuralnetwork
LearningFromData-Le ture10 13/21
The neuralnetwork
θ x 2
x d
s θ(s) h(x) 1
1 1
x 1
θ θ
θ θ
θ
LearningFromData-Le ture10 13/21
The neuralnetwork
θ x 2
x d
s θ(s) h(x) 1
1 1
x 1
input
x
θ θ
θ θ
θ
LearningFromData-Le ture10 13/21
The neuralnetwork
θ x 2
x d
s θ(s) h(x) 1
1 1
x 1
input
x
hiddenlayers1 ≤ l < L
θ θ
θ θ
θ
LearningFromData-Le ture10 13/21
The neuralnetwork
θ x 2
x d
s θ(s) h(x) 1
1 1
x 1
input
x
hiddenlayers1 ≤ l < L
outputlayerl = L
θ θ
θ θ
θ
LearningFromData-Le ture10 13/21
Theneuralnetwork
θ x 2
x d
s θ(s) h(x) 1
1 1
x 1
input
x
hiddenlayers1 ≤ l < L
outputlayerl = L θ
θ
θ θ
θ
LearningFromData-Le ture10 13/21
Howthenetworkoperates Howthenetworkoperates
PSfragrepla ements
linear
tanh
hardthreshold
+1
−1
-4
-2
0
2
4
-1
-0.5
0
0.5
1
θ(s) = tanh(s) = e s − e −s e s + e −s w ij (l)
1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs
x (l) j = θ(s (l) j ) = θ
d X (l−1)
i=0
w ij (l) x (l i −1)
Apply
x
tox (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)
Howthenetworkoperates PSfragrepla ements
linear
tanh
hardthreshold
+1
−1
-4
-2
0
2
4
-1
-0.5
0
0.5
1
θ(s) = tanh(s) = e s − e −s e s + e −s w ij (l)
1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs
x (l) j = θ(s (l) j ) = θ
d X (l−1)
i=0
w ij (l) x (l i −1)
Apply
x
tox (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)
Howthenetworkoperates PSfragrepla ements
linear
tanh
hardthreshold
+1
−1
-4
-2
0
2
4
-1
-0.5
0
0.5
1
θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij
1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs
x (l) j = θ(s (l) j ) = θ
d X (l−1)
i=0
w ij (l) x (l i −1)
Apply
x
tox (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)
Howthenetworkoperates PSfragrepla ements
linear
tanh
hardthreshold
+1
−1
-4
-2
0
2
4
-1
-0.5
0
0.5
1
θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij
1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs
x (l) j = θ(s (l) j ) = θ
d X (l−1)
i=0
w ij (l) x (l i −1)
Apply
x
tox (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)
Howthenetworkoperates PSfragrepla ements
linear
tanh
hardthreshold
+1
−1
-4
-2
0
2
4
-1
-0.5
0
0.5
1
θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij
1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs
x (l) j = θ(s (l) j ) = θ
d X (l−1)
i=0
w ij (l) x (l i −1)
Apply
x
tox (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)
Howthenetworkoperates PSfragrepla ements
linear
tanh
hardthreshold
+1
−1
-4
-2
0
2
4
-1
-0.5
0
0.5
1
θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij
1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs
x (l) j = θ(s (l) j ) = θ
d X (l−1)
i=0
w ij (l) x (l i −1)
Apply
x
tox (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)
Howthenetworkoperates PSfragrepla ements
linear
tanh
hardthreshold
+1
−1
-4
-2
0
2
4
-1
-0.5
0
0.5
1
θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij
1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs
x (l) j = θ(s (l) j ) = θ
d X (l−1)
i=0
w ij (l) x (l i −1)
Apply
x
tox (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)
Howthenetworkoperates PSfragrepla ements
linear
tanh
hardthreshold
+1
−1
-4
-2
0
2
4
-1
-0.5
0
0.5
1
θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij
1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs
x (l) j = θ(s (l) j ) = θ
d X (l−1)
i=0
w ij (l) x (l i −1)
Apply
x
tox (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)
Howthenetworkoperates PSfragrepla ements
linear
tanh
hardthreshold
+1
−1
-4
-2
0
2
4
-1
-0.5
0
0.5
1
θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij
1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs
x (l) j = θ(s (l) j ) = θ
d X (l−1)
i=0
w ij (l) x (l i −1)
Apply
x
tox (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)
Howthenetworkoperates PSfragrepla ements
linear
tanh
hardthreshold
+1
−1
-4
-2
0
2
4
-1
-0.5
0
0.5
1
θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij
1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs
x (l) j = θ(s (l) j ) = θ
d X (l−1)
i=0
w ij (l) x (l i −1)
Apply
x
tox (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)
Howthenetworkoperates PSfragrepla ements
linear
tanh
hardthreshold
+1
−1
-4
-2
0
2
4
-1
-0.5
0
0.5
1
θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij
1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs
x (l) j = θ(s (l) j ) = θ
d X (l−1)
i=0
w ij (l) x (l i −1)
Apply
x
tox (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)
Howthenetworkoperates PSfragrepla ements
linear
tanh
hardthreshold
+1
−1
-4
-2
0
2
4
-1
-0.5
0
0.5
1
θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij
1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs
x (l) j = θ(s (l) j ) = θ
d X (l−1)
i=0
w ij (l) x (l i −1)
Apply
x
tox (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)
Howthenetworkoperates PSfragrepla ements
linear
tanh
hardthreshold
+1
−1
-4
-2
0
2
4
-1
-0.5
0
0.5
1
θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij
1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs
x (l) j = θ(s (l) j ) = θ
d X (l−1)
i=0
w ij (l) x (l i −1)
Apply
x
tox (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)
Howthenetworkoperates PSfragrepla ements
linear
tanh
hardthreshold
+1
−1
-4
-2
0
2
4
-1
-0.5
0
0.5
1
θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij
1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs
x (l) j = θ(s (l) j ) = θ
d X (l−1)
i=0
w ij (l) x (l i −1)
Apply
x
tox (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)
Howthenetworkoperates PSfragrepla ements
linear
tanh
hardthreshold
+1
−1
-4
-2
0
2
4
-1
-0.5
0
0.5
1
θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij
1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs
x (l) j = θ(s (l) j ) = θ
d X (l−1)
i=0
w ij (l) x (l i −1)
Apply
x
tox (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)
Howthenetworkoperates PSfragrepla ements
linear
tanh
hardthreshold
+1
−1
-4
-2
0
2
4
-1
-0.5
0
0.5
1
θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij
1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs
x (l) j = θ(s (l) j ) = θ
d X (l−1)
i=0
w ij (l) x (l i −1)
Apply
x
tox (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)
Howthenetworkoperates PSfragrepla ements
linear
tanh
hardthreshold
+1
−1
-4
-2
0
2
4
-1
-0.5
0
0.5
1
θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij
1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs
x (l) j = θ(s (l) j ) = θ
d X (l−1)
i=0
w ij (l) x (l i −1)
Apply
x
tox (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)
Howthenetworkoperates PSfragrepla ements
linear
tanh
hardthreshold
+1
−1
-4
-2
0
2
4
-1
-0.5
0
0.5
1
θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij
1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs
x (l) j = θ(s (l) j ) = θ
d X (l−1)
i=0
w ij (l) x (l i −1)
Apply
x
tox (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)
Howthenetworkoperates PSfragrepla ements
linear
tanh
hardthreshold
+1
−1
-4
-2
0
2
4
-1
-0.5
0
0.5
1
θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij
1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs
x (l) j = θ(s (l) j ) = θ
d X (l−1)
i=0
w ij (l) x (l i −1)
Apply
x
tox (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)
Howthenetworkoperates PSfragrepla ements
linear
tanh
hardthreshold
+1
−1
-4
-2
0
2
4
-1
-0.5
0
0.5
1
θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij
1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs
x (l) j = θ(s (l) j ) = θ
d X (l−1)
i=0
w ij (l) x (l i −1)
Apply
x
tox (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)
Howthenetworkoperates PSfragrepla ements
linear
tanh
hardthreshold
+1
−1
-4
-2
0
2
4
-1
-0.5
0
0.5
1
θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij
1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs
x (l) j = θ(s (l) j ) = θ
d X (l−1)
i=0
w ij (l) x (l i −1)
Apply
x
tox (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)
Howthenetworkoperates PSfragrepla ements
linear
tanh
hardthreshold
+1
−1
-4
-2
0
2
4
-1
-0.5
0
0.5
1
θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij
1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs
x (l) j = θ(s (l) j ) = θ
d X (l−1)
i=0
w ij (l) x (l i −1)
Apply
x
tox (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)
Howthenetworkoperates PSfragrepla ements
linear
tanh
hardthreshold
+1
−1
-4
-2
0
2
4
-1
-0.5
0
0.5
1
θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij
1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs
x (l) j = θ(s (l) j ) = θ
d X (l−1)
i=0
w ij (l) x (l i −1)
Apply
x
tox (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)
ApplyingSGD
Alltheweights
w = {w (l) ij }
determineh(x)
Erroronexample
(x n , y n )
ise
h(x n ), y n
=
e(w)
ToimplementSGD,weneedthegradient
∇
e(w)
:∂
e(w)
∂ w ij (l)
forall
i, j, l
LearningFromData-Le ture10 16/21
ApplyingSGD
Alltheweights
w = { w (l) ij }
determineh(x)
Erroronexample
(x n , y n )
ise
h(x n ), y n
=
e(w)
ToimplementSGD,weneedthegradient
∇
e(w)
:∂
e(w)
∂ w ij (l)
forall
i, j, l
LearningFromData-Le ture10 16/21
ApplyingSGD
Alltheweights
w = { w (l) ij }
determineh(x)
Erroronexample
(x n , y n )
ise
h(x n ), y n
=
e(w)
ToimplementSGD,weneedthegradient
∇
e(w)
:∂
e(w)
∂ w ij (l)
forall
i, j, l
LearningFromData-Le ture10 16/21
ApplyingSGD
Alltheweights
w = { w (l) ij }
determineh(x)
Erroronexample
(x n , y n )
ise
h(x n ), y n
=
e(w)
ToimplementSGD,weneedthegradient
∇
e(w)
:∂
e(w)
∂ w ij (l)
forall
i, j, l
LearningFromData-Le ture10 16/21
ApplyingSGD
Alltheweights
w = { w (l) ij }
determineh(x)
Erroronexample
(x n , y n )
ise
h(x n ), y n
=
e(w)
ToimplementSGD,weneedthegradient
∇
e(w)
:∂
e(w)
∂ w ij (l)
forall
i, j, l
LearningFromData-Le ture10 16/21
ApplyingSGD
Alltheweights
w = { w (l) ij }
determineh(x)
Erroronexample
(x n , y n )
ise
h(x n ), y n
=
e(w)
ToimplementSGD,weneedthegradient
∇
e(w)
:∂
e(w)
∂ w ij (l)
forall
i, j, l
LearningFromData-Le ture10 16/21
ApplyingSGD
Alltheweights
w = { w (l) ij }
determineh(x)
Erroronexample
(x n , y n )
ise
h(x n ), y n
=
e(w)
ToimplementSGD,weneedthegradient
∇
e(w)
:∂
e(w)
∂ w ij (l)
forall
i, j, l
LearningFromData-Le ture10 16/21
ApplyingSGD
Alltheweights
w = { w (l) ij }
determineh(x)
Erroronexample
(x n , y n )
ise
h(x n ), y n
=
e(w)
ToimplementSGD,weneedthegradient
∇
e(w)
:∂
e(w)
∂ w ij (l)
forall
i, j, l
LearningFromData-Le ture10 16/21
ApplyingSGD
Alltheweights
w = { w (l) ij }
determineh(x)
Erroronexample
(x n , y n )
ise
h(x n ), y n
=
e(w)
ToimplementSGD,weneedthegradient
∇
e(w)
:∂
e(w)
∂ w ij (l)
forall
i, j, l
LearningFromData-Le ture10 16/21
Computing
∂
e(w)
∂ w (l) ij
LearningFromData-Le ture10 17/21
Computing
∂
e(w)
∂ w (l) ij
x w
s x
i ij
j j (l)
(l)
(l)
(l−1)
top
bottom
θ
We anevaluate
∂
e(w)
∂ w ij (l)
onebyone:analyti allyornumeri ally
Atri kfore ient omputation:
∂
e(w)
∂ w (l) ij = ∂
e(w)
∂ s (l) j × ∂ s (l) j
∂ w (l) ij
Wehave
∂ s (l) j
∂ w (l) ij = x (l i −1)
Weonlyneed:∂
e(w)
∂ s (l) j = δ j (l)
LearningFromData-Le ture10 17/21
Computing
∂
e(w)
∂ w (l) ij
x w
s x
i ij
j j (l)
(l)
(l)
(l−1)
top
bottom
θ
We anevaluate
∂
e(w)
∂ w ij (l)
onebyone: analyti allyornumeri ally
Atri kfore ient omputation:
∂
e(w)
∂ w (l) ij = ∂
e(w)
∂ s (l) j × ∂ s (l) j
∂ w (l) ij
Wehave
∂ s (l) j
∂ w (l) ij = x (l i −1)
Weonlyneed:∂
e(w)
∂ s (l) j = δ j (l)
LearningFromData-Le ture10 17/21
Computing
∂
e(w)
∂ w (l) ij
x w
s x
i ij
j j (l)
(l)
(l)
(l−1)
top
bottom
θ
We anevaluate
∂
e(w)
∂ w ij (l)
onebyone:analyti allyornumeri ally
Atri kfore ient omputation:
∂
e(w)
∂ w (l) ij = ∂
e(w)
∂ s (l) j × ∂ s (l) j
∂ w (l) ij
Wehave
∂ s (l) j
∂ w (l) ij = x (l i −1)
Weonlyneed:∂
e(w)
∂ s (l) j = δ j (l)
LearningFromData-Le ture10 17/21
Computing
∂
e(w)
∂ w (l) ij
x w
s x
i ij
j j (l)
(l)
(l)
(l−1)
top
bottom
θ
We anevaluate
∂
e(w)
∂ w ij (l)
onebyone:analyti allyornumeri ally
Atri kfore ient omputation:
∂
e(w)
∂ w (l) ij = ∂
e(w)
∂ s (l) j × ∂ s (l) j
∂ w (l) ij
Wehave
∂ s (l) j
∂ w (l) ij = x (l i −1)
Weonlyneed:∂
e(w)
∂ s (l) j = δ j (l)
LearningFromData-Le ture10 17/21
Computing
∂
e(w)
∂ w (l) ij
x w
s x
i ij
j j (l)
(l)
(l)
(l−1)
top
bottom
θ
We anevaluate
∂
e(w)
∂ w ij (l)
onebyone:analyti allyornumeri ally
Atri kfore ient omputation:
∂
e(w)
∂ w (l) ij = ∂
e(w)
∂ s (l) j × ∂ s (l) j
∂ w (l) ij
Wehave
∂ s (l) j
∂ w (l) ij = x (l i −1)
Weonlyneed:∂
e(w)
∂ s (l) j = δ j (l)
LearningFromData-Le ture10 17/21
Computing
∂
e(w)
∂ w (l) ij
x w
s x
i ij
j j (l)
(l)
(l)
(l−1)
top
bottom
θ
We anevaluate
∂
e(w)
∂ w ij (l)
onebyone:analyti allyornumeri ally
Atri kfore ient omputation:
∂
e(w)
∂ w (l) ij = ∂
e(w)
∂ s (l) j × ∂ s (l) j
∂ w (l) ij
Wehave
∂ s (l) j
∂ w (l) ij = x (l i −1)
Weonlyneed:∂
e(w)
∂ s (l) j = δ j (l)
LearningFromData-Le ture10 17/21
Computing
∂
e(w)
∂ w (l) ij
x w
s x
i ij
j j (l)
(l)
(l)
(l−1)
top
bottom
θ
We anevaluate
∂
e(w)
∂ w ij (l)
onebyone:analyti allyornumeri ally
Atri kfore ient omputation:
∂
e(w)
∂ w (l) ij = ∂
e(w)
∂ s (l) j × ∂ s (l) j
∂ w (l) ij
Wehave
∂ s (l) j
∂ w (l) ij = x (l i −1)
Weonlyneed:∂
e(w)
∂ s (l) j = δ j (l)
LearningFromData-Le ture10 17/21
Computing
∂
e(w)
∂ w (l) ij
x w
s x
i ij
j j (l)
(l)
(l)
(l−1)
top
bottom
θ
We anevaluate
∂
e(w)
∂ w ij (l)
onebyone:analyti allyornumeri ally
Atri kfore ient omputation:
∂
e(w)
∂ w (l) ij = ∂
e(w)
∂ s (l) j × ∂ s (l) j
∂ w (l) ij
Wehave
∂ s (l) j
∂ w (l) ij = x (l i −1)
Weonlyneed:∂
e(w)
∂ s (l) j = δ j (l)
LearningFromData-Le ture10 17/21
Computing
∂
e(w)
∂ w (l) ij
x w
s x
i ij
j j (l)
(l)
(l)
(l−1)
top
bottom
θ
We anevaluate
∂
e(w)
∂ w ij (l)
onebyone:analyti allyornumeri ally
Atri kfore ient omputation:
∂
e(w)
∂ w (l) ij = ∂
e(w)
∂ s (l) j × ∂ s (l) j
∂ w (l) ij
Wehave
∂ s (l) j
∂ w (l) ij = x (l i −1)
Weonlyneed:∂
e(w)
∂ s (l) j = δ j (l)
LearningFromData-Le ture10 17/21
Computing
∂
e(w)
∂ w (l) ij
x w
s x
i ij
j j (l)
(l)
(l)
(l−1)
top
bottom
θ
We anevaluate
∂
e(w)
∂ w ij (l)
onebyone:analyti allyornumeri ally
Atri kfore ient omputation:
∂
e(w)
∂ w (l) ij = ∂
e(w)
∂ s (l) j × ∂ s (l) j
∂ w (l) ij
Wehave
∂ s (l) j
∂ w (l) ij = x (l i −1)
Weonlyneed:∂
e(w)
∂ s (l) j = δ j (l)
LearningFromData-Le ture10 17/21
Computing
∂
e(w)
∂ w (l) ij
x w
s x
i ij
j j (l)
(l)
(l)
(l−1)
top
bottom
θ
We anevaluate
∂
e(w)
∂ w ij (l)
onebyone:analyti allyornumeri ally
Atri kfore ient omputation:
∂
e(w)
∂ w (l) ij = ∂
e(w)
∂ s (l) j × ∂ s (l) j
∂ w (l) ij
Wehave
∂ s (l) j
∂ w (l) ij = x (l i −1)
Weonlyneed:∂
e(w)
∂ s (l) j = δ j (l)
LearningFromData-Le ture10 17/21
δ
forthenallayerδ (l) j = ∂
e(w)
∂ s (l) j
Forthenallayer
l = L
andj = 1
:δ (L) 1 = ∂
e(w)
∂ s (L) 1
e
(w) =
ex (L) 1 , y n
x (L) 1 = θ(s (L) 1 )
θ ′ (s) = 1 − θ 2 (s)
forthetanhLearningFromData-Le ture10 18/21
δ
forthenallayerδ (l) j = ∂
e(w)
∂ s (l) j
Forthenallayer
l = L
andj = 1
:δ (L) 1 = ∂
e(w)
∂ s (L) 1
e
(w) =
ex (L) 1 , y n
x (L) 1 = θ(s (L) 1 )
θ ′ (s) = 1 − θ 2 (s)
forthetanhLearningFromData-Le ture10 18/21
δ
forthenallayerδ (l) j = ∂
e(w)
∂ s (l) j
Forthenallayer
l = L
andj = 1
:δ (L) 1 = ∂
e(w)
∂ s (L) 1
e
(w) =
ex (L) 1 , y n
x (L) 1 = θ(s (L) 1 )
θ ′ (s) = 1 − θ 2 (s)
forthetanhLearningFromData-Le ture10 18/21
δ
forthenallayerδ (l) j = ∂
e(w)
∂ s (l) j
Forthenallayer
l = L
andj = 1
:δ (L) 1 = ∂
e(w)
∂ s (L) 1
e
(w) =
ex (L) 1 , y n
x (L) 1 = θ(s (L) 1 )
θ ′ (s) = 1 − θ 2 (s)
forthetanhLearningFromData-Le ture10 18/21
δ
forthenallayerδ (l) j = ∂
e(w)
∂ s (l) j
Forthenallayer
l = L
andj = 1
:δ (L) 1 = ∂
e(w)
∂ s (L) 1
e
(w) =
ex (L) 1 , y n
x (L) 1 = θ(s (L) 1 )
θ ′ (s) = 1 − θ 2 (s)
forthetanhLearningFromData-Le ture10 18/21
δ
forthenallayerδ (l) j = ∂
e(w)
∂ s (l) j
Forthenallayer
l = L
andj = 1
:δ (L) 1 = ∂
e(w)
∂ s (L) 1
e
(w) =
eh(x n ), y n
x (L) 1 = θ(s (L) 1 )
θ ′ (s) = 1 − θ 2 (s)
forthetanhLearningFromData-Le ture10 18/21
δ
forthenallayerδ (l) j = ∂
e(w)
∂ s (l) j
Forthenallayer
l = L
andj = 1
:δ (L) 1 = ∂
e(w)
∂ s (L) 1
e
(w) =
ex (L) 1 , y n
x (L) 1 = θ(s (L) 1 )
θ ′ (s) = 1 − θ 2 (s)
forthetanhLearningFromData-Le ture10 18/21
δ
forthenallayerδ (l) j = ∂
e(w)
∂ s (l) j
Forthenallayer
l = L
andj = 1
:δ (L) 1 = ∂
e(w)
∂ s (L) 1
e
(w) = ( x (L) 1 − y n ) 2 x (L) 1 = θ(s (L) 1 )
θ ′ (s) = 1 − θ 2 (s)
forthetanhLearningFromData-Le ture10 18/21
δ
forthenallayerδ (l) j = ∂
e(w)
∂ s (l) j
Forthenallayer
l = L
andj = 1
:δ (L) 1 = ∂
e(w)
∂ s (L) 1
e
(w) = ( x (L) 1 − y n ) 2 x (L) 1 = θ(s (L) 1 )
θ ′ (s) = 1 − θ 2 (s)
forthetanhLearningFromData-Le ture10 18/21