Slide credit from Hung-Yi Lee & Richard Socher 1
Review
Recurrent Neural Network
2
Recurrent Neural Network
Idea: condition the neural network on all previous words and tie the weights at each time step
Assumption: temporal information matters
3
RNN Language Modeling
4
vector of “START”
P(next word is
“wreck”)
vector of “wreck”
P(next word is “a”)
vector of “a”
P(next word is
“nice”)
vector of “nice”
P(next word is
“beach”)
input hidden output
context vector word prob dist
Idea: pass the information from the previous hidden layer to leverage all contexts
RNNLM Formulation
5
At each time step,
…… ……
……
……
vector of the current word probability of the next word
Recurrent Neural Network Definition
6 http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/
: tanh, ReLU
Model Training
All model parameters can be updated by
7 http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/
y
t-1y
ty
t+1target
predicted
Outline
Language Modeling
◦N-gram Language Model
◦Feed-Forward Neural Language Model
◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network
◦Definition
◦Training via Backpropagation through Time (BPTT)
◦Training Issue Applications
◦Sequential Input
◦Sequential Output
◦ Aligned Sequential Pairs (Tagging)
◦ Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
8
Backpropagation
9
… …
1 2
j … …
1 2
l
i w
ijLayer l Layer l 1
1
1
1 l x
l a
j l l j
iBackward Pass
⋮
⋮
Forward Pass
⋮
⋮
Error
signal
Backpropagation
10 l
iBackward Pass
⋮
⋮
Error signal
1
2
n
…
y1
C
z1L
z2L
znL
y2
C
yn
C
Layer L
2 1
i
…
Layer l
z1l
z2l
zil
δ1l
δ2l
l
δi
2
…
1L1
z
1
m
Layer L-1
…
…
…
WL T
Wl1 T
yC
L 1
-
L
2L 1
z
L1
zm
δl
Backpropagation through Time (BPTT)
Unfold
◦ Input: init, x
1, x
2, …, x
t◦ Output: o
t◦ Target: y
t11
init
st yt
xt ot
xt-1 st-1
xt-2
x1 s1
st-2
o
1C
o
2C
o
nC
y
C
…
Backpropagation through Time (BPTT)
Unfold
◦ Input: init, x
1, x
2, …, x
t◦ Output: o
t◦ Target: y
t12
init
st yt
xt ot
xt-1 st-1
xt-2
x1 s1
st-2
1
2
n
…
1
2
n
…
y
C
Backpropagation through Time (BPTT)
Unfold
◦ Input: init, x
1, x
2, …, x
t◦ Output: o
t◦ Target: y
t13
init
st yt
xt ot
xt-1 st-1
xt-2
x1 s1
st-2
y
C
Backpropagation through Time (BPTT)
Unfold
◦ Input: init, x
1, x
2, …, x
t◦ Output: o
t◦ Target: y
t14
init
st yt
xt ot
xt-1 st-1
xt-2
x1 s1
st-2
j
i
i
j i j
i j
y
C
Weights are tied together
the same memory
pointer pointer
Backpropagation through Time (BPTT)
Unfold
◦ Input: init, x
1, x
2, …, x
t◦ Output: o
t◦ Target: y
t15
init
st yt
xt ot
xt-1 st-1
xt-2
x1 s1
st-2
j
i
k i
k j i j
i
k j
y
C
Weights are tied together
BPTT
16
For 𝐶
(1)Backward Pass:
For 𝐶
(2)For 𝐶
(3)For 𝐶
(4)Forward Pass: Compute s
1, s
2, s
3, s
4……
y
1y
2y
3x
1x
2x
3o
1o
2o
3init
y
4x
4o
4𝐶
(1)𝐶
(2)𝐶
(3)𝐶
(4)s
1s
2s
3s
4RNN Training Issue
The gradient is a product of Jacobian matrices, each associated with a step in the forward computation
Multiply the same matrix at each time step during backprop
17
The gradient becomes very small or very large quickly
vanishing or exploding gradient
Bengio et al., “Learning long-term dependencies with gradient descent is difficult,” IEEE Trans. of Neural Networks, 1994. [link]
Pascanu et al., “On the difficulty of training recurrent neural networks,” in ICML, 2013. [link]
w
2w
1C ost
Rough Error Surface
Bengio et al., “Learning long-term dependencies with gradient descent is difficult,” IEEE Trans. of Neural Networks, 1994. [link] 18
Pascanu et al., “On the difficulty of training recurrent neural networks,” in ICML, 2013. [link]
The error surface is either very flat or very steep
Possible Solutions
Recurrent Neural Network
19
Exploding Gradient: Clipping
20
w2
w1
Cost
clipped gradient Idea: control the gradient value to avoid exploding
Parameter setting: values from half to ten times the average can still yield convergence
Vanishing Gradient: Initialization + ReLU
IRNN
◦ initialize all W as identity matrix I
◦ use ReLU for activation functions
Le et al., “A Simple Way to Initialize Recurrent Networks of Rectified Linear Units,” arXiv, 2016. [link] 21
Vanishing Gradient: Gating Mechanism
RNN models temporal sequence information
◦ can handle “long-term dependencies” in theory
http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 22
Issue: RNN cannot handle such “long-term dependencies” in practice due to vanishing gradient
apply the gating mechanism to directly encode the long-distance information
“I grew up in France…
I speak fluent French.”
Extension
Recurrent Neural Network
23
Bidirectional RNN
24
ℎ = ℎ; ℎ represents (summarizes) the past and future around a single token
Deep Bidirectional RNN
25
Each memory layer passes an intermediate representation to the next
Concluding Remarks
Recurrent Neural Networks
◦ Definition
◦ Issue: Vanishing/Exploding Gradient
◦ Solution:
• Exploding Gradient: Clipping
• Vanishing Gradient: Initialization, ReLU, Gated RNNs
Extension
◦ Bidirectional
◦ Deep RNN
26