Slide credit from Hung-Yi Lee & Richard Socher 1
Review
Word Vector
Word2Vec Variants
Skip-gram: predicting surrounding words given the target word
(Mikolov+, 2013)CBOW (continuous bag-of-words): predicting the
target word given the surrounding words
(Mikolov+, 2013)LM (Language modeling): predicting the next words given the proceeding contexts
(Mikolov+, 2013)Mikolov et al., “Efficient estimation of word representations in vector space,” in ICLR Workshop, 2013. 3 Mikolov et al., “Linguistic regularities in continuous space word representations,” in NAACL HLT, 2013.
Word2Vec LM
Goal: predicting the next words given the proceeding contexts
Outline
Language Modeling
◦N-gram Language Model
◦Feed-Forward Neural Language Model
◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network
◦Definition
◦Training via Backpropagation through Time (BPTT)
◦Training Issue Applications
◦Sequential Input
◦Sequential Output
◦ Aligned Sequential Pairs (Tagging)
◦ Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
5
Outline
Language Modeling
◦N-gram Language Model
◦Feed-Forward Neural Language Model
◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network
◦Definition
◦Training via Backpropagation through Time (BPTT)
◦Training Issue Applications
◦Sequential Input
◦Sequential Output
◦ Aligned Sequential Pairs (Tagging)
◦ Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
Language Modeling
Goal: estimate the probability of a word sequence
Example task: determinate whether a sequence is grammatical or makes more sense
7
recognize speech or
wreck a nice beach Output =
“recognize speech”
If P(recognize speech)
> P(wreck a nice beach)
Outline
Language Modeling
◦N-gram Language Model
◦Feed-Forward Neural Language Model
◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network
◦Definition
◦Training via Backpropagation through Time (BPTT)
◦Training Issue Applications
◦Sequential Input
◦Sequential Output
◦ Aligned Sequential Pairs (Tagging)
◦ Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
N-Gram Language Modeling
Goal: estimate the probability of a word sequence
N-gram language model
◦
Probability is conditioned on a window of (n-1) previous words
◦
Estimate the probability based on the training data
9
𝑃 beach|nice = 𝐶 𝑛𝑖𝑐𝑒 𝑏𝑒𝑎𝑐ℎ
𝐶 𝑛𝑖𝑐𝑒 Count of “nice” in the training data
Count of “nice beach” in the training data
Issue: some sequences may not appear in the training data
N-Gram Language Modeling
Training data:
◦The dog ran ……
◦The cat jumped ……
P( jumped | dog ) = 0 P( ran | cat ) = 0
give some small probability
smoothing
0.0001 0.0001
The probability is not accurate.
The phenomenon happens because we cannot collect all the possible text in the world as training data.
Outline
Language Modeling
◦N-gram Language Model
◦Feed-Forward Neural Language Model
◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network
◦Definition
◦Training via Backpropagation through Time (BPTT)
◦Training Issue Applications
◦Sequential Input
◦Sequential Output
◦ Aligned Sequential Pairs (Tagging)
◦ Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
11
Neural Language Modeling
Idea: estimate not from count, but
from the NN prediction
Neural Network
vector of “START”
P(next word is
“wreck”)
Neural Network
vector of “wreck”
P(next word is “a”)
Neural Network
vector of “a”
P(next word is
“nice”)
Neural Network
vector of “nice”
P(next word is
“beach”)
P(“wreck a nice beach”) = P(wreck|START)P(a|wreck)P(nice|a)P(beach|nice)
Neural Language Modeling
Bengio et al., “A Neural Probabilistic Language Model,” in JMLR, 2003. 13
Issue: fixed context window for conditioning
input hidden
output
context vector
Probability distribution of the next word
Neural Language Modeling
The input layer (or hidden layer) of the related words are close
◦If P(jump|dog) is large, P(jump|cat) increase accordingly (even there is not “… cat jump …” in the data)
h
1h
2dog
cat
rabbit
Smoothing is automatically done
Outline
Language Modeling
◦N-gram Language Model
◦Feed-Forward Neural Language Model
◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network
◦Definition
◦Training via Backpropagation through Time (BPTT)
◦Training Issue Applications
◦Sequential Input
◦Sequential Output
◦ Aligned Sequential Pairs (Tagging)
◦ Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
15
Recurrent Neural Network
Idea: condition the neural network on all previous words and tie the weights at each time step
Assumption: temporal information matters
RNN Language Modeling
17
vector of “START”
P(next word is
“wreck”)
vector of “wreck”
P(next word is “a”)
vector of “a”
P(next word is
“nice”)
vector of “nice”
P(next word is
“beach”)
input hidden output
context vector word prob dist
Idea: pass the information from the previous hidden layer to leverage all contexts
Outline
Language Modeling
◦N-gram Language Model
◦Feed-Forward Neural Language Model
◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network
◦Definition
◦Training via Backpropagation through Time (BPTT)
◦Training Issue Applications
◦Sequential Input
◦Sequential Output
◦ Aligned Sequential Pairs (Tagging)
◦ Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
RNNLM Formulation
19
At each time step,
…… ……
……
……
vector of the current word probability of the next word
Outline
Language Modeling
◦N-gram Language Model
◦Feed-Forward Neural Language Model
◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network
◦Definition
◦Training via Backpropagation through Time (BPTT)
◦Training Issue Applications
◦Sequential Input
◦Sequential Output
◦ Aligned Sequential Pairs (Tagging)
◦ Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
Recurrent Neural Network Definition
21 http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/
: tanh, ReLU
Model Training
All model parameters can be updated by y
t-1y
ty
t+1target
predicted
Outline
Language Modeling
◦N-gram Language Model
◦Feed-Forward Neural Language Model
◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network
◦Definition
◦Training via Backpropagation through Time (BPTT)
◦Training Issue Applications
◦Sequential Input
◦Sequential Output
◦ Aligned Sequential Pairs (Tagging)
◦ Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
23
Backpropagation
… …
1 2
j … …
1 2
l
i w
ijLayer l Layer l 1
1
1
1 l x
l a
j l l j
iBackward Pass
⋮
⋮
Forward Pass
⋮
⋮
Error
signal
Backpropagation
25 l
iBackward Pass
⋮
⋮
Error signal
1
2
n
…
y1
C
z1L
z2L
znL
y2
C
yn
C
Layer L
2 1
i
…
Layer l
z1l
z2l
zil
δ1l
δ2l
l
δi
2
…
1L1
z
1
m
Layer L-1
…
…
…
WL T Wl1 T
yC
L 1
-
L
2L 1
z
L1
zm
δl
Backpropagation through Time (BPTT)
Unfold
◦
Input: init, x
1, x
2, …, x
t◦
Output: o
t◦
Target: y
tinit
st yt
xt ot
xt-1 st-1
xt-2
x1 s1
st-2
y
1C
y
2C
y
nC
y
C
…
Backpropagation through Time (BPTT)
Unfold
◦
Input: init, x
1, x
2, …, x
t◦
Output: o
t◦
Target: y
t27
init
st yt
xt ot
xt-1 st-1
xt-2
x1 s1
st-2
1
2
n
…
1
2
n
…
y
C
Backpropagation through Time (BPTT)
Unfold
◦
Input: init, x
1, x
2, …, x
t◦
Output: o
t◦
Target: y
tinit
st yt
xt ot
xt-1 st-1
xt-2
x1 s1
st-2
y
C
Backpropagation through Time (BPTT)
Unfold
◦
Input: init, x
1, x
2, …, x
t◦
Output: o
t◦
Target: y
t29
init
st yt
xt ot
xt-1 st-1
xt-2
x1 s1
st-2
j
i
i j
i j
i j
y
C
Weights are tied together
the same memory
pointer pointer
Backpropagation through Time (BPTT)
Unfold
◦
Input: init, x
1, x
2, …, x
t◦
Output: o
t◦
Target: y
tinit
st yt
xt ot
xt-1 st-1
xt-2
x1 s1
st-2
j
i
k i
k j i j
i
k j
y
C
Weights are tied together
BPTT
31
For 𝐶
(1)Backward Pass:
For 𝐶
(2)For 𝐶
(3)For 𝐶
(4)Forward Pass: Compute s
1, s
2, s
3, s
4……
y
1y
2y
3x
1x
2x
3o
1o
2o
3init
y
4x
4o
4𝐶
(1)𝐶
(2)𝐶
(3)𝐶
(4)s
1s
2s
3s
4Outline
Language Modeling
◦N-gram Language Model
◦Feed-Forward Neural Language Model
◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network
◦Definition
◦Training via Backpropagation through Time (BPTT)
◦Training Issue Applications
◦Sequential Input
◦Sequential Output
◦ Aligned Sequential Pairs (Tagging)
◦ Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
RNN Training Issue
The gradient is a product of Jacobian matrices, each associated with a step in the forward computation
Multiply the same matrix at each time step during backprop
33
The gradient becomes very small or very large quickly
vanishing or exploding gradient
Bengio et al., “Learning long-term dependencies with gradient descent is difficult,” IEEE Trans. of Neural Networks, 1994. [link]
Pascanu et al., “On the difficulty of training recurrent neural networks,” in ICML, 2013. [link]
w
2w
1C ost
Rough Error Surface
The error surface is either very flat or very steep
Vanishing/Exploding Gradient Example
35
0 5 10 15 20 25 30 35
0 5 10 15 20 25 30 35
0 5 10 15 20 25 30 35
0 5 10 15 20 25 30 35
0 5 10 15 20 25 30 35
0 5 10 15 20 25 30 35
1 step 2 steps 5 steps
10 steps 20 steps 50 steps
Outline
Language Modeling
◦N-gram Language Model
◦Feed-Forward Neural Language Model
◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network
◦Definition
◦Training via Backpropagation through Time (BPTT)
◦Training Issue Applications
◦Sequential Input
◦Sequential Output
◦ Aligned Sequential Pairs (Tagging)
◦ Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
How to Frame the Learning Problem?
The learning algorithm f is to map the input domain X into the output domain Y
Input domain: word, word sequence, audio signal, click logs Output domain: single label, sequence tags, tree structure, probability distribution
37
Y X
f :
Network design should leverage input and output domain properties
Outline
Language Modeling
◦N-gram Language Model
◦Feed-Forward Neural Language Model
◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network
◦Definition
◦Training via Backpropagation through Time (BPTT)
◦Training Issue Applications
◦Sequential Input
◦Sequential Output
◦ Aligned Sequential Pairs (Tagging)
◦ Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
Input Domain – Sequence Modeling
Idea: aggregate the meaning from all words into a vector Method:
◦
Basic combination: average, sum
◦
Neural combination:
Recursive neural network (RvNN)
Recurrent neural network (RNN)
Convolutional neural network (CNN)
39
How to compute
規格
(specification)
誠意
(sincerity)
這
(this)
有
(have)
N-dim
誠意
這 規格 有
x4 h4
Sentiment Analysis
Encode the sequential input into a vector using RNN
x1
x2
… …
y1
y2
… …
…
…
…
Input Output
yM
xN
RNN considers temporal information to learn sentence vectors as the input of classification tasks
Outline
Language Modeling
◦N-gram Language Model
◦Feed-Forward Neural Language Model
◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network
◦Definition
◦Training via Backpropagation through Time (BPTT)
◦Training Issue Applications
◦Sequential Input
◦Sequential Output
◦ Aligned Sequential Pairs (Tagging)
◦ Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
41
Output Domain – Sequence Prediction
POS Tagging
Speech Recognition
Machine Translation
“推薦我台大後門的餐廳” 推薦/VV 我/PN 台大/NR 後門/NN
的/DEG 餐廳/NN
“大家好”
“How are you doing today?” “你好嗎?”
The output can be viewed as a sequence of classification
Outline
Language Modeling
◦N-gram Language Model
◦Feed-Forward Neural Language Model
◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network
◦Definition
◦Training via Backpropagation through Time (BPTT)
◦Training Issue Applications
◦Sequential Input
◦Sequential Output
◦ Aligned Sequential Pairs (Tagging)
◦ Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
43
POS Tagging
Tag a word at each timestamp
◦Input: word sequence
◦Output: corresponding POS tag sequence
四樓 好 專業
N VA AD
Natural Language Understanding (NLU)
Tag a word at each timestamp
◦Input: word sequence
◦Output: IOB-format slot tag and intent tag
45
<START> just sent email to bob about fishing this weekend <END>
O O O O
B-contact_name
O
B-subject I-subject I-subject
send_email(contact_name=“bob”, subject=“fishing this weekend”) O
send_email
Temporal orders for input and output are the same
Outline
Language Modeling
◦N-gram Language Model
◦Feed-Forward Neural Language Model
◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network
◦Definition
◦Training via Backpropagation through Time (BPTT)
◦Training Issue Applications
◦Sequential Input
◦Sequential Output
◦ Aligned Sequential Pairs (Tagging)
◦ Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
超棒 的 醬汁
Machine Translation
47
Cascade two RNNs, one for encoding and one for decoding
◦
Input: word sequences in the source language
◦
Output: word sequences in the target language
encoder
decoder
Chit-Chat Dialogue Modeling
Cascade two RNNs, one for encoding and one for decoding
◦
Input: word sequences in the question
◦
Output: word sequences in the response
Temporal ordering for input and output may be different
Concluding Remarks
Language Modeling
◦RNNLM
Recurrent Neural Networks
◦
Definition
◦
Backpropagation through Time (BPTT)
◦
Vanishing/Exploding Gradient
Applications
◦
Sequential Input: Sequence-Level Embedding
◦
Sequential Output: Tagging / Seq2Seq (Encoder-Decoder)
49