Gated RNN &
Sequence Generation
Hung-yi Lee
李宏毅
Outline
• RNN with Gated Mechanism
• Sequence Generation
• Conditional Sequence Generation
• Tips for Generation
RNN with Gated
Mechanism
Recurrent Neural Network
• Given function f: ℎ
′, 𝑦 = 𝑓 ℎ, 𝑥
f
h
0h
1y
1x
1f h
2y
2x
2f h
3y
3x
3……
No matter how long the input/output sequence is, we only need one function f
h and h’ are vectors with the same dimension
Deep RNN
f
1h
0h
1y
1x
1f
1h
2y
2x
2f
1h
3y
3x
3……
f
2b
0b
1c
1f
2b
2c
2f
2b
3c
3……
… …
…
ℎ′, 𝑦 = 𝑓1 ℎ, 𝑥 𝑏′, 𝑐 = 𝑓2 𝑏, 𝑦 …
f
1h
0h
1a
1x
1f
1h
2a
2x
2f
1h
3a
3x
3f
2b
0b
1f
2b
2f
2b
3Bidirectional RNN
x
1x
2x
3c
1c
2c
3f
3y
1f
3y
2f
3y
3ℎ′, 𝑎 = 𝑓1 ℎ, 𝑥 𝑏′, 𝑐 = 𝑓2 𝑏, 𝑥
𝑦 = 𝑓3 𝑎, 𝑐
Naïve RNN
• Given function f: ℎ
′, 𝑦 = 𝑓 ℎ, 𝑥
f
h h
'y
x
Ignore bias here
h
'y W
oW
hh
'= 𝜎
softmax
= 𝜎 h + W
ix
LSTM
c changes slowly h changes faster
c
tis c
t-1added by something h
tand h
t-1can be very different Naive h
ty
tx
th
t-1LSTM y
tx
tc
th
th
t-1c
t-1xt z zi
zf zo
ht-1 ct-1
z xt
ht-1
= 𝑡𝑎𝑛ℎ
Wzi xt
ht-1 Wi
= 𝜎
zf xt
ht-1 Wf
= 𝜎
zo xt
ht-1 Wo
= 𝜎
xt z zi
zf zo
ht-1 ct-1
“peephole”
z
= 𝑡𝑎𝑛ℎ
Wxt ht-1
ct-1 diagonal zi
zf
zo obtained by the same way
ht
𝑐
𝑡= 𝑧
𝑓⨀𝑐
𝑡−1+𝑧
𝑖⨀𝑧
xt z zi
zf zo
+
yt
ht-1
ct-1 ct
ℎ
𝑡= 𝑧
𝑜⨀𝑡𝑎𝑛ℎ 𝑐
𝑡⨀
⨀ ⨀
𝑦
𝑡= 𝜎 𝑊
′ℎ
𝑡LSTM
xt z zi
zf zo
⨀ +
yt
ht-1
ct-1 ct
xt+1 z zi
zf zo
+
yt+1
ht
ct+1
⨀
⨀ ⨀
⨀
⨀
ht
ht-1
GRU
r z
yt
xt ht-1
h'
⨀
xt
⨀
1- ⨀
+ ht
reset update
ℎ
𝑡= 𝑧⨀ℎ
𝑡−1+ 1 − 𝑧 ⨀ℎ
′LSTM: A Search Space Odyssey
LSTM: A Search Space Odyssey
Standard LSTM works well
Simply LSTM: coupling input and forget gate, removing peephole Forget gate is critical for performance
Output gate activation function is critical
An Empirical Exploration of Recurrent Network Architectures
Importance: forget > input > output Large bias for forget gate is helpful LSTM-f/i/o: removing
forget/input/output gates LSTM-b: large bias
An Empirical Exploration of Recurrent Network
Architectures
Neural Architecture Search with Reinforcement Learning
LSTM From Reinforcement Learning
Sequence Generation
Generation
• Sentences are composed of characters/words
• Generating a character/word at each time by RNN
f
h h
'y
x
The token generated at the last time step.(represented by 1-of-N encoding) Distribution over the token
(sampling from the distribution to generate a token)
x:
y:
1
0 0 0 0 …… 0
你 我 他 是 很
0
0 0 0.7 0.3 …… 0
Generation
• Sentences are composed of characters/words
• Generating a character/word at each time by RNN
f
h
0h
1y
1x
1f h
2y
2x
2f h
3y
3x
3……
<BOS>
y1: P(w|<BOS>)
~
sample床 前 ~ 明 ~
床 前
y2: P(w|<BOS>,床) y3: P(w|<BOS>,床,前)
Until <EOS>
is generated
Generation
• Training
f
h
0h
1y
1x
1f h
2y
2x
2f h
3y
3x
3……
<BOS>
春 眠 不
春 眠
Training data: 春 眠 不 覺 曉 : minimizing cross-entropy
Generation
• Images are composed of pixels
• Generating a pixel at each time by RNN
Consider as a sentence
blue red yellow gray ……
Train a RNN based on the
“sentences”
f
h
0h
1y
1x
1f h
2y
2x
2f h
3y
3x
3……
<BOS>
~
samplered blue ~ green ~
red blue
Generation
• Images are composed of pixels
3 x 3 images
Conditional
Sequence Generation
Conditional Generation
• We don’t want to simply generate some random sentences.
• Generate sentences based on conditions:
Given
condition:
Caption Generation
Chat-bot
Given
condition:
“Hello”
“A young girl is dancing.”
“Hello. Nice to see you.”
Conditional Generation
• Represent the input condition as a vector, and consider the vector as the input of RNN generator
Image Caption Generation
Input image CNN
A vector
. (period)
……
<BOS>
A woman
Conditional Generation
• Represent the input condition as a vector, and consider the vector as the input of RNN generator
• E.g. Machine translation / Chat-bot
機 器 學 習
Information of the whole sentences
Jointly train
Encoder Decoder
Sequence-to- sequence learning
machine learning . (period)
Conditional Generation
U: Hi U: Hi
M: Hi
M: Hello M: Hi
M: Hello
Serban, Iulian V., Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau, 2015
"Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models.
https://www.youtube.com/watch?v=e2MpOmyQJw4
Need to consider longer context during chatting
Dynamic Conditional Generation
機 器 學 習
Decoder
machine learning . (period)
Encoder
ℎ4 ℎ3
ℎ2 ℎ1
𝑐1 𝑐2 𝑐3
𝑐1 𝑐2
Dynamic Conditional Generation
機 器 學 習
Information of the whole sentences
Encoder Decoder
machine learning . (period)
𝑐1 𝑐2 𝑐3
Machine Translation
• Attention-based model
𝑧0
機 器 學 習
ℎ1 ℎ2 ℎ3 ℎ4
match 𝛼01
Cosine similarity of z and h
Small NN whose input is z and h, output a scalar
𝛼 = ℎ𝑇𝑊𝑧
Design by yourself
What is ? match Jointly learned
with other part of the network
match
ℎ 𝑧
𝛼
Machine Translation
• Attention-based model
𝛼01 𝛼02 𝛼03 𝛼04 𝑐0
𝑧1
Decoder input 𝑐0 = ො𝛼0𝑖 ℎ𝑖
machine
0.5 0.5 0.0 0.0
= 0.5ℎ1 + 0.5ℎ2 𝑧0
softmax
𝑐0
ො
𝛼01 𝛼ො02 𝛼ො03 𝛼ො04
機 器 學 習
ℎ1 ℎ2 ℎ3 ℎ4
Machine Translation
• Attention-based model
𝑧1
machine
𝑧0
𝑐0 match
𝛼11
機 器 學 習
ℎ1 ℎ2 ℎ3 ℎ4
Machine Translation
• Attention-based model
𝑧1
machine
𝑧0
𝑐0 𝑐1
𝑧2
learning
𝑐1
𝑐1 = ො𝛼1𝑖ℎ𝑖
= 0.5ℎ3 + 0.5ℎ4 𝛼11 𝛼12 𝛼13 𝛼14
0.0 0.0 0.5 0.5
softmax
ො
𝛼11 𝛼ො12 𝛼ො13 𝛼ො14
機 器 學 習
ℎ1 ℎ2 ℎ3 ℎ4
Machine Translation
• Attention-based model
𝑧1
machine
𝑧0
𝑐0
𝑧2
learning
𝑐1 match
𝛼21
The same process repeat until generating
<EOS>
機 器 學 習
ℎ1 ℎ2 ℎ3 ℎ4
……
……
……
Speech Recognition
William Chan, Navdeep Jaitly, Quoc V. Le, Oriol Vinyals, “Listen, Attend and Spell”, ICASSP, 2016
Image Caption Generation
filter filter filter filter filter filter
match 0.7
CNN
filter filter filter filter filter filter
𝑧0 A vector for
each region
Image Caption Generation
filter filter filter filter filter filter
CNN
filter filter filter filter filter filter
A vector for each region
0.7 0.1 0.1
0.1 0.0 0.0
weighted sum
𝑧1 Word 1
𝑧0
Image Caption Generation
filter filter filter filter filter filter
CNN
filter filter filter filter filter filter
A vector for each region
𝑧0
0.0 0.8 0.2
0.0 0.0 0.0
weighted sum 𝑧1
Word 1
𝑧2 Word 2
Image Caption Generation
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan
Salakhutdinov, Richard Zemel, Yoshua Bengio, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML, 2015
Image Caption Generation
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan
Salakhutdinov, Richard Zemel, Yoshua Bengio, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML, 2015
Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, Aaron Courville, “Describing Videos by Exploiting Temporal Structure”, ICCV, 2015
Question Answering
• Given a document and a query, output an answer
• bAbI: the answer is a word
• https://research.fb.com/downloads/babi/
• SQuAD: the answer is a sequence of words (in the input document)
• https://rajpurkar.github.io/SQuAD-explorer/
• MS MARCO: the answer is a sequence of words
• http://www.msmarco.org
• MovieQA: Multiple choice question (output a number)
• http://movieqa.cs.toronto.edu/home/
Memory Network
AnswerMatch Query
vector Document
q
Extracted Information
𝑥1 ……
𝛼1
=
𝑛=1 𝑁
𝛼𝑛𝑥𝑛
𝑥2 𝑥3 𝑥𝑁 𝛼2 𝛼3 𝛼𝑁 Sentence to DNN
vector can be jointly trained.
Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, Rob Fergus, “End-To-End Memory Networks”, NIPS, 2015
Answer
Match Query
Document
q
Extracted Information
𝑥1 ……
𝛼1
=
𝑛=1 𝑁
𝛼𝑛ℎ𝑛
𝑥2 𝑥3 𝑥𝑁 𝛼2 𝛼3 𝛼𝑁 Jointly learned
ℎ1 ℎ2 ℎ3 …… ℎ𝑁
DNN
Memory Network
Hopping
Memory Network
q
……
……
Compute attention Extract information
……
……
Compute attention Extract information
DNN
a
Reading Comprehension
• End-To-End Memory Networks. S. Sukhbaatar, A. Szlam, J.
Weston, R. Fergus. NIPS, 2015.
The position of reading head:
Keras has example:
https://github.com/fchollet/keras/blob/master/examples/ba bi_memnn.py
Wei Fang, Juei-Yang Hsu, Hung- yi Lee, Lin-Shan Lee,
"Hierarchical Attention Model for Improved Machine
Comprehension of Spoken Content", SLT, 2016
Neural Turing Machine
• von Neumann architecture
https://www.quora.com/How-does-the-Von-Neumann-architecture- provide-flexibility-for-program-development
Neural Turing Machine not only read from
memory
Also modify the memory
through attention
Neural Turing Machine
r0 y1
f h0
𝑚01 𝑚02 𝑚03 𝑚04
x1
ො
𝛼01 𝛼ො02 𝛼ො03 𝛼ො04 𝑟0 = ො𝛼0𝑖 𝑚0𝑖
Retrieval process
Neural Turing Machine
𝑚01 𝑚02 𝑚03 𝑚04
ො
𝛼01 𝛼ො02 𝛼ො03 𝛼ො04 𝑟0 = ො𝛼0𝑖 𝑚0𝑖
𝑒1 𝑎1 𝑘1
𝛼1𝑖 = 𝑐𝑜𝑠 𝑚0𝑖 , 𝑘1
ො
𝛼11 𝛼ො12 𝛼ො13 𝛼ො14 softmax
𝛼11 𝛼12 𝛼13 𝛼14 r0
y1 f h0
x1
Neural Turing Machine
𝑚01 𝑚02 𝑚03 𝑚04
ො
𝛼01 𝛼ො02 𝛼ො03 𝛼ො04
𝑒1 𝑎1 𝑘1
𝑚11 𝑚12 𝑚13 𝑚14 𝑚1𝑖 = 𝑚0𝑖 𝑒1 + ො𝛼1𝑖 𝑎1
(element-wise)
− ො𝛼1𝑖
ො
𝛼11 𝛼ො12 𝛼ො13 𝛼ො14 𝑚0𝑖
0 ~ 1
⨀
Neural Turing Machine
𝑚01 𝑚02 𝑚03 𝑚04
ො
𝛼01 𝛼ො02 𝛼ො03 𝛼ො04 𝛼ො11 𝛼ො12 𝛼ො13 𝛼ො14 𝑚11 𝑚12 𝑚13 𝑚14
ො
𝛼21 𝛼ො22 𝛼ො23 𝛼ො24 𝑚21 𝑚22 𝑚23 𝑚24 r0
y1 f h0
x1 r1
y2 f h1
x2
Neural Turing Machine
Wei-Jen Ko, Bo-Hsiang Tseng, Hung-yi Lee,
“Recurrent Neural Network based Language Modeling with Controllable External Memory”, ICASSP, 2017
Stack RNN
Armand Joulin, Tomas Mikolov, Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets, arXiv Pre-Print, 2015
stack
xt yt
……
f
Push, Pop, Nothing 0.7 0.2 0.1
Information to store
Pop Nothing Push
X0.7 X0.2 X0.1
+ +
… … …
Tips for Generation
Attention
Bad Attention
Good Attention: each input component has approximately the same attention weight
w1 w2 w3 w4
E.g. Regularization term: 𝑖 𝜏 −
𝑡
𝛼𝑡𝑖
2
For each component Over the generation 𝛼11
𝛼𝑡𝑖
component time
𝛼21𝛼31𝛼41 𝛼12𝛼22𝛼32𝛼42 𝛼13𝛼23𝛼33𝛼43 𝛼14𝛼24𝛼34𝛼44 (woman) (woman) …… no cooking
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML, 2015
Mismatch between Train and Test
• Training
Reference:A B B
A B
𝐶 =
𝑡
𝐶
𝑡Minimizing
cross-entropy of each component
: condition A
A B
B
A B
<BOS>
Mismatch between Train and Test
• Generation
A A
A B
A B
A
A B B
B We do not know
the reference
Testing: The inputs are the outputs of the last time step.
Training: The inputs are reference.
Exposure Bias
<BOS>
A B
A B A B
A B A B A B A
B
A B
A B A B
A B A B A B A
B One step
wrong
May be
totally wrong
Never
explore ……
一步錯,步步錯
Modifying Training Process?
A A
A B
A B
A
A B B
B
B
A B
Reference
In practice, it is
hard to train in this way.
Training is
matched to testing.
When we try to
decrease the loss for both steps 1 and 2 …..
A
A A
B
A B
B
B
A B
B A
A
Scheduled Sampling
from model
From
reference B
A
from
model From
reference Reference
Scheduled Sampling
• Caption generation on MSCOCO
BLEU-4 METEOR CIDER Always from reference 28.8 24.2 89.5 Always from model 11.2 15.7 49.7 Scheduled Sampling 30.6 24.3 92.1
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, Noam Shazeer, Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks, arXiv preprint, 2015
Beam Search
A B
A B A B
A B A B A
B A
B
0.4
0.9
0.9
0.6 0.4
0.4
0.6 0.6
The green path has higher score.
Not possible to check all the paths
Beam Search
A B
A B A B
A B A B A
B A
B Keep several best path at each step
Beam size = 2
Beam Search
https://github.com/tensorflow/tensorflow/issues/654#issuecomment-169009989
The size of beam is 3 in this example.
Better Idea?
A A
B
A B B
B
<BOS>
A B
A B
<BOS> A B you A
you are
you I ≈ you
high score
I ≈ you
am ≈ are I am ……
You are ……
I are ……
You am ……
Object level v.s. Component level
• Minimizing the error defined on component level is not equivalent to improving the generated objects
𝐶 =
𝑡
𝐶𝑡
Ref: The dog is running fast
A cat a a a
The dog is is fast
The dog is running fast Cross-entropy
of each step
Optimize object-level criterion instead of component-level cross- entropy. object-level criterion: 𝑅 𝑦, ො𝑦
𝑦: generated utterance, ො𝑦: ground truth Gradient Descent?
Reinforcement learning?
Start with
observation 𝑠1 Observation 𝑠2 Observation 𝑠3
Action 𝑎1: “right”
Obtain reward 𝑟1 = 0
Action 𝑎2 : “fire”
(kill an alien)
Obtain reward 𝑟2 = 5
Reinforcement learning?
Marc'Aurelio Ranzato, Sumit Chopra, Michael Auli, Wojciech
Zaremba, “Sequence Level Training with Recurrent Neural Networks”, ICLR, 2016
A A
A B
A B
A
A B B
observation B Actions set
The action we take influence the observation in the next step
<BOS>
Action taken
𝑟 = 0 𝑟 = 0
𝑟𝑒𝑤𝑎𝑟𝑑:
R(“BAA”, reference)