Gated RNN &

(1)

Gated RNN &

Sequence Generation

Hung-yi Lee

李宏毅

(2)

Outline

• RNN with Gated Mechanism

• Sequence Generation

• Conditional Sequence Generation

• Tips for Generation

(3)

RNN with Gated

Mechanism

(4)

Recurrent Neural Network

• Given function f: ℎ

^′

, 𝑦 = 𝑓 ℎ, 𝑥

f

h

⁰

h

¹

y

¹

x

¹

f h

²

y

²

x

²

f h

³

y

³

x

³

……

No matter how long the input/output sequence is, we only need one function f

h and h’ are vectors with the same dimension

(5)

Deep RNN

f

₁

h

⁰

h

¹

y

¹

x

¹

f

₁

h

²

y

²

x

²

f

₁

h

³

y

³

x

³

……

f

₂

b

⁰

b

¹

c

¹

f

₂

b

²

c

²

f

₂

b

³

c

³

……

… …

…

ℎ^′, 𝑦 = 𝑓₁ ℎ, 𝑥 𝑏^′, 𝑐 = 𝑓₂ 𝑏, 𝑦 …

(6)

f

₁

h

⁰

h

¹

a

¹

x

¹

f

₁

h

²

a

²

x

²

f

₁

h

³

a

³

x

³

f

₂

b

⁰

b

¹

f

₂

b

²

f

₂

b

³

Bidirectional RNN

x

¹

x

²

x

³

c

¹

c

²

c

³

f

₃

y

¹

f

₃

y

²

f

₃

y

³

ℎ^′, 𝑎 = 𝑓₁ ℎ, 𝑥 𝑏^′, 𝑐 = 𝑓₂ 𝑏, 𝑥

𝑦 = 𝑓₃ 𝑎, 𝑐

(7)

Naïve RNN

• Given function f: ℎ

^′

, 𝑦 = 𝑓 ℎ, 𝑥

f

h h

^'

y

x

Ignore bias here

h

^'

y W

^o

W

^h

h

^'

= 𝜎

softmax

= 𝜎 h + W

ⁱ

x

(8)

LSTM

c changes slowly h changes faster

c

^t

is c

^t-1

added by something h

^t

and h

^t-1

can be very different Naive h

^t

y

^t

x

^t

h

^t-1

LSTM y

^t

x

^t

c

^t

h

^t

h

^t-1

c

^t-1

(9)

x^t z zⁱ

z^f z^o

h^t-1 c^t-1

z x^t

h^t-1

= 𝑡𝑎𝑛ℎ

W

zⁱ x^t

h^t-1 Wⁱ

= 𝜎

z^f x^t

h^t-1 W^f

= 𝜎

z^o x^t

h^t-1 W^o

= 𝜎

(10)

x^t z zⁱ

z^f z^o

h^t-1 c^t-1

“peephole”

z

= 𝑡𝑎𝑛ℎ

W

x^t h^t-1

c^t-1 diagonal zⁱ

z^f

z^o obtained by the same way

(11)

h^t

𝑐

^𝑡

= 𝑧

^𝑓

⨀𝑐

^𝑡−1

+𝑧

^𝑖

⨀𝑧

x^t z zⁱ

z^f z^o

＋

y^t

h^t-1

c^t-1 c^t

ℎ

^𝑡

= 𝑧

^𝑜

⨀𝑡𝑎𝑛ℎ 𝑐

^𝑡

⨀

⨀ ⨀

𝑦

^𝑡

= 𝜎 𝑊

^′

ℎ

^𝑡

(12)

LSTM

x^t z zⁱ

z^f z^o

⨀ ^＋

y^t

h^t-1

c^t-1 c^t

x^t+1 z zⁱ

z^f z^o

＋

y^t+1

h^t

c^t+1

⨀

⨀ ⨀

⨀

h^t

(13)

h^t-1

GRU

r z

y^t

x^t h^t-1

h'

⨀

x^t

⨀

1- ⨀

＋ h^t

reset update

ℎ

^𝑡

= 𝑧⨀ℎ

^𝑡−1

+ 1 − 𝑧 ⨀ℎ

^′

(14)

LSTM: A Search Space Odyssey

(15)

LSTM: A Search Space Odyssey

Standard LSTM works well

Simply LSTM: coupling input and forget gate, removing peephole Forget gate is critical for performance

Output gate activation function is critical

(16)

An Empirical Exploration of Recurrent Network Architectures

Importance: forget > input > output Large bias for forget gate is helpful LSTM-f/i/o: removing

forget/input/output gates LSTM-b: large bias

(17)

An Empirical Exploration of Recurrent Network

Architectures

(18)

Neural Architecture Search with Reinforcement Learning

LSTM From Reinforcement Learning

(19)

Sequence Generation

(20)

Generation

• Sentences are composed of characters/words

• Generating a character/word at each time by RNN

f

h h

^'

y

x

The token generated at the last time step.

(represented by 1-of-N encoding) Distribution over the token

(sampling from the distribution to generate a token)

x:

y:

1

0 0 0 0 …… 0

你我他是很

0

0 0 _{0.7 0.3} …… 0

(21)

Generation

• Sentences are composed of characters/words

• Generating a character/word at each time by RNN

f

h

⁰

h

¹

y

¹

x

¹

f h

²

y

²

x

²

f h

³

y

³

x

³

……

<BOS>

y¹: P(w|<BOS>)

~

sample

床前 ~ 明 ~

床前

y²: P(w|<BOS>,床) y³: P(w|<BOS>,床,前)

Until <EOS>

is generated

(22)

Generation

• Training

f

h

⁰

h

¹

y

¹

x

¹

f h

²

y

²

x

²

f h

³

y

³

x

³

……

<BOS>

春眠不

春眠

Training data: 春眠不覺曉 : minimizing cross-entropy

(23)

Generation

• Images are composed of pixels

• Generating a pixel at each time by RNN

Consider as a sentence

blue red yellow gray ……

Train a RNN based on the

“sentences”

f

h

⁰

h

¹

y

¹

x

¹

f h

²

y

²

x

²

f h

³

y

³

x

³

……

<BOS>

~

sample

red blue ~ green ~

red blue

(24)

Generation

• Images are composed of pixels

3 x 3 images

(25)

Conditional

Sequence Generation

(26)

Conditional Generation

• We don’t want to simply generate some random sentences.

• Generate sentences based on conditions:

Given

condition:

Caption Generation

Chat-bot

Given

condition:

“Hello”

“A young girl is dancing.”

“Hello. Nice to see you.”

(27)

Conditional Generation

• Represent the input condition as a vector, and consider the vector as the input of RNN generator

Image Caption Generation

Input image CNN

A vector

. (period)

……

<BOS>

A woman

(28)

Conditional Generation

• Represent the input condition as a vector, and consider the vector as the input of RNN generator

• E.g. Machine translation / Chat-bot

機器學習

Information of the whole sentences

Jointly train

Encoder Decoder

Sequence-to- sequence learning

machine learning . (period)

(29)

Conditional Generation

U: Hi U: Hi

M: Hi

M: Hello M: Hi

M: Hello

Serban, Iulian V., Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau, 2015

"Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models.

https://www.youtube.com/watch?v=e2MpOmyQJw4

Need to consider longer context during chatting

(30)

Dynamic Conditional Generation

機器學習

Decoder

Encoder

ℎ⁴ ℎ³

ℎ² ℎ¹

𝑐¹ 𝑐² 𝑐³

𝑐¹ 𝑐²

(31)

Dynamic Conditional Generation

機器學習

Information of the whole sentences

Encoder Decoder

𝑐¹ 𝑐² 𝑐³

(32)

Machine Translation

• Attention-based model

𝑧⁰

機器學習

ℎ¹ ℎ² ℎ³ ℎ⁴

match 𝛼₀¹

 Cosine similarity of z and h

 Small NN whose input is z and h, output a scalar

 𝛼 = ℎ^𝑇𝑊𝑧

Design by yourself

What is ? match Jointly learned

with other part of the network

match

ℎ 𝑧

𝛼

(33)

Machine Translation

• Attention-based model

𝛼₀¹ 𝛼₀² 𝛼₀³ 𝛼₀⁴ 𝑐⁰

𝑧¹

Decoder input 𝑐⁰ = ෍ ො𝛼₀^𝑖 ℎ^𝑖

machine

0.5 0.5 0.0 0.0

= 0.5ℎ¹ + 0.5ℎ² 𝑧⁰

softmax

𝑐⁰

ො

𝛼₀¹ 𝛼ො₀² 𝛼ො₀³ 𝛼ො₀⁴

機器學習

ℎ¹ ℎ² ℎ³ ℎ⁴

(34)

Machine Translation

• Attention-based model

𝑧¹

machine

𝑧⁰

𝑐⁰ match

𝛼₁¹

機器學習

ℎ¹ ℎ² ℎ³ ℎ⁴

(35)

Machine Translation

• Attention-based model

𝑧¹

machine

𝑧⁰

𝑐⁰ 𝑐¹

𝑧²

learning

𝑐¹

𝑐¹ = ෍ ො𝛼₁^𝑖ℎ^𝑖

= 0.5ℎ³ + 0.5ℎ⁴ 𝛼₁¹ 𝛼₁² 𝛼₁³ 𝛼₁⁴

0.0 0.0 0.5 0.5

softmax

ො

𝛼₁¹ 𝛼ො₁² 𝛼ො₁³ 𝛼ො₁⁴

機器學習

ℎ¹ ℎ² ℎ³ ℎ⁴

(36)

Machine Translation

• Attention-based model

𝑧¹

machine

𝑧⁰

𝑐⁰

𝑧²

learning

𝑐¹ match

𝛼₂¹

The same process repeat until generating

<EOS>

機器學習

ℎ¹ ℎ² ℎ³ ℎ⁴

……

(37)

Speech Recognition

William Chan, Navdeep Jaitly, Quoc V. Le, Oriol Vinyals, “Listen, Attend and Spell”, ICASSP, 2016

(38)

Image Caption Generation

filter filter filter filter filter filter

match 0.7

CNN

𝑧⁰ A vector for

each region

(39)

Image Caption Generation

CNN

A vector for each region

0.7 0.1 0.1

0.1 0.0 0.0

weighted sum

𝑧¹ Word 1

𝑧⁰

(40)

Image Caption Generation

CNN

A vector for each region

𝑧⁰

0.0 0.8 0.2

0.0 0.0 0.0

weighted sum 𝑧¹

Word 1

𝑧² Word 2

(41)

Image Caption Generation

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan

Salakhutdinov, Richard Zemel, Yoshua Bengio, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML, 2015

(42)

Image Caption Generation

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan

Salakhutdinov, Richard Zemel, Yoshua Bengio, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML, 2015

(43)

Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, Aaron Courville, “Describing Videos by Exploiting Temporal Structure”, ICCV, 2015

(44)

Question Answering

• Given a document and a query, output an answer

• bAbI: the answer is a word

• https://research.fb.com/downloads/babi/

• SQuAD: the answer is a sequence of words (in the input document)

• https://rajpurkar.github.io/SQuAD-explorer/

• MS MARCO: the answer is a sequence of words

• http://www.msmarco.org

• MovieQA: Multiple choice question (output a number)

• http://movieqa.cs.toronto.edu/home/

(45)

Memory Network

^Answer

Match Query

vector Document

q

Extracted Information

𝑥¹ ……

𝛼₁

= ෍

𝑛=1 𝑁

𝛼_𝑛𝑥^𝑛

𝑥² 𝑥³ 𝑥^𝑁 𝛼₂ 𝛼₃ 𝛼_𝑁 Sentence to DNN

vector can be jointly trained.

Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, Rob Fergus, “End-To-End Memory Networks”, NIPS, 2015

(46)

Answer

Match Query

Document

q

Extracted Information

𝑥¹ ……

𝛼₁

= ෍

𝑛=1 𝑁

𝛼_𝑛ℎ^𝑛

𝑥² 𝑥³ 𝑥^𝑁 𝛼₂ 𝛼₃ 𝛼_𝑁 Jointly learned

ℎ¹ ℎ² ℎ³ …… ℎ^𝑁

DNN

Memory Network

Hopping

(47)

Memory Network

q

……

Compute attention Extract information

෍

……

Compute attention Extract information

෍ DNN

a

(48)

Reading Comprehension

• End-To-End Memory Networks. S. Sukhbaatar, A. Szlam, J.

Weston, R. Fergus. NIPS, 2015.

The position of reading head:

Keras has example:

https://github.com/fchollet/keras/blob/master/examples/ba bi_memnn.py

(49)

Wei Fang, Juei-Yang Hsu, Hung- yi Lee, Lin-Shan Lee,

"Hierarchical Attention Model for Improved Machine

Comprehension of Spoken Content", SLT, 2016

(50)

Neural Turing Machine

• von Neumann architecture

https://www.quora.com/How-does-the-Von-Neumann-architecture- provide-flexibility-for-program-development

Neural Turing Machine not only read from

memory

Also modify the memory

through attention

(51)

Neural Turing Machine

r⁰ y¹

f h⁰

𝑚₀¹ 𝑚₀² 𝑚₀³ 𝑚₀⁴

x¹

ො

𝛼₀¹ 𝛼ො₀² 𝛼ො₀³ 𝛼ො₀⁴ 𝑟⁰ = ෍ ො𝛼₀^𝑖 𝑚₀^𝑖

Retrieval process

(52)

Neural Turing Machine

𝑚₀¹ 𝑚₀² 𝑚₀³ 𝑚₀⁴

ො

𝛼₀¹ 𝛼ො₀² 𝛼ො₀³ 𝛼ො₀⁴ 𝑟⁰ = ෍ ො𝛼₀^𝑖 𝑚₀^𝑖

𝑒¹ 𝑎¹ 𝑘¹

𝛼₁^𝑖 = 𝑐𝑜𝑠 𝑚₀^𝑖 , 𝑘¹

ො

𝛼₁¹ 𝛼ො₁² 𝛼ො₁³ 𝛼ො₁⁴ softmax

𝛼₁¹ 𝛼₁² 𝛼₁³ 𝛼₁⁴ r⁰

y¹ f h⁰

x¹

(53)

Neural Turing Machine

𝑚₀¹ 𝑚₀² 𝑚₀³ 𝑚₀⁴

ො

𝛼₀¹ 𝛼ො₀² 𝛼ො₀³ 𝛼ො₀⁴

𝑒¹ 𝑎¹ 𝑘¹

𝑚₁¹ 𝑚₁² 𝑚₁³ 𝑚₁⁴ 𝑚₁^𝑖 = 𝑚₀^𝑖 𝑒¹ + ො𝛼₁^𝑖 𝑎¹

(element-wise)

− ො𝛼₁^𝑖

ො

𝛼₁¹ 𝛼ො₁² 𝛼ො₁³ 𝛼ො₁⁴ 𝑚₀^𝑖

0 ~ 1

⨀

(54)

Neural Turing Machine

𝑚₀¹ 𝑚₀² 𝑚₀³ 𝑚₀⁴

ො

𝛼₀¹ 𝛼ො₀² 𝛼ො₀³ 𝛼ො₀⁴ 𝛼ො₁¹ 𝛼ො₁² 𝛼ො₁³ 𝛼ො₁⁴ 𝑚₁¹ 𝑚₁² 𝑚₁³ 𝑚₁⁴

ො

𝛼₂¹ 𝛼ො₂² 𝛼ො₂³ 𝛼ො₂⁴ 𝑚₂¹ 𝑚₂² 𝑚₂³ 𝑚₂⁴ r⁰

y¹ f h⁰

x¹ r¹

y² f h¹

x²

(55)

Neural Turing Machine

Wei-Jen Ko, Bo-Hsiang Tseng, Hung-yi Lee,

“Recurrent Neural Network based Language Modeling with Controllable External Memory”, ICASSP, 2017

(56)

Stack RNN

Armand Joulin, Tomas Mikolov, Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets, arXiv Pre-Print, 2015

stack

x^t y^t

……

f

Push, Pop, Nothing 0.7 0.2 0.1

Information to store

Pop Nothing Push

X0.7 X0.2 X0.1

+ +

… … …

(57)

Tips for Generation

(58)

Attention

Bad Attention

Good Attention: each input component has approximately the same attention weight

w₁ w₂ w₃ w₄

E.g. Regularization term: ෍_𝑖 𝜏 − ෍

𝑡

𝛼_𝑡^𝑖

2

For each component Over the generation 𝛼₁¹

𝛼_𝑡^𝑖

component time

𝛼₂¹𝛼₃¹𝛼₄¹ 𝛼₁²𝛼₂²𝛼₃²𝛼₄² 𝛼₁³𝛼₂³𝛼₃³𝛼₄³ 𝛼₁⁴𝛼₂⁴𝛼₃⁴𝛼₄⁴ (woman) (woman) …… no cooking

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML, 2015

(59)

Mismatch between Train and Test

• Training

^Reference:

A B B

A B

𝐶 = ෍

𝑡

𝐶

_𝑡

Minimizing

cross-entropy of each component

: condition A

A B

B

A B

<BOS>

(60)

Mismatch between Train and Test

• Generation

A A

A B

A

A B B

B We do not know

the reference

Testing: The inputs are the outputs of the last time step.

Training: The inputs are reference.

Exposure Bias

<BOS>

(61)

A B

A B A B

A B A B A B A

B

A B

A B A B

A B A B A B A

B One step

wrong

May be

totally wrong

Never

explore ……

一步錯，步步錯

(62)

Modifying Training Process?

A A

A B

A

A B B

B

A B

Reference

In practice, it is

hard to train in this way.

Training is

matched to testing.

When we try to

decrease the loss for both steps 1 and 2 …..

A

(63)

A A

B

A B

B

A B

B A

A

Scheduled Sampling

from model

From

reference B

A

from

model From

reference Reference

(64)

Scheduled Sampling

• Caption generation on MSCOCO

BLEU-4 METEOR CIDER Always from reference 28.8 24.2 89.5 Always from model 11.2 15.7 49.7 Scheduled Sampling 30.6 24.3 92.1

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, Noam Shazeer, Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks, arXiv preprint, 2015

(65)

Beam Search

A B

A B A B

A B A B A

B A

B

0.4

0.9

0.6 0.4

0.4

0.6 0.6

The green path has higher score.

Not possible to check all the paths

(66)

Beam Search

A B

A B A B

A B A B A

B A

B Keep several best path at each step

Beam size = 2

(67)

Beam Search

https://github.com/tensorflow/tensorflow/issues/654#issuecomment-169009989

The size of beam is 3 in this example.

(68)

Better Idea?

A A

B

A B B

B

<BOS>

A B

<BOS> A B you A

you are

you I ≈ you

high score

I ≈ you

am ≈ are I am ……

You are ……

I are ……

You am ……

(69)

Object level v.s. Component level

• Minimizing the error defined on component level is not equivalent to improving the generated objects

𝐶 = ෍

𝑡

𝐶_𝑡

Ref: The dog is running fast

A cat a a a

The dog is is fast

The dog is running fast Cross-entropy

of each step

Optimize object-level criterion instead of component-level cross- entropy. object-level criterion: 𝑅 𝑦, ො𝑦

𝑦: generated utterance, ො𝑦: ground truth Gradient Descent?

(70)

Reinforcement learning?

Start with

observation 𝑠₁ Observation 𝑠₂ Observation 𝑠₃

Action 𝑎₁: “right”

Obtain reward 𝑟₁ = 0

Action 𝑎₂ : “fire”

(kill an alien)

Obtain reward 𝑟₂ = 5

(71)

Reinforcement learning?

Marc'Aurelio Ranzato, Sumit Chopra, Michael Auli, Wojciech

Zaremba, “Sequence Level Training with Recurrent Neural Networks”, ICLR, 2016

A A

A B

A

A B B

observation B Actions set

The action we take influence the observation in the next step

<BOS>

Action taken

𝑟 = 0 𝑟 = 0

𝑟𝑒𝑤𝑎𝑟𝑑:

R(“BAA”, reference)

(72)