• 沒有找到結果。

Gated RNN &

N/A
N/A
Protected

Academic year: 2022

Share "Gated RNN &"

Copied!
72
0
0

加載中.... (立即查看全文)

全文

(1)

Gated RNN &

Sequence Generation

Hung-yi Lee

李宏毅

(2)

Outline

• RNN with Gated Mechanism

• Sequence Generation

• Conditional Sequence Generation

• Tips for Generation

(3)

RNN with Gated

Mechanism

(4)

Recurrent Neural Network

• Given function f: ℎ

, 𝑦 = 𝑓 ℎ, 𝑥

f

h

0

h

1

y

1

x

1

f h

2

y

2

x

2

f h

3

y

3

x

3

……

No matter how long the input/output sequence is, we only need one function f

h and h’ are vectors with the same dimension

(5)

Deep RNN

f

1

h

0

h

1

y

1

x

1

f

1

h

2

y

2

x

2

f

1

h

3

y

3

x

3

……

f

2

b

0

b

1

c

1

f

2

b

2

c

2

f

2

b

3

c

3

……

, 𝑦 = 𝑓1 ℎ, 𝑥 𝑏, 𝑐 = 𝑓2 𝑏, 𝑦 …

(6)

f

1

h

0

h

1

a

1

x

1

f

1

h

2

a

2

x

2

f

1

h

3

a

3

x

3

f

2

b

0

b

1

f

2

b

2

f

2

b

3

Bidirectional RNN

x

1

x

2

x

3

c

1

c

2

c

3

f

3

y

1

f

3

y

2

f

3

y

3

, 𝑎 = 𝑓1 ℎ, 𝑥 𝑏, 𝑐 = 𝑓2 𝑏, 𝑥

𝑦 = 𝑓3 𝑎, 𝑐

(7)

Naïve RNN

• Given function f: ℎ

, 𝑦 = 𝑓 ℎ, 𝑥

f

h h

'

y

x

Ignore bias here

h

'

y W

o

W

h

h

'

= 𝜎

softmax

= 𝜎 h + W

i

x

(8)

LSTM

c changes slowly h changes faster

c

t

is c

t-1

added by something h

t

and h

t-1

can be very different Naive h

t

y

t

x

t

h

t-1

LSTM y

t

x

t

c

t

h

t

h

t-1

c

t-1

(9)

xt z zi

zf zo

ht-1 ct-1

z xt

ht-1

= 𝑡𝑎𝑛ℎ

W

zi xt

ht-1 Wi

= 𝜎

zf xt

ht-1 Wf

= 𝜎

zo xt

ht-1 Wo

= 𝜎

(10)

xt z zi

zf zo

ht-1 ct-1

“peephole”

z

= 𝑡𝑎𝑛ℎ

W

xt ht-1

ct-1 diagonal zi

zf

zo obtained by the same way

(11)

ht

𝑐

𝑡

= 𝑧

𝑓

⨀𝑐

𝑡−1

+𝑧

𝑖

⨀𝑧

xt z zi

zf zo

yt

ht-1

ct-1 ct

𝑡

= 𝑧

𝑜

⨀𝑡𝑎𝑛ℎ 𝑐

𝑡

⨀ ⨀

𝑦

𝑡

= 𝜎 𝑊

𝑡

(12)

LSTM

xt z zi

zf zo

yt

ht-1

ct-1 ct

xt+1 z zi

zf zo

yt+1

ht

ct+1

⨀ ⨀

ht

(13)

ht-1

GRU

r z

yt

xt ht-1

h'

xt

1- ⨀

ht

reset update

𝑡

= 𝑧⨀ℎ

𝑡−1

+ 1 − 𝑧 ⨀ℎ

(14)

LSTM: A Search Space Odyssey

(15)

LSTM: A Search Space Odyssey

Standard LSTM works well

Simply LSTM: coupling input and forget gate, removing peephole Forget gate is critical for performance

Output gate activation function is critical

(16)

An Empirical Exploration of Recurrent Network Architectures

Importance: forget > input > output Large bias for forget gate is helpful LSTM-f/i/o: removing

forget/input/output gates LSTM-b: large bias

(17)

An Empirical Exploration of Recurrent Network

Architectures

(18)

Neural Architecture Search with Reinforcement Learning

LSTM From Reinforcement Learning

(19)

Sequence Generation

(20)

Generation

• Sentences are composed of characters/words

• Generating a character/word at each time by RNN

f

h h

'

y

x

The token generated at the last time step.

(represented by 1-of-N encoding) Distribution over the token

(sampling from the distribution to generate a token)

x:

y:

1

0 0 0 0 …… 0

你 我 他 是 很

0

0 0 0.7 0.3 …… 0

(21)

Generation

• Sentences are composed of characters/words

• Generating a character/word at each time by RNN

f

h

0

h

1

y

1

x

1

f h

2

y

2

x

2

f h

3

y

3

x

3

……

<BOS>

y1: P(w|<BOS>)

~

sample

床 前 ~ 明 ~

床 前

y2: P(w|<BOS>,床) y3: P(w|<BOS>,床,前)

Until <EOS>

is generated

(22)

Generation

• Training

f

h

0

h

1

y

1

x

1

f h

2

y

2

x

2

f h

3

y

3

x

3

……

<BOS>

春 眠 不

春 眠

Training data: 春 眠 不 覺 曉 : minimizing cross-entropy

(23)

Generation

• Images are composed of pixels

• Generating a pixel at each time by RNN

Consider as a sentence

blue red yellow gray ……

Train a RNN based on the

“sentences”

f

h

0

h

1

y

1

x

1

f h

2

y

2

x

2

f h

3

y

3

x

3

……

<BOS>

~

sample

red blue ~ green ~

red blue

(24)

Generation

• Images are composed of pixels

3 x 3 images

(25)

Conditional

Sequence Generation

(26)

Conditional Generation

• We don’t want to simply generate some random sentences.

• Generate sentences based on conditions:

Given

condition:

Caption Generation

Chat-bot

Given

condition:

“Hello”

“A young girl is dancing.”

“Hello. Nice to see you.”

(27)

Conditional Generation

• Represent the input condition as a vector, and consider the vector as the input of RNN generator

Image Caption Generation

Input image CNN

A vector

. (period)

……

<BOS>

A woman

(28)

Conditional Generation

• Represent the input condition as a vector, and consider the vector as the input of RNN generator

• E.g. Machine translation / Chat-bot

機 器 學 習

Information of the whole sentences

Jointly train

Encoder Decoder

Sequence-to- sequence learning

machine learning . (period)

(29)

Conditional Generation

U: Hi U: Hi

M: Hi

M: Hello M: Hi

M: Hello

Serban, Iulian V., Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau, 2015

"Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models.

https://www.youtube.com/watch?v=e2MpOmyQJw4

Need to consider longer context during chatting

(30)

Dynamic Conditional Generation

機 器 學 習

Decoder

machine learning . (period)

Encoder

43

21

𝑐1 𝑐2 𝑐3

𝑐1 𝑐2

(31)

Dynamic Conditional Generation

機 器 學 習

Information of the whole sentences

Encoder Decoder

machine learning . (period)

𝑐1 𝑐2 𝑐3

(32)

Machine Translation

• Attention-based model

𝑧0

機 器 學 習

1234

match 𝛼01

 Cosine similarity of z and h

 Small NN whose input is z and h, output a scalar

 𝛼 = ℎ𝑇𝑊𝑧

Design by yourself

What is ? match Jointly learned

with other part of the network

match

ℎ 𝑧

𝛼

(33)

Machine Translation

• Attention-based model

𝛼01 𝛼02 𝛼03 𝛼04 𝑐0

𝑧1

Decoder input 𝑐0 = ෍ ො𝛼0𝑖𝑖

machine

0.5 0.5 0.0 0.0

= 0.5ℎ1 + 0.5ℎ2 𝑧0

softmax

𝑐0

𝛼01 𝛼ො02 𝛼ො03 𝛼ො04

機 器 學 習

1234

(34)

Machine Translation

• Attention-based model

𝑧1

machine

𝑧0

𝑐0 match

𝛼11

機 器 學 習

1234

(35)

Machine Translation

• Attention-based model

𝑧1

machine

𝑧0

𝑐0 𝑐1

𝑧2

learning

𝑐1

𝑐1 = ෍ ො𝛼1𝑖𝑖

= 0.5ℎ3 + 0.5ℎ4 𝛼11 𝛼12 𝛼13 𝛼14

0.0 0.0 0.5 0.5

softmax

𝛼11 𝛼ො12 𝛼ො13 𝛼ො14

機 器 學 習

1234

(36)

Machine Translation

• Attention-based model

𝑧1

machine

𝑧0

𝑐0

𝑧2

learning

𝑐1 match

𝛼21

The same process repeat until generating

<EOS>

機 器 學 習

1234

……

……

……

(37)

Speech Recognition

William Chan, Navdeep Jaitly, Quoc V. Le, Oriol Vinyals, “Listen, Attend and Spell”, ICASSP, 2016

(38)

Image Caption Generation

filter filter filter filter filter filter

match 0.7

CNN

filter filter filter filter filter filter

𝑧0 A vector for

each region

(39)

Image Caption Generation

filter filter filter filter filter filter

CNN

filter filter filter filter filter filter

A vector for each region

0.7 0.1 0.1

0.1 0.0 0.0

weighted sum

𝑧1 Word 1

𝑧0

(40)

Image Caption Generation

filter filter filter filter filter filter

CNN

filter filter filter filter filter filter

A vector for each region

𝑧0

0.0 0.8 0.2

0.0 0.0 0.0

weighted sum 𝑧1

Word 1

𝑧2 Word 2

(41)

Image Caption Generation

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan

Salakhutdinov, Richard Zemel, Yoshua Bengio, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML, 2015

(42)

Image Caption Generation

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan

Salakhutdinov, Richard Zemel, Yoshua Bengio, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML, 2015

(43)

Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, Aaron Courville, “Describing Videos by Exploiting Temporal Structure”, ICCV, 2015

(44)

Question Answering

• Given a document and a query, output an answer

• bAbI: the answer is a word

• https://research.fb.com/downloads/babi/

• SQuAD: the answer is a sequence of words (in the input document)

• https://rajpurkar.github.io/SQuAD-explorer/

• MS MARCO: the answer is a sequence of words

• http://www.msmarco.org

• MovieQA: Multiple choice question (output a number)

• http://movieqa.cs.toronto.edu/home/

(45)

Memory Network

Answer

Match Query

vector Document

q

Extracted Information

𝑥1 ……

𝛼1

= ෍

𝑛=1 𝑁

𝛼𝑛𝑥𝑛

𝑥2 𝑥3 𝑥𝑁 𝛼2 𝛼3 𝛼𝑁 Sentence to DNN

vector can be jointly trained.

Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, Rob Fergus, “End-To-End Memory Networks”, NIPS, 2015

(46)

Answer

Match Query

Document

q

Extracted Information

𝑥1 ……

𝛼1

= ෍

𝑛=1 𝑁

𝛼𝑛𝑛

𝑥2 𝑥3 𝑥𝑁 𝛼2 𝛼3 𝛼𝑁 Jointly learned

123 …… ℎ𝑁

DNN

Memory Network

Hopping

(47)

Memory Network

q

……

……

Compute attention Extract information

……

……

Compute attention Extract information

DNN

a

(48)

Reading Comprehension

• End-To-End Memory Networks. S. Sukhbaatar, A. Szlam, J.

Weston, R. Fergus. NIPS, 2015.

The position of reading head:

Keras has example:

https://github.com/fchollet/keras/blob/master/examples/ba bi_memnn.py

(49)

Wei Fang, Juei-Yang Hsu, Hung- yi Lee, Lin-Shan Lee,

"Hierarchical Attention Model for Improved Machine

Comprehension of Spoken Content", SLT, 2016

(50)

Neural Turing Machine

• von Neumann architecture

https://www.quora.com/How-does-the-Von-Neumann-architecture- provide-flexibility-for-program-development

Neural Turing Machine not only read from

memory

Also modify the memory

through attention

(51)

Neural Turing Machine

r0 y1

f h0

𝑚01 𝑚02 𝑚03 𝑚04

x1

𝛼01 𝛼ො02 𝛼ො03 𝛼ො04 𝑟0 = ෍ ො𝛼0𝑖 𝑚0𝑖

Retrieval process

(52)

Neural Turing Machine

𝑚01 𝑚02 𝑚03 𝑚04

𝛼01 𝛼ො02 𝛼ො03 𝛼ො04 𝑟0 = ෍ ො𝛼0𝑖 𝑚0𝑖

𝑒1 𝑎1 𝑘1

𝛼1𝑖 = 𝑐𝑜𝑠 𝑚0𝑖 , 𝑘1

𝛼11 𝛼ො12 𝛼ො13 𝛼ො14 softmax

𝛼11 𝛼12 𝛼13 𝛼14 r0

y1 f h0

x1

(53)

Neural Turing Machine

𝑚01 𝑚02 𝑚03 𝑚04

𝛼01 𝛼ො02 𝛼ො03 𝛼ො04

𝑒1 𝑎1 𝑘1

𝑚11 𝑚12 𝑚13 𝑚14 𝑚1𝑖 = 𝑚0𝑖 𝑒1 + ො𝛼1𝑖 𝑎1

(element-wise)

− ො𝛼1𝑖

𝛼11 𝛼ො12 𝛼ො13 𝛼ො14 𝑚0𝑖

0 ~ 1

(54)

Neural Turing Machine

𝑚01 𝑚02 𝑚03 𝑚04

𝛼01 𝛼ො02 𝛼ො03 𝛼ො04 𝛼ො11 𝛼ො12 𝛼ො13 𝛼ො14 𝑚11 𝑚12 𝑚13 𝑚14

𝛼21 𝛼ො22 𝛼ො23 𝛼ො24 𝑚21 𝑚22 𝑚23 𝑚24 r0

y1 f h0

x1 r1

y2 f h1

x2

(55)

Neural Turing Machine

Wei-Jen Ko, Bo-Hsiang Tseng, Hung-yi Lee,

“Recurrent Neural Network based Language Modeling with Controllable External Memory”, ICASSP, 2017

(56)

Stack RNN

Armand Joulin, Tomas Mikolov, Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets, arXiv Pre-Print, 2015

stack

xt yt

……

f

Push, Pop, Nothing 0.7 0.2 0.1

Information to store

Pop Nothing Push

X0.7 X0.2 X0.1

+ +

… … …

(57)

Tips for Generation

(58)

Attention

Bad Attention

Good Attention: each input component has approximately the same attention weight

w1 w2 w3 w4

E.g. Regularization term: ෍𝑖 𝜏 − ෍

𝑡

𝛼𝑡𝑖

2

For each component Over the generation 𝛼11

𝛼𝑡𝑖

component time

𝛼21𝛼31𝛼41 𝛼12𝛼22𝛼32𝛼42 𝛼13𝛼23𝛼33𝛼43 𝛼14𝛼24𝛼34𝛼44 (woman) (woman) …… no cooking

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML, 2015

(59)

Mismatch between Train and Test

• Training

Reference:

A B B

A B

𝐶 = ෍

𝑡

𝐶

𝑡

Minimizing

cross-entropy of each component

: condition A

A B

B

A B

<BOS>

(60)

Mismatch between Train and Test

• Generation

A A

A B

A B

A

A B B

B We do not know

the reference

Testing: The inputs are the outputs of the last time step.

Training: The inputs are reference.

Exposure Bias

<BOS>

(61)

A B

A B A B

A B A B A B A

B

A B

A B A B

A B A B A B A

B One step

wrong

May be

totally wrong

Never

explore ……

一步錯,步步錯

(62)

Modifying Training Process?

A A

A B

A B

A

A B B

B

B

A B

Reference

In practice, it is

hard to train in this way.

Training is

matched to testing.

When we try to

decrease the loss for both steps 1 and 2 …..

A

(63)

A A

B

A B

B

B

A B

B A

A

Scheduled Sampling

from model

From

reference B

A

from

model From

reference Reference

(64)

Scheduled Sampling

• Caption generation on MSCOCO

BLEU-4 METEOR CIDER Always from reference 28.8 24.2 89.5 Always from model 11.2 15.7 49.7 Scheduled Sampling 30.6 24.3 92.1

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, Noam Shazeer, Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks, arXiv preprint, 2015

(65)

Beam Search

A B

A B A B

A B A B A

B A

B

0.4

0.9

0.9

0.6 0.4

0.4

0.6 0.6

The green path has higher score.

Not possible to check all the paths

(66)

Beam Search

A B

A B A B

A B A B A

B A

B Keep several best path at each step

Beam size = 2

(67)

Beam Search

https://github.com/tensorflow/tensorflow/issues/654#issuecomment-169009989

The size of beam is 3 in this example.

(68)

Better Idea?

A A

B

A B B

B

<BOS>

A B

A B

<BOS> A B you A

you are

you I ≈ you

high score

I ≈ you

am ≈ are I am ……

You are ……

I are ……

You am ……

(69)

Object level v.s. Component level

• Minimizing the error defined on component level is not equivalent to improving the generated objects

𝐶 = ෍

𝑡

𝐶𝑡

Ref: The dog is running fast

A cat a a a

The dog is is fast

The dog is running fast Cross-entropy

of each step

Optimize object-level criterion instead of component-level cross- entropy. object-level criterion: 𝑅 𝑦, ො𝑦

𝑦: generated utterance, ො𝑦: ground truth Gradient Descent?

(70)

Reinforcement learning?

Start with

observation 𝑠1 Observation 𝑠2 Observation 𝑠3

Action 𝑎1: “right”

Obtain reward 𝑟1 = 0

Action 𝑎2 : “fire”

(kill an alien)

Obtain reward 𝑟2 = 5

(71)

Reinforcement learning?

Marc'Aurelio Ranzato, Sumit Chopra, Michael Auli, Wojciech

Zaremba, “Sequence Level Training with Recurrent Neural Networks”, ICLR, 2016

A A

A B

A B

A

A B B

observation B Actions set

The action we take influence the observation in the next step

<BOS>

Action taken

𝑟 = 0 𝑟 = 0

𝑟𝑒𝑤𝑎𝑟𝑑:

R(“BAA”, reference)

(72)

Concluding Remarks

• RNN with Gated Mechanism

• Sequence Generation

• Conditional Sequence Generation

• Tips for Generation

參考文獻

相關文件

• The scene with depth variations and the camera has movement... Planar scene (or a

• Richard Szeliski, Image Alignment and Stitching: A Tutorial, Foundations and Trends in Computer Graphics and Computer Vision, 2(1):1-104, December 2006. Szeliski

• Richard Szeliski, Image Alignment and Stitching: A Tutorial, Foundations and Trends in Computer Graphics and Computer Vision, 2(1):1-104, December 2006. Szeliski

• The scene with depth variations and the camera has movement... Planar scene (or a

It is required to do radial distortion correction for better stitching results. correction for better

Practice with your teacher - Show and tell Hi, Mike.. How

Asakura, “A Study on Traffic Sign Recognition in Scene Image using Genetic Algorithms and Neural Networks,” Proceedings of the 1996 IEEE IECON 22 nd International Conference

Use images to adapt a generic face model Use images to adapt a generic face model. Creating