• 沒有找到結果。

Slide credit from Hung-Yi Lee

N/A
N/A
Protected

Academic year: 2022

Share "Slide credit from Hung-Yi Lee"

Copied!
33
0
0

加載中.... (立即查看全文)

全文

(1)

Slide credit from Hung-Yi Lee

(2)

Attention and Memory

Sensory Memory

Working Memory

Long-term Memory Information from the sensors (e.g. eyes, ears)

Attention

Encode Retrieval

When the input is a very long sequence or an image

Pay attention on partial of the input object each time

(3)

Attention and Memory

3

Sensory Memory

Working Memory

Long-term Memory Information from the sensors (e.g. eyes, ears)

Attention

Encode Retrieval

When the input is a very long sequence or an image

Pay attention on partial of the input object each time In RNN/LSTM, larger memory

implies more parameters

Increasing memory size will not increasing parameters

(4)

Attention

on Sensory Info

Sensory Memory

Working Memory

Long-term Memory Information from the sensors (e.g. eyes, ears)

Attention

Encode

(5)

Machine Translation

Sequence-to-sequence learning: both input and output are both sequences with different lengths.

E.g. 深度學習 → deep learning

5

learning

deep

深 度 學 習

<END>

RNN

Encoder RNN

Decoder

Information of the whole sentences

(6)

Machine Translation with Attention

𝑧0

深 度 學 習

1234 match

𝛼01

 Cosine similarity of z and h

 Small NN whose input is z and h, output a scalar

 𝛼 = ℎ𝑇𝑊𝑧

How to learn the parameters?

What is ? match

(7)

Machine Translation with Attention

7

𝛼01 𝛼02 𝛼03 𝛼04 𝑐0

𝑧1

As RNN input 𝑐0 = ෍ ො𝛼0𝑖𝑖

deep

0.5 0.5 0.0 0.0

= 0.5ℎ1 + 0.5ℎ2 How to learn the parameters?

𝑧0 softmax

𝑐0

𝛼01 𝛼ො02 𝛼ො03 𝛼ො04 match ℎ 𝑧

𝛼

深 度 學 習

1234

(8)

Machine Translation with Attention

𝑧1

deep

𝑧0

𝑐0 match

𝛼11

深 度 學 習

1234

(9)

Machine Translation with Attention

9

𝑧1

deep

𝑧0

𝑐0 𝑐1

𝑧2

learning

𝑐1

𝑐1 = ෍ ො𝛼1𝑖𝑖 = 0.5ℎ3 + 0.5ℎ4 𝛼11 𝛼12 𝛼13 𝛼14

0.0 0.0 0.5 0.5

softmax

𝛼11 𝛼ො12 𝛼ො13 𝛼ො14

深 度 學 習

1234

(10)

Machine Translation with Attention

𝑧1

deep

𝑧0

𝑐0

𝑧2

learning

𝑐1 match

𝛼21

The same process repeat until generating <END>

深 度 學 習

1234 ……

……

……

(11)

Speech Recognition with Attention

Chan et al., “Listen, Attend and Spell”, arXiv, 2015 . 11

(12)

Image Captioning

Input: image

Output: word sequence

Input image

a woman is

……

<END>

CNN

A vector for whole image

(13)

Image Captioning with Attention

13

filter filter filter filter filter filter

match 0.7

CNN

filter filter filter filter filter filter

A vector for each region

𝑧0

(14)

Image Captioning with Attention

filter filter filter filter filter filter

CNN

A vector for each region

filter filter filter filter filter filter

0.7 0.1 0.1

0.1 0.0 0.0

weighted sum

𝑧1 Word 1 𝑧0

(15)

Image Captioning with Attention

15

filter filter filter filter filter filter

CNN

A vector for each region

weighted sum

filter filter filter filter filter

𝑧0

0.0 0.8 0.2

0.0 0.0 0.0

weighted sum 𝑧1

Word 2

filter

Word 1 𝑧1

(16)

Image Captioning

Good examples

(17)

Image Captioning

Bad examples

17

(18)

Video Captioning

(19)

Video Captioning

19

(20)

Reading Comprehension

Answer

Match Question

Document

q

Extracted Information

𝑥1 ……

𝛼1

= ෍

𝑛=1 𝑁

𝛼𝑛𝑥𝑛

𝑥2 𝑥3 𝑥𝑁 𝛼2 𝛼3 𝛼𝑁 Sentence to DNN

vector can be jointly trained.

(21)

Reading Comprehension

21

Answer

Match Question

Document

q

Extracted Information

𝑥1 ……

𝛼1

= ෍

𝑛=1 𝑁

𝛼𝑛𝑛

𝑥2 𝑥3 𝑥𝑁 𝛼2 𝛼3 𝛼𝑁 Jointly learned

123 …… ℎ𝑁

DNN

Hopping

(22)

Memory Network

……

……

Compute attention Extract information

……

……

Compute attention Extract information

DNN

a

q

(23)

Memory Network

Muti-hop performance analysis

https://www.facebook.com/Engineering/videos/10153098860532200/ 23

(24)

Special Attention: Spatial Transformers

CNN

Bad results

CNN

Good results

Jointly learned

(25)

Attention

on Memory

25

Sensory Memory

Working Memory

Long-term Memory Information from the sensors (e.g. eyes, ears)

Attention

Encode

(26)

Neural Turing Machine

Von Neumann architecture

Neural Turing Machine is an advanced RNN/LSTM.

(27)

Neural Turing Machine

27

x1 x2

y1 y2

h1 h2

h0

𝑚01 𝑚02 𝑚03 𝑚04

r0

𝛼01 𝛼ො02 𝛼ො03 𝛼ො04 𝑟0 = ෍ ො𝛼0𝑖 𝑚0𝑖

Long term memory

Retrieval process

Zhang et al., “Structured Memory for Neural Turing Machines,” arXiv, 2015.

(28)

x1 x2

y1 y2

h1 h2

h0

𝑚01 𝑚02 𝑚03 𝑚04

r0

𝛼01 𝛼ො02 𝛼ො03 𝛼ො04 𝑟0 = ෍ ො𝛼0𝑖 𝑚0𝑖

Neural Turing Machine

𝑒1 𝑎1 𝑘1

𝛼1𝑖 = 1 − 𝜆 𝛼0𝑖 +𝜆𝑐𝑜𝑠 𝑚0𝑖 , 𝑘1

𝛼11 𝛼ො12 𝛼ො13 𝛼ො14 softmax

𝛼11 𝛼12 𝛼13 𝛼14

(simplified)

(29)

𝑒1 𝑎1 𝑘1

Neural Turing Machine

29

𝑚11 𝑚12 𝑚13 𝑚14

𝑚1𝑖 = 𝑚0𝑖

𝑒1 + ො𝛼1𝑖 𝑎1

(element-wise)

1 − ො𝛼1𝑖 Encode

process

𝑚01 𝑚02 𝑚03 𝑚04

𝛼01 𝛼ො02 𝛼ො03 𝛼ො04 𝛼ො11 𝛼ො12 𝛼ො13 𝛼ො14

Zhang et al., “Structured Memory for Neural Turing Machines,” arXiv, 2015.

(30)

Neural Turing Machine

𝑚01 𝑚02 𝑚03 𝑚04

𝛼01 𝛼ො02 𝛼ො03 𝛼ො04

𝑚11 𝑚12 𝑚13 𝑚14

𝛼11 𝛼ො12 𝛼ො13 𝛼ො14

x2 y2 h2 r1

𝛼21 𝛼ො22 𝛼ො23 𝛼ො24 𝑚21 𝑚22 𝑚23 𝑚24 x1

y1 h1 h0

r0

(31)

Stack RNN

Joulin and Mikolov, “Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets,” 2015. 31

xt Push, Pop, Nothing

ht yt stack

0.7 0.2 0.1

Information to store

Pop Nothing

Push

-1 -1

-1

X0.7 X0.2 X0.1

+

+

(32)

Concluding Remarks

Neural Turing Machine Stack RNN

Machine Translation Speech Recognition Image Captioning Question Answering Sensory Memory

Working Memory

Long-term Memory Information from the sensors (e.g. eyes, ears)

Attention

Encode

(33)

Reference

End-To-End Memory Networks. S. Sukhbaatar, A. Szlam, J. Weston, R. Fergus. arXiv Pre- Print, 2015.

Neural Turing Machines. Alex Graves, Greg Wayne, Ivo Danihelka. arXiv Pre-Print, 2014 Ask Me Anything: Dynamic Memory Networks for Natural Language Processing. Kumar et al. arXiv Pre-Print, 2015

Neural Machine Translation by Jointly Learning to Align and Translate. D. Bahdanau, K. Cho, Y. Bengio; International Conference on Representation Learning 2015.

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Kelvin Xu et.

al.. arXiv Pre-Print, 2015.

Attention-Based Models for Speech Recognition. Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, Yoshua Bengio. arXiv Pre-Print, 2015.

A Neural Attention Model for Abstractive Sentence Summarization. A. M. Rush, S. Chopra and J. Weston. EMNLP 2015.

33

參考文獻

相關文件

LTP (I - III) Latent Topic Probability (mean, variance, standard deviation) non-key term.

Training two networks jointly  the generator knows how to adapt its parameters in order to produce output data that can fool the

 Goal: select actions to maximize future reward Big three: action, state,

Input domain: word, word sequence, audio signal, click logs Output domain: single label, sequence tags, tree structure, probability

◦ Value function: how good is each state and/or action1. ◦ Model: agent’s representation of

◦ Value function: how good is each state and/or action.. ◦ Policy: agent’s

State value function: when using

3. Works better for some tasks to use grammatical tree structure Language recursion is still up to debate.. Recursive Neural Network Architecture. A network is to predict the