Slide credit from Hung-Yi Lee

(1)

(2)

Attention and Memory

Sensory Memory

Working Memory

Long-term Memory Information from the sensors (e.g. eyes, ears)

Attention

Encode Retrieval

When the input is a very long sequence or an image

Pay attention on partial of the input object each time

(3)

Attention and Memory

3

Sensory Memory

Working Memory

Attention

Encode Retrieval

When the input is a very long sequence or an image

Pay attention on partial of the input object each time In RNN/LSTM, larger memory

implies more parameters

Increasing memory size will not increasing parameters

(4)

Attention

on Sensory Info

Sensory Memory

Working Memory

Attention

Encode

(5)

Machine Translation

Sequence-to-sequence learning: both input and output are both sequences with different lengths.

E.g. 深度學習 → deep learning

5

learning

deep

深度學習

<END>

RNN

Encoder RNN

Decoder

Information of the whole sentences

(6)

Machine Translation with Attention

𝑧⁰

深度學習

ℎ¹ ℎ² ℎ³ ℎ⁴ match

𝛼₀¹

 Cosine similarity of z and h

 Small NN whose input is z and h, output a scalar

 𝛼 = ℎ^𝑇𝑊𝑧

How to learn the parameters?

What is ? match

(7)

Machine Translation with Attention

7

𝛼₀¹ 𝛼₀² 𝛼₀³ 𝛼₀⁴ 𝑐⁰

𝑧¹

As RNN input 𝑐⁰ = ෍ ො𝛼₀^𝑖 ℎ^𝑖

deep

0.5 0.5 0.0 0.0

= 0.5ℎ¹ + 0.5ℎ² How to learn the parameters?

𝑧⁰ softmax

𝑐⁰

ො

𝛼₀¹ 𝛼ො₀² 𝛼ො₀³ 𝛼ො₀⁴ match ℎ 𝑧

𝛼

深度學習

ℎ¹ ℎ² ℎ³ ℎ⁴

(8)

Machine Translation with Attention

𝑧¹

deep

𝑧⁰

𝑐⁰ match

𝛼₁¹

深度學習

ℎ¹ ℎ² ℎ³ ℎ⁴

(9)

Machine Translation with Attention

9

𝑧¹

deep

𝑧⁰

𝑐⁰ 𝑐¹

𝑧²

learning

𝑐¹

𝑐¹ = ෍ ො𝛼₁^𝑖ℎ^𝑖 = 0.5ℎ³ + 0.5ℎ⁴ 𝛼₁¹ 𝛼₁² 𝛼₁³ 𝛼₁⁴

0.0 0.0 0.5 0.5

softmax

ො

𝛼₁¹ 𝛼ො₁² 𝛼ො₁³ 𝛼ො₁⁴

深度學習

ℎ¹ ℎ² ℎ³ ℎ⁴

(10)

Machine Translation with Attention

𝑧¹

deep

𝑧⁰

𝑐⁰

𝑧²

learning

𝑐¹ match

𝛼₂¹

The same process repeat until generating <END>

深度學習

ℎ¹ ℎ² ℎ³ ℎ⁴ ……

……

(11)

Speech Recognition with Attention

Chan et al., “Listen, Attend and Spell”, arXiv, 2015 . 11

(12)

Image Captioning

Input: image

Output: word sequence

Input image

a woman is

……

<END>

CNN

A vector for whole image

(13)

Image Captioning with Attention

13

filter filter filter filter filter filter

match 0.7

CNN

A vector for each region

𝑧⁰

(14)

Image Captioning with Attention

CNN

0.7 0.1 0.1

0.1 0.0 0.0

weighted sum

𝑧¹ Word 1 𝑧⁰

(15)

Image Captioning with Attention

15

CNN

weighted sum

filter filter filter filter filter

𝑧⁰

0.0 0.8 0.2

0.0 0.0 0.0

weighted sum 𝑧¹

Word 2

filter

Word 1 𝑧¹

(16)

Image Captioning

Good examples

(17)

Image Captioning

Bad examples

17

(18)

Video Captioning

(19)

Video Captioning

19

(20)

Reading Comprehension

^Answer

Match Question

Document

q

Extracted Information

𝑥¹ ……

𝛼₁

= ෍

𝑛=1 𝑁

𝛼_𝑛𝑥^𝑛

𝑥² 𝑥³ 𝑥^𝑁 𝛼₂ 𝛼₃ 𝛼_𝑁 Sentence to DNN

vector can be jointly trained.

(21)

Reading Comprehension

21

Answer

Match Question

Document

q

Extracted Information

𝑥¹ ……

𝛼₁

= ෍

𝑛=1 𝑁

𝛼_𝑛ℎ^𝑛

𝑥² 𝑥³ 𝑥^𝑁 𝛼₂ 𝛼₃ 𝛼_𝑁 Jointly learned

ℎ¹ ℎ² ℎ³ …… ℎ^𝑁

DNN

Hopping

(22)

Memory Network

……

Compute attention Extract information

……

Compute attention Extract information

෍ DNN

a

q

෍

(23)

Memory Network

Muti-hop performance analysis

https://www.facebook.com/Engineering/videos/10153098860532200/ 23

(24)

Special Attention: Spatial Transformers

CNN

^Bad_results

CNN

^Good_results

Jointly learned

(25)

Attention

on Memory

25

Sensory Memory

Working Memory

Attention

Encode

(26)

Neural Turing Machine

Von Neumann architecture

Neural Turing Machine is an advanced RNN/LSTM.

(27)

Neural Turing Machine

27

x¹ x²

y¹ y²

h¹ h²

h⁰

𝑚₀¹ 𝑚₀² 𝑚₀³ 𝑚₀⁴

r⁰

ො

𝛼₀¹ 𝛼ො₀² 𝛼ො₀³ 𝛼ො₀⁴ 𝑟⁰ = ෍ ො𝛼₀^𝑖 𝑚₀^𝑖

Long term memory

Retrieval process

Zhang et al., “Structured Memory for Neural Turing Machines,” arXiv, 2015.

(28)

x¹ x²

y¹ y²

h¹ h²

h⁰

𝑚₀¹ 𝑚₀² 𝑚₀³ 𝑚₀⁴

r⁰

ො

𝛼₀¹ 𝛼ො₀² 𝛼ො₀³ 𝛼ො₀⁴ 𝑟⁰ = ෍ ො𝛼₀^𝑖 𝑚₀^𝑖

Neural Turing Machine

𝑒¹ 𝑎¹ 𝑘¹

𝛼₁^𝑖 = 1 − 𝜆 𝛼₀^𝑖 +𝜆𝑐𝑜𝑠 𝑚₀^𝑖 , 𝑘¹

ො

𝛼₁¹ 𝛼ො₁² 𝛼ො₁³ 𝛼ො₁⁴ softmax

𝛼₁¹ 𝛼₁² 𝛼₁³ 𝛼₁⁴

(simplified)

(29)

𝑒¹ 𝑎¹ 𝑘¹

Neural Turing Machine

29

𝑚₁¹ 𝑚₁² 𝑚₁³ 𝑚₁⁴

𝑚₁^𝑖 = 𝑚₀^𝑖

∗

𝑒¹ + ො𝛼₁^𝑖 𝑎¹

(element-wise)

1 − ො𝛼₁^𝑖 Encode

process

𝑚₀¹ 𝑚₀² 𝑚₀³ 𝑚₀⁴

ො

𝛼₀¹ 𝛼ො₀² 𝛼ො₀³ 𝛼ො₀⁴ 𝛼ො₁¹ 𝛼ො₁² 𝛼ො₁³ 𝛼ො₁⁴

Zhang et al., “Structured Memory for Neural Turing Machines,” arXiv, 2015.

(30)

Neural Turing Machine

𝑚₀¹ 𝑚₀² 𝑚₀³ 𝑚₀⁴

ො

𝛼₀¹ 𝛼ො₀² 𝛼ො₀³ 𝛼ො₀⁴

𝑚₁¹ 𝑚₁² 𝑚₁³ 𝑚₁⁴

ො

𝛼₁¹ 𝛼ො₁² 𝛼ො₁³ 𝛼ො₁⁴

x² y² h² r¹

ො

𝛼₂¹ 𝛼ො₂² 𝛼ො₂³ 𝛼ො₂⁴ 𝑚₂¹ 𝑚₂² 𝑚₂³ 𝑚₂⁴ x¹

y¹ h¹ h⁰

r⁰

(31)

Stack RNN

Joulin and Mikolov, “Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets,” 2015. 31

x^t Push, Pop, Nothing

h^t y^t stack

0.7 0.2 0.1

Information to store

Pop Nothing

Push

-1 -1

-1

X0.7 X0.2 X0.1

+

(32)

Concluding Remarks

Neural Turing Machine Stack RNN

Machine Translation Speech Recognition Image Captioning Question Answering Sensory Memory

Working Memory

Attention

Encode

(33)

Reference

End-To-End Memory Networks. S. Sukhbaatar, A. Szlam, J. Weston, R. Fergus. arXiv Pre- Print, 2015.

Neural Turing Machines. Alex Graves, Greg Wayne, Ivo Danihelka. arXiv Pre-Print, 2014 Ask Me Anything: Dynamic Memory Networks for Natural Language Processing. Kumar et al. arXiv Pre-Print, 2015

Neural Machine Translation by Jointly Learning to Align and Translate. D. Bahdanau, K. Cho, Y. Bengio; International Conference on Representation Learning 2015.

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Kelvin Xu et.

al.. arXiv Pre-Print, 2015.

Attention-Based Models for Speech Recognition. Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, Yoshua Bengio. arXiv Pre-Print, 2015.

A Neural Attention Model for Abstractive Sentence Summarization. A. M. Rush, S. Chopra and J. Weston. EMNLP 2015.

33