Slide credit from Hung-Yi Lee
Attention and Memory
Sensory Memory
Working Memory
Long-term Memory Information from the sensors (e.g. eyes, ears)
Attention
Encode Retrieval
When the input is a very long sequence or an image
Pay attention on partial of the input object each time
Attention and Memory
3
Sensory Memory
Working Memory
Long-term Memory Information from the sensors (e.g. eyes, ears)
Attention
Encode Retrieval
When the input is a very long sequence or an image
Pay attention on partial of the input object each time In RNN/LSTM, larger memory
implies more parameters
Increasing memory size will not increasing parameters
Attention
on Sensory Info
Sensory Memory
Working Memory
Long-term Memory Information from the sensors (e.g. eyes, ears)
Attention
Encode
Machine Translation
Sequence-to-sequence learning: both input and output are both sequences with different lengths.
E.g. 深度學習 → deep learning
5
learning
deep
深 度 學 習
<END>
RNN
Encoder RNN
Decoder
Information of the whole sentences
Machine Translation with Attention
𝑧0
深 度 學 習
ℎ1 ℎ2 ℎ3 ℎ4 match
𝛼01
Cosine similarity of z and h
Small NN whose input is z and h, output a scalar
𝛼 = ℎ𝑇𝑊𝑧
How to learn the parameters?
What is ? match
Machine Translation with Attention
7
𝛼01 𝛼02 𝛼03 𝛼04 𝑐0
𝑧1
As RNN input 𝑐0 = ො𝛼0𝑖 ℎ𝑖
deep
0.5 0.5 0.0 0.0
= 0.5ℎ1 + 0.5ℎ2 How to learn the parameters?
𝑧0 softmax
𝑐0
ො
𝛼01 𝛼ො02 𝛼ො03 𝛼ො04 match ℎ 𝑧
𝛼
深 度 學 習
ℎ1 ℎ2 ℎ3 ℎ4
Machine Translation with Attention
𝑧1
deep
𝑧0
𝑐0 match
𝛼11
深 度 學 習
ℎ1 ℎ2 ℎ3 ℎ4
Machine Translation with Attention
9
𝑧1
deep
𝑧0
𝑐0 𝑐1
𝑧2
learning
𝑐1
𝑐1 = ො𝛼1𝑖ℎ𝑖 = 0.5ℎ3 + 0.5ℎ4 𝛼11 𝛼12 𝛼13 𝛼14
0.0 0.0 0.5 0.5
softmax
ො
𝛼11 𝛼ො12 𝛼ො13 𝛼ො14
深 度 學 習
ℎ1 ℎ2 ℎ3 ℎ4
Machine Translation with Attention
𝑧1
deep
𝑧0
𝑐0
𝑧2
learning
𝑐1 match
𝛼21
The same process repeat until generating <END>
深 度 學 習
ℎ1 ℎ2 ℎ3 ℎ4 ……
……
……
Speech Recognition with Attention
Chan et al., “Listen, Attend and Spell”, arXiv, 2015 . 11
Image Captioning
Input: image
Output: word sequence
Input image
a woman is
……
<END>
CNN
A vector for whole image
Image Captioning with Attention
13
filter filter filter filter filter filter
match 0.7
CNN
filter filter filter filter filter filter
A vector for each region
𝑧0
Image Captioning with Attention
filter filter filter filter filter filter
CNN
A vector for each region
filter filter filter filter filter filter
0.7 0.1 0.1
0.1 0.0 0.0
weighted sum
𝑧1 Word 1 𝑧0
Image Captioning with Attention
15
filter filter filter filter filter filter
CNN
A vector for each region
weighted sum
filter filter filter filter filter
𝑧0
0.0 0.8 0.2
0.0 0.0 0.0
weighted sum 𝑧1
Word 2
filter
Word 1 𝑧1
Image Captioning
Good examples
Image Captioning
Bad examples
17
Video Captioning
Video Captioning
19
Reading Comprehension
AnswerMatch Question
Document
q
Extracted Information
𝑥1 ……
𝛼1
=
𝑛=1 𝑁
𝛼𝑛𝑥𝑛
𝑥2 𝑥3 𝑥𝑁 𝛼2 𝛼3 𝛼𝑁 Sentence to DNN
vector can be jointly trained.
Reading Comprehension
21
Answer
Match Question
Document
q
Extracted Information
𝑥1 ……
𝛼1
=
𝑛=1 𝑁
𝛼𝑛ℎ𝑛
𝑥2 𝑥3 𝑥𝑁 𝛼2 𝛼3 𝛼𝑁 Jointly learned
ℎ1 ℎ2 ℎ3 …… ℎ𝑁
DNN
Hopping
Memory Network
……
……
Compute attention Extract information
……
……
Compute attention Extract information
DNN
a
q
Memory Network
Muti-hop performance analysis
https://www.facebook.com/Engineering/videos/10153098860532200/ 23
Special Attention: Spatial Transformers
CNN
Bad resultsCNN
Good resultsJointly learned
Attention
on Memory
25
Sensory Memory
Working Memory
Long-term Memory Information from the sensors (e.g. eyes, ears)
Attention
Encode
Neural Turing Machine
Von Neumann architecture
Neural Turing Machine is an advanced RNN/LSTM.
Neural Turing Machine
27
x1 x2
y1 y2
h1 h2
h0
𝑚01 𝑚02 𝑚03 𝑚04
r0
ො
𝛼01 𝛼ො02 𝛼ො03 𝛼ො04 𝑟0 = ො𝛼0𝑖 𝑚0𝑖
Long term memory
Retrieval process
Zhang et al., “Structured Memory for Neural Turing Machines,” arXiv, 2015.
x1 x2
y1 y2
h1 h2
h0
𝑚01 𝑚02 𝑚03 𝑚04
r0
ො
𝛼01 𝛼ො02 𝛼ො03 𝛼ො04 𝑟0 = ො𝛼0𝑖 𝑚0𝑖
Neural Turing Machine
𝑒1 𝑎1 𝑘1
𝛼1𝑖 = 1 − 𝜆 𝛼0𝑖 +𝜆𝑐𝑜𝑠 𝑚0𝑖 , 𝑘1
ො
𝛼11 𝛼ො12 𝛼ො13 𝛼ො14 softmax
𝛼11 𝛼12 𝛼13 𝛼14
(simplified)
𝑒1 𝑎1 𝑘1
Neural Turing Machine
29
𝑚11 𝑚12 𝑚13 𝑚14
𝑚1𝑖 = 𝑚0𝑖
∗
𝑒1 + ො𝛼1𝑖 𝑎1(element-wise)
1 − ො𝛼1𝑖 Encode
process
𝑚01 𝑚02 𝑚03 𝑚04
ො
𝛼01 𝛼ො02 𝛼ො03 𝛼ො04 𝛼ො11 𝛼ො12 𝛼ො13 𝛼ො14
Zhang et al., “Structured Memory for Neural Turing Machines,” arXiv, 2015.
Neural Turing Machine
𝑚01 𝑚02 𝑚03 𝑚04
ො
𝛼01 𝛼ො02 𝛼ො03 𝛼ො04
𝑚11 𝑚12 𝑚13 𝑚14
ො
𝛼11 𝛼ො12 𝛼ො13 𝛼ො14
x2 y2 h2 r1
ො
𝛼21 𝛼ො22 𝛼ො23 𝛼ො24 𝑚21 𝑚22 𝑚23 𝑚24 x1
y1 h1 h0
r0
Stack RNN
Joulin and Mikolov, “Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets,” 2015. 31
xt Push, Pop, Nothing
ht yt stack
0.7 0.2 0.1
Information to store
Pop Nothing
Push
-1 -1
-1
X0.7 X0.2 X0.1
+
+
Concluding Remarks
Neural Turing Machine Stack RNN
Machine Translation Speech Recognition Image Captioning Question Answering Sensory Memory
Working Memory
Long-term Memory Information from the sensors (e.g. eyes, ears)
Attention
Encode
Reference
End-To-End Memory Networks. S. Sukhbaatar, A. Szlam, J. Weston, R. Fergus. arXiv Pre- Print, 2015.
Neural Turing Machines. Alex Graves, Greg Wayne, Ivo Danihelka. arXiv Pre-Print, 2014 Ask Me Anything: Dynamic Memory Networks for Natural Language Processing. Kumar et al. arXiv Pre-Print, 2015
Neural Machine Translation by Jointly Learning to Align and Translate. D. Bahdanau, K. Cho, Y. Bengio; International Conference on Representation Learning 2015.
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Kelvin Xu et.
al.. arXiv Pre-Print, 2015.
Attention-Based Models for Speech Recognition. Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, Yoshua Bengio. arXiv Pre-Print, 2015.
A Neural Attention Model for Abstractive Sentence Summarization. A. M. Rush, S. Chopra and J. Weston. EMNLP 2015.
33