Attention Mechanism
Applied Deep Learning
March 14th, 2022 http://adl.miulab.tw
Organize
http://henrylo1605.blogspot.tw/2015/05/blog-post_56.html
Breakfast today What you learned
in these lectures
summer vacation 10 years ago
What is deep learning?
Answer
Attention and Memory
3
Sensory Memory
Working Memory
Long-term Memory Information from sensors (e.g. eyes, ears)
Attention
Encode Retrieval
Problem: very long sequence or an image
Solution: pay attention on the partial input object each time
Sensory Memory
Working Memory
Long-term Memory (e.g. eyes, ears)
Attention
Encode Retrieval
Problem: very long sequence or an image
Solution: pay attention on the partial input object each time Problem: larger memory implies more parameters in RNN
Solution: long-term memory increases memory size without increasing parameters
Attention on Sensory Info
5
Sensory Memory
Working Memory
Long-term Memory (e.g. eyes, ears)
Attention
Encode Retrieval
with different lengths.
◉
E.g. 深度學習 → deep learninglearning
deep
深 度 學 習
<END>
RNN
Encoder RNN
Decoder
Information of the whole sentences
Machine Translation with Attention
7
𝑧0
深 度 學 習
ℎ1 ℎ2 ℎ3 ℎ4 match
𝛼01
➢ Cosine similarity of z and h
➢ Small NN whose input is z and h, output a scalar
➢ 𝛼 = ℎ𝑇𝑊𝑧
How to learn the parameters?
What is ? match
𝛼01 𝛼02 𝛼03 𝛼04 𝑐0
𝑧1
as RNN input 𝑐0 = ො𝛼0𝑖ℎ𝑖
deep
0.5 0.5 0.0 0.0
= 0.5ℎ1 + 0.5ℎ2
How to learn the parameters?
𝑧0 softmax
𝑐0
ො
𝛼01 𝛼ො02 𝛼ො03 𝛼ො04
match ℎ 𝑧
𝛼
深 度 學 習
ℎ1 ℎ2 ℎ3 ℎ4
Machine Translation with Attention
9
𝑧1
deep
𝑧0
𝑐0 match
𝛼11
深 度 學 習
ℎ1 ℎ2 ℎ3 ℎ4
𝑧1
deep
𝑧0
𝑐0 𝑐1
𝑧2
learning
𝑐1
𝑐1 = ො𝛼1𝑖ℎ𝑖 = 0.5ℎ3 + 0.5ℎ4 𝛼11 𝛼12 𝛼13 𝛼14
0.0 0.0 0.5 0.5 softmax
ො
𝛼11 𝛼ො12 𝛼ො13 𝛼ො14
深 度 學 習
ℎ1 ℎ2 ℎ3 ℎ4
Machine Translation with Attention
11
𝑧1
deep
𝑧0
𝑐0
𝑧2
learning
𝑐1 match
𝛼21
The same process repeat until generating <END>
深 度 學 習
ℎ1 ℎ2 ℎ3 ℎ4 ……
……
……
𝑧0
深 度 學 習
ℎ1 ℎ2 ℎ3 ℎ4 match
𝛼01
𝑐0
0.5 0.5 0.0 0.0
softmax
ො
𝛼01 𝛼ො02 𝛼ො03 𝛼ො04
深 度 學 習
ℎ1 ℎ2 ℎ3 ℎ4
query
key value
Dot-Product Attention
◉
Input: a query 𝑞 and a set of key-value (𝑘-𝑣) pairs to an output◉
Output: weighted sum of values○
Query 𝑞 is a 𝑑𝑘-dim vector○
Key 𝑘 is a 𝑑𝑘-dim vector○
Value 𝑣 is a 𝑑𝑣-dim vector 13Inner product of
query and corresponding key
◉
Output: a set of weighted sum of valuessoftmax row-wise
各種不同的應用都用得到 Attention
Attention Applications
15
Chan et al.,“Listen, Attend and Spell”, arXiv, 2015 .
Image Captioning
◉
Input: image◉
Output: word sequence17
Input image
a woman is
……
<END>
CNN
A vector for whole image
filter filter filter filter filter filter
match 0.7
CNN
filter filter filter filter filter filter A vector for each region
𝑧0
Image Captioning with Attention
19
0.7 0.1 0.1 0.1 0.0 0.0
weighted sum
𝑧1 Word 1
𝑧0
filter filter filter filter filter filter
CNN
filter filter filter filter filter filter A vector for each region
weighted sum
𝑧1 𝑧0
filter filter filter filter filter filter
CNN
filter filter filter filter filter filter A vector for each region
0.0 0.8 0.2 0.0 0.0 0.0
weighted sum 𝑧2
Image Captioning
◉
Good examples21
Video Captioning
23
Reading Comprehension
25
Answer
Match Question
Document
q
Extracted Information
𝑥1 ……
𝛼1
=
𝑛=1 𝑁
𝛼𝑛𝑥𝑛
𝑥2 𝑥3 𝑥𝑁 𝛼2 𝛼3 𝛼𝑁
DNN Sentence encoders
can be jointly trained.
Match Question Document
q
Extracted Information
𝑥1 ……
𝛼1
=
𝑛=1 𝑁
𝛼𝑛ℎ𝑛
𝑥2 𝑥3 𝑥𝑁 𝛼2 𝛼3 𝛼𝑁 Jointly
learned
ℎ1 ℎ2 ℎ3 …… ℎ𝑁
DNN
Hopping
Memory Network
27
……
……
Compute attention Extract information
……
……
Compute attention Extract information
DNN
a
q
https://www.facebook.com/Engineering/videos/10153098860532200/
Conversational QA – CoQA, QuAC
◉
The QA pairs are conversational-
Q1: Who had a birthday?-
A1: Jessica-
Q2: How old would she be?-
A2: 80-
Q3: Did she plan to have any visitors?-
A3: Yes-
Q4: How many?-
A4: Three-
Q5: Who?-
A5: Annie, Melanie, and Josh 29Jessica went to sit in her rocking chair.
Today was her birthday and she was turning 80. Her granddaughter Annie was coming over in the afternoon and Jessica was very excited to see her.
Her daughter Melanie and Melanie’s husband Josh were coming as well.
Jessica had . . .
Attention
on Memory
30
Working Memory
Long-term Memory Attention
Encode Retrieval
Neural Turing Machine
◉
Von Neumann architecture◉
Neural Turing Machine is an advanced RNN/LSTM.31
Zhang et al., “Structured Memory for Neural Turing Machines,” arXiv, 2015.
Neural Turing
Machine Stack RNN Machine Translation Speech Recognition Image Captioning Question Answering
Sensory Memory
Working Memory
Long-term Memory (e.g. eyes, ears)
Attention
Encode Retrieval