Attention Mechanism

(1)

Attention Mechanism

Applied Deep Learning

March 14th, 2022 http://adl.miulab.tw

(2)

Organize

http://henrylo1605.blogspot.tw/2015/05/blog-post_56.html

Breakfast today What you learned

in these lectures

summer vacation 10 years ago

What is deep learning?

Answer

(3)

Attention and Memory

3

Sensory Memory

Working Memory

Long-term Memory Information from sensors (e.g. eyes, ears)

Attention

Encode Retrieval

Problem: very long sequence or an image

Solution: pay attention on the partial input object each time

(4)

Sensory Memory

Working Memory

Long-term Memory (e.g. eyes, ears)

Attention

Encode Retrieval

Problem: very long sequence or an image

Solution: pay attention on the partial input object each time Problem: larger memory implies more parameters in RNN

Solution: long-term memory increases memory size without increasing parameters

(5)

Attention on Sensory Info

5

Sensory Memory

Working Memory

Long-term Memory (e.g. eyes, ears)

Attention

Encode Retrieval

(6)

with different lengths.

◉

^{E.g. 深度學習} → deep learning

learning

deep

深度學習

<END>

RNN

Encoder RNN

Decoder

Information of the whole sentences

(7)

Machine Translation with Attention

7

𝑧⁰

深度學習

ℎ¹ ℎ² ℎ³ ℎ⁴ match

𝛼₀¹

➢ Cosine similarity of z and h

➢ Small NN whose input is z and h, output a scalar

➢ 𝛼 = ℎ^𝑇𝑊𝑧

How to learn the parameters?

What is ? match

(8)

𝛼₀¹ 𝛼₀² 𝛼₀³ 𝛼₀⁴ 𝑐⁰

𝑧¹

as RNN input 𝑐⁰ = ෍ ො𝛼₀^𝑖ℎ^𝑖

deep

0.5 0.5 0.0 0.0

= 0.5ℎ¹ + 0.5ℎ²

How to learn the parameters?

𝑧⁰ softmax

𝑐⁰

ො

𝛼₀¹ 𝛼ො₀² 𝛼ො₀³ 𝛼ො₀⁴

match ℎ 𝑧

𝛼

深度學習

ℎ¹ ℎ² ℎ³ ℎ⁴

(9)

Machine Translation with Attention

9

𝑧¹

deep

𝑧⁰

𝑐⁰ match

𝛼₁¹

深度學習

ℎ¹ ℎ² ℎ³ ℎ⁴

(10)

𝑧¹

deep

𝑧⁰

𝑐⁰ 𝑐¹

𝑧²

learning

𝑐¹

𝑐¹ = ෍ ො𝛼₁^𝑖ℎ^𝑖 = 0.5ℎ³ + 0.5ℎ⁴ 𝛼₁¹ 𝛼₁² 𝛼₁³ 𝛼₁⁴

0.0 0.0 0.5 0.5 softmax

ො

𝛼₁¹ 𝛼ො₁² 𝛼ො₁³ 𝛼ො₁⁴

深度學習

ℎ¹ ℎ² ℎ³ ℎ⁴

(11)

Machine Translation with Attention

11

𝑧¹

deep

𝑧⁰

𝑐⁰

𝑧²

learning

𝑐¹ match

𝛼₂¹

The same process repeat until generating <END>

深度學習

ℎ¹ ℎ² ℎ³ ℎ⁴ ……

……

(12)

𝑧⁰

深度學習

ℎ¹ ℎ² ℎ³ ℎ⁴ match

𝛼₀¹

𝑐⁰

0.5 0.5 0.0 0.0

softmax

ො

𝛼₀¹ 𝛼ො₀² 𝛼ො₀³ 𝛼ො₀⁴

深度學習

ℎ¹ ℎ² ℎ³ ℎ⁴

query

key value

(13)

Dot-Product Attention

◉

Input: a query 𝑞 and a set of key-value (𝑘-𝑣) pairs to an output

◉

Output: weighted sum of values

○

Query 𝑞 is a 𝑑_𝑘-dim vector

○

Key 𝑘 is a 𝑑_𝑘-dim vector

○

Value 𝑣 is a 𝑑_𝑣-dim vector 13

Inner product of

query and corresponding key

(14)

◉

Output: a set of weighted sum of values

softmax row-wise

(15)

各種不同的應用都用得到 Attention

Attention Applications

15

(16)

Chan et al.,“Listen, Attend and Spell”, arXiv, 2015 .

(17)

Image Captioning

◉

Input: image

◉

Output: word sequence

17

Input image

a woman is

……

<END>

CNN

A vector for whole image

(18)

filter filter filter filter filter filter

match 0.7

CNN

filter filter filter filter filter filter A vector for each region

𝑧⁰

(19)

Image Captioning with Attention

19

0.7 0.1 0.1 0.1 0.0 0.0

weighted sum

𝑧¹ Word 1

𝑧⁰

CNN

(20)

weighted sum

𝑧¹ 𝑧⁰

CNN

0.0 0.8 0.2 0.0 0.0 0.0

weighted sum 𝑧²

(21)

Image Captioning

◉

Good examples

21

(22)

(23)

Video Captioning

23

(24)

(25)

Reading Comprehension

25

Answer

Match Question

Document

q

Extracted Information

𝑥¹ ……

𝛼₁

= ෍

𝑛=1 𝑁

𝛼_𝑛𝑥^𝑛

𝑥² 𝑥³ 𝑥^𝑁 𝛼₂ 𝛼₃ 𝛼_𝑁

DNN Sentence encoders

can be jointly trained.

(26)

Match Question Document

q

Extracted Information

𝑥¹ ……

𝛼₁

= ෍

𝑛=1 𝑁

𝛼_𝑛ℎ^𝑛

𝑥² 𝑥³ 𝑥^𝑁 𝛼₂ 𝛼₃ 𝛼_𝑁 Jointly

learned

ℎ¹ ℎ² ℎ³ …… ℎ^𝑁

DNN

Hopping

(27)

Memory Network

27

……

Compute attention Extract information

……

Compute attention Extract information

෍ DNN

a

q

෍

(28)

https://www.facebook.com/Engineering/videos/10153098860532200/

(29)

Conversational QA – CoQA, QuAC

◉

The QA pairs are conversational

-

Q1: Who had a birthday?

-

A1: Jessica

-

Q2: How old would she be?

-

^{A2: 80}

-

Q3: Did she plan to have any visitors?

-

^{A3: Yes}

-

Q4: How many?

-

^{A4: Three}

-

^{Q5: Who?}

-

A5: Annie, Melanie, and Josh 29

Jessica went to sit in her rocking chair.

Today was her birthday and she was turning 80. Her granddaughter Annie was coming over in the afternoon and Jessica was very excited to see her.

Her daughter Melanie and Melanie’s husband Josh were coming as well.

Jessica had . . .

(30)

Attention

on Memory

30

Working Memory

Long-term Memory Attention

Encode Retrieval

(31)

Neural Turing Machine

◉

Von Neumann architecture

◉

Neural Turing Machine is an advanced RNN/LSTM.

31

Zhang et al., “Structured Memory for Neural Turing Machines,” arXiv, 2015.

(32)

Neural Turing

Machine Stack RNN Machine Translation Speech Recognition Image Captioning Question Answering