• 沒有找到結果。

Attention Mechanism

N/A
N/A
Protected

Academic year: 2022

Share "Attention Mechanism"

Copied!
32
0
0

加載中.... (立即查看全文)

全文

(1)

Attention Mechanism

Applied Deep Learning

March 14th, 2022 http://adl.miulab.tw

(2)

Organize

http://henrylo1605.blogspot.tw/2015/05/blog-post_56.html

Breakfast today What you learned

in these lectures

summer vacation 10 years ago

What is deep learning?

Answer

(3)

Attention and Memory

3

Sensory Memory

Working Memory

Long-term Memory Information from sensors (e.g. eyes, ears)

Attention

Encode Retrieval

Problem: very long sequence or an image

Solution: pay attention on the partial input object each time

(4)

Sensory Memory

Working Memory

Long-term Memory (e.g. eyes, ears)

Attention

Encode Retrieval

Problem: very long sequence or an image

Solution: pay attention on the partial input object each time Problem: larger memory implies more parameters in RNN

Solution: long-term memory increases memory size without increasing parameters

(5)

Attention on Sensory Info

5

Sensory Memory

Working Memory

Long-term Memory (e.g. eyes, ears)

Attention

Encode Retrieval

(6)

with different lengths.

E.g. 深度學習 → deep learning

learning

deep

深 度 學 習

<END>

RNN

Encoder RNN

Decoder

Information of the whole sentences

(7)

Machine Translation with Attention

7

𝑧0

深 度 學 習

1234 match

𝛼01

➢ Cosine similarity of z and h

➢ Small NN whose input is z and h, output a scalar

➢ 𝛼 = ℎ𝑇𝑊𝑧

How to learn the parameters?

What is ? match

(8)

𝛼01 𝛼02 𝛼03 𝛼04 𝑐0

𝑧1

as RNN input 𝑐0 = ෍ ො𝛼0𝑖𝑖

deep

0.5 0.5 0.0 0.0

= 0.5ℎ1 + 0.5ℎ2

How to learn the parameters?

𝑧0 softmax

𝑐0

𝛼01 𝛼ො02 𝛼ො03 𝛼ො04

match ℎ 𝑧

𝛼

深 度 學 習

1234

(9)

Machine Translation with Attention

9

𝑧1

deep

𝑧0

𝑐0 match

𝛼11

深 度 學 習

1234

(10)

𝑧1

deep

𝑧0

𝑐0 𝑐1

𝑧2

learning

𝑐1

𝑐1 = ෍ ො𝛼1𝑖𝑖 = 0.5ℎ3 + 0.5ℎ4 𝛼11 𝛼12 𝛼13 𝛼14

0.0 0.0 0.5 0.5 softmax

𝛼11 𝛼ො12 𝛼ො13 𝛼ො14

深 度 學 習

1234

(11)

Machine Translation with Attention

11

𝑧1

deep

𝑧0

𝑐0

𝑧2

learning

𝑐1 match

𝛼21

The same process repeat until generating <END>

深 度 學 習

1234 ……

……

……

(12)

𝑧0

深 度 學 習

1234 match

𝛼01

𝑐0

0.5 0.5 0.0 0.0

softmax

𝛼01 𝛼ො02 𝛼ො03 𝛼ො04

深 度 學 習

1234

query

key value

(13)

Dot-Product Attention

Input: a query 𝑞 and a set of key-value (𝑘-𝑣) pairs to an output

Output: weighted sum of values

Query 𝑞 is a 𝑑𝑘-dim vector

Key 𝑘 is a 𝑑𝑘-dim vector

Value 𝑣 is a 𝑑𝑣-dim vector 13

Inner product of

query and corresponding key

(14)

Output: a set of weighted sum of values

softmax row-wise

(15)

各種不同的應用都用得到 Attention

Attention Applications

15

(16)

Chan et al.,“Listen, Attend and Spell”, arXiv, 2015 .

(17)

Image Captioning

Input: image

Output: word sequence

17

Input image

a woman is

……

<END>

CNN

A vector for whole image

(18)

filter filter filter filter filter filter

match 0.7

CNN

filter filter filter filter filter filter A vector for each region

𝑧0

(19)

Image Captioning with Attention

19

0.7 0.1 0.1 0.1 0.0 0.0

weighted sum

𝑧1 Word 1

𝑧0

filter filter filter filter filter filter

CNN

filter filter filter filter filter filter A vector for each region

(20)

weighted sum

𝑧1 𝑧0

filter filter filter filter filter filter

CNN

filter filter filter filter filter filter A vector for each region

0.0 0.8 0.2 0.0 0.0 0.0

weighted sum 𝑧2

(21)

Image Captioning

Good examples

21

(22)
(23)

Video Captioning

23

(24)
(25)

Reading Comprehension

25

Answer

Match Question

Document

q

Extracted Information

𝑥1 ……

𝛼1

= ෍

𝑛=1 𝑁

𝛼𝑛𝑥𝑛

𝑥2 𝑥3 𝑥𝑁 𝛼2 𝛼3 𝛼𝑁

DNN Sentence encoders

can be jointly trained.

(26)

Match Question Document

q

Extracted Information

𝑥1 ……

𝛼1

= ෍

𝑛=1 𝑁

𝛼𝑛𝑛

𝑥2 𝑥3 𝑥𝑁 𝛼2 𝛼3 𝛼𝑁 Jointly

learned

123 …… ℎ𝑁

DNN

Hopping

(27)

Memory Network

27

……

……

Compute attention Extract information

……

……

Compute attention Extract information

DNN

a

q

(28)

https://www.facebook.com/Engineering/videos/10153098860532200/

(29)

Conversational QA – CoQA, QuAC

The QA pairs are conversational

-

Q1: Who had a birthday?

-

A1: Jessica

-

Q2: How old would she be?

-

A2: 80

-

Q3: Did she plan to have any visitors?

-

A3: Yes

-

Q4: How many?

-

A4: Three

-

Q5: Who?

-

A5: Annie, Melanie, and Josh 29

Jessica went to sit in her rocking chair.

Today was her birthday and she was turning 80. Her granddaughter Annie was coming over in the afternoon and Jessica was very excited to see her.

Her daughter Melanie and Melanie’s husband Josh were coming as well.

Jessica had . . .

(30)

Attention

on Memory

30

Working Memory

Long-term Memory Attention

Encode Retrieval

(31)

Neural Turing Machine

Von Neumann architecture

Neural Turing Machine is an advanced RNN/LSTM.

31

Zhang et al., “Structured Memory for Neural Turing Machines,” arXiv, 2015.

(32)

Neural Turing

Machine Stack RNN Machine Translation Speech Recognition Image Captioning Question Answering

Sensory Memory

Working Memory

Long-term Memory (e.g. eyes, ears)

Attention

Encode Retrieval

參考文獻

相關文件

• Cell: A unit of main memory (typically 8 bits which is one byte).. – Most significant bit: the bit at the left (high- order) end of the conceptual row of bits in a

• Virtual memory uses disk as part of the memory, thus allowing sum of all programs can be larger than physical memory. • Divides each segment into 4096-byte blocks

– Number of TLB entries are restricted by clock cycle time, so a larger page size maps more memory, thereby reducing TLB misses. • Reasons for a smaller

• When paging in from disk, we need a free frame of physical memory to hold the data we’re reading in. • In reality, size of physical memory is

The performance guarantees of real-time garbage collectors and the free-page replenishment mechanism are based on a constant α, i.e., a lower-bound on the number of free pages that

compiler on the four memory segments static, this, local, argument In addition, there are four additional memory segments, whose role will. The VM’s

Data larger than memory but smaller than disk Design algorithms so that disk access is less frequent An example (Yu et al., 2010): a decomposition method to load a block at a time

It costs &gt;1TB memory to simply save the raw  graph data (without attributes, labels nor content).. This can cause problems for