• 沒有找到結果。

# Attention Mechanism

N/A
N/A
Protected

Share "Attention Mechanism"

Copied!
32
0
0

(1)

## Attention Mechanism

(2)

### Organize

http://henrylo1605.blogspot.tw/2015/05/blog-post_56.html

Breakfast today What you learned

in these lectures

summer vacation 10 years ago

(3)

3

### Encode Retrieval

Problem: very long sequence or an image

Solution: pay attention on the partial input object each time

(4)

### Encode Retrieval

Problem: very long sequence or an image

Solution: pay attention on the partial input object each time Problem: larger memory implies more parameters in RNN

Solution: long-term memory increases memory size without increasing parameters

(5)

## Attention on Sensory Info

5

Sensory Memory

Working Memory

Long-term Memory (e.g. eyes, ears)

Attention

Encode Retrieval

(6)

with different lengths.

### ◉

E.g. 深度學習 → deep learning

learning

deep

<END>

RNN

Encoder RNN

Decoder

Information of the whole sentences

(7)

### Machine Translation with Attention

7

𝑧0

1234 match

𝛼01

➢ Cosine similarity of z and h

➢ Small NN whose input is z and h, output a scalar

➢ 𝛼 = ℎ𝑇𝑊𝑧

How to learn the parameters?

What is ? match

(8)

𝛼01 𝛼02 𝛼03 𝛼04 𝑐0

𝑧1

as RNN input 𝑐0 = ෍ ො𝛼0𝑖𝑖

deep

0.5 0.5 0.0 0.0

= 0.5ℎ1 + 0.5ℎ2

How to learn the parameters?

𝑧0 softmax

𝑐0

𝛼01 𝛼ො02 𝛼ො03 𝛼ො04

match ℎ 𝑧

𝛼

1234

(9)

### Machine Translation with Attention

9

𝑧1

deep

𝑧0

𝑐0 match

𝛼11

1234

(10)

𝑧1

deep

𝑧0

𝑐0 𝑐1

𝑧2

learning

𝑐1

𝑐1 = ෍ ො𝛼1𝑖𝑖 = 0.5ℎ3 + 0.5ℎ4 𝛼11 𝛼12 𝛼13 𝛼14

0.0 0.0 0.5 0.5 softmax

𝛼11 𝛼ො12 𝛼ො13 𝛼ො14

1234

(11)

### Machine Translation with Attention

11

𝑧1

deep

𝑧0

𝑐0

𝑧2

learning

𝑐1 match

𝛼21

The same process repeat until generating <END>

1234 ……

……

……

(12)

𝑧0

1234 match

𝛼01

𝑐0

0.5 0.5 0.0 0.0

softmax

𝛼01 𝛼ො02 𝛼ො03 𝛼ො04

1234

query

key value

(13)

### ◉

Input: a query 𝑞 and a set of key-value (𝑘-𝑣) pairs to an output

### ◉

Output: weighted sum of values

### ○

Query 𝑞 is a 𝑑𝑘-dim vector

### ○

Key 𝑘 is a 𝑑𝑘-dim vector

### ○

Value 𝑣 is a 𝑑𝑣-dim vector 13

Inner product of

query and corresponding key

(14)

### ◉

Output: a set of weighted sum of values

softmax row-wise

(15)

## Attention Applications

15

(16)

Chan et al.,“Listen, Attend and Spell”, arXiv, 2015 .

(17)

Input: image

### ◉

Output: word sequence

17

Input image

a woman is

### ……

<END>

CNN

A vector for whole image

(18)

filter filter filter filter filter filter

match 0.7

CNN

filter filter filter filter filter filter A vector for each region

𝑧0

(19)

### Image Captioning with Attention

19

0.7 0.1 0.1 0.1 0.0 0.0

weighted sum

𝑧1 Word 1

𝑧0

filter filter filter filter filter filter

CNN

filter filter filter filter filter filter A vector for each region

(20)

weighted sum

𝑧1 𝑧0

filter filter filter filter filter filter

CNN

filter filter filter filter filter filter A vector for each region

0.0 0.8 0.2 0.0 0.0 0.0

weighted sum 𝑧2

(21)

Good examples

21

(22)
(23)

### Video Captioning

23

(24)
(25)

25

Match Question

Document

q

Extracted Information

𝑥1 ……

𝛼1

= ෍

𝑛=1 𝑁

𝛼𝑛𝑥𝑛

𝑥2 𝑥3 𝑥𝑁 𝛼2 𝛼3 𝛼𝑁

DNN Sentence encoders

can be jointly trained.

(26)

Match Question Document

q

Extracted Information

𝑥1 ……

𝛼1

= ෍

𝑛=1 𝑁

𝛼𝑛𝑛

𝑥2 𝑥3 𝑥𝑁 𝛼2 𝛼3 𝛼𝑁 Jointly

learned

123 …… ℎ𝑁

DNN

(27)

### Memory Network

27

……

……

Compute attention Extract information

……

……

Compute attention Extract information

DNN

q

(28)

(29)

### ◉

The QA pairs are conversational

A1: Jessica

### -

Q2: How old would she be?

A2: 80

### -

Q3: Did she plan to have any visitors?

A3: Yes

Q4: How many?

A4: Three

Q5: Who?

### -

A5: Annie, Melanie, and Josh 29

Jessica went to sit in her rocking chair.

Today was her birthday and she was turning 80. Her granddaughter Annie was coming over in the afternoon and Jessica was very excited to see her.

Her daughter Melanie and Melanie’s husband Josh were coming as well.

(30)

## on Memory

30

Working Memory

Long-term Memory Attention

Encode Retrieval

(31)

### ◉

Von Neumann architecture

### ◉

Neural Turing Machine is an advanced RNN/LSTM.

31

Zhang et al., “Structured Memory for Neural Turing Machines,” arXiv, 2015.

(32)

Neural Turing

Machine Stack RNN Machine Translation Speech Recognition Image Captioning Question Answering

### Long-term Memory (e.g. eyes, ears)

Attention

Encode Retrieval

• Cell: A unit of main memory (typically 8 bits which is one byte).. – Most significant bit: the bit at the left (high- order) end of the conceptual row of bits in a

• Virtual memory uses disk as part of the memory, thus allowing sum of all programs can be larger than physical memory. • Divides each segment into 4096-byte blocks

– Number of TLB entries are restricted by clock cycle time, so a larger page size maps more memory, thereby reducing TLB misses. • Reasons for a smaller

• When paging in from disk, we need a free frame of physical memory to hold the data we’re reading in. • In reality, size of physical memory is

The performance guarantees of real-time garbage collectors and the free-page replenishment mechanism are based on a constant α, i.e., a lower-bound on the number of free pages that

compiler on the four memory segments static, this, local, argument In addition, there are four additional memory segments, whose role will. The VM’s

Data larger than memory but smaller than disk Design algorithms so that disk access is less frequent An example (Yu et al., 2010): a decomposition method to load a block at a time

It costs &gt;1TB memory to simply save the raw  graph data (without attributes, labels nor content).. This can cause problems for