Special Networks

(1)

Special Networks

Hung-yi Lee

李宏毅

(2)

Announcement

• 11/13 (下週一) 14:00 ~ 17:00 台灣微軟參訪

• 地址：台北市信義區忠孝東路五段68號19樓 (捷運市政府站3號出口)

• 14:00：在捷運市政府站3號出口集合

• 報名表單：

• https://docs.google.com/forms/d/e/1FAIpQLSfs2zlo GanjWjJvVkJu8DUe9BlVZ5ugLPIs3FUmMbR9VkF8Fw /viewform?fbzx=-8767653761190698000

(3)

Outline

• Convolutional Neural Network (Review)

• Spatial Transformer

• Highway Network & Grid LSTM

• Pointer Network

• External Memory

(4)

Convolutional Layer

……

1

1 2

2

Sparse Connectivity

3 4 5

Each neural only connects to part of the output of the

previous layer

3 4 Receptive

Field

Different neurons have different, but overlapping, receptive fields

(5)

Convolutional Layer

……

1

1 2

2

Sparse Connectivity

3 4 5

Each neural only connects to part of the output of the

previous layer

3 4

Parameter Sharing

The neurons with different receptive fields can use the same set of parameters.

Less parameters than fully connected layer

(6)

Convolutional Layer

……

1

1 2

2 3

4 5

3 4

Considering neuron 1 and 3 as

“filter 1” (kernel 1)

filter (kernel) size: size of the receptive field of a neuron

Stride = 2

Considering neuron 2 and 4 as

“filter 2” (kernel 2)

Kernel size, no. of filter, stride are all designed by the developers.

(7)

Example –

1D Signal + Single Channel

1 2 3 4

𝑥₁ 𝑥₂ 𝑥₃

𝑥₄

𝑥₅

Classification, Predict the future …

Audio Signal, Stock Value …

(8)

Example –

1D Signal + Multiple Channel

1 2 3

A document: each word is a vector

I like this movie

very

much 4

……

𝑥₁ 𝑥₂ 𝑥₃ 𝑥₄ 𝑥₅ 𝑥₆ 𝑥₇

Does this kind of receptive field make sense?

(9)

Example –

2D Signal + Single Channel

1 0 0 0 0 1

0 1 0 0 1 0

0 0 1 1 0 0

1 0 0 0 1 0

0 1 0 0 1 0

0 0 1 0 1 0

6 x 6 black & white picture image

1:

2:

3:

…

7:

8:

9:

…

13:

14:

15:

…

4:

10:

16:

1 0 0 0 0 1 0 0 0 0 1 1

Only show 1 filter here

Size of Receptive field is 3x3, Stride is 1

(10)

1 0 0 0 0 1

0 1 0 0 1 0

0 0 1 1 0 0

1 0 0 0 1 0

0 1 0 0 1 0

0 0 1 0 1 0

Example –

2D Signal + Multiple Channel

6 x 6 colorful image

1:

2:

3:

…

7:

8:

9:

…

13:

14:

15:

…

4:

10:

16:

1 0 0 0 0 1 0 0 0 0 1 1

Only show 1 filter here

1 0 0 0 0 1

0 1 0 0 1 0

0 0 1 1 0 0

1 0 0 0 1 0

0 1 0 0 1 0

0 0 1 0 1 0

1 0 0 0 0 1

0 1 0 0 1 0

0 0 1 1 0 0

1 0 0 0 1 0

0 1 0 0 1 0

0 0 1 0 1 0

Size of Receptive field is 3x3x3, Stride is 1

1 0 1 0 0 1 1 0 1 1 0 1

0 0 0 0 0 0 0 0 1 1 1 0

(11)

Padding

Source of images:

https://github.com/vdumoulin/conv_arithmetic

Zero Padding, Reflection Padding

(12)

Pooling Layer

nodes k

N /

Layer l Layer l 1

nodes N

… …

1

1 k

… ¹

 k

k 2…

2

k outputs in layer 𝑙 − 1 are grouped together

Each output in layer 𝑙

“summarizes” k inputs

1 1



al

1 l

ak _l

a₁





 ^k  j

l j

l a

a k

1

1 1

1 Average Pooling:

Max Pooling:

L2 Pooling:



2 ¹ ¹



1 1

1^l  max a^l^ ,a^l^ , ,a_k^l^

a 

  



 ^k  j

l j

l a

a k

1

1 2 1

1

(13)

Pooling Layer

Which outputs should be grouped together?

Group the neurons corresponding to the same filter with nearby

receptive fields

……

1

1 2

2 3

4 5

3 4 Convolutional

Layer

Pooling Layer

1

2

Subsampling

(14)

Pooling Layer

Which outputs should be grouped together?

Group the neurons with the same receptive field

……

1

1 2

2 3

4 5

3 4 Convolutional

Layer

Pooling Layer

1

2

Maxout Network

How can you know whether the neurons detect the same

pattern?

(15)

Auto-encoder for CNN

Convolution

Pooling

Convolution

Pooling Deconvolution

Unpooling

Deconvolution Unpooling

As close as possible Deconvolution

code

Unpooling &

Deconvolution

(16)

Unpooling

14 x 14 28 x 28

Source of image :

https://leonardoaraujosantos.gitbooks.io/artificial- inteligence/content/image_segmentation.html

Alternative: simply repeat the values

(17)

Deconvolution

=

Actually, deconvolution is convolution.

+ + +

+

(18)

Tara N. Sainath, Ron J. Weiss, Andrew Senior, Kevin W. Wilson, Oriol Vinyals, “Learning the Speech Front-end With Raw Waveform CLDNNs,”

In INTERPSEECH 2015

Combination of Different Structures

(19)

Combination of Different Structures

(20)

Combination of Different Structures

3 layers

(21)

CNN for Sequence-

to-

sequence

https://arxiv.org/abs/1705.03122

(22)

CNN for Sequence-to-sequence

• Encoder

機器學習

RNN

ℎ⁴ ℎ³

ℎ² ℎ¹

機器學習

CNN

ℎ⁴ ℎ³

ℎ² ℎ¹

(23)

CNN for Sequence-to-sequence

• Decoder - WaveNet

(24)

Outline

• Pointer Network

• External Memory

(25)

Spatial Transformer Layer

• CNN is not invariant to scaling and rotation

CNN

5

6

NN layer

End-to-end learn Can also transform feature map

(26)

Spatial Transformer Layer

𝑎₁₁^𝑙−1 𝑎₁₂^𝑙−1 𝑎₁₃^𝑙−1 𝑎₂₁^𝑙−1 𝑎₂₂^𝑙−1 𝑎₂₃^𝑙−1 𝑎₃₁^𝑙−1 𝑎₃₂^𝑙−1 𝑎₃₃^𝑙−1

• How to transform an image/feature map

𝑎₁₁^𝑙 𝑎₁₂^𝑙 𝑎₁₃^𝑙 𝑎₂₁^𝑙 𝑎₂₂^𝑙 𝑎₂₃^𝑙 𝑎₃₁^𝑙 𝑎₃₂^𝑙 𝑎₃₃^𝑙

Layer l-1 Layer l

Spatial Transformer Layer

Translate

General layer: 𝑎_𝑛𝑚^𝑙 = ෍

𝑖=1

3 ෍

𝑗=1

3 𝑤_{𝑛𝑚,𝑖𝑗}^𝑙 𝑎_𝑖𝑗^𝑙−1

If we want translate as above: 𝑎_𝑛𝑚^𝑙 = 𝑎_{(𝑛−1)𝑚}^𝑙−1

𝑤_{𝑛𝑚,𝑖𝑗}^𝑙 = 1 𝑖𝑓 𝑖 = 𝑛 − 1, 𝑗 = 𝑚 𝑤_{𝑛𝑚,𝑖𝑗}^𝑙 = 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

(27)

Spatial Transformer Layer

• How to transform an image/feature map

Layer l-1

Layer l

Layer l-1

Layer l

NN NN

Control the connection

(28)

Image Transformation

Expansion, Compression, Translation

𝑥 𝑦

𝑥′

𝑥′ 𝑦′

𝑦′ = 2 0 0 2

𝑥

𝑦 + 0 0

𝑥′

1 𝑦′

1

𝑥′

𝑦′ = 0.5 0 0 0.5

𝑥

𝑦 + 0.5 0.5

(29)

Image Transformation

• Rotation

https://home.gamer.com.tw/c reationDetail.php?sn=792585

𝑥

𝑦 𝑥′

𝑦′

Rotate 𝜃^° 𝑥′

𝑦′ = 𝑐𝑜𝑠𝜃 −𝑠𝑖𝑛𝜃 𝑠𝑖𝑛𝜃 𝑐𝑜𝑠𝜃

𝑥

𝑦 + 0 0

(30)

Spatial Transformer Layer

𝑥′

𝑦′ = 𝑎 𝑏 𝑐 𝑑

𝑥

𝑦 + 𝑒 𝑓

Layer l-1

Layer l

NN

Index of layer l Index of layer l-1

𝑎 𝑏 𝑐 𝑑

𝑒 𝑓

6 parameters to describe the affine transformation

(31)

𝑎₁₁^𝑙 𝑎₁₂^𝑙 𝑎₁₃^𝑙 𝑎₂₁^𝑙 𝑎₂₂^𝑙 𝑎₂₃^𝑙 𝑎₃₁^𝑙 𝑎₃₂^𝑙 𝑎₃₃^𝑙 𝑎₁₁^𝑙 𝑎₁₂^𝑙 𝑎₁₃^𝑙 𝑎₂₁^𝑙 𝑎₂₂^𝑙 𝑎₂₃^𝑙 𝑎₃₁^𝑙 𝑎₃₂^𝑙 𝑎₃₃^𝑙

Spatial Transformer Layer

𝑥′

𝑥

𝑦 + 𝑒

𝑓 6 parameters to describe the affine transformation

Layer l-1 Layer l

NN

𝑎 𝑏 𝑐 𝑑

𝑒 0 1 𝑓

1 0 −1

−1 0 1

1 0 −1

−1

(32)

Spatial Transformer Layer

𝑥′

𝑥

𝑦 + 𝑒

𝑓 6 parameters to describe the affine transformation

Layer l-1 Layer l

NN

𝑎 𝑏 𝑐 𝑑

𝑒 0 0.5 𝑓

1 0

0.6 0.4 0 0.5

1 0 0.6

2 0.4 1.6 2

2.4

What is the problem?

Gradient is always zero

(33)

Interpolation

𝑥′

𝑥

𝑦 + 𝑒

𝑓 6 parameters to describe the affine transformation Index of layer l

Index of layer l-1

0 0.5

1 0 0.6

2 0.4 1.6 2

2.4

Layer l

1.6 2.4

0.6 0.4

0.4 0.6

0.6 0.4

𝑎₂₂^𝑙 = (1 − 0.4) × (1 − 0.4) × 𝑎₂₂^𝑙−1 + 1 − 0.6 × (1 − 0.4) × 𝑎₁₂^𝑙−1 + 1 − 0.6 × (1 − 0.6) × 𝑎₁₃^𝑙−1 + 1 − 0.4 × (1 − 0.6) × 𝑎₂₃^𝑙−1 Now we can use gradient descent

(34)

(35)

(36)

Single: one transformation layer Multi: many transformation layer

Street View

House Number

(37)

Bird Recognition

^{𝑎 𝑏}^{𝑐 𝑑}

𝑒

0 𝑓

0

(38)

Outline

• Pointer Network

• External Memory

(39)

x f₁ a¹ f₂ a² f₃ a³ f₄ y

𝑎^𝑡 = 𝑓_𝑙 𝑎^𝑡−1 = 𝜎 𝑊^𝑡𝑎^𝑡−1 + 𝑏^𝑡

x¹

h⁰ f h¹

x² f

x³ h² f

x⁴

h³ f y⁴

ℎ^𝑡 = 𝑓 ℎ^𝑡−1, 𝑥^𝑡 = 𝜎 𝑊^ℎℎ^𝑡−1 + 𝑊^𝑖𝑥^𝑡 + 𝑏^𝑖

t is layer

t is time step

Applying gated structure in feedforward network

Feedforward v.s. Recurrent

1. Feedforward network does not have input at each step

2. Feedforward network has different parameters for each layer

(40)

GRU → Highway Network

h^t-1

r z

y^t

x^t h^t-1

h'

⨀

x^t

⨀

1- ⨀

＋ h^t

reset update No input x^t at each

step

a^t-1 is the output of the (t-1)-th layer a^t is the output of the t-th layer

No output y^t at each step

No reset gate

a^t-1 a^t

a^t-1

(41)

Highway Network

• Residual Network

• Highway Network

Deep Residual Learning for Image Recognition

http://arxiv.org/abs/1512.03385 Training Very Deep Networks

https://arxiv.org/pdf/1507.0622 8v2.pdf

+

copy copy Gate

controller

ℎ^′ = 𝜎 𝑊𝑎^𝑡−1 𝑧 = 𝜎 𝑊′𝑎^𝑡−1

𝑎^𝑡 = 𝑧 ⊙ 𝑎^𝑡−1 + 1 − 𝑧 ⊙ ℎ

𝑎^𝑡−1 𝑎^𝑡 𝑧

ℎ′

𝑎^𝑡−1 𝑎^𝑡

𝑎^𝑡−1 ℎ^′

(42)

Input layer output layer

Highway Network automatically determines the layers needed!

(43)

Highway Network

(44)

Grid LSTM

LSTM y

x

c^’ h^t h

c Grid

LSTM

c^’ h^’ h

c

Memory for both time and depth

a b a’ b’

time depth

(45)

Grid h^t-1 LSTM

c^t-1

a^l b^l

h^t c^t

a^l-1 b^l-1

Grid LSTM a^l b^l

h^t+1 c^t+1

a^l-1 b^l-1 Grid

LSTM’

h^t-1 c^t-1

a^l+1 b^l+1

h^t c^t

Grid LSTM’

a^l+1 b^l+1

h^t+1 c^t+1

(46)

Grid LSTM

c^’ h^’ h

c

a b a’ b’

h^'

z zⁱ

z^f z^o

＋

h c

⨀

⨀ ⨀

tanh

c' a

b

a' b'

(47)

e’ f’

3D Grid LSTM

h c

h’

c’

b a

e f

b’

a’

(48)

3D Grid LSTM

• Images are composed of pixels

3 x 3 images

(49)

Outline

• Pointer Network

• External Memory

(50)

Pointer Network

NN

𝑥₁ 𝑦₁

𝑥₂ 𝑦₂

𝑥₃ 𝑦₃

𝑥₄

𝑦₄ ……

coordinate of P₁

4 2 7 6 5 3

(51)

“硬train” 的故事

• Fizz Buzz in Tensorflow:

http://joelgrus.com/2016/05/23/fizz-buzz-in-tensorflow/

https://ochronus.com/fizzbuzz-in-css/

(52)

Sequence-to-sequence?

機器學習

Encoder Decoder

…

machine learning

~ ~

(53)

Sequence-to-sequence?

Encoder Decoder

…

~ ~

𝑥₁

𝑦₁ 𝑥₂

𝑦₂ 𝑥₃

𝑦₃ 𝑥₄ 𝑦₄

{1, 2, 3, 4, END}

~

1 4 2

Problem?

Of course, one can add attention.

(54)

Pointer Network

𝑧⁰

ℎ¹ ℎ² ℎ³ ℎ⁴

𝑥₁

𝑦₁ 𝑥₂

𝑦₂ 𝑥₃

𝑦₃ 𝑥₄ 𝑦₄ 𝑥₀

𝑦₀ ℎ⁰

𝑥₀

𝑦₀ : END

key 0.5

Attention Weight

(55)

Pointer Network

𝑧⁰

ℎ¹ ℎ² ℎ³ ℎ⁴

𝑥₁

𝑦₁ 𝑥₂

𝑦₂ 𝑥₃

𝑦₃ 𝑥₄ 𝑦₄ 𝑥₀

𝑦₀ ℎ⁰

𝑥₀

𝑦₀ : END

0.5

0.0 0.3 0.2 0.0

key argmax from this distribution

~ 1

Output:

𝑧¹ 𝑥₁ 𝑦₁

What decoder can output depends on the input.

(56)

Pointer Network

𝑧⁰

ℎ¹ ℎ² ℎ³ ℎ⁴

𝑥₁

𝑦₁ 𝑥₂

𝑦₂ 𝑥₃

𝑦₃ 𝑥₄ 𝑦₄ 𝑥₀

𝑦₀ ℎ⁰

𝑥₀

𝑦₀ : END

0.0

0.0 0.1 0.2 0.7

key

𝑧¹ 𝑥₁ 𝑦₁ argmax from this distribution

~ 4

Output:

𝑧² 𝑥₄ 𝑦₄

……

The process stops when

“END” has the largest attention weights.

What decoder can output depends on the input.

(57)

Applications - Summarization

https://arxiv.org/abs/1704.04368

(58)

More Applications

Machine Translation

Chat-bot

User: X寶你好，我是庫洛洛

Machine:庫洛洛你好，很高興認識你

(59)

Outline

• Pointer Network

• External Memory

(60)

External Memory

Reading Head Controller

Input

Reading Head

output

…… ……

Machine’s Memory DNN/RNN

Ref:

http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Attain%20(v3).e cm.mp4/index.html

(61)

Reading Comprehension

Query

Each sentence becomes a vector.

……

DNN/RNN

……

answer

Semantic Analysis

(62)

Memory Network

^Answer

Match Query

vector Document

q

Extracted Information

𝑥¹ ……

𝛼₁

= ෍

𝑛=1 𝑁

𝛼_𝑛𝑥^𝑛

𝑥² 𝑥³ 𝑥^𝑁 𝛼₂ 𝛼₃ 𝛼_𝑁 Sentence to DNN

vector can be jointly trained.

Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, Rob Fergus, “End-To-End Memory Networks”, NIPS, 2015

(63)

Answer

Match Query

Document

q

Extracted Information

𝑥¹ ……

𝛼₁

= ෍

𝑛=1 𝑁

𝛼_𝑛ℎ^𝑛

𝑥² 𝑥³ 𝑥^𝑁 𝛼₂ 𝛼₃ 𝛼_𝑁 Jointly learned

ℎ¹ ℎ² ℎ³ …… ℎ^𝑁

DNN

Memory Network

Hopping

(64)

Memory Network

q

……

Compute attention Extract information

෍

……

Compute attention Extract information

෍ DNN a

(65)

Multiple-hop

• End-To-End Memory Networks. S. Sukhbaatar, A. Szlam, J.

Weston, R. Fergus. NIPS, 2015.

The position of reading head:

Keras has example:

https://github.com/fchollet/keras/blob/master/examples/ba bi_memnn.py

(66)

Visual Question Answering

source: http://visualqa.org/

(67)

Visual Question Answering

Query DNN/RNN

answer

CNN A vector for

each region

(68)

Visual Question Answering

• Huijuan Xu, Kate Saenko. Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question

Answering. arXiv Pre-Print, 2015

(69)

Visual Question Answering

• Huijuan Xu, Kate Saenko. Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question

Answering. arXiv Pre-Print, 2015

(70)

External Memory v2

Input

Reading Head

output

…… ……

Machine’s Memory DNN/RNN

Neural Turing Machine

Writing Head Controller Writing Head

(71)

Neural Turing Machine

r⁰ y¹

f h⁰

𝑚₀¹ 𝑚₀² 𝑚₀³ 𝑚₀⁴

x¹

ො𝛼₀¹ ො𝛼₀² ො𝛼₀³ ො𝛼₀⁴ 𝑟⁰ = ෍ ො𝛼₀^𝑖 𝑚₀^𝑖

Retrieval process

(72)

Neural Turing Machine

𝑚₀¹ 𝑚₀² 𝑚₀³ 𝑚₀⁴

ො𝛼₀¹ ො𝛼₀² ො𝛼₀³ ො𝛼₀⁴ 𝑟⁰ = ෍ ො𝛼₀^𝑖 𝑚₀^𝑖

𝑒¹ 𝑎¹ 𝑘¹

𝛼₁^𝑖 = 𝑐𝑜𝑠 𝑚₀^𝑖 , 𝑘¹

ො𝛼₁¹ ො𝛼₁² ො𝛼₁³ ො𝛼₁⁴ softmax

𝛼₁¹ 𝛼₁² 𝛼₁³ 𝛼₁⁴ r⁰

y¹ f h⁰

x¹

(73)

Neural Turing Machine

𝑚₀¹ 𝑚₀² 𝑚₀³ 𝑚₀⁴

ො𝛼₀¹ ො𝛼₀² ො𝛼₀³ ො𝛼₀⁴

𝑒¹ 𝑎¹ 𝑘¹

𝑚₁¹ 𝑚₁² 𝑚₁³ 𝑚₁⁴ 𝑚₁^𝑖 = 𝑚₀^𝑖 𝑒¹ + ො𝛼₁^𝑖 𝑎¹

(element-wise)

− ො𝛼₁^𝑖

ො𝛼₁¹ ො𝛼₁² ො𝛼₁³ ො𝛼₁⁴ 𝑚₀^𝑖

0 ~ 1

⨀

(74)

Neural Turing Machine

𝑚₀¹ 𝑚₀² 𝑚₀³ 𝑚₀⁴

ො𝛼₀¹ ො𝛼₀² ො𝛼₀³ ො𝛼₀⁴ ො𝛼₁¹ ො𝛼₁² ො𝛼₁³ ො𝛼₁⁴ 𝑚₁¹ 𝑚₁² 𝑚₁³ 𝑚₁⁴

ො𝛼₂¹ ො𝛼₂² ො𝛼₂³ ො𝛼₂⁴ 𝑚₂¹ 𝑚₂² 𝑚₂³ 𝑚₂⁴ r⁰

y¹ f h⁰

x¹ r¹

y² f h¹

x²

(75)

Neural Turing Machine for LM

Wei-Jen Ko, Bo-Hsiang Tseng, Hung-yi Lee,

“Recurrent Neural Network based Language Modeling with Controllable External Memory”, ICASSP, 2017

(76)

Stack RNN

Armand Joulin, Tomas Mikolov, Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets, arXiv Pre-Print, 2015

stack

x^t y^t

……

f

Push, Pop, Nothing 0.7 0.2 0.1

Information to store

Pop Nothing Push

X0.7 X0.2 X0.1

+ +

… … …

(77)

Concluding Remarks

• Pointer Network

• External Memory