Special Networks
Hung-yi Lee
李宏毅
Announcement
• 11/13 (下週一) 14:00 ~ 17:00 台灣微軟參訪
• 地址:台北市信義區忠孝東路五段68號19樓 (捷運市 政府站3號出口)
• 14:00:在捷運市政府站3號出口集合
• 報名表單:
• https://docs.google.com/forms/d/e/1FAIpQLSfs2zlo GanjWjJvVkJu8DUe9BlVZ5ugLPIs3FUmMbR9VkF8Fw /viewform?fbzx=-8767653761190698000
Outline
• Convolutional Neural Network (Review)
• Spatial Transformer
• Highway Network & Grid LSTM
• Pointer Network
• External Memory
Convolutional Layer
……
1
1 2
2
Sparse Connectivity
3 4 5
Each neural only connects to part of the output of the
previous layer
3 4 Receptive
Field
Different neurons have different, but overlapping, receptive fields
Convolutional Layer
……
1
1 2
2
Sparse Connectivity
3 4 5
Each neural only connects to part of the output of the
previous layer
3 4
Parameter Sharing
The neurons with different receptive fields can use the same set of parameters.
Less parameters than fully connected layer
Convolutional Layer
……
1
1 2
2 3
4 5
3 4
Considering neuron 1 and 3 as
“filter 1” (kernel 1)
filter (kernel) size: size of the receptive field of a neuron
Stride = 2
Considering neuron 2 and 4 as
“filter 2” (kernel 2)
Kernel size, no. of filter, stride are all designed by the developers.
Example –
1D Signal + Single Channel
1 2 3 4
𝑥1 𝑥2 𝑥3
𝑥4
𝑥5
Classification, Predict the future …
Audio Signal, Stock Value …
Example –
1D Signal + Multiple Channel
1 2 3
A document: each word is a vector
I like this movie
very
much 4
……
𝑥1 𝑥2 𝑥3 𝑥4 𝑥5 𝑥6 𝑥7
Does this kind of receptive field make sense?
Example –
2D Signal + Single Channel
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 black & white picture image
1:
2:
3:
…
7:
8:
9:
…
13:
14:
15:
…
4:
10:
16:
1 0 0 0 0 1 0 0 0 0 1 1
Only show 1 filter here
Size of Receptive field is 3x3, Stride is 1
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
Example –
2D Signal + Multiple Channel
6 x 6 colorful image
1:
2:
3:
…
7:
8:
9:
…
13:
14:
15:
…
4:
10:
16:
1 0 0 0 0 1 0 0 0 0 1 1
Only show 1 filter here
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
Size of Receptive field is 3x3x3, Stride is 1
1 0 1 0 0 1 1 0 1 1 0 1
0 0 0 0 0 0 0 0 1 1 1 0
Padding
Source of images:https://github.com/vdumoulin/conv_arithmetic
Zero Padding, Reflection Padding
Pooling Layer
nodes k
N /
Layer l Layer l 1
nodes N
… …
1
1 k
… 1
k
k 2…
2
k outputs in layer 𝑙 − 1 are grouped together
Each output in layer 𝑙
“summarizes” k inputs
1 1
al
1 l
ak l
a1
k j
l j
l a
a k
1
1 1
1 Average Pooling:
Max Pooling:
L2 Pooling:
2 1 1
1 1
1l max al ,al , ,akl
a
k j
l j
l a
a k
1
1 2 1
1
Pooling Layer
Which outputs should be grouped together?Group the neurons corresponding to the same filter with nearby
receptive fields
……
1
1 2
2 3
4 5
3 4 Convolutional
Layer
Pooling Layer
1
2
Subsampling
Pooling Layer
Which outputs should be grouped together?Group the neurons with the same receptive field
……
1
1 2
2 3
4 5
3 4 Convolutional
Layer
Pooling Layer
1
2
Maxout Network
How can you know whether the neurons detect the same
pattern?
Auto-encoder for CNN
Convolution
Pooling
Convolution
Pooling Deconvolution
Unpooling
Deconvolution Unpooling
As close as possible Deconvolution
code
Unpooling &
Deconvolution
Unpooling
14 x 14 28 x 28
Source of image :
https://leonardoaraujosantos.gitbooks.io/artificial- inteligence/content/image_segmentation.html
Alternative: simply repeat the values
Deconvolution
=
Actually, deconvolution is convolution.
+ + +
+
Tara N. Sainath, Ron J. Weiss, Andrew Senior, Kevin W. Wilson, Oriol Vinyals, “Learning the Speech Front-end With Raw Waveform CLDNNs,”
In INTERPSEECH 2015
Combination of Different Structures
Tara N. Sainath, Ron J. Weiss, Andrew Senior, Kevin W. Wilson, Oriol Vinyals, “Learning the Speech Front-end With Raw Waveform CLDNNs,”
In INTERPSEECH 2015
Combination of Different Structures
Tara N. Sainath, Ron J. Weiss, Andrew Senior, Kevin W. Wilson, Oriol Vinyals, “Learning the Speech Front-end With Raw Waveform CLDNNs,”
In INTERPSEECH 2015
Combination of Different Structures
3 layers
CNN for Sequence-
to-
sequence
https://arxiv.org/abs/1705.03122
CNN for Sequence-to-sequence
• Encoder
機 器 學 習
RNN
ℎ4 ℎ3
ℎ2 ℎ1
機 器 學 習
CNN
ℎ4 ℎ3
ℎ2 ℎ1
CNN for Sequence-to-sequence
• Decoder - WaveNet
Outline
• Convolutional Neural Network (Review)
• Spatial Transformer
• Highway Network & Grid LSTM
• Pointer Network
• External Memory
Spatial Transformer Layer
• CNN is not invariant to scaling and rotation
CNN
CNN
5
6
NN layer
End-to-end learn Can also transform feature map
Spatial Transformer Layer
𝑎11𝑙−1 𝑎12𝑙−1 𝑎13𝑙−1 𝑎21𝑙−1 𝑎22𝑙−1 𝑎23𝑙−1 𝑎31𝑙−1 𝑎32𝑙−1 𝑎33𝑙−1
• How to transform an image/feature map
𝑎11𝑙 𝑎12𝑙 𝑎13𝑙 𝑎21𝑙 𝑎22𝑙 𝑎23𝑙 𝑎31𝑙 𝑎32𝑙 𝑎33𝑙
Layer l-1 Layer l
Spatial Transformer Layer
Translate
General layer: 𝑎𝑛𝑚𝑙 =
𝑖=1
3
𝑗=1
3 𝑤𝑛𝑚,𝑖𝑗𝑙 𝑎𝑖𝑗𝑙−1
If we want translate as above: 𝑎𝑛𝑚𝑙 = 𝑎(𝑛−1)𝑚𝑙−1
𝑤𝑛𝑚,𝑖𝑗𝑙 = 1 𝑖𝑓 𝑖 = 𝑛 − 1, 𝑗 = 𝑚 𝑤𝑛𝑚,𝑖𝑗𝑙 = 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Spatial Transformer Layer
• How to transform an image/feature map
𝑎11𝑙−1 𝑎12𝑙−1 𝑎13𝑙−1 𝑎21𝑙−1 𝑎22𝑙−1 𝑎23𝑙−1 𝑎31𝑙−1 𝑎32𝑙−1 𝑎33𝑙−1
Layer l-1
𝑎11𝑙 𝑎12𝑙 𝑎13𝑙 𝑎21𝑙 𝑎22𝑙 𝑎23𝑙 𝑎31𝑙 𝑎32𝑙 𝑎33𝑙
Layer l
𝑎11𝑙−1 𝑎12𝑙−1 𝑎13𝑙−1 𝑎21𝑙−1 𝑎22𝑙−1 𝑎23𝑙−1 𝑎31𝑙−1 𝑎32𝑙−1 𝑎33𝑙−1
Layer l-1
𝑎11𝑙 𝑎12𝑙 𝑎13𝑙 𝑎21𝑙 𝑎22𝑙 𝑎23𝑙 𝑎31𝑙 𝑎32𝑙 𝑎33𝑙
Layer l
NN NN
Control the connection
Control the connection
Image Transformation
Expansion, Compression, Translation
𝑥 𝑦
𝑥′
𝑥′ 𝑦′
𝑦′ = 2 0 0 2
𝑥
𝑦 + 0 0
𝑥′
1 𝑦′
1
𝑥′
𝑦′ = 0.5 0 0 0.5
𝑥
𝑦 + 0.5 0.5
Image Transformation
• Rotation
https://home.gamer.com.tw/c reationDetail.php?sn=792585
𝑥
𝑦 𝑥′
𝑦′
Rotate 𝜃° 𝑥′
𝑦′ = 𝑐𝑜𝑠𝜃 −𝑠𝑖𝑛𝜃 𝑠𝑖𝑛𝜃 𝑐𝑜𝑠𝜃
𝑥
𝑦 + 0 0
Spatial Transformer Layer
𝑥′
𝑦′ = 𝑎 𝑏 𝑐 𝑑
𝑥
𝑦 + 𝑒 𝑓
𝑎11𝑙−1 𝑎12𝑙−1 𝑎13𝑙−1 𝑎21𝑙−1 𝑎22𝑙−1 𝑎23𝑙−1 𝑎31𝑙−1 𝑎32𝑙−1 𝑎33𝑙−1
Layer l-1
𝑎11𝑙 𝑎12𝑙 𝑎13𝑙 𝑎21𝑙 𝑎22𝑙 𝑎23𝑙 𝑎31𝑙 𝑎32𝑙 𝑎33𝑙
Layer l
NN
Index of layer l Index of layer l-1
𝑎 𝑏 𝑐 𝑑
𝑒 𝑓
6 parameters to describe the affine transformation
𝑎11𝑙 𝑎12𝑙 𝑎13𝑙 𝑎21𝑙 𝑎22𝑙 𝑎23𝑙 𝑎31𝑙 𝑎32𝑙 𝑎33𝑙 𝑎11𝑙 𝑎12𝑙 𝑎13𝑙 𝑎21𝑙 𝑎22𝑙 𝑎23𝑙 𝑎31𝑙 𝑎32𝑙 𝑎33𝑙
Spatial Transformer Layer
𝑥′
𝑦′ = 𝑎 𝑏 𝑐 𝑑
𝑥
𝑦 + 𝑒
𝑓 6 parameters to describe the affine transformation
𝑎11𝑙−1 𝑎12𝑙−1 𝑎13𝑙−1 𝑎21𝑙−1 𝑎22𝑙−1 𝑎23𝑙−1 𝑎31𝑙−1 𝑎32𝑙−1 𝑎33𝑙−1
Layer l-1 Layer l
NN
Index of layer l Index of layer l-1
𝑎 𝑏 𝑐 𝑑
𝑒 0 1 𝑓
1 0 −1
−1 0 1
1 0 −1
−1
𝑎11𝑙 𝑎12𝑙 𝑎13𝑙 𝑎21𝑙 𝑎22𝑙 𝑎23𝑙 𝑎31𝑙 𝑎32𝑙 𝑎33𝑙 𝑎11𝑙 𝑎12𝑙 𝑎13𝑙 𝑎21𝑙 𝑎22𝑙 𝑎23𝑙 𝑎31𝑙 𝑎32𝑙 𝑎33𝑙
Spatial Transformer Layer
𝑥′
𝑦′ = 𝑎 𝑏 𝑐 𝑑
𝑥
𝑦 + 𝑒
𝑓 6 parameters to describe the affine transformation
𝑎11𝑙−1 𝑎12𝑙−1 𝑎13𝑙−1 𝑎21𝑙−1 𝑎22𝑙−1 𝑎23𝑙−1 𝑎31𝑙−1 𝑎32𝑙−1 𝑎33𝑙−1
Layer l-1 Layer l
NN
Index of layer l Index of layer l-1
𝑎 𝑏 𝑐 𝑑
𝑒 0 0.5 𝑓
1 0
0.6 0.4 0 0.5
1 0 0.6
2 0.4 1.6 2
2.4
𝑎11𝑙 𝑎12𝑙 𝑎13𝑙 𝑎21𝑙 𝑎22𝑙 𝑎23𝑙 𝑎31𝑙 𝑎32𝑙 𝑎33𝑙
What is the problem?
Gradient is always zero
Interpolation
𝑥′
𝑦′ = 𝑎 𝑏 𝑐 𝑑
𝑥
𝑦 + 𝑒
𝑓 6 parameters to describe the affine transformation Index of layer l
Index of layer l-1
0 0.5
1 0 0.6
2 0.4 1.6 2
2.4
𝑎11𝑙 𝑎12𝑙 𝑎13𝑙 𝑎21𝑙 𝑎22𝑙 𝑎23𝑙 𝑎31𝑙 𝑎32𝑙 𝑎33𝑙 𝑎11𝑙 𝑎12𝑙 𝑎13𝑙 𝑎21𝑙 𝑎22𝑙 𝑎23𝑙 𝑎31𝑙 𝑎32𝑙 𝑎33𝑙
Layer l
𝑎11𝑙 𝑎12𝑙 𝑎13𝑙 𝑎21𝑙 𝑎22𝑙 𝑎23𝑙 𝑎31𝑙 𝑎32𝑙 𝑎33𝑙
1.6 2.4
0.6 0.4
0.4 0.6
0.6 0.4
0.6 0.4
𝑎22𝑙 = (1 − 0.4) × (1 − 0.4) × 𝑎22𝑙−1 + 1 − 0.6 × (1 − 0.4) × 𝑎12𝑙−1 + 1 − 0.6 × (1 − 0.6) × 𝑎13𝑙−1 + 1 − 0.4 × (1 − 0.6) × 𝑎23𝑙−1 Now we can use gradient descent
Single: one transformation layer Multi: many transformation layer
Street View
House Number
Bird Recognition
𝑎 𝑏𝑐 𝑑𝑒
0 𝑓
0
Outline
• Convolutional Neural Network (Review)
• Spatial Transformer
• Highway Network & Grid LSTM
• Pointer Network
• External Memory
x f1 a1 f2 a2 f3 a3 f4 y
𝑎𝑡 = 𝑓𝑙 𝑎𝑡−1 = 𝜎 𝑊𝑡𝑎𝑡−1 + 𝑏𝑡
x1
h0 f h1
x2 f
x3 h2 f
x4
h3 f y4
ℎ𝑡 = 𝑓 ℎ𝑡−1, 𝑥𝑡 = 𝜎 𝑊ℎℎ𝑡−1 + 𝑊𝑖𝑥𝑡 + 𝑏𝑖
t is layer
t is time step
Applying gated structure in feedforward network
Feedforward v.s. Recurrent
1. Feedforward network does not have input at each step
2. Feedforward network has different parameters for each layer
GRU → Highway Network
ht-1
r z
yt
xt ht-1
h'
⨀
xt
⨀
1- ⨀
+ ht
reset update No input xt at each
step
at-1 is the output of the (t-1)-th layer at is the output of the t-th layer
No output yt at each step
No reset gate
at-1 at
at-1
Highway Network
• Residual Network
• Highway Network
Deep Residual Learning for Image Recognition
http://arxiv.org/abs/1512.03385 Training Very Deep Networks
https://arxiv.org/pdf/1507.0622 8v2.pdf
+
copy copy Gate
controller
ℎ′ = 𝜎 𝑊𝑎𝑡−1 𝑧 = 𝜎 𝑊′𝑎𝑡−1
𝑎𝑡 = 𝑧 ⊙ 𝑎𝑡−1 + 1 − 𝑧 ⊙ ℎ
𝑎𝑡−1 𝑎𝑡 𝑧
ℎ′
𝑎𝑡−1 𝑎𝑡
𝑎𝑡−1 ℎ′
Input layer output layer
Input layer output layer
Input layer output layer
Highway Network automatically determines the layers needed!
Highway Network
Grid LSTM
LSTM y
x
c’ ht h
c Grid
LSTM
c’ h’ h
c
Memory for both time and depth
a b a’ b’
time depth
Grid ht-1 LSTM
ct-1
al bl
ht ct
al-1 bl-1
Grid LSTM al bl
ht+1 ct+1
al-1 bl-1 Grid
LSTM’
ht-1 ct-1
al+1 bl+1
ht ct
Grid LSTM’
al+1 bl+1
ht+1 ct+1
Grid LSTM
Grid LSTM
c’ h’ h
c
a b a’ b’
h'
z zi
zf zo
+
h c
⨀
⨀ ⨀
tanh
c' a
b
a' b'
e’ f’
3D Grid LSTM
h c
h’
c’
b a
e f
b’
a’
3D Grid LSTM
• Images are composed of pixels
3 x 3 images
Outline
• Convolutional Neural Network (Review)
• Spatial Transformer
• Highway Network & Grid LSTM
• Pointer Network
• External Memory
Pointer Network
NN
𝑥1 𝑦1
𝑥2 𝑦2
𝑥3 𝑦3
𝑥4
𝑦4 ……
coordinate of P1
4 2 7 6 5 3
“硬train” 的故事
• Fizz Buzz in Tensorflow:
http://joelgrus.com/2016/05/23/fizz-buzz-in-tensorflow/
https://ochronus.com/fizzbuzz-in-css/
Sequence-to-sequence?
機 器 學 習
Encoder Decoder
…
…
…
machine learning
~ ~
Sequence-to-sequence?
Encoder Decoder
…
…
…
~ ~
𝑥1
𝑦1 𝑥2
𝑦2 𝑥3
𝑦3 𝑥4 𝑦4
{1, 2, 3, 4, END}
~
1 4 2
Problem?
Of course, one can add attention.
Pointer Network
𝑧0
ℎ1 ℎ2 ℎ3 ℎ4
𝑥1
𝑦1 𝑥2
𝑦2 𝑥3
𝑦3 𝑥4 𝑦4 𝑥0
𝑦0 ℎ0
𝑥0
𝑦0 : END
key 0.5
Attention Weight
Pointer Network
𝑧0
ℎ1 ℎ2 ℎ3 ℎ4
𝑥1
𝑦1 𝑥2
𝑦2 𝑥3
𝑦3 𝑥4 𝑦4 𝑥0
𝑦0 ℎ0
𝑥0
𝑦0 : END
0.5
0.0 0.3 0.2 0.0
key argmax from this distribution
~ 1
Output:
𝑧1 𝑥1 𝑦1
What decoder can output depends on the input.
Pointer Network
𝑧0
ℎ1 ℎ2 ℎ3 ℎ4
𝑥1
𝑦1 𝑥2
𝑦2 𝑥3
𝑦3 𝑥4 𝑦4 𝑥0
𝑦0 ℎ0
𝑥0
𝑦0 : END
0.0
0.0 0.1 0.2 0.7
key
𝑧1 𝑥1 𝑦1 argmax from this distribution
~ 4
Output:
𝑧2 𝑥4 𝑦4
……
The process stops when
“END” has the largest attention weights.
What decoder can output depends on the input.
Applications - Summarization
https://arxiv.org/abs/1704.04368
More Applications
Machine Translation
Chat-bot
User: X寶你好,我是庫洛洛
Machine:庫洛洛你好,很高興認識你
Outline
• Convolutional Neural Network (Review)
• Spatial Transformer
• Highway Network & Grid LSTM
• Pointer Network
• External Memory
External Memory
Reading Head Controller
Input
Reading Head
output
…… ……
Machine’s Memory DNN/RNN
Ref:
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Attain%20(v3).e cm.mp4/index.html
Reading Comprehension
Query
Each sentence becomes a vector.
……
DNN/RNN
Reading Head Controller
……
answer
Semantic Analysis
Memory Network
AnswerMatch Query
vector Document
q
Extracted Information
𝑥1 ……
𝛼1
=
𝑛=1 𝑁
𝛼𝑛𝑥𝑛
𝑥2 𝑥3 𝑥𝑁 𝛼2 𝛼3 𝛼𝑁 Sentence to DNN
vector can be jointly trained.
Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, Rob Fergus, “End-To-End Memory Networks”, NIPS, 2015
Answer
Match Query
Document
q
Extracted Information
𝑥1 ……
𝛼1
=
𝑛=1 𝑁
𝛼𝑛ℎ𝑛
𝑥2 𝑥3 𝑥𝑁 𝛼2 𝛼3 𝛼𝑁 Jointly learned
ℎ1 ℎ2 ℎ3 …… ℎ𝑁
DNN
Memory Network
Hopping
Memory Network
q
……
……
Compute attention Extract information
……
……
Compute attention Extract information
DNN a
Multiple-hop
• End-To-End Memory Networks. S. Sukhbaatar, A. Szlam, J.
Weston, R. Fergus. NIPS, 2015.
The position of reading head:
Keras has example:
https://github.com/fchollet/keras/blob/master/examples/ba bi_memnn.py
Visual Question Answering
source: http://visualqa.org/
Visual Question Answering
Query DNN/RNN
Reading Head Controller
answer
CNN A vector for
each region
Visual Question Answering
• Huijuan Xu, Kate Saenko. Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question
Answering. arXiv Pre-Print, 2015
Visual Question Answering
• Huijuan Xu, Kate Saenko. Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question
Answering. arXiv Pre-Print, 2015
External Memory v2
Reading Head Controller
Input
Reading Head
output
…… ……
Machine’s Memory DNN/RNN
Neural Turing Machine
Writing Head Controller Writing Head
Neural Turing Machine
r0 y1
f h0
𝑚01 𝑚02 𝑚03 𝑚04
x1
ො𝛼01 ො𝛼02 ො𝛼03 ො𝛼04 𝑟0 = ො𝛼0𝑖 𝑚0𝑖
Retrieval process
Neural Turing Machine
𝑚01 𝑚02 𝑚03 𝑚04
ො𝛼01 ො𝛼02 ො𝛼03 ො𝛼04 𝑟0 = ො𝛼0𝑖 𝑚0𝑖
𝑒1 𝑎1 𝑘1
𝛼1𝑖 = 𝑐𝑜𝑠 𝑚0𝑖 , 𝑘1
ො𝛼11 ො𝛼12 ො𝛼13 ො𝛼14 softmax
𝛼11 𝛼12 𝛼13 𝛼14 r0
y1 f h0
x1
Neural Turing Machine
𝑚01 𝑚02 𝑚03 𝑚04
ො𝛼01 ො𝛼02 ො𝛼03 ො𝛼04
𝑒1 𝑎1 𝑘1
𝑚11 𝑚12 𝑚13 𝑚14 𝑚1𝑖 = 𝑚0𝑖 𝑒1 + ො𝛼1𝑖 𝑎1
(element-wise)
− ො𝛼1𝑖
ො𝛼11 ො𝛼12 ො𝛼13 ො𝛼14 𝑚0𝑖
0 ~ 1
⨀
Neural Turing Machine
𝑚01 𝑚02 𝑚03 𝑚04
ො𝛼01 ො𝛼02 ො𝛼03 ො𝛼04 ො𝛼11 ො𝛼12 ො𝛼13 ො𝛼14 𝑚11 𝑚12 𝑚13 𝑚14
ො𝛼21 ො𝛼22 ො𝛼23 ො𝛼24 𝑚21 𝑚22 𝑚23 𝑚24 r0
y1 f h0
x1 r1
y2 f h1
x2
Neural Turing Machine for LM
Wei-Jen Ko, Bo-Hsiang Tseng, Hung-yi Lee,
“Recurrent Neural Network based Language Modeling with Controllable External Memory”, ICASSP, 2017
Stack RNN
Armand Joulin, Tomas Mikolov, Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets, arXiv Pre-Print, 2015
stack
xt yt
……
f
Push, Pop, Nothing 0.7 0.2 0.1
Information to store
Pop Nothing Push
X0.7 X0.2 X0.1
+ +
… … …
Concluding Remarks
• Convolutional Neural Network (Review)
• Spatial Transformer
• Highway Network & Grid LSTM
• Pointer Network
• External Memory