Advanced Tips for Deep Learning
Hung-yi Lee
Prerequisite: https://www.youtube.com/watch?v=xki61j7z-30
Neural Network
Good Results on Testing Data?
Good Results on Training Data?
Step 3: pick the best function Step 2: goodness
of function Step 1: define a
set of function
YES YES
NO NO
Overfitting!
Recipe of Deep Learning
Do not always blame Overfitting
Deep Residual Learning for Image Recognition http://arxiv.org/abs/1512.03385
Testing Data
Overfitting?
Training Data
Not well trained
Neural Network
Good Results on Testing Data?
Good Results on Training Data?
YES YES
Recipe of Deep Learning
Different approaches for different problems.
e.g. dropout for good results on testing data
Outline
• Batch Normalization
• New Activation Function
• Tuning Hyperparameters
• Interesting facts (?) about deep learning
• Capsule
• New models for QA
Batch Normalization
Sergey Ioffe, Christian Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, 2015
Feature Scaling
1
a w
w
2x
1x
2 b
1, 2 ……
100, 200 ……
w
1w
2 Loss L1
a w
w
2x
1x
2 b
1, 2 ……
w
1w
2 Loss L1, 2 ……
Make different features have the same scaling
Feature Scaling
…… …… …… …… …… …… ……
𝑥
1𝑥
2𝑥
3𝑥
𝑟𝑥
𝑅mean: 𝑚
𝑖standard
deviation: 𝜎
𝑖𝑥
𝑖𝑟← 𝑥
𝑖𝑟− 𝑚
𝑖𝜎
𝑖The means of all dimensions are 0, and the variances are all 1
For each
dimension i:
𝑥
11𝑥
21𝑥
12𝑥
22In general, gradient descent converges much faster with feature scaling than without it.
How about Hidden Layer?
𝑎1
𝑥1 Layer 1 Layer 2 𝑎2
……
Feature Scaling Feature Scaling ? Feature Scaling ?
Smaller learning rate can be helpful, but the training would be slower.
Difficulty: their statistics
change during the training …
Batch normalization
Internal Covariate Shift
𝑎3 𝑎2 𝑎1
Batch
𝑥1
𝑥2
𝑥3
𝑊1
𝑊1
𝑊1
𝑧1
𝑧2
𝑧3
𝑊2
𝑊2
𝑊2
Sigmoid
……
……
……
𝑊1 𝑥1 𝑥2 𝑥3 𝑧1 𝑧2 𝑧3
=
SigmoidSigmoid
Batch
Batch normalization
𝑥1
𝑥2
𝑥3
𝑊1
𝑊1
𝑊1
𝑧1
𝑧2
𝑧3
𝜇 𝜎
𝜇 = 1
3
𝑖=1 3
𝑧𝑖
𝜎 = 1
3
𝑖=1 3
𝑧𝑖 − 𝜇 2
𝜇 and 𝜎
depends on 𝑧𝑖
Note: Batch normalization cannot be applied on
small batch.
Batch normalization
𝑥1
𝑥2
𝑥3
𝑊1
𝑊1
𝑊1
𝑧1
𝑧2
𝑧3
𝜇 𝜎
ǁ𝑧
𝑖= 𝑧
𝑖− 𝜇 𝜎
𝜇 and 𝜎
depends on 𝑧𝑖
𝑎3 𝑎2 𝑎1
SigmoidSigmoidSigmoid
ǁ𝑧1
ǁ𝑧2
ǁ𝑧3
How to do
backpropogation?
Batch normalization
𝑥1
𝑥2
𝑥3
𝑊1
𝑊1
𝑊1
𝑧1
𝑧2
𝑧3
𝜇 𝜎
Ƹ𝑧
𝑖= 𝛾⨀ ǁ𝑧
𝑖+ 𝛽
𝜇 and 𝜎
depends on 𝑧𝑖
Ƹ𝑧3 Ƹ𝑧2 Ƹ𝑧1 ǁ𝑧1
ǁ𝑧2
ǁ𝑧3
𝛽 𝛾
ǁ𝑧
𝑖= 𝑧
𝑖− 𝜇
𝜎
Batch normalization
• At testing stage:
𝑥 𝑊1 𝑧
ǁ𝑧 = 𝑧 − 𝜇
ǁ𝑧Ƹ𝑧
𝑖= 𝛾⨀ ǁ𝑧
𝑖+ 𝛽
Ƹ𝑧𝜎
𝜇, 𝜎 are from batch
𝛾, 𝛽 are network parameters
We do not have batch at testing stage.
Ideal solution:
Computing 𝜇 and 𝜎 using the whole training dataset.
Practical solution:
Computing the moving average of 𝜇 and 𝜎 of the batches during training.
Acc
Updates 𝜇1 𝜇100 𝜇300
Batch normalization - Benefit
• BN reduces training times, and make very deep net trainable.
• Because of less Covariate Shift, we can use larger learning rates.
• Less exploding/vanishing gradients
• Especially effective for sigmoid, tanh, etc.
• Learning is less affected by initialization.
• BN reduces the demand for regularization.
𝑥𝑖 𝑊1 𝑧𝑖 ǁ𝑧𝑖 Ƹ𝑧𝑖
Ƹ𝑧
𝑖= 𝛾⨀ ǁ𝑧
𝑖+ 𝛽 ǁ𝑧
𝑖= 𝑧
𝑖− 𝜇
𝜎
× 𝒌 × 𝒌
𝒌 𝒌
𝒌
𝒌𝒆𝒆𝒑
Demo
Activation Function
Günter Klambauer, Thomas Unterthiner, Andreas Mayr, Andreas Mayr,
“Self-Normalizing Neural Networks”, NIPS, 2017
ReLU
• Rectified Linear Unit (ReLU)
Reason:
1. Fast to compute 2. Biological reason 3. Infinite sigmoid with different biases 4. Vanishing gradient problem
𝑧 𝑎
𝑎 = 𝑧
𝑎 = 0
𝜎 𝑧
[Xavier Glorot, AISTATS’11]
[Andrew L. Maas, ICML’13]
[Kaiming He, arXiv’15]
ReLU - variant
𝑧 𝑎
𝑎 = 𝑧
𝑎 = 0.01𝑧
Leaky ReLU
𝑧 𝑎
𝑎 = 𝑧
𝑎 = 𝛼𝑧
Parametric ReLU
α also learned by gradient descent
ReLU - variant
𝑧 𝑎
𝑎 = 𝑧
𝑧 𝑎
𝑎 = 𝑧
𝑎 = 𝛼 𝑒𝑧 − 1
Exponential Linear Unit (ELU)
Randomized ReLU
𝑎 = 𝛼𝑧
𝛼 is sampled from a
distribution during training.
Fixed during testing.
Dropout???
ReLU - variant
𝑧 𝑎
𝑎 = 𝑧
𝑎 = 𝛼 𝑒𝑧 − 1
Exponential Linear Unit (ELU)
𝑧 𝑎
𝑎 = 𝑧
𝑎 = 𝛼 𝑒𝑧 − 1
Scaled ELU (SELU)
https://github.com/bioinf-jku/SNNs
× 𝜆
× 𝜆
𝛼 = 1.6732632423543772848170429916717 𝜆 = 1.0507009873554804934193349852946
SELU
Positive and negative values
The whole ReLU family has this property except the original ReLU.
Saturation region ELU also has this property Slope larger than 1
𝑎 = 𝜆𝑧
𝑎 = 𝜆𝛼 𝑒𝑧 − 1
𝛼 = 1.673263242 … 𝜆 = 1.050700987 …
Only SELU also has this property
SELU
K K
k
k
w a w
a w
a
z
1 1 z
w
1w
kw
K…
a
1a
ka
K f z a
… … …
The inputs are i.i.d random variables with mean 𝜇 and variance 𝜎2.
𝜇𝑧 = 𝐸 𝑧
=
𝑘=1 𝐾
𝐸 𝑎𝑘 𝑤𝑘
𝜇 = 𝜇
𝑘=1 𝐾
𝑤𝑘 = 𝜇 ∙ 𝐾𝜇𝑤
=1 =0
=0
Do not have to be Gaussian
=0
SELU
K K
k
k
w a w
a w
a
z
1 1 z
w
1w
kw
K…
a
1a
ka
K f z a
… … …
The inputs are i.i.d random variables with mean 𝜇 and variance 𝜎2.
𝜇𝑧 = 0
𝜎𝑧2 = 𝐸 𝑧 − 𝜇𝑧 2 = 𝐸 𝑧2
=1 =0
= 𝐸 𝑎1𝑤1 + 𝑎2𝑤2 + ⋯ 2
𝐸 𝑎𝑘𝑤𝑘 2 = 𝑤𝑘 2𝐸 𝑎𝑘 2 = 𝑤𝑘 2𝜎2 𝐸 𝑎𝑖𝑎𝑗𝑤𝑖𝑤𝑗 = 𝑤𝑖𝑤𝑗𝐸 𝑎𝑖 𝐸 𝑎𝑗 = 0
=
𝑘=1 𝐾
𝑤𝑘 2𝜎2 = 𝜎2 ∙ 𝐾𝜎𝑤2
𝜇 = 0, 𝜎 = 1
=1 =1
= 1
target Assume Gaussian
𝜇𝑤 = 0
Demo
Source of joke:
https://zhuanlan.zhihu.co m/p/27336839
93 頁的證明
SELU is actually
more general.
•
最新激活神經元:SELF-NORMALIZATION NEURALNETWORK (SELU)
MNIST CIFAR-10
Demo
𝑎 = 𝑧 ∙ 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝛽𝑧
Hyperparameters
感謝 沈昇勳 同學提供圖檔 Source of iamge: https://medium.com/intuitionmachine/the-brute-force-method-of-
deep-learning-innovation-58b497323ae5 (Denny Britz’s graphic)
Grid Search v.s. Random Search
http://www.deeplearningbook.org/contents/guidelines.html
Layer depth
Layer width
Layer depth
Layer width
Assumption: top K results are good enough
If there are N points, probability K/N that your sample is in top K Sample x times: 1 − 1 − 𝐾/𝑁 𝑥 > 90%
If N = 1000, K = 10 x = 230
K = 100 x = 22
Model-based Hyperparameter Optimization
https://cloud.google.com/blog/big- data/2017/08/hyperparameter- tuning-in-cloud-machine-learning- engine-using-bayesian-optimization
Reinforcement Learning
Design a network Train the network
Accuracy as reward
It can design LSTM as shown in the previous lecture.
One kind of meta
learning (or learn to learn)
SWISH ...
SWISH ...
Learning Rate
Can transfer to
new tasks
Capsule
Sara Sabour, Nicholas Frosst, Geoffrey E. Hinton, “Dynamic Routing Between Capsules”, NIPS, 2017
Capsule
• Neuron: output a value, Capsule: output a vector
Cap Cap
Cap
𝑣 𝑣1
𝑣2
A neuron detects a specific pattern.
Neuron A Neuron B
Detect one type of patterns
Each dimension of v represents the characteristics of patterns.
The norm of v represents the existence.
1.0
⋮
−1.0
⋮
Capsule
𝑣 𝑣1
𝑣2
Squashing 𝑢1
𝑢2
𝑠 +
× 𝑊
1× 𝑊
2× 𝑐
1× 𝑐
2𝑢1 = 𝑊1𝑣1 𝑢2 = 𝑊2𝑣2 𝑠 = 𝑐1𝑢1 + 𝑐2𝑢2
𝑣 = 𝑆𝑞𝑢𝑎𝑠ℎ 𝑠
c are determined by dynamic routing during the testing stage.
c.f. pooling
𝑣 = 𝑠
21 + 𝑠
2𝑠 𝑠
coupling coefficients
Squashing 𝑣 𝑢1
𝑢2 + 𝑠
Dynamic ×?
Routing
𝑢3
×?
×?
𝑏10 = 0, 𝑏20 = 0, 𝑏30 = 0 For 𝑟 = 1 to T do
𝑐1𝑟, 𝑐2𝑟, 𝑐3𝑟 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑏1𝑟−1, 𝑏2𝑟−1, 𝑏3𝑟−1 𝑠𝑟 = 𝑐1𝑢1 + 𝑐2𝑢2 + 𝑐3𝑢3
𝑎𝑟 = 𝑆𝑞𝑢𝑎𝑠ℎ 𝑠𝑟 𝑏𝑖𝑟 = 𝑏𝑖𝑟−1 + 𝑎𝑟 ∙ 𝑢𝑖
𝑐1𝑇 𝑐2𝑇
𝑐3𝑇
= 𝑎𝑇
𝑣 𝑣1
Squashing 𝑢1
𝑢2
+
× 𝑊1
× 𝑊2
𝑐11 𝑎1
Squashing +
Squashing
+ 𝑎3
𝑏10 = 0, 𝑏20 = 0
𝑐11, 𝑐21 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑏10, 𝑏20 𝑏𝑖𝑟 = 𝑏𝑖𝑟−1 + 𝑎𝑟 ∙ 𝑢𝑖
Like RNN Also learned by backprop 𝑣2
𝑠1
𝑠2
𝑠3
𝑐12 𝑐22
𝑐13 𝑐23 𝑎2
T=3
𝑐21 𝑐12
𝑐22 𝑐13 𝑐23
Capsule
• Capsule can also be convolutional.
• Simply replace filter with capsule
• Output layer and loss
Cap 𝑣1
Cap 𝑣2
CapsNet
…… ……
𝑣1
𝑣2
Confidence of “1”
Confidence of “2”
NN
Minimize
reconstruction error
x1 x0
Experimental Results
• MNIST
• Each example is an MNIST digit with a random small affine transformation.
• However, models were never trained with affine transformations
• CapsNet achieved 79% accuracy on the affnist test set.
• A traditional convolutional model with a similar number of parameters which achieved 66%.
Experimental Results
• Each dimension contains specific information.
𝑣1
NN
Minimize
reconstruction error
Experimental Results
• MultiMNIST
Top: input
Bottom: reconstructed R: reconstructed digits L: true labels
Discussion
• Invariance v.s. Equivariance
NN NN NN NN
Invariance Equivariance
Discussion
• Invariance v.s. Equivariance
Max pooling has invariance, but not equivariance.
Capsule has both invariance and equivariance.
3 -1 -3 -1
-3 1 0 3
3 3 Cap
𝑣1 𝑣1
Cap 𝑣1
𝑣1 Both large
Can be different I don’t know the difference.
I know the difference, but I do not react to it.
Dynamic Routing
To Learn More ……
• Hinton’s talk:
https://www.youtube.com/watch?v=rTawFwUvnLE
• Keras:
• https://github.com/XifengGuo/CapsNet-Keras
• Tensorflow:
• https://github.com/naturomics/CapsNet-Tensorflow
• PyTorch
• https://github.com/gram-ai/capsule-networks
• https://github.com/timomernick/pytorch-capsule
• https://github.com/nishnik/CapsNet-PyTorch
Interesting Facts (?)
about Deep Learning
Training stuck because …. ?
• People believe training stuck because the parameters are near a local minima
local minima
How about saddle point?
http://www.deeplearningbook.org/contents/optimization.html
Training stuck because …. ?
• People believe training stuck because the parameters are around a critical point
!!!
http://www.deeplearningbook.org/contents/optimization.html
Brute-force Memorization ?
https://arxiv.org/pdf/1611.03530.pdf
Final of 2017 Spring:
https://ntumlds.wordpress.com/2017/03/27/r05922018_drliao/
Demo
Brute-force Memorization ?
• Simple pattern first, then memorize exception
https://arxiv.org/pdf/1706.05394.pdf
Knowledge Distillation
Knowledge Distillation
https://arxiv.org/pdf/1503.02531.pdf Do Deep Nets Really Need to be Deep?
https://arxiv.org/pdf/1312.6184.pdf
Teacher Net (Deep)
“1”: 0.7, “7”: 0.2. “9”: 0.1
Student Net (Shallow)
? Learning target
Cross-entropy minimization
Training Data
https://arxiv.org/pdf/1312.6184.pdf
Deep Learning for
Question Answering
Question Answering
• Given a document and a query, output an answer
• bAbI: the answer is a word
• https://research.fb.com/downloads/babi/
• SQuAD: the answer is a sequence of words (in the input document)
• https://rajpurkar.github.io/SQuAD-explorer/
• MS MARCO: the answer is a sequence of words
• http://www.msmarco.org
• MovieQA: Multiple choice question (output a number)
• http://movieqa.cs.toronto.edu/home/
• More: https://github.com/dapurv5/awesome-question- answering
Bi-directional Attention Flow
Demo: http://35.165.153.16:1995
Dynamic Coattention Networks
Dynamic Coattention Networks
Dynamic Coattention Networks
Dynamic Coattention Networks
• Experimental Results
DCN+: https://arxiv.org/pdf/1711.00106.pdf
Rnet
R-Net
S-net
MS MARCO
Attention-over-Attention (AoA)
(not on SQUAD)
Reinforced Mnemonic Reader
Multiple-hop
ReasoNet
https://arxiv.org/abs/1609.05284