### Advanced Tips for Deep Learning

### Hung-yi Lee

Prerequisite: https://www.youtube.com/watch?v=xki61j7z-30

Neural Network

Good Results on Testing Data?

Good Results on Training Data?

### Step 3: pick the best function Step 2: goodness

### of function Step 1: define a

### set of function

YES YES

NO NO

### Overfitting!

**Recipe of Deep Learning**

**Recipe of Deep Learning**

### Do not always blame Overfitting

Deep Residual Learning for Image Recognition http://arxiv.org/abs/1512.03385

Testing Data

### Overfitting?

Training Data

Not well trained

Neural Network

Good Results on Testing Data?

Good Results on Training Data?

YES YES

**Recipe of Deep Learning**

**Recipe of Deep Learning**

### Different approaches for different problems.

e.g. dropout for good results on testing data

### Outline

### • Batch Normalization

### • New Activation Function

### • Tuning Hyperparameters

### • Interesting facts (?) about deep learning

### • Capsule

### • New models for QA

### Batch Normalization

Sergey Ioffe, Christian Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, 2015

### Feature Scaling

1

*a* *w*

*w*

2
*x*

1
*x*

2
### *b*

1, 2 ……

100, 200 ……

*w*

1
*w*

2 ^{Loss L}

1

*a* *w*

*w*

2
*x*

1
*x*

2
### *b*

1, 2 ……

*w*

1
*w*

2 ^{Loss L}

1, 2 ……

Make different features have the same scaling

### Feature Scaling

### …… …… …… …… …… …… ……

### 𝑥

^{1}

### 𝑥

^{2}

### 𝑥

^{3}

### 𝑥

^{𝑟}

### 𝑥

^{𝑅}

### mean: 𝑚

_{𝑖}

### standard

### deviation: 𝜎

_{𝑖}

### 𝑥

_{𝑖}

^{𝑟}

### ← 𝑥

_{𝑖}

^{𝑟}

### − 𝑚

_{𝑖}

### 𝜎

_{𝑖}

The means of all dimensions are 0, and the variances are all 1

### For each

### dimension i:

### 𝑥

_{1}

^{1}

### 𝑥

_{2}

^{1}

### 𝑥

_{1}

^{2}

### 𝑥

_{2}

^{2}

In general, gradient descent converges much faster with feature scaling than without it.

### How about Hidden Layer?

𝑎^{1}

𝑥^{1} Layer 1 Layer 2 𝑎^{2}

**……**

Feature Scaling Feature Scaling ? Feature Scaling ?

Smaller learning rate can be helpful, but the training would be slower.

Difficulty: their statistics

change during the training …

### Batch normalization

Internal Covariate Shift

𝑎^{3}
𝑎^{2}
𝑎^{1}

### Batch

𝑥^{1}

𝑥^{2}

𝑥^{3}

𝑊^{1}

𝑊^{1}

𝑊^{1}

𝑧^{1}

𝑧^{2}

𝑧^{3}

𝑊^{2}

𝑊^{2}

𝑊^{2}

Sigmoid

### ……

### ……

### ……

𝑊^{1} 𝑥^{1} 𝑥^{2} 𝑥^{3}
𝑧^{1} 𝑧^{2} 𝑧^{3}

**=**

SigmoidSigmoid

**Batch**

### Batch normalization

𝑥^{1}

𝑥^{2}

𝑥^{3}

𝑊^{1}

𝑊^{1}

𝑊^{1}

𝑧^{1}

𝑧^{2}

𝑧^{3}

𝜇 𝜎

𝜇 = 1

3

𝑖=1 3

𝑧^{𝑖}

𝜎 = 1

3

𝑖=1 3

𝑧^{𝑖} − 𝜇 ^{2}

𝜇 and 𝜎

depends on 𝑧^{𝑖}

Note: Batch normalization cannot be applied on

small batch.

### Batch normalization

𝑥^{1}

𝑥^{2}

𝑥^{3}

𝑊^{1}

𝑊^{1}

𝑊^{1}

𝑧^{1}

𝑧^{2}

𝑧^{3}

𝜇 𝜎

### ǁ𝑧

^{𝑖}

### = 𝑧

^{𝑖}

### − 𝜇 𝜎

𝜇 and 𝜎

depends on 𝑧^{𝑖}

𝑎^{3}
𝑎^{2}
𝑎^{1}

SigmoidSigmoidSigmoid

ǁ𝑧^{1}

ǁ𝑧^{2}

ǁ𝑧^{3}

How to do

backpropogation?

### Batch normalization

𝑥^{1}

𝑥^{2}

𝑥^{3}

𝑊^{1}

𝑊^{1}

𝑊^{1}

𝑧^{1}

𝑧^{2}

𝑧^{3}

𝜇 𝜎

### Ƹ𝑧

^{𝑖}

### = 𝛾⨀ ǁ𝑧

^{𝑖}

### + 𝛽

𝜇 and 𝜎

depends on 𝑧^{𝑖}

Ƹ𝑧^{3}
Ƹ𝑧^{2}
Ƹ𝑧^{1}
ǁ𝑧^{1}

ǁ𝑧^{2}

ǁ𝑧^{3}

𝛽 𝛾

### ǁ𝑧

^{𝑖}

### = 𝑧

^{𝑖}

### − 𝜇

### 𝜎

### Batch normalization

### • At testing stage:

𝑥 𝑊^{1} 𝑧

### ǁ𝑧 = 𝑧 − 𝜇

ǁ𝑧### Ƹ𝑧

^{𝑖}

### = 𝛾⨀ ǁ𝑧

^{𝑖}

### + 𝛽

Ƹ𝑧### 𝜎

𝜇, 𝜎 are
**from batch**

𝛾, 𝛽 are network parameters

**We do not have batch at testing stage.**

Ideal solution:

Computing 𝜇 and 𝜎 using the whole training dataset.

Practical solution:

Computing the moving average of 𝜇 and 𝜎 of the batches during training.

Acc

Updates
𝜇_{1} 𝜇_{100} 𝜇_{300}

### Batch normalization - Benefit

• BN reduces training times, and make very deep net trainable.

• Because of less Covariate Shift, we can use larger learning rates.

• Less exploding/vanishing gradients

• Especially effective for sigmoid, tanh, etc.

• Learning is less affected by initialization.

• BN reduces the demand for regularization.

𝑥^{𝑖} 𝑊^{1} 𝑧^{𝑖} ǁ𝑧^{𝑖} Ƹ𝑧^{𝑖}

### Ƹ𝑧

^{𝑖}

### = 𝛾⨀ ǁ𝑧

^{𝑖}

### + 𝛽 ǁ𝑧

^{𝑖}

### = 𝑧

^{𝑖}

### − 𝜇

### 𝜎

× 𝒌 × 𝒌

𝒌 𝒌

𝒌

𝒌𝒆𝒆𝒑

### Demo

### Activation Function

Günter Klambauer, Thomas Unterthiner, Andreas Mayr, Andreas Mayr,

“Self-Normalizing Neural Networks”, NIPS, 2017

### ReLU

### • Rectified Linear Unit (ReLU)

**Reason:**

**Reason:**

### 1. Fast to compute 2. Biological reason 3. Infinite sigmoid with different biases 4. Vanishing gradient problem

𝑧 𝑎

𝑎 = 𝑧

𝑎 = 0

### 𝜎 𝑧

[Xavier Glorot, AISTATS’11]

[Andrew L. Maas, ICML’13]

[Kaiming He, arXiv’15]

### ReLU - variant

𝑧 𝑎

𝑎 = 𝑧

𝑎 = 0.01𝑧

### Leaky ReLU

𝑧 𝑎

𝑎 = 𝑧

𝑎 = 𝛼𝑧

### Parametric ReLU

α also learned by gradient descent

### ReLU - variant

𝑧 𝑎

𝑎 = 𝑧

𝑧 𝑎

𝑎 = 𝑧

𝑎 = 𝛼 𝑒^{𝑧} − 1

### Exponential Linear Unit (ELU)

### Randomized ReLU

𝑎 = 𝛼𝑧

𝛼 is sampled from a

distribution during training.

Fixed during testing.

Dropout???

### ReLU - variant

𝑧 𝑎

𝑎 = 𝑧

𝑎 = 𝛼 𝑒^{𝑧} − 1

### Exponential Linear Unit (ELU)

𝑧 𝑎

𝑎 = 𝑧

𝑎 = 𝛼 𝑒^{𝑧} − 1

### Scaled ELU (SELU)

https://github.com/bioinf-jku/SNNs

× 𝜆

× 𝜆

𝛼 = 1.6732632423543772848170429916717 𝜆 = 1.0507009873554804934193349852946

### SELU

Positive and negative values

The whole ReLU family has this property except the original ReLU.

Saturation region ELU also has this property Slope larger than 1

𝑎 = 𝜆𝑧

𝑎 = 𝜆𝛼 𝑒^{𝑧} − 1

𝛼 = 1.673263242 … 𝜆 = 1.050700987 …

Only SELU also has this property

### SELU

*K*
*K*

*k*

*k*

*w* *a* *w*

*a* *w*

*a*

*z*

_{1}

_{1}

### *z*

*w*

1
*w*

*k*

*w*

K
### …

*a*

1
*a*

*k*

*a*

K
### ^{f} ^{z} ^{a}

^{f}

^{z}

^{a}

### … … …

The inputs are i.i.d random
variables with mean 𝜇 and
variance 𝜎^{2}.

𝜇_{𝑧} = 𝐸 𝑧

=

𝑘=1 𝐾

𝐸 𝑎_{𝑘} 𝑤_{𝑘}

𝜇 = 𝜇

𝑘=1 𝐾

𝑤_{𝑘} = 𝜇 ∙ 𝐾𝜇_{𝑤}

=1 =0

=0

Do not have to be Gaussian

=0

### SELU

*K*
*K*

*k*

*k*

*w* *a* *w*

*a* *w*

*a*

*z*

_{1}

_{1}

### *z*

*w*

1
*w*

*k*

*w*

K
### …

*a*

1
*a*

*k*

*a*

K
### ^{f} ^{z} ^{a}

^{f}

^{z}

^{a}

### … … …

The inputs are i.i.d random
variables with mean 𝜇 and
variance 𝜎^{2}.

𝜇_{𝑧} = 0

𝜎_{𝑧}^{2} = 𝐸 𝑧 − 𝜇_{𝑧} ^{2} = 𝐸 𝑧^{2}

=1 =0

= 𝐸 𝑎_{1}𝑤_{1} + 𝑎_{2}𝑤_{2} + ⋯ ^{2}

𝐸 𝑎_{𝑘}𝑤_{𝑘} ^{2} = 𝑤_{𝑘} ^{2}𝐸 𝑎_{𝑘} ^{2} = 𝑤_{𝑘} ^{2}𝜎^{2}
𝐸 𝑎_{𝑖}𝑎_{𝑗}𝑤_{𝑖}𝑤_{𝑗} = 𝑤_{𝑖}𝑤_{𝑗}𝐸 𝑎_{𝑖} 𝐸 𝑎_{𝑗} = 0

=

𝑘=1 𝐾

𝑤_{𝑘} ^{2}𝜎^{2} = 𝜎^{2} ∙ 𝐾𝜎_{𝑤}^{2}

𝜇 = 0, 𝜎 = 1

=1 =1

= 1

target Assume Gaussian

𝜇_{𝑤} = 0

### Demo

Source of joke:

https://zhuanlan.zhihu.co m/p/27336839

93 頁的證明

### SELU is actually

### more general.

### •

**最新激活神經元:SELF-NORMALIZATION NEURAL**

**NETWORK (SELU)**

MNIST CIFAR-10

### Demo

### 𝑎 = 𝑧 ∙ 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝛽𝑧

### Hyperparameters

感謝 沈昇勳 同學提供圖檔 Source of iamge: https://medium.com/intuitionmachine/the-brute-force-method-of-

deep-learning-innovation-58b497323ae5 (Denny Britz’s graphic)

### Grid Search v.s. Random Search

http://www.deeplearningbook.org/contents/guidelines.html

Layer depth

Layer width

Layer depth

Layer width

Assumption: top K results are good enough

If there are N points, probability K/N that your sample is in top K
Sample x times: 1 − 1 − 𝐾/𝑁 ^{𝑥} > 90%

If N = 1000, K = 10 x = 230

K = 100 x = 22

### Model-based Hyperparameter Optimization

https://cloud.google.com/blog/big- data/2017/08/hyperparameter- tuning-in-cloud-machine-learning- engine-using-bayesian-optimization

**Reinforcement Learning**

**Reinforcement Learning**

Design a network Train the network

Accuracy as reward

It can design LSTM as shown in the previous lecture.

One kind of meta

learning (or learn to learn)

### SWISH ...

### SWISH ...

### Learning Rate

### Can transfer to

### new tasks

## Capsule

Sara Sabour, Nicholas Frosst, Geoffrey E. Hinton, “Dynamic Routing Between Capsules”, NIPS, 2017

### Capsule

### • Neuron: output a value, Capsule: output a vector

Cap Cap

Cap

𝑣
𝑣^{1}

𝑣^{2}

A neuron detects a specific pattern.

Neuron A Neuron B

* Detect one type *
of patterns

Each dimension of v represents the characteristics of patterns.

The norm of v represents the existence.

1.0

⋮

−1.0

⋮

### Capsule

𝑣
𝑣^{1}

𝑣^{2}

Squashing
𝑢^{1}

𝑢^{2}

𝑠 +

### × 𝑊

^{1}

### × 𝑊

^{2}

### × 𝑐

_{1}

### × 𝑐

_{2}

𝑢^{1} = 𝑊^{1}𝑣^{1} 𝑢^{2} = 𝑊^{2}𝑣^{2}
𝑠 = 𝑐_{1}𝑢^{1} + 𝑐_{2}𝑢^{2}

𝑣 = 𝑆𝑞𝑢𝑎𝑠ℎ 𝑠

**c are determined by dynamic routing during the testing stage.**

c.f. pooling

### 𝑣 = 𝑠

^{2}

### 1 + 𝑠

^{2}

### 𝑠 𝑠

coupling coefficients

Squashing 𝑣
𝑢^{1}

𝑢^{2} + 𝑠

**Dynamic ** ×?

**Dynamic**

**Routing**

**Routing**

𝑢^{3}

### ×?

### ×?

𝑏_{1}^{0} = 0, 𝑏_{2}^{0} = 0, 𝑏_{3}^{0} = 0
For 𝑟 = 1 to T do

𝑐_{1}^{𝑟}, 𝑐_{2}^{𝑟}, 𝑐_{3}^{𝑟} = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑏_{1}^{𝑟−1}, 𝑏_{2}^{𝑟−1}, 𝑏_{3}^{𝑟−1}
𝑠^{𝑟} = 𝑐_{1}𝑢^{1} + 𝑐_{2}𝑢^{2} + 𝑐_{3}𝑢^{3}

𝑎^{𝑟} = 𝑆𝑞𝑢𝑎𝑠ℎ 𝑠^{𝑟}
𝑏_{𝑖}^{𝑟} = 𝑏_{𝑖}^{𝑟−1} + 𝑎^{𝑟} ∙ 𝑢^{𝑖}

𝑐_{1}^{𝑇}
𝑐_{2}^{𝑇}

𝑐_{3}^{𝑇}

= 𝑎^{𝑇}

𝑣
𝑣^{1}

Squashing
𝑢^{1}

𝑢^{2}

+

× 𝑊^{1}

× 𝑊^{2}

𝑐_{1}^{1} 𝑎^{1}

Squashing +

Squashing

+ 𝑎^{3}

𝑏_{1}^{0} = 0, 𝑏_{2}^{0} = 0

𝑐_{1}^{1}, 𝑐_{2}^{1} = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑏_{1}^{0}, 𝑏_{2}^{0} 𝑏_{𝑖}^{𝑟} = 𝑏_{𝑖}^{𝑟−1} + 𝑎^{𝑟} ∙ 𝑢^{𝑖}

Like RNN Also learned by backprop
𝑣^{2}

𝑠^{1}

𝑠^{2}

𝑠^{3}

𝑐_{1}^{2}
𝑐_{2}^{2}

𝑐_{1}^{3}
𝑐_{2}^{3}
𝑎^{2}

**T=3**

𝑐_{2}^{1}
𝑐_{1}^{2}

𝑐_{2}^{2}
𝑐_{1}^{3}
𝑐_{2}^{3}

### Capsule

• Capsule can also be convolutional.

• Simply replace filter with capsule

• Output layer and loss

Cap 𝑣^{1}

Cap 𝑣^{2}

### CapsNet

### …… ……

𝑣^{1}

𝑣^{2}

Confidence of “1”

Confidence of “2”

### NN

Minimize

reconstruction error

x1 x0

### Experimental Results

• MNIST

• Each example is an MNIST digit with a random small affine transformation.

• However, models were never trained with affine transformations

• CapsNet achieved 79% accuracy on the affnist test set.

• A traditional convolutional model with a similar number of parameters which achieved 66%.

### Experimental Results

### • Each dimension contains specific information.

𝑣^{1}

### NN

Minimize

reconstruction error

### Experimental Results

### • MultiMNIST

Top: input

Bottom: reconstructed R: reconstructed digits L: true labels

### Discussion

### • Invariance v.s. Equivariance

### NN NN NN NN

**Invariance** **Equivariance**

**Invariance**

**Equivariance**

### Discussion

### • Invariance v.s. Equivariance

Max pooling has invariance, but not equivariance.

Capsule has both invariance and equivariance.

3 -1 -3 -1

-3 1 0 3

3 3 Cap

𝑣^{1}
𝑣^{1}

Cap
𝑣^{1}

𝑣^{1}
Both large

Can be different I don’t know the difference.

I know the difference, but I do not react to it.

### Dynamic Routing

### To Learn More ……

### • Hinton’s talk:

### https://www.youtube.com/watch?v=rTawFwUvnLE

### • Keras:

• https://github.com/XifengGuo/CapsNet-Keras

### • Tensorflow:

• https://github.com/naturomics/CapsNet-Tensorflow

### • PyTorch

• https://github.com/gram-ai/capsule-networks

• https://github.com/timomernick/pytorch-capsule

• https://github.com/nishnik/CapsNet-PyTorch

### Interesting Facts (?)

### about Deep Learning

### Training stuck because …. ?

### • People believe training stuck because the parameters are near a local minima

local minima

How about saddle point?

http://www.deeplearningbook.org/contents/optimization.html

### Training stuck because …. ?

### • People believe training stuck because the parameters are around a critical point

### !!!

http://www.deeplearningbook.org/contents/optimization.html

### Brute-force Memorization ?

https://arxiv.org/pdf/1611.03530.pdf

Final of 2017 Spring:

https://ntumlds.wordpress.com/2017/03/27/r05922018_drliao/

### Demo

### Brute-force Memorization ?

### • Simple pattern first, then memorize exception

https://arxiv.org/pdf/1706.05394.pdf

### Knowledge Distillation

Knowledge Distillation

https://arxiv.org/pdf/1503.02531.pdf Do Deep Nets Really Need to be Deep?

https://arxiv.org/pdf/1312.6184.pdf

Teacher Net (Deep)

“1”: 0.7, “7”: 0.2. “9”: 0.1

Student Net (Shallow)

? Learning target

Cross-entropy minimization

Training Data

https://arxiv.org/pdf/1312.6184.pdf

### Deep Learning for

### Question Answering

### Question Answering

• Given a document and a query, output an answer

• bAbI: the answer is a word

• https://research.fb.com/downloads/babi/

• SQuAD: the answer is a sequence of words (in the input document)

• https://rajpurkar.github.io/SQuAD-explorer/

• MS MARCO: the answer is a sequence of words

• http://www.msmarco.org

• MovieQA: Multiple choice question (output a number)

• http://movieqa.cs.toronto.edu/home/

• More: https://github.com/dapurv5/awesome-question- answering

### Bi-directional Attention Flow

Demo: http://35.165.153.16:1995

### Dynamic Coattention Networks

### Dynamic Coattention Networks

### Dynamic Coattention Networks

### Dynamic Coattention Networks

### • Experimental Results

DCN+: https://arxiv.org/pdf/1711.00106.pdf

### Rnet

**R-Net**

**R-Net**

**S-net**

**S-net**

### MS MARCO

**Attention-over-Attention (AoA)**

**Attention-over-Attention (AoA)**

(not on SQUAD)

### Reinforced Mnemonic Reader

**Multiple-hop**

**Multiple-hop**

**ReasoNet**

**ReasoNet**

https://arxiv.org/abs/1609.05284