for Deep Learning

(1)

Advanced Tips for Deep Learning

Hung-yi Lee

Prerequisite: https://www.youtube.com/watch?v=xki61j7z-30

(2)

Neural Network

Good Results on Testing Data?

Good Results on Training Data?

Step 3: pick the best function Step 2: goodness

of function Step 1: define a

set of function

YES YES

NO NO

Overfitting!

Recipe of Deep Learning

(3)

Do not always blame Overfitting

Deep Residual Learning for Image Recognition http://arxiv.org/abs/1512.03385

Testing Data

Overfitting?

Training Data

Not well trained

(4)

Neural Network

Good Results on Testing Data?

Good Results on Training Data?

YES YES

Recipe of Deep Learning

Different approaches for different problems.

e.g. dropout for good results on testing data

(5)

Outline

• Batch Normalization

• New Activation Function

• Tuning Hyperparameters

• Interesting facts (?) about deep learning

• Capsule

• New models for QA

(6)

Batch Normalization

Sergey Ioffe, Christian Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, 2015

(7)

Feature Scaling

1

a w

w

2

x

1

x

2

 b

1, 2 ……

100, 200 ……

w

1

w

2 ^{Loss L}

1

a w

w

2

x

1

x

2

 b

1, 2 ……

w

1

w

2 ^{Loss L}

1, 2 ……

Make different features have the same scaling

(8)

Feature Scaling

…… …… …… …… …… …… ……

𝑥

¹

𝑥

²

𝑥

³

𝑥

^𝑟

𝑥

^𝑅

mean: 𝑚

_𝑖

standard

deviation: 𝜎

_𝑖

𝑥

_𝑖^𝑟

← 𝑥

_𝑖^𝑟

− 𝑚

_𝑖

𝜎

_𝑖

The means of all dimensions are 0, and the variances are all 1

For each

dimension i:

𝑥

₁¹

𝑥

₂¹

𝑥

₁²

𝑥

₂²

In general, gradient descent converges much faster with feature scaling than without it.

(9)

How about Hidden Layer?

𝑎¹

𝑥¹ Layer 1 Layer 2 𝑎²

……

Feature Scaling Feature Scaling ? Feature Scaling ?

Smaller learning rate can be helpful, but the training would be slower.

Difficulty: their statistics

change during the training …

Batch normalization

Internal Covariate Shift

(10)

𝑎³ 𝑎² 𝑎¹

Batch

𝑥¹

𝑥²

𝑥³

𝑊¹

𝑧¹

𝑧²

𝑧³

𝑊²

Sigmoid

……

𝑊¹ 𝑥¹ 𝑥² 𝑥³ 𝑧¹ 𝑧² 𝑧³

=

SigmoidSigmoid

Batch

(11)

Batch normalization

𝑥¹

𝑥²

𝑥³

𝑊¹

𝑧¹

𝑧²

𝑧³

𝜇 𝜎

𝜇 = 1

3 ෍

𝑖=1 3

𝑧^𝑖

𝜎 = 1

3 ෍

𝑖=1 3

𝑧^𝑖 − 𝜇 ²

𝜇 and 𝜎

depends on 𝑧^𝑖

Note: Batch normalization cannot be applied on

small batch.

(12)

Batch normalization

𝑥¹

𝑥²

𝑥³

𝑊¹

𝑧¹

𝑧²

𝑧³

𝜇 𝜎

ǁ𝑧

^𝑖

= 𝑧

^𝑖

− 𝜇 𝜎

𝜇 and 𝜎

𝑎³ 𝑎² 𝑎¹

SigmoidSigmoidSigmoid

ǁ𝑧¹

ǁ𝑧²

ǁ𝑧³

How to do

backpropogation?

(13)

Batch normalization

𝑥¹

𝑥²

𝑥³

𝑊¹

𝑧¹

𝑧²

𝑧³

𝜇 𝜎

Ƹ𝑧

^𝑖

= 𝛾⨀ ǁ𝑧

^𝑖

+ 𝛽

𝜇 and 𝜎

Ƹ𝑧³ Ƹ𝑧² Ƹ𝑧¹ ǁ𝑧¹

ǁ𝑧²

ǁ𝑧³

𝛽 𝛾

ǁ𝑧

^𝑖

= 𝑧

^𝑖

− 𝜇

𝜎

(14)

Batch normalization

• At testing stage:

𝑥 𝑊¹ 𝑧

ǁ𝑧 = 𝑧 − 𝜇

ǁ𝑧

Ƹ𝑧

^𝑖

= 𝛾⨀ ǁ𝑧

^𝑖

+ 𝛽

Ƹ𝑧

𝜎

𝜇, 𝜎 are from batch

𝛾, 𝛽 are network parameters

We do not have batch at testing stage.

Ideal solution:

Computing 𝜇 and 𝜎 using the whole training dataset.

Practical solution:

Computing the moving average of 𝜇 and 𝜎 of the batches during training.

Acc

Updates 𝜇₁ 𝜇₁₀₀ 𝜇₃₀₀

(15)

Batch normalization - Benefit

• BN reduces training times, and make very deep net trainable.

• Because of less Covariate Shift, we can use larger learning rates.

• Less exploding/vanishing gradients

• Especially effective for sigmoid, tanh, etc.

• Learning is less affected by initialization.

• BN reduces the demand for regularization.

𝑥^𝑖 𝑊¹ 𝑧^𝑖 ǁ𝑧^𝑖 Ƹ𝑧^𝑖

Ƹ𝑧

^𝑖

= 𝛾⨀ ǁ𝑧

^𝑖

+ 𝛽 ǁ𝑧

^𝑖

= 𝑧

^𝑖

− 𝜇

𝜎

× 𝒌 × 𝒌

𝒌 𝒌

𝒌

𝒌𝒆𝒆𝒑

(16)

(17)

Demo

(18)

Activation Function

Günter Klambauer, Thomas Unterthiner, Andreas Mayr, Andreas Mayr,

“Self-Normalizing Neural Networks”, NIPS, 2017

(19)

ReLU

• Rectified Linear Unit (ReLU)

Reason:

1. Fast to compute 2. Biological reason 3. Infinite sigmoid with different biases 4. Vanishing gradient problem

𝑧 𝑎

𝑎 = 𝑧

𝑎 = 0

𝜎 𝑧

[Xavier Glorot, AISTATS’11]

[Andrew L. Maas, ICML’13]

[Kaiming He, arXiv’15]

(20)

ReLU - variant

𝑧 𝑎

𝑎 = 𝑧

𝑎 = 0.01𝑧

Leaky ReLU

𝑧 𝑎

𝑎 = 𝑧

𝑎 = 𝛼𝑧

Parametric ReLU

α also learned by gradient descent

(21)

ReLU - variant

𝑧 𝑎

𝑎 = 𝑧

𝑧 𝑎

𝑎 = 𝑧

𝑎 = 𝛼 𝑒^𝑧 − 1

Exponential Linear Unit (ELU)

Randomized ReLU

𝑎 = 𝛼𝑧

𝛼 is sampled from a

distribution during training.

Fixed during testing.

Dropout???

(22)

ReLU - variant

𝑧 𝑎

𝑎 = 𝑧

𝑎 = 𝛼 𝑒^𝑧 − 1

Exponential Linear Unit (ELU)

𝑧 𝑎

𝑎 = 𝑧

𝑎 = 𝛼 𝑒^𝑧 − 1

Scaled ELU (SELU)

https://github.com/bioinf-jku/SNNs

× 𝜆

𝛼 = 1.6732632423543772848170429916717 𝜆 = 1.0507009873554804934193349852946

(23)

SELU

Positive and negative values

The whole ReLU family has this property except the original ReLU.

Saturation region ELU also has this property Slope larger than 1

𝑎 = 𝜆𝑧

𝑎 = 𝜆𝛼 𝑒^𝑧 − 1

𝛼 = 1.673263242 … 𝜆 = 1.050700987 …

Only SELU also has this property

(24)

SELU

K K

k

w a w

a w

a

z 

₁ ₁

      z

w

1

w

k

w

K

…

a

1

a

k

a

K

 ^f   ^z ^a

… … …

The inputs are i.i.d random variables with mean 𝜇 and variance 𝜎².

𝜇_𝑧 = 𝐸 𝑧

= ෍

𝑘=1 𝐾

𝐸 𝑎_𝑘 𝑤_𝑘

𝜇 = 𝜇 ෍

𝑘=1 𝐾

𝑤_𝑘 = 𝜇 ∙ 𝐾𝜇_𝑤

=1 =0

=0

Do not have to be Gaussian

=0

(25)

SELU

K K

k

w a w

a w

a

z 

₁ ₁

      z

w

1

w

k

w

K

…

a

1

a

k

a

K

 ^f   ^z ^a

… … …

The inputs are i.i.d random variables with mean 𝜇 and variance 𝜎².

𝜇_𝑧 = 0

𝜎_𝑧² = 𝐸 𝑧 − 𝜇_𝑧 ² = 𝐸 𝑧²

=1 =0

= 𝐸 𝑎₁𝑤₁ + 𝑎₂𝑤₂ + ⋯ ²

𝐸 𝑎_𝑘𝑤_𝑘 ² = 𝑤_𝑘 ²𝐸 𝑎_𝑘 ² = 𝑤_𝑘 ²𝜎² 𝐸 𝑎_𝑖𝑎_𝑗𝑤_𝑖𝑤_𝑗 = 𝑤_𝑖𝑤_𝑗𝐸 𝑎_𝑖 𝐸 𝑎_𝑗 = 0

= ෍

𝑘=1 𝐾

𝑤_𝑘 ²𝜎² = 𝜎² ∙ 𝐾𝜎_𝑤²

𝜇 = 0, 𝜎 = 1

=1 =1

= 1

target Assume Gaussian

𝜇_𝑤 = 0

(26)

Demo

(27)

Source of joke:

https://zhuanlan.zhihu.co m/p/27336839

93 頁的證明

SELU is actually

more general.

(28)

•

最新激活神經元:SELF-NORMALIZATION NEURAL

NETWORK (SELU)

MNIST CIFAR-10

(29)

Demo

(30)

𝑎 = 𝑧 ∙ 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝛽𝑧

(31)

Hyperparameters

(32)

感謝沈昇勳同學提供圖檔 Source of iamge: https://medium.com/intuitionmachine/the-brute-force-method-of-

deep-learning-innovation-58b497323ae5 (Denny Britz’s graphic)

(33)

Grid Search v.s. Random Search

http://www.deeplearningbook.org/contents/guidelines.html

Layer depth

Layer width

Layer depth

Layer width

Assumption: top K results are good enough

If there are N points, probability K/N that your sample is in top K Sample x times: 1 − 1 − 𝐾/𝑁 ^𝑥 > 90%

If N = 1000, K = 10 x = 230

K = 100 x = 22

(34)

Model-based Hyperparameter Optimization

https://cloud.google.com/blog/big- data/2017/08/hyperparameter- tuning-in-cloud-machine-learning- engine-using-bayesian-optimization

(35)

Reinforcement Learning

Design a network Train the network

Accuracy as reward

It can design LSTM as shown in the previous lecture.

One kind of meta

learning (or learn to learn)

(36)

SWISH ...

(37)

SWISH ...

(38)

Learning Rate

(39)

Can transfer to

new tasks

(40)

Capsule

Sara Sabour, Nicholas Frosst, Geoffrey E. Hinton, “Dynamic Routing Between Capsules”, NIPS, 2017

(41)

Capsule

• Neuron: output a value, Capsule: output a vector

Cap Cap

Cap

𝑣 𝑣¹

𝑣²

A neuron detects a specific pattern.

Neuron A Neuron B

Detect one type of patterns

Each dimension of v represents the characteristics of patterns.

The norm of v represents the existence.

1.0

⋮

−1.0

⋮

(42)

Capsule

𝑣 𝑣¹

𝑣²

Squashing 𝑢¹

𝑢²

𝑠 +

× 𝑊

¹

× 𝑊

²

× 𝑐

₁

× 𝑐

₂

𝑢¹ = 𝑊¹𝑣¹ 𝑢² = 𝑊²𝑣² 𝑠 = 𝑐₁𝑢¹ + 𝑐₂𝑢²

𝑣 = 𝑆𝑞𝑢𝑎𝑠ℎ 𝑠

c are determined by dynamic routing during the testing stage.

c.f. pooling

𝑣 = 𝑠

²

1 + 𝑠

²

𝑠 𝑠

coupling coefficients

(43)

Squashing 𝑣 𝑢¹

𝑢² + 𝑠

Dynamic ×?

Routing

𝑢³

×?

𝑏₁⁰ = 0, 𝑏₂⁰ = 0, 𝑏₃⁰ = 0 For 𝑟 = 1 to T do

𝑐₁^𝑟, 𝑐₂^𝑟, 𝑐₃^𝑟 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑏₁^𝑟−1, 𝑏₂^𝑟−1, 𝑏₃^𝑟−1 𝑠^𝑟 = 𝑐₁𝑢¹ + 𝑐₂𝑢² + 𝑐₃𝑢³

𝑎^𝑟 = 𝑆𝑞𝑢𝑎𝑠ℎ 𝑠^𝑟 𝑏_𝑖^𝑟 = 𝑏_𝑖^𝑟−1 + 𝑎^𝑟 ∙ 𝑢^𝑖

𝑐₁^𝑇 𝑐₂^𝑇

𝑐₃^𝑇

= 𝑎^𝑇

(44)

𝑣 𝑣¹

Squashing 𝑢¹

𝑢²

+

× 𝑊¹

× 𝑊²

𝑐₁¹ 𝑎¹

Squashing +

Squashing

+ 𝑎³

𝑏₁⁰ = 0, 𝑏₂⁰ = 0

𝑐₁¹, 𝑐₂¹ = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑏₁⁰, 𝑏₂⁰ 𝑏_𝑖^𝑟 = 𝑏_𝑖^𝑟−1 + 𝑎^𝑟 ∙ 𝑢^𝑖

Like RNN Also learned by backprop 𝑣²

𝑠¹

𝑠²

𝑠³

𝑐₁² 𝑐₂²

𝑐₁³ 𝑐₂³ 𝑎²

T=3

𝑐₂¹ 𝑐₁²

𝑐₂² 𝑐₁³ 𝑐₂³

(45)

Capsule

• Capsule can also be convolutional.

• Simply replace filter with capsule

• Output layer and loss

Cap 𝑣¹

Cap 𝑣²

CapsNet

…… ……

𝑣¹

𝑣²

Confidence of “1”

Confidence of “2”

NN

Minimize

reconstruction error

x1 x0

(46)

Experimental Results

• MNIST

• Each example is an MNIST digit with a random small affine transformation.

• However, models were never trained with affine transformations

• CapsNet achieved 79% accuracy on the affnist test set.

• A traditional convolutional model with a similar number of parameters which achieved 66%.

(47)

Experimental Results

• Each dimension contains specific information.

𝑣¹

NN

Minimize

reconstruction error

(48)

Experimental Results

• MultiMNIST

Top: input

Bottom: reconstructed R: reconstructed digits L: true labels

(49)

Discussion

• Invariance v.s. Equivariance

NN NN NN NN

Invariance Equivariance

(50)

Discussion

• Invariance v.s. Equivariance

Max pooling has invariance, but not equivariance.

Capsule has both invariance and equivariance.

3 -1 -3 -1

-3 1 0 3

3 3 Cap

𝑣¹ 𝑣¹

Cap 𝑣¹

𝑣¹ Both large

Can be different I don’t know the difference.

I know the difference, but I do not react to it.

(51)

Dynamic Routing

(52)

To Learn More ……

• Hinton’s talk:

https://www.youtube.com/watch?v=rTawFwUvnLE

• Keras:

• https://github.com/XifengGuo/CapsNet-Keras

• Tensorflow:

• https://github.com/naturomics/CapsNet-Tensorflow

• PyTorch

• https://github.com/gram-ai/capsule-networks

• https://github.com/timomernick/pytorch-capsule

• https://github.com/nishnik/CapsNet-PyTorch

(53)

Interesting Facts (?)

about Deep Learning

(54)

Training stuck because …. ?

• People believe training stuck because the parameters are near a local minima

local minima

How about saddle point?

http://www.deeplearningbook.org/contents/optimization.html

(55)

Training stuck because …. ?

• People believe training stuck because the parameters are around a critical point

!!!

http://www.deeplearningbook.org/contents/optimization.html

(56)

Brute-force Memorization ?

https://arxiv.org/pdf/1611.03530.pdf

Final of 2017 Spring:

https://ntumlds.wordpress.com/2017/03/27/r05922018_drliao/

(57)

Demo

(58)

Brute-force Memorization ?

• Simple pattern first, then memorize exception

(59)

Knowledge Distillation

https://arxiv.org/pdf/1503.02531.pdf Do Deep Nets Really Need to be Deep?

Teacher Net (Deep)

“1”: 0.7, “7”: 0.2. “9”: 0.1

Student Net (Shallow)

? Learning target

Cross-entropy minimization

Training Data

(60)

(61)

Deep Learning for

Question Answering

(62)

Question Answering

• Given a document and a query, output an answer

• bAbI: the answer is a word

• https://research.fb.com/downloads/babi/

• SQuAD: the answer is a sequence of words (in the input document)

• https://rajpurkar.github.io/SQuAD-explorer/

• MS MARCO: the answer is a sequence of words

• http://www.msmarco.org

• MovieQA: Multiple choice question (output a number)

• http://movieqa.cs.toronto.edu/home/

• More: https://github.com/dapurv5/awesome-question- answering

(63)

(64)

Bi-directional Attention Flow

Demo: http://35.165.153.16:1995

(65)

(66)

(67)

(68)

(69)

Dynamic Coattention Networks

(70)

Dynamic Coattention Networks

(71)

Dynamic Coattention Networks

(72)

Dynamic Coattention Networks

• Experimental Results

DCN+: https://arxiv.org/pdf/1711.00106.pdf

(73)

Rnet

R-Net

(74)

S-net

MS MARCO

(75)

Attention-over-Attention (AoA)

(not on SQUAD)

(76)

Reinforced Mnemonic Reader

(77)

Multiple-hop

ReasoNet

https://arxiv.org/abs/1609.05284

(78)

FusionNet

(79)

Recurrent Entity Networks

• https://arxiv.org/pdf/1612.03969.pdf

(80)

(81)

Query-Reduction Networks for Question Answering

• https://arxiv.org/pdf/1606.04582.pdf

(82)

Query-Reduction Networks for Question Answering

• https://arxiv.org/pdf/1606.04582.pdf

(83)