• 沒有找到結果。

for Deep Learning

N/A
N/A
Protected

Academic year: 2022

Share "for Deep Learning"

Copied!
83
0
0

加載中.... (立即查看全文)

全文

(1)

Advanced Tips for Deep Learning

Hung-yi Lee

Prerequisite: https://www.youtube.com/watch?v=xki61j7z-30

(2)

Neural Network

Good Results on Testing Data?

Good Results on Training Data?

Step 3: pick the best function Step 2: goodness

of function Step 1: define a

set of function

YES YES

NO NO

Overfitting!

Recipe of Deep Learning

(3)

Do not always blame Overfitting

Deep Residual Learning for Image Recognition http://arxiv.org/abs/1512.03385

Testing Data

Overfitting?

Training Data

Not well trained

(4)

Neural Network

Good Results on Testing Data?

Good Results on Training Data?

YES YES

Recipe of Deep Learning

Different approaches for different problems.

e.g. dropout for good results on testing data

(5)

Outline

• Batch Normalization

• New Activation Function

• Tuning Hyperparameters

• Interesting facts (?) about deep learning

• Capsule

• New models for QA

(6)

Batch Normalization

Sergey Ioffe, Christian Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, 2015

(7)

Feature Scaling

1

a w

w

2

x

1

x

2

b

1, 2 ……

100, 200 ……

w

1

w

2 Loss L

1

a w

w

2

x

1

x

2

b

1, 2 ……

w

1

w

2 Loss L

1, 2 ……

Make different features have the same scaling

(8)

Feature Scaling

…… …… …… …… …… …… ……

𝑥

1

𝑥

2

𝑥

3

𝑥

𝑟

𝑥

𝑅

mean: 𝑚

𝑖

standard

deviation: 𝜎

𝑖

𝑥

𝑖𝑟

← 𝑥

𝑖𝑟

− 𝑚

𝑖

𝜎

𝑖

The means of all dimensions are 0, and the variances are all 1

For each

dimension i:

𝑥

11

𝑥

21

𝑥

12

𝑥

22

In general, gradient descent converges much faster with feature scaling than without it.

(9)

How about Hidden Layer?

𝑎1

𝑥1 Layer 1 Layer 2 𝑎2

……

Feature Scaling Feature Scaling ? Feature Scaling ?

Smaller learning rate can be helpful, but the training would be slower.

Difficulty: their statistics

change during the training …

Batch normalization

Internal Covariate Shift

(10)

𝑎3 𝑎2 𝑎1

Batch

𝑥1

𝑥2

𝑥3

𝑊1

𝑊1

𝑊1

𝑧1

𝑧2

𝑧3

𝑊2

𝑊2

𝑊2

Sigmoid

……

……

……

𝑊1 𝑥1 𝑥2 𝑥3 𝑧1 𝑧2 𝑧3

=

SigmoidSigmoid

Batch

(11)

Batch normalization

𝑥1

𝑥2

𝑥3

𝑊1

𝑊1

𝑊1

𝑧1

𝑧2

𝑧3

𝜇 𝜎

𝜇 = 1

3 ෍

𝑖=1 3

𝑧𝑖

𝜎 = 1

3 ෍

𝑖=1 3

𝑧𝑖 − 𝜇 2

𝜇 and 𝜎

depends on 𝑧𝑖

Note: Batch normalization cannot be applied on

small batch.

(12)

Batch normalization

𝑥1

𝑥2

𝑥3

𝑊1

𝑊1

𝑊1

𝑧1

𝑧2

𝑧3

𝜇 𝜎

ǁ𝑧

𝑖

= 𝑧

𝑖

− 𝜇 𝜎

𝜇 and 𝜎

depends on 𝑧𝑖

𝑎3 𝑎2 𝑎1

SigmoidSigmoidSigmoid

ǁ𝑧1

ǁ𝑧2

ǁ𝑧3

How to do

backpropogation?

(13)

Batch normalization

𝑥1

𝑥2

𝑥3

𝑊1

𝑊1

𝑊1

𝑧1

𝑧2

𝑧3

𝜇 𝜎

Ƹ𝑧

𝑖

= 𝛾⨀ ǁ𝑧

𝑖

+ 𝛽

𝜇 and 𝜎

depends on 𝑧𝑖

Ƹ𝑧3 Ƹ𝑧2 Ƹ𝑧1 ǁ𝑧1

ǁ𝑧2

ǁ𝑧3

𝛽 𝛾

ǁ𝑧

𝑖

= 𝑧

𝑖

− 𝜇

𝜎

(14)

Batch normalization

• At testing stage:

𝑥 𝑊1 𝑧

ǁ𝑧 = 𝑧 − 𝜇

ǁ𝑧

Ƹ𝑧

𝑖

= 𝛾⨀ ǁ𝑧

𝑖

+ 𝛽

Ƹ𝑧

𝜎

𝜇, 𝜎 are from batch

𝛾, 𝛽 are network parameters

We do not have batch at testing stage.

Ideal solution:

Computing 𝜇 and 𝜎 using the whole training dataset.

Practical solution:

Computing the moving average of 𝜇 and 𝜎 of the batches during training.

Acc

Updates 𝜇1 𝜇100 𝜇300

(15)

Batch normalization - Benefit

• BN reduces training times, and make very deep net trainable.

• Because of less Covariate Shift, we can use larger learning rates.

• Less exploding/vanishing gradients

• Especially effective for sigmoid, tanh, etc.

• Learning is less affected by initialization.

• BN reduces the demand for regularization.

𝑥𝑖 𝑊1 𝑧𝑖 ǁ𝑧𝑖 Ƹ𝑧𝑖

Ƹ𝑧

𝑖

= 𝛾⨀ ǁ𝑧

𝑖

+ 𝛽 ǁ𝑧

𝑖

= 𝑧

𝑖

− 𝜇

𝜎

× 𝒌 × 𝒌

𝒌 𝒌

𝒌

𝒌𝒆𝒆𝒑

(16)
(17)

Demo

(18)

Activation Function

Günter Klambauer, Thomas Unterthiner, Andreas Mayr, Andreas Mayr,

“Self-Normalizing Neural Networks”, NIPS, 2017

(19)

ReLU

• Rectified Linear Unit (ReLU)

Reason:

1. Fast to compute 2. Biological reason 3. Infinite sigmoid with different biases 4. Vanishing gradient problem

𝑧 𝑎

𝑎 = 𝑧

𝑎 = 0

𝜎 𝑧

[Xavier Glorot, AISTATS’11]

[Andrew L. Maas, ICML’13]

[Kaiming He, arXiv’15]

(20)

ReLU - variant

𝑧 𝑎

𝑎 = 𝑧

𝑎 = 0.01𝑧

Leaky ReLU

𝑧 𝑎

𝑎 = 𝑧

𝑎 = 𝛼𝑧

Parametric ReLU

α also learned by gradient descent

(21)

ReLU - variant

𝑧 𝑎

𝑎 = 𝑧

𝑧 𝑎

𝑎 = 𝑧

𝑎 = 𝛼 𝑒𝑧 − 1

Exponential Linear Unit (ELU)

Randomized ReLU

𝑎 = 𝛼𝑧

𝛼 is sampled from a

distribution during training.

Fixed during testing.

Dropout???

(22)

ReLU - variant

𝑧 𝑎

𝑎 = 𝑧

𝑎 = 𝛼 𝑒𝑧 − 1

Exponential Linear Unit (ELU)

𝑧 𝑎

𝑎 = 𝑧

𝑎 = 𝛼 𝑒𝑧 − 1

Scaled ELU (SELU)

https://github.com/bioinf-jku/SNNs

× 𝜆

× 𝜆

𝛼 = 1.6732632423543772848170429916717 𝜆 = 1.0507009873554804934193349852946

(23)

SELU

Positive and negative values

The whole ReLU family has this property except the original ReLU.

Saturation region ELU also has this property Slope larger than 1

𝑎 = 𝜆𝑧

𝑎 = 𝜆𝛼 𝑒𝑧 − 1

𝛼 = 1.673263242 … 𝜆 = 1.050700987 …

Only SELU also has this property

(24)

SELU

K K

k

k

w a w

a w

a

z

1 1

      z

w

1

w

k

w

K

a

1

a

k

a

K

f   z a

… … …

The inputs are i.i.d random variables with mean 𝜇 and variance 𝜎2.

𝜇𝑧 = 𝐸 𝑧

= ෍

𝑘=1 𝐾

𝐸 𝑎𝑘 𝑤𝑘

𝜇 = 𝜇 ෍

𝑘=1 𝐾

𝑤𝑘 = 𝜇 ∙ 𝐾𝜇𝑤

=1 =0

=0

Do not have to be Gaussian

=0

(25)

SELU

K K

k

k

w a w

a w

a

z

1 1

      z

w

1

w

k

w

K

a

1

a

k

a

K

f   z a

… … …

The inputs are i.i.d random variables with mean 𝜇 and variance 𝜎2.

𝜇𝑧 = 0

𝜎𝑧2 = 𝐸 𝑧 − 𝜇𝑧 2 = 𝐸 𝑧2

=1 =0

= 𝐸 𝑎1𝑤1 + 𝑎2𝑤2 + ⋯ 2

𝐸 𝑎𝑘𝑤𝑘 2 = 𝑤𝑘 2𝐸 𝑎𝑘 2 = 𝑤𝑘 2𝜎2 𝐸 𝑎𝑖𝑎𝑗𝑤𝑖𝑤𝑗 = 𝑤𝑖𝑤𝑗𝐸 𝑎𝑖 𝐸 𝑎𝑗 = 0

= ෍

𝑘=1 𝐾

𝑤𝑘 2𝜎2 = 𝜎2 ∙ 𝐾𝜎𝑤2

𝜇 = 0, 𝜎 = 1

=1 =1

= 1

target Assume Gaussian

𝜇𝑤 = 0

(26)

Demo

(27)

Source of joke:

https://zhuanlan.zhihu.co m/p/27336839

93 頁的證明

SELU is actually

more general.

(28)

最新激活神經元:SELF-NORMALIZATION NEURAL

NETWORK (SELU)

MNIST CIFAR-10

(29)

Demo

(30)

𝑎 = 𝑧 ∙ 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝛽𝑧

(31)

Hyperparameters

(32)

感謝 沈昇勳 同學提供圖檔 Source of iamge: https://medium.com/intuitionmachine/the-brute-force-method-of-

deep-learning-innovation-58b497323ae5 (Denny Britz’s graphic)

(33)

Grid Search v.s. Random Search

http://www.deeplearningbook.org/contents/guidelines.html

Layer depth

Layer width

Layer depth

Layer width

Assumption: top K results are good enough

If there are N points, probability K/N that your sample is in top K Sample x times: 1 − 1 − 𝐾/𝑁 𝑥 > 90%

If N = 1000, K = 10 x = 230

K = 100 x = 22

(34)

Model-based Hyperparameter Optimization

https://cloud.google.com/blog/big- data/2017/08/hyperparameter- tuning-in-cloud-machine-learning- engine-using-bayesian-optimization

(35)

Reinforcement Learning

Design a network Train the network

Accuracy as reward

It can design LSTM as shown in the previous lecture.

One kind of meta

learning (or learn to learn)

(36)

SWISH ...

(37)

SWISH ...

(38)

Learning Rate

(39)

Can transfer to

new tasks

(40)

Capsule

Sara Sabour, Nicholas Frosst, Geoffrey E. Hinton, “Dynamic Routing Between Capsules”, NIPS, 2017

(41)

Capsule

• Neuron: output a value, Capsule: output a vector

Cap Cap

Cap

𝑣 𝑣1

𝑣2

A neuron detects a specific pattern.

Neuron A Neuron B

Detect one type of patterns

Each dimension of v represents the characteristics of patterns.

The norm of v represents the existence.

1.0

−1.0

(42)

Capsule

𝑣 𝑣1

𝑣2

Squashing 𝑢1

𝑢2

𝑠 +

× 𝑊

1

× 𝑊

2

× 𝑐

1

× 𝑐

2

𝑢1 = 𝑊1𝑣1 𝑢2 = 𝑊2𝑣2 𝑠 = 𝑐1𝑢1 + 𝑐2𝑢2

𝑣 = 𝑆𝑞𝑢𝑎𝑠ℎ 𝑠

c are determined by dynamic routing during the testing stage.

c.f. pooling

𝑣 = 𝑠

2

1 + 𝑠

2

𝑠 𝑠

coupling coefficients

(43)

Squashing 𝑣 𝑢1

𝑢2 + 𝑠

Dynamic ×?

Routing

𝑢3

×?

×?

𝑏10 = 0, 𝑏20 = 0, 𝑏30 = 0 For 𝑟 = 1 to T do

𝑐1𝑟, 𝑐2𝑟, 𝑐3𝑟 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑏1𝑟−1, 𝑏2𝑟−1, 𝑏3𝑟−1 𝑠𝑟 = 𝑐1𝑢1 + 𝑐2𝑢2 + 𝑐3𝑢3

𝑎𝑟 = 𝑆𝑞𝑢𝑎𝑠ℎ 𝑠𝑟 𝑏𝑖𝑟 = 𝑏𝑖𝑟−1 + 𝑎𝑟 ∙ 𝑢𝑖

𝑐1𝑇 𝑐2𝑇

𝑐3𝑇

= 𝑎𝑇

(44)

𝑣 𝑣1

Squashing 𝑢1

𝑢2

+

× 𝑊1

× 𝑊2

𝑐11 𝑎1

Squashing +

Squashing

+ 𝑎3

𝑏10 = 0, 𝑏20 = 0

𝑐11, 𝑐21 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑏10, 𝑏20 𝑏𝑖𝑟 = 𝑏𝑖𝑟−1 + 𝑎𝑟 ∙ 𝑢𝑖

Like RNN Also learned by backprop 𝑣2

𝑠1

𝑠2

𝑠3

𝑐12 𝑐22

𝑐13 𝑐23 𝑎2

T=3

𝑐21 𝑐12

𝑐22 𝑐13 𝑐23

(45)

Capsule

• Capsule can also be convolutional.

• Simply replace filter with capsule

• Output layer and loss

Cap 𝑣1

Cap 𝑣2

CapsNet

…… ……

𝑣1

𝑣2

Confidence of “1”

Confidence of “2”

NN

Minimize

reconstruction error

x1 x0

(46)

Experimental Results

• MNIST

• Each example is an MNIST digit with a random small affine transformation.

• However, models were never trained with affine transformations

• CapsNet achieved 79% accuracy on the affnist test set.

• A traditional convolutional model with a similar number of parameters which achieved 66%.

(47)

Experimental Results

• Each dimension contains specific information.

𝑣1

NN

Minimize

reconstruction error

(48)

Experimental Results

• MultiMNIST

Top: input

Bottom: reconstructed R: reconstructed digits L: true labels

(49)

Discussion

• Invariance v.s. Equivariance

NN NN NN NN

Invariance Equivariance

(50)

Discussion

• Invariance v.s. Equivariance

Max pooling has invariance, but not equivariance.

Capsule has both invariance and equivariance.

3 -1 -3 -1

-3 1 0 3

3 3 Cap

𝑣1 𝑣1

Cap 𝑣1

𝑣1 Both large

Can be different I don’t know the difference.

I know the difference, but I do not react to it.

(51)

Dynamic Routing

(52)

To Learn More ……

• Hinton’s talk:

https://www.youtube.com/watch?v=rTawFwUvnLE

• Keras:

• https://github.com/XifengGuo/CapsNet-Keras

• Tensorflow:

• https://github.com/naturomics/CapsNet-Tensorflow

• PyTorch

• https://github.com/gram-ai/capsule-networks

• https://github.com/timomernick/pytorch-capsule

• https://github.com/nishnik/CapsNet-PyTorch

(53)

Interesting Facts (?)

about Deep Learning

(54)

Training stuck because …. ?

• People believe training stuck because the parameters are near a local minima

local minima

How about saddle point?

http://www.deeplearningbook.org/contents/optimization.html

(55)

Training stuck because …. ?

• People believe training stuck because the parameters are around a critical point

!!!

http://www.deeplearningbook.org/contents/optimization.html

(56)

Brute-force Memorization ?

https://arxiv.org/pdf/1611.03530.pdf

Final of 2017 Spring:

https://ntumlds.wordpress.com/2017/03/27/r05922018_drliao/

(57)

Demo

(58)

Brute-force Memorization ?

• Simple pattern first, then memorize exception

https://arxiv.org/pdf/1706.05394.pdf

(59)

Knowledge Distillation

Knowledge Distillation

https://arxiv.org/pdf/1503.02531.pdf Do Deep Nets Really Need to be Deep?

https://arxiv.org/pdf/1312.6184.pdf

Teacher Net (Deep)

“1”: 0.7, “7”: 0.2. “9”: 0.1

Student Net (Shallow)

? Learning target

Cross-entropy minimization

Training Data

(60)

https://arxiv.org/pdf/1312.6184.pdf

(61)

Deep Learning for

Question Answering

(62)

Question Answering

• Given a document and a query, output an answer

• bAbI: the answer is a word

• https://research.fb.com/downloads/babi/

• SQuAD: the answer is a sequence of words (in the input document)

• https://rajpurkar.github.io/SQuAD-explorer/

• MS MARCO: the answer is a sequence of words

• http://www.msmarco.org

• MovieQA: Multiple choice question (output a number)

• http://movieqa.cs.toronto.edu/home/

• More: https://github.com/dapurv5/awesome-question- answering

(63)
(64)

Bi-directional Attention Flow

Demo: http://35.165.153.16:1995

(65)
(66)
(67)
(68)
(69)

Dynamic Coattention Networks

(70)

Dynamic Coattention Networks

(71)

Dynamic Coattention Networks

(72)

Dynamic Coattention Networks

• Experimental Results

DCN+: https://arxiv.org/pdf/1711.00106.pdf

(73)

Rnet

R-Net

(74)

S-net

MS MARCO

(75)

Attention-over-Attention (AoA)

(not on SQUAD)

(76)

Reinforced Mnemonic Reader

(77)

Multiple-hop

ReasoNet

https://arxiv.org/abs/1609.05284

(78)

FusionNet

(79)

Recurrent Entity Networks

• https://arxiv.org/pdf/1612.03969.pdf

(80)
(81)

Query-Reduction Networks for Question Answering

• https://arxiv.org/pdf/1606.04582.pdf

(82)

Query-Reduction Networks for Question Answering

• https://arxiv.org/pdf/1606.04582.pdf

(83)

Acknowledgement

• 感謝 曹爗文 同學發現投影片上的錯字

參考文獻

相關文件

 End-to-end reinforcement learning dialogue system (Li et al., 2017; Zhao and Eskenazi, 2016)?.  No specific goal, focus on

A novel surrogate able to adapt to any given MLL criterion The first cost-sensitive multi-label learning deep model The proposed model successfully. Tackle general

Normalization by the number of reads in the sample, or by calculating a Z score, should be performed on the reported read counts before comparisons among samples. For genes with

B3-4 DEEP LEARNING MODEL COMPRESSION BY NETWORK SLIMMING Ching-Hao Wang (王敬豪), Shih-Che Chien (簡士哲), Feng-Chia Chang (張峰嘉), and Wen-Huang Cheng (鄭文皇). B3-5

○ Value function: how good is each state and/or action. ○ Policy: agent’s

Agent learns to take actions maximizing expected reward.. Machine Learning ≈ Looking for

➢ Plot the learning curves (ROUGE versus training steps)... ➢ What is your final

Principle Component Analysis Denoising Auto Encoder Deep Neural Network... Deep Learning Optimization