### Advanced Tips for Deep Learning

### Hung-yi Lee

Neural Network

Good Results on Testing Data?

Good Results on Training Data?

### Step 3: pick the best function Step 2: goodness

### of function Step 1: define a

### set of function

YES YES

NO NO

### Overfitting!

**Recipe of Deep Learning**

### Do not always blame Overfitting

Testing Data

### Overfitting?

Training Data

Not well trained

Neural Network

Good Results on Testing Data?

Good Results on Training Data?

YES YES

**Recipe of Deep Learning**

### Different approaches for different problems.

e.g. dropout for good results on testing data

### Outline

### • Batch Normalization

### • New Activation Function

### • Tuning Hyperparameters

### • Interesting facts (?) about deep learning

### • Capsule

### • New models for QA

### Batch Normalization

### Feature Scaling

Make different features have the same scaling

### Feature Scaling

In general, gradient descent converges much faster with feature scaling than without it.

### How about Hidden Layer?

Feature Scaling Feature Scaling ? Feature Scaling ?

Smaller learning rate can be helpful, but the training would be slower.

Difficulty: their statistics

change during the training …

### Batch normalization

Internal Covariate Shift

**Batch**

### Batch normalization

𝜇 and 𝜎

depends on 𝑧^{𝑖}

Note: Batch normalization cannot be applied on small batch.

small batch.

### Batch normalization

How to do

backpropogation?

### Batch normalization

### Batch normalization

### • At testing stage:

𝛾, 𝛽 are network parameters

We do not have batch at testing stage.

Ideal solution:

Computing 𝜇 and 𝜎 using the whole training dataset.

Practical solution:

Computing the moving average of 𝜇 and 𝜎 of the batches during training.

### Batch normalization - Benefit

• BN reduces training times, and make very deep net trainable.

• Because of less Covariate Shift, we can use larger learning rates.

• Less exploding/vanishing gradients

• Especially effective for sigmoid, tanh, etc.

• Learning is less affected by initialization.

• BN reduces the demand for regularization.

### Demo

### Activation Function

### ReLU

### • Rectified Linear Unit (ReLU)

**Reason:**

**Reason:**

1. Fast to compute 2. Biological reason 3. Infinite sigmoid with different biases 4. Vanishing gradient problem

### ReLU - variant

α also learned by gradient descent

𝛼 is sampled from a

### ReLU - variant

Positive and negative values

The whole ReLU family has this property except the original ReLU.

Saturation region ELU also has this property Slope larger than 1

### SELU

### SELU

𝜇_{𝑤} = 0

### Demo

### SELU is actually

### more general.

MNIST CIFAR-10

### Demo

### 𝑎 = 𝑧 ∙ 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝛽𝑧

### Hyperparameters

### Grid Search v.s. Random Search

Assumption: top K results are good enough

### Model-based Hyperparameter Optimization

**Reinforcement Learning**

**Reinforcement Learning**

Design a network Train the network

Accuracy as reward

It can design LSTM as shown in the previous lecture.

One kind of meta

learning (or learn to learn)

### SWISH ...

### SWISH ...

### Learning Rate

### Can transfer to

### new tasks

## Capsule

### Capsule

### • Neuron: output a value, Capsule: output a vector

### Capsule

c are determined by dynamic routing during the testing stage.

Squashing 𝑣
𝑣
### Capsule

• Capsule can also be convolutional.

• Simply replace filter with capsule

• Output layer and loss

### CapsNet

### NN

Minimize

reconstruction error

### Experimental Results

• MNIST

• Each example is an MNIST digit with a random small affine transformation.

• However, models were never trained with affine transformations

• CapsNet achieved 79% accuracy on the affnist test set.

• A traditional convolutional model with a similar number of parameters which achieved 66%.

### Experimental Results

### • Each dimension contains specific information.

### Experimental Results

### • MultiMNIST

### Discussion

### • Invariance v.s. Equivariance

### Dynamic Routing

### To Learn More ……

### • Hinton’s talk:

### Interesting Facts (?)

### about Deep Learning

### Training stuck because …. ?

### • People believe training stuck because the parameters are near a local minima

local minima

How about saddle point?

### Training stuck because …. ?

### • People believe training stuck because the parameters are around a critical point

### !!!

### Brute-force Memorization ?

### Demo

### Brute-force Memorization ?

### • Simple pattern first, then memorize exception

### Knowledge Distillation

Knowledge Distillation

### Deep Learning for

### Question Answering

### Question Answering

• Given a document and a query, output an answer

• bAbI: the answer is a word

• SQuAD: the answer is a sequence of words (in the input document)

• MS MARCO: the answer is a sequence of words

• MovieQA: Multiple choice question (output a number)

### Bi-directional Attention Flow

Demo: http://35.165.153.16:1995

### Dynamic Coattention Networks

### Dynamic Coattention Networks

### Dynamic Coattention Networks

### Dynamic Coattention Networks

### • Experimental Results

### Rnet

**R-Net**

**R-Net**

**S-net**

**S-net**

### MS MARCO

**Attention-over-Attention (AoA)**

**Attention-over-Attention (AoA)**

(not on SQUAD)

### Reinforced Mnemonic Reader

**Multiple-hop**

**Multiple-hop**

**ReasoNet**

**ReasoNet**

