Slide credit from Prof. Hung-Yi Lee
Review
2
Learning ≈ Looking for a Function
Speech Recognition
Handwritten Recognition
Weather forecast
Play video games
f
f
f
f
“2”
“你好”
“ Saturday”
“move left”
Thursday
3
Machine Learning Framework
Training is to pick the best function given the observed data Testing is to predict the label using the learned function Training Data
Model: Hypothesis Function Set
2
1, f f
Training: Pick the best function f *
Testing: f
x yy
f
*“Best” Function
x1, yˆ1 , x2, yˆ2 ,
Testing Data
x,? ,
“It claims too much.”
-
(negative): x
ˆy :
4
function input function output
How to Train a Model?
5
Machine Learning Framework
Training is to pick the best function given the observed data Testing is to predict the label using the learned function Training Data
Model: Hypothesis Function Set
2
1, f f
Training: Pick the best function f *
Testing: f
x yy
f
*“Best” Function
x1, yˆ1 , x2, yˆ2 ,
Testing Data
x,? ,
“It claims too much.”
-
(negative): x
ˆy :
6
function input
function output
Training
Procedure
Training Procedure
Q1. What is the model? (function hypothesis set) Q2. What does a “good” function mean?
Q3. How do we pick the “best” function?
7
Model: Hypothesis Function Set
2
1, f f
Training: Pick the best function f *
f
*“Best” Function
Training Procedure Outline
Model Architecture
A Single Layer of Neurons (Perceptron)
Limitation of Perceptron
Neural Network Model (Multi-Layer Perceptron)
Loss Function Design
Function = Model Parameters
Model Parameter Measurement
Optimization
Gradient Descent
Stochastic Gradient Descent (SGD)
Mini-Batch SGD
Practical Tips
8
What is the Model?
9
Training Procedure Outline
Model Architecture
A Single Layer of Neurons (Perceptron)
Limitation of Perceptron
Neural Network Model (Multi-Layer Perceptron)
Loss Function Design
Function = Model Parameters
Model Parameter Measurement
Optimization
Gradient Descent
Stochastic Gradient Descent (SGD)
Mini-Batch SGD
Practical Tips
10
Classification Task
Sentiment Analysis
Speech Phoneme Recognition Handwritten Recognition
“這規格有誠意!” +
“太爛了吧~” -
/h/
2
Binary Classification
Multi-class Classification
input object
Class A (yes)
Class B (no)
input object
Class A Class B Class C Some cases are not easy to be formulated as classification problems
11
Target Function
Classification Task
◦x: input object to be classified
◦y: class/label
x y
f f : R
N R
MAssume both x and y can be represented as fixed-size vectors
12
a N-dim vector
a M-dim vector
Vector Representation Example
Handwriting Digit Classification
“1” “2”
0 0 1
10 dimensions for digit recognition
“1”
“2”
“3”
0 1
0 “1”
“2”
“3”
1: for ink 0: otherwise
Each pixel
corresponds to an element in the vector
1 0 16 x 16
16 x 16 = 256 dimensions
x: image y : class/label
“1” or not
“2” or not
“3” or not
M
N
R
R
f :
13
Vector Representation Example
Sentiment Analysis
“+” “-”
0 0 1
3 dimensions
(positive, negative, neutral)
“+”
“-”
“?”
0 1
0 “+”
“-”
“?”
1: indicates the word 0: otherwise
Each element in the vector corresponds to a word in the vocabulary
1 0
dimensions = size of vocab
x: word y : class/label
“+” or not
“-” or not
“?” or not
M
N
R
R
f :
“love”
14
Target Function
Classification Task
◦x: input object to be classified
◦y: class/label
x y
f f : R
N R
MAssume both x and y can be represented as fixed-size vectors
15
a N-dim vector
a M-dim vector
Training Procedure Outline
Model Architecture
A Single Layer of Neurons (Perceptron)
Limitation of Perceptron
Neural Network Model (Multi-Layer Perceptron)
Loss Function Design
Function = Model Parameters
Model Parameter Measurement
Optimization
Gradient Descent
Stochastic Gradient Descent (SGD)
Mini-Batch SGD
Practical Tips
16
A Single Neuron
z w
1w
2w
N…
x
1x
2x
N b
z
z
z
bias
y
zz e
1
1
Sigmoid function Activation
function
Each neuron is a very simple function
17
A Single Neuron
z w
1w
2w
N…
x
1x
2x
N b
z
z
bias z
y
zz e
1
1
Sigmoid function Activation
function
18
1
The bias term is an “always on” feature
Why Bias?
19
z
z
b
bias
The bias term gives a class prior
Model Parameters of A Single Neuron
z w
1w
2w
N…
x
1x
2x
N b
z
bias
y
zz e
1
1
20
1
w, b are the parameters of this neuron
A Single Neuron
z w
1w
2w
N…
x
1x
2x
N
b
bias
y
21
1
M
N
R
R
f :
5 . 0
"
2
"
5 . 0
"
2
"
y not
y is
A single neuron can only handle binary classification
A Layer of Neurons
Handwriting digit classification
22 M
N
R
R
f :
A layer of neurons can handle multiple possible output, and the result depends on the max one
…
x
1x
2x
N
1
y
1
… …
“1” or not
“2” or not
“3” or not
y
2y
310 neurons/10 classes
Which one is max?
Training Procedure Outline
Model Architecture
A Single Layer of Neurons (Perceptron)
Limitation of Perceptron
Neural Network Model (Multi-Layer Perceptron)
Loss Function Design
Function = Model Parameters
Model Parameter Measurement
Optimization
Gradient Descent
Stochastic Gradient Descent (SGD)
Mini-Batch SGD
Practical Tips
23
A Layer of Neurons – Perceptron
Output units all operate separately – no shared weights
24
Adjusting weights moves the location, orientation, and steepness of cliff
…
x
1x
2x
N
1
y
1
… …
y
2y
3http://aima.eecs.berkeley.edu/slides-pdf/chapter20b.pdf
Expression of Perceptron
z w
1w
2x
1x
2b 1
y
A perceptron can represent AND, OR, NOT, etc., but not XOR linear separator
http://aima.eecs.berkeley.edu/slides-pdf/chapter20b.pdf 25
How to Implement XOR?
Input
Output
A B
0 0 0
0 1 1
1 0 1
1 1 0
A xor B = AB’ + A’B
Multiple operations can produce more complicate output
26
Training Procedure Outline
Model Architecture
A Single Layer of Neurons (Perceptron)
Limitation of Perceptron
Neural Network Model (Multi-Layer Perceptron)
Loss Function Design
Function = Model Parameters
Model Parameter Measurement
Optimization
Gradient Descent
Stochastic Gradient Descent (SGD)
Mini-Batch SGD
Practical Tips
27
Neural Networks – Multi-Layer Perceptron
a
1 z
1 z
2x
1x
2 z
a
2Hidden Units
1 1
y
28
Continuous function w/ 2 layers
Combine two opposite-facing
threshold functions to make a ridge
Continuous function w/ 3 layers
Combine two perpendicular ridges to make a bump
Add bumps of various sizes and locations to fit any surface
Expression of Multi-Layer Perceptron
http://aima.eecs.berkeley.edu/slides-pdf/chapter20b.pdf
multiple layers enhance the model expression
the model can approximate more complex functions
29
Deep Neural Networks (DNN)
Fully connected feedforward network
x
1x
2……
Layer 1
……
y
1y
2……
Layer 2
……
Layer L
……
……
……
Input Output
y
Mx
Nvector x
vector y
M
N
R
R
f :
Deep NN: multiple hidden layers
30
Notation Definition
……
nodes N
lLayer l
1 1
a
l1 2
a
l1 l
a
j…… ……
Layer l 1 nodes
1
N
la
1la
2ll
a
i……
l
a i
Output of a neuron:
neuron i layer l 1
2
j
1 2
i
output of one layer a vector
31
Notation Definition
…… …… …… ……
1 2
j
1 2
i
layer l 1 to layer l
1 1
a
l1 2
a
l1 l
a
ja
1la
2ll
a
inodes N
lLayer l Layer l 1
nodes
1
N
lfrom neuron j (layer l-1) to neuron i (layer l)
weights between two layers
a matrix
32
Notation Definition
…… …… …… ……
1 2
j
1 2
i
1 1
a
l1 2
a
l1 l
a
ja
1la
2ll
a
il
b
i: bias for neuron i at layer l
l
b
i1
b
1lb
2lnodes N
lLayer l Layer l 1
nodes
1
N
lbias of all neurons at each layer
a vector
33
Notation Definition
……
nodes N
lLayer l
1 1
a
l1 2
a
l1 l
a
j……
Layer l 1 nodes
1
N
ll
a
i……
1 2
j i
l
z
i: input of the activation function for neuron i at layer l
l
w
ij lw
i2 lw
i1l
z
il
b
i1 activation function
input at each layer
a vector
34
Notation Summary
l
a i
a l
l
z i
z l
l
b
ib l
: output of a neuron
: output vector of a layer : input of activation
function
: input vector of activation function for a layer
: a weight
: a weight matrix
: a bias
: a bias vector
35
Layer Output Relation
……
nodes N
lLayer l
…… ……
Layer l 1 nodes
1
N
l……
1 2
j
1 2
i
1 1
a
l1 2
a
l1 l
a
ja
1la
2ll
a
iz
1lz
2ll
z
ia
lz
l1
a
l36
Layer Output Relation – from a to z
……
nodes N
lLayer l
…… ……
Layer l 1 nodes
1
N
l……
1 2
j
1 2
i
1 1
a
l1 2
a
l1 l
a
ja
1la
2ll
a
iz
1lz
2ll
z
ia
lz
l1
a
l… …
37
Layer Output Relation – from z to a
……
nodes N
lLayer l
…… ……
Layer l 1 nodes
1
N
l……
1 2
j
1 2
i
1 1
a
l1 2
a
l1 l
a
ja
1la
2ll
a
iz
1lz
2ll
z
ia
lz
l1
a
l
l i l l
l i l l
z z z
a a a
2 1 2
1
il li
z
a
ll
z
a
38
Layer Output Relation
……
nodes N
lLayer l
…… ……
Layer l 1 nodes
1
N
l……
1 2
j
1 2
i
1 1
a
l1 2
a
l1 l
a
ja
1la
2ll
a
iz
1lz
2ll
z
ia
lz
l1
a
ll l
l
l
W a b
z
1
ll
z
a
39
Neural Network Formulation
Fully connected feedforward network
Layer 1 Layer 2 Layer L
Input Output
x
1x
2…… ……
y
1y
2…… ……
……
……
…… y
Mx
Nvector x
vector y
M
N
R
R
f :
= = =
40
Neural Network Formulation
Fully connected feedforward network
Layer 1 Layer 2 Layer L
Input Output
x
1x
2…… ……
y
1y
2…… ……
……
……
…… y
Mx
Nvector x
vector y
M
N
R
R
f :
41
Activation Function
bounded function
42
Activation Function
non-linear
43
boolean
linear
Non-Linear Activation Function
Sigmoid
Tanh
Rectified Linear Unit (ReLU)
44
Non-linear functions are
frequently used in neural net
Why Non-Linearity?
Function approximation
◦ Without non-linearity, deep neural networks work the same as linear transform
◦ With non-linearity, networks with more layers can approximate more complex function
http://cs224d.stanford.edu/lectures/CS224d-Lecture4.pdf 45
What does the “Good”
Function mean?
46
Training Procedure Outline
Model Architecture
A Single Layer of Neurons (Perceptron)
Limitation of Perceptron
Neural Network Model (Multi-Layer Perceptron)
Loss Function Design
Function = Model Parameters
Model Parameter Measurement
Optimization
Gradient Descent
Stochastic Gradient Descent (SGD)
Mini-Batch SGD
Practical Tips
47
Training Procedure Outline
Model Architecture
A Single Layer of Neurons (Perceptron)
Limitation of Perceptron
Neural Network Model (Multi-Layer Perceptron)
Loss Function Design
Function = Model Parameters
Model Parameter Measurement
Optimization
Gradient Descent
Stochastic Gradient Descent (SGD)
Mini-Batch SGD
Practical Tips
48
Function = Model Parameters
Formal definition
different parameters W and b different functions function set
model parameter set
pick a function f = pick a set of model parameters θ
49
Training Procedure Outline
Model Architecture
A Single Layer of Neurons (Perceptron)
Limitation of Perceptron
Neural Network Model (Multi-Layer Perceptron)
Loss Function Design
Function = Model Parameters
Model Parameter Measurement
Optimization
Gradient Descent
Stochastic Gradient Descent (SGD)
Mini-Batch SGD
Practical Tips
50
Model Parameter Measurement
Define a function to measure the quality of a parameter set θ
◦ Evaluating by a loss/cost/error function C(θ) how bad θ is
◦ Best model parameter set
◦ Evaluating by an objective/reward function O(θ) how good θ is
◦ Best model parameter set
51
Loss Function Example
Training Data
Model: Hypothesis Function Set
2
1, f f
Training: Pick the best function f *
f
*“Best” Function
x1, yˆ1 , x2, yˆ2 ,
“It claims too much.”
-
(negative): x
ˆy :
52
function input function output
A “Good” function:
Define an example loss function:
sum over the error of all training samples
Frequent Loss Function
Square loss Hinge loss Logistic loss
Cross entropy loss
Others: large margin, etc.
https://en.wikipedia.org/wiki/Loss_functions_for_classification 53
How can we Pick the
“Best” Function?
54
Training Procedure Outline
Model Architecture
A Single Layer of Neurons (Perceptron)
Limitation of Perceptron
Neural Network Model (Multi-Layer Perceptron)
Loss Function Design
Function = Model Parameters
Model Parameter Measurement
Optimization
Gradient Descent
Stochastic Gradient Descent (SGD)
Mini-Batch SGD
Practical Tips
55
Problem Statement
Given a loss function and several model parameter sets
◦ Loss function:
◦ Model parameter sets:
Find a model parameter set that minimizes C(θ)
1) Brute force – enumerate all possible θ 2) Calculus –
How to solve this optimization problem?
Issue: whole space of C(θ) is unknown
56
Training Procedure Outline
Model Architecture
A Single Layer of Neurons (Perceptron)
Limitation of Perceptron
Neural Network Model (Multi-Layer Perceptron)
Loss Function Design
Function = Model Parameters
Model Parameter Measurement
Optimization
Gradient Descent
Stochastic Gradient Descent (SGD)
Mini-Batch SGD
Practical Tips
57
Gradient Descent for Optimization
Assume that θ has only one variable
C
0
1
2
3: the model at the i-th iteration Idea: drop a ball and find the position
where the ball stops rolling (local minima)
58
Gradient Descent for Optimization
Assume that θ has only one variable
C
0
1Randomly start at 𝜃0 Compute 𝑑𝐶 𝜃0 Τ𝑑𝜃:
Compute 𝑑𝐶 𝜃1 Τ𝑑𝜃:
…
η is “learning rate”
59
Gradient Descent for Optimization
Assume that θ has two variables {θ
1, θ
2}
60
Gradient Descent for Optimization
Assume that θ has two variables {θ
1, θ
2}
• Randomly start at 𝜃0:
• Compute the gradients of 𝐶 𝜃 at 𝜃0:
• Update parameters:
• Compute the gradients of 𝐶 𝜃 at 𝜃1:
61
Gradient Descent for Optimization
Movement Gradient 𝜃
0𝜃
1𝜃
2𝜃
3𝛻𝐶 𝜃
0𝛻𝐶 𝜃
1𝛻𝐶 𝜃
2𝛻𝐶 𝜃
3𝜃
1𝜃
2Algorithm
Initialization: start at 𝜃
0while(𝜃
(𝑖+1)≠ 𝜃
𝑖)
{
compute gradient at 𝜃
𝑖update parameters
}
62
Revisit Neural Network Formulation
Fully connected feedforward network
Layer 1 Layer 2 Layer L
Input Output
x
1x
2…… ……
y
1y
2…… ……
……
……
…… y
Mx
Nvector x
vector y
63
Gradient Descent for Neural Network
Algorithm
Initialization: start at 𝜃0 while(𝜃(𝑖+1) ≠ 𝜃𝑖)
{
compute gradient at 𝜃𝑖 update parameters }
64
Gradient Descent for Optimization
Simple Case
Algorithm
Initialization: start at 𝜃0 while(𝜃(𝑖+1) ≠ 𝜃𝑖)
{
compute gradient at 𝜃𝑖 update parameters }
65
z w
1w
2x
1x
2
b
y
1
z
Gradient Descent for Optimization
Simple Case – Three Parameters & Square Error Loss
Update three parameters for t-th iteration
Square error loss
66
Gradient Descent for Optimization
Simple Case – Square Error Loss
Square Error Loss
67
Gradient Descent for Optimization
Simple Case – Square Error Loss
68
chain rule sigmoid func
Gradient Descent for Optimization
Simple Case – Square Error Loss
Square Error Loss
69
Gradient Descent for Optimization
Simple Case – Three Parameters & Square Error Loss
Update three parameters for t-th iteration
70
z w1
w2
x1
x2
b
y
1
z
Optimization Algorithm
71
Algorithm
Initialization: set the parameters 𝜃, 𝑏 at random while(stopping criteria not met)
{
for training sample {𝑥, ො𝑦}, compute gradient and update parameters }
Gradient Descent for Neural Network
Algorithm
Initialization: start at 𝜃0 while(𝜃(𝑖+1) ≠ 𝜃𝑖)
{
compute gradient at 𝜃𝑖 update parameters }
Computing the gradient includes millions of parameters.
To compute it efficiently, we use backpropagation.
72
Gradient Descent Issue
Training Data
x1, yˆ1 , x2, yˆ2 ,
After seeing all training samples, the model can be updated slow
73
Training Procedure Outline
Model Architecture
A Single Layer of Neurons (Perceptron)
Limitation of Perceptron
Neural Network Model (Multi-Layer Perceptron)
Loss Function Design
Function = Model Parameters
Model Parameter Measurement
Optimization
Gradient Descent
Stochastic Gradient Descent (SGD)
Mini-Batch SGD
Practical Tips
74
Stochastic Gradient Descent (SGD)
Gradient Descent
Stochastic Gradient Descent (SGD)
◦ Pick a training sample x
k◦ If all training samples have same probability to be picked
Training Data
x1, yˆ1 , x2, yˆ2 ,
The model can be updated after seeing one training sample faster
75
Epoch Definition
When running SGD, the model starts θ
0pick x
1pick x
2pick x
kpick x
Kpick x
1see all training samples once
one epoch
… …
… …
Training Data
x1, yˆ1 , x2, yˆ2 ,
76
Gradient Descent v.s. SGD
Gradient Descent
Update after seeing all examples
Stochastic Gradient Descent
If there are 20 examples, update 20 times in one epoch.
1 epoch See all
examples
See only one example
SGD approaches to the target point faster than gradient descent 77
Training Procedure Outline
Model Architecture
A Single Layer of Neurons (Perceptron)
Limitation of Perceptron
Neural Network Model (Multi-Layer Perceptron)
Loss Function Design
Function = Model Parameters
Model Parameter Measurement
Optimization
Gradient Descent
Stochastic Gradient Descent (SGD)
Mini-Batch SGD
Practical Tips
78
Mini-Batch SGD
Batch Gradient Descent
Stochastic Gradient Descent (SGD)
◦ Pick a training sample x
kMini-Batch SGD
◦ Pick a set of B training samples as a batch b
B is “batch size”
Use all K samples in each iteration
Use 1 samples in each iteration
Use all B samples in each iteration
79
Mini-Batch SGD
80
Batch v.s. Mini-Batch
Handwritting Digit Classification
Batch size = 1 Gradient Descent
81
Gradient Descent v.s. SGD v.s. Mini-Batch
Training Time (sec)
Batch Size
SGD
1 10 100 1000 10000 full
Gradient Descent Mini-Batch
Why is mini-batch faster than SGD?
82
Training speed: mini-batch > SGD > Gradient Descent
SGD v.s. Mini-Batch
Stochastic Gradient Descent (SGD)
Mini-Batch SGD
𝑊
1𝑥 𝑊
1𝑥
𝑧
1= 𝑧
1= ……
matrix
𝑊
1𝑥 𝑥
𝑧
1𝑧
1=
Modern computers run matrix-matrix multiplication faster than matrix-vector multiplication
83
Big Issue: Local Optima
Neural networks has no guarantee for obtaining global optimal solution
84
Training Procedure Outline
Model Architecture
A Single Layer of Neurons (Perceptron)
Limitation of Perceptron
Neural Network Model (Multi-Layer Perceptron)
Loss Function Design
Function = Model Parameters
Model Parameter Measurement
Optimization
Gradient Descent
Stochastic Gradient Descent (SGD)
Mini-Batch SGD
Practical Tips
85
Initialization
Different initialization parameters may result in different trained models
Do not initialize the parameters equally set them randomly
86
Learning Rate
# parameters updates cost
Error Surface Very Large
Large
small
Just make
Learning rate should be set carefully
87
Tips for Mini-Batch Training
Shuffle training samples before every epoch
◦ the network might memorize the order you feed the samples
Use a fixed batch size for every epoch
◦ enable to fast implement matrix multiplication for calculations
Adapt the learning rate to the batch size
◦ larger batch smaller learning rate
http://stackoverflow.com/questions/13693966/neural-net-selecting-data-for-each-mini-batch 88
Learning Recipe
Training Data
Testing Data
x yˆ
Validation Real Testing
x x
f
*“Best” Function
y y
89
Learning Recipe
Training Data
Testing Data
x yˆ
Validation Real Testing
x y x y
immediately know the performance
Do not know the performance until
submission
90
Learning Recipe
Possible reasons
◦ no good function exists: bad hypothesis function set
reconstruct the model architecture
◦ cannot find a good function: local optima
change the training strategy
get good results on training set modify training process
no
91
get good results on dev/validation set
Learning Recipe
get good results on training set modify training process
no
yes
done yes
no
prevent overfitting
Better performance on training but worse performance on dev overfitting
92
Overfitting
Possible solutions
◦ more training samples
◦ some tips: dropout, etc.
93
Concluding Remarks
Q1. What is the model?
Q2. What does a “good” function mean?
Q3. How do we pick the “best” function?
94
Model: Hypothesis Function Set
2
1, f f
Training: Pick the best function f *
f
*“Best” Function
Model Architecture Loss Function Design
Optimization