Neural Network Basics
Applied Deep Learning
September 15th, 2022 http://adl.miulab.tw
Learning ≈ Looking for a Function
◉
Speech Recognition◉
Handwritten Recognition◉
Weather forecast◉
Play video games2
(
)
=f
(
)
= f(
)
= f(
)
=f
“2”
“你好”
“ Saturday”
“move left”
Thursday
Machine Learning Framework
3
Training is to pick the best function given the observed data Testing is to predict the label using the learned function
Training Data
Model: Hypothesis Function Set
2
1, f f
Training: Pick the best function f *
Testing: f
( )
x = y y = +f *
“Best” Function
( ) ( )
x1,yˆ1 , x2,yˆ2 , Testing Data
( )
x,? ,
“It claims too much.”
-
(negative): x
ˆy: function input function output
實際上我們是如何訓練一個模型的?
How to Train a Model?
4
Training is to pick the best function given the observed data Testing is to predict the label using the learned function
Training Data
Model: Hypothesis Function Set
2
1, f f
Training: Pick the best function f *
Testing: f
( )
x = y y = +f *
“Best” Function
( ) ( )
x1,yˆ1 , x2,yˆ2 , Testing Data
( )
x,? ,
“It claims too much.”
-
(negative): x
ˆy: function input function output
Machine Learning Framework
5
Training Procedure
Training Procedure
◉
Q1. What is the model? (function hypothesis set)◉
Q2. What does a “good” function mean?◉
Q3. How do we pick the “best” function?6
Model: Hypothesis Function Set
2
1
, f f
Training: Pick the best function f *
f
*“Best” Function
Training Procedure Outline
Model Architecture
✓
A Single Layer of Neurons (Perceptron)✓
Limitation of Perceptron✓
Neural Network Model (Multi-Layer Perceptron) Loss Function Design
✓
Function = Model Parameters✓
Model Parameter Measurement Optimization
✓
Gradient Descent✓
Stochastic Gradient Descent (SGD)✓
Mini-Batch SGD✓
Practical Tips 7什麼是模型?
What is the Model?
8
Training Procedure Outline
Model Architecture
✓
A Single Layer of Neurons (Perceptron)✓
Limitation of Perceptron✓
Neural Network Model (Multi-Layer Perceptron) Loss Function Design
✓
Function = Model Parameters✓
Model Parameter Measurement Optimization
✓
Gradient Descent✓
Stochastic Gradient Descent (SGD)✓
Mini-Batch SGD✓
Practical Tips 9Classification Task
◉
Sentiment Analysis◉
Speech Phoneme Recognition◉
Handwritten Recognition10
“這規格有誠意!” +
“太爛了吧~” -
/h/
2
Binary Classification
Multi-class Classification input
object
Class A (yes)
Class B (no)
input object
Class A Class B Class C
Some cases are not easy to be formulated as classification problems
Target Function
◉
Classification Task○
x: input object to be classified○
y: class/label 11( ) x y
f = f : R
N→ R
MAssume both x and y can be represented as fixed-size vectors
→ a N-dim vector
→ a M-dim vector
Vector Representation Example
◉
Handwriting Digit Classification12
“1” “2”
0 0 1
10 dimensions for digit recognition
“1”
“2”
“3”
0 1
0 “1”
“2”
“3”
1: for ink 0: otherwise
Each pixel corresponds to an element in the vector
1 0
16 x 16
16 x 16 = 256 dimensions
x: image y: class/label
“1” or not
“2” or not
“3” or not
M
N
R
R
f : →
Vector Representation Example
◉
Sentiment Analysis13
“+” “-”
0 0 1
3 dimensions
(positive, negative, neutral)
“+”
“-”
“?”
0 1
0
“+”“-”
“?”
1: indicates the word 0: otherwise
Each element in the vector corresponds to a word in the vocabulary
1 0
dimensions = size of vocab
x: word y: class/label
“+” or not
“-” or not
“?” or not
M
N
R
R
f : →
“love”
Target Function
◉
Classification Task○
x: input object to be classified○
y: class/label 14( ) x y
f = f : R
N→ R
MAssume both x and y can be represented as fixed-size vectors
→ a N-dim vector
→ a M-dim vector
Training Procedure Outline
Model Architecture
✓
A Single Layer of Neurons (Perceptron)✓
Limitation of Perceptron✓
Neural Network Model (Multi-Layer Perceptron) Loss Function Design
✓
Function = Model Parameters✓
Model Parameter Measurement Optimization
✓
Gradient Descent✓
Stochastic Gradient Descent (SGD)✓
Mini-Batch SGD✓
Practical Tips 15A Single Neuron
16
z w
1w
2w
N…
x
1x
2x
N+ b
( ) z
( ) z
z
bias
y
( )
zz e
−= + 1
1
Sigmoid function Activation
function
Each neuron is a very simple function
A Single Neuron
17
z w
1w
2w
N…
x
1x
2x
N+ b
( ) z
bias
y
Activation function
1
The bias term is an “always on” feature
( ) z
( )
zz
z e
−= + 1
1
Sigmoid function
Why Bias?
18
( ) z
z
b
bias
The bias term gives a class prior
Model Parameters of A Single Neuron
19
z w
1w
2w
N…
x
1x
2x
N+ b
( ) z
bias
y
( )
zz e
−= + 1
1
1
w, b are the parameters of this neuron
A Single Neuron
20
z w
1w
2w
N…
x
1x
2x
N+
b
bias
y
1
M
N
R
R
f : →
5 . 0
"
2
"
5 . 0
"
2
"
y not
y is
A single neuron can only handle binary classification
A Layer of Neurons
◉
Handwriting digit classification21
M
N
R
R
f : →
A layer of neurons can handle multiple possible output, and the result depends on the max one
…
x
1x
2x
N+
1
+ y
1+
… …
“1” or not
“2” or not
“3” or not
y
2y
310 neurons/10 classes
Which one is max?
Training Procedure Outline
Model Architecture
✓
A Single Layer of Neurons (Perceptron)✓
Limitation of Perceptron✓
Neural Network Model (Multi-Layer Perceptron) Loss Function Design
✓
Function = Model Parameters✓
Model Parameter Measurement Optimization
✓
Gradient Descent✓
Stochastic Gradient Descent (SGD)✓
Mini-Batch SGD✓
Practical Tips 22A Layer of Neurons – Perceptron
◉
Output units all operate separately – no shared weights23
Adjusting weights moves the location, orientation, and steepness of cliff
…
x1
x2
xN
+
1
+ y1
+
… …
y2
y3
http://aima.eecs.berkeley.edu/slides-pdf/chapter20b.pdf
Expression of Perceptron
24
+ z w
1w
2x
1x
2b 1
y
A perceptron can represent AND, OR, NOT, etc., but not XOR → linear separator
http://aima.eecs.berkeley.edu/slides-pdf/chapter20b.pdf
How to Implement XOR?
25
Input
Output
A B
0 0 0
0 1 1
1 0 1
1 1 0
A xor B = AB’ + A’B
Multiple operations can produce more complicate output
A
B
A’
B’
A + B’
A’ + B
AB’ + A’B
Training Procedure Outline
Model Architecture
✓
A Single Layer of Neurons (Perceptron)✓
Limitation of Perceptron✓
Neural Network Model (Multi-Layer Perceptron) Loss Function Design
✓
Function = Model Parameters✓
Model Parameter Measurement Optimization
✓
Gradient Descent✓
Stochastic Gradient Descent (SGD)✓
Mini-Batch SGD✓
Practical Tips 26Neural Networks – Multi-Layer Perceptron
27
a
1+ z
1+ z
2x
1x
2a
2+ z
Hidden Units
1 1
y
Expression of Multi-Layer Perceptron
◉
Continuous function w/ 2 layers◉
Combine two opposite-facingthreshold functions to make a ridge
28
◉
Continuous function w/ 3 layers◉
Combine two perpendicular ridges to make a bump○
Add bumps of various sizes and locations to fit any surfacehttp://aima.eecs.berkeley.edu/slides-pdf/chapter20b.pdf
multiple layers enhance the model expression
→ the model can approximate more complex functions
Deep Neural Networks (DNN)
◉
Fully connected feedforward network29
x1
x2
……
Layer 1
……
y1
y2
……
Layer 2
……
Layer L
…
… …
…
…
Input Output
yM
xN
vector x
vector y
M
N
R
R
f : →
Deep NN: multiple hidden layers
Notation Definition
30
…..
nodes Nl
Layerl
1 1
−
al
1 2
−
al
−1 l
aj
….. …..
Layerl −1 nodes
−1
Nl
a1l
a2l
l
ai
…..
l
a i
Output of a neuron:
neuron i layer l 1
2
j
1 2
i
output of one layer → a vector
Notation Definition
31
1 2
j
1 2
i
layer l −1 to layer l
1 1
−
al
1 2
−
al
−1 l
aj
a1l
a2l
l
ai
nodes Nl
Layerl Layerl −1
nodes
−1
Nl
from neuron j (layer l-1) to neuron i (layer l)
weights between two layers
→ a matrix
….. ….. ….. …..
Notation Definition
32
1 2
j
1 2
i
1 1
−
al
1 2
−
al
−1 l
aj
a1l
a2l
l
ai
l
b
i : bias for neuron i at layer ll
bi
1
b1l
b2l
nodes Nl
Layerl Layerl −1
nodes
−1
Nl
bias of all neurons at each layer
→ a vector
….. ….. ….. …..
Notation Definition
33
nodes Nl
Layerl Layerl −1
nodes
−1
Nl
1 2
j i
: input of the activation function for neuron i at layer l
1 activation function input at
each layer → a vector
….. ….. …..
Notation Summary
34
l
a
ia
ll
z
iz
ll
b
ib
l : output of a neuron: output vector of a layer
: input of activation function
: input vector of activation function for a layer
: a weight
: a weight matrix : a bias
: a bias vector
Layer Output Relation
35
… …
nodes Nl
Layerl
… … … …
Layerl −1 nodes
−1
Nl
… …
1 2
j
1 2
i
1 1
−
al
1 2
−
al
−1 l
aj
a1l
a2l
l
ai
z1l
z2l
l
zi
al
zl
−1
al
Layer Output Relation – from a to z
36
… …
nodes Nl
Layerl
… … … …
Layerl −1 nodes
−1
Nl
… …
1 2
j
1 2
i
1 1
−
al
1 2
−
al
−1 l
aj
a1l
a2l
l
ai
z1l
z2l
l
zi
al
zl
−1
al
… …
Layer Output Relation – from z to a
37
… …
nodes Nl
Layerl
… … … …
Layerl −1 nodes
−1
Nl
… …
1 2
j
1 2
i
1 1
−
al
1 2
−
al
−1 l
aj
a1l
a2l
l
ai
z1l
z2l
l
zi
al
zl
−1
al
( ) ( )
( )
=
l i l l
l i l l
z z z
a a a
2 1 2
1
( )
il li
z
a =
( )
ll
z
a =
Layer Output Relation
38
… …
nodes Nl
Layerl
… … … …
Layerl −1 nodes
−1
Nl
… …
1 2
j
1 2
i
1 1
−
al
1 2
−
al
−1 l
aj
a1l
a2l
l
ai
z1l
z2l
l
zi
al
zl
−1
al
l l
l
l
W a b
z =
−1+
( )
ll
z
a =
Neural Network Formulation
◉
Fully connected feedforward network39
Layer 1 Layer 2 Layer L
Input Output
x1
x2
…. .
…..
y1
y2
…. . …. .
…
… …
…
…
yMxN
vector x
vector y
M
N
R
R
f : →
= = =
Neural Network Formulation
◉
Fully connected feedforward network40
M
N
R
R
f : →
Layer 1 Layer 2 Layer L
Input Output
x1
x2
…. .
…..
y1
y2
…. . …. .
…
… …
…
…
yMxN
vector x
vector
y
Activation Function
41
bounded function
Activation Function
42
non-linear boolean
linear
Non-Linear Activation Function
◉
Sigmoid◉
Tanh◉
Rectified Linear Unit (ReLU)43
Non-linear functions are frequently used in neural networks
Why Non-Linearity?
◉
Function approximation○
Without non-linearity, deep neural networks work the same as linear transform○
With non-linearity, networks with more layers can approximate more complex functions44
http://cs224d.stanford.edu/lectures/CS224d-Lecture4.pdf
什麼叫做“好”的Function呢?
What does the “Good”
Function mean?
45
Training Procedure Outline
Model Architecture
✓
A Single Layer of Neurons (Perceptron)✓
Limitation of Perceptron✓
Neural Network Model (Multi-Layer Perceptron) Loss Function Design
✓
Function = Model Parameters✓
Model Parameter Measurement Optimization
✓
Gradient Descent✓
Stochastic Gradient Descent (SGD)✓
Mini-Batch SGD✓
Practical Tips 46Training Procedure Outline
Model Architecture
✓
A Single Layer of Neurons (Perceptron)✓
Limitation of Perceptron✓
Neural Network Model (Multi-Layer Perceptron) Loss Function Design
✓
Function = Model Parameters✓
Model Parameter Measurement Optimization
✓
Gradient Descent✓
Stochastic Gradient Descent (SGD)✓
Mini-Batch SGD✓
Practical Tips 47Function = Model Parameters
◉
Formal definition48
different parameters W and b → different functions function set
model parameter set
pick a function f = pick a set of model parameters θ
Training Procedure Outline
Model Architecture
✓
A Single Layer of Neurons (Perceptron)✓
Limitation of Perceptron✓
Neural Network Model (Multi-Layer Perceptron) Loss Function Design
✓
Function = Model Parameters✓
Model Parameter Measurement Optimization
✓
Gradient Descent✓
Stochastic Gradient Descent (SGD)✓
Mini-Batch SGD✓
Practical Tips 49Model Parameter Measurement
◉
Define a function to measure the quality of a parameter set θ○
Evaluating by a loss/cost/error function C(θ) → how bad θ is○
Best model parameter set○
Evaluating by an objective/reward function O(θ) → how good θ is○
Best model parameter set50
Loss Function Example
51
f *
“Best” Function
A “Good” function:
Define an example loss function:
sum over the error of all training samples
Training Data
Model: Hypothesis Function Set
2
1, f f
Training: Pick the best function f *
( ) ( )
x1,yˆ1 , x2,yˆ2 ,
“It claims too much.”
-
(negative): x
ˆy: function input function output
Frequent Loss Function
◉
Square loss◉
Hinge loss◉
Logistic loss◉
Cross entropy loss◉
Others: large margin, etc.52
https://en.wikipedia.org/wiki/Loss_functions_for_classification
我們如何找出“最好”的Function呢?
How can we Pick the
“Best” Function?
53
Training Procedure Outline
Model Architecture
✓
A Single Layer of Neurons (Perceptron)✓
Limitation of Perceptron✓
Neural Network Model (Multi-Layer Perceptron) Loss Function Design
✓
Function = Model Parameters✓
Model Parameter Measurement Optimization
✓
Gradient Descent✓
Stochastic Gradient Descent (SGD)✓
Mini-Batch SGD✓
Practical Tips 54Problem Statement
◉
Given a loss function and several model parameter sets○
Loss function:○
Model parameter sets:◉
Find a model parameter set that minimizes C(θ)◉
1) Brute force – enumerate all possible θ◉
2) Calculus –55
How to solve this optimization problem?
Issue: whole space of C(θ) is unknown
Training Procedure Outline
Model Architecture
✓
A Single Layer of Neurons (Perceptron)✓
Limitation of Perceptron✓
Neural Network Model (Multi-Layer Perceptron) Loss Function Design
✓
Function = Model Parameters✓
Model Parameter Measurement Optimization
✓
Gradient Descent✓
Stochastic Gradient Descent (SGD)✓
Mini-Batch SGD✓
Practical Tips 56Gradient Descent for Optimization
◉
Assume that θ has only one variable57
( )
C
0
1
2
3: the model at the i-th iteration
Idea: drop a ball and find the position where the ball stops rolling (local minima)
Gradient Descent for Optimization
◉
Assume that θ has only one variable58
( )
C
0
1Randomly start at 𝜃0 Compute 𝑑𝐶 𝜃0 Τ𝑑𝜃:
Compute 𝑑𝐶 𝜃1 Τ𝑑𝜃:
…
η is “learning rate”
Gradient Descent for Optimization
◉
Assume that θ has two variables {θ1, θ2}59
Gradient Descent for Optimization
◉
Assume that θ has two variables {θ1, θ2}60
• Randomly start at 𝜃0:
• Compute the gradients of 𝐶 𝜃 at 𝜃0:
• Update parameters:
• Compute the gradients of 𝐶 𝜃 at 𝜃1:
Gradient Descent for Optimization
61
Movement Gradient 𝜃0
𝜃1
𝜃2 𝜃3
𝛻𝐶 𝜃0
𝛻𝐶 𝜃1
𝛻𝐶 𝜃2
𝛻𝐶 𝜃3
𝜃1 𝜃2
Algorithm
Initialization: start at 𝜃0 while(𝜃(𝑖+1) ≠ 𝜃𝑖)
{
compute gradient at 𝜃𝑖 update parameters }
Revisit Neural Network Formulation
◉
Fully connected feedforward network62
M
N
R
R
f : →
Layer 1 Layer 2 Layer L
Input Output
x1
x2
…. .
…..
y1
y2
…. . …. .
…
… …
…
…
yMxN
vector x
vector
y
Algorithm
Initialization: start at 𝜃0 while(𝜃(𝑖+1) ≠ 𝜃𝑖)
{
compute gradient at 𝜃𝑖 update parameters }
Gradient Descent for Neural Network
63
Gradient Descent for Optimization
Simple Case
64
Algorithm
Initialization: start at 𝜃0 while(𝜃(𝑖+1) ≠ 𝜃𝑖)
{
compute gradient at 𝜃𝑖 update parameters }
z w1
w2
x1
x2 +
b
y
1
( )
z
Gradient Descent for Optimization
Simple Case – Three Parameters & Square Error Loss
◉
Update three parameters for t-th iteration◉
Square error loss65
Gradient Descent for Optimization
Simple Case – Square Error Loss
◉
Square Error Loss66
Gradient Descent for Optimization
Simple Case – Square Error Loss
67
chain rule sigmoid func
Gradient Descent for Optimization
Simple Case – Square Error Loss
◉
Square Error Loss68
Gradient Descent for Optimization
Simple Case – Three Parameters & Square Error Loss
◉
Update three parameters for t-th iteration69
z w1
w2
x1
x2 +
b
y
1
( )z
Optimization Algorithm
70
Algorithm
Initialization: set the parameters 𝜃, 𝑏 at random while(stopping criteria not met)
{
for training sample {𝑥, ො𝑦}, compute gradient and update parameters }
Gradient Descent for Neural Network
71
Computing the gradient includes millions of parameters.
To compute it efficiently, we use backpropagation.
Algorithm
Initialization: start at 𝜃0 while(𝜃(𝑖+1) ≠ 𝜃𝑖)
{
compute gradient at 𝜃𝑖 update parameters }
Gradient Descent Issue
72
Training Data
( ) ( )
x1, yˆ1 , x2, yˆ2 ,
After seeing all training samples, the model can be updated → slow
Training Procedure Outline
Model Architecture
✓
A Single Layer of Neurons (Perceptron)✓
Limitation of Perceptron✓
Neural Network Model (Multi-Layer Perceptron) Loss Function Design
✓
Function = Model Parameters✓
Model Parameter Measurement Optimization
✓
Gradient Descent✓
Stochastic Gradient Descent (SGD)✓
Mini-Batch SGD✓
Practical Tips 73Stochastic Gradient Descent (SGD)
◉
Gradient Descent◉
Stochastic Gradient Descent (SGD)○
Pick a training sample xk○
If all training samples have same probability to be picked 74The model can be updated after seeing one training sample → faster
Training Data
( ) ( )
x1, yˆ1 , x2, yˆ2 ,
Epoch Definition
◉
When running SGD, the model starts θ075
pick x1 pick x2 pick xk
pick xK pick x1
see all training samples once
→ one epoch
… …
… …
Training Data
( ) ( )
x1, yˆ1 , x2, yˆ2 ,
Gradient Descent v.s. SGD
◉
Gradient Descent✓
Update after seeing all examples 76◉
Stochastic Gradient Descent✓
If there are 20 examples, update 20 times in one epoch.1 epoch See all
examples
See only one example
SGD approaches to the target point faster than gradient descent
Training Procedure Outline
Model Architecture
✓
A Single Layer of Neurons (Perceptron)✓
Limitation of Perceptron✓
Neural Network Model (Multi-Layer Perceptron) Loss Function Design
✓
Function = Model Parameters✓
Model Parameter Measurement Optimization
✓
Gradient Descent✓
Stochastic Gradient Descent (SGD)✓
Mini-Batch SGD✓
Practical Tips 77Mini-Batch SGD
◉
Batch Gradient Descent◉
Stochastic Gradient Descent (SGD)○
Pick a training sample xk◉
Mini-Batch SGD○
Pick a set of B training samples as a batch b 78B is “batch size”
Use all K samples in each iteration
Use 1 samples in each iteration
Use all B samples in each iteration
Mini-Batch SGD
79
Batch v.s. Mini-Batch
Handwritting Digit Classification
80
Batch size = 1 Gradient Descent
Gradient Descent v.s. SGD v.s. Mini-Batch
81
Training Time (sec)
Batch Size SGD
1 10 100 1000 10000 full
Gradient Descent Mini-Batch
Why is mini-batch faster than SGD?
Training speed: mini-batch > SGD > Gradient Descent
SGD v.s. Mini-Batch
◉
Stochastic Gradient Descent (SGD)◉
Mini-Batch SGD82
𝑊
1𝑥 𝑊
1𝑥
𝑧
1= 𝑧
1= ……
matrix
𝑊
1𝑥 𝑥
𝑧
1𝑧
1=
Modern computers run matrix-matrix multiplication faster than matrix-vector multiplication
Big Issue: Local Optima
83
Neural networks has no guarantee for obtaining global optimal solution
Training Procedure Outline
Model Architecture
✓
A Single Layer of Neurons (Perceptron)✓
Limitation of Perceptron✓
Neural Network Model (Multi-Layer Perceptron) Loss Function Design
✓
Function = Model Parameters✓
Model Parameter Measurement Optimization
✓
Gradient Descent✓
Stochastic Gradient Descent (SGD)✓
Mini-Batch SGD✓
Practical Tips 84Initialization
◉
Different initialization parameters may result in different trained models85
Do not initialize the parameters equally → set them randomly
Learning Rate
86
# parameters updates cost
Error Surface Very Large
Large
small
Just make
Learning rate should be set carefully
Tips for Mini-Batch Training
◉ Shuffle training samples before every epoch
○
the network might memorize the order you feed the samples◉ Use a fixed batch size for every epoch
○
enable to fast implement matrix multiplication for calculations◉ Adapt the learning rate to the batch size
○
𝐾 times of batch size → (theoretically) 𝐾 times of learning rate87
http://stackoverflow.com/questions/13693966/neural-net-selecting-data-for-each-mini- batch
Learning Recipe
88
Training Data
Testing Data
x yˆ
Validation Real Testing
x x
f *
“Best” Function
y y
Learning Recipe
89
immediately know the performance
Do not know the performance until
submission
Training Data
Testing Data
x yˆ
Validation Real Testing
x y x y
Learning Recipe
◉
Possible reasons○
no good function exists: bad hypothesis function set→ reconstruct the model architecture
○
cannot find a good function: local optima→ change the training strategy 90
get good results on training set modify training process
no
get good results on dev/validation set
Learning Recipe
91
yes
done yes
no
prevent overfitting
Better performance on training but worse performance on dev → overfitting get good results
on training set modify training process
no
Overfitting
◉
Possible solutions○
more training samples○
some tips: dropout, etc.92
Concluding Remarks
◉
Q1. What is the model?◉
Q2. What does a “good” function mean?◉
Q3. How do we pick the “best” function?93
Model: Hypothesis Function Set
2
1, f f
Training: Pick the best function f * f *
“Best” Function
Model Architecture Loss Function Design
Optimization