Neural Network Basics

(1)

Neural Network Basics

Applied Deep Learning

September 15th, 2022 http://adl.miulab.tw

(2)

Learning ≈ Looking for a Function

◉

Speech Recognition

◉

Handwritten Recognition

◉

Weather forecast

◉

Play video games

2

(

)

⁼

f

(

)

= f

(

)

= f

(

)

⁼

f

“2”

“你好”

“ Saturday”

“move left”

Thursday

(3)

Machine Learning Framework

3

Training is to pick the best function given the observed data Testing is to predict the label using the learned function

Training Data

Model: Hypothesis Function Set

2

1, f f

Training: Pick the best function f ^*

Testing: ^f ^

( )

^x^ ⁼ ^y^ ^y^ ⁼ ⁺

f *

“Best” Function

( ) ( )

x₁,yˆ₁ , x₂,yˆ₂ , Testing Data

( )

x,? ,

“It claims too much.”

-

^(negative)

: x

ˆy: function input function output

(4)

實際上我們是如何訓練一個模型的？

How to Train a Model?

4

(5)

Training is to pick the best function given the observed data Testing is to predict the label using the learned function

Training Data

2

1, f f

Testing: ^f ^

( )

^x^ ⁼ ^y^ ^y^ ⁼ ⁺

f *

“Best” Function

( ) ( )

x₁,yˆ₁ , x₂,yˆ₂ , Testing Data

( )

x,? ,

-

^(negative)

: x

Machine Learning Framework

5

Training Procedure

(6)

Training Procedure

◉

Q1. What is the model? (function hypothesis set)

◉

Q2. What does a “good” function mean?

◉

Q3. How do we pick the “best” function?

6

2



1

, f f

f

*

“Best” Function

(7)

Training Procedure Outline

 Model Architecture

✓

A Single Layer of Neurons (Perceptron)

✓

Limitation of Perceptron

✓

Neural Network Model (Multi-Layer Perceptron)

 Loss Function Design

✓

Function = Model Parameters

✓

Model Parameter Measurement

 Optimization

✓

Gradient Descent

✓

Stochastic Gradient Descent (SGD)

✓

Mini-Batch SGD

✓

Practical Tips 7

(8)

什麼是模型？

What is the Model?

8

(9)

Training Procedure Outline

 Model Architecture

✓

 Loss Function Design

✓

 Optimization

✓

Gradient Descent

✓

Mini-Batch SGD

✓

Practical Tips 9

(10)

Classification Task

◉

Sentiment Analysis

◉

Speech Phoneme Recognition

◉

Handwritten Recognition

10

“這規格有誠意!” +

“太爛了吧~” -

/h/

2

Binary Classification

Multi-class Classification input

object

Class A (yes)

Class B (no)

input object

Class A Class B Class C

Some cases are not easy to be formulated as classification problems

(11)

Target Function

◉

Classification Task

○

x: input object to be classified

○

y: class/label 11

( ) ^x ^y

f = f : R

^N

→ R

^M

Assume both x and y can be represented as fixed-size vectors

→ a N-dim vector

→ a M-dim vector

(12)

Vector Representation Example

◉

Handwriting Digit Classification

12

“1” “2”













 0 0 1

10 dimensions for digit recognition

“1”

“2”

“3” 











 0 1

0 “1”

“2”

“3”

1: for ink 0: otherwise

Each pixel corresponds to an element in the vector

 







 







 1 0

16 x 16

16 x 16 = 256 dimensions

x: image y: class/label

“1” or not

“2” or not

“3” or not

M

N

R

f : →

(13)

Vector Representation Example

◉

Sentiment Analysis

13

“+” “-”

 





 





 0 0 1

3 dimensions

(positive, negative, neutral)

“+”

“-”

“?”

   





 





 0 1

0

^“+”

“-”

“?”

1: indicates the word 0: otherwise

Each element in the vector corresponds to a word in the vocabulary

 







 







 1 0

dimensions = size of vocab

x: word y: class/label

“+” or not

“-” or not

“?” or not

M

N

R

f : →

“love”

(14)

Target Function

◉

Classification Task

○

x: input object to be classified

○

y: class/label 14

( ) ^x ^y

f = f : R

^N

→ R

^M

Assume both x and y can be represented as fixed-size vectors

→ a N-dim vector

→ a M-dim vector

(15)

Training Procedure Outline

 Model Architecture

✓

 Loss Function Design

✓

 Optimization

✓

Gradient Descent

✓

Mini-Batch SGD

✓

Practical Tips 15

(16)

A Single Neuron

16

z w

1

w

2

w

N

…

x

1

x

2

x

N

+ b

( ) ^z

 ^ ( ) ^z

z

bias

y

( )

_z

z e

₋

= + 1

 1

Sigmoid function Activation

function

Each neuron is a very simple function

(17)

A Single Neuron

17

z w

1

w

2

w

N

…

x

1

x

2

x

N

+ b

( ) ^z



bias

y

Activation function

1

The bias term is an “always on” feature

( ) ^z



( )

_z

z

z e

₋

= + 1

 1

Sigmoid function

(18)

Why Bias?

18

( ) ^z



z

b

bias

The bias term gives a class prior

(19)

Model Parameters of A Single Neuron

19

z w

1

w

2

w

N

…

x

1

x

2

x

N

+ b

( ) ^z



bias

y

( )

_z

z e

₋

= + 1

 1

1

w, b are the parameters of this neuron

(20)

A Single Neuron

20

z w

1

w

2

w

N

…

x

1

x

2

x

N

+

b

bias

y

1

M

N

R

f : →

 







5 . 0

"

2 "

5 . 0

"

2 "

y not

y is

A single neuron can only handle binary classification

(21)

A Layer of Neurons

◉

Handwriting digit classification

21

M

N

R

f : →

A layer of neurons can handle multiple possible output, and the result depends on the max one

…

x

1

x

2

x

N

+

1

+ y

₁

+

… …

“1” or not

“2” or not

“3” or not

y

2

y

3

10 neurons/10 classes

Which one is max?

(22)

Training Procedure Outline

 Model Architecture

✓

 Loss Function Design

✓

 Optimization

✓

Gradient Descent

✓

Mini-Batch SGD

✓

Practical Tips 22

(23)

A Layer of Neurons – Perceptron

◉

Output units all operate separately – no shared weights

23

Adjusting weights moves the location, orientation, and steepness of cliff

…

x1

x2

xN

+

1

+ y₁

+

… …

y2

y3

http://aima.eecs.berkeley.edu/slides-pdf/chapter20b.pdf

(24)

Expression of Perceptron

24

+ z w

1

w

2

x

1

x

2

b 1

y

A perceptron can represent AND, OR, NOT, etc., but not XOR → linear separator

(25)

How to Implement XOR?

25

Input

Output

A B

0 0 0

0 1 1

1 0 1

1 1 0

A xor B = AB’ + A’B

Multiple operations can produce more complicate output

A

B

A’

B’

A + B’

A’ + B

AB’ + A’B

(26)

Training Procedure Outline

 Model Architecture

✓

 Loss Function Design

✓

 Optimization

✓

Gradient Descent

✓

Mini-Batch SGD

✓

Practical Tips 26

(27)

Neural Networks – Multi-Layer Perceptron

27

a

1

+ z

¹

+ z

²

x

1

x

2

a

₂

+ z

Hidden Units

1 1

y

(28)

Expression of Multi-Layer Perceptron

◉

Continuous function w/ 2 layers

◉

Combine two opposite-facing

threshold functions to make a ridge

28

◉

Continuous function w/ 3 layers

◉

Combine two perpendicular ridges to make a bump

○

Add bumps of various sizes and locations to fit any surface

multiple layers enhance the model expression

→ the model can approximate more complex functions

(29)

Deep Neural Networks (DNN)

◉

Fully connected feedforward network

29

x1

x2

……

Layer 1

……

y1

y2

……

Layer 2

……

Layer L

…

… …

…

Input Output

yM

xN

vector x

vector y

M

N

R

f : →

Deep NN: multiple hidden layers

(30)

Notation Definition

30

…..

nodes Nl

Layerl

1 1

−

al

1 2

−

al

−1 l

aj

….. …..

Layerl −1 nodes

−1

Nl

a₁l

a₂l

l

ai

…..

l

a i

Output of a neuron:

neuron i layer l 1

2

j

1 2

i

output of one layer → a vector

(31)

Notation Definition

31

1 2

j

1 2

i

layer l −1 to layer l

1 1

−

al

1 2

−

al

−1 l

aj

a₁l

a₂l

l

ai

nodes Nl

Layerl Layerl −1

nodes

−1

Nl

from neuron j (layer l-1) to neuron i (layer l)

weights between two layers

→ a matrix

….. ….. ….. …..

(32)

Notation Definition

32

1 2

j

1 2

i

1 1

−

al

1 2

−

al

−1 l

aj

a₁l

a₂l

l

ai

l

b

i : bias for neuron i at layer l

l

bi

1

b₁l

b₂l

nodes Nl

Layerl Layerl −1

nodes

−1

Nl

bias of all neurons at each layer

→ a vector

….. ….. ….. …..

(33)

Notation Definition

33

nodes Nl

Layerl Layerl −1

nodes

−1

Nl

1 2

j i

: input of the activation function for neuron i at layer l

1 activation function input at

each layer → a vector

….. ….. …..

(34)

Notation Summary

34

l

a

i

a

l

z

i

z

l

b

i

b

l : output of a neuron

: output vector of a layer

: input of activation function

: input vector of activation function for a layer

: a weight

: a weight matrix : a bias

: a bias vector

(35)

Layer Output Relation

35

… …

nodes Nl

Layerl

… … … …

Layerl −1 nodes

−1

Nl

… …

1 2

j

1 2

i

1 1

−

al

1 2

−

al

−1 l

aj

a₁l

a₂l

l

ai

z₁l

z₂l

l

zi

al

zl

−1

al

(36)

Layer Output Relation – from a to z

36

… …

nodes Nl

Layerl

… … … …

Layerl −1 nodes

−1

Nl

… …

1 2

j

1 2

i

1 1

−

al

1 2

−

al

−1 l

aj

a₁l

a₂l

l

ai

z₁l

z₂l

l

zi

al

zl

−1

al

… …

(37)

Layer Output Relation – from z to a

37

… …

nodes Nl

Layerl

… … … …

Layerl −1 nodes

−1

Nl

… …

1 2

j

1 2

i

1 1

−

al

1 2

−

al

−1 l

aj

a₁l

a₂l

l

ai

z₁l

z₂l

l

zi

al

zl

−1

al

( ) ( )

( )

 







 







=

 







 









l i l l

z z z

a a a



2 1 2

1

( )

i^l l

i

z

a = 

( )

^l

l

z

a = 

(38)

Layer Output Relation

38

… …

nodes Nl

Layerl

… … … …

Layerl −1 nodes

−1

Nl

… …

1 2

j

1 2

i

1 1

−

al

1 2

−

al

−1 l

aj

a₁l

a₂l

l

ai

z₁l

z₂l

l

zi

al

zl

−1

al

l l

l

W a b

z =

⁻¹

+

( )

^l

l

z

a = 

(39)

Neural Network Formulation

◉

39

Layer 1 Layer 2 Layer L

Input Output

x1

x2

…. .

…..

y1

y2

…. . …. .

…

… …

…

y_M

xN

vector x

vector y

M

N

R

f : →

= = =

(40)

Neural Network Formulation

◉

40

M

N

R

f : →

Input Output

x1

x2

…. .

…..

y1

y2

…. . …. .

…

… …

…

y_M

xN

vector x

vector

y

(41)

Activation Function

41

bounded function

(42)

Activation Function

42

non-linear boolean

linear

(43)

Non-Linear Activation Function

◉

^Sigmoid

◉

^Tanh

◉

Rectified Linear Unit (ReLU)

43

Non-linear functions are frequently used in neural networks

(44)

Why Non-Linearity?

◉

Function approximation

○

Without non-linearity, deep neural networks work the same as linear transform

○

With non-linearity, networks with more layers can approximate more complex functions

44

http://cs224d.stanford.edu/lectures/CS224d-Lecture4.pdf

(45)

什麼叫做“好”的Function呢？

What does the “Good”

Function mean?

45

(46)

Training Procedure Outline

 Model Architecture

✓

 Loss Function Design

✓

 Optimization

✓

Gradient Descent

✓

Mini-Batch SGD

✓

Practical Tips 46

(47)

Training Procedure Outline

 Model Architecture

✓

 Loss Function Design

✓

 Optimization

✓

Gradient Descent

✓

Mini-Batch SGD

✓

Practical Tips 47

(48)

Function = Model Parameters

◉

Formal definition

48

different parameters W and b → different functions function set

model parameter set

pick a function f = pick a set of model parameters θ

(49)

Training Procedure Outline

 Model Architecture

✓

 Loss Function Design

✓

 Optimization

✓

Gradient Descent

✓

Mini-Batch SGD

✓

Practical Tips 49

(50)

Model Parameter Measurement

◉

Define a function to measure the quality of a parameter set θ

○

Evaluating by a loss/cost/error function C(θ) → how bad θ is

○

Best model parameter set

○

Evaluating by an objective/reward function O(θ) → how good θ is

○

Best model parameter set

50

(51)

Loss Function Example

51

f *

“Best” Function

A “Good” function:

Define an example loss function:

sum over the error of all training samples

Training Data

2

1, f f

( ) ( )

x₁,yˆ₁ , x₂,yˆ₂ ,

-

^(negative)

: x

(52)

Frequent Loss Function

◉

Square loss

◉

^{Hinge loss}

◉

Logistic loss

◉

Cross entropy loss

◉

Others: large margin, etc.

52

https://en.wikipedia.org/wiki/Loss_functions_for_classification

(53)

我們如何找出“最好”的Function呢？

How can we Pick the

“Best” Function?

53

(54)

Training Procedure Outline

 Model Architecture

✓

 Loss Function Design

✓

 Optimization

✓

Gradient Descent

✓

Mini-Batch SGD

✓

Practical Tips 54

(55)

Problem Statement

◉

Given a loss function and several model parameter sets

○

Loss function:

○

Model parameter sets:

◉

Find a model parameter set that minimizes C(θ)

◉

1) Brute force – enumerate all possible θ

◉

2) Calculus –

55

How to solve this optimization problem?

Issue: whole space of C(θ) is unknown

(56)

Training Procedure Outline

 Model Architecture

✓

 Loss Function Design

✓

 Optimization

✓

Gradient Descent

✓

Mini-Batch SGD

✓

Practical Tips 56

(57)

Gradient Descent for Optimization

◉

Assume that θ has only one variable

57



( ) 

C



0



¹



²



³

: the model at the i-th iteration

Idea: drop a ball and find the position where the ball stops rolling (local minima)

(58)

Gradient Descent for Optimization

◉

Assume that θ has only one variable

58



( ) 

C



0



¹

Randomly start at 𝜃⁰ Compute 𝑑𝐶 𝜃⁰ Τ𝑑𝜃:

Compute 𝑑𝐶 𝜃¹ Τ𝑑𝜃:

…

η is “learning rate”

(59)

Gradient Descent for Optimization

◉

Assume that θ has two variables {θ₁, θ₂}

59

(60)

Gradient Descent for Optimization

◉

Assume that θ has two variables {θ₁, θ₂}

60

• Randomly start at 𝜃⁰:

• Compute the gradients of 𝐶 𝜃 at 𝜃⁰:

• Update parameters:

• Compute the gradients of 𝐶 𝜃 at 𝜃¹:

(61)

Gradient Descent for Optimization

61

Movement Gradient 𝜃⁰

𝜃¹

𝜃² 𝜃³

𝛻𝐶 𝜃⁰

𝛻𝐶 𝜃¹

𝛻𝐶 𝜃²

𝛻𝐶 𝜃³

𝜃₁ 𝜃₂

Algorithm

Initialization: start at 𝜃⁰ while(𝜃^(𝑖+1) ≠ 𝜃^𝑖)

{

compute gradient at 𝜃^𝑖 update parameters }

(62)

Revisit Neural Network Formulation

◉

62

M

N

R

f : →

Input Output

x1

x2

…. .

…..

y1

y2

…. . …. .

…

… …

…

y_M

xN

vector x

vector

y

(63)

Algorithm

{

Gradient Descent for Neural Network

63

(64)

Gradient Descent for Optimization

Simple Case

64

Algorithm

{

z w1

w2

x1

x2 +

b

y

1

( )

^z



(65)

Gradient Descent for Optimization

Simple Case – Three Parameters & Square Error Loss

◉

Update three parameters for t-th iteration

◉

Square error loss

65

(66)

Gradient Descent for Optimization

Simple Case – Square Error Loss

◉

Square Error Loss

66

(67)

Gradient Descent for Optimization

Simple Case – Square Error Loss

67

chain rule sigmoid func

(68)

Gradient Descent for Optimization

Simple Case – Square Error Loss

◉

Square Error Loss

68

(69)

Gradient Descent for Optimization

Simple Case – Three Parameters & Square Error Loss

◉

Update three parameters for t-th iteration

69

z w1

w2

x1

x2 +

b

y

1

( )z



(70)

Optimization Algorithm

70

Algorithm

Initialization: set the parameters 𝜃, 𝑏 at random while(stopping criteria not met)

{

for training sample {𝑥, ො𝑦}, compute gradient and update parameters }

(71)

Gradient Descent for Neural Network

71

Computing the gradient includes millions of parameters.

To compute it efficiently, we use backpropagation.

Algorithm

{

(72)

Gradient Descent Issue

72

Training Data

( ) ( )



^x₁^, ^y^ˆ₁ ^, ^x₂^, ^y^ˆ₂ ^,^



After seeing all training samples, the model can be updated → slow

(73)

Training Procedure Outline

 Model Architecture

✓

 Loss Function Design

✓

 Optimization

✓

Gradient Descent

✓

Mini-Batch SGD

✓

Practical Tips 73

(74)

Stochastic Gradient Descent (SGD)

◉

Gradient Descent

◉

○

Pick a training sample x_k

○

If all training samples have same probability to be picked 74

The model can be updated after seeing one training sample → faster

Training Data

( ) ( )



^x₁^, ^y^ˆ₁ ^, ^x₂^, ^y^ˆ₂ ^,^



(75)

Epoch Definition

◉

When running SGD, the model starts θ⁰

75

pick x₁ pick x₂ pick x_k

pick x_K pick x₁

see all training samples once

→ one epoch

… …

Training Data

( ) ( )



x₁, yˆ₁ , x₂, yˆ₂ ,^



(76)

Gradient Descent v.s. SGD

◉

Gradient Descent

✓

Update after seeing all examples 76

◉

Stochastic Gradient Descent

✓

If there are 20 examples, update 20 times in one epoch.

1 epoch See all

examples

See only one example

SGD approaches to the target point faster than gradient descent

(77)

Training Procedure Outline

 Model Architecture

✓

 Loss Function Design

✓

 Optimization

✓

Gradient Descent

✓

Mini-Batch SGD

✓

Practical Tips 77

(78)

Mini-Batch SGD

◉

Batch Gradient Descent

◉

○

Pick a training sample x_k

◉

Mini-Batch SGD

○

Pick a set of B training samples as a batch b 78

B is “batch size”

Use all K samples in each iteration

Use 1 samples in each iteration

Use all B samples in each iteration

(79)

Mini-Batch SGD

79

(80)

Batch v.s. Mini-Batch

Handwritting Digit Classification

80

Batch size = 1 Gradient Descent

(81)

Gradient Descent v.s. SGD v.s. Mini-Batch

81

Training Time (sec)

Batch Size SGD

1 10 100 1000 10000 full

Gradient Descent Mini-Batch

Why is mini-batch faster than SGD?

Training speed: mini-batch > SGD > Gradient Descent

(82)

SGD v.s. Mini-Batch

◉

Mini-Batch SGD

82

𝑊

¹

𝑥 𝑊

¹

𝑥

𝑧

¹

= 𝑧

¹

= ……

matrix

𝑊

¹

𝑥 𝑥

𝑧

¹

𝑧

¹

=

Modern computers run matrix-matrix multiplication faster than matrix-vector multiplication

(83)

Big Issue: Local Optima

83

Neural networks has no guarantee for obtaining global optimal solution

(84)

Training Procedure Outline

 Model Architecture

✓

 Loss Function Design

✓

 Optimization

✓

Gradient Descent

✓

Mini-Batch SGD

✓

Practical Tips 84

(85)

Initialization

◉

Different initialization parameters may result in different trained models

85

Do not initialize the parameters equally → set them randomly

(86)

Learning Rate

86

# parameters updates cost

Error Surface Very Large

Large

small

Just make

Learning rate should be set carefully

(87)

Tips for Mini-Batch Training

◉ Shuffle training samples before every epoch

○

the network might memorize the order you feed the samples

◉ Use a fixed batch size for every epoch

○

enable to fast implement matrix multiplication for calculations

◉ Adapt the learning rate to the batch size

○

𝐾 times of batch size → (theoretically) 𝐾 times of learning rate

87

http://stackoverflow.com/questions/13693966/neural-net-selecting-data-for-each-mini- batch

(88)

Learning Recipe

88

Training Data

Testing Data

x yˆ

Validation Real Testing

x x

f *

“Best” Function

y y

(89)

Learning Recipe

89

immediately know the performance

Do not know the performance until

submission

Training Data

Testing Data

x yˆ

Validation Real Testing

x y x y

(90)

Learning Recipe

◉

Possible reasons

○

no good function exists: bad hypothesis function set

→ reconstruct the model architecture

○

cannot find a good function: local optima

→ change the training strategy 90

get good results on training set modify training process

no

(91)

get good results on dev/validation set

Learning Recipe

91

yes

done yes

no

prevent overfitting

Better performance on training but worse performance on dev → overfitting get good results

on training set modify training process

no

(92)

Overfitting

◉

Possible solutions

○

more training samples

○

some tips: dropout, etc.

92

(93)

Concluding Remarks

◉

Q1. What is the model?

◉

Q2. What does a “good” function mean?

◉

Q3. How do we pick the “best” function?

93

2

1, f f

Training: Pick the best function f ^* f *

“Best” Function

Model Architecture Loss Function Design

Optimization

Neural Network Basics