Slide credit from Prof. Hung-Yi Lee

(1)

(2)

Review

2

(3)

Learning ≈ Looking for a Function

Speech Recognition

Handwritten Recognition

Weather forecast

Play video games

  ^

f

  ^

f

  ^

f

  ^

f

“2”

“你好”

“ Saturday”

“move left”

Thursday

3

(4)

Machine Learning Framework

Training is to pick the best function given the observed data Testing is to predict the label using the learned function Training Data

Model: Hypothesis Function Set

2

1, f f

Training: Pick the best function f ^*

Testing: ^f ^

 

^x^ ^ ^y^

^y ^ ^ ^

f

*

“Best” Function

   



^x₁^, ^y^ˆ₁ ^, ^x₂^, ^y^ˆ₂ ^,^



Testing Data

 



^x^,^? ^,^



“It claims too much.”

-

(negative)

: x

ˆy :

4

function input function output

(5)

How to Train a Model?

5

(6)

Machine Learning Framework

Training is to pick the best function given the observed data Testing is to predict the label using the learned function Training Data

2

1, f f

Testing: ^f ^

 

^x^ ^ ^y^

^y ^ ^ ^

f

*

“Best” Function

   



^x₁^, ^y^ˆ₁ ^, ^x₂^, ^y^ˆ₂ ^,^



Testing Data

 



^x^,^? ^,^



-

(negative)

: x

ˆy :

6

function input

function output

Training

Procedure

(7)

Training Procedure

Q1. What is the model? (function hypothesis set) Q2. What does a “good” function mean?

Q3. How do we pick the “best” function?

7

2

1, f f

f

*

“Best” Function

(8)

Training Procedure Outline

 Model Architecture

 A Single Layer of Neurons (Perceptron)

 Limitation of Perceptron

 Neural Network Model (Multi-Layer Perceptron)

 Loss Function Design

 Function = Model Parameters

 Model Parameter Measurement

 Optimization

 Gradient Descent

 Stochastic Gradient Descent (SGD)

 Mini-Batch SGD

 Practical Tips

8

(9)

What is the Model?

9

(10)

Training Procedure Outline

 Model Architecture

 Loss Function Design

 Optimization

 Mini-Batch SGD

 Practical Tips

10

(11)

Classification Task

Sentiment Analysis

Speech Phoneme Recognition Handwritten Recognition

“這規格有誠意!” +

“太爛了吧~” -

/h/

2

Binary Classification

Multi-class Classification

input object

Class A (yes)

Class B (no)

input object

Class A Class B Class C Some cases are not easy to be formulated as classification problems

11

(12)

Target Function

Classification Task

◦x: input object to be classified

◦y: class/label

  ^x ^y

f  ^f ^: ^R

^N

^ ^R

^M

Assume both x and y can be represented as fixed-size vectors

12

 a N-dim vector

 a M-dim vector

(13)

Vector Representation Example

Handwriting Digit Classification

“1” “2”

 





 





 0 0 1

10 dimensions for digit recognition

“1”

“2”

“3”    





 





 0 1

0 “1”

“2”

“3”

1: for ink 0: otherwise

Each pixel

corresponds to an element in the vector

 







 







 1 0 16 x 16

16 x 16 = 256 dimensions

x: image y : class/label

“1” or not

“2” or not

“3” or not

M

N

R

f : 

13

(14)

Vector Representation Example

Sentiment Analysis

“+” “-”

 





 





 0 0 1

3 dimensions

(positive, negative, neutral)

“+”

“-”

“?”    





 





 0 1

0 “+”

“-”

“?”

1: indicates the word 0: otherwise

Each element in the vector corresponds to a word in the vocabulary

 







 







 1 0

dimensions = size of vocab

x: word y : class/label

“+” or not

“-” or not

“?” or not

M

N

R

f : 

“love”

14

(15)

Target Function

Classification Task

◦x: input object to be classified

◦y: class/label

  ^x ^y

f  ^f ^: ^R

^N

^ ^R

^M

Assume both x and y can be represented as fixed-size vectors

15

 a N-dim vector

 a M-dim vector

(16)

Training Procedure Outline

 Model Architecture

 Loss Function Design

 Optimization

 Mini-Batch SGD

 Practical Tips

16

(17)

A Single Neuron

z w

1

w

2

w

N

…

x

1

x

2

x

N

 b

  ^z

    ^z

z

bias

y

 

_z

z e

_

  1

 1

Sigmoid function Activation

function

Each neuron is a very simple function

17

(18)

A Single Neuron

z w

1

w

2

w

N

…

x

1

x

2

x

N

 b

  ^z

    ^z

bias z

y

 

_z

z e

_

  1

 1

Sigmoid function Activation

function

18

1

The bias term is an “always on” feature

(19)

Why Bias?

19

  ^z



z

b

bias

The bias term gives a class prior

(20)

Model Parameters of A Single Neuron

z w

1

w

2

w

N

…

x

1

x

2

x

N

 b

  ^z



bias

y

 

_z

z e

_

  1

 1

20

1

w, b are the parameters of this neuron

M

N

R

f : 

 







5 . 0

"

2 "

5 . 0

"

2 "

y not

y is

A single neuron can only handle binary classification

(22)

A Layer of Neurons

Handwriting digit classification

22 M

N

R

f : 

A layer of neurons can handle multiple possible output, and the result depends on the max one

…

x

1

x

2

x

N



1  y

₁



… …

“1” or not

“2” or not

“3” or not

y

2

y

3

10 neurons/10 classes

Which one is max?

(23)

Training Procedure Outline

 Model Architecture

 Loss Function Design

 Optimization

 Mini-Batch SGD

 Practical Tips

23

(24)

A Layer of Neurons – Perceptron

Output units all operate separately – no shared weights

24

Adjusting weights moves the location, orientation, and steepness of cliff

(25)

Expression of Perceptron

 z w

1

w

2

x

1

x

2

b 1

y

A perceptron can represent AND, OR, NOT, etc., but not XOR  linear separator

http://aima.eecs.berkeley.edu/slides-pdf/chapter20b.pdf ²⁵

(26)

How to Implement XOR?

Input

Output

A B

0 0 0

0 1 1

1 0 1

1 1 0

A xor B = AB’ + A’B

Multiple operations can produce more complicate output

26

(27)

Training Procedure Outline

 Model Architecture

 Loss Function Design

 Optimization

 Mini-Batch SGD

 Practical Tips

27

(28)

Neural Networks – Multi-Layer Perceptron

a

1

 ^z

¹

 z

²

x

1

x

2

 z

a

2

Hidden Units

1 1

y

28

(29)

Continuous function w/ 2 layers

Combine two opposite-facing

threshold functions to make a ridge

Continuous function w/ 3 layers

Combine two perpendicular ridges to make a bump

 Add bumps of various sizes and locations to fit any surface

Expression of Multi-Layer Perceptron

http://aima.eecs.berkeley.edu/slides-pdf/chapter20b.pdf

multiple layers enhance the model expression

 the model can approximate more complex functions

29

(30)

Deep Neural Networks (DNN)

Fully connected feedforward network

x

1

x

2

……

Layer 1

……

y

1

y

2

……

Layer 2

……

Layer L

……

Input Output

y

M

x

N

vector x

vector y

M

N

R

f : 

Deep NN: multiple hidden layers

30

(31)

Notation Definition

……

nodes N

l

Layer l

1 1



a

l

1 2



a

l

1 l

a

j

…… ……

Layer l  1 nodes

1

N

l

a

₁l

a

₂l

l

a

i

……

l

a i

Output of a neuron:

neuron i layer l 1

2 j

1 2

i

output of one layer  a vector

31

(32)

Notation Definition

…… …… …… ……

1 2

j

1 2

i

layer l  1 to layer l

1 1



a

l

1 2



a

l

1 l

a

from neuron j (layer l-1) to neuron i (layer l)

weights between two layers

 a matrix

32

(33)

Notation Definition

…… …… …… ……

1 2

j

1 2

1 1



a

l

1 2

j i

l

z

i

: input of the activation function for neuron i at layer l

l

w

ij l

w

i₂ l

w

i₁

l

z

i

l

b

i

1 activation function

input at each layer

 a vector

34

(35)

Notation Summary

l

a i

a l

l

z i

z l

l

b

i

b l

: output of a neuron

: output vector of a layer : input of activation

function

: input vector of activation function for a layer

: a weight

: a weight matrix

: a bias

: a bias vector

35

(36)

Layer Output Relation

……

nodes N

l

Layer l

…… ……

Layer l  1 nodes

1

N

l

……

1 2

j

1 2

Layer Output Relation – from a to z

……

nodes N

l

Layer l

…… ……

Layer l  1 nodes

1

N

l

……

1 2

j

1 2

(38)

Layer Output Relation – from z to a

……

nodes N

l

Layer l

…… ……

Layer l  1 nodes

1

N

l

……

1 2

j

1 2

z

l

1

a

l

   

 

 







 









 







 









l i l l

z z z

a a a



2 1 2

1

 

i^l l

i

z

a  

 

^l

l

z

a  

38

(39)

Layer Output Relation

……

nodes N

l

Layer l

…… ……

Layer l  1 nodes

1

N

l

……

1 2

j

1 2

W a b

z 

^1



 

^l

l

z

a  

39

(40)

Neural Network Formulation

Fully connected feedforward network

Layer 1 Layer 2 Layer L

Input Output

x

1

x

2

…… ……

y

1

y

2

…… ……

……

…… ^y

^M

x

N

vector x

vector y

M

N

R

f : 

= = =

40

(41)

Neural Network Formulation

Fully connected feedforward network

Layer 1 Layer 2 Layer L

Input Output

x

1

x

2

…… ……

y

1

y

2

…… ……

……

…… ^y

^M

x

N

vector x

vector y

M

N

R

f : 

41

(42)

Activation Function

bounded function

42

(43)

Activation Function

non-linear

43

boolean

linear

(44)

Non-Linear Activation Function

Sigmoid

Tanh

Rectified Linear Unit (ReLU)

44

Non-linear functions are

frequently used in neural net

(45)

Why Non-Linearity?

Function approximation

◦ Without non-linearity, deep neural networks work the same as linear transform

◦ With non-linearity, networks with more layers can approximate more complex function

http://cs224d.stanford.edu/lectures/CS224d-Lecture4.pdf ⁴⁵

(46)

What does the “Good”

Function mean?

46

(47)

Training Procedure Outline

 Model Architecture

 Loss Function Design

 Optimization

 Mini-Batch SGD

 Practical Tips

47

(48)

Training Procedure Outline

 Model Architecture

 Loss Function Design

 Optimization

 Mini-Batch SGD

 Practical Tips

48

(49)

Function = Model Parameters

Formal definition

different parameters W and b  different functions function set

model parameter set

pick a function f = pick a set of model parameters θ

49

(50)

Training Procedure Outline

 Model Architecture

 Loss Function Design

 Optimization

 Mini-Batch SGD

 Practical Tips

50

(51)

Model Parameter Measurement

Define a function to measure the quality of a parameter set θ

◦ Evaluating by a loss/cost/error function C(θ)  how bad θ is

◦ Best model parameter set

◦ Evaluating by an objective/reward function O(θ)  how good θ is

◦ Best model parameter set

51

(52)

Loss Function Example

Training Data

2

1, f f

f

*

“Best” Function

   



^x₁^, ^y^ˆ₁ ^, ^x₂^, ^y^ˆ₂ ^,^



-

(negative)

: x

ˆy :

52

function input function output

A “Good” function:

Define an example loss function:

sum over the error of all training samples

(53)

Frequent Loss Function

Square loss Hinge loss Logistic loss

Cross entropy loss

Others: large margin, etc.

https://en.wikipedia.org/wiki/Loss_functions_for_classification 53

(54)

How can we Pick the

“Best” Function?

54

(55)

Training Procedure Outline

 Model Architecture

 Loss Function Design

 Optimization

 Mini-Batch SGD

 Practical Tips

55

(56)

Problem Statement

Given a loss function and several model parameter sets

◦ Loss function:

◦ Model parameter sets:

Find a model parameter set that minimizes C(θ)

1) Brute force – enumerate all possible θ 2) Calculus –

How to solve this optimization problem?

Issue: whole space of ^C(θ) is unknown

56

(57)

Training Procedure Outline

 Model Architecture

 Loss Function Design

 Optimization

 Mini-Batch SGD

 Practical Tips

57

(58)

Gradient Descent for Optimization

Assume that θ has only one variable



  ^

C



0



¹



²



³

: the model at the i-th iteration Idea: drop a ball and find the position

where the ball stops rolling (local minima)

58

(59)

Gradient Descent for Optimization

Assume that θ has only one variable



  ^

C



0



¹

Randomly start at 𝜃⁰ Compute 𝑑𝐶 𝜃⁰ Τ𝑑𝜃:

Compute 𝑑𝐶 𝜃¹ Τ𝑑𝜃:

…

η is “learning rate”

59

(60)

Gradient Descent for Optimization

Assume that θ has two variables {θ

₁

, θ

₂

}

60

(61)

Gradient Descent for Optimization

Assume that θ has two variables {θ

₁

, θ

₂

}

• Randomly start at 𝜃⁰:

• Compute the gradients of 𝐶 𝜃 at 𝜃⁰:

• Update parameters:

• Compute the gradients of 𝐶 𝜃 at 𝜃¹:

61

(62)

Gradient Descent for Optimization

Algorithm

Initialization: start at 𝜃

⁰

while(𝜃

^(𝑖+1)

≠ 𝜃

^𝑖

)

{

compute gradient at 𝜃

^𝑖

update parameters

}

62

(63)

Revisit Neural Network Formulation

Fully connected feedforward network

Layer 1 Layer 2 Layer L

Input Output

x

1

x

2

…… ……

y

1

y

2

…… ……

……

…… ^y

^M

x

N

vector x

vector y

63

(64)

Gradient Descent for Neural Network

Algorithm

Initialization: start at 𝜃⁰ while(𝜃^(𝑖+1) ≠ 𝜃^𝑖)

{

compute gradient at 𝜃^𝑖 update parameters }

64

(65)

Gradient Descent for Optimization

Simple Case

Algorithm

{

65

z w

1

w

2

x

1

x

2



b

y

1

  ^z



(66)

Gradient Descent for Optimization

Simple Case – Three Parameters & Square Error Loss

Update three parameters for t-th iteration

Square error loss

66

(67)

Gradient Descent for Optimization

Simple Case – Square Error Loss

Square Error Loss

67

(68)

Gradient Descent for Optimization

Simple Case – Square Error Loss

68

chain rule sigmoid func

(69)

Gradient Descent for Optimization

Simple Case – Square Error Loss

Square Error Loss

69

(70)

Gradient Descent for Optimization

Simple Case – Three Parameters & Square Error Loss

Update three parameters for t-th iteration

70

z w1

w2

x1

x2 

b

y

1

 z



(71)

Optimization Algorithm

71

Algorithm

Initialization: set the parameters 𝜃, 𝑏 at random while(stopping criteria not met)

{

for training sample {𝑥, ො𝑦}, compute gradient and update parameters }

(72)

Gradient Descent for Neural Network

Algorithm

{

Computing the gradient includes millions of parameters.

To compute it efficiently, we use backpropagation.

72

(73)

Gradient Descent Issue

Training Data

   



x₁, yˆ₁ , x₂, yˆ₂ ,



After seeing all training samples, the model can be updated  slow

73

(74)

Training Procedure Outline

 Model Architecture

 Loss Function Design

 Optimization

 Mini-Batch SGD

 Practical Tips

74

(75)

Stochastic Gradient Descent (SGD)

Gradient Descent

Stochastic Gradient Descent (SGD)

◦ Pick a training sample x

_k

◦ If all training samples have same probability to be picked

Training Data

   



^x₁^, ^y^ˆ₁ ^, ^x₂^, ^y^ˆ₂ ^,^



The model can be updated after seeing one training sample  faster

75

(76)

Epoch Definition

   



x₁, yˆ₁ , x₂, yˆ₂ ,^



76

(77)

Gradient Descent v.s. SGD

Gradient Descent

Update after seeing all examples

Stochastic Gradient Descent

If there are 20 examples, update 20 times in one epoch.

1 epoch See all

examples

See only one example

SGD approaches to the target point faster than gradient descent 77

(78)

Training Procedure Outline

 Model Architecture

 Loss Function Design

 Optimization

 Mini-Batch SGD

 Practical Tips

78

(79)

Mini-Batch SGD

Batch Gradient Descent

Stochastic Gradient Descent (SGD)

◦ Pick a training sample x

_k

Mini-Batch SGD

◦ Pick a set of B training samples as a batch b

B is “batch size”

Use all K samples in each iteration

Use 1 samples in each iteration

Use all B samples in each iteration

79

(80)

Mini-Batch SGD

80

(81)

Batch v.s. Mini-Batch

Handwritting Digit Classification

Batch size = 1 Gradient Descent

81

(82)

Gradient Descent v.s. SGD v.s. Mini-Batch

Training Time (sec)

Batch Size

SGD

1 10 100 1000 10000 full

Gradient Descent Mini-Batch

Why is mini-batch faster than SGD?

82

Training speed: mini-batch > SGD > Gradient Descent

(83)

SGD v.s. Mini-Batch

Stochastic Gradient Descent (SGD)

Mini-Batch SGD

𝑊

¹

𝑥 𝑊

¹

𝑥

𝑧

¹

= 𝑧

¹

= ……

matrix

𝑊

¹

𝑥 𝑥

𝑧

¹

𝑧

¹

=

Modern computers run matrix-matrix multiplication faster than matrix-vector multiplication

83

(84)

Big Issue: Local Optima

Neural networks has no guarantee for obtaining global optimal solution

84

(85)

Training Procedure Outline

 Model Architecture

 Loss Function Design

 Optimization

 Mini-Batch SGD

 Practical Tips

85

(86)

Initialization

Different initialization parameters may result in different trained models

Do not initialize the parameters equally  set them randomly

86

(87)

Learning Rate

# parameters updates cost

Error Surface Very Large

Large

small

Just make

Learning rate should be set carefully

87

(88)

Tips for Mini-Batch Training

Shuffle training samples before every epoch

◦ the network might memorize the order you feed the samples

Use a fixed batch size for every epoch

◦ enable to fast implement matrix multiplication for calculations

Adapt the learning rate to the batch size

◦ larger batch  smaller learning rate

http://stackoverflow.com/questions/13693966/neural-net-selecting-data-for-each-mini-batch 88

Slide credit from Prof. Hung-Yi Lee

Review

Learning ≈ Looking for a Function

  

f

  

f

  

f

  

f

“2”

“你好”

“ Saturday”

“move left”

Thursday

Machine Learning Framework

 

y   

f

   





 





-

: x

ˆy :

How to Train a Model?

Machine Learning Framework

 

y   

f

   





 





-

: x

ˆy :

Training

Procedure

Training Procedure

Q1. What is the model? (function hypothesis set) Q2. What does a “good” function mean?

Q3. How do we pick the “best” function?

f

Training Procedure Outline

 Model Architecture

 Loss Function Design

 Optimization

What is the Model?

Training Procedure Outline

 Model Architecture

 Loss Function Design

 Optimization

Classification Task

Sentiment Analysis

Speech Phoneme Recognition Handwritten Recognition

input object

input object

Target Function

Classification Task

  x y

f  f : R

 R

Vector Representation Example

Handwriting Digit Classification

“1” “2”

 

 





 

 





 0 0 1

  ^

  ^

  ^

  ^

^y ^ ^ ^

^y ^ ^ ^

  ^x ^y

f  ^f ^: ^R

^ ^R