Slide credit from Hung-Yi Lee and Mark Chang

(1)

(2)

Talk Outline

Part I: Introduction to

Machine Learning & Deep Learning Part II: Variants of Neural Nets

Part III: Beyond Supervised Learning

& Recent Trends

2

(3)

Talk Outline

Part I: Introduction to

Machine Learning & Deep Learning Part II: Variants of Neural Nets

Part III: Beyond Supervised Learning

& Recent Trends

3

(4)

Introduction to Machine Learning &

Deep Learning

PART I

4

(5)

Part I: Introduction to ML & DL



Basic Machine Learning



Basic Deep Learning



Toolkits and Learning Recipe

5

(6)

Part I: Introduction to ML & DL



Basic Machine Learning



Basic Deep Learning



Toolkits and Learning Recipe

6

(7)

Machine Learning



Machine learning is rising rapidly in recent days

7

(8)

Recent Trend

8

(9)

What Computers Can Do?

Programs can do the things you ask them to do

9

(10)

Program for Solving Tasks

 Task: predicting positive or negative given a product review

“I love this product!” “It claims too much.” “It’s a little expensive.”

+ - ?

“台灣第一波上市!” “規格好雞肋…” “樓下買了我才考慮”

推 ^噓 ?

program.py

Some tasks are complex, and we don’t know how to write a program to solve them.

if input contains “love”, “like”, etc.

output = positive

if input contains “too much”, “bad”, etc.

output = negative

program.py program.py

program.py program.py program.py

10

(11)

Learning ≈ Looking for a Function

 Task: predicting positive or negative given a product review

“I love this product!” “It claims too much.” “It’s a little expensive.”

+ - ?

“台灣第一波上市!” “規格好雞肋…” “樓下買了我才考慮”

推 ^噓 ?

f f f

Given a large amount of data, the machine learns what the function f should be.

11

(12)

Learning ≈ Looking for a Function



Speech Recognition



Image Recognition



Go Playing



Dialogue System





^

f





^

f





^

f





^

f

cat

“你好”

5-5 (next move)

“台積電怎麼去”

^{“地址為…}

現在建議搭乘計程車”

12

(13)

Framework

A set of

function f₁, f₂





^

f1 ^“cat”





^

f1 ^“dog”





^

f2 ^“monkey”





^

f2 ^“snake”

Model





^

f “cat”

Image Recognition:

13

(14)

Framework

A set of

Model

Training Data

Goodness of function f

Better!

“monkey” “cat” “dog”

function input:

function output:

Supervised Learning





^

f “cat”

Image Recognition:

14

(15)

Framework

^f

^

^

^ ^“cat”

Image Recognition:

A set of

function f₁, f₂ 

Model

Training Data

“monkey” “cat” “dog”

f *

Pick the “Best” Function

Using f ^

“cat”

Training Testing

Step 1

Step 2 Step 3

Training is to pick the best function given the observed data Testing is to predict the label using the learned function

15

(16)

Why to Learn Machine Learning?



AI Age



AI can work for most of labor work?



New job market AI 訓練師

(Machine Learning Expert 機器學習專家、

Data Scientist 資料科學家)

16

(17)

AI 訓練師

機器不是自己會學嗎？

為什麼需要 AI 訓練師

戰鬥是寶可夢在打，

為什麼需要寶可夢訓練師？

17

(18)

AI 訓練師



寶可夢訓練師

 挑選適合的寶可夢來戰鬥



寶可夢有不同的屬性

 召喚出來的寶可夢不一定聽話



E.g. 小智的噴火龍

 需要足夠的經驗



AI 訓練師

 在 step 1，AI訓練師要挑選合適的模型



不同模型適合處理不同的問題

 不一定能在 step 3 找出 best function



E.g. Deep Learning

 需要足夠的經驗

18

(19)

AI 訓練師



厲害的 AI ， AI 訓練師功不可沒



讓我們一起朝 AI 訓練師之路邁進

19

(20)

Machine Learning Map

Regression

Linear Model

Deep Learning SVM, Decision Tree, KNN, etc Non-Linear Model

Classification Supervised Learning

Task

Scenario Method

Semi-Supervised Learning

Transfer Learning

Unsupervised Learning

Reinforcement Learning

20

(21)

Machine Learning Map

Regression

Linear Model

Transfer Learning

Reinforcement Learning Task

Scenario Method

The output of the target function 𝑓 is a “scalar”.

一個數值

21

(22)

Regression

 Stock Market Forecast

 Self-driving Car

 Recommendation

𝑓 =

Dow Jones Industrial

Average at tomorrow

𝑓 =

方向盤角度

使用者 A 商品 B 購買可能性

22

(23)

Example Application



Estimating the Combat Power (CP) of a pokemon after evolution

𝑓 = CP after

evolution

𝑥 𝑦

𝑥

_𝑐𝑝

𝑥

_ℎ𝑝

𝑥

_𝑤

𝑥

_ℎ

𝑥

_𝑠

23

(24)

Step 1: Model

A set of

Model

𝑓 = CP after

evolution

Linear model:

f₁: y = 10.0 + 9.0 ∙ x_cp f₂: y = 9.8 + 9.2 ∙ x_cp f₃: y = - 0.8 - 1.2 ∙ x_cp

…… infinite

𝑦 = 𝑏 + ෍ 𝑤

_𝑖

𝑥

_𝑖

𝑥_𝑖: an attribute of input x

w and b are parameters (can be any value)

𝑥 𝑦

𝑤_𝑖: weight, b: bias feature

𝑦 = 𝑏 + 𝑤 ∙ 𝑥cp

24

(25)

Step 2: Goodness of Function

A set of

Model

Training Data

function input:

function output (scalar):

ො 𝑦¹ 𝑥¹

𝑥²

ො 𝑦²

𝑦 = 𝑏 + 𝑤 ∙ 𝑥cp

25

(26)

Step 2: Goodness of Function

 Training data

 1^st pokemon:

𝑥¹, ො𝑦¹

 2^nd pokemon:

𝑥², ො𝑦²

 10^th pokemon:

𝑥¹⁰, ො𝑦¹⁰

Source: https://www.openintro.org/stat/data/?data=pokemon

𝑥_𝑐𝑝^𝑛 , ො𝑦^𝑛

… …

𝑥_𝑐𝑝

ො 𝑦 This is real data.

26

(27)

Step 2: Goodness of Function

A set of

Training Data

Loss function 𝐿:

L 𝑓 = L 𝑤, 𝑏

= ෍

𝑛=1 10

ො

𝑦

^𝑛

− 𝑏 + 𝑤 ∙ 𝑥

_𝑐𝑝^𝑛 ²

Input: a function, output: how bad it is

Estimated y based on input function

Estimation error Sum over examples

Model 𝑦 = 𝑏 + 𝑤 ∙ 𝑥cp

27

(28)

Step 2: Goodness of Function



Loss Function

Each point in the figure is a function The color represents 𝐿 𝑤, 𝑏

L 𝑤, 𝑏 = ෍

𝑛=1 10

ො

𝑦^𝑛 − 𝑏 + 𝑤 ∙ 𝑥_𝑐𝑝^𝑛 ²

y = - 180 - 2 ∙ x_cp

28

(29)

Step 3: Best Function

A set of

Model

Training Data

Pick the “Best” Function 𝑓^∗ = 𝑎𝑟𝑔 min

𝑓 𝐿 𝑓 𝑤^∗, 𝑏^∗ = 𝑎𝑟𝑔 min

𝑤,𝑏 𝐿 𝑤, 𝑏

= 𝑎𝑟𝑔 min

𝑤,𝑏 ෍

𝑛=1 10

ො

𝑦^𝑛 − 𝑏 + 𝑤 ∙ 𝑥_𝑐𝑝^𝑛 ² L 𝑤, 𝑏

= ෍

𝑛=1 10

ො

𝑦^𝑛 − 𝑏 + 𝑤 ∙ 𝑥_𝑐𝑝^𝑛 ²

29

(30)

Step 3: Gradient Descent

 Consider loss function 𝐿(𝑤) with one parameter w:

Loss 𝐿 𝑤

w 𝑤^∗ = 𝑎𝑟𝑔 min

𝑤 𝐿 𝑤

 (Randomly) Pick an initial value w⁰

 Compute ^𝑑𝐿

𝑑𝑤 |_𝑤=𝑤⁰

w⁰

30

(31)

Step 3: Gradient Descent

Loss 𝐿 𝑤

w

w⁰

−𝜂 𝑑𝐿

𝑤¹ ← 𝑤⁰ − 𝜂 𝑑𝐿

𝑑𝑤 |_𝑤=𝑤⁰ 𝑤^∗ = 𝑎𝑟𝑔 min

𝑤 𝐿 𝑤

31

(32)

Step 3: Gradient Descent

Loss 𝐿 𝑤

w

w⁰

𝑤¹ ← 𝑤⁰ − 𝜂 𝑑𝐿

𝑑𝑤 |_𝑤=𝑤¹

𝑤² ← 𝑤¹ − 𝜂 𝑑𝐿

𝑑𝑤 |_𝑤=𝑤¹

…… Many iteration

w¹ w² w^T

𝑤^∗ = 𝑎𝑟𝑔 min

𝑤 𝐿 𝑤

32

(33)

Step 3: Gradient Descent

 How about two parameters?

𝑤^∗, 𝑏^∗ = 𝑎𝑟𝑔 min

𝑤,𝑏 𝐿 𝑤, 𝑏

 (Randomly) Pick an initial value w⁰, b⁰

 Compute ^𝜕𝐿

𝜕𝑤 |_𝑤=𝑤⁰_,𝑏=𝑏⁰, ^𝜕𝐿

𝜕𝑏 |_𝑤=𝑤⁰_,𝑏=𝑏⁰ 𝑤¹ ← 𝑤⁰ − 𝜂 𝜕𝐿

𝜕𝑤 |_𝑤=𝑤⁰_,𝑏=𝑏⁰

 Compute ^𝜕𝐿

𝜕𝑤 |_𝑤=𝑤¹_,𝑏=𝑏¹, ^𝜕𝐿

𝜕𝑏 |_𝑤=𝑤¹_,𝑏=𝑏¹

𝑏¹ ← 𝑏⁰ − 𝜂 𝜕𝐿

𝜕𝑏 |_𝑤=𝑤⁰_,𝑏=𝑏⁰

𝛻𝐿 =

𝜕𝐿

𝜕𝑤

𝜕𝐿

𝜕𝑏

gradient

𝑤² ← 𝑤¹ − 𝜂 𝜕𝐿

𝜕𝑤 |_𝑤=𝑤¹_,𝑏=𝑏¹ 𝑏² ← 𝑏¹ − 𝜂 𝜕𝐿

𝜕𝑏 |_𝑤=𝑤¹_,𝑏=𝑏¹

33

(34)

𝑏 𝑤

Step 3: Gradient Descent

Color: Value of loss 𝐿 𝑤, 𝑏

34

(35)

Step 3: Gradient Descent

𝐿

𝑤 𝑏

Linear regression  No local optimal

𝑤

𝑏

 Local optimal

 Loss function is convex in linear regression

35

(36)

Step 3: Gradient Descent



Formulation of 𝜕𝐿 𝜕𝑤 and Τ 𝜕𝐿 𝜕𝑏 Τ

𝐿 𝑤, 𝑏 = ෍

𝑛=1 10

ො

𝑦^𝑛 − 𝑏 + 𝑤 ∙ 𝑥_𝑐𝑝^𝑛 ²

𝜕𝐿

𝜕𝑤 =?

𝜕𝐿

𝜕𝑏 =?

෍

𝑛=1 10

2 ො𝑦^𝑛 − 𝑏 + 𝑤 ∙ 𝑥_𝑐𝑝^𝑛 −𝑥_𝑐𝑝^𝑛

36

(37)

Step 3: Gradient Descent



Formulation of 𝜕𝐿 𝜕𝑤 and Τ 𝜕𝐿 𝜕𝑏 Τ

𝐿 𝑤, 𝑏 = ෍

𝑛=1 10

ො

𝑦^𝑛 − 𝑏 + 𝑤 ∙ 𝑥_𝑐𝑝^𝑛 ²

𝜕𝐿

𝜕𝑤 =?

𝜕𝐿

𝜕𝑏 =?

෍

𝑛=1 10

2 ො𝑦^𝑛 − 𝑏 + 𝑤 ∙ 𝑥_𝑐𝑝^𝑛 −𝑥_𝑐𝑝^𝑛

෍

𝑛=1 10

2 ො𝑦^𝑛 − 𝑏 + 𝑤 ∙ 𝑥_𝑐𝑝^𝑛 −1

37

(38)

Learned Model

𝑦 = 𝑏 + 𝑤 ∙ 𝑥cp

b = -188.4 w = 2.7

Average Error on Training Data

Training Data

𝑒¹

𝑒²

= 31.9

= ෍

𝑛=1 10

𝑒^𝑛

38

(39)

Model Generalization

b = -188.4 w = 2.7

Average Error on Testing Data

= 35.0

= ෍

𝑛=1 10

𝑒^𝑛

What we really care about is the error on new data (testing data)

Another 10 pokemons as testing data

How can we do better?

> Average Error on Training Data (31.9) 𝑦 = 𝑏 + 𝑤 ∙ 𝑥cp

39

(40)

Model Generalization



Select another model



Best function



Testing

𝑦 = 𝑏 + 𝑤₁ ∙ 𝑥cp + 𝑤2 ∙ 𝑥cp ²

Average Error = 18.4 Average Error = 15.4

𝑏 = −10.3, 𝑤₁ = 1.0, 𝑤₂ = 2.7 × 10⁻³

Better! Could it be even better?

40

(41)

Model Generalization



Select another model



Best function



Testing

𝑦 = 𝑏 + 𝑤₁ ∙ 𝑥cp + 𝑤2 ∙ 𝑥cp ² +𝑤₃ ∙ 𝑥cp ³

𝑏 = 6.4, 𝑤₁ = 0.66, 𝑤₂ = 4.3 × 10⁻³, 𝑤₃ = 1.8 × 10⁻⁶

Slightly better.

How about more complex model?

41

(42)

Model Generalization



Select another model



Best function



Testing

𝑦 = 𝑏 + 𝑤₁ ∙ 𝑥cp + 𝑤2 ∙ 𝑥cp ² +𝑤₃ ∙ 𝑥cp ³ + 𝑤₄ ∙ 𝑥cp ⁴

The results become worse

42

(43)

Model Generalization



Select another model



Best function



Testing

𝑦 = 𝑏 + 𝑤₁ ∙ 𝑥cp + 𝑤2 ∙ 𝑥cp ² +𝑤₃ ∙ 𝑥cp ³ + 𝑤₄ ∙ 𝑥cp ⁴ +𝑤₅ ∙ 𝑥cp ⁵

The results are so bad

43

(44)

Model Selection

Training Data

𝑦 = 𝑏 + 𝑤₁ ∙ 𝑥cp + 𝑤² ∙ 𝑥cp ² 𝑦 = 𝑏 + 𝑤₁ ∙ 𝑥cp + 𝑤² ∙ 𝑥cp ² +𝑤₃ ∙ 𝑥cp ³

𝑦 = 𝑏 + 𝑤₁ ∙ 𝑥cp + 𝑤² ∙ 𝑥cp ² +𝑤₃ ∙ 𝑥cp ³ + 𝑤₄ ∙ 𝑥cp ⁴ 𝑦 = 𝑏 + 𝑤 ∙ 𝑥cp

𝑦 = 𝑏 + 𝑤₁ ∙ 𝑥cp + 𝑤² ∙ 𝑥cp ² +𝑤₃ ∙ 𝑥cp ³ + 𝑤₄ ∙ 𝑥cp ⁴ +𝑤₅ ∙ 𝑥cp ⁵

1.

2.

3.

4.

5. A more complex model yields

lower error on training data.

If we can truly find the best function

44

(45)

Machine Learning Map

Regression

Task

Scenario Method

Transfer Learning

45

(46)

Classification



Binary Classification

^

Multi-Class Classification

Function f Yes / No

Input

Function f

Class 1, Class 2, …, Class N

Input

46

(47)

Binary Classification – Spam Filtering

(http://spam-filter-review.toptenreviews.com/)

Model

(Yes/No) 1/0

1 (Yes)

0 (No)

“free” in e-mail

“Talk” in e-mail

47

(48)

Multi-Class – Image Recognition

Model

“monkey”

“cat”

“dog”

“monkey”

“cat”

“dog”

48

(49)

Multi-Class – Topic Classification

http://top-breaking-news.com/

Model

政治

體育經濟

“president” in document

“stock” in document

體育政治財經

49

(50)

Machine Learning Map

Regression

Transfer Learning Task

Scenario Method

Linear Model

50

(51)

Part I: Introduction to ML & DL



Basic Machine Learning



Basic Deep Learning



Toolkits and Learning Recipe

51

(52)

Stacked Functions Learned by Machine



Production line (生產線)

“台灣第一波

上市!” 推

End-to-end training: what each function should do is learned automatically

Simple Function

f₁

Simple Function

f₂

Simple Function

f₃

Deep Learning Model

f: a very complex function

Deep learning usually refers to neural network based model

52

(53)

Three Steps for Deep Learning

Step 3: pick the best function Step 2: goodness of function

Step 1: define a set of function

53

(54)

Three Steps for Deep Learning

Step 3: pick the best function Step 2: goodness of function

Step 1: define a set of function

Neural Network

54

(55)

Neural Network

b w

a w

a

z 

₁ ₁

  

_k _k

  

_K _K



z w

1

w

k

w

K

…

a

1

a

k

a

K

 b

  ^z



bias

a

weights

Neuron

… … …

A simple function

Activation function

55

(56)

Neural Network

 ^   ^z

bias

Activation function weights

Neuron

1

-2 -1

1 2

-1

1

4

  ^z



z

 

_z

z e

_

  1

 1

Sigmoid Function

0.98

56

(57)

Neural Network

 

^z





 

z





 

z





 

z





Different connections lead to different network structures

Weights and biases are network parameters 𝜃

The neurons have different values of weights and biases.

57

(58)

Fully Connected Feedforward Network

  ^z



z

 

_z

z e

_

  1

 1

Sigmoid Function 1

-1

1 -2

1 -1

1

0 4

-2

0.98

0.12

58

(59)

Fully Connected Feedforward Network

1 -2

1 -1

1

0 4

-2

0.98

0.12 2

-1

-1 -2

3 -1

4 -1 0.86

0.11

0.62

0.83 0

0

-2

2 1

-1

59

(60)

Fully Connected Feedforward Network

1 -2

1 -1

1

0

0.73

0.5

2 -1

-1 -2

3 -1

4 -1 0.72

0.12

0.51

0.85 0

0

-2

2 𝑓 0

0 = 0.51 0.85

Given parameters 𝜃, define a function

𝑓 1

−1 = 0.62 0.83 0

0

This is a function.

Input vector, output vector

Given network structure, define a function set

60

(61)

Fully Connect Feedforward Network

Output Layer Hidden Layers

Input Layer

Input Output

x1

x2

Layer 1

……

xN

……

Layer 2

……

Layer L

……

y

₁

y

₂

y

_M

Deep means many hidden layers

neuron

61

(62)

Why Deep? Universality Theorem



Any continuous function f

can be realized by a network with only hidden layer

 (given enough hidden neurons)

: R R

M

f

^N



Why “deep” not “fat”?

62

(63)

Fat + Shallow v.s. Thin + Deep

 Two networks with the same number of parameters

63

x1 x₂

……

^x_N

……

x1 x₂

……

x_N

……

(64)

Why Deep

 Logic circuits

 Consists of gates

 A two layers of logic gates can represent any Boolean function.

 Using multiple layers of logic gates to build some functions are much simpler

 Neural network

 consists of neurons

 A hidden layer network can represent any continuous function.

 Using multiple layers of neurons to represent some functions are much simpler

64

(65)

Deep = Many Hidden Layers

AlexNet (2012) VGG (2014) GoogleNet (2014) 16.4%

7.3% 6.7%

http://cs231n.stanford.e du/slides/winter1516_le cture8.pdf

65

(66)

Deep = Many Hidden Layers

AlexNet (2012)

VGG (2014)

GoogleNet (2014)

3.57%

Residual Net (2015)

Taipei 101

16.4% 7.3% 6.7%

Special structure

66

(67)

Output Layer



Softmax layer as the output layer

Ordinary Layer

 

1

1 z

y 



 

₂

2 z

y 



 

3

3 z

y 



z1

z2

z3



In general, the output of network can be any value.

May not be easy to interpret

67

(68)

Output Layer



Softmax layer as the output layer

Probability:

 1 > 𝑦_𝑖 > 0

 σ_𝑖 𝑦_𝑖 = 1

z1

z2

z3

Softmax Layer

e e e

z1

e

z2

e

z3

e







 ³

1 1

1

j

z z_j

e e

y



 3

1 j

z_j

e



3

-3

1 2.7

20

0.05

0.88 0.12

≈0





 ³

1 2

2

j

z z_j

e e

y





 ³

1 3

3

j

z zj

e e

y

68

(69)

Example Application



Input

^

Output

16 x 16 = 256

x1

x2

x256

……

Ink → 1 No ink → 0

……

y

₁

y

₂

y

₁₀

Each dimension represents the confidence of a digit.

is 1 is 2

is 0

……

0.1 0.7

0.2

The image is “2”

69

(70)

Example Application



Handwriting Digit Recognition

Machine “2”

x1

x2

x256

…… ……

y

₁

y

₂

y

₁₀

is 1 is 2

is 0

……

What is needed is a function ……

Input:

256-dim vector

output:

10-dim vector

Neural Network

70

(71)

Example Application

Output Layer Hidden Layers

Input Layer

Input Output

x1

x2

Layer 1

……

xN

……

Layer 2

……

Layer L

……

“2” ^……

y

₁

y

₂

y

₁₀

is 1 is 2

is 0

……

A function set containing the candidates for

Handwriting Digit Recognition

You need to decide the network structure to let a good function in your function set.

71

(72)

FAQ



Q: How many layers? How many neurons for each layer?



Q: Can we design the network structure?



Q: Can the structure be automatically determined?



Yes, but not widely studied yet.

Trial and Error + Intuition

Variants of Neural Networks (next lecture)

72

(73)

Three Steps for Deep Learning

Step 3: pick the best function Step 2: goodness of function

Step 1: define a set of function

73

(74)

Training Data



Preparing training data: images and their labels

The learning target is defined on the training data.

“5” “0” “4” “1”

“1” “3”

“9” “2”

74

(75)

Learning Target

16 x 16 = 256

x1

x2

……

x256

……

Ink → 1 No ink → 0

……

y

₁

y

₂

y

₁₀

y₁ has the maximum value The learning target is ……

Input:

y₂ has the maximum value Input:

is 1 is 2

is 0

Softmax

75

(76)

Loss

x1

x2

……

x256

……

y

₁

y

₂

y

₁₀

Loss 𝑙

“1”

……

1 0

0 ……

Loss can be square error or cross entropy between the network output and target

target

Softmax

As close as

possible

A good function should make the loss of all examples as small as possible.

Given a set of parameters

76

(77)

Total Loss



For all training data …

x¹ x²

x^R

NN NN

NN

…… ……

y¹ y²

y^R

ො 𝑦¹

ො 𝑦²

ො 𝑦^𝑅 𝑙₁

…… ……

x³ NN y³ 𝑦ො³

𝐿 = ෍

𝑟=1 𝑅

𝑙_𝑟

Find the network parameters 𝜽^∗ that minimize total loss L

Total Loss:

𝑙₂ 𝑙₃

𝑙_𝑅

As small as possible Find a function in

function set that

minimizes total loss L

77

(78)

Three Steps for Deep Learning

Step 3: pick the best function Step 2: goodness of function

Step 1: define a set of function

78

(79)

How to pick the best function

Find network parameters 𝜽

^∗

that minimize total loss L

Network parameters 𝜃 = 𝑤

₁

, 𝑤

₂

, 𝑤

₃

, ⋯ , 𝑏

₁

, 𝑏

₂

, 𝑏

₃

, ⋯

Enumerate all possible values

^{Layer l}

……

Layer l+1

……

E.g. speech recognition: 8 layers and

1000 neurons each layer 1000 neurons

1000 neurons

10

⁶

weights Millions of parameters

79

(80)

Gradient Descent

Random, RBM pre-train

Usually good enough

Network parameters 𝜃 = 𝑤

₁

, 𝑤

₂

, ⋯ , 𝑏

₁

, 𝑏

₂

, ⋯

w

 Pick an initial value for w

Find network parameters 𝜽

^∗

that minimize total loss L

80

(81)

Gradient Descent

Total Loss 𝐿

w

 Compute Τ𝜕𝐿 𝜕𝑤

Positive Negative

Decrease w Increase w

Find network parameters 𝜽

^∗

that minimize total loss L

Network parameters 𝜃 = 𝑤

₁

, 𝑤

₂

, ⋯ , 𝑏

₁

, 𝑏

₂

, ⋯

81

(82)

Gradient Descent

w

−𝜂𝜕𝐿 𝜕𝑤Τ

η is called

“learning rate”

𝑤 ← 𝑤 − 𝜂𝜕𝐿 𝜕𝑤Τ Repeat

Find network parameters 𝜽

^∗

that minimize total loss L

Network parameters 𝜃 = 𝑤

₁

, 𝑤

₂

, ⋯ , 𝑏

₁

, 𝑏

₂

, ⋯

82

(83)

Gradient Descent

Total Loss 𝐿

w

𝑤 ← 𝑤 − 𝜂𝜕𝐿 𝜕𝑤Τ Repeat

(when update is little)

Find network parameters 𝜽

^∗

that minimize total loss L

Network parameters 𝜃 = 𝑤

₁

, 𝑤

₂

, ⋯ , 𝑏

₁

, 𝑏

₂

, ⋯

83

(84)

Gradient Descent

 Assume that θ has two variables {θ₁, θ₂}

84

(85)

Gradient Descent

𝑤₁ 𝑤₂

Color: Value of Total Loss L

Randomly pick a starting point

85

(86)

𝑤₁ 𝑤₂

Gradient Descent Hopfully, we would reach a minima …..

Compute 𝜕𝐿 𝜕𝑤Τ ₁, 𝜕𝐿 𝜕𝑤Τ ₂

(−𝜂 𝜕𝐿 𝜕𝑤Τ ₁, −𝜂 𝜕𝐿 𝜕𝑤Τ ₂)

86

(87)

Local Minima

Total Loss

The value of a network parameter w

Very slow at the plateau

Stuck at local minima

𝜕𝐿 ∕ 𝜕𝑤

= 0

Stuck at saddle point

𝜕𝐿 ∕ 𝜕𝑤

= 0

𝜕𝐿 ∕ 𝜕𝑤

≈ 0

87

(88)

Local Minima



Gradient descent never guarantee global minima

𝐿

𝑤

₁

𝑤

₂

Different initial point

Reach different minima, so different results

88

(89)

Gradient Descent

This is the “learning” of machines in deep learning ……

Even AlphaGo using this approach.

I hope you are not too disappointed :p

People image …… Actually …..

89

(90)

Part I: Introduction to ML & DL



Basic Machine Learning



Basic Deep Learning



Toolkits and Learning Recipe

90

(91)

Deep Learning Toolkit



Backpropagation: an efficient way to compute

𝜕𝐿 𝜕𝑤 in neural network Τ

91

(92)

Three Steps for Deep Learning

Deep Learning is so simple ……

Now If you want to find a function

If you have lots of function input/output (?) as training data

You can use deep learning Step 1:

define a set

of function

Step 2:

goodness of function

Step 3: pick the best function

92

(93)

Keras

keras

Very flexible Need some effort to learn

Easy to learn and use

(still have some flexibility)

You can modify it if you can write TensorFlow or Theano

Interface of TensorFlow or Theano

or

93

(94)

Keras

 François Chollet is the author of Keras.

 He currently works for Google as a deep learning engineer and researcher.

 Keras means horn in Greek

 Documentation: http://keras.io/

 Example

 https://github.com/fchollet/keras/tree/master/examples

 Step-by-step lecture by Prof. Hung-Yi Lee

 Slide

http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Keras.pdf

 Lecture recording:

https://www.youtube.com/watch?v=qetE6uUoLQA

94

(95)

使用 Keras 心得

感謝沈昇勳同學提供圖檔

95

(96)

Example Application



Handwriting Digit Recognition

Machine “1”

“Hello world” for deep learning

MNIST Data: http://yann.lecun.com/exdb/mnist/

Keras provides data sets loading function: http://keras.io/datasets/

28 x 28

96

(97)

Three Steps for Deep Learning

Step 1:

define a set

of function

Step 2:

goodness of function

Step 3: pick the best function Deep Learning is so simple ……

97

(98)

Good Results on Testing Data?

Good Results on Training Data?

Step 3: pick the best function Step 2: goodness

of function Step 1: define a

set of function

YES YES

NO NO

Learning Recipe

98

(99)

Overfitting

 Possible solutions

 more training samples

 some tips: dropout, etc.

99

(100)

YES YES

Learning Recipe

Different approaches for different problems e.g. dropout for good results on testing data

100

(101)

YES YES

Learning Recipe

Choosing proper loss Mini-batch

New activation function Adaptive Learning Rate Momentum

101

(102)

Learning Recipe

Training Data

Testing Data

x yˆ

Validation Real Testing

x x

f *

“Best” Function

y y

102

(103)

Learning Recipe

Training Data

Testing Data

x yˆ

Validation Real Testing

x y x y

immediately know the performance

Do not know the performance

103

(104)

Learning Recipe

 Possible reasons

 no good function exists: bad hypothesis function set

 reconstruct the model architecture

 cannot find a good function: local optima

 change the training strategy

get good results on training set modify training process

no

104

(105)

Learning Recipe

get good results on dev/validation set get good results

on training set modify training process

no

yes

done yes

no

prevent overfitting

Better performance on training but worse performance on dev  overfitting

105

(106)

Concluding Remarks



Basic Machine Learning

1.

Define a set of functions

2.

Measure goodness of functions

3.

Pick the best function



Basic Deep Learning



Stacked functions

106

(107)

Talk Outline

Part I: Introduction to

Machine Learning & Deep Learning Part II: Variants of Neural Nets

Part III: Beyond Supervised Learning

& Recent Trends

107

(108)

PART II

Variants of Neural Networks

108

(109)

PART II: Variants of Neural Networks



Convolutional Neural Network (CNN)



Recurrent Neural Network (RNN)

109

(110)

PART II: Variants of Neural Networks



Convolutional Neural Network (CNN)



Recurrent Neural Network (RNN)

Widely used in image processing

110

(111)

Why CNN for Image

Can the network be simplified by considering the properties of images?

x1

x2

……

xN

…… …… ……

……

The most basic classifiers

Use 1^st layer as module to build classifiers

Use 2^nd layer as module ……

Represented as pixels

(Zeiler, M. D., ECCV 2014)

111

(112)

Why CNN for Image



Some patterns are much smaller than the whole image

A neuron does not have to see the whole image to discover the pattern.

“beak” detector

Connecting to small region with less parameters

112

(113)

Why CNN for Image



The same patterns appear in different regions.

“upper-left beak” detector

“middle beak”

detector

They can use the same set of parameters.

Do almost the same thing

113

(114)

Why CNN for Image



Subsampling the pixels will not change the object

subsampling bird

bird

We can subsample the pixels to make image smaller

Less parameters for the network to process the image

114

(115)

Three Steps for Deep Learning

Step 1:

define a set

of function

Step 2:

goodness of function

Step 3: pick the best function Deep Learning is so simple ……

Convolutional Neural Network

115

(116)

Image Recognition

116

http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf

(117)

The Whole CNN

Fully Connected Feedforward network

cat dog ……

Convolution

Max Pooling

Convolution

Max Pooling Flatten

Can repeat many times

117

(118)

The Whole CNN

Convolution

Max Pooling

Convolution

Max Pooling Flatten

Can repeat many times

 Some patterns are much

smaller than the whole image

 The same patterns appear in different regions

 Subsampling the pixels will not change the object

Property 1

Property 2

Property 3

118

(119)

Image Recognition

119

(120)

Local Connectivity

120

Neurons connect to a small region

(121)

Parameter Sharing



The same feature in different positions

121

Neurons share the same weights

(122)

Parameter Sharing



Different features in the same position

122

Neurons have different weights

(123)

Convolutional Layers

123

depth

width width

depth

weights weights

height

shared weight

(124)

Convolutional Layers

124

c₁

c₂

b₁

b₂ a₁

a₂

a₃

depth = 2 depth = 1

(125)

Convolutional Layers

125

c₁ b₁

b₂ a₁

a₂

d₁

b₃ a₃

c₂

d₂

depth = 2 depth = 2

(126)

Convolutional Layers

126

c₁ b₁

b₂ a₁

a₂

d₁

b₃ a₃

c₂

d₂

depth = 2 depth = 2

(127)

Convolutional Layers

127

A B C

A B C D

(128)

Hyper-parameters of CNN



Stride

128



Padding

0 0

Stride = 1

Stride = 2

Padding = 0

Padding = 1

(129)

Example

129

Output

Volume (3x3x2)

Input

Volume (7x7x3)

Stride = 2

Padding = 1

http://cs231n.github.io/convolutional-networks/

Filter (3x3x3)

(130)

Convolutional Layers

130

(131)

Convolutional Layers

131

(132)

Convolutional Layers

132

(133)

Convolutional Layers

133

(134)

Pooling Layer

134

1 3 2 4

5 7 6 8

0 0 3 3

5 5 0 0

4 5

5 3

7 8

5 3

Maximum Pooling

Average Pooling

Max(1,3,5,7) = 7 Avg(1,3,5,7) = 4

no overlap

no weights

depth = 1

Max(0,0,5,5) = 5

(135)

Why “Deep” Learning?

135