• 沒有找到結果。

Slide credit from Hung-Yi Lee and Mark Chang

N/A
N/A
Protected

Academic year: 2022

Share "Slide credit from Hung-Yi Lee and Mark Chang"

Copied!
306
0
0

加載中.... (立即查看全文)

全文

(1)

Slide credit from Hung-Yi Lee and Mark Chang

(2)

Talk Outline

Part I: Introduction to

Machine Learning & Deep Learning Part II: Variants of Neural Nets

Part III: Beyond Supervised Learning

& Recent Trends

2

(3)

Talk Outline

Part I: Introduction to

Machine Learning & Deep Learning Part II: Variants of Neural Nets

Part III: Beyond Supervised Learning

& Recent Trends

3

(4)

Introduction to Machine Learning &

Deep Learning

PART I

4

(5)

Part I: Introduction to ML & DL

Basic Machine Learning

Basic Deep Learning

Toolkits and Learning Recipe

5

(6)

Part I: Introduction to ML & DL

Basic Machine Learning

Basic Deep Learning

Toolkits and Learning Recipe

6

(7)

Machine Learning

Machine learning is rising rapidly in recent days

7

(8)

Recent Trend

8

(9)

What Computers Can Do?

Programs can do the things you ask them to do

9

(10)

Program for Solving Tasks

Task: predicting positive or negative given a product review

“I love this product!” “It claims too much.” “It’s a little expensive.”

+ - ?

“台灣第一波上市!” “規格好雞肋…” “樓下買了我才考慮”

?

program.py

Some tasks are complex, and we don’t know how to write a program to solve them.

if input contains “love”, “like”, etc.

output = positive

if input contains “too much”, “bad”, etc.

output = negative

program.py program.py

program.py program.py program.py

10

(11)

Learning ≈ Looking for a Function

Task: predicting positive or negative given a product review

“I love this product!” “It claims too much.” “It’s a little expensive.”

+ - ?

“台灣第一波上市!” “規格好雞肋…” “樓下買了我才考慮”

?

f f f

f f f

Given a large amount of data, the machine learns what the function f should be.

11

(12)

Learning ≈ Looking for a Function

Speech Recognition

Image Recognition

Go Playing

Dialogue System

f

f

f

f

cat

“你好”

5-5 (next move)

“台積電怎麼去”

“地址為…

現在建議搭乘計程車”

12

(13)

Framework

A set of

function f1, f2

f1 “cat”

f1 “dog”

f2 “monkey”

f2 “snake”

Model

f “cat”

Image Recognition:

13

(14)

Framework

A set of

function f1, f2

Model

Training Data

Goodness of function f

Better!

“monkey” “cat” “dog”

function input:

function output:

Supervised Learning

f “cat”

Image Recognition:

14

(15)

Framework

f

“cat”

Image Recognition:

A set of

function f1, f2

Model

Training Data

Goodness of function f

“monkey” “cat” “dog”

f *

Pick the “Best” Function

Using f

“cat”

Training Testing

Step 1

Step 2 Step 3

Training is to pick the best function given the observed data Testing is to predict the label using the learned function

15

(16)

Why to Learn Machine Learning?

AI Age

AI can work for most of labor work?

New job market AI 訓練師

(Machine Learning Expert 機器學習專家、

Data Scientist 資料科學家)

16

(17)

AI 訓練師

機器不是自己會學嗎?

為什麼需要 AI 訓練師

戰鬥是寶可夢在打,

為什麼需要寶可夢訓練師?

17

(18)

AI 訓練師

寶可夢訓練師

挑選適合的寶可夢來戰鬥

寶可夢有不同的屬性

召喚出來的寶可夢不一定聽話

E.g. 小智的噴火龍

 需要足夠的經驗

AI 訓練師

在 step 1,AI訓練師要挑選 合適的模型

不同模型適合處理不同 的問題

不一定能在 step 3 找出 best function

E.g. Deep Learning

 需要足夠的經驗

18

(19)

AI 訓練師

厲害的 AI , AI 訓練師功不可沒

讓我們一起朝 AI 訓練師之路邁進

19

(20)

Machine Learning Map

Regression

Linear Model

Deep Learning SVM, Decision Tree, KNN, etc Non-Linear Model

Classification Supervised Learning

Task

Scenario Method

Semi-Supervised Learning

Transfer Learning

Unsupervised Learning

Reinforcement Learning

20

(21)

Machine Learning Map

Regression

Linear Model

Deep Learning SVM, Decision Tree, KNN, etc Non-Linear Model

Classification Supervised Learning

Semi-Supervised Learning

Transfer Learning

Unsupervised Learning

Reinforcement Learning Task

Scenario Method

The output of the target function 𝑓 is a “scalar”.

一個數值

21

(22)

Regression

Stock Market Forecast

Self-driving Car

Recommendation

𝑓 =

Dow Jones Industrial

Average at tomorrow

𝑓 =

𝑓 =

方向盤角度

使用者 A 商品 B 購買可能性

22

(23)

Example Application

Estimating the Combat Power (CP) of a pokemon after evolution

𝑓 = CP after

evolution

𝑥 𝑦

𝑥

𝑐𝑝

𝑥

ℎ𝑝

𝑥

𝑤

𝑥

𝑥

𝑠

23

(24)

Step 1: Model

A set of

function f1, f2

Model

𝑓 = CP after

evolution

Linear model:

f1: y = 10.0 + 9.0 ∙ xcp f2: y = 9.8 + 9.2 ∙ xcp f3: y = - 0.8 - 1.2 ∙ xcp

…… infinite

𝑦 = 𝑏 + ෍ 𝑤

𝑖

𝑥

𝑖

𝑥𝑖: an attribute of input x

w and b are parameters (can be any value)

𝑥 𝑦

𝑤𝑖: weight, b: bias feature

𝑦 = 𝑏 + 𝑤 ∙ 𝑥cp

24

(25)

Step 2: Goodness of Function

A set of

function f1, f2

Model

Training Data

function input:

function output (scalar):

ො 𝑦1 𝑥1

𝑥2

ො 𝑦2

𝑦 = 𝑏 + 𝑤 ∙ 𝑥cp

25

(26)

Step 2: Goodness of Function

Training data

1st pokemon:

𝑥1, ො𝑦1

2nd pokemon:

𝑥2, ො𝑦2

10th pokemon:

𝑥10, ො𝑦10

Source: https://www.openintro.org/stat/data/?data=pokemon

𝑥𝑐𝑝𝑛 , ො𝑦𝑛

… …

𝑥𝑐𝑝

ො 𝑦 This is real data.

26

(27)

Step 2: Goodness of Function

A set of

function f1, f2

Training Data

Goodness of function f

Loss function 𝐿:

L 𝑓 = L 𝑤, 𝑏

= ෍

𝑛=1 10

𝑦

𝑛

− 𝑏 + 𝑤 ∙ 𝑥

𝑐𝑝𝑛 2

Input: a function, output: how bad it is

Estimated y based on input function

Estimation error Sum over examples

Model 𝑦 = 𝑏 + 𝑤 ∙ 𝑥cp

27

(28)

Step 2: Goodness of Function

Loss Function

Each point in the figure is a function The color represents 𝐿 𝑤, 𝑏

L 𝑤, 𝑏 = ෍

𝑛=1 10

𝑦𝑛 − 𝑏 + 𝑤 ∙ 𝑥𝑐𝑝𝑛 2

y = - 180 - 2 ∙ xcp

28

(29)

Step 3: Best Function

A set of

function f1, f2

Model

Training Data

Goodness of function f

Pick the “Best” Function 𝑓 = 𝑎𝑟𝑔 min

𝑓 𝐿 𝑓 𝑤, 𝑏 = 𝑎𝑟𝑔 min

𝑤,𝑏 𝐿 𝑤, 𝑏

= 𝑎𝑟𝑔 min

𝑤,𝑏

𝑛=1 10

𝑦𝑛 − 𝑏 + 𝑤 ∙ 𝑥𝑐𝑝𝑛 2 L 𝑤, 𝑏

= ෍

𝑛=1 10

𝑦𝑛 − 𝑏 + 𝑤 ∙ 𝑥𝑐𝑝𝑛 2

29

(30)

Step 3: Gradient Descent

Consider loss function 𝐿(𝑤) with one parameter w:

Loss 𝐿 𝑤

w 𝑤 = 𝑎𝑟𝑔 min

𝑤 𝐿 𝑤

 (Randomly) Pick an initial value w0

 Compute 𝑑𝐿

𝑑𝑤 |𝑤=𝑤0

w0

30

(31)

Step 3: Gradient Descent

Consider loss function 𝐿(𝑤) with one parameter w:

Loss 𝐿 𝑤

w

 (Randomly) Pick an initial value w0

 Compute 𝑑𝐿

𝑑𝑤 |𝑤=𝑤0

w0

−𝜂 𝑑𝐿

𝑑𝑤 |𝑤=𝑤0

𝑤1 ← 𝑤0 − 𝜂 𝑑𝐿

𝑑𝑤 |𝑤=𝑤0 𝑤 = 𝑎𝑟𝑔 min

𝑤 𝐿 𝑤

31

(32)

Step 3: Gradient Descent

Consider loss function 𝐿(𝑤) with one parameter w:

Loss 𝐿 𝑤

w

 (Randomly) Pick an initial value w0

 Compute 𝑑𝐿

𝑑𝑤 |𝑤=𝑤0

w0

𝑤1 ← 𝑤0 − 𝜂 𝑑𝐿

𝑑𝑤 |𝑤=𝑤0

 Compute 𝑑𝐿

𝑑𝑤 |𝑤=𝑤1

𝑤2 ← 𝑤1 − 𝜂 𝑑𝐿

𝑑𝑤 |𝑤=𝑤1

…… Many iteration

w1 w2 wT

𝑤 = 𝑎𝑟𝑔 min

𝑤 𝐿 𝑤

32

(33)

Step 3: Gradient Descent

How about two parameters?

𝑤, 𝑏 = 𝑎𝑟𝑔 min

𝑤,𝑏 𝐿 𝑤, 𝑏

 (Randomly) Pick an initial value w0, b0

 Compute 𝜕𝐿

𝜕𝑤 |𝑤=𝑤0,𝑏=𝑏0, 𝜕𝐿

𝜕𝑏 |𝑤=𝑤0,𝑏=𝑏0 𝑤1 ← 𝑤0 − 𝜂 𝜕𝐿

𝜕𝑤 |𝑤=𝑤0,𝑏=𝑏0

 Compute 𝜕𝐿

𝜕𝑤 |𝑤=𝑤1,𝑏=𝑏1, 𝜕𝐿

𝜕𝑏 |𝑤=𝑤1,𝑏=𝑏1

𝑏1 ← 𝑏0 − 𝜂 𝜕𝐿

𝜕𝑏 |𝑤=𝑤0,𝑏=𝑏0

𝛻𝐿 =

𝜕𝐿

𝜕𝑤

𝜕𝐿

𝜕𝑏

gradient

𝑤2 ← 𝑤1 − 𝜂 𝜕𝐿

𝜕𝑤 |𝑤=𝑤1,𝑏=𝑏1 𝑏2 ← 𝑏1 − 𝜂 𝜕𝐿

𝜕𝑏 |𝑤=𝑤1,𝑏=𝑏1

33

(34)

𝑏 𝑤

Step 3: Gradient Descent

Color: Value of loss 𝐿 𝑤, 𝑏

34

(35)

Step 3: Gradient Descent

𝐿

𝑤 𝑏

Linear regression  No local optimal

𝑤

𝑏

Local optimal

Loss function is convex in linear regression

35

(36)

Step 3: Gradient Descent

Formulation of 𝜕𝐿 𝜕𝑤 and Τ 𝜕𝐿 𝜕𝑏 Τ

𝐿 𝑤, 𝑏 = ෍

𝑛=1 10

𝑦𝑛 − 𝑏 + 𝑤 ∙ 𝑥𝑐𝑝𝑛 2

𝜕𝐿

𝜕𝑤 =?

𝜕𝐿

𝜕𝑏 =?

𝑛=1 10

2 ො𝑦𝑛 − 𝑏 + 𝑤 ∙ 𝑥𝑐𝑝𝑛 −𝑥𝑐𝑝𝑛

36

(37)

Step 3: Gradient Descent

Formulation of 𝜕𝐿 𝜕𝑤 and Τ 𝜕𝐿 𝜕𝑏 Τ

𝐿 𝑤, 𝑏 = ෍

𝑛=1 10

𝑦𝑛 − 𝑏 + 𝑤 ∙ 𝑥𝑐𝑝𝑛 2

𝜕𝐿

𝜕𝑤 =?

𝜕𝐿

𝜕𝑏 =?

𝑛=1 10

2 ො𝑦𝑛 − 𝑏 + 𝑤 ∙ 𝑥𝑐𝑝𝑛 −𝑥𝑐𝑝𝑛

𝑛=1 10

2 ො𝑦𝑛 − 𝑏 + 𝑤 ∙ 𝑥𝑐𝑝𝑛 −1

37

(38)

Learned Model

𝑦 = 𝑏 + 𝑤 ∙ 𝑥cp

b = -188.4 w = 2.7

Average Error on Training Data

Training Data

𝑒1

𝑒2

= 31.9

= ෍

𝑛=1 10

𝑒𝑛

38

(39)

Model Generalization

b = -188.4 w = 2.7

Average Error on Testing Data

= 35.0

= ෍

𝑛=1 10

𝑒𝑛

What we really care about is the error on new data (testing data)

Another 10 pokemons as testing data

How can we do better?

> Average Error on Training Data (31.9) 𝑦 = 𝑏 + 𝑤 ∙ 𝑥cp

39

(40)

Model Generalization

Select another model

Best function

Testing

𝑦 = 𝑏 + 𝑤1 ∙ 𝑥cp + 𝑤2 ∙ 𝑥cp 2

Average Error = 18.4 Average Error = 15.4

𝑏 = −10.3, 𝑤1 = 1.0, 𝑤2 = 2.7 × 10−3

Better! Could it be even better?

40

(41)

Model Generalization

Select another model

Best function

Testing

𝑦 = 𝑏 + 𝑤1 ∙ 𝑥cp + 𝑤2 ∙ 𝑥cp 2 +𝑤3 ∙ 𝑥cp 3

Average Error = 18.1 Average Error = 15.3

𝑏 = 6.4, 𝑤1 = 0.66, 𝑤2 = 4.3 × 10−3, 𝑤3 = 1.8 × 10−6

Slightly better.

How about more complex model?

41

(42)

Model Generalization

Select another model

Best function

Testing

𝑦 = 𝑏 + 𝑤1 ∙ 𝑥cp + 𝑤2 ∙ 𝑥cp 2 +𝑤3 ∙ 𝑥cp 3 + 𝑤4 ∙ 𝑥cp 4

Average Error = 28.8 Average Error = 14.9

The results become worse

42

(43)

Model Generalization

Select another model

Best function

Testing

𝑦 = 𝑏 + 𝑤1 ∙ 𝑥cp + 𝑤2 ∙ 𝑥cp 2 +𝑤3 ∙ 𝑥cp 3 + 𝑤4 ∙ 𝑥cp 4 +𝑤5 ∙ 𝑥cp 5

Average Error = 232.1 Average Error = 12.8

The results are so bad

43

(44)

Model Selection

Training Data

𝑦 = 𝑏 + 𝑤1 ∙ 𝑥cp + 𝑤2 ∙ 𝑥cp 2 𝑦 = 𝑏 + 𝑤1 ∙ 𝑥cp + 𝑤2 ∙ 𝑥cp 2 +𝑤3 ∙ 𝑥cp 3

𝑦 = 𝑏 + 𝑤1 ∙ 𝑥cp + 𝑤2 ∙ 𝑥cp 2 +𝑤3 ∙ 𝑥cp 3 + 𝑤4 ∙ 𝑥cp 4 𝑦 = 𝑏 + 𝑤 ∙ 𝑥cp

𝑦 = 𝑏 + 𝑤1 ∙ 𝑥cp + 𝑤2 ∙ 𝑥cp 2 +𝑤3 ∙ 𝑥cp 3 + 𝑤4 ∙ 𝑥cp 4 +𝑤5 ∙ 𝑥cp 5

1.

2.

3.

4.

5. A more complex model yields

lower error on training data.

If we can truly find the best function

44

(45)

Machine Learning Map

Regression

Classification Supervised Learning

Task

Scenario Method

Semi-Supervised Learning

Transfer Learning

Unsupervised Learning

Reinforcement Learning

45

(46)

Classification

Binary Classification

Multi-Class Classification

Function f Yes / No

Input

Function f

Class 1, Class 2, …, Class N

Input

46

(47)

Binary Classification – Spam Filtering

(http://spam-filter-review.toptenreviews.com/)

Model

(Yes/No) 1/0

1 (Yes)

0 (No)

“free” in e-mail

“Talk” in e-mail

47

(48)

Multi-Class – Image Recognition

Model

“monkey”

“cat”

“dog”

“monkey”

“cat”

“dog”

48

(49)

Multi-Class – Topic Classification

http://top-breaking-news.com/

Model

政治

體育 經濟

“president” in document

“stock” in document

體育 政治 財經

49

(50)

Machine Learning Map

Regression

Classification Supervised Learning

Semi-Supervised Learning

Unsupervised Learning

Reinforcement Learning

Transfer Learning Task

Scenario Method

Linear Model

Deep Learning SVM, Decision Tree, KNN, etc Non-Linear Model

50

(51)

Part I: Introduction to ML & DL

Basic Machine Learning

Basic Deep Learning

Toolkits and Learning Recipe

51

(52)

Stacked Functions Learned by Machine

Production line (生產線)

“台灣第一波

上市!”

End-to-end training: what each function should do is learned automatically

Simple Function

f1

Simple Function

f2

Simple Function

f3

Deep Learning Model

f: a very complex function

Deep learning usually refers to neural network based model

52

(53)

Three Steps for Deep Learning

Step 3: pick the best function Step 2: goodness of function

Step 1: define a set of function

53

(54)

Three Steps for Deep Learning

Step 3: pick the best function Step 2: goodness of function

Step 1: define a set of function

Neural Network

54

(55)

Neural Network

b w

a w

a w

a

z

1 1

  

k k

  

K K

z w

1

w

k

w

K

a

1

a

k

a

K

b

  z

bias

a

weights

Neuron

… … …

A simple function

Activation function

55

(56)

Neural Network

  z

bias

Activation function weights

Neuron

1

-2 -1

1 2

-1

1

4

  z

z

 

z

z e

  1

 1

Sigmoid Function

0.98

56

(57)

Neural Network

 

z

 

z

 

z

 

z

Different connections lead to different network structures

Weights and biases are network parameters 𝜃

The neurons have different values of weights and biases.

57

(58)

Fully Connected Feedforward Network

  z

z

 

z

z e

  1

 1

Sigmoid Function 1

-1

1 -2

1 -1

1

0 4

-2

0.98

0.12

58

(59)

Fully Connected Feedforward Network

1 -2

1 -1

1

0 4

-2

0.98

0.12 2

-1

-1 -2

3 -1

4 -1 0.86

0.11

0.62

0.83 0

0

-2

2 1

-1

59

(60)

Fully Connected Feedforward Network

1 -2

1 -1

1

0

0.73

0.5

2 -1

-1 -2

3 -1

4 -1 0.72

0.12

0.51

0.85 0

0

-2

2 𝑓 0

0 = 0.51 0.85

Given parameters 𝜃, define a function

𝑓 1

−1 = 0.62 0.83 0

0

This is a function.

Input vector, output vector

Given network structure, define a function set

60

(61)

Fully Connect Feedforward Network

Output Layer Hidden Layers

Input Layer

Input Output

x1

x2

Layer 1

……

xN

……

Layer 2

……

Layer L

……

……

……

……

……

y

1

y

2

y

M

Deep means many hidden layers

neuron

61

(62)

Why Deep? Universality Theorem

Any continuous function f

can be realized by a network with only hidden layer

(given enough hidden neurons)

: R R

M

f

N

Why “deep” not “fat”?

62

(63)

Fat + Shallow v.s. Thin + Deep

Two networks with the same number of parameters

63

x1 x2

……

xN

……

……

x1 x2

……

xN

……

(64)

Why Deep

Logic circuits

Consists of gates

A two layers of logic gates can represent any Boolean function.

Using multiple layers of logic gates to build some functions are much simpler

Neural network

consists of neurons

A hidden layer network can represent any continuous function.

Using multiple layers of neurons to represent some functions are much simpler

64

(65)

Deep = Many Hidden Layers

AlexNet (2012) VGG (2014) GoogleNet (2014) 16.4%

7.3% 6.7%

http://cs231n.stanford.e du/slides/winter1516_le cture8.pdf

65

(66)

Deep = Many Hidden Layers

AlexNet (2012)

VGG (2014)

GoogleNet (2014)

3.57%

Residual Net (2015)

Taipei 101

16.4% 7.3% 6.7%

Special structure

66

(67)

Output Layer

Softmax layer as the output layer

Ordinary Layer

 

1

1 z

y

 

2

2 z

y

 

3

3 z

y

z1

z2

z3

In general, the output of network can be any value.

May not be easy to interpret

67

(68)

Output Layer

Softmax layer as the output layer

Probability:

 1 > 𝑦𝑖 > 0

 σ𝑖 𝑦𝑖 = 1

z1

z2

z3

Softmax Layer

e e e

z1

e

z2

e

z3

e

3

1 1

1

j

z zj

e e

y

3

1 j

zj

e

3

-3

1 2.7

20

0.05

0.88 0.12

≈0

3

1 2

2

j

z zj

e e

y

3

1 3

3

j

z zj

e e

y

68

(69)

Example Application

Input

Output

16 x 16 = 256

x1

x2

x256

……

Ink → 1 No ink → 0

……

y

1

y

2

y

10

Each dimension represents the confidence of a digit.

is 1 is 2

is 0

……

0.1 0.7

0.2

The image is “2”

69

(70)

Example Application

Handwriting Digit Recognition

Machine “2”

x1

x2

x256

…… ……

y

1

y

2

y

10

is 1 is 2

is 0

……

What is needed is a function ……

Input:

256-dim vector

output:

10-dim vector

Neural Network

70

(71)

Example Application

Output Layer Hidden Layers

Input Layer

Input Output

x1

x2

Layer 1

……

xN

……

Layer 2

……

Layer L

……

……

……

……

“2” ……

y

1

y

2

y

10

is 1 is 2

is 0

……

A function set containing the candidates for

Handwriting Digit Recognition

You need to decide the network structure to let a good function in your function set.

71

(72)

FAQ

Q: How many layers? How many neurons for each layer?

Q: Can we design the network structure?

Q: Can the structure be automatically determined?

Yes, but not widely studied yet.

Trial and Error + Intuition

Variants of Neural Networks (next lecture)

72

(73)

Three Steps for Deep Learning

Step 3: pick the best function Step 2: goodness of function

Step 1: define a set of function

73

(74)

Training Data

Preparing training data: images and their labels

The learning target is defined on the training data.

“5” “0” “4” “1”

“1” “3”

“9” “2”

74

(75)

Learning Target

16 x 16 = 256

x1

x2

……

x256

……

……

……

……

Ink → 1 No ink → 0

……

y

1

y

2

y

10

y1 has the maximum value The learning target is ……

Input:

y2 has the maximum value Input:

is 1 is 2

is 0

Softmax

75

(76)

Loss

x1

x2

……

x256

……

……

……

……

……

y

1

y

2

y

10

Loss 𝑙

“1”

……

1 0

0

……

Loss can be square error or cross entropy between the network output and target

target

Softmax

As close as

possible

A good function should make the loss of all examples as small as possible.

Given a set of parameters

76

(77)

Total Loss

For all training data …

x1 x2

xR

NN NN

NN

…… ……

y1 y2

yR

ො 𝑦1

ො 𝑦2

ො 𝑦𝑅 𝑙1

…… ……

x3 NN y3 𝑦ො3

𝐿 = ෍

𝑟=1 𝑅

𝑙𝑟

Find the network parameters 𝜽 that minimize total loss L

Total Loss:

𝑙2 𝑙3

𝑙𝑅

As small as possible Find a function in

function set that

minimizes total loss L

77

(78)

Three Steps for Deep Learning

Step 3: pick the best function Step 2: goodness of function

Step 1: define a set of function

78

(79)

How to pick the best function

Find network parameters 𝜽

that minimize total loss L

Network parameters 𝜃 = 𝑤

1

, 𝑤

2

, 𝑤

3

, ⋯ , 𝑏

1

, 𝑏

2

, 𝑏

3

, ⋯

Enumerate all possible values

Layer l

……

Layer l+1

……

E.g. speech recognition: 8 layers and

1000 neurons each layer 1000 neurons

1000 neurons

10

6

weights Millions of parameters

79

(80)

Gradient Descent

Random, RBM pre-train

Usually good enough

Network parameters 𝜃 = 𝑤

1

, 𝑤

2

, ⋯ , 𝑏

1

, 𝑏

2

, ⋯

w

 Pick an initial value for w

Find network parameters 𝜽

that minimize total loss L

80

(81)

Gradient Descent

Total Loss 𝐿

w

 Pick an initial value for w

 Compute Τ𝜕𝐿 𝜕𝑤

Positive Negative

Decrease w Increase w

Find network parameters 𝜽

that minimize total loss L

Network parameters 𝜃 = 𝑤

1

, 𝑤

2

, ⋯ , 𝑏

1

, 𝑏

2

, ⋯

81

(82)

Gradient Descent

w

 Pick an initial value for w

 Compute Τ𝜕𝐿 𝜕𝑤

−𝜂𝜕𝐿 𝜕𝑤Τ

η is called

“learning rate”

𝑤 ← 𝑤 − 𝜂𝜕𝐿 𝜕𝑤Τ Repeat

Find network parameters 𝜽

that minimize total loss L

Network parameters 𝜃 = 𝑤

1

, 𝑤

2

, ⋯ , 𝑏

1

, 𝑏

2

, ⋯

82

(83)

Gradient Descent

Total Loss 𝐿

w

 Pick an initial value for w

 Compute Τ𝜕𝐿 𝜕𝑤

𝑤 ← 𝑤 − 𝜂𝜕𝐿 𝜕𝑤Τ Repeat

(when update is little)

Find network parameters 𝜽

that minimize total loss L

Network parameters 𝜃 = 𝑤

1

, 𝑤

2

, ⋯ , 𝑏

1

, 𝑏

2

, ⋯

83

(84)

Gradient Descent

Assume that θ has two variables {θ1, θ2}

84

(85)

Gradient Descent

𝑤1 𝑤2

Color: Value of Total Loss L

Randomly pick a starting point

85

(86)

𝑤1 𝑤2

Gradient Descent Hopfully, we would reach a minima …..

Compute 𝜕𝐿 𝜕𝑤Τ 1, 𝜕𝐿 𝜕𝑤Τ 2

(−𝜂 𝜕𝐿 𝜕𝑤Τ 1, −𝜂 𝜕𝐿 𝜕𝑤Τ 2)

86

(87)

Local Minima

Total Loss

The value of a network parameter w

Very slow at the plateau

Stuck at local minima

𝜕𝐿 ∕ 𝜕𝑤

= 0

Stuck at saddle point

𝜕𝐿 ∕ 𝜕𝑤

= 0

𝜕𝐿 ∕ 𝜕𝑤

≈ 0

87

(88)

Local Minima

Gradient descent never guarantee global minima

𝐿

𝑤

1

𝑤

2

Different initial point

Reach different minima, so different results

88

(89)

Gradient Descent

This is the “learning” of machines in deep learning ……

Even AlphaGo using this approach.

I hope you are not too disappointed :p

People image …… Actually …..

89

(90)

Part I: Introduction to ML & DL

Basic Machine Learning

Basic Deep Learning

Toolkits and Learning Recipe

90

(91)

Deep Learning Toolkit

Backpropagation: an efficient way to compute

𝜕𝐿 𝜕𝑤 in neural network Τ

91

(92)

Three Steps for Deep Learning

Deep Learning is so simple ……

Now If you want to find a function

If you have lots of function input/output (?) as training data

You can use deep learning Step 1:

define a set

of function

Step 2:

goodness of function

Step 3: pick the best function

92

(93)

Keras

keras

Very flexible Need some effort to learn

Easy to learn and use

(still have some flexibility)

You can modify it if you can write TensorFlow or Theano

Interface of TensorFlow or Theano

or

93

(94)

Keras

François Chollet is the author of Keras.

He currently works for Google as a deep learning engineer and researcher.

Keras means horn in Greek

Documentation: http://keras.io/

Example

https://github.com/fchollet/keras/tree/master/examples

Step-by-step lecture by Prof. Hung-Yi Lee

Slide

http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Keras.pdf

Lecture recording:

https://www.youtube.com/watch?v=qetE6uUoLQA

94

(95)

使用 Keras 心得

感謝 沈昇勳 同學提供圖檔

95

(96)

Example Application

Handwriting Digit Recognition

Machine “1”

“Hello world” for deep learning

MNIST Data: http://yann.lecun.com/exdb/mnist/

Keras provides data sets loading function: http://keras.io/datasets/

28 x 28

96

(97)

Three Steps for Deep Learning

Step 1:

define a set

of function

Step 2:

goodness of function

Step 3: pick the best function Deep Learning is so simple ……

97

(98)

Good Results on Testing Data?

Good Results on Training Data?

Step 3: pick the best function Step 2: goodness

of function Step 1: define a

set of function

YES YES

NO NO

Learning Recipe

98

(99)

Overfitting

Possible solutions

more training samples

some tips: dropout, etc.

99

(100)

Good Results on Testing Data?

Good Results on Training Data?

YES YES

Learning Recipe

Different approaches for different problems e.g. dropout for good results on testing data

100

(101)

Good Results on Testing Data?

Good Results on Training Data?

YES YES

Learning Recipe

Choosing proper loss Mini-batch

New activation function Adaptive Learning Rate Momentum

101

(102)

Learning Recipe

Training Data

Testing Data

x

Validation Real Testing

x x

f *

“Best” Function

y y

102

(103)

Learning Recipe

Training Data

Testing Data

x

Validation Real Testing

x y x y

immediately know the performance

Do not know the performance

103

(104)

Learning Recipe

Possible reasons

no good function exists: bad hypothesis function set

 reconstruct the model architecture

cannot find a good function: local optima

 change the training strategy

get good results on training set modify training process

no

104

(105)

Learning Recipe

get good results on dev/validation set get good results

on training set modify training process

no

yes

done yes

no

prevent overfitting

Better performance on training but worse performance on dev  overfitting

105

(106)

Concluding Remarks

Basic Machine Learning

1.

Define a set of functions

2.

Measure goodness of functions

3.

Pick the best function

Basic Deep Learning

Stacked functions

106

(107)

Talk Outline

Part I: Introduction to

Machine Learning & Deep Learning Part II: Variants of Neural Nets

Part III: Beyond Supervised Learning

& Recent Trends

107

(108)

PART II

Variants of Neural Networks

108

(109)

PART II: Variants of Neural Networks

Convolutional Neural Network (CNN)

Recurrent Neural Network (RNN)

109

(110)

PART II: Variants of Neural Networks

Convolutional Neural Network (CNN)

Recurrent Neural Network (RNN)

Widely used in image processing

110

(111)

Why CNN for Image

Can the network be simplified by considering the properties of images?

x1

x2

……

xN

…… …… ……

……

……

……

The most basic classifiers

Use 1st layer as module to build classifiers

Use 2nd layer as module ……

Represented as pixels

(Zeiler, M. D., ECCV 2014)

111

(112)

Why CNN for Image

Some patterns are much smaller than the whole image

A neuron does not have to see the whole image to discover the pattern.

“beak” detector

Connecting to small region with less parameters

112

(113)

Why CNN for Image

The same patterns appear in different regions.

“upper-left beak” detector

“middle beak”

detector

They can use the same set of parameters.

Do almost the same thing

113

(114)

Why CNN for Image

Subsampling the pixels will not change the object

subsampling bird

bird

We can subsample the pixels to make image smaller

Less parameters for the network to process the image

114

(115)

Three Steps for Deep Learning

Step 1:

define a set

of function

Step 2:

goodness of function

Step 3: pick the best function Deep Learning is so simple ……

Convolutional Neural Network

115

(116)

Image Recognition

116

http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf

(117)

The Whole CNN

Fully Connected Feedforward network

cat dog ……

Convolution

Max Pooling

Convolution

Max Pooling Flatten

Can repeat many times

117

(118)

The Whole CNN

Convolution

Max Pooling

Convolution

Max Pooling Flatten

Can repeat many times

 Some patterns are much

smaller than the whole image

 The same patterns appear in different regions

 Subsampling the pixels will not change the object

Property 1

Property 2

Property 3

118

(119)

Image Recognition

119

(120)

Local Connectivity

120

Neurons connect to a small region

(121)

Parameter Sharing

The same feature in different positions

121

Neurons share the same weights

(122)

Parameter Sharing

Different features in the same position

122

Neurons have different weights

(123)

Convolutional Layers

123

depth

width width

depth

weights weights

height

shared weight

(124)

Convolutional Layers

124

c1

c2

b1

b2 a1

a2

a3

depth = 2 depth = 1

(125)

Convolutional Layers

125

c1 b1

b2 a1

a2

d1

b3 a3

c2

d2

depth = 2 depth = 2

(126)

Convolutional Layers

126

c1 b1

b2 a1

a2

d1

b3 a3

c2

d2

depth = 2 depth = 2

(127)

Convolutional Layers

127

A B C

A B C D

(128)

Hyper-parameters of CNN

Stride

128

Padding

0 0

Stride = 1

Stride = 2

Padding = 0

Padding = 1

(129)

Example

129

Output

Volume (3x3x2)

Input

Volume (7x7x3)

Stride = 2

Padding = 1

http://cs231n.github.io/convolutional-networks/

Filter (3x3x3)

(130)

Convolutional Layers

130

http://cs231n.github.io/convolutional-networks/

(131)

Convolutional Layers

131

http://cs231n.github.io/convolutional-networks/

(132)

Convolutional Layers

132

http://cs231n.github.io/convolutional-networks/

(133)

Convolutional Layers

133

http://cs231n.github.io/convolutional-networks/

(134)

Pooling Layer

134

1 3 2 4

5 7 6 8

0 0 3 3

5 5 0 0

4 5

5 3

7 8

5 3

Maximum Pooling

Average Pooling

Max(1,3,5,7) = 7 Avg(1,3,5,7) = 4

no overlap

no weights

depth = 1

Max(0,0,5,5) = 5

(135)

Why “Deep” Learning?

135

參考文獻

相關文件

Reinforcement learning is based on reward hypothesis A reward r t is a scalar feedback signal. ◦ Indicates how well agent is doing at

Training two networks jointly  the generator knows how to adapt its parameters in order to produce output data that can fool the

Input domain: word, word sequence, audio signal, click logs Output domain: single label, sequence tags, tree structure, probability

◦ Value function: how good is each state and/or action1. ◦ Model: agent’s representation of

◦ Value function: how good is each state and/or action.. ◦ Policy: agent’s

State value function: when using

3. Works better for some tasks to use grammatical tree structure Language recursion is still up to debate.. Recursive Neural Network Architecture. A network is to predict the

Input domain: word, word sequence, audio signal, click logs Output domain: single label, sequence tags, tree structure, probability distribution.