Slide credit from Hung-Yi Lee and Mark Chang
Talk Outline
Part I: Introduction to
Machine Learning & Deep Learning Part II: Variants of Neural Nets
Part III: Beyond Supervised Learning
& Recent Trends
2
Talk Outline
Part I: Introduction to
Machine Learning & Deep Learning Part II: Variants of Neural Nets
Part III: Beyond Supervised Learning
& Recent Trends
3
Introduction to Machine Learning &
Deep Learning
PART I
4
Part I: Introduction to ML & DL
Basic Machine Learning
Basic Deep Learning
Toolkits and Learning Recipe
5
Part I: Introduction to ML & DL
Basic Machine Learning
Basic Deep Learning
Toolkits and Learning Recipe
6
Machine Learning
Machine learning is rising rapidly in recent days
7
Recent Trend
8
What Computers Can Do?
Programs can do the things you ask them to do
9
Program for Solving Tasks
Task: predicting positive or negative given a product review
“I love this product!” “It claims too much.” “It’s a little expensive.”
+ - ?
“台灣第一波上市!” “規格好雞肋…” “樓下買了我才考慮”
推 噓 ?
program.py
Some tasks are complex, and we don’t know how to write a program to solve them.
if input contains “love”, “like”, etc.
output = positive
if input contains “too much”, “bad”, etc.
output = negative
program.py program.py
program.py program.py program.py
10
Learning ≈ Looking for a Function
Task: predicting positive or negative given a product review
“I love this product!” “It claims too much.” “It’s a little expensive.”
+ - ?
“台灣第一波上市!” “規格好雞肋…” “樓下買了我才考慮”
推 噓 ?
f f f
f f f
Given a large amount of data, the machine learns what the function f should be.
11
Learning ≈ Looking for a Function
Speech Recognition
Image Recognition
Go Playing
Dialogue System
f
f
f
f
cat
“你好”
5-5 (next move)
“台積電怎麼去”
“地址為…現在建議搭乘計程車”
12
Framework
A set of
function f1, f2
f1 “cat”
f1 “dog”
f2 “monkey”
f2 “snake”
Model
f “cat”
Image Recognition:
13
Framework
A set of
function f1, f2
Model
Training Data
Goodness of function f
Better!
“monkey” “cat” “dog”
function input:
function output:
Supervised Learning
f “cat”
Image Recognition:
14
Framework
f
“cat”Image Recognition:
A set of
function f1, f2
Model
Training Data
Goodness of function f
“monkey” “cat” “dog”
f *
Pick the “Best” Function
Using f
“cat”
Training Testing
Step 1
Step 2 Step 3
Training is to pick the best function given the observed data Testing is to predict the label using the learned function
15
Why to Learn Machine Learning?
AI Age
AI can work for most of labor work?
New job market AI 訓練師
(Machine Learning Expert 機器學習專家、
Data Scientist 資料科學家)
16
AI 訓練師
機器不是自己會學嗎?
為什麼需要 AI 訓練師
戰鬥是寶可夢在打,
為什麼需要寶可夢訓練師?
17
AI 訓練師
寶可夢訓練師
挑選適合的寶可夢來戰鬥
寶可夢有不同的屬性
召喚出來的寶可夢不一定聽話
E.g. 小智的噴火龍
需要足夠的經驗
AI 訓練師
在 step 1,AI訓練師要挑選 合適的模型
不同模型適合處理不同 的問題
不一定能在 step 3 找出 best function
E.g. Deep Learning
需要足夠的經驗
18
AI 訓練師
厲害的 AI , AI 訓練師功不可沒
讓我們一起朝 AI 訓練師之路邁進
19
Machine Learning Map
Regression
Linear Model
Deep Learning SVM, Decision Tree, KNN, etc Non-Linear Model
Classification Supervised Learning
Task
Scenario Method
Semi-Supervised Learning
Transfer Learning
Unsupervised Learning
Reinforcement Learning
20
Machine Learning Map
Regression
Linear Model
Deep Learning SVM, Decision Tree, KNN, etc Non-Linear Model
Classification Supervised Learning
Semi-Supervised Learning
Transfer Learning
Unsupervised Learning
Reinforcement Learning Task
Scenario Method
The output of the target function 𝑓 is a “scalar”.
一個數值
21
Regression
Stock Market Forecast
Self-driving Car
Recommendation
𝑓 =
Dow Jones IndustrialAverage at tomorrow
𝑓 =
𝑓 =
方向盤角度
使用者 A 商品 B 購買可能性
22
Example Application
Estimating the Combat Power (CP) of a pokemon after evolution
𝑓 = CP after
evolution
𝑥 𝑦
𝑥
𝑐𝑝𝑥
ℎ𝑝𝑥
𝑤𝑥
ℎ𝑥
𝑠23
Step 1: Model
A set of
function f1, f2
Model
𝑓 = CP after
evolution
Linear model:
f1: y = 10.0 + 9.0 ∙ xcp f2: y = 9.8 + 9.2 ∙ xcp f3: y = - 0.8 - 1.2 ∙ xcp
…… infinite
𝑦 = 𝑏 + 𝑤
𝑖𝑥
𝑖𝑥𝑖: an attribute of input x
w and b are parameters (can be any value)
𝑥 𝑦
𝑤𝑖: weight, b: bias feature
𝑦 = 𝑏 + 𝑤 ∙ 𝑥cp
24
Step 2: Goodness of Function
A set of
function f1, f2
Model
Training Data
function input:
function output (scalar):
ො 𝑦1 𝑥1
𝑥2
ො 𝑦2
𝑦 = 𝑏 + 𝑤 ∙ 𝑥cp
25
Step 2: Goodness of Function
Training data
1st pokemon:
𝑥1, ො𝑦1
2nd pokemon:
𝑥2, ො𝑦2
10th pokemon:
𝑥10, ො𝑦10
Source: https://www.openintro.org/stat/data/?data=pokemon
𝑥𝑐𝑝𝑛 , ො𝑦𝑛
… …
𝑥𝑐𝑝
ො 𝑦 This is real data.
26
Step 2: Goodness of Function
A set of
function f1, f2
Training Data
Goodness of function f
Loss function 𝐿:
L 𝑓 = L 𝑤, 𝑏
=
𝑛=1 10
ො
𝑦
𝑛− 𝑏 + 𝑤 ∙ 𝑥
𝑐𝑝𝑛 2Input: a function, output: how bad it is
Estimated y based on input function
Estimation error Sum over examples
Model 𝑦 = 𝑏 + 𝑤 ∙ 𝑥cp
27
Step 2: Goodness of Function
Loss Function
Each point in the figure is a function The color represents 𝐿 𝑤, 𝑏
L 𝑤, 𝑏 =
𝑛=1 10
ො
𝑦𝑛 − 𝑏 + 𝑤 ∙ 𝑥𝑐𝑝𝑛 2
y = - 180 - 2 ∙ xcp
28
Step 3: Best Function
A set of
function f1, f2
Model
Training Data
Goodness of function f
Pick the “Best” Function 𝑓∗ = 𝑎𝑟𝑔 min
𝑓 𝐿 𝑓 𝑤∗, 𝑏∗ = 𝑎𝑟𝑔 min
𝑤,𝑏 𝐿 𝑤, 𝑏
= 𝑎𝑟𝑔 min
𝑤,𝑏
𝑛=1 10
ො
𝑦𝑛 − 𝑏 + 𝑤 ∙ 𝑥𝑐𝑝𝑛 2 L 𝑤, 𝑏
=
𝑛=1 10
ො
𝑦𝑛 − 𝑏 + 𝑤 ∙ 𝑥𝑐𝑝𝑛 2
29
Step 3: Gradient Descent
Consider loss function 𝐿(𝑤) with one parameter w:
Loss 𝐿 𝑤
w 𝑤∗ = 𝑎𝑟𝑔 min
𝑤 𝐿 𝑤
(Randomly) Pick an initial value w0
Compute 𝑑𝐿
𝑑𝑤 |𝑤=𝑤0
w0
30
Step 3: Gradient Descent
Consider loss function 𝐿(𝑤) with one parameter w:
Loss 𝐿 𝑤
w
(Randomly) Pick an initial value w0
Compute 𝑑𝐿
𝑑𝑤 |𝑤=𝑤0
w0
−𝜂 𝑑𝐿
𝑑𝑤 |𝑤=𝑤0
𝑤1 ← 𝑤0 − 𝜂 𝑑𝐿
𝑑𝑤 |𝑤=𝑤0 𝑤∗ = 𝑎𝑟𝑔 min
𝑤 𝐿 𝑤
31
Step 3: Gradient Descent
Consider loss function 𝐿(𝑤) with one parameter w:
Loss 𝐿 𝑤
w
(Randomly) Pick an initial value w0
Compute 𝑑𝐿
𝑑𝑤 |𝑤=𝑤0
w0
𝑤1 ← 𝑤0 − 𝜂 𝑑𝐿
𝑑𝑤 |𝑤=𝑤0
Compute 𝑑𝐿
𝑑𝑤 |𝑤=𝑤1
𝑤2 ← 𝑤1 − 𝜂 𝑑𝐿
𝑑𝑤 |𝑤=𝑤1
…… Many iteration
w1 w2 wT
𝑤∗ = 𝑎𝑟𝑔 min
𝑤 𝐿 𝑤
32
Step 3: Gradient Descent
How about two parameters?
𝑤∗, 𝑏∗ = 𝑎𝑟𝑔 min
𝑤,𝑏 𝐿 𝑤, 𝑏
(Randomly) Pick an initial value w0, b0
Compute 𝜕𝐿
𝜕𝑤 |𝑤=𝑤0,𝑏=𝑏0, 𝜕𝐿
𝜕𝑏 |𝑤=𝑤0,𝑏=𝑏0 𝑤1 ← 𝑤0 − 𝜂 𝜕𝐿
𝜕𝑤 |𝑤=𝑤0,𝑏=𝑏0
Compute 𝜕𝐿
𝜕𝑤 |𝑤=𝑤1,𝑏=𝑏1, 𝜕𝐿
𝜕𝑏 |𝑤=𝑤1,𝑏=𝑏1
𝑏1 ← 𝑏0 − 𝜂 𝜕𝐿
𝜕𝑏 |𝑤=𝑤0,𝑏=𝑏0
𝛻𝐿 =
𝜕𝐿
𝜕𝑤
𝜕𝐿
𝜕𝑏
gradient
𝑤2 ← 𝑤1 − 𝜂 𝜕𝐿
𝜕𝑤 |𝑤=𝑤1,𝑏=𝑏1 𝑏2 ← 𝑏1 − 𝜂 𝜕𝐿
𝜕𝑏 |𝑤=𝑤1,𝑏=𝑏1
33
𝑏 𝑤
Step 3: Gradient Descent
Color: Value of loss 𝐿 𝑤, 𝑏
34
Step 3: Gradient Descent
𝐿
𝑤 𝑏
Linear regression No local optimal
𝑤
𝑏
Local optimal
Loss function is convex in linear regression
35
Step 3: Gradient Descent
Formulation of 𝜕𝐿 𝜕𝑤 and Τ 𝜕𝐿 𝜕𝑏 Τ
𝐿 𝑤, 𝑏 =
𝑛=1 10
ො
𝑦𝑛 − 𝑏 + 𝑤 ∙ 𝑥𝑐𝑝𝑛 2
𝜕𝐿
𝜕𝑤 =?
𝜕𝐿
𝜕𝑏 =?
𝑛=1 10
2 ො𝑦𝑛 − 𝑏 + 𝑤 ∙ 𝑥𝑐𝑝𝑛 −𝑥𝑐𝑝𝑛
36
Step 3: Gradient Descent
Formulation of 𝜕𝐿 𝜕𝑤 and Τ 𝜕𝐿 𝜕𝑏 Τ
𝐿 𝑤, 𝑏 =
𝑛=1 10
ො
𝑦𝑛 − 𝑏 + 𝑤 ∙ 𝑥𝑐𝑝𝑛 2
𝜕𝐿
𝜕𝑤 =?
𝜕𝐿
𝜕𝑏 =?
𝑛=1 10
2 ො𝑦𝑛 − 𝑏 + 𝑤 ∙ 𝑥𝑐𝑝𝑛 −𝑥𝑐𝑝𝑛
𝑛=1 10
2 ො𝑦𝑛 − 𝑏 + 𝑤 ∙ 𝑥𝑐𝑝𝑛 −1
37
Learned Model
𝑦 = 𝑏 + 𝑤 ∙ 𝑥cp
b = -188.4 w = 2.7
Average Error on Training Data
Training Data
𝑒1
𝑒2
= 31.9
=
𝑛=1 10
𝑒𝑛
38
Model Generalization
b = -188.4 w = 2.7
Average Error on Testing Data
= 35.0
=
𝑛=1 10
𝑒𝑛
What we really care about is the error on new data (testing data)
Another 10 pokemons as testing data
How can we do better?
> Average Error on Training Data (31.9) 𝑦 = 𝑏 + 𝑤 ∙ 𝑥cp
39
Model Generalization
Select another model
Best function
Testing
𝑦 = 𝑏 + 𝑤1 ∙ 𝑥cp + 𝑤2 ∙ 𝑥cp 2
Average Error = 18.4 Average Error = 15.4
𝑏 = −10.3, 𝑤1 = 1.0, 𝑤2 = 2.7 × 10−3
Better! Could it be even better?
40
Model Generalization
Select another model
Best function
Testing
𝑦 = 𝑏 + 𝑤1 ∙ 𝑥cp + 𝑤2 ∙ 𝑥cp 2 +𝑤3 ∙ 𝑥cp 3
Average Error = 18.1 Average Error = 15.3
𝑏 = 6.4, 𝑤1 = 0.66, 𝑤2 = 4.3 × 10−3, 𝑤3 = 1.8 × 10−6
Slightly better.
How about more complex model?
41
Model Generalization
Select another model
Best function
Testing
𝑦 = 𝑏 + 𝑤1 ∙ 𝑥cp + 𝑤2 ∙ 𝑥cp 2 +𝑤3 ∙ 𝑥cp 3 + 𝑤4 ∙ 𝑥cp 4
Average Error = 28.8 Average Error = 14.9
The results become worse
42
Model Generalization
Select another model
Best function
Testing
𝑦 = 𝑏 + 𝑤1 ∙ 𝑥cp + 𝑤2 ∙ 𝑥cp 2 +𝑤3 ∙ 𝑥cp 3 + 𝑤4 ∙ 𝑥cp 4 +𝑤5 ∙ 𝑥cp 5
Average Error = 232.1 Average Error = 12.8
The results are so bad
43
Model Selection
Training Data𝑦 = 𝑏 + 𝑤1 ∙ 𝑥cp + 𝑤2 ∙ 𝑥cp 2 𝑦 = 𝑏 + 𝑤1 ∙ 𝑥cp + 𝑤2 ∙ 𝑥cp 2 +𝑤3 ∙ 𝑥cp 3
𝑦 = 𝑏 + 𝑤1 ∙ 𝑥cp + 𝑤2 ∙ 𝑥cp 2 +𝑤3 ∙ 𝑥cp 3 + 𝑤4 ∙ 𝑥cp 4 𝑦 = 𝑏 + 𝑤 ∙ 𝑥cp
𝑦 = 𝑏 + 𝑤1 ∙ 𝑥cp + 𝑤2 ∙ 𝑥cp 2 +𝑤3 ∙ 𝑥cp 3 + 𝑤4 ∙ 𝑥cp 4 +𝑤5 ∙ 𝑥cp 5
1.
2.
3.
4.
5. A more complex model yields
lower error on training data.
If we can truly find the best function
44
Machine Learning Map
Regression
Classification Supervised Learning
Task
Scenario Method
Semi-Supervised Learning
Transfer Learning
Unsupervised Learning
Reinforcement Learning
45
Classification
Binary Classification
Multi-Class Classification
Function f Yes / No
Input
Function f
Class 1, Class 2, …, Class N
Input
46
Binary Classification – Spam Filtering
(http://spam-filter-review.toptenreviews.com/)
Model
(Yes/No) 1/0
1 (Yes)
0 (No)
“free” in e-mail
“Talk” in e-mail
47
Multi-Class – Image Recognition
Model
“monkey”
“cat”
“dog”
“monkey”
“cat”
“dog”
48
Multi-Class – Topic Classification
http://top-breaking-news.com/
Model
政治
體育 經濟
“president” in document
“stock” in document
體育 政治 財經
49
Machine Learning Map
Regression
Classification Supervised Learning
Semi-Supervised Learning
Unsupervised Learning
Reinforcement Learning
Transfer Learning Task
Scenario Method
Linear Model
Deep Learning SVM, Decision Tree, KNN, etc Non-Linear Model
50
Part I: Introduction to ML & DL
Basic Machine Learning
Basic Deep Learning
Toolkits and Learning Recipe
51
Stacked Functions Learned by Machine
Production line (生產線)
“台灣第一波
上市!” 推
End-to-end training: what each function should do is learned automatically
Simple Function
f1
Simple Function
f2
Simple Function
f3
Deep Learning Model
f: a very complex function
Deep learning usually refers to neural network based model
52
Three Steps for Deep Learning
Step 3: pick the best function Step 2: goodness of function
Step 1: define a set of function
53
Three Steps for Deep Learning
Step 3: pick the best function Step 2: goodness of function
Step 1: define a set of function
Neural Network
54
Neural Network
b w
a w
a w
a
z
1 1
k k
K K
z w
1w
kw
K…
a
1a
ka
K b
z
bias
a
weights
Neuron
… … …
A simple function
Activation function
55
Neural Network
z
bias
Activation function weights
Neuron
1
-2 -1
1 2
-1
1
4
z
z
zz e
1
1
Sigmoid Function
0.98
56
Neural Network
z
z
z
z
Different connections lead to different network structures
Weights and biases are network parameters 𝜃
The neurons have different values of weights and biases.
57
Fully Connected Feedforward Network
z
z
zz e
1
1
Sigmoid Function 1
-1
1 -2
1 -1
1
0 4
-2
0.98
0.12
58
Fully Connected Feedforward Network
1 -2
1 -1
1
0 4
-2
0.98
0.12 2
-1
-1 -2
3 -1
4 -1 0.86
0.11
0.62
0.83 0
0
-2
2 1
-1
59
Fully Connected Feedforward Network
1 -2
1 -1
1
0
0.73
0.5
2 -1
-1 -2
3 -1
4 -1 0.72
0.12
0.51
0.85 0
0
-2
2 𝑓 0
0 = 0.51 0.85
Given parameters 𝜃, define a function
𝑓 1
−1 = 0.62 0.83 0
0
This is a function.
Input vector, output vector
Given network structure, define a function set
60
Fully Connect Feedforward Network
Output Layer Hidden Layers
Input Layer
Input Output
x1
x2
Layer 1
……
xN
……
Layer 2
……
Layer L
……
……
……
……
……
y
1y
2y
MDeep means many hidden layers
neuron
61
Why Deep? Universality Theorem
Any continuous function f
can be realized by a network with only hidden layer
(given enough hidden neurons)
: R R
Mf
N
Why “deep” not “fat”?
62
Fat + Shallow v.s. Thin + Deep
Two networks with the same number of parameters
63
x1 x2
……
xN……
……
x1 x2
……
xN……
Why Deep
Logic circuits
Consists of gates
A two layers of logic gates can represent any Boolean function.
Using multiple layers of logic gates to build some functions are much simpler
Neural network
consists of neurons
A hidden layer network can represent any continuous function.
Using multiple layers of neurons to represent some functions are much simpler
64
Deep = Many Hidden Layers
AlexNet (2012) VGG (2014) GoogleNet (2014) 16.4%
7.3% 6.7%
http://cs231n.stanford.e du/slides/winter1516_le cture8.pdf
65
Deep = Many Hidden Layers
AlexNet (2012)
VGG (2014)
GoogleNet (2014)
3.57%
Residual Net (2015)
Taipei 101
16.4% 7.3% 6.7%
Special structure
66
Output Layer
Softmax layer as the output layer
Ordinary Layer
11 z
y
22 z
y
33 z
y
z1z2
z3
In general, the output of network can be any value.
May not be easy to interpret
67
Output Layer
Softmax layer as the output layer
Probability:
1 > 𝑦𝑖 > 0
σ𝑖 𝑦𝑖 = 1
z1
z2
z3
Softmax Layer
e e e
z1
e
z2
e
z3
e
3
1 1
1
j
z zj
e e
y
31 j
zj
e
3
-3
1 2.7
20
0.05
0.88 0.12
≈0
3
1 2
2
j
z zj
e e
y
3
1 3
3
j
z zj
e e
y
68
Example Application
Input
Output
16 x 16 = 256
x1
x2
x256
……
Ink → 1 No ink → 0
……
y
1y
2y
10Each dimension represents the confidence of a digit.
is 1 is 2
is 0
……
0.1 0.7
0.2
The image is “2”
69
Example Application
Handwriting Digit Recognition
Machine “2”
x1
x2
x256
…… ……
y
1y
2y
10is 1 is 2
is 0
……
What is needed is a function ……
Input:
256-dim vector
output:
10-dim vector
Neural Network
70
Example Application
Output Layer Hidden Layers
Input Layer
Input Output
x1
x2
Layer 1
……
xN
……
Layer 2
……
Layer L
……
……
……
……
“2” ……
y
1y
2y
10is 1 is 2
is 0
……
A function set containing the candidates for
Handwriting Digit Recognition
You need to decide the network structure to let a good function in your function set.
71
FAQ
Q: How many layers? How many neurons for each layer?
Q: Can we design the network structure?
Q: Can the structure be automatically determined?
Yes, but not widely studied yet.
Trial and Error + Intuition
Variants of Neural Networks (next lecture)
72
Three Steps for Deep Learning
Step 3: pick the best function Step 2: goodness of function
Step 1: define a set of function
73
Training Data
Preparing training data: images and their labels
The learning target is defined on the training data.
“5” “0” “4” “1”
“1” “3”
“9” “2”
74
Learning Target
16 x 16 = 256
x1
x2
……
x256
……
……
……
……
Ink → 1 No ink → 0
……
y
1y
2y
10y1 has the maximum value The learning target is ……
Input:
y2 has the maximum value Input:
is 1 is 2
is 0
Softmax
75
Loss
x1
x2
……
x256
……
……
……
……
……
y
1y
2y
10Loss 𝑙
“1”
……
1 0
0
……
Loss can be square error or cross entropy between the network output and target
target
Softmax
As close as
possible
A good function should make the loss of all examples as small as possible.
Given a set of parameters
76
Total Loss
For all training data …
x1 x2
xR
NN NN
NN
…… ……
y1 y2
yR
ො 𝑦1
ො 𝑦2
ො 𝑦𝑅 𝑙1
…… ……
x3 NN y3 𝑦ො3
𝐿 =
𝑟=1 𝑅
𝑙𝑟
Find the network parameters 𝜽∗ that minimize total loss L
Total Loss:
𝑙2 𝑙3
𝑙𝑅
As small as possible Find a function in
function set that
minimizes total loss L
77
Three Steps for Deep Learning
Step 3: pick the best function Step 2: goodness of function
Step 1: define a set of function
78
How to pick the best function
Find network parameters 𝜽
∗that minimize total loss L
Network parameters 𝜃 = 𝑤
1, 𝑤
2, 𝑤
3, ⋯ , 𝑏
1, 𝑏
2, 𝑏
3, ⋯
Enumerate all possible values
Layer l……
Layer l+1
……
E.g. speech recognition: 8 layers and
1000 neurons each layer 1000 neurons
1000 neurons
10
6weights Millions of parameters
79
Gradient Descent
Random, RBM pre-train
Usually good enough
Network parameters 𝜃 = 𝑤
1, 𝑤
2, ⋯ , 𝑏
1, 𝑏
2, ⋯
w
Pick an initial value for w
Find network parameters 𝜽
∗that minimize total loss L
80
Gradient Descent
Total Loss 𝐿
w
Pick an initial value for w
Compute Τ𝜕𝐿 𝜕𝑤
Positive Negative
Decrease w Increase w
Find network parameters 𝜽
∗that minimize total loss L
Network parameters 𝜃 = 𝑤
1, 𝑤
2, ⋯ , 𝑏
1, 𝑏
2, ⋯
81
Gradient Descent
w
Pick an initial value for w
Compute Τ𝜕𝐿 𝜕𝑤
−𝜂𝜕𝐿 𝜕𝑤Τ
η is called
“learning rate”
𝑤 ← 𝑤 − 𝜂𝜕𝐿 𝜕𝑤Τ Repeat
Find network parameters 𝜽
∗that minimize total loss L
Network parameters 𝜃 = 𝑤
1, 𝑤
2, ⋯ , 𝑏
1, 𝑏
2, ⋯
82
Gradient Descent
Total Loss 𝐿
w
Pick an initial value for w
Compute Τ𝜕𝐿 𝜕𝑤
𝑤 ← 𝑤 − 𝜂𝜕𝐿 𝜕𝑤Τ Repeat
(when update is little)
Find network parameters 𝜽
∗that minimize total loss L
Network parameters 𝜃 = 𝑤
1, 𝑤
2, ⋯ , 𝑏
1, 𝑏
2, ⋯
83
Gradient Descent
Assume that θ has two variables {θ1, θ2}
84
Gradient Descent
𝑤1 𝑤2
Color: Value of Total Loss L
Randomly pick a starting point
85
𝑤1 𝑤2
Gradient Descent Hopfully, we would reach a minima …..
Compute 𝜕𝐿 𝜕𝑤Τ 1, 𝜕𝐿 𝜕𝑤Τ 2
(−𝜂 𝜕𝐿 𝜕𝑤Τ 1, −𝜂 𝜕𝐿 𝜕𝑤Τ 2)
86
Local Minima
Total Loss
The value of a network parameter w
Very slow at the plateau
Stuck at local minima
𝜕𝐿 ∕ 𝜕𝑤
= 0
Stuck at saddle point
𝜕𝐿 ∕ 𝜕𝑤
= 0
𝜕𝐿 ∕ 𝜕𝑤
≈ 0
87
Local Minima
Gradient descent never guarantee global minima
𝐿
𝑤
1𝑤
2Different initial point
Reach different minima, so different results
88
Gradient Descent
This is the “learning” of machines in deep learning ……
Even AlphaGo using this approach.
I hope you are not too disappointed :p
People image …… Actually …..
89
Part I: Introduction to ML & DL
Basic Machine Learning
Basic Deep Learning
Toolkits and Learning Recipe
90
Deep Learning Toolkit
Backpropagation: an efficient way to compute
𝜕𝐿 𝜕𝑤 in neural network Τ
91
Three Steps for Deep Learning
Deep Learning is so simple ……
Now If you want to find a function
If you have lots of function input/output (?) as training data
You can use deep learning Step 1:
define a set
of function
Step 2:
goodness of function
Step 3: pick the best function
92
Keras
keras
Very flexible Need some effort to learn
Easy to learn and use
(still have some flexibility)
You can modify it if you can write TensorFlow or Theano
Interface of TensorFlow or Theano
or
93
Keras
François Chollet is the author of Keras.
He currently works for Google as a deep learning engineer and researcher.
Keras means horn in Greek
Documentation: http://keras.io/
Example
https://github.com/fchollet/keras/tree/master/examples
Step-by-step lecture by Prof. Hung-Yi Lee
Slide
http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Keras.pdf
Lecture recording:
https://www.youtube.com/watch?v=qetE6uUoLQA
94
使用 Keras 心得
感謝 沈昇勳 同學提供圖檔
95
Example Application
Handwriting Digit Recognition
Machine “1”
“Hello world” for deep learning
MNIST Data: http://yann.lecun.com/exdb/mnist/
Keras provides data sets loading function: http://keras.io/datasets/
28 x 28
96
Three Steps for Deep Learning
Step 1:
define a set
of function
Step 2:
goodness of function
Step 3: pick the best function Deep Learning is so simple ……
97
Good Results on Testing Data?
Good Results on Training Data?
Step 3: pick the best function Step 2: goodness
of function Step 1: define a
set of function
YES YES
NO NO
Learning Recipe
98
Overfitting
Possible solutions
more training samples
some tips: dropout, etc.
99
Good Results on Testing Data?
Good Results on Training Data?
YES YES
Learning Recipe
Different approaches for different problems e.g. dropout for good results on testing data
100
Good Results on Testing Data?
Good Results on Training Data?
YES YES
Learning Recipe
Choosing proper loss Mini-batch
New activation function Adaptive Learning Rate Momentum
101
Learning Recipe
Training Data
Testing Data
x yˆ
Validation Real Testing
x x
f *
“Best” Function
y y
102
Learning Recipe
Training Data
Testing Data
x yˆ
Validation Real Testing
x y x y
immediately know the performance
Do not know the performance
103
Learning Recipe
Possible reasons
no good function exists: bad hypothesis function set
reconstruct the model architecture
cannot find a good function: local optima
change the training strategy
get good results on training set modify training process
no
104
Learning Recipe
get good results on dev/validation set get good results
on training set modify training process
no
yes
done yes
no
prevent overfitting
Better performance on training but worse performance on dev overfitting
105
Concluding Remarks
Basic Machine Learning
1.
Define a set of functions
2.
Measure goodness of functions
3.
Pick the best function
Basic Deep Learning
Stacked functions
106
Talk Outline
Part I: Introduction to
Machine Learning & Deep Learning Part II: Variants of Neural Nets
Part III: Beyond Supervised Learning
& Recent Trends
107
PART II
Variants of Neural Networks
108
PART II: Variants of Neural Networks
Convolutional Neural Network (CNN)
Recurrent Neural Network (RNN)
109
PART II: Variants of Neural Networks
Convolutional Neural Network (CNN)
Recurrent Neural Network (RNN)
Widely used in image processing
110
Why CNN for Image
Can the network be simplified by considering the properties of images?
x1
x2
……
xN
…… …… ……
……
……
……
The most basic classifiers
Use 1st layer as module to build classifiers
Use 2nd layer as module ……
Represented as pixels
(Zeiler, M. D., ECCV 2014)
111
Why CNN for Image
Some patterns are much smaller than the whole image
A neuron does not have to see the whole image to discover the pattern.
“beak” detector
Connecting to small region with less parameters
112
Why CNN for Image
The same patterns appear in different regions.
“upper-left beak” detector
“middle beak”
detector
They can use the same set of parameters.
Do almost the same thing
113
Why CNN for Image
Subsampling the pixels will not change the object
subsampling bird
bird
We can subsample the pixels to make image smaller
Less parameters for the network to process the image
114
Three Steps for Deep Learning
Step 1:
define a set
of function
Step 2:
goodness of function
Step 3: pick the best function Deep Learning is so simple ……
Convolutional Neural Network
115
Image Recognition
116
http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf
The Whole CNN
Fully Connected Feedforward network
cat dog ……
Convolution
Max Pooling
Convolution
Max Pooling Flatten
Can repeat many times
117
The Whole CNN
Convolution
Max Pooling
Convolution
Max Pooling Flatten
Can repeat many times
Some patterns are much
smaller than the whole image
The same patterns appear in different regions
Subsampling the pixels will not change the object
Property 1
Property 2
Property 3
118
Image Recognition
119
Local Connectivity
120
Neurons connect to a small region
Parameter Sharing
The same feature in different positions
121
Neurons share the same weights
Parameter Sharing
Different features in the same position
122
Neurons have different weights
Convolutional Layers
123
depth
width width
depth
weights weights
height
shared weight
Convolutional Layers
124
c1
c2
b1
b2 a1
a2
a3
depth = 2 depth = 1
Convolutional Layers
125
c1 b1
b2 a1
a2
d1
b3 a3
c2
d2
depth = 2 depth = 2
Convolutional Layers
126
c1 b1
b2 a1
a2
d1
b3 a3
c2
d2
depth = 2 depth = 2
Convolutional Layers
127
A B C
A B C D
Hyper-parameters of CNN
Stride
128
Padding
0 0
Stride = 1
Stride = 2
Padding = 0
Padding = 1
Example
129
Output
Volume (3x3x2)
Input
Volume (7x7x3)
Stride = 2
Padding = 1
http://cs231n.github.io/convolutional-networks/
Filter (3x3x3)
Convolutional Layers
130
http://cs231n.github.io/convolutional-networks/
Convolutional Layers
131
http://cs231n.github.io/convolutional-networks/
Convolutional Layers
132
http://cs231n.github.io/convolutional-networks/
Convolutional Layers
133
http://cs231n.github.io/convolutional-networks/
Pooling Layer
134
1 3 2 4
5 7 6 8
0 0 3 3
5 5 0 0
4 5
5 3
7 8
5 3
Maximum Pooling
Average Pooling
Max(1,3,5,7) = 7 Avg(1,3,5,7) = 4
no overlap
no weights
depth = 1
Max(0,0,5,5) = 5
Why “Deep” Learning?
135