Machine Learning Techniques ( 機器學習技法)
Lecture 12: Neural Network
Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.twDepartment of Computer Science
& Information Engineering
National Taiwan University
( 國立台灣大學資訊工程系)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/23
Neural Network
Roadmap
1 Embedding Numerous Features: Kernel Models
2 Combining Predictive Features: Aggregation Models
Lecture 11: Gradient Boosted Decision Tree
aggregating trees fromfunctional gradient
andsteepest descent
subject toany error measure
3
Distilling Implicit Features: Extraction ModelsLecture 12: Neural Network
Motivation
Neural Network Hypothesis Neural Network Learning
Optimization and Regularization
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 1/23
Neural Network Motivation
Linear Aggregation of Perceptrons: Pictorial View
x
0= 1
x
1x
2.. . x
d
x
=
g
1w
1g
2w
2g
3w
3g
4w
4.. . .. .
g
Tw
TG α
1α
2α
3α
4.. . α
TG(x) = sign
T
P
t=1
α
tsign w
Ttx
| {z }
gt(x)
•
two layers of weights:w t
andα
•
two layers of sign functions:in
g t
and inG
what boundary can
G implement?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/23
Neural Network Motivation
Logic Operations with Aggregation
+1
−1
+1
−1
g 1 g 2
AND(g1
,g 2
)x
0= 1 x
1x
2.. . x
d+1 g
1w
1g
2w
2α
0= −1 α
1= +1
α
2= +1
G(x)
G(x) = sign ( −1+ g
1(x)+g
2(x))
• g 1
(x) =g 2
(x) =+1
(TRUE):G(x) = +1
(TRUE)•
otherwise:G(x) = −1
(FALSE)• G
≡ AND(g 1
,g 2
)OR, NOT can be
similarly implemented
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/23
Neural Network Motivation
Powerfulness and Limitation
− −
−
−
+ +
+ +
− −
−
+ ++
−
+−
−
−
+ ++
−
+8 perceptrons 16 perceptrons target boundary
•
‘convex set’ hypotheses implemented:d
VC→ ∞, remember? :-)
•
powerfulness: enough perceptrons≈ smooth boundary
+1
−1
+1
−1
g 1 g 2
XOR(g1
,g 2
)•
limitation: XORnot ‘linear separable’
underφ(x) = (g 1
(x),g 2
(x)) how to implement XOR(g1
,g 2
)?Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/23
Neural Network Motivation
Multi-Layer Perceptrons: Basic Neural Network
•
non-separable data: can use moretransform
•
how aboutone more layer of AND transform?
XOR(g
1
,g 2
) = OR(AND(−g 1
,g 2
),AND(g 1
,−g 2
))x
0
=1 x1
x2
... xd
+1 g 1 w 1
g 2 w 2
+1
-1-1
AND
AND
OR G
≡XOR(g
1
,g 2
)perceptron (simple)
=⇒ aggregation of perceptrons (powerful)
=⇒
multi-layer perceptrons
(more powerful)Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/23
Neural Network Motivation
Connection to Biological Neurons
by UC Regents Davis campus-brainmaps.org.
Licensed under CC BY 3.0 via Wikimedia Commons
x0=1 x1 x2 ... xd
+1 +1
by Lauris Rubenis.
Licensed under CC BY 2.0 via
https://flic.kr/
p/fkVuZX
by Pedro Ribeiro Sim ¯oes. Licensed under CC BY 2.0 via https://flic.kr/
p/adiV7b
neural network:
bio-inspired
modelHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/23
Neural Network Motivation
Fun Time
Let g
0
(x) = +1. Which of the following (α0
, α1
, α2
)allows G(x) = sign
2
Pt=0
α
t
gt
(x)
to implement OR(g
1
, g2
)?1
(−3, +1, +1)2
(−1, +1, +1)3
(+1, +1, +1)4
(+3, +1, +1)Reference Answer: 3
You can easily verify with all four possibilities of (g
1
(x), g2
(x)).Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/23
Neural Network Motivation
Fun Time
Let g
0
(x) = +1. Which of the following (α0
, α1
, α2
)allows G(x) = sign
2
Pt=0
α
t
gt
(x)
to implement OR(g
1
, g2
)?1
(−3, +1, +1)2
(−1, +1, +1)3
(+1, +1, +1)4
(+3, +1, +1)Reference Answer: 3
You can easily verify with all four possibilities of (g
1
(x), g2
(x)).Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/23
Neural Network Neural Network Hypothesis
Neural Network Hypothesis: Output
x
0= 1 x
1x
2.. . x
d+1 +1
OUTPUT w
• OUTPUT: simply a linear model
withs
=w T
φ(2)
(φ(1)
(x))•
any linear model can be used—remember? :-)linear classification
h(x) = sign(s)
s x
x
x x0
1 2
d
h x( )
err = 0/1
linear regression
h(x) = s
s x
x
x x0
1 2
d
h x( )
err = squared
logistic regression
h(x) = θ(s)
s x
x
x x0
1 2
d
h x( )
err = cross-entropy
will discuss
‘regression’ with squared error
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/23
Neural Network Neural Network Hypothesis
Neural Network Hypothesis: Transformation
•
:transformation
function of score (signal) s• any transformation?
• : whole network linear &
thus less useful
• : discrete & thus hard to optimize for w
•
popular choice oftransformation:
=tanh(s)• ‘analog’ approximation of : easier to optimize
• somewhat closer to biological neuron
• not that new! :-)
x
0= 1 x
1x
2.. . x
d+1 +1
PSfrag
linear
sign tanh
tanh(s) = exp(s) − exp(−s) exp(s) + exp(−s)
= 2θ(2s) − 1
will discuss with
tanh
astransformation function
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/23
Neural Network Neural Network Hypothesis
Neural Network Hypothesis
x
0= 1 x
1x
2.. . x
d+1
tanh
tanh
w
ij(1)w
jk(2)w
kq(3)+1
tanh
tanh
s
3(2) tanhx
3(2)d (0) -d (1) -d (2) - · · · -d (L) Neural Network (NNet) w ij (`)
:
1≤ ` ≤ L layers 0≤ i ≤ d
(`−1)
inputs 1≤ j ≤ d(`)
outputs, score s
j (`)
=d
(`−1)X
i=0
w
ij (`)
xi (`−1)
,transformed
xj (`)
=
tanh
s
(`)jif ` < L s
(`)jif ` = L
apply
x as input layer x (0)
, go throughhidden layers
to getx (`)
, predict atoutput layer
x1 (L)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/23
Neural Network Neural Network Hypothesis
Physical Interpretation
x
0= 1 x
1x
2.. . x
d+1
tanh
tanh
w
ij(1)w
jk(2)w
kq(3)+1
tanh
tanh
s
3(2) tanhx
3(2)•
each layer:transformation
to belearned
from data• φ (`)
(x) = tanh
d(`−1)
P
i=0
w
i1(`)x
i(`−1). . .
—whether
x ‘matches’ weight vectors
in pattern NNet:pattern extraction
withlayers of
connection weights
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/23
Neural Network Neural Network Hypothesis
Fun Time
How many weights{w
ij (`)
} are there in a 3-5-1 NNet?1
92
153
204
26Reference Answer: 4
There are (3 + 1)× 5 weights in w
ij (1)
, and (5 + 1)× 1 weights in wjk (2)
.Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/23
Neural Network Neural Network Hypothesis
Fun Time
How many weights{w
ij (`)
} are there in a 3-5-1 NNet?1
92
153
204
26Reference Answer: 4
There are (3 + 1)× 5 weights in w
ij (1)
, and (5 + 1)× 1 weights in wjk (2)
.Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/23
Neural Network Neural Network Learning
How to Learn the Weights?
x
0= 1 x
1x
2.. . x
d+1
tanh
tanh
w
ij(1)w
jk(2)w
kq(3)+1
tanh
tanh
s
3(2) tanhx
3(2)•
goal: learning all{w ij (`) }
tominimize E in
{w ij (`) }
•
one hidden layer: simplyaggregation of perceptrons
—gradient boostingto determine hidden neuron one by one
•
multiple hidden layers?not easy
•
lete n = (y n − NNet(x n )) 2
:can apply
(stochastic) GD
after computing∂e
n∂w
ij(`)! next: efficient computation of∂e
n∂w
ij(`)Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/23
Neural Network Neural Network Learning
Computing ∂en
∂w
i1(L)(Output Layer)
e n
=(y n − NNet(x n )) 2
=y n − s 1 (L)
2
=
y n −
d
(L−1)X
i=0
w i1 (L) x i (L−1)
2
specially (output layer) (0 ≤ i ≤ d (L−1) )
∂e n
∂w i1 (L)
=
∂e n
∂s 1 (L)
·∂s (L) 1
∂w i1 (L)
=
−2
y n − s (L) 1
·x i (L−1)
generally (1 ≤ ` < L)
(0 ≤ i ≤ d (`−1) ; 1 ≤ j ≤ d (`) )
∂e n
∂w ij (`)
=
∂e n
∂s (`) j
·∂s (`) j
∂w ij (`)
=
δ j (`)
·x i (`−1)
δ (L) 1 = −2
y n − s (L) 1
, how about
others?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/23
Neural Network Neural Network Learning
Computing δ ( j `) = ∂en
∂s
j(`)s j (`)
=tanh
⇒x j (`) w
(`+1)
=jk⇒
s 1 (`+1)
.. . s k (`+1)
.. .
=⇒ · · · =⇒
e n
δ (`) j
=∂e n
∂s (`) j
=d
(`+1)X
k =1
∂e n
∂s (`+1) k
∂s (`+1) k
∂x j (`)
∂x j (`)
∂s (`) j
= X
k
δ (`+1) k
w jk (`+1)
tanh 0 s j (`)
δ j (`)
can be computedbackwards
fromδ k (`+1)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 15/23
Neural Network Neural Network Learning
Backpropagation (Backprop) Algorithm
Backprop on NNet
initialize all weightsw ij (`)
for t = 0, 1, . . . , T1
stochastic: randomly pick n∈ {1, 2, · · · , N}2
forward: compute allx i (`)
withx (0)
=x n
3
backward: compute allδ j (`)
subject tox (0)
=x n
4
gradient descent:w ij (`)
←w ij (`)
− ηx i (`−1) δ j (`)
returng
NNET(x) =· · · tanh P
j w jk (2)
· tanh Pi w ij (1)
xi
sometimes 1 to 3 is (parallelly) done many times and average(x
i (`−1)
δj (`)
)taken for update in 4 , calledmini-batch
basic NNet algorithm: backprop to compute the gradient
efficiently
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 16/23
Neural Network Neural Network Learning
Fun Time
According to
∂e
n∂w
i1(L) =−2
y n − s (L) 1
·
x i (L−1)
when would
∂e
n∂w
i1(L) =0?1
yn
=s(L) 1
2
xi (L−1)
=03
si (L−1)
=04
all of the aboveReference Answer: 4
Note that x
i (L−1)
=tanh(si (L−1)
) =0 if and only if si (L−1)
=0.Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/23
Neural Network Neural Network Learning
Fun Time
According to
∂e
n∂w
i1(L) =−2
y n − s (L) 1
·
x i (L−1)
when would
∂e
n∂w
i1(L) =0?1
yn
=s(L) 1
2
xi (L−1)
=03
si (L−1)
=04
all of the aboveReference Answer: 4
Note that x
i (L−1)
=tanh(si (L−1)
) =0 if and only if si (L−1)
=0.Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/23
Neural Network Optimization and Regularization
Neural Network Optimization
E
in(w) = 1 N
N
X
n=1
err
· · · tanh
X
j
w
jk(2)· tanh X
i
w
ij(1)x
n,i!
, y
n
•
generallynon-convex
when multiple hidden layers• not easy to reach global minimum
• GD/SGD with backprop only gives local minimum
•
different initialw ij (`)
=⇒ differentlocal minimum
• somewhat ‘sensitive’ to initial weights
• large weights = ⇒ saturate (small gradient)
• advice: try some random & small ones
NNet:
difficult to optimize,
butpractically works
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/23
Neural Network Optimization and Regularization
VC Dimension of Neural Network Model
roughly, with
tanh-like transfer functions:
dVC =O(V
D)
whereV = # of neurons, D = # of weights
x
0= 1 x
1x
2.. . x
d+1
tanh
tanh
w
ij(1)w
jk(2)w
kq(3)+1
tanh
tanh
s
3(2) tanhx
3(2)•
pros: canapproximate ‘anything’
ifenough neurons
(V large)•
cons: canoverfit
iftoo many neurons
NNet:
watch out for overfitting!
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/23
Neural Network Optimization and Regularization
Regularization for Neural Network
basic choice:
old friend
weight-decay
(L2) regularizer Ω(w) =Pw ij (`)
2
• ‘shrink’ weights:
large weight
→large shrink; small weight
→small shrink
•
wantw ij (`) = 0 (sparse)
to effectivelydecrease d
VC• L1 regularizer: P w
ij(`), but not differentiable
• weight-elimination (‘scaled’ L2) regularizer:
large weight → median shrink; small weight → median shrink
weight-elimination
regularizer: Pw
ij(`)21+ w
ij(`)2Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/23
Neural Network Optimization and Regularization
Yet Another Regularization: Early Stopping
• GD/SGD (backprop)
visitsmore weight combinations
ast increases
w0
w1
w2
w3
H3
• smaller t
effectivelydecrease d
VC•
better‘stop in middle’:
early stopping
in-sample error model complexity out-of-sample error
VC dimension, dvc
Error
d∗vc
(dVC
∗ in middle, remember? :-))
iteration, t log10(error)
t∗ Ein
Etest
102 103 104
-1.8 -1.4 -1 -0.6 -0.2
when to stop?
validation!
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/23
Neural Network Optimization and Regularization
Fun Time
For the weight elimination regularizerP
w
ij(`)21+
w
ij(`)2, what is∂regularizer
∂w
ij(`) ?1
2wij (`)
/1 +
w
ij (`)
2
1
2
2wij (`)
/1 +
w
ij (`)
2
2
3
2wij (`)
/1 +
w
ij (`)
2
3
4
2wij (`)
/1 +
w
ij (`)
2
4
Reference Answer: 2
Too much calculus in this class, huh? :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 22/23
Neural Network Optimization and Regularization
Fun Time
For the weight elimination regularizerP
w
ij(`)21+
w
ij(`)2, what is∂regularizer
∂w
ij(`) ?1
2wij (`)
/1 +
w
ij (`)
2
1
2
2wij (`)
/1 +
w
ij (`)
2
2
3
2wij (`)
/1 +
w
ij (`)
2
3
4
2wij (`)
/1 +
w
ij (`)
2
4
Reference Answer: 2
Too much calculus in this class, huh? :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 22/23
Neural Network Optimization and Regularization
Summary
1 Embedding Numerous Features: Kernel Models
2 Combining Predictive Features: Aggregation Models
3
Distilling Implicit Features: Extraction ModelsLecture 12: Neural Network
Motivation
multi-layer for power with biological inspirations Neural Network Hypothesis
layered pattern extraction until linear hypothesis Neural Network Learning
backprop to compute gradient efficiently Optimization and Regularization
tricks on initialization, regularizer, early stopping
• next: making neural network ‘deeper’
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 23/23