Machine Learning Techniques ( 機器學習技巧)
Lecture 12: Deep Learning
Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.twDepartment of Computer Science
& Information Engineering
National Taiwan University
( 國立台灣大學資訊工程系)
Deep Learning
Agenda
Lecture 12: Deep Learning Optimization and Overfitting Auto Encoder
Principle Component Analysis
Denoising Auto Encoder
Deep Neural Network
Deep Learning Optimization and Overfitting
Error Function of Neural Network
E
in
(w) = 1 NN
X
n=1
y
n
− θ
· · · θ
X
j
w
jk (2)
· θ Xi
w
ij (1)
xi
!
2
•
generallynon-convex
when multiple hidden layers• not easy to reach global minimum
• GD/SGD with backprop only gives local minimum
•
different initialw 0
=⇒differentlocal minimum
• somewhat ‘sensitive’ to initial weights
• large weights =⇒ saturate (small gradient)
• advice: try some random & small ones
neural network (NNet):
difficult to optimize, but practically works
Deep Learning Optimization and Overfitting
VC Dimension of Neural Networks
roughly, with
θ-like transfer functions:
dVC=O(Dlog
D)
whereD = # of weights
•
canimplement ‘anything’
ifenough neurons
(Dlarge)—no need for
many layers?
•
canoverfit
iftoo many neurons
NNet:
watch out for overfitting!
Deep Learning Optimization and Overfitting
Regularization for Neural Network
basic choice:
old friend
weight-decay
(L2) regularizer Ω(w) =Pw ij (`)
2
• ‘shrink’ weights:
large weight
→large shrink; small weight
→small shrink
•
wantw ij (`) = 0 (sparse)
to effectivelydecrease d
VC• L1 regularizer: P w
ij(`), but not differentiable
• weight-elimination (‘scaled’ L2) regularizer:
large weight → median shrink; small weight → median shrink
weight-elimination
regularizer: Pw
ij(`)2β
2+
w
ij(`)2Deep Learning Optimization and Overfitting
Yet Another Regularization: Early Stopping
GD/SGD (backprop)
visitsmore weight combinations
ast increases
• smaller t
effectivelydecrease d
VC•
better‘stop in the middle’:
early stopping
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
0 0.5 1 1.5 2 2.5 3 3.5
Epochs
Error
E
E
top
bottom Early stopping
in out
when to stop?
validation!
Deep Learning Optimization and Overfitting
Fun Time
Deep Learning Auto Encoder
Learning the Identity Function
identity
function:f(x) = x
•
avector
function composed off i
(x) = xi
• learning
eachf i
:regression
with data(x
1
,y 1 = x 1,i
), (x2
,y 2 = x 2,i
), . . . , (xN
,y N = x N,i
)• learning f: learning f i jointly
with data(x
1
,y 1 = x 1
), (x2
,y 2 = x 2
), . . . , (xN
,y N = x N
)but wait, why
learning
somethingknown
&easily implemented? :-)
Deep Learning Auto Encoder
Why Learning Identity Function
if
g(x) ≈ f(x) using some hidden
structures on theobserved data x n
•
for unsupervised learning:• density estimation: larger (structure match) when g(x) ≈ x better
• outlier detection: those x where g(x) 6≈ x
—learning
‘typical’ representation
of data•
for supervised learning:• hidden structure: essence of x that can be used as Φ(x)
—learning
‘informative’ representation
of dataauto-encoder:
NNet for learning identity function
Deep Learning Auto Encoder
Simple Auto-Encoder
simple
auto-encoder: ad
-d ˜
-d NNet• d outputs: backprop easily applies
• d ˜
<d: compressed
representation;d ˜
≥d: [over]-complete
representation•
data: (x1
,y 1 = x 1
), (x2
,y 2 = x 2
), . . . , (xN
,y N = x N
)—often categorized as
unsupervised learning technique
•
ifx
contain binary bits,• naïve solution exists (but unwanted) when [over]-complete
• regularized weights needed in general
•
sometimes constrainw ij (1) = w ji (2)
as ‘regularization’—more
sophisticated
in calculating gradientauto-encoder
forrepresentation
learning:outputs of
hidden neurons
serve asΦ(x)
Deep Learning Auto Encoder
Fun Time
Deep Learning Principle Component Analysis
Linear Auto-Encoder Hypothesis
h
k
(x) =θ
X
j
w jk (2)
·θ
Xi
w ij (1)
xi
!
consider three special conditions:
•
constrainw ij (1)
=w ji (2)
=w ij
as ‘regularization’—let
W
= [wij
]of size d × ˜d• θ
does nothing (like linear regression)•
d˜ < dlinear auto-encoder hypothesis:
h(x) = x T WW T
Deep Learning Principle Component Analysis
Linear Auto-Encoder Error Function
min
W
Ein
(W) = 1 N
X − XWW
T
2 F
let
WW T
=VΛV T
such thatV T V
= I andΛ
a diagonal matrix ofrank at most ˜ d
(eigenvalue decomposition)
X − XVΛV
T
2 F
= trace
X − XVΛV
T
T
X − XVΛV
T
= trace
X
T
X − XT
XVΛVT
−VΛV T
XT
X +VΛV T
XT
XVΛVT
= trace
X
T
X −ΛV T
XT
XV−ΛV T
XT
XV+ΛV T
XT
XVΛVT V
= trace
V T
XT
XV−ΛV T
XT
XV−ΛV T
XT
XV+Λ 2 V T
XT
XV
= trace
(I −
Λ) 2 V T
XT
XVDeep Learning Principle Component Analysis
Linear Auto-Encoder Algorithm
min
V,Λ
trace(I −
Λ) 2 V T
XT
XV•
optimalrank-˜ d Λ
containsd ‘1’ ˜
andd − ˜ d ‘0’
•
let XT
X =UΣU T
(eigenvalue decomposition),V = U
with (smallestσ i
⇐⇒λ j = 1) is optimal
•
sooptimal column vectors w j = v j = top eigen vectors of X T X
optimal linear auto-encoding ≡principal component analysis
(PCA)with
w j
beingprincipal components
of unshifted dataDeep Learning Principle Component Analysis
Fun Time
Deep Learning Denoising Auto Encoder
Simple Auto-Encoder Revisited
simple
auto-encoder: ad
-d ˜
-d NNet•
want:hidden structure
to captureessence
ofx
•
naïve solution exists (butunwanted) when [over]-complete
• regularized
weights needed in generalregularization towards more
robust hidden structure?
Deep Learning Denoising Auto Encoder
Idea of Denoising Auto-Encoder
robust hidden structure
should allow g(˜x) ≈ x
even when
˜ x
slightly different fromx
•
denoising auto-encoder: run auto-encoderwith data (
x ˜ 1
,y 1 = x 1
), (x ˜ 2
,y 2 = x 2
), . . . , (˜ x N
,y N = x N
), wherex ˜ n
=x n
+artificial noise
•
PCA auto-encoder +Gaussian noise:
min
W
Ein
(W) = 1 N
X − (X +
noise) WW T
2 F
—simply L2-regularizedPCA
artificial noise
asregularization!
—practically also useful for other types of NNet
Deep Learning Denoising Auto Encoder
Fun Time
Deep Learning Deep Neural Network
Finalremark: hiddenlayers
LearningFromData-Le ture10 21/21
Finalremark: hiddenlayers
LearningFromData-Le ture10 21/21
Finalremark: hiddenlayers
θ x2
xd
s θ(s) h(x) 1
1 1
x1
θ θ
θ θ
θ
learnednonlineartransform
interpretation?
LearningFromData-Le ture10 21/21
Finalremark: hiddenlayers
θ x2
xd
s θ(s) h(x) 1
1 1
x1
θ θ
θ θ
θ
learnednonlineartransform
interpretation?
LearningFromData-Le ture10 21/21
Finalremark: hiddenlayers
θ x2
xd
s θ(s) h(x) 1
1 1
x1
θ θ
θ θ
θ
learnednonlineartransform
interpretation?
LearningFromData-Le ture10 21/21
Finalremark: hiddenlayers
θ x2
xd
s θ(s) h(x) 1
1 1
x1
θ θ
θ θ
θ
learnednonlineartransform
interpretation?
LearningFromData-Le ture10 21/21
Deep Learning Deep Neural Network
Shallow versus Deep Structures
shallow: few hidden layers; deep: many hidden layers
Shallow
•
efficient•
powerful if enough neuronsDeep
•
challenging to train•
needing more structural (model) decisions•
‘meaningful’?deep structure (deep learning)
re-gain attention recently
Deep Learning Deep Neural Network
Key Techniques behind Deep Learning
•
(usually)unsupervised pre-training
between hidden layers, such as simple/denoising auto-encoder—viewing hidden layers as
‘condensing’
low-level representation to high-level one• fine-tune with backprop
after initializing with those ‘good’weights
—because
direct backprop may get stuck more easily
•
speed-up: better optimization algorithms, and fasterGPU
•
generalization issue less serious withbig (enough) data
currently very useful
for
vision and speech recognition
Deep Learning Deep Neural Network
Fun Time
Deep Learning Deep Neural Network