Machine Learning Techniques ( 機器學習技法)
Lecture 13: Deep Learning
Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.twDepartment of Computer Science
& Information Engineering
National Taiwan University
( 國立台灣大學資訊工程系)
Deep Learning
Roadmap
1 Embedding Numerous Features: Kernel Models
2 Combining Predictive Features: Aggregation Models
3
Distilling Implicit Features: Extraction ModelsLecture 12: Neural Network
automatic
pattern feature extraction
fromlayers of neurons
withbackprop
for GD/SGDLecture 13: Deep Learning Deep Neural Network Autoencoder
Denoising Autoencoder Principal Component Analysis
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 1/24
Deep Learning Deep Neural Network
Physical Interpretation of NNet Revisited
x
0= 1 x
1x
2.. . x
d+1
tanh
tanh
w
ij(1)w
jk(2)w
kq(3)+1
tanh
tanh
s
3(2) tanhx
3(2)•
each layer:pattern feature extracted
from data,remember? :-)
•
how many neurons? how many layers?—more generally,
what structure?
• subjectively, your design!
• objectively, validation, maybe?
structural decisions:
key issue
for applying NNetDeep Learning Deep Neural Network
Shallow versus Deep Neural Networks
shallow: few (hidden) layers; deep: many layers
Shallow NNet
•
moreefficient
to train ( )• simpler
structural decisions ( )•
theoreticallypowerful enough
( )Deep NNet
• challenging
to train (×)• sophisticated
structural decisions (×)• ‘arbitrarily’ powerful
( )•
more‘meaningful’?
(see next slide)deep NNet (deep learning)
gaining attention
in recent yearsHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24
Deep Learning Deep Neural Network
Meaningfulness of Deep Learning
,
is it a ‘1’? ✲ ✛ is it a ‘5’?
✻
z
1z
5φ
1φ
2φ
3φ
4φ
5φ
6positive weight negative weight
• ‘less burden’
for each layer:simple
tocomplex
features•
natural fordifficult
learning task withraw features, like vision
deep NNet: currently popular invision/speech/. . .
Deep Learning Deep Neural Network
Challenges and Key Techniques for Deep Learning
•
difficultstructural decisions:
• subjective with domain knowledge: like convolutional NNet for images
•
highmodel complexity:
• no big worries if big enough data
• regularization towards noise-tolerant: like
• dropout (tolerant when network corrupted)
• denoising (tolerant when input corrupted)
•
hardoptimization problem:
• careful initialization to avoid bad local minimum:
called pre-training
•
hugecomputational complexity
(worsen withbig data):
• novel hardware/architecture: like mini-batch with GPU
IMHO, careful
regularization
andinitialization
are key techniquesHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/24
Deep Learning Deep Neural Network
A Two-Step Deep Learning Framework
Simple Deep Learning
1
for` = 1, . . . , L,pre-train
n wij (`)
oassuming w
∗ (1)
,. . . w∗ (`−1)
fixed(a) (b) (c) (d)
2 train with backprop
onpre-trained
NNet tofine-tune
alln wij (`)
owill focus on
simplest pre-training
technique along withregularization
Deep Learning Deep Neural Network
Fun Time
For a deep NNet for written character recognition from raw pixels, which type of features are more likely extracted after the first hidden layer?
1
pixels2
strokes3
parts4
digitsReference Answer: 2
Simple strokes are likely the ‘next-level’ features that can be extracted from raw pixels.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/24
Deep Learning Deep Neural Network
Fun Time
For a deep NNet for written character recognition from raw pixels, which type of features are more likely extracted after the first hidden layer?
1
pixels2
strokes3
parts4
digitsReference Answer: 2
Simple strokes are likely the ‘next-level’
features that can be extracted from raw pixels.
Deep Learning Autoencoder
Information-Preserving Encoding
• weights: feature transform, i.e. encoding
• good weights: information-preserving encoding
—next layer
same info.
withdifferent representation
• information-preserving:
decode accurately
afterencoding
,
is it a ‘1’? ✲ ✛ is it a ‘5’?
✻
z1 z5
φ1 φ2 φ3 φ4 φ5 φ6
positive weight negative weight
(a) (b) (c) (d)
idea:
pre-train weights
towardsinformation-preserving
encodingHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/24
Deep Learning Autoencoder
Information-Preserving Neural Network
x
0= 1
x
1x
2x
3.. . x
d+1
tanh
tanh
tanh
≈ x
1≈ x
2≈ x
3.. .
≈ x
dw ij (1) w ji (2)
• autoencoder: d —˜ d—d NNet
with goalg i
(x) ≈x i
—learning to
approximate identity function
• w ij (1) : encoding weights; w ji (2) : decoding weights
whyapproximating identity function?
Deep Learning Autoencoder
Usefulness of Approximating Identity Function
if
g(x) ≈ x
using somehidden
structures on theobserved data x n
•
for supervised learning:• hidden structure (essence) of x can be used as reasonable transform Φ(x)
—learning
‘informative’ representation
of data•
for unsupervised learning:• density estimation: larger (structure match) when g(x) ≈ x
• outlier detection: those x where g(x) 6≈ x
—learning
‘typical’ representation
of dataautoencoder:
representation-learning through approximating identity function
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/24
Deep Learning Autoencoder
Basic Autoencoder
basic
autoencoder:
d —˜ d —d NNet
with error functionPd
i=1
(gi
(x) − xi
)2
•
backpropeasily
applies;shallow
andeasy
to train•
usuallyd ˜
<d
:compressed
representation•
data: {(x1
,y 1 = x 1
), (x2
,y 2 = x 2
), . . . , (xN
,y N = x N
)}—often categorized as
unsupervised learning technique
•
sometimes constrainw ij (1)
=w ji (2)
asregularization
—more
sophisticated
in calculating gradientbasic
autoencoder
in basic deep learning:n w ij (1) o
taken as
shallowly pre-trained weights
Deep Learning Autoencoder
Pre-Training with Autoencoders
Deep Learning with Autoencoders
1
for` = 1, . . . , L,pre-train
n wij (`)
oassuming w
∗ (1)
,. . . w∗ (`−1)
fixed(a) (b) (c) (d)
by
training basic autoencoder on n
x (`−1) n o
with ˜ d = d (`)
2 train with backprop
onpre-trained
NNet tofine-tune
all nw
ij (`)
omany successful
pre-training
techniques take‘fancier’ autoencoders
with differentarchitectures
andregularization schemes
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/24
Deep Learning Autoencoder
Fun Time
Suppose training a d -˜d -d autoencoder with backprop takes approximately c · d · ˜d seconds. Then, what is the total number of seconds needed for pre-training a d -d
(1)
-d(2)
-d(3)
-1 deep NNet?1
c d + d(1)
+d(2)
+d(3)
+12
c d · d(1)
· d(2)
· d(3)
· 13
c dd(1)
+d(1)
d(2)
+d(2)
d(3)
+d(3)
4
c dd(1)
· d(1)
d(2)
· d(2)
d(3)
· d(3)
Reference Answer: 3
Each c · d
(`−1)
· d(`)
represents the time for pre-training with one autoencoder to determine one layer of the weights.Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/24
Deep Learning Autoencoder
Fun Time
Suppose training a d -˜d -d autoencoder with backprop takes approximately c · d · ˜d seconds. Then, what is the total number of seconds needed for pre-training a d -d
(1)
-d(2)
-d(3)
-1 deep NNet?1
c d + d(1)
+d(2)
+d(3)
+12
c d · d(1)
· d(2)
· d(3)
· 13
c dd(1)
+d(1)
d(2)
+d(2)
d(3)
+d(3)
4
c dd(1)
· d(1)
d(2)
· d(2)
d(3)
· d(3)
Reference Answer: 3
Each c · d
(`−1)
· d(`)
represents the time for pre-training with one autoencoder to determine one layer of the weights.Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/24
Deep Learning Denoising Autoencoder
Regularization in Deep Learning
x
0= 1 x
1x
2.. . x
d+1
tanh
tanh
w
ij(1)w
jk(2)w
kq(3)+1
tanh
tanh
s
3(2) tanhx
3(2)watch out for overfitting, remember? :-)
high
model complexity: regularization
needed•
structural decisions/constraints•
weight decay or weight eliminationregularizers
• early stopping
next: another
regularization
techniqueDeep Learning Denoising Autoencoder
Reasons of Overfitting Revisited
stochastic noise
Number of Data Points, N
NoiseLevel,σ2
80 100 120 -0.2
-0.1 0 0.1 0.2
0 1 2
reasons of serious overfitting:
data size N ↓ overfit
↑ noise
↑ overfit↑
excessive power ↑ overfit↑
how to deal with
noise?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 15/24
Deep Learning Denoising Autoencoder
Dealing with Noise
•
direct possibility:data cleaning/pruning, remember? :-)
•
awild
possibility:adding noise
to data?•
idea:robust
autoencoder should not only letg(x) ≈ x
but also allowg( x) ≈ ˜ x
even whenx ˜
slightly different fromx
• denoising
autoencoder:run basic autoencoder with data
{(˜
x 1
,y 1 = x 1
), (˜ x 2
,y 2 = x 2
), . . . , (x ˜ N
,y N = x N
)}, wherex ˜ n
=x n
+artificial noise
—often used
instead of basic autoencoder
in deep learning•
useful for data/image processing:g( ˜ x)
adenoised
version of˜ x
•
effect: ‘constrain/regularize’g
towardsnoise-tolerant
denoisingartificial noise/hint
asregularization!
—practically also useful for other NNet/models
Deep Learning Denoising Autoencoder
Fun Time
Which of the following cannot be viewed as a regularization technique?
1
hint the model with artificially-generated noisy data2
stop gradient descent early3
add a weight elimination regularizer4
all the above are regularization techniquesReference Answer: 4
1 is our new friend for regularization, while 2 and 3 are old friends.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/24
Deep Learning Denoising Autoencoder
Fun Time
Which of the following cannot be viewed as a regularization technique?
1
hint the model with artificially-generated noisy data2
stop gradient descent early3
add a weight elimination regularizer4
all the above are regularization techniquesReference Answer: 4
1 is our new friend for regularization, while 2 and 3 are old friends.
Deep Learning Principal Component Analysis
Linear Autoencoder Hypothesis
nonlinear autoencoder
sophisticatedlinear autoencoder
simplelinear: more efficient? less overfitting? linear first, remember? :-)
linear hypothesis for k -th component h
k
(x) =d ˜
X
j=0
w kj
d
X
i=1
w ij
xi
!
consider three special conditions:
• exclude x 0
: range of isame
as range of k•
constrainw ij (1)
=w ji (2)
=w ij
:regularization
—denote
W
= [wij
]of size d × ˜d•
assume ˜d < d: ensurenon-trivial
solution linear autoencoder hypothesis:h(x) = WW T x
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/24
Deep Learning Principal Component Analysis
Linear Autoencoder Error Function
E
in
(h) = Ein
(W) = 1 NN
X
n=1
x n
−WW T x n
2
with d ×
d ˜
matrix W—analytic solution to minimize E
in
? but4-th order polynomial of w ij
let’s familiarize the problem with linear algebra (be brave! :-))
•
eigen-decomposeWW T
=VΓV T
• d × d matrix V orthogonal: VV
T= V
TV = I
d• d × d matrix Γ diagonal with ≤ ˜ d non-zero
• WW T x n
=VΓV T x n
• V
T(x
n): change of orthonormal basis (rotate or reflect)
• Γ(· · · ): set ≥ d − d ˜ components to 0, and scale others
• V(· · · ): reconstruct by coefficients and basis (back-rotate)
• x n
=VIV T x n
:rotate
andback-rotate
cancel out next: minimize Ein by optimizing Γ and V
Deep Learning Principal Component Analysis
The Optimal Γ
min
Vmin
Γ
1 N
N
X
n=1
VIV
Tx
n| {z }
xn
− VΓV
Tx
n| {z }
WWTxn
2
• back-rotate
not affecting length:@ V
•
minΓ
P k(I −Γ)(some vector)k 2
:want many 0
within (I −Γ)
•
optimal diagonalΓ
with rank≤ ˜ d
:
˜
d diagonal components 1 other components 0
=⇒ without loss of gen.
I d ˜
0 0 0
next: min
V N
X
n=1
0 0
0 I
d −˜ d
| {z }
I−optimal Γ
V T x n
2
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/24
Deep Learning Principal Component Analysis
The Optimal V
min
V NX
n=1
0 0
0 I
d −˜dV
Tx
n2
≡ max
V N
X
n=1
I
˜d0 0 0
V
Tx
n2
•
d = 1: only first row˜v T
ofV T
matters maxv P N
n=1 v T x n x T n v
subject tov T v
=1• optimal v satisfies P
Nn=1
x
nx
Tnv = λv
—using Lagrange multiplier λ, remember? :-)
• optimal v: ‘topmost’ eigenvector of X
TX
•
general ˜d :{v j } d j=1 ˜ ‘topmost’
eigenvectorSofX T X
—optimal {w
j
} = {vj
with qγ j = 1y} = top eigenvectors
linear autoencoder: projecting toorthogonal
patterns w j that ‘matches’ {x n } most
Deep Learning Principal Component Analysis
Principal Component Analysis
Linear Autoencoder or PCA
1 let ¯ x = N 1 P N
n=1 x n , and let x n ← x n − ¯ x
2
calculate ˜d top eigenvectorsw 1
, w2
, . . . , wd ˜
of XT
X3
return feature transform Φ(x) = W(x−¯x)
•
linear autoencoder:maximizeP(maginitude after projection)
2
• principal component analysis (PCA)
from statistics:maximizeP(varianceafter projection)
•
both useful forlinear dimension reduction
though
PCA more popular linear dimension reduction:
useful for
data processing
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 22/24
Deep Learning Principal Component Analysis
Fun Time
When solving the optimization problem max
v P N
n=1 v T x n x T n v
subject tov T v
=1, we know that the optimalv
is the ‘topmost’ eigenvector thatcorresponds to the ‘topmost’ eigenvalue
λ
ofX T X. Then, what is the
optimal objective value of the optimization problem?1
λ1
2
λ2
3
λ3
4
λ4
Reference Answer: 1
The objective value of the optimization problem is simply
v T X T Xv, which is λv T v
and you know whatv T v
must be.Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 23/24
Deep Learning Principal Component Analysis
Fun Time
When solving the optimization problem max
v P N
n=1 v T x n x T n v
subject tov T v
=1, we know that the optimalv
is the ‘topmost’ eigenvector thatcorresponds to the ‘topmost’ eigenvalue
λ
ofX T X. Then, what is the
optimal objective value of the optimization problem?1
λ1
2
λ2
3
λ3
4
λ4
Reference Answer: 1
The objective value of the optimization problem is simply
v T X T Xv, which is λv T v
and you know whatv T v
must be.Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 23/24
Deep Learning Principal Component Analysis