Machine Learning Techniques
( 機器學習技法)
Lecture 13: Deep Learning
Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.twDepartment of Computer Science
& Information Engineering
National Taiwan University
( 國立台灣大學資訊工程系)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/24
Deep Learning
Roadmap
1 Embedding Numerous Features: Kernel Models
2 Combining Predictive Features: Aggregation Models
3
Distilling Implicit Features: Extraction ModelsLecture 12: Neural Network
automatic
pattern feature extraction
fromlayers of neurons
withbackprop
for GD/SGDLecture 13: Deep Learning Deep Neural Network Autoencoder
Denoising Autoencoder
Principal Component Analysis
Deep Learning Deep Neural Network
Physical Interpretation of NNet Revisited
x
0= 1 x
1x
2.. . x
d+1
tanh
tanh
w
ij(1)w
jk(2)w
kq(3)+1
tanh
tanh
s
3(2) tanhx
3(2)•
each layer:pattern feature extracted
from data,remember? :-)
•
how many neurons? how many layers?—more generally,
what structure?
• subjectively, your design!
• objectively, validation, maybe?
structural decisions:
key issue
for applying NNetHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/24
Deep Learning Deep Neural Network
Physical Interpretation of NNet Revisited
x
0= 1 x
1x
2.. . x
d+1
tanh
tanh
w
ij(1)w
jk(2)w
kq(3)+1
tanh
tanh
s
3(2) tanhx
3(2)•
each layer:pattern feature extracted
from data,remember? :-)
•
how many neurons? how many layers?—more generally,
what structure?
• subjectively, your design!
• objectively, validation, maybe?
structural decisions:
key issue
for applying NNetDeep Learning Deep Neural Network
Physical Interpretation of NNet Revisited
x
0= 1 x
1x
2.. . x
d+1
tanh
tanh
w
ij(1)w
jk(2)w
kq(3)+1
tanh
tanh
s
3(2) tanhx
3(2)•
each layer:pattern feature extracted
from data,remember? :-)
•
how many neurons? how many layers?—more generally,
what structure?
• subjectively, your design!
• objectively, validation, maybe?
structural decisions:
key issue
for applying NNetHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/24
Deep Learning Deep Neural Network
Physical Interpretation of NNet Revisited
x
0= 1 x
1x
2.. . x
d+1
tanh
tanh
w
ij(1)w
jk(2)w
kq(3)+1
tanh
tanh
s
3(2) tanhx
3(2)•
each layer:pattern feature extracted
from data,remember? :-)
•
how many neurons? how many layers?—more generally,
what structure?
• subjectively, your design!
• objectively, validation, maybe?
structural decisions:
key issue
for applying NNetDeep Learning Deep Neural Network
Physical Interpretation of NNet Revisited
x
0= 1 x
1x
2.. . x
d+1
tanh
tanh
w
ij(1)w
jk(2)w
kq(3)+1
tanh
tanh
s
3(2) tanhx
3(2)•
each layer:pattern feature extracted
from data,remember? :-)
•
how many neurons? how many layers?—more generally,
what structure?
• subjectively, your design!
• objectively, validation, maybe?
structural decisions:
key issue
for applying NNetHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/24
Deep Learning Deep Neural Network
Physical Interpretation of NNet Revisited
x
0= 1 x
1x
2.. . x
d+1
tanh
tanh
w
ij(1)w
jk(2)w
kq(3)+1
tanh
tanh
s
3(2) tanhx
3(2)•
each layer:pattern feature extracted
from data,remember? :-)
•
how many neurons? how many layers?—more generally,
what structure?
• subjectively, your design!
• objectively, validation, maybe?
structural decisions:
key issue
for applying NNetDeep Learning Deep Neural Network
Shallow versus Deep Neural Networks
shallow: few (hidden) layers; deep: many layers
Shallow NNet
•
moreefficient
to train ( )• simpler
structural decisions ( )•
theoreticallypowerful enough
( )Deep NNet
• challenging
to train (×)• sophisticated
structural decisions (×)• ‘arbitrarily’ powerful
( )•
more‘meaningful’?
(see next slide)deep NNet (deep learning)
gaining attention
in recent yearsHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24
Deep Learning Deep Neural Network
Shallow versus Deep Neural Networks
shallow: few (hidden) layers; deep: many layers
Shallow NNet
•
moreefficient
to train ( )• simpler
structural decisions ( )•
theoreticallypowerful enough
( )Deep NNet
• challenging
to train (×)• sophisticated
structural decisions (×)• ‘arbitrarily’ powerful
( )•
more‘meaningful’?
(see next slide)deep NNet (deep learning)
gaining attention
in recent yearsDeep Learning Deep Neural Network
Shallow versus Deep Neural Networks
shallow: few (hidden) layers; deep: many layers
Shallow NNet
•
moreefficient
to train ( )• simpler
structural decisions ( )•
theoreticallypowerful enough
( )Deep NNet
• challenging
to train (×)• sophisticated
structural decisions (×)• ‘arbitrarily’ powerful
( )•
more‘meaningful’?
(see next slide)deep NNet (deep learning)
gaining attention
in recent yearsHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24
Deep Learning Deep Neural Network
Shallow versus Deep Neural Networks
shallow: few (hidden) layers; deep: many layers
Shallow NNet
•
moreefficient
to train ( )• simpler
structural decisions ( )•
theoreticallypowerful enough
( )Deep NNet
• challenging
to train (×)• sophisticated
structural decisions (×)• ‘arbitrarily’ powerful
( )•
more‘meaningful’?
(see next slide)deep NNet (deep learning)
gaining attention
in recent yearsDeep Learning Deep Neural Network
Shallow versus Deep Neural Networks
shallow: few (hidden) layers; deep: many layers
Shallow NNet
•
moreefficient
to train ( )• simpler
structural decisions ( )•
theoreticallypowerful enough
( )Deep NNet
• challenging
to train (×)• sophisticated
structural decisions (×)• ‘arbitrarily’ powerful
( )•
more‘meaningful’?
(see next slide)deep NNet (deep learning)
gaining attention
in recent yearsHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24
Deep Learning Deep Neural Network
Shallow versus Deep Neural Networks
shallow: few (hidden) layers; deep: many layers
Shallow NNet
•
moreefficient
to train ( )• simpler
structural decisions ( )•
theoreticallypowerful enough
( )Deep NNet
• challenging
to train (×)• sophisticated
structural decisions (×)• ‘arbitrarily’ powerful
( )•
more‘meaningful’?
(see next slide)deep NNet (deep learning)
gaining attention
in recent yearsDeep Learning Deep Neural Network
Shallow versus Deep Neural Networks
shallow: few (hidden) layers; deep: many layers
Shallow NNet
•
moreefficient
to train ( )• simpler
structural decisions ( )•
theoreticallypowerful enough
( )Deep NNet
• challenging
to train (×)• sophisticated
structural decisions (×)• ‘arbitrarily’ powerful
( )•
more‘meaningful’?
(see next slide)deep NNet (deep learning)
gaining attention
in recent yearsHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24
Deep Learning Deep Neural Network
Shallow versus Deep Neural Networks
shallow: few (hidden) layers; deep: many layers
Shallow NNet
•
moreefficient
to train ( )• simpler
structural decisions ( )•
theoreticallypowerful enough
( )Deep NNet
• challenging
to train (×)• sophisticated
structural decisions (×)• ‘arbitrarily’ powerful
( )•
more‘meaningful’?
(see next slide)deep NNet (deep learning)
gaining attention
in recent yearsDeep Learning Deep Neural Network
Shallow versus Deep Neural Networks
shallow: few (hidden) layers; deep: many layers
Shallow NNet
•
moreefficient
to train ( )• simpler
structural decisions ( )•
theoreticallypowerful enough
( )Deep NNet
• challenging
to train (×)• sophisticated
structural decisions (×)• ‘arbitrarily’ powerful
( )•
more‘meaningful’?
(see next slide)deep NNet (deep learning)
gaining attention
in recent yearsHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24
Deep Learning Deep Neural Network
Shallow versus Deep Neural Networks
shallow: few (hidden) layers; deep: many layers
Shallow NNet
•
moreefficient
to train ( )• simpler
structural decisions ( )•
theoreticallypowerful enough
( )Deep NNet
• challenging
to train (×)• sophisticated
structural decisions (×)• ‘arbitrarily’ powerful
( )•
more‘meaningful’?
(see next slide)deep NNet (deep learning)
gaining attention
in recent yearsDeep Learning Deep Neural Network
Meaningfulness of Deep Learning
,
is it a ‘1’? ✲ ✛ is it a ‘5’?
✻
z
1z
5φ
1φ
2φ
3φ
4φ
5φ
6positive weight negative weight
• ‘less burden’
for each layer:simple
tocomplex
features•
natural fordifficult
learning task withraw features, like vision
deep NNet: currently popular in
vision/speech/. . .
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/24
Deep Learning Deep Neural Network
Meaningfulness of Deep Learning
,
is it a ‘1’? ✲ ✛ is it a ‘5’?
✻
z
1z
5φ
1φ
2φ
3φ
4φ
5φ
6positive weight negative weight
• ‘less burden’
for each layer:simple
tocomplex
features•
natural fordifficult
learning task withraw features, like vision
deep NNet: currently popular invision/speech/. . .
Deep Learning Deep Neural Network
Meaningfulness of Deep Learning
,
is it a ‘1’? ✲ ✛ is it a ‘5’?
✻
z
1z
5φ
1φ
2φ
3φ
4φ
5φ
6positive weight negative weight
• ‘less burden’
for each layer:simple
tocomplex
features•
natural fordifficult
learning task withraw features, like vision
deep NNet: currently popular in
vision/speech/. . .
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/24
Deep Learning Deep Neural Network
Meaningfulness of Deep Learning
,
is it a ‘1’? ✲ ✛ is it a ‘5’?
✻
z
1z
5φ
1φ
2φ
3φ
4φ
5φ
6positive weight negative weight
• ‘less burden’
for each layer:simple
tocomplex
features•
natural fordifficult
learning task withraw features, like vision
deep NNet: currently popular invision/speech/. . .
Deep Learning Deep Neural Network
Challenges and Key Techniques for Deep Learning
•
difficultstructural decisions:
• subjective with domain knowledge: like convolutional NNet for images
•
highmodel complexity:
• no big worries if big enough data
• regularization towards noise-tolerant: like
• dropout (tolerant when network corrupted)
• denoising (tolerant when input corrupted)
•
hardoptimization problem:
• careful initialization to avoid bad local minimum:
called pre-training
•
hugecomputational complexity
(worsen with
big data)
:
• novel hardware/architecture: like mini-batch with GPU
IMHO, careful
regularization
andinitialization
are key techniquesHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/24
Deep Learning Deep Neural Network
Challenges and Key Techniques for Deep Learning
•
difficultstructural decisions:
• subjective with domain knowledge: like convolutional NNet for images
•
highmodel complexity:
• no big worries if big enough data
• regularization towards noise-tolerant: like
• dropout (tolerant when network corrupted)
• denoising (tolerant when input corrupted)
•
hardoptimization problem:
• careful initialization to avoid bad local minimum:
called pre-training
•
hugecomputational complexity
(worsen with
big data)
:
• novel hardware/architecture: like mini-batch with GPU
IMHO, careful
regularization
andinitialization
are key techniquesDeep Learning Deep Neural Network
Challenges and Key Techniques for Deep Learning
•
difficultstructural decisions:
• subjective with domain knowledge: like convolutional NNet for images
•
highmodel complexity:
• no big worries if big enough data
• regularization towards noise-tolerant: like
• dropout (tolerant when network corrupted)
• denoising (tolerant when input corrupted)
•
hardoptimization problem:
• careful initialization to avoid bad local minimum:
called pre-training
•
hugecomputational complexity
(worsen with
big data)
:
• novel hardware/architecture: like mini-batch with GPU
IMHO, careful
regularization
andinitialization
are key techniquesHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/24
Deep Learning Deep Neural Network
Challenges and Key Techniques for Deep Learning
•
difficultstructural decisions:
• subjective with domain knowledge: like convolutional NNet for images
•
highmodel complexity:
• no big worries if big enough data
• regularization towards noise-tolerant: like
• dropout (tolerant when network corrupted)
• denoising (tolerant when input corrupted)
•
hardoptimization problem:
• careful initialization to avoid bad local minimum:
called pre-training
•
hugecomputational complexity
(worsen with
big data)
:
• novel hardware/architecture: like mini-batch with GPU
IMHO, careful
regularization
andinitialization
are key techniquesDeep Learning Deep Neural Network
Challenges and Key Techniques for Deep Learning
•
difficultstructural decisions:
• subjective with domain knowledge: like convolutional NNet for images
•
highmodel complexity:
• no big worries if big enough data
• regularization towards noise-tolerant: like
• dropout (tolerant when network corrupted)
• denoising (tolerant when input corrupted)
•
hardoptimization problem:
• careful initialization to avoid bad local minimum:
called pre-training
•
hugecomputational complexity
(worsen with
big data)
:
• novel hardware/architecture: like mini-batch with GPU
IMHO, careful
regularization
andinitialization
are key techniquesHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/24
Deep Learning Deep Neural Network
Challenges and Key Techniques for Deep Learning
•
difficultstructural decisions:
• subjective with domain knowledge: like convolutional NNet for images
•
highmodel complexity:
• no big worries if big enough data
• regularization towards noise-tolerant: like
• dropout (tolerant when network corrupted)
• denoising (tolerant when input corrupted)
•
hardoptimization problem:
• careful initialization to avoid bad local minimum:
called pre-training
•
hugecomputational complexity
(worsen withbig data):
• novel hardware/architecture: like mini-batch with GPU
IMHO, careful
regularization
andinitialization
are key techniquesDeep Learning Deep Neural Network
Challenges and Key Techniques for Deep Learning
•
difficultstructural decisions:
• subjective with domain knowledge: like convolutional NNet for images
•
highmodel complexity:
• no big worries if big enough data
• regularization towards noise-tolerant: like
• dropout (tolerant when network corrupted)
• denoising (tolerant when input corrupted)
•
hardoptimization problem:
• careful initialization to avoid bad local minimum:
called pre-training
•
hugecomputational complexity
(worsen withbig data):
• novel hardware/architecture: like mini-batch with GPU
IMHO, careful
regularization
andinitialization
are key techniquesHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/24
Deep Learning Deep Neural Network
Challenges and Key Techniques for Deep Learning
•
difficultstructural decisions:
• subjective with domain knowledge: like convolutional NNet for images
•
highmodel complexity:
• no big worries if big enough data
• regularization towards noise-tolerant: like
• dropout (tolerant when network corrupted)
• denoising (tolerant when input corrupted)
•
hardoptimization problem:
• careful initialization to avoid bad local minimum:
called pre-training
•
hugecomputational complexity
(worsen withbig data):
• novel hardware/architecture: like mini-batch with GPU
IMHO, careful
regularization
andinitialization
are key techniquesDeep Learning Deep Neural Network
Challenges and Key Techniques for Deep Learning
•
difficultstructural decisions:
• subjective with domain knowledge: like convolutional NNet for images
•
highmodel complexity:
• no big worries if big enough data
• regularization towards noise-tolerant: like
• dropout (tolerant when network corrupted)
• denoising (tolerant when input corrupted)
•
hardoptimization problem:
• careful initialization to avoid bad local minimum:
called pre-training
•
hugecomputational complexity
(worsen withbig data):
• novel hardware/architecture: like mini-batch with GPU
IMHO, careful
regularization
andinitialization
are key techniquesHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/24
Deep Learning Deep Neural Network
Challenges and Key Techniques for Deep Learning
•
difficultstructural decisions:
• subjective with domain knowledge: like convolutional NNet for images
•
highmodel complexity:
• no big worries if big enough data
• regularization towards noise-tolerant: like
• dropout (tolerant when network corrupted)
• denoising (tolerant when input corrupted)
•
hardoptimization problem:
• careful initialization to avoid bad local minimum:
called pre-training
•
hugecomputational complexity
(worsen withbig data):
• novel hardware/architecture: like mini-batch with GPU
IMHO, careful
regularization
andinitialization
are key techniquesDeep Learning Deep Neural Network
Challenges and Key Techniques for Deep Learning
•
difficultstructural decisions:
• subjective with domain knowledge: like convolutional NNet for images
•
highmodel complexity:
• no big worries if big enough data
• regularization towards noise-tolerant: like
• dropout (tolerant when network corrupted)
• denoising (tolerant when input corrupted)
•
hardoptimization problem:
• careful initialization to avoid bad local minimum:
called pre-training
•
hugecomputational complexity
(worsen withbig data):
• novel hardware/architecture: like mini-batch with GPU
IMHO, careful
regularization
andinitialization
are key techniquesHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/24
Deep Learning Deep Neural Network
Challenges and Key Techniques for Deep Learning
•
difficultstructural decisions:
• subjective with domain knowledge: like convolutional NNet for images
•
highmodel complexity:
• no big worries if big enough data
• regularization towards noise-tolerant: like
• dropout (tolerant when network corrupted)
• denoising (tolerant when input corrupted)
•
hardoptimization problem:
• careful initialization to avoid bad local minimum:
called pre-training
•
hugecomputational complexity
(worsen withbig data):
• novel hardware/architecture: like mini-batch with GPU
IMHO, careful
regularization
andinitialization
are key techniquesDeep Learning Deep Neural Network
Challenges and Key Techniques for Deep Learning
•
difficultstructural decisions:
• subjective with domain knowledge: like convolutional NNet for images
•
highmodel complexity:
• no big worries if big enough data
• regularization towards noise-tolerant: like
• dropout (tolerant when network corrupted)
• denoising (tolerant when input corrupted)
•
hardoptimization problem:
• careful initialization to avoid bad local minimum:
called pre-training
•
hugecomputational complexity
(worsen withbig data):
• novel hardware/architecture: like mini-batch with GPU
IMHO, careful
regularization
andinitialization
are key techniquesHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/24
Deep Learning Deep Neural Network
A Two-Step Deep Learning Framework
Simple Deep Learning
1
for` = 1, . . . , L,pre-train
n wij (`)
oassuming w
∗ (1)
,. . . w∗ (`−1)
fixed(a) (b) (c) (d)
2 train with backprop
onpre-trained
NNet tofine-tune
alln wij (`)
owill focus on
simplest pre-training
technique along withregularization
Deep Learning Deep Neural Network
A Two-Step Deep Learning Framework
Simple Deep Learning
1
for` = 1, . . . , L,pre-train
n wij (`)
oassuming w
∗ (1)
,. . . w∗ (`−1)
fixed(a) (b) (c) (d)
2 train with backprop
onpre-trained
NNet tofine-tune
alln wij (`)
owill focus on
simplest pre-training
technique along withregularization
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/24
Deep Learning Deep Neural Network
A Two-Step Deep Learning Framework
Simple Deep Learning
1
for` = 1, . . . , L,pre-train
n wij (`)
oassuming w
∗ (1)
,. . . w∗ (`−1)
fixed(a) (b) (c) (d)
2 train with backprop
onpre-trained
NNet tofine-tune
alln wij (`)
owill focus on
simplest pre-training
technique along withregularization
Deep Learning Deep Neural Network
Fun Time
For a deep NNet for written character recognition from raw pixels, which type of features are more likely extracted after the first hidden layer?
1
pixels2
strokes3
parts4
digitsReference Answer: 2
Simple strokes are likely the ‘next-level’ features that can be extracted from raw pixels.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/24
Deep Learning Deep Neural Network
Fun Time
For a deep NNet for written character recognition from raw pixels, which type of features are more likely extracted after the first hidden layer?
1
pixels2
strokes3
parts4
digitsReference Answer: 2
Simple strokes are likely the ‘next-level’
features that can be extracted from raw pixels.
Deep Learning Autoencoder
Information-Preserving Encoding
• weights: feature transform, i.e. encoding
• good weights: information-preserving encoding
—next layer
same info.
withdifferent representation
• information-preserving:
decode accurately
afterencoding
,
is it a ‘1’? ✲ ✛ is it a ‘5’?
✻
z1 z5
φ1 φ2 φ3 φ4 φ5 φ6
positive weight negative weight
(a) (b) (c) (d)
idea:
pre-train weights
towardsinformation-preserving
encodingHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/24
Deep Learning Autoencoder
Information-Preserving Encoding
• weights: feature transform, i.e. encoding
• good weights: information-preserving encoding
—next layer
same info.
withdifferent representation
• information-preserving:
decode accurately
afterencoding
,
is it a ‘1’? ✲ ✛ is it a ‘5’?
✻
z1 z5
φ1 φ2 φ3 φ4 φ5 φ6
positive weight negative weight
(a) (b) (c) (d)
idea:
pre-train weights
towardsinformation-preserving
encodingDeep Learning Autoencoder
Information-Preserving Encoding
• weights: feature transform, i.e. encoding
• good weights: information-preserving encoding
—next layer
same info.
withdifferent representation
• information-preserving:
decode accurately
afterencoding
,
is it a ‘1’? ✲ ✛ is it a ‘5’?
✻
z1 z5
φ1 φ2 φ3 φ4 φ5 φ6
positive weight negative weight
(a) (b) (c) (d)
idea:
pre-train weights
towardsinformation-preserving
encodingHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/24
Deep Learning Autoencoder
Information-Preserving Encoding
• weights: feature transform, i.e. encoding
• good weights: information-preserving encoding
—next layer
same info.
withdifferent representation
• information-preserving:
decode accurately
afterencoding
,
is it a ‘1’? ✲ ✛ is it a ‘5’?
✻
z1 z5
φ1 φ2 φ3 φ4 φ5 φ6
positive weight negative weight
(a) (b) (c) (d)
idea:
pre-train weights
towardsinformation-preserving
encodingDeep Learning Autoencoder
Information-Preserving Neural Network
x
0= 1
x
1x
2x
3.. . x
d+1
tanh
tanh
tanh
≈ x
1≈ x
2≈ x
3.. .
≈ x
dw ij (1) w ji (2)
• autoencoder:
d —˜ d—d NNet
with goalg i
(x) ≈x i
—learning to
approximate identity function
• w ij (1) : encoding weights; w ji (2) : decoding weights
why
approximating identity function?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/24
Deep Learning Autoencoder
Information-Preserving Neural Network
x
0= 1
x
1x
2x
3.. . x
d+1
tanh
tanh
tanh
≈ x
1≈ x
2≈ x
3.. .
≈ x
dw ij (1) w ji (2)
• autoencoder: d —˜ d—d NNet
with goalg i
(x) ≈x i
—learning to
approximate identity function
• w ij (1) : encoding weights; w ji (2) : decoding weights
why
approximating identity function?
Deep Learning Autoencoder
Information-Preserving Neural Network
x
0= 1
x
1x
2x
3.. . x
d+1
tanh
tanh
tanh
≈ x
1≈ x
2≈ x
3.. .
≈ x
dw ij (1) w ji (2)
• autoencoder: d —˜ d—d NNet
with goalg i
(x) ≈x i
—learning to
approximate identity function
• w ij (1) : encoding weights; w ji (2) : decoding weights
whyapproximating identity function?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/24
Deep Learning Autoencoder
Information-Preserving Neural Network
x
0= 1
x
1x
2x
3.. . x
d+1
tanh
tanh
tanh
≈ x
1≈ x
2≈ x
3.. .
≈ x
dw ij (1) w ji (2)
• autoencoder: d —˜ d—d NNet
with goalg i
(x) ≈x i
—learning to
approximate identity function
• w ij (1) : encoding weights; w ji (2) : decoding weights
why
approximating identity function?
Deep Learning Autoencoder
Information-Preserving Neural Network
x
0= 1
x
1x
2x
3.. . x
d+1
tanh
tanh
tanh
≈ x
1≈ x
2≈ x
3.. .
≈ x
dw ij (1) w ji (2)
• autoencoder: d —˜ d—d NNet
with goalg i
(x) ≈x i
—learning to
approximate identity function
• w ij (1) : encoding weights; w ji (2) : decoding weights
whyapproximating identity function?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/24
Deep Learning Autoencoder
Usefulness of Approximating Identity Function
if
g(x) ≈ x
using somehidden
structures on theobserved data x n
•
for supervised learning:• hidden structure (essence) of x can be used as reasonable transform Φ(x)
—learning
‘informative’ representation
of data•
for unsupervised learning:• density estimation: larger (structure match) when g(x) ≈ x
• outlier detection: those x where g(x) 6≈ x
—learning
‘typical’ representation
of dataautoencoder:
representation-learning through
approximating identity function
Deep Learning Autoencoder
Usefulness of Approximating Identity Function
if
g(x) ≈ x
using somehidden
structures on theobserved data x n
•
for supervised learning:• hidden structure (essence) of x can be used as reasonable transform Φ(x)
—learning
‘informative’ representation
of data•
for unsupervised learning:• density estimation: larger (structure match) when g(x) ≈ x
• outlier detection: those x where g(x) 6≈ x
—learning
‘typical’ representation
of dataautoencoder:
representation-learning through approximating identity function
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/24
Deep Learning Autoencoder
Usefulness of Approximating Identity Function
if
g(x) ≈ x
using somehidden
structures on theobserved data x n
•
for supervised learning:• hidden structure (essence) of x can be used as reasonable transform Φ(x)
—learning
‘informative’ representation
of data•
for unsupervised learning:• density estimation: larger (structure match) when g(x) ≈ x
• outlier detection: those x where g(x) 6≈ x
—learning
‘typical’ representation
of dataautoencoder:
representation-learning through
approximating identity function
Deep Learning Autoencoder
Usefulness of Approximating Identity Function
if
g(x) ≈ x
using somehidden
structures on theobserved data x n
•
for supervised learning:• hidden structure (essence) of x can be used as reasonable transform Φ(x)
—learning
‘informative’ representation
of data•
for unsupervised learning:• density estimation: larger (structure match) when g(x) ≈ x
• outlier detection: those x where g(x) 6≈ x
—learning
‘typical’ representation
of dataautoencoder:
representation-learning through approximating identity function
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/24
Deep Learning Autoencoder
Usefulness of Approximating Identity Function
if
g(x) ≈ x
using somehidden
structures on theobserved data x n
•
for supervised learning:• hidden structure (essence) of x can be used as reasonable transform Φ(x)
—learning
‘informative’ representation
of data•
for unsupervised learning:• density estimation: larger (structure match) when g(x) ≈ x
• outlier detection: those x where g(x) 6≈ x
—learning
‘typical’ representation
of dataautoencoder:
representation-learning through
approximating identity function
Deep Learning Autoencoder
Usefulness of Approximating Identity Function
if
g(x) ≈ x
using somehidden
structures on theobserved data x n
•
for supervised learning:• hidden structure (essence) of x can be used as reasonable transform Φ(x)
—learning
‘informative’ representation
of data•
for unsupervised learning:• density estimation: larger (structure match) when g(x) ≈ x
• outlier detection: those x where g(x) 6≈ x
—learning
‘typical’ representation
of dataautoencoder:
representation-learning through approximating identity function
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/24
Deep Learning Autoencoder
Usefulness of Approximating Identity Function
if
g(x) ≈ x
using somehidden
structures on theobserved data x n
•
for supervised learning:• hidden structure (essence) of x can be used as reasonable transform Φ(x)
—learning
‘informative’ representation
of data•
for unsupervised learning:• density estimation: larger (structure match) when g(x) ≈ x
• outlier detection: those x where g(x) 6≈ x
—learning
‘typical’ representation
of dataautoencoder:
representation-learning through
approximating identity function
Deep Learning Autoencoder
Usefulness of Approximating Identity Function
if
g(x) ≈ x
using somehidden
structures on theobserved data x n
•
for supervised learning:• hidden structure (essence) of x can be used as reasonable transform Φ(x)
—learning
‘informative’ representation
of data•
for unsupervised learning:• density estimation: larger (structure match) when g(x) ≈ x
• outlier detection: those x where g(x) 6≈ x
—learning
‘typical’ representation
of dataautoencoder:
representation-learning through approximating identity function
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/24
Deep Learning Autoencoder
Basic Autoencoder
basic
autoencoder:
d —˜ d —d NNet
with error functionPd
i=1
(gi
(x) − xi
)2
•
backpropeasily
applies;shallow
andeasy
to train•
usuallyd ˜
<d
:compressed
representation•
data: {(x1
,y 1 = x 1
), (x2
,y 2 = x 2
), . . . , (xN
,y N = x N
)}—often categorized as
unsupervised learning technique
•
sometimes constrainw ij (1)
=w ji (2)
asregularization
—more
sophisticated
in calculating gradientbasic
autoencoder
in basic deep learning:n
w ij (1) o
taken as
shallowly pre-trained weights
Deep Learning Autoencoder
Basic Autoencoder
basic
autoencoder:
d —˜ d —d NNet
with error functionPd
i=1
(gi
(x) − xi
)2
•
backpropeasily
applies;shallow
andeasy
to train•
usuallyd ˜
<d
:compressed
representation•
data: {(x1
,y 1 = x 1
), (x2
,y 2 = x 2
), . . . , (xN
,y N = x N
)}—often categorized as
unsupervised learning technique
•
sometimes constrainw ij (1)
=w ji (2)
asregularization
—more
sophisticated
in calculating gradientbasic
autoencoder
in basic deep learning:n
w ij (1) o
taken as
shallowly pre-trained weights
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/24
Deep Learning Autoencoder
Basic Autoencoder
basic
autoencoder:
d —˜ d —d NNet
with error functionPd
i=1
(gi
(x) − xi
)2
•
backpropeasily
applies;shallow
andeasy
to train•
usuallyd ˜
<d
:compressed
representation•
data: {(x1
,y 1 = x 1
), (x2
,y 2 = x 2
), . . . , (xN
,y N = x N
)}—often categorized as
unsupervised learning technique
•
sometimes constrainw ij (1)
=w ji (2)
asregularization
—more
sophisticated
in calculating gradientbasic
autoencoder
in basic deep learning:n
w ij (1) o
taken as
shallowly pre-trained weights
Deep Learning Autoencoder
Basic Autoencoder
basic
autoencoder:
d —˜ d —d NNet
with error functionPd
i=1
(gi
(x) − xi
)2
•
backpropeasily
applies;shallow
andeasy
to train•
usuallyd ˜
<d
:compressed
representation•
data: {(x1
,y 1 = x 1
), (x2
,y 2 = x 2
), . . . , (xN
,y N = x N
)}—often categorized as
unsupervised learning technique
•
sometimes constrainw ij (1)
=w ji (2)
asregularization
—more
sophisticated
in calculating gradientbasic
autoencoder
in basic deep learning:n
w ij (1) o
taken as
shallowly pre-trained weights
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/24
Deep Learning Autoencoder
Basic Autoencoder
basic
autoencoder:
d —˜ d —d NNet
with error functionPd
i=1
(gi
(x) − xi
)2
•
backpropeasily
applies;shallow
andeasy
to train•
usuallyd ˜
<d
:compressed
representation•
data: {(x1
,y 1 = x 1
), (x2
,y 2 = x 2
), . . . , (xN
,y N = x N
)}—often categorized as
unsupervised learning technique
•
sometimes constrainw ij (1)
=w ji (2)
asregularization
—more
sophisticated
in calculating gradientbasic
autoencoder
in basic deep learning:n
w ij (1) o
taken as
shallowly pre-trained weights
Deep Learning Autoencoder
Basic Autoencoder
basic
autoencoder:
d —˜ d —d NNet
with error functionPd
i=1
(gi
(x) − xi
)2
•
backpropeasily
applies;shallow
andeasy
to train•
usuallyd ˜
<d
:compressed
representation•
data: {(x1
,y 1 = x 1
), (x2
,y 2 = x 2
), . . . , (xN
,y N = x N
)}—often categorized as
unsupervised learning technique
•
sometimes constrainw ij (1)
=w ji (2)
asregularization
—more
sophisticated
in calculating gradientbasic
autoencoder
in basic deep learning:n
w ij (1) o
taken as
shallowly pre-trained weights
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/24
Deep Learning Autoencoder
Basic Autoencoder
basic
autoencoder:
d —˜ d —d NNet
with error functionPd
i=1
(gi
(x) − xi
)2
•
backpropeasily
applies;shallow
andeasy
to train•
usuallyd ˜
<d
:compressed
representation•
data: {(x1
,y 1 = x 1
), (x2
,y 2 = x 2
), . . . , (xN
,y N = x N
)}—often categorized as
unsupervised learning technique
•
sometimes constrainw ij (1)
=w ji (2)
asregularization
—more
sophisticated
in calculating gradientbasic
autoencoder
in basic deep learning:n
w ij (1) o
taken as
shallowly pre-trained weights
Deep Learning Autoencoder
Basic Autoencoder
basic
autoencoder:
d —˜ d —d NNet
with error functionPd
i=1
(gi
(x) − xi
)2
•
backpropeasily
applies;shallow
andeasy
to train•
usuallyd ˜
<d
:compressed
representation•
data: {(x1
,y 1 = x 1
), (x2
,y 2 = x 2
), . . . , (xN
,y N = x N
)}—often categorized as
unsupervised learning technique
•
sometimes constrainw ij (1)
=w ji (2)
asregularization
—more
sophisticated
in calculating gradientbasic
autoencoder
in basic deep learning:n w ij (1) o
taken as
shallowly pre-trained weights
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/24
Deep Learning Autoencoder
Pre-Training with Autoencoders
Deep Learning
with Autoencoders
1
for` = 1, . . . , L,pre-train
n wij (`)
oassuming w
∗ (1)
,. . . w∗ (`−1)
fixed(a) (b) (c) (d)
by
training basic autoencoder on n
x (`−1) n o
with ˜ d = d (`)
2 train with backprop
onpre-trained
NNet tofine-tune
all nw
ij (`)
omany successful
pre-training
techniques take‘fancier’ autoencoders
with differentarchitectures
andregularization schemes
Deep Learning Autoencoder
Pre-Training with Autoencoders
Deep Learning with Autoencoders
1
for` = 1, . . . , L,pre-train
n wij (`)
oassuming w
∗ (1)
,. . . w∗ (`−1)
fixed(a) (b) (c) (d)
by
training basic autoencoder on n
x (`−1) n o
with ˜ d = d (`)
2 train with backprop
onpre-trained
NNet tofine-tune
all nw
ij (`)
omany successful
pre-training
techniques take‘fancier’ autoencoders
with differentarchitectures
andregularization schemes
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/24
Deep Learning Autoencoder
Pre-Training with Autoencoders
Deep Learning with Autoencoders
1
for` = 1, . . . , L,pre-train
n wij (`)
oassuming w
∗ (1)
,. . . w∗ (`−1)
fixed(a) (b) (c) (d)
by
training basic autoencoder on n
x (`−1) n o
with ˜ d = d (`)
2 train with backprop
onpre-trained
NNet tofine-tune
all nw
ij (`)
omany successful
pre-training
techniques take‘fancier’ autoencoders
with differentarchitectures
andregularization schemes
Deep Learning Autoencoder
Fun Time
Suppose training a d -˜d -d autoencoder with backprop takes approximately c · d · ˜d seconds. Then, what is the total number of seconds needed for pre-training a d -d
(1)
-d(2)
-d(3)
-1 deep NNet?1
c d + d(1)
+d(2)
+d(3)
+12
c d · d(1)
· d(2)
· d(3)
· 13
c dd(1)
+d(1)
d(2)
+d(2)
d(3)
+d(3)
4
c dd(1)
· d(1)
d(2)
· d(2)
d(3)
· d(3)
Reference Answer: 3
Each c · d
(`−1)
· d(`)
represents the time for pre-training with one autoencoder to determine one layer of the weights.Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/24
Deep Learning Autoencoder
Fun Time
Suppose training a d -˜d -d autoencoder with backprop takes approximately c · d · ˜d seconds. Then, what is the total number of seconds needed for pre-training a d -d
(1)
-d(2)
-d(3)
-1 deep NNet?1
c d + d(1)
+d(2)
+d(3)
+12
c d · d(1)
· d(2)
· d(3)
· 13
c dd(1)
+d(1)
d(2)
+d(2)
d(3)
+d(3)
4
c dd(1)
· d(1)
d(2)
· d(2)
d(3)
· d(3)
Reference Answer: 3
Each c · d
(`−1)
· d(`)
represents the time for pre-training with one autoencoder to determine one layer of the weights.Deep Learning Denoising Autoencoder
Regularization in Deep Learning
x
0= 1 x
1x
2.. . x
d+1
tanh
tanh
w
ij(1)w
jk(2)w
kq(3)+1
tanh
tanh
s
3(2) tanhx
3(2)watch out for overfitting, remember? :-)
high
model complexity: regularization
needed•
structural decisions/constraints•
weight decay or weight eliminationregularizers
• early stopping
next: another
regularization
techniqueHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/24
Deep Learning Denoising Autoencoder
Regularization in Deep Learning
x
0= 1 x
1x
2.. . x
d+1
tanh
tanh
w
ij(1)w
jk(2)w
kq(3)+1
tanh
tanh
s
3(2) tanhx
3(2)watch out for overfitting, remember? :-)
high
model complexity:
regularization
needed•
structural decisions/constraints•
weight decay or weight eliminationregularizers
• early stopping
next: another
regularization
techniqueDeep Learning Denoising Autoencoder
Regularization in Deep Learning
x
0= 1 x
1x
2.. . x
d+1
tanh
tanh
w
ij(1)w
jk(2)w
kq(3)+1
tanh
tanh
s
3(2) tanhx
3(2)watch out for overfitting, remember? :-)
highmodel complexity:
regularization
needed•
structural decisions/constraints•
weight decay or weight eliminationregularizers
• early stopping
next: another
regularization
techniqueHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/24
Deep Learning Denoising Autoencoder
Regularization in Deep Learning
x
0= 1 x
1x
2.. . x
d+1
tanh
tanh
w
ij(1)w
jk(2)w
kq(3)+1
tanh
tanh
s
3(2) tanhx
3(2)watch out for overfitting, remember? :-)
highmodel complexity: regularization
needed•
structural decisions/constraints•
weight decay or weight eliminationregularizers
• early stopping
next: another
regularization
techniqueDeep Learning Denoising Autoencoder
Regularization in Deep Learning
x
0= 1 x
1x
2.. . x
d+1
tanh
tanh
w
ij(1)w
jk(2)w
kq(3)+1
tanh
tanh
s
3(2) tanhx
3(2)watch out for overfitting, remember? :-)
high
model complexity: regularization
needed•
structural decisions/constraints•
weight decay or weight eliminationregularizers
• early stopping
next: another
regularization
techniqueHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/24
Deep Learning Denoising Autoencoder
Regularization in Deep Learning
x
0= 1 x
1x
2.. . x
d+1
tanh
tanh
w
ij(1)w
jk(2)w
kq(3)+1
tanh
tanh
s
3(2) tanhx
3(2)watch out for overfitting, remember? :-)
high
model complexity: regularization
needed•
structural decisions/constraints•
weight decay or weight eliminationregularizers
• early stopping
next: another
regularization
techniqueDeep Learning Denoising Autoencoder
Regularization in Deep Learning
x
0= 1 x
1x
2.. . x
d+1
tanh
tanh
w
ij(1)w
jk(2)w
kq(3)+1
tanh
tanh
s
3(2) tanhx
3(2)watch out for overfitting, remember? :-)
high
model complexity: regularization
needed•
structural decisions/constraints•
weight decay or weight eliminationregularizers
• early stopping
next: another
regularization
techniqueHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/24
Deep Learning Denoising Autoencoder
Regularization in Deep Learning
x
0= 1 x
1x
2.. . x
d+1
tanh
tanh
w
ij(1)w
jk(2)w
kq(3)+1
tanh
tanh
s
3(2) tanhx
3(2)watch out for overfitting, remember? :-)
high
model complexity: regularization
needed•
structural decisions/constraints•
weight decay or weight eliminationregularizers
• early stopping
next: another