### Machine Learning Techniques

### ( 機器學習技法)

### Lecture 13: Deep Learning

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw### Department of Computer Science

### & Information Engineering

### National Taiwan University

### ( 國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/24

Deep Learning

### Roadmap

### 1 Embedding Numerous Features: Kernel Models

### 2 Combining Predictive Features: Aggregation Models

### 3

Distilling Implicit Features: Extraction Models### Lecture 12: Neural Network

automatic

**pattern feature extraction**

from**layers of** **neurons**

with**backprop**

for GD/SGD
### Lecture 13: Deep Learning Deep Neural Network Autoencoder

### Denoising Autoencoder

### Principal Component Analysis

Deep Learning Deep Neural Network

### Physical Interpretation of NNet Revisited

### x

0### = 1 x

1### x

2### .. . x

d### +1

tanh

tanh

### w

_{ij}

^{(1)}

### w

_{jk}

^{(2)}

### w

_{kq}

^{(3)}

### +1

tanh

tanh

### s

_{3}

^{(2)}tanh

### x

_{3}

^{(2)}

### •

each layer:**pattern feature** **extracted**

from data,**remember? :-)**

### •

how many neurons? how many layers?—more generally,

**what structure?**

### • subjectively, **your design!**

### • objectively, **validation, maybe?**

structural decisions:

**key issue**

for applying NNet
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/24

Deep Learning Deep Neural Network

### Physical Interpretation of NNet Revisited

### x

0### = 1 x

1### x

2### .. . x

d### +1

tanh

tanh

### w

_{ij}

^{(1)}

### w

_{jk}

^{(2)}

### w

_{kq}

^{(3)}

### +1

tanh

tanh

### s

_{3}

^{(2)}tanh

### x

_{3}

^{(2)}

### •

each layer:**pattern feature** **extracted**

from data,**remember? :-)**

### •

how many neurons? how many layers?—more generally,

**what structure?**

### • subjectively, **your design!**

### • objectively, **validation, maybe?**

structural decisions:

**key issue**

for applying NNet
Deep Learning Deep Neural Network

### Physical Interpretation of NNet Revisited

### x

0### = 1 x

1### x

2### .. . x

d### +1

tanh

tanh

### w

_{ij}

^{(1)}

### w

_{jk}

^{(2)}

### w

_{kq}

^{(3)}

### +1

tanh

tanh

### s

_{3}

^{(2)}tanh

### x

_{3}

^{(2)}

### •

each layer:**pattern feature** **extracted**

from data,**remember? :-)**

### •

how many neurons? how many layers?—more generally,

**what structure?**

### • subjectively, **your design!**

### • objectively, **validation, maybe?**

structural decisions:

**key issue**

for applying NNet
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/24

Deep Learning Deep Neural Network

### Physical Interpretation of NNet Revisited

### x

0### = 1 x

1### x

2### .. . x

d### +1

tanh

tanh

### w

_{ij}

^{(1)}

### w

_{jk}

^{(2)}

### w

_{kq}

^{(3)}

### +1

tanh

tanh

### s

_{3}

^{(2)}tanh

### x

_{3}

^{(2)}

### •

each layer:**pattern feature** **extracted**

from data,**remember? :-)**

### •

how many neurons? how many layers?—more generally,

**what structure?**

### • subjectively, **your design!**

### • objectively, **validation, maybe?**

structural decisions:

**key issue**

for applying NNet
Deep Learning Deep Neural Network

### Physical Interpretation of NNet Revisited

### x

0### = 1 x

1### x

2### .. . x

d### +1

tanh

tanh

### w

_{ij}

^{(1)}

### w

_{jk}

^{(2)}

### w

_{kq}

^{(3)}

### +1

tanh

tanh

### s

_{3}

^{(2)}tanh

### x

_{3}

^{(2)}

### •

each layer:**pattern feature** **extracted**

from data,**remember? :-)**

### •

how many neurons? how many layers?—more generally,

**what structure?**

### • subjectively, **your design!**

### • objectively, **validation, maybe?**

structural decisions:

**key issue**

for applying NNet
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/24

Deep Learning Deep Neural Network

### Physical Interpretation of NNet Revisited

### x

0### = 1 x

1### x

2### .. . x

d### +1

tanh

tanh

### w

_{ij}

^{(1)}

### w

_{jk}

^{(2)}

### w

_{kq}

^{(3)}

### +1

tanh

tanh

### s

_{3}

^{(2)}tanh

### x

_{3}

^{(2)}

### •

each layer:**pattern feature** **extracted**

from data,**remember? :-)**

### •

how many neurons? how many layers?—more generally,

**what structure?**

### • subjectively, **your design!**

### • objectively, **validation, maybe?**

structural decisions:

**key issue**

for applying NNet
Deep Learning Deep Neural Network

### Shallow versus Deep Neural Networks

### shallow: few (hidden) layers; deep: many layers

### Shallow NNet

### •

more**efficient**

to train (
)
### • **simpler**

structural
decisions (
)
### •

theoretically**powerful** **enough**

(
)
### Deep NNet

### • **challenging**

to train (×)
### • **sophisticated**

structural
decisions (×)
### • **‘arbitrarily’ powerful**

(
)
### •

more**‘meaningful’?**

(see
next slide)
deep NNet (deep learning)

**gaining attention**

in recent years
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24

Deep Learning Deep Neural Network

### Shallow versus Deep Neural Networks

### shallow: few (hidden) layers; deep: many layers

### Shallow NNet

### •

more**efficient**

to train (
)
### • **simpler**

structural
decisions (
)
### •

theoretically**powerful** **enough**

(
)
### Deep NNet

### • **challenging**

to train (×)
### • **sophisticated**

structural
decisions (×)
### • **‘arbitrarily’ powerful**

(
)
### •

more**‘meaningful’?**

(see
next slide)
deep NNet (deep learning)

**gaining attention**

in recent years
Deep Learning Deep Neural Network

### Shallow versus Deep Neural Networks

### shallow: few (hidden) layers; deep: many layers

### Shallow NNet

### •

more**efficient**

to train (
)
### • **simpler**

structural
decisions (
)
### •

theoretically**powerful** **enough**

(
)
### Deep NNet

### • **challenging**

to train (×)
### • **sophisticated**

structural
decisions (×)
### • **‘arbitrarily’ powerful**

(
)
### •

more**‘meaningful’?**

(see
next slide)
deep NNet (deep learning)

**gaining attention**

in recent years
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24

Deep Learning Deep Neural Network

### Shallow versus Deep Neural Networks

### shallow: few (hidden) layers; deep: many layers

### Shallow NNet

### •

more**efficient**

to train (
)
### • **simpler**

structural
decisions (
)
### •

theoretically**powerful** **enough**

(
)
### Deep NNet

### • **challenging**

to train (×)
### • **sophisticated**

structural
decisions (×)
### • **‘arbitrarily’ powerful**

(
)
### •

more**‘meaningful’?**

(see
next slide)
deep NNet (deep learning)

**gaining attention**

in recent years
Deep Learning Deep Neural Network

### Shallow versus Deep Neural Networks

### shallow: few (hidden) layers; deep: many layers

### Shallow NNet

### •

more**efficient**

to train (
)
### • **simpler**

structural
decisions (
)
### •

theoretically**powerful** **enough**

(
)
### Deep NNet

### • **challenging**

to train (×)
### • **sophisticated**

structural
decisions (×)
### • **‘arbitrarily’ powerful**

(
)
### •

more**‘meaningful’?**

(see
next slide)
deep NNet (deep learning)

**gaining attention**

in recent years
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24

Deep Learning Deep Neural Network

### Shallow versus Deep Neural Networks

### shallow: few (hidden) layers; deep: many layers

### Shallow NNet

### •

more**efficient**

to train (
)
### • **simpler**

structural
decisions (
)
### •

theoretically**powerful** **enough**

(
)
### Deep NNet

### • **challenging**

to train (×)
### • **sophisticated**

structural
decisions (×)
### • **‘arbitrarily’ powerful**

(
)
### •

more**‘meaningful’?**

(see
next slide)
deep NNet (deep learning)

**gaining attention**

in recent years
Deep Learning Deep Neural Network

### Shallow versus Deep Neural Networks

### shallow: few (hidden) layers; deep: many layers

### Shallow NNet

### •

more**efficient**

to train (
)
### • **simpler**

structural
decisions (
)
### •

theoretically**powerful** **enough**

(
)
### Deep NNet

### • **challenging**

to train (×)
### • **sophisticated**

structural
decisions (×)
### • **‘arbitrarily’ powerful**

(
)
### •

more**‘meaningful’?**

(see
next slide)
deep NNet (deep learning)

**gaining attention**

in recent years
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24

Deep Learning Deep Neural Network

### Shallow versus Deep Neural Networks

### shallow: few (hidden) layers; deep: many layers

### Shallow NNet

### •

more**efficient**

to train (
)
### • **simpler**

structural
decisions (
)
### •

theoretically**powerful** **enough**

(
)
### Deep NNet

### • **challenging**

to train (×)
### • **sophisticated**

structural
decisions (×)
### • **‘arbitrarily’ powerful**

(
)
### •

more**‘meaningful’?**

(see
next slide)
deep NNet (deep learning)

**gaining attention**

in recent years
Deep Learning Deep Neural Network

### Shallow versus Deep Neural Networks

### shallow: few (hidden) layers; deep: many layers

### Shallow NNet

### •

more**efficient**

to train (
)
### • **simpler**

structural
decisions (
)
### •

theoretically**powerful** **enough**

(
)
### Deep NNet

### • **challenging**

to train (×)
### • **sophisticated**

structural
decisions (×)
### • **‘arbitrarily’ powerful**

(
)
### •

more**‘meaningful’?**

(see
next slide)
deep NNet (deep learning)

**gaining attention**

in recent years
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24

Deep Learning Deep Neural Network

### Shallow versus Deep Neural Networks

### shallow: few (hidden) layers; deep: many layers

### Shallow NNet

### •

more**efficient**

to train (
)
### • **simpler**

structural
decisions (
)
### •

theoretically**powerful** **enough**

(
)
### Deep NNet

### • **challenging**

to train (×)
### • **sophisticated**

structural
decisions (×)
### • **‘arbitrarily’ powerful**

(
)
### •

more**‘meaningful’?**

(see
next slide)
deep NNet (deep learning)

**gaining attention**

in recent years
Deep Learning Deep Neural Network

### Meaningfulness of Deep Learning

## ,

### is it a ‘1’? ✲ ✛ is it a ‘5’?

### ✻

### z

1### z

5### φ

1### φ

2### φ

3### φ

4### φ

5### φ

6### positive weight negative weight

### • **‘less burden’**

for each layer:**simple**

to**complex**

features
### •

natural for**difficult**

learning task with**raw features, like** **vision**

deep NNet: currently popular in

**vision/speech/. . .**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/24

Deep Learning Deep Neural Network

### Meaningfulness of Deep Learning

## ,

### is it a ‘1’? ✲ ✛ is it a ‘5’?

### ✻

### z

1### z

5### φ

1### φ

2### φ

3### φ

4### φ

5### φ

6### positive weight negative weight

### • **‘less burden’**

for each layer:**simple**

to**complex**

features
### •

natural for**difficult**

learning task with**raw features, like** **vision**

deep NNet: currently popular in
**vision/speech/. . .**

Deep Learning Deep Neural Network

### Meaningfulness of Deep Learning

## ,

### is it a ‘1’? ✲ ✛ is it a ‘5’?

### ✻

### z

1### z

5### φ

1### φ

2### φ

3### φ

4### φ

5### φ

6### positive weight negative weight

### • **‘less burden’**

for each layer:**simple**

to**complex**

features
### •

natural for**difficult**

learning task with**raw features, like** **vision**

deep NNet: currently popular in

**vision/speech/. . .**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/24

Deep Learning Deep Neural Network

### Meaningfulness of Deep Learning

## ,

### is it a ‘1’? ✲ ✛ is it a ‘5’?

### ✻

### z

1### z

5### φ

1### φ

2### φ

3### φ

4### φ

5### φ

6### positive weight negative weight

### • **‘less burden’**

for each layer:**simple**

to**complex**

features
### •

natural for**difficult**

learning task with**raw features, like** **vision**

deep NNet: currently popular in
**vision/speech/. . .**

Deep Learning Deep Neural Network

### Challenges and Key Techniques for Deep Learning

### •

difficult**structural decisions:**

### • subjective with **domain knowledge: like** **convolutional NNet** for images

### •

high**model complexity:**

### • no big worries if **big enough data**

### • **regularization** towards noise-tolerant: like

### • **dropout** (tolerant when network corrupted)

### • **denoising** (tolerant when input corrupted)

### •

hard**optimization problem:**

### • **careful initialization** to avoid bad local minimum:

### called **pre-training**

### •

huge**computational complexity**

(worsen with

**big data)**

:

### • novel hardware/architecture: like **mini-batch with GPU**

IMHO, careful

**regularization**

and
**initialization**

are key techniques
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/24

Deep Learning Deep Neural Network

### Challenges and Key Techniques for Deep Learning

### •

difficult**structural decisions:**

### • subjective with **domain knowledge: like** **convolutional NNet** for images

### •

high**model complexity:**

### • no big worries if **big enough data**

### • **regularization** towards noise-tolerant: like

### • **dropout** (tolerant when network corrupted)

### • **denoising** (tolerant when input corrupted)

### •

hard**optimization problem:**

### • **careful initialization** to avoid bad local minimum:

### called **pre-training**

### •

huge**computational complexity**

(worsen with

**big data)**

:

### • novel hardware/architecture: like **mini-batch with GPU**

IMHO, careful

**regularization**

and
**initialization**

are key techniques
Deep Learning Deep Neural Network

### Challenges and Key Techniques for Deep Learning

### •

difficult**structural decisions:**

### • subjective with **domain knowledge: like** **convolutional NNet** for images

### •

high**model complexity:**

### • no big worries if **big enough data**

### • **regularization** towards noise-tolerant: like

### • **dropout** (tolerant when network corrupted)

### • **denoising** (tolerant when input corrupted)

### •

hard**optimization problem:**

### • **careful initialization** to avoid bad local minimum:

### called **pre-training**

### •

huge**computational complexity**

(worsen with

**big data)**

:

### • novel hardware/architecture: like **mini-batch with GPU**

IMHO, careful

**regularization**

and
**initialization**

are key techniques
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/24

Deep Learning Deep Neural Network

### Challenges and Key Techniques for Deep Learning

### •

difficult**structural decisions:**

### • subjective with **domain knowledge: like** **convolutional NNet** for images

### •

high**model complexity:**

### • no big worries if **big enough data**

### • **regularization** towards noise-tolerant: like

### • **dropout** (tolerant when network corrupted)

### • **denoising** (tolerant when input corrupted)

### •

hard**optimization problem:**

### • **careful initialization** to avoid bad local minimum:

### called **pre-training**

### •

huge**computational complexity**

(worsen with

**big data)**

:

### • novel hardware/architecture: like **mini-batch with GPU**

IMHO, careful

**regularization**

and
**initialization**

are key techniques
Deep Learning Deep Neural Network

### Challenges and Key Techniques for Deep Learning

### •

difficult**structural decisions:**

### • subjective with **domain knowledge: like** **convolutional NNet** for images

### •

high**model complexity:**

### • no big worries if **big enough data**

### • **regularization** towards noise-tolerant: like

### • **dropout** (tolerant when network corrupted)

### • **denoising** (tolerant when input corrupted)

### •

hard**optimization problem:**

### • **careful initialization** to avoid bad local minimum:

### called **pre-training**

### •

huge**computational complexity**

(worsen with

**big data)**

:

### • novel hardware/architecture: like **mini-batch with GPU**

IMHO, careful

**regularization**

and
**initialization**

are key techniques
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/24

Deep Learning Deep Neural Network

### Challenges and Key Techniques for Deep Learning

### •

difficult**structural decisions:**

### • subjective with **domain knowledge: like** **convolutional NNet** for images

### •

high**model complexity:**

### • no big worries if **big enough data**

### • **regularization** towards noise-tolerant: like

### • **dropout** (tolerant when network corrupted)

### • **denoising** (tolerant when input corrupted)

### •

hard**optimization problem:**

### • **careful initialization** to avoid bad local minimum:

### called **pre-training**

### •

huge**computational complexity**

(worsen with**big data):**

### • novel hardware/architecture: like **mini-batch with GPU**

IMHO, careful

**regularization**

and
**initialization**

are key techniques
Deep Learning Deep Neural Network

### Challenges and Key Techniques for Deep Learning

### •

difficult**structural decisions:**

### • subjective with **domain knowledge: like** **convolutional NNet** for images

### •

high**model complexity:**

### • no big worries if **big enough data**

### • **regularization** towards noise-tolerant: like

### • **dropout** (tolerant when network corrupted)

### • **denoising** (tolerant when input corrupted)

### •

hard**optimization problem:**

### • **careful initialization** to avoid bad local minimum:

### called **pre-training**

### •

huge**computational complexity**

(worsen with**big data):**

### • novel hardware/architecture: like **mini-batch with GPU**

IMHO, careful

**regularization**

and
**initialization**

are key techniques
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/24

Deep Learning Deep Neural Network

### Challenges and Key Techniques for Deep Learning

### •

difficult**structural decisions:**

### • subjective with **domain knowledge: like** **convolutional NNet** for images

### •

high**model complexity:**

### • no big worries if **big enough data**

### • **regularization** towards noise-tolerant: like

### • **dropout** (tolerant when network corrupted)

### • **denoising** (tolerant when input corrupted)

### •

hard**optimization problem:**

### • **careful initialization** to avoid bad local minimum:

### called **pre-training**

### •

huge**computational complexity**

(worsen with**big data):**

### • novel hardware/architecture: like **mini-batch with GPU**

IMHO, careful

**regularization**

and
**initialization**

are key techniques
Deep Learning Deep Neural Network

### Challenges and Key Techniques for Deep Learning

### •

difficult**structural decisions:**

### • subjective with **domain knowledge: like** **convolutional NNet** for images

### •

high**model complexity:**

### • no big worries if **big enough data**

### • **regularization** towards noise-tolerant: like

### • **dropout** (tolerant when network corrupted)

### • **denoising** (tolerant when input corrupted)

### •

hard**optimization problem:**

### • **careful initialization** to avoid bad local minimum:

### called **pre-training**

### •

huge**computational complexity**

(worsen with**big data):**

### • novel hardware/architecture: like **mini-batch with GPU**

IMHO, careful

**regularization**

and
**initialization**

are key techniques
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/24

Deep Learning Deep Neural Network

### Challenges and Key Techniques for Deep Learning

### •

difficult**structural decisions:**

### • subjective with **domain knowledge: like** **convolutional NNet** for images

### •

high**model complexity:**

### • no big worries if **big enough data**

### • **regularization** towards noise-tolerant: like

### • **dropout** (tolerant when network corrupted)

### • **denoising** (tolerant when input corrupted)

### •

hard**optimization problem:**

### • **careful initialization** to avoid bad local minimum:

### called **pre-training**

### •

huge**computational complexity**

(worsen with**big data):**

### • novel hardware/architecture: like **mini-batch with GPU**

IMHO, careful

**regularization**

and
**initialization**

are key techniques
Deep Learning Deep Neural Network

### Challenges and Key Techniques for Deep Learning

### •

difficult**structural decisions:**

### • subjective with **domain knowledge: like** **convolutional NNet** for images

### •

high**model complexity:**

### • no big worries if **big enough data**

### • **regularization** towards noise-tolerant: like

### • **dropout** (tolerant when network corrupted)

### • **denoising** (tolerant when input corrupted)

### •

hard**optimization problem:**

### • **careful initialization** to avoid bad local minimum:

### called **pre-training**

### •

huge**computational complexity**

(worsen with**big data):**

### • novel hardware/architecture: like **mini-batch with GPU**

IMHO, careful

**regularization**

and
**initialization**

are key techniques
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/24

Deep Learning Deep Neural Network

### Challenges and Key Techniques for Deep Learning

### •

difficult**structural decisions:**

### • subjective with **domain knowledge: like** **convolutional NNet** for images

### •

high**model complexity:**

### • no big worries if **big enough data**

### • **regularization** towards noise-tolerant: like

### • **dropout** (tolerant when network corrupted)

### • **denoising** (tolerant when input corrupted)

### •

hard**optimization problem:**

### • **careful initialization** to avoid bad local minimum:

### called **pre-training**

### •

huge**computational complexity**

(worsen with**big data):**

### • novel hardware/architecture: like **mini-batch with GPU**

IMHO, careful

**regularization**

and
**initialization**

are key techniques
Deep Learning Deep Neural Network

### Challenges and Key Techniques for Deep Learning

### •

difficult**structural decisions:**

### • subjective with **domain knowledge: like** **convolutional NNet** for images

### •

high**model complexity:**

### • no big worries if **big enough data**

### • **regularization** towards noise-tolerant: like

### • **dropout** (tolerant when network corrupted)

### • **denoising** (tolerant when input corrupted)

### •

hard**optimization problem:**

### • **careful initialization** to avoid bad local minimum:

### called **pre-training**

### •

huge**computational complexity**

(worsen with**big data):**

### • novel hardware/architecture: like **mini-batch with GPU**

IMHO, careful

**regularization**

and
**initialization**

are key techniques
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/24

Deep Learning Deep Neural Network

### A Two-Step Deep Learning Framework

### Simple Deep Learning

### 1

for` = 1, . . . , L,**pre-train**

n
w_{ij} ^{(`)}

o
assuming w

### ∗ ^{(1)}

,. . . w### ∗ ^{(`−1)}

fixed
### (a) (b) (c) (d)

### 2 **train with backprop**

on**pre-trained**

NNet to**fine-tune**

alln
w_{ij} ^{(`)}

o
will focus on

**simplest** **pre-training**

technique
along with**regularization**

Deep Learning Deep Neural Network

### A Two-Step Deep Learning Framework

### Simple Deep Learning

### 1

for` = 1, . . . , L,**pre-train**

n
w_{ij} ^{(`)}

o
assuming w

### ∗ ^{(1)}

,. . . w### ∗ ^{(`−1)}

fixed
### (a) (b) (c) (d)

### 2 **train with backprop**

on**pre-trained**

NNet to**fine-tune**

alln
w_{ij} ^{(`)}

o
will focus on

**simplest** **pre-training**

technique
along with**regularization**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/24

Deep Learning Deep Neural Network

### A Two-Step Deep Learning Framework

### Simple Deep Learning

### 1

for` = 1, . . . , L,**pre-train**

n
w_{ij} ^{(`)}

o
assuming w

### ∗ ^{(1)}

,. . . w### ∗ ^{(`−1)}

fixed
### (a) (b) (c) (d)

### 2 **train with backprop**

on**pre-trained**

NNet to**fine-tune**

alln
w_{ij} ^{(`)}

o
will focus on

**simplest** **pre-training**

technique
along with**regularization**

Deep Learning Deep Neural Network

### Fun Time

For a deep NNet for written character recognition from raw pixels, which type of features are more likely extracted after the first hidden layer?

### 1

pixels### 2

strokes### 3

parts### 4

digits### Reference Answer: 2

Simple strokes are likely the ‘next-level’ features that can be extracted from raw pixels.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/24

Deep Learning Deep Neural Network

### Fun Time

For a deep NNet for written character recognition from raw pixels, which type of features are more likely extracted after the first hidden layer?

### 1

pixels### 2

strokes### 3

parts### 4

digits### Reference Answer: 2

Simple strokes are likely the ‘next-level’

features that can be extracted from raw pixels.

Deep Learning Autoencoder

### Information-Preserving Encoding

### • **weights: feature transform, i.e.** **encoding**

### • **good weights:** information-preserving encoding

—next layer

### same info.

with### different representation

### • **information-preserving:**

**decode accurately**

after### encoding

### ,

is it a ‘1’? ✲ ✛ is it a ‘5’?

✻

z1 z5

φ1 φ2 φ3 φ4 φ5 φ6

positive weight negative weight

### (a) (b) (c) (d)

idea:

**pre-train weights**

towards
**information-preserving**

encoding
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/24

Deep Learning Autoencoder

### Information-Preserving Encoding

### • **weights: feature transform, i.e.** **encoding**

### • **good weights:** information-preserving encoding

—next layer

### same info.

with### different representation

### • **information-preserving:**

**decode accurately**

after### encoding

### ,

is it a ‘1’? ✲ ✛ is it a ‘5’?

✻

z1 z5

φ1 φ2 φ3 φ4 φ5 φ6

positive weight negative weight

### (a) (b) (c) (d)

idea:

**pre-train weights**

towards
**information-preserving**

encoding
Deep Learning Autoencoder

### Information-Preserving Encoding

### • **weights: feature transform, i.e.** **encoding**

### • **good weights:** information-preserving encoding

—next layer

### same info.

with### different representation

### • **information-preserving:**

**decode accurately**

after### encoding

### ,

is it a ‘1’? ✲ ✛ is it a ‘5’?

✻

z1 z5

φ1 φ2 φ3 φ4 φ5 φ6

positive weight negative weight

### (a) (b) (c) (d)

idea:

**pre-train weights**

towards
**information-preserving**

encoding
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/24

Deep Learning Autoencoder

### Information-Preserving Encoding

### • **weights: feature transform, i.e.** **encoding**

### • **good weights:** information-preserving encoding

—next layer

### same info.

with### different representation

### • **information-preserving:**

**decode accurately**

after### encoding

### ,

is it a ‘1’? ✲ ✛ is it a ‘5’?

✻

z1 z5

φ1 φ2 φ3 φ4 φ5 φ6

positive weight negative weight

### (a) (b) (c) (d)

idea:

**pre-train weights**

towards
**information-preserving**

encoding
Deep Learning Autoencoder

### Information-Preserving Neural Network

### x

0### = 1

### x

1### x

2### x

3### .. . x

d### +1

tanh

tanh

tanh

### ≈ x

_{1}

### ≈ x

_{2}

### ≈ x

_{3}

### .. .

### ≈ x

d### w _{ij} ^{(1)} w _{ji} ^{(2)}

### • **autoencoder:**

### d —˜ d—d NNet

with goal### g _{i}

(x) ≈### x _{i}

—learning to

**approximate** **identity function**

### • w _{ij} ^{(1)} : encoding weights; w _{ji} ^{(2)} : decoding weights

why

**approximating** **identity function?**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/24

Deep Learning Autoencoder

### Information-Preserving Neural Network

### x

0### = 1

### x

1### x

2### x

3### .. . x

d### +1

tanh

tanh

tanh

### ≈ x

_{1}

### ≈ x

_{2}

### ≈ x

_{3}

### .. .

### ≈ x

d### w _{ij} ^{(1)} w _{ji} ^{(2)}

### • **autoencoder:** d —˜ d—d NNet

with goal### g _{i}

(x) ≈### x _{i}

—learning to

**approximate** **identity function**

### • w _{ij} ^{(1)} : encoding weights; w _{ji} ^{(2)} : decoding weights

why

**approximating** **identity function?**

Deep Learning Autoencoder

### Information-Preserving Neural Network

### x

0### = 1

### x

1### x

2### x

3### .. . x

d### +1

tanh

tanh

tanh

### ≈ x

_{1}

### ≈ x

_{2}

### ≈ x

_{3}

### .. .

### ≈ x

d### w _{ij} ^{(1)} w _{ji} ^{(2)}

### • **autoencoder:** d —˜ d—d NNet

with goal### g _{i}

(x) ≈### x _{i}

—learning to

**approximate** **identity function**

### • w _{ij} ^{(1)} : encoding weights; w _{ji} ^{(2)} : decoding weights

why**approximating** **identity function?**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/24

Deep Learning Autoencoder

### Information-Preserving Neural Network

### x

0### = 1

### x

1### x

2### x

3### .. . x

d### +1

tanh

tanh

tanh

### ≈ x

_{1}

### ≈ x

_{2}

### ≈ x

_{3}

### .. .

### ≈ x

d### w _{ij} ^{(1)} w _{ji} ^{(2)}

### • **autoencoder:** d —˜ d—d NNet

with goal### g _{i}

(x) ≈### x _{i}

—learning to

**approximate** **identity function**

### • w _{ij} ^{(1)} : encoding weights; w _{ji} ^{(2)} : decoding weights

why

**approximating** **identity function?**

Deep Learning Autoencoder

### Information-Preserving Neural Network

### x

0### = 1

### x

1### x

2### x

3### .. . x

d### +1

tanh

tanh

tanh

### ≈ x

_{1}

### ≈ x

_{2}

### ≈ x

_{3}

### .. .

### ≈ x

d### w _{ij} ^{(1)} w _{ji} ^{(2)}

### • **autoencoder:** d —˜ d—d NNet

with goal### g _{i}

(x) ≈### x _{i}

—learning to

**approximate** **identity function**

### • w _{ij} ^{(1)} : encoding weights; w _{ji} ^{(2)} : decoding weights

why**approximating** **identity function?**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/24

Deep Learning Autoencoder

### Usefulness of Approximating Identity Function

if

**g(x) ≈** **x**

using some**hidden**

structures on the**observed data x** _{n}

### •

for supervised learning:### • hidden structure (essence) of **x** can be used as reasonable transform Φ(x)

—learning

**‘informative’ representation**

of data
### •

for unsupervised learning:### • density estimation: larger (structure match) when **g(x) ≈** **x**

### • outlier detection: those **x where** **g(x) 6≈** **x**

—learning

**‘typical’ representation**

of data
**autoencoder:**

**representation-learning through**

### approximating identity function

Deep Learning Autoencoder

### Usefulness of Approximating Identity Function

if

**g(x) ≈** **x**

using some**hidden**

structures on the**observed data x** _{n}

### •

for supervised learning:### • hidden structure (essence) of **x** can be used as reasonable transform Φ(x)

—learning

**‘informative’ representation**

of data
### •

for unsupervised learning:### • density estimation: larger (structure match) when **g(x) ≈** **x**

### • outlier detection: those **x where** **g(x) 6≈** **x**

—learning

**‘typical’ representation**

of data
**autoencoder:**

**representation-learning through** approximating identity function

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/24

Deep Learning Autoencoder

### Usefulness of Approximating Identity Function

if

**g(x) ≈** **x**

using some**hidden**

structures on the**observed data x** _{n}

### •

for supervised learning:### • hidden structure (essence) of **x** can be used as reasonable transform Φ(x)

—learning

**‘informative’ representation**

of data
### •

for unsupervised learning:### • density estimation: larger (structure match) when **g(x) ≈** **x**

### • outlier detection: those **x where** **g(x) 6≈** **x**

—learning

**‘typical’ representation**

of data
**autoencoder:**

**representation-learning through**

### approximating identity function

Deep Learning Autoencoder

### Usefulness of Approximating Identity Function

if

**g(x) ≈** **x**

using some**hidden**

structures on the**observed data x** _{n}

### •

for supervised learning:### • hidden structure (essence) of **x** can be used as reasonable transform Φ(x)

—learning

**‘informative’ representation**

of data
### •

for unsupervised learning:### • density estimation: larger (structure match) when **g(x) ≈** **x**

### • outlier detection: those **x where** **g(x) 6≈** **x**

—learning

**‘typical’ representation**

of data
**autoencoder:**

**representation-learning through** approximating identity function

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/24

Deep Learning Autoencoder

### Usefulness of Approximating Identity Function

if

**g(x) ≈** **x**

using some**hidden**

structures on the**observed data x** _{n}

### •

for supervised learning:### • hidden structure (essence) of **x** can be used as reasonable transform Φ(x)

—learning

**‘informative’ representation**

of data
### •

for unsupervised learning:### • density estimation: larger (structure match) when **g(x) ≈** **x**

### • outlier detection: those **x where** **g(x) 6≈** **x**

—learning

**‘typical’ representation**

of data
**autoencoder:**

**representation-learning through**

### approximating identity function

Deep Learning Autoencoder

### Usefulness of Approximating Identity Function

if

**g(x) ≈** **x**

using some**hidden**

structures on the**observed data x** _{n}

### •

for supervised learning:### • hidden structure (essence) of **x** can be used as reasonable transform Φ(x)

—learning

**‘informative’ representation**

of data
### •

for unsupervised learning:### • density estimation: larger (structure match) when **g(x) ≈** **x**

### • outlier detection: those **x where** **g(x) 6≈** **x**

—learning

**‘typical’ representation**

of data
**autoencoder:**

**representation-learning through** approximating identity function

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/24

Deep Learning Autoencoder

### Usefulness of Approximating Identity Function

if

**g(x) ≈** **x**

using some**hidden**

structures on the**observed data x** _{n}

### •

for supervised learning:### • hidden structure (essence) of **x** can be used as reasonable transform Φ(x)

—learning

**‘informative’ representation**

of data
### •

for unsupervised learning:### • density estimation: larger (structure match) when **g(x) ≈** **x**

### • outlier detection: those **x where** **g(x) 6≈** **x**

—learning

**‘typical’ representation**

of data
**autoencoder:**

**representation-learning through**

### approximating identity function

Deep Learning Autoencoder

### Usefulness of Approximating Identity Function

if

**g(x) ≈** **x**

using some**hidden**

structures on the**observed data x** _{n}

### •

for supervised learning:### • hidden structure (essence) of **x** can be used as reasonable transform Φ(x)

—learning

**‘informative’ representation**

of data
### •

for unsupervised learning:### • density estimation: larger (structure match) when **g(x) ≈** **x**

### • outlier detection: those **x where** **g(x) 6≈** **x**

—learning

**‘typical’ representation**

of data
**autoencoder:**

**representation-learning through** approximating identity function

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/24

Deep Learning Autoencoder

### Basic Autoencoder

basic

**autoencoder:**

### d —˜ d —d NNet

with error functionP### d

### i=1

(g_{i}

(x) − x_{i}

)^{2}

### •

backprop### easily

applies;**shallow**

and### easy

to train### •

usually### d ˜

<### d

:**compressed**

representation
### •

data: {(x_{1}

,**y** _{1} = **x** _{1}

), (x_{2}

,**y** _{2} = **x** _{2}

), . . . , (x_{N}

,**y** _{N} = **x** _{N}

)}
—often categorized as

**unsupervised learning technique**

### •

sometimes constrain### w _{ij} ^{(1)}

=### w _{ji} ^{(2)}

as**regularization**

—more

**sophisticated**

in calculating gradient
basic

**autoencoder**

in basic deep learning:
### n

### w _{ij} ^{(1)} o

taken as

### shallowly pre-trained weights

Deep Learning Autoencoder

### Basic Autoencoder

basic

**autoencoder:**

### d —˜ d —d NNet

with error functionP### d

### i=1

(g_{i}

(x) − x_{i}

)^{2}

### •

backprop### easily

applies;**shallow**

and### easy

to train### •

usually### d ˜

<### d

:**compressed**

representation
### •

data: {(x_{1}

,**y** _{1} = **x** _{1}

), (x_{2}

,**y** _{2} = **x** _{2}

), . . . , (x_{N}

,**y** _{N} = **x** _{N}

)}
—often categorized as

**unsupervised learning technique**

### •

sometimes constrain### w _{ij} ^{(1)}

=### w _{ji} ^{(2)}

as**regularization**

—more

**sophisticated**

in calculating gradient
basic

**autoencoder**

in basic deep learning:
### n

### w _{ij} ^{(1)} o

taken as

### shallowly pre-trained weights

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/24

Deep Learning Autoencoder

### Basic Autoencoder

basic

**autoencoder:**

### d —˜ d —d NNet

with error functionP### d

### i=1

(g_{i}

(x) − x_{i}

)^{2}

### •

backprop### easily

applies;**shallow**

and### easy

to train### •

usually### d ˜

<### d

:**compressed**

representation
### •

data: {(x_{1}

,**y** _{1} = **x** _{1}

), (x_{2}

,**y** _{2} = **x** _{2}

), . . . , (x_{N}

,**y** _{N} = **x** _{N}

)}
—often categorized as

**unsupervised learning technique**

### •

sometimes constrain### w _{ij} ^{(1)}

=### w _{ji} ^{(2)}

as**regularization**

—more

**sophisticated**

in calculating gradient
basic

**autoencoder**

in basic deep learning:
### n

### w _{ij} ^{(1)} o

taken as

### shallowly pre-trained weights

Deep Learning Autoencoder

### Basic Autoencoder

basic

**autoencoder:**

### d —˜ d —d NNet

with error functionP### d

### i=1

(g_{i}

(x) − x_{i}

)^{2}

### •

backprop### easily

applies;**shallow**

and### easy

to train### •

usually### d ˜

<### d

:**compressed**

representation
### •

data: {(x_{1}

,**y** _{1} = **x** _{1}

), (x_{2}

,**y** _{2} = **x** _{2}

), . . . , (x_{N}

,**y** _{N} = **x** _{N}

)}
—often categorized as

**unsupervised learning technique**

### •

sometimes constrain### w _{ij} ^{(1)}

=### w _{ji} ^{(2)}

as**regularization**

—more

**sophisticated**

in calculating gradient
basic

**autoencoder**

in basic deep learning:
### n

### w _{ij} ^{(1)} o

taken as

### shallowly pre-trained weights

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/24

Deep Learning Autoencoder

### Basic Autoencoder

basic

**autoencoder:**

### d —˜ d —d NNet

with error functionP### d

### i=1

(g_{i}

(x) − x_{i}

)^{2}

### •

backprop### easily

applies;**shallow**

and### easy

to train### •

usually### d ˜

<### d

:**compressed**

representation
### •

data: {(x_{1}

,**y** _{1} = **x** _{1}

), (x_{2}

,**y** _{2} = **x** _{2}

), . . . , (x_{N}

,**y** _{N} = **x** _{N}

)}
—often categorized as

**unsupervised learning technique**

### •

sometimes constrain### w _{ij} ^{(1)}

=### w _{ji} ^{(2)}

as**regularization**

—more

**sophisticated**

in calculating gradient
basic

**autoencoder**

in basic deep learning:
### n

### w _{ij} ^{(1)} o

taken as

### shallowly pre-trained weights

Deep Learning Autoencoder

### Basic Autoencoder

basic

**autoencoder:**

### d —˜ d —d NNet

with error functionP### d

### i=1

(g_{i}

(x) − x_{i}

)^{2}

### •

backprop### easily

applies;**shallow**

and### easy

to train### •

usually### d ˜

<### d

:**compressed**

representation
### •

data: {(x_{1}

,**y** _{1} = **x** _{1}

), (x_{2}

,**y** _{2} = **x** _{2}

), . . . , (x_{N}

,**y** _{N} = **x** _{N}

)}
—often categorized as

**unsupervised learning technique**

### •

sometimes constrain### w _{ij} ^{(1)}

=### w _{ji} ^{(2)}

as**regularization**

—more

**sophisticated**

in calculating gradient
basic

**autoencoder**

in basic deep learning:
### n

### w _{ij} ^{(1)} o

taken as

### shallowly pre-trained weights

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/24

Deep Learning Autoencoder

### Basic Autoencoder

basic

**autoencoder:**

### d —˜ d —d NNet

with error functionP### d

### i=1

(g_{i}

(x) − x_{i}

)^{2}

### •

backprop### easily

applies;**shallow**

and### easy

to train### •

usually### d ˜

<### d

:**compressed**

representation
### •

data: {(x_{1}

,**y** _{1} = **x** _{1}

), (x_{2}

,**y** _{2} = **x** _{2}

), . . . , (x_{N}

,**y** _{N} = **x** _{N}

)}
—often categorized as

**unsupervised learning technique**

### •

sometimes constrain### w _{ij} ^{(1)}

=### w _{ji} ^{(2)}

as**regularization**

—more

**sophisticated**

in calculating gradient
basic

**autoencoder**

in basic deep learning:
### n

### w _{ij} ^{(1)} o

taken as

### shallowly pre-trained weights

Deep Learning Autoencoder

### Basic Autoencoder

basic

**autoencoder:**

### d —˜ d —d NNet

with error functionP### d

### i=1

(g_{i}

(x) − x_{i}

)^{2}

### •

backprop### easily

applies;**shallow**

and### easy

to train### •

usually### d ˜

<### d

:**compressed**

representation
### •

data: {(x_{1}

,**y** _{1} = **x** _{1}

), (x_{2}

,**y** _{2} = **x** _{2}

), . . . , (x_{N}

,**y** _{N} = **x** _{N}

)}
—often categorized as

**unsupervised learning technique**

### •

sometimes constrain### w _{ij} ^{(1)}

=### w _{ji} ^{(2)}

as**regularization**

—more

**sophisticated**

in calculating gradient
basic

**autoencoder**

in basic deep learning:
### n w _{ij} ^{(1)} o

taken as

### shallowly pre-trained weights

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/24

Deep Learning Autoencoder

### Pre-Training with Autoencoders

### Deep Learning

### with Autoencoders

### 1

for` = 1, . . . , L,**pre-train**

n
w_{ij} ^{(`)}

o
assuming w

### ∗ ^{(1)}

,. . . w### ∗ ^{(`−1)}

fixed
(a) (b) (c) (d)

by

**training basic autoencoder on** n

**x** ^{(`−1)} _{n} o

**with ˜** d = d ^{(`)}

### 2 **train with backprop**

on**pre-trained**

NNet to**fine-tune**

all
n
w

_{ij} ^{(`)}

o
many successful

**pre-training**

techniques take
**‘fancier’ autoencoders**

with different
**architectures**

and**regularization schemes**

Deep Learning Autoencoder

### Pre-Training with Autoencoders

### Deep Learning with Autoencoders

### 1

for` = 1, . . . , L,**pre-train**

n
w_{ij} ^{(`)}

o
assuming w

### ∗ ^{(1)}

,. . . w### ∗ ^{(`−1)}

fixed
(a) (b) (c) (d)

by

**training basic autoencoder on** n

**x** ^{(`−1)} _{n} o

**with ˜** d = d ^{(`)}

### 2 **train with backprop**

on**pre-trained**

NNet to**fine-tune**

all
n
w

_{ij} ^{(`)}

o
many successful

**pre-training**

techniques take
**‘fancier’ autoencoders**

with different
**architectures**

and**regularization schemes**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/24

Deep Learning Autoencoder

### Pre-Training with Autoencoders

### Deep Learning with Autoencoders

### 1

for` = 1, . . . , L,**pre-train**

n
w_{ij} ^{(`)}

o
assuming w

### ∗ ^{(1)}

,. . . w### ∗ ^{(`−1)}

fixed
(a) (b) (c) (d)

by

**training basic autoencoder on** n

**x** ^{(`−1)} _{n} o

**with ˜** d = d ^{(`)}

### 2 **train with backprop**

on**pre-trained**

NNet to**fine-tune**

all
n
w

_{ij} ^{(`)}

o
many successful

**pre-training**

techniques take
**‘fancier’ autoencoders**

with different
**architectures**

and**regularization schemes**

Deep Learning Autoencoder

### Fun Time

Suppose training a d -˜d -d autoencoder with backprop takes approximately c · d · ˜d seconds. Then, what is the total number of seconds needed for pre-training a d -d

^{(1)}

-d^{(2)}

-d^{(3)}

-1 deep NNet?
### 1

c d + d^{(1)}

+d^{(2)}

+d^{(3)}

+1
### 2

c d · d^{(1)}

· d^{(2)}

· d^{(3)}

· 1
### 3

c dd^{(1)}

+d^{(1)}

d^{(2)}

+d^{(2)}

d^{(3)}

+d^{(3)}

### 4

c dd^{(1)}

· d^{(1)}

d^{(2)}

· d^{(2)}

d^{(3)}

· d^{(3)}

### Reference Answer: 3

Each c · d

^{(`−1)}

· d^{(`)}

represents the time for
pre-training with one autoencoder to determine
one layer of the weights.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/24

Deep Learning Autoencoder

### Fun Time

Suppose training a d -˜d -d autoencoder with backprop takes approximately c · d · ˜d seconds. Then, what is the total number of seconds needed for pre-training a d -d

^{(1)}

-d^{(2)}

-d^{(3)}

-1 deep NNet?
### 1

c d + d^{(1)}

+d^{(2)}

+d^{(3)}

+1
### 2

c d · d^{(1)}

· d^{(2)}

· d^{(3)}

· 1
### 3

c dd^{(1)}

+d^{(1)}

d^{(2)}

+d^{(2)}

d^{(3)}

+d^{(3)}

### 4

c dd^{(1)}

· d^{(1)}

d^{(2)}

· d^{(2)}

d^{(3)}

· d^{(3)}

### Reference Answer: 3

Each c · d

^{(`−1)}

· d^{(`)}

represents the time for
pre-training with one autoencoder to determine
one layer of the weights.
Deep Learning Denoising Autoencoder

### Regularization in Deep Learning

### x

_{0}

### = 1 x

_{1}

### x

_{2}

### .. . x

d### +1

tanh

tanh

### w

_{ij}

^{(1)}

### w

_{jk}

^{(2)}

### w

_{kq}

^{(3)}

### +1

tanh

tanh

### s

_{3}

^{(2)}tanh

### x

_{3}

^{(2)}

**watch out for overfitting, remember? :-)**

high

**model complexity:** **regularization**

needed
### •

structural decisions/constraints### •

weight decay or weight elimination**regularizers**

### • **early stopping**

next: another

**regularization**

technique
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/24

Deep Learning Denoising Autoencoder

### Regularization in Deep Learning

### x

_{0}

### = 1 x

_{1}

### x

_{2}

### .. . x

d### +1

tanh

tanh

### w

_{ij}

^{(1)}

### w

_{jk}

^{(2)}

### w

_{kq}

^{(3)}

### +1

tanh

tanh

### s

_{3}

^{(2)}tanh

### x

_{3}

^{(2)}

**watch out for overfitting, remember? :-)**

high

**model complexity:**

**regularization**

needed
### •

structural decisions/constraints### •

weight decay or weight elimination**regularizers**

### • **early stopping**

next: another

**regularization**

technique
Deep Learning Denoising Autoencoder

### Regularization in Deep Learning

### x

_{0}

### = 1 x

_{1}

### x

_{2}

### .. . x

d### +1

tanh

tanh

### w

_{ij}

^{(1)}

### w

_{jk}

^{(2)}

### w

_{kq}

^{(3)}

### +1

tanh

tanh

### s

_{3}

^{(2)}tanh

### x

_{3}

^{(2)}

**watch out for overfitting, remember? :-)**

high**model complexity:**

**regularization**

needed
### •

structural decisions/constraints### •

weight decay or weight elimination**regularizers**

### • **early stopping**

next: another

**regularization**

technique
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/24

Deep Learning Denoising Autoencoder

### Regularization in Deep Learning

### x

_{0}

### = 1 x

_{1}

### x

_{2}

### .. . x

d### +1

tanh

tanh

### w

_{ij}

^{(1)}

### w

_{jk}

^{(2)}

### w

_{kq}

^{(3)}

### +1

tanh

tanh

### s

_{3}

^{(2)}tanh

### x

_{3}

^{(2)}

**watch out for overfitting, remember? :-)**

high**model complexity:** **regularization**

needed
### •

structural decisions/constraints### •

weight decay or weight elimination**regularizers**

### • **early stopping**

next: another

**regularization**

technique
Deep Learning Denoising Autoencoder

### Regularization in Deep Learning

### x

_{0}

### = 1 x

_{1}

### x

_{2}

### .. . x

d### +1

tanh

tanh

### w

_{ij}

^{(1)}

### w

_{jk}

^{(2)}

### w

_{kq}

^{(3)}

### +1

tanh

tanh

### s

_{3}

^{(2)}tanh

### x

_{3}

^{(2)}

**watch out for overfitting, remember? :-)**

high

**model complexity:** **regularization**

needed
### •

structural decisions/constraints### •

weight decay or weight elimination**regularizers**

### • **early stopping**

next: another

**regularization**

technique
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/24

Deep Learning Denoising Autoencoder

### Regularization in Deep Learning

### x

_{0}

### = 1 x

_{1}

### x

_{2}

### .. . x

d### +1

tanh

tanh

### w

_{ij}

^{(1)}

### w

_{jk}

^{(2)}

### w

_{kq}

^{(3)}

### +1

tanh

tanh

### s

_{3}

^{(2)}tanh

### x

_{3}

^{(2)}

**watch out for overfitting, remember? :-)**

high

**model complexity:** **regularization**

needed
### •

structural decisions/constraints### •

weight decay or weight elimination**regularizers**

### • **early stopping**

next: another

**regularization**

technique
Deep Learning Denoising Autoencoder

### Regularization in Deep Learning

### x

_{0}

### = 1 x

_{1}

### x

_{2}

### .. . x

d### +1

tanh

tanh

### w

_{ij}

^{(1)}

### w

_{jk}

^{(2)}

### w

_{kq}

^{(3)}

### +1

tanh

tanh

### s

_{3}

^{(2)}tanh

### x

_{3}

^{(2)}

**watch out for overfitting, remember? :-)**

high

**model complexity:** **regularization**

needed
### •

structural decisions/constraints### •

weight decay or weight elimination**regularizers**

### • **early stopping**

next: another

**regularization**

technique
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/24

Deep Learning Denoising Autoencoder

### Regularization in Deep Learning

### x

_{0}

### = 1 x

_{1}

### x

_{2}

### .. . x

d### +1

tanh

tanh

### w

_{ij}

^{(1)}

### w

_{jk}

^{(2)}

### w

_{kq}

^{(3)}

### +1

tanh

tanh

### s

_{3}

^{(2)}tanh

### x

_{3}

^{(2)}

**watch out for overfitting, remember? :-)**

high

**model complexity:** **regularization**

needed
### •

structural decisions/constraints### •

weight decay or weight elimination**regularizers**

### • **early stopping**

next: another