### Machine Learning Techniques ( 機器學習技法)

### Lecture 13: Deep Learning

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw### Department of Computer Science

### & Information Engineering

### National Taiwan University

### ( 國立台灣大學資訊工程系)

Deep Learning

### Roadmap

### 1 Embedding Numerous Features: Kernel Models

### 2 Combining Predictive Features: Aggregation Models

### 3

Distilling Implicit Features: Extraction Models### Lecture 12: Neural Network

automatic

**pattern feature extraction**

from**layers of** **neurons**

with**backprop**

for GD/SGD
### Lecture 13: Deep Learning Deep Neural Network Autoencoder

### Denoising Autoencoder Principal Component Analysis

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 1/24

Deep Learning Deep Neural Network

### Physical Interpretation of NNet Revisited

### x

0### = 1 x

1### x

2### .. . x

d### +1

tanh

tanh

### w

_{ij}

^{(1)}

### w

_{jk}

^{(2)}

### w

_{kq}

^{(3)}

### +1

tanh

tanh

### s

_{3}

^{(2)}tanh

### x

_{3}

^{(2)}

### •

each layer:**pattern feature** **extracted**

from data,**remember? :-)**

### •

how many neurons? how many layers?—more generally,

**what structure?**

### • subjectively, **your design!**

### • objectively, **validation, maybe?**

structural decisions:

**key issue**

for applying NNet
Deep Learning Deep Neural Network

### Shallow versus Deep Neural Networks

### shallow: few (hidden) layers; deep: many layers

### Shallow NNet

### •

more**efficient**

to train (
)
### • **simpler**

structural
decisions (
)
### •

theoretically**powerful** **enough**

(
)
### Deep NNet

### • **challenging**

to train (×)
### • **sophisticated**

structural
decisions (×)
### • **‘arbitrarily’ powerful**

(
)
### •

more**‘meaningful’?**

(see
next slide)
deep NNet (deep learning)

**gaining attention**

in recent years
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24

Deep Learning Deep Neural Network

### Meaningfulness of Deep Learning

## ,

### is it a ‘1’? ✲ ✛ is it a ‘5’?

### ✻

### z

1### z

5### φ

1### φ

2### φ

3### φ

4### φ

5### φ

6### positive weight negative weight

### • **‘less burden’**

for each layer:**simple**

to**complex**

features
### •

natural for**difficult**

learning task with**raw features, like** **vision**

deep NNet: currently popular in
**vision/speech/. . .**

Deep Learning Deep Neural Network

### Challenges and Key Techniques for Deep Learning

### •

difficult**structural decisions:**

### • subjective with **domain knowledge: like** **convolutional NNet** for images

### •

high**model complexity:**

### • no big worries if **big enough data**

### • **regularization** towards noise-tolerant: like

### • **dropout** (tolerant when network corrupted)

### • **denoising** (tolerant when input corrupted)

### •

hard**optimization problem:**

### • **careful initialization** to avoid bad local minimum:

### called **pre-training**

### •

huge**computational complexity**

(worsen with**big data):**

### • novel hardware/architecture: like **mini-batch with GPU**

IMHO, careful

**regularization**

and
**initialization**

are key techniques
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/24

Deep Learning Deep Neural Network

### A Two-Step Deep Learning Framework

### Simple Deep Learning

### 1

for` = 1, . . . , L,**pre-train**

n
w_{ij} ^{(`)}

o
assuming w

### ∗ ^{(1)}

,. . . w### ∗ ^{(`−1)}

fixed
### (a) (b) (c) (d)

### 2 **train with backprop**

on**pre-trained**

NNet to**fine-tune**

alln
w_{ij} ^{(`)}

o
will focus on

**simplest** **pre-training**

technique
along with**regularization**

Deep Learning Deep Neural Network

### Fun Time

For a deep NNet for written character recognition from raw pixels, which type of features are more likely extracted after the first hidden layer?

### 1

pixels### 2

strokes### 3

parts### 4

digits### Reference Answer: 2

Simple strokes are likely the ‘next-level’ features that can be extracted from raw pixels.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/24

Deep Learning Deep Neural Network

### Fun Time

For a deep NNet for written character recognition from raw pixels, which type of features are more likely extracted after the first hidden layer?

### 1

pixels### 2

strokes### 3

parts### 4

digits### Reference Answer: 2

Simple strokes are likely the ‘next-level’

features that can be extracted from raw pixels.

Deep Learning Autoencoder

### Information-Preserving Encoding

### • **weights: feature transform, i.e.** **encoding**

### • **good weights:** information-preserving encoding

—next layer

### same info.

with### different representation

### • **information-preserving:**

**decode accurately**

after### encoding

### ,

is it a ‘1’? ✲ ✛ is it a ‘5’?

✻

z1 z5

φ1 φ2 φ3 φ4 φ5 φ6

positive weight negative weight

### (a) (b) (c) (d)

idea:

**pre-train weights**

towards
**information-preserving**

encoding
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/24

Deep Learning Autoencoder

### Information-Preserving Neural Network

### x

0### = 1

### x

1### x

2### x

3### .. . x

d### +1

tanh

tanh

tanh

### ≈ x

_{1}

### ≈ x

_{2}

### ≈ x

_{3}

### .. .

### ≈ x

d### w _{ij} ^{(1)} w _{ji} ^{(2)}

### • **autoencoder:** d —˜ d—d NNet

with goal### g _{i}

(x) ≈### x _{i}

—learning to

**approximate** **identity function**

### • w _{ij} ^{(1)} : encoding weights; w _{ji} ^{(2)} : decoding weights

why**approximating** **identity function?**

Deep Learning Autoencoder

### Usefulness of Approximating Identity Function

if

**g(x) ≈** **x**

using some**hidden**

structures on the**observed data x** _{n}

### •

for supervised learning:### • hidden structure (essence) of **x** can be used as reasonable transform Φ(x)

—learning

**‘informative’ representation**

of data
### •

for unsupervised learning:### • density estimation: larger (structure match) when **g(x) ≈** **x**

### • outlier detection: those **x where** **g(x) 6≈** **x**

—learning

**‘typical’ representation**

of data
**autoencoder:**

**representation-learning through** approximating identity function

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/24

Deep Learning Autoencoder

### Basic Autoencoder

basic

**autoencoder:**

### d —˜ d —d NNet

with error functionP### d

### i=1

(g_{i}

(x) − x_{i}

)^{2}

### •

backprop### easily

applies;**shallow**

and### easy

to train### •

usually### d ˜

<### d

:**compressed**

representation
### •

data: {(x_{1}

,**y** _{1} = **x** _{1}

), (x_{2}

,**y** _{2} = **x** _{2}

), . . . , (x_{N}

,**y** _{N} = **x** _{N}

)}
—often categorized as

**unsupervised learning technique**

### •

sometimes constrain### w _{ij} ^{(1)}

=### w _{ji} ^{(2)}

as**regularization**

—more

**sophisticated**

in calculating gradient
basic

**autoencoder**

in basic deep learning:
### n w _{ij} ^{(1)} o

taken as

### shallowly pre-trained weights

Deep Learning Autoencoder

### Pre-Training with Autoencoders

### Deep Learning with Autoencoders

### 1

for` = 1, . . . , L,**pre-train**

n
w_{ij} ^{(`)}

o
assuming w

### ∗ ^{(1)}

,. . . w### ∗ ^{(`−1)}

fixed
(a) (b) (c) (d)

by

**training basic autoencoder on** n

**x** ^{(`−1)} _{n} o

**with ˜** d = d ^{(`)}

### 2 **train with backprop**

on**pre-trained**

NNet to**fine-tune**

all
n
w

_{ij} ^{(`)}

o
many successful

**pre-training**

techniques take
**‘fancier’ autoencoders**

with different
**architectures**

and**regularization schemes**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/24

Deep Learning Autoencoder

### Fun Time

Suppose training a d -˜d -d autoencoder with backprop takes approximately c · d · ˜d seconds. Then, what is the total number of seconds needed for pre-training a d -d

^{(1)}

-d^{(2)}

-d^{(3)}

-1 deep NNet?
### 1

c d + d^{(1)}

+d^{(2)}

+d^{(3)}

+1
### 2

c d · d^{(1)}

· d^{(2)}

· d^{(3)}

· 1
### 3

c dd^{(1)}

+d^{(1)}

d^{(2)}

+d^{(2)}

d^{(3)}

+d^{(3)}

### 4

c dd^{(1)}

· d^{(1)}

d^{(2)}

· d^{(2)}

d^{(3)}

· d^{(3)}

### Reference Answer: 3

Each c · d

^{(`−1)}

· d^{(`)}

represents the time for
pre-training with one autoencoder to determine
one layer of the weights.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/24

Deep Learning Autoencoder

### Fun Time

Suppose training a d -˜d -d autoencoder with backprop takes approximately c · d · ˜d seconds. Then, what is the total number of seconds needed for pre-training a d -d

^{(1)}

-d^{(2)}

-d^{(3)}

-1 deep NNet?
### 1

c d + d^{(1)}

+d^{(2)}

+d^{(3)}

+1
### 2

c d · d^{(1)}

· d^{(2)}

· d^{(3)}

· 1
### 3

c dd^{(1)}

+d^{(1)}

d^{(2)}

+d^{(2)}

d^{(3)}

+d^{(3)}

### 4

c dd^{(1)}

· d^{(1)}

d^{(2)}

· d^{(2)}

d^{(3)}

· d^{(3)}

### Reference Answer: 3

Each c · d

^{(`−1)}

· d^{(`)}

represents the time for
pre-training with one autoencoder to determine
one layer of the weights.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/24

Deep Learning Denoising Autoencoder

### Regularization in Deep Learning

### x

_{0}

### = 1 x

_{1}

### x

_{2}

### .. . x

d### +1

tanh

tanh

### w

_{ij}

^{(1)}

### w

_{jk}

^{(2)}

### w

_{kq}

^{(3)}

### +1

tanh

tanh

### s

_{3}

^{(2)}tanh

### x

_{3}

^{(2)}

**watch out for overfitting, remember? :-)**

high

**model complexity:** **regularization**

needed
### •

structural decisions/constraints### •

weight decay or weight elimination**regularizers**

### • **early stopping**

next: another

**regularization**

technique
Deep Learning Denoising Autoencoder

### Reasons of Overfitting Revisited

**stochastic noise**

Number of Data Points, N

NoiseLevel,σ2

80 100 120 -0.2

-0.1 0 0.1 0.2

0 1 2

reasons of serious overfitting:

data size N ↓ overfit

### ↑ **noise**

↑ overfit### ↑

excessive power ↑ overfit### ↑

how to deal with

**noise?**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 15/24

Deep Learning Denoising Autoencoder

### Dealing with Noise

### •

direct possibility:**data cleaning/pruning, remember? :-)**

### •

a**wild**

possibility: **adding noise**

to data?
### •

idea:**robust**

autoencoder should not only let**g(x) ≈** **x**

but also allow**g(** **x) ≈** ˜ **x**

even when**x** ˜

slightly different from**x**

### • **denoising**

autoencoder:
run basic autoencoder with data

{(˜

**x** _{1}

,**y** _{1} = **x** _{1}

), (### ˜ **x** _{2}

,**y** _{2} = **x** _{2}

), . . . , (**x** ˜ _{N}

,**y** _{N} = **x** _{N}

)},
where**x** ˜ n

=**x** n

+**artificial noise**

—often used

**instead of basic autoencoder**

in deep learning
### •

useful for data/image processing:**g(** ˜ **x)**

a**denoised**

version of### ˜ **x**

### •

effect: ‘constrain/regularize’**g**

towards**noise-tolerant**

denoising
**artificial noise/hint**

as**regularization!**

—practically also useful for other NNet/models

Deep Learning Denoising Autoencoder

### Fun Time

Which of the following cannot be viewed as a regularization technique?

### 1

hint the model with artificially-generated noisy data### 2

stop gradient descent early### 3

add a weight elimination regularizer### 4

all the above are regularization techniques### Reference Answer: 4

1 is our new friend for regularization, while 2 and 3 are old friends.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/24

Deep Learning Denoising Autoencoder

### Fun Time

Which of the following cannot be viewed as a regularization technique?

### 1

hint the model with artificially-generated noisy data### 2

stop gradient descent early### 3

add a weight elimination regularizer### 4

all the above are regularization techniques### Reference Answer: 4

1 is our new friend for regularization, while 2 and 3 are old friends.

Deep Learning Principal Component Analysis

### Linear Autoencoder Hypothesis

### nonlinear autoencoder

sophisticated### linear autoencoder

simple**linear: more efficient? less overfitting?** **linear first, remember? :-)**

linear hypothesis for k -th component h

_{k}

(x) =
### d ˜

X

### j=0

### w _{kj}

### d

X

### i=1

### w _{ij}

x_{i}

!

consider three special conditions:

### • **exclude x** _{0}

: range of i **same**

as range of k
### •

constrain### w _{ij} ^{(1)}

=### w _{ji} ^{(2)}

=### w _{ij}

:**regularization**

—denote

### W

= [w_{ij}

]of size d × ˜d
### •

assume ˜d < d: ensure**non-trivial**

solution
linear autoencoder hypothesis:
**h(x) =** WW ^{T} **x**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/24

Deep Learning Principal Component Analysis

### Linear Autoencoder Error Function

E

_{in}

(h) = E_{in}

(W) = 1
N
### N

X

### n=1

**x** _{n}

−### WW ^{T} **x** _{n}

### 2

with d ×

### d ˜

matrix W—analytic solution to minimize E

_{in}

? but**4-th order polynomial of w** _{ij}

let’s familiarize the problem with linear algebra (be brave! :-))

### •

eigen-decompose### WW ^{T}

=### VΓV ^{T}

### • d × d matrix V **orthogonal:** VV

^{T}

### = V

^{T}

### V = I

d### • d × d matrix Γ **diagonal** with ≤ ˜ d non-zero

### • WW ^{T} **x** _{n}

=### VΓV ^{T} **x** _{n}

### • V

^{T}

### (x

n### ): change of orthonormal basis (rotate or reflect)

### • Γ(· · · ): set ≥ d − d ˜ components to 0, and **scale** others

### • V(· · · ): reconstruct by coefficients and basis (back-rotate)

### • **x** _{n}

=### VIV ^{T} **x** _{n}

:**rotate**

and**back-rotate**

cancel out
next: minimize E_{in} **by optimizing Γ and V**

Deep Learning Principal Component Analysis

### The Optimal Γ

### min

V### min

Γ

### 1 N

N

### X

n=1

### VIV

^{T}

**x**

_{n}

### | {z }

**x**n

### − VΓV

^{T}

**x**

_{n}

### | {z }

WW^{T}**x**n

2

### • **back-rotate**

not affecting length:### @ V

### •

min_{Γ}

P k(I −### Γ)(some vector)k ^{2}

:**want many 0**

within (I −### Γ)

### •

optimal diagonal### Γ

with rank### ≤ ˜ d

:

_{˜}

### d **diagonal components 1** other components 0

=⇒ without loss of gen.

### I _{d} _{˜}

0
0 0

next: min

### V N

X

### n=1

0 0

0 I

_{d −˜} _{d}

| {z }

### I−optimal Γ

### V ^{T} **x** n

### 2

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/24

Deep Learning Principal Component Analysis

### The Optimal V

### min

V N### X

n=1

### 0 0

### 0 I

_{d −˜}

_{d}

### V

^{T}

**x**

n
2

### ≡ max

V N

### X

n=1

### I

˜d### 0 0 0

### V

^{T}

**x**

n
2

### •

d = 1: only first row˜**v** ^{T}

of### V ^{T}

matters
max**v** P N

### n=1 **v** ^{T} **x** _{n} **x** ^{T} _{n} **v**

subject to**v** ^{T} **v**

=1
### • optimal **v** satisfies P

N
n=1

**x**

_{n}

**x**

^{T}

_{n}

**v** = λv

**—using Lagrange multiplier** λ, remember? :-)

### • optimal **v:** ‘topmost’ eigenvector of X

^{T}

### X

### •

general ˜d :**{v** _{j} } ^{d} _{j=1} ^{˜} ‘topmost’

eigenvectorSof### X ^{T} X

—optimal {w

_{j}

} = {v_{j}

with q
### γ _{j} = 1y} = **top eigenvectors**

linear autoencoder: projecting to### orthogonal

### patterns **w** _{j} **that ‘matches’ {x** _{n} **} most**

Deep Learning Principal Component Analysis

### Principal Component Analysis

### Linear Autoencoder or **PCA**

### 1 let ¯ **x =** _{N} ^{1} P N

### n=1 **x** _{n} , and let **x** _{n} **← x** n − ¯ **x**

### 2

calculate ˜d top eigenvectors**w** _{1}

**, w**

_{2}

**, . . . , w**

### d ˜

of X^{T}

X
### 3

return feature transform Φ(x) = W(x−¯**x)**

### •

linear autoencoder:maximizeP(maginitude after projection)

^{2}

### • principal component analysis (PCA)

from statistics:maximizeP(varianceafter projection)

### •

both useful for**linear dimension reduction**

though

### PCA more popular **linear dimension reduction:**

useful for

**data processing**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 22/24

Deep Learning Principal Component Analysis

### Fun Time

When solving the optimization problem max

**v** P N

### n=1 **v** ^{T} **x** _{n} **x** ^{T} _{n} **v**

subject to**v** ^{T} **v**

=1,
we know that the optimal**v**

is the ‘topmost’ eigenvector that
corresponds to the ‘topmost’ eigenvalue

### λ

of### X ^{T} X. Then, what is the

optimal objective value of the optimization problem?
### 1

λ^{1}

### 2

λ^{2}

### 3

λ^{3}

### 4

λ^{4}

### Reference Answer: 1

The objective value of the optimization problem is simply

**v** ^{T} X ^{T} Xv, which is λv ^{T} **v**

and you
know what**v** ^{T} **v**

must be.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 23/24

Deep Learning Principal Component Analysis

### Fun Time

When solving the optimization problem max

**v** P N

### n=1 **v** ^{T} **x** _{n} **x** ^{T} _{n} **v**

subject to**v** ^{T} **v**

=1,
we know that the optimal**v**

is the ‘topmost’ eigenvector that
corresponds to the ‘topmost’ eigenvalue

### λ

of### X ^{T} X. Then, what is the

optimal objective value of the optimization problem?
### 1

λ^{1}

### 2

λ^{2}

### 3

λ^{3}

### 4

λ^{4}

### Reference Answer: 1

The objective value of the optimization problem is simply

**v** ^{T} X ^{T} Xv, which is λv ^{T} **v**

and you
know what**v** ^{T} **v**

must be.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 23/24

Deep Learning Principal Component Analysis