## Machine Learning Techniques ( 機器學習技巧)

### Lecture 12: Deep Learning

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw### Department of Computer Science

### & Information Engineering

### National Taiwan University

### ( 國立台灣大學資訊工程系)

Deep Learning

## Agenda

### Lecture 12: Deep Learning Optimization and Overfitting Auto Encoder

### Principle Component Analysis

### Denoising Auto Encoder

### Deep Neural Network

Deep Learning Optimization and Overfitting

## Error Function of Neural Network

E

_{in}

(w) = 1
N
### N

X

### n=1

y

_{n}

− θ

· · · θ

X

### j

w

_{jk} ^{(2)}

· θ X
### i

w

_{ij} ^{(1)}

x_{i}

!

### 2

### •

generally**non-convex**

when multiple hidden layers
### • not easy to reach **global minimum**

### • GD/SGD with **backprop** only gives **local minimum**

### •

different initial**w** _{0}

=⇒different**local minimum**

### • somewhat ‘sensitive’ to initial weights

### • **large weights** =⇒ **saturate** (small gradient)

### • advice: try **some random** & **small** ones

neural network (NNet):

**difficult to optimize, but** **practically works**

Deep Learning Optimization and Overfitting

## VC Dimension of Neural Networks

roughly, with

**θ-like transfer functions:**

d_{VC}=O(Dlog

### D)

where### D = # of weights

### •

can**implement ‘anything’**

if### enough neurons

(Dlarge)—no need for

**many layers?**

### •

can**overfit**

if### too many neurons

NNet:

**watch out for overfitting!**

Deep Learning Optimization and Overfitting

## Regularization for Neural Network

basic choice:

old friend

### weight-decay

(L2) regularizer Ω(w) =P### w _{ij} ^{(`)}

### 2

### • ‘shrink’ weights:

### large weight

→### large shrink; small weight

→### small shrink

### •

want### w _{ij} ^{(`)} = 0 (sparse)

to effectively**decrease d**

VC
### • L1 regularizer: P w

_{ij}

^{(`)}

### , but **not differentiable**

### • weight-elimination (‘scaled’ L2) regularizer:

### large weight → median shrink; small weight → median shrink

**weight-elimination**

regularizer: P
### w

_{ij}

^{(`)}2

### β

^{2}

### +

### w

_{ij}

^{(`)}2

Deep Learning Optimization and Overfitting

## Yet Another Regularization: Early Stopping

**GD/SGD (backprop)**

visits
### more weight combinations

as### t increases

### • smaller t

effectively### decrease d

VC### •

better### ‘stop in the middle’:

**early stopping**

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

0 0.5 1 1.5 2 2.5 3 3.5

Epochs

Error

**E**

**E**

top

bottom
**Early stopping**

**in**
**out**

when to stop?

**validation!**

Deep Learning Optimization and Overfitting

## Fun Time

Deep Learning Auto Encoder

## Learning the Identity Function

### identity

function:**f(x) = x**

### •

a### vector

function composed of### f _{i}

(x) = x_{i}

### • learning

each### f _{i}

:### regression

with data(x

_{1}

,### y _{1} = x _{1,i}

), (x_{2}

,### y _{2} = x _{2,i}

), . . . , (x_{N}

,### y _{N} = x _{N,i}

)
### • learning **f: learning** f _{i} jointly

with data
(x

_{1}

,**y** _{1} = **x** _{1}

), (x_{2}

,**y** _{2} = **x** _{2}

), . . . , (x_{N}

,**y** _{N} = **x** _{N}

)
but wait, why

**learning**

something
**known**

&**easily implemented?** **:-)**

Deep Learning Auto Encoder

## Why Learning Identity Function

if

**g(x) ≈ f(x) using some** **hidden**

structures on the**observed data x** _{n}

### •

for unsupervised learning:### • density estimation: larger (structure match) when **g(x) ≈ x better**

### • outlier detection: those **x where** **g(x) 6≈ x**

—learning

**‘typical’ representation**

of data
### •

for supervised learning:### • **hidden structure:** essence of **x that can be used as** Φ(x)

—learning

**‘informative’ representation**

of data
**auto-encoder:**

NNet for learning identity function

Deep Learning Auto Encoder

## Simple Auto-Encoder

**simple**

auto-encoder: a### d

-### d ˜

-d NNet### • d outputs: backprop easily applies

### • d ˜

<### d: **compressed**

representation;
### d ˜

≥### d: **[over]-complete**

representation
### •

data: (x_{1}

,**y** _{1} = **x** _{1}

), (x_{2}

,**y** _{2} = **x** _{2}

), . . . , (x_{N}

,**y** _{N} = **x** _{N}

)
—often categorized as

**unsupervised learning technique**

### •

if**x**

contain binary bits,
### • naïve solution exists (but **unwanted) when** **[over]-complete**

### • **regularized** weights needed in general

### •

sometimes constrain### w _{ij} ^{(1)} = w _{ji} ^{(2)}

as ‘regularization’
—more

**sophisticated**

in calculating gradient
**auto-encoder**

for**representation**

learning:
outputs of

**hidden neurons**

serve as### Φ(x)

Deep Learning Auto Encoder

## Fun Time

Deep Learning Principle Component Analysis

## Linear Auto-Encoder Hypothesis

h

_{k}

(x) = ### θ

X

### j

### w _{jk} ^{(2)}

·### θ

X### i

### w _{ij} ^{(1)}

x_{i}

!

consider three special conditions:

### •

constrain### w _{ij} ^{(1)}

=### w _{ji} ^{(2)}

=### w _{ij}

as ‘regularization’
—let

### W

= [w_{ij}

]of size d × ˜d
### • θ

does nothing (like linear regression)### •

d˜ < dlinear auto-encoder hypothesis:

**h(x) = x** ^{T} WW ^{T}

Deep Learning Principle Component Analysis

## Linear Auto-Encoder Error Function

min

### W

E_{in}

(W) = 1
N

X − XWW

^{T}

### 2 F

let

### WW ^{T}

=### VΛV ^{T}

such that### V ^{T} V

= I and### Λ

a diagonal matrix of### rank at most ˜ d

(eigenvalue decomposition)

X − XVΛV

^{T}

### 2 F

= trace

X − XVΛV

^{T}

### T

X − XVΛV

^{T}

= trace

X

^{T}

X − X^{T}

XVΛV^{T}

−### VΛV ^{T}

X^{T}

X +### VΛV ^{T}

X^{T}

XVΛV^{T}

= trace

X

^{T}

X −### ΛV ^{T}

X^{T}

XV−### ΛV ^{T}

X^{T}

XV+### ΛV ^{T}

X^{T}

XVΛV^{T} V

= trace

### V ^{T}

X^{T}

XV−### ΛV ^{T}

X^{T}

XV−### ΛV ^{T}

X^{T}

XV+### Λ ^{2} V ^{T}

X^{T}

XV

= trace

(I −

### Λ) ^{2} V ^{T}

X^{T}

XV
Deep Learning Principle Component Analysis

## Linear Auto-Encoder Algorithm

min

### V,Λ

trace(I −

### Λ) ^{2} V ^{T}

X^{T}

XV
### •

optimal### rank-˜ d Λ

contains### d ‘1’ ˜

and### d − ˜ d ‘0’

### •

let X^{T}

X =### UΣU ^{T}

(eigenvalue decomposition),### V = U

with (smallest### σ i

⇐⇒### λ j = 1) is optimal

### •

so**optimal column vectors w** _{j} = **v** _{j} **= top eigen vectors of X** ^{T} X

optimal linear auto-encoding ≡**principal component analysis**

(PCA)
with

**w** _{j}

being**principal components**

of unshifted data
Deep Learning Principle Component Analysis

## Fun Time

Deep Learning Denoising Auto Encoder

## Simple Auto-Encoder Revisited

**simple**

auto-encoder: a### d

-### d ˜

-d NNet### •

want:**hidden structure**

to capture### essence

of**x**

### •

naïve solution exists (but**unwanted) when** **[over]-complete**

### • **regularized**

weights needed in general
regularization towards more

**robust hidden structure?**

Deep Learning Denoising Auto Encoder

## Idea of Denoising Auto-Encoder

**robust hidden structure**

should allow g(˜**x) ≈** **x**

even when

### ˜ **x**

slightly different from**x**

### •

denoising auto-encoder: run auto-encoderwith data (

**x** ˜ _{1}

,**y** _{1} = **x** _{1}

), (**x** ˜ _{2}

,**y** _{2} = **x** _{2}

), . . . , (### ˜ **x** _{N}

,**y** _{N} = **x** _{N}

),
where**x** ˜ _{n}

=**x** _{n}

+**artificial noise**

### •

PCA auto-encoder +**Gaussian noise:**

min

### W

E_{in}

(W) = 1
N

X − (X +

**noise)** WW ^{T}

### 2 F

—simply L2-regularizedPCA

**artificial noise**

as**regularization!**

—practically also useful for other types of NNet

Deep Learning Denoising Auto Encoder

## Fun Time

Deep Learning Deep Neural Network

Finalremark: hiddenlayers

LearningFromData-Le ture10 21/21

Finalremark: hiddenlayers

LearningFromData-Le ture10 21/21

Finalremark: hiddenlayers

θ x2

xd

s θ(s) h(x) 1

1 1

x1

θ θ

θ θ

θ

learnednonlineartransform

interpretation?

LearningFromData-Le ture10 21/21

Finalremark: hiddenlayers

θ x2

xd

s θ(s) h(x) 1

1 1

x1

θ θ

θ θ

θ

learnednonlineartransform

interpretation?

LearningFromData-Le ture10 21/21

Finalremark: hiddenlayers

θ x2

xd

s θ(s) h(x) 1

1 1

x1

θ θ

θ θ

θ

learnednonlineartransform

interpretation?

LearningFromData-Le ture10 21/21

Finalremark: hiddenlayers

θ x2

xd

s θ(s) h(x) 1

1 1

x1

θ θ

θ θ

θ

learnednonlineartransform

interpretation?

LearningFromData-Le ture10 21/21

Deep Learning Deep Neural Network

## Shallow versus Deep Structures

shallow: few hidden layers; deep: many hidden layers

### Shallow

### •

efficient### •

powerful if enough neurons### Deep

### •

challenging to train### •

needing more structural (model) decisions### •

‘meaningful’?deep structure (deep learning)

**re-gain attention recently**

Deep Learning Deep Neural Network

## Key Techniques behind Deep Learning

### •

(usually)**unsupervised pre-training**

between hidden layers, such
as simple/denoising auto-encoder
—viewing hidden layers as

**‘condensing’**

low-level representation
to high-level one
### • **fine-tune with backprop**

after initializing with those ‘good’
weights

—because

**direct backprop may get stuck more easily**

### •

speed-up: better optimization algorithms, and faster**GPU**

### •

generalization issue less serious with**big (enough) data**

currently very useful

for

**vision and speech recognition**

Deep Learning Deep Neural Network

## Fun Time

Deep Learning Deep Neural Network