Machine Learning Techniques (ᘤᢈ)

(1)

Machine Learning Techniques ( 機器學習技巧)

Lecture 12: Deep Learning

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Deep Learning

Agenda

Lecture 12: Deep Learning Optimization and Overfitting Auto Encoder

Principle Component Analysis

Denoising Auto Encoder

Deep Neural Network

(3)

Deep Learning Optimization and Overfitting

Error Function of Neural Network

E

_in

(w) = 1 N

N

X

n=1



y

_n

− θ



· · · θ



 X

j

w

_jk ⁽²⁾

· θ X

i

w

_ij ⁽¹⁾

x

_i

!











2

•

generally

non-convex

when multiple hidden layers

• not easy to reach global minimum

• GD/SGD with backprop only gives local minimum

•

different initial

w ₀

=⇒different

local minimum

• somewhat ‘sensitive’ to initial weights

• large weights =⇒ saturate (small gradient)

• advice: try some random & small ones

neural network (NNet):

difficult to optimize, but practically works

(4)

VC Dimension of Neural Networks

roughly, with

θ-like transfer functions:

d_VC=O(Dlog

D)

where

D = # of weights

•

can

implement ‘anything’

if

enough neurons

(Dlarge)

—no need for

many layers?

•

can

overfit

if

too many neurons

NNet:

watch out for overfitting!

(5)

Regularization for Neural Network

basic choice:

old friend

weight-decay

(L2) regularizer Ω(w) =P

w _ij ^(`)

2 • ‘shrink’ weights:

large weight

→

large shrink; small weight

→

small shrink

•

want

w _ij ^(`) = 0 (sparse)

to effectively

decrease d

VC

• L1 regularizer: P w

_ij^(`)

, but not differentiable

• weight-elimination (‘scaled’ L2) regularizer:

large weight → median shrink; small weight → median shrink

weight-elimination

regularizer: P

w

_ij^(`)

2

β

²

+

w

_ij^(`)

2

(6)

Yet Another Regularization: Early Stopping

GD/SGD (backprop)

visits

more weight combinations

as

t increases

• smaller t

effectively

decrease d

VC

•

better

‘stop in the middle’:

early stopping

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

0 0.5 1 1.5 2 2.5 3 3.5

Epochs

Error

E

top

bottom Early stopping

in out

when to stop?

validation!

(7)

Fun Time

(8)

Deep Learning Auto Encoder

Learning the Identity Function

identity

function:

f(x) = x

•

a

vector

function composed of

f _i

(x) = x

_i

• learning

each

f _i

:

regression

with data

(x

₁

,

y ₁ = x _1,i

), (x

₂

,

y ₂ = x _2,i

), . . . , (x

_N

,

y _N = x _N,i

)

• learning f: learning f _i jointly

with data

(x

₁

,

y ₁ = x ₁

), (x

₂

,

y ₂ = x ₂

), . . . , (x

_N

,

y _N = x _N

)

but wait, why

learning

something

known

&

easily implemented? :-)

(9)

Why Learning Identity Function

if

g(x) ≈ f(x) using some hidden

structures on the

observed data x _n

•

for unsupervised learning:

• density estimation: larger (structure match) when g(x) ≈ x better

• outlier detection: those x where g(x) 6≈ x

—learning

‘typical’ representation

of data

•

for supervised learning:

• hidden structure: essence of x that can be used as Φ(x)

—learning

‘informative’ representation

of data

auto-encoder:

NNet for learning identity function

(10)

Simple Auto-Encoder

simple

auto-encoder: a

d

-

d ˜

-d NNet

• d outputs: backprop easily applies

• d ˜

<

d: compressed

representation;

d ˜

≥

d: [over]-complete

representation

•

data: (x

₁

,

y ₁ = x ₁

), (x

₂

,

y ₂ = x ₂

), . . . , (x

_N

,

y _N = x _N

)

—often categorized as

unsupervised learning technique

•

if

x

contain binary bits,

• naïve solution exists (but unwanted) when [over]-complete

• regularized weights needed in general

•

sometimes constrain

w _ij ⁽¹⁾ = w _ji ⁽²⁾

as ‘regularization’

—more

sophisticated

in calculating gradient

auto-encoder

for

representation

learning:

outputs of

hidden neurons

serve as

Φ(x)

(11)

Fun Time

(12)

Deep Learning Principle Component Analysis

Linear Auto-Encoder Hypothesis

h

_k

(x) =

θ



 X

j

w _jk ⁽²⁾

·

θ

X

i

w _ij ⁽¹⁾

x

_i

!



consider three special conditions:

•

constrain

w _ij ⁽¹⁾

=

w _ji ⁽²⁾

=

w _ij

as ‘regularization’

—let

W

= [w

_ij

]of size d × ˜d

• θ

does nothing (like linear regression)

•

d˜ < d

linear auto-encoder hypothesis:

h(x) = x ^T WW ^T

(13)

Linear Auto-Encoder Error Function

min

W

E

_in

(W) = 1 N

X − XWW

^T

2 F

let

WW ^T

=

VΛV ^T

such that

V ^T V

= I and

Λ

a diagonal matrix of

rank at most ˜ d

(eigenvalue decomposition)

X − XVΛV

^T

2 F

= trace

X − XVΛV

^T

T

X − XVΛV

^T

= trace

X

^T

X − X

^T

XVΛV

^T

−

VΛV ^T

X

^T

X +

VΛV ^T

X

^T

XVΛV

^T

= trace

X

^T

X −

ΛV ^T

X

^T

XV−

ΛV ^T

X

^T

XV+

ΛV ^T

X

^T

XVΛV

^T V

= trace

V ^T

X

^T

XV−

ΛV ^T

X

^T

XV−

ΛV ^T

X

^T

XV+

Λ ² V ^T

X

^T

XV

= trace

(I −

Λ) ² V ^T

X

^T

XV

(14)

Linear Auto-Encoder Algorithm

min

V,Λ

trace

(I −

Λ) ² V ^T

X

^T

XV

•

optimal

rank-˜ d Λ

contains

d ‘1’ ˜

and

d − ˜ d ‘0’

•

let X

^T

X =

UΣU ^T

(eigenvalue decomposition),

V = U

with (smallest

σ i

⇐⇒

λ j = 1) is optimal

•

so

optimal column vectors w _j = v _j = top eigen vectors of X ^T X

optimal linear auto-encoding ≡

principal component analysis

(PCA)

with

w _j

being

principal components

of unshifted data

(15)

Fun Time

(16)

Deep Learning Denoising Auto Encoder

Simple Auto-Encoder Revisited

simple

auto-encoder: a

d

-

d ˜

-d NNet

•

want:

hidden structure

to capture

essence

of

x

•

naïve solution exists (but

unwanted) when [over]-complete

• regularized

weights needed in general

regularization towards more

robust hidden structure?

(17)

Idea of Denoising Auto-Encoder

robust hidden structure

should allow g(˜

x) ≈ x

even when

˜ x

slightly different from

x

•

denoising auto-encoder: run auto-encoder

with data (

x ˜ ₁

,

y ₁ = x ₁

), (

x ˜ ₂

,

y ₂ = x ₂

), . . . , (

˜ x _N

,

y _N = x _N

), where

x ˜ _n

=

x _n

+

artificial noise

•

PCA auto-encoder +

Gaussian noise:

min

W

E

_in

(W) = 1 N

X − (X +

noise) WW ^T

2 F

—simply L2-regularizedPCA

artificial noise

as

regularization!

—practically also useful for other types of NNet

(18)

Fun Time

(19)

Deep Learning Deep Neural Network

Finalremark: hiddenlayers

LearningFromData-Le ture10 21/21

θ x2

xd

s θ(s) h(x) 1

1 1

x1

θ θ

θ

learnednonlineartransform

interpretation?

θ x2

xd

s θ(s) h(x) 1

1 1

x1

θ θ

θ

interpretation?

θ x2

xd

s θ(s) h(x) 1

1 1

x1

θ θ

θ

interpretation?

θ x2

xd

s θ(s) h(x) 1

1 1

x1

θ θ

θ

interpretation?

(20)

Shallow versus Deep Structures

shallow: few hidden layers; deep: many hidden layers

Shallow

•

efficient

•

powerful if enough neurons

Deep

•

challenging to train

•

needing more structural (model) decisions

•

‘meaningful’?

deep structure (deep learning)

re-gain attention recently

(21)

Key Techniques behind Deep Learning

•

(usually)

unsupervised pre-training

between hidden layers, such as simple/denoising auto-encoder

—viewing hidden layers as

‘condensing’

low-level representation to high-level one

• fine-tune with backprop

after initializing with those ‘good’

weights

—because

direct backprop may get stuck more easily

•

speed-up: better optimization algorithms, and faster

GPU

•

generalization issue less serious with

big (enough) data

currently very useful

for

vision and speech recognition

(22)

Fun Time

(23)

Machine Learning Techniques (ᘤᢈ)

Machine Learning Techniques ( 機器學習技巧)

Lecture 12: Deep Learning

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

Agenda

Lecture 12: Deep Learning Optimization and Overfitting Auto Encoder

Principle Component Analysis

Denoising Auto Encoder

Deep Neural Network

Error Function of Neural Network

in

N

n=1

n

j

jk (2)

i

ij (1)

i

2

•

non-convex

• not easy to reach global minimum

• GD/SGD with backprop only gives local minimum

•

w 0

local minimum

• somewhat ‘sensitive’ to initial weights

• large weights =⇒ saturate (small gradient)

• advice: try some random & small ones

difficult to optimize, but practically works

VC Dimension of Neural Networks

θ-like transfer functions:

D)

D = # of weights

•

implement ‘anything’

enough neurons

many layers?

•

overfit

too many neurons

watch out for overfitting!

Regularization for Neural Network

weight-decay

 w ij (`)

 2

• ‘shrink’ weights:

large weight

large shrink; small weight

small shrink

•

w ij (`) = 0 (sparse)

decrease d

• L1 regularizer: P w

, but not differentiable

• weight-elimination (‘scaled’ L2) regularizer:

large weight → median shrink; small weight → median shrink

weight-elimination

 w



β

+



w



Yet Another Regularization: Early Stopping

GD/SGD (backprop)

more weight combinations

t increases

• smaller t

decrease d

•

‘stop in the middle’:

early stopping

validation!

Fun Time

Machine Learning Techniques (ᘤᢈ)

_in

_n

_jk ⁽²⁾

_ij ⁽¹⁾

_i

w ₀

w _ij ^(`)

2

w _ij ^(`) = 0 (sparse)

w

f _i

_i

f _i

₁

y ₁ = x _1,i

₂

y ₂ = x _2,i

_N

y _N = x _N,i

• learning f: learning f _i jointly

₁

y ₁ = x ₁

₂

y ₂ = x ₂

_N

y _N = x _N

observed data x _n

₁

y ₁ = x ₁

₂

y ₂ = x ₂

_N

y _N = x _N

w _ij ⁽¹⁾ = w _ji ⁽²⁾

_k