Machine Learning Techniques (ᘤᢈ)

(1)

Machine Learning Techniques ( 機器學習技法)

Lecture 13: Deep Learning

Hsuan-Tien Lin (林軒田) [email protected]

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Deep Learning

Roadmap

1 Embedding Numerous Features: Kernel Models

2 Combining Predictive Features: Aggregation Models

3

Distilling Implicit Features: Extraction Models

Lecture 12: Neural Network

automatic

pattern feature extraction

from

layers of neurons

with

backprop

for GD/SGD

Lecture 13: Deep Learning Deep Neural Network Autoencoder

Denoising Autoencoder Principal Component Analysis

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 1/24

(3)

Deep Learning Deep Neural Network

Physical Interpretation of NNet Revisited

x

0

= 1 x

1

x

2

.. . x

d

+1

tanh

w

_ij⁽¹⁾

w

_jk⁽²⁾

w

_kq⁽³⁾

+1

tanh

s

₃⁽²⁾ tanh

x

₃⁽²⁾

•

each layer:

pattern feature extracted

from data,

remember? :-)

•

how many neurons? how many layers?

—more generally,

what structure?

• subjectively, your design!

• objectively, validation, maybe?

structural decisions:

key issue

for applying NNet

(4)

Shallow versus Deep Neural Networks

shallow: few (hidden) layers; deep: many layers

Shallow NNet

•

more

efficient

to train ( )

• simpler

structural decisions ( )

•

theoretically

powerful enough

( )

Deep NNet

• challenging

to train (×)

• sophisticated

structural decisions (×)

• ‘arbitrarily’ powerful

( )

•

more

‘meaningful’?

(see next slide)

deep NNet (deep learning)

gaining attention

in recent years

(5)

Meaningfulness of Deep Learning

,

is it a ‘1’? ✲ ✛ is it a ‘5’?

✻

z

1

z

5

φ

1

φ

2

φ

3

φ

4

φ

5

φ

6

positive weight negative weight

• ‘less burden’

for each layer:

simple

to

complex

features

•

natural for

difficult

learning task with

raw features, like vision

deep NNet: currently popular in

vision/speech/. . .

(6)

Challenges and Key Techniques for Deep Learning

•

difficult

structural decisions:

• subjective with domain knowledge: like convolutional NNet for images

•

high

model complexity:

• no big worries if big enough data

• regularization towards noise-tolerant: like

• dropout (tolerant when network corrupted)

• denoising (tolerant when input corrupted)

•

hard

optimization problem:

• careful initialization to avoid bad local minimum:

called pre-training

•

huge

computational complexity

(worsen with

big data):

• novel hardware/architecture: like mini-batch with GPU

IMHO, careful

regularization

and

initialization

are key techniques

(7)

A Two-Step Deep Learning Framework

Simple Deep Learning

1

for` = 1, . . . , L,

pre-train

n w

_ij ^(`)

o

assuming w

∗ ⁽¹⁾

,. . . w

∗ ^(`−1)

fixed

(a) (b) (c) (d)

2 train with backprop

on

pre-trained

NNet to

fine-tune

alln w

_ij ^(`)

o

will focus on

simplest pre-training

technique along with

regularization

(8)

Fun Time

For a deep NNet for written character recognition from raw pixels, which type of features are more likely extracted after the first hidden layer?

1

pixels

2

strokes

3

parts

4

digits

Reference Answer: 2

Simple strokes are likely the ‘next-level’ features that can be extracted from raw pixels.

(9)

Fun Time

For a deep NNet for written character recognition from raw pixels, which type of features are more likely extracted after the first hidden layer?

1

pixels

2

strokes

3

parts

4

digits

Reference Answer: 2

Simple strokes are likely the ‘next-level’

features that can be extracted from raw pixels.

(10)

Deep Learning Autoencoder

Information-Preserving Encoding

• weights: feature transform, i.e. encoding

• good weights: information-preserving encoding

—next layer

same info.

with

different representation

• information-preserving:

decode accurately

after

encoding

,

is it a ‘1’? ✲ ✛ is it a ‘5’?

✻

z1 z5

φ1 φ2 φ3 φ4 φ5 φ6

positive weight negative weight

(a) (b) (c) (d)

idea:

pre-train weights

towards

information-preserving

encoding

(11)

Information-Preserving Neural Network

x

0

= 1

x

1

x

2

x

3

.. . x

d

+1

tanh

≈ x

₁

≈ x

₂

≈ x

₃

.. .

≈ x

d

w _ij ⁽¹⁾ w _ji ⁽²⁾

• autoencoder: d —˜ d—d NNet

with goal

g _i

(x) ≈

x _i

—learning to

approximate identity function

• w _ij ⁽¹⁾ : encoding weights; w _ji ⁽²⁾ : decoding weights

why

approximating identity function?

(12)

Usefulness of Approximating Identity Function

if

g(x) ≈ x

using some

hidden

structures on the

observed data x _n

•

for supervised learning:

• hidden structure (essence) of x can be used as reasonable transform Φ(x)

—learning

‘informative’ representation

of data

•

for unsupervised learning:

• density estimation: larger (structure match) when g(x) ≈ x

• outlier detection: those x where g(x) 6≈ x

—learning

‘typical’ representation

of data

autoencoder:

representation-learning through approximating identity function

(13)

Basic Autoencoder

basic

autoencoder:

d —˜ d —d NNet

with error functionP

d

i=1

(g

_i

(x) − x

_i

)

²

•

backprop

easily

applies;

shallow

and

easy

to train

•

usually

d ˜

<

d

:

compressed

representation

•

data: {(x

₁

,

y ₁ = x ₁

), (x

₂

,

y ₂ = x ₂

), . . . , (x

_N

,

y _N = x _N

)}

—often categorized as

unsupervised learning technique

•

sometimes constrain

w _ij ⁽¹⁾

=

w _ji ⁽²⁾

as

regularization

—more

sophisticated

in calculating gradient

basic

autoencoder

in basic deep learning:

n w _ij ⁽¹⁾ o

taken as

shallowly pre-trained weights

(14)

Pre-Training with Autoencoders

Deep Learning with Autoencoders

1

for` = 1, . . . , L,

pre-train

n w

_ij ^(`)

o

assuming w

∗ ⁽¹⁾

,. . . w

∗ ^(`−1)

fixed

(a) (b) (c) (d)

by

training basic autoencoder on n

x ^(`−1) _n o

with ˜ d = d ^(`)

2 train with backprop

on

pre-trained

NNet to

fine-tune

all n

w

_ij ^(`)

o

many successful

pre-training

techniques take

‘fancier’ autoencoders

with different

architectures

and

regularization schemes

(15)

Fun Time

Suppose training a d -˜d -d autoencoder with backprop takes approximately c · d · ˜d seconds. Then, what is the total number of seconds needed for pre-training a d -d

⁽¹⁾

-d

⁽²⁾

-d

⁽³⁾

-1 deep NNet?

1

c d + d

⁽¹⁾

+d

⁽²⁾

+d

⁽³⁾

+1

2

c d · d

⁽¹⁾

· d

⁽²⁾

· d

⁽³⁾

· 1

3

c dd

⁽¹⁾

+d

⁽¹⁾

d

⁽²⁾

+d

⁽²⁾

d

⁽³⁾

+d

⁽³⁾

4

c dd

⁽¹⁾

· d

⁽¹⁾

d

⁽²⁾

· d

⁽²⁾

d

⁽³⁾

· d

⁽³⁾

Reference Answer: 3

Each c · d

^(`−1)

· d

^(`)

represents the time for pre-training with one autoencoder to determine one layer of the weights.

(16)

Fun Time

Suppose training a d -˜d -d autoencoder with backprop takes approximately c · d · ˜d seconds. Then, what is the total number of seconds needed for pre-training a d -d

⁽¹⁾

-d

⁽²⁾

-d

⁽³⁾

-1 deep NNet?

1

c d + d

⁽¹⁾

+d

⁽²⁾

+d

⁽³⁾

+1

2

c d · d

⁽¹⁾

· d

⁽²⁾

· d

⁽³⁾

· 1

3

c dd

⁽¹⁾

+d

⁽¹⁾

d

⁽²⁾

+d

⁽²⁾

d

⁽³⁾

+d

⁽³⁾

4

c dd

⁽¹⁾

· d

⁽¹⁾

d

⁽²⁾

· d

⁽²⁾

d

⁽³⁾

· d

⁽³⁾

Reference Answer: 3

Each c · d

^(`−1)

· d

^(`)

represents the time for pre-training with one autoencoder to determine one layer of the weights.

(17)

Deep Learning Denoising Autoencoder

Regularization in Deep Learning

x

₀

= 1 x

₁

x

₂

.. . x

d

+1

tanh

w

_ij⁽¹⁾

w

_jk⁽²⁾

w

_kq⁽³⁾

+1

tanh

s

₃⁽²⁾ tanh

x

₃⁽²⁾

watch out for overfitting, remember? :-)

high

model complexity: regularization

needed

•

structural decisions/constraints

•

weight decay or weight elimination

regularizers

• early stopping

next: another

regularization

technique

(18)

Reasons of Overfitting Revisited

stochastic noise

Number of Data Points, N

NoiseLevel,σ2

80 100 120 -0.2

-0.1 0 0.1 0.2

0 1 2

reasons of serious overfitting:

data size N ↓ overfit

↑ noise

↑ overfit

↑

excessive power ↑ overfit

↑

how to deal with

noise?

(19)

Dealing with Noise

•

direct possibility:

data cleaning/pruning, remember? :-)

•

a

wild

possibility:

adding noise

to data?

•

idea:

robust

autoencoder should not only let

g(x) ≈ x

but also allow

g( x) ≈ ˜ x

even when

x ˜

slightly different from

x

• denoising

autoencoder:

run basic autoencoder with data

{(˜

x ₁

,

y ₁ = x ₁

), (

˜ x ₂

,

y ₂ = x ₂

), . . . , (

x ˜ _N

,

y _N = x _N

)}, where

x ˜ n

=

x n

+

artificial noise

—often used

instead of basic autoencoder

in deep learning

•

useful for data/image processing:

g( ˜ x)

a

denoised

version of

˜ x

•

effect: ‘constrain/regularize’

g

towards

noise-tolerant

denoising

artificial noise/hint

as

regularization!

—practically also useful for other NNet/models

(20)

Fun Time

Which of the following cannot be viewed as a regularization technique?

1

hint the model with artificially-generated noisy data

2

stop gradient descent early

3

add a weight elimination regularizer

4

all the above are regularization techniques

Reference Answer: 4

1 is our new friend for regularization, while 2 and 3 are old friends.

(21)

Fun Time

Which of the following cannot be viewed as a regularization technique?

1

hint the model with artificially-generated noisy data

2

stop gradient descent early

3

add a weight elimination regularizer

4

all the above are regularization techniques

Reference Answer: 4

1 is our new friend for regularization, while 2 and 3 are old friends.

(22)

Deep Learning Principal Component Analysis

Linear Autoencoder Hypothesis

nonlinear autoencoder

sophisticated

linear autoencoder

simple

linear: more efficient? less overfitting? linear first, remember? :-)

linear hypothesis for k -th component h

_k

(x) =

d ˜

X

j=0

w _kj

d

X

i=1

w _ij

x

_i

!

consider three special conditions:

• exclude x ₀

: range of i

same

as range of k

•

constrain

w _ij ⁽¹⁾

=

w _ji ⁽²⁾

=

w _ij

:

regularization

—denote

W

= [w

_ij

]of size d × ˜d

•

assume ˜d < d: ensure

non-trivial

solution linear autoencoder hypothesis:

h(x) = WW ^T x

(23)

Linear Autoencoder Error Function

E

_in

(h) = E

_in

(W) = 1 N

N

X

n=1

x _n

−

WW ^T x _n

2

with d ×

d ˜

matrix W

—analytic solution to minimize E

_in

? but

4-th order polynomial of w _ij

let’s familiarize the problem with linear algebra (be brave! :-))

•

eigen-decompose

WW ^T

=

VΓV ^T

• d × d matrix V orthogonal: VV

^T

= V

^T

V = I

d

• d × d matrix Γ diagonal with ≤ ˜ d non-zero

• WW ^T x _n

=

VΓV ^T x _n

• V

^T

(x

n

): change of orthonormal basis (rotate or reflect)

• Γ(· · · ): set ≥ d − d ˜ components to 0, and scale others

• V(· · · ): reconstruct by coefficients and basis (back-rotate)

• x _n

=

VIV ^T x _n

:

rotate

and

back-rotate

cancel out next: minimize E

_in by optimizing Γ and V

(24)

The Optimal Γ

min

V

min

Γ

1 N

N

X

n=1

VIV

^T

x

_n

| {z }

xn

− VΓV

^T

x

_n

| {z }

WW^Txn

2

• back-rotate

not affecting length:

@ V

•

min

_Γ

P k(I −

Γ)(some vector)k ²

:

want many 0

within (I −

Γ)

•

optimal diagonal

Γ

with rank

≤ ˜ d

:

_˜

d diagonal components 1 other components 0

=⇒ without loss of gen.

I _d _˜

0 0 0

next: min

V N

X

n=1

0 0

0 I

_{d −˜} _d

| {z }

I−optimal Γ

V ^T x n

2

(25)

The Optimal V

min

V N

X

n=1

0 0

0 I

_{d −˜}_d

V

^T

x

n

2

≡ max

V N

X

n=1

I

˜d

0 0 0

V

^T

x

n

2

•

d = 1: only first row˜

v ^T

of

V ^T

matters max

v P N

n=1 v ^T x _n x ^T _n v

subject to

v ^T v

=1

• optimal v satisfies P

N

n=1

x

_n

x

^T_n

v = λv

—using Lagrange multiplier λ, remember? :-)

• optimal v: ‘topmost’ eigenvector of X

^T

X

•

general ˜d :

{v _j } ^d _j=1 ^˜ ‘topmost’

eigenvectorSof

X ^T X

—optimal {w

_j

} = {v

_j

with q

γ _j = 1y} = top eigenvectors

linear autoencoder: projecting to

orthogonal

patterns w _j that ‘matches’ {x _n } most

(26)

Principal Component Analysis

Linear Autoencoder or PCA

1 let ¯ x = _N ¹ P N

n=1 x _n , and let x _n ← x n − ¯ x

2

calculate ˜d top eigenvectors

w ₁

, w

₂

, . . . , w

d ˜

of X

^T

X

3

return feature transform Φ(x) = W(x−¯

x)

•

linear autoencoder:

maximizeP(maginitude after projection)

²

• principal component analysis (PCA)

from statistics:

maximizeP(varianceafter projection)

•

both useful for

linear dimension reduction

though

PCA more popular linear dimension reduction:

useful for

data processing

(27)

Fun Time

When solving the optimization problem max

v P N

n=1 v ^T x _n x ^T _n v

subject to

v ^T v

=1, we know that the optimal

v

is the ‘topmost’ eigenvector that

corresponds to the ‘topmost’ eigenvalue

λ

of

X ^T X. Then, what is the

optimal objective value of the optimization problem?

1

λ

¹

2

λ

²

3

λ

³

4

λ

⁴

Reference Answer: 1

The objective value of the optimization problem is simply

v ^T X ^T Xv, which is λv ^T v

and you know what

v ^T v

must be.

(28)

Fun Time

When solving the optimization problem max

v P N

n=1 v ^T x _n x ^T _n v

subject to

v ^T v

=1, we know that the optimal

v

is the ‘topmost’ eigenvector that

corresponds to the ‘topmost’ eigenvalue

λ

of

X ^T X. Then, what is the

optimal objective value of the optimization problem?

1

λ

¹

2

λ

²

3

λ

³

4

λ

⁴

Reference Answer: 1

The objective value of the optimization problem is simply

v ^T X ^T Xv, which is λv ^T v

and you know what

v ^T v

must be.

(29)

Summary

1 Embedding Numerous Features: Kernel Models

2 Combining Predictive Features: Aggregation Models

3

Distilling Implicit Features: Extraction Models

Machine Learning Techniques (ᘤᢈ)

Machine Learning Techniques ( 機器學習技法)

Lecture 13: Deep Learning

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

Roadmap

1 Embedding Numerous Features: Kernel Models

2 Combining Predictive Features: Aggregation Models

3

Lecture 12: Neural Network

pattern feature extraction

layers of neurons

backprop

Lecture 13: Deep Learning Deep Neural Network Autoencoder

Denoising Autoencoder Principal Component Analysis

Physical Interpretation of NNet Revisited

x

= 1 x

x

.. . x

+1

w

w

w

+1

s

x

•

pattern feature extracted

remember? :-)

•

what structure?

• subjectively, your design!

• objectively, validation, maybe?

key issue

Shallow versus Deep Neural Networks

shallow: few (hidden) layers; deep: many layers

Shallow NNet

•

efficient

• simpler

•

powerful enough

Deep NNet

• challenging

• sophisticated

• ‘arbitrarily’ powerful

•

‘meaningful’?

gaining attention

Meaningfulness of Deep Learning

,

is it a ‘1’? ✲ ✛ is it a ‘5’?

✻

z

z

φ

φ

φ

φ

φ

φ

positive weight negative weight

• ‘less burden’

simple

complex

•

difficult

raw features, like vision

vision/speech/. . .

Challenges and Key Techniques for Deep Learning

•

structural decisions:

• subjective with domain knowledge: like convolutional NNet for images

•

model complexity:

• no big worries if big enough data

• regularization towards noise-tolerant: like

Machine Learning Techniques (ᘤᢈ)

_ij ^(`)

∗ ⁽¹⁾

∗ ^(`−1)

_ij ^(`)

w _ij ⁽¹⁾ w _ji ⁽²⁾

g _i

x _i

• w _ij ⁽¹⁾ : encoding weights; w _ji ⁽²⁾ : decoding weights