• 沒有找到結果。

Machine Learning Techniques (ᘤᢈ)

N/A
N/A
Protected

Academic year: 2022

Share "Machine Learning Techniques (ᘤᢈ)"

Copied!
29
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning Techniques ( 機器學習技法)

Lecture 13: Deep Learning

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Deep Learning

Roadmap

1 Embedding Numerous Features: Kernel Models

2 Combining Predictive Features: Aggregation Models

3

Distilling Implicit Features: Extraction Models

Lecture 12: Neural Network

automatic

pattern feature extraction

from

layers of neurons

with

backprop

for GD/SGD

Lecture 13: Deep Learning Deep Neural Network Autoencoder

Denoising Autoencoder Principal Component Analysis

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 1/24

(3)

Deep Learning Deep Neural Network

Physical Interpretation of NNet Revisited

x

0

= 1 x

1

x

2

.. . x

d

+1

tanh

tanh

w

ij(1)

w

jk(2)

w

kq(3)

+1

tanh

tanh

s

3(2) tanh

x

3(2)

each layer:

pattern feature extracted

from data,

remember? :-)

how many neurons? how many layers?

—more generally,

what structure?

• subjectively, your design!

• objectively, validation, maybe?

structural decisions:

key issue

for applying NNet

(4)

Deep Learning Deep Neural Network

Shallow versus Deep Neural Networks

shallow: few (hidden) layers; deep: many layers

Shallow NNet

more

efficient

to train ( )

simpler

structural decisions ( )

theoretically

powerful enough

( )

Deep NNet

challenging

to train (×)

sophisticated

structural decisions (×)

‘arbitrarily’ powerful

( )

more

‘meaningful’?

(see next slide)

deep NNet (deep learning)

gaining attention

in recent years

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24

(5)

Deep Learning Deep Neural Network

Meaningfulness of Deep Learning

,

is it a ‘1’? ✲ ✛ is it a ‘5’?

z

1

z

5

φ

1

φ

2

φ

3

φ

4

φ

5

φ

6

positive weight negative weight

‘less burden’

for each layer:

simple

to

complex

features

natural for

difficult

learning task with

raw features, like vision

deep NNet: currently popular in

vision/speech/. . .

(6)

Deep Learning Deep Neural Network

Challenges and Key Techniques for Deep Learning

difficult

structural decisions:

• subjective with domain knowledge: like convolutional NNet for images

high

model complexity:

• no big worries if big enough data

regularization towards noise-tolerant: like

dropout (tolerant when network corrupted)

denoising (tolerant when input corrupted)

hard

optimization problem:

careful initialization to avoid bad local minimum:

called pre-training

huge

computational complexity

(worsen with

big data):

• novel hardware/architecture: like mini-batch with GPU

IMHO, careful

regularization

and

initialization

are key techniques

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/24

(7)

Deep Learning Deep Neural Network

A Two-Step Deep Learning Framework

Simple Deep Learning

1

for` = 1, . . . , L,

pre-train

n w

ij (`)

o

assuming w

(1)

,. . . w

(`−1)

fixed

(a) (b) (c) (d)

2 train with backprop

on

pre-trained

NNet to

fine-tune

alln w

ij (`)

o

will focus on

simplest pre-training

technique along with

regularization

(8)

Deep Learning Deep Neural Network

Fun Time

For a deep NNet for written character recognition from raw pixels, which type of features are more likely extracted after the first hidden layer?

1

pixels

2

strokes

3

parts

4

digits

Reference Answer: 2

Simple strokes are likely the ‘next-level’ features that can be extracted from raw pixels.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/24

(9)

Deep Learning Deep Neural Network

Fun Time

For a deep NNet for written character recognition from raw pixels, which type of features are more likely extracted after the first hidden layer?

1

pixels

2

strokes

3

parts

4

digits

Reference Answer: 2

Simple strokes are likely the ‘next-level’

features that can be extracted from raw pixels.

(10)

Deep Learning Autoencoder

Information-Preserving Encoding

weights: feature transform, i.e. encoding

good weights: information-preserving encoding

—next layer

same info.

with

different representation

information-preserving:

decode accurately

after

encoding

,

is it a ‘1’? ✲ ✛ is it a ‘5’?

z1 z5

φ1 φ2 φ3 φ4 φ5 φ6

positive weight negative weight

(a) (b) (c) (d)

idea:

pre-train weights

towards

information-preserving

encoding

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/24

(11)

Deep Learning Autoencoder

Information-Preserving Neural Network

x

0

= 1

x

1

x

2

x

3

.. . x

d

+1

tanh

tanh

tanh

≈ x

1

≈ x

2

≈ x

3

.. .

≈ x

d

w ij (1) w ji (2)

autoencoder: d —˜ d—d NNet

with goal

g i

(x) ≈

x i

—learning to

approximate identity function

• w ij (1) : encoding weights; w ji (2) : decoding weights

why

approximating identity function?

(12)

Deep Learning Autoencoder

Usefulness of Approximating Identity Function

if

g(x) ≈ x

using some

hidden

structures on the

observed data x n

for supervised learning:

• hidden structure (essence) of x can be used as reasonable transform Φ(x)

—learning

‘informative’ representation

of data

for unsupervised learning:

• density estimation: larger (structure match) when g(x) ≈ x

• outlier detection: those x where g(x) 6≈ x

—learning

‘typical’ representation

of data

autoencoder:

representation-learning through approximating identity function

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/24

(13)

Deep Learning Autoencoder

Basic Autoencoder

basic

autoencoder:

d —˜ d —d NNet

with error functionP

d

i=1

(g

i

(x) − x

i

)

2

backprop

easily

applies;

shallow

and

easy

to train

usually

d ˜

<

d

:

compressed

representation

data: {(x

1

,

y 1 = x 1

), (x

2

,

y 2 = x 2

), . . . , (x

N

,

y N = x N

)}

—often categorized as

unsupervised learning technique

sometimes constrain

w ij (1)

=

w ji (2)

as

regularization

—more

sophisticated

in calculating gradient

basic

autoencoder

in basic deep learning:

n w ij (1) o

taken as

shallowly pre-trained weights

(14)

Deep Learning Autoencoder

Pre-Training with Autoencoders

Deep Learning with Autoencoders

1

for` = 1, . . . , L,

pre-train

n w

ij (`)

o

assuming w

(1)

,. . . w

(`−1)

fixed

(a) (b) (c) (d)

by

training basic autoencoder on n

x (`−1) n o

with ˜ d = d (`)

2 train with backprop

on

pre-trained

NNet to

fine-tune

all n

w

ij (`)

o

many successful

pre-training

techniques take

‘fancier’ autoencoders

with different

architectures

and

regularization schemes

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/24

(15)

Deep Learning Autoencoder

Fun Time

Suppose training a d -˜d -d autoencoder with backprop takes approximately c · d · ˜d seconds. Then, what is the total number of seconds needed for pre-training a d -d

(1)

-d

(2)

-d

(3)

-1 deep NNet?

1

c d + d

(1)

+d

(2)

+d

(3)

+1

2

c d · d

(1)

· d

(2)

· d

(3)

· 1

3

c dd

(1)

+d

(1)

d

(2)

+d

(2)

d

(3)

+d

(3)



4

c dd

(1)

· d

(1)

d

(2)

· d

(2)

d

(3)

· d

(3)



Reference Answer: 3

Each c · d

(`−1)

· d

(`)

represents the time for pre-training with one autoencoder to determine one layer of the weights.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/24

(16)

Deep Learning Autoencoder

Fun Time

Suppose training a d -˜d -d autoencoder with backprop takes approximately c · d · ˜d seconds. Then, what is the total number of seconds needed for pre-training a d -d

(1)

-d

(2)

-d

(3)

-1 deep NNet?

1

c d + d

(1)

+d

(2)

+d

(3)

+1

2

c d · d

(1)

· d

(2)

· d

(3)

· 1

3

c dd

(1)

+d

(1)

d

(2)

+d

(2)

d

(3)

+d

(3)



4

c dd

(1)

· d

(1)

d

(2)

· d

(2)

d

(3)

· d

(3)



Reference Answer: 3

Each c · d

(`−1)

· d

(`)

represents the time for pre-training with one autoencoder to determine one layer of the weights.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/24

(17)

Deep Learning Denoising Autoencoder

Regularization in Deep Learning

x

0

= 1 x

1

x

2

.. . x

d

+1

tanh

tanh

w

ij(1)

w

jk(2)

w

kq(3)

+1

tanh

tanh

s

3(2) tanh

x

3(2)

watch out for overfitting, remember? :-)

high

model complexity: regularization

needed

structural decisions/constraints

weight decay or weight elimination

regularizers

early stopping

next: another

regularization

technique

(18)

Deep Learning Denoising Autoencoder

Reasons of Overfitting Revisited

stochastic noise

Number of Data Points, N

NoiseLevel,σ2

80 100 120 -0.2

-0.1 0 0.1 0.2

0 1 2

reasons of serious overfitting:

data size N ↓ overfit

noise

↑ overfit

excessive power ↑ overfit

how to deal with

noise?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 15/24

(19)

Deep Learning Denoising Autoencoder

Dealing with Noise

direct possibility:

data cleaning/pruning, remember? :-)

a

wild

possibility:

adding noise

to data?

idea:

robust

autoencoder should not only let

g(x) ≈ x

but also allow

g( x) ≈ ˜ x

even when

x ˜

slightly different from

x

denoising

autoencoder:

run basic autoencoder with data

{(˜

x 1

,

y 1 = x 1

), (

˜ x 2

,

y 2 = x 2

), . . . , (

x ˜ N

,

y N = x N

)}, where

x ˜ n

=

x n

+

artificial noise

—often used

instead of basic autoencoder

in deep learning

useful for data/image processing:

g( ˜ x)

a

denoised

version of

˜ x

effect: ‘constrain/regularize’

g

towards

noise-tolerant

denoising

artificial noise/hint

as

regularization!

—practically also useful for other NNet/models

(20)

Deep Learning Denoising Autoencoder

Fun Time

Which of the following cannot be viewed as a regularization technique?

1

hint the model with artificially-generated noisy data

2

stop gradient descent early

3

add a weight elimination regularizer

4

all the above are regularization techniques

Reference Answer: 4

1 is our new friend for regularization, while 2 and 3 are old friends.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/24

(21)

Deep Learning Denoising Autoencoder

Fun Time

Which of the following cannot be viewed as a regularization technique?

1

hint the model with artificially-generated noisy data

2

stop gradient descent early

3

add a weight elimination regularizer

4

all the above are regularization techniques

Reference Answer: 4

1 is our new friend for regularization, while 2 and 3 are old friends.

(22)

Deep Learning Principal Component Analysis

Linear Autoencoder Hypothesis

nonlinear autoencoder

sophisticated

linear autoencoder

simple

linear: more efficient? less overfitting? linear first, remember? :-)

linear hypothesis for k -th component h

k

(x) =

d ˜

X

j=0

w kj

d

X

i=1

w ij

x

i

!

consider three special conditions:

exclude x 0

: range of i

same

as range of k

constrain

w ij (1)

=

w ji (2)

=

w ij

:

regularization

—denote

W

= [w

ij

]of size d × ˜d

assume ˜d < d: ensure

non-trivial

solution linear autoencoder hypothesis:

h(x) = WW T x

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/24

(23)

Deep Learning Principal Component Analysis

Linear Autoencoder Error Function

E

in

(h) = E

in

(W) = 1 N

N

X

n=1

x n

WW T x n

2

with d ×

d ˜

matrix W

—analytic solution to minimize E

in

? but

4-th order polynomial of w ij

let’s familiarize the problem with linear algebra (be brave! :-))

eigen-decompose

WW T

=

VΓV T

• d × d matrix V orthogonal: VV

T

= V

T

V = I

d

• d × d matrix Γ diagonal with ≤ ˜ d non-zero

• WW T x n

=

VΓV T x n

• V

T

(x

n

): change of orthonormal basis (rotate or reflect)

• Γ(· · · ): set ≥ d − d ˜ components to 0, and scale others

• V(· · · ): reconstruct by coefficients and basis (back-rotate)

x n

=

VIV T x n

:

rotate

and

back-rotate

cancel out next: minimize E

in by optimizing Γ and V

(24)

Deep Learning Principal Component Analysis

The Optimal Γ

min

V

min

Γ

1 N

N

X

n=1

VIV

T

x

n

| {z }

xn

− VΓV

T

x

n

| {z }

WWTxn

2

back-rotate

not affecting length:

@ V

min

Γ

P k(I −

Γ)(some vector)k 2

:

want many 0

within (I −

Γ)

optimal diagonal

Γ

with rank

≤ ˜ d

:



˜

d diagonal components 1 other components 0



=⇒ without loss of gen.



I d ˜

0 0 0



next: min

V N

X

n=1

 0 0

0 I

d −˜ d



| {z }

I−optimal Γ

V T x n

2

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/24

(25)

Deep Learning Principal Component Analysis

The Optimal V

min

V N

X

n=1

 0 0

0 I

d −˜d

 V

T

x

n

2

≡ max

V N

X

n=1

 I

˜d

0 0 0

 V

T

x

n

2

d = 1: only first row˜

v T

of

V T

matters max

v P N

n=1 v T x n x T n v

subject to

v T v

=1

• optimal v satisfies P

N

n=1

x

n

x

Tn

v = λv

—using Lagrange multiplier λ, remember? :-)

• optimal v: ‘topmost’ eigenvector of X

T

X

general ˜d :

{v j } d j=1 ˜ ‘topmost’

eigenvectorSof

X T X

—optimal {w

j

} = {v

j

with q

γ j = 1y} = top eigenvectors

linear autoencoder: projecting to

orthogonal

patterns w j that ‘matches’ {x n } most

(26)

Deep Learning Principal Component Analysis

Principal Component Analysis

Linear Autoencoder or PCA

1 let ¯ x = N 1 P N

n=1 x n , and let x n ← x n − ¯ x

2

calculate ˜d top eigenvectors

w 1

, w

2

, . . . , w

d ˜

of X

T

X

3

return feature transform Φ(x) = W(x−¯

x)

linear autoencoder:

maximizeP(maginitude after projection)

2

• principal component analysis (PCA)

from statistics:

maximizeP(varianceafter projection)

both useful for

linear dimension reduction

though

PCA more popular linear dimension reduction:

useful for

data processing

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 22/24

(27)

Deep Learning Principal Component Analysis

Fun Time

When solving the optimization problem max

v P N

n=1 v T x n x T n v

subject to

v T v

=1, we know that the optimal

v

is the ‘topmost’ eigenvector that

corresponds to the ‘topmost’ eigenvalue

λ

of

X T X. Then, what is the

optimal objective value of the optimization problem?

1

λ

1

2

λ

2

3

λ

3

4

λ

4

Reference Answer: 1

The objective value of the optimization problem is simply

v T X T Xv, which is λv T v

and you know what

v T v

must be.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 23/24

(28)

Deep Learning Principal Component Analysis

Fun Time

When solving the optimization problem max

v P N

n=1 v T x n x T n v

subject to

v T v

=1, we know that the optimal

v

is the ‘topmost’ eigenvector that

corresponds to the ‘topmost’ eigenvalue

λ

of

X T X. Then, what is the

optimal objective value of the optimization problem?

1

λ

1

2

λ

2

3

λ

3

4

λ

4

Reference Answer: 1

The objective value of the optimization problem is simply

v T X T Xv, which is λv T v

and you know what

v T v

must be.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 23/24

(29)

Deep Learning Principal Component Analysis

Summary

1 Embedding Numerous Features: Kernel Models

2 Combining Predictive Features: Aggregation Models

3

Distilling Implicit Features: Extraction Models

Lecture 13: Deep Learning

Deep Neural Network

difficult hierarchical feature extraction problem Autoencoder

unsupervised NNet learning of representation Denoising Autoencoder

using noise as hints for regularization Principal Component Analysis

linear autoencoder variant for data processing

next: extracting ‘prototype’ instead of pattern

參考文獻

相關文件

Which of the following aggregation model learns diverse g t by reweighting and calculates linear vote by steepest search?.

3 Distilling Implicit Features: Extraction Models Lecture 14: Radial Basis Function Network. RBF

Lecture 4: Soft-Margin Support Vector Machine allow some margin violations ξ n while penalizing them by C; equivalent to upper-bounding α n by C Lecture 5: Kernel Logistic

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/22.. Decision Tree Decision Tree Hypothesis. Disclaimers about

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.. linear SVM: more robust and solvable with quadratic programming Lecture 2: Dual Support

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.

Principle Component Analysis Denoising Auto Encoder Deep Neural Network... Deep Learning Optimization

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24.:. Deep Learning Deep