• 沒有找到結果。

Machine Learning Techniques (ᘤᢈ)

N/A
N/A
Protected

Academic year: 2022

Share "Machine Learning Techniques (ᘤᢈ)"

Copied!
23
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning Techniques ( 機器學習技巧)

Lecture 12: Deep Learning

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Deep Learning

Agenda

Lecture 12: Deep Learning Optimization and Overfitting Auto Encoder

Principle Component Analysis

Denoising Auto Encoder

Deep Neural Network

(3)

Deep Learning Optimization and Overfitting

Error Function of Neural Network

E

in

(w) = 1 N

N

X

n=1

y

n

− θ

· · · θ

 X

j

w

jk (2)

· θ X

i

w

ij (1)

x

i

!

2

generally

non-convex

when multiple hidden layers

• not easy to reach global minimum

• GD/SGD with backprop only gives local minimum

different initial

w 0

=⇒different

local minimum

• somewhat ‘sensitive’ to initial weights

large weights =⇒ saturate (small gradient)

• advice: try some random & small ones

neural network (NNet):

difficult to optimize, but practically works

(4)

Deep Learning Optimization and Overfitting

VC Dimension of Neural Networks

roughly, with

θ-like transfer functions:

dVC=O(Dlog

D)

where

D = # of weights

can

implement ‘anything’

if

enough neurons

(Dlarge)

—no need for

many layers?

can

overfit

if

too many neurons

NNet:

watch out for overfitting!

(5)

Deep Learning Optimization and Overfitting

Regularization for Neural Network

basic choice:

old friend

weight-decay

(L2) regularizer Ω(w) =P

 w ij (`)

 2

• ‘shrink’ weights:

large weight

large shrink; small weight

small shrink

want

w ij (`) = 0 (sparse)

to effectively

decrease d

VC

• L1 regularizer: P w

ij(`)

, but not differentiable

• weight-elimination (‘scaled’ L2) regularizer:

large weight → median shrink; small weight → median shrink

weight-elimination

regularizer: P

 w

ij(`)



2

β

2

+



w

ij(`)



2

(6)

Deep Learning Optimization and Overfitting

Yet Another Regularization: Early Stopping

GD/SGD (backprop)

visits

more weight combinations

as

t increases

• smaller t

effectively

decrease d

VC

better

‘stop in the middle’:

early stopping

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

0 0.5 1 1.5 2 2.5 3 3.5

Epochs

Error

E

E

top

bottom Early stopping

in out

when to stop?

validation!

(7)

Deep Learning Optimization and Overfitting

Fun Time

(8)

Deep Learning Auto Encoder

Learning the Identity Function

identity

function:

f(x) = x

a

vector

function composed of

f i

(x) = x

i

• learning

each

f i

:

regression

with data

(x

1

,

y 1 = x 1,i

), (x

2

,

y 2 = x 2,i

), . . . , (x

N

,

y N = x N,i

)

• learning f: learning f i jointly

with data

(x

1

,

y 1 = x 1

), (x

2

,

y 2 = x 2

), . . . , (x

N

,

y N = x N

)

but wait, why

learning

something

known

&

easily implemented? :-)

(9)

Deep Learning Auto Encoder

Why Learning Identity Function

if

g(x) ≈ f(x) using some hidden

structures on the

observed data x n

for unsupervised learning:

• density estimation: larger (structure match) when g(x) ≈ x better

• outlier detection: those x where g(x) 6≈ x

—learning

‘typical’ representation

of data

for supervised learning:

hidden structure: essence of x that can be used as Φ(x)

—learning

‘informative’ representation

of data

auto-encoder:

NNet for learning identity function

(10)

Deep Learning Auto Encoder

Simple Auto-Encoder

simple

auto-encoder: a

d

-

d ˜

-d NNet

• d outputs: backprop easily applies

• d ˜

<

d: compressed

representation;

d ˜

d: [over]-complete

representation

data: (x

1

,

y 1 = x 1

), (x

2

,

y 2 = x 2

), . . . , (x

N

,

y N = x N

)

—often categorized as

unsupervised learning technique

if

x

contain binary bits,

• naïve solution exists (but unwanted) when [over]-complete

regularized weights needed in general

sometimes constrain

w ij (1) = w ji (2)

as ‘regularization’

—more

sophisticated

in calculating gradient

auto-encoder

for

representation

learning:

outputs of

hidden neurons

serve as

Φ(x)

(11)

Deep Learning Auto Encoder

Fun Time

(12)

Deep Learning Principle Component Analysis

Linear Auto-Encoder Hypothesis

h

k

(x) =

θ

 X

j

w jk (2)

·

θ

X

i

w ij (1)

x

i

!

consider three special conditions:

constrain

w ij (1)

=

w ji (2)

=

w ij

as ‘regularization’

—let

W

= [w

ij

]of size d × ˜d

• θ

does nothing (like linear regression)

d˜ < d

linear auto-encoder hypothesis:

h(x) = x T WW T

(13)

Deep Learning Principle Component Analysis

Linear Auto-Encoder Error Function

min

W

E

in

(W) = 1 N

X − XWW

T

2 F

let

WW T

=

VΛV T

such that

V T V

= I and

Λ

a diagonal matrix of

rank at most ˜ d

(eigenvalue decomposition)

X − XVΛV

T

2 F

= trace



X − XVΛV

T



T



X − XVΛV

T



= trace

X

T

X − X

T

XVΛV

T

VΛV T

X

T

X +

VΛV T

X

T

XVΛV

T



= trace

X

T

X −

ΛV T

X

T

XV−

ΛV T

X

T

XV+

ΛV T

X

T

XVΛV

T V



= trace

V T

X

T

XV−

ΛV T

X

T

XV−

ΛV T

X

T

XV+

Λ 2 V T

X

T

XV



= trace

(I −

Λ) 2 V T

X

T

XV

(14)

Deep Learning Principle Component Analysis

Linear Auto-Encoder Algorithm

min

V,Λ

trace

(I −

Λ) 2 V T

X

T

XV

optimal

rank-˜ d Λ

contains

d ‘1’ ˜

and

d − ˜ d ‘0’

let X

T

X =

UΣU T

(eigenvalue decomposition),

V = U

with (smallest

σ i

⇐⇒

λ j = 1) is optimal

so

optimal column vectors w j = v j = top eigen vectors of X T X

optimal linear auto-encoding ≡

principal component analysis

(PCA)

with

w j

being

principal components

of unshifted data

(15)

Deep Learning Principle Component Analysis

Fun Time

(16)

Deep Learning Denoising Auto Encoder

Simple Auto-Encoder Revisited

simple

auto-encoder: a

d

-

d ˜

-d NNet

want:

hidden structure

to capture

essence

of

x

naïve solution exists (but

unwanted) when [over]-complete

regularized

weights needed in general

regularization towards more

robust hidden structure?

(17)

Deep Learning Denoising Auto Encoder

Idea of Denoising Auto-Encoder

robust hidden structure

should allow g(˜

x) ≈ x

even when

˜ x

slightly different from

x

denoising auto-encoder: run auto-encoder

with data (

x ˜ 1

,

y 1 = x 1

), (

x ˜ 2

,

y 2 = x 2

), . . . , (

˜ x N

,

y N = x N

), where

x ˜ n

=

x n

+

artificial noise

PCA auto-encoder +

Gaussian noise:

min

W

E

in

(W) = 1 N

X − (X +

noise) WW T

2 F

—simply L2-regularizedPCA

artificial noise

as

regularization!

—practically also useful for other types of NNet

(18)

Deep Learning Denoising Auto Encoder

Fun Time

(19)

Deep Learning Deep Neural Network

Finalremark: hiddenlayers

LearningFromData-Le ture10 21/21

Finalremark: hiddenlayers

LearningFromData-Le ture10 21/21

Finalremark: hiddenlayers

θ x2

xd

s θ(s) h(x) 1

1 1

x1

θ θ

θ θ

θ

learnednonlineartransform

interpretation?

LearningFromData-Le ture10 21/21

Finalremark: hiddenlayers

θ x2

xd

s θ(s) h(x) 1

1 1

x1

θ θ

θ θ

θ

learnednonlineartransform

interpretation?

LearningFromData-Le ture10 21/21

Finalremark: hiddenlayers

θ x2

xd

s θ(s) h(x) 1

1 1

x1

θ θ

θ θ

θ

learnednonlineartransform

interpretation?

LearningFromData-Le ture10 21/21

Finalremark: hiddenlayers

θ x2

xd

s θ(s) h(x) 1

1 1

x1

θ θ

θ θ

θ

learnednonlineartransform

interpretation?

LearningFromData-Le ture10 21/21

(20)

Deep Learning Deep Neural Network

Shallow versus Deep Structures

shallow: few hidden layers; deep: many hidden layers

Shallow

efficient

powerful if enough neurons

Deep

challenging to train

needing more structural (model) decisions

‘meaningful’?

deep structure (deep learning)

re-gain attention recently

(21)

Deep Learning Deep Neural Network

Key Techniques behind Deep Learning

(usually)

unsupervised pre-training

between hidden layers, such as simple/denoising auto-encoder

—viewing hidden layers as

‘condensing’

low-level representation to high-level one

fine-tune with backprop

after initializing with those ‘good’

weights

—because

direct backprop may get stuck more easily

speed-up: better optimization algorithms, and faster

GPU

generalization issue less serious with

big (enough) data

currently very useful

for

vision and speech recognition

(22)

Deep Learning Deep Neural Network

Fun Time

(23)

Deep Learning Deep Neural Network

Summary

Lecture 12: Deep Learning Optimization and Overfitting Auto Encoder

Principle Component Analysis

Denoising Auto Encoder

Deep Neural Network

參考文獻

相關文件

Which of the following aggregation model learns diverse g t by reweighting and calculates linear vote by steepest search?.

3 Distilling Implicit Features: Extraction Models Lecture 14: Radial Basis Function Network. RBF

Lecture 4: Soft-Margin Support Vector Machine allow some margin violations ξ n while penalizing them by C; equivalent to upper-bounding α n by C Lecture 5: Kernel Logistic

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/22.. Decision Tree Decision Tree Hypothesis. Disclaimers about

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.. linear SVM: more robust and solvable with quadratic programming Lecture 2: Dual Support

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.

For a deep NNet for written character recognition from raw pixels, which type of features are more likely extracted after the first hidden layer.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24.:. Deep Learning Deep