• 沒有找到結果。

# Machine Learning Techniques (ᘤᢈ)

N/A
N/A
Protected

Share "Machine Learning Techniques (ᘤᢈ)"

Copied!
23
0
0

(1)

## Machine Learning Techniques ( 機器學習技巧)

### Lecture 12: Deep Learning

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

(2)

Deep Learning

## Agenda

### Deep Neural Network

(3)

Deep Learning Optimization and Overfitting

## Error Function of Neural Network

E

(w) = 1 N

X

y

− θ

· · · θ

 X

w

· θ X

w

x

!

generally

### non-convex

when multiple hidden layers

### •

different initial

=⇒different

### • advice: try some random & small ones

neural network (NNet):

### difficult to optimize, butpractically works

(4)

Deep Learning Optimization and Overfitting

## VC Dimension of Neural Networks

roughly, with

dVC=O(Dlog

where

can

if

(Dlarge)

—no need for

can

if

NNet:

### watch out for overfitting!

(5)

Deep Learning Optimization and Overfitting

## Regularization for Neural Network

basic choice:

old friend

### weight-decay

(L2) regularizer Ω(w) =P

want

to effectively

VC

ij(`)

regularizer: P

ij(`)

2

2

ij(`)

### 

2

(6)

Deep Learning Optimization and Overfitting

## Yet Another Regularization: Early Stopping

visits

as

effectively

VC

better

### early stopping

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

0 0.5 1 1.5 2 2.5 3 3.5

Epochs

Error

E

E

top

bottom Early stopping

in out

when to stop?

### validation!

(7)

Deep Learning Optimization and Overfitting

## Fun Time

(8)

Deep Learning Auto Encoder

## Learning the Identity Function

function:

a

### vector

function composed of

(x) = x

each

:

with data

(x

,

), (x

,

), . . . , (x

,

)

with data

(x

,

), (x

,

), . . . , (x

,

)

but wait, why

something

&

### easily implemented?:-)

(9)

Deep Learning Auto Encoder

## Why Learning Identity Function

if

### g(x) ≈ f(x) using somehidden

structures on the

### •

for unsupervised learning:

—learning

of data

### •

for supervised learning:

—learning

of data

### auto-encoder:

NNet for learning identity function

(10)

Deep Learning Auto Encoder

## Simple Auto-Encoder

auto-encoder: a

-

-d NNet

<

representation;

representation

data: (x

,

), (x

,

), . . . , (x

,

### yN = xN

)

—often categorized as

if

### x

contain binary bits,

### •

sometimes constrain

### w ij(1) = w ji(2)

as ‘regularization’

—more

for

learning:

outputs of

serve as

### Φ(x)

(11)

Deep Learning Auto Encoder

## Fun Time

(12)

Deep Learning Principle Component Analysis

## Linear Auto-Encoder Hypothesis

h

(x) =

 X

·

X

x

### i

!

consider three special conditions:

constrain

=

=

### w ij

as ‘regularization’

—let

= [w

]of size d × ˜d

### • θ

does nothing (like linear regression)

### •

d˜ < d

linear auto-encoder hypothesis:

### h(x) = xT WW T

(13)

Deep Learning Principle Component Analysis

## Linear Auto-Encoder Error Function

min

E

(W) = 1 N

X − XWW

let

=

such that

= I and

### Λ

a diagonal matrix of

### rank at most ˜ d

(eigenvalue decomposition)

X − XVΛV

= trace



X − XVΛV





X − XVΛV



= trace

X

X − X

XVΛV

X

X +

X

XVΛV



= trace

X

X −

X

XV−

X

XV+

X

XVΛV



= trace

X

XV−

X

XV−

X

XV+

X

XV



= trace

(I −

X

### T

XV

(14)

Deep Learning Principle Component Analysis

## Linear Auto-Encoder Algorithm

min

trace

(I −

X

XV

optimal

contains

and

let X

X =

### UΣU T

(eigenvalue decomposition),

with (smallest

⇐⇒

so

### optimal column vectors wj = vj= top eigen vectors of XT X

optimal linear auto-encoding ≡

(PCA)

with

being

### principal components

of unshifted data

(15)

Deep Learning Principle Component Analysis

## Fun Time

(16)

Deep Learning Denoising Auto Encoder

## Simple Auto-Encoder Revisited

auto-encoder: a

-

-d NNet

want:

to capture

of

### •

naïve solution exists (but

### • regularized

weights needed in general

regularization towards more

### robust hidden structure?

(17)

Deep Learning Denoising Auto Encoder

## Idea of Denoising Auto-Encoder

should allow g(˜

even when

### ˜ x

slightly different from

### •

denoising auto-encoder: run auto-encoder

with data (

,

), (

,

), . . . , (

,

), where

=

+

### •

PCA auto-encoder +

min

E

(W) = 1 N

X − (X +

### 2 F

—simply L2-regularizedPCA

as

### regularization!

—practically also useful for other types of NNet

(18)

Deep Learning Denoising Auto Encoder

## Fun Time

(19)

Deep Learning Deep Neural Network

Finalremark: hiddenlayers

LearningFromData-Le ture10 21/21

Finalremark: hiddenlayers

LearningFromData-Le ture10 21/21

Finalremark: hiddenlayers

θ x2

xd

s θ(s) h(x) 1

1 1

x1

θ θ

θ θ

θ

learnednonlineartransform

interpretation?

LearningFromData-Le ture10 21/21

Finalremark: hiddenlayers

θ x2

xd

s θ(s) h(x) 1

1 1

x1

θ θ

θ θ

θ

learnednonlineartransform

interpretation?

LearningFromData-Le ture10 21/21

Finalremark: hiddenlayers

θ x2

xd

s θ(s) h(x) 1

1 1

x1

θ θ

θ θ

θ

learnednonlineartransform

interpretation?

LearningFromData-Le ture10 21/21

Finalremark: hiddenlayers

θ x2

xd

s θ(s) h(x) 1

1 1

x1

θ θ

θ θ

θ

learnednonlineartransform

interpretation?

LearningFromData-Le ture10 21/21

(20)

Deep Learning Deep Neural Network

## Shallow versus Deep Structures

shallow: few hidden layers; deep: many hidden layers

efficient

### •

powerful if enough neurons

### •

challenging to train

### •

needing more structural (model) decisions

### •

‘meaningful’?

deep structure (deep learning)

### re-gain attention recently

(21)

Deep Learning Deep Neural Network

## Key Techniques behind Deep Learning

(usually)

### unsupervised pre-training

between hidden layers, such as simple/denoising auto-encoder

—viewing hidden layers as

### ‘condensing’

low-level representation to high-level one

### • fine-tune with backprop

after initializing with those ‘good’

weights

—because

### •

speed-up: better optimization algorithms, and faster

### •

generalization issue less serious with

### big (enough) data

currently very useful

for

### vision and speech recognition

(22)

Deep Learning Deep Neural Network

## Fun Time

(23)

Deep Learning Deep Neural Network

## Summary

### Deep Neural Network

Which of the following aggregation model learns diverse g t by reweighting and calculates linear vote by steepest search?.

3 Distilling Implicit Features: Extraction Models Lecture 14: Radial Basis Function Network. RBF

Lecture 4: Soft-Margin Support Vector Machine allow some margin violations ξ n while penalizing them by C; equivalent to upper-bounding α n by C Lecture 5: Kernel Logistic

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/22.. Decision Tree Decision Tree Hypothesis. Disclaimers about

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.. linear SVM: more robust and solvable with quadratic programming Lecture 2: Dual Support

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.

For a deep NNet for written character recognition from raw pixels, which type of features are more likely extracted after the first hidden layer.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24.:. Deep Learning Deep