• 沒有找到結果。

Machine Learning Techniques (ᘤᢈ)

N/A
N/A
Protected

Academic year: 2022

Share "Machine Learning Techniques (ᘤᢈ)"

Copied!
145
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning Techniques

( 機器學習技法)

Lecture 13: Deep Learning

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/24

(2)

Deep Learning

Roadmap

1 Embedding Numerous Features: Kernel Models

2 Combining Predictive Features: Aggregation Models

3

Distilling Implicit Features: Extraction Models

Lecture 12: Neural Network

automatic

pattern feature extraction

from

layers of neurons

with

backprop

for GD/SGD

Lecture 13: Deep Learning Deep Neural Network Autoencoder

Denoising Autoencoder

Principal Component Analysis

(3)

Deep Learning Deep Neural Network

Physical Interpretation of NNet Revisited

x

0

= 1 x

1

x

2

.. . x

d

+1

tanh

tanh

w

ij(1)

w

jk(2)

w

kq(3)

+1

tanh

tanh

s

3(2) tanh

x

3(2)

each layer:

pattern feature extracted

from data,

remember? :-)

how many neurons? how many layers?

—more generally,

what structure?

• subjectively, your design!

• objectively, validation, maybe?

structural decisions:

key issue

for applying NNet

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/24

(4)

Deep Learning Deep Neural Network

Physical Interpretation of NNet Revisited

x

0

= 1 x

1

x

2

.. . x

d

+1

tanh

tanh

w

ij(1)

w

jk(2)

w

kq(3)

+1

tanh

tanh

s

3(2) tanh

x

3(2)

each layer:

pattern feature extracted

from data,

remember? :-)

how many neurons? how many layers?

—more generally,

what structure?

• subjectively, your design!

• objectively, validation, maybe?

structural decisions:

key issue

for applying NNet

(5)

Deep Learning Deep Neural Network

Physical Interpretation of NNet Revisited

x

0

= 1 x

1

x

2

.. . x

d

+1

tanh

tanh

w

ij(1)

w

jk(2)

w

kq(3)

+1

tanh

tanh

s

3(2) tanh

x

3(2)

each layer:

pattern feature extracted

from data,

remember? :-)

how many neurons? how many layers?

—more generally,

what structure?

• subjectively, your design!

• objectively, validation, maybe?

structural decisions:

key issue

for applying NNet

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/24

(6)

Deep Learning Deep Neural Network

Physical Interpretation of NNet Revisited

x

0

= 1 x

1

x

2

.. . x

d

+1

tanh

tanh

w

ij(1)

w

jk(2)

w

kq(3)

+1

tanh

tanh

s

3(2) tanh

x

3(2)

each layer:

pattern feature extracted

from data,

remember? :-)

how many neurons? how many layers?

—more generally,

what structure?

• subjectively, your design!

• objectively, validation, maybe?

structural decisions:

key issue

for applying NNet

(7)

Deep Learning Deep Neural Network

Physical Interpretation of NNet Revisited

x

0

= 1 x

1

x

2

.. . x

d

+1

tanh

tanh

w

ij(1)

w

jk(2)

w

kq(3)

+1

tanh

tanh

s

3(2) tanh

x

3(2)

each layer:

pattern feature extracted

from data,

remember? :-)

how many neurons? how many layers?

—more generally,

what structure?

• subjectively, your design!

• objectively, validation, maybe?

structural decisions:

key issue

for applying NNet

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/24

(8)

Deep Learning Deep Neural Network

Physical Interpretation of NNet Revisited

x

0

= 1 x

1

x

2

.. . x

d

+1

tanh

tanh

w

ij(1)

w

jk(2)

w

kq(3)

+1

tanh

tanh

s

3(2) tanh

x

3(2)

each layer:

pattern feature extracted

from data,

remember? :-)

how many neurons? how many layers?

—more generally,

what structure?

• subjectively, your design!

• objectively, validation, maybe?

structural decisions:

key issue

for applying NNet

(9)

Deep Learning Deep Neural Network

Shallow versus Deep Neural Networks

shallow: few (hidden) layers; deep: many layers

Shallow NNet

more

efficient

to train ( )

simpler

structural decisions ( )

theoretically

powerful enough

( )

Deep NNet

challenging

to train (×)

sophisticated

structural decisions (×)

‘arbitrarily’ powerful

( )

more

‘meaningful’?

(see next slide)

deep NNet (deep learning)

gaining attention

in recent years

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24

(10)

Deep Learning Deep Neural Network

Shallow versus Deep Neural Networks

shallow: few (hidden) layers; deep: many layers

Shallow NNet

more

efficient

to train ( )

simpler

structural decisions ( )

theoretically

powerful enough

( )

Deep NNet

challenging

to train (×)

sophisticated

structural decisions (×)

‘arbitrarily’ powerful

( )

more

‘meaningful’?

(see next slide)

deep NNet (deep learning)

gaining attention

in recent years

(11)

Deep Learning Deep Neural Network

Shallow versus Deep Neural Networks

shallow: few (hidden) layers; deep: many layers

Shallow NNet

more

efficient

to train ( )

simpler

structural decisions ( )

theoretically

powerful enough

( )

Deep NNet

challenging

to train (×)

sophisticated

structural decisions (×)

‘arbitrarily’ powerful

( )

more

‘meaningful’?

(see next slide)

deep NNet (deep learning)

gaining attention

in recent years

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24

(12)

Deep Learning Deep Neural Network

Shallow versus Deep Neural Networks

shallow: few (hidden) layers; deep: many layers

Shallow NNet

more

efficient

to train ( )

simpler

structural decisions ( )

theoretically

powerful enough

( )

Deep NNet

challenging

to train (×)

sophisticated

structural decisions (×)

‘arbitrarily’ powerful

( )

more

‘meaningful’?

(see next slide)

deep NNet (deep learning)

gaining attention

in recent years

(13)

Deep Learning Deep Neural Network

Shallow versus Deep Neural Networks

shallow: few (hidden) layers; deep: many layers

Shallow NNet

more

efficient

to train ( )

simpler

structural decisions ( )

theoretically

powerful enough

( )

Deep NNet

challenging

to train (×)

sophisticated

structural decisions (×)

‘arbitrarily’ powerful

( )

more

‘meaningful’?

(see next slide)

deep NNet (deep learning)

gaining attention

in recent years

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24

(14)

Deep Learning Deep Neural Network

Shallow versus Deep Neural Networks

shallow: few (hidden) layers; deep: many layers

Shallow NNet

more

efficient

to train ( )

simpler

structural decisions ( )

theoretically

powerful enough

( )

Deep NNet

challenging

to train (×)

sophisticated

structural decisions (×)

‘arbitrarily’ powerful

( )

more

‘meaningful’?

(see next slide)

deep NNet (deep learning)

gaining attention

in recent years

(15)

Deep Learning Deep Neural Network

Shallow versus Deep Neural Networks

shallow: few (hidden) layers; deep: many layers

Shallow NNet

more

efficient

to train ( )

simpler

structural decisions ( )

theoretically

powerful enough

( )

Deep NNet

challenging

to train (×)

sophisticated

structural decisions (×)

‘arbitrarily’ powerful

( )

more

‘meaningful’?

(see next slide)

deep NNet (deep learning)

gaining attention

in recent years

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24

(16)

Deep Learning Deep Neural Network

Shallow versus Deep Neural Networks

shallow: few (hidden) layers; deep: many layers

Shallow NNet

more

efficient

to train ( )

simpler

structural decisions ( )

theoretically

powerful enough

( )

Deep NNet

challenging

to train (×)

sophisticated

structural decisions (×)

‘arbitrarily’ powerful

( )

more

‘meaningful’?

(see next slide)

deep NNet (deep learning)

gaining attention

in recent years

(17)

Deep Learning Deep Neural Network

Shallow versus Deep Neural Networks

shallow: few (hidden) layers; deep: many layers

Shallow NNet

more

efficient

to train ( )

simpler

structural decisions ( )

theoretically

powerful enough

( )

Deep NNet

challenging

to train (×)

sophisticated

structural decisions (×)

‘arbitrarily’ powerful

( )

more

‘meaningful’?

(see next slide)

deep NNet (deep learning)

gaining attention

in recent years

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24

(18)

Deep Learning Deep Neural Network

Shallow versus Deep Neural Networks

shallow: few (hidden) layers; deep: many layers

Shallow NNet

more

efficient

to train ( )

simpler

structural decisions ( )

theoretically

powerful enough

( )

Deep NNet

challenging

to train (×)

sophisticated

structural decisions (×)

‘arbitrarily’ powerful

( )

more

‘meaningful’?

(see next slide)

deep NNet (deep learning)

gaining attention

in recent years

(19)

Deep Learning Deep Neural Network

Meaningfulness of Deep Learning

,

is it a ‘1’? ✲ ✛ is it a ‘5’?

z

1

z

5

φ

1

φ

2

φ

3

φ

4

φ

5

φ

6

positive weight negative weight

‘less burden’

for each layer:

simple

to

complex

features

natural for

difficult

learning task with

raw features, like vision

deep NNet: currently popular in

vision/speech/. . .

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/24

(20)

Deep Learning Deep Neural Network

Meaningfulness of Deep Learning

,

is it a ‘1’? ✲ ✛ is it a ‘5’?

z

1

z

5

φ

1

φ

2

φ

3

φ

4

φ

5

φ

6

positive weight negative weight

‘less burden’

for each layer:

simple

to

complex

features

natural for

difficult

learning task with

raw features, like vision

deep NNet: currently popular in

vision/speech/. . .

(21)

Deep Learning Deep Neural Network

Meaningfulness of Deep Learning

,

is it a ‘1’? ✲ ✛ is it a ‘5’?

z

1

z

5

φ

1

φ

2

φ

3

φ

4

φ

5

φ

6

positive weight negative weight

‘less burden’

for each layer:

simple

to

complex

features

natural for

difficult

learning task with

raw features, like vision

deep NNet: currently popular in

vision/speech/. . .

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/24

(22)

Deep Learning Deep Neural Network

Meaningfulness of Deep Learning

,

is it a ‘1’? ✲ ✛ is it a ‘5’?

z

1

z

5

φ

1

φ

2

φ

3

φ

4

φ

5

φ

6

positive weight negative weight

‘less burden’

for each layer:

simple

to

complex

features

natural for

difficult

learning task with

raw features, like vision

deep NNet: currently popular in

vision/speech/. . .

(23)

Deep Learning Deep Neural Network

Challenges and Key Techniques for Deep Learning

difficult

structural decisions:

• subjective with domain knowledge: like convolutional NNet for images

high

model complexity:

• no big worries if big enough data

regularization towards noise-tolerant: like

dropout (tolerant when network corrupted)

denoising (tolerant when input corrupted)

hard

optimization problem:

careful initialization to avoid bad local minimum:

called pre-training

huge

computational complexity

(worsen with

big data)

:

• novel hardware/architecture: like mini-batch with GPU

IMHO, careful

regularization

and

initialization

are key techniques

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/24

(24)

Deep Learning Deep Neural Network

Challenges and Key Techniques for Deep Learning

difficult

structural decisions:

• subjective with domain knowledge: like convolutional NNet for images

high

model complexity:

• no big worries if big enough data

regularization towards noise-tolerant: like

dropout (tolerant when network corrupted)

denoising (tolerant when input corrupted)

hard

optimization problem:

careful initialization to avoid bad local minimum:

called pre-training

huge

computational complexity

(worsen with

big data)

:

• novel hardware/architecture: like mini-batch with GPU

IMHO, careful

regularization

and

initialization

are key techniques

(25)

Deep Learning Deep Neural Network

Challenges and Key Techniques for Deep Learning

difficult

structural decisions:

• subjective with domain knowledge: like convolutional NNet for images

high

model complexity:

• no big worries if big enough data

regularization towards noise-tolerant: like

dropout (tolerant when network corrupted)

denoising (tolerant when input corrupted)

hard

optimization problem:

careful initialization to avoid bad local minimum:

called pre-training

huge

computational complexity

(worsen with

big data)

:

• novel hardware/architecture: like mini-batch with GPU

IMHO, careful

regularization

and

initialization

are key techniques

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/24

(26)

Deep Learning Deep Neural Network

Challenges and Key Techniques for Deep Learning

difficult

structural decisions:

• subjective with domain knowledge: like convolutional NNet for images

high

model complexity:

• no big worries if big enough data

regularization towards noise-tolerant: like

dropout (tolerant when network corrupted)

denoising (tolerant when input corrupted)

hard

optimization problem:

careful initialization to avoid bad local minimum:

called pre-training

huge

computational complexity

(worsen with

big data)

:

• novel hardware/architecture: like mini-batch with GPU

IMHO, careful

regularization

and

initialization

are key techniques

(27)

Deep Learning Deep Neural Network

Challenges and Key Techniques for Deep Learning

difficult

structural decisions:

• subjective with domain knowledge: like convolutional NNet for images

high

model complexity:

• no big worries if big enough data

regularization towards noise-tolerant: like

dropout (tolerant when network corrupted)

denoising (tolerant when input corrupted)

hard

optimization problem:

careful initialization to avoid bad local minimum:

called pre-training

huge

computational complexity

(worsen with

big data)

:

• novel hardware/architecture: like mini-batch with GPU

IMHO, careful

regularization

and

initialization

are key techniques

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/24

(28)

Deep Learning Deep Neural Network

Challenges and Key Techniques for Deep Learning

difficult

structural decisions:

• subjective with domain knowledge: like convolutional NNet for images

high

model complexity:

• no big worries if big enough data

regularization towards noise-tolerant: like

dropout (tolerant when network corrupted)

denoising (tolerant when input corrupted)

hard

optimization problem:

careful initialization to avoid bad local minimum:

called pre-training

huge

computational complexity

(worsen with

big data):

• novel hardware/architecture: like mini-batch with GPU

IMHO, careful

regularization

and

initialization

are key techniques

(29)

Deep Learning Deep Neural Network

Challenges and Key Techniques for Deep Learning

difficult

structural decisions:

• subjective with domain knowledge: like convolutional NNet for images

high

model complexity:

• no big worries if big enough data

regularization towards noise-tolerant: like

dropout (tolerant when network corrupted)

denoising (tolerant when input corrupted)

hard

optimization problem:

careful initialization to avoid bad local minimum:

called pre-training

huge

computational complexity

(worsen with

big data):

• novel hardware/architecture: like mini-batch with GPU

IMHO, careful

regularization

and

initialization

are key techniques

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/24

(30)

Deep Learning Deep Neural Network

Challenges and Key Techniques for Deep Learning

difficult

structural decisions:

• subjective with domain knowledge: like convolutional NNet for images

high

model complexity:

• no big worries if big enough data

regularization towards noise-tolerant: like

dropout (tolerant when network corrupted)

denoising (tolerant when input corrupted)

hard

optimization problem:

careful initialization to avoid bad local minimum:

called pre-training

huge

computational complexity

(worsen with

big data):

• novel hardware/architecture: like mini-batch with GPU

IMHO, careful

regularization

and

initialization

are key techniques

(31)

Deep Learning Deep Neural Network

Challenges and Key Techniques for Deep Learning

difficult

structural decisions:

• subjective with domain knowledge: like convolutional NNet for images

high

model complexity:

• no big worries if big enough data

regularization towards noise-tolerant: like

dropout (tolerant when network corrupted)

denoising (tolerant when input corrupted)

hard

optimization problem:

careful initialization to avoid bad local minimum:

called pre-training

huge

computational complexity

(worsen with

big data):

• novel hardware/architecture: like mini-batch with GPU

IMHO, careful

regularization

and

initialization

are key techniques

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/24

(32)

Deep Learning Deep Neural Network

Challenges and Key Techniques for Deep Learning

difficult

structural decisions:

• subjective with domain knowledge: like convolutional NNet for images

high

model complexity:

• no big worries if big enough data

regularization towards noise-tolerant: like

dropout (tolerant when network corrupted)

denoising (tolerant when input corrupted)

hard

optimization problem:

careful initialization to avoid bad local minimum:

called pre-training

huge

computational complexity

(worsen with

big data):

• novel hardware/architecture: like mini-batch with GPU

IMHO, careful

regularization

and

initialization

are key techniques

(33)

Deep Learning Deep Neural Network

Challenges and Key Techniques for Deep Learning

difficult

structural decisions:

• subjective with domain knowledge: like convolutional NNet for images

high

model complexity:

• no big worries if big enough data

regularization towards noise-tolerant: like

dropout (tolerant when network corrupted)

denoising (tolerant when input corrupted)

hard

optimization problem:

careful initialization to avoid bad local minimum:

called pre-training

huge

computational complexity

(worsen with

big data):

• novel hardware/architecture: like mini-batch with GPU

IMHO, careful

regularization

and

initialization

are key techniques

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/24

(34)

Deep Learning Deep Neural Network

Challenges and Key Techniques for Deep Learning

difficult

structural decisions:

• subjective with domain knowledge: like convolutional NNet for images

high

model complexity:

• no big worries if big enough data

regularization towards noise-tolerant: like

dropout (tolerant when network corrupted)

denoising (tolerant when input corrupted)

hard

optimization problem:

careful initialization to avoid bad local minimum:

called pre-training

huge

computational complexity

(worsen with

big data):

• novel hardware/architecture: like mini-batch with GPU

IMHO, careful

regularization

and

initialization

are key techniques

(35)

Deep Learning Deep Neural Network

Challenges and Key Techniques for Deep Learning

difficult

structural decisions:

• subjective with domain knowledge: like convolutional NNet for images

high

model complexity:

• no big worries if big enough data

regularization towards noise-tolerant: like

dropout (tolerant when network corrupted)

denoising (tolerant when input corrupted)

hard

optimization problem:

careful initialization to avoid bad local minimum:

called pre-training

huge

computational complexity

(worsen with

big data):

• novel hardware/architecture: like mini-batch with GPU

IMHO, careful

regularization

and

initialization

are key techniques

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/24

(36)

Deep Learning Deep Neural Network

A Two-Step Deep Learning Framework

Simple Deep Learning

1

for` = 1, . . . , L,

pre-train

n w

ij (`)

o

assuming w

(1)

,. . . w

(`−1)

fixed

(a) (b) (c) (d)

2 train with backprop

on

pre-trained

NNet to

fine-tune

alln w

ij (`)

o

will focus on

simplest pre-training

technique along with

regularization

(37)

Deep Learning Deep Neural Network

A Two-Step Deep Learning Framework

Simple Deep Learning

1

for` = 1, . . . , L,

pre-train

n w

ij (`)

o

assuming w

(1)

,. . . w

(`−1)

fixed

(a) (b) (c) (d)

2 train with backprop

on

pre-trained

NNet to

fine-tune

alln w

ij (`)

o

will focus on

simplest pre-training

technique along with

regularization

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/24

(38)

Deep Learning Deep Neural Network

A Two-Step Deep Learning Framework

Simple Deep Learning

1

for` = 1, . . . , L,

pre-train

n w

ij (`)

o

assuming w

(1)

,. . . w

(`−1)

fixed

(a) (b) (c) (d)

2 train with backprop

on

pre-trained

NNet to

fine-tune

alln w

ij (`)

o

will focus on

simplest pre-training

technique along with

regularization

(39)

Deep Learning Deep Neural Network

Fun Time

For a deep NNet for written character recognition from raw pixels, which type of features are more likely extracted after the first hidden layer?

1

pixels

2

strokes

3

parts

4

digits

Reference Answer: 2

Simple strokes are likely the ‘next-level’ features that can be extracted from raw pixels.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/24

(40)

Deep Learning Deep Neural Network

Fun Time

For a deep NNet for written character recognition from raw pixels, which type of features are more likely extracted after the first hidden layer?

1

pixels

2

strokes

3

parts

4

digits

Reference Answer: 2

Simple strokes are likely the ‘next-level’

features that can be extracted from raw pixels.

(41)

Deep Learning Autoencoder

Information-Preserving Encoding

weights: feature transform, i.e. encoding

good weights: information-preserving encoding

—next layer

same info.

with

different representation

information-preserving:

decode accurately

after

encoding

,

is it a ‘1’? ✲ ✛ is it a ‘5’?

z1 z5

φ1 φ2 φ3 φ4 φ5 φ6

positive weight negative weight

(a) (b) (c) (d)

idea:

pre-train weights

towards

information-preserving

encoding

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/24

(42)

Deep Learning Autoencoder

Information-Preserving Encoding

weights: feature transform, i.e. encoding

good weights: information-preserving encoding

—next layer

same info.

with

different representation

information-preserving:

decode accurately

after

encoding

,

is it a ‘1’? ✲ ✛ is it a ‘5’?

z1 z5

φ1 φ2 φ3 φ4 φ5 φ6

positive weight negative weight

(a) (b) (c) (d)

idea:

pre-train weights

towards

information-preserving

encoding

(43)

Deep Learning Autoencoder

Information-Preserving Encoding

weights: feature transform, i.e. encoding

good weights: information-preserving encoding

—next layer

same info.

with

different representation

information-preserving:

decode accurately

after

encoding

,

is it a ‘1’? ✲ ✛ is it a ‘5’?

z1 z5

φ1 φ2 φ3 φ4 φ5 φ6

positive weight negative weight

(a) (b) (c) (d)

idea:

pre-train weights

towards

information-preserving

encoding

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/24

(44)

Deep Learning Autoencoder

Information-Preserving Encoding

weights: feature transform, i.e. encoding

good weights: information-preserving encoding

—next layer

same info.

with

different representation

information-preserving:

decode accurately

after

encoding

,

is it a ‘1’? ✲ ✛ is it a ‘5’?

z1 z5

φ1 φ2 φ3 φ4 φ5 φ6

positive weight negative weight

(a) (b) (c) (d)

idea:

pre-train weights

towards

information-preserving

encoding

(45)

Deep Learning Autoencoder

Information-Preserving Neural Network

x

0

= 1

x

1

x

2

x

3

.. . x

d

+1

tanh

tanh

tanh

≈ x

1

≈ x

2

≈ x

3

.. .

≈ x

d

w ij (1) w ji (2)

autoencoder:

d —˜ d—d NNet

with goal

g i

(x) ≈

x i

—learning to

approximate identity function

• w ij (1) : encoding weights; w ji (2) : decoding weights

why

approximating identity function?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/24

(46)

Deep Learning Autoencoder

Information-Preserving Neural Network

x

0

= 1

x

1

x

2

x

3

.. . x

d

+1

tanh

tanh

tanh

≈ x

1

≈ x

2

≈ x

3

.. .

≈ x

d

w ij (1) w ji (2)

autoencoder: d —˜ d—d NNet

with goal

g i

(x) ≈

x i

—learning to

approximate identity function

• w ij (1) : encoding weights; w ji (2) : decoding weights

why

approximating identity function?

(47)

Deep Learning Autoencoder

Information-Preserving Neural Network

x

0

= 1

x

1

x

2

x

3

.. . x

d

+1

tanh

tanh

tanh

≈ x

1

≈ x

2

≈ x

3

.. .

≈ x

d

w ij (1) w ji (2)

autoencoder: d —˜ d—d NNet

with goal

g i

(x) ≈

x i

—learning to

approximate identity function

• w ij (1) : encoding weights; w ji (2) : decoding weights

why

approximating identity function?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/24

(48)

Deep Learning Autoencoder

Information-Preserving Neural Network

x

0

= 1

x

1

x

2

x

3

.. . x

d

+1

tanh

tanh

tanh

≈ x

1

≈ x

2

≈ x

3

.. .

≈ x

d

w ij (1) w ji (2)

autoencoder: d —˜ d—d NNet

with goal

g i

(x) ≈

x i

—learning to

approximate identity function

• w ij (1) : encoding weights; w ji (2) : decoding weights

why

approximating identity function?

(49)

Deep Learning Autoencoder

Information-Preserving Neural Network

x

0

= 1

x

1

x

2

x

3

.. . x

d

+1

tanh

tanh

tanh

≈ x

1

≈ x

2

≈ x

3

.. .

≈ x

d

w ij (1) w ji (2)

autoencoder: d —˜ d—d NNet

with goal

g i

(x) ≈

x i

—learning to

approximate identity function

• w ij (1) : encoding weights; w ji (2) : decoding weights

why

approximating identity function?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/24

(50)

Deep Learning Autoencoder

Usefulness of Approximating Identity Function

if

g(x) ≈ x

using some

hidden

structures on the

observed data x n

for supervised learning:

• hidden structure (essence) of x can be used as reasonable transform Φ(x)

—learning

‘informative’ representation

of data

for unsupervised learning:

• density estimation: larger (structure match) when g(x) ≈ x

• outlier detection: those x where g(x) 6≈ x

—learning

‘typical’ representation

of data

autoencoder:

representation-learning through

approximating identity function

(51)

Deep Learning Autoencoder

Usefulness of Approximating Identity Function

if

g(x) ≈ x

using some

hidden

structures on the

observed data x n

for supervised learning:

• hidden structure (essence) of x can be used as reasonable transform Φ(x)

—learning

‘informative’ representation

of data

for unsupervised learning:

• density estimation: larger (structure match) when g(x) ≈ x

• outlier detection: those x where g(x) 6≈ x

—learning

‘typical’ representation

of data

autoencoder:

representation-learning through approximating identity function

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/24

(52)

Deep Learning Autoencoder

Usefulness of Approximating Identity Function

if

g(x) ≈ x

using some

hidden

structures on the

observed data x n

for supervised learning:

• hidden structure (essence) of x can be used as reasonable transform Φ(x)

—learning

‘informative’ representation

of data

for unsupervised learning:

• density estimation: larger (structure match) when g(x) ≈ x

• outlier detection: those x where g(x) 6≈ x

—learning

‘typical’ representation

of data

autoencoder:

representation-learning through

approximating identity function

(53)

Deep Learning Autoencoder

Usefulness of Approximating Identity Function

if

g(x) ≈ x

using some

hidden

structures on the

observed data x n

for supervised learning:

• hidden structure (essence) of x can be used as reasonable transform Φ(x)

—learning

‘informative’ representation

of data

for unsupervised learning:

• density estimation: larger (structure match) when g(x) ≈ x

• outlier detection: those x where g(x) 6≈ x

—learning

‘typical’ representation

of data

autoencoder:

representation-learning through approximating identity function

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/24

(54)

Deep Learning Autoencoder

Usefulness of Approximating Identity Function

if

g(x) ≈ x

using some

hidden

structures on the

observed data x n

for supervised learning:

• hidden structure (essence) of x can be used as reasonable transform Φ(x)

—learning

‘informative’ representation

of data

for unsupervised learning:

• density estimation: larger (structure match) when g(x) ≈ x

• outlier detection: those x where g(x) 6≈ x

—learning

‘typical’ representation

of data

autoencoder:

representation-learning through

approximating identity function

(55)

Deep Learning Autoencoder

Usefulness of Approximating Identity Function

if

g(x) ≈ x

using some

hidden

structures on the

observed data x n

for supervised learning:

• hidden structure (essence) of x can be used as reasonable transform Φ(x)

—learning

‘informative’ representation

of data

for unsupervised learning:

• density estimation: larger (structure match) when g(x) ≈ x

• outlier detection: those x where g(x) 6≈ x

—learning

‘typical’ representation

of data

autoencoder:

representation-learning through approximating identity function

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/24

(56)

Deep Learning Autoencoder

Usefulness of Approximating Identity Function

if

g(x) ≈ x

using some

hidden

structures on the

observed data x n

for supervised learning:

• hidden structure (essence) of x can be used as reasonable transform Φ(x)

—learning

‘informative’ representation

of data

for unsupervised learning:

• density estimation: larger (structure match) when g(x) ≈ x

• outlier detection: those x where g(x) 6≈ x

—learning

‘typical’ representation

of data

autoencoder:

representation-learning through

approximating identity function

(57)

Deep Learning Autoencoder

Usefulness of Approximating Identity Function

if

g(x) ≈ x

using some

hidden

structures on the

observed data x n

for supervised learning:

• hidden structure (essence) of x can be used as reasonable transform Φ(x)

—learning

‘informative’ representation

of data

for unsupervised learning:

• density estimation: larger (structure match) when g(x) ≈ x

• outlier detection: those x where g(x) 6≈ x

—learning

‘typical’ representation

of data

autoencoder:

representation-learning through approximating identity function

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/24

(58)

Deep Learning Autoencoder

Basic Autoencoder

basic

autoencoder:

d —˜ d —d NNet

with error functionP

d

i=1

(g

i

(x) − x

i

)

2

backprop

easily

applies;

shallow

and

easy

to train

usually

d ˜

<

d

:

compressed

representation

data: {(x

1

,

y 1 = x 1

), (x

2

,

y 2 = x 2

), . . . , (x

N

,

y N = x N

)}

—often categorized as

unsupervised learning technique

sometimes constrain

w ij (1)

=

w ji (2)

as

regularization

—more

sophisticated

in calculating gradient

basic

autoencoder

in basic deep learning:

n

w ij (1) o

taken as

shallowly pre-trained weights

(59)

Deep Learning Autoencoder

Basic Autoencoder

basic

autoencoder:

d —˜ d —d NNet

with error functionP

d

i=1

(g

i

(x) − x

i

)

2

backprop

easily

applies;

shallow

and

easy

to train

usually

d ˜

<

d

:

compressed

representation

data: {(x

1

,

y 1 = x 1

), (x

2

,

y 2 = x 2

), . . . , (x

N

,

y N = x N

)}

—often categorized as

unsupervised learning technique

sometimes constrain

w ij (1)

=

w ji (2)

as

regularization

—more

sophisticated

in calculating gradient

basic

autoencoder

in basic deep learning:

n

w ij (1) o

taken as

shallowly pre-trained weights

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/24

(60)

Deep Learning Autoencoder

Basic Autoencoder

basic

autoencoder:

d —˜ d —d NNet

with error functionP

d

i=1

(g

i

(x) − x

i

)

2

backprop

easily

applies;

shallow

and

easy

to train

usually

d ˜

<

d

:

compressed

representation

data: {(x

1

,

y 1 = x 1

), (x

2

,

y 2 = x 2

), . . . , (x

N

,

y N = x N

)}

—often categorized as

unsupervised learning technique

sometimes constrain

w ij (1)

=

w ji (2)

as

regularization

—more

sophisticated

in calculating gradient

basic

autoencoder

in basic deep learning:

n

w ij (1) o

taken as

shallowly pre-trained weights

(61)

Deep Learning Autoencoder

Basic Autoencoder

basic

autoencoder:

d —˜ d —d NNet

with error functionP

d

i=1

(g

i

(x) − x

i

)

2

backprop

easily

applies;

shallow

and

easy

to train

usually

d ˜

<

d

:

compressed

representation

data: {(x

1

,

y 1 = x 1

), (x

2

,

y 2 = x 2

), . . . , (x

N

,

y N = x N

)}

—often categorized as

unsupervised learning technique

sometimes constrain

w ij (1)

=

w ji (2)

as

regularization

—more

sophisticated

in calculating gradient

basic

autoencoder

in basic deep learning:

n

w ij (1) o

taken as

shallowly pre-trained weights

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/24

(62)

Deep Learning Autoencoder

Basic Autoencoder

basic

autoencoder:

d —˜ d —d NNet

with error functionP

d

i=1

(g

i

(x) − x

i

)

2

backprop

easily

applies;

shallow

and

easy

to train

usually

d ˜

<

d

:

compressed

representation

data: {(x

1

,

y 1 = x 1

), (x

2

,

y 2 = x 2

), . . . , (x

N

,

y N = x N

)}

—often categorized as

unsupervised learning technique

sometimes constrain

w ij (1)

=

w ji (2)

as

regularization

—more

sophisticated

in calculating gradient

basic

autoencoder

in basic deep learning:

n

w ij (1) o

taken as

shallowly pre-trained weights

(63)

Deep Learning Autoencoder

Basic Autoencoder

basic

autoencoder:

d —˜ d —d NNet

with error functionP

d

i=1

(g

i

(x) − x

i

)

2

backprop

easily

applies;

shallow

and

easy

to train

usually

d ˜

<

d

:

compressed

representation

data: {(x

1

,

y 1 = x 1

), (x

2

,

y 2 = x 2

), . . . , (x

N

,

y N = x N

)}

—often categorized as

unsupervised learning technique

sometimes constrain

w ij (1)

=

w ji (2)

as

regularization

—more

sophisticated

in calculating gradient

basic

autoencoder

in basic deep learning:

n

w ij (1) o

taken as

shallowly pre-trained weights

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/24

(64)

Deep Learning Autoencoder

Basic Autoencoder

basic

autoencoder:

d —˜ d —d NNet

with error functionP

d

i=1

(g

i

(x) − x

i

)

2

backprop

easily

applies;

shallow

and

easy

to train

usually

d ˜

<

d

:

compressed

representation

data: {(x

1

,

y 1 = x 1

), (x

2

,

y 2 = x 2

), . . . , (x

N

,

y N = x N

)}

—often categorized as

unsupervised learning technique

sometimes constrain

w ij (1)

=

w ji (2)

as

regularization

—more

sophisticated

in calculating gradient

basic

autoencoder

in basic deep learning:

n

w ij (1) o

taken as

shallowly pre-trained weights

(65)

Deep Learning Autoencoder

Basic Autoencoder

basic

autoencoder:

d —˜ d —d NNet

with error functionP

d

i=1

(g

i

(x) − x

i

)

2

backprop

easily

applies;

shallow

and

easy

to train

usually

d ˜

<

d

:

compressed

representation

data: {(x

1

,

y 1 = x 1

), (x

2

,

y 2 = x 2

), . . . , (x

N

,

y N = x N

)}

—often categorized as

unsupervised learning technique

sometimes constrain

w ij (1)

=

w ji (2)

as

regularization

—more

sophisticated

in calculating gradient

basic

autoencoder

in basic deep learning:

n w ij (1) o

taken as

shallowly pre-trained weights

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/24

(66)

Deep Learning Autoencoder

Pre-Training with Autoencoders

Deep Learning

with Autoencoders

1

for` = 1, . . . , L,

pre-train

n w

ij (`)

o

assuming w

(1)

,. . . w

(`−1)

fixed

(a) (b) (c) (d)

by

training basic autoencoder on n

x (`−1) n o

with ˜ d = d (`)

2 train with backprop

on

pre-trained

NNet to

fine-tune

all n

w

ij (`)

o

many successful

pre-training

techniques take

‘fancier’ autoencoders

with different

architectures

and

regularization schemes

(67)

Deep Learning Autoencoder

Pre-Training with Autoencoders

Deep Learning with Autoencoders

1

for` = 1, . . . , L,

pre-train

n w

ij (`)

o

assuming w

(1)

,. . . w

(`−1)

fixed

(a) (b) (c) (d)

by

training basic autoencoder on n

x (`−1) n o

with ˜ d = d (`)

2 train with backprop

on

pre-trained

NNet to

fine-tune

all n

w

ij (`)

o

many successful

pre-training

techniques take

‘fancier’ autoencoders

with different

architectures

and

regularization schemes

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/24

(68)

Deep Learning Autoencoder

Pre-Training with Autoencoders

Deep Learning with Autoencoders

1

for` = 1, . . . , L,

pre-train

n w

ij (`)

o

assuming w

(1)

,. . . w

(`−1)

fixed

(a) (b) (c) (d)

by

training basic autoencoder on n

x (`−1) n o

with ˜ d = d (`)

2 train with backprop

on

pre-trained

NNet to

fine-tune

all n

w

ij (`)

o

many successful

pre-training

techniques take

‘fancier’ autoencoders

with different

architectures

and

regularization schemes

(69)

Deep Learning Autoencoder

Fun Time

Suppose training a d -˜d -d autoencoder with backprop takes approximately c · d · ˜d seconds. Then, what is the total number of seconds needed for pre-training a d -d

(1)

-d

(2)

-d

(3)

-1 deep NNet?

1

c d + d

(1)

+d

(2)

+d

(3)

+1

2

c d · d

(1)

· d

(2)

· d

(3)

· 1

3

c dd

(1)

+d

(1)

d

(2)

+d

(2)

d

(3)

+d

(3)



4

c dd

(1)

· d

(1)

d

(2)

· d

(2)

d

(3)

· d

(3)



Reference Answer: 3

Each c · d

(`−1)

· d

(`)

represents the time for pre-training with one autoencoder to determine one layer of the weights.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/24

(70)

Deep Learning Autoencoder

Fun Time

Suppose training a d -˜d -d autoencoder with backprop takes approximately c · d · ˜d seconds. Then, what is the total number of seconds needed for pre-training a d -d

(1)

-d

(2)

-d

(3)

-1 deep NNet?

1

c d + d

(1)

+d

(2)

+d

(3)

+1

2

c d · d

(1)

· d

(2)

· d

(3)

· 1

3

c dd

(1)

+d

(1)

d

(2)

+d

(2)

d

(3)

+d

(3)



4

c dd

(1)

· d

(1)

d

(2)

· d

(2)

d

(3)

· d

(3)



Reference Answer: 3

Each c · d

(`−1)

· d

(`)

represents the time for pre-training with one autoencoder to determine one layer of the weights.

(71)

Deep Learning Denoising Autoencoder

Regularization in Deep Learning

x

0

= 1 x

1

x

2

.. . x

d

+1

tanh

tanh

w

ij(1)

w

jk(2)

w

kq(3)

+1

tanh

tanh

s

3(2) tanh

x

3(2)

watch out for overfitting, remember? :-)

high

model complexity: regularization

needed

structural decisions/constraints

weight decay or weight elimination

regularizers

early stopping

next: another

regularization

technique

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/24

(72)

Deep Learning Denoising Autoencoder

Regularization in Deep Learning

x

0

= 1 x

1

x

2

.. . x

d

+1

tanh

tanh

w

ij(1)

w

jk(2)

w

kq(3)

+1

tanh

tanh

s

3(2) tanh

x

3(2)

watch out for overfitting, remember? :-)

high

model complexity:

regularization

needed

structural decisions/constraints

weight decay or weight elimination

regularizers

early stopping

next: another

regularization

technique

(73)

Deep Learning Denoising Autoencoder

Regularization in Deep Learning

x

0

= 1 x

1

x

2

.. . x

d

+1

tanh

tanh

w

ij(1)

w

jk(2)

w

kq(3)

+1

tanh

tanh

s

3(2) tanh

x

3(2)

watch out for overfitting, remember? :-)

high

model complexity:

regularization

needed

structural decisions/constraints

weight decay or weight elimination

regularizers

early stopping

next: another

regularization

technique

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/24

(74)

Deep Learning Denoising Autoencoder

Regularization in Deep Learning

x

0

= 1 x

1

x

2

.. . x

d

+1

tanh

tanh

w

ij(1)

w

jk(2)

w

kq(3)

+1

tanh

tanh

s

3(2) tanh

x

3(2)

watch out for overfitting, remember? :-)

high

model complexity: regularization

needed

structural decisions/constraints

weight decay or weight elimination

regularizers

early stopping

next: another

regularization

technique

(75)

Deep Learning Denoising Autoencoder

Regularization in Deep Learning

x

0

= 1 x

1

x

2

.. . x

d

+1

tanh

tanh

w

ij(1)

w

jk(2)

w

kq(3)

+1

tanh

tanh

s

3(2) tanh

x

3(2)

watch out for overfitting, remember? :-)

high

model complexity: regularization

needed

structural decisions/constraints

weight decay or weight elimination

regularizers

early stopping

next: another

regularization

technique

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/24

(76)

Deep Learning Denoising Autoencoder

Regularization in Deep Learning

x

0

= 1 x

1

x

2

.. . x

d

+1

tanh

tanh

w

ij(1)

w

jk(2)

w

kq(3)

+1

tanh

tanh

s

3(2) tanh

x

3(2)

watch out for overfitting, remember? :-)

high

model complexity: regularization

needed

structural decisions/constraints

weight decay or weight elimination

regularizers

early stopping

next: another

regularization

technique

(77)

Deep Learning Denoising Autoencoder

Regularization in Deep Learning

x

0

= 1 x

1

x

2

.. . x

d

+1

tanh

tanh

w

ij(1)

w

jk(2)

w

kq(3)

+1

tanh

tanh

s

3(2) tanh

x

3(2)

watch out for overfitting, remember? :-)

high

model complexity: regularization

needed

structural decisions/constraints

weight decay or weight elimination

regularizers

early stopping

next: another

regularization

technique

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/24

(78)

Deep Learning Denoising Autoencoder

Regularization in Deep Learning

x

0

= 1 x

1

x

2

.. . x

d

+1

tanh

tanh

w

ij(1)

w

jk(2)

w

kq(3)

+1

tanh

tanh

s

3(2) tanh

x

3(2)

watch out for overfitting, remember? :-)

high

model complexity: regularization

needed

structural decisions/constraints

weight decay or weight elimination

regularizers

early stopping

next: another

regularization

technique

參考文獻

相關文件

2 Combining Predictive Features: Aggregation Models Lecture 7: Blending and Bagging.. Motivation of Aggregation

Which of the following aggregation model learns diverse g t by reweighting and calculates linear vote by steepest search?.

3 Distilling Implicit Features: Extraction Models Lecture 14: Radial Basis Function Network. RBF

Lecture 4: Soft-Margin Support Vector Machine allow some margin violations ξ n while penalizing them by C; equivalent to upper-bounding α n by C Lecture 5: Kernel Logistic

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/22.. Decision Tree Decision Tree Hypothesis. Disclaimers about

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.. linear SVM: more robust and solvable with quadratic programming Lecture 2: Dual Support

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics