最近搜尋

沒有找到結果。

標籤

沒有找到結果。

文件

沒有找到結果。

上傳

首頁學校主題

登錄

Machine Learning Techniques (ᘤᢈ)

Share "Machine Learning Techniques (ᘤᢈ)"

N/A

N/A

Protected

學年: 2022

Info

Protected

Academic year: 2022

Share "Machine Learning Techniques (ᘤᢈ)"

Copied!

42

0

0

42

0

0

加載中.... (立即查看全文)

立即下載 ( 42 頁 )

全文

(1)

Machine Learning Techniques ( 機器學習技法)

Lecture 14: Radial Basis Function Network

Hsuan-Tien Lin (林軒田)

htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Radial Basis Function Network

Roadmap

1 Embedding Numerous Features: Kernel Models

2 Combining Predictive Features: Aggregation Models

3

Distilling Implicit Features: Extraction Models

Lecture 13: Deep Learning

pre-training

with

denoising autoencoder

(non-linear PCA) and fine-tuning with

backprop

for NNet with

many layers

Lecture 14: Radial Basis Function Network RBF Network Hypothesis

RBF Network Learning k -Means Algorithm

k -Means and RBF Network in Action

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 1/24

(3)

Gaussian SVM Revisited

g

SVM

(x) = sign X

SV

α

_n

y

_n

exp −γkx − x

_n

k

²

+ b

!

Gaussian SVM: find

α n

to

combine Gaussians

centered at

x n

;

achieve large margin in

infinite-dimensional space, remember? :-)

•

Gaussian kernel: also called

Radial Basis Function

(RBF) kernel

• radial: only depends on distance between x and ‘center’ x

n

• basis function: to be ‘combined’

•

let

g n

(x) =

y n exp −γkx − x n k ²

: gSVM(x) = sign P

SV α n g n

(x) + b

—linear

aggregation

of

selected radial

hypotheses

Radial Basis Function

(RBF)

Network:

linear

aggregation

of

radial

hypotheses

(4)

Radial Basis Function Network RBF Network Hypothesis

From Neural Network to RBF Network

Neural Network

x0=1

x1

x2

x3

... xd

+1

tanh

tanh

tanh

w _ij ⁽¹⁾ w _j1 ⁽²⁾

RBF Network

x0=1

x1

x2

x3

... xd

+1

RBF

RBF

RBF

centers votes

• hidden layer

different:

(inner-product+ tanh) versus (distance+ Gaussian)

• output layer

same:

just linear aggregation

RBF Network: historically

a type of NNet

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24

(5)

RBF Network Hypothesis

h(x)

= Output

M

X

m=1

β

_m

RBF(x, µ

_m

) + b

!

key variables:

centers µ _m

; (signed)

votes β _m

RBF Network

x0=1

x1

x2

x3

... xd

+1

RBF

RBF

RBF

Output

centers votes

g

SVM

for Gaussian-SVM

• RBF: Gaussian; Output: sign (binary classification)

•

M = #SV;

µ _m

: SVM SVs

x m

;

β m

: α

m

y

m

from SVM Dual learning: given

RBF

and

Output,

decide

µ _m

and

β _m

(6)

Radial Basis Function Network RBF Network Hypothesis

RBF and Similarity

general similarity function

between

x and x ⁰

:

Neuron(x, x

⁰

) =tanh(γx

^T x ⁰

+1) DNASim(x, x

⁰

) =EditDistance(x, x

⁰

)

RBF Network:

distance similarity-to-centers

as

feature transform

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/24

kernel: similarity via Z-space inner product

—governed by Mercer’s condition, remember? :-) Poly(x, x

⁰

) = (1 +

x ^T x ⁰

)

²

Gaussian(x, x

⁰

) =exp(−γkx − x

⁰

k

²

)

Gaussian(x, x

⁰

) =exp(−γkx − x

⁰

k

²

)

Truncated

(x, x

⁰

) =Jkx − x

0

k ≤ 1K (1 − kx − x

0

k)

² RBF: similarity via X -space distance

—often

monotonically non-increasing

to distance

(7)

Radial Basis Function Network RBF Network Hypothesis

Fun Time

Which of the following is not a radial basis function?

1

φ(x, µ) = exp(−γkx − µk

²

)

2

φ(x, µ) = −p

x ^T x − 2x ^T

µ + µ

^T

µ

3

φ(x, µ) =Jx = µK

4

φ(x, µ) = x

^T x + µ ^T

µ

Reference Answer: 4

Note that 3 is an extreme case of 1 (Gaussian) with γ → ∞, and 2 contains an kx − µk

²

somewhere

:-).

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/24

(8)

Radial Basis Function Network RBF Network Hypothesis

Fun Time

Which of the following is not a radial basis function?

1

φ(x, µ) = exp(−γkx − µk

²

)

2

φ(x, µ) = −p

x ^T x − 2x ^T

µ + µ

^T

µ

3

φ(x, µ) =Jx = µK

4

φ(x, µ) = x

^T x + µ ^T

µ

Reference Answer: 4

Note that 3 is an extreme case of 1 (Gaussian) with γ → ∞, and 2 contains an kx − µk

²

somewhere

:-).

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/24

(9)

Full RBF Network

h(x) =

Output

M

X

m=1

β m RBF(x, µ _m

)

!

• full RBF

Network:

M = N

and each

µ _m

=

x _m

•

physical meaning: each

x _m influences

similar

x by β _m

•

e.g. uniform influence with

β m = 1 · y m

for binary classification

g

_uniform

(x) =

sign

N

X

m=1

y _m exp

−γkx − x _m k ²

!

—aggregateeach example’s

opinion

subject to

similarity

full RBF

Network:

lazy

way to decide

µ _m

(10)

Radial Basis Function Network RBF Network Learning

Nearest Neighbor

g

_uniform

(x) = sign

N

X

m=1

y _m exp

−γkx − x m k ²

!

• exp −γkx − x m k ²

:

maximum

when

x closest to x _m

—maximum oneoften dominates the

P N

m=1

term

•

take

y m

of

maximum exp(. . .)

instead of

voting

of all

y m

—selectioninstead of

aggregation

•

physical meaning:

g

_nbor

(x) =

y m

such that

x closest to x _m

—called

nearest neighbor

model

•

can

uniformly aggregate k neighbors

also:

k nearest neighbor k nearest neighbor:

also

lazy

but

very intuitive

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/24

(11)

Interpolation by Full RBF Network

full RBF

Network for squared error regression:

h(x) =

XX Output X X

N

X

m=1

β m RBF(x, x _m

)

!

•

just linear regression on

RBF-transformed data

z n

= [RBF(x

n

,

x ₁

),

RBF(x n

,

x ₂

), . . . ,

RBF(x n

,

x _N

)]

•

optimal

β? β

= (Z

^T Z) ⁻¹ Z ^T y, if Z ^T Z

invertible,

remember? :-)

•

size of

Z? N (examples) by N (centers)

—symmetric square matrix

•

theoretical fact: if

x _n all different, Z

with

Gaussian RBF invertible

optimal

β

with

invertible Z: β

=

Z ⁻¹ y

(12)

Radial Basis Function Network RBF Network Learning

Regularized Full RBF Network

full Gaussian RBF Network for regression:

β

=

Z ⁻¹ y g

RBF

(x

₁

) = β

^T

z

₁

= y

^T

Z

⁻¹

(first column of Z) = y

^T

1 0 . . . 0

T

= y

₁

—g_RBF(x

_n

) =y

_n

, i.e. E

_in

(g_RBF) =0,

yeah!! :-)

•

called

exact interpolation

for

function approximation

•

but

overfitting for learning? :-(

•

how about

regularization? e.g. ridge

regression for

β

instead

—optimal

β

= (Z

^T Z

+

λI) ⁻¹ Z ^T y

•

seen

Z? Z

= [Gaussian(x

_n

,

x _m

)] =Gaussian kernel matrix

K

effect of

regularization

in different spaces:

kernel

ridge

regression:

β

= (K+

λI) ⁻¹ y;

regularized

full RBFNet:

β

= (Z

^T Z

+

λI) ⁻¹ Z ^T y

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/24

(13)

Fewer Centers as Regularization

recall:

gSVM(x) =

sign X

SV

α m y m exp

−γkx − x m k ²

+

b

!

—only ‘ N’

SVs

needed in ‘network’

•

next:

M N

instead of

M = N

•

effect:

regularization

by constraining

number of centers and voting weights

•

physical meaning of

centers µ _m

:

prototypes

remaining question:

how to extract

prototypes?

(14)

Radial Basis Function Network RBF Network Learning

Fun Time

If

x ₁

=

x ₂

, what happens in the

Z

matrix of full Gaussian RBF network?

1

the first two rows of the matrix are the same

2

the first two columns of the matrix are different

3

the matrix is invertible

4

the sub-matrix at the intersection of the first two rows and the first two columns contains a constant of 0

Reference Answer: 1

It is easy to see that the first two rows must be the same; so must the first two columns. The two same rows makes the matrix singular; the sub-matrix in 4 contains a constant of 1 = exp(−0) instead of 0.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/24

(15)

Fun Time

If

x ₁

=

x ₂

, what happens in the

Z

matrix of full Gaussian RBF network?

1

the first two rows of the matrix are the same

2

the first two columns of the matrix are different

3

the matrix is invertible

4

the sub-matrix at the intersection of the first two rows and the first two columns contains a constant of 0

Reference Answer: 1

It is easy to see that the first two rows must be the same; so must the first two columns. The two same rows makes the matrix singular; the sub-matrix in 4 contains a constant of 1 = exp(−0) instead of 0.

(16)

Radial Basis Function Network k -Means Algorithm

Good Prototypes: Clustering Problem

=⇒

if

x ₁ ≈ x 2

,

=⇒

no need

both

RBF(x, x ₁

)&

RBF(x, x ₂

)in RBFNet,

=⇒

cluster x ₁

and

x ₂

by

one prototype µ ≈ x ₁ ≈ x ₂

• clustering

with

prototype:

• partition {x

_n

} to disjoint sets S

1

, S

2

, · · · , S

M

• choose µ

_m

for each S

m

—hope:

x ₁ , x ₂

both ∈

S _m

⇔

µ _m ≈ x ₁ ≈ x ₂

•

cluster error with squared error measure:

E

in

(S

1

, · · · , S

M

; µ

₁

, · · · , µ

_M

) = 1 N

N

X

n=1 M

X

m=1

J x

n

∈ S

_m

K kx

_n

− µ

_m

k

²

goal: with

S ₁ , · · · , S _M

being a partition of

{x n },

min

{S

1

,··· ,S

M

;µ

₁

,··· ,µ

_M

}

E

_in

(S

₁ , · · · , S _M

;

µ ₁ , · · · , µ _M

)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/24

(17)

Partition Optimization

with

S ₁ , · · · , S _M

being a partition of

{x n },

{S

₁

,··· ,S

minM

;µ

₁

,··· ,µ

_M

} N

X

n=1 M

X

m=1

J x _n ∈ S _m K kx _n − µ _m k ²

• hard to optimize: joint combinatorial-numerical

optimization

• two sets

of

variables: will optimize alternatingly

if

µ ₁ , · · · , µ _M fixed, for each x _n

• J x _n ∈ S _m K

: choose

one and only one subset

• kx _n − µ _m k ²

: distance to each

prototype

optimal

chosen subset S _m

= the one with

minimum kx _n − µ _m k ²

for given

µ ₁ , · · · , µ _M

, each

x _n

‘optimally partitioned’ using its

closest µ _m

(18)

Radial Basis Function Network k -Means Algorithm

Prototype Optimization

with

S ₁ , · · · , S _M

being a partition of

{x _n },

{S

₁

,··· ,S

min_M

;µ

₁

,··· ,µ

_M

} N

X

n=1 M

X

m=1

J x _n ∈ S m K kx _n − µ _m k ²

• hard to optimize: joint combinatorial-numerical

optimization

• two sets

of

variables: will optimize alternatingly

if

S ₁ , · · · , S _M fixed, just unconstrained optimization

for each

µ _m

∇

_µ

_mE

_in

= −2

N

X

n=1

J x _n ∈ S m K (x _n − µ _m )

=

−2







 X

x

n

∈S

m

x _n



− |S

_m

|µ

_m



 optimal

prototype µ _m

=

average

of

x n

within

S m

for given

S ₁ , · · · , S _M

, each

µ _n

‘optimally computed’ as

consensus

within

S _m

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 15/24

(19)

k -Means Algorithm

use k prototypes instead of M historically (different from k nearest neighbor, though)

k -Means Algorithm

1

initialize

µ ₁ , µ ₂ , . . . , µ _k

: say, as

k

randomly chosen

x _n

2 alternating optimization

of E

_in

: repeatedly

1 optimize S

₁

, S

₂

, . . . , S

_k

:

each x

_n

‘optimally partitioned’ using its closest µ

_i

2 optimize µ

₁

, µ

₂

, . . . , µ

_k

:

each µ

_n

‘optimally computed’ as consensus within S

_m until

converge

converge: no change of S ₁ , S ₂ , . . . , S _k

anymore

—guaranteed as E

_in decreases

during alternating minimization k -Means: the most popular

clustering

algorithm through

alternating minimization

(20)

Radial Basis Function Network k -Means Algorithm

RBF Network Using k -Means

RBF Network Using k -Means

1

run

k -Means

with k = M to get

{µ _m }

2

construct transform

Φ(x) from RBF (say, Gaussian) at µ _m Φ(x)

= [RBF(x,

µ ₁

),RBF(x,

µ ₂

), . . . ,RBF(x,

µ _M

)]

3

run

linear model

on {(Φ(x

n

),y

n

)}to get

β

4

return gRBFNET(x) =

LinearHypothesis

(β,

Φ(x))

•

using

unsupervised learning (k -Means)

to assist

feature transform—like autoencoder

•

parameters: M (prototypes), RBF (such as γ of Gaussian)

RBF Network: a simple (old-fashioned) model

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/24

(21)

Radial Basis Function Network k -Means Algorithm

Fun Time

For k -Means, consider examples

x _n

∈ R

²

such that all x

_n,1

and x

_n,2

are non-zero. When fixing two prototypes µ

₁

= [1, 1] and µ

₂

= [−1, 1], which of the following set is the optimal S

₁

?

1

{x

_n

:x

_n,1

>0}

2

{x

n

:x

_n,1

<0}

3

{x

n

:x

_n,2

>0}

4

{x

n

:x

_n,2

<0}

Reference Answer: 1

Note that S

₁

contains examples that are closer to µ

₁

than µ

₂

.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/24

(22)

Radial Basis Function Network k -Means Algorithm

Fun Time

For k -Means, consider examples

x _n

∈ R

²

such that all x

_n,1

and x

_n,2

are non-zero. When fixing two prototypes µ

₁

= [1, 1] and µ

₂

= [−1, 1], which of the following set is the optimal S

₁

?

1

{x

_n

:x

_n,1

>0}

2

{x

n

:x

_n,1

<0}

3

{x

n

:x

_n,2

>0}

4

{x

n

:x

_n,2

<0}

Reference Answer: 1

Note that S

₁

contains examples that are closer to µ

₁

than µ

₂

.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/24

(23)

Beauty of k -Means

k = 4

usually works well

with

proper k and initialization

(24)

Radial Basis Function Network k -Means and RBF Network in Action

Beauty of k -Means

k = 4

usually works well

with

proper k and initialization

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/24

(25)

Beauty of k -Means

k = 4

usually works well

with

proper k and initialization

(26)

Radial Basis Function Network k -Means and RBF Network in Action

Beauty of k -Means

k = 4

usually works well

with

proper k and initialization

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/24

(27)

Beauty of k -Means

k = 4

usually works well

with

proper k and initialization

(28)

Radial Basis Function Network k -Means and RBF Network in Action

Beauty of k -Means

k = 4

usually works well

with

proper k and initialization

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/24

(29)

Beauty of k -Means

k = 4

usually works well

with

proper k and initialization

(30)

Radial Basis Function Network k -Means and RBF Network in Action

Beauty of k -Means

k = 4

usually works well

with

proper k and initialization

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/24

(31)

Beauty of k -Means

k = 4

usually works well

with

proper k and initialization

(32)

Radial Basis Function Network k -Means and RBF Network in Action

Beauty of k -Means

k = 4

usually works well

with

proper k and initialization

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/24

(33)

Beauty of k -Means

k = 4

usually works well

with

proper k and initialization

(34)

Radial Basis Function Network k -Means and RBF Network in Action

Beauty of k -Means

k = 4

usually works well

with

proper k and initialization

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/24

(35)

Beauty of k -Means

k = 4

usually works well

with

proper k and initialization

(36)

Radial Basis Function Network k -Means and RBF Network in Action

Difficulty of k -Means

k = 2 k = 4 k = 7

‘sensitive’ to

k and initialization

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/24

(37)

Difficulty of k -Means

k = 2 k = 4 k = 7

‘sensitive’ to

k and initialization

(38)

Radial Basis Function Network k -Means and RBF Network in Action

RBF Network Using k -Means

k = 2 k = 4 k = 7

reasonable performance with

proper centers

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/24

(39)

Full RBF Network

k = N k = 4 nearest neighbor

λ =0.001

full RBF Network: generally less useful

(40)

Radial Basis Function Network k -Means and RBF Network in Action

Fun Time

When coupled with ridge linear regression, which of the following RBF Network is ‘most regularized’?

1

small M and small λ

2

small M and large λ

3

large M and small λ

4

large M and large λ

Reference Answer: 2

small M: fewer weights and more regularized; large λ: shorter β more and more regularized.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 23/24

(41)

Fun Time

When coupled with ridge linear regression, which of the following RBF Network is ‘most regularized’?

1

small M and small λ

2

small M and large λ

3

large M and small λ

4

large M and large λ

Reference Answer: 2

small M: fewer weights and more regularized;

large λ: shorter β more and more regularized.

(42)

Radial Basis Function Network k -Means and RBF Network in Action

Summary

1 Embedding Numerous Features: Kernel Models

2 Combining Predictive Features: Aggregation Models

3

Distilling Implicit Features: Extraction Models

Lecture 14: Radial Basis Function Network

RBF Network Hypothesis

prototypes instead of neurons as transform RBF Network Learning

linear aggregation of prototype ‘hypotheses’

k -Means Algorithm

clustering with alternating optimization k -Means and RBF Network in Action

proper choice of # prototypes important

• next: extracting features from abstract data

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 24/24

參考文獻

立即下載 ( PDF - 42 頁 - 0.93 MB )

相關文件

Machine Learning Techniques (ᘤᢈ)

Which of the following aggregation model learns diverse g t by reweighting and calculates linear vote by steepest search?.

Machine Learning Techniques (ᘤᢈ)

Lecture 4: Soft-Margin Support Vector Machine allow some margin violations ξ n while penalizing them by C; equivalent to upper-bounding α n by C Lecture 5: Kernel Logistic

Machine Learning Techniques (ᘤᢈ)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/22.. Decision Tree Decision Tree Hypothesis. Disclaimers about

Machine Learning Techniques (ᘤᢈ)

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.. linear SVM: more robust and solvable with quadratic programming Lecture 2: Dual Support

Machine Learning Techniques (ᘤᢈ)

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.

Machine Learning Techniques (ᘤᢈ)

Principle Component Analysis Denoising Auto Encoder Deep Neural Network... Deep Learning Optimization

Machine Learning Techniques (ᘤᢈ)

For a deep NNet for written character recognition from raw pixels, which type of features are more likely extracted after the first hidden layer.

Machine Learning Techniques (ᘤᢈ)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/22.. Matrix Factorization Summary of Extraction Models.

上傳您的學習材料以下載所有文件。

您的文件將被豐富，在 9lib TW 上共享以幫助學習。

相關文件

Machine Learning Techniques (ᘤᢈ)

Machine Learning Techniques (ᘤᢈ)

126

0

0

Machine Learning Techniques (ᘤᢈ)

Machine Learning Techniques (ᘤᢈ)

26

0

0

Machine Learning Techniques (ᘤᢈ)

Machine Learning Techniques (ᘤᢈ)

112

0

0

Machine Learning Techniques (ᘤᢈ)

Machine Learning Techniques (ᘤᢈ)

37

0

0

Machine Learning Techniques (ᘤᢈ)

Machine Learning Techniques (ᘤᢈ)

31

0

0

Machine Learning Techniques (ᘤᢈ)

Machine Learning Techniques (ᘤᢈ)

147

0

0

Machine Learning Techniques (ᘤᢈ)

Machine Learning Techniques (ᘤᢈ)

153

0

0

Machine Learning Techniques (ᘤᢈ)

Machine Learning Techniques (ᘤᢈ)

28

0

0