• 沒有找到結果。

Machine Learning Techniques (ᘤᢈ)

N/A
N/A
Protected

Academic year: 2022

Share "Machine Learning Techniques (ᘤᢈ)"

Copied!
42
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning Techniques ( 機器學習技法)

Lecture 14: Radial Basis Function Network

Hsuan-Tien Lin (林軒田)

htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Radial Basis Function Network

Roadmap

1 Embedding Numerous Features: Kernel Models

2 Combining Predictive Features: Aggregation Models

3

Distilling Implicit Features: Extraction Models

Lecture 13: Deep Learning

pre-training

with

denoising autoencoder

(non-linear PCA) and fine-tuning with

backprop

for NNet with

many layers

Lecture 14: Radial Basis Function Network RBF Network Hypothesis

RBF Network Learning k -Means Algorithm

k -Means and RBF Network in Action

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 1/24

(3)

Gaussian SVM Revisited

g

SVM

(x) = sign X

SV

α

n

y

n

exp −γkx − x

n

k

2

 + b

!

Gaussian SVM: find

α n

to

combine Gaussians

centered at

x n

;

achieve large margin in

infinite-dimensional space, remember? :-)

Gaussian kernel: also called

Radial Basis Function

(RBF) kernel

• radial: only depends on distance between x and ‘center’ x

n

• basis function: to be ‘combined’

let

g n

(x) =

y n exp −γkx − x n k 2 

: gSVM(x) = sign P

SV α n g n

(x) + b

—linear

aggregation

of

selected radial

hypotheses

Radial Basis Function

(RBF)

Network:

linear

aggregation

of

radial

hypotheses

(4)

Radial Basis Function Network RBF Network Hypothesis

From Neural Network to RBF Network

Neural Network

x0=1

x1

x2

x3

... xd

+1

tanh

tanh

tanh

w ij (1) w j1 (2)

RBF Network

x0=1

x1

x2

x3

... xd

+1

RBF

RBF

RBF

centers votes

• hidden layer

different:

(inner-product+ tanh) versus (distance+ Gaussian)

• output layer

same:

just linear aggregation

RBF Network: historically

a type of NNet

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24

(5)

RBF Network Hypothesis

h(x)

= Output

M

X

m=1

β

m

RBF(x, µ

m

) + b

!

key variables:

centers µ m

; (signed)

votes β m

RBF Network

x0=1

x1

x2

x3

... xd

+1

RBF

RBF

RBF

Output

centers votes

g

SVM

for Gaussian-SVM

• RBF: Gaussian; Output: sign (binary classification)

M = #SV;

µ m

: SVM SVs

x m

;

β m

: α

m

y

m

from SVM Dual learning: given

RBF

and

Output,

decide

µ m

and

β m

(6)

Radial Basis Function Network RBF Network Hypothesis

RBF and Similarity

general similarity function

between

x and x 0

:

Neuron(x, x

0

) =tanh(γx

T x 0

+1) DNASim(x, x

0

) =EditDistance(x, x

0

)

RBF Network:

distance similarity-to-centers

as

feature transform

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/24

kernel: similarity via Z-space inner product

—governed by Mercer’s condition, remember? :-) Poly(x, x

0

) = (1 +

x T x 0

)

2

Gaussian(x, x

0

) =exp(−γkx − x

0

k

2

)

Gaussian(x, x

0

) =exp(−γkx − x

0

k

2

)

Truncated

(x, x

0

) =Jkx − x

0

k ≤ 1K (1 − kx − x

0

k)

2 RBF: similarity via X -space distance

—often

monotonically non-increasing

to distance

(7)

Radial Basis Function Network RBF Network Hypothesis

Fun Time

Which of the following is not a radial basis function?

1

φ(x, µ) = exp(−γkx − µk

2

)

2

φ(x, µ) = −p

x T x − 2x T

µ + µ

T

µ

3

φ(x, µ) =Jx = µK

4

φ(x, µ) = x

T x + µ T

µ

Reference Answer: 4

Note that 3 is an extreme case of 1 (Gaussian) with γ → ∞, and 2 contains an kx − µk

2

somewhere

:-).

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/24

(8)

Radial Basis Function Network RBF Network Hypothesis

Fun Time

Which of the following is not a radial basis function?

1

φ(x, µ) = exp(−γkx − µk

2

)

2

φ(x, µ) = −p

x T x − 2x T

µ + µ

T

µ

3

φ(x, µ) =Jx = µK

4

φ(x, µ) = x

T x + µ T

µ

Reference Answer: 4

Note that 3 is an extreme case of 1 (Gaussian) with γ → ∞, and 2 contains an kx − µk

2

somewhere

:-).

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/24

(9)

Full RBF Network

h(x) =

Output

M

X

m=1

β m RBF(x, µ m

)

!

• full RBF

Network:

M = N

and each

µ m

=

x m

physical meaning: each

x m influences

similar

x by β m

e.g. uniform influence with

β m = 1 · y m

for binary classification

g

uniform

(x) =

sign

N

X

m=1

y m exp 

−γkx − x m k 2 

!

—aggregateeach example’s

opinion

subject to

similarity

full RBF

Network:

lazy

way to decide

µ m

(10)

Radial Basis Function Network RBF Network Learning

Nearest Neighbor

g

uniform

(x) = sign

N

X

m=1

y m exp 

−γkx − x m k 2 

!

• exp −γkx − x m k 2 

:

maximum

when

x closest to x m

—maximum oneoften dominates the

P N

m=1

term

take

y m

of

maximum exp(. . .)

instead of

voting

of all

y m

—selectioninstead of

aggregation

physical meaning:

g

nbor

(x) =

y m

such that

x closest to x m

—called

nearest neighbor

model

can

uniformly aggregate k neighbors

also:

k nearest neighbor k nearest neighbor:

also

lazy

but

very intuitive

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/24

(11)

Interpolation by Full RBF Network

full RBF

Network for squared error regression:

h(x) =

   XX Output X X

N

X

m=1

β m RBF(x, x m

)

!

just linear regression on

RBF-transformed data

z n

= [RBF(x

n

,

x 1

),

RBF(x n

,

x 2

), . . . ,

RBF(x n

,

x N

)]

optimal

β? β

= (Z

T Z) −1 Z T y, if Z T Z

invertible,

remember? :-)

size of

Z? N (examples) by N (centers)

—symmetric square matrix

theoretical fact: if

x n all different, Z

with

Gaussian RBF invertible

optimal

β

with

invertible Z: β

=

Z −1 y

(12)

Radial Basis Function Network RBF Network Learning

Regularized Full RBF Network

full Gaussian RBF Network for regression:

β

=

Z −1 y g

RBF

(x

1

) = β

T

z

1

= y

T

Z

−1

(first column of Z) = y

T



1 0 . . . 0 

T

= y

1

—gRBF(x

n

) =y

n

, i.e. E

in

(gRBF) =0,

yeah!! :-)

called

exact interpolation

for

function approximation

but

overfitting for learning? :-(

how about

regularization? e.g. ridge

regression for

β

instead

—optimal

β

= (Z

T Z

+

λI) −1 Z T y

seen

Z? Z

= [Gaussian(x

n

,

x m

)] =Gaussian kernel matrix

K

effect of

regularization

in different spaces:

kernel

ridge

regression:

β

= (K+

λI) −1 y;

regularized

full RBFNet:

β

= (Z

T Z

+

λI) −1 Z T y

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/24

(13)

Fewer Centers as Regularization

recall:

gSVM(x) =

sign X

SV

α m y m exp



−γkx − x m k 2 

+

b

!

—only ‘ N’

SVs

needed in ‘network’

next:

M  N

instead of

M = N

effect:

regularization

by constraining

number of centers and voting weights

physical meaning of

centers µ m

:

prototypes

remaining question:

how to extract

prototypes?

(14)

Radial Basis Function Network RBF Network Learning

Fun Time

If

x 1

=

x 2

, what happens in the

Z

matrix of full Gaussian RBF network?

1

the first two rows of the matrix are the same

2

the first two columns of the matrix are different

3

the matrix is invertible

4

the sub-matrix at the intersection of the first two rows and the first two columns contains a constant of 0

Reference Answer: 1

It is easy to see that the first two rows must be the same; so must the first two columns. The two same rows makes the matrix singular; the sub-matrix in 4 contains a constant of 1 = exp(−0) instead of 0.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/24

(15)

Fun Time

If

x 1

=

x 2

, what happens in the

Z

matrix of full Gaussian RBF network?

1

the first two rows of the matrix are the same

2

the first two columns of the matrix are different

3

the matrix is invertible

4

the sub-matrix at the intersection of the first two rows and the first two columns contains a constant of 0

Reference Answer: 1

It is easy to see that the first two rows must be the same; so must the first two columns. The two same rows makes the matrix singular; the sub-matrix in 4 contains a constant of 1 = exp(−0) instead of 0.

(16)

Radial Basis Function Network k -Means Algorithm

Good Prototypes: Clustering Problem

=⇒

if

x 1 ≈ x 2

,

=⇒

no need

both

RBF(x, x 1

)&

RBF(x, x 2

)in RBFNet,

=⇒

cluster x 1

and

x 2

by

one prototype µ ≈ x 1 ≈ x 2

clustering

with

prototype:

partition {x

n

} to disjoint sets S

1

, S

2

, · · · , S

M

choose µ

m

for each S

m

—hope:

x 1 , x 2

both ∈

S m

µ m ≈ x 1 ≈ x 2

cluster error with squared error measure:

E

in

(S

1

, · · · , S

M

; µ

1

, · · · , µ

M

) = 1 N

N

X

n=1 M

X

m=1

J x

n

∈ S

m

K kx

n

− µ

m

k

2

goal: with

S 1 , · · · , S M

being a partition of

{x n },

min

{S

1

,··· ,S

M

1

,··· ,µ

M

}

E

in

(S

1 , · · · , S M

;

µ 1 , · · · , µ M

)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/24

(17)

Partition Optimization

with

S 1 , · · · , S M

being a partition of

{x n },

{S

1

,··· ,S

minM

1

,··· ,µ

M

} N

X

n=1 M

X

m=1

J x n ∈ S m K kx n − µ m k 2

hard to optimize: joint combinatorial-numerical

optimization

two sets

of

variables: will optimize alternatingly

if

µ 1 , · · · , µ M fixed, for each x n

• J x n ∈ S m K

: choose

one and only one subset

• kx n − µ m k 2

: distance to each

prototype

optimal

chosen subset S m

= the one with

minimum kx n − µ m k 2

for given

µ 1 , · · · , µ M

, each

x n

‘optimally partitioned’ using its

closest µ m

(18)

Radial Basis Function Network k -Means Algorithm

Prototype Optimization

with

S 1 , · · · , S M

being a partition of

{x n },

{S

1

,··· ,S

minM

1

,··· ,µ

M

} N

X

n=1 M

X

m=1

J x n ∈ S m K kx n − µ m k 2

hard to optimize: joint combinatorial-numerical

optimization

two sets

of

variables: will optimize alternatingly

if

S 1 , · · · , S M fixed, just unconstrained optimization

for each

µ m

µ

mE

in

= −2

N

X

n=1

J x n ∈ S m K (x n − µ m )

=

−2

 X

x

n

∈S

m

x n

− |S

m

m

 optimal

prototype µ m

=

average

of

x n

within

S m

for given

S 1 , · · · , S M

, each

µ n

‘optimally computed’ as

consensus

within

S m

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 15/24

(19)

k -Means Algorithm

use k prototypes instead of M historically (different from k nearest neighbor, though)

k -Means Algorithm

1

initialize

µ 1 , µ 2 , . . . , µ k

: say, as

k

randomly chosen

x n

2 alternating optimization

of E

in

: repeatedly

1 optimize S

1

, S

2

, . . . , S

k

:

each x

n

‘optimally partitioned’ using its closest µ

i

2 optimize µ

1

, µ

2

, . . . , µ

k

:

each µ

n

‘optimally computed’ as consensus within S

m until

converge

converge: no change of S 1 , S 2 , . . . , S k

anymore

—guaranteed as E

in decreases

during alternating minimization k -Means: the most popular

clustering

algorithm through

alternating minimization

(20)

Radial Basis Function Network k -Means Algorithm

RBF Network Using k -Means

RBF Network Using k -Means

1

run

k -Means

with k = M to get

m }

2

construct transform

Φ(x) from RBF (say, Gaussian) at µ m Φ(x)

= [RBF(x,

µ 1

),RBF(x,

µ 2

), . . . ,RBF(x,

µ M

)]

3

run

linear model

on {(Φ(x

n

),y

n

)}to get

β

4

return gRBFNET(x) =

LinearHypothesis

(β,

Φ(x))

using

unsupervised learning (k -Means)

to assist

feature transform—like autoencoder

parameters: M (prototypes), RBF (such as γ of Gaussian)

RBF Network: a simple (old-fashioned) model

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/24

(21)

Radial Basis Function Network k -Means Algorithm

Fun Time

For k -Means, consider examples

x n

∈ R

2

such that all x

n,1

and x

n,2

are non-zero. When fixing two prototypes µ

1

= [1, 1] and µ

2

= [−1, 1], which of the following set is the optimal S

1

?

1

{x

n

:x

n,1

>0}

2

{x

n

:x

n,1

<0}

3

{x

n

:x

n,2

>0}

4

{x

n

:x

n,2

<0}

Reference Answer: 1

Note that S

1

contains examples that are closer to µ

1

than µ

2

.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/24

(22)

Radial Basis Function Network k -Means Algorithm

Fun Time

For k -Means, consider examples

x n

∈ R

2

such that all x

n,1

and x

n,2

are non-zero. When fixing two prototypes µ

1

= [1, 1] and µ

2

= [−1, 1], which of the following set is the optimal S

1

?

1

{x

n

:x

n,1

>0}

2

{x

n

:x

n,1

<0}

3

{x

n

:x

n,2

>0}

4

{x

n

:x

n,2

<0}

Reference Answer: 1

Note that S

1

contains examples that are closer to µ

1

than µ

2

.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/24

(23)

Beauty of k -Means

k = 4

usually works well

with

proper k and initialization

(24)

Radial Basis Function Network k -Means and RBF Network in Action

Beauty of k -Means

k = 4

usually works well

with

proper k and initialization

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/24

(25)

Beauty of k -Means

k = 4

usually works well

with

proper k and initialization

(26)

Radial Basis Function Network k -Means and RBF Network in Action

Beauty of k -Means

k = 4

usually works well

with

proper k and initialization

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/24

(27)

Beauty of k -Means

k = 4

usually works well

with

proper k and initialization

(28)

Radial Basis Function Network k -Means and RBF Network in Action

Beauty of k -Means

k = 4

usually works well

with

proper k and initialization

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/24

(29)

Beauty of k -Means

k = 4

usually works well

with

proper k and initialization

(30)

Radial Basis Function Network k -Means and RBF Network in Action

Beauty of k -Means

k = 4

usually works well

with

proper k and initialization

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/24

(31)

Beauty of k -Means

k = 4

usually works well

with

proper k and initialization

(32)

Radial Basis Function Network k -Means and RBF Network in Action

Beauty of k -Means

k = 4

usually works well

with

proper k and initialization

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/24

(33)

Beauty of k -Means

k = 4

usually works well

with

proper k and initialization

(34)

Radial Basis Function Network k -Means and RBF Network in Action

Beauty of k -Means

k = 4

usually works well

with

proper k and initialization

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/24

(35)

Beauty of k -Means

k = 4

usually works well

with

proper k and initialization

(36)

Radial Basis Function Network k -Means and RBF Network in Action

Difficulty of k -Means

k = 2 k = 4 k = 7

‘sensitive’ to

k and initialization

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/24

(37)

Difficulty of k -Means

k = 2 k = 4 k = 7

‘sensitive’ to

k and initialization

(38)

Radial Basis Function Network k -Means and RBF Network in Action

RBF Network Using k -Means

k = 2 k = 4 k = 7

reasonable performance with

proper centers

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/24

(39)

Full RBF Network

k = N k = 4 nearest neighbor

λ =0.001

full RBF Network: generally less useful

(40)

Radial Basis Function Network k -Means and RBF Network in Action

Fun Time

When coupled with ridge linear regression, which of the following RBF Network is ‘most regularized’?

1

small M and small λ

2

small M and large λ

3

large M and small λ

4

large M and large λ

Reference Answer: 2

small M: fewer weights and more regularized; large λ: shorter β more and more regularized.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 23/24

(41)

Fun Time

When coupled with ridge linear regression, which of the following RBF Network is ‘most regularized’?

1

small M and small λ

2

small M and large λ

3

large M and small λ

4

large M and large λ

Reference Answer: 2

small M: fewer weights and more regularized;

large λ: shorter β more and more regularized.

(42)

Radial Basis Function Network k -Means and RBF Network in Action

Summary

1 Embedding Numerous Features: Kernel Models

2 Combining Predictive Features: Aggregation Models

3

Distilling Implicit Features: Extraction Models

Lecture 14: Radial Basis Function Network

RBF Network Hypothesis

prototypes instead of neurons as transform RBF Network Learning

linear aggregation of prototype ‘hypotheses’

k -Means Algorithm

clustering with alternating optimization k -Means and RBF Network in Action

proper choice of # prototypes important

next: extracting features from abstract data

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 24/24

參考文獻

相關文件

Which of the following aggregation model learns diverse g t by reweighting and calculates linear vote by steepest search?.

Lecture 4: Soft-Margin Support Vector Machine allow some margin violations ξ n while penalizing them by C; equivalent to upper-bounding α n by C Lecture 5: Kernel Logistic

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/22.. Decision Tree Decision Tree Hypothesis. Disclaimers about

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.. linear SVM: more robust and solvable with quadratic programming Lecture 2: Dual Support

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.

Principle Component Analysis Denoising Auto Encoder Deep Neural Network... Deep Learning Optimization

For a deep NNet for written character recognition from raw pixels, which type of features are more likely extracted after the first hidden layer.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/22.. Matrix Factorization Summary of Extraction Models.