• 沒有找到結果。

Machine Learning Techniques (ᘤᢈ)

N/A
N/A
Protected

Academic year: 2022

Share "Machine Learning Techniques (ᘤᢈ)"

Copied!
146
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning Techniques

( 機器學習技法)

Lecture 14: Radial Basis Function Network

Hsuan-Tien Lin (林軒田)

htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Radial Basis Function Network

Roadmap

1 Embedding Numerous Features: Kernel Models

2 Combining Predictive Features: Aggregation Models

3

Distilling Implicit Features: Extraction Models

Lecture 13: Deep Learning

pre-training

with

denoising autoencoder

(non-linear PCA) and fine-tuning with

backprop

for NNet with

many layers

Lecture 14: Radial Basis Function Network RBF Network Hypothesis

RBF Network Learning

k -Means Algorithm

(3)

Radial Basis Function Network RBF Network Hypothesis

Gaussian SVM Revisited

g

SVM

(x) = sign X

SV

α

n

y

n

exp −γkx − x

n

k

2

 + b

!

Gaussian SVM:

find

α n

to

combine Gaussians

centered at

x n

;

achieve large margin in

infinite-dimensional space, remember? :-)

Gaussian kernel: also called

Radial Basis Function

(RBF) kernel

• radial: only depends on distance between x and ‘center’ x

n

• basis function: to be ‘combined’

let

g n

(x) =

y n exp −γkx − x n k 2 

: gSVM(x) = sign P

SV α n

g n

(x)

+b

—linear

aggregation

of

selected radial

hypotheses

Radial Basis Function

(RBF)

Network:

linear

aggregation

of

radial

hypotheses

(4)

Radial Basis Function Network RBF Network Hypothesis

Gaussian SVM Revisited

g

SVM

(x) = sign X

SV

α

n

y

n

exp −γkx − x

n

k

2

 + b

!

Gaussian SVM: find

α n

to

combine Gaussians

centered at

x n

;

achieve large margin in

infinite-dimensional space, remember? :-)

Gaussian kernel: also called

Radial Basis Function

(RBF) kernel

• radial: only depends on distance between x and ‘center’ x

n

• basis function: to be ‘combined’

let

g n

(x) =

y n exp −γkx − x n k 2 

: gSVM(x) = sign P

SV α n

g n

(x)

+b

—linear

aggregation

of

selected radial

hypotheses

Radial Basis Function

(RBF)

Network:

linear

aggregation

of

radial

hypotheses

(5)

Radial Basis Function Network RBF Network Hypothesis

Gaussian SVM Revisited

g

SVM

(x) = sign X

SV

α

n

y

n

exp −γkx − x

n

k

2

 + b

!

Gaussian SVM: find

α n

to

combine Gaussians

centered at

x n

;

achieve large margin in

infinite-dimensional space, remember? :-)

Gaussian kernel: also called

Radial Basis Function

(RBF) kernel

• radial: only depends on distance between x and ‘center’ x

n

• basis function: to be ‘combined’

let

g n

(x) =

y n exp −γkx − x n k 2 

: gSVM(x) = sign P

SV α n

g n

(x)

+b

—linear

aggregation

of

selected radial

hypotheses

Radial Basis Function

(RBF)

Network:

linear

aggregation

of

radial

hypotheses

(6)

Radial Basis Function Network RBF Network Hypothesis

Gaussian SVM Revisited

g

SVM

(x) = sign X

SV

α

n

y

n

exp −γkx − x

n

k

2

 + b

!

Gaussian SVM: find

α n

to

combine Gaussians

centered at

x n

;

achieve large margin in

infinite-dimensional space, remember? :-)

Gaussian kernel: also called

Radial Basis Function

(RBF) kernel

• radial: only depends on distance between x and ‘center’ x

n

• basis function: to be ‘combined’

let

g n

(x) =

y n exp −γkx − x n k 2 

: gSVM(x) = sign P

SV α n

g n

(x)

+b

—linear

aggregation

of

selected radial

hypotheses

Radial Basis Function

(RBF)

Network:

linear

aggregation

of

radial

hypotheses

(7)

Radial Basis Function Network RBF Network Hypothesis

Gaussian SVM Revisited

g

SVM

(x) = sign X

SV

α

n

y

n

exp −γkx − x

n

k

2

 + b

!

Gaussian SVM: find

α n

to

combine Gaussians

centered at

x n

;

achieve large margin in

infinite-dimensional space, remember? :-)

Gaussian kernel: also called

Radial Basis Function

(RBF) kernel

• radial: only depends on distance between x and ‘center’ x

n

• basis function: to be ‘combined’

let

g n

(x) =

y n exp −γkx − x n k 2 

: gSVM(x) = sign P

SV α n

g n

(x)

+b

—linear

aggregation

of

selected radial

hypotheses

Radial Basis Function

(RBF)

Network:

linear

aggregation

of

radial

hypotheses

(8)

Radial Basis Function Network RBF Network Hypothesis

Gaussian SVM Revisited

g

SVM

(x) = sign X

SV

α

n

y

n

exp −γkx − x

n

k

2

 + b

!

Gaussian SVM: find

α n

to

combine Gaussians

centered at

x n

;

achieve large margin in

infinite-dimensional space, remember? :-)

Gaussian kernel: also called

Radial Basis Function

(RBF) kernel

• radial: only depends on distance between x and ‘center’ x

n

• basis function: to be ‘combined’

let

g n

(x) =

y n exp −γkx − x n k 2 

: gSVM(x) = sign P

SV α n

g n

(x)

+b

—linear

aggregation

of

selected radial

hypotheses

Radial Basis Function

(RBF)

Network:

linear

aggregation

of

radial

hypotheses

(9)

Radial Basis Function Network RBF Network Hypothesis

Gaussian SVM Revisited

g

SVM

(x) = sign X

SV

α

n

y

n

exp −γkx − x

n

k

2

 + b

!

Gaussian SVM: find

α n

to

combine Gaussians

centered at

x n

;

achieve large margin in

infinite-dimensional space, remember? :-)

Gaussian kernel: also called

Radial Basis Function

(RBF) kernel

• radial: only depends on distance between x and ‘center’ x

n

• basis function: to be ‘combined’

let

g n

(x) =

y n exp −γkx − x n k 2 

: gSVM(x) = sign P

SV α n g n

(x) + b

—linear

aggregation

of

selected radial

hypotheses

Radial Basis Function

(RBF)

Network:

linear

aggregation

of

radial

hypotheses

(10)

Radial Basis Function Network RBF Network Hypothesis

Gaussian SVM Revisited

g

SVM

(x) = sign X

SV

α

n

y

n

exp −γkx − x

n

k

2

 + b

!

Gaussian SVM: find

α n

to

combine Gaussians

centered at

x n

;

achieve large margin in

infinite-dimensional space, remember? :-)

Gaussian kernel: also called

Radial Basis Function

(RBF) kernel

• radial: only depends on distance between x and ‘center’ x

n

• basis function: to be ‘combined’

let

g n

(x) =

y n exp −γkx − x n k 2 

: gSVM(x) = sign P

SV α n g n

(x) + b

—linear

aggregation

of

selected radial

hypotheses

Radial Basis Function

(RBF)

Network:

linear

aggregation

of

radial

hypotheses

(11)

Radial Basis Function Network RBF Network Hypothesis

Gaussian SVM Revisited

g

SVM

(x) = sign X

SV

α

n

y

n

exp −γkx − x

n

k

2

 + b

!

Gaussian SVM: find

α n

to

combine Gaussians

centered at

x n

;

achieve large margin in

infinite-dimensional space, remember? :-)

Gaussian kernel: also called

Radial Basis Function

(RBF) kernel

• radial: only depends on distance between x and ‘center’ x

n

• basis function: to be ‘combined’

let

g n

(x) =

y n exp −γkx − x n k 2 

: gSVM(x) = sign P

SV α n g n

(x) + b

—linear

aggregation

of

selected radial

hypotheses

Radial Basis Function

(RBF)

Network:

linear

aggregation

of

radial

hypotheses

(12)

Radial Basis Function Network RBF Network Hypothesis

From Neural Network to RBF Network

Neural Network

x0=1

x1

x2

x3

... xd

+1

tanh

tanh

tanh

w ij (1) w j1 (2)

RBF Network

x0=1

x1

x2

x3

... xd

+1

RBF

RBF

RBF

centers votes

• hidden layer

different:

(inner-product+ tanh) versus (distance+ Gaussian)

• output layer

same:

just linear aggregation

RBF Network: historically

a type of NNet

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24

(13)

Radial Basis Function Network RBF Network Hypothesis

From Neural Network to RBF Network

Neural Network

x0=1

x1

x2

x3

... xd

+1

tanh

tanh

tanh

w ij (1) w j1 (2)

RBF Network

x0=1

x1

x2

x3

... xd

+1

RBF

RBF

RBF

centers votes

• hidden layer

different:

(inner-product+ tanh) versus (distance+ Gaussian)

• output layer

same:

just linear aggregation

RBF Network: historically

a type of NNet

(14)

Radial Basis Function Network RBF Network Hypothesis

From Neural Network to RBF Network

Neural Network

x0=1

x1

x2

x3

... xd

+1

tanh

tanh

tanh

w ij (1) w j1 (2)

RBF Network

x0=1

x1

x2

x3

... xd

+1

RBF

RBF

RBF

centers votes

• hidden layer

different:

(inner-product+ tanh) versus (distance+ Gaussian)

• output layer

same:

just linear aggregation

RBF Network: historically

a type of NNet

(15)

Radial Basis Function Network RBF Network Hypothesis

From Neural Network to RBF Network

Neural Network

x0=1

x1

x2

x3

... xd

+1

tanh

tanh

tanh

w ij (1) w j1 (2)

RBF Network

x0=1

x1

x2

x3

... xd

+1

RBF

RBF

RBF

centers votes

• hidden layer

different:

(inner-product+ tanh) versus (distance+ Gaussian)

• output layer

same:

just linear aggregation

RBF Network: historically

a type of NNet

(16)

Radial Basis Function Network RBF Network Hypothesis

From Neural Network to RBF Network

Neural Network

x0=1

x1

x2

x3

... xd

+1

tanh

tanh

tanh

w ij (1) w j1 (2)

RBF Network

x0=1

x1

x2

x3

... xd

+1

RBF

RBF

RBF

centers votes

• hidden layer

different:

(inner-product+ tanh) versus (distance+ Gaussian)

• output layer

same:

just linear aggregation

(17)

Radial Basis Function Network RBF Network Hypothesis

RBF Network Hypothesis

h(x)

= Output

M

X

m=1

β

m

RBF(x, µ

m

) + b

!

key variables:

centers µ m

; (signed)

votes β m

RBF Network

x0=1

x1

x2

x3

... xd

+1

RBF

RBF

RBF

Output

centers votes

g

SVM

for Gaussian-SVM

• RBF: Gaussian; Output: sign (binary classification)

M = #SV;

µ m

: SVM SVs

x m

;

β m

: α

m

y

m

from SVM Dual

learning: given

RBF

and

Output,

decide

µ m

and

β m

(18)

Radial Basis Function Network RBF Network Hypothesis

RBF Network Hypothesis

h(x)

= Output

M

X

m=1

β

m

RBF(x, µ

m

) + b

!

key variables:

centers µ m

; (signed)

votes β m

RBF Network

x0=1

x1

x2

x3

... xd

+1

RBF

RBF

RBF

Output

centers votes

g

SVM

for Gaussian-SVM

• RBF: Gaussian; Output: sign (binary classification)

M = #SV;

µ m

: SVM SVs

x m

;

β m

: α

m

y

m

from SVM Dual

learning: given

RBF

and

Output,

decide

µ m

and

β m

(19)

Radial Basis Function Network RBF Network Hypothesis

RBF Network Hypothesis

h(x)

= Output

M

X

m=1

β

m

RBF(x, µ

m

) + b

!

key variables:

centers µ m

; (signed)

votes β m

RBF Network

x0=1

x1

x2

x3

... xd

+1

RBF

RBF

RBF

Output

centers votes

g

SVM

for Gaussian-SVM

• RBF: Gaussian; Output: sign (binary classification)

M = #SV;

µ m

: SVM SVs

x m

;

β m

: α

m

y

m

from SVM Dual learning: given

RBF

and

Output,

decide

µ m

and

β m

(20)

Radial Basis Function Network RBF Network Hypothesis

RBF Network Hypothesis

h(x)

= Output

M

X

m=1

β

m

RBF(x, µ

m

) + b

!

key variables:

centers µ m

; (signed)

votes β m

RBF Network

x0=1

x1

x2

x3

... xd

+1

RBF

RBF

RBF

Output

centers votes

g

SVM

for Gaussian-SVM

• RBF: Gaussian; Output: sign (binary classification)

M = #SV;

µ m

: SVM SVs

x m

;

β m

: α

m

y

m

from SVM Dual

learning: given

RBF

and

Output,

decide

µ m

and

β m

(21)

Radial Basis Function Network RBF Network Hypothesis

RBF Network Hypothesis

h(x)

= Output

M

X

m=1

β

m

RBF(x, µ

m

) + b

!

key variables:

centers µ m

; (signed)

votes β m

RBF Network

x0=1

x1

x2

x3

... xd

+1

RBF

RBF

RBF

Output

centers votes

g

SVM

for Gaussian-SVM

• RBF: Gaussian; Output: sign (binary classification)

M = #SV;

µ m

: SVM SVs

x m

;

β m

: α

m

y

m

from SVM Dual learning: given

RBF

and

Output,

decide

µ m

and

β m

(22)

Radial Basis Function Network RBF Network Hypothesis

RBF and Similarity

general similarity function

between

x and x 0

:

Neuron(x, x

0

) =tanh(γx

T x 0

+1) DNASim(x, x

0

) =EditDistance(x, x

0

)

RBF Network:

distance similarity-to-centers

as

feature transform

kernel: similarity via Z-space inner product

—governed by Mercer’s condition, remember? :-) Poly(x, x

0

) = (1 +

x T x 0

)

2

Gaussian(x, x

0

) =exp(−γkx − x

0

k

2

)

Gaussian(x, x

0

) =exp(−γkx − x

0

k

2

)

Truncated

(x, x

0

) =Jkx − x

0

k ≤ 1K (1 − kx − x

0

k)

2 RBF: similarity via X -space distance

—often

monotonically non-increasing

to distance

(23)

Radial Basis Function Network RBF Network Hypothesis

RBF and Similarity

general similarity function

between

x and x 0

:

Neuron(x, x

0

) =tanh(γx

T x 0

+1) DNASim(x, x

0

) =EditDistance(x, x

0

)

RBF Network:

distance similarity-to-centers

as

feature transform

kernel: similarity via Z-space inner product

—governed by Mercer’s condition, remember? :-) Poly(x, x

0

) = (1 +

x T x 0

)

2

Gaussian(x, x

0

) =exp(−γkx − x

0

k

2

)

Gaussian(x, x

0

) =exp(−γkx − x

0

k

2

)

Truncated

(x, x

0

) =Jkx − x

0

k ≤ 1K (1 − kx − x

0

k)

2 RBF: similarity via X -space distance

—often

monotonically non-increasing

to distance

(24)

Radial Basis Function Network RBF Network Hypothesis

RBF and Similarity

general similarity function

between

x and x 0

:

Neuron(x, x

0

) =tanh(γx

T x 0

+1) DNASim(x, x

0

) =EditDistance(x, x

0

)

RBF Network:

distance similarity-to-centers

as

feature transform

kernel: similarity via Z-space inner product

—governed by Mercer’s condition, remember? :-) Poly(x, x

0

) = (1 +

x T x 0

)

2

Gaussian(x, x

0

) =exp(−γkx − x

0

k

2

)

Gaussian(x, x

0

) =exp(−γkx − x

0

k

2

)

Truncated

(x, x

0

) =Jkx − x

0

k ≤ 1K (1 − kx − x

0

k)

2 RBF: similarity via X -space distance

—often

monotonically non-increasing

to distance

(25)

Radial Basis Function Network RBF Network Hypothesis

RBF and Similarity

general similarity function

between

x and x 0

:

Neuron(x, x

0

) =tanh(γx

T x 0

+1) DNASim(x, x

0

) =EditDistance(x, x

0

)

RBF Network:

distance similarity-to-centers

as

feature transform

kernel: similarity via Z-space inner product

—governed by Mercer’s condition, remember? :-) Poly(x, x

0

) = (1 +

x T x 0

)

2

Gaussian(x, x

0

) =exp(−γkx − x

0

k

2

)

Gaussian(x, x

0

) =exp(−γkx − x

0

k

2

)

Truncated

(x, x

0

) =Jkx − x

0

k ≤ 1K (1 − kx − x

0

k)

2 RBF: similarity via X -space distance

—often

monotonically non-increasing

to distance

(26)

Radial Basis Function Network RBF Network Hypothesis

Fun Time

Which of the following is not a radial basis function?

1

φ(x, µ) = exp(−γkx − µk

2

)

2

φ(x, µ) = −p

x T x − 2x T

µ + µ

T

µ

3

φ(x, µ) =Jx = µK

4

φ(x, µ) = x

T x + µ T

µ

Reference Answer: 4

Note that 3 is an extreme case of 1 (Gaussian) with γ → ∞, and 2 contains an kx − µk

2

somewhere

:-).

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/24

(27)

Radial Basis Function Network RBF Network Hypothesis

Fun Time

Which of the following is not a radial basis function?

1

φ(x, µ) = exp(−γkx − µk

2

)

2

φ(x, µ) = −p

x T x − 2x T

µ + µ

T

µ

3

φ(x, µ) =Jx = µK

4

φ(x, µ) = x

T x + µ T

µ

Reference Answer: 4

Note that 3 is an extreme case of 1 (Gaussian) with γ → ∞, and 2 contains an kx − µk

2

somewhere

:-).

(28)

Radial Basis Function Network RBF Network Learning

Full RBF Network

h(x) =

Output

M

X

m=1

β m RBF(x, µ m

)

!

• full RBF

Network:

M = N

and each

µ m

=

x m

physical meaning: each

x m influences

similar

x by β m

e.g. uniform influence with

β m = 1 · y m

for binary classification

g

uniform

(x) =

sign

N

X

m=1

y m exp 

−γkx − x m k 2 

!

—aggregateeach example’s

opinion

subject to

similarity

full RBF

Network:

lazy

way to decide

µ m

(29)

Radial Basis Function Network RBF Network Learning

Full RBF Network

h(x) =

Output

M

X

m=1

β m RBF(x, µ m

)

!

• full RBF

Network:

M = N

and each

µ m

=

x m

physical meaning: each

x m influences

similar

x by β m

e.g. uniform influence with

β m = 1 · y m

for binary classification

g

uniform

(x) =

sign

N

X

m=1

y m exp 

−γkx − x m k 2 

!

—aggregateeach example’s

opinion

subject to

similarity

full RBF

Network:

lazy

way to decide

µ m

(30)

Radial Basis Function Network RBF Network Learning

Full RBF Network

h(x) =

Output

M

X

m=1

β m RBF(x, µ m

)

!

• full RBF

Network:

M = N

and each

µ m

=

x m

physical meaning: each

x m influences

similar

x by β m

e.g. uniform influence with

β m = 1 · y m

for binary classification

g

uniform

(x) =

sign

N

X

m=1

y m exp 

−γkx − x m k 2 

!

—aggregateeach example’s

opinion

subject to

similarity

full RBF

Network:

lazy

way to decide

µ m

(31)

Radial Basis Function Network RBF Network Learning

Full RBF Network

h(x) =

Output

M

X

m=1

β m RBF(x, µ m

)

!

• full RBF

Network:

M = N

and each

µ m

=

x m

physical meaning: each

x m influences

similar

x by β m

e.g. uniform influence with

β m = 1 · y m

for binary classification

g

uniform

(x) =

sign

N

X

m=1

y m exp 

−γkx − x m k 2 

!

—aggregateeach example’s

opinion

subject to

similarity

full RBF

Network:

lazy

way to decide

µ m

(32)

Radial Basis Function Network RBF Network Learning

Full RBF Network

h(x) =

Output

M

X

m=1

β m RBF(x, µ m

)

!

• full RBF

Network:

M = N

and each

µ m

=

x m

physical meaning: each

x m influences

similar

x by β m

e.g. uniform influence with

β m = 1 · y m

for binary classification

g

uniform

(x) =

sign

N

X

m=1

y m exp 

−γkx − x m k 2 

!

—aggregateeach example’s

opinion

subject to

similarity

(33)

Radial Basis Function Network RBF Network Learning

Nearest Neighbor

g

uniform

(x) = sign

N

X

m=1

y m exp 

−γkx − x m k 2 

!

• exp −γkx − x m k 2 

:

maximum

when

x closest to x m

—maximum oneoften dominates the

P N

m=1

term

take

y m

of

maximum exp(. . .)

instead of

voting

of all

y m

—selectioninstead of

aggregation

physical meaning:

g

nbor

(x) =

y m

such that

x closest to x m

—called

nearest neighbor

model

can

uniformly aggregate k neighbors

also:

k nearest neighbor

k nearest neighbor:

also

lazy

but

very intuitive

(34)

Radial Basis Function Network RBF Network Learning

Nearest Neighbor

g

uniform

(x) = sign

N

X

m=1

y m exp 

−γkx − x m k 2 

!

• exp −γkx − x m k 2 

:

maximum

when

x closest to x m

—maximum oneoften dominates the

P N

m=1

term

take

y m

of

maximum exp(. . .)

instead of

voting

of all

y m

—selectioninstead of

aggregation

physical meaning:

g

nbor

(x) =

y m

such that

x closest to x m

—called

nearest neighbor

model

can

uniformly aggregate k neighbors

also:

k nearest neighbor

k nearest neighbor:

also

lazy

but

very intuitive

(35)

Radial Basis Function Network RBF Network Learning

Nearest Neighbor

g

uniform

(x) = sign

N

X

m=1

y m exp 

−γkx − x m k 2 

!

• exp −γkx − x m k 2 

:

maximum

when

x closest to x m

—maximum oneoften dominates the

P N

m=1

term

take

y m

of

maximum exp(. . .)

instead of

voting

of all

y m

—selectioninstead of

aggregation

physical meaning:

g

nbor

(x) =

y m

such that

x closest to x m

—called

nearest neighbor

model

can

uniformly aggregate k neighbors

also:

k nearest neighbor

k nearest neighbor:

also

lazy

but

very intuitive

(36)

Radial Basis Function Network RBF Network Learning

Nearest Neighbor

g

uniform

(x) = sign

N

X

m=1

y m exp 

−γkx − x m k 2 

!

• exp −γkx − x m k 2 

:

maximum

when

x closest to x m

—maximum oneoften dominates the

P N

m=1

term

take

y m

of

maximum exp(. . .)

instead of

voting

of all

y m

—selectioninstead of

aggregation

physical meaning:

g

nbor

(x) =

y m

such that

x closest to x m

—called

nearest neighbor

model

can

uniformly aggregate k neighbors

also:

k nearest neighbor

k nearest neighbor:

also

lazy

but

very intuitive

(37)

Radial Basis Function Network RBF Network Learning

Nearest Neighbor

g

uniform

(x) = sign

N

X

m=1

y m exp 

−γkx − x m k 2 

!

• exp −γkx − x m k 2 

:

maximum

when

x closest to x m

—maximum oneoften dominates the

P N

m=1

term

take

y m

of

maximum exp(. . .)

instead of

voting

of all

y m

—selectioninstead of

aggregation

physical meaning:

g

nbor

(x) =

y m

such that

x closest to x m

—called

nearest neighbor

model

can

uniformly aggregate k neighbors

also:

k nearest neighbor

k nearest neighbor:

also

lazy

but

very intuitive

(38)

Radial Basis Function Network RBF Network Learning

Nearest Neighbor

g

uniform

(x) = sign

N

X

m=1

y m exp 

−γkx − x m k 2 

!

• exp −γkx − x m k 2 

:

maximum

when

x closest to x m

—maximum oneoften dominates the

P N

m=1

term

take

y m

of

maximum exp(. . .)

instead of

voting

of all

y m

—selectioninstead of

aggregation

physical meaning:

g

nbor

(x) =

y m

such that

x closest to x m

—called

nearest neighbor

model

can

uniformly aggregate k neighbors

also:

k nearest neighbor

k nearest neighbor:

also

lazy

but

very intuitive

(39)

Radial Basis Function Network RBF Network Learning

Nearest Neighbor

g

uniform

(x) = sign

N

X

m=1

y m exp 

−γkx − x m k 2 

!

• exp −γkx − x m k 2 

:

maximum

when

x closest to x m

—maximum oneoften dominates the

P N

m=1

term

take

y m

of

maximum exp(. . .)

instead of

voting

of all

y m

—selectioninstead of

aggregation

physical meaning:

g

nbor

(x) =

y m

such that

x closest to x m

—called

nearest neighbor

model

can

uniformly aggregate k neighbors

also:

k nearest neighbor k nearest neighbor:

also

lazy

but

very intuitive

(40)

Radial Basis Function Network RBF Network Learning

Nearest Neighbor

g

uniform

(x) = sign

N

X

m=1

y m exp 

−γkx − x m k 2 

!

• exp −γkx − x m k 2 

:

maximum

when

x closest to x m

—maximum oneoften dominates the

P N

m=1

term

take

y m

of

maximum exp(. . .)

instead of

voting

of all

y m

—selectioninstead of

aggregation

physical meaning:

g

nbor

(x) =

y m

such that

x closest to x m

—called

nearest neighbor

model

can

uniformly aggregate k neighbors

also:

k nearest neighbor

k nearest neighbor:

also

lazy

but

very intuitive

(41)

Radial Basis Function Network RBF Network Learning

Nearest Neighbor

g

uniform

(x) = sign

N

X

m=1

y m exp 

−γkx − x m k 2 

!

• exp −γkx − x m k 2 

:

maximum

when

x closest to x m

—maximum oneoften dominates the

P N

m=1

term

take

y m

of

maximum exp(. . .)

instead of

voting

of all

y m

—selectioninstead of

aggregation

physical meaning:

g

nbor

(x) =

y m

such that

x closest to x m

—called

nearest neighbor

model

can

uniformly aggregate k neighbors

also:

k nearest neighbor k nearest neighbor:

also

lazy

but

very intuitive

(42)

Radial Basis Function Network RBF Network Learning

Interpolation by Full RBF Network

full RBF

Network for squared error regression:

h(x) =

   XX Output X X

N

X

m=1

β m RBF(x, x m

)

!

just linear regression on

RBF-transformed data

z n

= [RBF(x

n

,

x 1

),

RBF(x n

,

x 2

), . . . ,

RBF(x n

,

x N

)]

optimal

β? β

= (Z

T Z) −1 Z T y, if Z T Z

invertible,

remember? :-)

size of

Z?

N (examples) by

N (centers)

—symmetric square matrix

theoretical fact: if

x n all different, Z

with

Gaussian RBF invertible

optimal

β

with

invertible Z: β

=

Z −1

y

(43)

Radial Basis Function Network RBF Network Learning

Interpolation by Full RBF Network

full RBF

Network for squared error regression:

h(x) =

   XX Output X X

N

X

m=1

β m RBF(x, x m

)

!

just linear regression on

RBF-transformed data

z n

= [RBF(x

n

,

x 1

),

RBF(x n

,

x 2

), . . . ,

RBF(x n

,

x N

)]

optimal

β? β

= (Z

T Z) −1 Z T y, if Z T Z

invertible,

remember? :-)

size of

Z?

N (examples) by

N (centers)

—symmetric square matrix

theoretical fact: if

x n all different, Z

with

Gaussian RBF invertible

optimal

β

with

invertible Z: β

=

Z −1

y

(44)

Radial Basis Function Network RBF Network Learning

Interpolation by Full RBF Network

full RBF

Network for squared error regression:

h(x) =

   XX Output X X

N

X

m=1

β m RBF(x, x m

)

!

just linear regression on

RBF-transformed data

z n

= [RBF(x

n

,

x 1

),

RBF(x n

,

x 2

), . . . ,

RBF(x n

,

x N

)]

optimal

β? β

= (Z

T Z) −1 Z T y, if Z T Z

invertible,

remember? :-)

size of

Z?

N (examples) by

N (centers)

—symmetric square matrix

theoretical fact: if

x n all different, Z

with

Gaussian RBF invertible

optimal

β

with

invertible Z: β

=

Z −1

y

(45)

Radial Basis Function Network RBF Network Learning

Interpolation by Full RBF Network

full RBF

Network for squared error regression:

h(x) =

   XX Output X X

N

X

m=1

β m RBF(x, x m

)

!

just linear regression on

RBF-transformed data

z n

= [RBF(x

n

,

x 1

),

RBF(x n

,

x 2

), . . . ,

RBF(x n

,

x N

)]

optimal

β? β

= (Z

T Z) −1 Z T y, if Z T Z

invertible,

remember? :-)

size of

Z?

N (examples) by

N (centers)

—symmetric square matrix

theoretical fact: if

x n all different, Z

with

Gaussian RBF invertible

optimal

β

with

invertible Z: β

=

Z −1

y

(46)

Radial Basis Function Network RBF Network Learning

Interpolation by Full RBF Network

full RBF

Network for squared error regression:

h(x) =

   XX Output X X

N

X

m=1

β m RBF(x, x m

)

!

just linear regression on

RBF-transformed data

z n

= [RBF(x

n

,

x 1

),

RBF(x n

,

x 2

), . . . ,

RBF(x n

,

x N

)]

optimal

β? β

= (Z

T Z) −1 Z T y, if Z T Z

invertible,

remember? :-)

size of

Z? N (examples) by N (centers)

—symmetric square matrix

theoretical fact: if

x n all different, Z

with

Gaussian RBF invertible

optimal

β

with

invertible Z: β

=

Z −1

y

(47)

Radial Basis Function Network RBF Network Learning

Interpolation by Full RBF Network

full RBF

Network for squared error regression:

h(x) =

   XX Output X X

N

X

m=1

β m RBF(x, x m

)

!

just linear regression on

RBF-transformed data

z n

= [RBF(x

n

,

x 1

),

RBF(x n

,

x 2

), . . . ,

RBF(x n

,

x N

)]

optimal

β? β

= (Z

T Z) −1 Z T y, if Z T Z

invertible,

remember? :-)

size of

Z? N (examples) by N (centers)

—symmetric square matrix

theoretical fact: if

x n all different, Z

with

Gaussian RBF invertible

optimal

β

with

invertible Z: β

=

Z −1

y

(48)

Radial Basis Function Network RBF Network Learning

Interpolation by Full RBF Network

full RBF

Network for squared error regression:

h(x) =

   XX Output X X

N

X

m=1

β m RBF(x, x m

)

!

just linear regression on

RBF-transformed data

z n

= [RBF(x

n

,

x 1

),

RBF(x n

,

x 2

), . . . ,

RBF(x n

,

x N

)]

optimal

β? β

= (Z

T Z) −1 Z T y, if Z T Z

invertible,

remember? :-)

size of

Z? N (examples) by N (centers)

—symmetric square matrix

theoretical fact: if

x n all different, Z

with

Gaussian RBF invertible

Z −1

(49)

Radial Basis Function Network RBF Network Learning

Interpolation by Full RBF Network

full RBF

Network for squared error regression:

h(x) =

   XX Output X X

N

X

m=1

β m RBF(x, x m

)

!

just linear regression on

RBF-transformed data

z n

= [RBF(x

n

,

x 1

),

RBF(x n

,

x 2

), . . . ,

RBF(x n

,

x N

)]

optimal

β? β

= (Z

T Z) −1 Z T y, if Z T Z

invertible,

remember? :-)

size of

Z? N (examples) by N (centers)

—symmetric square matrix

theoretical fact: if

x n all different, Z

with

Gaussian RBF invertible

optimal

β

with

invertible Z: β

=

Z −1 y

(50)

Radial Basis Function Network RBF Network Learning

Regularized Full RBF Network

full Gaussian RBF Network for regression:

β

=

Z −1 y g

RBF

(x

1

) = β

T

z

1

=

y

T

Z

−1

(first column of Z) = y

T



1

0 . . . 0 

T

=

y

1

—gRBF(x

n

) =

y

n

, i.e. E

in

(gRBF) =

0

,

yeah!! :-)

called

exact interpolation

for

function approximation

but

overfitting for learning? :-(

how about

regularization? e.g. ridge

regression for

β

instead

—optimal

β

= (Z

T Z

+

λI) −1 Z T y

seen

Z? Z

= [Gaussian(x

n

,

x m

)] =Gaussian kernel matrix

K

effect of

regularization

in different spaces:

kernel

ridge

regression:

β

= (K+

λI) −1 y;

regularized

full RBFNet:

β

= (Z

T Z

+

λI) −1 Z T y

(51)

Radial Basis Function Network RBF Network Learning

Regularized Full RBF Network

full Gaussian RBF Network for regression:

β

=

Z −1 y g

RBF

(x

1

) = β

T

z

1

=

y

T

Z

−1

(first column of Z)

= y

T



1

0 . . . 0 

T

=

y

1

—gRBF(x

n

) =

y

n

, i.e. E

in

(gRBF) =

0

,

yeah!! :-)

called

exact interpolation

for

function approximation

but

overfitting for learning? :-(

how about

regularization? e.g. ridge

regression for

β

instead

—optimal

β

= (Z

T Z

+

λI) −1 Z T y

seen

Z? Z

= [Gaussian(x

n

,

x m

)] =Gaussian kernel matrix

K

effect of

regularization

in different spaces:

kernel

ridge

regression:

β

= (K+

λI) −1 y;

regularized

full RBFNet:

β

= (Z

T Z

+

λI) −1 Z T y

(52)

Radial Basis Function Network RBF Network Learning

Regularized Full RBF Network

full Gaussian RBF Network for regression:

β

=

Z −1 y g

RBF

(x

1

) = β

T

z

1

= y

T

Z

−1

(first column of Z)

= y

T



1

0 . . . 0 

T

=

y

1

—gRBF(x

n

) =

y

n

, i.e. E

in

(gRBF) =

0

,

yeah!! :-)

called

exact interpolation

for

function approximation

but

overfitting for learning? :-(

how about

regularization? e.g. ridge

regression for

β

instead

—optimal

β

= (Z

T Z

+

λI) −1 Z T y

seen

Z? Z

= [Gaussian(x

n

,

x m

)] =Gaussian kernel matrix

K

effect of

regularization

in different spaces:

kernel

ridge

regression:

β

= (K+

λI) −1 y;

regularized

full RBFNet:

β

= (Z

T Z

+

λI) −1 Z T y

(53)

Radial Basis Function Network RBF Network Learning

Regularized Full RBF Network

full Gaussian RBF Network for regression:

β

=

Z −1 y g

RBF

(x

1

) = β

T

z

1

= y

T

Z

−1

(first column of Z) = y

T



1

0 . . . 0 

T

=

y

1

—gRBF(x

n

) =

y

n

, i.e. E

in

(gRBF) =

0

,

yeah!! :-)

called

exact interpolation

for

function approximation

but

overfitting for learning? :-(

how about

regularization? e.g. ridge

regression for

β

instead

—optimal

β

= (Z

T Z

+

λI) −1 Z T y

seen

Z? Z

= [Gaussian(x

n

,

x m

)] =Gaussian kernel matrix

K

effect of

regularization

in different spaces:

kernel

ridge

regression:

β

= (K+

λI) −1 y;

regularized

full RBFNet:

β

= (Z

T Z

+

λI) −1 Z T y

(54)

Radial Basis Function Network RBF Network Learning

Regularized Full RBF Network

full Gaussian RBF Network for regression:

β

=

Z −1 y g

RBF

(x

1

) = β

T

z

1

= y

T

Z

−1

(first column of Z) = y

T



1 0 . . . 0 

T

=

y

1

—gRBF(x

n

) =

y

n

, i.e. E

in

(gRBF) =

0

,

yeah!! :-)

called

exact interpolation

for

function approximation

but

overfitting for learning? :-(

how about

regularization? e.g. ridge

regression for

β

instead

—optimal

β

= (Z

T Z

+

λI) −1 Z T y

seen

Z? Z

= [Gaussian(x

n

,

x m

)] =Gaussian kernel matrix

K

effect of

regularization

in different spaces:

kernel

ridge

regression:

β

= (K+

λI) −1 y;

regularized

full RBFNet:

β

= (Z

T Z

+

λI) −1 Z T y

(55)

Radial Basis Function Network RBF Network Learning

Regularized Full RBF Network

full Gaussian RBF Network for regression:

β

=

Z −1 y g

RBF

(x

1

) = β

T

z

1

= y

T

Z

−1

(first column of Z) = y

T



1 0 . . . 0 

T

=

y

1

—gRBF(x

n

) =

y

n

, i.e. E

in

(gRBF) =

0

,

yeah!! :-)

called

exact interpolation

for

function approximation

but

overfitting for learning? :-(

how about

regularization? e.g. ridge

regression for

β

instead

—optimal

β

= (Z

T Z

+

λI) −1 Z T y

seen

Z? Z

= [Gaussian(x

n

,

x m

)] =Gaussian kernel matrix

K

effect of

regularization

in different spaces:

kernel

ridge

regression:

β

= (K+

λI) −1 y;

regularized

full RBFNet:

β

= (Z

T Z

+

λI) −1 Z T y

(56)

Radial Basis Function Network RBF Network Learning

Regularized Full RBF Network

full Gaussian RBF Network for regression:

β

=

Z −1 y g

RBF

(x

1

) = β

T

z

1

= y

T

Z

−1

(first column of Z) = y

T



1 0 . . . 0 

T

= y

1

—gRBF(x

n

) =

y

n

, i.e. E

in

(gRBF) =

0

,

yeah!! :-)

called

exact interpolation

for

function approximation

but

overfitting for learning? :-(

how about

regularization? e.g. ridge

regression for

β

instead

—optimal

β

= (Z

T Z

+

λI) −1 Z T y

seen

Z? Z

= [Gaussian(x

n

,

x m

)] =Gaussian kernel matrix

K

effect of

regularization

in different spaces:

kernel

ridge

regression:

β

= (K+

λI) −1 y;

regularized

full RBFNet:

β

= (Z

T Z

+

λI) −1 Z T y

(57)

Radial Basis Function Network RBF Network Learning

Regularized Full RBF Network

full Gaussian RBF Network for regression:

β

=

Z −1 y g

RBF

(x

1

) = β

T

z

1

= y

T

Z

−1

(first column of Z) = y

T



1 0 . . . 0 

T

= y

1

—gRBF(x

n

) =

y

n

,

i.e. E

in

(gRBF) =

0

,

yeah!! :-)

called

exact interpolation

for

function approximation

but

overfitting for learning? :-(

how about

regularization? e.g. ridge

regression for

β

instead

—optimal

β

= (Z

T Z

+

λI) −1 Z T y

seen

Z? Z

= [Gaussian(x

n

,

x m

)] =Gaussian kernel matrix

K

effect of

regularization

in different spaces:

kernel

ridge

regression:

β

= (K+

λI) −1 y;

regularized

full RBFNet:

β

= (Z

T Z

+

λI) −1 Z T y

(58)

Radial Basis Function Network RBF Network Learning

Regularized Full RBF Network

full Gaussian RBF Network for regression:

β

=

Z −1 y g

RBF

(x

1

) = β

T

z

1

= y

T

Z

−1

(first column of Z) = y

T



1 0 . . . 0 

T

= y

1

—gRBF(x

n

) =y

n

,

i.e. E

in

(gRBF) =

0

,

yeah!! :-)

called

exact interpolation

for

function approximation

but

overfitting for learning? :-(

how about

regularization? e.g. ridge

regression for

β

instead

—optimal

β

= (Z

T Z

+

λI) −1 Z T y

seen

Z? Z

= [Gaussian(x

n

,

x m

)] =Gaussian kernel matrix

K

effect of

regularization

in different spaces:

kernel

ridge

regression:

β

= (K+

λI) −1 y;

regularized

full RBFNet:

β

= (Z

T Z

+

λI) −1 Z T y

(59)

Radial Basis Function Network RBF Network Learning

Regularized Full RBF Network

full Gaussian RBF Network for regression:

β

=

Z −1 y g

RBF

(x

1

) = β

T

z

1

= y

T

Z

−1

(first column of Z) = y

T



1 0 . . . 0 

T

= y

1

—gRBF(x

n

) =y

n

, i.e. E

in

(gRBF) =

0,

yeah!! :-)

called

exact interpolation

for

function approximation

but

overfitting for learning? :-(

how about

regularization? e.g. ridge

regression for

β

instead

—optimal

β

= (Z

T Z

+

λI) −1 Z T y

seen

Z? Z

= [Gaussian(x

n

,

x m

)] =Gaussian kernel matrix

K

effect of

regularization

in different spaces:

kernel

ridge

regression:

β

= (K+

λI) −1 y;

regularized

full RBFNet:

β

= (Z

T Z

+

λI) −1 Z T y

(60)

Radial Basis Function Network RBF Network Learning

Regularized Full RBF Network

full Gaussian RBF Network for regression:

β

=

Z −1 y g

RBF

(x

1

) = β

T

z

1

= y

T

Z

−1

(first column of Z) = y

T



1 0 . . . 0 

T

= y

1

—gRBF(x

n

) =y

n

, i.e. E

in

(gRBF) =0,

yeah!! :-)

called

exact interpolation

for

function approximation

but

overfitting for learning? :-(

how about

regularization? e.g. ridge

regression for

β

instead

—optimal

β

= (Z

T Z

+

λI) −1 Z T y

seen

Z? Z

= [Gaussian(x

n

,

x m

)] =

Gaussian kernel matrix

K

effect of

regularization

in different spaces:

kernel

ridge

regression:

β

= (K+

λI) −1 y;

regularized

full RBFNet:

β

= (Z

T Z

+

λI) −1 Z T y

(61)

Radial Basis Function Network RBF Network Learning

Regularized Full RBF Network

full Gaussian RBF Network for regression:

β

=

Z −1 y g

RBF

(x

1

) = β

T

z

1

= y

T

Z

−1

(first column of Z) = y

T



1 0 . . . 0 

T

= y

1

—gRBF(x

n

) =y

n

, i.e. E

in

(gRBF) =0,

yeah!! :-)

called

exact interpolation

for

function approximation

but

overfitting for learning? :-(

how about

regularization? e.g. ridge

regression for

β

instead

—optimal

β

= (Z

T Z

+

λI) −1 Z T y

seen

Z? Z

= [Gaussian(x

n

,

x m

)] =

Gaussian kernel matrix

K

effect of

regularization

in different spaces:

kernel

ridge

regression:

β

= (K+

λI) −1 y;

regularized

full RBFNet:

β

= (Z

T Z

+

λI) −1 Z T y

(62)

Radial Basis Function Network RBF Network Learning

Regularized Full RBF Network

full Gaussian RBF Network for regression:

β

=

Z −1 y g

RBF

(x

1

) = β

T

z

1

= y

T

Z

−1

(first column of Z) = y

T



1 0 . . . 0 

T

= y

1

—gRBF(x

n

) =y

n

, i.e. E

in

(gRBF) =0,

yeah!! :-)

called

exact interpolation

for

function approximation

but

overfitting for learning? :-(

how about

regularization? e.g. ridge

regression for

β

instead

—optimal

β

= (Z

T Z

+

λI) −1 Z T y

seen

Z? Z

= [Gaussian(x

n

,

x m

)] =

Gaussian kernel matrix

K

effect of

regularization

in different spaces:

kernel

ridge

regression:

β

= (K+

λI) −1 y;

regularized

full RBFNet:

β

= (Z

T Z

+

λI) −1 Z T y

(63)

Radial Basis Function Network RBF Network Learning

Regularized Full RBF Network

full Gaussian RBF Network for regression:

β

=

Z −1 y g

RBF

(x

1

) = β

T

z

1

= y

T

Z

−1

(first column of Z) = y

T



1 0 . . . 0 

T

= y

1

—gRBF(x

n

) =y

n

, i.e. E

in

(gRBF) =0,

yeah!! :-)

called

exact interpolation

for

function approximation

but

overfitting for learning? :-(

how about

regularization? e.g. ridge

regression for

β

instead

—optimal

β

= (Z

T Z

+

λI) −1 Z T y

seen

Z? Z

= [Gaussian(x

n

,

x m

)] =

Gaussian kernel matrix

K

effect of

regularization

in different spaces:

kernel

ridge

regression:

β

= (K+

λI) −1 y;

regularized

full RBFNet:

β

= (Z

T Z

+

λI) −1 Z T y

(64)

Radial Basis Function Network RBF Network Learning

Regularized Full RBF Network

full Gaussian RBF Network for regression:

β

=

Z −1 y g

RBF

(x

1

) = β

T

z

1

= y

T

Z

−1

(first column of Z) = y

T



1 0 . . . 0 

T

= y

1

—gRBF(x

n

) =y

n

, i.e. E

in

(gRBF) =0,

yeah!! :-)

called

exact interpolation

for

function approximation

but

overfitting for learning? :-(

how about

regularization? e.g. ridge

regression for

β

instead

—optimal

β

= (Z

T Z

+

λI) −1 Z T y

seen

Z? Z

= [Gaussian(x

n

,

x m

)] =

Gaussian kernel matrix

K

effect of

regularization

in different spaces:

kernel

ridge

regression:

β

= (K+

λI) −1 y;

regularized

full RBFNet:

β

= (Z

T Z

+

λI) −1 Z T y

(65)

Radial Basis Function Network RBF Network Learning

Regularized Full RBF Network

full Gaussian RBF Network for regression:

β

=

Z −1 y g

RBF

(x

1

) = β

T

z

1

= y

T

Z

−1

(first column of Z) = y

T



1 0 . . . 0 

T

= y

1

—gRBF(x

n

) =y

n

, i.e. E

in

(gRBF) =0,

yeah!! :-)

called

exact interpolation

for

function approximation

but

overfitting for learning? :-(

how about

regularization? e.g. ridge

regression for

β

instead

—optimal

β

= (Z

T Z

+

λI) −1 Z T y

seen

Z? Z

= [Gaussian(x

n

,

x m

)] =Gaussian kernel matrix

K

effect of

regularization

in different spaces:

kernel

ridge

regression:

β

= (K+

λI) −1 y;

regularized

full RBFNet:

β

= (Z

T Z

+

λI) −1 Z T y

參考文獻

相關文件

Which of the following aggregation model learns diverse g t by reweighting and calculates linear vote by steepest search?.

3 Distilling Implicit Features: Extraction Models Lecture 14: Radial Basis Function Network. RBF

Lecture 4: Soft-Margin Support Vector Machine allow some margin violations ξ n while penalizing them by C; equivalent to upper-bounding α n by C Lecture 5: Kernel Logistic

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/22.. Decision Tree Decision Tree Hypothesis. Disclaimers about

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.. linear SVM: more robust and solvable with quadratic programming Lecture 2: Dual Support

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.

Principle Component Analysis Denoising Auto Encoder Deep Neural Network... Deep Learning Optimization

For a deep NNet for written character recognition from raw pixels, which type of features are more likely extracted after the first hidden layer.