## Machine Learning Techniques ( 機器學習技法)

### Lecture 14: Radial Basis Function Network

Hsuan-Tien Lin (林軒田)htlin@csie.ntu.edu.tw

### Department of Computer Science

### & Information Engineering

### National Taiwan University

### ( 國立台灣大學資訊工程系)

Radial Basis Function Network

## Roadmap

### 1 Embedding Numerous Features: Kernel Models

### 2 Combining Predictive Features: Aggregation Models

### 3

Distilling Implicit Features: Extraction Models### Lecture 13: Deep Learning

**pre-training**

with**denoising autoencoder**

(non-linear PCA) and fine-tuning with**backprop**

for NNet with

**many layers**

### Lecture 14: Radial Basis Function Network RBF Network Hypothesis

### RBF Network Learning k -Means Algorithm

### k -Means and RBF Network in Action

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 1/24

## Gaussian SVM Revisited

### g

SVM### (x) = sign X

SV

### α

_{n}

### y

_{n}

### exp −γkx − **x**

_{n}

### k

^{2}

### + b

### !

Gaussian SVM: find

### α n

to### combine Gaussians

centered at**x** n

;
achieve large margin in

### infinite-dimensional space, **remember? :-)**

### •

Gaussian kernel: also called### Radial Basis Function

(RBF) kernel### • radial: only depends on distance between **x and** ‘center’ **x**

n
### • basis function: to be ‘combined’

### •

let### g n

(x) =### y n exp −γkx − **x** n k ^{2}

:
gSVM(x) = sign P
### SV α n g n

(x) + b—linear

### aggregation

of### selected radial

hypotheses**Radial** **Basis Function**

(RBF)**Network:**

linear

### aggregation

of### radial

hypothesesRadial Basis Function Network RBF Network Hypothesis

## From Neural Network to RBF Network

### Neural Network

x0=1

x1

x2

x3

... xd

+1

tanh

tanh

tanh

### w _{ij} ^{(1)} w _{j1} ^{(2)}

### RBF Network

x0=1

x1

x2

x3

... xd

+1

RBF

RBF

RBF

**centers** **votes**

### • hidden layer

different:(inner-product+ tanh) versus (distance+ Gaussian)

### • output layer

same:**just linear aggregation**

RBF Network: historically

**a type of NNet**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24

## RBF Network Hypothesis

### h(x)

### = Output

M

### X

m=1

### β

_{m}

### RBF(x, µ

_{m}

### ) + b

### !

key variables:

### centers µ _{m}

; (signed)### votes β _{m}

### RBF Network

x0=1

x1

x2

x3

... xd

+1

RBF

RBF

RBF

Output

**centers** **votes**

### g

SVM### for Gaussian-SVM

### • RBF: Gaussian; Output: sign (binary classification)

### •

M = #SV;### µ _{m}

: SVM SVs**x** m

; ### β m

: α### m

y### m

from SVM Dual learning: given### RBF

and### Output,

decide

### µ _{m}

and### β _{m}

Radial Basis Function Network RBF Network Hypothesis

## RBF and Similarity

### general similarity function

between**x and x** ^{0}

:
Neuron(x, x

^{0}

) =tanh(γx^{T} **x** ^{0}

+1)
DNASim(x, x^{0}

) =EditDistance(x, x^{0}

)
RBF Network:

### distance similarity-to-centers

as**feature transform**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/24

### kernel: similarity via Z-space inner product

—governed by Mercer’s condition, remember? :-) Poly(x, x

^{0}

) = (1 +**x** ^{T} **x** ^{0}

)^{2}

Gaussian(x, x

^{0}

) =exp(−γkx − x^{0}

k^{2}

)
Gaussian(x, x

^{0}

) =exp(−γkx − x^{0}

k^{2}

)
### Truncated

(x, x^{0}

) =**Jkx − x**

### 0

k ≤ 1**K (1 − kx − x**

### 0

k)^{2} RBF: similarity via X -space distance

—often

**monotonically non-increasing**

to distance
Radial Basis Function Network RBF Network Hypothesis

## Fun Time

Which of the following is not a radial basis function?

### 1

φ(x, µ) = exp(−γkx − µk^{2}

)
### 2

φ(x, µ) = −p**x** ^{T} **x − 2x** ^{T}

µ + µ^{T}

µ
### 3

φ(x, µ) =**Jx = µK**

### 4

φ(x, µ) = x^{T} **x + µ** ^{T}

µ
### Reference Answer: 4

Note that 3 is an extreme case of 1
(Gaussian) with γ → ∞, and 2 contains an
**kx − µk**

^{2}

somewhere**:-).**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/24

Radial Basis Function Network RBF Network Hypothesis

## Fun Time

Which of the following is not a radial basis function?

### 1

φ(x, µ) = exp(−γkx − µk^{2}

)
### 2

φ(x, µ) = −p**x** ^{T} **x − 2x** ^{T}

µ + µ^{T}

µ
### 3

φ(x, µ) =**Jx = µK**

### 4

φ(x, µ) = x^{T} **x + µ** ^{T}

µ
### Reference Answer: 4

Note that 3 is an extreme case of 1
(Gaussian) with γ → ∞, and 2 contains an
**kx − µk**

^{2}

somewhere**:-).**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/24

## Full RBF Network

h(x) =

### Output

### M

X

### m=1

### β m RBF(x, µ _{m}

)
!

### • full RBF

Network:### M = N

and each### µ _{m}

=**x** _{m}

### •

physical meaning: each**x** _{m} **influences**

similar**x by** β _{m}

### •

e.g. uniform influence with### β m = 1 · y m

for binary classificationg

_{uniform}

(x) =### sign

### N

### X

### m=1

### y _{m} exp

### −γkx **− x** _{m} k ^{2}

!

—aggregateeach example’s

### opinion

subject to### similarity

### full RBF

Network:**lazy**

way to decide### µ _{m}

Radial Basis Function Network RBF Network Learning

## Nearest Neighbor

g

_{uniform}

(x) = sign
### N

### X

### m=1

### y _{m} exp

### −γkx **− x** m k ^{2}

!

### • exp −γkx **− x** m k ^{2}

:

### maximum

when**x** **closest to** **x** _{m}

—maximum oneoften dominates the

### P N

### m=1

term### •

take### y m

of**maximum** exp(. . .)

instead of### voting

of all### y m

—selectioninstead of

**aggregation**

### •

physical meaning:g

_{nbor}

(x) =### y m

such that**x** **closest to** **x** _{m}

—called

**nearest neighbor**

model
### •

can**uniformly aggregate k** neighbors

also:### k **nearest neighbor** k nearest neighbor:

also

**lazy**

but**very intuitive**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/24

## Interpolation by Full RBF Network

### full RBF

Network for squared error regression:h(x) =

### XX Output X X

### N

### X

### m=1

### β m RBF(x, **x** _{m}

)
!

### •

just linear regression on### RBF-transformed data

**z** n

= [RBF(x### n

,**x** _{1}

),### RBF(x n

,**x** _{2}

), . . . ,### RBF(x n

,**x** _{N}

)]
### •

optimal### β? β

= (Z^{T} Z) ^{−1} Z ^{T} **y, if** Z ^{T} Z

invertible,**remember? :-)**

### •

size of### Z? N (examples) by N (centers)

—symmetric square matrix

### •

theoretical fact: if**x** _{n} all different, Z

with### Gaussian RBF **invertible**

optimal

### β

with### invertible Z: β

=### Z ^{−1} **y**

Radial Basis Function Network RBF Network Learning

## Regularized Full RBF Network

full Gaussian RBF Network for regression:

### β

=### Z ^{−1} **y** g

RBF### (x

_{1}

### ) = β

^{T}

**z**

_{1}

### = **y**

^{T}

### Z

^{−1}

### (first column of Z) = **y**

^{T}

### 1 0 . . . 0

T### = y

_{1}

—g_{RBF}(x

_{n}

) =y_{n}

, i.e. E_{in}

(g_{RBF}) =0,

**yeah!! :-)**

### •

called**exact interpolation**

for**function approximation**

### •

but**overfitting for learning? :-(**

### •

how about**regularization? e.g.** **ridge**

regression for### β

instead—optimal

### β

= (Z^{T} Z

+### λI) ^{−1} Z ^{T} **y**

### •

seen### Z? Z

= [Gaussian(x_{n}

,**x** _{m}

)] =Gaussian kernel matrix### K

effect of### regularization

in different spaces:kernel

### ridge

regression:### β

= (K+### λI) ^{−1} **y;**

### regularized

full RBFNet:### β

= (Z^{T} Z

+### λI) ^{−1} Z ^{T} **y**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/24

## Fewer Centers as Regularization

recall:

gSVM(x) =

### sign X

### SV

### α m y m exp

### −γkx **− x** m k ^{2}

+### b

!

—only ‘ N’

### SVs

needed in ‘network’### •

next:### M N

instead of### M = N

### •

effect:**regularization**

by constraining

**number of** **centers** **and** **voting weights**

### •

physical meaning of### centers µ _{m}

: **prototypes**

remaining question:

how to extract

**prototypes?**

Radial Basis Function Network RBF Network Learning

## Fun Time

If

**x** _{1}

=**x** _{2}

, what happens in the### Z

matrix of full Gaussian RBF network?### 1

the first two rows of the matrix are the same### 2

the first two columns of the matrix are different### 3

the matrix is invertible### 4

the sub-matrix at the intersection of the first two rows and the first two columns contains a constant of 0### Reference Answer: 1

It is easy to see that the first two rows must be the same; so must the first two columns. The two same rows makes the matrix singular; the sub-matrix in 4 contains a constant of 1 = exp(−0) instead of 0.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/24

## Fun Time

If

**x** _{1}

=**x** _{2}

, what happens in the### Z

matrix of full Gaussian RBF network?### 1

the first two rows of the matrix are the same### 2

the first two columns of the matrix are different### 3

the matrix is invertible### 4

the sub-matrix at the intersection of the first two rows and the first two columns contains a constant of 0### Reference Answer: 1

It is easy to see that the first two rows must be the same; so must the first two columns. The two same rows makes the matrix singular; the sub-matrix in 4 contains a constant of 1 = exp(−0) instead of 0.

Radial Basis Function Network k -Means Algorithm

## Good Prototypes: Clustering Problem

=⇒

if

**x** _{1} **≈ x** 2

,
=⇒

**no need**

both### RBF(x, **x** _{1}

)&### RBF(x, **x** _{2}

)in RBFNet,
=⇒

**cluster** **x** _{1}

and**x** _{2}

by**one prototype µ** **≈ x** _{1} **≈ x** _{2}

### • **clustering**

with**prototype:**

### • **partition** **{x**

_{n}

### } to disjoint sets S

1### , S

2### , · · · , S

M### • **choose µ**

_{m}

### for each S

m—hope:

**x** _{1} , **x** _{2}

both ∈### S _{m}

⇔### µ _{m} **≈ x** _{1} ≈ x _{2}

### •

cluster error with squared error measure:### E

in### (S

1### , · · · , S

M### ; µ

_{1}

### , · · · , µ

_{M}

### ) = 1 N

N

### X

n=1 M

### X

m=1

### J **x**

n### ∈ S

_{m}

### K kx

_{n}

### − µ

_{m}

### k

^{2}

goal: with

### S _{1} , · · · , S _{M}

being a partition of**{x** n },

min
### {S

1### ,··· ,S

M### ;µ

_{1}

### ,··· ,µ

_{M}

### }

E_{in}

(S_{1} , · · · , S _{M}

;### µ _{1} , · · · , µ _{M}

)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/24

## Partition Optimization

with

### S _{1} , · · · , S _{M}

being a partition of**{x** n },

### {S

_{1}

### ,··· ,S

minM### ;µ

_{1}

### ,··· ,µ

_{M}

### } N

X

### n=1 M

X

### m=1

### J **x** _{n} ∈ S _{m} K kx _{n} − µ _{m} k ^{2}

### • **hard to optimize: joint** combinatorial-numerical

optimization
### • **two sets**

of### variables: will optimize alternatingly

if### µ _{1} , · · · , µ _{M} **fixed, for each** **x** _{n}

### • J **x** _{n} ∈ S _{m} K

: choose### one and only one subset

### • kx _{n} − µ _{m} k ^{2}

: distance to each### prototype

optimal

### chosen subset S _{m}

= the one with**minimum** kx _{n} − µ _{m} k ^{2}

for given### µ _{1} , · · · , µ _{M}

, each**x** _{n}

‘optimally partitioned’ using its

**closest** µ _{m}

Radial Basis Function Network k -Means Algorithm

## Prototype Optimization

with

### S _{1} , · · · , S _{M}

being a partition of**{x** _{n} },

### {S

_{1}

### ,··· ,S

min_{M}

### ;µ

_{1}

### ,··· ,µ

_{M}

### } N

X

### n=1 M

X

### m=1

### J **x** _{n} ∈ S m K kx _{n} − µ _{m} k ^{2}

### • **hard to optimize: joint** combinatorial-numerical

optimization
### • **two sets**

of### variables: will optimize alternatingly

if

### S _{1} , · · · , S _{M} **fixed, just** **unconstrained optimization**

for each### µ _{m}

∇

_{µ}

_{m}E

_{in}

= −2
### N

X

### n=1

### J **x** _{n} ∈ S m K (x _{n} − µ _{m} )

=### −2

X

**x**

n### ∈S

m**x** _{n}

− |S

_{m}

|µ_{m}

optimal

### prototype µ _{m}

= **average**

of**x** n

within### S m

for given

### S _{1} , · · · , S _{M}

, each### µ _{n}

‘optimally computed’ as

**consensus**

within### S _{m}

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 15/24

## k -Means Algorithm

### use k **prototypes** instead of M historically (different from k nearest neighbor, though)

### k -Means Algorithm

### 1

initialize### µ _{1} , µ _{2} , . . . , µ _{k}

: say, as### k

randomly chosen**x** _{n}

### 2 **alternating optimization**

of E_{in}

: repeatedly
### 1 optimize S

_{1}

### , S

_{2}

### , . . . , S

_{k}

### :

### each **x**

_{n}

### ‘optimally partitioned’ using its closest µ

_{i}

### 2 optimize µ

_{1}

### , µ

_{2}

### , . . . , µ

_{k}

### :

### each µ

_{n}

### ‘optimally computed’ as consensus within S

_{m}until

**converge**

**converge: no change of** S _{1} , S _{2} , . . . , S _{k}

anymore
—guaranteed as E

_{in} **decreases**

during alternating minimization
k -Means: the most popular**clustering**

algorithm through

**alternating minimization**

Radial Basis Function Network k -Means Algorithm

## RBF Network Using k -Means

### RBF Network Using k -Means

### 1

run### k -Means

with k = M to get### {µ _{m} }

### 2

construct transform### Φ(x) from RBF (say, Gaussian) at µ _{m} Φ(x)

= [RBF(x,### µ _{1}

),RBF(x,### µ _{2}

), . . . ,RBF(x,### µ _{M}

)]
### 3

run### linear model

on {(Φ(x### n

),y### n

)}to get### β

### 4

return gRBFNET(x) =### LinearHypothesis

(β,### Φ(x))

### •

using### unsupervised learning (k -Means)

to assist**feature** **transform—like** **autoencoder**

### •

parameters: M (prototypes), RBF (such as γ of Gaussian)RBF Network: a simple (old-fashioned) model

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/24

Radial Basis Function Network k -Means Algorithm

## Fun Time

For k -Means, consider examples

**x** _{n}

∈ R^{2}

such that all x_{n,1}

and x_{n,2}

are
non-zero. When fixing two prototypes µ_{1}

= [1, 1] and µ_{2}

= [−1, 1],
which of the following set is the optimal S_{1}

?
### 1

**{x**

_{n}

:x_{n,1}

>0}
### 2

**{x**

### n

:x_{n,1}

<0}
### 3

**{x**

### n

:x_{n,2}

>0}
### 4

**{x**

### n

:x_{n,2}

<0}
### Reference Answer: 1

Note that S

_{1}

contains examples that are closer
to µ_{1}

than µ_{2}

.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/24

Radial Basis Function Network k -Means Algorithm

## Fun Time

For k -Means, consider examples

**x** _{n}

∈ R^{2}

such that all x_{n,1}

and x_{n,2}

are
non-zero. When fixing two prototypes µ_{1}

= [1, 1] and µ_{2}

= [−1, 1],
which of the following set is the optimal S_{1}

?
### 1

**{x**

_{n}

:x_{n,1}

>0}
### 2

**{x**

### n

:x_{n,1}

<0}
### 3

**{x**

### n

:x_{n,2}

>0}
### 4

**{x**

### n

:x_{n,2}

<0}
### Reference Answer: 1

Note that S

_{1}

contains examples that are closer
to µ_{1}

than µ_{2}

.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/24

## Beauty of k -Means

k = 4

usually works well

with

**proper k and initialization**

Radial Basis Function Network k -Means and RBF Network in Action

## Beauty of k -Means

k = 4

usually works well

with

**proper k and initialization**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/24

## Beauty of k -Means

k = 4

usually works well

with

**proper k and initialization**

Radial Basis Function Network k -Means and RBF Network in Action

## Beauty of k -Means

k = 4

usually works well

with

**proper k and initialization**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/24

## Beauty of k -Means

k = 4

usually works well

with

**proper k and initialization**

Radial Basis Function Network k -Means and RBF Network in Action

## Beauty of k -Means

k = 4

usually works well

with

**proper k and initialization**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/24

## Beauty of k -Means

k = 4

usually works well

with

**proper k and initialization**

Radial Basis Function Network k -Means and RBF Network in Action

## Beauty of k -Means

k = 4

usually works well

with

**proper k and initialization**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/24

## Beauty of k -Means

k = 4

usually works well

with

**proper k and initialization**

Radial Basis Function Network k -Means and RBF Network in Action

## Beauty of k -Means

k = 4

usually works well

with

**proper k and initialization**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/24

## Beauty of k -Means

k = 4

usually works well

with

**proper k and initialization**

Radial Basis Function Network k -Means and RBF Network in Action

## Beauty of k -Means

k = 4

usually works well

with

**proper k and initialization**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/24

## Beauty of k -Means

k = 4

usually works well

with

**proper k and initialization**

Radial Basis Function Network k -Means and RBF Network in Action

## Difficulty of k -Means

k = 2 k = 4 k = 7

**‘sensitive’ to**

k and initialization
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/24

## Difficulty of k -Means

k = 2 k = 4 k = 7

**‘sensitive’ to**

k and initialization
Radial Basis Function Network k -Means and RBF Network in Action

## RBF Network Using k -Means

k = 2 k = 4 k = 7

reasonable performance with

**proper centers**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/24

## Full RBF Network

k = N k = 4 nearest neighbor

λ =0.001

**full RBF Network: generally less useful**

Radial Basis Function Network k -Means and RBF Network in Action

## Fun Time

When coupled with ridge linear regression, which of the following RBF Network is ‘most regularized’?

### 1

small M and small λ### 2

small M and large λ### 3

large M and small λ### 4

large M and large λ### Reference Answer: 2

small M: fewer weights and more regularized; large λ: shorter β more and more regularized.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 23/24

## Fun Time

When coupled with ridge linear regression, which of the following RBF Network is ‘most regularized’?

### 1

small M and small λ### 2

small M and large λ### 3

large M and small λ### 4

large M and large λ### Reference Answer: 2

small M: fewer weights and more regularized;

large λ: shorter β more and more regularized.

Radial Basis Function Network k -Means and RBF Network in Action

## Summary

### 1 Embedding Numerous Features: Kernel Models

### 2 Combining Predictive Features: Aggregation Models

### 3

Distilling Implicit Features: Extraction Models### Lecture 14: Radial Basis Function Network

### RBF Network Hypothesis

**prototypes instead of neurons as transform** RBF Network Learning

**linear aggregation of prototype ‘hypotheses’**

### k -Means Algorithm

**clustering with alternating optimization** k -Means and RBF Network in Action

**proper choice of # prototypes important**

### • **next: extracting features from abstract data**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 24/24