Machine Learning Techniques
( 機器學習技法)
Lecture 14: Radial Basis Function Network
Hsuan-Tien Lin (林軒田)htlin@csie.ntu.edu.tw
Department of Computer Science
& Information Engineering
National Taiwan University
( 國立台灣大學資訊工程系)
Radial Basis Function Network
Roadmap
1 Embedding Numerous Features: Kernel Models
2 Combining Predictive Features: Aggregation Models
3
Distilling Implicit Features: Extraction ModelsLecture 13: Deep Learning
pre-training
withdenoising autoencoder
(non-linear PCA) and fine-tuning withbackprop
for NNet with
many layers
Lecture 14: Radial Basis Function Network RBF Network Hypothesis
RBF Network Learning
k -Means Algorithm
Radial Basis Function Network RBF Network Hypothesis
Gaussian SVM Revisited
g
SVM(x) = sign X
SV
α
ny
nexp −γkx − x
nk
2+ b
!
Gaussian SVM:
find
α n
tocombine Gaussians
centered atx n
;achieve large margin in
infinite-dimensional space, remember? :-)
•
Gaussian kernel: also calledRadial Basis Function
(RBF) kernel• radial: only depends on distance between x and ‘center’ x
n• basis function: to be ‘combined’
•
letg n
(x) =y n exp −γkx − x n k 2
: gSVM(x) = sign PSV α n
g n
(x)+b
—linear
aggregation
ofselected radial
hypothesesRadial Basis Function
(RBF)Network:
linearaggregation
ofradial
hypothesesRadial Basis Function Network RBF Network Hypothesis
Gaussian SVM Revisited
g
SVM(x) = sign X
SV
α
ny
nexp −γkx − x
nk
2+ b
!
Gaussian SVM: find
α n
tocombine Gaussians
centered atx n
;achieve large margin in
infinite-dimensional space, remember? :-)
•
Gaussian kernel: also calledRadial Basis Function
(RBF) kernel• radial: only depends on distance between x and ‘center’ x
n• basis function: to be ‘combined’
•
letg n
(x) =y n exp −γkx − x n k 2
: gSVM(x) = sign PSV α n
g n
(x)+b
—linear
aggregation
ofselected radial
hypothesesRadial Basis Function
(RBF)Network:
linearaggregation
ofradial
hypothesesRadial Basis Function Network RBF Network Hypothesis
Gaussian SVM Revisited
g
SVM(x) = sign X
SV
α
ny
nexp −γkx − x
nk
2+ b
!
Gaussian SVM: find
α n
tocombine Gaussians
centered atx n
;achieve large margin in
infinite-dimensional space, remember? :-)
•
Gaussian kernel: also calledRadial Basis Function
(RBF) kernel• radial: only depends on distance between x and ‘center’ x
n• basis function: to be ‘combined’
•
letg n
(x) =y n exp −γkx − x n k 2
: gSVM(x) = sign PSV α n
g n
(x)+b
—linear
aggregation
ofselected radial
hypothesesRadial Basis Function
(RBF)Network:
linearaggregation
ofradial
hypothesesRadial Basis Function Network RBF Network Hypothesis
Gaussian SVM Revisited
g
SVM(x) = sign X
SV
α
ny
nexp −γkx − x
nk
2+ b
!
Gaussian SVM: find
α n
tocombine Gaussians
centered atx n
;achieve large margin in
infinite-dimensional space, remember? :-)
•
Gaussian kernel: also calledRadial Basis Function
(RBF) kernel• radial: only depends on distance between x and ‘center’ x
n• basis function: to be ‘combined’
•
letg n
(x) =y n exp −γkx − x n k 2
: gSVM(x) = sign PSV α n
g n
(x)+b
—linear
aggregation
ofselected radial
hypothesesRadial Basis Function
(RBF)Network:
linearaggregation
ofradial
hypothesesRadial Basis Function Network RBF Network Hypothesis
Gaussian SVM Revisited
g
SVM(x) = sign X
SV
α
ny
nexp −γkx − x
nk
2+ b
!
Gaussian SVM: find
α n
tocombine Gaussians
centered atx n
;achieve large margin in
infinite-dimensional space, remember? :-)
•
Gaussian kernel: also calledRadial Basis Function
(RBF) kernel• radial: only depends on distance between x and ‘center’ x
n• basis function: to be ‘combined’
•
letg n
(x) =y n exp −γkx − x n k 2
: gSVM(x) = sign PSV α n
g n
(x)+b
—linear
aggregation
ofselected radial
hypothesesRadial Basis Function
(RBF)Network:
linearaggregation
ofradial
hypothesesRadial Basis Function Network RBF Network Hypothesis
Gaussian SVM Revisited
g
SVM(x) = sign X
SV
α
ny
nexp −γkx − x
nk
2+ b
!
Gaussian SVM: find
α n
tocombine Gaussians
centered atx n
;achieve large margin in
infinite-dimensional space, remember? :-)
•
Gaussian kernel: also calledRadial Basis Function
(RBF) kernel• radial: only depends on distance between x and ‘center’ x
n• basis function: to be ‘combined’
•
letg n
(x) =y n exp −γkx − x n k 2
: gSVM(x) = sign PSV α n
g n
(x)+b
—linear
aggregation
ofselected radial
hypothesesRadial Basis Function
(RBF)Network:
linearaggregation
ofradial
hypothesesRadial Basis Function Network RBF Network Hypothesis
Gaussian SVM Revisited
g
SVM(x) = sign X
SV
α
ny
nexp −γkx − x
nk
2+ b
!
Gaussian SVM: find
α n
tocombine Gaussians
centered atx n
;achieve large margin in
infinite-dimensional space, remember? :-)
•
Gaussian kernel: also calledRadial Basis Function
(RBF) kernel• radial: only depends on distance between x and ‘center’ x
n• basis function: to be ‘combined’
•
letg n
(x) =y n exp −γkx − x n k 2
: gSVM(x) = sign PSV α n g n
(x) + b—linear
aggregation
ofselected radial
hypothesesRadial Basis Function
(RBF)Network:
linearaggregation
ofradial
hypothesesRadial Basis Function Network RBF Network Hypothesis
Gaussian SVM Revisited
g
SVM(x) = sign X
SV
α
ny
nexp −γkx − x
nk
2+ b
!
Gaussian SVM: find
α n
tocombine Gaussians
centered atx n
;achieve large margin in
infinite-dimensional space, remember? :-)
•
Gaussian kernel: also calledRadial Basis Function
(RBF) kernel• radial: only depends on distance between x and ‘center’ x
n• basis function: to be ‘combined’
•
letg n
(x) =y n exp −γkx − x n k 2
: gSVM(x) = sign PSV α n g n
(x) + b—linear
aggregation
ofselected radial
hypothesesRadial Basis Function
(RBF)Network:
linearaggregation
ofradial
hypothesesRadial Basis Function Network RBF Network Hypothesis
Gaussian SVM Revisited
g
SVM(x) = sign X
SV
α
ny
nexp −γkx − x
nk
2+ b
!
Gaussian SVM: find
α n
tocombine Gaussians
centered atx n
;achieve large margin in
infinite-dimensional space, remember? :-)
•
Gaussian kernel: also calledRadial Basis Function
(RBF) kernel• radial: only depends on distance between x and ‘center’ x
n• basis function: to be ‘combined’
•
letg n
(x) =y n exp −γkx − x n k 2
: gSVM(x) = sign PSV α n g n
(x) + b—linear
aggregation
ofselected radial
hypothesesRadial Basis Function
(RBF)Network:
linear
aggregation
ofradial
hypothesesRadial Basis Function Network RBF Network Hypothesis
From Neural Network to RBF Network
Neural Network
x0=1
x1
x2
x3
... xd
+1
tanh
tanh
tanh
w ij (1) w j1 (2)
RBF Network
x0=1
x1
x2
x3
... xd
+1
RBF
RBF
RBF
centers votes
• hidden layer
different:(inner-product+ tanh) versus (distance+ Gaussian)
• output layer
same:just linear aggregation
RBF Network: historically
a type of NNet
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24
Radial Basis Function Network RBF Network Hypothesis
From Neural Network to RBF Network
Neural Network
x0=1
x1
x2
x3
... xd
+1
tanh
tanh
tanh
w ij (1) w j1 (2)
RBF Network
x0=1
x1
x2
x3
... xd
+1
RBF
RBF
RBF
centers votes
• hidden layer
different:(inner-product+ tanh) versus (distance+ Gaussian)
• output layer
same:just linear aggregation
RBF Network: historically
a type of NNet
Radial Basis Function Network RBF Network Hypothesis
From Neural Network to RBF Network
Neural Network
x0=1
x1
x2
x3
... xd
+1
tanh
tanh
tanh
w ij (1) w j1 (2)
RBF Network
x0=1
x1
x2
x3
... xd
+1
RBF
RBF
RBF
centers votes
• hidden layer
different:(inner-product+ tanh) versus (distance+ Gaussian)
• output layer
same:just linear aggregation
RBF Network: historically
a type of NNet
Radial Basis Function Network RBF Network Hypothesis
From Neural Network to RBF Network
Neural Network
x0=1
x1
x2
x3
... xd
+1
tanh
tanh
tanh
w ij (1) w j1 (2)
RBF Network
x0=1
x1
x2
x3
... xd
+1
RBF
RBF
RBF
centers votes
• hidden layer
different:(inner-product+ tanh) versus (distance+ Gaussian)
• output layer
same:just linear aggregation
RBF Network: historically
a type of NNet
Radial Basis Function Network RBF Network Hypothesis
From Neural Network to RBF Network
Neural Network
x0=1
x1
x2
x3
... xd
+1
tanh
tanh
tanh
w ij (1) w j1 (2)
RBF Network
x0=1
x1
x2
x3
... xd
+1
RBF
RBF
RBF
centers votes
• hidden layer
different:(inner-product+ tanh) versus (distance+ Gaussian)
• output layer
same:just linear aggregation
Radial Basis Function Network RBF Network Hypothesis
RBF Network Hypothesis
h(x)
= Output
M
X
m=1
β
mRBF(x, µ
m) + b
!
key variables:
centers µ m
; (signed)votes β m
RBF Network
x0=1
x1
x2
x3
... xd
+1
RBF
RBF
RBF
Output
centers votes
g
SVMfor Gaussian-SVM
• RBF: Gaussian; Output: sign (binary classification)
•
M = #SV;µ m
: SVM SVsx m
;β m
: αm
ym
from SVM Duallearning: given
RBF
andOutput,
decideµ m
andβ m
Radial Basis Function Network RBF Network Hypothesis
RBF Network Hypothesis
h(x)
= Output
M
X
m=1
β
mRBF(x, µ
m) + b
!
key variables:
centers µ m
; (signed)votes β m
RBF Network
x0=1
x1
x2
x3
... xd
+1
RBF
RBF
RBF
Output
centers votes
g
SVMfor Gaussian-SVM
• RBF: Gaussian; Output: sign (binary classification)
•
M = #SV;µ m
: SVM SVsx m
;β m
: αm
ym
from SVM Duallearning: given
RBF
andOutput,
decideµ m
andβ m
Radial Basis Function Network RBF Network Hypothesis
RBF Network Hypothesis
h(x)
= Output
M
X
m=1
β
mRBF(x, µ
m) + b
!
key variables:
centers µ m
; (signed)votes β m
RBF Network
x0=1
x1
x2
x3
... xd
+1
RBF
RBF
RBF
Output
centers votes
g
SVMfor Gaussian-SVM
• RBF: Gaussian; Output: sign (binary classification)
•
M = #SV;µ m
: SVM SVsx m
;β m
: αm
ym
from SVM Dual learning: givenRBF
andOutput,
decide
µ m
andβ m
Radial Basis Function Network RBF Network Hypothesis
RBF Network Hypothesis
h(x)
= Output
M
X
m=1
β
mRBF(x, µ
m) + b
!
key variables:
centers µ m
; (signed)votes β m
RBF Network
x0=1
x1
x2
x3
... xd
+1
RBF
RBF
RBF
Output
centers votes
g
SVMfor Gaussian-SVM
• RBF: Gaussian; Output: sign (binary classification)
•
M = #SV;µ m
: SVM SVsx m
;β m
: αm
ym
from SVM Duallearning: given
RBF
andOutput,
decideµ m
andβ m
Radial Basis Function Network RBF Network Hypothesis
RBF Network Hypothesis
h(x)
= Output
M
X
m=1
β
mRBF(x, µ
m) + b
!
key variables:
centers µ m
; (signed)votes β m
RBF Network
x0=1
x1
x2
x3
... xd
+1
RBF
RBF
RBF
Output
centers votes
g
SVMfor Gaussian-SVM
• RBF: Gaussian; Output: sign (binary classification)
•
M = #SV;µ m
: SVM SVsx m
;β m
: αm
ym
from SVM Dual learning: givenRBF
andOutput,
decide
µ m
andβ m
Radial Basis Function Network RBF Network Hypothesis
RBF and Similarity
general similarity function
betweenx and x 0
:Neuron(x, x
0
) =tanh(γxT x 0
+1) DNASim(x, x0
) =EditDistance(x, x0
)RBF Network:
distance similarity-to-centers
asfeature transform
kernel: similarity via Z-space inner product
—governed by Mercer’s condition, remember? :-) Poly(x, x
0
) = (1 +x T x 0
)2
Gaussian(x, x
0
) =exp(−γkx − x0
k2
)Gaussian(x, x
0
) =exp(−γkx − x0
k2
)Truncated
(x, x0
) =Jkx − x0
k ≤ 1K (1 − kx − x0
k)2 RBF: similarity via X -space distance
—often
monotonically non-increasing
to distanceRadial Basis Function Network RBF Network Hypothesis
RBF and Similarity
general similarity function
betweenx and x 0
:Neuron(x, x
0
) =tanh(γxT x 0
+1) DNASim(x, x0
) =EditDistance(x, x0
)RBF Network:
distance similarity-to-centers
asfeature transform
kernel: similarity via Z-space inner product
—governed by Mercer’s condition, remember? :-) Poly(x, x
0
) = (1 +x T x 0
)2
Gaussian(x, x
0
) =exp(−γkx − x0
k2
)Gaussian(x, x
0
) =exp(−γkx − x0
k2
)Truncated
(x, x0
) =Jkx − x0
k ≤ 1K (1 − kx − x0
k)2 RBF: similarity via X -space distance
—often
monotonically non-increasing
to distanceRadial Basis Function Network RBF Network Hypothesis
RBF and Similarity
general similarity function
betweenx and x 0
:Neuron(x, x
0
) =tanh(γxT x 0
+1) DNASim(x, x0
) =EditDistance(x, x0
)RBF Network:
distance similarity-to-centers
asfeature transform
kernel: similarity via Z-space inner product
—governed by Mercer’s condition, remember? :-) Poly(x, x
0
) = (1 +x T x 0
)2
Gaussian(x, x
0
) =exp(−γkx − x0
k2
)Gaussian(x, x
0
) =exp(−γkx − x0
k2
)Truncated
(x, x0
) =Jkx − x0
k ≤ 1K (1 − kx − x0
k)2 RBF: similarity via X -space distance
—often
monotonically non-increasing
to distanceRadial Basis Function Network RBF Network Hypothesis
RBF and Similarity
general similarity function
betweenx and x 0
:Neuron(x, x
0
) =tanh(γxT x 0
+1) DNASim(x, x0
) =EditDistance(x, x0
)RBF Network:
distance similarity-to-centers
asfeature transform
kernel: similarity via Z-space inner product
—governed by Mercer’s condition, remember? :-) Poly(x, x
0
) = (1 +x T x 0
)2
Gaussian(x, x
0
) =exp(−γkx − x0
k2
)Gaussian(x, x
0
) =exp(−γkx − x0
k2
)Truncated
(x, x0
) =Jkx − x0
k ≤ 1K (1 − kx − x0
k)2 RBF: similarity via X -space distance
—often
monotonically non-increasing
to distanceRadial Basis Function Network RBF Network Hypothesis
Fun Time
Which of the following is not a radial basis function?
1
φ(x, µ) = exp(−γkx − µk2
)2
φ(x, µ) = −px T x − 2x T
µ + µT
µ3
φ(x, µ) =Jx = µK4
φ(x, µ) = xT x + µ T
µReference Answer: 4
Note that 3 is an extreme case of 1 (Gaussian) with γ → ∞, and 2 contains an kx − µk
2
somewhere:-).
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/24
Radial Basis Function Network RBF Network Hypothesis
Fun Time
Which of the following is not a radial basis function?
1
φ(x, µ) = exp(−γkx − µk2
)2
φ(x, µ) = −px T x − 2x T
µ + µT
µ3
φ(x, µ) =Jx = µK4
φ(x, µ) = xT x + µ T
µReference Answer: 4
Note that 3 is an extreme case of 1 (Gaussian) with γ → ∞, and 2 contains an kx − µk
2
somewhere:-).
Radial Basis Function Network RBF Network Learning
Full RBF Network
h(x) =
Output
M
X
m=1
β m RBF(x, µ m
)!
• full RBF
Network:M = N
and eachµ m
=x m
•
physical meaning: eachx m influences
similarx by β m
•
e.g. uniform influence withβ m = 1 · y m
for binary classificationg
uniform
(x) =sign
N
X
m=1
y m exp
−γkx − x m k 2
!
—aggregateeach example’s
opinion
subject tosimilarity
full RBF
Network:lazy
way to decideµ m
Radial Basis Function Network RBF Network Learning
Full RBF Network
h(x) =
Output
M
X
m=1
β m RBF(x, µ m
)!
• full RBF
Network:M = N
and eachµ m
=x m
•
physical meaning: eachx m influences
similarx by β m
•
e.g. uniform influence withβ m = 1 · y m
for binary classificationg
uniform
(x) =sign
N
X
m=1
y m exp
−γkx − x m k 2
!
—aggregateeach example’s
opinion
subject tosimilarity
full RBF
Network:lazy
way to decideµ m
Radial Basis Function Network RBF Network Learning
Full RBF Network
h(x) =
Output
M
X
m=1
β m RBF(x, µ m
)!
• full RBF
Network:M = N
and eachµ m
=x m
•
physical meaning: eachx m influences
similarx by β m
•
e.g. uniform influence withβ m = 1 · y m
for binary classificationg
uniform
(x) =sign
N
X
m=1
y m exp
−γkx − x m k 2
!
—aggregateeach example’s
opinion
subject tosimilarity
full RBF
Network:lazy
way to decideµ m
Radial Basis Function Network RBF Network Learning
Full RBF Network
h(x) =
Output
M
X
m=1
β m RBF(x, µ m
)!
• full RBF
Network:M = N
and eachµ m
=x m
•
physical meaning: eachx m influences
similarx by β m
•
e.g. uniform influence withβ m = 1 · y m
for binary classificationg
uniform
(x) =sign
N
X
m=1
y m exp
−γkx − x m k 2
!
—aggregateeach example’s
opinion
subject tosimilarity
full RBF
Network:lazy
way to decideµ m
Radial Basis Function Network RBF Network Learning
Full RBF Network
h(x) =
Output
M
X
m=1
β m RBF(x, µ m
)!
• full RBF
Network:M = N
and eachµ m
=x m
•
physical meaning: eachx m influences
similarx by β m
•
e.g. uniform influence withβ m = 1 · y m
for binary classificationg
uniform
(x) =sign
N
X
m=1
y m exp
−γkx − x m k 2
!
—aggregateeach example’s
opinion
subject tosimilarity
Radial Basis Function Network RBF Network Learning
Nearest Neighbor
g
uniform
(x) = signN
X
m=1
y m exp
−γkx − x m k 2
!
• exp −γkx − x m k 2
:
maximum
whenx closest to x m
—maximum oneoften dominates the
P N
m=1
term•
takey m
ofmaximum exp(. . .)
instead ofvoting
of ally m
—selectioninstead of
aggregation
•
physical meaning:g
nbor
(x) =y m
such thatx closest to x m
—called
nearest neighbor
model•
canuniformly aggregate k neighbors
also:k nearest neighbor
k nearest neighbor:
alsolazy
butvery intuitive
Radial Basis Function Network RBF Network Learning
Nearest Neighbor
g
uniform
(x) = signN
X
m=1
y m exp
−γkx − x m k 2
!
• exp −γkx − x m k 2
:
maximum
whenx closest to x m
—maximum oneoften dominates the
P N
m=1
term•
takey m
ofmaximum exp(. . .)
instead ofvoting
of ally m
—selectioninstead of
aggregation
•
physical meaning:g
nbor
(x) =y m
such thatx closest to x m
—called
nearest neighbor
model•
canuniformly aggregate k neighbors
also:k nearest neighbor
k nearest neighbor:
alsolazy
butvery intuitive
Radial Basis Function Network RBF Network Learning
Nearest Neighbor
g
uniform
(x) = signN
X
m=1
y m exp
−γkx − x m k 2
!
• exp −γkx − x m k 2
:
maximum
whenx closest to x m
—maximum oneoften dominates the
P N
m=1
term•
takey m
ofmaximum exp(. . .)
instead ofvoting
of ally m
—selectioninstead of
aggregation
•
physical meaning:g
nbor
(x) =y m
such thatx closest to x m
—called
nearest neighbor
model•
canuniformly aggregate k neighbors
also:k nearest neighbor
k nearest neighbor:
alsolazy
butvery intuitive
Radial Basis Function Network RBF Network Learning
Nearest Neighbor
g
uniform
(x) = signN
X
m=1
y m exp
−γkx − x m k 2
!
• exp −γkx − x m k 2
:
maximum
whenx closest to x m
—maximum oneoften dominates the
P N
m=1
term•
takey m
ofmaximum exp(. . .)
instead ofvoting
of ally m
—selectioninstead of
aggregation
•
physical meaning:g
nbor
(x) =y m
such thatx closest to x m
—called
nearest neighbor
model•
canuniformly aggregate k neighbors
also:k nearest neighbor
k nearest neighbor:
alsolazy
butvery intuitive
Radial Basis Function Network RBF Network Learning
Nearest Neighbor
g
uniform
(x) = signN
X
m=1
y m exp
−γkx − x m k 2
!
• exp −γkx − x m k 2
:
maximum
whenx closest to x m
—maximum oneoften dominates the
P N
m=1
term•
takey m
ofmaximum exp(. . .)
instead ofvoting
of ally m
—selectioninstead of
aggregation
•
physical meaning:g
nbor
(x) =y m
such thatx closest to x m
—called
nearest neighbor
model•
canuniformly aggregate k neighbors
also:k nearest neighbor
k nearest neighbor:
alsolazy
butvery intuitive
Radial Basis Function Network RBF Network Learning
Nearest Neighbor
g
uniform
(x) = signN
X
m=1
y m exp
−γkx − x m k 2
!
• exp −γkx − x m k 2
:
maximum
whenx closest to x m
—maximum oneoften dominates the
P N
m=1
term•
takey m
ofmaximum exp(. . .)
instead ofvoting
of ally m
—selectioninstead of
aggregation
•
physical meaning:g
nbor
(x) =y m
such thatx closest to x m
—called
nearest neighbor
model•
canuniformly aggregate k neighbors
also:k nearest neighbor
k nearest neighbor:
alsolazy
butvery intuitive
Radial Basis Function Network RBF Network Learning
Nearest Neighbor
g
uniform
(x) = signN
X
m=1
y m exp
−γkx − x m k 2
!
• exp −γkx − x m k 2
:
maximum
whenx closest to x m
—maximum oneoften dominates the
P N
m=1
term•
takey m
ofmaximum exp(. . .)
instead ofvoting
of ally m
—selectioninstead of
aggregation
•
physical meaning:g
nbor
(x) =y m
such thatx closest to x m
—called
nearest neighbor
model•
canuniformly aggregate k neighbors
also:k nearest neighbor k nearest neighbor:
also
lazy
butvery intuitive
Radial Basis Function Network RBF Network Learning
Nearest Neighbor
g
uniform
(x) = signN
X
m=1
y m exp
−γkx − x m k 2
!
• exp −γkx − x m k 2
:
maximum
whenx closest to x m
—maximum oneoften dominates the
P N
m=1
term•
takey m
ofmaximum exp(. . .)
instead ofvoting
of ally m
—selectioninstead of
aggregation
•
physical meaning:g
nbor
(x) =y m
such thatx closest to x m
—called
nearest neighbor
model•
canuniformly aggregate k neighbors
also:k nearest neighbor
k nearest neighbor:
alsolazy
butvery intuitive
Radial Basis Function Network RBF Network Learning
Nearest Neighbor
g
uniform
(x) = signN
X
m=1
y m exp
−γkx − x m k 2
!
• exp −γkx − x m k 2
:
maximum
whenx closest to x m
—maximum oneoften dominates the
P N
m=1
term•
takey m
ofmaximum exp(. . .)
instead ofvoting
of ally m
—selectioninstead of
aggregation
•
physical meaning:g
nbor
(x) =y m
such thatx closest to x m
—called
nearest neighbor
model•
canuniformly aggregate k neighbors
also:k nearest neighbor k nearest neighbor:
also
lazy
butvery intuitive
Radial Basis Function Network RBF Network Learning
Interpolation by Full RBF Network
full RBF
Network for squared error regression:h(x) =
XX Output X X
N
X
m=1
β m RBF(x, x m
)!
•
just linear regression onRBF-transformed data
z n
= [RBF(xn
,x 1
),RBF(x n
,x 2
), . . . ,RBF(x n
,x N
)]•
optimalβ? β
= (ZT Z) −1 Z T y, if Z T Z
invertible,remember? :-)
•
size ofZ?
N (examples) by
N (centers)
—symmetric square matrix
•
theoretical fact: ifx n all different, Z
withGaussian RBF invertible
optimal
β
withinvertible Z: β
=Z −1
y
Radial Basis Function Network RBF Network Learning
Interpolation by Full RBF Network
full RBF
Network for squared error regression:h(x) =
XX Output X X
N
X
m=1
β m RBF(x, x m
)!
•
just linear regression onRBF-transformed data
z n
= [RBF(xn
,x 1
),RBF(x n
,x 2
), . . . ,RBF(x n
,x N
)]•
optimalβ? β
= (ZT Z) −1 Z T y, if Z T Z
invertible,remember? :-)
•
size ofZ?
N (examples) by
N (centers)
—symmetric square matrix
•
theoretical fact: ifx n all different, Z
withGaussian RBF invertible
optimal
β
withinvertible Z: β
=Z −1
y
Radial Basis Function Network RBF Network Learning
Interpolation by Full RBF Network
full RBF
Network for squared error regression:h(x) =
XX Output X X
N
X
m=1
β m RBF(x, x m
)!
•
just linear regression onRBF-transformed data
z n
= [RBF(xn
,x 1
),RBF(x n
,x 2
), . . . ,RBF(x n
,x N
)]•
optimalβ? β
= (ZT Z) −1 Z T y, if Z T Z
invertible,remember? :-)
•
size ofZ?
N (examples) by
N (centers)
—symmetric square matrix
•
theoretical fact: ifx n all different, Z
withGaussian RBF invertible
optimal
β
withinvertible Z: β
=Z −1
y
Radial Basis Function Network RBF Network Learning
Interpolation by Full RBF Network
full RBF
Network for squared error regression:h(x) =
XX Output X X
N
X
m=1
β m RBF(x, x m
)!
•
just linear regression onRBF-transformed data
z n
= [RBF(xn
,x 1
),RBF(x n
,x 2
), . . . ,RBF(x n
,x N
)]•
optimalβ? β
= (ZT Z) −1 Z T y, if Z T Z
invertible,remember? :-)
•
size ofZ?
N (examples) by
N (centers)
—symmetric square matrix
•
theoretical fact: ifx n all different, Z
withGaussian RBF invertible
optimal
β
withinvertible Z: β
=Z −1
y
Radial Basis Function Network RBF Network Learning
Interpolation by Full RBF Network
full RBF
Network for squared error regression:h(x) =
XX Output X X
N
X
m=1
β m RBF(x, x m
)!
•
just linear regression onRBF-transformed data
z n
= [RBF(xn
,x 1
),RBF(x n
,x 2
), . . . ,RBF(x n
,x N
)]•
optimalβ? β
= (ZT Z) −1 Z T y, if Z T Z
invertible,remember? :-)
•
size ofZ? N (examples) by N (centers)
—symmetric square matrix
•
theoretical fact: ifx n all different, Z
withGaussian RBF invertible
optimal
β
withinvertible Z: β
=Z −1
y
Radial Basis Function Network RBF Network Learning
Interpolation by Full RBF Network
full RBF
Network for squared error regression:h(x) =
XX Output X X
N
X
m=1
β m RBF(x, x m
)!
•
just linear regression onRBF-transformed data
z n
= [RBF(xn
,x 1
),RBF(x n
,x 2
), . . . ,RBF(x n
,x N
)]•
optimalβ? β
= (ZT Z) −1 Z T y, if Z T Z
invertible,remember? :-)
•
size ofZ? N (examples) by N (centers)
—symmetric square matrix
•
theoretical fact: ifx n all different, Z
withGaussian RBF invertible
optimal
β
withinvertible Z: β
=Z −1
y
Radial Basis Function Network RBF Network Learning
Interpolation by Full RBF Network
full RBF
Network for squared error regression:h(x) =
XX Output X X
N
X
m=1
β m RBF(x, x m
)!
•
just linear regression onRBF-transformed data
z n
= [RBF(xn
,x 1
),RBF(x n
,x 2
), . . . ,RBF(x n
,x N
)]•
optimalβ? β
= (ZT Z) −1 Z T y, if Z T Z
invertible,remember? :-)
•
size ofZ? N (examples) by N (centers)
—symmetric square matrix
•
theoretical fact: ifx n all different, Z
withGaussian RBF invertible
Z −1
Radial Basis Function Network RBF Network Learning
Interpolation by Full RBF Network
full RBF
Network for squared error regression:h(x) =
XX Output X X
N
X
m=1
β m RBF(x, x m
)!
•
just linear regression onRBF-transformed data
z n
= [RBF(xn
,x 1
),RBF(x n
,x 2
), . . . ,RBF(x n
,x N
)]•
optimalβ? β
= (ZT Z) −1 Z T y, if Z T Z
invertible,remember? :-)
•
size ofZ? N (examples) by N (centers)
—symmetric square matrix
•
theoretical fact: ifx n all different, Z
withGaussian RBF invertible
optimal
β
withinvertible Z: β
=Z −1 y
Radial Basis Function Network RBF Network Learning
Regularized Full RBF Network
full Gaussian RBF Network for regression:
β
=Z −1 y g
RBF(x
1) = β
Tz
1=
y
TZ
−1(first column of Z) = y
T1
0 . . . 0
T=
y
1—gRBF(x
n
) =y
n
, i.e. E
in
(gRBF) =0
,
yeah!! :-)
•
calledexact interpolation
forfunction approximation
•
butoverfitting for learning? :-(
•
how aboutregularization? e.g. ridge
regression forβ
instead—optimal
β
= (ZT Z
+λI) −1 Z T y
•
seenZ? Z
= [Gaussian(xn
,x m
)] =Gaussian kernel matrixK
effect of
regularization
in different spaces:kernel
ridge
regression:β
= (K+λI) −1 y;
regularized
full RBFNet:β
= (ZT Z
+λI) −1 Z T y
Radial Basis Function Network RBF Network Learning
Regularized Full RBF Network
full Gaussian RBF Network for regression:
β
=Z −1 y g
RBF(x
1) = β
Tz
1=
y
TZ
−1(first column of Z)
= y
T1
0 . . . 0
T=
y
1—gRBF(x
n
) =y
n
, i.e. E
in
(gRBF) =0
,
yeah!! :-)
•
calledexact interpolation
forfunction approximation
•
butoverfitting for learning? :-(
•
how aboutregularization? e.g. ridge
regression forβ
instead—optimal
β
= (ZT Z
+λI) −1 Z T y
•
seenZ? Z
= [Gaussian(xn
,x m
)] =Gaussian kernel matrixK
effect of
regularization
in different spaces:kernel
ridge
regression:β
= (K+λI) −1 y;
regularized
full RBFNet:β
= (ZT Z
+λI) −1 Z T y
Radial Basis Function Network RBF Network Learning
Regularized Full RBF Network
full Gaussian RBF Network for regression:
β
=Z −1 y g
RBF(x
1) = β
Tz
1= y
TZ
−1(first column of Z)
= y
T1
0 . . . 0
T=
y
1—gRBF(x
n
) =y
n
, i.e. E
in
(gRBF) =0
,
yeah!! :-)
•
calledexact interpolation
forfunction approximation
•
butoverfitting for learning? :-(
•
how aboutregularization? e.g. ridge
regression forβ
instead—optimal
β
= (ZT Z
+λI) −1 Z T y
•
seenZ? Z
= [Gaussian(xn
,x m
)] =Gaussian kernel matrixK
effect of
regularization
in different spaces:kernel
ridge
regression:β
= (K+λI) −1 y;
regularized
full RBFNet:β
= (ZT Z
+λI) −1 Z T y
Radial Basis Function Network RBF Network Learning
Regularized Full RBF Network
full Gaussian RBF Network for regression:
β
=Z −1 y g
RBF(x
1) = β
Tz
1= y
TZ
−1(first column of Z) = y
T1
0 . . . 0
T=
y
1—gRBF(x
n
) =y
n
, i.e. E
in
(gRBF) =0
,
yeah!! :-)
•
calledexact interpolation
forfunction approximation
•
butoverfitting for learning? :-(
•
how aboutregularization? e.g. ridge
regression forβ
instead—optimal
β
= (ZT Z
+λI) −1 Z T y
•
seenZ? Z
= [Gaussian(xn
,x m
)] =Gaussian kernel matrixK
effect of
regularization
in different spaces:kernel
ridge
regression:β
= (K+λI) −1 y;
regularized
full RBFNet:β
= (ZT Z
+λI) −1 Z T y
Radial Basis Function Network RBF Network Learning
Regularized Full RBF Network
full Gaussian RBF Network for regression:
β
=Z −1 y g
RBF(x
1) = β
Tz
1= y
TZ
−1(first column of Z) = y
T1 0 . . . 0
T=
y
1—gRBF(x
n
) =y
n
, i.e. E
in
(gRBF) =0
,
yeah!! :-)
•
calledexact interpolation
forfunction approximation
•
butoverfitting for learning? :-(
•
how aboutregularization? e.g. ridge
regression forβ
instead—optimal
β
= (ZT Z
+λI) −1 Z T y
•
seenZ? Z
= [Gaussian(xn
,x m
)] =Gaussian kernel matrixK
effect of
regularization
in different spaces:kernel
ridge
regression:β
= (K+λI) −1 y;
regularized
full RBFNet:β
= (ZT Z
+λI) −1 Z T y
Radial Basis Function Network RBF Network Learning
Regularized Full RBF Network
full Gaussian RBF Network for regression:
β
=Z −1 y g
RBF(x
1) = β
Tz
1= y
TZ
−1(first column of Z) = y
T1 0 . . . 0
T=
y
1—gRBF(x
n
) =y
n
, i.e. E
in
(gRBF) =0
,
yeah!! :-)
•
calledexact interpolation
forfunction approximation
•
butoverfitting for learning? :-(
•
how aboutregularization? e.g. ridge
regression forβ
instead—optimal
β
= (ZT Z
+λI) −1 Z T y
•
seenZ? Z
= [Gaussian(xn
,x m
)] =Gaussian kernel matrixK
effect of
regularization
in different spaces:kernel
ridge
regression:β
= (K+λI) −1 y;
regularized
full RBFNet:β
= (ZT Z
+λI) −1 Z T y
Radial Basis Function Network RBF Network Learning
Regularized Full RBF Network
full Gaussian RBF Network for regression:
β
=Z −1 y g
RBF(x
1) = β
Tz
1= y
TZ
−1(first column of Z) = y
T1 0 . . . 0
T= y
1—gRBF(x
n
) =y
n
, i.e. E
in
(gRBF) =0
,
yeah!! :-)
•
calledexact interpolation
forfunction approximation
•
butoverfitting for learning? :-(
•
how aboutregularization? e.g. ridge
regression forβ
instead—optimal
β
= (ZT Z
+λI) −1 Z T y
•
seenZ? Z
= [Gaussian(xn
,x m
)] =Gaussian kernel matrixK
effect of
regularization
in different spaces:kernel
ridge
regression:β
= (K+λI) −1 y;
regularized
full RBFNet:β
= (ZT Z
+λI) −1 Z T y
Radial Basis Function Network RBF Network Learning
Regularized Full RBF Network
full Gaussian RBF Network for regression:
β
=Z −1 y g
RBF(x
1) = β
Tz
1= y
TZ
−1(first column of Z) = y
T1 0 . . . 0
T= y
1—gRBF(x
n
) =y
n
,
i.e. E
in
(gRBF) =0
,
yeah!! :-)
•
calledexact interpolation
forfunction approximation
•
butoverfitting for learning? :-(
•
how aboutregularization? e.g. ridge
regression forβ
instead—optimal
β
= (ZT Z
+λI) −1 Z T y
•
seenZ? Z
= [Gaussian(xn
,x m
)] =Gaussian kernel matrixK
effect of
regularization
in different spaces:kernel
ridge
regression:β
= (K+λI) −1 y;
regularized
full RBFNet:β
= (ZT Z
+λI) −1 Z T y
Radial Basis Function Network RBF Network Learning
Regularized Full RBF Network
full Gaussian RBF Network for regression:
β
=Z −1 y g
RBF(x
1) = β
Tz
1= y
TZ
−1(first column of Z) = y
T1 0 . . . 0
T= y
1—gRBF(x
n
) =yn
,i.e. E
in
(gRBF) =0
,
yeah!! :-)
•
calledexact interpolation
forfunction approximation
•
butoverfitting for learning? :-(
•
how aboutregularization? e.g. ridge
regression forβ
instead—optimal
β
= (ZT Z
+λI) −1 Z T y
•
seenZ? Z
= [Gaussian(xn
,x m
)] =Gaussian kernel matrixK
effect of
regularization
in different spaces:kernel
ridge
regression:β
= (K+λI) −1 y;
regularized
full RBFNet:β
= (ZT Z
+λI) −1 Z T y
Radial Basis Function Network RBF Network Learning
Regularized Full RBF Network
full Gaussian RBF Network for regression:
β
=Z −1 y g
RBF(x
1) = β
Tz
1= y
TZ
−1(first column of Z) = y
T1 0 . . . 0
T= y
1—gRBF(x
n
) =yn
, i.e. Ein
(gRBF) =0,
yeah!! :-)
•
calledexact interpolation
forfunction approximation
•
butoverfitting for learning? :-(
•
how aboutregularization? e.g. ridge
regression forβ
instead—optimal
β
= (ZT Z
+λI) −1 Z T y
•
seenZ? Z
= [Gaussian(xn
,x m
)] =Gaussian kernel matrixK
effect of
regularization
in different spaces:kernel
ridge
regression:β
= (K+λI) −1 y;
regularized
full RBFNet:β
= (ZT Z
+λI) −1 Z T y
Radial Basis Function Network RBF Network Learning
Regularized Full RBF Network
full Gaussian RBF Network for regression:
β
=Z −1 y g
RBF(x
1) = β
Tz
1= y
TZ
−1(first column of Z) = y
T1 0 . . . 0
T= y
1—gRBF(x
n
) =yn
, i.e. Ein
(gRBF) =0,yeah!! :-)
•
calledexact interpolation
forfunction approximation
•
butoverfitting for learning? :-(
•
how aboutregularization? e.g. ridge
regression forβ
instead—optimal
β
= (ZT Z
+λI) −1 Z T y
•
seenZ? Z
= [Gaussian(xn
,x m
)] =Gaussian kernel matrix
K
effect of
regularization
in different spaces:kernel
ridge
regression:β
= (K+λI) −1 y;
regularized
full RBFNet:β
= (ZT Z
+λI) −1 Z T y
Radial Basis Function Network RBF Network Learning
Regularized Full RBF Network
full Gaussian RBF Network for regression:
β
=Z −1 y g
RBF(x
1) = β
Tz
1= y
TZ
−1(first column of Z) = y
T1 0 . . . 0
T= y
1—gRBF(x
n
) =yn
, i.e. Ein
(gRBF) =0,yeah!! :-)
•
calledexact interpolation
forfunction approximation
•
butoverfitting for learning? :-(
•
how aboutregularization? e.g. ridge
regression forβ
instead—optimal
β
= (ZT Z
+λI) −1 Z T y
•
seenZ? Z
= [Gaussian(xn
,x m
)] =Gaussian kernel matrix
K
effect of
regularization
in different spaces:kernel
ridge
regression:β
= (K+λI) −1 y;
regularized
full RBFNet:β
= (ZT Z
+λI) −1 Z T y
Radial Basis Function Network RBF Network Learning
Regularized Full RBF Network
full Gaussian RBF Network for regression:
β
=Z −1 y g
RBF(x
1) = β
Tz
1= y
TZ
−1(first column of Z) = y
T1 0 . . . 0
T= y
1—gRBF(x
n
) =yn
, i.e. Ein
(gRBF) =0,yeah!! :-)
•
calledexact interpolation
forfunction approximation
•
butoverfitting for learning? :-(
•
how aboutregularization? e.g. ridge
regression forβ
instead—optimal
β
= (ZT Z
+λI) −1 Z T y
•
seenZ? Z
= [Gaussian(xn
,x m
)] =Gaussian kernel matrix
K
effect of
regularization
in different spaces:kernel
ridge
regression:β
= (K+λI) −1 y;
regularized
full RBFNet:β
= (ZT Z
+λI) −1 Z T y
Radial Basis Function Network RBF Network Learning
Regularized Full RBF Network
full Gaussian RBF Network for regression:
β
=Z −1 y g
RBF(x
1) = β
Tz
1= y
TZ
−1(first column of Z) = y
T1 0 . . . 0
T= y
1—gRBF(x
n
) =yn
, i.e. Ein
(gRBF) =0,yeah!! :-)
•
calledexact interpolation
forfunction approximation
•
butoverfitting for learning? :-(
•
how aboutregularization? e.g. ridge
regression forβ
instead—optimal
β
= (ZT Z
+λI) −1 Z T y
•
seenZ? Z
= [Gaussian(xn
,x m
)] =Gaussian kernel matrix
K
effect of
regularization
in different spaces:kernel
ridge
regression:β
= (K+λI) −1 y;
regularized
full RBFNet:β
= (ZT Z
+λI) −1 Z T y
Radial Basis Function Network RBF Network Learning
Regularized Full RBF Network
full Gaussian RBF Network for regression:
β
=Z −1 y g
RBF(x
1) = β
Tz
1= y
TZ
−1(first column of Z) = y
T1 0 . . . 0
T= y
1—gRBF(x
n
) =yn
, i.e. Ein
(gRBF) =0,yeah!! :-)
•
calledexact interpolation
forfunction approximation
•
butoverfitting for learning? :-(
•
how aboutregularization? e.g. ridge
regression forβ
instead—optimal
β
= (ZT Z
+λI) −1 Z T y
•
seenZ? Z
= [Gaussian(xn
,x m
)] =Gaussian kernel matrix
K
effect ofregularization
in different spaces:kernel
ridge
regression:β
= (K+λI) −1 y;
regularized
full RBFNet:β
= (ZT Z
+λI) −1 Z T y
Radial Basis Function Network RBF Network Learning
Regularized Full RBF Network
full Gaussian RBF Network for regression:
β
=Z −1 y g
RBF(x
1) = β
Tz
1= y
TZ
−1(first column of Z) = y
T1 0 . . . 0
T= y
1—gRBF(x
n
) =yn
, i.e. Ein
(gRBF) =0,yeah!! :-)
•
calledexact interpolation
forfunction approximation
•
butoverfitting for learning? :-(
•
how aboutregularization? e.g. ridge
regression forβ
instead—optimal
β
= (ZT Z
+λI) −1 Z T y
•
seenZ? Z
= [Gaussian(xn
,x m
)] =Gaussian kernel matrixK
effect of
regularization
in different spaces:kernel