Machine Learning Foundations ( 機器學習基石)
Lecture 12: Nonlinear Transformation
Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.twDepartment of Computer Science
& Information Engineering
National Taiwan University
( 國立台灣大學資訊工程系)
Nonlinear Transformation
Roadmap
1 When Can Machines Learn?
2 Why Can Machines Learn?
3 How
Can Machines Learn?Lecture 11: Linear Models for Classification binary classification
via(logistic) regression;
multiclass
viaOVA/OVO decomposition Lecture 12: Nonlinear Transformation
Quadratic Hypotheses Nonlinear Transform
Price of Nonlinear Transform Structured Hypothesis Sets
4 How Can Machines Learn Better?
Nonlinear Transformation Quadratic Hypotheses
Linear Hypotheses
up to now: linear hypotheses
•
visually:‘line’-like
boundary•
mathematically: linear scoress
=w T x
but limited . . .
−1 0 1
−1 0 1
•
theoretically:d
VCunder control :-)
•
practically: on someD,large E in
for every line:-(
how to
break the limit
of linear hypothesesNonlinear Transformation Quadratic Hypotheses
Circular Separable
−1 0 1
−1 0 1
−1 0 1
−1 0 1
•
D not linear separable•
butcircular separable
by a circle of radius√0.6 centered at origin:
hSEP(x) = sign
−x
1 2
− x2 2
+0.6re-derive
Circular-PLA, Circular-Regression,
blahblah. . . all over again?:-)
Nonlinear Transformation Quadratic Hypotheses
Circular Separable and Linear Separable
h(x) = sign
0.6
|{z}
w ˜
0·
1
|{z}
z
0+(
−1
)| {z }
w ˜
1·
x 1 2
|{z}
z
1+(
−1
)| {z }
w ˜
2·
x 2 2
|{z}
z
2
= sign
w ˜ T z
x
1x
2−1 0 1
−1 0
1
•
{(xn
, yn
)} circular separable=⇒ {(
z n
, yn
)}linear
separable• x
∈ X 7−→Φ z ∈ Z
:(nonlinear) feature
transform Φ z
1z
20 0.5 1
0 0.5 1
circular separable inX =⇒
linear
separable inZ
vice versa?
Nonlinear Transformation Quadratic Hypotheses
Linear Hypotheses in Z -Space
(z 0 , z 1 , z 2 )
=z
=Φ(x) = (1, x 1 2
,x 2 2
) h(x) =˜ h(z) =
signw ˜ T Φ(x)
=sign
w ˜ 0
+w ˜ 1 x 1 2
+w ˜ 2 x 2 2
w ˜ = ( w ˜ 0 , w ˜ 1 , w ˜ 2 )
•
(0.6,−1, −1): circle (◦inside)•
(−0.6, +1, +1): circle (◦outside)•
(0.6,−1, −2): ellipse•
(0.6,−1, +2): hyperbola•
(0.6, +1, +2):constant ◦ :-)
lines in
Z
-space⇐⇒
special
quadratic curves inX -spaceNonlinear Transformation Quadratic Hypotheses
General Quadratic Hypothesis Set
a ‘bigger’
Z
-space withΦ 2
(x) = (1,x 1
,x 2
,x 1 2
,x 1 x 2
,x 2 2
) perceptrons inZ
-space⇐⇒ quadratic hypotheses in X -spaceH
Φ
2 = nh(x) : h(x) =
h(Φ ˜ 2
(x)) for some linear˜ h
onZ
o•
canimplement all possible quadratic curve boundaries:
circle, ellipse,
rotated ellipse, hyperbola, parabola,
. . .⇐=
ellipse 2(x
1
+x2
− 3)2
+ (x1
− x2
− 4)2
=1⇐=
w ˜ T
=[33, −20, −4, 3, 2, 3]
•
includelines and constants as degenerate cases
next:
learn
a good quadratic hypothesis gNonlinear Transformation Quadratic Hypotheses
Fun Time
Using the transform
Φ 2
(x) = (1,x 1
,x 2
,x 1 2
,x 1 x 2
,x 2 2
), which of the following weightsw ˜ T
in theZ
-space implements the parabola 2x1 2
+x2
=1?1 [ −1, 2, 1, 0, 0, 0]
2 [0, 2, 1, 0, −1, 0]
3 [ −1, 0, 1, 2, 0, 0]
4 [ −1, 2, 0, 0, 0, 1]
Reference Answer: 3
Too simple, uh? :-)
Flexibility to implement arbitrary quadratic curves opens new possibilities for minimizing Ein
!Nonlinear Transformation Quadratic Hypotheses
Fun Time
Using the transform
Φ 2
(x) = (1,x 1
,x 2
,x 1 2
,x 1 x 2
,x 2 2
), which of the following weightsw ˜ T
in theZ
-space implements the parabola 2x1 2
+x2
=1?1 [ −1, 2, 1, 0, 0, 0]
2 [0, 2, 1, 0, −1, 0]
3 [ −1, 0, 1, 2, 0, 0]
4 [ −1, 2, 0, 0, 0, 1]
Reference Answer: 3
Too simple, uh? :-)
Flexibility to implement arbitrary quadratic curves opens new possibilities for minimizing Ein
!Nonlinear Transformation Nonlinear Transform
Good Quadratic Hypothesis
Z
-space X -spaceperceptrons
⇐⇒ quadratic hypothesesgood perceptron
⇐⇒good quadratic hypothesis separating perceptron
⇐⇒ separating quadratic hypothesisz1
z2
0 0.5 1
0 0.5 1
⇐⇒
x1
x2
−1 0 1
−1 0 1
•
want: getgood perceptron
inZ
-space•
known: getgood perceptron
inX
-space with data{(x n
, yn
)} todo: getgood perceptron
inZ
-space with data{(z n
=Φ 2
(xn
), yn
)}Nonlinear Transformation Nonlinear Transform
The Nonlinear Transform Steps
−1 0 1
−1 0 1
−→
Φ
0 0.5 1
0 0.5 1
↓ A
−1 0 1
−1 0 1
Φ
−1←−
−→
Φ
0 0.5 1
0 0.5 1
1
transform original data{(xn
, yn
)} to {(z n
=Φ(x n
), yn
)} byΦ
2
get a good perceptronw ˜
using{(z n
, yn
)}and your favorite linear classification algorithmA
3
return g(x) = signw ˜ T Φ(x)
Nonlinear Transformation Nonlinear Transform
Nonlinear Model via Nonlinear Φ + Linear Models
−1 0 1
−1 0 1
−→
Φ
0 0.5 1
0 0.5 1
↓ A
−1 0 1
−1 0 1
Φ
−1←−
−→
Φ
0 0.5 1
0 0.5 1
two choices:
•
feature transformΦ
•
linear modelA,not just binary classification
Pandora’s box :-):
can now freely do
quadratic PLA, quadratic regression,
cubic regression, . . ., polynomial regression
Nonlinear Transformation Nonlinear Transform
Feature Transform Φ
−→
Φ
Average Intensity
Symmetry
not 1 1
↓ A
Φ
−1←−
−→
Φ
Average Intensity
Symmetry
not new, not just polynomial:
raw (pixels)
domain knowledge
−→
concrete (intensity, symmetry)
the force, too good to be true? :-)
Nonlinear Transformation Nonlinear Transform
Fun Time
Consider the quadratic transform
Φ 2
(x) for x∈ Rd
instead of in R2
. The transform should include all different quadratic, linear, and constant terms formed by (x1
, x2
, . . . , xd
). What is the number of dimensions ofz
=Φ 2
(x)?1
d2 d
22
+3d 2
+13
d2
+d + 14
2d
Reference Answer: 2
Number of different quadratic terms is
d 2
+ d; number of different linear terms is d ;number of different constant term is 1.
Nonlinear Transformation Nonlinear Transform
Fun Time
Consider the quadratic transform
Φ 2
(x) for x∈ Rd
instead of in R2
. The transform should include all different quadratic, linear, and constant terms formed by (x1
, x2
, . . . , xd
). What is the number of dimensions ofz
=Φ 2
(x)?1
d2 d
22
+3d 2
+13
d2
+d + 14
2d
Reference Answer: 2
Number of different quadratic terms is
d 2
+ d;number of different linear terms is d ; number of different constant term is 1.
Nonlinear Transformation Price of Nonlinear Transform
Computation/Storage Price
Q-th order polynomial transform: Φ
Q(x) = 1,
x
1, x
2, . . . , x
d, x
12, x
1x
2, . . . , x
d2, . . . ,
x
1Q, x
1Q−1x
2, . . . , x
dQ=
1
|{z}
w ˜
0+
d ˜
|{z}
others
dimensions
= # ways of≤ Q-combination from d kinds with repetitions
=
Q+d Q
=Q+d d
=O Q d
= efforts needed for computing/storing
z
=Φ Q
(x) andw ˜
Q large =⇒difficult to compute/store
Nonlinear Transformation Price of Nonlinear Transform
Model Complexity Price
Q-th order polynomial transform: Φ
Q(x) = 1,
x
1, x
2, . . . , x
d, x
12, x
1x
2, . . . , x
d2, . . . ,
x
1Q, x
1Q−1x
2, . . . , x
dQ1
|{z}
w ˜
0+
d ˜
|{z}
others
dimensions =
O Q d
•
number of free parameters ˜wi
=d ˜
+1≈ d
VC( H Φ
Q)
• d
VC( H Φ
Q) ≤ ˜d + 1
, why?=⇒
any ˜d + 2 inputs not shattered inZ
=⇒ any ˜d + 2 inputs not shattered in X Q large =⇒
large d
VCNonlinear Transformation Price of Nonlinear Transform
Generalization Issue
Φ
1
(originalx)
which one do you prefer? :-)
•
Φ1
‘visually’ preferred•
Φ4
: Ein
(g) = 0 but overkillΦ
4
1 can we make sure that E out (g) is close enough to E in (g)?
2 can we make E in (g) small enough?
trade-off:
d (Q)˜
1 2
higher :-(:-D
lower
:-D
:-(how to pick Q?
visually, maybe?
Nonlinear Transformation Price of Nonlinear Transform
Danger of Visual Choices
first of all, can you really ‘visualize’ whenX = R
10
?(well, I can’t :-)) Visualize X = R 2
•
full Φ2
:z = (1, x 1
, x2
, x1 2
, x1
x2
, x2 2
), dVC =6•
orz = (1, x 1 2
, x2 2
), dVC =3,after visualizing?
•
or betterz = (1, x 1 2
+x2 2
), dVC=2?•
or even betterz = sign(0.6 − x 1 2 − x 2 2 )?
—careful about
your brain’s ‘model complexity’
−1 0 1
−1 0 1
for VC-safety, Φ shall be decided
without ‘peeking’
dataNonlinear Transformation Price of Nonlinear Transform
Fun Time
Consider the Q-th order polynomial transform
Φ Q
(x) for x∈ R2
. Recall that ˜d =Q+2 2
− 1. When Q = 50, what is the value of ˜d?1
11262
13253
26514
6211Reference Answer: 2
It’s just a simple calculation, but shows you how ˜d becomes hundreds of times of d = 2 after the transform.
Nonlinear Transformation Price of Nonlinear Transform
Fun Time
Consider the Q-th order polynomial transform
Φ Q
(x) for x∈ R2
. Recall that ˜d =Q+2 2
− 1. When Q = 50, what is the value of ˜d?1
11262
13253
26514
6211Reference Answer: 2
It’s just a simple calculation, but shows you how ˜d becomes hundreds of times of d = 2 after the transform.
Nonlinear Transformation Structured Hypothesis Sets
Polynomial Transform Revisited
Φ
0(x) = 1
, Φ
1(x) =
Φ
0(x), x
1, x
2, . . . , x
dΦ
2(x) =
Φ
1(x), x
12, x
1x
2, . . . , x
d2Φ
3(x) =
Φ
2(x), x
13, x
12x
2, . . . , x
d3. . . . . .
Φ
Q(x) =
Φ
Q−1(x), x
1Q, x
1Q−1x
2, . . . , x
dQH
Φ0⊂ H
Φ1⊂ H
Φ2⊂ H
Φ3⊂ . . . ⊂ H
ΦQk k k k k
H
0H
1H
2H
3. . . H
QH0 H1 H2 H3 · · ·
structure:
nested H i
Nonlinear Transformation Structured Hypothesis Sets
Structured Hypothesis Sets
H0 H1 H2 H3 · · ·
Let
g i = argmin h∈H
i
E in (h):
H
0⊂ H
1⊂ H
2⊂ H
3⊂ . . .
d
VC( H
0) ≤ d
VC( H
1) ≤ d
VC( H
2) ≤ d
VC( H
3) ≤ . . . E
in(g
0) ≥ E
in(g
1) ≥ E
in(g
2) ≥ E
in(g
3) ≥ . . .
in-sample error model complexity out-of-sample error
VC dimension, dvc
Error
d∗vc
use
H 1126
won’t be good!:-(
Nonlinear Transformation Structured Hypothesis Sets
Linear Model First
in-sample error model complexity out-of-sample error
VC dimension, dvc
Error
d∗vc
•
tempting sin: useH1126
, lowE in (g 1126 )
to fool your boss—really? :-( a dangerous path of no return
•
safe route: H1
first• if E
in(g
1) good enough, live happily thereafter :-)
• otherwise, move right of the curve
with nothing lost except ‘wasted’ computation
linear model first:simple, efficient,
safe, and workable!
Nonlinear Transformation Structured Hypothesis Sets
Fun Time
Consider two hypothesis sets,H
1
andH1126
, whereH1
⊂ H1126
. Which of the following relationship between dVC(H1
)and dVC(H1126
)is not possible?1
dVC(H1
) =dVC(H1126
)2
dVC(H1
)6= dVC(H1126
)3
dVC(H1
)< dVC(H1126
)4
dVC(H1
)> dVC(H1126
)Reference Answer: 4
Every input combination thatH
1
shatters can be shattered byH1126
, so dVCcannotdecrease.
Nonlinear Transformation Structured Hypothesis Sets
Fun Time
Consider two hypothesis sets,H
1
andH1126
, whereH1
⊂ H1126
. Which of the following relationship between dVC(H1
)and dVC(H1126
)is not possible?1
dVC(H1
) =dVC(H1126
)2
dVC(H1
)6= dVC(H1126
)3
dVC(H1
)< dVC(H1126
)4
dVC(H1
)> dVC(H1126
)Reference Answer: 4
Every input combination thatH
1
shatters can be shattered byH1126
, so dVCcannotdecrease.
Nonlinear Transformation Structured Hypothesis Sets