Machine Learning Foundations ( 機器學習基石)
Lecture 14: Regularization
Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.twDepartment of Computer Science
& Information Engineering
National Taiwan University
( 國立台灣大學資訊工程系)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 0/22
Regularization
Roadmap
1 When Can Machines Learn?
2 Why Can Machines Learn?
3 How Can Machines Learn?
4
How Can Machines LearnBetter?
Lecture 13: Hazard of Overfitting
overfitting happens with
excessive power, stochastic/deterministic noise, and limited data
Lecture 14: Regularization
Regularized Hypothesis Set
Weight Decay Regularization
Regularization and VC Theory
General Regularizers
Regularization Regularized Hypothesis Set
Regularization: The Magic
x
y
‘regularized fit’
⇐=
x
y
Data Target Fit
overfit
•
idea: ‘step back’ fromH10
toH2
H0 H1 H2 H3 · · ·
•
name history: function approximation forill-posed problems
how to step back?Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 2/22
Regularization Regularized Hypothesis Set
Stepping Back as Constraint
H0 H1 H2 H3 · · ·
Q-th order polynomial
transform
for x ∈ R:Φ Q
(x ) =1, x, x 2 , . . . , x Q
+linear regression, denote ˜ w by w
hypothesis
w
inH10
:w 0
+w 1 x
+w 2 x 2
+w 3 x 3
+. . . +w 10 x 10
hypothesisw
inH2
:w 0
+w 1 x
+w 2 x 2
that is,H
2
=H10
AND ‘constraint that w3 = w 4 = . . . = w 10 = 0’
step back =
constraint
Regularization Regularized Hypothesis Set
Regression with Constraint
H
10
≡nw
∈ R10+1
owhile w
3
=w4
=. . . = w10
=0oregression withH
10
: minw∈R
10+1Ein
(w)w 10 = 0
H
2
≡ nw
∈ R10+1
while w 3 = w 4 = . . . = w 10 = 0
o regression withH2
:min
w∈R
10+1 Ein
(w)s.t.
w 3 = w 4 = . . . = w 10 = 0
step back =
constrained optimization
of Ein
why don’t you just usew ∈ R 2+1
?:-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/22
Regularization Regularized Hypothesis Set
Regression with Looser Constraint
H
2
≡ nw
∈ R10+1
while w 3 = . . . = w 10 = 0
o regression withH2
:min
w∈R
10+1 Ein
(w)s.t.
w 3 = . . . = w 10 = 0
H 0 2
≡ nw
∈ R10+1
while ≥ 8 of w q = 0
o regression withH0 2
:min
w∈R
10+1 Ein
(w) s.t.10
X
q=0
Jw q 6= 0K ≤ 3
•
more flexible thanH2
: H2
⊂H 0 2
•
less risky thanH10
:H 0 2
⊂ H10
bad news for
sparse hypothesis set H 0 2
:NP-hard to solve :-(
Regularization Regularized Hypothesis Set
Regression with Softer Constraint
H
2 0
≡ nw
∈ R10+1
while ≥ 8 of w q = 0
o regression withH0 2
:min
w∈R
10+1Ein
(w)s.t.10
X
q=0
Jw q 6= 0K ≤ 3
H(C) ≡ n
w
∈ R10+1 while kwk 2 ≤ C
o regression withH(C) :min
w∈R
10+1Ein
(w)s.t.10
X
q=0
w q 2 ≤ C
•
H(C):overlaps
but not exactly the same asH2 0
• soft and smooth
structure over C ≥ 0:H(0) ⊂ H(1.126) ⊂ . . . ⊂ H(1126) ⊂ . . . ⊂ H(∞) = H
10
regularized hypothesis w
REG:optimal solution from regularized hypothesis set H(C)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 6/22
Regularization Regularized Hypothesis Set
Fun Time
For Q ≥ 1, which of the following hypothesis (weight vector w ∈ R
Q+1
) is not in the regularized hypothesis setH(1)?1 w T
= [0, 0, . . . , 0]2 w T
= [1, 0, . . . , 0]3 w T
= [1, 1, . . . , 1]4 w T
=hq1 Q+1
,q1
Q+1
, . . . ,q1 Q+1
i
Reference Answer: 3
The squared length of
w in 3 is Q + 1, which
is not≤ 1.Regularization Regularized Hypothesis Set
Fun Time
For Q ≥ 1, which of the following hypothesis (weight vector w ∈ R
Q+1
) is not in the regularized hypothesis setH(1)?1 w T
= [0, 0, . . . , 0]2 w T
= [1, 0, . . . , 0]3 w T
= [1, 1, . . . , 1]4 w T
=hq1 Q+1
,q1
Q+1
, . . . ,q1 Q+1
i
Reference Answer: 3
The squared length of
w in 3 is Q + 1, which
is not≤ 1.Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 7/22
Regularization Weight Decay Regularization
Matrix Form of Regularized Regression Problem
min
w∈R
Q+1 Ein
(w) = 1 NN
X
n=1
(w
T z n
− yn
)2
| {z }
(Zw−y)
T(Zw−y)
s.t.
Q
X
q=0
w q 2
| {z }
w
Tw
≤ C
•
Pn
. . . = (Zw− y)T
(Zw− y),remember? :-)
• w T w ≤ C
: feasiblew within a radius-
√C hypersphere
how to solve
constrained
optimization problem?Regularization Weight Decay Regularization
The Lagrange Multiplier
min
w ∈R
Q+1E in (w)
= 1N(Zw− y)
T
(Zw− y) s.t.w T w ≤ C
•
decreasing direction:−∇E in (w), remember? :-)
• normal
vector ofw T w = C
:w
•
if−∇E in (w)
andw
not parallel: candecrease E in (w) without violating the constraint
•
at optimal solutionw
REG,−∇E in (w
REG)
∝w
REGw
linw
tw = C w E
in= const.
normal
−∇E
inwant: find
Lagrange multiplier λ > 0
andw
REGsuch that
∇E in (w
REG)
+2λ N w
REG =0
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 9/22
Regularization Weight Decay Regularization
Augmented Error
•
iforacle
tells youλ > 0, then
solving
∇E in (w
REG)
+2λ
N w
REG =0 2
N
Z
T
ZwREG− ZT y
+2λ
N w
REG =0
•
optimal solution:w
REG ← (ZT
Z +λI) −1
ZT y
—called
ridge regression
in Statisticsminimizing
unconstrained E aug
effectively minimizes someC-constrained E in
Regularization Weight Decay Regularization
Augmented Error
•
iforacle
tells youλ > 0, then
solving
∇E in (w
REG)
+2λ
N w
REG =0
equivalent to minimizing
E in (w)
+λ N
regularizer
z }| { w T w
| {z }
augmented error E
aug(w)
•
regularization withaugmented error
instead ofconstrained E in w
REG ← argminw
E aug (w)
for givenλ > 0
orλ = 0
minimizing
unconstrained E aug
effectively minimizes someC-constrained E in
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/22
Regularization Weight Decay Regularization
The Results
λ = 0 λ = 0.0001 λ = 0.01 λ = 1
x
y
Data Target Fit
x
y
x
y
x
y
overfitting =⇒ =⇒ =⇒ underfitting
philosophy:
a little regularization goes a long way!
call ‘+
N λ w T w’ weight-decay
regularization:⇐⇒
larger
λ
⇐⇒ prefer shorter w
⇐⇒ effectively smaller C
—go with ‘any’ transform + linear model
Regularization Weight Decay Regularization
Some Detail: Legendre Polynomials
min
w∈R
Q+11 N
N
X
n=0
(w
T
Φ(xn
)− yn
)2
+ λ NQ
X
q=0
w q 2
naïve polynomial
transform:
Φ(x) = 1, x, x 2 , . . . , x Q
—when x
n
∈ [−1, +1], xn q
really small, needing large wq
normalized polynomial
transform:
1, L 1 (x ), L 2 (x ), . . . , L Q (x )
—‘orthonormal basis functions’
called
Legendre polynomials
L1
x
L2
1 2(3x2− 1)
L3
1 2(5x3− 3x)
L4
1
8(35x4− 30x2+ 3)
L5
1 8(63x5· · · )
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 12/22
Regularization Weight Decay Regularization
Fun Time
When would
w
REGequalw
LIN?1
λ = 02
C =∞3
C≥ kwLINk2
4
all of the aboveReference Answer: 4
1 and 2 shall be easy; 3 means that there are effectively no constraint on
w, hence
the equivalence.Regularization Weight Decay Regularization
Fun Time
When would
w
REGequalw
LIN?1
λ = 02
C =∞3
C≥ kwLINk2
4
all of the aboveReference Answer: 4
1 and 2 shall be easy; 3 means that there are effectively no constraint on
w, hence
the equivalence.Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/22
Regularization Regularization and VC Theory
Regularization and VC Theory
Regularization by
Constrained-Minimizing E in
min
w
Ein
(w) s.t.w T w ≤ C
m
C equivalent to someλRegularization by
Minimizing E aug
min
w
Eaug
(w) = Ein
(w) + λ Nw T w
→ VC Guarantee of
Constrained-Minimizing E in
E
out
(w)≤ Ein
(w) + Ω(H(C)
)minimizing E
aug
: indirectly getting VC guaranteewithout confining to H(C)
Regularization Regularization and VC Theory
Another View of Augmented Error
Augmented Error
E aug
(w) =E in
(w) +N λ w T w
VC Bound
E out
(w)≤E in
(w) +Ω(
H)•
regularizerw T w
=
Ω(w)
: complexity of a single hypothesis
•
generalization priceΩ(
H): complexity of a hypothesis set•
ifN λ Ω(w) ‘represents’ Ω(
H) well,E aug
is a better proxy ofE out
thanE in
minimizing
E aug
:(heuristically) operating with the better proxy;
(technically) enjoying flexibility of wholeH
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/22
Regularization Regularization and VC Theory
Effective VC Dimension
min
w∈R
˜d +1E aug
(w) =E in
(w) +λ
NΩ(w)
•
model complexity?dVC(H) = ˜d + 1, because {w} ‘
all considered’ during minimization
•
{w} ‘actually needed’:
H(C), with some C
equivalent toλ
•
dVC H(C):
effective VC dimension dEFF(H,
A
|{z}
min E
aug)
explanation of regularization:
dVC(H) large,
while dEFF(H,
A
)small ifA
regularizedRegularization Regularization and VC Theory
Fun Time
Consider the weight-decay regularization with regression. When increasingλ inA, what would happen with dEFF(H, A)?
1
dEFF ↑2
dEFF ↓3
dEFF =dVC(H) and does not depend on λ4
dEFF =1126 and does not depend onλReference Answer: 2
⇐⇒
largerλ
⇐⇒ smaller C
⇐⇒ smaller H(C)
⇐⇒ smaller dEFF
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 17/22
Regularization Regularization and VC Theory
Fun Time
Consider the weight-decay regularization with regression. When increasingλ inA, what would happen with dEFF(H, A)?
1
dEFF ↑2
dEFF ↓3
dEFF =dVC(H) and does not depend on λ4
dEFF =1126 and does not depend onλReference Answer: 2
⇐⇒
largerλ
⇐⇒ smaller C
⇐⇒ smaller H(C)
⇐⇒ smaller dEFF
Regularization General Regularizers
General Regularizers Ω(w)
want: constraint in the
‘direction’ of target function
•
target-dependent: someproperties
of target, if known• symmetry regularizer: P
Jq is oddK w
2 q
•
plausible: direction towardssmoother
orsimpler
stochastic/deterministic noise both
non-smooth
• sparsity (L1) regularizer: P |w
q| (next slide)
•
friendly: easy tooptimize
• weight-decay (L2) regularizer: P w
q2• bad? :-): no worries, guard by
λaugmented error = errorerr + regularizer Ωc regularizer:
target-dependent, plausible, or friendly
ringing a bell? :-)
error measure:
user-dependent, plausible, or friendly
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 18/22
Regularization General Regularizers
L2 and L1 Regularizer
wlin
wtw = C w Ein=const.
normal
−∇Ein
L2 Regularizer
Ω(w) = X
Qq=0
w
q2= kwk
22• convex, differentiable everywhere
• easy to optimize
sign wlin Ein=const.
kwk1= C
−∇Ein
w
L1 Regularizer
Ω(w) = X
Qq=0
|w
q| = kwk
1• convex, not differentiable everywhere
• sparsity in solution
L1 useful if needingsparse solution
Regularization General Regularizers
The Optimal λ
stochastic noise
Regularization Parameter, λ E xp ec te d E
outσ
2= 0 σ
2= 0.25 σ
2= 0.5
0.5 1 1.5 2
0.25 0.5 0.75 1
deterministic noise
Regularization Parameter, λ E xp ec te d E
outQ
f= 15 Q
f= 30 Q
f= 100
0.5 1 1.5 2
0.2 0.4 0.6
•
more noise⇐⇒ more regularization needed—more bumpy road⇐⇒ putting brakes more
•
noiseunknown—important to make proper choices
how to choose?stay tuned for the next lecture! :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 20/22
Regularization General Regularizers
Fun Time
Consider using a regularizer Ω(w) =P
Q
q=0
2q
wq 2
to work with Legendre polynomial regression. Which kind of hypothesis does the regularizer prefer?1
symmetric polynomials satisfying h(x ) = h(−x)2
low-dimensional polynomials3
high-dimensional polynomials4
no specific preferenceReference Answer: 2
There is a higher ‘penalty’ for higher-order terms, and hence the regularizer prefers low-dimensional polynomials.
Regularization General Regularizers
Fun Time
Consider using a regularizer Ω(w) =P
Q
q=0
2q
wq 2
to work with Legendre polynomial regression. Which kind of hypothesis does the regularizer prefer?1
symmetric polynomials satisfying h(x ) = h(−x)2
low-dimensional polynomials3
high-dimensional polynomials4
no specific preferenceReference Answer: 2
There is a higher ‘penalty’ for higher-order terms, and hence the regularizer prefers low-dimensional polynomials.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 21/22
Regularization General Regularizers