Machine Learning Foundations (ᘤ9M)

(1)

Machine Learning Foundations ( 機器學習基石)

Lecture 14: Regularization

Hsuan-Tien Lin (林軒田) [email protected]

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 0/22

(2)

Regularization

Roadmap

1 When Can Machines Learn?

2 Why Can Machines Learn?

3 How Can Machines Learn?

4

How Can Machines Learn

Better?

Lecture 13: Hazard of Overfitting

overfitting happens with

excessive power, stochastic/deterministic noise, and limited data

Lecture 14: Regularization

Regularized Hypothesis Set

Weight Decay Regularization

Regularization and VC Theory

General Regularizers

(3)

Regularization Regularized Hypothesis Set

Regularization: The Magic

x

y

‘regularized fit’

⇐=

x

y

Data Target Fit

overfit

•

idea: ‘step back’ fromH

10

toH

2

H0 H1 H2 H3 · · ·

•

name history: function approximation for

ill-posed problems

how to step back?

(4)

Stepping Back as Constraint

H0 H1 H2 H3 · · ·

Q-th order polynomial

transform

for x ∈ R:

Φ _Q

(x ) =

1, x, x ² , . . . , x ^Q

+

linear regression, denote ˜ w by w

hypothesis

w

inH

10

:

w ₀

+

w ₁ x

+

w ₂ x ²

+

w ₃ x ³

+. . . +

w ₁₀ x ¹⁰

hypothesis

w

inH

2

:

w ₀

+

w ₁ x

+

w ₂ x ²

that is,H

2

=H

10

AND ‘constraint that w

₃ = w ₄ = . . . = w ₁₀ = 0’

step back =

constraint

(5)

Regression with Constraint

H

10

≡n

w

∈ R

¹⁰⁺¹

o

while w

₃

=w

₄

=. . . = w

10

=0o

regression withH

10

: min

w∈R

¹⁰⁺¹E

_in

(w)

w ₁₀ = 0

H

2

≡ n

w

∈ R

¹⁰⁺¹

while w ₃ = w ₄ = . . . = w 10 = 0

o regression withH

2

:

min

w∈R

¹⁰⁺¹ E

_in

(w)

s.t.

w ₃ = w ₄ = . . . = w ₁₀ = 0

step back =

constrained optimization

of E

_in

why don’t you just use

w ∈ R ²⁺¹

?

:-)

(6)

Regression with Looser Constraint

H

2

≡ n

w

∈ R

¹⁰⁺¹

while w ₃ = . . . = w 10 = 0

o regression withH

2

:

min

w∈R

¹⁰⁺¹ E

_in

(w)

s.t.

w ₃ = . . . = w ₁₀ = 0

H ⁰ 2

≡ n

w

∈ R

¹⁰⁺¹

while ≥ 8 of w q = 0

o regression withH

⁰ ₂

:

min

w∈R

¹⁰⁺¹ E

_in

(w) s.t.

10 X

q=0

Jw ^q 6= 0K ≤ 3

•

more flexible thanH

2

: H

2

⊂

H ⁰ ₂

•

less risky thanH

10

:

H ⁰ ₂

⊂ H

10

bad news for

sparse hypothesis set H ⁰ 2

:

NP-hard to solve :-(

(7)

Regression with Softer Constraint

H

2 ⁰

≡ n

w

∈ R

¹⁰⁺¹

while ≥ 8 of w q = 0

o regression withH

⁰ ₂

:

min

w∈R

¹⁰⁺¹E

_in

(w)s.t.

10 X

q=0

Jw q 6= 0K ≤ 3

H(C) ≡ n

w

∈ R

¹⁰⁺¹ while kwk ² ≤ C

o regression withH(C) :

min

w∈R

¹⁰⁺¹E

_in

(w)s.t.

10 X

q=0

w _q ² ≤ C

•

H(C):

overlaps

but not exactly the same asH

2 ⁰

• soft and smooth

structure over C ≥ 0:

H(0) ⊂ H(1.126) ⊂ . . . ⊂ H(1126) ⊂ . . . ⊂ H(∞) = H

10 regularized hypothesis w

REG:

optimal solution from regularized hypothesis set H(C)

(8)

Fun Time

For Q ≥ 1, which of the following hypothesis (weight vector w ∈ R

^Q+1

) is not in the regularized hypothesis setH(1)?

1 w ^T

= [0, 0, . . . , 0]

2 w ^T

= [1, 0, . . . , 0]

3 w ^T

= [1, 1, . . . , 1]

4 w ^T

=hq

1 Q+1

,q

1 Q+1

, . . . ,q

1 Q+1

i

Reference Answer: 3

The squared length of

w in 3 is Q + 1, which

is not≤ 1.

(9)

Fun Time

For Q ≥ 1, which of the following hypothesis (weight vector w ∈ R

^Q+1

) is not in the regularized hypothesis setH(1)?

1 w ^T

= [0, 0, . . . , 0]

2 w ^T

= [1, 0, . . . , 0]

3 w ^T

= [1, 1, . . . , 1]

4 w ^T

=hq

1 Q+1

,q

1 Q+1

, . . . ,q

1 Q+1

i

Reference Answer: 3

The squared length of

w in 3 is Q + 1, which

is not≤ 1.

(10)

Regularization Weight Decay Regularization

Matrix Form of Regularized Regression Problem

min

w∈R

^Q+1 E

_in

(w) = 1 N

N

X

n=1

(w

^T z _n

− y

n

)

²

| {z }

(Zw−y)

^T

(Zw−y)

s.t.

Q

X

q=0

w _q ²

| {z }

w

^T

w

≤ C

•

P

n

. . . = (Zw− y)

^T

(Zw− y),

remember? :-)

• w ^T w ≤ C

: feasible

w within a radius-

√

C hypersphere

how to solve

constrained

optimization problem?

(11)

The Lagrange Multiplier

min

w ∈R

^Q+1

E _in (w)

= 1

N(Zw− y)

^T

(Zw− y) s.t.

w ^T w ≤ C

•

decreasing direction:

−∇E in (w), remember? :-)

• normal

vector of

w ^T w = C

:

w

•

if

−∇E in (w)

and

w

not parallel: can

decrease E _in (w) without violating the constraint

•

at optimal solution

w

REG,

−∇E in (w

_REG

)

∝

w

_REG

w

_lin

w

^t

w = C w E

_in

= const.

normal

−∇E

in

want: find

Lagrange multiplier λ > 0

and

w

REG

such that

∇E in (w

_REG

)

+

^2λ _N w

_REG =

0

(12)

Augmented Error

•

if

oracle

tells you

λ > 0, then

solving

∇E in (w

_REG

)

+

2λ

N w

_REG =

0 2

N

Z

^T

ZwREG− Z

^T y

+

2λ

N w

REG =

0

•

optimal solution:

w

_REG ← (Z

^T

Z +

λI) ⁻¹

Z

^T y

—called

ridge regression

in Statistics

minimizing

unconstrained E _aug

effectively minimizes some

C-constrained E _in

(13)

Augmented Error

•

if

oracle

tells you

λ > 0, then

solving

∇E in (w

_REG

)

+

2λ

N w

_REG =

0

equivalent to minimizing

E _in (w)

+

λ N

regularizer

z }| { w ^T w

| {z }

augmented error E

aug

(w)

•

regularization with

augmented error

instead of

constrained E _in w

REG ← argmin

w

E aug (w)

for given

λ > 0

or

λ = 0

minimizing

unconstrained E _aug

effectively minimizes some

C-constrained E _in

(14)

The Results

λ = 0 λ = 0.0001 λ = 0.01 λ = 1

x

y

Data Target Fit

x

y

x

y

x

y

overfitting =⇒ =⇒ =⇒ underfitting

philosophy:

a little regularization goes a long way!

call ‘+

_N ^λ w ^T w’ weight-decay

regularization:

⇐⇒

larger

λ

⇐⇒ prefer shorter w

⇐⇒ effectively smaller C

—go with ‘any’ transform + linear model

(15)

Some Detail: Legendre Polynomials

min

w∈R

^Q+1

1 N

N

X

n=0

(w

^T

Φ(x

_n

)− y

n

)

²

+ λ N

Q

X

q=0

w _q ²

naïve polynomial

transform:

Φ(x) = 1, x, x ² , . . . , x ^Q

—when x

n

∈ [−1, +1], x

n ^q

really small, needing large w

_q

normalized polynomial

transform:

1, L ₁ (x ), L ₂ (x ), . . . , L _Q (x )

—‘orthonormal basis functions’

called

Legendre polynomials

L₁

x

L₂

1 2(3x²− 1)

L₃

1 2(5x³− 3x)

L₄

1

8(35x⁴− 30x²+ 3)

L₅

1 8(63x⁵· · · )

(16)

Fun Time

When would

w

REGequal

w

LIN?

1

λ = 0

2

C =∞

3

C≥ kw^LINk

²

4

all of the above

Reference Answer: 4

1 and 2 shall be easy; 3 means that there are effectively no constraint on

w, hence

the equivalence.

(17)

Fun Time

When would

w

REGequal

w

LIN?

1

λ = 0

2

C =∞

3

C≥ kw^LINk

²

4

all of the above

Reference Answer: 4

1 and 2 shall be easy; 3 means that there are effectively no constraint on

w, hence

the equivalence.

(18)

Regularization Regularization and VC Theory

Regularization and VC Theory

Regularization by

Constrained-Minimizing E in

min

w

E

_in

(w) s.t.

w ^T w ≤ C

m

C equivalent to someλ

Regularization by

Minimizing E aug

min

w

E

aug

(w) = E

_in

(w) + λ N

w ^T w

→ VC Guarantee of

Constrained-Minimizing E in

E

_out

(w)≤ E

in

(w) + Ω(

H(C)

)

minimizing E

_aug

: indirectly getting VC guarantee

without confining to H(C)

(19)

Another View of Augmented Error

Augmented Error

E aug

(w) =

E _in

(w) +

_N ^λ w ^T w

VC Bound

E _out

(w)≤

E _in

(w) +

Ω(

H)

•

regularizer

w ^T w

=

Ω(w)

: complexity of a single hypothesis

•

generalization price

Ω(

H): complexity of a hypothesis set

•

if

_N ^λ Ω(w) ‘represents’ Ω(

H) well,

E aug

is a better proxy of

E out

than

E _in

minimizing

E _aug

:

(heuristically) operating with the better proxy;

(technically) enjoying flexibility of wholeH

(20)

Effective VC Dimension

min

w∈R

^˜^{d +1}

E _aug

(w) =

E _in

(w) +

λ

N

Ω(w)

•

model complexity?

d_VC(H) = ˜d + 1, because {w} ‘

all considered’ during minimization

•

{w} ‘

actually needed’:

H(

C), with some C

equivalent to

λ

•

dVC H(

C):

effective VC dimension d_EFF(H,

A

|{z}

min E

aug

)

explanation of regularization:

dVC(H) large,

while dEFF(H,

A

)small if

A

regularized

(21)

Fun Time

Consider the weight-decay regularization with regression. When increasingλ inA, what would happen with dEFF(H, A)?

1

d_EFF ↑

2

dEFF ↓

3

dEFF =dVC(H) and does not depend on λ

4

dEFF =1126 and does not depend onλ

Reference Answer: 2

⇐⇒

largerλ

⇐⇒ smaller C

⇐⇒ smaller H(C)

⇐⇒ smaller dEFF

(22)

Fun Time

Consider the weight-decay regularization with regression. When increasingλ inA, what would happen with dEFF(H, A)?

1

d_EFF ↑

2

dEFF ↓

3

dEFF =dVC(H) and does not depend on λ

4

dEFF =1126 and does not depend onλ

Reference Answer: 2

⇐⇒

largerλ

⇐⇒ smaller C

⇐⇒ smaller H(C)

⇐⇒ smaller d^EFF

(23)

Regularization General Regularizers

General Regularizers Ω(w)

want: constraint in the

‘direction’ of target function

•

target-dependent: some

properties

of target, if known

• symmetry regularizer: P

Jq is oddK w

2 q

•

plausible: direction towards

smoother

or

simpler

stochastic/deterministic noise both

non-smooth

• sparsity (L1) regularizer: P |w

q

| (next slide)

•

friendly: easy to

optimize

• weight-decay (L2) regularizer: P w

_q²

• bad? :-): no worries, guard by

λ

augmented error = errorerr + regularizer Ωc regularizer:

target-dependent, plausible, or friendly

ringing a bell? :-)

error measure:

user-dependent, plausible, or friendly

(24)

L2 and L1 Regularizer

w_lin

w^tw = C w E_in=const.

normal

−∇Ein

L2 Regularizer

Ω(w) = X

^Q

q=0

w

_q²

= kwk

²2

• convex, differentiable everywhere

• easy to optimize

sign w_lin Ein=const.

kwk1= C

−∇Ein

w

L1 Regularizer

Ω(w) = X

^Q

q=0

|w

^q

| = kwk

¹

• convex, not differentiable everywhere

• sparsity in solution

L1 useful if needing

sparse solution

(25)

The Optimal λ

stochastic noise

Regularization Parameter, λ E xp ec te d E

out

σ

²

= 0 σ

²

= 0.25 σ

²

= 0.5

0.5 1 1.5 2

0.25 0.5 0.75 1

deterministic noise

Regularization Parameter, λ E xp ec te d E

out

Q

f

= 15 Q

_f

= 30 Q

_f

= 100

0.5 1 1.5 2

0.2 0.4 0.6

•

more noise⇐⇒ more regularization needed

—more bumpy road⇐⇒ putting brakes more

•

noise

unknown—important to make proper choices

how to choose?

stay tuned for the next lecture! :-)

(26)

Fun Time

Consider using a regularizer Ω(w) =P

Q

q=0

2

^q

w

_q ²

to work with Legendre polynomial regression. Which kind of hypothesis does the regularizer prefer?

1

symmetric polynomials satisfying h(x ) = h(−x)

2

low-dimensional polynomials

3

high-dimensional polynomials

4

no specific preference

Reference Answer: 2

There is a higher ‘penalty’ for higher-order terms, and hence the regularizer prefers low-dimensional polynomials.

(27)

Fun Time

Consider using a regularizer Ω(w) =P

Q

q=0

2

^q

w

_q ²

to work with Legendre polynomial regression. Which kind of hypothesis does the regularizer prefer?

1

symmetric polynomials satisfying h(x ) = h(−x)

2

low-dimensional polynomials

3

high-dimensional polynomials

4

no specific preference

Reference Answer: 2

There is a higher ‘penalty’ for higher-order terms, and hence the regularizer prefers low-dimensional polynomials.

Machine Learning Foundations (ᘤ9M)

Machine Learning Foundations ( 機器學習基石)

Lecture 14: Regularization

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

Roadmap

1 When Can Machines Learn?

2 Why Can Machines Learn?

3 How Can Machines Learn?

4

Better?

Lecture 13: Hazard of Overfitting

excessive power, stochastic/deterministic noise, and limited data

Lecture 14: Regularization

Regularized Hypothesis Set

Weight Decay Regularization

Regularization and VC Theory

General Regularizers

Regularization: The Magic

x

y

⇐=

x

y

Data Target Fit

•

10

2

•

ill-posed problems

Stepping Back as Constraint

transform

Φ Q

1, x, x 2 , . . . , x Q

linear regression, denote ˜ w by w

w

10

w 0

w 1 x

w 2 x 2

w 3 x 3

w 10 x 10

w

2

w 0

w 1 x

w 2 x 2

2

10

3 = w 4 = . . . = w 10 = 0’

constraint

Regression with Constraint

10

w

10+1

3

4

10

10

w∈R

in

w 10 = 0

2

w

10+1

while w 3 = w 4 = . . . = w 10 = 0

2

w∈R

in

w 3 = w 4 = . . . = w 10 = 0

constrained optimization

in

w ∈ R 2+1

:-)

Regression with Looser Constraint

2

w

10+1

Machine Learning Foundations (ᘤ9M)

Φ _Q

1, x, x ² , . . . , x ^Q

w ₀

w ₁ x

w ₂ x ²

w ₃ x ³

w ₁₀ x ¹⁰

w ₀

w ₁ x

w ₂ x ²

₃ = w ₄ = . . . = w ₁₀ = 0’

¹⁰⁺¹

₃

₄

_in

w ₁₀ = 0

¹⁰⁺¹

while w ₃ = w ₄ = . . . = w 10 = 0

_in

w ₃ = w ₄ = . . . = w ₁₀ = 0

_in

w ∈ R ²⁺¹

¹⁰⁺¹

while w ₃ = . . . = w 10 = 0

_in

w ₃ = . . . = w ₁₀ = 0

H ⁰ 2

¹⁰⁺¹

⁰ ₂

_in

Jw ^q 6= 0K ≤ 3

H ⁰ ₂

H ⁰ ₂

sparse hypothesis set H ⁰ 2

2 ⁰

¹⁰⁺¹

⁰ ₂

_in

¹⁰⁺¹ while kwk ² ≤ C

_in

w _q ² ≤ C

2 ⁰

^Q+1

1 w ^T

2 w ^T

3 w ^T

4 w ^T

^Q+1

1 w ^T

2 w ^T

3 w ^T