• 沒有找到結果。

Machine Learning Foundations (ᘤ9M)

N/A
N/A
Protected

Academic year: 2022

Share "Machine Learning Foundations (ᘤ9M)"

Copied!
28
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning Foundations ( 機器學習基石)

Lecture 14: Regularization

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 0/22

(2)

Regularization

Roadmap

1 When Can Machines Learn?

2 Why Can Machines Learn?

3 How Can Machines Learn?

4

How Can Machines Learn

Better?

Lecture 13: Hazard of Overfitting

overfitting happens with

excessive power, stochastic/deterministic noise, and limited data

Lecture 14: Regularization

Regularized Hypothesis Set

Weight Decay Regularization

Regularization and VC Theory

General Regularizers

(3)

Regularization Regularized Hypothesis Set

Regularization: The Magic

x

y

‘regularized fit’

⇐=

x

y

Data Target Fit

overfit

idea: ‘step back’ fromH

10

toH

2

H0 H1 H2 H3 · · ·

name history: function approximation for

ill-posed problems

how to step back?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 2/22

(4)

Regularization Regularized Hypothesis Set

Stepping Back as Constraint

H0 H1 H2 H3 · · ·

Q-th order polynomial

transform

for x ∈ R:

Φ Q

(x ) =

1, x, x 2 , . . . , x Q

 +

linear regression, denote ˜ w by w

hypothesis

w

inH

10

:

w 0

+

w 1 x

+

w 2 x 2

+

w 3 x 3

+. . . +

w 10 x 10

hypothesis

w

inH

2

:

w 0

+

w 1 x

+

w 2 x 2

that is,H

2

=H

10

AND ‘constraint that w

3 = w 4 = . . . = w 10 = 0’

step back =

constraint

(5)

Regularization Regularized Hypothesis Set

Regression with Constraint

H

10

≡n

w

∈ R

10+1

o

while w

3

=w

4

=. . . = w

10

=0o

regression withH

10

: min

w∈R

10+1E

in

(w)

w 10 = 0

H

2

≡ n

w

∈ R

10+1

while w 3 = w 4 = . . . = w 10 = 0

o regression withH

2

:

min

w∈R

10+1 E

in

(w)

s.t.

w 3 = w 4 = . . . = w 10 = 0

step back =

constrained optimization

of E

in

why don’t you just use

w ∈ R 2+1

?

:-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/22

(6)

Regularization Regularized Hypothesis Set

Regression with Looser Constraint

H

2

≡ n

w

∈ R

10+1

while w 3 = . . . = w 10 = 0

o regression withH

2

:

min

w∈R

10+1 E

in

(w)

s.t.

w 3 = . . . = w 10 = 0

H 0 2

≡ n

w

∈ R

10+1

while ≥ 8 of w q = 0

o regression withH

0 2

:

min

w∈R

10+1 E

in

(w) s.t.

10

X

q=0

Jw q 6= 0K ≤ 3

more flexible thanH

2

: H

2

H 0 2

less risky thanH

10

:

H 0 2

⊂ H

10

bad news for

sparse hypothesis set H 0 2

:

NP-hard to solve :-(

(7)

Regularization Regularized Hypothesis Set

Regression with Softer Constraint

H

2 0

≡ n

w

∈ R

10+1

while ≥ 8 of w q = 0

o regression withH

0 2

:

min

w∈R

10+1E

in

(w)s.t.

10

X

q=0

Jw q 6= 0K ≤ 3

H(C) ≡ n

w

∈ R

10+1 while kwk 2 ≤ C

o regression withH(C) :

min

w∈R

10+1E

in

(w)s.t.

10

X

q=0

w q 2 ≤ C

H(C):

overlaps

but not exactly the same asH

2 0

• soft and smooth

structure over C ≥ 0:

H(0) ⊂ H(1.126) ⊂ . . . ⊂ H(1126) ⊂ . . . ⊂ H(∞) = H

10

regularized hypothesis w

REG:

optimal solution from regularized hypothesis set H(C)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 6/22

(8)

Regularization Regularized Hypothesis Set

Fun Time

For Q ≥ 1, which of the following hypothesis (weight vector w ∈ R

Q+1

) is not in the regularized hypothesis setH(1)?

1 w T

= [0, 0, . . . , 0]

2 w T

= [1, 0, . . . , 0]

3 w T

= [1, 1, . . . , 1]

4 w T

=hq

1 Q+1

,q

1

Q+1

, . . . ,q

1 Q+1

i

Reference Answer: 3

The squared length of

w in 3 is Q + 1, which

is not≤ 1.

(9)

Regularization Regularized Hypothesis Set

Fun Time

For Q ≥ 1, which of the following hypothesis (weight vector w ∈ R

Q+1

) is not in the regularized hypothesis setH(1)?

1 w T

= [0, 0, . . . , 0]

2 w T

= [1, 0, . . . , 0]

3 w T

= [1, 1, . . . , 1]

4 w T

=hq

1 Q+1

,q

1

Q+1

, . . . ,q

1 Q+1

i

Reference Answer: 3

The squared length of

w in 3 is Q + 1, which

is not≤ 1.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 7/22

(10)

Regularization Weight Decay Regularization

Matrix Form of Regularized Regression Problem

min

w∈R

Q+1 E

in

(w) = 1 N

N

X

n=1

(w

T z n

− y

n

)

2

| {z }

(Zw−y)

T

(Zw−y)

s.t.

Q

X

q=0

w q 2

| {z }

w

T

w

≤ C

P

n

. . . = (Zw− y)

T

(Zw− y),

remember? :-)

w T w ≤ C

: feasible

w within a radius-

C hypersphere

how to solve

constrained

optimization problem?

(11)

Regularization Weight Decay Regularization

The Lagrange Multiplier

min

w ∈R

Q+1

E in (w)

= 1

N(Zw− y)

T

(Zw− y) s.t.

w T w ≤ C

decreasing direction:

−∇E in (w), remember? :-)

• normal

vector of

w T w = C

:

w

if

−∇E in (w)

and

w

not parallel: can

decrease E in (w) without violating the constraint

at optimal solution

w

REG,

−∇E in (w

REG

)

w

REG

w

lin

w

t

w = C w E

in

= const.

normal

−∇E

in

want: find

Lagrange multiplier λ > 0

and

w

REG

such that

∇E in (w

REG

)

+

N w

REG =

0

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 9/22

(12)

Regularization Weight Decay Regularization

Augmented Error

if

oracle

tells you

λ > 0, then

solving

∇E in (w

REG

)

+

N w

REG =

0 2

N



Z

T

ZwREG− Z

T y 

+

N w

REG =

0

optimal solution:

w

REG ← (Z

T

Z +

λI) −1

Z

T y

—called

ridge regression

in Statistics

minimizing

unconstrained E aug

effectively minimizes some

C-constrained E in

(13)

Regularization Weight Decay Regularization

Augmented Error

if

oracle

tells you

λ > 0, then

solving

∇E in (w

REG

)

+

N w

REG =

0

equivalent to minimizing

E in (w)

+

λ N

regularizer

z }| { w T w

| {z }

augmented error E

aug

(w)

regularization with

augmented error

instead of

constrained E in w

REG ← argmin

w

E aug (w)

for given

λ > 0

or

λ = 0

minimizing

unconstrained E aug

effectively minimizes some

C-constrained E in

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/22

(14)

Regularization Weight Decay Regularization

The Results

λ = 0 λ = 0.0001 λ = 0.01 λ = 1

x

y

Data Target Fit

x

y

x

y

x

y

overfitting =⇒ =⇒ =⇒ underfitting

philosophy:

a little regularization goes a long way!

call ‘+

N λ w T w’ weight-decay

regularization:

⇐⇒

larger

λ

⇐⇒ prefer shorter w

⇐⇒ effectively smaller C

—go with ‘any’ transform + linear model

(15)

Regularization Weight Decay Regularization

Some Detail: Legendre Polynomials

min

w∈R

Q+1

1 N

N

X

n=0

(w

T

Φ(x

n

)− y

n

)

2

+ λ N

Q

X

q=0

w q 2

naïve polynomial

transform:

Φ(x) = 1, x, x 2 , . . . , x Q



—when x

n

∈ [−1, +1], x

n q

really small, needing large w

q

normalized polynomial

transform:

1, L 1 (x ), L 2 (x ), . . . , L Q (x )



—‘orthonormal basis functions’

called

Legendre polynomials

L1

x

L2

1 2(3x2− 1)

L3

1 2(5x3− 3x)

L4

1

8(35x4− 30x2+ 3)

L5

1 8(63x5· · · )

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 12/22

(16)

Regularization Weight Decay Regularization

Fun Time

When would

w

REGequal

w

LIN?

1

λ = 0

2

C =∞

3

C≥ kwLINk

2

4

all of the above

Reference Answer: 4

1 and 2 shall be easy; 3 means that there are effectively no constraint on

w, hence

the equivalence.

(17)

Regularization Weight Decay Regularization

Fun Time

When would

w

REGequal

w

LIN?

1

λ = 0

2

C =∞

3

C≥ kwLINk

2

4

all of the above

Reference Answer: 4

1 and 2 shall be easy; 3 means that there are effectively no constraint on

w, hence

the equivalence.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/22

(18)

Regularization Regularization and VC Theory

Regularization and VC Theory

Regularization by

Constrained-Minimizing E in

min

w

E

in

(w) s.t.

w T w ≤ C

m

C equivalent to someλ

Regularization by

Minimizing E aug

min

w

E

aug

(w) = E

in

(w) + λ N

w T w

→ VC Guarantee of

Constrained-Minimizing E in

E

out

(w)≤ E

in

(w) + Ω(

H(C)

)

minimizing E

aug

: indirectly getting VC guarantee

without confining to H(C)

(19)

Regularization Regularization and VC Theory

Another View of Augmented Error

Augmented Error

E aug

(w) =

E in

(w) +

N λ w T w

VC Bound

E out

(w)≤

E in

(w) +

Ω(

H)

regularizer

w T w

=

Ω(w)

: complexity of a single hypothesis

generalization price

Ω(

H): complexity of a hypothesis set

if

N λ Ω(w) ‘represents’ Ω(

H) well,

E aug

is a better proxy of

E out

than

E in

minimizing

E aug

:

(heuristically) operating with the better proxy;

(technically) enjoying flexibility of wholeH

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/22

(20)

Regularization Regularization and VC Theory

Effective VC Dimension

min

w∈R

˜d +1

E aug

(w) =

E in

(w) +

λ

N

Ω(w)

model complexity?

dVC(H) = ˜d + 1, because {w} ‘

all considered’ during minimization

{w} ‘

actually needed’:

H(

C), with some C

equivalent to

λ

dVC H(

C):

effective VC dimension dEFF(H,

A

|{z}

min E

aug

)

explanation of regularization:

dVC(H) large,

while dEFF(H,

A

)small if

A

regularized

(21)

Regularization Regularization and VC Theory

Fun Time

Consider the weight-decay regularization with regression. When increasingλ inA, what would happen with dEFF(H, A)?

1

dEFF

2

dEFF ↓

3

dEFF =dVC(H) and does not depend on λ

4

dEFF =1126 and does not depend onλ

Reference Answer: 2

⇐⇒

largerλ

⇐⇒ smaller C

⇐⇒ smaller H(C)

⇐⇒ smaller dEFF

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 17/22

(22)

Regularization Regularization and VC Theory

Fun Time

Consider the weight-decay regularization with regression. When increasingλ inA, what would happen with dEFF(H, A)?

1

dEFF

2

dEFF ↓

3

dEFF =dVC(H) and does not depend on λ

4

dEFF =1126 and does not depend onλ

Reference Answer: 2

⇐⇒

largerλ

⇐⇒ smaller C

⇐⇒ smaller H(C)

⇐⇒ smaller dEFF

(23)

Regularization General Regularizers

General Regularizers Ω(w)

want: constraint in the

‘direction’ of target function

target-dependent: some

properties

of target, if known

• symmetry regularizer: P

Jq is oddK w

2 q

plausible: direction towards

smoother

or

simpler

stochastic/deterministic noise both

non-smooth

• sparsity (L1) regularizer: P |w

q

| (next slide)

friendly: easy to

optimize

• weight-decay (L2) regularizer: P w

q2

• bad? :-): no worries, guard by

λ

augmented error = errorerr + regularizer Ωc regularizer:

target-dependent, plausible, or friendly

ringing a bell? :-)

error measure:

user-dependent, plausible, or friendly

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 18/22

(24)

Regularization General Regularizers

L2 and L1 Regularizer

wlin

wtw = C w Ein=const.

normal

−∇Ein

L2 Regularizer

Ω(w) = X

Q

q=0

w

q2

= kwk

22

• convex, differentiable everywhere

• easy to optimize

sign wlin Ein=const.

kwk1= C

−∇Ein

w

L1 Regularizer

Ω(w) = X

Q

q=0

|w

q

| = kwk

1

• convex, not differentiable everywhere

sparsity in solution

L1 useful if needing

sparse solution

(25)

Regularization General Regularizers

The Optimal λ

stochastic noise

Regularization Parameter, λ E xp ec te d E

out

σ

2

= 0 σ

2

= 0.25 σ

2

= 0.5

0.5 1 1.5 2

0.25 0.5 0.75 1

deterministic noise

Regularization Parameter, λ E xp ec te d E

out

Q

f

= 15 Q

f

= 30 Q

f

= 100

0.5 1 1.5 2

0.2 0.4 0.6

more noise⇐⇒ more regularization needed

—more bumpy road⇐⇒ putting brakes more

noise

unknown—important to make proper choices

how to choose?

stay tuned for the next lecture! :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 20/22

(26)

Regularization General Regularizers

Fun Time

Consider using a regularizer Ω(w) =P

Q

q=0

2

q

w

q 2

to work with Legendre polynomial regression. Which kind of hypothesis does the regularizer prefer?

1

symmetric polynomials satisfying h(x ) = h(−x)

2

low-dimensional polynomials

3

high-dimensional polynomials

4

no specific preference

Reference Answer: 2

There is a higher ‘penalty’ for higher-order terms, and hence the regularizer prefers low-dimensional polynomials.

(27)

Regularization General Regularizers

Fun Time

Consider using a regularizer Ω(w) =P

Q

q=0

2

q

w

q 2

to work with Legendre polynomial regression. Which kind of hypothesis does the regularizer prefer?

1

symmetric polynomials satisfying h(x ) = h(−x)

2

low-dimensional polynomials

3

high-dimensional polynomials

4

no specific preference

Reference Answer: 2

There is a higher ‘penalty’ for higher-order terms, and hence the regularizer prefers low-dimensional polynomials.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 21/22

(28)

Regularization General Regularizers

Summary

1 When Can Machines Learn?

2 Why Can Machines Learn?

3 How Can Machines Learn?

4

How Can Machines Learn

Better?

Lecture 13: Hazard of Overfitting Lecture 14: Regularization

Regularized Hypothesis Set

original H + constraint Weight Decay Regularization

add N λ w T w in E aug

Regularization and VC Theory

regularization decreases d

EFF

General Regularizers

target-dependent, [plausible], or [friendly]

next: choosing from the so-many models/parameters

參考文獻

相關文件

vice versa.’ To verify the rule, you chose 100 days uniformly at random from the past 10 years of stock data, and found that 80 of them satisfy the rule. What is the best guarantee

Which keywords below shall have large positive weights in a good perceptron for the task. 1 coffee, tea,

Which keywords below shall have large positive weights in a good perceptron for the task.. 1 coffee, tea,

2 You’ll likely be rich by exploiting the rule in the next 100 days, if the market behaves similarly to the last 10 years. 3 You’ll likely be rich by exploiting the ‘best rule’

Note: BCGD has stronger global convergence property (and cheaper iteration) than BCM..1. Every cluster point of the x-sequence is a minimizer

Nonsmooth regularization induces sparsity in the solution, avoids oversmoothing signals, and is useful for variable selection.. The regularized problem can be solved effectively by

1 After computing if D is linear separable, we shall know w ∗ and then there is no need to use PLA.. Noise and Error Algorithmic Error Measure. Choice of

You shall find it difficult to generate more kinds by varying the inputs, and we will give a formal proof in future lectures.