• 沒有找到結果。

5.3 Regularization and Weight Decay

N/A
N/A
Protected

Academic year: 2022

Share "5.3 Regularization and Weight Decay"

Copied!
4
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning (NTU, Fall 2010) instructor: Hsuan-Tien Lin

Homework #5

TA in charge: Yu-Cheng Chou RELEASE DATE: 11/08/2010

DUE DATE: 11/29/2010, 4:00 pm IN CLASS TA SESSION: 11/25/2010, 6:00 pm IN R110

Unless granted by the instructor in advance, you must turn in a hard copy of your solutions (without the source code) for all problems. For problems marked with (*), please follow the guidelines on the course website and upload your source code to designated places.

Any form of cheating, lying, or plagiarism will not be tolerated. Students can get zero scores and/or fail the class and/or be kicked out of school and/or receive other punishments for those kinds of misconducts.

Discussions on course materials and homework solutions are encouraged. But you should write the final solutions alone and understand them fully. Books, notes, and Internet resources can be consulted, but not copied from.

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time. In order to maximize the level of fairness in this class, lending and borrowing homework solutions are both regarded as dishonest behaviors and will be punished according to the honesty policy.

You should write your solutions in English with the common math notations introduced in class or in the problems. We do not accept solutions written in any other languages.

5.1 Low-Order Transforms: Decision Stump

(1) (5%) Do the first part in Exercise 3.9 of LFD (prove the dVCof a single transform).

(2) (5%) Do the second part in Exercise 3.9 of LFD (prove the dVC of the union).

5.2 Data-dependent Transforms

A Transformer thinks the following procedures would work well in getting a low Eout from any two- dimensional data sets. Please point out if there are any potential caveats in the procedures:

(1) (5%) Use the feature transform

Φ(x) =

(0, · · · , 0

| {z }

n−1

, 1, 0, · · · ) if x = xn

(0, · · · , 0, 0, 0, · · · ) otherwise . before running PLA.

(2) (5%) Use the feature transform Φ(x) = (φ1(x), φ2(x), · · · , φN(x)) with

φn(x) = exp



−kx − xnk22



and some very small σ2before running PLA.

(Note: You can use the fact that if x1, x2, · · · , xN are all different, the matrix A = [amn] with amn= φm(xn) is always invertible.)

1 of 4

(2)

Machine Learning (NTU, Fall 2010) instructor: Hsuan-Tien Lin

5.3 Regularization and Weight Decay

Consider the regularized linear regression formulation

minw λww + 1 N

N

X

n=1



yn− wxn2

.

with some λ > 0.

(1) (5%) Let wlin be the optimal solution for the plain-vanilla linear regression and wreg(λ) be the optimal solution for the formulation above. Prove that kwreg(λ)k ≤ kwlink for any λ > 0.

(Note: This is one origin of the name “weight decay.”)

Next, consider a more general formulation of regularized learning:

minw

E(w) = λw˜ w + Ein(w).

with some λ ≥ 0, where λ = 0 corresponds to not using regularization.

(2) (5%) Let wreg(λ) be the optimal solution for the formulation above. Prove that kwreg(λ)k is a non-increasing function of λ for λ ≥ 0.

(3) (5%) Assume that Ein is differentiable and use gradient descent to minimize ˜E:

wnew← wold− η∇ ˜E(w).

Show that the update rule above is the same as

wnew← (1 − 2ηλ)wold− η∇Ein(w).

(Note: This is another origin of the name “weight decay”: wold decays before being updated by the gradient of Ein.)

5.4 Regularization and Virtual Examples

(1) (5%) Consider the regularized linear regression formulation in Problem 5.3. Let

˜

xi= (0, · · · , 0

| {z }

i−1

,√

N λ, 0, · · · , 0

| {z }

d−i+1

)

and ˜yi = 0. Prove that solving the formulation is equivalent to applying the plain-vanilla linear regression on {(xn, yn)}Nn=1∪ {(˜xi, ˜yi)}d+1i=1.

(Note: The pairs (˜xi, ˜yi) are called virtual examples. The results above shows that regularization can be viewed as using “additional” examples to guide the learning process.)

(2) (5%) Recall that when XTX is invertible, the solution of the plain-vanilla linear regression is (XTX)−1XTY. Explain how you can use the result from (1) to easily derive the solution of the regularized linear regression formulation.

5.5 Non-Falsifiability

(1) (3%) Do Exercise 5.2(b) of LFD.

(2) (3%) Do Exercise 5.2(c) of LFD.

(3) (3%) Do Exercise 5.2(d) of LFD.

(4) (3%) Do Exercise 5.2(e) of LFD.

(5) (3%) Do Exercise 5.2(f) of LFD.

2 of 4

(3)

Machine Learning (NTU, Fall 2010) instructor: Hsuan-Tien Lin

5.6 Regularized Linear Regression (*)

(1) (10%) Implement the linear regression algorithm in Problem 4.5. Run the algorithm on the fol- lowing data set for training:

http://www.csie.ntu.edu.tw/~htlin/course/ml10fall/doc/hw5_6_train.dat and the following set for testing

http://www.csie.ntu.edu.tw/~htlin/course/ml10fall/doc/hw5_6_test.dat Let

g(x) = sign wx.

What is Ein(g) in terms of the 0/1 loss (classification)? How about Eout(g)?

Plot the training examples (xn, yn) and the decision boundary wx = 0 in the same figure. Use different symbols to distinguish examples with different yn. Briefly state your findings.

Please check the course policy carefully and do not use sophisticated packages in your solution. You can use standard matrix multiplication and inversion routines.

(2) (10%) Split the given training examples in

http://www.csie.ntu.edu.tw/~htlin/course/ml10fall/doc/hw5_6_train.dat to 120 “base” examples (the first 120) and 80 “validation” ones (the last 80).

Ideally, you should randomly do the 120/80 split. Because the given examples are already randomly permuted, however, we would use a fixed split for the purpose of this problem.

Implement an algorithm that solves the regularized linear regression formulation in Problem 5.3.

Run the algorithm on the 120 base examples using log10λ = {2, 1, 0, −1, . . . , −8, −9, −10}. Let gλ(x) = sign wreg(λ)x.

Validate gλ with the 80 validation examples and test it with the test examples in

http://www.csie.ntu.edu.tw/~htlin/course/ml10fall/doc/hw5_6_test.dat Plot Ebase(gλ), Eval(gλ), Eout(gλ) on the same figure as a function of log10λ, where the base training error Ebaseis Ein evaluated on only the 120 base examples. Briefly state your findings.

5.7 Large-Margin Perceptron Classification (*)

(1) (10%) Implement the perceptron learning algorithm in Problem 1.3. Run the algorithm on the following data set for training (until Ein reaches 0):

http://www.csie.ntu.edu.tw/~htlin/course/ml10fall/doc/hw5_7_train.dat and the following set for testing

http://www.csie.ntu.edu.tw/~htlin/course/ml10fall/doc/hw5_7_test.dat

Let w = (w0, u) as the solution from PLA and g(x) = sign wx. Record the following two items:

• the “thickness” of w: minn

n

yn(kukw xn)o

• the out-of-sample error Eout of g

Repeat the experiment over 100 runs. Plot a histogram of the thickness and another histogram of the out-of-sample error. Briefly state your findings.

(2) (10%) Implement the large-margin perceptron formulation below:

min

u,b

1 2uu

subject to yn(uxn+ b) ≥ 1 for n = 1, 2, . . . , N.

3 of 4

(4)

Machine Learning (NTU, Fall 2010) instructor: Hsuan-Tien Lin

Run the algorithm on the following data set for training:

http://www.csie.ntu.edu.tw/~htlin/course/ml10fall/doc/hw5_7_train.dat and the following set for testing

http://www.csie.ntu.edu.tw/~htlin/course/ml10fall/doc/hw5_7_test.dat

Let w = (b, u) as the solution from the formulation and g(x) = sign wx. Report the following two items:

• the “thickness” of w: minnn

yn(kukw xn)o

• the out-of-sample error Eout of g

Compare the numbers with the histograms that you get from PLA. Briefly state your findings.

(Note: You can use any general-purpose packages for quadratic programming to solve this problem, but you cannot use any SVM-specific packages.)

4 of 4

參考文獻

相關文件

Here, a deterministic linear time and linear space algorithm is presented for the undirected single source shortest paths problem with positive integer weights.. The algorithm

First, write a program to implement the (linear) ridge regression algorithm for classification (i.e. use 0/1 error for evaluation)?. Use the first 400 examples for training to get g

(*) Implement the fixed learning rate gradient descent algorithm below for logistic regression, ini- tialized with 0?. Run the algorithm with η = 0.001 and T = 2000 on the following

Real Schur and Hessenberg-triangular forms The doubly shifted QZ algorithm.. Above algorithm is locally

Multiple Choice Questions — Use an HB pencil to blacken the circle next to the correct answer.. Stick barcode labels on pages 1, 3 and 5 in the

Then, it is easy to see that there are 9 problems for which the iterative numbers of the algorithm using ψ α,θ,p in the case of θ = 1 and p = 3 are less than the one of the

In this chapter we develop the Lanczos method, a technique that is applicable to large sparse, symmetric eigenproblems.. The method involves tridiagonalizing the given

Like the proximal point algorithm using D-function [5, 8], we under some mild assumptions es- tablish the global convergence of the algorithm expressed in terms of function values,