5.3 Regularization and Weight Decay

(1)

Machine Learning (NTU, Fall 2010) instructor: Hsuan-Tien Lin

Homework #5

TA in charge: Yu-Cheng Chou RELEASE DATE: 11/08/2010

DUE DATE: 11/29/2010, 4:00 pm IN CLASS TA SESSION: 11/25/2010, 6:00 pm IN R110

Unless granted by the instructor in advance, you must turn in a hard copy of your solutions (without the source code) for all problems. For problems marked with (*), please follow the guidelines on the course website and upload your source code to designated places.

Any form of cheating, lying, or plagiarism will not be tolerated. Students can get zero scores and/or fail the class and/or be kicked out of school and/or receive other punishments for those kinds of misconducts.

Discussions on course materials and homework solutions are encouraged. But you should write the final solutions alone and understand them fully. Books, notes, and Internet resources can be consulted, but not copied from.

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time. In order to maximize the level of fairness in this class, lending and borrowing homework solutions are both regarded as dishonest behaviors and will be punished according to the honesty policy.

You should write your solutions in English with the common math notations introduced in class or in the problems. We do not accept solutions written in any other languages.

5.1 Low-Order Transforms: Decision Stump

(1) (5%) Do the first part in Exercise 3.9 of LFD (prove the dVCof a single transform).

(2) (5%) Do the second part in Exercise 3.9 of LFD (prove the dVC of the union).

5.2 Data-dependent Transforms

A Transformer thinks the following procedures would work well in getting a low Eout from any two- dimensional data sets. Please point out if there are any potential caveats in the procedures:

(1) (5%) Use the feature transform

Φ(x) =







(0, · · · , 0

| {z }

n−1

, 1, 0, · · · ) if x = xn

(0, · · · , 0, 0, 0, · · · ) otherwise . before running PLA.

(2) (5%) Use the feature transform Φ(x) = (φ1(x), φ2(x), · · · , φN(x)) with

φ_n(x) = exp

−kx − xnk² 2σ²

and some very small σ²before running PLA.

(Note: You can use the fact that if x1, x2, · · · , xN are all different, the matrix A = [amn] with amn= φm(xn) is always invertible.)

1 of 4

(2)

5.3 Regularization and Weight Decay

Consider the regularized linear regression formulation

minw λw^•w + 1 N

N

X

n=1

yn− w^•xn2

.

with some λ > 0.

(1) (5%) Let w_lin be the optimal solution for the plain-vanilla linear regression and w_reg(λ) be the optimal solution for the formulation above. Prove that kwreg(λ)k ≤ kwlink for any λ > 0.

(Note: This is one origin of the name “weight decay.”)

Next, consider a more general formulation of regularized learning:

minw

E(w) = λw˜ ^•w + E_in(w).

with some λ ≥ 0, where λ = 0 corresponds to not using regularization.

(2) (5%) Let w_reg(λ) be the optimal solution for the formulation above. Prove that kw_reg(λ)k is a non-increasing function of λ for λ ≥ 0.

(3) (5%) Assume that Ein is differentiable and use gradient descent to minimize ˜E:

w^new← w^old− η∇ ˜E(w).

Show that the update rule above is the same as

w^new← (1 − 2ηλ)w^old− η∇Ein(w).

(Note: This is another origin of the name “weight decay”: w^old decays before being updated by the gradient of Ein.)

5.4 Regularization and Virtual Examples

(1) (5%) Consider the regularized linear regression formulation in Problem 5.3. Let

˜

x_i= (0, · · · , 0

| {z }

i−1

,√

N λ, 0, · · · , 0

| {z }

d−i+1

)

and ˜yi = 0. Prove that solving the formulation is equivalent to applying the plain-vanilla linear regression on {(xn, yn)}^N_n=1∪ {(˜xi, ˜yi)}^d+1_i=1.

(Note: The pairs (˜xi, ˜yi) are called virtual examples. The results above shows that regularization can be viewed as using “additional” examples to guide the learning process.)

(2) (5%) Recall that when X^TX is invertible, the solution of the plain-vanilla linear regression is (X^TX)⁻¹X^TY. Explain how you can use the result from (1) to easily derive the solution of the regularized linear regression formulation.

5.5 Non-Falsifiability

(1) (3%) Do Exercise 5.2(b) of LFD.

(2) (3%) Do Exercise 5.2(c) of LFD.

(3) (3%) Do Exercise 5.2(d) of LFD.

(4) (3%) Do Exercise 5.2(e) of LFD.

(5) (3%) Do Exercise 5.2(f) of LFD.

2 of 4

(3)

5.6 Regularized Linear Regression (*)

(1) (10%) Implement the linear regression algorithm in Problem 4.5. Run the algorithm on the following data set for training:

http://www.csie.ntu.edu.tw/~htlin/course/ml10fall/doc/hw5_6_train.dat and the following set for testing

http://www.csie.ntu.edu.tw/~htlin/course/ml10fall/doc/hw5_6_test.dat Let

g(x) = sign w^•x.

What is Ein(g) in terms of the 0/1 loss (classification)? How about Eout(g)?

Plot the training examples (xn, yn) and the decision boundary w^•x = 0 in the same figure. Use different symbols to distinguish examples with different yn. Briefly state your findings.

Please check the course policy carefully and do not use sophisticated packages in your solution. You can use standard matrix multiplication and inversion routines.

(2) (10%) Split the given training examples in

http://www.csie.ntu.edu.tw/~htlin/course/ml10fall/doc/hw5_6_train.dat to 120 “base” examples (the first 120) and 80 “validation” ones (the last 80).

Ideally, you should randomly do the 120/80 split. Because the given examples are already randomly permuted, however, we would use a fixed split for the purpose of this problem.

Implement an algorithm that solves the regularized linear regression formulation in Problem 5.3.

Run the algorithm on the 120 base examples using log₁₀λ = {2, 1, 0, −1, . . . , −8, −9, −10}. Let g_λ(x) = sign w_reg(λ)^•x.

Validate gλ with the 80 validation examples and test it with the test examples in

http://www.csie.ntu.edu.tw/~htlin/course/ml10fall/doc/hw5_6_test.dat Plot E_base(g_λ), E_val(g_λ), E_out(g_λ) on the same figure as a function of log₁₀λ, where the base training error E_baseis E_in evaluated on only the 120 base examples. Briefly state your findings.

5.7 Large-Margin Perceptron Classification (*)

(1) (10%) Implement the perceptron learning algorithm in Problem 1.3. Run the algorithm on the following data set for training (until Ein reaches 0):

http://www.csie.ntu.edu.tw/~htlin/course/ml10fall/doc/hw5_7_test.dat

Let w = (w0, u) as the solution from PLA and g(x) = sign w^•x. Record the following two items:

• the “thickness” of w: minn

n

yn(_kuk^w ^•xn)o

• the out-of-sample error Eout of g

Repeat the experiment over 100 runs. Plot a histogram of the thickness and another histogram of the out-of-sample error. Briefly state your findings.

(2) (10%) Implement the large-margin perceptron formulation below:

min

u,b

1 2u^•u

subject to yn(u^•xn+ b) ≥ 1 for n = 1, 2, . . . , N.

3 of 4

(4)

Run the algorithm on the following data set for training:

http://www.csie.ntu.edu.tw/~htlin/course/ml10fall/doc/hw5_7_test.dat

Let w = (b, u) as the solution from the formulation and g(x) = sign w^•x. Report the following two items:

• the “thickness” of w: min_nn

y_n(_kuk^w ^•x_n)o

• the out-of-sample error Eout of g

Compare the numbers with the histograms that you get from PLA. Briefly state your findings.

(Note: You can use any general-purpose packages for quadratic programming to solve this problem, but you cannot use any SVM-specific packages.)

4 of 4