Machine Learning (NTU, Fall 2010) instructor: Hsuan-Tien Lin
Homework #5
TA in charge: Yu-Cheng Chou RELEASE DATE: 11/08/2010
DUE DATE: 11/29/2010, 4:00 pm IN CLASS TA SESSION: 11/25/2010, 6:00 pm IN R110
Unless granted by the instructor in advance, you must turn in a hard copy of your solutions (without the source code) for all problems. For problems marked with (*), please follow the guidelines on the course website and upload your source code to designated places.
Any form of cheating, lying, or plagiarism will not be tolerated. Students can get zero scores and/or fail the class and/or be kicked out of school and/or receive other punishments for those kinds of misconducts.
Discussions on course materials and homework solutions are encouraged. But you should write the final solutions alone and understand them fully. Books, notes, and Internet resources can be consulted, but not copied from.
Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time. In order to maximize the level of fairness in this class, lending and borrowing homework solutions are both regarded as dishonest behaviors and will be punished according to the honesty policy.
You should write your solutions in English with the common math notations introduced in class or in the problems. We do not accept solutions written in any other languages.
5.1 Low-Order Transforms: Decision Stump
(1) (5%) Do the first part in Exercise 3.9 of LFD (prove the dVCof a single transform).
(2) (5%) Do the second part in Exercise 3.9 of LFD (prove the dVC of the union).
5.2 Data-dependent Transforms
A Transformer thinks the following procedures would work well in getting a low Eout from any two- dimensional data sets. Please point out if there are any potential caveats in the procedures:
(1) (5%) Use the feature transform
Φ(x) =
(0, · · · , 0
| {z }
n−1
, 1, 0, · · · ) if x = xn
(0, · · · , 0, 0, 0, · · · ) otherwise . before running PLA.
(2) (5%) Use the feature transform Φ(x) = (φ1(x), φ2(x), · · · , φN(x)) with
φn(x) = exp
−kx − xnk2 2σ2
and some very small σ2before running PLA.
(Note: You can use the fact that if x1, x2, · · · , xN are all different, the matrix A = [amn] with amn= φm(xn) is always invertible.)
1 of 4
Machine Learning (NTU, Fall 2010) instructor: Hsuan-Tien Lin
5.3 Regularization and Weight Decay
Consider the regularized linear regression formulation
minw λw•w + 1 N
N
X
n=1
yn− w•xn2
.
with some λ > 0.
(1) (5%) Let wlin be the optimal solution for the plain-vanilla linear regression and wreg(λ) be the optimal solution for the formulation above. Prove that kwreg(λ)k ≤ kwlink for any λ > 0.
(Note: This is one origin of the name “weight decay.”)
Next, consider a more general formulation of regularized learning:
minw
E(w) = λw˜ •w + Ein(w).
with some λ ≥ 0, where λ = 0 corresponds to not using regularization.
(2) (5%) Let wreg(λ) be the optimal solution for the formulation above. Prove that kwreg(λ)k is a non-increasing function of λ for λ ≥ 0.
(3) (5%) Assume that Ein is differentiable and use gradient descent to minimize ˜E:
wnew← wold− η∇ ˜E(w).
Show that the update rule above is the same as
wnew← (1 − 2ηλ)wold− η∇Ein(w).
(Note: This is another origin of the name “weight decay”: wold decays before being updated by the gradient of Ein.)
5.4 Regularization and Virtual Examples
(1) (5%) Consider the regularized linear regression formulation in Problem 5.3. Let
˜
xi= (0, · · · , 0
| {z }
i−1
,√
N λ, 0, · · · , 0
| {z }
d−i+1
)
and ˜yi = 0. Prove that solving the formulation is equivalent to applying the plain-vanilla linear regression on {(xn, yn)}Nn=1∪ {(˜xi, ˜yi)}d+1i=1.
(Note: The pairs (˜xi, ˜yi) are called virtual examples. The results above shows that regularization can be viewed as using “additional” examples to guide the learning process.)
(2) (5%) Recall that when XTX is invertible, the solution of the plain-vanilla linear regression is (XTX)−1XTY. Explain how you can use the result from (1) to easily derive the solution of the regularized linear regression formulation.
5.5 Non-Falsifiability
(1) (3%) Do Exercise 5.2(b) of LFD.
(2) (3%) Do Exercise 5.2(c) of LFD.
(3) (3%) Do Exercise 5.2(d) of LFD.
(4) (3%) Do Exercise 5.2(e) of LFD.
(5) (3%) Do Exercise 5.2(f) of LFD.
2 of 4
Machine Learning (NTU, Fall 2010) instructor: Hsuan-Tien Lin
5.6 Regularized Linear Regression (*)
(1) (10%) Implement the linear regression algorithm in Problem 4.5. Run the algorithm on the fol- lowing data set for training:
http://www.csie.ntu.edu.tw/~htlin/course/ml10fall/doc/hw5_6_train.dat and the following set for testing
http://www.csie.ntu.edu.tw/~htlin/course/ml10fall/doc/hw5_6_test.dat Let
g(x) = sign w•x.
What is Ein(g) in terms of the 0/1 loss (classification)? How about Eout(g)?
Plot the training examples (xn, yn) and the decision boundary w•x = 0 in the same figure. Use different symbols to distinguish examples with different yn. Briefly state your findings.
Please check the course policy carefully and do not use sophisticated packages in your solution. You can use standard matrix multiplication and inversion routines.
(2) (10%) Split the given training examples in
http://www.csie.ntu.edu.tw/~htlin/course/ml10fall/doc/hw5_6_train.dat to 120 “base” examples (the first 120) and 80 “validation” ones (the last 80).
Ideally, you should randomly do the 120/80 split. Because the given examples are already randomly permuted, however, we would use a fixed split for the purpose of this problem.
Implement an algorithm that solves the regularized linear regression formulation in Problem 5.3.
Run the algorithm on the 120 base examples using log10λ = {2, 1, 0, −1, . . . , −8, −9, −10}. Let gλ(x) = sign wreg(λ)•x.
Validate gλ with the 80 validation examples and test it with the test examples in
http://www.csie.ntu.edu.tw/~htlin/course/ml10fall/doc/hw5_6_test.dat Plot Ebase(gλ), Eval(gλ), Eout(gλ) on the same figure as a function of log10λ, where the base training error Ebaseis Ein evaluated on only the 120 base examples. Briefly state your findings.
5.7 Large-Margin Perceptron Classification (*)
(1) (10%) Implement the perceptron learning algorithm in Problem 1.3. Run the algorithm on the following data set for training (until Ein reaches 0):
http://www.csie.ntu.edu.tw/~htlin/course/ml10fall/doc/hw5_7_train.dat and the following set for testing
http://www.csie.ntu.edu.tw/~htlin/course/ml10fall/doc/hw5_7_test.dat
Let w = (w0, u) as the solution from PLA and g(x) = sign w•x. Record the following two items:
• the “thickness” of w: minn
n
yn(kukw •xn)o
• the out-of-sample error Eout of g
Repeat the experiment over 100 runs. Plot a histogram of the thickness and another histogram of the out-of-sample error. Briefly state your findings.
(2) (10%) Implement the large-margin perceptron formulation below:
min
u,b
1 2u•u
subject to yn(u•xn+ b) ≥ 1 for n = 1, 2, . . . , N.
3 of 4
Machine Learning (NTU, Fall 2010) instructor: Hsuan-Tien Lin
Run the algorithm on the following data set for training:
http://www.csie.ntu.edu.tw/~htlin/course/ml10fall/doc/hw5_7_train.dat and the following set for testing
http://www.csie.ntu.edu.tw/~htlin/course/ml10fall/doc/hw5_7_test.dat
Let w = (b, u) as the solution from the formulation and g(x) = sign w•x. Report the following two items:
• the “thickness” of w: minnn
yn(kukw •xn)o
• the out-of-sample error Eout of g
Compare the numbers with the histograms that you get from PLA. Briefly state your findings.
(Note: You can use any general-purpose packages for quadratic programming to solve this problem, but you cannot use any SVM-specific packages.)
4 of 4