This homework set comes with 160 points and 20 bonus points. In general, every home- work set would come with a full credit of 160 points, with some possible bonus points.

(1)

Machine Learning (NTU, Spring 2019) instructor: Hsuan-Tien Lin

Homework #2

RELEASE DATE: 04/05/2019

DUE DATE: 04/30/2019, BEFORE 14:00 ON GRADESCOPE

QUESTIONS ABOUT HOMEWORK MATERIALS ARE WELCOMED ON THE FACEBOOK FORUM.

Please upload your solutions (without the source code) to Gradescope as instructed.

For problems marked with (*), please follow the guidelines on the course website and upload your source code to CEIBA. You are encouraged to (but not required to) include a README to help the TAs check your source code. Any programming language/platform is allowed.

Any form of cheating, lying, or plagiarism will not be tolerated. Students can get zero scores and/or fail the class and/or be kicked out of school and/or receive other punishments for those kinds of misconducts.

Discussions on course materials and homework solutions are encouraged. But you should write the final solutions alone and understand them fully. Books, notes, and Internet resources can be consulted, but not copied from.

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time. In order to maximize the level of fairness in this class, lending and borrowing homework solutions are both regarded as dishonest behaviors and will be punished according to the honesty policy.

You should write your solutions in English or Chinese with the common math notations introduced in class or in the problems. We do not accept solutions written in any other languages.

This homework set comes with 160 points and 20 bonus points. In general, every home- work set would come with a full credit of 160 points, with some possible bonus points.

Descent Methods for Probabilistic SVM

Recall that the probabilistic SVM is based on solving the following optimization problem:

min

A,B F (A, B) = 1 N

N

X

n=1

ln

1 + exp

−yn

A ·

w_svm^T φ(xn) + b_svm

+ B

.

1.

When using the gradient descent for minimizing F (A, B), we need to compute the gradient first.

Let zn= w^T_svmφ(xn) + b_svm, and pn= θ(−yn(Azn+ B)), where θ(s) = _1+exp(s)^exp(s) is the usual logistic function. What is the gradient ∇F (A, B) in terms of only yn, pn, zn and N ? Prove your answer.

2.

When using the Newton method for minimizing F (A, B) (see Homework 3 of Machine Learning Foundations), we need to compute −(H(F ))⁻¹∇F in each iteration, where H(F ) is the Hessian matrix of F at (A, B). Following the notations of Question 1, what is H(F ) in terms of only yn, pn, zn and N ? Prove your answer.

Extreme Kernel and Overfitting

3.

Assume that there are the same number of positive (yn = 1) and negative (yn = −1) examples and again all xnare different. When using the Gaussian kernel with γ → ∞ in a soft-margin SVM with C > 1, prove or disprove that the optimal α is an all-1 vector.

Blending

4.

Consider the case where the target function f : [0, 1] → R is given by f (x) = x − x² and the input probability distribution is uniform on [0, 1]. Assume that the training set has only two

1 of 4

(2)

examples generated independently from the input probability distribution and noiselessly by f , and the learning model is usual linear regression that minimizes the mean squared error within all hypotheses of the form h(x) = w1x + w0. What is ¯g(x), the expected value of the hypothesis, that the learning algorithm produces (see Page 10 of Lecture 207)? Prove your answer.

Boosting

5.

Assume that linear regression (for classification) is used within AdaBoost. That is, we need to solve the weighted-Ein optimization problem for un ≥ 0.

minw E_in^u(w) = 1 N

N

X

n=1

un(yn− w^Txn)².

The optimization problem above is equivalent to minimizing the usual Ein of linear regression on some “pseudo data” {(˜x_n, ˜y_n)}^N_n=1. Write down your pseudo data (˜x_n, ˜y_n) and prove your answer.

(Hint: There is more than one possible form of pseudo data)

6.

Consider applying the AdaBoost algorithm on a binary classification data set where 78% of the examples are positive. Because there are so many positive examples, the base algorithm within AdaBoost returns a constant classifier g₁(x) = +1 in the first iteration. Let u⁽²⁾₊ be the individual example weight of each positive example in the second iteration, and u⁽²⁾₋ be the individual example weight of each negative example in the second iteration. What is u⁽²⁾₊ /u⁽²⁾₋ ? Prove your answer.

Kernel for Decision Stumps

When talking about non-uniform voting in aggregation, we mentioned that α can be viewed as a weight vector learned from any linear algorithm coupled with the following transform:

φ(x) =

g1(x), g2(x), · · · , gT(x) .

When studying kernel methods, we mentioned that the kernel is simply a computational short-cut for the inner product (φ(x))^T(φ(x⁰)). In this problem, we mix the two topics together using the decision stumps as our g_t(x).

7.

Assume that the input vectors contain only integers between (including) −M and M . gs,i,θ(x) = s · sign

xi− θ ,

where i ∈ {1, 2, · · · , d}, d is the finite dimensionality of the input space, s ∈ {−1, +1}, θ ∈ R, and sign(0) = +1

Two decision stumps g and ˆg are defined as the same if g(x) = ˆg(x) for every x ∈ X . Two decision stumps are different if they are not the same. How many different decision stumps are there for the case of d = 2 and M = 5? Explain your answer.

8.

Continuing from the previous problem, let G = { all different decision stumps for X } and enumerate each hypothesis g ∈ G by some index t. Define

φ_ds(x) = g1(x), g2(x), · · · , gt(x), · · · , g_|G|(x)

! .

For any given (d, M ), derive a simple equation that evaluates Kds(x, x⁰) = (φ_ds(x))^T(φ_ds(x⁰)) efficiently and prove your answer.

2 of 4

(3)

Experiments with Bagging Ridge Regression

First, write a program to implement the (linear) ridge regression algorithm for classification (i.e. use 0/1 error for evaluation). Consider the following data set.

hw2_lssvm_all.dat

Please do add x0 = 1 to your data. Use the first 400 examples for training to get g and the remaining for testing. Calculate E_in and E_out with the 0/1 error. Consider λ ∈ {0.05, 0.5, 5, 50, 500}.

9.

(*) Among all λ, which λ results in the minimum E_in(g)? What is the corresponding E_in(g)?

10.

(*) Among all λ, which λ results in the minimum E_out(g)? What is the corresponding E_out(g)?

Next, write a program to implement bagging on top of ridge regression. Again consider the following data set

hw2_lssvm_all.dat

Please do add x0= 1 to your data. Use the first 400 examples for training and the remaining for testing.

Calculate Ein and Eout with the 0/1 error. Note that each ridge regression for classification should take the sign operation before uniform aggregation (with voting). Consider λ ∈ {0.05, 0.5, 5, 50, 500}. Use 400 bootstrapped examples in bagging and 250 iterations of bagging (e.g. 250 gt’s) to get G.

11.

(*) Among all λ, which λ results in the minimum Ein(G)? What is the corresponding Ein(G)?

Compare your results with the one in Question 9 and describe your findings.

12.

(*) Among all λ, which λ results in the minimum E_out(G)? What is the corresponding E_out(G)?

Compare your results with the one in Question 10 and describe your findings.

Experiments with Adaptive Boosting

For Questions 13–16, implement the AdaBoost-Stump algorithm as introduced in Lecture 208. Run the algorithm on the following set for training:

hw2_adaboost_train.dat and the following set for testing:

hw2_adaboost_test.dat

Use a total of T = 300 iterations (please do not stop earlier than 300), and calculate E_in and E_out with the 0/1 error.

For the decision stump algorithm, please implement the following steps. Any ties can be arbitrarily broken.

(1) For any feature i, sort all the xn,i values to x_[n],i such that x_[n],i≤ x_[n+1],i.

(2) Consider thresholds within −∞ and all the midpoints ^x^[n],i^+x₂^[n+1],i. Test those thresholds with s ∈ {−1, +1} to determine the best (s, θ) combination that minimizes E_in^u using feature i.

(3) Pick the best (s, i, θ) combination by enumerating over all possible i.

For those interested, step 2 can be carried out in O(N ) time only!!

13.

(*) Plot a figure for t versus E_in(g_t). Should E_in(g_t) be decreasing or increasing? Write down your observations and explanations. What is E_in(g_T)?

14.

(*) Plot a figure for t versus E_in(G_t), where G_t(x) = Pt

τ =1α_τg_τ(x). That is, G = G_T. Should E_in(G_t) be decreasing or increasing? Write down your observations and explanations. What is Ein(GT)?

15.

(*) Plot a figure for t versus Ut, where Ut =PN

n=1u^(t)n . Should Ut be decreasing or increasing?

Write down your observations and explanations. What is UT?

16.

(*) Plot a figure for t versus Eout(Gt) estimated with the test set. Should Eout(Gt) be decreasing or increasing? Write down your observations and explanations. What is Eout(GT)?

3 of 4

(4)

Bonus: Power of Adaptive Boosting

In this part, we will prove that AdaBoost can reach Ein(GT) = 0 if T is large enough and every hypothesis gtsatisfies t≤ < ¹₂. Let Utbe defined as in Question 15. It can be proved (see Lecture 211 of Machine Learning Techniques) that

U_t+1= 1 N

N

X

n=1

exp −y_n

t

X

τ =1

α_τg_τ(x_n)

! .

and Ein(GT) ≤ UT +1.

17.

Prove that U1= 1 and Ut+1= Ut· 2pt(1 − t) ≤ Ut· 2p(1 − ).

18.

Using the fact that p(1 − ) ≤ ¹₂exp −2(¹₂− )²

for < ¹₂, argue that after T = O(log N ) iterations, Ein(GT) = 0.

4 of 4