• 沒有找到結果。

This homework set comes with 160 points and 20 bonus points. In general, every home- work set would come with a full credit of 160 points, with some possible bonus points.

N/A
N/A
Protected

Academic year: 2022

Share "This homework set comes with 160 points and 20 bonus points. In general, every home- work set would come with a full credit of 160 points, with some possible bonus points."

Copied!
4
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning (NTU, Spring 2019) instructor: Hsuan-Tien Lin

Homework #2

RELEASE DATE: 04/05/2019

DUE DATE: 04/30/2019, BEFORE 14:00 ON GRADESCOPE

QUESTIONS ABOUT HOMEWORK MATERIALS ARE WELCOMED ON THE FACEBOOK FORUM.

Please upload your solutions (without the source code) to Gradescope as instructed.

For problems marked with (*), please follow the guidelines on the course website and upload your source code to CEIBA. You are encouraged to (but not required to) include a README to help the TAs check your source code. Any programming language/platform is allowed.

Any form of cheating, lying, or plagiarism will not be tolerated. Students can get zero scores and/or fail the class and/or be kicked out of school and/or receive other punishments for those kinds of misconducts.

Discussions on course materials and homework solutions are encouraged. But you should write the final solutions alone and understand them fully. Books, notes, and Internet resources can be consulted, but not copied from.

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time. In order to maximize the level of fairness in this class, lending and borrowing homework solutions are both regarded as dishonest behaviors and will be punished according to the honesty policy.

You should write your solutions in English or Chinese with the common math notations introduced in class or in the problems. We do not accept solutions written in any other languages.

This homework set comes with 160 points and 20 bonus points. In general, every home- work set would come with a full credit of 160 points, with some possible bonus points.

Descent Methods for Probabilistic SVM

Recall that the probabilistic SVM is based on solving the following optimization problem:

min

A,B F (A, B) = 1 N

N

X

n=1

ln

1 + exp

−yn

 A ·

wsvmT φ(xn) + bsvm

+ B

.

1.

When using the gradient descent for minimizing F (A, B), we need to compute the gradient first.

Let zn= wTsvmφ(xn) + bsvm, and pn= θ(−yn(Azn+ B)), where θ(s) = 1+exp(s)exp(s) is the usual logistic function. What is the gradient ∇F (A, B) in terms of only yn, pn, zn and N ? Prove your answer.

2.

When using the Newton method for minimizing F (A, B) (see Homework 3 of Machine Learning Foundations), we need to compute −(H(F ))−1∇F in each iteration, where H(F ) is the Hessian matrix of F at (A, B). Following the notations of Question 1, what is H(F ) in terms of only yn, pn, zn and N ? Prove your answer.

Extreme Kernel and Overfitting

3.

Assume that there are the same number of positive (yn = 1) and negative (yn = −1) examples and again all xnare different. When using the Gaussian kernel with γ → ∞ in a soft-margin SVM with C > 1, prove or disprove that the optimal α is an all-1 vector.

Blending

4.

Consider the case where the target function f : [0, 1] → R is given by f (x) = x − x2 and the input probability distribution is uniform on [0, 1]. Assume that the training set has only two

1 of 4

(2)

Machine Learning (NTU, Spring 2019) instructor: Hsuan-Tien Lin

examples generated independently from the input probability distribution and noiselessly by f , and the learning model is usual linear regression that minimizes the mean squared error within all hypotheses of the form h(x) = w1x + w0. What is ¯g(x), the expected value of the hypothesis, that the learning algorithm produces (see Page 10 of Lecture 207)? Prove your answer.

Boosting

5.

Assume that linear regression (for classification) is used within AdaBoost. That is, we need to solve the weighted-Ein optimization problem for un ≥ 0.

minw Einu(w) = 1 N

N

X

n=1

un(yn− wTxn)2.

The optimization problem above is equivalent to minimizing the usual Ein of linear regression on some “pseudo data” {(˜xn, ˜yn)}Nn=1. Write down your pseudo data (˜xn, ˜yn) and prove your answer.

(Hint: There is more than one possible form of pseudo data)

6.

Consider applying the AdaBoost algorithm on a binary classification data set where 78% of the examples are positive. Because there are so many positive examples, the base algorithm within AdaBoost returns a constant classifier g1(x) = +1 in the first iteration. Let u(2)+ be the individual example weight of each positive example in the second iteration, and u(2) be the individual example weight of each negative example in the second iteration. What is u(2)+ /u(2) ? Prove your answer.

Kernel for Decision Stumps

When talking about non-uniform voting in aggregation, we mentioned that α can be viewed as a weight vector learned from any linear algorithm coupled with the following transform:

φ(x) =

g1(x), g2(x), · · · , gT(x) .

When studying kernel methods, we mentioned that the kernel is simply a computational short-cut for the inner product (φ(x))T(φ(x0)). In this problem, we mix the two topics together using the decision stumps as our gt(x).

7.

Assume that the input vectors contain only integers between (including) −M and M . gs,i,θ(x) = s · sign

xi− θ ,

where i ∈ {1, 2, · · · , d}, d is the finite dimensionality of the input space, s ∈ {−1, +1}, θ ∈ R, and sign(0) = +1

Two decision stumps g and ˆg are defined as the same if g(x) = ˆg(x) for every x ∈ X . Two decision stumps are different if they are not the same. How many different decision stumps are there for the case of d = 2 and M = 5? Explain your answer.

8.

Continuing from the previous problem, let G = { all different decision stumps for X } and enumerate each hypothesis g ∈ G by some index t. Define

φds(x) = g1(x), g2(x), · · · , gt(x), · · · , g|G|(x)

! .

For any given (d, M ), derive a simple equation that evaluates Kds(x, x0) = (φds(x))Tds(x0)) efficiently and prove your answer.

2 of 4

(3)

Machine Learning (NTU, Spring 2019) instructor: Hsuan-Tien Lin

Experiments with Bagging Ridge Regression

First, write a program to implement the (linear) ridge regression algorithm for classification (i.e. use 0/1 error for evaluation). Consider the following data set.

hw2_lssvm_all.dat

Please do add x0 = 1 to your data. Use the first 400 examples for training to get g and the remaining for testing. Calculate Ein and Eout with the 0/1 error. Consider λ ∈ {0.05, 0.5, 5, 50, 500}.

9.

(*) Among all λ, which λ results in the minimum Ein(g)? What is the corresponding Ein(g)?

10.

(*) Among all λ, which λ results in the minimum Eout(g)? What is the corresponding Eout(g)?

Next, write a program to implement bagging on top of ridge regression. Again consider the following data set

hw2_lssvm_all.dat

Please do add x0= 1 to your data. Use the first 400 examples for training and the remaining for testing.

Calculate Ein and Eout with the 0/1 error. Note that each ridge regression for classification should take the sign operation before uniform aggregation (with voting). Consider λ ∈ {0.05, 0.5, 5, 50, 500}. Use 400 bootstrapped examples in bagging and 250 iterations of bagging (e.g. 250 gt’s) to get G.

11.

(*) Among all λ, which λ results in the minimum Ein(G)? What is the corresponding Ein(G)?

Compare your results with the one in Question 9 and describe your findings.

12.

(*) Among all λ, which λ results in the minimum Eout(G)? What is the corresponding Eout(G)?

Compare your results with the one in Question 10 and describe your findings.

Experiments with Adaptive Boosting

For Questions 13–16, implement the AdaBoost-Stump algorithm as introduced in Lecture 208. Run the algorithm on the following set for training:

hw2_adaboost_train.dat and the following set for testing:

hw2_adaboost_test.dat

Use a total of T = 300 iterations (please do not stop earlier than 300), and calculate Ein and Eout with the 0/1 error.

For the decision stump algorithm, please implement the following steps. Any ties can be arbitrarily broken.

(1) For any feature i, sort all the xn,i values to x[n],i such that x[n],i≤ x[n+1],i.

(2) Consider thresholds within −∞ and all the midpoints x[n],i+x2[n+1],i. Test those thresholds with s ∈ {−1, +1} to determine the best (s, θ) combination that minimizes Einu using feature i.

(3) Pick the best (s, i, θ) combination by enumerating over all possible i.

For those interested, step 2 can be carried out in O(N ) time only!!

13.

(*) Plot a figure for t versus Ein(gt). Should Ein(gt) be decreasing or increasing? Write down your observations and explanations. What is Ein(gT)?

14.

(*) Plot a figure for t versus Ein(Gt), where Gt(x) = Pt

τ =1ατgτ(x). That is, G = GT. Should Ein(Gt) be decreasing or increasing? Write down your observations and explanations. What is Ein(GT)?

15.

(*) Plot a figure for t versus Ut, where Ut =PN

n=1u(t)n . Should Ut be decreasing or increasing?

Write down your observations and explanations. What is UT?

16.

(*) Plot a figure for t versus Eout(Gt) estimated with the test set. Should Eout(Gt) be decreasing or increasing? Write down your observations and explanations. What is Eout(GT)?

3 of 4

(4)

Machine Learning (NTU, Spring 2019) instructor: Hsuan-Tien Lin

Bonus: Power of Adaptive Boosting

In this part, we will prove that AdaBoost can reach Ein(GT) = 0 if T is large enough and every hypothesis gtsatisfies t≤  < 12. Let Utbe defined as in Question 15. It can be proved (see Lecture 211 of Machine Learning Techniques) that

Ut+1= 1 N

N

X

n=1

exp −yn

t

X

τ =1

ατgτ(xn)

! .

and Ein(GT) ≤ UT +1.

17.

Prove that U1= 1 and Ut+1= Ut· 2pt(1 − t) ≤ Ut· 2p(1 − ).

18.

Using the fact that p(1 − ) ≤ 12exp −2(12− )2

for  < 12, argue that after T = O(log N ) iterations, Ein(GT) = 0.

4 of 4

參考文獻

相關文件

◦ GitHub code, Project document. ◦ Bonus points for the

(Note: The choices here “hint” you that the expected value is related to the matrix being inverted for regularized linear regression—see page 10 of Lecture 14. That is, data

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time.. In order

(BFQ, *) Implement the fixed learning rate gradient descent algorithm below for logistic regression.. Run the algorithm with η = 0.001 and T = 2000 on the following set

As shown in class, for one-dimensional data, the VC dimension of the decision stump model is 2.. In fact, the decision stump model is one of the few models that we could easily

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time.. In order

(a) Consider a binary classification algorithm A majority that returns a constant classifier that always predicts the majority class (i.e., the class with more instances in the data

(A 10% bonus can be given if your proof for either case is rigorous and works for general polynomial regression.).. If gradient boosting is coupled with linear regression