This homework set comes with 160 points and 40 bonus points. In general, every home- work set would come with a full credit of 160 points, with some possible bonus points.

(1)

Machine Learning Techniques (NTU, Spring 2017) instructor: Hsuan-Tien Lin

Homework #3

RELEASE DATE: 05/02/2017 DUE DATE: 05/23/2017, BEFORE 14:00

QUESTIONS ABOUT HOMEWORK MATERIALS ARE WELCOMED ON THE FACEBOOK FORUM.

Unless granted by the instructor in advance, you must turn in a printed/written copy of your solutions (without the source code) for all problems.

For problems marked with (*), please follow the guidelines on the course website and upload your source code to designated places. You are encouraged to (but not required to) include a README to help the TAs check your source code. Any programming language/platform is allowed.

Any form of cheating, lying, or plagiarism will not be tolerated. Students can get zero scores and/or fail the class and/or be kicked out of school and/or receive other punishments for those kinds of misconducts.

Discussions on course materials and homework solutions are encouraged. But you should write the final solutions alone and understand them fully. Books, notes, and Internet resources can be consulted, but not copied from.

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time. In order to maximize the level of fairness in this class, lending and borrowing homework solutions are both regarded as dishonest behaviors and will be punished according to the honesty policy.

You should write your solutions in English or Chinese with the common math notations introduced in class or in the problems. We do not accept solutions written in any other languages.

This homework set comes with 160 points and 40 bonus points. In general, every home- work set would come with a full credit of 160 points, with some possible bonus points.

Boosting

1.

Assume that linear regression (for classification) is used within AdaBoost. That is, we need to solve the weighted-Ein optimization problem.

minw E_in^u(w) = 1 N

N

X

n=1

u_n(y_n− w^Tx_n)².

The optimization problem above is equivalent to minimizing the usual Ein of linear regression on some “pseudo data” {(˜xn, ˜yn)}^N_n=1. Write down your pseudo data (˜xn, ˜yn) and prove your answer.

(Hint: There is more than one possible form of pseudo data)

2.

Consider applying the AdaBoost algorithm on a binary classification data set where 99% of the examples are positive. Because there are so many positive examples, the base algorithm within AdaBoost returns a constant classifier g1(x) = +1 in the first iteration. Let u⁽²⁾₊ be the individual example weight of each positive example in the second iteration, and u⁽²⁾₋ be the individual example weight of each negative example in the second iteration. What is u⁽²⁾₊ /u⁽²⁾₋ ? Prove your answer.

Kernel for Decision Stumps

When talking about non-uniform voting in aggregation, we mentioned that α can be viewed as a weight vector learned from any linear algorithm coupled with the following transform:

φ(x) =

g₁(x), g₂(x), · · · , g_T(x) .

1 of 3

(2)

When studying kernel methods, we mentioned that the kernel is simply a computational short-cut for the inner product (φ(x))^T(φ(x⁰)). In this problem, we mix the two topics together using the decision stumps as our gt(x).

3.

Assume that the input vectors contain only integers between (including) L and R.

gs,i,θ(x) = s · sign xi− θ

,

where i ∈ {1, 2, · · · , d}, d is the finite dimensionality of the input space, s ∈ {−1, +1}, θ ∈ R, and sign(0) = +1

Two decision stumps g and ˆg are defined as the same if g(x) = ˆg(x) for every x ∈ X . Two decision stumps are different if they are not the same. How many different decision stumps are there for the case of d = 2, L = 1, and R = 6? Explain your answer.

4.

Continuing from the previous question, let G = { all different decision stumps for X } and enumerate each hypothesis g ∈ G by some index t. Define

φ_ds(x) = g1(x), g2(x), · · · , gt(x), · · · , g_|G|(x)

! .

Derive a simple equation that evaluates K_ds(x, x⁰) = (φ_ds(x))^T(φ_ds(x⁰)) efficiently and prove your answer.

We would give full credit if your solution works for the specific (d, L, R) given by Question 3, and we would give 10 bonus points if your solution works for general (d, L, R). Besides, another 10 bonus points will be awarded if your solution works with the “non-integer” input vectors.

Decision Tree

Impurity functions play an important role in decision tree branching. For binary classification problems, let µ+ be the fraction of positive examples in a data subset, and µ₋ = 1 − µ+ be the fraction of negative examples in the data subset.

5.

The Gini index is 1 − µ²₊−µ²₋. What is the maximum value of the Gini index among all µ₊∈ [0, 1]?

Prove your answer.

6.

Following Question 1, there are four possible impurity functions below. We can normalize each impurity function by dividing it with its maximum value among all µ+ ∈ [0, 1]. For instance, the classification error is simply min(µ+, µ₋) and its maximum value is 0.5. So the normalized classification error is 2 min(µ+, µ₋). After normalization, which of the following impurity function is equivalent to the normalized Gini index? Prove your answer.

[a] the classification error min(µ+, µ−).

[b] the squared regression error (used for branching in classification data sets), which is by defi- nition µ₊(1 − (µ₊− µ₋))²+ µ₋(−1 − (µ₊− µ₋))².

[c] the entropy, which is −µ+ln µ+− µ₋ln µ₋, with 0 log 0 ≡ 0.

[d] the closeness, which is 1 − |µ+− µ₋|.

[e] none of the other choices

Experiments with Adaptive Boosting.

For Questions 7–13, implement the AdaBoost-Stump algorithm as introduced in Lecture 208. Run the algorithm on the following set for training:

hw3_train.dat and the following set for testing:

hw3_test.dat

Use a total of T = 300 iterations (please do not stop earlier than 300), and calculate Ein and Eout with the 0/1 error.

For the decision stump algorithm, please implement the following steps. Any ties can be arbitrarily broken.

2 of 3

(3)

(1) For any feature i, sort all the x_n,i values to x_[n],i such that x_[n],i≤ x[n+1],i.

(2) Consider thresholds within −∞ and all the midpoints ^x^[n],i^+x₂^[n+1],i. Test those thresholds with s ∈ {−1, +1} to determine the best (s, θ) combination that minimizes E_in^u using feature i.

(3) Pick the best (s, i, θ) combination by enumerating over all possible i.

For those interested, step 2 can be carried out in O(N ) time only!!

7.

(*) Plot a figure for t versus E_in(g_t). What is E_in(g₁) and what is α₁?

8.

From the figure in the previous question, should Ein(gt) be decreasing or increasing? Write down your observations and explanations.

9.

(*) Plot a figure for t versus Ein(Gt), where Gt(x) =sign(Pt

τ =1ατgτ(x)). That is, G = GT. What is Ein(G)?

10.

(*) Plot a figure for t versus Ut, where Ut=PN

n=1u^(t)n . What is U2 and what is UT?

11.

(*) Plot a figure for t versus t. What is the minimum value of t?

12.

(*) Plot a figure for t versus E_out(g_t) estimated with the test set. What is E_out(g₁)?

13.

(*) Plot a figure for t versus Eout(Gt) estimated with the test set. What is Eout(G)?

Experiments with Unpruned Decision Tree

Implement the simple C&RT algorithm without pruning using the Gini index as the impurity measure as introduced in the class. For the decision stump used in branching, if you are branching with feature i and direction s, please sort all the xn,i values to form (at most) N + 1 segments of equivalent θ, and then pick θ within the median of the segment.

Run the algorithm on the following set for training:

hw3_train.dat and the following set for testing:

hw3_test.dat

14.

(*) Draw the resulting tree (by program or by hand, in any way easily understandable by the TAs).

15.

(*) Continuing from the previous problem, what is Ein and Eout (evaluated with 0/1 error) of the tree?

16.

(*) Try pruning eachleafof the tree above. What is the lowest Einthat you can get from pruning one leaf? What is the corresponding Eout?

Power of Adaptive Boosting

In this problem, we will prove that AdaBoost can reach E_in(G_T) = 0 if T is large enough and every hypothesis g_tsatisfies _t≤ < ¹₂. Let U_tbe defined as in Question 10. It can be proved (see Lecture 11 of Machine Learning Techniques) that

U_t+1= 1 N

N

X

n=1

exp −y_n

t

X

τ =1

α_τg_τ(x_n)

! . and E_in(G_T) ≤ U_{T +1}.

17.

(Bonus, 20 points) Prove that U1= 1 and Ut+1= Ut· 2pt(1 − t) ≤ Ut· 2p(1 − ).

18.

(Bonus, 20 points) Using the fact thatp(1 − ) ≤ ¹₂exp −2(¹₂− )² for < ¹₂, argue that after T = O(log N ) iterations, Ein(GT) = 0.

3 of 3