• 沒有找到結果。

This homework set comes with 160 points and 40 bonus points. In general, every home- work set would come with a full credit of 160 points, with some possible bonus points.

N/A
N/A
Protected

Academic year: 2022

Share "This homework set comes with 160 points and 40 bonus points. In general, every home- work set would come with a full credit of 160 points, with some possible bonus points."

Copied!
3
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning Techniques (NTU, Spring 2017) instructor: Hsuan-Tien Lin

Homework #3

RELEASE DATE: 05/02/2017 DUE DATE: 05/23/2017, BEFORE 14:00

QUESTIONS ABOUT HOMEWORK MATERIALS ARE WELCOMED ON THE FACEBOOK FORUM.

Unless granted by the instructor in advance, you must turn in a printed/written copy of your solutions (without the source code) for all problems.

For problems marked with (*), please follow the guidelines on the course website and upload your source code to designated places. You are encouraged to (but not required to) include a README to help the TAs check your source code. Any programming language/platform is allowed.

Any form of cheating, lying, or plagiarism will not be tolerated. Students can get zero scores and/or fail the class and/or be kicked out of school and/or receive other punishments for those kinds of misconducts.

Discussions on course materials and homework solutions are encouraged. But you should write the final solutions alone and understand them fully. Books, notes, and Internet resources can be consulted, but not copied from.

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time. In order to maximize the level of fairness in this class, lending and borrowing homework solutions are both regarded as dishonest behaviors and will be punished according to the honesty policy.

You should write your solutions in English or Chinese with the common math notations introduced in class or in the problems. We do not accept solutions written in any other languages.

This homework set comes with 160 points and 40 bonus points. In general, every home- work set would come with a full credit of 160 points, with some possible bonus points.

Boosting

1.

Assume that linear regression (for classification) is used within AdaBoost. That is, we need to solve the weighted-Ein optimization problem.

minw Einu(w) = 1 N

N

X

n=1

un(yn− wTxn)2.

The optimization problem above is equivalent to minimizing the usual Ein of linear regression on some “pseudo data” {(˜xn, ˜yn)}Nn=1. Write down your pseudo data (˜xn, ˜yn) and prove your answer.

(Hint: There is more than one possible form of pseudo data)

2.

Consider applying the AdaBoost algorithm on a binary classification data set where 99% of the examples are positive. Because there are so many positive examples, the base algorithm within AdaBoost returns a constant classifier g1(x) = +1 in the first iteration. Let u(2)+ be the individual example weight of each positive example in the second iteration, and u(2) be the individual example weight of each negative example in the second iteration. What is u(2)+ /u(2) ? Prove your answer.

Kernel for Decision Stumps

When talking about non-uniform voting in aggregation, we mentioned that α can be viewed as a weight vector learned from any linear algorithm coupled with the following transform:

φ(x) =

g1(x), g2(x), · · · , gT(x) .

1 of 3

(2)

Machine Learning Techniques (NTU, Spring 2017) instructor: Hsuan-Tien Lin

When studying kernel methods, we mentioned that the kernel is simply a computational short-cut for the inner product (φ(x))T(φ(x0)). In this problem, we mix the two topics together using the decision stumps as our gt(x).

3.

Assume that the input vectors contain only integers between (including) L and R.

gs,i,θ(x) = s · sign xi− θ

,

where i ∈ {1, 2, · · · , d}, d is the finite dimensionality of the input space, s ∈ {−1, +1}, θ ∈ R, and sign(0) = +1

Two decision stumps g and ˆg are defined as the same if g(x) = ˆg(x) for every x ∈ X . Two decision stumps are different if they are not the same. How many different decision stumps are there for the case of d = 2, L = 1, and R = 6? Explain your answer.

4.

Continuing from the previous question, let G = { all different decision stumps for X } and enumerate each hypothesis g ∈ G by some index t. Define

φds(x) = g1(x), g2(x), · · · , gt(x), · · · , g|G|(x)

! .

Derive a simple equation that evaluates Kds(x, x0) = (φds(x))Tds(x0)) efficiently and prove your answer.

We would give full credit if your solution works for the specific (d, L, R) given by Question 3, and we would give 10 bonus points if your solution works for general (d, L, R). Besides, another 10 bonus points will be awarded if your solution works with the “non-integer” input vectors.

Decision Tree

Impurity functions play an important role in decision tree branching. For binary classification problems, let µ+ be the fraction of positive examples in a data subset, and µ = 1 − µ+ be the fraction of negative examples in the data subset.

5.

The Gini index is 1 − µ2+−µ2. What is the maximum value of the Gini index among all µ+∈ [0, 1]?

Prove your answer.

6.

Following Question 1, there are four possible impurity functions below. We can normalize each impurity function by dividing it with its maximum value among all µ+ ∈ [0, 1]. For instance, the classification error is simply min(µ+, µ) and its maximum value is 0.5. So the normalized classification error is 2 min(µ+, µ). After normalization, which of the following impurity function is equivalent to the normalized Gini index? Prove your answer.

[a] the classification error min(µ+, µ).

[b] the squared regression error (used for branching in classification data sets), which is by defi- nition µ+(1 − (µ+− µ))2+ µ(−1 − (µ+− µ))2.

[c] the entropy, which is −µ+ln µ+− µln µ, with 0 log 0 ≡ 0.

[d] the closeness, which is 1 − |µ+− µ|.

[e] none of the other choices

Experiments with Adaptive Boosting.

For Questions 7–13, implement the AdaBoost-Stump algorithm as introduced in Lecture 208. Run the algorithm on the following set for training:

hw3_train.dat and the following set for testing:

hw3_test.dat

Use a total of T = 300 iterations (please do not stop earlier than 300), and calculate Ein and Eout with the 0/1 error.

For the decision stump algorithm, please implement the following steps. Any ties can be arbitrarily broken.

2 of 3

(3)

Machine Learning Techniques (NTU, Spring 2017) instructor: Hsuan-Tien Lin

(1) For any feature i, sort all the xn,i values to x[n],i such that x[n],i≤ x[n+1],i.

(2) Consider thresholds within −∞ and all the midpoints x[n],i+x2[n+1],i. Test those thresh- olds with s ∈ {−1, +1} to determine the best (s, θ) combination that minimizes Einu using feature i.

(3) Pick the best (s, i, θ) combination by enumerating over all possible i.

For those interested, step 2 can be carried out in O(N ) time only!!

7.

(*) Plot a figure for t versus Ein(gt). What is Ein(g1) and what is α1?

8.

From the figure in the previous question, should Ein(gt) be decreasing or increasing? Write down your observations and explanations.

9.

(*) Plot a figure for t versus Ein(Gt), where Gt(x) =sign(Pt

τ =1ατgτ(x)). That is, G = GT. What is Ein(G)?

10.

(*) Plot a figure for t versus Ut, where Ut=PN

n=1u(t)n . What is U2 and what is UT?

11.

(*) Plot a figure for t versus t. What is the minimum value of t?

12.

(*) Plot a figure for t versus Eout(gt) estimated with the test set. What is Eout(g1)?

13.

(*) Plot a figure for t versus Eout(Gt) estimated with the test set. What is Eout(G)?

Experiments with Unpruned Decision Tree

Implement the simple C&RT algorithm without pruning using the Gini index as the impurity measure as introduced in the class. For the decision stump used in branching, if you are branching with feature i and direction s, please sort all the xn,i values to form (at most) N + 1 segments of equivalent θ, and then pick θ within the median of the segment.

Run the algorithm on the following set for training:

hw3_train.dat and the following set for testing:

hw3_test.dat

14.

(*) Draw the resulting tree (by program or by hand, in any way easily understandable by the TAs).

15.

(*) Continuing from the previous problem, what is Ein and Eout (evaluated with 0/1 error) of the tree?

16.

(*) Try pruning eachleafof the tree above. What is the lowest Einthat you can get from pruning one leaf? What is the corresponding Eout?

Power of Adaptive Boosting

In this problem, we will prove that AdaBoost can reach Ein(GT) = 0 if T is large enough and every hypothesis gtsatisfies t≤  < 12. Let Utbe defined as in Question 10. It can be proved (see Lecture 11 of Machine Learning Techniques) that

Ut+1= 1 N

N

X

n=1

exp −yn

t

X

τ =1

ατgτ(xn)

! . and Ein(GT) ≤ UT +1.

17.

(Bonus, 20 points) Prove that U1= 1 and Ut+1= Ut· 2pt(1 − t) ≤ Ut· 2p(1 − ).

18.

(Bonus, 20 points) Using the fact thatp(1 − ) ≤ 12exp −2(12− )2 for  < 12, argue that after T = O(log N ) iterations, Ein(GT) = 0.

3 of 3

參考文獻

相關文件

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time.. In order

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time.. In order

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time.. In order

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time.. In order

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time.. In order

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time.. In order

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time.. In order

(BFQ, *) Implement the fixed learning rate gradient descent algorithm below for logistic regression.. Run the algorithm with η = 0.001 and T = 2000 on the following set