This homework set comes with 200 points and 20 bonus points. In general, every home- work set would come with a full credit of 200 points, with some possible bonus points.

(1)

Machine Learning (NTU, Fall 2015) instructor: Hsuan-Tien Lin

Homework #2

RELEASE DATE: 10/15/2015

DUE DATE: 11/02/2015 (MONDAY), BEFORE NOON

QUESTIONS ABOUT HOMEWORK MATERIALS ARE WELCOMED ON THE COURSERA FORUM.

Unless granted by the instructor in advance, you must turn in a printed/written copy of your solutions (without the source code) for all problems.

For problems marked with (*), please follow the guidelines on the course website and upload your source code to designated places. You are encouraged to (but not required to) include a README to help the TAs check your source code. Any programming language/platform is allowed.

Any form of cheating, lying, or plagiarism will not be tolerated. Students can get zero scores and/or fail the class and/or be kicked out of school and/or receive other punishments for those kinds of misconducts.

Discussions on course materials and homework solutions are encouraged. But you should write the final solutions alone and understand them fully. Books, notes, and Internet resources can be consulted, but not copied from.

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time. In order to maximize the level of fairness in this class, lending and borrowing homework solutions are both regarded as dishonest behaviors and will be punished according to the honesty policy.

You should write your solutions in English or Chinese with the common math notations introduced in class or in the problems. We do not accept solutions written in any other languages.

This homework set comes with 200 points and 20 bonus points. In general, every home- work set would come with a full credit of 200 points, with some possible bonus points.

Questions 1-2 are about noisy targets

1.

Consider the bin model for a hypothesis h that makes an error with probability µ in approximating a deterministic target function f (both h and f outputs {−1, +1}). If we use the same h to approximate a noisy version of f given by

P (x, y) = P (x)P (y|x)

P (y|x) =

λ y = f (x) 1 − λ otherwise

What is the probability of error that h makes in approximating the noisy target y? Please provide explanation of your answer.

2.

Following Question 1, with what value of λ will the performance of h be independent of µ? Please provide explanation of your answer.

Questions 3-5 are about generalization error, and getting the feel of the bounds numerically.

Please use the simple upper bound N^d^vcon the growth function mH(N ), assuming that N ≥ 2 and dvc≥ 2.

3.

For an H with d_vc= 10, if you want 95% confidence that your generalization error is at most 0.05, what is the the sample size that the VC generalization bound predicts? Please provide calculating steps of your answer, and round your answer to the closest thousand (that is, your answer should be something like 845000).

4.

There are a number of bounds on the generalization error , all holding with probability at least 1 − δ. Fix d_vc = 50 and δ = 0.05 and plot these bounds as a function of N . Use any numerical

1 of 4

(2)

method to calculate the (most) generalization error by the bound. Which bound is the tightest (smallest) for very large N , say N = 10, 000? What is the generalization error calculated by the tightest bound? Note that Devroye and Parrondo & Van den Broek are implicit bounds in .

[a] Original VC bound: ≤ q8

Nln^4m^H_δ^{(2N )} [b] Variant VC bound: ≤q

16

N ln^2m^√^H^{(N )}

δ

[c] Rademacher Penalty Bound: ≤

q2 ln(2N mH(N ))

N +q

2

Nln¹_δ +_N¹ [d] Parrondo and Van den Broek: ≤

q1

N(2 + ln^6m^H_δ^{(2N )}) [e] Devroye: ≤

q 1

2N(4(1 + ) + ln^4m^H_δ^(N²⁾)

5.

Continuing from Question 4, for small N , say N = 5, which bound is the tightest (smallest)? What is the generalization error calculated by the tightest bound?

In Questions 6-11, you are asked to play with the growth function or VC-dimension of some hypothesis sets. You should make sure your proof is rigorous and complete, as they will be carefully checked.

6.

What is the growth function mH(N ) of “positive-and-negative intervals on R”? The hypothesis set H of “positive-and-negative intervals” contains the functions which are +1 within one interval [`, r]

and −1 elsewhere, as well as the functions which are −1 within one interval [`, r] and +1 elsewhere.

For instance, the hypothesis h₁(x) = sign(x(x − 4)) is a negative interval with −1 within [0, 4] and +1 elsewhere, and hence belongs to H. The hypothesis h₂(x) = sign((x + 1)(x)(x − 1)) contains two positive intervals in [−1, 0] and [1, ∞) and hence does not belong to H. Please provide proof of your answer.

7.

Continuing from the previous problem, what is the VC-dimension of the “positive-and-negative intervals on R”? Please provide proof of your answer.

8.

What is the growth function m_H(N ) of “positive donuts in R²”? The hypothesis set H of “positive donuts” contains hypotheses formed by two concentric circles centered at the origin. In particular, each hypothesis is +1 within a “donut” region of a² ≤ x²₁+ x²₂ ≤ b² and −1 elsewhere. Without loss of generality, we assume 0 < a < b < ∞. Please provide proof of your answer.

9.

Consider the “polynomial discriminant” hypothesis set of degree D on R, which is given by

H = (

h_c

h_c(x) = sign

D

X

i=0

c_ixⁱ

!)

What is the VC-Dimension of such an H? Please provide proof of your answer.

10.

Consider the “simplified decision trees” hypothesis set on R^d, which is given by H = {ht,S| ht,S(x) = 2Jv ∈ SK − 1, where vi=Jxi> tiK ,

S a collection of vectors in {0, 1}^d, t ∈ R^d }

That is, each hypothesis makes a prediction by first using the d thresholds ti to locate x to be within one of the 2^d hyper-rectangular regions, and looking up S to decide whether the region should be +1 or −1. What is the VC-dimension of the “simplified decision trees” hypothesis set?

Please provide proof of your answer.

11.

Consider the “triangle waves” hypothesis set on R, , which is given by H = {hα| hα(x) = sign(|(αx) mod 4 − 2| − 1), α ∈ R}

Here (z mod 4) is a number z − 4k for some integer k such that z − 4k ∈ [0, 4). For instance, (11.26 mod 4) is 3.26, and (−11.26 mod 4) is 0.74. What is the VC-Dimension of such an H?

Please provide proof of your answer.

2 of 4

(3)

In Questions 12-15, you are asked to verify some properties or bounds on the growth function and VC-dimension.

12.

Is min

1≤i≤N −12ⁱm_H(N − i) an upper bound of the growth function m_H(N ) for N ≥ d_vc ≥ 2? Please provide proof of your answer.

13.

Is 2^b

√N c a possible growth function mH(N ) for some hypothesis set? Please provide proof of your answer.

14.

For hypothesis sets H₁, H₂, ..., H_K with finite, positive VC dimensions d_vc(H_k), consider the VC dimension of the intersection of the sets. Prove or disprove that

0 ≤ d_vc(

K

\

k=1

Hk) ≤ min{d_vc(Hk)}^K_k=1.

(The VC dimension of an empty set or a singleton set is taken as zero)

15.

For hypothesis sets H1, H2, ..., HK with finite, positive VC dimensions d_vc(Hk), consider the VC dimension of the union of the sets. Prove or disprove that

max{d_vc(Hk)}^K_k=1 ≤ d_vc(

K

[

k=1

Hk) ≤ K − 1 +

K

X

k=1

d_vc(Hk).

For Questions 16-20, you will play with the decision stump algorithm.

In class, we taught about the learning model of “positive and negative rays” (which is simply one- dimensional perceptron) for one-dimensional data. The model contains hypotheses of the form:

h_s,θ(x) = s · sign(x − θ).

The model is frequently named the “decision stump” model and is one of the simplest learning models.

As shown in class, for one-dimensional data, the VC dimension of the decision stump model is 2.

In fact, the decision stump model is one of the few models that we could easily minimize Ein for binary classification efficiently by enumerating all possible thresholds. In particular, for N examples, there are at most 2N dichotomies (see page 22 of class05 slides), and thus at most 2N different Ein

values. We can then easily choose the dichotomy that leads to the lowest Ein, where ties can be broken by randomly choosing among the lowest-Ein ones. The chosen dichotomy stands for a combination of some ‘spot’ (range of θ) and s, and commonly the median of the range is chosen as the θ that realizes the dichotomy.

In this problem, you are asked to implement such and algorithm and run your program on an artificial data set. First of all, start by generating a one-dimensional data by the procedure below:

a) Generate x by a uniform distribution in [−1, 1].

b) Generate y by ˜s(x) + noise where ˜s(x) = sign(x) and the noise flips the result with 20% probability.

16.

For any decision stump hs,θ with θ ∈ [−1, 1], express Eout(hs,θ) as a function of θ and s. Please provide your derivation steps.

17.

(*) Generate a data set of size 20 by the procedure above and run the one-dimensional decision stump algorithm on the data set. Record Ein and compute Eout with the formula above. Repeat the experiment (including data generation, running the decision stump algorithm, and computing Ein and Eout) 5, 000 times. What is the average Ein? Plot a histogram for your Ein distribution.

18.

(*) Continuing from the previous question, what is the average Eout? Plot a histogram for your E_out distribution.

3 of 4

(4)

Decision stumps can also work for multi-dimensional data. In particular, each decision stump now deals with a specific dimension i, as shown below.

hs,i,θ(x) = s · sign(xi− θ).

Implement the following decision stump algorithm for multi-dimensional data:

a) for each dimension i = 1, 2, · · · , d, find the best decision stump hs,i,θ using the one-dimensional decision stump algorithm that you have just implemented.

b) return the “best of best” decision stump in terms of Ein. If there is a tie, please randomly choose among the lowest-E_in ones.

The training data D_train is available at:

http://www.csie.ntu.edu.tw/~htlin/course/ml14fall/hw2/hw2_train.dat The testing data Dtest is available at:

http://www.csie.ntu.edu.tw/~htlin/course/ml14fall/hw2/hw2_test.dat

19.

(*) Run the algorithm on the D_train. What is the optimal decision stump returned by your program? What is the E_inof the optimal decision stump?

20.

(*) Use the returned decision stump to predict the label of each example within the D_test. Report an estimate of E_out by E_test.

Bonus: More on Growth Function

21.

In class, we have shown that

B(N, k) ≤

k−1

X

i=0

N i

Show that in fact the equality holds. (Hint: there is a intuitive construction of a specific set of Pk−1

i=0 N

i dichotomies, where no subset of k variables can be shattered.)

4 of 4