• 沒有找到結果。

This homework set comes with 200 points and 20 bonus points. In general, every home- work set would come with a full credit of 200 points, with some possible bonus points.

N/A
N/A
Protected

Academic year: 2022

Share "This homework set comes with 200 points and 20 bonus points. In general, every home- work set would come with a full credit of 200 points, with some possible bonus points."

Copied!
4
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning (NTU, Fall 2015) instructor: Hsuan-Tien Lin

Homework #2

RELEASE DATE: 10/15/2015

DUE DATE: 11/02/2015 (MONDAY), BEFORE NOON

QUESTIONS ABOUT HOMEWORK MATERIALS ARE WELCOMED ON THE COURSERA FORUM.

Unless granted by the instructor in advance, you must turn in a printed/written copy of your solutions (without the source code) for all problems.

For problems marked with (*), please follow the guidelines on the course website and upload your source code to designated places. You are encouraged to (but not required to) include a README to help the TAs check your source code. Any programming language/platform is allowed.

Any form of cheating, lying, or plagiarism will not be tolerated. Students can get zero scores and/or fail the class and/or be kicked out of school and/or receive other punishments for those kinds of misconducts.

Discussions on course materials and homework solutions are encouraged. But you should write the final solutions alone and understand them fully. Books, notes, and Internet resources can be consulted, but not copied from.

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time. In order to maximize the level of fairness in this class, lending and borrowing homework solutions are both regarded as dishonest behaviors and will be punished according to the honesty policy.

You should write your solutions in English or Chinese with the common math notations introduced in class or in the problems. We do not accept solutions written in any other languages.

This homework set comes with 200 points and 20 bonus points. In general, every home- work set would come with a full credit of 200 points, with some possible bonus points.

Questions 1-2 are about noisy targets

1.

Consider the bin model for a hypothesis h that makes an error with probability µ in approximating a deterministic target function f (both h and f outputs {−1, +1}). If we use the same h to approximate a noisy version of f given by

P (x, y) = P (x)P (y|x)

P (y|x) =

 λ y = f (x) 1 − λ otherwise

What is the probability of error that h makes in approximating the noisy target y? Please provide explanation of your answer.

2.

Following Question 1, with what value of λ will the performance of h be independent of µ? Please provide explanation of your answer.

Questions 3-5 are about generalization error, and getting the feel of the bounds numerically.

Please use the simple upper bound Ndvcon the growth function mH(N ), assuming that N ≥ 2 and dvc≥ 2.

3.

For an H with dvc= 10, if you want 95% confidence that your generalization error is at most 0.05, what is the the sample size that the VC generalization bound predicts? Please provide calculating steps of your answer, and round your answer to the closest thousand (that is, your answer should be something like 845000).

4.

There are a number of bounds on the generalization error , all holding with probability at least 1 − δ. Fix dvc = 50 and δ = 0.05 and plot these bounds as a function of N . Use any numerical

1 of 4

(2)

Machine Learning (NTU, Fall 2015) instructor: Hsuan-Tien Lin

method to calculate the (most) generalization error by the bound. Which bound is the tightest (smallest) for very large N , say N = 10, 000? What is the generalization error calculated by the tightest bound? Note that Devroye and Parrondo & Van den Broek are implicit bounds in .

[a] Original VC bound:  ≤ q8

Nln4mHδ(2N ) [b] Variant VC bound:  ≤q

16

N ln2mH(N )

δ

[c] Rademacher Penalty Bound:  ≤

q2 ln(2N mH(N ))

N +q

2

Nln1δ +N1 [d] Parrondo and Van den Broek:  ≤

q1

N(2 + ln6mHδ(2N )) [e] Devroye:  ≤

q 1

2N(4(1 + ) + ln4mHδ(N2))

5.

Continuing from Question 4, for small N , say N = 5, which bound is the tightest (smallest)? What is the generalization error calculated by the tightest bound?

In Questions 6-11, you are asked to play with the growth function or VC-dimension of some hypothesis sets. You should make sure your proof is rigorous and complete, as they will be carefully checked.

6.

What is the growth function mH(N ) of “positive-and-negative intervals on R”? The hypothesis set H of “positive-and-negative intervals” contains the functions which are +1 within one interval [`, r]

and −1 elsewhere, as well as the functions which are −1 within one interval [`, r] and +1 elsewhere.

For instance, the hypothesis h1(x) = sign(x(x − 4)) is a negative interval with −1 within [0, 4] and +1 elsewhere, and hence belongs to H. The hypothesis h2(x) = sign((x + 1)(x)(x − 1)) contains two positive intervals in [−1, 0] and [1, ∞) and hence does not belong to H. Please provide proof of your answer.

7.

Continuing from the previous problem, what is the VC-dimension of the “positive-and-negative intervals on R”? Please provide proof of your answer.

8.

What is the growth function mH(N ) of “positive donuts in R2”? The hypothesis set H of “positive donuts” contains hypotheses formed by two concentric circles centered at the origin. In particular, each hypothesis is +1 within a “donut” region of a2 ≤ x21+ x22 ≤ b2 and −1 elsewhere. Without loss of generality, we assume 0 < a < b < ∞. Please provide proof of your answer.

9.

Consider the “polynomial discriminant” hypothesis set of degree D on R, which is given by

H = (

hc

hc(x) = sign

D

X

i=0

cixi

!)

What is the VC-Dimension of such an H? Please provide proof of your answer.

10.

Consider the “simplified decision trees” hypothesis set on Rd, which is given by H = {ht,S| ht,S(x) = 2Jv ∈ SK − 1, where vi=Jxi> tiK ,

S a collection of vectors in {0, 1}d, t ∈ Rd }

That is, each hypothesis makes a prediction by first using the d thresholds ti to locate x to be within one of the 2d hyper-rectangular regions, and looking up S to decide whether the region should be +1 or −1. What is the VC-dimension of the “simplified decision trees” hypothesis set?

Please provide proof of your answer.

11.

Consider the “triangle waves” hypothesis set on R, , which is given by H = {hα| hα(x) = sign(|(αx) mod 4 − 2| − 1), α ∈ R}

Here (z mod 4) is a number z − 4k for some integer k such that z − 4k ∈ [0, 4). For instance, (11.26 mod 4) is 3.26, and (−11.26 mod 4) is 0.74. What is the VC-Dimension of such an H?

Please provide proof of your answer.

2 of 4

(3)

Machine Learning (NTU, Fall 2015) instructor: Hsuan-Tien Lin

In Questions 12-15, you are asked to verify some properties or bounds on the growth function and VC-dimension.

12.

Is min

1≤i≤N −12imH(N − i) an upper bound of the growth function mH(N ) for N ≥ dvc ≥ 2? Please provide proof of your answer.

13.

Is 2b

N c a possible growth function mH(N ) for some hypothesis set? Please provide proof of your answer.

14.

For hypothesis sets H1, H2, ..., HK with finite, positive VC dimensions dvc(Hk), consider the VC dimension of the intersection of the sets. Prove or disprove that

0 ≤ dvc(

K

\

k=1

Hk) ≤ min{dvc(Hk)}Kk=1.

(The VC dimension of an empty set or a singleton set is taken as zero)

15.

For hypothesis sets H1, H2, ..., HK with finite, positive VC dimensions dvc(Hk), consider the VC dimension of the union of the sets. Prove or disprove that

max{dvc(Hk)}Kk=1 ≤ dvc(

K

[

k=1

Hk) ≤ K − 1 +

K

X

k=1

dvc(Hk).

For Questions 16-20, you will play with the decision stump algorithm.

In class, we taught about the learning model of “positive and negative rays” (which is simply one- dimensional perceptron) for one-dimensional data. The model contains hypotheses of the form:

hs,θ(x) = s · sign(x − θ).

The model is frequently named the “decision stump” model and is one of the simplest learning models.

As shown in class, for one-dimensional data, the VC dimension of the decision stump model is 2.

In fact, the decision stump model is one of the few models that we could easily minimize Ein for binary classification efficiently by enumerating all possible thresholds. In particular, for N examples, there are at most 2N dichotomies (see page 22 of class05 slides), and thus at most 2N different Ein

values. We can then easily choose the dichotomy that leads to the lowest Ein, where ties can be broken by randomly choosing among the lowest-Ein ones. The chosen dichotomy stands for a combination of some ‘spot’ (range of θ) and s, and commonly the median of the range is chosen as the θ that realizes the dichotomy.

In this problem, you are asked to implement such and algorithm and run your program on an artificial data set. First of all, start by generating a one-dimensional data by the procedure below:

a) Generate x by a uniform distribution in [−1, 1].

b) Generate y by ˜s(x) + noise where ˜s(x) = sign(x) and the noise flips the result with 20% probability.

16.

For any decision stump hs,θ with θ ∈ [−1, 1], express Eout(hs,θ) as a function of θ and s. Please provide your derivation steps.

17.

(*) Generate a data set of size 20 by the procedure above and run the one-dimensional decision stump algorithm on the data set. Record Ein and compute Eout with the formula above. Repeat the experiment (including data generation, running the decision stump algorithm, and computing Ein and Eout) 5, 000 times. What is the average Ein? Plot a histogram for your Ein distribution.

18.

(*) Continuing from the previous question, what is the average Eout? Plot a histogram for your Eout distribution.

3 of 4

(4)

Machine Learning (NTU, Fall 2015) instructor: Hsuan-Tien Lin

Decision stumps can also work for multi-dimensional data. In particular, each decision stump now deals with a specific dimension i, as shown below.

hs,i,θ(x) = s · sign(xi− θ).

Implement the following decision stump algorithm for multi-dimensional data:

a) for each dimension i = 1, 2, · · · , d, find the best decision stump hs,i,θ using the one-dimensional decision stump algorithm that you have just implemented.

b) return the “best of best” decision stump in terms of Ein. If there is a tie, please randomly choose among the lowest-Ein ones.

The training data Dtrain is available at:

http://www.csie.ntu.edu.tw/~htlin/course/ml14fall/hw2/hw2_train.dat The testing data Dtest is available at:

http://www.csie.ntu.edu.tw/~htlin/course/ml14fall/hw2/hw2_test.dat

19.

(*) Run the algorithm on the Dtrain. What is the optimal decision stump returned by your program? What is the Einof the optimal decision stump?

20.

(*) Use the returned decision stump to predict the label of each example within the Dtest. Report an estimate of Eout by Etest.

Bonus: More on Growth Function

21.

In class, we have shown that

B(N, k) ≤

k−1

X

i=0

N i



Show that in fact the equality holds. (Hint: there is a intuitive construction of a specific set of Pk−1

i=0 N

i dichotomies, where no subset of k variables can be shattered.)

4 of 4

參考文獻

相關文件

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time.. In order

(BFQ, *) Implement the fixed learning rate gradient descent algorithm below for logistic regression.. Run the algorithm with η = 0.001 and T = 2000 on the following set

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time.. In order

In this case, the index is unclustered, each qualifying data entry could contain an rid that points to a distinct data page, leading to as many data page I/Os as the number of

First, write a program to implement the (linear) ridge regression algorithm for classification (i.e. use 0/1 error for evaluation)?. Use the first 400 examples for training to get g

(a) Consider a binary classification algorithm A majority that returns a constant classifier that always predicts the majority class (i.e., the class with more instances in the data

(A 10% bonus can be given if your proof for either case is rigorous and works for general polynomial regression.).. If gradient boosting is coupled with linear regression

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time.. In order