This homework set comes with 400 points. For each problem, there is one correct choice.

(1)

Homework #2

RELEASE DATE: 10/16/2020 RED BUG FIX: 10/22/2020 16:30 BLUE BUG FIX: 10/24/2020 17:00 GREEN BUG FIX: 11/03/2020 21:10

DUE DATE:EXTENDED TO 11/06/2020, BEFORE 13:00 on NTU COOL QUESTIONS ARE WELCOMED ON THE NTU COOL FORUM.

We will instruct you on how to useGradescopeto upload your choices and your scanned/printed solutions later. For problems marked with (*), please follow the guidelines on the course website and upload your source code to Gradescope as well. You are encouraged to (but not required to) include a README to help the TAs check your source code. Any programming language/platform is allowed.

Any form of cheating, lying, or plagiarism will not be tolerated. Students can get zero scores and/or fail the class and/or be kicked out of school and/or receive other punishments for those kinds of misconducts.

Discussions on course materials and homework solutions are encouraged. But you should write the final solutions alone and understand them fully. Books, notes, and Internet resources can be consulted, but not copied from.

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time. In order to maximize the level of fairness in this class, lending and borrowing homework solutions are both regarded as dishonest behaviors and will be punished according to the honesty policy.

You should write your solutions in English or Chinese with the common math notations introduced in class or in the problems. We do not accept solutions written in any other languages.

This homework set comes with 400 points. For each problem, there is one correct choice.

For most of the problems, if you choose the correct answer, you get 20 points; if you choose an incorrect answer, you get −10 points. That is, the expected value of random guessing is −20 per problem, and if you can eliminate two of the choices accurately, the expected value of random guessing on the remaining three choices would be 0 per problem. For other problems, the TAs will check your solution in terms of the written explanations and/or code. The solution will be given points between [−20, 20] based on how logical your solution is.

Perceptrons

1.

Which of the following set of x ∈ R³can be shattered by the 3D perceptron hypothesis set? The set contains all hyperplanes of the form with our usual notation of x0= 1:

hw(x) = sign

3

X

i=0

wixi

! .

Choose the correct answer; explain your answer.

[a] {(7, 8, 9), (17, 18, 19), (27, 28, 29)}

[b] {(1, 1, 1), (7, 8, 9), (15, 16, 17), (21, 23, 25)}

[c] {(1, 1, 3), (7, 8, 9), (15, 16, 17), (21, 23, 25)}

[d] {(1, 3, 5), (7, 8, 9), (15, 16, 17), (21, 23, 25)}

[e] {(1, 2, 3), (4, 5, 6), (7, 8, 9), (15, 16, 17), (21, 23, 25)}

(2)

2.

What is the growth function of axis-aligned perceptrons in 2D for N ≥ 4? Those perceptrons are all perceptrons with w1w2 = 0. That is, they are vertical or horizontal lines on the 2D plane.

Choose the correct answer; explain your answer.

[a] 4N + 4 [b] 4N + 2

[c] 4N [d] 4N − 2 [e] 4N − 4

3.

What is the VC dimension of positively-biased perceptrons in 2D? The positively-biased perceptrons are all perceptrons with w₀> 0. Choose the correct answer; explain your answer.

[a] 0 [b] 1 [c] 2 [d] 3 [e] 4

Ring Hypothesis Set

4.

The “ring” hypothesis set in R³ contains hypothesis parameterized by two positive numbers a and b, where

h(x) =

+1 if a ≤ x²₁+ x²₂+ x²₃≤ b,

−1 otherwise.

What is the growth function of the hypothesis set? Choose the correct answer; explain your answer.

[a] ^{N +1}₁ + 1 [b] ^{N +1}₂ + 1 [c] ^{N +1}₃ + 1 [d] ^{N +1}₆ + 1

[e] none of the other choices

5.

Following the previous problem, what is the VC dimension of the ring hypothesis set? Choose the correct answer; explain your answer.

[a] 1 [b] 2 [c] 3 [d] 6

(3)

Deviation from Optimal Hypothesis

6.

In Lecture 7, the VC bound was stated from the perspective of g, the hypothesis picked by the learning algorithm. The bound itself actually quantifies the BAD probability from any hypothesis h in the hypothesis set. That is,

P [∃h ∈ H s.t. |Ein(h) − Eout(h)| > ] ≤ 4m_H(2N ) exp

−1 8²N

.

Define the best-Ein hypothesis

g = argmin_h∈HEin(h)

and the best-Eouthypothesis (which is optimal but can only be obtained by a “cheating” algorithm) g∗= argmin_h∈HEout(h).

Using the VC bound above, with probability more than 1 − δ, which of the following is an upper bound of E_out(g) − E_out(g_∗)? Choose the correct answer; explain your answer.

[a]

r

1

8N ln_4m

H(2N ) δ

[b]

r

1 8N ln_m

H(2N ) 4δ

[c]

r

8 N ln_4m

H(2N ) δ

[d] 2 r

8 N ln_4m

H(2N ) δ

[e]

r

8 N ln_8m

H(2N ) δ

The VC Dimension

7.

For a finite hypothesis set H = {h₁, h₂, . . . , h_M},where each hypothesis is a binary classifier from X to {−1, +1}, what is the largest possible value of d_vc(H)? Choose the correct answer; explain your answer.

[a] M [b] 2M

[c] M² [d] blog₂M c [e] 2^M

8.

A boolean function h : {−1, +1}^k → {−1, +1} is called symmetric if its value does not depend on the permutation of its inputs, i.e., its value only depend on the number of ones in the input. What is the VC dimension of the set of all symmetric boolean functions? Choose the correct answer;

explain your answer.

[a] k − 2 [b] k − 1

[c] k [d] k + 1 [e] k + 2

(4)

9.

How many of the following are necessary conditions for d_vc(H) = d? Choose the correct answer;

state which conditions correspond your answer and explain them.

• some set of d distinct inputs is shattered by H

• some set of d distinct inputs is not shattered by H

• any set of d distinct inputs is shattered by H

• any set of d distinctinputs is not shattered by H

• some set of d + 1 distinct inputs is shattered by H

• some set of d + 1 distinct inputs is not shattered by H

• any set of d + 1 distinct inputs is shattered by H

• any set of d + 1 distinct inputs is not shattered by H [a] 1

[b] 2 [c] 3 [d] 4 [e] 5

10.

Which of the following hypothesis set is of VC dimension ∞? Choose the correct answer; explain your answer.

[a] the rectangle family: the infinite number of hypotheses where the boundary between ±1 regions of each hypothesis looks like a rectangle (including axis-aligned ones and rotated ones) for x ∈ R²

[b] the intersected-interval family: the infinite number of hypotheses where the positive region of each hypothesis can be represented as an intersection of any finite number of “positive intervals” for x ∈ R

[c] the sine family: the infinite number of hypotheses {h_α: h_α(x) = sign(sin(α · x))} for x ∈ R [d] the scaling family: the infinite number of hypothesis {hα: hα(x) = sign(α ·Pd

i=1xi)} for x ∈ R^d

Noise and Error

11.

Consider a binary classification problem where we sample (x, y) from a distribution P with y ∈ {−1, +1}. Now we define a distribution P_τ to be a “noisy” version of P. That is, to sample from P_τ, we first sample (x, y) from P and flip y to −y with probability τ independently. Note that P0 = P. The distribution Pτ models a situation that our training data is labeled by an unreliable human, who mislabels with probability τ .

Define Eout(h, τ ) to be the out-of-sample error of h with respect to Pτ. That is, Eout(h, τ ) = E(x,y)∼PτJh(x) 6= yK .

Which of the following relates Eout(h, τ ) to Eout(h, 0)? Choose the correct answer; explain your answer.

[a] Eout(h, 0) = Êôut^{(h,τ )−2τ}_1−τ [b] Eout(h, 0) = ^2Eôut_2−τ^{(h,τ )−τ} [c] E_out(h, 0) = ^{τ −E}_1−2τôut^{(h,τ )}

(5)

12.

Consider x ∈ R³and a target function f (x) = argmax_i=1,2,3x_i,with ties broken, if any, by choosing the smallest i. Then, assume a process that generates (x, y) by a uniform P (x) within [0, 1]³ and

P (y|x) =







0.7 y = f (x)

0.1 y = f (x) mod 3 + 1 0.2 y = (f (x) + 1) mod 3 + 1

The operation of “a mod 3” returns the residual when the integer a is divided by 3. When using the squared error, what is Eout(f ) subject to the process above? Choose the correct answer; explain your answer. (Note: This is in some sense the “price of noise”)

[a] 0.3 [b] 0.6 [c] 0.9 [d] 1.2 [e] 1.5

13.

Following Problem 12, the squared error defines an ideal target function

f_∗(x) =

3

X

y=1

y · P (y|x),

as shown on page 11 of the Lecture 8 slides. Unlike the slides, however, we denote this function as f_∗ to avoid being confused with the target function f used for generating the data. Define the squared difference between f and f_∗to be

∆(f, f_∗) = Ex∼P (x)(f (x) − f_∗(x))².

What is the value of ∆(f, f_∗)? Choose the correct answer; explain your answer. (Note: This means how much the original target function f was dragged by the noise in P (y|x) to the “new” target function f∗.)

[a] 0.01 [b] 0.14 [c] 0.16 [d] 0.25 [e] 0.42

(6)

Decision Stump

In page 22 of the Lecture 5 slides (the Fun Time that you should play by yourself), we taught about the learning model of “positive and negative rays” (which is simply one-dimensional perceptron). The model contains hypotheses of the form:

hs,θ(x) = s · sign(x − θ),

where s ∈ {−1, +1} is the “direction” of the ray and θ ∈ R is the threshold. You can take sign(0) = −1 for simplicity. The model is frequently named the “decision stump” model and is one of the simplest learning models. As shown in class, the growth function of the model is 2N and the VC Dimension is 2.

14.

When using the decision stump model, given = 0.1 and δ = 0.1, among the five choices, what is the smallest N such that the BAD probability of the VC bound (as given in the beginning of Problem 6) is ≤ δ? Choose the correct answer; explain your answer.

[a] 6000 [b] 8000 [c] 10000 [d] 12000 [e] 14000

In fact, the decision stump model is one of the few models that we could minimize Ein efficiently by enumerating all possible thresholds. In particular, for N examples, there are at most 2N dichotomies (see page 22 of the Lecture 5 slides), and thus at most 2N different E_invalues. We can then easily choose the hypothesis that leads to the lowest E_in by the following decision stump learning algorithm.

(1) sort all N examples xnto a sorted sequence x⁰₁, x⁰₂, . . . , x⁰_N such that x⁰₁≤ x⁰₂≤ x⁰₃≤ . . . ≤ x⁰_N

(2) for each θ ∈ {−1} ∪ {^x

0 i+x⁰_i+1

2 : 1 ≤ i ≤ N − 1 and x⁰_i6= x⁰_i+1} and s ∈ {−1, +1}, calculate Ein(hs,θ)

(3) return the hs,θ with the minimum Ein as g; if multiple hypotheses reach the minimum Ein, return the one with the smallest s + θ.

(Hint: CS-majored students are encouraged to think about whether the second step can be carried out efficiently, i.e. O(N ), using dxxxxxc pxxxxxxxxxg instead of the naive implementation of O(N²).)

Next, you are asked to implement such an algorithm and run your program on an artificial data set. We shall start by generating (x, y) with the following procedure. We will take the target function f (x) = sign(x):

• Generate x by a uniform distribution in [−1, +1].

• Generate y from x by y = f(x) and then flip y to −y with τ probability independently

15.

For θ ∈ [−1, +1], what is Eout(h+1,θ, 0), where Eout(h, τ ) is defined in Problem 11? Choose the correct answer; explain your answer.

[a] |θ|

[b] ¹₂|θ|

[c] 2|θ|

[d] 1 − |θ|

(7)

16.

(*) For τ = 0, which means that your data is noiseless. Generate a data set of size 2 by the procedure above and run the decision stump algorithm on the data set to get g. Repeat the experiment 10000 times, each with a different data set. What is the mean of Eout(g, τ ) − Ein(g) within the 10000 results? Choose the closest value. (By extending the results in Problem 11 and Problem 15, you can actually compute any Eout(hs,θ, τ ) analytically. But if you do not trust your math derivation, you can get a very accurate estimate of Eout(g) by evaluating g on a separate test data set of size 100000, as guaranteed by Hoeffding’s inequality).

[a] 0.00 [b] 0.02 [c] 0.05 [d] 0.30 [e] 0.40

17.

(*) For τ = 0, generate a data set of size 20 by the procedure above and run the decision stump algorithm on the data set to get g. Repeat the experiment 10000 times, each with a different data set. What is the mean of E_out(g, τ ) − E_in(g) within the 10000 results? Choose the closest value.

[a] 0.00 [b] 0.02 [c] 0.05 [d] 0.30 [e] 0.40

18.

(*) For τ = 0.1, generate a data set of size 2 by the procedure above and run the decision stump algorithm on the data set to get g. Repeat the experiment 10000 times, each with a different data set. What is the mean of E_out(g, τ ) − E_in(g) within the 10000 results?

[a] 0.00 [b] 0.02 [c] 0.05 [d] 0.30 [e] 0.40

19.

(*)For τ = 0.1, generate a data set of size 20 by the procedure above and run the decision stump algorithm on the data set to get g. Repeat the experiment 10000 times, each with a different data set. What is themeanof Eout(g, τ ) − Ein(g) within the 10000 results? Choose the closest value.

[a] 0.00 [b] 0.02 [c] 0.05 [d] 0.30 [e] 0.40

20.

(*)For τ = 0.1, generate a data set of size 200 by the procedure above and run the decision stump algorithm on the data set to get g. Repeat the experiment 10000 times, each with a different data set. What is themeanof Eout(g, τ ) − Ein(g) within the 10000 results? Choose the closest value.

[a] 0.00 [b] 0.02

[c] 0.05nbz [d] 0.30 [e] 0.40