This homework set comes with 400 points. For each problem, there is one correct choice.

(1)

Homework #4

RELEASE DATE: 11/13/2020 FOR PROBLEMS 1–15 FIRST RELEASE DATE: 11/16/2020 FOR PROBLEMS 16–20

RED BUG FIX: 11/16/2020 23:50 BLUE BUG FIX: 11/25/2020 06:00 GREEN BUG FIX: 11/29/2020 07:15

DUE DATE: 12/04 (THREE WEEKS, YEAH!!), BEFORE 13:00 on Gradescope RANGE: LECTURES 13-16 + ANY EARLIER KNOWLEDGE

QUESTIONS ARE WELCOMED ON THE NTU COOL FORUM.

We will instruct you on how to use Gradescope to upload your choices and your scanned/printed solutions.

For problems marked with (*), please follow the guidelines on the course website and upload your source code to Gradescope as well. You are encouraged to (but not required to) include a README to help the TAs check your source code. Any programming language/platform is allowed.

Any form of cheating, lying, or plagiarism will not be tolerated. Students can get zero scores and/or fail the class and/or be kicked out of school and/or receive other punishments for those kinds of misconducts.

Discussions on course materials and homework solutions are encouraged. But you should write the final solutions alone and understand them fully. Books, notes, and Internet resources can be consulted, but not copied from.

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time. In order to maximize the level of fairness in this class, lending and borrowing homework solutions are both regarded as dishonest behaviors and will be punished according to the honesty policy.

You should write your solutions in English or Chinese with the common math notations introduced in class or in the problems. We do not accept solutions written in any other languages.

This homework set comes with 400 points. For each problem, there is one correct choice.

For most of the problems, if you choose the correct answer, you get 20 points; if you choose an incorrect answer, you get −10 points. That is, the expected value of random guessing is −20 per problem, and if you can eliminate two of the choices accurately, the expected value of random guessing on the remaining three choices would be 0 per problem. For other problems, the TAs will check your solution in terms of the written explanations and/or code. The solution will be given points between [−20, 20] based on how logical your solution is.

Deterministic Noise

1.

(Lecture 13) Consider the target function f (x) = e^x. When x is uniformly sampled from [0, 2], and we use all linear hypotheses h(x) = w · x to approximate the target function with respect to the squared error, what is the magnitude of deterministic noise for each x? Choose the correct answer;

explain your answer.

[a] |e^x|

[b] |e^x− (^3+e₈²)x|

[c] |e^x− (^3+3e₈ ²)x|

[d] |e^x−^e₈²x|

[e] |e^x−^3e₈²x|

(2)

(Hint: If you want to take page 17 of Lecture 13 for inspiration, please note that the answer on page 17 is not exact. Here, however, we are asking you for an exact answer.)

Learning Curve

2.

(Lecture 13) Learning curves are important for us to understand the behavior of learning algorithms.

The learning curves that we have plotted in lecture 13 come from polynomial regression with squared error, and we see that the expected Ein curve is always below the expected Eout curve.

Next, we think about whether this behavior is also true in general. Consider the 0/1 error, an arbitrary non-empty hypothesis set H, and a learning algorithm A that returns one h ∈ H with the minimum Ein on any non-empty data set D. That is,

A(D) = argmin

h∈H

Ein(h).

Assume that each example in D is generated i.i.d. from a distribution P, and define E_out(h) with respect to the distribution. How many of the following statements are always false?

• ED[E_in(A(D))] < ED[E_out(A(D))]

• ED[Ein(A(D))] = ED[Eout(A(D))]

• ED[Ein(A(D))] > ED[Eout(A(D))]

Choose the correct answer; explain your answer.

[a] 0 [b] 1 [c] 2 [d] 3

[e] 1126 (seriously?)

(Hint: Think about the optimal hypothesis h^∗= argmin_h∈HE_out(h).)

Noisy Virtual Examples

3.

(Lecture 13) On page 20 of Lecture 13, we discussed about adding “virtual examples” (hints) to help combat overfitting. One way of generating virtual examples is to add a small noise to the input vector x ∈ R^d+1 (including the 0-th component x₀) For each (x₁, y₁), (x₂, y₂), . . . , (x_N, y_N) in our training data set, assume that we generate virtual examples (˜x₁, y₁), (˜x₂, y₂), . . . , (˜x_N, y_N) where ˜x_n is simply x_n+ and the noise vector ∈ R^d+1 is generated i.i.d. from a multivariate normal distribution N (0, σ²· I_d+1). Here 0 ∈ R^d+1 denotes the all-zero vector and I_d+1 is an identity matrix of size d + 1.

Recall that when training the linear regression model, we need to calculate X^TX first. Define the hinted input matrix

X_h=





| . . . | | . . . | x₁ . . . x_N x˜₁ . . . x˜_N

| | | |





T

.

(3)

(Note: The choices here “hint” you that the expected value is related to the matrix being inverted for regularized linear regression—see page 10 of Lecture 14. That is, data hinting “by noise” is closely related to regularization. If x contains the pixels of an image, the virtual example is a Gaussian-noise-contaminated image with the same label, e.g. https://en.wikipedia.org/wiki/

Gaussian_noise. Adding such noise is a very common technique to generate virtual examples for images.)

4.

(Lecture 13) Following the previous problem, when training the linear regression model, we also need to calculate X^Ty. Define the hinted label vector yh = y

y

. What is the expected value E(X^Thyh), where the expectation is taken over the (Gaussian)-noise generating process above?

Choose the correct answer; explain your answer.

[a] 2N X^Ty [b] N X^Ty

[c] 0 [d] X^Ty [e] 2X^Ty

Regularization

5.

(Lecture 14) Consider the matrix of input vectors as X (as defined in Lecture 9), and assume X^TX to be invertible. That is, X^TX must be symmetric positive definite and can be decomposed to QΓQ^T, where Q is an orthogonal matrix (Q^TQ = QQ^T = I_d+1) and Γ is a diagonal matrix that contains the eigenvalues γ₀, γ₁, . . ., γ_d of X^TX. Note that the eigenvalues must be positive.

Now, consider a feature transform Φ(x) = Q^Tx. The feature transform “rotates” the original x.

After transforming each xn to zn = Φ(xn), denote the new matrix of transformed input vectors as Z. That is, Z = XQ. Then, apply regularized linear regression in the Z-space (see Lecture 12).

That is, solve

min

w∈R^d+1

1

NkZw − yk²+ λ Nw^Tw.

Denote the optimal solution when λ = 0 as v (i.e. wlin), and the optimal solution when λ > 0 as u (i.e., w_reg). What is the ratio u_i/v_i? Choose the correct answer; explain your answer.

[a] _1+λ¹ [b] _1+λ¹2

[c] _γ2^γⁱ² i+λ

[d] _γ^γⁱ

i+λ

[e] _γ^γⁱ

i+λ²

(Note: All the choices are of value < 1 if λ > 0. This is the behavior of weight “decay”—wreg is shorter than wlin. That is why the L2-regularizer is also called the weight-decay regularizer )

(4)

6.

(Lecture 14) Consider a one-dimensional data set {(x_n, y_n)}^N_n=1 where each x_n ∈ R and yn ∈ R.

Then, solve the following one-variable regularized linear regression problem:

min

w∈R

1 N

N

X

n=1

(w · xn− yn)²+ λ Nw².

If the optimal solution to the problem above is w^∗, it can be shown that w^∗ is also the optimal solution of

min

w∈R

1 N

N

X

n=1

(w · x_n− y_n)² subject to w²≤ C

with C = (w^∗)². This allows us to express the relationship between C in the constrained optimization problem and λ in the augmented optimization problem for any λ > 0. What is the relationship? Choose the correct answer; explain your answer.

[a] C =







N

P

n=1

xnyn N

P

n=1

x²_n+ λ







2

[b] C =







N

P

n=1

y_n²

N

P

n=1

x²_n+ λ







2

[c] C =







N

P

n=1

x²_ny²_n

N

P

n=1

x²_n+ λ







2

[d] C =







N

P

n=1

xnyn N

P

n=1

y_n²+ λ







2

[e] C =







N

P

n=1

x²_n

N

P

n=1

y_n²+ λ







2

(Note: All the choices hint you that a smaller λ corresponds to a bigger C.)

(5)

7.

(Lecture 14) Additive smoothing (https://en.wikipedia.org/wiki/Additive_smoothing) is a simple yet useful technique in estimating discrete probabilities. Consider the technique for estimating the head probability of a coin. Let y1, y2, . . . , yN denotes the flip results from a coin, with yn = 1 meaning a head and yn = 0 meaning a tail. Additive smoothing adds 2K “virtual flips”, with K of them being head and the other K being tail. Then, the head probability is estimated by

(PN

n=1yn) + K N + 2K The estimate can be viewed as the optimal solution of

min

y∈R

1 N

N

X

n=1

(y − y_n)²+2K N Ω(y),

where Ω(y) is a “regularizer” to this estimation problem. What is Ω(y)? Choose the correct answer;

explain your answer.

[a] (y + 1)² [b] (y + 0.5)²

[c] y² [d] (y − 0.5)² [e] (y − 1)²

8.

(Lecture 14) On page 12 of Lecture 14, we mentioned that the ranges of features may affect regularization. One common technique to align the ranges of features is to consider a “scaling”

transformation. Define Φ(x) = Γ⁻¹x, where Γ is a diagonal matrix with positive diagonal values γ0, γ1, . . . , γd. Then, conducting L2-regularized linear regression in the Z-space.

min

w∈R˜ ^d+1

1 N

N

X

n=1

( ˜w^TΦ(xn) − yn)²+ λ N( ˜w^Tw)˜ is equivalent to regularized linear regression in the X -space

min

w∈R^d+1

1 N

N

X

n=1

(w^Txn− yn)²+ λ NΩ(w)

with a different regularizer Ω(w). What is Ω(w)? Choose the correct answer; explain your answer.

[a] w^TΓw [b] w^TΓ²w

[c] w^Tw [d] w^TΓ⁻²w [e] w^TΓ⁻¹w

(6)

9.

(Lecture 13/14) In the previous problem, regardless of which regularizer you choose, the optimization problem is of the form

min

w∈R^d+1

1 N

N

X

n=1

(w^Txn− yn)²+ λ N

d

X

i=0

βiw²_i

with positive constants βi. We will call the problem “scaled regularization.”

Now, consider linear regression with virtual examples. That is, we add K virtual examples (˜x₁, ˜y₁), (˜x₂, ˜y₂) . . . (˜x_K, ˜y_K) to the training data set, and solve

min

w∈R^d+1

1 N + K

N

X

n=1

(w^Txn− yn)²+

K

X

k=1

(w^Tx˜k− ˜yk)²

! .

We will show that using some “special” virtual examples, which were claimed to be a possible way to combat overfitting in Lecture 13, is related to regularization, another possible way to combat overfitting discussed in Lecture 14.

Let ˜X = [˜x1˜x2. . . ˜xK]^T, ˜y = [˜y1, ˜y2. . . ˜yK]^T, and B be a diagonal matrix that containsβ0, β1, β2, . . . , βd

in its diagonals. Set K = d + 1, for what ˜X and ˜y will the optimal solution of this linear regression be the same as the optimal solution of the scaled regularization problem above? Choose the correct answer; explain your answer.

[a] ˜X = λIK, ˜y = 0 [b] ˜X =√

λ ·√

B, ˜y = 0 [c] ˜X =√

λ · B, ˜y = 0 [d] ˜X = B, ˜y =√

λ1 [e] ˜X = B, ˜y = λ1

(Note: Both Problem 3 and this problem show that data hinting is closely related to regularization.)

Leave-one-out

10.

(Lecture 15) Consider a binary classification algorithm Amajority, which returns a constant classifier that always predicts the majority class (i.e., the class with more instances in the data set that it sees). As you can imagine, the returned classifier is the best-E_inone among all constant classifiers.

For a binary classification data set with N positive examples and N negative examples, what is E_loocv(A_majority)? Choose the correct answer; explain your answer.

[a] 0 [b] 1/N

[c] 1/2 [d] (N − 1)/N [e] 1

11.

(Lecture 15) Consider the decision stump model and the data generation process mentioned in Problem 16 of Homework 2, and use the generation process to generate a data set of N examples

(7)

12.

(Lecture 15) You are given three data points: (x₁, y₁) = (3, 0), (x₂, y₂) = (ρ, 2), (x₃, y₃) = (−3, 0) with ρ ≥ 0, and a choice between two models: constant (all hypotheses are of the form h(x) = w0) and linear (all hypotheses are of the form h(x) = w0+ w1x). For which value of ρ would the two models be tied using leave-one-out cross-validation with the squared error measure? Choose the correct answer; explain your answer.

[a] p 4 + 9√

6 [b] p

16 + 81√ 6 [c] p

9 + 4√ 6 [d] p

36 + 16√ 6 [e] p

81 + 36√ 6

13.

(Lecture 15) Consider a probability distribution P(x, y) that can be used to generate examples (x, y), and suppose we generate K i.i.d. examples from the distribution as validation examples, and store them in D_val. For any fixed hypothesis h, we can show that

Variance

Dval∼P^KEval(h) = · Variance

(x,y)∼P err(h(x), y).

Which of the following is ? Choose the correct answer; explain your answer.

[a] K [b] 1

[c] ^√¹

K

[d] _K¹ [e] _K¹₂

Learning Principles

14.

(Lecture 16) In Lecture 16, we talked about the probability to fit data perfectly when the labels are random. For instance, page 6 of Lecture 16 shows that the probability of fitting the data perfectly with decision stumps is (2N )/2^N. Consider 4 vertices of a rectangle in R² as input vectors x1, x2, x3, x4, and a 2D perceptron model that minimizes Ein(w) to the lowest possible value. One way to measure the power of the model is to consider four random labels y1, y2, y3, y4, each in ±1 and generated by i.i.d. fair coin flips, and then compute

Ey1,y2,y3,y4

min

w∈R²⁺¹

E_in(w)

.

For a perfect fitting, min Ein(w) will be 0; for a less perfect fitting (when the data is not linearly separable), min Ein(w) will be some non-zero value. The expectation above averages over all 16 possible combinations of y₁, y₂, y₃, y₄. What is the value of the expectation? Choose the correct answer; explain your answer.

[a] 0/64 [b] 1/64 [c] 2/64 [d] 4/64 [e] 8/64

(Note: It can be shown that 1 minus twice the expected value above is the same as the so-called empirical Rademacher complexity of 2D perceptrons. Rademacher complexity, similar to the VC dimension, is another tool to measure the complexity of a hypothesis set. If a hypothesis set shatters some data points, zero Ein can always be achieved and thus Rademacher complexity is 1; if a hypothesis set cannot shatter some data points, Rademacher complexity provides a soft measure of how “perfect” the hypothesis set is.)

(8)

15.

(Lecture 16) Consider a binary classifier g such that

P (g(x) = −1|y = +1) = +

P (g(x) = +1|y = −1) = ₋.

When deploying the classifier to a test distribution of P (y = +1) = P (y = −1) = 1/2, we get Eout(g) = ¹₂++¹₂₋. Now, if we deploy the classifier to another test distribution P (y = +1) = p instead of 1/2, the Eout(g) under this test distribution will then change to a different value. Note that under this test distribution, a constant classifier gc that always predicts +1 will suffer from Eout(gc) = (1 − p) as it errors on all the negative examples. At what p, if its value is between [0, 1], will our binary classifier g be as good as (or as bad as) the constant classifier gc in terms of Eout? Choose the correct answer; explain your answer.

[a] p = ¹⁻⁻

+−−+1

[b] p = ¹⁻⁻

−−++1

[c] p = ¹⁻⁺

−−++1

[d] p = ¹⁻⁺

+−−+1

[e] p = ¹₂

Experiments with Regularized Logistic Regression

Consider L2-regularized logistic regression with second-order polynomial transformation.

wλ= argmin

w

λ

Nkwk²+ 1 N

N

X

n=1

ln(1 + exp(−ynw^TΦ2(xn))),

Here Φ2is the second-order polynomial transformation introduced in page 2 of Lecture 12 (with Q = 2), defined as

Φ₂(x) = (1, x₁, x₂, . . . , x_d, x²₁, x₁x₂, . . . , x₁x_d, x²₂, x₂x₃, . . . , x₂x_d, . . . , x²_d)

Given that d = 6 in the following data sets, your Φ2(x) should be of 28 dimensions (including the constant dimension).

Next, we will take the following file as our training data set D:

http://www.csie.ntu.edu.tw/~htlin/course/ml20fall/hw4/hw4_train.dat and the following file as our test data set for evaluating E_out:

http://www.csie.ntu.edu.tw/~htlin/course/ml20fall/hw4/hw4_test.dat

We call the algorithm for solving the problem above as A_λ. The problem guides you to use LIBLIN- EAR (https://www.csie.ntu.edu.tw/~cjlin/liblinear/), a machine learning packaged developed in our university, to solve this problem. In addition to using the default options, what you need to do when running LIBLINEAR are

• set option -s 0, which corresponds to solving regularized logistic regression

• set option -c C, with a parameter value of C calculated from the λ that you want to use; read

(9)

16.

Select the best λ^∗ in a cheating manner as argmin

log₁₀λ∈{−4,−2,0,2,4}

Eout(wλ).

Break the tie, if any, by selecting the largest λ. What is log₁₀(λ^∗)? Choose the closest answer;

provide your command/code.

[a] −4 [b] −2 [c] 0 [d] 2 [e] 4

17.

Select the best λ^∗ as

argmin

log₁₀λ∈{−4,−2,0,2,4}

E_in(w_λ).

Break the tie, if any, by selecting the largest λ. What is log₁₀(λ^∗)? Choose the closest answer;

provide your command/code.

[a] −4 [b] −2 [c] 0 [d] 2 [e] 4

18.

Now split the given training examples in D to two sets: the first 120 examples as Dtrain and 80 as Dval. (Ideally, you should randomly do the 120/80 split. Because the given examples are already randomly permuted, however, we would use a fixed split for the purpose of this problem). Run Aλ

on only Dtrain to get w⁻_λ (the weight vector within the g⁻ returned), and validate w_λ⁻ with Dval

to get Eval(w⁻_λ). Select the best λ^∗ as

argmin

log₁₀λ∈{−4,−2,0,2,4}

Eval(w⁻_λ).

Break the tie, if any, by selecting the largest λ. Then, estimate E_out(w⁻_λ∗) with the test set. What is the value of E_out(w⁻_λ∗)? Choose the closest answer; provide your command/code.

[a] 0.10 [b] 0.11 [c] 0.12 [d] 0.13 [e] 0.14

19.

For the λ_∗ selected in the previous problem, compute wλ^∗ by running Aλ^∗ with the full training set D. Then, estimate Eout(wλ^∗) with the test set. What is the value of Eout(wλ^∗)? Choose the closest answer; provide your command/code.

[a] 0.10 [b] 0.11 [c] 0.12 [d] 0.13 [e] 0.14

(10)

20.

Now split the given training examples in D to five folds, the first 40 being fold 1, the next 40 being fold 2, and so on. Again, we take a fixed split because the given examples are already randomly permuted. Select the best λ^∗ as

argmin

log₁₀λ∈{−4,−2,0,2,4}

Ecv(Aλ).

Break the tie, if any, by selecting the largest λ. What is the value of Ecv(Aλ^∗) Choose the closest answer; provide your command/code.

[a] 0.10 [b] 0.11 [c] 0.12 [d] 0.13 [e] 0.14