Linear Regression

(1)

Homework #3

RELEASE DATE: 10/30/2020 RED BUG FIX: 11/04/2020 09:45 BLUE BUG FIX: 11/12/2020 16:30

DUE DATE: 11/20 (THREE WEEKS, YEAH!!), BEFORE 13:00 on Gradescope QUESTIONS ARE WELCOMED ON THE NTU COOL FORUM.

We will instruct you on how to use Gradescope to upload your choices and your scanned/printed solutions.

For problems marked with (*), please follow the guidelines on the course website and upload your source code to Gradescope as well. You are encouraged to (but not required to) include a README to help the TAs check your source code. Any programming language/platform is allowed.

Any form of cheating, lying, or plagiarism will not be tolerated. Students can get zero scores and/or fail the class and/or be kicked out of school and/or receive other punishments for those kinds of misconducts.

Discussions on course materials and homework solutions are encouraged. But you should write the final solutions alone and understand them fully. Books, notes, and Internet resources can be consulted, but not copied from.

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time. In order to maximize the level of fairness in this class, lending and borrowing homework solutions are both regarded as dishonest behaviors and will be punished according to the honesty policy.

You should write your solutions in English or Chinese with the common math notations introduced in class or in the problems. We do not accept solutions written in any other languages.

This homework set comes with 400 points. For each problem, there is one correct choice.

For most of the problems, if you choose the correct answer, you get 20 points; if you choose an incorrect answer, you get −10 points. That is, the expected value of random guessing is −20 per problem, and if you can eliminate two of the choices accurately, the expected value of random guessing on the remaining three choices would be 0 per problem. For other problems, the TAs will check your solution in terms of the written explanations and/or code. The solution will be given points between [−20, 20] based on how logical your solution is.

Linear Regression

1.

Consider a noisy target y = w^T_fx + , where x ∈ R^d+1 (including the added coordinate x0 = 1), y ∈ R, wf ∈ R^d+1 is an unknown vector, and is an i.i.d. noise term with zero mean and σ² variance. Assume that we run linear regression on a training data set D = {(x₁, y₁), . . . , (x_N, y_N)}

generated i.i.d. from some P (x) and the noise process above, and obtain the weight vector w_lin. As briefly discussed in Lecture 9, it can be shown that the expected in-sample error E_in(w_lin) with respect to D is given by:

ED[E_in(w_lin)] = σ²

1 − d + 1 N

.

For σ = 0.1 and d = 11, what is the smallest number of examples N such that ED[Ein(wlin)] is no less than 0.006? Choose the correct answer; explain your answer.

[a] 25 [b] 30 [c] 35

(2)

[e] 45

2.

As shown in Lecture 9, minimizing Ein(w) for linear regression means solving ∇Ein(w) = 0, which in term means solving the so-called normal equation

X^TXw = X^Ty.

Which of the following statement about the normal equation is correct for any features X and labels y? Choose the correct answer; explain your answer.

[a] There exists at least one solution for the normal equation.

[b] If there exists a solution for the normal equation, E_in(w) = 0at such a solution.

[c] If there exists a unique solution for the normal equation, E_in(w) = 0at the solution.

[d] If E_in(w) = 0at some w, there exists a unique solution for the normal equation.

[e] none of the other choices

3.

In Lecture 9, we introduced the hat matrix H = XX^† for linear regression. The matrix projects the label vector y to the “predicted” vector ˆy = Hy and helps us analyze the error of linear regression.

Assume that X^TX is invertible, which makes H = X(X^TX)⁻¹X^T. Now, consider the following operations on X. Which operation can possibly change H? Choose the correct answer; explain your answer.

[a] multiplying the whole matrix X by 2 (which is equivalent to scaling all input vectors by 2) [b] multiplying each of the i-th column of X by i (which is equivalent to scaling the i-th feature

by i)

[c] multiplying each of the n-th row of X by _n¹ (which is equivalent to scaling the n-th example by _n¹)

[d] adding three randomly-chosen columns i, j, k to column 1 of X (i.e., x_n,1← x_n,1+ x_n,i+ x_n,j+ x_n,k)

[e] none of the other choices (i.e. all other choices are guaranteed to keep H unchanged.)

Likelihood and Maximum Likelihood

4.

Consider a coin with an unknown head probability θ. Independently flip this coin N times to get y₁, y₂, . . . , y_N, where y_n = 1 if the n-th flipping results in head, and 0 otherwise. Define ν = _N¹ PN

n=1yn. How many of the following statements about ν are true? Choose the correct answer; explain your answer by illustrating why those statements are true.

• Pr(|ν − θ| > ) ≤ 2 exp(−2²N ) for all N ∈ N and > 0.

• ν maximizes likelihood(ˆθ) over all ˆθ ∈ [0, 1].

• ν minimizes Ein(ˆy) = _N¹ PN

n=1(ˆy − y_n)² over all ˆy ∈ R.

• 2 · ν is the negative gradient direction −∇Ein(ˆy) at ˆy = 0.

(Note: θ is similar to the role of the “target function” and ˆθ is similar to the role of the “hypothesis”

in our machine learning framework.) [a] 0

[b] 1 [c] 2 [d] 3

(3)

5.

Let y₁, y₂, . . . , y_N be N values generated i.i.d. from a uniform distribution [0, θ] with some unknown θ. For any ˆθ ≥ max(y1, y2, . . . , yN), what is its likelihood? Choose the correct answer;

explain your answer.

[a]

1 θˆ

^N [b] PN

n=1 y_n

θˆ

[c] QN n=1

y_n θˆ

[d] ^max(y¹_ˆ^,...,y^N⁾

θ

[e] ^min(y¹_ˆ^,...,y^N⁾

θ

(Hint: Those who are interested in more math [who isn’t? :-)] are encouraged to try to derive the maximum-likelihood estimator.)

Gradient and Stochastic Gradient Descent

6.

In the perceptron learning algorithm, we find one example (x_n(t), y_n(t)) that the current weight vector w_tmis-classifies, and then update w_t by

wt+1← wt+ y_n(t)x_n(t).

A variant of the algorithm finds all examples (xn, yn) that the weight vector wt mis-classifies (e.g. yn6= sign(w^T_txn)), and then update wt by

wt+1← wt+ η N

X

n : y_n6=sign(w_t^Tx_n)

ynxn.

The variant can be viewed as optimizing some E_in(w) that is composed of one of the following point- wise error functions with a fixed learning rate gradient descent (neglecting any non-differentiable spots of Ein). What is the error function? Choose the correct answer; explain your answer.

[a] err(w, x, y) = |1 − yw^Tx|

[b] err(w, x, y) = max(0, −yw^Tx) [c] err(w, x, y) = −yw^Tx

[d] err(w, x, y) = min(0, −yw^Tx) [e] err(w, x, y) = max(0, 1 − yw^Tx)

7.

Besides the error functions introduced in the lectures so far, the following error function, exponential error, is also widely used by some learning models. The exponential error is defined by err_exp(w, x, y) = exp(−yw^Tx). If we want to use stochastic gradient descent to minimize an E_in(w) that is composed of the error function, which of the following is the update direction

−∇ err_exp(w, x_n, y_n) for the chosen (x_n, y_n) with respect to w_t? Choose the correct answer; explain your answer.

[a] +ynxnexp(−ynw^Txn) [b] −ynxnexp(−ynw^Txn)

[c] +xnexp(−ynw^Txn) [d] −xnexp(−ynw^Txn) [e] none of the other choices

(4)

Hessian and Newton Method

8.

Let E(w) : R^d→ R be a function. Denote the gradient bE(w) and the Hessian AE(w) by

bE(w) = ∇E(w) =







∂E

∂w₁(w)

∂E

∂w2(w) ...

∂E

∂w_d(w)







d×1

and AE(w) =







∂²E

∂w²₁(w) _∂w^∂²^E

1∂w2(w) . . . _∂w^∂²^E

1∂wd(w)

∂²E

∂w2∂w1(w) ^∂_∂w²^E2 2

(w) . . . _∂w^∂²^E

2∂wd(w)

... ... . .. ...

∂²E

∂w_d∂w₁(w) _∂w^∂²^E

d∂w₂(w) . . . ^∂_∂w²^E2 d

(w)







d×d

.

Then, the second-order Taylor’s expansion of E(w) around u is:

E(w) ≈ E(u) + bE(u)^T(w − u) +1

2(w − u)^TAE(u)(w − u).

Suppose AE(u) is positive definite. What is the optimal direction v such that w ← u+v minimizes the right-hand-side of the Taylor’s expansion above? Choose the correct answer; explain your answer. (Note that iterative optimization with v is generally called Newton’s method.)

[a] +(AE(u))⁻¹bE(u) [b] −(AE(u))⁻¹bE(u) [c] +(A_E(u))⁺¹b_E(u) [d] −(A_E(u))⁺¹b_E(u) [e] none of the other choices

9.

Following the previous problem, considering minimizing Ein(w) in linear regression problem with Newton’s method. For any given wt, what is the Hessian AE(wt) with E = Ein? Choose the correct answer; explain your answer.

[a] _N²X^TXwtw_t^T [b] _N²X^TX

[c] _N²XX^T [d] _N²X^Tyy^TX

(5)

Multinomial Logistic Regression

10.

In Lecture 11, we solve multiclass classification by OVA or OVO decompositions. One alternative to deal with multiclass classification is to extend the original logistic regression model to Multinomial Logistic Regression (MLR). For a K-class classification problem, we will denote the output space Y = {1, 2, · · · , K}. The hypotheses considered by MLR can be indexed by a matrix

W =





| | · · · | · · · | w₁ w₂ · · · w_k · · · w_K

| | · · · | · · · |





(d+1)×K

,

that contains weight vectors (w1, · · · , wK), each of length d+1. The matrix represents a hypothesis

hy(x) = exp(w^T_yx) PK

i=1exp(w^T_ix)

that can be used to approximate the target distribution P (y|x) for any (x, y). MLR then seeks for the maximum likelihood solution over all such hypotheses. For a given data set {(x1, y1), . . . , (xN, yN)}

generated i.i.d. from some P (x) and target distribution P (y|x), the likelihood of hy(x) is propor- tional toQN

n=1h_y_n(x_n). That is, minimizing the negative log likelihood is equivalent to minimizing an E_in(W) that is composed of the following error function

err(W, x, y) = − ln hy(x) = −

K

X

k=1

Jy = kK ln h^k(x).

When minimizing Ein(W) with SGD, we need to compute ∂err(W,x,y)

∂W_ik . What is the value of the partial derivative? Choose the correct answer; explain your answer.

[a] (hk(x) +Jy = kK)xⁱ [b] (hk(x) −Jy = kK)xi

[c] (−hk(x) +Jy = kK)xi

[d] (−h_k(x) −Jy = kK)xi

11.

Following the previous problem, consider a data set with K = 2 and obtain the optimal solution from MLR as (w₁^∗, w^∗₂). Now, relabel the same data set by replacing yn with y_n⁰ = 2yn− 3 to form a binary classification data set. Which of the following is an optimal solution for logistic regression on the binary classification data set? Choose the correct answer; explain your answer.

[a] w^∗₂+ w₁^∗ [b] w^∗₁− w₂^∗

[c] ¹₂(w^∗₂− w^∗₁) [d] 2(w^∗₁− w^∗₂) [e] w^∗₂− w₁^∗

(6)

Nonlinear Transformation

12.

Given the following training data set:

x1= (0, 1), y1= −1 x2= (1, −0.5), y2= −1 x3= (−1, 0), y3= −1

x₄= (−1, 2), y₄= +1 x₅= (2, 0), y₅= +1 x₆= (1, −1.5), y₆= +1 x₇= (0, −2), y₇= +1 Using the quadratic transform Φ2(x) = (1, x1, x2, x²₁, x1x2, x²₂), which of the following weights ˜w^T in the Z-space can separate all of the training data correctly? Choose the correct answer; (no, you don’t need to explain your answer :-)).

[a] [−9, −1, 0, 2, −2, 3]

[b] [−5, −1, 2, 3, −7, 2]

[c] [9, −1, 4, 2, −2, 3]

[d] [2, 1, −4, −2, 7, −4]

[e] [−7, 0, 0, 2, −2, 3]

13.

Consider the following feature transform, which maps x ∈ R^d to z ∈ R¹⁺¹, keeping only the k- th coordinate of x: Φ_(k)(x) = (1, xk). Let Hk be the set of hypothesis that couples Φ_(k) with perceptrons. Among the following choices, which of is the tightest upper bound of dvc

Sd k=1Hk

for d ≥ 4? Choose the correct answer; explain your answer. (Hint: You can use the fact that log₂d ≤^d₂ for d ≥ 4if needed.)

[a] 2((log₂log₂d)+ 1) [b] 2((log₂d)+ 1)

[c] 2((d log₂d)+ 1) [d] 2(d + 1) [e] 2(d²+ 1)

Experiments with Linear and Nonlinear Models

Next, we will play with linear regression, logistic regression, non-linear transform, and their use for binary classification. Please use the following set for training:

https://www.csie.ntu.edu.tw/~htlin/course/ml20fall/hw3/hw3_train.dat and the following set for testing (estimating E_out):

https://www.csie.ntu.edu.tw/~htlin/course/ml20fall/hw3/hw3_test.dat

Each line of the data set contains one (x_n, y_n) with x_n ∈ R¹⁰. The first 10 numbers of the line contains the components of x_n orderly, the last number is y_n, which belongs to {−1, +1} ⊆ R. That is, we can use those yn for either binary classification or regression.

14.

(*) Add xn,0 = 1 to each xn. Then, implement the linear regression algorithm on page 11 of Lecture 9. What is E^sqr_in (wlin), where E_in^sqr denotes the averaged squared error over N examples?

Choose the closest answer; provide your code.

[a] 0.00 [b] 0.20 [c] 0.40

(7)

15.

(*) Add x_n,0= 1 to each x_n. Then, implement the SGD algorithm for linear regression using the results on pages 10 and 12 of Lecture 11. Pick one example uniformly at random in each iteration, take η = 0.001 and initialize w with w0= 0. Run the algorithm until E^sqr_in (wt) ≤ 1.01E_in^sqr(wlin), and record the total number of iterations taken. Repeat the experiment 1000 times, each with a different random seed. What is the average number of iterations over the 1000 experiments?

[a] 600 [b] 1200

[c] 1800 [d] 2400 [e] 3000

16.

(*) Add x_n,0 = 1 to each x_n. Then, implement the SGD algorithm for logistic regression by replacing the SGD update step in the previous problem with the one on page 10 of Lecture 11.

Pick one example uniformly at random in each iteration, take η = 0.001 and initialize w with w0 = 0. Run the algorithm for 500 iterations. Repeat the experiment 1000 times, each with a different random seed. What is the average E_in^ce(w500) over the 1000 experiments, where E_in^ce denotes the averaged cross-entropy error over N examples? Choose the closest answer; provide your code.

[a] 0.44 [b] 0.50 [c] 0.56 [d] 0.62 [e] 0.68

17.

(*) Repeat the previous problem, but with w initialized by w₀= w_linof Problem 14 instead. Repeat the experiment 1000 times, each with a different random seed. What is the average E_in^ce(w₅₀₀) over the 1000 experiments? Choose the closest answer; provide your code.

[a] 0.44 [b] 0.50 [c] 0.56 [d] 0.62 [e] 0.68

18.

(*) Following Problem 14, what is

E_in^0/1(wlin) − E_out^0/1(wlin)

, where 0/1 denotes the 0/1 error (i.e.

using wlin for binary classification), and E^(0/1)_out is estimated using the test set provided above?

[a] 0.32 [b] 0.36 [c] 0.40 [d] 0.44 [e] 0.48

(8)

19.

(*) Next, consider the following homogeneous order-Q polynomial transform Φ(x) = (1, x1, x2, ..., x10, x²₁, x²₂, ..., x²₁₀, ..., x^Q₁, x^Q₂, ..., x^Q₁₀).

Transform the training and testing data according to Φ(x) with Q = 3, and again implement the linear regression algorithm on page 11 of lecture 9. What is

E_in^0/1(g) − E^0/1_out(g)

, where g is the hypothesis returned by the transform + linear regression procedure? Choose the closest answer;

provide your code.

[a] 0.32 [b] 0.36 [c] 0.40 [d] 0.44 [e] 0.48

20.

(*) Repeat the previous problem, but with Q = 10 instead. What is

E_in^0/1(g) − E_out^0/1(g)

? Choose the closest answer; provide your code.

[a] 0.32 [b] 0.36 [c] 0.40 [d] 0.44 [e] 0.48