• 沒有找到結果。

Linear Regression

N/A
N/A
Protected

Academic year: 2022

Share "Linear Regression"

Copied!
8
0
0

加載中.... (立即查看全文)

全文

(1)

Homework #3

RELEASE DATE: 10/30/2020 RED BUG FIX: 11/04/2020 09:45 BLUE BUG FIX: 11/12/2020 16:30

DUE DATE: 11/20 (THREE WEEKS, YEAH!!), BEFORE 13:00 on Gradescope QUESTIONS ARE WELCOMED ON THE NTU COOL FORUM.

We will instruct you on how to use Gradescope to upload your choices and your scanned/printed solutions.

For problems marked with (*), please follow the guidelines on the course website and upload your source code to Gradescope as well. You are encouraged to (but not required to) include a README to help the TAs check your source code. Any programming language/platform is allowed.

Any form of cheating, lying, or plagiarism will not be tolerated. Students can get zero scores and/or fail the class and/or be kicked out of school and/or receive other punishments for those kinds of misconducts.

Discussions on course materials and homework solutions are encouraged. But you should write the final solutions alone and understand them fully. Books, notes, and Internet resources can be consulted, but not copied from.

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time. In order to maximize the level of fairness in this class, lending and borrowing homework solutions are both regarded as dishonest behaviors and will be punished according to the honesty policy.

You should write your solutions in English or Chinese with the common math notations introduced in class or in the problems. We do not accept solutions written in any other languages.

This homework set comes with 400 points. For each problem, there is one correct choice.

For most of the problems, if you choose the correct answer, you get 20 points; if you choose an incorrect answer, you get −10 points. That is, the expected value of random guessing is −20 per problem, and if you can eliminate two of the choices accurately, the expected value of random guessing on the remaining three choices would be 0 per problem. For other problems, the TAs will check your solution in terms of the written explanations and/or code. The solution will be given points between [−20, 20] based on how logical your solution is.

Linear Regression

1.

Consider a noisy target y = wTfx + , where x ∈ Rd+1 (including the added coordinate x0 = 1), y ∈ R, wf ∈ Rd+1 is an unknown vector, and  is an i.i.d. noise term with zero mean and σ2 variance. Assume that we run linear regression on a training data set D = {(x1, y1), . . . , (xN, yN)}

generated i.i.d. from some P (x) and the noise process above, and obtain the weight vector wlin. As briefly discussed in Lecture 9, it can be shown that the expected in-sample error Ein(wlin) with respect to D is given by:

ED[Ein(wlin)] = σ2



1 − d + 1 N

 .

For σ = 0.1 and d = 11, what is the smallest number of examples N such that ED[Ein(wlin)] is no less than 0.006? Choose the correct answer; explain your answer.

[a] 25 [b] 30 [c] 35

(2)

[e] 45

2.

As shown in Lecture 9, minimizing Ein(w) for linear regression means solving ∇Ein(w) = 0, which in term means solving the so-called normal equation

XTXw = XTy.

Which of the following statement about the normal equation is correct for any features X and labels y? Choose the correct answer; explain your answer.

[a] There exists at least one solution for the normal equation.

[b] If there exists a solution for the normal equation, Ein(w) = 0at such a solution.

[c] If there exists a unique solution for the normal equation, Ein(w) = 0at the solution.

[d] If Ein(w) = 0at some w, there exists a unique solution for the normal equation.

[e] none of the other choices

3.

In Lecture 9, we introduced the hat matrix H = XX for linear regression. The matrix projects the label vector y to the “predicted” vector ˆy = Hy and helps us analyze the error of linear regression.

Assume that XTX is invertible, which makes H = X(XTX)−1XT. Now, consider the following operations on X. Which operation can possibly change H? Choose the correct answer; explain your answer.

[a] multiplying the whole matrix X by 2 (which is equivalent to scaling all input vectors by 2) [b] multiplying each of the i-th column of X by i (which is equivalent to scaling the i-th feature

by i)

[c] multiplying each of the n-th row of X by n1 (which is equivalent to scaling the n-th example by n1)

[d] adding three randomly-chosen columns i, j, k to column 1 of X (i.e., xn,1← xn,1+ xn,i+ xn,j+ xn,k)

[e] none of the other choices (i.e. all other choices are guaranteed to keep H unchanged.)

Likelihood and Maximum Likelihood

4.

Consider a coin with an unknown head probability θ. Independently flip this coin N times to get y1, y2, . . . , yN, where yn = 1 if the n-th flipping results in head, and 0 otherwise. Define ν = N1 PN

n=1yn. How many of the following statements about ν are true? Choose the correct answer; explain your answer by illustrating why those statements are true.

• Pr(|ν − θ| > ) ≤ 2 exp(−22N ) for all N ∈ N and  > 0.

• ν maximizes likelihood(ˆθ) over all ˆθ ∈ [0, 1].

• ν minimizes Ein(ˆy) = N1 PN

n=1(ˆy − yn)2 over all ˆy ∈ R.

• 2 · ν is the negative gradient direction −∇Ein(ˆy) at ˆy = 0.

(Note: θ is similar to the role of the “target function” and ˆθ is similar to the role of the “hypothesis”

in our machine learning framework.) [a] 0

[b] 1 [c] 2 [d] 3

(3)

5.

Let y1, y2, . . . , yN be N values generated i.i.d. from a uniform distribution [0, θ] with some un- known θ. For any ˆθ ≥ max(y1, y2, . . . , yN), what is its likelihood? Choose the correct answer;

explain your answer.

[a] 

1 θˆ

N [b] PN

n=1 yn

θˆ

[c] QN n=1

yn θˆ

[d] max(y1ˆ,...,yN)

θ

[e] min(y1ˆ,...,yN)

θ

(Hint: Those who are interested in more math [who isn’t? :-)] are encouraged to try to derive the maximum-likelihood estimator.)

Gradient and Stochastic Gradient Descent

6.

In the perceptron learning algorithm, we find one example (xn(t), yn(t)) that the current weight vector wtmis-classifies, and then update wt by

wt+1← wt+ yn(t)xn(t).

A variant of the algorithm finds all examples (xn, yn) that the weight vector wt mis-classifies (e.g. yn6= sign(wTtxn)), and then update wt by

wt+1← wt+ η N

X

n : yn6=sign(wtTxn)

ynxn.

The variant can be viewed as optimizing some Ein(w) that is composed of one of the following point- wise error functions with a fixed learning rate gradient descent (neglecting any non-differentiable spots of Ein). What is the error function? Choose the correct answer; explain your answer.

[a] err(w, x, y) = |1 − ywTx|

[b] err(w, x, y) = max(0, −ywTx) [c] err(w, x, y) = −ywTx

[d] err(w, x, y) = min(0, −ywTx) [e] err(w, x, y) = max(0, 1 − ywTx)

7.

Besides the error functions introduced in the lectures so far, the following error function, ex- ponential error, is also widely used by some learning models. The exponential error is defined by errexp(w, x, y) = exp(−ywTx). If we want to use stochastic gradient descent to minimize an Ein(w) that is composed of the error function, which of the following is the update direction

−∇ errexp(w, xn, yn) for the chosen (xn, yn) with respect to wt? Choose the correct answer; explain your answer.

[a] +ynxnexp(−ynwTxn) [b] −ynxnexp(−ynwTxn)

[c] +xnexp(−ynwTxn) [d] −xnexp(−ynwTxn) [e] none of the other choices

(4)

Hessian and Newton Method

8.

Let E(w) : Rd→ R be a function. Denote the gradient bE(w) and the Hessian AE(w) by

bE(w) = ∇E(w) =

∂E

∂w1(w)

∂E

∂w2(w) ...

∂E

∂wd(w)

d×1

and AE(w) =

2E

∂w21(w) ∂w2E

1∂w2(w) . . . ∂w2E

1∂wd(w)

2E

∂w2∂w1(w) ∂w2E2 2

(w) . . . ∂w2E

2∂wd(w)

... ... . .. ...

2E

∂wd∂w1(w) ∂w2E

d∂w2(w) . . . ∂w2E2 d

(w)

d×d

.

Then, the second-order Taylor’s expansion of E(w) around u is:

E(w) ≈ E(u) + bE(u)T(w − u) +1

2(w − u)TAE(u)(w − u).

Suppose AE(u) is positive definite. What is the optimal direction v such that w ← u+v minimizes the right-hand-side of the Taylor’s expansion above? Choose the correct answer; explain your answer. (Note that iterative optimization with v is generally called Newton’s method.)

[a] +(AE(u))−1bE(u) [b] −(AE(u))−1bE(u) [c] +(AE(u))+1bE(u) [d] −(AE(u))+1bE(u) [e] none of the other choices

9.

Following the previous problem, considering minimizing Ein(w) in linear regression problem with Newton’s method. For any given wt, what is the Hessian AE(wt) with E = Ein? Choose the correct answer; explain your answer.

[a] N2XTXwtwtT [b] N2XTX

[c] N2XXT [d] N2XTyyTX

[e] none of the other choices

(5)

Multinomial Logistic Regression

10.

In Lecture 11, we solve multiclass classification by OVA or OVO decompositions. One alternative to deal with multiclass classification is to extend the original logistic regression model to Multinomial Logistic Regression (MLR). For a K-class classification problem, we will denote the output space Y = {1, 2, · · · , K}. The hypotheses considered by MLR can be indexed by a matrix

W =

| | · · · | · · · | w1 w2 · · · wk · · · wK

| | · · · | · · · |

(d+1)×K

,

that contains weight vectors (w1, · · · , wK), each of length d+1. The matrix represents a hypothesis

hy(x) = exp(wTyx) PK

i=1exp(wTix)

that can be used to approximate the target distribution P (y|x) for any (x, y). MLR then seeks for the maximum likelihood solution over all such hypotheses. For a given data set {(x1, y1), . . . , (xN, yN)}

generated i.i.d. from some P (x) and target distribution P (y|x), the likelihood of hy(x) is propor- tional toQN

n=1hyn(xn). That is, minimizing the negative log likelihood is equivalent to minimizing an Ein(W) that is composed of the following error function

err(W, x, y) = − ln hy(x) = −

K

X

k=1

Jy = kK ln hk(x).

When minimizing Ein(W) with SGD, we need to compute ∂err(W,x,y)

∂Wik . What is the value of the partial derivative? Choose the correct answer; explain your answer.

[a] (hk(x) +Jy = kK)xi [b] (hk(x) −Jy = kK)xi

[c] (−hk(x) +Jy = kK)xi

[d] (−hk(x) −Jy = kK)xi

[e] none of the other choices

11.

Following the previous problem, consider a data set with K = 2 and obtain the optimal solution from MLR as (w1, w2). Now, relabel the same data set by replacing yn with yn0 = 2yn− 3 to form a binary classification data set. Which of the following is an optimal solution for logistic regression on the binary classification data set? Choose the correct answer; explain your answer.

[a] w2+ w1 [b] w1− w2

[c] 12(w2− w1) [d] 2(w1− w2) [e] w2− w1

(6)

Nonlinear Transformation

12.

Given the following training data set:

x1= (0, 1), y1= −1 x2= (1, −0.5), y2= −1 x3= (−1, 0), y3= −1

x4= (−1, 2), y4= +1 x5= (2, 0), y5= +1 x6= (1, −1.5), y6= +1 x7= (0, −2), y7= +1 Using the quadratic transform Φ2(x) = (1, x1, x2, x21, x1x2, x22), which of the following weights ˜wT in the Z-space can separate all of the training data correctly? Choose the correct answer; (no, you don’t need to explain your answer :-)).

[a] [−9, −1, 0, 2, −2, 3]

[b] [−5, −1, 2, 3, −7, 2]

[c] [9, −1, 4, 2, −2, 3]

[d] [2, 1, −4, −2, 7, −4]

[e] [−7, 0, 0, 2, −2, 3]

13.

Consider the following feature transform, which maps x ∈ Rd to z ∈ R1+1, keeping only the k- th coordinate of x: Φ(k)(x) = (1, xk). Let Hk be the set of hypothesis that couples Φ(k) with perceptrons. Among the following choices, which of is the tightest upper bound of dvc

Sd k=1Hk

 for d ≥ 4? Choose the correct answer; explain your answer. (Hint: You can use the fact that log2d ≤d2 for d ≥ 4if needed.)

[a] 2((log2log2d)+ 1) [b] 2((log2d)+ 1)

[c] 2((d log2d)+ 1) [d] 2(d + 1) [e] 2(d2+ 1)

Experiments with Linear and Nonlinear Models

Next, we will play with linear regression, logistic regression, non-linear transform, and their use for binary classification. Please use the following set for training:

https://www.csie.ntu.edu.tw/~htlin/course/ml20fall/hw3/hw3_train.dat and the following set for testing (estimating Eout):

https://www.csie.ntu.edu.tw/~htlin/course/ml20fall/hw3/hw3_test.dat

Each line of the data set contains one (xn, yn) with xn ∈ R10. The first 10 numbers of the line contains the components of xn orderly, the last number is yn, which belongs to {−1, +1} ⊆ R. That is, we can use those yn for either binary classification or regression.

14.

(*) Add xn,0 = 1 to each xn. Then, implement the linear regression algorithm on page 11 of Lecture 9. What is Esqrin (wlin), where Einsqr denotes the averaged squared error over N examples?

Choose the closest answer; provide your code.

[a] 0.00 [b] 0.20 [c] 0.40

(7)

15.

(*) Add xn,0= 1 to each xn. Then, implement the SGD algorithm for linear regression using the results on pages 10 and 12 of Lecture 11. Pick one example uniformly at random in each iteration, take η = 0.001 and initialize w with w0= 0. Run the algorithm until Esqrin (wt) ≤ 1.01Einsqr(wlin), and record the total number of iterations taken. Repeat the experiment 1000 times, each with a different random seed. What is the average number of iterations over the 1000 experiments?

Choose the closest answer; provide your code.

[a] 600 [b] 1200

[c] 1800 [d] 2400 [e] 3000

16.

(*) Add xn,0 = 1 to each xn. Then, implement the SGD algorithm for logistic regression by replacing the SGD update step in the previous problem with the one on page 10 of Lecture 11.

Pick one example uniformly at random in each iteration, take η = 0.001 and initialize w with w0 = 0. Run the algorithm for 500 iterations. Repeat the experiment 1000 times, each with a different random seed. What is the average Eince(w500) over the 1000 experiments, where Eince denotes the averaged cross-entropy error over N examples? Choose the closest answer; provide your code.

[a] 0.44 [b] 0.50 [c] 0.56 [d] 0.62 [e] 0.68

17.

(*) Repeat the previous problem, but with w initialized by w0= wlinof Problem 14 instead. Repeat the experiment 1000 times, each with a different random seed. What is the average Eince(w500) over the 1000 experiments? Choose the closest answer; provide your code.

[a] 0.44 [b] 0.50 [c] 0.56 [d] 0.62 [e] 0.68

18.

(*) Following Problem 14, what is

Ein0/1(wlin) − Eout0/1(wlin)

, where 0/1 denotes the 0/1 error (i.e.

using wlin for binary classification), and E(0/1)out is estimated using the test set provided above?

Choose the closest answer; provide your code.

[a] 0.32 [b] 0.36 [c] 0.40 [d] 0.44 [e] 0.48

(8)

19.

(*) Next, consider the following homogeneous order-Q polynomial transform Φ(x) = (1, x1, x2, ..., x10, x21, x22, ..., x210, ..., xQ1, xQ2, ..., xQ10).

Transform the training and testing data according to Φ(x) with Q = 3, and again implement the linear regression algorithm on page 11 of lecture 9. What is

Ein0/1(g) − E0/1out(g)

, where g is the hypothesis returned by the transform + linear regression procedure? Choose the closest answer;

provide your code.

[a] 0.32 [b] 0.36 [c] 0.40 [d] 0.44 [e] 0.48

20.

(*) Repeat the previous problem, but with Q = 10 instead. What is

Ein0/1(g) − Eout0/1(g)

? Choose the closest answer; provide your code.

[a] 0.32 [b] 0.36 [c] 0.40 [d] 0.44 [e] 0.48

參考文獻

相關文件

(Note: The choices here “hint” you that the expected value is related to the matrix being inverted for regularized linear regression—see page 10 of Lecture 14. That is, data

Juang has received numerous distinctions and recognitions, including Bell Labs' President Gold Award, IEEE Signal Processing Society Technical Achievement Award, the IEEE

For example, here are the weights of running mnist with the following parameters (scripts and results were generated by our TAs)?. python3 script.py --optim SGD --bsize 256 --C

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time.. In order

(BFQ, *) Implement the fixed learning rate gradient descent algorithm below for logistic regression.. Run the algorithm with η = 0.001 and T = 2000 on the following set

In Chapter Seven (or in the lecture) we mentioned one of the greatest unsolved computer science problem. What

Motivated by recent work of Hajela, we here reconsider the problem of determining the minimum distance between output sequences of an ideal band-limiting channel,

First, write a program to implement the (linear) ridge regression algorithm for classification (i.e. use 0/1 error for evaluation)?. Use the first 400 examples for training to get g