This homework set comes with 200 points and 40 bonus points. In general, every homework set of ours would come with a full credit of 200 points, with some possible

(1)

Homework #3

RELEASE DATE: 10/28/2013

DUE DATE: extended to 11/18/2013, BEFORE NOON

QUESTIONS ABOUT HOMEWORK MATERIALS ARE WELCOMED ON THE FORUM.

Unless granted by the instructor in advance, you must turn in a printed/written copy of your solutions (without the source code) for all problems. For problems marked with (*), please follow the guidelines on the course website and upload your source code to designated places.

Any form of cheating, lying, or plagiarism will not be tolerated. Students can get zero scores and/or fail the class and/or be kicked out of school and/or receive other punishments for those kinds of misconducts.

Discussions on course materials and homework solutions are encouraged. But you should write the final solutions alone and understand them fully. Books, notes, and Internet resources can be consulted, but not copied from.

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time. In order to maximize the level of fairness in this class, lending and borrowing homework solutions are both regarded as dishonest behaviors and will be punished according to the honesty policy.

You should write your solutions in English or Chinese with the common math notations introduced in class or in the problems. We do not accept solutions written in any other languages.

There are three kinds of regular problems.

• multiple-choice question (MCQ): There are several choices and only one of them is correct.

You should choose one and only one.

• multiple-response question (MRQ): There are several choices and none, some, or all of them are correct. You should write down every choice that you think to be correct.

• blank-filling question (BFQ): You should write down the answer that we ask you to fill.

Some problems also come with (+ . . .) that contains additional todo items.

If there are big bonus questions (BBQ), please simply follow the problem guideline to write down your solution, if you choose to tackle them.

This homework set comes with 200 points and 40 bonus points. In general, every homework set of ours would come with a full credit of 200 points, with some possible

bonus points.

Problems 1-2 are about linear regression

1.

(MCQ) Consider a noisy target y = w^T_fx + , where x ∈ R^d (with the added coordinate x0= 1), y ∈ R, w^f is an unknown vector, and is a noise term with zero mean and σ² variance. Assume

is independent of x and of all other ’s. If linear regression is carried out using a training data set D = {(x1, y1), . . . , (xN, yN)}, and outputs the parameter vector wlin, it can be shown that the expected in-sample error Einwith respect to D is given by:

ED[Ein(wlin)] = σ²

1 − d + 1 N

For σ = 0.1 and d = 8, which among the following choices is the smallest number of examples N that will result in an expected E_in greater than 0.008?

[a] 10 [b] 25 [c] 100 [d] 500

(2)

2.

(MRQ) Recall that we have introduced the hat matrix H = X(X^TX)⁻¹X^T in class. Assume X^TX is invertible, which statements of H are true?

[a] H is positive semi-definite.

[b] H is always invertible.

[c] Some eigenvalues of H are bigger than 1.

[d] d + 1 eigenvalues of H are 1.

(+ explanation of your choice) Problems 3-5 are about error and SGD

3.

(MRQ) Which of the following are upper bounds of qsign(w^Tx) 6= yy for y ∈ {−1, +1}?

[a] err(w) = max(0, 1 − yw^Tx) [b] err(w) = max(0, 1 − yw^Tx)²

[c] err(w) = max(0, yw^Tx) [d] err(w) = θ(−yw^Tx)

(+ explanation of your choice)

4.

(MRQ) Which of the following are differentiable functions of w everywhere?

5.

(MCQ) When using SGD on the following error functions and ‘ignoring’ some singular points that are not differentiable, which of the following error function results in PLA?

For Problems 6-10, you will play with gradient descent algorithm and variants

6.

(BFQ) Consider a function

E(u, v) = e^u+ e^2v+ e^uv+ u²− 2uv + 2v²− 3u − 2v.

What is the gradient ∇E(u, v) around (u, v) = (0, 0) (optional: + explanation of your derivations)

7.

(BFQ) In class, we have taught that the update rule of the gradient descent algorithm is (ut+1, vt+1) = (ut, vt) − η∇E(ut, vt)

Please start from (u₀, v₀) = (0, 0), and fix η = 0.01, what is E(u₅, v₅) after five updates?

(optional: + explanation of your derivations)

(3)

8.

(BFQ) Continue from Problem 7, if we approximate the E(u + ∆u, v + ∆v) by ˆE₂(∆u, ∆v), where Eˆ2 is the second-order Taylor’s expansion of E around (u, v) = (0, 0). Suppose

Eˆ2(∆u, ∆v) = buu(∆u)²+ bvv(∆v)²+ buv(∆u)(∆v) + bu∆u + bv∆v + b.

What are the values of (buu, bvv, buv, bu, bv, b) around (u, v) = (0, 0)?

(+ explanation of your derivations)

9.

(MCQ) Continue from Problem 8 and denote the Hessian matrix to be ∇²E(u, v), and assume that the Hessian matrix is positive definite. What is the optimal (∆u, ∆v) to minimize ˆE2(∆u, ∆v)?

The direction is called the Newton Direction?

[a] − ∇²E(u, v)−1

∇E(u, v) [b] + ∇²E(u, v)⁻¹

∇E(u, v) [c] −∇²E(u, v)∇E(u, v) [d] +∇²E(u, v)∇E(u, v)

10.

(BFQ) Using the Newton direction (without η) to update, please start from (u₀, v₀) = (0, 0), what is E(u₅, v₅) after five updates?

(optional: + explanation of your derivations)

For Problems 11-12, you will play with feature transforms

11.

(MCQ) Consider five inputs x1 = (1, 1), x2 = (1, −1), x3 = (−1, −1), x4 = (−1, 1), x5 = (0, 0), x6 = (1, 0). Which set of those inputs can be shattered by some quadratic, linear, or constant hypotheses of x?

[a] x1, x2, x3

[b] x₁, x₂, x₃, x₄ [c] x₁, x₂, x₃, x₄, x₅ [d] x₁, x₂, x₃, x₄, x₅, x₆

12.

(MCQ) Assume that a transformer peeks the data and decides the following transform Φ “intelli- gently” from the data of size N . The transform maps x ∈ R^d to z ∈ R^N, where

(Φ(x))n= zn=Jx = xⁿK

Consider a learning algorithm that performs linear classification after the feature transform, what is d_vc(HΦ)?

[a] 1 [b] d + 1

[c] N + 1 [d] ∞

For Problems 13-15, you will play with linear regression and feature transforms.

Consider the target function:

(4)

f (x1, x2) = sign(x²₁+ x²₂− 0.6)

Generate a training set of N = 1000 points on X = [−1, 1] × [−1, 1] with uniform probability of picking each x ∈ X . Generate simulated noise by flipping the sign of the output in a random 10% subset of the generated training set.

13.

(MCQ, *) Carry out Linear Regression without transformation, i.e., with feature vector:

(1, x₁, x₂),

to find the weight w. What is the closest value to the classification in-sample error (Ein)? Run the experiment 1000 times and take the average Ein in order to reduce variation in your results.

[a] 0.1 [b] 0.3 [c] 0.5 [d] 0.8

Now, transform the training data into the following nonlinear feature vector:

(1, x₁, x₂, x₁x₂, x²₁, x²₂)

Find the vector ˜w that corresponds to the solution of Linear Regression.

14.

(MCQ, *) Which of the following hypotheses is closest to the one you find using Linear Regression on the transformed input? Closest here means agrees the most with your hypothesis (has the most probability of agreeing on a randomly selected point).

[a] g(x1, x2) = sign(−1 − 0.05x1+ 0.08x2+ 0.13x1x2+ 1.5x²₁+ 1.5x²₂) [b] g(x1, x2) = sign(−1 − 0.05x1+ 0.08x2+ 0.13x1x2+ 1.5x²₁+ 15x²₂)

[c] g(x1, x2) = sign(−1 − 0.05x1+ 0.08x2+ 0.13x1x2+ 15x²₁+ 1.5x²₂) [d] g(x1, x2) = sign(−1 − 1.5x1+ 0.08x2+ 0.13x1x2+ 0.05x²₁+ 0.05x²₂)

15.

(MCQ, *) What is the closest value to the classification out-of-sample error Eoutof your hypothesis?

(Estimate it by generating a new set of 1000 points and adding noise as before. Average over 1000 runs to reduce the variation in your results).

[a] 0.1 [b] 0.3 [c] 0.5 [d] 0.8

For Problems 16-17, you will derive an algorithm for multinomial (multiclass) logistic regression.

For a K-class classification problem, we will denote the output space Y = {1, 2, · · · , K}. The hypotheses considered by MLR are indexed by a list of weight vectors (w₁, · · · , w_K), each weight vector of length d + 1. Each list represents a hypothesis

hy(x) = exp(w^T_yx) PK

i=1exp(w^T_i x)

that can be used to approximate the target distribution P (y|x). MLR then seeks for the maximum likelihood solution over all such hypotheses.

(5)

16.

(BFQ) For general K, derive an E_in(w₁, · · · , w_K) like page 11 of Lecture 10 slides by minimizing the negative log likelihood.

(+ derivation steps)

17.

(BFQ) For the Ein derived above, write down its gradient ∇Ein. (+ derivation steps)

For Problems 18-20, you will play with logistic regression.

18.

(BFQ, *) Implement the fixed learning rate gradient descent algorithm below for logistic regression.

Run the algorithm with η = 0.001 and T = 2000 on the following set for training:

http://www.csie.ntu.edu.tw/~htlin/course/ml13fall/hw3/hw3_train.dat and the following set for testing:

http://www.csie.ntu.edu.tw/~htlin/course/ml13fall/hw3/hw3_test.dat What is Eout(g) from your algorithm, evaluated using the 0/1 error on the test set?

19.

(BFQ, *) Implement the fixed learning rate gradient descent algorithm below for logistic regression.

Run the algorithm with η = 0.01 and T = 2000 on the following set for training:

20.

(BFQ, *) Implement the fixed learning rate stochastic gradient descent algorithm below for logistic regression. Instead of randomly choosing n in each iteration, please simply pick the example with the cyclic order n = 1, 2, . . . , N, 1, 2, . . .. Run the algorithm with η = 0.001 and T = 2000 on the following set for training:

Bonus: Smart ‘Cheating’

21.

(BBQ, 10 points) For a regression problem, the root-mean-square-error (RMSE) of a hypothesis h on a test set {(xn, yn)}^N_n=1is defined as

RMSE(h) = v u u t 1 N

N

X

n=1

(yn− h(xn))².

Please consider a case of knowing all the xn, none of the yn, but allowed to query RMSE(h) for some h. To construct a hypothesis g with RMSE(g) = 0, what is the least number of queries?

22.

(BBQ, 10 points) Continue from Problem 21, for any given hypothesis h, let h = (h(x1), h(x2), · · · , h(xN))

y = (y1, y2, · · · , yN).

To compute h^Ty, what is the least number of queries?

(6)

23.

(BBQ, 20 points) Continue from Problem 22, for any given set of hypotheses {h₁, h₂, · · · , h_K}. To solve

w₁,wmin₂,··· ,w_KRMSE

K

X

k=1

wkhk

! , what is the least number of queries?

(7)

Answer guidelines. First, please write down your name and school ID number.

Name: School ID:

Then, fill in your answers for MCQ, MRQ and BFQ in the table below.

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

17 18 19 20

Lastly, please write down your solution to those (+ . . .) parts and bonus problems, using as many additional pages as you want.

Each problem is of 10 points.

• For Problem with (+ . . .), the answer in the table is of 3 score points, and the (+ . . .) part is of 7 score points. If your solution to the (+ . . .) part is clearly different from your answer in the table, it is regarded as a suspicious violation of the class policy (plagiarism) and the TAs can deduct some more points based on the violation.

• For Problem without (+ . . .), the problem is of 10 points by itself and the TAs can decide to give you partial credit or not as long as it is fair to the whole class.