4.2 The Hat of Linear Regression

(1)

Machine Learning (NTU, Fall 2010) instructor: Hsuan-Tien Lin

Homework #4

TA in charge: Chao-Kai Chiang RELEASE DATE: 10/25/2010 DUE DATE: 11/08/2010, 4:00 pm IN CLASS

TA SESSION: 11/04/2010, 6:00 pm IN R110

Unless granted by the instructor in advance, you must turn in a hard copy of your solutions (without the source code) for all problems. For problems marked with (*), please follow the guidelines on the course website and upload your source code to designated places.

Any form of cheating, lying, or plagiarism will not be tolerated. Students can get zero scores and/or fail the class and/or be kicked out of school and/or receive other punishments for those kinds of misconducts.

Discussions on course materials and homework solutions are encouraged. But you should write the final solutions alone and understand them fully. Books, notes, and Internet resources can be consulted, but not copied from.

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time. In order to maximize the level of fairness in this class, lending and borrowing homework solutions are both regarded as dishonest behaviors and will be punished according to the honesty policy.

You should write your solutions in English with the common math notations introduced in class or in the problems. We do not accept solutions written in any other languages.

4.1 More on Growth Function and VC Dimension

(1) (5%) Let H = {h1, h2, . . . , hM} with some finite M . Prove that dVC(H) ≤ log₂M .

(2) (5%) For hypothesis sets H1, H2, · · · , HK with finite VC-dimensions dVC(Hk), derive and prove the tightest lower bound that you can get on dVC ∩^K_k=1Hk.

(3) (5%) For hypothesis sets H₁, H₂, · · · , H_K with finite VC-dimensions d_VC(H_k), derive and prove the tightest upper bound that you can get on d_VC ∩^K_k=1H_k.

(4) (5%) For hypothesis sets H1, H2, · · · , HK with finite VC-dimensions dVC(Hk), derive and prove the tightest lower bound that you can get on dVC ∪^K_k=1Hk.

(5) (5%) For hypothesis sets H1, H2, · · · , HK with finite VC-dimensions dVC(Hk), derive and prove the tightest upper bound that you can get on dVC ∪^K_k=1Hk.

4.2 The Hat of Linear Regression

(1) (3%) Do Exercise 3.3-1 of LFD.

4.3 The Feature Transforms

(1) (4%) Do Exercise 3.6 of LFD.

1 of 3

(2)

4.4 Gradient and Newton Directions

Consider a function

E(u, v) = e^u+ e^2v+ e^uv+ u²− 3uv + 4v²− 3u − 5v,

(1) (3%) Approximate E(u + ∆u, v + ∆v) by Ê₁(∆u, ∆v), where Ê₁ is the first-order Taylor’s expansion of E around (u, v) = (0, 0). Suppose Ê₁(∆u, ∆v) = a_u∆u + a_v∆v + a. What are the values of a_u, a_v, and a?

(2) (3%) Minimize ˆE1 over all possible (∆u, ∆v) such that k(∆u, ∆v)k = 0.5. In class, we proved that the optimal column vector∆u

∆v

is parallel to the column vector −∇E(u, v), which is called the negative gradient direction. Compute the optimal (∆u, ∆v) and the resulting E(u + ∆u, v + ∆v).

(3) (3%) Approximate E(u + ∆u, v + ∆v) by ˆE2(∆u, ∆v), where ˆE2 is the second-order Taylor’s expansion of E around (u, v) = (0, 0). Suppose

Eˆ2(∆u, ∆v) = buu(∆u)²+ bvv(∆v)²+ buv(∆u)(∆v) + bu∆u + bv∆v + b.

What are the values of b_uu, b_vv, b_uv, b_u, b_v, and b?

(4) (3%) Minimize ˆE2over all possible (∆u, ∆v) (regardless of length). Use the fact that ∇²E(u, v) (the Hessian matrix) is positive definite to prove that the optimal column vector

∆u

∆v

= − ∇²E(u, v)⁻¹

∇E(u, v), which is called the Newton direction.

(5) (3%) Numerically compute the following values:

(a) the vector (∆u, ∆v) of length 0.5 along the Newton direction, and the resulting E(u + ∆u, v + ∆v).

(b) the vector (∆u, ∆v) of length 0.5 that minimizes E(u + ∆u, v + ∆v), and the resulting E(u + ∆u, v + ∆v). (Hint: let ∆u = 0.5 sin θ.)

Compare the values of E(u + ∆u, v + ∆v) in (2), (5a), and (5b). Briefly state your findings.

The negative gradient direction and the Newton direction are quite fundamental for designing opti- mization algorithms. It is important to understand these directions and put them in your toolbox for designing ML algorithms.

4.5 Least-squares Linear Regression (*)

(1) (8%) Implement the least-squares linear regression algorithm taught in class to compute the optimal (d + 1)-dimensional w that solves

minw N

X

n=1

yn− w^•xn

² .

Run the algorithm on the following set for training (each row represents a pair of (xn, yn), where xn is the “thin” version. The first column is xn[1], the second one is xn[2], and the third one is yn):

http://www.csie.ntu.edu.tw/~htlin/course/ml10fall/data/hw4_train.dat and the following set for testing:

http://www.csie.ntu.edu.tw/~htlin/course/ml10fall/data/hw4_test.dat Report the w you find. Let g(x) = sign w^•x. What is Ein(g) in terms of the 0/1 loss (classifica- tion)? How about Eout(g)?

Please check the course policy carefully and do not use sophisticated packages in your solution. You can use standard matrix multiplication and inversion routines.

2 of 3

(3)

4.6 Gradient Descent for Logistic Regression (*)

Consider the formulation (so-called logistic regression)

minw E(w), (A1)

where E(w) = 1 N

N

X

n=1

E⁽ⁿ⁾(w), and E⁽ⁿ⁾(w) = ln 1 + exp

−yn w^•xn

! .

(1) (3%) Prove that _{ln 2}¹ E⁽ⁿ⁾(w) is an upper bound ofJsign(w^•x_n) 6= y_nK for any w.

(2) (3%) For a given (x_n, y_n), derive its gradient ∇E⁽ⁿ⁾(w).

(3) (8%) Implement the (fixed-step) stochastic gradient descent algorithm below for (A1).

(a) initialize a (d+1)-dimensional vector w⁽⁰⁾, say, w⁽⁰⁾←− (0, 0, . . . , 0) . (b) for t = 1, 2, . . . , T

• randomly pick one n from {1, 2, . . . , N }.

• update

w^(t)←− w^(t−1)− η · ∇E⁽ⁿ⁾(w^(t−1)).

Assume that

g₁^(t)(x) = sign

w^(t)^•x ,

where w^(t)are generated from stochastic gradient descent algorithm above. Run the algorithm with η = 0.001 and T = 2000 on the following set for training:

http://www.csie.ntu.edu.tw/~htlin/course/ml10fall/data/hw4_test.dat Plot Ein

g₁^(t)

and Eout

g^(t)₁

as a function of t and briefly state your findings.

(4) (8%) Implement the (fixed-step) gradient descent algorithm below for (A1).

(a) initialize a (d+1)-dimensional vector w⁽⁰⁾, say, w⁽⁰⁾←− (0, 0, . . . , 0) . (b) for t = 1, 2, . . . , T

• update

w^(t)←− w^(t−1)− η · ∇E(w^(t−1)) .

Assume that

g₂^(t)(x) = sign

w^(t)^•x ,

where w^(t)are generated from gradient descent algorithm above. Run the algorithm with η = 0.001 and T = 2000 on the following set for training:

http://www.csie.ntu.edu.tw/~htlin/course/ml10fall/data/hw4_test.dat Plot E_in

g^(t)₂

and E_out g₂^(t)

as a function of t, compare it to your plot for g₁^(t), and briefly state your findings.

3 of 3