• 沒有找到結果。

4.2 The Hat of Linear Regression

N/A
N/A
Protected

Academic year: 2022

Share "4.2 The Hat of Linear Regression"

Copied!
3
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning (NTU, Fall 2010) instructor: Hsuan-Tien Lin

Homework #4

TA in charge: Chao-Kai Chiang RELEASE DATE: 10/25/2010 DUE DATE: 11/08/2010, 4:00 pm IN CLASS

TA SESSION: 11/04/2010, 6:00 pm IN R110

Unless granted by the instructor in advance, you must turn in a hard copy of your solutions (without the source code) for all problems. For problems marked with (*), please follow the guidelines on the course website and upload your source code to designated places.

Any form of cheating, lying, or plagiarism will not be tolerated. Students can get zero scores and/or fail the class and/or be kicked out of school and/or receive other punishments for those kinds of misconducts.

Discussions on course materials and homework solutions are encouraged. But you should write the final solutions alone and understand them fully. Books, notes, and Internet resources can be consulted, but not copied from.

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time. In order to maximize the level of fairness in this class, lending and borrowing homework solutions are both regarded as dishonest behaviors and will be punished according to the honesty policy.

You should write your solutions in English with the common math notations introduced in class or in the problems. We do not accept solutions written in any other languages.

4.1 More on Growth Function and VC Dimension

(1) (5%) Let H = {h1, h2, . . . , hM} with some finite M . Prove that dVC(H) ≤ log2M .

(2) (5%) For hypothesis sets H1, H2, · · · , HK with finite VC-dimensions dVC(Hk), derive and prove the tightest lower bound that you can get on dVCKk=1Hk.

(3) (5%) For hypothesis sets H1, H2, · · · , HK with finite VC-dimensions dVC(Hk), derive and prove the tightest upper bound that you can get on dVCKk=1Hk.

(4) (5%) For hypothesis sets H1, H2, · · · , HK with finite VC-dimensions dVC(Hk), derive and prove the tightest lower bound that you can get on dVCKk=1Hk.

(5) (5%) For hypothesis sets H1, H2, · · · , HK with finite VC-dimensions dVC(Hk), derive and prove the tightest upper bound that you can get on dVCKk=1Hk.

4.2 The Hat of Linear Regression

(1) (3%) Do Exercise 3.3-1 of LFD.

(2) (3%) Do Exercise 3.3-2 of LFD.

(3) (3%) Do Exercise 3.3-3 of LFD.

(4) (3%) Do Exercise 3.3-4 of LFD.

(5) (3%) Do Exercise 3.3-5 of LFD.

4.3 The Feature Transforms

(1) (4%) Do Exercise 3.6 of LFD.

(2) (6%) Do Exercise 3.7 of LFD.

(3) (5%) Do Exercise 3.11 of LFD.

1 of 3

(2)

Machine Learning (NTU, Fall 2010) instructor: Hsuan-Tien Lin

4.4 Gradient and Newton Directions

Consider a function

E(u, v) = eu+ e2v+ euv+ u2− 3uv + 4v2− 3u − 5v,

(1) (3%) Approximate E(u + ∆u, v + ∆v) by ˆE1(∆u, ∆v), where ˆE1 is the first-order Taylor’s ex- pansion of E around (u, v) = (0, 0). Suppose ˆE1(∆u, ∆v) = au∆u + av∆v + a. What are the values of au, av, and a?

(2) (3%) Minimize ˆE1 over all possible (∆u, ∆v) such that k(∆u, ∆v)k = 0.5. In class, we proved that the optimal column vector∆u

∆v



is parallel to the column vector −∇E(u, v), which is called the negative gradient direction. Compute the optimal (∆u, ∆v) and the resulting E(u + ∆u, v + ∆v).

(3) (3%) Approximate E(u + ∆u, v + ∆v) by ˆE2(∆u, ∆v), where ˆE2 is the second-order Taylor’s expansion of E around (u, v) = (0, 0). Suppose

2(∆u, ∆v) = buu(∆u)2+ bvv(∆v)2+ buv(∆u)(∆v) + bu∆u + bv∆v + b.

What are the values of buu, bvv, buv, bu, bv, and b?

(4) (3%) Minimize ˆE2over all possible (∆u, ∆v) (regardless of length). Use the fact that ∇2E(u, v) (the Hessian matrix) is positive definite to prove that the optimal column vector

∆u

∆v



= − ∇2E(u, v)−1

∇E(u, v), which is called the Newton direction.

(5) (3%) Numerically compute the following values:

(a) the vector (∆u, ∆v) of length 0.5 along the Newton direction, and the resulting E(u + ∆u, v + ∆v).

(b) the vector (∆u, ∆v) of length 0.5 that minimizes E(u + ∆u, v + ∆v), and the resulting E(u + ∆u, v + ∆v). (Hint: let ∆u = 0.5 sin θ.)

Compare the values of E(u + ∆u, v + ∆v) in (2), (5a), and (5b). Briefly state your findings.

The negative gradient direction and the Newton direction are quite fundamental for designing opti- mization algorithms. It is important to understand these directions and put them in your toolbox for designing ML algorithms.

4.5 Least-squares Linear Regression (*)

(1) (8%) Implement the least-squares linear regression algorithm taught in class to compute the optimal (d + 1)-dimensional w that solves

minw N

X

n=1



yn− wxn

2 .

Run the algorithm on the following set for training (each row represents a pair of (xn, yn), where xn is the “thin” version. The first column is xn[1], the second one is xn[2], and the third one is yn):

http://www.csie.ntu.edu.tw/~htlin/course/ml10fall/data/hw4_train.dat and the following set for testing:

http://www.csie.ntu.edu.tw/~htlin/course/ml10fall/data/hw4_test.dat Report the w you find. Let g(x) = sign wx. What is Ein(g) in terms of the 0/1 loss (classifica- tion)? How about Eout(g)?

Please check the course policy carefully and do not use sophisticated packages in your solution. You can use standard matrix multiplication and inversion routines.

2 of 3

(3)

Machine Learning (NTU, Fall 2010) instructor: Hsuan-Tien Lin

4.6 Gradient Descent for Logistic Regression (*)

Consider the formulation (so-called logistic regression)

minw E(w), (A1)

where E(w) = 1 N

N

X

n=1

E(n)(w), and E(n)(w) = ln 1 + exp

−yn wxn



! .

(1) (3%) Prove that ln 21 E(n)(w) is an upper bound ofJsign(wxn) 6= ynK for any w.

(2) (3%) For a given (xn, yn), derive its gradient ∇E(n)(w).

(3) (8%) Implement the (fixed-step) stochastic gradient descent algorithm below for (A1).

(a) initialize a (d+1)-dimensional vector w(0), say, w(0)←− (0, 0, . . . , 0) . (b) for t = 1, 2, . . . , T

• randomly pick one n from {1, 2, . . . , N }.

• update

w(t)←− w(t−1)− η · ∇E(n)(w(t−1)).

Assume that

g1(t)(x) = sign

w(t)x ,

where w(t)are generated from stochastic gradient descent algorithm above. Run the algorithm with η = 0.001 and T = 2000 on the following set for training:

http://www.csie.ntu.edu.tw/~htlin/course/ml10fall/data/hw4_train.dat and the following set for testing:

http://www.csie.ntu.edu.tw/~htlin/course/ml10fall/data/hw4_test.dat Plot Ein

 g1(t)

and Eout

 g(t)1 

as a function of t and briefly state your findings.

(4) (8%) Implement the (fixed-step) gradient descent algorithm below for (A1).

(a) initialize a (d+1)-dimensional vector w(0), say, w(0)←− (0, 0, . . . , 0) . (b) for t = 1, 2, . . . , T

• update

w(t)←− w(t−1)− η · ∇E(w(t−1)) .

Assume that

g2(t)(x) = sign

w(t)x ,

where w(t)are generated from gradient descent algorithm above. Run the algorithm with η = 0.001 and T = 2000 on the following set for training:

http://www.csie.ntu.edu.tw/~htlin/course/ml10fall/data/hw4_train.dat and the following set for testing:

http://www.csie.ntu.edu.tw/~htlin/course/ml10fall/data/hw4_test.dat Plot Ein

g(t)2 

and Eout g2(t)

as a function of t, compare it to your plot for g1(t), and briefly state your findings.

3 of 3

參考文獻

相關文件

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time.. In order

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time.. In order

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time.. In order

In this homework, you are asked to implement k-d tree for the k = 1 case, and the data structure should support the operations of querying the nearest point, point insertion, and

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time.. In order

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time.. In order

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time.. In order

Use your MATLAB codes with various time steps (e.g., of the form 2 −k ) to see whether the results of your numerical experiments correspond to the theory.. Present your results in