6.4 Experiments with Adaptive Boosting (*)

(1)

Machine Learning (NTU, Fall 2008) instructor: Hsuan-Tien Lin

Homework #6

TA in charge: Hanhsing Tu, Room 536 RELEASE DATE: 12/11/2008 DUE DATE: 12/18/2008, 4:00 pm IN CLASS TA SESSION: 12/17/2008, noon to 2:00 pm IN R106

Unless granted by the instructor in advance, you must turn in a hard copy of your solutions (without the source code) for all problems. For problems marked with (*), please follow the guidelines on the course website and upload your source code to designated places.

Discussions on course materials and homework solutions are encouraged. But you should write the final solutions alone and understand them fully. Books, notes, and Internet resources can be consulted, but not copied from.

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time. In order to maximize the level of fairness in this class, lending and borrowing homework solutions are both regarded as dishonest behaviors and will be punished according to the honesty policy.

You should write your solutions in English with the common math notations introduced in class or in the problems. We do not accept solutions written in any other languages.

6.1 Bayesian Universe

(1) (15%) ASSUME that the universe generates an example (x, y) by the following procedure:

(a) generate x from some probability density function P (x) (b) use some fixed (w, θ) to evaluate ρ = hw, xi − θ

(c) generate y ∈ R from ρ by the probability density function P (y| ρ) =^√¹_2πexp −(y − ρ)² If each (xn, yn) within Z = {(xn, yn)}^N_n=1is generated i.i.d from the procedure above, what is the likelihood P Z| (w, θ)? Prove that linear regression (see Problem 2.3-(1)) equivalently gives the maximum likelihood estimate of (w, θ).

(a) generate (w, θ) from P (w, θ) = ¹

(√

2π)^d+1·σ^d+1 · exp

−^kwk_2σ²^+θ2 ²

(b) generate x from some probability density function P (x)

(c) use the “fixed” (w, θ) to evaluate ρ = hw, xi − θ

(d) generate y ∈ R from ρ by the probability density function P (y| ρ) = ^√¹_2πexp −(y − ρ)² If each (xn, yn) within Z = {(xn, yn)}^N_n=1is generated i.i.d from the procedure above, and assume that the constant P (Z) = Q, what is the posterior P (w, θ)| Z? Prove that regularized linear regression (see Problem 2.3.(3)) equivalently gives the maximum a posteriori estimate of (w, θ)).

In particular, what is the relationship between λ (in Problem 2.3(3)) and σ (here)?

(a) generate x from some probability density function P (x) (b) use some fixed (w, θ) to evaluate ρ = hw, xi − θ

(c) evaluate Q₊= exp ^ρ₂ and Q−= exp −^ρ₂

(d) generate y ∈ {+, −} with the probability distribution Qy/(Q++ Q−)

If each (xn, yn) within Z = {(xn, yn)}^N_n=1 is generated i.i.d from the procedure above, what is the likelihood P Z| (w, θ)? Prove that logistic regression (see Problem 2.4) equivalently gives the maximum likelihood estimate of (w, θ).

1 of 3

(2)

6.2 Power of Adaptive Boosting

The adaptive boosting (AdaBoost) algorithm, as shown in the class slides, is as follows:

• Input: Z = {(xn, yn)}^N_n=1.

• Set u_n= _N¹ for all n.

• For t = 1, 2, · · · , T ,

– Learn a simple rule ht such that ht solves

ht= argmin_h

N

X

n=1

un· I[yn6= h(xn)].

with the help of some base learner Ab. – Compute the weighted error _t= 1

PN m=1u_m

N

X

n=1

u_n· I[yn 6= ht(x_n)] and the confidence

αt=1

2ln1 − t

t

– Emphasize the training examples that do not agree with ht: u_n= u_n· exp

−αty_nh_t(x_n) .

• Output: combined function H(x) = sign

T

X

t=1

αtht(x)

!

In this problem, we will prove that AdaBoost can reach ν(H) = 0 if T is large enough and every hypothesis htsatisfies t≤ < ¹₂.

(1) (5%) Let U^(t−1)=

N

X

n=1

u_n at the beginning of the t-th iteration. What is U⁽⁰⁾? (2) (10%) According to the AdaBoost algorithm above, for t ≥ 1, prove that

U^(t)= 1 N

N

X

n=1

exp −y_n

t

X

τ =1

α_τh_τ(x_n)

! .

(3) (5%) By the result in (2), prove that ν(H) ≤ U^{(T )}.

(4) (10%) According to the AdaBoost algorithm above, for t ≥ 1, prove that U^(t)= U^(t−1)· 2p

t(1 − t).

(5) (5%) Using 0 ≤ _t≤ < ¹₂, for t ≥ 1, prove that pt(1 − t) ≤p

(1 − ).

(6) (5%) Using < ¹₂, prove that

p(1 − ) ≤ 1 2exp

−2(1 2− )²

.

(7) (5%) Using the results above, prove that U^{(T )}≤ exp

−2T (1 2 − )²

.

(8) (5%) Using the results above, argue that after T = O(log N ) iterations, ν(H) = 0.

2 of 3

(3)

6.3 Experiments with Bootstrap Aggregation (*)

(1) (20%) Implement the decision stump learning algorithm Ads. That is, let h_s,i,θ(x) = sign

s · (x)_i− θ ,

where s ∈ {−1, +1}, i ∈ {1, 2, . . . , d} , and θ ∈ R. Given a weighted training set Z = {(xn, y_n, u_n)}^N_n=1,

Ads(Z) = argmin

h_s,i,θ N

X

n=1

un· Iyn6= hs,i,θ(xn).

Run the algorithm on the following set for training (with un =_N¹ for all N ):

http://www.csie.ntu.edu.tw/~htlin/course/ml08fall/data/hw6_train.dat and the following set for testing:

http://www.csie.ntu.edu.tw/~htlin/course/ml08fall/data/hw6_test.dat

Let g be the decision function returned from Ads. Report ν(g) and ˆπ(g). Briefly state your findings.

(2) (30%) Implement the bootstrap aggregation (bagging) algorithm with decision stumps (i.e., use Ads as Ab below):

• Input: Z = {(xn, yn)}^N_n=1.

• for t = 1, 2, . . . , T ,

– generate Z^(t) from Z by bootstrapping—uniformly sampling N examples from Z with replacement

– let h_t= A_b(Z^(t)) and α_t= 1.

• Output: combined function H(x) = sign

T

X

t=1

αtht(x)

!

Use a total of T = 100 iterations. Let Ht(x) = sign

t

X

τ =1

ατhτ(x)

!

. Plot ν(Ht) and ˆπ(Ht) as functions of t on the same figure. Briefly state your findings.

(3) (Bonus 5%) Prove that you can implement an A_ds that runs in time O(N log N ) instead of the brute-force implementation that takes O(N²).