Logistic Regression

(1)

Logistic Regression

For a label-feature pair (y,x), assume the probability model

p(y |x) = 1

1 + e^−y^w^T^x. w is the parameter to be decided Assume

(y_i,xi), i = 1, . . . , l are training instances

(2)

Logistic Regression (Cont’d)

Logistic regression finds w by maximizing the following likelihood

maxw l

Y

i =1

p (y_i|xi) . (1) Regularized logistic regression

minw

1

2w^Tw + C

l

X

i =1

log

1 + e^−yⁱ^w^T^xⁱ

. (2)

(3)

Gradient-descent Methods

Given initial w⁰ and constants η ∈ (0, 1).

For k = 0, 1, . . .

Calculate the direction s^k = −∇f (w^k) Find αk satisfying

f (w^k + α_ks^k) ≤ f (w^k) + ηα_k∇f (w^k)^Ts^k Update w^k+1 = w^k + α_ks^k.

(4)

Gradient

We note that gradient takes the following form

∇f (w) = w + C

l

X

i =1

e^−yⁱ^w^T^xⁱ 1 + e^−yⁱ^w^T^xⁱ

!

(−y_ixⁱ)

= w + C

l

X

i =1

1

1 + e^−yⁱ^w^T^xⁱ − 1

yixⁱ

(5)

Backtracking Line Search

To find αk satisfying

f (w^k + αks^k) ≤ f (w^k) + ηαk∇f (w^k)^Ts^k Sequentially check αk = 1, 1/2, 1/4, 1/8

Recall the function is 1

2w^Tw + C

l

X

i =1

log

1 + e^−yⁱ^w^T^xⁱ

. You save time by the property

(w + αd)^Tx = w^Tx + αd^Tx

(6)

Backtracking Line Search (Cont’d)

You can keep

(w^k+1)^Tx = (w^k + α_ks^k)^Tx for the next iteration

But error propagation is a concern

(7)

Stopping Condition

You can use

k∇f (w^k)k ≤ k∇f (w⁰)k This is a relative condition

You may choose

= 0.01

Note that a smaller will cause more iterations You may need to set a maximal number of iterations as well

(8)

Newton Methods

Newton direction

mins ∇f (w^k)^Ts + 1

2s^T∇²f (w^k)s w^k: current iterate

This is the same as solving Newton linear system

∇²f (w^k)s = −∇f (w^k)

(9)

Newton Methods (Cont’d)

Given initial w⁰ and constants η ∈ (0, 1).

For k = 0, 1, . . .

Solve Newton linear system to obtain direction s^k

Find α_k satisfying

f (w^k + α_ks^k) ≤ f (w^k) + ηα_k∇f (w^k)^Ts^k Update w^k+1 = w^k + αks^k.

(10)

Newton Linear System

Hessian ∇²f (w^k) too large to be stored

∇²f (w^k) : n × n, n : number of features But Hessian has a special form

∇²f (w) = I + C

l

X

i =1

y_ixi

e^−yⁱ^w^T^xⁱ (1 + e^−yⁱ^w^T^xⁱ)²

! y_ix^Ti

= I + CX^TDX

(11)

Newton Linear System (Cont’d)

I: identity.

X =



 x^T1

...

x^Tl



 is the data matrix.

D diagonal with

D_ii = e^−yⁱ^w^T^xⁱ (1 + e^−yⁱ^w^T^xⁱ)²

(12)

Newton Linear System (Cont’d)

Using Conjugate Gradient method to solve the linear system.

∇²f (w^k)s = −∇f (w^k)

Only a sequence of Hessian-vector products are needed

∇²f (w)s = s + C · X^T(D(Xs)) Therefore, we have a Hessian-free approach

(13)

Conjugate Gradient

Given ξ_k < 1. Let ¯s⁰ = 0,r⁰ = −∇f (w^k), and d⁰ = r⁰. For i = 0, 1, . . . (inner iterations)

If

krⁱk ≤ ξ_kk∇f (w^k)k, then output s^k = ¯sⁱ and stop.

α_i = krⁱk²/((dⁱ)^T∇²f (w^k)dⁱ).

¯s^{i +1} = ¯sⁱ + α_idⁱ.

r^{i +1} = rⁱ − α_i∇²f (w^k)dⁱ. β_i = kr^{i +1}k²/krⁱk².

d^{i +1} = r^{i +1} + βdⁱ.

(14)

Conjugate Gradient (Cont’d)

The CG stopping condition

krⁱk ≤ ξ_kk∇f (w^k)k, is important

It’s a relative stopping condition. It becomes strict in the end because of small k∇f (w^k)k

Therefore, we only approximately obtain the Newton direction

(15)

Conjugate Gradient (Cont’d)

In addition to line search, trust region is another method to ensure sufficient decrease; see the implementation in LIBLINEAR (Lin et al., 2007) http:

//www.csie.ntu.edu.tw/~cjlin/liblinear Note that αi in CG is different from αk in line search procedure

Check Golub and Van Loan (1996) for details of conjugate gradient methods

(16)

Homework

Implement

Gradient-descent method with line search Newton method with line search and CG on MATLAB, Octave, Python, or R

MATLAB and Octave may be more suitable because of their good support on matrix operations

Train the data set “kdd2010 (bridge to algebra)” at LIBSVM Data Set http://www.csie.ntu.edu.

tw/~cjlin/libsvmtools/datasets

(17)

Homework (Cont’d)

To read data to MATLAB/Octave, check

libsvmread.c in the matlab directory of LIBLINEAR To be more precise you must build the mex file by

>> mex libsvmread.c

See liblinear/matlab/README for more details

(18)

Homework (Cont’d)

Let’s use

η = 0.01, ξ_k = 0.1,w⁰ = 0 For regularization parameter, set C = 0.1

It is known that a larger C causes more iterations You can check the correctness by comparing with the objective function value of LIBLINEAR (option -s 0 for logistic regression)

You may start with a smaller data set

(19)

Homework (Cont’d)

For Newton method, you should observe that in final iterations, step size α becomes 1.

If you don’t see that, you can try to use a smaller C = 0.01 and reduce to 0.001 or even smaller.

You want to compare gradient-descent and Newton methods

(20)

Homework (Cont’d)

You may also compare yours with LIBLINEAR.

They differ in

matlab versus C

line search versus trust region to adjust the Newton direction

We require you to submit

A report of ≤ 4 pages (without including code) Your code