Logistic Regression

(1)

Logistic Regression

For a label-feature pair (y,x), assume the probability model

p(y |x) = 1 1 + e^{−y w}^T^x. w is the parameter to be decided Assume

(y_i, x_i), i = 1, . . . , l are training instances

Chih-Jen Lin (National Taiwan Univ.) 1 / 16

(2)

Logistic Regression (Cont’d)

Logistic regression finds w by maximizing the following likelihood

maxw l

Y

i =1

p (y_i|x_i) . (1) Regularized logistic regression

minw

1

2w^Tw + C

l

X

i =1

log

1 + e^−yⁱ^w^T^xⁱ

. (2) C : regularization parameter decided by users

(3)

Newton Methods

Newton direction

mins ∇f (w^k)^Ts + 1

2s^T∇²f (w^k)s w^k: current iterate

This is the same as solving Newton linear system

∇²f (w^k)s = −∇f (w^k)

(4)

Newton Methods (Cont’d)

Given initial w⁰ and constants η ∈ (0, 1).

For k = 0, 1, . . .

Solve Newton linear system to obtain direction s^k

Find α_k satisfying

f (w^k + α_ks^k) ≤ f (w^k) + ηα_k∇f (w^k)^Ts^k Update w^k+1 = w^k + αks^k.

(5)

Backtracking Line Search

To find αk satisfying

f (w^k + αks^k) ≤ f (w^k) + ηαk∇f (w^k)^Ts^k Sequentially check αk = 1, 1/2, 1/4, 1/8

Recall the function is 1

2w^Tw + C

l

X

i =1

log

1 + e^−yⁱ^w^T^xⁱ

. You save time by the property

(w + αd)^Tx = w^Tx + αd^Tx

(6)

Backtracking Line Search (Cont’d)

You can keep

(w^k+1)^Tx = (w^k + α_ks^k)^Tx for the next iteration

But error propagation is a concern

(7)

Stopping Condition

You may use

k∇f (w^k)k ≤ k∇f (w⁰)k This is a relative condition

You may choose

= 0.01

Note that a smaller will cause more Newton iterations

(8)

Newton Linear System

Hessian ∇²f (w^k) too large to be stored

∇²f (w^k) : n × n, n : number of features But Hessian has a special form

∇²f (w) = I + CX^TDX I: identity.

X =



 x^T₁

...

x^T_l



 is the data matrix.

(9)

Newton Linear System (Cont’d)

D diagonal with

Dii = e^−yⁱ^w^T^xⁱ (1 + e^−yⁱ^w^T^xⁱ)²

Using Conjugate Gradient method to solve the linear system.

∇²f (w^k)s = −∇f (w^k)

Only a sequence of Hessian-vector products are needed

∇²f (w)s = s + C · X^T(D(X s)) Therefore, we have a Hessian-free approach

(10)

Gradient

We note that gradient takes the following form

∇f (w) = w + C

l

X

i =1

1

1 + e^−yⁱ^w^T^xⁱ − 1

yixi,

(11)

Conjugate Gradient

Given ξk < 1. Let ¯s⁰ = 0, r⁰ = −∇f (w^k), and d⁰ = r⁰. For i = 0, 1, . . . (inner iterations)

If

krⁱk ≤ ξ_kk∇f (w^k)k, then output s^k = ¯sⁱ and stop.

α_i = krⁱk²/((dⁱ)^T∇²f (w^k)dⁱ).

¯s^{i +1} = ¯sⁱ + α_idⁱ.

r^{i +1} = rⁱ − α_i∇²f (w^k)dⁱ. β_i = kr^{i +1}k²/krⁱk².

d^{i +1} = r^{i +1}+ βidⁱ.

(12)

Conjugate Gradient (Cont’d)

The CG stopping condition

krⁱk ≤ ξ_kk∇f (w^k)k, is important

It’s a relative stopping condition. It becomes strict in the end because of small k∇f (w^k)k

Now we only approximately obtain the Newton direction

So convergence is an issue

In addition to line search, trust region is another method to ensure sufficient decrease; see the implementation in LIBLINEAR (Lin et al., 2007)

(13)

Conjugate Gradient (Cont’d)

Note that αi in CG is different from αk in line search procedure

Check Golub and Van Loan (1996) for details of conjugate gradient methods

(14)

Homework

Implement Newton method with line search and CG on MATLAB, Octave, Python, or R

MATLAB and Octave may be more suitable because of their good support on matrix operations

Train the data set “kdd2010 (bridge to algebra)” at LIBSVM Data Set http://www.csie.ntu.edu.

tw/~cjlin/libsvmtools/datasets To read data to MATLAB/Octave, check

libsvmread.c in the matlab directory of LIBLINEAR

(15)

Homework (Cont’d)

Let’s use

η = 0.01, ξ_k = 0.1, w = 0 For logistic regression, set C = 0.1

It is known that a larger C causes more iterations You can check the correctness by comparing with the objective function value of LIBLINEAR (option -s 0 for logistic regression)

You may start with a smaller data set

(16)

Homework (Cont’d)

You want to observe if in final iterations, step size α becomes 1.

If you don’t see that, you can try to use a smaller C = 0.01 and reduce to 0.001 or even smaller.

You may also compare yours with LIBLINEAR.

They differ in

matlab versus C

line search versus trust region to adjust the Newton direction

We require you to submit A report of ≤ 4 pages Your code