• 沒有找到結果。

Logistic Regression

N/A
N/A
Protected

Academic year: 2022

Share "Logistic Regression"

Copied!
16
0
0

加載中.... (立即查看全文)

全文

(1)

Logistic Regression

For a label-feature pair (y,x), assume the probability model

p(y |x) = 1 1 + e−y wTx. w is the parameter to be decided Assume

(yi, xi), i = 1, . . . , l are training instances

Chih-Jen Lin (National Taiwan Univ.) 1 / 16

(2)

Logistic Regression (Cont’d)

Logistic regression finds w by maximizing the following likelihood

maxw l

Y

i =1

p (yi|xi) . (1) Regularized logistic regression

minw

1

2wTw + C

l

X

i =1

log



1 + e−yiwTxi



. (2) C : regularization parameter decided by users

(3)

Newton Methods

Newton direction

mins ∇f (wk)Ts + 1

2sT2f (wk)s wk: current iterate

This is the same as solving Newton linear system

2f (wk)s = −∇f (wk)

Chih-Jen Lin (National Taiwan Univ.) 3 / 16

(4)

Newton Methods (Cont’d)

Given initial w0 and constants η ∈ (0, 1).

For k = 0, 1, . . .

Solve Newton linear system to obtain direction sk

Find αk satisfying

f (wk + αksk) ≤ f (wk) + ηαk∇f (wk)Tsk Update wk+1 = wk + αksk.

(5)

Backtracking Line Search

To find αk satisfying

f (wk + αksk) ≤ f (wk) + ηαk∇f (wk)Tsk Sequentially check αk = 1, 1/2, 1/4, 1/8

Recall the function is 1

2wTw + C

l

X

i =1

log



1 + e−yiwTxi

 . You save time by the property

(w + αd)Tx = wTx + αdTx

Chih-Jen Lin (National Taiwan Univ.) 5 / 16

(6)

Backtracking Line Search (Cont’d)

You can keep

(wk+1)Tx = (wk + αksk)Tx for the next iteration

But error propagation is a concern

(7)

Stopping Condition

You may use

k∇f (wk)k ≤ k∇f (w0)k This is a relative condition

You may choose

 = 0.01

Note that a smaller  will cause more Newton iterations

Chih-Jen Lin (National Taiwan Univ.) 7 / 16

(8)

Newton Linear System

Hessian ∇2f (wk) too large to be stored

2f (wk) : n × n, n : number of features But Hessian has a special form

2f (w) = I + CXTDX I: identity.

X =

 xT1

...

xTl

 is the data matrix.

(9)

Newton Linear System (Cont’d)

D diagonal with

Dii = e−yiwTxi (1 + e−yiwTxi)2

Using Conjugate Gradient method to solve the linear system.

2f (wk)s = −∇f (wk)

Only a sequence of Hessian-vector products are needed

2f (w)s = s + C · XT(D(X s)) Therefore, we have a Hessian-free approach

Chih-Jen Lin (National Taiwan Univ.) 9 / 16

(10)

Gradient

We note that gradient takes the following form

∇f (w) = w + C

l

X

i =1

 1

1 + e−yiwTxi − 1

 yixi,

(11)

Conjugate Gradient

Given ξk < 1. Let ¯s0 = 0, r0 = −∇f (wk), and d0 = r0. For i = 0, 1, . . . (inner iterations)

If

krik ≤ ξkk∇f (wk)k, then output sk = ¯si and stop.

αi = krik2/((di)T2f (wk)di).

¯si +1 = ¯si + αidi.

ri +1 = ri − αi2f (wk)di. βi = kri +1k2/krik2.

di +1 = ri +1+ βidi.

Chih-Jen Lin (National Taiwan Univ.) 11 / 16

(12)

Conjugate Gradient (Cont’d)

The CG stopping condition

krik ≤ ξkk∇f (wk)k, is important

It’s a relative stopping condition. It becomes strict in the end because of small k∇f (wk)k

Now we only approximately obtain the Newton direction

So convergence is an issue

In addition to line search, trust region is another method to ensure sufficient decrease; see the implementation in LIBLINEAR (Lin et al., 2007)

(13)

Conjugate Gradient (Cont’d)

Note that αi in CG is different from αk in line search procedure

Check Golub and Van Loan (1996) for details of conjugate gradient methods

Chih-Jen Lin (National Taiwan Univ.) 13 / 16

(14)

Homework

Implement Newton method with line search and CG on MATLAB, Octave, Python, or R

MATLAB and Octave may be more suitable because of their good support on matrix operations

Train the data set “kdd2010 (bridge to algebra)” at LIBSVM Data Set http://www.csie.ntu.edu.

tw/~cjlin/libsvmtools/datasets To read data to MATLAB/Octave, check

libsvmread.c in the matlab directory of LIBLINEAR

(15)

Homework (Cont’d)

Let’s use

η = 0.01, ξk = 0.1, w = 0 For logistic regression, set C = 0.1

It is known that a larger C causes more iterations You can check the correctness by comparing with the objective function value of LIBLINEAR (option -s 0 for logistic regression)

You may start with a smaller data set

Chih-Jen Lin (National Taiwan Univ.) 15 / 16

(16)

Homework (Cont’d)

You want to observe if in final iterations, step size α becomes 1.

If you don’t see that, you can try to use a smaller C = 0.01 and reduce  to 0.001 or even smaller.

You may also compare yours with LIBLINEAR.

They differ in

matlab versus C

line search versus trust region to adjust the Newton direction

We require you to submit A report of ≤ 4 pages Your code

參考文獻

相關文件

Accelerated prox gradient method is promising in theory and practice.. Applicable to convex-concave optimization by using

•Last month I watched a dance class in 崇文 Elementary School and learned the new..

[r]

(18%) Determine whether the given series converges or diverges... For what values of x does the series

11) Carbon-11 is used in medical imaging. The half-life of this radioisotope is 20.4 min.. 31) If each of the following represents an alkane, and a carbon atom is located at each

Now, nearly all of the current flows through wire S since it has a much lower resistance than the light bulb. The light bulb does not glow because the current flowing through it

The information provided in this Section should describe the quality assurance procedures in place to ensure that the course in Hong Kong is delivered to an academic

Schools are required to offset the equivalent number of graduate teacher posts in the corresponding rank so that serving non-graduate teachers can be accommodated in their current