Logistic Regression
For a label-feature pair (y,x), assume the probability model
p(y |x) = 1
1 + e−ywTx. w is the parameter to be decided Assume
(yi,xi), i = 1, . . . , l are training instances
Logistic Regression (Cont’d)
Logistic regression finds w by maximizing the following likelihood
maxw l
Y
i =1
p (yi|xi) . (1) Regularized logistic regression
minw
1
2wTw + C
l
X
i =1
log
1 + e−yiwTxi
. (2)
Gradient-descent Methods
Given initial w0 and constants η ∈ (0, 1).
For k = 0, 1, . . .
Calculate the direction sk = −∇f (wk) Find αk satisfying
f (wk + αksk) ≤ f (wk) + ηαk∇f (wk)Tsk Update wk+1 = wk + αksk.
Gradient
We note that gradient takes the following form
∇f (w) = w + C
l
X
i =1
e−yiwTxi 1 + e−yiwTxi
!
(−yixi)
= w + C
l
X
i =1
1
1 + e−yiwTxi − 1
yixi
Backtracking Line Search
To find αk satisfying
f (wk + αksk) ≤ f (wk) + ηαk∇f (wk)Tsk Sequentially check αk = 1, 1/2, 1/4, 1/8
Recall the function is 1
2wTw + C
l
X
i =1
log
1 + e−yiwTxi
. You save time by the property
(w + αd)Tx = wTx + αdTx
Backtracking Line Search (Cont’d)
You can keep
(wk+1)Tx = (wk + αksk)Tx for the next iteration
But error propagation is a concern
Stopping Condition
You can use
k∇f (wk)k ≤ k∇f (w0)k This is a relative condition
You may choose
= 0.01
Note that a smaller will cause more iterations You may need to set a maximal number of iterations as well
Newton Methods
Newton direction
mins ∇f (wk)Ts + 1
2sT∇2f (wk)s wk: current iterate
This is the same as solving Newton linear system
∇2f (wk)s = −∇f (wk)
Newton Methods (Cont’d)
Given initial w0 and constants η ∈ (0, 1).
For k = 0, 1, . . .
Solve Newton linear system to obtain direction sk
Find αk satisfying
f (wk + αksk) ≤ f (wk) + ηαk∇f (wk)Tsk Update wk+1 = wk + αksk.
Newton Linear System
Hessian ∇2f (wk) too large to be stored
∇2f (wk) : n × n, n : number of features But Hessian has a special form
∇2f (w) = I + C
l
X
i =1
yixi
e−yiwTxi (1 + e−yiwTxi)2
! yixTi
= I + CXTDX
Newton Linear System (Cont’d)
I: identity.
X =
xT1
...
xTl
is the data matrix.
D diagonal with
Dii = e−yiwTxi (1 + e−yiwTxi)2
Newton Linear System (Cont’d)
Using Conjugate Gradient method to solve the linear system.
∇2f (wk)s = −∇f (wk)
Only a sequence of Hessian-vector products are needed
∇2f (w)s = s + C · XT(D(Xs)) Therefore, we have a Hessian-free approach
Conjugate Gradient
Given ξk < 1. Let ¯s0 = 0,r0 = −∇f (wk), and d0 = r0. For i = 0, 1, . . . (inner iterations)
If
krik ≤ ξkk∇f (wk)k, then output sk = ¯si and stop.
αi = krik2/((di)T∇2f (wk)di).
¯si +1 = ¯si + αidi.
ri +1 = ri − αi∇2f (wk)di. βi = kri +1k2/krik2.
di +1 = ri +1 + βdi.
Conjugate Gradient (Cont’d)
The CG stopping condition
krik ≤ ξkk∇f (wk)k, is important
It’s a relative stopping condition. It becomes strict in the end because of small k∇f (wk)k
Therefore, we only approximately obtain the Newton direction
Conjugate Gradient (Cont’d)
In addition to line search, trust region is another method to ensure sufficient decrease; see the implementation in LIBLINEAR (Lin et al., 2007) http:
//www.csie.ntu.edu.tw/~cjlin/liblinear Note that αi in CG is different from αk in line search procedure
Check Golub and Van Loan (1996) for details of conjugate gradient methods
Homework
Implement
Gradient-descent method with line search Newton method with line search and CG on MATLAB, Octave, Python, or R
MATLAB and Octave may be more suitable because of their good support on matrix operations
Train the data set “kdd2010 (bridge to algebra)” at LIBSVM Data Set http://www.csie.ntu.edu.
tw/~cjlin/libsvmtools/datasets
Homework (Cont’d)
To read data to MATLAB/Octave, check
libsvmread.c in the matlab directory of LIBLINEAR To be more precise you must build the mex file by
>> mex libsvmread.c
See liblinear/matlab/README for more details
Homework (Cont’d)
Let’s use
η = 0.01, ξk = 0.1,w0 = 0 For regularization parameter, set C = 0.1
It is known that a larger C causes more iterations You can check the correctness by comparing with the objective function value of LIBLINEAR (option -s 0 for logistic regression)
You may start with a smaller data set
Homework (Cont’d)
For Newton method, you should observe that in final iterations, step size α becomes 1.
If you don’t see that, you can try to use a smaller C = 0.01 and reduce to 0.001 or even smaller.
You want to compare gradient-descent and Newton methods
Homework (Cont’d)
You may also compare yours with LIBLINEAR.
They differ in
matlab versus C
line search versus trust region to adjust the Newton direction
We require you to submit
A report of ≤ 4 pages (without including code) Your code