Sparse Representation and Optimization Methods for L1-regularized Problems

(1)

Sparse Representation and Optimization Methods for L1-regularized Problems

Chih-Jen Lin

Department of Computer Science National Taiwan University

(2)

Outline

Sparse Representation

Existing Optimization Methods Coordinate Descent Methods Other Methods

Experiments

(3)

Outline

Experiments

(4)

Sparse Representation

A mathematical way to model a signal, an image, or a document is

y =X w

=w₁



 x₁₁

...

x_{l 1}



+ · · · + w_n



 x_1n

...

x_ln



 A signal is a linear combination of others X and y are given

We would like to find w with as few non-zeros as possible (sparsity)

(5)

Example: Image Deblurring

Consider

y = Hz

z: original image, H: blur operation y: observed image

Assume

z = Dw with known dictionary D Try to

minky − HDwk and get ˆw

(6)

Example: Image Deblurring (Cont’d)

We hope w has few nonzeros as each image is generated using only several columns of the dictionary

The restored image is D ˆw

(7)

Example: Face Recognition

Assume a face image is a combination of the same person’s other images



 x11

...

x_{l 1}



 : 1st image,



 x12

...

x_{l 2}



 : 2nd image, . . . l : number of pixels in a face image

Given a face image y and collections of two persons’

faces X1 and X2

(8)

Example: Face Recognition (Cont’d)

If

minw ky − X₁wk < min

w ky − X₂wk, predict y as the first person

We hope w has few nonzeros as noisy images shouldn’t be used

(9)

Example: Feature Selection

Given

X =





x₁₁ . . . x_1n ...

x_{l 1} . . . x_ln



, x_{i 1} . . . x_in : ith document y_i = +1 or − 1 (two classes)

We hope to find w such that w^Tx_i

(> 0 if y_i = 1

< 0 if y = −1

(10)

Example: Feature Selection (Cont’d)

Try to

minw l

X

i =1

e^−yⁱ^w^T^xⁱ and hope that w is sparse

That is, we assume that each document is generated from important features

w_i 6= 0 ⇒ important features

(11)

L1-norm Minimization I

Finding w with the smallest number of non-zeros is difficult

kwk₀ : number of nonzeros Instead, L1-norm minimization

minw C ky − X wk² + kwk₁ C : a parameter given by users

(12)

L1-norm Minimization II

1-norm versus 2-norm

kwk₁ = |w₁| + · · · + |w_n| kwk²₂ = w₁² + · · · + w_n² Two figures

w

|w |

w w²

(13)

L1-norm Minimization III

If using 2-norm, all wi are non-zeros Using 1-norm, many w_i may be zeros Smaller C , better sparsity

(14)

L1-regularized Classifier

Training data {y_i, x_i}, x_i ∈ Rⁿ, i = 1, . . . , l , y_i = ±1 l : # of data, n: # of features

minw kwk₁ + C Xl

i =1ξ(w; x_i, y_i) ξ(w; xi, yi): loss function

Logistic loss:

log(1 + e^{−y w}^T^x) L1 and L2 losses:

max(1 − y w^Tx, 0) and max(1 − y w^Tx, 0)² We do not consider kernels

(15)

L1-regularized Classifier (Cont’d)

kwk₁ not differentiable ⇒ causes difficulties in optimization

Loss functions: logistic loss twice differentiable, L2 loss differentiable, and L1 loss not differentiable We focus on logistic and L2 loss

Sometimes bias term is added

w^Tx ⇒ w^Tx + b

(16)

L1-regularized Classifier (Cont’d)

Many available methods; we review existing methods and show details of some methods Notation:

f (w) ≡ kwk₁ + C X^l

i =1ξ(w; x_i, y_i) is the function to be minimized, and

L(w) ≡ C X^l

i =1ξ(w; x_i, y_i).

We do not discuss L1-regularized regression, which is another hot topic recently

(17)

Existing Optimization Methods

Outline

Experiments

(18)

Decomposition Methods

Working on some variables at a time Cyclic coordinate descent methods

Working variables sequentially or randomly selected One-variable case:

minz f (w + ze_j) − f (w) ej: indicator vector for the the j th element

Examples: Goodman (2004); Genkin et al. (2007);

Balakrishnan and Madigan (2005); Tseng and Yun (2007); Shalev-Shwartz and Tewari (2009); Duchi and Singer (2009); Wright (2010)

(19)

Decomposition Methods (Cont’d)

Gradient-based working set selection

Higher cost per iteration; larger working set

Examples: Shevade and Keerthi (2003); Tseng and Yun (2007); Yun and Toh (2009)

Active set method

Working set the same as the set of non-zero w elements

Examples: Perkins et al. (2003)

(20)

Constrained Optimization

Replace w with w⁺− w⁻: min

w⁺,w⁻

Xⁿ

j =1w_j⁺+Xⁿ

j =1w_j⁻+ C X^l

i =1ξ(w⁺−w⁻; x_i, y_i) s. t. w_j⁺ ≥ 0, w_j⁻ ≥ 0, j = 1, . . . , n.

Any bound-constrained optimization methods can be used

Examples: Schmidt et al. (2009) used Gafni and Bertsekas (1984); Kazama and Tsujii (2003) used Benson and Mor´e (2001); we have considered Lin and Mor´e (1999); Koh et al. (2007): interior point method

(21)

Constrained Optimization (Cont’d)

Equivalent problem with non-smooth constraints:

minw

X^l

i =1ξ(w; x_i, y_i) subject to kwk₁ ≤ K .

C replaced by a corresponding K

Go back to LASSO (Tibshirani, 1996) if y ∈ R and least-square loss

Examples: Kivinen and Warmuth (1997); Lee et al.

(2006); Donoho and Tsaig (2008); Duchi et al.

(22)

Other Methods

Expectation maximization: Figueiredo (2003);

Krishnapuram et al. (2004, 2005).

Stochastic gradient descent: Langford et al. (2009);

Shalev-Shwartz and Tewari (2009)

Modified quasi Newton: Andrew and Gao (2007);

Yu et al. (2010)

Hybrid: easy method first and then interior-point for faster local convergence (Shi et al., 2010)

(23)

Other Methods (Cont’d)

Quadratic approximation followed by coordinate descent: Krishnapuram et al. (2005); Friedman et al. (2010); a kind of Newton approach

Cutting plane method: Teo et al. (2010)

Some methods find a solution path for different C values; e.g., Rosset (2005), Zhao and Yu (2007), Park and Hastie (2007), and Keerthi and Shevade (2007).

Here we focus on a single C

(24)

Strengths and Weaknesses of Existing Methods

Convergence speed: higher-order methods (quasi Newton or Newton) have fast local convergence, but fail to obtain a reasonable model quickly Implementation efforts: higher-order methods usually more complicated

Large data: if solving linear systems is needed, use iterative (e.g., CG) instead of direct methods Feature correlation: methods working on some variables at a time (e.g., decomposition methods) may be efficient if features are almost independent

(25)

Coordinate Descent Methods

Outline

Experiments

(26)

Coordinate Descent Methods I

Minimizing the one-variable function

g_j(z) ≡ |w_j + z| − |w_j| + L(w + ze_j) − L(w), where

ej ≡ [0, . . . , 0

| {z }

j −1

, 1, 0, . . . , 0]^T. No closed form solution

Genkin et al. (2007), Shalev-Shwartz and Tewari (2009), and Yuan et al. (2010)

(27)

Coordinate Descent Methods II

They differ in how to minimize this one-variable problem

While gj(z) is not differentiable, we can have a form similar to Taylor expansion:

g_j(z) = g_j(0) + g_j⁰(0)z + 1

2g_j⁰⁰(ηz)z², Anoter representation (for our derivation)

min g_j(z) = |w_j + z| − |w_j| + L_j(z; w) − L_j(0; w),

(28)

Coordinate Descent Methods III

where

L_j(z; w) ≡ L(w + ze_j).

is a function of z

(29)

BBR (Genkin et al., 2007) I

They rewrite g_j(z) as

g_j(z) = g_j(0) + g_j⁰(0)z + 1

2g_j⁰⁰(ηz)z², where 0 < η < 1,

g_j⁰(0) ≡

(L⁰_j(0) + 1 if w_j > 0,

L⁰_j(0) − 1 if w_j < 0, (1) and

(30)

BBR (Genkin et al., 2007) II

g_j(z) is not differentiable if w_j = 0

BBR finds an upper bound U_j of g_j⁰⁰(z) in a trust region

Uj ≥ g_j⁰⁰(z), ∀|z| ≤ ∆j.

Then ˆg_j(z) is an upper-bound function of g_j(z):

ˆ

g_j(z) ≡ g_j(0) + g_j⁰(0)z + 1 2U_jz². Any step z satisfying ˆg_j(z) < ˆg_j(0) leads to

gj(z) − gj(0) = gj(z) − ˆgj(0) ≤ ˆgj(z) − ˆgj(0) < 0,

(31)

BBR (Genkin et al., 2007) III

Convergence not proved (no sufficient decrease condition via line search)

Logistic loss U_j ≡ C

l

X

i =1

x_ij²F y_iw^Tx_i, ∆_j|x_ij|,

where

F (r , δ) =

(0.25 if |r | ≤ δ,

1

(32)

BBR (Genkin et al., 2007) IV

The sub-problem solved in practice:

minz gˆ_j(z)

s. t. |z| ≤ ∆_j and w_j + z

(≥ 0 if w_j > 0,

≤ 0 if w_j < 0.

Update rule:

d = min

max P(−g_j⁰(0)

U_j , w_j), −∆_j, ∆_j

,

(33)

BBR (Genkin et al., 2007) V

where

P(z, w ) ≡

(z if sgn(w + z) = sgn(w ),

−w otherwise.

(34)

SCD (Shalev-Shwartz and Tewari, 2009) I

SCD: stochastic coordinate descent w = w⁺− w⁻

At each step, randomly select a variable from {w₁⁺, . . . , w_n⁺, w₁⁻, . . . , w_n⁻} One-variable sub-problem:

minz g_j(z) ≡ z + L_j(z; w⁺− w⁻) − L_j(0; w⁺− w⁻), subject to

w_j^k,+ + z ≥ 0 or w_j^k,−+ z ≥ 0,

(35)

SCD (Shalev-Shwartz and Tewari, 2009) II

Second-order approximation similar to BBR:

ˆ

g_j(z) = g_j(0) + g_j⁰(0)z + 1 2U_jz², where

g_j⁰(0) =

(1 + L⁰_j(0) for w_j⁺

1 − L⁰_j(0) for w_j⁻ and U_j ≥ g_j⁰⁰(z), ∀z.

BBR: U_j an upper bound of g_j⁰⁰(z) in the trust region

(36)

SCD (Shalev-Shwartz and Tewari, 2009) III

For logistic regression, U_j = 0.25C

l

X

i =1

x_ij² ≥ g_j⁰⁰(z), ∀z.

Shalev-Shwartz and Tewari (2009) assume

−1 ≤ x_ij ≤ 1, ∀i , j, so a simple upper bound is U_j = 0.25Cl .

(37)

CDN (Yuan et al., 2010) I

Newton step:

minz g_j⁰(0)z + 1

2g_j⁰⁰(0)z². That is,

minz |w_j + z| − |wj| + L⁰_j(0)z + 1

2L⁰⁰_j(0)z². Second-order term not replaced by an upper bound Function value may not be decreasing

(38)

CDN (Yuan et al., 2010) II

Assume z is the optimal solution; need line search Following Tseng and Yun (2007)

g_j(λz) − g_j(0) ≤ σλ(L⁰_j(0)z + |w_j + z| − |w_j|), This is slightly different from the traditional form of line search. Now

|w_j + z| − |wj| must be taken into consideration Convergence can be proved

(39)

Calculating First and Second Order Information I

We have

L⁰_j(0) = dL(w + ze_j) dz

z=0

= ∇_jL(w) L⁰⁰_j(0) = d²L(w + zej)

dzdz z=0

= ∇²_jjL(w)

(40)

Calculating First and Second Order Information II

For logistic loss:

L⁰_j(0) = C

l

X

i =1

y_ix_ij τ (y_i(w)^Tx_i) − 1 ,

L⁰⁰_j(0) = C

l

X

i =1

x_ij² τ (y_i(w)^Tx_i)

1 − τ (y_i(w)^Tx_i) , where

τ (s) ≡ 1 1 + e^−s

(41)

Other Methods

Outline

Experiments

(42)

Other Methods

GLMNET (Friedman et al., 2010) I

A quadratic approximation of L(w):

f (w + d) − f (w)

= (kw + dk₁ + L(w + d)) − (kwk₁ + L(w))

≈∇L(w)^Td + 1

2d^T∇²L(w)d + kw + dk1 − kwk₁. Then

w ← w + d Line search is needed for convergence

(43)

Other Methods

GLMNET (Friedman et al., 2010) II

But how to handle quadratic minimization with some one-norm terms?

GLMNET uses coordinate descent For logistic regression:

∇L(w) = C

l

X

i =1

τ (y_iw^Tx_i) − 1y_ix_i

∇²L(w) = CX^TDX , where D ∈ R^{l ×l} is a diagonal matrix with

(44)

Other Methods

Bundle Methods (Teo et al., 2010) I

Also called cutting plane method L(w) a convex loss function If w^k the current solution,

L(w) ≥∇L(w^k)^T(w − w^k) + L(w^k)

=a^T_k w + b_k, ∀w, where

a_k ≡ ∇L(w^k) and b_k ≡ L(w^k) − a^T_k w^k.

(45)

Other Methods

Bundle Methods (Teo et al., 2010) II

Maintains all earlier cutting planes to form a lower-bound function for L(w):

L(w) ≥ L^CP_k (w) ≡ max

1≤t≤ka^T_t w + b_t, ∀w.

Obtaining w^k+1 by solving

minw kwk₁ + L^CP_k (w).

(46)

Other Methods

Bundle Methods (Teo et al., 2010) III

This is a linear program using w = w⁺− w⁻: min

w⁺,w⁻,ζ n

X

j =1

w_j⁺+

n

X

j =1

w_j⁻+ ζ

subject to a^T_t (w⁺− w⁻) + b_t ≤ ζ, t = 1, . . . , k, w_j⁺ ≥ 0, w_j⁻ ≥ 0, j = 1, . . . , n.

(47)

Experiments

Outline

Experiments

(48)

Experiments

Data

Data set l n # of non-zeros

real-sim 72,309 20,958 3,709,083 news20 19,996 1,355,191 9,097,916

rcv1 677,399 47,236 49,556,258

yahoo-korea 460,554 3,052,939 156,436,656 l : number of data, n: number of features

They are all document sets

4/5 for training and 1/5 for testing

Select best C by cross validation on training

(49)

Experiments

Compared Methods

Software using w^Tx without b BBR (Genkin et al., 2007)

SCD (Shalev-Shwartz and Tewari, 2009) CDN: our coordinate descent implementation TRON: our Newton implementation for

bound-constrained formulation OWL-QN (Andrew and Gao, 2007) BMRM (Teo et al., 2010)

(50)

Experiments

Compared Methods (Cont’d)

Software using w^Tx + b

CDN: our coordinate descent implementation BBR (Genkin et al., 2007)

CGD-GS (Yun and Toh, 2009) IPM (Koh et al., 2007)

GLMNET (Friedman et al., 2010) Lassplore (Liu et al., 2009)

(51)

Experiments

Convergence of Objective Values (no b)

real-sim news20

(52)

Experiments

Test Accuracy

real-sim news20

rcv1 yahoo-korea

(53)

Experiments

Convergence of Objective Values (with b)

real-sim news20

(54)

Experiments

Observations and Conclusions

Decomposition methods better in the early stage One-variable sub-problem in coordinate descent Use tight approximation if possible

Newton (IPM, GLMNET) and quasi Newton (OWL-QN): fast local convergence in the end We also checked gradient and sparsity

Complete results (of more data sets) and programs are in Yuan et al. (2010); JMLR 2010 (11),

3183–3234

(55)

Experiments

References I

G. Andrew and J. Gao. Scalable training of L1-regularized log-linear models. In Proceedings of the Twenty Fourth International Conference on Machine Learning (ICML), 2007.

S. Balakrishnan and D. Madigan. Algorithms for sparse linear classifiers in the massive data setting. 2005. URL http://www.stat.rutgers.edu/~madigan/PAPERS/sm.pdf.

S. Benson and J. J. Mor´e. A limited memory variable metric method for bound constrained minimization. Preprint MCS-P909-0901, Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, Illinois, 2001.

D. L. Donoho and Y. Tsaig. Fast solution of l1 minimization problems when the solution may be sparse. IEEE Transactions on Information Theory, 54:4789–4812, 2008.

J. Duchi and Y. Singer. Boosting with structural sparsity. In Proceedings of the Twenty Sixth International Conference on Machine Learning (ICML), 2009.

J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the L1-ball for learning in high dimensions. In Proceedings of the Twenty Fifth International Conference on Machine Learning (ICML), 2008.

M. A. T. Figueiredo. Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25:1150–1159, 2003.

(56)

Experiments

References II

E. M. Gafni and D. P. Bertsekas. Two-metric projection methods for constrained optimization.

SIAM Journal on Control and Optimization, 22:936–964, 1984.

A. Genkin, D. D. Lewis, and D. Madigan. Large-scale Bayesian logistic regression for text categorization. Technometrics, 49(3):291–304, 2007.

J. Goodman. Exponential priors for maximum entropy models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2004.

J. Kazama and J. Tsujii. Evaluation and extension of maximum entropy models with inequality constraints. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 137–144, 2003.

S. S. Keerthi and S. Shevade. A fast tracking algorithm for generalized LARS/LASSO. IEEE Transactions on Neural Networks, 18(6):1826–1830, 2007.

J. Kim, Y. Kim, and Y. Kim. A gradient-based optimization algorithm for LASSO. Journal of Computational and Graphical Statistics, 17(4):994–1009, 2008.

Y. Kim and J. Kim. Gradient LASSO for feature selection. In Proceedings of the 21st International Conference on Machine Learning (ICML), 2004.

J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132:1–63, 1997.

(57)

Experiments

References III

K. Koh, S.-J. Kim, and S. Boyd. An interior-point method for large-scale l1-regularized logistic regression. Journal of Machine Learning Research, 8:1519–1555, 2007. URL

http://www.stanford.edu/~boyd/l1_logistic_reg.html.

B. Krishnapuram, A. J. Hartemink, L. Carin, and M. A. T. Figueiredo. A Bayesian approach to joint feature selection and classifier design. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(6):1105–1111, 2004.

B. Krishnapuram, L. Carin, M. A. T. Figueiredo, and A. J. Hartemink. Sparse multinomial logistic regression: fast algorithms and generalization bounds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(6):957–968, 2005.

J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. Journal of Machine Learning Research, 10:771–801, 2009.

S.-I. Lee, H. Lee, P. Abbeel, and A. Y. Ng. Efficient l1 regularized logistic regression. In Proceedings of the Twenty-first National Conference on Artificial Intelligence (AAAI-06), pages 1–9, Boston, MA, USA, July 2006.

C.-J. Lin and J. J. Mor´e. Newton’s method for large-scale bound constrained problems. SIAM Journal on Optimization, 9:1100–1127, 1999.

J. Liu, J. Chen, and J. Ye. Large-scale sparse logistic regression. In Proceedings of The 15th

(58)

Experiments

References IV

M. Y. Park and T. Hastie. L1 regularization path algorithm for generalized linear models.

Journal of the Royal Statistical Society Series B, 69:659–677, 2007.

S. Perkins, K. Lacker, and J. Theiler. Grafting: Fast, incremental feature selection by gradient descent in function space. Journal of Machine Learning Research, 3:1333–1356, 2003.

S. Rosset. Following curved regularized optimization solution paths. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 1153–1160, Cambridge, MA, 2005. MIT Press.

V. Roth. The generalized LASSO. IEEE Transactions on Neural Networks, 15(1):16–28, 2004.

M. Schmidt, G. Fung, and R. Rosales. Optimization methods for l1-regularization. Technical Report TR-2009-19, University of British Columbia, 2009.

S. Shalev-Shwartz and A. Tewari. Stochastic methods for l1 regularized loss minimization. In Proceedings of the Twenty Sixth International Conference on Machine Learning (ICML), 2009.

S. K. Shevade and S. S. Keerthi. A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics, 19(17):2246–2253, 2003.

J. Shi, W. Yin, S. Osher, and P. Sajda. A fast hybrid algorithm for large scale l1-regularized logistic regression. Journal of Machine Learning Research, 11:713–741, 2010.

C. H. Teo, S. Vishwanathan, A. Smola, and Q. V. Le. Bundle methods for regularized risk minimization. Journal of Machine Learning Research, 11:311–365, 2010.

(59)

Experiments

References V

R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B, 58:267–288, 1996.

P. Tseng and S. Yun. A coordinate gradient descent method for nonsmooth separable minimization. Mathematical Programming, 117:387–423, 2007.

S. J. Wright. Accelerated block-coordinate relaxation for regularized optimization. Technical report, University of Wisconsin, 2010.

J. Yu, S. Vishwanathan, S. Gunter, and N. N. Schraudolph. A quasi-Newton approach to nonsmooth convex optimization problems in machine learning. Journal of Machine Learning Research, 11:1–57, 2010.

G.-X. Yuan, K.-W. Chang, C.-J. Hsieh, and C.-J. Lin. A comparison of optimization methods and software for large-scale l1-regularized linear classification. Journal of Machine Learning Research, 11:3183–3234, 2010. URL

http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf.

S. Yun and K.-C. Toh. A coordinate gradient descent method for l1-regularized convex minimization. 2009. To appear in Computational Optimizations and Applications.

P. Zhao and B. Yu. Stagewise lasso. Journal of Machine Learning Research, 8:2701–2726,