Sparse Representation and Optimization Methods for L1-regularized Problems
Chih-Jen Lin
Department of Computer Science National Taiwan University
Outline
Sparse Representation
Existing Optimization Methods Coordinate Descent Methods Other Methods
Experiments
Sparse Representation
Outline
Sparse Representation
Existing Optimization Methods Coordinate Descent Methods Other Methods
Experiments
Sparse Representation
Sparse Representation
A mathematical way to model a signal, an image, or a document is
y =X w
=w1
x11
...
xl 1
+ · · · + wn
x1n
...
xln
A signal is a linear combination of others X and y are given
We would like to find w with as few non-zeros as possible (sparsity)
Sparse Representation
Example: Image Deblurring
Consider
y = Hz
z: original image, H: blur operation y: observed image
Assume
z = Dw with known dictionary D Try to
minky − HDwk and get ˆw
Sparse Representation
Example: Image Deblurring (Cont’d)
We hope w has few nonzeros as each image is generated using only several columns of the dictionary
The restored image is D ˆw
Sparse Representation
Example: Face Recognition
Assume a face image is a combination of the same person’s other images
x11
...
xl 1
: 1st image,
x12
...
xl 2
: 2nd image, . . . l : number of pixels in a face image
Given a face image y and collections of two persons’
faces X1 and X2
Sparse Representation
Example: Face Recognition (Cont’d)
If
minw ky − X1wk < min
w ky − X2wk, predict y as the first person
We hope w has few nonzeros as noisy images shouldn’t be used
Sparse Representation
Example: Feature Selection
Given
X =
x11 . . . x1n ...
xl 1 . . . xln
, xi 1 . . . xin : ith document yi = +1 or − 1 (two classes)
We hope to find w such that wTxi
(> 0 if yi = 1
< 0 if y = −1
Sparse Representation
Example: Feature Selection (Cont’d)
Try to
minw l
X
i =1
e−yiwTxi and hope that w is sparse
That is, we assume that each document is generated from important features
wi 6= 0 ⇒ important features
Sparse Representation
L1-norm Minimization I
Finding w with the smallest number of non-zeros is difficult
kwk0 : number of nonzeros Instead, L1-norm minimization
minw C ky − X wk2 + kwk1 C : a parameter given by users
Sparse Representation
L1-norm Minimization II
1-norm versus 2-norm
kwk1 = |w1| + · · · + |wn| kwk22 = w12 + · · · + wn2 Two figures
w
|w |
w w2
Sparse Representation
L1-norm Minimization III
If using 2-norm, all wi are non-zeros Using 1-norm, many wi may be zeros Smaller C , better sparsity
Sparse Representation
L1-regularized Classifier
Training data {yi, xi}, xi ∈ Rn, i = 1, . . . , l , yi = ±1 l : # of data, n: # of features
minw kwk1 + C Xl
i =1ξ(w; xi, yi) ξ(w; xi, yi): loss function
Logistic loss:
log(1 + e−y wTx) L1 and L2 losses:
max(1 − y wTx, 0) and max(1 − y wTx, 0)2 We do not consider kernels
Sparse Representation
L1-regularized Classifier (Cont’d)
kwk1 not differentiable ⇒ causes difficulties in optimization
Loss functions: logistic loss twice differentiable, L2 loss differentiable, and L1 loss not differentiable We focus on logistic and L2 loss
Sometimes bias term is added
wTx ⇒ wTx + b
Sparse Representation
L1-regularized Classifier (Cont’d)
Many available methods; we review existing methods and show details of some methods Notation:
f (w) ≡ kwk1 + C Xl
i =1ξ(w; xi, yi) is the function to be minimized, and
L(w) ≡ C Xl
i =1ξ(w; xi, yi).
We do not discuss L1-regularized regression, which is another hot topic recently
Existing Optimization Methods
Outline
Sparse Representation
Existing Optimization Methods Coordinate Descent Methods Other Methods
Experiments
Existing Optimization Methods
Decomposition Methods
Working on some variables at a time Cyclic coordinate descent methods
Working variables sequentially or randomly selected One-variable case:
minz f (w + zej) − f (w) ej: indicator vector for the the j th element
Examples: Goodman (2004); Genkin et al. (2007);
Balakrishnan and Madigan (2005); Tseng and Yun (2007); Shalev-Shwartz and Tewari (2009); Duchi and Singer (2009); Wright (2010)
Existing Optimization Methods
Decomposition Methods (Cont’d)
Gradient-based working set selection
Higher cost per iteration; larger working set
Examples: Shevade and Keerthi (2003); Tseng and Yun (2007); Yun and Toh (2009)
Active set method
Working set the same as the set of non-zero w elements
Examples: Perkins et al. (2003)
Existing Optimization Methods
Constrained Optimization
Replace w with w+− w−: min
w+,w−
Xn
j =1wj++Xn
j =1wj−+ C Xl
i =1ξ(w+−w−; xi, yi) s. t. wj+ ≥ 0, wj− ≥ 0, j = 1, . . . , n.
Any bound-constrained optimization methods can be used
Examples: Schmidt et al. (2009) used Gafni and Bertsekas (1984); Kazama and Tsujii (2003) used Benson and Mor´e (2001); we have considered Lin and Mor´e (1999); Koh et al. (2007): interior point method
Existing Optimization Methods
Constrained Optimization (Cont’d)
Equivalent problem with non-smooth constraints:
minw
Xl
i =1ξ(w; xi, yi) subject to kwk1 ≤ K .
C replaced by a corresponding K
Go back to LASSO (Tibshirani, 1996) if y ∈ R and least-square loss
Examples: Kivinen and Warmuth (1997); Lee et al.
(2006); Donoho and Tsaig (2008); Duchi et al.
Existing Optimization Methods
Other Methods
Expectation maximization: Figueiredo (2003);
Krishnapuram et al. (2004, 2005).
Stochastic gradient descent: Langford et al. (2009);
Shalev-Shwartz and Tewari (2009)
Modified quasi Newton: Andrew and Gao (2007);
Yu et al. (2010)
Hybrid: easy method first and then interior-point for faster local convergence (Shi et al., 2010)
Existing Optimization Methods
Other Methods (Cont’d)
Quadratic approximation followed by coordinate descent: Krishnapuram et al. (2005); Friedman et al. (2010); a kind of Newton approach
Cutting plane method: Teo et al. (2010)
Some methods find a solution path for different C values; e.g., Rosset (2005), Zhao and Yu (2007), Park and Hastie (2007), and Keerthi and Shevade (2007).
Here we focus on a single C
Existing Optimization Methods
Strengths and Weaknesses of Existing Methods
Convergence speed: higher-order methods (quasi Newton or Newton) have fast local convergence, but fail to obtain a reasonable model quickly Implementation efforts: higher-order methods usually more complicated
Large data: if solving linear systems is needed, use iterative (e.g., CG) instead of direct methods Feature correlation: methods working on some variables at a time (e.g., decomposition methods) may be efficient if features are almost independent
Coordinate Descent Methods
Outline
Sparse Representation
Existing Optimization Methods Coordinate Descent Methods Other Methods
Experiments
Coordinate Descent Methods
Coordinate Descent Methods I
Minimizing the one-variable function
gj(z) ≡ |wj + z| − |wj| + L(w + zej) − L(w), where
ej ≡ [0, . . . , 0
| {z }
j −1
, 1, 0, . . . , 0]T. No closed form solution
Genkin et al. (2007), Shalev-Shwartz and Tewari (2009), and Yuan et al. (2010)
Coordinate Descent Methods
Coordinate Descent Methods II
They differ in how to minimize this one-variable problem
While gj(z) is not differentiable, we can have a form similar to Taylor expansion:
gj(z) = gj(0) + gj0(0)z + 1
2gj00(ηz)z2, Anoter representation (for our derivation)
min gj(z) = |wj + z| − |wj| + Lj(z; w) − Lj(0; w),
Coordinate Descent Methods
Coordinate Descent Methods III
where
Lj(z; w) ≡ L(w + zej).
is a function of z
Coordinate Descent Methods
BBR (Genkin et al., 2007) I
They rewrite gj(z) as
gj(z) = gj(0) + gj0(0)z + 1
2gj00(ηz)z2, where 0 < η < 1,
gj0(0) ≡
(L0j(0) + 1 if wj > 0,
L0j(0) − 1 if wj < 0, (1) and
Coordinate Descent Methods
BBR (Genkin et al., 2007) II
gj(z) is not differentiable if wj = 0
BBR finds an upper bound Uj of gj00(z) in a trust region
Uj ≥ gj00(z), ∀|z| ≤ ∆j.
Then ˆgj(z) is an upper-bound function of gj(z):
ˆ
gj(z) ≡ gj(0) + gj0(0)z + 1 2Ujz2. Any step z satisfying ˆgj(z) < ˆgj(0) leads to
gj(z) − gj(0) = gj(z) − ˆgj(0) ≤ ˆgj(z) − ˆgj(0) < 0,
Coordinate Descent Methods
BBR (Genkin et al., 2007) III
Convergence not proved (no sufficient decrease condition via line search)
Logistic loss Uj ≡ C
l
X
i =1
xij2F yiwTxi, ∆j|xij|,
where
F (r , δ) =
(0.25 if |r | ≤ δ,
1
Coordinate Descent Methods
BBR (Genkin et al., 2007) IV
The sub-problem solved in practice:
minz gˆj(z)
s. t. |z| ≤ ∆j and wj + z
(≥ 0 if wj > 0,
≤ 0 if wj < 0.
Update rule:
d = min
max P(−gj0(0)
Uj , wj), −∆j, ∆j
,
Coordinate Descent Methods
BBR (Genkin et al., 2007) V
where
P(z, w ) ≡
(z if sgn(w + z) = sgn(w ),
−w otherwise.
Coordinate Descent Methods
SCD (Shalev-Shwartz and Tewari, 2009) I
SCD: stochastic coordinate descent w = w+− w−
At each step, randomly select a variable from {w1+, . . . , wn+, w1−, . . . , wn−} One-variable sub-problem:
minz gj(z) ≡ z + Lj(z; w+− w−) − Lj(0; w+− w−), subject to
wjk,+ + z ≥ 0 or wjk,−+ z ≥ 0,
Coordinate Descent Methods
SCD (Shalev-Shwartz and Tewari, 2009) II
Second-order approximation similar to BBR:
ˆ
gj(z) = gj(0) + gj0(0)z + 1 2Ujz2, where
gj0(0) =
(1 + L0j(0) for wj+
1 − L0j(0) for wj− and Uj ≥ gj00(z), ∀z.
BBR: Uj an upper bound of gj00(z) in the trust region
Coordinate Descent Methods
SCD (Shalev-Shwartz and Tewari, 2009) III
For logistic regression, Uj = 0.25C
l
X
i =1
xij2 ≥ gj00(z), ∀z.
Shalev-Shwartz and Tewari (2009) assume
−1 ≤ xij ≤ 1, ∀i , j, so a simple upper bound is Uj = 0.25Cl .
Coordinate Descent Methods
CDN (Yuan et al., 2010) I
Newton step:
minz gj0(0)z + 1
2gj00(0)z2. That is,
minz |wj + z| − |wj| + L0j(0)z + 1
2L00j(0)z2. Second-order term not replaced by an upper bound Function value may not be decreasing
Coordinate Descent Methods
CDN (Yuan et al., 2010) II
Assume z is the optimal solution; need line search Following Tseng and Yun (2007)
gj(λz) − gj(0) ≤ σλ(L0j(0)z + |wj + z| − |wj|), This is slightly different from the traditional form of line search. Now
|wj + z| − |wj| must be taken into consideration Convergence can be proved
Coordinate Descent Methods
Calculating First and Second Order Information I
We have
L0j(0) = dL(w + zej) dz
z=0
= ∇jL(w) L00j(0) = d2L(w + zej)
dzdz z=0
= ∇2jjL(w)
Coordinate Descent Methods
Calculating First and Second Order Information II
For logistic loss:
L0j(0) = C
l
X
i =1
yixij τ (yi(w)Txi) − 1 ,
L00j(0) = C
l
X
i =1
xij2 τ (yi(w)Txi)
1 − τ (yi(w)Txi) , where
τ (s) ≡ 1 1 + e−s
Other Methods
Outline
Sparse Representation
Existing Optimization Methods Coordinate Descent Methods Other Methods
Experiments
Other Methods
GLMNET (Friedman et al., 2010) I
A quadratic approximation of L(w):
f (w + d) − f (w)
= (kw + dk1 + L(w + d)) − (kwk1 + L(w))
≈∇L(w)Td + 1
2dT∇2L(w)d + kw + dk1 − kwk1. Then
w ← w + d Line search is needed for convergence
Other Methods
GLMNET (Friedman et al., 2010) II
But how to handle quadratic minimization with some one-norm terms?
GLMNET uses coordinate descent For logistic regression:
∇L(w) = C
l
X
i =1
τ (yiwTxi) − 1yixi
∇2L(w) = CXTDX , where D ∈ Rl ×l is a diagonal matrix with
Other Methods
Bundle Methods (Teo et al., 2010) I
Also called cutting plane method L(w) a convex loss function If wk the current solution,
L(w) ≥∇L(wk)T(w − wk) + L(wk)
=aTk w + bk, ∀w, where
ak ≡ ∇L(wk) and bk ≡ L(wk) − aTk wk.
Other Methods
Bundle Methods (Teo et al., 2010) II
Maintains all earlier cutting planes to form a lower-bound function for L(w):
L(w) ≥ LCPk (w) ≡ max
1≤t≤kaTt w + bt, ∀w.
Obtaining wk+1 by solving
minw kwk1 + LCPk (w).
Other Methods
Bundle Methods (Teo et al., 2010) III
This is a linear program using w = w+− w−: min
w+,w−,ζ n
X
j =1
wj++
n
X
j =1
wj−+ ζ
subject to aTt (w+− w−) + bt ≤ ζ, t = 1, . . . , k, wj+ ≥ 0, wj− ≥ 0, j = 1, . . . , n.
Experiments
Outline
Sparse Representation
Existing Optimization Methods Coordinate Descent Methods Other Methods
Experiments
Experiments
Data
Data set l n # of non-zeros
real-sim 72,309 20,958 3,709,083 news20 19,996 1,355,191 9,097,916
rcv1 677,399 47,236 49,556,258
yahoo-korea 460,554 3,052,939 156,436,656 l : number of data, n: number of features
They are all document sets
4/5 for training and 1/5 for testing
Select best C by cross validation on training
Experiments
Compared Methods
Software using wTx without b BBR (Genkin et al., 2007)
SCD (Shalev-Shwartz and Tewari, 2009) CDN: our coordinate descent implementation TRON: our Newton implementation for
bound-constrained formulation OWL-QN (Andrew and Gao, 2007) BMRM (Teo et al., 2010)
Experiments
Compared Methods (Cont’d)
Software using wTx + b
CDN: our coordinate descent implementation BBR (Genkin et al., 2007)
CGD-GS (Yun and Toh, 2009) IPM (Koh et al., 2007)
GLMNET (Friedman et al., 2010) Lassplore (Liu et al., 2009)
Experiments
Convergence of Objective Values (no b)
real-sim news20
Experiments
Test Accuracy
real-sim news20
rcv1 yahoo-korea
Experiments
Convergence of Objective Values (with b)
real-sim news20
Experiments
Observations and Conclusions
Decomposition methods better in the early stage One-variable sub-problem in coordinate descent Use tight approximation if possible
Newton (IPM, GLMNET) and quasi Newton (OWL-QN): fast local convergence in the end We also checked gradient and sparsity
Complete results (of more data sets) and programs are in Yuan et al. (2010); JMLR 2010 (11),
3183–3234
Experiments
References I
G. Andrew and J. Gao. Scalable training of L1-regularized log-linear models. In Proceedings of the Twenty Fourth International Conference on Machine Learning (ICML), 2007.
S. Balakrishnan and D. Madigan. Algorithms for sparse linear classifiers in the massive data setting. 2005. URL http://www.stat.rutgers.edu/~madigan/PAPERS/sm.pdf.
S. Benson and J. J. Mor´e. A limited memory variable metric method for bound constrained minimization. Preprint MCS-P909-0901, Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, Illinois, 2001.
D. L. Donoho and Y. Tsaig. Fast solution of l1 minimization problems when the solution may be sparse. IEEE Transactions on Information Theory, 54:4789–4812, 2008.
J. Duchi and Y. Singer. Boosting with structural sparsity. In Proceedings of the Twenty Sixth International Conference on Machine Learning (ICML), 2009.
J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the L1-ball for learning in high dimensions. In Proceedings of the Twenty Fifth International Conference on Machine Learning (ICML), 2008.
M. A. T. Figueiredo. Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25:1150–1159, 2003.
Experiments
References II
E. M. Gafni and D. P. Bertsekas. Two-metric projection methods for constrained optimization.
SIAM Journal on Control and Optimization, 22:936–964, 1984.
A. Genkin, D. D. Lewis, and D. Madigan. Large-scale Bayesian logistic regression for text categorization. Technometrics, 49(3):291–304, 2007.
J. Goodman. Exponential priors for maximum entropy models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2004.
J. Kazama and J. Tsujii. Evaluation and extension of maximum entropy models with inequality constraints. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 137–144, 2003.
S. S. Keerthi and S. Shevade. A fast tracking algorithm for generalized LARS/LASSO. IEEE Transactions on Neural Networks, 18(6):1826–1830, 2007.
J. Kim, Y. Kim, and Y. Kim. A gradient-based optimization algorithm for LASSO. Journal of Computational and Graphical Statistics, 17(4):994–1009, 2008.
Y. Kim and J. Kim. Gradient LASSO for feature selection. In Proceedings of the 21st International Conference on Machine Learning (ICML), 2004.
J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132:1–63, 1997.
Experiments
References III
K. Koh, S.-J. Kim, and S. Boyd. An interior-point method for large-scale l1-regularized logistic regression. Journal of Machine Learning Research, 8:1519–1555, 2007. URL
http://www.stanford.edu/~boyd/l1_logistic_reg.html.
B. Krishnapuram, A. J. Hartemink, L. Carin, and M. A. T. Figueiredo. A Bayesian approach to joint feature selection and classifier design. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(6):1105–1111, 2004.
B. Krishnapuram, L. Carin, M. A. T. Figueiredo, and A. J. Hartemink. Sparse multinomial logistic regression: fast algorithms and generalization bounds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(6):957–968, 2005.
J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. Journal of Machine Learning Research, 10:771–801, 2009.
S.-I. Lee, H. Lee, P. Abbeel, and A. Y. Ng. Efficient l1 regularized logistic regression. In Proceedings of the Twenty-first National Conference on Artificial Intelligence (AAAI-06), pages 1–9, Boston, MA, USA, July 2006.
C.-J. Lin and J. J. Mor´e. Newton’s method for large-scale bound constrained problems. SIAM Journal on Optimization, 9:1100–1127, 1999.
J. Liu, J. Chen, and J. Ye. Large-scale sparse logistic regression. In Proceedings of The 15th
Experiments
References IV
M. Y. Park and T. Hastie. L1 regularization path algorithm for generalized linear models.
Journal of the Royal Statistical Society Series B, 69:659–677, 2007.
S. Perkins, K. Lacker, and J. Theiler. Grafting: Fast, incremental feature selection by gradient descent in function space. Journal of Machine Learning Research, 3:1333–1356, 2003.
S. Rosset. Following curved regularized optimization solution paths. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 1153–1160, Cambridge, MA, 2005. MIT Press.
V. Roth. The generalized LASSO. IEEE Transactions on Neural Networks, 15(1):16–28, 2004.
M. Schmidt, G. Fung, and R. Rosales. Optimization methods for l1-regularization. Technical Report TR-2009-19, University of British Columbia, 2009.
S. Shalev-Shwartz and A. Tewari. Stochastic methods for l1 regularized loss minimization. In Proceedings of the Twenty Sixth International Conference on Machine Learning (ICML), 2009.
S. K. Shevade and S. S. Keerthi. A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics, 19(17):2246–2253, 2003.
J. Shi, W. Yin, S. Osher, and P. Sajda. A fast hybrid algorithm for large scale l1-regularized logistic regression. Journal of Machine Learning Research, 11:713–741, 2010.
C. H. Teo, S. Vishwanathan, A. Smola, and Q. V. Le. Bundle methods for regularized risk minimization. Journal of Machine Learning Research, 11:311–365, 2010.
Experiments
References V
R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B, 58:267–288, 1996.
P. Tseng and S. Yun. A coordinate gradient descent method for nonsmooth separable minimization. Mathematical Programming, 117:387–423, 2007.
S. J. Wright. Accelerated block-coordinate relaxation for regularized optimization. Technical report, University of Wisconsin, 2010.
J. Yu, S. Vishwanathan, S. Gunter, and N. N. Schraudolph. A quasi-Newton approach to nonsmooth convex optimization problems in machine learning. Journal of Machine Learning Research, 11:1–57, 2010.
G.-X. Yuan, K.-W. Chang, C.-J. Hsieh, and C.-J. Lin. A comparison of optimization methods and software for large-scale l1-regularized linear classification. Journal of Machine Learning Research, 11:3183–3234, 2010. URL
http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf.
S. Yun and K.-C. Toh. A coordinate gradient descent method for l1-regularized convex minimization. 2009. To appear in Computational Optimizations and Applications.
P. Zhao and B. Yu. Stagewise lasso. Journal of Machine Learning Research, 8:2701–2726,