• 沒有找到結果。

# Sparse Representation and Optimization Methods for L1-regularized Problems

N/A
N/A
Protected

Share "Sparse Representation and Optimization Methods for L1-regularized Problems"

Copied!
59
0
0

(1)

### Sparse Representation and Optimization Methods for L1-regularized Problems

Chih-Jen Lin

Department of Computer Science National Taiwan University

(2)

### Outline

Sparse Representation

Existing Optimization Methods Coordinate Descent Methods Other Methods

Experiments

(3)

Sparse Representation

### Outline

Sparse Representation

Existing Optimization Methods Coordinate Descent Methods Other Methods

Experiments

(4)

Sparse Representation

### Sparse Representation

A mathematical way to model a signal, an image, or a document is

y =X w

=w1

 x11

...

xl 1

+ · · · + wn

 x1n

...

xln

 A signal is a linear combination of others X and y are given

We would like to find w with as few non-zeros as possible (sparsity)

(5)

Sparse Representation

### Example: Image Deblurring

Consider

y = Hz

z: original image, H: blur operation y: observed image

Assume

z = Dw with known dictionary D Try to

minky − HDwk and get ˆw

(6)

Sparse Representation

### Example: Image Deblurring (Cont’d)

We hope w has few nonzeros as each image is generated using only several columns of the dictionary

The restored image is D ˆw

(7)

Sparse Representation

### Example: Face Recognition

Assume a face image is a combination of the same person’s other images

 x11

...

xl 1

 : 1st image,

 x12

...

xl 2

 : 2nd image, . . . l : number of pixels in a face image

Given a face image y and collections of two persons’

faces X1 and X2

(8)

Sparse Representation

### Example: Face Recognition (Cont’d)

If

minw ky − X1wk < min

w ky − X2wk, predict y as the first person

We hope w has few nonzeros as noisy images shouldn’t be used

(9)

Sparse Representation

### Example: Feature Selection

Given

X =

x11 . . . x1n ...

xl 1 . . . xln

, xi 1 . . . xin : ith document yi = +1 or − 1 (two classes)

We hope to find w such that wTxi

(> 0 if yi = 1

< 0 if y = −1

(10)

Sparse Representation

### Example: Feature Selection (Cont’d)

Try to

minw l

X

i =1

e−yiwTxi and hope that w is sparse

That is, we assume that each document is generated from important features

wi 6= 0 ⇒ important features

(11)

Sparse Representation

### L1-norm Minimization I

Finding w with the smallest number of non-zeros is difficult

kwk0 : number of nonzeros Instead, L1-norm minimization

minw C ky − X wk2 + kwk1 C : a parameter given by users

(12)

Sparse Representation

### L1-norm Minimization II

1-norm versus 2-norm

kwk1 = |w1| + · · · + |wn| kwk22 = w12 + · · · + wn2 Two figures

w

|w |

w w2

(13)

Sparse Representation

### L1-norm Minimization III

If using 2-norm, all wi are non-zeros Using 1-norm, many wi may be zeros Smaller C , better sparsity

(14)

Sparse Representation

### L1-regularized Classifier

Training data {yi, xi}, xi ∈ Rn, i = 1, . . . , l , yi = ±1 l : # of data, n: # of features

minw kwk1 + C Xl

i =1ξ(w; xi, yi) ξ(w; xi, yi): loss function

Logistic loss:

log(1 + e−y wTx) L1 and L2 losses:

max(1 − y wTx, 0) and max(1 − y wTx, 0)2 We do not consider kernels

(15)

Sparse Representation

### L1-regularized Classifier (Cont’d)

kwk1 not differentiable ⇒ causes difficulties in optimization

Loss functions: logistic loss twice differentiable, L2 loss differentiable, and L1 loss not differentiable We focus on logistic and L2 loss

wTx ⇒ wTx + b

(16)

Sparse Representation

### L1-regularized Classifier (Cont’d)

Many available methods; we review existing methods and show details of some methods Notation:

f (w) ≡ kwk1 + C Xl

i =1ξ(w; xi, yi) is the function to be minimized, and

L(w) ≡ C Xl

i =1ξ(w; xi, yi).

We do not discuss L1-regularized regression, which is another hot topic recently

(17)

Existing Optimization Methods

### Outline

Sparse Representation

Existing Optimization Methods Coordinate Descent Methods Other Methods

Experiments

(18)

Existing Optimization Methods

### Decomposition Methods

Working on some variables at a time Cyclic coordinate descent methods

Working variables sequentially or randomly selected One-variable case:

minz f (w + zej) − f (w) ej: indicator vector for the the j th element

Examples: Goodman (2004); Genkin et al. (2007);

Balakrishnan and Madigan (2005); Tseng and Yun (2007); Shalev-Shwartz and Tewari (2009); Duchi and Singer (2009); Wright (2010)

(19)

Existing Optimization Methods

### Decomposition Methods (Cont’d)

Higher cost per iteration; larger working set

Examples: Shevade and Keerthi (2003); Tseng and Yun (2007); Yun and Toh (2009)

Active set method

Working set the same as the set of non-zero w elements

Examples: Perkins et al. (2003)

(20)

Existing Optimization Methods

### Constrained Optimization

Replace w with w+− w: min

w+,w

Xn

j =1wj++Xn

j =1wj+ C Xl

i =1ξ(w+−w; xi, yi) s. t. wj+ ≥ 0, wj ≥ 0, j = 1, . . . , n.

Any bound-constrained optimization methods can be used

Examples: Schmidt et al. (2009) used Gafni and Bertsekas (1984); Kazama and Tsujii (2003) used Benson and Mor´e (2001); we have considered Lin and Mor´e (1999); Koh et al. (2007): interior point method

(21)

Existing Optimization Methods

### Constrained Optimization (Cont’d)

Equivalent problem with non-smooth constraints:

minw

Xl

i =1ξ(w; xi, yi) subject to kwk1 ≤ K .

C replaced by a corresponding K

Go back to LASSO (Tibshirani, 1996) if y ∈ R and least-square loss

Examples: Kivinen and Warmuth (1997); Lee et al.

(2006); Donoho and Tsaig (2008); Duchi et al.

(22)

Existing Optimization Methods

### Other Methods

Expectation maximization: Figueiredo (2003);

Krishnapuram et al. (2004, 2005).

Stochastic gradient descent: Langford et al. (2009);

Shalev-Shwartz and Tewari (2009)

Modified quasi Newton: Andrew and Gao (2007);

Yu et al. (2010)

Hybrid: easy method first and then interior-point for faster local convergence (Shi et al., 2010)

(23)

Existing Optimization Methods

### Other Methods (Cont’d)

Quadratic approximation followed by coordinate descent: Krishnapuram et al. (2005); Friedman et al. (2010); a kind of Newton approach

Cutting plane method: Teo et al. (2010)

Some methods find a solution path for different C values; e.g., Rosset (2005), Zhao and Yu (2007), Park and Hastie (2007), and Keerthi and Shevade (2007).

Here we focus on a single C

(24)

Existing Optimization Methods

### Strengths and Weaknesses of Existing Methods

Convergence speed: higher-order methods (quasi Newton or Newton) have fast local convergence, but fail to obtain a reasonable model quickly Implementation efforts: higher-order methods usually more complicated

Large data: if solving linear systems is needed, use iterative (e.g., CG) instead of direct methods Feature correlation: methods working on some variables at a time (e.g., decomposition methods) may be efficient if features are almost independent

(25)

Coordinate Descent Methods

### Outline

Sparse Representation

Existing Optimization Methods Coordinate Descent Methods Other Methods

Experiments

(26)

Coordinate Descent Methods

### Coordinate Descent Methods I

Minimizing the one-variable function

gj(z) ≡ |wj + z| − |wj| + L(w + zej) − L(w), where

ej ≡ [0, . . . , 0

| {z }

j −1

, 1, 0, . . . , 0]T. No closed form solution

Genkin et al. (2007), Shalev-Shwartz and Tewari (2009), and Yuan et al. (2010)

(27)

Coordinate Descent Methods

### Coordinate Descent Methods II

They differ in how to minimize this one-variable problem

While gj(z) is not differentiable, we can have a form similar to Taylor expansion:

gj(z) = gj(0) + gj0(0)z + 1

2gj00(ηz)z2, Anoter representation (for our derivation)

min gj(z) = |wj + z| − |wj| + Lj(z; w) − Lj(0; w),

(28)

Coordinate Descent Methods

### Coordinate Descent Methods III

where

Lj(z; w) ≡ L(w + zej).

is a function of z

(29)

Coordinate Descent Methods

### BBR (Genkin et al., 2007) I

They rewrite gj(z) as

gj(z) = gj(0) + gj0(0)z + 1

2gj00(ηz)z2, where 0 < η < 1,

gj0(0) ≡

(L0j(0) + 1 if wj > 0,

L0j(0) − 1 if wj < 0, (1) and

(30)

Coordinate Descent Methods

### BBR (Genkin et al., 2007) II

gj(z) is not differentiable if wj = 0

BBR finds an upper bound Uj of gj00(z) in a trust region

Uj ≥ gj00(z), ∀|z| ≤ ∆j.

Then ˆgj(z) is an upper-bound function of gj(z):

ˆ

gj(z) ≡ gj(0) + gj0(0)z + 1 2Ujz2. Any step z satisfying ˆgj(z) < ˆgj(0) leads to

gj(z) − gj(0) = gj(z) − ˆgj(0) ≤ ˆgj(z) − ˆgj(0) < 0,

(31)

Coordinate Descent Methods

### BBR (Genkin et al., 2007) III

Convergence not proved (no sufficient decrease condition via line search)

Logistic loss Uj ≡ C

l

X

i =1

xij2F yiwTxi, ∆j|xij|,

where

F (r , δ) =

(0.25 if |r | ≤ δ,

1

(32)

Coordinate Descent Methods

### BBR (Genkin et al., 2007) IV

The sub-problem solved in practice:

minzj(z)

s. t. |z| ≤ ∆j and wj + z

(≥ 0 if wj > 0,

≤ 0 if wj < 0.

Update rule:

d = min



max P(−gj0(0)

Uj , wj), −∆j, ∆j

 ,

(33)

Coordinate Descent Methods

### BBR (Genkin et al., 2007) V

where

P(z, w ) ≡

(z if sgn(w + z) = sgn(w ),

−w otherwise.

(34)

Coordinate Descent Methods

### SCD (Shalev-Shwartz and Tewari, 2009) I

SCD: stochastic coordinate descent w = w+− w

At each step, randomly select a variable from {w1+, . . . , wn+, w1, . . . , wn} One-variable sub-problem:

minz gj(z) ≡ z + Lj(z; w+− w) − Lj(0; w+− w), subject to

wjk,+ + z ≥ 0 or wjk,−+ z ≥ 0,

(35)

Coordinate Descent Methods

### SCD (Shalev-Shwartz and Tewari, 2009) II

Second-order approximation similar to BBR:

ˆ

gj(z) = gj(0) + gj0(0)z + 1 2Ujz2, where

gj0(0) =

(1 + L0j(0) for wj+

1 − L0j(0) for wj and Uj ≥ gj00(z), ∀z.

BBR: Uj an upper bound of gj00(z) in the trust region

(36)

Coordinate Descent Methods

### SCD (Shalev-Shwartz and Tewari, 2009) III

For logistic regression, Uj = 0.25C

l

X

i =1

xij2 ≥ gj00(z), ∀z.

Shalev-Shwartz and Tewari (2009) assume

−1 ≤ xij ≤ 1, ∀i , j, so a simple upper bound is Uj = 0.25Cl .

(37)

Coordinate Descent Methods

### CDN (Yuan et al., 2010) I

Newton step:

minz gj0(0)z + 1

2gj00(0)z2. That is,

minz |wj + z| − |wj| + L0j(0)z + 1

2L00j(0)z2. Second-order term not replaced by an upper bound Function value may not be decreasing

(38)

Coordinate Descent Methods

### CDN (Yuan et al., 2010) II

Assume z is the optimal solution; need line search Following Tseng and Yun (2007)

gj(λz) − gj(0) ≤ σλ(L0j(0)z + |wj + z| − |wj|), This is slightly different from the traditional form of line search. Now

|wj + z| − |wj| must be taken into consideration Convergence can be proved

(39)

Coordinate Descent Methods

### Calculating First and Second Order Information I

We have

L0j(0) = dL(w + zej) dz

z=0

= ∇jL(w) L00j(0) = d2L(w + zej)

dzdz z=0

= ∇2jjL(w)

(40)

Coordinate Descent Methods

### Calculating First and Second Order Information II

For logistic loss:

L0j(0) = C

l

X

i =1

yixij τ (yi(w)Txi) − 1 ,

L00j(0) = C

l

X

i =1

xij2 τ (yi(w)Txi)

1 − τ (yi(w)Txi) , where

τ (s) ≡ 1 1 + e−s

(41)

Other Methods

### Outline

Sparse Representation

Existing Optimization Methods Coordinate Descent Methods Other Methods

Experiments

(42)

Other Methods

### GLMNET (Friedman et al., 2010) I

f (w + d) − f (w)

= (kw + dk1 + L(w + d)) − (kwk1 + L(w))

≈∇L(w)Td + 1

2dT2L(w)d + kw + dk1 − kwk1. Then

w ← w + d Line search is needed for convergence

(43)

Other Methods

### GLMNET (Friedman et al., 2010) II

But how to handle quadratic minimization with some one-norm terms?

GLMNET uses coordinate descent For logistic regression:

∇L(w) = C

l

X

i =1

τ (yiwTxi) − 1yixi

2L(w) = CXTDX , where D ∈ Rl ×l is a diagonal matrix with

(44)

Other Methods

### Bundle Methods (Teo et al., 2010) I

Also called cutting plane method L(w) a convex loss function If wk the current solution,

L(w) ≥∇L(wk)T(w − wk) + L(wk)

=aTk w + bk, ∀w, where

ak ≡ ∇L(wk) and bk ≡ L(wk) − aTk wk.

(45)

Other Methods

### Bundle Methods (Teo et al., 2010) II

Maintains all earlier cutting planes to form a lower-bound function for L(w):

L(w) ≥ LCPk (w) ≡ max

1≤t≤kaTt w + bt, ∀w.

Obtaining wk+1 by solving

minw kwk1 + LCPk (w).

(46)

Other Methods

### Bundle Methods (Teo et al., 2010) III

This is a linear program using w = w+− w: min

w+,w n

X

j =1

wj++

n

X

j =1

wj+ ζ

subject to aTt (w+− w) + bt ≤ ζ, t = 1, . . . , k, wj+ ≥ 0, wj ≥ 0, j = 1, . . . , n.

(47)

Experiments

### Outline

Sparse Representation

Existing Optimization Methods Coordinate Descent Methods Other Methods

Experiments

(48)

Experiments

### Data

Data set l n # of non-zeros

real-sim 72,309 20,958 3,709,083 news20 19,996 1,355,191 9,097,916

rcv1 677,399 47,236 49,556,258

yahoo-korea 460,554 3,052,939 156,436,656 l : number of data, n: number of features

They are all document sets

4/5 for training and 1/5 for testing

Select best C by cross validation on training

(49)

Experiments

### Compared Methods

Software using wTx without b BBR (Genkin et al., 2007)

SCD (Shalev-Shwartz and Tewari, 2009) CDN: our coordinate descent implementation TRON: our Newton implementation for

bound-constrained formulation OWL-QN (Andrew and Gao, 2007) BMRM (Teo et al., 2010)

(50)

Experiments

### Compared Methods (Cont’d)

Software using wTx + b

CDN: our coordinate descent implementation BBR (Genkin et al., 2007)

CGD-GS (Yun and Toh, 2009) IPM (Koh et al., 2007)

GLMNET (Friedman et al., 2010) Lassplore (Liu et al., 2009)

(51)

Experiments

real-sim news20

(52)

Experiments

real-sim news20

rcv1 yahoo-korea

(53)

Experiments

real-sim news20

(54)

Experiments

### Observations and Conclusions

Decomposition methods better in the early stage One-variable sub-problem in coordinate descent Use tight approximation if possible

Newton (IPM, GLMNET) and quasi Newton (OWL-QN): fast local convergence in the end We also checked gradient and sparsity

Complete results (of more data sets) and programs are in Yuan et al. (2010); JMLR 2010 (11),

3183–3234

(55)

Experiments

### References I

G. Andrew and J. Gao. Scalable training of L1-regularized log-linear models. In Proceedings of the Twenty Fourth International Conference on Machine Learning (ICML), 2007.

S. Balakrishnan and D. Madigan. Algorithms for sparse linear classifiers in the massive data setting. 2005. URL http://www.stat.rutgers.edu/~madigan/PAPERS/sm.pdf.

S. Benson and J. J. Mor´e. A limited memory variable metric method for bound constrained minimization. Preprint MCS-P909-0901, Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, Illinois, 2001.

D. L. Donoho and Y. Tsaig. Fast solution of l1 minimization problems when the solution may be sparse. IEEE Transactions on Information Theory, 54:4789–4812, 2008.

J. Duchi and Y. Singer. Boosting with structural sparsity. In Proceedings of the Twenty Sixth International Conference on Machine Learning (ICML), 2009.

J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the L1-ball for learning in high dimensions. In Proceedings of the Twenty Fifth International Conference on Machine Learning (ICML), 2008.

M. A. T. Figueiredo. Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25:1150–1159, 2003.

(56)

Experiments

### References II

E. M. Gafni and D. P. Bertsekas. Two-metric projection methods for constrained optimization.

SIAM Journal on Control and Optimization, 22:936–964, 1984.

A. Genkin, D. D. Lewis, and D. Madigan. Large-scale Bayesian logistic regression for text categorization. Technometrics, 49(3):291–304, 2007.

J. Goodman. Exponential priors for maximum entropy models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2004.

J. Kazama and J. Tsujii. Evaluation and extension of maximum entropy models with inequality constraints. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 137–144, 2003.

S. S. Keerthi and S. Shevade. A fast tracking algorithm for generalized LARS/LASSO. IEEE Transactions on Neural Networks, 18(6):1826–1830, 2007.

J. Kim, Y. Kim, and Y. Kim. A gradient-based optimization algorithm for LASSO. Journal of Computational and Graphical Statistics, 17(4):994–1009, 2008.

Y. Kim and J. Kim. Gradient LASSO for feature selection. In Proceedings of the 21st International Conference on Machine Learning (ICML), 2004.

J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132:1–63, 1997.

(57)

Experiments

### References III

K. Koh, S.-J. Kim, and S. Boyd. An interior-point method for large-scale l1-regularized logistic regression. Journal of Machine Learning Research, 8:1519–1555, 2007. URL

http://www.stanford.edu/~boyd/l1_logistic_reg.html.

B. Krishnapuram, A. J. Hartemink, L. Carin, and M. A. T. Figueiredo. A Bayesian approach to joint feature selection and classifier design. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(6):1105–1111, 2004.

B. Krishnapuram, L. Carin, M. A. T. Figueiredo, and A. J. Hartemink. Sparse multinomial logistic regression: fast algorithms and generalization bounds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(6):957–968, 2005.

J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. Journal of Machine Learning Research, 10:771–801, 2009.

S.-I. Lee, H. Lee, P. Abbeel, and A. Y. Ng. Efficient l1 regularized logistic regression. In Proceedings of the Twenty-first National Conference on Artificial Intelligence (AAAI-06), pages 1–9, Boston, MA, USA, July 2006.

C.-J. Lin and J. J. Mor´e. Newton’s method for large-scale bound constrained problems. SIAM Journal on Optimization, 9:1100–1127, 1999.

J. Liu, J. Chen, and J. Ye. Large-scale sparse logistic regression. In Proceedings of The 15th

(58)

Experiments

### References IV

M. Y. Park and T. Hastie. L1 regularization path algorithm for generalized linear models.

Journal of the Royal Statistical Society Series B, 69:659–677, 2007.

S. Perkins, K. Lacker, and J. Theiler. Grafting: Fast, incremental feature selection by gradient descent in function space. Journal of Machine Learning Research, 3:1333–1356, 2003.

S. Rosset. Following curved regularized optimization solution paths. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 1153–1160, Cambridge, MA, 2005. MIT Press.

V. Roth. The generalized LASSO. IEEE Transactions on Neural Networks, 15(1):16–28, 2004.

M. Schmidt, G. Fung, and R. Rosales. Optimization methods for l1-regularization. Technical Report TR-2009-19, University of British Columbia, 2009.

S. Shalev-Shwartz and A. Tewari. Stochastic methods for l1 regularized loss minimization. In Proceedings of the Twenty Sixth International Conference on Machine Learning (ICML), 2009.

S. K. Shevade and S. S. Keerthi. A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics, 19(17):2246–2253, 2003.

J. Shi, W. Yin, S. Osher, and P. Sajda. A fast hybrid algorithm for large scale l1-regularized logistic regression. Journal of Machine Learning Research, 11:713–741, 2010.

C. H. Teo, S. Vishwanathan, A. Smola, and Q. V. Le. Bundle methods for regularized risk minimization. Journal of Machine Learning Research, 11:311–365, 2010.

(59)

Experiments

### References V

R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B, 58:267–288, 1996.

P. Tseng and S. Yun. A coordinate gradient descent method for nonsmooth separable minimization. Mathematical Programming, 117:387–423, 2007.

S. J. Wright. Accelerated block-coordinate relaxation for regularized optimization. Technical report, University of Wisconsin, 2010.

J. Yu, S. Vishwanathan, S. Gunter, and N. N. Schraudolph. A quasi-Newton approach to nonsmooth convex optimization problems in machine learning. Journal of Machine Learning Research, 11:1–57, 2010.

G.-X. Yuan, K.-W. Chang, C.-J. Hsieh, and C.-J. Lin. A comparison of optimization methods and software for large-scale l1-regularized linear classification. Journal of Machine Learning Research, 11:3183–3234, 2010. URL

http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf.

S. Yun and K.-C. Toh. A coordinate gradient descent method for l1-regularized convex minimization. 2009. To appear in Computational Optimizations and Applications.

P. Zhao and B. Yu. Stagewise lasso. Journal of Machine Learning Research, 8:2701–2726,

He proposed a ﬁxed point algorithm and a gradient projection method with constant step size based on the dual formulation of total variation.. These two algorithms soon became

GCG method is developed to minimize the residual of the linear equation under some special functional.. Therefore, the minimality property does not hold... , k) in order to construct

For an important class of matrices the more qualitative assertions of Theorems 13 and 14 can be considerably sharpened. This is the class of consistly

where L is lower triangular and U is upper triangular, then the operation counts can be reduced to O(2n 2 )!.. The results are shown in the following table... 113) in

Taking second-order cone optimization and complementarity problems for example, there have proposed many ef- fective solution methods, including the interior point methods [1, 2, 3,

For finite-dimensional second-order cone optimization and complementarity problems, there have proposed various methods, including the interior point methods [1, 15, 18], the

Fukushima, On the local convergence of semismooth Newton methods for linear and nonlinear second-order cone programs without strict complementarity, SIAM Journal on Optimization,

Since the subsequent steps of Gaussian elimination mimic the first, except for being applied to submatrices of smaller size, it suffices to conclude that Gaussian elimination