• 沒有找到結果。

Gradient Methods for Sparse Optimization with Nonsmooth Regularization

N/A
N/A
Protected

Academic year: 2022

Share "Gradient Methods for Sparse Optimization with Nonsmooth Regularization"

Copied!
32
0
0

加載中.... (立即查看全文)

全文

(1)

Paul Tseng

Mathematics, University of Washington Seattle

SIMAI, Roma September 15, 2008

(Joint works with Sylvain Sardy (Univ. Geneva) and Sangwoon Yun (NUS))

(2)

• First-Order Methods for Smooth Optimization

• Application I: Basis Pursuit/Lasso in Compressed Sensing

(3)

• First-Order Methods for Smooth Optimization

• Application I: Basis Pursuit/Lasso in Compressed Sensing

• Optimization with Nonsmooth Regularization

• Block-Coordinate Gradient Descent Method

• Application II: Group Lasso for Logistic Regression

• Application III: Sparse Inverse Covariance Estimation

(4)

• First-Order Methods for Smooth Optimization

• Application I: Basis Pursuit/Lasso in Compressed Sensing

• Optimization with Nonsmooth Regularization

• Block-Coordinate Gradient Descent Method

• Application II: Group Lasso for Logistic Regression

• Application III: Sparse Inverse Covariance Estimation

• Accelerated Gradient Method

• Conclusions & Future Work

(5)

min

x=(x1,...,xn)f (x)

f : <n → < is cont. diff.

(6)

min

x=(x1,...,xn)f (x)

f : <n → < is cont. diff.

Given x ∈ <n, choose H ∈ <n×n, H  0, and α > 0. Update

xnew = x − αH−1∇f (x).

Gradient Desc.

(7)

min

x=(x1,...,xn)f (x)

f : <n → < is cont. diff.

Given x ∈ <n, choose H ∈ <n×n, H  0, and α > 0. Update

xnew = x − αH−1∇f (x).

Gradient Desc.

Given x ∈ <n, choose i ∈ {1, ..., n}. Update xnew = arg min

u|uj=xj ∀j6=i

f (u).

Coordinate Min.

Gauss-Seidel: Choose i cyclically, 1, 2,..., n, 1, 2,...

Gauss-Southwell: Choose i with |∂x∂f

i(x)| maximum.

(8)

• Coord. min. method is simple, and efficient for large “weakly coupled”

problems (off-block-diagonals of ∇2f (x) not too large).

(9)

• Coord. min. method is simple, and efficient for large “weakly coupled”

problems (off-block-diagonals of ∇2f (x) not too large).

• If f is convex, then very cluster point of the x-sequence is a minimizer. Zadeh

’70 If f is nonconvex, then G-Seidel can cycle Powell ’73 though G-Southwell still converges.

(10)

• Coord. min. method is simple, and efficient for large “weakly coupled”

problems (off-block-diagonals of ∇2f (x) not too large).

• If f is convex, then very cluster point of the x-sequence is a minimizer. Zadeh

’70 If f is nonconvex, then G-Seidel can cycle Powell ’73 though G-Southwell still converges.

• Can get stuck at non-stationary point if f is nondifferentiable.

But if the nondifferentiable part is separable, then convergence is possible.

(11)

Example: min

x=(x1,x2)(x1 + x2)2 + 1

4(x1 − x2)2 + |x1| + |x2|

(12)

minx kAx − bk22 + ckxk1

Tibshirani ’96, Fu ’98 Osborne et al. ’98

Chen, Donoho, Saunders ’99 ...

A ∈ <m×n, b ∈ <m, c > 0. `1-regularization induces soln sparsity.

• Typically m ≥ 1000, n ≥ 8000, and A is dense. k · k1 is nonsmooth.

6. .

_

• Can reformulate this as a convex QP and solve using an IP method. Chen,

Donoho, Saunders ’99

• When the columns of A come from an overcomplete set of basis functions associated with a fast transform (e.g., wavelet packets), this can be solved faster using block-coordinate minimization (Gauss-Southwell). Sardy, Bruce, T ’00

(13)

0 10 20 30 40 50 60

bp.cptable.spectrogram

50 55 60 65

0 10 20 30 40 50 60

ew.cptable.spectrogram

microseconds

KHz

-40 -30 -20 -10 0 10 20

m = 211 = 2048, c = 4, b: top image, A: local cosine transform, all but 4 levels

(14)

2500 2000

1500

1000

500

1 5 10







 





 





  

        !" #$

%&

CPU time (log scale) CPU time (log scale)

Objective

1 5 10

1000 500 1500 2000 2500

BCR

IP IP

BCR

Comparing CPU times of IP and BCM (S-Plus, Sun Ultra 1).

(15)

Can BCM (Gauss-Seidel & Gauss-Southwell) be extended to efficiently solve more general nonsmooth problems?

ML Estimation with `

1

-Regularization

minx −`(Ax; b) + c X

i∈J

|xi|

` is log likelihood, {Ai}i6∈J are lin. indep “coarse-scale Wavelets”, c > 0

• −`(y; b) = 12ky − bk22 Gaussian noise

• −`(y; b) =

m

X

i=1

(yi − bi ln yi) (yi ≥ 0) Poisson noise

(16)

minx Fc(x) := f (x) + cP (x)

f : <n → < is cont. diff. c ≥ 0.

P : <n → (−∞, ∞] is proper, convex, lsc, and block-separable, i.e., P (x) = P

I∈C PI(xI) (I ∈ C partition {1, ..., n}).

• P (x) = kxk1 Basis Pursuit/Lasso

• P (x) = P

I∈C kxIk2 group Lasso

• P (x) =

n0 if l ≤ x ≤ u

∞ else bound constrained NLP

(17)

minx Fc(x) := f (x) + cP (x)

f : <n → < is cont. diff. c ≥ 0.

P : <n → (−∞, ∞] is proper, convex, lsc, and block-separable, i.e., P (x) = P

I∈C PI(xI) (I ∈ C partition {1, ..., n}).

• P (x) = kxk1 Basis Pursuit/Lasso

• P (x) = P

I∈C kxIk2 group Lasso

• P (x) =

n0 if l ≤ x ≤ u

∞ else bound constrained NLP

Idea

: Do BCM on a quadratic approx. of f.

(18)

For x ∈ domP, ∅ 6= I ⊆ {1, ..., n}, and H  0, let dH(x; I) and qH(x; I) be the optimal soln and obj. value of

min

d|di=0 ∀i6∈I

{∇f (x)Td + 1

2dTHd + cP (x + d) − cP (x)} direc.

subprob

Properties:

• dH(x; {1, ..., n}) = 0 ⇔ Fc0(x; d) ≥ 0 ∀d ∈ <n. stationarity

• H is diagonal, P is “simple” ⇒ dH(x; I) has “closed form”.

• The case of H = I and I = {1, ..., n} has been proposed previously. Fukushima

& Mine ’81, Daubechies et al. ’04, ...

(19)

Update x = x + αdH(x; I) (α > 0) until “convergence.”

Gauss-Seidel: Choose I ∈ C cyclically.

Gauss-Southwell: Choose I with

qD(x; I) ≤ υ qD(x; {1, ..., n})

(0 < υ ≤ 1, D  0 is diagonal, e.g., D = I or D = diag(H)).

(20)

H

until “convergence.”

Gauss-Seidel: Choose I ∈ C cyclically.

Gauss-Southwell: Choose I with

qD(x; I) ≤ υ qD(x; {1, ..., n})

(0 < υ ≤ 1, D  0 is diagonal, e.g., D = I or D = diag(H)).

Inexact Armijo LS: α = largest element of {1, β, β2, ...} satisfying Fc(x + αd) − Fc(x) ≤ 0.1 α qH(x; I) (0 < β < 1)

(21)

(a) If λI  D, H  ¯λI (0 < λ ≤ ¯λ), then every cluster point of the x-sequence generated by BCGD method is a stationary point of Fc.

(b) If in addition P and f satisfy any of the following assumptions, then the x-sequence converges linearly in the root sense.

A1 f is strongly convex, ∇f is Lipschitz cont. on domP. A2 f is (nonconvex) quadratic. P is polyhedral.

A3 f (x) = g(Ax) + qTx, where A ∈ <m×n, q ∈ <n, g is strongly convex, ∇g is Lipschitz cont. on <m. P is polyhedral.

Note: BCGD has stronger global convergence property (and cheaper iteration) than BCM.

(22)

minx f (x) + c X

I∈C

ωIkxIk2

Yuan, Lin ’06 Kim3 ’06

Meier, van de Geer, B ¨uhlmann ’06 ...

c > 0, ωI > 0.

f (x) = PN

j=1 log

1 + eaTj x

− bjaTj x (aj ∈ <n, bj ∈ {0, 1})

• f is convex, cont. diff. k · k2 is convex, nonsmooth. In prediction of short DNA motifs, n > 1000, N > 11, 000.

• BCM-GSeidel has been used Yuan, Lin ’06, but each iteration is expensive. Every cluster point of the x-sequence is a minimizer T ’01.

• BCGD-GSeidel is significantly more efficient Meier et al ’06. Every cluster point of the x-sequence is a minimizer T, Yun ’06. Linear convergence?

(23)

X∈Smin+n f (X) + ckXk1

Meinshausen, B ¨uhlmann ’06 Yuan, Lin ’07

Banerjee, El Ghaoui, d’Aspremont ’07 Friedman, Hastie, Tibshirani ’07

c > 0, kXk1 = P

ij |Xij|,

f (X) = − log detX + tr(XS) (S ∈ S+n is empirical covariance matrix)

• f is strictly convex, cont. diff. on its domain, O(n3) ops to evaluate. k · k1 is convex, nonsmooth. In applications, n can exceed 6000.

The Fenchel dual problem Rockafellar ’70 is a bound-constrained convex program:

min

W ∈S+n,kW −Sk≤c − log det(W )

kY k = maxij |Yij|.

(24)
(25)

Nesterov’s first-order smoothing method requires O(n /) ops .

• Use BCM-GSeidel to solve the dual problem, cycling thru columns

i = 1, ..., n of W. Each iteration reduces (via determinant property & duality) to min

ξ∈<n−1

1

TWi¬i¬ξ − SiT¬iξ + ckξk1.

Solve this using IP method (O(n3) ops) Banerjee et al ’07 or BCM-GSeidel Friedman et al

’07.

(26)

• Use BCM-GSeidel to solve the dual problem, cycling thru columns

i = 1, ..., n of W. Each iteration reduces (via determinant property & duality) to min

ξ∈<n−1

1

TWi¬i¬ξ − SiT¬iξ + ckξk1.

Solve this using IP method (O(n3) ops) Banerjee et al ’07 or BCM-GSeidel Friedman et al

’07.

• Can apply BCGD-GSeidel to either primal or dual problem. More efficient?

Applied to the primal, each iteration entails

u∈<minn



tr((−X−1 + S)D) + 1

2uTHu + ckX + Dk1



D=uTei+eiuT

.

For diagonal H, the minimizing D has closed form! For each trial α in the Armijo LS, det(X + αD) can be evaluated from detX and X−1 in O(n2) ops.

Update X−1 in O(n2) ops. Similar application to the dual. Toh, T, Yun, forthcoming.

(27)

When f is convex and ∇f is Lipschitz cont. on domP with constant L, BCGD-GSeidel finds an -optimal solution in O



Lkxinit−xoptk2





iterations ( > 0). T, Yun ’08

Can global convergence be accelerated?

(28)

Given x ∈ domP and θ ∈ (0, 1], choose I = {1, ..., n}, H = LI. Update

y = x +

 θ

θprev − θ



(x − xprev) xnew = y + dH(y; I)

θnew =

√θ4 + 4θ2 − θ2 2

until “convergence,”

with θinit = 1, xinit ∈ domP Nesterov, Auslender, Beck, Teboulle, Lan, Lu, Monteiro, ...

θ = O(1/k) after k iterations.

This method finds an -optimal solution in O

q

Lkxinit−xoptk2





iterations.

(29)

1. Nonsmooth regularization induces sparsity in the solution, avoids oversmoothing signals, and is useful for variable selection.

2. The regularized problem can be solved effectively by BCM or BCGD or IP, exploiting the problem structure.

(30)

1. Nonsmooth regularization induces sparsity in the solution, avoids oversmoothing signals, and is useful for variable selection.

2. The regularized problem can be solved effectively by BCM or BCGD or IP, exploiting the problem structure.

3. Extension of BCM, BCGD to handle linear constraints Ax = b is possible, including Support Vector Machine training T, Yun, ’07, ’08.

4. Other applications, including stochastic volatility models Neto, Sardy, T, forthcoming.

(31)

1. Nonsmooth regularization induces sparsity in the solution, avoids oversmoothing signals, and is useful for variable selection.

2. The regularized problem can be solved effectively by BCM or BCGD or IP, exploiting the problem structure.

3. Extension of BCM, BCGD to handle linear constraints Ax = b is possible, including Support Vector Machine training T, Yun, ’07, ’08.

4. Other applications, including stochastic volatility models Neto, Sardy, T, forthcoming. 5. Extension of BCGD to nonconvex nonsmooth regularization is possible

(e.g. `p-regularization, 0 < p < 1) Sardy, T, forthcoming. Finds stationary points.

6. Incorporating Nesterov’s acceleration approach within BCGD?

(32)

Grazie!

6. .

^

參考文獻

相關文件

According to the authors’ earlier experience on symmetric cone optimization, we believe that spectral decomposition associated with cones, nonsmooth analysis regarding

(i) spectral factorization associated with circular cones; (ii) smooth and nonsmooth analysis for f L θ given as in (7); (iii) the so-called L θ -convexity; and (iv) L θ

According to the authors’ earlier experience on symmetric cone optimization, we believe that spectral decomposition associated with cones, nonsmooth analysis regarding cone-

Taking second-order cone optimization and complementarity problems for example, there have proposed many ef- fective solution methods, including the interior point methods [1, 2, 3,

We have provided alternative proofs for some results of vector-valued functions associ- ated with second-order cone, which are useful for designing and analyzing smoothing and

Convex optimization, compressed sensing, ℓ 1 -regularization, nuclear/trace norm, regression, variable selection, sensor network localization, approximation accuracy, proximal

Large data: if solving linear systems is needed, use iterative (e.g., CG) instead of direct methods Feature correlation: methods working on some variables at a time (e.g.,

For the data sets used in this thesis we find that F-score performs well when the number of features is large, and for small data the two methods using the gradient of the