• 沒有找到結果。

Block-Coordinatewise Methods for Sparse Optimization with Nonsmooth Regularization

N/A
N/A
Protected

Academic year: 2022

Share "Block-Coordinatewise Methods for Sparse Optimization with Nonsmooth Regularization"

Copied!
21
0
0

加載中.... (立即查看全文)

全文

(1)

Block-Coordinatewise Methods for Sparse Optimization with Nonsmooth Regularization

Paul Tseng

Mathematics, University of Washington Seattle

Hong Kong Polytechnic University February 20, 2008

(Joint works with Sylvain Sardy (Univ. Geneva) and Sangwoon Yun (NUS)) Abstract

This is a talk given at UBC, April 2007.

(2)

Talk Outline

• Block-Coordinate Minimization

? Properties

? Application I: Basis Pursuit/Lasso

• Block-Coordinate Gradient Descent

? Properties

? Application II: Group Lasso for Logistic Regression

? Application III: Sparse Inverse Covariance Estimation

• Conclusions & Future Work

(3)

Block-Coordinate Minimization

min

x=(x1,...,xn)f (x) f : <n → < is cont. diff.

Given x ∈ <n, choose i ∈ {1, ..., n}. Update xnew = arg min

u|uj=xj ∀j6=i

f (u).

Repeat until “convergence”.

Gauss-Seidel: Choose i cyclically, 1, 2,..., n, 1, 2,...

Gauss-Southwell: Choose i with |∂x∂f

i(x)| maximum.

(4)

Example: min

x=(x1,x2)(x1 + x2)2 +

4(x1 − x2)2

(5)

Properties

:

• This method extends to update a block of coordinates at each iteration.

• It is simple, and efficient for large “weakly coupled” problems (off-block-diagonals of ∇2f (x) not too large).

• If f is convex, then very cluster point of the x-sequence is a minimizer. Zadeh ’70

• If f is nonconvex, then G-Seidel can cycle Powell ’73 but G-Southwell still converges.

• Can get stuck if f is nondifferentiable.

(6)

Example: min

x=(x1,x2)(x1 + x2)2 +

4(x1 − x2)2 + |x1 + x2|

(7)

But, if the nondifferentiable part is separable, then convergence is possible.

Example: min

x=(x1,x2)(x1 + x2)2 + 1

4(x1 − x2)2 + |x1| + |x2|

(8)

Application I: Basis Pursuit/Lasso

minx Fc(x) := kAx − bk22 + ckxk1

Tibshirani ’96, Fu ’98 Osborne et al. ’98

Chen, Donoho, Saunders ’99 ...

A ∈ <m×n, b ∈ <m, c ≥ 0.

• Typically m ≥ 1000, n ≥ 8000, and A is dense. k · k1 is nonsmooth.

6. .

_

• Can reformulate this as a convex QP and solve using an IP method. Chen,

Donoho, Saunders ’99

(9)

Assume the columns of A come from an overcomplete set of basis functions associated with a fast transform (e.g., wavelet packets).

BCM for BP

:

Given x, choose I ⊆ {1, ..., n} with |I| = m and {Ai}i∈I orthog. Update xnew = arg min

ui=xi ∀i6∈I

Fc(u) has closed

form soln Repeat until “convergence”.

Gauss-Southwell: Choose I to maximize min

v∈∂xIFc(x)

kvk2.

• Finds I in O(n + m log m) opers. by algorithm of Coifman & Wickerhauser.

• x-sequence is bounded & each cluster point minimizes Fc. Sardy, Bruce, T ’00

(10)

Electronic surveillance

:

0 10 20 30 40 50 60

bp.cptable.spectrogram

50 55 60 65

0 10 20 30 40 50 60

ew.cptable.spectrogram

microseconds

KHz

-40 -30 -20 -10 0 10 20

m = 211 = 2048, c = 4, b: top image, A: local cosine transform, all but 4 levels

(11)

Method efficiency

:

2500 2000

1500

1000

500

1 5 10















 





 





   



 

          !" #$

%&

CPU time (log scale) CPU time (log scale)

Real part Imaginary part

Objective

1 5 10

1000 500 1500 2000 2500 3000

BCR

IP IP

BCR

Comparing CPU times of IP and BCM (S-Plus, Sun Ultra 1).

(12)

• IP method requires fewer iterations, but each iteration is expensive (many CG steps per iteration). No good preconditioner for CG is known.

• BCM method requires more iterations, but each iteration is cheap.

• Convergence of BCM depends crucially on

• differentiability of k · k22

• separability of k · k1

• convexity of Fc (stationarity ⇒ global minimum)

(13)

Generalization to ML Estimation with `

1

-Regularization

?

minx −`(Ax; b) + c X

i∈J

|xi|

` is log likelihood, {Ai}i6∈J are lin. indep “coarse-scale Wavelets”, c ≥ 0

• −`(y; b) = 12ky − bk22 Gaussian noise

• −`(y; b) =

m

X

i=1

(yi − bi ln yi) (yi ≥ 0) Poisson noise

Can solve this problem by adapting IP method. But IP method is slow (many CG steps per IP iteration) Antoniadis, Sardy, T ’04.

6. .

_

Adapt BCM method?

(14)

Optimization with Nonsmooth Regularization

minx Fc(x) := f (x) + cP (x)

f : <n → < is cont. diff. c ≥ 0.

P : <n → (−∞, ∞] is proper, convex, lsc, and block-separable, i.e., P (x) = P

I∈C PI(xI) (I ∈ C partition {1, ..., n}).

• P (x) = kxk1 Basis Pursuit/Lasso

• P (x) = P

I∈C kxIk2 group Lasso

• P (x) = n0 if l ≤ x ≤ u

∞ else bound constrained NLP

Idea

: Do BCM on a quadratic approx. of f.

(15)

Block-Coord. Gradient Descent Method

For x ∈ domP, ∅ 6= I ⊆ {1, ..., n}, and H  0, let dH(x; I) and qH(x; I) be the optimal soln and obj. value of

min

d|di=0 ∀i6∈I

{∇f (x)Td + 1

2dTHd + cP (x + d) − cP (x)} direc.

subprob

Facts:

• dH(x; {1, ..., n}) = 0 ⇔ Fc0(x; d) ≥ 0 ∀d ∈ <n. stationarity

• H is diagonal, P is separable ⇒ dH(x; I) = X

i∈I

dH(x; i),

qH(x; I) = X

i∈I

qH(x; i). separab.

• H is diagonal, P is “simple” ⇒ dH(x; I) has “closed form”.

(16)

Given x ∈ domP, choose I ⊆ {1, ..., n}, H  0.

Update xnew = x + αdH(x; I) (α > 0) until “convergence.”

Gauss-Seidel: Choose I ∈ C cyclically.

Gauss-Southwell: Choose I with

qD(x; I) ≤ υ qD(x; {1, ..., n})

(0 < υ ≤ 1, D  0 is diagonal, e.g., D = I or D = diag(H)).

Inexact Armijo LS: α = largest element of {1, β, β2, ...} satisfying Fc(x + αd) − Fc(x) ≤ 0.1 α qH(x; I) (0 < β < 1)

(17)

Convergence properties T, Yun ’06:

(a) If λI  D, H  ¯λI (0 < λ ≤ ¯λ), then every cluster point of the x-sequence generated by BCGD method is a stationary point of Fc.

(b) If in addition P and f satisfy any of the following assumptions, then the x-sequence converges linearly in the root sense.

A1 f is strongly convex, ∇f is Lipschitz cont. on domP. A2 f is (nonconvex) quadratic. P is polyhedral.

A3 f (x) = g(Ax) + qTx, where A ∈ <m×n, q ∈ <n, g is strongly convex, ∇g is Lipschitz cont. on <m. P is polyhedral.

Note: BCGD has stronger global convergence property (and cheaper iteration) than BCM.

(18)

Application II: Group Lasso for Logistic Regression

minx f (x) + c X

I∈C

ωIkxIk2

Yuan, Lin ’06 Kim3 ’06

Meier, van de Geer, B ¨uhlmann ’06 ...

c > 0, ωI > 0.

f (x) = PN

j=1 log

1 + eaTj x

− yjaTj x (aj ∈ <n, yj ∈ {0, 1})

• f is convex, cont. diff. k · k2 is convex, nonsmooth. In prediction of short DNA motifs, n > 1000, N > 11, 000.

• BCM-GSeidel has been used Yuan, Lin ’06, but each iteration is expensive. Every cluster point of the x-sequence is a minimizer T ’01.

• BCGD-GSeidel is significantly more efficient Meier et al ’06. Every cluster point of the x-sequence is a minimizer T, Yun ’06. Linear convergence?

(19)

Application III: Sparse Inverse Covariance Estimation

X∈Smin+n f (X) + ckXk1

Meinshausen, B ¨uhlmann ’06 Yuan, Lin ’07

Banerjee, El Ghaoui, d’Aspremont ’07 Friedman, Hastie, Tibshirani ’07

c > 0, kXk1 = P

ij |Xij|,

f (X) = − log detX + tr(XS) (S ∈ S+n is empirical covariance matrix)

• f is strictly convex, cont. diff. on its domain, O(n3) ops to evaluate. k · k1 is convex, nonsmooth. In applications, n can exceed 6000.

The Fenchel dual problem Rockafellar ’70 is a bound-constrained convex program:

min

W ∈S+n,kW −Sk≤c − log det(W ) kY k = maxij |Yij|.

(20)

• IP method requires O(n log(1/)) ops to find -optimal soln. Impractical!

Nesterov’s first-order smoothing method requires O(n4.5/) ops Banerjee et al ’07.

• Use BCM-GSeidel to solve the dual problem, cycling thru columns

i = 1, ..., n of W. Each iteration reduces (via determinant property & duality) to min

ξ∈<n−1

1

TWi¬i¬ξ − SiT¬iξ + ckξk1.

Solve this using IP method (O(n3) ops) Banerjee et al ’07 or BCM-GSeidel Friedman et al ’07.

• Can apply BCGD-GSeidel to either primal or dual problem. More efficient?

Applied to the primal, each iteration entails

u∈<minn



tr((−X−1 + S)D) + 1

2uTHu + ckX + Dk1



D=uTei+eiuT

.

For diagonal H, the minimizing D has closed form! For each trial α in the Armijo LS, det(X + αD) can be evaluated from detX and X−1 in O(n2) ops.

Update X−1 in O(n2) ops. Similar application to the dual. Global convergence, asymptotic linear convergence, complexity analysis... Toh, T, Yun, forthcoming.

(21)

Conclusions & Future Work

1. Nonsmooth regularization induces sparsity in the solution, avoids oversmoothing signals, and is useful for variable selection.

2. The regularized problem can be solved effectively by BCM or BCGD, exploiting the problem structure.

3. Extension of BCM, BCGD to handle linear constraints Ax = b is possible, including Support Vector Machine training T, Yun, ’07, ’08. Some open questions on efficient implementation and convergence analysis remain.

4. Many other applications, including stochastic volatility models Neto, Sardy, T, forthcoming.

5. Extension of BCGD to nonconvex nonsmooth regularization is possible (e.g. `p-regularization, 0 < p < 1) Sardy, T, forthcoming.

參考文獻

相關文件

Particularly, combining the numerical results of the two papers, we may obtain such a conclusion that the merit function method based on ϕ p has a better a global convergence and

For the proposed algorithm, we establish a global convergence estimate in terms of the objective value, and moreover present a dual application to the standard SCLP, which leads to

These include developments in density functional theory methods and algorithms, nuclear magnetic resonance (NMR) property evaluation, coupled cluster and perturbation theories,

Like the proximal point algorithm using D-function [5, 8], we under some mild assumptions es- tablish the global convergence of the algorithm expressed in terms of function values,

Taking second-order cone optimization and complementarity problems for example, there have proposed many ef- fective solution methods, including the interior point methods [1, 2, 3,

For finite-dimensional second-order cone optimization and complementarity problems, there have proposed various methods, including the interior point methods [1, 15, 18], the

Fukushima, On the local convergence of semismooth Newton methods for linear and nonlinear second-order cone programs without strict complementarity, SIAM Journal on Optimization,

The superlinear convergence of Broyden’s method for Example 1 is demonstrated in the following table, and the computed solutions are less accurate than those computed by