• 沒有找到結果。

Gauss-Seidel for Constrained Nonsmooth Optimization and Applications

N/A
N/A
Protected

Academic year: 2022

Share "Gauss-Seidel for Constrained Nonsmooth Optimization and Applications"

Copied!
27
0
0

加載中.... (立即查看全文)

全文

(1)

Gauss-Seidel for Constrained Nonsmooth Optimization and Applications

Paul Tseng

Mathematics, University of Washington Seattle

UBC

April 10, 2007

(Joint works with Sylvain Sardy (EPFL) and Sangwoon Yun (UW)) Abstract

This is a talk given at UBC, April 2007.

(2)

Talk Outline

• (Block) Coordinate Minimization

• Application to Basis Pursuit

• Block Coordinate Gradient Descent

? Convergence

? Numerical Tests

• Applications to group Lasso regression and SVM

• Conclusions & Future Work

(3)

Coordinate Minimization

min

x=(x1,...,xn)f (x) f : <n → < is convex, cont. diff.

Given x ∈ <n, choose i ∈ {1, ..., n}. Update xnew = arg min

u|uj=xj ∀j6=i

f (u).

Repeat until “convergence”.

Gauss-Seidel: Choose i cyclically, 1, 2,..., n, 1, 2,...

Gauss-Southwell: Choose i with |∂x∂f

i(x)| maximum.

(4)

Example: min

x=(x1,x2)(x1 + x2)2 +

4(x1 − x2)2

(5)

• This method extends to update a block of coordinates at each iteration.

• It is simple, and efficient for large “weakly coupled” problems (off-diagonals of ∇2f (x) not too large).

• Every cluster point of the x-sequence is a minimizer. Zadeh ’70

• If f is nonconvex, then G-Seidel can cycle Powell ’73 but G-Southwell still converges.

• Can get stuck if f is nondifferentiable.

(6)

Example: min

x=(x1,x2)(x1 + x2)2 +

4(x1 − x2)2 + |x1 + x2|

(7)

But, if the nondifferentiable part is separable, then convergence is possible.

Example: min

x=(x1,x2)(x1 + x2)2 + 1

4(x1 − x2)2 + |x1| + |x2|

(8)

Block Coord. Minimization for Basis Pursuit

minx Fc(x) := kAx − bk22 + ckxk1 “Basis Pursuit”

A ∈ <m×n, b ∈ <m, c ≥ 0.

• Typically m ≥ 1000, n ≥ 8000, and A is dense. k · k1 is nonsmooth.

6. .

_

• Can reformulate this as a convex QP and solve using an IP method. Chen,

Donoho, Saunders ’99

(9)

Assume the columns of A come from an overcomplete set of basis functions associated with a fast transform (e.g., wavelet packets).

BCM for BP:

Given x, choose I ⊆ {1, ..., n} with |I| = m and {Ai}i∈I orthog. Update xnew = arg min

ui=xi ∀i6∈I

Fc(u) has closed

form soln Repeat until “convergence”.

Gauss-Southwell: Choose I to maximize min

v∈∂xIFc(x)

kvk2.

• Finds I in O(n + m log m) opers. by algorithm of Coifman & Wickerhauser.

• The x-sequence is bounded & each cluster point is a minimizer. Sardy, Bruce, T ’00

(10)

Convergence of BCM depends crucially on

• differentiability of k · k22

• separability of k · k1

• convexity ⇒ global minimum

(11)

Application:

Electronic surveillance

0 10 20 30 40 50 60

bp.cptable.spectrogram

50 55 60 65

0 10 20 30 40 50 60

ew.cptable.spectrogram

microseconds

KHz

-40 -30 -20 -10 0 10 20

m = 211 = 2048, c = 4, local cosine transform, all but 4 levels

(12)

Method efficiency

:

2500 2000

1500

1000

500

1 5 10















 





 





   



 

          !" #$

%&

CPU time (log scale) CPU time (log scale)

Real part Imaginary part

Objective

1 5 10

1000 500 1500 2000 2500 3000

BCR

IP IP

BCR

Comparing CPU times of IP and BCM (S-Plus, Sun Ultra 1).

(13)

Generalization to ML Estimation

minx −`(Ax; b) + c X

i∈J

|xi|

` is log likelihood, {Ai}i6∈J are lin. indep “coarse-scale Wavelets”, c ≥ 0

• −`(y; b) = 12ky − bk22 Gaussian noise

• −`(y; b) =

m

X

i=1

(yi − bi ln yi) (yi ≥ 0) Poisson noise

Can solve this problem by adapting IP method. But IP method is slow (many CG steps per IP iteration). 6_. .

Adapt BCM method?

(14)

General Problem Model

P1

min

x=(x1,...,xn)Fc(x) := f (x)+cP (x)

f : <N → < is cont. diff. (N ≥ n). c ≥ 0.

P : <N → (−∞, ∞] is proper, convex, lsc, and block-separable, i.e., P (x) = Pn

i=1 Pi(xi) (xi ∈ <ni).

• P (x) = kxk1 generalized basis pursuit

• P (x) = n0 if l ≤ x ≤ u

∞ else bound constrained NLP

(15)

Block Coord. Gradient Descent Method for P1

Idea

: Do BCM on a quadratic approx. of f.

For x ∈ domP, I ⊆ {1, ..., n}, and H  0N, let dH(x; I) and qH(x; I) be the optimal soln and obj. value of

d|di=0 ∀i6∈Imin {∇f (x)Td + 1

2dTHd + cP (x + d) − cP (x)} direc.

subprob

Facts

:

• dH(x; {1, ..., n}) = 0 ⇔ Fc0(x; d) ≥ 0 ∀d ∈ <N. stationarity

• H is diagonal ⇒ dH(x; I) = X

i∈I

dH(x; i), qH(x; I) = X

i∈I

qH(x; i). separab.

(16)

BCGD for P1:

Given x ∈ domP, choose I ⊆ {1, ..., n}, H  0N. Let d = dH(x; I).

Update xnew = x + αd (α > 0)

until “convergence.” (dH(x; I) has closed form when H is diagonal and P (·) = k · k1.)

Gauss-Southwell-d: Choose I with kdD(x; I)k ≥ υkdD(x; {1, ..., n})k (0 < υ ≤ 1, D  0N is diagonal, e.g., D = I or D = diag(H)).

Gauss-Southwell-q: Choose I with qD(x; I) ≤ υ qD(x; {1, ..., n}).

Inexact Armijo LS: α = largest element of {s, sβ, sβ2, ...} satisfying Fc(x + αd) − Fc(x) ≤ σαqH(x; I)

(s > 0, 0 < β < 1, 0 < σ < 1)

(17)

Convergence Results

: (a) If 0 < λ ≤ λi(D), λi(H) ≤ ¯λ ∀i, then every cluster point of the x-sequence generated by BCGD method is a stationary point of Fc.

(b) If in addition P and f satisfy any of the following assumptions and I is chosen by G-Southwell-q, then the x-sequence converges at R-linear rate.

C1 f is strongly convex, ∇f is Lipschitz cont. on domP. C2 f is (nonconvex) quadratic. P is polyhedral.

C3 f (x) = g(Ex) + qTx, where E ∈ <m×N, q ∈ <N, g is strongly convex, ∇g is Lipschitz cont. on <m. P is polyhedral.

C4 f (x) = maxy∈Y {(Ex)Ty − g(y)} + qTx, where Y ⊆ <m is polyhedral,

E ∈ <m×N, q ∈ <N, g is strongly convex, ∇g is Lipschitz cont. on <m. P is polyhedral.

• BCGD has stronger global convergence property (and cheaper iteration) than BCM.

(18)

Numerical Tests

:

• Implement BCGD, with additional acceleration steps, in Matlab.

• Numerical tests on minx f (x) + ckxk1 with f from Mor ´e-Garbow-Hillstrom set (least square), and different c (e.g., c = .1, 1, 10). Initial x = (1, . . . , 1).

• Compared with L-BFGS-B (Zhu, Byrd, Nocedal ’97) and MINOS 5.5.1 (Murtagh,

Saunders ’05), applied to a reformulation of P1 with P (x) = kxk1 as min

x+≥0 x−≥0

f (x+ − x) + c eT(x+ + x).

• BCGD seems more robust than L-BFGS-B and faster than MINOS on avg (on a HP DL360 workstation, Red Hat Linux 3.5). However, MINOS is general NLP solver. L-BFGS-B is a bound constrained NLP solver.

(19)

f (x) n Description

BAL 1000 Brown almost-linear func, nonconvex, dense Hessian.

BT 1000 Broyden tridiagonal func, nonconvex, sparse Hessian.

DBV 1000 Discrete boundary value func, nonconvex, sparse Hessian.

EPS 1000 Extended Powell singular func, convex, 4-block diag. Hessian.

ER 1000 Extended Rosenbrook func, nonconvex, 2-block diag. Hessian.

LFR 1000 f (x) =

n

X

i=1

xi 2 n + 1

n

X

j=1

xj − 1

2

+

2 n + 1

n

X

j=1

xj + 1

2

, strongly convex, quad., dense Hessian.

VD 1000 f (x) =

n

X

i=1

(xi − 1)2 +

n

X

j=1

j(xj − 1)

2

+

n

X

j=1

j(xj − 1)

4

, strongly convex, dense ill-conditioned Hessian.

Table 1: Least square problems from Mor ´e, Garbow, Hillstrom, 1981

(20)

MINOS L-BFGS-B BCGD-GS-q-acc f (x) c ]nz/objec/cpu ]nz/objec/cpu ]nz/objec/cpu BAL 1 1000/1000/43.9 1000/1000/.02 1000/1000/.1

10 1000/9999.9/43.9 1000/9999.9/.03 1000/9999.9/.2 100 1000/99997.5/44.3 1000/99997.5/.1 1000/99997.5/.1 BT .1 1000/71.725/100.6 1000/84.00/.02 1000/71.74/.9 1 997/672.41/94.7 981/668.72/.2 1000/626.67/42.4

10 0/1000/56.0 0/1000/.01 0/1000/.01

DBV .1 0/0/51.5 999/83.45/.01 0/0/.5

1 0/0/50.8 0/0/.01 2/0/.3

10 0/0/52.5 0/0/.00 0/0/.01

EPS 1 1000/351.14/60.3 999/352.52/.05 1000/351.14/.3 10 243/1250/44.2 250/1250/.01 249/1250/.1

100 0/1250/51.5 0/1250/.01 0/1250/.01

ER 1 1000/436.25/71.5 1000/436.25/.1 1000/436.25/.1

10 0/500/50.2 500/1721.1/.00 0/500/.3

100 0/500/52.4 0/500/.00 0/500/.03

LFR .1 1000/98.5/77.2 1000/98.5/.00 1000/98.5/.03

1 1000/751/73.8 0/751/.01 0/751/.01

10 0/1001/53.3 0/1001/.01 0/1001/.01

VD 1 1000/937.59/43.0 1000/1000.0/.00 1000/937.66/.5 10 413/6726.80/56.9 974/5.8·1012/2.3 1000/6726.81/60.3 100 136/55043/57.4 996/75135/.2 1000/55043/88.1

Table 2: Performance of MINOS, LBFGS-B and BCGD, with n = N, xinit = (1, . . . , 1)

(21)

BCGD was recently applied to Group Lasso for logistic regression (Meier et al

’06)

min

x=(x1,...,xn)−`(x) + c

n

X

i=1

ωikxik2

c > 0, ωi > 0.

` : <N → < is the log-likelihood for linear logistic regression.

Extension to constraints?

(22)

Linearly Constrained Problem

P2

min

x∈X f (x)

f : <n → < is cont. diff.

X = {x = (x1, ..., xn) ∈ <n | l ≤ x ≤ u, Ax = b}, A ∈ <m×n, b ∈ <m, l ≤ u.

Special case. Support Vector Machine QP (Vapnik ’82)

min

0≤x≤Ce, aTx=0

1

2xTQx − eTx

C > 0, a ∈ <n, eT = (1, ..., 1), Qij = aiajK(zi, zj), and K : <p × <p → <.

K(zi, zj) = ziTzj Linear

K(zi, zj) = exp(−γkzi − zjk2) (γ > 0) Gaussian

(23)

Block Coord. Gradient Descent Method for P2

For x ∈ X, I ⊆ {1, ..., n}, and H  0n, let dH(x; I) and qH(x; I) be the optimal soln and obj. value of

min

d|x+d∈X, di=0 ∀i6∈I

{∇f (x)Td + 1

2dTHd} direc.

subprob

BCGD for P2:

Given x ∈ X, choose I ⊆ {1, ..., n}, H  0n. Let d = dH(x; I).

Update xnew = x + αd (α > 0)

until “convergence.”

• d is easily calculated when m = 1 and |I| = 2.

(24)

Gauss-Southwell-q: Choose I with

qD(x; I) ≤ υ qD(x; {1, ..., n})

(0 < υ ≤ 1, D  0n is diagonal, e.g., D = I or D = diag(H)).

For m = 1, such I with |I| = 2 can be found in O(n) ops by solving a

continuous quadratic knapsack problem and finding a “conformal realization”

of the solution.

Inexact Armijo LS: α = largest element of {s, sβ, sβ2, ...} satisfying x + αd ∈ X, f (x + αd) − f (x) ≤ σαqH(x; I)

(s > 0, 0 < β < 1, 0 < σ < 1).

(25)

Convergence Results

: (a) If 0 < λ ≤ λi(D), λi(H) ≤ ¯λ ∀i, then every cluster point of the x-sequence generated by BCGD method is a stationary point of P2.

(b) If in addition f satisfies any of the following assumptions and I is chosen by G-Southwell-q, then the x-sequence converges at R-linear rate.

C1 f is strongly convex, ∇f is Lipschitz cont. on X. C2 f is (nonconvex) quadratic.

C3 f (x) = g(Ex) + qTx, where E ∈ <m×n, q ∈ <n, g is strongly convex, ∇g is Lipschitz cont. on <m.

C4 f (x) = maxy∈Y {(Ex)Ty − g(y)} + qTx, where Y ⊆ <m is polyhedral, E ∈ <m×n, q ∈ <n, g is strongly convex, ∇g is Lipschitz cont. on <m.

• For SVM QP, BCGD has R-linear convergence (with no additional

assumption). Similar work as decomposition methods (Joachims ’98, Platt ’99, Lin et al, ...)

(26)

Numerical Tests

:

• Implement BCGD in Fortran for SVM QP (two-class data classification).

• xinit = 0. Cache most recently used columns of Q.

• On large benchmark problems

a7a (p = 122, n = 16100), a8a (p = 123, n = 22696), a9a (p = 123, n = 32561), ijcnn1 (p = 22, n = 49990), w7a (p = 300, n = 24692)

and using nonlinear kernel, BCGD is comparable in CPU time and solution quality with the C++ SVM code LIBSVM (Lin et al). Using linear kernel, BCGD is much slower (it doesn’t yet do variable fixing as in LIBSVM).

(27)

Conclusions & Future Work

1. For ML estimation, `1-regularization induces sparsity in the solution and avoids oversmoothing the signals.

2. The resulting estimation problem can be solved effectively by BCM or BCGD, exploiting the problem structure, including nondiffer. of `1-norm.

Which to use? Depends on problem.

3. Applications to denoising, regression, SVM..

4. Improve BCGD speed for SVM QP using linear kernel? Efficient implementation for m = 2 constraints?

參考文獻

相關文件

Nonsmooth regularization induces sparsity in the solution, avoids oversmoothing signals, and is useful for variable selection.. The regularized problem can be solved effectively by

BAL 1000 Brown almost-linear func, nonconvex, dense Hessian.. BT 1000 Broyden tridiagonal func, nonconvex,

Rodman, Canonical forms for symmetric/skew-symmetric real matrix pairs under strict equivalence and congruence, Linear Algebra and its Applications 406 (2005),

This problem has wide applications, e.g., Robust linear programming, filter design, antenna design, etc... (Lobo, Vandenberghe, Boyd,

In this paper, we have shown that how to construct complementarity functions for the circular cone complementarity problem, and have proposed four classes of merit func- tions for

Abstract Like the matrix-valued functions used in solutions methods for semidefinite pro- gram (SDP) and semidefinite complementarity problem (SDCP), the vector-valued func-

Chen, Conditions for error bounds and bounded level sets of some merit func- tions for the second-order cone complementarity problem, Journal of Optimization Theory and

In this paper, we illustrate a new concept regarding unitary elements defined on Lorentz cone, and establish some basic properties under the so-called unitary transformation associ-