Gauss-Seidel for Constrained Nonsmooth Optimization and Applications

(1)

Gauss-Seidel for Constrained Nonsmooth Optimization and Applications

Paul Tseng

Mathematics, University of Washington Seattle

UBC

April 10, 2007

(Joint works with Sylvain Sardy (EPFL) and Sangwoon Yun (UW)) Abstract

This is a talk given at UBC, April 2007.

(2)

Talk Outline

• (Block) Coordinate Minimization

• Application to Basis Pursuit

• Block Coordinate Gradient Descent

? Convergence

? Numerical Tests

• Applications to group Lasso regression and SVM

• Conclusions & Future Work

(3)

Coordinate Minimization

min

x=(x₁,...,x_n)f (x) f : <ⁿ → < is convex, cont. diff.

Given x ∈ <ⁿ, choose i ∈ {1, ..., n}. Update x^new = arg min

u|u_j=x_j ∀j6=i

f (u).

Repeat until “convergence”.

Gauss-Seidel: Choose i cyclically, 1, 2,..., n, 1, 2,...

Gauss-Southwell: Choose i with |_∂x^∂f

i(x)| maximum.

(4)

Example: min

x=(x₁,x₂)(x₁ + x₂)² +

4(x₁ − x₂)²

(5)

• This method extends to update a block of coordinates at each iteration.

• It is simple, and efficient for large “weakly coupled” problems (off-diagonals of ∇²f (x) not too large).

• Every cluster point of the x-sequence is a minimizer. ^{Zadeh ’70}

• If f is nonconvex, then G-Seidel can cycle ^{Powell ’73} but G-Southwell still converges.

• Can get stuck if f is nondifferentiable.

(6)

Example: min

x=(x₁,x₂)(x₁ + x₂)² +

4(x₁ − x₂)² + |x₁ + x₂|

(7)

But, if the nondifferentiable part is separable, then convergence is possible.

Example: min

x=(x₁,x₂)(x₁ + x₂)² + 1

4(x₁ − x₂)² + |x₁| + |x₂|

(8)

Block Coord. Minimization for Basis Pursuit

minx F_c(x) := kAx − bk²₂ + ckxk₁ “Basis Pursuit”

A ∈ <^m×n, b ∈ <^m, c ≥ 0.

• Typically m ≥ 1000, n ≥ 8000, and A is dense. k · k₁ is nonsmooth.

6. .

_

• Can reformulate this as a convex QP and solve using an IP method. ^Chen,

Donoho, Saunders ’99

(9)

Assume the columns of A come from an overcomplete set of basis functions associated with a fast transform (e.g., wavelet packets).

BCM for BP:

Given x, choose I ⊆ {1, ..., n} with |I| = m and {A_i}_i∈I orthog. Update x^new = arg min

u_i=x_i ∀i6∈I

F_c(u) ^{has closed}

← ^{form soln} Repeat until “convergence”.

Gauss-Southwell: Choose I to maximize min

v∈∂_xIF_c(x)

kvk₂.

• Finds I in O(n + m log m) opers. by algorithm of Coifman & Wickerhauser.

• The x-sequence is bounded & each cluster point is a minimizer. Sardy, Bruce, T ’00

(10)

Convergence of BCM depends crucially on

• differentiability of k · k²₂

• separability of k · k₁

• convexity ⇒ global minimum

(11)

Application:

Electronic surveillance

0 10 20 30 40 50 60

bp.cptable.spectrogram

50 55 60 65

0 10 20 30 40 50 60

ew.cptable.spectrogram

microseconds

KHz

-40 -30 -20 -10 0 10 20

m = 2¹¹ = 2048, c = 4, local cosine transform, all but 4 levels

(12)

Method efficiency

:

2500 2000

1500

1000

500

1 5 10

!" #$

%&

CPU time (log scale) CPU time (log scale)

Real part Imaginary part

Objective

1 5 10

1000 500 1500 2000 2500 3000

BCR

IP IP

BCR

Comparing CPU times of IP and BCM (S-Plus, Sun Ultra 1).

(13)

Generalization to ML Estimation

minx −`(Ax; b) + c X

i∈J

|x_i|

` is log likelihood, {A_i}_i6∈J are lin. indep “coarse-scale Wavelets”, c ≥ 0

• −`(y; b) = ¹₂ky − bk²₂ Gaussian noise

• −`(y; b) =

m

X

i=1

(y_i − b_i ln y_i) (y_i ≥ 0) Poisson noise

Can solve this problem by adapting IP method. But IP method is slow (many CG steps per IP iteration). ⁶_{_}^{. .}

Adapt BCM method?

(14)

General Problem Model

P1

^min

x=(x₁,...,x_n)F_c(x) := f (x)+cP (x)

f : <^N → < is cont. diff. (N ≥ n). c ≥ 0.

P : <^N → (−∞, ∞] is proper, convex, lsc, and block-separable, i.e., P (x) = Pn

i=1 P_i(x_i) (x_i ∈ <ⁿⁱ).

• P (x) = kxk₁ generalized basis pursuit

• P (x) = n0 if l ≤ x ≤ u

∞ else bound constrained NLP

(15)

Block Coord. Gradient Descent Method for P1

Idea

: Do BCM on a quadratic approx. of f.

For x ∈ domP, I ⊆ {1, ..., n}, and H 0_N, let d_H(x; I) and q_H(x; I) be the optimal soln and obj. value of

d|d_i=0 ∀i6∈Imin {∇f (x)^Td + 1

2d^THd + cP (x + d) − cP (x)} direc.

subprob

Facts

:

• d_H(x; {1, ..., n}) = 0 ⇔ F_c⁰(x; d) ≥ 0 ∀d ∈ <^N. stationarity

• H is diagonal ⇒ d_H(x; I) = X

i∈I

d_H(x; i), q_H(x; I) = X

i∈I

q_H(x; i). ^separab.

(16)

BCGD for P1:

Given x ∈ domP, choose I ⊆ {1, ..., n}, H 0_N. Let d = d_H(x; I).

Update x^new = x + αd (α > 0)

until “convergence.” (d_H(x; I) has closed form when H is diagonal and P (·) = k · k₁.)

Gauss-Southwell-d: Choose I with kd_D(x; I)k_∞ ≥ υkd_D(x; {1, ..., n})k_∞ (0 < υ ≤ 1, D 0_N is diagonal, e.g., D = I or D = diag(H)).

Gauss-Southwell-q: Choose I with q_D(x; I) ≤ υ q_D(x; {1, ..., n}).

Inexact Armijo LS: α = largest element of {s, sβ, sβ², ...} satisfying F_c(x + αd) − F_c(x) ≤ σαq_H(x; I)

(s > 0, 0 < β < 1, 0 < σ < 1)

(17)

Convergence Results

: (a) If 0 < λ ≤ λ_i(D), λ_i(H) ≤ ¯λ ∀i, then every cluster point of the x-sequence generated by BCGD method is a stationary point of F_c.

(b) If in addition P and f satisfy any of the following assumptions and I is chosen by G-Southwell-q, then the x-sequence converges at R-linear rate.

C1 f is strongly convex, ∇f is Lipschitz cont. on domP. C2 f is (nonconvex) quadratic. P is polyhedral.

C3 f (x) = g(Ex) + q^Tx, where E ∈ <^m×N, q ∈ <^N, g is strongly convex, ∇g is Lipschitz cont. on <^m. P is polyhedral.

C4 f (x) = max_y∈Y {(Ex)^Ty − g(y)} + q^Tx, where Y ⊆ <^m is polyhedral,

E ∈ <^m×N, q ∈ <^N, g is strongly convex, ∇g is Lipschitz cont. on <^m. P is polyhedral.

• BCGD has stronger global convergence property (and cheaper iteration) than BCM.

(18)

Numerical Tests

:

• Implement BCGD, with additional acceleration steps, in Matlab.

• Numerical tests on min_x f (x) + ckxk₁ with f from Mor ´e-Garbow-Hillstrom set (least square), and different c (e.g., c = .1, 1, 10). Initial x = (1, . . . , 1).

• Compared with L-BFGS-B (Zhu, Byrd, Nocedal ’97) and MINOS 5.5.1 ^(Murtagh,

Saunders ’05), applied to a reformulation of P1 with P (x) = kxk₁ as min

x+≥0 x−≥0

f (x⁺ − x⁻) + c e^T(x⁺ + x⁻).

• BCGD seems more robust than L-BFGS-B and faster than MINOS on avg (on a HP DL360 workstation, Red Hat Linux 3.5). However, MINOS is general NLP solver. L-BFGS-B is a bound constrained NLP solver.

(19)

f (x) n Description

BAL 1000 Brown almost-linear func, nonconvex, dense Hessian.

BT 1000 Broyden tridiagonal func, nonconvex, sparse Hessian.

DBV 1000 Discrete boundary value func, nonconvex, sparse Hessian.

EPS 1000 Extended Powell singular func, convex, 4-block diag. Hessian.

ER 1000 Extended Rosenbrook func, nonconvex, 2-block diag. Hessian.

LFR 1000 f (x) =

n

X

i=1



x_i − 2 n + 1

n

X

j=1

x_j − 1





2

+



 2 n + 1

n

X

j=1

x_j + 1





2

, strongly convex, quad., dense Hessian.

VD 1000 f (x) =

n

X

i=1

(x_i − 1)² +





n

X

j=1

j(x_j − 1)





2

+





n

X

j=1

j(x_j − 1)





4

, strongly convex, dense ill-conditioned Hessian.

Table 1: Least square problems from Mor ´e, Garbow, Hillstrom, 1981

(20)

MINOS L-BFGS-B BCGD-GS-q-acc f (x) c ]nz/objec/cpu ]nz/objec/cpu ]nz/objec/cpu BAL 1 1000/1000/43.9 1000/1000/.02 1000/1000/.1

10 1000/9999.9/43.9 1000/9999.9/.03 1000/9999.9/.2 100 1000/99997.5/44.3 1000/99997.5/.1 1000/99997.5/.1 BT .1 1000/71.725/100.6 1000/84.00/.02 1000/71.74/.9 1 997/672.41/94.7 981/668.72/.2 1000/626.67/42.4

10 0/1000/56.0 0/1000/.01 0/1000/.01

DBV .1 0/0/51.5 999/83.45/.01 0/0/.5

1 0/0/50.8 0/0/.01 2/0/.3

10 0/0/52.5 0/0/.00 0/0/.01

EPS 1 1000/351.14/60.3 999/352.52/.05 1000/351.14/.3 10 243/1250/44.2 250/1250/.01 249/1250/.1

100 0/1250/51.5 0/1250/.01 0/1250/.01

ER 1 1000/436.25/71.5 1000/436.25/.1 1000/436.25/.1

10 0/500/50.2 500/1721.1/.00 0/500/.3

100 0/500/52.4 0/500/.00 0/500/.03

LFR .1 1000/98.5/77.2 1000/98.5/.00 1000/98.5/.03

1 1000/751/73.8 0/751/.01 0/751/.01

10 0/1001/53.3 0/1001/.01 0/1001/.01

VD 1 1000/937.59/43.0 1000/1000.0/.00 1000/937.66/.5 10 413/6726.80/56.9 974/5.8·10¹²/2.3 1000/6726.81/60.3 100 136/55043/57.4 996/75135/.2 1000/55043/88.1

Table 2: Performance of MINOS, LBFGS-B and BCGD, with n = N, x^init = (1, . . . , 1)

(21)

BCGD was recently applied to Group Lasso for logistic regression (Meier et al

’06)

min

x=(x₁,...,x_n)−`(x) + c

n

X

i=1

ω_ikx_ik₂

c > 0, ω_i > 0.

` : <^N → < is the log-likelihood for linear logistic regression.

Extension to constraints?

(22)

Linearly Constrained Problem

P2

^min

x∈X f (x)

f : <ⁿ → < is cont. diff.

X = {x = (x₁, ..., x_n) ∈ <ⁿ | l ≤ x ≤ u, Ax = b}, A ∈ <^m×n, b ∈ <^m, l ≤ u.

Special case. Support Vector Machine QP (Vapnik ’82)

min

0≤x≤Ce, a^Tx=0

1

2x^TQx − e^Tx

C > 0, a ∈ <ⁿ, e^T = (1, ..., 1), Q_ij = a_ia_jK(z_i, z_j), and K : <^p × <^p → <.

K(z_i, z_j) = z_i^Tz_j Linear

K(z_i, z_j) = exp(−γkz_i − z_jk²) (γ > 0) Gaussian

(23)

Block Coord. Gradient Descent Method for P2

For x ∈ X, I ⊆ {1, ..., n}, and H 0_n, let d_H(x; I) and q_H(x; I) be the optimal soln and obj. value of

min

d|x+d∈X, d_i=0 ∀i6∈I

{∇f (x)^Td + 1

2d^THd} direc.

subprob

BCGD for P2:

Given x ∈ X, choose I ⊆ {1, ..., n}, H 0_n. Let d = d_H(x; I).

Update x^new = x + αd (α > 0)

until “convergence.”

• d is easily calculated when m = 1 and |I| = 2.

(24)

Gauss-Southwell-q: Choose I with

q_D(x; I) ≤ υ q_D(x; {1, ..., n})

(0 < υ ≤ 1, D 0_n is diagonal, e.g., D = I or D = diag(H)).

For m = 1, such I with |I| = 2 can be found in O(n) ops by solving a

continuous quadratic knapsack problem and finding a “conformal realization”

of the solution.

Inexact Armijo LS: α = largest element of {s, sβ, sβ², ...} satisfying x + αd ∈ X, f (x + αd) − f (x) ≤ σαq_H(x; I)

(s > 0, 0 < β < 1, 0 < σ < 1).

(25)

Convergence Results

: (a) If 0 < λ ≤ λ_i(D), λ_i(H) ≤ ¯λ ∀i, then every cluster point of the x-sequence generated by BCGD method is a stationary point of P2.

(b) If in addition f satisfies any of the following assumptions and I is chosen by G-Southwell-q, then the x-sequence converges at R-linear rate.

C1 f is strongly convex, ∇f is Lipschitz cont. on X. C2 f is (nonconvex) quadratic.

C3 f (x) = g(Ex) + q^Tx, where E ∈ <^m×n, q ∈ <ⁿ, g is strongly convex, ∇g is Lipschitz cont. on <^m.

C4 f (x) = max_y∈Y {(Ex)^Ty − g(y)} + q^Tx, where Y ⊆ <^m is polyhedral, E ∈ <^m×n, q ∈ <ⁿ, g is strongly convex, ∇g is Lipschitz cont. on <^m.

• For SVM QP, BCGD has R-linear convergence (with no additional

assumption). Similar work as decomposition methods (Joachims ’98, Platt ’99, Lin et al, ...)

(26)

Numerical Tests

:

• Implement BCGD in Fortran for SVM QP (two-class data classification).

• x^init = 0. Cache most recently used columns of Q.

• On large benchmark problems

a7a (p = 122, n = 16100), a8a (p = 123, n = 22696), a9a (p = 123, n = 32561), ijcnn1 (p = 22, n = 49990), w7a (p = 300, n = 24692)

and using nonlinear kernel, BCGD is comparable in CPU time and solution quality with the C++ SVM code LIBSVM (Lin et al). Using linear kernel, BCGD is much slower (it doesn’t yet do variable fixing as in LIBSVM).

(27)

Conclusions & Future Work

1. For ML estimation, `₁-regularization induces sparsity in the solution and avoids oversmoothing the signals.

2. The resulting estimation problem can be solved effectively by BCM or BCGD, exploiting the problem structure, including nondiffer. of `₁-norm.

Which to use? Depends on problem.

3. Applications to denoising, regression, SVM..

4. Improve BCGD speed for SVM QP using linear kernel? Efficient implementation for m = 2 constraints?