Gauss-Seidel for Constrained Nonsmooth Optimization and Applications
Paul Tseng
Mathematics, University of Washington Seattle
UBC
April 10, 2007
(Joint works with Sylvain Sardy (EPFL) and Sangwoon Yun (UW)) Abstract
This is a talk given at UBC, April 2007.
Talk Outline
• (Block) Coordinate Minimization
• Application to Basis Pursuit
• Block Coordinate Gradient Descent
? Convergence
? Numerical Tests
• Applications to group Lasso regression and SVM
• Conclusions & Future Work
Coordinate Minimization
min
x=(x1,...,xn)f (x) f : <n → < is convex, cont. diff.
Given x ∈ <n, choose i ∈ {1, ..., n}. Update xnew = arg min
u|uj=xj ∀j6=i
f (u).
Repeat until “convergence”.
Gauss-Seidel: Choose i cyclically, 1, 2,..., n, 1, 2,...
Gauss-Southwell: Choose i with |∂x∂f
i(x)| maximum.
Example: min
x=(x1,x2)(x1 + x2)2 +
4(x1 − x2)2
• This method extends to update a block of coordinates at each iteration.
• It is simple, and efficient for large “weakly coupled” problems (off-diagonals of ∇2f (x) not too large).
• Every cluster point of the x-sequence is a minimizer. Zadeh ’70
• If f is nonconvex, then G-Seidel can cycle Powell ’73 but G-Southwell still converges.
• Can get stuck if f is nondifferentiable.
Example: min
x=(x1,x2)(x1 + x2)2 +
4(x1 − x2)2 + |x1 + x2|
But, if the nondifferentiable part is separable, then convergence is possible.
Example: min
x=(x1,x2)(x1 + x2)2 + 1
4(x1 − x2)2 + |x1| + |x2|
Block Coord. Minimization for Basis Pursuit
minx Fc(x) := kAx − bk22 + ckxk1 “Basis Pursuit”
A ∈ <m×n, b ∈ <m, c ≥ 0.
• Typically m ≥ 1000, n ≥ 8000, and A is dense. k · k1 is nonsmooth.
6. .
_
• Can reformulate this as a convex QP and solve using an IP method. Chen,
Donoho, Saunders ’99
Assume the columns of A come from an overcomplete set of basis functions associated with a fast transform (e.g., wavelet packets).
BCM for BP:
Given x, choose I ⊆ {1, ..., n} with |I| = m and {Ai}i∈I orthog. Update xnew = arg min
ui=xi ∀i6∈I
Fc(u) has closed
← form soln Repeat until “convergence”.
Gauss-Southwell: Choose I to maximize min
v∈∂xIFc(x)
kvk2.
• Finds I in O(n + m log m) opers. by algorithm of Coifman & Wickerhauser.
• The x-sequence is bounded & each cluster point is a minimizer. Sardy, Bruce, T ’00
Convergence of BCM depends crucially on
• differentiability of k · k22
• separability of k · k1
• convexity ⇒ global minimum
Application:
Electronic surveillance0 10 20 30 40 50 60
bp.cptable.spectrogram
50 55 60 65
0 10 20 30 40 50 60
ew.cptable.spectrogram
microseconds
KHz
-40 -30 -20 -10 0 10 20
m = 211 = 2048, c = 4, local cosine transform, all but 4 levels
Method efficiency
:2500 2000
1500
1000
500
1 5 10
!" #$
%&
CPU time (log scale) CPU time (log scale)
Real part Imaginary part
Objective
1 5 10
1000 500 1500 2000 2500 3000
BCR
IP IP
BCR
Comparing CPU times of IP and BCM (S-Plus, Sun Ultra 1).
Generalization to ML Estimation
minx −`(Ax; b) + c X
i∈J
|xi|
` is log likelihood, {Ai}i6∈J are lin. indep “coarse-scale Wavelets”, c ≥ 0
• −`(y; b) = 12ky − bk22 Gaussian noise
• −`(y; b) =
m
X
i=1
(yi − bi ln yi) (yi ≥ 0) Poisson noise
Can solve this problem by adapting IP method. But IP method is slow (many CG steps per IP iteration). 6_. .
Adapt BCM method?
General Problem Model
P1
minx=(x1,...,xn)Fc(x) := f (x)+cP (x)
f : <N → < is cont. diff. (N ≥ n). c ≥ 0.
P : <N → (−∞, ∞] is proper, convex, lsc, and block-separable, i.e., P (x) = Pn
i=1 Pi(xi) (xi ∈ <ni).
• P (x) = kxk1 generalized basis pursuit
• P (x) = n0 if l ≤ x ≤ u
∞ else bound constrained NLP
Block Coord. Gradient Descent Method for P1
Idea
: Do BCM on a quadratic approx. of f.For x ∈ domP, I ⊆ {1, ..., n}, and H 0N, let dH(x; I) and qH(x; I) be the optimal soln and obj. value of
d|di=0 ∀i6∈Imin {∇f (x)Td + 1
2dTHd + cP (x + d) − cP (x)} direc.
subprob
Facts
:• dH(x; {1, ..., n}) = 0 ⇔ Fc0(x; d) ≥ 0 ∀d ∈ <N. stationarity
• H is diagonal ⇒ dH(x; I) = X
i∈I
dH(x; i), qH(x; I) = X
i∈I
qH(x; i). separab.
BCGD for P1:
Given x ∈ domP, choose I ⊆ {1, ..., n}, H 0N. Let d = dH(x; I).
Update xnew = x + αd (α > 0)
until “convergence.” (dH(x; I) has closed form when H is diagonal and P (·) = k · k1.)
Gauss-Southwell-d: Choose I with kdD(x; I)k∞ ≥ υkdD(x; {1, ..., n})k∞ (0 < υ ≤ 1, D 0N is diagonal, e.g., D = I or D = diag(H)).
Gauss-Southwell-q: Choose I with qD(x; I) ≤ υ qD(x; {1, ..., n}).
Inexact Armijo LS: α = largest element of {s, sβ, sβ2, ...} satisfying Fc(x + αd) − Fc(x) ≤ σαqH(x; I)
(s > 0, 0 < β < 1, 0 < σ < 1)
Convergence Results
: (a) If 0 < λ ≤ λi(D), λi(H) ≤ ¯λ ∀i, then every cluster point of the x-sequence generated by BCGD method is a stationary point of Fc.(b) If in addition P and f satisfy any of the following assumptions and I is chosen by G-Southwell-q, then the x-sequence converges at R-linear rate.
C1 f is strongly convex, ∇f is Lipschitz cont. on domP. C2 f is (nonconvex) quadratic. P is polyhedral.
C3 f (x) = g(Ex) + qTx, where E ∈ <m×N, q ∈ <N, g is strongly convex, ∇g is Lipschitz cont. on <m. P is polyhedral.
C4 f (x) = maxy∈Y {(Ex)Ty − g(y)} + qTx, where Y ⊆ <m is polyhedral,
E ∈ <m×N, q ∈ <N, g is strongly convex, ∇g is Lipschitz cont. on <m. P is polyhedral.
• BCGD has stronger global convergence property (and cheaper iteration) than BCM.
Numerical Tests
:• Implement BCGD, with additional acceleration steps, in Matlab.
• Numerical tests on minx f (x) + ckxk1 with f from Mor ´e-Garbow-Hillstrom set (least square), and different c (e.g., c = .1, 1, 10). Initial x = (1, . . . , 1).
• Compared with L-BFGS-B (Zhu, Byrd, Nocedal ’97) and MINOS 5.5.1 (Murtagh,
Saunders ’05), applied to a reformulation of P1 with P (x) = kxk1 as min
x+≥0 x−≥0
f (x+ − x−) + c eT(x+ + x−).
• BCGD seems more robust than L-BFGS-B and faster than MINOS on avg (on a HP DL360 workstation, Red Hat Linux 3.5). However, MINOS is general NLP solver. L-BFGS-B is a bound constrained NLP solver.
f (x) n Description
BAL 1000 Brown almost-linear func, nonconvex, dense Hessian.
BT 1000 Broyden tridiagonal func, nonconvex, sparse Hessian.
DBV 1000 Discrete boundary value func, nonconvex, sparse Hessian.
EPS 1000 Extended Powell singular func, convex, 4-block diag. Hessian.
ER 1000 Extended Rosenbrook func, nonconvex, 2-block diag. Hessian.
LFR 1000 f (x) =
n
X
i=1
xi − 2 n + 1
n
X
j=1
xj − 1
2
+
2 n + 1
n
X
j=1
xj + 1
2
, strongly convex, quad., dense Hessian.
VD 1000 f (x) =
n
X
i=1
(xi − 1)2 +
n
X
j=1
j(xj − 1)
2
+
n
X
j=1
j(xj − 1)
4
, strongly convex, dense ill-conditioned Hessian.
Table 1: Least square problems from Mor ´e, Garbow, Hillstrom, 1981
MINOS L-BFGS-B BCGD-GS-q-acc f (x) c ]nz/objec/cpu ]nz/objec/cpu ]nz/objec/cpu BAL 1 1000/1000/43.9 1000/1000/.02 1000/1000/.1
10 1000/9999.9/43.9 1000/9999.9/.03 1000/9999.9/.2 100 1000/99997.5/44.3 1000/99997.5/.1 1000/99997.5/.1 BT .1 1000/71.725/100.6 1000/84.00/.02 1000/71.74/.9 1 997/672.41/94.7 981/668.72/.2 1000/626.67/42.4
10 0/1000/56.0 0/1000/.01 0/1000/.01
DBV .1 0/0/51.5 999/83.45/.01 0/0/.5
1 0/0/50.8 0/0/.01 2/0/.3
10 0/0/52.5 0/0/.00 0/0/.01
EPS 1 1000/351.14/60.3 999/352.52/.05 1000/351.14/.3 10 243/1250/44.2 250/1250/.01 249/1250/.1
100 0/1250/51.5 0/1250/.01 0/1250/.01
ER 1 1000/436.25/71.5 1000/436.25/.1 1000/436.25/.1
10 0/500/50.2 500/1721.1/.00 0/500/.3
100 0/500/52.4 0/500/.00 0/500/.03
LFR .1 1000/98.5/77.2 1000/98.5/.00 1000/98.5/.03
1 1000/751/73.8 0/751/.01 0/751/.01
10 0/1001/53.3 0/1001/.01 0/1001/.01
VD 1 1000/937.59/43.0 1000/1000.0/.00 1000/937.66/.5 10 413/6726.80/56.9 974/5.8·1012/2.3 1000/6726.81/60.3 100 136/55043/57.4 996/75135/.2 1000/55043/88.1
Table 2: Performance of MINOS, LBFGS-B and BCGD, with n = N, xinit = (1, . . . , 1)
BCGD was recently applied to Group Lasso for logistic regression (Meier et al
’06)
min
x=(x1,...,xn)−`(x) + c
n
X
i=1
ωikxik2
c > 0, ωi > 0.
` : <N → < is the log-likelihood for linear logistic regression.
Extension to constraints?
Linearly Constrained Problem
P2
minx∈X f (x)
f : <n → < is cont. diff.
X = {x = (x1, ..., xn) ∈ <n | l ≤ x ≤ u, Ax = b}, A ∈ <m×n, b ∈ <m, l ≤ u.
Special case. Support Vector Machine QP (Vapnik ’82)
min
0≤x≤Ce, aTx=0
1
2xTQx − eTx
C > 0, a ∈ <n, eT = (1, ..., 1), Qij = aiajK(zi, zj), and K : <p × <p → <.
K(zi, zj) = ziTzj Linear
K(zi, zj) = exp(−γkzi − zjk2) (γ > 0) Gaussian
Block Coord. Gradient Descent Method for P2
For x ∈ X, I ⊆ {1, ..., n}, and H 0n, let dH(x; I) and qH(x; I) be the optimal soln and obj. value of
min
d|x+d∈X, di=0 ∀i6∈I
{∇f (x)Td + 1
2dTHd} direc.
subprob
BCGD for P2:
Given x ∈ X, choose I ⊆ {1, ..., n}, H 0n. Let d = dH(x; I).
Update xnew = x + αd (α > 0)
until “convergence.”
• d is easily calculated when m = 1 and |I| = 2.
Gauss-Southwell-q: Choose I with
qD(x; I) ≤ υ qD(x; {1, ..., n})
(0 < υ ≤ 1, D 0n is diagonal, e.g., D = I or D = diag(H)).
For m = 1, such I with |I| = 2 can be found in O(n) ops by solving a
continuous quadratic knapsack problem and finding a “conformal realization”
of the solution.
Inexact Armijo LS: α = largest element of {s, sβ, sβ2, ...} satisfying x + αd ∈ X, f (x + αd) − f (x) ≤ σαqH(x; I)
(s > 0, 0 < β < 1, 0 < σ < 1).
Convergence Results
: (a) If 0 < λ ≤ λi(D), λi(H) ≤ ¯λ ∀i, then every cluster point of the x-sequence generated by BCGD method is a stationary point of P2.(b) If in addition f satisfies any of the following assumptions and I is chosen by G-Southwell-q, then the x-sequence converges at R-linear rate.
C1 f is strongly convex, ∇f is Lipschitz cont. on X. C2 f is (nonconvex) quadratic.
C3 f (x) = g(Ex) + qTx, where E ∈ <m×n, q ∈ <n, g is strongly convex, ∇g is Lipschitz cont. on <m.
C4 f (x) = maxy∈Y {(Ex)Ty − g(y)} + qTx, where Y ⊆ <m is polyhedral, E ∈ <m×n, q ∈ <n, g is strongly convex, ∇g is Lipschitz cont. on <m.
• For SVM QP, BCGD has R-linear convergence (with no additional
assumption). Similar work as decomposition methods (Joachims ’98, Platt ’99, Lin et al, ...)
Numerical Tests
:• Implement BCGD in Fortran for SVM QP (two-class data classification).
• xinit = 0. Cache most recently used columns of Q.
• On large benchmark problems
a7a (p = 122, n = 16100), a8a (p = 123, n = 22696), a9a (p = 123, n = 32561), ijcnn1 (p = 22, n = 49990), w7a (p = 300, n = 24692)
and using nonlinear kernel, BCGD is comparable in CPU time and solution quality with the C++ SVM code LIBSVM (Lin et al). Using linear kernel, BCGD is much slower (it doesn’t yet do variable fixing as in LIBSVM).
Conclusions & Future Work
1. For ML estimation, `1-regularization induces sparsity in the solution and avoids oversmoothing the signals.
2. The resulting estimation problem can be solved effectively by BCM or BCGD, exploiting the problem structure, including nondiffer. of `1-norm.
Which to use? Depends on problem.
3. Applications to denoising, regression, SVM..
4. Improve BCGD speed for SVM QP using linear kernel? Efficient implementation for m = 2 constraints?