Convex Optimization
Paul Tseng
Mathematics, University of Washington Seattle
Optimization Seminar, Univ. Washington October 7, 2008
Talk Outline
• A Convex Opimization Problem
Talk Outline
• A Convex Opimization Problem
• Proximal Gradient Method
Talk Outline
• A Convex Opimization Problem
• Proximal Gradient Method
• Accelerated Proximal Gradient Method I
• Accelerated Proximal Gradient Method II
Talk Outline
• A Convex Opimization Problem
• Proximal Gradient Method
• Accelerated Proximal Gradient Method I
• Accelerated Proximal Gradient Method II
• Example: Matrix Game
Talk Outline
• A Convex Opimization Problem
• Proximal Gradient Method
• Accelerated Proximal Gradient Method I
• Accelerated Proximal Gradient Method II
• Example: Matrix Game
• Conclusions & Extensions
A Convex Optimization Problem
minx∈E fP(x) := f (x) + P (x)
E is a real linear space with norm k · k.
E∗ is the dual space of cont. linear functionals on E, with dual norm kx∗k∗ = supkxk≤1hx∗, xi.
P : E → (−∞, ∞] is proper, convex, lsc (and “simple”).
f : E → < is convex diff. k∇f (x) − ∇f (y)k∗ ≤ Lkx − yk ∀x, y ∈ domP (L ≥ 0).
A Convex Optimization Problem
minx∈E fP(x) := f (x) + P (x)
E is a real linear space with norm k · k.
E∗ is the dual space of cont. linear functionals on E, with dual norm kx∗k∗ = supkxk≤1hx∗, xi.
P : E → (−∞, ∞] is proper, convex, lsc (and “simple”).
f : E → < is convex diff. k∇f (x) − ∇f (y)k∗ ≤ Lkx − yk ∀x, y ∈ domP (L ≥ 0).
Constrained case
: P ≡ δX with X ⊆ E nonempty, closed, convex.δX(x) = n0 if x ∈ X
∞ else
Examples
:• E = <n, P (x) = kxk1, f (x) = kAx − bk22 Basis Pursuit/Lasso
• E = <n1 × · · · × <nN, P (x) = w1kx1k2 + · · · + wNkxNk2 (wj > 0), f (x) = g(Ax) with g(y) = Pm
i=1 ln(1 + eyi) − biyi group Lasso
• E = <n, P ≡ δX with X = {x | x ≥ 0, x1 + · · · + xn = 1}, f (x) = g∗(Ax) with g(y) =
Pm
i=1 yi ln yi if y ≥ 0, y1 + · · · + ym = 1
∞ else matrix game
• E = Sn, P ≡ δX with X = {x | |xij| ≤ ρ ∀i, j}, f (x) = g∗(x + s) with g(y) = n− ln dety if αI y βI
∞ else (ρ, α, β > 0) covariance selection
How to solve this (nonsmooth) convex optimization problem? In applications, m and n are large (m, n ≥ 1000), A may be dense.
2nd-order methods (Newton, interior-point)? Few iterations, but each iteration can be too expensive (e.g., O(n3) ops).
1st-order methods (gradient)? Each iteration is cheap (by using suitable “prox function”), but often too many iterations. Accelerate convergence by
interpolation Nesterov.
Proximal Gradient Method
Let
`(x; y) := f (y) + h∇f (y), x − yi + P (x)
D(x, y) := h(x) − h(y) − h∇h(y), x − yi, Bregman, ...
with h : E → (−∞, ∞] strictly convex, differentiable on Xh ⊇ int(domP ), and
D(x, y) ≥ 1
2kx − yk2 ∀ x ∈ domP, y ∈ Xh.
Proximal Gradient Method
Let
`(x; y) := f (y) + h∇f (y), x − yi + P (x)
D(x, y) := h(x) − h(y) − h∇h(y), x − yi, Bregman, ...
with h : E → (−∞, ∞] strictly convex, differentiable on Xh ⊇ int(domP ), and
D(x, y) ≥ 1
2kx − yk2 ∀ x ∈ domP, y ∈ Xh.
For k = 0, 1, . . .,
xk+1 = arg min
x
{`(x; xk) + LD(x, xk)}
with x0 ∈ domP. Assume xk ∈ Xh ∀k.
Special cases: steepest descent, gradient-projection Goldstein, Levitin, Polyak, ..., mirror-descent Yudin, Nemirovski, iterative thresholding Daubechies et al., ...
For the earlier examples, xk+1 has closed form when h is chosen suitably:
• E = <n, P (x) = kxk1, h(x) = kxk22/2.
• E = <n1 × · · · × <nN, P (x) = w1kx1k2 + · · · + wNkxNk2 (wj > 0), h(x) = kxk22/2.
• E = <n, P ≡ δX with X = {x | x ≥ 0, x1 + · · · + xn = 1}, h(x) = Pn
j=1 xj ln xj.
• E = Sn, P ≡ δX with X = {x | |xij| ≤ ρ ∀i, j}, h(x) = kxk2F/2.
Fact 1: fP(x) ≥ `(x; y) ≥ fP(x) − L2kx − yk2 ∀x, y ∈ domP.
Fact 2: For any proper convex lsc ψ : E → (−∞, ∞] and z ∈ Xh, let
z+ = arg min
x
{ψ(x) + D(x, z)} .
If z+ ∈ Xh, then
ψ(z+) + D(z+, z) ≤ ψ(x) + D(x, z) − D(x, z+) ∀x ∈ domP.
Prop. 1: For any x ∈ domP,
min{e1, . . . , ek} ≤ LD(x, x0)
k , k = 1, 2, . . . with ek := fP(xk) − fP(x).
Prop. 1: For any x ∈ domP,
min{e1, . . . , ek} ≤ LD(x, x0)
k , k = 1, 2, . . . with ek := fP(xk) − fP(x).
Proof:
fP(xk+1) ≤ `(xk+1; xk) + L
2kxk+1 − xkk2 Fact 1
≤ `(xk+1; xk) + LD(xk+1, xk)
≤ `(x; xk) + LD(x, xk) − LD(x, xk+1) Fact 2
≤ fP(x) + LD(x, xk) − LD(x, xk+1), Fact 1 so
0 ≤ LD(x, xk+1) ≤ LD(x, xk) − ek+1
≤ LD(x, x0) − (e1 + · · · + ek+1)
≤ LD(x, x0) − (k + 1) min{e1, . . . , ek+1}
We will improve the global convergence rate by interpolation.
Idea: At iteration k, use a stepsize of O(k/L) instead of 1/L and backtrack towards xk.
Accelerated Proximal Gradient Method I
For k = 0, 1, . . .,
yk = (1 − θk)xk + θkzk zk+1 = arg min
x
{`(x; yk) + θkLD(x, zk)}
xk+1 = (1 − θk)xk + θkzk+1 1 − θk+1
θk+12 ≤ 1
θk2 (0 < θk+1 ≤ 1)
with θ0 = 1, x0, z0 ∈ domP Nesterov, Auslender, Teboulle, Lan, Lu, Monteiro, ... Assume zk ∈ Xh ∀k.
For example, θk = 2
k + 2 or θk+1 = pθk4 + 4θk2 − θ2k
2 .
Prop. 2: For any x ∈ domP,
min{e1, . . . , ek} ≤ LD(x, z0)θk2, k = 1, 2, . . .
with ek := fP(xk) − fP(x).
Prop. 2: For any x ∈ domP,
min{e1, . . . , ek} ≤ LD(x, z0)θk2, k = 1, 2, . . .
with ek := fP(xk) − fP(x).
Proof:
fP(xk+1)
≤ `(xk+1; yk) + L
2kxk+1 − ykk2 Fact 1
= `((1 − θk)xk + θkzk+1; yk) + L
2k(1 − θk)xk + θkzk+1 − ykk2
≤ (1 − θk)`(xk; yk) + θk`(zk+1; yk) + L
2θk2kzk+1 − zkk2
≤ (1 − θk)`(xk; yk) + θk (`(zk+1; yk) + θkLD(zk+1, zk))
≤ (1 − θk)`(xk; yk) + θk (`(x; yk) + θkLD(x, zk) − θkLD(x, zk+1)) Fact 2
≤ (1 − θk)fP(xk) + θk fP(x) + θkLD(x, zk) − θkLD(x, zk+1)
Fact 1
so, subtracting by fP(x) and then dividing by θk2, we have 1
θk2ek+1 ≤ 1 − θk
θk2 ek + LD(x; zk) − LD(x; zk+1) etc.
Thus, global convergence rate improves from O(1/k) to O(1/k2) with little extra work per iteration!
Comparing PGM with APGM I:
Assume P ≡ δX.
Can also replace `(x; yk) by a certain weighted sum of
`(x; y0), `(x; y1), . . . , `(x; yk).
Then...
Accelerated Proximal Gradient Method II
For k = 0, 1, . . .,
yk = (1 − θk)xk + θkzk zk+1 = arg min
x
( k X
i=0
`(x; yi)
ϑi + Lh(x) )
xk+1 = (1 − θk)xk + θkzk+1 1 − θk+1
θk+1ϑk+1 = 1
θkϑk (ϑk+1 ≥ θk+1 > 0)
with ϑ0 ≥ θ0 = 1, x0 ∈ domP, and z0 = arg min
x∈domP
h(x) Nesterov, d’Aspremont et al., Lu, ...
Assume zk ∈ Xh ∀k.
For example, ϑk = 2
k + 1, θk = 2
k + 2 or ϑk+1 = θk+1 = pθk4 + 4θk2 − θ2k
2 .
Prop. 3: For any x ∈ domP,
min{e1, . . . , ek} ≤ L(h(x) − h(z0))θk−1ϑk−1, k = 1, 2, . . .
with ek := fP(xk) − fP(x).
Prop. 3: For any x ∈ domP,
min{e1, . . . , ek} ≤ L(h(x) − h(z0))θk−1ϑk−1, k = 1, 2, . . .
with ek := fP(xk) − fP(x).
Proof replaces Fact 2 with:
Fact 3: For any proper convex lsc ψ : E → (−∞, ∞], let
z = arg min
x
{ψ(x) + h(x)} .
If z ∈ Xh, then
ψ(z) + h(z) ≤ ψ(x) + h(x) − D(x, z) ∀x ∈ domP.
Advantage? Possibly better performance on compressed sensing and certain conic programs. Lu
Example: Matrix Game
x∈Xminmax
v∈V hv, Axi
with X and V unit simplices in <n and <m, and A ∈ <m×n. Generate Aij ∼ U [−1, 1] with probab. p; otherwise Aij = 0. Nesterov, Nemirovski
Set P ≡ δX and f (x) = g∗(Ax/µ), with µ = 2 ln m ( > 0) and g(v) =
Pm
i=1 vi ln vi if v ∈ V
∞ else (L = µ1, k · k = 1-norm)
Example: Matrix Game
x∈Xminmax
v∈V hv, Axi
with X and V unit simplices in <n and <m, and A ∈ <m×n. Generate Aij ∼ U [−1, 1] with probab. p; otherwise Aij = 0. Nesterov, Nemirovski
Set P ≡ δX and f (x) = g∗(Ax/µ), with µ = 2 ln m ( > 0) and g(v) =
Pm
i=1 vi ln vi if v ∈ V
∞ else (L = µ1, k · k = 1-norm)
• Implement PGM, APGM I & II in Matlab, with h(x) = Pn
j=1 xj ln xj and Linit = 8µ1 . Matrix-vector mult. by A, A∗ per iter.
• Initialize x0 = z0 = (n1, . . . , n1). Terminate when maxi (Axk)i − min
j (A∗vk)j ≤
with vk ∈ V a weighted sum of dual vectors associated with x0, x1, . . . , xk.
PGM APGM I APGM II n/m/p k/cpu (sec) k/cpu (sec) k/cpu (sec) 1000/100/.01 .001 1082480/1500 3325/5 10510/9
.0001 – 20635/23 61865/45
10000/100/.01 .001 – 10005/142 10005/128
10000/100/.1 .001 – 10005/201 10005/185
10000/1000/.01 .001 – 10005/202 10005/191
10000/1000/.1 .001 – 10005/706 10005/695
Table 1: Performance of PGM, APGM I & II for different n, m, sparsity p, and soln accuracy
.
Conclusions & Extensions
1. Accelerated prox gradient method is promising in theory and practice.
Applicable to convex-concave optimization by using smoothing Nesterov. Further extension to add cutting planes.
Conclusions & Extensions
1. Accelerated prox gradient method is promising in theory and practice.
Applicable to convex-concave optimization by using smoothing Nesterov. Further extension to add cutting planes.
2. Application to matrix completion, where E = <m×n and P (x) = kσ(x)k1? Or to total-variation image restoration (joint work with Steve Wright)?
Conclusions & Extensions
1. Accelerated prox gradient method is promising in theory and practice.
Applicable to convex-concave optimization by using smoothing Nesterov. Further extension to add cutting planes.
2. Application to matrix completion, where E = <m×n and P (x) = kσ(x)k1? Or to total-variation image restoration (joint work with Steve Wright)?
3. Extending the interpolation technique to incremental gradient methods and coordinate-wise gradient methods?
Conclusions & Extensions
1. Accelerated prox gradient method is promising in theory and practice.
Applicable to convex-concave optimization by using smoothing Nesterov. Further extension to add cutting planes.
2. Application to matrix completion, where E = <m×n and P (x) = kσ(x)k1? Or to total-variation image restoration (joint work with Steve Wright)?
3. Extending the interpolation technique to incremental gradient methods and coordinate-wise gradient methods?
The END
6. .
^