Accelerated Proximal Gradient Methods for Convex Optimization

(1)

Convex Optimization

Paul Tseng

Mathematics, University of Washington Seattle

Optimization Seminar, Univ. Washington October 7, 2008

(2)

Talk Outline

• A Convex Opimization Problem

(3)

Talk Outline

• Proximal Gradient Method

(4)

Talk Outline

• Accelerated Proximal Gradient Method I

• Accelerated Proximal Gradient Method II

(5)

Talk Outline

• Example: Matrix Game

(6)

Talk Outline

• Example: Matrix Game

• Conclusions & Extensions

(7)

A Convex Optimization Problem

minx∈E f^P(x) := f (x) + P (x)

E is a real linear space with norm k · k.

E^∗ is the dual space of cont. linear functionals on E, with dual norm kx^∗k_∗ = sup_kxk≤1hx^∗, xi.

P : E → (−∞, ∞] is proper, convex, lsc (and “simple”).

f : E → < is convex diff. k∇f (x) − ∇f (y)k_∗ ≤ Lkx − yk ∀x, y ∈ domP (L ≥ 0).

(8)

A Convex Optimization Problem

minx∈E f^P(x) := f (x) + P (x)

E is a real linear space with norm k · k.

E^∗ is the dual space of cont. linear functionals on E, with dual norm kx^∗k_∗ = sup_kxk≤1hx^∗, xi.

P : E → (−∞, ∞] is proper, convex, lsc (and “simple”).

f : E → < is convex diff. k∇f (x) − ∇f (y)k_∗ ≤ Lkx − yk ∀x, y ∈ domP (L ≥ 0).

Constrained case

: P ≡ δ_X with X ⊆ E nonempty, closed, convex.

δ_X(x) = n0 if x ∈ X

∞ else

(9)

Examples

:

• E = <ⁿ, P (x) = kxk₁, f (x) = kAx − bk²₂ Basis Pursuit/Lasso

• E = <ⁿ¹ × · · · × <ⁿ^N, P (x) = w₁kx₁k₂ + · · · + w_Nkx_Nk₂ (w_j > 0), f (x) = g(Ax) with g(y) = Pm

i=1 ln(1 + e^yⁱ) − b_iy_i group Lasso

• E = <ⁿ, P ≡ δ_X with X = {x | x ≥ 0, x₁ + · · · + x_n = 1}, f (x) = g^∗(Ax) with g(y) =

Pm

i=1 y_i ln y_i if y ≥ 0, y₁ + · · · + y_m = 1

∞ else matrix game

• E = Sⁿ, P ≡ δ_X with X = {x | |x_ij| ≤ ρ ∀i, j}, f (x) = g^∗(x + s) with g(y) = n− ln dety if αI y βI

∞ else (ρ, α, β > 0) covariance selection

(10)

How to solve this (nonsmooth) convex optimization problem? In applications, m and n are large (m, n ≥ 1000), A may be dense.

2nd-order methods (Newton, interior-point)? Few iterations, but each iteration can be too expensive (e.g., O(n³) ops).

1st-order methods (gradient)? Each iteration is cheap (by using suitable “prox function”), but often too many iterations. Accelerate convergence by

interpolation ^Nesterov.

(11)

Proximal Gradient Method

Let

`(x; y) := f (y) + h∇f (y), x − yi + P (x)

D(x, y) := h(x) − h(y) − h∇h(y), x − yi, Bregman, ...

with h : E → (−∞, ∞] strictly convex, differentiable on X_h ⊇ int(domP ), and

D(x, y) ≥ 1

2kx − yk² ∀ x ∈ domP, y ∈ X_h.

(12)

Proximal Gradient Method

Let

`(x; y) := f (y) + h∇f (y), x − yi + P (x)

D(x, y) := h(x) − h(y) − h∇h(y), x − yi, Bregman, ...

with h : E → (−∞, ∞] strictly convex, differentiable on X_h ⊇ int(domP ), and

D(x, y) ≥ 1

2kx − yk² ∀ x ∈ domP, y ∈ X_h.

For k = 0, 1, . . .,

x_k+1 = arg min

x

{`(x; x_k) + LD(x, x_k)}

with x₀ ∈ domP. Assume x_k ∈ X_h ∀k.

Special cases: steepest descent, gradient-projection Goldstein, Levitin, Polyak, ..., mirror-descent Yudin, Nemirovski, iterative thresholding Daubechies et al., ...

(13)

For the earlier examples, x_k+1 has closed form when h is chosen suitably:

• E = <ⁿ, P (x) = kxk₁, h(x) = kxk²₂/2.

• E = <ⁿ¹ × · · · × <ⁿ^N, P (x) = w₁kx₁k₂ + · · · + w_Nkx_Nk₂ (w_j > 0), h(x) = kxk²₂/2.

• E = <ⁿ, P ≡ δ_X with X = {x | x ≥ 0, x₁ + · · · + x_n = 1}, h(x) = Pn

j=1 x_j ln x_j.

• E = Sⁿ, P ≡ δ_X with X = {x | |x_ij| ≤ ρ ∀i, j}, h(x) = kxk²_F/2.

(14)

Fact 1: f^P(x) ≥ `(x; y) ≥ f^P(x) − ^L₂kx − yk² ∀x, y ∈ domP.

Fact 2: For any proper convex lsc ψ : E → (−∞, ∞] and z ∈ X_h, let

z₊ = arg min

x

{ψ(x) + D(x, z)} .

If z₊ ∈ X_h, then

ψ(z₊) + D(z₊, z) ≤ ψ(x) + D(x, z) − D(x, z₊) ∀x ∈ domP.

(15)

Prop. 1: For any x ∈ domP,

min{e₁, . . . , e_k} ≤ LD(x, x₀)

k , k = 1, 2, . . . with e_k := f^P(x_k) − f^P(x).

(16)

min{e₁, . . . , e_k} ≤ LD(x, x₀)

k , k = 1, 2, . . . with e_k := f^P(x_k) − f^P(x).

Proof:

f^P(x_k+1) ≤ `(x_k+1; x_k) + L

2kx_k+1 − x_kk² Fact 1

≤ `(x_k+1; x_k) + LD(x_k+1, x_k)

≤ `(x; x_k) + LD(x, x_k) − LD(x, x_k+1) Fact 2

≤ f^P(x) + LD(x, x_k) − LD(x, x_k+1), Fact 1 so

0 ≤ LD(x, x_k+1) ≤ LD(x, x_k) − e_k+1

≤ LD(x, x₀) − (e₁ + · · · + e_k+1)

≤ LD(x, x₀) − (k + 1) min{e₁, . . . , e_k+1}

(17)

We will improve the global convergence rate by interpolation.

Idea: At iteration k, use a stepsize of O(k/L) instead of 1/L and backtrack towards x_k.

(18)

Accelerated Proximal Gradient Method I

For k = 0, 1, . . .,

y_k = (1 − θ_k)x_k + θ_kz_k z_k+1 = arg min

x

{`(x; y_k) + θ_kLD(x, z_k)}

x_k+1 = (1 − θ_k)x_k + θ_kz_k+1 1 − θ_k+1

θ_k+1² ≤ 1

θ_k² (0 < θ_k+1 ≤ 1)

with θ₀ = 1, x₀, z₀ ∈ domP Nesterov, Auslender, Teboulle, Lan, Lu, Monteiro, ... Assume z_k ∈ X_h ∀k.

For example, θ_k = 2

k + 2 or θ_k+1 = pθ_k⁴ + 4θ_k² − θ²_k

2 .

(19)

min{e₁, . . . , e_k} ≤ LD(x, z₀)θ_k², k = 1, 2, . . .

with e_k := f^P(x_k) − f^P(x).

(20)

min{e₁, . . . , e_k} ≤ LD(x, z₀)θ_k², k = 1, 2, . . .

Proof:

f^P(x_k+1)

≤ `(x_k+1; y_k) + L

2kx_k+1 − y_kk² Fact 1

= `((1 − θ_k)x_k + θ_kz_k+1; y_k) + L

2k(1 − θ_k)x_k + θ_kz_k+1 − y_kk²

≤ (1 − θ_k)`(x_k; y_k) + θ_k`(z_k+1; y_k) + L

2θ_k²kz_k+1 − z_kk²

≤ (1 − θ_k)`(x_k; y_k) + θ_k (`(z_k+1; y_k) + θ_kLD(z_k+1, z_k))

≤ (1 − θ_k)`(x_k; y_k) + θ_k (`(x; y_k) + θ_kLD(x, z_k) − θ_kLD(x, z_k+1)) Fact 2

≤ (1 − θ_k)f^P(x_k) + θ_k f^P(x) + θ_kLD(x, z_k) − θ_kLD(x, z_k+1)

Fact 1

(21)

so, subtracting by f^P(x) and then dividing by θ_k², we have 1

θ_k²e_k+1 ≤ 1 − θ_k

θ_k² e_k + LD(x; z_k) − LD(x; z_k+1) etc.

Thus, global convergence rate improves from O(1/k) to O(1/k²) with little extra work per iteration!

(22)

Comparing PGM with APGM I:

Assume P ≡ δ_X.

(23)

Can also replace `(x; y_k) by a certain weighted sum of

`(x; y₀), `(x; y₁), . . . , `(x; y_k).

Then...

(24)

Accelerated Proximal Gradient Method II

For k = 0, 1, . . .,

y_k = (1 − θ_k)x_k + θ_kz_k z_k+1 = arg min

x

( _k X

i=0

`(x; y_i)

ϑ_i + Lh(x) )

x_k+1 = (1 − θ_k)x_k + θ_kz_k+1 1 − θ_k+1

θ_k+1ϑ_k+1 = 1

θ_kϑ_k (ϑ_k+1 ≥ θ_k+1 > 0)

with ϑ₀ ≥ θ₀ = 1, x₀ ∈ domP, and z₀ = arg min

x∈domP

h(x) Nesterov, d’Aspremont et al., Lu, ...

Assume z_k ∈ X_h ∀k.

For example, ϑ_k = 2

k + 1, θ_k = 2

k + 2 or ϑ_k+1 = θ_k+1 = pθ_k⁴ + 4θ_k² − θ²_k

2 .

(25)

min{e₁, . . . , e_k} ≤ L(h(x) − h(z₀))θ_k−1ϑ_k−1, k = 1, 2, . . .

(26)

min{e₁, . . . , e_k} ≤ L(h(x) − h(z₀))θ_k−1ϑ_k−1, k = 1, 2, . . .

Proof replaces Fact 2 with:

Fact 3: For any proper convex lsc ψ : E → (−∞, ∞], let

z = arg min

x

{ψ(x) + h(x)} .

If z ∈ X_h, then

ψ(z) + h(z) ≤ ψ(x) + h(x) − D(x, z) ∀x ∈ domP.

Advantage? Possibly better performance on compressed sensing and certain conic programs. ^Lu

(27)

Example: Matrix Game

x∈Xminmax

v∈V hv, Axi

with X and V unit simplices in <ⁿ and <^m, and A ∈ <^m×n. Generate A_ij ∼ U [−1, 1] with probab. p; otherwise A_ij = 0. Nesterov, Nemirovski

Set P ≡ δ_X and f (x) = g^∗(Ax/µ), with µ = _{2 ln m} ( > 0) and g(v) =

Pm

i=1 v_i ln v_i if v ∈ V

∞ else (L = _µ¹, k · k = 1-norm)

(28)

Example: Matrix Game

x∈Xminmax

v∈V hv, Axi

with X and V unit simplices in <ⁿ and <^m, and A ∈ <^m×n. Generate A_ij ∼ U [−1, 1] with probab. p; otherwise A_ij = 0. Nesterov, Nemirovski

Set P ≡ δ_X and f (x) = g^∗(Ax/µ), with µ = _{2 ln m} ( > 0) and g(v) =

Pm

i=1 v_i ln v_i if v ∈ V

∞ else (L = _µ¹, k · k = 1-norm)

• Implement PGM, APGM I & II in Matlab, with h(x) = Pn

j=1 x_j ln x_j and L^init = _8µ¹ . Matrix-vector mult. by A, A^∗ per iter.

• Initialize x₀ = z₀ = (_n¹, . . . , _n¹). Terminate when maxi (Ax_k)_i − min

j (A^∗v_k)_j ≤

with v_k ∈ V a weighted sum of dual vectors associated with x₀, x₁, . . . , x_k.

(29)

PGM APGM I APGM II n/m/p k/cpu (sec) k/cpu (sec) k/cpu (sec) 1000/100/.01 .001 1082480/1500 3325/5 10510/9

.0001 – 20635/23 61865/45

10000/100/.01 .001 – 10005/142 10005/128

10000/100/.1 .001 – 10005/201 10005/185

10000/1000/.01 .001 – 10005/202 10005/191

10000/1000/.1 .001 – 10005/706 10005/695

Table 1: Performance of PGM, APGM I & II for different n, m, sparsity p, and soln accuracy

.

(30)

Conclusions & Extensions

1. Accelerated prox gradient method is promising in theory and practice.

Applicable to convex-concave optimization by using smoothing ^Nesterov. Further extension to add cutting planes.

(31)

Conclusions & Extensions

2. Application to matrix completion, where E = <^m×n and P (x) = kσ(x)k₁? Or to total-variation image restoration (joint work with Steve Wright)?

(32)

Conclusions & Extensions

3. Extending the interpolation technique to incremental gradient methods and coordinate-wise gradient methods?

(33)

Conclusions & Extensions

3. Extending the interpolation technique to incremental gradient methods and coordinate-wise gradient methods?

The END

6. .

^