• 沒有找到結果。

# Accelerated Proximal Gradient Methods for Convex Optimization

N/A
N/A
Protected

Share "Accelerated Proximal Gradient Methods for Convex Optimization"

Copied!
33
0
0

(1)

## Convex Optimization

Paul Tseng

Mathematics, University of Washington Seattle

Optimization Seminar, Univ. Washington October 7, 2008

(2)

### Talk Outline

• A Convex Opimization Problem

(3)

### Talk Outline

• A Convex Opimization Problem

(4)

### Talk Outline

• A Convex Opimization Problem

• Accelerated Proximal Gradient Method I

• Accelerated Proximal Gradient Method II

(5)

### Talk Outline

• A Convex Opimization Problem

• Accelerated Proximal Gradient Method I

• Accelerated Proximal Gradient Method II

• Example: Matrix Game

(6)

### Talk Outline

• A Convex Opimization Problem

• Accelerated Proximal Gradient Method I

• Accelerated Proximal Gradient Method II

• Example: Matrix Game

• Conclusions & Extensions

(7)

### A Convex Optimization Problem

minx∈E fP(x) := f (x) + P (x)

E is a real linear space with norm k · k.

E is the dual space of cont. linear functionals on E, with dual norm kxk = supkxk≤1hx, xi.

P : E → (−∞, ∞] is proper, convex, lsc (and “simple”).

f : E → < is convex diff. k∇f (x) − ∇f (y)k ≤ Lkx − yk ∀x, y ∈ domP (L ≥ 0).

(8)

### A Convex Optimization Problem

minx∈E fP(x) := f (x) + P (x)

E is a real linear space with norm k · k.

E is the dual space of cont. linear functionals on E, with dual norm kxk = supkxk≤1hx, xi.

P : E → (−∞, ∞] is proper, convex, lsc (and “simple”).

f : E → < is convex diff. k∇f (x) − ∇f (y)k ≤ Lkx − yk ∀x, y ∈ domP (L ≥ 0).

### Constrained case

: P ≡ δX with X ⊆ E nonempty, closed, convex.

δX(x) = n0 if x ∈ X

∞ else

(9)

### Examples

:

• E = <n, P (x) = kxk1, f (x) = kAx − bk22 Basis Pursuit/Lasso

• E = <n1 × · · · × <nN, P (x) = w1kx1k2 + · · · + wNkxNk2 (wj > 0), f (x) = g(Ax) with g(y) = Pm

i=1 ln(1 + eyi) − biyi group Lasso

• E = <n, P ≡ δX with X = {x | x ≥ 0, x1 + · · · + xn = 1}, f (x) = g(Ax) with g(y) =

 Pm

i=1 yi ln yi if y ≥ 0, y1 + · · · + ym = 1

∞ else matrix game

• E = Sn, P ≡ δX with X = {x | |xij| ≤ ρ ∀i, j}, f (x) = g(x + s) with g(y) = n− ln dety if αI  y  βI

∞ else (ρ, α, β > 0) covariance selection

(10)

How to solve this (nonsmooth) convex optimization problem? In applications, m and n are large (m, n ≥ 1000), A may be dense.

2nd-order methods (Newton, interior-point)? Few iterations, but each iteration can be too expensive (e.g., O(n3) ops).

1st-order methods (gradient)? Each iteration is cheap (by using suitable “prox function”), but often too many iterations. Accelerate convergence by

interpolation Nesterov.

(11)

Let

`(x; y) := f (y) + h∇f (y), x − yi + P (x)

D(x, y) := h(x) − h(y) − h∇h(y), x − yi, Bregman, ...

with h : E → (−∞, ∞] strictly convex, differentiable on Xh ⊇ int(domP ), and

D(x, y) ≥ 1

2kx − yk2 ∀ x ∈ domP, y ∈ Xh.

(12)

Let

`(x; y) := f (y) + h∇f (y), x − yi + P (x)

D(x, y) := h(x) − h(y) − h∇h(y), x − yi, Bregman, ...

with h : E → (−∞, ∞] strictly convex, differentiable on Xh ⊇ int(domP ), and

D(x, y) ≥ 1

2kx − yk2 ∀ x ∈ domP, y ∈ Xh.

For k = 0, 1, . . .,

xk+1 = arg min

x

{`(x; xk) + LD(x, xk)}

with x0 ∈ domP. Assume xk ∈ Xh ∀k.

Special cases: steepest descent, gradient-projection Goldstein, Levitin, Polyak, ..., mirror-descent Yudin, Nemirovski, iterative thresholding Daubechies et al., ...

(13)

For the earlier examples, xk+1 has closed form when h is chosen suitably:

• E = <n, P (x) = kxk1, h(x) = kxk22/2.

• E = <n1 × · · · × <nN, P (x) = w1kx1k2 + · · · + wNkxNk2 (wj > 0), h(x) = kxk22/2.

• E = <n, P ≡ δX with X = {x | x ≥ 0, x1 + · · · + xn = 1}, h(x) = Pn

j=1 xj ln xj.

• E = Sn, P ≡ δX with X = {x | |xij| ≤ ρ ∀i, j}, h(x) = kxk2F/2.

(14)

Fact 1: fP(x) ≥ `(x; y) ≥ fP(x) − L2kx − yk2 ∀x, y ∈ domP.

Fact 2: For any proper convex lsc ψ : E → (−∞, ∞] and z ∈ Xh, let

z+ = arg min

x

{ψ(x) + D(x, z)} .

If z+ ∈ Xh, then

ψ(z+) + D(z+, z) ≤ ψ(x) + D(x, z) − D(x, z+) ∀x ∈ domP.

(15)

Prop. 1: For any x ∈ domP,

min{e1, . . . , ek} ≤ LD(x, x0)

k , k = 1, 2, . . . with ek := fP(xk) − fP(x).

(16)

Prop. 1: For any x ∈ domP,

min{e1, . . . , ek} ≤ LD(x, x0)

k , k = 1, 2, . . . with ek := fP(xk) − fP(x).

Proof:

fP(xk+1) ≤ `(xk+1; xk) + L

2kxk+1 − xkk2 Fact 1

≤ `(xk+1; xk) + LD(xk+1, xk)

≤ `(x; xk) + LD(x, xk) − LD(x, xk+1) Fact 2

≤ fP(x) + LD(x, xk) − LD(x, xk+1), Fact 1 so

0 ≤ LD(x, xk+1) ≤ LD(x, xk) − ek+1

≤ LD(x, x0) − (e1 + · · · + ek+1)

≤ LD(x, x0) − (k + 1) min{e1, . . . , ek+1}

(17)

We will improve the global convergence rate by interpolation.

Idea: At iteration k, use a stepsize of O(k/L) instead of 1/L and backtrack towards xk.

(18)

### Accelerated Proximal Gradient Method I

For k = 0, 1, . . .,

yk = (1 − θk)xk + θkzk zk+1 = arg min

x

{`(x; yk) + θkLD(x, zk)}

xk+1 = (1 − θk)xk + θkzk+1 1 − θk+1

θk+12 ≤ 1

θk2 (0 < θk+1 ≤ 1)

with θ0 = 1, x0, z0 ∈ domP Nesterov, Auslender, Teboulle, Lan, Lu, Monteiro, ... Assume zk ∈ Xh ∀k.

For example, θk = 2

k + 2 or θk+1 = pθk4 + 4θk2 − θ2k

2 .

(19)

Prop. 2: For any x ∈ domP,

min{e1, . . . , ek} ≤ LD(x, z0k2, k = 1, 2, . . .

with ek := fP(xk) − fP(x).

(20)

Prop. 2: For any x ∈ domP,

min{e1, . . . , ek} ≤ LD(x, z0k2, k = 1, 2, . . .

with ek := fP(xk) − fP(x).

Proof:

fP(xk+1)

≤ `(xk+1; yk) + L

2kxk+1 − ykk2 Fact 1

= `((1 − θk)xk + θkzk+1; yk) + L

2k(1 − θk)xk + θkzk+1 − ykk2

≤ (1 − θk)`(xk; yk) + θk`(zk+1; yk) + L

k2kzk+1 − zkk2

≤ (1 − θk)`(xk; yk) + θk (`(zk+1; yk) + θkLD(zk+1, zk))

≤ (1 − θk)`(xk; yk) + θk (`(x; yk) + θkLD(x, zk) − θkLD(x, zk+1)) Fact 2

≤ (1 − θk)fP(xk) + θk fP(x) + θkLD(x, zk) − θkLD(x, zk+1)

Fact 1

(21)

so, subtracting by fP(x) and then dividing by θk2, we have 1

θk2ek+1 ≤ 1 − θk

θk2 ek + LD(x; zk) − LD(x; zk+1) etc.

Thus, global convergence rate improves from O(1/k) to O(1/k2) with little extra work per iteration!

(22)

Comparing PGM with APGM I:

Assume P ≡ δX.

(23)

Can also replace `(x; yk) by a certain weighted sum of

`(x; y0), `(x; y1), . . . , `(x; yk).

Then...

(24)

### Accelerated Proximal Gradient Method II

For k = 0, 1, . . .,

yk = (1 − θk)xk + θkzk zk+1 = arg min

x

( k X

i=0

`(x; yi)

ϑi + Lh(x) )

xk+1 = (1 − θk)xk + θkzk+1 1 − θk+1

θk+1ϑk+1 = 1

θkϑkk+1 ≥ θk+1 > 0)

with ϑ0 ≥ θ0 = 1, x0 ∈ domP, and z0 = arg min

x∈domP

h(x) Nesterov, d’Aspremont et al., Lu, ...

Assume zk ∈ Xh ∀k.

For example, ϑk = 2

k + 1, θk = 2

k + 2 or ϑk+1 = θk+1 = pθk4 + 4θk2 − θ2k

2 .

(25)

Prop. 3: For any x ∈ domP,

min{e1, . . . , ek} ≤ L(h(x) − h(z0))θk−1ϑk−1, k = 1, 2, . . .

with ek := fP(xk) − fP(x).

(26)

Prop. 3: For any x ∈ domP,

min{e1, . . . , ek} ≤ L(h(x) − h(z0))θk−1ϑk−1, k = 1, 2, . . .

with ek := fP(xk) − fP(x).

Proof replaces Fact 2 with:

Fact 3: For any proper convex lsc ψ : E → (−∞, ∞], let

z = arg min

x

{ψ(x) + h(x)} .

If z ∈ Xh, then

ψ(z) + h(z) ≤ ψ(x) + h(x) − D(x, z) ∀x ∈ domP.

Advantage? Possibly better performance on compressed sensing and certain conic programs. Lu

(27)

### Example: Matrix Game

x∈Xminmax

v∈V hv, Axi

with X and V unit simplices in <n and <m, and A ∈ <m×n. Generate Aij ∼ U [−1, 1] with probab. p; otherwise Aij = 0. Nesterov, Nemirovski

Set P ≡ δX and f (x) = g(Ax/µ), with µ = 2 ln m ( > 0) and g(v) =

 Pm

i=1 vi ln vi if v ∈ V

∞ else (L = µ1, k · k = 1-norm)

(28)

### Example: Matrix Game

x∈Xminmax

v∈V hv, Axi

with X and V unit simplices in <n and <m, and A ∈ <m×n. Generate Aij ∼ U [−1, 1] with probab. p; otherwise Aij = 0. Nesterov, Nemirovski

Set P ≡ δX and f (x) = g(Ax/µ), with µ = 2 ln m ( > 0) and g(v) =

 Pm

i=1 vi ln vi if v ∈ V

∞ else (L = µ1, k · k = 1-norm)

• Implement PGM, APGM I & II in Matlab, with h(x) = Pn

j=1 xj ln xj and Linit = 1 . Matrix-vector mult. by A, A per iter.

• Initialize x0 = z0 = (n1, . . . , n1). Terminate when maxi (Axk)i − min

j (Avk)j ≤ 

with vk ∈ V a weighted sum of dual vectors associated with x0, x1, . . . , xk.

(29)

PGM APGM I APGM II n/m/p  k/cpu (sec) k/cpu (sec) k/cpu (sec) 1000/100/.01 .001 1082480/1500 3325/5 10510/9

.0001 – 20635/23 61865/45

10000/100/.01 .001 – 10005/142 10005/128

10000/100/.1 .001 – 10005/201 10005/185

10000/1000/.01 .001 – 10005/202 10005/191

10000/1000/.1 .001 – 10005/706 10005/695

Table 1: Performance of PGM, APGM I & II for different n, m, sparsity p, and soln accuracy

.

(30)

### Conclusions & Extensions

1. Accelerated prox gradient method is promising in theory and practice.

Applicable to convex-concave optimization by using smoothing Nesterov. Further extension to add cutting planes.

(31)

### Conclusions & Extensions

1. Accelerated prox gradient method is promising in theory and practice.

Applicable to convex-concave optimization by using smoothing Nesterov. Further extension to add cutting planes.

2. Application to matrix completion, where E = <m×n and P (x) = kσ(x)k1? Or to total-variation image restoration (joint work with Steve Wright)?

(32)

### Conclusions & Extensions

1. Accelerated prox gradient method is promising in theory and practice.

Applicable to convex-concave optimization by using smoothing Nesterov. Further extension to add cutting planes.

2. Application to matrix completion, where E = <m×n and P (x) = kσ(x)k1? Or to total-variation image restoration (joint work with Steve Wright)?

3. Extending the interpolation technique to incremental gradient methods and coordinate-wise gradient methods?

(33)

### Conclusions & Extensions

1. Accelerated prox gradient method is promising in theory and practice.

Applicable to convex-concave optimization by using smoothing Nesterov. Further extension to add cutting planes.

2. Application to matrix completion, where E = <m×n and P (x) = kσ(x)k1? Or to total-variation image restoration (joint work with Steve Wright)?

3. Extending the interpolation technique to incremental gradient methods and coordinate-wise gradient methods?

### The END

6. .

^

Then, we recast the signal recovery problem as a smoothing penalized least squares optimization problem, and apply the nonlinear conjugate gradient method to solve the smoothing

Then, we recast the signal recovery problem as a smoothing penalized least squares optimization problem, and apply the nonlinear conjugate gradient method to solve the smoothing

Accordingly, we reformulate the image deblur- ring problem as a smoothing convex optimization problem, and then apply semi-proximal alternating direction method of multipliers

Wang, Solving pseudomonotone variational inequalities and pseudo- convex optimization problems using the projection neural network, IEEE Transactions on Neural Network,

In this paper, we extended the entropy-like proximal algo- rithm proposed by Eggermont [12] for convex programming subject to nonnegative constraints and proposed a class of

Numerical experiments are done for a class of quasi-convex optimization problems where the function f (x) is a composition of a quadratic convex function from IR n to IR and

The purpose of this talk is to analyze new hybrid proximal point algorithms and solve the constrained minimization problem involving a convex functional in a uni- formly convex

For finite-dimensional second-order cone optimization and complementarity problems, there have proposed various methods, including the interior point methods [1, 15, 18], the