Optimization Methods with Signal Denoising Applications

(1)

Optimization Methods with Signal Denoising Applications

Paul Tseng

Mathematics, University of Washington Seattle

Taiwan Normal University March 2, 2006

(Joint works with Sylvain Sardy (EPFL), Andrew Bruce (MathSoft), and Sangwoon Yun (UW))

Abstract

This is a talk given at optimization seminar, Jan 2006, and at Taiwan Normal University in March 2006.

(2)

Talk Outline

• Basic Problem Model

? Primal-Dual Interior Point Method

? Block Coordinate Minimization Method

? Applications

• General Problem Model

? Block Coordinate Gradient Descent Method

? Convergence

? Numerical Testing (ongoing)

• Conclusions & Future Work

(3)

Basic Problem Model

t_m

b(t ) b(t )

1 1

B = ₁ ¹

m

b = b(t )

b(t )

1 m

b(t ) b(t ) B =

m 2 2 1

2

!!

""

b( )t

t

2

##$$ %%&& ''((

))** ++,, --..

/01234

Wavelet basis

& its transl.

Observed

Noisy Signal:

Denoised Signal:

556

778

99:

t t_{1 2}

;;

<<

==

>>??

@@

AA BB

CC

DDEE

FF

GG

HHII

JJ

KK LL MM

NN OOPPQQ RR

SS TTUUVV WWXX YY

ZZ

[[

\\

]]

^^

__

`` aabb cc dd

ee ff gg

hh ii jj

b( )t

. . . t

b( )t

t

1

b( )t

t

m+1

b (t )m+1

b (t )_m B = _m+1 1

B₁w₁ + · · · + B_nw_n = [B₁ · · · Bn]

"

w₁ w..._n

#

= Bw (n ≥ m)

(4)

Find w so that Bw − b ≈ 0 and w has “few” nonzeros.

Formulate this as an unconstrained convex optimization problem:

P1

_w∈<^min_n ^{kBw − bk}²² ^{+ ckwk}¹ ^{(c > 0)} “Basis Pursuit”

Chen, Donoho, Saunders

Difficulty:_{. .} Typically m ≥ 1000, n ≥ 8000, and B is dense. k · k₁ is nonsmooth.

6

_

(5)

Primal-Dual Interior Point Method for P1

Idea

: Reformulate P1 as a convex QP, and apply primal-dual IP method.

QP Reformulation of P1:

P1

_w∈<^min_n ^{kBw − bk}²² ^{+ ckwk}¹

Substitute w = w⁺ − w⁻ with w⁺ ≥ 0, w⁻ ≥ 0, kwk₁ = e^T(w⁺ + w⁻):

min

w+≥0 w−≥0

k Bw⁺ − Bw⁻ − b

| {z }

y

k²₂ + ce^T(w⁺ + w⁻)

min kyk²₂ + ce^T

w⁺ w⁻

w⁺≥0

w⁻≥0 [B − B]

| {z }

A

w⁺ w⁻

| {z }

x

+y = b

(6)

QP Reformulation of P1:

min kyk²₂ + ce^Tx x ≥ 0 Ax + y = b KKT Optimality Condition for QP:

Ax + y = b, x ≥ 0 A^Ty + z = ce, z ≥ 0

Xz = 0

(X = diag[x₁, ..., x_2n])

Perturbed KKT Optimality Condition:

Ax + y = b, x > 0 A^Ty + z = ce, z > 0

Xz = µe (µ > 0)

Primal-Dual IP method: Apply damped Newton method to solve inexactly the perturbed KKT equations while maintaining x > 0, z > 0. Decrease µ after each iteration. Fiacco-McCormick ’68, Karmarkar ’84,...

(7)

Method description:

Given µ > 0, x > 0, y, z > 0, solve

A∆x + ∆y = b − Ax − y, A^T∆y + ∆z = ce − A^Ty − z, Z∆x + X∆z = µe − Xz

Newton Eqs.

Update

x^new = x + .99β_p∆x, y^new = y + .99β_d∆y, z^new = z + .99β_d∆z,

µ^new = (1 − min{.99, β_p, β_d})µ, where β_p = min

i:∆x_i<0

x_i

−∆x_i

, β_d = min

i:∆z_i<0

z_i

−∆z_i

(8)

Implementation & Initialization:

• Newton Eqs. reduce to

(I + AZ⁻¹XA^T)∆y = r . Solve by Conjugate Gradient (CG) method.

Multiplication by A

m×2n|{z}

& A^T require O(m log m) & O(m(log m)²) opers.

• Initialization as in Chen-Donoho-Saunders ’96

• Theoretical convergence?

CG preconditioning?

(9)

Block Coord. Minimization Method for P1

Method description:

Given w, choose I ⊆ {1, ..., n} with |I| = m, {B_i}_i∈I is orthog.

Update

w^new = arg min

u_i=w_i ∀i6∈I kBu − bk²₂ + ckuk₁ ^closed

← ^{form soln}

• Choose I to maximize min

v∈∂_uI(kBu−bk²₂+ckuk₁)|_u=w

kvk₂.

Requires O(m log m) opers. by algorithm of Coifman & Wickerhauser.

• Theoretical convergence: w-sequence is bounded & each cluster point solves P1.

(10)

Convergence of BCM method depends crucially on

• differentiability of k · k²₂

• separability of k · k₁

• convexity ⇒ global minimum

(11)

Application 1:

Electronic surveillance

0 10 20 30 40 50 60

bp.cptable.spectrogram

50 55 60 65

0 10 20 30 40 50 60

ew.cptable.spectrogram

microseconds

KHz

-40 -30 -20 -10 0 10 20

m = 2¹¹ = 2048, c = 4, local cosine transform, all but 4 levels

(12)

Method efficiency

:

2500 2000

1500

1000

500

1 5 10

!" #$

%&

CPU time (log scale) CPU time (log scale)

Real part Imaginary part

Objective

1 5 10

1000 500 1500 2000 2500 3000

BCR

IP IP

BCR

Comparing CPU times of IP and BCM methods (S-Plus, Sun Ultra 1).

(13)

ML Estimation P2

: min

w −`(Bw; b) + c X

i∈J

|w_i| (c > 0)

` is log likelihood, {B_i}_i6∈J are lin. indep “coarse-scale Wavelets”

• −`(y; b) = ¹₂ky − bk²₂ Gaussian noise

• −`(y; b) =

m

X

i=1

(y_i − b_i ln y_i) (y_i ≥ 0) Poisson noise

Solve

P2

by adapting IP method. BCM?

(14)

Application 2:

^Γ ray astronomy

0 0.5 1 1.5 Measurements

50 100 150 200 250 300 350 50 100 150

0 0.5 1 1.5 Anscombe−based estimate

50 100 150 200 250 300 350 50 100 150

0 0.5 1 1.5 TIPSH estimate

50 100 150 200 250 300 350 50 100 150

0 0.5 1 1.5 l1−penalized likelihood estimate

50 100 150 200 250 300 350 50 100 150

m = 720 · 360, c chosen by CV, Symmlets of order 8 (levels 3-8). Spatially inhomogeneous Poisson noise.

(15)

But IP method is slow (many CG steps).

6. .

_

Adapt BCM method?

(16)

General Problem Model

P3

min

w F_c(w) := f (w) + cP (w) (c ≥ 0)

f : <^N → < is smooth.

P : <^N → (−∞, ∞] is proper, convex, lsc, and P (w) = Pn

j=1P_j(w_j) (w = (w₁, ..., w_n)).

• P (w) = kwk₁

• P (w) = n0 if l ≤ w ≤ u

∞ else

(17)

Block Coord. Gradient Descent Method for P3

Idea

: Do BCM on a quadratic approx. of f.

For w ∈ domP, I ⊆ {1, ..., n}, and H 0_n, let d_H(w; I) and q_H(w; I) be the optimal soln and obj. value of

d|d_i=0 ∀i6∈Imin {g^Td + 1

2d^THd + cP (w + d) − cP (w)} direc.

subprob

with g = ∇f (w).

Facts

:

• d_H(w; {1, ..., n}) = 0 ⇔ F_c⁰(w; d) ≥ 0 ∀d ∈ <^N. stationarity

• H is diagonal ⇒ d_H(w; I) = X

i∈I

d_H(w; i), q_H(w; I) = X

i∈I

q_H(w; i). ^separab.

(18)

Method description:

Given w ∈ domP, choose I ⊆ {1, ..., n}, H 0_n. Let d = d_H(w; I).

Update w^new = w + αd (α > 0)

• α = largest element of {1, β, β², ...} satisfying

F_c(w + αd) − F_c(w) ≤ σαq_H(w; I) (0 < β < 1, 0 < σ < 1) Armijo

• I = {1}, {2}, ..., {n}, {1}, {2}, ... Gauss-Seidel

• kd_D(w; I)k_∞ ≥ υkd_D(w; {1, ..., n})k_∞ (0 < υ ≤ 1, D 0_n is diagonal,

e.g., D = I or D = diag(H)). Gauss-Southwell-d

• q_D(w; I) ≤ υ q_D(w; {1, ..., n}). Gauss-Southwell-q

(19)

Convergence Results

: (a) If

• 0 < λ ≤ λ_i(D), λ_i(H) ≤ ¯λ ∀i,

• α is chosen by Armijo rule,

• I is chosen by G-Seidel or G-Southwell-d or G-Southwell-q,

then every cluster point of the w-sequence generated by BCGD method is a stationary point of F_c.

(b) If in addition P and f satisfy any of the following assumptions, then the w-sequence converges at R-linear rate (excepting G-Southwell-d).

C1 f is strongly convex, ∇f is Lipschitz cont. on domP. C2 f is (nonconvex) quadratic. P is polyhedral.

C3 f (w) = g(Ew) + q^Tw, where E ∈ <^m×N, q ∈ <^N, g is strongly convex, ∇g is Lipschitz cont. on <^m. P is polyhedral.

(20)

C4 f (w) = max_y∈Y{(Ew)^Ty − g(y)} + q^Tw, where Y ⊆ <^m is polyhedral, E ∈ <^m×N, q ∈ <^N, g is strongly convex, ∇g is Lipschitz cont. on <^m. P is polyhedral.

Notes

:

• BCGD has stronger global convergence property (and cheaper iteration) than BCM.

• Proof of (b) uses a local error bound on dist (w, {stat. pts. of F_c}).

(21)

Numerical Testing

(ongoing):

• Implement BCGD method in Matlab.

• Numerical tests with f from Mor ´e-Garbow-Hillstrom set and CUTEr set

(Gould, Orban, Toint ’05), P (w) = kwk₁, and different c (e.g., c = .1, 1, 10).

• Comparison with MINOS 5.5.1 (Murtagh, Saunders ’05), a Fortran

implementation of an active-set method, applied to a reformulation of P3 with P (w) = kwk₁ as

min

w+≥0 w−≥0

f (w⁺ − w⁻) + c e^T(w⁺ + w⁻).

• Preliminary results are “promising”.

(22)

f (w) n Description

BAL 1000 Brown almost-linear func, nonconvex, dense Hessian.

BT 1000 Broyden tridiagonal func, nonconvex, sparse Hessian.

DBV 1000 Discrete boundary value func, nonconvex, sparse Hessian.

EPS 1000 Extended Powell singular func, convex, 4-block diag. Hessian.

ER 1000 Extended Rosenbrook func, nonconvex, 2-block diag. Hessian.

QD1 1000 f (w) =

n

X

i=1

w_i − 1

!2

, convex, quad., rank-1 Hessian.

QD2 1000 f (w) =

n

X

i=1



w_i − 2 n + 1

n

X

j=1

w_j − 1





2

+



 2 n + 1

n

X

j=1

w_j + 1





2

, strongly convex, quad., dense Hessian.

VD 1000 f (w) =

n

X

i=1

(w_i − 1)² +





n

X

j=1

j(w_j − 1)





2

+





n

X

j=1

j(w_j − 1)





4

, strongly convex, dense ill-conditioned Hessian.

Table 1: Least square problems from Mor ´e, Garbow, Hillstrom, 1981

(23)

MINOS BCGD- BCGD- G-Southwell-d G-Southwell-q f (w) c ]nz/objec/cpu ]nz/objec/cpu ]nz/objec/cpu

BAL 1 1000/1000/43.9 1000/1000/.1 1000/1000/.1

10 1000/9999.9/43.9 1000/9999.9/.1 1000/9999.9/.1 100 1000/99997.5/44.3 1000/99997.5/.1 1000/99997.5/.2 BT .1 1000/71.725/134.4 999/71.394/4.5 999/71.394/5.0 1 999/672.41/95.3 21/672.70/292.6 995/991.06/1.3(?)

10 0/1000/77.7 0/1000/.01 0/1000/.01

DBV .1 0/0/52.7 0/4.5E-9/.1 0/4.5E-9/.04

1 0/0/52.9 0/4.5E-9/.1 0/4.5E-9/.04

10 0/0/53.0 0/4.5E-9/.01 0/4.5E-9/.01

EPS 1 1000/351.14/58.5 500/351.14/.3 500/351.14/.3

10 243/1250/45.7 250/1250/.05 250/1250/.05

100 0/1250/50.7 0/1250/.01 0/1250/.02

ER 1 1000/436.25/72.0 1000/436.25/.5 1000/436.25/.4

10 0/500/51.5 0/500/.1 0/500/.01

100 0/500/52.4 0/500/.03 0/500/.01

QD1 .1 1000/.0975/29.9 1/.0975/.01 1/.0975/.02

1 1000/.75/37.8 1/.75/.01 1/.75/.01

10 0/1/38.6 0/1/.01 0/1/.01

QD2 .1 1000/98.5/74.2 0/98.5/.01 0/98.5/.03

1 1000/751/75.8 0/751/.01 0/751/.02

10 0/1001/53.1 0/1001/.01 0/1001/.01

VD 1 1000/937.59/43.9 1000/937.66/856.3 1000/937.66/869.0 10 413/6726.80/57.1 1000/6746.74/235.7 999/6746.74/246.9 100 136/55043/57.8 1000/55078/12.6 1000/55078/13.3

Table 2: Performance of MINOS, BCGD-Gauss-Southwell-d/q, with w^init = (1, 1, ..., 1)

(24)

Conclusions & Future Work

1. For ML estimation, `₁-penalty imparts parsimony in the coefficients and avoid oversmoothing the signals.

2. The resulting estimation problem can be solved effectively by IP method or BCM method, exploiting the problem structure, including nondiffer. of

`₁-norm. Which to use? Depends on problem.

3. Problem reformulation may be needed.

4. For general problem model, we propose BCGD method. Numerical testing is ongoing.

5. Applications to denoising, regression, SVM?