Optimization Methods with Signal Denoising Applications
Paul Tseng
Mathematics, University of Washington Seattle
Taiwan Normal University March 2, 2006
(Joint works with Sylvain Sardy (EPFL), Andrew Bruce (MathSoft), and Sangwoon Yun (UW))
Abstract
This is a talk given at optimization seminar, Jan 2006, and at Taiwan Normal University in March 2006.
Talk Outline
• Basic Problem Model
? Primal-Dual Interior Point Method
? Block Coordinate Minimization Method
? Applications
• General Problem Model
? Block Coordinate Gradient Descent Method
? Convergence
? Numerical Testing (ongoing)
• Conclusions & Future Work
Basic Problem Model
tm
b(t ) b(t )
1 1
B = 1 1
m
b = b(t )
b(t )
1 m
b(t ) b(t ) B =
m 2 2 1
2
!!
""
b( )t
t
2
##$$ %%&& ''((
))** ++,, --..
/01234
Wavelet basis
& its transl.
Observed
Noisy Signal:
Denoised Signal:
556
778
99:
t t1 2
;;
<<
==
>>??
@@
AA BB
CC
DDEE
FF
GG
HHII
JJ
KK LL MM
NN OOPPQQ RR
SS TTUUVV WWXX YY
ZZ
[[
\\
]]
^^
__
`` aabb cc dd
ee ff gg
hh ii jj
b( )t
. . . t
b( )t
t
1
b( )t
t
m+1
m+1
b (t )m+1
b (t )m B = m+1 1
B1w1 + · · · + Bnwn = [B1 · · · Bn]
"
w1 w...n
#
= Bw (n ≥ m)
Find w so that Bw − b ≈ 0 and w has “few” nonzeros.
Formulate this as an unconstrained convex optimization problem:
P1
w∈<minn kBw − bk22 + ckwk1 (c > 0) “Basis Pursuit”Chen, Donoho, Saunders
Difficulty:. . Typically m ≥ 1000, n ≥ 8000, and B is dense. k · k1 is nonsmooth.
6
_
Primal-Dual Interior Point Method for P1
Idea
: Reformulate P1 as a convex QP, and apply primal-dual IP method.QP Reformulation of P1:
P1
w∈<minn kBw − bk22 + ckwk1Substitute w = w+ − w− with w+ ≥ 0, w− ≥ 0, kwk1 = eT(w+ + w−):
min
w+≥0 w−≥0
k Bw+ − Bw− − b
| {z }
y
k22 + ceT(w+ + w−)
min kyk22 + ceT
w+ w−
w+≥0
w−≥0 [B − B]
| {z }
A
w+ w−
| {z }
x
+y = b
QP Reformulation of P1:
min kyk22 + ceTx x ≥ 0 Ax + y = b KKT Optimality Condition for QP:
Ax + y = b, x ≥ 0 ATy + z = ce, z ≥ 0
Xz = 0
(X = diag[x1, ..., x2n])
Perturbed KKT Optimality Condition:
Ax + y = b, x > 0 ATy + z = ce, z > 0
Xz = µe (µ > 0)
Primal-Dual IP method: Apply damped Newton method to solve inexactly the perturbed KKT equations while maintaining x > 0, z > 0. Decrease µ after each iteration. Fiacco-McCormick ’68, Karmarkar ’84,...
Method description:
Given µ > 0, x > 0, y, z > 0, solve
A∆x + ∆y = b − Ax − y, AT∆y + ∆z = ce − ATy − z, Z∆x + X∆z = µe − Xz
Newton Eqs.
Update
xnew = x + .99βp∆x, ynew = y + .99βd∆y, znew = z + .99βd∆z,
µnew = (1 − min{.99, βp, βd})µ, where βp = min
i:∆xi<0
xi
−∆xi
, βd = min
i:∆zi<0
zi
−∆zi
Implementation & Initialization:
• Newton Eqs. reduce to
(I + AZ−1XAT)∆y = r . Solve by Conjugate Gradient (CG) method.
Multiplication by A
m×2n|{z}
& AT require O(m log m) & O(m(log m)2) opers.
• Initialization as in Chen-Donoho-Saunders ’96
• Theoretical convergence?
CG preconditioning?
Block Coord. Minimization Method for P1
Method description:
Given w, choose I ⊆ {1, ..., n} with |I| = m, {Bi}i∈I is orthog.
Update
wnew = arg min
ui=wi ∀i6∈I kBu − bk22 + ckuk1 closed
← form soln
• Choose I to maximize min
v∈∂uI(kBu−bk22+ckuk1)|u=w
kvk2.
Requires O(m log m) opers. by algorithm of Coifman & Wickerhauser.
• Theoretical convergence: w-sequence is bounded & each cluster point solves P1.
Convergence of BCM method depends crucially on
• differentiability of k · k22
• separability of k · k1
• convexity ⇒ global minimum
Application 1:
Electronic surveillance0 10 20 30 40 50 60
bp.cptable.spectrogram
50 55 60 65
0 10 20 30 40 50 60
ew.cptable.spectrogram
microseconds
KHz
-40 -30 -20 -10 0 10 20
m = 211 = 2048, c = 4, local cosine transform, all but 4 levels
Method efficiency
:2500 2000
1500
1000
500
1 5 10
!" #$
%&
CPU time (log scale) CPU time (log scale)
Real part Imaginary part
Objective
1 5 10
1000 500 1500 2000 2500 3000
BCR
IP IP
BCR
Comparing CPU times of IP and BCM methods (S-Plus, Sun Ultra 1).
ML Estimation P2
: minw −`(Bw; b) + c X
i∈J
|wi| (c > 0)
` is log likelihood, {Bi}i6∈J are lin. indep “coarse-scale Wavelets”
• −`(y; b) = 12ky − bk22 Gaussian noise
• −`(y; b) =
m
X
i=1
(yi − bi ln yi) (yi ≥ 0) Poisson noise
Solve
P2
by adapting IP method. BCM?Application 2:
Γ ray astronomy0 0.5 1 1.5 Measurements
50 100 150 200 250 300 350 50 100 150
0 0.5 1 1.5 Anscombe−based estimate
50 100 150 200 250 300 350 50 100 150
0 0.5 1 1.5 TIPSH estimate
50 100 150 200 250 300 350 50 100 150
0 0.5 1 1.5 l1−penalized likelihood estimate
50 100 150 200 250 300 350 50 100 150
m = 720 · 360, c chosen by CV, Symmlets of order 8 (levels 3-8). Spatially inhomogeneous Poisson noise.
But IP method is slow (many CG steps).
6. .
_
Adapt BCM method?
General Problem Model
P3
minw Fc(w) := f (w) + cP (w) (c ≥ 0)
f : <N → < is smooth.
P : <N → (−∞, ∞] is proper, convex, lsc, and P (w) = Pn
j=1Pj(wj) (w = (w1, ..., wn)).
• P (w) = kwk1
• P (w) = n0 if l ≤ w ≤ u
∞ else
Block Coord. Gradient Descent Method for P3
Idea
: Do BCM on a quadratic approx. of f.For w ∈ domP, I ⊆ {1, ..., n}, and H 0n, let dH(w; I) and qH(w; I) be the optimal soln and obj. value of
d|di=0 ∀i6∈Imin {gTd + 1
2dTHd + cP (w + d) − cP (w)} direc.
subprob
with g = ∇f (w).
Facts
:• dH(w; {1, ..., n}) = 0 ⇔ Fc0(w; d) ≥ 0 ∀d ∈ <N. stationarity
• H is diagonal ⇒ dH(w; I) = X
i∈I
dH(w; i), qH(w; I) = X
i∈I
qH(w; i). separab.
Method description:
Given w ∈ domP, choose I ⊆ {1, ..., n}, H 0n. Let d = dH(w; I).
Update wnew = w + αd (α > 0)
• α = largest element of {1, β, β2, ...} satisfying
Fc(w + αd) − Fc(w) ≤ σαqH(w; I) (0 < β < 1, 0 < σ < 1) Armijo
• I = {1}, {2}, ..., {n}, {1}, {2}, ... Gauss-Seidel
• kdD(w; I)k∞ ≥ υkdD(w; {1, ..., n})k∞ (0 < υ ≤ 1, D 0n is diagonal,
e.g., D = I or D = diag(H)). Gauss-Southwell-d
• qD(w; I) ≤ υ qD(w; {1, ..., n}). Gauss-Southwell-q
Convergence Results
: (a) If• 0 < λ ≤ λi(D), λi(H) ≤ ¯λ ∀i,
• α is chosen by Armijo rule,
• I is chosen by G-Seidel or G-Southwell-d or G-Southwell-q,
then every cluster point of the w-sequence generated by BCGD method is a stationary point of Fc.
(b) If in addition P and f satisfy any of the following assumptions, then the w-sequence converges at R-linear rate (excepting G-Southwell-d).
C1 f is strongly convex, ∇f is Lipschitz cont. on domP. C2 f is (nonconvex) quadratic. P is polyhedral.
C3 f (w) = g(Ew) + qTw, where E ∈ <m×N, q ∈ <N, g is strongly convex, ∇g is Lipschitz cont. on <m. P is polyhedral.
C4 f (w) = maxy∈Y{(Ew)Ty − g(y)} + qTw, where Y ⊆ <m is polyhedral, E ∈ <m×N, q ∈ <N, g is strongly convex, ∇g is Lipschitz cont. on <m. P is polyhedral.
Notes
:• BCGD has stronger global convergence property (and cheaper iteration) than BCM.
• Proof of (b) uses a local error bound on dist (w, {stat. pts. of Fc}).
Numerical Testing
(ongoing):• Implement BCGD method in Matlab.
• Numerical tests with f from Mor ´e-Garbow-Hillstrom set and CUTEr set
(Gould, Orban, Toint ’05), P (w) = kwk1, and different c (e.g., c = .1, 1, 10).
• Comparison with MINOS 5.5.1 (Murtagh, Saunders ’05), a Fortran
implementation of an active-set method, applied to a reformulation of P3 with P (w) = kwk1 as
min
w+≥0 w−≥0
f (w+ − w−) + c eT(w+ + w−).
• Preliminary results are “promising”.
f (w) n Description
BAL 1000 Brown almost-linear func, nonconvex, dense Hessian.
BT 1000 Broyden tridiagonal func, nonconvex, sparse Hessian.
DBV 1000 Discrete boundary value func, nonconvex, sparse Hessian.
EPS 1000 Extended Powell singular func, convex, 4-block diag. Hessian.
ER 1000 Extended Rosenbrook func, nonconvex, 2-block diag. Hessian.
QD1 1000 f (w) =
n
X
i=1
wi − 1
!2
, convex, quad., rank-1 Hessian.
QD2 1000 f (w) =
n
X
i=1
wi − 2 n + 1
n
X
j=1
wj − 1
2
+
2 n + 1
n
X
j=1
wj + 1
2
, strongly convex, quad., dense Hessian.
VD 1000 f (w) =
n
X
i=1
(wi − 1)2 +
n
X
j=1
j(wj − 1)
2
+
n
X
j=1
j(wj − 1)
4
, strongly convex, dense ill-conditioned Hessian.
Table 1: Least square problems from Mor ´e, Garbow, Hillstrom, 1981
MINOS BCGD- BCGD- G-Southwell-d G-Southwell-q f (w) c ]nz/objec/cpu ]nz/objec/cpu ]nz/objec/cpu
BAL 1 1000/1000/43.9 1000/1000/.1 1000/1000/.1
10 1000/9999.9/43.9 1000/9999.9/.1 1000/9999.9/.1 100 1000/99997.5/44.3 1000/99997.5/.1 1000/99997.5/.2 BT .1 1000/71.725/134.4 999/71.394/4.5 999/71.394/5.0 1 999/672.41/95.3 21/672.70/292.6 995/991.06/1.3(?)
10 0/1000/77.7 0/1000/.01 0/1000/.01
DBV .1 0/0/52.7 0/4.5E-9/.1 0/4.5E-9/.04
1 0/0/52.9 0/4.5E-9/.1 0/4.5E-9/.04
10 0/0/53.0 0/4.5E-9/.01 0/4.5E-9/.01
EPS 1 1000/351.14/58.5 500/351.14/.3 500/351.14/.3
10 243/1250/45.7 250/1250/.05 250/1250/.05
100 0/1250/50.7 0/1250/.01 0/1250/.02
ER 1 1000/436.25/72.0 1000/436.25/.5 1000/436.25/.4
10 0/500/51.5 0/500/.1 0/500/.01
100 0/500/52.4 0/500/.03 0/500/.01
QD1 .1 1000/.0975/29.9 1/.0975/.01 1/.0975/.02
1 1000/.75/37.8 1/.75/.01 1/.75/.01
10 0/1/38.6 0/1/.01 0/1/.01
QD2 .1 1000/98.5/74.2 0/98.5/.01 0/98.5/.03
1 1000/751/75.8 0/751/.01 0/751/.02
10 0/1001/53.1 0/1001/.01 0/1001/.01
VD 1 1000/937.59/43.9 1000/937.66/856.3 1000/937.66/869.0 10 413/6726.80/57.1 1000/6746.74/235.7 999/6746.74/246.9 100 136/55043/57.8 1000/55078/12.6 1000/55078/13.3
Table 2: Performance of MINOS, BCGD-Gauss-Southwell-d/q, with winit = (1, 1, ..., 1)
Conclusions & Future Work
1. For ML estimation, `1-penalty imparts parsimony in the coefficients and avoid oversmoothing the signals.
2. The resulting estimation problem can be solved effectively by IP method or BCM method, exploiting the problem structure, including nondiffer. of
`1-norm. Which to use? Depends on problem.
3. Problem reformulation may be needed.
4. For general problem model, we propose BCGD method. Numerical testing is ongoing.
5. Applications to denoising, regression, SVM?