• 沒有找到結果。

# Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization

N/A
N/A
Protected

Share "Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization"

Copied!
20
0
0

(1)

## Convergence of a Block Coordinate DescentMethod for Nondifferentiable Minimization

1

P. TSENG2

Communicated by O. L. Mangasarian

Abstract. We study the convergence properties of a (block) coordinate descent method applied to minimize a nondifferentiable (nonconvex) function f (x1, . . . , xN) with certain separability and regularity proper- ties. Assuming that f is continuous on a compact level set, the sub- sequence convergence of the iterates to a stationary point is shown when either f is pseudoconvex in every pair of coordinate blocks from among NA1 coordinate blocks or f has at most one minimum in each of NA2 coordinate blocks. If f is quasiconvex and hemivariate in every coordi- nate block, then the assumptions of continuity of f and compactness of the level set may be relaxed further. These results are applied to derive new (and old) convergence results for the proximal minimization algo- rithm, an algorithm of Arimoto and Blahut, and an algorithm of Han.

They are applied also to a problem of blind source separation.

Key Words. Block coordinate descent, nondifferentiable minimization, stationary point, Gauss–Seidel method, convergence, quasiconvex func- tions, pseudoconvex functions.

1. Introduction

A popular method for minimizing a real-valued continuously differen- tiable function f of n real variables, subject to bound constraints, is the (block) coordinate descent method. In this method, the coordinates are par- titioned into N blocks and, at each iteration, f is minimized with respect to one of the coordinate blocks while the other coordinates are held fixed. This method, which is related closely to the Gauss–Seidel and SOR methods for equation solving (Ref. 1), was studied early by Hildreth (Ref. 2) and Warga (Ref. 3), and is described in various books on optimization (Refs. 1 and 4–

1This work was partially supported by the National Science Foundation Grant CCR-9731273.

2Professor, Department of Mathematics, University of Washington, Seattle, Washington.

475

0022-3239兾01兾0600-0475\$19.50兾02001 Plenum Publishing Corporation

(2)

10). Its applications include channel capacity computation (Refs. 11–12), image reconstruction (Ref. 7), dynamic programming (Refs. 13–15), and flow routing (Ref. 16). It may be applied also to the dual of a linearly constrained, strictly convex program to obtain various decomposition methods (see Refs. 6–7, 17–22, and references therein) and parallel SOR methods (Ref. 23).

Convergence of the (block) coordinate descent method requires typi- cally that f be strictly convex (or quasiconvex or hemivariate) differentiable and, taking into account the bound constraints, has bounded level sets (e.g., Refs. 3–4 and 24–25). Zadeh (Ref. 26; also see Ref. 27) relaxed the strict convexity assumption to pseudoconvexity, which allows f to have a non- unique minimum along coordinate directions. For certain classes of convex functions, the level sets need not be bounded (see Refs. 2, 6–7, 17, 19–22, and references therein). If f is not (pseudo)convex, then an example of Powell (Ref. 28) shows that the method may cycle without approaching any stationary point of f. Nonetheless, convergence can be shown for special cases of non(pseudo)convex f, as when f is quadratic (Ref. 29), or f is strictly pseudoconvex in each of NA2 coordinate blocks (Ref. 27), or f has unique minimum in each coordinate block (Ref. 8, p. 159). If f is not differentiable, the coordinate descent method may get stuck at a nonstationary point even when f is convex (e.g., Ref. 4, p. 94). For this reason, it is perceived generally that the method is unsuitable when f is nondifferentiable. However, an exception occurs when the nondifferentiable part of f is separable. Such a structure for f was considered first by Auslender (Ref. 4, p. 94) in the case where f is strongly convex. This structure is implicit in a decomposition method and projection method of Han (Refs. 18, 30), for which f is the convex dual functional associated with a certain linearly constrained convex program (see Ref. 22 for detailed discussions). This structure arises also in least-square problems where an l1-penalty is placed on a subset of the para- meters in order to minimize the support (see Refs. 31–33 and references therein).

Motivated by the preceding works, we consider in this paper the non- differentiable (nonconvex) case where the nondifferentiable part of f is sep- arable. Specifically, we assume that f has the following special form:

f (x1, . . . , xN)Gf0(x1, . . . , xN)C ∑N

k G1

fk(xk), (1)

for some f0:ℜn1C···CnN>ℜ∪{S} and some fk:ℜnk>ℜ∪{S}, kG 1, . . . , N. Here, N, n1, . . . , nNare positive integers. We assume that f is pro- per, i.e., f≡兾S. We will refer to each xk, kG1, . . . , N, as a coordinate block of xG(x1, . . . , xN). We will show that each cluster point of the iterates generated by the (block) coordinate descent method is a stationary point of

(3)

f, provided that f0has a certain smoothness property (see Lemma 3.1), f is continuous on a compact level set, and either f is pseudoconvex in every pair of coordinate blocks from among NA1 coordinate blocks, or f has at most one minimum in each of NA2 coordinate blocks (see Theorem 4.1).

If f is quasiconvex and hemivariate in every coordinate block, then the assumptions of continuity of f and compactness of the level set may be relaxed further (see Proposition 5.1). These results unify and extend some previous results in Refs. 4, 6, 8, 26–27. For example, previous results assumed that f is pseudoconvex and that f1, . . . , fNare indicator functions for closed convex sets, whereas we assume only that f is pseudoconvex in every pair of coordinate blocks from among NA1 coordinate blocks, with no additional assumption made on f1, . . . , fN. Previous results also did not consider the case where f is not continuous on its effective domain. Lastly, we apply our results to derive new (and old) convergence results for the proximal minimization algorithm, an algorithm of Arimoto and Blahut (Refs. 11–12), and an algorithm of Han (Ref. 30); see Examples 6.1–6.3.

We also apply them to a problem of blind source separation described in Refs. 31, 33; see Example 6.4.

In our notation, ℜmdenotes the space of m-dimensional real column vector. For any x, y∈ℜm, we denote by〈x, y〉the Euclidean inner product of x, y and by兩兩x兩兩 the Euclidean norm of x, i.e.,

For any set S⊆ ℜm, we denote by int(S ) the interior of S and denote bdry(S )GS\int(S ).

For any h:m>ℜ∪{S}, we denote by dom h the effective domain of h, i.e.,

dom hG{x∈ℜm兩h(x)FS}.

For any x∈dom h and any d∈ℜm, we denote the (lower) directional deriva- tive of h at x in the direction d by

h′(x; d )Glim inf

λ↓0[h(xCλd )Ah(x)]兾λ. We say that h is quasiconvex if

h(xCλd ) ⁄ max{h(x), h(xCd )}, for all x, d andλ∈[0, 1];

h is pseudoconvex if

h(xCd ) ¤ h(x), whenever x∈dom h and h′(x; d )¤0;

see Ref. 34, p. 146; and h is hemivariate if h is not constant on any line segment belonging to dom h (Ref. 1). For any nonempty I{1, . . . , m}, we

(4)

say that h(x1, . . . , xm) is pseudoconvex [respectively, has at most one mini- mum point].

2. Block Coordinate Descent Method

We describe formally the block coordinate descent (BCD) method below.

BCD Method.

Initialization. Choose any x0G(x01, . . . , x0N)∈dom f.

Iteration rC1, r¤0. Given xrG(xr1, . . . , xrN)∈dom f, choose an index s∈{1, . . . , N} and compute a new iterate

xrC1G(xrC11 , . . . , xrC1N )∈dom f satisfying

xrC1s ∈arg min

xs

f (xr1, . . . , xrsA1, xs, xrsC1, . . . , xrN), (2)

xrC1j Gxrj, ∀j≠s. (3)

We note that the minimization in (2) is attained if X0G{x: f (x)⁄f (x0)}

is bounded and f is lower semicontinuous (lsc) on X0, so X0 is compact (Ref. 35). Alternatively, this minimization is attained if f is convex, has a minimum point, and is hemivariate in each coordinate block (but the level sets of f need not be bounded). To ensure convergence, we need further that each coordinate block is chosen sufficiently often in the method. In particu- lar, we will choose the coordinate blocks according to the following rule (see, e.g., Refs. 7–8, 21, 25).

Essentially Cyclic Rule. There exists a constant T¤N such that every index s∈{1, . . . , N} is chosen at least once between the rth iteration and the (rCTA1)th iteration, for all r.

A well-known special case of this rule, for which TGN, is given below.

Cyclic Rule. Choose sGk at iterations k, kCN, kC2N, . . . , for kG 1, . . . , N.

(5)

3. Stationary Points of f

We say that z is a stationary point of f if z∈dom f and f′(z; d )¤0, ∀d.

We say that z is a coordinatewise minimum point of f if z∈dom f and f (zC(0, . . . , dk, . . . , 0))¤f (z), ∀dk∈ℜnk, (4) for all kG1, . . . , N. Here and throughout, we denote by (0, . . . , dk, . . . , 0) the vector inℜn1C···CnNwhose kth coordinate block is dk and whose other coordinates are zero. We say that f is regular at z∈dom f if

f′(z; d )¤0, ∀d G(d1, . . . , dN),

such that f′(z; (0, . . . , dk, . . . , 0))¤0, k G1, . . . , N. (5) This notion of regularity is weaker than that used by Auslender (Ref. 4, p. 93), which entails

f′(z; d )G ∑N

k G1

f′(z; (0, . . . , dk, . . . , 0)), for all dG(d1, . . . , dN).

For example, the function

f (x1, x2)Gφ(x1, x2)Cφ(−x1, x2)Cφ(x1,−x2)Cφ(−x1,−x2), where

φ(a, b)Gmax{0, aCbA1a2Cb2},

is regular at zG(0, 0) in the sense of (5), but is not regular in the sense of Ref. 4, p. 93.

Since (4) implies

f′(z; (0, . . . , dk, . . . , 0))¤0, for all dk,

it follows that a coordinatewise minimum point z of f is a stationary point of f whenever f is regular at z. To ensure regularity of f at z, we consider one of the following smoothness assumptions on f0:

(A1) dom f0is open and f0is Gaˆteaux-differentiable on dom f0. (A2) f0 is Gaˆteaux-differentiable on int(dom f0) and, for every z

dom fbdry(dom f0), there exist k∈{1, . . . , N} and dk∈ℜnk such that f (zC(0, . . . , dk, . . . , 0))F f (z).

(6)

Assumption A1 was considered essentially by Auslender (Ref. 4, Example 2 on p. 94). In contrast to Assumption A1, Assumption A2 allows dom f0 to include boundary points. We will see an application (Example 6.2) where A2 holds but not A1.

Lemma 3.1. Under A1, f is regular at each z∈dom f. Under A2, f is regular at each coordinatewise minimum point z of f.

Proof. Under A1, if zG(z1, . . . , zN)∈dom f, then z∈dom f0. Under A2, if zG(z1, . . . , zN) is a coordinatewise minimum point of f, then z∉bdry(dom f0), so z∈int(dom f0). Thus, under either A1 or A2, f0 is Gaˆteaux-differentiable at z. Fix any dG(d1, . . . , dN) such that

f′(z; (0, . . . , dk, . . . , 0))¤0, k G1, . . . , N.

Then,

f′(z; d )G〈∇f0(z), d〉Clim inf

λ↓0N

k G1

[ fk(xkdk)Afk(xk)]兾λ

¤〈∇f0(z), d〉C∑N

k G1

lim inf

λ↓0[ fk(xkdk)Afk(xk)]兾λ G〈∇f0(z), d〉C ∑N

k G1

fk(zk; dk)

G ∑N

k G1

f′(z; (0, . . . , dk, . . . , 0))

¤0. 䊐

4. Convergence Analysis: I

Our first convergence result unifies and extends a result of Auslender (Ref. 4, p. 95) for the nondifferentiable convex case and some results of Grippo and Sciandrone (Ref. 27), Luenberger (Ref. 8, p. 159), and Zadeh (Ref. 26) for the differentiable case. In what follows, r(NA1) mod N means rGNA1, 2NA1, 3NA1, . . . .

Theorem 4.1. Assume that the level set X0G{x: f (x)⁄f (x0)} is compact and that f is continuous on X0. Then, the sequence {xrG(xr1, . . . , xrN)}r G0, 1,... generated by the BCD method using the essen- tially cyclic rule is defined and bounded. Moreover, the following statements

(7)

hold:

(a) If f (x1, . . . , xN) is pseudoconvex in (xk, xi) for every i, k∈

{1, . . . , N}, and if f is regular at every x∈X0, then every cluster point of {xr} is a stationary point of f.

(b) If f (x1, . . . , xN) is pseudoconvex in (xk, xi) for every i, k{1, . . . , NA1}, if f is regular at every x∈X0, and if the cyclic rule is used, then every cluster point of {xr}r≡(NA1) mod Nis a stationary point of f.

(c) If f (x1, . . . , xN) has at most one minimum in xk for kG 2, . . . , NA1, and if the cyclic rule is used, then every cluster point z of {xr}r(NA1) mod N is a coordinatewise minimum point of f. In addition, if f is regular at z, then z is a stationary point of f.

Proof. Since X0 is compact, an induction argument on r shows that xrC1is defined, f (xrC1)⁄f (xr), and xrC1∈X0for all rG0, 1, . . . . Thus, {xr} is bounded. Consider any subsequence {xr}r∈R, with

## R

⊆{0, 1, . . .}, con- verging to some z. For each j∈{1, . . . , T}, {xrATC1Cj}r∈Ris bounded, so by passing to a subsequence, if necessary, we can assume that

{xrATC1Cj}r∈Rconverges to some zjG(zj1, . . . , zjN), j G1, . . . , T.

Thus,

zTA1Gz.

Since { f (xr)} converges monotonically and f is continuous on X0, we obtain that

f (x0)¤ lim

r→S

f (xr)Gf (z1)G· · ·Gf (zT). (6) By further passing to a subsequence, if necessary, we can assume that the index s chosen at iteration rATC1Cj, j∈{1, . . . , T}, is the same for all r

## R

, which we denote by sj.

For each j∈{1, . . . , T}, since sj is chosen at iteration rATC1Cj for r∈

## R

, then (2) and (3) yield

f (xrATC1Cj)⁄f (xrATC1CjC(0, . . . , dsj, . . . , 0)), ∀dsj, jG1, . . . , T,

xrATC1Cjk GxrATCjk , ∀k≠sj, jG2, . . . , T.

Then, the continuity of f on X0yields in the limit that

f (zj)⁄f (zjC(0, . . . , dsj, . . . , 0)), ∀dsj, jG1, . . . , T,

zjkGzjA1k , ∀k≠sj, j G2, . . . , T. (7)

(8)

Then, (6) and (7) yield

f (zjA1)⁄f (zjA1C(0, . . . , dsj, . . . , 0)), ∀dsj, jG2, . . . , T. (8) (a), (b) Suppose that f is regular at every x∈X0and that f (x1, . . . , xN) is pseudoconvex in (xk, xi) for every i, k∈{s1}∪· · ·∪{sTA1}. This holds under the assumption (a) or under the assumption (b), with {xr}r∈R being any convergent subsequence of {xr}r≡(NA1) mod N. We claim that, for jG 1, . . . , TA1,

f (zj)⁄f (zjC(0, . . . , dk, . . . , 0)), ∀dk,∀k Gs1, . . . , sj. (9) By (7), (9) holds for jG1. Suppose that (9) holds for jG1, . . . , lA1 for some l∈{2, . . . , TA1}. We show that (9) holds for jGl. From (8), we have that

f (zlA1)⁄f (zlA1C(0, . . . , dsl, . . . , 0)), ∀dsl, implying

f′(zlA1; (0, . . . , zlslAzlA1sl , . . . , 0))¤0.

Also, since (9) holds for jGlA1, we have that, for each kGs1, . . . , slA1, f′(zlA1; (0, . . . , dk, . . . , 0))¤0, ∀dk.

Since by (6) zlA1∈X0, so f is regular at zlA1, the above two relations imply f′(zlA1; (0, . . . , dk, . . . , 0)C(0, . . . , zlslAzlA1sl , . . . , 0))¤0, ∀dk. Since f is pseudoconvex in (xk, xsl), this yields [also using zlGzlA1C (0, . . . , zlslAzlA1sl , . . . , 0)] for kGs1, . . . , slA1that

f (zlC(0, . . . , dk, . . . , 0))¤f (zlA1)Gf (zl), ∀dk.

Since we have also that (7) holds with jGl, we see that (9) holds for jGl.

By induction, (9) holds for all jG1, . . . , TA1.

Since zTA1Gz and (9) holds for j GTA1, then (4) holds for k G s1, . . . , sTA1. Since zTA1Gz and (8) holds (in particular, for j GT ), then (4) holds for kGsTalso. Since

{1, . . . , N}G{s1}∪· · ·∪{sT},

this implies that z is a coordinatewise minimum point of f. Since f is regular at z, then z is in fact a stationary point of f.

(c) Suppose that f (x1, . . . , xN) has at most one minimum in xkfor kG s2, . . . , sTA1. This holds under the assumption (c), with {xr}r∈R being any convergent subsequence of {xr}r(NA1) mod N. For each jG2, . . . , TA1, since

(9)

(7) and (8) hold, then the function dsj>f (zjC(0, . . . , dsj, . . . , 0))

attains its minimum at both dsjG0 and dsjGzjA1sj Azjsj. By assumption, the minimum point is unique, implying 0GzjA1sj Azjsj, or equivalently, zjA1G zj. Thus, z1Gz2G· · · GzTA1Gz and (7) yields that (4) holds for k G s1, . . . , sTA1. Since zTA1Gz and (8) holds (in particular, for j GT ), then (4) holds for kGsTalso. Since

{1, . . . , N}G{s1}∪· · ·{sT},

this implies that z is a coordinatewise minimum point of f. If f is regular at

z, then z is also a stationary point of f.

Notice that, if f is pseudoconvex, then f is pseudoconvex in (xk, xi) for every i, k∈{1, . . . , N}; if f is quasiconvex and hemivariate in xk, then f has at most one minimum in xk. The converses do not hold. For example, the 2-variable Rosenbrock function has a unique minimum point but is not quasiconvex. The following 3-variable quadratic function

f (x1, x2, x3)G(1兾2)x21C(1兾2)x22C(1兾2)x23Cx1x3Cx2x3Ax1x2

is convex in every pair of variables, but is not pseudoconvex. In particular, for xG(0, 0, 1兾2) and dG(1, 1,−1), we have f′(x; d )G1兾2¤0, while f (xCd ) G−7兾8Ff(x)G1兾8. This example generalizes to any quadratic function

f (x) G〈x, Qx〉.

where Q

### R

NBN is symmetric, not positive semidefinite, but whose 2B2 principal submatrices are positie semidefinite. Then, for any d satisfying

〈d, Qd〉F0 and any x satisfying 0⁄〈x, Qd〉F−(1兾2)〈d, Qd〉, we have that

f′(x; d )¤0, while f (xCd )Ff (x).

Thus, parts (a) and (c) of Theorem 4.1 may be viewed as extensions of two results of Grippo and Sciandrone (Ref. 27, Propositions 5.2, 5.3) for the case of f0being continuously differentiable and each fk being the indicator function of some closed convex set. In turn, the first of these results extended a result of Zadeh (Ref. 26) for which fk0 for all k. Part (b) makes a less restrictive assumption on f than part (a), though its assumption on the BCD method is more restrictive. Part (b) is sharp in the sense that it is false if instead we assume that f is convex in every coordinate block. This

(10)

is because the Powell 3-variable example (Ref. 28) is convex in each variable;

see Ref. 27, Section 6 for further discussions of the example. We will see an application (Example 6.4) in which part (b) applies but not part (a) nor (c).

5. Convergence Analysis: II

The convergence analysis of the previous section assumes f to be con- tinuous on a bounded level set and makes no use of the special structure (1) of f. In this section, we show that this assumption can be relaxed by exploiting the special structure (1), provided that f is quasiconvex and hemi- variate in each coordinate block. More precisely, we will make the following assumptions on f, f0, f1, . . . , fN:

(B1) f0is continuous on dom f0.

(B2) For each k∈{1, . . . , N} and (xj)j≠k, the function xk>

f (x1, . . . , xN) is quasiconvex and hemivariate.

(B3) f0, f1, . . . , fNare lsc.

We will see some applications (Ref. 6, Section 3.4.3 and Examples 6.1–

6.3) for which f satisfies this weaker assumption although it is not strictly convex. In addition, we will make one of the following technical assump- tions on f0:

(C1) dom f0 is open and f0 tends to S at every boundary point of dom f0.

(C2) dom f0GY1B· · ·BYN, for some Yk

### R

nk, kG1, . . . , N.

In contrast to Assumption C1, Assumption C2 allows f0to have a finite value on bdry(dom f ). We will see in Example 6.2 a nonseparable function f0 that satisfies Assumptions B1–B3 and C2, but not C1. We show below that Assumptions B1–B3, together with either Assumption C1 or C2, ensure that every cluster point of the iterates generated by the BCD method is a coordinate minimum point of f. The proof of this result is patterned after an argument given by Bertsekas and Tsitsiklis (Ref. 6, pp. 220–221; also see Ref. 27), but is complicated by the fact that f is not necessarily differentiable (or even continuous) on its effective domain.

Proposition 5.1. Suppose that f, f0, f1, . . . , fNsatisfy Assumptions B1–

B3 and that f0satisfies either Assumption C1 or C2. Also, assume that the sequence {xrG(xr1, . . . , xrN)}r G0, 1,... generated by the BCD method using the essentially cyclic rule is defined. Then, either { f (xr)}↓ −S, or else every cluster point zG(z1, . . . , zN) is a coordinatewise minimum point of f.

(11)

Proof. Since f (x0)FS and f (xrC1)⁄f (xr) for all r, then either { f (xr)}↓ −S, or else { f (xr)} converges to some limit and { f (xrC1)A f (xr)}→0. Consider the latter case and let z be any cluster point of {xr}.

Since f is lsc by Assumption B3, we have f (z) ⁄ lim

r→S

f (xr)FS,

so z∈dom f. We show below that z satisfies (4) for kG1, . . . , N.

First, we claim that, for any infinite subsequence

{xr}r∈Rz, (10)

with

### R

⊆{0, 1, . . .}, there holds that

(xrC1}r∈Rz. (11)

We prove this by contradiction. Suppose that this were not true. Then, there exists an infinite subsequence

′of

### R

and a scalar (H0 such that

### R

′.

By further passing to a subsequence, if necessary, we can assume that there is some nonzero vector d for which

{(xrC1Axr)兾兩兩xrC1Axr兩兩}r∈Rd, (12) and that the same coordinate block, say xs, is chosen t the (rC1)st iteration for all r∈

### R

′. Moreover, (10) implies that { f0(xr)}r∈R and { fk(xrk)}r∈R, kG 1, . . . , N, are bounded from below, which together with the convergence of { f (xr)}G{ f0(xr)C∑Nk G1fk(xrk)} implies that { f0(xr)}r∈R and { fk(xrk)}r∈R, k G1, . . . , N, are bounded. Hence, by further passing to a subsequence, if necessary, we can assume that there is some scalarθ for which

{ f0(xr)Cfs(xrs)}r∈R→θ. (13)

Fix anyλ∈[0, (]. Let

zˆ GzCλd, (14)

and for each r∈

### R

′, let

rGxr(xrC1Axr)兾兩兩xrC1Axr兩兩. (15) Then, by (10), (12), and (14),

{xˆr}r∈Rzˆ. (16)

For each r∈

### R

′, we see from (2) that xrC1is obtained from xrby minimizing f with respect to xs, while the other coordinates are held fixed. Since

λ兾兩兩xrC1Axr兩兩⁄λ兾(⁄1,

(12)

so xˆr lies on the line segment joining xr with xrC1, this together with f (xrC1)⁄f (xr) and the quasiconvexity of xs>f (xr1, . . . , xrsA1, xs, xrsC1, . . . , xrN) implies

f (xˆr)⁄f (xr), ∀r∈

### R

′.

Since f is lsc, this and (16) imply zˆ∈dom f. Also, this and (1) and the obser- vation that xrand xˆrdiffer only in their sth coordinate block imply

f0(xˆr)Cfs(xˆrs)⁄f0(xr)Cfs(xrs), ∀r∈

### R

′.

This combined with (13) yields

r→Slim, r∈Rsup{ f0(xˆr)Cfs(xˆrs)}⁄θ. (17) Also, since

{ f (xrC1)Af (xr)}r∈R→0, we have equivalently that

{ f0(xrC1)Cfs(xrC1s )Af0(xr)Afs(xrs)}r∈R→0, so (13) implies

{ f0(xrC1)Cfs(xrC1s )}r∈R→θ. (18) Let

δGf0(zˆ)Cfs(zˆs)Aθ.

Since f0 and fsare lsc, we have from (16), (17) thatδ0. We claim that in fact δG0. Suppose that this were not true, so that δH0. By (16) and the observation that, for all r

### R

′, xˆrand xrdiffer in only their sth coordinate block, we have

{(xr1, . . . , xrsA1, zˆs, xrsC1, . . . , xrN)}r∈Rzˆ. (19) Moreover, the vector on the left-hand side of (19) is in dom f0for all r

### R

sufficiently large. Since zˆ∈dom f0, this is certainly true under Assumption C1; under Assumption C2, this is also true because xr∈dom f0for all r and dom f0 has a product structure corresponding to the coordinate blocks.

Then, (18) together with (19) and the continuity of f0 on dom f0 implies that, for all r

### R

′sufficiently large, there holds that

f0(xr1, . . . , xrsA1, zˆs, xrsC1, . . . , xrN)Cfs(zˆs)

f0(xrC1)Cfs(xrC1s )Cδ兾2,

(13)

or equivalently [via (1) and the observation that xrand xrC1differ in only their sth coordinate block],

f (xr1, . . . , xrsA1, zˆs, xrsC1, . . . , xrN)⁄f (xrC1)Cδ兾2,

a contradiction to the fact that xrC1 is obtained from xr by minimizing f with respect to the sth coordinate block, while the other coordinates are held fixed. Hence,δG0 and therefore

f0(zˆ)Cfs(zˆs)Gθ.

Since the choice ofλ was arbitrary, we obtain [also using (14)]

f0(zCλd )Cfs(zsds)Gθ, ∀λ∈[0, (],

where dsdenotes the sth coordinate block of d. Since xrand xrC1 differ in only their sth coordinate block for all r∈

### R

′, then all coordinate blocks of d, except ds, are zero [see (12)], and the above relation, together with (1), shows that f (zCλd ) is constant (and finite) for allλ∈[0, (], a contradiction to Assumption B2, namely, that f is hemivariate in the sth coordinate block.

Hence, (11) holds.

Since (11) holds for any subsequence {xr}r∈R of {xr} converging to z, we can apply (11) to the subsequence {xrC1}r∈R to conclude that {xrC2}r∈Rz and so on, yielding

{xrCj}r∈Rz, ∀j G0, 1, . . . , T, (20) where T is the bound specified in the essentially cyclic rule.

We claim that (20), together with Assumption C1 or C2, implies f0(z)Cfk(zk)⁄f0(z1, . . . , zkA1, xk, zkC1, . . . , zN)Cfk(xk), (21) for all xk and all k∈{1, . . . , N}. To see this, fix any k∈{1, . . . , N}. Since the coordinate blocks are chosen according to the essentially cyclic rule, there exists some j∈{1, . . . , T} and an infinite subsequence

′ ⊆

### R

such that the coordinate block xk is chosen at the (rCj)th iteration for all r∈

### R

′. Then, for each r∈

### R

′, xrCjk minimizes f0(xrCj1, . . . , xrCjkA1, xk, xrCjkC1, . . . , xrCjN)Cfk(xk) over all xk [see (1), (2), (3)], so that

f0(xrCj)Cfk(xrCjk)

f0(xrCj1, . . . , xrCjkA1, xk, xrCjkC1, . . . , xrCjN)Cfk(xk), ∀xk. (22) Fix any xk∈dom fk such that (z1, . . . , zkA1, xk, zkC1, . . . , zN)∈dom f0. Suppose that Assumption C1 holds, so dom f0 is open. Since z∈dom f0,

(14)

then (20) implies that

(xrCj1, . . . , xrCjkA1, xk, xrCjkC1, . . . , xrCjN)∈dom f0, for all r

### R

′sufficiently large.

Passing to the limit as rS, r∈

### R

′, and using the lsc property of fk

and the continuity of f0 on the open set dom f0, we obtain from (20) and (22) that (21) holds. Suppose instead that Assumption C2 holds, so

dom f0GY1B· · ·BYN, for some Y1⊆ ℜn1, . . . , YN⊆ ℜnN. Then, the first quantity on the right-hand side of (22) is finite for all r∈ ℜ′. Passing to the limit as r→S, r∈ℜ′, and using the lsc property of fk

and the continuity of f0 on dom f0, we obtain from (20) and (22) that (21) holds. If xk∉dom fk or (z1, . . . , zkA1, xk, zkC1, . . . , zN)∉dom f0, then the right-hand side of (21) has the extended value S, so (21) holds trivially. Since the above choice of k was arbitrary, this shows that (21) holds for all xk and all k∈{1, . . . , N}. Then, it follows from (1) that (4)

holds for all kG1, . . . , N.

Proposition 5.1 extends a result of Grippo and Sciandrone (Ref. 27, Proposition 5.1) for the special case where each fk is the indicator func- tion for some closed convex set and f0 is continuously differentiable and (block) coordinatewise strictly pseudoconvex. In turn, the latter result is an extension of a result of Bertsekas and Tsitsiklis (Ref. 6, Proposition 3.9 in Section 3.3.5), which assumes further f0 to be convex. As a cor- ollary of Proposition 5.1, we obtain the following convergence result for the BCD method.

Theorem 5.1. Suppose that f, f0, f1, . . . , fN satisfy Assumptions B1–

B3 and that f0 satisfies either Assumption C1 or C2. Also, assume that {x: f (x)⁄f (x0)} is bounded. Then, the sequence {xr} generated by the BCD method using the essentially cyclic rule is defined, bounded, and every cluster point is a coordinatewise minimum point of f.

Theorem 5.1 extends a result of Auslender [see Theorem 1.2(a) in Ref.

4, p. 95] for the special case where fk is convex for all k, dom f0G Y1B· · ·BYNfor some closed convex sets Yk⊆ ℜnk, kG1, . . . , N, and f0 is strongly convex and continuous on dom f0.

(15)

6. Applications

We describe four interesting applications of the BCD method below.

In all applications, the objective function f is not necessarily strictly convex nor differentiable everywhere on its effective domain.

Example 6.1. Proximal Minimization Algorithm. Let ψ:ℜn> ℜ∪

{S} be a proper (i.e.,ψ兾≡S) lsc function. Fix any scalar cH0, and consider the proper lsc function f defined by

f (x, y) Gc兩兩xAy兩兩2(x).

Clearly, this function has the form (1) with f0(x, y)Gc兩兩xAy兩兩2, f1Gψ, f2≡0.

Applying the BCD method to f yields a method whereby f (x, y) is alter- nately minimized with respect to x and y. This method has the form

xrC1Garg min

x c兩兩xAxr兩兩2(x), r G0, 1, . . . ,

which is the proximal minimization algorithm with fixed parameter c for minimizing ψ; see Ref. 6, Section 3.4.3 and Refs. 36–37 and references therein.

It is easily seen that f, f0, f1, f2satisfy Assumptions B1–B3 and that f0

satisfies Assumptions A1 and C1. Moreover, f is regular everywhere on dom f. Then, by Proposition 5.1, ifψ is bounded below (so, f is bounded below), then every cluster point z of the iterates generated by the above proximal minimization algorithm is a stationary point ofψ, i.e.,

ψ′(z; d )¤0, for all d.

Notice that Theorem 4.1 is not applicable here, since f need not be continu- ous on its level sets.

Example 6.2. Arimoto–Blahut Algorithm. Let Pij, iG1, . . . , n, jG 1, . . . , m, be given nonnegative scalars satisfying

j

PijG1, for all i.

The Pij may be viewed as probabilities. Consider the proper lsc function f defined by

f (x, y) Gf0(x, y)Cf1(x)Cf2( y),

(16)

where

f0(x, y)Gj G1m i G1n Pijxiφ( yij兾xi), if x¤0, yH0,

S, otherwise,

f1(x)G0,S, otherwise,ifi G1n xiG1,

f2(y)G0, ifi G1n yijG1, ∀j G1, . . . , m.

S, otherwise,

withφ(t)G−log(t). In our notation, x is a vector innwhose ith coordinate is xi, and y is a vector innmwhose ((iA1)mCj)th coordinate is yij. Apply- ing the BCD method to f yields a method whereby f (x, y) is alternately minimized with respect to x and y. This in turn can be seen to be the Arimoto–Blahut algorithm for computing the capacity of a discrete memoryless communication channel (Refs. 11–12).

It can be verified that f, f0, f1, f2 are convex and satisfy Assumptions B1–B3. Convexity of f0follows from observing that (a, b) > aφ(b兾a) is con- vex. Moreover, f has compact level sets and is continuous on each level set, and f0satisfies Assumptions A2 and C2. Notice that f is not strictly convex and f0 does not satisfy Assumption A1 or C1. Thus, by Lemma 3.1 and Theorem 5.1 or Theorem 4.1(c), the sequence of iterates generated by the Arimoto–Blahut algorithm is bounded and each cluster point is a stationary point of f. By the convexity of f, this is in fact a minimum point of f. This result matches those obtained in Refs. 11–12. Analogous convergence results are obtained for variants of the Arimoto–Blahut algorithm, whereby we use, for example,

φ(t)Gt log(t) or φ(t)G1兾t.

Example 6.3. Han Algorithm. Let f be the proper lsc convex function studied by Han [Ref. 30, (D′)],

f (x1, . . . , xN)G(1兾2)兩兩x1C· · ·CxNAd兩兩2C∑N

k G1

fk(xk),

where d is a given vector inmand each fk:ℜm>ℜ∪{S} is a proper lsc convex function. Also see Ref. 18 for a special case where fkis the support

(17)

function of a closed convex set. Clearly, f is of the form (1) with f0(x1, . . . , xN)G(1兾2)兩兩x1C· · ·CxNAd兩兩2.

Han proposed in Ref. 30 an algorithm for minimizing f, which may be viewed as an instance of the BCD method using the cyclic rule, as was shown in Ref. 22.

It is seen easily that f, f0, f1, . . . , fN satisfy Assumptions B1–B3 and that f0 satisfies Assumptions A1 and C1. Thus, by Lemma 3.1 and Prop- osition 5.1 [also see the remark following (3)], if f has a minimum point, then the iterates generated by the Han algorithm are defined and every cluster point is a minimum point of f. This result matches Proposition 4.3 in Ref. 30. On the other hand, by using the convexity of the functions, stronger convergence results can be obtained; see Refs. 22, 38.

Example 6.4. Blind Source Separation. In Ref. 33, Zibulevsky and Pearlmutter studied an optimization formulation of the blind source separ- ation, whereby an error term of the form

(1兾2σ2)兩兩ASAX兩兩2FC∑

j,t

fjt(stj), is minimized with respect to

A∈ℜmBn and SG[stj]j G1,...,n,t G1,...,T∈ℜnBT.

Here, X∈ℜmBT are the given data; 兩兩·兩兩F denotes the Frobenious norm;

σH0; and each fjt:ℜ>[0, S] is a proper convex function that is continu- ous on its effective domain and has bounded level sets. In Ref. 31, the particular choice of fjt(·)G兩·兩 is used. To ensure the existence of an optimal solution, it was suggested in Ref. 33 that constraints such as

be imposed, where Aidenotes the ith row of A. The objective function of this problem has the form (1) with NG1CnT,

f0(A, s11, . . . , sTn)G(1兾2σ2)兩兩ASAX兩兩2F, f1(A)G0, if兩兩Ai兩兩⁄1, i G1, . . . , m,

S, else,

and fjt, jG1, . . . , n, tG1, . . . , T, as given. Notice that minimizing f with respect to A entails minimizing a convex quadratic function over the Cartesian product of m Euclidean balls, while minimizing f with respect to each stj entails minimizing the sum of a convex quadratic function of one

(18)

variable with a convex function of one variable. Thus, the BCD method applied to this f can be implemented fairly inexpensively. If we replace (23) by the single ball constraint

for some fixed ρH0, then minimizing f with respect to A can be solved efficiently using e.g. the More´–Sorenson method.

It is not difficult to see that f is continuous on its effective domain and has compact level sets. Moreover, f is convex in (s11, . . . , sTn), f1 is convex, and f0satisfies Assumption A1. Thus, by Lemma 3.1 and Theorem 4.1(b), the iterates generated by the BCD method using the cyclic rule are defined and every cluster point is a stationary point of f. Notice that f is not pseudo- convex in every pair of coordinate blocks and that f need not have at most one minimum in each stj, so neither Theorem 5.1, nor part (a) of Theorem 4.1, nor part (c) of Theorem 4.1 is applicable here.

Instead of treating each stjas a coordinate block, we can treat alterna- tively SG[stj]j,tas a coordinate block. However, minimizing f with respect to S is more difficult. In the case of fjt(·)G兩·兩, this would require solving a large convex quadratic programming problem. A comparison of a primal–

dual interior-point method and the BCD method for solving such a problem is given in Ref. 32.

References

1. ORTEGA, J. M., and RHEINBOLDT, W. C., Iteratiûe Solution of Nonlinear Equa- tions in Seûeral Variables, Academic Press, New York, NY, 1970.

2. HILDRETH, C., A Quadratic Programming Procedure, Naval Research Logistics Quarterly, Vol. 4, pp. 79–85, 1957; see also Erratum, Naval Research Logistics Quarterly, Vol. 4, p. 361, 1957.

3. WARGA, J., Minimizing Certain Conûex Functions, SIAM Journal on Applied Mathematics, Vol. 11, pp. 588–593, 1963.

4. AUSLENDER, A., Optimisation Me´thodes Nume´riques, Masson, Paris, France, 1976.

5. BERTSEKAS, D. P., Nonlinear Programming, 2nd Edition, Athena Scientific, Belmont, Massachusetts, 1999.

6. BERTSEKAS, D. P., and TSITSIKLIS, J. N., Parallel and Distributed Computation:

Numerical Methods, Prentice-Hall, Englewood Cliffs, New Jersey, 1989.

7. CENSOR, Y., and ZENIOS, S. A., Parallel Optimization: Theory, Algorithms, and Applications, Oxford University Press, Oxford, United Kingdom, 1997.

8. LUENBERGER, D. G., Linear and Nonlinear Programming, Addison–Wesley, Reading, Massachusetts, 1973.

(19)

9. POLAK, E., Computational Methods in Optimization: A Unified Approach, Academic Press, New York, NY, 1971.

10. ZANGWILL, W. I., Nonlinear Programming, Prentice-Hall, Englewood Cliffs, New Jersey, 1969.

11. ARIMOTO, S., An Algorithm for Computing the Capacity of Arbitrary DMCs, IEEE Transactions on Information Theory, Vol. 18, pp. 14–20, 1972.

12. BLAHUT, R., Computation of Channel Capacity and Rate Distortion Functions, IEEE Transactions on Information Theory, Vol. 18, pp. 460–473, 1972.

13. HOWSON, H. R., and SANCHO, N. G. F., A New Algorithm for the Solution of Multistate Dynamic Programming Problems, Mathematical Programming, Vol. 8, pp. 104–116, 1975.

14. KORSAK, A. J., and LARSON, R. E., A Dynamic Programming Successiûe Approximations Technique with Conûergence Proofs, Automatica, Vol. 6, pp. 253–260, 1970.

15. ZUO, Z. Q., and WU, C. P., Successiûe Approximation Technique for a Class of Large-Scale NLP Problems and Its Application to Dynamic Programming, Journal of Optimization Theory and Applications, Vol. 62, pp. 515–527, 1989.

16. STERN, T. E., A Class of Decentralized Routing Algorithms Using Relaxation, IEEE Transactions on Communications, Vol. 25, pp. 1092–1102, 1977.

17. BREGMAN, L. M., The Relaxation Method of Finding the Common Point of Con- ûex Sets and Its Application to the Solution of Problems in Conûex Programming, USSR Computational Mathematics and Mathematical Physics, Vol. 7, pp. 200–

217, 1967.

18. HAN, S. P., A Successiûe Projection Method, Mathematical Programming, Vol. 40, pp. 1–14, 1988.

19. KIWIEL, K. C., Free-Steering Relaxation Methods for Problems with Strictly Conûex Costs and Linear Constraints, Mathematics of Operations Research, Vol. 22, pp. 326–349, 1997.

20. LUO, Z. Q., and TSENG, P., On the Conûergence Rate of Dual Ascent Methods for Strictly Conûex Minimization, Mathematics of Operations Research, Vol. 18, pp. 846–867, 1993.

21. TSENG, P., Dual Ascent Methods for Problems with Strictly Conûex Costs and Linear Constraints: A Unified Approach, SIAM Journal on Control and Optimization, Vol. 28, pp. 214–242, 1990.

22. TSENG, P., Dual Coordinate Ascent Methods for Nonstrictly Conûex Minimiz- ation, Mathematical Programming, Vol. 59, pp. 231–247, 1993.

23. MANGASARIAN, O. L., and DELEONE, R., Parallel Successiûe Oûerrelaxation Methods for Symmetric Linear Complementarity Problems and Linear Programs, Journal of Optimization Theory and Applications, Vol. 54, pp. 437–446, 1987.

24. CEA, J., and GLOWINSKI, R., Sur des Methodes d’Optimisation par Relaxation, Revue Franc¸aise d’Automatique, Informatique et Recherche Ope´rationnelle, Vol. R3, pp. 5–32, 1973.

25. SARGENT, R. W. H., and SEBASTIAN, D. J., On the Conûergence of Sequential Minimization Algorithms, Journal of Optimization Theory and Applications, Vol. 12, pp. 567–575, 1973.

(20)

26. ZADEH, N., A Note on the Cyclic Coordinate Ascent Method, Management Science, Vol. 16, pp. 642–644, 1970.

27. GRIPPO, L., and SCIANDRONE, M., On the Conûergence of the Block Nonlinear Gauss–Seidel Method under Conûex Constraints, Operations Research Letters, Vol. 26, pp. 127–136, 2000.

28. POWELL, M. J. D., On Search Directions for Minimization Algorithms, Math- ematical Programming, Vol. 4, pp. 193–201, 1973.

29. LUO, Z. Q., and TSENG, P., Error Bounds and Conûergence Analysis of Feasible Descent Methods: A General Approach, Annals of Operations Research, Vol. 46, pp. 157–178, 1993.

30. HAN, S. P., A Decomposition Method and Its Application to Conûex Program- ming, Mathematics of Operations Research, Vol. 14, pp. 237–248, 1989.

31. BOFILL, P., and ZIBULEVSKY, M., Sparse Undetermined ICA: Estimating the Mixing Matrix and the Sources Separately, Technical Report UPC-DAC-2000- 7, Universitat Polite`cnica de Catalunya, Barcelona, Spain, 1999.

32. SARDY, S., BRUCE, A., and TSENG, P., Block Coordinate Relaxation Methods for Nonparametric Waûelet Denoising, Journal of Computational and Graphical Statistics, Vol. 9, pp. 361–379, 2000.

33. ZIBULEVSKY, M., and PEARLMUTTER, B., Blind Source Separation by Sparse Decomposition, Technical Report CS99-1, Computer Science Department, University of New Mexico, Albuquerque, New Mexico, 1999.

34. MANGASARIAN, O. L., Nonlinear Programming, McGraw-Hill, New York, NY, 1969.

35. ROCKAFELLAR, R. T., Conûex Analysis, Princeton University Press, Princeton, New Jersey, 1970.

36. MARTINET, B., Determination Approche´e d’un Point Fixe d’une Application Pseudo-Contractante: Cas de l ’Application Prox, Comptes Rendus des Se´ances de l’Acade´mie des Sciences, Vol. 274A, pp. 163–165, 1972.

37. ROCKAFELLAR, R. T., Augmented Lagrangians and Applications of the Proximal Point Algorithm in Conûex Programming, Mathematics of Operations Research, Vol. 1, pp. 97–116, 1976.

38. BAUSCHKE, H. H., and LEWIS, A. S., Dykstra’s Algorithm with Bregman Projec- tions: A Conûergence Proof, Optimization (to appear).

“Find sufficiently accurate starting approximate solution by using Steepest Descent method” + ”Compute convergent solution by using Newton-based methods”. The method of

Chen, The semismooth-related properties of a merit function and a descent method for the nonlinear complementarity problem, Journal of Global Optimization, vol.. Soares, A new

Particularly, combining the numerical results of the two papers, we may obtain such a conclusion that the merit function method based on ϕ p has a better a global convergence and

Although we have obtained the global and superlinear convergence properties of Algorithm 3.1 under mild conditions, this does not mean that Algorithm 3.1 is practi- cally efficient,

For the proposed algorithm, we establish a global convergence estimate in terms of the objective value, and moreover present a dual application to the standard SCLP, which leads to

For the proposed algorithm, we establish its convergence properties, and also present a dual application to the SCLP, leading to an exponential multiplier method which is shown

Based on the reformulation, a semi-smooth Levenberg–Marquardt method was developed, and the superlinear (quadratic) rate of convergence was established under the strict

The superlinear convergence of Broyden’s method for Example 1 is demonstrated in the following table, and the computed solutions are less accurate than those computed by