Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization

(1)

Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization

¹

P. T^SENG²

Communicated by O. L. Mangasarian

Abstract. We study the convergence properties of a (block) coordinate descent method applied to minimize a nondifferentiable (nonconvex) function f (x1, . . . , xN) with certain separability and regularity proper- ties. Assuming that f is continuous on a compact level set, the sub- sequence convergence of the iterates to a stationary point is shown when either f is pseudoconvex in every pair of coordinate blocks from among NA1 coordinate blocks or f has at most one minimum in each of NA2 coordinate blocks. If f is quasiconvex and hemivariate in every coordi- nate block, then the assumptions of continuity of f and compactness of the level set may be relaxed further. These results are applied to derive new (and old) convergence results for the proximal minimization algorithm, an algorithm of Arimoto and Blahut, and an algorithm of Han.

They are applied also to a problem of blind source separation.

Key Words. Block coordinate descent, nondifferentiable minimization, stationary point, Gauss–Seidel method, convergence, quasiconvex functions, pseudoconvex functions.

1. Introduction

A popular method for minimizing a real-valued continuously differen- tiable function f of n real variables, subject to bound constraints, is the (block) coordinate descent method. In this method, the coordinates are par- titioned into N blocks and, at each iteration, f is minimized with respect to one of the coordinate blocks while the other coordinates are held fixed. This method, which is related closely to the Gauss–Seidel and SOR methods for equation solving (Ref. 1), was studied early by Hildreth (Ref. 2) and Warga (Ref. 3), and is described in various books on optimization (Refs. 1 and 4–

1This work was partially supported by the National Science Foundation Grant CCR-9731273.

2Professor, Department of Mathematics, University of Washington, Seattle, Washington.

475

0022-3239兾01兾0600-0475$19.50兾02001 Plenum Publishing Corporation

(2)

10). Its applications include channel capacity computation (Refs. 11–12), image reconstruction (Ref. 7), dynamic programming (Refs. 13–15), and flow routing (Ref. 16). It may be applied also to the dual of a linearly constrained, strictly convex program to obtain various decomposition methods (see Refs. 6–7, 17–22, and references therein) and parallel SOR methods (Ref. 23).

Convergence of the (block) coordinate descent method requires typi- cally that f be strictly convex (or quasiconvex or hemivariate) differentiable and, taking into account the bound constraints, has bounded level sets (e.g., Refs. 3–4 and 24–25). Zadeh (Ref. 26; also see Ref. 27) relaxed the strict convexity assumption to pseudoconvexity, which allows f to have a non- unique minimum along coordinate directions. For certain classes of convex functions, the level sets need not be bounded (see Refs. 2, 6–7, 17, 19–22, and references therein). If f is not (pseudo)convex, then an example of Powell (Ref. 28) shows that the method may cycle without approaching any stationary point of f. Nonetheless, convergence can be shown for special cases of non(pseudo)convex f, as when f is quadratic (Ref. 29), or f is strictly pseudoconvex in each of NA2 coordinate blocks (Ref. 27), or f has unique minimum in each coordinate block (Ref. 8, p. 159). If f is not differentiable, the coordinate descent method may get stuck at a nonstationary point even when f is convex (e.g., Ref. 4, p. 94). For this reason, it is perceived generally that the method is unsuitable when f is nondifferentiable. However, an exception occurs when the nondifferentiable part of f is separable. Such a structure for f was considered first by Auslender (Ref. 4, p. 94) in the case where f is strongly convex. This structure is implicit in a decomposition method and projection method of Han (Refs. 18, 30), for which f is the convex dual functional associated with a certain linearly constrained convex program (see Ref. 22 for detailed discussions). This structure arises also in least-square problems where an l1-penalty is placed on a subset of the para- meters in order to minimize the support (see Refs. 31–33 and references therein).

Motivated by the preceding works, we consider in this paper the non- differentiable (nonconvex) case where the nondifferentiable part of f is sep- arable. Specifically, we assume that f has the following special form:

f (x1, . . . , xN)Gf0(x1, . . . , xN)C ∑^N

k G1

fk(xk), (1)

for some f0:ℜⁿ¹^C^···Cn^N>ℜ∪{S} and some fk:ℜⁿ^k>ℜ∪{S}, kG 1, . . . , N. Here, N, n1, . . . , nNare positive integers. We assume that f is pro- per, i.e., f≡兾S. We will refer to each xk, kG1, . . . , N, as a coordinate block of xG(x1, . . . , xN). We will show that each cluster point of the iterates generated by the (block) coordinate descent method is a stationary point of

(3)

f, provided that f0has a certain smoothness property (see Lemma 3.1), f is continuous on a compact level set, and either f is pseudoconvex in every pair of coordinate blocks from among NA1 coordinate blocks, or f has at most one minimum in each of NA2 coordinate blocks (see Theorem 4.1).

If f is quasiconvex and hemivariate in every coordinate block, then the assumptions of continuity of f and compactness of the level set may be relaxed further (see Proposition 5.1). These results unify and extend some previous results in Refs. 4, 6, 8, 26–27. For example, previous results assumed that f is pseudoconvex and that f1, . . . , fNare indicator functions for closed convex sets, whereas we assume only that f is pseudoconvex in every pair of coordinate blocks from among NA1 coordinate blocks, with no additional assumption made on f1, . . . , fN. Previous results also did not consider the case where f is not continuous on its effective domain. Lastly, we apply our results to derive new (and old) convergence results for the proximal minimization algorithm, an algorithm of Arimoto and Blahut (Refs. 11–12), and an algorithm of Han (Ref. 30); see Examples 6.1–6.3.

We also apply them to a problem of blind source separation described in Refs. 31, 33; see Example 6.4.

In our notation, ℜ^mdenotes the space of m-dimensional real column vector. For any x, y∈ℜ^m, we denote by〈x, y〉the Euclidean inner product of x, y and by兩兩x兩兩 the Euclidean norm of x, i.e.,

兩兩x兩兩G1〈x, x〉.

For any set S⊆ ℜ^m, we denote by int(S ) the interior of S and denote bdry(S )GS\int(S ).

For any h:ℜ^m>ℜ∪{S}, we denote by dom h the effective domain of h, i.e.,

dom hG{x∈ℜ^m兩h(x)FS}.

For any x∈dom h and any d∈ℜ^m, we denote the (lower) directional deriva- tive of h at x in the direction d by

h′(x; d )Glim inf

λ↓0[h(xCλd )Ah(x)]兾λ. We say that h is quasiconvex if

h(xCλd ) ⁄ max{h(x), h(xCd )}, for all x, d andλ∈[0, 1];

h is pseudoconvex if

h(xCd ) ¤ h(x), whenever x∈dom h and h′(x; d )¤0;

see Ref. 34, p. 146; and h is hemivariate if h is not constant on any line segment belonging to dom h (Ref. 1). For any nonempty I⊆{1, . . . , m}, we

(4)

say that h(x1, . . . , xm) is pseudoconvex [respectively, has at most one minimum point].

2. Block Coordinate Descent Method

We describe formally the block coordinate descent (BCD) method below.

BCD Method.

Initialization. Choose any x⁰G(x⁰1, . . . , x⁰N)∈dom f.

Iteration rC1, r¤0. Given x^rG(x^r1, . . . , x^rN)∈dom f, choose an index s∈{1, . . . , N} and compute a new iterate

x^rC1G(x^rC11 , . . . , x^rC1N )∈dom f satisfying

x^rC1s ∈arg min

xs

f (x^r1, . . . , x^rsA1, xs, x^rsC1, . . . , x^rN), (2)

x^rC1j Gx^rj, ∀j≠s. (3)

We note that the minimization in (2) is attained if X⁰G{x: f (x)⁄f (x⁰)}

is bounded and f is lower semicontinuous (lsc) on X⁰, so X⁰ is compact (Ref. 35). Alternatively, this minimization is attained if f is convex, has a minimum point, and is hemivariate in each coordinate block (but the level sets of f need not be bounded). To ensure convergence, we need further that each coordinate block is chosen sufficiently often in the method. In particular, we will choose the coordinate blocks according to the following rule (see, e.g., Refs. 7–8, 21, 25).

Essentially Cyclic Rule. There exists a constant T¤N such that every index s∈{1, . . . , N} is chosen at least once between the rth iteration and the (rCTA1)th iteration, for all r.

A well-known special case of this rule, for which TGN, is given below.

Cyclic Rule. Choose sGk at iterations k, kCN, kC2N, . . . , for kG 1, . . . , N.

(5)

3. Stationary Points of f

We say that z is a stationary point of f if z∈dom f and f′(z; d )¤0, ∀d.

We say that z is a coordinatewise minimum point of f if z∈dom f and f (zC(0, . . . , dk, . . . , 0))¤f (z), ∀dk∈ℜⁿ^k, (4) for all kG1, . . . , N. Here and throughout, we denote by (0, . . . , d_k, . . . , 0) the vector inℜⁿ¹^C^···Cn^Nwhose kth coordinate block is d_k and whose other coordinates are zero. We say that f is regular at z∈dom f if

f′(z; d )¤0, ∀d G(d1, . . . , dN),

such that f′(z; (0, . . . , dk, . . . , 0))¤0, k G1, . . . , N. (5) This notion of regularity is weaker than that used by Auslender (Ref. 4, p. 93), which entails

f′(z; d )G ∑^N

k G1

f′(z; (0, . . . , dk, . . . , 0)), for all dG(d1, . . . , dN).

For example, the function

f (x1, x2)Gφ(x1, x2)Cφ(−x1, x2)Cφ(x1,−x2)Cφ(−x1,−x2), where

φ(a, b)Gmax{0, aCbA1_a²Cb²},

is regular at zG(0, 0) in the sense of (5), but is not regular in the sense of Ref. 4, p. 93.

Since (4) implies

f′(z; (0, . . . , dk, . . . , 0))¤0, for all dk,

it follows that a coordinatewise minimum point z of f is a stationary point of f whenever f is regular at z. To ensure regularity of f at z, we consider one of the following smoothness assumptions on f0:

(A1) dom f0is open and f0is Gaˆteaux-differentiable on dom f0. (A2) f0 is Gaˆteaux-differentiable on int(dom f0) and, for every z∈

dom f∩bdry(dom f0), there exist k∈{1, . . . , N} and dk∈ℜⁿ^k such that f (zC(0, . . . , dk, . . . , 0))F f (z).

(6)

Assumption A1 was considered essentially by Auslender (Ref. 4, Example 2 on p. 94). In contrast to Assumption A1, Assumption A2 allows dom f0 to include boundary points. We will see an application (Example 6.2) where A2 holds but not A1.

Lemma 3.1. Under A1, f is regular at each z∈dom f. Under A2, f is regular at each coordinatewise minimum point z of f.

Proof. Under A1, if zG(z1, . . . , zN)∈dom f, then z∈dom f0. Under A2, if zG(z1, . . . , zN) is a coordinatewise minimum point of f, then z∉bdry(dom f0), so z∈int(dom f0). Thus, under either A1 or A2, f0 is Gaˆteaux-differentiable at z. Fix any dG(d1, . . . , dN) such that

f′(z; (0, . . . , dk, . . . , 0))¤0, k G1, . . . , N.

Then,

f′(z; d )G〈∇f0(z), d〉Clim inf

λ↓0 ∑^N

k G1

[ f_k(x_kCλdk)Af_k(x_k)]兾λ

¤〈∇f0(z), d〉C∑^N

k G1

lim inf

λ↓0[ f_k(x_kCλdk)Af_k(x_k)]兾λ G〈∇f0(z), d〉C ∑^N

k G1

f′k(z_k; d_k)

G ∑^N

k G1

f′(z; (0, . . . , dk, . . . , 0))

¤0. 䊐

4. Convergence Analysis: I

Our first convergence result unifies and extends a result of Auslender (Ref. 4, p. 95) for the nondifferentiable convex case and some results of Grippo and Sciandrone (Ref. 27), Luenberger (Ref. 8, p. 159), and Zadeh (Ref. 26) for the differentiable case. In what follows, r≡(NA1) mod N means rGNA1, 2NA1, 3NA1, . . . .

Theorem 4.1. Assume that the level set X⁰G{x: f (x)⁄f (x⁰)} is compact and that f is continuous on X⁰. Then, the sequence {x^rG(x^r1, . . . , x^rN)}r G0, 1,... generated by the BCD method using the essentially cyclic rule is defined and bounded. Moreover, the following statements

(7)

hold:

(a) If f (x1, . . . , x_N) is pseudoconvex in (x_k, x_i) for every i, k∈

{1, . . . , N}, and if f is regular at every x∈X⁰, then every cluster point of {x^r} is a stationary point of f.

(b) If f (x1, . . . , xN) is pseudoconvex in (xk, xi) for every i, k∈ {1, . . . , NA1}, if f is regular at every x∈X⁰, and if the cyclic rule is used, then every cluster point of {x^r}r≡(NA1) mod Nis a stationary point of f.

(c) If f (x1, . . . , xN) has at most one minimum in xk for kG 2, . . . , NA1, and if the cyclic rule is used, then every cluster point z of {x^r}r≡(NA1) mod N is a coordinatewise minimum point of f. In addition, if f is regular at z, then z is a stationary point of f.

Proof. Since X⁰ is compact, an induction argument on r shows that x^rC1is defined, f (x^rC1)⁄f (x^r), and x^rC1∈X⁰for all rG0, 1, . . . . Thus, {x^r} is bounded. Consider any subsequence {x^r}r∈R, with

R

⊆{0, 1, . . .}, con- verging to some z. For each j∈{1, . . . , T}, {x^rATC1Cj}r∈Ris bounded, so by passing to a subsequence, if necessary, we can assume that

{x^rATC1Cj}r∈Rconverges to some z^jG(z^j1, . . . , z^jN), j G1, . . . , T.

Thus,

z^TA1Gz.

Since { f (x^r)} converges monotonically and f is continuous on X⁰, we obtain that

f (x⁰)¤ lim

r→S

f (x^r)Gf (z¹)G· · ·Gf (z^T). (6) By further passing to a subsequence, if necessary, we can assume that the index s chosen at iteration rATC1Cj, j∈{1, . . . , T}, is the same for all r∈

R

, which we denote by s^j.

For each j∈{1, . . . , T}, since s^j is chosen at iteration rATC1Cj for r∈

R

, then (2) and (3) yield

f (x^rATC1Cj)⁄f (x^rATC1CjC(0, . . . , d_s^j, . . . , 0)), ∀ds^j, jG1, . . . , T,

x^rATC1Cjk Gx^rATCjk , ∀k≠s^j, jG2, . . . , T.

Then, the continuity of f on X⁰yields in the limit that

f (z^j)⁄f (z^jC(0, . . . , ds^j, . . . , 0)), ∀ds^j, jG1, . . . , T,

z^jkGz^jA1k , ∀k≠s^j, j G2, . . . , T. (7)

(8)

Then, (6) and (7) yield

f (z^jA1)⁄f (z^jA1C(0, . . . , d_s^j, . . . , 0)), ∀ds^j, jG2, . . . , T. (8) (a), (b) Suppose that f is regular at every x∈X⁰and that f (x1, . . . , xN) is pseudoconvex in (xk, xi) for every i, k∈{s¹}∪· · ·∪{s^TA1}. This holds under the assumption (a) or under the assumption (b), with {x^r}r∈R being any convergent subsequence of {x^r}_r≡(NA1) mod N. We claim that, for jG 1, . . . , TA1,

f (z^j)⁄f (z^jC(0, . . . , dk, . . . , 0)), ∀dk,∀k Gs¹, . . . , s^j. (9) By (7), (9) holds for jG1. Suppose that (9) holds for jG1, . . . , lA1 for some l∈{2, . . . , TA1}. We show that (9) holds for jGl. From (8), we have that

f (z^lA1)⁄f (z^lA1C(0, . . . , ds^l, . . . , 0)), ∀ds^l, implying

f′(z^lA1; (0, . . . , z^ls^lAz^lA1s^l , . . . , 0))¤0.

Also, since (9) holds for jGlA1, we have that, for each kGs¹, . . . , s^lA1, f′(z^lA1; (0, . . . , dk, . . . , 0))¤0, ∀dk.

Since by (6) z^lA1∈X⁰, so f is regular at z^lA1, the above two relations imply f′(z^lA1; (0, . . . , d_k, . . . , 0)C(0, . . . , z^l_s^lAz^lA1s^l , . . . , 0))¤0, ∀dk. Since f is pseudoconvex in (xk, xs^l), this yields [also using z^lGz^lA1C (0, . . . , z^ls^lAz^lA1s^l , . . . , 0)] for kGs¹, . . . , s^lA1that

f (z^lC(0, . . . , dk, . . . , 0))¤f (z^lA1)Gf (z^l), ∀dk.

Since we have also that (7) holds with jGl, we see that (9) holds for jGl.

By induction, (9) holds for all jG1, . . . , TA1.

Since z^TA1Gz and (9) holds for j GTA1, then (4) holds for k G s¹, . . . , s^TA1. Since z^TA1Gz and (8) holds (in particular, for j GT ), then (4) holds for kGs^Talso. Since

{1, . . . , N}G{s¹}∪· · ·∪{s^T},

this implies that z is a coordinatewise minimum point of f. Since f is regular at z, then z is in fact a stationary point of f.

(c) Suppose that f (x1, . . . , xN) has at most one minimum in xkfor kG s², . . . , s^TA1. This holds under the assumption (c), with {x^r}r∈R being any convergent subsequence of {x^r}r≡(NA1) mod N. For each jG2, . . . , TA1, since

(9)

(7) and (8) hold, then the function ds^j>f (z^jC(0, . . . , ds^j, . . . , 0))

attains its minimum at both ds^jG0 and ds^jGz^jA1s^j Az^js^j. By assumption, the minimum point is unique, implying 0Gz^jA1s^j Az^js^j, or equivalently, z^jA1G z^j. Thus, z¹Gz²G· · · Gz^TA1Gz and (7) yields that (4) holds for k G s¹, . . . , s^TA1. Since z^TA1Gz and (8) holds (in particular, for j GT ), then (4) holds for kGs^Talso. Since

{1, . . . , N}G{s¹}∪· · ·∪{s^T},

this implies that z is a coordinatewise minimum point of f. If f is regular at

z, then z is also a stationary point of f. 䊐

Notice that, if f is pseudoconvex, then f is pseudoconvex in (x_k, x_i) for every i, k∈{1, . . . , N}; if f is quasiconvex and hemivariate in xk, then f has at most one minimum in xk. The converses do not hold. For example, the 2-variable Rosenbrock function has a unique minimum point but is not quasiconvex. The following 3-variable quadratic function

f (x1, x2, x3)G(1兾2)x²1C(1兾2)x²2C(1兾2)x²3Cx1x3Cx2x3Ax1x2

is convex in every pair of variables, but is not pseudoconvex. In particular, for xG(0, 0, 1兾2) and dG(1, 1,−1), we have f′(x; d )G1兾2¤0, while f (xCd ) G−7兾8Ff(x)G1兾8. This example generalizes to any quadratic function

f (x) G〈x, Qx〉.

where Q∈

R

^NBN is symmetric, not positive semidefinite, but whose 2B2 principal submatrices are positie semidefinite. Then, for any d satisfying

〈d, Qd〉F0 and any x satisfying 0⁄〈x, Qd〉F−(1兾2)〈d, Qd〉, we have that

f′(x; d )¤0, while f (xCd )Ff (x).

Thus, parts (a) and (c) of Theorem 4.1 may be viewed as extensions of two results of Grippo and Sciandrone (Ref. 27, Propositions 5.2, 5.3) for the case of f0being continuously differentiable and each fk being the indicator function of some closed convex set. In turn, the first of these results extended a result of Zadeh (Ref. 26) for which fk≡0 for all k. Part (b) makes a less restrictive assumption on f than part (a), though its assumption on the BCD method is more restrictive. Part (b) is sharp in the sense that it is false if instead we assume that f is convex in every coordinate block. This

(10)

is because the Powell 3-variable example (Ref. 28) is convex in each variable;

see Ref. 27, Section 6 for further discussions of the example. We will see an application (Example 6.4) in which part (b) applies but not part (a) nor (c).

5. Convergence Analysis: II

The convergence analysis of the previous section assumes f to be con- tinuous on a bounded level set and makes no use of the special structure (1) of f. In this section, we show that this assumption can be relaxed by exploiting the special structure (1), provided that f is quasiconvex and hemi- variate in each coordinate block. More precisely, we will make the following assumptions on f, f0, f1, . . . , fN:

(B1) f0is continuous on dom f0.

(B2) For each k∈{1, . . . , N} and (xj)j≠k, the function xk>

f (x1, . . . , xN) is quasiconvex and hemivariate.

(B3) f0, f1, . . . , fNare lsc.

We will see some applications (Ref. 6, Section 3.4.3 and Examples 6.1–

6.3) for which f satisfies this weaker assumption although it is not strictly convex. In addition, we will make one of the following technical assump- tions on f0:

(C1) dom f0 is open and f0 tends to S at every boundary point of dom f0.

(C2) dom f0GY1B· · ·BYN, for some Yk⊆

R

ⁿ^k, kG1, . . . , N.

In contrast to Assumption C1, Assumption C2 allows f0to have a finite value on bdry(dom f ). We will see in Example 6.2 a nonseparable function f0 that satisfies Assumptions B1–B3 and C2, but not C1. We show below that Assumptions B1–B3, together with either Assumption C1 or C2, ensure that every cluster point of the iterates generated by the BCD method is a coordinate minimum point of f. The proof of this result is patterned after an argument given by Bertsekas and Tsitsiklis (Ref. 6, pp. 220–221; also see Ref. 27), but is complicated by the fact that f is not necessarily differentiable (or even continuous) on its effective domain.

Proposition 5.1. Suppose that f, f0, f1, . . . , fNsatisfy Assumptions B1–

B3 and that f0satisfies either Assumption C1 or C2. Also, assume that the sequence {x^rG(x^r1, . . . , x^rN)}r G0, 1,... generated by the BCD method using the essentially cyclic rule is defined. Then, either { f (x^r)}↓ −S, or else every cluster point zG(z1, . . . , zN) is a coordinatewise minimum point of f.

(11)

Proof. Since f (x⁰)FS and f (x^rC1)⁄f (x^r) for all r, then either { f (x^r)}↓ −S, or else { f (x^r)} converges to some limit and { f (x^rC1)A f (x^r)}→0. Consider the latter case and let z be any cluster point of {x^r}.

Since f is lsc by Assumption B3, we have f (z) ⁄ lim

r→S

f (x^r)FS,

so z∈dom f. We show below that z satisfies (4) for kG1, . . . , N.

First, we claim that, for any infinite subsequence

{x^r}r∈R→z, (10)

with

R

⊆{0, 1, . . .}, there holds that

(x^rC1}r∈R→z. (11)

We prove this by contradiction. Suppose that this were not true. Then, there exists an infinite subsequence

R

′of

R

and a scalar (H0 such that

兩兩x^rC1Ax^r兩兩¤(, for all r∈

R

′.

By further passing to a subsequence, if necessary, we can assume that there is some nonzero vector d for which

{(x^rC1Ax^r)兾兩兩x^rC1Ax^r兩兩}r∈R′→d, (12) and that the same coordinate block, say xs, is chosen t the (rC1)st iteration for all r∈

R

′. Moreover, (10) implies that { f0(x^r)}r∈R and { fk(x^rk)}r∈R, kG 1, . . . , N, are bounded from below, which together with the convergence of { f (x^r)}G{ f0(x^r)C∑^Nk G1fk(x^rk)} implies that { f0(x^r)}r∈R and { fk(x^rk)}r∈R, k G1, . . . , N, are bounded. Hence, by further passing to a subsequence, if necessary, we can assume that there is some scalarθ for which

{ f0(x^r)Cfs(x^rs)}r∈R′→θ. (13)

Fix anyλ∈[0, (]. Let

zˆ GzCλd, (14)

and for each r∈

R

′, let

xˆ^rGx^rCλ^(x^rC1^Ax^r)兾兩兩x^rC1Ax^r兩兩. (15) Then, by (10), (12), and (14),

{xˆ^r}r∈R′→zˆ. (16)

For each r∈

R

′, we see from (2) that x^rC1is obtained from x^rby minimizing f with respect to xs, while the other coordinates are held fixed. Since

λ兾兩兩x^rC1Ax^r兩兩⁄λ兾(⁄1,

(12)

so xˆ^r lies on the line segment joining x^r with x^rC1, this together with f (x^rC1)⁄f (x^r) and the quasiconvexity of xs>f (x^r1, . . . , x^rsA1, xs, x^rsC1, . . . , x^rN) implies

f (xˆ^r)⁄f (x^r), ∀r∈

R

′.

Since f is lsc, this and (16) imply zˆ∈dom f. Also, this and (1) and the obser- vation that x^rand xˆ^rdiffer only in their sth coordinate block imply

f0(xˆ^r)Cfs(xˆ^rs)⁄f0(x^r)Cfs(x^rs), ∀r∈

R

′.

This combined with (13) yields

r→Slim, r∈R′sup{ f0(xˆ^r)Cfs(xˆ^rs)}⁄θ. (17) Also, since

{ f (x^rC1)Af (x^r)}r∈R′→0, we have equivalently that

{ f0(x^rC1)Cfs(x^rC1s )Af0(x^r)Afs(x^rs)}r∈R′→0, so (13) implies

{ f0(x^rC1)Cfs(x^rC1s )}r∈R′→θ^. ⁽¹⁸⁾ Let

δ^Gf0(zˆ)Cfs(zˆs)Aθ.

Since f0 and fsare lsc, we have from (16), (17) thatδ^⁄0. We claim that in fact δ^G0. Suppose that this were not true, so that δ^H0. By (16) and the observation that, for all r∈

R

′, xˆ^rand x^rdiffer in only their sth coordinate block, we have

{(x^r1, . . . , x^rsA1, zˆs, x^rsC1, . . . , x^rN)}r∈R′→zˆ. (19) Moreover, the vector on the left-hand side of (19) is in dom f0for all r∈

R

′ sufficiently large. Since zˆ∈dom f0, this is certainly true under Assumption C1; under Assumption C2, this is also true because x^r∈dom f0for all r and dom f0 has a product structure corresponding to the coordinate blocks.

Then, (18) together with (19) and the continuity of f0 on dom f0 implies that, for all r∈

R

′sufficiently large, there holds that

f0(x^r1, . . . , x^rsA1, zˆs, x^rsC1, . . . , x^rN)Cfs(zˆs)

⁄f0(x^rC1)Cfs(x^rC1s )Cδ兾2,

(13)

or equivalently [via (1) and the observation that x^rand x^rC1differ in only their sth coordinate block],

f (x^r1, . . . , x^rsA1, zˆs, x^rsC1, . . . , x^rN)⁄f (x^rC1)Cδ兾2,

a contradiction to the fact that x^rC1 is obtained from x^r by minimizing f with respect to the sth coordinate block, while the other coordinates are held fixed. Hence,δ^G0 and therefore

f0(zˆ)Cfs(zˆs)Gθ.

Since the choice ofλ was arbitrary, we obtain [also using (14)]

f0(zCλd )Cfs(zsCλds)Gθ, ∀λ∈[0, (],

where dsdenotes the sth coordinate block of d. Since x^rand x^rC1 differ in only their sth coordinate block for all r∈

R

′, then all coordinate blocks of d, except ds, are zero [see (12)], and the above relation, together with (1), shows that f (zCλd ) is constant (and finite) for allλ∈[0, (], a contradiction to Assumption B2, namely, that f is hemivariate in the sth coordinate block.

Hence, (11) holds.

Since (11) holds for any subsequence {x^r}r∈R of {x^r} converging to z, we can apply (11) to the subsequence {x^rC1}r∈R to conclude that {x^rC2}r∈R→z and so on, yielding

{x^rCj}r∈R→z, ∀j G0, 1, . . . , T, (20) where T is the bound specified in the essentially cyclic rule.

We claim that (20), together with Assumption C1 or C2, implies f0(z)Cfk(zk)⁄f0(z1, . . . , zkA1, xk, zkC1, . . . , zN)Cfk(xk), (21) for all x_k and all k∈{1, . . . , N}. To see this, fix any k∈{1, . . . , N}. Since the coordinate blocks are chosen according to the essentially cyclic rule, there exists some j∈{1, . . . , T} and an infinite subsequence

R

′ ⊆

R

such that the coordinate block xk is chosen at the (rCj)th iteration for all r∈

R

′. Then, for each r∈

R

′, x^rCjk minimizes f0(x^rCj1, . . . , x^rCjkA1, xk, x^rCjkC1, . . . , x^rCjN)Cfk(xk) over all xk [see (1), (2), (3)], so that

f0(x^rCj)Cfk(x^rCjk)

⁄f0(x^rCj1, . . . , x^rCjkA1, xk, x^rCjkC1, . . . , x^rCjN)Cfk(xk), ∀xk. (22) Fix any xk∈dom fk such that (z1, . . . , zkA1, xk, zkC1, . . . , zN)∈dom f0. Suppose that Assumption C1 holds, so dom f0 is open. Since z∈dom f0,

(14)

then (20) implies that

(x^rCj1, . . . , x^rCjkA1, xk, x^rCjkC1, . . . , x^rCjN)∈dom f0, for all r∈

R

′sufficiently large.

Passing to the limit as r→^S, r∈

R

′, and using the lsc property of fk

and the continuity of f0 on the open set dom f0, we obtain from (20) and (22) that (21) holds. Suppose instead that Assumption C2 holds, so

dom f0GY1B· · ·BYN, for some Y1⊆ ℜⁿ¹, . . . , YN⊆ ℜⁿ^N. Then, the first quantity on the right-hand side of (22) is finite for all r∈ ℜ′. Passing to the limit as r→^S, r∈ℜ′, and using the lsc property of fk

and the continuity of f0 on dom f0, we obtain from (20) and (22) that (21) holds. If xk∉dom fk or (z1, . . . , zkA1, xk, zkC1, . . . , zN)∉dom f0, then the right-hand side of (21) has the extended value S, so (21) holds trivially. Since the above choice of k was arbitrary, this shows that (21) holds for all xk and all k∈{1, . . . , N}. Then, it follows from (1) that (4)

holds for all kG1, . . . , N. 䊐

Proposition 5.1 extends a result of Grippo and Sciandrone (Ref. 27, Proposition 5.1) for the special case where each fk is the indicator func- tion for some closed convex set and f0 is continuously differentiable and (block) coordinatewise strictly pseudoconvex. In turn, the latter result is an extension of a result of Bertsekas and Tsitsiklis (Ref. 6, Proposition 3.9 in Section 3.3.5), which assumes further f0 to be convex. As a cor- ollary of Proposition 5.1, we obtain the following convergence result for the BCD method.

Theorem 5.1. Suppose that f, f0, f1, . . . , fN satisfy Assumptions B1–

B3 and that f0 satisfies either Assumption C1 or C2. Also, assume that {x: f (x)⁄f (x⁰)} is bounded. Then, the sequence {x^r} generated by the BCD method using the essentially cyclic rule is defined, bounded, and every cluster point is a coordinatewise minimum point of f.

Theorem 5.1 extends a result of Auslender [see Theorem 1.2(a) in Ref.

4, p. 95] for the special case where f_k is convex for all k, dom f0G Y1B· · ·BYNfor some closed convex sets Yk⊆ ℜⁿ^k, kG1, . . . , N, and f0 is strongly convex and continuous on dom f0.

(15)

6. Applications

We describe four interesting applications of the BCD method below.

In all applications, the objective function f is not necessarily strictly convex nor differentiable everywhere on its effective domain.

Example 6.1. Proximal Minimization Algorithm. Let ψ:ℜⁿ> ℜ∪

{S} be a proper (i.e.,ψ兾≡S) lsc function. Fix any scalar cH0, and consider the proper lsc function f defined by

f (x, y) Gc兩兩xAy兩兩²Cψ(x).

Clearly, this function has the form (1) with f0(x, y)Gc兩兩xAy兩兩², f1Gψ, f2≡0.

Applying the BCD method to f yields a method whereby f (x, y) is alter- nately minimized with respect to x and y. This method has the form

x^rC1Garg min

x c兩兩xAx^r兩兩²Cψ(x), r G0, 1, . . . ,

which is the proximal minimization algorithm with fixed parameter c for minimizing ψ; see Ref. 6, Section 3.4.3 and Refs. 36–37 and references therein.

It is easily seen that f, f0, f1, f2satisfy Assumptions B1–B3 and that f0

satisfies Assumptions A1 and C1. Moreover, f is regular everywhere on dom f. Then, by Proposition 5.1, ifψ is bounded below (so, f is bounded below), then every cluster point z of the iterates generated by the above proximal minimization algorithm is a stationary point ofψ^{, i.e.,}

ψ′(z; d )¤0, for all d.

Notice that Theorem 4.1 is not applicable here, since f need not be continu- ous on its level sets.

Example 6.2. Arimoto–Blahut Algorithm. Let Pij, iG1, . . . , n, jG 1, . . . , m, be given nonnegative scalars satisfying

∑

j

PijG1, for all i.

The Pij may be viewed as probabilities. Consider the proper lsc function f defined by

f (x, y) Gf0(x, y)Cf1(x)Cf2( y),

(16)

where

f0(x, y)G冦^{j G1}^∑^m ^{i G1}^∑ⁿ ^P^ij^xⁱ^φ^{( y}^ij^兾xⁱ^), if x¤0, yH0,

S, otherwise,

f1(x)G冦^0,^S^, ^otherwise,^if^{i G1}^∑ⁿ ^xⁱ^G^1,

f2(y)G冦^0, ^if^{i G1}^∑ⁿ ^y^ij^G^1, ∀j G1, . . . , m.

S, otherwise,

withφ(t)G−log(t). In our notation, x is a vector inℜⁿwhose ith coordinate is xi, and y is a vector inℜ^nmwhose ((iA1)mCj)th coordinate is yij. Apply- ing the BCD method to f yields a method whereby f (x, y) is alternately minimized with respect to x and y. This in turn can be seen to be the Arimoto–Blahut algorithm for computing the capacity of a discrete memoryless communication channel (Refs. 11–12).

It can be verified that f, f0, f1, f2 are convex and satisfy Assumptions B1–B3. Convexity of f0follows from observing that (a, b) > aφ(b兾a) is con- vex. Moreover, f has compact level sets and is continuous on each level set, and f0satisfies Assumptions A2 and C2. Notice that f is not strictly convex and f0 does not satisfy Assumption A1 or C1. Thus, by Lemma 3.1 and Theorem 5.1 or Theorem 4.1(c), the sequence of iterates generated by the Arimoto–Blahut algorithm is bounded and each cluster point is a stationary point of f. By the convexity of f, this is in fact a minimum point of f. This result matches those obtained in Refs. 11–12. Analogous convergence results are obtained for variants of the Arimoto–Blahut algorithm, whereby we use, for example,

φ(t)Gt log(t) or φ(t)G1兾t.

Example 6.3. Han Algorithm. Let f be the proper lsc convex function studied by Han [Ref. 30, (D′)],

f (x1, . . . , xN)G(1兾2)兩兩x1C· · ·CxNAd兩兩²C∑^N

k G1

fk(xk),

where d is a given vector inℜ^mand each fk:ℜ^m>ℜ∪{S} is a proper lsc convex function. Also see Ref. 18 for a special case where fkis the support

(17)

function of a closed convex set. Clearly, f is of the form (1) with f0(x1, . . . , xN)G(1兾2)兩兩x1C· · ·CxNAd兩兩².

Han proposed in Ref. 30 an algorithm for minimizing f, which may be viewed as an instance of the BCD method using the cyclic rule, as was shown in Ref. 22.

It is seen easily that f, f0, f1, . . . , fN satisfy Assumptions B1–B3 and that f0 satisfies Assumptions A1 and C1. Thus, by Lemma 3.1 and Prop- osition 5.1 [also see the remark following (3)], if f has a minimum point, then the iterates generated by the Han algorithm are defined and every cluster point is a minimum point of f. This result matches Proposition 4.3 in Ref. 30. On the other hand, by using the convexity of the functions, stronger convergence results can be obtained; see Refs. 22, 38.

Example 6.4. Blind Source Separation. In Ref. 33, Zibulevsky and Pearlmutter studied an optimization formulation of the blind source separation, whereby an error term of the form

(1兾2σ²)兩兩ASAX兩兩²FC∑

j,t

fj^t(s^tj), is minimized with respect to

A∈ℜ^mBn and SG[s^tj]j G1,...,n,t G1,...,T∈ℜ^nBT.

Here, X∈ℜ^mBT are the given data; 兩兩·兩兩F denotes the Frobenious norm;

σ^H0; and each fj^t:ℜ>[0, S] is a proper convex function that is continuous on its effective domain and has bounded level sets. In Ref. 31, the particular choice of fj^t(·)G兩·兩 is used. To ensure the existence of an optimal solution, it was suggested in Ref. 33 that constraints such as

兩兩Ai兩兩⁄1, i G1, . . . , m, (23)

be imposed, where Aidenotes the ith row of A. The objective function of this problem has the form (1) with NG1CnT,

f0(A, s¹1, . . . , s^Tn)G(1兾2σ²)兩兩ASAX兩兩²F, f1(A)G冦^0, ^if^兩兩Aⁱ^兩兩⁄1, i G1, . . . , m,

S, else,

and fj^t, jG1, . . . , n, tG1, . . . , T, as given. Notice that minimizing f with respect to A entails minimizing a convex quadratic function over the Cartesian product of m Euclidean balls, while minimizing f with respect to each s^tj entails minimizing the sum of a convex quadratic function of one

(18)

variable with a convex function of one variable. Thus, the BCD method applied to this f can be implemented fairly inexpensively. If we replace (23) by the single ball constraint

兩兩A兩兩F⁄ρ,

for some fixed ρ^H0, then minimizing f with respect to A can be solved efficiently using e.g. the More´–Sorenson method.

It is not difficult to see that f is continuous on its effective domain and has compact level sets. Moreover, f is convex in (s¹1, . . . , s^Tn), f1 is convex, and f0satisfies Assumption A1. Thus, by Lemma 3.1 and Theorem 4.1(b), the iterates generated by the BCD method using the cyclic rule are defined and every cluster point is a stationary point of f. Notice that f is not pseudo- convex in every pair of coordinate blocks and that f need not have at most one minimum in each s^tj, so neither Theorem 5.1, nor part (a) of Theorem 4.1, nor part (c) of Theorem 4.1 is applicable here.

Instead of treating each s^tjas a coordinate block, we can treat alterna- tively SG[s^tj]j,tas a coordinate block. However, minimizing f with respect to S is more difficult. In the case of fj^t(·)G兩·兩, this would require solving a large convex quadratic programming problem. A comparison of a primal–

dual interior-point method and the BCD method for solving such a problem is given in Ref. 32.

References

1. O^RTEGA, J. M., and R^HEINBOLDT, W. C., Iteratiûe Solution of Nonlinear Equa- tions in Seûeral Variables, Academic Press, New York, NY, 1970.

2. H^ILDRETH, C., A Quadratic Programming Procedure, Naval Research Logistics Quarterly, Vol. 4, pp. 79–85, 1957; see also Erratum, Naval Research Logistics Quarterly, Vol. 4, p. 361, 1957.

3. W^ARGA, J., Minimizing Certain Conûex Functions, SIAM Journal on Applied Mathematics, Vol. 11, pp. 588–593, 1963.

4. A^USLENDER, A., Optimisation Me´thodes Nume´riques, Masson, Paris, France, 1976.

5. B^ERTSEKAS, D. P., Nonlinear Programming, 2nd Edition, Athena Scientific, Belmont, Massachusetts, 1999.

6. B^ERTSEKAS, D. P., and T^SITSIKLIS, J. N., Parallel and Distributed Computation:

Numerical Methods, Prentice-Hall, Englewood Cliffs, New Jersey, 1989.

7. CENSOR, Y., and ZENIOS, S. A., Parallel Optimization: Theory, Algorithms, and Applications, Oxford University Press, Oxford, United Kingdom, 1997.

8. L^UENBERGER, D. G., Linear and Nonlinear Programming, Addison–Wesley, Reading, Massachusetts, 1973.

(19)

9. P^OLAK, E., Computational Methods in Optimization: A Unified Approach, Academic Press, New York, NY, 1971.

10. Z^ANGWILL, W. I., Nonlinear Programming, Prentice-Hall, Englewood Cliffs, New Jersey, 1969.

11. A^RIMOTO, S., An Algorithm for Computing the Capacity of Arbitrary DMCs, IEEE Transactions on Information Theory, Vol. 18, pp. 14–20, 1972.

12. B^LAHUT, R., Computation of Channel Capacity and Rate Distortion Functions, IEEE Transactions on Information Theory, Vol. 18, pp. 460–473, 1972.

13. H^OWSON, H. R., and S^ANCHO, N. G. F., A New Algorithm for the Solution of Multistate Dynamic Programming Problems, Mathematical Programming, Vol. 8, pp. 104–116, 1975.

14. K^ORSAK, A. J., and L^ARSON, R. E., A Dynamic Programming Successiûe Approximations Technique with Conûergence Proofs, Automatica, Vol. 6, pp. 253–260, 1970.

15. Z^UO, Z. Q., and W^U, C. P., Successiûe Approximation Technique for a Class of Large-Scale NLP Problems and Its Application to Dynamic Programming, Journal of Optimization Theory and Applications, Vol. 62, pp. 515–527, 1989.

16. STERN, T. E., A Class of Decentralized Routing Algorithms Using Relaxation, IEEE Transactions on Communications, Vol. 25, pp. 1092–1102, 1977.

17. B^REGMAN, L. M., The Relaxation Method of Finding the Common Point of Con- ûex Sets and Its Application to the Solution of Problems in Conûex Programming, USSR Computational Mathematics and Mathematical Physics, Vol. 7, pp. 200–

217, 1967.

18. H^AN, S. P., A Successiûe Projection Method, Mathematical Programming, Vol. 40, pp. 1–14, 1988.

19. KIWIEL, K. C., Free-Steering Relaxation Methods for Problems with Strictly Conûex Costs and Linear Constraints, Mathematics of Operations Research, Vol. 22, pp. 326–349, 1997.

20. L^UO, Z. Q., and T^SENG, P., On the Conûergence Rate of Dual Ascent Methods for Strictly Conûex Minimization, Mathematics of Operations Research, Vol. 18, pp. 846–867, 1993.

21. T^SENG, P., Dual Ascent Methods for Problems with Strictly Conûex Costs and Linear Constraints: A Unified Approach, SIAM Journal on Control and Optimization, Vol. 28, pp. 214–242, 1990.

22. T^SENG, P., Dual Coordinate Ascent Methods for Nonstrictly Conûex Minimiz- ation, Mathematical Programming, Vol. 59, pp. 231–247, 1993.

23. MÂNGASARIAN, O. L., and DÊLÊONE, R., Parallel Successiûe Oûerrelaxation Methods for Symmetric Linear Complementarity Problems and Linear Programs, Journal of Optimization Theory and Applications, Vol. 54, pp. 437–446, 1987.

24. C^EA, J., and G^LOWINSKI, R., Sur des Methodes d’Optimisation par Relaxation, Revue Franc¸aise d’Automatique, Informatique et Recherche Ope´rationnelle, Vol. R3, pp. 5–32, 1973.

25. S^ARGENT, R. W. H., and S^EBASTIAN, D. J., On the Conûergence of Sequential Minimization Algorithms, Journal of Optimization Theory and Applications, Vol. 12, pp. 567–575, 1973.

(20)

26. Z^ADEH, N., A Note on the Cyclic Coordinate Ascent Method, Management Science, Vol. 16, pp. 642–644, 1970.

27. G^RIPPO, L., and S^CIANDRONE, M., On the Conûergence of the Block Nonlinear Gauss–Seidel Method under Conûex Constraints, Operations Research Letters, Vol. 26, pp. 127–136, 2000.

28. P^OWELL, M. J. D., On Search Directions for Minimization Algorithms, Math- ematical Programming, Vol. 4, pp. 193–201, 1973.

29. L^UO, Z. Q., and T^SENG, P., Error Bounds and Conûergence Analysis of Feasible Descent Methods: A General Approach, Annals of Operations Research, Vol. 46, pp. 157–178, 1993.

30. H^AN, S. P., A Decomposition Method and Its Application to Conûex Program- ming, Mathematics of Operations Research, Vol. 14, pp. 237–248, 1989.

31. B^OFILL, P., and Z^IBULEVSKY, M., Sparse Undetermined ICA: Estimating the Mixing Matrix and the Sources Separately, Technical Report UPC-DAC-2000- 7, Universitat Polite`cnica de Catalunya, Barcelona, Spain, 1999.

32. S^ARDY, S., B^RUCE, A., and T^SENG, P., Block Coordinate Relaxation Methods for Nonparametric Waûelet Denoising, Journal of Computational and Graphical Statistics, Vol. 9, pp. 361–379, 2000.

33. Z^IBULEVSKY, M., and P^EARLMUTTER, B., Blind Source Separation by Sparse Decomposition, Technical Report CS99-1, Computer Science Department, University of New Mexico, Albuquerque, New Mexico, 1999.

34. M^ANGASARIAN, O. L., Nonlinear Programming, McGraw-Hill, New York, NY, 1969.

35. R^OCKAFELLAR, R. T., Conûex Analysis, Princeton University Press, Princeton, New Jersey, 1970.

36. MÂRTINET, B., Determination Approcheé d’un Point Fixe d’une Application Pseudo-Contractante: Cas de l ’Application Prox, Comptes Rendus des Seánces de l’Acade´mie des Sciences, Vol. 274A, pp. 163–165, 1972.

37. R^OCKAFELLAR, R. T., Augmented Lagrangians and Applications of the Proximal Point Algorithm in Conûex Programming, Mathematics of Operations Research, Vol. 1, pp. 97–116, 1976.

38. B^AUSCHKE, H. H., and L^EWIS, A. S., Dykstra’s Algorithm with Bregman Projec- tions: A Conûergence Proof, Optimization (to appear).