4 Numerical Experiments

(1)

NONCONVEX PROGRAMMING

Jein-Shan Chen^∗ and Shaohua Pan

Abstract: In this paper, we study a proximal-like algorithm for minimizing a closed proper function f (x) subject to x ≥ 0, based on the iterative scheme: x^k ∈ argmin{f (x) + µkd(x, x^k−1)}, where d(·, ·) is an entropy-like distance function. The algorithm is well-defined under the assumption that the problem has a nonempty and bounded solution set. If, in addition, f is a differentiable quasi-convex function (or f is a differentiable function which is homogeneous with respect to a solution), we show that the sequence generated by the algorithm is convergent (or bounded), and furthermore, it converges to a solution of the problem (or every accumulation point is a solution of the problem) when the parameter µk approaches to zero. Preliminary numerical results are also reported, which further verify the theoretical results obtained.

Key words: proximal algorithm, entropy-like distance, quasi-convex, homogeneous Mathematics Subject Classification: 26A27, 26B05, 26B35, 49J52, 90C33, 65K05

1 Introduction

The proximal point algorithm for minimizing a convex function f (x) on IRⁿ generates a sequence {x^k}k∈N by the iterative scheme as below:

x^k = argmin

x∈IRⁿ

©f (x) + µkkx − x^k−1k²ª

, (1.1)

where µkis a sequence of positive numbers and k·k denotes the Euclidean norm in IRⁿ. This method, which was originally introduced by Martinet [13], is based on the Moreau proximal approximation [14] of f , defined by

fλ(x) = inf

u

½

f (u) + 1

2λkx − uk²

¾

, λ > 0. (1.2)

This proximal algorithm was then further developed and studied by Rockafellar [18, 19].

In 1992, Teboulle [16] introduced the so-called entropic proximal map based on imitating the proximal map of Moreau, which replaces the quadratic distance in (1.1)–(1.2) with the following entropy-like distance, also called ϕ-divergence:

dϕ(x, y) = Xn i=1

yiϕ(xi/yi), (1.3)

∗The author’s work is partially supported by National Taiwan Normal University.

(2)

where ϕ : IR+→ IR is a closed proper strictly convex function satisfying certain conditions (see [2, 9, 10, 16, 17]). An important choice of ϕ is ϕ(t) = t ln t − t + 1, for which the corre- sponding dϕis the well known Kullback-Leibler entropy function [5, 6, 7, 16] from statistics and that is the “entropy” terminology stems from.

The algorithm associated with the ϕ-divergence for minimizing a convex function f subject to nonnegative constraints x ≥ 0 is given as follows

x⁰ > 0 x^k = argmin

x≥0

©f (x) + µkdϕ(x, x^k−1)ª

, (1.4)

where µk is same as in (1.1). The algorithm in (1.4) is a proximal-like one and has been studied extensively for convex programming; see [9, 10, 16, 17] and references therein. In fact, the algorithm (1.4) with ϕ(t) = − ln t + t − 1 was first proposed by Eggermont in [6].

It is worthwhile to point out that the fundamental difference between (1.1) and (1.4) is that the term d_ϕ is used to force the iterates {x^k}_k∈N to stay in the interior of the nonnegative orthant IRⁿ₊, namely the algorithm in (1.4) will automatically generate a positive sequence {x^k}k∈N. Similar extensions and convergence results for the proximal-like methods using a Bregman distance have also been studied (see, for example, [3, 11, 12]). However, the analysis of the proximal-like method based on a Bregman distance does not carry over to the algorithm defined as (1.4) except for the case ϕ(t) = t ln t − t + 1 where the two distances coincide. As explained in [17], this is due to the fact that one nice property [3, Lemma 3.1], which holds for Bregman distances, does not hold in general for dϕ. In addition, we also observe that the algorithm in (1.4) was adopted by [4] to solve the problem of minimizing a closed function f over IRⁿ without assuming convexity of f .

In this paper, we wish to employ the ϕ-divergence algorithm defined as in (1.4) with ϕ(t) := − ln t + t − 1 (t > 0) (1.5) to solve the nonconvex optimization problem of the following form

min f (x)

s.t. x ≥ 0, (1.6)

where f : IRⁿ → IR is a closed proper function with domf ⊇ IRⁿ₊₊. This particular choice of ϕ, used for convex minimization, was also discussed in [9, 17]. Since we do not require the convexity of f , the algorithm to be studied in this paper is as follows:

x⁰ > 0 x^k ∈ argmin

x≥0

©f (x) + µkd(x, x^k−1)ª

, (1.7)

where µk is a sequence of positive numbers and d(x, y) is specified as follows:

d(x, y) :=

Xn i=1

yiϕ(xi/yi) = Xn i=1

h

yiln(yi/xi) + (xi− yi) i

. (1.8)

The main purpose of this paper is to establish convergence results (see Propositions 3.3 and 3.4) of the algorithm (1.7)–(1.8) under some mild assumptions for the problem (1.6).

(3)

Throughout this paper, IRⁿ denotes the space of n-dimensional real column vectors, IRⁿ₊ represents the nonnegative orthant in IRⁿ with its interior being IRⁿ₊₊, h·, ·i and k · k denote the Euclidean inner product and the Euclidean norm, respectively. For a given function f defined on IRⁿ, if f is differentiable at x, the notation ∇f (x) denotes the gradient of f at x while (∇f (x))i means the ith partial derivative of f with respect to x.

2 Preliminaries

In this section, we recall some preliminary results that will be used in the next section. We start with the definition of Fej´er convergence to a nonempty set with respect to d(·, ·).

Definition 2.1. A sequence {x^k}k∈N ⊂ IRⁿ₊₊is Fej´er convergent to a nonempty set U ⊆ IRⁿ₊ with respect to the divergence d(·, ·) if d(x^k, u) ≤ d(x^k−1, u) for each k and any u ∈ U .

Given an extended real-valued function f : IRⁿ → IR ∪ {+∞}, denote its domain by domf := {x ∈ IRⁿ | f (x) < +∞}.

The function f is said to be proper if domf 6= ∅ and f (x) > −∞ for any x ∈ domf , and f is called closed, which is equivalent to f being lower semicontinuous, if its epigraph is a closed set. Next we present some properties of quasi-convex and homogeneous functions, and recall the definition of stationary point for a constrained optimization problem.

Definition 2.2 ([1]). Let f : IRⁿ → IR ∪ {+∞} be a proper function. If f (αx + (1 − α)y) ≤ max{f (x), f (y)}

for any x, y ∈ domf and α ∈ (0, 1), then f is called quasi-convex.

It is easy to verify that any convex function is quasi-convex as well as strictly quasi-convex, but the converse is not true. For quasi-convex functions, we have the following results.

Lemma 2.3 ([1]). Let f : IRⁿ→ IR ∪ {+∞} be a proper function. Then,

(a) f is quasi-convex if and only if the level sets Lf(γ) := {x ∈ domf | f (x) ≤ γ} are convex for all γ ∈ IR.

(b) If f is differentiable on domf , then f is quasi-convex if and only if h∇f (y), x − yi ≤ 0 whenever f (x) ≤ f (y) for any x, y ∈ domf .

Definition 2.4 ([15]). A proper function f : IRⁿ→ IR ∪ {+∞} is called homogeneous with respect to ¯x ∈ domf with exponential κ > 0 if for any x ∈ domf and λ ≥ 0,

f (¯x + λ(x − ¯x)) − f (¯x) = λ^κ(f (x) − f (¯x)).

Lemma 2.5 ([15]). Assume that the proper function f : IRⁿ→ IR ∪ {+∞} is differentiable on domf . If f is homogeneous with respect to ¯x ∈ domf with exponential κ > 0, then

f (x) − f (¯x) = κ⁻¹h∇f (x), x − ¯xi ∀x ∈ domf.

(4)

Definition 2.6. For a constrained optimization problem minx∈Cf (x), where C ⊆ IRⁿ is a nonempty convex set, x^∗ is called a stationary point if ∇f (x^∗)^T(x − x^∗) ≥ 0 for all x ∈ C.

In what follows, we focus on the properties of ϕ given as (1.5) and the induced function d(·, ·), which will be used in the subsequent analysis. First, we summarize some special properties of ϕ. Since their verifications are direct by computations, we omit the details.

Property 2.7. Let ϕ : IR++ → IR be defined as in (1.5). Then, the following results hold.

(a) ϕ(t) ≥ 0 and ϕ(t) = 0 if and only if t = 1.

(b) ϕ(t) is decreasing in (0, 1) with lim_t→0⁺ϕ(t) = +∞, and increasing in (1, ∞) with limt→+∞ϕ(t) = +∞.

(c) ϕ(1) = 0, ϕ⁰(1) = 0, and ϕ⁰⁰(1) > 0.

(d) ϕ⁰(t) is nondecreasing on (0, +∞) and limt→+∞ϕ⁰(t) = 1, limt→0⁺ϕ⁰(t) = −∞.

(e) ϕ⁰(t) ≤ ln t for all t > 0 and ϕ⁰(t) > 0 when t ∈ [1, ∞).

From Property 2.7 (a) and the definition of d(x, y), it is not hard to see that d(x, y) ≥ 0 and d(x, y) = 0 ⇐⇒ x = y, ∀x, y ∈ IRⁿ₊₊.

This means that d(·, ·) can be viewed as a distance-like function, though d(·, ·) itself can not be a distance since the triangle inequality does not hold in general. In fact, d is a divergence measure in IRⁿ₊₊× IRⁿ₊₊ (see [10]) that enjoys some favorable properties, for example, the ones given by Lemmas 2.8 and 2.9. In addition, we notice that d : IRⁿ₊₊× IRⁿ₊₊ → IR is a continuous function and can be continuously extended to IRⁿ₊₊ × IRⁿ₊ by adopting the convention 0 ln 0 = 0, i.e., d(·, ·) admits points with 0 component in its second argument.

The following two lemmas characterize some crucial properties of the divergence measure.

Lemma 2.8. Let ϕ and d be defined as in (1.5) and (1.8), respectively. Then, (a) d(x, z) − d(y, z) ≥

Xn i=1

(zi− yi)ϕ⁰(yi/xi) for any x, y ∈ IRⁿ₊₊ and z ∈ IRⁿ₊;

(b) for any fixed y ∈ IRⁿ₊, L_x(y, γ) :=©

x ∈ IRⁿ₊₊ | d(x, y) ≤ γª

are bounded for all γ ≥ 0;

(c) for any fixed x ∈ IRⁿ₊₊, Ly(x, γ) :=©

y ∈ IRⁿ₊ | d(x, y) ≤ γª

are bounded for all γ ≥ 0.

Proof. (a) By the definition of the function d, we can compute that

d(x, z) − d(y, z) = Xn i=1

h

ziln(zi/xi) + xi− zi

i

− Xn i=1

h

ziln(zi/yi) + yi− zi

i

= Xn i=1

h

ziln(yi/xi) + xi− yi

i

. (2.1)

Since ϕ⁰(yi/xi) = 1 − xi/yi, we have yiϕ⁰(yi/xi) = yi− xi, i.e.,

xi− yi= −yiϕ⁰(yi/xi). (2.2)

(5)

In addition, using Property 2.7 (e) and noting that zi≥ 0 for all i, we readily have

ziln(xi/yi) ≥ ziϕ⁰(yi/xi). (2.3) From equations (2.1)–(2.3), we immediately obtain that

d(x, z) − d(y, z) ≥ Xn i=1

h

ziϕ⁰(yi/xi) − yiϕ⁰(yi/xi)i

= Xn i=1

(zi− yi)ϕ⁰(yi/xi).

(b) The boundedness of the level sets Lx(y, γ) for all γ ≥ 0 is direct by Property 2.7 (b).

(c) Let ψ(t) := t ln t − t + 1 (t ≥ 0). By the definition of d(·, ·), we can verify that d(x, y) =

Xn i=1

xiψ(yi/xi) ∀x ∈ IRⁿ₊₊ and y ∈ IRⁿ₊.

Thus, for any fixed x ∈ IRⁿ₊₊, to show that Ly(x, γ) are bounded for all γ ≥ 0, it suffices to prove that ψ(t) has bounded level sets, which is clear since lim

t→+∞ψ(t) = +∞.

Lemma 2.9. Given any two sequences {y^k}k∈N ⊂ IRⁿ₊₊ and {x^k}k∈N ⊆ IRⁿ₊, (a) if {y^k}k∈N converges to ¯y ∈ IRⁿ₊, then limk→+∞d(y^k, ¯y) = 0;

(b) if {y^k}k∈N is bounded and {x^k}k∈N is such that limk→+∞d(y^k, x^k) = 0, then we have limk→+∞ky^k− x^kk = 0.

Proof. (a) From the definition of d(·, ·), it follows that d(y^k, ¯y) =

Xn i=1

h

¯

yiln ¯yi− ¯yiln y^k_i + (y_i^k− ¯yi)i .

For any i ∈ {1, 2, . . . , n}, if ¯yi = 0, clearly ¯yiln ¯yi− ¯yiln y^k_i + (y_i^k− ¯yi) → 0 as k → +∞;

if ¯yi > 0, then ln(¯yi/y^k_i) → 0 and (y^k_i − ¯yi) → 0 since {y^k_i} → ¯yi, which means that

¯

yiln(¯yi/y^k_i) + (y^k_i − ¯yi) → 0 as k → +∞. The two sides yield that limk→+∞d(y^k, ¯y) = 0.

(b) First, by Lemma 2.8 (b) and the fact that limk→+∞d(y^k, x^k) = 0, we may verify that {x^k}k∈N is bounded. Now suppose lim

k→+∞ky^k− x^kk 6= 0. Then, there exists a subsequence {y^σ(k)}k∈N such that ky^σ(k)− x^σ(k)k ≥ 3ε for some ε > 0 and for all sufficiently large k.

Since {y^σ(k)}k∈N is bounded, we can extract a convergent subsequence. Without loss of generality, we still use {y^σ(k)}k∈N → y^∗ to represent the convergent subsequence. From the triangle inequality, it then follows that

ky^σ(k)− y^∗k + ky^∗− x^σ(k)k ≥ ky^σ(k)− x^σ(k)k ≥ 3ε.

Since {y^σ(k)}k∈N → y^∗, there exists a positive integer K such that ky^σ(k)− y^∗k ≤ ε for k ≥ K. Thus, we have ky^∗ − x^σ(k)k ≥ 3ε − ky^σ(k) − y^∗k ≥ 2ε for k ≥ K. On the other hand, the boundedness of {x^σ(k)}k∈N implies that there is a convergent subsequence {x^σ(γ(k))} → x^∗ for k ≥ K and k → +∞. Then by the same arguments as above, we obtain ky^∗− x^∗k ≥ ε. However, lim

k→+∞d(y^k, x^k) = 0 yields lim

k→+∞d(y^σ(γ(k)), x^σ(γ(k))) = 0.

Therefore, from the continuity of d and boundedness of the sequences, we have d(y^∗, x^∗) = 0 which implies y^∗= x^∗. Thus, we obtain a contradiction. The proof is complete.

(6)

3 Main Results

In this section, we establish the convergence results of the proximal-like algorithm (1.7)–

(1.8). First, we show that the algorithm is well-defined under the following assumption:

(A1) The solution set of problem (1.6), denoted by X^∗, is nonempty and bounded.

Lemma 3.1. Let d be defined as in (1.8). Then, under assumption (A1), (a) the sequence {x^k}k∈N generated by (1.7)–(1.8) is well-defined;

(b) {f (x^k)}k∈N is a decreasing and convergent sequence.

Proof. (a) The proof proceeds by induction. Clearly, when k = 0, the conclusion holds since x⁰> 0. Suppose that x^k−1is well-defined. Let f^∗ be the optimal value of (1.6). Then, from the iterative scheme (1.7) and the nonnegativity of d, it follows that

f (x) + µkd(x, x^k−1) ≥ f^∗+ µkd(x, x^k−1) for all x ∈ IRⁿ₊₊. (3.1) Let fk(x) := f (x) + µkd(x, x^k−1) and denote its level sets by

Lf_k(γ) := {x ∈ IRⁿ₊₊ : fk(x) ≤ γ} for any γ ∈ IR.

Using the inequality in (3.1), we have Lf_k(γ) ⊆ Lx(x^k−1, µ⁻¹_k (γ − f^∗)). This, together with Lemma 2.8 (b), implies that Lf_k(γ) is bounded for any γ ≥ f^∗. Notice that Lf_k(γ) = X^∗for any γ ≤ f^∗ since µkd(x, x^k−1) ≥ 0, and consequently Lfk(γ) is also bounded for this case by assumption (A1). This shows that the level sets of fk(x) are bounded. Also, fk(x) is lower semicontinuity on IRⁿ. Therefore, the level sets of fk(x) are compact. Using the lower semicontinuity of fk(x) again, we have that fk(x) has a global minimum which may not be unique due to the nonconvexity of f . In such case, x^k can be arbitrarily chosen among the set of minimizers of fk(x). The sequence {x^k}k∈N is thus well-defined.

(b) From the iterative scheme in (1.7), it readily follows that

f (x^k) + µkd(x^k, x^k−1) ≤ f (x) + µkd(x, x^k−1), ∀x ∈ IRⁿ₊₊. (3.2) Setting x = x^k−1 in the last inequality, we obtain that

f (x^k) + µkd(x^k, x^k−1) ≤ f (x^k−1) + µkd(x^k−1, x^k−1) = f (x^k−1), which, by the nonnegativity of d and µk, implies that

0 ≤ µkd(x^k, x^k−1) ≤ f (x^k−1) − f (x^k).

This show that {f (x^k)}k∈N is a decreasing sequence, and furthermore, it is convergent by assumption (A1). The proof is thus completed.

By Lemma 3.1 (b), let β := limk→+∞f (x^k) and define the following set

U := {x ∈ IRⁿ₊ | f (x) ≤ β}. (3.3)

Clearly, X^∗ ⊆ U , and consequently U is nonempty by assumption (A1). In what follows, we show that the sequence {x^k}k∈N generated by (1.7)–(1.8) is Fej´er convergent to U with respect to d under the following additional assumption for f :

(7)

(A2) f is a quasi-convex function which is differentiable on domf .

Lemma 3.2. Let {µk} be an arbitrary sequence of positive numbers and {x^k}k∈N be the sequence generated by (1.7)–(1.8). Then, under assumptions (A1) and (A2),

(a) d(x^k, x) ≤ d(x^k−1, x) for any x ∈ IRⁿ₊ such that f (x) ≤ f (x^k);

(b) {x^k}k∈N is Fej´er convergent to the set U with respect to d;

(c) for any x ∈ U , the sequence {d(x^k, x)}k∈N is convergent.

Proof. (a) For any x ∈ IRⁿ₊ satisfying f (x) ≤ f (x^k), from Lemma 2.3 (b) it follows that

∇f (x^k)^T(x − x^k) ≤ 0.

In addition, since x^k is a minimizer of the function f (x) + µkd(x, x^k−1), we have

∇f (x^k) + µk∇xd(x^k, x^k−1) = 0. (3.4) Combining the last two equations, we obtain that

µkh∇xd(x^k, x^k−1), x − x^ki ≥ 0 ∀x ∈ IRⁿ₊ such that f (x) ≤ f (x^k).

Noting that

∇xd(x^k, x^k−1) = (ϕ⁰(x^k₁/x^k−1₁ ), . . . , ϕ⁰(x^k_n/x^k−1_n ))^T, and using Lemma 2.8 (a) with x = x^k−1, y = x^k and z = x, it follows that

d(x^k−1, x) − d(x^k, x) ≥ µkh∇xd(x^k, x^k−1), x − x^ki ≥ 0 for any x ∈ IRⁿ₊ satisfying f (x) ≤ f (x^k). The desired result follows.

(b) By the definition of U , clearly, x ∈ U means f (x) ≤ f (x^k) for all k since {f (x^k)}k∈N is decreasing. The proof is direct by part (a) and Definition 2.1.

(c) The proof follows directly from part (b) and the nonnegativity of d.

Now we are in a position to establish the convergence results of the algorithm.

Proposition 3.3. Let {µk}k∈N be an arbitrary sequence of positive numbers and {x^k}k∈N

be generated by (1.7)–(1.8). If assumptions (A1) and (A2) hold, then {x^k}k∈N converges, and

(a) if there exist ˆµ and ¯µ such that 0 < ˆµ < µk ≤ ¯µ for each k, then

k→+∞lim (∇f (x^k))i≥ 0, lim

k→+∞(∇f (x^k))i(xi− x^k_i) ≥ 0 for any x ∈ IRⁿ₊, i = 1, 2, . . . , n;

(b) if limk→+∞µk = 0, then {x^k}k∈N converges to a solution of (1.6).

Proof. We first prove that the sequence {x^k}k∈N is convergent. By Lemma 3.2 (b), {x^k}k∈N

is Fej´er convergent to the set U with respect to d, which in turn implies that {x^k}k∈N ⊆©

y ∈ IRⁿ₊₊ | d(y, x) ≤ d(x⁰, x)ª

∀x ∈ U.

(8)

From Lemma 2.8 (b), the set on the right hand side of the last equation is bounded, and consequently, {x^k}k∈N is bounded. Let ¯x be an accumulation point of {x^k}k∈N and {x^k^j} be a subsequence converging to ¯x. From the continuity of f , it then follows that

j→+∞lim f (x^k^j) = f (¯x),

which, by the definition of U , implies that ¯x ∈ U . Using Lemma 3.2 (c), we have that the sequence {d(x^k, ¯x)}k∈N is convergent. Notice that limj→+∞d(x^k^j, ¯x) = 0 by Lemma 2.9 (a), and therefore, we conclude that lim_k→+∞d(x^k, ¯x) = 0. Using Lemma 2.9 (b) with y^k = x^k and x^k = ¯x, we prove that {x^k}k∈N converges to ¯x.

(a) By the iterative formula in (1.7), x^k is the minimizer of f (x) + µkd(x, x^k−1), and hence

∇f (x^k) = −µk∇xd(x^k, x^k−1), which means that

(∇f (x^k))_i= −µ_kϕ⁰(x^k_i/x^k−1_i ), i = 1, 2, . . . , n. (3.5) Let ¯x be the limit of the sequence {x^k}k∈N. Define the index sets

I(¯x) :=n

i ∈ {1, 2, . . . , n} | ¯xi> 0o

and J(¯x) :=n

i ∈ {1, 2, . . . , n} | ¯xi= 0o . Clearly, the two disjoint sets form a division of {1, 2, . . . , n}. We proceed the arguments by the cases where i ∈ I(¯x) or i ∈ J(¯x) for any i ∈ {1, 2, . . . , n}.

Case (1): If i ∈ I(¯x), then limk→+∞x^k_i/x^k−1_i = 1 by the convergence of {x^k}k∈N. Notice that ϕ is continuous on (0, +∞) and ϕ⁰(1) = 0, and hence,

k→+∞lim ϕ⁰(x^k_i/x^k−1_i ) = 0.

This, together with (3.5) and the boundedness of {µ_k}_k∈N, immediately yields

k→+∞lim (∇f (x^k))i= 0.

Consequently, by the convergence of {x^k}k∈N, we have lim

k→+∞(∇f (x^k))i(xi− x^k_i) = 0.

Case (2): If i ∈ J(¯x), we will show that limk→+∞(∇f (x^k))i ≥ 0. Suppose that it does not hold. Then, (∇f (x^k))i< 0 for large enough k. From (3.5) and the assumption for {µk}k∈N, it then follows that ϕ⁰(x^k_i/x^k−1_i ) > 0 for sufficiently large k ∈ N . Thus, by Property 2.7 (c)–(d), x^k_i > x^k−1_i for large enough k ∈ N , which contradicts the fact that

{x^k}k∈N → ¯x and x¯i= 0 ∀i ∈ J(¯x).

Consequently, limk→+∞(∇f (x^k))i≥ 0. Noting that limk→+∞(xi− x^k_i) ≥ 0 since xi≥ 0 and limk→+∞x^k_i = ¯xi= 0, we readily have limk→+∞(∇f (x^k))i(xi− x^k_i) ≥ 0.

(b) Since x^k is the minimizer of f (x) + µkd(x, x^k−1), we have

f (x^k) + µkd(x^k, x^k−1) ≤ f (x) + µkd(x, x^k−1) ∀x ∈ IRⁿ₊₊, which, by the nonnegativity of d, implies that

f (x^k) ≤ f (x) + µkd(x, x^k−1) ∀x ∈ IRⁿ₊₊. (3.6)

(9)

Taking the limit k → +∞ on the inequality and using the continuity of f yields that f (¯x) ≤ f (x), ∀x ∈ IRⁿ₊₊

since limk→+∞x^k = ¯x, limk→+∞µk = 0 and the sequence {d(x, x^k−1)}k∈N is bounded by Lemma 2.8 (c). This, together with the continuity of f , means that f (¯x) ≤ f (x) for any x ∈ IRⁿ₊, and consequently, ¯x ∈ X^∗. We thus complete the proof.

From Proposition 3.3 (a), we see that the sequence {x^k}k∈N converges to a stationary point of (1.6) without requiring µk → 0 if f is continuously differentiable quasi-convex.

When f is not quasi-convex, the following proposition states that the sequence {x^k}k∈N

generated by (1.7)–(1.8) is bounded and every limit point is a solution of problem (1.6) under (A1) and the following additional assumption:

(A3) f is differentiable on domf and homogeneous with respect to a solution of the problem (1.6) with exponential κ > 0.

Proposition 3.4. Let {µk}k∈N be any sequence of positive numbers and {x^k}k∈N be gen- erated by (1.7)–(1.8). If assumptions (A1) and (A3) hold, then {x^k}k∈N is bounded, and (a) if there exist ˆµ and ¯µ such that 0 < ˆµ < µk ≤ ¯µ for each k, then

k→+∞lim (∇f (x^k))i≥ 0, lim

k→+∞(∇f (x^k))i(xi− x^k_i) ≥ 0 for any x ∈ IRⁿ₊, i = 1, 2, . . . , n;

(b) if limk→+∞µk = 0, every limit point of {x^k}k∈N is an optimal solution of (1.6).

Proof. Let x^∗ ∈ X^∗ be such that f is homogeneous with respect to x^∗ with exponential κ > 0. Then, using Lemma 2.5 it follows that

0 ≤ f (x^k) − f (x^∗) = κ⁻¹h∇f (x^k), x^k− x^∗i ∀k.

This, together with (3.4), means that

hµk∇xd(x^k, x^k−1), x^∗− x^ki ≥ 0 ∀k.

Using Lemma 2.8 (a) with x = x^k−1, y = x^k and z = x^∗ then yields that d(x^k−1, x^∗) − d(x^k, x^∗) ≥ µkh∇xd(x^k, x^k−1), x^∗− x^ki ≥ 0 ∀k.

This shows that the sequence {d(x^k, x^∗)}k∈N is monotonically decreasing, and hence, {x^k}k∈N ⊆©

y ∈ IRⁿ₊₊ | d(y, x^∗) ≤ d(x⁰, x^∗)ª .

Using Lemma 2.8 (b), we have that {x^k}k∈N is bounded. Let ¯x be an arbitrary limit point of {x^k}k∈N and {x^k^j} be a subsequence such that limj→+∞x^k^j = ¯x. Then, it is easy to prove part (a) by following the same arguments as Proposition 3.3 (a). Notice that the inequality (3.6) still holds, and consequently, we have

f (x^k^j) ≤ f (x) + µkd(x, x^k^j⁻¹) ∀x ∈ IRⁿ₊₊.

Taking the limit j → +∞ and using the same arguments as Proposition 3.3 (b), we have f (¯x) ≤ f (x), ∀x ∈ IRⁿ₊.

This shows that ¯x is a solution of the problem (1.6). The proof is thus completed.

(10)

4 Numerical Experiments

In this section, we verify the theoretical results obtained by applying the algorithm (1.7)–

(1.8) for some differentiable quasi-convex optimization problems of the form (1.6). Since f is quasi-convex if and only if the level sets of f are convex, we generate the quasi-convex func- tion f (x) by compounding a quadratic convex function g(x) := (1/2)x^TM x with a monotone increasing but nonconvex function h : IR → IR, i.e., f (x) = h(g(x)), where M ∈ IR^n×n is a given symmetric positive semidefinite matrix.

In our experiments, the matrix M was obtained by setting M = N N^T, where N was a square matrix whose nonzero elements were chosen randomly from a normal distribution with mean −1, variance one and standard deviation one. In this procedure, the number of nonzero elements of N is determined so that the nonzero density of M can be approximately estimated. The function h is given as below.

Experiment A. We took h(t) = − 1

1 + t (t 6= −1), and generated 10 matrices M of dimension n = 100 with approximate nonzero density 0.1% and 10%, respectively. Then, we solved the quasi-convex programming problem (1.6) with f (x) = − 1

1 + (x^TM x)/2. Experiment B. Set h(t) = √

t + 1 (t ≥ 0). We adopted this h and the matrix M in Experiment A to yield the quasi-convex function f (x) =

q

(x^TM x)/2 + 1.

Experiment C. Set h(t) = ln(1 + t) (t > −1). We employed h and the matrix M in Experiment A to generate the quasi-convex function f (x) = ln[1 + (x^TM x)/2].

Experiment D. We took h(t) = arctan(t) + t + 2 and used such h and the matrix M in Experiment A to generate the function f (x) = arctan[(x^TM x)/2] + (x^TM x)/2 + 2.

It is not difficult to verify that each h in Experiments A-D is pseudo-convex, and hence the corresponding f is also pseudo-convex by [1, Exercise 3.43]. Thus, every stationary point of (1.6) is a globally optimal solution by [1, Section 3.5]. Moreover, we notice that all test problems have a globally optimal solution x^∗ = 0. In view of this, throughout the experiments, we employed an approximate version of the algorithm (1.7)–(1.8), described as follows, to solve the test problems generated randomly.

Approximate entropy-like proximal algorithm

Given ² > 0 and τ > 0. Select a starting point x⁰∈ IRⁿ₊₊, and set k := 0.

For k = 1, 2, · · · until |∇f (x^k)^Tx^k| < ² do

1. Use an unconstrained minimization method to solve approximately the problem

x∈IRminⁿ

, (4.1)

and obtain an x^k such that k∇f (x^k) + µk∇xd(x^k, x^k−1)k ≤ τ . 2. Let k := k + 1, and then go back to Step 1.

End

(11)

We implemented the approximate entropy-like proximal algorithm with our code in MAT- LAB 6.5. All numerical experiments were done at a PC with 2.8GHz CPU and 512MB memory. We chose a BFGS algorithm with Armijo line search to solve (4.1). To improve numerical behavior, we replaced the standard Armijo line search by the nonmonotone line search technique described as in [8] to seek a suitable steplength, i.e., we computed the smallest nonnegative integer l such that

fk(x^k+ β^ld^k) ≤ Wk+ σβ^l∇fk(x^k)^Td^k where fk(x) = f (x) + µkd(x, x^k−1) and Wk is given by

Wk := max©

fk(x^j) | j = k − mk, . . . , kª , and, for a given nonnegative integer ˆm and s, we set

mk=

½ 0 if k ≤ s

min©

mk−1+ 1, ˆmª

otherwise .

Throughout the experiments, the following parameters were used for the line search:

β = 0.5, σ = 10⁻⁴, ˆm = 5 and s = 5.

The parameters ² and τ in the algorithm were chosen as ² = 10⁻⁵and τ = 10⁻⁵, respectively.

In addition, we updated the proximal parameter µk by the following formula:

µk+1= 0.1µk with µ0= 1,

and used x⁰= ω as the starting point, where ω was chosen randomly from [1, 2].

Numerical results for Experiments A-D are summarized in Tables 1-4 of the appendix.

In these tables, Obj. represents the value of f (x) at the final iteration, Nf denotes the total number of function evaluations for the objective function of subproblem (4.1) for solving each quasi-convex programming problem, Den denotes the approximate nonzero density of M , and Time represents the CPU time in second for solving each test problem.

From Tables 1–4, we see that the approximate entropy-like proximal algorithm can find the stationary point successfully for all test problems in Experiments A–D from the given starting point x⁰ = ω except those test problems with nonzero density 10% in Experiment A, for which we used the starting point x⁰ = 0.5ω instead. This verifies the theoretical results obtained in last section. In addition, from these tables, it seems that the proximal- like algorithm needs more function evaluations for those problems involved in a matrix M with a higher nonzero density.

5 Conclusions

We have considered the proximal-like method defined by (1.7)–(1.8) for a class of nonconvex problems of the form (1.6) and established the convergence results of the algorithm under some suitable assumptions. Specifically, we have shown that, under assumptions (A1) and (A2), the sequence {x^k}_k∈N generated by the algorithm is convergent, and furthermore, converges to a solution of (1.6) when µk → 0; while under assumptions (A1) and (A3), the sequence {x^k}k∈N is only bounded but every accumulation point is a solution if µk → 0.

Preliminary numerical experiments are also reported to verify the theoretical results. In our future research work, we will consider applying the proximal-like algorithm for more extensive classes of nonconvex optimization problems.

(12)

Acknowledgement

The authors are grateful to Prof. Fukushima and Prof. Tseng for inspiring them to generate test examples. They also thank two referees for valuable suggestions.

References

[1] M.S. Bazaraa and C.M. Shetty, Nonlinear Programming Theory and Algorithms, New York: John Wiley & Sons, 1979.

[2] Y. Censor and S.A. Zenios, The proximal minimization algorithm with D-functions, J.

Optim. Theory Appl. 73 (1992) 451–464.

[3] G. Chen and M. Teboulle, Convergence analysis of a proximal-like minimization algo- rithm using Bregman functions, SIAM J. Optim. 3 (1993) 538–543.

[4] S. Chretien and O. Hero, Generalized proximal point algorithms, Technical Report, The University of Michigan, 1998.

[5] J. Eckstein, Nonlinear proximal point algorithms using Bregman functions, with appli- cations to convex programming, Math. Oper. Res. 18 (1993) 206–226.

[6] P.B. Eggermont, Multilicative iterative algorithms for convex programming, Linear Algebra Appl. 130 (1990) 25–42.

[7] O. Guler, On the convergence of the proximal point algorithm for convex minimization, SIAM J. Control Optim. 29 (1991) 403–419.

[8] L. Grippo, F. Lampariello and S. Lucidi, A nonmonotone line search technique for Newton’s method, SIAM J. Numer. Anal. 23 (1986) 707–716.

[9] A. Iusem and M. Teboulle, Convergence rate analysis of nonquadratic proximal method for convex and linear programming, Math. Oper. Res. 20 (1995) 657–677.

[10] A. Iusem, B. Svaiter and M. Teboulle, Entropy-Like proximal methods in convex pro- gramming, Math. Oper. Res. 19 (1994) 790–814.

[11] K.C. Kiwiel, Proximal minimization methods with generalized Bregman functions, SIAM J. Control Optim. 35 (1997) 1142–1168.

[12] S. Kabbadj, Méthodes proximales entropiques, Thése de Doctorat, Uinversité Montpel- lier II, 1994.

[13] B. Martinet, Perturbation des m´ethodes d’optimisation. Applications, RAIRO Anal.

Num´er. 12 (1978) 153–171.

[14] J.J. Moreau, Promimit´e et Dualit´e dans un Espace Hilbertien, Bull. Soc. Math. France 93 (1965) 273–299.

[15] B.T. Polyak, Introduction to Optimization, Optimization Software, INC., 1987.

[16] M. Teboulle, Entropic proximal mappings with applications to nonlinear programming, Math. Oper. Res. 17 (1992) 670–690.

(13)

[17] M. Teboulle, Convergence of proximal-like algorithms, SIAM J. Optim. 7 (1997) 1069–

1083.

[18] R.T. Rockafellar, Augmented lagrangians and applications of proximal point algorithm in convex programming, Math. Oper. Res. 1 (1976) 97–116.

[19] R.T. Rockafellar, Monotone operators and the proximal point algorithm, SIAM J. Con- trol Optim. 14 (1976) 877–898.

Manuscript received 5 January 2006 revised 14 July 2006, 12 April 2007, 15 February 2008 accepted for publication 18 February 2008

Jein-Shan Chen

Member of Mathematics Division, National Center for Theoretical Sciences, Taipei Office Department of Mathematics, National Taiwan Normal University

Taipei, Taiwan 11677 E-mail address: jschen@math.ntnu.edu.tw

Shaohua Pan

School of Mathematical Sciences, South China University of Technology Guangzhou 510640, China

E-mail address: shhpan@scut.edu.cn

Appendix

(14)

Table 1: Numerical results for Experiment A

Den=0.1% Den=10%

N0. Obj. Nf Time(s) Obj. Nf Time(s)

1 -0.9999958e-0 628 0.31 -0.99999530e-0 52984 21.71 2 -0.9999974e-0 582 0.31 -0.99999504e-0 32168 16.65 3 -0.9999950e-0 997 0.39 -0.99999500e-0 45363 27.04 4 -0.9999955e-0 728 0.37 -0.99999521e-0 45944 18.11 5 -0.9999964e-0 571 0.30 -0.99999519e-0 49001 26.40 6 -0.9999960e-0 869 0.52 -0.99999501e-0 61891 29.70 7 -0.9999995e-0 631 0.33 -0.99999513e-0 40297 20.50 8 -0.9999958e-0 1048 0.44 -0.99999511e-0 64684 37.60 9 -0.9999953e-0 461 0.31 -0.99999503e-0 33232 15.03 10 -0.9999988e-0 808 0.37 -0.99999510e-0 31660 21.09 Note: the starting point x⁰= 0.5ω was used for all test problems with density 10%.

Table 2: Numerical results for Experiment B

Den=0.1% Den=10%

1 2.0000005e-0 619 0.36 2.0000099e-0 11412 7.67 2 2.0000099e-0 1599 0.48 2.0000095e-0 11705 7.40 3 2.0000094e-0 652 0.33 2.0000096e-0 9636 7.32 4 2.0000093e-0 563 0.28 2.0000098e-0 11041 7.65 5 2.0000072e-0 791 0.41 2.0000095e-0 10211 7.00 6 2.0000082e-0 996 0.58 2.0000095e-0 12963 7.51 7 2.0000054e-0 514 0.22 2.0000100e-0 11631 8.68 8 2.0000045e-0 622 0.31 2.0000097e-0 10702 6.09 9 2.0000052e-0 573 0.31 2.0000099e-0 10927 8.43 10 2.0000028e-0 540 0.26 2.0000096e-0 11354 8.18

(15)

Table 3: Numerical results for Experiment C

Den=0.1% Den=10%

Table 4: Numerical results for Experiment D

Den=0.1% Den=10%