1Introduction Anentropy-likeproximalalgorithmandtheexponentialmultipliermethodforsymmetricconeprogramming

(1)

An entropy-like proximal algorithm and the exponential multiplier method for symmetric cone programming

Jein-Shan Chen ¹ Department of Mathematics National Taiwan Normal University

Taipei 11677, Taiwan

Shaohua Pan²

School of Mathematical Sciences South China University of Technology

Guangzhou 510640, China

April 28, 2006 (revised May 16, 2007)

Abstract. We introduce an entropy-like proximal algorithm for the problem of minimizing a closed proper convex function subject to the symmetric cone constraint. The algorithm is based on a distance-like function that is an extension of the Kullback-Leiber relative entropy to the setting of symmetric cones. Like the proximal algorithm for convex programming with nonnegative orthant cone constraint, we show that, under some mild assumptions, the sequence generated by the proposed algorithm is bounded and every accumulation point is a solution of the considered problem. In addition, we also present a dual application of the proposed algorithm to the symmetric cone linear program, leading to a multiplier method which is shown to enjoy properties similar to the exponential multiplier method in [29].

Key words. Symmetric cone optimization, proximal-like method, entropy-like distance, exponential multiplier method.

1 Introduction

Symmetric cone programming (SCP) provides a unified framework for various mathematical programming including the linear programming (LP), the second-order cone programming (SOCP) and the semidefinite programming (SDP). Such programming problems arise

1Member of Mathematics Division, National Center for Theoretical Sciences, Taipei Office. The author’s work is partially supported by National Science Council of Taiwan. E-mail: [email protected].

2E-mail:[email protected].

(2)

naturally in a wide range of applications in engineering, economics, optimal control, man- agement science, combinatorial optimization, and other fields; see [1, 16, 30] and references therein. Recently, the symmetric cone optimization problem, especially the symmetric cone linear programming (SCLP), has recently attracted the attention of some researchers with a focus on the development of interior point methods similar to those used in linear programming; see [9, 10, 25]. Although interior-point methods were successfully applied for solving the SCLP, it is worthwhile to explore other solution methods for more general convex symmetric cone optimization problems.

Let A = (V, ◦, h·, ·i) be a Euclidean Jordan algebra, where (V, h·, ·i) is a finite dimen- sional inner product space over real field R and “◦” denotes the Jordan product which will be defined in the next section. Let K be the symmetric cone in V (will be introduced in Sec.

2). In this paper, we consider the following convex symmetric cone programming (CSCP):

min f (x)

s.t. x º_K 0, (1)

where f : V → (−∞, ∞] is a closed proper convex function, and x º_K 0 means x ∈ K. In general, for any x, y ∈ V, we write x º_K y if x − y ∈ K and write x Â_K y if x − y ∈ int(K).

A function is closed if and only if it is lower semi-continuous (l.s.c. for short) and a function is proper if f (x) < ∞ for at least one x ∈ V and f (x) > −∞ for all x ∈ V.

Notice that the CSCP is indeed a special class of convex programs, and therefore it can in principle be solved via general convex programming methods. One such method is the proximal point algorithm for minimizing a convex function f (x) on Rⁿ which generates a sequence {x^k}_k∈N by the iterative scheme as below:

x^k = argmin

x∈Rⁿ

½

f (x) + 1

µ_kkx − x^k−1k²

¾

, (2)

where µ_k is a sequence of positive numbers and k · k denotes the Euclidean norm in Rⁿ. This method was originally introduced by Martinet [17], based on the Moreau proximal approximation of f (see [18]), and then further developed and studied by Rockafellar [22, 23]. Later, several generalizations of the proximal algorithm have been considered where the usual quadratic proximal term in (2) is replaced by nonquadratic distance-like functions;

see, e.g. [5, 7, 8, 14, 27]. Among others, the proximal algorithms using an entropy- like distance [27, 14, 13, 28], designed for minimizing a convex function f (x) subject to nonnegative orthant constraints x ∈ Rⁿ₊, generate the iterates by





x⁰ ∈ Rⁿ₊₊

x^k = argmin

x∈Rⁿ₊

½

f (x) + 1

µ_kd_ϕ(x, x^k−1)

¾

, (3)

where d_ϕ(·, ·) : Rⁿ₊₊× Rⁿ₊₊→ Rⁿ₊ is the entropy-like distance defined by dϕ(x, y) =

Xn i=1

yiϕ(xi/yi) (4)

(3)

with ϕ satisfying certain conditions; see [13, 14, 27, 28]. An important choice of ϕ is the case that ϕ(t) = t ln t − t + 1 for which the corresponding d_ϕ given by

d_ϕ(x, y) = Xn

i=1

h

x_iln x_i− x_iln y_i+ y_i− x_i i

(5) is the well known Kullback-Leibler entropy from statistics and that is the “entropy” termi- nology stems from. One key feature of entropic proximal methods is that they automatically generate a sequence staying the interior of Rⁿ₊, i.e. {x^k}_k∈N ⊂ Rⁿ₊₊, and hence eliminates the combinatorial nature of the problem. One of the main applications of these proximal methods is to the dual of smooth convex programs, yielding twice continuously differentiable nonquadratic augmented Lagrangians and thus allowing the use of Newton’s method.

The main motivation of this paper is to develop unified proximal-type algorithms and their corresponding dual augmented Lagrangian methods for the convex symmetric cone optimization problems. Specifically, by using the Euclidean Jordan algebraic techniques, we introduce an interior proximal-type algorithm for the CSCP which can be viewed as an extension of the entropy-like proximal algorithm defined by (3)-(4) with ϕ(t) = t ln t−t+1.

For the proposed algorithm, we establish its convergence properties, and also present a dual application to the SCLP, leading to an exponential multiplier method which is shown to possess properties analogous to the method proposed by [3, 29] for convex programming over nonnegative orthant cones.

The paper is organized as follows. Section 2 reviews some basic concepts and materials on Euclidean Jordan algebras which are needed in the analysis of the algorithm. In Section 3, we introduce a distance-like function H to measure how close between two points in the symmetric cone K and show some related properties. In addition, we also outline a basic proximal-like algorithm with the measure function H. The convergence analysis of the basic algorithm is studied in Section 4. In Section 5, we consider a dual application of the algorithm to the SCLP and establish the convergence results for the resulted multiplier method. We close with some final remarks in Section 6.

2 Preliminaries on Euclidean Jordan Algebra

In this section, we recall some concepts and results on Euclidean Jordan algebras that will be used in the subsequent sections. More detailed expositions of Euclidean Jordan algebras can be found in Koecher’s lecture notes [15] and the monograph by Faraut and Kor´anyi [11].

Let V be a finite-dimensional vector space endowed with a bilinear mapping (x, y) → x◦y from V × V into V. For a given x ∈ V, let L(x) be the linear operator of V defined by

L(x)y := x ◦ y for every y ∈ V.

The pair (V, ◦) is called a Jordan algebra if, for all x, y ∈ V,

(4)

(i) x ◦ y = y ◦ x,

(ii) x ◦ (x²◦ y) = x²◦ (x ◦ y), where x² := x ◦ x.

In a Jordan algebra (V, ◦), x ◦ y is said to be the Jordan product of x and y. Note that a Jordan algebra is not associative, i.e., x ◦ (y ◦ z) = (x ◦ y) ◦ z may not hold in general.

If for some element e ∈ V, x ◦ e = e ◦ x = x for all x ∈ V, then e is called a unit element of the Jordan algebra (V, ◦). The unit element, if exists, is unique. A Jordan algebra does not necessarily have a unit element. For x ∈ V, let ζ(x) be the degree of the minimal polynomial of x, which can be equivalently defined as

ζ(x) := min©

k : {e, x, x², · · · , x^k} are linearly dependentª . Then the rank of (V, ◦) is well defined by r := max{ζ(x) : x ∈ V}.

A Jordan algebra (V, ◦), with a unit element e ∈ V, defined over the real field R is called a Euclidean Jordan algebra or formally real Jordan algebra, if there exists a positive definite symmetric bilinear form on V which is associative; in other words, there exists on V an inner product denoted by h·, ·i_V such that for all x, y, z ∈ V:

(iii) hx ◦ y, zi_V = hy, x ◦ zi_V.

In a Euclidean Jordan algebra A = (V, ◦, h·, ·i_V), we define the set of squares as K :=©

x² : x ∈ Vª .

By [11, Theorem III. 2.1], K is a symmetric cone. This means that K is a self-dual closed convex cone with nonempty interior and for any two elements x, y ∈ int(K), there exists an invertible linear transformation T : V → V such that T (K) = K and T (x) = y.

Here are two popular examples of Euclidean Jordan algebras. Let Sⁿ be the space of n × n real symmetric matrices with the inner product given by hX, Y i_Sⁿ := Tr(XY ), where XY is the usual matrix multiplication of X and Y and Tr(XY ) is the trace of matrix XY . Then, (Sⁿ, ◦, h·, ·iSⁿ) is a Euclidean Jordan algebra with the Jordan product given by

X ◦ Y := (XY + Y X)/2, X, Y ∈ Sⁿ.

In this case, the unit element is the identity matrix I in Sⁿ and the cone of squares K is the set of all positive semidefinite matrices (denoted as Sⁿ₊) in Sⁿ. Let Rⁿ be the Euclidean space of dimension n with the usual inner product hx, yi_Rⁿ = x^Ty. For any x = (x₁, x₂), y = (y₁, y₂) ∈ R × Rⁿ⁻¹, define

x ◦ y := (x^Ty, x₁y₂+ y₁x₂)^T.

Then (Rⁿ, ◦, h·, ·i_Rⁿ) is a Euclidean Jordan algebra, also called the quadratic forms algebra.

In this algebra, the unit element e = (1, 0)^T and the set of squares K = {x = (x₁, x₂) ∈

(5)

R × Rⁿ⁻¹: kx2k ≤ x1}.

Recall that an element c ∈ V is said to be idempotent if c² = c. Two idempotents c and q are said to be orthogonal if c ◦ q = 0. One says that {c₁, c₂, · · · , c_k} is a complete system of orthogonal idempotents if

c²_j = c_j, c_j ◦ c_i = 0 if j 6= i for all j, i = 1, 2, · · · , k, and P_k

j=1c_j = e.

An idempotent is said to be primitive if it is nonzero and cannot be written as the sum of two other nonzero idempotents. We call a complete system of orthogonal primitive idempotents a Jordan frame. Then, we have the second version of the spectral decomposition theorem.

Theorem 2.1 [11, Theorem III. 1.2] Suppose that A = (V, ◦, h·, ·i_V) is a Euclidean Jor- dan algebra and the rank of A is r. Then for any x ∈ V, there exists a Jordan frame {c₁, c₂, · · · , c_r} and real numbers λ₁(x), λ₂(x), · · · , λ_r(x), arranged in the decreasing order λ₁(x) ≥ λ₂(x) ≥ · · · ≥ λ_r(x), such that x =P_r

j=1λ_j(x)c_j.

The numbers λj(x) (counting multiplicities), which are uniquely determined by x, are called the eigenvalues. We let λ_min(·) and λ_max(·) denote the smallest and the largest eigenvalues, respectively. The trace of x, denoted as tr(x), is given by tr(x) :=P_r

j=1λ_j(x).

From [11, Proposition III.1.5], a Jordan algebra (V, ◦) with a unit element e ∈ V is Euclidean if and only if the symmetric bilinear form tr(x ◦ y) is positive definite, we may define another inner product on V by

hx, yi := tr(x ◦ y), ∀ x, y ∈ V.

By the associativity of tr(·) [11, Proposition II. 4.3], the inner product h·, ·i is associative, i.e., for all x, y, z ∈ V, there holds that hx, y ◦ zi = hy, x ◦ zi. Thus, the linear operator L(x) for each x ∈ V is symmetric with respect to the inner product h·, ·i in the sense that

hL(x)y, zi = hy, L(x)zi, ∀ y, z ∈ V.

In addition, we let k · k be the norm on V induced by the inner product h·, ·i, i.e., kxk := p

hx, xi =³P_r

j=1λ²_j(x)

´_1/2

, ∀ x ∈ V.

Now we can present two important inequalities of the eigenvalue function λ_min(·).

Lemma 2.1 Let x, y ∈ V, then we can bound the minimum eigenvalue of x + y as follows:

λ_min(x) + λ_min(y) ≤ λ_min(x + y) ≤ λ_min(x) + λ_max(y).

(6)

Proof. The first inequality can be found in [25, Lemma 14]. Thus, we only prove the second inequality. From [25, Lemma 13], we know

λ_min(x) = min

u∈V

hu, x ◦ ui

hu, ui and λ_max(x) = max

u∈V

hu, x ◦ ui hu, ui . Hence, there holds

λmin(x + y) = min

u∈V

hu, (x + y) ◦ ui

hu, ui = min

u∈V

hu, x ◦ ui + hu, y ◦ ui hu, ui

≤ min

u∈V

½hu, x ◦ ui

hu, ui + max

u∈V

hu, y ◦ ui hu, ui

¾

= λ_min(x) + λ_max(y).

Consequently, the second inequality follows. 2

Let φ : R → R be a scalar valued function. Then, it is natural to define a vector-valued function associated with the Euclidean Jordan algebra (V, ◦, h·, ·i_V) by

φ^sc(x) := φ(λ₁(x))c₁+ φ(λ₂(x))c₂+ · · · + φ(λ_r(x))c_r, (6) where x ∈ V has the spectral decomposition x =P_r

j=1λ_j(x)c_j. This function φ^sc is called the L¨owner operator and shown to enjoy the following property [26].

Lemma 2.2 [26, Theorem 13] For any x = P_r

j=1λ_j(x)c_j, let φ^sc be defined as in (6).

Then φ^sc is (continuously) differentiable at x if and only if for each j ∈ {1, 2, · · · , r}, φ is (continuously) differentiable at λj(x). The derivative of φ^sc at x, for any h ∈ V, is given by

(φ^sc)⁰(x)h = Xr

j=1

[φ^[1](λ(x))]_jjhc_j, hic_j+ X

1≤j<l≤r

4[φ^[1](λ(x))]_jlc_j◦ (c_l◦ h) with

[φ^[1](λ(x))]_ij :=





φ(λi(x)) − φ(λj(x))

λ_i(x) − λ_j(x) if λ_i(x) 6= λ_j(x) φ⁰(λi(x)) if λi(x) = λj(x)

, i, j = 1, 2, · · · , r.

In fact, the Jacobian (φ^sc)⁰(·) is a linear and symmetric operator, and can be written as (φ^sc)⁰(x) =

Xr j=1

ϕ⁰(λ_j(x))Q(c_j) + 2 Xr i,j=1,i6=j

[φ^[1](λ(x))]_ijL(c_j)L(c_i), (7)

where Q(x) := 2L²(x) − L(x²) for any x ∈ V is called the quadratic representation of V.

We next introduce the real-valued spectral functions on the Euclidean Jordan algebra.

Let P be the set of all permutations of r-dimensional vectors. A subset of R^r is said to be symmetric if it remains unchanged under every permutation of P (we adopt the same notations used in [2])).

(7)

Definition 2.1 Let S ⊆ R^r be a symmetric set. A real-valued function f : S → R is symmetric if for every permutation P ∈ P and each s ∈ S, we have f (P s) = f (s).

For any x ∈ V with the spectral decomposition x =P_r

j=1λj(x)cj, we define K := {x ∈ V | λ(x) ∈ S},

where λ(x) = (λ₁(x), · · · , λ_r(x))^T is the spectral vector of x and S ⊆ R^r is a symmetric set.

Let f : S → R is a symmetric function. Then the function F : K → R given by

F (x) := f (λ(x)) (8)

is the spectral function generated by f . Moreover, from [2, Theorem 41], we know that F is (strictly) convex if f is (strictly) convex.

Unless otherwise stated, in the sequel, A = (V, ◦, h·, ·i) represents a Euclidean Jordan algebra of rank r and dim(V) = n. For a closed proper convex function f : V → (−∞, +∞], we denote the domain of f by dom(f ) := {x ∈ V | f (x) < +∞}. The subdifferential of f at x₀ ∈ V is the convex set

∂f (x₀) = {ξ ∈ V | f (x) ≥ f (x₀) + hξ, x − x₀i ∀ x ∈ V} . (9) Since hx, yi = tr(x ◦ y) for any x, y ∈ V, the above subdifferential set is equivalent to

∂f (x₀) = {ξ ∈ V | f (x) ≥ f (x₀) + tr(ξ ◦ (x − x₀)) ∀ x ∈ V} . (10) For a sequence {x^k}k∈N, the notation N denotes the set of natural numbers.

3 An entropy-like proximal algorithm for SCP

To solve the convex symmetric optimization problem CSCP, we in this section suggest an entropy-like proximal minimization algorithm defined as follows:





x⁰ Â_K 0 x^k = argmin

x º_K0

½

f (x) + 1

µ_kH(x, x^k−1)

¾

, (11)

where µ_k > 0 and H : V × V → (−∞, +∞] is defined by H(x, y) :=

½ tr(x ◦ ln x − x ◦ ln y + y − x) ∀ x ∈ K, y ∈ int(K),

+∞ otherwise. (12)

This algorithm is indeed a proximal-type one, except that, instead of using the classical quadratic term kx − x^k−1k², we adopt the distance-like function H to guarantee that the generated sequence {x^k}_k∈N satisfies x^k ∈ int(K) for all k, thus leading to an “interior”

(8)

proximal method (see Proposition 4.1).

By the definition of L¨owner operator in Section 2, it is clear that the function H(x, y) is well-defined for all x, y ∈ int(K). Moreover, the domain of x ∈ int(K) can be extended to x ∈ K by adopting the convention that 0 ln 0 ≡ 0. The function H is a natural extension of the distance-like entropy function given as in (5), and is used to measure the “distance”

between two points in K. In fact, H will become the entropy function d_ϕ given as in (5) if the Euclidean Jordan algebra A is specified as (Rⁿ, ◦, h·, ·i_Rⁿ) with “◦” denoting the componentwise product of two vectors in Rⁿ and h·, ·i_Rⁿ being the standard dot product ha, biRⁿ =P_n

i=1aibi for every a, b ∈ Rⁿ. As shown by Proposition 3.1 below, most of the important properties, but not all, hold for dϕ(·, ·) also hold for H(·, ·).

In what follows, we present several technical lemmas that will be used in investigating some favorable properties of the distance measure H. We start with the extension to Euclidean Jordan algebras of Von Neumann inequality, whose proof can be found in [2].

Lemma 3.1 For any x, y ∈ V, we have tr(x ◦ y) ≤ P_r

j=1λj(x)λj(y) = λ(x)^Tλ(y), where λ(x) and λ(y) are the spectral vectors of x and y, respectively.

Lemma 3.2 For any x ∈ K, let Φ(x) := tr(x ◦ ln x). Then, we have the following results.

(a) Φ(x) is the spectral function generated by the symmetric entropy function Φ(u) =b P_r

j=1ujln uj, ∀u ∈ R^r₊. (13) (b) Φ(x) is continuously differentiable on x ∈ int(K) with ∇Φ(x) = ln x + e.

(c) The function Φ(x) is strictly convex over x ∈ Kⁿ.

Proof. (a) Suppose that x has the spectral decomposition x =P_r

j=1λ_j(x)c_j. Let φ(t) = t ln t, t ∈ R.

Then, from Section 2, we know that the vector-valued function x◦ln x is exactly the L¨owner function φ^sc(x), i.e., φ^sc(x) = x ◦ ln x. Clearly, φ^sc is well-defined for any x ∈ K and

φ^sc(x) = x ◦ ln x = Xr

j=1

λ_j(x) ln(λ_j(x))c_j.

Therefore

Φ(x) = tr(φ^sc(x)) = Xr

j=1

λ_j(x) ln(λ_j(x)) = bΦ(λ(x)),

(9)

where bΦ : R^r₊ → R is given by (13). By Definition 2.1, clearly, the function bΦ is symmetric.

Consequently, Φ(x) is the spectral function generated by the symmetric function bΦ in view of (8).

(b) From Lemma 2.2, we know that φ^sc(x) = x ◦ ln x is continuously differentiable on x ∈ int(K). Thus, Φ(x) is also continuously differentiable on x ∈ int(K) because Φ is the composition of the trace function (clearly continuously differentiable) and φ^sc. Now, it remains to find its gradient formula. From the fact that tr(x ◦ y) = hx, yi, we have

Φ(x) = tr(x ◦ ln x) = hx, ln xi.

Applying the chain rule for inner product of two functions, we then obtain

∇Φ(x) = ln x + (∇ ln x)x = ln x + (ln x)⁰x. (14) On the other hand, from the formula (7) it follows that for any h ∈ V,

(ln x)⁰h = Xr

j=1

1

λ_j(x)hc_j, hic_j + X

1≤j<l≤r

ln(λ_j(x)) − ln(λ_l(x))

λ_j(x) − λ_l(x) c_j◦ (c_l◦ h).

By this, it is easy to compute that (ln x)⁰x =

Xr j=1

1

λ_j(x)hc_j, xic_j + X

1≤j<l≤r

ln(λj(x)) − ln(λl(x))

λ_j(x) − λ_l(x) c_j ◦ (c_l◦ x)

= Xr

j=1

1

λ_j(x)λ_j(x)c_j+ X

1≤j<l≤r

ln(λ_j(x)) − ln(λ_l(x))

λ_j(x) − λ_l(x) λ_l(x)c_j ◦ c_l

= e.

This together with equation (14) yields the desired results.

(c) Note that the function bΦ in (13) is strictly convex over Rⁿ₊. Therefore, the conclusion immediately follows from part (a) and [2, Theorem 41]. 2

Now we are in a position to summarize some of favorable properties of the distance-like function H. These properties play an important role in the convergence analysis (as will be seen in the next section) of the algorithm (11)-(12).

Proposition 3.1 Let H(x, y) be given as in (12). Then the following results hold.

(a) H(x, y) is continuous on K × int(K) and H(·, y) is strictly convex for any y ∈ int(K).

(b) For any fixed y ∈ int(K), H(·, y) is continuously differentiable on int(K) with

∇_xH(x, y) = ln x − ln y.

(c) H(x, y) ≥ 0 for any x ∈ K and y ∈ int(K); moreover, H(x, y) = 0 if and only if x = y.

(10)

(d) H(x, y) ≥ d(λ(x), λ(y)) for any x ∈ K, y ∈ int(K), where d(·, ·) is defined by

d(u, v) :=

Xn i=1

h

u_iln u_i− u_iln v_i+ v_i− u_i i

∀ u ∈ R^r₊, v ∈ R^r₊₊. (15)

(e) The level sets of H(·, y) and H(x, ·) are respectively bounded for any fixed y ∈ int(K) and x ∈ K.

Proof. (a) Since x ◦ ln x, x ◦ ln y are continuous in x ∈ K and y ∈ int(K), and the trace function is also continuous, it gives that H is continuous over K × int(K). On the other hand, from definition of H, it follows that

H(x, y) = Φ(x) − tr(x ◦ ln y) + tr(y) − tr(x). (16) Notice that Φ(x) is strictly convex over K by Lemma 3.2 (c) and the other terms in the above expression are clearly convex for any fixed y ∈ int(K). Therefore, H(·, y) is strictly convex for any fixed y ∈ int(K).

(b) From the expression of H(x, y) given as (16) and Lemma 3.2 (b), clearly, the function H(·, y) is continuously differentiable in int(K). Moreover,

∇_xH(x, y) = ∇_xΦ(x) − ln y − e = ln x − ln y.

(c) From the definition of Φ(x) and its gradient formula shown as in Lemma 3.2 (b), Φ(x) − Φ(y) − hΦ⁰(y), x − yi

= tr(x ◦ ln x) − tr(y ◦ ln y) − hln y + e, x − yi

= tr(x ◦ ln x) − tr(x ◦ ln y) − tr(x) + tr(y)

= H(x, y) (17)

for any x ∈ K and y ∈ int(K). On the other hand, the strict convexity of Φ shown in Lemma 3.2 (c) implies that

Φ(x) − Φ(y) − hΦ⁰(y), x − yi ≥ 0

and the equality holds if and only if x = y. The two sides readily give the desired result.

(d) First, from Lemma 3.1, it follows that

tr(x ◦ ln y) ≤ Xr

j=1

λj(x)λj(ln y) = Xr j=1

λi(x) ln(λj(y)), ∀ x ∈ V, y ∈ int(K).

Applying this inequality, we thus obtain for all x ∈ K and y ∈ int(K), H(x, y) = tr(x ◦ ln x + y − x) − tr(x ◦ ln y)

(11)

≥ tr(x ◦ ln x + y − x) − Xr

j=1

λ_j(x) ln(λ_j(y))

= Xr

j=1

h

λ_j(x) ln(λ_j(x)) + λ_j(y) − λ_j(x) i

− Xr

j=1

λ_j(x) ln(λ_j(y))

= Xr

j=1

h

λj(x) ln(λj(x)) − λj(x) ln(λj(y)) + λj(y) − λj(x) i

= d(λ(x), λ(y)).

(e) For a fixed y ∈ int(K), let Lγ(x) := {x ∈ K | H(x, y) ≤ γ} where γ ≥ 0. By part (c), we have L_γ(x) ⊆ {x ∈ K | d(λ(x), λ(y)) ≤ γ}. Since d(·, v) has bounded level sets for any v ∈ R^r₊₊ (see [13, Proposition 4]) and λ(·) is continuous, L_γ(x) are bounded for all γ ≥ 0.

Similarly, H(x, ·) has bounded level sets for any x ∈ K. 2

Proposition 3.2 Let H(x, y) be given as in (12). Suppose {x^k}_k∈N ⊆ K and {y^k}_k∈N ⊂ int(K) are bounded sequences such that H(x^k, y^k) → 0. Then, as k → +∞, we have (a) λ_j(x^k) − λ_j(y^k) → 0 for all j = 1, 2, · · · , r.

(b) tr(x^k− y^k) → 0.

Proof. (a) By Proposition 3.1 (d) and the nonnegativity of d(·, ·), H(x^k, y^k) ≥ d(λ(x^k), λ(y^k)) ≥ 0.

Hence, H(x^k, y^k) → 0 implies d(λ(x^k), λ(y^k)) → 0. By the definition of d(·, ·) given by (15), d(λ(x^k), λ(y^k)) =

Xr j=1

λ_j(y^k)ϕ¡

λ_j(x^k)/λ_j(y^k)¢

with ϕ(t) = t ln t − t + 1 (t ≥ 0). Since ϕ(t) ≥ 0 for any t ≥ 0, each term of the above sum is nonnegative. Therefore, d(λ(x^k), λ(y^k)) → 0 implies that

λ_j(y^k)ϕ

³

λ_j(x^k)/λ_j(y^k)

´

→ 0, j = 1, 2, · · · , r, which is also equivalent to

λ_j(x^k) ln(λ_j(x^k)) − λ_j(x^k) ln(λ_j(y^k)) + λ_j(y^k) − λ_j(x^k) → 0, j = 1, 2, · · · , r.

Since {λ_j(x^k)} and {λ_j(y^k)} are bounded, using [6, Lemma A.1] yields that λ_j(x^k) − λ_j(y^k) → 0 for all j = 1, 2, · · · , r.

(b) Since tr(x^k− y^k) = Xr j=1

(λ_j(x^k) − λ_j(y^k)), the desired results follows by part (a). 2

(12)

Finally, we present two useful relations for the function H, which can be easily verified by direct substitution, using the definition of H given as in (12), and recalling the nonnegativity of H (see Proposition 3.1 (c)).

Proposition 3.3 Let H(x, y) be given in (12). For all x, y ∈ int(K) and z ∈ K, we have (a) H(z, x) − H(z, y) = tr(z ◦ ln y − z ◦ ln x + x − y).

(b) tr

³

(z − y) ◦ (ln y − ln x)

´

= H(z, x) − H(z, y) − H(y, x) ≤ H(z, x) − H(z, y).

4 Convergence analysis of the algorithm

The analysis of entropy-like proximal algorithm for the CSCP defined by (11)-(12) is similar to that developed for convex minimization problem by G¨uler [12] for quadratical proximal methods and its extension derived by Chen and Teboulle [4] for the proximal-like algorithm using Bregman functions. In what follows, we present a global convergence rate estimate for the entropy-like proximal algorithm (11)-(12) in terms of function values.

We make the following assumptions for the problem CSCP:

(A1) inf{f (x) | x º_K 0} := f_∗ > −∞, (A2) dom(f ) := {x ∈ V | f (x) < ∞}T

int(K) 6= ∅.

In addition, we denote the optimal solution set of CSCP by X^∗ := {x ∈ K | f (x) = f∗}.

Before proving the convergence results, we have to show that the algorithm given in (11)-(12) is well-defined, i.e., it generates a sequence {x^k}_k∈N ⊂ int(K). A technical lemma as below is needed for the proof.

Lemma 4.1 For any x º_K 0, y Â_K 0 and tr(x ◦ y) = 0, we always have x = 0.

Proof. It is clear that tr(x ◦ y) = 0 implies hx, yi = 0. From the self-duality of K and [11, Proposition I. 1.4], we know that

u ∈ int(K) ⇐⇒ hu, vi > 0, ∀ 0 6= v ∈ K.

This immediately implies the desired result that x = 0. 2

Proposition 4.1 Consider the proximal algorithm defined as in (11)-(12), for any y ∈ int(K) and µ > 0, we have the following results.

(13)

(a) Under assumption (A1), the function f (·) + µ⁻¹H(·, y) has bounded level sets.

(b) If, in addition, assumption (A2) holds, then there exists a unique x(y) ∈ int(K) satis- fying

x(y) = argmin

x º_K0

½

f (x) + 1

µH(x, y)

¾

, (18)

and the minimum is attained at x(y) ∈ int(K) satisfying

µ⁻¹(ln y − ln x(y)) ∈ ∂f (x(y)), (19) where ∂f is the subdifferential of f given as in (9).

Proof. (a) Fix a y ∈ int(K) and µ > 0. Let F_y(x) : V → R be a function given by F_y(x) := f (x) + µ⁻¹H(x, y).

By Proposition 3.1 (e), the level sets

Lγ(x) := { x ∈ K | H(x, y) ≤ γ} (20)

are bounded for all γ ≥ 0 and any fixed y ∈ int(K). This implies that F_y(x) has bounded level sets by assumption (A1) (since, suppose not, there exists some γ such that L_γ(x) is not bounded).

(b) From part (a), F_y(x) has bounded level sets, which implies the existence of the minimum point x(y). Moreover, the strict convexity of F_y(x) by Proposition 3.1 (a) guarantees the uniqueness. Under assumption (A2), using Proposition 3.1 (b) and the optimality condition for (18) gives that

0 ∈ ∂f (x(y)) + µ⁻¹

³

ln x(y) − ln y

´

+ ∂δ(x(y) | K), (21)

where δ(z| K) = 0 if z º_K 0 and +∞ otherwise. We will show that ∂δ(x(y)| K) = 0 and hence the desired result follows. We observe that for any x^k ∈ int(K) with x^k → x ∈ bd(K),

k ln x^kk = ÃXr

j=1

[ln(λ_j(x^k))]²

!_1/2

→ +∞,

where the second relation is due to the continuity of λ_j(·) and ln t. Consequently, k∇_xH(x^k, y)k = k ln x^k− ln yk ≥ k ln x^kk − k ln yk → +∞.

This by [21, pp. 251] means that H(·, y) is essentially smooth, and hence ∂xH(x, y) = ∅ for all x ∈ bd(K) by [21, Theorem 26.1]. Thus, from (21), we must have x(y) ∈ int(K) (since, if not, there is ∂_xH(x, y) = ∅ which contradicts (21)). Furthermore, from [21, pp. 226] it follows that

∂δ(z| K) = {v ∈ V | v ¹_K 0, tr(v ◦ z) = 0}.

(14)

Using Lemma 4.1, we then obtain ∂δ(x(y)| K) = {0}. Thus, the proof is completed. 2 The next result provides the key properties of the algorithm (11)-(12) from which our convergence result will follow.

Proposition 4.2 Let {x^k}_k∈N be the sequence generated by the algorithm defined as in (11)-(12), and let σ_m :=P_m

k=1µ_k. Then, the following results hold.

(a) The sequence {f (x^k)}_k∈N is nonincreasing.

(b) µ_k(f (x^k) − f (x)) ≤ H(x, x^k−1) − H(x, x^k) for all x ∈ K.

(c) σ_m(f (x^m) − f (x)) ≤ H(x, x⁰) − H(x, x^m) for all x ∈ K.

(d) {H(x, x^k)}_k∈N is nonincreasing for any x ∈ X^∗, if the optimal set X^∗ 6= ∅.

(e) H(x^k, x^k−1) → 0 and tr(x^k− x^k−1) → 0, if the optimal set X^∗ 6= ∅.

Proof. (a) By the definition of x^k given as in (11), we know that f (x^k) + µ⁻¹_k H(x^k, x^k−1) ≤ f (x^k−1) + µ⁻¹_k H(x^k−1, x^k−1).

Since H(x^k, x^k−1) ≥ 0 and H(x^k−1, x^k−1) = 0 (see Proposition 3.1 (c)), we obtain f (x^k) ≤ f (x^k−1) ∀ k ∈ N.

(b) From (19), we have −µ⁻¹_k (ln x^k− ln x^k−1) ∈ ∂f (x^k). Plugging this into the formula of

∂f (x^k) given as in (10), we obtain f (x) ≥ f (x^k) − µ⁻¹_k · tr

³

(x − x^k) ◦ (ln x^k− ln x^k−1)

´

∀ x ∈ K.

This yields

µ_k· (f (x^k) − f (x)) ≤ tr³

(x − x^k) ◦ (ln x^k− ln x^k−1)´

= H(x, x^k−1) − H(x, x^k) − H(x^k, x^k−1) (22)

≤ H(x, x^k−1) − H(x, x^k),

where the equality and the last inequality hold due to Proposition 3.3 (b).

(c) First, summing up the inequalities as part (b) over k = 1, 2, 3, · · · , m, yields that Xm

k=1

µ_kf (x^k) − σ_mf (x) ≤ H(x, x⁰) − H(x, x^m) ∀ x ∈ K. (23)

On the other hand, note that f (x^m) ≤ f (x^k) from part(a). Multiplying this inequality by µ_k gives

µ_kf (x^m) ≤ µ_kf (x^k).

(15)

Then summing up leads to Xm

k=1

µkf (x^k) ≥ Xm k=1

µkf (x^m) = σmf (x^m). (24)

Now, combining (23) and (24) yields the desired result.

(d) Since f (x^k) − f (x) = f (x^k) − f_∗ ≥ 0 for all x ∈ X^∗, the result immediately follows from part (b).

(e) From part (d), we know that the sequence {H(x, x^k)}_k∈N is nonincreasing for any x ∈ X^∗. This together with H(x, x^k) ≥ 0 implies that {H(x, x^k)}_k∈N is convergent. Thus, H(x, x^k−1) − H(x, x^k) → 0 ∀ x ∈ X^∗. (25) On the other hand, from (22) in the proof of part (b) as well as part (d), we have

0 ≤ µ_k(f (x^k) − f (x)) ≤ H(x, x^k−1) − H(x, x^k) − H(x^k, x^k−1), ∀ x ∈ X^∗. Hence,

H(x^k, x^k−1) ≤ H(x, x^k−1) − H(x, x^k).

Using the equation (25) and the nonnegativity of H(x^k, x^k−1), the first result is obtained.

Since {x^k}_k∈N ⊂ {y ∈ int(K) | H(x, y) ≤ H(x, x⁰)} with x ∈ X^∗, the sequence {x^k}_k∈N is bounded. Thus, the second result follows by the first result and Proposition 3.2 (b). 2

By now, we have proved that the algorithm (11)-(12) is well-defined and satisfies some favorable properties. With these properties, we are ready to show the convergence results of the proposed algorithm which is one of the main purposes of this paper.

Proposition 4.3 Let {x^k}_k∈N be the sequence generated by the algorithm in (11)-(12), and let σ_m :=P_m

k=1µ_k. Then, under assumptions (A1) and (A2), the following results hold.

(a) f (x^m) − f (x) ≤ σ_m⁻¹H(x, x⁰) for all x ∈ K.

(b) If σm → +∞, then limm→+∞f (x^m) = f∗.

(c) If the optimal set X^∗ 6= ∅, the sequence {x^k}k∈N is bounded and every accumulation point is a solution of the problem CSCP.

Proof. (a) This result follows by Proposition 4.2 (c) and the nonnegativity of H(x, x^m).

(b) Since σm → +∞, passing to the limit in part (a), we have lim sup

m→+∞

f (x^m) ≤ f (x) ∀ x ∈ K.

This together with the fact f (x^m) ≥ f_∗ yields the desired result.

(16)

(c) From Proposition 3.1 (d), H(x, ·) has bounded level sets for every x ∈ K. Also, from Proposition 4.2 (d), {H(x, x^k)}_k∈N is nonincreasing for all x ∈ X^∗ if X^∗ 6= ∅. Thus, the sequence {x^k}_k∈N is bounded and therefore has an accumulation point. Let ˆx ∈ K be an accumulation point of {x^k}_k∈N. Then {x^k^j} → ˆx for some k_j → +∞. Since f is lower semi-continuous, we have f (ˆx) = lim inf_k_j_→+∞f (x^k^j) (see [24, page 8]). On the other hand, we also have f (x^k^j) → f_∗, hence f (ˆx) = f_∗ which says that ˆx is a solution of CSCP. 2

Proposition 4.3 (a) indicates an estimate for global rate of convergence which is similar to the one obtained by proximal-like algorithms in convex minimization. Analogously, as remarked in [6, Remark 4.1], global convergence of {x^k} itself to an optimal solution of (1) can be achieved under the assumption that {x^k} ⊂ int(K) has a limit point in int(K).

Nonetheless, this assumption is rather stringent.

5 The exponential multiplier method for SCLP

In this section, we present a dual application of the entropy-like proximal algorithm (11)- (12), leading to an exponential multiplier method for the symmetric cone linear program

(SCLP) min (

−b^Ty : c − Xm

i=1

y_ia_i º_K 0, y ∈ R^m )

, (26)

where c, a_i ∈ V for i = 1, 2, · · · , m and b ∈ R^m. For convenience, we denote the feasible set of (SCLP) by S = {y ∈ R^m : A(y) º_K 0}, where A(y) : R^m → V is defined by

A(y) := c −P_m

i=1y_ia_i. (27)

We make the following assumptions for the problem (SCLP):

(S1) The optimal solution set of (SCLP) is nonempty and bounded;

(S2) Slater’s condition holds, i.e., there exists a ˆy ∈ R^m such that A(ˆy) Â_K 0.

The Lagrangian function associated with (SCLP) is given as follows:

L(y, x) =

½ −b^Ty − tr [x ◦ A(y)] if x ∈ K,

+∞ otherwise.

Therefore, the dual problem corresponding to (SCLP) is given by

(DSCLP) max

xº_K0 inf

y∈R^mL(y, x), which can be explicitly written as

(DSCLP)





max −tr(c ◦ x)

s.t. tr(a_i◦ x) = b_i, i = 1, 2, · · · , m, x º_K 0.

(17)

By standard convex analysis arguments from [21], we know that under the assumption (S2), there is no duality gap between (SCLP) and (DSCLP). Moreover, the set of optimal solution of (DSCLP) is nonempty and compact.

To solve the problem (SCLP), we introduce the following multiplier-type algorithm:

Algorithm 5.1 Given an x⁰ ∈ int(K), generate the sequence {y^k}_k∈N ⊆ R^m and {x^k}_k∈N ⊂ int(K) by

y^k = argmin

y∈R^m

n

−b^Ty + µ⁻¹_k tr h

exp

³

− µ_kA(y) + ln x^k−1

´io

, (28)

x^k = exp¡

−µ_kA(y^k) + ln x^k−1¢

, (29)

where the penalty parameters µ_k satisfy µ_k > ¯µ > 0 for all k ∈ N.

The algorithm can be viewed as a natural extension of the exponential multiplier method used in convex programs over nonnegative orthant cones (see, e.g., [3, 29]) to symmetric cones. In the rest of this section, we will show that the algorithm (28)-(29) shares similar properties.

We first present two technical lemmas. One of them is to collect some properties of the L¨owner operator exp(z), while the other is to characterize the recession cone of the set S.

Lemma 5.1 (a) For any z ∈ V, there always holds that exp(z) ∈ int(K).

(b) The function exp(z) is continuously differentiable everywhere with (exp(z))⁰e = ∇z(exp(z))e = exp(z) ∀ z ∈ V.

(b) The function tr[exp(P_m

i=1y_ia_i)] is continuously differentiable everywhere with

∇_ytr h

exp(P_m

i=1y_ia_i) i

= A^T exp(P_m

i=1y_ia_i)

where y ∈ R^m and a_i ∈ V for i = 1, 2, · · · , m, and A^T ∈ R^m×n be the matrix cor- responding to the linear transformation that maps x ∈ V to the m-vector whose i^th component is ha_i, xi.

Proof. (a) The result is clear since for any z ∈ V with z = P_r

j=1λ_j(z)c_j, all eigenvalues of exp(z), given by exp(λ_j(z)) for j = 1, 2, · · · , r, are strictly positive.

(b) By Theorem 2.2, clearly, exp(z) is continuously differentiable everywhere and

(exp(z))⁰h = Xr

j=1

exp(λ_j(z))hc_j, hic_j+ X

1≤j<l≤r

4exp(λ_j(z)) − exp(λ_l(z))

λj(z) − λl(z) c_j ◦ (c_l◦ h)

(18)

for any h ∈ V. From this formula, we particularly have that (exp(z))⁰e =

Xr j=1

exp(λ_j(z))hc_j, eic_j+ X

1≤j<l≤r

4exp(λ_j(z)) − exp(λ_l(z))

λj(z) − λl(z) c_j ◦ (c_l◦ e)

= Xr

j=1

exp(λ_j(z))c_j = exp(z).

(c) From the continuous differentiability of trace function and part (b), we immediately obtain the first part. Notice that

tr h

exp(P_m

i=1y_ia_i) i

= D

exp(P_m

i=1y_ia_i), e E

.

Therefore, applying the differential chain rule readily yields the desired result. 2

Lemma 5.2 Let S_∞ denote the recession cone of the feasible set S. Then, S_∞ = {d ∈ R^m : A(d) − c º_K 0} .

Proof. Assume that d ∈ R^m such that A(d) − c º_K 0. If d = 0, clearly, d ∈ S_∞. If d 6= 0, we take any y ∈ S. Then A(y) º_K 0, and moreover, for any τ > 0,

A(y + τ d) = c − Xm

i=1

(y_i + τ d_i)a_i = A(y) + τ (A(d) − c) º_K 0.

This, by the definition of recession direction [21, pp. 61], shows that d ∈ S∞. Therefore, {d ∈ R^m : A(d) − c º_K 0} ⊆ S∞.

Now take any d ∈ S_∞ and y ∈ S. Then A(y + τ d) º_K 0 for any τ > 0, and therefore λ_min[A(y + τ d)] ≥ 0. This implies that λ_min(A(d) − c) ≥ 0. If not, using the fact that

λ_min[A(y + τ d)] = λ_min[A(y) + τ (A(d) − c)]

≤ λ_min(τ (A(d) − c)) + λ_max(A(y))

= τ λmin(A(d) − c) + λ(A(y)),

where the inequality is due to Lemma 2.1, and letting τ → +∞, we then obtain λ_min[A(y + τ d)] → −∞,

which contradicts to λmin[A(y + τ d)] ≥ 0. Thus, we have that A(d) − c º_K 0. Combining with the above discussions, we get S_∞= {d ∈ R^m |A(d) − c º_K 0}. 2

Now we begin with the convergence of the algorithm (28)-(29). We first prove that the sequence generated by (28)-(29) is well-defined, which is implied by the following lemma.

(19)

Lemma 5.3 For any y ∈ R^m and µ > 0, let F : R^m → R be defined by F (y) := −b^Ty + µ⁻¹tr£

exp(−µA(y) + ln x^k−1)¤

(30) where µ > 0. Then under assumption (S1) the minimum set of F is nonempty and bounded.

Proof. By assumption (S1), it suffices to prove F (y) has bounded level sets. Suppose not, then there exists a sequence {y^k} ⊆ R^m such that

ky^kk → +∞, lim

k→+∞

y^k

ky^kk = d 6= 0 and F (y^k) ≤ δ. (31) Since by Lemma 5.1(a) exp(−µA(y) + ln x^k−1) ∈ int(K) for any y ∈ R^m, we have

tr h

exp

³

− µA(y) + ln x^k−1

´i

= Xr

j=1

exp h

λ_j

³

− µA(y) + ln x^k−1

´i

> 0.

Thus, F (y^k) ≤ δ implies that the following two inequalities hold:

−b^Ty^k < δ, Xr

j=1

exp h

λ_j

³

− µA(y^k) + ln x^k−1

´i

≤ µ(δ + b^Ty^k).

From this, we immediately obtain that

−b^Td ≤ 0, (32)

exp h

λ_j

³

− µA(y^k) + ln x^k−1

´i

≤ µ(δ + b^Ty^k), ∀ j = 1, 2, · · · , r. (33) Using the monotonicity of ln t (t > 0) and the homogeneity of λ_j(·), we can rewrite (33) as

λ_j µ

−µA(y^k)

ky^kk + ln x^k−1 ky^kk

¶

≤ ln(µ(δ + b^Ty^k))

ky^kk ≤ µ(δ + b^Ty^k) − 1

ky^kk , ∀ j = 1, 2, · · · , r, where the second inequality is due to ln t ≤ t − 1 (t > 0). Passing to the limit on the both sides of the inequality and using the homogeneity of λ_j(·) again, we then get

λ_j(A(d) − c) ≥ 0, ∀ j = 1, 2, · · · , r, which in turn means that

A(d) − c º_K 0. (34)

From Lemma 5.2, we know that {d ∈ R^m : A(d) − c º_K 0} = S_∞. Combining with (32), we thus show that there exists a nonzero d ∈ R^m such that d ∈ S_∞ but −b^Td ≤ 0. This clearly contradicts the assumption (S1). Hence, the proof is completed. 2

To analyze the convergence of the algorithm (28)-(29), we also need the following result which states that the sequence {x^k}_k∈N generated by (28)-(29) is exactly the sequence produced by the algorithm (11)-(12) when applied to the dual problem (DSCLP).