• 沒有找到結果。

Approximation Accuracy, Gradient Methods, and Error Bound for Structured Convex Optimization1

N/A
N/A
Protected

Academic year: 2022

Share "Approximation Accuracy, Gradient Methods, and Error Bound for Structured Convex Optimization1"

Copied!
34
0
0

加載中.... (立即查看全文)

全文

(1)

Approximation Accuracy, Gradient Methods, and Error Bound for Structured Convex Optimization

1

May 14, 2009 (revised July 7, 2009) Paul Tseng2

Abstract

Convex optimization problems arising in applications, possibly as approximations of in- tractable problems, are often structured and large scale. When the data are noisy, it is of interest to bound the solution error relative to the (unknown) solution of the original noiseless problem. Related to this is an error bound for the linear convergence analysis of first-order gradient methods for solving these problems. Example applications include com- pressed sensing, variable selection in regression, TV-regularized image denoising, and sensor network localization.

Key words. Convex optimization, compressed sensing, ℓ1-regularization, nuclear/trace norm, regression, variable selection, sensor network localization, approximation accuracy, proximal gradient method, error bound, linear convergence.

1 Introduction

Optimization problems arising in application areas such as signal/image denoising, compressed sensing, regression, multi-task learning, classification, sensor network localization, are often large scale, possibly nonconvex or NP-hard, and the data may be noisy. Such problems may be approximated by convex relaxations that are highly structured.

Question 1: How accurate approximations are the convex relaxations?

Specifically, can we bound the error between a solution of the convex relaxation and an (un- known) solution of the original noiseless problem in terms of knowable quantities such as the noise level? This question is meaningful when a solution of the original problem changes (semi)continuously with small perturbations in the data, so the problem has both discrete and continuous nature. Examples include compressed sensing and sensor network localization; see Section 2. A certain noise-aware property of the convex relaxation appears key.

Question 2: How fast can the convex relaxations be solved?

Due to the large problem size, first-order gradient methods seem better suited to exploit struc- tures such as sparsity, (partial) separability, and simple nonsmoothness in the convex relaxations;

1This work is supported by National Science Foundation grant DMS-0511283.

2Department of Mathematics, University of Washington, Seattle, WA 98195, U.S.A.

(tseng@math.washington.edu)

(2)

see Section 3. The asymptotic convergence rate of these methods depend on an error bound on the distance to the solution set of the convex relaxation in terms of a certain residual function.

We prove such an error bound for a class of 2-norm-regularized problems that includes the group lasso for linear and logistic regression; see Section 4. Thus our aims are threefold: exposit on existing results, present a new result, and suggest future research.

We begin with a problem that has received much attention recently: compressed sensing. In the basic version of this problem, we wish to find a sparse representation of a given noiseless signal b0 ∈ ℜm from a dictionary of n elementary signals. This may be formulated as

x|Ax=bmin0♯(x), (1)

where A∈ ℜm×ncomprises the elementary signals for its columns and ♯(x) counts the number of nonzero components in x∈ ℜn. In typical applications, m and n are large (m, n≥ 2000). This problem is known to be difficult (NP-hard) and a popular solution approach is to approximate it by a convex relaxation, with ♯(·) replaced by the 1-norm k· k1.3 This results in a linear program:

x|Ax=bmin0kxk1 (2)

that can be efficiently solved by simplex or interior point methods [28, 103, 105, 150, 154]. More- over, when the optimal value of (1) is sufficiently small and the columns of A are “approximately orthogonal,” which occur with high probability when A is, say, a Gaussian random matrix, the solution of (2) also solves (1), i.e., the relaxation is exact [26, 27, 28, 35, 40, 42, 56, 57, 59, 62, 83, 111, 133, 134]. For a noisy signal b, Ax = b may be inconsistent and we seek a sparse solution x whose residual Ax− b is small in the least square sense, either in the primal form of “lasso”

[131]

x| kAx−bkmin2≤ρkxk1 (3)

(ρ > 0) or in the dual form of “basis pursuit” [30, 43, 44, 121]

minx kAx − bk22+ τkxk1 (4)

(τ > 0). Partial results on the accuracy of this relaxation are known [41, 133, 143]. Various methods have been proposed to solve (3) and (4) for fixed ρ and τ , including interior point method [30], coordinate minimization [121, 122], proximal gradient (with or without smoothing) [13, 14, 34], proximal minimization [155]. Homotopy approaches have been proposed to follow the solution as τ varies between 0 and∞ [54, 63, 110]. The efficiency of these methods depend on the structure of A. The polyhedral and separable structure of k · k1 is key.

A problem related to (4) is image denoising using total variation (TV) regularization:

min

x∈B

n 2 2

kAx − bk22, (5)

where n is even, B2 denotes the unit 2-norm (Euclidea) ball in ℜ2, A ∈ ℜm×n is the adjoint of the discrete (via finite difference) gradient mapping, and b ∈ ℜm. Interior-point, proximal

3We are using the term “relaxation” loosely since ♯(·) majorizes k · k1 only on the unit ∞-norm ball.

(3)

minimization, coordinate minimization, and gradient projection methods have been proposed for its solution [60, 108, 147, 152, 160, 161].

A second problem related to (1) and of growing interest is matrix rank minimization. The basic problem is

x|Ax=bmin0rank(x), (6)

where rank(x) is the rank of a matrix x∈ ℜp×n, and A is a linear mapping from ℜp×n to ℜm, and b0∈ ℜm. In the case of matrix completion, we have Ax = (xij)(i,j)∈A, whereA indexes the known entries of x. This problem is also NP-hard, and a convex relaxation has been proposed whereby rank(x) is replaced by the nuclear/trace normkxknuc (the 1-norm of the singular values of x) [24, 25, 50, 73, 82, 115, 156]. The resulting problem

x|Ax=bmin0kxknuc (7)

has a more complex structure than (2), and only recently have solution methods, including interior point method and dual gradient method, been developed [24, 73, 82] and exactness results been obtained [25, 115]. For noisy b, we may consider, analogous to (4), the relaxation

minx kAx − bk2F + τkxknuc, (8)

where τ > 0 andk·kF denotes the Frobenious-norm. This problem has applications to dimension reduction in multivariate linear regression [75] and multi-task learning [1, 3, 106]. Recently, proximal/gradient methods have been applied to its solution [75, 82, 132]. How well does (8) approximate (6)?

A third problem related to (1) and of recent interest is that of sparse inverse covariance esti- mation, where we seek a positive definite x∈ Sn that is sparse and whose inverse approximates a given sample covariance matrix s∈ Sn. This may be formulated as the nonconvex problem

min

λIx¯λI− log det(x) + hs, xi + τ♯(x) (9)

with 0≤ λ < ¯λ ≤ ∞ and τ > 0, where hs, xi = trace[sx], ♯(x) counts the number of nonzeros in x, I denotes the n×n identity matrix, and  denotes the partial ordering with respect to the cone S+n of positive semidefinite matrices [5, 9, 53, 74, 158]. Replacing ♯(x) by kxk1 := Pni,j=1|xij| yields the convex relaxation

min

λIx¯λI− log det(x) + hs, xi + τkxk1. (10) Interior-point method appears unsuited for solving (10) owing to the large size of the Newton equation to be solved. Block-coordinate minimization methods (with each block corresponding to a row/column of x) and Nesterov’s accelerated gradient methods have been applied to its solution [5, 9, 53, 74]. How well does (10) approximate (9)?

A fourth problem of much recent interest is that of ad hoc wireless sensor network localization [4, 17, 18, 20, 21, 29, 36, 37, 47, 70, 72, 127]. In the basic version of this problem, we have n points z01, . . . , zn0 in ℜd (d ≥ 1). We know the last n − m points (“anchors”) and an estimate dij ≥ 0 of the Euclidean distance d0ij =kzi0− zj0k2 between “neighboring” points i and j for all

(4)

(i, j)∈ A, where A ⊆ ({1, . . . , m}×{1, . . . , n}) ∪ ({1, . . . , n}×{1, . . . , m}). We wish to estimate the first m points (“sensors”). This problem may be formulated as

z1min,...,zm

X

(i,j)∈As

|kzi− zjk22− d2ij| + X

(i,j)∈Aa

|kzi− zj0k22− d2ij|, (11)

where As := {(i, j) ∈ A | i < j ≤ m} and Aa := {(i, j) ∈ A | i ≤ m < j} are the sets of, respectively, sensor-to-sensor and sensor-to-anchor neighboring pairs. Typically, d = 2 and two points are neighbors if the distance between them is below some threshold, e.g., the radio range.

In variants of this problem, constraints such as bounds on distances and angles-of-arrival are also present [16, 29, 37, 70]. This problem is NP-hard for any d ≥ 1. It is closely related to distance geometry problems arising in molecular conformation [19, 90], graph rigidity/realization [2, 36, 47, 127], and max-min/avg dispersion [33, 114, 149]. Letting z := ( z1 · · · zm) and I denote the d×d identity matrix, Biswas and Ye [20, 21] proposed the following convex relaxation of (11):

minx

X

(i,j)∈A

ij(x)− d2ij

subject to x =

y zT

z I



 0,

(12)

where y = ( yij)1≤i,j≤m, and ℓij(x) :=

yii− 2yij + yjj if (i, j)∈ As,

yii− 2zTi zj0+kz0jk2 if (i, j)∈ Aa. (13) The relaxation (12) is a semidefinite program (SDP) and is exact if it has a solution of rank d.

On the other hand, (12) is still difficult to solve by existing methods for SDP when m > 500, and decomposition methods have been proposed [21, 29]. Recently, Wang, Zheng, Boyd, and Ye [148] proposed a further relaxation of (12), called edge-based SDP (ESDP) relaxation, which is solved much faster by an interior point method than (12), and yields solution comparable in approximation accuracy as (12). The ESDP relaxation is obtained by relaxing the constraint x 0 in (12) to require only those principal submatrices of x associated with A to be positive semidefinite. Specifically, the ESDP relaxation is

minx

X

(i,j)∈A

ij(x)− d2ij

subject to x =

y zT z I

 ,

yii yij ziT yij yjj zjT zi zj I

 0 ∀(i, j) ∈ As,

yii zTi zi Id



 0 ∀i ≤ m.

(14)

Notice that the objective function and the positive semidefinite constraints in (14) do not depend on yij, (i, j)6∈ A. How well does (12) approximate (11)? Only partial results are known in the noiseless case, i.e., (11) has zero optimal value [127, 141]. How well does (14) approximate (11)?

The above convex problems share a common structure, namely, they entail minimizing the sum of a smooth (i.e., continuously differentiable) convex function and a “simple” nonsmooth

(5)

convex function. More specifically, they have the form

minx∈E F (x) := f (x) + τ P (x), (15)

where E is a finite-dimensional real linear space endowed with a norm k · k, τ > 0, P : E → (−∞, ∞] is lower semicontinuous (lsc), convex, with domP = {x | P (x) < ∞} closed, and f : E → (−∞, ∞] is convex and smooth on domf, assumed open, and f(x) → ∞ whenever x approaches a boundary point of domf [116]. A well-known special case of (15) is smooth constrained convex optimization, for which P is the indicator function for a nonempty closed convex setX ⊆ E, i.e.,

P (x) =

0 if x∈ X ,

∞ else. (16)

The class of problems (15) was studied in [6, 89] and by others; see [144] and references therein.

For example, (4) corresponds to

E = ℜn, k · k = k · k2, f (x) =kAx − bk22, P (x) =kxk1, (17) (5) corresponds to

E = ℜn, k · k = k · k2, f (x) =kAx − bk22, P (x) = (

0 if x∈ B

n

22,

∞ else, (18)

(8) corresponds to

E = ℜp×n, k · k = k · kF, f (x) =kAx − bk2F, P (x) =kxknuc, (19) and (10) corresponds to

E = Sn, k · k = k · kF, f (x) =− log det(x) + hs, xi, P (x) =kxk1. (20) In the case of 0 < λ < ¯λ < ∞, Lu [74] proposed a reformulation of (10) via Fenchel duality [116, 117], corresponding to

f (x) = sup

λIy¯λI

hx, yi − log det(y), P (x) =

0 ifkx − sk≤ τ,

∞ else, (21)

where kxk = maxi,j|xij|. (Note that P in (21) depends on τ.) Importantly, for (17), (18), (20), (21), P is block-separable, i.e.,

P (x) = X

J∈J

PJ(xJ), (22)

whereJ is some partition of the coordinate indices of x. Then (15) is amenable to solution by (block) coordinatewise methods. Another special case of interest is variable selection in logistic regression [88, 126, 157, 159], corresponding to

E = ℜn, k · k = k · k2, f (x) =

m

X

i=1

log1 + eAix− biAix, P (x) =kxk1, (23)

(6)

with Ai the ith row of A∈ ℜm×n and bi ∈ {0, 1}; also see [55, 123, 124, 125] for variants. A closely related problem is group variable selection (“group lasso”), which uses instead

P (x) = X

J∈J

ωJkxJk2, (24)

with ωJ ≥ 0, in (17) and (23) [88, 157]. The TV image reconstruction model in [147, Eq.

(1.3)] and the TV-L1 image deblurring model in [152, Eq. (1.5)] are special cases of (15) with f quadratic and P of the form (24).

How accurate approximations are the convex relaxations (7), (8), (10), (12), (14), and other related convex relaxations such as that for max-min dispersion, allowing for noisy data? What are the iteration complexities and asymptotic convergence rates of first-order gradient methods for solving these and related problems? First-order methods are attractive since they can exploit sparse or partial separable structure of f and block-separable structure of P .

Throughout, ℜn denotes the space of n-dimensional real column vectors, Sn={x ∈ ℜn×n | x = xT}, andT denotes transpose. For any x∈ ℜnand nonempty J ⊆ {1, ..., n}, xj denotes jth coordinate of x, xJ denotes subvector of x comprising xj, j ∈ J, and kxkρ = Pnj=1|xj|ρ1/ρ for 0 < ρ <∞, and kxk= maxj|xj|. For any x ∈ ℜp×n, xij denotes the (i, j)th entry of x.

2 Approximation Accuracy of the Convex Relaxations

Let aj denote column j of A in (1), normalized so thatkajk2 = 1, for all j. It is easily seen that the relaxation (2) is exact (i.e., its solution also solves (1)) if a1, . . . , anare pairwise orthogonal.

This hints that (2) may remain exact if these columns are approximately orthogonal. Mallat and Zhang [83] introduced the following measure of approximate orthogonality, called “mutual coherence”, in their study of matching pursuit:

µ := max

1≤i6=j≤n

aTi aj . (25)

There exist overcomplete sets with n ≈ m2 and µ ≈ 1/√

m; see [129, pages 265-266] and references therein. This µ is central to the analysis in [40, 41, 42, 56, 57, 59, 62, 83, 133, 134, 143].

In particular, (2) is exact whenever N0 < 12−1+ 1) = O(n14), where N0 denotes the optimal value of (1) [40, Theorem 7], [62, Theorem 1], [56], [133, Theorems A and B]. When b is noisy, it can be shown that the solution xρ of the noise-aware model (3) is close to the solution x0 of the original noiseless problem (1):

kxρ− x0k22 ≤ (δ + ρ)2

1− µ(4N0− 1) (26)

whenever ρ≥ δ := kb − b0k2 and N0 < (12− O(µ))µ−1+ 1; see [41, Theorem 3.1], [143, Theorem 1]. The bound (26) also extends to ρ < δ with some limitations [143, Theorem 1]. In addition to convex relaxation, greedy methods can also be shown to recover or approximate x0 and identify the support of x0 under similar conditions on N0, δ, and ρ; see [41, Theorem 5.1(a)], [133, Theorems A and B], [143, Theorems 3 and 4].

(7)

A different measure of approximate orthogonality, introduced by Cand`es and Tao [28], is the

“restricted isometry” constant µN, defined as the smallest scalar satisfying

(1− µN)kxk22 ≤ kAxk22 ≤ (1 + µN)kxk22 ∀x with ♯(x) ≤ N. (27) In fact, µ2 = µ.4 It is known thatkxρ− x0k2= O(ρ) whenever ρ≥ δ and µ3N0+ 3µ4N0 < 2 [27, Theorem 1]; also see [27, Theorem 2] for a relaxation of the latter condition. For A randomly generated from certain classes of distributions (e.g., Gaussian), this condition holds with high probability (for N0 in the order of n to within log factors); see [26, 28, 39, 45, 111, 135] and references therein for the noiseless case and [27, 38] for the noisy case. The approximation bounds in [27, 38] for (3) require ρ ≥ δ, as well as n = O(m) in [38]. Is extension to ρ < δ possible, as in [143, Theorem 1]?

Can the aforementioned exactness results and error bounds be extended to matrix rank minimization (6) and its convex relaxations (7) and (8)? Recent progress has been made in the noiseless case [25, 115]. The nuclear norm has a more complex structure than the 1-norm.

For the sensor network localization problem (11), its SDP relaxation (12) is exact if the distances dij, (i, j) ∈ A, are exact (i.e., dij = d0ij for all (i, j) ∈ A) and any relative-interior solution (i.e., a point in the relative interior of the solution set) of (12) has rank d [127, Theorem 2]. However, this assumption is quite strong. What can we say in general? Remarkably, a kind of partial exactness still holds. Biswas and Ye [20, Section 4] introduced the notion of individual traces for a feasible solution x of (12), defined as

tri(x) := yii− kzik2, i = 1, . . . , m,

or, equivalently, the diagonals of the Schur complement y− zTz. It can be shown that, for any relative-interior solution x of (12), tri(x) = 0 implies zi is invariant over the solution set of (12) and hence equals zi0, the true position of sensor i, when distances are exact [141, Proposition 4.1]. An analogous result holds for the ESDP relaxation (14) [148, Theorem 2]. Thus, upon finding a relative-interior solution x of (12) or (14), we know that every sensor i with tri(x) = 0 (within numerical accuracy) has zi as its true position. Is this result sharp? Yes, at least when the distances are exact. In this case, the converse holds for the ESDP relaxation; see [113, Theorem 1]. It is not known if the converse holds for the SDP relaxation. On the other hand, an example in [113, Example 2] shows that, in the noisy case where the distances are inexact, tri(x) = 0 is not a reliable indicator of sensor i position accuracy even when x is the unique solution of the SDP/ESDP relaxation. The reason is that the solution set of the SDP/ESDP relaxation can change abruptly under arbitrarily small data perturbation. This contrasts with a second-order cone (SOCP) relaxation of the same problem, whose solution set changes gradually with data perturbation [141, Section 7]. To overcome this difficulty, a noise-aware version of the ESDP relaxation, analogous to (3) for compressed sensing, was proposed in [113, Section 5].

Specifically, for any ρ = (ρij)(i,j)∈A ≥ 0, let

Sresdpρ :=nx| x is feasible for (14) and |ℓij(x)− d2ij| ≤ ρij ∀(i, j) ∈ Ao. (28)

4Why? Since kajk2 = 1 for all j, (27) with N = 2 reduces to 1 − µ2 ≤ 1 + 2aTiajxixj≤ 1 + µ2 for all i 6= j and all xi, xj ∈ ℜ with x2i+ x2j = 1. Since (xi, xj) 7→ 2xixjhas minimum value of −1 and maximum value of 1 on the unit sphere, the smallest µ2 for which this holds is precisely µ given by (25).

(8)

Then Sresdpρ contains the noiseless ESDP solutions whenever ρ ≥ δ := (|d2ij − (d0ij)2|)(i,j)∈A. Moreover, defining its “analytic center” as

xρ := arg min

x∈Sρ

resdp

X

(i,j)∈As

log det

yii yij ziT yij yjj zjT zi zj I

m

X

i=1

log tri(x), (29)

it can be shown that, for ρ > δ sufficiently small, tri(xρ) = 0 is a reliable indicator of sensor i position accuracy; see [113, Theorem 4]. Moreover, for any ρ > δ, we have the following computable bound on the individual sensor position accuracy:

kziρ− zi0k ≤q2|As| + m · tri(xρ)12 ∀i.

Can these results be extended to the SDP relaxation (12) or SOS relaxations [66, 104] or to handle additional constraints such as bounds on distances and angles-of-arrival?

Closely related to sensor network localization is the continuous max-min dispersion problem, whereby, given existing points zm+10 , . . . , zn0 inℜd(d≥ 1), we wish to locate new points z1, . . . , zm inside, say, a box [0, 1]d that are furthest from each other and existing points [33, 149]:

max

z1,...,zm∈[0,1]dmin



i<j≤mmin ωijkzi− zjk2, min

i≤m<jωijkzi− z0jk2



, (30)

with ωij > 0. Replacing “max” by “sum” yields the max-avg dispersion problem [114, Section 4]. It can be shown (by reduction from 0/1-integer program feasibility) that (30) is NP-hard if d is a part of the input, even when m = 1. How accurate approximations are their convex (e.g., SDP, SOCP) relaxations? Another related problem arises in protein structure prediction, whereby the distances between neighboring atoms and their bond angles are known, and we wish to find positions of the atoms that minimize a certain energy function [119]. Although the energy function is complicated and highly nonlinear, one can focus on the most nonlinear terms, such as the Lennard-Jones interactions, in seeking approximate solutions.

3 Gradient Methods for Solving the Convex Relaxations

How to solve (15)? We will assume that ∇f is Lipschitz continuous on a closed convex set X ⊇ domP , i.e.,

k∇f(x) − ∇f(y)k≤ Lkx − yk ∀x, y ∈ X , (31) for some L > 0, where E is the vector space of continuous linear functionals on E, endowed with the dual norm kxk = supkxk≤1hx, xi and hx, xi is the value of x ∈ E at x∈ E. This assumption, which is satisfied by (17), (18), (19), (21), (23), can be relaxed to hold for (20) as well. Owing to its size and structure, (15) is suited for solution by first-order gradient methods, whereby at each iteration f is approximated by a linear function plus a “simple” proximal term.

We describe such methods below. To simplify notation, we denote the linearization of f in F at y∈ X by

F(x; y) := f (y) +h∇f(y), x − yi + τP (x) ∀x. (32)

(9)

3.1 Proximal Gradient Methods

Choose a strictly convex function η : E → (−∞, ∞] that is differentiable on an open set con- tainingX ,5 Then the function

D(x, y) := η(x)− η(y) − h∇η(y), x − yi ∀y ∈ X , ∀x ∈ E,

is nonnegative and zero if and only if x = y, so D acts as a proximity measure. This function was studied by Bregman [23] and many others; see [8, 10, 32, 46, 67, 130] and references therein.

By scaling η if necessary, we assume that D(x, y)≥ 1

2kx − yk2 ∀ x, y ∈ X . (33)

The classical gradient projection method of Goldstein and Levitin, Polyak (see [15, 112]) naturally generalizes to solve (15) using the Bregman function D, with constant stepsize 1/L:

xk+1 = arg min

x



F(x; xk) + L

2D(x, xk)



, k = 0, 1, . . . , (34) where x0 ∈ domP . This method, which we call the proximal gradient (PG) method, is closely related to the mirror descent method of Nemirovski and Yudin [93], as is discussed in [8, 11];

also see [89] for the case of η(·) = 12k · k22. When P is the 1-norm or has the block-separable form (22), the new point xk+1 can be found in closed form, which is a key advantage of this method for large-scale optimization; see [144, 151] and references therein. When P is given by (16) and X is the unit simplex, xk+1can be found in closed form in O(n) floating point operations (flops) by taking η(x) to be the x log x-entropy function [11, Section 5], [98, Lemma 4]. Moreover, the corresponding D satisfies (33) with k · k being the 1-norm [11, Proposition 5.1], [98, Lemma 3].

If η(·) = 12k · k22 is used instead, then xk+1 can still be found in O(n) flops, but this requires using a more complicated algorithm; see [69] and references therein. It can be shown that

F (xk)− inf F ≤ O

L k



∀k,

and hence O(Lǫ) iterations suffice to come within ǫ > 0 of inf F ; see, e.g., [13, Theorem 3.1], [97, Theorem 2.1.14], [112, page 166], [146, Theorem 5.1].

In a series of work [94, 95, 98] (also see [112, page 171]), Nesterov proposed three methods for solving the smooth constrained case (16) that, at each iteration, use either one or two pro- jection steps together with extrapolation to accelerate convergence. These accelerated gradient projection methods generate points{xk} that achieve

F (xk)− inf F ≤ O

L k2



∀k,

so that O(qLǫ) iterations suffice to come within ǫ > 0 of inf F . In [98], it is shown that various large convex-concave optimization problems can be efficiently solved by applying these methods to a smooth approximation with Lipschitz constant L = O(1/ǫ). These methods have inspired

5This assumption can be relaxed to η being differentiable on the interior of X only.

(10)

various extensions and variants [8, Section 5], [13, 61, 71], [97, Section 2.2], [98, 101, 142], as well as applications to compressed sensing, sparse covariance selection, matrix completion, etc.

[14, 5, 64, 74, 75, 82, 96, 132]. In particular, all three methods can be extended to solve (15) in a unified way and achieve O(qLǫ) iteration complexity; see [142] and discussion below. The work per iteration is between O(n) and O(n3) flops for the applications of Section 1. In contrast, the number of iterations for interior point methods is at best O(√

n log1ǫ) and the work per iteration is typically between O(n3) and O(n4) ops. Thus, for moderate ǫ (say, ǫ = .001), moderate L (which may depend on n), and large n (n ≥ 10000), a proximal gradient method can outperform interior point methods.

The first accelerated proximal gradient (APG) method for solving (15) is the simplest, but requires E to be a Hilbert space (i.e., E = E, k · k = ph·, ·i) and η(·) = 12k · k2, so that D(x, y) = 12kx−yk2. For any x0 = x−1 ∈ domP and θ0 = θ−1= 1, it generates (for k = 0, 1, . . .) yk = xk+ θkk−1−1 − 1)(xk− xk−1), (35) xk+1 = arg min

x



F(x; yk) +L

2kx − ykk2



, (36)

θk+1 =

qθk4+ 4θ2k− θk2

2 . (37)

An inductive argument shows that θkk+22 for all k. As k→ ∞, we have θk−1θk =√

1− θk→ 1, so that, by (35), yk is asymptotically an isometric extrapolation from xk−1 to xk. In particular, yk may lie outside of domP . However, since xk, xk−1 ∈ domP , it is readily seen that yk ∈ {2x − w | x, w ∈ domP } (since x + α(x − x−1) = 2x− w with w = (1 − α)x + αx−1). This method was recently proposed by Beck and Teboulle [13] as an extension of Nesterov’s first method [94]; also see [61] for refinements in the unconstrained case of P ≡ 0.

The second APG method imposes no requirement on E or D, and maintains yk ∈ domP , so it is less restrictive than the first method. For any x0, z0 ∈ domP and θ0 = 1, it generates (for k = 0, 1, . . .)

yk = (1− θk)xk+ θkzk, (38)

zk+1 = arg min

x

nF(x; yk) + θkLD(x, zk)o, (39)

xk+1 = (1− θk)xk+ θkzk+1, (40)

with θk+1 given by (37). Since 0 < θk≤ 1, we have from xk, zk ∈ domP that yk∈ domP . In the smooth constrained case (16), this method corresponds to Auslender and Teboulle’s extension [8, Section 5] of Nesterov’s second method [95]; also see [97, page 90]. A variant proposed by Lan, Lu, and Monteiro [71, Section 3] replaces (40) by a PG step from yk.

The third APG method differs from the second method mainly in the computation of zk+1. For any x0∈ domP and z0 = arg min

x∈domP

η(x), θ0 = 1, it generates (for k = 0, 1, . . .) yk by (38),

zk+1 = arg min

x

( k X

i=0

F(x; yi) θi

+ Lη(x) )

, (41)

(11)

and xk+1, θk+1 by (40), (37). Thus, (41) replaces ℓF(x; yk)/θkin (39) by its cumulative sum and replaces D(·, zk) by η(·). In the case of (16), this method is similar to Nesterov’s third method [98] but with only one projection instead of two. In the case of E being a Hilbert space and η(·) = 12k · k2, this method bears some resemblance to an accelerated dual gradient method in [101, Section 4].

The preceding accelerated methods may look unintuitive, but they arise “naturally” from refining the analysis of the PG method, as is discussed in Appendix A. Moreover, these methods are equivalent (i.e., generate the same sequences) when P ≡ 0, E is a Hilbert space, and the Bregman function D is quadratic (i.e., η(·) = 12k · k2). Some extensions of these methods, including cutting planes, estimating L, are discussed in [142]. In particular, L can be estimated using backtracking: increase L and repeat the iteration whenever a suitable sufficient descent condition (e.g., (53) or (57)) is violated; see [13, 94, 142]. Below we summarize the iteration complexity of the PG and APG methods.

Theorem 1 (a) Let {xk} be generated by the PG method (34). For any x ∈ domP , we have F (xk)≤ F (x) +1

kLD(x, x0) ∀k ≥ 1.

(b) AssumeE is a Hilbert space, η(·) = 12k · k2, andX ⊇ {2x − w | x, w ∈ domP }. Let {xk} be generated by the first APG method (35)–(37). For any x∈ domP , we have

F (xk)≤ F (x) + θk−12 LD(x, x0) ∀k ≥ 1.

(c) Let {xk} be generated by the second APG method (37)–(40). For any x ∈ domP , we have F (xk)≤ F (x) + θ2k−1LD(x, z0) ∀k ≥ 1.

(d) Let {xk} be generated by the third APG method (37), (38), (40), (41). For any x ∈ domP , we have

F (xk)≤ F (x) + θ2k−1L(η(x)− η(z0)) ∀k ≥ 1.

A proof of Theorem 1(a)-(c) is given in Appendix A. A proof of part (d) can be found in [142, Corollary 3(a)]. Taking any x satisfying F (x)≤ inf F +ǫ2 in Theorem 1 yields F (xk)≤ inf F + ǫ after k = O(Lǫ) iterations for the PG method and after k = O(qLǫ) iterations for the APG methods.

How can we terminate the PG and APG methods in practice with a guaranteed optimality gap? The bounds in Theorem 1 requires estimating the distance to an 2ǫ-minimizer of F and are rather conservative. In the case where f has the form

f (x) = max

v∈V φ(x, v),

for some saddle function φ and convex set V in a suitable space, duality gap can be used to terminate the methods [92, 98, 99]. The dual problem is maxvQ(v), with dual function

Q(v) := min

x {φ(x, v) + τP (x)}.

(12)

Then we compute (say, every 5 or 10 iterations) a candidate dual solution vk = arg max

v φ(xk, v),

and check that F (xk)− Q(vk)≤ ǫ. In fact, assuming furthermore that domP is bounded, it can be shown using an idea of Nesterov [98, Theorem 3] that

0 ≤ F (xk+1)− Q(¯vk) ≤ θ2kL max

x∈domP(η(x)− η(z0)) ∀k ≥ 0, where xk+1, yk, θk are generated by the third APG method, and we let

vk = arg max

v φ(yk, v), ¯vk = (1− θk)¯vk−1+ θkvk.

with ¯v−1 = 0 [142, Corollary 3(c)]; also see [74, Theorem 2.2] and [99] for related results in the constrained case (16). Analogous bounds hold for the first two APG methods; see [142, Corollaries 1(b) and 2].

When applied to (4), the first APG method yields an accelerated version of the iterative thresholding method of Daubechie et al. [13, 34]. What about the other two methods? How efficiently can these methods be applied to solve (5), (7), (8), (10), (23), and related problems such as those in [48, 123, 124, 125, 153]? When applied to (7) and (8), singular value decom- position is needed at each iteration, and efficiency depends on the cost for this decomposition.

However, only the largest singular values and their associated singular vectors are needed [24].

Can these be efficiently computed or updated? Some progress on this have recently been made [82, 132].

Can the iteration complexity be further improved? The proofs suggest that the convergence rate can be improved to O(k1p) (p > 2) if we can replace k · k2 in the proofs by k · kp; see Appendix A. However, this may require using a higher-order approximation of f in ℓF(·; y), so the improvement would not come “for free”.

3.2 Block-Coordinate Gradient Methods

WhenE is a Hilbert space and P is block-separable (22), we can apply (34) block-coordinatewise, possibly with L and D dynamically adjusted, resulting in a block-coordinate gradient (BCG) method. More precisely, given xk ∈ domP , we choose Jk as the union of some subcollection of indices in J and choose a self-adjoint positive definite linear mapping Hk : E → E (ideally Hk≈ ∇2f (xk)), compute

dk = arg min

d



F(xk+ d; xk) +1

2hHkd, di | dj = 0 ∀j 6∈ Jk



, (42)

and update

xk+1= xk+ αkdk, (43)

with stepsize αk > 0 [144, 146]. This method may be viewed as a coordinate/SOR version of a sequential quadratic programming (SQP) method, and it is related to the variable/gradient

(13)

distribution methods for unconstrained smooth optimization [51, 58, 86, 120] and (block) coor- dinate minimization methods [15, 84, 87, 105, 121, 136, 140]. In the case of Hkd = Ld and Jk comprising all coordinate indices of x, (42)–(43) with αk = 1 reduces to the PG method (34) with η(·) = 12k · k2.

How to choose αkand Jk? Various stepsize rules for smooth optimization [15, 52, 105] can be adapted to this nonsmooth setting. One that works well in theory and practice is an Armijo-type rule adapted from SQP methods:

αk = maxnα∈ {1, β, β2, . . .} | F (xk+ αdk)≤ F (xk) + ασ∆ko, (44) where 0 < β, σ < 1 and ∆k is the difference between the optimal value of (42) and F (xk). This rule requires only function evaluations, and ∆kpredicts the descent from xkalong dk. For global convergence, the index subset Jk is chosen either in a Gauss-Seidel manner, i.e., Jk∪ · · · ∪ Jk+K

covers all subsets of J for some constant K ≥ 0 [31, 88, 107, 140, 144] or Jk is chosen in a Gauss-Southwell manner to satisfy

k≤ ω∆allk ,

where 0 < ω < 1 and ∆allk denotes the analog of ∆k when Jkis replaced by the entire coordinate index set [144, 145, 146]. Moreover, assuming (31) withX = domP , the BCG method using the Gauss-Southwell choice of Jk finds an ǫ-minimizer of F in O(Lǫ) iterations [146, Theorem 5.1].

The above BCG method, which has been successful for compressed sensing and variable selection in regression [88, 126, 159] and can be extended to handle linear constraints as in support vector machine (SVM) training [145], may also be suited for solving (5), (10), (14), (21), and related problems. When applied to (10) with Jk indexing a row/column of x, dk and αk are computable in O(n2) flops using Schur complement and properties of determinant. This method may be similarly applied to Lu’s reformulation (21). This contrasts with the block- coordinate minimization method in [9] which uses O(n3) flops per iteration. This method can also be applied to solve (29) by using a smooth convex penalty 1 max{0, | · | − ρij}2 (θ > 0) for each constraint in (28) of the form | · | ≤ ρij. By choosing each coordinate block to comprise zi, yii, and (yij)j|(i,j)∈A, the computation distributes, with sensor i needing to communicate only with its neighbors when updating its position – an important consideration for practical implementation. The resulting method can accurately position up to 9000 sensors in little over a minute; see [113, Section 7.3]. Can Nesterov’s extrapolation techniques of Section 3.1 be adapted to the BCG method?

3.3 Incremental Gradient Methods

A problem that frequently arises in machine learning and neural network training has the form (15) with

f (x) =

m

X

i=1

fi(x), (45)

where each fi :E → (−∞, ∞] is smooth, convex on an open subset of E containing domP . (The convexity assumption can be relaxed depending on the context.) For example, fi(x) may be the output error for an input-out system (e.g., a classifier), parameterized by x, on the ith training

(14)

example. P given by (16) would confer constraints on x and P (·) = k · k1 would induce a sparse representation. m may be large. A popular approach to minimizing f of the form (45) is by an incremental gradient method (“on-line back-propagation”):

xk+1 = xk− αk∇fik(xk), (46)

with ik chosen cyclically from {1, . . . , m} and stepsize αk > 0 either constant or adjusted dy- namically; see [15, 68, 77, 85, 91, 128, 137] and references therein. When some the ∇fi’s are

“similar”, this incremental method is more efficient than (pure) gradient method since it does not wait for all component gradients∇f1, ...,∇fm to be evaluated before updating x. However, global convergence (i.e.,∇f(xk)→ 0) generally requires αk→ 0, which slows convergence. Re- cently, Blatt, Hero and Gauchman [22] proposed an aggregate version of (46) that approximates

∇f(xk) by

gk=

k

X

ℓ=k−m+1

∇fi(x).

This method achieves global convergence for any sufficiently small constant stepsize αk, but requires O(mn) storage. We can reduce the storage to O(n) by updating a cumulative average of the component gradients:

gksum = gsumk−1+∇fik(xk), gk = m kgksum,

with g−1sum = 0. We then use gk in the PG, APG, or BCG method. The resulting incremental methods share features with recently proposed averaging gradient methods [65, 102]. What are their convergence properties, iteration complexities, and practical performances?

4 Error Bound and Linear Convergence of Gradient Methods

Analogous to superlinear convergence for second-order methods, linear convergence is a good in- dicator of efficiency for first-order methods. Key to a linear convergence analysis is the following Lipschizian error bound on dist(x, ¯X ) := min

s∈ ¯X kx − sk in terms of the norm of the residual R(x) := arg min

d



F(x + d; x) + 1 2kdk2



, (47)

where ¯X denotes the set of minimizers of F , which we assume to be nonempty.

EB condition: For any ζ≥ min F , there exist scalars κ > 0 and ǫ > 0 such that

dist(x, ¯X ) ≤ κkR(x)k whenever F (x)≤ ζ, kR(x)k ≤ ǫ. (48) Under the EB condition, asymptotic linear and even superlinear convergence can be estab- lished for various methods, including interior point, gradient projection, proximal minimiza- tion, coordinate minimization, and coordinate gradient methods – even if ¯X is unbounded; see [12, 49, 80, 81, 139, 144, 145, 146] and references therein. Moreover, the EB condition holds under any of the following conditions; see [144, Theorem 4] as well as [109, Theorem 3.1], [79, Theorem 2.1] for the constrained case (16).

(15)

C1. E = ℜn, f (x) = h(Ax) +hc, xi for all x ∈ ℜn, where A ∈ ℜm×n, c ∈ ℜn, and h is a strongly convex differentiable function on ℜm with ∇h Lipschitz continuous on ℜm. P is polyhedral.

C2. E = ℜn, f (x) = maxy∈Y{hy, Axi − h(y)} + hc, xi for all x ∈ ℜn, where Y is a polyhedral set in ℜm, A ∈ ℜm×n, c ∈ ℜn, and h is a strongly convex differentiable function on ℜm with∇h Lipschitz continuous on ℜm. P is polyhedral.

C3. f is strongly convex and satisfies (31) for some L > 0.

What if f is not strongly convex and P is non-polyhedral? In particular, we are interested in the group lasso for linear and logistic regression (see (17), (23), (24)), for which f is not strongly convex (unless A has full column rank) and P is non-polyhedral (unless J is a singleton for all J ∈ J ). The following new result shows that the error bound (48) holds for the group lasso.

The proof, given in Appendix B, exploits the structure of the weighted sum of 2-norms (24). To our knowledge, this is the first Lipschitzian error bound result for f not strongly convex and P non-polyhedral.

Theorem 2 Suppose that E = ℜn, P has the form (24) with ωJ > 0 for all J ∈ J , and f has the form

f (x) = h(Ax), (49)

where A ∈ ℜm×n, and h : ℜm → (−∞, ∞] is differentiable on domh, which is assumed to be convex and open. Also suppose that (a) h is strongly convex and ∇h is Lipschitz continuous on any compact convex subset of domh, and (b) h(y)→ ∞ whenever y approaches a boundary point of domh. If ¯X 6= ∅, then

{x | F (x) ≤ ζ} is bounded ∀ζ ∈ ℜ, (50)

and the EB condition (48) holds.

The assumptions on h in Theorem 2 are satisfied by h(y) = 1

2ky − bk2, corresponding to linear regression, and

h(y) =

m

X

i=1

log(1 + eyi)− hb, yi with b ∈ {0, 1}m,

corresponding to logistic regression (23). The assumptions are also satisfied by h(y) =−

m

X

i=1

log(yi) +hb, yi with b ≥ 0,

which arises in likelihood estimation under Poisson noise [123]. In the first two examples, h is bounded from below by zero. In the third example, h is unbounded from below but tends to

−∞ sublinearly. Since P given by (24) is homogeneous of degree 1, it is readily seen that ¯X 6= ∅

(16)

for all three examples. (An example of ¯X = ∅ is minxex+ 2x +|x|.) However, ¯X need not be a singleton. An example is

minx

1

2|x1+ x2− 2|2+|x1| + |x2|,

for which ¯X = {(1 − t, t) | t ∈ [0, 1]}. Can Theorem 2 be extended to f satisfying C2 and P given by (24) or to (18) or (19)? Can the constant κ in (48), which determines the convergence ratio, be estimated for compressed sensing (17) in terms of the restricted isometry constant µN? Corollary 1 Under the assumptions of Theorem 2, let {(xk, Hk, Jk, αk)} be generated by the BCG method (42)–(43) with (i) λI  Hk  ¯λI for all k (0 < λ ≤ ¯λ), (ii) {Jk} cycling through J ∈ J , and (iii) {αk} chosen by the Armijo rule (44). If ¯X 6= ∅, then {F (xk)}k∈T converges at least Q-linearly and {xk}k∈T converges at least R-linearly, where T = {0, K, 2K, . . .} and K is the cardinality of J .

Proof. Theorem 2 shows that [144, Assumption 2(a)] is satisfied. By [144, Theorem 1(a)], {F (xk)} ↓, so, by (50), {xk} is bounded. Since F is convex, [144, Assumption 2(b)] is auto- matically satisfied. Also, {xk} is lies in a compact convex subset of domf, over which ∇f is Lipschitz continuous (since ∇h is Lipschitz continuous over the the image of this subset un- der A). Conditions (i)–(iii) imply that the remaining assumptions in [144, Theorem 2(b)] are satisfied. In particular, the restricted Gauss-Seidel rule in [144] holds with the given T . Since {xk} is bounded, [144, Theorem 2(b)] implies that {F (xk)}k∈T converges at least Q-linearly and {xk}k∈T converges at least R-linearly [107].

By Corollary 1, the method in [157], [88, Section 2.2.1] for linear group lasso (corresponding to Hk = ATA) and the method in [88, Section 2.2.2] for logistic group lasso (corresponding to Hk = νkI with νk > 0) attain linear convergence. Linear convergence of the block-coordinate minimization methods used in [147] for TV-regularized image reconstruction, with f (w, u) =

1

2kw−Buk22+γ2kAu−bk22, P (w, u) =PJkwJk2, and in [152] for TV-regularized image deblurring, with f (w, u, z) = 12kw − Buk22 + γ2kAu − b − zk22, P (w, u, z) = PJωJkwJk2 +kzk1, can be analyzed similarly using a generalization of Theorem 2 to allow ωJ = 0 for some J. Can the linear convergence analysis be extended to the APG methods or their variants?

5 Appendix A: Analyzing the PG and APG Methods

Since f is convex and (31) holds withX ⊇ domP , we have from (32) that F (x) ≥ ℓF(x; y) ≥ F (x) − L

2kx − yk2 ∀x ∈ domP, y ∈ X . (51) The following “3-point” property is also key; see [32, Lemma 3.2], [71, Lemma 6], [142, Section 2].

3-Point Property: For any proper lsc convex function ψ :E → (−∞, ∞] and any z ∈ X , if η is differentiable at z+= arg minx{ψ(x) + D(x, z)}, then

ψ(x) + D(x, z)≥ ψ(z+) + D(z+, z) + D(x, z+) ∀x ∈ domP.

(17)

The APG methods can be motivated by analyzing the PG method (34). Let {xk} be gener- ated by the PG method. For any x∈ domP and any k ∈ {0, 1, . . .}, we have

F (xk+1) ≤ ℓF(xk+1; xk) +L

2kxk+1− xkk2

≤ ℓF(xk+1; xk) + LD(xk+1, xk)

≤ ℓF(x; xk) + LD(x, xk)− LD(x, xk+1)

≤ F (x) + LD(x, xk)− LD(x, xk+1) ∀x ∈ domP, (52) where the first and fourth inequalities use (51), the second inequality uses (33), and the third inequality uses (34) and the 3-Point Property with ψ(x) = ℓF(x; xk)/L. Letting ek = F (xk)− F (x) and ∆k= LD(x, xk), this simplifies to the recursion

ek+1 ≤ ∆k− ∆k+1 ∀k ≥ 0.

It follows from ek+1≤ ek that (k + 1)ek+1 ≤ ∆0− ∆k+1 ≤ ∆0, which proves Theorem 1(a).

The above proof suggests that, for faster convergence, we should find a similar recursion as (52) but with L scaled by something tending to zero with k. To do this, we need to modify (34).

Suppose for simplicity E is a Hilbert space and η(·) = 12k · k2 (so that D(x, y) = 12kx − yk2).

One such modification is to replace xk in (34) by some yk∈ X to be determined, yielding (36).

Then, as in the above proof for the PG method, we have F (xk+1) ≤ ℓF(xk+1; yk) +L

2kxk+1− ykk2 (53)

≤ ℓF(y; xk) + L

2ky − ykk2− L

2ky − xk+1k2

≤ F (y) + L

2ky − ykk2−L

2ky − xk+1k2 ∀y ∈ domP, (54) where the first and third inequalities use (51), and the second inequality uses (36) and the 3- Point Property. To get L to be scaled by something tending to zero, set y = (1− θk)xk+ θkx in the above inequality, with x∈ domP arbitrary and 0 < θk ≤ 1 to be determined. We can then factor θk out of x to scale L, yielding

F (xk+1)

≤ F ((1 − θk)xk+ θkx) + L

2k(1 − θk)xk+ θkx− ykk2−L

2k(1 − θk)xk+ θkx− xk+1k2

= F ((1− θk)xk+ θkx) + L

2kkx + (θ−1k − 1)xk− θk−1ykk2−L

2kkx + (θk−1− 1)xk− θk−1xk+1k2, where we have rearranged the terms to look like the recursion (52). We want the two terms insidek k2 to have the form “x− zk” and “x− zk+1”, which we get by setting

zk=−(θk−1− 1)xk+ θ−1k yk and yk by (35). Using also the convexity of F , we then obtain that

F (xk+1) ≤ (1 − θk)F (xk) + θkF (x) + θ2kL

2kx − zkk2− θ2k

L

2kx − zk+1k2 ∀k. (55)

(18)

Letting ek= F (xk)− F (x) and ∆k= L2kx − zkk2, this simplifies to ek+1 ≤ (1 − θk)ek+ θk2k− θ2kk+1. Upon dividing both sides by θ2k, we see that, by choosing θk+1so that θ12

k

= 1−θθ2k+1 k+1

(which, upon solving for θk+1, yields (37)), this rewrites as the recursion

1− θk+1

θk+12 ek+1+ ∆k+1 ≤ 1− θk

θ2k ek+ ∆k, which propagates backwards to yield

1− θk+1

θ2k+1 ek+1+ ∆k+1 ≤ 1− θ0

θ20 e0+ ∆0 ∀k.

Since 1−θθ2k+1 k+1

= θ12 k

, setting θ0 = 1 simplifies this to 1

θk2ek+1 ≤ ∆0− ∆k+1 ≤ ∆0. Also, we have from (35) and taking θ−1 = 1 that z0 = y0 = x0. This proves Theorem 1(b).

The preceding proof depends crucially on rearranging terms inside k k2, so it cannot be directly extended to other Bregman functions D. However, (55) suggests we seek a recursion of the form

F (xk+1) ≤ (1 − θk)F (xk) + θkF (x) + θ2kLD(x, zk)− θ2kLD(x, zk+1) ∀k, (56) which, as our derivation of (52) and (55) suggests, can be achieved by setting zk+1 by (39) and using the 3-Point Property, setting xk+1 by (40) (analogous to our setting of y in (54)), and then choosing yk so that xk+1− yk = θk(zk+1− zk) (which works out to be (38)). Specifically, for any k∈ {0, 1, . . .}, we have

F (xk+1) ≤ ℓF(xk+1; yk) + L

2kxk+1− ykk2

= ℓF((1− θk)xk+ θkzk+1; yk) +Lθk2

2 kzk+1− zkk2

≤ (1 − θk)ℓF(xk; yk) + θkF(zk+1; yk) + θ2kLD(zk+1, zk) (57)

≤ (1 − θk)ℓF(xk; yk) + θkF(x; yk) + θkLD(x, zk)− θkLD(x, zk+1) ∀x ∈ domh, where the first inequality uses (51), the second inequality uses the convexity of ℓF(·; yk) and (33), the last inequality uses the 3-Point Property with ψ(x) = ℓF(x; yk)/(θkL). Using (51) yields (56), from which Theorem 1(c) follows.

The third APG method (see (41)) can be derived and analyzed similarly using an analog of the 3-Point Property for η; see [142, Property 2]. For brevity we omit the details.

6 Appendix B: A Proof of Theorem 2

Througout we assume that the assumptions of Theorem 2 are satisfied. By scaling f by 1/τ , we will assume without loss of generality that τ = 1.

參考文獻

相關文件

When the Hessian of each constraint function is of rank 1 (namely, outer products of some given so-called steer- ing vectors) and the phase spread of the entries of these

Accelerated prox gradient method is promising in theory and practice.. Applicable to convex-concave optimization by using

In this paper, we evaluate whether adaptive penalty selection procedure proposed in Shen and Ye (2002) leads to a consistent model selector or just reduce the overfitting of

Accordingly, we reformulate the image deblur- ring problem as a smoothing convex optimization problem, and then apply semi-proximal alternating direction method of multipliers

Wang, Solving pseudomonotone variational inequalities and pseudo- convex optimization problems using the projection neural network, IEEE Transactions on Neural Network,

Numerical experiments are done for a class of quasi-convex optimization problems where the function f (x) is a composition of a quadratic convex function from IR n to IR and

The purpose of this talk is to analyze new hybrid proximal point algorithms and solve the constrained minimization problem involving a convex functional in a uni- formly convex

In this paper, motivated by Chares’s thesis (Cones and interior-point algorithms for structured convex optimization involving powers and exponentials, 2009), we consider