Approximation Accuracy, Gradient Methods, and Error Bound for Structured Convex Optimization1

(1)

Approximation Accuracy, Gradient Methods, and Error Bound for Structured Convex Optimization

¹

May 14, 2009 (revised July 7, 2009) Paul Tseng²

Abstract

Convex optimization problems arising in applications, possibly as approximations of in- tractable problems, are often structured and large scale. When the data are noisy, it is of interest to bound the solution error relative to the (unknown) solution of the original noiseless problem. Related to this is an error bound for the linear convergence analysis of first-order gradient methods for solving these problems. Example applications include compressed sensing, variable selection in regression, TV-regularized image denoising, and sensor network localization.

Key words. Convex optimization, compressed sensing, ℓ₁-regularization, nuclear/trace norm, regression, variable selection, sensor network localization, approximation accuracy, proximal gradient method, error bound, linear convergence.

1 Introduction

Optimization problems arising in application areas such as signal/image denoising, compressed sensing, regression, multi-task learning, classification, sensor network localization, are often large scale, possibly nonconvex or NP-hard, and the data may be noisy. Such problems may be approximated by convex relaxations that are highly structured.

Question 1: How accurate approximations are the convex relaxations?

Specifically, can we bound the error between a solution of the convex relaxation and an (unknown) solution of the original noiseless problem in terms of knowable quantities such as the noise level? This question is meaningful when a solution of the original problem changes (semi)continuously with small perturbations in the data, so the problem has both discrete and continuous nature. Examples include compressed sensing and sensor network localization; see Section 2. A certain noise-aware property of the convex relaxation appears key.

Question 2: How fast can the convex relaxations be solved?

Due to the large problem size, first-order gradient methods seem better suited to exploit struc- tures such as sparsity, (partial) separability, and simple nonsmoothness in the convex relaxations;

1This work is supported by National Science Foundation grant DMS-0511283.

2Department of Mathematics, University of Washington, Seattle, WA 98195, U.S.A.

([email protected])

(2)

see Section 3. The asymptotic convergence rate of these methods depend on an error bound on the distance to the solution set of the convex relaxation in terms of a certain residual function.

We prove such an error bound for a class of 2-norm-regularized problems that includes the group lasso for linear and logistic regression; see Section 4. Thus our aims are threefold: exposit on existing results, present a new result, and suggest future research.

We begin with a problem that has received much attention recently: compressed sensing. In the basic version of this problem, we wish to find a sparse representation of a given noiseless signal b⁰ ∈ ℜ^m from a dictionary of n elementary signals. This may be formulated as

x|Ax=bmin⁰♯(x), (1)

where A∈ ℜ^m×ncomprises the elementary signals for its columns and ♯(x) counts the number of nonzero components in x∈ ℜⁿ. In typical applications, m and n are large (m, n≥ 2000). This problem is known to be difficult (NP-hard) and a popular solution approach is to approximate it by a convex relaxation, with ♯(·) replaced by the 1-norm k· k1.³ This results in a linear program:

x|Ax=bmin⁰kxk1 (2)

that can be efficiently solved by simplex or interior point methods [28, 103, 105, 150, 154]. More- over, when the optimal value of (1) is sufficiently small and the columns of A are “approximately orthogonal,” which occur with high probability when A is, say, a Gaussian random matrix, the solution of (2) also solves (1), i.e., the relaxation is exact [26, 27, 28, 35, 40, 42, 56, 57, 59, 62, 83, 111, 133, 134]. For a noisy signal b, Ax = b may be inconsistent and we seek a sparse solution x whose residual Ax− b is small in the least square sense, either in the primal form of “lasso”

[131]

x| kAx−bkmin2≤ρkxk1 (3)

(ρ > 0) or in the dual form of “basis pursuit” [30, 43, 44, 121]

minx kAx − bk²2+ τkxk1 (4)

(τ > 0). Partial results on the accuracy of this relaxation are known [41, 133, 143]. Various methods have been proposed to solve (3) and (4) for fixed ρ and τ , including interior point method [30], coordinate minimization [121, 122], proximal gradient (with or without smoothing) [13, 14, 34], proximal minimization [155]. Homotopy approaches have been proposed to follow the solution as τ varies between 0 and∞ [54, 63, 110]. The efficiency of these methods depend on the structure of A. The polyhedral and separable structure of k · k1 is key.

A problem related to (4) is image denoising using total variation (TV) regularization:

min

x∈B

n 2 2

kAx − bk²2, (5)

where n is even, B2 denotes the unit 2-norm (Euclidea) ball in ℜ², A ∈ ℜ^m×n is the adjoint of the discrete (via finite difference) gradient mapping, and b ∈ ℜ^m. Interior-point, proximal

3We are using the term “relaxation” loosely since ♯(·) majorizes k · k1 only on the unit ∞-norm ball.

(3)

minimization, coordinate minimization, and gradient projection methods have been proposed for its solution [60, 108, 147, 152, 160, 161].

A second problem related to (1) and of growing interest is matrix rank minimization. The basic problem is

x|Ax=bmin⁰rank(x), (6)

where rank(x) is the rank of a matrix x∈ ℜ^p×n, and A is a linear mapping from ℜ^p×n to ℜ^m, and b⁰∈ ℜ^m. In the case of matrix completion, we have Ax = (x_ij)_(i,j)∈A, whereA indexes the known entries of x. This problem is also NP-hard, and a convex relaxation has been proposed whereby rank(x) is replaced by the nuclear/trace normkxknuc (the 1-norm of the singular values of x) [24, 25, 50, 73, 82, 115, 156]. The resulting problem

x|Ax=bmin⁰kxknuc (7)

has a more complex structure than (2), and only recently have solution methods, including interior point method and dual gradient method, been developed [24, 73, 82] and exactness results been obtained [25, 115]. For noisy b, we may consider, analogous to (4), the relaxation

minx kAx − bk²F + τkxknuc, (8)

where τ > 0 andk·kF denotes the Frobenious-norm. This problem has applications to dimension reduction in multivariate linear regression [75] and multi-task learning [1, 3, 106]. Recently, proximal/gradient methods have been applied to its solution [75, 82, 132]. How well does (8) approximate (6)?

A third problem related to (1) and of recent interest is that of sparse inverse covariance estimation, where we seek a positive definite x∈ Sⁿ that is sparse and whose inverse approximates a given sample covariance matrix s∈ Sⁿ. This may be formulated as the nonconvex problem

min

λIx¯λI− log det(x) + hs, xi + τ♯(x) (9)

with 0≤ λ < ¯λ ≤ ∞ and τ > 0, where hs, xi = trace[sx], ♯(x) counts the number of nonzeros in x, I denotes the n×n identity matrix, and denotes the partial ordering with respect to the cone S+ⁿ of positive semidefinite matrices [5, 9, 53, 74, 158]. Replacing ♯(x) by kxk1 := ^Pⁿ_i,j=1|xij| yields the convex relaxation

min

λIx¯λI− log det(x) + hs, xi + τkxk1. (10) Interior-point method appears unsuited for solving (10) owing to the large size of the Newton equation to be solved. Block-coordinate minimization methods (with each block corresponding to a row/column of x) and Nesterov’s accelerated gradient methods have been applied to its solution [5, 9, 53, 74]. How well does (10) approximate (9)?

A fourth problem of much recent interest is that of ad hoc wireless sensor network localization [4, 17, 18, 20, 21, 29, 36, 37, 47, 70, 72, 127]. In the basic version of this problem, we have n points z⁰₁, . . . , z_n⁰ in ℜ^d (d ≥ 1). We know the last n − m points (“anchors”) and an estimate d_ij ≥ 0 of the Euclidean distance d⁰ij =kzi⁰− zj⁰k2 between “neighboring” points i and j for all

(4)

(i, j)∈ A, where A ⊆ ({1, . . . , m}×{1, . . . , n}) ∪ ({1, . . . , n}×{1, . . . , m}). We wish to estimate the first m points (“sensors”). This problem may be formulated as

z1min,...,zm

X

(i,j)∈A^s

|kzi− zjk²2− d²ij| + ^X

(i,j)∈A^a

|kzi− zj⁰k²2− d²ij|, (11)

where A^s := {(i, j) ∈ A | i < j ≤ m} and A^a := {(i, j) ∈ A | i ≤ m < j} are the sets of, respectively, sensor-to-sensor and sensor-to-anchor neighboring pairs. Typically, d = 2 and two points are neighbors if the distance between them is below some threshold, e.g., the radio range.

In variants of this problem, constraints such as bounds on distances and angles-of-arrival are also present [16, 29, 37, 70]. This problem is NP-hard for any d ≥ 1. It is closely related to distance geometry problems arising in molecular conformation [19, 90], graph rigidity/realization [2, 36, 47, 127], and max-min/avg dispersion [33, 114, 149]. Letting z := ( z₁ · · · zm) and I denote the d×d identity matrix, Biswas and Ye [20, 21] proposed the following convex relaxation of (11):

minx

X

(i,j)∈A

ℓ_ij(x)− d²ij

subject to x =

y z^T

z I

0,

(12)

where y = ( y_ij)_1≤i,j≤m, and ℓ_ij(x) :=

y_ii− 2yij + y_jj if (i, j)∈ A^s,

yii− 2z^Ti z_j⁰+kz⁰jk² if (i, j)∈ A^a. (13) The relaxation (12) is a semidefinite program (SDP) and is exact if it has a solution of rank d.

On the other hand, (12) is still difficult to solve by existing methods for SDP when m > 500, and decomposition methods have been proposed [21, 29]. Recently, Wang, Zheng, Boyd, and Ye [148] proposed a further relaxation of (12), called edge-based SDP (ESDP) relaxation, which is solved much faster by an interior point method than (12), and yields solution comparable in approximation accuracy as (12). The ESDP relaxation is obtained by relaxing the constraint x 0 in (12) to require only those principal submatrices of x associated with A to be positive semidefinite. Specifically, the ESDP relaxation is

minx

X

(i,j)∈A

ℓ_ij(x)− d²ij

subject to x =

y z^T z I

,





y_ii y_ij z_i^T y_ij y_jj z_j^T z_i z_j I



 0 ∀(i, j) ∈ A^s,

y_ii z^T_i z_i I_d

0 ∀i ≤ m.

(14)

Notice that the objective function and the positive semidefinite constraints in (14) do not depend on y_ij, (i, j)6∈ A. How well does (12) approximate (11)? Only partial results are known in the noiseless case, i.e., (11) has zero optimal value [127, 141]. How well does (14) approximate (11)?

The above convex problems share a common structure, namely, they entail minimizing the sum of a smooth (i.e., continuously differentiable) convex function and a “simple” nonsmooth

(5)

convex function. More specifically, they have the form

minx∈E F (x) := f (x) + τ P (x), (15)

where E is a finite-dimensional real linear space endowed with a norm k · k, τ > 0, P : E → (−∞, ∞] is lower semicontinuous (lsc), convex, with domP = {x | P (x) < ∞} closed, and f : E → (−∞, ∞] is convex and smooth on domf, assumed open, and f(x) → ∞ whenever x approaches a boundary point of domf [116]. A well-known special case of (15) is smooth constrained convex optimization, for which P is the indicator function for a nonempty closed convex setX ⊆ E, i.e.,

P (x) =

0 if x∈ X ,

∞ else. (16)

The class of problems (15) was studied in [6, 89] and by others; see [144] and references therein.

For example, (4) corresponds to

E = ℜⁿ, k · k = k · k2, f (x) =kAx − bk²2, P (x) =kxk1, (17) (5) corresponds to

E = ℜⁿ, k · k = k · k2, f (x) =kAx − bk²2, P (x) = (

0 if x∈ B

n

22,

∞ else, (18)

(8) corresponds to

E = ℜ^p×n, k · k = k · kF, f (x) =kAx − bk²F, P (x) =kxknuc, (19) and (10) corresponds to

E = Sⁿ, k · k = k · kF, f (x) =− log det(x) + hs, xi, P (x) =kxk1. (20) In the case of 0 < λ < ¯λ < ∞, Lu [74] proposed a reformulation of (10) via Fenchel duality [116, 117], corresponding to

f (x) = sup

λIy¯λI

hx, yi − log det(y), P (x) =

0 ifkx − sk∞≤ τ,

∞ else, (21)

where kxk∞ = max_i,j|xij|. (Note that P in (21) depends on τ.) Importantly, for (17), (18), (20), (21), P is block-separable, i.e.,

P (x) = ^X

J∈J

P_J(x_J), (22)

whereJ is some partition of the coordinate indices of x. Then (15) is amenable to solution by (block) coordinatewise methods. Another special case of interest is variable selection in logistic regression [88, 126, 157, 159], corresponding to

E = ℜⁿ, k · k = k · k2, f (x) =

m

X

i=1

log1 + e^Aⁱ^x− biA_ix, P (x) =kxk1, (23)

(6)

with A_i the ith row of A∈ ℜ^m×n and b_i ∈ {0, 1}; also see [55, 123, 124, 125] for variants. A closely related problem is group variable selection (“group lasso”), which uses instead

P (x) = ^X

J∈J

ω_JkxJk2, (24)

with ω_J ≥ 0, in (17) and (23) [88, 157]. The TV image reconstruction model in [147, Eq.

(1.3)] and the TV-L₁ image deblurring model in [152, Eq. (1.5)] are special cases of (15) with f quadratic and P of the form (24).

How accurate approximations are the convex relaxations (7), (8), (10), (12), (14), and other related convex relaxations such as that for max-min dispersion, allowing for noisy data? What are the iteration complexities and asymptotic convergence rates of first-order gradient methods for solving these and related problems? First-order methods are attractive since they can exploit sparse or partial separable structure of f and block-separable structure of P .

Throughout, ℜⁿ denotes the space of n-dimensional real column vectors, Sⁿ={x ∈ ℜ^n×n | x = x^T}, and^T denotes transpose. For any x∈ ℜⁿand nonempty J ⊆ {1, ..., n}, xj denotes jth coordinate of x, xJ denotes subvector of x comprising xj, j ∈ J, and kxkρ = ^Pⁿ_j=1|xj|^ρ^1/ρ for 0 < ρ <∞, and kxk∞= maxj|xj|. For any x ∈ ℜ^p×n, xij denotes the (i, j)th entry of x.

2 Approximation Accuracy of the Convex Relaxations

Let a_j denote column j of A in (1), normalized so thatkajk2 = 1, for all j. It is easily seen that the relaxation (2) is exact (i.e., its solution also solves (1)) if a1, . . . , anare pairwise orthogonal.

This hints that (2) may remain exact if these columns are approximately orthogonal. Mallat and Zhang [83] introduced the following measure of approximate orthogonality, called “mutual coherence”, in their study of matching pursuit:

µ := max

1≤i6=j≤n

a^T_i a_j. (25)

There exist overcomplete sets with n ≈ m² and µ ≈ 1/√

m; see [129, pages 265-266] and references therein. This µ is central to the analysis in [40, 41, 42, 56, 57, 59, 62, 83, 133, 134, 143].

In particular, (2) is exact whenever N⁰ < ¹₂(µ⁻¹+ 1) = O(n¹⁴), where N⁰ denotes the optimal value of (1) [40, Theorem 7], [62, Theorem 1], [56], [133, Theorems A and B]. When b is noisy, it can be shown that the solution x^ρ of the noise-aware model (3) is close to the solution x⁰ of the original noiseless problem (1):

kx^ρ− x⁰k²2 ≤ (δ + ρ)²

1− µ(4N⁰− 1) (26)

whenever ρ≥ δ := kb − b⁰k2 and N⁰ < (¹₂− O(µ))µ⁻¹+ 1; see [41, Theorem 3.1], [143, Theorem 1]. The bound (26) also extends to ρ < δ with some limitations [143, Theorem 1]. In addition to convex relaxation, greedy methods can also be shown to recover or approximate x⁰ and identify the support of x⁰ under similar conditions on N⁰, δ, and ρ; see [41, Theorem 5.1(a)], [133, Theorems A and B], [143, Theorems 3 and 4].

(7)

A different measure of approximate orthogonality, introduced by Cand`es and Tao [28], is the

“restricted isometry” constant µ_N, defined as the smallest scalar satisfying

(1− µN)kxk²2 ≤ kAxk²2 ≤ (1 + µN)kxk²2 ∀x with ♯(x) ≤ N. (27) In fact, µ₂ = µ.⁴ It is known thatkx^ρ− x⁰k2= O(ρ) whenever ρ≥ δ and µ3N⁰+ 3µ_4N⁰ < 2 [27, Theorem 1]; also see [27, Theorem 2] for a relaxation of the latter condition. For A randomly generated from certain classes of distributions (e.g., Gaussian), this condition holds with high probability (for N⁰ in the order of n to within log factors); see [26, 28, 39, 45, 111, 135] and references therein for the noiseless case and [27, 38] for the noisy case. The approximation bounds in [27, 38] for (3) require ρ ≥ δ, as well as n = O(m) in [38]. Is extension to ρ < δ possible, as in [143, Theorem 1]?

Can the aforementioned exactness results and error bounds be extended to matrix rank minimization (6) and its convex relaxations (7) and (8)? Recent progress has been made in the noiseless case [25, 115]. The nuclear norm has a more complex structure than the 1-norm.

For the sensor network localization problem (11), its SDP relaxation (12) is exact if the distances d_ij, (i, j) ∈ A, are exact (i.e., dij = d⁰_ij for all (i, j) ∈ A) and any relative-interior solution (i.e., a point in the relative interior of the solution set) of (12) has rank d [127, Theorem 2]. However, this assumption is quite strong. What can we say in general? Remarkably, a kind of partial exactness still holds. Biswas and Ye [20, Section 4] introduced the notion of individual traces for a feasible solution x of (12), defined as

tr_i(x) := y_ii− kzik², i = 1, . . . , m,

or, equivalently, the diagonals of the Schur complement y− z^Tz. It can be shown that, for any relative-interior solution x of (12), tr_i(x) = 0 implies z_i is invariant over the solution set of (12) and hence equals z_i⁰, the true position of sensor i, when distances are exact [141, Proposition 4.1]. An analogous result holds for the ESDP relaxation (14) [148, Theorem 2]. Thus, upon finding a relative-interior solution x of (12) or (14), we know that every sensor i with tr_i(x) = 0 (within numerical accuracy) has z_i as its true position. Is this result sharp? Yes, at least when the distances are exact. In this case, the converse holds for the ESDP relaxation; see [113, Theorem 1]. It is not known if the converse holds for the SDP relaxation. On the other hand, an example in [113, Example 2] shows that, in the noisy case where the distances are inexact, tr_i(x) = 0 is not a reliable indicator of sensor i position accuracy even when x is the unique solution of the SDP/ESDP relaxation. The reason is that the solution set of the SDP/ESDP relaxation can change abruptly under arbitrarily small data perturbation. This contrasts with a second-order cone (SOCP) relaxation of the same problem, whose solution set changes gradually with data perturbation [141, Section 7]. To overcome this difficulty, a noise-aware version of the ESDP relaxation, analogous to (3) for compressed sensing, was proposed in [113, Section 5].

Specifically, for any ρ = (ρij)_(i,j)∈A ≥ 0, let

S_resdp^ρ :=ⁿx| x is feasible for (14) and |ℓij(x)− d²ij| ≤ ρij ∀(i, j) ∈ A^o. (28)

4Why? Since kajk2 = 1 for all j, (27) with N = 2 reduces to 1 − µ2 ≤ 1 + 2a^Tiajxixj≤ 1 + µ2 for all i 6= j and all xi, xj ∈ ℜ with x²_i+ x²_j = 1. Since (xi, xj) 7→ 2xixjhas minimum value of −1 and maximum value of 1 on the unit sphere, the smallest µ2 for which this holds is precisely µ given by (25).

(8)

Then Sresdp^ρ contains the noiseless ESDP solutions whenever ρ ≥ δ := (|d²ij − (d⁰ij)²|)(i,j)∈A. Moreover, defining its “analytic center” as

x^ρ := arg min

x∈S^ρ

resdp

− ^X

(i,j)∈A^s

log det









y_ii y_ij z_i^T y_ij y_jj z_j^T z_i z_j I







−

m

X

i=1

log tr_i(x), (29)

it can be shown that, for ρ > δ sufficiently small, tr_i(x^ρ) = 0 is a reliable indicator of sensor i position accuracy; see [113, Theorem 4]. Moreover, for any ρ > δ, we have the following computable bound on the individual sensor position accuracy:

kzi^ρ− zi⁰k ≤^q2|A^s| + m · tri(x^ρ)¹² ∀i.

Can these results be extended to the SDP relaxation (12) or SOS relaxations [66, 104] or to handle additional constraints such as bounds on distances and angles-of-arrival?

Closely related to sensor network localization is the continuous max-min dispersion problem, whereby, given existing points z_m+1⁰ , . . . , z_n⁰ inℜ^d(d≥ 1), we wish to locate new points z1, . . . , z_m inside, say, a box [0, 1]^d that are furthest from each other and existing points [33, 149]:

max

z1,...,zm∈[0,1]^dmin

i<j≤mmin ω_ijkzi− zjk2, min

i≤m<jω_ijkzi− z⁰jk2

, (30)

with ω_ij > 0. Replacing “max” by “sum” yields the max-avg dispersion problem [114, Section 4]. It can be shown (by reduction from 0/1-integer program feasibility) that (30) is NP-hard if d is a part of the input, even when m = 1. How accurate approximations are their convex (e.g., SDP, SOCP) relaxations? Another related problem arises in protein structure prediction, whereby the distances between neighboring atoms and their bond angles are known, and we wish to find positions of the atoms that minimize a certain energy function [119]. Although the energy function is complicated and highly nonlinear, one can focus on the most nonlinear terms, such as the Lennard-Jones interactions, in seeking approximate solutions.

3 Gradient Methods for Solving the Convex Relaxations

How to solve (15)? We will assume that ∇f is Lipschitz continuous on a closed convex set X ⊇ domP , i.e.,

k∇f(x) − ∇f(y)k∗≤ Lkx − yk ∀x, y ∈ X , (31) for some L > 0, where E^∗ is the vector space of continuous linear functionals on E, endowed with the dual norm kx^∗k∗ = sup_kxk≤1hx^∗, xi and hx^∗, xi is the value of x^∗ ∈ E^∗ at x∈ E. This assumption, which is satisfied by (17), (18), (19), (21), (23), can be relaxed to hold for (20) as well. Owing to its size and structure, (15) is suited for solution by first-order gradient methods, whereby at each iteration f is approximated by a linear function plus a “simple” proximal term.

We describe such methods below. To simplify notation, we denote the linearization of f in F at y∈ X by

ℓ_F(x; y) := f (y) +h∇f(y), x − yi + τP (x) ∀x. (32)

(9)

3.1 Proximal Gradient Methods

Choose a strictly convex function η : E → (−∞, ∞] that is differentiable on an open set con- tainingX ,⁵ Then the function

D(x, y) := η(x)− η(y) − h∇η(y), x − yi ∀y ∈ X , ∀x ∈ E,

is nonnegative and zero if and only if x = y, so D acts as a proximity measure. This function was studied by Bregman [23] and many others; see [8, 10, 32, 46, 67, 130] and references therein.

By scaling η if necessary, we assume that D(x, y)≥ 1

2kx − yk² ∀ x, y ∈ X . (33)

The classical gradient projection method of Goldstein and Levitin, Polyak (see [15, 112]) naturally generalizes to solve (15) using the Bregman function D, with constant stepsize 1/L:

x^k+1 = arg min

x

ℓF(x; x^k) + L

2D(x, x^k)

, k = 0, 1, . . . , (34) where x⁰ ∈ domP . This method, which we call the proximal gradient (PG) method, is closely related to the mirror descent method of Nemirovski and Yudin [93], as is discussed in [8, 11];

also see [89] for the case of η(·) = ¹₂k · k²2. When P is the 1-norm or has the block-separable form (22), the new point x^k+1 can be found in closed form, which is a key advantage of this method for large-scale optimization; see [144, 151] and references therein. When P is given by (16) and X is the unit simplex, x^k+1can be found in closed form in O(n) floating point operations (flops) by taking η(x) to be the x log x-entropy function [11, Section 5], [98, Lemma 4]. Moreover, the corresponding D satisfies (33) with k · k being the 1-norm [11, Proposition 5.1], [98, Lemma 3].

If η(·) = ¹₂k · k²2 is used instead, then x^k+1 can still be found in O(n) flops, but this requires using a more complicated algorithm; see [69] and references therein. It can be shown that

F (x^k)− inf F ≤ O

L k

∀k,

and hence O(^L_ǫ) iterations suffice to come within ǫ > 0 of inf F ; see, e.g., [13, Theorem 3.1], [97, Theorem 2.1.14], [112, page 166], [146, Theorem 5.1].

In a series of work [94, 95, 98] (also see [112, page 171]), Nesterov proposed three methods for solving the smooth constrained case (16) that, at each iteration, use either one or two projection steps together with extrapolation to accelerate convergence. These accelerated gradient projection methods generate points{x^k} that achieve

F (x^k)− inf F ≤ O

L k²

∀k,

so that O(^q^L_ǫ) iterations suffice to come within ǫ > 0 of inf F . In [98], it is shown that various large convex-concave optimization problems can be efficiently solved by applying these methods to a smooth approximation with Lipschitz constant L = O(1/ǫ). These methods have inspired

5This assumption can be relaxed to η being differentiable on the interior of X only.

(10)

various extensions and variants [8, Section 5], [13, 61, 71], [97, Section 2.2], [98, 101, 142], as well as applications to compressed sensing, sparse covariance selection, matrix completion, etc.

[14, 5, 64, 74, 75, 82, 96, 132]. In particular, all three methods can be extended to solve (15) in a unified way and achieve O(^q^L_ǫ) iteration complexity; see [142] and discussion below. The work per iteration is between O(n) and O(n³) flops for the applications of Section 1. In contrast, the number of iterations for interior point methods is at best O(√

n log¹_ǫ) and the work per iteration is typically between O(n³) and O(n⁴) ops. Thus, for moderate ǫ (say, ǫ = .001), moderate L (which may depend on n), and large n (n ≥ 10000), a proximal gradient method can outperform interior point methods.

The first accelerated proximal gradient (APG) method for solving (15) is the simplest, but requires E to be a Hilbert space (i.e., E^∗ = E, k · k = ^ph·, ·i) and η(·) = ¹₂k · k², so that D(x, y) = ¹₂kx−yk². For any x⁰ = x⁻¹ ∈ domP and θ0 = θ₋₁= 1, it generates (for k = 0, 1, . . .) y^k = x^k+ θ_k(θ_k−1⁻¹ − 1)(x^k− x^k−1), (35) x^k+1 = arg min

x

ℓ_F(x; y^k) +L

2kx − y^kk²

, (36)

θ_k+1 =

qθ_k⁴+ 4θ²_k− θk²

2 . (37)

An inductive argument shows that θ_k≤ _k+2² for all k. As k→ ∞, we have _θ_k−1^θ^k =√

1− θk→ 1, so that, by (35), y^k is asymptotically an isometric extrapolation from x^k−1 to x^k. In particular, y^k may lie outside of domP . However, since x^k, x^k−1 ∈ domP , it is readily seen that y^k ∈ {2x − w | x, w ∈ domP } (since x + α(x − x⁻¹) = 2x− w with w = (1 − α)x + αx⁻¹). This method was recently proposed by Beck and Teboulle [13] as an extension of Nesterov’s first method [94]; also see [61] for refinements in the unconstrained case of P ≡ 0.

The second APG method imposes no requirement on E or D, and maintains y^k ∈ domP , so it is less restrictive than the first method. For any x⁰, z⁰ ∈ domP and θ0 = 1, it generates (for k = 0, 1, . . .)

y^k = (1− θk)x^k+ θ_kz^k, (38)

z^k+1 = arg min

x

nℓ_F(x; y^k) + θ_kLD(x, z^k)^o, (39)

x^k+1 = (1− θk)x^k+ θ_kz^k+1, (40)

with θ_k+1 given by (37). Since 0 < θ_k≤ 1, we have from x^k, z^k ∈ domP that y^k∈ domP . In the smooth constrained case (16), this method corresponds to Auslender and Teboulle’s extension [8, Section 5] of Nesterov’s second method [95]; also see [97, page 90]. A variant proposed by Lan, Lu, and Monteiro [71, Section 3] replaces (40) by a PG step from y^k.

The third APG method differs from the second method mainly in the computation of z^k+1. For any x⁰∈ domP and z⁰ = arg min

x∈domP

η(x), θ₀ = 1, it generates (for k = 0, 1, . . .) y^k by (38),

z^k+1 = arg min

x

( _k X

i=0

ℓ_F(x; yⁱ) θi

+ Lη(x) )

, (41)

(11)

and x^k+1, θ_k+1 by (40), (37). Thus, (41) replaces ℓ_F(x; y^k)/θ_kin (39) by its cumulative sum and replaces D(·, z^k) by η(·). In the case of (16), this method is similar to Nesterov’s third method [98] but with only one projection instead of two. In the case of E being a Hilbert space and η(·) = ¹₂k · k², this method bears some resemblance to an accelerated dual gradient method in [101, Section 4].

The preceding accelerated methods may look unintuitive, but they arise “naturally” from refining the analysis of the PG method, as is discussed in Appendix A. Moreover, these methods are equivalent (i.e., generate the same sequences) when P ≡ 0, E is a Hilbert space, and the Bregman function D is quadratic (i.e., η(·) = ¹₂k · k²). Some extensions of these methods, including cutting planes, estimating L, are discussed in [142]. In particular, L can be estimated using backtracking: increase L and repeat the iteration whenever a suitable sufficient descent condition (e.g., (53) or (57)) is violated; see [13, 94, 142]. Below we summarize the iteration complexity of the PG and APG methods.

Theorem 1 (a) Let {x^k} be generated by the PG method (34). For any x ∈ domP , we have F (x^k)≤ F (x) +1

kLD(x, x⁰) ∀k ≥ 1.

(b) AssumeE is a Hilbert space, η(·) = ¹₂k · k², andX ⊇ {2x − w | x, w ∈ domP }. Let {x^k} be generated by the first APG method (35)–(37). For any x∈ domP , we have

F (x^k)≤ F (x) + θk−1² LD(x, x⁰) ∀k ≥ 1.

(c) Let {x^k} be generated by the second APG method (37)–(40). For any x ∈ domP , we have F (x^k)≤ F (x) + θ²k−1LD(x, z⁰) ∀k ≥ 1.

(d) Let {x^k} be generated by the third APG method (37), (38), (40), (41). For any x ∈ domP , we have

F (x^k)≤ F (x) + θ²k−1L(η(x)− η(z⁰)) ∀k ≥ 1.

A proof of Theorem 1(a)-(c) is given in Appendix A. A proof of part (d) can be found in [142, Corollary 3(a)]. Taking any x satisfying F (x)≤ inf F +^ǫ₂ in Theorem 1 yields F (x^k)≤ inf F + ǫ after k = O(^L_ǫ) iterations for the PG method and after k = O(^q^L_ǫ) iterations for the APG methods.

How can we terminate the PG and APG methods in practice with a guaranteed optimality gap? The bounds in Theorem 1 requires estimating the distance to an ₂^ǫ-minimizer of F and are rather conservative. In the case where f has the form

f (x) = max

v∈V φ(x, v),

for some saddle function φ and convex set V in a suitable space, duality gap can be used to terminate the methods [92, 98, 99]. The dual problem is max_vQ(v), with dual function

Q(v) := min

x {φ(x, v) + τP (x)}.

(12)

Then we compute (say, every 5 or 10 iterations) a candidate dual solution v^k = arg max

v φ(x^k, v),

and check that F (x^k)− Q(v^k)≤ ǫ. In fact, assuming furthermore that domP is bounded, it can be shown using an idea of Nesterov [98, Theorem 3] that

0 ≤ F (x^k+1)− Q(¯v^k) ≤ θ²kL max

x∈domP(η(x)− η(z⁰)) ∀k ≥ 0, where x^k+1, y^k, θ_k are generated by the third APG method, and we let

v^k = arg max

v φ(y^k, v), ¯v^k = (1− θk)¯v^k−1+ θ_kv^k.

with ¯v⁻¹ = 0 [142, Corollary 3(c)]; also see [74, Theorem 2.2] and [99] for related results in the constrained case (16). Analogous bounds hold for the first two APG methods; see [142, Corollaries 1(b) and 2].

When applied to (4), the first APG method yields an accelerated version of the iterative thresholding method of Daubechie et al. [13, 34]. What about the other two methods? How efficiently can these methods be applied to solve (5), (7), (8), (10), (23), and related problems such as those in [48, 123, 124, 125, 153]? When applied to (7) and (8), singular value decomposition is needed at each iteration, and efficiency depends on the cost for this decomposition.

However, only the largest singular values and their associated singular vectors are needed [24].

Can these be efficiently computed or updated? Some progress on this have recently been made [82, 132].

Can the iteration complexity be further improved? The proofs suggest that the convergence rate can be improved to O(_k¹p) (p > 2) if we can replace k · k² in the proofs by k · k^p; see Appendix A. However, this may require using a higher-order approximation of f in ℓF(·; y), so the improvement would not come “for free”.

3.2 Block-Coordinate Gradient Methods

WhenE is a Hilbert space and P is block-separable (22), we can apply (34) block-coordinatewise, possibly with L and D dynamically adjusted, resulting in a block-coordinate gradient (BCG) method. More precisely, given x^k ∈ domP , we choose Jk as the union of some subcollection of indices in J and choose a self-adjoint positive definite linear mapping H^k : E → E (ideally H^k≈ ∇²f (x^k)), compute

d^k = arg min

d

ℓF(x^k+ d; x^k) +1

2hH^kd, di | dj = 0 ∀j 6∈ Jk

, (42)

and update

x^k+1= x^k+ α_kd^k, (43)

with stepsize α_k > 0 [144, 146]. This method may be viewed as a coordinate/SOR version of a sequential quadratic programming (SQP) method, and it is related to the variable/gradient

(13)

distribution methods for unconstrained smooth optimization [51, 58, 86, 120] and (block) coordinate minimization methods [15, 84, 87, 105, 121, 136, 140]. In the case of H^kd = Ld and J_k comprising all coordinate indices of x, (42)–(43) with α_k = 1 reduces to the PG method (34) with η(·) = ¹₂k · k².

How to choose α_kand J_k? Various stepsize rules for smooth optimization [15, 52, 105] can be adapted to this nonsmooth setting. One that works well in theory and practice is an Armijo-type rule adapted from SQP methods:

α_k = maxⁿα∈ {1, β, β², . . .} | F (x^k+ αd^k)≤ F (x^k) + ασ∆_k^o, (44) where 0 < β, σ < 1 and ∆_k is the difference between the optimal value of (42) and F (x^k). This rule requires only function evaluations, and ∆_kpredicts the descent from x^kalong d^k. For global convergence, the index subset J_k is chosen either in a Gauss-Seidel manner, i.e., J_k∪ · · · ∪ Jk+K

covers all subsets of J for some constant K ≥ 0 [31, 88, 107, 140, 144] or Jk is chosen in a Gauss-Southwell manner to satisfy

∆_k≤ ω∆^allk ,

where 0 < ω < 1 and ∆^all_k denotes the analog of ∆_k when J_kis replaced by the entire coordinate index set [144, 145, 146]. Moreover, assuming (31) withX = domP , the BCG method using the Gauss-Southwell choice of J_k finds an ǫ-minimizer of F in O(^L_ǫ) iterations [146, Theorem 5.1].

The above BCG method, which has been successful for compressed sensing and variable selection in regression [88, 126, 159] and can be extended to handle linear constraints as in support vector machine (SVM) training [145], may also be suited for solving (5), (10), (14), (21), and related problems. When applied to (10) with J_k indexing a row/column of x, d^k and α_k are computable in O(n²) flops using Schur complement and properties of determinant. This method may be similarly applied to Lu’s reformulation (21). This contrasts with the block- coordinate minimization method in [9] which uses O(n³) flops per iteration. This method can also be applied to solve (29) by using a smooth convex penalty _2θ¹ max{0, | · | − ρij}² (θ > 0) for each constraint in (28) of the form | · | ≤ ρij. By choosing each coordinate block to comprise z_i, y_ii, and (y_ij)_j|(i,j)∈A, the computation distributes, with sensor i needing to communicate only with its neighbors when updating its position – an important consideration for practical implementation. The resulting method can accurately position up to 9000 sensors in little over a minute; see [113, Section 7.3]. Can Nesterov’s extrapolation techniques of Section 3.1 be adapted to the BCG method?

3.3 Incremental Gradient Methods

A problem that frequently arises in machine learning and neural network training has the form (15) with

f (x) =

m

X

i=1

f_i(x), (45)

where each f_i :E → (−∞, ∞] is smooth, convex on an open subset of E containing domP . (The convexity assumption can be relaxed depending on the context.) For example, f_i(x) may be the output error for an input-out system (e.g., a classifier), parameterized by x, on the ith training

(14)

example. P given by (16) would confer constraints on x and P (·) = k · k1 would induce a sparse representation. m may be large. A popular approach to minimizing f of the form (45) is by an incremental gradient method (“on-line back-propagation”):

x^k+1 = x^k− αk∇fik(x^k), (46)

with i_k chosen cyclically from {1, . . . , m} and stepsize αk > 0 either constant or adjusted dynamically; see [15, 68, 77, 85, 91, 128, 137] and references therein. When some the ∇fi’s are

“similar”, this incremental method is more efficient than (pure) gradient method since it does not wait for all component gradients∇f1, ...,∇fm to be evaluated before updating x. However, global convergence (i.e.,∇f(x^k)→ 0) generally requires αk→ 0, which slows convergence. Re- cently, Blatt, Hero and Gauchman [22] proposed an aggregate version of (46) that approximates

∇f(x^k) by

g^k=

k

X

ℓ=k−m+1

∇fiℓ(x^ℓ).

This method achieves global convergence for any sufficiently small constant stepsize α_k, but requires O(mn) storage. We can reduce the storage to O(n) by updating a cumulative average of the component gradients:

g^k_sum = g_sum^k−1+∇fik(x^k), g^k = m kg^k_sum,

with g⁻¹_sum = 0. We then use g^k in the PG, APG, or BCG method. The resulting incremental methods share features with recently proposed averaging gradient methods [65, 102]. What are their convergence properties, iteration complexities, and practical performances?

4 Error Bound and Linear Convergence of Gradient Methods

Analogous to superlinear convergence for second-order methods, linear convergence is a good indicator of efficiency for first-order methods. Key to a linear convergence analysis is the following Lipschizian error bound on dist(x, ¯X ) := min

s∈ ¯X kx − sk in terms of the norm of the residual R(x) := arg min

d

ℓ_F(x + d; x) + 1 2kdk²

, (47)

where ¯X denotes the set of minimizers of F , which we assume to be nonempty.

EB condition: For any ζ≥ min F , there exist scalars κ > 0 and ǫ > 0 such that

dist(x, ¯X ) ≤ κkR(x)k whenever F (x)≤ ζ, kR(x)k ≤ ǫ. (48) Under the EB condition, asymptotic linear and even superlinear convergence can be estab- lished for various methods, including interior point, gradient projection, proximal minimization, coordinate minimization, and coordinate gradient methods – even if ¯X is unbounded; see [12, 49, 80, 81, 139, 144, 145, 146] and references therein. Moreover, the EB condition holds under any of the following conditions; see [144, Theorem 4] as well as [109, Theorem 3.1], [79, Theorem 2.1] for the constrained case (16).

(15)

C1. E = ℜⁿ, f (x) = h(Ax) +hc, xi for all x ∈ ℜⁿ, where A ∈ ℜ^m×n, c ∈ ℜⁿ, and h is a strongly convex differentiable function on ℜ^m with ∇h Lipschitz continuous on ℜ^m. P is polyhedral.

C2. E = ℜⁿ, f (x) = max_y∈Y{hy, Axi − h(y)} + hc, xi for all x ∈ ℜⁿ, where Y is a polyhedral set in ℜ^m, A ∈ ℜ^m×n, c ∈ ℜⁿ, and h is a strongly convex differentiable function on ℜ^m with∇h Lipschitz continuous on ℜ^m. P is polyhedral.

C3. f is strongly convex and satisfies (31) for some L > 0.

What if f is not strongly convex and P is non-polyhedral? In particular, we are interested in the group lasso for linear and logistic regression (see (17), (23), (24)), for which f is not strongly convex (unless A has full column rank) and P is non-polyhedral (unless J is a singleton for all J ∈ J ). The following new result shows that the error bound (48) holds for the group lasso.

The proof, given in Appendix B, exploits the structure of the weighted sum of 2-norms (24). To our knowledge, this is the first Lipschitzian error bound result for f not strongly convex and P non-polyhedral.

Theorem 2 Suppose that E = ℜⁿ, P has the form (24) with ω_J > 0 for all J ∈ J , and f has the form

f (x) = h(Ax), (49)

where A ∈ ℜ^m×n, and h : ℜ^m → (−∞, ∞] is differentiable on domh, which is assumed to be convex and open. Also suppose that (a) h is strongly convex and ∇h is Lipschitz continuous on any compact convex subset of domh, and (b) h(y)→ ∞ whenever y approaches a boundary point of domh. If ¯X 6= ∅, then

{x | F (x) ≤ ζ} is bounded ∀ζ ∈ ℜ, (50)

and the EB condition (48) holds.

The assumptions on h in Theorem 2 are satisfied by h(y) = 1

2ky − bk², corresponding to linear regression, and

h(y) =

m

X

i=1

log(1 + e^yⁱ)− hb, yi with b ∈ {0, 1}^m,

corresponding to logistic regression (23). The assumptions are also satisfied by h(y) =−

m

X

i=1

log(y_i) +hb, yi with b ≥ 0,

which arises in likelihood estimation under Poisson noise [123]. In the first two examples, h is bounded from below by zero. In the third example, h is unbounded from below but tends to

−∞ sublinearly. Since P given by (24) is homogeneous of degree 1, it is readily seen that ¯X 6= ∅

(16)

for all three examples. (An example of ¯X = ∅ is minxe^x+ 2x +|x|.) However, ¯X need not be a singleton. An example is

minx

1

2|x1+ x2− 2|²+|x1| + |x2|,

for which ¯X = {(1 − t, t) | t ∈ [0, 1]}. Can Theorem 2 be extended to f satisfying C2 and P given by (24) or to (18) or (19)? Can the constant κ in (48), which determines the convergence ratio, be estimated for compressed sensing (17) in terms of the restricted isometry constant µ_N? Corollary 1 Under the assumptions of Theorem 2, let {(x^k, H^k, J_k, α_k)} be generated by the BCG method (42)–(43) with (i) λI H^k ¯λI for all k (0 < λ ≤ ¯λ), (ii) {Jk} cycling through J ∈ J , and (iii) {α^k} chosen by the Armijo rule (44). If ¯X 6= ∅, then {F (x^k)}k∈T converges at least Q-linearly and {x^k}k∈T converges at least R-linearly, where T = {0, K, 2K, . . .} and K is the cardinality of J .

Proof. Theorem 2 shows that [144, Assumption 2(a)] is satisfied. By [144, Theorem 1(a)], {F (x^k)} ↓, so, by (50), {x^k} is bounded. Since F is convex, [144, Assumption 2(b)] is auto- matically satisfied. Also, {x^k} is lies in a compact convex subset of domf, over which ∇f is Lipschitz continuous (since ∇h is Lipschitz continuous over the the image of this subset under A). Conditions (i)–(iii) imply that the remaining assumptions in [144, Theorem 2(b)] are satisfied. In particular, the restricted Gauss-Seidel rule in [144] holds with the given T . Since {x^k} is bounded, [144, Theorem 2(b)] implies that {F (x^k)}k∈T converges at least Q-linearly and {x^k}k∈T converges at least R-linearly [107].

By Corollary 1, the method in [157], [88, Section 2.2.1] for linear group lasso (corresponding to H^k = A^TA) and the method in [88, Section 2.2.2] for logistic group lasso (corresponding to H^k = ν_kI with ν_k > 0) attain linear convergence. Linear convergence of the block-coordinate minimization methods used in [147] for TV-regularized image reconstruction, with f (w, u) =

1

2kw−Buk²2+^γ₂kAu−bk²2, P (w, u) =^P_JkwJk2, and in [152] for TV-regularized image deblurring, with f (w, u, z) = ¹₂kw − Buk²2 + ^γ₂kAu − b − zk²2, P (w, u, z) = ^P_Jω_JkwJk2 +kzk1, can be analyzed similarly using a generalization of Theorem 2 to allow ω_J = 0 for some J. Can the linear convergence analysis be extended to the APG methods or their variants?

5 Appendix A: Analyzing the PG and APG Methods

Since f is convex and (31) holds withX ⊇ domP , we have from (32) that F (x) ≥ ℓF(x; y) ≥ F (x) − L

2kx − yk² ∀x ∈ domP, y ∈ X . (51) The following “3-point” property is also key; see [32, Lemma 3.2], [71, Lemma 6], [142, Section 2].

3-Point Property: For any proper lsc convex function ψ :E → (−∞, ∞] and any z ∈ X , if η is differentiable at z+= arg min_x{ψ(x) + D(x, z)}, then

ψ(x) + D(x, z)≥ ψ(z+) + D(z₊, z) + D(x, z₊) ∀x ∈ domP.

(17)

The APG methods can be motivated by analyzing the PG method (34). Let {x^k} be generated by the PG method. For any x∈ domP and any k ∈ {0, 1, . . .}, we have

F (x^k+1) ≤ ℓF(x^k+1; x^k) +L

2kx^k+1− x^kk²

≤ ℓF(x^k+1; x^k) + LD(x^k+1, x^k)

≤ ℓF(x; x^k) + LD(x, x^k)− LD(x, x^k+1)

≤ F (x) + LD(x, x^k)− LD(x, x^k+1) ∀x ∈ domP, (52) where the first and fourth inequalities use (51), the second inequality uses (33), and the third inequality uses (34) and the 3-Point Property with ψ(x) = ℓ_F(x; x^k)/L. Letting e_k = F (x^k)− F (x) and ∆_k= LD(x, x^k), this simplifies to the recursion

e_k+1 ≤ ∆k− ∆k+1 ∀k ≥ 0.

It follows from e_k+1≤ ek that (k + 1)e_k+1 ≤ ∆0− ∆k+1 ≤ ∆0, which proves Theorem 1(a).

The above proof suggests that, for faster convergence, we should find a similar recursion as (52) but with L scaled by something tending to zero with k. To do this, we need to modify (34).

Suppose for simplicity E is a Hilbert space and η(·) = ¹₂k · k² (so that D(x, y) = ¹₂kx − yk²).

One such modification is to replace x^k in (34) by some y^k∈ X to be determined, yielding (36).

Then, as in the above proof for the PG method, we have F (x^k+1) ≤ ℓF(x^k+1; y^k) +L

2kx^k+1− y^kk² (53)

≤ ℓF(y; x^k) + L

2ky − y^kk²− L

2ky − x^k+1k²

≤ F (y) + L

2ky − y^kk²−L

2ky − x^k+1k² ∀y ∈ domP, (54) where the first and third inequalities use (51), and the second inequality uses (36) and the 3- Point Property. To get L to be scaled by something tending to zero, set y = (1− θk)x^k+ θ_kx in the above inequality, with x∈ domP arbitrary and 0 < θk ≤ 1 to be determined. We can then factor θ_k out of x to scale L, yielding

F (x^k+1)

≤ F ((1 − θk)x^k+ θ_kx) + L

2k(1 − θk)x^k+ θ_kx− y^kk²−L

2k(1 − θk)x^k+ θ_kx− x^k+1k²

= F ((1− θk)x^k+ θ_kx) + L

2θ²_kkx + (θ⁻¹_k − 1)x^k− θ_k⁻¹y^kk²−L

2θ²_kkx + (θ_k⁻¹− 1)x^k− θ_k⁻¹x^k+1k², where we have rearranged the terms to look like the recursion (52). We want the two terms insidek k² to have the form “x− z^k” and “x− z^k+1”, which we get by setting

z^k=−(θ_k⁻¹− 1)x^k+ θ⁻¹_k y^k and y^k by (35). Using also the convexity of F , we then obtain that

F (x^k+1) ≤ (1 − θk)F (x^k) + θ_kF (x) + θ²_kL

2kx − z^kk²− θ²k

L

2kx − z^k+1k² ∀k. (55)

(18)

Letting e_k= F (x^k)− F (x) and ∆k= ^L₂kx − z^kk², this simplifies to e_k+1 ≤ (1 − θk)e_k+ θ_k²∆_k− θ²k∆_k+1. Upon dividing both sides by θ²_k, we see that, by choosing θ_k+1so that _θ¹2

k

= ^1−θ_θ2^k+1 k+1

(which, upon solving for θ_k+1, yields (37)), this rewrites as the recursion

1− θk+1

θ_k+1² e_k+1+ ∆_k+1 ≤ 1− θk

θ²_k e_k+ ∆_k, which propagates backwards to yield

1− θk+1

θ²_k+1 e_k+1+ ∆_k+1 ≤ 1− θ0

θ²₀ e0+ ∆0 ∀k.

Since ^1−θ_θ2^k+1 k+1

= _θ¹2 k

, setting θ0 = 1 simplifies this to 1

θ_k²e_k+1 ≤ ∆0− ∆k+1 ≤ ∆0. Also, we have from (35) and taking θ₋₁ = 1 that z⁰ = y⁰ = x⁰. This proves Theorem 1(b).

The preceding proof depends crucially on rearranging terms inside k k², so it cannot be directly extended to other Bregman functions D. However, (55) suggests we seek a recursion of the form

F (x^k+1) ≤ (1 − θk)F (x^k) + θ_kF (x) + θ²_kLD(x, z^k)− θ²kLD(x, z^k+1) ∀k, (56) which, as our derivation of (52) and (55) suggests, can be achieved by setting z^k+1 by (39) and using the 3-Point Property, setting x^k+1 by (40) (analogous to our setting of y in (54)), and then choosing y^k so that x^k+1− y^k = θ_k(z^k+1− z^k) (which works out to be (38)). Specifically, for any k∈ {0, 1, . . .}, we have

F (x^k+1) ≤ ℓF(x^k+1; y^k) + L

2kx^k+1− y^kk²

= ℓ_F((1− θk)x^k+ θ_kz^k+1; y^k) +Lθ_k²

2 kz^k+1− z^kk²

≤ (1 − θk)ℓ_F(x^k; y^k) + θ_kℓ_F(z^k+1; y^k) + θ²_kLD(z^k+1, z^k) (57)

≤ (1 − θk)ℓ_F(x^k; y^k) + θ_kℓ_F(x; y^k) + θ_kLD(x, z^k)− θkLD(x, z^k+1) ∀x ∈ domh, where the first inequality uses (51), the second inequality uses the convexity of ℓ_F(·; y^k) and (33), the last inequality uses the 3-Point Property with ψ(x) = ℓF(x; y^k)/(θ_kL). Using (51) yields (56), from which Theorem 1(c) follows.

The third APG method (see (41)) can be derived and analyzed similarly using an analog of the 3-Point Property for η; see [142, Property 2]. For brevity we omit the details.

6 Appendix B: A Proof of Theorem 2

Througout we assume that the assumptions of Theorem 2 are satisfied. By scaling f by 1/τ , we will assume without loss of generality that τ = 1.