**PREPRINT**

國立臺灣大學 數學系 預印本 Department of Mathematics, National Taiwan University

### www.math.ntu.edu.tw/ ~ mathlib/preprint/2011- 05.pdf

## Linear convergence analysis of the use of gradient projection methods on total vari- ation problems

### Pengwen Chen and Changfeng Gui

### October, 2011

(will be inserted by the editor)

**Linear convergence analysis of the use of gradient** **projection methods on total variation problems**

**Pengwen Chen** **· Changfeng Gui**

Received: date / Accepted: date

**Abstract Optimization problems using total variation frequently appear in**
image analysis models, in which the sharp edges of images are preserved. Direct
gradient descent methods usually yield very slow convergence when used for
such optimization problems. Recently, many duality-based gradient projection
methods have been proposed to accelerate the speed of convergence. In this
dual formulation, the cost function of the optimization problem is singular,
and the constraint set is not a polyhedral set. In this paper, we establish two
inequalities related to projected gradients and show that, under some non-
degeneracy conditions, the rate of convergence is linear.

**Keywords Total variation, gradient projection methods, non-degeneracy**
conditions, linear convergence, projected gradients.

**Mathematics Subject Classification (2000) MSC 65K10,49K35**

**1 Introduction**

In this paper, we study the convergence rate of the gradient projection method in solving the following problem:

min*x**∈Ω**f (x),* (1)

*where f is a convex, continuously diﬀerentiable function whose gradient∇f is*
*Lipschitz continuous on a nonempty, convex and closed set Ω. Without further*

Pengwen Chen

Department of Mathematics, National Taiwan University, Taiwan. E-mail: peng- wen@math.ntu.edu.tw

Changfeng Gui

Department of Mathematics, University of Connecticut, Storers, CT 06268. E-mail:

gui@math.uconn.edu

assumptions, it is known that the worse-case rate of convergence could be sub- linear [28]. In this paper, we prove that linear convergence can be obtained, when the following two conditions hold:

*1. The cost function in (1) can be expressed in the form f (x) = h*0*(E*0*x) and*
the feasible set can be expressed in the form

*Ω = {x ∈ R*

^{n}*: g*

*i*

*(x)*

**≤ 0, for i = 1, . . . , m} ⊂ R**

^{n}*, ,*(2)

*where g*

*i*

*(x) = h*

*i*

*(E*

*i*

*x). Here, each{E*

*i*

*: i = 0, . . . , m} is a nonzero matrix*

*with n columns, and*

*{h*

*i*

*: i = 0, . . . , m} are strongly convex and twice*

*diﬀerentiable C*

^{2}

*functions. (Note that these matrices E*

*could be singular.) 2. The minimizer in (1) satisﬁes a non-degeneracy condition (see Assump-*

_{i}tion 2).

Gradient projection methods with constant step sizes were ﬁrst proposed by Goldstein [23] and Levitin and Poljak [26]. Bertsekas [3] proposed the Armijo rule along the projection arc and studied its convergence behavior.

This method can identify the set of active inequality constraints in a ﬁnite number of iterations (this property is called ﬁnite identiﬁcation). Moreover, this method can be combined with conjugate gradient methods or Newton’s method to achieve super-linear convergence [3, 8, 35]. Practically, the gradient projection method is especially well-suited for large-scale problems with sim- ple constraint structures, e.g., a box, a polyhedral set, or a Cartesian product of Euclidean hypercubes or balls. It is well known that the convergence rate of gradient projection methods is linear in the following cases:

*1. f is strongly convex [18].*

*2. f is not strongly convex, but f (x) = h*0*(E*0*x) + q· x, where q is some*
**vector in R**^{n}*, E*0*is some matrix, and h*0is a strictly convex and essentially
smooth function with *∇*^{2}*h*0*(E*0*x** ^{∗}*) being positive deﬁnite for a minimizer

*x*

^{∗}*. Additionally, the convex set Ω must be a polyhedral set [27].*

Clearly, the problem mentioned above does not ﬁt either of these two cases.

This research problem originates from the dual formulation of the image restoration model proposed by Rudin, Osher and Fatemi [31]. Currently, the model is called the ROF model, and the total variation is introduced to pre- serve the sharp edges of images. The ROF model was later modiﬁed to tackle image de-noising [33], image de-blurring [15, 34] and image segmentation [14, 32, 11, 7] problems successfully. Because of the inclusion of the total variation term, the Euler-Lagrangian equation for this optimization problem becomes highly nonlinear, and many numerical methods converge quite slowly for this problem, especially when the explicit gradient descent method is employed.

Many algorithms were proposed to alleviate this numerical diﬃculty [33, 6, 13, 12]. A key breakthrough was made by Chambolle [9, 10]. He proposed a ﬁxed point algorithm and a gradient projection method with constant step size based on the dual formulation of total variation. These two algorithms soon became popular due to their simplicity and acceptable convergence speeds.

Recently, various accelerated algorithms have been proposed [38, 36, 1] based

on this dual formulation. These algorithms use diﬀerent step size rules, in- cluding techniques based on the Barzilai-Borwein(BB) method [2, 21, 5] and Nesterov’s scheme [29]. The numerical results of these algorithms indicate that these gradient projection methods indeed converge much faster than the original Chambolle gradient projection method. In addition to the algorithms mentioned above, there exist other interesting and eﬃcient algorithms for han- dling the image restoration problem, e.g., graph cut methods [10, 25], and the split Bregman method [24]. The dual formulation of the ROF model can be expressed as a special case of the problem (1). In this paper, we shall study the convergence rate of the gradient projection method used to solve the prob- lem (1). This theoretical result can also be used to explain the occurrence of linear convergence in Chambolle’s gradient projection method.

*According to the framework developed by Nesterov, when f is convex and*
possesses a Lipschitz continuous gradient, without exploiting the structure of
*the problem, the cost function in the gradient method converges as O(1/k),*
*where k is the iteration counter [28]. To improve the speed of convergence, Nes-*
terov proposed an accelerated scheme, in which the convergence speed becomes
*O(1/k*^{2}).^{1} In this paper, we point out explicitly under which circumstances
the gradient projection method can converge linearly. This provides an expla-
nation as to why accelerated gradient projection methods can converge faster
than Nesterov’s scheme, as reported in a numerical result [36]. Brieﬂy speak-
ing, two conditions are required for linear convergence. The ﬁrst condition is
non-degeneracy, which stipulates that the gradient projection method has the
ﬁnite identiﬁcation property. The second condition speciﬁes the form of the
cost function and the constraint set. In a way, the second condition leads to
positive curvature of “the constraint surface”.

In fact, the slow convergence of Chambolle’s gradient projection method
*can be attributed to two factors. One comes from the cost function f , and the*
other comes from*{g**i**: i}.*^{2} First, when the system has a large condition num-
ber, the use of a constant step size is usually not an optimal choice. Numerical
results shown in the reports [36, 38, 20] indicate that various BB schemes can
improve convergence speed, when the step size is estimated by least-squares
ﬁts. The second factor to which slow convergence is attributed comes from the
non-polyhedral constraint set. Unlike a polyhedral constraint set, the conver-
gence rate of Chambolle’s gradient projection method could be sub-linear in
the worse case. This has nothing to do with the condition number of the sys-
tem, so the improvement from the BB method could be limited or nonexistent
(see examples in section 2.4). In this situation, Nesterov’s scheme is a better
approach. Which factor dominates depends on whether the non-degeneracy
condition is fulﬁlled. When the non-degeneracy condition holds, the gradient
projection method converges linearly (though perhaps very slowly), and can
be accelerated by various non-monotone spectral projected gradient methods.

1 *In fact, Nesterov pointed out that O(1/k*^{2}) is optimal when no further information about
the problem is exploited.

2 *Throughout this paper, when no speciﬁed range is given for i, we mean that i ranges*
over all its possible values.

In this paper, two conditions are proposed to establish linear convergence in two gradient projection methods: The ﬁrst method uses gradients, and the second method uses projected gradients (deﬁned in Deﬁnition 1). A few exam- ples are provided to explain the necessity of the proposed conditions. Linear convergence is established by constructing two inequalities involving the norm of projected gradients. We also study the convergence property of the gradient projection method using projected gradients, a method which was proposed implicitly in [38].

This paper is organized as follows. The formal description of our problem is given in section 2. The linear convergence analysis of the dual formulation is one special case of our problem. Two major inequalities are introduced to prove linear convergence. The proofs of these inequalities themselves are quite lengthy, and are given in section 3. In section 4, we present the convergence property of the second gradient projection method and demonstrate the exis- tence of a positive lower bound for its step size. A brief introduction to the dual formulation of the ROF model is given in the appendix.

**2 Problem description and notations**

2.1 Notations

*In this paper, we use the following notation. Let X** ^{∗}*be the set of minimizers of
min

*x*

*∈Ω*

*f (x). Let∥·∥ be the Euclidean norm. Let x*

^{′}*, x·y refer to the transpose*

*and the inner product. For each x*

*∈ Ω, let A(x) denote the active index set*

*at x,*

*A(x) = {i : g*

*i*

*(x) = 0}. Let [·]*

^{+}

*denote the projection on the set Ω,*

*deﬁned as follows: If x = [y]*

^{+}

*, then x = arg min*

_{z}

_{∈Ω}*∥y − z∥*

^{2}; equivalently, for

*all z∈ Ω,*

*(y− x) · (z − x) ≤ 0.* (3)

Finally, by assumption, Lipschitz continuity holds on *∇f. Let L**f* denote the
*Lipschitz constant: for all x, y∈ Ω,*

*∥∇f(x) − ∇f(y)∥ ≤ L**f**∥x − y∥.* (4)
*While solving an unconstrained minimization problem, a minimizer x** ^{∗}* is
characterized as

*∇f(x*

*) = 0. This criterion is no longer valid when solving a constrained minimization problem. Instead, the following properties show that the projected gradient*

^{∗}*∇*

*Ω*

*f plays a similar role in searching for a minimizer:*

*A minimizer x** ^{∗}*satisﬁes

*∇*

*Ω*

*f (x*

^{∗}*) = 0.*

**Definition 1 [8] Let the tangent cone T (x) be the closure of the cone of all***feasible directions, where a direction v is feasible at x∈ Ω if x + τv belongs to*
*Ω for all τ > 0 suﬃciently small. The projected gradient* *∇**Ω**f of f is deﬁned*
by

*∇**Ω**f :=−arg min*

*d**∈T (x)**∥ − ∇f(x) − d∥ = −arg min*

*d**∈T (x)**∥∇f(x) + d∥.* (5)

Thus,*−∇**Ω**f is the steepest feasible descent direction. Note that this deﬁnition*
*is diﬀerent from the one in [8]: A negative sign is added in our deﬁnition.*

*Clearly, when x is in the interior of Ω,* *∇**Ω**f (x) =∇f(x). As in [8], ∇**Ω**f*
has the following properties.

**– (a) The point x**∈ Ω is a stationary point if and only if ∇*Ω**f (x) = 0.*

**– (b)***∇f(x)*^{′}*∇**Ω**f (x) =∥∇**Ω**f (x)∥*^{2}. And
min

*d**∈T (x),∥d∥=1**d· ∇f(x) = −∥∇**Ω**f (x)∥.* (6)
**– (c)** *∥∇**Ω**f (·)∥ is lower semi-continuous. If ∇f is uniform continuous at x** ^{∗}*,

and lim*k**→∞**x**k* *= x** ^{∗}*, then
lim

*k**→∞**∥∇**Ω**f (x** _{k}*)

*∥ = 0.*(7)

**– (d)**

lim

*α**→0*^{+}

*[x− α∇f]*^{+}*− x*

*α* =*−∇**Ω**f.* (8)

*Let N (x) be the polar of the tangent cone T (x). The cone N (x) is called*
*the normal cone to Ω at x∈ Ω. The tangent cone T (x) is in turn the polar of*
*the normal cone N (x). Thus we have the unique orthogonal decomposition of*

*∇f (for example, see Lemma 2.1 and 2.2 in [37]):*

*−∇f(x) = −∇**Ω**f (x) +* ∑

*i**∈A(x)*

*λ*_{i}*(x)∇g**i**(x),*

and*{λ**i**(x) : i∈ A(x** ^{∗}*)

*} is the minimizer of the least squares problem*

min

*λ*_{i}*≥0**∥ − ∇f(x) −* ∑

*i**∈A(x** ^{∗}*)

*λ*_{i}*∇g**i**(x)∥*^{2}*.* (9)

*For a minimizer x*^{∗}*∈ X*^{∗}*, an optimal condition [35] for x** ^{∗}* is

*−∇f(x** ^{∗}*)

*∈ N(x*

^{∗}*).*(10)

*That is, there exist scalars λ**i**≥ 0 such that*

*−∇f(x** ^{∗}*) = ∑

*i**∈A(x** ^{∗}*)

*λ**i**∇g(x*^{∗}*),* (11)

2.2 Gradient projection methods and basic properties

In this paper, we shall study the convergence rates of the following gradient projection methods in solving Eq. (1).

* Alg 1 (Gradient projection method) Starting with some x*0

*∈ Ω, iterate*

*k = 0, 1, 2, . . .,*

*x**k+1**= x(α**k**),* (12)

*where x(α) and α** _{k}* are determined as follows.

*1. (a) A constant step size is used with 0 < α**k* *< 2/L**f**, where L**f* is the
Lipschitz constant of*∇f.*

*x(α) := [x− α∇f]*^{+}*,* (13)

(b) The Armijo rule is in eﬀect along the projection arc using gradients

*∇f: Let σ ∈ (0, 1), β ∈ (0, 1) and α*0 *> 0. Let α = β*^{m}*α*0*, where m is*
the ﬁrst nonnegative integer for which

*f (x)− f(x(β*^{m}*α*_{0}))*≥ σ∇f(x) · (x − x(β*^{m}*α*_{0}*)),* (14)
with

*x(α) := [x− α∇f]*^{+}*.* (15)

2. The Armijo rule is in eﬀect along the projection arc using projected gra-
dients *∇**Ω**f : Let σ∈ (0, 1), β ∈ (0, 1) and α*0 *> 0. Let α = β*^{m}*α*_{0}, where
*m is the ﬁrst nonnegative integer for which*

*f (x)− f(x(β*^{m}*α*0))*≥ σ∇**Ω**f (x)· (x − x(β*^{m}*α*0*)),* (16)
with

*x(α) := [x− α∇**Ω**f ]*^{+}*,* (17)
( *∇**Ω**f is deﬁned in Deﬁnition 1). This method is mentioned implicitly*
in [38].

**Convergence analysis:The sequence***{x**k* *: k} generated by these gradient*
*projection methods converges to one stationary point x** ^{∗}*: for case 1 (a) (b),
see Prop. 2.3.2, Prop. 2.3.3 in [4]; for case 2, see Theorem 5.

*Remark 1 The discretization of the dual formulation of the ROF model (see*
Eq. (133)) can be formulated as follows. Let

*Ω ={x ∈ R*^{2n}^{1}^{n}^{2} *: g**i**(x) = x*^{2}_{2i}_{−1}*+ x*^{2}_{2i}*− 1 ≤ 0, i = 1, . . . , n*1*n*2*}.*

*Given a vector g*_{I}**∈ R**^{n}^{1}^{n}^{2} *and a positive parameter λ, the dual variable p is*
a minimizer of the problem:

min

*x**∈Ω**{f(x) = ∥E*0*x− λg**I**∥*^{2}*},* (18)
*where E*_{0} : R^{2n}^{1}^{n}^{2} *→ R*^{n}^{1}^{n}^{2} is the discrete divergence operator *∇· ( see Re-*
*mark 5). Note that f is not strongly convex. Compared with the proposed*
problem (1), the dual formulation (18) is a special case of the proposed prob-
lem. In [10], Chambolle proposes a gradient projection method based on this
dual formulation to solve the ROF model. The step size is set to be a constant
*between 0 and 1/4. Clearly, this method is exactly the same as case 1 in Alg. 1.*

*2.3 Assumptions on f, g**i*

As mentioned in the introduction, we will prove linear convergence of the above gradient projection methods under the following conditions:

* Assumption 1 f (x) can be expressed in the form f (x) = h*0

*(E*0

*x), with E*0

*being an m*0*× n matrix and h*0 **being a strongly convex function in R**^{m}^{0}*.*
*Similarly, for each i, g*_{i}*can be expressed in the form g*_{i}*(x) = h*_{i}*(E*_{i}*x), with E*_{i}*being an m*_{i}*× n matrix and h**i* **being a strongly convex function in R**^{m}^{i}*. Here,*
*{m**i**: i} are natural numbers.*

**Assumption 2 (Non-degeneracy)** *[8, 18] A minimizer x*^{∗}*∈ Ω is non-*
*degenerate if the active constraint normals* *{∇g**i**(x*^{∗}*) : i∈ A(x** ^{∗}*)

*} are linearly*

*independent and λ*

_{i}*(x*

^{∗}*) > 0 for each i∈ A(x*

^{∗}*). Hence,*

*−∇f(x** ^{∗}*)

*∈ ri(N(x*

^{∗}*)),*(19)

*where ri(Λ) is the relative interior of Λ ⊂ R*

^{n}*.*

These gradient projection methods all have the following property: there
*exists a positive constant c such that*

*c∥x**k**− x**k+1**∥*^{2}*≤ (f(x**k*)*− f(x**k+1**)),* (20)
*for k suﬃciently large. The proof is given as follows. For case 1a, the formula*
can be found in the proof of Proposition 2.3.2 in [4],

*f (x** _{k}*)

*− f(x*

*k+1*)

*≥ (*1

*α−L*

_{f}2 )*∥x**k**− x**k+1**∥*^{2};
For case 1b and case 2, from Eq. (3) we have

*f (x**k*)*− f(x**k+1*)*≥ σ∥x**k**− x**k+1**∥*^{2}

*α*_{k}*≥ σ∥x**k**− x**k+1**∥*^{2}

*α*_{0} *.* (21)

Because of Eq. (20), linear convergence in*{f(x**k*)*} implies R-linear conver-*
gence in*{x**k* *: k} ( Lemma 3.1 in [27]).*

2.4 Necessity of the non-degeneracy condition

Before proceeding, we provide a few examples to illustrate the necessity of the
proposed conditions, and also to examine the convergence speed of Chambolle’s
*method. It is known that if f is quadratic and of the form x*^{′}*Qx−b*^{′}*x, where Q*
*is positive deﬁnite, then the convergence rate of the gradient projection method*
is linear (see page 233 [4]). Hence, the non-degeneracy condition is not required
*to achieve linear convergence. The cost function f in the dual formulation of*
*the ROF model is quadratic and of the form x*^{′}*Qx− b*^{′}*x, with Q =* *∇(∇·)*
*being merely positive semi-deﬁnite. To emphasize the diﬀerence, we give a*
few examples to show that, without either of these proposed conditions being

satisﬁed, linear convergence can not be obtained. The major reason for this is
*that f (x)−f(x** ^{∗}*) cannot be bounded by

*∥∇*

*Ω*

*f (x)∥*

^{2}when the non-degeneracy

*condition fails, or when curvature vanishes at x*

*.*

^{∗}To analyze the iterative behavior of Chambolle’s method, we focus on
low-dimensional cases subject to one or two constraints. These examples are
provided to emphasize the eﬀects of non-polyhedral constraints. In the ﬁrst
*example, we consider the positive semi-deﬁnite matrix Q = [1, 0; 0, 0]/2. The*
*minimizer x*^{∗}*∈ R*^{2} is deliberately chosen so that *∇f(x*^{∗}*) = 0. Thus, λ*_{1} = 0
and the non-degeneracy condition fails. In the second example, we consider
*a quadratic cost function f with* *∇f(x** ^{∗}*)

*̸= 0, but −∇f(x*

^{∗}*) /∈ ri(N(x*

*)).*

^{∗}The occurrence of sub-linear convergence indicates that the non-degeneracy
condition is the crucial condition in this example, rather than *∇f(x** ^{∗}*)

*̸= 0.*

Our ﬁnal example is provided to emphasize the importance of Assumption 1.

The minimization problem is deliberately designed so that curvature vanishes
at minimizers. In this example,*−∇f(x** ^{∗}*)

*∈ ri(N(x*

^{∗}*)) holds, but g*

*i*

*(x) does*not satisﬁes Assumption 1. Under this circumstance, the gradient projection method using a constant step size (Chambolle’s method) still converges sub- linearly, which implies that Assumption 1 is also a crucial condition to achieve linear convergence.

*Example 1 Let Ω be a unit disk* *{x := (x*1*, x*_{2}) **∈ R**^{2} *: x*^{2}_{1}*+ x*^{2}_{2} *≤ 1}*^{3}. Let
*f (x*_{1}*, x*_{2}*) = (x*_{1}*− 1)*^{2}*/2. Then, the minimizer x*^{∗}*is unique and in fact x** ^{∗}* =

*(x*

_{1}

*, x*

_{2}

*) = (1, 0). Start with a point x = (cos θ, sin θ) on the circle with 0 <*

*θ < π/2. Consider the constant step size α = 1. Then, in case 1 of Alg. 1*

*x(1) = [x− ∇f(x)]*^{+} *= (1, x*_{2}*)/*

√
*1 + x*^{2}_{2}*.*
Compute

*∥x(1) − x*^{∗}*∥ = ∥[x − ∇f(x)]*^{+}*− x*^{∗}*∥ = θ + O(θ*^{2}*),* (22)
and

*∥x − x*^{∗}*∥ =√*

2*− 2 cos θ = 2 sin(θ/2).*

Hence,

lim

*θ**→0*

*∥x(1) − x*^{∗}*∥*

*∥x − x*^{∗}*∥* *= 1,* (23)

which implies that the convergence rate is sub-linear. On the other hand, since

*∥∇**Ω**f (x)∥ = (1 − cos θ) sin θ =* *θ*^{3}

2 *+ O(θ*^{4}) (24)

*and f (x)− f(x** ^{∗}*) =

*(cos θ− 1)*

^{2}

2 = *θ*^{4}

8 *+ O(θ*^{6}*).*

*We can see that f (x)− f(x** ^{∗}*) cannot be bounded by

*∥∇*

*Ω*

*f (x)∥*

^{2}

*as θ→ 0.*

0 50 100 150 200 250
10^{−6}

10^{−4}
10^{−2}
10^{0}

Nesterov PG Chambolle

**Fig. 1 The projection counter vs. cost function error of Nesterov’s scheme (Nesterov), the**
gradient projection method using optimal step size (Chambolle) and the gradient projection
method using projected gradients (PG).

*In this example, α = 1 is the optimal step size in the following sense: for*
*each point x on the circle, we have*

*arg min*

*α* *f (x− α∇f) = 1.*

As the sub-linear convergence is caused by the use of projections on non-
polyhedral constraint sets, instead of by the low condition number of the sys-
tem, the convergence speed cannot be improved by the BB method: because
*α = 1 is the optimal choice, the BB method will give exactly the same iterative*
result as in the case with a constant step size. Hence, the slow convergence
cannot be alleviated by the aid of non-monotone spectral projected gradient
methods or conjugate gradient methods. In fact, readers can easily verify that
in fact the limit of the ratio in Eq. (23) remains unchanged, even when a
*diﬀerent step size α is taken, provided that 0 < α < 2/L** _{f}*.

This phenomenon is primarily caused by the “orthogonality” between *∇f*
and*∇**Ω**f as x→ x** ^{∗}*along the boundary of a non-polyhedral set:

lim

*θ**→0*

*∇f · ∇**Ω**f*

*∥∇f∥∥∇**Ω**f∥* = lim

*θ**→0**sin θ = 0,*

which does not exist in problems with polyhedral constraint sets.

To emphasize the slow convergence of this method, we compare the con- vergence speed of the gradient projection method using constant step size (Chambolle) with the Nesterov method (Nesterov’s scheme [29]) and the PG method:

*x**k+1**= x**k**(α**k**) = [x**k**− α**k**∇**Ω**f (x**k*)]^{+}*, with α**k*:= min

*α* *f (x**k**− α∇**Ω**f (x**k**)).*

(25) That is, the step size in the PG method is selected to be the optimal step size along the projected gradient direction, which is a special case of the gradient

3 *In this subsection, though we use x**k* for two diﬀerent objects, with a little care the
*meaning of x**k* should be clear from the context.

projection method using*∇**Ω**f , i.e., case 2 in Alg. 1. The same initial condition*
*is used in each method: (x*1*, x*2*) = (0, 1). As both the Nesterov method and*
the PG method use two projections in each iteration, in order to make a fair
comparison, the x-axis represents the projection counter. The result shown
in Fig. 1 indicates that the Chambolle method is the slowest to converge.

The Nesterov method improves upon the convergence speed of the Chambolle
method. However, its convergence rate is still sub-linear. In contrast, the PG
method has linear convergence, which is proven in the following. Observe that
*due to the structure of f , the optimal step size α** _{opt}*is always the one which

*forces the ﬁrst component of x− α∇*

*Ω*

*f (x) to be 1. Then*

*x(α*_{opt}*) = (x*_{2}*, 1− x*1*)/√*

2*− 2x*1*= (cos(θ/2), sin(θ/2)), if x = (cos θ, sin θ).*

Hence, the convergence rate is indeed linear with lim*k**→∞**∥x**k+1**− x*^{∗}*∥/∥x**k**−*
*x*^{∗}*∥ = 0.5.*

*Example 2 Let Ω ={(x*1*, x*_{2}*, x*_{3}*, x*_{4})**∈ R**^{4}*: x*^{2}_{1}*+ x*^{2}_{2}*≤ 1, x*^{2}3*+ x*^{2}_{4}*≤ 1} and*
*f (x*_{1}*, x*_{2}*, x*_{3}*, x*_{4}*) = ((x*_{1}*− 1)*^{2}*+ (x*_{3}*− 2)*^{2}*)/2.*

*Then x*^{∗}*= (1, 0, 1, 0) is the minimizer with∇f(x*^{∗}*) = (0, 0,−1, 0). Consider*
*the point x = (cos θ, sin θ, 1, 0) on the surface of Ω. Using similar arguments*
as in the previous example, we can verify that the convergence rate of the
Chambolle method is sub-linear in this higher dimensional case.

*Example 3 Let Ω :={(x*1*, x*_{2})**∈ R**^{2}*: x*^{4}_{1}*≤ x*2*}, and f(x*1*, x*_{2}*) = (x*_{2}+ 1)^{2}*/2.*

*Then the minimizer is x*^{∗}*= (0, 0). Let x = (θ, θ*^{4}). The optimal step size of
*Chambolle’s method is α = 1, and x(1) = (θ− 4θ*^{3}*, (θ− 4θ*^{3})^{4}*) + O(θ*^{4}). Then

*∥x(1) − x*^{∗}*∥ = θ − 4θ*^{3}*+ O(θ*^{4}*),* *∥x − x*^{∗}*∥ = θ + θ*^{7}*/2 + O(θ*^{8}*).* (26)
Hence,

lim

*θ**→0*

*∥x(1) − x*^{∗}*∥*

*∥x − x*^{∗}*∥* *= 1,* (27)

which implies that the convergent rate is sub-linear. In fact, in this case because

*∥∇**Ω**f (x)∥ = 4θ*^{3}*+ O(θ*^{7}*), f (x)− f(x*^{∗}*) = θ*^{4}*+ O(θ*^{8}*),* (28)
*f (x)− f(x** ^{∗}*) cannot be bounded by

*∥∇*

*Ω*

*f (x)∥*

^{2}

*as θ→ 0.*

*Remark 2 In unconstrained optimization problems, the R-linear convergence*
of the BB method is known [17, 16]. These examples suggest that the proposed
two assumptions should be necessary conditions of the R-linear convergence
of the BB method in this non-polyhedral constrained optimization problem.

*Remark 3 Here, we make a few comments on the PG method. In general,*
the aforementioned PG method does not converge linearly when the proposed
*conditions fail. In the ﬁrst example, when Ω is replaced by*

*Ω :={x ∈ R*^{2}*: x*1*≤ 1 − x** ^{m}*2

*} for every even integer m ≥ 4,*

*the curvature vanishes at x*^{∗}*= (1, 0), and the ﬁrst component of x(α) is always*
*1 for each x = (1−θ*^{m}*, θ), θ̸= 0. In this case, the PG method converges linearly*
with the rate, the limit of the ratio in Eq. (23), being 1*− 1/m. Therefore,*
*we know that the PG method converges only sub-linearly with Ω :=* *{x ∈*
R^{2} *: x*_{1} *≤ 1 − exp(−1/x*^{2}2)*} ( exp(−1/x*^{2}) is not real analytic). In fact, the
PG method does not converge in the third example: observe that the second
*component of x(α) is always−1 and the ﬁrst component of x(α) tends to ±∞*

*as x approaches to x** ^{∗}*. This result indicates the necessity of the Armijo rule
to ensure convergence.

The above examples show that in the worst case the gradient projection method using gradients could converge very slowly. It is known that many numerical methods for solving the primal formulation of the ROF model have linear convergence, including the explicit gradient descent method. The nat- ural question that arises is why Chambolle’s method is eﬀective, if its con- vergence rate is potentially only sub-linear. Most numerical experiments show that Chambolle’s method converges linearly in solving the dual formulation of the ROF model. This is the motivation for our study of the convergence rates of gradient projection methods.

2.5 Finite identiﬁcation

An important result regarding the non-degeneracy condition is the ﬁnite iden-
tiﬁcation property: The set *A(x** ^{∗}*) can be ﬁnitely identiﬁed by gradient pro-

*jection methods in Alg. 1. This property guarantees that x*

*k*will enter and remain in a set

*B deﬁned in Eq. (32).*

Consider the subset ˆ*B of Ω,*

*B := {x ∈ Ω : g*ˆ *i**(x) = 0, i∈ A(x*^{∗}*), g**i**(x) < 0, i /∈ A(x** ^{∗}*)

*},*(29)

*in which every point has the same active index set as x*

*does, i.e.,*

^{∗}*A(x) = A(x*^{∗}*) for each x∈ ˆB.* (30)
Clearly, ˆ*B contains x** ^{∗}*. Thus ˆ

*B is nonempty.*

*For each x∈ ˆB, let {λ**i**(x) : i∈ A(x** ^{∗}*)

*} be a minimizer of the least squares*problem,

min

*λ*_{i}*∈R**∥ − ∇f(x) −* ∑

*i**∈A(x** ^{∗}*)

*λ*_{i}*∇g**i**(x)∥*^{2}*.* (31)

Recall the non-degeneracy condition,*−∇f(x** ^{∗}*) =∑

*i**∈A(x** ^{∗}*)

*λ*

*i*

*(x*

*)*

^{∗}*∇g*

*i*

*(x*

*) with*

^{∗}*λ*

*i*

*(x*

^{∗}*) > 0.*

Let*B be a subset of ˆB,*

*B := {x ∈ ˆB : ∥x − x*^{∗}*∥ < ϵ} for some ϵ > 0,* (32)
*such that for each x∈ B, we have that*

1. *{∇g**i**(x) : i∈ A(x** ^{∗}*)

*} are linearly independent;*

2. *{λ**i**(x) : i} is determined uniquely with λ**i**(x) > 0 for i∈ A(x) and λ**i**(x) =*
*0 for i /∈ A(x).*

*The existence of ϵ stems from the C*^{1}continuity of the functions*{λ**i**(x)}, which*
is ensured by the following observations:

1. Because *{∇g**i**(x*^{∗}*) : i* *∈ A(x** ^{∗}*)

*} are linearly independent, {∇g*

*i*

*(x) : i*

*∈*

*A(x*

*)*

^{∗}*} are linearly independent for each x ∈ R*

^{n}*near x*

^{∗}*. That is, det(G*

^{⊤}*G)̸=*

*0, where G is the matrix whose columns are* *{∇g**i**(x) : i∈ A(x** ^{∗}*)

*}.*

*2. When x is near x** ^{∗}*, the solution

*{λ*

*i*

*(x) : i*

*∈ A(x*

*)*

^{∗}*} of the least square*problem is a rational function of

*∇f(x) and {∇g*

*i*

*(x) : i∈ A(x*

*)*

^{∗}*}, where*

*∇f, ∇g**i**are C*^{1}**(R**^{n}*). As the denominator of this rational function det(G*^{⊤}*G)*
*is nonzero near x** ^{∗}*,

*{λ*

*i*

*(x)} is uniquely determined.*

*3. Because g**i**(x*^{∗}*) < 0 for each i* *∈ A(x*^{∗}*), g**i**(x) is negative near x** ^{∗}*, which

*implies that λ*

*i*

*(x) = 0.*

**Theorem 1 (Finite identification) Suppose that f,**{g*i* *: i = 1, . . . , m} are*
*C*^{2}*, with Lipschitz continuous gradient functions. Let x*^{∗}*be the limit of{x**k* :
*k} generated by the gradient projection method Alg. 1. Then x**k* *∈ B for k*
*suﬃciently large, whereB is deﬁned in Eq. (32).*

*Proof Because the sequence{x**k* *: k} generated by gradient projection methods*
*converges to one stationary point x** ^{∗}*, we have lim

*k*

*→∞*

*x*

*k*

*= x*

*, which implies that*

^{∗}*A(x*

*k*)

*⊂ A(x*

^{∗}*). Assume that there are an inﬁnite subsequence K and*

*an index r such that r∈ A(x*

^{∗}*) but r /∈ A(x*

*k*

*) for all r*

*∈ K. Let P*

*k*be the projection into the space

*{v : v · ∇g**i**(x**k**) = 0, for all i∈ A(x*^{∗}*), i̸= r},*
*and let P be the projection into the space*

*{v : v · ∇g**i**(x*^{∗}*) = 0, for all i∈ A(x*^{∗}*), i̸= r}.*

Then *−P**k**∇g**r**(x**k*)*∈ T (x**k**) for all k* *∈ K because r /∈ A(x**k*). Hence, by Eq.

(6) we have

*∇f(x**k*)*· P ∇g**r**(x**k*) =*∇f(x**k*)*· P**k**∇g**r**(x**k*) +*∇f(x**k*)*· (P − P**k*)*∇g**r**(x**k*)

*≤ ∥∇**Ω**f (x** _{k}*)

*∥∥P*

*k*

*∇g*

*r*

*(x*

*)*

_{k}*∥ + ∥∇f(x*

*k*)

*∥∥P − P*

*k*

*∥∥∇g*

*r*

*(x*

*)*

_{k}*∥.*

Since*{x**k**} converges to x** ^{∗}*, then from Eq. (7) and lim

*k*

*→∞*

*∥P − P*

*k*

*∥ = 0,*

*∇f(x** ^{∗}*)

*· P ∇g*

*r*

*(x*

*)*

^{∗}*≤ 0.*

On the other hand, the linear independence of the active constraint normals
*guarantees P∇g**r**(x** ^{∗}*)

*̸= 0. By the non-degeneracy condition,*

*∇f(x** ^{∗}*)

*· P ∇g*

*r*

*(x*

^{∗}*) = λ*

*r*

*(x*

*)*

^{∗}*∥P ∇g*

*r*

*(x*

*)*

^{∗}*∥*

^{2}

*> 0.*

This contradiction proves that*A(x**k*) =*A(x*^{∗}*) for all k suﬃciently large. That*
*is, x**k**∈ ˆB for all k suﬃciently large.*

*Finally, as x**k**converges to x*^{∗}*, for k suﬃciently large we have∥x**k**−x*^{∗}*∥ < ϵ,*
*i.e., x**k**∈ B.*

*⊓*

*⊔*

According to the above ﬁnite identiﬁcation, we have that if ˆ*B = {x*^{∗}*}, then*
the minimizer is found in ﬁnite iterations. According to Theorem 2.4 in [35],*B*
*is a class-C*^{2}identiﬁable surface. (Similarly,*B is a class-C** ^{p}*identiﬁable surface

*if g*

*i*

*∈ C*

*.)*

^{p}2.6 The main theorem

In the following theorem, we show linear convergence of the gradient methods in Alg 1. The proof of these inequalities (33,34) and the existence of positive lower bounds for step sizes is rather lengthy, and will be given in section 3 and section 4.

**Theorem 2 Suppose that Assumption 1 and 2 hold. Then linear convergence***can be obtained.*

*More precisely, there exist a subset* *M ⊂ B and two positive scalars c*1*, c*_{2}*,*
*such that each nonminimizer x∈ M the following inequalities hold:*

*f (x)− f(x** ^{∗}*)

*≤ c*1

*∥∇*

*Ω*

*f (x)∥*

^{2}

*,*(33)

*f (x)− f(x(α)) ≥ c*2

*α∥∇*

*Ω*

*f (x)∥*

^{2}

*.*(34)

*Based on these two inequalities, we have linear convergence for nonminimizers*

*x*

*k*

*:*

*f (x**k+1*)*− f(x** ^{∗}*)

*≤ r(f(x*

*k*)

*− f(x*

^{∗}*)),*(35)

*where r := 1− c*2

*α*

_{min}*/c*

_{1}

*, with α*

*:= lim inf*

_{min}

_{k}

_{→∞}*α*

_{k}*> 0. Besides, if x*

_{k}*is*

*a minimizer, then linear convergence is obvious.*

*Proof According to the ﬁnite identiﬁcation, there exists some positive integer*
*k*_{0}*such that x*_{k}*∈ M with k ≥ k*0*. From inequalities (34,33), for each k≥ k*0,
*f (x**k+1*)*−f(x** ^{∗}*)

*≤ f(x*

*k*)

*−f(x*

*)*

^{∗}*−c*2

*α*

*k*

*∥∇*

*Ω*

*f (x*

*k*)

*∥*

^{2}

*≤ r(f(x*

*k*)

*−f(x*

^{∗}*)), (36)*

*where r := 1− c*2

*α*

*k*

*/c*1. Finally, the proof is completed by noting that we do not have a diminishing step size, i.e., lim inf

*k*

*→∞*

*α*

*k*

*> 0 : for case 2, see*

Theorem 6; for case 1b, see [3]. *⊓⊔*

*Remark 4 This framework can be regarded as one generalization of the proof*
based on *∇f in the unconstrained optimization problem (e.g., page 87 in*
[4]). Also, compared with the work [27], Luo et al. proved linear convergence
through the following two inequalities:

*f (x)− f(x** ^{∗}*)

*≤ c*1

*∥x − [x − ∇f(x)]*

^{+}

*∥*

^{2}

*,*(37)

*f (x)− f(x(α)) ≥ c*2

*∥x − [x − ∇f(x)]*

^{+}

*∥*

^{2}

*( Eq. (3.3) in [27]),*(38) where Eq. (37) is derived by combining Eq. (3.2), (3.4) in [27]. Our ap- proach bears a resemblance to theirs in observing that lim

_{α}*+*

_{→0}*(x− [x −*

*α∇f(x)]*

^{+}

*)/α =*

*∇*

*Ω*

*f (x). However, the main diﬀerence between the two is*that their convergence result does require the polyhedral assumption on the constraint sets, even though it does not require the non-degeneracy condition.

**3 Proof of two inequalities**

*We shall verify these inequalities (33,34) in two cases : (a) x** ^{∗}*lies in the interior

*of Ω; (b) x*

^{∗}*lies in the boundary ∂Ω.*

*When x** ^{∗}*is an interior point, the minimization problem is equivalent to the
unconstrained optimization problem. The linear convergence result is known,

*see [4] or [27]. In this section, we shall focus on the case x*

^{∗}*∈ ∂Ω. Here we*

*focus on the case that the set with x*

*excluded,*

^{∗}*B−x*

*is nonempty; otherwise, the minimizer is found in ﬁnite iterations by Theorem 1.*

^{∗}3.1 Curvature deﬁned by shape operators

Thanks to ﬁnite identiﬁcation,*{x**k* *: k} eventually enter and remain in the set*
*B. In the following, we examine its “curvature”, which is deﬁned through a*
*shape operator S onB. Keep in mind that B in general is not a surface, and*
its “curvature” is deﬁned through a normal vector ﬁeld, which is introduced
by*∇f. Hence our curvature is not a purely geometric property.*

**[Curvature of g***i***(x) = 0]: The curvature of each surface***{x : g**i**(x) = 0}*
*can be deﬁned as follows (see, e.g., [30]). Let n**i**(x) be a unit normal vector,*

*n**i**(x) =∇g**i**(x)/∥∇g**i**(x)∥ for each x with ∥∇g**i**(x)∥ > 0.* (39)
*Let the tangent plane T**i**(x) with respect to the surface be the subspace{y ∈*
**R**^{n}*: y· ∇g**i**(x) = 0}, and the projection operator E**T**i**(x)*be

E_{T}_{i}_{(x)}*u := (arg min*

*y**∈T**i**(x)**∥u − y∥).* (40)
*Then for each tangent vector v in T*_{i}*(x), we deﬁne the shape operator S*_{i}*at x*
*with respect to v by*

*S*_{i}*(v) = lim*

*t**→0*^{+}

*n*_{i}*(x + tv)− n**i**(x)*

*t* =E_{T}_{i}_{(x)}*∇*^{2}*g*_{i}*(x)E**T*_{i}*(x)**v*

*∥∇g**i**(x)∥* *.* (41)
*Then S*_{i}*(v) is a tangent vector in T*_{i}*(x).*

*When v̸= 0, the quantity*
*S**i**(v)· v*

*∥v∥*^{2} = *v*^{′}*∇*^{2}*g**i**(x)v*

*∥v∥*^{2}*∥∇g**i**(x)∥* (42)

*is called the normal curvature of the surface at x in the v direction. Ac-*
cording to Lemma 2.1 in [35], E*T**i**(x)**∇*^{2}*g**i**(x)E**T**i**(x)* is positive semi-deﬁnite, or
*equivalently, the smallest principal curvature (the smallest eigenvalue) of S**i*

is nonnegative. Therefore, the positive deﬁnite condition of the shape opera-
*tor S**i* is equivalent to the positive deﬁnite condition E*T**i**(x)**∇*^{2}*g**i**(x)E**T**i**(x)*, or
geometrically the positive principal curvature condition.

**[Curvature of** * B]: In the following, we shall deﬁne the “curvature” of the*
surface

*B. Thanks to the non-degeneracy condition, we can assign a normal*

*vector ﬁeld n**f* *to every point x in* R* ^{n}* according to the gradient vector ﬁeld

*∇f:*

*n**f**(x) :=−∇f(x) + ∇**Ω**f (x) = lim*

*α**→0*^{+}

*x− α∇f(x) − [x − α∇f(x)]*^{+}

*α* *.* (43)

Obviously, we have the orthogonal decomposition:

*−∇f = −∇**Ω**f + n**f**.* (44)

*For each x* *∈ R*^{n}*with n**f**(x)* *̸= 0, we can assign a unit normal vector ﬁeld,*
*n(x) := n**f**(x)/∥n**f**(x)∥. Hence, for x ∈ B, we have n**f**(x) =*∑

*i**∈A(x)**λ**i**(x)∇g**i**(x)*
*with λ**i**(x) > 0.*

*For each x ∈ B, we can assign a tangent plane (a subspace in R*

*),*

^{n}*TB*

*x*:=

**{y ∈ R**

^{n}*: y· ∇g*

*i*

*(x) = 0, i∈ A(x)}.*(45)

*For each vector u*

**∈ R***, denote the linear projection operator E*

^{n}*x*by

E*x**u := (arg min*

*y**∈T B**x*

*∥u − y∥).* (46)

Readers can easily verify the linearity of E*x*and

E*x**∇g**i**(x) = 0 for i∈ A(x), which imples that E**x*(*−∇f(x)) = −∇**Ω**f (x).*

(47)
**Definition 2 (The shape operator S on*** B) Fix a point x ∈ B. Let n be*
the unit normal vector induced by

*∇f as above. For each (tangent) vector v*

*in TB*

*x*,

*∇*

*v*

*n denotes the covariant derivative of*

*B with respect to a tangent*

*vector v,*

*∇**v**n = lim*

*t**→0*^{+}

*n(x + tv)− n(x)*

*t* *.* (48)

As the dimension of*B is not necessarily n−1, the vector ∇**v**n is not necessarily*
*a tangent vector in TB**x**. Thus, the shape operator at x is deﬁned by its*
*projection on TB**x*,

*S(v) := E**x**∇**v**n.* (49)

*Unlike each S*_{i}*, S not only depends on the surfaces{x : g**i**(x) = 0, i∈ A(x)},*
*but also depends on the function f . In the following lemma, we investigate the*
connection between the “curvature” of *B and the curvature of each surface*
*{x : g**i**(x) = 0}.*

**Lemma 1 For each tangent vector v in T**B*x**,*

*∥n**f**(x)∥S(v) =* ∑

*i**∈A(x)*

*λ*_{i}*(x)E**x**∇*^{2}*g*_{i}*(x)v =* ∑

*i**∈A(x)*

*λ*_{i}*(x)E**x**S*_{i}*(v)∥∇g**i**(x)∥, (50)*

*where n**f**(x) =*∑

*i**∈A(x)**λ**i**(x)∇g**i**(x). Thus, S(v) is a conical combination of*
E*x**S*_{i}*(v),*

*S(v) =* ∑

*i**∈A(x)*

*λ**i**(x)∥∇g**i**(x)∥*

*∥*∑

*i**∈A(x)**λ*_{i}*(x)∇g**i**(x)∥*E*x**S**i**(v).* (51)

*Proof Since n(x) =*∑

*i**∈A(x)**λ**i**(x)∇g**i**(x)/∥n**f**(x)∥, then*
*S(v) = E**x**∇**v**n =* ∑

*i**∈A(x)*

E_{x}*∇**v**(λ*_{i}*∇g**i**/∥n**f**∥).* (52)

*For each i, we have*

*∇**v*

*λ**i**∇g**i*

*∥n**f**∥* =*λ**i**∇g**i*^{2}*v*

*∥n**f**∥* + terms consisting of*∇g**i**.* (53)
Thanks to Eq. (47), those terms with *∇g**i* vanish after the projection E*x* is
applied, which justiﬁes the ﬁrst statement in Eq. (50). Then, the deﬁnition of
*S** _{i}* in Eq. (41) yields the second expression in Eq. (50).

*⊓*

*⊔*
Observe that*∇*^{2}*g**i* *restricted to TB**x*is positive semi-deﬁnite, and the tri-
angle inequality *∥*∑

*i**∈A(x)**λ**i**∇g**i**∥ ≤* ∑

*i**∈A(x)**λ**i**∥∇g**i**∥ holds for λ**i* *≥ 0. Let*
*λ*¯*i**:= λ**i**∥∇g**i**∥(*∑

*i**∈A(x)**λ**i**∥∇g**i**∥)*^{−1}*≥ 0. From Eq. (51), we have*
*v· S(v) ≥* ∑

*i**∈A(x)*

*λ*¯*i**v· S**i**(v), with* ∑

*i**∈A(x)*

*λ*¯*i**= 1,* (54)

*which implies that S restricted to TB**x**is positive semi-deﬁnite. In addition, S*
*is positive deﬁnite if at least one of S**i**(x) restricted to TB**x*is positive deﬁnite,
or equivalently, E*x**∇*^{2}*g**i*E*x**(restricted to TB**x*) is positive deﬁnite.

3.2 The ﬁrst inequality

In the next lemma, we analyze the leading term of*∇**Ω**f (x),*

*∇**Ω**f (x) = F (x− x*^{∗}*) + higher order terms,* (55)
*where F is some positive semi-deﬁnite matrix.*

**Lemma 2 Assume that the non-degeneracy condition holds. For x**∈ B,

*x*lim*→x*^{∗}

*∇**Ω**f (x)*

*∥x − x*^{∗}*∥* = E*x*^{∗}

*∇*^{2}*f (x** ^{∗}*) + ∑

*i**∈A(x** ^{∗}*)

*λ*_{i}*(x** ^{∗}*)

*∇*

^{2}

*g*

_{i}*(x*

*)*

^{∗}

lim

*x**→x** ^{∗}*E

_{x}*∗*

*x− x*

^{∗}*∥x − x*^{∗}*∥.*
(56)
*Equivalently, using S**i* *to replace* *∇*^{2}*g**i**(x) (Eq. (41)), we have*

*∇**Ω**f (x) = E**x**∗*

*∇*^{2}*f (x** ^{∗}*) + ∑

*i**∈A(x** ^{∗}*)

*λ**i**(x** ^{∗}*)

*∥∇g*

*i*

*(x*

*)*

^{∗}*∥S*

*i*

E*x**∗**(x−x*^{∗}*)+O(∥x−x*^{∗}*∥*^{2}*).*

(57)

*Proof Note that*

*∇**Ω**f (x) =∇f(x) +* ∑

*i**∈A(x)*

*λ*_{i}*(x)∇g**i**(x),* (58)

and

0 =*∇**Ω**f (x** ^{∗}*) =

*∇f(x*

*) + ∑*

^{∗}*i**∈A(x** ^{∗}*)

*λ**i**(x** ^{∗}*)

*∇g*

*i*

*(x*

^{∗}*).*(59)

Using the mean value theorem, because*{λ**i**(x),∇g**i**(x)} are continuously dif-*
ferentiable, the diﬀerence of these two equations is

*∇**Ω**f (x) =∇*^{2}*f (x)(x− x** ^{∗}*) + ∑

*i**∈A(x** ^{∗}*)

*λ**i**(x** ^{∗}*)(

*∇g*

*i*

*(x)− ∇g*

*i*

*(x*

*)) (60)*

^{∗}+ ∑

*i**∈A(x** ^{∗}*)

*(λ*_{i}*(x)− λ**i**(x** ^{∗}*))

*∇g*

*i*

*(x) + O(∥x − x*

^{∗}*∥*

^{2}

*).*(61)

Because*∇**Ω**f (x)∈ T B**x*, we can apply E*x* on both sides of Eq. (60),

*∇**Ω**f (x) = E**x**∇*^{2}*f (x)(x−x** ^{∗}*)+ ∑

*i**∈A(x** ^{∗}*)

*λ*_{i}*(x** ^{∗}*)E

*x*(

*∇g*

^{2}

*i*

*(x*

^{∗}*)(x−x*

^{∗}*))+O(∥x−x*

^{∗}*∥*

^{2}

*),*(62) where E

*x*(

*∇g*

*i*

*(x)) = 0 is used.*

Hence, the leading term of*∇**Ω**f (x)/∥x − x*^{∗}*∥ is*

*x*lim*→x*^{∗}

*∇**Ω**f (x)*

*∥x − x*^{∗}*∥* = E*x**∗*

*∇*^{2}*f (x** ^{∗}*) + ∑

*i**∈A(x** ^{∗}*)

*λ**i**(x** ^{∗}*)

*∇*

^{2}

*g*

*i*

*(x*

*)*

^{∗}

lim

*x**→x** ^{∗}*E

*x*

*∗*

*x− x*

^{∗}*∥x − x*^{∗}*∥.*
(63)

*⊓*

*⊔*
This lemma points out that *∇**Ω**f (x) can be approximated by F (x− x** ^{∗}*)

*locally, where the matrix F (restricted to TB*

*x*

*) is*

^{∗}*F := E**x*^{∗}

*∇*^{2}*f (x** ^{∗}*) + ∑

*i**∈A(x** ^{∗}*)

*λ**i**(x** ^{∗}*)

*∇*

^{2}

*g*

*i*

*(x*

*)*

^{∗}

E*x*^{∗}*.* (64)

Hence, to have a positive lower bound for*∥∇**Ω**f (x)∥/∥x − x*^{∗}*∥, it suﬃces*
*to have a positive deﬁnite matrix F . This is true, if at least one of{∇*^{2}*g**i**(x** ^{∗}*) :

*i∈ A(x*

*)*

^{∗}*} restricted to T B*

*x*

*∗*is positive deﬁnite.

**Theorem 3 Assume that the non-degeneracy condition holds. Suppose that***at least one of* *{∇*^{2}*g**i**(x*^{∗}*) : i* *∈ A(x** ^{∗}*)

*} restricted to T B*

*x*

^{∗}*is positive deﬁnite*

*with the smallest eigenvalue µ*

^{min}

_{i}*, i.e., at least one of*

*{µ*

^{min}*i*

*} is positive. Let*

*µ*

*=∑*

^{min}*i**∈A(x** ^{∗}*)

*λ*

*i*

*(x*

^{∗}*)µ*

^{min}

_{i}*. Then for c*1

*∈ (0, 1/µ*

^{min}*), there exists δ > 0,*

*such that if x∈ B with ∥x − x*

^{∗}*∥ < δ, then*

*∥x − x*^{∗}*∥ ≤ c*1*∥∇**Ω**f (x)∥ and f(x) − f(x** ^{∗}*)

*≤ c*1

*∥∇*

*Ω*

*f (x)∥*

^{2}

*.*(65)