• 沒有找到結果。

Linear convergence analysis of the use of gradient projection methods on total vari- ation problems

N/A
N/A
Protected

Academic year: 2022

Share "Linear convergence analysis of the use of gradient projection methods on total vari- ation problems"

Copied!
33
0
0

加載中.... (立即查看全文)

全文

(1)

PREPRINT

國立臺灣大學 數學系 預印本 Department of Mathematics, National Taiwan University

www.math.ntu.edu.tw/ ~ mathlib/preprint/2011- 05.pdf

Linear convergence analysis of the use of gradient projection methods on total vari- ation problems

Pengwen Chen and Changfeng Gui

October, 2011

(2)

(will be inserted by the editor)

Linear convergence analysis of the use of gradient projection methods on total variation problems

Pengwen Chen · Changfeng Gui

Received: date / Accepted: date

Abstract Optimization problems using total variation frequently appear in image analysis models, in which the sharp edges of images are preserved. Direct gradient descent methods usually yield very slow convergence when used for such optimization problems. Recently, many duality-based gradient projection methods have been proposed to accelerate the speed of convergence. In this dual formulation, the cost function of the optimization problem is singular, and the constraint set is not a polyhedral set. In this paper, we establish two inequalities related to projected gradients and show that, under some non- degeneracy conditions, the rate of convergence is linear.

Keywords Total variation, gradient projection methods, non-degeneracy conditions, linear convergence, projected gradients.

Mathematics Subject Classification (2000) MSC 65K10,49K35

1 Introduction

In this paper, we study the convergence rate of the gradient projection method in solving the following problem:

minx∈Ωf (x), (1)

where f is a convex, continuously differentiable function whose gradient∇f is Lipschitz continuous on a nonempty, convex and closed set Ω. Without further

Pengwen Chen

Department of Mathematics, National Taiwan University, Taiwan. E-mail: peng- wen@math.ntu.edu.tw

Changfeng Gui

Department of Mathematics, University of Connecticut, Storers, CT 06268. E-mail:

gui@math.uconn.edu

(3)

assumptions, it is known that the worse-case rate of convergence could be sub- linear [28]. In this paper, we prove that linear convergence can be obtained, when the following two conditions hold:

1. The cost function in (1) can be expressed in the form f (x) = h0(E0x) and the feasible set can be expressed in the form

Ω ={x ∈ Rn: gi(x)≤ 0, for i = 1, . . . , m} ⊂ Rn, , (2) where gi(x) = hi(Eix). Here, each{Ei: i = 0, . . . , m} is a nonzero matrix with n columns, and {hi : i = 0, . . . , m} are strongly convex and twice differentiable C2functions. (Note that these matrices Eicould be singular.) 2. The minimizer in (1) satisfies a non-degeneracy condition (see Assump-

tion 2).

Gradient projection methods with constant step sizes were first proposed by Goldstein [23] and Levitin and Poljak [26]. Bertsekas [3] proposed the Armijo rule along the projection arc and studied its convergence behavior.

This method can identify the set of active inequality constraints in a finite number of iterations (this property is called finite identification). Moreover, this method can be combined with conjugate gradient methods or Newton’s method to achieve super-linear convergence [3, 8, 35]. Practically, the gradient projection method is especially well-suited for large-scale problems with sim- ple constraint structures, e.g., a box, a polyhedral set, or a Cartesian product of Euclidean hypercubes or balls. It is well known that the convergence rate of gradient projection methods is linear in the following cases:

1. f is strongly convex [18].

2. f is not strongly convex, but f (x) = h0(E0x) + q· x, where q is some vector in Rn, E0is some matrix, and h0is a strictly convex and essentially smooth function with 2h0(E0x) being positive definite for a minimizer x. Additionally, the convex set Ω must be a polyhedral set [27].

Clearly, the problem mentioned above does not fit either of these two cases.

This research problem originates from the dual formulation of the image restoration model proposed by Rudin, Osher and Fatemi [31]. Currently, the model is called the ROF model, and the total variation is introduced to pre- serve the sharp edges of images. The ROF model was later modified to tackle image de-noising [33], image de-blurring [15, 34] and image segmentation [14, 32, 11, 7] problems successfully. Because of the inclusion of the total variation term, the Euler-Lagrangian equation for this optimization problem becomes highly nonlinear, and many numerical methods converge quite slowly for this problem, especially when the explicit gradient descent method is employed.

Many algorithms were proposed to alleviate this numerical difficulty [33, 6, 13, 12]. A key breakthrough was made by Chambolle [9, 10]. He proposed a fixed point algorithm and a gradient projection method with constant step size based on the dual formulation of total variation. These two algorithms soon became popular due to their simplicity and acceptable convergence speeds.

Recently, various accelerated algorithms have been proposed [38, 36, 1] based

(4)

on this dual formulation. These algorithms use different step size rules, in- cluding techniques based on the Barzilai-Borwein(BB) method [2, 21, 5] and Nesterov’s scheme [29]. The numerical results of these algorithms indicate that these gradient projection methods indeed converge much faster than the original Chambolle gradient projection method. In addition to the algorithms mentioned above, there exist other interesting and efficient algorithms for han- dling the image restoration problem, e.g., graph cut methods [10, 25], and the split Bregman method [24]. The dual formulation of the ROF model can be expressed as a special case of the problem (1). In this paper, we shall study the convergence rate of the gradient projection method used to solve the prob- lem (1). This theoretical result can also be used to explain the occurrence of linear convergence in Chambolle’s gradient projection method.

According to the framework developed by Nesterov, when f is convex and possesses a Lipschitz continuous gradient, without exploiting the structure of the problem, the cost function in the gradient method converges as O(1/k), where k is the iteration counter [28]. To improve the speed of convergence, Nes- terov proposed an accelerated scheme, in which the convergence speed becomes O(1/k2).1 In this paper, we point out explicitly under which circumstances the gradient projection method can converge linearly. This provides an expla- nation as to why accelerated gradient projection methods can converge faster than Nesterov’s scheme, as reported in a numerical result [36]. Briefly speak- ing, two conditions are required for linear convergence. The first condition is non-degeneracy, which stipulates that the gradient projection method has the finite identification property. The second condition specifies the form of the cost function and the constraint set. In a way, the second condition leads to positive curvature of “the constraint surface”.

In fact, the slow convergence of Chambolle’s gradient projection method can be attributed to two factors. One comes from the cost function f , and the other comes from{gi: i}.2 First, when the system has a large condition num- ber, the use of a constant step size is usually not an optimal choice. Numerical results shown in the reports [36, 38, 20] indicate that various BB schemes can improve convergence speed, when the step size is estimated by least-squares fits. The second factor to which slow convergence is attributed comes from the non-polyhedral constraint set. Unlike a polyhedral constraint set, the conver- gence rate of Chambolle’s gradient projection method could be sub-linear in the worse case. This has nothing to do with the condition number of the sys- tem, so the improvement from the BB method could be limited or nonexistent (see examples in section 2.4). In this situation, Nesterov’s scheme is a better approach. Which factor dominates depends on whether the non-degeneracy condition is fulfilled. When the non-degeneracy condition holds, the gradient projection method converges linearly (though perhaps very slowly), and can be accelerated by various non-monotone spectral projected gradient methods.

1 In fact, Nesterov pointed out that O(1/k2) is optimal when no further information about the problem is exploited.

2 Throughout this paper, when no specified range is given for i, we mean that i ranges over all its possible values.

(5)

In this paper, two conditions are proposed to establish linear convergence in two gradient projection methods: The first method uses gradients, and the second method uses projected gradients (defined in Definition 1). A few exam- ples are provided to explain the necessity of the proposed conditions. Linear convergence is established by constructing two inequalities involving the norm of projected gradients. We also study the convergence property of the gradient projection method using projected gradients, a method which was proposed implicitly in [38].

This paper is organized as follows. The formal description of our problem is given in section 2. The linear convergence analysis of the dual formulation is one special case of our problem. Two major inequalities are introduced to prove linear convergence. The proofs of these inequalities themselves are quite lengthy, and are given in section 3. In section 4, we present the convergence property of the second gradient projection method and demonstrate the exis- tence of a positive lower bound for its step size. A brief introduction to the dual formulation of the ROF model is given in the appendix.

2 Problem description and notations

2.1 Notations

In this paper, we use the following notation. Let Xbe the set of minimizers of minx∈Ωf (x). Let∥·∥ be the Euclidean norm. Let x, x·y refer to the transpose and the inner product. For each x ∈ Ω, let A(x) denote the active index set at x, A(x) = {i : gi(x) = 0}. Let [·]+ denote the projection on the set Ω, defined as follows: If x = [y]+, then x = arg minz∈Ω∥y − z∥2; equivalently, for all z∈ Ω,

(y− x) · (z − x) ≤ 0. (3)

Finally, by assumption, Lipschitz continuity holds on ∇f. Let Lf denote the Lipschitz constant: for all x, y∈ Ω,

∥∇f(x) − ∇f(y)∥ ≤ Lf∥x − y∥. (4) While solving an unconstrained minimization problem, a minimizer x is characterized as∇f(x) = 0. This criterion is no longer valid when solving a constrained minimization problem. Instead, the following properties show that the projected gradientf plays a similar role in searching for a minimizer:

A minimizer xsatisfies f (x) = 0.

Definition 1 [8] Let the tangent cone T (x) be the closure of the cone of all feasible directions, where a direction v is feasible at x∈ Ω if x + τv belongs to Ω for all τ > 0 sufficiently small. The projected gradient f of f is defined by

f :=−arg min

d∈T (x)∥ − ∇f(x) − d∥ = −arg min

d∈T (x)∥∇f(x) + d∥. (5)

(6)

Thus,−∇f is the steepest feasible descent direction. Note that this definition is different from the one in [8]: A negative sign is added in our definition.

Clearly, when x is in the interior of Ω, f (x) =∇f(x). As in [8], ∇f has the following properties.

– (a) The point x∈ Ω is a stationary point if and only if ∇f (x) = 0.

– (b)∇f(x)f (x) =∥∇f (x)∥2. And min

d∈T (x),∥d∥=1d· ∇f(x) = −∥∇f (x)∥. (6) – (c) ∥∇f (·)∥ is lower semi-continuous. If ∇f is uniform continuous at x,

and limk→∞xk = x, then lim

k→∞∥∇f (xk)∥ = 0. (7)

– (d)

lim

α→0+

[x− α∇f]+− x

α =−∇f. (8)

Let N (x) be the polar of the tangent cone T (x). The cone N (x) is called the normal cone to Ω at x∈ Ω. The tangent cone T (x) is in turn the polar of the normal cone N (x). Thus we have the unique orthogonal decomposition of

∇f (for example, see Lemma 2.1 and 2.2 in [37]):

−∇f(x) = −∇f (x) +

i∈A(x)

λi(x)∇gi(x),

andi(x) : i∈ A(x)} is the minimizer of the least squares problem

min

λi≥0∥ − ∇f(x) −

i∈A(x)

λi∇gi(x)∥2. (9)

For a minimizer x∈ X, an optimal condition [35] for x is

−∇f(x)∈ N(x). (10)

That is, there exist scalars λi≥ 0 such that

−∇f(x) = ∑

i∈A(x)

λi∇g(x), (11)

(7)

2.2 Gradient projection methods and basic properties

In this paper, we shall study the convergence rates of the following gradient projection methods in solving Eq. (1).

Alg 1 (Gradient projection method) Starting with some x0∈ Ω, iterate k = 0, 1, 2, . . .,

xk+1= x(αk), (12)

where x(α) and αk are determined as follows.

1. (a) A constant step size is used with 0 < αk < 2/Lf, where Lf is the Lipschitz constant of∇f.

x(α) := [x− α∇f]+, (13)

(b) The Armijo rule is in effect along the projection arc using gradients

∇f: Let σ ∈ (0, 1), β ∈ (0, 1) and α0 > 0. Let α = βmα0, where m is the first nonnegative integer for which

f (x)− f(x(βmα0))≥ σ∇f(x) · (x − x(βmα0)), (14) with

x(α) := [x− α∇f]+. (15)

2. The Armijo rule is in effect along the projection arc using projected gra- dients f : Let σ∈ (0, 1), β ∈ (0, 1) and α0 > 0. Let α = βmα0, where m is the first nonnegative integer for which

f (x)− f(x(βmα0))≥ σ∇f (x)· (x − x(βmα0)), (16) with

x(α) := [x− α∇f ]+, (17) ( f is defined in Definition 1). This method is mentioned implicitly in [38].

Convergence analysis:The sequence{xk : k} generated by these gradient projection methods converges to one stationary point x: for case 1 (a) (b), see Prop. 2.3.2, Prop. 2.3.3 in [4]; for case 2, see Theorem 5.

Remark 1 The discretization of the dual formulation of the ROF model (see Eq. (133)) can be formulated as follows. Let

Ω ={x ∈ R2n1n2 : gi(x) = x22i−1+ x22i− 1 ≤ 0, i = 1, . . . , n1n2}.

Given a vector gI ∈ Rn1n2 and a positive parameter λ, the dual variable p is a minimizer of the problem:

min

x∈Ω{f(x) = ∥E0x− λgI2}, (18) where E0 : R2n1n2 → Rn1n2 is the discrete divergence operator ∇· ( see Re- mark 5). Note that f is not strongly convex. Compared with the proposed problem (1), the dual formulation (18) is a special case of the proposed prob- lem. In [10], Chambolle proposes a gradient projection method based on this dual formulation to solve the ROF model. The step size is set to be a constant between 0 and 1/4. Clearly, this method is exactly the same as case 1 in Alg. 1.

(8)

2.3 Assumptions on f, gi

As mentioned in the introduction, we will prove linear convergence of the above gradient projection methods under the following conditions:

Assumption 1 f (x) can be expressed in the form f (x) = h0(E0x), with E0

being an m0× n matrix and h0 being a strongly convex function in Rm0. Similarly, for each i, gi can be expressed in the form gi(x) = hi(Eix), with Ei being an mi× n matrix and hi being a strongly convex function in Rmi. Here, {mi: i} are natural numbers.

Assumption 2 (Non-degeneracy) [8, 18] A minimizer x ∈ Ω is non- degenerate if the active constraint normals {∇gi(x) : i∈ A(x)} are linearly independent and λi(x) > 0 for each i∈ A(x). Hence,

−∇f(x)∈ ri(N(x)), (19)

where ri(Λ) is the relative interior of Λ⊂ Rn.

These gradient projection methods all have the following property: there exists a positive constant c such that

c∥xk− xk+12≤ (f(xk)− f(xk+1)), (20) for k sufficiently large. The proof is given as follows. For case 1a, the formula can be found in the proof of Proposition 2.3.2 in [4],

f (xk)− f(xk+1)≥ (1 α−Lf

2 )∥xk− xk+12; For case 1b and case 2, from Eq. (3) we have

f (xk)− f(xk+1)≥ σ∥xk− xk+12

αk ≥ σ∥xk− xk+12

α0 . (21)

Because of Eq. (20), linear convergence in{f(xk)} implies R-linear conver- gence in{xk : k} ( Lemma 3.1 in [27]).

2.4 Necessity of the non-degeneracy condition

Before proceeding, we provide a few examples to illustrate the necessity of the proposed conditions, and also to examine the convergence speed of Chambolle’s method. It is known that if f is quadratic and of the form xQx−bx, where Q is positive definite, then the convergence rate of the gradient projection method is linear (see page 233 [4]). Hence, the non-degeneracy condition is not required to achieve linear convergence. The cost function f in the dual formulation of the ROF model is quadratic and of the form xQx− bx, with Q = ∇(∇·) being merely positive semi-definite. To emphasize the difference, we give a few examples to show that, without either of these proposed conditions being

(9)

satisfied, linear convergence can not be obtained. The major reason for this is that f (x)−f(x) cannot be bounded by∥∇f (x)∥2when the non-degeneracy condition fails, or when curvature vanishes at x.

To analyze the iterative behavior of Chambolle’s method, we focus on low-dimensional cases subject to one or two constraints. These examples are provided to emphasize the effects of non-polyhedral constraints. In the first example, we consider the positive semi-definite matrix Q = [1, 0; 0, 0]/2. The minimizer x ∈ R2 is deliberately chosen so that ∇f(x) = 0. Thus, λ1 = 0 and the non-degeneracy condition fails. In the second example, we consider a quadratic cost function f with ∇f(x) ̸= 0, but −∇f(x) /∈ ri(N(x)).

The occurrence of sub-linear convergence indicates that the non-degeneracy condition is the crucial condition in this example, rather than ∇f(x) ̸= 0.

Our final example is provided to emphasize the importance of Assumption 1.

The minimization problem is deliberately designed so that curvature vanishes at minimizers. In this example,−∇f(x)∈ ri(N(x)) holds, but gi(x) does not satisfies Assumption 1. Under this circumstance, the gradient projection method using a constant step size (Chambolle’s method) still converges sub- linearly, which implies that Assumption 1 is also a crucial condition to achieve linear convergence.

Example 1 Let Ω be a unit disk {x := (x1, x2) ∈ R2 : x21+ x22 ≤ 1}3. Let f (x1, x2) = (x1− 1)2/2. Then, the minimizer x is unique and in fact x = (x1, x2) = (1, 0). Start with a point x = (cos θ, sin θ) on the circle with 0 <

θ < π/2. Consider the constant step size α = 1. Then, in case 1 of Alg. 1

x(1) = [x− ∇f(x)]+ = (1, x2)/

1 + x22. Compute

∥x(1) − x∥ = ∥[x − ∇f(x)]+− x∥ = θ + O(θ2), (22) and

∥x − x∥ =√

2− 2 cos θ = 2 sin(θ/2).

Hence,

lim

θ→0

∥x(1) − x

∥x − x = 1, (23)

which implies that the convergence rate is sub-linear. On the other hand, since

∥∇f (x)∥ = (1 − cos θ) sin θ = θ3

2 + O(θ4) (24)

and f (x)− f(x) =(cos θ− 1)2

2 = θ4

8 + O(θ6).

We can see that f (x)− f(x) cannot be bounded by∥∇f (x)∥2 as θ→ 0.

(10)

0 50 100 150 200 250 10−6

10−4 10−2 100

Nesterov PG Chambolle

Fig. 1 The projection counter vs. cost function error of Nesterov’s scheme (Nesterov), the gradient projection method using optimal step size (Chambolle) and the gradient projection method using projected gradients (PG).

In this example, α = 1 is the optimal step size in the following sense: for each point x on the circle, we have

arg min

α f (x− α∇f) = 1.

As the sub-linear convergence is caused by the use of projections on non- polyhedral constraint sets, instead of by the low condition number of the sys- tem, the convergence speed cannot be improved by the BB method: because α = 1 is the optimal choice, the BB method will give exactly the same iterative result as in the case with a constant step size. Hence, the slow convergence cannot be alleviated by the aid of non-monotone spectral projected gradient methods or conjugate gradient methods. In fact, readers can easily verify that in fact the limit of the ratio in Eq. (23) remains unchanged, even when a different step size α is taken, provided that 0 < α < 2/Lf.

This phenomenon is primarily caused by the “orthogonality” between ∇f andf as x→ xalong the boundary of a non-polyhedral set:

lim

θ→0

∇f · ∇f

∥∇f∥∥∇f∥ = lim

θ→0sin θ = 0,

which does not exist in problems with polyhedral constraint sets.

To emphasize the slow convergence of this method, we compare the con- vergence speed of the gradient projection method using constant step size (Chambolle) with the Nesterov method (Nesterov’s scheme [29]) and the PG method:

xk+1= xkk) = [xk− αkf (xk)]+, with αk:= min

α f (xk− α∇f (xk)).

(25) That is, the step size in the PG method is selected to be the optimal step size along the projected gradient direction, which is a special case of the gradient

3 In this subsection, though we use xk for two different objects, with a little care the meaning of xk should be clear from the context.

(11)

projection method usingf , i.e., case 2 in Alg. 1. The same initial condition is used in each method: (x1, x2) = (0, 1). As both the Nesterov method and the PG method use two projections in each iteration, in order to make a fair comparison, the x-axis represents the projection counter. The result shown in Fig. 1 indicates that the Chambolle method is the slowest to converge.

The Nesterov method improves upon the convergence speed of the Chambolle method. However, its convergence rate is still sub-linear. In contrast, the PG method has linear convergence, which is proven in the following. Observe that due to the structure of f , the optimal step size αoptis always the one which forces the first component of x− α∇f (x) to be 1. Then

x(αopt) = (x2, 1− x1)/√

2− 2x1= (cos(θ/2), sin(θ/2)), if x = (cos θ, sin θ).

Hence, the convergence rate is indeed linear with limk→∞∥xk+1− x∥/∥xk x∥ = 0.5.

Example 2 Let Ω ={(x1, x2, x3, x4)∈ R4: x21+ x22≤ 1, x23+ x24≤ 1} and f (x1, x2, x3, x4) = ((x1− 1)2+ (x3− 2)2)/2.

Then x = (1, 0, 1, 0) is the minimizer with∇f(x) = (0, 0,−1, 0). Consider the point x = (cos θ, sin θ, 1, 0) on the surface of Ω. Using similar arguments as in the previous example, we can verify that the convergence rate of the Chambolle method is sub-linear in this higher dimensional case.

Example 3 Let Ω :={(x1, x2)∈ R2: x41≤ x2}, and f(x1, x2) = (x2+ 1)2/2.

Then the minimizer is x = (0, 0). Let x = (θ, θ4). The optimal step size of Chambolle’s method is α = 1, and x(1) = (θ− 4θ3, (θ− 4θ3)4) + O(θ4). Then

∥x(1) − x∥ = θ − 4θ3+ O(θ4), ∥x − x∥ = θ + θ7/2 + O(θ8). (26) Hence,

lim

θ→0

∥x(1) − x

∥x − x = 1, (27)

which implies that the convergent rate is sub-linear. In fact, in this case because

∥∇f (x)∥ = 4θ3+ O(θ7), f (x)− f(x) = θ4+ O(θ8), (28) f (x)− f(x) cannot be bounded by∥∇f (x)∥2 as θ→ 0.

Remark 2 In unconstrained optimization problems, the R-linear convergence of the BB method is known [17, 16]. These examples suggest that the proposed two assumptions should be necessary conditions of the R-linear convergence of the BB method in this non-polyhedral constrained optimization problem.

Remark 3 Here, we make a few comments on the PG method. In general, the aforementioned PG method does not converge linearly when the proposed conditions fail. In the first example, when Ω is replaced by

Ω :={x ∈ R2: x1≤ 1 − xm2 } for every even integer m ≥ 4,

(12)

the curvature vanishes at x= (1, 0), and the first component of x(α) is always 1 for each x = (1−θm, θ), θ̸= 0. In this case, the PG method converges linearly with the rate, the limit of the ratio in Eq. (23), being 1− 1/m. Therefore, we know that the PG method converges only sub-linearly with Ω := {x ∈ R2 : x1 ≤ 1 − exp(−1/x22)} ( exp(−1/x2) is not real analytic). In fact, the PG method does not converge in the third example: observe that the second component of x(α) is always−1 and the first component of x(α) tends to ±∞

as x approaches to x. This result indicates the necessity of the Armijo rule to ensure convergence.

The above examples show that in the worst case the gradient projection method using gradients could converge very slowly. It is known that many numerical methods for solving the primal formulation of the ROF model have linear convergence, including the explicit gradient descent method. The nat- ural question that arises is why Chambolle’s method is effective, if its con- vergence rate is potentially only sub-linear. Most numerical experiments show that Chambolle’s method converges linearly in solving the dual formulation of the ROF model. This is the motivation for our study of the convergence rates of gradient projection methods.

2.5 Finite identification

An important result regarding the non-degeneracy condition is the finite iden- tification property: The set A(x) can be finitely identified by gradient pro- jection methods in Alg. 1. This property guarantees that xk will enter and remain in a setB defined in Eq. (32).

Consider the subset ˆB of Ω,

B := {x ∈ Ω : gˆ i(x) = 0, i∈ A(x), gi(x) < 0, i /∈ A(x)}, (29) in which every point has the same active index set as x does, i.e.,

A(x) = A(x) for each x∈ ˆB. (30) Clearly, ˆB contains x. Thus ˆB is nonempty.

For each x∈ ˆB, let {λi(x) : i∈ A(x)} be a minimizer of the least squares problem,

min

λi∈R∥ − ∇f(x) −

i∈A(x)

λi∇gi(x)∥2. (31)

Recall the non-degeneracy condition,−∇f(x) =∑

i∈A(x)λi(x)∇gi(x) with λi(x) > 0.

LetB be a subset of ˆB,

B := {x ∈ ˆB : ∥x − x∥ < ϵ} for some ϵ > 0, (32) such that for each x∈ B, we have that

(13)

1. {∇gi(x) : i∈ A(x)} are linearly independent;

2. i(x) : i} is determined uniquely with λi(x) > 0 for i∈ A(x) and λi(x) = 0 for i /∈ A(x).

The existence of ϵ stems from the C1continuity of the functionsi(x)}, which is ensured by the following observations:

1. Because {∇gi(x) : i ∈ A(x)} are linearly independent, {∇gi(x) : i A(x)} are linearly independent for each x ∈ Rnnear x. That is, det(GG)̸=

0, where G is the matrix whose columns are {∇gi(x) : i∈ A(x)}.

2. When x is near x, the solution i(x) : i ∈ A(x)} of the least square problem is a rational function of∇f(x) and {∇gi(x) : i∈ A(x)}, where

∇f, ∇giare C1(Rn). As the denominator of this rational function det(GG) is nonzero near x,i(x)} is uniquely determined.

3. Because gi(x) < 0 for each i ∈ A(x), gi(x) is negative near x, which implies that λi(x) = 0.

Theorem 1 (Finite identification) Suppose that f,{gi : i = 1, . . . , m} are C2, with Lipschitz continuous gradient functions. Let x be the limit of{xk : k} generated by the gradient projection method Alg. 1. Then xk ∈ B for k sufficiently large, whereB is defined in Eq. (32).

Proof Because the sequence{xk : k} generated by gradient projection methods converges to one stationary point x, we have limk→∞xk = x, which implies that A(xk) ⊂ A(x). Assume that there are an infinite subsequence K and an index r such that r∈ A(x) but r /∈ A(xk) for all r ∈ K. Let Pk be the projection into the space

{v : v · ∇gi(xk) = 0, for all i∈ A(x), i̸= r}, and let P be the projection into the space

{v : v · ∇gi(x) = 0, for all i∈ A(x), i̸= r}.

Then −Pk∇gr(xk)∈ T (xk) for all k ∈ K because r /∈ A(xk). Hence, by Eq.

(6) we have

∇f(xk)· P ∇gr(xk) =∇f(xk)· Pk∇gr(xk) +∇f(xk)· (P − Pk)∇gr(xk)

≤ ∥∇f (xk)∥∥Pk∇gr(xk)∥ + ∥∇f(xk)∥∥P − Pk∥∥∇gr(xk)∥.

Since{xk} converges to x, then from Eq. (7) and limk→∞∥P − Pk∥ = 0,

∇f(x)· P ∇gr(x)≤ 0.

On the other hand, the linear independence of the active constraint normals guarantees P∇gr(x)̸= 0. By the non-degeneracy condition,

∇f(x)· P ∇gr(x) = λr(x)∥P ∇gr(x)2> 0.

This contradiction proves thatA(xk) =A(x) for all k sufficiently large. That is, xk∈ ˆB for all k sufficiently large.

Finally, as xkconverges to x, for k sufficiently large we have∥xk−x∥ < ϵ, i.e., xk∈ B.

(14)

According to the above finite identification, we have that if ˆB = {x}, then the minimizer is found in finite iterations. According to Theorem 2.4 in [35],B is a class-C2identifiable surface. (Similarly,B is a class-Cpidentifiable surface if gi∈ Cp.)

2.6 The main theorem

In the following theorem, we show linear convergence of the gradient methods in Alg 1. The proof of these inequalities (33,34) and the existence of positive lower bounds for step sizes is rather lengthy, and will be given in section 3 and section 4.

Theorem 2 Suppose that Assumption 1 and 2 hold. Then linear convergence can be obtained.

More precisely, there exist a subset M ⊂ B and two positive scalars c1, c2, such that each nonminimizer x∈ M the following inequalities hold:

f (x)− f(x)≤ c1∥∇f (x)∥2, (33) f (x)− f(x(α)) ≥ c2α∥∇f (x)∥2. (34) Based on these two inequalities, we have linear convergence for nonminimizers xk:

f (xk+1)− f(x)≤ r(f(xk)− f(x)), (35) where r := 1− c2αmin/c1, with αmin := lim infk→∞αk > 0. Besides, if xk is a minimizer, then linear convergence is obvious.

Proof According to the finite identification, there exists some positive integer k0such that xk ∈ M with k ≥ k0. From inequalities (34,33), for each k≥ k0, f (xk+1)−f(x)≤ f(xk)−f(x)−c2αk∥∇f (xk)2≤ r(f(xk)−f(x)), (36) where r := 1− c2αk/c1. Finally, the proof is completed by noting that we do not have a diminishing step size, i.e., lim infk→∞αk > 0 : for case 2, see

Theorem 6; for case 1b, see [3]. ⊓⊔

Remark 4 This framework can be regarded as one generalization of the proof based on ∇f in the unconstrained optimization problem (e.g., page 87 in [4]). Also, compared with the work [27], Luo et al. proved linear convergence through the following two inequalities:

f (x)− f(x)≤ c1∥x − [x − ∇f(x)]+2, (37) f (x)− f(x(α)) ≥ c2∥x − [x − ∇f(x)]+2( Eq. (3.3) in [27]), (38) where Eq. (37) is derived by combining Eq. (3.2), (3.4) in [27]. Our ap- proach bears a resemblance to theirs in observing that limα→0+(x− [x − α∇f(x)]+)/α = f (x). However, the main difference between the two is that their convergence result does require the polyhedral assumption on the constraint sets, even though it does not require the non-degeneracy condition.

(15)

3 Proof of two inequalities

We shall verify these inequalities (33,34) in two cases : (a) xlies in the interior of Ω; (b) x lies in the boundary ∂Ω.

When xis an interior point, the minimization problem is equivalent to the unconstrained optimization problem. The linear convergence result is known, see [4] or [27]. In this section, we shall focus on the case x ∈ ∂Ω. Here we focus on the case that the set with xexcluded,B−xis nonempty; otherwise, the minimizer is found in finite iterations by Theorem 1.

3.1 Curvature defined by shape operators

Thanks to finite identification,{xk : k} eventually enter and remain in the set B. In the following, we examine its “curvature”, which is defined through a shape operator S onB. Keep in mind that B in general is not a surface, and its “curvature” is defined through a normal vector field, which is introduced by∇f. Hence our curvature is not a purely geometric property.

[Curvature of gi(x) = 0]: The curvature of each surface {x : gi(x) = 0} can be defined as follows (see, e.g., [30]). Let ni(x) be a unit normal vector,

ni(x) =∇gi(x)/∥∇gi(x)∥ for each x with ∥∇gi(x)∥ > 0. (39) Let the tangent plane Ti(x) with respect to the surface be the subspace{y ∈ Rn: y· ∇gi(x) = 0}, and the projection operator ETi(x)be

ETi(x)u := (arg min

y∈Ti(x)∥u − y∥). (40) Then for each tangent vector v in Ti(x), we define the shape operator Siat x with respect to v by

Si(v) = lim

t→0+

ni(x + tv)− ni(x)

t =ETi(x)2gi(x)ETi(x)v

∥∇gi(x)∥ . (41) Then Si(v) is a tangent vector in Ti(x).

When v̸= 0, the quantity Si(v)· v

∥v∥2 = v2gi(x)v

∥v∥2∥∇gi(x)∥ (42)

is called the normal curvature of the surface at x in the v direction. Ac- cording to Lemma 2.1 in [35], ETi(x)2gi(x)ETi(x) is positive semi-definite, or equivalently, the smallest principal curvature (the smallest eigenvalue) of Si

is nonnegative. Therefore, the positive definite condition of the shape opera- tor Si is equivalent to the positive definite condition ETi(x)2gi(x)ETi(x), or geometrically the positive principal curvature condition.

[Curvature of B]: In the following, we shall define the “curvature” of the surface B. Thanks to the non-degeneracy condition, we can assign a normal

(16)

vector field nf to every point x in Rn according to the gradient vector field

∇f:

nf(x) :=−∇f(x) + ∇f (x) = lim

α→0+

x− α∇f(x) − [x − α∇f(x)]+

α . (43)

Obviously, we have the orthogonal decomposition:

−∇f = −∇f + nf. (44)

For each x ∈ Rn with nf(x) ̸= 0, we can assign a unit normal vector field, n(x) := nf(x)/∥nf(x)∥. Hence, for x ∈ B, we have nf(x) =

i∈A(x)λi(x)∇gi(x) with λi(x) > 0.

For each x∈ B, we can assign a tangent plane (a subspace in Rn), TBx:={y ∈ Rn : y· ∇gi(x) = 0, i∈ A(x)}. (45) For each vector u∈ Rn, denote the linear projection operator Exby

Exu := (arg min

y∈T Bx

∥u − y∥). (46)

Readers can easily verify the linearity of Exand

Ex∇gi(x) = 0 for i∈ A(x), which imples that Ex(−∇f(x)) = −∇f (x).

(47) Definition 2 (The shape operator S on B) Fix a point x ∈ B. Let n be the unit normal vector induced by ∇f as above. For each (tangent) vector v in TBx, vn denotes the covariant derivative of B with respect to a tangent vector v,

vn = lim

t→0+

n(x + tv)− n(x)

t . (48)

As the dimension ofB is not necessarily n−1, the vector ∇vn is not necessarily a tangent vector in TBx. Thus, the shape operator at x is defined by its projection on TBx,

S(v) := Exvn. (49)

Unlike each Si, S not only depends on the surfaces{x : gi(x) = 0, i∈ A(x)}, but also depends on the function f . In the following lemma, we investigate the connection between the “curvature” of B and the curvature of each surface {x : gi(x) = 0}.

Lemma 1 For each tangent vector v in TBx,

∥nf(x)∥S(v) =

i∈A(x)

λi(x)Ex2gi(x)v =

i∈A(x)

λi(x)ExSi(v)∥∇gi(x)∥, (50)

where nf(x) =

i∈A(x)λi(x)∇gi(x). Thus, S(v) is a conical combination of ExSi(v),

S(v) =

i∈A(x)

λi(x)∥∇gi(x)∥

i∈A(x)λi(x)∇gi(x)∥ExSi(v). (51)

(17)

Proof Since n(x) =

i∈A(x)λi(x)∇gi(x)/∥nf(x)∥, then S(v) = Exvn =

i∈A(x)

Exvi∇gi/∥nf∥). (52)

For each i, we have

v

λi∇gi

∥nf =λi∇gi2v

∥nf + terms consisting of∇gi. (53) Thanks to Eq. (47), those terms with ∇gi vanish after the projection Ex is applied, which justifies the first statement in Eq. (50). Then, the definition of Si in Eq. (41) yields the second expression in Eq. (50).

Observe that2gi restricted to TBxis positive semi-definite, and the tri- angle inequality

i∈A(x)λi∇gi∥ ≤

i∈A(x)λi∥∇gi∥ holds for λi ≥ 0. Let λ¯i:= λi∥∇gi∥(

i∈A(x)λi∥∇gi∥)−1≥ 0. From Eq. (51), we have v· S(v) ≥

i∈A(x)

λ¯iv· Si(v), with

i∈A(x)

λ¯i= 1, (54)

which implies that S restricted to TBxis positive semi-definite. In addition, S is positive definite if at least one of Si(x) restricted to TBxis positive definite, or equivalently, Ex2giEx(restricted to TBx) is positive definite.

3.2 The first inequality

In the next lemma, we analyze the leading term off (x),

f (x) = F (x− x) + higher order terms, (55) where F is some positive semi-definite matrix.

Lemma 2 Assume that the non-degeneracy condition holds. For x∈ B,

xlim→x

f (x)

∥x − x = Ex

∇2f (x) + ∑

i∈A(x)

λi(x)2gi(x)

 lim

x→xEx x− x

∥x − x∥. (56) Equivalently, using Si to replace 2gi(x) (Eq. (41)), we have

f (x) = Ex

∇2f (x) + ∑

i∈A(x)

λi(x)∥∇gi(x)∥Si

 Ex(x−x)+O(∥x−x2).

(57)

(18)

Proof Note that

f (x) =∇f(x) +

i∈A(x)

λi(x)∇gi(x), (58)

and

0 =f (x) =∇f(x) + ∑

i∈A(x)

λi(x)∇gi(x). (59)

Using the mean value theorem, becausei(x),∇gi(x)} are continuously dif- ferentiable, the difference of these two equations is

f (x) =∇2f (x)(x− x) + ∑

i∈A(x)

λi(x)(∇gi(x)− ∇gi(x)) (60)

+ ∑

i∈A(x)

i(x)− λi(x))∇gi(x) + O(∥x − x2). (61)

Becausef (x)∈ T Bx, we can apply Ex on both sides of Eq. (60),

f (x) = Ex2f (x)(x−x)+ ∑

i∈A(x)

λi(x)Ex(∇g2i(x)(x−x))+O(∥x−x2), (62) where Ex(∇gi(x)) = 0 is used.

Hence, the leading term off (x)/∥x − x∥ is

xlim→x

f (x)

∥x − x = Ex

∇2f (x) + ∑

i∈A(x)

λi(x)2gi(x)

 lim

x→xEx x− x

∥x − x∥. (63)

This lemma points out that f (x) can be approximated by F (x− x) locally, where the matrix F (restricted to TBx) is

F := Ex

∇2f (x) + ∑

i∈A(x)

λi(x)2gi(x)

 Ex. (64)

Hence, to have a positive lower bound for∥∇f (x)∥/∥x − x∥, it suffices to have a positive definite matrix F . This is true, if at least one of{∇2gi(x) : i∈ A(x)} restricted to T Bx is positive definite.

Theorem 3 Assume that the non-degeneracy condition holds. Suppose that at least one of {∇2gi(x) : i ∈ A(x)} restricted to T Bx is positive definite with the smallest eigenvalue µmini , i.e., at least one of mini } is positive. Let µmin =∑

i∈A(x)λi(xmini . Then for c1 ∈ (0, 1/µmin), there exists δ > 0, such that if x∈ B with ∥x − x∥ < δ, then

∥x − x∥ ≤ c1∥∇f (x)∥ and f(x) − f(x)≤ c1∥∇f (x)∥2. (65)

參考文獻

相關文件

For the proposed algorithm, we establish a global convergence estimate in terms of the objective value, and moreover present a dual application to the standard SCLP, which leads to

We explicitly saw the dimensional reason for the occurrence of the magnetic catalysis on the basis of the scaling argument. However, the precise form of gap depends

For the proposed algorithm, we establish its convergence properties, and also present a dual application to the SCLP, leading to an exponential multiplier method which is shown

Based on the reformulation, a semi-smooth Levenberg–Marquardt method was developed, and the superlinear (quadratic) rate of convergence was established under the strict

These interior point algorithms typically penalize the nonneg- ativity constraints by a logarithmic function and use Newton’s method to solve the penalized problem, with the

The min-max and the max-min k-split problem are defined similarly except that the objectives are to minimize the maximum subgraph, and to maximize the minimum subgraph respectively..

A dual coordinate descent method for large-scale linear SVM. In Proceedings of the Twenty Fifth International Conference on Machine Learning

This reduced dual problem may be solved by a conditional gradient method and (accelerated) gradient-projection methods, with each projection involving an SVD of an r × m matrix..