CONJUGATE DIRECTION METHODS

Gradient Methods

One of the remarkable properties of the conjugate gradient method is its ability to generate, in a very economical fashion, a set of vectors with a property known as conjugacy. A

set of nonzero vectors{p0, p1, . . . , pl} is said to be conjugate with respect to the symmetric positive deﬁnite matrix A if

p_i^TApj 0, for all i j. (5.5)

It is easy to show that any set of vectors satisfying this property is also linearly independent.

(For a geometrical illustration of conjugate directions see Section 9.4.)

The importance of conjugacy lies in the fact that we can minimizeφ(·) in n steps by successively minimizing it along the individual directions in a conjugate set. To verify this claim, we consider the following conjugate direction method. (The distinction between the conjugate gradient method and the conjugate direction method will become clear as we proceed.) Given a starting point x0∈ IRⁿand a set of conjugate directions{p0, p1, . . . , pn−1}, let us generate the sequence{xk} by setting

xk+1 xk+ αkpk, (5.6)

whereαkis the one-dimensional minimizer of the quadratic functionφ(·) along xk+ αpk, given explicitly by

αk − r_k^Tpk

p^T_kApk

; (5.7)

see (3.55). We have the following result.

Theorem 5.1.

For any x₀∈ IRⁿthe sequence{xk} generated by the conjugate direction algorithm (5.6), (5.7) converges to the solution x^∗of the linear system (5.1) in at most n steps.

P^ROOF. Since the directions{pi} are linearly independent, they must span the whole space IRⁿ. Hence, we can write the difference between x₀and the solution x^∗in the following way:

x^∗− x0 σ0p₀+ σ1p₁+ · · · + σn−1pn−1,

for some choice of scalarsσk. By premultiplying this expression by p_k^TA and using the conjugacy property (5.5), we obtain

σk p_k^TA(x^∗− x0) p^T_kApk

. (5.8)

We now establish the result by showing that these coefﬁcients σk coincide with the step lengthsαkgenerated by the formula (5.7).

If xkis generated by algorithm (5.6), (5.7), then we have xk x0+ α0p0+ α1p1+ · · · + αk−1pk−1.

By premultiplying this expression by p_k^TAand using the conjugacy property, we have that p^T_kA(xk− x0) 0,

and therefore

p^T_kA(x^∗− x0) p^TkA(x^∗− xk) pk^T(b− Axk) −pk^Trk.

By comparing this relation with (5.7) and (5.8), we ﬁnd thatσk αk, giving the result. There is a simple interpretation of the properties of conjugate directions. If the matrix Ain (5.2) is diagonal, the contours of the functionφ(·) are ellipses whose axes are aligned with the coordinate directions, as illustrated in Figure 5.1. We can ﬁnd the minimizer of this function by performing one-dimensional minimizations along the coordinate directions

.

* e₂

x₀

e₁ x₁

.

Figure 5.1 Successive minimizations along the coordinate directions ﬁnd the minimizer of a quadratic with a diagonal Hessian in n iterations.

e x₁

x₂ ³

x x*

e₂

Figure 5.2 Successive minimization along coordinate axes does not ﬁnd the solution in n iterations, for a general convex quadratic.

e₁, e2, . . . , en in turn. When A is not diagonal, its contours are still elliptical, but they are usually no longer aligned with the coordinate directions. The strategy of successive minimization along these directions in turn no longer leads to the solution in n iterations (or even in a ﬁnite number of iterations). This phenomenon is illustrated in the two-dimensional example of Figure 5.2 We can, however, recover the nice behavior of Figure 5.1 if we transform the problem to make A diagonal and then minimize along the coordinate directions. Suppose we transform the problem by deﬁning new variables ˆx as

ˆx S⁻¹x, (5.9)

where S is the n× n matrix deﬁned by

S [p0p1 · · · pn−1],

where{p0, p2, . . . , pn−1} is the set of conjugate directions with respect to A. The quadratic φ deﬁned by (5.2) now becomes

ˆφ(ˆx)^def φ(S ˆx) ¹₂ˆx^T(S^TAS)ˆx − (S^Tb)^Tˆx.

By the conjugacy property (5.5), the matrix S^TASis diagonal, so we can ﬁnd the minimizing value of ˆφ by performing n one-dimensional minimizations along the coordinate directions

of ˆx. Because of the relation (5.9), however, the ith coordinate direction in ˆx-space corre-sponds to the direction piin x-space. Hence, the coordinate search strategy applied to ˆφ is equivalent to the conjugate direction algorithm (5.6), (5.7). We conclude, as in Theorem 5.1, that the conjugate direction algorithm terminates in at most n steps.

Returning to Figure 5.1, we note another interesting property: When the Hessian ma-trix is diagonal, each coordinate minimization correctly determines one of the components of the solution x^∗. In other words, after k one-dimensional minimizations, the quadratic has been minimized on the subspace spanned by e1, e2, . . . , ek. The following theorem proves this important result for the general case in which the Hessian of the quadratic is not necessarily diagonal. (Here and later, we use the notation span{p0, p1, . . . , pk} to denote the set of all linear combinations of the vectors p₀, p1, . . . , pk.) In proving the result we will make use of the following expression, which is easily veriﬁed from the relations (5.4) and (5.6):

rk+1 rk+ αkApk. (5.10)

Theorem 5.2 (Expanding Subspace Minimization).

Let x₀∈ IRⁿbe any starting point and suppose that the sequence{xk} is generated by the conjugate direction algorithm (5.6), (5.7). Then

r_k^Tpi 0, for i 0, 1, . . . , k − 1, (5.11) and xkis the minimizer ofφ(x) ¹₂x^TAx − b^Txover the set

{x | x x0+ span{p0, p1, . . . , pk−1}}. (5.12)

P^ROOF. We begin by showing that a point ˜x minimizes φ over the set (5.12) if and only if r (˜x)^Tpi 0, for each i 0, 1, . . . , k − 1. Let us deﬁne h(σ) φ(x0+ σ0p₀+ · · · + σk−1pk−1), whereσ (σ0, σ1, . . . , σk−1)^T. Since h(σ) is a strictly convex quadratic, it has a unique minimizerσ^∗that satisﬁes

∂h(σ^∗)

∂σi

0, i 0, 1, . . . , k − 1.

By the chain rule, this equation implies that

∇φ(x0+ σ₀^∗p0+ · · · + σk^∗−1pk−1)^Tpi 0, i 0, 1, . . . , k − 1.

By recalling the deﬁnition (5.3), we have for the minimizer ˜x x0+ σ₀^∗p0+ σ₁^∗p2+ · · · + σk^∗−1pk−1on the set (5.12) that r (˜x)^Tpi 0, as claimed.

We now use induction to show that xksatisﬁes (5.11). For the case k 1, we have from the fact that x1 x0+ α0p0minimizesφ along p0that r₁^Tp0 0. Let us now make

the induction hypothesis, namely, that r_k^T₋₁pi 0 for i 0, 1, . . . , k − 2. By (5.10), we have

rk rk−1+ αk−1Apk−1, so that

p_k^T₋₁rk pk^T−1rk−1+ αk−1p^T_k₋₁Apk−1 0,

by the deﬁnition (5.7) ofαk−1. Meanwhile, for the other vectors pi, i 0, 1, . . . , k − 2, we have

p_i^Trk pi^Trk−1+ αk−1p_i^TApk−1 0,

where p_i^Trk−1 0 because of the induction hypothesis and pi^TApk−1 0 because of conjugacy of the vectors pi. We have shown that r_k^Tpi 0, for i 0, 1, . . . , k − 1, so the

proof is complete.

The fact that the current residual rkis orthogonal to all previous search directions, as expressed in (5.11), is a property that will be used extensively in this chapter.

The discussion so far has been general, in that it applies to a conjugate direction method (5.6), (5.7) based on any choice of the conjugate direction set{p0, p1, . . . , pn−1}.

There are many ways to choose the set of conjugate directions. For instance, the eigen-vectorsv1, v2, . . . , vn of A are mutually orthogonal as well as conjugate with respect to A, so these could be used as the vectors{p0, p1, . . . , pn−1}. For large-scale applications, however, computation of the complete set of eigenvectors requires an excessive amount of computation. An alternative approach is to modify the Gram–Schmidt orthogonalization process to produce a set of conjugate directions rather than a set of orthogonal directions.

(This modiﬁcation is easy to produce, since the properties of conjugacy and orthogonality are closely related in spirit.) However, the Gram–Schmidt approach is also expensive, since it requires us to store the entire direction set.

在文檔中 SecondEdition NumericalOptimization JorgeNocedalStephenJ.Wright (頁 121-126)