Gradient Methods
CONJUGATE DIRECTION METHODS
One of the remarkable properties of the conjugate gradient method is its ability to generate, in a very economical fashion, a set of vectors with a property known as conjugacy. A
set of nonzero vectors{p0, p1, . . . , pl} is said to be conjugate with respect to the symmetric positive definite matrix A if
piTApj 0, for all i j. (5.5)
It is easy to show that any set of vectors satisfying this property is also linearly independent.
(For a geometrical illustration of conjugate directions see Section 9.4.)
The importance of conjugacy lies in the fact that we can minimizeφ(·) in n steps by successively minimizing it along the individual directions in a conjugate set. To verify this claim, we consider the following conjugate direction method. (The distinction between the conjugate gradient method and the conjugate direction method will become clear as we proceed.) Given a starting point x0∈ IRnand a set of conjugate directions{p0, p1, . . . , pn−1}, let us generate the sequence{xk} by setting
xk+1 xk+ αkpk, (5.6)
whereαkis the one-dimensional minimizer of the quadratic functionφ(·) along xk+ αpk, given explicitly by
αk − rkTpk
pTkApk
; (5.7)
see (3.55). We have the following result.
Theorem 5.1.
For any x0∈ IRnthe sequence{xk} generated by the conjugate direction algorithm (5.6), (5.7) converges to the solution x∗of the linear system (5.1) in at most n steps.
PROOF. Since the directions{pi} are linearly independent, they must span the whole space IRn. Hence, we can write the difference between x0and the solution x∗in the following way:
x∗− x0 σ0p0+ σ1p1+ · · · + σn−1pn−1,
for some choice of scalarsσk. By premultiplying this expression by pkTA and using the conjugacy property (5.5), we obtain
σk pkTA(x∗− x0) pTkApk
. (5.8)
We now establish the result by showing that these coefficients σk coincide with the step lengthsαkgenerated by the formula (5.7).
If xkis generated by algorithm (5.6), (5.7), then we have xk x0+ α0p0+ α1p1+ · · · + αk−1pk−1.
By premultiplying this expression by pkTAand using the conjugacy property, we have that pTkA(xk− x0) 0,
and therefore
pTkA(x∗− x0) pTkA(x∗− xk) pkT(b− Axk) −pkTrk.
By comparing this relation with (5.7) and (5.8), we find thatσk αk, giving the result. There is a simple interpretation of the properties of conjugate directions. If the matrix Ain (5.2) is diagonal, the contours of the functionφ(·) are ellipses whose axes are aligned with the coordinate directions, as illustrated in Figure 5.1. We can find the minimizer of this function by performing one-dimensional minimizations along the coordinate directions
.
* e2
x0
e1 x1
.
.
xFigure 5.1 Successive minimizations along the coordinate directions find the minimizer of a quadratic with a diagonal Hessian in n iterations.
e x1
x0
x2 3
x x*
e2
1
Figure 5.2 Successive minimization along coordinate axes does not find the solution in n iterations, for a general convex quadratic.
e1, e2, . . . , en in turn. When A is not diagonal, its contours are still elliptical, but they are usually no longer aligned with the coordinate directions. The strategy of successive minimization along these directions in turn no longer leads to the solution in n iterations (or even in a finite number of iterations). This phenomenon is illustrated in the two-dimensional example of Figure 5.2 We can, however, recover the nice behavior of Figure 5.1 if we transform the problem to make A diagonal and then minimize along the coordinate directions. Suppose we transform the problem by defining new variables ˆx as
ˆx S−1x, (5.9)
where S is the n× n matrix defined by
S [p0p1 · · · pn−1],
where{p0, p2, . . . , pn−1} is the set of conjugate directions with respect to A. The quadratic φ defined by (5.2) now becomes
ˆφ(ˆx)def φ(S ˆx) 12ˆxT(STAS)ˆx − (STb)Tˆx.
By the conjugacy property (5.5), the matrix STASis diagonal, so we can find the minimizing value of ˆφ by performing n one-dimensional minimizations along the coordinate directions
of ˆx. Because of the relation (5.9), however, the ith coordinate direction in ˆx-space corre-sponds to the direction piin x-space. Hence, the coordinate search strategy applied to ˆφ is equivalent to the conjugate direction algorithm (5.6), (5.7). We conclude, as in Theorem 5.1, that the conjugate direction algorithm terminates in at most n steps.
Returning to Figure 5.1, we note another interesting property: When the Hessian ma-trix is diagonal, each coordinate minimization correctly determines one of the components of the solution x∗. In other words, after k one-dimensional minimizations, the quadratic has been minimized on the subspace spanned by e1, e2, . . . , ek. The following theorem proves this important result for the general case in which the Hessian of the quadratic is not necessarily diagonal. (Here and later, we use the notation span{p0, p1, . . . , pk} to denote the set of all linear combinations of the vectors p0, p1, . . . , pk.) In proving the result we will make use of the following expression, which is easily verified from the relations (5.4) and (5.6):
rk+1 rk+ αkApk. (5.10)
Theorem 5.2 (Expanding Subspace Minimization).
Let x0∈ IRnbe any starting point and suppose that the sequence{xk} is generated by the conjugate direction algorithm (5.6), (5.7). Then
rkTpi 0, for i 0, 1, . . . , k − 1, (5.11) and xkis the minimizer ofφ(x) 12xTAx − bTxover the set
{x | x x0+ span{p0, p1, . . . , pk−1}}. (5.12)
PROOF. We begin by showing that a point ˜x minimizes φ over the set (5.12) if and only if r (˜x)Tpi 0, for each i 0, 1, . . . , k − 1. Let us define h(σ) φ(x0+ σ0p0+ · · · + σk−1pk−1), whereσ (σ0, σ1, . . . , σk−1)T. Since h(σ) is a strictly convex quadratic, it has a unique minimizerσ∗that satisfies
∂h(σ∗)
∂σi
0, i 0, 1, . . . , k − 1.
By the chain rule, this equation implies that
∇φ(x0+ σ0∗p0+ · · · + σk∗−1pk−1)Tpi 0, i 0, 1, . . . , k − 1.
By recalling the definition (5.3), we have for the minimizer ˜x x0+ σ0∗p0+ σ1∗p2+ · · · + σk∗−1pk−1on the set (5.12) that r (˜x)Tpi 0, as claimed.
We now use induction to show that xksatisfies (5.11). For the case k 1, we have from the fact that x1 x0+ α0p0minimizesφ along p0that r1Tp0 0. Let us now make
the induction hypothesis, namely, that rkT−1pi 0 for i 0, 1, . . . , k − 2. By (5.10), we have
rk rk−1+ αk−1Apk−1, so that
pkT−1rk pkT−1rk−1+ αk−1pTk−1Apk−1 0,
by the definition (5.7) ofαk−1. Meanwhile, for the other vectors pi, i 0, 1, . . . , k − 2, we have
piTrk piTrk−1+ αk−1piTApk−1 0,
where piTrk−1 0 because of the induction hypothesis and piTApk−1 0 because of conjugacy of the vectors pi. We have shown that rkTpi 0, for i 0, 1, . . . , k − 1, so the
proof is complete.
The fact that the current residual rkis orthogonal to all previous search directions, as expressed in (5.11), is a property that will be used extensively in this chapter.
The discussion so far has been general, in that it applies to a conjugate direction method (5.6), (5.7) based on any choice of the conjugate direction set{p0, p1, . . . , pn−1}.
There are many ways to choose the set of conjugate directions. For instance, the eigen-vectorsv1, v2, . . . , vn of A are mutually orthogonal as well as conjugate with respect to A, so these could be used as the vectors{p0, p1, . . . , pn−1}. For large-scale applications, however, computation of the complete set of eigenvectors requires an excessive amount of computation. An alternative approach is to modify the Gram–Schmidt orthogonalization process to produce a set of conjugate directions rather than a set of orthogonal directions.
(This modification is easy to produce, since the properties of conjugacy and orthogonality are closely related in spirit.) However, the Gram–Schmidt approach is also expensive, since it requires us to store the entire direction set.