SEARCH DIRECTIONS FOR LINE SEARCH METHODS

Fundamentals of Unconstrained

SEARCH DIRECTIONS FOR LINE SEARCH METHODS

The steepest descent direction−∇ fkis the most obvious choice for search direction for a line search method. It is intuitive; among all the directions we could move from xk,

it is the one along which f decreases most rapidly. To verify this claim, we appeal again to Taylor’s theorem (Theorem 2.1), which tells us that for any search direction p and step-length parameterα, we have

f(xk+ αp) f (xk)+ αp^T∇ fk+¹₂α²p^T∇²f(xk+ tp)p, for some t∈ (0, α) (see (2.6)). The rate of change in f along the direction p at xkis simply the coefﬁcient of α, namely, p^T∇ fk. Hence, the unit direction p of most rapid decrease is the solution to the problem

minp p^T∇ fk, subject top 1. (2.13)

Since p^T∇ fk p ∇ fk cos θ ∇ fk cos θ, where θ is the angle between p and ∇ fk, it is easy to see that the minimizer is attained when cosθ −1 and

p −∇ fk/∇ fk,

as claimed. As we illustrate in Figure 2.5, this direction is orthogonal to the contours of the function.

The steepest descent method is a line search method that moves along pk −∇ fkat every step. It can choose the step lengthαkin a variety of ways, as we discuss in Chapter 3. One advantage of the steepest descent direction is that it requires calculation of the gradient∇ fk

but not of second derivatives. However, it can be excruciatingly slow on difﬁcult problems.

Line search methods may use search directions other than the steepest descent direc-tion. In general, any descent direction—one that makes an angle of strictly less thanπ/2 radians with−∇ fk—is guaranteed to produce a decrease in f , provided that the step length

p x

.

Figure 2.5 Steepest descent direction for a function of two variables.

– ∆

f

p

Figure 2.6

A downhill direction pk.

is sufﬁciently small (see Figure 2.6). We can verify this claim by using Taylor’s theorem.

From (2.6), we have that

f(xk+ pk) f (xk)+ p^Tk∇ fk+ O( ²).

When pkis a downhill direction, the angleθkbetween pkand∇ fkhas cosθk < 0, so that p_k^T∇ fk pk ∇ fk cos θk< 0.

It follows that f (xk+ pk)< f (xk) for all positive but sufﬁciently small values of . Another important search direction—perhaps the most important one of all—

is the Newton direction. This direction is derived from the second-order Taylor series approximation to f (xk+ p), which is

f(xk+ p) ≈ fk+ p^T∇ fk+¹₂p^T∇²fkp^def mk( p). (2.14) Assuming for the moment that∇²fk is positive deﬁnite, we obtain the Newton direction by ﬁnding the vector p that minimizes mk( p). By simply setting the derivative of mk( p) to zero, we obtain the following explicit formula:

p^N_k −

∇²fk

₋₁

∇ fk. (2.15)

The Newton direction is reliable when the difference between the true function f(xk+ p) and its quadratic model mk( p) is not too large. By comparing (2.14) with (2.6), we see that the only difference between these functions is that the matrix∇²f(xk+ tp) in the third term of the expansion has been replaced by∇²fk. If∇²f is sufﬁciently smooth, this difference introduces a perturbation of only O(p³) into the expansion, so that when

p is small, the approximation f (xk+ p) ≈ mk( p) is quite accurate.

The Newton direction can be used in a line search method when∇²fk is positive deﬁnite, for in this case we have

∇ fk^Tp_k^N −p^Nk

T∇²fkp_k^N≤ −σkp^Nk²

for someσk> 0. Unless the gradient ∇ fk(and therefore the step p_k^N) is zero, we have that

∇ f_k^Tp^N_k < 0, so the Newton direction is a descent direction.

Unlike the steepest descent direction, there is a “natural” step length of 1 associated with the Newton direction. Most line search implementations of Newton’s method use the unit stepα 1 where possible and adjust α only when it does not produce a satisfactory reduction in the value of f .

When∇²fkis not positive deﬁnite, the Newton direction may not even be deﬁned, since

∇²fk

₋₁

may not exist. Even when it is deﬁned, it may not satisfy the descent property

∇ fk^Tp^N_k < 0, in which case it is unsuitable as a search direction. In these situations, line search methods modify the deﬁnition of pkto make it satisfy the descent condition while retaining the beneﬁt of the second-order information contained in∇²fk. We describe these modiﬁcations in Chapter 3.

Methods that use the Newton direction have a fast rate of local convergence, typically quadratic. After a neighborhood of the solution is reached, convergence to high accuracy often occurs in just a few iterations. The main drawback of the Newton direction is the need for the Hessian∇²f(x). Explicit computation of this matrix of second derivatives can sometimes be a cumbersome, error-prone, and expensive process. Finite-difference and automatic differentiation techniques described in Chapter 8 may be useful in avoiding the need to calculate second derivatives by hand.

Quasi-Newton search directions provide an attractive alternative to Newton’s method in that they do not require computation of the Hessian and yet still attain a superlinear rate of convergence. In place of the true Hessian∇²fk, they use an approximation Bk, which is updated after each step to take account of the additional knowledge gained during the step.

The updates make use of the fact that changes in the gradient g provide information about the second derivative of f along the search direction. By using the expression (2.5) from our statement of Taylor’s theorem, we have by adding and subtracting the term∇²f(x) p that

∇ f (x + p) ∇ f (x) + ∇²f(x) p+

₁

∇²f(x+ tp) − ∇²f(x) p dt.

Because∇ f (·) is continuous, the size of the ﬁnal integral term is o(p). By setting x xk

and p xk+1− xk, we obtain

∇ fk+1 ∇ fk+ ∇²fk(xk+1− xk)+ o(xk+1− xk).

When xkand xk+1lie in a region near the solution x^∗, within which∇²f is positive deﬁnite, the ﬁnal term in this expansion is eventually dominated by the∇²fk(xk+1− xk) term, and

we can write

∇²fk(xk+1− xk)≈ ∇ fk+1− ∇ fk. (2.16)

We choose the new Hessian approximation Bk+1so that it mimics the property (2.16) of the true Hessian, that is, we require it to satisfy the following condition, known as the secant equation:

Bk+1sk yk, (2.17)

where

sk xk+1− xk, yk ∇ fk+1− ∇ fk.

Typically, we impose additional conditions on Bk+1, such as symmetry (motivated by symmetry of the exact Hessian), and a requirement that the difference between successive approximations Bkand Bk+1have low rank.

Two of the most popular formulae for updating the Hessian approximation Bk are the symmetric-rank-one (SR1) formula, deﬁned by

Bk+1 Bk+(yk− Bksk)(yk− Bksk)^T (yk− Bksk)^Tsk

, (2.18)

and the BFGS formula, named after its inventors, Broyden, Fletcher, Goldfarb, and Shanno, which is deﬁned by

Bk+1 Bk− Bksks_k^TBk

s_k^TBksk

+yky_k^T y_k^Tsk

. (2.19)

Note that the difference between the matrices Bk and Bk+1 is a rank-one matrix in the case of (2.18) and a rank-two matrix in the case of (2.19). Both updates satisfy the secant equation and both maintain symmetry. One can show that BFGS update (2.19) generates positive deﬁnite approximations whenever the initial approximation B₀is positive deﬁnite and s_k^Tyk> 0. We discuss these issues further in Chapter 6.

The quasi-Newton search direction is obtained by using Bk in place of the exact Hessian in the formula (2.15), that is,

pk −Bk⁻¹∇ fk. (2.20)

Some practical implementations of quasi-Newton methods avoid the need to factorize Bk

at each iteration by updating the inverse of Bk, instead of Bkitself. In fact, the equivalent

formula for (2.18) and (2.19), applied to the inverse approximation Hk

def Bk⁻¹, is

Hk+1

I− ρksky_k^T Hk

I− ρkyks_k^T

+ ρksks_k^T, ρk 1 y_k^Tsk

. (2.21)

Calculation of pkcan then be performed by using the formula pk −Hk∇ fk. This matrix–

vector multiplication is simpler than the factorization/back-substitution procedure that is needed to implement the formula (2.20).

Two variants of quasi-Newton methods designed to solve large problems—partially separable and limited-memory updating—are described in Chapter 7.

The last class of search directions we preview here is that generated by nonlinear conjugate gradient methods. They have the form

pk −∇ f (xk)+ βkpk−1,

whereβk is a scalar that ensures that pk and pk−1 are conjugate—an important concept in the minimization of quadratic functions that will be deﬁned in Chapter 5. Conjugate gradient methods were originally designed to solve systems of linear equations Ax b, where the coefﬁcient matrix A is symmetric and positive deﬁnite. The problem of solving this linear system is equivalent to the problem of minimizing the convex quadratic function deﬁned by

φ(x) ¹₂x^TAx− b^Tx,

so it was natural to investigate extensions of these algorithms to more general types of unconstrained minimization problems. In general, nonlinear conjugate gradient directions are much more effective than the steepest descent direction and are almost as simple to compute. These methods do not attain the fast convergence rates of Newton or quasi-Newton methods, but they have the advantage of not requiring storage of matrices. An extensive discussion of nonlinear conjugate gradient methods is given in Chapter 5.

All of the search directions discussed so far can be used directly in a line search framework. They give rise to the steepest descent, Newton, quasi-Newton, and conjugate gradient line search methods. All except conjugate gradients have an analogue in the trust-region framework, as we now discuss.

MODELS FOR TRUST-REGION METHODS

If we set Bk 0 in (2.12) and deﬁne the trust region using the Euclidean norm, the trust-region subproblem (2.11) becomes

minp fk+ p^T∇ fk subject top2≤ k.

We can write the solution to this problem in closed form as

pk −k∇ fk

∇ fk.

This is simply a steepest descent step in which the step length is determined by the trust-region radius; the trust-trust-region and line search approaches are essentially the same in this case.

A more interesting trust-region algorithm is obtained by choosing Bk to be the exact Hessian∇²fkin the quadratic model (2.12). Because of the trust-region restriction

p2≤ k, the subproblem (2.11) is guaranteed to have a solution even when∇²fkis not positive deﬁnite pk, as we see in Figure 2.4. The trust-region Newton method has proved to be highly effective in practice, as we discuss in Chapter 7.

If the matrix Bkin the quadratic model function mkof (2.12) is deﬁned by means of a quasi-Newton approximation, we obtain a trust-region quasi-Newton method.

SCALING

The performance of an algorithm may depend crucially on how the problem is formu-lated. One important issue in problem formulation is scaling. In unconstrained optimization, a problem is said to be poorly scaled if changes to x in a certain direction produce much larger variations in the value of f than do changes to x in another direction. A simple example is provided by the function f (x) 10⁹x₁²+ x₂², which is very sensitive to small changes in x1

but not so sensitive to perturbations in x₂.

Poorly scaled functions arise, for example, in simulations of physical and chemical systems where different processes are taking place at very different rates. To be more speciﬁc, consider a chemical system in which four reactions occur. Associated with each reaction is a rate constant that describes the speed at which the reaction takes place. The optimization problem is to ﬁnd values for these rate constants by observing the concentrations of each chemical in the system at different times. The four constants differ greatly in magnitude, since the reactions take place at vastly different speeds. Suppose we have the following rough esti-mates for the ﬁnal values of the constants, each correct to within, say, an order of magnitude:

x1≈ 10⁻¹⁰, x2≈ x3≈ 1, x4≈ 10⁵.

Before solving this problem we could introduce a new variable z deﬁned by

⎡

and then deﬁne and solve the optimization problem in terms of the new variable z. The

∆ k

∆ f_k

–

– f

Figure 2.7 Poorly scaled and well scaled problems, and performance of the steepest descent direction.

optimal values of z will be within about an order of magnitude of 1, making the solution more balanced. This kind of scaling of the variables is known as diagonal scaling.

Scaling is performed (sometimes unintentionally) when the units used to represent variables are changed. During the modeling process, we may decide to change the units of some variables, say from meters to millimeters. If we do, the range of those variables and their size relative to the other variables will both change.

Some optimization algorithms, such as steepest descent, are sensitive to poor scaling, while others, such as Newton’s method, are unaffected by it. Figure 2.7 shows the contours of two convex nearly quadratic functions, the ﬁrst of which is poorly scaled, while the second is well scaled. For the poorly scaled problem, the one with highly elongated contours, the steepest descent direction does not yield much reduction in the function, while for the well-scaled problem it performs much better. In both cases, Newton’s method will produce a much better step, since the second-order quadratic model (mkin (2.14)) happens to be a good approximation of f .

Algorithms that are not sensitive to scaling are preferable, because they can handle poor problem formulations in a more robust fashion. In designing complete algorithms, we try to incorporate scale invariance into all aspects of the algorithm, including the line search or trust-region strategies and convergence tests. Generally speaking, it is easier to preserve scale invariance for line search algorithms than for trust-region algorithms.

✐ E

X E R C I S E S

✐

2.1 Compute the gradient∇ f (x) and Hessian ∇²f(x) of the Rosenbrock function f(x) 100(x2− x₁²)²+ (1 − x1)². (2.22)

Show that x^∗ (1, 1)^T is the only local minimizer of this function, and that the Hessian matrix at that point is positive deﬁnite.

✐

2.2 Show that the function f (x) 8x1+ 12x2+ x₁²− 2x₂²has only one stationary point, and that it is neither a maximum or minimum, but a saddle point. Sketch the contour lines of f .

✐

2.3 Let a be a given n-vector, and A be a given n× n symmetric matrix. Compute the gradient and Hessian of f₁(x) a^Txand f₂(x) x^TAx.

✐

2.4 Write the second-order Taylor expansion (2.6) for the function cos(1/x) around a nonzero point x, and the third-order Taylor expansion of cos(x) around any point x.

Evaluate the second expansion for the speciﬁc case of x 1.

✐

2.5 Consider the function f : IR² → IR deﬁned by f (x) x². Show that the sequence of iterates{xk} deﬁned by

1+ 1

2^k

cos k sin k

satisﬁes f (xk+1) < f (xk) for k 0, 1, 2, . . . . Show that every point on the unit circle {x | x² 1} is a limit point for {xk}. Hint: Every value θ ∈ [0, 2π] is a limit point of the subsequence{ξk} deﬁned by

ξk k(mod 2π) k − 2π

k 2π

where the operator· denotes rounding down to the next integer.

✐

2.6 Prove that all isolated local minimizers are strict. (Hint: Take an isolated local minimizer x^∗ and a neighborhoodN . Show that for any x ∈ N , x x^∗ we must have

f(x)> f (x^∗).)

✐

2.7 Suppose that f (x) x^TQx, where Q is an n×n symmetric positive semideﬁnite matrix. Show using the deﬁnition (1.4) that f (x) is convex on the domain IRⁿ. Hint: It may be convenient to prove the following equivalent inequality:

f(y+ α(x − y)) − α f (x) − (1 − α) f (y) ≤ 0,

for allα ∈ [0, 1] and all x, y ∈ IRⁿ.

✐

2.8 Suppose that f is a convex function. Show that the set of global minimizers of f is a convex set.

✐

2.9 Consider the function f (x1, x2)

x1+ x₂²2

. At the point x^T (1, 0) we consider the search direction p^T (−1, 1). Show that p is a descent direction and ﬁnd all minimizers of the problem (2.10).

✐

2.10 Suppose that ˜f(z) f (x), where x Sz + s for some S ∈ IRⁿ^×nand s∈ IRⁿ. Show that

∇ ˜f(z) S^T∇ f (x), ∇² ˜f(z) S^T∇²f(x)S.

(Hint: Use the chain rule to express d ˜f/dzj in terms of d f/dxi and d xi/dzj for all i, j 1, 2, . . . , n.)

✐

2.11 Show that the symmetric rank-one update (2.18) and the BFGS update (2.19) are scale-invariant if the initial Hessian approximations B₀are chosen appropriately. That is, using the notation of the previous exercise, show that if these methods are applied to f(x) starting from x₀ Sz0+ s with initial Hessian B0, and to ˜f(z) starting from z₀with initial Hessian S^TB0S, then all iterates are related by xk Szk+ s. (Assume for simplicity that the methods take unit step lengths.)

✐

2.12 Suppose that a function f of two variables is poorly scaled at the solution x^∗. Write two Taylor expansions of f around x^∗—one along each coordinate direction—and use them to show that the Hessian∇²f(x^∗) is ill-conditioned.

✐

2.13 (For this and the following three questions, refer to the material on “Rates of Convergence” in Section A.2 of the Appendix.) Show that the sequence xk 1/k is not Q-linearly convergent, though it does converge to zero. (This is called sublinear convergence.)

✐

2.14 Show that the sequence xk 1 + (0.5)²^kis Q-quadratically convergent to 1.

✐

2.15 Does the sequence xk 1/k! converge Q-superlinearly? Q-quadratically?

✐

2.16 Consider the sequence{xk} deﬁned by xk

₁

2^k

, keven,

(xk−1)/k, k odd.

Is this sequence Q-superlinearly convergent? Q-quadratically convergent? R-quadratically convergent?

C H A P T E R 3

Line Search

在文檔中 SecondEdition NumericalOptimization JorgeNocedalStephenJ.Wright (頁 39-49)