• 沒有找到結果。

SEARCH DIRECTIONS FOR LINE SEARCH METHODS

Fundamentals of Unconstrained

SEARCH DIRECTIONS FOR LINE SEARCH METHODS

The steepest descent direction−∇ fkis the most obvious choice for search direction for a line search method. It is intuitive; among all the directions we could move from xk,

it is the one along which f decreases most rapidly. To verify this claim, we appeal again to Taylor’s theorem (Theorem 2.1), which tells us that for any search direction p and step-length parameterα, we have

f(xk+ αp)  f (xk)+ αpT∇ fk+12α2pT2f(xk+ tp)p, for some t∈ (0, α) (see (2.6)). The rate of change in f along the direction p at xkis simply the coefficient of α, namely, pT∇ fk. Hence, the unit direction p of most rapid decrease is the solution to the problem

minp pT∇ fk, subject top  1. (2.13)

Since pT∇ fk p ∇ fk cos θ  ∇ fk cos θ, where θ is the angle between p and ∇ fk, it is easy to see that the minimizer is attained when cosθ  −1 and

p −∇ fk/∇ fk,

as claimed. As we illustrate in Figure 2.5, this direction is orthogonal to the contours of the function.

The steepest descent method is a line search method that moves along pk −∇ fkat every step. It can choose the step lengthαkin a variety of ways, as we discuss in Chapter 3. One advantage of the steepest descent direction is that it requires calculation of the gradient∇ fk

but not of second derivatives. However, it can be excruciatingly slow on difficult problems.

Line search methods may use search directions other than the steepest descent direc-tion. In general, any descent direction—one that makes an angle of strictly less thanπ/2 radians with−∇ fk—is guaranteed to produce a decrease in f , provided that the step length

*

x

p x

k

k

.

Figure 2.5 Steepest descent direction for a function of two variables.

k

f

k

p

Figure 2.6

A downhill direction pk.

is sufficiently small (see Figure 2.6). We can verify this claim by using Taylor’s theorem.

From (2.6), we have that

f(xk+ pk) f (xk)+ pTk∇ fk+ O( 2).

When pkis a downhill direction, the angleθkbetween pkand∇ fkhas cosθk < 0, so that pkT∇ fk  pk ∇ fk cos θk< 0.

It follows that f (xk+ pk)< f (xk) for all positive but sufficiently small values of . Another important search direction—perhaps the most important one of all—

is the Newton direction. This direction is derived from the second-order Taylor series approximation to f (xk+ p), which is

f(xk+ p) ≈ fk+ pT∇ fk+12pT2fkpdef mk( p). (2.14) Assuming for the moment that∇2fk is positive definite, we obtain the Newton direction by finding the vector p that minimizes mk( p). By simply setting the derivative of mk( p) to zero, we obtain the following explicit formula:

pNk −

2fk

−1

∇ fk. (2.15)

The Newton direction is reliable when the difference between the true function f(xk+ p) and its quadratic model mk( p) is not too large. By comparing (2.14) with (2.6), we see that the only difference between these functions is that the matrix∇2f(xk+ tp) in the third term of the expansion has been replaced by∇2fk. If∇2f is sufficiently smooth, this difference introduces a perturbation of only O(p3) into the expansion, so that when

p is small, the approximation f (xk+ p) ≈ mk( p) is quite accurate.

The Newton direction can be used in a line search method when∇2fk is positive definite, for in this case we have

∇ fkTpkN −pNk

T2fkpkN≤ −σkpNk2

for someσk> 0. Unless the gradient ∇ fk(and therefore the step pkN) is zero, we have that

∇ fkTpNk < 0, so the Newton direction is a descent direction.

Unlike the steepest descent direction, there is a “natural” step length of 1 associated with the Newton direction. Most line search implementations of Newton’s method use the unit stepα  1 where possible and adjust α only when it does not produce a satisfactory reduction in the value of f .

When∇2fkis not positive definite, the Newton direction may not even be defined, since

2fk

−1

may not exist. Even when it is defined, it may not satisfy the descent property

∇ fkTpNk < 0, in which case it is unsuitable as a search direction. In these situations, line search methods modify the definition of pkto make it satisfy the descent condition while retaining the benefit of the second-order information contained in∇2fk. We describe these modifications in Chapter 3.

Methods that use the Newton direction have a fast rate of local convergence, typically quadratic. After a neighborhood of the solution is reached, convergence to high accuracy often occurs in just a few iterations. The main drawback of the Newton direction is the need for the Hessian∇2f(x). Explicit computation of this matrix of second derivatives can sometimes be a cumbersome, error-prone, and expensive process. Finite-difference and automatic differentiation techniques described in Chapter 8 may be useful in avoiding the need to calculate second derivatives by hand.

Quasi-Newton search directions provide an attractive alternative to Newton’s method in that they do not require computation of the Hessian and yet still attain a superlinear rate of convergence. In place of the true Hessian∇2fk, they use an approximation Bk, which is updated after each step to take account of the additional knowledge gained during the step.

The updates make use of the fact that changes in the gradient g provide information about the second derivative of f along the search direction. By using the expression (2.5) from our statement of Taylor’s theorem, we have by adding and subtracting the term∇2f(x) p that

∇ f (x + p)  ∇ f (x) + ∇2f(x) p+

 1

0

2f(x+ tp) − ∇2f(x) p dt.

Because∇ f (·) is continuous, the size of the final integral term is o(p). By setting x  xk

and p xk+1− xk, we obtain

∇ fk+1 ∇ fk+ ∇2fk(xk+1− xk)+ o(xk+1− xk).

When xkand xk+1lie in a region near the solution x, within which∇2f is positive definite, the final term in this expansion is eventually dominated by the∇2fk(xk+1− xk) term, and

we can write

2fk(xk+1− xk)≈ ∇ fk+1− ∇ fk. (2.16)

We choose the new Hessian approximation Bk+1so that it mimics the property (2.16) of the true Hessian, that is, we require it to satisfy the following condition, known as the secant equation:

Bk+1sk  yk, (2.17)

where

sk xk+1− xk, yk ∇ fk+1− ∇ fk.

Typically, we impose additional conditions on Bk+1, such as symmetry (motivated by symmetry of the exact Hessian), and a requirement that the difference between successive approximations Bkand Bk+1have low rank.

Two of the most popular formulae for updating the Hessian approximation Bk are the symmetric-rank-one (SR1) formula, defined by

Bk+1 Bk+(yk− Bksk)(yk− Bksk)T (yk− Bksk)Tsk

, (2.18)

and the BFGS formula, named after its inventors, Broyden, Fletcher, Goldfarb, and Shanno, which is defined by

Bk+1 BkBkskskTBk

skTBksk

+ykykT ykTsk

. (2.19)

Note that the difference between the matrices Bk and Bk+1 is a rank-one matrix in the case of (2.18) and a rank-two matrix in the case of (2.19). Both updates satisfy the secant equation and both maintain symmetry. One can show that BFGS update (2.19) generates positive definite approximations whenever the initial approximation B0is positive definite and skTyk> 0. We discuss these issues further in Chapter 6.

The quasi-Newton search direction is obtained by using Bk in place of the exact Hessian in the formula (2.15), that is,

pk  −Bk−1∇ fk. (2.20)

Some practical implementations of quasi-Newton methods avoid the need to factorize Bk

at each iteration by updating the inverse of Bk, instead of Bkitself. In fact, the equivalent

formula for (2.18) and (2.19), applied to the inverse approximation Hk

def Bk−1, is

Hk+1

I− ρkskykT Hk

I− ρkykskT

+ ρkskskT, ρk 1 ykTsk

. (2.21)

Calculation of pkcan then be performed by using the formula pk −Hk∇ fk. This matrix–

vector multiplication is simpler than the factorization/back-substitution procedure that is needed to implement the formula (2.20).

Two variants of quasi-Newton methods designed to solve large problems—partially separable and limited-memory updating—are described in Chapter 7.

The last class of search directions we preview here is that generated by nonlinear conjugate gradient methods. They have the form

pk −∇ f (xk)+ βkpk−1,

whereβk is a scalar that ensures that pk and pk−1 are conjugate—an important concept in the minimization of quadratic functions that will be defined in Chapter 5. Conjugate gradient methods were originally designed to solve systems of linear equations Ax  b, where the coefficient matrix A is symmetric and positive definite. The problem of solving this linear system is equivalent to the problem of minimizing the convex quadratic function defined by

φ(x) 12xTAx− bTx,

so it was natural to investigate extensions of these algorithms to more general types of unconstrained minimization problems. In general, nonlinear conjugate gradient directions are much more effective than the steepest descent direction and are almost as simple to compute. These methods do not attain the fast convergence rates of Newton or quasi-Newton methods, but they have the advantage of not requiring storage of matrices. An extensive discussion of nonlinear conjugate gradient methods is given in Chapter 5.

All of the search directions discussed so far can be used directly in a line search framework. They give rise to the steepest descent, Newton, quasi-Newton, and conjugate gradient line search methods. All except conjugate gradients have an analogue in the trust-region framework, as we now discuss.

MODELS FOR TRUST-REGION METHODS

If we set Bk  0 in (2.12) and define the trust region using the Euclidean norm, the trust-region subproblem (2.11) becomes

minp fk+ pT∇ fk subject top2≤ k.

We can write the solution to this problem in closed form as

pk −k∇ fk

∇ fk.

This is simply a steepest descent step in which the step length is determined by the trust-region radius; the trust-trust-region and line search approaches are essentially the same in this case.

A more interesting trust-region algorithm is obtained by choosing Bk to be the exact Hessian∇2fkin the quadratic model (2.12). Because of the trust-region restriction

p2≤ k, the subproblem (2.11) is guaranteed to have a solution even when∇2fkis not positive definite pk, as we see in Figure 2.4. The trust-region Newton method has proved to be highly effective in practice, as we discuss in Chapter 7.

If the matrix Bkin the quadratic model function mkof (2.12) is defined by means of a quasi-Newton approximation, we obtain a trust-region quasi-Newton method.

SCALING

The performance of an algorithm may depend crucially on how the problem is formu-lated. One important issue in problem formulation is scaling. In unconstrained optimization, a problem is said to be poorly scaled if changes to x in a certain direction produce much larger variations in the value of f than do changes to x in another direction. A simple example is provided by the function f (x) 109x12+ x22, which is very sensitive to small changes in x1

but not so sensitive to perturbations in x2.

Poorly scaled functions arise, for example, in simulations of physical and chemical systems where different processes are taking place at very different rates. To be more specific, consider a chemical system in which four reactions occur. Associated with each reaction is a rate constant that describes the speed at which the reaction takes place. The optimization problem is to find values for these rate constants by observing the concentrations of each chemical in the system at different times. The four constants differ greatly in magnitude, since the reactions take place at vastly different speeds. Suppose we have the following rough esti-mates for the final values of the constants, each correct to within, say, an order of magnitude:

x1≈ 10−10, x2≈ x3≈ 1, x4≈ 105.

Before solving this problem we could introduce a new variable z defined by

and then define and solve the optimization problem in terms of the new variable z. The

k

fk

f

Figure 2.7 Poorly scaled and well scaled problems, and performance of the steepest descent direction.

optimal values of z will be within about an order of magnitude of 1, making the solution more balanced. This kind of scaling of the variables is known as diagonal scaling.

Scaling is performed (sometimes unintentionally) when the units used to represent variables are changed. During the modeling process, we may decide to change the units of some variables, say from meters to millimeters. If we do, the range of those variables and their size relative to the other variables will both change.

Some optimization algorithms, such as steepest descent, are sensitive to poor scaling, while others, such as Newton’s method, are unaffected by it. Figure 2.7 shows the contours of two convex nearly quadratic functions, the first of which is poorly scaled, while the second is well scaled. For the poorly scaled problem, the one with highly elongated contours, the steepest descent direction does not yield much reduction in the function, while for the well-scaled problem it performs much better. In both cases, Newton’s method will produce a much better step, since the second-order quadratic model (mkin (2.14)) happens to be a good approximation of f .

Algorithms that are not sensitive to scaling are preferable, because they can handle poor problem formulations in a more robust fashion. In designing complete algorithms, we try to incorporate scale invariance into all aspects of the algorithm, including the line search or trust-region strategies and convergence tests. Generally speaking, it is easier to preserve scale invariance for line search algorithms than for trust-region algorithms.

✐ E

X E R C I S E S

2.1 Compute the gradient∇ f (x) and Hessian ∇2f(x) of the Rosenbrock function f(x) 100(x2− x12)2+ (1 − x1)2. (2.22)

Show that x  (1, 1)T is the only local minimizer of this function, and that the Hessian matrix at that point is positive definite.

2.2 Show that the function f (x) 8x1+ 12x2+ x12− 2x22has only one stationary point, and that it is neither a maximum or minimum, but a saddle point. Sketch the contour lines of f .

2.3 Let a be a given n-vector, and A be a given n× n symmetric matrix. Compute the gradient and Hessian of f1(x) aTxand f2(x) xTAx.

2.4 Write the second-order Taylor expansion (2.6) for the function cos(1/x) around a nonzero point x, and the third-order Taylor expansion of cos(x) around any point x.

Evaluate the second expansion for the specific case of x  1.

2.5 Consider the function f : IR2 → IR defined by f (x)  x2. Show that the sequence of iterates{xk} defined by

xk

 1+ 1

2k

  cos k sin k



satisfies f (xk+1) < f (xk) for k  0, 1, 2, . . . . Show that every point on the unit circle {x | x2  1} is a limit point for {xk}. Hint: Every value θ ∈ [0, 2π] is a limit point of the subsequencek} defined by

ξk k(mod 2π)  k − 2π

 k 2π

 ,

where the operator· denotes rounding down to the next integer.

2.6 Prove that all isolated local minimizers are strict. (Hint: Take an isolated local minimizer x and a neighborhoodN . Show that for any x ∈ N , x  x we must have

f(x)> f (x).)

2.7 Suppose that f (x) xTQx, where Q is an n×n symmetric positive semidefinite matrix. Show using the definition (1.4) that f (x) is convex on the domain IRn. Hint: It may be convenient to prove the following equivalent inequality:

f(y+ α(x − y)) − α f (x) − (1 − α) f (y) ≤ 0,

for allα ∈ [0, 1] and all x, y ∈ IRn.

2.8 Suppose that f is a convex function. Show that the set of global minimizers of f is a convex set.

2.9 Consider the function f (x1, x2)  

x1+ x222

. At the point xT  (1, 0) we consider the search direction pT  (−1, 1). Show that p is a descent direction and find all minimizers of the problem (2.10).

2.10 Suppose that ˜f(z) f (x), where x  Sz + s for some S ∈ IRn×nand s∈ IRn. Show that

∇ ˜f(z)  ST∇ f (x),2 ˜f(z)  ST2f(x)S.

(Hint: Use the chain rule to express d ˜f/dzj in terms of d f/dxi and d xi/dzj for all i, j  1, 2, . . . , n.)

2.11 Show that the symmetric rank-one update (2.18) and the BFGS update (2.19) are scale-invariant if the initial Hessian approximations B0are chosen appropriately. That is, using the notation of the previous exercise, show that if these methods are applied to f(x) starting from x0 Sz0+ s with initial Hessian B0, and to ˜f(z) starting from z0with initial Hessian STB0S, then all iterates are related by xk Szk+ s. (Assume for simplicity that the methods take unit step lengths.)

2.12 Suppose that a function f of two variables is poorly scaled at the solution x. Write two Taylor expansions of f around x—one along each coordinate direction—and use them to show that the Hessian∇2f(x) is ill-conditioned.

2.13 (For this and the following three questions, refer to the material on “Rates of Convergence” in Section A.2 of the Appendix.) Show that the sequence xk  1/k is not Q-linearly convergent, though it does converge to zero. (This is called sublinear convergence.)

2.14 Show that the sequence xk 1 + (0.5)2kis Q-quadratically convergent to 1.

2.15 Does the sequence xk 1/k! converge Q-superlinearly? Q-quadratically?

2.16 Consider the sequence{xk} defined by xk

 1

4

2k

, keven,

(xk−1)/k, k odd.

Is this sequence Q-superlinearly convergent? Q-quadratically convergent? R-quadratically convergent?

C H A P T E R 3

Line Search