THE BFGS METHOD - Quasi-Newton Methods - SecondEdition NumericalOptimization JorgeNocedalStephe

Quasi-Newton Methods

6.1 THE BFGS METHOD

The most popular quasi-Newton algorithm is the BFGS method, named for its discoverers Broyden, Fletcher, Goldfarb, and Shanno. In this section we derive this algorithm (and its close relative, the DFP algorithm) and describe its theoretical properties and practical implementation.

We begin the derivation by forming the following quadratic model of the objective function at the current iterate xk:

mk( p) fk+ ∇ fk^Tp+¹₂p^TBkp. (6.1)

Here Bkis an n× n symmetric positive deﬁnite matrix that will be revised or updated at every iteration. Note that the function value and gradient of this model at p 0 match fkand∇ fk, respectively. The minimizer pkof this convex quadratic model, which we can write explicitly as

pk −Bk⁻¹∇ fk, (6.2)

is used as the search direction, and the new iterate is

xk+1 xk+ αkpk, (6.3)

where the step lengthαk is chosen to satisfy the Wolfe conditions (3.6). This iteration is quite similar to the line search Newton method; the key difference is that the approximate Hessian Bkis used in place of the true Hessian.

Instead of computing Bkafresh at every iteration, Davidon proposed to update it in a simple manner to account for the curvature measured during the most recent step. Suppose that we have generated a new iterate xk+1and wish to construct a new quadratic model, of the form

mk+1( p) fk+1+ ∇ fk^T+1p+¹₂p^TBk+1p.

What requirements should we impose on Bk+1, based on the knowledge gained during the latest step? One reasonable requirement is that the gradient of mk+1should match the gradient of the objective function f at the latest two iterates xkand xk+1. Since∇mk+1(0) is precisely∇ fk+1, the second of these conditions is satisﬁed automatically. The ﬁrst condition can be written mathematically as

∇mk+1(−αkpk) ∇ fk+1− αkBk+1pk ∇ fk. By rearranging, we obtain

Bk+1αkpk ∇ fk+1− ∇ fk. (6.4) To simplify the notation it is useful to deﬁne the vectors

sk xk+1− xk αkpk, yk ∇ fk+1− ∇ fk, (6.5) so that (6.4) becomes

Bk+1sk yk. (6.6)

We refer to this formula as the secant equation.

Given the displacement skand the change of gradients yk, the secant equation requires that the symmetric positive deﬁnite matrix Bk+1map skinto yk. This will be possible only if skand yksatisfy the curvature condition

s_k^Tyk> 0, (6.7)

as is easily seen by premultiplying (6.6) by s_k^T. When f is strongly convex, the inequality (6.7) will be satisﬁed for any two points xkand xk+1(see Exercise 6.1). However, this condition

will not always hold for nonconvex functions, and in this case we need to enforce (6.7) explicitly, by imposing restrictions on the line search procedure that chooses the step length α. In fact, the condition (6.7) is guaranteed to hold if we impose the Wolfe (3.6) or strong Wolfe conditions (3.7) on the line search. To verify this claim, we note from (6.5) and (3.6b) that∇ fk^T+1sk≥ c2∇ fk^Tsk, and therefore

y_k^Tsk≥ (c2− 1)αk∇ fk^Tpk. (6.8) Since c2 < 1 and since pkis a descent direction, the term on the right is positive, and the curvature condition (6.7) holds.

When the curvature condition is satisﬁed, the secant equation (6.6) always has a solution Bk+1. In fact, it admits an inﬁnite number of solutions, since the n(n + 1)/2 degrees of freedom in a symmetric positive deﬁnite matrix exceed the n conditions imposed by the secant equation. The requirement of positive deﬁniteness imposes n additional inequalities—all principal minors must be positive—but these conditions do not absorb the remaining degrees of freedom.

To determine Bk+1 uniquely, we impose the additional condition that among all symmetric matrices satisfying the secant equation, Bk+1is, in some sense, closest to the current matrix Bk. In other words, we solve the problem

minB B − Bk (6.9a)

subject to B B^T, Bsk yk, (6.9b)

where skand yk satisfy (6.7) and Bk is symmetric and positive deﬁnite. Different matrix norms can be used in (6.9a), and each norm gives rise to a different quasi-Newton method.

A norm that allows easy solution of the minimization problem (6.9) and gives rise to a scale-invariant optimization method is the weighted Frobenius norm

AW ≡W^1/2AW^1/2 any matrix satisfying the relation W yk sk. For concreteness, the reader can assume that W ¯G⁻¹k where ¯Gkis the average Hessian deﬁned by follows from Taylor’s theorem, Theorem 2.1. With this choice of weighting matrix W , the

norm (6.10) is non-dimensional, which is a desirable property, since we do not wish the solution of (6.9) to depend on the units of the problem.

With this weighting matrix and this norm, the unique solution of (6.9) is (DFP) Bk+1

This formula is called the DFP updating formula, since it is the one originally proposed by Davidon in 1959, and subsequently studied, implemented, and popularized by Fletcher and Powell.

The inverse of Bk, which we denote by Hk Bk⁻¹,

is useful in the implementation of the method, since it allows the search direction (6.2) to be calculated by means of a simple matrix–vector multiplication. Using the Sherman–

Morrison–Woodbury formula (A.28), we can derive the following expression for the update of the inverse Hessian approximation Hkthat corresponds to the DFP update of Bkin (6.13):

(DFP) Hk+1 Hk−Hkyky_k^THk

y_k^THkyk

+ sks_k^T y_k^Tsk

. (6.15)

Note that the last two terms in the right-hand-side of (6.15) are rank-one matrices, so that Hk

undergoes a rank-two modiﬁcation. It is easy to see that (6.13) is also a rank-two modiﬁcation of Bk. This is the fundamental idea of quasi-Newton updating: Instead of recomputing the approximate Hessians (or inverse Hessians) from scratch at every iteration, we apply a simple modiﬁcation that combines the most recently observed information about the objective function with the existing knowledge embedded in our current Hessian approximation.

The DFP updating formula is quite effective, but it was soon superseded by the BFGS formula, which is presently considered to be the most effective of all quasi-Newton updating formulae. BFGS updating can be derived by making a simple change in the argument that led to (6.13). Instead of imposing conditions on the Hessian approximations Bk, we impose similar conditions on their inverses Hk. The updated approximation Hk+1must be symmetric and positive deﬁnite, and must satisfy the secant equation (6.6), now written as

Hk+1yk sk.

The condition of closeness to Hkis now speciﬁed by the following analogue of (6.9):

minH H − Hk (6.16a)

subject to H H^T, H yk sk. (6.16b)

The norm is again the weighted Frobenius norm described above, where the weight matrix W is now any matrix satisfying W sk yk. (For concreteness, we assume again that W is given by the average Hessian ¯Gkdeﬁned in (6.11).) The unique solution Hk+1to (6.16) is given by

(BFGS) Hk+1 (I − ρksky_k^T)Hk(I− ρkyks_k^T)+ ρksks_k^T, (6.17) withρkdeﬁned by (6.14).

Just one issue has to be resolved before we can deﬁne a complete BFGS algorithm: How should we choose the initial approximation H0? Unfortunately, there is no magic formula that works well in all cases. We can use speciﬁc information about the problem, for instance by setting it to the inverse of an approximate Hessian calculated by ﬁnite differences at x0. Otherwise, we can simply set it to be the identity matrix, or a multiple of the identity matrix, where the multiple is chosen to reﬂect the scaling of the variables.

Algorithm 6.1 (BFGS Method).

Given starting point x0, convergence tolerance > 0, inverse Hessian approximation H₀;

k← 0;

while∇ fk > ;

Compute search direction

pk −Hk∇ fk; (6.18)

Set xk+1 xk+ αkpkwhereαkis computed from a line search procedure to satisfy the Wolfe conditions (3.6);

Deﬁne sk xk+1− xkand yk ∇ fk+1− ∇ fk; Compute Hk+1by means of (6.17);

k← k + 1;

end (while)

Each iteration can be performed at a cost of O(n²) arithmetic operations (plus the cost of function and gradient evaluations); there are no O(n³) operations such as linear system solves or matrix–matrix operations. The algorithm is robust, and its rate of convergence is superlinear, which is fast enough for most practical purposes. Even though Newton’s method converges more rapidly (that is, quadratically), its cost per iteration usually is higher, because of its need for second derivatives and solution of a linear system.

We can derive a version of the BFGS algorithm that works with the Hessian approx-imation Bkrather than Hk. The update formula for Bkis obtained by simply applying the Sherman–Morrison–Woodbury formula (A.28) to (6.17) to obtain

(BFGS) Bk+1 Bk− Bksks_k^TBk

s_k^TBksk

+ yky_k^T y_k^Tsk

. (6.19)

A naive implementation of this variant is not efﬁcient for unconstrained minimization, because it requires the system Bkpk −∇ fkto be solved for the step pk, thereby increasing the cost of the step computation to O(n³). We discuss later, however, that less expensive implementations of this variant are possible by updating Cholesky factors of Bk.

在文檔中 SecondEdition NumericalOptimization JorgeNocedalStephenJ.Wright (頁 155-160)