Quasi-Newton Methods
6.1 THE BFGS METHOD
The most popular quasi-Newton algorithm is the BFGS method, named for its discoverers Broyden, Fletcher, Goldfarb, and Shanno. In this section we derive this algorithm (and its close relative, the DFP algorithm) and describe its theoretical properties and practical implementation.
We begin the derivation by forming the following quadratic model of the objective function at the current iterate xk:
mk( p) fk+ ∇ fkTp+12pTBkp. (6.1)
Here Bkis an n× n symmetric positive definite matrix that will be revised or updated at every iteration. Note that the function value and gradient of this model at p 0 match fkand∇ fk, respectively. The minimizer pkof this convex quadratic model, which we can write explicitly as
pk −Bk−1∇ fk, (6.2)
is used as the search direction, and the new iterate is
xk+1 xk+ αkpk, (6.3)
where the step lengthαk is chosen to satisfy the Wolfe conditions (3.6). This iteration is quite similar to the line search Newton method; the key difference is that the approximate Hessian Bkis used in place of the true Hessian.
Instead of computing Bkafresh at every iteration, Davidon proposed to update it in a simple manner to account for the curvature measured during the most recent step. Suppose that we have generated a new iterate xk+1and wish to construct a new quadratic model, of the form
mk+1( p) fk+1+ ∇ fkT+1p+12pTBk+1p.
What requirements should we impose on Bk+1, based on the knowledge gained during the latest step? One reasonable requirement is that the gradient of mk+1should match the gradient of the objective function f at the latest two iterates xkand xk+1. Since∇mk+1(0) is precisely∇ fk+1, the second of these conditions is satisfied automatically. The first condition can be written mathematically as
∇mk+1(−αkpk) ∇ fk+1− αkBk+1pk ∇ fk. By rearranging, we obtain
Bk+1αkpk ∇ fk+1− ∇ fk. (6.4) To simplify the notation it is useful to define the vectors
sk xk+1− xk αkpk, yk ∇ fk+1− ∇ fk, (6.5) so that (6.4) becomes
Bk+1sk yk. (6.6)
We refer to this formula as the secant equation.
Given the displacement skand the change of gradients yk, the secant equation requires that the symmetric positive definite matrix Bk+1map skinto yk. This will be possible only if skand yksatisfy the curvature condition
skTyk> 0, (6.7)
as is easily seen by premultiplying (6.6) by skT. When f is strongly convex, the inequality (6.7) will be satisfied for any two points xkand xk+1(see Exercise 6.1). However, this condition
will not always hold for nonconvex functions, and in this case we need to enforce (6.7) explicitly, by imposing restrictions on the line search procedure that chooses the step length α. In fact, the condition (6.7) is guaranteed to hold if we impose the Wolfe (3.6) or strong Wolfe conditions (3.7) on the line search. To verify this claim, we note from (6.5) and (3.6b) that∇ fkT+1sk≥ c2∇ fkTsk, and therefore
ykTsk≥ (c2− 1)αk∇ fkTpk. (6.8) Since c2 < 1 and since pkis a descent direction, the term on the right is positive, and the curvature condition (6.7) holds.
When the curvature condition is satisfied, the secant equation (6.6) always has a solution Bk+1. In fact, it admits an infinite number of solutions, since the n(n + 1)/2 degrees of freedom in a symmetric positive definite matrix exceed the n conditions imposed by the secant equation. The requirement of positive definiteness imposes n additional inequalities—all principal minors must be positive—but these conditions do not absorb the remaining degrees of freedom.
To determine Bk+1 uniquely, we impose the additional condition that among all symmetric matrices satisfying the secant equation, Bk+1is, in some sense, closest to the current matrix Bk. In other words, we solve the problem
minB B − Bk (6.9a)
subject to B BT, Bsk yk, (6.9b)
where skand yk satisfy (6.7) and Bk is symmetric and positive definite. Different matrix norms can be used in (6.9a), and each norm gives rise to a different quasi-Newton method.
A norm that allows easy solution of the minimization problem (6.9) and gives rise to a scale-invariant optimization method is the weighted Frobenius norm
AW ≡W1/2AW1/2 any matrix satisfying the relation W yk sk. For concreteness, the reader can assume that W ¯G−1k where ¯Gkis the average Hessian defined by follows from Taylor’s theorem, Theorem 2.1. With this choice of weighting matrix W , the
norm (6.10) is non-dimensional, which is a desirable property, since we do not wish the solution of (6.9) to depend on the units of the problem.
With this weighting matrix and this norm, the unique solution of (6.9) is (DFP) Bk+1
This formula is called the DFP updating formula, since it is the one originally proposed by Davidon in 1959, and subsequently studied, implemented, and popularized by Fletcher and Powell.
The inverse of Bk, which we denote by Hk Bk−1,
is useful in the implementation of the method, since it allows the search direction (6.2) to be calculated by means of a simple matrix–vector multiplication. Using the Sherman–
Morrison–Woodbury formula (A.28), we can derive the following expression for the update of the inverse Hessian approximation Hkthat corresponds to the DFP update of Bkin (6.13):
(DFP) Hk+1 Hk−HkykykTHk
ykTHkyk
+ skskT ykTsk
. (6.15)
Note that the last two terms in the right-hand-side of (6.15) are rank-one matrices, so that Hk
undergoes a rank-two modification. It is easy to see that (6.13) is also a rank-two modification of Bk. This is the fundamental idea of quasi-Newton updating: Instead of recomputing the approximate Hessians (or inverse Hessians) from scratch at every iteration, we apply a simple modification that combines the most recently observed information about the objective function with the existing knowledge embedded in our current Hessian approximation.
The DFP updating formula is quite effective, but it was soon superseded by the BFGS formula, which is presently considered to be the most effective of all quasi-Newton updating formulae. BFGS updating can be derived by making a simple change in the argument that led to (6.13). Instead of imposing conditions on the Hessian approximations Bk, we impose similar conditions on their inverses Hk. The updated approximation Hk+1must be symmetric and positive definite, and must satisfy the secant equation (6.6), now written as
Hk+1yk sk.
The condition of closeness to Hkis now specified by the following analogue of (6.9):
minH H − Hk (6.16a)
subject to H HT, H yk sk. (6.16b)
The norm is again the weighted Frobenius norm described above, where the weight matrix W is now any matrix satisfying W sk yk. (For concreteness, we assume again that W is given by the average Hessian ¯Gkdefined in (6.11).) The unique solution Hk+1to (6.16) is given by
(BFGS) Hk+1 (I − ρkskykT)Hk(I− ρkykskT)+ ρkskskT, (6.17) withρkdefined by (6.14).
Just one issue has to be resolved before we can define a complete BFGS algorithm: How should we choose the initial approximation H0? Unfortunately, there is no magic formula that works well in all cases. We can use specific information about the problem, for instance by setting it to the inverse of an approximate Hessian calculated by finite differences at x0. Otherwise, we can simply set it to be the identity matrix, or a multiple of the identity matrix, where the multiple is chosen to reflect the scaling of the variables.
Algorithm 6.1 (BFGS Method).
Given starting point x0, convergence tolerance > 0, inverse Hessian approximation H0;
k← 0;
while∇ fk > ;
Compute search direction
pk −Hk∇ fk; (6.18)
Set xk+1 xk+ αkpkwhereαkis computed from a line search procedure to satisfy the Wolfe conditions (3.6);
Define sk xk+1− xkand yk ∇ fk+1− ∇ fk; Compute Hk+1by means of (6.17);
k← k + 1;
end (while)
Each iteration can be performed at a cost of O(n2) arithmetic operations (plus the cost of function and gradient evaluations); there are no O(n3) operations such as linear system solves or matrix–matrix operations. The algorithm is robust, and its rate of convergence is superlinear, which is fast enough for most practical purposes. Even though Newton’s method converges more rapidly (that is, quadratically), its cost per iteration usually is higher, because of its need for second derivatives and solution of a linear system.
We can derive a version of the BFGS algorithm that works with the Hessian approx-imation Bkrather than Hk. The update formula for Bkis obtained by simply applying the Sherman–Morrison–Woodbury formula (A.28) to (6.17) to obtain
(BFGS) Bk+1 Bk− BkskskTBk
skTBksk
+ ykykT ykTsk
. (6.19)
A naive implementation of this variant is not efficient for unconstrained minimization, because it requires the system Bkpk −∇ fkto be solved for the step pk, thereby increasing the cost of the step computation to O(n3). We discuss later, however, that less expensive implementations of this variant are possible by updating Cholesky factors of Bk.