UNCONSTRAINED OPTIMIZATION: A REVIEW - Adaptive process, which involves the automatic adjustmen

The Least-Mean-Square Algorithm

2. Adaptive process, which involves the automatic adjustment of the synaptic weights of the neuron in accordance with the error signal e(i)

3.3 UNCONSTRAINED OPTIMIZATION: A REVIEW

wherew1(i),w2(i), ...,wM(i) are the M synaptic weights of the neuron, measured at time i. In matrix form, we may express y(i) as an inner product of the vectors x(i) and w(i) as

(3.4) where

Note that the notation for a synaptic weight has been simplified here by not including an additional subscript to identify the neuron, since we have only a single neuron to deal with. This practice is followed throughout the book, whenever a single neuron is involved. The neuron’s output y(i) is compared with the corresponding output d(i) received from the unknown system at time i. Typically, y(i) is different from d(i); hence, their comparison results in the error signal

(3.5) The manner in which the error signal e(i) is used to control the adjustments to the neu-ron’s synaptic weights is determined by the cost function used to derive the adaptive-filtering algorithm of interest. This issue is closely related to that of optimization. It is therefore appropriate to present a review of unconstrained-optimization methods.

The material is applicable not only to linear adaptive filters, but also to neural networks in general.

3.3 UNCONSTRAINED OPTIMIZATION: A REVIEW

Consider a cost function e(w) that is a continuously differentiable function of some unknown weight (parameter) vector w. The function e(w) maps the elements of w into real numbers. It is a measure of how to choose the weight (parameter) vector w of an adaptive-filtering algorithm so that it behaves in an optimum manner. We want to find an optimal solution w* that satisfies the condition

e e (3.6)

That is, we need to solve an unconstrained-optimization problem, stated as follows:

e .

The necessary condition for optimality is

e(w*) = 0 (3.7)

(w) with respect to the weight vector w Minimize the cost function

(w) (w*)

e(i) = d(i) - y(i) w(i) = [w1(i), w₂(i), ..., w_M(i)]^T

y(i) = x^T(i)w(i) y(i) = v(i) = a

M k = 1

w_k(i)x_k(i)

where § is the gradient operator,

(3.8) and is the gradient vector of the cost function,

(3.9) (Differentiation with respect to a vector is discussed in Note 1 at the end of this chapter.) A class of unconstrained-optimization algorithms that is particularly well suited for the design of adaptive filters is based on the idea of local iterative descent:

Starting with an initial guess denoted by w(0), generate a sequence of weight vectors w(1), w(2), . . ., such that the cost functione(w) is reduced at each iteration of the algorithm, as shown by

(3.10) where w(n) is the old value of the weight vector and w(n 1) is its updated value.

We hope that the algorithm will eventually converge onto the optimal solution w*. We say “hope” because there is a distinct possibility that the algorithm will diverge (i.e., become unstable) unless special precautions are taken.

In this section, we describe three unconstrained-optimization methods that rely on the idea of iterative descent in one form or another (Bertsekas, 1995).

Method of Steepest Descent

In the method of steepest descent, the successive adjustments applied to the weight vec-tor w are in the direction of steepest descent, that is, in a direction opposite to the gradient vector . For convenience of presentation, we write

(3.11) Accordingly, the steepest-descent algorithm is formally described by

(3.12) where is a positive constant called the stepsize, or learning-rate, parameter, and g(n) is the gradient vector evaluated at the point w(n). In going from iteration n to n 1, the algorithm applies the correction

(3.13)

Equation (3.13) is in fact a formal statement of the error-correction rule described in the introductory chapter.

To show that the formulation of the steepest-descent algorithm satisfies the con-dition of Eq. (3.10) for iterative descent, we use a first-order Taylor series expansion around w(n) to approximate as

e(w(n + 1)) L e(w(n)) + g^T(n)¢w(n)

Section 3.3 Unconstrained Optimization: A Review 95

the use of which is justified for small . Substituting Eq. (3.13) into this approximate relation yields

which shows that, for a positive learning-rate parameter , the cost function is decreased as the algorithm progresses from one iteration to the next.The reasoning presented here is approximate in that this end result is true only for small enough learning rates.

The method of steepest descent converges to the optimal solution w* slowly. More-over, the learning-rate parameter has a profound influence on its convergence behavior:

• When is small, the transient response of the algorithm is overdamped, in that the trajectory traced by w(n) follows a smooth path in the w-plane, as illustrated in Fig. 3.2a.

• When is large, the transient response of the algorithm is underdamped, in that the trajectory of w(n) follows a zigzagging (oscillatory) path, as illustrated in Fig. 3.2b.

• When exceeds a certain critical value, the algorithm becomes unstable (i.e., it diverges).

Newton’s Method

For a more elaborate optimization technique, we may look to Newton’s method, the basic idea of which is to minimize the quadratic approximation of the cost function around the current point w(n); this minimization is performed at each iteration of the algorithm. Specifically, using a second-order Taylor series expansion of the cost func-tion around the point w(n), we may write

(3.14)

As before, g(n) is the M-by-1 gradient vector of the cost function evaluated at the point w(n). The matrix H(n) is the m-by-m Hessian of , also evaluated at w(n).

The Hessian of is defined by

≥

Section 3.3 Unconstrained Optimization: A Review 97

w2(n)

w₁(n) 4.0

4.0 0.0 4.0

0.0

4.0

n 0 n 1 n 2

(a)

small h

w2(n)

w₁(n) 4.0

4.0 0.0 4.0

0.0

4.0

n 0 n 1 n 2

(b)

large h

FIGURE 3.2 Trajectory of the method of steepest descent in a two-dimensional space for two different values of learning-rate parameter: (a) small (b) large . The coordinates w1and w2are elements of the weight vector w; they both lie in thew-plane.

Equation (3.15) requires the cost function to be twice continuously differentiable with respect to the elements of w. Differentiating¹Eq. (3.14) with respect to ∆w, we minimize the resulting change when

Solving this equation for ∆w(n) yields

That is,

(3.16) where H^-1(n) is the inverse of the Hessian of .

Generally speaking, Newton’s method converges quickly asymptotically and does not exhibit the zigzagging behavior that sometimes characterizes the method of steep-est descent. However, for Newton’s method to work, the Hessian H(n) has to be a positive definite matrix for all n. Unfortunately, in general, there is no guarantee that H(n) is positive definite at every iteration of the algorithm. If the Hessian H(n) is not positive definite, modification of Newton’s method is necessary (Powell, 1987; Bertsekas, 1995).

In any event, a major limitation of Newton’s method is its computational complexity.

Gauss–Newton Method

To deal with the computational complexity of Newton’s method without seriously compromising its convergence behavior, we may use the Gauss–Newton method. To apply this method, we adopt a cost function that is expressed as the sum of error squares. Let

(3.17) where the scaling factor is included to simplify matters in subsequent analysis. All the error terms in this formula are calculated on the basis of a weight vector w that is fixed over the entire observation interval 1 i n.

The error signal e(i) is a function of the adjustable weight vector w. Given an oper-ating point w(n), we linearize the dependence of e(i) on w by introducing the new term

Equivalently, by using matrix notation, we may write

(3.18) where e(n) is the error vector

e(n) = [e(1), e(2), ..., e(n)]^T e¿(n, w) = e(n) + J(n) (w - w(n)) e¿(i, w) = e(i) + c0e(i)

0w d^T

w = w(n)

* (w - w(n)),

i = 1, 2, ..., n

1 2

e(w) = 1 2a

n i = 1

e²(i) e(w)

= w(n) - H^-1(n)g(n) w(n + 1) = w(n) + ¢w(n)

¢w(n) = -H^-1(n)g(n) g(n) + H(n)¢w(n) = 0

¢e(w)

e(w)

and J(n) is the n-by-m Jacobian of e(n):

(3.19)

The Jacobian J(n) is the transpose of the m-by-n gradient matrix , where

The updated weight vector w(n 1) is now defined by

(3.20) Using Eq. (3.18) to evaluate the squared Euclidean norm of e(n, w), we get

Hence, differentiating this expression with respect to w and setting the result equal to zero, we obtain

Solving this equation for w, we may thus write, in light of Eq. 3.20,

(3.21) which describes the pure form of the Gauss–Newton method.

Unlike Newton’s method, which requires knowledge of the Hessian of the cost function , the Gauss–Newton method requires only the Jacobian of the error vector e(n). However, for the Gauss–Newton iteration to be computable, the matrix product J^T(n)J(n) must be nonsingular.

With regard to the latter point, we recognize that J^T(n)J(n) is always nonnegative definite. To ensure that it is nonsingular, the Jacobian J(n) must have row rank n; that is, the n rows of J(n) in Eq. (3.19) must be linearly independent. Unfortunately, there is no guarantee that this condition will always hold. To guard against the possibility that J(n) is rank deficient, the customary practice is to add the diagonal matrix δI to the matrix J^T(n)J(n), where I is the identity matrix. The parameter δ is a small positive con-stant chosen to ensure that

J^T(n)J(n) + I is positive definite for all n

Section 3.3 Unconstrained Optimization: A Review 99

On this basis, the Gauss–Newton method is implemented in the slightly modified form (3.22) The effect of the added term δI is progressively reduced as the number of iterations, n, is increased. Note also that the recursive equation (3.22) is the solution of the modified cost function

(3.23) where w(n) is the current value of the weight vector w(i).

In the literature on signal processing, the addition of the term δI in Eq. (3.22) is referred to as diagonal loading. The addition of this term is accounted for by expanding the cost function in the manner described in Eq. (3.23), where we now have two terms (ignoring the scaling factor ):

• The first term, , is the standard sum of squared errors, which depends on the training data.

• The second term contains the squared Euclidean norm, , which depends on the filter structure. In effect, this term acts as a stabilizer.

The scaling factor δ is commonly referred to as a regularization parameter, and the result-ing modification of the cost function is correspondresult-ingly referred to as structural regu-larization. The issue of regularization is discussed in great detail in Chapter 7.

在文檔中 Neural Networks and Learning Machines (頁 125-131)