Multilayer Perceptrons
Case 2 Neuron j Is a Hidden Node
3. The Hessian is basic to the formulation of second-order optimization methods as an alternative to back-propagation learning, to be discussed in Section 4.16
4.10 OPTIMAL ANNEALING AND ADAPTIVE CONTROL OF THE LEARNING RATE
In Section 4.2, we emphasized the popularity of the on-line learning algorithm for two main reasons:
(i) The algorithm is simple, in that its implementation requires a minimal amount of memory, which is used merely to store the old value of the estimated weight vec-tor from one iteration to the next.
(ii) With each example {x, d} being used only once at every time-step, the learning rate assumes a more important role in on-line learning than in batch learning, in that the on-line learning algorithm has the built-in ability to track statistical variations in the environment responsible for generating the training set of examples.
In Amari (1967) and, more recently, Opper (1996), it is shown that optimally annealed on-line learning is enabled to operate as fast as batch learning in an asymptotic sense. This issue is explored in what follows.
n S qlim⺕[wˆ(n)] = w*
Section 4.10 Optimal Annealing and Adaptive Control of the Learning Rate 157
Optimal Annealing of the Learning Rate
Let w denote the vector of synaptic weights in the network, stacked up on top of each other in some orderly fashion. With (n) denoting the old estimate of the weight vector w at time-step n, let (nwˆ 1) denote the updated estimate of w on receipt of the
“input-wˆ
Updated estimate
Old estimate
Learning-rate parameter
Error signal Partial derivative of the network function F
¯˚˘˚˙ ¯˘˙ ¯˘˙ ¯˚˚˚˚˚˚˚˚˘˚˚˚˚˚˚˚˚˙ ¯˚˚˚˚˘˚˚˚˚˙
desired response” example {x(n1),d(n1)}.Correspondingly,let F(x(n1); (n)) denotewˆ the vector-valued output of the network produced in response to the input x(n1); natu-rally the dimension of the function F must be the same as that of the desired response vector d(n). Following the defining equation of Eq. (4.3), we may express the instanta-neous energy as the squared Euclidean norm of the estimation error, as shown by
(4.55) The mean-square error, or expected risk, of the on-line learning problem is defined by
(4.56) where ⺕x,dis the expectation operator performed with respect to the example {x, d}. The solution
(4.57) defines the optimal parameter vector.
The instantaneous gradient vector of the learning process is defined by
(4.58)
where
(4.59) With the definition of the gradient vector just presented, we may now express the on-line learning algorithm as
(4.60) or, equivalently,
(4.61)
Given this difference equation, we may go on to describe the ensemble-averaged dynamics of the weight vector w in the neighborhood of the optimal parameter w* by the continuous differential equation
(4.62) d
dtwˆ (t) = - (t)⺕x,d[g(x(t), d(t); wˆ (t))]
wˆ (n + 1) = wˆ(n) + (n)[d(n + 1) - F(x(n + 1); wˆ(n))] F¿(x(n + 1); wˆ(n)) wˆ (n + 1) = wˆ(n) - (n)g(x(n + 1), d(n + 1); wˆ(n))
F¿(x; w) = 0
0wF(x; w)
= - (d(n) - F(x(n); w)F¿(x(n); w) g(x(n), d(n); w) = 0
0w e(x(n), d(n); w) w* = arg minw [J(w)]
J(w) = ⺕x,d[e(x, d; w)]
e(x(n), d(n); w) = 1
2 7d(n) - F(x(n); w)72
where t denotes continuous time. Following Murata (1998), the expected value of the gra-dient vector is approximated by
(4.63) where the ensembled-averaged matrix K* is itself defined by
(4.64)
The new Hessian K* is a positive-definite matrix defined differently from the Hessian H of Eq. (4.54). However, if the environment responsible for generating the training examples {x, d} is ergodic, we may then substitute the Hessian H, based on time aver-aging, for the Hessian K*, based on ensemble-averaging. In any event, using Eq. (4.63) in Eq. (4.62), we find that the continuous differential equation describing the evolution of the estimator may be approximated as
(4.65) Let the vector q denote an eigenvector of the matrix K*, as shown by the defining equation
(4.66) where is the eigenvalue associated with the eigenvector q.We may then introduce the new function
(4.67) which, in light of Eq. (4.63), may itself be approximated as
(4.68) At each instant of time t, the function (t) takes on a scalar value, which may be viewed as an approximate measure of the Euclidean distance between two projections onto the eigenvector q, one due to the optimal parameter w* and the other due to the estimator . The value of (t) is therefore reduced to zero if, and when, the estimator con-verges to w*.
wˆ (t) wˆ (t)
= -qT(w* - wˆ(t))
(t) L -qTK*(w* - wˆ(t))
(t) = ⺕x,d[qTg(x, d; wˆ (t))]
K*q = q d
dt wˆ (t) L -(t)K*(w* - wˆ(t)) wˆ (t)
= ⺕x,dc 02
0w2e(x, d; w)d K* = ⺕x,dc 0
0wg(x, d; w)d
⺕x,d[g(x, d; wˆ (t))] L -K*(w* - wˆ(t))
Section 4.10 Optimal Annealing and Adaptive Control of the Learning Rate 159
From Eqs. (4.65), (4.66), and (4.68), we find that the function (t) is related to the time-varying learning-rate parameter (t) as follows:
(4.69) This differential equation may be solved to yield
(4.70) where c is a positive integration constant.
(t) = c exp(-(t)dt) d
dt (t) = -(t)(t)
the exponent be large compared with unity, which may be satisfied by setting 0 /
for positive .
Now, there remains only the issue of how to choose the eigenvector q. From the previous section, we recall that the convergence speed of the learning curve is domi-nated by the smallest eigenvalue minof the Hessian H. With this Hessian and the new Hessian K* tending to behave similarly, a clever choice is to hypothesize that for a sufficiently large number of iterations, the evolution of the estimator (t) over time t may be considered as a one-dimensional process, running “almost parallel” to the eigenvector of the Hessian K* associated with the smallest eigenvalue min, as illustrated in Fig. 4.15.
We may thus set
(4.73) where the normalization is introduced to make the eigenvector q assume unit Euclidean length. Correspondingly, the use of this formula in Eq. (4.67) yields
(4.74) We may now summarize the results of the discussion presented in this section by making the following statements:
1. The choice of the annealing schedule described in Eq. (4.71) satisfies the two con-ditions
(4.75) a
t (t) S q and a
t
2(t) 7 q, as t S q
(t) = 7⺕x,d[g(x, d; wˆ (t))]7 q = ⺕x,d[g(x, d; wˆ )]
7⺕x,d[g(x, d; wˆ )]7
wˆ
Following the annealing schedule due to Darken and Moody (1991) that was dis-cussed in Chapter 3 on the LMS algorithm, let the formula
(4.71) account for dependence of the learning-rate on time t, where and 0are positive tun-ing parameters. Then, substituttun-ing this formula into Eq. (4.70), we find that the corre-sponding formula tor the function (t) is
(4.72) For (t) to vanish as time t approaches infinity, we require that the product term 0in
(t) = c(t + )-0
(t) = t + 0
Trajectory of wˆ (t)
w *
FIGURE 4.15 The evolution of the estimator over time t. The ellipses represent contours of the expected risk for varying values of w, assumed to be two-dimensional.
wˆ (t)
In other words,(t) satisfies the requirements of stochastic approximation theory (Robbins and Monro, 1951).
2. As time t approaches infinity, the function (t) approaches zero asymptotically. In accordance with Eq. (4.68), it follows that the estimator approaches the opti-mal estimator w* as t approaches infinity.
3. The ensemble-averaged trajectory of the estimator is almost parallel to the eigenvector of the Hessian K* associated with the smallest eigenvalue minafter a large enough number of iterations.
4. The optimally annealed on-line learning algorithm for a network characterized by