generalization of the on-line learning algorithm so that its applicability is broad- broad-ened by avoiding the need for a prescribed cost function

Multilayer Perceptrons

Case 2 Neuron j Is a Hidden Node

2. generalization of the on-line learning algorithm so that its applicability is broad- broad-ened by avoiding the need for a prescribed cost function

To be specific, the ensemble-averaged dynamics of the weight vector w, defined in Eq. (4.62), is now rewritten as⁶

(4.77) where the vector-valued function f(·, ·; ·) denotes flow that determines the change applied to the estimator in response to the incoming example {x(t), d(t)}. The flow f is required to satisfy the condition

(4.78) where w* is the optimal value of the weight vector w, as previously defined in Eq. (4.57).

In other words, the flow f must asymptotically converge to the optimal parameter w*

across time t. Moreover, for stability, we also require that the gradient of f should be a positive-definite matrix. The flow f includes the gradient vector g in Eq. (4.62) as a spe-cial case.

The previously defined equations of Eqs. (4.63) through (4.69) apply equally well to Murata’s algorithm. Thereafter, however, the assumption made is that the evolution of the learning rate (t) across time t is governed by a dynamic system that comprises the pair of differential equations

(4.79) and

(4.80) where it should be noted that (t) is always positive and and are positive constants.

The first equation of this dynamic system is a repeat of Eq. (4.69). The second equation of the system is motivated by the corresponding differential equation in the learning of the learning algorithm described in Sompolinsky et al. (1995).⁷

As before, the ␭ in Eq. (4.79) is the eigenvalue associated with the eigenvector q of the Hessian K*. Moreover, it is hypothesized that q is chosen as the particular eigen-vector associated with the smallest eigenvalue ␭min.This, in turn, means that the ensemble-averaged flow f converges to the optimal parameter w* in a manner similar to that previously described, as depicted in Fig. 4.15.

dt(t) = (t)((t) - (t)) d

dt(t) = -(t)(t)

⺕x,d[f(x, d; w*)] = 0 wˆ (t)

dt wˆ (t) = -(t)⺕x,d[f(x(t), d(t); wˆ (t))]

The asymptotic behavior of the dynamic system described in Eqs. (4.79) and (4.80) is given by the corresponding pair of equations

(4.81) and

(4.82) The important point to note here is that this new dynamic system exhibits the desired annealing of the learning rate (t)—namely, c/t for large t—which is optimal for any esti-mator converging to w*, as previously discussed.

In light of the considerations just presented, we may now formally describe the Murata adaptive algorithm for on-line learning in discrete time as follows (Murata, 1998;

M ¨uller et al., 1998):

(4.83) (4.84) (4.85) The following points are noteworthy in the formulation of this discrete-time system of equations:

• Equation (4.83) is simply the instantaneous discrete-time version of the differen-tial equation of Eq. (4.77).

• Equation (4.84) includes an auxiliary vector r(n), which has been introduced to account for the continuous-time function (t). Moreover, this second equation of the Murata adaptive algorithm includes a leakage factor whose value controls the running average of the flow f.

• Equation (4.85) is a discrete-time version of the differential equation Eq. (4.80).The updated auxiliary vector r(n 1) included in Eq. (4.85) links it to Eq. (4.84); in so doing, allowance is made for the linkage between the continuous-time functions (t) and (t) previously defined in Eqs. (4.79) and (4.80).

Unlike the continuous-time dynamic system described in Eqs. (4.79) and (4.80), the asymptotic behavior of the learning-rate parameter (t) in Eq. (4.85) does not converge to zero as the number of iterations, n, approaches infinity, thereby violating the requirement for optimal annealing. Accordingly, in the neighborhood of the optimal parameter w*, we now find that for the Murata adaptive algorithm:

(4.86) This asymptotic behavior is different from that of the optimally annealed on-line learn-ing algorithm of Eq. (4.76). Basically, the deviation from optimal anneallearn-ing is attributed to the use of a running average of the flow in Eq. (4.77), the inclusion of which was moti-vated by the need to account for the algorithm not having access to a prescribed cost

n S qlim wˆ (n) Z w*

(n + 1) = (n) + (n)(7r(n + 1) 7 - (n))

r(n + 1) = r(n) + f(x(n + 1), d(n + 1); wˆ(n)),

^{0 6 6 1}

wˆ (n + 1) = wˆ(n) - (n)f(x(n + 1), d(n + 1); wˆ(n)) wˆ (t)

(t) = c t

^{c =}^-1

(t) = 1

a 1 - 1

b 1

t, 7

Section 4.10 Optimal Annealing and Adaptive Control of the Learning Rate 163

function, as was the case in deriving the optimally annealed on-line learning algorithm of Eq. (4.76).

The learning of the learning rule is useful when the optimal varies with time n slowly (i.e., the environment responsible for generating the examples is nonstationary) or it changes suddenly. On the other hand, the 1/n rule is not a good choice in such an environment, because nbecomes very small for large n, causing the 1/n rule to lose its learning capability. Basically, the difference between the optimally annealed on-learning algorithm of Eq. (4.76) and the on-line learning algorithm described in Eqs. (4.83) to (4.85) is that the latter has a built-in mechanism for adaptive control of the learning rate—hence its ability to track variations in the optimal .

A final comment is in order: Although the Murata adaptive algorithm is indeed suboptimal insofar as annealing of the learning-rate parameter is concerned, its impor-tant virtue is the broadened applicability of on-line learning in a practically imple-mentable manner.

4.11 GENERALIZATION

In propagation learning, we typically start with a training sample and use the back-propagation algorithm to compute the synaptic weights of a multilayer perceptron by loading (encoding) as many of the training examples as possible into the network. The hope is that the neural network so designed will generalize well. A network is said to generalize well when the input–output mapping computed by the network is correct (or nearly so) for test data never used in creating or training the network; the term “gener-alization” is borrowed from psychology. Here, it is assumed that the test data are drawn from the same population used to generate the training data.

The learning process (i.e., training of a neural network) may be viewed as a “curve-fitting” problem.The network itself may be considered simply as a nonlinear input–output mapping. Such a viewpoint then permits us to look at generalization not as a mystical property of neural networks, but rather simply as the effect of a good nonlinear inter-polation of the input data. The network performs useful interinter-polation primarily because multilayer perceptrons with continuous activation functions lead to output functions that are also continuous.

Figure 4.16a illustrates how generalization may occur in a hypothetical network.

The nonlinear input–output mapping represented by the curve depicted in this figure is computed by the network as a result of learning the points labeled as “training data.”

The point marked in red on the curve as “generalization” is thus seen as the result of interpolation performed by the network.

A neural network that is designed to generalize well will produce a correct input–output mapping even when the input is slightly different from the examples used to train the network, as illustrated in the figure. When, however, a neural network learns too many input–output examples, the network may end up memorizing the training data. It may do so by finding a feature (due to noise, for example) that is present in the training data, but not true of the underlying function that is to be modeled. Such a phe-nomenon is referred to as overfitting or overtraining. When the network is overtrained, it loses the ability to generalize between similar input–output patterns.

wˆ *

Ordinarily, loading data into a multilayer perceptron in this way requires the use of more hidden neurons than are actually necessary, with the result that undesired contri-butions in the input space due to noise are stored in synaptic weights of the network. An example of how poor generalization due to memorization in a neural network may occur is illustrated in Fig. 4.16b for the same data as depicted in Fig. 4.16a. “Memorization” is essentially a “look-up table,” which implies that the input–output mapping computed by the neural network is not smooth. As pointed out in Poggio and Girosi (1990a), smooth-ness of input–output mapping is closely related to such model-selection criteria as Occam’s Section 4.11 Generalization 165

Nonlinear mapping learned through training

Training data points Generalization point

Training data points Generalization point Output

Input 0

(a)

Output

Input (b)

FIGURE 4.16 (a) Properly fitted nonlinear mapping with good generalization. (b) Overfitted nonlinear mapping with poor generalization.

razor, the essence of which is to select the “simplest” function in the absence of any prior knowledge to the contrary. In the context of our present discussion, the simplest function means the smoothest function that approximates the mapping for a given error criterion, because such a choice generally demands the fewest computational resources. Smooth-ness is also natural in many applications, depending on the scale of the phenomenon being studied. It is therefore important to seek a smooth nonlinear mapping for ill-posed input–output relationships, so that the network is able to classify novel patterns correctly with respect to the training patterns (Wieland and Leighton, 1987).

Sufficient Training-Sample Size for a Valid Generalization

Generalization is influenced by three factors: (1) the size of the training sample and how representative the training sample is of the environment of interest, (2) the archi-tecture of the neural network, and (3) the physical complexity of the problem at hand.

Clearly, we have no control over the lattermost factor. In the context of the other two factors, we may view the issue of generalization from two different perspectives:

• The architecture of the network is fixed (hopefully in accordance with the physical complexity of the underlying problem), and the issue to be resolved is that of deter-mining the size of the training sample needed for a good generalization to occur.

• The size of the training sample is fixed, and the issue of interest is that of deter-mining the best architecture of network for achieving good generalization.

Both of these viewpoints are valid in their own individual ways.

In practice, it seems that all we really need for a good generalization is to have the size of the training sample, N, satisfy the condition

(4.87) where W is the total number of free parameters (i.e., synaptic weights and biases) in the network, denotes the fraction of classification errors permitted on test data (as in pat-tern classification), and O(·) denotes the order of quantity enclosed within. For exam-ple, with an error of 10 percent, the number of training examples needed should be about 10 times the number of free parameters in the network.

Equation (4.87) is in accordance with Widrow’s rule of thumb for the LMS algo-rithm, which states that the settling time for adaptation in linear adaptive temporal fil-tering is approximately equal to the memory span of an adaptive tapped-delay-line filter divided by the misadjustment (Widrow and Stearns, 1985; Haykin, 2002). The misad-justment in the LMS algorithm plays a role somewhat analogous to the error in Eq. (4.87). Further justification for this empirical rule is presented in the next section.

4.12 APPROXIMATIONS OF FUNCTIONS

A multilayer perceptron trained with the back-propagation algorithm may be viewed as a practical vehicle for performing a nonlinear input–output mapping of a general nature. To be specific, let m₀denote the number of input (source) nodes of a multilayer

N = OaW

perceptron, and let M mLdenote the number of neurons in the output layer of the network. The input–output relationship of the network defines a mapping from an m₀-dimensional Euclidean input space to an M-dimensional Euclidean output space, which is infinitely continuously differentiable when the activation function is likewise. In assessing the capability of the multilayer perceptron from this viewpoint of input–output mapping, the following fundamental question arises:

What is the minimum number of hidden layers in a multilayer perceptron with an input–output mapping that provides an approximate realization of any continuous mapping?

Universal Approximation Theorem

The answer to this question is embodied in the universal approximation theorem⁸for a nonlinear input–output mapping, which may be stated as follows:

Let(·) be a nonconstant, bounded, and monotone-increasing continuous function. Let Im0

Section 4.12 Approximations of Functions 167

denote the m0-dimensional unit hypercube[0, 1]^m⁰. The space of continuous functions onIm0

is denoted byC(Im0). Then, given any functionf C(Im0)and 0, there exist an integer m1and sets of real constantsi, bi, and wij, where i 1, m1and j 1, m0such that we may define

(4.88)

as an approximate realization of the function f(·); that is,

for all that lie in the input space.

The universal approximation theorem is directly applicable to multilayer percep-trons. We first note, for example, that the hyperbolic tangent function used as the non-linearity in a neural model for the construction of a multilayer perceptron is indeed a nonconstant, bounded, and monotone-increasing function; it therefore satisfies the con-ditions imposed on the function (·) Next, we note that Eq. (4.88) represents the out-put of a multilayer perceptron described as follows:

1. The network has m0input nodes and a single hidden layer consisting of m1 neu-rons; the inputs are denoted by .

2. Hidden neuron i has synaptic weights , and bias b_i.

3. The network output is a linear combination of the outputs of the hidden neurons,

在文檔中 Neural Networks and Learning Machines (頁 193-198)