Assumption III: The input vector x(n) and the desired response d(n) are jointly Gaussian
Case 1 The function f(w) is defined by the inner product:
Hence,
or, equivalently, in matrix form,
(3.71) Case 2 The function f(w) is defined by the quadratic form:
Here, rijis the ij-th element of the m-by-m matrix R. Hence,
or, equivalently, in matrix form,
(3.72) Equations (3.71) and (3.72) provide two useful rules for the differentiation of a real-valued function with respect to a vector.
2. The pseudoinverse of a rectangular matrix is discussed in Golub and Van Loan (1996); see also Chapter 8 of Haykin (2002).
3. The Langevin equation is discussed in Reif (1965). For a fascinating historical account of the Langevin equation, see the tutorial paper on noise by Cohen (2005).
4. The orthogonality transformation described in Eq. (3.56) follows from the eigendecompo-sition of a square matrix. This topic is described in detail in Chapter 8.
0f
0w = 2Rw 0f
0wi
= 2a
m
j = 1
rijwj,
i = 1, 2, ...., m = a
m
i = 1
a
m
j = 1
wi rij wj
f(w) = wTRw 0f 0w = x 0f
0wi
= xi,
i = 1, 2, ...., m = a
m
i = 1
xiwi
f(w) = xTw 0f
0w = c 0f 0w1
, 0f 0w2
, p , 0f 0wmdT
5. For an early (and perhaps the first) motivational treatment of H control, the reader is referred to Zames (1981).
The first exposition of optimality of the LMS algorithm in the H sense was presented in Hassibi et al. (1993). Hassibi et al. (1999) treat the H theory from an estimation or adaptive-filtering perspective. Hassibi also presents a condensed treatment of robustness of the LMS algorithm in the H sense in Chapter 5 of Haykin and Widrow (2005).
For books on H theory from a control perspective, the reader is referred to Zhou and Doyle (1998) and Green and Limebeer (1995).
6. Sensitivity of convergence behavior of the LMS algorithm to variations in the condition number of the correlation matrix Rxx, denoted by (R), is demonstrated experimentally in Section 5.7 of the book by Haykin (2002). In Chapter 9 of Haykin (2002), which deals with recursive implementation of the method of least squares, it is also shown that convergence behavior of the resulting algorithm is essentially independent of the condition number (R).
PROBLEMS
3.1 (a) Let m(n) denote the mean weight vector of the LMS algorithm at iteration n; that is,
Using the small-learning-rate parameter theory of Section 3.9, show that
where is the learning-rate parameter, Rxxis the correlation matrix of the input vec-tor x(n), and m(0) and are the initial and final values of m(n), respectively.
(b) Show that for convergence of the LMS algorithm in the mean, the learning-rate parameter must satisfy the condition
where is the largest eigenvalue of the correlation matrix Rxx.
3.2 Continuing from Problem 3.1, discuss why convergence of the LMS algorithm in the mean is not an adequate criterion for convergence in practice.
3.3 Consider the use of a white-noise sequence of zero mean and variance σ2as the input to the LMS algorithm. Determine the condition for convergence of the algorithm in the mean-square sense.
3.4 In a variant of the LMS algorithm called the leaky LMS algorithm, the cost function to be minimized is defined by
where w(n) is the parameter vector, e(n) is the estimation error, and is a constant. As in the ordinary LMS algorithm, we have
where d(n) is the desired response corresponding to the input vector x(n).
e(n) = d(n) - wT(n)x(n) e(n) = 1
2e(n)2 + 1
27 w(n)72 max
O 6 6 2 max
m(q)
m(n) = (I - Rxx)n[m(0) - m(q)] + m(q) m(n) = ⺕[wˆ(n)]
q
q
q
q q
Problems 119
(a) Show that the time update for the parameter vector of the leaky LMS algorithm is defined by
which includes the ordinary LMS algorithm as a special case.
(b) Using the small learning-rate parameter theory of Section 3.9, show that
where Rxxis the correlation matrix of x(n), I is the identity matrix, and rdxis the cross-correlation vector between x(n) and d(n).
3.5 Continuing from Problem 3.4, verify that the leaky LMS algorithm can be “simulated” by adding white noise to the input vector x(n).
(a) What should variance of this noise be for the condition in part (b) of Problem 3.4 to hold?
(b) When will the simulated algorithm take a form that is practically the same as the leaky LMS algorithm? Justify your answer.
3.6 An alternative to the mean-square error (MSE) formulation of the learning curve that we sometimes find in the literature is the mean-square deviation (MSD) learning curve. Define the weight-error vector
where w is the parameter vector of the regression model supplying the desired response.This second learning curve is obtained by computing a plot of the MSD
versus the number of iterations n.
Using the small-learning-rate-parameter theory of Section 3.9, show that
where is the learning-rate parameter, M is the size of the parameter vector , and Jminis the minimum mean-square error of the LMS algorithm.
3.7 In this problem, we address a proof of the direct-averaging method, assuming ergodicity.
Start with Eq. (3.41), which defines the weight-error vector in terms of the transi-tion matrix A(n) and driving force f(n), which are themselves defined in terms of the input vector x(n) in Eqs. (3.42) and (3.43), respectively; then proceed as follows:
• Set n 0, and evaluate (1).
(n)
wˆ
= 1 2MJmin
D(q) = lim
n S qD(n) D(n) = ⺕[(n)2]
(n) = w - wˆ(n) lim
x S q⺕[wˆ(n)] = (Rxx + I)-1rdx
wˆ (n + 1) = (1 - )wˆ(n) + x(n)e(n)
• Set n 1, and evaluate .
• Continue in this fashion for a few more iterations.
(2)
With these iterated values of (n) at hand, deduce a formula for the transition matrix A(n).
Next, assume that the learning-rate parameter is small enough to justify retaining only the terms that are linear in . Hence, show that
A(n) = I - a
n
i = 1
x(i)xT(i)
which, assuming ergodicity, takes the form
3.8 When the learning-rate parameter is small, the LMS algorithm acts like a low-pass filter with a small cutoff frequency. Such a filter produces an output that is proportional to the average of the input signal.
Using Eq. (3.41), demonstrate this property of the LMS algorithm by considering the sim-ple examsim-ple of the algorithm using a single parameter.
3.9 Starting with Eq. (3.55) for a small learning-rate parameter, show that under steady-state con-ditions, the Lyapunov equation
holds, where we have
and
for i 0, 1, 2, ....The matrix P0is defined by , and eo(n) is the irreducible esti-mation error produced by the Wiener filter.
Computer Experiments
3.10 Repeat the computer experiment of Section 3.10 on linear prediction for the following val-ues of the learning-rate parameter:
(i) 0.002;
(ii) 0.01;
(iii) 0.02.
Comment on your findings in the context of applicability of the small-learning-rate-parameter theory of the LMS algorithm for each value of .
3.11 Repeat the computer experiment of Section 3.11 on pattern classification for the distance of separation between the two moons of Fig. 1.8 set at d = 0. Compare the results of your experiment with those in Problem 1.6 on the perceptron and Problem 2.7 on the method of least squares.
3.12 Plot the pattern-classification learning curves of the LMS algorithm applied to the double-moon configuration of Fig. 1.8 for the following values assigned to the distance of separation:
d 1 d 0 d 4
Compare the results of the experiment with the corresponding ones obtained using Rosen-blatt’s perceptron in Chapter 1.
⺕[o(n)To (n)]
R(i) = ⺕[x(n)xT(n - i)]
J(i)min= ⺕[eo(n)eo(n - i)]
RP0(n) + P0(n)R = a
q i = 0
J(i)min R(i) A(n) = I - Rxx
Problems 121
122
ORGANIZATION OF THE CHAPTER
In this chapter, we study the many facets of the multilayer perceptron, which stands for a neural network with one or more hidden layers. After the introductory material pre-sented in Section 4.1, the study proceeds as follows:
1. Sections 4.2 through 4.7 discuss matters relating to back-propagation learning.We begin with some preliminaries in Section 4.2 to pave the way for the derivation of the back-propagation algorithm.This section also includes a discussion of the credit-assignment problem. In Section 4.3, we describe two methods of learning: batch and on-line. In Section 4.4, we present a detailed derivation of the back-propagation algorithm, using the chain rule of calculus; we take a traditional approach in this derivation. In Section 4.5, we illustrate the use of the back-propagation algorithm by solving the XOR problem, an interesting problem that cannot be solved by Rosenblatt’s perceptron. Section 4.6 presents some heuristics and practical guidelines for making the back-propagation algorithm perform better. Section 4.7 presents a pattern-classification experiment on the multilayer perceptron trained with the back-propagation algorithm.
2. Sections 4.8 and 4.9 deal with the error surface. In Section 4.8, we discuss the fun-damental role of back-propagation learning in computing partial derivatives of a network-approximating function. We then discuss computational issues relating to the Hessian of the error surface in Section 4.9. In Section 4.10, we discuss two issues: how to fulfill optimal annealing and how to make the learning-rate pa-rameter adaptive.
3. Sections 4.11 through 4.14 focus on various matters relating to the performance of a multilayer perceptron trained with the back-propagation algorithm. In Section 4.11, we discuss the issue of generalization—the very essence of learning. Section 4.12 addresses the approximation of continuous functions by means of multiplayer perceptrons.The use of cross-validation as a statistical design tool is discussed in Section 4.13. In Section 4.14, we discuss the issue of complexity regularization, as well as network-pruning techniques.
4. Section 4.15, summarizes the advantages and limitations of back-propagation learning.
5. Having completed the study of back-propagation learning, we next take a different perspective on learning in Section 4.16 by viewing supervised learning as an optimization problem.