Neural Networks and Learning Machines

(1)

(2)

Neural Networks and Learning Machines

Third Edition

Simon Haykin

McMaster University Hamilton, Ontario, Canada

New York Boston San Francisco

London Toronto Sydney Tokyo Singapore Madrid Mexico City Munich Paris Cape Town Hong Kong Montreal

(3)

Neural networks and learning machines / Simon Haykin.—3rd ed.

p. cm.

Rev. ed of: Neural networks. 2nd ed., 1999.

Includes bibliographical references and index.

ISBN-13: 978-0-13-147139-9 ISBN-10: 0-13-147139-2

1. Neural networks (Computer science) 2. Adaptive filters. I. Haykin, Simon Neural networks. II. Title.

QA76.87.H39 2008 006.3--dc22

2008034079 Vice President and Editorial Director, ECS: Marcia J. Horton Associate Editor: Alice Dworkin

Supervisor/Editorial Assistant: Dolores Mars Editorial Assistant: William Opaluch

Director of Team-Based Project Management: Vince O’Brien Senior Managing Editor: Scott Disanno

A/V Production Editor: Greg Dulles Art Director: Jayne Conte

Cover Designer: Bruce Kenselaar Manufacturing Manager: Alan Fischer Manufacturing Buyer: Lisa McDowell Marketing Manager: Tim Galligan

Pearson Prentice Hall. All rights reserved. Printed in the United States of America. This publication is protected by Copyright and permission should be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by any means, electronic, mechanical, photocopying, recording, or likewise. For information regarding permission(s), write to: Rights and Permissions Department.

Pearson^®is a registered trademark of Pearson plc

Pearson Education Ltd. Pearson Education Australia Pty. Limited Pearson Education Singapore Pte. Ltd. Pearson Education North Asia Ltd.

Pearson Education Canada, Ltd. Pearson Educación de Mexico, S.A. de C.V.

Pearson Education–Japan Pearson Education Malaysia Pte. Ltd.

10 9 8 7 6 5 4 3 2 1 ISBN-13: 978-0-13-147139-9 ISBN-10: 0-13-147139-2

(4)

To my wife, Nancy, for her patience and tolerance, and

to the countless researchers in neural networks for their original contributions, the many reviewers for their critical inputs, and many of my graduate students for their keen interest.

(5)

(6)

v Preface x

Introduction 1

1. What is a Neural Network? 1 2. The Human Brain 6

3. Models of a Neuron 10

4. Neural Networks Viewed As Directed Graphs 15 5. Feedback 18

6. Network Architectures 21 7. Knowledge Representation 24 8. Learning Processes 34 9. Learning Tasks 38 10. Concluding Remarks 45

Notes and References 46

Chapter 1 Rosenblatt’s Perceptron 47 1.1 Introduction 47

1.2. Perceptron 48

1.3. The Perceptron Convergence Theorem 50

1.4. Relation Between the Perceptron and Bayes Classifier for a Gaussian Environment 55 1.5. Computer Experiment: Pattern Classification 60

1.6. The Batch Perceptron Algorithm 62 1.7. Summary and Discussion 65

Notes and References 66 Problems 66

Chapter 2 Model Building through Regression 68 2.1 Introduction 68

2.2 Linear Regression Model: Preliminary Considerations 69 2.3 Maximum a Posteriori Estimation of the Parameter Vector 71 2.4 Relationship Between Regularized Least-Squares Estimation

and MAP Estimation 76

2.5 Computer Experiment: Pattern Classification 77 2.6 The Minimum-Description-Length Principle 79 2.7 Finite Sample-Size Considerations 82

2.8 The Instrumental-Variables Method 86 2.9 Summary and Discussion 88

Although, indeed, they share many fundamental concepts and applications, there are some subtle differences between the operations of neural networks and learning machines. The underlying subject matter is therefore much richer when they are studied together, under one umbrella, particularly so when

• ideas drawn from neural networks and machine learning are hybridized to per- form improved learning tasks beyond the capability of either one operating on its own, and

• ideas inspired by the human brain lead to new perspectives wherever they are of particular importance.

Moreover, the scope of the book has been broadened to provide detailed treat- ments of dynamic programming and sequential state estimation, both of which have affected the study of reinforcement learning and supervised learning, respectively, in significant ways.

Organization of the Book

The book begins with an introductory chapter that is motivational, paving the way for the rest of the book which is organized into six parts as follows:

1. Chapters 1 through 4, constituting the first part of the book, follow the classical approach on supervised learning. Specifically,

Preface

(13)

• Chapter 1 describes Rosenblatt’s perceptron, highlighting the perceptron con- vergence theorem, and the relationship between the perceptron and the Bayesian classifier operating in a Gaussian environment.

• Chapter 2 describes the method of least squares as a basis for model building.

The relationship between this method and Bayesian inference for the special case of a Gaussian environment is established. This chapter also includes a discussion of the minimum description length (MDL) principle for model selection.

• Chapter 3 is devoted to the least-mean-square (LMS) algorithm and its con- vergence analysis. The theoretical framework of the analysis exploits two principles: Kushner’s direct method and the Langevin equation (well known in nonequilibrium thermodynamics).

These three chapters, though different in conceptual terms, share a common feature: They are all based on a single computational unit. Most importantly, they provide a great deal of insight into the learning process in their own individual ways—a feature that is exploited in subsequent chapters.

Chapter 4, on the multilayer perceptron, is a generalization of Rosenblatt’s perceptron. This rather long chapter covers the following topics:

• the back-propagation algorithm, its virtues and limitations, and its role as an optimum method for computing partial derivations;

• optimal annealing and adaptive control of the learning rate;

• cross-validation;

• convolutional networks, inspired by the pioneering work of Hubel and Wiesel on visual systems;

• supervised learning viewed as an optimization problem, with attention focused on conjugate-gradient methods, quasi-Newton methods, and the Marquardt–

Levenberg algorithm;

• nonlinear filtering;

• last, but by no means least, a contrasting discussion of small-scale versus large- scale learning problems.

2. The next part of the book, consisting of Chapters 5 and 6, discusses kernel meth- ods based on radial-basis function (RBF) networks.

In a way, Chapter 5 may be viewed as an insightful introduction to kernel methods. Specifically, it does the following:

• presents Cover’s theorem as theoretical justification for the architectural struc- ture of RBF networks;

• describes a relatively simple two-stage hybrid procedure for supervised learn- ing, with stage 1 based on the idea of clustering (namely, the K-means algo- rithm) for computing the hidden layer, and stage 2 using the LMS or the method of least squares for computing the linear output layer of the network;

• presents kernel regression and examines its relation to RBF networks.

Chapter 6 is devoted to support vector machines (SVMs), which are commonly recognized as a method of choice for supervised learning. Basically, the SVM is a binary classifier, in the context of which the chapter covers the following topics:

(14)

Preface xiii

• the condition for defining the maximum margin of separation between a pair of linearly separable binary classes;

• quadratic optimization for finding the optimal hyperplane when the two classes are linearly separable and when they are not;

• the SVM viewed as a kernel machine, including discussions of the kernel trick and Mercer’s theorem;

• the design philosophy of SVMs;

• the-insensitive loss function and its role in the optimization of regression problems;

• the Representer Theorem, and the roles of Hilbert space and reproducing ker- nel Hilbert space (RKHS) in its formulation.

From this description, it is apparent that the underlying theory of support vector machines is built on a strong mathematical background—hence their computational strength as an elegant and powerful tool for supervised learning.

3. The third part of the book involves a single chapter, Chapter 7. This broadly based chapter is devoted to regularization theory, which is at the core of machine learning. The following topics are studied in detail:

• Tikhonov’s classic regularization theory, which builds on the RKHS discussed in Chapter 6.This theory embodies some profound mathematical concepts: the Fréchet differential of the Tikhonov functional, the Riesz representation theorem, the Euler–Lagrange equation, Green’s function, and multivariate Gaussian functions;

• generalized RBF networks and their modification for computational tractability;

• the regularized least-squares estimator, revisited in light of the Representer Theorem;

• estimation of the regularization parameter, using Wahba’s concept of general- ized cross-validation;

• semisupervised learning, using labeled as well as unlabeled examples;

• differentiable manifolds and their role in manifold regularization—a role that is basic to designing semisupervised learning machines;

• spectral graph theory for finding a Gaussian kernel in an RBF network used for semisupervised learning;

• a generalized Representer Theorem for dealing with semisupervised kernel machines;

• the Laplacian regularized least-squares (LapRLS) algorithm for computing the linear output layer of the RBF network; here, it should be noted that when the intrinsic regularization parameter (responsible for the unlabeled data) is reduced to zero, the algorithm is correspondingly reduced to the ordinary least- squares algorithm.

This highly theoretical chapter is of profound practical importance. First, it provides the basis for the regularization of supervised-learning machines. Second, it lays down the groundwork for designing regularized semisupervised learning machines.

4. Chapters 8 through 11 constitute the fourth part of the book, dealing with unsu- pervised learning. Beginning with Chapter 8, four principles of self-organization, intuitively motivated by neurobiological considerations, are presented:

(15)

(i) Hebb’s postulate of learning for self-amplification;

(ii) Competition among the synapses of a single neuron or a group of neurons for limited resources;

(iii) Cooperation among the winning neuron and its neighbors;

(iv) Structural information (e.g., redundancy) contained in the input data.

The main theme of the chapter is threefold:

• Principles (i), (ii), and (iv) are applied to a single neuron, in the course of which Oja’s rule for maximum eigenfiltering is derived; this is a remarkable result obtained through self-organization, which involves bottom-up as well as top- down learning. Next, the idea of maximum eigenfiltering is generalized to principal-components analysis (PCA) on the input data for the purpose of dimensionality reduction; the resulting algorithm is called the generalized Heb- bian algorithm (GHA).

• Basically, PCA is a linear method, the computing power of which is therefore limited to second-order statistics. In order to deal with higher-order statistics, the kernel method is applied to PCA in a manner similar to that described in Chap- ter 6 on support vector machines, but with one basic difference: unlike SVM, kernel PCA is performed in an unsupervised manner.

• Unfortunately, in dealing with natural images, kernel PCA can become un- manageable in computational terms. To overcome this computational limitation, GHA and kernel PCA are hybridized into a new on-line unsupervised learning algorithm called the kernel Hebbian algorithm (KHA), which finds applications in image denoising.

The development of KHA is an outstanding example of what can be accomplished when an idea from machine learning is combined with a complementary idea rooted in neural networks, producing a new algorithm that overcomes their respective practical limitations.

Chapter 9 is devoted to self-organizing maps (SOMs), the development of which follows the principles of self-organization described in Chapter 8. The SOM is a simple algorithm in computational terms, yet highly powerful in its built-in ability to construct organized topographic maps with several useful properties:

• spatially discrete approximation of the input space, responsible for data generation;

• topological ordering, in the sense that the spatial location of a neuron in the topographic map corresponds to a particular feature in the input (data) space;

• input–output density matching;

• input-data feature selection.

The SOM has been applied extensively in practice; the construction of contextual maps and hierarchical vector quantization are presented as two illustrative examples of the SOM’s computing power. What is truly amazing is that the SOM exhibits several interesting properties and solves difficult computational tasks, yet it lacks an objective function that could be optimized. To fill this gap and thereby provide the possibility of improved topographic mapping, the self-organizing map is kernelized. This is done by introducing an entropic function as the objective

(16)

Preface xv function to be maximized. Here again, we see the practical benefit of hybridizing ideas rooted in neural networks with complementary kernel-theoretic ones.

Chapter 10 exploits principles rooted in Shannon’s information theory as tools for unsupervised learning. This rather long chapter begins by presenting a review of Shannon’s information theory, with particular attention given to the concepts of entropy, mutual information, and the Kullback–Leibler divergence (KLD).

The review also includes the concept of copulas, which, unfortunately, has been largely overlooked for several decades. Most importantly, the copula provides a measure of the statistical dependence between a pair of correlated random variables. In any event, focusing on mutual information as the objective function, the chapter establishes the following principles:

• The Infomax principle, which maximizes the mutual information between the input and output data of a neural system; Infomax is closely related to redundancy reduction.

• The Imax principle, which maximizes the mutual information between the sin- gle outputs of a pair of neural systems that are driven by correlated inputs.

• The Imin principle operates in a manner similar to the Imax principle, except that the mutual information between the pair of output random variables is minimized.

• The independent-components analysis (ICA) principle, which provides a power- ful tool for the blind separation of a hidden set of statistically independent source signals. Provided that certain operating conditions are satisfied, the ICA principle affords the basis for deriving procedures for recovering the original source signals from a corresponding set of observables that are linearly mixed versions of the source signals. Two specific ICA algorithms are described:

(i) the natural-gradient learning algorithm, which, except for scaling and per- mutation, solves the ICA problem by minimizing the KLD between a pa- rameterized probability density function and the corresponding factorial distribution;

(ii) the maximum-entropy learning algorithm, which maximizes the entropy of a nonlinearly transformed version of the demixer output; this algorithm, commonly known as the Infomax algorithm for ICA, also exhibits scaling and permutation properties.

Chapter 10 also describes another important ICA algorithm, known as FastICA, which, as the name implies, is computationally fast.This algorithm maximizes a con- trast function based on the concept of negentropy, which provides a measure of the non-Gaussianity of a random variable. Continuing with ICA, the chapter goes on to describe a new algorithm known as coherent ICA, the development of which rests on fusion of the Infomax and Imax principles via the use of the copula; coherent ICA is useful for extracting the envelopes of a mixture of amplitude-modulated signals. Finally, Chapter 10 introduces another concept rooted in Shannon’s information theory, namely, rate distortion theory, which is used to develop the last concept in the chapter: information bottleneck. Given the joint distribution of an input vector and a (relevant) output vector, the method is formulated as a constrained

(17)

optimization problem in such a way that a tradeoff is created between two amounts of information, one pertaining to information contained in the bottleneck vector about the input and the other pertaining to information contained in the bottleneck vector about the output.The chapter then goes on to find an optimal manifold for data representation, using the information bottleneck method.

The final approach to unsupervised learning is described in Chapter 11, using stochastic methods that are rooted in statistical mechanics; the study of statistical mechanics is closely related to information theory. The chapter begins by review- ing the fundamental concepts of Helmholtz free energy and entropy (in a statistical mechanics sense), followed by the description of Markov chains. The stage is then set for describing the Metropolis algorithm for generating a Markov chain, the transition probabilities of which converge to a unique and stable distribution.

The discussion of stochastic methods is completed by describing simulated annealing for global optimization, followed by Gibbs sampling, which can be used as a special form of the Metropolis algorithm. With all this background on statistical mechanics at hand, the stage is set for describing the Boltzmann machine, which, in a historical context, was the first multilayer learning machine discussed in the literature. Unfortunately, the learning process in the Boltzmann machine is very slow, particularly when the number of hidden neurons is large—hence the lack of interest in its practical use. Various methods have been proposed in the literature to overcome the limitations of the Boltzmann machine. The most successful inno- vation to date is the deep belief net, which distinguishes itself in the clever way in which the following two functions are combined into a powerful machine:

• generative modeling, resulting from bottom-up learning on a layer-by-layer basis and without supervision;

• inference, resulting from top-down learning.

Finally, Chapter 10 describes deterministic annealing to overcome the excessive computational requirements of simulated annealing; the only problem with deterministic annealing is that it could get trapped in a local minimum.

5. Up to this point, the focus of attention in the book has been the formulation of al- gorithms for supervised learning, semisupervised learning, and unsupervised learning. Chapter 12, constituting the next part of the book all by itself, addresses reinforcement learning, in which learning takes place in an on-line manner as the result of an agent (e.g., robot) interacting with its surrounding environment. In re- ality, however, dynamic programming lies at the core of reinforcement learning.

Accordingly, the early part of Chapter 15 is devoted to an introductory treatment of Bellman’s dynamic programming, which is then followed by showing that the two widely used methods of reinforcement learning: Temporal difference (TD) learn- ing, and Q-learning can be derived as special cases of dynamic programming. Both TD-learning and Q-learning are relatively simple, on-line reinforcement learning algorithms that do not require knowledge of transition probabilities. However, their practical applications are limited to situations in which the dimensionality of the state space is of moderate size. In large-scale dynamic systems, the curse of dimensionality becomes a serious issue, making not only dynamic programming,

(18)

Preface xvii but also its approximate forms, TD-learning and Q-learning, computationally in- tractable. To overcome this serious limitation, two indirect methods of approximate dynamic programming are described:

• a linear method called the least-squares policy evaluation (LSPV) algorithm, and

• a nonlinear method using a neural network (e.g., multilayer perceptron) as a universal approximator.

6. The last part of the book, consisting of Chapters 13, 14, and 15, is devoted to the study of nonlinear feedback systems, with an emphasis on recurrent neural networks:

(i) Chapter 13 studies neurodynamics, with particular attention given to the sta- bility problem. In this context, the direct method of Lyapunov is described.

This method embodies two theorems, one dealing with stability of the system and the other dealing with asymptotic stability. At the heart of the method is a Lyapunov function, for which an energy function is usually found to be adequate. With this background theory at hand, two kinds of associative memory are described:

• the Hopfield model, the operation of which demonstrates that a complex system is capable of generating simple emergent behavior;

• the brain-state-in-a-box model, which provides a basis for clustering.

The chapter also discusses properties of chaotic processes and a regularized procedure for their dynamic reconstruction.

(ii) Chapter 14 is devoted to the Bayesian filter, which provides a unifying basis for sequential state estimation algorithms, at least in a conceptual sense. The findings of the chapter are summarized as follows:

• The classic Kalman filter for a linear Gaussian environment is derived with the use of the minimum mean-square-error criterion; in a problem at the end of the chapter, it is shown that the Kalman filter so derived is a special case of the Bayesian filter;

• square-root filtering is used to overcome the divergence phenomenon that can arise in practical applications of the Kalman filter;

• the extended Kalman filter (EKF) is used to deal with dynamic systems whose nonlinearity is of a mild sort; the Gaussian assumption is maintained;

• the direct approximate form of the Bayesian filter is exemplified by a new filter called the cubature Kalman filter (CKF); here again, the Gaussian assumption is maintained;

• indirect approximate forms of the Bayesian filter are exemplified by par- ticle filters, the implementation of which can accommodate nonlinearity as well as non-Gaussianity.

With the essence of Kalman filtering being that of a predictor–corrector, Chapter 14 goes on to describe the possible role of “Kalman-like filtering”

in certain parts of the human brain.

(19)

The final chapter of the book, Chapter 15, studies dynamically driven recurrent neural networks. The early part of the chapter discusses different structures (models) for recurrent networks and their computing power, followed by two algorithms for the training of recurrent networks:

• back propagation through time, and

• real-time recurrent learning.

Unfortunately both of these procedures, being gradient based, are likely to suffer from the so-called vanishing-gradients problem. To mitigate the problem, the use of nonlinear sequential state estimators is described at some length for the supervised training of recurrent networks in a rather novel manner. In this context, the advantages and disadvantages of the extended Kalman filter (simple, but derivative dependent) and the cubature Kalman filter (derivative free, but more com- plicated mathematically) as sequential state estimator for supervised learning are discussed. The emergence of adaptive behavior, unique to recurrent networks, and the potential benefit of using an adaptive critic to further enhance the capability of recurrent networks are also discussed in the chapter.

An important topic featuring prominently in different parts of the book is supervised learning and semisupervised learning applied to large-scale problems. The concluding remarks of the book assert that this topic is in its early stages of development;

most importantly, a four-stage procedure is described for its future development.

Distinct Features of the Book

Over and above the broad scope and thorough treatment of the topics summarized under the organization of the book, distinctive features of the text include the following:

1. Chapters 1 through 7 and Chapter 10 include computer experiments involving the double-moon configuration for generating data for the purpose of binary classification. The experiments range from the simple case of linearly separable patterns to difficult cases of nonseparable patterns. The double-moon configuration, as a running example, is used all the way from Chapter 1 to Chapter 7, followed by Chapter 10, thereby providing an experimental means for studying and compar- ing the learning algorithms described in those eight chapters.

2. Computer experiments are also included in Chapter 8 on PCA, Chapter 9 on SOM and kernel SOM, and Chapter 14 on dynamic reconstruction of the Mackay–Glass attractor using the EKF and CKF algorithms.

3. Several case studies, using real-life data, are presented:

• Chapter 7 discusses the United States Postal Service (USPS) data for semisu- pervised learning using the Laplacian RLS algorithm;

• Chapter 8 examines how PCA is applied to handwritten digital data and de- scribes the coding and denoising of images;

• Chapter 10 treats the analysis of natural images by using sparse-sensory coding and ICA;

• Chapter 13 presents dynamic reconstruction applied to the Lorenz attractor by using a regularized RBF network.

(20)

Preface xix

Chapter 15 also includes a section on the model reference adaptive control system as a case study.

4. Each chapter ends with notes and references for further study, followed by end-

of-chapter problems that are designed to challenge, and therefore expand, the reader’s expertise.

The glossary at the front of the book has been expanded to include explanatory notes on the methodology used on matters dealing with matrix analysis and proba- bility theory.

5. PowerPoint files of all the figures and tables in the book will be available to

Instructors and can be found at www.prenhall.com/haykin.

6. Matlab codes for all the computer experiments in the book are available on the

Website of the publisher to all those who have purchased copies of the book. These are available to students at www.pearsonhighered.com/haykin.

7. The book is accompanied by a Manual that includes the solutions to all the end-

of-chapter problems as well as computer experiments.

The manual is available from the publisher, Prentice Hall, only to instructors who use the book as the recommended volume for a course, based on the material covered in the book.

Last, but by no means least, every effort has been expended to make the book error free and, most importantly, readable.

Simon Haykin Ancaster,Ontario

(21)

(22)

I am deeply indebted to many renowned authorities on neural networks and learning machines around the world, who have provided invaluable comments on selected parts of the book:

Dr. Sun-Ichi Amari, The RIKEN Brain Science Institute, Wako City, Japan Dr. Susanne Becker, Department of Psychology, Neuroscience & Behaviour, McMaster University, Hamilton, Ontario, Canada

Dr. Dimitri Bertsekas, MIT, Cambridge, Massachusetts

Dr. Leon Bottou, NEC Laboratories America, Princeton, New Jersey Dr. Simon Godsill, University of Cambridge, Cambridge, England

Dr. Geoffrey Gordon, Carnegie-Mellon University, Pittsburgh, Pennsylvania Dr. Peter Grünwald, CWI, Amsterdam, the Netherlands

Dr. Geoffrey Hinton, Department of Computer Science, University of Toronto, Toronto, Ontario, Canada

Dr. Timo Honkela, Helsinki University of Technology, Helsinki, Finland

Dr. Tom Hurd, Department of Mathematics and Statistics, McMaster University, Ontario, Canada.

Dr. Eugene Izhikevich, The Neurosciences Institute, San Diego, California Dr. Juha Karhunen, Helsinki University of Technology, Helsinki, Finland

Dr. Kwang In Kim, Max-Planck-Institut für Biologische Kybernetik, Tübingen, Germany

Dr. James Lo, University of Maryland at Baltimore County, Baltimore, Maryland Dr. Klaus Müller, University of Potsdam and Fraunhofer Institut FIRST, Berlin, Germany

Dr. Erkki Oja, Helsinki University of Technology, Helsinki, Finland

Dr. Bruno Olshausen, Redwood Center for Theoretical Neuroscience, University of California, Berkeley, California

Dr. Danil Prokhorov, Toyota Technical Center, Ann Arbor, Michigan

Dr. Kenneth Rose, Electrical and Computer Engineering, University of California, Santa Barbara, California

Dr. Bernhard Schölkopf, Max-Planck-Institut für Biologische Kybernetik,Tübingen, Germany

Dr. Vikas Sindhwani, Department of Computer Science, University of Chicago, Chicago, Illinois

xxi

Acknowledgments

(23)

Dr. Sergios Theodoridis, Department of Informatics, University of Athens, Athens, Greece

Dr. Naftali Tishby, The Hebrew University, Jerusalem, Israel

Dr. John Tsitsiklis, Massachusetts Institute of Technology, Cambridge, Massachusetts Dr. Marc Van Hulle, Katholieke Universiteit, Leuven, Belgium

Several photographs and graphs have been reproduced in the book with permissions provided by Oxford University Press and

Dr. Anthony Bell, Redwood Center for Theoretical Neuroscience, University of California, Berkeley, California

Dr. Leon Bottou, NEC Laboratories America, Princeton, New Jersey Dr. Juha Karhunen, Helsinki University of Technology, Helsinki, Finland

Dr. Bruno Olshausen, Redwood Center for Theoretical Neuroscience, University of California, Berkeley, California

Dr. Vikas Sindhwani, Department of Computer Science, University of Chicago, Chicago, Illinois

Dr. Naftali Tishby, The Hebrew University, Jerusalem, Israel Dr. Marc Van Hulle, Katholieke Universiteit, Leuven, Belgium I thank them all most sincerely.

I am grateful to my graduate students:

1. Yanbo Xue, for his tremendous effort devoted to working on nearly all the computer experiments produced in the book, and also for reading the second page proofs of the book.

2. Karl Wiklund, for proofreading the entire book and making valuable comments for improving it.

3. Haran Arasaratnam, for working on the computer experiment dealing with the Mackay–Glass attractor.

4. Andreas Wendel (Graz University of technology, Austria) while he was on leave at McMaster University, 2008.

I am grateful to Scott Disanno and Alice Dworkin of Prentice Hall for their support and hard work in the production of the book. Authorization of the use of color in the book by Marcia Horton is truly appreciated; the use of color has made a tremendous difference to the appearance of the book from cover to cover.

I am grateful to Jackie Henry of Aptara Corp. and her staff, including Donald E.

Smith, Jr., the proofreader, for the production of the book. I also wish to thank Brian Baker and the copyeditor, Abigail Lin, at Write With, Inc., for their effort in copy-editing the manuscript of the book.

The tremendous effort by my Technical Coordinator, Lola Brooks, in typing several versions of the chapters in the book over the course of 12 months, almost nonstop, is gratefully acknowledged.

Last, but by no means least, I thank my wife, Nancy, for having allowed me the time and space, which I have needed over the last 12 months, almost nonstop, to com- plete the book in a timely fashion.

Simon Haykin

(24)

ABBREVIATIONS

AR autoregressive

BBTT back propagation through time

BM Boltzmann machine

BP back propagation

b/s bits per second

BSB brain-state-in-a-box

BSS Blind source (signal) separation cmm correlation matrix memory

CV cross-validation

DFA deterministic finite-state automata EKF extended Kalman filter

EM expectation-maximization

FIR finite-duration impulse response FM frequency-modulated (signal) GCV generalized cross-validation GHA generalized Hebbian algorithm GSLC generalized sidelobe canceler

Hz hertz

ICA independent-components analysis Infomax maximum mutual information Imax variant of Infomax

Imin another variant of Infomax KSOM kernel self-organizing map KHA kernel Hebbian algorithm

LMS least-mean-square

LR likelihood ratio

xxiii

Abbreviations and Symbols

(25)

LS Least-squares

LS-TD Least-squares, temporal-difference LTP long-term potentiation

LTD long-term depression

LR likelihood ratio

LRT Likelihood ratio test

MAP Maximum a posteriori

MCA minor-components analysis MCMC Markov Chan Monte Carlo MDL minimum description length MIMO multiple input–multiple output

ML maximum likelihood

MLP multilayer perceptron MRC model reference control

NARMA nonlinear autoregressive moving average NARX nonlinear autoregressive with exogenous inputs

NDP neuro-dynamic programming

NW Nadaraya–Watson (estimator) NWKR Nadaraya–Watson kernal regression

OBD optimal brain damage

OBS optimal brain surgeon OCR optical character recognition PAC probably approximately correct PCA principal-components analysis

PF Particle Filter

pdf probability density function pmf probability mass function

QP quadratic programming

RBF radial basis function RLS recursive least-squares RLS regularized least-squares RMLP recurrent multilayer perceptron RTRL real-time recurrent learning SIMO single input–multiple output SIR sequential importance resampling SIS sequential important sampling SISO single input–single output SNR signal-to-noise ratio

SOM self-organizing map

SRN simple recurrent network (also referred to as Elman’s recurrent network)

(26)

SVD singular value decomposition SVM support vector machine

TD temporal difference

TDNN time-delay neural network TLFN time-lagged feedforward network VC Vapnik–Chervononkis (dimension) VLSI very-large-scale integration

XOR exclusive OR

IMPORTANT SYMBOLS

a action

a^Tb inner product of vectors a and b ab^T outer product of vectors a and b

binomial coefficient unions of A and B

B inverse of temperature

b_k bias applied to neuron k

cos(a,b) cosine of the angle between vectors a and b c_{u, v}(u, v) probability density function of copula

D depth of memory

Kullback–Leibler divergence between probability density functions f and g adjoint of operator D

E energy function

E_i energy of state i in statistical mechanics

⺕ statistical expectation operator average energy

exp exponential

eav average squared error, or sum of squared errors e(n) instantaneous value of the sum of squared errors

etotal total sum of error squares

F free energy

f* subset (network) with minimum empirical risk

H Hessian (matrix)

H^-1 inverse of Hessian H

i square root of -1, also denoted by j

I identity matrix

I Fisher’s information matrix

J mean-square error

J Jacobian (matrix)

E

D~ D_f_g A ´ B

a l mb

Abbreviations and Symbols xxv

(27)

P^1/2 square root of matrix P

P^T/2 transpose of square root of matrix P

Pn, n1 error covariance matrix in Kalman filter theory

k_B Boltzmann constant

log logarithm

L(w) log-likelihood function of weight vector w

l(w) log-likelihood function of weight vector w based on a single example M_c controllability matrix

M_o observability matrix

n discrete time

p_i probability of state i in statistical mechanics pij transition probability from state i to state j

P stochastic matrix

P(e|c) conditional probability of error e given that the input is drawn from classc

P probability that the visible neurons of a Boltzmann machine are in state , given that the network is in its clamped condition (i.e., positive phase)

P probability that the visible neurons of a Boltzmann machine are in state , given that the network is in its free-running condition (i.e., negative phase) ˆr_x(j, k;n)] estimate of autocorrelation function of x_j(n) and x_k(n)

ˆr_dx(k;n) estimate of cross-correlation function of d(n) and x_k(n) R correlation matrix of an input vector

t continuous time

T temperature

t training set (sample)

tr operator denoting the trace of a matrix

var variance operator

V(x) Lyapunov function of state vector x

v_j induced local field or activation potential of neuron j w_o optimum value of synaptic weight vector

w_kj weight of synapse j belonging to neuron k

w* optimum weight vector

equilibrium value of state vector x average of state x_jin a “thermal” sense

ˆx estimate of x, signified by the use of a caret (hat) absolute value (magnitude) of x

x* complex conjugate of x, signified by asterisk as superscript Euclidean norm (length) of vector x

x^T transpose of vector x, signified by the superscript T z¹ unit-time delay operator

Z partition function

δj(n) local gradient of neuron j at time n

∆w small change applied to weight w gradient operator

§

x

xj x

(28)

2 Laplacian operator

gradient of J with respect to w divergence of vector F

learning-rate parameter

cumulant

policy

k threshold applied to neuron k (i.e., negative of bias b_k)

regularization parameter

k kth eigenvalue of a square matrix k() nonlinear activation function of neuron k

symbol for “belongs to”

symbol for “union of”

symbol for “intersection of”

* symbol for convolution

superscript symbol for pseudoinverse of a matrix

superscript symbol for updated estimate

Open and closed intervals

• The open interval (a, b) of a variable x signifies that a x b.

• The closed interval [a, b] of a variable x signifies that a x b.

• The closed-open interval [a, b) of a variable x signifies that a x b; likewise for the open-closed interval (a, b], a x b.

Minima and Maxima

• The symbol arg f(w) signifies the minimum of the function f(w) with respect to the argument vector w.

• The symbol arg f(w) signifies the maximum of the function f(w) with respect to the argument vector w.

maxw

minw

¨

´

F§wJ

§

Abbreviations and Symbols xxvii

(29)

(30)

GLOSSARY

NOTATIONS I: MATRIX ANALYSIS

Scalars: Italic lowercase symbols are used for scalars.

Vectors: Bold lowercase symbols are used for vectors.

A vector is defined as a column of scalars. Thus, the inner product of a pair of m- dimensional vectors, x and y, is written as

where the superscript T denotes matrix transposition. With the inner product being a scalar, we therefore have

y^Tx

^x^T^y

Matrices: Bold uppercase symbols are used for matrices.

Matrix multiplication is carried out on a row multiplied by column basis.To illustrate, con- sider an m-by-k matrix X and a k-by-l matrix Y. The product of these two matrices yields the m-by-l matrix

Z XY

More specifically, the ij-th component of matrix Z is obtained by multiplying the ith row of matrix X by the jth column of matrix Y, both of which are made up of k scalars.

The outer product of a pair of m-dimensional vectors, x and y, is written as xy^T, which is an m-by-m matrix.

NOTATIONS II: PROBABILITY THEORY

Random variables: Italic uppercase symbols are used for random variables. The sample value (i.e., one-shot realization) of a random variable is denoted by the corresponding

= a

m i = 1

x_iy_i

≥ y₁ y₂ o y_m

¥ x^Ty = [x1,x₂, ..., x_m]

xxix

(31)

italic lowercase symbol. For example, we write X for a random variable and x for its sample value.

Random vectors: Bold uppercase symbols are used for random vectors. Similarly, the sample value of a random vector is denoted by the corresponding bold lowercase sym- bol. For example, we write X for a random vector and x for its sample value.

The probability density function (pdf) of a random variable X is thus denoted by p_X(x), which is a function of the sample value x; the subscript X is included as a reminder that the pdf pertains to random vector X.

(32)

1 WHAT IS A NEURAL NETWORK?

Work on artificial neural networks, commonly referred to as “neural networks,” has been motivated right from its inception by the recognition that the human brain com- putes in an entirely different way from the conventional digital computer. The brain is a highly complex, nonlinear, and parallel computer (information-processing system). It has the capability to organize its structural constituents, known as neurons, so as to perform certain computations (e.g., pattern recognition, perception, and motor control) many times faster than the fastest digital computer in existence today. Consider, for example, human vision, which is an information-processing task. It is the function of the visual system to provide a representation of the environment around us and, more important, to supply the information we need to interact with the environment.

To be specific, the brain routinely accomplishes perceptual recognition tasks (e.g., rec- ognizing a familiar face embedded in an unfamiliar scene) in approximately 100–200 ms, whereas tasks of much lesser complexity take a great deal longer on a powerful computer.

For another example, consider the sonar of a bat. Sonar is an active echolocation system. In addition to providing information about how far away a target (e.g., a flying insect) is, bat sonar conveys information about the relative velocity of the target, the size of the target, the size of various features of the target, and the azimuth and eleva- tion of the target. The complex neural computations needed to extract all this information from the target echo occur within a brain the size of a plum. Indeed, an echolocating bat can pursue and capture its target with a facility and success rate that would be the envy of a radar or sonar engineer.

How, then, does a human brain or the brain of a bat do it? At birth, a brain already has considerable structure and the ability to build up its own rules of behavior through what we usually refer to as “experience.” Indeed, experience is built up over time, with much of the development (i.e., hardwiring) of the human brain taking place during the first two years from birth, but the development continues well beyond that stage.

A “developing” nervous system is synonymous with a plastic brain: Plasticity per- mits the developing nervous system to adapt to its surrounding environment. Just as plasticity appears to be essential to the functioning of neurons as information-processing units in the human brain, so it is with neural networks made up of artificial neurons. In

1

Introduction

(33)

its most general form, a neural network is a machine that is designed to model the way in which the brain performs a particular task or function of interest; the network is usually implemented by using electronic components or is simulated in software on a digital computer. In this book, we focus on an important class of neural networks that perform useful computations through a process of learning. To achieve good perfor- mance, neural networks employ a massive interconnection of simple computing cells referred to as “neurons” or “processing units.” We may thus offer the following defini- tion of a neural network viewed as an adaptive machine¹:

A neural network is a massively parallel distributed processor made up of simple processing units that has a natural propensity for storing experiential knowledge and making it available for use. It resembles the brain in two respects:

1. Knowledge is acquired by the network from its environment through a learning process.

2. Interneuron connection strengths, known as synaptic weights, are used to store the ac- quired knowledge.

The procedure used to perform the learning process is called a learning algorithm, the function of which is to modify the synaptic weights of the network in an orderly fashion to attain a desired design objective.

The modification of synaptic weights provides the traditional method for the design of neural networks. Such an approach is the closest to linear adaptive filter theory, which is already well established and successfully applied in many diverse fields (Widrow and Stearns, 1985; Haykin, 2002). However, it is also possible for a neural network to modify its own topology, which is motivated by the fact that neurons in the human brain can die and new synaptic connections can grow.

Benefits of Neural Networks

It is apparent that a neural network derives its computing power through, first, its massively parallel distributed structure and, second, its ability to learn and therefore generalize. Generalization refers to the neural network’s production of reasonable outputs for inputs not encountered during training (learning). These two information- processing capabilities make it possible for neural networks to find good approximate solutions to complex (large-scale) problems that are intractable. In practice, however, neural networks cannot provide the solution by working individually. Rather, they need to be integrated into a consistent system engineering approach. Specifically, a complex problem of interest is decomposed into a number of relatively simple tasks, and neural networks are assigned a subset of the tasks that match their inherent capabilities. It is important to recognize, however, that we have a long way to go (if ever) before we can build a computer architecture that mimics the human brain.

Neural networks offer the following useful properties and capabilities:

1. Nonlinearity. An artificial neuron can be linear or nonlinear. A neural network, made up of an interconnection of nonlinear neurons, is itself nonlinear. Moreover, the nonlinearity is of a special kind in the sense that it is distributed throughout the net- work. Nonlinearity is a highly important property, particularly if the underlying physical

(34)

mechanism responsible for generation of the input signal (e.g., speech signal) is inherently nonlinear.

2. Input–Output Mapping.A popular paradigm of learning, called learning with a teacher, or supervised learning, involves modification of the synaptic weights of a neur- al network by applying a set of labeled training examples, or task examples. Each example consists of a unique input signal and a corresponding desired (target) response. The network is presented with an example picked at random from the set, and the synaptic weights (free parameters) of the network are modified to minimize the difference between the desired response and the actual response of the network produced by the input signal in accordance with an appropriate statistical criterion. The training of the network is re- peated for many examples in the set, until the network reaches a steady state where there are no further significant changes in the synaptic weights. The previously applied training examples may be reapplied during the training session, but in a different order. Thus the network learns from the examples by constructing an input–output mapping for the problem at hand. Such an approach brings to mind the study of nonparametric statistical inference, which is a branch of statistics dealing with model-free estimation, or, from a bi- ological viewpoint, tabula rasa learning (Geman et al., 1992); the term “nonparametric”

is used here to signify the fact that no prior assumptions are made on a statistical model for the input data. Consider, for example, a pattern classification task, where the re- quirement is to assign an input signal representing a physical object or event to one of several prespecified categories (classes). In a nonparametric approach to this problem, the requirement is to “estimate” arbitrary decision boundaries in the input signal space for the pattern-classification task using a set of examples, and to do so without invoking a probabilistic distribution model. A similar point of view is implicit in the supervised learning paradigm, which suggests a close analogy between the input–output mapping performed by a neural network and nonparametric statistical inference.

3. Adaptivity.Neural networks have a built-in capability to adapt their synaptic weights to changes in the surrounding environment. In particular, a neural network trained to operate in a specific environment can be easily retrained to deal with minor changes in the operating environmental conditions. Moreover, when it is operating in a nonstationary environment (i.e., one where statistics change with time), a neural net- work may be designed to change its synaptic weights in real time. The natural architecture of a neural network for pattern classification, signal processing, and control applications, coupled with the adaptive capability of the network, makes it a useful tool in adaptive pattern classification, adaptive signal processing, and adaptive control. As a general rule, it may be said that the more adaptive we make a system, all the time en- suring that the system remains stable, the more robust its performance will likely be when the system is required to operate in a nonstationary environment. It should be emphasized, however, that adaptivity does not always lead to robustness; indeed, it may do the very opposite. For example, an adaptive system with short-time constants may change rapidly and therefore tend to respond to spurious disturbances, causing a dras- tic degradation in system performance. To realize the full benefits of adaptivity, the principal time constants of the system should be long enough for the system to ignore spurious disturbances, and yet short enough to respond to meaningful changes in the Section 1 What Is a Neural Network? 3

(35)

environment; the problem described here is referred to as the stability–plasticity dilemma (Grossberg, 1988).

4. Evidential Response.In the context of pattern classification, a neural network can be designed to provide information not only about which particular pattern to select, but also about the confidence in the decision made. This latter information may be used to reject ambiguous patterns, should they arise, and thereby improve the classification performance of the network.

5. Contextual Information.Knowledge is represented by the very structure and activation state of a neural network. Every neuron in the network is potentially affected by the global activity of all other neurons in the network. Consequently, contextual information is dealt with naturally by a neural network.

6. Fault Tolerance.A neural network, implemented in hardware form, has the potential to be inherently fault tolerant, or capable of robust computation, in the sense that its performance degrades gracefully under adverse operating conditions.

For example, if a neuron or its connecting links are damaged, recall of a stored pattern is impaired in quality. However, due to the distributed nature of information stored in the network, the damage has to be extensive before the overall response of the network is degraded seriously. Thus, in principle, a neural network exhibits a graceful degradation in performance rather than catastrophic failure. There is some empirical evidence for robust computation, but usually it is uncontrolled. In order to be assured that the neural network is, in fact, fault tolerant, it may be necessary to take corrective measures in designing the algorithm used to train the network (Kerlirzin and Vallet, 1993).

7. VLSI Implementability.The massively parallel nature of a neural network makes it potentially fast for the computation of certain tasks. This same feature makes a neural network well suited for implementation using very-large-scale-integrated (VLSI) tech- nology. One particular beneficial virtue of VLSI is that it provides a means of capturing truly complex behavior in a highly hierarchical fashion (Mead, 1989).

8. Uniformity of Analysis and Design.Basically, neural networks enjoy universal- ity as information processors. We say this in the sense that the same notation is used in all domains involving the application of neural networks. This feature manifests itself in different ways:

• Neurons, in one form or another, represent an ingredient common to all neural networks.

• This commonality makes it possible to share theories and learning algorithms in different applications of neural networks.

• Modular networks can be built through a seamless integration of modules.

9. Neurobiological Analogy.The design of a neural network is motivated by analogy with the brain, which is living proof that fault-tolerant parallel processing is not only physically possible, but also fast and powerful. Neurobiologists look to (artificial) neural networks as a research tool for the interpretation of neurobiological phenomena. On the other hand, engineers look to neurobiology for new ideas to solve problems more complex than those based on conventional hardwired design

(36)

techniques. These two viewpoints are illustrated by the following two respective examples:

• In Anastasio (1993), linear system models of the vestibulo-ocular reflex (VOR) are compared to neural network models based on recurrent networks, which are described in Section 6 and discussed in detail in Chapter 15. The vestibulo-ocular reflex is part of the oculomotor system. The function of VOR is to maintain visual (i.e., retinal) image stability by making eye rotations that are opposite to head rotations. The VOR is mediated by premotor neurons in the vestibular nuclei that re- ceive and process head rotation signals from vestibular sensory neurons and send the results to the eye muscle motor neurons. The VOR is well suited for modeling because its input (head rotation) and its output (eye rotation) can be precisely specified. It is also a relatively simple reflex, and the neurophysiological properties of its constituent neurons have been well described. Among the three neural types, the premotor neurons (reflex interneurons) in the vestibular nuclei are the most complex and therefore most interesting. The VOR has previously been mod- eled using lumped, linear system descriptors and control theory. These models were useful in explaining some of the overall properties of the VOR, but gave lit- tle insight into the properties of its constituent neurons. This situation has been greatly improved through neural network modeling. Recurrent network models of VOR (programmed using an algorithm called real-time recurrent learning, described in Chapter 15) can reproduce and help explain many of the static, dynamic, nonlinear, and distributed aspects of signal processing by the neurons that mediate the VOR, especially the vestibular nuclei neurons.

• The retina, more than any other part of the brain, is where we begin to put together the relationships between the outside world represented by a visual sense, its physical image projected onto an array of receptors, and the first neural images. The retina is a thin sheet of neural tissue that lines the posterior hemisphere of the eyeball. The retina’s task is to convert an optical image into a neural image for transmission down the optic nerve to a multitude of centers for further analysis. This is a complex task, as evidenced by the synaptic organization of the retina. In all vertebrate retinas, the transformation from optical to neural image involves three stages (Sterling, 1990):

(i) photo transduction by a layer of receptor neurons;

(ii) transmission of the resulting signals (produced in response to light) by chem- ical synapses to a layer of bipolar cells;

(iii) transmission of these signals, also by chemical synapses, to output neurons that are called ganglion cells.

At both synaptic stages (i.e., from receptor to bipolar cells, and from bipolar to ganglion cells), there are specialized laterally connected neurons called horizontal cells and amacrine cells, respectively. The task of these neurons is to modify the transmission across the synaptic layers. There are also centrifugal elements called inter-plexiform cells; their task is to convey signals from the inner synaptic layer back to the outer one. Some researchers have built electronic chips that mimic the structure of the retina. These electronic chips are called neuromorphic inte- grated circuits, a term coined by Mead (1989). A neuromorphic imaging sensor Section 1 What Is a Neural Network? 5

Neural Networks and Learning Machines