the M-by-M identity matrix I whose M diagonal elements are unity and the re- re-maining elements are all zero;

through Regression

Assumption 1: Statistical Independence and Identical Distribution

2. the M-by-M identity matrix I whose M diagonal elements are unity and the re- re-maining elements are all zero;

3. the time-averaged M-by-1 cross-correlation vector of the regressor x and the de-sired response d, which is defined by

(2.31)

Section 2.3 Maximum A Posteriori Estimation of the Parameter Vector 75

The correlations (N) and (N) are both averaged over all the N examples of the training sample t—hence the use of the term “time averaged.”

Suppose we assign a large value to the variance σ²w, which has the implicit effect of saying that the prior distribution of each element of the parameter vector w is essentially uniform over a wide range of possible values. Under this condition, the parameter is essentially zero and the formula of Eq. (2.29) reduces to the ML estimate (2.32) which supports the point we made previously: The ML estimator relies solely on the observation model exemplified by the training sample t. In the statistics literature on linear regression, the equation

(2.33) is commonly referred to as the normal equation; the ML estimator is, of course, the solution of this equation. It is also of interest that the ML estimator is an unbiased esti-mator, in the sense that for an infinitely large training sample t, we find that, in the limit, converges to the parameter vector w of the unknown stochastic environment, provided that the regressor x(n) and the response d(n) are drawn from jointly ergodic processes, in which case time averages may be substituted for ensemble averages. Under this condition, in Problem 2.4, it is shown that

In contrast, the MAP estimator of Eq. (2.29) is a biased estimator, which therefore prompts us to make the following statement:

In improving the stability of the maximum likelihood estimator through the use of regular-ization (i.e., the incorporation of prior knowledge), the resulting maximum a posteriori esti-mator becomes biased.

In short, we have a tradeoff between stability and bias.

2.4 RELATIONSHIP BETWEEN REGULARIZED LEAST-SQUARES ESTIMATION AND MAP ESTIMATION

We may approach the estimation of the parameter vector w in another way by focusing on a cost function e0(w) defined as the squared expectational errors summed over the N experimental trials on the environment. Specifically, we write

where we have included w in the argument of εito stress the fact that the uncertainty in the regression model is due to the vector w. Rearranging terms in Eq. (2.16), we obtain

(2.34) εi(w) = di - w^Txi,

i = 1, 2, ..., N

e0(w) = a

N i = 1

ε²i(w) limit

N S q ˆwML(N) = w wˆ_ML

wˆ_ML Rˆ_xx(N) ˆw_ML(N) = ˆrdx(N)

wˆ_ML(N) = Rˆ-1

xx(N)rˆ_dx(N) ˆr_dx

Rˆ_xx

Substituting this equation into the expression for e0(w) yields

(2.35) which relies solely on the training sample t. Minimizing this cost function with respect to w yields a formula for the ordinary least-squares estimator that is identical to the maximum-likelihood estimator of Eq. (2.32), and hence there is a distinct possibility of obtaining a solution that lacks uniqueness and stability.

To overcome this serious problem, the customary practice is to expand the cost function of Eq. (2.35) by adding a new term as follows:

(2.36) This expression is identical to the function defined in Eq. (2.28). The inclusion of the squared Euclidean norm ||w||²is referred to as structural regularization. Correspond-ingly, the scalar is referred to as the regularization parameter.

When 0, the implication is that we have complete confidence in the observa-tion model exemplified by the training sample t. At the other extreme, when  , the implication is that we have no confidence in the observation model. In practice, the regularization parameter is chosen somewhere between these two limiting cases.

In any event, for a prescribed value of the regularization parameter , the solution of the regularized method of least squares, obtained by minimizing the regularized cost function of Eq. (2.36) with respect to the parameter vector w, is identical to the MAP estimate of Eq. (2.29). This particular solution is referred to as the regularized least-squares (RLS) solution.

2.5 COMPUTER EXPERIMENT: PATTERN CLASSIFICATION

In this section, we repeat the computer experiment performed on the pattern-classification problem studied in Chapter 1, where we used the perceptron algorithm. As before, the double-moon structure, providing the training as well as the test data, is that shown in Fig. 1.8. This time, however, we use the method of least squares to perform the classification.

Figure 2.2 presents the results of training the least squares algorithm for the separation distance between the two moons set at d 1. The figure shows the deci-sion boundary constructed between the two moons. The corresponding results obtained using the perceptron algorithm for the same setting d 1 were presented in Fig. 1.9. Comparing these two figures, we make the following interesting observations:

1. The decision boundaries constructed by the two algorithms are both linear, which is intuitively satisfying. The least-squares algorithm discovers the asymmetric q = 1

2 a

N i = 1

(d_i - w^Tx_i)² + 2 7 w7² e(w) = e0(w) +

2 7 w7² e0(w) = 1

2 a

N i = 1

(d_i - w^Tx_i)²

Section 2.5 Computer Experiment: Pattern Classification 77

manner in which the two moons are positioned relative to each other, as seen by the positive slope of the decision boundary in Fig. 2.2. Interestingly enough, the per-ceptron algorithm completely ignores this asymmetry by constructing a decision boundary that is parallel to the x-axis.

2. For the separation distance d 1, the two moons are linearly separable. The per-ceptron algorithm responds perfectly to this setting; on the other hand, in discov-ering the asymmetric feature of the double-moon figure, the method of least squares ends up misclassifying the test data, incurring a classification error of 0.8%.

3. Unlike the perceptron algorithm, the method of least squares computes the

在文檔中 Neural Networks and Learning Machines (頁 106-109)