Bayesian-based PLS algorithm

Chapter 3. Bayesian-based PLS

3.2. Bayesian-based PLS algorithm

The assumption we made is to consider different kinds of probability density function and try to find out the relation between maximum covariance and minimum sum of squared error for them. In original PLS three-layer ANN architecture, it searches the maximum variance between input and hidden layer and finds the minimum sum of total squared error between hidden and output layer. Here we define the total data misfit function as：

M = α E

+ β E

_W (3-1) ED is the residual squared error function andE is commonly refered to as a _W regularizing fucntion. In order to find the optimal value forλ, the regularized parameter, we apply Bayesian interpretation and calculate the best value forλby using the evidence procedure, an approximate Bayesian scheme reviewed in Mackay [4]. Using Bayes’ rule, we get the posterior probability of the parameter w is the fomula (2-8). Now that we want to evaluate the evidence find the value ofλ. We define the probability of the data given the parameter w is：

2 _/₂

and a prior probabilily on the parameter w is：

2 _/₂

And if α andβ are known, then the posterior probability of the parameterw is：

From above formulas and given some hypotheses, we evaluate the evidence forα, . β

Then we differentiate the log evidence, from (3-6), with respect toα andβ so as to find the condition that is satisfied at the maximum. We can obtain derivation for differentiating with respect toα andβ from formula (3-6).

1 0

First, we differentiate the log evidence with respect toα and get, setting the derivation to zero：

2 α E

= n − α Trace ( A

⁻¹

B ) = n − γ

(3-7) And then, we differentiate with respect toβ .

1 0

We obtain the following condition for the most probable value ofβ ：

γ

According to (3-7) and (3-8), we rewrite the new error criterion as：

M = α E

+ β E

≈ e

^Τ

e + λ q

^Τ

q

(3-9)

Following that, we can find the correct value to use forλby given an initial value of λ(λ≥0) in the following iterative procedure. The value ofλis updated by the formula as follows：

Chapter 4 Experiments and Results

In this chapter, we will demonstrate the simulation experiment results including sigmoid function, Gaussian-based spectrum and data preprocessing. In our simulation data experiments, the results shows the analyzed performance are nearly between Bayesian-based PLS and PRLS and they all have better performance than original PLS.

4.1 Illustration

4.1.1 Synthesized simulation data

In simulation data calculation, we use synthesize testing data with noise to examine the efficiency of our method. We add the noise generated by Gaussian probability density function with zero mean and set the value of standard deviation, so as to vary the level of noise. The noise to signal (N/S) ratio is also used to set up a standard of the variation. Given a signal data set signal and Gaussian noise data set _i

noise with zero mean, i 1≤i≤n.

The mean of signal and noise data set are：

The variance of the signal and noise data set are：

n The noise to signal (N/S) ratio is

4.1.2 Criterion of estimation

Here we use two familiar estimators to verify the performance of our method.

One of them is root mean square error (RMSE). RMSE is one of many ways to quantify the amount by which an estimator differs from the true value of the quantity being estimated like as a loss function. The other one is correlation coefficient which indicates the strength and direction of a linear relationship between two variables. The correlation coefficient is a value between 0 and 1. It is a measure of how well trends in the predicted values follow trends in past actual values. Following, we illustrate the concept of correlation coefficient in Figure 4.1.

Figure 4.1 A sketch map of correlation coefficient

Given a set of observations(x₁,y₁),(x₂,y₂),K,(x_n,y_n), the formula for computing the correlation coefficient is given by：

⁼ ¹ ₋ ₁ ∑ ⁽ ⁻ ⁾⁽ ⁻ ⁾

Y x

Y Y X X

S S

r n

(4-4)

Figure 4.2 Root mean square error

Figure 4.2 shows that the main concept of RMSE is to calculate the average of the distance between prediction and desired output data. To acquire accurate prediction, we hope that RMSE minimizes as far as possible.

4.1.3 Conditional training

Here we also calibrate the training data in different conditions ： (1) self-calibration and self-prediction (SCSP) and (2) cross validation (CV). In order to understand easily what is difference between SCSP and CV. We use diagrams to illustrate. Figure 4.3 shows the principle of SCSP and Figure 4.4 shows CV.

SCSP is a traditional training mode and the training data set is also prediction data set. Usually the result of SCSP is ideal if there is no noise hidden in the source data. However data usually goes along with noise and SCSP would be influenced by hidden information so that results may not necessarily meet to desire.

CV is also called leave one out method because we select a validation data from original training data set and repeat until each observation in the set is used as validation data. The method also has the property of avoiding overfitting but costs heavy computation. Next, we will compare regularization technique and CV in simulation and real data experiments.

Prediction set

Figure 4.4 Cross validation (CV)

4.2 Simulation data

In this section, we will generate sigmoid function, Gaussian-based spectrum data and preprocessing procedure under SCSP and CV condition. We calibrate these different kinds of data set. After predicting, we apply the criterion of estimation to examine which one is better among PLS, PRLS and Bayesian-based PLS methods.

4.2.1 Sigmoid function

In this simulation, we use hybrid of sine and cosine function to examine.

f(x_i)=a_isin(x_i)+b_icos(x_i), 0≤x≤2

π

(4-5) The training data were generated from f(x_i)+ε_i, wherex has take from the uniform _i distribution in (0,2π) and the noiseε_ihad a Gaussian distribution with zero mean. The training data and the sigmoid function f(x_i)are plotted in Figure 4.5. The training data is highly ill-conditioned.

Figure 4.5 Noisy data (points) and sigmoid function (curve)

Figure 4.6 RMSE as a function of N/S ratio under SCSP (PRLS)

Figure 4.8 RMSE as a function of N/S ratio under SCSP (PLS)

Figure 4.9 Correlation coefficient as a function of N/S ratio under SCSP (PRLS)

Figure 4.10 Correlation coefficient as a function of N/S ratio under SCSP (Bayesian-based PLS)

Figure 4.12 Prediction error sum of squares (PRESS) under CV

4.2.2 Gaussian-based spectrum

We would like to generate two Gaussian functionsg(x)with mean = 510 and the standard deviation =15, h(x)with mean = 540 and the standard deviation = 10. f(x) is the linear combination ofg(x)andh(x)plotted in Figure 4.13.

Figure 4.13 The linear combination of two Gaussian functions with different mean

and standard deviation.

The training data setX_i +εcan be generated by linear combination of g(x) and h(x) with Gaussian noiseε . The training data set will be represented in Figure 4.14.

Figure 4.14 The training data set of Gaussian-base spectrum

Next, we will show the experiment results of calibration as follows. We can compare the analyzed performance with these three methods.

Figure 4.15 RMSE as a function of N/S ratio under SCSP (PRLS)

Figure 4.16 RMSE as a function of N/S ratio under SCSP (Bayesian-based PLS)

Figure 4.18 Correlation coefficient as a function of N/S ratio under SCSP (PRLS)

Figure 4.19 Correlation coefficient as a function of N/S ratio under SCSP (Bayesian-based PLS)

Figure 4.20 Correlation coefficient as a function of N/S ratio under SCSP (PLS)

Figure 4.21 Prediction error sum of squares (PRESS) under CV

From our experiment results, we can find out both Bayesian-based PLS and PRLS have better performance than PLS, and the analyzed result of Bayesian-based PLS is nearly to PRLS whether the prediction is under SCSP condition. The Figure 4.12 shows the CV result, we can realize that Bayesian-based PLS is not better than PRLS. We think that the result might be influenced by the selection of prior and the data we simulate.

4.2.3 Preprocessing

In this part, we generate original training data set taken from the uniform distribution in (-1, 1). Then we use tangent sigmoid as a function to transfer the original training data set to a new one which is Gaussian-based. We make an assumption about whether the training data set is center distribution or more uniform will make better analyzed performance. They will give some illustration in following figures.

Figure 4.22 and 4.23 represent the original training data set X and the new training data set 'X which is transferred by tangent function respectively.

Figure 4.22 The original training data setX

Figure 4.24 RMSE as a function of FWHM under SCSP

Figure 4.25 Correlation coefficient as a function of FWHM under SCSP

Chapter 5 Discussion

For regularization concept, almost all inverse problem methods involve a trade-off between two optimizations：agreement between data and solution, and smoothness of the solution. We define that the unconstrained minimum of agreement and the unconstrained minimum of smoothness is the best solution. Figure 5.1 will give you a brief thought about that. Here, we have a question for how to define or find out the location of the best solution between “Best smoothness” line and “Best agreement” line.

Figure 5.1 Where is the best solution

The estimated criterion RMSE and correlation coefficient would involve a trade-off relationship. In our data experiment results, we hope the RMSE is low and correlation coefficient is high to verify our proposed method. So, we need some

verification to explain this problem. We make a assumption that our proposed method and PLS may have different curves as shown in Figure 5.2. In further study, we will have a fundamental proof for this issue.

Figure 5.2 Trade-off curves of Bayesian-based PLS and PLS

The preprocessing result, we transfer the original data set to Gaussian form to examine whether the performance is better or not. We make different widths for FWHM to verify our proposed method. But we could obviously find out the hypothesis for data preprocessing doesn’t accomplish to our expectation. The results after preprocessing might be influenced by the limitation of tangent function. The data after tangent function transferring may be divergent so that the analyzed results would be affected for this reason.

The local and global minimum problem is another issue we concern. We would like to find the best solution to approximate nearly global minimum.

Chapter 6 Conclusions and Future works

6.1 Conclusions

We have established a probability based analyzed method which combines the advantages of regularization and the properties of PLS for a novel calibration model.

The proposed method, Bayesian-based PLS, is able to reduce the noise signal hidden in the training data. And it has better analyzed results than original PLS method when training data accompanying noise signal during calibration phase. So we can apply our method to on-line analyzed system for further application.

6.2 Future works

In data preprocessing issue, we might to make tries for other kinds of transfer function (e.g., arcsine function) to make sure the data divergent problem and improve the limitation of transformation accuracy to obtain better performance for further study. The track of best solution between the agreement and smoothness is our next objective to achieve. Then, we also consider to make the results approximated to the global minimum so that we can apply the proposed method for weights initialization of backpropogation network. There still have another issue we have to take into account. The selection of appropriate prior would probably affect the analyzed result.

So we need to make a study about the prior probability to make sure that we don’t have a bad or wrong one.

References

[1] Hsiao TC, Lin CW, Chiang HH, “Partial least squares algorithm for weights initialization of the back-propagation network”, Neurocomputing, vol.

50, pp. 237-247, 2003.

[2] Chen S, Chng ES, Alkadhimi K, “Regularized orthogonal least squares algorithm for constructing radial basis function networks”, International Journal of Control, vol. 64, pp. 829-837, 1996.

[3] Chang SH, Chiou YJ, Yu C, Lin CW, Hsiao TC, “A Novel Multivariate Analysis Method with Noise Reduction”, 4^th European Congress for Medical and Biomedical Engineering, 2008.

[4] MacKay DJC, “Bayesian interpolation”, Neural Computation, vol. 4, pp.

415-447, 1992.

[5] Bhandare P, Mendelson Y, Peura RA, Janatsch G, Kruse-Jarres JD, Marbach R, Heise HM, “Multivariate determination of glucose in whole blood using

partial least-squares and artificial neural networks based on mid-infrared spectroscopy”, Applied Spectroscopy, vol. 47, pp. 1214-1221, 1993.

[6] Möcks J, Verleger R, “Multivariate methods in biosignal analysis: application of principal component analysis to event-related”, Techniques in the behavioral

and neural sciences, vol. 5, pp. 399-458, 1991.

[7] Castellanos G, Delgado E, Daza G, Sanchez LG, Suarez JF, “Feature Selection in Pathology Detection using Hybrid Multidimensional Analysis”, Proceedings of International Conference of EMBS, pp. 5950-5953, 2006.

[8] Oja E, “A simplified neuron model as a principal component analyzer”, Journal of Mathematics and Biology, vol. 15, pp. 267-273, 1982.

[9] Harald M, Tormod N, “Multivariate Calibration”, 2^nd Edition, John Wiley &

Sons, Great Britain, 1996.

[10] Huang KY, “Neural Networks and Pattern Recognition”, 2^nd Edition, 維科圖書有限公司, 2003.

[11] Oja E, Karhunen J, “Recursive construction of Karhunen-Loeve expansions for pattern recognition purposes”, Proceedings of 5^th Int. Conf. on Pattern Recognition, pp. 1215-1218, 1980.

[12] Hsiao TC, Lin CW, Zeng MT, Chiang Kenny HH, “The Implementation of Partial Lease Squares with Artificial Neural Network Architecture”, 20^th Annual International Conference of the IEEE Engineering in Medicine Biology Society, vol. 3, pp. 1341-1343, 1998.

[13] Chen S, Cowan CFN, Grant PM, “Orthogonal least squares learning algorithm for radial basis function networks”, IEEE Transactions on Neural Networks, vol. 2, pp. 302-309, 1991.

[14] Press HW, Vetterling WT, Teukolsky SA, Flannery BP, “Numerical Recipes in C: the art of scientific computing”, 2^nd Edition, Cambridge University Press, 1993.

[15] Orr MJL, “Regularization in the selection of radial basis function centers”, Neural Computation, vol. 7, pp. 606-623, 1995.

[16] Hertz J, Krough A, Palmer R, “Introduction to the Theory of Neural Computation”, Redwood city, California, USA, Addison-Wesley, 1991.

[17] Ham FM, Kostanic I, “A Neural Network Architecture for Partial Least Squares Regression with Supervised Adaptive Modular Hebbian Learning”, Neural, Parallel, Scientific Computation, vol. 6, pp. 35-72, 1998.

[18] Jeffreys H, “Theory of Probability”, Oxford University Press, 1939.

[19] Gull SF, “Bayesian inductive inference and maximum entropy”, Maximum Entropy and Bayesian Methods in Science and Engineering, vol. 1, pp. 53-74, 1988.

在文檔中貝氏架構下部分最小平方法 (頁 34-0)

Chapter 3. Bayesian-based PLS

3.2. Bayesian-based PLS algorithm

M = α E

+ β E

1 0

2 α E

= n − α Trace ( A

B ) = n − γ

1 0

γ

M = α E

+ β E

≈ e

e + λ q

q

Chapter 4

Experiments and Results

4.1 Illustration

= 1 − 1 ∑ ( − )( − )

Y Y X X

S S

r n

Prediction set

4.2 Simulation data

π

Chapter 5 Discussion

Chapter 6

Conclusions and Future works

6.1 Conclusions

6.2 Future works

References

⁼ ¹ ₋ ₁ ∑ ⁽ ⁻ ⁾⁽ ⁻ ⁾