Thesis Organization - 貝氏架構下部分最小平方法

Chapter 1. Introduction

1.4. Thesis Organization

The structure of the thesis is described as follow. The first chapter gives an introduction and the motivation for my research. Next section, in chapter 2 we depict some calibration models, Bayesian analysis, and Bayesian regularization in my study.

In chapter 3, we will make some discussion between Bayesian regularization and PLS.

Later, we propose a novel calibration model, names as Bayesian-based PLS, by combining PLS with the concept of Bayes’ rule and regularization technique. Chapter 4 shows the simulation experiment results. Then, we will make some discussions in chapter 5 Conclusions and future works are listed in the final chapter.

Chapter 2 Materials and Methods

2.1 Least Squares (LS)

The least squares (LS) method is used to approximate the parameters and find the best fitting curve to fit the given data. Classic LS regression has minimum sum of squared residuals between data set and estimation. Suppose the linear model is given b y f (x_i) = a₀ + x_i₁a₁ + x_i₂a₂ + L + x_ima_m, i = 1,2,K , n . T h e L S method use this model to approximate the given set of data. And the sum of squared error (SSE) is calculated as below：

and we get the partial differential equations for each a ,the derivation is： _j

We also can illustrate LS method to a two-layer ANN architecture shown as Figure 2.1. And we transform the data set to matrix form. Then matrix X represents the input data X=[x₁ x₂ x₃ L x_m] ; x_m =[x₁_m x₂_m x₃_m L x_nm] , real output Y=[y₁ y₂ y₃ L y_n]^Tand weight coefficient a=[a₁ a₂ a₃ L a_m]^T.

The LS procedure in matrix form is defined as：

Y = Xa + ε

(2-3)

We calculate the weighting coefficients due to (2-3).

Xa X Y

X

≈

^T (2-4)

Y) (X X) (X

a ≈

^T ⁻¹ ^T (2-5)

Figure 2.1 Two-layer architecture of LS method

2.2 Partial Least Squares (PLS)

PLS is a method which the most widely used in biomedical spectroscopic analysis. It is a popular technique that generalizes and combines features from principal component analysis (PCA) and multiple regressions. The purpose of PLS is to predict or analyze a set of dependent variables from a set of independent variables or predictors. PLS regression is mainly useful when we have to predict a set of dependent variables from a large set of independent variables. It is used to find the fundamental relations between two matrices (X and Y), i.e. a latent variable approach to modeling the covariance structures in these two spaces. A PLS model will try to find the multidimensional direction in the X space that explains the maximum multidimensional variance direction in the Y space.

We will illustrate the general underlying model of multivariate PLS as follow and show you the architecture of multivariate system if we treat PLS as a three-layer ANN network.

The independent variable matrix X_n×_m decomposed into matrix T_n×_a with corresponding weighting matrix P_a×_m and dependent variable matrix Y_n_×₁ can be decomposed into matrix T_n×_a with corresponding weighting matrix Q_a_×₁. The mathematic form is represented as follows：

E

F Y

Y Y

Y

_n_×₁

=

⁽¹⁾

+

⁽²⁾

+ L +

⁽^a⁾

+

= t

q

+ t

q

+ L + t

q

+ F

= T

_n_×_a

Q

_a_×₁

+ F

(2-7)

From the formula (2-6) and (2-7) above, we also can illustrate the mathematic relation for computing PLS in Figure 2.2. It shows the regression steps how PLS decomposed.

Figure 2.2 The computational procedure of PLS

After derivative, we exactly find out the residual matrix E_n×_m and F_n_×₁ are minimized through the course of decomposing the matrix X and Y .When computational iteration equation to a (a≤ ) or the residual small than a minimum, n PLS procedure would terminate.

Ham [17] and Hsiao [12] bring up an idea which regards PLS as one kind of artificial neural networks. In the purpose, transformation between independent and dependent variables can be represented as three-layer ANN architecture. It is shown as Figure 2.3. And the PLS learning procedure will be illustrated in Figure 2.4.

Figure 2.3 Three-layer architecture of PLS method

Figure 2.4 PLS learning flow chart

2.3 Bayesian analysis

Bayesian refers to methods in probability and statistics named after the Reverend Thomas Bayes. Bayesian methods for inductive inference were first developed in detail early this century by the Cambridge geophysicist, Sir Harold Jeffreys [18].

Bayesian inference is the statistical inference in which evidence or observations are used to update or to newly calculate the probability that a hypothesis might be true.

Bayesian inference uses a numerical estimate of the degree of belief in a hypothesis before evidence has been observed and calculates a numerical estimate of the degree of belief in the hypothesis after evidence has been observed. The fundamental concept of Bayesian analysis is that the plausibilities of alternative hypotheses are represented by probabilities, and inference is performed by evaluating those probabilities.

In David J.C. Mackay proposed paper [4], the Bayesian approach to regularization and model-comparison is clarified by studying the inference problem of interpolating noisy data. The concepts and methods described are quite general and can be applied to many other data modelling problems.

In his study, we can examine the posterior probability distribution to set the regularized constants. The way in which Bayes infers the values of regularized constants and noise levels has an elegant interpretation in terms of the effective number of parameters determined by the data set.

Two levels of inference are involved in the task of data modelling. Figure 2.5 will show you where Bayesian inference fits into the data modelling process and illustrate an abstraction of the part of the scientific process in which data is collected and modelled. At the first level of inference, we assume that one of the models we created is true, then we fit the model to the data. And the second level of inference is the model comparison. The two double-framed boxes denote the two steps which

involve inference. However, Bayes’ rule can only be used in these two steps. Bayes’

rule may be used to find the most probable parameter values and the error bars on these parameters. The second inference task requires a quantitative Occam’s razor to penalise overcomplex models. Bayes can assign objective preferences to the alternative models in a way that automatically and quantitatively embodies Occam’s razor [18][19]. Complex models are automatically self-penalizing under Bayes’ rule.

Figure 2.5 Data modeling process [4]

Model comparison is a difficult task because it’s not possible simply to find the best model that fits the data set. Occam’s razor is the principle that states that the explanation of any phenomenon should make as few assumptions as possible, eliminating those that make no difference in the observable predictions of the explanatory hypothesis. A problem should be stated in its basic and simplest form.

Figure 2.6 Why Bayes embodies Occam’s razor [4]

The Figure 2.6 shows the intuition for why complex models penalized. Bayes’

rule rewards models according to how well they predict actual data. These predictions are quatified by a normalized probability distribution on data sets D and this probability,P(D|H_i), is known as the evidence forH_i. A simple modelH₁ makes only a limited range of predictions,P(D|H₁)；a more powerful modelH₂that has more free parameters thanH₁, is able to predict a larger variety of data sets. However, this means thatH₂can not predict the data sets in regionC₁as strongly asH₁. Assume that the two models have been assigned the equal prior probabilities. Then if the data set falls in regionC₁, the less powerful modelH₁will be the more probable than to the modelH₂.

Let us write down the Bayes’ rule for the two levels of inference so that we can examine explicitly how Bayesian model comparison works.

Model fitting：At the first level of inference, we assume that one modelH_iis true, we infer what the model’s parameter w might be given the data D .Using Bayes’ rule, the posterior probability of the parameter w is：

( | )

Model comparison：At the second level of inference, we infer which model is the most sensible give the data. And the posterior probability for each model is defined as：

P ( H

| D ) ∝ P ( D | H

) P ( H

)

(2-9)

Assuming that we have no reason to assign strongly differing priorsP(H_i) to the alternative models, modelsH_iare ranked by evaluating the evidence. New models are compared with previous by evaluating the evidence for them.

Let us explicitly study the evidence to gain insight into how the Bayesian Occam’s razor works. The evidence is the normalizing constant for equation (2-8)：

P ( D | H

) ⁼ ∫ P ( D | w , H

) P ( w | H

) dw

(2-10)

Figure 2.7 shows the quantities that determine the Occam factor for hypothesis Hi having a single parameterw.The dotted line that represented the prior distribution for the parameter has widthΔ . The solid line that represented the posterior ⁰w distribution has a single peak atw_MPwith characteristic widthΔw. The Occam factor is w

w Δ0

Δ .

Figure 2.7 The Occam factor [4]

Therefore the evidence is evaluated by taking the best fit likelihood and multiplying it by the Occam factor.

The quantityΔwis the posterior uncertainty inw. Imagine for simplicity that the prior )

| (w H_i

P is uniform on some large intervalΔ , representing the range of values of⁰w w thatH thought possible before the data arrived. Then_i

H w w

P _MP _i 1₀ )

( = Δ . The log of the Occam factor can be interpreted as the amount of information we gain about the model when the data arrive. Comparison of evidence,P(D|H_i), provides a purely objective way to rank hypotheses. Evaluation of evidence is an extension of maximum likelihood model selection：multiply the best fit likelihood by the Occam factor. No more computationally difficult than finding the best fit parameters. The Occam factor automatically penalizes a model which requires fine tuning of its parameters. It promotes models where the required precision of its parameters is coarse.

2.4 Bayesian regularization

In this section, we will introduce the Bayesian regularization and examine the probability distribution to set the regularized parameterλ. The selection of regularized parameterλis the key concept of our method. So we adopt Bayesian analysis to infer the optimal value ofλ. To infer from the data what valueλshould have, we evaluate some probability distribution.

As mentioned earlier, we add the concept of regularized technique to PLS and rewrite the new error criterion as：

E

= e

e + λ q

q , λ ≥ 0

(2-12) Where ee^T means the total sum of squared error andq^Tqis the weighting vector which infers the output directly. However, original PLS calibration reduces the total error as far as possible but if there has noisy signal (outlier) in the training data, the prediction may fit to the noisy data. So the predicted accuracy will be poor for the unseen data.

Vary of weighting coefficientsq^Tqcontrols the covariance for two variables.

As shown in section 1.3, Figure 1-2 gives a briefly interpretation for error criterion and regularized parameterλ . The regularized parameter λ is added to the term to make the calibration curve smooth without oscillating. The Bayesian-based PLS keeps the balance between smoothness of curve and accuracy in calibration phase.

Chapter 3 Bayesian-based PLS

In this chapter, we establish a novel analyzed method, Bayesian-based PLS, by applying Bayesian approach to PLS method. An elegant approach to the selection of the regularization parameter is to adopt Bayesian interpretation and evaluate the evidence probability to find the best value of regularization parameter. The evidence procedure we adopt is to calculate the probabilityP(D|α,β,H).

3.1 Data preprocessing

In computer science, Data preprocessing describes processing performed on the raw data to transform it to another processing procedure. The result after data preprocessing is the final training data set. Data preprocessing transforms the data to a new type that will be more easily and effectively processed for the purpose of the user.

There are many different widely used methods and techniques for data preprocessing, including sampling, cleaning, normalization, transformation, denoising, feature extraction and selection, etc. The sampling is the process of selecting a representive subset from a large population of data. The transformation is usually applied so that the data appear to more closely meet the assumptions of a statistical inference procedure that is to be applied. The denoising is the method that eliminate the noise from the source data. The normalization, which organizes data more efficient to access and more normal, which typically means conforming to some regularity or rule, or returning from some state of abnormality and feature extraction is a process of

smaller number of dimensions. As such it is useful for data visualization, since a complex data set can be effectively visualized when it is reduced to smaller dimensions.

In our study, we make the assumption for considering data preprocessing procedure into our method. Transformation is adopted to make the training data set to another form. Here the tangent sigmoid function is selected and shown in Figure 3.1.

Figure 3.1 Tangent sigmoid function

For constructing our three-layer architecture of Bayesian-based PLS, the data preprocessing procedure had been added between input and hidden layer and the flow chart has shown in Figure 3.2. In Figure 3.2, the original data X was generated with uniform distribution. After tangent function transferring, we obtain new form of data 'X . The probability density function (PDF) of 'X is Gaussian distribution or normal distribution.

Figure 3.2 The flow chart of data preprocessing

3.2 Bayesian-based PLS algorithm

The assumption we made is to consider different kinds of probability density function and try to find out the relation between maximum covariance and minimum sum of squared error for them. In original PLS three-layer ANN architecture, it searches the maximum variance between input and hidden layer and finds the minimum sum of total squared error between hidden and output layer. Here we define the total data misfit function as：

M = α E

+ β E

_W (3-1) ED is the residual squared error function andE is commonly refered to as a _W regularizing fucntion. In order to find the optimal value forλ, the regularized parameter, we apply Bayesian interpretation and calculate the best value forλby using the evidence procedure, an approximate Bayesian scheme reviewed in Mackay [4]. Using Bayes’ rule, we get the posterior probability of the parameter w is the fomula (2-8). Now that we want to evaluate the evidence find the value ofλ. We define the probability of the data given the parameter w is：

2 _/₂

and a prior probabilily on the parameter w is：

2 _/₂

And if α andβ are known, then the posterior probability of the parameterw is：

From above formulas and given some hypotheses, we evaluate the evidence forα, . β

Then we differentiate the log evidence, from (3-6), with respect toα andβ so as to find the condition that is satisfied at the maximum. We can obtain derivation for differentiating with respect toα andβ from formula (3-6).

1 0

First, we differentiate the log evidence with respect toα and get, setting the derivation to zero：

2 α E

= n − α Trace ( A

⁻¹

B ) = n − γ

(3-7) And then, we differentiate with respect toβ .

1 0

We obtain the following condition for the most probable value ofβ ：

γ

According to (3-7) and (3-8), we rewrite the new error criterion as：

M = α E

+ β E

≈ e

^Τ

e + λ q

^Τ

q

(3-9)

Following that, we can find the correct value to use forλby given an initial value of λ(λ≥0) in the following iterative procedure. The value ofλis updated by the formula as follows：

Chapter 4 Experiments and Results

In this chapter, we will demonstrate the simulation experiment results including sigmoid function, Gaussian-based spectrum and data preprocessing. In our simulation data experiments, the results shows the analyzed performance are nearly between Bayesian-based PLS and PRLS and they all have better performance than original PLS.

4.1 Illustration

4.1.1 Synthesized simulation data

In simulation data calculation, we use synthesize testing data with noise to examine the efficiency of our method. We add the noise generated by Gaussian probability density function with zero mean and set the value of standard deviation, so as to vary the level of noise. The noise to signal (N/S) ratio is also used to set up a standard of the variation. Given a signal data set signal and Gaussian noise data set _i

noise with zero mean, i 1≤i≤n.

The mean of signal and noise data set are：

The variance of the signal and noise data set are：

n The noise to signal (N/S) ratio is

4.1.2 Criterion of estimation

Here we use two familiar estimators to verify the performance of our method.

One of them is root mean square error (RMSE). RMSE is one of many ways to quantify the amount by which an estimator differs from the true value of the quantity being estimated like as a loss function. The other one is correlation coefficient which indicates the strength and direction of a linear relationship between two variables. The correlation coefficient is a value between 0 and 1. It is a measure of how well trends in the predicted values follow trends in past actual values. Following, we illustrate the concept of correlation coefficient in Figure 4.1.

Figure 4.1 A sketch map of correlation coefficient

Given a set of observations(x₁,y₁),(x₂,y₂),K,(x_n,y_n), the formula for computing the correlation coefficient is given by：

⁼ ¹ ₋ ₁ ∑ ⁽ ⁻ ⁾⁽ ⁻ ⁾

Y x

Y Y X X

S S

r n

(4-4)

Figure 4.2 Root mean square error

Figure 4.2 shows that the main concept of RMSE is to calculate the average of the distance between prediction and desired output data. To acquire accurate prediction, we hope that RMSE minimizes as far as possible.

4.1.3 Conditional training

Here we also calibrate the training data in different conditions ： (1) self-calibration and self-prediction (SCSP) and (2) cross validation (CV). In order to understand easily what is difference between SCSP and CV. We use diagrams to illustrate. Figure 4.3 shows the principle of SCSP and Figure 4.4 shows CV.

SCSP is a traditional training mode and the training data set is also prediction data set. Usually the result of SCSP is ideal if there is no noise hidden in the source data. However data usually goes along with noise and SCSP would be influenced by hidden information so that results may not necessarily meet to desire.

CV is also called leave one out method because we select a validation data from original training data set and repeat until each observation in the set is used as validation data. The method also has the property of avoiding overfitting but costs heavy computation. Next, we will compare regularization technique and CV in simulation and real data experiments.

Prediction set

Figure 4.4 Cross validation (CV)

4.2 Simulation data

In this section, we will generate sigmoid function, Gaussian-based spectrum data and preprocessing procedure under SCSP and CV condition. We calibrate these different kinds of data set. After predicting, we apply the criterion of estimation to examine which one is better among PLS, PRLS and Bayesian-based PLS methods.

4.2.1 Sigmoid function

In this simulation, we use hybrid of sine and cosine function to examine.

f(x_i)=a_isin(x_i)+b_icos(x_i), 0≤x≤2

π

(4-5) The training data were generated from f(x_i)+ε_i, wherex has take from the uniform _i distribution in (0,2π) and the noiseε_ihad a Gaussian distribution with zero mean. The training data and the sigmoid function f(x_i)are plotted in Figure 4.5. The training data is highly ill-conditioned.

Figure 4.5 Noisy data (points) and sigmoid function (curve)

Figure 4.6 RMSE as a function of N/S ratio under SCSP (PRLS)

Figure 4.8 RMSE as a function of N/S ratio under SCSP (PLS)

Figure 4.9 Correlation coefficient as a function of N/S ratio under SCSP (PRLS)

Figure 4.10 Correlation coefficient as a function of N/S ratio under SCSP (Bayesian-based PLS)

Figure 4.12 Prediction error sum of squares (PRESS) under CV

4.2.2 Gaussian-based spectrum

We would like to generate two Gaussian functionsg(x)with mean = 510 and the standard deviation =15, h(x)with mean = 540 and the standard deviation = 10. f(x) is the linear combination ofg(x)andh(x)plotted in Figure 4.13.

Figure 4.13 The linear combination of two Gaussian functions with different mean

and standard deviation.

The training data setX_i +εcan be generated by linear combination of g(x) and h(x) with Gaussian noiseε . The training data set will be represented in Figure 4.14.

Figure 4.14 The training data set of Gaussian-base spectrum

Next, we will show the experiment results of calibration as follows. We can compare the analyzed performance with these three methods.

Figure 4.15 RMSE as a function of N/S ratio under SCSP (PRLS)

Figure 4.16 RMSE as a function of N/S ratio under SCSP (Bayesian-based PLS)

Figure 4.18 Correlation coefficient as a function of N/S ratio under SCSP (PRLS)

在文檔中貝氏架構下部分最小平方法 (頁 17-0)

Thesis Organization

Chapter 1. Introduction

1.4. Thesis Organization

Chapter 2

Materials and Methods

2.1 Least Squares (LS)

Y = Xa + ε

Xa X Y

X

≈

Y) (X X) (X

a ≈

2.2 Partial Least Squares (PLS)

E

F Y

Y Y

Y

=

+

+ L +

+

= t

q

+ t

q

+ L + t

q

+ F

= T

Q

+ F

2.3 Bayesian analysis

( | )

P ( H

| D ) ∝ P ( D | H

) P ( H

)

P ( D | H

) = ∫ P ( D | w , H

) P ( w | H

) dw

2.4 Bayesian regularization

E

= e

e + λ q

q , λ ≥ 0

Chapter 3

Bayesian-based PLS

3.1 Data preprocessing

3.2 Bayesian-based PLS algorithm

M = α E

+ β E

1 0

2 α E

= n − α Trace ( A

B ) = n − γ

1 0

γ

M = α E

+ β E

≈ e

e + λ q

q

Chapter 4

Experiments and Results

4.1 Illustration

= 1 − 1 ∑ ( − )( − )

Y Y X X

S S

r n

Prediction set

4.2 Simulation data

π

) ⁼ ∫ P ( D | w , H

⁼ ¹ ₋ ₁ ∑ ⁽ ⁻ ⁾⁽ ⁻ ⁾