Using Bayesian Network and Nonparametric Regression to Analyze

2. Literature Reviews

2.3. Using Bayesian Network and Nonparametric Regression to Analyze

A Bayesian network is a graph-based model of joint multivariate probability distributions that captures properties of conditional independence between variables. It consists two components. The first component, G, is a directed acyclic graph (DAG) whose vertices correspond to the random variablesX₁,...,X . The second component,θ , _n describes a conditional distribution for each variable, given its parents in G. Together, these two components specify a unique distribution on X₁,...,X . The graph G _n represents conditional independence assumptions that allow the joint distribution to be decomposed, economizing on the number of parameters. The graph G encodes the Markov Assumption: Each variableX is independent of its non-descendants, given its _i parents in G.

By applying the chain rule of probabilities and properties of conditional independencies, any joint distribution that satisfies the Markov Assumption can be composed into the product form

( ,..., ) ( | ( )),

n i

P X X P X Pa X

∏

ⁱ (2.7)

where Pa^G(X_i)is the set of the parents of X in G. The method of Bayesian network is _i to select a proper graph G with the largest posterior probability after giving the prior probability. Imoto et al. (2002) proposed the criterion

( ) 2 log ( | ) ( | )

i G G G

BNRC G π f x θ π θ λ θd

⎧ ⎫

= − ⎨ ⎬

⎩

∫ ∏

⎭ (2.8) whereπ_Gis the prior probability of the graph G, (π θ λ_G| )is the prior probability of

θGafter giving parametric vectorλ , θ_G=(θ₁^T,...,θ_n^T)is the parametric vector in G,

wherex is the ith observation of jth variable, that is, if the number of observation is n _ij

then the observation of the jth variableX is _j

x

₁_j

,..., x

_nj _{. We use} to represent the parents of

1_j,..., P P_nj

X .j

p

_ij is the vector with -dimension, i.e., the jth variable has parents at the ith observation, and its kth component is . And use nonparametric regression models for capturing the relationship between

qj q_j

is a prescribed set of basis functions (such as Fourior series, polynomial bases, regression spline bases, B-spline bases, wavelet bases and so on),

is the unknown coefficients and

( ) ( )

networks are to find G makes BNRC maximized.

Recently, Imoto et al. (2002) porposed the use of nonparametric additive regression

models for capturing not only linear dependencies but also nonlinear structures between genes. Imoto et al. (2003) improved this method by using Bayesian networks and the nonparametric heteroscedastic regression, which is more resistant to the effect of outliers but it needs much time for determining the optimal graph. Tamada et al. (2003) proposed a motif detection method to estimate gene networks and combine microarray gene expression data and DNA sequences of regulatory regions of genes. It found that the motif information is useful for revising some incorrect relations in the network estimated by microarray alone. Imoto et al. (2003) proposed a method for estimating a gene network based on Bayesian networks from microarray gene expression data together with biological knowledge including protein-protein interactions, protein-DNA interactions, binding site information, existing literature and so on. Its proposed criterion can control the trade-off between microarray information and biological knowledge automatically. Kim et al. (2003) proposed a Bayesian network and nonparametric regression model for constructing a gene network from time series microarray gene data. This method can overcome a shortcoming of the Bayesian network model in the sense of the construction of cyclic regulations.

2.4 Using Nonparametric Regression Method to Analyze the Causal Effect between Variables

Different from linear regression analysis, nonparametric regression uses a roughness penalty that decreases as the fitting curve gets smoother.

Consider the n observations{(x_i,y_i),i=1,...,n}. Assume that the sample is ordered over the interval [a, b] with respect to the predictor values; that is,

b x x

a≤ ₁≤...≤ _n ≤ . To estimate the unknown smooth regression function by explicitly trading off fidelity to the data with smoothness of the estimate. For regression, the residual sum of squares is a natural measure of fidelity to the data, so the roughness

penalty estimator is the minimizer of and the resulting cubic smoothing spline estimategˆbelongs to the Sobolev space

} interpolating spline that passes through each of the responsesy_i, while if

∞

λ→ ,gˆapproaches the linear least squares regression line. The smoothing spline is a linear estimator, so the vector of fitted valuesyˆ_i = gˆ(x_i) can be written as

. ) ˆ A( y

y = λ The matrixA(λ)is called the hat matrix. λ can be chosen to minimize the cross-validation score [17]

where is the spline estimate based on all the observations except , evaluated at . It can be shown that for linear smoothers,

)

the fitted values (Green and Silverman, 1995) [17],

, leverage value at since it measures the potential for the observed response to exert influence on the fitted value.

xi y_i

A variation is generalized cross-validation (GCV) [17], which replaces each value1−A_ii(λ)with their average, . The generalized cross-validation selector of , is the minimizer of

Proposition 1. (Shiau, 1985) Letgˆ_λbe the smoothing spline estimator. Then

S(gˆ_λ)= y^T(I−A(λ))y/n, (2.18) The following two propositions are given in Cheng (2003).

Proposition 2. ) , and is the mode of the posterior density after giving the prior distribution.

( min

ˆ arg S g

g_λ = _g

Proposition 3. Assume that X and Y and E(X)=E(Y)=0, Var(X)=Var(Y)=1 , let , be n i.i.d. observations. Then

n i

x_i, _i), 1,...,

( = E(S(gˆ_λ))→1,asn→∞.

Cheng (2003) consider only the causal relationship between two variables X and Y.

If X is the parent (cause) and Y is the child (effect), then denote the cause-effect relationship by XÆY. more specifically, the causal relationship between X and Y can be described by the following causal models: Y =g X₁( )+ , where is a smoothing ε function and

ε is a random error with mean zero and independent of X. If X cannot be

simultaneously represented as g Y₂( )+ with Y independent of an random errorε^' ε^',

then XÆY but not vice versa.

We shall call the method proposed in Cheng (2003) determing the causal direction between two random variables “SCORE” method. SCORE method simultaneously use nonparametric method to estimate and . Finally gets two score and

(Shiau, 1985), then use decision rules to determine the direction.

g1 g₂ S(gˆ₁)

ˆ ) (g₂ S

We simplify Chia Yu Cheng et al. (2003) as the following flow chart.

Input Gene expression data n

i y

x_i, _i), 1,..,

( =

Normalize data

Use nonparametric regression to estimate smoothing functions. λ =λ^ˆ_GCV

Figure 3: The flowchart of SCORE Decision rules are as follows:

IfS(gˆ ) S(gˆ )then XÆY has a larger posterior probability than YÆX, i.e., X causes Y.

2 1 <

IfS(gˆ₁) S(gˆ₂)then YÆX has a larger posterior probability than XÆY, i.e., Y causes X.

IfS(gˆ_λ) is close to 1, then X ⊥Y (Cheng (2003) suggested , and Lu (2003)

suggested ).

9 . 0 ˆ ) (g_λ >

(ˆ ) 0.8 S g_λ >

We call the 0.9 or 0.8 “the threshold score of independence”.

2.5 Using Smooth Response Surface Methodology to Construct Gene Networks A smooth response surface algorithm proposed by Xu et al. (2002) is a sophisticated data mining technique for analyzing gene expression data and constructing gene networks. It uses a three-dimensional smooth response surface to capture the biological relationship between the target and activator-repressor. It also functionally

ˆ ) (g₁

S S(gˆ₂)

ε +

=g₁(X)

Y X =g₂(Y)+ε

Check Decision rules

Establish Relationship

describes triplets of activators, repressors, and targets, and their regulations in gene expression data. The diagnostic strategy in the algorithm is to evaluate the scores of the triplets so that those with low scores are kept and a regulatory network is constructed based on this information and existing biological knowledge. Xu et al. (2002) applied the method to two yeast gene expression data sets and reported that the predictions based on the identified triplets agree with some experimental data in the literature. This method provides a novel model with attractive mathematical and statistical features that make the algorithm valuable for mining expression or concentration information, help for determining the function of uncharacterized proteins, and can result in a better understanding of coherent pathways.

The smooth response surface method is decided as follows. First, transform the raw data over the time points into the interval [0, 1]. Let A be the activator and B be the repressor. The definition of the activator-repressor-target model is that a target gene C is controlled by both an activator gene A and a repressor gene B. Define a three-dimensional smooth response surface as S(A, B), which is a piecewise linear-quadratic polynomial on [0, 1] × [0, 1]. The triplets that follow the activator-repressor-target relationship should lie closely to the response surface.

The 3D response surface has the same purpose with a high dimensional decision matrix. The function S(A, B) maps two normalized values A and B onto a 3D surface, in order to describe a surface response value C .To ease decision-making, it uses some heuristic rules:

A high + B low Æ C high.

A low + B high Æ C low.

As seen in Figure 3, a triplet (A, B, S(A, B)) represents the biological relationship that follows the pattern of a target S(A, B) controlled by an activator A and a repressor B as described in the activator-repressor-target model. The response surface captures the

biological model with features such as compactness, simplicity and visualization. The imputation and gene filtering steps are applied to remove noise from the data.

For each triplet (A, B, C), is the fitted value of a target C. The residual, , should be small, if the activator-repressor-target relationship is strong.

The residual sum of squares measures the overall variation in C that is not explained in the response surface model. Then the lack-of-fit function RT(A, B, C), i.e., the ratio of the residual sum of squares and the total sum of squares, describes the proportion of variation in C that is not captured by the 3D response surface. A small value of lack-of-fit indicates that there is a strong activator-repressor-target relationship among A, B, and C. The lack-of-fit formula is defined to filter the triplets in the initial screening.

To save storage and computation, only those triplets whose lack-of-fit values do not exceed a given constant RT are kept.

) , ˆ S(A B C=

C Cˆ−

A diagnostic strategy Diag(A, B, C) is applied to check the reliability of the triplets after the initial filtering to measure robustness of the fitted model for each triplet (A, B, C ). Xu et al. (2002) pointed out that the intensity measurement of gene expression at one or two time points may deviate from the model and suggest that the measurement may be faulty and should be treated as an outlier. If such a value occurs at the i-th point, then , i.e., the lack-of-fit of (A, B, C) when the i-th point (or the i-th column) is left out, will differ greatly from RT(A, B, C). Diag(A, B, C) provides a summary measure over all time points for a given triplet. The diagnostic method is developed to refine the selected triplets and a score, which reflects the strength of the triplet interrelationship, is defined to rank the refined triplets. A larger Diag value would suggest that the information for the triplet is unreliable and should be removed for further consideration. Thus the criteria for selecting triplet candidates is:

) , ,

)(

( A B C

RT_i

Diag C

B A Diag RT

C B A

RT( , , )≤ and ( , , )≤ , where RT and Diag are constants as

specified by users.

A final score is defined to measure the strength of the triplet interrelationship.

Score(A, B, C) is a function of the lack-of-fit value and the diagnostic measure, and focuses primarily on the RT(A, B, C) value and secondly on the Diag(A, B, C) values.

Triplets with low values of RT(A, B, C) and Diag(A, B, C) will have low scores, which indicate a close relationship among A, B, and C. Finally, a gene regulatory network is constructed based on the top scoring triplets. Figure 4 gives the flowchart of the smooth response surface algorithm.

The lack-of-fit:

∑⁽^Cⁱ^{− )}^Cⁱ

Figure 4: The data processing flowchart of the Smooth Response Surface algorithm.

3. Methodology

3.1 Motivation

Xu et al. (2002) proposed the activator-repressor-target model to represent that each variable can find its activator and repressor and is controlled by the activator and repressor simultaneously. It is not clear whether there exists such a triplet relationship in reality. In this study, we consider pairs of genes instead of triplets.

Cheng (2003) proposed a score method to determine the relationship between two variables. But this method only determines the causal effect direction but no activate (+) / repress (-) information. Also, there is a problem of determining whether two variables are independent. Therefore, we use the method in next section to improve Cheng’s method.

3.2 Proposed Methods

We simplify our goal as finding the relation between two variables.

If X is the parent (cause) and Y is the child (effect), then we use XÆY to represent the causal effect relation between them. We assume X and Y have the regression relation: , where f is a smoothing function and is a random variable with mean zero and is independent of X. Sometimes, X can be represented as

with Y being independent of

Y=f (X)+ ε ε

g(Y)+ ε' ε at the same time. This will hold when f is a ^' linear function. When this happens, we denote the cause/effect reality byX↔Y.

Let and be the residuals of regressory Y on X and X on Y, respectively.

The core idea of our method is that, if XÆY, then we should have the following result:

1 2 1 ˆ 2 ˆ

R ⊥X and R ⊥ Y, where R = −Y f (X) and R = −X g(Y).

On the other hand, if YÆX, thenR₁ ⊥ X and R₂ ⊥ . Y The possible relationships between X and Y are:

- + - +

-X⎯⎯⁺→Y, X⎯⎯→Y, X←⎯⎯Y, X←⎯⎯Y, X←⎯→Y, X←⎯→Y, X⊥Y.

Arrow is the direction. The sign above the arrow represents activate (+) or repress (-).

Our proposed method is divided into two parts by the patterns of the residuals , =1, 2.

R ii

We summarize the method by the following flow chart.

The Distribution of the Residual R

We can use a graphical display, QQ plot, to check the normality of the residuals.

QQ plots are used to assess whether a data set has a particular distribution, or whether two datasets have the same distribution. If two distributions are the same, then the plot will approximate a straight line. The extreme points have more variability than points toward the center. Or we can use Kolmogorov-Smirnov Goodness-of-Fit Test

Second part , i=1,2.

At least one of the distributions of the residuals is normal distribution.

If the residuals both do not approximate to normal or the sample size is small (<30).

First part

Pre-process: TransformingR , X and Y to an ordinal type data using RANK or RANGE statistics.

2 1,R Decision rules

Step one: Relation test using nonparametric correlations (Kendall’s Tau, Spearman’s Rho) with Bonferroni correction.

Establish Relationship

Step two: Direction test using chi-square test of independence.

Figure 5: The flow chart of our proposed method

(Chakravart, Laha, and Roy, 1967) [5] and Chi square Goodness-of-Fit Test (Snedecor and Cochran, 1989) [24] to compare the distribution of the residuals with normal. If the sample size is small, we suggest skip this part and go to second part.

First part: At least one of the distributions of the residuals is normal distribution.

(We remark that this condition usually does not hold in our study.) In this part we use the following Decision rule:

1. IfR is approximately normal and₁ R isn’t, then XÆY. ₂ 2. IfR is approximately normal and₂ R isn’t, then YÆX. ₁ 3. IfR and₁ R both are approximately normal, then₂ X ↔Y .

4. IfR and₁ R both are not approximately normal or the sample size is small, then ₂ go to second part.

From our experience, the chance of using the first part of our method is quite small.

Second part: When the residuals both are not approximately normal or the sample size is small (<30).

1 2

R and R

In this part, we divide the target into the following steps:

Q1. Is there a relationship between those two variables?

Q2. How do they relate? Repress (-) or activate (+)?

Q3. What is the related direction if they really have relationship?

Step one will solve the Q1 and Q2. Step two will give the answer to Q3.

If we want to know the relationship between two variables, we must confirm whether there is relation between two variables first.

Relationships between variables. To express a relationship between two variables, one way is to compute the correlation coefficient between two variables. We discuss the correlation between X and Y using nonparametric correlations (Kendall Tau and Spearman R). An advantage of nonparametric or rank correlation is that we need not

know the probability distribution functions from which the and are drawn.

However, the slight loss of information in ranking is a small price to pay for a very major advantage: when a correlation is demonstrated to be present nonparametrically, then it is really there! Nonparametric correlation is more robust than linear correlation,

more resistant to unplanned defects in the data s

x_i' y_i's

Spearman Rank-Order Correlation Coefficient (Siegel & Castellan, 1988 and Siegel, 1956) [23, 24]

Suppose we have N data points (x ,_i y_i), i=1,...,N. Let be the rank of among the other x’s, be the rank of among the other y’s, then the rank-order correlation coefficient is defined to be the linear correlation coefficient of the ranks, namely,

Ri x_i If N is larger than 10, the significance of a nonzero value ofr_sis tested by computing

₂

This statistic is distributed approximately as Student’s t distribution with N -2 degrees of freedom. A key point is that this approximation does not depend on the original distribution of thex s_i' and ; it is always the same approximation, and always pretty good.

i' y s

Kendall’s Tau (Helsel & Hirsch, 1995) [12]

Suppose we have N data points (x ,_i ). Now consider all

(

)

points, where a data point cannot be paired with itself, and where the points in either order count as one pair. Let and be a pair of (bivariate) observations.

If and have the same sign, we say that pair is concordant. If they have opposite signs, we say that the pair is discordant.

(x_j

Let C be the number of concordant pairs, and D be the number of discordant pairs, then Kendall’s Tau is defined as

If , or , or both, the comparison is called a ‘tie’. Ties are not counted as concordant or discordant. If the number of ties is large, then Tau has to be replaced by

where be the number of ties involving x and be the number of ties involving y.

Obviously,

Nx N_y

1 1≤ ≤

− τ .

If N is larger than 40, the significance of Kendall's Tau can be tested by calculating a test statistic, t, and compares it to the tabular values of Student's t distribution:

of our proposed method.

Kendall’s Tau is equivalent to Spearman’s Rho (3.1) with regard to the underlying assumptions. But they are not equal in magnitude because their underlying logic and computational formulae are quite different. They have a relation represented as

In order to use the rank correlation test, we must transform the original data to ordinal type data. The following pre-process is necessary.

Pre-process: Transforming , X, and Y to ordinal type. We use the following methods:

2 1,R R

1. RANK:

Let X ,...,₁ X_Nbe continuous data. R_i is the rank of X . We use _i

⎥⎦⎤

⎢⎣⎡ K N Int Rⁱ

/ to translate the original continuous data to a discrete ordinal data type with K classes.

2. RANGE:

Let X ....₁ X_Nbe continuous data. A= max (X ....₁ X_N), B=min (X ....₁ X_N). We use B K

A B

Int Xⁱ ×

⎥⎦⎤

⎢⎣⎡ × − + ×

−

− ₋7 ₋8

10 5 ) 10 1

( to translate the original continuous data to discrete ordinal type data with K classes.

Step one: Relation test.

We use those two nonparametric correlations (Kendall’s Tau and Spearman’s Rho) to perform the rank correlation test. The null hypothesis is that the coefficient (Kendall’s Tau or Spearman’s Rho) is zero. We use the signs of those two coefficients to indicate the way of connection (repress (-) or activate (+)). Next we use Bonferroni correction to combine the above results.

The rank correlation test is a distribution-free test that determines whether there is a monotonic relation between two variables. A monotonic relation exists when any increase in one variable is invariably associated with either an increase or a decrease in the other variables.

Bonferroni Correction (Sidak, 1968, 1971) [21, 22]

The following is the Bonferroni general inequality:

1 1

( ) 1 [

g g

i i i

P A P A

= =

≥ −

∑

I

ⁱ^], (3.7)

where Ai and its complement are any events, g is the number of statements or comparisons in the finite set. In particular, if each Ai is the event that a calculated confidence interval for a particular linear combination of treatments includes the true value of that combination, then the left-hand side of the inequality is the probability that all the confidence intervals simultaneously cover their respective true values. The right-hand side is one minus the sum of the probabilities of each of the intervals missing their true values. Therefore, if simultaneous multiple interval estimates are desired with an overall confidence coefficient 1- , one can construct each interval with confidence coefficient (1- /g), and the Bonferroni inequality ensures that the overall confidence coefficient is at least 1- .

In our simulations, we use 0.0975 for α . So if we apply a significance level of 0.05 to each of the two tests, there is now only a 5% chance that any of them will be

在文檔中估計基因網路及基因間函數結構之統計方法 (頁 12-0)