Identification algorithm with noisy array

In subsection 4.1, we discussed an identification method for data without noise. In this subsection we will consider the situation of noisy array data. We assume that every element in the entry of (y_{i j}), j = 1, 2, . . . , m switches to its reverse status with a misclassification probability p independently; that is

xi j=

½y_{i j} with probability 1 − p;

1 − y_{i j}with probability p. (6)

Thus, the observed array (x_{i j}) contains misclassification error. Our goal is to recon-struct directed acyclic Boolean networks from noisy array of binary data (x_{i j}).

In the first step, we investigate every pair of elements for possible relationships.

Next, we use the probabilistic model of equation (6) to estimate misclassification probability p. We treat the data in the 2 × 2 table as a multinomial distribution with

four cells whose probabilities are q₀₀, q₀₁, q₁₀, q₁₁, respectively, where q₀₀+ q₀₁+ q₁₀+ q₁₁= 1.

The observed data n₀₀, n₀₁, n₁₀, n₁₁are generated from the multinomial distribu-tion with probability r₀₀, r₀₁, r₁₀, r₁₁, where r₀₀+ r₀₁+ r₁₀+ r₁₁= 1. The relation-ship between q_{i j}and r_{i j}is displayed in Table 5 and explained below.

Table 4 Splitting counts caused by misclassification error

(vi, vj) Observed

Actual 00 01 10 11

00 m_00,00 m_00,01 m_00,10 m_00,11 m₀₀ 01 m_01,00 m_01,01 m_01,10 m_01,11 m₀₁ 10 m_10,00 m_10,01 m_10,10 m_10,11 m₁₀ 11 m_11,00 m_11,01 m_11,10 m_11,11 m₁₁ n00 n01 n10 n11 n

Table 5 Splitting probabilities caused by the misclassification error

(v_i, v_j) Observed

Actual 00 01 10 11

00 q_00,00= (1 − p)²q₀₀ q_00,01= p(1 − p)q₀₀ q_00,10= p(1 − p)q₀₀ q_00,11= p²q₀₀ q₀₀ 01 q01,00= p(1 − p)q01 q01,01= (1 − p)²q01 q01,10= p²q01 q01,11= p(1 − p)q01 q01

10 q_10,00= p(1 − p)q₁₀ q_10,01= p²q₁₀ q_10,10= (1 − p)²q₁₀ q_10,11= p(1 − p)q₁₀ q₁₀ 11 q_11,00= p²q₁₁ q_11,01= p(1 − p)q₁₁ q_11,10= p(1 − p)q₁₁ q_11,11= (1 − p)²q₁₁ q₁₁

r₀₀ r₀₁ r₁₀ r₁₁ 1

Because of the misclassification error, a portion of samples of m₀₀may change to the other three cells. We use the notations of m00,00, m00,01, m00,10, m00,11to represent the counts of four cells changed from m00. Analogous notations are defined for m01, m10and m11. Consequently, their generating probabilities (q00, q01, q10, q11) are calculated as follows: qi j,kl= p|i−k|+| j−l|(1 − p)2−|i−k|−| j−l|qi j. Here, we adopt the notation qi j,kl analogous to mi j,kl. The above parameters and splits are shown in Table 4 and Table 5. By these two table, it is easy to find that the correspondence between two sets of counts and probabilities is the following:







n_kl=

∑

i, j=0,1

m_{i j,kl}, r_kl=

∑

i, j=0,1

q_{i j,kl};

and (7)







m_{i j}=

∑

k,l=0,1

m_{i j,kl}, qi j=

∑

k,l=0,1

qi j,kl.

For the complete data {m_{i j,kl}}, the log-likelihood is given by

L =

∑

i, j,k,l=0,1

m_{i j,kl}log q_{i j,kl}, (8)

where q_{i j,kl} are those splitting probabilities. Since the complete data {m_{i j,kl}} are not observable, we use the M algorithm to maximize the log-likelihood. In the E-step, the splitting counts of complete data {m_{i j,kl}} are evaluated by the conditional expectations using the current values of the parameters by the following formula

Ep,q00,q01,q10,q11(m_{i j,kl}|n_kl) = n_klq_{i j,kl}

i⁰j

∑

⁰=0,1

q_i⁰_j⁰_,kl, (9)

where i, j, k, l = 0, 1. One or two probabilities of q00, q01, q10, q11are zero in those different hypotheses specified in Table 6. In the M-step, we maximize the condi-tional expectation of the log-likelihood for the complete data to obtain the maxi-mum likelihood estimates (MLEs) of the parameters. According to the MLEs, we can compute the p-score or s-score for every pair of elements, which are obtained by the estimate for the misclassification probability under prerequisite or similar relationship.

Table 6 The six basic relationships and their corresponding probabilistic hypotheses and scores

Relation Hypothesis Scores v_i≺ ¯v_j q₀₀= 0 p_v_i_{≺ ¯v}_j vi≺ vj q01= 0 pvi≺vj

¯v_i≺ v_j q₁₀= 0 p_¯v_i_≺v_j

¯v_i≺ ¯v_j q₁₁= 0 p_¯v_i_{≺ ¯v}_j vi∼ ¯vj q01= q10= 0 svi∼ ¯vj

vi∼ vj q00= q11= 0 svi∼vj

For the first step, we would like to determine the most probable relationships between elements and select candidate pairs of genes for the watch list. Next, we reconstruct a directed acyclic Boolean network by integrating the relationship of those genes selected.

For a pair of genes viand vj, we define the p-scores pvi≺ ¯vj, pvi≺vj, p¯vi≺ ¯vj, p¯vi≺vj

are, respectively, the maximum likelihood estimates of p under the triangular model:

q00= 0, q01= 0, q10= 0, q11= 0. The s-scores svi∼vj and svi∼ ¯vj are the maximum

likelihood estimates of p under the diagonal model: q₀₁= q₁₀= 0 and q₀₀= q₁₁= 0, respectively.

According to the E-M algorithm described above, we can evaluate the s-score and p-score for every pair of elements. We use the MLE ˆp to measure how well each hypothesis fits: the smaller the score, the more evidence that the corresponding hypothesis could be true.

For each pair of elements, we find the diagonal model which have the smaller s-score and the triangular model which have the smallest p-score. Then we evaluate their BIC values by

BIC = − log likelihood +d log n 2 ,

where d is the number of parameters for one possible relationship. We treat the model with the smaller BIC value as the most probable relationship for the pair elements and the s-p-score is defined as the corresponding score under the model.

Next, for every pair of elements, we rank its s-p-score in the ascending order. The smaller the s-p-score is, the more likely the relationship could be true.

If the samples are generated from a directed acyclic Boolean network, s-p-scores are quite useful for the discovery of pairwise relationships. Here we could consider the maximum compatibility criterion: to choose the maximum threshold value so that the selected relationships contain no conflicts [20]. We collect those relation-ships whose s-p-scores are smaller than a threshold. Known biological results could be helpful for the determination of a threshold. For example, if we know the re-lationshp v₁≺ v₃is true, then the s-p-scores smaller than p_v₁_≺v₃ should be in our watch list. As more relationships are included in the watch list, the more likely we are to observe incompatible ones. In general, we can choose the threshold which allows the maximum number of relationships with no conflicting relationships.

We now evaluate the computational complexity of statistical reconstruction method of SPAN described above. The key procedure is the computation of s-p-score for every pair of elements. If the number of elements is m, their are totally

¡_m

¢pairs of elements and the complexity for the computation of MLE is O(m²). We can rank the s-p-score of every pair elements in the order of O(m²log m). Thus, in this statistical reconstruction algorithm, the time complexity is O(m²log m) and the memory complexity is O(m²) as described in [20].

5 Conclusion

We have introduced a variety of models including classical Boolean networks, prob-abilistic Boolean networks and directed acyclic Boolean networks for dealing with genetic regulatory networks. These variants of Boolean networks can be used in the exploration of large genetic networks because of the simple structure of Boolean networks. Based on the reconstruction of Boolean networks, more flexible models, like Bayesian networks, can be applied to investigate more complex problems.

There are several advantages in estimating gene regulatory networks with Boolean networks. First of all, a variety of software packages have recently been developed for constructing Boolean networks. Matlab implementations of classical Boolean network toolbox and for probabilistic Boolean networks were developed in [25]

and [27]. Moreover, Li and Lu also provided an implementation for the s-p-scoring method in Matlab [20]. Other genetic regulatory network tools such as NetBuilder for simulating genetic Boolean network are also available [24]. Second, recent re-search indicates that various complex biological processes can be described by seemingly simplistic Boolean formalisms [34, 35]. The dynamic behaviors of living systems can be explained effectively by Boolean networks [9, 30]. Moreover, for large-scale gene regulatory networks, Kim et al. [17] have used Boolean network with chi-square test on the yeast cell cycle microarray gene expression data sets.

Kauffman et al. [16] have used a random Boolean network to get possible inter-action rules for transcriptional network models in yeast. Furthermore, the dynamic behaviors of cellular states are also represented by attractors in Boolean network in [9].

One characteristic of a Boolean network is that all the variables in the graph are binary. If the data we observed is continuous or quantized to have more than two levels, we need to discretize them. For microarray data, the ratios of expression level would be one possible approach of discretization. That is, we can treat the gene as on (active) if the log-ratio of its expression is larger than zero and off (inactive) otherwise. In general, biological background knowledge will be helpful for setting thresholds for discretizaion. On the other hand, if the samples are obtained from a time course, then we can consider the gene as on or off by detecting the gene is either increasing or decreasing with time.

For future developments on Boolean networks, we can consider more compli-cated structures such as Boolean networks with time delay. Furthermore, we can develop models of Boolean networks that have more flexible structures than these models proposed in literature. Since Boolean network models have been shown to be useful for reconstructing genetic network from real biological gene expression profiles, the evaluation of Boolean network models’ effectiveness will be an impor-tant task in the future.

Acknowledgements The authors would like to express their gratitude to the English editing of Yang Wang and Arthur Tu. This research was partially supported by the National Science Council, National Center for Theoretical Sciences and Center of Mathematical Modeling and Scientific Computing (CMMSC) at the National Chiao Tung University in Taiwan.

References

[1] Akutsu, T., Kuhara, S., Maruyama, O., Miyano, S.: Identification of gene reg-ulatory networks by strategic gene disruptions and gene overexpression. Proc.

9th ACM-SIAM Symp. Discrete Algorithms pp. 695–702 (1998)

[2] Akutsu, T., Miyano, S.: Identification of genetic networks from a small num-ber of gene expression patterns under the Boolean network model. Pacific Symposium on Biocomputing 4, 17–28 (1999)

[3] Akutsu, T., Miyano, S., Kuhara, S.: Inferring qualitative relations genetic net-works and metabolic pathways. Bioinformatics 16, 727–734 (2000)

[4] Bornholdt, S.: Less is more in modeling large genetic networks. Science 310(5747), 449–451 (2005)

[5] Dougherty, E.R., Kim, S., Chen, Y.: Coefficient of determination in nonlinear signal processing. Signal Processing 80, 2219–2235 (2000)

[6] Friedman, N., Linial, M., Nachman, I., Pe’er, D.: Using Bayesian networks to analyze expression data. Journal of Computational Biology 7, 601–620 (2000) [7] Harvey, I., Bossomaier, T.: Time out of joint: attractors in asynchronous ran-dom Boolean network. Proceedings of the Fourth European Conference on Artificial Life pp. 67–75 (1997)

[8] Heckerman, D., Geiger, D., Chickering, D.M.: Learning Bayesian networks:

the combination of knowledge and statistical data. Machine Learning 20, 197–

243 (1995)

[9] Huang, S.: Gene expression profiling, genetic networks and cellular states: An integrating concept for tumorigenesis and drug discovery. Journal of Molecu-lar Medicine 77, 469–480 (1999)

[10] Imoto, S., Goto, T., Miyano, S.: Estimation of genetic networks and functional structures between genes by using Bayesian network and nonparametric re-gression. Pacific Symposium on Biocomputing 7, 175–186 (2002)

[11] Imoto, S., Higuchi, T., Goto, T., Tashiro, K., Kuhara, S., Miyano, S.: Combin-ing microarrays and biological knowledge for estimatCombin-ing gene networks via Bayesian networks. Journal of Bioinformatics and Computational Biology 2, 77–98 (2004)

[12] Jensen, F.V.: An introduction to Bayesian networks. University College Lon-don Press, LonLon-don (1996)

[13] Jensen, F.V.: Bayesian networks and decision graphs. Springer, New York (2001)

[14] Kauffman, S.A.: Metabolic stability and epigenesis in randomly constructed genetic nets. Journal of Theoretical Biology 22(3), 437–467 (1969)

[15] Kauffman, S.A.: The Origins of Order: self-organization and selection in evo-lution. Oxford University Press, New York (1993)

[16] Kauffman, S.A., Peterson, C., Samuelsson, B., Troein, C.: Random Boolean network models and the yeast transcriptional network. Biophysics 100(25), 14,796–14,799 (2003)

[17] Kim, H., Lee, J.K., Park, T.: Boolean networks using the chi-square test for inferring large-scale gene regulatory networks. BMC Bioinformatics 8, 37 (2007)

[18] Kim, S., Dougherty, E.R., Chen, Y., Sivakumar, K., Meltzer, P., Trent, J.M., Bittner, M.: Multivariate measurement of gene expression relationships. Ge-nomics 67, 201–209 (2000)

[19] Laubenbacher, R., Stigler, B.: A computational algebra approach to the reverse engineering of gene regulatory networks. Journal of Theoretical Biology 299, 523–537 (2004)

[20] Li, L.M. and Lu, H.H.-S.: Explore biological pathways from noisy array data by directed acyclic Boolean networks. Journal of Computational Biology 12(2), 170–185 (2005)

[21] Liang, S., Fuhrman, S., Somogyi, R.: REVEAL, a general reverse engineering algorithm for inference of genetic network architectures. Pacific Symposium on Biocomputing 3, 18–29 (1998)

[22] Moler, E.J., Radisky, D.C., Mian, I.S.: Integrating naive Bayes models and external knowledge to examine copper and iron homeostasis in S. cerevisiae.

Physiol Genomics 4(2), 127–135 (2000)

[23] Pearl, J.: Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, San Mateo (1988)

[24] Schilstra, M.J., Bolouri, H.: Modeling the regulation of gene expression in genetic regulatory networks. URL http://strc.herts.ac.uk/bio/

maria/NetBuilder

[25] Schwarzer, C.: Matlab random Boolean network toolbox. Swiss Federal Institute of Technology Lausanne(EPFL) (2003). URL http://www.

teuscher.ch/rbntoolbox/

[26] Shannon, C.E., Weaver, W.: The mathematical theory of communication. Uni-versity of Illinois Press (1963)

[27] Shmulevich, I., Dougherty, E.R., Kim, S., Zhang, W.: Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks. Bioin-formatics 18(2), 261–274 (2002)

[28] Shmulevich, I., Dougherty, E.R., Zhang, W.: From Boolean to probabilistic Boolean networks as models of genetic regulatory networks. Proceeding of the IEEE 90(11), 1778–1792 (2002)

[29] Shmulevich, I., Dougherty, E.R., Zhang, W.: Gene perturbation and interven-tion in probabilistic Boolean networks. Bioinformatics 18(10), 1319–1331 (2002)

[30] Shmulevich, I., Gluhovsky, I., Hashimoto, R.F., Dougherty, E.R., Zhang, W.:

Steady-state analysis of genetic regulatory networks modelled by probabilistic Boolean networks. Comparative and Functional Genomics 4, 601–608 (2003) [31] Somogyi, R., Sniegoski, C.A.: Modeling the complexity of genetic networks:

Understanding multigene and pleiotropic regulation. Complexity 1, 45–63 (1996)

[32] Sontag, E., Veliz-Cuba, A., Laubenbacher, R., Jarrah, A.S.: The effect of nega-tive feedback loops on the dynamics of Boolean networks. Biophysical Journal 95, 518–526 (2008)

[33] Spirtes, P., Glymour, C., Scheines, R.: Causation, prediction and search. MIT Press, Cambridge, MA (2000)

[34] Szallasi, Z., Liang, S.: Modeling the normal and neoplastic cell cycle with

”realistic Boolean genetic networks”: their application for understanding car-cinogenesis and assessing therapeutic strategies. Pacific Symposium on Bio-computing 3, 66–76 (1998)

[35] Thomas, R., Thieffry, D., Kaufman, M.: Dynamical behaviour of biological regulatory networksXI. Biological role of feedback loops and practical use of the concept of the loop-characteristic state. Bulletin of Mathematical Biology 57(2), 247–276 (1995)

[36] Wolfram, S.: Statistical mechanics of cellular automata. Reviews of Modern Physics 55(3), 601–644 (1983)

[37] Wolfram, S.: Universality and complexity in cellular automata. Physica 10D 10(1), 1–35 (1984)

[38] Wuensche, A.: Genomic regulation modeled as a network with basins of at-traction. Pacific Symposium on Biocomputing 3, 89–102 (1998)

在文檔中 Boolean Networks (頁 18-26)