Discussion - 淨本質相關係數在基因選擇與基因調控網路建構之應用

The CID values of Y given either one or two predictors provided hints regarding how to guess about the approximate pCID values. For example, pCID(Y |X1; X4) is approximately (0.1747 − 0.1176)/(1 − 0.1176) = 0.065 and pCID(Y |X6; X₄) is approximately (0.1191 − 0.1176)/(1 − 0.1176) = 0.002 (see Table 3.1). The latter is much smaller than the former, reflecting their differing magnitudes of dependency.

After eliminating the impact from the more dominant variables, the signals from the minor variables were enlarged and the pCID values were gradually increased as the number of conditioning variables was increased.

The order of the variables declared relevant also provided hints about the order of priority for statistical dependence. Linearity was superior to nonlinearity because X4 was favored over X1 and X2 even though 10X4 and 10 sin(πX1X2) contributed the same range of Y in Model (3.1). But the influence of X₂ (or X₁) was stronger than that of X₅, which had only half the impact of X₄ on Y in the model. X₃ and X₅ having similar CID and pCID values (see Table 3.1) but the range of 30(X₃− 0.5)² and 5X₅ being [0, 7.5] and [0, 5], respectively, means that X₅ was 1.5 times ‘more influential’ on Y than X₃. Therefore, pCID values can serve as indicators for or can even quantify different types of curvilinearity in regard to statistical dependence.

With a relatively large sample size (N = 100), 96% of the simulations correctly selected more than four of five relevant variables, while the irrelevant variable X₆ was falsely included in only three simulations (Figure 3.6A). Otherwise, 22% of the simulations under the moderate sample size (e.g., N = 50) picked all five relevant variables; 41% of the simulations picked four relevant variables, where X₄ was never missed but X₃ and X₅ were missed in about 20% of the simulations (Figure 3.6B).

Also about 20% of the simulations claimed significance only for X₁, X₂, and X₄ (Figure 3.6B). For a small sample size (N = 25), CID / pCID lost sensitivities in finding X₅ (79% missed), X₃ (78% missed), X₁ (51% missed), X₂ (44% missed), and X₄ (17% missed) (Figure 3.6C). But X₆ was selected in 8% of the simulations, which is about the nominal α = 0.05.

A. N = 100 B. N = 50

C. N = 25

Figure 3.6: Number of the relevant variable X_i (i = 1, 2, 3, 5, 6) being selected in 100 simulated samples of size (A) 100, (B) 50, or (C) 25 from the model Y = 10 sin(πX₁X₂) + 30(X₃ − 0.5)² + 10X₄ + 5X₅ + , where X_i’s were distributed as U (0, 1) and was distributed as N (0, 1).

Chapter 4

Application to gene regulatory network

The gene regulation events under certain condition serve as small blocks to the entire gene regulation network (GRN), which may be reconstructed by connecting multiple regulation modules. An inferred GRN can therefore provide insights into the relationships between genes of interest by experiments and the understanding of biological functions with complex biological phenomena. More specifically, an in-ferred GRN consisting of the nodes (representing genes) and the edges (representing significant gene-gene interaction) reflects the gene regulation events that may con-currently or sequentially occur under the condition of study. In this study, we focus on the inference of GRN using the results of microarray experiments. It is usually achieved by (1) identifying a pair of significantly associated genes, (2) elongating the regulation path from the gene pair, and then (3) assembling all identified paths to form the complex GRN (Figure 4.1).

This study aims to infer the causality in a GRN using CID. A causal connection between a pair of nodes means one is the origin (source) and the other is the con-sequence (target) in the association. Such cause and effect relationship is usually expected when studying the relationship between a transcription factor (TF) and its target genes and is usually indicated as a directed edge in the network. Compared to co-expression GRN (i.e., network with undirected edges), a cause-and-effect GRN requires more information to put the direction on the edge. The direction is typically assigned according to known biological evidences, which may not be available at all time. In this study, we utilize the asymmetric property of CID (i.e., CID(Y |X) is not necessarily equal to CID(X|Y )) to distinguish not only the associated gene pairs but the causes / effects in a gene regulation event. Asymmetry is a very unique fea-ture of CID whereas the some conventional methods, including PCC, pPCC and MI,

Figure 4.1: Diagram of gene regulatory network inference workflow. (A) Identifi-cation of a significantly associated gene pair. (B) Regulation path elongation. (C) Assembly of all identified regulation paths.

provide symmetric results when considering the association between two variables.

More specifically, the gene Y is designated as the source and gene X, the target, in the GRN if CID(Y |X) > CID(X|Y ).

The pCID method could identify relevant genes in the elongation step. Ideally, a proper stepwise procedure iteratively picks the relevant genes according to its magnitude of association to the target until no more gene would significantly increase the amount of association. For example, in Figure 4.1B, CID(Source A|Target A1) would be significant while we also expect a significant CID(Source A|Target A1, Target A2) but a insignificant CID(Source A|Target A1, X) given an irrelevant gene X. However, due to the dominant effect of the most influential gene, i.e., Target A1, in the first step, CID(Source A|Target A1, X) were mostly significant (see Section 3). The pCID resolves this problem by decomposing only the information of the target variable which was not explained by the first predictor.

4.1 Construction of gene regulatory network by CID/pCID

The inference of GRN has three steps (Figure 4.1). However, due to the dramatic amount of genes simultaneously monitored in a microarray experiment, we develop the following heuristic approach for the first two steps which were illustrated with Figure 4.2. Given a source gene T0, CID(T0|Ti) for one of the candidate target genes, T_i, was computed in the first step. The candidate target genes may be all

other genes in the same microarray dataset or user-defined. In order to reduce the computation of the programming, we eliminated some irrelevant candidate target genes which caused the CID(T₀|T_i) values to be insignificant (p-value > 0.05) and which were not proceeded to the following steps. Under the circumstance, the source gene T0 was discarded as the origin of a regulation path when all CID(T0|Ti) values were insignificant in the first run. Otherwise, if CID(T₀|T₍₁₎) had the single smallest significant p-value among the results from all candidate target genes, we connected the source gene T₀ and the target gene T₍₁₎. Provided that there were more than one CID(T₀|T_i) value had the smallest significant p-value, we selected T₍₁₎ which had the maximum of these CID(T₀|T_i) value. The decision-making about the direction between the source gene T₀ and the target gene T₍₁₎ was based on comparing the significance between CID(T₀|T₍₁₎) and CID(T₍₁₎|T₀). If CID(T₀|T₍₁₎) was more significant than CID(T₍₁₎|T₀) or if these two CID values had equal p-value and the CID(T0|T₍₁₎) value was larger than CID(T₍₁₎|T0) value, the direction was from T₀ to T₍₁₎; otherwise, the direction was from T₍₁₎to T₀. The gene pair (T₀, T₍₁₎) was proceeded to the elongation step.

In the elongation step, pCID(T₀|T_j; T₍₁₎) and pCID(T₍₁₎|T_j; T₀) were computed for one of the remaining candidate target genes, T_j, to identify the second relevant target gene, T₍₂₎(Figure 4.2). Suppose that all pCID(T₀|T_j; T₍₁₎) and pCID(T₍₁₎|T_j; T₀) values were insignificant, the regulation path would stop and the network was with two nodes (T₀, T₍₁₎). In other respects, the process was continued and there were two routes to connect the regulation path. Provided that there were more than one pCID(T0|Tj; T₍₁₎) or pCID(T₍₁₎|Tj; T0) value had the smallest significant p-value among the results of the pCID(T₀|T_j; T₍₁₎) and pCID(T₍₁₎|T_j; T₀) from all remaining candidate target genes, we selected T₍₂₎ which had the maximum of these pCID(T₀|T_j; T₍₁₎) and pCID(T₍₁₎|T_j; T₀) values. One of these routes was that we con-nected the gene T₀ and T₍₂₎, if T₍₂₎ was selected as a result of the pCID(T₀|T₍₂₎; T₍₁₎) value. The decision of the direction by pCID values was similar to the previous resolution by CID values. The direction was from T₀ to T₍₂₎, if pCID(T₀|T₍₂₎; T₍₁₎) was more significant than pCID(T₍₂₎|T₀; T₍₁₎) or if these two pCID values had equal p-value and the pCID(T₀|T₍₂₎; T₍₁₎) value was larger than pCID(T₍₂₎|T₀; T₍₁₎) value; or from T₍₂₎ to T0, otherwise. The other route was that we connected the gene T₍₁₎ and T₍₂₎, if T₍₂₎ was selected as a result of the pCID(T₍₁₎|T₍₂₎; T₀) value. The direction was from T₍₁₎ to T₍₂₎, if pCID(T₍₁₎|T₍₂₎; T₀) was more significant than pCID(T₍₂₎|T₍₁₎; T₀) or if these two pCID values had equal p-value and the pCID(T₍₁₎|T₍₂₎; T₀) value was

Figure 4.2: Illustration of the heuristic approach for regulation path elongation.

larger than pCID(T₍₂₎|T₍₁₎; T₀) value; or from T₍₂₎ to T₍₁₎, otherwise. This finished the first run of the elongation.

Furthermore, we explain the next steps of GRN construction. In the rth run (r ≥ 2) of the elongation, all possible values of pCID(S|T_j; {T₀, T₍₁₎, . . . , T_(r)} \ S) for one of the remaining candidate genes, Tj, and S ∈ {T0, T₍₁₎, . . . , T_(r)} were com-puted. Suppose that all pCID(S|T_j; {T₀, T₍₁₎, . . . , T_(r)} \ S) values were insignificant, the regulation path would stop and the network was with r + 1 nodes (T₀, T₍₁₎, . . . , T_(r)). Provided that there were more than one pCID(S|T_j; {T₀, T₍₁₎, . . . , T_(r)} \ S) value had the smallest significant p-value among the results of the pCID(S|T_j; {T₀, T₍₁₎, . . . , T_(r)} \ S) from all remaining candidate target genes, we selected T_(r+1) of the elongation were insignificant (p-value > 0.05). The resulting network would contain e + 1 nodes (T₀, T₍₁₎, . . . , T_(e)). For example, Figure 4.3 illustrates one of the GRN construction results. Let T₀ be the source gene and the other genes be the target genes. First (Step (0) in Figure 4.3), we computed all CID values of T₀ given one of the target genes, and then CID(T₀|T₍₁₎) had the most significant p-value, we connected the T0 and T₍₁₎ with the direction was from T0 to T₍₁₎ when the value of CID(T₀|T₍₁₎) > CID(T₍₁₎|T₀). Second (Step (1)), we selected the target gene, T₍₂₎, which might be connected with T₀ or T₍₁₎. Therefore, we computed the pCID(T₀|T_j; T₍₁₎) and pCID(T₍₁₎|T_j; T₀), where T_j was one of the remaining genes. The result was that pCID(T₀|T₍₂₎; T₍₁₎) had the most significant p-value and T₍₂₎ was connected with T₀ from T₀ to T₍₂₎ when pCID(T₀|T₍₂₎; T₍₁₎) > pCID(T₍₂₎|T₀; T₍₁₎) value. In Step (2), the next selected gene, T₍₃₎, could be connected with T₀or T₍₁₎ or T₍₂₎. We computed the pCID(T₀|T_j; T₍₁₎, T₍₂₎), pCID(T₍₁₎|T_j; T₀, T₍₂₎) and pCID(T₍₂₎|T_j; T₀, T₍₁₎), where T_j was one of the remaining genes. The result was that pCID(T₀|T₍₃₎; T₍₁₎, T₍₂₎) had the most significant p-value and T₍₃₎ was connected with T0 from T₍₃₎ to T0 when pCID(T₍₃₎|T₀; T₍₁₎, T₍₂₎) > pCID(T₀|T₍₃₎; T₍₁₎, T₍₂₎). In Step (3), the chosen target gene, T₍₄₎, would be connected with one of the prior selected genes (T₀, T₍₁₎, T₍₂₎ and T₍₃₎). We computed the pCID(T₀|T_j; T₍₁₎, T₍₂₎, T₍₃₎), pCID(T₍₁₎|T_j; T₀, T₍₂₎, T₍₃₎),

Figure 4.3: Illustration of the simple example for regulation path elongation used by CID/pCID method.

pCID(T₍₂₎|T_j; T₀, T₍₁₎, T₍₃₎) and pCID(T₍₃₎|T_j; T₀, T₍₁₎, T₍₂₎), where T_j was one of the remaining genes. Therefore the pCID(T₍₂₎|T₍₄₎; T₀, T₍₁₎, T₍₃₎) had the most signifi-cant p-value and T₍₄₎ was connected with T₍₂₎ from T₍₂₎ to T₍₄₎ when pCID(T₍₂₎|T₍₄₎; T₀, T₍₁₎, T₍₃₎) > pCID(T₍₄₎|T₍₂₎; T₀, T₍₁₎, T₍₃₎). In Step (4), the chosen target gene, T(5), would be connected with one of the previous selected genes (T0, T(1), T(2), T₍₃₎ and T₍₄₎). We computed the pCID(T₀|T_j; T₍₁₎, T₍₂₎, T₍₃₎, T₍₄₎), pCID(T₍₁₎|T_j; T₀, T₍₂₎, T₍₃₎, T₍₄₎), pCID(T₍₂₎|T_j; T₀, T₍₁₎, T₍₃₎, T₍₄₎), pCID(T₍₃₎|T_j; T₀, T₍₁₎, T₍₂₎, T₍₄₎) and pCID(T₍₄₎|T_j; T₀, T₍₁₎, T₍₂₎, T₍₃₎), where T_j was one of the remaining genes.

Therefore the pCID(T₍₂₎|T₍₅₎; T₀, T₍₁₎, T₍₃₎, T₍₄₎) had the most significant p-value and T₍₅₎ was connected with T₍₂₎ from T₍₅₎ to T₍₂₎ when pCID(T₍₅₎|T₍₂₎; T₀, T₍₁₎, T₍₃₎, T₍₄₎)

> pCID(T₍₂₎|T₍₅₎; T₀, T₍₁₎, T₍₃₎, T₍₄₎). In the next step, we wanted to find the next linked gene T₍₆₎but all of pCID(S|T_j; {T₀, T₍₁₎, . . . , T₍₅₎}\S) values were insignificant (p-value > 0.05), where S was one of these previous selected genes, T0, T₍₁₎, T₍₂₎, T₍₃₎, T(4) and T(5).

4.2 Simulation study

The proposed procedure of GRN inference was examined in the simulation study. A pseudo network with six nodes (genes) was generated according to normal mixture model (Figure 4.4). It contained one source node (A11), four target nodes (A21, A22, A31 and A32), and one node (B) independent to the others. The expression levels of nodes A11 and B were randomly generated from the Normal distribution

with mean and standard deviation both equal to 1, which was denoted by N (1, 1).

The expression levels of the target nodes would be affected by two factors of its direct source: the expression level and the binding efficiency. This intended to mimic the occasions (1) the transcription factor was not expressed so that the target gene would not be regulated by the source gene, and (2) even the source gene was expressed, the target gene may still not be regulated by the source gene due to various binding efficiency of the transcription factor. Let S and T denote the direct source and the target gene, respectively. In the simulated network (Figure 4.4), A11 was the direct source of {A21, A22} and A21 was the direct source of {A31, A32}. If the binding efficiency for this pair of S and T was set to be 100b%, then 100(1−b)% of the objects in the sample were not affected by the expression level of S and their expression levels were generated from N (−1, 0.25). The binding efficiency (b) for {A11, A21}, {A11, A22}, {A21, A31}, and {A21, A32} were 0.9, 0.7, 0.9, and 0.8, respectively.

For the 100b% objects that the regulation did take place, if the expression level of S in the ith sample was s_i, the expression level of the ith sample was randomly generated from N (s_i, 0.25) if s_i > 0 and from N (−1, 0.25) if s_i < 0 (meaning S was not expressed). Based on statistical theory, the approximate proportions of gene expressions of the target gene actually determined by the expression levels of the source gene were indicated next to the arrows in Figure 4.4. The inference process of the proportions of gene expressions of the target gene was showed in Appendix A. The pseudo network was replicated 100 times with sample size N = 25, 50 and 100.

Figure 4.4: Pseudo network for the simulation study. The numbers next to the arrows illustrate the proportions of the objects in the sample that the expressions of the target node actually determined by the expressions of the source node.

A pseudo network with six nodes (genes) was generated to assess the proposed procedure of GRN inference (Figure 4.4). Two source genes, A11 and B, were prede-termined. The CID and pCID values as well as their p-values for a particular simu-lation under sample size N =50 are shown in Table 4.1 for demonstration of network reconstruction. Starting from A11, the CID(A11|B) value was insignificant (p-value:

0.4136 > 0.05), hence the node B did not exist in the following steps. Then the re-sults showed CID(A11|A21), CID(A11|A22), CID(A11|A31) and CID(A11|A32) had the minimum p-value (0.0010) and CID(A11|A22) value (0.2028) was the maximum of these CID values, so that A22 would be selected as the first node connected to A11. Because CID(A11|A22) and CID(A22|A11) had the same significant p-value (0.0010) and CID(A11|A22) value (0.2028) was larger than CID(A22|A11) value (0.1791), the direction was set from A11 to A22. The computation of pCID(A11|x;

A22) and pCID(A22|x; A11) for another gene x followed and resulted in the se-lection of A21 as the second node connected to A11 due to that pCID(A11|A21;

A22) had the smallest p-value (0.0010) and the largest pCID value (0.1013). The direction was set from A11 to A21 because pCID(A11|A21; A22) had the same sig-nificant p-value (0.0010) as pCID(A21|A11; A22) and it’s value (0.1013) was larger than pCID(A21|A11; A22) value (0.0934). Similarly, the third and fourth target, A31 and A32, was selected based on pCID(A21|A31; A11, A22) and pCID(A21|A32;

A11, A22, A31); both A31 and A32 was connected from A21 due to pCID(A21|A31;

A11, A22) was equal significant (p-value: 0.0010) to and has larger value than

Table 4.1: The estimated CID and pCID values in one of the 100 simulations with sample size N = 50.

CID/pCID Estimate (p-value) CID/pCID Estimate (p-value)

CID(A11|A21) 0.1936 (0.0010)

CID(A11|A22) 0.2028 (0.0010) CID(A22|A11) 0.1791 (0.0010) CID(A11|A31) 0.1612 (0.0010)

CID(A11|A32) 0.1281 (0.0010)

CID(A11|B) 0.0129 (0.4136)

pCID(A11|A21;A22) 0.1013 (0.0010) PCID(A21|A11;A22) 0.0934 (0.0010) pCID(A11|A31;A22) 0.0639 (0.0020)

pCID(A21|A31;A11,A22) 0.1131 (0.0010) pCID(A31|A21;A11,A22) 0.1123 (0.0010) pCID(A21|A32;A11,A22) 0.0929 (0.0010)

pCID(A22|A31;A11,A21) 0.0122 (0.3227) pCID(A22|A32;A11,A21) 0.0205 (0.1638) pCID(A11|A32;A21,A22,A31) 0.0123 (0.5465)

pCID(A21|A32;A11,A22,A31) 0.0553 (0.0020) pCID(A32|A21;A11,A22,A31) 0.0576 (0.0350) pCID(A22|A32;A11,A21,A31) 0.0162 (0.5415)

pCID(A31|A32;A11,A21,A22) 0.0298 (0.1788)

CID(B|A11) 0.0036 (0.9999)

CID(B|A21) 0.0202 (0.2468)

CID(B|A22) 0.0012 (0.9990)

CID(B|A31) 0.0137 (0.4905)

CID(B|A32) 0.0090 (0.6563)

pCID(A31|A21; A11, A22) (value: 0.1131 > 0.1123), and pCID(A21|A32; A11, A22, A31) was more significant than pCID(A32|A21; A11, A22, A31) (p-value: 0.0020

< 0.0350) even though pCID(A21|A32; A11, A22, A31) value (0.0553) was smaller than pCID(A32|A21; A11, A22, A31) value (0.0576), respectively. When consider-ing the negative-control node B as the source node, it had all insignificant values of CID at the first step of GRN inference and was isolated from the other nodes.

Therefore, the resulting network was identical to our setting showing in Figure 4.4.

We also collected all networks reconstructed under the source node was A11 in the simulations for N = 25, 50 and 100; networks consisting of the same set of nodes were grouped together and the groups occurred at least 5 times were shown in Figure 4.5. Fourteen resulting networks obtained the correct network structure among these one hundred simulations for N = 25, sixty-five correct networks were restructured for N = 50 and eighty-one correct networks were for N = 100. For N

= 25, 54% of the simulations only revealed the partial network; when using a larger sample (N = 50), as few as 10 simulations obtained partial network; moreover, there were not any partial network under the sample of size N = 100. In addition, we could observe that the two nodes were sometimes discarded to produce the partial networks, if the proportion of gene expressions of the target gene actually determined by the expression levels of the source gene was lower than 76% (Figure 4.4) under the sample of size N = 25. In other words, the edges between (A11, A22) and (A21, A32) could be missed in the reconstruction of pseudo network. Similarly, the edge between (A11, A22) would be discarded when the proportion of A22 gene expressions actually determined by A11 was lower than 60% (Figure 4.4) under the sample of size N = 50. In this instance, the GRN would be accurately reconstructed in the large sample.

The asymmetric property of CID was utilized to infer causal effect in the network.

When CID(Y |X) was more significant than CID(X|Y ) or pCID(Y |X; Z) was more significant than pCID(X|Y ; Z), Y was claimed to be the source of the relationship.

In Figure 4.5 and Figure 4.6, the numbers of arrows which pointed to correct direc-tions were shown beside the arrows outside of the parentheses whereas the numbers of incorrect directions in the parentheses. In Figure 4.6, we combined all the correct connections between two nodes from 100 simulations for N = 25, 50 and 100. When the sample of size N = 25 and the source node was A11, there were 88% of networks to connect (A11, A21) together, 86% for (A21, A31), 55% for (A11, A22), and 40%

for (A21, A32); 2% of the networks included the negative control node, B (Figure

Figure 4.5: The results of the network reconstructed under the source node was A11 based on the procedure in Section 4.1 (Exclude the insignificant node by CID, and pick up the connected node which has the minimum significant CID/pCID p-value, if there existed at least two nodes which fitted the requests, we chose the node that had the maximum CID/pCID value) from 100 simulations of pseudo network for N = 25, 50 and 100, respectively. The numbers next to the arrows illustrate the number of connection from the source node to the target node; besides, the number of connection in the brackets illustrated the inverse direction.

Figure 4.6: Pseudo network for the simulation study based on the procedure in Section 4.1 (Exclude the insignificant node by CID, and pick up the connected node which has the minimum significant CID/pCID p-value, if there existed at least two nodes which fitted the requests, we chose the node that had the maximum CID/pCID value). (A) The numbers next to the arrows illustrate the proportions of the objects in the sample that the expressions of the target node actually determined by the expressions of the source node. (B), (C) and (D) were the results which were combined with all connection from 100 simulations when the source node T₀ was A11 for N = 25, 50 and 100, respectively.

4.6 B). When N = 50, 97%, 99%, 82%, and 88% of the networks contained the edges between (A11, A21), (A21, A31), (A11, A22) and (A21, A32), respectively, while 7% of them had the negative control node, B (Figure 4.6 C). When N = 100, 99%, 100%, 99%, and 94% of the networks contained the edges between (A11, A21), (A21, A31), (A11, A22) and (A21, A32), respectively, while 12% of them had the negative control node, B (Figure 4.6 D). When the negative control node, B, was set to be the source gene, 16% (Figure 4.6 B), 21% (Figure 4.6 C) and 26% (Figure 4.6 D) of the networks were significant build at α = 0.05. However, the false networks were built spontaneously without consensus. All false networks started from B of the same combination of nodes only appeared less than or equal to five times in 100 simulations for N = 25, 50 and 100. Therefore, CID/pCID method robustly identified the relationships between nodes and extended the association network.

The medians and interquartile ranges of some CID and pCID values summarized from 100 simulations were shown in Table 4.2. The CID values of A11 to a directed or undirected associated node were much larger than the CID values of A11 to the irrelevant node B. Also, it could be observed that CID(A11|A21) > CID(A11|A22), CID(A11|A31) > CID(A11|A32), and CID(A11|A21) was larger than the maximum of CID(A11|A31) and CID(A11|A32) values. Therefore, CID value can not only distinguish the existence of association but also reflect the strength of the associ-ation and successfully pick the direct (or strongest) associassoci-ation among all possible connections. In addition, 100% of CID(A11|A21) and CID(A21|A11) values were declared significant if setting α = 0.05. The pCID values further assisted to select next A11-related or A21-related node after eliminating the effects from A21 and A11, respectively. Among these pCID values, 100% of pCID(A21|A31; A11) values were significant at α = 0.05 and the medians of pCID(A21|A31; A11) values in different sample of size N were maximum, A31 was the most likely to be selected as A21-related node after eliminating the effects from A11. Furthermore, A22 was possibly picked up to connect with A11 based on 63% significance for the sample of size N =25 and 100% significance for N =100; A32 was possibly picked up to connect with A21 according to 97% significance for N =50. In the final step, the chance A32 being selected in the elongation process to connect with A21 was only 29% for the sample of size N =25, but there was 100% for N = 100; the chance A22 being selected in the elongation process to connect with A11 was 83% for N = 50. On the other hand, the false positive rates of gene selection using either CID or pCID were all about 0.05.

Table 4.2: Summary of the estimated CID/pCID values in 100 simulations with

CID(A11|A21) 0.1967 (0.0534) 1.00 0.2049 (0.0527) 1.00 0.2319 (0.0378) 1.00 CID(A11|A22) 0.1100 (0.0568) 0.86 0.1232 (0.0522) 1.00 0.1402 (0.0331) 1.00 CID(A11|A31) 0.1348 (0.0631) 0.93 0.1457 (0.0610) 1.00 0.1600 (0.0345) 1.00 CID(A11|A32) 0.1130 (0.0708) 0.86 0.1233 (0.0499) 1.00 0.1328 (0.0377) 1.00 CID(A11|B) 0.0281 (0.0369) 0.06 0.0157 (0.0166) 0.13 0.0119 (0.0077) 0.16 CID(A21|A11) 0.1941 (0.0609) 1.00 0.2024 (0.0510) 1.00 0.2310 (0.0302) 1.00 pCID(A11|A22;A21) 0.0781 (0.0425) 0.74 0.0824 (0.0496) 0.96 0.0842 (0.0304) 1.00 pCID(A11|A31;A21) 0.0359 (0.0320) 0.22 0.0297 (0.0226) 0.55 0.0172 (0.0165) 0.83 pCID(A11|A32;A21) 0.0309 (0.0319) 0.19 0.0221 (0.0212) 0.40 0.0122 (0.0156) 0.72 pCID(A21|A22;A11) 0.0358 (0.0312) 0.19 0.0210 (0.0221) 0.33 0.0091 (0.0140) 0.61 pCID(A21|A31;A11) 0.1301 (0.0431) 1.00 0.1285 (0.0356) 1.00 0.1320 (0.0272) 1.00 pCID(A21|A32;A11) 0.0937 (0.0412) 0.93 0.1017 (0.0350) 1.00 0.0989 (0.0259) 1.00 pCID(A31|A21;A11) 0.1274 (0.0570) 0.92 0.1258 (0.0431) 1.00 0.1397 (0.0215) 1.00 pCID(A11|A22;A21,A31) 0.0764 (0.0536) 0.63 0.0772 (0.0461) 0.88 0.0838 (0.0385) 1.00 pCID(A11|A32;A21,A31) 0.0239 (0.0238) 0.04 0.0156 (0.0182) 0.09 0.0086 (0.0148) 0.23 pCID(A21|A22;A11,A31) 0.0202 (0.0242) 0.11 0.0126 (0.0197) 0.15 0.0009 (0.0156) 0.33 pCID(A21|A32;A11,A31) 0.0517 (0.0381) 0.52 0.0567 (0.0265) 0.97 0.0611 (0.0247) 1.00 pCID(A31|A22;A11,A21) 0.0160 (0.0211) 0.03 0.0057 (0.0137) 0.04 -0.0039 (0.0134) 0.07 pCID(A31|A32;A11,A21) 0.0295 (0.0273) 0.16 0.0237 (0.0238) 0.32 0.0195 (0.0181) 0.68 pCID(A22|A11;A21,A31) 0.0615 (0.0440) 0.18 0.0611 (0.0238) 0.86

pCID(A32|A21;A11,A31) 0.0486 (0.0222) 0.41

pCID(A11|A32;A21,A22,A31) 0.0206 (0.0205) 0.01 0.0095 (0.0104) 0.14 pCID(A21|A32;A11,A22,A31) 0.0479 (0.0379) 0.29 0.0584 (0.0238) 1.00 pCID(A22|A32;A11,A21,A31) 0.0237 (0.0211) 0.01 0.0128 (0.0130) 0.02 pCID(A31|A32;A11,A21,A22) 0.0316 (0.0262) 0.08 0.0259 (0.0150) 0.41 pCID(A32|A21;A11,A22,A31) 0.0407 (0.0369) 0.02 0.0493 (0.0171) 0.59

pCID(A11|A22;A21,A31,A32) 0.0793 (0.0446) 0.83

pCID(A21|A22;A11,A31,A32) 0.0123 (0.0189) 0.03

pCID(A31|A22;A11,A21,A32) 0.0119 (0.0188) 0.07

pCID(A32|A22;A11,A21,A31) 0.0143 (0.0192) 0.02

pCID(A22|A11;A21,A31,A32) 0.0626 (0.0341) 0.35

CID(B|A11) 0.0273 (0.0285) 0.08 0.0167 (0.0163) 0.07 0.0119 (0.0100) 0.10 CID(B|A21) 0.0220 (0.0231) 0.06 0.0144 (0.0129) 0.04 0.0103 (0.0072) 0.08 CID(B|A22) 0.0187 (0.0222) 0.03 0.0114 (0.0117) 0.05 0.0075 (0.0060) 0.04 CID(B|A31) 0.0199 (0.0239) 0.08 0.0125 (0.0149) 0.08 0.0079 (0.0086) 0.11 CID(B|A32) 0.0188 (0.0158) 0.05 0.0131 (0.0171) 0.11 0.0078 (0.0064) 0.09

1IQR = interquartile range.

4.3 Arabidopsis microarray data analysis

C-repeat binding factors (CBF) would bind to the promoter regions of downstream cold-regulated (COR) genes and induce COR genes expression under cold stress (Thomashow et al., 2001; McKhann et al., 2008; Zhang et al., 2013). We ex-ercised the gene regulation network (GRN) inference on the expression dataset of Arabidopsis Thaliana under cold stress to reconstruct the well-known CBF-COR regulatory network. The detailed description about this dataset from TAIR database was in Section 3.3. After normalized and log2-transformed, the expres-sions of eight probes, three C-repeat binding factors (CBF1 (probe ID: 254074 at), CBF2 (probe ID: 254075 at) and CBF3 (probe ID: 254066 at)) and five COR gene family (COR6.6 (probe ID: 246481 s at), COR78 (probe ID: 248337 at), COR47 (probe ID: 259570 at), COR15A (probe ID: 263497 at) and COR15B (probe ID:

263495 at)) , were taken to construct the GRN by CID/pCID method.

在文檔中淨本質相關係數在基因選擇與基因調控網路建構之應用 (頁 47-68)