Discussion - 利用相對性R squared方法辨認酵母菌轉錄因子

4.1 Performance comparison with existing methods

Five previous studies also tried to identify the yeast cell cycle TFs. Tsai et al. [26] identified 30 cell cycle TFs by applying a statistical method (ANOVA analysis) and Cheng et al. [5]

identified 40 cell cycle TFs by applying another statistical method (Fisher’s G test). Cokus et al. [7] identified 12 cell cycle TFs by applying linear regression analysis. Andersson et al. [2]

identified 15 cell cycle TFs by applying rule-based modeling. Wu et al. [33] identified 17 cell cycle TFs by using a time-lagged dynamic model of gene regulation (See Table 6). Since these five approaches are different from ours, a performance comparison should be done. As suggested by de Lichtenberg et al. [8], we tested the ability of each of these five methods to retrieve the known cell cycle TFs annotated in the MIPS database [18]. Performance comparison was based on the Jaccard similarity score[21], which scores the overlaps between a method’s output and the list of known cell cycle TFs (i.e., the true answers). The definition of Jaccard similarity score is given later. Therefore, the higher the Jaccard similarity score, the better the ability of a method to retrieve the known cell cycle TFs. As shown in Table 3, our method has the highest Jaccard similarity score among the six methods. Therefore, our method outperforms the five existing methods.

Before giving definition of Jaccard similarity score, we first describe the origin of Jaccard similarity score. It is evolved from the Jaccard coefficient, which measures similarity between sample set A and sample set B. The Jaccard coefficient is defined as the size of the intersection of the sample sets divided by the size of the union of the sample sets and can be written as J(A,B)= AB AB. Given two objects, A and B, each with n binary attributes, the Jaccard coefficient is a useful measure of the overlap that A and B share with their

attributes. Each attribute of A and B can either be 0 or 1. The total number of each combination of attributes for both A and B are specified as follows: M represents the total ₁₁ number of attributes where A and B both have a value of 1. M represents the total number ₀₁ of attributes where the attribute of A is 0 and the attribute of B is 1. M represents the total ₁₀ number of attributes where the attribute of A is 1 and the attribute of B is 0. M represents ₀₀ the total number of attributes where A and B both have a value of 0. Each attribute must fall into one of these four categories, meaning that M₁₁M₀₁M₁₀M₀₀   , The Jaccard similarity coefficient, J, is given as J M₁₁ (M₀₁M₁₀M₁₁). In our study, suppose each attribute of A and B represents the number of positives deriving from fact and our method, respectively. Then M means true positives and was renamed as TP, ₁₁ M means false ₀₁ positives and was renamed as FP and M means false negatives and was renamed as FN. ₁₀ Therefore, the Jaccard similarity coefficient, J, is given as

and was renamed as the Jaccard similarity score.

4.2 Robustness against different cell cycle gene expression datasets

Besides the above analysis, we also apply the relative R² method to another cell cycle gene expression dataset: alpha38 dataset [19]. This dataset has a sampling interval of 5 minutes and a total of 25 data points. In our method, we identified 18 cell cycle TFs. Among them, 13 (Ace2, Cin5, Fkh1, Fkh2, Hir3, Mbp1, Mcm1, Rap1, Swi, Swi5, Swi6, Ume6, Yox1) are known cell cycle TFs according to MIPS database [18] and the remaining five cell cycle TFs are Fhl1, Ino2, Leu3, Met32 and Yap1. In this analysis, the relative R² method also leads to high Jaccard similarity score 0.317. Besides, we found that among the 15 cell cycle TFs identified in this study which uses alpha30 dataset [19], 10 TFs are also identified using the

alpha38 dataset (see Figure 3). This suggests that our method is robust against different cell cycle gene expression datasets.

4.3 Threshold setting

There are three thresholds that we need to decide in the above analysis, p , ₀ s and . In the relative R square method, we first use the criterion involving p to select TFs that ₀ have significant effect on a gene, then use the criterion involving s to check whether the TFs left are able to account for the dynamics of the target gene’s expression (see Methods for details). Since a p-value indicates the significance of regulation of a TF on the gene, it is reasonable to require that p can not be too large. As mentioned in Wang and Li [27] that ₀ the selection of p should be more relaxed, while the selection of s can be more strict ₀ because the selection of s value is the main criterion. We suggest choosing s more than 0.9 to ensure the accuracy of results. To achieve highest Jaccard similarity score, we conduct simulations for different cases by varying the values of p and ₀ s (see Table 4). Finally,

p is selected as 0.72 and 0 s is selected as 0.97.

For the hypergeometric significant level  selection,  is commonly selected as 0.05. In this case, we identified 15 cell cycle TFs which included 12 true positive and 3 false positive for p = 0.72 and ₀ s =0.97 and obtained Jaccard similarity score 0.308. But if we relax the significant level value to 0.15, under the same p and ₀ s , we identified 28 cell cycle TFs which included 16 true positive (Abf1, Ace2, Fkh1, Fkh2, Hir3, Mbp1, Mcm1, Ndd1, Rfx1, Stb1, Swi4, Swi5, Swi6, Ume6, Yhp1, Yox1) and 12 false positive (Dal81, Dat1, Fhl1, Gal4, Hap4, Msn4, Pdr1, Phd1, Reb1, Tye7, Yap1, Yap5). This Jaccard similarity score for this case is 0.333.

在文檔中利用相對性R squared方法辨認酵母菌轉錄因子 (頁 19-22)