• 沒有找到結果。

CHAPTER 4 RESULTS

4.2 P REDICTION P ERFORMANCE

With the evaluation benchmark described previously in chapter 3, the prediction performance of the three SVM models can be evaluated fairly and clearly.

4.2.1 Statistically Significant 6-mer Patterns

The prediction sensitivity (Sen.), specificity (Spe.), accuracy (Acc.), and precision (Pre.) of the constructed SVM model based on statistically significant 6-mer patterns of group 1 (all), 2 (non-CpG), and 3 (CpG) are given in Table 4.7, 4.8, and 4.9, respectively. In addition, the negative set is randomly extracted from six negative regions we define in Table 3.3. As you can see, the larger window size is, the higher prediction performance is. Even so, we still choose the models of window size 300 to be our prediction models because of the prediction accuracy is over 70% and we want to determine the core promoter region accurately. Moreover, the prediction performance of group 3 (CpG) is better than group 1 (all) and 2 (non-CpG).

Table 4.7 The models accuracy of 6mer pattern in group 1(all).

Negative set Positive set Window size Sen. Spec. Acc. Pre.

Random - 60 ~ + 20 80 51% 77% 64% 69%

Random -100 ~ + 50 150 58% 76% 67% 71%

Random -200 ~ +100 300 62% 77% 70% 73%

Random -300 ~ +150 450 64% 80% 72% 76%

Random -400 ~ +200 600 65% 83% 74% 79%

Table 4.8 The models accuracy of 6mer pattern in group 2 (non-CpG island).

Negative set Positive set Window size Sen. Spec. Acc. Pre.

Random - 60 ~ + 20 80 22% 85% 53% 59%

Random -100 ~ + 50 150 28% 85% 56% 65%

Random -200 ~ +100 300 32% 83% 58% 66%

Random -300 ~ +150 450 32% 83% 58% 66%

Random -400 ~ +200 600 35% 82% 58% 66%

Table 4.9 The models accuracy of 6mer pattern in group 3 (CpG island).

Negative set Positive set Window size Sen. Spec. Acc. Pre.

Random - 60 ~ + 20 80 56% 75% 66% 70%

Random -100 ~ + 50 150 67% 74% 71% 72%

Random -200 ~ +100 300 75% 76% 76% 76%

Random -300 ~ +150 450 77% 78% 78% 78%

Random -400 ~ +200 600 79% 82% 80% 81%

Figure 4.7 shows the distribution of predictions by the constructed SVM models (-200 ~ + 100) based on statistically significant 6-mer patterns of group 1, 2, and 3 in the regions of upstream 3,000 bps to downstream 3,000 bps of the TSS.

The sliding window size was set to 300 nt which determined by the selective window size (-200 ~ +100) of positive set and shift in the size of 50 bps nt, and the number of predictions were calculated in each window. As it show, the group 1 and 3 have the similar prediction performance, and both of them are better than group 2.

Figure 4.7 Distributions from 6 mer pattern models’ predictions of group 1, 2, and 3 in the interval [-3000, +3000] relative to the TSS based on DBTSS.

4.2.2 Nucleotide Composition

The prediction sensitivity (Sn.), specificity (Sp.), accuracy (Acc.), and precision (Pre.) of the constructed SVM model based on monomer to dimer and monomer to trimer of group 1 are given in Table 4.10 and 4.11, respectively.

Because the results of monomer to trimer did not have much increases, we chosen monomer to dimer to evaluate group 2, and group 3. Thus The prediction sensitivity (Sn.), specificity (Sp.), accuracy (Acc.), and precision (Pre.) of the constructed SVM model based on monomer to dimer of group 2, and 3 are given in Table 4.12, and 4.13, respectively. In addition, the negative set is randomly

extracted from six negative regions we define in Table 3.3. As you can see, the larger window size is, the higher prediction performance is. Even so, we still choose the models of window size 300 to be our prediction models. Because its’

model accuracy is over 70% and we want to determine the core promoter region accurately. Moreover, the prediction performance of group 3 (CpG) is better than group 1 (all) and 2 (non-CpG).

Table 4.10 The models accuracy of monomer to dimer of group 1 (all).

Negative set Positive set Window size Sn Sp Acc Pre

Random - 60 ~ + 20 80 69% 69% 69% 69%

Random -100 ~ + 50 150 67% 71% 70% 70%

Random -200 ~ +100 300 69% 74% 71% 72%

Random -300 ~ +150 450 70% 75% 72% 74%

Random -400 ~ +200 600 72% 75% 74% 74%

Table 4.11 The models accuracy of monomer to trimer of group 1 (all).

Negative set Positive set Window size Sn Sp Acc Pre

Random - 60 ~ + 20 80 66% 71% 69% 70%

Random -100 ~ + 50 150 67% 73% 70% 71%

Random -200 ~ +100 300 67% 73% 70% 71%

Random -300 ~ +150 450 69% 76% 73% 74%

Random -400 ~ +200 600 71% 75% 73% 74%

Table 4.12 The models accuracy of monomer to dimer of group 2 (non-CpG island).

Negative set Positive set Window size Sn Sp Acc Pre

Random - 60 ~ + 20 80 61% 71% 66% 68%

Random -100 ~ + 50 150 62% 68% 65% 66%

Random -200 ~ +100 300 65% 69% 67% 68%

Random -300 ~ +150 450 63% 68% 66% 67%

Random -400 ~ +200 600 63% 68% 66% 66%

Table 4.13 The models accuracy of monomer to dimer of group 3 (CpG island).

Negative set Positive set Window size Sn Sp Acc Pre

Random - 60 ~ + 20 80 74% 68% 71% 70%

Random -100 ~ + 50 150 75% 69% 72% 71%

Random -200 ~ +100 300 76% 71% 74% 73%

Random -300 ~ +150 450 77% 74% 75% 74%

Random -400 ~ +200 600 80% 75% 77% 76%

Figure 4.8 shows the distribution of predictions by the constructed SVM models (-200 ~ + 100) based on nucleotide composition of group 1, 2, and 3 in the regions of upstream 3,000 bps to downstream 3,000 bps of the TSS. The sliding window size was set to 300 nt which determined by the selective window size (-200 ~ +100) of positive set and shift in the size of 50 bps nt, and the number of predictions were calculated in each window. As it show, the group 1 and 3 have the similar prediction performance, and both of them are better than group 2.

Figure 4.8 Distributions from nucleotide composition models’ predictions of group 1, 2, and 3 in the interval [-3000, +3000] relative to the TSS based on

DBTSS.

4.2.4 DNA Stability

The prediction sensitivity, specificity, accuracy, and precision of the constructed SVM model based on DNA stability in group 1, 2, and 3 are given in Table 4.14, 4.15, and 4.16, respectively. The negative set is randomly extracted from six negative regions we define in Table 3.3. As you can see, the larger window size is, the higher prediction performance is. Even so, we still choose the models of window size 300 to be our prediction models because of the prediction accuracy of the constructed model is more than 70% and we want to determine the core promoter region accurately. Moreover, the prediction performance of group 3 (CpG) is better than group 1 (all) and 2 (non-CpG).

Table 4.14 The models accuracy of DNA stability in group 1 (all).

Table 4.15 The models accuracy of DNA stability in group 2 (non-CpG island).

Negative set Positive set Window size Sn Sp Acc Pre

Table 4.16 The models accuracy of DNA stability in group 3 (CpG island).

Negative set Positive set Window size Sn Sp Acc Pre

Figure 4.9 shows the distribution of predictions by the constructed SVM models (-200 ~ + 100) based on DNA stability of group 1, 2, and 3 in the regions of upstream 3,000 bps to downstream 3,000 bps of the TSS. The sliding window size was set to 300 nt which determined by the selective window size (-200 ~ +100) of positive set and shift in the size of 50 bps nt, and the number of predictions were calculated in each window. As it show, the group 1 and 3 have the similar

prediction performance, and both of them are better than group 2.

Figure 4.9 Distributions from DNA stability models’ predictions of group 1, 2, and 3 in the interval [-3000, +3000] relative to the TSS based on DBTSS.

4.2.5 The Prediction Performance of Combinatorial Features

We try to test all the combinations of the three kinds of regulatory features, and want to find the best combination for increasing the prediction performance. The prediction sensitivity, specificity, accuracy, and precision of the constructed SVM model based on combinatorial models in group 1, 2, and 3 are given in Table 4.17, 4.18, and 4.19, respectively. As you can see, we selected the highest model accuracy models of combination all of three features to be our prediction model.

Moreover, the prediction performance of group 3 (CpG island) is better than group

1 (all) and 2 (non-CpG island).

Table 4.17 The model accuracy of Combinational models in group 1 (all).

Window size 300

Table 4.18 The model accuracy of Combinational models in group 2(non-CpG island).

Window size 300

Table 4.19 The model accuracy of Combinational models in group 3 (CpG island).

Window size 300

Figure 4.10 shows the distribution of predictions by the constructed SVM models (-200 ~ + 100) based on all of those three features of group 1, 2, and 3 in the regions of upstream 3,000 bps to downstream 3,000 bps of the TSS. The sliding window size was set to 300 nt which determined by the selective window

size (-200 ~ +100) of positive set and shift in the size of 50 bps nt, and the number of predictions were calculated in each window. As it show, the group 1 and 3 have the similar prediction performance, and both of them are better than group 2.

Figure 4.10 Distributions of 6-mer pattern, nucleotide composition, and DNA stability models’ predictions of group 1, 2, and 3 in the interval [-3000,

+3000] relative to the TSS based on DBTSS.

相關文件