• 沒有找到結果。

We have implemented PRooF based on the algorithm whose details was described in Chapter 3, for the prediction of−1 and +1 PRF sites in a given sequence. The kernel of PRooF was written in C and its web server, available for online analysis at [37], was implemented in PHP. To evaluate its function and correctness, our PRooF was tested with a number of genomic sequences with one or two known PRF sites from many different species. And, its experimental results were compared with those obtained by the latest program FSFinder2 [31, 38]. To reduce the number of false positives, FSFinder2 seems to consider only two pairs of the partially overlapping ORFs whose zero reading frames are the largest two in length, because Moon et al. [31] reported that these two pairs had the highest probability to contain−1 and +1 PRF sites. However, currently there seems to be no biological evidence to support their observation. On the contrary, here we utilized InterProScan to screen out the partially overlapping ORFs whose protein sequences contain no functional motifs/domains. As demonstrated later in our experiments, such an approach of functional bioinformatics is very useful to reduce the number of false positives.

In our experiments, the tested sequences were taken from the databases Pseu-doBase [39] and RECODE [12]. PseuPseu-doBase collects RNA pseudoknots, some of which

are thought to function as the stimulators of −1 PRFs, and RECODE contains trans-lational recoding events in various biological species, including −1 and +1 PRFs. It should be noted that most of the known PRF sites in these tested sequences are puta-tive, because they have never been proven to be functional and simply just carry the required slippery sequences and downstream RNA secondary structures. Tables 4.1

Table 4.1: The tested sequences and their −1 PRF numbers

Seq. ID Species −1PRF# Seq. ID Species −1PRF#

PKB1 BLV 1 RCD96 Simian retrovirus 2 2

PKB2 BWYV 1 RCD97 Siman T cell lympotropic virus 1 2

PKB3 EIAV 1 RCD98 Visna virus 2

PKB4 FIV 1 RCD99 Bacteriophage T7 1

PKB42 PLRV-W 1 RCD104 Bacteriophage lambda 1

PKB43 PLRV-S 1 RCD105 Cocksfoot mottle virus 1

PKB44 CABYV 1 RCD106 D. buzzatii ossvaldo retrotransposone 1

PKB45 PEMV 1 RCD107 D. ananassae Tom retrotransposone 1

PKB46 BYDV-NY RPV 1 RCD108 Gill-associated virus 1

PKB80 MMTV 2 RCD110 T. vaginalis virus 2 1

PKB106 IBV 1 RCD114 B. subtilis 1

PKB107 SRV1 gag/pro 1 RCD115 D. melanogaster telo-meric 1

PKB127 EAV 1 retrotransposon Het-A

PKB128 BEV 1 RCD118 Enzootic nasal tumor V. 1

PKB171 HCV 229E 1 RCD233 Potato leafrol V. 1

PKB174 RSV 1 RCD235 IS1 1

PKB217 LDV-C 1 RCD236 IS3 1

PKB218 PRRSV-16244B 1 RCD237 IS2 1

PKB233 PRRSV-LV 1 RCD238 IS911 1

PKB240 BChV 1 RCD249 Cereal yellow dwarf V. RPV-NY 1

RCD71 E. coli 1 RCD250 Cereal yellow dwarf V. RPV-Mex 1

RCD72 Drosophila TE 1 RCD251 IS150 1

RCD73 Human astrovirus 1 RCD252 IS1221A 1

RCD79 Giardiavirus 1 RCD257 Carrot mottle mimic V. 1

RCD80 D. melanogaster gypsy TE 1 RCD258 Groundnut rosette V. 1

RCD82 HIV type 1 1 RCD260 PEMV2 1

RCD83 HIV type 2 1 RCD360 S. typhi 1

RCD84 Human T-cell lympotrophic 1 2 RCD361 S. typhimurium 1

RCD85 Human T-cell lympotrophic 2 2 RCD362 V. cholerae 1

RCD86 IAP 1 RCD363 N. meningitides 1

RCD88 S. cerevisiae L-A 1 RCD364 N. gonorrhoeae 1

RCD89 Murine hepatitis V. 1 RCD365 N. meningitides 1

RCD91 Mason-pfizer monkey V. 2 RCD375 M. musculus 1

RCD92 Red clover necrotic mosaic V. 1 RCD376 H. sapiens 1

RCD94 SIV 1 RCD392 Y. pestis 1

RCD95 Simian type D V. 1 2 RCD393 SARS coronavirus 1

Most tested sequences listed in this table have−1 PRFs that produce longer proteins, whereas a few sequences, such as RCD71, RCD360–365 and RCD392, give shorter proteins instead.

The sequences (PKB127, and RCD92, 99, 114, 236, 257 and 260) possess−1 PRF slippery sequences that conform to

the form Y YYZ. Most of the tested sequences, however, have slippery sequences of the general form X XXY YYZ for their−1 PRFs. Notice that in the two −1 PRFs of PKB80, one slippery sequence is X XXY YYZ but the other is Y YYZ.

and 4.2 show the information about the sequences we used to predict −1 and +1 PRFs, respectively, and the number of their corresponding PRF sites. For convenience of comparison, here we used the sequence IDs designated by Moon et al. [31], despite the fact that their IDs are inconsistent with those annotated in RECODE. Most sequences listed in Table 4.1 have putative −1 PRFs with longer protein products, whereas only a few sequences, such as RCD71, RCD360–365 and RCD392, have those with shorter

Table 4.2: The tested sequences and their +1 PRF numbers

Seq. ID Species +1PRF# Seq. ID Species +1PRF#

RCD1 B. mori 1 RCD40 C. pneumoniae 1

RCD2 B. fuckeliana 1 RCD41 C. acetobutylicum 1

RCD3 C. elegans 1 RCD42 C. difficile 1

RCD4 D. rerio (long form) 1 RCD43 D. ethenogenes 1

RCD5 D. rerio (short form) 1 RCD44 D. radiodurans 1

RCD6 D. melanogaster 1 RCD45 D. vulgaris 1

RCD7 A. nidulellus 1 RCD46 E. faecalis 1

RCD8 G. gallus 1 RCD47 E. coli 1

RCD9 G. pallida 1 RCD48 H. ducreyi 1

RCD10 H. contortus 1 RCD49 H. influenzae 1

RCD11 H. sapiens 1 RCD50 P. multocida 1

RCD12 H. sapiens 1 RCD51 P. gingivalis 1

RCD13 H. sapiens 1 RCD52 P. aeruginosa 1

RCD14 H. sapiens 1 RCD53 P. putida 1

RCD15 M. auratus 1 RCD54 R. prowazekii 1

RCD16 M. musculus 1 RCD55 S. typhimurium 1

RCD17 M. musculus 1 RCD56 S. typhi 1

RCD18 M. musculus 1 RCD57 S. putrefaciens 1

RCD19 N. americanus 1 RCD58 S. mutans 1

RCD20 O. volvulus 1 RCD59 S. aureus 1

RCD21 P. carinii 1 RCD61 S. pneumoniae 1

RCD22 P. pacificus 1 RCD62 S. pyogenes 1

RCD23 R. norvegicus 1 RCD63 S. PCC6803 1

RCD24 S. pombe 1 RCD64 T. pallidum 1

RCD25 S. japonicus 1 RCD65 V. cholerae 1

RCD26 S. octosporus 1 RCD66 X. campestris pv. 1

RCD27 T. marmorata 1 campestris

RCD28 X. laevis 1 RCD67 X. fastidiosa 1

RCD29 A. ferrooxidans 1 RCD68 N. meningitidis 1

RCD30 A. actinomycetemcomitans 1 RCD69 L. monocytogenes 1

RCD32 B. firmus 1 RCD366 B. halodurans 1

RCD33 B. subtilis 1 RCD367 B. parapertussis 1

RCD34 B. bronchiseptica 1 RCD368 B. sp. APS 1

RCD35 B. pertussis 1 RCD369 C. psittaci 1

RCD36 B. burgdorferi 1 RCD370 C. psittaci 1

RCD37 C. crescentus 1 RCD371 C. tepidum 1

RCD38 C. trachomatis 1 RCD372 D. hafniense 1

RCD39 C. muridarum 1 RCD373 M. loti 1

products. Moreover, most of the tested sequences bear slippery sequences of the general form X XXY YYZ for−1 PRF, except for a few instances (PKB127, RCD92, 99, 114, 236, 257 and 260) which fit with the shorter form Y YYZ. In Table 4.2, all the tested sequences have +1 PRFs that produce longer proteins.

A summary of overall sensitivity and specificity for all the tests is listed in Tables 4.3–4.8, in which we let Sen (Sensitivity) = T P +F N100×T P and Spe (Specificity) = T N +F P100×T N, where TP = true positive (i.e., the number of correctly predicted PRF sites), FN = false negative (i.e., the number of known PRF sites that were not predicted), FP = false positive (i.e., the number of incorrectly predicted PRF sites), and TN = true negative (i.e., the number of predicted non-PRF sites that possess a required slippery sequence but are not annotated as PRF sites in database). The str field denotes the type of the predicted 3’-stimulatory RNA structure, with SL, BH and PK standing for simple stem-loop, bulged helix and H-type pseudoknot, respectively. Unless otherwise specified, all the tests of PRooF and FSFinder2 were run with default parameters.

Table 4.3 lists the experimental results of PRooF and FSFinder2 using the Pseu-doBase sequences whose−1 PRFs result in longer protein products and whose slippery sequences conform to X XXY YYZ. Successfully, our PRooF identified all the−1 PRF sites in this set of tested sequences, except for PKB80 and PKB106. PKB80 actually gave two true positives whose slippery sequences are X XXY YYZ and Y YYZ, re-spectively. The latter was missed by PRooF, as well as FSFinder2, since the slippery sequence used in the experiment was X XXY YYZ. However, it can be successfully detected by PRooF if Y YYZ is chosen as the slippery sequence. The −1 PRF site in PKB106 was missed by PRooF because only the carboxyl-terminal motif of its protein product is currently registered in the InterPro database. Therefore, if only the region downstream of the slippery site is scanned for potential motifs/domains, then the true

−1 PRF site in PKB106 can still be detected by PRooF. In contrast to the result of

Table 4.3: Summary of the PRooF results for predicting the−1 PRFs of longer product on several sequences from PseudoBase using the slippery sequence X XXY YYZ

aPKB80 has two true positives whose slippery sequences are X XXY YYZ and Y YYZ, respectively, and hence the true positive candidate whose slippery sequence is Y YYZ was missed by PRooF and FSFinder2 since the used slippery sequence was X XXY YYZ. However, it can be successfully found by our PRooF if Y YYZ is chosen as the slippery sequence.

bThe−1 PRF site of PKB106 was missed by PRooF because only the carboxyl-terminal motif of its protein product is currently registered in the InterPro database. Therefore, if only the region downstream of the slippery site is scanned for potential motifs, then the true−1 PRF site in PKB106 can still be detected by PRooF.

cIn these cases, the stimulatory RNA structures predicted by PRooF are either H-type pseudoknots or bulged helixes, whereas those produced by FSFinder2 are all simple stem-loops.

PRooF, FSFinder2 also failed to find the true −1 PRF sites in PKB2, 3 and 4, whose slippery sequences are in fact X XXY YYZ.

For the tested sequences with −1 PRF sites of longer product from RECODE, FSFinder2 failed to identify true−1 PRF sites in RCD91, 96, 104, 107, 110, 115, 237, 238, 251 and 252 as shown in Table 4.4. Our PRooF, however, missed the sites only in three cases of RCD110, 115 and 252. The main reason for the misses in RCD110 and RCD115 is that the Y’s in their slippery sequence X XXY YYZ are C’s or G’s, instead of the defaults A’s or U’s. If the Y used is any base within X XXY YYZ instead, our

Table 4.4: Summary of the PRooF results for predicting the −1 PRFs of longer product on several sequences from RECODE using the slippery sequence X XXY YYZ

aThe slippery sites of RCD110 and RCD115 were missed by PRooF (and FSFinder2) since their Y’s in X XXY YYZ are C’s or G’s, instead of the defaults A’s or U’s. Nevertheless, our PRooF, as well as FSFinder2, still can find the slippery site for RCD110 if the Y used within X XXY YYZ is any base instead. As for RCD115, PRooF found an alternative−1 PRF site at around 1269 nt, instead of the reported one in RECODE that starts at 1326 nt, when using X XXY YYZ with Y being any base as the slippery sequence.

bThe candidate of true positive for RCD252 was also missed by our PRooF, because the lengths of the involved ORFs are less than the default minimum length of 100 nt and the motifs/domains of its protein product are not registered in InterPro database. However, it still can be detected by PRooF if the minimum length of ORF is set 40 nt and the verification of protein function is disabled.

cIn these cases, the stimulatory RNA structures predicted by PRooF are either H-type pseudoknots or bulged helixes, whereas those produced by FSFinder2 are all simple stem-loops.

PRooF can still identify the slippery site in RCD110. As for RCD115, another −1 PRF site starting at 1269 nt, instead of 1326 nt reported in RECODE, was found by our PRooF when using X XXY YYZ as the slippery sequence with Y being any base.

In fact, downstream of 1326 nt, we even detected no simple stem-loop nearby that can serve as a stable RNA structure to stimulate the programmed −1 frameshifting in RCD115. This observation suggests that the −1 PRF site of RCD115 reported in RECODE may be questionable. For RCD252, the failure to identify −1 PRF site by PRooF is caused by the following two reasons. (1) The lengths of the ORFs involved in this frameshifting are less than the default minimum length (i.e., 100 nt) in PRooF.

(2) The motifs/domains in the −1 PRF protein product are currently not registered in InterPro database. Consequently, the candidate with this −1 PRF site will be filtered out by PRooF in the step of verifying potential protein function. Therefore, if the minimum length of ORF is set 40 nt and the verification for protein function is disabled, the true −1 PRF site in RCD252 can still, as expected, be successfully detected by PRooF.

Table 4.5 lists the experimental results obtained by our PRooF and FSFinder2, for those tested sequences whose −1 PRF slippery sequences conform to Y YYZ, instead of X XXY YYZ. Consequently, PRooF missed the slippery site in RCD114, whereas FSFinder2 missed in RCD99 and 114. PRooF failed to detect the −1 PRF site in RCD114 because the involved ORFs were short and the protein sequence in the region downstream of slippery site contained no motifs/domains currently registered in Inter-Pro database. As expected, it still can be detected by PRooF with the minimum ORF length of 50 nt and with only verifying the protein function of the region upstream from the slippery site. Inevitably, both PRooF and FSFinder2 will generate more false positives by using Y YYZ than X XXY YYZ. But, the numbers of false positives generated by PRooF are still small in all the tested sequences, except for PKB127. In

Table 4.5: Summary of the PRooF results for predicting the−1 PRFs of longer prod-uct on several sequences from PseudoBase and RECODE using the slippery sequence Y YYZ

aThis true positive of the−1 PRF site in RCD114 can be detected by PRooF if the minimum ORF length is set to 50 nt and only the region upstream of slippery site is scanned for potential motifs/domains.

bIn these cases, the stimulatory RNA structures predicted by PRooF are either H-type pseudoknots or bulged helixes, whereas those produced by FSFinder2 are all simple stem-loops.

the case of PKB127, PRooF totally found nine partially overlapping ORFs, five of which were further screened out for the lack of possible protein motifs/domains. Subsequently, PRooF identified a true positive of −1 PRF site, along with 11 false positives, out of the four remaining overlapping ORFs. Notably, among these 11 false positives, six of them were derived from the same overlapping ORFs and four of them from another same overlapping ORFs. That is, a single overlapping region gave many false positives in the output. According to the −1 PRF model, however, there should be at most one true −1 PRF site in each pair of overlapping ORFs. Furthermore, our results show that a true −1 PRF site is usually accompanied with a 3’-stimulatory RNA structure of lower free energy. Therefore, the number of the false positives in PKB127 can be reduced further if our PRooF continues to filter out those candidates whose predicted RNA structures are of high free energy and those from the same overlapping ORFs.

For the sequences with known −1 PRF sites of shorter product, as listed in Table 4.6, PRooF detected all their slippery sites, whereas FSFinder2 failed for the cases

Table 4.6: Summary of the PRooF results for predicting the −1 PRFs of shorter product on several sequences from RECODE using the slippery se-quence X XXY YYZ

of RCD364 and 365. Moreover, the stimulatory RNA structures detected by PRooF are H-type pseudoknots or bulged helixes, whereas all the RNA structures predicted by FSFinder2 are just simple stem-loops. Actually, such a property can greatly be observed in other experiments as demonstrated in Tables 4.3–4.5.

Tables 4.7 and 4.8 presented the experimental results of detecting +1 PRF sites on several sequences from RECODE database. The tested sequences used in Table 4.7 are related to the prfB genes from many bacterial genomes, as mentioned before, whose frameshifting sites have no downstream RNA structures to server as stimulators.

Hence, we experimented these sequences with PRooF by selecting CUU URA C (that are most commonly found in the prfB genes) as the slippery sequence, along with detecting their SD-like sequences, but disabling the detection of stimulatory RNA structure. In Table 4.8, the sequences we tested are related to the oaz genes from several eukaryotic genomes whose +1 PRF sites have 3’-stimulatory RNA structures.

Therefore, we tested them with PRooF by using UUU UGA or YCC UGA that are common in the oaz genes as the slippery sequence. Consequently, our PRooF had better sensitivity than FSFinder2, because it almost detected the +1 PRF sites on all tested sequences, except for RCD43, and almost predicted H-type pseudoknot or

Table 4.7: Summary of the PRooF results for predicting the +1 PRFs on several sequences from RECODE with using the slippery sequence CUU URA C and without detecting downstream RNA structure

aFor RCD43, its true positive candidate was missed by PRooF with default parameters.

However, it can still be found by PRooF if the detection of SD-like sequence is disabled.

Table 4.8: Summary of the PRooF results for predicting the +1 PRFs on several sequences from RECODE using the slippery sequence UUU UGA or YCC UGA

aIn these cases, the stimulatory RNA structures predicted by PRooF are either H-type pseudoknots or bulged helixes, whereas those produced by FSFinder2 are all simple stem-loops.

bulged helixes as the stimulatory RNA structures on all sequences, excepted for RCD13.

The failure to detect the frameshifting site in RCD43 was due to the fact that there is no any pre-defined SD-like sequence upstream of the slippery site. Hence, we can correctly detect it with PRooF if the detection of SD-like sequence is disabled.

Generally speaking, the average sensitivity and specificity of PRooF are both better than those of FSFinder2, as depicted in Tables 4.9. In particular, PRooF greatly improves the sensitivity when compared with FSFinder2. In addition, almost all the stimulatory RNA structures predicted by PRooF are either H-type pseudoknots or

Table 4.9: The average sensitivity and specificity of −1 and +1 PRF prediction using PRooF and FSFinder2

−1 and +1 PRF prediction Average sensitivity Average specificity

PRooF 149149+7×100 = 96 42884288+37×100 = 99 FSFinder2 114114+42×100 = 73 42554255+70×100 = 98

The total TP, FN, TN and FP in Tables 4.3–4.8 of−1 and +1 PRF prediction are 149, 7, 4288 and 37, respectively, for PRooF, and 114, 42, 4255 and 70, respectively, for FSFinder2.

bulged helixes, except those for PKB42 in Table 4.3, RCD72, 107 and 108 in Table 4.4 and RCD13 in Table 4.8. Recall that H-type pseudoknots and bulged helixes both share a similar structural feature of bend conformation, and are structurally more complex and more stable than simple stem-loops. Therefore, they are believed to be more useful and constructive to promote the efficiency of −1 PRFs and some +1 PRFs. As for PKB42 and RCD72, 107, 108 and 13, their stimulators found by PRooF are just simple stem-loops, and neither a stable H-type pseudoknot nor a bulged helix downstream of their slippery sites was detected. In contrast to our PRooF, a great number of the stimulatory RNA structures identified by FSFinder2 are just simple stem-loops, because the algorithm employed by FSFinder2 for the RNA structure prediction first searches for possible stem-loops (without bulges or interior loops) by examining the nucleotides in both directions from every pivot for possible base pairing, and then considers any two simple stem-loops as an H-type pseudoknot if they cross with each other. In addition, it is worth mentioning that some simple stem-loops (such as RCD72 and 108) predicted by FSFinder2 do not seem to be stable RNA structures, since their loops are only 1 nt long, leading to sharp stem-loops.

Recall that the stimulatory RNA structure in the−1 PRF of HIV-1 was first thought

to be a simple stem-loop, but it was then proved experimentally to be a bulged helix.

Interestingly, the stimulatory RNA structure predicted by PRooF for the −1 PRF of HIV-1 (i.e., RCD82) is indeed a bulged helix, exactly the same as that determined by Gaudin et al. [26] using heteronuclear NMR spectroscopy. However, the one predicted by FSFinder2 is just a simple stem-loop. It should be worthwhile to further determine experimentally the stimulatory RNA structures for −1 and +1 PRF sites in other similar cases where their RNA structures predicted by PRooF are H-type pseudoknots or bulged-helixes, but are just simple stem-loops by FSFinder2 or reported in the literature.

Chapter 5 Conclusions

In this thesis, we studied and designed a bioinformatics approach for automatically

In this thesis, we studied and designed a bioinformatics approach for automatically

相關文件