lin-32 encodes a basic helix-loop-helix transcription factor that is required for development of several types of neurons, including the touch receptor neurons and the male sensory ray neurons (Krause et al., 1997; Portman and Emmons, 2000). This sequence set contains 9 upstream sequences of various genes regulated by lin-32 with length varying from 326 to 1050. The GSS2 prediction is listed in Table 7. The accuracy is not verified here because no related experimental report from laboratory is available for comparison. But in our opinion, this result is good because the reactive bases in consensus sequence are close together and have a TG group. Therefore, these TFBS are strong and meaningful signals and possibly the lin-32 binding sites.
hm17r (Tompa et al., 2005)
We also test a sequence set from an assessment system designed by Tompa et al.
(2005). The whole dataset in this assessment system includes 3 classes each of which contains 56 sequence sets. This dataset is totally from real genome and the TFBS are very hard to find because they have various features for binding zinc-finger, HTH, HLH, and leucine-zipper, etc. Besides, the most intractable part is, the insertion and deletion errors. Because the dataset is designed for assessing TFBS finding tools designed in TCM (zero or more occurrence per sequence) mode, most of them are not appropriate for testing the proposed mixed 0-1 linear program, which is designed for OOPS (One Occurrence Per Sequence) mode from original concept. The sequence set we used here is hm17r, a human DNA sequence set from real class of Tompa’s dataset.
Every sequence in hm17r is 500 bp long with a TFBS for a dimerized regulator. The prediction is listed in Table 8. Comparing with answer of Tompa et al. (2005), there
are two differences in this prediction: the TFBS position in Seq_3 is -328 differing from Tompa’s answer, -173; in Seq_5 there is no occurrence by answer; and TFBS position in Seq_9 is -173 differing from Tompa’s answer, -138. All different answers
Table 6 A Prediction of daf-19 binding sites.
GSS2 Score Swoboda et al.(2000) Gene
regulated ...GTT.CCATGG.AAC... Posi. 85 Motifs Posi.
che-2 ctgGTTgTCATGGtGACtgc -57 10 GTTgTCATGGtGAC -130 daf-19 ttgGTTtCCATGGaAACtac -109 12 GTTtCCATGGaAAC -109 osm-1 attGTAtCCATACcAACatc -1211 9 GCTaCCATGGcAAC -86 osm-6 catGTTaCCATAGtAACcac -100 11 GTTaCCATAGtAAC -100 xbx-1 cccGTTtCCATGGtAACcgt -79 12 GTTtCCATGGtAAC -79
dyf-3 ggaGTTtCTATGGgAACgga -88 11 N/A N/A
pkd-2 tccGTTtCTATGCaAAAaac -231 9 N/A N/A
xbx-4 ctaGTTgCCATGAcAACcgc -35 11 N/A N/A
Table 7 A Prediction of lin-32 binding sites.
Consensus Sequence Score Gene regulated position
TGAAA (9) TTTCA 78 hlh-2 -457 tGGAAAtattaaagaATTCTt 7
cfi-1 -738 tTAAAAttaaattatTTTCAa 9 cwp-4 -332 tTTAAAtatatttttTTTCAg 9 egl-46 -239 gTGAAAattgactagATTCAc 9 lin-22 -348 tTGAATtttctgggaTTTCTt 8 mab-3 -184 tTGAAAatttgacttTTCCAc 9 mab-5 -56 gTGAAAtatgtgtcgTTTCAc 10
tbb-4 -300 cAGAAAaagtcaacaTTACAg 8 twk-21 -374 cTGAAAattcaagtaTTTAAa 9
Table 8 A Prediction of DNA motifs in hm17r sequence set.
GSS2 Score Tompa et al.(2005) Seq. name
...GGGAA.TTCCC... Posi. 97 Motifs Posi.
Seq_0 actccGGGAAtTTCCCtggcc -83 10 tccGGGAAtTTCCCtg -81 Seq_1 gctccGGGAAtTTCCCtggcc -83 10 tccGGGAAtTTCCCtg -81 Seq_2 gctccGGGAAtTTCCCtggcc -85 10 tccGGGAAtTTCCCtg -83 Seq_3 ctccgGGGAAgTTGGCagtat -328 8 gcttggaaattccggagc -173 Seq_4 aaagtGGGAAaTTCCTctgaa -144 9 gtGGGAAaTTCC -141 Seq_5 gtatcGGGAAtTGCTCcctcc -274 8 <No Instances> N/A Seq_6 ggcagGGGAAtCTCCCtctcc -274 9 gGGGAAtCTCC -270 Seq_7 aatgtGGGATtTTCCCatgag -79 9 aaatgtGGGATtTTCCC -80 Seq_8 aatcgTGGAAtTTCCTctgac -86 8 GGAAtTTCCT -80 Seq_9 catcgTGGATaTTCCCgggaa -173 8 attggggatttcctc -138 Seq_10 gccctGGGGGcTTCCCcgggc -136 8 tGGGGGcTTCCCc -132
provided by Tompa have lower matches than the TFBS found by GSS2. These tests illustrate that the determined consensus successfully helps determinate most TFBS and can be regarded as a good result.
5.7 Software Package: “Global Site Seer v2”
A software package “Global Site Seer 2.0” is designed for pattern-free TFBS finding and is available by http://www.iim.nctu.edu.tw/~cjfu/gss2.htm.
Chapter 6 Discussion
6.1 Features of Proposed Methods
This study proposes a mixed 0-1 linear programming approach to search TFBS under various conditions. The final result of this study is a mixed 0-1 linear program for solving pattern-free TFBS finding problems. Advantageous features of this approach include:
(i) A pattern-driven design which can search longer patterns than current
enumeration approaches. Because only the reactive bases are enumerated in
consensus sequence, the computational time is notably reduced.(ii) A global optimal consensus is promised. As a nature of mixed 0-1 linear program, the consensus sequence with maximum matches is surely obtained.
(iii) No prerequisite shared pattern is needed. The proposed method can search TFBS of an undiscovered regulation with limited information like length of regulatory region and number of reactive bases.
(iv) Capable of identifying TFBS with spacers dispersed in regulatory region.
Most current TFBS finding methods have difficulty to search patterns containing inactive bases. Contrarily, the proposed method benefits from these inactive bases because searching space is pruned.
(v) Structural features can be involved. In the proposed method various structure features of TFBS can be formulated to help prune searching space and improve precision, e.g. inverted palindrome or direct repeat.
This approach also has several weaknesses as follows:
(i) Exponential growing computational time to the number of reactive bases.
Although a notable feature over current pattern-driven enumeration methods is that the critical factor of searching time is number of reactive bases instead of pattern length, the limitations on length of regulatory sites still exist.
(ii) Only one solution obtained. By nature of optimization program, the proposed method cannot simultaneously search multiple patterns. Finding suboptimal solutions in this approaches still required individual program in which previously obtained optimal consensus sequences are excluded.
(iii) Difficult to search consensus with base variability. The proposed method utilizes consensus sequences consisted only by four distinct nucleic acid types.
The consensus sequence is a distinct ideal model of TFBS and only exact base matches within sites contribute the matching score. But in fact, there may be some reactive bases replaceable by other nucleotides which have similar sensitivity to regulators.
As a nature of pattern-driven and mixed 0-1 linear programming design, the proposed method can find the optimal consensus in an acceptable computational time.
The most advantaged property to current heuristic methods is the capability of embedding logical constraints. These logical constraints telling many kinds of specific features and exclusive rules notably increase the precision and efficiency.
6.2 Issues in Approach Design
Based on assumptions of occurrences in each sequence, there are several different searching modes for the computer-based determination of transcription binding sites. These modes are generally defined in studies of sequence-driven approaches which apply probability models to iteratively search the most significant conserved signal. CONSENSUS (Hertz and Stormo, 1995), a statistical based system for identifying consensus patterns of DNA sequence and protein sequences, provides three modes of searching: One Occurrence per Sequence (OOPS), One or More Occurrences per Sequence (OMOPS) and Zero or More Occurrences per Sequence (ZMOPS). Another TFBS searching tool, MEME (Bailey et al., 1995), also can search
motif under three different modes: One Occurrence per Sequence (OOPS), Zero or One Occurrence per Sequence (ZOOPS) and Two-Component Mixture (TCM)—each sequence may contain any number of non-overlapping occurrences of each motif.
Which sequence mode is appropriate depends on the purpose of motif finding work. When a sequence set is given from any combination of upstream sequences of various genes and the purpose is to discover any possible regulations, searching tools capable of handling TCM mode are obviously much appropriate. For analyzing function of a particular regulator, the sequence set shall be prepared more conscientiously from sequences upstream genes regulated by the target transcription factor. And in this case OOPS and OMOPS are more suitable for finding the DNA motifs precisely.
The proposed approach is only designed for searching sequences in OOPS or OMOPS mode and is very powerful when analyzing a specific function regulatory. It is not appropriate to search sequences in ZOOPS, ZMOPS and TCM modes for any possible undiscovered regulatory.