2011.06.07
林琬瑜、陳筱雯、何家全、賴宜萍 魏昭寧、洪晧瑜、吳冠龍、蔡長蓉
LOCATING WELL-CONSERVED REGION S WITHIN A PAIRWISE ALIGNMENT
1
Paper Info
Author
Kun-Mao Chao, Ross C. Hardison, Webb Miller
Department of Computer Science, Pennsylvania State University, University Park 16802.
Publication
Comput Appl Biosci (1993) 9 (4): 387-396.
doi: 10.1093/bioinformatics/9.4.387
Cited by 21 2
Outline
Introduction
Algorithms
Basic dynamic programming alignment
Affined-gap penalty alignment
Hirschberg’s optimal path method
Suboptimal alignment
Robustness and uniqueness
Implementation
Examples
Discussion & Conclusion
3
R99944016 林琬瑜
INTRODUCTION
4
Introduction
Within a single alignment of 2 DNA/protein sequences..
The optimal path does not necessarily yield th e best equivalence of residues assessed by str uctural or functional criteria
Some regions may be more conserved than others
May reveal a region that possesses an important fu nction
An automatic process is needed to compute t he conservation degree along the alignments 5
Introduction
The reliability of different regions within an alignment is widely appreciated
Vingron and Argos, 1990
Determination of reliable regions in protein seque nce alignment
Zuker, 1991
Suboptimal sequence alignment in molecular biology : alignment with error analysis
Friemann and Schmitz, 1992
A new approach for displaying identities and diffe rences among aligned amino acid sequences
6
Introduction
One approach to obtain such information is to determine suboptimal alignments
Alignments come within a specified tolerance o f the optimum score
Waterman and Byers, 1985
A dynamic programing algorithm to find all solutions in a neighborhood of the optimum
Saqi and Sternberg, 1991
A simple method to generate non-trivial alternative alignments of protein sequences
7
Introduction
Many of these earlier papers modify the cla ssic DP for pairwise sequence alignment so that it computes additional “robustness”
information
The problem is..
Require space proportional to the product of the s equence lengths
Impractical for long protein sequence
8
Purpose of this paper
Extend Hirschberg’s approach to compute additio nal information that in some sense indicates the degree to which the aligned residues are conserv ed.
Two main methods are developed
Can also efficiently handle the case where alignmen ts are constrained
For each position i of the 1st sequence, can be aligned only to positions between L[i] and R[i] of the second s equence
All methods run in linear space and score-only time
9
10
To illustrate the effectiveness
One of the methods is used to locate partic ularly well-conserved regions in:
β-globin gene
Control region
γ-globin gene
5’flank
ALGORITHMS
Basic dynamic programming alignment Affined-gap penalty alignment
Hirschberg’s optimal path method Suboptimal alignment
Left extent finding Right extent finding
Robustness and uniqueness 11
R99922073 陳筱雯
ALGORITHMS – BASIC PREFACE
12
Algorithm Flow
13
Affine-gap penalty alignment
Basic dynamic programming alignment
L score
HR score R
score
Robustness Measure
Compute Region
uniqu e
Dynamic-programming Method
0
M
N
(i, j) (i-1, (i, j- j)
1)
(i-1, j- 1)
14
Dynamic-programming Method
Time complexity
Space complexity
15
Affine Gap Penalties
Myers and Miller (1989)
Each gap is accessed with an additional pen alty.
Compute the highest score path from to .
16
DD DD
DD DD
SS SS
SS SS
II II
II II
(i-1, j- 1)
(i-1, j)
(i, j- 1)
(i, j)
Affine Gap Penalties
D (i , j)=���
{
��(� −1, �(� −1, �)−)� − �− �
Here, a gap of length k is penalized by x+ky.
Twins
17
I (i , j)=���
{
�(�� , � −1(� , � − 1)−)−� − ��
S (i , j)=���
{
�(� −1, � −1�(� , �)� (� , �))+�(��,��)
R99945005 何家全
THE LEFT EXTENT OF SUBOPTIMAL ALI GNMENTS
18
The optimal path
Since the time and space complexity for the simple dynamic programming is poor, we want a more efficient method for alignments.
Hirschberg’s method might be a good choice to find optimal path.
19
Hirschberg’s method
0
M
N
M/2
����� −
+ ¿
����� ¿
20
The optimal path
0
M
N
M/2
21
The optimal path
0
M
N
M/2 M/4
3M/
4
22
The optimal path
Space complexity:
Only one row saved in memory
O(N)
Time complexity:
Each turn with half space removed
MN+0.5MN+0.25MN+…=2MN
O(MN)
23
The optimal path
Optimal path is not enough for biologists.
WE WANT MORE!
24
The suboptimal path region
We need to set a threshold value.
All suboptimal paths with score higher than the threshold would be considered.
25
The modified Hirschberg’s method
0
M
N
M/2
Suboptimal regions
26
The modified Hirschberg’s method
How could we solve the subproblems in the a bove figure?
27
The modified Hirschberg’s method
0
M
N
����� −
����� +¿ ¿
28
The modified Hirschberg’s method
0
M
N
M/2
����� −
+ ¿
����� ¿
29
The modified Hirschberg’s method
0
M
N
M/2
Suboptimal regions
30
The modified Hirschberg’s method
0
M
N
M/2
Suboptimal
regions
����� −
+ ¿
�����
¿L-
value
31
The modified Hirschberg’s method
The L-value means the index of the leftmost column for certain row.
In that, we could always get the L-values f or each row of the concerned matrix.
32
The modified Hirschberg’s method
0
M
N
M/2
33
R99945018 賴宜萍
THE ROBUSTNESS MEASURE FOR EACH E DGE OF AN OPTIMAL PATH
34
Robustness measure
For each edge of an optimal path, let the r obustness measure of that edge be the diffe rence between the optimal score and the hig hest score of all paths [i.e. from (0,0)s t o (M,N)s] that do not use the edge or its t win
RM value
RM[i] >= 0
DD DD
DD DD
SS SS
SS SS
II II
II II
(i-1, j-1) (i-1, j)
(i, j-1) (i, j)
35
The only problem is that
Cannot guarantee that the second highest sc oring edge entering row (s + t)/2 must lie within the rectangle
36
HR[i] value
The highest score over all edges entering r ow i that are not contained within a pendin g subproblem
The only remaining problem is to update HR in linear space and score-only time when a problem is subdivided
37
Strategy
Account for the scores of edges in the regi on denoted Δ, and do this for all i betwee n s and
(s + t)/2 in linear space
38
Strategy
Make a forward pass to propagate values of Score- to nodes along the vertical segment f rom (s, jbest) to ((s + t)/2, jbest)
39
A backward pass beginning at row (s + t)/2 propagates values of Score+ to the right and bottom borders of Γ, so edges across Γ ca n be scored
Strategy
40
41
Strategy
0
M
N
M/2
����� −
����� +¿
¿
HR[i
] i
42
Robustness measure
0
M
N
M/2
HR[i ]
RM[i ]
R99945053 魏昭寧
COMPUTATION WITHIN A REGION
43
Computation within a region
Suppose we have computed the left and right extents of suboptimal paths rel ative to a given threshold score and that we now want to repeat the process for a lar ger threshold.
44
Computation within a region
Our strategy is to cover the region with a small number of non-intersecting upright re ctangles whose total area does not exceed t wice that of the rectangle’s intersection with the given region
45
Computation within a region
The total length of the dark segments in Fig ure is linear in N.
* Lemma. The sum of the segment widths R [i] L[i] + 1 for i ∈I is at most 3N.−
46
0 N
M
Computation within a region
47
0 N
M
0 N
M
Computation within a region
48
Computation within a region
assume that L[0] = 0 ≤ R[0], L[M] ≤ N = R [M], and L and R are monotonic in the sense that, e.g., L[i 1] ≤ L[i] for 1 ≤ i ≤ − M.
M
N
49
0 N
M
Computation within a region
i=
1
50
} }
N
M
} }
Computation within a region
(i0,L[i0])
(i,R[i])
51
N
M
} }
Computation within a region
(i0,L[i0])
(i,R[i])
52
R99945051 洪晧瑜
PROOF
53
Lemma
Suppose i∈I , where I is computed by Figur e 8. Let the intersection of row i with the given region extend from column j to column k, i.e., j = L[i] and k = R[i].
De ne c[i] to be k j + 1 and fi −
C[i] = Σ{c[i] for q∈I and q ≤ i}
Then C[i] ≤ j + 2k.
54
Compute I
55
s i
0
i
5i
8i
17
I
C(i
8)
56
Proof (induction on i)
i = i0(j = j0 = 0, k = k0)
C[i0] = c[i0] ≤ k0 + 1 ≤ j0 + 2k0 Hol d
Let i = i‘ (j = j’, k = k’)
Suppose C[i‘] ≤ j’ + 2k’ Hold
57
Two cases
Case1: C ase2:
58
Case1:
C[i] = ≤ ≤ =
k’ – j j - + 1
j’
≥
k – j + 1
k – j + 1
k - k’
+ (k – j +
1) + (k’ – j + 1)
+ (k - k’)
C[i‘] ≤ j’ + 2k’
C[i‘]
j’ + 2k’
+ 2(k - k’)
+ (j - j’)
j’ + 2k’
j + 2k
59
Case2: (1/2)
(k k’) + ( j j’) − − ≥ =
≥
k’− j + 1 (k − ( j − k’)
j’) m− k’− m+1 j
≤
m−
j’
(k’-
m+1 )
+ 1 (m−
j’)
k’− j + 1
+(m−
j)
+ 1 + 1
+(m−
j) k − k’ ≥ 2(m − j) + 1
(k- k’)
+ (k’-j+1 ) + (j-j’ ) > 2(k’-j +1)≥
+ 1
60
Case2: (2/2)
C[i] = = ≤ =
≤ =
C[i‘] ≤ j’ + 2k’
k − k’
k’− m+1 m−
j
k – j + 1
k − k’ ≥ 2(m − j) + 1 + (k – j +
1) + (m−
j )
+ (k’−
m+1) + (k − + (m− k’)
j )
( j − j’)
(m−
j+1)
+ (m− j+
1) + ( j
− j’)
+ (k − k’)
C[i’]
C[i’]
C[i’]
C[i’] + ( j
− j’)
+ 2(m − j) + 1 + (k − j’ + k’)
2k’
+ ( j − j’)
+ (k - k’)
+ (k − k’)
≥ j + 2k
61
R99944013 吳冠龍
IMPLEMENTATION
62
Implementation
Left-right
determines both the left and right boundaries
Robust
computes the robustness measure for all edges (inc luding horizontal edges) of an optimal alignment
Unique (substantially faster)
only a bit telling if the aligned pair is in all o ptimal alignments
if there is a unique optimal edge entering the mid dle row, then recomputation within the subproblems is not needed to complete the division step
63
Environment
Developed on Sun workstations running SunOS Unix
Written in C
Tool
http://www.csie.ntu.edu.tw/~kmchao/tools/robus t/
64
R99944013 吳冠龍
EXAMPLE - 1
65
Example - 1
Applied to alignments between regulatory re gions of the β-like globin gene clusters o f humans and rabbit
Robustness property is interpreted as indic ating particularly well-conserved sections of the alignment.
66
DNA Binding Site
Represso r
Represso r
Activator Activator
67
Hypersensitive site 4 (HS4) by the β-g lobin locus control region
The study [Lowrey et al., 1992] has implica ted binding sites for the transcription fac tors NFE2/AP1 and GATA1 in HS4 function
The percent identity calculated for this HS 4 core is no greater than for most other se gments in this region
68
Hypersensitive site 4 (HS4) by the β-g lobin locus control region
Rabbit HS4 region:
Human β-like globin gene cluster
% ident.
69
Hypersensitive site 4 (HS4) by the β-g lobin locus control region
GATA1
AFE2 CDP
70
Hypersensitive site 4 (HS4) by the β-g lobin locus control region
The extent of suboptimal alignments support the proposal that the NFE2/AP1 and GATA1 si tes are important for HS4 function.
71
Alignments around constriction points i n the comparison of HS4
72
R99922050 蔡長蓉
EXAMPLE - 2
73
Before the Second Example
Before the Second Example
5’ to the -globin gene ϒ
Several mutations that lead to persistent e xpression of fetal hemoglobin (α2ϒ2) in adu lt life map in this region.
Alignments
Pairwise Alignment
Multiple Alignment
Phylogenetic Footprints
Phylogenetic Footprints
Human and Rabbit Alignment
R99945051 洪晧瑜
DISCUSSION & CONCLUSION
82
Conclusion(1/3)
Simple alignment:
structural or functional criteria?
More conserved region
extend Hirschberg’s approach
linear space and score-only time method
83
Conclusion(2/3)
Basic dynamic programming alignment
Affined-gap penalty alignment
Hirschberg’s optimal path method
Suboptimal alignment
Robustness and uniqueness
84
Conclusion(3/3)
Extents of suboptimal alignments identifies some important sequences:
Core hypersensitive sites in the LCR (HS4)
Exons of the γ -globin genes (5’ to -globinϒ gene)
Obviously these are important DNA segments to subject to functional tests for effects on expression of globin genes.
85