LOCATING WELL-CONSERVED REGION S WITHIN A PAIRWISE ALIGNMENT

(1)

2011.06.07

林琬瑜、陳筱雯、何家全、賴宜萍魏昭寧、洪晧瑜、吳冠龍、蔡長蓉

LOCATING WELL-CONSERVED REGION S WITHIN A PAIRWISE ALIGNMENT

1

(2)

Paper Info

 Author

 Kun-Mao Chao, Ross C. Hardison, Webb Miller

 Department of Computer Science, Pennsylvania State University, University Park 16802.

 Publication

 Comput Appl Biosci (1993) 9 (4): 387-396.

 doi: 10.1093/bioinformatics/9.4.387

 Cited by 21 2

(3)

Outline

 Introduction

 Algorithms

 Basic dynamic programming alignment

 Affined-gap penalty alignment

 Hirschberg’s optimal path method

 Suboptimal alignment

 Robustness and uniqueness

 Implementation

 Examples

 Discussion & Conclusion

3

(4)

R99944016 林琬瑜

INTRODUCTION

4

(5)

Introduction

 Within a single alignment of 2 DNA/protein sequences..

 The optimal path does not necessarily yield th e best equivalence of residues assessed by str uctural or functional criteria

 Some regions may be more conserved than others

 May reveal a region that possesses an important fu nction

 An automatic process is needed to compute t he conservation degree along the alignments 5

(6)

Introduction

 The reliability of different regions within an alignment is widely appreciated

 Vingron and Argos, 1990

 Determination of reliable regions in protein seque nce alignment

 Zuker, 1991

 Suboptimal sequence alignment in molecular biology : alignment with error analysis

 Friemann and Schmitz, 1992

 A new approach for displaying identities and diffe rences among aligned amino acid sequences

6

(7)

Introduction

 One approach to obtain such information is to determine suboptimal alignments

 Alignments come within a specified tolerance o f the optimum score

 Waterman and Byers, 1985

 A dynamic programing algorithm to find all solutions in a neighborhood of the optimum

 Saqi and Sternberg, 1991

 A simple method to generate non-trivial alternative alignments of protein sequences

7

(8)

Introduction

 Many of these earlier papers modify the cla ssic DP for pairwise sequence alignment so that it computes additional “robustness”

information

 The problem is..

 Require space proportional to the product of the s equence lengths

 Impractical for long protein sequence

8

(9)

Purpose of this paper

 Extend Hirschberg’s approach to compute additio nal information that in some sense indicates the degree to which the aligned residues are conserv ed.

 Two main methods are developed

 Can also efficiently handle the case where alignmen ts are constrained

 For each position i of the 1^st sequence, can be aligned only to positions between L[i] and R[i] of the second s equence

 All methods run in linear space and score-only time

9

(10)

10

To illustrate the effectiveness

 One of the methods is used to locate partic ularly well-conserved regions in:

 β-globin gene

 Control region

 γ-globin gene

 5’flank

(11)

ALGORITHMS

Basic dynamic programming alignment Affined-gap penalty alignment

Hirschberg’s optimal path method Suboptimal alignment

Left extent finding Right extent finding

Robustness and uniqueness 11

(12)

R99922073 陳筱雯

ALGORITHMS – BASIC PREFACE

12

(13)

Algorithm Flow

13

Affine-gap penalty alignment

Basic dynamic programming alignment

L score

HR score R

score

Robustness Measure

Compute Region

uniqu e

(14)

Dynamic-programming Method

0 M

N

(i, j) (i-1, (i, j- j)

1)

(i-1, j- 1)

14

(15)

Dynamic-programming Method

 Time complexity

 Space complexity



15

(16)

Affine Gap Penalties

 Myers and Miller (1989)

 Each gap is accessed with an additional pen alty.

 Compute the highest score path from to .



16

(17)

DD DD

SS SS

II II

(i-1, j- 1)

(i-1, j)

(i, j- 1)

(i, j)

Affine Gap Penalties

D (i , j)=��

{

�^�(_{� −1, �}⁽^{� −1, �})₋⁾_{� − �}⁻ ^�

Here, a gap of length k is penalized by x+ky.

Twins

17

I (i , j)=��

{

^�⁽^�^{� , � −1}⁽^{� , � − 1}⁾⁻⁾⁻^{� − �}^�

S (_{i , j})₌_��

{

^�⁽^{� −1, � −1}^{�(� , �)}^{� (� , �)}⁾⁺^�(�^�^,^�^�⁾

(18)

R99945005 何家全

THE LEFT EXTENT OF SUBOPTIMAL ALI GNMENTS

18

(19)

The optimal path

 Since the time and space complexity for the simple dynamic programming is poor, we want a more efficient method for alignments.

 Hirschberg’s method might be a good choice to find optimal path.

19

(20)

Hirschberg’s method

0

M

N

M/2

�� ⁻

+ ¿

�� ^¿

20

(21)

The optimal path

0

M

N

M/2

21

(22)

The optimal path

0

M

N

M/2 M/4

3M/

4

22

(23)

The optimal path

 Space complexity:

 Only one row saved in memory

 O(N)

 Time complexity:

 Each turn with half space removed

 MN+0.5MN+0.25MN+…=2MN

 O(MN)

23

(24)

The optimal path

 Optimal path is not enough for biologists.

 WE WANT MORE!

24

(25)

The suboptimal path region

 We need to set a threshold value.

 All suboptimal paths with score higher than the threshold would be considered.

25

(26)

The modified Hirschberg’s method

0

M

N

M/2

Suboptimal regions

26

(27)

The modified Hirschberg’s method

 How could we solve the subproblems in the a bove figure?

27

(28)

The modified Hirschberg’s method

0

M

N

�� ⁻

�� +¿ ^¿

28

(29)

The modified Hirschberg’s method

0

M

N

M/2

�� ⁻

+ ¿

�� ^¿

29

(30)

The modified Hirschberg’s method

0

M

N

M/2

Suboptimal regions

30

(31)

The modified Hirschberg’s method

0

M

N

M/2

Suboptimal

regions

�� ⁻

+ ¿

��

^¿

L-

value

31

(32)

The modified Hirschberg’s method

 The L-value means the index of the leftmost column for certain row.

 In that, we could always get the L-values f or each row of the concerned matrix.

32

(33)

The modified Hirschberg’s method

0

M

N

M/2

33

(34)

R99945018 賴宜萍

THE ROBUSTNESS MEASURE FOR EACH E DGE OF AN OPTIMAL PATH

34

(35)

Robustness measure

 For each edge of an optimal path, let the r obustness measure of that edge be the diffe rence between the optimal score and the hig hest score of all paths [i.e. from (0,0)s t o (M,N)s] that do not use the edge or its t win

 RM value

 RM[i] >= 0

DD DD

SS SS

II II

(i-1, j-1) (i-1, j)

(i, j-1) (i, j)

35

(36)

The only problem is that

 Cannot guarantee that the second highest sc oring edge entering row (s + t)/2 must lie within the rectangle

36

(37)

HR[i] value

 The highest score over all edges entering r ow i that are not contained within a pendin g subproblem

 The only remaining problem is to update HR in linear space and score-only time when a problem is subdivided

37

(38)

Strategy

 Account for the scores of edges in the regi on denoted Δ, and do this for all i betwee n s and

(s + t)/2 in linear space

38

(39)

Strategy

 Make a forward pass to propagate values of Score^- to nodes along the vertical segment f rom (s, j_best) to ((s + t)/2, j_best)

39

(40)

 A backward pass beginning at row (s + t)/2 propagates values of Score⁺ to the right and bottom borders of Γ, so edges across Γ ca n be scored

Strategy

40

(41)

41

Strategy

0

M

N

M/2

�� ⁻

�� +¿

^¿

HR[i

] _i

(42)

42

Robustness measure

0

M

N

M/2

HR[i ]

RM[i ]

(43)

R99945053 魏昭寧

COMPUTATION WITHIN A REGION

43

(44)

Computation within a region

 Suppose we have computed the left and right extents of suboptimal paths rel ative to a given threshold score and that we now want to repeat the process for a lar ger threshold.

44

(45)

Computation within a region

 Our strategy is to cover the region with a small number of non-intersecting upright re ctangles whose total area does not exceed t wice that of the rectangle’s intersection with the given region

45

(46)

Computation within a region

The total length of the dark segments in Fig ure is linear in N.

* Lemma. The sum of the segment widths R [i] L[i] + 1 for i ∈I is at most 3N.−

46

0 N

M

(47)

Computation within a region

47

0 N

M

(48)

0 N

M

Computation within a region

48

(49)

Computation within a region

 assume that L[0] = 0 ≤ R[0], L[M] ≤ N = R [M], and L and R are monotonic in the sense that, e.g., L[i 1] ≤ L[i] for 1 ≤ i ≤ − M.

M

N

49

(50)

0 N

M

Computation within a region

i=

1

50

} }

(51)

N

M

} }

Computation within a region

(i₀,L[i₀])

(i,R[i])

51

(52)

N

M

} }

Computation within a region

(i₀,L[i₀])

(i,R[i])

52

(53)

R99945051 洪晧瑜

PROOF

53

(54)

Lemma

 Suppose i∈I , where I is computed by Figur e 8. Let the intersection of row i with the given region extend from column j to column k, i.e., j = L[i] and k = R[i].

 De ne c[i] to be k j + 1 and ﬁ −

C[i] = Σ{c[i] for q∈I and q ≤ i}

 Then C[i] ≤ j + 2k.

54

(55)

Compute I

55

(56)

s _i

0

i

₅

i

₈

i

₁

7

I

C(i

₈

)

56

(57)

Proof (induction on i)

 i = i₀(j = j₀= 0, k = k₀)

 C[i₀] = c[i₀] ≤ k₀+ 1 ≤ j₀+ 2k0 Hol d

 Let i = i‘ (j = j’, k = k’)

 Suppose C[i‘] ≤ j’ + 2k’ Hold

57

(58)

Two cases

Case1: C ase2:

58

(59)

Case1:

C[i] = ≤ ≤ =

k’ – j j - + 1

j’

≥

k – j + 1

k - k’

+ (k – j +

1) + (k’ – j + 1)

+ (k - k’)

C[i‘] ≤ j’ + 2k’

C[i‘]

j’ + 2k’

+ 2(k - k’)

+ (j - j’)

j’ + 2k’

j + 2k

59

(60)

Case2: (1/2)

 (k k’) + ( j j’) − − ≥ =

≥

k’− j + 1 (k − ( j − k’)

j’) m− k’− m+1 j

≤

m−

j’

(k’-

m+1 )

+ 1 (m−

j’)

k’− j + 1

+(m−

j)

+ 1 + 1

+(m−

j) k − k’ ≥ 2(m − j) + 1

(k- k’)

+ (k’-j+1 ) + (j-j’ ) > 2(k’-j +1)≥

+ 1

60

(61)

Case2: (2/2)

 C[i] = = ≤ =

≤ =

C[i‘] ≤ j’ + 2k’

k − k’

k’− m+1 m−

j

k – j + 1

k − k’ ≥ 2(m − j) + 1 + (k – j +

1) + (m−

j )

+ (k’−

m+1) + (k − + (m− k’)

j )

( j − j’)

(m−

j+1)

+ (m− j+

1) + ( j

− j’)

+ (k − k’)

C[i’]

C[i’] + ( j

− j’)

+ 2(m − j) + 1 + (k − j’ + k’)

2k’

+ ( j − j’)

+ (k - k’)

+ (k − k’)

≥ j + 2k

61

(62)

R99944013 吳冠龍

IMPLEMENTATION

62

(63)

Implementation

 Left-right

 determines both the left and right boundaries

 Robust

 computes the robustness measure for all edges (inc luding horizontal edges) of an optimal alignment

 Unique (substantially faster)

 only a bit telling if the aligned pair is in all o ptimal alignments

 if there is a unique optimal edge entering the mid dle row, then recomputation within the subproblems is not needed to complete the division step

63

(64)

Environment

 Developed on Sun workstations running SunOS Unix

 Written in C

 Tool

 http://www.csie.ntu.edu.tw/~kmchao/tools/robus t/

64

(65)

R99944013 吳冠龍

EXAMPLE - 1

65

(66)

Example - 1

 Applied to alignments between regulatory re gions of the β-like globin gene clusters o f humans and rabbit

 Robustness property is interpreted as indic ating particularly well-conserved sections of the alignment.

66

(67)

DNA Binding Site

Represso r

Activator Activator

67

(68)

Hypersensitive site 4 (HS4) by the β-g lobin locus control region

 The study [Lowrey et al., 1992] has implica ted binding sites for the transcription fac tors NFE2/AP1 and GATA1 in HS4 function

 The percent identity calculated for this HS 4 core is no greater than for most other se gments in this region

68

(69)

Hypersensitive site 4 (HS4) by the β-g lobin locus control region

Rabbit HS4 region:

Human β-like globin gene cluster

% ident.

69

(70)

Hypersensitive site 4 (HS4) by the β-g lobin locus control region

GATA1

AFE2 CDP

70

(71)

Hypersensitive site 4 (HS4) by the β-g lobin locus control region

 The extent of suboptimal alignments support the proposal that the NFE2/AP1 and GATA1 si tes are important for HS4 function.

71

(72)

Alignments around constriction points i n the comparison of HS4

72

(73)

R99922050 蔡長蓉

EXAMPLE - 2

73

(74)

Before the Second Example

(75)

Before the Second Example

(76)

5’ to the -globin gene ϒ

 Several mutations that lead to persistent e xpression of fetal hemoglobin (α₂ϒ₂) in adu lt life map in this region.

(77)

Alignments

 Pairwise Alignment

 Multiple Alignment

(78)

Phylogenetic Footprints

(79)

Phylogenetic Footprints

(80)

(81)

Human and Rabbit Alignment

(82)

R99945051 洪晧瑜

DISCUSSION & CONCLUSION

82

(83)

Conclusion(1/3)

 Simple alignment:

 structural or functional criteria?

 More conserved region

extend Hirschberg’s approach

linear space and score-only time method

83

(84)

Conclusion(2/3)

 Basic dynamic programming alignment

 Affined-gap penalty alignment

 Hirschberg’s optimal path method

 Suboptimal alignment

 Robustness and uniqueness

84

(85)

Conclusion(3/3)

 Extents of suboptimal alignments identifies some important sequences:

 Core hypersensitive sites in the LCR (HS4)

 Exons of the γ -globin genes (5’ to -globinϒ gene)

 Obviously these are important DNA segments to subject to functional tests for effects on expression of globin genes.

85

LOCATING WELL-CONSERVED REGION S WITHIN A PAIRWISE ALIGNMENT