• 沒有找到結果。

LOCATING WELL-CONSERVED REGION S WITHIN A PAIRWISE ALIGNMENT

N/A
N/A
Protected

Academic year: 2022

Share "LOCATING WELL-CONSERVED REGION S WITHIN A PAIRWISE ALIGNMENT"

Copied!
85
0
0

加載中.... (立即查看全文)

全文

(1)

2011.06.07

林琬瑜、陳筱雯、何家全、賴宜萍 魏昭寧、洪晧瑜、吳冠龍、蔡長蓉

LOCATING WELL-CONSERVED REGION S WITHIN A PAIRWISE ALIGNMENT

1

(2)

Paper Info

Author

Kun-Mao Chao, Ross C. Hardison, Webb Miller

Department of Computer Science, Pennsylvania State University, University Park 16802.

Publication

Comput Appl Biosci (1993) 9 (4): 387-396.

doi: 10.1093/bioinformatics/9.4.387

Cited by 21 2

(3)

Outline

Introduction

Algorithms

Basic dynamic programming alignment

Affined-gap penalty alignment

Hirschberg’s optimal path method

Suboptimal alignment

Robustness and uniqueness

Implementation

Examples

Discussion & Conclusion

3

(4)

R99944016 林琬瑜

INTRODUCTION

4

(5)

Introduction

Within a single alignment of 2 DNA/protein sequences..

The optimal path does not necessarily yield th e best equivalence of residues assessed by str uctural or functional criteria

Some regions may be more conserved than others

May reveal a region that possesses an important fu nction

An automatic process is needed to compute t he conservation degree along the alignments 5

(6)

Introduction

The reliability of different regions within an alignment is widely appreciated

Vingron and Argos, 1990

Determination of reliable regions in protein seque nce alignment

Zuker, 1991

Suboptimal sequence alignment in molecular biology : alignment with error analysis

Friemann and Schmitz, 1992

A new approach for displaying identities and diffe rences among aligned amino acid sequences

6

(7)

Introduction

One approach to obtain such information is to determine suboptimal alignments

Alignments come within a specified tolerance o f the optimum score

Waterman and Byers, 1985

A dynamic programing algorithm to find all solutions in a neighborhood of the optimum

Saqi and Sternberg, 1991

A simple method to generate non-trivial alternative alignments of protein sequences

7

(8)

Introduction

Many of these earlier papers modify the cla ssic DP for pairwise sequence alignment so that it computes additional “robustness”

information

The problem is..

Require space proportional to the product of the s equence lengths

Impractical for long protein sequence

8

(9)

Purpose of this paper

Extend Hirschberg’s approach to compute additio nal information that in some sense indicates the degree to which the aligned residues are conserv ed.

Two main methods are developed

Can also efficiently handle the case where alignmen ts are constrained

For each position i of the 1st sequence, can be aligned only to positions between L[i] and R[i] of the second s equence

All methods run in linear space and score-only time

9

(10)

10

To illustrate the effectiveness

One of the methods is used to locate partic ularly well-conserved regions in:

β-globin gene

Control region

γ-globin gene

5’flank

(11)

ALGORITHMS

Basic dynamic programming alignment Affined-gap penalty alignment

Hirschberg’s optimal path method Suboptimal alignment

Left extent finding Right extent finding

Robustness and uniqueness 11

(12)

R99922073 陳筱雯

ALGORITHMS – BASIC PREFACE

12

(13)

Algorithm Flow

13

Affine-gap penalty alignment

Basic dynamic programming alignment

L score

HR score R

score

Robustness Measure

Compute Region

uniqu e

(14)

Dynamic-programming Method

0

M

N

(i, j) (i-1, (i, j- j)

1)

(i-1, j- 1)

14

(15)

Dynamic-programming Method

Time complexity

Space complexity

 

15

(16)

Affine Gap Penalties

Myers and Miller (1989)

Each gap is accessed with an additional pen alty.

Compute the highest score path from to .

 

16

(17)

DD DD

DD DD

SS SS

SS SS

II II

II II

(i-1, j- 1)

(i-1, j)

(i, j- 1)

(i, j)

Affine Gap Penalties

D (i , j)=���

{

(� −1, �(� −1, �))� − �

 

Here, a gap of length k is penalized by x+ky.

Twins

17

I (i , j)=���

{

(� , � −1(� , � − 1))� − �

 

S (i , j)=���

{

(� −1, � −1�(� , �)� (� , �))+�(�,)

 

(18)

R99945005 何家全

THE LEFT EXTENT OF SUBOPTIMAL ALI GNMENTS

18

(19)

The optimal path

Since the time and space complexity for the simple dynamic programming is poor, we want a more efficient method for alignments.

Hirschberg’s method might be a good choice to find optimal path.

19

(20)

Hirschberg’s method

0

M

N

M/2  

�����

+ ¿

����� ¿

 

20

(21)

The optimal path

0

M

N

M/2

21

(22)

The optimal path

0

M

N

M/2 M/4

3M/

4

22

(23)

The optimal path

Space complexity:

Only one row saved in memory

O(N)

Time complexity:

Each turn with half space removed

MN+0.5MN+0.25MN+…=2MN

O(MN)

23

(24)

The optimal path

Optimal path is not enough for biologists.

 WE WANT MORE!

24

(25)

The suboptimal path region

We need to set a threshold value.

All suboptimal paths with score higher than the threshold would be considered.

25

(26)

The modified Hirschberg’s method

0

M

N

M/2

Suboptimal regions

26

(27)

The modified Hirschberg’s method

How could we solve the subproblems in the a bove figure?

27

(28)

The modified Hirschberg’s method

0

M

N

����� 

�����  +¿ ¿

28

(29)

The modified Hirschberg’s method

0

M

N

M/2  

�����

+ ¿

����� ¿

 

29

(30)

The modified Hirschberg’s method

0

M

N

M/2

Suboptimal regions

30

(31)

The modified Hirschberg’s method

0

M

N

M/2

Suboptimal

regions

�����

 

+ ¿

�����

¿

  L-

value

31

(32)

The modified Hirschberg’s method

The L-value means the index of the leftmost column for certain row.

In that, we could always get the L-values f or each row of the concerned matrix.

32

(33)

The modified Hirschberg’s method

0

M

N

M/2

33

(34)

R99945018 賴宜萍

THE ROBUSTNESS MEASURE FOR EACH E DGE OF AN OPTIMAL PATH

34

(35)

Robustness measure

For each edge of an optimal path, let the r obustness measure of that edge be the diffe rence between the optimal score and the hig hest score of all paths [i.e. from (0,0)s t o (M,N)s] that do not use the edge or its t win

RM value

RM[i] >= 0

DD DD

DD DD

SS SS

SS SS

II II

II II

(i-1, j-1) (i-1, j)

(i, j-1) (i, j)

35

(36)

The only problem is that

Cannot guarantee that the second highest sc oring edge entering row (s + t)/2 must lie within the rectangle

36

(37)

HR[i] value

The highest score over all edges entering r ow i that are not contained within a pendin g subproblem

The only remaining problem is to update HR in linear space and score-only time when a problem is subdivided

37

(38)

Strategy

Account for the scores of edges in the regi on denoted Δ, and do this for all i betwee n s and

(s + t)/2 in linear space

38

(39)

Strategy

Make a forward pass to propagate values of Score- to nodes along the vertical segment f rom (s, jbest) to ((s + t)/2, jbest)

39

(40)

A backward pass beginning at row (s + t)/2 propagates values of Score+ to the right and bottom borders of Γ, so edges across Γ ca n be scored

Strategy

40

(41)

41

Strategy

0

M

N

M/2

�����

 

����� +¿

¿

 

HR[i

] i

(42)

42

Robustness measure

0

M

N

M/2

HR[i ]

RM[i ]

(43)

R99945053 魏昭寧

COMPUTATION WITHIN A REGION

43

(44)

Computation within a region

Suppose we have computed the left and right extents of suboptimal paths rel ative to a given threshold score and that we now want to repeat the process for a lar ger threshold.

44

(45)

Computation within a region

Our strategy is to cover the region with a small number of non-intersecting upright re ctangles whose total area does not exceed t wice that of the rectangle’s intersection with the given region

45

(46)

Computation within a region

The total length of the dark segments in Fig ure is linear in N.

* Lemma. The sum of the segment widths R [i] L[i] + 1 for i ∈I is at most 3N.−

46

0 N

M

(47)

Computation within a region

47

0 N

M

(48)

0 N

M

Computation within a region

48

 

(49)

Computation within a region

assume that L[0] = 0 ≤ R[0], L[M] ≤ N = R [M], and L and R are monotonic in the sense that, e.g., L[i 1] ≤ L[i] for 1 ≤ i ≤ − M.

M

N

49

(50)

0 N

M

Computation within a region

i=

1

50

} }

 

(51)

N

M

} }

 

Computation within a region

(i0,L[i0])

(i,R[i])

51

(52)

N

M

} }

 

Computation within a region

(i0,L[i0])

(i,R[i])

52

(53)

R99945051 洪晧瑜

PROOF

53

(54)

Lemma

Suppose i∈I , where I is computed by Figur e 8. Let the intersection of row i with the given region extend from column j to column k, i.e., j = L[i] and k = R[i].

De ne c[i] to be k j + 1 and fi −

C[i] = Σ{c[i] for q∈I and q ≤ i}

Then C[i] ≤ j + 2k.

54

(55)

Compute I

55

(56)

s i

0

i

5

i

8

i

1

7

I

C(i

8

)

56

(57)

Proof (induction on i)

i = i0(j = j0 = 0, k = k0)

C[i0] = c[i0] ≤ k0 + 1 ≤ j0 + 2k0 Hol d

Let i = i‘ (j = j’, k = k’)

Suppose C[i‘] ≤ j’ + 2k’ Hold

57

(58)

Two cases

Case1: C ase2:

58

(59)

Case1:

C[i] = ≤ ≤ =

k’ – j j - + 1

j’

k – j + 1

k – j + 1

k - k’

+ (k – j +

1) + (k’ – j + 1)

+ (k - k’)

C[i‘] ≤ j’ + 2k’

C[i‘]

j’ + 2k’

+ 2(k - k’)

+ (j - j’)

j’ + 2k’

j + 2k

59

(60)

Case2: (1/2)

(k k’) + ( j j’) − − ≥ =

k’− j + 1 (k − ( j − k’)

j’) m− k’− m+1 j

m−

j’

(k’-

m+1 )

+ 1 (m−

j’)

k’− j + 1

+(m−

j)

+ 1 + 1

+(m−

j) k − k’ ≥ 2(m − j) + 1

(k- k’)

+ (k’-j+1 ) + (j-j’ ) > 2(k’-j +1)≥

+ 1

60

(61)

Case2: (2/2)

C[i] = = ≤ =

≤ =

C[i‘] ≤ j’ + 2k’

k − k’

k’− m+1 m−

j

k – j + 1

k − k’ ≥ 2(m − j) + 1 + (k – j +

1) + (m−

j )

+ (k’−

m+1) + (k − + (m− k’)

j )

( j − j’)

(m−

j+1)

+ (m− j+

1) + ( j

− j’)

+ (k − k’)

C[i’]

C[i’]

C[i’]

C[i’] + ( j

− j’)

+ 2(m − j) + 1 + (k − j’ + k’)

2k’

+ ( j − j’)

+ (k - k’)

+ (k − k’)

j + 2k

61

(62)

R99944013 吳冠龍

IMPLEMENTATION

62

(63)

Implementation

Left-right

determines both the left and right boundaries

Robust

computes the robustness measure for all edges (inc luding horizontal edges) of an optimal alignment

Unique (substantially faster)

only a bit telling if the aligned pair is in all o ptimal alignments

if there is a unique optimal edge entering the mid dle row, then recomputation within the subproblems is not needed to complete the division step

63

(64)

Environment

Developed on Sun workstations running SunOS Unix

Written in C

Tool

http://www.csie.ntu.edu.tw/~kmchao/tools/robus t/

64

(65)

R99944013 吳冠龍

EXAMPLE - 1

65

(66)

Example - 1

Applied to alignments between regulatory re gions of the β-like globin gene clusters o f humans and rabbit

Robustness property is interpreted as indic ating particularly well-conserved sections of the alignment.

66

(67)

DNA Binding Site

Represso r

Represso r

Activator Activator

67

(68)

Hypersensitive site 4 (HS4) by the β-g lobin locus control region

The study [Lowrey et al., 1992] has implica ted binding sites for the transcription fac tors NFE2/AP1 and GATA1 in HS4 function

The percent identity calculated for this HS 4 core is no greater than for most other se gments in this region

68

(69)

Hypersensitive site 4 (HS4) by the β-g lobin locus control region

Rabbit HS4 region:

Human β-like globin gene cluster

% ident.

69

(70)

Hypersensitive site 4 (HS4) by the β-g lobin locus control region

GATA1

AFE2 CDP

70

(71)

Hypersensitive site 4 (HS4) by the β-g lobin locus control region

The extent of suboptimal alignments support the proposal that the NFE2/AP1 and GATA1 si tes are important for HS4 function.

71

(72)

Alignments around constriction points i n the comparison of HS4

72

(73)

R99922050 蔡長蓉

EXAMPLE - 2

73

(74)

Before the Second Example

(75)

Before the Second Example

(76)

5’ to the -globin gene ϒ

Several mutations that lead to persistent e xpression of fetal hemoglobin (α2ϒ2) in adu lt life map in this region.

(77)

Alignments

Pairwise Alignment

Multiple Alignment

(78)

Phylogenetic Footprints

(79)

Phylogenetic Footprints

(80)
(81)

Human and Rabbit Alignment

(82)

R99945051 洪晧瑜

DISCUSSION & CONCLUSION

82

(83)

Conclusion(1/3)

Simple alignment:

structural or functional criteria?

More conserved region

extend Hirschberg’s approach

linear space and score-only time method

83

(84)

Conclusion(2/3)

Basic dynamic programming alignment

Affined-gap penalty alignment

Hirschberg’s optimal path method

Suboptimal alignment

Robustness and uniqueness

84

(85)

Conclusion(3/3)

Extents of suboptimal alignments identifies some important sequences:

Core hypersensitive sites in the LCR (HS4)

Exons of the γ -globin genes (5’ to -globinϒ gene)

Obviously these are important DNA segments to subject to functional tests for effects on expression of globin genes.

85

參考文獻

相關文件

Extranodal non-hodgkin lymphomas of the oral cavity and maxillofacial region: a clinical study of 58 cases and review of the literature.. Regezi JA, Sciubba JJ,

Exclusion criteria were: 1) Lack of initial histological diagnosis; 2) follow-up period of less than one year; 3) lesion with clinical appearance or histological findings white,

Figure 2: Panoramic radiograph view showed a well-defined radiopacity located at the left posterior ramus of the mandible region.. Figure 3: Paranasal sinus view showed a

 This technique is solving a problem by reducing its instance to a smaller one, solving the latte r, and then extending the obtained solution to g et a solution to the

T transforms S into a region R in the xy-plane called the image of S, consisting of the images of all points in S.... So we begin by finding the images of the sides

In addition to line search, trust region is another method to ensure sufficient decrease; see the implementation in LIBLINEAR (Lin et al., 2007)

Should an employer find it necessary to continue the employment of the Class A Foreign Worker(s), the employer shall, within four (4) months prior to the expiration of the

In the size estimate problem studied in [FLVW], the essential tool is a three-region inequality which is obtained by applying the Carleman estimate for the second order