• 沒有找到結果。

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford C PLOS Computational Biology. 2017 October; 13(10): e1005777

N/A
N/A
Protected

Academic year: 2022

Share "Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford C PLOS Computational Biology. 2017 October; 13(10): e1005777"

Copied!
43
0
0

加載中.... (立即查看全文)

全文

(1)

National Taiwan University

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing

Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford C

PLOS Computational Biology. 2017 October; 13(10): e1005777

Hung-Yu Chen, R06945024

Vincent Hwang, B05902122

(2)

1

Outline

· Background

· Methods and results

· Conclusion

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(3)

2

Background

· Sequencing datasets are larger and larger.

· New computational ideas are essential to manage and analyze

data.

(4)

3

Minimizer

· Michael Roberts, Wayne Hayes, Brian R. Hunt, Stephen M. Mount, James A. Yorke;

Reducing storage requirements for biological sequence comparison, Bioinformatics, Volume 20, Issue 18, 12 December 2004, Pages 3363−3369

· Given a sequence of length L, the minimizer is the lexicographically smallest k-mer in it.

· Given a sequence S of any length, the minimizer set is the set of minimizers of every L-long subsequence in S.

= ⇒ Every L-long subsequence in S is represented in the set.

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(5)

4

Application of Minimizers

· Hashing for read overlapping

· Sparse suffix arrays

· Bloom filters to speed up sequence search

(6)

5

Hashing for read overlapping

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(7)

6

Sparse suffix arrays

(8)

7

Bloom filters to speed up sequence search

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(9)

8

Universal hitting set(UHS)

· For integers k, L, a set U k,L is called a UHS of k-mers if every possible sequence of length L must contain at least one k-mer in U k,L .

· For example, the set of all k-mers is a trivial UHS.

· Problem 1. Given k and L, find a smallest UHS of k-mers.

(10)

9

Hits

· A k-mer w hits string S, denoted w ⊆ S, if w is a substring in S.

· k-mer set X hits string S if there exists w ∈ X such that w ⊆ S.

· The UHS in Problem 1 is a set of k-mers U k,L which hits every possible sequence of length L.

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(11)

10

Advantages of UHS over minimizers

· The set of minimizers may be as large as the complete set of k-mers. The method in this paper can often generate UHSs smaller by a factor of nearly k.

· UHS is universal.

= ⇒ For any k and L, a UHS needs to be computed only once for every dataset.

= ⇒ The data structures created for different datasets will contain a

comparable set of k-mers.

(12)

11

Using de Bruijn graphs to find UHSs

· Problem 2. Given a complete de Bruijn graph D k of order k and an integer L, find a smallest set of vertices U k,L such that any path in D k of length l = L − k passes through at least one vertex of U k,L .

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(13)

12

Complete de Bruijn graph

· A complete de Bruijn graph of order k over alphabet Σ:

V: |Σ| k vertices, each labelled with a unique k-mer.

E: If there is an edge (u, v) with a (k + 1)-mer label l, then the label of vertex u is the k-suffix of l and the label of vertex v is the k-prefix of l. A complete de Bruijn graph contains all possible

|Σ| k+1 edges of this type.

(14)

13

How to find the UHS?

· NP-hard in general(supporting information in the paper).

· Heuristic approaches.(DOCKS, DOCKSany, DOCKSanyX)

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(15)

14

How to find UHS?

1. Generate a complete de Bruijn graph of order k, set l = L − k.

2. Find the decycling vertex set(V set), X.

3. Remove X from the graph, result in G .

4. Remove vertices from G and add them to S to hit the remained L length sequences.

(i) DOCKS (ii) DOCKSany (iii) DOCKSanyX

5. X is the universal hitting set we’re searching for.

(16)

15

Decycling de Bruijn graph

· Vertices labeling

· Factor

· Pure cycling register(PCR k )

· V-set

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(17)

16

Decycling de Bruijn graph

001 011

000 101 010 111

100 110

(18)

17

Vertices labeling

For a vertex v(s 0 , s 1 , . . . , s k −1 ), calculate the center of mass.

According to the center of mass position in the coordinate system, label the vertex I if x = 0, L if x < 0, R if x > 0,

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(19)

18

Vertex labeling example

v = 010111, the center of mass’ x value > 0. = ⇒ R.

0

1 1

0 1

(20)

19

Factor

· A factor is a set of cycles such that all vertices in the graph are in exactly one of the cycles.

· Each cycle has a unique feedback function f(s 0 , s 1 , . . . , s k −1 ) = s k .

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(21)

20

Pure cycling register(PCR k )

· PCR k is a factor.

· Each cycle has a unique function f(s 0 , s 1 , . . . , s k −1 ) = s k = s 0 , that is, for every arc < u, v >, u = (s 0 , s 1 , . . . , s k −1 ) = v = (s 1 , s 2 , . . . , s k ) = (s 1 , s 2 , . . . , s 0 ).

· The number of cycles in PCR k is Z(k), which converges to |Σ| k

k

.

· It is proved that any circle in the PCR k must be either all I’s or a

block of L’s and a block of R’s separated by at most two I’s.

(22)

21

PCR k example

001 011

000 101 010 111

100 110

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(23)

22

Factor but not PCR k example

001 011

000 101 010 111

100 110

(24)

23

Why PCR k ?

Lemmas tell us:

· All cycles are in the form of all I’s or at least a L and a R.

· Cycles with all I’s are in PCR k .

· For each cycle with at least a L and a R, there exist exactly one cycle in PCR k such that the first vertex of L block of the two cycles are the same one.

= ⇒ We only need to deal with cycles in PCR k .

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(25)

24

V-set

A minimum set of vertices which when removed leaves a graph

with no cycles.

(26)

25

V-set

Naïve algorithm:

1. Choose a vertex v, find the cycle belongs to PCR k that contains v.

2. Choose a certain vertex u and add it to the V-set:

Arbitrary one, if the cycle is all I’s.

The first vertex in the L block, otherwise.

3. Remove the cycle from the graph.

4. Repeat until all cycles belong to PCR k are tested.

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(27)

26

V-set example

001 011

000 101 010 111

100 110

(28)

27

V-set example

000 010 111

110

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(29)

28

Time complexity analysis

There are Z(k) iterations. Find the vertex to be added with O(k) time cost in every iteration.

= ⇒ O(kZ(k)) = O(|Σ| k ) in total.

(30)

29

How to find Minimum UHS?

1. Generate a complete de Bruijn graph of order k, set l = L − k.

2. Find the decycling vertex set(V set), X.

3. Remove X from the graph, result in G .

4. Remove vertices from G and add them to S to hit the remained L length sequences.

(i) DOCKS (ii) DOCKSany (iii) DOCKSanyX

5. X is the universal hitting set we’re searching for.

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(31)

30

DOCKS

Define:

D(v, i) = the number of i-long paths starting at v F(v, i) = the number of i-long paths ending at v

=

T(v, l) = the number of l-long paths through v

= Σ l i=0 F(v, i) · D(v, l − i)

· Calculate D(−, −), F(−, −) to find T(−, l).

· Choose the one has the largest T(−, l) and extract it.

(32)

31

DOCKS performance(set size)

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(33)

32

DOCKS performance(runtime)

(34)

33

DOCKS performance(memory)

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(35)

34

DOCKSany

Define:

D(v) = the number of paths start at v F(v) = the number of paths end at v

=

T(v) = the number of paths through v

= F(v) · D(v)

· Calculate D(−), F(−) to find T(−).

· Choose the one has the largest T(−) and extract it.

(36)

35

DOCKSany performance(set size)

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(37)

36

DOCKSany performance(runtime)

(38)

37

DOCKSany performance(memory)

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(39)

38

DOCKSanyX

Same calculation as DOCKSany.

Extract at most x such vertices instead of just one.

(40)

39

DOCKSanyX performance(set size)

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(41)

40

DOCKSanyX performance(runtime)

(42)

41

DOCKSanyX performance(memory)

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(43)

42

Conclusion

· DOCKS can generate compact sets of k-mers that hit all L-long sequences for any k ≤ 13 and L.

· These compact sets can improve many of the applications that

currently use minimizers.

參考文獻

相關文件

(C)John’s love for graffiti began in junior high s chool when his school held a graffiti contest.. (D)John’s love for crows start s in junior high school when he joined

展望今年,在課程方面將配合 IEET 工程教育認證的要求推動頂石課程(Capstone

In the past researches, all kinds of the clustering algorithms are proposed for dealing with high dimensional data in large data sets.. Nevertheless, almost all of

[7] C-K Lin, and L-S Lee, “Improved spontaneous Mandarin speech recognition by disfluency interruption point (IP) detection using prosodic features,” in Proc. “ Speech

唇音 b巴 p趴 m媽 f花 舌尖音 d打 t它 n拿 l啦.. 舌葉音 z渣 c茶 s沙 j也 舌根音 g家

Using sets of diverse, multimodal and multi-genre texts of high quality on selected themes, the Seed Project, Development of Text Sets (DTS) for Enriching the School-based

For the data sets used in this thesis we find that F-score performs well when the number of features is large, and for small data the two methods using the gradient of the

Unlike the case of optimizing the micro-average F-measure, where cyclic optimization does not help, here the exact match ratio is slightly improved for most data sets.. 5.5