• 沒有找到結果。

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford C PLOS Computational Biology. 2017 October; 13(10): e1005777

N/A
N/A
Protected

Academic year: 2022

Share "Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford C PLOS Computational Biology. 2017 October; 13(10): e1005777"

Copied!
43
0
0

全文

(1)

National Taiwan University

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing

Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford C

PLOS Computational Biology. 2017 October; 13(10): e1005777

Hung-Yu Chen, R06945024

Vincent Hwang, B05902122

(2)

1

Outline

· Background

· Methods and results

· Conclusion

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(3)

2

Background

· Sequencing datasets are larger and larger.

· New computational ideas are essential to manage and analyze

data.

(4)

3

Minimizer

· Michael Roberts, Wayne Hayes, Brian R. Hunt, Stephen M. Mount, James A. Yorke;

Reducing storage requirements for biological sequence comparison, Bioinformatics, Volume 20, Issue 18, 12 December 2004, Pages 3363−3369

· Given a sequence of length L, the minimizer is the lexicographically smallest k-mer in it.

· Given a sequence S of any length, the minimizer set is the set of minimizers of every L-long subsequence in S.

= ⇒ Every L-long subsequence in S is represented in the set.

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(5)

4

Application of Minimizers

· Hashing for read overlapping

· Sparse suffix arrays

· Bloom filters to speed up sequence search

(6)

5

Hashing for read overlapping

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(7)

6

Sparse suffix arrays

(8)

7

Bloom filters to speed up sequence search

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(9)

8

Universal hitting set(UHS)

· For integers k, L, a set U k,L is called a UHS of k-mers if every possible sequence of length L must contain at least one k-mer in U k,L .

· For example, the set of all k-mers is a trivial UHS.

· Problem 1. Given k and L, find a smallest UHS of k-mers.

(10)

9

Hits

· A k-mer w hits string S, denoted w ⊆ S, if w is a substring in S.

· k-mer set X hits string S if there exists w ∈ X such that w ⊆ S.

· The UHS in Problem 1 is a set of k-mers U k,L which hits every possible sequence of length L.

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(11)

10

Advantages of UHS over minimizers

· The set of minimizers may be as large as the complete set of k-mers. The method in this paper can often generate UHSs smaller by a factor of nearly k.

· UHS is universal.

= ⇒ For any k and L, a UHS needs to be computed only once for every dataset.

= ⇒ The data structures created for different datasets will contain a

comparable set of k-mers.

(12)

11

Using de Bruijn graphs to find UHSs

· Problem 2. Given a complete de Bruijn graph D k of order k and an integer L, find a smallest set of vertices U k,L such that any path in D k of length l = L − k passes through at least one vertex of U k,L .

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(13)

12

Complete de Bruijn graph

· A complete de Bruijn graph of order k over alphabet Σ:

V: |Σ| k vertices, each labelled with a unique k-mer.

E: If there is an edge (u, v) with a (k + 1)-mer label l, then the label of vertex u is the k-suffix of l and the label of vertex v is the k-prefix of l. A complete de Bruijn graph contains all possible

|Σ| k+1 edges of this type.

(14)

13

How to find the UHS?

· NP-hard in general(supporting information in the paper).

· Heuristic approaches.(DOCKS, DOCKSany, DOCKSanyX)

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(15)

14

How to find UHS?

1. Generate a complete de Bruijn graph of order k, set l = L − k.

2. Find the decycling vertex set(V set), X.

3. Remove X from the graph, result in G .

4. Remove vertices from G and add them to S to hit the remained L length sequences.

(i) DOCKS (ii) DOCKSany (iii) DOCKSanyX

5. X is the universal hitting set we’re searching for.

(16)

15

Decycling de Bruijn graph

· Vertices labeling

· Factor

· Pure cycling register(PCR k )

· V-set

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(17)

16

Decycling de Bruijn graph

001 011

000 101 010 111

100 110

(18)

17

Vertices labeling

For a vertex v(s 0 , s 1 , . . . , s k −1 ), calculate the center of mass.

According to the center of mass position in the coordinate system, label the vertex I if x = 0, L if x < 0, R if x > 0,

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(19)

18

Vertex labeling example

v = 010111, the center of mass’ x value > 0. = ⇒ R.

0

1 1

0 1

(20)

19

Factor

· A factor is a set of cycles such that all vertices in the graph are in exactly one of the cycles.

· Each cycle has a unique feedback function f(s 0 , s 1 , . . . , s k −1 ) = s k .

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(21)

20

Pure cycling register(PCR k )

· PCR k is a factor.

· Each cycle has a unique function f(s 0 , s 1 , . . . , s k −1 ) = s k = s 0 , that is, for every arc < u, v >, u = (s 0 , s 1 , . . . , s k −1 ) = v = (s 1 , s 2 , . . . , s k ) = (s 1 , s 2 , . . . , s 0 ).

· The number of cycles in PCR k is Z(k), which converges to |Σ| k

k

.

· It is proved that any circle in the PCR k must be either all I’s or a

block of L’s and a block of R’s separated by at most two I’s.

(22)

21

PCR k example

001 011

000 101 010 111

100 110

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(23)

22

Factor but not PCR k example

001 011

000 101 010 111

100 110

(24)

23

Why PCR k ?

Lemmas tell us:

· All cycles are in the form of all I’s or at least a L and a R.

· Cycles with all I’s are in PCR k .

· For each cycle with at least a L and a R, there exist exactly one cycle in PCR k such that the first vertex of L block of the two cycles are the same one.

= ⇒ We only need to deal with cycles in PCR k .

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(25)

24

V-set

A minimum set of vertices which when removed leaves a graph

with no cycles.

(26)

25

V-set

Naïve algorithm:

1. Choose a vertex v, find the cycle belongs to PCR k that contains v.

2. Choose a certain vertex u and add it to the V-set:

Arbitrary one, if the cycle is all I’s.

The first vertex in the L block, otherwise.

3. Remove the cycle from the graph.

4. Repeat until all cycles belong to PCR k are tested.

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(27)

26

V-set example

001 011

000 101 010 111

100 110

(28)

27

V-set example

000 010 111

110

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(29)

28

Time complexity analysis

There are Z(k) iterations. Find the vertex to be added with O(k) time cost in every iteration.

= ⇒ O(kZ(k)) = O(|Σ| k ) in total.

(30)

29

How to find Minimum UHS?

1. Generate a complete de Bruijn graph of order k, set l = L − k.

2. Find the decycling vertex set(V set), X.

3. Remove X from the graph, result in G .

4. Remove vertices from G and add them to S to hit the remained L length sequences.

(i) DOCKS (ii) DOCKSany (iii) DOCKSanyX

5. X is the universal hitting set we’re searching for.

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(31)

30

DOCKS

Define:

D(v, i) = the number of i-long paths starting at v F(v, i) = the number of i-long paths ending at v

=

T(v, l) = the number of l-long paths through v

= Σ l i=0 F(v, i) · D(v, l − i)

· Calculate D(−, −), F(−, −) to find T(−, l).

· Choose the one has the largest T(−, l) and extract it.

(32)

31

DOCKS performance(set size)

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(33)

32

DOCKS performance(runtime)

(34)

33

DOCKS performance(memory)

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(35)

34

DOCKSany

Define:

D(v) = the number of paths start at v F(v) = the number of paths end at v

=

T(v) = the number of paths through v

= F(v) · D(v)

· Calculate D(−), F(−) to find T(−).

· Choose the one has the largest T(−) and extract it.

(36)

35

DOCKSany performance(set size)

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(37)

36

DOCKSany performance(runtime)

(38)

37

DOCKSany performance(memory)

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(39)

38

DOCKSanyX

Same calculation as DOCKSany.

Extract at most x such vertices instead of just one.

(40)

39

DOCKSanyX performance(set size)

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(41)

40

DOCKSanyX performance(runtime)

(42)

41

DOCKSanyX performance(memory)

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(43)

42

Conclusion

· DOCKS can generate compact sets of k-mers that hit all L-long sequences for any k ≤ 13 and L.

· These compact sets can improve many of the applications that

currently use minimizers.

參考文獻

相關文件

(B) Corresponding gradients in two images (B) Corresponding gradients in two images – Vector operations (gradient projection). • Combining flash/no-flash images, Reflection

P s ( dBm )=( P t ) dBm +( G t ) dB +( G r ) dB ( PL ( d )) dB (12) where P r ( d ) is the received power in dBm, which is a function of the T-R separation distance in meters, P t

 Combines DNA biorepositories with electronic medical record (E MR) systems for large-scale, high-throughput genetic research wi th the ultimate goal of returning genomic

(A)Universal Serial Bus(USB)(B)High Definition Multimedia Interface(HDMI) (C)Video Graphics Array(VGA)(D)Integrated Drive

As a remedy, using higher order schemes, like WENO (Weighted Essentially Non-Oscillatory) scheme [24], to solve compressible multiphase flows is also found in the

Hence, we have shown the S-duality at the Poisson level for a D3-brane in R-R and NS-NS backgrounds.... Hence, we have shown the S-duality at the Poisson level for a D3-brane in R-R

3: Calculated ratio of dynamic structure factor S(k, ω) to static structure factor S(k) for &#34;-Ge at T = 1250K for several values of k, plotted as a function of ω, calculated

We compare the results of analytical and numerical studies of lattice 2D quantum gravity, where the internal quantum metric is described by random (dynamical)

(C)John’s love for graffiti began in junior high s chool when his school held a graffiti contest.. (D)John’s love for crows start s in junior high school when he joined

展望今年,在課程方面將配合 IEET 工程教育認證的要求推動頂石課程(Capstone

In the past researches, all kinds of the clustering algorithms are proposed for dealing with high dimensional data in large data sets.. Nevertheless, almost all of

[7] C-K Lin, and L-S Lee, “Improved spontaneous Mandarin speech recognition by disfluency interruption point (IP) detection using prosodic features,” in Proc. “ Speech

唇音 b巴 p趴 m媽 f花 舌尖音 d打 t它 n拿 l啦.. 舌葉音 z渣 c茶 s沙 j也 舌根音 g家

Using sets of diverse, multimodal and multi-genre texts of high quality on selected themes, the Seed Project, Development of Text Sets (DTS) for Enriching the School-based

For the data sets used in this thesis we find that F-score performs well when the number of features is large, and for small data the two methods using the gradient of the

Unlike the case of optimizing the micro-average F-measure, where cyclic optimization does not help, here the exact match ratio is slightly improved for most data sets.. 5.5

assembly of the genome of that species will be far better if read lengths are longer than N... Accurate but

• Next-generation sequencing projects, with their short read lengths and high data volumes, have made these challeng es more difficult.. • We discuss the computational

MASS::lda(Y~.,data) Linear discriminant analysis MASS::qda(Y~.,data) Quadratic Discriminant Analysis class::knn(X,X,Y,k,prob) k-Nearest Neighbour(X 為變數資料;Y 為分類)

(A) For deceleration systems which have a connection link or lanyard, the test weight should free fall a distance equal to the connection distance (measured between the center line

Kwong, “Metal nanocrystal memory with high-k tunneling barrier for improved data retention, ” IEEE Trans. Electron

Although the decline of yield rate under high efficiency is minor for the non-bottleneck, this will damage the capacity of bottleneck and then lower throughput of the factory..

16 The major current feature of the Southern Ocean is the Antarctic Circumpolar Current (ACC) which, by virtue of its great depth, has an enormous volume