Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford C PLOS Computational Biology. 2017 October; 13(10): e1005777

(1)

National Taiwan University

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing

Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford C

PLOS Computational Biology. 2017 October; 13(10): e1005777

Hung-Yu Chen, R06945024

Vincent Hwang, B05902122

(2)

1

Outline

· Background

· Methods and results

· Conclusion

Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

(3)

2

Background

· Sequencing datasets are larger and larger.

· New computational ideas are essential to manage and analyze

data.

(4)

3

Minimizer

· Michael Roberts, Wayne Hayes, Brian R. Hunt, Stephen M. Mount, James A. Yorke;

Reducing storage requirements for biological sequence comparison, Bioinformatics, Volume 20, Issue 18, 12 December 2004, Pages 3363−3369

· Given a sequence of length L, the minimizer is the lexicographically smallest k-mer in it.

· Given a sequence S of any length, the minimizer set is the set of minimizers of every L-long subsequence in S.

= ⇒ Every L-long subsequence in S is represented in the set.

(5)

4

Application of Minimizers

· Hashing for read overlapping

· Sparse suffix arrays

· Bloom filters to speed up sequence search

(6)

5

Hashing for read overlapping

(7)

6

Sparse suffix arrays

(8)

7

Bloom filters to speed up sequence search

(9)

8

Universal hitting set(UHS)

· For integers k, L, a set U k,L is called a UHS of k-mers if every possible sequence of length L must contain at least one k-mer in U _k,L .

· For example, the set of all k-mers is a trivial UHS.

· Problem 1. Given k and L, find a smallest UHS of k-mers.

(10)

9

Hits

· A k-mer w hits string S, denoted w ⊆ S, if w is a substring in S.

· k-mer set X hits string S if there exists w ∈ X such that w ⊆ S.

· The UHS in Problem 1 is a set of k-mers U k,L which hits every possible sequence of length L.

(11)

10

Advantages of UHS over minimizers

· The set of minimizers may be as large as the complete set of k-mers. The method in this paper can often generate UHSs smaller by a factor of nearly k.

· UHS is universal.

= ⇒ For any k and L, a UHS needs to be computed only once for every dataset.

= ⇒ The data structures created for different datasets will contain a

comparable set of k-mers.

(12)

11

Using de Bruijn graphs to find UHSs

· Problem 2. Given a complete de Bruijn graph D k of order k and an integer L, find a smallest set of vertices U _k,L such that any path in D _k of length l = L − k passes through at least one vertex of U k,L .

(13)

12

Complete de Bruijn graph

· A complete de Bruijn graph of order k over alphabet Σ:

V: |Σ| ^k vertices, each labelled with a unique k-mer.

E: If there is an edge (u, v) with a (k + 1)-mer label l, then the label of vertex u is the k-suffix of l and the label of vertex v is the k-prefix of l. A complete de Bruijn graph contains all possible

|Σ| ^k+1 edges of this type.

(14)

13

How to find the UHS?

· NP-hard in general(supporting information in the paper).

· Heuristic approaches.(DOCKS, DOCKSany, DOCKSanyX)

(15)

14

How to find UHS?

1. Generate a complete de Bruijn graph of order k, set l = L − k.

2. Find the decycling vertex set(V set), X.

3. Remove X from the graph, result in G ^′ .

4. Remove vertices from G ^′ and add them to S to hit the remained L length sequences.

(i) DOCKS (ii) DOCKSany (iii) DOCKSanyX

5. X is the universal hitting set we’re searching for.

(16)

15

Decycling de Bruijn graph

· Vertices labeling

· Factor

· Pure cycling register(PCR k )

· V-set

(17)

16

Decycling de Bruijn graph

001 011

000 101 010 111

100 110

(18)

17

Vertices labeling

For a vertex v(s ₀ , s ₁ , . . . , s k −1 ), calculate the center of mass.

According to the center of mass position in the coordinate system, label the vertex I if x = 0, L if x < 0, R if x > 0,

(19)

18

Vertex labeling example

v = 010111, the center of mass’ x value > 0. = ⇒ R.

0 1 1

0 1

(20)

19

Factor

· A factor is a set of cycles such that all vertices in the graph are in exactly one of the cycles.

· Each cycle has a unique feedback function f(s 0 , s ₁ , . . . , s _k ₋₁ ) = s _k .

(21)

20

Pure cycling register(PCR k )

· PCR k is a factor.

· Each cycle has a unique function f(s 0 , s ₁ , . . . , s _k ₋₁ ) = s _k = s ₀ , that is, for every arc < u, v >, u = (s ₀ , s 1 , . . . , s k −1 ) = ⇒ v = (s ₁ , s ₂ , . . . , s k ) = (s ₁ , s ₂ , . . . , s ₀ ).

· The number of cycles in PCR k is Z(k), which converges to ^|Σ| _k

^k

.

· It is proved that any circle in the PCR k must be either all I’s or a

block of L’s and a block of R’s separated by at most two I’s.

(22)

21

PCR k example

001 011

000 101 010 111

100 110

(23)

22

Factor but not PCR k example

001 011

000 101 010 111

100 110

(24)

23

Why PCR k ?

Lemmas tell us:

· All cycles are in the form of all I’s or at least a L and a R.

· Cycles with all I’s are in PCR k .

· For each cycle with at least a L and a R, there exist exactly one cycle in PCR _k such that the first vertex of L block of the two cycles are the same one.

= ⇒ We only need to deal with cycles in PCR k .

(25)

24

V-set

A minimum set of vertices which when removed leaves a graph

with no cycles.

(26)

25

V-set

Naïve algorithm:

1. Choose a vertex v, find the cycle belongs to PCR _k that contains v.

2. Choose a certain vertex u and add it to the V-set:

Arbitrary one, if the cycle is all I’s.

The first vertex in the L block, otherwise.

3. Remove the cycle from the graph.

4. Repeat until all cycles belong to PCR _k are tested.

(27)

26

V-set example

001 011

000 101 010 111

100 110

(28)

27

V-set example

000 010 111

110

(29)

28

Time complexity analysis

There are Z(k) iterations. Find the vertex to be added with O(k) time cost in every iteration.

= ⇒ O(kZ(k)) = O(|Σ| ^k ) in total.

(30)

29

How to find Minimum UHS?

1. Generate a complete de Bruijn graph of order k, set l = L − k.

2. Find the decycling vertex set(V set), X.

3. Remove X from the graph, result in G ^′ .

4. Remove vertices from G ^′ and add them to S to hit the remained L length sequences.

(i) DOCKS (ii) DOCKSany (iii) DOCKSanyX

5. X is the universal hitting set we’re searching for.

(31)

30

DOCKS

Define:

D(v, i) = the number of i-long paths starting at v F(v, i) = the number of i-long paths ending at v

= ⇒

T(v, l) = the number of l-long paths through v

= Σ ^l _i=0 F(v, i) · D(v, l − i)

· Calculate D(−, −), F(−, −) to find T(−, l).

· Choose the one has the largest T(−, l) and extract it.

(32)

31

DOCKS performance(set size)

(33)

32

DOCKS performance(runtime)

(34)

33

DOCKS performance(memory)

(35)

34

DOCKSany

Define:

D(v) = the number of paths start at v F(v) = the number of paths end at v

= ⇒

T(v) = the number of paths through v

= F(v) · D(v)

· Calculate D(−), F(−) to find T(−).

· Choose the one has the largest T(−) and extract it.

(36)

35

DOCKSany performance(set size)

(37)

36

DOCKSany performance(runtime)

(38)

37

DOCKSany performance(memory)

(39)

38

DOCKSanyX

Same calculation as DOCKSany.

Extract at most x such vertices instead of just one.

(40)

39

DOCKSanyX performance(set size)

(41)

40

DOCKSanyX performance(runtime)

(42)

41

DOCKSanyX performance(memory)

(43)

42

Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford C PLOS Computational Biology. 2017 October; 13(10): e1005777

National Taiwan University