National Taiwan University
Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing
Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford C
PLOS Computational Biology. 2017 October; 13(10): e1005777
Hung-Yu Chen, R06945024
Vincent Hwang, B05902122
1
Outline
· Background
· Methods and results
· Conclusion
Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |
Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777
2
Background
· Sequencing datasets are larger and larger.
· New computational ideas are essential to manage and analyze
data.
3
Minimizer
· Michael Roberts, Wayne Hayes, Brian R. Hunt, Stephen M. Mount, James A. Yorke;
Reducing storage requirements for biological sequence comparison, Bioinformatics, Volume 20, Issue 18, 12 December 2004, Pages 3363−3369
· Given a sequence of length L, the minimizer is the lexicographically smallest k-mer in it.
· Given a sequence S of any length, the minimizer set is the set of minimizers of every L-long subsequence in S.
= ⇒ Every L-long subsequence in S is represented in the set.
Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |
Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777
4
Application of Minimizers
· Hashing for read overlapping
· Sparse suffix arrays
· Bloom filters to speed up sequence search
5
Hashing for read overlapping
Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |
Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777
6
Sparse suffix arrays
7
Bloom filters to speed up sequence search
Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |
Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777
8
Universal hitting set(UHS)
· For integers k, L, a set U k,L is called a UHS of k-mers if every possible sequence of length L must contain at least one k-mer in U k,L .
· For example, the set of all k-mers is a trivial UHS.
· Problem 1. Given k and L, find a smallest UHS of k-mers.
9
Hits
· A k-mer w hits string S, denoted w ⊆ S, if w is a substring in S.
· k-mer set X hits string S if there exists w ∈ X such that w ⊆ S.
· The UHS in Problem 1 is a set of k-mers U k,L which hits every possible sequence of length L.
Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |
Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777
10
Advantages of UHS over minimizers
· The set of minimizers may be as large as the complete set of k-mers. The method in this paper can often generate UHSs smaller by a factor of nearly k.
· UHS is universal.
= ⇒ For any k and L, a UHS needs to be computed only once for every dataset.
= ⇒ The data structures created for different datasets will contain a
comparable set of k-mers.
11
Using de Bruijn graphs to find UHSs
· Problem 2. Given a complete de Bruijn graph D k of order k and an integer L, find a smallest set of vertices U k,L such that any path in D k of length l = L − k passes through at least one vertex of U k,L .
Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |
Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777
12
Complete de Bruijn graph
· A complete de Bruijn graph of order k over alphabet Σ:
V: |Σ| k vertices, each labelled with a unique k-mer.
E: If there is an edge (u, v) with a (k + 1)-mer label l, then the label of vertex u is the k-suffix of l and the label of vertex v is the k-prefix of l. A complete de Bruijn graph contains all possible
|Σ| k+1 edges of this type.
13
How to find the UHS?
· NP-hard in general(supporting information in the paper).
· Heuristic approaches.(DOCKS, DOCKSany, DOCKSanyX)
Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |
Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777
14
How to find UHS?
1. Generate a complete de Bruijn graph of order k, set l = L − k.
2. Find the decycling vertex set(V set), X.
3. Remove X from the graph, result in G ′ .
4. Remove vertices from G ′ and add them to S to hit the remained L length sequences.
(i) DOCKS (ii) DOCKSany (iii) DOCKSanyX
5. X is the universal hitting set we’re searching for.
15
Decycling de Bruijn graph
· Vertices labeling
· Factor
· Pure cycling register(PCR k )
· V-set
Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |
Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777
16
Decycling de Bruijn graph
001 011
000 101 010 111
100 110
17
Vertices labeling
For a vertex v(s 0 , s 1 , . . . , s k −1 ), calculate the center of mass.
According to the center of mass position in the coordinate system, label the vertex I if x = 0, L if x < 0, R if x > 0,
Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |
Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777
18
Vertex labeling example
v = 010111, the center of mass’ x value > 0. = ⇒ R.
0
1 1
0 1
19
Factor
· A factor is a set of cycles such that all vertices in the graph are in exactly one of the cycles.
· Each cycle has a unique feedback function f(s 0 , s 1 , . . . , s k −1 ) = s k .
Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |
Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777
20
Pure cycling register(PCR k )
· PCR k is a factor.
· Each cycle has a unique function f(s 0 , s 1 , . . . , s k −1 ) = s k = s 0 , that is, for every arc < u, v >, u = (s 0 , s 1 , . . . , s k −1 ) = ⇒ v = (s 1 , s 2 , . . . , s k ) = (s 1 , s 2 , . . . , s 0 ).
· The number of cycles in PCR k is Z(k), which converges to |Σ| k
k.
· It is proved that any circle in the PCR k must be either all I’s or a
block of L’s and a block of R’s separated by at most two I’s.
21
PCR k example
001 011
000 101 010 111
100 110
Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |
Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777
22
Factor but not PCR k example
001 011
000 101 010 111
100 110
23
Why PCR k ?
Lemmas tell us:
· All cycles are in the form of all I’s or at least a L and a R.
· Cycles with all I’s are in PCR k .
· For each cycle with at least a L and a R, there exist exactly one cycle in PCR k such that the first vertex of L block of the two cycles are the same one.
= ⇒ We only need to deal with cycles in PCR k .
Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |
Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777
24
V-set
A minimum set of vertices which when removed leaves a graph
with no cycles.
25
V-set
Naïve algorithm:
1. Choose a vertex v, find the cycle belongs to PCR k that contains v.
2. Choose a certain vertex u and add it to the V-set:
Arbitrary one, if the cycle is all I’s.
The first vertex in the L block, otherwise.
3. Remove the cycle from the graph.
4. Repeat until all cycles belong to PCR k are tested.
Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |
Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777
26
V-set example
001 011
000 101 010 111
100 110
27
V-set example
000 010 111
110
Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |
Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777
28
Time complexity analysis
There are Z(k) iterations. Find the vertex to be added with O(k) time cost in every iteration.
= ⇒ O(kZ(k)) = O(|Σ| k ) in total.
29
How to find Minimum UHS?
1. Generate a complete de Bruijn graph of order k, set l = L − k.
2. Find the decycling vertex set(V set), X.
3. Remove X from the graph, result in G ′ .
4. Remove vertices from G ′ and add them to S to hit the remained L length sequences.
(i) DOCKS (ii) DOCKSany (iii) DOCKSanyX
5. X is the universal hitting set we’re searching for.
Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |
Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777
30
DOCKS
Define:
D(v, i) = the number of i-long paths starting at v F(v, i) = the number of i-long paths ending at v
= ⇒
T(v, l) = the number of l-long paths through v
= Σ l i=0 F(v, i) · D(v, l − i)
· Calculate D(−, −), F(−, −) to find T(−, l).
· Choose the one has the largest T(−, l) and extract it.
31
DOCKS performance(set size)
Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |
Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777
32
DOCKS performance(runtime)
33
DOCKS performance(memory)
Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |
Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777
34
DOCKSany
Define:
D(v) = the number of paths start at v F(v) = the number of paths end at v
= ⇒
T(v) = the number of paths through v
= F(v) · D(v)
· Calculate D(−), F(−) to find T(−).
· Choose the one has the largest T(−) and extract it.
35
DOCKSany performance(set size)
Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |
Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777
36
DOCKSany performance(runtime)
37
DOCKSany performance(memory)
Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |
Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777
38
DOCKSanyX
Same calculation as DOCKSany.
Extract at most x such vertices instead of just one.
39
DOCKSanyX performance(set size)
Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |
Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777
40
DOCKSanyX performance(runtime)
41
DOCKSanyX performance(memory)
Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 |
Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777
42