### National Taiwan University

## Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing

Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford C

PLOS Computational Biology. 2017 October; 13(10): e1005777

### Hung-Yu Chen, R06945024

### Vincent Hwang, B05902122

## Outline

*· Background*

*· Methods and results*

*· Conclusion*

## Background

*· Sequencing datasets are larger and larger.*

*· New computational ideas are essential to manage and analyze*

### data.

## Minimizer

*· Michael Roberts, Wayne Hayes, Brian R. Hunt, Stephen M. Mount, James A. Yorke;*

Reducing storage requirements for biological sequence comparison,
Bioinformatics, Volume 20, Issue 18, 12 December 2004, Pages 3363*−3369*

*· Given a sequence of length L, the minimizer is the* *lexicographically smallest k-mer in it.*

*· Given a sequence S of any length, the minimizer set is the set of* *minimizers of every L-long subsequence in S.*

### = *⇒ Every L-long subsequence in S is represented in the set.*

## Application of Minimizers

*· Hashing for read overlapping*

*· Sparse suffix arrays*

*· Bloom filters to speed up sequence search*

## Hashing for read overlapping

## Sparse suffix arrays

## Bloom filters to speed up sequence search

## Universal hitting set(UHS)

*· For integers k, L, a set U* *k,L* *is called a UHS of k-mers if every* *possible sequence of length L must contain at least one k-mer in* *U* _{k,L} .

_{k,L}

*· For example, the set of all k-mers is a trivial UHS.*

**· Problem 1. Given k and L, find a smallest UHS of k-mers.**

**· Problem 1. Given k and L, find a smallest UHS of k-mers.**

## Hits

*· A k-mer w hits string S, denoted w ⊆ S, if w is a substring in S.*

*· k-mer set X hits string S if there exists w ∈ X such that w ⊆ S.*

*· The UHS in Problem 1 is a set of k-mers U* *k,L* which hits every *possible sequence of length L.*

## Advantages of UHS over minimizers

*· The set of minimizers may be as large as the complete set of* *k-mers. The method in this paper can often generate UHSs* *smaller by a factor of nearly k.*

*· UHS is universal.*

### = *⇒ For any k and L, a UHS needs to be computed only once for every* dataset.

### = *⇒ The data structures created for different datasets will contain a*

*comparable set of k-mers.*

## Using de Bruijn graphs to find UHSs

**· Problem 2. Given a complete de Bruijn graph D** *k* *of order k and* *an integer L, find a smallest set of vertices U* _{k,L} such that any path *in D* _{k} *of length l = L* *− k passes through at least one vertex of U* *k,L* .

**· Problem 2. Given a complete de Bruijn graph D**

_{k,L}

_{k}

## Complete de Bruijn graph

*· A complete de Bruijn graph of order k over alphabet Σ:*

*V:* *|Σ|* ^{k} *vertices, each labelled with a unique k-mer.*

^{k}

*E: If there is an edge (u, v) with a (k + 1)-mer label l, then the* *label of vertex u is the k-suffix of l and the label of vertex v is the* *k-prefix of l. A complete de Bruijn graph contains all possible*

*|Σ|* ^{k+1} edges of this type.

^{k+1}

## How to find the UHS?

*· NP-hard in general(supporting information in the paper).*

*· Heuristic approaches.(DOCKS, DOCKSany, DOCKSanyX)*

## How to find UHS?

*1. Generate a complete de Bruijn graph of order k, set l = L* *− k.*

*2. Find the decycling vertex set(V set), X.*

*3. Remove X from the graph, result in G* ^{′} .

^{′}

*4. Remove vertices from G* ^{′} *and add them to S to hit the remained* *L length sequences.*

^{′}

### (i) DOCKS (ii) DOCKSany (iii) DOCKSanyX

*5. X is the universal hitting set we’re searching for.*

## Decycling de Bruijn graph

*· Vertices labeling*

*· Factor*

*· Pure cycling register(PCR* *k* )

*· V-set*

## Decycling de Bruijn graph

### 001 011

### 000 101 010 111

### 100 110

## Vertices labeling

*For a vertex v(s* _{0} *, s* _{1} *, . . . , s* *k* *−1* ), calculate the center of mass.

### According to the center of mass position in the coordinate *system, label the vertex I if x = 0, L if x < 0, R if x > 0,*

## Vertex labeling example

*v = 010111, the center of mass’ x value > 0. =* *⇒ R.*

### 0

### 1 1

### 0 1

## Factor

*· A factor is a set of cycles such that all vertices in the graph are in* exactly one of the cycles.

*· Each cycle has a unique feedback function f(s* 0 *, s* _{1} *, . . . , s* _{k} _{−1} *) = s* _{k} .

_{k}

_{−1}

_{k}

*Pure cycling register(PCR* *k* )

*· PCR* *k* is a factor.

*· Each cycle has a unique function f(s* 0 *, s* _{1} *, . . . , s* _{k} _{−1} *) = s* _{k} *= s* _{0} , *that is, for every arc < u, v >, u = (s* _{0} *, s* 1 *, . . . , s* *k* *−1* ) = *⇒* *v = (s* _{1} *, s* _{2} *, . . . , s* *k* *) = (s* _{1} *, s* _{2} *, . . . , s* _{0} ).

_{k}

_{−1}

_{k}

*· The number of cycles in PCR* *k* *is Z(k), which converges to* ^{|Σ|} _{k}

^{|Σ|}

_{k}

^{k}### .

*· It is proved that any circle in the PCR* *k* *must be either all I’s or a*

*block of L’s and a block of R’s separated by at most two I’s.*

*PCR* *k* example

### 001 011

### 000 101 010 111

### 100 110

*Factor but not PCR* *k* example

### 001 011

### 000 101 010 111

### 100 110

*Why PCR* *k* ?

**Lemmas tell us:**

*· All cycles are in the form of all I’s or at least a L and a R.*

*· Cycles with all I’s are in PCR* *k* .

*· For each cycle with at least a L and a R, there exist exactly* *one cycle in PCR* _{k} *such that the first vertex of L block of the two* cycles are the same one.

_{k}

### = *⇒ We only need to deal with cycles in PCR* *k* .

## V-set

### A minimum set of vertices which when removed leaves a graph

### with no cycles.

## V-set

### Naïve algorithm:

*1. Choose a vertex v, find the cycle belongs to PCR* _{k} that *contains v.*

_{k}

*2. Choose a certain vertex u and add it to the V-set:*

*Arbitrary one, if the cycle is all I’s.*

*The first vertex in the L block, otherwise.*

### 3. Remove the cycle from the graph.

*4. Repeat until all cycles belong to PCR* _{k} are tested.

_{k}

## V-set example

### 001 011

### 000 101 010 111

### 100 110

## V-set example

### 000 010 111

### 110

## Time complexity analysis

*There are Z(k) iterations. Find the vertex to be added with* *O(k) time cost in every iteration.*

### = *⇒ O(kZ(k)) = O(|Σ|* ^{k} ) in total.

^{k}

## How to find Minimum UHS?

*1. Generate a complete de Bruijn graph of order k, set l = L* *− k.*

*2. Find the decycling vertex set(V set), X.*

*3. Remove X from the graph, result in G* ^{′} .

^{′}

*4. Remove vertices from G* ^{′} *and add them to S to hit the remained* *L length sequences.*

^{′}

### (i) DOCKS (ii) DOCKSany (iii) DOCKSanyX

*5. X is the universal hitting set we’re searching for.*

## DOCKS

### Define:

*D(v, i) = the number of i-long paths starting at v* *F(v, i) = the number of i-long paths ending at v*

### = *⇒*

*T(v, l) = the number of l-long paths through v*

### = Σ ^{l} _{i=0} *F(v, i)* *· D(v, l − i)*

^{l}

_{i=0}

*· Calculate D(−, −), F(−, −) to find T(−, l).*

*· Choose the one has the largest T(−, l) and extract it.*

## DOCKS performance(set size)

## DOCKS performance(runtime)

## DOCKS performance(memory)

## DOCKSany

### Define:

*D(v) = the number of paths start at v* *F(v) = the number of paths end at v*

### = *⇒*

*T(v) = the number of paths through v*

*= F(v)* *· D(v)*

*· Calculate D(−), F(−) to find T(−).*

*· Choose the one has the largest T(−) and extract it.*

## DOCKSany performance(set size)

## DOCKSany performance(runtime)

## DOCKSany performance(memory)

## DOCKSanyX

### Same calculation as DOCKSany.

*Extract at most x such vertices instead of just one.*

## DOCKSanyX performance(set size)

## DOCKSanyX performance(runtime)

## DOCKSanyX performance(memory)

