A New Framework for the Selection of Tag SNPs by

(1)

A New Framework for the Selection of Tag SNPs by

Multimarker Haplotypes

Yao-Ting Huang

¹

and Kun-Mao Chao

^2,3,4†

1

Department of Computer Science and Information Engineering National Chung-Cheng University, Chia-Yi, Taiwan

2

Department of Computer Science and Information Engineering

3

Graduate Institute of Biomedical Electronics and Bioinformatics

4

Graduate Institute of Networking and Multimedia National Taiwan University, Taipei, Taiwan

†: Corresponding Author:

Kun-Mao Chao

Department of Computer Science and Information Engineering National Taiwan University

]1 Roosevelt Rd. Sec. 4, Taipei, Taiwan

Email: [email protected] Phone: 886-2-23625336

Fax: 886-2-23628167

(2)

Abstract

This paper proposes a new framework for the selection of tag SNPs based on haplotypes instead of on a single SNP. The tag SNPs found by this framework form a set of haplotypes completely predictive of the alleles of all untyped SNPs. We refer to this problem as MTMH, which is defined as follows: given a set of SNPs, find a minimum subset of SNPs (called tag SNPs) which defines a set of haplotypes completely predictive of the alleles of all untyped SNPs. The MTMH problem is solved by dividing into three subproblems, two of which are shown to be NP-hard. Several exact and approximation algorithms are proposed to solve these subproblems. We describe a framework which integrates these algorithms and develop a program called HapTagger for finding tag SNPs. HapTagger is compared with existing methods as well as the official tagging tool (called Haploview) of the International HapMap project using a variety of real data sets. Our theoretical analysis and experimental results indicate that HapTagger consistently identifies a smaller set of tag SNPs and runs much faster than existing methods. HapTagger avoids the need of incorporating a linkage disequilibrium statistic and thus significantly improves the computational efficiency. We also present an algorithm (specific to HapTagger) for reconstructing alleles of untyped SNPs. It is worth mentioning that these predictive haplotypes selected by HapTagger can be used as signatures of recent positive selection or co- evolution. HapTagger is available at http://www.csie.ntu.edu.tw/∼kmchao/tools/HapTagger/.

Keywords: Algorithm, Haplotype, Linkage Disequilibrium, NP-hardness, SNP

1 Introduction

Single nucleotide polymorphisms (SNPs) are the most abundant form of genetic variations observed in the human population. Through recent linkage disequilibrium (LD) analysis across the entire human genome, the SNPs in proximity are shown to usually have strong correlation with each other [1, 9, 14, 17]. The correlation structure of entire genome indicates that the human chromosome can be partitioned into high LD regions interspersed by low LD regions. Within each high LD region, only a small subset of SNPs (called tag SNPs) is sufficient to be typed, whereas the alleles of

(3)

untyped SNPs can be indirectly predicted by typed tag SNPs, due to the strong correlation among them [3, 6, 21, 26, 33, 34]. In 2002, the International HapMap project is launched to characterize the LD patterns in the human genome such that this information can be used to guide the selection of tag SNPs [1, 16]. Recently, with the advent of high-throughput genotyping array (e.g., Affymetrix 500K GeneChip array), the cost of assaying tens of thousands of SNPs has been greatly reduced [22, 24].

As a consequence, genome-wide association studies using tag SNPs together with genotyping array are going to be used for studying complex genetic diseases presumably induced by multiple unknown genes throughout the genome [8, 19, 23]. In contrast to traditional association studies or linkage analysis, the genome-wide association studies using tag SNPs make no assumption on the location of disease genes and is a promising approach for discovering disease susceptibility genes of complex diseases.

However, due to the limited size of the genotyping array, it is difficult to type all tag SNPs to capture the allele of each common SNP on the human genome. Therefore, investigators are usually forced to select a subset of tag SNPs, to prioritize them, or to relax the threshold of LD [1, 10]. But these approaches often sacrifice the statistical power in subsequent association studies or analysis.

In addition, due to the incompleteness of the HapMap data, less common SNPs (e.g., minor allele frequencies less than 5%) which may induce the disease are usually ignored and not captured by current tag SNPs selection programs. To capture these less common SNPs, it is expected that more tag SNPs have to be used. Moreover, a portion of tag SNPs may not be always successfully typed and these missing data may greatly decrease the power of using tag SNPs [34]. To avoid the influence from missing data, it has been shown that additional tag SNPs have to be included into the solution [6, 20]. As a consequence, sophisticated methods for reducing the number of tag SNPs are still highly demanded.

A number of methods have been proposed to identify the minimum tag SNPs using different criteria. Most methods are mainly based on the pairwise LD between two diallelic SNPs [2, 5, 17, 28].

However, these pairwise-based methods tend to produce numerous tag SNPs having little or no correlation with untyped SNPs. The singleton tag SNPs (i.e., SNPs having no correlation with

(4)

others) can even account for more than 50% in their solutions [17]. A few initial studies overcome the limitation of pairwise LD between two diallelic SNPs by further considering the multiallelic LD between a diallelic SNP and a multimarker (multiallelic) haplotype [3, 23, 30, 32]. These approaches can reduce the number of tag SNPs or increase the statistical power but come at the cost of heavy computational overhead, due to the exponential number of possible haplotypes to be tested. Re- cently, de Bakker et al. use a peel-back approach and a multiallelic LD statistic for selecting tag SNPs. The developed program is incorporated into Haploview, which is the official tagging tool used in the International HapMap project [4]. However, Haploview is still quite inefficient, because the number of possible haplotypes to be tested by the LD statsitic still grows exponentially with respect to the number of SNPs. As a consequence, Haploview has to make several restrictions to gain efficiency (e.g., test at most 10,000 haplotypes).

In this paper, we design and implement algorithms for the selection of tag SNPs by multimarker haplotypes. In contrast to previous studies using a single tag SNP to predict the alleles of an untyped SNP, our algorithms search for tag SNPs which define a set of haplotypes completely predictive of the alleles of all untyped SNPs. Moreover, our methods do not rely on any statistic to measure the LD between tag SNPs and untyped SNPs. We start by studying a problem called multimarker haplotype tagging by perfect LD (MHTP), which is defined as follows: given a set of SNPs and a target SNP to be replaced, find a minimum length haplotype completely predictive of alleles at the target SNP. We prove that the MHTP problem is NP-hard and give an approximation algorithm.

This algorithm is used as a subroutine for solving the main problem studied in this paper (referred to as MTMH), which is defined as follows: given a set of SNPs, find a minimum set of tag SNPs which defines a set of haplotypes completely predictive of the alleles of all untyped SNPs. The MTMH problem is solved by dividing into three subproblems, two of which are shown to be NP-hard. Several exact and approximation algorithms are developed to solve these subproblems and their extension for tolerating missing data is also presented. We integrate these algorithms and develop a program called HapTagger for finding tag SNPs by multimarker haplotypes. The HapTagger is compared with the pairwise LD-based approach and the official tagging tool Haploview. Our theoretical and

(5)

experimental results indicate that HapTagger consistently finds a smaller set of tag SNPs on a variety of real data sets and runs much faster than existing methods. The efficiency of various LD statistic and the comparison of distinct methods for reconstructing untyped SNP alleles are also discussed in this paper. It is worth mentioning that these predictive haplotypes selected by HapTagger can be used as the signature of recent positive selection or co-evolution.

2 Algorithms for the Selection of Predictive Haplotypes

In this section, we study the MHTP problem, which aims to find a minimum length haplotype completely predictive of the alleles at a target SNP to be replaced. The algorithm introduced in this section is used as a subroutine for solving the MTMH problem studied in the next section.

Informally speaking, we seek for a haplotype which is always observed together with some alleles at a target SNP. We first formulate the MHTP problem, show the hardness of this problem, and finally give an approximation algorithm.

Formulation and Hardness of the MHTP Problem

Given a k ∗ (n + 1) haplotype matrix, where k is the number of haplotypes and (n + 1) is the number of SNPs. Denote C = {S1, S2, ..., Sn} as the set of n SNPs and ST ∈ C as the target SNP to be/ replaced. Let H = {h₁, h₂, ..., h_k} denote these k haplotypes, where h_i= {0, 1, x}ⁿ⁺¹, and 0, 1, or x denote that the allele at this SNP locus is the major type, minor type, or missing data, respectively.

Note that for unphased genotypes, the phased haplotypes can be inferred by a number of existing methods [21, 27, 29]. The MHTP problem aims to find a minimum length haplotype which is defined by a subset of SNPs in C and is predictive of the alleles of SNP ST. Figure 1 illustrates an example for the MHTP problem. There are five haplotypes (i.e., h1, ..., h5) composed by six SNPs and the target SNP to be replaced is S6. In this example, the haplotype (0, 0) defined by SNPs S1and S3is predictive of SNP S6, because haplotype (0, 0) perfectly co-occurs with all minor alleles at SNP S6

(but never with the major allele). As a consequence, only SNPs S1 and S3 only need to be typed for predicting the alleles at SNP S6. That is, if haplotype (0, 0) is observed at these two SNPs from

(6)

a testing sample, we can predict that this sample contain minor allele at SNP S6. Otherwise, the allele is predicted as the major type. We would like to note that the input may not always contain a feasible solution (i.e., no haplotypes can replace the target SNP). Then the target SNP will be selected as the tag SNP in our final solution (see Section 3).

In this paper, we say that this sort of haplotypes has “perfect LD” with the target SNP, which is close to the definition of perfect LD between two diallelic SNPs (i.e., r²= 1). A formal definition of the MHTP problem is given below.

Input: a k ∗ (n + 1) haplotypes matrix, where C = {S1, S2, ..., Sn} represents a set of n SNPs and ST ∈ C is the target SNP to be replaced./

Output: a minimum subset of SNPs C⁰ ⊂ C which defines a haplotype having perfect LD with SNP ST.

For the example in Figure 1, the set of SNPs C⁰ = {S1, S2, S3} is one feasible solution for MHTP but the minimum solution is C⁰= {S1, S3}, because the haplotype (0, 0) composed of SNPs S1 and S3perfectly co-occurs with all minor alleles at SNP S6. In this paper, we also say that a set of SNPs C⁰ can replace a SNP ST if there exists a haplotype defined by SNPs in C⁰ having perfect LD with SNP ST.

Theorem 1. The MHTP problem is NP-hard.

Proof. We make a reduction from an NP-hard problem called Set Cover [15] to the MHTP problem

with a reduction technique similar to Bafna et al [3]. The set covering (SC) problem is defined as given a collection C of subsets of k elements, find a minimum subcollection C⁰⊂ C (called set cover) such that each element appears in at least one subset of C⁰. Given an instance of the SC problem, a collection C = {C1, C2, ..., Cn} of elements {E1, E2, ..., Ek}, we construct an instance of the MHTP problem by first creating k haplotypes (h1 to hk) with n SNPs (S1 to Sn). Each element Ei ∈ Cj

produces a major allele on haplotype hiat SNP Sj and the remaining positions are all minor alleles.

Then we construct an additional haplotype hk+1with all minor alleles from SNPs S1to Sn. Finally,

(7)

we construct the target SNP to be replaced Sn+1with major alleles only occurred in haplotypes h1

to hk and one minor allele occurred in haplotype hk+1.

Example. Given an instance of the SC problem C = {C1, C2, C3} over elements

{1, 2, 3, 4, 5}, where C1 = {1, 2, 5}, C2 = {3, 4}, C3 = {1, 4}, we construct 6 haplo- types composed of 4 SNPs, where h1 = (0, 1, 0, 0), h2 = (0, 1, 1, 0), h3 = (1, 0, 1, 0), h4= (1, 0, 0, 0), h5= (0, 1, 1, 0), and h6= (1, 1, 1, 1),

Note that the additional haplotype (i.e., hk+1) contains all minor alleles at all SNP loci and the SNP to be replaced (i.e., Sn+1) has only one minor allele occurred in hk+1. Thus, the haplotype hk+1

defined by SNPs S₁ to S_n has perfect LD with the minor allele at S_n+1. For the above example, the haplotype (1,1,1) defined by the first three SNPs indicates the occurrence of minor allele at SNP S4

and thus SNP S4 can be replaced by haplotype (1,1,1). We then show that the minimum set cover for SC implies a minimum subset of SNPs which can replace SNP Sn+1for MHTP, and vice versa.

Example. For the sample example, the minimum set cover for SC is C⁰ = {C1, C2}, which covers all elements E = {1, 2, 3, 4, 5}. On the other hand, the minimum subset of SNPs which can replace SNP S4in the MHTP problem is C^∗= {S1, S2}, since haplotype (1, 1) defined by SNPs S₁and S₂can indicate the occurrence of minor allele at SNP S₄ (i.e., haplotype (1, 1) has perfect LD with SNPs S4).

Consequently, there exists a set cover of size k for the SC problem if and only if there exists a subset of SNPs of size k which can replace SNP Sn+1for the MHTP problem. Therefore, MHTP is NP-hard.

An Approximation Algorithm for the MHTP Problem

In this subsection, we describe an approximation algorithm which solves MHTP by first removing SNPs impossible to form a solution, reducing to an existing NP-hard problem, introducing a greedy algorithm, and finally presenting its extension for handling missing data. To simplify the presenta- tion, the described algorithm focuses on identifying a haplotype which perfectly co-occurs with all

(8)

minor alleles at SNP ST. The algorithm for capturing major alleles is similar. This algorithm starts by removing SNPs impossible to form the solution using the following two steps.

Step 1. Identify the subset of haplotypes HT ⊂ H in which the minor alleles are observed at the target SNP ST. The set of haplotypes containing the major alleles for SNP ST is denoted as H¯T. For the example shown in Figure 1, HT = {h1, h2} and ¯HT = {h3, h4, h5}.

Step 2. Identify the set of SNPs C⁰ ⊆ C which are consistent with SNP ST. A SNP is said to be consistent to SNP ST if either only major alleles or minor alleles (but not both) are observed in haplotypes in HT. On the other hand, a SNP is said to be inconsistent with SNP ST if it contains mixed major and minor alleles observed in haplotypes of HT. For the example shown in Figure 1, SNP S4is inconsistent with SNP S6because one major and one minor alleles are both observed in haplotypes h1 and h2, respectively.

The consistent SNPs in C⁰ have either all major or all minor alleles observed in haplotypes in HT. Consequently, the haplotypes in HT defined by these consistent SNPs have only one pattern (e.g., {0,0,0} in Figure 1). Then, for each consistent SNP Si∈ C⁰, we identify the set of haplotypes Hi ⊂ ¯HT in which the observed allele is complementary to those observed in HT. For the example shown in Figure 1, H1= {h3, h5}, H2= {h3}, and H3= {h4}. The algorithm proceeds by adopting a greedy method which iteratively selects a SNP Si ∈ C⁰ that can incur complementary alleles for most haplotypes in ¯HT (i.e., max{|Hi∩ ¯HT|}), until all haplotypes in ¯HT contain at least one allele complementary to those in HT. After running this algorithm, the haplotypes in HT defined by SNPs in C⁰ perfectly co-occur with all minor alleles at SNP ST, and all haplotypes in ¯HT are distinguished from those in HT. For example, in Figure 1, the SNPs S1 and S3 are selected in order by this algorithm. If no feasible solutions exist, this algorithm will return a null symbol which implies no SNPs can replace the target SNP. The running time of this algorithm is bounded by O(nk²), where n is the number of SNPs and k is the number of haplotypes. A pseudo code of this algorithm (referred to as MHTagger) is given below.

(9)

Algorithm: MHTagger(C, ST)

1 Construct HT and ¯HT containing minor and major alleles for SNP ST, respectively.

2 Identify the set of SNPs C⁰⊆ C which are consistent with SNP ST.

3 For each SNP S_i∈ C⁰, construct H_i⊂ ¯H_T containing alleles complement to those in H_T. 4 R ← φ

5 while ¯HT 6= φ and C⁰6= φ do

6 Let Sj be the SNP Si∈ C⁰ that maximizes |Hi∩ ¯HT|.

7 H¯T ← ¯HT − Hj

8 C⁰← C⁰− Sj

9 R ← RS

Sj

10 end of while 11 if ¯HT 6= φ

12 return φ

13 else

14 return R

Theorem 2. The MHTagger algorithm gives a solution within a factor of O(log k) of the optimal solution.

Proof. Note that Lines 1-3 reduce MHTP to an instance of the set-covering (SC) problem [15] and

Lines 4–11 solve the instance of SC by a greedy algorithm. The greedy algorithm for solving the SC problem has been shown to have O(log n) approximation [7], where n is the number of elements to be covered. The number of elements (to be covered in the SC problem) corresponds to the number of haplotypes in ¯HT in the MHTP problem, where | ¯HT| < |H| = O(k). Therefore, the MHTagger algorithm also gives a solution of O(log k) approximation for the MHTP problem.

Extension for Handling Missing Data

In reality, a portion of SNPs may not be always typed successfully and these missing SNPs can greatly reduce the power of using tag SNPs for association studies. For the example shown in Figure 1,

(10)

although the minimum solution is C⁰= {S1, S3}, we would fail to predict the allele at SNP S6if any of the two SNPs is missing. As pointed out previously [20], the negative effects from missing data can be avoided by selecting a slightly larger set of SNPs for genotyping. Consequently, we extend the MHTagger algorithm for tolerating a fixed amount of missing SNPs, because the missing rates of the genotyping array is usually limited (<∼10%). However, there is a tradeoff between the number of tag SNPs and ability of tolerating missing data. In the following, we use the strict requirement which guarantees that if up to m tag SNPs are missing, it has no effects on predicting the alleles at the target SNP. With loose requirement, the number of tag SNPs can be reduced, but then we will not be able to make the correct prediction in all circumstances. The detailed discussion of tolerating missing data can be found in [20]. An extended definition of the MHTP problem for tolerating missing data is given below.

Input: a k ∗ (n + 1) haplotypes matrix, where C = {S1, S2, ..., Sn} represents a set of n SNPs and ST ∈ C is the target SNP to be replaced; denote m as the number of missing SNPs to be/ tolerated.

Output: a minimum subset of SNPs C⁰ ⊂ C which defines a haplotype having perfect LD with SNP ST, even when up to m SNPs in C⁰ are missing.

The algorithm is briefly described below. Recall that HT and ¯HT are two sets of haplotypes containing the minor or major alleles of SNP ST, respectively. After Line 3 in the MHTagger algorithm, all haplotypes in HT have the same pattern. The remaining steps (i.e., Lines 5-10) are revised for finding a minimum set of SNPs which defines a haplotype pattern having Hamming distance at least (m+1) with each haplotype in ¯HT, whereas the original algorithm only requires the Hamming distance to be at least one. Note that when m SNPs are missing, the Hamming distance between haplotypes in HT and ¯HT decreases at most m and thus is at least equal or greater than one. Therefore, the haplotypes in H_T can still be distinguished from all haplotypes in ¯H_T, which still satisfies the requirement of perfect co-occurrence with all minor alleles at SNP ST.

(11)

3 Algorithms for the Selection of Tag SNPs by Predictive Haplotypes

In this section, we study the problem of MTMH defined as follows: given a set of SNPs, find a minimum set of tag SNPs which defines a set of haplotypes completely predictive of the alleles of all untyped SNPs. The MTMH problem is divided into three subproblems which are separately solved in the following three stages: (1) find a minimum set of tag SNPs based on pairwise perfect LD between diallelic SNPs; (2) for each of the found tag SNP, identify a minimum length haplotype having perfect LD with the tag SNP by solving the MHTP problem; (3) select a minimum subset of tag SNPs which defines a set of haplotypes completely predictive of alleles of all removed tag SNPs.

In the first stage, we describe a linear-time algorithm for finding a minimum set of tag SNPs based on pairwise perfect LD. The second stage iteratively solves the MHTP problem by setting each tag SNP as the target SNP to be replaced and running the MHTagger algorithm to find a haplotype predictive of the alleles at the target SNP (see Section 2). The last stage is shown to be another NP-hard problem and two algorithms are presented.

Stage 1: Finding a Minimum Set of Tag SNPs by Pairwise Perfect LD

The first stage of our algorithm solves the problem of finding a minimum set of tag SNPs based on pairwise perfect LD between diallelic SNPs, which is defined as follows: given a set of SNPs find a minimum subset of SNPs (called tag SNPs) such that each untyped SNP has perfect LD with some tag SNP. A generalization of this problem with arbitrary LD setting (non-perfect LD) has been shown to be NP-hard and numerous methods have been proposed [2, 5, 28]. Existing methods usually take O(n²k) time, where n is the number of SNPs and k is the number of haplotypes, due to the need of computing LD (r²) between all pairs of SNPs. We observe that SNPs in perfect LD usually have identical 0/1 (major/minor alleles) encoding in all haplotype samples. Instead of explicitly computing r²for all pairs of SNPs, we consider SNPs with identical encoding to be perfect LD. Although this looks like a more stringent requirement, our experimental results indicate that

(12)

the solution found by this method is the same as those found by other programs based on explicitly evaluating r² = 1 (see Table 1, Section 4). This is mainly due to the sufficiently large sample size in real data sets, which lead to different frequencies of major and minor alleles at each SNP. Thus, any two SNPs with r²= 1 have identical allele pattern when encoded into 0/1 representation. For example (see Figure 2), SNPs S1and S2are in perfect LD and they contain the same allele pattern (0,0,1,0) observed at all haplotypes. Note that this stringent requirement satisfies various definitions of perfect LD (e.g., r²= 1, D⁰= 1, or no-four-gamete property).

The algorithm (FastPerfectLD ) intrinsically uses a technique similar to the bucket sorting [7]

to divide SNPs into bins of perfect LD and then select a tag SNP from each bin, which is briefly described below. The FastPerfectLD algorithm starts by scanning the first haplotype (e.g., h₁) and divide these SNPs into two groups according to the major or minor alleles observed at this haplotype.

The algorithm recursively divides SNPs in each group into subgroups according to the major and minor alleles observed in the next haplotype, until all haplotypes are tested. The black nodes in Figure 2 illustrates the intermediate groups of SNPs during the execution of this algorithm. After finishing testing all haplotypes, each resulting group (i.e., the white node in Figure 2) stands for a bin of SNPs having perfect LD with each other, whereas the SNPs in distinct bins do not have perfect LD. As a result, we can select any SNP from each bin as the tag SNP and construct a set of tag SNPs as the solution. Note that if we wish to tolerate m missing tag SNPs, we can select arbitrary m + 1 SNPs from each bin as the solution. This algorithm only needs to test at most n SNPs for each of the k haplotypes. The running time of FastPerfectLD is thus bounded by O(nk).

Stage 2: Identifying a Minimum Length Haplotype for Replacing Each Tag SNP

The input of the second stage is the tag SNPs found in the first stage. The subproblem solved in the second stage is defined as follows: given a set of tag SNPs C = {S1, ..., Sn}, for each tag SNP Si

(1 ≤ i ≤ n), find a minimum set of SNPs from C − Si which defines a haplotype having perfect LD with the tag SNP Si. In other words, this stage iteratively solves the MHTP problem (see Section 2)

(13)

by setting each of the n SNPs as the target SNP to be replaced. We apply the MHTagger algorithm for solving MHTP to identify a subset of tag SNPs for replacing the target tag SNP Si. If the MHTagger subroutine returns no feasible solution for a target SNP Si, this irreplaceable tag SNP will be included in the final solution. Let Ri be the subset of tag SNPs found by the MHTagger algorithm which can replace tag SNP Si. We formulate the dependency of replacement among all SNPs as a directed graph called “replacement graph” (Figure 3). The vertices in the replacement graph represent each SNP and vertex Si is connected to vertex Sj with a directed edge if Si ∈ Rj. That is, SNP Si is in the set of tag SNPs which can replace SNP Sj.

The replacement graph is the output of the second stage. Note that the replacement graph may contain cycles (e.g., S₄, S₅, and S₆ in Figure 3). The following lemma describes an additional property of the replacement graph.

Lemma 1. Each vertex in the replacement graph has at most k − 1 incoming edges.

Proof. Recall that the incoming edges in the replacement graph are resulted from the output of

running the MHTagger algorithm. The worst case of the MHTagger algorithm takes place when the target SNP to be replaced ST has only one haplotype carrying the minor allele at this locus, and all other haplotypes carrying major alleles. Thus, we have to select a set of SNPs which produces complementary alleles for remaining (k − 1) haplotypes. Note that the greedy manner of MHTagger guarantees that each selected SNP incurs at least one complementary allele for those (k − 1) haplotypes. Therefore, MHTagger outputs at most k − 1 SNPs as the solution. As a consequence, there are at most k − 1 incoming edges produced for each vertex in the replacement graph.

Stage 3: Reserving a Minimum Subset of Tag SNPs Based on the Replace- ment Graph

The input of the last stage is the replacement graph produced in the second stage. Denote the set of all tag SNPs in the replacement graph as C. The replacement graph gives us the information as to which tag SNPs in C can be replaced. Hence we can select a minimum subset of tag SNPs

(14)

C⁰⊆ C such that the alleles of each removed tag SNP (i.e., C − C⁰) can be predicted by a haplotype defined by tag SNPs in C⁰. However, not all SNPs can be safely removed because tag SNPs of these removed SNPs may be also removed (e.g., SNPs S4, S5, and S6in Figure 3). That is, the alleles of these dependent SNPs can not be completely reconstructed if all of them are removed from the final solution. The Haploview resolves this problem by sequentially removing a tag SNP on the basis of the remaining SNPs in a peel-back manner [10]. However, the tag SNPs removed in the early stage could be used to replace more tag SNPs, and this global dependent relation is not considered. In the last stage, we introduce an improved algorithm which considers the overall dependency among all tag SNPs and selects a smaller set of tag SNPs based on the replacement graph as the final solution.

In the following, we describe two lemmas regarding the set of tag SNPs which can be safely removed.

Lemma 2. A tag SNP with incoming edges can be safely removed if it is not contained within a cycle in the replacement graph.

Proof. A tag SNP with incoming edges in the replacement graph implies that there exist some other

tag SNPs which can replace it. For example, SNP S1 in Figure 3 can be directly removed from the final solution since it can be replaced by SNPs S2 and S3. On the other hand, if the tag SNP is contained in a directed cycle, it can not be safely removed, because each SNP is dependent on others in the cycle for predicting its alleles. For example, SNPs S₄, S₅, S₆ form a directed cycle. If all of them are removed, we can not reconstruct the alleles of these SNPs even though we type all other tag SNPs.

Lemma 3. For each cycle, only one tag SNP needs to be kept while the other tag SNPs in this cycle can be safely removed, if they are not contained in other cycles.

Proof. If a tag SNP within a cycle is kept, we can remove its incoming edges from the graph since the

allele of this SNP will be directly typed and known. Therefore, the cycle can be broken and becomes a simple path, if the remaining tag SNPs are not contained in other cycles. By Lemma 2, the remaining tag SNPs in this simple path can now be safely removed since all of them have incoming edges and are not contained in any cycle.

(15)

By Lemmas 2 and 3, we have to reserve at least one tag SNP in each cycle and to remove its incoming edges from the replacement graph. Note that the outgoing edges cannot be removed.

Otherwise, we will fail to reconstruct the alleles of untyped tag SNPs. Recall that MTMH asks for a minimum set of tag SNPs as the final solution. Therefore, the last stage is solving a problem (referred to as MTSR) defined as follows: given a replacement graph, find a minimum set of vertices such that the removal of their incoming edges breaks all cycles in the replacement graph.

Theorem 3. The MTSR problem is NP-hard.

Proof. Without loss of generality, we assume that the number of incoming edges of each vertex in

the replacement graph is bounded by an integer k (see Lemma 1). We make a reduction from a variant of the vertex cover problem referred to as k-VC [15, 25]. The k-VC problem is known to be NP-hard and is defined as follows: given a graph G=(V,E) with degrees bounded by an integer k ≥ 3, find a minimum subset of vertices V⁰⊆ V (called vertex cover) such that each edge (u, v) ∈ E

has at least one of u and v belonging to V⁰.

Given an instance of the k-VC problem, we construct a new graph ˜G = (Uv ∪ Ue, ˜E), where vertices of Uv = V correspond to original vertices of G and vertices of Ue correspond to each edge of G. For each edge e = (v1, v2) in G, we construct three edges in ˜G which form a directed cycle: an edge from (v1 to v2), an edge from v2to e, and an edge from e to v1. Note that since the degree of each vertex in G is bounded by k, there are at most k directed cycles created for each vertex in ˜G, which produces at most k incoming edges. It is easy to observe that a vertex cover in G implies a set of vertices in ˜G which can break all cycles by removing their incoming edges, because each edge in G corresponds to one cycle in ˜G. Therefore, G has a minimum vertex cover of size c if and only if ˜G has a minimum set of vertices of size c such that the removal of their incoming edges breaks all cycles. In summary, MTSR is NP-hard.

The MTSR problem can be solved by reducing to an NP-hard problem called “minimum feedback vertex set” (MFVS) [15]. The MFVS problem is defined as given a directed graph G = (V, E), find a minimum subset of vertices V⁰⊆ V such that V⁰contains at least one vertex for every directed cycle in G. Let V⁰ be a minimum solution of the MFVS problem. Note that each vertex in a directed

(16)

cycle of G has one incoming and one outgoing edges both contained in this cycle. The removal of incoming edges of all vertices in V⁰ can also break all cycles in G, which implies that V⁰ is also a minimum solution of MTSR. Therefore, MTSR can be solved by applying existing algorithms for MFVS.

The best approximation algorithm for the MFVS problem gives a solution within a factor of O(log v ∗ log log v) of the optimal solution, where v is the number of vertices [13]. However, it

requires solving a linear-programming instance with exponential number of constraints, which is impractical when applying on genome-wide data sets with millions of SNPs. To efficiently break all cycles in the replacement graph, we turn to solve a relaxed problem which asks for a minimum subset of vertices such that the removal of their incoming edges eliminates all back edges in the replacement graph. An edge (u, v) connecting from vertex u to vertex v is said to be a back edge if vertex v is the ancestor of vertex u in the depth-first-search (DFS) traversal of the graph [7]. Note that the DFS traversal in a graph produces back edges if and only if the graph has cycles, which implies that all cycles can be indirectly broken by removing all back edges. Consequently, the solution of this relaxed problem is a feasible solution to MTSR.

This relaxed problem can be solved in polynomial time since each back edge (u, v) uniquely corresponds to one incoming edge of the vertex v and we require that only incoming edges of a vertex can be removed. In the following, we describe an algorithm which removes all back edges in the replacement graph by revising the DFS algorithm. During the DFS traversal, all incoming edges of a vertex v are removed once a back edge (u, v) connecting from a descendant vertex u to an ancestor vertex v is found. Therefore, the removal of incoming edges of vertex v eliminates the back edge (u, v) and indirectly breaks cycles associated with this back edge. We repeat this process until all back edges in this replacement graph are found and removed.

We integrate this algorithm with the FastPerfectLD and MHTagger algorithms introduced in previous subsections and develop a program called HapTagger to solve all subproblems of MTMH in three stages. After the last stage, the set of vertices in the replacement graph without incoming edges indicate those tag SNPs of the output. A pseudo code of HapTagger is given below.

(17)

Algorithm: HapTagger(C)

1 Run FastPerfectLD to find a minimum set of tag SNPs C⁰⊆ C

2 Construct a replacement graph G with vertices corresponding to SNPs in C⁰ 3 for each SNP Si∈ C⁰

4 Run MHTagger(C⁰− Si, Si) to obtain a set of tag SNPs Ri which can replace Si

5 Add directed edges connecting from vertices in Ri to Si in the replacement graph G 6 end of for

7 for each vertex Si in G

8 Conduct a DFS traversal starting from vertex Si

9 if a back edge (u, v) is found during the traversal 10 remove all incoming edges of vertex v from G 11 end of for

12 Identify the set of vertices T in G without incoming edges 13 Return T

The time complexity of HapTagger is analyzed as follows. Line 1 is bounded by the FastPerfectLD algorithm which takes O(nk). Lines 3-6 take O(n²k²) time since for each of the n SNPs, we have to run the MHTagger algorithm which takes O(nk²). Lines 7-11 is bounded by the time of running DFS to traverse the entire graph, which takes O((V + E)) = O(n²) time, where V and E are the numbers of vertices and edges, respectively. Therefore, the entire HapTagger algorithm runs in O(n²k²) time.

4 Experimental Results

We implement the HapTagger algorithm in JAVA for finding tag SNPs by multimarker haplotypes.

HapTagger is freely available on http://www.csie.ntu.edu.tw/∼kmchao/HapTagger/. Due to the inefficiency of pairwise-LD based methods, the FastPerfectLD algorithm in Section 3 is separately implemented and used as a reference for the solutions of pairwise-LD based approaches. These pro-

(18)

grams along with the official tagging tool Haploview [10] used by the International HapMap project are tested on a variety of real data sets. The Haploview can identify tag SNPs in two modes: pairwise or aggressive modes. The Haploview in pairwise mode finds tag SNPs only based on pairwise LD between two diallelic SNPs. The Haploview in aggressive mode identifies tag SNPs by first finding tag SNPs using pairwise LD and then reduce the tag SNPs using a peel-back approach with a multiallelic LD statistic. In the following experiments, the minimum LD threshold required for Haploview in both modes is set to 1.0. We download the phased haplotype data from HapMap and choose the population of Utah residents with ancestry from northern and western Europe (CEU) as our experimental target [1]. The following experiments are conducted under the hardware environment with Pentium 3.2GHz CPU and with 8GB RAM.

Experiments on HapMap ENCODE Data Sets

We first test these programs on ten ENCODE data sets (corresponding to ten 500-kilobase regions) resequenced and genotyped in the HapMap project. Each data set contains 180 haplotype samples originated from 30 CEU trios. Table 1 lists the number of tag SNPs found by HapTagger, FastPerfectLD, and Haploview (in pairwise or aggressive modes) on each ENCODE data set. The Haploview in aggressive mode fails to output a solution in most data sets within a reasonable period of time (e.g., longer than ten days). The results indicate that HapTagger consistently finds a smaller set of tag SNPs with size less than half of other programs. The FastPerfectLD and Haploview in pairwise mode identify the same number of tag SNPs in all data sets. Recall that FastPerfectLD requires that SNPs in perfect LD have identical allele pattern, which is more stringent than the requirement of r²= 1 used by Haploview in pairwise mode. However, the results indicate that this stringent requirement produces the same number of tag SNPs as Haploview in pairwise mode. This phenomenon is mainly due to the sufficiently large samples in HapMap data and SNPs with r²= 1 all have identical allele pattern when encoded into 0/1 representation. The Haploview in aggressive mode outperforms FastPerfectLD and pairwise mode as expected, because it further refines the solution (i.e., the solutions of pairwise mode) by multimarker haplotypes. However, the improvement

(19)

is not significant in comparison with HapTagger. The reason is that Haploview has several default constraints (e.g., allowing up to three SNPs in a haplotype and testing at most 10,000 haplotypes) which prevent it from finding a better solution. It should be noted that the Haploview in aggressive mode fails to output a solution in all data sets within a reasonable period of time when we relax all of its default constraints.

In terms of efficiency, FastPerfectLD is the fastest program because of the internal linear-time algorithm. The HapTagger is able to output a solution in several seconds, whereas Haploview in pairwise mode requires a bit longer time (from four to six minutes). Although the theoretical time complexity of these two programs are the same (i.e., O(n²)), the HapTagger internally employs the FastPerfectLD algorithm to group SNPs in perfect LD and avoid the heavy computation of r²values for each pair of SNPs. Thus, it is slightly faster than Haploview in pairwise mode in practice. The Haploview in aggressive mode is the slowest program which takes at least four days for outputting a solution.

Experiments on HapMap Chromosome Data Sets

We then test these programs on a number of large genome-wide data sets. We download 22 phased haplotype data sets corresponding to human Chromosome 1 to Chromosome 22 from the Phase I release of HapMap data. The Haploview in aggressive mode fails to output a solution for all data sets within a reasonable period of time. The Haploview in pairwise mode also fails to output a solution due to out-of-memory error when we relax all the default constraints but is able to output a solution when all default constraints are retained. Thus we only report the results of Haploview in pairwise mode with its default constraints retained. The Haploview in pairwise mode takes one to four hours to finish processing each data set. The HapTagger returns a solution from one to two hours for each data set. The FastPerfectLD is the fastest program which returns a solution only in several minutes on all data sets.

Table 2 lists the number of tag SNPs found by each program. The HapTagger consistently outperforms Haploview and FastPerfectLD on all data sets as expected, since it further considers

(20)

the LD between a diallelic SNP and a multimarker haplotype instead of only pairwise LD between diallelic SNPs. The FastPerfectLD also consistently outperforms Haploview in pairwise mode since its solution is not restricted by any constraint. The number of tag SNPs found by HapTagger is less than half of the numbers outputted by Haploview or FastPerfectLD. In summary, the HapTagger only requires 23% of original SNPs to be typed as tag SNPs, whereas the tag SNPs identified by Haploview and FastPerfectLD account for roughly 60% of original SNPs. Most tagging programs reduce the number of tag SNPs by relaxing the threshold of LD (e.g., r² ≥ 0.8). It is worth mentioning that HapTagger not only runs faster but also reduces the number of tag SNPs under the requirement of perfect LD.

5 Discussion

Reconstruction of Alleles of Untyped SNPs with HapTagger

In previous methods, the alleles of an untyped SNP can be directly reconstructed by a typed tag SNP. But in HapTagger, the alleles of an untyped SNP have to be reconstructed in a more complex manner. It is because each untyped SNP is now predicted by a haplotype instead of by a single tag SNP, and the predictive haplotype itself may also contain partial untyped SNPs. In other words, we have to resolve the dependency among all SNPs in order to reconstruct the alleles of all untyped SNPs. Nevertheless, we can simply apply the algorithm of topological sorting [7] to obtain the dependency ordering among all SNPs based on the replacement graph introduced in Section 3.

Given an acyclic directed graph, the topological sorting algorithm sorts a vertex Si precedent to a vertex Sj if there is a directed edge from Si to Sj and finally gives a linear ordering of these vertices. Note that the replacement graph is also acyclic because we have broken all cycles after the last stage. Consequently, we can reconstruct alleles of all untyped SNPs one by one by following the linear ordering of these SNPs, which takes O(n²) time.

(21)

An Improved Algorithm for Breaking Cycles in the Replacement Graph

The previous algorithm for breaking cycles by removing all back edges may fail to obtain the optimal solution. Figure 4(A) illustrates an example in which the previous algorithm may not perform well.

If the DFS traversal starts from vertex S1 (instead of vertex S5), we would obtain four back edges (i.e., b1, b2, b3, and b4) and use four SNPs (i.e., S1, S2, S3, and S4) to remove all back edges.

However, the optimal solution is the SNP S5since the removal of its incoming edge also breaks all cycles in this graph, even though this edge is not a back edge.

In order to overcome the limitation of only removing back edges, we consider the removal of other edges which can also break cycles. Although we do not explicitly enumerate all cycles in the replacement graph, each back edge is implicitly associated with some cycles and the removal of this back edge can break these associated cycles. The following lemma indicates that the cycles associated with one back edge can be broken by removing incoming edges of either vertex on this back edge.

Lemma 4. For a directed back edge (u, v), the removal of incoming edges of either vertex u or of vertex v breaks the same set of cycles associated with this back edge.

Proof. Denote the set of cycles broken by removing the back edge (u, v) as Cu,v. The removal of incoming edges of v removes the back edge (u, v) and thus breaks all cycles in Cu,v. Note that each cycle in Cu,v must contain an edge which is also the incoming edge of vertex u, since it has to pass through vertex u to v. As a consequence, the removal of incoming edges of vertex u also breaks all cycles in Cu,v.

By Lemma 4, we can select any of the two vertices on each back edge and remove its incoming edges to break cycles associated with this back edge. Denote the set of vertices at two ends of all back edges as C = {S1, ..., Sn} (Figure 4(B)). The problem is redefined as follows: given a set of back edges B = {b1, ..., bm} discovered during DFS traversal of the replacement graph, find a minimum set of vertices C⁰ ⊆ C such that C⁰ contains at least one vertex from either end of a back edge. The removal of all incoming edges of vertices in C⁰can thus break all cycles in the replacement

(22)

graph. However, this problem becomes NP-hard, which can be easily shown by a reduction from the k-VC problem similar to the proof in Theorem 3. On the positive side, this problem is just an instance of the set covering problem which asks for a minimum subcollection C⁰ ⊆ C such that each element in B is covered by at least one set in C⁰. Therefore, we can employ a typical greedy algorithm which iteratively selects a vertex shared by most back edges, until all back edges have at least one vertex (from either end) selected. For example, in Figure 4, only SNP S5 is selected by this greedy approach as the solution. Furthermore, it is easy to observe that each bi ∈ B appears in exactly two sets in C corresponding to its two end vertices. Therefore, this is a restricted version of the set covering problem with each element in B appears in two sets in C, which is shown to be APX-hard [25] and can be approximated within a factor of 2 of the optimal solution [18].

Efficiency of Various LD Statistic

A number of measures for computing the LD between two diallelic SNPs have been widely used for the selection of tag SNPs (e.g., r², D⁰, or four-gamete property) [5]. On the other hand, only a few studies consider the LD between a diallelic SNP and a multiallelic haplotype for selecting tag SNPs (e.g., multiallelic D⁰or the relative information [11]). One major difference between these two directions is the number of tests required for obtaining a predictive SNP or a predictive haplotype.

For example, on the basis of LD between diallelic SNPs, we can obtain a SNP which is predictive of another SNP ST by computing the correlation coefficient (r²) between SNP ST and all other SNPs, which takes O(n) time, where n is the number of SNPs. On the other hand, to obtain a haplotype predictive of SNP ST, one has to compute the multiallelic LD statistic between a SNP and all possible haplotypes. However, the number of all possible haplotypes grows exponentially with respect to the number of SNPs, because a haplotype can be composed by arbitrary combination of SNPs. On the contrary, HapTagger implicitly estimates the multiallelic LD between a SNP and a haplotype using a combinatorics approach but does not rely on any explicit LD statistic. The major advantage of our approach is that approximation algorithms are allowed for efficiently finding the predictive haplotypes. As indicated by our theoretical and experiment results, HapTagger runs

(23)

much faster than other methods since it avoids the test of exponential number of haplotypes.

Signatures of Positive Selection or Co-evolution

Recent large-scale analysis of recent positive selection using the HapMap data indicates that hu- mans are still under fast evolution [31]. The classical signature of recent positive selection is the elevating-haplotype homozygosity surrounding the favored allele at one SNP (or so called genetic hitchhiking). That is, the haplotypes flanking the favored allele at one SNP under recent positive selection usually show very low sequence diversity. Therefore, it is especially easy for HapTagger to find a haplotype predictive of alleles at one SNP under recent positive selection. As to alleles not at close loci, they might still co-evolve through the heredity due to their functional dependency in the biological pathway. Thus HapTagger is also able to identify these coevolved haplotypes for capturing alleles at one SNP. The tag SNPs selected by previous LD-based methods usually only reflect the extent of past chromosome recombination. It is worth mentioning that the predictive haplotypes selected by HapTagger is not only used for capturing untyped SNP alleles, but these haplotypes may be also considered to be the signature of recent positive selection or co-evolution.

However, the length of haplotypes under positive selection or co-evolution will be reduced because chromosome recombination will break the linkage of SNPs in these haplotypes. Thus, HapTagger seeks for the minimum-length haplotype for replacing a target SNP in the algorithm. In terms of algorithmic process, the requirement of minimum-length haplotype is helpful in the stage 3 of our algorithm, because the dependency (i.e., edges in the replacement graph) among these SNPs can be reduced and less cycles would be generated.

6 Conclusion

In this paper, we designed and implemented algorithms for the selection of tag SNPs using multimarker haplotypes without relying on LD statistic. The tag SNPs found by our algorithms define a set of haplotypes completely predictive of the alleles of all untyped SNPs. Several exact and approximation algorithms are proposed to efficiently find these tag SNPs. We integrated these algorithms

(24)

and implemented a program called HapTagger. Our theoretical analysis and experimental results indicated that HapTagger consistently identifies a smaller set of tag SNPs and runs much faster than existing methods on a variety of real data sets. We also discussed the efficiency of various LD statistic and compared distinct approaches for reconstructing untyped SNP alleles. It is worth mentioning that these predictive haplotypes selected by HapTagger may be the signature of positive selection or co-evolution.

ACKNOWLEDGEMENT

Yao-Ting Huang and Kun-Mao Chao were supported in part by NSC grants 93-2213-E-002-029, 94-2213-E-002-091, and 96-2218-E-194-006 from the National Science Council, Taiwan.

References

[1] Altshuler, D., Brooks, L.D., Chakravarti, A., Collins, F.S., Daly, M.J., and Donnelly, P. 2005.

A haplotype map of the human genome. Nature, 437:1299–1320.

[2] Ao, S.I., Yip, K., Ng, M., Cheung, D., Fong, P.Y., Melhado, I., and Sham,P.C. 2004. CLUSTAG:

hierarchical clustering and graph methods for selecting tag SNPs. Bioinformatics, 21(8):1735–

1736.

[3] Bafna, V., Halld´orsson, B.V., Schwartz, R., Clark, A.G., and Istrail, S. 2003. Haplotypes and informative SNP selection algorithms: don’t block out information. In Proc. RECOMB’03, pages 19–27.

[4] Barrett J.C., Fry B., Maller J., and Daly M.J. 2005. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics, 21(2):263–265.

[5] Carlson, C.S., Eberle, M.A., Rieder, M.J., Yi, Q., Kruglyak, L., and Nickerson, D.A. 2004. Se- lecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am. J. Hum. Genet., 74:106–120.

(25)

[6] Chang, C.-J., Huang, Y.-T., and Chao, K.-M. 2006. A greedier appraoch for finding tag SNPs.

Bioinformatics, 22: 685-691.

[7] Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C. 2001. Introduction to algorithms, The MIT Press.

[8] Crawfod, D.C. and Nickerson, D.A. 2005. Definition and clinical importance of haplotypes.

Annu. Rev. Med., 56:303–320.

[9] Daly, M.J., Rioux, J.D., Schaffner, S.F., Hudson, T.J., and Lander, E.S. 2001. High-resolution haplotype structure in the human genome. Nat. Genet., 29(2):229–232.

[10] de Bakker, P.I., Yelensky, R., Pe’er, I., Gabriel, S.B., Daly, M.J., and Altshuler, D. 2005.

Efficiency and power in genetic association studies. Nat. Genet., pages 1217–1223.

[11] de Bakker, P.I., McVean, G., Sabeti, P.C., Miretti, M.M., Green, T., et al. 2006. A high- resolution HLA and SNP haplotype map for disease association studies in the extended human MHC. Nat. Genet., pages 1166–1172.

[12] Douglas, J.A., Boehnke, M., Gillanders, E., Trent, J.M., and Gruber, S.B. 2001.

Experimentally-derived haplotypes substantially increase the efficiency of linkage disequilib- rium studies. Nat. Genet., 28(4):361–364.

[13] Even, G., Naor, J., Schieber, B., and Sudan, M. 1998. Approximating minimum feedback sets and multicuts in directed graphs. Algorithmica, 20:151–174.

[14] Gabriel, S.B., Schaffner, S.F., Nguyen, H., Moore, J.M., Roy, J., Blumenstiel, B., Higgins, J., DeFelice, M., Lochner, A., Faggart, M., Liu-Cordero, S.N., Rotimi, C., Adeyemo, A., Cooper, R., Ward, R., Lander, E.S., Daly, M.J., and Altshuler, D. 2002. The structure of haplotype blocks in the human genome. Science, 296(5576):2225–2229.

[15] Garey, M.R., and Johnson, D.S. 1979. Computers and intractability, Freeman, New York.

[16] Helmuth, L. 2001. Genome research: map of the human genome 3.0. Science, 293(5530):583–

585.

(26)

[17] Hinds, D.A., Stuve, L.L., Nilsen, G.B., Halperin, E., Eskin, E., Ballinger, D.G., Frazer, K.A., and Cox, D.R. 2005. Whole-genome patterns of common DNA variation in three human pop- ulations. Science, 307:1072–1079.

[18] Houchbaum, D.S. 1982. Approximation algorithms for the set covering and vertex cover prob- lems. SIAM J. Comp., 11:555-556.

[19] Hu, N., Wang, C., Hu, Y., Yang, H.H., Giffen, C., Tang, Z.-Z., Han, X.-Y., Goldstein, A.M., Emmert, M.R., Buetow, K.H., and Taylor, P.R., and Lee, M.P. 2005. Genome-wide asspciation study in esophageal cancer using genechip mapping 10K array. Cancer Research, 65(7):2542–

2546.

[20] Huang, Y.-T., Zhang, K., Chen, T., and Chao, K.-M. 2005. Selecting additional tag SNPs for tolerating missing data in genotyping. BMC Bioinformatics, 6:263.

[21] Huang, Y.-T., Chao, K.-M, and Chen, T. 2005 An approximation algorithm for haplotype inference by pure parsimony. Journal of Computational Biology, 12: 1261-1274.

[22] Kennedy, G.C., Matsuzaki, H., Dong, S., Liu, W.M., Huang, J., Liu, G., Su, X., Cao, M., Chen, W., Zhang, J., Liu, W, Yang, G., Di, X., Ryder, T., He, Z., Surti, U., Phillips, M.S., Boyce-Jacino, M.T., Fodor, S.P., and Jones, K.W. 2003. Large-scale genotyping of complex DNA. Nature Biotechnology, 21:1233–1237.

[23] Lin, S., Chakravarti, A., and Cutler, D.J. 2004. Exhaustive allelic transmission disequilibirum tests as a new approach to genome-wide association studies. Nature Genetics, 36:1181-1188.

[24] Liu, W.M., Di, X., Yang, G., Matsuzaki, H., Huang, J., Mei, R., Ryder, T.B., Webster, T.A., Dong, S., Liu, G., Jones, K.W., Kennedy, G.C., and Kulp, D. 2003. Algorithms for Large Scale Genotyping Microarrays. Bioinformatics, 19:2397-2403.

[25] Papadimitriou, C. H., and Yannakakis, M. 1991. Optimization, approximation, and complexity classes. J. Comput. System Sci., 43:425-440.

(27)

[26] Patil, N., Berno, A.J., Hinds, D.A., Barrett, W.A., Doshi, J.M., Hacker, C.R., Kautzer, C.R., Lee, D.H., Marjoribanks, C., McDonough, D.P., et al. 2001. Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science, 294:1719–1723.

[27] Qin, Z., Niu, T., and Liu, J. 2002. Partitioning-ligation-expectation-maximization algorithm for haplotype inference with single-nucleotide Ploymorphisms. Am. J. Hum. Genet., 71:1242–1247.

[28] Qin, Z.S., Gopalakrishnan, S., Abecasis, G.R. 2006. An efficient comprehensive search algorithm for TagSNP selection using linkage disequilibirium criteria. Bioinformatics, 99(11):7335–7339.

[29] Stephens, M., and Donnelly, P. 2003. A comparison of bayesian methods for haplotype recon- struction from population genotype data. Am. J. Hum. Genet., 73:1162–1169.

[30] Stram, D.O., Haiman, C.A., Hirschhorn J.N., Altshuler, D., Kolonel, L.N., Henderson, B.E., and Pike, M.C. 2003. Choosing haplotype-tagging SNPs based on unphased genotype data using a preliminary sample of unrelated subjects with an example from the multiethnic cohort study. Hum. Hered., 55:27–36.

[31] Voight, B.F., Kudaravalli, S., Wen, X., Pritchard, J.K. 2006. A map of recent positive selection in the human genome. PLOS Biology, 446–458.

[32] Weal, M.E., Depondt, C., Macdonald, S.J., Smith, A., Lai, P.S., Shorvon, S.D., Wood, N.W., Goldstein, D.B. 2003. Selection and avalulation of tagging SNPs in the neuronal-sodium-channel gene SCN1A: implications for linkage diequilibrium gene mapping. Am. J. Hum. Genet., 73:551–

565.

[33] Zhang, K., Sun, F., Waterman, M.S., and Chen, T. 2003. Haplotype block partition with limited resources and applications to human chromosome 21 haplotype data. Am. J. Hum.

Genet., 73:63–73.

[34] Zhang, K., Qin, Z.S., Liu, J.S., Chen, T., Waterman, M.S., and Sun, F. 2004. Haplotype block partitioning and tag SNP selection using genotype data and their applications to association studies. Genome Res., 14(5):908–916.

(28)

Figure Legends

Figure 1.

An input example for the MHTP problem. H = {h1, ..., h5}, C = {S1, ..., S5} and ST = S6. {S1, S2, S3} is a set of feasible solution but the minimum solution is C⁰ = {S1, S3}, since the haplotype (0,0) defined by these two SNPs has perfect LD with SNP S6.

Figure 2.

An example for dividing SNPs into bins of perfect LD. The left-hand side is the input example which contains four haplotypes composed of six SNPs. The right-hand side illustrates each execution stage of this algorithm. The black nodes represent the intermediate groups of SNPs during the algorithm.

The white nodes at the leaves of the tree represent the bins of SNPs having perfect LD with each other.

Figure 3.

The replacement graph defined by tag SNPs S1 to Sn. Note that there exists a cycle defined by SNPs S₄, S₅, and S₆.

Figure 4.

An example of a replacement graph composed by five tag SNPs (i.e., S1, ..., S5). The DFS traversal starts from SNP S1 and results in four back edges (i.e., b1, ..., b4). (B) The back edges and vertices are reformulated to elements and sets in a set covering problem, respectively. The element bi is covered by the set Sj if Sj is one vertex on the back edge bi.