The evolutionary landscape of the Mycobacterium tuberculosis genome

(1)

The evolutionary landscape of the Mycobacterium tuberculosis genome

Tai-Chun Wang

a

_{, Feng-Chi Chen}

a,b,c,

⁎

a

Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, 35 Keyen Road, Zhunan, Miaoli County, 350 Taiwan

b

Department of Biological Science and Technology, National Chiao-Tung University, Hsinchu, 300 Taiwan

c_{Department of Dentistry, China Medical University, Taichung, 404 Taiwan}

a b s t r a c t

a r t i c l e i n f o

Article history:

Accepted 27 November 2012 Available online 7 December 2012

Keywords:

Mycobacterium tuberculosis Selection pressure Nonsynonymous substitution Synonymous substitution Neutral substitution rate

Mycobacterium tuberculosis is one of the most deadly human pathogens. The major mechanism for the adapta-tions of M. tuberculosis is nucleotide substitution. Previous studies have relied on the nonsynonymous-to-synonymous substitution rate (dN/dS) ratio as a measurement of selective constraint based on the

assumed selective neutrality of synonymous substitutions. However, this assumption has been shown to be untrue in many cases. In this study, we used the substitution rate in intergenic regions (di) of the M. tuberculosis

genome as the neutral reference, and conducted a genome-wide proﬁling for di, dS, and the rate of insertions/

deletions (indel rate) as compared with the genome of M. canettii using a 50 kb sliding window. We demonstrate significant variations in all of the three evolutionary measurements across the M. tuberculosis genome, even for regions in close vicinity. Furthermore, we identified a total of 233 genes with their dSdeviating significantly from

diwithin the same window. Interestingly, dSalso varies signiﬁcantly in some of the windows, indicating drastic

changes in mutation rate and/or selection pressure within relatively short distances in the M. tuberculosis genome. Importantly, our results indicate that selection on synonymous substitutions is common in the M. tuberculosis genome. Therefore, the dN/dSratio test must be applied carefully for measuring selection

pressure on M. tuberculosis genes.

1. Introduction

Mycobacterium tuberculosis (MTB), the causing pathogen of one of the most deadly diseases, claims millions of lives worldwide each year

(Zhang et al., 2011). The MTB complex (MTBC) belongs to the

slow-growing sublineage of Mycobacteria. Based on the geographical characteristics, MTBC can be classiﬁed into six clusters, including such species as M. tuberculosis, M. bovis, M. africanum, M. microti, M. pinnipedii, and M. canettii (Filliol et al., 2006; Gagneux et al., 2006; Gutacker et al.,

2006; Schurch and van Soolingen, 2012). Members in MTBC, including

M. tuberculosis, M. bovis, M. africanum, and M. canetti, share 99.95% of their genomic sequences and a strictly clonal population structure (Mokrousov et al., 2004; Smith et al., 2009). Compared to more ancient species (e.g. M. marinum), MTBC has shorter but more virulent chromo-somes (Namouchi et al., 2012).

Although most bacterial species acquire new genetic materials via

horizontal gene transfer (Thomas and Nielsen, 2005), it has been

reported that this mechanism rarely occurs to MTBC genomes (Gutierrez et al., 2005; Veyrier et al., 2009, 2011). Therefore, nucleotide substitution is a major mechanism for the emergence of M. tuberculosis pathogenesis. By comparing multiple MTBC genomes, Namouchi and

colleagues indicated that MTBC genomes exhibit signiﬁcant regional

variations in the density of single nucleotide polymorphisms (SNPs)

(Namouchi et al., 2012). This observation implies that MTBC genes at

different genomic positions may be evolving at very different rates. However, the authors did not distinguish between coding and noncod-ing regions when calculatnoncod-ing SNP densities. Their results thus cannot re-ﬂect the variations in SNP density at selectively neutral sites.

Since the majority of the MTB genome is composed of coding se-quences, genomic regions of high SNP density may harbor rapidly evolving genes. Some of these genes may be positively selected for their importance in the adaptations of MTBC to the human environ-ments. Meanwhile, extremely conserved genes are likely to be essen-tial for the survival and/or replication of the bacterium. These two groups of genes are good candidates of drug targets. However, an in-crease in evolutionary rate does not necessarily result from positive selection. An increase in mutation rate or relaxation of selective con-straint can lead to the same result. An adequate reference for neutral substitution rate and a good measurement for selection pressure are thus required to infer the driver of the increased evolutionary rates in the genes of interest.

Abbreviations: MTB, Mycobacterium tuberculosis; MTBC, Mycobacterium tuberculosis complex; dN, Nonsynonysmous substitution rate; dS, Synonymous substation rate; di,

Nu-cleotide substitution rate at intergenic regions; indel, Insertion and deletion; kb, kilobase; SNP, Single nucleotide polymorphism; NCBI, National Center for Biotechnology Informa-tion; CAI, Codon adaptation index.

⁎ Corresponding author at: Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, 35 Keyen Road, Zhunan, Miaoli County, 350 Taiwan.

E-mail addresses:[email protected](T.-C. Wang),[email protected]

(F.-C. Chen).

http://dx.doi.org/10.1016/j.gene.2012.11.033

Contents lists available atSciVerse ScienceDirect

Gene

(2)

One commonly used test for natural selection is the ratio of

nonsynonymous substitution rate (dN) to synonymous substitution

rate (dS) (i.e., the dN/dSratio) (Toll-Riera et al., 2011). In general,

dN/dS> 1 indicates positive selection, and dN/dSb1 is a sign of

nega-tive selection. However, this test is based on the assumption that synonymous substitutions are selectively neutral, which has been questioned particularly in unicellular organisms. It is known that

synonymous substitutions may confer_{ﬁtness effects by affecting}

the efﬁciency and/or accuracy of protein translation (Kryazhimskiy

and Plotkin, 2008). An alternative neutral reference is the

nucleo-tide substitution rate of intergenic regions (di) because intergenic

regions are usually free from selection pressure. Therefore, theoret-ically, by comparing the dSof a gene against the diof the neighboring

intergenic region, we can infer whether the synonymous substitu-tions are selectively neutral or not, and determine whether we

should use dN/dSas the measurement of selection. There are three

possible scenarios in the comparison between dSand di. Firstly, if

dSis approximately equal to di, synonymous substitutions are

prob-ably driven mainly by mutation. Alternatively, if dSis signiﬁcantly

lower than di, synonymous substitutions are likely to be negatively

selected. Finally, if dSis signiﬁcantly larger than di, synonymous

sub-stitutions are possibly driven by positive selection. In the latter two cases, dN/dishould be used instead of dN/dSfor measuring selection

pressure on the gene of interest.

Here, we examine the variations in evolutionary rates in the ge-nomes of multiple MTB strains and the selection pressures imposed on MTB genes. We would like to address the following questions: (1) how applicable is dN/dSin measuring selection pressure on MTB

genes; (2) which MTB genes evolve signiﬁcantly more rapidly or

more slowly than the genome average in terms of, separately, dS,

dN, and dN/dS; and (3) what is the major driving force that leads to

the variations in evolutionary rates among genes.

Our results indicate signiﬁcant variations in di, dSand the rate of

insertions/deletions (indels) across the MTB genome, which suggests ﬂuctuations in local mutation rate as a driving force of nucleotide sub-stitutions. Furthermore, we found that synonymous substitutions in hundreds of MTB genes may be subject to negative or positive selec-tion, indicating noticeable inapplicability of the dN/dSratio test to the

MTB genes. The molecular mechanisms and phenotypic conse-quences of the drastic variations in evolutionary rates in MTB genes are worth further investigations.

2. Materials and methods 2.1. Datasets

The genomic sequences of thirteen strains of Mycobacteria (Table 1) were downloaded from the National Center for Biotechnology Informa-tion (NCBI) athttp://www.ncbi.nlm.nih.gov/. Except for M. marinum, all

of these strains belong to the MTBC. Here, the genomes of M. marinum and M. canettii were used for comparisons with the other MTBC genomes for the calculation of evolutionary rates. The average G + C content is approximately 65% for all of the analyzed genomes.

2.2. Identi_{ﬁcation of orthologous genes}

The gene annotations of the analyzed bacterial genomes were also retrieved from NCBI. The nucleotide sequences of the annotated genes were conceptually translated into peptide sequences, and

input into orthoMCL (Li et al., 2003) with default parameters for

identiﬁcation of orthologous genes between the analyzed species/

strains. OrthoMCL identiﬁed 2358 orthologous genes for the 13

analyzed Mycobacterial genomes. The peptide sequences of the

identiﬁed orthologous genes were then aligned by using MUSCLE

(Edgar, 2004) with default parameters, and then back-translated

to nucleotide sequences for calculations of dN, dS, and the dN/dS

ratio.

2.3. Measurements of local evolutionary rates

To analyze diand indel rate, we used Mauve 2.0 (Darling et al., 2004)

to align the nucleotide sequences of the 13 analyzed genomes. The gaps between alignment blocks were discarded. For the comparison between dSand di, we removed all of the noncoding RNAs from intergenic regions

with reference to the annotations of SIPHT (sRNA identiﬁcation protocol

using high-throughput technologies) (Livny et al., 2008). A 50-kb

non-overlapping sliding window was then used to delineate the aligned

genomic regions for calculations of di and indel rate. Note that a

window contains both genic and intergentic regions. The genic and intergenic regions were demarcated according to the NCBI annotations. dNand dSwere calculated separately for each gene. The intergenic regions

within each window were concatenated for the calculation of di.

There-fore, for each window, we could obtain multiple dN and dS values

(when there are multiple genes in a window), and a single divalue. Of

note, the genes that are located at the boundaries between windows were discarded. In addition, we trimmed 50 nucleotides from both ends of each alignment block to avoid potential alignment errors.

The Codeml module of PAML 4 (Yang, 2007) was used to calculate

dNand dS. The Baseml module of PAML was applied for the calculation

of di. We also calculated the indel rate by analyzing the MAUVE

align-mentﬁles using an in-house PERL script. The indel rate was deﬁned as the total length of insertions and deletions divided by the length of the alignable sequence.

2.4. Identiﬁcation of genes with exceptional evolutionary rates

Since M. tuberculosis H37Rv is genetically close to M. canettii, in many of the cases we observe zero values of dN, dS, or dN/dSwhen

(3)

comparing orthologous genes between the two species. For each evolu-tionary measurement (dN, dS, or dN/dS), we assume that the probability

of a zero value isα. Then the distribution is a α: 1−α mixture of a

point probability mass at 0 and a log-normal distribution, which can be presented as:

F xð Þ ¼ _ð₁_−α_{ÞΦ logx}_ð α; x ¼ 0_{Þ; x > 0}

whereΦ is the function of normal distribution.

Based on this mixture probability distribution, the cut points for determining the outlier evolutionary measurements at 5% error rate should be adjusted by

Average evolutionary rate 2 standard deviation 1−αð Þ:

Notably, the standard deviation should be calculated based on the non-zero values. Also note that, for the analysis of dN/dS, the genes

with dS= 0 were removed. Hence, the sample size in this analysis

was reduced, andα becomes the proportion of dN= 0 in this reduced

set.

We then de_{ﬁned slow- and fast-evolving genes as those that have}

their evolutionary rates fall outside of two standard deviations (adjusted forα) from the average. The genes thus identiﬁed were then functionally analyzed with reference to the TubercuList (http://genolist.pasteur.fr/ TubercuList/) (Lew et al., 2011).

2.5. Measurement of codon usage bias

In this study, we used codon adaptation index (CAI) (Sharp and Li,

1987) as the measurement of codon usage bias for MTB genes. CAI can

be calculated as follows: CAI¼ exp 1. L∑ L l¼1ln wð ið ÞlÞ ; where wi¼ fi max fj

ij ∈ synonumous codons for amino acid½

Here, fiis the frequency of a codon i and fjis the frequency of the

synonymous codons for that amino acid.

3. Results and discussions

3.1. Signiﬁcant variations in diand indel rate across the MTBC genomes

Genomic regions in close proximity generally have similar substitu-tion rates. Meanwhile, recombinasubstitu-tion events may divide neighboring genomic regions into different“linkage blocks”, between which nucleo-tide substitution rates can differ signiﬁcantly. However, if recombina-tion events occur frequently at very small intervals (e.g., tens of base

pairs), the genome would become“homogenized”, and the linkage

blocks would be too small to be meaningful for large-scale studies. For the MTBC genomes, this seems to be the case according to Namouchi et al.'s recent report (Namouchi et al., 2012). By analyzing SNPs in the MTBC genomes, Namouchi and colleagues inferred that the average

size of recombination DNA segments in these genomes is about 50–

58 bp in length, and that recombination events occur at one-ﬁfth the

frequency of mutations, which is far more frequent than previously

known (Namouchi et al., 2012). In other words, the MTBC genomes

are frequently“scrambled” so that larger-scale regional variations in

nucleotide substitution rates may simply reﬂect variations in

back-ground mutation rate and/or selection pressure. Therefore, by applying an adequate-sized widow to the genomic sequences of MTBC, we can obtain dS, dN, and difor comparison and for inferences of selection

pressure on the genes in the window. To this end, weﬁrst aligned

eleven MTBC genomes (Table 1) against the M. canettii genome by

Fig. 1. The genomic landscape of difor eleven Mycobacterial genomes as compared with M. canetti.

Table 2

Spearman's correlation test for diand indel rate between M. tuberculosis H37Rv and

other MTBC strains.

Strain di Indel rate

rho p-Value rho p-Value T. KZN1435 0.978 b2.20E−16 0.843 b2.20E−16 T. F11 0.977 b2.20E−16 0.907 b2.20E−16 T. CDC1551 0.943 b2.20E−16 0.691 b2.20E−16 T. CCDC5079 0.935 b2.20E−16 0.705 b2.20E−16 T. CCDC5180 0.959 b2.20E−16 0.719 b2.20E−16 T. H37Ra 0.994 b2.20E−16 0.929 b2.20E−16 africanum 0.973 b2.20E−16 0.478 3.26E−06 Bovis AF2122 0.973 b2.20E−16 0.862 b2.20E−16 BCG Pasteur 0.968 b2.20E−16 0.871 b2.20E−16 BCG Tokyo 0.977 b2.20E−16 0.842 b2.20E−16

(4)

using MAUVE (Darling et al., 2004). We selected M. canettii because it is the closest outgroup species to the major pathogenic strains of M. tuberculosis, M. bovis, and M. africanum (the M. marinum genome is applied in a latter analysis). We then used a non-overlapping slid-ing window to examine the variations in evolutionary rates between M. canettii and each of the eleven MTBC genomes. Of note, a window

must be large enough to include suf_{ﬁcient intergenic sequences for}

the calculation of di. We tested several different sizes (5 kb, 10 kb,

25 kb, 50 kb, 75 kb, and 100 kb), and determined 50 kb to be the most suitable window size for the purpose of this study. Since the MTBC genomes are approximately 4.4 Mb in length, there are a total of 88 windows for each genome.

Weﬁrst examined the variations in diacross the MTBC genomes.

As shown inFig. 1, diﬂuctuates considerably in the MTBC genomes,

ranging from zero to ~ 0.18. Interestingly, the genomic proﬁles of di

are almost identical among the MTBC strains (Table 2), which reﬂects the shared evolutionary histories and the small genetic distances

among the analyzed genomes. Meanwhile, indel rate (deﬁned as the

total length of the identiﬁed indels divided by the alignable sequence

length) also varies dramatically across the MTBC genomes (Fig. 2).

Nevertheless, the proﬁles of indel rates are less similar among the

strains as compared with di(Table 2). There are several potential

rea-sons for the decreased similarity among the profiles of indel rate. One explanation is potential alignment errors, which may have led to false identification of indels. Of note, we actually trimmed 50 bp from both ends of each alignment block to reduce this possible source of error. Another possibility is genome assembly errors, which may also arti fi-cially increase indel rates (Albers et al., 2011). The third possible rea-son is the presence of low-complexity sequences, which might occur at different frequencies in the MTBC genomes, and lead to regional variations in indel rate. Of course, we cannot exclude the possibility

that such variations in indel rate have reﬂected the high frequency

of genomic rearrangements in the MTBC genomes, as reported previ-ously (Namouchi et al., 2012).

Fig. 2. The genomic landscape of indel rate for eleven Mycobacterial genomes as compared with M. canetti.

(5)

Next, we are interested in comparing different evolutionary mea-surements (di, dS, and indel rate) in the same windows, and see whether

they exhibit similar genomic proﬁles. Since the proﬁles of diare almost

identical among different MTBC genomes, for simplicity, we use only the M. tuberculosis H37Rv genome for subsequent analyses. 3.2. Selection on synonymous substitutions in the genome of M. tuberculosis H37Rv

By comparing di, dS, and indel rate in the same windows, we may be

able to better understand the driving force of MTB genome evolution.

Here di in each window is used as the neutral reference because

intergenic regions should be free from selection pressure in most of the cases. Of note, annotated nocoding RNAs were removed from the intergenic sequences to avoid biased estimations of diin this analysis

(seeMaterials and methods).Fig. 3shows the genomic landscape for

dS, di and indel rate for the H37Rv strain as compared against the

M. canettii genome. Apparently, the proﬁles of the three evolutionary measurements are similar to one another. For instance, in windows # 19, 22, 63, 68 and 82, dS, diand indel rate all increase remarkably as

compared with the neighboring windows. These windows may harbor recombination hotsopts, which can signi_{ﬁcantly increase both mutation} rate and indel rate. Meanwhile, windows with a relatively high indel rate but low dSand diare also observed (e.g. windows # 49–55). However,

the pair-wise Spearman's correlations between the three evolutionary measurements are statistically insigniﬁcant (p-value>0.05 in all three pair-wise comparisons), indicating non-mutation forces (e.g., selection) may have differentially affected different windows. Also noticeable is that dS, diand indel rate may differ by more than ten folds even for two

adjacent windows (e.g. windows # 63 and 64). This is actually consistent with the previous observation that the recombination DNA tracts are fairly short, so that physically close windows may have very different re-combination rates (and very different dS, diand indel rate). It is thus of

interest to use a smaller window size for a better resolution of the varia-tions in evolutionary rates. However, when we use a 5 kb window size for this purpose, the total length of intergenic region in each window is too short (e.g. shorter than 100 bp) for a reliable estimation for di.

Of note, in more than half of the windows, the average dSis larger

than di.This is unexpected because synonymous substitutions are

supposed to be nearly neutral or negatively selected in most of the

cases. We then examined whether signiﬁcant variations in dSexist

in the same windows, which can bias the average dSupwards in the

presence of exceptionally large dSvalues. If this is the case, the

aver-age dSwill be larger than the median dS, which is actually observed in

86% of the 88 windows (Fig. 4). Nevertheless, even if we replace the

average dSwith the median dSfor each of the window, about 40% of

all the windows still have their dSlarger than di(data not shown).

One possible explanation for this observation is that diis

under-estimated in these windows because the intergenic regions include

yet unidentiﬁed regulatory elements (although we have removed

noncoding RNAs from the intergenic regions). Alternatively, the

syn-onymous substitutions in the high-dSgenes may be subject to

posi-tive selection for enhancing translational efﬁciency, or for regulatory reasons related to the secondary structures of mRNAs and/or the binding of noncoding RNAs.

Next, we used the Chi-square test to examine for each gene whether dSis signiﬁcantly smaller (dS≪di) or larger (dS≫di) than

di in the same window. We found a total of 114 and 119 genes,

respectively, that have their dSsigniﬁcantly smaller or larger than di.

Interestingly, these two types of genes occasionally co-occur in the

same window. This observation suggests that dSvaries to a large

extent even within a relatively small distance (i.e. 50 kb).

One possible explanation for the deviation of dS from diis to

achieve better translational efﬁciency. Hence, we examined the

codon usage biases (measured by the codon adaptation index, CAI) for known (non-hypothetical) genes, and compared the CAIs be-tween dS~= digenes and dS≠digenes.Fig. 5shows that the median

CAI is the highest in the genes with dS≪di, followed by genes with

Fig. 4. Comparison between average dSand median dSin the case of M. tb. H37Rv Vs. M. canetti.

Fig. 5. The codon adaptation index (CAI) of three different groups of M. tuberculosis H37Rv genes (dS~= di, dS≫di, and dS≪di). Note that only known genes are included

(6)

back to the intergenic regions, all of the pair-wise differences in CAI

between different gene groups are statistically signiﬁcant

(Supple-mentary Fig. 1; p-valueb0.05 by Wilcoxon Rank Sum test). For the

genes that have their dSdeviate signiﬁcantly from di, the dN/dSratio

test may be inappropriate for measuring selection pressure. The genes with their dN/dSdeviating signiﬁcantly from dN/diare listed in

Supplementary Table 1.

3.3. Identiﬁcation of conserved and rapidly evolving genes in M. tuberculosis H37Rv

One important issue in the development of tuberculosis therapeutics is the identification of drug target genes. As stated previously, both con-served genes and rapidly evolving genes may be good candidates in this regard. Therefore, we compared the M. tuberculosis H37Rv genes with their one-to-one orthologues (as identified by orthoMCL) in M. canetti and M. marinum, respectively, for the identi_{fication of conserved and} rap-idly evolving genes. A total of 2761 H37Rv-canettii-marinum orthologous

gene sets were thus analyzed. We then deﬁne conserved and rapidly

evolving genes as those that have their evolutionary rates fall outside of two standard deviations from the average (seeMaterials and methods). We conducted this analysis separately for dS, dN, and dN/dS. Since the

evo-lutionary rates may occasionally be zero (especially for dN), we used a

mixture distribution of zero and log-normal distribution for each evolu-tionary measurement for this analysis (seeMaterials and methods). It is noteworthy that, for the analysis of dN/dS, we removed those genes

whose dSdeviates signiﬁcantly from diof the same window. However,

we cannot evaluate the deviation of dSfrom diin the comparison between

M. tuberculosis H37Rv and M. marinum because the alignable intergenic regions are too short and fragmented for reliable estimations of di.

Table 3shows that 40.3%, 0.1%, and 11.7% of M. tuberculosis H37Rv

genes have their dN, dS, and dN/dS, respectively, signiﬁcantly lower than

the genome average when compared against the M. canettii genes. Mean-while, the corresponding numbers are 2.9%, 2.2%, and 3.0%, respectively, in the comparison against M. marinum genes. The most possible explana-tion for such differences in the percentage of conserved genes is the short divergence time between M. tuberculosis and M. canettii (Falush, 2009). This short time span is insufﬁcient for nucleotide substitutions to accu-mulate, thus leading to low dNand dSvalues and lack of resolution in

this analysis. The low dNand dSvalues in the M. tuberculosis–M. canettii

comparison also render the estimations of dN/dSunreliable. By

com-parison, the small percentage (~ 3%) of genes conserved between M. tuberculosis and M. marinum are likely to be responsible for fun-damental biological functions of Mycobacteria. These genes thus may serve as candidate drug targets.

Meanwhile, we observed fewer rapidly evolving genes than con-served genes in terms of dN/dSin both sets of pair-wise comparisons.

These high-dN/dSgenes are subject either to positive selection or

re-laxed negative selection. In the former case, these genes may be relat-ed to the adaptations of MTBC to the human environments. While in the latter case, the genes are biologically unimportant for MTBC, and

3.4. Conclusions

In this study, we conducted a genome-wide analysis of dS, dN, dN/

ds, di, and indel rate for MTB. We discovered signiﬁcant ﬂuctuations in

all of these evolutionary rates, suggesting large variations in regional mutation rate and selection pressure across the MTB genome. We also showed that a considerable proportion of the synonymous substitu-tions are subject to natural selection, possibly for enhancing

transla-tional efﬁciency or for other regulatory reasons. The large number

of genes with their dSdeviating from selective neutrality thus calls

for caution in the application of the dN/dSratio as a measurement

for selection pressure on M. tuberculosis genes. Acknowledgments

This study was supported by the National Science Council, Taiwan (under contract number NSC100-3111-B-400-001) and the intramu-ral funding of National Health Research Institutes (PH101-SP-04) to F.-C. Chen. We thank Dr. Pei-Sheng Lin and Dr. Wen-Chang Wang for statistical advice. We also thank Dr. Horng-Yunn Dou for construc-tive comments and suggestions.

Appendix A. Supplementary data

Supplementary data to this article can be found online athttp://

dx.doi.org/10.1016/j.gene.2012.11.033.

References

Albers, C.A., Lunter, G., MacArthur, D.G., McVean, G., Ouwehand, W.H., Durbin, R., 2011. Dindel: accurate indel calls from short-read data. Genome Res. 21 (6), 961–973. Darling, A.C., Mau, B., Blattner, F.R., Perna, N.T., 2004. Mauve: multiple alignment

of conserved genomic sequence with rearrangements. Genome Res. 14 (7), 1394–1403.

Edgar, R.C., 2004. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5, 113.

Falush, D., 2009. Toward the use of genomics to study microevolutionary change in bacteria. PLoS Genet. 5 (10), e1000627.

Filliol, I., et al., 2006. Global phylogeny of Mycobacterium tuberculosis based on single nucleotide polymorphism (SNP) analysis: insights into tuberculosis evolution, phylogenetic accuracy of other DNAﬁngerprinting systems, and recommendations for a minimal standard SNP set. J. Bacteriol. 188 (2), 759–772.

Gagneux, S., et al., 2006. Variable host-pathogen compatibility in Mycobacterium tuberculosis. Proc. Natl. Acad. Sci. U. S. A. 103 (8), 2869–2873.

Gutacker, M.M., et al., 2006. Single-nucleotide polymorphism-based population genetic analysis of Mycobacterium tuberculosis strains from 4 geographic sites. J. Infect. Dis. 193 (1), 121–128.

Gutierrez, M.C., et al., 2005. Ancient origin and gene mosaicism of the progenitor of Mycobacterium tuberculosis. PLoS Pathog. 1 (1), e5.

Kryazhimskiy, S., Plotkin, J.B., 2008. The population genetics of dN/dS. PLoS Genet. 4 (12), e1000304.

Lew, J.M., Kapopoulou, A., Jones, L.M., Cole, S.T., 2011. TubercuList—10 years after. Tuberculosis (Edinb.) 91 (1), 1–7.

Li, L., Stoeckert Jr., C.J., Roos, D.S., 2003. OrthoMCL: identiﬁcation of ortholog groups for eukaryotic genomes. Genome Res. 13 (9), 2178–2189.

(7)

Livny, J., Teonadi, H., Livny, M., Waldor, M.K., 2008. High-throughput, kingdom-wide prediction and annotation of bacterial non-coding RNAs. PLoS One 3, e3197.

Mokrousov, I., Narvskaya, O., Limeschenko, E., Vyazovaya, A., Otten, T., Vyshnevskiy, B., 2004. Analysis of the allelic diversity of the mycobacterial interspersed repetitive units in Mycobacterium tuberculosis strains of the Beijing family: practical implica-tions and evolutionary consideraimplica-tions. J. Clin. Microbiol. 42 (6), 2438–2444. Namouchi, A., Didelot, X., Schock, U., Gicquel, B., Rocha, E.P., 2012. After the bottleneck:

genome-wide diversiﬁcation of the Mycobacterium tuberculosis complex by muta-tion, recombinamuta-tion, and natural selection. Genome Res. 22 (4), 721–734. Schurch, A.C., van Soolingen, D., 2012. DNAﬁngerprinting of Mycobacterium tuberculosis:

from phage typing to whole-genome sequencing. Infect. Genet. Evol. J. Mol. Epidemiol. Evol. Genet. Infect. Dis. 12 (4), 602–609.

Sharp, P.M., Li, W.H., 1987. The codon Adaptation Index—a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 15 (3), 1281–1295.

Smith, N.H., Hewinson, R.G., Kremer, K., Brosch, R., Gordon, S.V., 2009. Myths and misconceptions: the origin and evolution of Mycobacterium tuberculosis. Nat. Rev. Microbiol. 7 (7), 537–544.

Thomas, C.M., Nielsen, K.M., 2005. Mechanisms of, and barriers to, horizontal gene transfer between bacteria. Nat. Rev. Microbiol. 3 (9), 711–721.

Toll-Riera, M., Laurie, S., Alba, M.M., 2011. Lineage-speciﬁc variation in intensity of nat-ural selection in mammals. Mol. Biol. Evol. 28 (1), 383–398.

Veyrier, F., Pletzer, D., Turenne, C., Behr, M.A., 2009. Phylogenetic detection of horizon-tal gene transfer during the step-wise genesis of Mycobacterium tuberculosis. BMC Evol. Biol. 9, 196.

Veyrier, F.J., Dufort, A., Behr, M.A., 2011. The rise and fall of the Mycobacterium tuberculosis genome. Trends Microbiol. 19 (4), 156–161.

Yang, Z., 2007. PAML 4: phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24 (8), 1586–1591.

Zhang, Y., Zhang, H., Zhou, T., Zhong, Y., Jin, Q., 2011. Genes under positive selection in Mycobacterium tuberculosis. Comput. Biol. Chem. 35 (5), 319–322.