The evolutionary landscape of the Mycobacterium tuberculosis genome
Tai-Chun Wang
a, Feng-Chi Chen
a,b,c,⁎
a
Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, 35 Keyen Road, Zhunan, Miaoli County, 350 Taiwan
b
Department of Biological Science and Technology, National Chiao-Tung University, Hsinchu, 300 Taiwan
cDepartment of Dentistry, China Medical University, Taichung, 404 Taiwan
a b s t r a c t
a r t i c l e i n f o
Article history:
Accepted 27 November 2012 Available online 7 December 2012
Keywords:
Mycobacterium tuberculosis Selection pressure Nonsynonymous substitution Synonymous substitution Neutral substitution rate
Mycobacterium tuberculosis is one of the most deadly human pathogens. The major mechanism for the adapta-tions of M. tuberculosis is nucleotide substitution. Previous studies have relied on the nonsynonymous-to-synonymous substitution rate (dN/dS) ratio as a measurement of selective constraint based on the
assumed selective neutrality of synonymous substitutions. However, this assumption has been shown to be untrue in many cases. In this study, we used the substitution rate in intergenic regions (di) of the M. tuberculosis
genome as the neutral reference, and conducted a genome-wide profiling for di, dS, and the rate of insertions/
deletions (indel rate) as compared with the genome of M. canettii using a 50 kb sliding window. We demonstrate significant variations in all of the three evolutionary measurements across the M. tuberculosis genome, even for regions in close vicinity. Furthermore, we identified a total of 233 genes with their dSdeviating significantly from
diwithin the same window. Interestingly, dSalso varies significantly in some of the windows, indicating drastic
changes in mutation rate and/or selection pressure within relatively short distances in the M. tuberculosis genome. Importantly, our results indicate that selection on synonymous substitutions is common in the M. tuberculosis genome. Therefore, the dN/dSratio test must be applied carefully for measuring selection
pressure on M. tuberculosis genes.
© 2012 Elsevier B.V. All rights reserved.
1. Introduction
Mycobacterium tuberculosis (MTB), the causing pathogen of one of the most deadly diseases, claims millions of lives worldwide each year
(Zhang et al., 2011). The MTB complex (MTBC) belongs to the
slow-growing sublineage of Mycobacteria. Based on the geographical characteristics, MTBC can be classified into six clusters, including such species as M. tuberculosis, M. bovis, M. africanum, M. microti, M. pinnipedii, and M. canettii (Filliol et al., 2006; Gagneux et al., 2006; Gutacker et al.,
2006; Schurch and van Soolingen, 2012). Members in MTBC, including
M. tuberculosis, M. bovis, M. africanum, and M. canetti, share 99.95% of their genomic sequences and a strictly clonal population structure (Mokrousov et al., 2004; Smith et al., 2009). Compared to more ancient species (e.g. M. marinum), MTBC has shorter but more virulent chromo-somes (Namouchi et al., 2012).
Although most bacterial species acquire new genetic materials via
horizontal gene transfer (Thomas and Nielsen, 2005), it has been
reported that this mechanism rarely occurs to MTBC genomes (Gutierrez et al., 2005; Veyrier et al., 2009, 2011). Therefore, nucleotide substitution is a major mechanism for the emergence of M. tuberculosis pathogenesis. By comparing multiple MTBC genomes, Namouchi and
colleagues indicated that MTBC genomes exhibit significant regional
variations in the density of single nucleotide polymorphisms (SNPs)
(Namouchi et al., 2012). This observation implies that MTBC genes at
different genomic positions may be evolving at very different rates. However, the authors did not distinguish between coding and noncod-ing regions when calculatnoncod-ing SNP densities. Their results thus cannot re-flect the variations in SNP density at selectively neutral sites.
Since the majority of the MTB genome is composed of coding se-quences, genomic regions of high SNP density may harbor rapidly evolving genes. Some of these genes may be positively selected for their importance in the adaptations of MTBC to the human environ-ments. Meanwhile, extremely conserved genes are likely to be essen-tial for the survival and/or replication of the bacterium. These two groups of genes are good candidates of drug targets. However, an in-crease in evolutionary rate does not necessarily result from positive selection. An increase in mutation rate or relaxation of selective con-straint can lead to the same result. An adequate reference for neutral substitution rate and a good measurement for selection pressure are thus required to infer the driver of the increased evolutionary rates in the genes of interest.
Abbreviations: MTB, Mycobacterium tuberculosis; MTBC, Mycobacterium tuberculosis complex; dN, Nonsynonysmous substitution rate; dS, Synonymous substation rate; di,
Nu-cleotide substitution rate at intergenic regions; indel, Insertion and deletion; kb, kilobase; SNP, Single nucleotide polymorphism; NCBI, National Center for Biotechnology Informa-tion; CAI, Codon adaptation index.
⁎ Corresponding author at: Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, 35 Keyen Road, Zhunan, Miaoli County, 350 Taiwan.
E-mail addresses:[email protected](T.-C. Wang),[email protected]
(F.-C. Chen).
0378-1119/$– see front matter © 2012 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.gene.2012.11.033
Contents lists available atSciVerse ScienceDirect
Gene
One commonly used test for natural selection is the ratio of
nonsynonymous substitution rate (dN) to synonymous substitution
rate (dS) (i.e., the dN/dSratio) (Toll-Riera et al., 2011). In general,
dN/dS> 1 indicates positive selection, and dN/dSb1 is a sign of
nega-tive selection. However, this test is based on the assumption that synonymous substitutions are selectively neutral, which has been questioned particularly in unicellular organisms. It is known that
synonymous substitutions may conferfitness effects by affecting
the efficiency and/or accuracy of protein translation (Kryazhimskiy
and Plotkin, 2008). An alternative neutral reference is the
nucleo-tide substitution rate of intergenic regions (di) because intergenic
regions are usually free from selection pressure. Therefore, theoret-ically, by comparing the dSof a gene against the diof the neighboring
intergenic region, we can infer whether the synonymous substitu-tions are selectively neutral or not, and determine whether we
should use dN/dSas the measurement of selection. There are three
possible scenarios in the comparison between dSand di. Firstly, if
dSis approximately equal to di, synonymous substitutions are
prob-ably driven mainly by mutation. Alternatively, if dSis significantly
lower than di, synonymous substitutions are likely to be negatively
selected. Finally, if dSis significantly larger than di, synonymous
sub-stitutions are possibly driven by positive selection. In the latter two cases, dN/dishould be used instead of dN/dSfor measuring selection
pressure on the gene of interest.
Here, we examine the variations in evolutionary rates in the ge-nomes of multiple MTB strains and the selection pressures imposed on MTB genes. We would like to address the following questions: (1) how applicable is dN/dSin measuring selection pressure on MTB
genes; (2) which MTB genes evolve significantly more rapidly or
more slowly than the genome average in terms of, separately, dS,
dN, and dN/dS; and (3) what is the major driving force that leads to
the variations in evolutionary rates among genes.
Our results indicate significant variations in di, dSand the rate of
insertions/deletions (indels) across the MTB genome, which suggests fluctuations in local mutation rate as a driving force of nucleotide sub-stitutions. Furthermore, we found that synonymous substitutions in hundreds of MTB genes may be subject to negative or positive selec-tion, indicating noticeable inapplicability of the dN/dSratio test to the
MTB genes. The molecular mechanisms and phenotypic conse-quences of the drastic variations in evolutionary rates in MTB genes are worth further investigations.
2. Materials and methods 2.1. Datasets
The genomic sequences of thirteen strains of Mycobacteria (Table 1) were downloaded from the National Center for Biotechnology Informa-tion (NCBI) athttp://www.ncbi.nlm.nih.gov/. Except for M. marinum, all
of these strains belong to the MTBC. Here, the genomes of M. marinum and M. canettii were used for comparisons with the other MTBC genomes for the calculation of evolutionary rates. The average G + C content is approximately 65% for all of the analyzed genomes.
2.2. Identification of orthologous genes
The gene annotations of the analyzed bacterial genomes were also retrieved from NCBI. The nucleotide sequences of the annotated genes were conceptually translated into peptide sequences, and
input into orthoMCL (Li et al., 2003) with default parameters for
identification of orthologous genes between the analyzed species/
strains. OrthoMCL identified 2358 orthologous genes for the 13
analyzed Mycobacterial genomes. The peptide sequences of the
identified orthologous genes were then aligned by using MUSCLE
(Edgar, 2004) with default parameters, and then back-translated
to nucleotide sequences for calculations of dN, dS, and the dN/dS
ratio.
2.3. Measurements of local evolutionary rates
To analyze diand indel rate, we used Mauve 2.0 (Darling et al., 2004)
to align the nucleotide sequences of the 13 analyzed genomes. The gaps between alignment blocks were discarded. For the comparison between dSand di, we removed all of the noncoding RNAs from intergenic regions
with reference to the annotations of SIPHT (sRNA identification protocol
using high-throughput technologies) (Livny et al., 2008). A 50-kb
non-overlapping sliding window was then used to delineate the aligned
genomic regions for calculations of di and indel rate. Note that a
window contains both genic and intergentic regions. The genic and intergenic regions were demarcated according to the NCBI annotations. dNand dSwere calculated separately for each gene. The intergenic regions
within each window were concatenated for the calculation of di.
There-fore, for each window, we could obtain multiple dN and dS values
(when there are multiple genes in a window), and a single divalue. Of
note, the genes that are located at the boundaries between windows were discarded. In addition, we trimmed 50 nucleotides from both ends of each alignment block to avoid potential alignment errors.
The Codeml module of PAML 4 (Yang, 2007) was used to calculate
dNand dS. The Baseml module of PAML was applied for the calculation
of di. We also calculated the indel rate by analyzing the MAUVE
align-mentfiles using an in-house PERL script. The indel rate was defined as the total length of insertions and deletions divided by the length of the alignable sequence.
2.4. Identification of genes with exceptional evolutionary rates
Since M. tuberculosis H37Rv is genetically close to M. canettii, in many of the cases we observe zero values of dN, dS, or dN/dSwhen
comparing orthologous genes between the two species. For each evolu-tionary measurement (dN, dS, or dN/dS), we assume that the probability
of a zero value isα. Then the distribution is a α: 1−α mixture of a
point probability mass at 0 and a log-normal distribution, which can be presented as:
F xð Þ ¼ ð1−αÞΦ logxð α; x ¼ 0Þ; x > 0
whereΦ is the function of normal distribution.
Based on this mixture probability distribution, the cut points for determining the outlier evolutionary measurements at 5% error rate should be adjusted by
Average evolutionary rate 2 standard deviation 1−αð Þ:
Notably, the standard deviation should be calculated based on the non-zero values. Also note that, for the analysis of dN/dS, the genes
with dS= 0 were removed. Hence, the sample size in this analysis
was reduced, andα becomes the proportion of dN= 0 in this reduced
set.
We then defined slow- and fast-evolving genes as those that have
their evolutionary rates fall outside of two standard deviations (adjusted forα) from the average. The genes thus identified were then functionally analyzed with reference to the TubercuList (http://genolist.pasteur.fr/ TubercuList/) (Lew et al., 2011).
2.5. Measurement of codon usage bias
In this study, we used codon adaptation index (CAI) (Sharp and Li,
1987) as the measurement of codon usage bias for MTB genes. CAI can
be calculated as follows: CAI¼ exp 1. L∑ L l¼1ln wð ið ÞlÞ ; where wi¼ fi max fj
ij ∈ synonumous codons for amino acid½
Here, fiis the frequency of a codon i and fjis the frequency of the
synonymous codons for that amino acid.
3. Results and discussions
3.1. Significant variations in diand indel rate across the MTBC genomes
Genomic regions in close proximity generally have similar substitu-tion rates. Meanwhile, recombinasubstitu-tion events may divide neighboring genomic regions into different“linkage blocks”, between which nucleo-tide substitution rates can differ significantly. However, if recombina-tion events occur frequently at very small intervals (e.g., tens of base
pairs), the genome would become“homogenized”, and the linkage
blocks would be too small to be meaningful for large-scale studies. For the MTBC genomes, this seems to be the case according to Namouchi et al.'s recent report (Namouchi et al., 2012). By analyzing SNPs in the MTBC genomes, Namouchi and colleagues inferred that the average
size of recombination DNA segments in these genomes is about 50–
58 bp in length, and that recombination events occur at one-fifth the
frequency of mutations, which is far more frequent than previously
known (Namouchi et al., 2012). In other words, the MTBC genomes
are frequently“scrambled” so that larger-scale regional variations in
nucleotide substitution rates may simply reflect variations in
back-ground mutation rate and/or selection pressure. Therefore, by applying an adequate-sized widow to the genomic sequences of MTBC, we can obtain dS, dN, and difor comparison and for inferences of selection
pressure on the genes in the window. To this end, wefirst aligned
eleven MTBC genomes (Table 1) against the M. canettii genome by
Fig. 1. The genomic landscape of difor eleven Mycobacterial genomes as compared with M. canetti.
Table 2
Spearman's correlation test for diand indel rate between M. tuberculosis H37Rv and
other MTBC strains.
Strain di Indel rate
rho p-Value rho p-Value T. KZN1435 0.978 b2.20E−16 0.843 b2.20E−16 T. F11 0.977 b2.20E−16 0.907 b2.20E−16 T. CDC1551 0.943 b2.20E−16 0.691 b2.20E−16 T. CCDC5079 0.935 b2.20E−16 0.705 b2.20E−16 T. CCDC5180 0.959 b2.20E−16 0.719 b2.20E−16 T. H37Ra 0.994 b2.20E−16 0.929 b2.20E−16 africanum 0.973 b2.20E−16 0.478 3.26E−06 Bovis AF2122 0.973 b2.20E−16 0.862 b2.20E−16 BCG Pasteur 0.968 b2.20E−16 0.871 b2.20E−16 BCG Tokyo 0.977 b2.20E−16 0.842 b2.20E−16
using MAUVE (Darling et al., 2004). We selected M. canettii because it is the closest outgroup species to the major pathogenic strains of M. tuberculosis, M. bovis, and M. africanum (the M. marinum genome is applied in a latter analysis). We then used a non-overlapping slid-ing window to examine the variations in evolutionary rates between M. canettii and each of the eleven MTBC genomes. Of note, a window
must be large enough to include sufficient intergenic sequences for
the calculation of di. We tested several different sizes (5 kb, 10 kb,
25 kb, 50 kb, 75 kb, and 100 kb), and determined 50 kb to be the most suitable window size for the purpose of this study. Since the MTBC genomes are approximately 4.4 Mb in length, there are a total of 88 windows for each genome.
Wefirst examined the variations in diacross the MTBC genomes.
As shown inFig. 1, difluctuates considerably in the MTBC genomes,
ranging from zero to ~ 0.18. Interestingly, the genomic profiles of di
are almost identical among the MTBC strains (Table 2), which reflects the shared evolutionary histories and the small genetic distances
among the analyzed genomes. Meanwhile, indel rate (defined as the
total length of the identified indels divided by the alignable sequence
length) also varies dramatically across the MTBC genomes (Fig. 2).
Nevertheless, the profiles of indel rates are less similar among the
strains as compared with di(Table 2). There are several potential
rea-sons for the decreased similarity among the profiles of indel rate. One explanation is potential alignment errors, which may have led to false identification of indels. Of note, we actually trimmed 50 bp from both ends of each alignment block to reduce this possible source of error. Another possibility is genome assembly errors, which may also arti fi-cially increase indel rates (Albers et al., 2011). The third possible rea-son is the presence of low-complexity sequences, which might occur at different frequencies in the MTBC genomes, and lead to regional variations in indel rate. Of course, we cannot exclude the possibility
that such variations in indel rate have reflected the high frequency
of genomic rearrangements in the MTBC genomes, as reported previ-ously (Namouchi et al., 2012).
Fig. 2. The genomic landscape of indel rate for eleven Mycobacterial genomes as compared with M. canetti.
Next, we are interested in comparing different evolutionary mea-surements (di, dS, and indel rate) in the same windows, and see whether
they exhibit similar genomic profiles. Since the profiles of diare almost
identical among different MTBC genomes, for simplicity, we use only the M. tuberculosis H37Rv genome for subsequent analyses. 3.2. Selection on synonymous substitutions in the genome of M. tuberculosis H37Rv
By comparing di, dS, and indel rate in the same windows, we may be
able to better understand the driving force of MTB genome evolution.
Here di in each window is used as the neutral reference because
intergenic regions should be free from selection pressure in most of the cases. Of note, annotated nocoding RNAs were removed from the intergenic sequences to avoid biased estimations of diin this analysis
(seeMaterials and methods).Fig. 3shows the genomic landscape for
dS, di and indel rate for the H37Rv strain as compared against the
M. canettii genome. Apparently, the profiles of the three evolutionary measurements are similar to one another. For instance, in windows # 19, 22, 63, 68 and 82, dS, diand indel rate all increase remarkably as
compared with the neighboring windows. These windows may harbor recombination hotsopts, which can significantly increase both mutation rate and indel rate. Meanwhile, windows with a relatively high indel rate but low dSand diare also observed (e.g. windows # 49–55). However,
the pair-wise Spearman's correlations between the three evolutionary measurements are statistically insignificant (p-value>0.05 in all three pair-wise comparisons), indicating non-mutation forces (e.g., selection) may have differentially affected different windows. Also noticeable is that dS, diand indel rate may differ by more than ten folds even for two
adjacent windows (e.g. windows # 63 and 64). This is actually consistent with the previous observation that the recombination DNA tracts are fairly short, so that physically close windows may have very different re-combination rates (and very different dS, diand indel rate). It is thus of
interest to use a smaller window size for a better resolution of the varia-tions in evolutionary rates. However, when we use a 5 kb window size for this purpose, the total length of intergenic region in each window is too short (e.g. shorter than 100 bp) for a reliable estimation for di.
Of note, in more than half of the windows, the average dSis larger
than di.This is unexpected because synonymous substitutions are
supposed to be nearly neutral or negatively selected in most of the
cases. We then examined whether significant variations in dSexist
in the same windows, which can bias the average dSupwards in the
presence of exceptionally large dSvalues. If this is the case, the
aver-age dSwill be larger than the median dS, which is actually observed in
86% of the 88 windows (Fig. 4). Nevertheless, even if we replace the
average dSwith the median dSfor each of the window, about 40% of
all the windows still have their dSlarger than di(data not shown).
One possible explanation for this observation is that diis
under-estimated in these windows because the intergenic regions include
yet unidentified regulatory elements (although we have removed
noncoding RNAs from the intergenic regions). Alternatively, the
syn-onymous substitutions in the high-dSgenes may be subject to
posi-tive selection for enhancing translational efficiency, or for regulatory reasons related to the secondary structures of mRNAs and/or the binding of noncoding RNAs.
Next, we used the Chi-square test to examine for each gene whether dSis significantly smaller (dS≪di) or larger (dS≫di) than
di in the same window. We found a total of 114 and 119 genes,
respectively, that have their dSsignificantly smaller or larger than di.
Interestingly, these two types of genes occasionally co-occur in the
same window. This observation suggests that dSvaries to a large
extent even within a relatively small distance (i.e. 50 kb).
One possible explanation for the deviation of dS from diis to
achieve better translational efficiency. Hence, we examined the
codon usage biases (measured by the codon adaptation index, CAI) for known (non-hypothetical) genes, and compared the CAIs be-tween dS~= digenes and dS≠digenes.Fig. 5shows that the median
CAI is the highest in the genes with dS≪di, followed by genes with
Fig. 4. Comparison between average dSand median dSin the case of M. tb. H37Rv Vs. M. canetti.
Fig. 5. The codon adaptation index (CAI) of three different groups of M. tuberculosis H37Rv genes (dS~= di, dS≫di, and dS≪di). Note that only known genes are included
back to the intergenic regions, all of the pair-wise differences in CAI
between different gene groups are statistically significant
(Supple-mentary Fig. 1; p-valueb0.05 by Wilcoxon Rank Sum test). For the
genes that have their dSdeviate significantly from di, the dN/dSratio
test may be inappropriate for measuring selection pressure. The genes with their dN/dSdeviating significantly from dN/diare listed in
Supplementary Table 1.
3.3. Identification of conserved and rapidly evolving genes in M. tuberculosis H37Rv
One important issue in the development of tuberculosis therapeutics is the identification of drug target genes. As stated previously, both con-served genes and rapidly evolving genes may be good candidates in this regard. Therefore, we compared the M. tuberculosis H37Rv genes with their one-to-one orthologues (as identified by orthoMCL) in M. canetti and M. marinum, respectively, for the identification of conserved and rap-idly evolving genes. A total of 2761 H37Rv-canettii-marinum orthologous
gene sets were thus analyzed. We then define conserved and rapidly
evolving genes as those that have their evolutionary rates fall outside of two standard deviations from the average (seeMaterials and methods). We conducted this analysis separately for dS, dN, and dN/dS. Since the
evo-lutionary rates may occasionally be zero (especially for dN), we used a
mixture distribution of zero and log-normal distribution for each evolu-tionary measurement for this analysis (seeMaterials and methods). It is noteworthy that, for the analysis of dN/dS, we removed those genes
whose dSdeviates significantly from diof the same window. However,
we cannot evaluate the deviation of dSfrom diin the comparison between
M. tuberculosis H37Rv and M. marinum because the alignable intergenic regions are too short and fragmented for reliable estimations of di.
Table 3shows that 40.3%, 0.1%, and 11.7% of M. tuberculosis H37Rv
genes have their dN, dS, and dN/dS, respectively, significantly lower than
the genome average when compared against the M. canettii genes. Mean-while, the corresponding numbers are 2.9%, 2.2%, and 3.0%, respectively, in the comparison against M. marinum genes. The most possible explana-tion for such differences in the percentage of conserved genes is the short divergence time between M. tuberculosis and M. canettii (Falush, 2009). This short time span is insufficient for nucleotide substitutions to accu-mulate, thus leading to low dNand dSvalues and lack of resolution in
this analysis. The low dNand dSvalues in the M. tuberculosis–M. canettii
comparison also render the estimations of dN/dSunreliable. By
com-parison, the small percentage (~ 3%) of genes conserved between M. tuberculosis and M. marinum are likely to be responsible for fun-damental biological functions of Mycobacteria. These genes thus may serve as candidate drug targets.
Meanwhile, we observed fewer rapidly evolving genes than con-served genes in terms of dN/dSin both sets of pair-wise comparisons.
These high-dN/dSgenes are subject either to positive selection or
re-laxed negative selection. In the former case, these genes may be relat-ed to the adaptations of MTBC to the human environments. While in the latter case, the genes are biologically unimportant for MTBC, and
3.4. Conclusions
In this study, we conducted a genome-wide analysis of dS, dN, dN/
ds, di, and indel rate for MTB. We discovered significant fluctuations in
all of these evolutionary rates, suggesting large variations in regional mutation rate and selection pressure across the MTB genome. We also showed that a considerable proportion of the synonymous substitu-tions are subject to natural selection, possibly for enhancing
transla-tional efficiency or for other regulatory reasons. The large number
of genes with their dSdeviating from selective neutrality thus calls
for caution in the application of the dN/dSratio as a measurement
for selection pressure on M. tuberculosis genes. Acknowledgments
This study was supported by the National Science Council, Taiwan (under contract number NSC100-3111-B-400-001) and the intramu-ral funding of National Health Research Institutes (PH101-SP-04) to F.-C. Chen. We thank Dr. Pei-Sheng Lin and Dr. Wen-Chang Wang for statistical advice. We also thank Dr. Horng-Yunn Dou for construc-tive comments and suggestions.
Appendix A. Supplementary data
Supplementary data to this article can be found online athttp://
dx.doi.org/10.1016/j.gene.2012.11.033.
References
Albers, C.A., Lunter, G., MacArthur, D.G., McVean, G., Ouwehand, W.H., Durbin, R., 2011. Dindel: accurate indel calls from short-read data. Genome Res. 21 (6), 961–973. Darling, A.C., Mau, B., Blattner, F.R., Perna, N.T., 2004. Mauve: multiple alignment
of conserved genomic sequence with rearrangements. Genome Res. 14 (7), 1394–1403.
Edgar, R.C., 2004. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5, 113.
Falush, D., 2009. Toward the use of genomics to study microevolutionary change in bacteria. PLoS Genet. 5 (10), e1000627.
Filliol, I., et al., 2006. Global phylogeny of Mycobacterium tuberculosis based on single nucleotide polymorphism (SNP) analysis: insights into tuberculosis evolution, phylogenetic accuracy of other DNAfingerprinting systems, and recommendations for a minimal standard SNP set. J. Bacteriol. 188 (2), 759–772.
Gagneux, S., et al., 2006. Variable host-pathogen compatibility in Mycobacterium tuberculosis. Proc. Natl. Acad. Sci. U. S. A. 103 (8), 2869–2873.
Gutacker, M.M., et al., 2006. Single-nucleotide polymorphism-based population genetic analysis of Mycobacterium tuberculosis strains from 4 geographic sites. J. Infect. Dis. 193 (1), 121–128.
Gutierrez, M.C., et al., 2005. Ancient origin and gene mosaicism of the progenitor of Mycobacterium tuberculosis. PLoS Pathog. 1 (1), e5.
Kryazhimskiy, S., Plotkin, J.B., 2008. The population genetics of dN/dS. PLoS Genet. 4 (12), e1000304.
Lew, J.M., Kapopoulou, A., Jones, L.M., Cole, S.T., 2011. TubercuList—10 years after. Tuberculosis (Edinb.) 91 (1), 1–7.
Li, L., Stoeckert Jr., C.J., Roos, D.S., 2003. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 13 (9), 2178–2189.
Livny, J., Teonadi, H., Livny, M., Waldor, M.K., 2008. High-throughput, kingdom-wide prediction and annotation of bacterial non-coding RNAs. PLoS One 3, e3197.
Mokrousov, I., Narvskaya, O., Limeschenko, E., Vyazovaya, A., Otten, T., Vyshnevskiy, B., 2004. Analysis of the allelic diversity of the mycobacterial interspersed repetitive units in Mycobacterium tuberculosis strains of the Beijing family: practical implica-tions and evolutionary consideraimplica-tions. J. Clin. Microbiol. 42 (6), 2438–2444. Namouchi, A., Didelot, X., Schock, U., Gicquel, B., Rocha, E.P., 2012. After the bottleneck:
genome-wide diversification of the Mycobacterium tuberculosis complex by muta-tion, recombinamuta-tion, and natural selection. Genome Res. 22 (4), 721–734. Schurch, A.C., van Soolingen, D., 2012. DNAfingerprinting of Mycobacterium tuberculosis:
from phage typing to whole-genome sequencing. Infect. Genet. Evol. J. Mol. Epidemiol. Evol. Genet. Infect. Dis. 12 (4), 602–609.
Sharp, P.M., Li, W.H., 1987. The codon Adaptation Index—a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 15 (3), 1281–1295.
Smith, N.H., Hewinson, R.G., Kremer, K., Brosch, R., Gordon, S.V., 2009. Myths and misconceptions: the origin and evolution of Mycobacterium tuberculosis. Nat. Rev. Microbiol. 7 (7), 537–544.
Thomas, C.M., Nielsen, K.M., 2005. Mechanisms of, and barriers to, horizontal gene transfer between bacteria. Nat. Rev. Microbiol. 3 (9), 711–721.
Toll-Riera, M., Laurie, S., Alba, M.M., 2011. Lineage-specific variation in intensity of nat-ural selection in mammals. Mol. Biol. Evol. 28 (1), 383–398.
Veyrier, F., Pletzer, D., Turenne, C., Behr, M.A., 2009. Phylogenetic detection of horizon-tal gene transfer during the step-wise genesis of Mycobacterium tuberculosis. BMC Evol. Biol. 9, 196.
Veyrier, F.J., Dufort, A., Behr, M.A., 2011. The rise and fall of the Mycobacterium tuberculosis genome. Trends Microbiol. 19 (4), 156–161.
Yang, Z., 2007. PAML 4: phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24 (8), 1586–1591.
Zhang, Y., Zhang, H., Zhou, T., Zhong, Y., Jin, Q., 2011. Genes under positive selection in Mycobacterium tuberculosis. Comput. Biol. Chem. 35 (5), 319–322.