Chapter 2: Materials and methods
2.2. Case data
Contact investigation and demographic data of participants were obtained from Kaohsiung city government department of health. The case data included the following variables: age, sex, nationality (Taiwanese or foreign), management area, marital status (single, married, divorced, or widowed), activity addresses (permanent, residential, and work addresses). The 163 MTB cultures from these cases were conducted 28-loci MIRU-VNTR genotyping at Kaohsiung Medical University Hospital System Laboratory (KMUHSL) from the collaborator professor Po-Liang Lu, and they provided us the MIRU-VNTR pattern results of isolates. Besides, the samples from these 163 cases were also sent to the team of professor Thomas Ioerger for analysis of whole genome sequencing, and we got the FASTA format file from foreign laboratory to detect SNPs
between cases. Finally, we had genetic and epidemiological information of patients in this study.
2.3.1. Selection of MIRU-VNTR loci
We adopted the set of 24-loci in our main analysis. However, there were 155 of 163 samples of cases having complete the selected 24 loci information (Figure 1). Thus, we conducted 24-loci MIRU-VNTR identifying the cluster from these 155 samples. In addition, although many studied would use the international standard 24-loci to analyze [2], there is no golden standard of selecting specific loci to classify clusters to date.
Therefore, we would conduct sensitivity analysis in 9, 10, and 28 loci sets of MIRU-VNTR to classify the genotyped cluster in this study.
First of all, combination of 9 loci was recommended by Qian Gao et.al. research in China, and it showed high discriminatory power to distinguish Beijing strains and non-Beijing strains [18]. Secondly, we adopted 10 loci which was the best typing result according to the report from Taiwan Centers for Disease Control (TCDC) in 2014 [19].
They recommended that it was better to perform with another typing method spacer oligonucleotide typing (spoligotyping) together, but we still try to understand the effect of using only these 10-loci MIRU-VNTR method. Moreover, the last one we selected all of 28-loci which were analyzed in KMUHS laboratory to classify the cluster.
2.3.2. Calculation of Hunter-Gaston discriminatory index (HGDI) from loci
The Hunter-Gaston discriminatory index (HGDI) was used to evaluate the discriminatory power of the selected loci of MIRU-VNTR in study, and its value range between 0 and 1. This index is given by the following equation [19, 20]:
HGDI = 1-∑𝑋𝚤2[ n/(n-1)]
where 𝑋𝚤 is the probability of a single isolate at random belongs to the 𝚤 th cluster group, n is the number of the isolates. The lower value of HGDI shows that the sample isolates are more difficult distinguished. In contrast, the higher value of HGDI means all strains are more distinguished well. We calculated it to understand the performance of classified cluster from our selected sets of loci. Based on the study of Sola et al.[21], their respective HGDI score was that the discriminatory powers were classified as high (HGDI>0.6), moderate (0.6≥HGDI≥0.3), or low (HGDI<0.3) in our study.
2.3.3. The process of PCR in this study
We conducted genotyping using PCR based on 24-loci MIRU-VNTR [2]. First, collaborate laboratory conducted denaturation of DNA. Reaction temperature was increased to 94 °C for 15 min to ensure that all double-stranded DNA molecules were separated into single strands for amplification. Secondly, they conducted annealing. The
temperature was lowered to 59 °C to promote primer binding to the template. Thirdly, extension, the temperature was increased to 72 °C to allow the hybridized primers to be extended. Finally, it finished after cycling the three steps for 22-25 times.
2.3.4. Definition of cluster by MIRU-VNTR
We successfully MIRU-VNTR genotyped 95.1% (155/163) of isolates in this study.
After typing the isolates by MIRU-VNTR, the cluster was defined that the digital code of MIRU-VNTR pattern of isolates from two or more strains were identical. They sharing identical patterns were categorized as the same genotyped clusters [3].
2.4.1. Bioinformatics analysis of WGS
In brief, our samples were sequenced using short read technology, HiSeq X System (Illumina). After using BWA (version 0.7.12) to map reads to the H37Rv reference genome (GenBank accession number NC_000962.2), detection of genomic variants SNPs were identified as sites with a majority of nucleotides differing from the reference among reads covering each site. Small indels (1-3bp) were identified from the .sam files generated by BWA, and larger indels (up to 20kb) were identified by drops in coverage below 10 reads, and edits to the genome sequence were generated by building a local contig of overlapping reads that spans the site. Using MUMMER (version 3.20) to align
genome sequences, SNPs were extracted according to the following criteria: coverage ≥ 10x (covered by at least 10 reads) with purity ≥ 70% (conversion to non-reference
nucleotide). To filter out potential false-positive SNPs attributable to mapping errors, we omitted the variants detected in proline-proline-glutamic acid (PPE), proline-glutamic acid_polymorphic guanine-cytosine-rich sequence (PE_PGRS) genes, and repetitive regions [22], because of sequences in these regions not unique.
2.4.2. Definition of cluster by WGS
To date there was no uniform standard for cut-off values of SNPs, and different country studies had applied various cut-offs. However, some recent studies using WGS for detecting TB transmission employed isolates with 12 SNPs difference as the maximum threshold related to recent transmission chains [23-25], and this concept was early proposed by Walker et al [13]. In our study, we employed 12 SNPs cut-offs for main definition genotyped cluster of WGS, but other thresholds of SNP were considered in sensitivity analysis.
2.5.1. Definition of epidemiological correlation
Kaohsiung city government department of health authorized to give us the de-identified TB patients data about epidemiological information on TB contact
investigation. However, epidemiological information of cases was difficult to obtain completely. Thus, we defined as two definitions as epidemiological correlation in our study. One was epidemiological linking. If any case was named by other patients before they became patients, they had epidemiological linking. The other one was geographical correlation, and it was based on the activity addresses of cases recorded in the data. We defined the threshold of distance was 5 km, thus it meant if the distance difference of two cases locations under 5 km, they had geographical correlation. However, there was no evidence to support how far was the distance of TB transmission, because it was difficult to know the all activity place of all patients. Therefore, we also conducted sensitivity analyses: the threshold of distance as 1 and 3 km as geographical correlation to understand how different distance of definition affected clusters of two genotyping methods with geographical relationship.
2.5.2. Calculating distance and mapping of activity locations from patients
We adopted Euclidean distance as our distance form, and this is the common way to understand distance calculation. It is given by the following equation:
𝑑(𝑥, 𝑦) = √∑(𝑥𝑖 − 𝑦𝑖)2
𝑛
𝑖=1
where 𝑥 and 𝑦 are the different patient locations. Distance difference were calculated by dist function in R package “stats”. Moreover, we map patient's activity locations in
order to understand their activity distribution. First, we geocoded the all activity addresses of each TB patient by Taiwan Geospatial One-Stop Portal (TGOS), and drew locations on map with R 3.4.2 using GIS package.