A Fault-tolerant Method for HLA Typing with PacBio Data
Speaker: Chia-Jung Chang
Advisors: Dr. Pei-Lung Chen and Prof. Kun-Mao Chao
Outline
Introduction
Simulation
Methods
Experiments
Discussion
Conclusion
Introduction
HLA genes
PacBio Sequencing Technology
HLA genotyping
Classical HLA Genes
Erlich et al., Immunity (2001)
Mackay et al., N Engl J Med (2000)
HLA Database
HLA Class I
Gene A B C E F G
Alleles 2,579 3,285 2,133 15 22 50
Proteins 1,833 2,459 1,507 6 4 16
Nulls 121 109 63 0 0 2
HLA Class II
Gene DRA DRB DQA1 DQB1 DPA1 DPB1 DMA DMB DOA DOB Alleles 7 1,512 51 509 37 248 7 13 12 13 Protein
s
2 1,118 32 337 19 205 4 7 3 5
Nulls 0 33 1 13 0 6 0 0 1 0
Regions of interest
Exons 2,3:
HLA-A, -B, -C
Exon 2
HLA-DRB1, -DQB1, -DPB1
Others
A Glimps
Comparison of NGS Technologies
From the University of Pennsylvania and The Children’s Hospital of Philadelphia
PacBio SMRT Sequencing
Developed by Pacific Biosciences
Single Molecule Real Time sequencing
PacBio SMRT Sequencing
Time for PacBio
Rea Length
PacBio - Error Rate
PacBio - Error Profile
Sequencing Protocols
Two Types of Reads
From PacBio Technical Note
Targeted Sequencing
Sequencing specific areas of interest
v.s. Whole genome sequencing
Benefits
Compound Mutations and Haplotype Phasing
Repeat Expansions
Full-Length Transcripts and Splice Variants
Minor Variants and Quasispecies
SNP Detection and Validation
Barcode Technology
48 pairs of 16bp barcodes attached to targets
e.g. 48 samples can be seq uenced parallelly
Barcode 5' Barcode 3'
Primer Primer
HLA Genotyping
HLA Matching before organ transportations
Serological (antibody based) approaches
Resolution is not enough
DNA-based
Sanger as the gold standard
NGS
Illumina
Roche 454
Ion Torrent
PacBio
Why Not and Why PacBio?
Why not PacBio?
High error rate
Sample identification error when multiplexing
Why PacBio?
Long enough to sequence exon 2 and exon 3 of class I HLA genes at the same time, which can solve the ambi guous allele combination problem
Why CCS instead of CLR?
Both are used to detect variants
CLR have more reads for consensus
How to identify samples?
Align barcode
CLR might lead to more barcode calling error
An illustration of the problem
An illustration of the problem
Simulation
The target sequence for each allele
The samples in a multiplexing sequencing experi ment
The pool of the reads in an experiment
Noise reads
The Target Sequence
• HLA database only contains CDS sequences for most of the alleles
Three HLA Loci and Their Correspond ing Reference Alleles
A B DRB1
reference A*01:01:01:0
1 B*07:02:01 DRB1*01:01:0 1
start 380 400 5400
length 1100 950 600
#unique
alleles 2335 3075 1388
Samples in an Experiment
Type 1 Type 2 Type 3
#samples 12 24 48
#reads/allele 40 20 10
Alleles of a sample
Taiwan Minnan population
http://www.allelefrequencies.net
30% of homozygous samples
The Pool of Reads
Produced by PBSIM
Ono, Y., Asai, K., Hamada, M.: PBSIM: PacBio reads simulator–toward a ccurate genome assembly. Bioinformatics 29(1) (January 2013) 119–121
CCS reads
length-mean=450
length-sd=170
accuracy-mean=0.98
accuracy-sd=0.02
Simulation of Correct Reads and Noi
se Reads
Pre-processing
Bays’ Theorem (BayesTyping0)
Denote the reads as r1... rn and a pair of allel es as ai, aj.
Bays’ Theorem (cont’d)
Bays’ Theorem (cont’d)
To Tolerate Noise Reads (BayesTyping1)
Assume there are m noise reads
Experiments
For Type 1 experiments (40 reads/allele), when typing HLA-A, NGSengine could only successfully predicted 274 pairs of alleles (22.83%).
On the other hand, BayesTyping0 successfully pr edicted 1193 pairs of alleles (99.42%).
Type 1 Type 2 Type 3
#samples 12 24 48
#reads/allele 40 20 10
Experiments without noise reads
A B DRB1
Type 1 99.92% 99.92% 100%
Type 2 99.50% 99.21% 100%
Type3 97.63% 96.87% 99.98%
HLA-A
HLA-B
HLA-DRB1
Type 2 HLA with Different m
Noise Reads from Pools Containing D
ifferent Numbers of Samples
Homozygous and Heterozygous Samples
• Fisher’s exact test
Conclusion
BayesTyping1 can tolerate sequencing errors, wh ich are introduced by the PacBio sequencing tec hnology, and noise reads, which are introduced by false barcode identi cations to some degreefi .
It is better to multiplex12 or 24 samples inste ad of 48 samples to maintain a high accuracy