A Fault-tolerant Method for HLA Typing with PacBio Data

(1)

A Fault-tolerant Method for HLA Typing with PacBio Data

Speaker: Chia-Jung Chang

Advisors: Dr. Pei-Lung Chen and Prof. Kun-Mao Chao

(2)

Outline

Introduction

Simulation

Methods

Experiments

Discussion

Conclusion

(3)

Introduction

HLA genes

PacBio Sequencing Technology

HLA genotyping

(4)

Classical HLA Genes

Erlich et al., Immunity (2001)

Mackay et al., N Engl J Med (2000)

(5)

HLA Database

HLA Class I

Gene A B C E F G

Alleles 2,579 3,285 2,133 15 22 50

Proteins 1,833 2,459 1,507 6 4 16

Nulls 121 109 63 0 0 2

HLA Class II

Gene DRA DRB DQA1 DQB1 DPA1 DPB1 DMA DMB DOA DOB Alleles 7 1,512 51 509 37 248 7 13 12 13 Protein

s

2 1,118 32 337 19 205 4 7 3 5

Nulls 0 33 1 13 0 6 0 0 1 0

(6)

Regions of interest

Exons 2,3:

HLA-A, -B, -C

Exon 2

HLA-DRB1, -DQB1, -DPB1

Others

(7)

A Glimps

(8)

Comparison of NGS Technologies

From the University of Pennsylvania and The Children’s Hospital of Philadelphia

(9)

PacBio SMRT Sequencing

Developed by Pacific Biosciences

Single Molecule Real Time sequencing

(10)

PacBio SMRT Sequencing

(11)

Time for PacBio

(12)

Rea Length

(13)

PacBio - Error Rate

(14)

PacBio - Error Profile

(15)

Sequencing Protocols

(16)

Two Types of Reads

From PacBio Technical Note

(17)

Targeted Sequencing

Sequencing specific areas of interest

v.s. Whole genome sequencing

Benefits

Compound Mutations and Haplotype Phasing

Repeat Expansions

Full-Length Transcripts and Splice Variants

Minor Variants and Quasispecies

SNP Detection and Validation

pdf

(18)

Barcode Technology

48 pairs of 16bp barcodes attached to targets

e.g. 48 samples can be seq uenced parallelly

Barcode 5' Barcode 3'

Primer Primer

(19)

HLA Genotyping

HLA Matching before organ transportations

Serological (antibody based) approaches

Resolution is not enough

DNA-based

Sanger as the gold standard

NGS

Illumina

Roche 454

Ion Torrent

PacBio

(20)

Why Not and Why PacBio?

Why not PacBio?

High error rate

Sample identification error when multiplexing

Why PacBio?

Long enough to sequence exon 2 and exon 3 of class I HLA genes at the same time, which can solve the ambi guous allele combination problem

(21)

Why CCS instead of CLR?

Both are used to detect variants

CLR have more reads for consensus

How to identify samples?

Align barcode

CLR might lead to more barcode calling error

(22)

An illustration of the problem

(23)

An illustration of the problem

(24)

Simulation

The target sequence for each allele

The samples in a multiplexing sequencing experi ment

The pool of the reads in an experiment

Noise reads

(25)

The Target Sequence

• HLA database only contains CDS sequences for most of the alleles

(26)

Three HLA Loci and Their Correspond ing Reference Alleles

A B DRB1

reference A*01:01:01:0

1 B*07:02:01 DRB1*01:01:0 1

start 380 400 5400

length 1100 950 600

#unique

alleles 2335 3075 1388

(27)

Samples in an Experiment

Type 1 Type 2 Type 3

#samples 12 24 48

#reads/allele 40 20 10

Alleles of a sample

Taiwan Minnan population

http://www.allelefrequencies.net

30% of homozygous samples

(28)

The Pool of Reads

Produced by PBSIM

 Ono, Y., Asai, K., Hamada, M.: PBSIM: PacBio reads simulator–toward a ccurate genome assembly. Bioinformatics 29(1) (January 2013) 119–121

CCS reads

length-mean=450

length-sd=170

accuracy-mean=0.98

accuracy-sd=0.02

(29)

Simulation of Correct Reads and Noi

se Reads

(30)

Pre-processing

(31)

Bays’ Theorem (BayesTyping0)

Denote the reads as r₁... r_n and a pair of allel es as a_i, a_j.

(32)

Bays’ Theorem (cont’d)

(33)

Bays’ Theorem (cont’d)

(34)

To Tolerate Noise Reads (BayesTyping1)

Assume there are m noise reads

(35)

Experiments

For Type 1 experiments (40 reads/allele), when typing HLA-A, NGSengine could only successfully predicted 274 pairs of alleles (22.83%).

On the other hand, BayesTyping0 successfully pr edicted 1193 pairs of alleles (99.42%).

Type 1 Type 2 Type 3

#samples 12 24 48

#reads/allele 40 20 10

(36)

Experiments without noise reads

A B DRB1

Type 1 99.92% 99.92% 100%

Type 2 99.50% 99.21% 100%

Type3 97.63% 96.87% 99.98%

(37)

HLA-A

(38)

HLA-B

(39)

HLA-DRB1

(40)

Type 2 HLA with Different m

(41)

Noise Reads from Pools Containing D

ifferent Numbers of Samples

(42)

Homozygous and Heterozygous Samples

• Fisher’s exact test

(43)

Conclusion

BayesTyping1 can tolerate sequencing errors, wh ich are introduced by the PacBio sequencing tec hnology, and noise reads, which are introduced by false barcode identi cations to some degreeﬁ .

It is better to multiplex12 or 24 samples inste ad of 48 samples to maintain a high accuracy

(44)

A Fault-tolerant Method for HLA Typing with PacBio Data