Statistical Issues on Genomic
Composite Biomarker Classifiers
Jen-pei Liu1,2 and Shein-Chung Chow3
1 Division of Biometry, Department of Agronomy
National Taiwan University, Taipei, Taiwan
2 Division of Biostatistics and Bioinformatics
National Health Research Institutes, Zhunan, Taiwan
3 Department of Biostatistics and Bioinformatics
Duke University School of Medicine Durham, NC 27706, U.S.A
At
2006 Joint Statistical Meeting Seattle, Washington, U.S.A.
Outlines
Introduction
Selection of Genes and Representation
Distribution
Agreement and Reproducibility
Estimation of Treatment Effects in
Targeted Clinical Trials
Introduction
Post HGP (Human Genome Project) Era
Pharmacogentics and Pharmacogenomics
Biochip Products
Target Clinical Trials
Personalized Medicine
Diagnosis and Individualized Treatment
Introduction
TAILORx (JCO, 2006)
Patients
10,000 patients with early-stage breast cancer
ER and/or PR+ and Her2/new – and not spread to
lymph nodes
Diagnostic Device
Likelihood of distant recurrence
Based on 21 prospectively selected genes in
paraffin-embedded tumor tissue
Introduction
TAILORx (JCO, 2006)
Treatment
Recurrence scores
>25: chemotherapy + hormonal therapy <11: hormonal therapy
11 – 25: randomization
Standardized combination chemotherapy + adjuvant
hormonal therapy
Adjuvant hormonal therapy
Follow-up:
10 years additional 20 years after initialIntroduction
Issues in GCB classifiers
Selection of genes
Functional form for the overall representation of
expression levels
Distribution of GCB classifiers and determination
of thresholds
Evaluation of agreement and reproducibility for
GCB classifiers
Selection of Genes
and Representation
Differentially expression genes for diagnosis
of molecular targets
Current hypothesis
Ho: μTi - μCi =0 and Ha: μTi - μCi ≠ 0, i=1,…,G
Statistical significance based on hypothesis of
difference does not take into consideration of magnitude of levels of differentially expressed genes and their biological significance
Fold change does not take into consideration of
Selection of Genes
and Representation
Statistical hypothesis for identification of
differentially expression genes should take into consideration biological significance
Ho: μTi - μCi ≥ δLi and μTi - μCi ≤ δUi Vs.
HA: μTi - μCi < δLi and μTi - μCi > δUi
Selection of Genes
and Representation
T i C i L i L 2 i T i C i T i C i U i L 2 i T i C i T i C i T i C i A tw o o n e -s id e d te s t Y Y T 1 1 s ( ) n n a n d Y Y T 1 1 s ( ) n nG e n e i is c la im e d to b e d iffe re n tia lly e x p re s se d a t th e s ig n ific a n c e le v e l if T > t( , n n 2 ) o r T < t( , n n 2 ), δ δ α α α − − = + − − = + + − + − i = 1 ,...,G
Selection of Genes
and Representation
The average type I error rate is controlled at
the nominal level
The power function is a parabola and
symmetric and reaches the minimum at (δLi + δUi)/2
Simulation studies was conducted to compare
with current methods
Unadjusted, Bonferroni adjustment, fold changes
Selection of Genes
and Representation
Functional Form
Differentially expressed genes
Over-expressed in T and under-expressed in C Under-expressed in T and over-expressed in C
Representation
Difference of expression levels of differentially expressed genes between the test and control Ratio of expression levels of differentially
Distributions
Genomic Composite Biomarker (GCB)
Classifier
The number of differentially expressed gene for
diagnosis of certain molecular targets is a random variables
The number of genes in GCB classifier is a random
variable
The expression level for each selected gene in
Distributions
Only consider the linear function
X = w1Y1 + w2Y2 + … + wgXg,
g is the number of differentially
expressed genes selected from a pool of a total of G genes
Yi is the expression level of gene i
based on log 2
Distributions
Suppose all weights are equal and Yi are i.i.d. with
mean μ and variance σ2 and the number of
differentially expressed gene follows a Poisson distribution with mean λ
The distribution of X can be expressed by
convolution
g* g g=0
F(x) = p .S (x),
where the probability that the number of differentially expressed genes is g, and
∞
∑
Distributions
E(X) = λμ Var(X) = λσ2 + λμ2 Asymptotic normality 2 2 i i 1 1 2 Let Y Y , and s (Y Y)An estimate of variance of X is given as Var(X) = m Y ms X-E(X) Z= N(0,1) Var(X ) g g i= i= = = − + →
∑
∑
Agreement and Reproducibility
Only recognition and evaluation of agreement
and reproducibility:
Dobbin et al. (Clinical Cancer Research, 2005) Larkin et al. (Nature Methods, 2005)
Irizarry et al. (Nature Methods, 2005)
Members of the Toxicogenomic Research Consortium
(Nature Methods, 2005)
Tan, et al. (Nucleic Acids Research, 2003) Yauk, et al. (Nucleic Acids Research, 2004)
Agreement and Reproducibility
Correlation coefficient
A measure for association
Not a measure for similarity (or agreement)
Euclidean distance
A measure for agreement
Agreement and Reproducibility
Example
Case I Case II Case III
X1 X2 X1 X2 X1 X2
1 1 1 2 1 4
2 2 2 4 2 8
3 3 3 6 3 12
Agreement and Reproducibility
Hypothesis of no correlation can not prove
agreement not reproducibility
With 5000 genes, a correlation of 0.05 is
statistically significant from 0 at 1% level
Use of concordance correlation coefficient
Agreement and Reproducibility
1i 2i 2 1i 1 1 12 2 2i 2 12 2We want to evaluate agreement of expression levels of two replicates of genes,
i.e., Y Y , i=1,...,G, and
Y
N ,
Y
Concordance Correlation Coefficient
μ σ σ μ σ σ = ⎡ ⎛ ⎞⎤ ⎛ ⎞ ⎛ ⎞ ⎢ ⎜ ⎟⎥ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ ∼ ⎢⎣⎝ ⎠ ⎝ ⎠⎥⎦ 12 2 2 2 1 2 1 2 (CCC) 2 = ( )
The sample estimate is given as 2s
σ ρ
Agreement and Reproducibility
Hypothesis for Agreement
HO: ρ ≤ ρa vs. HO: ρ > ρa , where ρa is the minimal required level of agreement
Reject HO and conclude agreement if the
lower limit of the (1-α)% asymptotic C.I. is
greater than ρa
Use the concept of generalized pivotal
Agreement and Reproducibility
2 2 2 1.2 1 2 2 1.2 1 2 2 2 2 12 1 12 1.2 2 12 1.2 22 2 1.2 12 U ( 1), U ( 2), Z N(0,1), and Z N(0,1); U , U , Z , Z are independent s Define R , U s Z 1 R [ s s x x ], U U U s R R , G G χ − χ − = = − = + ∼ ∼ ∼ ∼Agreement and Reproducibility
1 2 1 2 1 2 12 2 1 2 2 2 12 1.2 1 2 2 12 2 1 2 R R 2R R (y y ) Z , G G G s s s . s A GPQ for CCC is given as 2R R= R R RA (1- )% C.I. can be obtained by Monte Carlo method
μ μ μ μ
α
− − = − − + − = − + +Agreement and Reproducibility
Reproducibility
Dobbin et al. (Clinical Cancer Research, 2005)
Reproducibility of 4 different labs based on
Affymetrix Human Genome U133A array
Each of 12 tumor tissue was divided into 6
blocks
Agreement and Reproducibility
Site 1 AABBCCD DEEFFGHI JKL The underlined sample was failed sample Site 2 AABBCDE FGGHHIIJ JKL Site 3 ABCCDDF GGHHIJK KLL Site 4 ABCDEEF F G H I I J J K LLAgreement and Reproducibility
Source of variation
Between laboratory (αi) Between samples (βj) Imbalance
Model: Two-way classification random-effects
model without interaction
Yijk = μ + αi + βj +eijk,
i = 1,…,a;j = 1,…,b;k=1,…nij;
where μ is an unknown constant, αi ~ N(0, σα2),
βj ~ N(0, σβ2),and e
Agreement and Reproducibility
Parameters of Reproducibility
Hypothesis of Reproducibility
HO: ρα ≤ ρα0 vs. HA: ρα > ρα0 ,
where ρα0 is the minimal required level for
reproducibility 2 2 2 2 e α α α β
σ
ρ =
σ + σ + σ
Agreement and Reproducibility
Reject the null hypothesis if the upper
limit of the (1-α)% C.I. for ρ
αis greater
than ρ
α0
For imbalanced case and small sample
size, apply the method of generalized
pivotal quantity (GPQ) to obtain the
exact C.I.
Agreement and Reproducibility
2 max 0 A E max BApplication of GPQ to obtain the (1- )% C.I. for The generalized pivotal quantity for is given as
A T(Y,y, )= , A+B+C ss (e) 1 ss( ) where A=max[0,( )( ( ) ), b U U 1 ss( ) B=max[0,( )( ( a U α α α ρ ρ ξ α − λ β − λ D 2 0 E 2 max 0 E ss (e) ) ), and U ss (e) C= ( ) , U λ D D
Agreement and Reproducibility
2 2 2 A B E ' a b a 1 b a a 1 0 U (a 1), U (b 1), U (n ab a b 2), is a abxab orthogonal matrix with the first row being /ab such that1 ( ) ' diag(1, , , ) and b 1 ( ) ' diag(1, , , ); b is a (a+b-2)x(a+b-2) − − χ − χ − χ − − − + ⊗ = ⊗ = P 1 P I J P I 0 0 P J I P 0 I 0 D ∼ ∼ ∼ submatrix of corresponding
to α and , the orthogonal transformation of vector of cell means;β P
Agreement and Reproducibility
ij ij n a b 2 ' ij. 0 ijk i 1 j 1 k 1 n n ij ' n-ab ' 1 2 3 a+b-2 n-ab-a-b+2 1 2 3 ' ' 1 1 2 2 0 ss (e) (Y Y ) , diag{ / n }There exists an orthogonal matrix such that diag( , ) =( : : )diag( , , )( : : ) = ; ss (e) ss = = = = − = = − = + =
∑∑∑
y Hy H I J H G I 0 G G G G I I 0 G G G G G G G ' ' ' ' 1 2 1 1 2 2 ' ' ' ' ' 1/ 2 ' max 0 a+b-2 0 1 ' ' (e) ss (e) y y y y; w (w , w ) ( , ) [ ( ) ] y; ss( ) w w and ss( )=w w α β α β α α β β + = + = = + λ − α = β G G G G z z D I D GTreatment Effects for
Targeted Clinical Trial
Enrichment design for targeted clinical
trials
Patients with positive diagnosis for the
molecular targets were randomized into the test drug or control group
The diagnostic device for the molecular
targets is not a perfect device
Patients with positive diagnosis may not
Treatment Effects for
Targeted Clinical Trial
Enrichment design for targeted clinical
trials
The objective of targeted clinical trials is to
estimate the treatment effect of the test drug
Due to the FPR, the observed mean
difference between the test and control groups under-estimate the true treatment effect
Treatment Effects for
Targeted Clinical Trial
Enrichment design for targeted clinical
trials
The expected value of mean difference
Under-estimation of the true treatment effect
becomes more severe as the prevalence rate of molecular targets decreases
T C +T C -T -C
Treatment Effects under
Targeted Clinical Trials
The true status of molecular targets for each
patient is not available
The positive predictive value can be
estimated from the clinical validation trials for the medical diagnostic device
Under normal assumption, the EM algorithm
can be applied to estimate the true treatment effect of the test drug for the patients with
molecular targets by assuming the PPV is a constant
Treatment Effects under
Targeted Clinical Trials
ij
i ij
ij
Let Y be the observation of patient j
receiving treatment i; i=T,C; j=1,...,n
Let
be the indicator for molecular target
1(0) if patient j in treatment group i
has the molecular target
π
π
π
=
. . i i d∼
Treatment Effects under
Targeted Clinical Trials
ij
ij
2 2
ij 2 ij +i ij -i
ij
Under normal assumption, given a value , for treatment i, y has a normal distribution with density
1 1
f(y 1 ) exp{ [ (y - ) (1- )(y - ) ] (a)
The conditional probability given
ij ij ij π π π μ π μ σ σ π ∝ − + i ij 2 ij ij -i -i ij n 2 2 2 ij +i ij -i j=1 y P( 1y ) 1/{1 exp[( - )( -2y ) / 2 ]} (b) The log-likelihood l=nlog { [ (y - ) (1- )(y - ) ]} / 2 (c) i i ij ij π μ μ μ μ σ σ π μ π μ σ + + = + + +
∑
Treatment Effects under
Targeted Clinical Trials
T -T 2 C -C ij T -T C -C Procedure
(1) E step: substitute "current" estimates of , ,
, , into (b) to obtain provisional
values for the expectation of
(2) M Step: Obtain the MLE of , ,
, μ μ μ μ σ π μ μ μ μ + + + + 2 ij , after replacing in (c)
(3) Repeat (1) and (2) until the estimates of
Discussion and Summary
Issues in evaluation of quality and
utility of GCB classifiers
Identification of differentially expressed
genes
Distribution of GCB classifiers
Agreement and reproducibility
Estimation of treatment effects for
Discussion and Summary
Identification of differentially expressed
genes
Determination of thresholds
Definition of type I error
Sample size estimation
Optimal functional form for an overall
Discussion and Summary
Distribution of GCB classifiers
Correlated expression levels among genes
Expression levels are not identically
distributed
Estimation of weights
Exact distribution
Determination of decision thresholds and
evaluation of their systematic bias
Discussion and Summary
Agreement and Reproducibility
Determination of minimal required threshold
Correlation of expression levels among
genes
Estimation of Treatment Effects
Two or more molecular targets for different
pathways
Variability of the estimated positive