by
Jen-pei Liu, Ph.D
.
National Taiwan University and National Health Research Institutes
[email protected] at Feng-Chia University November 17, 2005 Taichung, Taiwan
An Introduction to Statistical
Evaluation of Drug Products
Introduction
Effectiveness of Drug Products
Equivalence of Generic Drug Products
Estimation of Shelf-life of Drug Products
Quality Control of Drug Products
Evaluation of Diagnostic Devices
Summary
Introduction
Evidence from clinical trials must prove that the drug
is efficacious – drug is better than no drug
Inference from the sample (patients in trials) to the
targeted population (patients in clinical practice)
A decision process for clinical hypotheses based on
the trial objectives through statistical testing procedures
Introduction - Clinical Trials?
FDA (21 CFR 312.3, April 1994)
A clinical trial is the clinical investigation of a drug which is administered or dispensed to, or used involving one or more human subjects.
Chow and Liu (2004)
A clinical trial is the clinical investigation in which treatments are administered, dispensed or used
involving one or more human subjects for evaluation of the treatments.
Introduction –
Three Key Components
Experimental unit
A subject from a targeted population under study. For example
Healthy human subjects
Introduction –
Three Key Components
Treatment
It could be a placebo or any combinations of A new pharmaceutical entity
A new diet
A surgical procedure A diagnostic test
Introduction –
Three Key Components
Evaluation
--Efficacy analysis
Clinical endpoints
--Safety assessment
Adverse experience Laboratory test results
--Quality of life assessment --Pharmacoeconomics analysis --Outcomes research
Introduction – Statistical Designs
Parallel Group Designs
The patients are randomized to one of two or more groups, each group being allocated to a different treatment.
Advantages
Simple and easy to implement.
Less complicated analysis and interpretation.
Drawbacks
Relative large variability
Effectiveness of Drug Products
Example: Farlow et al (JAMA 1992; 268: 2523-2529) Randomized, double-blind, parallel groups Objective
To compare the tacrine (20, 40, 80 mg per day) versus placebo for probable Alzheimer’s disease
Null hypothesis
No difference in ADAS-cog scale between 80 mg of tacrine and placebo.
Alternative hypothesis
There exists a true difference in ADAS-cog scale between 80 mg of tacrine and placebo.
Effectiveness of Drug Products
Example: The NINDS rt-PA Stroke Study Group (NEJM 1996; 335: 841-7)
Objective for partⅠ
A greater proportion of patients with acute ischemic stroke treated with t-PA, as compared with those given placebo, have early improvement (>= 4 from baseline on NIHSS).
Primary efficacy endpoint
Proportion of patients with improvement
Null hypothesis
No difference in the proportions of patients with improvement between t-PA and placebo.
Alternative hypothesis
The minimal difference in the proportions of patients with improvement between t-PA and placebo is at least 24%.
Effectiveness of Drug Products
Statistical Hypothesis
H
o:P
T– P
R= 0 vs. H
a: P
T– P
R≥ 24%
A statistically significant difference indicates that
the new drug is better than the control.
Effectiveness of Drug Products
Decision Based on Results
True State No difference Minimal difference of 24%
No difference Correct TypeⅠError (false
positive) Minimal difference of 24% Type Ⅱ Error (false negative) Correct
Effectiveness of Drug Products
Decision Based on Results
Significance level: The consumer’s risk
The chance that the decision based on the results there is a minimal difference of 24% improvement between t-PA and placebo when in fact there is no difference.
Power = 1 – producer’s risk
The chance that decision based on the results concludes a minimal difference of 24% improvement between t-PA and placebo in fact there is.
Effectiveness of Drug Products
Statistical Testing Procedures
Step1
State the null and alternative hypotheses
Null hypothesis: the one to be questioned
No difference in the proportions of patients with improvement between t-PA and placebo.
Alternative hypothesis: the one of particular interest to investigators
The minimal difference in the proportions of patients with improvement between t-PA and placebo is at least 24%.
Effectiveness of Drug Products
Statistical Testing Procedures
Step 2
Choose an appropriate test statistics such as two-sample Z-statistic or t-statistic.
Step 3
Select the nominal significance level
the risk of typeⅠerror you are willing to commit
Effectiveness of Drug Products
Statistical Testing Procedures
Step 4
Determine the critical value, rejection region and decision
rule
For large samples, two-sided alternative and α= 0.05, the critical value is z(0.025) = 1.96 and rejection region will be the one such that the absolute value of the test statistic is greater than 1.96.
Decision rule
reject the null hypothesis if the resulting test statistic is in the rejection region.
Effectiveness of Drug Products
Statistical Testing Procedures
Step 1 to step 4 should be determined and
pre-specified in the Statistical Method
section of the protocol before initiation of
the study.
Effectiveness of Drug Products
Statistical Testing Procedures
Step 5
When the study is completed, complete the value of the test statistic specific in Step 2 (protocol).
Step 6
Make decision based on the resulting value of the test statistic and decision rule specified in Step 4 (protocol).
Effectiveness of Drug Products
Statistical Testing Procedures
Conclusion
Reject the null hypothesis
The sampling error is an unlikely explanation of
discrepancy between the null hypothesis and observed
values and the alternative hypothesis is proved at a risk of 5%.
Fail to reject null hypothesis
The sampling error is a likely explanation and the data fail to provide sufficient evidence to doubt the validity of the null hypothesis.
Effectiveness of Drug Products
P - value
If there is no difference in in the proportions of
patients with improvement between the two
groups (i.e., the null hypothesis is true), the
chance of obtaining a mean difference at least
as large as the observed mean difference.
If p-value is small, it implies that the observed
difference is unlikely to occur if there is no
difference in the proportions of patients with
improvement between t-PA and placebo.
Effectiveness of Drug Products
P - value
How small the p-value is sufficient enough to
conclude that there exists a true difference in the proportions of patients with improvement between t-PA and placebo?
It depends upon the risk that the investigator is
willing to take for committing typeⅠerror.
Nominal significance level = risk of typeⅠerror
(The chance of concluding existence of a true difference in the proportions of patients with
improvement between t-PA and placebo when in fact there is no difference)
Effectiveness of Drug Products
P - value
If the observed p-value < the nominal significance level (i.e., the
observed p-value < risk of type Ⅰerror), then conclude there exists a true difference in the proportions of patients with improvement
between t-PA and placebo
The nominal significance level = 5% or 1%
The p-value for the observed difference in the proportions of patients
with improvement between t-PA and placebo is 0.015.
If the nominal significance level is 5%, then it is concluded that there
is a difference in the proportions of patients with improvement
between t-PA and placebo in target population of patients with acute ischemic stroke .
Effectiveness of Drug Products
P - value
We can not make the same decision if the
nominal significance level is chosen to be 1%.
Should always reported the observed p-value
and let readers and reviewers judge the
strength of evidence by themselves and do not
use p-value < 0.05.
Equivalence of Generic Drug Products
New Drug Development (Innovative Drugs)
Length: an average of 12 years
Cost: an average of 800 million US dollars Success rate:
1 out of 10000 molecules screened 60% failure rate during clinical
development
Equivalence of Generic Drug Products
Abbreviated New Drug Application (Generic
Drugs)
After the patent of the innovative drug is
expired, all other manufacturers can produce the same drug product
Patents of most innovative drugs expires by
2005: big market
Requires evidence of bioequivalence between
Equivalence of Generic Drug Products
Pharmacokinetic Measures
Absorption Distribution Metabolism EliminationBased on the plasma concentrations of active ingredients C0, C1,…,CK measures at 0,
Equivalence of Generic Drug Products
Total Exposures
AUC (0-tK), AUC (0-∞)
Peak Exposure
Cmax – peak drug concentration
Partial Exposure
Partial AUC: AUC(0-ti)
Other Measures
Equivalence of Generic Drug Products
Equivalence hypothesis
θ = μ
T-
μ
RH
o: μ
T-
μ
R≤ θ
Lor μ
T-
μ
R≥ θ
Lvs. H
a: θ
L< μ
T-
μ
R< θ
UEquivalence of Generic Drug Products
-
Average Bioequivalence
Two one-sided hypotheses:
H
oL:
μT - μR ≤ θL vs.H
aL:
μT - μR > θLand
H
oU:
μT - μR ≥ θUvs. H
aU:
μT - μR < θUThe parameter space of Ho is the union of the parameter spaces of HoLand HoU.
The parameter space of Ha is the intersection of the parameter spaces of HaLand HaU.
Equivalence of Generic Drug Products
- Average Bioequivalence
Schuirmann’s Two One-sided Tests
Procedure (TOST, 1987)
Conclude ABE if
T
L=
(f -
θ
L)/v(f) >
t(α, n
1+n
2–2)
and
T
U=
(f -
θ
U)/v(f) < -
t(α, n
1+n
2–2),
where f is the LSE for θ
Equivalence of Generic Drug Products
- Average Bioequivalence
Confidence Interval Approach
If a (1-2α)100% confidence interval for the difference μT - μR or the ratio μ’T/μ’R is within the acceptance limits as recommended by the regulatory agency, then accept the test formulation; otherwise reject it. Westlake (1981)
α = 5% ⇒ 90% C.I.
log-scale: μT - μR: ±0.2231
Original Scale: μ’T/μ’R: (80%, 125%)
TOST is operationally equivalent to CI approach
This is the requirement by most of health regulatory agencies in the word
Estimation of Shelf-life
Shelf-life (expiration dating period)
Time interval during which a drug product is expected to remain within the specifications, provided that it is
stored under the conditions defined on the container label
Expiration date
The date placed on the container label of a drug product designating the time prior to which a batch of the
product is expected to remain within the approved shelf life, if stored under defined conditions, and after which it must not be used.
Estimation of Shelf-life
ICH Q1A(R2) guidance (2003) P.16
“An approach for analyzing data of quantitative attribute that is expected to change with time is to determine the time at which the 95% one-sided confidence limit for the mean curve
intersects the acceptance criterion”
ICH Q1E guidance (2004) p.11
A two-sided 95% confidence interval or 95% one-sided upper or lower confidence interval can be also used.
One-sided lower limit: known degradation One-sided upper limit: known impurities
Two-sided interval: unknown situation about increase or decrease of the assay with the time
Estimation of Shelf-life
0 3 6 9 12 storage time (month)
degradation curve
% of label claim
lower specification limit 95% lower
2006/8/24 Copyight by Jen-pei Liu, PhD 37
Estimation of Shelf-life
Only consider the case where the drug product
characteristic decreases linearly with time.
Model:
: jth response of assay at time Xj, α : Intercept(batch effect),
β : Slope(degradation rate),
Xj: time at which Yj is observed, εj : random error ~ N(0,σ2 ). n j X Yj = α + β j +ε j, =1,2,..., j Y
Estimation of Shelf-life
Construct (l-2α)100% C.I. for X for which the pth upper quantile of the distribution of Y given X is equal to some specified valueη.
The pth upper quantile of the distribution of Y given X is α+βX+σzp, where z is the pth upper quantile of a standard normal distribution.
The value of X for which the hypothesis H0: [(η - α - zpσ)/ β] ≤ X
is not rejected at the 2α significance level will constitute an(l-2α )100% C.I. for X.
Estimation of Shelf-life
Stability study: mean degradation => p=0.5 => zp=0.
H0: [(η - α)/ β ] ≤ X => H0: η - α – βX ≤ 0 Ha: η - α – βX > 0 => H0: α + βX ≥ η Ha: α + βX < η => H0: (η – α)/β ≤ X Ha: (η – α)/β > X
Estimation of Shelf-life
Stability study: mean degradation =>
p
=0.5=> zp=0.
H0: [(η - α)/ β ] ≤ X => H0: η - α – βX ≤ 0
The set of values of X for which H0 is not rejected at the 2α significance level is
A = {X: [η - (a + bX)]2 ≤ t2
Estimation of Shelf-life
Common intercept Common slope Different intercepts Common Slope Common intercepts Different slopes Different intercepts Different slopesQuality Control of Drug Products
Sampling Plan and Acceptance Criteria
Content uniformity of dosage units USP/NF general chapter[905]
Dissolution Testing
USP/NF general chapter[711] Disintegration Testing
2006/8/24 Copyight by Jen-pei Liu, PhD 43
Disintegration Testing
USP/NF general chapter [711]
p9
Disintegration Testing
Let Y be the disintegration time. Again we assume that Y follows a normal distribution with mean μ and variance σ2 .
Also, let p = P{0 < Y < UL},
where UL denotes the specified limit. Since the disintegration test involves only one acceptance criterion at both stages of the sampling plan, the exact probability can be computed. Let
C11 = {all six units disintegrate completely}, C12= {one unit fails to disintegrate completely}, C13= {two units fail to disintegrate completely},
C21 = {11 of 12 additional units disintegrate completely}, C22 = {all 12 additional units disintegrate completely}.
Then the exact probability of passing the
disintegration test is given as follows:
(
)
(
)
(
)
. ) 1 ( 87 ) 1 ( 6 1 2 6 p 1 1 6 1 11 12 } {C } C | {C } {C } C | C {C } {C pass} { 2 16 17 6 2 4 12 5 12 11 6 13 13 22 12 12 22 21 11 p p p p p p p p p p p p p P P P P P P − + − + = ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ − ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ + ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ − ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ + − ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ + = + + + = It can easily be verified that if the desired probability of passing
the disintegration test is 0.5, p is approximately about 0.831. If, in addition, the specified time limit, UL, is 30 min, it follows that
where Z is a standard normal variable and Z(0.169) is the 16.9% upper quantile of a standard normal distribution. Therefore
Hence the contour for μ and σ2 is a linear decreasing function
of given by where 0.957=Z(0.169) 831 . 0 )} 169 . 0 ( { } 30 { } 30 { } { = < = − < − = < = < = Z Z P Y P Y P UL Y P p σ μ σ μ (0.169) 0.957 30− = = Z σ μ μ σ = 30 − 957 . 0
Simplest Situation: Binary Outcomes from marker test (+, -) Binary Classification of Disease (Yes, No)
Design Matrix for Diagnostic Marker Tests
Correct (“Gold Standard”) True State of Disease Diagnosis Made from
Marker Test
Present (D) Absent (D) Total
Positive (T) Negative (T) a (1-β) c (β) b (α) d (1-α) m1 m2 Total n1 n2 N
Evaluation of Diagnostic Devices
Retrospective Sampling Plan (case-control)
Sensitivity (True Positive rate): Capacity for making
a correct diagnosis in subjects with the disease
Estimated Sensitivity:
100% x a/(a+c)
Specificity (True Negative rate): Capacity for
making a correct diagnosis in subjects without disease
Estimated Specificity:
Evaluation of Diagnostic Devices
Positive Predictive Value (Positive Predictive Accuracy): theproportion of subjects with the disease given the positive results. = 100% x a/(a+b)
Negative Predictive Value (Negative Predictive Accuracy):
the proportion of subjects without the disease given the negative results.
= 100% x d/(c+d)
False positive rate: given the positive results ,the proportion
of subjects without the disease
=1 – positive predictive value = 100% x b/(a+b)
False negative rate: given the negative results, the proportion
of subjects with the disease
Other Definitions of False Positive Rate and False Negative Rate
False positive rate : given the subjects without thedisease, the
proportion of subjects with positive results = b/(b+d) = b/n2
False negative rate : given the subjects with the disease, the
proportion of subjects with negative results = c/(a +c) = c/n1
False positive rate = 1 - specificity False negative rate = 1 - sensitivity
Evaluation of Diagnostic Devices
Example (Feinstein, 2002) New Maker Test Result Diseased Cases Nondiseased Control Total Positive Negative 46 4 2 48 48 52 Total 50 50 100Evaluation of Diagnostic Devices
Data from Example 2 (Feinstein, 2002)
Sensitivity = 100% x 46/50 = 92.0% Specificity = 100% x 48/50 = 96.0% Prevalence = 100% x 50/100 = 50.0% Positive Predictive Value
= 100% x 46/48 = 95.8% Negative Predictive Value
= 100% x 48/52 = 92.3%
False Positive Rate = 100% x 2/48 = 4.2% False Negative Rate = 100% x 4/52 = 7.7%
Evaluation of Diagnostic Devices
Type of Diagnostic Markers
Binary Test Results (+,-) Multiple Categorical Results
Abnormality Rating Severity Rating
Urine test: None, trace, 1+, 2+ HER2 test: 0, 1+, 2+, 3+
Continuous Test Results
PSA
Intraocular Pressure Glucose tolerance test Gene expression level
Evaluation of Diagnostic Devices
To convert a ranking scale or a continuous measurement into a binary outcomes (+,–), we need a cutoff point or threshold.
Example:
FBG > 126mg/dL DM (+)
≤ 126mg/dL DM (–)
S-T Depression in Exercise Stress Test Class D < 1.5 min CAD (+)
Evaluation of Diagnostic Devices
At a specific threshold, relationship of
sensitivity, specificity, false positive and false negative rates can be interpreted through
hypothesis testing:
H0:Absence of the disease H1:Presence of the disease
α =Pr[Type I Error]
=Pr[test positive | no disease] β=Pr[Type II Error]
Evaluation of Diagnostic Devices
Variable, X μΝ μD Threshold β α Specificity=1-α Sensitivity=1-β Normal DiseasedEvaluation of Diagnostic Devices
Sensitivity = Pr[test positive | disease] = 1 – β
= power of the statistical procedure Specificity = Pr[test negative | no disease]
= 1 – α
α↑ ⇒ β↓ ⇒ (1-β)↑
A test with a high sensitivity also has a high incorrect
positive rate but a low incorrect negative rate. A test with a high specificity also has a high incorrect
Evaluation of Diagnostic Devices
At each individual threshold (cut-off), sensitivity and
specificity can be computed.
A Receiver Operating Characteristic (ROC) curve is a
graphic presentation of sensitivity against 1-specificity.
It is a path in the unit square, from the lower left
corner to the upper right corner. In fact, it can be viewed as a cumulative distribution function.
Evaluation of Diagnostic Devices
In a useless marker test, the ROC curve will be a straight
line at a 45o angle.
The area under the ROC curve provides a summary index
for diagnostic accuracy across over all possible values of thresholds.
The range of the area under the ROC curve is from 0.5
(50%) to 1.0(100%)
In a useless marker test, the area under the ROC curve is
50% which is the same as flopping a fair coin.
For non-inferiority or equivalence test based on the paired
ROC curve area, see Liu, et al. (2005, Statistics in
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1-SPECIFICITY S E N S IT IV IT Y Perfect test Ordinary test Useless test Source: Feinstein (2002)
Summary
Descriptive Statistics
Description of characteristics and
estimation of special attributes of drug and device products
Inferential Statistics
Decision-making tool for approval of drug and device products for marketing
References
Chow, SC and Liu, JP (2004) Design and
Analysis of Clinical Trials, 2nd Ed. Wiley
Chow, SC and Liu, JP (2000) Design and
Analysis of Bioavailability and Bioequivalence Studies, Marcel Dekker, Inc.
Chow, SC and Liu, JP (1995) Statistical
Design and Analysis in Pharmaceutical Sciences, Marcel Dekker, Inc.