Design and Analysis of Clinical Trials

(1)

Instructor:

Jen-pei Liu, Ph.D.

Department of Statistics

National Cheng-Kung University Division of Biostatistics,

National Health Research Institutes

Lecture III:

Statistical Principles for Analysis of Clinical Data

Design and Analysis of

Clinical Trials

(2)

Statistical Methods for

Biotechnology Products II

Statistical Principles for Analysis of Clinical Data

Instructor: Jen-pei Liu, Ph.D. Division of Biometry

Department of Agronomy

National Taiwan University, and

Division of Biostatistics and Bioinformatics National Health Research Institutes

(3)

Types of Data



Continuous Endpoints

 Numerical discrete data

 Heart beats per minutes  Total NIHSS

 Total Hamilton Rating Scale for Depression  Total Alzheimer’s Disease Assessment Scale

(4)

Types of Data



Continuous Endpoints

 Numerical continuous data

 Age  Weight

 ALT

 Peak flow rate (liters per minute)

 FEV

(5)

Types of Data



Categorical Endpoints



Nominal scale data

Classification of patients according to their attributes

 Gender  Race

 Occurrence of a particular adverse reaction

(6)

Types of Data



Ordered (ordinal scale) categorical data

 A certain order among different categories  Symptom score

0 = no symptom, 1 = mild, 2 = moderate, 3 = severe

 Severity of adverse reactions  Severity of disease

(7)

Types of Data

Censored Endpoints



Time to the occurrence of a pre-defined

event



Time (continuous) and occurrence

(categorical)



The occurrence of the event may not

observed for some patients. Then the time

to the occurrence of the event for these

subjects is censored

(8)

Types of Data

 Chapman, et al (NEJM 1991; 324: 788-94)

The use of prednisone in reduction of relapse

within 21 days of the treatment of acute asthma in the emergency room.

 Primary endpoint

Time to unscheduled visit to clinics because of worsening asthma.

(9)

(10)

(11)

Types of Data

Cross-sectional vs. longitudinal data



Cross-sectional data

_{(snap shot at one time point)}

Clinical data are collected and evaluated at

a particular time point during the trial



Longitudinal data

_{(snap shots at several time}

points)

Clinical data collected and evaluated over a

series of time points during the trial

(12)

Example

Knapp et al (JAMA 1994; 271: 985-991)

 A multi-center trial with 33 centers

 Double-blind, randomized, 4 parallel groups  Forced escalation

30 weeks of randomized treatment

 6 visits

The start of randomized treatment (baseline) 6,12,18,24, and 30 weeks

 Cross-sectional data

CIBI and ADAS-cog evaluated at the start of randomized treatment

 Longitudinal

(13)

Types of Comparison



Within-group (patient) comparison

Comparison of the changes within the same

patients at different time points during the

trial.



Between-group (patient) comparison

Comparison between groups of patients

under different treatments.

(14)

Example

: Major depression disorder

Stark and Hardison (VCP, 1985;46,53-58) Cohn and Wilcox (JCP,1985:46,21-31)

 Double-blind, randomized, three parallel groups  One-week placebo washout period

 Fluoxetine vs. imipramine vs. placebo  6 weeks of randomized treatments  Primary efficacy endpoint

HAM-D score at the last follow-up visit

 Within each group

Change from baseline in HAM-D score

 Between groups

(15)

(16)

Endpoints

 Raw measurements at a time point.

 Change at a time point from baseline.

 Percent change at a time point from baseline.  Clinically meaningful targeted value attained

at a time point, i.e. sitting DBP <= 85 mm Hg

 Selection of time points should be able to

measure the effect of the intervention.

(17)

Selection of Endpoints

 Endpoints should reflect the change of clinical status

caused by the intervention.

 Endpoints should be sensitive to the change of

clinical status caused by the intervention.

 Endpoints should be validated.

 Raw measurements at a time point can only measure

the static clinical status.

 Change at a time point from baseline can measure

the magnitude of the change of clinical status caused by the intervention.

 Change from baseline has the same unit as the raw

(18)

Selection of Endpoints

 Percent change at a time point from baseline

measures the relative magnitude of the change of clinical status caused by the intervention.

 Percent change from baseline is unitless.  The same percent change may reflect

different magnitudes of change

(19)

Selection of Endpoints

 One of the key inclusion criteria for clinical trial in

treatment of mild to moderate essential hypertension is sitting DBP being between 95-115 mm Hg.

 Three changes from baseline: 115  105, 105  95,

95  85.

 95 Changes from baseline: 8.7%, 9.5%, 10.5%  Only 95  85 reaches the clinically meaningful

(20)

Selection of Endpoints

 Endpoints should reflect clinically meaningful

interpretation and applicability.

 Clinically meaningful targeted value >

change from baseline > percent change from baseline.

 Clinical investigators should have

responsibility for determination of the efficacy endpoints used in the clinical trials.

(21)

Selection of Endpoints

LDL HDL TG

Targeted Value < 100mg/dL 40-60 mg/dL < 150 mg/dL Bile acid

Binding Resin 15-30% 3-5% no change Nicotinic acid  5-25% 15-35% 15-25% Fibric acid  5-20% 10-20%  20-50% HMG-CoA 18-55%  3-5%  7-30% Inhibitor

(22)

Descriptive Statistics

All statistics are estimates with sampling errors

 Continuous Data

 Central tendency

Mean: arithmetic average of all observations y Median: the middle observation

 Dispersion

Standard deviation s

Minimum: the smallest observation Maximum: the largest observation Range: maximum minus minimum

 Log-transformation: Mean on the log-scale

(23)

Descriptive Statistics

 Presentation of results  Individual groups  Comparative difference  Example Adkinson, et al (NEJM 1997;336:324-31)

Immunotherapy for asthma in allergic children

Endpoint PEFR Placebo Immunotherapy DifferenceMean

N 60 61

(24)

Categorical Data



Proportion of the patients with a certain

attribute: the number of the patients with the

attribute divided the total number of the

patients in the group



Presenting both of counts and proportions m, p



Chapman, et al (NEJM 1991; 324: 788-94)

The use of prednisone in reduction of replapse

within 21 days of the treatment of acute asthma in the emergency room

(25)

Characteristics Prednisone Placebo N 48 45 Smoking status Current 21 (50.0%) 13 (31.0%) Former 5 (11.9%) 6 (14.3%) Never 16 (38.1%) 24 (54.8%) Use of oral steroids for previous exacerbations.

Yes 15 (36.6%) 13 (32.5%) No 26 (63.4%) 27 (67.5%)

No standard deviation are usually given for categorical data because given the number of the patients in each group it can be directly calculated from the proportion. The standard deviation of a proportion is at the maximum

(26)

Measures for comparison

between groups

Difference in the proportions

 Relative risk

The ratio of the proportions of the test group to the control.

 Odds ratio

The ratio of the odds of the test group to the control.

 Odds

The number of patients with the attribute to that without the attribute.

(27)

The US Physicians’ Health Study (NEJM 1989; 321: 129-35) Aspirin Placebo N 11037 11034 MI 139 (1.26%) 239 (2.17%) No MI 10898 (98.74%) 10795 (97.83%) Difference in proportion of MI = 1.26% - 2.17% = -0.91%

(average of fewer 91 MIs per 10,000)

Relative risk of MI for aspirin = 1.26% / 2.17% = 0.581 (the risk of MI in aspirin reduces 42%)

Odds ratio of MI for aspirin = (139 / 10898) / (239 / 10798) = 1.275% / 2.214% = 0.576 (the odds of MI in aspirin reduces 42%)

Difference in proportions and relative risk can only be used in prospective studies while odds can be used in both prospective as well as retrospective studies.

(28)

Categorical Endpoints

 Difference in proportions provides the absolute

magnitude of difference.

 Both relative risk and odds ratio gives the relative

magnitude of difference.

 50%  25% and 0.05%  0.025% both yield a

relative risk of 50% but differences in proportion are 25% and 0.025% respectively.

 Relative risk and odd ratio are appropriate when the

proportion of the event for control group is small (<5%).

 When the proportion of the event is small (<5%),

(29)

Censored Data



Kaplan-Meier curve (Actuarial probabilities)

The proportions of the patients with occurrence of a pre-defined event over a period of time.



Median survival

The time to the pre-defined event (e.g. death) occurring in 50% of the patients.



Hazard ratio

The hazard of the occurrence of a pre-defined event of the test group to the control group

(30)

Example: Crawford, et al

(NEJM 1989; 321: 419-24)

 A controlled trial of leuprolide with and without

flutamide in prostatic carcinoma

 Randomized, double-blind, 2 parallel groups  Primary endpoint: overall survival

Treatment Median Survival

Leuprolide + flutamide 35.6 Months

(31)

681 676 675 673 670 611 669 665 655 651 648 594 677 675 672 668 667 612 Months since first dose Sample size C 200 BID C 400 BID Placebo Log-rank statistic 8.74 (p= 0.013) Est ima ted prob abili ty o f CV de ath, MI , st rok e, o r C HF 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 6 12 18 24 30 36

Kaplan-Meier Estimates of the Risk of Serious CV

Events in the APC Trial by Treatment Arm*

(32)

Kaplan-Meier Estimates of the Risk of Serious CV

Events in the APC Trial by Treatment Arm*

(33)

Inferential Statistics



Inference from the sample to the target

population



A decision process for clinical hypotheses

based on the trial objective through

(34)

Example: Farlow et al

(JAMA 1992; 268: 2523-2529)

 Randomized, double-blind, parallel groups

 Objective

To compare the tacrine (20, 40, 80 mg per day) versus placebo for probable Alzheimer’s disease

 Null hypothesis

No difference in ADAS-cog scale between 80 mg of tacrine and placebo.

 Alternative hypothesis

There exists a true difference in ADAS-cog scale between 80 mg of tacrine and placebo.

(35)

Example: The NINDS rt-PA Stroke

Study Group (NEJM 1996; 335:

841-7)

 Objective for partⅠ

A greater proportion of patients with acute ischemic stroke treated with t-PA, as compared with those given placebo, have early improvement (>= 4 from baseline on NIHSS).

 Primary efficacy endpoint

Proportion of patients with improvement

 Null hypothesis

No difference in the proportions of patients with improvement between t-PA and placebo.

 Alternative hypothesis

The minimal difference in the proportions of patients with improvement between t-PA and placebo is at least 24%.

(36)

Decision Based on Results

True State No difference Minimal difference of 24%

No difference Correct Type I Error

(false positive) Minimal difference

(37)

Decision Based on Results

 Significance level: The consumer’s risk

The chance that the decision based on the results there is a minimal difference of 24% improvement between t-PA and placebo when in fact there is no difference.

 Power = 1 – producer’s risk

The chance that decision based on the results concludes a minimal difference of 24% improvement between t-PA and placebo in fact there is.

(38)

Statistical Testing Procedures



Step1

State the null and alternative hypotheses

 Null hypothesis: the one to be questioned

No difference in the proportions of patients with improvement between t-PA and placebo.

 Alternative hypothesis: the one of particular interest to investigators

The minimal difference in the proportions of patients with improvement between t-PA and

(39)

Statistical Testing Procedures



Step 2

Choose an appropriate test statistics such as two-sample t-statistics.



Step 3

 Select the nominal significance level

the risk of type error you are willing to commitⅠ Usually 5%

(40)

Statistical Testing Procedures

 Step 4

 Determine the critical value, rejection region and decision

rule

For large samples, two-sided alternative andα= 0.05, the critical value is z(0.025) = 1.96 and rejection region will be the one such that the absolute value of the test statistic is greater than 1.96.

 Decision rule

reject the null hypothesis if the resulting test statistic is in the rejection region.

(41)

Statistical Testing Procedures

Step 1 to step 4 should be determined and

pre-specified in the Statistical Method

section of the protocol before initiation of

the study.

(42)

Statistical Testing Procedures



Step 5

When the study is completed or the data are

available for interim analysis, complete the value of the test statistic specific in Step2 (protocol).



Step 6

Make decision based on the resulting value of the test statistic and decision rule specified in Step 4 (protocol).

(43)

Statistical Testing Procedures

 Conclusion

 Reject the null hypothesis

The sampling error is an unlikely explanation of discrepancy between the null hypothesis and observed values and the

alternative hypothesis is proved at a risk of 5%.

 Fail to reject null hypothesis

The sampling error is a likely explanation and the data fail to provide sufficient evidence to doubt the validity of the null hypothesis.

(44)

P - value



If there is no difference in ADAS-cog between

the two groups (i.e., the null hypothesis is true),

the chance of obtaining a mean difference at

least as large as the observed mean difference.



If p-value is small, it implies that the observed

difference is unlikely to occur if there is no

difference in ADAS-cog scale between 80mg of

tacrine and placebo.

(45)

P - value

 How small the p-value is sufficient enough to

conclude that there exists a true difference in ADAS-cog scale between 80 mg of tacrine and placebo?

 It depends upon the risk that the investigator is

willing to take for committing type I error.

 Nominal significance level = risk of type I error

(The chance of concluding existence of a true difference in ADAS-cog when in fact there is no difference)

(46)

P - value

 If the observed p-value < the nominal significance level (i.e., the

observed p-value < risk of type error), then conclude there exists a Ⅰ true difference in ADAS-cog.

 The nominal significance level = 5% or 1%

 The p-value for the observed difference in mean ADAS-cog is

0.015.

 If the nominal significance level is 5%, then it is concluded that

there is a difference in ADAS-cog between 80mg of tarcine and placebo in target population of patients with probable Alzheimer’s disease.

(47)

P - value



We can not make the same decision if the

nominal significance level is chosen to be 1%.



Should always reported the observed p-value

and let readers and reviewers judge the

strength of evidence by themselves and do not

use p-value < 0.05.

(48)

Confidence Interval

 Example

Adkinson, et al (NEJM 1997; 336: 324-31)

Immunotherapy for asthma in allergic children

Endpoint PEFR Placebo Immuno. Mean Diff. 95% C.I. P-value

N 60 61 Baseline 84.8 ± 8.6 81.9 ±10. 8. -2.9 ± 9.8 (-0.6, 6.4) 0.17 Change -1.4 ± 11.1 2.1 ± 11.1 3.8 ± 11.1 (-7.8, 0.1) 0.05 P-value 0.11 0.24 Symptom score Baseline 0.37 ± 0.35 0.34 ± 0.27 0.03 ± 0.31 (-0.09, 0.14) 0.98

(49)

Confidence Interval



Estimates about the true population difference.



Random intervals which can be different if the

same trial is repeated.



A 95% confidence interval implies 95%

chance that the interval (-7.8, 0.1) will cover

the true difference in average PEFR between

the two groups.

(50)

(51)

Statistical Testing Procedures

---Continuous data



Within-group

 Parametric methods: paired t-test  Nonparametric methods:

Wilcoxon signed rank test



Between-group

 Parametric methods:

unpaired t-test, analysis of variance  Nonparametric methods:

(52)

Computation of Test Statistics

 Test Statistics is a measure to quantify whether the discrepancy

between the observed descriptive statistic and the hypothetical value assumed under the null hypothesis exceeds the sampling error under the null hypothesis

 A general formula for test statistics is usually a function of ratios Difference / Standard Error

where:

 Difference

Difference between the observed descriptive statistic and the hypothetical value assumed under the null hypothesis.

 Standard error of descriptive statistic

(53)

Computation of Test Statistics



Large sample z-statistic



Unpaired t-statistic

2 2 1 2 1 2 1 2 ₁ ₂ s s n n (Y -Y )-(μ-μ) Z= Z ³Z(α/2) + 1 2 1 2 ₁ ₂ 1 2 1 1 n n (Y -Y )-(μ-μ) t= t > t(α/2, n +n -2) s +

(54)

Computation of Confidence Intervals

 A general formula for 95% C.I.

Descriptive statistic ± z(0.025) standard error where

Descriptive statistics

 mean, difference in means

 proportion, difference in proportions  odds, odds ratio

 hazards, hazard ratio

z(0.025): the upper 5% percentile of a standard normal distribution.

(55)

Computation of Confidence Intervals



For continuous data and small sample

(<30), use percentile of the student

t-distribution.



Standard error of descriptive statistic

(56)

Confidence Intervals – Continuous Data



One-sample

 Large-sample (n 30)≧  Small-sample (n<30) 

Paired sample

 Small-sample (n<30) (Y-Z(α/2)s/ n, Y+Z(α/2)s/ n ) (Y-t(α/2, n-1)s/ n, Y+t(α/2, n-1)s/ n )

(57)

Confidence Intervals – Continuous Data

 Two independent samples

 Large-sample (n 30)≧  Small-sample (n<30)

 Nonparametric confidence intervals are also

available, see Chow and Liu (2004).

2 2 2 1 2 1 ) 2 / ( ) (Y1 Y 2  Z  s_n  _ns 2 1 1 1 2 1 2 1 ) ( / 2, 2) (Y Y  t  n  n  s _n  _n

(58)

Pulmonary Function in Patients Receiving Prednisone and Those Receiving Placebo

before and after Treatment in the Emergency Room and Improvement over in the Course of Treatment. *

Variable Before Treatment At Discharge Percent Improvement # Prednisone

(N=34) Placebo (N=34) Prednisone(N=47) Placebo (N=44) Prednisone (N=34) Placebo (N=34) PEFR (liters/min) 246±119 225 ±106 323 ±134 321 ±136 +47.9 ±52.0 +58.4 ±78.1 FVC(liters) 2.83 ±1.4 2.67 ±1.1 3.55 ±1.1 3.47 ±1.1 +44.6 ±59.1 +36.4 ±36.6 FEV₁(liters) 1.84 ±0.9 1.59 ±0.7 2.27 ±0.9 2.24 ±0.9 +39.2 ±33.6 +47.3 ±37.6 MMEFR(liters/sec) 1.42 ±0.8 1.23 ±0.7 1.75 ±1.1 1.70 ±1.0 +56.4 ±56.1 53.2 ±52.6 V50(liters/sec) 1.68 ±0.9 1.45 ±0.9 1.98 ±1.2 1.99 ±1.1 +48.7 ±44.6 +57.2 ±56.3 V25(liters/sec) 0.86 ±0.5 0.60 ±0.4 0.76 ±0.5 0.76 ±0.5 +34.4 ±61.4 +43.4 ±68.0 *Plus-minus values are means ±SD. PEFR denotes peak expiatory flow rate. MMEFR mean midexpiratory flow rate. V₅₀ instantaneous flow rate at 50 percent of vital capacity, and V₂₅ instantaneous flow rate at 25 percent of vital capacity.

# Expressed as a percentage of pretreatment results. Source: Chapman, et al (1991)

(59)

Pulmonary Function before Treatment in the Emergency Room and at the Time of Discharge and Improvement Observed during the First Home Visit after Treatment. According to the Presence or Absence of Relapse at Any Time during the 21-Day Follow-up.*

Variable Before Treatment At Discharge Visit 1 Prednisone (N=51) Placebo (N=17) Prednisone (N=67) Placebo (N=24) Prednisone (N=61) Placebo (N=21)

% change from discharge value

PEFR (liters/min) 233±103 235 ±129 333 ±136 309 ±136 +16.2 ±24.9 +9.17 ±25.5 FVC(liters) 2.86 ±1.32 2.36 ±0.99 3.62 ±1.14 3.21 ±1.02 +6.6 ±16.9 +0.5 ±16.5 FEV₁(liters) 1.74 ±0.80 1.55 ±0.80 2.26 ±0.85 2.24 ±0.85 +12.2 ±22.7 -2.7 ±17.0 #1 MMEFR(liters/sec) 1.34 ±0.73 1.37 ±0.90 1.65 ±1.03 1.95 ±1.07 +29.6 ±74.6 -7.2 ±28.6 #2 V50(liters/sec) 1.59 ±0.91 1.54 ±0.98 1.87 ±1.12 2.27 ±1.22 +34.8 ±83.6 -3.6 ±31.2 #3 V25(liters/sec) 0.72 ±0.39 0.66 ±0.48 0.71 ±0.46 0.90 ±0.56 +33.7 ±76.3 -7.0 ±31.9 #2 *Plus-minus values are means ±SD. PEFR denotes peak expiatory flow rate. MMEFR mean midexpiratory flow rate. V₅₀ instantaneous flow rate at 50 percent of vital capacity, and V₂₅ instantaneous flow rate at 25 percent of vital capacity.

#1 p<0.05 for the comparison with the nonrelapse group. #2 p<0.005 for the comparison with the nonrelapse group. #3 p<0.01 for the comparison with the nonrelapse group.

(60)

Wilcoxon Signed Rank Test

-

within-group comparison

 Compute the change from baseline in morning

PEF = Visit – baseline

 Take the absolute value of the differences  Rank the absolute the differences from the

smallest to the largest

 Define a new variable. This new variable is 1 if

the change from baseline is positive. This new variable is 0 if the change is negative.

(61)

Wilcoxon Signed Rank Test

---

for within-group comparison



Multiple the rank of the absolute difference

and the sign variable. This is called the

signed rank.



Sum the signed ranks.



Compared the sum of the signed ranks with

the critical values from the table.

(62)

Patient

Number Placebo Hydrochlorothiazide Diff. Abso. Diff. Rank Sign VariableSign Signed Rank

1 211 181 -30 30 7 - 0 0 2 210 172 -38 38 8 - 0 0 3 210 196 -14 14 4 - 0 0 4 203 191 -12 12 2 - 0 0 5 196 167 -29 29 9.5 - 0 0 6 190 161 -29 29 9.5 - 0 0 7 191 178 -13 13 3 - 0 0 8 177 160 -17 17 5 - 0 0 9 173 149 -24 24 6 - 0 0 10 170 119 -51 51 11 - 0 0 11 163 156 -7 7 1 - 0 0 Total 0 Critical values: w(0.025, 11) = 11, w(0.975, 11) = 55 Sum of signed rank = 0 < w(0.025, 11) = 11

(63)

Wilcoxon Rank Sum Test



Combine all observations from two

independent samples.



Rank all observations in the combined sample.



Sum all ranks from the test group.

This is called the sum of ranks.



Compare the sum of ranks from the test group

with the critical values from the table (Chow

and Liu, 1998).

(64)

12.3 Mann-Whitney-Wilcoxon Test



Two independent random samples



Example

 Effectiveness of two CRA training programs

for GCP certification

 8 divisions of a large drug firm

 50 junior CRAs from each division  # of junior CRAs passed the GCP

(65)

12.3 Mann-Whitney-Wilcoxon Test



Example

Training Program 1 2 28 33 31 29 27 35 25 30

(66)

12.3 Mann-Whitney-Wilcoxon Test



Method

 Rank the observations in the combined

sample from the smallest (1) to the largest (n1+n2)

 In case of ties, use the averaged rank

(67)

12.3 Mann-Whitney-Wilcoxon Test



Example

Training Program

1 2

obs. Rank obs. Rank

28 3 33 7

31 6 29 4

27 2 35 8

25 1 30 5

(68)

12.3 Mann-Whitney-Wilcoxon Test

Exact Method for Small Samples(n1+n2 ≤ 30)

1. Null hypothesis: H₀: The population relative

frequency distributions for 1 and 2 are identical.

2. Alternative hypothesis: H_a: The population relative

frequency distributions are shifted in respect to their relative location(a two-tailed test). Or Ha:The

population relative frequency distribution for

population 1 is shifted to the right of the relative

frequency distribution for population 2 (a one-tailed test).

(69)

12.3 Mann-Whitney-Wilcoxon Test

3. Test statistics:

For a two-tailed test, use U, the smaller of

and

Where T1 and T2 are the rank sums for samples 1

and 2, respectively.

For a one-tailed test, use U .

1 1 1 2 1 1 T 2 ) 1 n ( n n n U     2 2 2 2 1 2 T 2 ) 1 n ( n n n U    

(70)

12.3 Mann-Whitney-Wilcoxon Test

4. Rejection region:

a. For the two-tailed test and a given value of α , reject H0 if U ≤ U0, where p(U≤ U0)=α/2.

[Note: Observe that U0 is the value such that

P(U ≤ U0) is equal to half ofα]

b. For a one-tailed test and a given value of α, reject H0 if U1 ≤ U0, where P(U1 ≤ U0)= α

(71)

12.3 Mann-Whitney-Wilcoxon Test

Method for larger Samples (n1+n2>30 )

1. Null hypothesis: H₀: The population relative

frequency distributions for 1 and 2 are identical.

2. Alternative hypothesis: H_a: The population

relative frequency distributions are not identical. Or Ha:The population relative

frequency distribution for 1 is shifted to the right (or left) of the relative frequency

(72)

12.3 Mann-Whitney-Wilcoxon Test

3. Test Statistics:

, and let U=U1for one-sided alternative.

12 /

)

1 n

n

(

n

)

2 /

n

(

U

Z

2 1 2 1 2 1







(73)

12.3 Mann-Whitney-Wilcoxon Test

4. Rejection region:

Reject H0 if z > zα/2 or z < -zα/2 for a

two- tailed test. For a one-tailed test, place all of α in one tail of the z distribution. To

detect a shift in distribution 1 to the right of distribution 2, let U=U1 and reject H0 when

z<-zα . To detect a shift in the opposite

direction, let U=U₁ and reject H0 when z>zα.

Tabulated values of z are given in Table 3 in the Appendix.

(74)

12.3 Mann-Whitney-Wilcoxon Test

Example:

Fail to reject H₀ of no difference between train

8 n n , 4 n , 4 n₁  ₂  ₁  ₂ 2 14 ) 4 )( 4 ( U 14 12 2 ) 1 4 ( 4 ) 4 )( 4 ( U 2 1         1 U 2 U 0.0572 1) 2P(U : tailed 2 0.0286 1) P(U : tailed -one 0.05 test tailed -Two 0         

(75)

The Number of Subjects with Malignant Neoplasms in the Beta Carotene Component of the Physician’s Health Study

Malignant Neoplasm Beta Carotene Placebo

N 11036 11035 Yes 1273 1293 Year 1-2 120 130 Year 3-4 157 136 Year 5-9 500 567 >=10 years 496 460 No 9763 9742

(76)

The Number of Subjects with Clinical Improvement

Part 1 rt-PA Placebo

0 – 90 Min N 71 68 Yes 36 31 No 35 37 91 – 180 Min N 73 79 Yes 31 26 No 42 53 Part 2 0 – 90 Min N 86 77 Yes 51 30 No 35 47 91 – 180 Min N 82 88 Yes 29 35

(77)

Summary of Frequencies of Transition of Status Score from Baseline to Visit 3

Treatment: Placebo

Visit 3

Baseline Terrible Poor Fair Good Excellent Total

Terrible 0 0 0 0 0 0 Poor 4 2 3 1 1 11 Fair 4 1 9 4 2 20 Good 1 3 2 4 9 19 Excellent 2 0 0 2 3 7 Total 11 6 14 11 15 57

Treatment: Test Drug

Visit 3

Baseline Terrible Poor Fair Good Excellent Total

Terrible 0 0 2 1 0 3

Poor 0 3 2 1 4 10

Fair 2 0 5 3 8 18

(78)

Statistical Testing Procedures

---Categorical data

 Within-group

 McNemar test for two categories

 Stuart-Maxwell-Bhapkar test for more than two categories

 Between-group

 2×2 Table

 Fisher’s exact test

 Pearson’s chi-square test

 2×2 tables for ordered categories

 Mantel-Haenszel test

 Logistic regression  Poisson regression  Log-linear model

(79)

Data Structure of Binary Endpoint for

a Parallel Two-group Trial

Binary Response

Treatment No Yes Total

Test Drug Y₁₀ (P₁₀) Y₁₁ (P₁₁) Y_1. (1) Placebo Y₂₀ (P₂₀) Y₂₁ (P₂₁) Y_2. (1)

(80)

Categorical Data



Large sample tests---One sample



Paired samples---McNemar test

) 2 / ( ) ( 0 _z  p v p p  _ ) 1 , ( , ) 1 ( ) ( ) ( 2 2 2 10 01 2 10 01 2 10 01 2             M y y y y y y

(81)

Computation of Statistics for Hypothesis Based on Binary Data for a Parallel Two-group Trial

Binary Response

Test Drug O y10 y11 y1. E m₁₀ m₁₁ O-E y₁₀- m₁₀ y₁₁- m₁₁ (O-E)2_{/ E} _(y 10 - m10)2 / m10 (y11 - m11)2 / m11 Placebo O y20 y21 y2. E m₂₀ m₂₁ O-E y₂₀- m₂₀ y₂₁- m₂₁ (O-E)2_{/ E} _(y 20 - m20)2 / m20 (y21 - m21)2 / m21 Total Y Y Y 2 p 

(82)

Two independent samples

Large sample test with cell size greater than 5

 Pearson’s chi-square test

 Randomization chi-squares test

 Fisher’s exact test

Computation of p-value for the 2×2 tables at least as extreme as the ) 1 , ( 1 , 0 2 , 1 , ) ( ) ( , ) ( 2 2 . . 2 1 1 0 2 2              p j i ij i j ij ij ij p i and j N y y m m m y 2 2 2 1 . 0 . . 2 . 1 11 11 2 11 11 2 1 ) 1 ( , ) ( p R R N N N N y y y y v m m y         

(83)

Confidence Intervals – Categorical Data



One-sample

 Large-sample (n 30)≧  Paired sample ) 1 ( 1 ) ( , ) ( ) 2 / ( p p n p v p v z p    



2



10 01 10 01 10 01 ( ) ( ) 1 ) ( , ) ( ) 2 / ( ) ( p p p p n v v z p p         

(84)

Confidence Intervals – Categorical Data



Two independent samples

 Difference in proportions  Relative risk  Odds ratio 2 21 21 1 11 11 21 11 ) 1 ( ) 1 ( ) ( ) ( ) 2 / ( n p p n p p d v d v z p p    _p _p     21 11 21 21 11 11 1 1 ) 2 / ( ) ln( exp p p RR y p y p z RR        _     10 11 1 1 1 1 ) 2 / ( ) ln( exp p p p p OR y y y y z OR            

(85)

Comparison of Proportions of Subject

with Improvement for the NINDS Trial

Treatment N Improvement Difference (SE) 95% Confidence Interval

rt-PA 312 147(47.12%) 0.0801 (0.0016) (0.0027, 0.1576)

(86)

Summary of Estimated Odds Ratio and Relative Risk for Malignant Neoplasm Due to Beta-carotene in U.S. Physicians’ Study

Beta-carotene

(N = 11036) Placebo (N = 11036) Odds Ratio (95% C.I.) Relative Risk (95% C.I.)

1273 (11.53%) 1293 (11.72%) 0.98 (0.91, 1.07) 0.98 (0.92, 1.06)

(87)

Data Structure of Binary Endpoint with

H strata for a Parallel Two-group Trial

Binary Response

Test Drug Y_h10(P_h10) Y_h11(P_h11) Y_h1.(1) Placebo Y_h20(P_h20) Y_h21(P_h21) Y_h2.(1)

Total Y_h.0 Y_h.1 Y_h..

(88)

Combining Results of 2×2 Tables

from Different Strata

Mantel-Haenszel’s Technique

 For each 2×2 Table, following the randomization chi-square

test, compute the expected number of patients who respond to the test drug and variance of the observed number of patients who respond to the test drug.

 Add the observed numbers, expected numbers, and variances

over all strata.

 Square the difference between the sums of the observed and

expected numbers.

 Divide the squared difference in sums by the sum of the

(89)

Summary of Results of Binary

Endpoint from H Strata

Strata Frequency Observed Frequency Expected Difference Variance 1 y₁₁₁ m₁₁₁=y₁₁.y_1.1/N₁ y₁₁₁-m₁₁₁ v₁ H y_h11 m_h11=y_h1.y_h.1/N_h y_h11-m_h11 v_h H y_h11 m_H11=y_H1.y_H.1/N_H y_H11-m_H11 v_H Sum Σy_h11 Σm_h11 Σ(y_h11-m_h11) Σv_h

(90)

Combining 2×2 tables across strata



Cochran-Mantel Haenszel test

                      _             _         H h h H h h H h h h c h h h h h h h h h h h MH H h H h h MH z m y a v m y OR H h N N y y y y v N y y m X v m y X 11 11 1 1 11 11 2 1 . 0 . . 2 . 1 1 . . 1 11 2 1 2 1 11 11 ) 2 / ( ) ( interval confidence % 100 ) 1 ( ) ( exp ratio odds common . , , 1 , ) 1 ( , ) 1 , ( , ) (     

(91)

Summary of Comparison in Proportions of the Subjects with Clinical

Improvement between rt-Pa and placebo for Adjustment of Part and Time from Onset to Treatment in NTNDS Trial

Part Time _FrequencyObserved _FrequencyExpected Oh – Eh Vh (Oh – Eh )2/Vh

1 0 – 90 36 34.2230 1.7770 8.7351 0.3615 1 9 – 180 31 27.3750 3.6250 8.9513 1.4680 2 0 – 90 51 42.7362 8.2638 10.2188 6.66829 2 91 – 180 29 30.8706 -1.8706 10.0230 0.3491 Sum 147 135.2048 11.7952 37.9281 8.8615

The observed and expected frequencies are referred to those with an improvement in rt-PA group.

(92)

Computation of Kaplan-Meier Survival

 Divide the time into intervals by the time points where the

pre-defined event (death) occurred.

 For each interval, count the number of the patients who were

alive at the beginning of the interval and the number of the patients who were still alive at the end of the interval.

 Compute the survival rate for each interval as the number of the

patients still alive at the end of interval by the number of the patients alive at the beginning of the interval.

 For the time point where pre-defined event occurred, the

(93)

Time in Moths to Progression of the Patients with Stage or A Ⅱ Ⅲ Ovarian Carcinoma by Low-grade or Well-differentiated Cancer

Patient Number Time in Months Censored Cell Grade

1 0.92 Yes Low Grade

6 12.40 No Low Grade

(94)

19 6.55 Yes High Grade

22 9.84 No High Grade

(95)

Data Layout for Computation of Kaplan-Meier Estimates of Survival Function Ordered Distinct Event Time Number of Events Number of Censored in [y(k), y(k+1)] Number in

Risk Set S(y) Y(0) = 0 d0 = 0 m0 n0 1

Y(1) d1 m1 n1 1- d0 /nm0

Y(2) d₂ m₂ n₂ (1- d₀ /n₀) (1- d₂ / n₂)

Y(k) d_k m_k n_k (1- d₀ / n₀)(1- d₂ / n₂)…(1- d_k / n_k)

(96)

Computation of Kaplan-Meier Estimates of Survival Function for Patients with Low-grade Cancer

Ordered Distinct

Progression Time of EventsNumber Number of Censored in [y(k), y(k+1)] Number in Risk Set S(y)

0 0 0 15 1 0.92 1 0 15 0.933 2.93 1 0 14 0.8667 5.76 1 0 13 0.8000 6.41 1 0 12 0.7333 10.16 1 4 11 0.6667 15.20 1 1 6 0.5556

(97)

Kaplan-Meier Estimates of Proportions for Patients with Ovarian Carcinoma by Low-grade or Well-differentiated cancer

Time in

Months Censored ProportionEstimated Estimated Variance Standard Error Lower 95% Limit Upper 95% Limit

0.92 No 0.93333 0.004148 0.06440 0.80710 1.00000 2.93 No 0.866667 0.007703 0.08777 0.69464 1.00000 5.76 No 0.80000 0.010666 0.10328 0.59758 1.00000 6.41 No 0.73333 0.013037 0.11418 0.50955 0.95712 10.16 No 0.66667 0.014814 0.12171 0.42811 0.90523 12.40 Yes 0.66667 12.93 Yes 0.66667 13.85 Yes 0.66667 14.70 Yes 0.66667 15.20 No 0.55556 0.020575 0.14344 0.27441 0.83670 23.32 Yes 0.55556 24.47 Yes 0.55556 25.33 Yes 0.55556 36.38 Yes 0.55556 39.67 Yes 0.55556

(98)

Statistical Testing Procedures

---Censored data



Within-group

 France-Lewis-Kay and Liu-Chow test



Between-group

 Log-rank test: difference later in time  Gehan’s test: difference early in time  Cox’s proportional hazard model

(99)

Logrank Test for Comparison of

Two Independent Survival Curves



Divide the time into intervals by the time

points where the pre-defined event (death)

occurred.



Then there are a series of 2×2 tables

stratified by the time points where the

pre-defined event occurred.



Apply the Mantel-Hasenszel’s technique to

combine the results.

(100)

Data Structure of Comparing Two Survival Functions at y(k) by Log-rank Method

Status

Treatment Event No Event Total Test Drug d_1k n_1k - d_1k n_1k

Placebo d_2k n_2k - d_2k n_2k d_k n_k - d_k n_k k= 1,…, K

(101)

Computation of Log-rank Statistic Ordered Distinct Event Time Observed Number of Events Expected Number of

Events Difference Variance y(1) d11 e11=n11d11/n1 d11-e11 v11 y(2) d12 e12=n12d12/n2 d12-e12 v₁₂ y(k) d_1k e_1k=n_1kd_1k/n_k d_1k-e_1k v_k d₁ e₁ d₁-e₁ v₁ where v_k=n_1kn_2kd_k (n_k-d_k)/[n2 k(nk-1)].

(102)

Censored Data



Logrank test statistic



             k k k k k k k k k K k k k k k k K k k LR LR v v K k n n d n d n n v e e n d n e d d X v e d X 1 1 2 2 1 1 1 1 1 1 1 1 1 2 1 2 1 1 , , 1 , ) 1 ( ) ( ) 1 , ( , ) (   

(103)

Confidence Intervals – Censored Data



Hazard ratio



1 1 1 1 1 1 1

d -e

Point estimate:λ=exp[

],

v

Confidence interval

d -e

exp [

] z(α/2)

v















(104)

Computation of Log-rank Test Statistic for the Data of the Patients with Stage or A Ⅱ Ⅲ Ovarian Carcinoma Time in months d_ik d_2k d_k n_ik n_2k n_k e_ik d_ik-e_ik v_ik 0.92 1 0 1 15 20 35 0.42857 0.57143 0.24490 1.12 0 1 1 14 20 34 0.41176 -0.41176 0.24221 2.89 0 1 1 14 19 33 0.42424 -0.42424 0.24426 2.92 1 0 1 14 18 32 0.43750 0.56250 0.24609 4.51 0 1 1 13 18 31 0.41935 -0.41935 0.24350 5.76 1 0 1 13 17 30 0.43333 0.56667 0.24556 6.41 1 0 1 12 17 29 0.41379 0.58621 0.24257 6.55 0 1 1 11 17 28 0.39286 -0.39286 0.23852 9.21 0 1 1 11 16 27 0.40741 -0.40741 0.24143 9.57 0 1 1 11 15 26 0.42308 -0.42308 0.24408 10.16 1 1 2 11 12 23 0.95652 0.04348 0.47637 11.56 0 1 1 10 11 21 0.47619 -0.47619 0.24943 11.78 0 1 1 10 10 20 0.50000 -0.50000 0.25000 12.14 0 2 2 10 9 19 1.05263 -1.05263 0.47091 12.17 0 1 1 10 7 17 0.58824 -0.58824 0.24221 12.34 0 1 1 10 6 16 0.62500 -0.62500 0.23438 12.57 0 1 1 9 5 14 0.64286 -0.64286 0.22959 12.89 0 1 1 9 4 13 0.69231 -0.69231 0.21302

(105)

Logrank test



X

_LR

= (-5.33279)

2

/5.10898 = 5.5664

>2(0.05,1) = 3.84



Estimate of hazard ratio is

exp[-5.33279/5.10898]=0.35231



95% CI for hazard ratio

exp[(-5.33279/5.10898)1.965.10898]

= (0.14794, 0.83805)

(106)

Statistical Testing Procedures

---Longitudinal data



Multivariate analysis of variance



Regression methods



Random effects models



Repeated measurement models



Time series



Proportional odds models

(107)

Robustness of Statistical

Analysis Reproducibility

 All statistical procedures are derived from some

assumptions

 Robustness of statistical procedures

 Use of statistical methods with fewer assumptions to be

verified

 Use of model-free methods vs. model-dependent

methods

 Use of simpler parsimonious models

(108)

Robustness of Statistical

Analysis Reproducibility

 Robustness of Results

 Consistency in

 Estimated treatment effects  Primary conclusion of the trial

 With respect to

 different statistical procedures

parametric vs. nonparametric

 different statistical assumptions

normal vs. non-normal

 Limitation of the data

sub-group analyses

 Different analyzed datasets

(109)

(110)

(111)

(112)

(113)

(114)

Two-sided versus One-sided Hypotheses

Evans et al (NEJM 1997; 337: 1412-8)



Randomized, double-blind, two parallel

groups



Objective

To compare the low-dose inhaled budesonide (400 ug, bid) plus theophylline (250 or 375 mg bid) and high-dose inhaled budesonide (800 ug, bid) for

(115)

Two-sided versus One-sided Hypotheses

 Two-sided hypothesis  Null hypothesis

No difference in average FVC between the two groups  Alternative hypothesis

there exists a difference in average FVC between the two groups

 One-sided hypothesis  Null hypothesis

No difference in average FVC between the two groups  Alternative hypothesis

The low-dose inhaled budesonide plus theophylline improves FVC better than the high-dose inhaled budesonide

(116)

Two-sided versus One-sided Hypotheses

 Two-sided hypotheses are to evaluate existence of

the difference between the test drug and control. The difference may be either positive (better) or negative (worse).

 Two side hypotheses are for the newly developed

pharmaceuticals with unknown and unproved efficacy and safety.

 One-side hypotheses are to prove that the

(117)

Two-sided versus One-sided Hypotheses

 With a significance level of 5%, the level of proof is 1/20 and 1/40

for one-sided hypotheses respectively. The level of proof is 1/400 and 1/1600 for one sided and two-sided hypotheses for two trials (US FDA requirement).

 With a significance level of 5%, the sample size required is

increased by 27% for two-sided hypothesis vs. one-side hypothesis.

 Most of regulatory agencies suggest (require) two-sided hypotheses

for approval.

 Need to specify whether one-sided or two-sided hypotheses is used

in the protocol and provide the justification if the one-sided hypothesis is elected.

(118)

Missing Values

Little and Rubin (1987), Little (1995)



Missing patterns

 Dropouts

The data are missing after the visit where patients withdrew from the study

 Intermittent

The patients complete the study but a few visits are missed by the patients

(119)

Missing Values

Little and Rubin (1987), Little (1995)



Missing patterns

 Incomplete data

One or few items of some scales or scores (NIHSS, ADAS, HAM-D,nasal symptom) are missing.

Symptom, PEFR or awakening of diary is missing for few days.

(120)

Missing Mechanism

Y_o: data observed

Ym: data supposed to be observed but missed

 Missing completely at random (MCAR)

Missing mechanism is independent of both Y_o and Y_m (ignorable data)

Inference on complete data is valid but less efficient (Lachin, 1988)

 Missing at random (MCR)

Missing mechanism is independent of Ym only.

e.g. older males more likely to have missing values than young males or females

(121)

Missing Mechanism



Informative missing

Missing mechanism is dependent Y

_m

.

termination due to hepatotoxicity.



Assumptions:

Missing patterns:

Diggle and Kenward (1994)

Completely random dropout Diggle (1989),

Ridout (1991).

(122)

Analysis Sets

 Intention-to-treat (all randomized) sets

All randomized patients according to their randomized treatments

 Per protocol (evaluable) sets

Inclusion of the patients if

 Requirement of a minimal exposure to the treatments

 The availability of measurements of the primary efficacy

endpoints

 The absence of major protocol violations

 The choice of analysis sets

 To minimize bias

(123)

Analysis Sets

 The intention-to-treat set >= the per protocol set  For a superiority trial

 Intention-to-treat sets is conservative

 Per protocol maximizes the chance of proving the efficacy

 For an equivalence trial, the role is reversed  Perform both analyses – sensitivity

 Consistence results – increase confidence

(124)

Lachin (2000)



An unbiased trial

 unbiased in estimation of the treatment effects  unbiased in testing in controlling type I error rate



Randomization

 a sufficient condition for provision of an unbiased trial

(125)

Lachin (2000)



Two other two necessary conditions

 The outcomes should be evaluated in a like and unbiased manner

Blinding (or masking)

 Data are missing, if any, do not bias the comparison between treatments

(126)

Lachin (2000)



Available methods (Simon and Simonoff,

1986; Little, 1988) are to disprove MCAR but

not able to prove it.



Many other assumptions for MCAR or MAR is

in fact untestable.



Many methods for imputation of missing

values are also untestable.

(127)

Lachin (2000)



Last Observation Carried Forward (LOCF)

another imputation method

Last observation is an unbiased estimate of missing values

(128)

Lachin (2000)



The only complete solution of the “missing

data” problem is not to have them (Cochran,

1957, P. 82)



The best way to deal with the problem of

missing data is to have as little missing data

as possible (Lachin, 2000, Begg, 2000)

(129)

(130)

(131)

(132)

(133)

(134)

Intention-to-treat Analysis and

Design

Peto, et al (1976); Lachin (2000)

 Intention-to-treat analysis

All patients included in the analysis: not received treatment

received the wrong treatment

withdrawal due to AE or other reasons

(135)

Intention-to-treat Analysis and

Design

Peto, et al (1976); Lachin (2000)

 Intention-to-treat design

 ITT principle requires that complete follow-up of all

randomized patients

 All patients should be followed and all scheduled

evaluations should be performed until the death of the patients or the end of study irrespective of withdrawals from treatments

 In an ITT design, withdrawal from treatment does not

(136)

(137)

Adjustment of Covariates

 Covariates are factors that affect the primary efficacy

endpoints

 prognostic, risk, or confounding factors  age, gender, race, disease severity, etc.

 Patient-specific covariates

Covariates measured before randomization

 Baseline FEV₁, FVC, etc.

 Time-dependent covariates

Covariates measured after randomization

 May be affected by the treatments

 CD

(138)

Adjustment of Covariates

 Stratification based on known covariates before

randomization and conduct of the trials.

 Adjustment of covariates in the analysis improvement

of the precision of the estimated treatment effects.

 The estimated treatment effect is unbiased without

adjustment of covariates as long as assignment of treatments is random.

 Avoid to adjust the primary endpoints for the

covariates measured after randomization

(139)

Baseline Comparisons

 Objectives

 Description of patients characteristics with respect to

inclusion and exclusion criteria – targeted population

 Measurements of initial disease severity  Comparability between treatment groups  Referenced values

 Change from baseline

(140)

Baseline Comparisons

 Baseline Data

 Demographic data: age, gender, race  Disease factors

 Entry criteria

 Duration, stage, severity of disease

 Baseline values of primary efficacy and safety endpoints  Concomitant illness

 Relevant previous diseases  Relevant previous treatments

(141)

Baseline Comparisons



Multiple Baseline

The FDA guidelines (Boyarsky and Paulson, 1987) for benign prostatic hyperplasia

 Baseline measurements are collected in a placebo run-in period of 28 days.

 Stability of the disease state  Placebo effects

 Existence  Estimation

(142)

Baseline Comparisons

 Stabilization of baseline

Use of Hotelling T2 statistic with Helmert matrix for p

multiple baseline jth row

 The first j-1 elements are 0  The jth element is 1

 The rest are 1/(1-p), j=1,…, p-1.

 Combination of stabilized baseline

 Clinical judgment

 Average of three measurements of diastolic BP at 1-minute

interval if each measurement dose not differ from the average by more than 5 mm Hg.

(143)

Statistical justifications

Generalized least squares procedure (O’Brien, 1984)

1 ' ' ' 1 1 ' 1 ' 1 ' 1 ' 1 ' 1 ' 1 1 1 1 1 average Simple ) 1 1 ( 1 variance Estimated 1 1 1 (EGLS) squares least d generalize Estimated 1 1 as given is Variance 1 1 1 S Y Z S P N H N v S Y S X Y X p p i p i p p p x p p p i p p i i p p p i p i                  

(144)

Issues of Active Control Trials

 Efficacy of a treatment should be established against placebo  Equivalence or non-inferiority of test treatment to the active

control

 Both superior to placebo  Both inferior to placebo

 The efficacy of the active control in the relevant indication

has been clearly established and quantified in well-design and well-documented superiority trials and that can be reliably expected to exhibit similar efficacy in the contemplated active control trial

(145)

Issues of Active Control Trials

 Same design features

 Inclusion/exclusion criteria  Dose

 Primary endpoints

 Objective of equivalence or non-inferiority must be stated in

the protocols with the equivalence limits

 Inclusion of a concurrent placebo control for interval validity  Superiority of active concurrent control to placebo

(146)

Multicenter Trials

 Conducted under a single protocol  SOP for conduct and evaluations

 Centralized data management system  Questions for analysis

 Separate interpretation for small centers  Domination by several large centers

 Centers out of line, reasons?  Trend in wrong direction?

(147)

Multicenter Trials



Purpose of analysis of a multicenter trial

 Verify a consistent treatment effect

 Obtain an estimate of the overall treatment effect

 Fixed or random effects for centers

 Definition of the overall treatment effect  Stratified analysis

(148)

Multicenter Trials

 Analysis of a multicenter trial under a mixed

effects model can be very complicated (Fleiss, 1986) and only approximate results are available (Chkravorti and Grizzle, 1975; Mielke and

McHugh, 1965)

 Selection of centers

Requirement of a minimum # of patients Expertise and experience of investigators Special equipment provided by center

(149)

Multicenter Trials



Center is more appropriate to be considered

as a fixed effect than a random effect

(Fleiss, 1986, Goldberg and Koury, 1990)



The weights in stratified analysis may not

be optimal in the presence of

treatment-by-center interaction and reflects the efforts

that centers enroll and retain patients and

does not represent the composition of the

target population.

(150)

Descriptive Statistics for Site j

of a Multicenter Trial

Treatment

Statistics Placebo Test Drug Difference

N n_pj n_Tj

Mean Y_pj Y_tj d_j=Y_pj-Y_tj Standard devistion S_pj s_Tj Sj Confidence interval CI_Pj CI_Tj Cij





) / 1 ( ) / 1 ( ); 2 , 2 / ( ) ( ) ( , ), 1 , 2 / ( ) ( ) 2 ( ) 1 ( ) 1 ( . 2 2 2 ij ij ij ij ij Tj Pj Tj Tj pj Pj j n n w n n t w s Y Y CI P T i n t n s Y CI n n s n s n s                   

(151)

Definition of an overall treatment effect

 Simple average of the treatment effects over all centers

The test based on the point estimate and its estimated variance is still valid regardless of treatment-by-center interaction and sample size.

 Test for qualitative interaction (Simon and Gail, 1985)

J J Tj. Pj. j j=1 j=1 2 j 2 1 1 Estimate d= d = (Y -Y ) J J s w Estimated variance v(d)= J    j j - + j j 2 2 j j d d

Q = I[d >0] and Q = I[d <0]

s s

(152)

Example of Multiplicity

Lepor (NEJM 1996; 335: 533-9)

 Targeted population

Patients with benign prostatic hyperplasia

 Treatment

 Terazonsin (10 mg daily) α1 blocher

 Finasteride (5 mg daily) 5 α-reductase inhibitor

 Design: 2×2 factorial design

Group Terazosin Finasteride

Ⅰ Placebo Placebo

(153)

Example of Multiplicity

 Primary efficacy endpoint

 Sum of AUA symptom score  Peak urinary flow rate (ml/sec)

 Statistical parameters

 Raw measurements  Change from baseline

 Time of evaluations

Baseline and 2, 4, 13, 26, 39, and 52 weeks post-randomization

 Sub-groups

 Race: Caucasian vs. non-Caucasian  Age: <, >= 65 years old

(154)

Summary of Possible Number of

Comparisons

Item Number of Comparisons Pairwise Comparison Primary 3 Secondary 3 Visit 7 Primary Endpoint 2 Response 2 Race 2

Baseline Severity of Disease