• 沒有找到結果。

Tests of equivalence and non-inferiority for diagnostic accuracy based on the paired areas under ROC curves

N/A
N/A
Protected

Academic year: 2021

Share "Tests of equivalence and non-inferiority for diagnostic accuracy based on the paired areas under ROC curves"

Copied!
20
0
0

加載中.... (立即查看全文)

全文

(1)

Published online 12 September 2005 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/sim.2358

Tests of equivalence and non-inferiority for diagnostic accuracy

based on the paired areas under ROC curves

Jen-Pei Liu

1;2;∗;†

, Mi-Chia Ma

3;‡

, Chin-yu Wu

3

and Jia-Yen Tai

1

1Division of Biometry; Department of Agronomy; National Taiwan University; Taipei; Taiwan 2Division of Biostatistics and Bioinformatics; National Health Research Institutes; Zhunan; Taiwan

3Department of Statistics; National Cheng-kung University; Tainan; Taiwan

SUMMARY

Assessment of equivalence or non-inferiority in accuracy between two diagnostic procedures often involves comparisons of paired areas under the receiver operating characteristic (ROC) curves. With some pre-specied clinically meaningful limits, the current approach to evaluating equivalence is to perform the two one-sided tests (TOST) based on the dierence in paired areas under ROC curves estimated by the non-parametric method. We propose to use the standardized dierence for assessing equivalence or non-inferiority in diagnostic accuracy based on paired areas under ROC curves between two diagnostic procedures. The bootstrap technique is also suggested for both non-parametric method and the standardized dierence approach. A simulation study was conducted empirically to investigate the size and power of the four methods for various combinations of distributions, data types, sam-ple sizes, and dierent correlations. Simulation results demonstrate that the bootstrap procedure of the standardized dierence approach not only can adequately control the type I error rate at the nominal level but also provides equivalent power under both symmetrical and skewed distributions. A numer-ical example using published data illustrates the proposed methods. Copyright ? 2005 John Wiley & Sons, Ltd.

KEY WORDS: paired areas under ROC curves; non-parametric method; standardized dierence; two one-sided tests

Correspondence to: Jen-Pei Liu, Division of Biometry, Department of Agronomy, National Taiwan University,

1, Section 4, Roosevelt Road, Taipei, Taiwan.

E-mail: jpliu@ntu.edu.tw

E-mail: mcma@ibm.stat.ncku.edu.tw §E-mail: cywu@ibm.stat.ncku.edu.tw E-mail: r93621209@ntu.edu.tw

Contract=grant sponsor: Taiwan National Science Grant; contract=grant numbers: NSC 92-2118-M-006-001, NSC 93-2118-M-006-002

(2)

1. INTRODUCTION

For a new diagnostic procedure that is less invasive, less expensive or easier to administer, it is extremely critical to determine whether its diagnostic accuracy is equivalent or non-inferior to the current standard procedure. As a result, if the objective of diagnostic trials is to demonstrate that the diagnostic accuracy of the new diagnostic procedure is within a pre-specied margin of the current standard procedure, it is referred to as equivalence trials. On the other hand, if the purpose is to show that the diagnostic accuracy of the new procedure is not worse than that of an existing diagnostic procedure within some pre-determined limit, it is then referred to as non-inferiority trials or one-sided equivalence studies. To reduce the variability between subjects, similar to bioequivalence studies [1], a matched-pair design is often employed to evaluate the new and the standard diagnostic procedures in the same subjects. It follows that the endpoints for assessing diagnostic accuracy are correlated. Current statistical methods for evaluation of equivalence or non-inferiority are based on paired binary endpoints such as sensitivity, specicity or proportion of correct diagnosis. These methods include those proposed by Liu et al. [2], Hsueh et al. [3] for the dierence in proportions of correct diagnosis or that proposed by Tang et al. [4] for the ratio of proportions of correct diagnosis.

However, sensitivity, specicity, and proportion of correct diagnosis depend upon some specied decision thresholds and cannot provide an overall characterization of the accuracy for the diagnostic procedure. On the other hand, the receiver operating characteristic (ROC) curve is a summary measure for the accuracy of diagnostic procedures. An ROC curve is a plot of the sensitivity (or true positive rate) on the y-axis versus its false positive rate (1-specicity) on the x-axis in the unit square. The curve is constructed by changing the decision thresholds that dene positive and negative test results. Therefore, a ROC curve incorporates both of sensitivity and specicity and accounts for the inherent trade-os between them as the decision thresholds change [5]. Dene X and Y as the random variable of the measurements for the diseased patients and non-diseased subjects, respectively. Therefore, the ROC curve area can be formulated as

 = P(X ¿ Y ) (1)

It follows that the ROC curve area is the probability that a randomly selected diseased patient has a test result indicating greater suspicion than that of a randomly chosen normal subject [6]. Obuchowski [7] and Zhou et al. [8] applied the two one-sided tests (TOST) to evaluate the two-sided equivalence of two diagnostic procedures based on the non-parametric estimates of the paired ROC curve areas proposed by Delong et al. [9]. However, its performance in terms of size and power was not thoroughly investigated. On the other hand, one might accept a new diagnostic procedure if it can provide a diagnostic accuracy no worse than the standard but at the same time it is safer, easier to administer or costs less. Therefore, the one-sided non-inferiority hypothesis is more relevant in assessment of equivalence between diagnostic procedures. Since the ROC curve area is a measure for separation of the distri-bution of the diseased patients from that of the non-diseased subjects, therefore, we propose to use the standardized dierence for evaluation of equivalence and non-inferiority between diagnostic procedures. In addition, a simulation study was conducted to empirically investi-gate and compare the size and power of the non-parametric and proposed methods and their respective bootstrap versions. In Section 2, the non-parametric method for evaluation of the

(3)

equivalence and non-inferiority hypotheses is reviewed. The standardized dierence approach is given in Section 3. In addition, a bootstrap method is also suggested for the non-parametric method and standardized dierence approach in this section. Simulation results are presented in Section 4. In Section 5, a numerical example using a published data set illustrates the proposed method. Discussion and nal remarks are provided in Section 6.

2. EQUIVALENCE AND NON-INFERIORITY HYPOTHESES

Let 1 and 2 be the paired ROC curve areas for the new and the standard diagnostic tests,

respectively. The hypothesis for testing equivalence based on the ROC curve areas between the two diagnostic procedures is given as

H0:12¿U or 126L versus H1:L¡ 12¡ U (2)

where L¡ U are some pre-determined clinically meaningful equivalence limits. This hypothesis can be further decomposed into two one-sided hypotheses:

H0l:126 −L versus H1l:12¿L (3)

and

H0u :12¿U versus H1u :12¡ U (4)

Because the one-sided hypothesis (3) is to verify the ROC curve area of the new diagnostic test is not smaller than that of the standard test within a pre-specied limit, it is referred to as the non-inferiority hypothesis. Similarly, the one-sided hypothesis (4) is referred to as the non-superiority hypothesis.

Suppose a sample of N (N = NA+NN) individuals undergo a new and the standard diag-nostic procedures for predicting a disease and that the test results are based on continuous measurements. We follow the convention that for both tests, higher values of the results are assumed to be associated with the disease of interest. Also suppose that NA of these individ-uals truly have the disease and the other NN( =N NA) individuals do not have the disease. Denote Xhi as the values of the measurements of diagnostic test h from the diseased patients, i = 1; : : : ; NA; h = 1(new), 2(standard). Yhj are similarly dened for the non-diseased subjects,

j = 1; : : : ; NN; h = 1(new), 2(standard).

A non-parametric consistent estimate of the ROC curve area for a diagnostic procedure based on Mann–Whitney U statistic is given as [9, 10]

ˆ h= N1 NNA NN  j=1 NA  i=1 (Xhi; Yhj) (5) where (X; Y ) = ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ 1 X ¿ Y 1 2 X = Y 0 X ¡ Y

(4)

An asymptotic estimated variance of ˆ1ˆ2 is given as [8, 9] v ˆar( ˆ1ˆ2) = 1 NA[s 1;1 10 +s 2;2 10 2s 1;2 10] + 1 NN[s 1;1 01 +s 2;2 01 2s 1;2 01] where sh;h 10 = 1 NA1 NA  i=1[V h 10(Xi)ˆh][Vh  10(Xi)ˆh]; h; h = 1; 2 sh;h 01 = 1 NN1 NN  j=1[V h 01(Yj)ˆh][Vh  01(Yj)ˆh]; h; h= 1; 2 Vh 10(Xi) = 1 NN NN  j=1 (Xhi; Yhj); i = 1; : : : ; NA; h = 1; 2 and Vh 01(Yj) = 1 NA NA  i=1 (Xhi; Yhj); j = 1; : : : ; NN; h = 1; 2

Zhou et al. [8] suggested that the non-inferiority of the new diagnostic procedure is concluded at the  signicance level if

Zl= ˆ 1ˆ2L  v ˆar( ˆ1ˆ2) ¿z

where z is the upper 100 percentile of the standard normal distribution. Similarly, the non-superiority hypothesis (4) is rejected if

Zu= ˆ 1ˆ2U  v ˆar( ˆ1ˆ2) 6 −z

It follows that by the intersection–union principle [11], the two-sided equivalence between the new and standard diagnostic procedures is concluded at the  signicant level by the TOST if both non-inferiority and non-superiority hypotheses are rejected at the  signicance level.

3. THE STANDARDIZED DIFFERENCE APPROACH

As shown in (1), the ROC curve area depends upon the ability of a diagnostic procedure to separate the distribution of the measurements of the diseased patients from that of the non-disease subjects. Therefore, the ROC curve area is a function of the distance in location between the distributions ofX and Y . To take into consideration the variation of X and Y , a measure for the distance in location between the distributions of X and Y is the standardized

(5)

dierence dened as below:

 = AN

2 A+N2

where A; N and 2

A; N2 are the means and variances of the distributions of the

measure-ments from the diseased patients and the non-diseased subjects, respectively. Theoretically, the value of the ROC curve area is between 0.5 and 1 and the possible range of the standardized dierence can be from −∞ to . However, in practice, for some aberrant diagnostic procedure, an observed ROC curve area smaller than 0.5 is possible. On the other hand, for most of diagnostic tests, the standardized dierences are some nite quantities. Therefore, assessment of equivalence and non-inferiority in the ROC curve areas between two diagnostic procedures could be based on their corresponding standardized dierences. In other words, if two diagnostic procedures have equivalent ROC curve areas, then the dierence of their standardized dierences should be close and is within some clinically meaningful limits.

Let Xi= (X1i; X2i) be a 2×1 vector for the results of the new and standard diagnostic

procedures of diseased patient i from the new and standard diagnostic tests, i = 1; : : : ; NA. We also assume that Xi follows a bivariate normal distribution with mean vector (A1; A2) and covariance matrix A with elements 2

A1; A22 ; AA1A2,whereA is the correlation between

measurements of the new and standard diagnostic tests in the diseased patients. Similarly, denote Yj= (Y1j; Y2j) as the 2×1 vector for the results of the new and standard diagnos-tic tests for non-diseased subject j with mean vector (N1; N2) and covariance matrix N with elements 2

N1; N22 , NN1N2, where N is the correlation between measurements of the

new and standard diagnostic tests in the non-diseased subjects, j = 1; : : : ; NN. In addition, Xi and Yj are assumed to be mutually independent. It follows that the standardized dierence for diagnostic procedure, h, is dened as

h= (AhNh) Ah2 +Nh2 ; h = 1; 2

Under normal assumption, it follows that

h=P(Xh¿ Yh) =P(XhYh¿ 0) =P ⎧ ⎨ ⎩ XhYh(AhNh) 2 Ah+Nh2 ¿ (AhNh) 2 Ah+Nh2 ⎫ ⎬ ⎭

Since XhYh follows a normal distribution, it follows that

h=  ⎛ ⎝Z ¿AhNh 2 Ah+Nh2 ⎞ ⎠

(6)

=  ⎛ ⎝Z ¡ AhNh 2 Ah+Nh2 ⎞ ⎠ = (Z ¡ h)

and h= −1(h), h = 1; 2, where (:) is the cumulative standard normal distribution. The relationship between h and h is well known for the normal distribution, see Hauck et al. [12] and Reiser and Guttman [13]. As a result, the dierence in the ROC curve areas between the new and standard diagnostic procedures can be transformed into the dierence in standardized dierences as

12 = −1(1)−1(2) (6)

The above relationship between h and h in (6) is true for a location-scale family of dis-tributions of the dierence. The two-sided equivalence hypothesis based on the standardized dierence can therefore be formulated as follows:

H0:12¿U or 126L versus H1:L¡ 12¡ U (7)

where L ¡ U are some pre-determined clinically meaningful equivalence limits which can be determined from L and U by the relationship L= [−1(2+L)−1(2)] and

U= [−1(2+U)−1(2)].

Similarly, the one-sided non-inferiority hypothesis based on the standardized dierence is given as

H0l:126L versus H1l:12¿ L

h can be consistently estimated by replacing the population moments by their corresponding

sample moments as ˆ h =  XhYh  s2 Ah+sNh2 ; h = 1; 2

Furthermore, the estimated asymptotic variances and covariance of ˆ1 and ˆ2 are given as

v ˆar( ˆ1) = 1 s2 A1+sN12  s2 A1 NA + s2 N1 NN  + ( X1Y1) 2 2(s2 A1+sN12 )3  s4 A1 NA1+ s4 N1 NN 1  v ˆar( ˆ2) = 1 s2 A2+sN22  s2 A2 NA + s2 N2 NN  + ( X2Y2) 2 2(s2 A2+sN22 )3  s4 A2 NA1+ s4 N2 NN1  c ˆov( ˆ1; ˆ2) = 1  (s2 A1+s2N1)(s2A2+sN22 )  ˆ AsA1sA2 NA + ˆ NsN1sN2 NN  + ( X1Y1)( X2Y2) 2  (s2 A1+sN12 )3(sA22 +s2N2)3  NAˆA2sA12 sA22 (NA1)2 + NNˆ2NsN12 sN22 (NN1) 

(7)

where ˆ A= NA i=1(X1iX1)(X2iX2) NA i=1(X1iX1)2 NA i=1(X2iX2)2 ; ˆN = NN j=1(Y1jY1)(Y2jY2)  NN j=1(Y1jY1)2  NN j=1(Y2jY2)2 s2 Ah=N 1 A1 NA  i=1(Xhi  Xh)2 and sNh2 = N 1 N1 NN  i=1(Yhi  Yh)2; h = 1; 2

An estimated asymptotic variance of ˆ1ˆ2 is then given as

v ˆar( ˆ1ˆ2) = v ˆar( ˆ1) + v ˆar( ˆ2)2 c ˆov( ˆ1; ˆ2)

By Slutsky theorem [14], asymptotically, [( ˆ1ˆ2)(1 2)]=



v ˆar( ˆ1ˆ2)] follows a

standard normal distribution.

The non-inferiority of the new diagnostic procedure is concluded at the  signicance level if Z l = ˆ 1ˆ2L  v ˆar( ˆ1ˆ2) ¿z

Similarly, the non-superiority hypothesis is rejected at the  signicance level if Z u= ˆ 1ˆ2U  v ˆar( ˆ1ˆ2) 6 −z

The equivalence of the new diagnostic procedure to the standard is declared at the  signicance level if both non-inferiority and non-superiority hypotheses are rejected at the  signicance level.

Both non-parametric method and standardized dierence approach are asymptotic proce-dures. Normal approximation might not be adequate even with a moderately large sample size. Therefore, we suggest using bootstrap technique [15, 16] to empirically obtain the sam-pling distributions of test statistics for the two methods. In addition, TOST is operationally equivalent to the condence interval approach [1] which also provides a probable range for the parameters of interest, i.e. dierence in the ROC curve areas and standardized dierences. As a result, the condence interval approach is employed for the bootstrap method. The following is the outline for the bootstrap procedures applied to the non-parametric method and the standardized dierence approach for evaluation of equivalence and non-inferiority of diagnostic accuracy based on the ROC curve areas:

1. Generate B-independent bootstrap samples of size NA by sampling with replacement from bivariate vectors of the observed measurements of the diseased patients. Similarly, generate B-independent bootstrap samples of size NN by sampling with replacement from bivariate vectors of the observed measurements of the non-diseased subjects.

2. Calculate the estimated dierence in the areas under ROC ˆ1ˆ2 or standardized

(8)

3. Repeat (1) and (2) for a large number of times, say 2000 times or more.

4. Compute the lower and upper limits of the (12) 100 per cent bootstrap condence interval as the  100 per cent and the (1)100 per cent quantiles of the bootstrap distribution.

5. Equivalence between the new and standard diagnostic tests is concluded at the signi-cance level if the lower and upper limits of the (12)100 per cent bootstrap condence interval is completely contained either within (L; U) for the ROC curve area or (L; U) for the standardized dierence, Non-inferiority of the new diagnostic test is reached if the lower limit of the (12)100 per cent bootstrap condence interval is greater than L(L).

4. SIMULATION STUDY

The design for the simulation study is a 2×2 factorial design. The two factors considered in the simulation study are method (with levels standardized dierence and non-parametric) and condence interval procedure (with levels asymptotic and bootstrap). FORTRAN 90 and IMSL’s STAT/LIBRARY FORTRAN subroutines were used in the simulation study to inves-tigate and compare empirically the size and power of four methods. Symmetrical equivalence limits of ±0:1 were chosen throughout the study for the dierence in the ROC curve areas between the new and standard diagnostic tests. The equivalence limits for the standardized dierence approach were then obtained by the relationship between h and h given in (6). To investigate the impact of the symmetric and skewed distributions of the measurements, the data were generated from the bivariate normal and bivariate exponential distributions. How-ever, the ordinal data are commonly recorded measurements of diagnostic procedures. As a result, the ordinal data of 5 categories were also generated.

For the normal data, NA +NN vectors were generated from bivariate normal distribution with mean vector (0; 0) and covariance matrix of equal variance of 1 and covariance . Then the rst NN bivariate normal vectors represent the measurements of the two diagnostic tests from the non-diseased subjects with mean vector (0; 0) and covariance matrix of equal variance of 1 and covariance . The other NA bivariate normal vectors represent the mea-surements of the two diagnostic tests from the diseased subjects with the same covariance matrix and mean vector (2−1(1), 2−1(2)), where 1 and 2 are the desired ROC

curve areas for the new and standard diagnostic tests, respectively. For the ordinal data, bi-variate normal bi-variates were rst generated. The possible range of the normal bi-variates was divided into 5 intervals of equal length. The intervals represent the ordered categories and the score from 1 to 5 was assigned according to the ascending order of the intervals. Ordinal data were then generated and assigned to the score according to the interval into which the normal variates fall. The method proposed by Moran [17] was used to generate correlated exponential data.

Real examples given by Zhou et al. [8], Obuchowski [18], Parker and Delong [19], and Pepe [20], suggest a range of the ROC curve areas from 0.52 to 0.85. Therefore, for the normal distribution, the ROC curve areas of the standard diagnostic procedure are selected from 0.6 to 0.85 by an increment of 0.05 for the simulation. We assume that the correlations between the measurements of the new and standard diagnostic tests are same for the diseased patients and the non-disease subjects. Three values of A=N= = 0:1; 0:5; 0:9 were chosen to study

(9)

the impact of low, moderate, and high correlations on the four methods. The health regulatory agencies of some countries request that for marketing approval of in vitro diagnostic tests, the results of diagnostic accuracy such as sensitivity, specicity and the ROC curve area based on 200 samples from each of 3 medical centres for a total of 600 samples be submitted. Therefore, to investigate the size and power of the proposed methods, four dierent total samples of 70, 150, 200, and 400 with an equal number of the diseased patients and non-diseased subjects are selected. We believe that these combinations of the ROC curve areas, correlations, and sample sizes cover most situations for evaluation of equivalence and non-inferiority between the new and standard diagnostic tests. With respect to the exponential distribution and the ordinal data, the combinations of 0.7 and 0.8 for the ROC curve area with correlations of 0.5 and 0.9 and the sample size of 200 are investigated in the simulation. For each of the combinations, 2000 random samples were generated. The number of the bootstrap samples is set to be 2000. For a 5 per cent nominal signicance level, a simulation study with 2000 random samples implies that 95 per cent of empirical sizes evaluated at the equivalence limits will be within 0.04513 and 0.05487 if the proposed methods can adequately control the size at the nominal level of 0.05.

Table I presents the empirical sizes for the equivalence and non-inferiority hypotheses under normal distribution. From Table I, the empirical sizes of one-sided non-inferiority test are larger than those of the two-sided equivalence test. However, for evaluation of equivalence hypothesis, when the total sample size is 70, correlation is 0.1, and the ROC curve area of the standard diagnostic test is smaller than 0.80, the empirical sizes of all methods are either close to 0 or very lower. The same situation is also observed when correlation is 0.5 and the ROC curve area is smaller than 0.70. Therefore, for evaluation of equivalence, all four methods are extremely conservative when sample size is 70. On the other hand, for the same combinations, the empirical sizes for evaluation of non-inferiority hypothesis remain quite close to the nominal level of 0.05.

The empirical sizes greater than 0.05487 are highlighted in bold in Tables I, III and IV. From Table I, the empirical size increases either as correlation between the new and standard diagnostic tests increases or as the ROC curve area of the standard test increases. On the other hand, the empirical sizes of non-parametric method and its bootstrap version in general are larger than those of the standardized dierence approach. Furthermore, for both non-parametric method and the standardized dierence approach, the empirical sizes of the bootstrap proce-dure are smaller than those of its asymptotic counterpart. For the non-parametric method, there are a total of 131 out of 288 empirical sizes (45.49 per cent) greater than 0.05487. The max-imum empirical size for the non-parametric method can reach as high as 0.0930 and 0.0760, respectively for the asymptotic method and bootstrap procedure. However, the maximum em-pirical size for the standardized dierence approach is 0.0670 and 0.0615, respectively, for the asymptotic method and bootstrap procedure. When the correlation is 0.5 and the ROC curve area of the standard test is above 0.75, most of empirical sizes of the non-parametric method and its bootstrap version for both equivalence and non-inferiority hypothesis are larger than 0.05487. This indicates that when the correlation between the new and standard diagnostic tests is at least 0.5 and the ROC curve area is large, the non-parametric method is liberal in testing the equivalence and non-inferiority hypotheses. On the other hand, 88.9 per cent of the empirical sizes of combinations provided by the bootstrap version of the standardized dif-ference approach are below 0.05487 and only one empirical size (0.7 per cent) is above 0.06 while 32 empirical sizes (22.2 per cent) of the bootstrap version of the non-parametric method

(10)

Table I. Empirical sizes of equivalence and non-inferiority testing under normal distribution with equivalence limit of 0.1 based on the ROC curve area.

ROC curve area for the standard diagnostic test

Hypothesis  N Method 0.60 0.65 0.70 0.75 0.80 0.85 Equivalence 0.1 70 Nonpar 0.0000 0.0000 0.0000 0.0045 0.0280 0.0705 BNP 0.0000 0.0000 0.0050 0.0035 0.0290 0.0630 SD 0.0000 0.0005 0.0000 0.0005 0.0155 0.0520 BSD 0.0000 0.0005 0.0000 0.0005 0.0170 0.0445 150 Nonpar 0.0090 0.0255 0.0390 0.0470 0.0610 0.0615 BNP 0.0090 0.0285 0.0405 0.0470 0.0580 0.0590 SD 0.0135 0.0310 0.0435 0.0500 0.0515 0.0435 BSD 0.0115 0.0290 0.0400 0.0495 0.0470 0.0410 200 Nonpar 0.0320 0.0510 0.0425 0.0500 0.0540 0.0685 BNP 0.0325 0.0510 0.0435 0.0485 0.0540 0.0660 SD 0.0360 0.0535 0.0425 0.0440 0.0510 0.0650 BSD 0.0335 0.0490 0.0410 0.0440 0.0480 0.0615 400 Nonpar 0.0535 0.0460 0.0465 0.0595 0.0605 0.0620 BNP 0.0540 0.0470 0.0455 0.0565 0.0605 0.0610 SD 0.0570 0.0455 0.0480 0.0525 0.0580 0.0470 BSD 0.0565 0.0450 0.0450 0.0515 0.0550 0.0440 0.5 70 Nonpar 0.0050 0.0100 0.0190 0.0410 0.0665 0.0820 BNP 0.0060 0.0115 0.0195 0.0400 0.0610 0.0710 SD 0.0050 0.0125 0.0185 0.0355 0.0580 0.0480 BSD 0.0055 0.0105 0.0195 0.0370 0.0520 0.0450 150 Nonpar 0.0410 0.0465 0.0520 0.0600 0.0550 0.0680 BNP 0.0425 0.0425 0.0510 0.0585 0.0510 0.0640 SD 0.0380 0.0530 0.0485 0.0885 0.0475 0.0620 BSD 0.0395 0.0510 0.0485 0.0565 0.0465 0.0585 200 Nonpar 0.0525 0.0485 0.0600 0.0545 0.0570 0.0775 BNP 0.0515 0.0480 0.0595 0.0530 0.0540 0.0700 SD 0.0515 0.0470 0.0575 0.0490 0.0450 0.0555 BSD 0.0500 0.0440 0.0560 0.0480 0.0405 0.0510 400 Nonpar 0.0535 0.0520 0.0515 0.0500 0.0550 0.0610 BNP 0.0515 0.0510 0.0500 0.0480 0.0520 0.0575 SD 0.0470 0.0515 0.0510 0.0480 0.0480 0.0465 BSD 0.0445 0.0495 0.0510 0.0450 0.0460 0.0485 0.9 70 Nonpar 0.0490 0.0540 0.0685 0.0760 0.0850 0.0930 BNP 0.0455 0.0480 0.0610 0.0605 0.0615 0.0760 SD 0.0540 0.0520 0.0500 0.0540 0.0630 0.0625 BSD 0.0435 0.0433 0.0430 0.0435 0.0445 0.0425 150 Nonpar 0.0565 0.0580 0.0575 0.0626 0.0625 0.0765 BNP 0.0525 0.0520 0.0490 0.0550 0.0535 0.0655 SD 0.0545 0.0550 0.0530 0.0545 0.0555 0.0545 BSD 0.0475 0.0415 0.0450 0.0420 0.0465 0.0425 200 Nonpar 0.0615 0.0555 0.0600 0.0650 0.0730 0.0625 BNP 0.0570 0.0525 0.0560 0.0590 0.0650 0.0590 SD 0.0620 0.0495 0.0475 0.0600 0.0570 0.0670 BSD 0.0565 0.0445 0.0405 0.0450 0.0465 0.0490 400 Nonpar 0.0455 0.0565 0.0550 0.0605 0.0520 0.0600 BNP 0.0445 0.0520 0.0510 0.0565 0.0470 0.0555

(11)

Table I. Continued.

ROC curve area for the standard diagnostic test

Hypothesis  N Method 0.60 0.65 0.70 0.75 0.80 0.85 SD 0.0515 0.0420 0.0505 0.0500 0.0495 0.0480 BSD 0.0425 0.0370 0.0460 0.0460 0.0435 0.0405 Non-inferiority 0.1 70 Nonpar 0.0475 0.0480 0.0510 0.0490 0.0600 0.0745 BNP 0.0470 0.0515 0.0515 0.0495 0.0600 0.0670 SD 0.0440 0.0520 0.0515 0.0480 0.0545 0.0550 BSD 0.0430 0.0510 0.0500 0.0445 0.0515 0.0485 150 Nonpar 0.0485 0.0615 0.0545 0.0545 0.0620 0.0615 BNP 0.0485 0.0605 0.0560 0.0545 0.0590 0.0590 SD 0.0475 0.0620 0.0570 0.0550 0.0525 0.0435 BSD 0.0465 0.0595 0.0535 0.0560 0.0485 0.0410 200 Nonpar 0.0435 0.0600 0.0440 0.0520 0.0545 0.0685 BNP 0.0440 0.0590 0.0450 0.0505 0.0545 0.0660 SD 0.0445 0.0590 0.0450 0.0460 0.0510 0.0650 BSD 0.0425 0.0555 0.0430 0.0460 0.0485 0.0615 400 Nonpar 0.0535 0.0460 0.0465 0.0595 0.0605 0.0620 BNP 0.0540 0.0470 0.0455 0.0565 0.0605 0.0610 SD 0.0570 0.0455 0.0480 0.0525 0.0580 0.0470 BSD 0.0565 0.0450 0.0450 0.0515 0.0550 0.0440 0.5 70 Nonpar 0.0520 0.0470 0.0485 0.0650 0.0715 0.0820 BNP 0.0530 0.0470 0.0500 0.0640 0.0670 0.0710 SD 0.0520 0.0485 0.0470 0.0565 0.0595 0.0485 BSD 0.0460 0.0460 0.0485 0.0545 0.0580 0.0455 150 Nonpar 0.0470 0.0485 0.0525 0.0600 0.0550 0.0680 BNP 0.0480 0.0455 0.0520 0.0590 0.0510 0.0640 SD 0.0430 0.0540 0.0495 0.0585 0.0475 0.0620 BSD 0.0450 0.0520 0.0455 0.0585 0.0465 0.0585 200 Nonpar 0.0525 0.0485 0.0605 0.0545 0.0570 0.0775 BNP 0.0515 0.0480 0.0600 0.0530 0.0540 0.0700 SD 0.0515 0.0470 0.0575 0.0490 0.0450 0.0555 BSD 0.0500 0.0440 0.0560 0.0480 0.0405 0.0510 400 Nonpar 0.0535 0.0520 0.0515 0.0500 0.0550 0.0610 BNP 0.0515 0.0510 0.0500 0.0480 0.0520 0.0575 SD 0.0470 0.0515 0.0510 0.0480 0.0480 0.0465 BSD 0.0445 0.0495 0.0510 0.0450 0.0460 0.0485 0.9 70 Nonpar 0.0490 0.0540 0.0685 0.0760 0.0800 0.0930 BNP 0.0455 0.0480 0.0610 0.0605 0.0615 0.0760 SD 0.0540 0.0520 0.0500 0.0540 0.0630 0.0625 BSD 0.0435 0.0435 0.0430 0.0435 0.0445 0.0425 150 Nonpar 0.0565 0.0580 0.0575 0.0620 0.0625 0.0765 BNP 0.0525 0.0520 0.0490 0.0550 0.0535 0.0655 SD 0.0545 0.0500 0.0530 0.0545 0.0555 0.0545 BSD 0.0475 0.0415 0.0450 0.0420 0.0465 0.0425 200 Nonpar 0.0615 0.0555 0.0600 0.0650 0.0730 0.0625 BNP 0.0570 0.0525 0.0560 0.0590 0.0650 0.0590 SD 0.0620 0.0495 0.0475 0.0600 0.0570 0.0670 BSD 0.0565 0.0445 0.0405 0.0450 0.0465 0.0490

(12)

Table I. Continued.

ROC curve area for the standard diagnostic test

Hypothesis  N Method 0.60 0.65 0.70 0.75 0.80 0.85

400 Nonpar 0.0455 0.0565 0.0550 0.0605 0.0520 0.0600

BNP 0.0445 0.0520 0.0510 0.0565 0.0470 0.0555

SD 0.0515 0.0420 0.0505 0.0500 0.0495 0.0480

BSD 0.0425 0.0370 0.0460 0.0460 0.0435 0.0405

Note:: Common correlation coecient of the measurements between the new and standard diagnostic test. Nonpar: Non-parametric method. BNP: Bootstrap procedure of the non-parametric method. SD: Standard dierence approach. BSD: Bootstrap procedure of the standard dierence approach.

are at least 0.06. Therefore, simulation results demonstrate that the bootstrap version of the standardized dierence approach can adequately control the type I error rate at the nominal level.

Table II provides the empirical powers under normal distribution when the dierence of the ROC curve areas is 0.05. From Table II, the empirical powers of one-sided non-inferiority hypothesis are greater than those of the two-sided equivalence hypothesis. In addition, the empirical power increases as either the sample size increases or the correlation between the new and standard diagnostic tests increases. On the other hand, all four methods provide com-parable powers for these combinations. When the correlation between the new and standard diagnostic tests is 0.1, the maximum empirical power of the four methods for equivalence and non-inferiority hypotheses is only about 0.68 provided by a total sample of 400. On the other hand, sucient power (¿0:8) can be provided with a total sample size of 150 when the correlation between the new and standard diagnostic tests is 0.9 and the ROC curve area of the standard test is at least 0.75. Figures 1 and 2 present the power curves of the two-sided equivalence and non-inferiority hypotheses, respectively, when the total sample size is 200, the correlation is 0.9, and the ROC curve area of the standard diagnostic test is 0.7. For the two-sided equivalence hypothesis, the power curves increase monotonically as the dierence in the ROC curve area increases from 0:15 to 0. They reach the maximum at 0 and then decrease monotonically as the dierence in the ROC curve area increases from 0 to 0.15. The power curves almost overlap each other in most of range from 0:15 to 0. However, at equivalence limits of ±0:1, the size of the non-parametric method is larger than 0.05. In addition, the empirical power curve of equivalence hypothesis is symmetrical about 0. On the other hand, for the one-sided non-inferiority hypothesis, the power curves are monotonic increasing functions of the dierence in the ROC curve areas.

Table III presents the empirical sizes at  = 0.1 and empirical powers at 12= 0:05

for the exponential distribution when the total sample size is 200. As shown in Table III, the empirical sizes of the asymptotic non-parametric method and its bootstrap version for both equivalence and non-inferiority hypotheses range from 0.1145 to 0.5090 and the empirical sizes of the standardized dierence approach range from 0.075 to 0.127. Therefore, neither the asymptotic and bootstrap procedures of the non-parametric method, nor the standardized dierence approach can control the size at the 5 per cent nominal level when the distribution is skewed. On the other hand, the empirical sizes of the bootstrap version of the standardized dierence approach are all below 0.05. Therefore, the bootstrap method of the standardized

(13)

Table II. Empirical powers of equivalence and non-inferiority testing under normal distribution with equivalence limit of 0.05 based on the ROC curve area.

ROC curve area for the standard diagnostic test

Hypothesis  N Method 0.60 0.65 0.70 0.75 0.80 0.85 Equivalence 0.1 70 Nonpar 0.0000 0.0000 0.0005 0.0025 0.0285 0.1195 BNP 0.0000 0.0000 0.0010 0.0025 0.0305 0.1160 SD 0.0000 0.0000 0.0000 0.0000 0.0005 0.0255 BSD 0.0000 0.0000 0.0000 0.0000 0.0020 0.0325 150 Nonpar 0.0160 0.0405 0.0830 0.1510 0.2550 0.3695 BNP 0.0155 0.0465 0.0865 0.1505 0.2505 0.3640 SD 0.0220 0.0425 0.0655 0.1250 0.1990 0.2650 BSD 0.0210 0.0415 0.0605 0.1170 0.1860 0.2540 200 Nonpar 0.1205 0.1575 0.1940 0.2570 0.3425 0.4530 BNP 0.1245 0.1560 0.1995 0.2565 0.3440 0.4490 SD 0.1310 0.1510 0.1890 0.2265 0.2780 0.3315 BSD 0.1220 0.1450 0.1850 0.2240 0.2745 0.3250 400 Nonpar 0.3655 0.4115 0.4550 0.4800 0.5695 0.6780 BNP 0.3650 0.4090 0.4520 0.4775 0.5660 0.6790 SD 0.3615 0.3850 0.4165 0.4285 0.4725 0.5200 BSD 0.3605 0.3845 0.4145 0.4220 0.4650 0.5150 0.5 70 Nonpar 0.0055 0.0120 0.0335 0.0785 0.1490 0.2850 BNP 0.0080 0.0125 0.0340 0.0815 0.1445 0.2730 SD 0.0055 0.0165 0.0220 0.0535 0.1015 0.1890 BSD 0.0065 0.0120 0.0190 0.0555 0.0880 0.1660 150 Nonpar 0.2120 0.2655 0.2975 0.3355 0.4245 0.4735 BNP 0.2115 0.2670 0.2970 0.3355 0.4190 0.4620 SD 0.2320 0.2755 0.2765 0.3030 0.3510 0.3660 BSD 0.2225 0.2645 0.2655 0.2920 0.3360 0.3580 200 Nonpar 0.3265 0.3355 0.3870 0.4330 0.5010 0.6045 BNP 0.3240 0.3370 0.3845 0.4315 0.4925 0.5945 SD 0.3280 0.3250 0.3625 0.3795 0.4160 0.4660 BSD 0.3190 0.3205 0.3505 0.3685 0.4110 0.4475 400 Nonpar 0.5575 0.5835 0.6125 0.6540 0.7530 0.8480 BNP 0.5565 0.5845 0.6170 0.6525 0.7465 0.8375 SD 0.5655 0.5620 0.5690 0.5965 0.6480 0.7130 BSD 0.5645 0.5530 0.5630 0.5890 0.6395 0.7010 0.9 70 Nonpar 0.4470 0.4765 0.4985 0.5385 0.5830 0.6475 BNP 0.4185 0.4505 0.4710 0.5120 0.5465 0.6045 SD 0.4905 0.5050 0.5145 0.5500 0.5610 0.6095 BSD 0.4500 0.4685 0.4755 0.5095 0.5140 0.5420 150 Nonpar 0.7050 0.7510 0.7785 0.8100 0.8450 0.9090 BNP 0.6900 0.7340 0.7630 0.8000 0.8335 0.8930 SD 0.7500 0.7740 0.7935 0.8035 0.8340 0.8800 BSD 0.7230 0.7400 0.7735 0.7840 0.8115 0.8550 200 Nonpar 0.8255 0.8490 0.8705 0.8910 0.9245 0.9590 BNP 0.8170 0.8410 0.8615 0.8795 0.9175 0.9530 SD 0.8570 0.8565 0.8740 0.8790 0.9175 0.9385 BSD 0.8440 0.8380 0.8615 0.8665 0.9035 0.9260 400 Nonpar 0.9795 0.9850 0.9890 0.9930 0.9985 0.9980 BNP 0.9770 0.9840 0.9895 0.9925 0.9985 0.9980

(14)

Table II. Continued.

ROC curve area for the standard diagnostic test

Hypothesis  N Method 0.60 0.65 0.70 0.75 0.80 0.85 SD 0.9885 0.9920 0.9895 0.9915 0.9965 0.9960 BSD 0.9860 0.9890 0.9880 0.9915 0.9960 0.9950 Non-inferiority 0.1 70 Nonpar 0.1245 0.1425 0.1455 0.1415 0.1585 0.1745 BNP 0.1230 0.1395 0.1395 0.1425 0.1595 0.1730 SD 0.1165 0.1425 0.1505 0.1560 0.2010 0.2370 BSD 0.1205 0.1435 0.1530 0.1635 0.2035 0.2355 150 Nonpar 0.2080 0.1965 0.2200 0.2305 0.2600 0.2870 BNP 0.2045 0.1955 0.2190 0.2210 0.2515 0.2780 SD 0.2050 0.2040 0.2350 0.2490 0.3055 0.3825 BSD 0.2045 0.2080 0.2365 0.2485 0.3010 0.3780 200 Nonpar 0.2360 0.2535 0.2760 0.3045 0.3525 0.4560 BNP 0.2360 0.2510 0.2790 0.3030 0.3535 0.4520 SD 0.2330 0.2500 0.2670 0.2815 0.2930 0.3350 BSD 0.2285 0.2445 0.2650 0.2800 0.2910 0.3290 400 Nonpar 0.3690 0.4165 0.4580 0.4810 0.5700 0.6780 BNP 0.3685 0.4150 0.4555 0.4785 0.5665 0.6790 SD 0.3640 0.3920 0.4195 0.4295 0.4730 0.5200 BSD 0.3630 0.3930 0.4175 0.4230 0.4655 0.5150 0.5 70 Nonpar 0.1845 0.1930 0.1815 0.2095 0.2055 0.2490 BNP 0.1770 0.1870 0.1765 0.2075 0.1955 0.2335 SD 0.1720 0.1960 0.1970 0.2395 0.2450 0.3340 BSD 0.1755 0.1955 0.1965 0.2400 0.2405 0.3255 150 Nonpar 0.2875 0.3165 0.3025 0.3225 0.3610 0.3665 BNP 0.2785 0.3090 0.2945 0.3140 0.3455 0.3585 SD 0.2755 0.3185 0.3245 0.3550 0.4305 0.4740 BSD 0.2745 0.3190 0.3245 0.3545 0.4255 0.4625 200 Nonpar 0.3470 0.3475 0.3935 0.4350 0.5010 0.6045 BNP 0.3440 0.3485 0.3910 0.4335 0.4925 0.5945 SD 0.3450 0.3365 0.3675 0.3835 0.4165 0.4660 BSD 0.3365 0.3320 0.3550 0.3720 0.4110 0.4475 400 Nonpar 0.5575 0.5835 0.6125 0.6540 0.7530 0.8480 BNP 0.5565 0.5845 0.6170 0.6525 0.7465 0.8375 SD 0.5655 0.5620 0.5690 0.5965 0.6480 0.7130 BSD 0.5645 0.5530 0.5630 0.5890 0.6395 0.7010 0.9 70 Nonpar 0.4915 0.5055 0.5155 0.5500 0.5610 0.6095 BNP 0.4515 0.4690 0.4770 0.5095 0.5140 0.5420 SD 0.4505 0.4770 0.5005 0.5385 0.5830 0.6475 BSD 0.4225 0.4510 0.4735 0.5120 0.5465 0.6045 150 Nonpar 0.7500 0.7740 0.7935 0.8035 0.8340 0.8800 BNP 0.7230 0.7400 0.7735 0.7840 0.8115 0.8550 SD 0.7050 0.7510 0.7785 0.8100 0.8450 0.9090 BSD 0.6900 0.7340 0.7630 0.8000 0.8335 0.8930 200 Nonpar 0.8255 0.8490 0.8705 0.8910 0.9245 0.9590 BNP 0.8170 0.8410 0.8615 0.8795 0.9175 0.9530 SD 0.8570 0.8565 0.8740 0.8790 0.9175 0.9385 BSD 0.8440 0.8380 0.8615 0.8665 0.9035 0.9260

(15)

Table II. Continued.

ROC curve area for the standard diagnostic test

Hypothesis  N Method 0.60 0.65 0.70 0.75 0.80 0.85

400 Nonpar 0.9795 0.9850 0.9890 0.9930 0.9985 0.9980

BNP 0.9770 0.9840 0.9895 0.9925 0.9985 0.9980

SD 0.9885 0.9920 0.9895 0.9915 0.9965 0.9960

BSD 0.9860 0.9890 0.9880 0.9915 0.9960 0.9950

Note: Common correlation coecient of the measurements between the new and standard diagnostic test. Nonpar: Non-parametric method. BNP: Bootstrap procedure of the non-parametric method. SD: Standard dierence approach. BSD: Bootstrap procedure of the standard dierence approach.

-0.15 -0.10 -0.05 0.00 0.05 0.10 0.15 0.0 0.2 0.4 0.6 0.8 1.0 difference (theta1-theta2) power nonparametric SD BSD

Figure 1. The empirical power curve of equivalence testing under normal distribution when the ROC curve area of the standard diagnostic test is 0.7, N = 200 and  = 0:9.

dierence approach can control the size at the 5 per cent nominal level and is robust to skewed distributions.

Table IV provides the empirical sizes at = 0:1 and empirical powers at 12= 0:05 for the

ordinal data with 5 categories when the total sample size is 200. From Table IV, the empirical sizes of the asymptotic non-parametric method and its bootstrap version for both equivalence and non-inferiority hypotheses range from 0.0760 to 0.1155. On the other hand, the empirical sizes of the asymptotic method and bootstrap procedure of the standardized dierence approach range from 0.0650 to 0.1115 for both equivalence and non-inferiority hypotheses. Although the empirical sizes of the standardized dierence approach are smaller than those of the non-parametric method, the empirical sizes of all four methods for both two-sided equivalence and

(16)

-0.15 -0.10 -0.05 0.00 0.05 0.10 0.15 0.0 0.2 0.4 0.6 0.8 1.0 difference (theta1-theta2) power nonparametric SD BSD

Figure 2. The empirical power curve of non-inferiority testing under normal distribution when the ROC curve area of the standard diagnostic test is 0.7, N = 200 and  = 0:9.

one-sided non-inferiority hypotheses are all above 0.05487. Therefore, the simulation results indicate that for the ordinal data, no method investigated in the simulation study can control the size at the 5 per cent nominal level.

5. NUMERICAL EXAMPLE

In a study by Masaryk et al. [21], two radiologists used three-dimensional magnetic resonance angiography (MRA) to evaluate the degree of arterial atherosclerotic stenosis of 65 carotid arteries (left and right) in 36 patients. These patients also underwent intra-arterial digital subtraction angiography (DSA), which is considered the gold standard for characterizing the degree of stenosis. The goals of the study were to estimate the accuracy of MRA for each reader using the area under ROC curve as the index of diagnostic accuracy, and to compare the accuracy of the two radiologists. This data set was used by Obuchowski [22] to illustrate the analysis of clustered ROC curve data. The past records showed that the average ROC curve area of experienced readers is 0.98. Therefore, we use this data set solely for the purpose of illustration of the proposed methods for non-inferiority hypotheses on diagnostic accuracy between two readers based on the ROC curve areas. The paired measurements of 33 patients obtained from left carotid arteries were used in the example. Here, reader 2 is considered to be the experienced reader and served as the so-called ‘active control’ and reader 1 is the newly trained reader. Therefore, we want to verify whether the diagnostic accuracy of the new reader is not worse than that of the experienced reader. Because an average ROC curve area of 0.98 is quite high, a non-inferiority margin of 0:05 in dierence of the ROC

(17)

Table III. Empirical sizes and powers of equivalence and non-inferiority testing under the exponential distribution based on the ROC curve area for N = 200.

Size at = 0:1 Power at dierence of 0.05

Hypothesis  Method 0.7 0.8 0.7 0.8 Equivalence 0.5 Nonpar 0.1165 0.2795 0.4560 0.6645 BNP 0.1145 0.2710 0.4490 0.6595 SD 0.0750 0.0925 0.3615 0.4310 BSD 0.0410 0.0370 0.2775 0.3030 0.9 Nonpar 0.2095 0.5090 0.8685 0.9730 BNP 0.1965 0.4815 0.8595 0.9690 SD 0.1030 0.1270 0.8180 0.8495 BSD 0.0455 0.0395 0.7385 0.7330 Non-inferiority 0.5 Nonpar 0.1165 0.2795 0.4660 0.6680 BNP 0.1145 0.2710 0.4595 0.6625 SD 0.0750 0.0925 0.3800 0.4405 BSD 0.0410 0.0370 0.3015 0.3215 0.9 Nonpar 0.2095 0.5090 0.8685 0.9730 BNP 0.1965 0.4815 0.8595 0.9690 SD 0.1030 0.1270 0.8180 0.8495 BSD 0.0455 0.0395 0.7385 0.7330

Note:: Common correlation coecient of the measurements between the new and standard diagnostic test. Nonpar: Non-parametric method. BNP: Bootstrap procedure of the non-parametric method. SD: Standard dierence approach. BSD: Bootstrap procedure of the standard dierence approach.

curve area is considered in the example. Under normal assumption and a non-inferiority margin of 0:05 based on the ROC curve area, because −1(0:98) = 2:05375 and −1(0:93) is 1.47579, the corresponding non-inferiority margin for the standardized dierence approach is 1:485792:05375 = 0:57796.

The non-parametric estimates of the ROC curve area are 0.988 for reader 1 and is 0.984 for reader 2. The standard deviation of ˆ1ˆ2 is estimated to be 0.0056. It follows that Zl= 9:64,

which is greater than Z0:05= 1:645. Therefore, as compared to reader 2, the non-inferiority of

reader 1 with respect to diagnostic accuracy based on the ROC curve area can be concluded at the 5 per cent signicance level with respect to the non-inferiority limit of 0:05. The 95 per cent lower bootstrap condence limit for the dierence in the ROC curve areas based on the non-parametric method is 0 which is greater than 0:05. Therefore, the same conclusion for the non-inferiority hypothesis is also reached at the 5 per cent signicance level.

On the other hand, under normal assumption, ˆ1= 1:86948 and ˆ2= 1:63863. The estimated

standard deviation of ˆ1ˆ2is 0.2452. Therefore, forL=0:57796(L=0:05), ZL= 3:30 that

is greater than Z0:05= 1:645 too. As compared to reader 2, the non-inferiority of reader 1 can be concluded at the 5 per cent signicance level with the non-inferiority margin of 0:57796. The 95 per cent lower bootstrap condence interval for 12 based on the standardized

dierence is 0:080610, which is greater than the non-inferiority margin of 0:57796. It follows that the same conclusion for non-inferiority hypothesis is also reached at the 5 per cent signicance level by the bootstrap version of the standardized dierence approach.

(18)

Table IV. Empirical sizes and powers of equivalence and non-inferiority testing under the ordinal data based on the ROC curve area forN = 200.

ROC curve area for the standard diagnostic test Size at = 0:1 Power at dierence of 0.05

Hypothesis  Method 0.7 0.8 0.7 0.8 Equivalence 0.5 Nonpar 0.0875 0.0775 0.4380 0.5430 BNP 0.0885 0.0760 0.4360 0.5405 SD 0.0720 0.0765 0.3690 0.4435 BSD 0.0670 0.0650 0.3440 0.4150 0.9 Nonpar 0.1155 0.1125 0.8115 0.8805 BNP 0.1115 0.1090 0.8055 0.8765 SD 0.0830 0.1115 0.7110 0.7890 BSD 0.0730 0.0925 0.6895 0.7685 Non-inferiority 0.5 Nonpar 0.0875 0.0775 0.4480 0.5445 BNP 0.0885 0.0760 0.4465 0.5420 SD 0.0720 0.0765 0.3820 0.4455 BSD 0.0670 0.0650 0.3580 0.4180 0.9 Nonpar 0.1155 0.1125 0.8115 0.8805 BNP 0.1115 0.1090 0.8055 0.8765 SD 0.0830 0.1115 0.7110 0.7890 BSD 0.0730 0.0925 0.6895 0.7685

Note: Common correlation coecient of the measurements between the new and standard diagnostic test. Nonpar: Non-parametric method. BNP: Bootstrap procedure of the non-parametric method. SD: Standard dierence approach. BSD: Bootstrap procedure of the standard dierence approach.

6. DISCUSSION

The technology of diagnostic tests for disease identication and staging advanced rapidly. In particular, after completion of the Human Genome Project, tests based on gene chips or biochips may provide quick, inexpensive, non-invasive and easy-to-use tools for diagnosis of diseases. Furthermore, importance of diagnostic tests increases as more targeted clinical trials will be conducted for the individualized treatment of patients in the genomic era [23, 24]. However, the diagnostic accuracy of any newly developed diagnostic technology must be rig-orously evaluated and approved by the health regulatory agencies before their routine use. One approach is to verify whether the diagnostic accuracy of the new diagnostic procedure is equivalent or is not worse than that of the current standard procedure due to the other advan-tages oered by the new procedure. Because the ROC curve area is a measure for separation of the distribution of the measurements of the diseased patients from that of the non-diseased subjects, we proposed to use the standardized dierence for evaluation of equivalence and non-inferiority between the new and standard diagnostic procedures. A FORTRAN program for computation of all four methods is from the authors upon request.

Simulation results indicate that the non-parametric method may inate the size considerably when the underlying distribution is skewed. On the other hand, in terms of size and power, the standardized dierence approach is a very competitive alternative to the non-parametric

(19)

method. In particular, the bootstrap method of the standardized dierence approach not only controls the size at the nominal level for both normal and exponential distributions but also provides sucient power. Therefore, simulation results suggested that the bootstrap method of the standardized dierence approach is quite robust to the skewedness of the distribution and selection of the equivalence limits under the skewed distributions. As a result, we recommend the bootstrap method of the standardized dierence approach to evaluate the two-sided equiv-alence and one-sided non-inferiority hypotheses based on the paired areas under the ROC curves between the new and standard diagnostic procedures when the measurements of both procedures are continuous.

However, all four methods, including the non-parametric method, fail to adequately control the size for the ordinal data. One of the possible reasons for the poor performance of the four methods with respect to ordinal data is that the variance of the ordinal (categorical) data is a function of the mean. The restricted maximum likelihood estimator (RMLE) of the variance of categorical data obtained at the equivalence limit should be used for testing the equivalence or non-inferiority hypotheses. Wald-type asymptotic methods for evaluation of equivalence or non-inferiority for categorical data generally will inate the size [2]. As a result, extreme caution should be taken when to evaluate the equivalence and non-inferiority based on the ROC curve areas computed from the ordinal data. Further research on assessment of equivalence for the ordinal data is warranted since they are the most commonly recorded form of data in diagnostic procedures.

Equivalence limits should be determined jointly by clinicians, radiologists, and statisticians and pre-specied in the study protocol before the conduct of the study. Furthermore, de-termination of equivalence limits is not an easy task and many factors such as the usage of diagnostic tests and the accuracy of the standard diagnostic test, feasibility of the required sam-ple size and many others should be considered. Because any equivalence or non-inferiority diagnostic study includes the standard diagnostic test, it is also an active control equiva-lence study. Therefore, the issues of assay sensitivity, constancy assumption, and reasons for selection of equivalence limits should be adequately addressed in the protocol [25]. Popula-tion bioequivalence and individual bioequivalence [26–29] have been suggested to evaluate bioequivalence for approval of generic drugs and to assess equivalence between diagnostic technologies [30, 31]. However, both population and individual bioequivalence are based on aggregate criteria of population average, intra-subject variability and variance of the subject× formulation interaction. It turns out that two totally dierent distributions with dierent aver-ages and variances can be concluded individual bioequivalent [32]. Because of this drawback and other issues, the US FDA [33] and other health regulatory agencies in the world still employ the average bioequivalence as the criterion for approval of generic drugs. There-fore, we also focus on average accuracy for assessment of equivalence between diagnostic tests. However, equivalence or non-inferiority on variability between diagnostic procedures is equally important and requires further research.

ACKNOWLEDGEMENTS

We like to thank the two anonymous reviewers for their careful, thoughtful and thorough review and comments which greatly improved the content and presentation of our work. This work is partially supported by the Taiwan National Science Grants: NSC 92-2118-M-006-001 and NSC 93-2118-M-006-002.

(20)

REFERENCES

1. Schuirmann DJ. A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of bioavailability. Journal of Pharmacokinetics and Biopharmaceutics 1987; 15:657–680. 2. Liu JP, Hsueh HM, Hsieh E, Chen JJ. Tests for equivalence or non-inferiority for paired binary data. Statistics

in Medicine 2002;21:231–245.

3. Hsueh HM, Liu JP, Chen JJ. Unconditional exact tests for equivalence or non-inferiority for paired binary endpoints. Biometrics 2001; 57:478–483.

4. Tang NS, Tang ML, Chan ISF. On tests of equivalence via non-unity relative risk for matched-pair design. Statistics in Medicine 2003;22:1217–1233.

5. Metz CE. Basic principles of ROC analysis. Seminars in Nuclear Medicine 1978;VIII:283–298.

6. Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982; 143:29–36.

7. Obuchowski N. Testing for equivalence of diagnostic tests. American Journal of Radiology 1997;168:13–17. 8. Zhou XH, Obuchowski NA, McClish DK. Statistical Methods in Diagnostic Medicine. Wiley: New York,

2002; 188–192.

9. DeLong E, DeLong D, Clarke-Pearson D. Comparing the areas under two or more correlated receiver operation characteristic cures: a non-parametric approach. Biometrics 1988; 44:837–845.

10. Sen PK. On some convergence properties of U-statistics. Calcutta Statistical Association Bulletin 1960; 10: 1–18.

11. Berger RL. Multiparameter hypothesis testing in acceptance testing. Technometrics 1982;24:295–300. 12. Hauck WW, Hyslop T, Anderson S. Generalized treatment eects for clinical trials. Statistics in Medicine

19:887–899.

13. Reiser B, Guttman I. Statistical inference forP(Y ¡X ): the normal case. Technometrics 1986; 28:253–257. 14. Sering RJ. Approximation Theorems of Mathematical Statistics. Wiley: New York, 1978; 19–21.

15. Efron B, Tibshirani RJ. Bootstrap methods for standard errors, condence intervals, and other measures of statistical accuracy. Statistical Science 1986; 1:54–77.

16. Efron B, Tibshirani RJ. An Introduction to the Bootstrap. Chapman & Hall: New York, 1993; 168–176. 17. Moran PAP. Testing for correlation between non-negative variates. Biometrika 1967;54:385–394.

18. Obuchowski NA. Receiver operating characteristic curves and their use in radiology. Radiology 2003; 229: 3 – 8.

19. Parker CB, Delong ER. ROC methodology within a monitoring framework. Statistics in Medicine 2003;22: 3473 –3488.

20. Pepe MS. The Statistical Evaluation of Medical Tests for Classication and Prediction. Oxford University Press: Oxford, U.K., 2003; 96–127.

21. Masaryk AM, Ross JS, DiCello MC, Modic MT, Paranandi L, Masaryk TJ. 3DFT MR angiography of the carotid bifurcation: potential and limitations as a screening examination. Radiology 1991; 179:797–804. 22. Obuchowski N. Nonparametric analysis of clustered ROC curve data. Biometrics 1997;53:567–578.

23. Simon R, Maitournan A. Evaluating the eciency of targeted designs for randomized clinical trials. Clinical Cancer Research 2004; 10:6759–6763.

24. Maitournan A, Simon R. On the eciency of targeted clinical trials. Statistics in Medicine 2005;24:329–339. 25. Chow SC, Liu JP. Design and Analysis of Clinical Trials (2nd edn). Wiley: New York, 2004; 250–265. 26. Anderson S, Hauck WW. Consideration of individual bioequivalence. Journal of Pharmacokinetics and

Biopharmaceutics 1990;18:259–273.

27. Hauck WW, Anderson S. Types of bioequivalence and related statistical considerations. International Journal of Clinical Pharmacology, Therapeutics and Toxicology 1992;30:181–187.

28. Chen ML. Individual bioequivalence—a regulatory update. Journal of Biopharmaceutical Statistics 1997; 7: 5 –11.

29. US FDA Guidance for Industry on Statistical Approaches to Establishing Bioequivalence. CDER, FDA: Rockville, MD, 2001; 3–7.

30. Obuchowski N. Film-screen versus digitized mammography: assessment of clinical equivalence. American Journal of Radiology 1999;173:889–894.

31. Obuchowski N. Can electronic images replace hard-copy lm? Dening and testing the equivalence of diagnostic tests. Statistics in Medicine 2001;20:2845–2863.

32. Liu JP. Statistical evaluation of individual bioequivalence. Communications in Statistics, Theory and Methods 1998;27:1433–1451.

33. US FDA Guidance for Industry on Bioavailability and Bioequivalence Studies for Orally Administered Drug Products–General Considerations. CDER, FDA: Rockville, MD, 2003; 11.

數據

Table I. Empirical sizes of equivalence and non-inferiority testing under normal distribution with equivalence limit of 0.1 based on the ROC curve area.
Table I. Continued.
Table I. Continued.
Table II. Empirical powers of equivalence and non-inferiority testing under normal distribution with equivalence limit of 0.05 based on the ROC curve area.
+6

參考文獻

相關文件

In this paper, we propose a practical numerical method based on the LSM and the truncated SVD to reconstruct the support of the inhomogeneity in the acoustic equation with

Understanding and inferring information, ideas, feelings and opinions in a range of texts with some degree of complexity, using and integrating a small range of reading

Wang, Solving pseudomonotone variational inequalities and pseudocon- vex optimization problems using the projection neural network, IEEE Transactions on Neural Networks 17

Particularly, combining the numerical results of the two papers, we may obtain such a conclusion that the merit function method based on ϕ p has a better a global convergence and

Define instead the imaginary.. potential, magnetic field, lattice…) Dirac-BdG Hamiltonian:. with small, and matrix

Monopolies in synchronous distributed systems (Peleg 1998; Peleg

Based on the reformulation, a semi-smooth Levenberg–Marquardt method was developed, and the superlinear (quadratic) rate of convergence was established under the strict

Corollary 13.3. For, if C is simple and lies in D, the function f is analytic at each point interior to and on C; so we apply the Cauchy-Goursat theorem directly. On the other hand,