Optimal sample sizes for precise interval estimation of Welch's procedure under various allocation and cost considerations

(1)

Optimal sample sizes for precise interval estimation

of Welch

’s procedure under various allocation

and cost considerations

Gwowen Shieh&Show-Li Jan

Published online: 30 July 2011 # Psychonomic Society, Inc. 2011

Abstract Welch’s (Biometrika 29: 350–362,1938) proce-dure has emerged as a robust alternative to the Student’s t test for comparing the means of two normal populations with unknown and possibly unequal variances. To facilitate the advocated statistical practice of confidence intervals and further improve the potential applicability of Welch’s procedure, in the present article, we consider exact approaches to optimize sample size determinations for precise interval estimation of the difference between two means under various allocation and cost considerations. The desired precision of a confidence interval is assessed with respect to the control of expected half-width, and to the assurance probability of interval half-width within a designated value. Furthermore, the design schemes in terms of participant allocation and cost constraints include (a) giving the ratio of group sizes, (b) specifying one sample size, (c) attaining maximum precision performance for a fixed cost, and (d) meeting a specified precision level for the least cost. The proposed methods provide useful alternatives to the conventional sample size procedures. Also, the developed programs expand the degree of generality for the existing statistical software packages

and can be accessed at brm.psychonomic-journals.org/ content/ supplemental.

Keywords Behrens-Fisher problem . Precision . Study design

Introduction

The fundamental results and associated usages of standard parametric procedures—such as Student’st, ANOVA F, and ordinary least squares regression—are well documented in the literature. One important assumption underlying the prescribed traditional methods is that of equal population variances. Although the homogeneity of variance formula-tion provides a convenient and useful setup, it is not unusual for the homoscedasticity assumption to be violated in actual applications. For example, Grissom (2000) emphasized that there are theoretical reasons to expect and empirical results to document the existence of hetero-scedasticity in clinical data. Moreover, Grissom and Kim (2005, pp. 10–14) provided additional explanations for the intrinsic causes of variance heterogeneity in real data. Notably, Grissom recommended employing suitable techni-ques that are superior to the traditional inferential methods under various conditions of heteroscedasticity.

For comparing the difference between two normal means that may have unequal population variance, the scenario is the well-known Behrens–Fisher problem (Kim & Cohen, 1998). Accordingly, Welch’s (1938) approximate t proce-dure has been recognized as a satisfactory and robust solution over the two-sample t of the Behrens–Fisher problem. The same notion was independently suggested by Smith (1936) and Satterthwaite (1946); hence, the technique is sometimes referred to as the Smith–Welch– Satterthwaite procedure. The method not only is covered in

Electronic supplementary material The online version of this article (doi:10.3758/s13428-011-0139-z) contains supplementary material, which is available to authorized users.

G. Shieh (*)

Department of Management Science, National Chiao Tung University, 1001 Ta Hsueh Road,

Hsinchu, Taiwan 30050

e-mail: [email protected] S.-L. Jan

Department of Applied Mathematics, Chung Yuan Christian University, Chungli, Taiwan 32023, Republic of China e-mail: [email protected]

(2)

introductory textbooks of statistics and quantitative meth-ods but also is available in several commonly used statistics packages—for example, Excel, Minitab, SAS, and SPSS. However, most research in this area is concerned with the null hypothesis significance tests for detecting mean differ-ences—for example, Best and Rayner (1987) and Wang (1971). This dominance of hypothesis testing for making statistical inferences does not occur exclusively in the Behrens–Fisher problem. It more broadly reflects the longstanding and prevalent practice of significance tests in applied research across many scientific fields. As a compelling alternative, there has been a growing awareness in the use of confidence intervals instead of hypothesis tests for inference-making purposes, such as Hahn and Meeker (1991), Harlow, Mulaik, and Steiger (1997), Kline (2004), and Smithson (2003). But from both practical and scientific standpoints, it may be more informative to provide a reliable estimate of the magnitude of the examined effect, rather than simply to decide whether or not a finding is statistically significant. Accordingly, Wilkinson and the American Psychological Association’s Task Force on Statistical Inference (1999) and the sixth edition of the Publication Manual of the American Psychological Asso-ciation (APA,2010) called for the greater use of confidence intervals. However, the interval estimation procedures are intrinsically stochastic in nature. From a study-planning point of view, researchers may wish to credibly address specific research questions and confirm meaningful treat-ment differences, so that the resulting confidence interval will meet the designated precision requirements. Hence, it is of practical interest and methodological importance to develop sample size procedures for precise interval estima-tion in the context of the Behrens–Fisher problem.

To ensure precision of the resulting confidence intervals, the notion of expected half-width for sample size calcu-lations is frequently introduced in standard texts. However, considerable attention has focused on the criterion of tolerance probability of interval half-width within a given value. For example, see Beal (1989), Kelley, Maxwell, and Rausch (2003), Kupper and Hafner (1989), and Liu (2009) for related discussion in the context of estimating the mean difference between two normal populations with homosce-dasticity. The empirical illustration in Kupper and Hafner shows that it typically requires a larger sample size to meet the necessary assurance of tolerance probability than the control of a designated expected half-width. Therefore, the sample sizes computed by the expected half-width ap-proach tend to be inadequate to guarantee the desired tolerance level of interval half-width. Consequently, the assurance probability approach is recommended over the expected width criterion for sample size determination. However, it is noteworthy that the two principles of expected width and assurance probability are closely related

to the two standard criteria of unbiasedness and consistency in statistical point estimation, respectively. In other words, these two measures impose unique and distinct aspects of precision characteristics on the resulting confidence inter-vals, and each principle has conceptual and empirical implications in its own right.

Within the framework of the Behrens–Fisher problem, Wang and Kupper (1997) derived a formula to compute the necessary sample size for a selected tolerance probability when the sample size ratio is given. Although the suggested sample size technique accommodates the more realistic situation of variance heterogeneity, three essential caveats of the results in Wang and Kupper should be pointed out. First, their theoretical presentations and algebraic expres-sions are noticeably awkward. The formulation is compli-cated in form, and the complexity requires intensive cumbersome evaluations. Furthermore, to our knowledge, there is no computer algorithm available for performing the necessary computation. Therefore, their result is of less practical value in application. Second, they suggest fixing the proportion of standard deviations as the allocation ratio to determine the optimal sample sizes for a designated tolerance level so that the total sample size is minimized. But the simplified algorithm employed by Wang and Kupper fails to take into account the underlying metric of integer sample sizes and often leads to suboptimal results. It is shown below in our numerical investigation that their procedure is not guaranteed to give the correct optimal sample sizes. Third, although there are mixed opinions on the effectiveness of expected width, they did not address the issue of how to perform the sample size calculations so that the expected confidence interval half-width will attain the planned precision. Thus, the results in Wang and Kupper should be clarified and extended with more transparent explications and exact computations. Note that the assur-ance probability for achieving a desired interval width can be further modified as a conditional probability that the confidence interval includes the true parameter. As was reported in Beal (1989), corresponding sample sizes computed with the conditional consideration are almost identical to or at most only slightly larger than those calculated with the aforementioned unconditional or toler-ance probability approach. Nonetheless, our calculations also confirm that this phenomenon continues to exist in the Behrens–Fisher problem. Hence, the conditional criterion presented in Wang and Kupper will not be considered further in this article.

In view of the potential variance heterogeneity one might encounter in applied work, the present article contributes to the applications of Welch’s (1938) procedure by providing feasible sample size methodology for constructing precise confidence intervals under two distinct perspectives. One method gives the minimum sample size such that the

(3)

expected confidence interval half-width is within the designated bound. The other approach provides the sample size needed to guarantee, with a given tolerance probability, that the half-width of a confidence interval will not exceed the planned value. Furthermore, conventional sample size calculations do not consider allocation schemes with participant constraints or cost implications. However, researchers have explored design strategies that take into account the impact of different constraints of the sample scheme and project funding while maintaining adequate power (Allison, Allison, Faith, Paultre, & Pi-Sunyer,1997, and references therein). Jan and Shieh (2011) considered the problem of determining optimal sample sizes to meet a designated power for Welch’s test under various allocation and cost considerations that call for independent random samples from two normal populations with possibly unequal variances. The same principles would apply for a study seeking a precise estimate of the mean difference between two treatments. It is well known that there exists a direct connection between hypothesis testing and interval estimation, although the two procedures are philosophically different in the power and precision viewpoints. Not surprisingly, the sample size required to test a hypothesis regarding the specific value of a parameter with desired power can be markedly different from the sample size needed to obtain adequate precision of interval estimation in the same study. Since there are crucial and useful tactics for study design other than the minimization of total sample size, it is prudent to present a comprehensive account of design configurations in terms of various participant and budget constraints. In this article, exact methods are presented to give proper sample sizes when either the ratio of group sizes is fixed in advance or one sample size is fixed. In addition, detailed procedures are provided to determine the optimal sample sizes to maximize the precision for a given total cost and to minimize the cost for a specified precision. Finally, corresponding SAS computer codes are developed to facilitate computations of the exact necessary sample size in actual applications.

Precise interval estimation

In line with the advocated practice of greater use of confidence intervals, we attempt to develop the sample size methodology under precision consideration for Welch’s (1938) approximate t procedure in the context of the Behrens–Fisher problem. Consider independent random samples from two normal populations with the following formulations:

Xij Nðmi; s2iÞ;

whereμ1,μ2, s2₁; s2₂are unknown parameters, j = 1,…, Ni,

and i = 1 and 2. To detect the difference between two group means, the well-known Welch’s pivotal quantity is of the form V ¼ X1 X2 mð 1 m2Þ S2 1=N1þ S22=N2 ₁₌₂ ; w h e r e X1¼ PN1 j¼1X1j=N1; X2¼ PN2 j¼1X2j=N2; S 2 1 ¼ PN1 j¼1 X1j X1Þ2= Nð 1 1Þand S22¼ PN2 j¼1 X2j X2 2_{= N} 2 1 ð Þ. Accord-ingly, Welch proposed the approximate distribution for V:

V t v^

; ð1Þ

where t v^

is the t distribution with degrees of freedom^v and v^¼ v^ N1; N2; S21; S22 with 1= v^¼ 1 N1 1 S₁2=N1 S2 1=N1þ S22=N2 2 þ 1 N2 1 S22=N2 S2 1=N1þ S22=N2 2 :

Thus, an approximate 100(1 – α)% two-sided confi-dence interval of mean difference (μ1− μ2) is of the form

(L, U), whereL ¼ X1 X2 t_v^ ;a=2 S12=N1þ S22=N2 1=2_; U ¼ X1 X2 þ t_v^ ;a=2 S12=N1þ S22=N2 ₁₌₂_{, and} _t v ^ ;a=2 is

the 100(1 − α/2) percentile of the t distribution t v^

with degrees of freedom v^. For ease of presentation, the half-width of the 100(1– α)% two-sided confidence interval is denoted by H ¼ t v ^ ;a=2 S 2 1=N1þ S22=N2 1=2 _ð2Þ

It is clear that the actual half-width H depends on the sample sizes N1and N2, the confidence coefficient 1– α,

as well as on variance estimates S2

1 and S22. More

importantly, bothS2₁ andS₂2 are scaled chi-square random variables with degrees of freedom (N1− 1) and (N2− 1),

respectively, and thus jointly determine the distributional feature of the half-width H of a confidence interval. When planning a study for ensuring that the confidence interval is narrow enough to produce meaningful findings, researchers must consider the stochastic nature of sample variances.

For the purpose of advanced research design, it is desirable to determine the sample sizes required to achieve the designated precision properties of a confidence interval. Two useful principles concern the control of the expected half-width and the tolerance probability of the half-width within a preassigned value. Specifically, it is necessary to determine the required sample size such that the expected

(4)

half-width of a 100(1– α)% confidence interval is within the given bound

E H½ ¼ d; ð3Þ

where the expectation E[H] is taken with respect to the joint distribution ofS₁2 andS2₂, andδ (> 0) is a constant. On the other hand, one may compute the sample size needed to guarantee, with a given tolerance probability, that the half-width of a 100(1– α)% confidence interval will not exceed the planned value

P Hf < wg ¼ 1 g; ð4Þ

where (1− γ) is the specified tolerance level, and ω (> 0) is a constant.

To simplify presentation and computation, the following alternative formulation for H is derived:

H¼ t v ^ ;a=2ðK G=kÞ 1=2 _ð5Þ w h e r e k¼ N1þ N2 2; K ¼ Nð 1 1ÞS12=s21þ Nð 2 1Þ S2 2=s22#2ð Þ; G ¼ sk 21=N1 B=p f g þ s2 2=N2 1 B ð Þ= f 1 p ð Þg; p ¼ Nð 1 1Þ=k; and B ¼ Nð 1 1ÞS12=s21 = K Beta Nfð 1 1Þ=2; Nð 2 1Þ=2g: Note that the random

variables K and B are independent. Also, it can be shown that 1= v^¼ B2 1= Nð 1 1Þ þ B22= Nð 2 1Þ ð6Þ w h e r e B2¼ 1 B1andB1 ¼ s21=N1 B=p f g = s2 1=N1 B=p f g þ s2 2=N2 1 B ð Þ= 1 pð Þ

f g. Hence, both G and

v

^_{are functions of the random variable B.}

It is clear from the distinct formulations in Eqs.2and5that the underlying core distribution of H transforms from the joint distribution of two independent chi-square random variables to the joint distribution of a chi-square random variable K and a beta random variable B. The suggested transformation appears at first sight to be of not much use, but actually it greatly simplifies our analytical and compu-tational illustrations. Note that the product form of a chi-square random variable K and other terms associated with a beta random variable B in Eq. 5 permit more transparent representations than those presented in Wang and Kupper (1997). Moreover, a beta distribution is bounded by 0 and 1, and requires less computational effort than a chi-square distribution. Therefore, the numerical computation of exact values of E[H] and P{H < ω} can be conducted with the evaluations of both the one-dimensional integration with respect to a beta probability distribution function, and the cumulative distribution function of a chi-square random variable. Since all related functions are readily available in major statistical packages, the exact computations can be performed with current computing capabilities.

In order to permit a practical treatment of sample size planning, additional concerns are considered to accommo-date the participant and cost constraints in practical

situations. In the next two sections, we will synthesize the ideas of Jan and Shieh (2011) and Kupper and Hafner (1989) to develop exact procedures of precise interval estimation with four different design and budget settings under the expected width and tolerance probability consid-erations, respectively. All calculations are performed using programs written with SAS/IML (SAS Institute, 2008a), and they are available in the supplementary files.

Expected width consideration

With the distributional properties described in Eqs.5and6, the assessment of expected half-width E[H] in Eq.3can be simplified as E H½ ¼ EK K1=2 h i EB t_v^ ;a=2 G1=2 h i =k1=2_: _ð7Þ

It follows from the standard result of a chi-square distribution with κ degrees of freedom that E_KK1=2_¼

21=2_{Γ k þ 1}_f_ð _Þ=2_{g=Γ k=2}_f _{g. Moreover, the expectation}

EB t^_v

;a=2 G1=2

h i

is taken with respect to the distribution of B and does not permit a closed-form expression. Although the expected width can still be numerically evaluated for all proper model configurations, it is prudent to focus on those with significant implications. To simplify the exposition, the following two allocation constraints are considered because of their potential usefulness. First, the ratio r = N2/

N1between the two group sizes may be fixed in advance,

so the goal is to find the minimum sample size N1 (N2 =

rN1) required to achieve the selected precision level.

Second, one of the two sample sizes, say, N2, may be

determined in advance, so the smallest size N1required to

satisfy the specified precision should be determined.

Sample size ratio is fixed

Consider that the sample size ratio r = N2/N1 is

preassigned, and without loss of generality, the ratio is assumed as r ≥ 1. Thus, for a specified precision δ, a simple incremental search can be conducted to find the minimum sample size N1such that E[H]≤ δ for the chosen

confidence level (1− α) and error variances (s2₁; s2₂). Note that the expected half-width is asymptotically equivalent toE H½ ¼ za=2 s2₁=N1þ s22=N2

₁₌₂

, where zα/2is the upper

100(α/2)th percentile of the standard normal distribution. The particular result provides a convenient initial value for N1. Accordingly, it is more efficient to start the

computa-tion process with the sample size N1Z, which is the

smallest integer that satisfies the inequality

N1z z2a=2 s 2 1þ s 2 2=r =d2_: _ð8Þ

(5)

For demonstration, whenδ = 0.5 and 1 – α = 0.95, the sample sizes N1and N2= r·N1are presented in Table1 for

selected values of r = 1, 2, and 3;σ1= 1/3, 1/2, 1, 2 and 3;

and σ2 = 1. The actual expected half-width E[H] is also

listed, and the values are slightly less than the nominal value of 0.5.

One sample size is fixed

Assume the sample size N2 of the second group is held

constant, and that it is desirable to find the proper sample size N1 to achieve the selected precision in terms of

expected half-width. Just as in the previous case, the minimum sample size N1 needed to ensure confidence

intervals with the specified expected half-width δ can be found by a simple iterative search for the chosen confidence level (1– α) and parameter values (s2₁; s2₂). In this case, the starting sample size N1Z, based on the asymptotic

approx-imation, is the smallest integer that satisfies the inequality

N1z s21= d=za₌₂ 2 s2 2=N2 n o : ð9Þ

Note that the chosen sample size N2should not be too small

because it is problematic to consider a small N2<

s2

2= d=za=2

₂

since the initial value N1Z and resulting N1

may be negative. In addition, it should be noted the resulting N1Zand N1values are unbounded and impractical

if one considers a value ofN₂ ¼ s2

2= d=za=2

₂

. Accordingly, Table2presents the computed sample size N1and the actual

expected half-width with chosen value N2 for the same

settings withδ = 0.5, 1 – α = 0.95, and the five standard deviation settings ofσ1andσ2in Table1.

In addition to the prescribed allocation constraints of participants, it is often sensible to consider cost and effectiveness issues when research funding is limited. Moreover, the costs of obtaining subjects may differ across the two groups. Suppose c1and c2are the costs per subject

in the first and second groups, respectively; then, the total cost of the study isC ¼ c1N1þ c2N2. Thus, the following

two questions arise naturally in choosing the optimal sample sizes. First, how can the maximum precision be

achieved in a study with a limited budget? Second, what is the least cost for an investigation to maintain its desired level of precision? In general, balanced group sizes do not necessarily yield the optimal solution in the aforementioned two scenarios. This assertion can be easily justified from the simplified asymptotic approxi-mation of E H½ ¼ za=2 s2₁=N1þ s22=N2

₁₌₂

, that the opti-mal sample size allocation ratio for the appraisals of cost and precision is N2 N1 ¼ q; ð10Þ where q¼s2c1=21 = s1c1=22

. Although this identity reveals the obvious disadvantage of a naive, balanced design, it has its own weakness as a rule of thumb. It is readily seen from Eq. 7 that the exact properties of the expected half-width depend on the joint distribution of a chi-square random variable K and a beta random variable B. The resulting behavior of E[H] for finite sample sizes can be notably different from that of asymptotic theory. Hence, the simple guideline of Eq. 10 does not guarantee an optimal result when the sample sizes are small. Instead, the identity is employed as a benchmark in the following detailed and systematic presentation of optimal sample size allocation.

Total cost is fixed and expected width needs to be minimized

It can be shown under a fixed value of total cost C ¼ c1N1Zþ c2N2Z and N2Z/N1Z = θ that the resulting sample

sizes are N1Z ¼ C s1c1=22 c1 s1c1=22 þ c2 s2c1=21 and N2z¼ C s2c1=21 c1 s1c1=22 þ c2 s2c1=21 : ð11Þ

Table 1 Computed sample sizes (N1, N2) and expected half-width E[H] when sample size ratio r = N2/N1is fixed withδ = 0.5 and 1 – α = 0.95 σ1:σ2

r 1/3:1 1/2:1 1:1 2:1 3:1

N1 N2 E[H] N1 N2 E[H] N1 N2 E[H] N1 N2 E[H] N1 N2 E[H]

1 19 19 .4959 21 21 .4947 32 32 .4980 79 79 .4973 156 156 .4988

2 11 22 .4788 13 26 .4843 25 50 .4901 71 142 .4989 148 296 .4995

(6)

As was described previously, although this sample size combination minimizes the magnitude s2₁=N1þ s22=N2

1=2

or asymptotic expected half-widthz_a=2 s2

1=N1þ s22=N2

₁₌₂

, it may be suboptimal with respect to the actual precision level E[H]. In practice, the sample sizes need to be integers, and it is unlikely that the values of N1Zand N2Zin Eq.11are

actually whole numbers. Consequently, any sample size adjustment or rounded numbers made on N1Zand N2Zwill

introduce further inexactness into the optimization analysis. To find the exact solution, a detailed precision calculation and comparison is performed for the sample size combina-tions wi th N1 f rom N1 m i n to N1 m a x a nd N2¼

Floor C cfð 1N1Þ=c2g, where N1min¼ Max Floor Nf ð 1ZÞ

10; 5g, N1max ¼ Ceil C cfð 2N2minÞ=c1g, N2min¼ Max

Floor Nð 2ZÞ 10; 5

f g, the function Floor(a) returns the

largest integer that is less than or equal to a, and Ceil(a) returns the smallest integer that is greater than or equal to a. Note that the constants of 10 and 5 are chosen to prevent computation error and to ensure that an optimal solution is covered. Thus, the optimal sample size allocation is the one giving the maximum precision or minimum expected half-width. For illustration, numerical results are presented in Table3for (c1, c2) = (1, 1), (1, 2), and (1, 3), and fixed total

cost C = 30, 40, 60, 150, and 240 in accordance with the standard deviation combinations reported in the previous two tables. The results in Table3 reveal that the actual expected half-width for a given total cost increases considerably as the unit cost c2 increases from 1 to 3. Furthermore, the

simplified allocation scheme does not yield the optimal sample sizes in several cases. For example, the optimal

sample sizes are N1= 24 and N2= 18 for (σ1,σ2) = (1, 1)

and (c1, c2) = (1, 2), in contrast with the result of N1Z =

24.8528 and N2Z = 17.5736 computed by Eq. 11.

Corre-spondingly, the optimal ratio N2/N1 = 18/24 = 0.7500 is

slightly greater than the ratio computed with the simple formula presented in Eq.10:θ = (1·11/2)/(1·21/2) = 0.7071.

Target expected width is fixed and total cost needs to be minimized

In this case, the large sample approximation shows that in order to ensure the nominal expected half-width d¼za=2 s2₁=N1zþ s22=N2z

1=2

while minimizing total cost C = c1N1Z+ c2N2Z, the best sample size combination is

N1z¼ qs2 1þ s22 q d=za=2 2and N2z¼ qs2 1þ s22 d=za=2 2; ð12Þ

where θ is the optimal ratio defined in Eq. 10. Similar to the usage of sample sizes in Eq.11, the computed values of N1Zand N2Zin Eq.12are modified to expedite a screening

of sample size combinations in order to find the optimal allocation that maintains the desired expected half-width with the least cost. Specifically, the exact precision computation and cost evaluation are conducted for sample size combinations with N1, from N1min to N1max satisfying

the required precision, where N_1min¼ Max Floor N½ ð _1ZÞ 10; Ceil s2 1= d=za=2 ₂ n o ; 6, N1max¼ Ceil s21= d=fð za=2Þ2

s2₂=N_{2 min}g þ 20, N_2min¼ Max½Floor Nð _2ZÞ 10; Ceil s 2₂=

Table 2 Computed sample sizes (N1, N2) and expected half-width E[H] when sample size N2is fixed withδ = 0.5 and 1 – α = 0.95 σ1:σ2

1/3:1 1/2:1 1:1 2:1 3:1

N1 N2 E[H] N1 N2 E[H] N1 N2 E[H] N1 N2 E[H] N1 N2 E[H]

7 24 .4888 12 25 .4982 27 40 .4970 78 80 .4993 166 100 .4989

6 27 .4831 10 30 .4897 23 60 .4927 71 140 .4993 152 200 .4994

5 30 .4910 9 35 .4843 21 80 .4958 69 200 .4978 148 300 .4993

Table 3 Computed sample sizes (N1, N2) and expected half-width E[H] when the total cost is fixed with 1– α = 0.95 σ1:σ2

c1:c2 1/3:1 1/2:1 1:1 2:1 3:1

Cost N1 N2 E[H] Cost N1 N2 E[H] Cost N1 N2 E[H] Cost N1 N2 E[H] Cost N1 N2 E[H]

1:1 30 8 22 .4960 40 13 27 .4779 60 30 30 .5150 150 100 50 .4833 240 180 60 .5081

1:2 30 6 12 .6726 40 10 15 .6231 60 24 18 .6285 150 88 31 .5517 240 162 39 .5615

(7)

d=za=2

₂

g, 6]. The constants of 6, 20, and 10 are chosen to prevent computation error and to enhance the optimal search. For each fixed value of N1, the matching sample

size N2is calculated to satisfy the required expected

half-width. Thus, the optimal sample size allocation is the one giving the smallest cost while maintaining the specified expected half-width value. In cases in which there is more than one combination yielding the same least cost, the one producing the maximum precision is reported. Table 4 provides the corresponding optimal sample size allocation, cost, and actual expected half-width for the configurations of (c1, c2) = (1, 1), (1, 2), and (1, 3), and the five standard

deviation settings ofσ1andσ2. It is clear that the total cost

for a required precision and for fixed standard deviations increases substantially as the unit cost c2changes from 1 to

3. The optimal allocations have the simple ratio θ for the three cases of (σ1,σ2) = (1, 1), (2, 1), and (3, 1) when (c1,

c2) = (1, 1). However, most of the sample size ratios are

close to, but different from, the ratio θ. The largest discrepancy occurs with the case N2/N1 = 22/8 = 2.7500

for (c1, c2) = (1, 1) and (σ1, σ2) = (1/3, 1), whereas the

approximate ratio q¼ 1 1 1=2_{= 1=3 1} 1=2_{¼ 3.}

Tolerance probability consideration

Instead of the expected half-width criterion, an useful alternative approach for sample size determination is to ensure that the actual confidence interval half-width will not exceed the planned bound with a given tolerance probability. For analytic clarity and computational ease, the probability P{H <ω} given in Eq.4 is expressed as P Hf < wg ¼ EB FK ðk=GÞ w=t_v^;a=2

2

; ð13Þ

where FK(⋅) is the cumulative density function of K ~ χ2(κ).

Note that the expression in Eq. 13 provides a more clear and concise exposition of the assurance probability of precision than does Eq. 14 in Wang and Kupper (1997). The formulation also expedites the subsequent computational

task for various participant and cost constraints. Since there may be several possible sample sizes N1and N2that meet

the required tolerance level, it is worthwhile to consider the same practical circumstances as in the case of expected interval half-width. Accordingly, the examinations pre-sented here simplify and expand the existing and limited results in Wang and Kupper.

Sample size ratio is fixed

With the allocation ratior ¼ N2=N1> 1, specified width ω,

tolerance probability (1− γ), confidence coefficient (1 − α), and error variances (s2

1; s22), a straightforward iterative

process is performed to find the minimum sample size N1,

such that PfH < wg 1 g. To simplify the incremental search, the initial value of N1 in the algorithm is based on

Eq. 8 with δ = ω, because the optimal solutions here for large level of (1− γ) are greater than those of the expected interval width approach with the same interval bound. This situation is similar to those noted in Kupper and Hafner (1989) for the traditional two-sample problem. More concrete examples are presented in Table 5 for (1 − γ) = 0.90 and ω = 0.5. For ease of comparison, the other parameter configurations of (1 − α), (s2

1; s22) and r are

identical to those in Table 1. In addition to its complex formulation, the numerical calculation of Wang and Kupper (1997) is also questionable. Specifically, for the settings of ω = 0.3, (1 − α) = 0.95, (s2

1; s22) = (1, 2), and r = 1, our

computations yield the optimal sample sizes N1= N2= 139

and N1= N2= 149 for (1− γ) = 0.80 and 0.95, respectively.

The corresponding results reported in Table1 of Wang and Kupper are N1= N2= 138 and N1= N2= 144. Note that SAS

procedure PROC POWER (SAS Institute, 2008b) provides the useful feature of finding the optimal sample sizes N1= N2

(r = 1) for the desired tolerance probability with confidence intervals of mean difference under homogeneous variances assumption. However, it does not consider the corresponding sample size calculations for the Behrens–Fisher problem with arbitrary sample size ratio r≥ 1, as is illustrated here.

Table 4 Computed sample sizes (N1, N2), cost, and expected width E[H] when the total cost needs to be minimized with target expected half-widthδ = 0.5 and 1 – α = 0.95

σ1:σ2

c1:c2 1/3:1 1/2:1 1:1 2:1 3:1

Cost N1 N2 E[H] Cost N1 N2 E[H] Cost N1 N2 E[H] Cost N1 N2 E[H] Cost N1 N2 E[H]

1:1 30 8 22 .4960 37 12 25 .4982 64 32 32 .4980 141 94 47 .4987 248 186 62 .4998

1:2 51 9 21 .4987 60 16 22 .4998 93 37 28 .4995 182 106 38 .4999 303 205 49 .4992

(8)

One sample size is fixed

A different restriction of the design setting is to find the minimum sample size, say, N1, that ensures a required

tolerance probability when the other sample size, N2, is

fixed in advance. With the substitution ofδ = ω in Eq.9, the resulting sample size is utilized as the starting value for the incremental search of optimal solution. The corresponding results with (1 − γ) = 0.90 and ω = 0.5 are listed in Table6for the same configurations of (1− α) = 0.90, (s2

1; s22), and N2 in Table 2. It is clear that the

computed sample size N1in Table 6 is larger than that for

the same setting in Table2. Since there is no explicit low bound of N2, it is possible that the specified N2is too small,

and the matching N1may be unbounded. Thus, the iterative

search of optimal N1is programmed to terminate when N1

reaches the value 1,001, because the resulting sample size combination appears to be impractical or unusual.

In the following section, we will turn our attention to the budget issue with varying unit cost per subject in each group.

Total cost is fixed and tolerance probability needs to be maximized

The notion of maximizing the tolerance level with a fixed value of total cost C ¼ c1N1þ c2N2 is considered,

where c1and c2are the known costs for each participant of

the two groups. To find the best sample size allocation, the prescribed logic and algorithm under the expected width criterion is applied to the optimization of cost and tolerance probability with the substitution of precision criterion P{H < ω} for E[H]. With a selective set of designated total cost C = 50, 60, 80, 180, and 300, and heterogeneity levels, the optimal sample sizes are summa-rized in Table7 forω = 0.5, 1 – α = 0.95, and three unit cost settings. As was described earlier for the expected half-width consideration in Table3, the results in Table7 also have the same behavior, in that the actual tolerance probability for a given total cost deceases substantially as the unit cost c2 increases from 1 to 3. Therefore,

researchers should be cautious about the prominent impact of heterogeneity on precision performance when the sources are limited.

Target tolerance probability is fixed and total cost needs to be minimized

In contrast with the previous case in which the total costs were fixed, the cost and precision assessment can be conversely performed by finding the optimal sample sizes to minimize cost when the target tolerance level is given. The utility of this procedure for the evaluation of expected half-width is extended to accommodate the precision criterion of assurance probability that the interval half-width is enclosed in the desirable range. To demonstrate the

Table 5 Computed sample sizes (N1, N2) and tolerance probability P{H <ω} when sample size ratio r = N2/N1is fixed withω = 0.5, 1 – γ = 0.90, and 1– α = 0.95 σ1:σ2 r 1/3:1 1/2:1 1:1 2:1 3:1 N1 N2 P{H <ω} N1 N2 P{H <ω} N1 N2 P{H <ω} N1 N2 P{H <ω} N1 N2 P{H <ω} 1 26 26 .9285 27 27 .9058 39 39 .9137 91 91 .9017 176 176 .9098 2 14 28 .9406 16 32 .9246 31 62 .9310 84 168 .9048 168 336 .9009 3 10 30 .9348 13 39 .9357 28 84 .9086 82 246 .9094 166 498 .9048

Table 6 Computed sample sizes (N1, N2) and tolerance probability P{H <ω} when sample size N2is fixed withω = 0.5, 1 – γ = 0.90, and 1 – α = 0.95 σ1:σ2 1/3:1 1/2:1 1:1 2:1 3:1 N1 N2 P{H <ω} N1 N2 P{H <ω} N1 N2 P{H <ω} N1 N2 P{H <ω} N1 N2 P{H <ω} 199 24 .9000 60 25 .9001 38 40 .9126 94 80 .9076 189 100 .9057 13 27 .9075 18 30 .9156 31 60 .9239 86 140 .9115 174 200 .9086 9 30 .9084 14 35 .9247 28 80 .9009 83 200 .9076 169 300 .9020

(9)

interrelation of the parameter configurations, numerical results are presented in Table 8 for the target tolerance probability 1– γ = 0.90, ω = 0.5, and 1 – α = 0.95, along with several combinations of unit costs (c1, c2) and standard

deviations (σ1,σ2). Similar to the expected width situation,

the resulting total cost for fixed values of tolerance probability and standard deviations is drastically increasing as the unit cost c2changes from 1 to 3. It is suggested in

Wang and Kupper (1997, p. 735) that the optimal sample sizes ratio is N2/N1=σ2/σ1for the problem of minimizing

the total number of sample sizes. However, none of the optimal allocation ratios in their Table 5 agrees with this guideline. Essentially, a systematic search and detailed inspection of sample size combinations is required to find the optimal allocation that attains the desired precision while giving the least total sample size. This extra procedure and resulting merit in sample size determination is not addressed in Wang and Kupper (1997). In contrast, all of the issues are considered in our suggested procedure and the developed program.

Numerical example

To illustrate the usefulness and discrepancy of the proposed sample size procedures under different various situations of precision criteria and design schemes, we extend the

numerical demonstration in Jan and Shieh (2011) from hypothesis testing to interval estimation for the difference of ability tests administered online and in the laboratory. Since the demographical structure of online samples can differ from that of offline samples acquired in traditional laboratory settings (Ihme, Lemke, Lieder, Martin, Muller & Schmidt, 2009), the planning parameter values are chosen asμLab= 11,μOnline = 10,σLab= 2.3, andσOnline= 2.7 to

reflect the underlying treatment effect and heteroscedastic-ity. Moreover, online testing has the advantages of ease of obtaining a large sample and low cost. It would seem sensible that more samples could be obtained online rather than offline. The determination of actual sample sizes depends on the precision properties that the research wants to ensure for the resulting confidence intervals as well as other essential design features. First, it is intuitively reasonable to consider the expected width criterion. Suppose that the sample ratio is NOnline/NLab= 4. It follows

that the sample sizes NLab = 110 and NOnline = 440 are

required for the 95% confidence intervals of mean differ-ences to have the expected interval half-width δ ≤ 0.5. On the other hand, if the sample size for the online sample is fixed at NOnline= 400, then it would need NLab= 115 to meet

the same precision. To account for a budgetary concern where the total cost is C = 200 and the respective unit costs per subject are cLab = 1 and cOnline = 0.2, the optimal

allocation of sample sizes is NLab= 132 and NOnline= 340,

Table 7 Computed sample sizes (N1, N2) and tolerance probability P{H <ω} when the total cost is fixed with half-width ω = 0.5, and 1 – α = 0.95

σ1:σ2

c1:c2 1/3:1 1/2:1 1:1 2:1 3:1

Cost N1 N2 P{H <ω} Cost N1 N2 P{H <ω} Cost N1 N2 P{H <ω} Cost N1 N2 P{H <ω} Cost N1 N2 P{H <ω}

1:1 50 13 37 .9988 60 20 40 .9988 80 40 40 .9402 180 120 60 .9937 300 225 75 .9925

1:2 50 10 20 .4885 60 16 22 .5128 80 34 23 .2394 180 106 37 .4723 300 204 48 .4765

1:3 50 11 13 .1546 60 15 15 .1615 80 38 14 .0679 180 105 25 .1127 300 201 33 .0986

Table 8 Computed sample sizes (N1, N2), cost, and tolerance probability P{H <ω} when the total cost needs to be minimized with target tolerance probability = 0.90,ω = 0.5, and 1 – α = 0.95

σ1:σ2

c1:c2 1/3:1 1/2:1 1:1 2:1 3:1

Cost N1 N2 P{H <ω} Cost N1 N2 P{H <ω} Cost N1 N2 P{H <ω} Cost N1 N2 P{H <ω} Cost N1 N2 P{H <ω}

1:1 39 10 29 .9141 47 16 31 .9042 78 39 39 .9137 161 107 54 .9002 276 207 69 .9032

1:2 67 11 28 .9091 77 19 29 .9036 114 44 35 .9072 210 120 45 .9017 338 226 56 .9057

(10)

thus producing the maximum precision within the cost constraint. Conversely, the sample size combination NLab=

125 and NOnline= 328 induces the lowest cost C = 190.6,

while ensuring the expected interval half-width E[H]≤ 0.5. The computed sample sizes and the corresponding actual values of expected interval half-width are summarized in Table9 for ease of discussion.

Alternatively, it may be necessary for the assurance level of confidence interval half-widths to be enclosed by a designated bound. Assume that the tolerance probability 1– γ = 0.90, and 95% confidence interval half-widthω = 0.5. A study with the sample ratio r ¼ NOnline=NLab¼ 4 must have the sample

sizes NLab = 125 and NOnline = 500 to meet the precision

specification. When the online sample is predetermined at NOnline = 400, the computation shows that the laboratory

group must at least have the sample size NLab= 134 in order

to satisfy the designated precision. In the case of limited total cost C = 200, with cLab= 1 and cOnline= 0.2, the best set of

sample sizes is NLab = 133 and NOnline = 335, and the

resulting tolerance level is the highest for all sample sizes NLaband NOnline, withNLabþ 0:2ð ÞNOnline 200. However,

for the tolerance probability 1 – γ = 0.90 and 95% confidence interval half-width ω = 0.5, the minimum cost is C = 211 for the optimal sample sizes NLab = 143 and

NOnline = 340. These results and associated tolerance

probabilities are also presented in Table9. It is noteworthy that the computed sample sizes under the expected width consideration are smaller than those of the tolerance probability criterion. The only exception is the third case, with fixed total cost C = 200. Accordingly, the optimal sample sizes NLab= 132 and NOnline= 340 yield the expected

half-width 0.4878, whereas the best sample size combination NLab = 133 and NOnline = 335 gives a tolerance level of

merely 0.7253 < 1– γ = 0.90. These contrasting behaviors may be useful for researchers to justify their design strategy and financial support. The reader is referred to Ihme et al. (2009) for further details about the comparison of ability tests administered online and in the laboratory.

Conclusions

In order to enhance the applicability of confidence intervals and the fundamental usefulness of Welch’s (1938) proce-dure, in the present article, we present the corresponding sample size techniques under various precision principles and design schemes. The precision criteria consist of the control of the expected width and the assurance of tolerance probability of confidence intervals. The design perspective includes four different allocation constraints and cost considerations. Detailed sample size tables are provided to help researchers have a better understanding of the intrinsic relationships that exist between the optimal sample sizes and the associated model, precision, and design config-urations. Since existing software packages do not accom-modate sample size calculations with the same degree of generality as is illustrated in this article, computer programs are developed to facilitate the use of the suggested procedures. The proposed sample size methodology should be useful for behavioral and other areas of social sciences to plan two-group comparison studies in which variances differ across groups.

Author Note The authors thank the editor, Gregory Francis, for enhancing the clarity of the article’s presentation, Professor Chao-Ying Joanne Peng of Indiana University, and an anonymous referee, whose suggestions extended and strengthened its content immensely.

References

Allison, D. B., Allison, R. L., Faith, M. S., Paultre, F., & Pi-Sunyer, X. (1997). Power and money: Designing statistically powerful studies while minimizing financial costs. Psychological Methods, 2, 20–33.

American Psychological Association. (2010). Publication manual of the American Psychological Association (6th ed.). Washington, DC: Author.

Table 9 Computed sample sizes (N1, N2) for precise interval estimation under various participant and cost constraints whenσ1= 2.3,σ2= 2.7, 1– α = 0.95, δ = 0.5, 1 – γ = 0.90, ω = 0.5, c1= 1, and c2= 0.2

Expected Width Tolerance Probability

(N1,N2) E[H] (N1,N2) P{H <ω} 1. Fixed allocation ratio: r = N2/N1= 4 (110, 440) 0.4986 (125, 500) 0.9084

II. One sample size is fixed: N2= 400 (115, 400) 0.4990 (134, 400) 0.9068

III. Fixed cost: C = 200 (132, 340) 0.4878 (133, 335) 0.7253

IV. Fixed target precision:δ = 0.5, ω = 0.5, and 1 – γ = 0.90 (125, 328) 0.4998 (143, 340) 0.9004

(11)

Beal, S. L. (1989). Sample size determination for confidence intervals on the population mean and on the difference between two population means. Biometrics, 45, 969–977.

Best, D. J., & Rayner, J. C. W. (1987). Welch’s approximate solution for the Behrens–Fisher problem. Technometrics, 29, 205–210. Grissom, R. J. (2000). Heterogeneity of variance in clinical data.

Journal of Consulting and Clinical Psychology, 68, 155–165. Grissom, R. J., & Kim, J. J. (2005). Effect sizes for research: A broad

practical approach. Mahwah: Erlbaum.

Hahn, G. J., & Meeker, W. Q. (1991). Statistical intervals: A guide for practitioners. New York: Wiley.

Harlow, L. L., Mulaik, S. A., & Steiger, J. H. (1997). What if there were no significance tests? Mahwah: Erlbaum.

Ihme, J. M., Lemke, F., Lieder, K., Martin, F., Muller, J. C., & Schmidt, S. (2009). Comparison of ability tests administered online and in the laboratory. Behavior Research Methods, 41, 1183–1189.

Jan, S. L., & Shieh, G. (2011). Optimal sample sizes for Welch’s test under various allocation and cost considerations. Behavior Research Methods. doi:10.3758/s13428-011-0095-7.

Kelley, K., Maxwell, S. E., & Rausch, J. R. (2003). Obtaining power or obtaining precision: Delineating methods of sample-size planning. Evaluation & the Health Professions, 26, 258–287. Kim, S. H., & Cohen, A. S. (1998). On the Behrens–Fisher problem:

A review. Journal of Educational and Behavioral Statistics, 23, 356–377.

Kline, R. B. (2004). Beyond significance testing: Reforming data analysis methods in behavioral research. Washington, DC: American Psychological Association.

Kupper, L. L., & Hafner, K. B. (1989). How appropriate are popular sample size formulas? The American Statistician, 43, 101–105.

Liu, X. S. (2009). Sample size and the width of the confidence interval for mean difference. British Journal of Mathematical and Statistical Psychology, 62, 201–215.

SAS Institute. (2008a). SAS/IML User’s Guide, Version 9.2. Cary: SAS Institute Inc.

SAS Institute. (2008b). SAS/STAT User’s Guide, Version 9.2. Cary: SAS Institute Inc.

Satterthwaite, F. E. (1946). An approximate distribution of estimate of variance components. Biometrics Bulletin, 2, 110–114. Smith, H. F. (1936). The problem of comparing the results of two

experiments with unequal errors. Journal of the Council for Scientific and Industrial Research, 9, 211–212.

Smithson, M. (2003). Confidence intervals. Thousand Oaks: Sage. Wang, Y. Y. (1971). Probabilities of the type I errors of the Welch tests

for the Behrens–Fisher problem. Journal of the American Statistical Association, 66, 605–608.

Wang, Y., & Kupper, L. L. (1997). Optimal sample sizes for estimating the difference in means between two normal populations treating confidence interval length as a random variable. Commemora-tions in Statistics—Theory and Methods, 26, 727–741.

Welch, B. L. (1938). The significance of the difference between two means when the population variances are unequal. Biometrika, 29, 350–362.

Wilkinson, L., & the Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594–604.