The Power of Statistics in the Design of a Study

全文

(1)Journal of Taiwan Normal University: Education 2002, 47(2), 231-240. The Power of Statistics in the Design of a Study Jung-Chang Tang. Shu-Hui Lee. National Chiayi University. Gonchen Elementary School. An analysis of the power of the statistical methods used is very important in designing any research project or investigation. The major concepts regarding statistical power include the sample size of the study, the desired detectable effect size, the specified Type I error rate, and the sensitivities of statistical testing. The purpose of this article is then to explain the procedures a researcher should employ in order to make such statistical power explicit in the design of his or her study. Two empirical investigations are described in order to illustrate Type I and Type II error by examining sample size, statistical tests, measures, treatment effects, and variability within the subject groups of these two studies. Finally, some procedures for the consideration of statistical power in experimental design are suggested. Key words：statistical power. Type I error. Type II error. effect size. Statistical power is an estimate probability that a. was many studies used too small samples and researchers. statistical test of the null hypothesis will yield statistical. were too concerned with Type I errors rather than. significance when the null hypothesis is, in fact, false for the. statistical power (Murphy & Myors, 1998). In order to. population (Chow, 1996). Studies with too little statistical. draw more investigators' attention to power issues, the. power can frequently lead to erroneous conclusions, whereas. purpose of this critic was to examine the relationship. studies with high statistical power are very likely to detect. among Type I error, Type II error, and power using two. the effects of treatments and interventions.. existing studies (Connell, 1987; Roth & McManis,. Although statistical power is very important, it has. 1972), for which Roth and McManis’ study was. received less attention in the behavior and social. employed for the analyses with the Type I error and. sciences than it deserves (Cohen, 1990). It was a routine. Connell’s study was for the Type II error. Additionally,. in many areas to run studies with low levels of power. it could remind the investigator to consider power in. (Lipsey, 1990). With low levels of power, studies could. designing a study to avoid doing an ineffective study. hardly detect the treatment effect and obtain correct. and enhance the probability to conduct the successful. conclusion. The explanation for the low levels of power. research--correctly reject the null hypothesis.. The Major Concepts about Statistical Power Researchers may invoke statistical significance. (Cohen, 1982; Chow, 1996): a Type I error happened if. testing whenever they have a random sample from a. the null hypothesis was rejected when it should not be. population. Two types of errors could occur when. (the probability of this was called alpha); and a Type II. making inferences based on a statistical hypothesis test. error resulted from the failure to reject a null hypothesis.

(2) 232. Jung-Chang Tang Shu-Hui Lee. when you should (the probability of this was called beta).. smaller depending on the relative values of the. Statistical power referred to an estimate probability of. difference between the means and the variance (Cohen,. avoiding a Type II error and depended on the ability of. 1996). Other things being equal, the larger the difference. one's statistical test to detect true differences of a. of the means is, the greater the effect size, and the larger. particular size (Cohen, 1996; Gall, Borg, & Gall, 1996).. the variance, the less the effect size. When the effect. It generally depended on four things: the sample size,. size is very small, it might be hard to accurately and. the desired detectable effect size, the specified Type I. consistently detect the treatment effect in study samples.. error rate, and the statistical test (Lipsey, 1990).. Contrary to this, if the effect size is relatively large, it is easy to detect the treatment effect, even in a relatively. Type I Error Rate and Power. small study samples.. The standard or decision criteria used in hypothesis testing has a crucial impact on statistical power. In. Sample Size and Power. tradition, the standards used to test statistical hypotheses. In practice, sample size was probably the most. were usually set at .05 or .01 (Mohr, 1990; Murphy &. important determinant of power (Kraemer & Thiemann,. Myors, 1998). Although setting a more lenient standard,. 1987). With a sufficiently large N, any statistic will be. such as .50, makes it easier to reject the null hypotheses,. significantly different from zero, and any null. it could more easily lead to Type I errors in those cases. hypothesis that is tested will be rejected. On the other. where the null is actually true. However, it may reduce. hand, with a small N, researchers may not have enough. the probability of committing Type II error, and then. power to reliably detect the effects of even the most. increases the statistical power of the study.. substantial treatment. Because sampling error is greater. On the other hand, if researchers made it very. for small samples and almost negligible for very large. difficult to reject the null hypothesis, they minimized. samples, it follows that sample size is a major. Type I errors, but will increased the probability of Type. determinant of the probability of errors in statistical. II errors. Thus, minimizing Type I errors actually. conclusions and thus of statistical power. Large samples. reduced statistical power (Cohen, 1996).. provide very precise estimates of population parameters, whereas small samples produce results that can be. Effect Size and Power. unstable and untrustworthy.. Effect size is a key concept in statistical power analysis. Measures of effect size provided a standardized. Statistical Test and Power. index of how much impact treatments had on the. Statistical test itself is one of the factors. dependent variable (Lipsey, 1990). Other things being. determining statistical power. Different statistical tests. equal, the larger the effect produced by the treatment on. do not necessarily have the same statistical power when. a dependent variable for the population, the more likely. they are applied to the same data. Three aspects of the. it was that statistical significance would be attained and. statistical significance test were important in this regard. the greater the statistical power (Lipsey, 1990; Murphy. (Lipsey, 1990). First, nonparametric tests, those that. & Myors, 1998). The effect size would be larger or. used only rank order or categorical scales, generally had.

(3) 233. The Power of Statistics in the Design of a Study. less inherent power than do parametric tests, those that. and one-way ANOVA, were quite adequate. However,. used continuum scores. Second, directional tests (i.e.,. while. one tailed t test) had greater statistical power than have. influencing dependent variables in studies, failing to use. nondirectional tests (i.e., two tailed t test). Third, in. a statistical test that accounts for them could greatly. cases of no factors extraneous to the treatment issues. reduce the ability of the study to detect real treatment. contributing to population variability on the dependent. effects.. there. were. important. extraneous. factors. measure, simple group comparison tests, such as t test. Criterions for Selecting Empirical Studies The search for these studies was undertaken in the. the selection of studies was to investigate areas relevant. following sequence: first, computer searches using the. to Special Education or Psychology which we took. Educational Resources and Information Center (ERIC). interest in. The third selection criterion was small. database (1970 to 1997) and the PsychoInfo database. subsample size (N<10) in studies and the reporting of. (1970 to 1997). The computer searches were conducted. research methods which were detailed enough to allow. using appropriate combinations of the following terms:. for evaluation. Finally, manual searches of the studies. pretest, posttest, experimental, statistical power, Type I. analyzed by t-test or ANOVA were conducted due to be. error, Type II error, effect size. Second, the criterion for. well interpreted.. Methods for Analyzing Studies Two empirical studies found to be appropriate for these. criterions. were. described. respectively. by. treatment effects, and variability within groups in these studies.. examining each sample size, statistical tests, measures,. The Study by Roth and McManis (1972) The purpose of this study was designed to test the. major diagnostic groups were equally divided by sex. effects of social reinforcement on the Block Design. and experimental condition, which yielded eight groups. performance for persons with mental retardations who. of five subjects each. These subjects were equated by. were medically diagnosed as organically impaired and. groups for Wechsler Adult Intelligence Scale (WAIS). those who displayed no signs of organicity. The subjects. Full-Scale IQ, chronological age (CA) and pretest scores. in this study were 40 adults with mental retardations. in the Block Design test.. ranging in IQ from 57-83. Half of the subjects were. After a three-week interval, all subjects were. medical diagnosed as organic impaired and the other. retested under their appropriate treatment conditions.. half presented no signs of organicity. Subjects in the two. Subjects in the control group were again administered.

(4) 234. Jung-Chang Tang Shu-Hui Lee. the Block Design test. Subjects in the experimental. their. group were received a positive verbal statement from. reinforcement might not exert its impact on the. the examiner immediately after they made each correct. population and then may commit Type I error in this. block placement.. study.. population.. Possibly. the. effect. of. social. In analyzing the treatment effects, four-factor. Second, the number of significance tests in this. (treatment, diagnostic, sex, test time) analyses of. study was large. The use of multiple statistical tests, like. variance (ANOVA) were conducted in pretest and. ANOVA designs that contained main effects and. posttest scores and three-factor analyses (ANOVA) were. interaction effects and analyses of multiple dependent. employed to analyze the different scores between pretest. variables, might lead to a situation where the chance of. and posttest. The result showed that the effects of. committing at least one Type I error is significantly. differential reinforcement occurred with organic and. increased (Newman & Frass, 1998). Additionally, using. nonorganic subjects. Nonorganic subjects of both sexes. the .05 level for many tests escalated the experimental. increased significantly in accuracy scores while. Type I error rate (Cohen, 1990). This study was. decreasing in speed scores. On the other hand, organic. conducted with a lot of statistical tests (20 pairwise. males significantly increased in speed scores and. comparisons in each dependent variable) of the. decreased in accuracy scores. In addition, nonorganic. treatment means to reject the null hypotheses. But the. subjects of both sexes increased significantly in. investigators only set one level .05 as critical value in. accuracy.. each null hypothesis test. Thus, it is easier for them to. This study might commit Type I error for several. reject the null hypotheses than the study set at .01. reasons addressed below. First, there were only five. or .001. The easier to reject the null hypothesis, the. subjects in per subgroup. With 5 subjects in each. higher chance to commit Type I error.. subgroup, even one outlier could influence the mean.. Third,. the authors. made. conclusions. about. For example, in this study, the mean and standard. differences which had not been tested for significance.. deviation for organic male subgroup was 46.20 and. For example, the investigators stated that "The. 22.97 in posttest accuracy score, respectively. The. nonorganic subjects made a significant gain (p < .05) in. possibility is high that at least one outlier may occasion.. mean accuracy score from pretest to posttest, but the. The smaller the group is, the greater variance around the. organic subjects decreased slightly from pretest to. mean. Moreover, even very few outliers can produce. posttest" (p. 185).. "significant findings" if extreme enough. Extremely. number of hypotheses were evaluated statistically also. non-normal distributions which result from the presence. discuss other "findings" which were not subjected to. of outliers were likely to lead to serious bias in report α. such evaluation. This kind of conclusion might commit. when parametric procedures are applied (Cohen, 1982).. Type I error, although overall effect had been tested but. Besides, small samples may yield "flukier" statistics.. appropriate follow-up comparisons had not been made. Thus, the significant differences between experimental. (Cohen, 1982).. Such research reports in which a. and control groups in accuracy and speed scores in this. Fourth, the authors used ANOVA to analyze. study might not represent the same results occurred in. absolute treatment effects and relative treatment effects.

(5) The Power of Statistics in the Design of a Study. 235. respectively. Furthermore, they separated the Block. four times in this study respectively to complete. Design scores as accuracy scores and speed scores and. research. It increased the chance to make more Type I. analyzed them twice. Thus, ANOVA needed to be done. errors if the analyses are reject the null hypotheses.. The Study by Connell (1987) The purpose of this study was to evaluate the. statistic. Sampling error is greater for small samples. effects of stress reduction information on the newly. and almost negligible for very large samples. The t. admitted patients' state anxiety. Twelve newly admitted. value. patients over the age of sixty were selected as subjects.. variances. The larger the subsample sizes, the smaller. They were randomly assigned to the group receiving. the denominator of t, and therefore the larger the t. information or to the group not receiving information.. altogether (Mohr, 1990). In this study, the subsample. The. was. size was small. Thus, sampling error could be large and. administered first, followed by the Stresses in. the t would be smaller. It may easily receive the null. Institutional Scale Questionnaire (SIS). Subjects were. hypothesis in this study. Therefore, it might commit. then asked to rank, in priority order, the three most. type II error easily.. State. Anxiety. Inventory. Test. (SAI). 1. is dependent partly on the denominator of. severe stresses anticipated or currently experienced.. Second, the treatment effects were weak. The. Information about elicited stresses was then given to. magnitude of the treatment effect in the sample reflects. those subjects who were assigned to the information. the magnitude of the true population parameter. The. group. One week later, with question randomly. larger the sample difference of means, the larger the. rearranged, the researcher readministered the SAI to all. numerator of t, and therefore the larger the t altogether. subjects.. would be (Mohr, 1990). In this study, the means of. T-test values were obtained in order to determine. pretest and posttest anxiety scores for the experimental. whether differences in state anxiety scores occurred. group were 52.6 and 52.2, respectively and no significant. between those admittees who received stress reduction. differences in anxiety were found in this group [t(4) =. information and those who did not. No significant. -.7580, p>.05]. The treatment effects were weak in this. differences in pretest scores of state anxiety were. study because information about elicited stresses was. obtained between the two groups. Similarly, state. given to the experimental group only for one week. How. anxiety scores following the intervention also indicated. many patients read and realized the information? Was the. no significant differences between the two groups.. information given too much? Did the intervention need to. This study might commit the Type II error for. be done for more two weeks? The investigator seemed to. several reasons addressed below. First, the subsample. give no answers. Had the quality of the intervention not. size (N1 = 5 and N2= 7, per group respectively) was. been improved, how can the treatment effects be expected. small.. with. strongly enough to detect the effects and reject the null. sampling error, the expectable discrepancies between. hypothesis? Thus, it might commit Type II error due to an. sample and corresponding population values for a given. improper study design.. Statistical. significance. is concerned.

(6) 236. Jung-Chang Tang Shu-Hui Lee. Third, the variability within each subgroup was large.. The. standard. and. scale to a yes or no format not only minimized the. experimental group were 13.4 and 17.98 in pretest,. sensitivity of this study but also reduce reliability and. respectively. The subsample variances, and therefore. validity in dependent scores. That the three-item rate. population subgroup variances, were a crucial factor to. scale was changed to two-item rate scale actually. affect the t value: the larger the variances, the larger the. decreased the sensitivity to detect the treatment effects. denominator of t, and therefore the smaller the t statistic. in this study. Again, the continuous variables in anxiety. altogether. in. scores changed to ordinal variables. The anxiety scores. hypotheses testing are thus more unlikely when variances. are continuous. Transforming into rate scale, the score. are large. The subjects in each subgroup were less. would decrease the sensitivity of study and failed to. homogeneous in this study. As for sex, females were. detect. more four times than males. Furthermore, the extraneous. instrument was hardly reliable to detect treatment effects. variables were not controlled well. The variables would. and may be difficult to measure anxiety exactly due to. be too large for the investigator to reject the null. measurement error.. (Mohr,. deviation. 1990).. for. Significant. control. Finally, the gross scale caused by modified Likert. results. the treatment. effects.. In. addition, gross. hypotheses. Thus, it is possible to make Type II error.. Procedures for Considering Power in Designing a Study From the illustration above, there exits a strong. population, and the decision used to determine statistical. relationship among statistical power, Type I error, and. significance (Type I error), investigators can solve for. Type II error. The investigators need to consider the. any of the four values, given the other three. For. power and balance the Type I and Type II error before. example, the power you want is .80 in detecting a small. conducting their studies. The followings are some. treatment effect (ES = .25). You set α = .05 for. procedures investigators need to consider:. one-tailed test. A sample of about 200 subjects per group is necessary as you check the power charts. Carry Out a Power Analysis before Beginning a Study. (Lipsey, 1990). If researchers want power to be .95, they. Power analysis can be used for planning a study. It. There is no rule about how much power is enough.. made appropriate decisions to determine the number of. However, power of .80 or above is usually judged to be. subjects that should be included in a study, to estimate. adequate (Cohen, 1988; Murphy & Myors, 1998). With. the likelihood that the study will reject the null. a criterion power level set, an alpha, the researcher can. hypothesis, to determine what sorts of effects can be. proceed with a power analysis. The consideration are (a). reliably detected in a study, or to make rational. enhancing the operative effect size to make the design as. decisions about the standards used to define statistical. efficient as possible, (b) determining the appropriate. significance (Murphy & Myors, 1998). Because power. sample size, and (c) relaxing the error risk criterion (α. is a function of the sample size, the effect size in. and β) if necessary to accommodate limits on sample. need about 350 subjects in each group (Lipsey, 1990)..

(7) 237. The Power of Statistics in the Design of a Study. size or effect size (d) employing appropriate statistical. effect size is large, the design will be easily to detect the. test.. difference between experimental and control conditions, and to reduce the chance to make Type II error. Thus, it. Enhance the Operative Effect Size. will increase the statistical power. The investigator Although the effect size in the population may be. needs to employ good design, such as random sampling. conceived of as fixed, the sample effect size (the operative effect size) that could be used to estimate the effect size in the population may be increased. Effect. and assignment, to reduce sample variance. Besides, select. reliable. and. valid. instrument. to. reduce. size enhancements are more cost-effective to engineer. measurement error and increase effect size. If the. than are sample size increases. This is because of what. measurement in a study is reliable and valid, it will be. determines effect size has to do with treatment strength,. sensitive to detect the difference between various. treatment. and. treatments. Thus, the measure is sensitive enough to. experimental control. One of the most common effect. detect the difference caused by intervention. Failure to. size measures is the difference between treatment and. consider score reliability adequately in substantive. integrity,. sampling,. measurement,. control group means divided by the pooled standard deviation. Therefore, estimated effect size can be enhanced by maximizing the treatment effect and minimizing subsample variances. First,. maximize. the. treatment. effect.. The. magnitude of the treatment effect in the sample reflects the magnitude of the true population parameter. The. research is very serious, because effect sizes and power against. Type. II. error. are. both. attenuated. by. measurement error. This is especially important when dealing with noisy data such as questionnaire responses or processes which are difficult to measure precisely.. Increase the Sample Size. larger the sample difference of means, the larger the numerator of the statistical test, and therefore the larger. Sample sizes should be large enough to reduce. the effect size and the lower chance to commit Type II. sampling errors. Small sample size easily causes luck. error. Therefore, the investigator has to choose a. and unstable statistics. Even very few outliers in a study. treatment with a large difference on the independent. could produce statistical significance between different. variable.. In addition, experimental treatment needs to. means, and thus increase the opportunity to commit the. be implemented for a period of time in order to. type I error. On the other hand, small sample size causes. demonstrate the magnitude of the treatment effect. The stronger the treatment effect, the larger the effect size in corresponding population and the more power in the study demonstrated. Second,. minimize. subsample. variances.. If. subsample variances are small, probably the population variances will be small. The denominator of the statistic test is small and the effect size, therefore, is large. If the. higher sampling error than large ones. It produces the larger denominator of t and will be hard to reject the null hypothesis. It, therefore, easily make Type II error. Thus, enlarging sample size is an important and very useful way to raise statistical power. Researchers need to adopt large sample size as possible whenever feasible, cost and availability of subjects..

(8) 238. Jung-Chang Tang Shu-Hui Lee. Determine Appropriate Decision Criteria. Employ Appropriate Statistical Test. The level set for alpha influences the likelihood of. The statistical test reflected certain assumptions. statistical significance and power. In tradition, the alpha. about the sampling and the nature of the data (Lipsey,. is set at .05, .01 or some other similarly low level in. 1990). Different tests may have different formulations of. order to avoid committing Type I error. But it is. the sampling error estimate and the critical values. inevitable to increase the chance to make Type II error. needed for significance. For example, when assumptions. and reduce the statistical power. Thus, the investigator. are approximately met parametric procedures tend to be. needs to balance risks in choosing significance level. If. more. such an error risk analysis was not possible, it was better. Conversely, when distributional assumptions were. for the researcher set α = β when the potential treatment. grossly violated, nonparametric procedures may be more. effects were of practical significance (Lipsey, 1990).. powerful than parametric procedures (Cohen, 1982).. This practice will at least assure that the same emphasis. Thus, researchers have to use appropriate statistical test. be given to Type II error that is conventionally given to. to detect the treatment effect of a study in order to. Type I error.. increase statistical power.. powerful. than. nonparametric. procedures.. Conclusion Statistical power is determined by sample size,. proper level of power. On the other hand, studies. effect size in the population, Type I error set, and. designed with statistical power in mind are likely to use. statistical test. Researchers need to consider power. large samples and sensitive procedures. It also directs. before conducting a study in order to correctly reject the. the researcher's attention toward the effect size and. null hypothesis when it is false. They may use a power. forces researchers to think about the strength of the. analysis to assess and adjust appropriate sample size,. study effects, rather than thinking only about whether a. operative effect size, and Type I error set to attain a. particular effect is significant.. Footnotes. 1. Xt. t = S. −. Xc 1 + 1 nt nc. the treatment and control groups respectively, n t and nc are the sample sizes for those groups, and s is the pooled with samples standard deviation used. to. deviation. Where X t and X c are the sample means for. estimate. the. population. standard.

(9) 239. The Power of Statistics in the Design of a Study. References Chow, S. L. (1996). Statistical significance: Rationale, validity and utility. Thousand Oaks, CA: SAGE Publications. Cohen, B. H. (1996). Explaining psychological statistics. Pacific Grove, CA: Brooks/Cole Publishing. Cohen, J. (1988). Statistical power analysis for the behavior sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum. Cohen, J. (1990). Things I have learned so far. American Psychologist, 45, 1304-1312.. Murphy, K. R., & Myors, B. (1998). Statistical power analysis: A simple and general model for traditional and modern hypothesis tests. Mahwah, NJ: Lawrence Erlbaum Associates. Newman, I., & Frass, J. W. (1998). The responsibility of educational researchers to make appropriate decisions about the error rate unit in which Type I error adjustments are based: A thoughtful process not a mechanical one. Chicago: Midwestern Educational Research Association. (ERIC Document Reproduction Service No. ED 427 020). Cohen, P. (1982). To be or not to be: Control and balancing of Type I and Type II errors. Evaluation and Program Planning, 5, 247-253.. Roth, G., & McManis, D. L. (1972). Social reinforcement effects on Block Design performance of organic and nonorganic retarded adults. American Journal of Mental Deficiency, 77(2), 181-189.. Connell, P. (1987). Effects of stress reduction information on anxiety. Salt Lake, UT: The American Society on Aging. (ERIC Document Reproduction Service No. ED 284 078). About the Authors. Gall, M. D., Borg, W. R., & Gall, J. P. (1996). Education research (6th ed.). New York: Longman. Kraemer, H. C., & Thiemann, S. (1987). How many subjects: Statistical power analysis in research. Newbury Park, CA: SAGE Publications. Lipsey, M. W. (1990). Design sensitivity: Statistical power for experimental research. Newbury Park, CA: SAGE Publications. Mohr, L. B. (1990). Understanding significance testing. Newbury Park, CA: SAGE Publications.. Jung-Chang Tang is an Assistant Professor of Special Education at National ChiaYi University. 唐榮昌，國立嘉義大學特殊教育學系助理教授。 e-mail: j52a@yahoo.com.tw Shu-Hui Lee, M.Ed. from Tennessee State University, U.S.A., is an English teacher in Gonchen Elementary School in Yulin. 李淑惠，雲林縣立公誠國小教師。收稿日期：91 年 03 月 06 日修正日期：91 年 08 月 16 日接受日期：91 年 09 月 10 日.

(10) 師大學報：教育類. 240 民國 91 年，47(2)，231-240. Jung-Chang Tang Shu-Hui Lee. 研究設計應如何考慮統計考驗力唐榮昌. 李淑惠. 國立嘉義大學. 雲林縣公誠國小. 摘要進行研究時，先評估研究的統計考驗力是很重要的。一個研究的統計考驗力（statistical power）與研究樣本的大小、效果大小（effect size）、第一類型錯誤率、及所採用的統計考驗方法有關。本文即以兩篇實徵性研究為例，透過檢視其樣本大小，所用的統計考驗，處理效果，測量工具，與組內的變異等，分別分析其第一類型錯誤及第二類型錯誤的可能性。最後建議，以下列方式提高統計考驗力：研究前先分析統計考驗力、提高可操作性的效果大小、增加研究樣本、決定適當的第一類型錯誤率、以及採用妥適的統計考驗方法。.. 關鍵字：統計考驗力. 第一類型錯誤. 第二類型錯誤. 效果大小.

(11)