Signiﬁcance Testing Using Permutation Tests

SECTION 14.4 Exercises

14.5 Signiﬁcance Testing Using Permutation Tests

Significance tests tell us whether an observed effect, such as a difference be-tween two means or a correlation bebe-tween two variables, could reasonably occur “just by chance” in selecting a random sample. If not, we have evidence that the effect observed in the sample reflects an effect that is present in the population. The reasoning of tests goes like this:

1. Choose a statistic that measures the effect you are looking for.

2. Construct the sampling distribution that this statistic would have if the ef-fect were not present in the population.

3. Locate the observed statistic on this distribution. A value in the main body of the distribution could easily occur just by chance. A value in the tail would

Sampling distribution when H₀ is true

P-value

Observed statistic

FIGURE 14.19 The P-value of a statistical test is found from the sampling distribution the statistic would have if the null hy-pothesis were true. It is the probability of a result at least as extreme as the value we actually observed.

rarely occur by chance and so is evidence that something other than chance is operating.

The statement that the effect we seek is not present in the population is the null hypothesis, H0. The probability, calculated taking the null hypothesis to null hypothesis

be true, that we would observe a statistic value as extreme or more extreme than the one we did observe is the P-value. Figure 14.19 illustrates the idea P-value

of a P-value. Small P-values are evidence against the null hypothesis and in favor of a real effect in the population. The reasoning of statistical tests is in-direct and a bit subtle but is by now familiar. Tests based on resampling don’t change this reasoning. They find P-values by resampling calculations rather than from formulas and so can be used in settings where traditional tests don’t apply.

Because P-values are calculated acting as if the null hypothesis were true, we cannot resample from the observed sample as we did earlier. In the absence of bias, resampling from the original sample creates a bootstrap distribution centered at the observed value of the statistic. If the null hypothesis is in fact not true, this value may be far from the parameter value stated by the null hypothesis. We must estimate what the sampling distribution of the statistic would be if the null hypothesis were true. That is, we must obey this rule:

RESAMPLING FOR SIGNIFICANCE TESTS

To estimate the P-value for a test of significance, estimate the sam-pling distribution of the test statistic when the null hypothesis is true by resampling in a manner that is consistent with the null hypothesis.

E X A M P L E 1 4 . 1 1 Do new “directed reading activities” improve the reading ability of elementary school students, as measured by their Degree of Reading Power (DRP) scores? A study assigns students at random to either the new method

TA B L E 1 4 . 3

Degree of Reading Power scores for third-graders

Treatment group Control group

24 61 59 46 43 53 42 33 46 37 62 20

43 44 52 43 57 49 43 41 10 42 53 48

58 67 62 57 56 33 55 19 17 55 37 85

71 49 54 26 54 60 28 42

(treatment group, 21 students) or traditional teaching methods (control group, 23 stu-dents). The DRP scores at the end of the study appear in Table 14.3.¹¹ In Example 7.14 (page 489) we applied the two-sample t test to these data.

To apply resampling, we will start with the difference between the sample means as a measure of the effect of the new activities:

statistic= xtreatment− xcontrol

The null hypothesis H0for the resampling test is that the teaching method has no effect on the distribution of DRP scores. If H0is true, the DRP scores in Table 14.3 do not depend on the teaching method. Each student has a DRP score that describes that child and is the same no matter which group the child is assigned to. The observed difference in group means just reflects the accident of random assignment to the two groups.

Now we can see how to resample in a way that is consistent with the null hypoth-esis: imitate many repetitions of the random assignment of students to treatment and control groups, with each student always keeping his or her DRP score unchanged.

Because resampling in this way scrambles the assignment of students to groups, tests based on resampling are called permutation tests, from the mathematical name for permutation tests

scrambling a collection of things.

Here is an outline of the permutation test procedure for comparing the mean DRP scores in Example 14.11:

• Choose 21 of the 44 students at random to be the treatment group; the other 23 are the control group. This is an ordinary SRS, chosen without re-placement. It is called a permutation resample. Calculate the mean DRP permutation

resample score in each group, using the individual DRP scores in Table 14.3. The difference between these means is our statistic.

• Repeat this resampling from the 44 students hundreds of times. The tribution of the statistic from these resamples estimates the sampling dis-tribution under the condition that H0 is true. It is called a permutation permutation

distribution distribution.

• The value of the statistic actually observed in the study was xtreatment− xcontrol= 51.476 − 41.522 = 9.954

Locate this value on the permutation distribution to get the P-value.

24, 61 | 42, 33, 46, 37 x₁– x₂= 42.5 – 39.5 = 3.0

33, 61 | 24, 42, 46, 37 x₁– x₂= 47 – 37.25 = 9.75

37, 42 | 24, 61, 33, 46 x₁– x₂= 39.5 – 41 = –1.5 33, 46 | 24, 61, 42, 37

x₁– x₂= 39.5 – 41 = –1.5

FIGURE 14.20 The idea of permutation resampling. The top box shows the outcomes of a study with four subjects in one group and two in the other. The boxes below show three per-mutation resamples. The values of the statistic for many such resamples form the perper-mutation distribution.

Figure 14.20 illustrates permutation resampling on a small scale. The top box shows the results of a study with four subjects in the treatment group and two subjects in the control group. A permutation resample chooses an SRS of four of the six subjects to form the treatment group. The remaining two are the control group. The results of three permutation resamples appear be-low the original results, along with the statistic (difference of group means) for each.

E X A M P L E 1 4 . 1 2 Figure 14.21 shows the permutation distribution of the difference of means based on 999 permutation resamples from the DRP data in Table 14.3. This is a resampling estimate of the sampling distribution of the statistic

–15 –10 –5 0 5 10 15

P-value Observed

Mean

FIGURE 14.21 The permutation distribution of the statistic x_treatment− xcontrolbased on the DRP scores of 44 students. The dashed line marks the mean of the permutation distribution: it is very close to zero, the value speciﬁed by the null hypothesis. The solid vertical line marks the observed difference in means, 9.954. Its location in the right tail shows that a value this large is unlikely to occur when the null hypothesis is true.

when the null hypothesis H0is true. As H0suggests, the distribution is centered at 0 (no effect). The solid vertical line in the figure marks the location of the statistic for the original sample, 9.954. Use the permutation distribution exactly as if it were the sampling distribution: the P-value is the probability that the statistic takes a value at least as extreme as 9.954 in the direction given by the alternative hypothesis.

We seek evidence that the treatment increases DRP scores, so the alternative hy-pothesis is that the distribution of the statistic x_treatment− xcontrolis centered not at 0 but at some positive value. Large values of the statistic are evidence against the null hypothesis in favor of this one-sided alternative. The permutation test P-value is the proportion of the 999 resamples that give a result at least as great as 9.954. A look at the resampling results finds that 14 of the 999 resamples gave a value 9.954 or larger, so the estimated P-value is 14/999, or 0.014.

Here is a last refinement. Recall from Chapter 8 that we can improve the estimate of a population proportion by adding two successes and two failures to the sample.

It turns out that we can similarly improve the estimate of the P-value by adding one sample result more extreme than the observed statistic. The final permutation test es-timate of the P-value is

14+ 1 999+ 1= 15

1000= 0.015

The data give good evidence that the new method beats the standard method.

Figure 14.21 shows that the permutation distribution has a roughly nor-mal shape. Because the permutation distribution approximates the sampling distribution, we now know that the sampling distribution is close to normal.

When the sampling distribution is close to normal, we can safely apply the usual two-sample t test. The t test in Example 7.14 gives P= 0.013, very close to the P-value from the permutation test.

Using software

In principle, you can program almost any statistical software to do a permu-tation test. It is more convenient to use software that automates the process of resampling, calculating the statistic, forming the permutation distribution, and finding the P-value. The menus in S-PLUS allow you to request tion tests along with standard tests whenever they make sense. The permuta-tion distribupermuta-tion in Figure 14.21 is one output. Another is this summary of the test results:

Number of Replications: 999 Summary Statistics:

Observed Mean SE alternative p.value score 9.954 0.07153 4.421 greater 0.015

By giving “greater” as the alternative hypothesis, the output makes it clear that 0.015 is the one-sided P-value.

Permutation tests in practice

Permutation tests versus t tests. We have analyzed the data in Table 14.3 both by the two-sample t test (in Chapter 7) and by a permutation test.

Comparing the two approaches brings out some general points about permu-tation tests versus traditional formula-based tests.

• The hypotheses for the t test are stated in terms of the two population means,

H0:µtreatment− µcontrol= 0 Ha:µtreatment− µcontrol> 0

The permutation test hypotheses are more general. The null hypothesis is

“same distribution of scores in both groups,” and the one-sided alternative is “scores in the treatment group are systematically higher.” These more general hypotheses imply the t hypotheses if we are interested in mean scores and the two distributions have the same shape.

• The plug-in principle says that the difference of sample means estimates the difference of population means. The t statistic starts with this differ-ence. We used the same statistic in the permutation test, but that was a choice: we could use the difference of 25% trimmed means or any other statistic that measures the effect of treatment versus control.

• The t test statistic is based on standardizing the difference of means in a clever way to get a statistic that has a t distribution when H0 is true.

The permutation test works directly with the difference of means (or some other statistic) and estimates the sampling distribution by resampling. No formulas are needed.

• The t test gives accurate P-values if the sampling distribution of the differ-ence of means is at least roughly normal. The permutation test gives accu-rate P-values even when the sampling distribution is not close to normal.

The permutation test is useful even if we plan to use the two-sample t test. Rather than relying on normal quantile plots of the two samples and the central limit theorem, we can directly check the normality of the sampling distribution by looking at the permutation distribution. Permutation tests provide a “gold standard” for assessing two-sample t tests. If the two P-values differ considerably, it usually indicates that the conditions for the two-sample t don’t hold for these data. Because permutation tests give accurate P-values even when the sampling distribution is skewed, they are often used when accuracy is very important. Here is an example.

E X A M P L E 1 4 . 1 3 In Example 14.6, we looked at the difference in means between re-pair times for 1664 Verizon (ILEC) customers and 23 customers of competing companies (CLECs). Figure 14.8 (page 14-19) shows both distributions.

Penalties are assessed if a significance test concludes at the 1% significance level that CLEC customers are receiving inferior service. The alternative hypothesis is one-sided because the Public Utilities Commission wants to know if CLEC customers are disadvantaged.

Because the distributions are strongly skewed and the sample sizes are very differ-ent, two-sample t tests are inaccurate. An inaccurate testing procedure might declare 3% of tests significant at the 1% level when in fact the two groups of customers are treated identically, so that only 1% of tests should in the long run be significant. Errors like this would cost Verizon substantial sums of money.

Verizon performs permutation tests with 500,000 resamples for high accuracy, us-ing custom software based on S-PLUS. Dependus-ing on the preferences of each state’s regulators, one of three statistics is chosen: the difference in means, x1− x2; the pooled-variance t statistic; or a modified t statistic in which only the standard deviation of the larger group is used to determine the standard error. The last statistic prevents the large variation in the small group from inflating the standard error.

To perform a permutation test, we randomly regroup the total set of repair times into two groups that are the same sizes as the two original samples. This is consistent with the null hypothesis that CLEC versus ILEC has no effect on repair time. Each repair time appears once in the data in each resample, but some repair times from the ILEC group move to CLEC, and vice versa. We calculate the test statistic for each resample and create its permutation distribution. The P-value is the proportion of the resamples with statistics that exceed the observed statistic.

Here are the P-values for the three tests on the Verizon data, using 500,000 permutations. The corresponding t test P-values, obtained by comparing the t statistic with t critical values, are shown for comparison.

Test statistic t test P-value Permutation test P-value

x1− x2 0.0183

Pooled t statistic 0.0045 0.0183

Modified t statistic 0.0044 0.0195

The t test results are off by a factor of more than 4 because they do not take skewness into account. The t test suggests that the differences are significant at the 1% level, but the more accurate P-values from the permutation test in-dicate otherwise. Figure 14.22 shows the permutation distribution of the first

–15 –10 –5 0 5 7

Observed Mean

P-value

FIGURE 14.22 The permutation distribution of the difference of means x1− x2for the Verizon repair time data.

statistic, the difference in sample means. The strong skewness implies that t tests will be inaccurate.

If you read Chapter 15, on nonparametric tests, you will find there more comparison of permutation tests with rank tests as well as tests based on nor-mal distributions.

Data from an entire population. A subtle difference between con-fidence intervals and significance tests is that concon-fidence intervals require the distinction between sample and population, but tests do not. If we have data on an entire population—say, all employees of a large corporation—we don’t need a confidence interval to estimate the difference between the mean salaries of male and female employees. We can calculate the means for all men and for all women and get an exact answer. But it still makes sense to ask, “Is the difference in means so large that it would rarely occur just by chance?” A test and its P-value answer that question.

Permutation tests are a convenient way to answer such questions. In car-rying out the test we pay no attention to whether the data are a sample or an entire population. The resampling assigns the full set of observed salaries at random to men and women and builds a permutation distribution from repeated random assignments. We can then see if the observed difference in mean salaries is so large that it would rarely occur if gender did not matter.

When are permutation tests valid? The two-sample t test starts from the condition that the sampling distribution of x1− x2is normal. This is the case if both populations have normal distributions, and it is approximately true for large samples from nonnormal populations because of the central limit theorem. The central limit theorem helps explain the robustness of the two-sample t test. The test works well when both populations are symmetric, especially when the two sample sizes are similar.

The permutation test completely removes the normality condition. But resampling in a way that moves observations between the two groups requires

CAUTION

that the two populations are identical when the null hypothesis is true—not only are their means the same, but also their spreads and shapes. Our preferred version of the two-sample t allows different standard deviations in the two groups, so the shapes are both normal but need not have the same spread.

In Example 14.13, the distributions are strongly skewed, ruling out the t test. The permutation test is valid if the repair time distributions for Veri-zon customers and CLEC customers have the same shape, so that they are identical under the null hypothesis that the centers (the means) are the same.

Fortunately, the permutation test is robust. That is, it gives accurate P-values when the two population distributions have somewhat different shapes, say, when they have slightly different standard deviations.

Sources of variation. Just as in the case of bootstrap confidence inter-vals, permutation tests are subject to two sources of random variability: the original sample is chosen at random from the population, and the resamples are chosen at random from the sample. Again as in the case of the bootstrap, the added variation due to resampling is usually small and can be made as small as we like by increasing the number of resamples. For example, Veri-zon uses 500,000 resamples.

For most purposes, 999 resamples are sufficient. If stakes are high or P-values are near a critical value (for example, near 0.01 in the Verizon case), increase the number of resamples. Here is a helpful guideline: If the true (one-sided) P-value is p, the standard deviation of the estimated P-value is approximately

p(1 − p)/B, where B is the number of resamples. You can choose B to obtain a desired level of accuracy.

Permutation tests in other settings

The bootstrap procedure can replace many different formula-based confi-dence intervals, provided that we resample in a way that matches the setting.

Permutation testing is also a general method that we can adapt to various settings.

GENERAL PROCEDURE FOR PERMUTATION TESTS

To carry out a permutation test based on a statistic that measures the size of an effect of interest:

在文檔中 14.1 The Bootstrap Idea (頁 46-54)