Chapter 1. Bootstrap Method

(1)

Chapter 1. Bootstrap Method

1 Introduction

1.1 The Practice of Statistics

Statistics is the science of learning from experience, especially experience that arrives a little bit at a time. Most people are not natural-born statisticians. Left to our own devices we are not very good at picking out patterns from a sea of noisy data. To put it another way, we all are too good at picking out non-existent patterns that happen to suit our purposes? Statistical theory attacks the problem from both ends. It provides optimal methods for finding a real signal in a noisy background, and also provides strict checks against the overinterpretation of random patterns.

Statistical theory attempts to answer three basic questions:

1. Data Collection: How should I collect my data?

2. Summary: How should I analyze and summarize the data that I’ve collected?

3. Statistical Inference: How accurate are my data summaries?

The bootstrap is a recently developed technique for making certain kinds of statistical inferences. It is only recently developed because it requires modern computer power to simplify the often intricate calculations of traditional statistical theory.

1.2 Motivated Example

We now illustrate the just mentioned three basic statistical concepts using a front-page news from the New York Times of January 27, 1987. A study was done to see if small aspirin doses would prevent heart attacks in healthy middle-aged men. The data for the aspirin study were collected in a particularly efficient way: by a controlled, randomized, double- blind study. One half of the subjects received aspirin and the other half received a control substance, or placebo, with no active ingredients. The subjects were randomly assigned to the aspirin or placebo groups. Both the subjects and the supervising physicians were blind to the assignments, with the statisticians keeping a secret code of who received which substance.

Scientists, like everyone else, want the subject they are working on to succeed. The elaborate precautions of a controlled, randomized, blinded experiment guard against seeing benefits that don’t exist, while maximizing the chance of detecting a genuine positive effect.

(2)

2

The summary statistics in the study are very simple:

heart attacks (fatal plus non-fatal) subjects

aspirin group: 104 11,037

placebo group: 189 11,034

What strikes the eye here is the lower rate of heart attacks in the aspirin group. The ratio of the two rates is

θ =ˆ 104/11037

189/11034 = 0.55.

It suggests that the aspirin-takers only have 55% as many as heart attacks as placebo-takers.

Of course we are not interested in ˆθ. What we would like to know is θ, the true ratio, that is the ratio we would see if we could treat all subjects, and not just a sample of them.

The tough question is how do we know that ˆθ might not come out much less favorably if the experiment were run again?

This is where statistical inference comes in. Statistical theory allows us to make the following inference: the true value of θ lies in the interval 0.43 < θ < 0.70 with 95% confidence.

Note that

θ = ˆθ + (θ − ˆθ) = 0.55 + [θ − ˆθ(ω0)],

where θ and ˆθ(ω₀) (= 0.55) are two numbers. In statistics, we use θ − ˆθ(ω) to describe θ − ˆθ(ω₀). Since ω cannot be observed exactly, we instead study the fluctuation of θ − ˆθ(ω) among all ω. If, for most ω, θ − ˆθ(ω) is around zero, we can conclude statistically that θ is close to 0.55 (= ˆθ(ω0). (Recall the definition of consistency.) If P (ω : |θ − ˆθ(ω)| < 0.1) = 0.95, we claim that with 95% confidence that θ − 0.55 is no more than 0.1.

In the aspirin study, it also track strokes. The results are presented as the following:

strokes subjects aspirin group: 119 11,037 placebo group: 98 11,034

For strokes, the ratio of the two rates is

θ =ˆ 119/11037

98/11034 = 1.21.

It now looks like taking aspirin is actually harmful. However, the interval for the true stroke ratio θ turns out to be 0.93 < θ < 1.59 with 95% confidence. This includes the neutral value θ = 1, at which aspirin would be no better or worse than placebo. In the language of statistical hypothesis testing, aspirin was found to be significantly beneficial for preventing heart attacks, but not significantly harmful for causing strokes.

In the above discussion, we use the sampling distribution of ˆθ(ω) to develop intervals in which the true value of θ lies on with a high confidence level. The task of data analyst

(3)

is to find the sampling distribution of the chosen estimator ˆθ. Turn it into practice, we are quite often on finding right statistical table to look up.

Quite often, these tables are constructed based on the model-based sampling theory approach to statistical inference. In this approach, it starts with the assumption that the data arise as a sample from some conceptual probability distribution, f . When f is completely specified, we derive the distribution of ˆθ. Recall that ˆθ is a function of the observed data. In deriving its distribution, those data will be viewed as random variables (why??). Uncertainties of our inferences can then be measured. The traditional parametric inference utilizes a priori assumptions about the shape of f . For the above example, we rely on the binomial distribution, large sample approximation of the binomial distribution, and the estimate of θ.

However, we sometimes need to figure out f intelligently. Consider a sample of weights of 27 rats (n = 27); the data are

57, 60, 52, 49, 56, 46, 51, 63, 49, 57, 59, 54, 56, 59, 57, 52, 52, 61, 59, 53, 59, 51, 51, 56, 58, 46, 53.

The sample mean of these data = 54.6667, standard deviation = 4.5064 with cv = 0.0824.

For illustration, what if we wanted an estimate of the standard error of cv. Clearly, this would be a nonstandard problem. First, we may need to start with a parametric assumption on f . (How will you do it?) We may construct a nonparametric f estimator of (in essence) from the sample data. Then we can invoke either Monte Carlo method or large sample method to give an approximation on it.

Here, we will provide an alternative to the above approach. Consider the following nonparametric bootstrap method which relies on the empirical distribution function. As a demonstration, we apply the bootstrap method works to the stroke example.

1. Create two populations: the first consisting of 119 ones and 11037 − 119 = 10918 zeros, and the second consisting of 98 ones and 11034 − 98 = 10936 zeros.

2. (Monte Carlo Resampling) Draw with replacement a sample of 11037 items from the first population, and a sample of 11034 items from the second population.

Each of these is called a bootstrap sample.

3. Derive the bootstrap replicate of ˆθ:

θˆ^∗=prop. of ones in bootstrap sample #1 prop. of ones in bootstrap sample #2.

4. Repeat this process (1-3) a large number of times, say 1000 times, and obtain 1000 bootstrap replicates ˆθ^∗.

As an illustration, the standard deviation turned out to be 0.17 in a batch of 1000 replicates that we generated. Also a rough 95% confidence interval is (0.93, 1.60) which is derived by taking the 25th and 975th largest of the 1000 replicates.

(4)

4

Remark:

1. Initiated by Efron in 1979, the basic bootstrap approach uses Monte Carlo sampling to generate an empirical estimate of the ˆθ’s sampling distribution.

2. Monte Carlo sampling builds an estimate of the sampling distribution by randomly drawing a large number of samples of size n from a population, and calculating for each one the associated value of the statistic ˆθ. The relative frequency distribution of these ˆθ values is an estimate of the sampling distribution for that statistic. The larger the number of samples of size n will be, the more accurate the relative frequency distribution of these estimates will be.

3. With the bootstrap method, the basic sample is treated as the population and a Monte Carlo-style procedure is conducted on it. This is done by randomly drawing a large number of resamples of size n from this original sample (of size n either) with replacement. So, although each resample will have the same number of elements as the original sample, it could include some of the original data points more than once, and some not included. Therefore, each of these resamples will randomly depart from the original sample. And because the elements in these resamples vary slightly, the statistic ˆθ, calculated from one of these resample will take on slightly different values.

4. The central assertion of the bootstrap method is that the relative frequency distribution of these ˆθ_F_n’s is an estimate of the sampling distribution of ˆθ.

5. How do we determine the number of bootstrap replicates?

Assignment 1. Do a small computer experiment to repeat the above process a few times and check whether you get the identical answers every time (with different random seeds).

Assignment 2. Read Ch. 11.4 of Rice’s book. Comment on randomization, placebo effect, observational studies and fishing expedition.

Assignment 3. Do problems 1, 19 and 28 in Section 11.6 of Rice’s book.

Now we come back to the cv example. First, we draw a random subsample of size 27 with replacement. Thus, while a weight of 63 appears in the actual sample, perhaps it would not appear in the subsample; or is could appear more than once. Similarly, there are 3 occurrences of the weight 57 in the actual sample, perhaps the resample would have, by chance, no values of 57. The point here is that a random sample of size 27 is taken from the original 27 data values. This is the first bootstrap resample with replacement (b = 1). From this resample, one computes ˆµ, the ˆse(ˆµ) and the cv and stores this in memory. Second, the whole process is repeated B times (where we will let B = 1, 000 reps for this example).

(5)

Thus, we generate 1000 resample data sets (b = 1, 2, 3, . . . , 1000) and from of each these we compute ˆµ, ˆse(ˆµ) and the cv and store these values. Third, we obtain the standard error of the cv by taking the standard deviation of the 1000 cv values (corresponding to the 1000 bootstrap samples). The process is simple. In this case, the standard error is 0.00917.

1.3 Odds Ratio

If an event has probability P (A) of occurring, the odds of A occurring are defined to be odds(A) = P (A)

1 − P (A).

Now suppose that X denotes the event that an individual is exposed to a potentially harmful agent and that D denotes the event that the individual becomes diseased. We denote the complementary events as ¯X and ¯D. The odds of an individual contracting the disease given that he is exposed are

odds(D|X) = P (D|X) 1 − P (D|X)

and the odds of contracting the disease given that he is not exposed are odds(D| ¯X) = P (D| ¯X)

1 − P (D| ¯X).

The odds ratio ∆ = ^odds(D|X)_{odds(D| ¯}_X) is a measure of the influence of exposure on subsequent disease.

We will consider how the odds and odds ratio could be estimated by sampling from a population with joint and marginal probabilities defined as in the following table:

D¯ D X¯ π00 π01 π0.

X π10 π11 π1.

π_.0 π_.1 1 With this notation,

P (D|X) = π11

π₁₀+ π₁₁ P (D| ¯X) = π01

π₀₀+ π₀₁ so that

odds(D|X) =π11

π₁₀ odds(D| ¯X) = π01

π₀₀ and the odds ratio is

∆ = π11π00

π₀₁π₁₀

the product of the diagonal probabilities in the preceding table divided by the product of the off-diagonal probabilities.

Now we will consider three possible ways to sample this population to study the rela- tionship of disease and exposure.

(6)

6

• Random sample: From such a sample, we could estimate all the probabilities directly.

However, if the disease is rare, the total sample size would have to be quite large to guarantee that a substantial number of diseased individuals was included.

• Prospective study: A fixed number of exposed and nonexposed individuals are sampled and then followed through time. The incidences of disease in those two groups are compared. In this case the data allow us to estimate and compare P (D|X) and P (D| ¯X) and, hence, the odds ratio. The aspirin study described in the previous section can be viewed as this type of study.

• Retrospective study: A fixed number of diseased and undiseased individuals are sampled and the incidences of exposure in the two groups are compared. From such data we can directly estimate P (X|D) and P (X| ¯D). Since the marginal counts of diseased and nondiseased are fixed, we cannot estimate the joint probabilities or the important conditional probabilities P (D|X) and P (D| ¯X). Observe that

P (X|D) = π11

π₀₁+ π₁₁, 1 − P (X|D) = π01

π₀₁+ π₁₁, odds(X|D) = π11

π₀₁, odds(X| ¯D) = π10

π₀₀.

We thus see that the odds ratio can also be expressed as odds(X|D)/odds(X| ¯D).

Now we describe the study of Vianna, Greenwald, and Davies (1971) to illustrate the retrospective study. In this study they collected data comparing the percentages of tonsillectomies for a group of patients suffering from Hodgkin’s disease and a comparable control group:

Tonsillectomy No Tonsillectomy

Hodgkin’s 67 34

Control 43 64

Recall that the odds ratio can be expressed as odds(X|D)/odds(X| ¯D) and an estimate of it is n00n11/(n01n10), the product of the diagonal counts divided by the product of the off- diagonal counts. The data of Vianna, Greenwald, and Davies gives an estimate of odds ratio is

67 × 64

43 × 34 = 2.93.

According to this study, the odds of contracting Hodgkin’s disease is increased by about a factor of three by undergoing a tonsillectomy.

As well as having a point estimate 2.93, it would be useful to attach an approximate standard error to the estimate to indicate its uncertainty. We will use simulation (parametric bootstrap) to approximate the distribution of ∆. To do so, we need to generate random numbers according to a statistical model for the counts in the table of Vianna, Greenwald,

(7)

and Davies. The model is that the count in the first row and first column, N₁₁, is binomially distributed with n = 101 and probability π₁₁. The count in the second row and second column, N₂₂, is binomially distributed with n = 107 and probability π₂₂. The distribution of the random variable

∆ =ˆ N11N22

(101 − N₁₁)(107 − N₂₂)

is thus determined by the two binomial distributions, and we could approximate it arbitrarily well by drawing a large number of samples from them. Since the probabilities π11 and π22

are unknown, they are estimated from the observed counts by ˆπ11 = 67/101 = 0.663 and π22 = 64/107 = 0.598. A one thousand realizations generated on a computer gives the standard deviation 0.89.

2 Bootstrap Method

The bootstrap method introduced in Efron (1979) is a very general resampling procedure for estimating the distributions of statistics based on independent observations. The bootstrap method is shown to be successful in many situations, which is being accepted as an alternative to the asymptotic methods. In fact, it is better than some other asymptotic methods, such as the traditional normal approximation and the Edgeworth expansion. However, there are some counterexamples that show the bootstrap produces wrong solutions, i.e., it provides some inconsistent estimators.

Consider the problem of estimating variability of location estimates by the Bootstrap method. If we view the observations x1, x2, . . . , xn as realizations of independent random variables with common distribution function F , it is appropriate to investigate the variability and sampling distribution of a location estimate calculated from a sample of size n. Suppose we denote the location estimate as ˆθ. Note that ˆθ is a function of the random variables X1, X2, . . . , Xn and hence has a probability distribution, its sampling distribution, which is determined by n and F . We would like to know this sampling distribution, but we are faced with two problems:

1. we don’t know F , and

2. even if we knew F , ˆθ may be such a complicated function of X1, X2, . . . , Xn that finding its distribution would exceed our analytic abilities.

First we address the second problem. Suppose we knew F . How could we find the probability distribution of ˆθ without going through incredibly complicated analytic calculations? The computer comes to our rescue-we can do it by simulation. We generate many, many samples, say B in number, of size n from F ; from each sample we calculate the value of θ. The empirical distribution of the resulting values ˆˆ θ₁^∗, ˆθ₂^∗, . . . , ˆθ^∗_Bis an approximation to the

(8)

8

distribution function of ˆθ, which is good if B is very large. If we wish to know the standard deviation of ˆθ, we can find a good approximation to it by calculating the standard deviation of the collection of values ˆθ₁^∗, ˆθ₂^∗, . . . , ˆθ^∗_B. We can make these approximations arbitrarily accurate by taking B to be arbitrarily large.

Assignment 4. Explain or prove that the simulation we just described will give a good approximation of the distribution function of θ.

All this would be well and good if we knew F , but we don’t. So what do we do? We will consider two different cases. In the first case, F is unknown up to an unknown parameter η, i.e. F (x|η). Without knowing η, the above approximation cannot be used. The idea of the parametric bootstrap is to simulate data from F (x|ˆη) where ˆη should be a good estimate of η. Then it utilize the structure of F .

In the second case, F is completely unknown. The idea of the nonparametric bootstrap is to simulate data from the empirical cdf F_n. Here F_n is a discrete probability distribution that gives probability 1/n to each observed value x1, · · · , xn. A sample of size n from Fn is thus a sample of size n drawn with replacement from the collection x1, · · · , xn. The standard deviation of ˆθ is then estimated by

sθˆ= v u u t 1 B

B

X

i=1

(θ_i^∗− ¯θ^∗)²

where θ^∗₁, . . . , θ_B^∗ are produced from B sample of size n from the collection x₁, · · · , x_n. Now we use a simple example to illustrate this idea. Suppose n = 2 and observe X₍₁₎= c < X₍₂₎= d. Then X₁^∗, X₂^∗ are independently distributed with

P (X_i^∗= c) = P (X_i^∗= d) = 1/2, i = 1, 2.

The pairs (X₁^∗, X₂^∗) therefore takes on the four possible pairs of values (c, c), (c, d), (d, c), (d, d),

each with probability 1/4. Thus θ^∗ = (X₁^∗+ X₂^∗)/2 takes on the values c, (c + d)/2, d with probabilities 1/4, 1/2, 1/4, respectively, so that θ^∗− (c + d)/2 takes on the values (c − d)/2, 0, (d − c)/2 with probabilities 1/4, 1/2, 1/4, respectively.

For the above example, we can easily calculate its bootstrap distribution. We can easily imagine that the above computation becomes too complicated to compute directly if n is large. Therefore, simple random sampling was proposed to generate bootstrap distribution.

In the bootstrap literature, a variety alternatives are suggested other than simple random sampling.

Now we rewrite the above (generic) nonparametric bootstrap procedure into the following steps as follows. Refer to Efron and Tibshirani (1993) for detailed discussions. Consider

(9)

the case where a random sample of size n is drawn from an unspecified probability distribution, F . The basic steps in the bootstrap procedure are

Step 1. Construct an empirical probability distribution, Fn, from the sample by placing a probability of 1/n at each point, x1, x2, · · · , xn of the sample. This is the empirical distribution function of the sample, which is the nonparametric maximum likelihood estimate of the population distribution, F .

Step 2. From the empirical distribution function, Fn, draw a random sample of size n with replacement. This is a resample.

Step 3. Calculate the statistic of interest, T_n, for this resample, yielding T_n^∗.

Step 4. Repeat steps 2 and 3 B times, where B is a large number, in order to create B resamples. The practical size of B depends on the tests to be run on the data.

Typically, B is at least equal to 1000 when an estimate of confidence interval around Tn is required.

Step 5. Construct the relative frequency histogram from the B number of T_n^∗’s by placing a probability of 1/B at each point, T_n^∗1, T_n^∗2, . . . , T_n^∗B. The distribution obtained is the bootstrapped estimate of the sampling distribution of Tn. This distribution can now be used to make inferences about the parameter θ, which is to be estimated by Tn.

We now introduce notations to illustrate the bootstrap method. Assumed the data X₁, · · · , X_n, are independent and identically distributed (iid) samples from a k-dimensional population distribution F and the problem of estimating the distribution

Hn(x) = P {Rn≤ x},

where Rn= Rn(Tn, F ) is a real-valued functional of F and Tn= Tn(X1, · · · , Xn), a statistic of interest. Let X₁^∗, · · · , X_n^∗ be a “bootstrap” samples iid from Fn, the empirical distribution based on X1, · · · , Xn, T_n^∗ = Tn(X₁^∗, · · · , X_n^∗), and R^∗_n = Rn(T_n^∗, Fn). Fn is constructed by placing at each observation Xi a mass 1/n. Thus Fn may be represented as

F_n(x) = 1 n

n

X

i=1

I(X_i ≤ x), − ∞ < x < ∞.

A bootstrap estimator of H_n is

Hˆ_n(x) = P_∗{R^∗_n ≤ x},

where for given X1, · · · , Xn, P_∗ is the conditional probability with respect to the random generator of bootstrap samples. Since the bootstrap samples are generated from Fn, this method is called the nonparametric bootstrap. Note that ˆHn(x) will depend on Fnand hence

(10)

Bootstrap Distribution

Density

−2 0 2 4

0.0 0.1 0.2 0.3 0.4

(11)

itself is a random variable. To be specific, ˆH_n(x) will change as the data {x₁, · · · , x_n} changes.

Recall that a bootstrap analysis is run to assess the accuracy of some primary statistical results. This produces bootstrap statistics, like standard errors or confidence intervals, which are assessments of error for the primary results.

As illustration, we consider the following three examples.

Example 1. Suppose that X1, · · · , Xn ∼ N (µ, 1) and Rn = √

n( ¯Xn− µ). Consider the estimation of

P (a) = P {Rn> a|N (µ, 1)}.

The nonparametric bootstrap method will estimate P (a) by P_{N B}(a) = P {√

n( ¯X_n^∗− ¯X_n) > a|F_n}.

To be specific, we observe data x₁, · · · , x_n with mean ¯x_n. Let Y₁, . . . , Y_n denote a bootstrap sample of n observations drawn independently from Fn and let ¯Yn = n⁻¹Pn

i=1Yi. Then P (a) is estimated by

PN B(a) = P {√

n( ¯Yn− ¯xn) > a|Fn}.

In principle, PN B(a) can be found by considering all nⁿ possible bootstrap sample. If all Xi’s are distinct, then the number of different possible resamples equals the number of distinct ways of placing n indistinguishable objects into n numbered boxes, the boxes being allowed to contain any number of objects. It is known that it is equal to C(2n − 1, n) ≈ (nπ)^−1/22²ⁿ⁻¹. When n = 10(20, respect.), C(2n − 1, n) ≈ 92375(6.9 × 10¹⁰, respect.). For small value of n, it is often feasible to calculate a bootstrap estimate exactly. However, for large samples, say n ≥ 10, this becomes infeasible even at today’s computer technology.

Natural questions to ask are as follows:

• What are computationally efficient ways to bootstrap?

• Can we get bootstrap-like answers without Monte Carlo?

Moreover, we need to address the question of “evaluating” the performance of bootstrap method. For the above particular problem, we need to estimate PN B(a) − P (a) or sup_a|PN B(a) − P (a)|. As a remark, P_{N B}(a) is a random variable since F_n is random. Efron (1992) proposed to use jackknife to give the error estimates for bootstrap quantities.

Suppose that additional information on F is available. Then it is reasonable to utilize this information in the bootstrap method. For example, F known to be normally distributed with unknown mean µ and variance 1. It is natural to use ¯xn to estimate µ and then estimate P (a) = P {Rn> a|N (µ, 1)} by

PP B(a) = P {√

n( ¯Yn− ¯xn) > a|N (¯xn, 1)}.

(12)

11

Since the bootstrap samples are generated from N (¯x_n, 1) which utilizes the information from a parametric form of F , this method is called the parametric bootstrap. In this case, it can be shown that P_{P B}(a) = P (a) for all realization of ¯X_n. However, if F is known to be normally distributed with unknown mean and variance µ and variance σ² respectively, PP B(a) is no longer equal to P (a).

Assignment 5. (a) Show that PP B(a) = Φ(a/sn) where s²_n= (n − 1)⁻¹Pn

i=1(xi− ¯xn)². (b) Prove that PP B(a) is a consistent estimate of P (a) for fixed a.

(c) Prove that sup_a|PP B(a) − P (a)|→ 0.^P

For the question of finding PN B(a), we can in principle write down the characteristic function and then apply the inversion formula. However, it is a nontrivial job. Therefore, Efron (1979) suggested to approximate P_{N B}(a) by Monte Carlo resampling. (i.e., Sample- size resamples may be drawn repeatedly from the original sample, the value of a statistic computed for each individual resample, and the bootstrap statistic approximated by taking an average of an appropriate function of these numbers.)

Example 1. (cont.) Let us consider a sample containing two hundred values generated randomly from a standard normal population N (0, 1). This is the original sample. In this example, the sampling distribution of the arithmetic mean is approximately normal with a mean roughly equal to 0 and a standard deviation approximately equal to 1/√

200. Now, let us apply the nonparametric bootstrap method to infer the result. One thousand and five hundred resamples are drawn from the original sample, and the arithmetic mean is calculated for each resample. These calculations are performed by using R functions as follows

Step 1. Randomly draw two hundred points from a standard normal population gauss < −rnorm(200, 0, 1)

Step 2. Perform the nonparametric bootstrap study (1500 resamples) bootmean < −1 : 1500

f or(i in 1 : 1500) bootmean[i] < −mean(sample(gauss, replace = T )) Step 3. Do the normalization and comparison with N (0, 1).

bootdistribution < −sqrt(200) ∗ (bootmean − mean(gauss))

hist(bootdistribution, f req = F ALSE, main = ”Bootstrap Distribution”, xlab = ””) x < −seq(−4, 4, 0.001); y < −(1/(sqrt(2 ∗ pi))) ∗ exp(−x²/2)

points(x, y, col = 2)

Now we state Levy’s Inversion Formula which is taken from Chapter 6.2 of Chung (1974).

Theorem 1 If x₁< x₂ and x₁ and x₂ are points of continuity of F , then we have F (x2) − F (x1) = lim

T →∞

1 2π

Z T

−T

e^−itx¹− e^−itx² it f (t)dt,

(13)

where f (t) is the characteristic function.

Example 2. (Estimating the probability of success) Consider a probability distribution F putting all of its mass at zero or one. Let θ(F ) = P (X = 1) = p. Consider R(X, F ) = ¯X − θ(F ) = ˆp − p. Observed X = x, the bootstrap sample

X₁^∗, · · · , X_n^∗∼ Bin(1, θ(Fn)) = Bin(1, ¯xn).

Note that

R(X^∗, Fn) = X¯_n^∗− ¯xn, E_∗( ¯X_n^∗− ¯x_n) = 0,

V ar∗( ¯X_n^∗− ¯xn) = x¯n(1 − ¯xn)

n .

Recall that n ¯X_n^∗∼ Bin(n, ¯x) and n ¯Xn ∼ Bin(n, p). It is known that if min{n¯xn, n(1− ¯xn)} ≥ 5,

n ¯X_n^∗− n¯xn

pn¯x_n(1 − ¯x_n) =

√n( ¯X_n^∗− ¯xn)

p ¯x_n(1 − ¯x_n) ∼ N (0, 1);

and if min{np, n(1 − p)} ≥ 5,

n ¯X_n− np pnθ(1 − p) =

√n( ¯X_n− p)

pp(1 − p) ∼ N (0, 1).

Based on the above approximation results, we conclude that the bootstrap method works if min{n¯xn, n(1 − ¯xn)} ≥ 5. The question remained to be studied is whether

P {min(n ¯Xn, n(1 − ¯Xn)) ≥ 5} → 0?

Example 3. (Estimating the median) Suppose we are interested in finding the distribution of n^1/2{F_n⁻¹(1/2) − F⁻¹(1/2)} where F_n⁻¹(1/2) and F⁻¹(1/2) are the sample and population median respectively. Set θ(F ) = F⁻¹(1/2). The normal approximation for this distribution will be discussed in Chapter 2. In this section, we consider the bootstrap approximation of the above distribution.

Consider n = 2m − 1. Then the sample median F_n⁻¹(1/2) = X_(m) where X₍₁₎ ≤ X₍₂₎ ≤ · · · ≤ X_(n). Let N_i^∗ denote the number of times xi is selected in the bootstrap sampling procedure.

Set N^∗ = (N₁^∗, · · · , N_n^∗). It follows easily that N^∗ follows a multinomial distribution with n trials and the probability of selection is (n⁻¹, · · · , n⁻¹). Denote the order statistics of x1, . . . , xn by x(1)≤ · · · ≤ x(n). Set N_[i]^∗ to be the number of times of choosing x(i). Then for 1 ≤ ` < n, we have

P rob_∗(X_(m)^∗ > x_(`)) = P rob_∗{N_[1]^∗ + · · · + N_[`]^∗ ≤ m − 1}

= P rob

Bin

n, `

n

≤ m − 1

=

m−1

X

j=0

C(n, j) ` n

j 1 − `

n

n−j

.

(14)

13

Or,

P rob∗(T^∗= x(`)−x(m)) = P rob

Bin

n,` − 1

n

≤ m − 1

−P rob

Bin

n, `

n

≤ m − 1

. When n = 13, we have

` 2 or 12 3 or 11 4 or 10 5 or 9 6 or 8 7 probability 0.0015 0.0142 0.0550 0.1242 0.4136 0.2230

Quite often we use the mean square error to measure the performance of an estimator, t(X), of θ(F ). Or, EFT²= EF(t(X) − θ(F ))². We then can use bootstrap to estimate EFT². Then the bootstrap estimate of EFT² is

E∗(T^∗)²=

13

X

`=1

[x(`)− x(7)]²P rob∗{T^∗= x(`)− x(7)}.

It is known that EFT²→ [4nf²(θ)]⁻¹as n tends to infinity when F has a bounded continuous density. A natural question to ask is whether E_∗(T^∗)²is close to EFT²?

3 Validity of the Bootstrap Method

We now give a brief discussion on the validity of the bootstrap method. First, we state central limit theorems and its approximation error bound which will be used in proving that the bootstrap can provide a good approximation of distribution of n^1/2(ˆp − p).

3.1 Central Limit Theorem

Perhaps the most widely known version of the CLT is

Theorem 2 (Lindeberg-Levy) Let {Xi} be iid with mean µ and finite variance σ². Then

√n 1 n

n

X

i=1

X_i− µ

!

→ N (0, σd ²).

The above theorem can be generalized to independent random variables which are not necessarily identically distributed.

Theorem 3 (Lindeberg-Feller) Let {Xi} be independent with mean {µi}, finite variances {σ_i²}, and distribution functions {Fi}. Suppose that B_n²=Pn

i=1σ²_i satisfies σ²_n

B_n² → 0, Bn → ∞ as n → ∞.

Then n⁻¹Pn

i=1Xiis N (n⁻¹Pn

i=1µi, n⁻²B²_n) if and only if the following Lindeberg condition satisfied

B⁻²_n

n

X

i=1

Z

|t−µi|>Bn

(t − µi)²dFi(t) → 0, n → ∞ each > 0.

(15)

In the theorems previously considered, asymptotic normality was asserted for a sequence of sumsPn

1X_igenerated by a single sequence X₁, X₂, . . . of random variables. More generally, we may consider a double array of random variables

X₁₁, X₁₂, · · · , X_1K₁; X₂₁, X₂₂, · · · , X_2K₂;

... ... ... ... Xn1, Xn2, · · · , XnK_n;

... ... ... ...

For each n ≥ 1, there are Kn random variables {Xnj, 1 ≤ j ≤ Kn}. It is assumed that Kn→ ∞. The case Kn = n is called a “triangular” array.

Denote by Fnj the distribution function of Xnj. Also, put µ_nj = EX_nj,

An = E

K_n

X

j=1

Xnj =

K_n

X

j=1

µnj,

B_n² = V ar





K_n

X

j=1

Xnj



.

We then have the following theorem.

Theorem 4 (Lindeberg-Feller) Let {X_nj : 1 ≤ j ≤ K_n; n = 1, 2, . . .} be a double array with independent random variables within rows. Then the “uniform asymptotic negligibility”

condition

max

1≤j≤Kn

P (|Xnj− µnj| > τ Bn) → 0, n → ∞, each τ > 0, and the asymptotic normality condition PKn

j=1X_nj is AN (A_n, B_n²) together hold if and only if the Lindberg condition

B_n⁻²

n

X

i=1

Z

|t−µi|>Bn

(t − µ_i)²dF_i(t) → 0, n → ∞each > 0

is satisfied.

As a note, the independence is assumed only it within rows, which themselves may be arbitrarily dependent.

Corollary 1 Suppose that, for some v > 2, PKn

j=1E|Xnj− µnj|^v = o(B^v_n), n → ∞. Then PKn

j=1Xnj is AN (An, B²_n).

3.2 Approximation Error of CLT

It is of both theoretical and practical interest to characterize the error of approximation in the CLT. In this section, we just consider the i.i.d. case. The convergence in the Central

(16)

15

Limit Theorem is not uniform in the underlying distribution. For any fixed sample size n, there are distributions for which the normal distribution approximation to the distribution function of √

n( ¯X_n − µ)/σ is arbitrarily poor. However, there is an upper bound, due to Berry (1941) and Esseen (1942), to the error of the Central Limit Theorem approximation that shows the convergence is uniform for the class of distributions for which |X − µ|³/σ³is bounded above by a finite bound. We state this theorem without proof in one dimension.

Theorem 5 If X₁, . . . , X_n are i.i.d. with distribution F and if S_n = X₁+ · · · + X_n, then there exists a constant c (independent of F ) such that for all x,

sup

x

P

"

S_n− ES_n pV ar(Sn) ≤ x

#

− Φ(x)

≤ c

√n

E|X₁− EX₁|³ [V ar(X1)]^3/2 for all F with finite third moment.

Note that c in the above theorem is a universal constant. Various authors have thought to find the best constant c. Originally, c is set to be 33/4 but it has been sharpened to be greater than 0.4097 and less than 0.7975. For x is sufficiently large, while n remains fixed, the quantity P [(Sn− ESn)/pV ar(Sn) ≤ x] become so close to 1 that the bound given by above is too crude. The problem in this case may be characterized as one of approximation of large deviation probabilities, with the object of attention becoming the relative error in approximation of

1 − P [(S_n− ES_n)/p

V ar(S_n) ≤ x]

by 1 − Φ(x) when x → ∞.

When we have information about the third and higher moments of the underlying distribution, we may often improve on the normal approximation by considering higher-order terms in the expansion of the characteristic function. This leads to asymptotic expansions known as Edgeworth Expansions. We present without proof the two next terms in the Edgeworth Expansion.

Φ(x) −β1(x²− 1) 6√

n φ(x) − β₂(x³− 3x)

24n +β²₁(x⁵− 10x³+ 15x) 72n

φ(x).

where β₁= E(X − µ)³/σ³ and β₂ = E(X − µ)⁴/σ⁴− 3 are the coefficient of skewness and the coefficient of kurtosis, respectively, and where φ(x) represents the density of the standard normal distribution. This approximation is to be understood in the sense that the difference of the two sides when multiplied by n tends to zero as n → ∞. Assuming the fourth moment exists, it is valid under the condition that

lim sup

|t|→∞

|E(exp{itX})| < 1.

This condition is known as Cramer’s Condition. It holds, in particular, if the underlying distribution has a nonzero absolutely continuous component. The expansion to the term

(17)

involving 1/√

n is valid if the third moment exists, provided only that the underlying distribution is nonlattice, and even for lattice distributions it is valid provided a correction for continuity is made. See Feller (Vol. 2, Chap. XVI.4) for details.

Let us inspect this approximation. If we stop at the first term, we have the approximation given by the Central Limit Theorem. The next term is of order n^−1/2and represents a correction for skewness, since this term is zero if β1 = 0. In particular, if the underlying distribution is symmetric, the Central Limit Theorem approximation is accurate up to terms of order 1/n. The remaining term is a correction for kurtosis (and skewness) or order 1/n.

The Edgeworth Expansion is an asymptotic expansion, which means that continuing with further terms in the expansion with n fixed may not converge. In particular, expanding to further terms for fixed n may make the accuracy worse. There are a number of books treating the more advanced theory of Edgeworth and allied expansions. The review by Bhattacharya (1990), treats the more mathematical aspects of the theory and the book of Barndorff-Nielsen and Cox (1989) the more statistical. Hall (1992) is concerned with the application of Edgeworth Expansion to the bootstrap.

3.3 Estimation of the Probability of Success

We now discuss whether bootstrap method will give a consistent estimate of the distribution of n^1/2(ˆpn− p). For simplicity, we use an asymptotic analysis to evaluate it. Note that as n → ∞, Fnwill change accordingly. This is different from some asymptotic analysis in which the underlying distribution F never change with n.

Two different approaches are used to address this question. Since that the underlying distribution F_n changes with n, the first approach is to use the double array CLT to handle the case and the second approach is to use approximation result, Berry-Esseen bound.

Proof 1: Note that the bootstrap samples at sample size n as Yn1, · · · , Ynn which come from Bin(1, ¯xn). Note that Bin(1, ¯xn) is the so-called Fnj in the double array CLT.

Then µnj = ¯xn, An= n¯xn, Kn= n, and B_n²= n¯xn(1 − ¯xn).

1. Check UAN condition.

P (|Ynj− µnj| > τp

n[¯xn(1 − ¯xn)]) = 0.

2. Check Lindberg condition.

Z

|t−µ_nj|>B_n

(t − µ_nj)²dF_nj(t) = 0.

These imply thatPn

j=1Y_nj is AN (n¯x_n, n¯x_n(1 − ¯x_n)).

It is well known that √

n(ˆp − p) is asymptotically normally distributed with mean 0 and variance p(1 − p). If, for all realizations, ¯Xn converges to p with probability 1, we then

(18)

17

conclude that the bootstrap distribution of ˆp will also converge to normal with mean 0 and variance p(1 − p). This gives a justification that the bootstrap method is consistent. As a remark, the bootstrap method is most powerful when we don’t know how to do asymptotic analysis. In such a case, how do we justify that the bootstrap method is consistent is a challenging problem.

Proof 2: Using the Berry-Esseen bound, we have

sup

x

P

"

n ¯Y_n− n¯x_n p ¯xn(1 − ¯xn) ≤ x

#

− Φ(x)

≤ c

√n

E(Y − ¯x_n)³ [¯xn(1 − ¯xn)]^3/2.

If ¯xn(1 − ¯xn) is bounded away from zero, the right hand side will tend to zero. This gives a justification that the bootstrap method is consistent.

3.4 Statistical Functionals

Many statistics including the sample mean, the sample median and the sample variance, are consistent estimators of their corresponding population quantity: the sample mean ¯X of the expectation E(X), the p^th sample quantile of the p^th population quantile F⁻¹(p), the k^th sample momentP(Xi− ¯X)^k/n of the k^thpopulation moment E[X − E(X)]^k, etc. Any such population quantity is a function of the distribution F of the Xiand can therefore be written as h(F ), where h is a real-valued function defined over a collection F of distributions F . The mean h(F ) = EF(X), for example, is defined over the class F of all F with finite expectation.

Statistics which are representable as functionals h(F ) are so-called statistical functionals.

To establish the connection between the sequence of sample statistics and functional h(F ) that it estimates, define the sample cdf ˆFn by

Fˆ_n(x) = Number of X_i≤ x

n .

This is the cdf of a distribution that assigns probability 1/n to each of the n sample values X1, X2, . . . , Xn. For the examples mentioned so far and many others, it turns out that the standard estimator of h(F ) based on n observations is equal to h( ˆFn), the plug-in estimator of h(F ). When h(F ) = E[X − E(X)]^k, it is seen that

h( ˆFn) = 1

n(X1− ¯X)^k+ · · · + 1

n(Xn− ¯X)^k.

Note thatP(Xi− ¯X)^k/n can be viewed as a function of n variables or as a function of ˆFn. Suppose we want to evaluate the performance of an estimator ˆθ_n of some parameter θ or functional h(F ). As an example, the sample median ˆθ_n as an estimator of the population median. We can use the following as a measure

λn(F ) = PF

n√

n[ˆθn− h(F )] ≤ ao .

(19)

Note that population median can be written as F⁻¹(0.5) and then the sample median can be viewed as a plug-in estimate ˆF_n⁻¹(0.5). Again, we can estimate λ_n(F ) by the plug-in estimator λ_n( ˆF_n) in which the distribution F of the X’s by the distribution ˆF_n. In addition, the subscript F , which governs the distribution of ˆθn, must also be changed to ˆFn. To see what this last step means, write

θˆn= θ(X1, . . . , Xn), (1)

that is, express ˆθnnot as a function of ˆFnbut directly as a function of the sample (X1, . . . , Xn).

The dependence of the distribution of ˆθnon F results from the fact that X1, . . . , Xnis a sample from F . To replace F by ˆF_n in the distribution governing ˆθ_n, we must therefore replace (1) by

θˆ_n^∗ = θ(X₁^∗, . . . , X_n^∗), (2) where X₁^∗, . . . , X_n^∗ is a sample from ˆFn. With this notation, λn( ˆFn) can now be written formally as

λn( ˆFn) = PFˆn

n√

n[ˆθ_n^∗− h( ˆFn)] ≤ ao

. (3)

When λn( ˆFn) is too complicated to compute directly, we can approximate λn( ˆFn) by λ^∗_B,n as suggested in Efron (1979). Here

λ^∗_B,n= 1 n

B

X

1

EFˆ_nθˆ_n− h( ˆF_n). (4)

Here we don’t give any discussion of theoretical properties of the plug-in estimator λn( ˆFn), such as consistency and asymptotic normality. In much of the bootstrap literature, the bootstrap is said to work if λn( ˆFn) is consistent for estimating λn(F ) in the sense that λn( ˆFn) → λn(F ) for all F under consideration.

4 Inconsistent Bootstrap Estimator

Bickel and Freedman (1981) and Loh (1984) showed that the bootstrap estimators of the distributions of the extreme-order statistics are inconsistent. Let X_(n) be the maximum of i.i.d. random variables X1, . . . , Xn from F with F (θ) = 1 for some θ, and let X_(m)^∗ be the maximum of X₍₁₎^∗ , . . . , X_(m)^∗ which are i.i.d. from the empirical distribution Fn. Although X_(n)→ θ, it never equals θ. But

P∗{X_(n)^∗ = X(n)} = 1 − (1 − n⁻¹)ⁿ→ 1 − e⁻¹, which leads to the inconsistency of the bootstrap estimator.

The reason for the inconsistency of the bootstrap is that the bootstrap samples are drawn from F_n which is not exactly F . Therefore, the bootstrap may fail due to the lack

(20)

19

of “continuity.” We now illustrate that the bootstrap can produce wrong solutions, i.e., it provides some inconsistent estimators. Refer to Shao (1994) for further references. We focus on the case where the data X₁, . . . , X_n are i.i.d. samples from a k-dimensional population distribution F and the problem of estimating the distribution

Hn(x) = P {Rn≤ x}, (5)

where Rn = Rn(Tn, F ) is a real-valued functional of F and Tn = Tn(X1, . . . , Tn), a statistic of interest. A bootstrap estimator of H_n is

Hˆ_n(x) = P_∗{R^∗_n ≤ x}, (6)

where R_n^∗ = Rn(T_n^∗, Fn). Let µ = EX1; θ = g(µ), where g is a function from R^k to R; ¯Xn = n⁻¹Pn

i=1Xi be the sample mean; and Tn = g( ¯Xn). Under the conditions that V ar(X1) = Σ < ∞ and g is first-order continuously differentiable at µ,

sup

x

P√n(T_n^∗− Tn) ≤ x − P √

n(T_n− θ) ≤ x

→ 0 a.s., (7)

where T_n^∗= g( ¯X_n^∗) and ¯X_n^∗= n⁻¹Pn i=1X_i^∗.

Consider the situation where g is second-order continuously differentiable at µ with 5²g(µ) 6= 0 but 5g(µ) = 0. Using the Taylor expansion and 5g(µ) = 0, we obtain that

Tn− θ = 1

2( ¯Xn− µ)⁰5²g(µ)( ¯Xn− µ) + oP(n⁻¹). (8) This implies

n(T_n− θ)→^d 1

2Z_Σ⁰ 5²g(µ)Z_Σ, (9)

where Z_Σis a random k-vector having normal distribution with mean 0 and covariance matrix Σ. From (9), √

n(T_n− θ) → 0, and, therefore, result (7) is not useful when 5g(µ) = 0 and^P we need to consider the bootstrap estimator of the distribution of n(Tn− θ) in this case.

Let Rn = n(Tn − θ), R^∗_n = n(T_n^∗− Tn), and Hn and ˆHn be given by (5) and (6), respectively. Babu (1984) pointed out that ˆHn is inconsistent in this case. Similar to (8),

T_n^∗− Tn= 5g( ¯Xn)⁰( ¯X_n^∗− ¯Xn) +1

2( ¯X_n^∗− ¯Xn)⁰5²g( ¯Xn)( ¯X_n^∗− ¯Xn) + oP(n⁻¹) a.s. (10) B the continuity of 5²g and Theorem 2.1 of Bickel and Freedman (1981), for almost all given sequences X₁, X₂, . . .,

n

2( ¯X_n^∗− ¯Xn)⁰5²g( ¯Xn)( ¯X_n^∗− ¯Xn)→^d 1

2Z_Σ⁰ 5²g(µ)ZΣ. (11) From 5g(µ) = 0,

√n 5 g( ¯Xn) =√

n 5²g(µ)( ¯Xn− µ) + oP(1)→ 5^d ²g(µ)ZΣ. (12) Hence, for almost all given X₁, X₂, . . ., the conditional distribution of n 5 g( ¯X_n)⁰( ¯X_n^∗− ¯X_n) does not have a limit. It follows from (10) and (11) that for almost all given X₁, X₂, . . ., the

(21)

conditional distribution of n(T_n^∗− Tn) does not have a limit. Therefore, ˆH_n is inconsistent as an estimator of H_n.

The symptom of this problem in the present case is that 5g( ¯X_n) is not necessarily equal to zero when 5g(µ) = 0. As a result, the expansion in (10), compared with the expansion in (8), has an extra nonzero term 5g( ¯Xn)⁰( ¯X_n^∗− ¯Xn) which does not converge to zero fast enough, and, therefore, ˆHn cannot mimic Hn.

5 Bias Reduction via the Bootstrap Principle

In this section, we will use an example to illustrate the bias reduction via the Bootstrap principle. Consider θ₀= θ(F₀) = µ³, where µ =R xdF0(x). Set ˆθ = θ(F_n) = ¯X³. Elementary calculations show that

E{θ(Fn)|F0} = E{µ + n⁻¹

n

X

i=1

(Xi− µ)}³ (13)

= µ³+ n⁻¹3µσ²+ n⁻²γ,

where γ = E(X1− µ)³denotes population skewness. Using the nonparametric bootstrap, we obtain in direct analogy to (13)

E{θ(F_n^∗)|Fn} = ¯X³+ n⁻¹3 ¯X ˆσ²+ n⁻²γ,ˆ

where ˆσ²= n⁻¹P(Xi− ¯X)²and ˆγ = n⁻¹P(Xi− ¯X)³ denote sample variance and skewness respectively. Using the bootstrap principle, E{θ(F_n^∗)|Fn} − θ(Fn) is used to estimate θ(Fn) − θ(F0). Note that θ0= θ(Fn)−(θ(Fn)−θ0). Or, θ0can be estimated by θ(Fn)−[E{θ(F_n^∗)|Fn}−

θ(Fn)] or 2θ(Fn) − E{θ(F_n^∗)|Fn}. Therefore the bootstrap bias-reduced estimate is 2 ¯X³− ( ¯X³+ n⁻¹3 ¯X ˆσ²+ n⁻²ˆγ). Or, ˆθN B= ¯X³− n⁻¹3 ¯X ˆσ²− n⁻²γ.ˆ

Now we check whether ˆθN Breally reduces bias. Observe that for general distributions with finite third moments,

E( ¯X³) = µ³+ n⁻¹3µσ²+ n⁻²γ, E( ¯X ˆσ²) = µσ²+ n⁻¹(γ − µσ²) − n⁻²γ,

E(ˆγ) = γ(1 − 3n⁻¹+ 2n⁻²).

It follows that

E(θ(Fn)) − θ0= n⁻²3(µσ²− γ) + n⁻³6γ − n⁻⁴2γ for general distributions.

6 Jackknife

One of the central goals of data analysis is an estimate of the uncertainties in fit parameters.

Sometimes standard methods for getting these errors are unavailable or inconvenient. In that

(22)

21

case we may resort to a couple of useful statistical tools that have become popular since the advent of fast computers. One is called the “jackknife” (because one should always have this tool handy) and the other the “bootstrap”. One of the earliest techniques to obtain reliable statistical estimators is the jackknife technique. Here we describe the jackknife method, which was invented in 1949 by Quenouille and developed further by Tukey in 1958. As the father of EDA, John Tukey attempted to use Jackknife to explore how a model is influenced by subsets of observations when outliers are present. The name Jackknife was coined by Tukey to imply that the method is an all-purpose statistical tool.

Quenouille (1949) introduced a technique for reducing the bias of a serial correlation estimator based on splitting the sample into two half-samples. In his 1967 paper he generalized this idea into splitting the sample into g group of size h each, n = gh, and explore its general applicability. It requires less computational power than more recent techniques such as bootstrap method.

Suppose we have a sample x = (x1, x2, . . . , xn) and an estimator ˆθ = s(x). The jackknife focuses on the samples that leaves out one observation at a time:

x_(i)= (x₁, x₂, . . . , x_i−1, x_i+1, . . . , x_n)

for i = 1, 2, . . . , n, called jackknife samples. The ith jackknife sample consists of the data set with the ith observation removed. Let ˆθ_(i)= s(x_(i)) be the ith jackknife replication of ˆθ.

The jackknife estimate of standard error defined by

ˆ sejack=

"

n − 1 n

X

i

(ˆθ_(i)− ˆθ_(·))²

#1/2

,

where ˆθ_(·)=Pn

i=1θˆ_(i)/n.

The jackknife only works well for linear statistics (e.g., mean). It fails to give accurate estimation for non-smooth (e.g., median) and nonlinear (e.g., correlation coefficient) cases.

Thus improvements to this technique were developed. Now we consider Delete-d jackknife.

Instead of leaving out one observation at a time, we leave out d observations. Therefore, the size of a delete-d jackknife sample is n − d, and there are C(n, d) jackknife samples. Let θˆ_(s) denote ˆθ applied to the data set with subset s removed. The formula for the delete-d jackknife estimate of s.e. is

"

n − d dC(n, d)

X

i

(ˆθ_(s)− ˆθ_(·))²

#1/2

,

where ˆθ_(·) =P

sθˆ_(s)/C(n, d) and the sum is over all subsets s of size n − d chosen without replacement for x₁, x₂, . . . , x_n. It can be shown that the delete-d jackknife is consistent for the median if √

n/d → 0 and n − d → ∞. Roughly speaking, it is preferrable to choose a d such that √

n < d < n for the delete-d jackknife estimation of standard error.

(23)

We just describe how to obtain standard error estimate of an estimator ˆθ based on the sample of size n. Now we demonstrate that the jackknife can be used to reduce the biasd estimate ˆθ. If the bias of ˆθ is of the order n⁻¹, we write

E(ˆθ) − θ =a1

n +a2

n² + · · · . Hence,

EF(ˆθ_(·)) = θ +a1(F )

n − 1 + a2(F )

(n − 1)² + · · · . To estimating the bias, the jackknife gives

Biasˆ _jack= (n − 1)(ˆθ_(·)− ˆθ).

Then the jackknife estimate of θ is

θ = ˆ˜ θ −Biasˆ _jack.

Note that

θ = nˆ˜ θ − (n − 1)ˆθ_(·).

We can show easily that the bias of jackknife estimate ˜θ is of the order n⁻².

7 Resampling Methods

The term “resampling” has been applied to a variety of techniques for statistical inference, among which stochastic permutation and the bootstrap are the most characteristic. There are at least four major types of resampling methods which include the randomization exact test, cross-validation, jackknife, and bootstrap. Although today they are unified under a common theme, it is important to note that these four techniques were developed by different people at different periods of time for different purposes. In this chapter, we already discuss two of them. In this section, we will describe the randomization exact test and cross-validation.

For the randomization exact test, it is also known as the permutation test. This test was developed by R.A. Fisher (1935), the founder of classical statistical testing. Both noted that with a large sample the exact Fisher test is not feasible because of the computational difficulty (before the age of powerful computers). Hence, in his later years Fisher lost interest in the permutation method because there were no computers in his days to automate such a laborious method. As a remedy, people suggested that a randomly-generated selected subset of the possible permutations could provide the benefits of the permutation test without excessive computational cost.

Randomization exact test is a test procedure in which data are randomly re-assigned so that an exact p-value is calculated based on the permutated data. Let’s look at the following