Bootstrap and Jackknife Calculations in R Version 6 April 2004

(1)

Version 6 April 2004

These notes work through a simple example to show how one can programRto do both jackknife and bootstrap sampling. We start with bootstrapping.

Bootstrap Calculations

Rhas a number of nice features for easy calculation of bootstrap estimates and confidence intervals. To see how to use these features, consider the following 25 observations:

8.26 6.33 10.4 5.27 5.35 5.61 6.12 6.19 5.2 7.01 8.74 7.78 7.02 6 6.5 5.8 5.12 7.41 6.52 6.21 12.28 5.6 5.38 6.6 8.74

Suppose we wish to estimate the coefficient of variation, CV =√

Var/ x. Let’s do this with a bootstrap estimator.

First, let’s put the data into a vector, which we will call^x,

> x <-c(8.26, 6.33, 10.4, 5.27, 5.35, 5.61, 6.12, 6.19, 5.2, 7.01, 8.74, 7.78, 7.02, 6, 6.5, 5.8, 5.12, 7.41, 6.52, 6.21, 12.28, 5.6, 5.38, 6.6, 8.74)

Now let’s define a functon in ^R, which we will call ^CV, to compute the coefficient of variation,

> CV <- function(x) sqrt(var(x))/mean(x) So, let’s compute the CV

> CV(x) [1] 0.2524712

To generate a single bootstrap sample from this data vector, we use the command

>sample(x,replace=T)

which generates a bootstrap sample of the data vectorxby sampling with replacement.

Hence, to compute the CV using a single bootstrap sample,

> CV(sample(x,replace=T)) [1] 0.2242572

The particular value that^Rreturns for you will be different as the sample is random.

Some other useful commands:

> sum(x)returns the sum of the elements in^x

> mean(x)returns the mean of the elements inx

> var(x)returns the sample variance, i.e.,P

i(x− x)²/(n− 1)

> length(x)returns the number of items in^x(i.e., the sample size n)

(2)

Note that the^sumcommand is fairly general, for example

> sum((x-mean(x))^ 2) computesP

i(x− x)²

So, lets now generate 1000 bootstrap samples. We first need to specify a vector of real values of lenght 1000, which we will callboot

> boot <-numeric(1000)

We now generate 1000 samples, and assign the CV for bootstrap sample i as the ith element in the vector^boot, using a^forloop

for (i in 1:1000) boot[i] <- CV(sample(x,replace=T))

The mean and variance of this collection of bootstrap samples are easily obtained using themeanandvarcommands (again, your values may differ),

> mean(boot) [1] 0.2404653

> var(boot) [1] 0.00193073

A plot of the histogram of these values follows using hist(boot)

Likewise, the value corresponding to the (say) upper 97.5

> quantile(boot,0.975) [1] 0.3176385

while the value corresponding to the lower 2.5% follows from

> quantile(boot,0.025) [1] 0.153469

Recall from the notes that the estimate of the bias is given by the difference between the mean of the bootstrap values and the initial estimate,

(3)

> bias <- mean(boot) - CV(x)

and an bootstrap-corrected estimate of the CV is just the original estimate minus the bias,

> CV(x) - bias [1] 0.2644771

Assuming normality, the approximate 95% confidence interval is given by CVd± 1.96p

Var(bootstrap) (or adjusting for the bias an lower and upper values of

> CV(x) - bias - 1.96*sqrt(var(boot)) [1] 0.1783546

> CV(x) - bias + 1.96*sqrt(var(boot)) [1] 0.3505997

Efron’s confident limit (Equation 11 on resampling notes) has an upper and lower value of

> quantile(boot,0.975) [1] 0.3176385

and

> quantile(boot,0.025) [1] 0.153469

While Hall’s confidence limits (Equation 12) has an upper and lower value of

> 2*CV(x) - quantile(boot,0.025) [1] 0.3514734

and

> 2*CV(x) - quantile(boot,0.975) [1] 0.1873039

Jackknife Calculations

We now turn to jackknifing the sample. Recall from the randomization notes that this involves two steps. First, we generate a jackknife sample which has value xiremoved and then compute the ith partial estimate of the test statistic using this sample,

bθi(x1· · · xi−1, xi,· · · xn)

We then turn this ith partial estimate into the ith pseudovalue bθ_i^∗using (Equation 5c in random notes)

bθ^∗_i = nbθ− (n − 1)bθi

(4)

where bθis the estimate using the full data.

Let’s see how to code this in^Rusing the previous vector^xof data with our test statistic again being the coefficient of variation (and hence our functionCVpreviously defined).

We first focus on generating the ith partial estimate and ith pseudovalue. We need to take the original data vectorxand turn it into a vector (which we denotejack) of lenght n− 1 as follows. First, we need to specify toRthat we are creating the jackknife sample vector of the n− 1 sampled points

jack <- numeric(length(x)-1)

As before, we will use the commandlenght(x)in place of n. We also need to specify to Rthat we will be generating a vectorpseudoof the n pseudovalues

pseudo <- numeric(length(x))

Next, we need to fill in the elements of the^jacksample vector as follows. For j < i, the jth element ofjackis the same as the jth element ofx; for j = i we exclude the value of x, while for j > i, the j− 1th element of^jackis the jth element ofx. We can state all this using a logicalif .. elsestatement within aforloop,

for (j in 1:length(x)) if(j < i) jack[j] <- x[j]

else if(j > i) jack[j-1] <- x[j]

We can then compute the ith pseudovalue (for the CV) as follows:

pseudo[i] <- lenght(x)*CV(x) -(lenght(x)-1)*CV(jack)

Finally, we top this all off by looping through the n possible i values, giving the final code as

jack <- numeric(length(x)-1) pseudo <- numeric(length(x)) for (i in 1:length(x)) { for (j in 1:length(x))

{if(j < i) jack[j] <- x[j] else if(j > i) jack[j-1] <- x[j]} pseudo[i] <- length(x)*CV(x) -(length(x)-1)*CV(jack)}

Note the use of the parenthesis ({, }) to delimit the appropriate elements in each loop. The mean and variance of the pseudovalues are easily found using

> mean(pseudo) [1] 0.2617376

> var(pseudo) [1] 0.07262871

Likewise, a histogram of the pseudovalues is generated using hist(pseudo)

Recall that the mean of the pseudovalues is the bootstrap estimator, while var(pseudo)/n is the variance of this estimator,

>var(pseudo)/length(x) [1] 0.002905148

(5)

An approximate 95% confidence interval is given by mean(pseudo)±t0.975,n−1p

var(pseudo)/n Using^R, the upper and lower limits become

> mean(pseudo) + qt(0.975,length(x)-1)*sqrt(var(pseudo)/length(x)) [1] 0.3729806

> mean(pseudo) - qt(0.975,length(x)-1)*sqrt(var(pseudo)/length(x)) [1] 0.1504947

Giving the approximate 95% jackknife confidence interval as 0.150 to 0.372.

Here’s a summary of the various estimated values, variances, and confidence intervals

Method Estimated CV Variance 95% interval

Original Estimate 0.252

Jackknife 0.262 0.0029 0.150 - 0.373

Bootstrap 0.264 0.0019

Bootstrap (normality) 0.178 - 0.351

Bootstrap (Efron) 0.153 - 0.318

Bootstrap (Hall) 0.187 - 0.351