Monte Carlo methods for Statistical Inference:

Resampling

Hung Chen

hchen@math.ntu.edu.tw Department of Mathematics

National Taiwan University

17th March 2004

Meet at NS 104 On Wednesday from 9:10 to 12.

Outline

• Introduction

– Nonparametric Bootstrap

– Classical Paradigm on Inference – Error in Bootstrap Inference

– R-programming

• Applications of Bootstrap – Confidence Intervals

– Regressions

– Hypothesis Tests – Censored data

• Resampling Methods

– Permutation Tests – The Jackknife

– Cross Validation and Model Selection

• References

– Bickel, P. and Freedman, D. (1981) Some asymp- totic theory for the bootstrap. Annals of Statistics, 9, 1196-1217.

– Davison, A.C. and Hinkley, D.V. (1997). Boot- strap Methods and their Application. Cambridge:

Cambridge University Press.

– DiCicco, T.J and Efron, B. (1996). Bootstrap confidence intervals. Statistical Science, 11, 189- 228.

– Efron, B. (1979) Bootstrap Methods: Another

Look at the Jackknife. Annals of Statistics 7 1¡X26.

– Efron, B. and R.J. Tibshirani (1993) An Introduc- tion to the Bootstrap. New York: Chapman and Hall.

– http://www-stat.stanford.edu/∼susan/courses/s208/

– Refer to Rnews 2002 December issue on Resam- pling Methods in R: The boot Package.

Bootstrapping

Bootstrapping is a general approach to statistical in- ference based on building a sampling distribution for a statistic by resampling from the data at hand.

• The term bootstrapping,due to Efron (AS, 1979), is an allusion to the expression pulling oneself up by one’s bootstraps.

• It uses the sample data as a population from which repeated samples are drawn.

• Two R libraries for bootstrapping are associated with extensive treatments of the subject:

– Efron and Tibshirani’s (1993) bootstrap library – Davison and Hinkley’s (1997) boot library

• There are several forms of the bootstrap, and, addi- tionally, several other resampling methods that are related to it, such as jackknifing, cross-validation, ran- domization tests, and permutation tests.

Nonparametric Bootstrapping Scientific question:

• Why Sample?

– Sampling can provide reliable information at far less cost than a census.

– Samples can be collected more quickly than cen- suses.

– Sampling can lead to more accurate estimates than censuses (!).

∗ When samples are used a small number of well- trained interviewers can spend more time get- ting high quality responses from a few sampled people.

– With probability sampling, you can quantify the sampling error from a survey.

– Sampling is necessary when a unit must be de- stroyed to be measured (e.g., breaking apart a Chips Ahoy! Cookie to measure the number of chocolate chips)

• Suppose that we draw a sample S = {X_{1}, X_{2}, . . . , X_{n}}
from a population P = {x_{1}, x_{2}, . . . , x_{N}} where N is
very much larger than n.

• Abstraction: Suppose that with any design, with
or without replacement, the probability of including
unit i in the sample is π_{i} (> 0), for i = 1, 2, . . . , N .

– The Horvitz-Thompson (1952) estimator for the

population total X_{T} is defined as
Yˆ_{HT} = X

i∈S

x_{i}
π_{i}

where S contains the distinct units only and hence the size of S could be less than the number n of units drawn.

– If the sampling is without replacement, the size of S must be n.

– Under a sampling with replacement, it can be shown that for a fixed sample size n,

π_{i} = 1 − (1 − p_{i})^{n}

where p_{i} is the probability of selecting the ith unit
of the population on each draw.

– Under a simple random sampling without replace-
ment, it can be shown that for a fixed sample size
n, π_{i} = n/N .

• For simplicity, think of the elements of the popula- tion as scalar values, but they could just as easily be vectors (i.e., multivariate).

• Now suppose that we are interested in some statis- tic T = t(S) as an estimate of the corresponding population parameter θ = t(P).

– θ could be a vector of parameters and T the cor- responding vector of estimates.

– In inference, we are interested in describing the random variable t(S) − t(P) which varies with S.

∗ Find the distribution of t(S) − t(P).

∗ How do we describe the distribution of t(S) − t(P)?

∗ Chebyschev’s inequality, Asymptotic analysis, ...

Essential idea of the nonparametric bootstrap is as fol- lows:

• Draw a sample of size n from among the elements of S, sampling with replacement.

Call the resulting bootstrap sample S_{1}^{∗} = {X_{11}^{∗} , X_{12}^{∗} , . . . , X_{1n}^{∗} }.

• In effect, we are treating the sample S as an esti-
mate of the population P; that is, each element X_{i}
of S is selected for the bootstrap sample with prob-
ability 1/n, mimicking the original selection of the
sample S from the population P.

• Repeat this procedure a large number of times, P,

selecting many bootstrap samples; the bth such boot-
strap sample is denoted S_{b}^{∗} = {X_{b1}^{∗} , X_{b2}^{∗} , . . . , X_{bn}^{∗} }.

The population is to the sample as the sample is to the bootstrap samples.

• Compute the statistic T for each of the bootstrap
samples; that is T_{b}^{∗} = t(S_{b}^{∗}).

– Let B denote the number of times on resampling.

– In theory, we can determine the distribution of
T^{∗} − T when B → ∞.

– We are doing simulation essentially.

• Suppose that we observe the sample
X = X_{1}, . . . , X_{n} → F (θ),^{iid}

statistic

θ = θ(Xˆ _{1}, . . . , X_{n}) = θ(X).

– Denote the empirical distribution by
F_{n}(x) = #{X_{i} ≤ x}

n .

– Think of the parameter θ = θ(F ) and the statistic
θ as functionals ˆˆ θ(F_{n}).

– The idea of bootstrap is to exploit the analogy
θ(F_{n}) − θ(F ) : θ(F_{n}^{∗}) − θ(F_{n})

where F_{n}^{∗} denotes the empirical distribution of a
sample from F_{n}.

• How do we find the sampling distribution of ¯X −
E(X) when X_{1}, . . . , X_{n} are from exponential distri-
bution?

– How do I utilize the information available to me?

– Think of parametric bootstrap.

• In the parametric bootstrap, the distribution func- tion of the population of interest, F , is assumed known up to a finite set of unknown parameters θ.

– ˆF is F with θ replaced by its sample estimate (of some kind).

• Algorithm of parametric bootstrap:

– Obtain estimates of the parameters that charac- terize the distribution within the assumed family.

– Generate B random samples each of size n from
the estimated distribution, and for each sample,
compute an estimator T_{b}^{∗} of the same functional

– The distribution of the T_{b}^{∗}’s is used to make in-
ferences about the distribution of T .

• Assume that the distribution of T_{b}^{∗} around the orig-
inal estimate T is analogous to the sampling dis-
tribution of the estimator T around the population
parameter θ.

– Consider the problem of correcting the bias of T as an estimate of θ.

∗ Let B denote the number of times of resam- pling.

∗ The average of the bootstrapped statistics is
T¯^{∗} = ˆE^{∗}(T^{∗}) =

P_{B}

b=1 T_{b}^{∗}

R .

∗ Bootstrap estimate of the bias would be ¯T ^{∗}− T .

∗ Recall that the bias is E(T ) − θ.

∗ Improve the estimate T by T − ( ¯T ^{∗} − T ).

– How do we estimates the sampling variance or sampling distribution of T ?

Note that the random selection of bootstrap samples is not an essential aspect of the nonparametric boot- strap:

• At least in principle, we could enumerate all boot-
strap samples of size n. Then we could calculate
E^{∗}(T^{∗}) and V ar^{∗}(T^{∗}) exactly, rather than having to
estimate them.

• The number of bootstrap samples n^{n}, however, is
astronomically large unless n is tiny.

Error in Bootstrap Inference

There are two sources of error in bootstrap inference:

(1) the error induced by using a particular sample S to represent the population

(2) the sampling error produced by failing to enumerate all bootstrap samples.

– This source of error can be controlled by making the number of bootstrap replications R sufficiently large.

• A traditional approach to statistical inference is to make assumptions about the structure of the popu- lation (e.g., an assumption of normality), and, along with the stipulation of random sampling, to use

these assumptions to derive the sampling distribu- tion of T , on which classical inference is based.

– In certain instances, the exact distribution of T may be intractable, and so we instead derive its asymptotic distribution.

– This familiar approach has three potentially im- portant deficiencies:

1. If the assumptions about the population are wrong, then the corresponding sampling distribution of the statistic may be seriously inaccurate.

2. If asymptotic results are relied upon, these may not hold to the required level of accuracy in a relatively small sample.

3. The approach requires sufficient mathematical

prowess to derive the sampling distribution of the statistic of interest. In some cases, such a derivation may be prohibitively difficult.

• Background:

Some of the theory involves functional Taylor ex- pansions.

θ − θ ≈ θˆ ^{0}(F )(F_{n} − F ),

where F_{n} − F can be approximated by a Brownian
bridge.

By the same reasoning,

θˆ^{∗} − ˆθ ≈ θ^{0}(F_{n})(F_{n}^{∗} − F_{n}).

Again, F_{n}^{∗} − F_{n} can be approximated by a Brownian
bridge.

• If this statistics is sufficiently smooth so that the

functional derivatives θ^{0}(F ) ≈ θ^{0}(F_{n}), then the pro-
cedure gets the right SE.

R-programming

In this demonstration, we consider estimating the pa- rameter θ of an exponential distribution.

• Data generation with θ = 2:

options(digits=3) & x<- rexp(20,2) print(theta<- sd(x))

• Parametric bootstrap

lambda<- 1/mean(x) & thetas<- 1:1000 for (i in 1:1000)

thetas[i]<- sd(rexp(30,lambda)) c(mean(thetas),sd(thetas))

quantile(thetas,c(0.025,0.975))

theta-rev(quantile(thetas-theta,c(0.025,0.975)))

• Nonparametric bootstrap thetas2<- 1:1000

for (i in 1:1000)

thetas[i]<- sd(sample(x,repl=T)) c(mean(thetas2),sd(thetas2))

quantile(thetas2,c(0.025,0.975))

theta-rev(quantile(thetas2-theta,c(0.025,0.975)))

• Monte Carlo estimate (knowing the truth) thetas3<- 1:10000 for (i in 1:100000)

thetas3<- sd(rexp(30,2)) c(mean(thetas3),sd(thetas3))

Bootstrap Confidence Intervals

There are several approaches to constructing bootstrap confidence intervals.

• The normal-theory interval assumes that the statis- tic T is normally distributed (which is often approx- imately the case for statistics in sufficiently large samples), and uses the bootstrap estimate of sam- pling variance, and perhaps of bias, to construct a 100(1 − α)-percent confidence interval of the form

θ = (T − ˆB^{∗}) ± z_{1−α}SEˆ ^{∗}(T^{∗}).

Here,

– ˆB^{∗} and ˆSE are the bootstrap estimate of the bias
and standard error of T .

– z_{1−α/2} is the 1 − α/2 quantile of the standard-
normal distribution.

• Bootstrap percentile interval: It is to use the em-
pirical quantiles of T_{b}^{∗} to form a confidence interval
for θ.

T_{(lower)}^{∗} < θ < T_{(upper)}^{∗}
where T^{∗}

(1), T^{∗}

(2), . . . , T^{∗}

(B) are the ordered bootstrap replicates of the statistic; lower = [(B +1)α/2]; upper

= [(B + 1)(1 − α/2)]; and the square brackets indicate rounding to the nearest integer.

– Although they do not artificially assume normal- ity, percentile confidence intervals often do not perform well.

∗ We have a lot of experience with approximate transformations to normality.

∗ Suppose there is a monotonically increasing trans- formation g and a constant τ such that the ran- dom variable

Z = c[g( ˆθ^{∗}) − g(θ)]

has a symmetric distribution about zero.

∗ Let H be the distribution function of Z. Then
G_{ˆ}

θ^{∗}(t) = H(c[g(t) − g(θ)]),
and

θˆ^{∗(1−α)} = g^{−1}(g( ˆθ^{∗}) + z^{(1−α)}/c),
where z^{(1−α)} is the 1 − α quantile of Z.

– How do we employ this concept on vector-valued parameters?

∗ Do we restrict to an ellipsoidal region?

∗ If it is, the shape is determined by the covari- ances of the estimators.

∗ How about the one-sided confidence interval?

• The bootstrap t interval: Determine the following distribution by the bootstrap method.

θˆ^{∗} − ˆθ
s( ˆθ^{∗})

• Bias-corrected, accelerated (or BC_{a}) percentile in-
tervals:

Steps:

– Calculate

z = Φ^{−1}

" P_{B}

b=1 1(T_{b}^{∗} ≤ T )
B + 1

#

where Φ^{−1}(·) is the standard-normal quantile func-
tion.

– If the bootstrap sampling distribution is symmet- ric, and if T is unbiased, then this proportion will be close to 0.5, and the correction factor z will be close to 0.

– Let T_{(−i)} represent the value of T produced when
the ith observation is deleted from the sample;

there are n of these quantities.

Let ¯T represent the average of the T_{(−i)}.

Calculate

a =

P_{n}

i=1[T_{(−i)} − ¯T ]^{3}
6

hP_{n}

i=1(T_{(−i)} − ¯T )^{2}

i_{3/2}.

– With the correction factors z and a in hand, com- pute

a_{1} = Φ

"

z + z − z_{1−α/2}

1 − a(z − z_{1−α/2})

#

a_{2} = Φ

"

z + z + z_{1−α/2}

1 − a(z − z_{1−α/2})

# .

– The values a_{1} and a_{2} are used to locate the end-
points of the corrected percentile confidence in-

terval:

T_{(lower}∗) < θ < T_{(upper}∗)

where lower^{∗} = [Ba_{1}] and upper^{∗} = [Ba_{2}].

When the correction factors a and z are both 0, it corresponds to the (uncorrected) percentile in- terval.

Bootstrapping Regressions
Consider (x_{1}, x_{2}, y)_{i} for i = 1, . . . , n.

y = β_{0} + β_{1}x_{1} + β_{2}x_{2} +
with E() = 0 and V ar() = σ^{2}.

Assume the _{i}s are independent.

• Parametric bootstrap:

– Obtain ˆβ_{0}, ˆβ_{1}, ˆβ_{2}

– Sample ^{∗}_{i} ∼ iid N(0,σ^{2})

– Take y_{i}^{∗} = ˆβ_{0} + ˆβ_{1}x_{i1} + ˆβ_{2}x_{i2} + ^{∗}_{i}

– Obtain ˆβ_{0}^{∗}, ˆβ_{1}^{∗}, ˆβ_{2}^{∗} and repeat many times

• Resampling residuals:

– Obtain ˆβ , ˆβ , and ˆβ .

– Calculate residuals ^{∗}_{i} = y_{i} − ˆβ_{0} − ˆβ_{1}x_{i1} − ˆβ_{2}x_{i2}

– Sample ^{∗}_{i} by drawing with replacement from {ˆ_{i}}
– Add ^{∗}_{i} to ˆβ_{0} + ˆβ_{1}x_{i1} + ˆβ_{2}x_{i2}.

– Obtain ˆβ_{0}^{∗}, ˆβ_{1}^{∗}, ˆβ_{2}^{∗} and repeat many times.

– It assumes that the distribution of is the same in all regions of the model.

– Better efficiency but is not robust to getting the wrong model.

• Resampling cases:

– Sample (x^{∗}_{i1}, x^{∗}_{i2}, y_{i}^{∗})_{i} by drawing with replacement
from (x_{i1}, x_{i2}, y)_{i}.

– Obtain ˆβ_{0}^{∗}, ˆβ_{1}^{∗}, ˆβ_{2}^{∗} and repeat many times.

– It is less efficient but it preserves the relationship

between Y and (X_{1}, X_{2}) is better. (Think of the
case that the variance is not homogeneous.)

Resampling Censored Data

In many practical settings, data is censored and so the usual bootstrap is not applicable.

• Random right-censored data:

Such data is typically comprised of the bivariate ob-
servations (Y_{i}, D_{i}) where

Y_{i} = min(X_{i}, C_{i}) D_{i} = I(X_{i} ≤ C_{i})

where X_{i} ∼ F and C_{i} ∼ G independently and I(A)
is the indicator function of the event A.

– Example: remission times for patients with a type of leukaemia

∗ The patients were divided into those who re- ceived maintenance chemotherapy and those who did not.

∗ We are interested in the median remission time for the two groups.

– survfit: Computes an estimate of a survival curve for censored data using either the Kaplan-Meier or the Fleming-Harrington method or computes the predicted survivor function for a Cox proportional hazards model.

– data(aml, package="boot")

fit<- survfit(Surv(time,group)~cens,data=aml) plot(fit)

• Algorithm 1:

– Nonparametric estimates of F and G are given by
the Kaplan-Meier estimates ˆF and ˆG, the latter
being obtained by replacing d_{i} by 1 − d_{i}.

– θ = F_{T}^{−1}(0.5) − F_{C}^{−1}(0.5)

– Sampling X_{1}^{∗}, . . . , X_{n}^{∗} from ˆF and independently
sampling C_{1}^{∗}, . . . , C_{n}^{∗} from ˆG.

– (Y_{i}^{∗}, D_{i}^{∗}) can then be found from (X_{i}^{∗}, C_{i}^{∗}) in the
same way as for the original data.

– Efron (1981) showed that this is identical to re- sampling with replacement from the original pairs.

aml.fun<- function(data){

surv<-survfit(Surv(time,group)~cens,data=data) out<- NULL

st <- 1

for (s in 1:length(surv$strata)) { inds <- st:(st+surv$strata[s]-1)

md<-min(surv$time[inds[1-surv$surv[inds]>=0.5]])

st <- st+surv$strata[s]

out<- c(out,md) }

}

aml.case<- censboot(aml,aml.fun,R=499,strata

=aml$group)

– Refer to the R function censboot.

• Conditional bootstrap:

– This approach conditions the resampling on the observed censoring pattern since this is, in effect, an ancillary statistic.

– Sample X_{1}^{∗}, . . . , X_{n}^{∗} from ˆF as before.

– If the ith observation is censored then we set

C_{i} = y_{i} and if it is not censored we sample an
observation from the estimated conditional distri-
bution of C_{i} given that C_{i} > y_{i}.

– Having thus obtained X_{1}^{∗}, . . . , X_{n}^{∗} and C_{1}^{∗}, . . . , C_{n}^{∗}
we proceed as before.

– Technical problem: Suppose that the maximum
value of y_{1}, . . . , y_{n}, y_{k} say, is a censored observa-
tion. Then X_{k}^{∗} < y_{k} and C_{k}^{∗} = y_{k} so the bootstrap
observation will always be uncensored.

Alternatively, if the the maximum value is uncen- sored, the estimated conditional distribution does not exist.

In order to overcome these problems we add one
extra point to the data set which has an observed
value greater than max{y_{1}, . . . , y_{n}} and has the op-

posite value of the censoring indicator to the max- imum.

– R-program

aml.s1<-survfit(Surv(time,cens)~group,data=aml) aml.s2<-survfit(Surv(time-0.001*cens,1-cens)~1,

data=aml)

aml.cond<-censboot(aml,aml.fun,R=499,strata=

aml$group,F.surv=aml.s1,G.surv=aml.s2, sim="cond")

Bootstrap Hypothesis Tests

It is also possible to use the bootstrap to construct an empirical sampling distribution for a test statistic.

• To be added later on.

Permutation Tests Randomization Methods:

When an hypothesis of interest does not have an obvi- ous test statistic, a randomization test may be useful.

• Compare an observed configuration of outcomes with all possible configurations.

• The randomization procedure does not depend on assumptions about the data generating process, so it is usable in a wide range of applications.

• In most situations the null hypothesis for the test is that all outcomes are equally likely, and the null hy- pothesis is rejected if the observed outcome belongs to a subset that has a low probability under the null

the alternative hypothesis.

• Suppose that we want to test whether the means of two data generating processes are equal.

– The decision would be based on observations of two samples of results using the two treatments.

– There are several statistical tests for this null hy- pothesis, both parametric and nonparametric, that might be used.

– Most tests would use either the differences in the means of the samples, the numbers of observa- tions in each sample that are greater than the overall mean or median, or the overall ranks of the observations in one sample.

– Any of these test statistics could be used as a test

statistic in a randomization test.

Test the difference in the two sample means

Consider two samples, x_{1}, x_{2}, . . . , x_{m} and y_{1}, y_{2}, . . . , y_{n}.
The chosen test statistic is t_{0} = ¯x − y.¯

• Without making any assumptions about the distri- butions of the two populations, the significance of the test statistic (that is, a measure of how extreme the observed difference is) can be estimated by con- sidering all configurations of the observations among the two treatment groups.

– This is done by computing the same test statistic for each possible arrangement of the observations, and then ranking the observed value of the test statistic within the set of all computed values.

– Consider a different configuration of the same set

of observations, y_{1}, x_{2}, . . . , x_{m} and x_{1}, y_{2}, . . . , y_{n} in
which an observation from each set has been in-
terchanged with one from the other set.

The same kind of test statistic, namely the differ-
ence in the sample means, is computed. Let t_{1} be
the value of the test statistic for this combination.

– Consider all possible different configurations, in which other values of the original samples have been switched.

Compute the test statistic. Continuing this way through the full set of x’s, we would eventually obtain C(n + m, n) different configurations, and a value of the test statistic for each one of these artificial samples.

ization of a random sample from that distribution under the null hypothesis.

– The empirical significance of the value correspond- ing to the observed configuration could then be computed simply as the rank of the observed value in the set of all values.

– Because there may be a very large number of all possible configurations, we may wish to sample randomly from the possible configurations rather than considering them all.

– When a sample of the configurations is used, the test is sometimes called an approximate random- ization test.

• Randomization tests have been used on small data

sets for a long time. Refer to Fisher’s famous lady tasting tea experiment in which a randomization test is used.

• Refer to Fisher (1935). Because such tests can require very extensive computations, their use has been limited until recently.

• Fisher’s randomization test: Fisher (1935) gave a permutation justification for the usual test for n paired observations.

– In his example (Darwin’s Zea data) y_{i} and d_{i} =|

x_{i} − y_{i} | were real numbers representing plant
height for treated and untreated plants.

– Darwin conducted an experiment to examine the

superiority of cross-fertilized plants over self-fertilized

plants.

∗ 15 pairs of plants were used. Each pair con- sisted of one cross-fertilized plant and one self- fertilized plant which germinated at the same time and grew in the same pot.

∗ The plants were measured at a fixed time after planting and the difference in heights between the cross- and self-fertilized plants are recorded in eighths of an inch.

∗ This data can be found in the package of boot with name darwin.

– Darwin had calculated the mean difference.

– Fisher gave a way of calibrating this by calculating
S_{n} = _{1}d_{1} + · · · + _{n}d_{n}

and considering all 2^{n+1} possible sums = ±1 with
S_{0}.

• Manly, B.F.J. (1997) Randomization, bootstrap and Monte Carlo method in biology, 2nd ed. Chapman &

Hall, London.

The Jackknife

Jackknife methods make use of systematic partitions of a data set to estimate properties of an estimator computed from the full sample.

• Quenouille (1949, 1956) suggested the technique to
estimate (and, hence, reduce) the bias of an esti-
mator ˆθ_{n}.

• Tukey coined the term jackknife to refer to the method, and also showed that the method is useful in esti- mating the variance of an estimator.

• Suppose, we have a random sample, X_{1}, X_{2}, . . . , X_{n},
from which we compute a statistic T as an estimate
of a parameter θ in the population from which the
sample was drawn.

In the jackknife method,

– Partition the given data set into r groups each of size k. (For simplicity, we will assume that the number of observations n is kr.)

– Remove the jth group from the sample, and com-
pute the estimate, T_{−j} from the reduced sample.

– Consider T_{j}^{∗} = rT − (r − 1)T_{−j} which is called pseu-
dovalues. The mean of the pseudo values, J (T ), is
called the jackknife estimator corresponding to T :

J (T ) = rT − (r − 1)

P_{r}

j=1 T_{−j}

r .

– In most applications, it is common to take k = 1 or r = n.

the bias of the estimator T .

• The Jackknife Bias Correction:

– Suppose that we can express the bias of ˆθ_{n} as a
power series in n^{−1}.

θˆ_{n} − θ = a_{1}

n + a_{2}

n^{2} + a_{3}

n^{3} + · · ·

where the numerators are unknowns depending on the real distribution F .

– For J (T ), we have

∞

X

q=1

a_{q}

n^{q} + (n − 1)

∞

X

q=1

a_{q}

n^{q} + θ

−(n − 1)

∞

X

q=1

a_{q}

(n − 1)^{q} + θ

= a_{2}

1

n(n − 1)

+ a_{3} 1

n^{2} − 1

(n − 1)^{2}

– The bias of jackknife estimate J (T ) is in n^{−2}.
– Jackknife gives an estimate of the bias by:

Biasˆ _{jack} = (n − 1)(J (T ) − T ).

• The Jackknife Variance Estimate
V arˆ _{jack} =

P_{r}

j=1(T_{j}^{∗} − T )^{2}
r(r − 1) .

– Monte Carlo studies that it is often conservative;

that is, it often overestimates the variance (see Efron, 1982).

• Refer to Gentle (2002) for higher-order bias correc- tion, the generalized jackknife, and the delete-k jack- knife.

Cross Validation and Model Selection

Cross-validation and bootstrapping are both methods for estimating generalization error based on resampling (Efron and Tibshirani, 1993).

• Cross validation is useful in model building.

• In regression model building the standard problem is, given a set of potential regressors, choose a rel- atively small subset that provide a good fit to the data.

– Standard techniques include stepwise regression and all best subsets.

– If all the data are used in fitting the model, how- ever, we have no method to validate the model.

– A simple method to select potential regressors is to divide the sample into half, to fit the model us- ing one half, and to check the fit using the second half.

∗ The regressors to include in the model can be based on comparisons of the predictions made for the second half with the actual data.

– Instead of dividing the sample into half, we could form multiple partial data sets with overlap.

∗ One way would be to leave out just one obser- vation at a time.

∗ The method of variable selection called PRESS, suggested by Allen (1971), does this. (See also Allen, 1974.)

• Cross validation is a common method of selecting smoothing parameters.

– Think of choosing window width for window esti- mate in regression.

• The resulting estimates of generalization error are often used for choosing among various models.

• Apparent Error and True Error:

Consider the problem of predicting Y using some
function of X such that E[Y − g(X)]^{2} is as well as
possible.

– Usually, g(X) is determined be a training sample
(x_{i}, y_{i})’s.

– For a new point, (x_{0}, y_{0}), how well does ˆg(x_{0})

– Let L(y, g) be a measure of the error between an observed value y and the predicted value g(x).

(Usually, L is the square, [y − g(x)]^{2}.)

• For prediction error,

– Recall the residual of sum squares we learned in regression analysis.

∗ Can we use RSS to do model selection?

∗ RSS is typically smaller than the true error be- cause the fit was chosen so as to minimize it.

∗ RSS is the so-called apparent error.

• Define the excess error as the random variable
D(Y, P_{(X,Y )}) = E_{P}

(X,Y )[L(Y_{0}, g(Xˆ _{0}))]−E _{ˆ}

P_{(X,Y )}[L(Y_{0}, g(Xˆ _{0}))],
where ˆP_{(X,Y )} is the estimated cumulative distribu-

tion function of (X, Y ).

– If ˆP_{(X,Y )} is the empirical CDF the density is just
1/n at the sample points, so

E _{ˆ}

P_{(X,Y )}[L(Y_{0}, g(Xˆ _{0}))] = 1
n

X

i

L(Y_{i}, g(Xˆ _{i})).

– This quantity, which is easy to compute, is the apparent error.

• Cross validation methods (and other resampling meth- ods) can be used to estimate the true error.

Cross Validation Consider model selection.

• In k-fold cross-validation, you divide the data into k subsets of (approximately) equal size v.

• Train the model k times, each time leaving out one of the subsets from training (estimating unknown parameters etc.), but using only the omitted subset to compute whatever chosen error criterion.

• If k equals the sample size, this is called leave-one- out.

– Leave-one-out cross-validation is also easily con- fused with jackknifing since both involve omitting each training case in turn.

– Jackknifing can be used to estimate the bias of the training error and hence to estimate the gen- eralization error, but this process is more compli- cated than leave-one-out cross-validation

• Leave-v-out is a more elaborate and expensive ver- sion of cross-validation that involves leaving out all possible subsets of v cases.

• For an insightful discussion of the limitations of cross- validatory choice among several learning methods, see Stone (1977).

– Leave-one-out cross-validation often works well for estimating generalization error for continuous error functions such as the mean squared error, but it may perform poorly for discontinuous er-

ror functions such as the number of misclassified cases.

– In the latter case, k-fold cross-validation is pre- ferred.

– But if k gets too small, the error estimate is pes- simistically biased because of the difference in training-set size between the full-sample analysis and the cross-validation analysis.

– A value of 10 for k is popular for estimating gen- eralization error.

• Refer to Chapter 7.11 of HTF book on bootstrap method.

Shao (1993, JASA) obtained the surprising result that for selecting subsets of inputs in a linear regression, the

probability of selecting the best does not converge to 1 (as the sample size n goes to infinity) for leave-v-out cross-validation unless the proportion v/n approaches 1.

• To obtain an intuitive understanding, let’s review what is generalization error.

• Generalization error can be broken down into three additive parts,

– noise variance

– estimation variance

– squared estimation bias

• Noise variance is the same for all subsets of inputs.

• Bias is nonzero for subsets that are not good, but

that the function to be learned is linear.

Hence the generalization error of good subsets will differ only in the estimation variance.

• The estimation variance is (2p/t)s^{2} where p is the
number of inputs in the subset, t is the training set
size, and s^{2} is the noise variance.

– The best subset is better than other good subsets only because the best subset has (by definition) the smallest value of p.

– But the t in the denominator means that differ- ences in generalization error among the good sub- sets will all go to zero as t goes to infinity.

– Therefore it is difficult to guess which subset is best based on the generalization error even when

t is very large.

• It is well known that unbiased estimates of the gen-
eralization error, such as those based on AIC, F P E,
and C_{p}, do not produce consistent estimates of the
best subset (e.g., see Stone, 1979).

• References:

– Efron, B. (1983). Estimating the error rate of a prediction rule: Improvement on cross-validation, JASA 78, 316-331.

– Efron, B. and Tibshirani, R.J. (1997). Improve- ments on cross-validation: The .632+ bootstrap method. JASA 92, 548-560.

– Stone, M. (1977). Asymptotics for and against cross-validation. Biometrika 64, 29-35.