• 沒有找到結果。

Monte Carlo methods for Statistical Inference: Resampling

N/A
N/A
Protected

Academic year: 2022

Share "Monte Carlo methods for Statistical Inference: Resampling"

Copied!
64
0
0

加載中.... (立即查看全文)

全文

(1)

Monte Carlo methods for Statistical Inference:

Resampling

Hung Chen

hchen@math.ntu.edu.tw Department of Mathematics

National Taiwan University

17th March 2004

Meet at NS 104 On Wednesday from 9:10 to 12.

(2)

Outline

• Introduction

– Nonparametric Bootstrap

– Classical Paradigm on Inference – Error in Bootstrap Inference

– R-programming

• Applications of Bootstrap – Confidence Intervals

– Regressions

– Hypothesis Tests – Censored data

• Resampling Methods

(3)

– Permutation Tests – The Jackknife

– Cross Validation and Model Selection

• References

– Bickel, P. and Freedman, D. (1981) Some asymp- totic theory for the bootstrap. Annals of Statistics, 9, 1196-1217.

– Davison, A.C. and Hinkley, D.V. (1997). Boot- strap Methods and their Application. Cambridge:

Cambridge University Press.

– DiCicco, T.J and Efron, B. (1996). Bootstrap confidence intervals. Statistical Science, 11, 189- 228.

– Efron, B. (1979) Bootstrap Methods: Another

(4)

Look at the Jackknife. Annals of Statistics 7 1¡X26.

– Efron, B. and R.J. Tibshirani (1993) An Introduc- tion to the Bootstrap. New York: Chapman and Hall.

– http://www-stat.stanford.edu/∼susan/courses/s208/

– Refer to Rnews 2002 December issue on Resam- pling Methods in R: The boot Package.

(5)

Bootstrapping

Bootstrapping is a general approach to statistical in- ference based on building a sampling distribution for a statistic by resampling from the data at hand.

• The term bootstrapping,due to Efron (AS, 1979), is an allusion to the expression pulling oneself up by one’s bootstraps.

• It uses the sample data as a population from which repeated samples are drawn.

• Two R libraries for bootstrapping are associated with extensive treatments of the subject:

– Efron and Tibshirani’s (1993) bootstrap library – Davison and Hinkley’s (1997) boot library

(6)

• There are several forms of the bootstrap, and, addi- tionally, several other resampling methods that are related to it, such as jackknifing, cross-validation, ran- domization tests, and permutation tests.

(7)

Nonparametric Bootstrapping Scientific question:

• Why Sample?

– Sampling can provide reliable information at far less cost than a census.

– Samples can be collected more quickly than cen- suses.

– Sampling can lead to more accurate estimates than censuses (!).

∗ When samples are used a small number of well- trained interviewers can spend more time get- ting high quality responses from a few sampled people.

(8)

– With probability sampling, you can quantify the sampling error from a survey.

– Sampling is necessary when a unit must be de- stroyed to be measured (e.g., breaking apart a Chips Ahoy! Cookie to measure the number of chocolate chips)

• Suppose that we draw a sample S = {X1, X2, . . . , Xn} from a population P = {x1, x2, . . . , xN} where N is very much larger than n.

• Abstraction: Suppose that with any design, with or without replacement, the probability of including unit i in the sample is πi (> 0), for i = 1, 2, . . . , N .

– The Horvitz-Thompson (1952) estimator for the

(9)

population total XT is defined as YˆHT = X

i∈S

xi πi

where S contains the distinct units only and hence the size of S could be less than the number n of units drawn.

– If the sampling is without replacement, the size of S must be n.

– Under a sampling with replacement, it can be shown that for a fixed sample size n,

πi = 1 − (1 − pi)n

where pi is the probability of selecting the ith unit of the population on each draw.

(10)

– Under a simple random sampling without replace- ment, it can be shown that for a fixed sample size n, πi = n/N .

• For simplicity, think of the elements of the popula- tion as scalar values, but they could just as easily be vectors (i.e., multivariate).

• Now suppose that we are interested in some statis- tic T = t(S) as an estimate of the corresponding population parameter θ = t(P).

– θ could be a vector of parameters and T the cor- responding vector of estimates.

– In inference, we are interested in describing the random variable t(S) − t(P) which varies with S.

∗ Find the distribution of t(S) − t(P).

(11)

∗ How do we describe the distribution of t(S) − t(P)?

∗ Chebyschev’s inequality, Asymptotic analysis, ...

Essential idea of the nonparametric bootstrap is as fol- lows:

• Draw a sample of size n from among the elements of S, sampling with replacement.

Call the resulting bootstrap sample S1 = {X11 , X12 , . . . , X1n }.

• In effect, we are treating the sample S as an esti- mate of the population P; that is, each element Xi of S is selected for the bootstrap sample with prob- ability 1/n, mimicking the original selection of the sample S from the population P.

• Repeat this procedure a large number of times, P,

(12)

selecting many bootstrap samples; the bth such boot- strap sample is denoted Sb = {Xb1 , Xb2 , . . . , Xbn }.

The population is to the sample as the sample is to the bootstrap samples.

• Compute the statistic T for each of the bootstrap samples; that is Tb = t(Sb).

– Let B denote the number of times on resampling.

– In theory, we can determine the distribution of T − T when B → ∞.

– We are doing simulation essentially.

• Suppose that we observe the sample X = X1, . . . , Xn → F (θ),iid

(13)

statistic

θ = θ(Xˆ 1, . . . , Xn) = θ(X).

– Denote the empirical distribution by Fn(x) = #{Xi ≤ x}

n .

– Think of the parameter θ = θ(F ) and the statistic θ as functionals ˆˆ θ(Fn).

– The idea of bootstrap is to exploit the analogy θ(Fn) − θ(F ) : θ(Fn) − θ(Fn)

where Fn denotes the empirical distribution of a sample from Fn.

• How do we find the sampling distribution of ¯X − E(X) when X1, . . . , Xn are from exponential distri- bution?

(14)

– How do I utilize the information available to me?

– Think of parametric bootstrap.

• In the parametric bootstrap, the distribution func- tion of the population of interest, F , is assumed known up to a finite set of unknown parameters θ.

– ˆF is F with θ replaced by its sample estimate (of some kind).

• Algorithm of parametric bootstrap:

– Obtain estimates of the parameters that charac- terize the distribution within the assumed family.

– Generate B random samples each of size n from the estimated distribution, and for each sample, compute an estimator Tb of the same functional

(15)

– The distribution of the Tb’s is used to make in- ferences about the distribution of T .

• Assume that the distribution of Tb around the orig- inal estimate T is analogous to the sampling dis- tribution of the estimator T around the population parameter θ.

– Consider the problem of correcting the bias of T as an estimate of θ.

∗ Let B denote the number of times of resam- pling.

∗ The average of the bootstrapped statistics is T¯ = ˆE(T) =

PB

b=1 Tb

R .

∗ Bootstrap estimate of the bias would be ¯T − T .

(16)

∗ Recall that the bias is E(T ) − θ.

∗ Improve the estimate T by T − ( ¯T − T ).

– How do we estimates the sampling variance or sampling distribution of T ?

Note that the random selection of bootstrap samples is not an essential aspect of the nonparametric boot- strap:

• At least in principle, we could enumerate all boot- strap samples of size n. Then we could calculate E(T) and V ar(T) exactly, rather than having to estimate them.

• The number of bootstrap samples nn, however, is astronomically large unless n is tiny.

(17)

Error in Bootstrap Inference

There are two sources of error in bootstrap inference:

(1) the error induced by using a particular sample S to represent the population

(2) the sampling error produced by failing to enumerate all bootstrap samples.

– This source of error can be controlled by making the number of bootstrap replications R sufficiently large.

• A traditional approach to statistical inference is to make assumptions about the structure of the popu- lation (e.g., an assumption of normality), and, along with the stipulation of random sampling, to use

(18)

these assumptions to derive the sampling distribu- tion of T , on which classical inference is based.

– In certain instances, the exact distribution of T may be intractable, and so we instead derive its asymptotic distribution.

– This familiar approach has three potentially im- portant deficiencies:

1. If the assumptions about the population are wrong, then the corresponding sampling distribution of the statistic may be seriously inaccurate.

2. If asymptotic results are relied upon, these may not hold to the required level of accuracy in a relatively small sample.

3. The approach requires sufficient mathematical

(19)

prowess to derive the sampling distribution of the statistic of interest. In some cases, such a derivation may be prohibitively difficult.

• Background:

Some of the theory involves functional Taylor ex- pansions.

θ − θ ≈ θˆ 0(F )(Fn − F ),

where Fn − F can be approximated by a Brownian bridge.

By the same reasoning,

θˆ − ˆθ ≈ θ0(Fn)(Fn − Fn).

Again, Fn − Fn can be approximated by a Brownian bridge.

• If this statistics is sufficiently smooth so that the

(20)

functional derivatives θ0(F ) ≈ θ0(Fn), then the pro- cedure gets the right SE.

(21)

R-programming

In this demonstration, we consider estimating the pa- rameter θ of an exponential distribution.

• Data generation with θ = 2:

options(digits=3) & x<- rexp(20,2) print(theta<- sd(x))

• Parametric bootstrap

lambda<- 1/mean(x) & thetas<- 1:1000 for (i in 1:1000)

thetas[i]<- sd(rexp(30,lambda)) c(mean(thetas),sd(thetas))

quantile(thetas,c(0.025,0.975))

theta-rev(quantile(thetas-theta,c(0.025,0.975)))

(22)

• Nonparametric bootstrap thetas2<- 1:1000

for (i in 1:1000)

thetas[i]<- sd(sample(x,repl=T)) c(mean(thetas2),sd(thetas2))

quantile(thetas2,c(0.025,0.975))

theta-rev(quantile(thetas2-theta,c(0.025,0.975)))

• Monte Carlo estimate (knowing the truth) thetas3<- 1:10000 for (i in 1:100000)

thetas3<- sd(rexp(30,2)) c(mean(thetas3),sd(thetas3))

(23)

Bootstrap Confidence Intervals

There are several approaches to constructing bootstrap confidence intervals.

• The normal-theory interval assumes that the statis- tic T is normally distributed (which is often approx- imately the case for statistics in sufficiently large samples), and uses the bootstrap estimate of sam- pling variance, and perhaps of bias, to construct a 100(1 − α)-percent confidence interval of the form

θ = (T − ˆB) ± z1−αSEˆ (T).

Here,

– ˆB and ˆSE are the bootstrap estimate of the bias and standard error of T .

(24)

– z1−α/2 is the 1 − α/2 quantile of the standard- normal distribution.

• Bootstrap percentile interval: It is to use the em- pirical quantiles of Tb to form a confidence interval for θ.

T(lower) < θ < T(upper) where T

(1), T

(2), . . . , T

(B) are the ordered bootstrap replicates of the statistic; lower = [(B +1)α/2]; upper

= [(B + 1)(1 − α/2)]; and the square brackets indicate rounding to the nearest integer.

– Although they do not artificially assume normal- ity, percentile confidence intervals often do not perform well.

(25)

∗ We have a lot of experience with approximate transformations to normality.

∗ Suppose there is a monotonically increasing trans- formation g and a constant τ such that the ran- dom variable

Z = c[g( ˆθ) − g(θ)]

has a symmetric distribution about zero.

∗ Let H be the distribution function of Z. Then Gˆ

θ(t) = H(c[g(t) − g(θ)]), and

θˆ∗(1−α) = g−1(g( ˆθ) + z(1−α)/c), where z(1−α) is the 1 − α quantile of Z.

– How do we employ this concept on vector-valued parameters?

(26)

∗ Do we restrict to an ellipsoidal region?

∗ If it is, the shape is determined by the covari- ances of the estimators.

∗ How about the one-sided confidence interval?

• The bootstrap t interval: Determine the following distribution by the bootstrap method.

θˆ − ˆθ s( ˆθ)

• Bias-corrected, accelerated (or BCa) percentile in- tervals:

Steps:

(27)

– Calculate

z = Φ−1

" PB

b=1 1(Tb ≤ T ) B + 1

#

where Φ−1(·) is the standard-normal quantile func- tion.

– If the bootstrap sampling distribution is symmet- ric, and if T is unbiased, then this proportion will be close to 0.5, and the correction factor z will be close to 0.

– Let T(−i) represent the value of T produced when the ith observation is deleted from the sample;

there are n of these quantities.

Let ¯T represent the average of the T(−i).

(28)

Calculate

a =

Pn

i=1[T(−i) − ¯T ]3 6

hPn

i=1(T(−i) − ¯T )2

i3/2.

– With the correction factors z and a in hand, com- pute

a1 = Φ

"

z + z − z1−α/2

1 − a(z − z1−α/2)

#

a2 = Φ

"

z + z + z1−α/2

1 − a(z − z1−α/2)

# .

– The values a1 and a2 are used to locate the end- points of the corrected percentile confidence in-

(29)

terval:

T(lower) < θ < T(upper)

where lower = [Ba1] and upper = [Ba2].

When the correction factors a and z are both 0, it corresponds to the (uncorrected) percentile in- terval.

(30)

Bootstrapping Regressions Consider (x1, x2, y)i for i = 1, . . . , n.

y = β0 + β1x1 + β2x2 +  with E() = 0 and V ar() = σ2.

Assume the is are independent.

• Parametric bootstrap:

– Obtain ˆβ0, ˆβ1, ˆβ2

– Sample i ∼ iid N(0,σ2)

– Take yi = ˆβ0 + ˆβ1xi1 + ˆβ2xi2 + i

– Obtain ˆβ0, ˆβ1, ˆβ2 and repeat many times

• Resampling residuals:

– Obtain ˆβ , ˆβ , and ˆβ .

(31)

– Calculate residuals i = yi − ˆβ0 − ˆβ1xi1 − ˆβ2xi2

– Sample i by drawing with replacement from {ˆi} – Add i to ˆβ0 + ˆβ1xi1 + ˆβ2xi2.

– Obtain ˆβ0, ˆβ1, ˆβ2 and repeat many times.

– It assumes that the distribution of  is the same in all regions of the model.

– Better efficiency but is not robust to getting the wrong model.

• Resampling cases:

– Sample (xi1, xi2, yi)i by drawing with replacement from (xi1, xi2, y)i.

– Obtain ˆβ0, ˆβ1, ˆβ2 and repeat many times.

– It is less efficient but it preserves the relationship

(32)

between Y and (X1, X2) is better. (Think of the case that the variance is not homogeneous.)

(33)

Resampling Censored Data

In many practical settings, data is censored and so the usual bootstrap is not applicable.

• Random right-censored data:

Such data is typically comprised of the bivariate ob- servations (Yi, Di) where

Yi = min(Xi, Ci) Di = I(Xi ≤ Ci)

where Xi ∼ F and Ci ∼ G independently and I(A) is the indicator function of the event A.

– Example: remission times for patients with a type of leukaemia

∗ The patients were divided into those who re- ceived maintenance chemotherapy and those who did not.

(34)

∗ We are interested in the median remission time for the two groups.

– survfit: Computes an estimate of a survival curve for censored data using either the Kaplan-Meier or the Fleming-Harrington method or computes the predicted survivor function for a Cox proportional hazards model.

– data(aml, package="boot")

fit<- survfit(Surv(time,group)~cens,data=aml) plot(fit)

• Algorithm 1:

– Nonparametric estimates of F and G are given by the Kaplan-Meier estimates ˆF and ˆG, the latter being obtained by replacing di by 1 − di.

(35)

– θ = FT−1(0.5) − FC−1(0.5)

– Sampling X1, . . . , Xn from ˆF and independently sampling C1, . . . , Cn from ˆG.

– (Yi, Di) can then be found from (Xi, Ci) in the same way as for the original data.

– Efron (1981) showed that this is identical to re- sampling with replacement from the original pairs.

aml.fun<- function(data){

surv<-survfit(Surv(time,group)~cens,data=data) out<- NULL

st <- 1

for (s in 1:length(surv$strata)) { inds <- st:(st+surv$strata[s]-1)

md<-min(surv$time[inds[1-surv$surv[inds]>=0.5]])

(36)

st <- st+surv$strata[s]

out<- c(out,md) }

}

aml.case<- censboot(aml,aml.fun,R=499,strata

=aml$group)

– Refer to the R function censboot.

• Conditional bootstrap:

– This approach conditions the resampling on the observed censoring pattern since this is, in effect, an ancillary statistic.

– Sample X1, . . . , Xn from ˆF as before.

– If the ith observation is censored then we set

(37)

Ci = yi and if it is not censored we sample an observation from the estimated conditional distri- bution of Ci given that Ci > yi.

– Having thus obtained X1, . . . , Xn and C1, . . . , Cn we proceed as before.

– Technical problem: Suppose that the maximum value of y1, . . . , yn, yk say, is a censored observa- tion. Then Xk < yk and Ck = yk so the bootstrap observation will always be uncensored.

Alternatively, if the the maximum value is uncen- sored, the estimated conditional distribution does not exist.

In order to overcome these problems we add one extra point to the data set which has an observed value greater than max{y1, . . . , yn} and has the op-

(38)

posite value of the censoring indicator to the max- imum.

– R-program

aml.s1<-survfit(Surv(time,cens)~group,data=aml) aml.s2<-survfit(Surv(time-0.001*cens,1-cens)~1,

data=aml)

aml.cond<-censboot(aml,aml.fun,R=499,strata=

aml$group,F.surv=aml.s1,G.surv=aml.s2, sim="cond")

(39)

Bootstrap Hypothesis Tests

It is also possible to use the bootstrap to construct an empirical sampling distribution for a test statistic.

• To be added later on.

(40)

Permutation Tests Randomization Methods:

When an hypothesis of interest does not have an obvi- ous test statistic, a randomization test may be useful.

• Compare an observed configuration of outcomes with all possible configurations.

• The randomization procedure does not depend on assumptions about the data generating process, so it is usable in a wide range of applications.

• In most situations the null hypothesis for the test is that all outcomes are equally likely, and the null hy- pothesis is rejected if the observed outcome belongs to a subset that has a low probability under the null

(41)

the alternative hypothesis.

• Suppose that we want to test whether the means of two data generating processes are equal.

– The decision would be based on observations of two samples of results using the two treatments.

– There are several statistical tests for this null hy- pothesis, both parametric and nonparametric, that might be used.

– Most tests would use either the differences in the means of the samples, the numbers of observa- tions in each sample that are greater than the overall mean or median, or the overall ranks of the observations in one sample.

– Any of these test statistics could be used as a test

(42)

statistic in a randomization test.

(43)

Test the difference in the two sample means

Consider two samples, x1, x2, . . . , xm and y1, y2, . . . , yn. The chosen test statistic is t0 = ¯x − y.¯

• Without making any assumptions about the distri- butions of the two populations, the significance of the test statistic (that is, a measure of how extreme the observed difference is) can be estimated by con- sidering all configurations of the observations among the two treatment groups.

– This is done by computing the same test statistic for each possible arrangement of the observations, and then ranking the observed value of the test statistic within the set of all computed values.

– Consider a different configuration of the same set

(44)

of observations, y1, x2, . . . , xm and x1, y2, . . . , yn in which an observation from each set has been in- terchanged with one from the other set.

The same kind of test statistic, namely the differ- ence in the sample means, is computed. Let t1 be the value of the test statistic for this combination.

– Consider all possible different configurations, in which other values of the original samples have been switched.

Compute the test statistic. Continuing this way through the full set of x’s, we would eventually obtain C(n + m, n) different configurations, and a value of the test statistic for each one of these artificial samples.

(45)

ization of a random sample from that distribution under the null hypothesis.

– The empirical significance of the value correspond- ing to the observed configuration could then be computed simply as the rank of the observed value in the set of all values.

– Because there may be a very large number of all possible configurations, we may wish to sample randomly from the possible configurations rather than considering them all.

– When a sample of the configurations is used, the test is sometimes called an approximate random- ization test.

• Randomization tests have been used on small data

(46)

sets for a long time. Refer to Fisher’s famous lady tasting tea experiment in which a randomization test is used.

• Refer to Fisher (1935). Because such tests can require very extensive computations, their use has been limited until recently.

• Fisher’s randomization test: Fisher (1935) gave a permutation justification for the usual test for n paired observations.

– In his example (Darwin’s Zea data) yi and di =|

xi − yi | were real numbers representing plant height for treated and untreated plants.

– Darwin conducted an experiment to examine the

superiority of cross-fertilized plants over self-fertilized

(47)

plants.

∗ 15 pairs of plants were used. Each pair con- sisted of one cross-fertilized plant and one self- fertilized plant which germinated at the same time and grew in the same pot.

∗ The plants were measured at a fixed time after planting and the difference in heights between the cross- and self-fertilized plants are recorded in eighths of an inch.

∗ This data can be found in the package of boot with name darwin.

– Darwin had calculated the mean difference.

– Fisher gave a way of calibrating this by calculating Sn = 1d1 + · · · + ndn

(48)

and considering all 2n+1 possible sums  = ±1 with S0.

• Manly, B.F.J. (1997) Randomization, bootstrap and Monte Carlo method in biology, 2nd ed. Chapman &

Hall, London.

(49)

The Jackknife

Jackknife methods make use of systematic partitions of a data set to estimate properties of an estimator computed from the full sample.

• Quenouille (1949, 1956) suggested the technique to estimate (and, hence, reduce) the bias of an esti- mator ˆθn.

• Tukey coined the term jackknife to refer to the method, and also showed that the method is useful in esti- mating the variance of an estimator.

• Suppose, we have a random sample, X1, X2, . . . , Xn, from which we compute a statistic T as an estimate of a parameter θ in the population from which the sample was drawn.

(50)

In the jackknife method,

– Partition the given data set into r groups each of size k. (For simplicity, we will assume that the number of observations n is kr.)

– Remove the jth group from the sample, and com- pute the estimate, T−j from the reduced sample.

– Consider Tj = rT − (r − 1)T−j which is called pseu- dovalues. The mean of the pseudo values, J (T ), is called the jackknife estimator corresponding to T :

J (T ) = rT − (r − 1)

Pr

j=1 T−j

r .

– In most applications, it is common to take k = 1 or r = n.

(51)

the bias of the estimator T .

• The Jackknife Bias Correction:

– Suppose that we can express the bias of ˆθn as a power series in n−1.

θˆn − θ = a1

n + a2

n2 + a3

n3 + · · ·

where the numerators are unknowns depending on the real distribution F .

(52)

– For J (T ), we have

X

q=1

aq

nq + (n − 1)

X

q=1

aq

nq + θ

−(n − 1)

X

q=1

aq

(n − 1)q + θ

= a2

 1

n(n − 1)



+ a3  1

n2 − 1

(n − 1)2

 – The bias of jackknife estimate J (T ) is in n−2. – Jackknife gives an estimate of the bias by:

Biasˆ jack = (n − 1)(J (T ) − T ).

(53)

• The Jackknife Variance Estimate V arˆ jack =

Pr

j=1(Tj − T )2 r(r − 1) .

– Monte Carlo studies that it is often conservative;

that is, it often overestimates the variance (see Efron, 1982).

• Refer to Gentle (2002) for higher-order bias correc- tion, the generalized jackknife, and the delete-k jack- knife.

(54)

Cross Validation and Model Selection

Cross-validation and bootstrapping are both methods for estimating generalization error based on resampling (Efron and Tibshirani, 1993).

• Cross validation is useful in model building.

• In regression model building the standard problem is, given a set of potential regressors, choose a rel- atively small subset that provide a good fit to the data.

– Standard techniques include stepwise regression and all best subsets.

– If all the data are used in fitting the model, how- ever, we have no method to validate the model.

(55)

– A simple method to select potential regressors is to divide the sample into half, to fit the model us- ing one half, and to check the fit using the second half.

∗ The regressors to include in the model can be based on comparisons of the predictions made for the second half with the actual data.

– Instead of dividing the sample into half, we could form multiple partial data sets with overlap.

∗ One way would be to leave out just one obser- vation at a time.

∗ The method of variable selection called PRESS, suggested by Allen (1971), does this. (See also Allen, 1974.)

(56)

• Cross validation is a common method of selecting smoothing parameters.

– Think of choosing window width for window esti- mate in regression.

• The resulting estimates of generalization error are often used for choosing among various models.

• Apparent Error and True Error:

Consider the problem of predicting Y using some function of X such that E[Y − g(X)]2 is as well as possible.

– Usually, g(X) is determined be a training sample (xi, yi)’s.

– For a new point, (x0, y0), how well does ˆg(x0)

(57)

– Let L(y, g) be a measure of the error between an observed value y and the predicted value g(x).

(Usually, L is the square, [y − g(x)]2.)

• For prediction error,

– Recall the residual of sum squares we learned in regression analysis.

∗ Can we use RSS to do model selection?

∗ RSS is typically smaller than the true error be- cause the fit was chosen so as to minimize it.

∗ RSS is the so-called apparent error.

• Define the excess error as the random variable D(Y, P(X,Y )) = EP

(X,Y )[L(Y0, g(Xˆ 0))]−E ˆ

P(X,Y )[L(Y0, g(Xˆ 0))], where ˆP(X,Y ) is the estimated cumulative distribu-

(58)

tion function of (X, Y ).

– If ˆP(X,Y ) is the empirical CDF the density is just 1/n at the sample points, so

E ˆ

P(X,Y )[L(Y0, g(Xˆ 0))] = 1 n

X

i

L(Yi, g(Xˆ i)).

– This quantity, which is easy to compute, is the apparent error.

• Cross validation methods (and other resampling meth- ods) can be used to estimate the true error.

(59)

Cross Validation Consider model selection.

• In k-fold cross-validation, you divide the data into k subsets of (approximately) equal size v.

• Train the model k times, each time leaving out one of the subsets from training (estimating unknown parameters etc.), but using only the omitted subset to compute whatever chosen error criterion.

• If k equals the sample size, this is called leave-one- out.

– Leave-one-out cross-validation is also easily con- fused with jackknifing since both involve omitting each training case in turn.

(60)

– Jackknifing can be used to estimate the bias of the training error and hence to estimate the gen- eralization error, but this process is more compli- cated than leave-one-out cross-validation

• Leave-v-out is a more elaborate and expensive ver- sion of cross-validation that involves leaving out all possible subsets of v cases.

• For an insightful discussion of the limitations of cross- validatory choice among several learning methods, see Stone (1977).

– Leave-one-out cross-validation often works well for estimating generalization error for continuous error functions such as the mean squared error, but it may perform poorly for discontinuous er-

(61)

ror functions such as the number of misclassified cases.

– In the latter case, k-fold cross-validation is pre- ferred.

– But if k gets too small, the error estimate is pes- simistically biased because of the difference in training-set size between the full-sample analysis and the cross-validation analysis.

– A value of 10 for k is popular for estimating gen- eralization error.

• Refer to Chapter 7.11 of HTF book on bootstrap method.

Shao (1993, JASA) obtained the surprising result that for selecting subsets of inputs in a linear regression, the

(62)

probability of selecting the best does not converge to 1 (as the sample size n goes to infinity) for leave-v-out cross-validation unless the proportion v/n approaches 1.

• To obtain an intuitive understanding, let’s review what is generalization error.

• Generalization error can be broken down into three additive parts,

– noise variance

– estimation variance

– squared estimation bias

• Noise variance is the same for all subsets of inputs.

• Bias is nonzero for subsets that are not good, but

(63)

that the function to be learned is linear.

Hence the generalization error of good subsets will differ only in the estimation variance.

• The estimation variance is (2p/t)s2 where p is the number of inputs in the subset, t is the training set size, and s2 is the noise variance.

– The best subset is better than other good subsets only because the best subset has (by definition) the smallest value of p.

– But the t in the denominator means that differ- ences in generalization error among the good sub- sets will all go to zero as t goes to infinity.

– Therefore it is difficult to guess which subset is best based on the generalization error even when

(64)

t is very large.

• It is well known that unbiased estimates of the gen- eralization error, such as those based on AIC, F P E, and Cp, do not produce consistent estimates of the best subset (e.g., see Stone, 1979).

• References:

– Efron, B. (1983). Estimating the error rate of a prediction rule: Improvement on cross-validation, JASA 78, 316-331.

– Efron, B. and Tibshirani, R.J. (1997). Improve- ments on cross-validation: The .632+ bootstrap method. JASA 92, 548-560.

– Stone, M. (1977). Asymptotics for and against cross-validation. Biometrika 64, 29-35.

參考文獻

相關文件

Based on [BL], by checking the strong pseudoconvexity and the transmission conditions in a neighborhood of a fixed point at the interface, we can derive a Car- leman estimate for

• Introduction of language arts elements into the junior forms in preparation for LA electives.. Curriculum design for

The 2010/11 Statistical Project Competition (SPC) for Secondary School Students, which is organised by the Hong Kong Statistical Society, co-organised by the

Then, a visualization is proposed to explain how the convergent behaviors are influenced by two descent directions in merit function approach.. Based on the geometric properties

In this paper, we have studied a neural network approach for solving general nonlinear convex programs with second-order cone constraints.. The proposed neural network is based on

If the bootstrap distribution of a statistic shows a normal shape and small bias, we can get a confidence interval for the parameter by using the boot- strap standard error and

mathematical statistics, statistical methods, regression, survival data analysis, categorical data analysis, multivariate statistical methods, experimental design.

If the best number of degrees of freedom for pure error can be specified, we might use some standard optimality criterion to obtain an optimal design for the given model, and