• 沒有找到結果。

HungChen MonteCarloMethodsforStatisticalInference:VarianceReductionTechniques

N/A
N/A
Protected

Share "HungChen MonteCarloMethodsforStatisticalInference:VarianceReductionTechniques"

Copied!
45
0
0

(1)

Monte Carlo Methods for Statistical Inference:

Variance Reduction Techniques

Hung Chen

hchen@math.ntu.edu.tw Department of Mathematics

National Taiwan University

3rd March 2004

(2)

Outline

 Numerical Integration 1. Introduction

2. Quadrature Integration 3. Composite Rules

4. Richardson's Improvement Formula 5. Improper integrals

 Monte Carlo Methods 1. Introduction

2. Variance Reduction Techniques 3. Importance Sampling

 References:

(3)

{ Lange, K. (1999) Numerical Analysis for Statisti- cians. Springer-Verlag, New York

{ Robert, C.P. and Casella, G. (1999). Monte Carlo Statistical Methods. Springer Verlag.

{ Thisted, R.A. (1996). Elements of Statistical Com- puting: Numerical Computing Chapman & Hall.

{ An Introduction to R by William N. Venables, David M. Smith

(http://www.ats.ucla.edu/stat/books/

(4)

Monte-Carlo Integration

Integration is fundamental to statistical inference.

 Evaluation of probabilities, means, variances, and mean squared error can all be thought of as inte- grals.

 Very often it is not feasible to solve for the integral of a given function via analytical techniques and al- ternative methods are adapted.

 The approximate answers are presented in this lec- ture.

Suppose we wish to evaluate I = R

f(x)dx.

 Riemann integral: The de nition starts with

(5)

[i4xi = [a; b] and f4xig is called a partition P of [a; b].

{ The mesh of this partition is de ned to be the largest size of sub-intervals, mesh(P) = maxi j 4xi j.

{ De ne a nite sum, Sn =

Xn i=1

f(xi)4xi; where xi 2 4xi is any point.

{ If the quantity limmeshP#0 Sn exists, then it is called the integral of f on [a; b] and is denoted by R b

a f(x)dx.

 This construction demonstrates that any numerical approximation of R b

f(x)dx will have two features:

(6)

(i) Selection of samples points which partition the interval

(ii) A nite number of function evaluations on these sample points

Now consider the problem of evaluating  = R

(x)f(x)dx where f(x) is a density function.

Using the law of large numbers, we can evaluate  easily.

 Sample X1; : : : ; Xn independently from f and form

^ = 1 n

Xn i=1

(xi);

V ar(^) = 1 n

Z

[(x) ]2f(x)dx:

(7)

 The precision of ^ is proportion to 1=p

In numerical integration, n points can achieve then.

precision of O(1=n4).

 Question: We can use Riemann integral to evaluate de nite integrals. Then why do we need Monte Carlo integration?

{ As the number of dimensions d increases, the number of points n required to achieve a fair esti- mate of integral would increase dramatically, i.e., proportional to nd.

{ Even when the value d is small, if the function to integrated is irregular, then it would inecient to use the regular methods of integration.

(8)

{ It is known that the subeciency of numerical methods compared with simulation algorithms for dimension d larger than 4 since the error is then of order O(n 4=d).

{ The intuitive reason behind this phenomenon is that a numerical approach like the Riemann sum method basically covers the whole space with a grid.

When the dimension of the space increases, the number of points on the grid necessary to obtain a given precision increases too.

(9)

Numerical Integration

Also named as \quadrature," it is related to the eval- uation of the integral

I =

Z b

a f(x)dx:

It is equivalent to solving for the value I = y(b) the di erential equation

dy

dx = f(x)

with the boundary condition y(a) = 0.

 When f is a simple function, I can be evaluated easily.

 The underlying idea is to approximate f by a simple

(10)

function which can be easily integrated on [a; b] and which agrees with f on the sampled points.

 The technique of nding a smooth curve passing through a set of points is also called curve tting.

{ To implement this idea is to sample N + 1 points and nd an order-N polynomial passing through those points.

{ The integral of f over that region (containing N + 1-points) can be approximated by the integral of the polynomial over the same region.

{ Given N + 1 sample points there is a unique poly- nomial passing through these points, though there are several methods to obtain it. We will use the Lagrange's method to nd this polynomial, which

(11)

we will call Lagrange's interpolating polynomial.

 Fit a polynomial to the sample points over the whole interval [a; b] we may end up with a high order poly- nomial which itself might be dicult to determine its coecients.

 Focus on a smaller region in [a; b], lets say [xk; xk+n], containing the points xk; xk+1; : : : ; xk+n.

 Let Pk+i = (xk+i; f(xk+i)) be the pairs of the sampled points and the function values; they are called knots.

 Let pk;k+n(x) denote the polynomial of degree less than or equal to n that interpolates Pk; Pk+1; : : : ; Pk+n. Now the question becomes

Given Pk; : : : ; Pk+n, nd the polynomial pk;k+n(x) such

(12)

that pk;k+n(xk+i) = f(xk+i); 0  i  n:

To understand the construction of pk;k+n(x), we look at the case n = 0 rst.

It is the so-called the extended midpoint rule of nding I.

1. Pick N large.

2. Let xi = a + (i 1=2)h for i = 1; : : : ; N where h = (b a)=N.

3. Let fi = f(xi).

4. Then I  hP

i fi.

Sample code: emr<- function(f,a,b,n=1000){

h<- (b-a)/n

(13)

h*sum(f(seq(a+h/2,b,by=h))) }

This is the simplest thing to do.

Extending these ideas to an arbitrary value of n, the polynomial takes the form

pk;k+n(x) =

Xn i=0

f(xk+i)Lk+i(x);

where

Lk+i(x) = Y

j=0;j6=i

x xk+j xk+i xk+j:

(14)

Note that

Lk+i(x)

8<

:

1; x = xk+i 0; x = xk+j; j 6= i else; in between

Therefore, pk;k+n(x) satis es the requirement that it should pass through the n + 1 knots and is an order-n polynomial.

f(x)  pn(x) = f(xk)Lk(x)+f(xk+1)Lk+1(x)+  +f(xk+n)Lk+n(x):

Approximate R xk+n

xk f(x)dx by the quantity R xk+n

xk pk;k+n(x)dx.

Z xk+n

xk f(x)dx  wkf(xk)+wk+1f(xk+1)+  +wk+nf(xk+n);

(15)

where

wk+i =

Z xk+n

xk Lk+i(x)dx:

calculate Calculation of these weights to derive a few well-known numerical integration techniques.

 Assume that the sample points xk; xk+1; : : :, are uni- formly spaced with the spacing h > 0.

 Any point x 2 [xk; xk+n] can now be represented by x = xk + sh, where s takes values 1; 2; 3; : : : ; n at the sample points and other values in between.

 The weight is

Lk+i(xk + sh) = Y

j=0;j6=i

s j i j

(16)

or

wk+i = h

Z n

0 Lk+i(xk + hs)ds:

{ For n = 1, the weights are given by wk = h R 1

0 sds = h=2 and wk+1 = h R 1

0 (1 s)ds = h=2. The integral value is given by

Z xk+1

xk f(x)dx  h

2[f(xk) + f(xk+1)]:

This approximation is called the Trapezoidal rule, because the integral is equal to the area of the trapezoid formed by the two knots.

(17)

{ For n = 2, the weights are given by wk = h

Z 2

0

1

2(s2 3s + 2)ds = h 3; wk+1 = h

Z 2

0

1

2(s2 2s)ds = 4h 3 ; wk+2 = h

Z 2

0

1

2(s2 s)ds = h 3: The integral value is given byZ xk+1

xk f(x)dx  h

3[f(xk) + 4f(xk+1) + f(xk+2)]:

This rule is called the Simpson's 1=3 rule.

{ For n = 3, we obtain Simpson's-3=8 rule given byZ xk+1 f(x)dx  3h

8 ff(xk) + 3[f(xk+1) + f(xk+2)] + f(xk+3)g :

(18)

Composite Rules:

Considering the whole space, we will divide it into n sub-intervals of equal width.

 Utilize a sliding window on [a; b] by including only a small number of these sub-intervals at a time. That

is, Z b

a f(x)dx =

N=nX

k=0

Z xk+n 1

xk f(x)dx;

where each of the integrals on the right side can be approximated using the basic rules derived earlier.

 The summation of basic rules over sub-intervals to obtain an approximation over [a; b] gives rise to com- posite rules.

(19)

 Composite Trapezoidal Rule:

I  h 0

@f(a) + f(b)

2 + n 1X

i=1

f(xi) 1 A :

The error is this approximation is given by 121h2f(2)()(b a) for  2 (a; b).

 Composite Simpson's-1=3 Rule: The number of samples n + 1 should be odd, or the number of in- tervals should be even. The integral approximation is given by

I  h 3

0

B@f(a) + f(b)

2 + 4 X

i odd

f(xi) + 2 X i even)

f(xi) 1 CA :

(20)

The error associated with this approximation is 1

90 h4f(4)()(b a)4 for  2 (a; b):

(21)

Richardson's Improvement Formula

Suppose that we use F [h] as an approximation of I computed using h-spacing. Then,

I = F [h] + Chn + O(hm);

where C is a constant and m > n. An improvement of F [h] is possible if there is a separation between n and m.

 Eliminates the error term Chn by evaluating I for two di erent values of h and mixing the results ap- propriately.

 Assume that I is evaluated for two values of h: h1 and h2.

 Let h > h and h =h = r where r > 1.

(22)

 For the sample spacing given by h2 or rh, I = F [rh1] + Crnhn1 + O(hm1 ):

We have

rnI I = rnF [h1] F [rh1] + O(hm1 ):

Rearranging,

I = rnF [h] F [rh]

rn 1 + O(hm):

The rst term on the right can now be used as an approximation for I with the error term given by O(hm).

 This removal to Chn from the error using two eval- uations of f[h] at two di erent values of h is called Richardson's Improvement Formula.

(23)

This result when applied to numerical integration is called Romberg's Integration.

(24)

Improper Integrals

How do we revise the above methods to handle the following cases:

 The integrand goes to a nite limit at nite upper and lower limits, but cannot be calculated right on one of the limits (e.g., sin x=x at x = 0).

 The upper limit of integration is 1 or the lower limit is 1.

 There is a singularity at each limit (e.g., x 1=2 at x = 0.)

 Commonly used techniques:

1. Change of variables

For example, if a > 0 and f(t) ! 0 faster than

(25)

t 2 ! 0 as t ! 1, then we can use u = 1=t as follows:

Z 1

a f(t)dt =

Z 1=a

0

1 u2f

1 u



du:

This also works if b < 0 and the lower limit is 1.

2. Break up the integral into pieces

(26)

Monte-Carlo Method

The main goal in this technique is to estimate the quantity , where

 = Z

R g(x)f(x)dx = E[g(X)];

for a random variable X distributed according to the density function f(x).

g(x) is any function on R such that  and E[g2(X)] are bounded.

Classical Monte Carlo approach:

 Suppose that we have tools to simulate independent and identically distributed samples from f(x), call them X1; X2; : : : ; Xn, then one can approximate  by

(27)

the quantity:

^n = 1 n

Xn i=1

g(Xi):

 The variance of ^n is given by n 1V ar(g(X)).

For this approach, the samples from f(x) are generated in an i.i.d. fashion.

In order to get a good estimate, we need that the variance goes to zero and hence the number of samples goes to in nity.

For practical situations, the sample size never goes to in nity. This raises an interesting question on whether a better estimate can be obtained with a given amount computing constraint. Now we consider three widely used techniques to accomplish this task.

(28)

Using Antithetic Variables

This method depends on generating averages from the samples which have negative covariance between them, causing overall variance to go down.

 Let Y1 and Y2 be two identically distributed random variables with mean . Then,

V ar

Y1 + Y2 2



= V ar(Y1)

2 + Cov(Y1; Y2)

2 :

{ If Y1 and Y2 are independent, then the last term is zero.

{ If Y1 and Y2 are positively correlated then the vari- ance of V ar((Y1 + Y2)=2) > V ar(Y1=2).

{ If they are negatively correlated, the resulting vari-

(29)

 Question: How to obtain random variables Y1 and Y2 with identical distribution but negative correlation?

 Illustration:

{ Let X1; X2; : : : ; Xn be independent random vari-

ables with the distribution functions given by F1; F2; : : : ; Fn. { Let g be a monotonous function.

{ Using the inverse transform method, the Xi's can be generated according to Xi = Fi 1(Ui), for Ui  UNIF [0; 1].

{ De ne

Y1 = g(F1 1(U1); F2 1(U2); : : : ; Fn 1(Un)):

Since U and 1 U are identically distributed and negatively correlated random variables, if we de-

(30)

ne

Y2 = g(F1 1(1 U1); F2 1(1 U2); : : : ; Fn 1(1 Un)):

{ For monotonic function g, Y1 and Y2 are negatively correlated.

{ Utilizing negatively correlated functions not only reduces the resulting variance of the sample av- erage but also reduces the computation time as only half the samples need to be generated from UNIF [0; 1].

 Estimate  by

~ = 1 2n

Xn i=1

[g(Ui) + g(1 Ui)]:

(31)

{ If f is symmetric around , take Yi = 2 Xi.

{ See Geweke (1988) for the implementation of this idea.

(32)

Variance Reduction by Conditioning:

Let Y and Z be two random variables. In general, we have

V ar[E(Y j Z)] = V ar(Y ) E[V ar(Y j Z)]  V ar(Y ):

 For the two random variables Y and E(Y j Z), both have the same mean.

Therefore E(Y j Z) is a better random variable to simulate and average to estimate .

 How to nd an appropriate Z such that E(Y j Z) has signi cantly lower variance than Y ?

 Example: Estimate .

(33)

{ We can do it by Vi = 2Ui 1, i = 1; 2, and set I =

 1 if V12 + V22  1 0 otherwise.

{ Improve the estimate E(I) by using E(I j V1).

Note that

E[I j V1 = v1] = P (V12 + V22  1 j V1 = v)

= P (V22  1 v2) = (1 v2)1=2: { The conditional variance equals to

V ar[(1 V12)1=2]  0:0498;

which is smaller than V ar(I)  0:1686.

(34)

Variance Reduction using Control Variates:

Estimate  which is the expected value of a function g of random variables X = (X1; X2; : : : ; Xn).

 Assume that we know the expected value of another function h of these random variables, call it .

 For any constant a, de ne a random variable Wa according to

Wa = g(X) + a[h(X) ]:

 We can utilize the sample averages of Wa to esti- mate  since E(Wa) = .

 Observe that

V ar(Wa) = V ar(g(X))+a2V ar(h(X))+2aCov(g(X); h(X)):

(35)

It follows easily that the minimizer of V ar(Wa) as a function of a is

a = Cov(g(X); h(X)) V ar(h(X)) :

{ Estimate  by averaging observations of g(X) Cov(g(X); h(X))

V ar(h(X)) [h(X) ]:

{ The resulting variance of W is given by

V ar(W ) = V ar(g(X)) [Cov(f(X); g(X))]2 V ar(f(X)) :

 Example: Use \sample mean" to reduce the vari- ance of estimate of \sample median."

{ Find median of a Poisson random variable X with

(36)

{ Note that  = 11:5.

{ Modify the usual estimate as

~x corr(median; mean)

s2 (x 11:5);

where ~x and x are the median and mean of sam- pling data, and s2 is the sample variance.

(37)

Example on Control Variate Method

In general, suppose there are p control variates W1; : : : ; Wp and Z generally varies with each Wi, i.e.,

^ = Z Xp

i=1

i[Wi E(Wi)]:

 Multiple regression of Z on W1; : : : ; Wp.

 How do we nd the estimates of correlation coe- cients between Z and W 's?

(38)

Importance Sampling

Another technique commonly used for reducing vari- ance in Monte Carlo methods is importance sampling.

Importance sampling is di erent from a classical Monte Carlo method is that instead of sampling from f(x) one samples from another density h(x), and computes the estimate of  using averages of g(x)f(x)=h(x) instead of g(x) evaluated on those samples.

 Rearrange the de nition of  as follows:

 = Z

g(x)f(x)dx =

Z g(x)f(x)

h(x) h(x)dx:

h(x) can be any density function as long as the sup- port of h(x) contains the support of f(x).

(39)

and compute the estimate:

^n = 1 n

Xn i=1

g(Xi)f(Xi) h(Xi) :

It can be seen that the mean of ^n is  and its variance is

V ar(^n) = 1

n Eh

g(X)f(X) h(X)

2

2

! :

 Recall that the variance associated with the classical Monte Carlo estimator di ers in the rst term.

In that case, the rst term is given by Ef[g(X)2].

 It is possible that a suitable choice of h can reduce the estimator variance below that of the classical

(40)

 By Jensen's inequality, we have a lower bound on the rst term:

E g2(X)f2(X) h2(X)

!



 E

g(X)f(X) h(X)

2

=

Z

g(x)f(x)dx

2 :

In practice, for importance sampling, we generally seek a probability density h that is nearly propor- tional to f.

 Example taken from Robert and Casella:

{ Let X be a Cauchy random variable with param- eters (0; 1), i.e. X is distributed according to the

(41)

density function:

f(x) = 1

(1 + x2);

and g(x) = 1(x > 2) be an indicator function.

{ Estimate

 = P r(X > 2) = 1 2

tan 2

 = 0:1476:

{ Method 1:

 Generate X1; X2; : : : ; Xn as a random samples from f(x).

 ^n is just the frequency of sampled values larger than 2

^n = 1 n

Xn

1(Xi > 2):

(42)

 Variance of this estimator is simply (1 )=n or 0:126=n.

{ Method 2:

 Utilize the fact that the density f(x) is symmet- ric around 0, and  is just half of the probability P rfj X j> 2g.

 Generating Xi's as i.i.d. Cauchy, one can esti- mate  by

^n = 1 2n

Xn i=1

1(j Xi j> 2):

 Variance of this estimator is (1 2)=n or 0:052=n.

{ Method 3:

(43)

 Write  as the following integral:

 = 1 2

Z 2

0

1

(1 + x2)dx:

 Generate X1; X2; : : : ; Xn as a random samples from UNIF (0; 2).

 De ne

^n = 1 2

1 n

X

i

1

(1 + Xi2):

 Its variance is given by 0:0092=n.

{ Let y = 1=x and write  as the integral Z 1=2

0

x 2

(1 + x 2)dx:

Using i.i.d. samples from UNIF [0; 1=2] and evalu-

2

(44)

once can further reduce the estimator variance.

 Importance sampling:

{ Select h so that its support is fX > 2g.

{ For x > 2,

f(x) = 1

(1 + x2) is closely matched by

h(x) = 2 x2:

{ Note that the cdf associated with h is 1 2=x.

{ Sampling X = 2=U, U  U(0; 1), and let (x) = 1(X > 2)  f(x)

h(x) = 1

2(1 + x 2):

(45)

{ By ^h = n 1 P

i (xi), this is equivalent to Method 3.

{ V ar(^h)  9:3  10 5=n.

Al atoms are larger than N atoms because as you trace the path between N and Al on the periodic table, you move down a column (atomic size increases) and then to the left across

Directed numbers 2.1 understand the concept of directed numbers 9 Students are required to represent the directed numbers on the number line.. Students are required to

Reading Task 6: Genre Structure and Language Features. • Now let’s look at how language features (e.g. sentence patterns) are connected to the structure

volume suppressed mass: (TeV) 2 /M P ∼ 10 −4 eV → mm range can be experimentally tested for any number of extra dimensions - Light U(1) gauge bosons: no derivative couplings. =&gt;

Joint “ “AMiBA AMiBA + Subaru + Subaru ” ” data, probing the gas/DM distribution data, probing the gas/DM distribution out to ~80% of the cluster. out to ~80% of the cluster

The space of total positive matrices mod GL(n) is finite But the number of Ising networks is infinite. Ising networks are secretly dual to each other though local duality

S1 Singlet-triplet energy gap (in kcal/mol) of n-cyclacene as a function of the number of benzene rings, calculated using TAO-LDA and KS-LDA.. For com- parison, the CASPT2, KS-M06L,

Microphone and 600 ohm line conduits shall be mechanically and electrically connected to receptacle boxes and electrically grounded to the audio system ground point.. Lines in