HungChen MonteCarloMethodsforStatisticalInference:VarianceReductionTechniques

(1)

Monte Carlo Methods for Statistical Inference:

Variance Reduction Techniques

Hung Chen

hchen@math.ntu.edu.tw Department of Mathematics

National Taiwan University

3rd March 2004

(2)

Outline

Numerical Integration 1. Introduction

2. Quadrature Integration 3. Composite Rules

4. Richardson's Improvement Formula 5. Improper integrals

Monte Carlo Methods 1. Introduction

2. Variance Reduction Techniques 3. Importance Sampling

References:

(3)

{ Lange, K. (1999) Numerical Analysis for Statisti- cians. Springer-Verlag, New York

{ Robert, C.P. and Casella, G. (1999). Monte Carlo Statistical Methods. Springer Verlag.

{ Thisted, R.A. (1996). Elements of Statistical Com- puting: Numerical Computing Chapman & Hall.

{ An Introduction to R by William N. Venables, David M. Smith

(http://www.ats.ucla.edu/stat/books/

#DownloadableBooks)

(4)

Monte-Carlo Integration

Integration is fundamental to statistical inference.

Evaluation of probabilities, means, variances, and mean squared error can all be thought of as integrals.

Very often it is not feasible to solve for the integral of a given function via analytical techniques and al- ternative methods are adapted.

The approximate answers are presented in this lec- ture.

Suppose we wish to evaluate I = R

f(x)dx.

Riemann integral: The denition starts with

(5)

[_i4x_i = [a; b] and f4x_ig is called a partition P of [a; b].

{ The mesh of this partition is dened to be the largest size of sub-intervals, mesh(P) = max_i j 4x_i j.

{ Dene a nite sum, S_n =

Xn i=1

f(x_i)4x_i; where x_i 2 4x_i is any point.

{ If the quantity lim_meshP#0 S_n exists, then it is called the integral of f on [a; b] and is denoted by R _b

a f(x)dx.

This construction demonstrates that any numerical approximation of R _b

f(x)dx will have two features:

(6)

(i) Selection of samples points which partition the interval

(ii) A nite number of function evaluations on these sample points

Now consider the problem of evaluating = R

(x)f(x)dx where f(x) is a density function.

Using the law of large numbers, we can evaluate easily.

Sample X₁; : : : ; X_n independently from f and form

^ = 1 n

Xn i=1

(x_i);

V ar(^) = 1 n

Z

[(x) ]²f(x)dx:

(7)

The precision of ^ is proportion to 1=p

In numerical integration, n points can achieve then.

precision of O(1=n⁴).

Question: We can use Riemann integral to evaluate denite integrals. Then why do we need Monte Carlo integration?

{ As the number of dimensions d increases, the number of points n required to achieve a fair estimate of integral would increase dramatically, i.e., proportional to n^d.

{ Even when the value d is small, if the function to integrated is irregular, then it would inecient to use the regular methods of integration.

(8)

{ It is known that the subeciency of numerical methods compared with simulation algorithms for dimension d larger than 4 since the error is then of order O(n ^4=d).

{ The intuitive reason behind this phenomenon is that a numerical approach like the Riemann sum method basically covers the whole space with a grid.

When the dimension of the space increases, the number of points on the grid necessary to obtain a given precision increases too.

(9)

Numerical Integration

Also named as \quadrature," it is related to the evaluation of the integral

I =

Z _b

a f(x)dx:

It is equivalent to solving for the value I = y(b) the dierential equation

dy

dx = f(x)

with the boundary condition y(a) = 0.

When f is a simple function, I can be evaluated easily.

The underlying idea is to approximate f by a simple

(10)

function which can be easily integrated on [a; b] and which agrees with f on the sampled points.

The technique of nding a smooth curve passing through a set of points is also called curve tting.

{ To implement this idea is to sample N + 1 points and nd an order-N polynomial passing through those points.

{ The integral of f over that region (containing N + 1-points) can be approximated by the integral of the polynomial over the same region.

{ Given N + 1 sample points there is a unique polynomial passing through these points, though there are several methods to obtain it. We will use the Lagrange's method to nd this polynomial, which

(11)

we will call Lagrange's interpolating polynomial.

Fit a polynomial to the sample points over the whole interval [a; b] we may end up with a high order polynomial which itself might be dicult to determine its coecients.

Focus on a smaller region in [a; b], lets say [x_k; x_k+n], containing the points x_k; x_k+1; : : : ; x_k+n.

Let P_k+i = (x_k+i; f(x_k+i)) be the pairs of the sampled points and the function values; they are called knots.

Let p_k;k+n(x) denote the polynomial of degree less than or equal to n that interpolates P_k; P_k+1; : : : ; P_k+n. Now the question becomes

Given P_k; : : : ; P_k+n, nd the polynomial p_k;k+n(x) such

(12)

that p_k;k+n(x_k+i) = f(x_k+i); 0 i n:

To understand the construction of p_k;k+n(x), we look at the case n = 0 rst.

It is the so-called the extended midpoint rule of nding I.

1. Pick N large.

2. Let x_i = a + (i 1=2)h for i = 1; : : : ; N where h = (b a)=N.

3. Let f_i = f(x_i).

4. Then I hP

i f_i.

Sample code: emr<- function(f,a,b,n=1000){

h<- (b-a)/n

(13)

h*sum(f(seq(a+h/2,b,by=h))) }

This is the simplest thing to do.

Extending these ideas to an arbitrary value of n, the polynomial takes the form

p_k;k+n(x) =

Xn i=0

f(x_k+i)L_k+i(x);

where

L_k+i(x) = Y

j=0;j6=i

x x_k+j x_k+i x_k+j:

(14)

Note that

L_k+i(x)

8<

:

1; x = x_k+i 0; x = x_k+j; j 6= i else; in between

Therefore, p_k;k+n(x) satises the requirement that it should pass through the n + 1 knots and is an order-n polynomial.

This leads to

f(x) p_n(x) = f(x_k)L_k(x)+f(x_k+1)L_k+1(x)+ +f(x_k+n)L_k+n(x):

Approximate R _x_k+n

x_k f(x)dx by the quantity R _x_k+n

x_k p_k;k+n(x)dx.

Z _x_k+n

x_k f(x)dx w_kf(x_k)+w_k+1f(x_k+1)+ +w_k+nf(x_k+n);

(15)

where

w_k+i =

Z _x_k+n

x_k L_k+i(x)dx:

calculate Calculation of these weights to derive a few well-known numerical integration techniques.

Assume that the sample points x_k; x_k+1; : : :, are uni- formly spaced with the spacing h > 0.

Any point x 2 [x_k; x_k+n] can now be represented by x = x_k + sh, where s takes values 1; 2; 3; : : : ; n at the sample points and other values in between.

The weight is

L_k+i(x_k + sh) = Y

j=0;j6=i

s j i j

(16)

or

w_k+i = h

Z _n

0 L_k+i(x_k + hs)ds:

{ For n = 1, the weights are given by w_k = h R ₁

0 sds = h=2 and w_k+1 = h R ₁

0 (1 s)ds = h=2. The integral value is given by

Z _x_k+1

x_k f(x)dx h

2[f(x_k) + f(x_k+1)]:

This approximation is called the Trapezoidal rule, because the integral is equal to the area of the trapezoid formed by the two knots.

(17)

{ For n = 2, the weights are given by w_k = h

Z ₂

0

1

2(s² 3s + 2)ds = h 3; w_k+1 = h

Z ₂

0

1

2(s² 2s)ds = 4h 3 ; w_k+2 = h

Z ₂

0

1

2(s² s)ds = h 3: The integral value is given byZ _x_k+1

x_k f(x)dx h

3[f(x_k) + 4f(x_k+1) + f(x_k+2)]:

This rule is called the Simpson's 1=3 rule.

{ For n = 3, we obtain Simpson's-3=8 rule given byZ _x_k+1 f(x)dx 3h

8 ff(x_k) + 3[f(x_k+1) + f(x_k+2)] + f(x_k+3)g :

(18)

Composite Rules:

Considering the whole space, we will divide it into n sub-intervals of equal width.

Utilize a sliding window on [a; b] by including only a small number of these sub-intervals at a time. That

is, Z _b

a f(x)dx =

N=nX

k=0

Z _x_{k+n 1}

x_k f(x)dx;

where each of the integrals on the right side can be approximated using the basic rules derived earlier.

The summation of basic rules over sub-intervals to obtain an approximation over [a; b] gives rise to composite rules.

(19)

Composite Trapezoidal Rule:

I h 0

@f(a) + f(b)

2 + ^{n 1}X

i=1

f(x_i) 1 A :

The error is this approximation is given by ₁₂¹h²f⁽²⁾()(b a) for 2 (a; b).

Composite Simpson's-1=3 Rule: The number of samples n + 1 should be odd, or the number of intervals should be even. The integral approximation is given by

I h 3

0

B@f(a) + f(b)

2 + 4 X

i odd

f(x_i) + 2 X i even⁾

f(x_i) 1 CA :

(20)

The error associated with this approximation is 1

90 h⁴f⁽⁴⁾()(b a)⁴ for 2 (a; b):

(21)

Richardson's Improvement Formula

Suppose that we use F [h] as an approximation of I computed using h-spacing. Then,

I = F [h] + Chⁿ + O(h^m);

where C is a constant and m > n. An improvement of F [h] is possible if there is a separation between n and m.

Eliminates the error term Chⁿ by evaluating I for two dierent values of h and mixing the results ap- propriately.

Assume that I is evaluated for two values of h: h₁ and h₂.

Let h > h and h =h = r where r > 1.

(22)

For the sample spacing given by h₂ or rh, I = F [rh₁] + Crⁿhⁿ₁ + O(h^m₁ ):

We have

rⁿI I = rⁿF [h₁] F [rh₁] + O(h^m₁ ):

Rearranging,

I = rⁿF [h] F [rh]

rⁿ 1 + O(h^m):

The rst term on the right can now be used as an approximation for I with the error term given by O(h^m).

This removal to Chⁿ from the error using two evaluations of f[h] at two dierent values of h is called Richardson's Improvement Formula.

(23)

This result when applied to numerical integration is called Romberg's Integration.

(24)

Improper Integrals

How do we revise the above methods to handle the following cases:

The integrand goes to a nite limit at nite upper and lower limits, but cannot be calculated right on one of the limits (e.g., sin x=x at x = 0).

The upper limit of integration is 1 or the lower limit is 1.

There is a singularity at each limit (e.g., x ¹⁼² at x = 0.)

Commonly used techniques:

1. Change of variables

For example, if a > 0 and f(t) ! 0 faster than

(25)

t ² ! 0 as t ! 1, then we can use u = 1=t as follows:

Z ₁

a f(t)dt =

Z _1=a

0

1 u²f

1 u

du:

This also works if b < 0 and the lower limit is 1.

Refer to Lange for additional advice and examples.

2. Break up the integral into pieces

(26)

Monte-Carlo Method

The main goal in this technique is to estimate the quantity , where

= Z

R g(x)f(x)dx = E[g(X)];

for a random variable X distributed according to the density function f(x).

g(x) is any function on R such that and E[g²(X)] are bounded.

Classical Monte Carlo approach:

Suppose that we have tools to simulate independent and identically distributed samples from f(x), call them X₁; X₂; : : : ; X_n, then one can approximate by

(27)

the quantity:

^_n = 1 n

Xn i=1

g(X_i):

The variance of ^_n is given by n ¹V ar(g(X)).

For this approach, the samples from f(x) are generated in an i.i.d. fashion.

In order to get a good estimate, we need that the variance goes to zero and hence the number of samples goes to innity.

For practical situations, the sample size never goes to innity. This raises an interesting question on whether a better estimate can be obtained with a given amount computing constraint. Now we consider three widely used techniques to accomplish this task.

(28)

Using Antithetic Variables

This method depends on generating averages from the samples which have negative covariance between them, causing overall variance to go down.

Let Y₁ and Y₂ be two identically distributed random variables with mean . Then,

V ar

Y₁ + Y₂ 2

= V ar(Y₁)

2 + Cov(Y₁; Y₂)

2 :

{ If Y₁ and Y₂ are independent, then the last term is zero.

{ If Y₁ and Y₂ are positively correlated then the variance of V ar((Y₁ + Y₂)=2) > V ar(Y₁=2).

{ If they are negatively correlated, the resulting vari-

(29)

Question: How to obtain random variables Y₁ and Y₂ with identical distribution but negative correlation?

Illustration:

{ Let X₁; X₂; : : : ; X_n be independent random vari-

ables with the distribution functions given by F₁; F₂; : : : ; F_n. { Let g be a monotonous function.

{ Using the inverse transform method, the X_i's can be generated according to X_i = F_i ¹(U_i), for U_i UNIF [0; 1].

{ Dene

Y₁ = g(F₁ ¹(U₁); F₂ ¹(U₂); : : : ; F_n ¹(U_n)):

Since U and 1 U are identically distributed and negatively correlated random variables, if we de-

(30)

ne

Y₂ = g(F₁ ¹(1 U₁); F₂ ¹(1 U₂); : : : ; F_n ¹(1 U_n)):

{ For monotonic function g, Y₁ and Y₂ are negatively correlated.

{ Utilizing negatively correlated functions not only reduces the resulting variance of the sample average but also reduces the computation time as only half the samples need to be generated from UNIF [0; 1].

Estimate by

~ = 1 2n

Xn i=1

[g(U_i) + g(1 U_i)]:

(31)

{ If f is symmetric around , take Y_i = 2 X_i.

{ See Geweke (1988) for the implementation of this idea.

(32)

Variance Reduction by Conditioning:

Let Y and Z be two random variables. In general, we have

V ar[E(Y j Z)] = V ar(Y ) E[V ar(Y j Z)] V ar(Y ):

For the two random variables Y and E(Y j Z), both have the same mean.

Therefore E(Y j Z) is a better random variable to simulate and average to estimate .

How to nd an appropriate Z such that E(Y j Z) has signicantly lower variance than Y ?

Example: Estimate .

(33)

{ We can do it by V_i = 2U_i 1, i = 1; 2, and set I =

1 if V₁² + V₂² 1 0 otherwise.

{ Improve the estimate E(I) by using E(I j V₁).

Note that

E[I j V₁ = v₁] = P (V₁² + V₂² 1 j V₁ = v)

= P (V₂² 1 v²) = (1 v²)¹⁼²: { The conditional variance equals to

V ar[(1 V₁²)¹⁼²] 0:0498;

which is smaller than V ar(I) 0:1686.

(34)

Variance Reduction using Control Variates:

Estimate which is the expected value of a function g of random variables X = (X₁; X₂; : : : ; X_n).

Assume that we know the expected value of another function h of these random variables, call it .

For any constant a, dene a random variable W_a according to

W_a = g(X) + a[h(X) ]:

We can utilize the sample averages of W_a to estimate since E(W_a) = .

Observe that

V ar(W_a) = V ar(g(X))+a²V ar(h(X))+2aCov(g(X); h(X)):

(35)

It follows easily that the minimizer of V ar(W_a) as a function of a is

a = Cov(g(X); h(X)) V ar(h(X)) :

{ Estimate by averaging observations of g(X) Cov(g(X); h(X))

V ar(h(X)) [h(X) ]:

{ The resulting variance of W is given by

V ar(W ) = V ar(g(X)) [Cov(f(X); g(X))]² V ar(f(X)) :

Example: Use \sample mean" to reduce the variance of estimate of \sample median."

{ Find median of a Poisson random variable X with

(36)

{ Note that = 11:5.

{ Modify the usual estimate as

~x corr(median; mean)

s² (x 11:5);

where ~x and x are the median and mean of sampling data, and s² is the sample variance.

(37)

Example on Control Variate Method

In general, suppose there are p control variates W₁; : : : ; W_p and Z generally varies with each W_i, i.e.,

^ = Z X^p

i=1

_i[W_i E(W_i)]:

Multiple regression of Z on W₁; : : : ; W_p.

How do we nd the estimates of correlation coecients between Z and W 's?

(38)

Importance Sampling

Another technique commonly used for reducing variance in Monte Carlo methods is importance sampling.

Importance sampling is dierent from a classical Monte Carlo method is that instead of sampling from f(x) one samples from another density h(x), and computes the estimate of using averages of g(x)f(x)=h(x) instead of g(x) evaluated on those samples.

Rearrange the denition of as follows:

= Z

g(x)f(x)dx =

Z g(x)f(x)

h(x) h(x)dx:

h(x) can be any density function as long as the support of h(x) contains the support of f(x).

(39)

and compute the estimate:

^_n = 1 n

Xn i=1

g(X_i)f(X_i) h(X_i) :

It can be seen that the mean of ^_n is and its variance is

V ar(^_n) = 1

n E_h

g(X)f(X) h(X)

₂

²

! :

Recall that the variance associated with the classical Monte Carlo estimator diers in the rst term.

In that case, the rst term is given by E_f[g(X)²].

It is possible that a suitable choice of h can reduce the estimator variance below that of the classical

(40)

By Jensen's inequality, we have a lower bound on the rst term:

E g²(X)f²(X) h²(X)

!

E

g(X)f(X) h(X)

₂

=

Z

g(x)f(x)dx

₂ :

In practice, for importance sampling, we generally seek a probability density h that is nearly proportional to f.

Example taken from Robert and Casella:

{ Let X be a Cauchy random variable with param- eters (0; 1), i.e. X is distributed according to the

(41)

density function:

f(x) = 1

(1 + x²);

and g(x) = 1(x > 2) be an indicator function.

{ Estimate

= P r(X > 2) = 1 2

tan 2

= 0:1476:

{ Method 1:

Generate X₁; X₂; : : : ; X_n as a random samples from f(x).

^_n is just the frequency of sampled values larger than 2

^_n = 1 n

Xn

1(X_i > 2):

(42)

Variance of this estimator is simply (1 )=n or 0:126=n.

{ Method 2:

Utilize the fact that the density f(x) is symmetric around 0, and is just half of the probability P rfj X j> 2g.

Generating X_i's as i.i.d. Cauchy, one can estimate by

^_n = 1 2n

Xn i=1

1(j X_i j> 2):

Variance of this estimator is (1 2)=n or 0:052=n.

{ Method 3:

(43)

Write as the following integral:

= 1 2

Z ₂

0

1

(1 + x²)dx:

Generate X₁; X₂; : : : ; X_n as a random samples from UNIF (0; 2).

Dene

^_n = 1 2

1 n

X

i

1

(1 + X_i²):

Its variance is given by 0:0092=n.

{ Let y = 1=x and write as the integral Z ₁₌₂

0

x ²

(1 + x ²)dx:

Using i.i.d. samples from UNIF [0; 1=2] and evalu-

2

(44)

once can further reduce the estimator variance.

Importance sampling:

{ Select h so that its support is fX > 2g.

{ For x > 2,

f(x) = 1

(1 + x²) is closely matched by

h(x) = 2 x²:

{ Note that the cdf associated with h is 1 2=x.

{ Sampling X = 2=U, U U(0; 1), and let (x) = 1(X > 2) f(x)

h(x) = 1

2(1 + x ²):

(45)

{ By ^_h = n ¹ P

i (x_i), this is equivalent to Method 3.

{ V ar(^_h) 9:3 10 ⁵=n.