Monte Carlo Methods for Statistical Inference:

Variance Reduction Techniques

Hung Chen

hchen@math.ntu.edu.tw Department of Mathematics

National Taiwan University

3rd March 2004

Outline

Numerical Integration 1. Introduction

2. Quadrature Integration 3. Composite Rules

4. Richardson's Improvement Formula 5. Improper integrals

Monte Carlo Methods 1. Introduction

2. Variance Reduction Techniques 3. Importance Sampling

References:

{ Lange, K. (1999) Numerical Analysis for Statisti- cians. Springer-Verlag, New York

{ Robert, C.P. and Casella, G. (1999). Monte Carlo Statistical Methods. Springer Verlag.

{ Thisted, R.A. (1996). Elements of Statistical Com- puting: Numerical Computing Chapman & Hall.

{ An Introduction to R by William N. Venables, David M. Smith

(http://www.ats.ucla.edu/stat/books/

#DownloadableBooks)

Monte-Carlo Integration

Integration is fundamental to statistical inference.

Evaluation of probabilities, means, variances, and mean squared error can all be thought of as inte- grals.

Very often it is not feasible to solve for the integral of a given function via analytical techniques and al- ternative methods are adapted.

The approximate answers are presented in this lec- ture.

Suppose we wish to evaluate I = R

f(x)dx.

Riemann integral: The denition starts with

[_{i}4x_{i} = [a; b] and f4x_{i}g is called a partition P of
[a; b].

{ The mesh of this partition is dened to be the
largest size of sub-intervals, mesh(P) = max_{i} j
4x_{i} j.

{ Dene a nite sum,
S_{n} =

Xn i=1

f(x_{i})4x_{i};
where x_{i} 2 4x_{i} is any point.

{ If the quantity lim_{meshP#0} S_{n} exists, then it is called
the integral of f on [a; b] and is denoted by R _{b}

a f(x)dx.

This construction demonstrates that any numerical
approximation of R _{b}

f(x)dx will have two features:

(i) Selection of samples points which partition the interval

(ii) A nite number of function evaluations on these sample points

Now consider the problem of evaluating = R

(x)f(x)dx where f(x) is a density function.

Using the law of large numbers, we can evaluate easily.

Sample X_{1}; : : : ; X_{n} independently from f and form

^ = 1 n

Xn i=1

(x_{i});

V ar(^) = 1 n

Z

[(x) ]^{2}f(x)dx:

The precision of ^ is proportion to 1=p

In numerical integration, n points can achieve then.

precision of O(1=n^{4}).

Question: We can use Riemann integral to evaluate denite integrals. Then why do we need Monte Carlo integration?

{ As the number of dimensions d increases, the
number of points n required to achieve a fair esti-
mate of integral would increase dramatically, i.e.,
proportional to n^{d}.

{ Even when the value d is small, if the function to integrated is irregular, then it would inecient to use the regular methods of integration.

{ It is known that the subeciency of numerical
methods compared with simulation algorithms for
dimension d larger than 4 since the error is then
of order O(n ^{4=d}).

{ The intuitive reason behind this phenomenon is that a numerical approach like the Riemann sum method basically covers the whole space with a grid.

When the dimension of the space increases, the number of points on the grid necessary to obtain a given precision increases too.

Numerical Integration

Also named as \quadrature," it is related to the eval- uation of the integral

I =

Z _{b}

a f(x)dx:

It is equivalent to solving for the value I = y(b) the dierential equation

dy

dx = f(x)

with the boundary condition y(a) = 0.

When f is a simple function, I can be evaluated easily.

The underlying idea is to approximate f by a simple

function which can be easily integrated on [a; b] and which agrees with f on the sampled points.

The technique of nding a smooth curve passing through a set of points is also called curve tting.

{ To implement this idea is to sample N + 1 points and nd an order-N polynomial passing through those points.

{ The integral of f over that region (containing N + 1-points) can be approximated by the integral of the polynomial over the same region.

{ Given N + 1 sample points there is a unique poly- nomial passing through these points, though there are several methods to obtain it. We will use the Lagrange's method to nd this polynomial, which

we will call Lagrange's interpolating polynomial.

Fit a polynomial to the sample points over the whole interval [a; b] we may end up with a high order poly- nomial which itself might be dicult to determine its coecients.

Focus on a smaller region in [a; b], lets say [x_{k}; x_{k+n}],
containing the points x_{k}; x_{k+1}; : : : ; x_{k+n}.

Let P_{k+i} = (x_{k+i}; f(x_{k+i})) be the pairs of the sampled
points and the function values; they are called knots.

Let p_{k;k+n}(x) denote the polynomial of degree less
than or equal to n that interpolates P_{k}; P_{k+1}; : : : ; P_{k+n}.
Now the question becomes

Given P_{k}; : : : ; P_{k+n}, nd the polynomial p_{k;k+n}(x) such

that p_{k;k+n}(x_{k+i}) = f(x_{k+i}); 0 i n:

To understand the construction of p_{k;k+n}(x), we look
at the case n = 0 rst.

It is the so-called the extended midpoint rule of nding I.

1. Pick N large.

2. Let x_{i} = a + (i 1=2)h for i = 1; : : : ; N where h =
(b a)=N.

3. Let f_{i} = f(x_{i}).

4. Then I hP

i f_{i}.

Sample code: emr<- function(f,a,b,n=1000){

h<- (b-a)/n

h*sum(f(seq(a+h/2,b,by=h))) }

This is the simplest thing to do.

Extending these ideas to an arbitrary value of n, the polynomial takes the form

p_{k;k+n}(x) =

Xn i=0

f(x_{k+i})L_{k+i}(x);

where

L_{k+i}(x) = Y

j=0;j6=i

x x_{k+j}
x_{k+i} x_{k+j}:

Note that

L_{k+i}(x)

8<

:

1; x = x_{k+i}
0; x = x_{k+j}; j 6= i
else; in between

Therefore, p_{k;k+n}(x) satises the requirement that it
should pass through the n + 1 knots and is an order-n
polynomial.

This leads to

f(x) p_{n}(x) = f(x_{k})L_{k}(x)+f(x_{k+1})L_{k+1}(x)+ +f(x_{k+n})L_{k+n}(x):

Approximate R _{x}_{k+n}

x_{k} f(x)dx by the quantity R _{x}_{k+n}

x_{k} p_{k;k+n}(x)dx.

Z _{x}_{k+n}

x_{k} f(x)dx w_{k}f(x_{k})+w_{k+1}f(x_{k+1})+ +w_{k+n}f(x_{k+n});

where

w_{k+i} =

Z _{x}_{k+n}

x_{k} L_{k+i}(x)dx:

calculate Calculation of these weights to derive a few well-known numerical integration techniques.

Assume that the sample points x_{k}; x_{k+1}; : : :, are uni-
formly spaced with the spacing h > 0.

Any point x 2 [x_{k}; x_{k+n}] can now be represented by
x = x_{k} + sh, where s takes values 1; 2; 3; : : : ; n at the
sample points and other values in between.

The weight is

L_{k+i}(x_{k} + sh) = Y

j=0;j6=i

s j i j

or

w_{k+i} = h

Z _{n}

0 L_{k+i}(x_{k} + hs)ds:

{ For n = 1, the weights are given by w_{k} = h R _{1}

0 sds =
h=2 and w_{k+1} = h R _{1}

0 (1 s)ds = h=2. The integral value is given by

Z _{x}_{k+1}

x_{k} f(x)dx h

2[f(x_{k}) + f(x_{k+1})]:

This approximation is called the Trapezoidal rule, because the integral is equal to the area of the trapezoid formed by the two knots.

{ For n = 2, the weights are given by
w_{k} = h

Z _{2}

0

1

2(s^{2} 3s + 2)ds = h
3;
w_{k+1} = h

Z _{2}

0

1

2(s^{2} 2s)ds = 4h
3 ;
w_{k+2} = h

Z _{2}

0

1

2(s^{2} s)ds = h
3:
The integral value is given byZ _{x}_{k+1}

x_{k} f(x)dx h

3[f(x_{k}) + 4f(x_{k+1}) + f(x_{k+2})]:

This rule is called the Simpson's 1=3 rule.

{ For n = 3, we obtain Simpson's-3=8 rule given byZ _{x}_{k+1}
f(x)dx 3h

8 ff(x_{k}) + 3[f(x_{k+1}) + f(x_{k+2})] + f(x_{k+3})g :

Composite Rules:

Considering the whole space, we will divide it into n sub-intervals of equal width.

Utilize a sliding window on [a; b] by including only a small number of these sub-intervals at a time. That

is, Z _{b}

a f(x)dx =

N=nX

k=0

Z _{x}_{k+n 1}

x_{k} f(x)dx;

where each of the integrals on the right side can be approximated using the basic rules derived earlier.

The summation of basic rules over sub-intervals to obtain an approximation over [a; b] gives rise to com- posite rules.

Composite Trapezoidal Rule:

I h 0

@f(a) + f(b)

2 + ^{n 1}X

i=1

f(x_{i})
1
A :

The error is this approximation is given by _{12}^{1}h^{2}f^{(2)}()(b
a) for 2 (a; b).

Composite Simpson's-1=3 Rule: The number of samples n + 1 should be odd, or the number of in- tervals should be even. The integral approximation is given by

I h 3

0

B@f(a) + f(b)

2 + 4 X

i odd

f(x_{i}) + 2 X
i even^{)}

f(x_{i})
1
CA :

The error associated with this approximation is 1

90 h^{4}f^{(4)}()(b a)^{4} for 2 (a; b):

Richardson's Improvement Formula

Suppose that we use F [h] as an approximation of I computed using h-spacing. Then,

I = F [h] + Ch^{n} + O(h^{m});

where C is a constant and m > n. An improvement of F [h] is possible if there is a separation between n and m.

Eliminates the error term Ch^{n} by evaluating I for
two dierent values of h and mixing the results ap-
propriately.

Assume that I is evaluated for two values of h: h_{1}
and h_{2}.

Let h > h and h =h = r where r > 1.

For the sample spacing given by h_{2} or rh,
I = F [rh_{1}] + Cr^{n}h^{n}_{1} + O(h^{m}_{1} ):

We have

r^{n}I I = r^{n}F [h_{1}] F [rh_{1}] + O(h^{m}_{1} ):

Rearranging,

I = r^{n}F [h] F [rh]

r^{n} 1 + O(h^{m}):

The rst term on the right can now be used as an
approximation for I with the error term given by
O(h^{m}).

This removal to Ch^{n} from the error using two eval-
uations of f[h] at two dierent values of h is called
Richardson's Improvement Formula.

This result when applied to numerical integration is called Romberg's Integration.

Improper Integrals

How do we revise the above methods to handle the following cases:

The integrand goes to a nite limit at nite upper and lower limits, but cannot be calculated right on one of the limits (e.g., sin x=x at x = 0).

The upper limit of integration is 1 or the lower limit is 1.

There is a singularity at each limit (e.g., x ^{1=2} at
x = 0.)

Commonly used techniques:

1. Change of variables

For example, if a > 0 and f(t) ! 0 faster than

t ^{2} ! 0 as t ! 1, then we can use u = 1=t as
follows:

Z _{1}

a f(t)dt =

Z _{1=a}

0

1
u^{2}f

1 u

du:

This also works if b < 0 and the lower limit is 1.

Refer to Lange for additional advice and examples.

2. Break up the integral into pieces

Monte-Carlo Method

The main goal in this technique is to estimate the quantity , where

= Z

R g(x)f(x)dx = E[g(X)];

for a random variable X distributed according to the density function f(x).

g(x) is any function on R such that and E[g^{2}(X)] are
bounded.

Classical Monte Carlo approach:

Suppose that we have tools to simulate independent
and identically distributed samples from f(x), call
them X_{1}; X_{2}; : : : ; X_{n}, then one can approximate by

the quantity:

^_{n} = 1
n

Xn i=1

g(X_{i}):

The variance of ^_{n} is given by n ^{1}V ar(g(X)).

For this approach, the samples from f(x) are generated in an i.i.d. fashion.

In order to get a good estimate, we need that the variance goes to zero and hence the number of samples goes to innity.

For practical situations, the sample size never goes to innity. This raises an interesting question on whether a better estimate can be obtained with a given amount computing constraint. Now we consider three widely used techniques to accomplish this task.

Using Antithetic Variables

This method depends on generating averages from the samples which have negative covariance between them, causing overall variance to go down.

Let Y_{1} and Y_{2} be two identically distributed random
variables with mean . Then,

V ar

Y_{1} + Y_{2}
2

= V ar(Y_{1})

2 + Cov(Y_{1}; Y_{2})

2 :

{ If Y_{1} and Y_{2} are independent, then the last term
is zero.

{ If Y_{1} and Y_{2} are positively correlated then the vari-
ance of V ar((Y_{1} + Y_{2})=2) > V ar(Y_{1}=2).

{ If they are negatively correlated, the resulting vari-

Question: How to obtain random variables Y_{1} and Y_{2}
with identical distribution but negative correlation?

Illustration:

{ Let X_{1}; X_{2}; : : : ; X_{n} be independent random vari-

ables with the distribution functions given by F_{1}; F_{2}; : : : ; F_{n}.
{ Let g be a monotonous function.

{ Using the inverse transform method, the X_{i}'s can
be generated according to X_{i} = F_{i} ^{1}(U_{i}), for U_{i}
UNIF [0; 1].

{ Dene

Y_{1} = g(F_{1} ^{1}(U_{1}); F_{2} ^{1}(U_{2}); : : : ; F_{n} ^{1}(U_{n})):

Since U and 1 U are identically distributed and negatively correlated random variables, if we de-

ne

Y_{2} = g(F_{1} ^{1}(1 U_{1}); F_{2} ^{1}(1 U_{2}); : : : ; F_{n} ^{1}(1 U_{n})):

{ For monotonic function g, Y_{1} and Y_{2} are negatively
correlated.

{ Utilizing negatively correlated functions not only reduces the resulting variance of the sample av- erage but also reduces the computation time as only half the samples need to be generated from UNIF [0; 1].

Estimate by

~ = 1 2n

Xn i=1

[g(U_{i}) + g(1 U_{i})]:

{ If f is symmetric around , take Y_{i} = 2 X_{i}.

{ See Geweke (1988) for the implementation of this idea.

Variance Reduction by Conditioning:

Let Y and Z be two random variables. In general, we have

V ar[E(Y j Z)] = V ar(Y ) E[V ar(Y j Z)] V ar(Y ):

For the two random variables Y and E(Y j Z), both have the same mean.

Therefore E(Y j Z) is a better random variable to simulate and average to estimate .

How to nd an appropriate Z such that E(Y j Z) has signicantly lower variance than Y ?

Example: Estimate .

{ We can do it by V_{i} = 2U_{i} 1, i = 1; 2, and set
I =

1 if V_{1}^{2} + V_{2}^{2} 1
0 otherwise.

{ Improve the estimate E(I) by using E(I j V_{1}).

Note that

E[I j V_{1} = v_{1}] = P (V_{1}^{2} + V_{2}^{2} 1 j V_{1} = v)

= P (V_{2}^{2} 1 v^{2}) = (1 v^{2})^{1=2}:
{ The conditional variance equals to

V ar[(1 V_{1}^{2})^{1=2}] 0:0498;

which is smaller than V ar(I) 0:1686.

Variance Reduction using Control Variates:

Estimate which is the expected value of a function g
of random variables X = (X_{1}; X_{2}; : : : ; X_{n}).

Assume that we know the expected value of another function h of these random variables, call it .

For any constant a, dene a random variable W_{a}
according to

W_{a} = g(X) + a[h(X) ]:

We can utilize the sample averages of W_{a} to esti-
mate since E(W_{a}) = .

Observe that

V ar(W_{a}) = V ar(g(X))+a^{2}V ar(h(X))+2aCov(g(X); h(X)):

It follows easily that the minimizer of V ar(W_{a}) as a
function of a is

a = Cov(g(X); h(X)) V ar(h(X)) :

{ Estimate by averaging observations of g(X) Cov(g(X); h(X))

V ar(h(X)) [h(X) ]:

{ The resulting variance of W is given by

V ar(W ) = V ar(g(X)) [Cov(f(X); g(X))]^{2}
V ar(f(X)) :

Example: Use \sample mean" to reduce the vari- ance of estimate of \sample median."

{ Find median of a Poisson random variable X with

{ Note that = 11:5.

{ Modify the usual estimate as

~x corr(median; mean)

s^{2} (x 11:5);

where ~x and x are the median and mean of sam-
pling data, and s^{2} is the sample variance.

Example on Control Variate Method

In general, suppose there are p control variates W_{1}; : : : ; W_{p}
and Z generally varies with each W_{i}, i.e.,

^ = Z X^{p}

i=1

_{i}[W_{i} E(W_{i})]:

Multiple regression of Z on W_{1}; : : : ; W_{p}.

How do we nd the estimates of correlation coe- cients between Z and W 's?

Importance Sampling

Another technique commonly used for reducing vari- ance in Monte Carlo methods is importance sampling.

Importance sampling is dierent from a classical Monte Carlo method is that instead of sampling from f(x) one samples from another density h(x), and computes the estimate of using averages of g(x)f(x)=h(x) instead of g(x) evaluated on those samples.

Rearrange the denition of as follows:

= Z

g(x)f(x)dx =

Z g(x)f(x)

h(x) h(x)dx:

h(x) can be any density function as long as the sup- port of h(x) contains the support of f(x).

and compute the estimate:

^_{n} = 1
n

Xn i=1

g(X_{i})f(X_{i})
h(X_{i}) :

It can be seen that the mean of ^_{n} is and its
variance is

V ar(^_{n}) = 1

n E_{h}

g(X)f(X) h(X)

_{2}

^{2}

! :

Recall that the variance associated with the classical Monte Carlo estimator diers in the rst term.

In that case, the rst term is given by E_{f}[g(X)^{2}].

It is possible that a suitable choice of h can reduce the estimator variance below that of the classical

By Jensen's inequality, we have a lower bound on the rst term:

E g^{2}(X)f^{2}(X)
h^{2}(X)

!

E

g(X)f(X) h(X)

_{2}

=

Z

g(x)f(x)dx

_{2}
:

In practice, for importance sampling, we generally seek a probability density h that is nearly propor- tional to f.

Example taken from Robert and Casella:

{ Let X be a Cauchy random variable with param- eters (0; 1), i.e. X is distributed according to the

density function:

f(x) = 1

(1 + x^{2});

and g(x) = 1(x > 2) be an indicator function.

{ Estimate

= P r(X > 2) = 1 2

tan 2

= 0:1476:

{ Method 1:

Generate X_{1}; X_{2}; : : : ; X_{n} as a random samples
from f(x).

^_{n} is just the frequency of sampled values larger
than 2

^_{n} = 1
n

Xn

1(X_{i} > 2):

Variance of this estimator is simply (1 )=n or 0:126=n.

{ Method 2:

Utilize the fact that the density f(x) is symmet- ric around 0, and is just half of the probability P rfj X j> 2g.

Generating X_{i}'s as i.i.d. Cauchy, one can esti-
mate by

^_{n} = 1
2n

Xn i=1

1(j X_{i} j> 2):

Variance of this estimator is (1 2)=n or 0:052=n.

{ Method 3:

Write as the following integral:

= 1 2

Z _{2}

0

1

(1 + x^{2})dx:

Generate X_{1}; X_{2}; : : : ; X_{n} as a random samples
from UNIF (0; 2).

Dene

^_{n} = 1
2

1 n

X

i

1

(1 + X_{i}^{2}):

Its variance is given by 0:0092=n.

{ Let y = 1=x and write as the integral
Z _{1=2}

0

x ^{2}

(1 + x ^{2})dx:

Using i.i.d. samples from UNIF [0; 1=2] and evalu-

2

once can further reduce the estimator variance.

Importance sampling:

{ Select h so that its support is fX > 2g.

{ For x > 2,

f(x) = 1

(1 + x^{2})
is closely matched by

h(x) = 2
x^{2}:

{ Note that the cdf associated with h is 1 2=x.

{ Sampling X = 2=U, U U(0; 1), and let (x) = 1(X > 2) f(x)

h(x) = 1

2(1 + x ^{2}):

{ By ^_{h} = n ^{1} P

i (x_{i}), this is equivalent to Method
3.

{ V ar(^_{h}) 9:3 10 ^{5}=n.