3 Moments and moment generating functions

(1)

Theorem 2.1 Let X be a random variable and let a, b, and c be constants. Then for any functions g₁(x) and g₂(x) whose expectations exist,

a. E(ag₁(X) + bg₂(X) + c) = aEg₁(X) + bEg₂(X) + c.

b. If g₁(x) ≥ 0 for all x, then Eg1(X) ≥ 0.

c. If g₁(x) ≥ g2(x) for all x, then Eg₁(X) ≥ Eg2(X).

d. If a ≤ g1(x) ≤ b for all x, then a ≤ Eg1(X) ≤ b.

Example 2.4 (Minimizing distance) Find the value of b which minimizes the distance E(X − b)². E(X − b)² = E(X − EX + EX − b)²

= E(X − EX)²+ (EX − b)²+ 2E((X − EX)(EX − b))

= E(X − EX)²+ (EX − b)². Hence E(X − b)² is minimized by choosing b = EX.

When evaluating expectations of nonlinear functions of X, we can proceed in one of two ways.

From the definition of Eg(X), we could directly calculate Eg(X) =

Z _∞

−∞

g(x)f_X(x)dx.

But we could also find the pdf f_Y(y) of Y = g(X) and we would have Eg(X) = EY =

Z _∞

−∞

yf_Y(y)dy.

3 Moments and moment generating functions

Definition 3.1 For each integer n, the n^th moment of X (or F_X(x)), µ⁰_n, is µ⁰_n= EXⁿ.

The n^th central moment of X, µ_n, is

µ_n= E(X − µ)ⁿ, where µ = µ⁰₁= EX.

Theorem 3.1 The variance of a random variable X is its second central moment, VarX = E(X − EX)². The positive square root of VarX is the standard deviation of X.

(2)

Example 3.1 (Exponential variance) Let X ∼ exponential(λ). We have calculated EX = λ, then VarX = E(X − λ)² =

Z _∞

0 (x − λ)²1

λe^−x/λdx = λ².

Theorem 3.2 If X is a random variable with finite variance, then for any constants a and b, Var(aX + b) = a²VarX.

The variance can be calculated using an alternative formula:

VarX = EX²− (EX)².

Example 3.2 (Binomial variance) Let X ∼ binomial(n, p), that is, P (X = x) =n

x

p^x(1 − p)^n−x, x = 0, 1, . . . , n.

We have known EX = np. Now we calculate EX² =

n

X

x=0

x²n x

p^x(1 − p)^n−x

= n

n

X

x=1

xn − 1 x − 1

p^x(1 − p)^n−x

= n

n−1X

y=0

(y + 1)n − 1 y

p^y+1(1 − p)^n−1−y

= np

n−1X

y=0

yn − 1 y

p^y+1(1 − p)^n−1−y+ np

n−1X

y=0

n − 1 y

p^y+1(1 − p)^n−1−y

= np[(n − 1)p] + np.

Hence, the variance is:

VarX = n(n − 1)p²+ np − (np)² = np(1 − p).

The moment generating function (mgf), as its name suggests, can be used to generate moments.

In practice, it is easier in many cases to calculate moments directly than to use the mgf. However, the main use of the mdf is not to generate moments, but to help in characterizing a distribution.

This property can lead to some extremely powerful results when used properly.

Definition 3.2 Let X be a random variable with cdf F_X. The moment generating function (mgf) of X (or F_X), denoted by M_X(t), is

M_X(t) = Ee^tX,

(3)

provided that the expectation exists for t in some neighborhood of 0. That is, there is an h such that, for all t in −h < t < h, Ee^tX exists. If the expectation does not exist in a neighbor of 0, we say that the moment generating function does not exist.

More explicitly, we can write the mgf of X as M_X(t) =

Z _∞

−∞

e^txf_X(x)dx if X is continuous, or

MX(t) =X

x

e^txP (X = x) if X is discrete.

Theorem 3.3 If X has mgf M_X(t), then

EXⁿ= M_X⁽ⁿ⁾(0), where we define

M_X⁽ⁿ⁾(0) = dⁿ

dtⁿM_X(t)|t=0.

That is, the n^th moment is equal to the n^th derivative of M_X(t) evaluated at t = 0.

Proof: Assuming that we can differentiate under the integral sign, we have d

dtMX(t) = d dt

Z _∞

−∞

e^txfX(x)dx

= Z _∞

−∞

(d

dte^tx)f_X(x)dx

= Z _∞

−∞

(xe^tx)f_X(x)dx

= EXe^tX.

Thus, _dt^dM_X(t)|t=0 = EXe^tX|t=0 = EX. Proceeding in an analogous manner, we can establish that

dⁿ

dtⁿM_X(t)|t=0= EXⁿe^tX|t=0= EXⁿ.

Example 3.3 (Gamma mgf) The gamma pdf is f (x) = 1

Γ(α)β^αx^α−1e^−x/β, 0 < x < ∞, α > 0, β > 0,

(4)

where Γ(α) =R_∞

0 t^α−1e^−tdt denotes the gamma function. The mgf is given by M_X(t) = 1

Γ(α)β^α Z _∞

0

e^txx^α−1e^−x/betadx

= 1

Γ(α)β^α Z _∞

0

x^α−1e^−x/(^1−βt^β ⁾dx

= 1

Γ(α)β^αΓ(α) β 1 − βt

α

= 1

1 − βt

α

if t < ¹_β

If t ≥ 1/β, then the quantity 1 − βt is nonpositive and the integral is infinite. Thus, the mgf of the gamma distribution exists only if t < 1/β.

The mean of the gamma distribution is given by EX = d

dtM_X(t)|t=0= αβ

(1 − βt)^α+1|t=0= αβ.

Example 3.4 (Binomial mgf) The binomial mgf is M_X(t) =

n

X

x=0

e^txn x

p^x(1 − p)^n−x

=

n

X

x=0

(pe^t)^x(1 − p)^n−x The binomial formula gives

n

X

x=0

n x

u^xv^n−x= (u + v)ⁿ. Hence, letting u = pe^t and v = 1 − p, we have

M_X(t) = [pe^t+ (1 − p)]ⁿ.

The following theorem shows how a distribution can be characterized.

Theorem 3.4 Let FX(x) and FY(y) be two cdfs all of whose moments exist.

a. If X and Y have bounded supports, then FX(u) = FY(u) for all u if and only if EX^r = EY^r for all integers r = 0, 1, 2, . . ..

b. If the moment generating functions exist and M_X(t) = M_Y(t) for all t in some neighborhood of 0, then F_X(u) = F_Y(u) for all u.

Theorem 3.5 (Convergence of mgfs) Suppose {Xⁱ, i = 1, 2, . . .} is a sequence of random variables, each with mgf M_X_i(t). Furthermore, suppose that

i→∞lim M_X_i(t) = M_X(t), for all t in a neighborhood of 0,

(5)

and M_X(t) is an mgf.Then there is a unique cdf F_X whose moments are determined by M_X(t) and, for all x where F_X(x) is continuous, we have

i→∞lim F_X_i(x) = F_X(x).

That is, convergence, for |t| < h, of mgfs to an mgf implies convergence of cdfs.

Before going to the next example, we first mention an important limit result, one that has wide applicability in statistics. The proof of this lemma may be found in many standard calculus texts.

Lemma 3.1 Let a₁, a₂, . . . be a sequence of numbers converging to a, that is, lim_n→∞a_n= a. Then

n→∞lim(1 +a_n

n)ⁿ= e^a.

Example 3.5 (Poisson approximation) The binomial distribution is characterized by two quanti- ties, denoted by n and p. It is taught that the Poisson approximation is valid “when n is large and np is small,” and rules of thumb are sometimes given.

The Poisson(λ) pmf is given by

P (X = x) = λ^x

x!e^−λ, x = 0, 1, 2, . . . ,

where λ is a positive constant. The approximation states that if X ∈ binomial(n, p) and Y ∼ Poisson(λ), with λ = np, then

P (X = x) ≈ P (Y = x)

for large n and small np. We now show the mgfs converge, lending credence to this approximation.

Recall that

M_X(t) = [pe^t+ (1 − p)]ⁿ. For the Poisson(λ) distribution, we can calculate

M_Y(t) = e^λ(e^t⁻¹⁾, and if we define p = λ/n, then

M_X(t) = [1 + 1

n(e^t− 1)(np)]ⁿ= [1 + 1

n(e^t− 1)λ]ⁿ. Now set a_n= a = (e^t− 1)λ, and apply the above lemma to get

n→∞lim MX(t) = e^λ(e^t⁻¹⁾= MY(t).

The Poisson approximation can be quite good even for moderate p and n.

(6)

Theorem 3.6 For any constants a and b, the mgf of the random variable aX + b is given by M_aX+b(t) = e^btM_X(at).

Proof: By definition

M_aX+b(t) = Ee^(aX+b)t = e^btEe^(at)X

= e^btM_X(at).

4 Differentiating under an integral sign

The purpose of this section is to characterize conditions under which this operation is legitimate.

We will also discuss interchanging the order of differentiation and summation. Many of these conditions can be established using standard theorems from calculus and detailed proofs can be found in most calculus textbooks. Thus, detailed proofs will not be presented here.

Theorem 4.1 (Leibnitz’s Rule) If f (x, θ), a(θ), and b(θ) are differentiable with respect to θ, then d

dθ Z b(θ)

a(θ)

f (x, θ)dx = f (b(θ), θ) d

dθb(θ) − f(a(θ), θ) d

dθa(θ) + Z b(θ

a(θ)

) ∂

∂θf (x, θ)dx.

Notice that if a(θ) and b(θ) are constants, we have a special case of Leibnitz’s rule:

d dθ

Z _b

a

f (x, θ)dx = Z _b

a

∂

∂θf (x, θ)dx.

Thus, in general, if we have the integral of a differentiable function over a finite range, differentiation of the integral poses no problem. If the range of integration is infinite, however, problems can arise.

The question of whether interchanging the order of differentiation and integration is justified is really a question of whether limits and integration can be interchanged, since a derivative is a special kind of limit. Recall that if f (x, θ) is differentiable, then

∂

∂θf (x, θ) = lim

δ→0

f (x, θ + δ) − f(x, θ)

δ ,

so we have

Z _∞

−∞

∂

∂θf (x, θ)dx = Z _∞

−∞

δ→0lim

f (x, θ + δ) − f(x, θ)

δ dx,

while

d dθ

Z _∞

−∞

f (x, θ)dx = lim

δ→0

Z _∞

−∞

f (x, θ + δ) − f(x, θ)

δ dx.

The following theorems are all corollaries of Lebesgue’s Dominated Convergence Theorem.

(7)

Theorem 4.2 Suppose the function h(x, y) is continuous at y₀ for each x, and there exists a function g(x) satisfying

i. |h(x, y)| ≤ g(x) for all x and y.

ii. R_∞

−∞g(x)dx < ∞.

Then

y→ylim⁰ Z _∞

−∞

h(x, y)dx = Z _∞

−∞

y→ylim⁰h(x, y)dx.

The key condition in this theorem is the existence of a dominating function g(x), with a finite integral, which ensures that the integrals cannot be too badly behaved.

Theorem 4.3 Suppose f (x, θ) is differentiable at θ = θ₀, that is,

δ→0lim

f (x, θ₀+ δ) − f(x, θ0)

δ = ∂

∂θf (x, θ)|θ=θ0

exists for every x, and there exists a function g(x, θ₀) and a constant δ₀ > 0 such that i.

f (x, θ₀+ δ) − f(x, θ0) δ

≤ g(x, θ0), for all x and |δ| ≤ δ0, ii.

Z _∞

−∞

g(x, θ₀)dx < ∞.

Then

d dθ

Z _∞

−∞

f (x, θ)dx θ=θ0 =

Z _∞

−∞

∂

∂θf (x, θ)|θ=θ⁰dx. (4) It is important to realize that although we seem to be treating θ as a variable, the statement of the theorem is for one value of θ. That is, for each value θ0 for which f (x, θ) is differentiable at θ0

and satisfies conditions (i) and (ii), the order of integration and differentiation can be interchanged.

Often the distinction between θ and θ₀ is not stressed and (4) is written d

dθ Z _∞

−∞

f (x, θ)dx = Z _∞

−∞

∂

∂θf (x, θ)dx. (5)

Example 4.1 (Interchanging integration and differentiation-I) Let X have the exponential(λ) pdf given by f (x) = _λ¹e^−x/λ, 0 < x < ∞, and suppose we want to calculate

d

dλEXⁿ= d dλ

Z _∞

0

xⁿ1

λe^−x/λdx

(8)

for integer n > 0. If we could move the differentiation inside the integral, we would have d

dλEXⁿ= Z _∞

0

∂

∂λxⁿ1

λe^−x/λdx

= Z _∞

0

xⁿ λ²(x

λ− 1)e^−x/λdx

= 1

λ²EXⁿ⁺¹−1 λEXⁿ.

To justify the interchange of integration and differentiation, we bound the derivative as follows.

∂

∂λ xⁿ1

λe^−x/λ

= xⁿe^−x/λ λ² |x

λ− 1| ≤ xⁿe^−x/λ λ² (x

λ+ 1), since x/λ > 0. For some constant δ₀ satisfying 0 < δ₀ < λ, take

g(x, λ) = xⁿe^−x/(λ+δ⁰⁾ (λ − δ0)² ( x

λ − δ0

+ 1).

We then have

∂

∂λ xⁿ1

λe^−x/λ|λ=λ⁰

≤ g(x, λ) for all λ⁰ such that |λ⁰− λ| ≤ δ0. Since the exponential distribution has all of its moments, R_∞

−∞g(x, λ)dx < ∞ as long as λ − δ0> 0, so the interchange of integration and differentiation is justified.

Note that this example gives us a recursion relation for the moments of the exponential distribution, EXⁿ⁺¹= λEXⁿ+ λ² d

dλEXⁿ.

Example 4.2 (Interchanging summation and differentiation) Let X be a discrete random variable with the geometric distribution

P (X = x) = θ(1 − θ)^x, x = 0, 1, . . . , 0 < θ < 1.

We have that P_∞

x=0θ(1 − θ)^x= 1 and, provided that the operations are justified, d

dθ X∞ x=0

θ(1 − θ)^x = X∞ x=0

d

dθθ(1 − θ)^x

=

∞

X

x=0

[(1 − θ)^x− θx(1 − θ)^x−1]

= 1 θ

X∞ x=0

θ(1 − θ)^x− 1 1 − θ

X∞ x=0

xθ(1 − θ)^x. SinceP_∞

x=0θ(1 − θ)^x= 1 for all 0 < θ < 1, its derivative is 0. So we have 1

θ

∞

X

x=0

θ(1 − θ)^x− 1 1 − θ

∞

X

x=0

xθ(1 − θ)^x = 0,

(9)

that is,

1 θ− 1

1 − θEX = 0, or

EX = 1 θ − 1.

Theorem 4.4 Suppose that the series P_∞

x=0h(θ, x) converges for all θ in an interval (a, b) of real numbers and

i. _∂θ^∂h(θ, x) is continuous in θ for each x, ii. P_∞

x=0 ∂

∂θh(θ, x) converges uniformly on every closed bounded subinterval of (a, b).

Then

d dθ

∞

X

x=0

h(θ, x) =

∞

X

x=0

∂

∂θh(x, θ).

The condition of uniform convergence is the key one to the theorem. Recall that a series converges uniformly if its sequence of partial sums converges uniformly.

Example 4.3 (Continuation of Example 4.2) Since h(θ, x) = θ(1 − θ)^x and

∂

∂θh(θ, x) = (1 − θ)^x− θx(1 − θ)^x−1, the uniform convergence of P_∞

x=0 ∂

∂θh(θ, x) can be verified as follows. Define S_n(θ) =

n

X

x=0

[(1 − θ)^x− θx(1 − θ)^x−1].

The convergence will be uniform on [c, d] ⊂ (0, 1) if, given > 0, we can find an N such that n > N ⇒ |Sn(θ) − S∞(θ)| < for all θ ∈ [c, d].

Since

n

X

x=0

(1 − θ)^x = 1 − (1 − θ)ⁿ⁺¹

θ ,

and

n

X

x=0

θx(1 − θ)^x−1 = θ

n

X

x=0

−∂

∂θ(1 − θ)^x

= −θ d dθ

n

X

x=0

(1 − θ)^x= −θ d

dθ[1 − (1 − θ)ⁿ⁺¹

θ ]

= [1 − (1 − θ)ⁿ⁺¹] − (n + 1)θ(1 − θ)ⁿ

θ .

(10)

Hence,

S_n(θ) = 1 − (1 − θ)ⁿ⁺¹

θ −[1 − (1 − θ)ⁿ⁺¹] − (n + 1)θ(1 − θ)ⁿ θ

= (n + 1)(1 − θ)ⁿ.

It is clear that, for 0 < θ < 1, S_∞= lim_n→∞Sn(θ) = 0. Since Sn(θ) is continuous, the convergence is uniform on any closed bounded interval. Therefore, the series of derivatives converges uniformly and the interchange of differentiation and summation is justified.

Theorem 4.5 Suppose the series P_∞

x=0h(θ, x) converges uniformly on [a, b] and that, for each x, h(θ, x) is a continuous function of θ. Then

Z _b

a

X∞ x=0

h(θ, x)dθ = X∞ x=0

Z _b

a

h(θ, x)dθ.