5.5.3 Convergence in Distribution Deﬁnition 5.5.10 A sequence of random variables, X

(1)

5.5.3 Convergence in Distribution

Definition 5.5.10

A sequence of random variables, X₁, X₂, . . ., converges in distribution to a random variable X if

n→∞lim F_X_n(x) = F_X(x) at all points x where F_X(x) is continuous.

Example (Maximum of uniforms)

If X1, X2, . . . are iid uniform(0,1) and X(n) = max1≤i≤nXi, let us examine if X(n) converges in distribution.

As n → ∞, we have for any ² > 0,

P (|X_n− 1| ≥ ²) = P (X_(n) ≤ 1 − ²)

= P (X_i ≤ 1 − ², i = 1, . . . , n) = (1 − ²)ⁿ, which goes to 0. However, if we take ² = t/n, we then have

P (X_(n)≤ 1 − t/n) = (1 − t/n)ⁿ→ e^−t,

which, upon rearranging, yields

P (n(1 − X_(n)) ≤ t) → 1 − e^−t;

that is, the random variable n(1−X(n)) converges in distribution to an exponential(1) random variable.

Note that although we talk of a sequence of random variables converging in distribution, it is really the cdfs that converge, not the random variables. In this very fundamental way convergence in distribution is quite different from convergence in probability or convergence almost surely.

Theorem 5.5.12

If the sequence of random variables, X1, X2, . . ., converges in probability to a random variable X, the sequence also converges in distribution to X.

(2)

Theorem 5.5.13

The sequence of random variables, X₁, X₂, . . ., converges in probability to a constant µ if and only if the sequence also converges in distribution to µ. That is, the statement

P (|X_n− µ| > ²) → 0 for every ² > 0 is equivalent to

P (X_n ≤ x) →







0 if x < µ 1 if x > µ.

Theorem 5.5.14 (Central limit theorem)

Let X₁, X₂, . . . be a sequence of iid random variables whose mgfs exist in a neighborhood of 0 (that is, MXi(t) exists for |t| < h, for some positive h). Let EXi = µ and VarXi = σ² > 0.

(Both µ and σ² are finite since the mgf exists.) Define ¯X_n = (_n¹)P_n

i=1X_i. Let G_n(x) denote the cdf of √

n( ¯Xn− µ)/σ. Then, for any x, −∞ < x < ∞,

n→∞lim Gn(x) = Z _x

−∞

√1

2πe^−y²^/2dy;

that is,√

n( ¯Xn− µ)/σ has a limiting standard normal distribution.

Theorem 5.5.15 (Stronger form of the central limit theorem)

Let X₁, X₂, . . . be a sequence of iid random variables with EX_i = µ and 0 < VarX_i = σ² <

∞. Define ¯X_n= (¹_n)P_n

i=1X_i. Let G_n(x) denote the cdf of √

n( ¯X_n− µ)/σ. Then, for any x,

−∞ < x < ∞,

n→∞lim G_n(x) = Z _x

−∞

√1

2πe^−y²^/2dy;

that is,√

n( ¯X_n− µ)/σ has a limiting standard normal distribution.

The proof is almost identical to that of Theorem 5.5.14, except that characteristic functions are used instead of mgfs.

Example (Normal approximation to the negative binomial)

Suppose X₁, . . . , X_n are a random sample from a negative binomial(r, p) distribution. Recall that

EX = r(1 − p)

p , VarX = r(1 − p) p²

(3)

and the central limit theorem tells us that

√n( ¯X − r(1 − p)/p) pr(1 − p)/p²

is approximately N(0, 1). The approximate probability calculation are much easier than the exact calculations. For example, if r = 10, p = ¹₂, and n = 30, an exact calculation would be

P ( ¯X ≤ 11) = P ( X30

i=1

X_i ≤ 330)

= X330

x=0

µ300 + x − 1 x

¶ (1

2)^300+x = 0.8916 NoteP

X is negative binomial(nr, p). The CLT gives us the approximation

P ( ¯X ≤ 11) = P (

√30( ¯X − 10)

√20 ≤

√30(11 − 10)

√20 ) ≈ P (Z ≤ 1.2247) = .8888.

Theorem 5.5.17 (Slutsky’s theorem)

If X_n→ X in distribution and Y_n→ a, a constant, in probability, then (a) YnXn→ aX in distribution.

(b) X_n+ Y_n→ X + a in distribution.

Example (Normal approximation with estimated variance)

Suppose that √

n( ¯X_n− µ)

σ → N(0, 1),

but the value σ is unknown. We know S_n → σ in probability. By Exercise 5.32, σ/S_n → 1 in probability. Hence, Slutsky’s theorem tells us

√n( ¯X_n− µ)

S_n = σ

S_n

√n( ¯X_n− µ)

σ → N(0, 1).

5.5.4 The Delta Method

First, we look at one motivation example. Example 5.5.19 (Estimating the odds)

Suppose we observe X₁, X₂, . . . , X_nindependent Bernoulli(p) random variables. The typical

(4)

parameter of interest is p, but another population is _1−p^p . As we would estimate p by ˆ

p =P

iX_i/n, we might consider using _1−ˆ^p^ˆ_p as an estimate of _1−p^p . But what are the properties of this estimator? How might we estimate the variance of _1−ˆ^p^ˆ_p ?

Definition

If a function g(x) has derivatives of order r, that is, g^(r)(x) = _dx^d^rrg(x) exists, then for any constant a, the Taylor polynomial of order r about a is

T_r(x) = Xr

i=0

g⁽ⁱ⁾(a)

i! (x − a)ⁱ.

Theorem (Taylor)

If g^(r)(a) = _dx^d^rrg(x)|_x=a exists, then

x→alim

g(x) − Tr(x) (x − a)^r = 0.

Since we are interested in approximations, we are just going to ignore the remainder. There are, however, many explicit forms, one useful one being

g(x) − T_r(x) = Z _x

a

g^(r+1)(t)

r! (x − t)^rdt.

Now we consider the multivariate case of Taylor series. Let T1, . . . , Tk be random variables with means θ₁, . . . , θ_k, and define T = (T₁, . . . , T_k) and θ = (θ₁, . . . , θ_k). Suppose there is a differentiable function g(T ) (an estimator of some parameter) for which we want an approximate estimate of variance. Define

g_i⁰(θ) = ∂

∂t_ig(t)|_t₁_=θ₁_,...,t_k_=θ_k. The first-order Taylor series expansion of g about θ is

g(t) = g(θ) + Xk

i=1

g_i⁰(θ)(t_i− θ_i) + Remainder.

From our statistical approximation we forget about the remainder and write

g(t) ≈ g(θ) + Xk

i=1

g⁰_i(θ)(ti− θi).

(5)

Now, take expectation on both sides to get

E_θg(T) ≈ g(theta) + Xk

i=1

g_i⁰(theta)E_θ(T_i− θ_i) = g(theta).

We can now approximate the variance of g(T) by

Var_θg(T) ≈ E_θ([g(T) − g(theta)]²) ≈ E_θ¡ (

Xk i=1

g⁰_i(theta)(T_i− θ_i)²¢

= Xk

i=1

[g_i⁰(theta)]²Var_θT_i+ 2X

i>j

g_i⁰(θ)g_j⁰(theta)Cov_θ(T_i, T_j).

This approximation is very useful because it gives us a variance formula for a general function, using only simple variance and covariance.

Example (Continuation of Example 5.5.19)

In our above notation, take g(p) = _1−p^p , so g⁰(p) = _(1−p)¹ 2 and Var( pˆ

1 − ˆp) ≈ [g⁰(p)]²Var(ˆp)

[ 1

(1 − p)²]²p(1 − p)

n = p

n(1 − p)³, giving us an approximation for the variance of our estimator.

Example (Approximate mean and variance)

Suppose X is a random variable with E_µX = µ 6= 0. If we want to estimate a function g(µ), a first-order approximation would give us

g(X) = g(µ) + g⁰(µ)(X − µ).

If we use g(X) as an estimator of g(µ), we can say that approximately E_µg(X) ≈ g(µ),

and

Var_µg(X) ≈ [g⁰(µ)]²Var_µX.

Theorem 5.5.24 (Delta method)

Let Y_nbe a sequence of random variables that satisfies√

n(Y_n−θ) → N(0, σ²) in distribution.

(6)

For a given function g and a specific value of θ, suppose that g⁰(θ) exists and is not 0. Then

√n[g(Yn) − g(θ)] → N(0, σ²[g⁰(θ)²])

in distribution.

Proof: The Taylor expansion of g(Y_n) around Y_n= θ is

g(Y_n) = g(θ) + g⁰(θ)(Y_n− θ) + remainder,

where the remainder→ 0 as Yn → θ. Since Yn → θ in probability it follows that the remainder→ 0 in probability. By applying Slutsky’s theorem (a),

g⁰(θ)√

n(Y_n− θ) → g⁰(θ)X, where X ∼ N(0, σ²). Therefore

√n[g(Y_n) − g(θ)] → g⁰(θ)√

n(Y_n− θ) → N(0, σ²[g⁰(θ)]²).

¤

Example

Suppose now that we have the mean of a random sample ¯X. For µ 6= 0, we have

√n(1 X¯ − 1

µ) → N(0, (1

µ)⁴Var_µX₁).

in distribution.

There are two extensions of the basic Delta method that we need to deal with to complete our treatment. The first concerns the possibility that g⁰(µ) = 0.

(Second-order Delta Method)

Let Y_nbe a sequence of random variables that satisfies√

n(Y_n−θ) → N(0, σ²) in distribution.

For a given function g and a specific value of θ, suppose that g⁰(θ) = 0 and g⁰⁰(θ) exists and is not 0. Then

n[g(Y_n) − g(θ)] → σ²g⁰⁰(θ) 2 χ²₁

(7)

in distribution.

Next we consider the extension of the basic Delta method to the multivariate case.

Theorem 5.5.28

Let X₁, . . . , X_nbe a random sample with E(X_ij) = µ_i and Cov(X_ik, X_jk) = σ_ij. For a given function g with continuous first partial derivatives and a specific value of µ = (µ₁, . . . , µ_p) for which τ² =P P

σ_ij^∂g(µ)_∂µ

i

∂g(µ)

∂µj > 0,

√n[g( ¯X₁, . . . , ¯X_p) − g(µ₁, . . . , µ_p)] → N(0, τ²)

in distribution.