Chapter 5. Hypothesis Testing

(1)

Chapter 5. Hypothesis Testing

1 Nested Hypotheses

In this chapter we provide a theoretical discussion on testing of statistical hypotheses.

Neyman and Pearson (1933) presented Neyman-Pearson Fundamental Lemma which un- folded the various complex problems in testing statistical hypotheses. In 1928, Neyman and Pearson proposed a general recipe, which is named as the likelihood ratio test, for doing hypothesis testing. We shall treat this test and two related test statistics, each based on the maximum likelihood method in this chapter. For the general case only asymptotic distributions of the test statistics have been established. Let X₁, . . . , X_n be iid with distribution F_θ belonging to a family F = {F_θ, θ ∈ Θ}, where Θ ⊂ R^s. Let the distributions Fθ possess densities or mass functions f (x; θ). Assume that the information matrix I(θ) exists and positive definite.

Suppose we are concerned with statistical tests of r < s independent equality restrictions on the (s × 1) parameter vector θ⁰, which we represent by the implicit side relations

R_i(θ) = 0, i = 1, 2, . . . , r. (1) The vector that satisfy these equations form an (s − r)-dimensional subspace Θ⁰ of the parameter space Θ, and we shall consider the null hypothesis that θ⁰ lies in this subspace.

Or in the set-up of hypotheses testing, we consider the null hypothesis H0 : θ⁰ ∈ Θ⁰ (to be tested), where Θ⁰ is a subset of Θ Θ⁰ is determined by a set of r (≤ s) restrictions given by (1). The restrictions under review are thus set within the context of a wider parent model, which provides the maintained hypothesis and defines the alternative hypothesis. In the case of a simple hypothesis H₀ : θ = θ⁰, we have Θ⁰ = {θ⁰}, and the function R_i(θ) may be taken to be

Ri(θ) = θi− θ_i⁰, 1 ≤ i ≤ s.

In the case of a composite hypothesis, the set Θ⁰ contains more than one element and we necessarily have r < s.

In this classical setting the null hypothesis is sometimes called a nested hypothesis, with obvious reference to the relative position of Θ⁰ and Θ. In the following discussions, θ can range freely over Θ, so that functions of θ are several times over differentiable in respect of all its elements, at least in the neighborhood of θ⁰. This property continues to hold under H₀. Since Θ⁰ and Θ − Θ⁰ together form the entire well-behaved parameter space Θ, we can differentiate functions of θ at θ⁰ ∈ Θ⁰ in all directions, including those leading to a passage into the alternative parameter space Θ − Θ⁰. We shall make use of this facility in deriving the three tests of this section.

Note that each restriction, R_i(θ) = 0, is in some (re-)parametrization equivalent to putting one parameter equal to zero, or suppressing it. This usually means a simplifi- cation of the model, and in this sense hypotheses express simplifying assumptions, as is reflected by the corresponding reduction in the number of dimensions of the parameter space. Starting from a loosely specified very general model with a surfeit of parameters we may arrange successive simplifications in a definite order so as to nested subspaces of ever smaller dimensionality within one another. In obeisance to the principle of parsimony we should then move down along this sequence, paring down the number of parameters and thus gradually carving the final sparse specification out of the overblown

(2)

parent model in which it is concealed, testing all the while to see how far we can go. But simplifying assumptions or nested hypotheses and the need for relevant statistical tests do, of course, also arise outside this specific context in the natural pursuit of parsimony.

2 Constrained Estimation

In previous section, we discuss statistical tests of r independent equality restrictions on the s parameter vector θ⁰. The simplest way of estimating θ⁰ subject to (1) is to eliminate r parameters. It can be done by adopting a transformation of η = R(θ) while taking care to define the first r elements of the vector function in such a way that they correspond to the restrictions (1) or R_i(θ) = 0 for 1 ≤ i ≤ r. The remaining functions Ri(θ) for i = r + 1, . . . , s can be chosen at will, provided the existence of the inverse transformation θ = R⁻¹(η) is assured. Whenever possible the identity is a popular choice. Under H₀ we now have

η⁰ =

"

0 η_∗⁰

#

and the remaining s−r elements of the subvector η∗ can be estimated without constraint.

This will yield an estimator ˆη_∗ with covariance matrix V ar( ˆη_∗). The constrained estimator of the full parameter vector is then

ˆ

η =^h 0 ηˆ_∗⁰ ⁱ, V ar(˜η) =

"

0 0

0 V ar( ˆη∗)

#

.

Then we can find the constrained estimator of the original parameter vector θ⁰.

This is a practical method of estimation, but it is not very helpful when we wish to examine the asymptotic distribution of the constrained estimate; for this purpose we turn to constrained maximization of the likelihood function by the Lagrange Multiplier method.

To begin with, we rewrite the restrictions (1) as an r × 1 vector function g_R(θ) = 0.

Likewise we write

G_R(θ) = ∂

∂θj

R_i(θ)

!

r×s

.

The new maximand is L(θ) − g_R(θ)^Tµ with µ a vector of r Lagrange multipliers. Dif- ferentiation yields s + r first-order conditions that must be satisfied by the constrained estimators ˜θ and ˜µ, namely,

Q(˜θ) − G_R(˜θ)^Tµ = 0,˜ g_R(˜θ) = 0, (2) where

Q(θ)^T = ∂

∂θ_jL(θ)

!

1×s

.

Now we examine these estimators under H₀, when they are appropriate; as MLE are consistent, and as the sample size increases ˜θ will converge to θ⁰ ∈ Θ⁰. We suppose indeed that ˜θ is sufficiently close to θ⁰ to justify several large-sample approximations, as follows:

GR(˜θ)^Tµ ≈ G˜ R(θ⁰)^Tµ˜ (3)

Q(˜θ) ≈ Q(θ⁰) + H(θ⁰)(˜θ − θ⁰) (4) g_R(˜θ) ≈ G_R(θ⁰)(˜θ − θ⁰), (5)

(3)

where

H(θ) = ∂²

∂θ_j∂θ_kL(θ)

!

s×s

. H₀ is used in (5) where we take it that g_R(θ⁰) = 0.

Upon substitution of these approximations into (2) and some rearrangement of the terms we obtain a system of r + s simultaneous linear equations

"

−H(θ⁰) GR(θ⁰)^T G_R(θ⁰) 0

# "

θ − θ˜ ⁰

˜ µ

#

≈

"

Q(θ⁰) 0

#

or again

"

−_n¹H(θ⁰) GR(θ⁰)^T G_R(θ⁰) 0

# " √

n(˜θ − θ⁰)

√1 nµ˜

#

≈

" ₁

√nQ(θ⁰) 0

#

. (6)

It is known that −n⁻¹H(θ⁰)→ I(θ^P ⁰) by the law of large numbers. Upon substitution in (6) this gives

" √

n(˜θ − θ⁰)

√1 nµ˜

#

≈

"

I(θ⁰) G_R(θ⁰)^T G_R(θ⁰) 0

#−1" ₁

√nQ(θ⁰) 0

#

. (7)

It follows directly from the multivariate Lindberg-Levy CLT that

√1

nQ(θ⁰)→ N (0, I(θ^d ⁰)).

Observe that

"

I(θ⁰) G_R(θ⁰)^T G_R(θ⁰) 0

#−1

=

"

I⁻¹(θ⁰) − I⁻¹(θ⁰)G_R(θ⁰)^T[A(θ⁰)]⁻¹G_R(θ⁰)I⁻¹(θ⁰) I⁻¹(θ⁰)G_R(θ⁰)^T[A(θ⁰)]⁻¹ A⁻¹(θ⁰)GR(θ⁰)I⁻¹(θ⁰) [A(θ⁰)]⁻¹

#

, where A(θ⁰) = GR(θ⁰)I⁻¹(θ⁰)GR(θ⁰)^T. It follows that the r + s vector on the left of (7) is also asymptotically normal with zero mean and with covariance matrix

"

I(θ⁰) GR(θ⁰)^T G_R(θ⁰) 0

#−1"

I(θ⁰) 0

0 0

# "

I(θ⁰) GR(θ⁰)^T G_R(θ⁰) 0

#−1

(8)

=

"

I⁻¹(θ⁰) − I⁻¹(θ⁰)G_R(θ⁰)^T[A(θ⁰)]⁻¹G_R(θ⁰)I⁻¹(θ⁰) · · ·

· · · [G_R(θ⁰)I⁻¹(θ⁰)G_R(θ⁰)^T]⁻¹

#

.

This yields that the asymptotic variance of √

n(˜θ − θ⁰) is

I⁻¹(θ⁰) − I⁻¹(θ⁰)G_R(θ⁰)^T ^hG_R(θ⁰)I⁻¹(θ⁰)G_R(θ⁰)^Tⁱ⁻¹G_R(θ⁰)I⁻¹(θ⁰) (9)

= I^−1/2(θ⁰)

I − I^−1/2(θ⁰)G_R(θ⁰)^T ^hG_R(θ⁰)I⁻¹(θ⁰)G_R(θ⁰)^Tⁱ⁻¹G_R(θ⁰)I^−1/2(θ⁰)

I^−1/2(θ⁰).

The matrix in parentheses has the same structure, and hence much the same properties, as the “projection matrix” of the linear regression model in its generalized least squares version. It is idempotent, as is readily verified, and hence its rank equals to its trace.

This trace is the difference of the traces of two terms. The first is a unit matrix of order

(4)

s, with trace s; the second term is itself idempotent and of rank r, since it includes G_R(θ⁰), and hence of trace r. Altogether the rank of (9) is s − r.

As for ˜µ, we seldom explicitly determine these estimates, and their is little interest in their asymptotic covariance matrix; for the discussion of the Lagrange multiplier test we mention that the asymptotic variance of n^−1/2µ is˜

hGR(θ⁰)I⁻¹(θ⁰)GR(θ⁰)^Tⁱ⁻¹. (10)

3 Hypothesis Testing By Likelihood Methods

Let H₀ denote a null hypothesis to be tested. Typically, we may represent H₀ as a specified family F₀ of distributions for the data. For any test procedure T , we shall denote by Tn the version based on a sample of size n. The function

γ_n(T, F ) = P_F(T_n rejects H₀),

defined for distribution function F , is called the power function of T_n (or of T ). For F ∈ F₀, γ_n(T, F ) represents the probability of a Type I error. The quantity

α_n(T, F ) = sup

F ∈F0

γ_n(T, F ) is called the size of the test. For F 6∈ F₀, the quantity

βn(T, F ) = 1 − γn(T, F )

represents the probability of a Type II error. Usually, attention is confined to consistent tests: for fixed F 6∈ F0, βn(T, F ) → 0 as n → ∞. Also, usually attention is confined to unbiased tests: for F 6∈ F₀, γ_n(T, F ) ≥ α_n(T, F₀).

A general way to compare two such test procedures is through their power functions. In this regard we shall use the concept of asymptotic relative efficiency (ARE).

For two test procedures T_Aand T_B, suppose that a performance criterion is tightened in such a way that the respective sample sizes n_Aand n_B for T_Aand T_B to perform “equivalently” tend to ∞ but have ratio nA/nB tending to some limit. Then the limit represents the ARE of procedure T_B relative to procedure T_A and is denoted by e(T_B, T_A).

The earliest approach to ARE was introduced by Pitman (1949). In this approach, two tests sequences T = {Tn} and U = {Un} are compared as the Type I and Type II error probabilities tend to positive limits α and β, respectively. In order that α_n → α > 0 and simultaneously β_n → β > 0, it is necessary to consider β_n(·) evaluated at an alternative F⁽ⁿ⁾ converging at a suitable rate to the null hypothesis F0.

In justification of this approach, we might argue that large sample sizes would be relevant in practice only if the alternative of interest were close to the null hypothesis and thus hard to distinguish with only a small sample.

3.1 Test Statistics for A Simple Null Hypothesis

Although the theory of this section is of most value for composite null hypotheses, it is convenient to begin with simple null hypothesis. Consider testing H₀ : θ = θ⁰.

A likelihood ratio statistic,

Λ_n = Lik(θ⁰; x) sup_θ∈ΘLik(θ; x)

(5)

was introduced by Neyman and Pearson (1928). Clearly, Λ_n takes values in the interval [0, 1] and H₀ is to be rejected for sufficiently small values of Λ_n. Equivalently, the test may be carried out in terms of the statistic

λ_n = −2 log Λ_n.

For finite n, the null distribution of λ_n will generally depend on n and on the form of pdf of X. However, there is for regular problems a uniform limiting result as n → ∞.

It turns out to be more convenient for asymptotic considerations.

Expanding λ_n in a Taylor series, we get λ_n = −2

(

−

n

X

i=1

log f (X_i, ˆθ) +

n

X

i=1

log f (X_i, θ⁰)

)

= 2

(1

2(θ⁰− ˆθ)^T −

n

X

i=1

∂²

∂θ_j∂θ_k log f (x; θ)

_θ=θ∗

!

(θ⁰− ˆθ)

)

,

where ˆθ lies between ˆθ and θ⁰. Since θ^∗ is consistent, λ_n= n(ˆθ − θ⁰)^T −1

n

X

i=1

∂²

∂θ_j∂θ_kL(θ)

_θ=θ

0

!

(ˆθ − θ⁰) + o_P(1).

By the asymptotic normality of ˆθ and the convergence of −_n¹ ^Pⁿ_i=1 _∂θ^∂²

j∂θ_k L(θ)|_θ=θ0 to I(θ⁰). λn has, under H0, a limiting chi-squared distribution on s degrees of freedom.

Let ˆθ_n denote a consistent, asymptotically normal, and asymptotically efficient sequence of solutions of the likelihood equations. Denote the efficient scores

q(θ) = (q₁(x; θ), . . . , q_s(x; θ))^T where

q_j(x; θ) = ∂

∂θ_j log f (x; θ).

Replace the matrix −¹_n^Pⁿ_i=1_∂θ^∂²

j∂θ_k L(θ)|_θ=θ0

by I(ˆθ_n), we get a second statistic, W_n = n(ˆθ_n− θ⁰)^TI(ˆθ_n)(ˆθ_n− θ⁰),

which was introduced by Wald (1943).

Write Q(θ) = ^Pⁿ_i=1q(X_i; θ). A second large-sample equivalent to λ_n can be obtained by

θ − θˆ ⁰ = I⁻¹(θ⁰)Q(θ⁰) + o_P(n^−1/2).

A third statistic,

V_n= n[n⁻¹Q(θ⁰)]^TI⁻¹(θ⁰)[n⁻¹Q(θ⁰)] = n⁻¹Q(θ⁰)^TI⁻¹(θ⁰)Q(θ⁰), was introduced by Rao (1948).

The three statistics differ somewhat in computational features. Note that Rao’s statistic does not require explicit computation of the maximum likelihood estimates.

Nevertheless all three statistics have the same limit chi-squared distribution with degree of freedom s under the null hypothesis. The limiting distribution can be found by the following lemma.

(6)

Lemma 1 Under regularity conditions,

(i) n^−1/2Q(θ⁰)→ N (0, I(θ^d ⁰)); (ii) n^1/2(ˆθ − θ⁰)→ N (0, I^d ⁻¹(θ⁰));

(iii) n(ˆθ_n− θ⁰)^TI(θ)(ˆθ_n− θ⁰)→ χ^d ²_s; (iv) n⁻¹Q(θ⁰)^TI⁻¹(θ⁰)Q(θ⁰)→ χ^d ²_s; (v) λ_n− n(ˆθ_n− θ⁰)^TI(θ)(ˆθ_n− θ⁰)→ 0.^P

Remark. Consider a sequence of n independent trials, with s possible outcomes for each trials. Let θ_j denote the probability of occurrence of the jth outcome in any given trial. Let Nj denote the number of occurrences of the jth outcome in the series of n trials. The MLE of θ_j’s are N_j/n. The three test statistics λ_n, W_n and V_n for testing H₀ : θ = θ⁰ against H_A: θ 6= θ⁰ are easily seen to be

λ_n = 2

s

X

j=1

N_jlog(N_j nθ_j⁰), W_n =

s

X

j=1

(N_j− nθ⁰_j)² N_j , V_n =

s

X

j=1

(N_j− nθ⁰_j)² nθ⁰_j .

Both W_n and V_n are referred to as chi-squared goodness of fit statistics; the latter often called the Pearson chi-squared distribution. The large sample properties was first derived by Pearson (1900).

Let us now consider the behavior of λ_n, W_nand V_n under “local” alternatives, that is, for a sequence {θn} of the form

θ_n = θ₀+ n^−1/2∆,

where ∆ = (∆₁, . . . , ∆_s)^T. Suppose that the convergences expressed in the above lemma may be established uniformly in Θ for θ in a neighborhood of θ⁰. It then would follow that

n^1/2(ˆθ − θ⁰) = n^1/2(ˆθ − θ_n) + ∆→ N (∆, I^d ⁻¹(θ⁰)),

n^−1/2Q(θ⁰) = n^1/2(ˆθ − θn)I(θ⁰) + oP_θn(1) → N (I(θ^d ⁰)∆, I(θ⁰)), and

λ_n− W_n ^P→ 0,^θn

It then follow that the statistics λ_n, W_nand V_neach converge in distribution to χ²_s(∆^TI(θ⁰)∆).

Therefore, under appropriate regularity conditions, the statistics λ_n, W_nand V_nare asymptotically equivalent in distribution, both under the null hypothesis and under local alternatives converging sufficiently fast. However, at fixed alternatives these equivalences are not anticipated to hold.

3.2 Likelihood Confidence Region

Due to the duality between tests and confidence regions, families of tests can generate confidence regions. Let {δ(x, θ)} be a family of tests such that {δ(x, θ⁰)} is a test of level of significance α for testing H₀ : θ = θ⁰. Note that δ(x, θ) takes values either 1 or 0. When δ(x, θ) = 1, we reject. Otherwise, we cannot reject H₀.

(7)

Define the subset C(x) of Θ by C(x) = {θ : δ(x, θ) = 0}. This is just the set of all θ that would not be rejected if we observe X = x and used the given family of tests.

Based on the likelihood ratio tests, we can construct a 1 − α likelihood-based confidence region of θ by C(x) = {θ : λ_n ≤ χ²_s(1 − α)}.

Remark. Under regularity conditions, asymptotic distribution of √

n(ˆθ − θ⁰) is usually normal which leads to confidence ellipsoids for θ. However, the confidence regions of θ obtained by the likelihood ratio tests are not necessarily ellipsoids.

3.3 Likelihood Ratio Test

This test is based on the likelihood function. If H₀ holds, and hence θ⁰ ∈ Θ⁰ = {θ ∈ Θ, R_i(θ) = 0, 1 ≤ i ≤ r}, the unconstrained maximum L(ˆθ) should be close to the constrained maximum L(˜θ). We therefore consider the likelihood ratio

Λ_n= sup_θ∈Θ0Lik(θ; x)

sup_θ∈ΘLik(θ; x) = Lik(˜θ; x) Lik(ˆθ; x).

As all likelihoods are positive, and as the constrained maximum cannot exceed the unconstrained maximum, 0 < Λ_n≤ 1. Equivalently, we use the quantity

λ_n = −2 log Λ_n = 2[L(ˆθ) − L(˜θ)]

is therefore always nonnegative, and we shall show that under H₀ it is asymptotically distributed as chi-square with r degrees of freedom, or

λ_n→ χ^d ²_r λ_n can thus serve as a test statistic for H₀.

Recall that Q(θ) =^Pⁿ_i=1q(x_i; θ). Once more we examine a Taylor series expansion L(˜θ) − L(ˆθ) = Q(ˆθ)^T(˜θ − ˆθ) + 1

2(˜θ − ˆθ)^TH(ˆθ)(˜θ − ˆθ) + oP(1), (11) where

H(θ) =

n

X

i=1

∂²

∂θ_j∂θ_k log f (x_i; θ)

!

s×s

.

The first term on the right-hand side vanishes since Q(ˆθ) = 0; by the consistency of ˆθ

−1

nH(ˆθ)→ I(θ^P ⁰).

We simplify (11) accordingly, and substitute the result in λ_n. This gives λ_n = [√

n(˜θ − ˆθ)]^TI(θ⁰)[√

n(˜θ − ˆθ)] + o_P(1). (12) Then it follows from (7) that under H0

√n(˜θ − θ⁰) ≈ I⁻¹(θ⁰)ⁿI − G_R(θ⁰)^T[G_R(θ⁰)I⁻¹(θ⁰)G_R(θ⁰)^T]⁻¹G_R(θ⁰)I⁻¹(θ⁰)^o 1

√nQ(θ⁰) and that

√n(ˆθ − θ⁰) = I⁻¹(θ⁰) 1

√nQ(θ⁰) + o_P(1)

(8)

so that

√n(˜θ − ˆθ) ≈ −I⁻¹(θ⁰)G_R(θ⁰)^T[G_R(θ⁰)I⁻¹(θ⁰)G_R(θ⁰)^T]⁻¹G_R(θ⁰)I⁻¹(θ⁰) 1

√nQ(θ⁰). (13) Recall that

√1

nQ(θ⁰)→ N (0, I(θ^d ⁰)) so that

√1

nI^−1/2(θ)Q(θ⁰)→ ∼ N (0, I).^d (14) Here is a vector of s standard normal variate that are uncorrelated and hence independent. By (13) and (14), moreover,

√n(˜θ − ˆθ) ≈ −I⁻¹(θ⁰)G_R(θ⁰)^T[G_R(θ⁰)I⁻¹(θ⁰)G_R(θ⁰)^T]⁻¹G_R(θ⁰)I⁻¹(θ⁰)I^1/2(θ⁰).

Upon substituting this into (12) we finally obtain

λ_n≈ ^TI^−1/2(θ⁰)G_R(θ⁰)^T[G_R(θ⁰)I⁻¹(θ⁰)G_R(θ⁰)^T]⁻¹G_R(θ⁰)I^−1/2(θ⁰). (15) This is a quadratic form in independent standard Normal variates, with a nonstochastic idempotent coefficient matrix that is of rank r because of the order of G_R(θ⁰); and it is therefore distributed as a chi-square with r degrees of freedom. And this is what we set out to prove.

3.4 Wald’s Test

Note that R_i(θ) = θ_i− θ_i⁰, 1 ≤ i ≤ s, for the simple null hypothesis H₀ : θ = θ⁰. For the composite hypotheses, this test is based on the vector b_θ = (R₁(θ), . . . , R_r(θ))^T and the estimate ˆθ which maximizes L(θ) without subjecting to the restrictions. If H₀ holds the R_i(θ⁰) are zero, and the R_i(ˆθ) should presumably be close to zero. Recall that√

n(ˆθ−θ⁰) under H₀ is asymptotical normal with mean zero and covariance I⁻¹(θ⁰). This implies that √

nb_θ_ˆ under H₀ is asymptotical normal with mean zero (b_θ⁰ = 0) and covariance G_R(θ⁰)I⁻¹(θ⁰)G_R(θ⁰)^T.

Under H₀, and we use this in the quadratic form

√nb^T_θ_ˆ ^hGR(θ⁰)I⁻¹(θ⁰)GR(θ⁰)^Tⁱ⁻¹bθˆ

√n

= nb^T_θ_ˆ^hG_R(θ⁰)I⁻¹(θ⁰)G_R(θ⁰)^Tⁱ⁻¹b_θ_ˆ→ X^d _r². Since

hG_R(ˆθ)I⁻¹(ˆθ)G_R(ˆθ)^Tⁱ^{−1 P}→^hG_R(θ⁰)I⁻¹(θ⁰)G_R(θ⁰)^Tⁱ⁻¹, we thus have

W_n= nb^T_θ_ˆ ⁿG_R(ˆθ)I⁻¹(ˆθ)G_R(ˆθ)^T^o⁻¹b_θ_ˆ→ X^d _r². (16) by the consistency of ˆθ and Slutsky’s theorem. This establishes the asymptotic distribution of Wald’s test statistic under the null hypothesis.

We can therefore construct the test statistic from the original unconstrained estimate ˆθ and its covariance matrix estimate with the help of the restriction function R_i of (1) and their first derivatives which form G_R.

(9)

3.5 Lagrange Multiplier Test

This test is also known as the Rao efficient score test or as the chi-square test. It is based on the score vector n⁻¹Q(θ) and the estimate ˜θ which maximizers L(θ) subject to the restrictions R_i(θ) = 0, 1 ≤ i ≤ r. When we evaluate this vector at the constrained estimate ˜θ the result is n⁻¹Q(˜θ), and under H0 this should be close to the value of the score vector at the unconstrained estimate, which is of course zero. Rao (1948) introduced the statistic

V_n= n⁻¹Q(˜θ)^T[var(Q(˜θ))]⁻¹Q(˜θ).

Before we work on the details, we will motivate this test statistic first. Assume that the specification of Θ0 may be equivalently be given as functions of β where β = (β₁, . . . , β_s−r) ranges through an open subset in R^s−r. In terms of the MLE ˆβ of β, ˜θ may be represented as

θ = g( ˆ˜ β) = (g1( ˆβ), . . . , gs( ˆβ)).

Denoting by J(β) the information matrix for the β-formulation of the model and tβ = 1

n

X

i=1

∂ log f (X_i; g(β))

∂β₁ , · · · , 1 n

n

X

i=1

∂ log f (X_i; g(β))

∂β_s−r

!T

. Then Rao’s statistic is

V_n= nt^T_β_ˆJ⁻¹( ˆβ)t_β_ˆ or in the original formulation

Vn= n⁻¹Q(˜θ)^T[var(Q(˜θ))]⁻¹Q(˜θ).

In order to examine the asymptotic distribution of Q(˜θ) we once more start off with a Taylor series

Q(˜θ) = Q(ˆθ) + H(ˆθ)(˜θ − ˆθ) + o_P(√

n) (17)

and use Q(ˆθ) = 0 and n⁻¹H(ˆθ) = −I(θ⁰) + o_P(1), as before. This yields

√1

nQ(˜θ) = I(θ⁰)√

n(ˆθ − ˜θ) + o_P(1) so that the asymptotic variance of ^√¹_nQ(˜θ) is

G_R(θ⁰)^T[G_R(θ⁰)I⁻¹(θ⁰)G_R(θ⁰)^T]⁻¹G_R(θ⁰) by (13) and (14). These conclude

√1

nQ(˜θ)^T ⁿG_R(θ⁰)^T[G_R(θ⁰)I⁻¹(θ⁰)G_R(θ⁰)^T]⁻¹G_R(θ⁰)^o⁻¹Q(˜θ) 1

√n

= √

n(ˆθ − ˜θ)^T ⁿG_R(θ⁰)^T[G_R(θ⁰)I⁻¹(θ⁰)G_R(θ⁰)^T]⁻¹G_R(θ⁰)^o⁻¹(˜θ − ˆθ)√

n + o_P(1)

= ^TI^1/2(θ⁰)ⁿG_R(θ⁰)^T[G_R(θ⁰)I⁻¹(θ⁰)G_R(θ⁰)^T]⁻¹G_R(θ⁰)^o⁻¹I^1/2(θ⁰) + o_P(1) (18) and this by (15) is the likelihood ratio test statistic, which we know to be asymptotically chi-square (r) distributed. This will also hold if we replace I(θ⁰) in (18) by a consistent estimate, as in the test statistic.

V_n= 1

nQ(˜θ)^T ⁿG_R(˜θ)^T[G_R(˜θ)I⁻¹(˜θ)G_R(˜θ)^T]⁻¹G_R(˜θ)^o⁻¹Q(˜θ). (19)

(10)

This is based on the second expression in (18), and we have replaced θ⁰ by the constrained estimator ˜θ - just as the score vectors Q here take their form from the unconstrained estimation problem, but their values from the constrained estimator ˜θ. In order to calculate the test statistic, ˜θ is the only estimate we need determine.

By (2) and (7)

Q(˜θ) ≈ GR(˜θ)^Tµ˜ and

˜

µ ≈ [G_R(θ⁰)I⁻¹(θ⁰)G_R(θ⁰)^T]⁻¹G_R(θ⁰)I⁻¹(θ⁰)Q(θ⁰) so that

V_n≈ 1

√nµ˜^T ^hG_R(˜θ)I⁻¹(˜θ)G_R(˜θ)^Tⁱ 1

√nµ.˜

Note that [G_R(˜θ)I⁻¹(˜θ)G_R(˜θ)^T]⁻¹ is the asymptotic variance of ^√¹_nµ. Under H˜ ₀ the likelihood ratio test statistic may be regarded as a standardized quadratic form in the estimated Lagrange Multipliers. This may explain the usual name of the test.

3.6 Discussion

All we have shown is that under the null hypothesis the three test statistics have the same asymptotic distribution; in the alternative case they need not at all have the same asymptotic behavior. Again, in a finite sample the three test statistics usually take different values, and at the present level of generality nothing can be said about their exact distribution, under the null hypothesis or otherwise. Without further knowledge of the finite-sample properties of the tests in a particular application they are of little use, since it is hard to tell what significance levels from an asymptotic distribution mean if the evidence comes from a small sample. In certain applications it has, however, proved possible to derive the exact distribution of one or other of the present test statistics or of its transformation. Failing this, the finite-sample performance of the tests may always be established by simulation studies.

In certain cases the asymptotic test statistic is thus taken as the starting point of a further investigation, as a suggestion for an exact approach. This applies in particular to the Likelihood Ratio test; as it is the oldest of the three it has been studied more intensively than the other two. It was introduced by Fisher in the 1920s and adopted in Neyman and Pearson’s testing methodology of a decade later. The next test was that of Wald (1943), and the most recent is the Lagrange Multiplier test (Rao 1948).

If we do not have the benefit of further analyses, and a large sample that inspires some confidence in the validity of asymptotic results, the choice between the three tests may be affected by expediency. Each requires different computations. For the likelihood ratio test we must maximize the likelihood function twice, with and without restrictions;

for the Wald test we need unconstrained parameter estimates with their covariance matrix; and for Rao’s score test, constrained estimates and their covariance matrix.

Some ingredients may be easier to obtain than others.

Let T = {T_n} and U = {U_n} be two sequences of tests for the same problem based on sample sizes n. The limiting ratio of the sample sizes n_U and n_T required to achieve the same limiting power γ_n(T, F ) evaluated at the same sequence of alternatives, when the significance levels of the two test sequences also have the same limit, is the Pitman ARE of T relative to U ,

e_P(T, U ) = lim

n

n_U n_T.

(11)

If, for example, e_P(T, U ) = 1/2, this means that the sequences U_nrequires approximately half as many observations as the sequence T_n to achieve the same asymptotic results.

4 Pitman Efficiency

Suppose that the distribution F under consideration may be indexed by a set Θ ⊂ R, and consider a simple null hypothesis

H0 : θ = θ0

to be tested against alternatives

θ > θ₀.

Consider the comparison of test sequences T = {T_n} satisfying the following conditions, relative to a neighborhood θ0 ≤ θ ≤ θ0+ δ of the null hypothesis.

Pitman Conditions

• (P1) For some continuous strictly increasing distribution function G, and functions µ_n(θ) and σ_n(θ), the F_θ-distribution of (T_n−µ_n(θ))/σ_n(θ) converges to G uniformly in [θ₀, θ₀+ δ]:

sup

θ0≤θ≤θ0+δ

sup

−∞<t<∞

P T_n− µ_n(θ) σ_n(θ)

!

→ 0, n → ∞.

• (P2) For θ ∈ [θ₀, θ₀ + δ], µ_n(θ) is k times differentiable, with µ⁽¹⁾_n (θ₀) = · · · = µ^(k−1)_n (θ₀) = 0 < µ^(k)_n (θ₀).

• (P3) For some function d(n) → ∞ and some constant c > 0, σ_n(θ₀) ∼ cµ^(k)_n (θ₀)

d(n) , n → ∞.

• (P4) For θ_n= θ₀+ O([d(n)]^−1/k),

µ^(k)_n (θ_n) ∼ cµ^(k)_n (θ₀), n → ∞.

• (P5) For θ_n= θ₀+ O([d(n)]^−1/k),

σ_n(θ_n) ∼ cσ_n(θ₀), n → ∞.

Theorem 1 (Pitman-Noether). (i) Let T = {T_n} satisfy (P1)-(P5). Consider testing H₀ by critical regions {T_n> u_α_n} with

α_n= P_θ₀(T_n > u_α_n) → α,

where 0 < α < 1. For 0 < β < 1 − α, and θ_n= θ₀+ O([d(n)]^−1/k), we have βn(θn) = Pθn(Tn≤ uαn) → β

if and only if

(θ_n− θ₀)^k k!

d(n)

c → G⁻¹(1 − α) − G⁻¹(β). (20)

(12)

(ii) Let T_A = {T_An} and T_B = {T_Bn} satisfy (P1)-(P5) with common G, k and d(n) in (P1)-(P3). Let d(n) = n^q, q > 0. Then the Pitman ARE of T_A relative to T_B is given by

e_P(T_A, T_B) =

c_B cA

1/q

. Proof. Check that, by (P1),

β_n(θ_n) − G u_αn− µ_n(θ_n) σ_n(θ_n)

!

→ 0, n → ∞.

Then β_n(θ_n) → β if and only if

uαn− µn(θn)

σ_n(θ_n) → G⁻¹(β). (21)

Likewise (check), α_n→ α if and only if u_αn− µ_n(θ₀)

σ_n(θ₀) → G⁻¹(1 − α). (22)

It follows (check, utilizing (P5)) that (21) and (22) together are equivalent to (22) and µ_n(θ_n) − µ_n(θ₀)

σn(θ0) → G⁻¹(1 − α) − G⁻¹(β) (23) together. By (P2) and (P3),

µ_n(θ_n) − µ_n(θ₀)

σ_n(θ₀) ∼ µ^(k)_n (˜θ_n)

µ^(k)n (θ₀) · (θ_n− θ₀)^k k! · d(n)

c_A , n → ∞,

where θ₀ ≤ ˜θ_n≤ θ_n. Thus, by (P4), (23) is equivalent to (20). This completes the proof of (i).

Now consider tests based on TA and TB, having sizes αAn → α and αBn → α. Let 0 < β < 1 − α. Let {θ_n} be a sequence of alternatives of the form

θ_n= θ₀+ A[d(n)]^−1/k.

It follows by (i) that if h(n) is the sample size at which T_B peforms “equivalently” to T_A with sample size n, that is, at which T_B and T_A have the same limiting power 1 − β for the given sequence of alternatives, so that

β_An(θ_n) → β, β_Bh(n)(θ_n) → β, then we must have d(h(n)) proportional to d(n) and

(θ_n− θ₀)^k k!

d(n)

c_A ∼ (θ_n− θ₀)^k k!

d(h(n)) c_B ,

or d(h(n))

d(n) → c_B c_A.

For d(n) = n^q, this yields (h(n)/n)^q → (c_B/c_A), proving (ii).

(13)

4.1 Rank Tests for Comparing Two Treatments

For comparing a new treament or procedure with the standard method, N subjects (patients, students, etc.) are divided at random into a group of n who will receive a new treatment and a control group of m who will be treated by the standard method.

At the termination of the study, the subjects are ranked either directly or according to some response that measures the success of the treatment such as a test score in an educational or pyschological investigation. The hypothesis H₀ of no treatment effect is rejected, and the superiority of the new treatment acknowledged, if the ranking the n treated subjects rank sufficiently high. (Here it is assumed that the success of the treatment is indicated by an increased response; if instead the aim is to decrease the response, H0 is rejected when the n treated subjects rank sufficiently low.)

Let the ranks of the treated subjects be denoted by S₁, . . . , S_n, where we shall assume that they are numbered in increasing order. Denote the sum of the treatment ranks WS = S1+ · · · + Sn. The hypothesis H0 is then rejected and the treatment judged to be effective when W_S is sufficiently large, say, when W_S ≥ c. Here the constant c is determined by the equation

PH0(WS ≥ c) = α.

The test defined above is known as the Wilcoxon rank-sum test.

Let X1, . . . , Xm and Y1, . . . , Ynbe independent, the X’s identically distributed with distribution F and the Y ’s identically distributed with distribution G. Here the Y ’s are responses to a treatment. Then H₀ : F = G and H_a: Y is stochastically larger than X, i.e., G(t) ≤ F (t) for all t but G 6= F .

Let the ranks of the X⁰s be denoted by R₁, . . . , R_m. If we substitute R⁰s for X’s and S’s for Y ’s in the two-sample t-test statistic, we obtain

nm N

1/2 1

n

Pn

i=1Si− _m¹ ^P^m_j=1Rj

(N − 2)⁻¹^h{^Pⁿ_i=1(S_i− ^{N +1}₂ )²+^P^m_j=1(R_j − ^{N +1}₂ )²ⁱ}^1/2.

This statistic is equivalent to the Wilcoxon statistic W_S, the sum of the ranks of the treatment group. Write W_XY as the number of pairs (X_i, Y_j) with X_i < Y_j. It can be shown that

W_S− 1

2n(n + 1) = W_XY.

W_XY is usually known as the Mann-Whitney statistic. Let φ(X_i, Y_j) = 1 if X_i < Y_j, and 0 otherwise. Then

W_XY =

m

X

i=1 n

X

j=1

φ(X_i, Y_j) (24)

Then we shall prove that W_XY is asymptotically normal as m and n tend to infinity.

The method of proof consists in replacing the variable W_XY by a sum of independent random variables, which is asymptotically equivalent to W_XY and to which the central limit theorem can then be applied. It is natural for this purpose to try a sum of the form

S =

m

X

i=1

ai(Xi) +

n

X

j=1

bj(Yj) (25)

but how should one choose the functions a_i and b_j? The following “projection mathod”

introduced in a different context by Hajek (1961), produces the ai and bj most likely to succeed in the sense of minimizing E(W_XY − S)². This approach is due to Hoeffding

(14)

(1948), and is applicable to a large class of statistics, the so-called U-statistics. Note that

θ(F, G) =

Z

F dG = P (X ≤ Y ).

An unbiased estimator of θ(F, G) is U = 1

nm

m

X

i=1 n

X

j=1

I(X_i ≤ Y_j),

which is the W_XY. A statistic can be written in the form is called a U-statistics. Note that the popularity of this projection method is due to Hajek (1968), who gives the following result.

Lemma 2 (Hoeffding) Let Z₁, . . . , Z_nbe independent random variables and S = S(Z₁, . . . , Z_n) any statistic satisfying E(S²) < ∞. Then the random variable

S^∗ =

n

X

i=1

E(S|Z_i) − (n − 1)E(S) satisfies E(S^∗) = E(S) and

E(S − S^∗)² = V ar(S) − V ar(S^∗).

The random variables S^∗ is called the projection of S on Z₁, . . . , Z_n. Note that it is conveniently a sum of independent and identically distributed random variables. In cases that E(S − S^∗)² → 0 at a suitable rate as n → ∞, the asymptotic normality of S may be established by applying classical theory to S^∗.

Proof of Hoeffding’s Lemma. Without loss of generality, we can assume that E(S) = 0. Consider the problem of finding the sum

T =

n

X

i=1

k_i(Z_i) (26)

for which E(S − T )² is as small as possible; the minimizing T may be considered the

“projection” of S onto the linear space formed by the functions T . Let

ri(zi) = E(S|Zi = zi) (27)

be the conditional expectation of S given Z_i = z_i, and let S^∗ =

n

X

i=1

r_i(Z_i). (28)

That S^∗ is the desired minimizing function is an immediate consequence of the following identity, which holds for all statistics T and S with mean zero and satisfying (26) for which the required expectation exist:

E(S − T )² = E(S − S^∗)²+ E(S^∗− T )². (29) To prove this identity, write

E(S − T )² = E[(S − S^∗) + (S^∗− T )]².

(15)

Squaring the right-hand side proves (29) if it can be shown that

E[(S − S^∗)(S^∗− T )] = 0. (30)

Since the left-hand side of (30) is the sum of the expectations of

[r_i(Z_i) − k_i(Z_i)](S − S^∗) (31) it is enough to show that the expectation of (31) given Z_i is zero for all i. We shall prove this by showing that the conditional expectation of (31) given Z_i is zero. In the conditional expectation of this product, the first factor can be taken out of the expectation sign since it depends only on Z_i, so that it is finally only necessary to show that the conditional expectation of S − S^∗ given Z_i is zero. Now

E[(S − S^∗)|Z_i] = E{S − r_i(Z_i) −^X

j6=i

r_j(Z_j)|Z_i}.

From the definition of r_i(Z_i), it is seen that the conditional expectation of S − r_i(Z_i) given Zi is zero. On the other hand, since Zi and Zj are independent, the conditional expectation of r_j(Z_j) given Z_i is equal to the unconditional expectation of r_j(Z_j), which by the definition of r_j is equal to E(S) and hence equal to zero. This completes the proof of (30) and therefore of (29).

A useful special case of (29) is obtained by putting T = 0, which gives after arrangement

E(S − S^∗)² = E(S²) − E(S^∗2) = V ar(S) − V ar(S^∗). (32) Before we apply Hoeffeding lemma to the W_XY-statistic (24), we will calculate the expectation and variance of WXY. Set θ = (F, G),

E_θ[φ(X, Y )] = P_θ[X < Y ] and we obtain

Eθ(WXY) = mnp (33)

where p = P_θ[X < Y ]. Similarly, we have

V ar_θ(W_XY) = nmp(1 − p) + nm(n − 1)(q₁− p²) + nm(m − 1)(q₂− p²) (34) where q₁ = P_θ[X₁ < min(Y₁, Y₂)] and q₂ = P_θ[Y₁ > max(X₁, X₂)].

Note that under H₀, if F is continuous, p = 1/2 while q₁ = q₂ = 1/3, since, among three independent identically distributed variables, each one is equally likely to be the minimum or the maximum. We then have E_θ(W_XY) = mn/2 and V ar_θ(W_XY) = mn(N + 1)/12 under H₀.

Put

ψ(x, y) = φ(x, y) − p. (35)

Note that

E[ψ(X_α, Y_β)|X_i = x] =

( Eψ(x, Y_β) if α = i

0 if α 6= i

and

E[ψ(X_α, Y_β)|Y_j = y] =

( Eψ(X_α, y) if β = j

0 if β 6= j

(16)

Put ψ₁₀(x) = E_Yψ(x, Y ) and ψ₀₁(y) = E_Xψ(X, y). The projection of W_XY − mnp by Hoeffeding Lemma is n^P^m_i=1ψ₁₀(X_i) + m^Pⁿ_j=1ψ₀₁(Y_j). Consider

U =√ m





1 m

m

X

i=1

ψ10(Xi) + 1 n

n

X

j=1

ψ01(Yj)





and S =√

m[(mn)⁻¹WXY − p]. Note that V ar(S) → q₁ − p²+ m

n(q₂− p²), V ar(U ) = V ar(ψ10(X)) + m

nV ar(ψ01(Y )), E(S − U )² = V ar(S) − V ar(U ).

Observe that for j 6= k, V ar(ψ10(X)) = q1− p² and V ar(ψ01(Y ) = q2− p². (i.e. Eψ(x₁, Y_j)ψ(x₁, Y_k) = [ψ₁₀(x₁)]² and

E_X[ψ₁₀(X)]² = Eψ(X, Y_j)ψ(X, Y_k) = Cov(ψ(X, Y_j), ψ(X, Y_k)).

We then conclude that E(S − U )² → 0.

Theorem 2 Suppose that F and G are continuous and that 0 < P_θ[X < Y ] < 1. Then S − E_θ(S)

qV ar_θ(S)

→ N (0, 1) as min(n, m) → ∞.d

Remark. Reject H₀ when

W_XY −¹₂nm

q1

12nm(N + 1)

≥ z(1 − α).

4.2 Pitman efficiency of the Wilcoxon rank-sum test to the two-sample t-test

We turn now to the comparison of the performance of the Wilcoxon and two-sample t tests. At first sight it would appear that a good reason for using the Wilcoxon is that it has a guaranteed probability of type I error and a good reason against using the Wilcoxon is its inefficient use of the data. We assume that the X’s and Y ’s have the same variance σ² and means µ₁ and µ₂. Although the t test does not have a guaranteed probability of type I error, if n and m are moderately large, H₀ is true, and F has a finite second moment, then the probability of type I error of the t test is fairly close to that specified by the normal model.

Recall that the two-sample t statistic is given by T =

rnm N

Y − ¯¯ X

s₂ (36)

where

s₂ =

Pm

i=1(X_i− ¯X)²+^Pⁿ_j=1(Y_j− ¯Y )²

N − 2 . (37)