Chapter 5. Hypothesis Testing
1 Nested Hypotheses
In this chapter we provide a theoretical discussion on testing of statistical hypotheses.
Neyman and Pearson (1933) presented Neyman-Pearson Fundamental Lemma which un- folded the various complex problems in testing statistical hypotheses. In 1928, Neyman and Pearson proposed a general recipe, which is named as the likelihood ratio test, for doing hypothesis testing. We shall treat this test and two related test statistics, each based on the maximum likelihood method in this chapter. For the general case only asymptotic distributions of the test statistics have been established. Let X1, . . . , Xn be iid with distribution Fθ belonging to a family F = {Fθ, θ ∈ Θ}, where Θ ⊂ Rs. Let the distributions Fθ possess densities or mass functions f (x; θ). Assume that the information matrix I(θ) exists and positive definite.
Suppose we are concerned with statistical tests of r < s independent equality restrictions on the (s × 1) parameter vector θ0, which we represent by the implicit side relations
Ri(θ) = 0, i = 1, 2, . . . , r. (1) The vector that satisfy these equations form an (s − r)-dimensional subspace Θ0 of the parameter space Θ, and we shall consider the null hypothesis that θ0 lies in this subspace.
Or in the set-up of hypotheses testing, we consider the null hypothesis H0 : θ0 ∈ Θ0 (to be tested), where Θ0 is a subset of Θ Θ0 is determined by a set of r (≤ s) restrictions given by (1). The restrictions under review are thus set within the context of a wider parent model, which provides the maintained hypothesis and defines the alternative hypothesis. In the case of a simple hypothesis H0 : θ = θ0, we have Θ0 = {θ0}, and the function Ri(θ) may be taken to be
Ri(θ) = θi− θi0, 1 ≤ i ≤ s.
In the case of a composite hypothesis, the set Θ0 contains more than one element and we necessarily have r < s.
In this classical setting the null hypothesis is sometimes called a nested hypothesis, with obvious reference to the relative position of Θ0 and Θ. In the following discussions, θ can range freely over Θ, so that functions of θ are several times over differentiable in respect of all its elements, at least in the neighborhood of θ0. This property continues to hold under H0. Since Θ0 and Θ − Θ0 together form the entire well-behaved parameter space Θ, we can differentiate functions of θ at θ0 ∈ Θ0 in all directions, including those leading to a passage into the alternative parameter space Θ − Θ0. We shall make use of this facility in deriving the three tests of this section.
Note that each restriction, Ri(θ) = 0, is in some (re-)parametrization equivalent to putting one parameter equal to zero, or suppressing it. This usually means a simplifi- cation of the model, and in this sense hypotheses express simplifying assumptions, as is reflected by the corresponding reduction in the number of dimensions of the parameter space. Starting from a loosely specified very general model with a surfeit of parameters we may arrange successive simplifications in a definite order so as to nested subspaces of ever smaller dimensionality within one another. In obeisance to the principle of parsimony we should then move down along this sequence, paring down the number of parameters and thus gradually carving the final sparse specification out of the overblown
parent model in which it is concealed, testing all the while to see how far we can go. But simplifying assumptions or nested hypotheses and the need for relevant statistical tests do, of course, also arise outside this specific context in the natural pursuit of parsimony.
2 Constrained Estimation
In previous section, we discuss statistical tests of r independent equality restrictions on the s parameter vector θ0. The simplest way of estimating θ0 subject to (1) is to eliminate r parameters. It can be done by adopting a transformation of η = R(θ) while taking care to define the first r elements of the vector function in such a way that they correspond to the restrictions (1) or Ri(θ) = 0 for 1 ≤ i ≤ r. The remaining functions Ri(θ) for i = r + 1, . . . , s can be chosen at will, provided the existence of the inverse transformation θ = R−1(η) is assured. Whenever possible the identity is a popular choice. Under H0 we now have
η0 =
"
0 η∗0
#
and the remaining s−r elements of the subvector η∗ can be estimated without constraint.
This will yield an estimator ˆη∗ with covariance matrix V ar( ˆη∗). The constrained esti- mator of the full parameter vector is then
ˆ
η =h 0 ηˆ∗0 i, V ar(˜η) =
"
0 0
0 V ar( ˆη∗)
#
.
Then we can find the constrained estimator of the original parameter vector θ0.
This is a practical method of estimation, but it is not very helpful when we wish to examine the asymptotic distribution of the constrained estimate; for this purpose we turn to constrained maximization of the likelihood function by the Lagrange Multiplier method.
To begin with, we rewrite the restrictions (1) as an r × 1 vector function gR(θ) = 0.
Likewise we write
GR(θ) = ∂
∂θj
Ri(θ)
!
r×s
.
The new maximand is L(θ) − gR(θ)Tµ with µ a vector of r Lagrange multipliers. Dif- ferentiation yields s + r first-order conditions that must be satisfied by the constrained estimators ˜θ and ˜µ, namely,
Q(˜θ) − GR(˜θ)Tµ = 0,˜ gR(˜θ) = 0, (2) where
Q(θ)T = ∂
∂θjL(θ)
!
1×s
.
Now we examine these estimators under H0, when they are appropriate; as MLE are consistent, and as the sample size increases ˜θ will converge to θ0 ∈ Θ0. We suppose indeed that ˜θ is sufficiently close to θ0 to justify several large-sample approximations, as follows:
GR(˜θ)Tµ ≈ G˜ R(θ0)Tµ˜ (3)
Q(˜θ) ≈ Q(θ0) + H(θ0)(˜θ − θ0) (4) gR(˜θ) ≈ GR(θ0)(˜θ − θ0), (5)
where
H(θ) = ∂2
∂θj∂θkL(θ)
!
s×s
. H0 is used in (5) where we take it that gR(θ0) = 0.
Upon substitution of these approximations into (2) and some rearrangement of the terms we obtain a system of r + s simultaneous linear equations
"
−H(θ0) GR(θ0)T GR(θ0) 0
# "
θ − θ˜ 0
˜ µ
#
≈
"
Q(θ0) 0
#
or again
"
−n1H(θ0) GR(θ0)T GR(θ0) 0
# " √
n(˜θ − θ0)
√1 nµ˜
#
≈
" 1
√nQ(θ0) 0
#
. (6)
It is known that −n−1H(θ0)→ I(θP 0) by the law of large numbers. Upon substitution in (6) this gives
" √
n(˜θ − θ0)
√1 nµ˜
#
≈
"
I(θ0) GR(θ0)T GR(θ0) 0
#−1" 1
√nQ(θ0) 0
#
. (7)
It follows directly from the multivariate Lindberg-Levy CLT that
√1
nQ(θ0)→ N (0, I(θd 0)).
Observe that
"
I(θ0) GR(θ0)T GR(θ0) 0
#−1
=
"
I−1(θ0) − I−1(θ0)GR(θ0)T[A(θ0)]−1GR(θ0)I−1(θ0) I−1(θ0)GR(θ0)T[A(θ0)]−1 A−1(θ0)GR(θ0)I−1(θ0) [A(θ0)]−1
#
, where A(θ0) = GR(θ0)I−1(θ0)GR(θ0)T. It follows that the r + s vector on the left of (7) is also asymptotically normal with zero mean and with covariance matrix
"
I(θ0) GR(θ0)T GR(θ0) 0
#−1"
I(θ0) 0
0 0
# "
I(θ0) GR(θ0)T GR(θ0) 0
#−1
(8)
=
"
I−1(θ0) − I−1(θ0)GR(θ0)T[A(θ0)]−1GR(θ0)I−1(θ0) · · ·
· · · [GR(θ0)I−1(θ0)GR(θ0)T]−1
#
.
This yields that the asymptotic variance of √
n(˜θ − θ0) is
I−1(θ0) − I−1(θ0)GR(θ0)T hGR(θ0)I−1(θ0)GR(θ0)Ti−1GR(θ0)I−1(θ0) (9)
= I−1/2(θ0)
I − I−1/2(θ0)GR(θ0)T hGR(θ0)I−1(θ0)GR(θ0)Ti−1GR(θ0)I−1/2(θ0)
I−1/2(θ0).
The matrix in parentheses has the same structure, and hence much the same properties, as the “projection matrix” of the linear regression model in its generalized least squares version. It is idempotent, as is readily verified, and hence its rank equals to its trace.
This trace is the difference of the traces of two terms. The first is a unit matrix of order
s, with trace s; the second term is itself idempotent and of rank r, since it includes GR(θ0), and hence of trace r. Altogether the rank of (9) is s − r.
As for ˜µ, we seldom explicitly determine these estimates, and their is little interest in their asymptotic covariance matrix; for the discussion of the Lagrange multiplier test we mention that the asymptotic variance of n−1/2µ is˜
hGR(θ0)I−1(θ0)GR(θ0)Ti−1. (10)
3 Hypothesis Testing By Likelihood Methods
Let H0 denote a null hypothesis to be tested. Typically, we may represent H0 as a specified family F0 of distributions for the data. For any test procedure T , we shall denote by Tn the version based on a sample of size n. The function
γn(T, F ) = PF(Tn rejects H0),
defined for distribution function F , is called the power function of Tn (or of T ). For F ∈ F0, γn(T, F ) represents the probability of a Type I error. The quantity
αn(T, F ) = sup
F ∈F0
γn(T, F ) is called the size of the test. For F 6∈ F0, the quantity
βn(T, F ) = 1 − γn(T, F )
represents the probability of a Type II error. Usually, attention is confined to consistent tests: for fixed F 6∈ F0, βn(T, F ) → 0 as n → ∞. Also, usually attention is confined to unbiased tests: for F 6∈ F0, γn(T, F ) ≥ αn(T, F0).
A general way to compare two such test procedures is through their power func- tions. In this regard we shall use the concept of asymptotic relative efficiency (ARE).
For two test procedures TAand TB, suppose that a performance criterion is tightened in such a way that the respective sample sizes nAand nB for TAand TB to perform “equiva- lently” tend to ∞ but have ratio nA/nB tending to some limit. Then the limit represents the ARE of procedure TB relative to procedure TA and is denoted by e(TB, TA).
The earliest approach to ARE was introduced by Pitman (1949). In this approach, two tests sequences T = {Tn} and U = {Un} are compared as the Type I and Type II error probabilities tend to positive limits α and β, respectively. In order that αn → α > 0 and simultaneously βn → β > 0, it is necessary to consider βn(·) evaluated at an alternative F(n) converging at a suitable rate to the null hypothesis F0.
In justification of this approach, we might argue that large sample sizes would be relevant in practice only if the alternative of interest were close to the null hypothesis and thus hard to distinguish with only a small sample.
3.1 Test Statistics for A Simple Null Hypothesis
Although the theory of this section is of most value for composite null hypotheses, it is convenient to begin with simple null hypothesis. Consider testing H0 : θ = θ0.
A likelihood ratio statistic,
Λn = Lik(θ0; x) supθ∈ΘLik(θ; x)
was introduced by Neyman and Pearson (1928). Clearly, Λn takes values in the interval [0, 1] and H0 is to be rejected for sufficiently small values of Λn. Equivalently, the test may be carried out in terms of the statistic
λn = −2 log Λn.
For finite n, the null distribution of λn will generally depend on n and on the form of pdf of X. However, there is for regular problems a uniform limiting result as n → ∞.
It turns out to be more convenient for asymptotic considerations.
Expanding λn in a Taylor series, we get λn = −2
(
−
n
X
i=1
log f (Xi, ˆθ) +
n
X
i=1
log f (Xi, θ0)
)
= 2
(1
2(θ0− ˆθ)T −
n
X
i=1
∂2
∂θj∂θk log f (x; θ)
θ=θ∗
!
(θ0− ˆθ)
)
,
where ˆθ lies between ˆθ and θ0. Since θ∗ is consistent, λn= n(ˆθ − θ0)T −1
n
n
X
i=1
∂2
∂θj∂θkL(θ)
θ=θ
0
!
(ˆθ − θ0) + oP(1).
By the asymptotic normality of ˆθ and the convergence of −n1 Pni=1 ∂θ∂2
j∂θk L(θ)|θ=θ0 to I(θ0). λn has, under H0, a limiting chi-squared distribution on s degrees of freedom.
Let ˆθn denote a consistent, asymptotically normal, and asymptotically efficient sequence of solutions of the likelihood equations. Denote the efficient scores
q(θ) = (q1(x; θ), . . . , qs(x; θ))T where
qj(x; θ) = ∂
∂θj log f (x; θ).
Replace the matrix −1nPni=1∂θ∂2
j∂θk L(θ)|θ=θ0
by I(ˆθn), we get a second statistic, Wn = n(ˆθn− θ0)TI(ˆθn)(ˆθn− θ0),
which was introduced by Wald (1943).
Write Q(θ) = Pni=1q(Xi; θ). A second large-sample equivalent to λn can be ob- tained by
θ − θˆ 0 = I−1(θ0)Q(θ0) + oP(n−1/2).
A third statistic,
Vn= n[n−1Q(θ0)]TI−1(θ0)[n−1Q(θ0)] = n−1Q(θ0)TI−1(θ0)Q(θ0), was introduced by Rao (1948).
The three statistics differ somewhat in computational features. Note that Rao’s statistic does not require explicit computation of the maximum likelihood estimates.
Nevertheless all three statistics have the same limit chi-squared distribution with degree of freedom s under the null hypothesis. The limiting distribution can be found by the following lemma.
Lemma 1 Under regularity conditions,
(i) n−1/2Q(θ0)→ N (0, I(θd 0)); (ii) n1/2(ˆθ − θ0)→ N (0, Id −1(θ0));
(iii) n(ˆθn− θ0)TI(θ)(ˆθn− θ0)→ χd 2s; (iv) n−1Q(θ0)TI−1(θ0)Q(θ0)→ χd 2s; (v) λn− n(ˆθn− θ0)TI(θ)(ˆθn− θ0)→ 0.P
Remark. Consider a sequence of n independent trials, with s possible outcomes for each trials. Let θj denote the probability of occurrence of the jth outcome in any given trial. Let Nj denote the number of occurrences of the jth outcome in the series of n trials. The MLE of θj’s are Nj/n. The three test statistics λn, Wn and Vn for testing H0 : θ = θ0 against HA: θ 6= θ0 are easily seen to be
λn = 2
s
X
j=1
Njlog(Nj nθj0), Wn =
s
X
j=1
(Nj− nθ0j)2 Nj , Vn =
s
X
j=1
(Nj− nθ0j)2 nθ0j .
Both Wn and Vn are referred to as chi-squared goodness of fit statistics; the latter often called the Pearson chi-squared distribution. The large sample properties was first derived by Pearson (1900).
Let us now consider the behavior of λn, Wnand Vn under “local” alternatives, that is, for a sequence {θn} of the form
θn = θ0+ n−1/2∆,
where ∆ = (∆1, . . . , ∆s)T. Suppose that the convergences expressed in the above lemma may be established uniformly in Θ for θ in a neighborhood of θ0. It then would follow that
n1/2(ˆθ − θ0) = n1/2(ˆθ − θn) + ∆→ N (∆, Id −1(θ0)),
n−1/2Q(θ0) = n1/2(ˆθ − θn)I(θ0) + oPθn(1) → N (I(θd 0)∆, I(θ0)), and
λn− Wn P→ 0,θn
It then follow that the statistics λn, Wnand Vneach converge in distribution to χ2s(∆TI(θ0)∆).
Therefore, under appropriate regularity conditions, the statistics λn, Wnand Vnare asymptotically equivalent in distribution, both under the null hypothesis and under local alternatives converging sufficiently fast. However, at fixed alternatives these equivalences are not anticipated to hold.
3.2 Likelihood Confidence Region
Due to the duality between tests and confidence regions, families of tests can generate confidence regions. Let {δ(x, θ)} be a family of tests such that {δ(x, θ0)} is a test of level of significance α for testing H0 : θ = θ0. Note that δ(x, θ) takes values either 1 or 0. When δ(x, θ) = 1, we reject. Otherwise, we cannot reject H0.
Define the subset C(x) of Θ by C(x) = {θ : δ(x, θ) = 0}. This is just the set of all θ that would not be rejected if we observe X = x and used the given family of tests.
Based on the likelihood ratio tests, we can construct a 1 − α likelihood-based confidence region of θ by C(x) = {θ : λn ≤ χ2s(1 − α)}.
Remark. Under regularity conditions, asymptotic distribution of √
n(ˆθ − θ0) is usually normal which leads to confidence ellipsoids for θ. However, the confidence regions of θ obtained by the likelihood ratio tests are not necessarily ellipsoids.
3.3 Likelihood Ratio Test
This test is based on the likelihood function. If H0 holds, and hence θ0 ∈ Θ0 = {θ ∈ Θ, Ri(θ) = 0, 1 ≤ i ≤ r}, the unconstrained maximum L(ˆθ) should be close to the constrained maximum L(˜θ). We therefore consider the likelihood ratio
Λn= supθ∈Θ0Lik(θ; x)
supθ∈ΘLik(θ; x) = Lik(˜θ; x) Lik(ˆθ; x).
As all likelihoods are positive, and as the constrained maximum cannot exceed the unconstrained maximum, 0 < Λn≤ 1. Equivalently, we use the quantity
λn = −2 log Λn = 2[L(ˆθ) − L(˜θ)]
is therefore always nonnegative, and we shall show that under H0 it is asymptotically distributed as chi-square with r degrees of freedom, or
λn→ χd 2r λn can thus serve as a test statistic for H0.
Recall that Q(θ) =Pni=1q(xi; θ). Once more we examine a Taylor series expansion L(˜θ) − L(ˆθ) = Q(ˆθ)T(˜θ − ˆθ) + 1
2(˜θ − ˆθ)TH(ˆθ)(˜θ − ˆθ) + oP(1), (11) where
H(θ) =
n
X
i=1
∂2
∂θj∂θk log f (xi; θ)
!
s×s
.
The first term on the right-hand side vanishes since Q(ˆθ) = 0; by the consistency of ˆθ
−1
nH(ˆθ)→ I(θP 0).
We simplify (11) accordingly, and substitute the result in λn. This gives λn = [√
n(˜θ − ˆθ)]TI(θ0)[√
n(˜θ − ˆθ)] + oP(1). (12) Then it follows from (7) that under H0
√n(˜θ − θ0) ≈ I−1(θ0)nI − GR(θ0)T[GR(θ0)I−1(θ0)GR(θ0)T]−1GR(θ0)I−1(θ0)o 1
√nQ(θ0) and that
√n(ˆθ − θ0) = I−1(θ0) 1
√nQ(θ0) + oP(1)
so that
√n(˜θ − ˆθ) ≈ −I−1(θ0)GR(θ0)T[GR(θ0)I−1(θ0)GR(θ0)T]−1GR(θ0)I−1(θ0) 1
√nQ(θ0). (13) Recall that
√1
nQ(θ0)→ N (0, I(θd 0)) so that
√1
nI−1/2(θ)Q(θ0)→ ∼ N (0, I).d (14) Here is a vector of s standard normal variate that are uncorrelated and hence inde- pendent. By (13) and (14), moreover,
√n(˜θ − ˆθ) ≈ −I−1(θ0)GR(θ0)T[GR(θ0)I−1(θ0)GR(θ0)T]−1GR(θ0)I−1(θ0)I1/2(θ0).
Upon substituting this into (12) we finally obtain
λn≈ TI−1/2(θ0)GR(θ0)T[GR(θ0)I−1(θ0)GR(θ0)T]−1GR(θ0)I−1/2(θ0). (15) This is a quadratic form in independent standard Normal variates, with a nonstochastic idempotent coefficient matrix that is of rank r because of the order of GR(θ0); and it is therefore distributed as a chi-square with r degrees of freedom. And this is what we set out to prove.
3.4 Wald’s Test
Note that Ri(θ) = θi− θi0, 1 ≤ i ≤ s, for the simple null hypothesis H0 : θ = θ0. For the composite hypotheses, this test is based on the vector bθ = (R1(θ), . . . , Rr(θ))T and the estimate ˆθ which maximizes L(θ) without subjecting to the restrictions. If H0 holds the Ri(θ0) are zero, and the Ri(ˆθ) should presumably be close to zero. Recall that√
n(ˆθ−θ0) under H0 is asymptotical normal with mean zero and covariance I−1(θ0). This implies that √
nbθˆ under H0 is asymptotical normal with mean zero (bθ0 = 0) and covariance GR(θ0)I−1(θ0)GR(θ0)T.
Under H0, and we use this in the quadratic form
√nbTθˆ hGR(θ0)I−1(θ0)GR(θ0)Ti−1bθˆ
√n
= nbTθˆhGR(θ0)I−1(θ0)GR(θ0)Ti−1bθˆ→ Xd r2. Since
hGR(ˆθ)I−1(ˆθ)GR(ˆθ)Ti−1 P→hGR(θ0)I−1(θ0)GR(θ0)Ti−1, we thus have
Wn= nbTθˆ nGR(ˆθ)I−1(ˆθ)GR(ˆθ)To−1bθˆ→ Xd r2. (16) by the consistency of ˆθ and Slutsky’s theorem. This establishes the asymptotic distri- bution of Wald’s test statistic under the null hypothesis.
We can therefore construct the test statistic from the original unconstrained esti- mate ˆθ and its covariance matrix estimate with the help of the restriction function Ri of (1) and their first derivatives which form GR.
3.5 Lagrange Multiplier Test
This test is also known as the Rao efficient score test or as the chi-square test. It is based on the score vector n−1Q(θ) and the estimate ˜θ which maximizers L(θ) subject to the restrictions Ri(θ) = 0, 1 ≤ i ≤ r. When we evaluate this vector at the constrained estimate ˜θ the result is n−1Q(˜θ), and under H0 this should be close to the value of the score vector at the unconstrained estimate, which is of course zero. Rao (1948) introduced the statistic
Vn= n−1Q(˜θ)T[var(Q(˜θ))]−1Q(˜θ).
Before we work on the details, we will motivate this test statistic first. Assume that the specification of Θ0 may be equivalently be given as functions of β where β = (β1, . . . , βs−r) ranges through an open subset in Rs−r. In terms of the MLE ˆβ of β, ˜θ may be represented as
θ = g( ˆ˜ β) = (g1( ˆβ), . . . , gs( ˆβ)).
Denoting by J(β) the information matrix for the β-formulation of the model and tβ = 1
n
n
X
i=1
∂ log f (Xi; g(β))
∂β1 , · · · , 1 n
n
X
i=1
∂ log f (Xi; g(β))
∂βs−r
!T
. Then Rao’s statistic is
Vn= ntTβˆJ−1( ˆβ)tβˆ or in the original formulation
Vn= n−1Q(˜θ)T[var(Q(˜θ))]−1Q(˜θ).
In order to examine the asymptotic distribution of Q(˜θ) we once more start off with a Taylor series
Q(˜θ) = Q(ˆθ) + H(ˆθ)(˜θ − ˆθ) + oP(√
n) (17)
and use Q(ˆθ) = 0 and n−1H(ˆθ) = −I(θ0) + oP(1), as before. This yields
√1
nQ(˜θ) = I(θ0)√
n(ˆθ − ˜θ) + oP(1) so that the asymptotic variance of √1nQ(˜θ) is
GR(θ0)T[GR(θ0)I−1(θ0)GR(θ0)T]−1GR(θ0) by (13) and (14). These conclude
√1
nQ(˜θ)T nGR(θ0)T[GR(θ0)I−1(θ0)GR(θ0)T]−1GR(θ0)o−1Q(˜θ) 1
√n
= √
n(ˆθ − ˜θ)T nGR(θ0)T[GR(θ0)I−1(θ0)GR(θ0)T]−1GR(θ0)o−1(˜θ − ˆθ)√
n + oP(1)
= TI1/2(θ0)nGR(θ0)T[GR(θ0)I−1(θ0)GR(θ0)T]−1GR(θ0)o−1I1/2(θ0) + oP(1) (18) and this by (15) is the likelihood ratio test statistic, which we know to be asymptotically chi-square (r) distributed. This will also hold if we replace I(θ0) in (18) by a consistent estimate, as in the test statistic.
Vn= 1
nQ(˜θ)T nGR(˜θ)T[GR(˜θ)I−1(˜θ)GR(˜θ)T]−1GR(˜θ)o−1Q(˜θ). (19)
This is based on the second expression in (18), and we have replaced θ0 by the constrained estimator ˜θ - just as the score vectors Q here take their form from the unconstrained estimation problem, but their values from the constrained estimator ˜θ. In order to calculate the test statistic, ˜θ is the only estimate we need determine.
By (2) and (7)
Q(˜θ) ≈ GR(˜θ)Tµ˜ and
˜
µ ≈ [GR(θ0)I−1(θ0)GR(θ0)T]−1GR(θ0)I−1(θ0)Q(θ0) so that
Vn≈ 1
√nµ˜T hGR(˜θ)I−1(˜θ)GR(˜θ)Ti 1
√nµ.˜
Note that [GR(˜θ)I−1(˜θ)GR(˜θ)T]−1 is the asymptotic variance of √1nµ. Under H˜ 0 the likelihood ratio test statistic may be regarded as a standardized quadratic form in the estimated Lagrange Multipliers. This may explain the usual name of the test.
3.6 Discussion
All we have shown is that under the null hypothesis the three test statistics have the same asymptotic distribution; in the alternative case they need not at all have the same asymptotic behavior. Again, in a finite sample the three test statistics usually take different values, and at the present level of generality nothing can be said about their exact distribution, under the null hypothesis or otherwise. Without further knowledge of the finite-sample properties of the tests in a particular application they are of little use, since it is hard to tell what significance levels from an asymptotic distribution mean if the evidence comes from a small sample. In certain applications it has, however, proved possible to derive the exact distribution of one or other of the present test statistics or of its transformation. Failing this, the finite-sample performance of the tests may always be established by simulation studies.
In certain cases the asymptotic test statistic is thus taken as the starting point of a further investigation, as a suggestion for an exact approach. This applies in particular to the Likelihood Ratio test; as it is the oldest of the three it has been studied more intensively than the other two. It was introduced by Fisher in the 1920s and adopted in Neyman and Pearson’s testing methodology of a decade later. The next test was that of Wald (1943), and the most recent is the Lagrange Multiplier test (Rao 1948).
If we do not have the benefit of further analyses, and a large sample that inspires some confidence in the validity of asymptotic results, the choice between the three tests may be affected by expediency. Each requires different computations. For the likelihood ratio test we must maximize the likelihood function twice, with and without restrictions;
for the Wald test we need unconstrained parameter estimates with their covariance matrix; and for Rao’s score test, constrained estimates and their covariance matrix.
Some ingredients may be easier to obtain than others.
Let T = {Tn} and U = {Un} be two sequences of tests for the same problem based on sample sizes n. The limiting ratio of the sample sizes nU and nT required to achieve the same limiting power γn(T, F ) evaluated at the same sequence of alternatives, when the significance levels of the two test sequences also have the same limit, is the Pitman ARE of T relative to U ,
eP(T, U ) = lim
n
nU nT.
If, for example, eP(T, U ) = 1/2, this means that the sequences Unrequires approximately half as many observations as the sequence Tn to achieve the same asymptotic results.
4 Pitman Efficiency
Suppose that the distribution F under consideration may be indexed by a set Θ ⊂ R, and consider a simple null hypothesis
H0 : θ = θ0
to be tested against alternatives
θ > θ0.
Consider the comparison of test sequences T = {Tn} satisfying the following conditions, relative to a neighborhood θ0 ≤ θ ≤ θ0+ δ of the null hypothesis.
Pitman Conditions
• (P1) For some continuous strictly increasing distribution function G, and functions µn(θ) and σn(θ), the Fθ-distribution of (Tn−µn(θ))/σn(θ) converges to G uniformly in [θ0, θ0+ δ]:
sup
θ0≤θ≤θ0+δ
sup
−∞<t<∞
P Tn− µn(θ) σn(θ)
!
→ 0, n → ∞.
• (P2) For θ ∈ [θ0, θ0 + δ], µn(θ) is k times differentiable, with µ(1)n (θ0) = · · · = µ(k−1)n (θ0) = 0 < µ(k)n (θ0).
• (P3) For some function d(n) → ∞ and some constant c > 0, σn(θ0) ∼ cµ(k)n (θ0)
d(n) , n → ∞.
• (P4) For θn= θ0+ O([d(n)]−1/k),
µ(k)n (θn) ∼ cµ(k)n (θ0), n → ∞.
• (P5) For θn= θ0+ O([d(n)]−1/k),
σn(θn) ∼ cσn(θ0), n → ∞.
Theorem 1 (Pitman-Noether). (i) Let T = {Tn} satisfy (P1)-(P5). Consider testing H0 by critical regions {Tn> uαn} with
αn= Pθ0(Tn > uαn) → α,
where 0 < α < 1. For 0 < β < 1 − α, and θn= θ0+ O([d(n)]−1/k), we have βn(θn) = Pθn(Tn≤ uαn) → β
if and only if
(θn− θ0)k k!
d(n)
c → G−1(1 − α) − G−1(β). (20)
(ii) Let TA = {TAn} and TB = {TBn} satisfy (P1)-(P5) with common G, k and d(n) in (P1)-(P3). Let d(n) = nq, q > 0. Then the Pitman ARE of TA relative to TB is given by
eP(TA, TB) =
cB cA
1/q
. Proof. Check that, by (P1),
βn(θn) − G uαn− µn(θn) σn(θn)
!
→ 0, n → ∞.
Then βn(θn) → β if and only if
uαn− µn(θn)
σn(θn) → G−1(β). (21)
Likewise (check), αn→ α if and only if uαn− µn(θ0)
σn(θ0) → G−1(1 − α). (22)
It follows (check, utilizing (P5)) that (21) and (22) together are equivalent to (22) and µn(θn) − µn(θ0)
σn(θ0) → G−1(1 − α) − G−1(β) (23) together. By (P2) and (P3),
µn(θn) − µn(θ0)
σn(θ0) ∼ µ(k)n (˜θn)
µ(k)n (θ0) · (θn− θ0)k k! · d(n)
cA , n → ∞,
where θ0 ≤ ˜θn≤ θn. Thus, by (P4), (23) is equivalent to (20). This completes the proof of (i).
Now consider tests based on TA and TB, having sizes αAn → α and αBn → α. Let 0 < β < 1 − α. Let {θn} be a sequence of alternatives of the form
θn= θ0+ A[d(n)]−1/k.
It follows by (i) that if h(n) is the sample size at which TB peforms “equivalently” to TA with sample size n, that is, at which TB and TA have the same limiting power 1 − β for the given sequence of alternatives, so that
βAn(θn) → β, βBh(n)(θn) → β, then we must have d(h(n)) proportional to d(n) and
(θn− θ0)k k!
d(n)
cA ∼ (θn− θ0)k k!
d(h(n)) cB ,
or d(h(n))
d(n) → cB cA.
For d(n) = nq, this yields (h(n)/n)q → (cB/cA), proving (ii).
4.1 Rank Tests for Comparing Two Treatments
For comparing a new treament or procedure with the standard method, N subjects (patients, students, etc.) are divided at random into a group of n who will receive a new treatment and a control group of m who will be treated by the standard method.
At the termination of the study, the subjects are ranked either directly or according to some response that measures the success of the treatment such as a test score in an educational or pyschological investigation. The hypothesis H0 of no treatment effect is rejected, and the superiority of the new treatment acknowledged, if the ranking the n treated subjects rank sufficiently high. (Here it is assumed that the success of the treatment is indicated by an increased response; if instead the aim is to decrease the response, H0 is rejected when the n treated subjects rank sufficiently low.)
Let the ranks of the treated subjects be denoted by S1, . . . , Sn, where we shall assume that they are numbered in increasing order. Denote the sum of the treatment ranks WS = S1+ · · · + Sn. The hypothesis H0 is then rejected and the treatment judged to be effective when WS is sufficiently large, say, when WS ≥ c. Here the constant c is determined by the equation
PH0(WS ≥ c) = α.
The test defined above is known as the Wilcoxon rank-sum test.
Let X1, . . . , Xm and Y1, . . . , Ynbe independent, the X’s identically distributed with distribution F and the Y ’s identically distributed with distribution G. Here the Y ’s are responses to a treatment. Then H0 : F = G and Ha: Y is stochastically larger than X, i.e., G(t) ≤ F (t) for all t but G 6= F .
Let the ranks of the X0s be denoted by R1, . . . , Rm. If we substitute R0s for X’s and S’s for Y ’s in the two-sample t-test statistic, we obtain
nm N
1/2 1
n
Pn
i=1Si− m1 Pmj=1Rj
(N − 2)−1h{Pni=1(Si− N +12 )2+Pmj=1(Rj − N +12 )2i}1/2.
This statistic is equivalent to the Wilcoxon statistic WS, the sum of the ranks of the treatment group. Write WXY as the number of pairs (Xi, Yj) with Xi < Yj. It can be shown that
WS− 1
2n(n + 1) = WXY.
WXY is usually known as the Mann-Whitney statistic. Let φ(Xi, Yj) = 1 if Xi < Yj, and 0 otherwise. Then
WXY =
m
X
i=1 n
X
j=1
φ(Xi, Yj) (24)
Then we shall prove that WXY is asymptotically normal as m and n tend to infinity.
The method of proof consists in replacing the variable WXY by a sum of independent random variables, which is asymptotically equivalent to WXY and to which the central limit theorem can then be applied. It is natural for this purpose to try a sum of the form
S =
m
X
i=1
ai(Xi) +
n
X
j=1
bj(Yj) (25)
but how should one choose the functions ai and bj? The following “projection mathod”
introduced in a different context by Hajek (1961), produces the ai and bj most likely to succeed in the sense of minimizing E(WXY − S)2. This approach is due to Hoeffding
(1948), and is applicable to a large class of statistics, the so-called U-statistics. Note that
θ(F, G) =
Z
F dG = P (X ≤ Y ).
An unbiased estimator of θ(F, G) is U = 1
nm
m
X
i=1 n
X
j=1
I(Xi ≤ Yj),
which is the WXY. A statistic can be written in the form is called a U-statistics. Note that the popularity of this projection method is due to Hajek (1968), who gives the following result.
Lemma 2 (Hoeffding) Let Z1, . . . , Znbe independent random variables and S = S(Z1, . . . , Zn) any statistic satisfying E(S2) < ∞. Then the random variable
S∗ =
n
X
i=1
E(S|Zi) − (n − 1)E(S) satisfies E(S∗) = E(S) and
E(S − S∗)2 = V ar(S) − V ar(S∗).
The random variables S∗ is called the projection of S on Z1, . . . , Zn. Note that it is conveniently a sum of independent and identically distributed random variables. In cases that E(S − S∗)2 → 0 at a suitable rate as n → ∞, the asymptotic normality of S may be established by applying classical theory to S∗.
Proof of Hoeffding’s Lemma. Without loss of generality, we can assume that E(S) = 0. Consider the problem of finding the sum
T =
n
X
i=1
ki(Zi) (26)
for which E(S − T )2 is as small as possible; the minimizing T may be considered the
“projection” of S onto the linear space formed by the functions T . Let
ri(zi) = E(S|Zi = zi) (27)
be the conditional expectation of S given Zi = zi, and let S∗ =
n
X
i=1
ri(Zi). (28)
That S∗ is the desired minimizing function is an immediate consequence of the following identity, which holds for all statistics T and S with mean zero and satisfying (26) for which the required expectation exist:
E(S − T )2 = E(S − S∗)2+ E(S∗− T )2. (29) To prove this identity, write
E(S − T )2 = E[(S − S∗) + (S∗− T )]2.
Squaring the right-hand side proves (29) if it can be shown that
E[(S − S∗)(S∗− T )] = 0. (30)
Since the left-hand side of (30) is the sum of the expectations of
[ri(Zi) − ki(Zi)](S − S∗) (31) it is enough to show that the expectation of (31) given Zi is zero for all i. We shall prove this by showing that the conditional expectation of (31) given Zi is zero. In the conditional expectation of this product, the first factor can be taken out of the expectation sign since it depends only on Zi, so that it is finally only necessary to show that the conditional expectation of S − S∗ given Zi is zero. Now
E[(S − S∗)|Zi] = E{S − ri(Zi) −X
j6=i
rj(Zj)|Zi}.
From the definition of ri(Zi), it is seen that the conditional expectation of S − ri(Zi) given Zi is zero. On the other hand, since Zi and Zj are independent, the conditional expectation of rj(Zj) given Zi is equal to the unconditional expectation of rj(Zj), which by the definition of rj is equal to E(S) and hence equal to zero. This completes the proof of (30) and therefore of (29).
A useful special case of (29) is obtained by putting T = 0, which gives after arrangement
E(S − S∗)2 = E(S2) − E(S∗2) = V ar(S) − V ar(S∗). (32) Before we apply Hoeffeding lemma to the WXY-statistic (24), we will calculate the expectation and variance of WXY. Set θ = (F, G),
Eθ[φ(X, Y )] = Pθ[X < Y ] and we obtain
Eθ(WXY) = mnp (33)
where p = Pθ[X < Y ]. Similarly, we have
V arθ(WXY) = nmp(1 − p) + nm(n − 1)(q1− p2) + nm(m − 1)(q2− p2) (34) where q1 = Pθ[X1 < min(Y1, Y2)] and q2 = Pθ[Y1 > max(X1, X2)].
Note that under H0, if F is continuous, p = 1/2 while q1 = q2 = 1/3, since, among three independent identically distributed variables, each one is equally likely to be the minimum or the maximum. We then have Eθ(WXY) = mn/2 and V arθ(WXY) = mn(N + 1)/12 under H0.
Put
ψ(x, y) = φ(x, y) − p. (35)
Note that
E[ψ(Xα, Yβ)|Xi = x] =
( Eψ(x, Yβ) if α = i
0 if α 6= i
and
E[ψ(Xα, Yβ)|Yj = y] =
( Eψ(Xα, y) if β = j
0 if β 6= j
Put ψ10(x) = EYψ(x, Y ) and ψ01(y) = EXψ(X, y). The projection of WXY − mnp by Hoeffeding Lemma is nPmi=1ψ10(Xi) + mPnj=1ψ01(Yj). Consider
U =√ m
1 m
m
X
i=1
ψ10(Xi) + 1 n
n
X
j=1
ψ01(Yj)
and S =√
m[(mn)−1WXY − p]. Note that V ar(S) → q1 − p2+ m
n(q2− p2), V ar(U ) = V ar(ψ10(X)) + m
nV ar(ψ01(Y )), E(S − U )2 = V ar(S) − V ar(U ).
Observe that for j 6= k, V ar(ψ10(X)) = q1− p2 and V ar(ψ01(Y ) = q2− p2. (i.e. Eψ(x1, Yj)ψ(x1, Yk) = [ψ10(x1)]2 and
EX[ψ10(X)]2 = Eψ(X, Yj)ψ(X, Yk) = Cov(ψ(X, Yj), ψ(X, Yk)).
We then conclude that E(S − U )2 → 0.
Theorem 2 Suppose that F and G are continuous and that 0 < Pθ[X < Y ] < 1. Then S − Eθ(S)
qV arθ(S)
→ N (0, 1) as min(n, m) → ∞.d
Remark. Reject H0 when
WXY −12nm
q1
12nm(N + 1)
≥ z(1 − α).
4.2 Pitman efficiency of the Wilcoxon rank-sum test to the two-sample t-test
We turn now to the comparison of the performance of the Wilcoxon and two-sample t tests. At first sight it would appear that a good reason for using the Wilcoxon is that it has a guaranteed probability of type I error and a good reason against using the Wilcoxon is its inefficient use of the data. We assume that the X’s and Y ’s have the same variance σ2 and means µ1 and µ2. Although the t test does not have a guaranteed probability of type I error, if n and m are moderately large, H0 is true, and F has a finite second moment, then the probability of type I error of the t test is fairly close to that specified by the normal model.
Recall that the two-sample t statistic is given by T =
rnm N
Y − ¯¯ X
s2 (36)
where
s2 =
Pm
i=1(Xi− ¯X)2+Pnj=1(Yj− ¯Y )2
N − 2 . (37)