Topic 3: Tests in Parametric Models Hypothesis Testing By Likelihood Methods

(1)

Topic 3: Tests in Parametric Models

Hypothesis Testing By Likelihood Methods

• Let H0 denote a null hypothesis to be tested. Typically, we may represent H0

as a specified family F0 of distributions for the data.

• For any test procedure T , we shall denote by Tnthe version based on a sample of size n.

• The function

β_n(T, F ) = P_F(T_n rejects H₀),

defined for distribution function F , is called the power function of T_n (or of T ).

– For F ∈ F₀, β_n(T, F ) represents the probability of a Type I error.

– The quantity

α_n(T, F ) = sup

F ∈F0

β_n(T, F ) is called the size of the test.

– For F 6∈ F₀, the quantity 1 − β_n(T, F ) represents the probability of a Type II error.

• Usually, attention is confined to consistent tests: for fixed F 6∈ F0, β_n(T, F ) → 0 as n → ∞.

• Also, usually attention is confined to unbiased tests: for F 6∈ F0, β_n(T, F ) ≥ αn(T, F0).

A general way to compare two such test procedures is through their power func- tions. In this regard we shall use the concept of asymptotic relative efficiency (ARE).

• For two test procedures TA and T_B, suppose that a performance criterion is tightened in such a way that the respective sample sizes n_A and n_B for T_A and T_B to perform “equivalently” tend to ∞ but have ratio n_A/n_B tending to some limit. Then the limit represents the ARE of procedure T_B relative to procedure T_A and is denoted by e(T_B, T_A).

• The earliest approach to ARE was introduced by Pitman (1949). In this approach, two tests sequences T = {Tn} and U = {Un} are compared as the Type I and Type II error probabilities tend to positive limits α and 1 − β, respectively.

(2)

• In order that αn → α > 0 and simultaneously 1 − βn → 1 − β > 0, it is necessary to consider β_n(·) evaluated at an alternative F⁽ⁿ⁾ converging at a suitable rate to the null hypothesis F₀.

• In justification of this approach, we might argue that large sample sizes would be relevant in practice only if the alternative of interest were close to the null hypothesis and thus hard to distinguish with only a small sample.

To demonstrate the above point, we consider the following example.

Example 3.11 Let X1, . . . , X_n be iid with X1 ∼ N (µ, 1).

• Test H0 : µ = 0 versus H1 : µ = µ₀ > 0.

• Construct a test with α = 0.05 and β = 0.2005.

• Reject H0 if√

n ¯X_n > 1.645.

• Note that

β = P (√

n ¯Xn ≤ 1.645|µ = µ₀) = Φ(1.645 −√ nµ0).

• If n → ∞ and µ0 is a fixed positive constant, β → 0.

• To ensure β = 0.2005, it requires that 1.645 −√

nµ0 = −0.84 or µ0 = 2.485n^−1/2.

• Do you notice that µ0will change with n which is no longer a fixed alternative?

Test Statistics for A Simple Null Hypothesis

Although the theory of the following three tests are of most value for composite null hypotheses, it is convenient to begin with simple null hypothesis. Consider testing H0 : θ = θ⁰ ∈ R^s versus H1 : θ 6= θ⁰.

Likelihood Ratio Test

• A likelihood ratio statistic,

Λ_n = L(θ⁰; x) supθ^∈ΘL(θ; x) was introduced by Neyman and Pearson (1928).

• Λ_n takes values in the interval [0, 1] and H₀ is to be rejected for sufficiently small values of Λ_n.

(3)

• The rationale behind LR tests is that when H0 is true, Λ_n tends to be close to 1, whereas when H1 is true, Λ_n tends to be close to 0,

• The test may be carried out in terms of the statistic λ_n = −2 log Λ_n.

• For finite n, the null distribution of λn will generally depend on n and on the form of pdf of X.

• LR tests are closely related to MLE’s.

• Denote MLE by ˆθ. For asymptotic analysis, expanding λn at ˆθ in a Taylor series, we get

λ_n = −2







−

n X i=1

log f (X_i, ˆθ) +

n X i=1

log f (X_i, θ⁰)







= 2







1

2(θ⁰ − ˆθ)^T



−

n X i=1

∂²

∂θ_j∂θ_k log f (x; θ)

θ⁼θ^∗



(θ⁰ − ˆθ)







, where ˆθ lies between ˆθ and θ⁰.

• Since θ^∗ is consistent, λ_n = n(ˆθ − θ⁰)^T





−1 n

n X i=1

∂²

∂θ_j∂θ_kL(θ)

θ=θ0





(ˆθ − θ⁰) + o_P(1).

By the asymptotic normality of ˆθ and

−n⁻¹

n X i=1

∂²

∂θ_j∂θ_k L(θ)|

θ⁼θ⁰

→ I(θP ⁰),

λ_n has, under H0, a limiting chi-squared distribution on s degrees of freedom.

Example 3.12 Consider the testing problem H₀ : θ = θ₀ versus H₁ : θ 6= θ₀ based on iid X1, . . . , Xn from the uniform distribution U (0, θ).

• L(θ₀; x) = θ₀⁻ⁿ1_{x_(n)_<θ₀_}

• ˆθ = x_(n) (MLE) and sup_θ∈ΘL(θ; x) = x⁻ⁿ_(n)1_{x_(n)_<θ}

• We have

Λn =







(X_(n)/θ⁰)ⁿ X_(n) ≤ θ₀ 0 X_(n) > θ₀

• Reject H0 if X_(n) > θ₀ or X_(n)/θ₀ < c^1/n.

• What is the asymptotic distribution of λn?

(4)

• What is P (n log(X(n)/θ⁰) ≤ c) where c < 0? It is not a χ² distribution.

(Why???)

Example 3.13 Consider the testing problem H₀ : σ² = σ²₀ versus H₁ : σ² 6= σ²₀ based on iid X1, . . . , Xn from the normal distribution N (µ0, σ²).

• L(θ⁰; x) = (2πσ²₀)^−n/2exp^h−^P_i(x_i− µ₀)²/2σ₀²ⁱ

• ˆσ² = n^{−1 P}i(xi − µ₀)² (MLE) and sup

θ∈Θ

L(θ; x) = (2πˆσ²)^−n/2exp(−n/2).

• We have

Λ_n =





ˆ σ² σ₀²



 n/2

exp





n 2 −

Pi(x_i− µ₀)² 2σ₀²





or under H₀

λ_n = −n







ln





1 n

n X i=1

Z_i²



−



1 −





1 n

n X i=1

Z_i²















, where Z₁, . . . , Z_n are iid N (0, 1).

• Fact: Using CLT, we have

n^{−1 Pn}_i=1Z_i² − 1

q2/n

→ N (0, 1)d

or

n 2





1 n

n X i=1

Z_i² − 1



 2

→ χd ²₁.

• Note that ln u ≈ −(1 − u) − (1 − u)²/2 when u is near 1 and n^{−1 Pn}_i=1Z_i² → 1 in probability by LLN.

• A common question to be asked in Taylor’s series approximation is that how many terms we should consider. In this example, it refers to the use of approximation ln u ≈ −(1 − u) as a contrast to the second order approximation we use. If we do use the first order approximation, we will end up the difficulty of finding limnanbn when limnan = ∞ and limnbn = 0.

• We conclude that λn has a limiting chi-squared distribution with 1 degree of freedom.

The Wald Test

• Let ˆθ_n denote a consistent, asymptotically normal, and asymptotically efficient sequence of solutions of the likelihood equations.

√n(ˆθn − θ) → N (0, I^d ⁻¹(θ)) as n → ∞.

(5)

• Because I(θ) is continuous in θ, we have I( ˆθ_n) → I(θ)^P as n → ∞.

• Replace the matrix

−_n¹ ^Pⁿ_i=1 _∂θ^∂²

j∂θk L(θ)|θ=θ⁰

by I(ˆθ_n) in large sample approximation of λ_n, we get a second statistic,

Wn = n(ˆθn − θ⁰)^TI(ˆθn)(ˆθn − θ⁰), which was introduced by Wald (1943).

• By Slutsky’s theorem, Wn converges in distribution to χ²_s.

• For the construction of confidence region, one generates {θ⁰ : W_n ≤ χ²_s,α} which is an ellipsoid in R^s.

• As a remark, for the construction of confidence region based on λn, one generates {θ⁰ : λ_n ≤ χ²_s,α} which is not necessary an ellipsoid in R^s.

The Rao Score Tests

• Both the Wald and likelihood ratio tests requires evaluation of ˆθ_n. Now we consider a test for which this is not necessary.

• Denote the likelihood score vector

q(x; θ) = (q₁(x; θ), . . . , q_s(x; θ))^T where

q_j(x; θ) = ∂

∂θ_j log f (x; θ).

• Write Q(θ) = ^Pn_i=1q(X_i; θ). By the central limit theorem, n^−1/2Q(θ⁰) → N (0, I(θ^d ⁰)).

• A third statistic,

V_n = [n^−1/2Q(θ⁰)]^TI⁻¹(θ⁰)[n^−1/2Q(θ⁰)] = n⁻¹Q(θ⁰)^TI⁻¹(θ⁰)Q(θ⁰), was introduced by Rao (1948).

Again, it has a limiting χ²_s distribution.

Example 3.14 Consider a sample X1, . . . , Xn from the logistic distribution with density

f_θ(x) = e^x−θ (1 + e^x−θ)².

(6)

• q(x; θ) = −1 + 2e^x−θ/(1 + e^x−θ) and Q(θ⁰) = −n + 2

n X i=1

e^xⁱ^−θ⁰ 1 + e^xⁱ^−θ⁰.

• I(θ) = 1/3 for all θ.

• The Rao scores test therefore rejects H0 with test statistic

v u u t

3 n

n X i=1

e^xⁱ^−θ⁰ − 1 1 + e^xⁱ^−θ⁰.

• In this case, the MLE does not have an explicit expression and therefore the Wald and likelihood ratio tests are less convenient.

The three test statistics we discuss are asymptotically equivalent under H0. How- ever, they do differ in computation and ease of interpretation.

• All three statistics have the same limit chi-squared distribution with degree of freedom s under the null hypothesis. The limiting distribution can be found by the following lemma.

Lemma 1 Under regularity conditions, (i) n^1/2(ˆθ_n− θ⁰) → N (0, I^d ⁻¹(θ⁰));

(ii) n(ˆθn − θ⁰)^TI(θ⁰)(ˆθn− θ⁰) → χ^d ²_s; (iii) n⁻¹Q(θ⁰)^TI⁻¹(θ⁰)Q(θ⁰) → χ^d ²_s;

(iv) λn− n(ˆθn − θ⁰)^TI(θ⁰)(ˆθn − θ⁰) → 0.^P

• Both the likelihood ratio test and the Wald test require calculating an efficient estimator ˆθ_n, while the Rao test does not and is therefore the most convenient from the computational point of view.

• The Wald test, being based on the studentized difference I^1/2(ˆθ_n)[√

n(ˆθ_n − θ)^T]

is more easily interpretable and has the advantage immediately yields confidence regions for θ.

• The Wald test has the drawback, not shared by the other two, that it is only asymptotically but not exactly invariant under reparametrization.

For simplicity, consider s = 1 and η = g(θ). Here we assume that g is differentiable and strictly increasing. The Wald statistic for testing η = η⁰(=

g(θ⁰)) is

[g(ˆθ_n) − g(θ₀)]^qnI(ˆη_n) =

r

nI(ˆθ_n)(ˆθ_n − θ₀)g(ˆθ_n) − g(θ₀)

θˆ_n− θ₀ · 1 g⁽¹⁾(ˆθ_n). The product of the second and third factor tends to 1 as ˆθn → θ₀ but typically will differ from 1 for finite n.

(7)

Example 3.15 Consider a sequence of n independent trials, with s possible out- comes for each trials.

• Let θj denote the probability of occurrence of the jth outcome in any given trial.

• Let Nj denote the number of occurrences of the jth outcome in the series of n trials.

• The MLE of θj’s are Nj/n.

• The three test statistics λn, W_n and V_n for testing H₀ : θ = θ⁰ against H₁ : θ 6= θ⁰ are easily seen to be

λ_n = 2

s X j=1

N_jlog( N_j nθ_j⁰), W_n =

s X j=1

(Nj − nθ_j⁰)² N_j , Vn =

s X j=1

(N_j − nθ_j⁰)² nθ_j⁰ .

• Both Wn and V_n are referred to as chi-squared goodness of fit statistics; the latter often called the Pearson chi-squared distribution. The large sample prop- erties was first derived by Pearson (1900).

Pearson’s chi-square statistic is easily remembered as χ² = sum(Observed − Expected)²

Expected .

• Let us now consider the behavior of λn, W_n and V_n under “local” alternatives, that is, for a sequence {θ_n} of the form

θ_n = θ₀ + n^−1/2∆, where ∆ = (∆1, . . . , ∆_s)^T.

• Suppose that the convergences expressed in the above lemma may be established uniformly in Θ for θ in a neighborhood of θ⁰.

• It then would follow that

n^1/2(ˆθ − θ⁰) = n^1/2(ˆθ − θ_n) + ∆ → N (∆, I^d ⁻¹(θ⁰)), n^−1/2Q(θ⁰) = n^1/2(ˆθ − θ_n)I(θ⁰) + o_P

θn

(1) → N (I(θ^d ⁰)∆, I(θ⁰)), and

λ_n− W_n

Pθn

→ 0,

(8)

• It then follow that the statistics λn, W_n and V_n each converge in distribution to χ²_s(∆^TI(θ⁰)∆).

• Therefore, under appropriate regularity conditions, the statistics λn, W_n and V_nare asymptotically equivalent in distribution, both under the null hypothesis and under local alternatives converging sufficiently fast.

• However, at fixed alternatives these equivalences are not anticipated to hold.

Example 3.16 (Testing a Genetic Theory)

• In experiments on pea breading, Mendel observed the different kinds of seeds obtained by crosses from peas with round yellow seeds and peas with wrinkled green seeds.

• Possible types of progeny were: (1) round yellow; (2) wrinkled yellow; (3) round green; and (4) wrinkled green.

• Assume the seeds are produced independently. We can think of each seed as being the outcome of a multinomial trial with possible outcomes numbered 1, 2, 3, 4 as above and associated probabilities of occurrence θ1, θ₂, θ₃, θ₄.

• Mendel’s theory predicted that θ1 = 9/16, θ2 = θ₃ = 3/16, θ4 = 1/16.

• Data: n = 556, n1 = 315, n2 = 101, n3 = 108, n4 = 32.

• Pearson’s chi-square statistic is (315 − 556 × 9/16)²

312.75 + (3.25)²

104.25 + (3.75)²

104.25 + (2.75)²

34.75 = 0.47, which has a p value of 0.9 when referred to a χ²₃ table.

There is insufficient evidence to reject Mendel’s hypothesis. (Why don’twe state that we accept Mendel’s hypothesis?)

Topic 5: Tests in Nonparametric Models

Sign, permutation, and rank tests

• In a nonparametric problem, a UMP, UMPU, or UMPI test usually does not exist.

• Nonparametric tests are derived using some intuitively appealing ideas. They are commonly referred to as distribution-free tests, since almost No assump- tion is imposed on the population under consideration.

• Sign test:

– Let X1, . . . , Xn be iid random variables from F , u be a fixed constant, and p = F (u).

(9)

– Consider the problem of testing H₀ : p ≤ p₀ versus H₁ : p > p₀, or testing H₀ : p = p₀ versus H₁ : p 6= p₀, where p₀ is a fixed constant in (0, 1).

– Let 4I = 1_X_i_−u≤0, i = 1, . . . , n. Then 41, . . . , 4_n are iid binary random variables with p = P (4i = 1).

– For testing H₀ : p ≤ p₀ versus H₁ : p > p₀, it follows from Neymann- Pearson lemma and monotone likelihood ratio that the test

T^∗(Y ) =











1 Y > m γ Y = m 0 Y < m

is UMP among tests based on 4i’s, where Y = ^Pn_i=14_i. – For testing H₀ : p = p₀ versus H₁ : p 6= p₀, the test

T^∗(Y ) =











1 Y < c₁ or Y > c₂ γ_i Y = c_i, i = 1, 2,

0 c1 < Y < c2

is UMPU among tests based on 4_i’s.

– Since Y is equal to the number of nonnegative signs of (u − Xi)’s, tests based on T^∗ are called sign tests.

– One can easily extend the sign tests to the case where p = P (X₁ ∈ B).

• Let (X1, Y₁), . . . , (X_n, Y_n) (matched pairs) be iid random variables from F . By using 4_i = X_i−Y_i−u, one can obtain sign tests for hypotheses concerning P (X₁ − Y₁ ≤ u).

• Permutation tests:

– Let Xi1, . . . , Xin_i, i = 1, 2, be two independent samples iid from Fi, i = 1, 2, respectively. Here Fi’s are cdf’s on R.

– Think of two-sample problem in parametric setting (normal). Such type of problems arise from the comparison of two treatments.

– Remove the parametric assumption and assume that Fi’s are in t he nonparametric family F containing all continuous cdf’s on R.

– Consider the problem of testing

H0 : F1 = F2 versus H1 : F1 6= F₂.

– Let X = (X_ij, j = 1, . . . , n_i, i = 1, 2), n = n1 + n₂, and α be a given significance level. A test T (X) satisfying

1 n!

X z∈π(x)

T (z) = α

(10)

is called a permutation test, where π(x) is the set of n! points obtained from x ∈ Rⁿ by permuting the components of x.

• For rank tests, we only consider Wilcoxon rank-sum test.

Rank Tests for Comparing Two Treatments

• For comparing a new treatment or procedure with the standard method, N subjects (patients, students, etc.) are divided at random into a group of n who will receive a new treatment and a control group of m who will be treated by the standard method.

• At the termination of the study, the subjects are ranked either directly or ac- cording to some response that measures the success of the treatment such as a test score in an educational or pyschological investigation.

• The hypothesis H0 of no treatment effect is rejected, and the superiority of the new treatment acknowledged, if the ranking the n treated subjects rank sufficiently high. (Here it is assumed that the success of the treatment is indicated by an increased response; if instead the aim is to decrease the response, H0 is rejected when the n treated subjects rank sufficiently low.)

• Let the ranks of the treated subjects be denoted by S1, . . . , S_n, where we shall assume that they are numbered in increasing order. Denote the sum of the treatment ranks W_S = S₁ + · · · + S_n.

• The hypothesis H0 is then rejected and the treatment judged to be effective when WS is sufficiently large, say, when WS ≥ c. Here the constant c is determined by the equation

P_H₀(W_S ≥ c) = α.

• The test defined above is known as the Wilcoxon rank-sum test.

• Let X1, . . . , X_mand Y₁, . . . , Y_nbe independent, the X’s identically distributed with distribution F and the Y ’s identically distributed with distribution G.

Here the Y ’s are responses to a treatment.

• Then H0 : F = G and Ha : Y is stochastically larger than X, i.e., G(t) ≤ F (t) for all t but G 6= F .

• Let the ranks of the X⁰s be denoted by R1, . . . , R_m. If we substitute R⁰s for X’s and S’s for Y ’s in the two-sample t-test statistic, we obtain

nm N

!1/2 1

n Pn

i=1Si − _m¹ ^Pm_j=1Rj

(N − 2)⁻¹^h{^Pⁿ_i=1(S_i − ^{N +1}₂ )² +^P^m_j=1(R_j − ^{N +1}₂ )²ⁱ}^1/2.

(11)

• This statistic is equivalent to the Wilcoxon statistic WS, the sum of the ranks of the treatment group.

– Write WXY as the number of pairs (Xi, Yj) with Xi < Yj. – It can be shown that

W_S − 1

2n(n + 1) = W_XY. – WXY is usually known as the Mann-Whitney statistic.

– Let φ(Xi, Y_j) = 1 if Xi < Y_j, and 0 otherwise. Then W_XY =

m X i=1

n X j=1

φ(X_i, Y_j) (1)

– We shall prove that W_XY is asymptotically normal as m and n tend to infinity.

• The method of proof consists in replacing the variable WXY by a sum of independent random variables, which is asymptotically equivalent to WXY and to which the central limit theorem can then be applied.

• It is natural for this purpose to try a sum of the form S =

m X i=1

a_i(X_i) +

n X j=1

b_j(Y_j) (2)

but how should one choose the functions a_i and b_j?

• The following “projection mathod” introduced in a different context by Hajek (1961), produces the ai and bj most likely to succeed in the sense of minimizing E(WXY − S)².

• This approach is due to Hoeffding (1948), and is applicable to a large class of statistics, the so-called U-statistics.

– Note that

θ(F, G) = ^Z F dG = P (X ≤ Y ).

– An unbiased estimator of θ(F, G) is

U = 1 nm

m X i=1

n X j=1

I(X_i ≤ Y_j), which is the W_XY.

– A statistic can be written in the form is called a U-statistics.

– Note that the popularity of this projection method is due to Hajek (1968), who gives the following result.

(12)

Lemma 2 (Hoeffding) Let Z₁, . . . , Z_n be independent random variables and S = S(Z₁, . . . , Z_n) any statistic satisfying E(S²) < ∞. Then the random variable

S^∗ =

n X i=1

E(S|Z_i) − (n − 1)E(S) satisfies E(S^∗) = E(S) and

E(S − S^∗)² = V ar(S) − V ar(S^∗).

Remarks:

1. The random variables S^∗ is called the projection of S on Z1, . . . , Z_n. 2. Note that it is conveniently a sum of independent and identically dis-

tributed random variables.

3. In cases that E(S − S^∗)² → 0 at a suitable rate as n → ∞, the asymptotic normality of S may be established by applying classical theory to S^∗. Proof of Hoeffding’s Lemma.

• Without loss of generality, we can assume that E(S) = 0.

• Consider the problem of finding the sum T =

n X i=1

k_i(Z_i) (3)

for which E(S − T )² is as small as possible; the minimizing T may be con- sidered the “projection” of S onto the linear space formed by the functions T .

• Let

r_i(z_i) = E(S|Z_i = z_i) (4) be the conditional expectation of S given Zi = zi, and let

S^∗ =

n X i=1

r_i(Z_i). (5)

That S^∗ is the desired minimizing function is an immediate consequence of the following identity, which holds for all statistics T and S with mean zero and satisfying (3) for which the required expectation exist:

E(S − T )² = E(S − S^∗)² + E(S^∗ − T )². (6)

• To prove the above identity, write

E(S − T )² = E[(S − S^∗) + (S^∗ − T )]².

(13)

– Squaring the right-hand side proves (6) if it can be shown that

E[(S − S^∗)(S^∗ − T )] = 0. (7) – Since the left-hand side of (7) is the sum of the expectations of

[r_i(Z_i) − k_i(Z_i)](S − S^∗) (8) it is enough to show that the expectation of (8) given Zi is zero for all i.

– We shall prove this by showing that the conditional expectation of (8) given Zi is zero.

– In the conditional expectation of this product, the first factor can be taken out of the expectation sign since it depends only on Z_i, so that it is finally only necessary to show that the conditional expectation of S − S^∗ given Z_i is zero.

– Now

E[(S − S^∗)|Z_i] = E{S − r_i(Z_i) −^X

j6=i

r_j(Z_j)|Z_i}.

– From the definition of r_i(Z_i), it is seen that the conditional expectation of S − r_i(Z_i) given Zi is zero.

– On the other hand, since Z_i and Z_j are independent, the conditional expectation of r_j(Z_j) given Zi is equal to the unconditional expectation of r_j(Z_j), which by the definition of rj is equal to E(S) and hence equal to zero.

– This completes the proof of (7) and therefore of (6).

• A useful special case of (6) is obtained by putting T = 0, which gives after arrangement

E(S − S^∗)² = E(S²) − E(S^∗2) = V ar(S) − V ar(S^∗). (9)

• Before we apply Hoeffeding lemma to the WXY-statistic (1), we will calculate the expectation and variance of WXY.

• Set θ = (F, G),

E_θ[φ(X, Y )] = P_θ[X < Y ] and we obtain

E_θ(W_XY) = mnp (10)

where p = P_θ[X < Y ].

• Similarly, we have

V ar_θ(W_XY) = nmp(1−p)+nm(n−1)(q₁−p²)+nm(m−1)(q₂−p²) (11) where q1 = P_θ[X₁ < min(Y₁, Y₂)] and q2 = P_θ[Y₁ > max(X₁, X₂)].

(14)

• Note that under H0, if F is continuous, p = 1/2 while q₁ = q₂ = 1/3, since, among three independent identically distributed variables, each one is equally likely to be the minimum or the maximum.

• We then have Eθ(W_XY) = mn/2 and V arθ(W_XY) = mn(N + 1)/12 under H0.

• Put

ψ(x, y) = φ(x, y) − p. (12)

Note that

E[ψ(X_α, Y_β)|X_i = x] =







Eψ(x, Yβ) if α = i

0 if α 6= i

and

E[ψ(X_α, Y_β)|Y_j = y] =







Eψ(X_α, y) if β = j

0 if β 6= j

• Put ψ10(x) = E_Yψ(x, Y ) and ψ01(y) = E_Xψ(X, y).

• The projection of WXY − mnp by Hoeffeding Lemma is n^Pm_i=1ψ10(Xi) + m^Pn_j=1ψ01(Yj). Consider

U = √ m





1 m

m X i=1

ψ₁₀(X_i) + 1 n

n X j=1

ψ₀₁(Y_j)





and S = √

m[(mn)⁻¹WXY − p].

• Note that

V ar(S) → q₁ − p² + m

n(q₂ − p²), V ar(U ) = V ar(ψ₁₀(X)) + m

nV ar(ψ₀₁(Y )), E(S − U )² = V ar(S) − V ar(U ).

Observe that for j 6= k, V ar(ψ₁₀(X)) = q₁ − p² and V ar(ψ₀₁(Y ) = q₂ − p². (i.e. Eψ(x₁, Y_j)ψ(x₁, Y_k) = [ψ₁₀(x₁)]² and

E_X[ψ₁₀(X)]² = Eψ(X, Y_j)ψ(X, Y_k) = Cov(ψ(X, Y_j), ψ(X, Y_k)).

We then conclude that E(S − U )² → 0.

Theorem 1 Suppose that F and G are continuous and that 0 < P_θ[X < Y ] < 1.

Then S − Eθ(S)

qV ar_θ(S)

→ N (0, 1) as min(n, m) → ∞.d

(15)

Remark. Reject H₀ when

W_XY − ¹₂nm

q 1

12nm(N + 1) ≥ z(1 − α).

Pitman efficiency of the Wilcoxon rank-sum test to the two-sample t-test We turn now to the comparison of the performance of the Wilcoxon and two-sample t tests. At first sight it would appear that a good reason for using the Wilcoxon is that it has a guaranteed probability of type I error and a good reason against using the Wilcoxon is its inefficient use of the data.

• We assume that the X’s and Y ’s have the same variance σ² and means µ₁ and µ₂.

• Although the t test does not have a guaranteed probability of type I error, if n and m are moderately large, H0 is true, and F has a finite second moment, then the probability of type I error of the t test is fairly close to that specified by the normal model.

• Recall that the two-sample t statistic is given by T =

snm N

Y − ¯¯ X

s₂ (13)

where

s2 =

Pm

i=1(X_i − ¯X)² + ^Pⁿ_j=1(Y_j − ¯Y )²

N − 2 . (14)

• We start by obtaining an approximation to the critical value and power of the t test. Note that s²₂ → σ^P ² as min(n, m) → ∞. It follows from Slutsky’s theorem and central limit theorem that when µ₁ = µ₂, T converges in law to a N (0, 1) random variable as min(n, m) → ∞.

• Then the t test that rejects H0 when T ≥ t_{N −2}(1 − α) has approximately level α regardless of the shape of F and G and z(1 − α) is an approximate critical value as we claimed above.

• If µ1 6= µ₂, let δ = (µ2 − µ₁)/σ. Then, arguing as above, if ^qnm/N δ stays bounded T − ^qnm/N δ has approximately a N (0, 1) random distribution for all F and G with σ² < ∞. We then can approximate the probability Pθ(T ≥ t_{N −2}(1 − α)) by

β_T = P_θ[T ≥ z(1−α)] = 1−Φ(z(1−α)−^qnm/N δ) = Φ(z(α)+^qnm/N δ).

(16)

• For Wilcoxon test, β_N = P_θ





W_XY ≥ 1

2nm + z(1 − α)

v u u t

1

12nm(N + 1)







= P_θ





W_XY − E_θ(W_XY)

qvar_θ(W_XY) ≥ nm(¹₂ − p) + z(1 − α)^{q 1}₁₂nm(N + 1)

qvar_θ(W_XY)





≈ Φ





nm(¹₂ − p) + z(1 − α)^{q 1}₁₂nm(N + 1)

qvar_θ(W_XY)



.

• Consider the case that X ∼ N (µ1, σ²), Y ∼ N (µ2, σ²), n = m and α = 0.05.

Note that δ = (µ₂ − µ₁)/σ = 0.5.

• Suppose we want to have β = 0.9.

For t-test, solve

−1.645 +

v u u t

N 2

N 0.5 = 1.282 and get N = 16 · (2.927)² ≈ 140.

For Wilcoxon test:

p = P_θ(X < Y ) = Φ µ2 − µ₁

√2 σ

!

, q₁ = P Z₁ < ∆

√2, Z₂ < ∆

√2

!

, q₂ = P Z₁ < ∆

√2, Z₃ < ∆

√2

!

, where Z1 = [X₁ − Y₁− (µ₁− µ₂)]/√

2σ, Z2 = [X₁ − Y₂− (µ₁− µ₂)]/√ 2σ, Z₃ = [X₂ − Y₁ − (µ₁ − µ₂)]/√

2σ.

• Note that (Z1, Z₂) ∼ N (0, 0, 1, 1, 1/2), (Z1, Z₃) ∼ N (0, 0, 1, 1, 1/2). When

∆ = 0.5, p = 0.638, q1 = q₂ = 0.483, we have βW ≈ Φ(−1.729 + 0.355^qN/2) = 0.9. Hence, N ≈ 144.