Financial Time Series I Topic 4: Discrete Data: Contingency Tables Hung Chen Department of Mathematics National Taiwan University 10/30/2002

(1)

Financial Time Series I

Topic 4: Discrete Data:

Contingency Tables

Hung Chen

Department of Mathematics National Taiwan University

10/30/2002

(2)

OUTLINE 1. Probability Model

– Mean and Variance – Limiting Distribution

– Mining of association rules in Basket Analysis

2. Goodness of Fit Test

– Embedding and Nested Models – Test of Independence

3. Logistic Regression for Binary Response – Logit or Probit

– Likelihood Equation – Likelihood Ratio Test

4. Tests for A Simple Null Hypothesis – Likelihood Ratio Test

– Wald Test

– Rao’s Score test

(3)

Contingency Tables

We start with a probability model to describe the data summarized in terms of contingency table.

• Consider a sequence of n independent trials, with k possible outcomes for each trial.

For a 2 × 2 table, k = 4 and n is the total number of observations.

• Let p_j denote the probability of occurrence of the jth outcome in any given trial (^P^k₁ p_j = 1).

• Let n_j denote the number of occurrences of the jth outcome in the series of n trials (^P^k_i n_j = n). (n₁, . . . , n_k) is called the

“cell frequency vector” associated with the n trials.

• The exact distribution of (n₁, . . . , n_k) is the multinomial distribution M N (n, p) where p = (p₁, . . . , p_k).

• E(n_i) = np_i, V ar(n_i) = np_i(1 − p_i) and Cov(n_i, n_j) = −np_ip_j, so that E(n₁, . . . , n_k) = np, Cov((n₁, . . . , n_k)) = n(D_p−p^tp), where D_p = diag(p).

(4)

• Let ˆp = n⁻¹(n₁, . . . , n_k) be the vector of sample proportions, and set U_n = √

n(ˆp − p). Then E(U_n) = 0, Cov(U_n) = D_p − p^tp.

We now use “Cramer-Wold device” to prove asymptotic multivariate normality of cell frequency vectors.

Theorem. The random vector U_n converges in distribution to k-variate normal with mean 0 and covariance D_p − p^tp.

• Compute the characteristic function of E exp(it^Pⁿ_i=1 u_i) where U_n = (u₁, . . . , u_k).

• Observe that E





exp





it ^X^k

j=1λ_ju_j













= E





exp







k

X

j=1itλ_j







n_j

√n − √ np_j



















= exp





−it√

n ^X^k

j=1λ_jp_j





 · E





exp







√it n

Xk

j=1λ_jn_j













= exp





−it√

n ^X^k

j=1λ_jp_j





 ·







k

X

j=1p_j exp







√it nλ_j













n

=







k

X

j=1p_j · exp







√it n





λ_j − ^X^k

i=1λ_ip_i



















n

(5)

=











k

X

j=1p_j





1 + it

√n(λ_j − ^X^k

i=1 λ_ip_i) − t²

2n(λ_j − ^X^k

i=1λ_ip_i)² + o(n⁻¹)

















n

=











1 − t² 2n

k

X

j=1p_j





λ_j − ^X^k

i=1λ_ip_i







2

+ o(n⁻¹)











n

→ exp





−1

2(λ₁, . . . , λ_k)(D_p − p^tp)(λ₁, . . . , λ_k)^t





.

• The limit being the ch.f. of the multivariate normal distribution with mean vector 0 and covariance matrix D_p − p^tp.

Assumptions:

• Every individual in the population under study can be classified as falling into one and only one of k categories, we say that the categories are mutually exclusive and exhaus- tive.

• A randomly selected member of the population will fall into one of the k categories with probability p, where p is the vector of cell probabilities

p = (p₁, p₂, . . . , p_k) and ^P^k_i=1 p_i = 1.

(6)

• Here the cells are strung out into a line for purposes of indexing only; their arrangement and ordering does not reflect anything about the characteristics of individuals falling into a particular cell.

• The p_i reflect the relative frequency of each category in the population.

• Mining of association rules in Basket Anal- ysis:

– A basket bought at the food store con- sists of the following four items: Apples, Bread, Coke, Milk, Tissues.

– Data on all baskets is available (through cash registers)

– Goal: Discover association rules of the form

Bread&Milk =¿ Coke&Tissue

– This analysis is also called linkage analysis or item analysis.

– Properties of association rules:

∗ The support of the rule is the Propor-

tion of baskets with Bread&Milk&Coke&Tissue.

(7)

∗ The confidence of the rule is the

Sup (Bread&Milk&Coke&Tissue)/Sup(Bread&Milk) which is simply the estimated condi-

tional probability in statistical terms.

∗ The lift of the rule is the

Sup (B&M&C&T)/Sup(B&M)Sup(C&T).

How do you connect it with P (A ∩ B)/P (A)P (B)?

– Search for rules with high confidence and support

∗ Will the results be affected by ran- domness?

∗ Add the requirement that the rule is statistically significant in the test against independence (i.e. against lift=1)

∗ The number of such tests to be per- formed in a moderate problem reaches tens of thousands

– You can put all of them in a huge contingency table.

(8)

2 × 2 Tables

• As an example, we might be interested in whether hair color is related to eye color.

Conduct a study by collecting a random sample and get a count of the number of people who fall in this particular cross-classification determined by hair color and eye color.

• When the cells are defined in terms of the categories of two or more variables, a structure relating to the nature of the data is imposed. The natural structure for two variables is often a rectangular array with columns corresponding to the categories of one variable and rows to categories of the second variable; three variables creates layers of two- way tables, and so on.

• The simplest contingency table is based on four cells, and the categories depend on two variables. The four cells are arranged in a 2 × 2 table whose two rows correspond to the categorical variable A and whose two columns correspond to the second categorical variable B. Double subscripts refer to the position of the cells in our arrangement.

(9)

• The first subscript gives the category number of variable A, the second of variable B, and the two-dimensional array is displayed as a grid with two rows and two columns.

• The probability p_ij is the probability of an individual being in category i of variable A and category j of variable B. Usually, we have some theory in mind which can be checked in terms of hypothesis testing such as

H₀ : p = π (π a fixed value).

• Then the problem can phrased as n observations from the k-cell multinomial distribution with cell probabilities p₁, . . . , p_k. Then we encounter the problem of proving asymptotic multivariate normality of cell frequency vectors.

• To test H₀, it can be proceed by the Pearson chi square test, which is to reject H₀ if X² is too large, where

X² = ^X^k

i=1

(n_i − nπ_i)² nπ_i .

(10)

This test statistic was first derived by Pear- son (1900). Then we need to answer two questions. The first one is to determine what kind of the magnitude of X² is the so-called too large. The second one is whether the Pearson chi-square test is a reasonable testing procedure. These questions will be tack- led by deriving the asymptotic distribution of the Pearson chi square statistic under H₀ and a local alternative of H₀.

• Using matrix notation, X² can be written as

X² = U_nD⁻¹π U^tⁿ, where

U_n = √

n(ˆp−π), p = nˆ ⁻¹(n₁, . . . , n_k), and Dπ = diag(π).

• Let g(x) = xD⁻¹π x^t for x = (x₁, . . . , x_k).

Evidently, g is a continuous function of x.

It can be shown that U_n → U, where U has^d the multivariate normal distribution N (0, Dπ−

π^tπ). Then we have

U_nD⁻¹π U^tⁿ → UD^d ⁻¹π U^t.

Thus the asymptotic distribution of X² under H₀, which is the distribution of UD⁻¹π U^t,

(11)

where U has the N (0, Dπ − π^tπ) distribution. This reduces the problem to finding the distribution of a quadratic form of a multivariate normal random vector. The above process is the so-called δ method.

• Now we state without proof the following general result on the distribution of a quadratic form of a multivariate normal random variable. It can be found in Chapter 3b in Rao (1973) and Chapter 3.5 of Serfling (1980).

Theorem If X = (X₁, . . . , X_d) has the multivariate normal distribution N (0, Σ) and Y = XAX^t for some symmetric matrix A, then L[Y ] = L[^P^d_i=1 λ_iZ_i²], where Z₁², . . . , Z_d² are independent chi square variables with one degree of freedom each and λ₁, . . . , λ_d are the eigenvalues of A^1/2Σ(A^1/2)^t.

• Apply the above theorem to the present problem, we see that L[UD⁻¹π U^t] = L[^P^d_i=1 λ_iZ_i²], where λ_i are the eigenvalues of

B = D^−1/2π (Dπ−π^tπ)D^−1/2π = I−

√

π^t√ π, where √

π = (√

π₁, . . . ,√ π_k).

• Now it remains to find the eigenvalues of

(12)

B. Since B² = B and B is symmetric, the eigenvalues of B are all either 1 or 0.

Moreover,

k

X

i=1λ_i = tr(B) = k − 1.

• Therefore, we establish the result that under the simple hypothesis H₀, Pearson’s chi- square statistic X² has an asymptotic chi square distribution with k − 1 degrees of freedom.

Remarks:

• We already examined the limiting distribution of the Pearson chi square statistic under H₀ by employing δ method.

• In essence, the δ method requires two ingre- dients:

first, a random variable (which we denote here by ˆθ_n) whose distribution depends on a real-valued parameter θ in such a way that

L[√

n( ˆθ_n − θ)] → N (0, σ²(θ)); (1) and second, a function f (x) that can be dif- ferentiated at x = θ so that it possesses the

(13)

following expansion about θ:

f (x) = f (θ)+(x−θ)f⁰(θ)+o(|x−θ|) as x → θ.

(2)

• The δ method for finding approximate means and variances (asymptotic mean and asymptotic variance) of a function of a random variable is justified by the following theorem.

Theorem (The one-dimensional δ method.) If ˆθ_n is a real-valued random variable and θ is a real-valued parameter such that (1) holds, and if f is a function satisfying (2), then the asymptotic distribution of f ( ˆθ_n) is given by

L[√

n(f ( ˆθ_n)−f (θ))] → N (0, σ²(θ)[f⁰(θ)]²).

(3) Proof. Set Ω_n = R, Ω = Ω₁ × Ω₂ × · · · × Ω_n× · · · = ×^∞_n=1Ω_n, and P_n to be the probability distribution of ˆθ_n on R. Note that Ω is the set of all sequences {t_n} such that t_n ∈ Ω_n. We define two subsets of Ω:

S = {{t_n} ∈ Ω : t_n − θ = O(n^−1/2)},

T = {{t_n} ∈ Ω : f (t_n) − f (θ) − (t_n − θ)f⁰(θ) = o(n^−1/2)}.

(14)

Since f satisfies (2), then S ⊂ T . By (1), we have

n^1/2( ˆθ_n−θ) = O_P(1) and hence ˆθ_n − θ = O_P(n^−1/2).

(4) Note that S occurs in probability and hence T also occur in probability since S ⊂ T . Finally,

f ( ˆθ_n) − f (θ) − ( ˆθ_n − θ)f⁰(θ) = o_P(n^−1/2) (5) or

√n(f ( ˆθ_n)−f (θ)) = √

n( ˆθ_n−θ)f⁰(θ)+o_P(1).

(6) Now let V_n = √

n(f ( ˆθ_n) − f (θ)), U_n =

√n( ˆθ_n − θ), and g(x) = xf⁰(θ) for all real numbers x. Then (6) may be rewritten as

V_n = g(U_n) + o_P(1).

(15)

Goodness-of-Fit to Composite Multinomial Models

Consider a sample from a population in genetic equilibrium with respect to a single gene with two alleles. If we assume the three different genotypes are identifiable, we are led to suppose that there are three types of individuals whose frequencies are given by the so-called Hardy- Weinberg proportions

p₁ = θ², p₂ = 2θ(1 − θ), p₃ = (1 − θ)², where 0 < θ < 1.

• In the Hardy-Weinberg model, the probability model to describe the data is multinomial with parameter falling in

Θ = {θ : θ_i ≥ 0, 1 ≤ i ≤ 3, ^X³

i=1θ_i = 1}.

• The theory we want to test can be described by a multinomial with parameter falling in Θ₀ = {(η², 2η(1 − η), (1 − η)², 0 ≤ η ≤ 1}, which is a one-dimensional curve in the two- dimensional parameter space Θ.

• To test the adequancy of the Hardy-Weinberg model means testing H₀ : θ ∈ Θ₀ versus H₁ : θ ∈ Θ₁ where Θ₁ = Θ − Θ₀.

(16)

In general, we can describe Θ₀ parametrically as

Θ₀ = {(θ₁(η), . . . , θ_k(η)) : θ(η ∈ Ξ},

where η = (η₁, . . . , η_q)^T, is a subset of q-dimensional space, and the map η → (θ₁(η), . . . , θ_k(η))^T

takes Ξ into Θ₀. To avoid trivialities we assume q < k − 1.

Now we consider the likelihood ratio test for H₀ versus H₁.

• Let p(n₁, . . . , n_k, θ) denote the frequency function.

– Maximizing p(n₁, . . . , n_k, θ) for θ ∈ Θ₀. – Denote the maximizer by ˆη = (ˆη₁, . . . ,ηˆ_q).

– The logarithm of sup θ^∈Θ0

L(θ; x)

is ^P^k_i=1 n_i log θ_i( ˆη) up to a constant.

• The logarithm of sup θ^∈Θ

L(θ; x)

is ^P^k_i=1 n_i log(n_i/n) up to a constant.

(17)

• Suppose that we can define θ_j⁰ = g_j(θ), j = 1, . . . , r, where g_j is chosen so that H₀ becomes equivalent to (θ₁⁰ , . . . , θ_q⁰)^T ranges over an open subset of R^q and θ_j = θ_0j, j = q + 1, . . . , r for specified θ_0j.

For example, to test the Hardy-Weinberg model we set θ₁⁰ = θ₁, θ⁰₂ = θ₂ − 2√

θ₁(1 −

√θ₁) and test H₁ : θ⁰₂ = 0.

• Apply the standard result on likelihood ratio test, under H₀, λ_n approximately has a χ²_r−q distribution for large n.

Example 1 Consider Hardy-Weinberg model.

• ˆη = (2n₁ + n₂)/2n

• Reject H₀ if λ_n ≥ χ²₁(1 − α) with θ(ˆη) =













2n₁ + n₂ 2n







2

, (2n₁ + n₂)(2n₃ + n₂)

2n² ,







2n₃ + n₂ 2n







2^





T

. For the Wald statistic and Rao score statis-

tic, they are approximately χ²_r−q distributed for large n under H₀.

• Wald statistic:

W_n = ^X^k

j=1

[N_j − nθ_j( ˆη)]² nθ_j( ˆη) .

(18)

• Rao score statistic:

S_n = ^X^k

j=1

[N_j − nθ_j( ˆη)]² θ_j( ˆη) .

• They are identical to Pearson’s χ² statistic SU M(Observed − Expected)²

Expected .

Example 2 (The Fisher Linkage Model).

• A self-crossing of maize heterozygous on two characteristic (starchy versus sugary; green base leaf versus white base leaf) leads to four possible offspring types: (1) sugary- white; (2) sugary-green; (3) starchy-white;

(4) starchy-green.

• (N₁, . . . , N₄) has a M N (n, θ₁, . . . , θ₄) distribution.

• Fisher (1958) specifies that θ₁ = 1

4(2 + η), θ₂ = θ₃ = 1

4(1 − η), θ₄ = 1 4η where η is an unknown number between 0 and 1.

(19)

• To test the validity of the linkage model we would take

Θ₀ = {(1

4(2+η), 1

4(1−η), 1

4) : 0 ≤ η ≤ 1}

a “one-dimensional curve” of the three-dimensional parameter space Θ.

• The likelihood equation under H₀ becomes n₁

2 + η − n₂ + n₃

1 − η + n₄

η = 0.

• We obtain critical values from χ²₁ table.

(20)

Testing Independence of Classification in Contingency Tables

• Many important characteristics have only two categories.

• An individual either is or is not inoculated against a disease; is or is not a soker; is male or female; and so on.

• We often want to know whether such characteristics are linked or are independent.

For example, do smoking and lung cancer have any relation to each other?

• Let us call the possible categories or states of the first characteristic A and ¯A and of the second B and ¯B.

– A randomly selected individual from the population can be one of four types AB, A ¯B, ¯AB, ¯A ¯B.

– Denote the probabilities of these types by θ₁₁, θ₁₂, θ₂₁, θ₂₂, respectively.

• Independent classification means that the events (being an A) and (being a B) are independent or in terms of the θ_ij,

θ_ij = (θ_i1 + θ_i2)(θ_1j + θ_2j).

(21)

• The data are assembled in what is called a 2 × 2 contingency table.

• Testing independence can be put as H₀ : θ ∈ Θ₀ versus H₁ : θ 6∈ Θ₀ where Θ₀ is a two-dimensional subset of Θ given by

Θ₀ = {(η₁η₂, η₁(1−η₂), η₂(1−η₁), (1−η₁)(1−η₂)) : 0 ≤ η₁, η₂ ≤ 1}.

The degree of freedom of chi² test is 1.

• For θ ∈ Θ₀, ˆη₁ = (n₁₁ + n₁₂)/n and ˆη₂ = (n₁₁ + n₂₁)/n.

• Pearson’s statistic is n ^X²

i=1 2

X

j=1

[N_ij − (N_i1 + N_i2)(N_1j + N_2j)/n]² (N_i1 + N_i2)(N_1j + N_2j) .

• Pearson’s statistic can be rewritten as Z² where

Z =







N₁₁

N₁₁ + N₂₁ − N₁₂ N₁₂ + N₂₂





 v u u u u u t

(N₁₁ + N₂₁)(N₁₂ + N₂₂)n (N₁₁ + N₁₂)(N₂₁ + N₂₂) . Thus,

Z = √

n[ ˆP (A|B)− ˆP (A| ¯B)]







P (B)ˆ P (A)ˆ

P ( ¯ˆ B) P ( ¯ˆ A)







1/2

where ˆP is the empirical distribution.

(22)

• Z indicates what directions they deviate from independence.

a × b Contingency Tables

• Consider contingency tables for two nonnu- merical characteristics having a and b states, respectively, a, b ≥ 2.

• If we take a sample of size n from a population and classify them according to each characteristic we obtain a vector N_ij, i = 1, . . . , a, j = 1, . . . , b where N_ij is the number of individuals of type i for characteristic 1 and j for characteristic 2.

• {N_ij : 1 ≤ i ≤ a, 1 ≤ j ≤ b} are multi- nomially distributed with {θ_ij : 1 ≤ i ≤ a, 1 ≤ j ≤ b} where

θ_ij = P (A randomly selected individual is of type i for 1 and j for 2).

• The hypothesis that the characteristics are assigned independently becomes H₀ : θ_ij = η_i1η_j2 for 1 ≤ i ≤ a, 1 ≤ j ≤ b where the η_i1, η_j2 are nonnegative and ^P^a_i=1 η_i1 =

Pb

j=1 η_j2 = 1.

• N_ij can be arranged in a a × b contingency table. Write C_j = ^P^a_i=1 N_ij and R_i = ^P^b_j=1N_ij.

(23)

• Pearson’s χ² for the hypothesis of independence is

n ^X^a

i=1 b

X

j=1

(N_ij − R_iC_j/n)² R_iC_j ,

which has approximately a χ²_{(a−1)(b−1)} distribution under H₀.

(24)

Logistic Regression for Binary Response

• Consider Bernoulli responses Y that can only take on the values 0 and 1. Examples are

– medical trials where at the end of the trial the patient has either recovered (Y = 1) or has not recovered (Y = 0),

– election polls where a voter either sup- ports a proposition (Y = 1) or does not (Y = 0),

– market research where a potential cus- tomer either desires a new product (Y = 1) or does not (Y = 0)

– multiple-choice test where an examiner either gets a correct answer ordoes not

• Assume the distribution of Y depends on the known covariate vector z in R^p.

• Assume that the data are grouped or repli- cated so that for each fixed i, we observe the number of successes X_i = ^P^m_j=1ⁱ Y_ij where Y_ij is the response on the jth of the m_i trials in block i, 1 ≤ i ≤ k.

Thus, we observe independent X₁, . . . , X_k with X_i binomial Bin(m_i, π), where π =

(25)

π(z) is the probability of success for a case with covariate vector z_i.

• Consider the logistic transform g(π), usually called the logit, which is

η = g(π) = log[π/(1 − π)].

• We choose the following parametric model for π(z)

logit(π(z)) = z^Tβ.

This model will allow that each component of z takes values on R.

• The above model is called the logistic linear regression model.

In practice, the probit g₁(π) = Φ⁻¹(π) where Φ is the N (0, 1) cdf and the log-log transform g₂(π) = log[− log(1−π)]are also being used.

• The log likelihood `(π(β)) ≡ `_N(β) of β = (β₁, . . . , β_p)^T is, if N = ^P^k_i=1 m_i,

`_N(β) = ^X^p

j=1β_jT_j− ^X^k

i=1m_i log(1+exp(z_iβ))+ ^X^k

i=1log







m_i X_i







where T_j = ^P^k_i=1 z_ijX_i.

(26)

• The likelihood equations are Z^T(X − µ) = 0, where Z = (z_ij)_m×p by observing

µ_i = E(X_i) = m_iπ_i E(T_j) = ^X^k

i=1z_ijµ_i

• The MLE ˆβ of β solves Eβ(T^j) = T_j, j = 1, . . . , p.

• To solve the above nonlinear equations, we use the Newton-Raphson algorithm

• The Fisher information matrix is Z^TWZ where W = diag{m_iπ_i(1 − π_i)}_k×k.

Testing

• Let ω = {η : η_i = z^T_i β, β ∈ R^p}. Consider two different kinds of tests.

– Let Ω = R^k. Test H₀ : η ∈ ω versus H₁ : η ∈ Ω \ ω.

– Let ω₀ be a q-dimensional linear subspace of ω with q < r.. Test H₀ : η ∈ ω₀ versus H₁ : η ∈ ω \ ω₀.

(27)

• For the first set of hypotheses, the MLEs of π_i and µ_i are X_i/m_i and X_i. The log- likelihood ratio test statistics is

2 ^X^k

i=1[X_i log(X_i/µˆ_i) + X_i⁰ log(X_i⁰/µˆ⁰_i)]

where X_i⁰ = m_i − X_i and µ⁰_i = m_i − ˆµ_i. – Note that it just measure the distance

between the fit ˆµ based on the model ω and the data.

– By the multivariate delta method, it has asymptotically a χ²_k−p distribution for η ∈ ω as m_i → ∞, i = 1, . . . , k < ∞.

• For the second set of hypotheses, the log- likelihood ratio test statistics is

2 ^X^k

i=1





X_i log







ˆ µ_i ˆ µ_0i





 + X_i⁰ log







ˆ µ⁰_i ˆ µ⁰_0i













where ˆµ₀ is the MLE of µ under H₀ and µ⁰_0i = m_i − ˆµ_0i.

It has an asymptotical χ²_p−q distribution as m_i → ∞, i = 1, . . . , k < ∞.

(28)

Tests for A Simple Null Hypothesis

• Let X₁, . . . , X_n be iid with X₁ ∼ N (θ, 1).

– Test H₀ : θ = 0 versus H₁ : θ = θ₀ > 0.

– How do we find a good test for the above simple hypothesis?

• Consider testing H₀ : θ = θ⁰ ∈ R^s versus H₁ : θ 6= θ⁰.

• We consider three large sample tests.

Likelihood Ratio Test

• A likelihood ratio statistic, Λ_n = L(θ⁰; x)

supθ∈Θ L(θ; x)

was introduced by Neyman and Pearson (1928).

• Λ_n takes values in the interval [0, 1] and H₀ is to be rejected for sufficiently small values of Λ_n.

• The rationale behind LR tests is that when H₀ is true, Λ_n tends to be close to 1, whereas when H₁ is true, Λ_n tends to be close to 0,

(29)

• The test may be carried out in terms of the statistic

λ_n = −2 log Λ_n.

• For finite n, the null distribution of λ_n will generally depend on n and on the form of pdf of X.

• LR tests are closely related to MLE’s.

• Denote MLE by ˆθ. For asymptotic analysis, expanding λ_n at ˆθ in a Taylor series, we get λ_n = −2







− ^Xⁿ

i=1 log f (X_i, ˆθ) + ^Xⁿ

i=1log f (X_i, θ⁰)







= 2











1

2(θ⁰ − ˆθ)^T





− ^Xⁿ

i=1

∂²

∂θ_j∂θ_k log f (x; θ)

θ⁼θ^∗





(θ⁰ − ˆθ)











, where ˆθ lies between ˆθ and θ⁰.

• Since θ^∗ is consistent, λ_n = n(ˆθ−θ⁰)^T





−1 n

n

X

i=1

∂²

∂θ_j∂θ_kL(θ)

θ⁼θ0





(ˆθ−θ⁰)+o_P(1).

By the asymptotic normality of ˆθ and

−n^{−1 n}^X

i=1

∂²

∂θ_j∂θ_k L(θ)|

θ⁼θ⁰

→ I(θP ⁰), λ_n has, under H₀, a limiting chi-squared distribution on s degrees of freedom.

(30)

Example 3 Consider the testing problem H₀ : σ² = σ₀² versus H₁ : σ² 6= σ₀² based on iid X₁, . . . , X_n from the normal distribution N (µ₀, σ²).

• L(θ⁰; x) = (2πσ₀²)^−n/2 exp −^P_i(x_i − µ₀)²/2σ₀²

• ˆσ² = n⁻¹ ^P_i(x_i − µ₀)² (MLE) and sup

θ∈Θ

L(θ; x) = (2π ˆσ²)^−n/2 exp(−n/2).

• We have Λ_n =







ˆ σ² σ₀²







n/2

exp







n 2 −

Pi(x_i − µ₀)² 2σ₀²







or under H₀ λ_n = −n







ln







1 n

n

X

i=1Z_i²





 −





1 −







1 n

n

X

i=1Z_i²



















, where Z₁, . . . , Z_n are iid N (0, 1).

• Fact: Using CLT, we have n⁻¹ ^Pⁿ_i=1 Z_i² − 1

r2/n

→ N (0, 1)d

or n

2







1 n

n

X

i=1Z_i² − 1







2 d

→ χ²₁.

(31)

• Note that ln u ≈ −(1 − u) − (1 − u)²/2 when u is near 1 and n⁻¹ ^Pⁿ_i=1 Z_i² → 1 in probability by LLN.

• A common question to be asked in Tay- lor’s series approximation is that how many terms we should consider. In this example, it refers to the use of approximation ln u ≈ −(1 − u) as a contrast to the second order approximation we use. If we do use the first order approximation, we will end up the difficulty of finding lim_na_nb_n when lim_na_n = ∞ and lim_nb_n = 0.

• We conclude that λ_n has a limiting chi-squared distribution with 1 degree of freedom.

The Wald Test

• Let ˆθ_n denote a consistent, asymptotically normal, and asymptotically efficient sequence of solutions of the likelihood equations.

√n(ˆθ_n − θ) → N (0, I^d ⁻¹(θ)) as n → ∞.

• Because I(θ) is continuous in θ, we have I( ˆθ_n) → I(θ)^P

(32)

as n → ∞.

• Replace the matrix



−_n¹ ^Pⁿ_i=1 _∂θ^∂²

j∂θ_k L(θ)|

θ⁼θ⁰





by I(ˆθ_n) in large sample approximation of λ_n, we get a second statistic,

W_n = n(ˆθ_n − θ⁰)^TI(ˆθ_n)(ˆθ_n − θ⁰), which was introduced by Wald (1943).

• By Slutsky’s theorem, W_n converges in distribution to χ²_s.

• For the construction of confidence region, one generates {θ⁰ : W_n ≤ χ²_s,α} which is an ellipsoid in R^s.

• As a remark, for the construction of confidence region based on λ_n, one generates {θ⁰ : λ_n ≤ χ²_s,α} which is not necessary an ellipsoid in R^s.

The Rao Score Tests

• Both the Wald and likelihood ratio tests requires evaluation of ˆθ_n. Now we consider a test for which this is not necessary.

• Denote the likelihood score vector

q(x; θ) = (q₁(x; θ), . . . , q_s(x; θ))^T

(33)

where

q_j(x; θ) = ∂

∂θ_j log f (x; θ).

• Write Q(θ) = ^Pⁿ_i=1 q(X_i; θ). By the central limit theorem,

n^−1/2Q(θ⁰) → N (0, I(θ^d ⁰)).

• A third statistic,

V_n = [n^−1/2Q(θ⁰)]^TI⁻¹(θ⁰)[n^−1/2Q(θ⁰)]

= n⁻¹Q(θ⁰)^TI⁻¹(θ⁰)Q(θ⁰), was introduced by Rao (1948).

Again, it has a limiting χ²_s distribution.

Example 4. Consider a sample X₁, . . . , X_n from the logistic distribution with density

f_θ(x) = e^x−θ

(1 + e^x−θ)².

• q(x; θ) = −1 + 2e^x−θ/(1 + e^x−θ) and Q(θ⁰) = −n + 2 ^Xⁿ

i=1

e^xⁱ^−θ⁰ 1 + e^xⁱ^−θ⁰.

• I(θ) = 1/3 for all θ.

(34)

• The Rao scores test therefore rejects H₀ with test statistic

v u u u u t

3 n

n

X

i=1

e^xⁱ^−θ⁰ − 1 1 + e^xⁱ^−θ⁰.

• In this case, the MLE does not have an ex- plicit expression and therefore the Wald and likelihood ratio tests are less convenient.

Example 5. Consider a sequence of n independent trials, with s possible outcomes for each trials.

• Let θ_j denote the probability of occurrence of the jth outcome in any given trial.

• Let N_j denote the number of occurrences of the jth outcome in the series of n trials.

• The MLE of θ_j’s are N_j/n.

• The three test statistics λ_n, W_n and V_n for testing H₀ : θ = θ⁰ against H₁ : θ 6= θ⁰ are easily seen to be

λ_n = 2 ^X^s

j=1N_j log( N_j nθ_j⁰), W_n = ^X^s

j=1

(N_j − nθ_j⁰)²

N_j ,

(35)

V_n = ^X^s

j=1

(N_j − nθ_j⁰)² nθ_j⁰ .

• Both W_n and V_n are referred to as chi-squared goodness of fit statistics; the latter often called the Pearson chi-squared distribution.

The large sample properties was first derived by Pearson (1900).

Pearson’s chi-square statistic is easily remem- bered as

χ² = sum(Observed − Expected)²

Expected .

Example 6. (Testing a Genetic Theory)

• In experiments on pea breading, Mendel observed the different kinds of seeds obtained by crosses from peas with round yellow seeds and peas with wrinkled green seeds.

• Possible types of progeny were: (1) round yellow; (2) wrinkled yellow; (3) round green;

and (4) wrinkled green.

• Assume the seeds are produced independently.

We can think of each seed as being the outcome of a multinomial trial with possible

(36)

outcomes numbered 1, 2, 3, 4 as above and associated probabilities of occurrence θ₁, θ₂, θ₃, θ₄.

• Mendel’s theory predicted that θ₁ = 9/16, θ₂ = θ₃ = 3/16, θ₄ = 1/16.

• Data: n = 556, n₁ = 315, n₂ = 101, n₃ = 108, n₄ = 32.

• Pearson’s chi-square statistic is (315 − 556 × 9/16)²

312.75 +(3.25)²

104.25+(3.75)²

104.25+(2.75)²

34.75 = 0.47, which has a p-