### Financial Time Series I

### Topic 4: Discrete Data:

### Contingency Tables

### Hung Chen

### Department of Mathematics National Taiwan University

### 10/30/2002

OUTLINE 1. Probability Model

– Mean and Variance – Limiting Distribution

– Mining of association rules in Basket Analysis

2. Goodness of Fit Test

– Embedding and Nested Models – Test of Independence

3. Logistic Regression for Binary Response – Logit or Probit

– Likelihood Equation – Likelihood Ratio Test

4. Tests for A Simple Null Hypothesis – Likelihood Ratio Test

– Wald Test

– Rao’s Score test

Contingency Tables

We start with a probability model to describe the data summarized in terms of contingency table.

• Consider a sequence of n independent trials, with k possible outcomes for each trial.

For a 2 × 2 table, k = 4 and n is the total number of observations.

• Let p_{j} denote the probability of occurrence
of the jth outcome in any given trial (^{P}^{k}_{1} p_{j} =
1).

• Let n_{j} denote the number of occurrences
of the jth outcome in the series of n tri-
als (^{P}^{k}_{i} n_{j} = n). (n_{1}, . . . , n_{k}) is called the

“cell frequency vector” associated with the n trials.

• The exact distribution of (n_{1}, . . . , n_{k}) is the
multinomial distribution M N (n, p) where
p = (p_{1}, . . . , p_{k}).

• E(n_{i}) = np_{i}, V ar(n_{i}) = np_{i}(1 − p_{i}) and
Cov(n_{i}, n_{j}) = −np_{i}p_{j}, so that E(n_{1}, . . . , n_{k}) =
np, Cov((n_{1}, . . . , n_{k})) = n(D_{p}−p^{t}p), where
D_{p} = diag(p).

• Let ˆp = n^{−1}(n_{1}, . . . , n_{k}) be the vector of
sample proportions, and set U_{n} = √

n(ˆp −
p). Then E(U_{n}) = 0, Cov(U_{n}) = D_{p} −
p^{t}p.

We now use “Cramer-Wold device” to prove asymptotic multivariate normality of cell fre- quency vectors.

Theorem. The random vector U_{n} converges
in distribution to k-variate normal with mean
0 and covariance D_{p} − p^{t}p.

• Compute the characteristic function of E exp(it^{P}^{n}_{i=1} u_{i})
where U_{n} = (u_{1}, . . . , u_{k}).

• Observe that E

exp

it ^{X}^{k}

j=1λ_{j}u_{j}

= E

exp

k

X

j=1itλ_{j}

n_{j}

√n − √
np_{j}

= exp

−it√

n ^{X}^{k}

j=1λ_{j}p_{j}

· E

exp

√it n

Xk

j=1λ_{j}n_{j}

= exp

−it√

n ^{X}^{k}

j=1λ_{j}p_{j}

·

k

X

j=1p_{j} exp

√it
nλ_{j}

n

=

k

X

j=1p_{j} · exp

√it n

λ_{j} − ^{X}^{k}

i=1λ_{i}p_{i}

n

=

k

X

j=1p_{j}

1 + it

√n(λ_{j} − ^{X}^{k}

i=1 λ_{i}p_{i}) − t^{2}

2n(λ_{j} − ^{X}^{k}

i=1λ_{i}p_{i})^{2} + o(n^{−1})

n

=

1 − t^{2}
2n

k

X

j=1p_{j}

λ_{j} − ^{X}^{k}

i=1λ_{i}p_{i}

2

+ o(n^{−1})

n

→ exp

−1

2(λ_{1}, . . . , λ_{k})(D_{p} − p^{t}p)(λ_{1}, . . . , λ_{k})^{t}

.

• The limit being the ch.f. of the multivariate
normal distribution with mean vector 0 and
covariance matrix D_{p} − p^{t}p.

Assumptions:

• Every individual in the population under study can be classified as falling into one and only one of k categories, we say that the cat- egories are mutually exclusive and exhaus- tive.

• A randomly selected member of the popu- lation will fall into one of the k categories with probability p, where p is the vector of cell probabilities

p = (p_{1}, p_{2}, . . . , p_{k})
and ^{P}^{k}_{i=1} p_{i} = 1.

• Here the cells are strung out into a line for purposes of indexing only; their arrangement and ordering does not reflect anything about the characteristics of individuals falling into a particular cell.

• The p_{i} reflect the relative frequency of each
category in the population.

• Mining of association rules in Basket Anal- ysis:

– A basket bought at the food store con- sists of the following four items: Apples, Bread, Coke, Milk, Tissues.

– Data on all baskets is available (through cash registers)

– Goal: Discover association rules of the form

Bread&Milk =¿ Coke&Tissue

– This analysis is also called linkage anal- ysis or item analysis.

– Properties of association rules:

∗ The support of the rule is the Propor-

tion of baskets with Bread&Milk&Coke&Tissue.

∗ The confidence of the rule is the

Sup (Bread&Milk&Coke&Tissue)/Sup(Bread&Milk) which is simply the estimated condi-

tional probability in statistical terms.

∗ The lift of the rule is the

Sup (B&M&C&T)/Sup(B&M)Sup(C&T).

How do you connect it with P (A ∩ B)/P (A)P (B)?

– Search for rules with high confidence and support

∗ Will the results be affected by ran- domness?

∗ Add the requirement that the rule is statistically significant in the test against independence (i.e. against lift=1)

∗ The number of such tests to be per- formed in a moderate problem reaches tens of thousands

– You can put all of them in a huge con- tingency table.

2 × 2 Tables

• As an example, we might be interested in whether hair color is related to eye color.

Conduct a study by collecting a random sam- ple and get a count of the number of people who fall in this particular cross-classification determined by hair color and eye color.

• When the cells are defined in terms of the categories of two or more variables, a struc- ture relating to the nature of the data is imposed. The natural structure for two vari- ables is often a rectangular array with columns corresponding to the categories of one vari- able and rows to categories of the second variable; three variables creates layers of two- way tables, and so on.

• The simplest contingency table is based on four cells, and the categories depend on two variables. The four cells are arranged in a 2 × 2 table whose two rows correspond to the categorical variable A and whose two columns correspond to the second categor- ical variable B. Double subscripts refer to the position of the cells in our arrangement.

• The first subscript gives the category num- ber of variable A, the second of variable B, and the two-dimensional array is displayed as a grid with two rows and two columns.

• The probability p_{ij} is the probability of an
individual being in category i of variable
A and category j of variable B. Usually,
we have some theory in mind which can be
checked in terms of hypothesis testing such
as

H_{0} : p = π (π a fixed value).

• Then the problem can phrased as n obser-
vations from the k-cell multinomial distribu-
tion with cell probabilities p_{1}, . . . , p_{k}. Then
we encounter the problem of proving asymp-
totic multivariate normality of cell fre-
quency vectors.

• To test H_{0}, it can be proceed by the Pearson
chi square test, which is to reject H_{0} if X^{2}
is too large, where

X^{2} = ^{X}^{k}

i=1

(n_{i} − nπ_{i})^{2}
nπ_{i} .

This test statistic was first derived by Pear-
son (1900). Then we need to answer two
questions. The first one is to determine what
kind of the magnitude of X^{2} is the so-called
too large. The second one is whether the
Pearson chi-square test is a reasonable test-
ing procedure. These questions will be tack-
led by deriving the asymptotic distribution
of the Pearson chi square statistic under H_{0}
and a local alternative of H_{0}.

• Using matrix notation, X^{2} can be written
as

X^{2} = U_{n}D^{−1}π U^{t}^{n},
where

U_{n} = √

n(ˆp−π), p = nˆ ^{−1}(n_{1}, . . . , n_{k}), and Dπ = diag(π).

• Let g(x) = xD^{−1}π x^{t} for x = (x_{1}, . . . , x_{k}).

Evidently, g is a continuous function of x.

It can be shown that U_{n} → U, where U has^{d}
the multivariate normal distribution N (0, Dπ−

π^{t}π). Then we have

U_{n}D^{−1}π U^{t}^{n} → UD^{d} ^{−1}π U^{t}.

Thus the asymptotic distribution of X^{2} un-
der H_{0}, which is the distribution of UD^{−1}π U^{t},

where U has the N (0, Dπ − π^{t}π) distri-
bution. This reduces the problem to finding
the distribution of a quadratic form of a mul-
tivariate normal random vector. The above
process is the so-called δ method.

• Now we state without proof the following general result on the distribution of a quadratic form of a multivariate normal random vari- able. It can be found in Chapter 3b in Rao (1973) and Chapter 3.5 of Serfling (1980).

Theorem If X = (X_{1}, . . . , X_{d}) has the
multivariate normal distribution N (0, Σ) and
Y = XAX^{t} for some symmetric matrix A,
then L[Y ] = L[^{P}^{d}_{i=1} λ_{i}Z_{i}^{2}], where Z_{1}^{2}, . . . , Z_{d}^{2}
are independent chi square variables with
one degree of freedom each and λ_{1}, . . . , λ_{d}
are the eigenvalues of A^{1/2}Σ(A^{1/2})^{t}.

• Apply the above theorem to the present prob-
lem, we see that L[UD^{−1}π U^{t}] = L[^{P}^{d}_{i=1} λ_{i}Z_{i}^{2}],
where λ_{i} are the eigenvalues of

B = D^{−1/2}π (Dπ−π^{t}π)D^{−1/2}π = I−

√

π^{t}√
π,
where √

π = (√

π_{1}, . . . ,√
π_{k}).

• Now it remains to find the eigenvalues of

B. Since B^{2} = B and B is symmetric,
the eigenvalues of B are all either 1 or 0.

Moreover,

k

X

i=1λ_{i} = tr(B) = k − 1.

• Therefore, we establish the result that un-
der the simple hypothesis H_{0}, Pearson’s chi-
square statistic X^{2} has an asymptotic chi
square distribution with k − 1 degrees of
freedom.

Remarks:

• We already examined the limiting distribu-
tion of the Pearson chi square statistic under
H_{0} by employing δ method.

• In essence, the δ method requires two ingre- dients:

first, a random variable (which we denote
here by ˆθ_{n}) whose distribution depends on a
real-valued parameter θ in such a way that

L[√

n( ˆθ_{n} − θ)] → N (0, σ^{2}(θ)); (1)
and second, a function f (x) that can be dif-
ferentiated at x = θ so that it possesses the

following expansion about θ:

f (x) = f (θ)+(x−θ)f^{0}(θ)+o(|x−θ|) as x → θ.

(2)

• The δ method for finding approximate means and variances (asymptotic mean and asymp- totic variance) of a function of a random variable is justified by the following theo- rem.

Theorem (The one-dimensional δ method.)
If ˆθ_{n} is a real-valued random variable and
θ is a real-valued parameter such that (1)
holds, and if f is a function satisfying (2),
then the asymptotic distribution of f ( ˆθ_{n}) is
given by

L[√

n(f ( ˆθ_{n})−f (θ))] → N (0, σ^{2}(θ)[f^{0}(θ)]^{2}).

(3)
Proof. Set Ω_{n} = R, Ω = Ω_{1} × Ω_{2} × · · · ×
Ω_{n}× · · · = ×^{∞}_{n=1}Ω_{n}, and P_{n} to be the prob-
ability distribution of ˆθ_{n} on R. Note that
Ω is the set of all sequences {t_{n}} such that
t_{n} ∈ Ω_{n}. We define two subsets of Ω:

S = {{t_{n}} ∈ Ω : t_{n} − θ = O(n^{−1/2})},

T = {{t_{n}} ∈ Ω : f (t_{n}) − f (θ) − (t_{n} − θ)f^{0}(θ) = o(n^{−1/2})}.

Since f satisfies (2), then S ⊂ T . By (1), we have

n^{1/2}( ˆθ_{n}−θ) = O_{P}(1) and hence ˆθ_{n} − θ = O_{P}(n^{−1/2}).

(4) Note that S occurs in probability and hence T also occur in probability since S ⊂ T . Finally,

f ( ˆθ_{n}) − f (θ) − ( ˆθ_{n} − θ)f^{0}(θ) = o_{P}(n^{−1/2})
(5)
or

√n(f ( ˆθ_{n})−f (θ)) = √

n( ˆθ_{n}−θ)f^{0}(θ)+o_{P}(1).

(6)
Now let V_{n} = √

n(f ( ˆθ_{n}) − f (θ)), U_{n} =

√n( ˆθ_{n} − θ), and g(x) = xf^{0}(θ) for all real
numbers x. Then (6) may be rewritten as

V_{n} = g(U_{n}) + o_{P}(1).

Goodness-of-Fit to Composite Multinomial Models

Consider a sample from a population in genetic equilibrium with respect to a single gene with two alleles. If we assume the three different genotypes are identifiable, we are led to suppose that there are three types of individuals whose frequencies are given by the so-called Hardy- Weinberg proportions

p_{1} = θ^{2}, p_{2} = 2θ(1 − θ), p_{3} = (1 − θ)^{2},
where 0 < θ < 1.

• In the Hardy-Weinberg model, the probabil- ity model to describe the data is multinomial with parameter falling in

Θ = {θ : θ_{i} ≥ 0, 1 ≤ i ≤ 3, ^{X}^{3}

i=1θ_{i} = 1}.

• The theory we want to test can be described
by a multinomial with parameter falling in
Θ_{0} = {(η^{2}, 2η(1 − η), (1 − η)^{2}, 0 ≤ η ≤ 1},
which is a one-dimensional curve in the two-
dimensional parameter space Θ.

• To test the adequancy of the Hardy-Weinberg
model means testing H_{0} : θ ∈ Θ_{0} versus
H_{1} : θ ∈ Θ_{1} where Θ_{1} = Θ − Θ_{0}.

In general, we can describe Θ_{0} parametrically
as

Θ_{0} = {(θ_{1}(η), . . . , θ_{k}(η)) : θ(η ∈ Ξ},

where η = (η_{1}, . . . , η_{q})^{T}, is a subset of q-dimensional
space, and the map η → (θ_{1}(η), . . . , θ_{k}(η))^{T}

takes Ξ into Θ_{0}. To avoid trivialities we assume
q < k − 1.

Now we consider the likelihood ratio test for
H_{0} versus H_{1}.

• Let p(n_{1}, . . . , n_{k}, θ) denote the frequency
function.

– Maximizing p(n_{1}, . . . , n_{k}, θ) for θ ∈ Θ_{0}.
– Denote the maximizer by ˆη = (ˆη_{1}, . . . ,ηˆ_{q}).

– The logarithm of
sup
θ^{∈Θ}0

L(θ; x)

is ^{P}^{k}_{i=1} n_{i} log θ_{i}( ˆη) up to a constant.

• The logarithm of
sup
θ^{∈Θ}

L(θ; x)

is ^{P}^{k}_{i=1} n_{i} log(n_{i}/n) up to a constant.

• Suppose that we can define θ_{j}^{0} = g_{j}(θ), j =
1, . . . , r, where g_{j} is chosen so that H_{0} be-
comes equivalent to (θ_{1}^{0} , . . . , θ_{q}^{0})^{T} ranges over
an open subset of R^{q} and θ_{j} = θ_{0j}, j =
q + 1, . . . , r for specified θ_{0j}.

For example, to test the Hardy-Weinberg
model we set θ_{1}^{0} = θ_{1}, θ^{0}_{2} = θ_{2} − 2√

θ_{1}(1 −

√θ_{1}) and test H_{1} : θ^{0}_{2} = 0.

• Apply the standard result on likelihood ratio
test, under H_{0}, λ_{n} approximately has a χ^{2}_{r−q}
distribution for large n.

Example 1 Consider Hardy-Weinberg model.

• ˆη = (2n_{1} + n_{2})/2n

• Reject H_{0} if λ_{n} ≥ χ^{2}_{1}(1 − α) with
θ(ˆη) =

2n_{1} + n_{2}
2n

2

, (2n_{1} + n_{2})(2n_{3} + n_{2})

2n^{2} ,

2n_{3} + n_{2}
2n

2^{}

T

. For the Wald statistic and Rao score statis-

tic, they are approximately χ^{2}_{r−q} distributed for
large n under H_{0}.

• Wald statistic:

W_{n} = ^{X}^{k}

j=1

[N_{j} − nθ_{j}( ˆη)]^{2}
nθ_{j}( ˆη) .

• Rao score statistic:

S_{n} = ^{X}^{k}

j=1

[N_{j} − nθ_{j}( ˆη)]^{2}
θ_{j}( ˆη) .

• They are identical to Pearson’s χ^{2} statistic
SU M(Observed − Expected)^{2}

Expected .

Example 2 (The Fisher Linkage Model).

• A self-crossing of maize heterozygous on two characteristic (starchy versus sugary; green base leaf versus white base leaf) leads to four possible offspring types: (1) sugary- white; (2) sugary-green; (3) starchy-white;

(4) starchy-green.

• (N_{1}, . . . , N_{4}) has a M N (n, θ_{1}, . . . , θ_{4}) dis-
tribution.

• Fisher (1958) specifies that
θ_{1} = 1

4(2 + η), θ_{2} = θ_{3} = 1

4(1 − η), θ_{4} = 1
4η
where η is an unknown number between 0
and 1.

• To test the validity of the linkage model we would take

Θ_{0} = {(1

4(2+η), 1

4(1−η), 1

4(1−η), 1

4) : 0 ≤ η ≤ 1}

a “one-dimensional curve” of the three-dimensional parameter space Θ.

• The likelihood equation under H_{0} becomes
n_{1}

2 + η − n_{2} + n_{3}

1 − η + n_{4}

η = 0.

• We obtain critical values from χ^{2}_{1} table.

Testing Independence of Classification in Contingency Tables

• Many important characteristics have only two categories.

• An individual either is or is not inoculated against a disease; is or is not a soker; is male or female; and so on.

• We often want to know whether such char- acteristics are linked or are independent.

For example, do smoking and lung cancer have any relation to each other?

• Let us call the possible categories or states of the first characteristic A and ¯A and of the second B and ¯B.

– A randomly selected individual from the population can be one of four types AB, A ¯B, ¯AB, ¯A ¯B.

– Denote the probabilities of these types by
θ_{11}, θ_{12}, θ_{21}, θ_{22}, respectively.

• Independent classification means that the
events (being an A) and (being a B) are
independent or in terms of the θ_{ij},

θ_{ij} = (θ_{i1} + θ_{i2})(θ_{1j} + θ_{2j}).

• The data are assembled in what is called a 2 × 2 contingency table.

• Testing independence can be put as H_{0} :
θ ∈ Θ_{0} versus H_{1} : θ 6∈ Θ_{0} where Θ_{0} is a
two-dimensional subset of Θ given by

Θ_{0} = {(η_{1}η_{2}, η_{1}(1−η_{2}), η_{2}(1−η_{1}), (1−η_{1})(1−η_{2})) : 0 ≤ η_{1}, η_{2} ≤ 1}.

The degree of freedom of chi^{2} test is 1.

• For θ ∈ Θ_{0}, ˆη_{1} = (n_{11} + n_{12})/n and ˆη_{2} =
(n_{11} + n_{21})/n.

• Pearson’s statistic is
n ^{X}^{2}

i=1 2

X

j=1

[N_{ij} − (N_{i1} + N_{i2})(N_{1j} + N_{2j})/n]^{2}
(N_{i1} + N_{i2})(N_{1j} + N_{2j}) .

• Pearson’s statistic can be rewritten as Z^{2}
where

Z =

N_{11}

N_{11} + N_{21} − N_{12}
N_{12} + N_{22}

v u u u u u t

(N_{11} + N_{21})(N_{12} + N_{22})n
(N_{11} + N_{12})(N_{21} + N_{22}) .
Thus,

Z = √

n[ ˆP (A|B)− ˆP (A| ¯B)]

P (B)ˆ P (A)ˆ

P ( ¯ˆ B) P ( ¯ˆ A)

1/2

where ˆP is the empirical distribution.

• Z indicates what directions they deviate from independence.

a × b Contingency Tables

• Consider contingency tables for two nonnu- merical characteristics having a and b states, respectively, a, b ≥ 2.

• If we take a sample of size n from a popu-
lation and classify them according to each
characteristic we obtain a vector N_{ij}, i =
1, . . . , a, j = 1, . . . , b where N_{ij} is the num-
ber of individuals of type i for characteristic
1 and j for characteristic 2.

• {N_{ij} : 1 ≤ i ≤ a, 1 ≤ j ≤ b} are multi-
nomially distributed with {θ_{ij} : 1 ≤ i ≤
a, 1 ≤ j ≤ b} where

θ_{ij} = P (A randomly selected individual is of type i for 1 and j for 2).

• The hypothesis that the characteristics are
assigned independently becomes H_{0} : θ_{ij} =
η_{i1}η_{j2} for 1 ≤ i ≤ a, 1 ≤ j ≤ b where
the η_{i1}, η_{j2} are nonnegative and ^{P}^{a}_{i=1} η_{i1} =

Pb

j=1 η_{j2} = 1.

• N_{ij} can be arranged in a a × b contingency
table. Write C_{j} = ^{P}^{a}_{i=1} N_{ij} and R_{i} = ^{P}^{b}_{j=1}N_{ij}.

• Pearson’s χ^{2} for the hypothesis of indepen-
dence is

n ^{X}^{a}

i=1 b

X

j=1

(N_{ij} − R_{i}C_{j}/n)^{2}
R_{i}C_{j} ,

which has approximately a χ^{2}_{(a−1)(b−1)} dis-
tribution under H_{0}.

Logistic Regression for Binary Response

• Consider Bernoulli responses Y that can only take on the values 0 and 1. Examples are

– medical trials where at the end of the trial the patient has either recovered (Y = 1) or has not recovered (Y = 0),

– election polls where a voter either sup- ports a proposition (Y = 1) or does not (Y = 0),

– market research where a potential cus- tomer either desires a new product (Y = 1) or does not (Y = 0)

– multiple-choice test where an examiner either gets a correct answer ordoes not

• Assume the distribution of Y depends on
the known covariate vector z in R^{p}.

• Assume that the data are grouped or repli-
cated so that for each fixed i, we observe the
number of successes X_{i} = ^{P}^{m}_{j=1}^{i} Y_{ij} where Y_{ij}
is the response on the jth of the m_{i} trials in
block i, 1 ≤ i ≤ k.

Thus, we observe independent X_{1}, . . . , X_{k}
with X_{i} binomial Bin(m_{i}, π), where π =

π(z) is the probability of success for a case
with covariate vector z_{i}.

• Consider the logistic transform g(π), usu- ally called the logit, which is

η = g(π) = log[π/(1 − π)].

• We choose the following parametric model for π(z)

logit(π(z)) = z^{T}β.

This model will allow that each component of z takes values on R.

• The above model is called the logistic lin- ear regression model.

In practice, the probit g_{1}(π) = Φ^{−1}(π) where
Φ is the N (0, 1) cdf and the log-log trans-
form g_{2}(π) = log[− log(1−π)]are also being
used.

• The log likelihood `(π(β)) ≡ `_{N}(β) of β =
(β_{1}, . . . , β_{p})^{T} is, if N = ^{P}^{k}_{i=1} m_{i},

`_{N}(β) = ^{X}^{p}

j=1β_{j}T_{j}− ^{X}^{k}

i=1m_{i} log(1+exp(z_{i}β))+ ^{X}^{k}

i=1log

m_{i}
X_{i}

where T_{j} = ^{P}^{k}_{i=1} z_{ij}X_{i}.

• The likelihood equations are
Z^{T}(X − µ) = 0,
where Z = (z_{ij})_{m×p} by observing

µ_{i} = E(X_{i}) = m_{i}π_{i}
E(T_{j}) = ^{X}^{k}

i=1z_{ij}µ_{i}

• The MLE ˆβ of β solves Eβ(T^{j}) = T_{j}, j =
1, . . . , p.

• To solve the above nonlinear equations, we use the Newton-Raphson algorithm

• The Fisher information matrix is Z^{T}WZ
where W = diag{m_{i}π_{i}(1 − π_{i})}_{k×k}.

Testing

• Let ω = {η : η_{i} = z^{T}_{i} β, β ∈ R^{p}}. Consider
two different kinds of tests.

– Let Ω = R^{k}. Test H_{0} : η ∈ ω versus
H_{1} : η ∈ Ω \ ω.

– Let ω_{0} be a q-dimensional linear subspace
of ω with q < r.. Test H_{0} : η ∈ ω_{0} ver-
sus H_{1} : η ∈ ω \ ω_{0}.

• For the first set of hypotheses, the MLEs
of π_{i} and µ_{i} are X_{i}/m_{i} and X_{i}. The log-
likelihood ratio test statistics is

2 ^{X}^{k}

i=1[X_{i} log(X_{i}/µˆ_{i}) + X_{i}^{0} log(X_{i}^{0}/µˆ^{0}_{i})]

where X_{i}^{0} = m_{i} − X_{i} and µ^{0}_{i} = m_{i} − ˆµ_{i}.
– Note that it just measure the distance

between the fit ˆµ based on the model ω and the data.

– By the multivariate delta method, it has
asymptotically a χ^{2}_{k−p} distribution for η ∈
ω as m_{i} → ∞, i = 1, . . . , k < ∞.

• For the second set of hypotheses, the log- likelihood ratio test statistics is

2 ^{X}^{k}

i=1

X_{i} log

ˆ
µ_{i}
ˆ
µ_{0i}

+ X_{i}^{0} log

ˆ
µ^{0}_{i}
ˆ
µ^{0}_{0i}

where ˆµ_{0} is the MLE of µ under H_{0} and
µ^{0}_{0i} = m_{i} − ˆµ_{0i}.

It has an asymptotical χ^{2}_{p−q} distribution as
m_{i} → ∞, i = 1, . . . , k < ∞.

Tests for A Simple Null Hypothesis

• Let X_{1}, . . . , X_{n} be iid with X_{1} ∼ N (θ, 1).

– Test H_{0} : θ = 0 versus H_{1} : θ = θ_{0} > 0.

– How do we find a good test for the above simple hypothesis?

• Consider testing H_{0} : θ = θ^{0} ∈ R^{s} versus
H_{1} : θ 6= θ^{0}.

• We consider three large sample tests.

Likelihood Ratio Test

• A likelihood ratio statistic,
Λ_{n} = L(θ^{0}; x)

supθ∈Θ L(θ; x)

was introduced by Neyman and Pearson (1928).

• Λ_{n} takes values in the interval [0, 1] and H_{0}
is to be rejected for sufficiently small values
of Λ_{n}.

• The rationale behind LR tests is that when
H_{0} is true, Λ_{n} tends to be close to 1, whereas
when H_{1} is true, Λ_{n} tends to be close to 0,

• The test may be carried out in terms of the statistic

λ_{n} = −2 log Λ_{n}.

• For finite n, the null distribution of λ_{n} will
generally depend on n and on the form of
pdf of X.

• LR tests are closely related to MLE’s.

• Denote MLE by ˆθ. For asymptotic analysis,
expanding λ_{n} at ˆθ in a Taylor series, we get
λ_{n} = −2

− ^{X}^{n}

i=1 log f (X_{i}, ˆθ) + ^{X}^{n}

i=1log f (X_{i}, θ^{0})

= 2

1

2(θ^{0} − ˆθ)^{T}

− ^{X}^{n}

i=1

∂^{2}

∂θ_{j}∂θ_{k} log f (x; θ)

θ^{=}θ^{∗}

(θ^{0} − ˆθ)

,
where ˆθ lies between ˆθ and θ^{0}.

• Since θ^{∗} is consistent,
λ_{n} = n(ˆθ−θ^{0})^{T}

−1 n

n

X

i=1

∂^{2}

∂θ_{j}∂θ_{k}L(θ)

θ^{=}θ0

(ˆθ−θ^{0})+o_{P}(1).

By the asymptotic normality of ˆθ and

−n^{−1 n}^{X}

i=1

∂^{2}

∂θ_{j}∂θ_{k} L(θ)|

θ^{=}θ^{0}

→ I(θP ^{0}),
λ_{n} has, under H_{0}, a limiting chi-squared dis-
tribution on s degrees of freedom.

Example 3 Consider the testing problem
H_{0} : σ^{2} = σ_{0}^{2} versus H_{1} : σ^{2} 6= σ_{0}^{2} based
on iid X_{1}, . . . , X_{n} from the normal distribution
N (µ_{0}, σ^{2}).

• L(θ^{0}; x) = (2πσ_{0}^{2})^{−n/2} exp ^{}−^{P}_{i}(x_{i} − µ_{0})^{2}/2σ_{0}^{2}^{}

• ˆσ^{2} = n^{−1} ^{P}_{i}(x_{i} − µ_{0})^{2} (MLE) and
sup

θ∈Θ

L(θ; x) = (2π ˆσ^{2})^{−n/2} exp(−n/2).

• We have
Λ_{n} =

ˆ
σ^{2}
σ_{0}^{2}

n/2

exp

n 2 −

Pi(x_{i} − µ_{0})^{2}
2σ_{0}^{2}

or under H_{0}
λ_{n} = −n

ln

1 n

n

X

i=1Z_{i}^{2}

−

1 −

1 n

n

X

i=1Z_{i}^{2}

,
where Z_{1}, . . . , Z_{n} are iid N (0, 1).

• Fact: Using CLT, we have
n^{−1} ^{P}^{n}_{i=1} Z_{i}^{2} − 1

r2/n

→ N (0, 1)d

or n

2

1 n

n

X

i=1Z_{i}^{2} − 1

2 d

→ χ^{2}_{1}.

• Note that ln u ≈ −(1 − u) − (1 − u)^{2}/2
when u is near 1 and n^{−1} ^{P}^{n}_{i=1} Z_{i}^{2} → 1 in
probability by LLN.

• A common question to be asked in Tay-
lor’s series approximation is that how many
terms we should consider. In this exam-
ple, it refers to the use of approximation
ln u ≈ −(1 − u) as a contrast to the second
order approximation we use. If we do use
the first order approximation, we will end
up the difficulty of finding lim_{n}a_{n}b_{n} when
lim_{n}a_{n} = ∞ and lim_{n}b_{n} = 0.

• We conclude that λ_{n} has a limiting chi-squared
distribution with 1 degree of freedom.

The Wald Test

• Let ˆθ_{n} denote a consistent, asymptotically
normal, and asymptotically efficient sequence
of solutions of the likelihood equations.

√n(ˆθ_{n} − θ) → N (0, I^{d} ^{−1}(θ))
as n → ∞.

• Because I(θ) is continuous in θ, we have
I( ˆθ_{n}) → I(θ)^{P}

as n → ∞.

• Replace the matrix

−_{n}^{1} ^{P}^{n}_{i=1} _{∂θ}^{∂}^{2}

j∂θ_{k} L(θ)|

θ^{=}θ^{0}

by I(ˆθ_{n}) in large sample approximation of
λ_{n}, we get a second statistic,

W_{n} = n(ˆθ_{n} − θ^{0})^{T}I(ˆθ_{n})(ˆθ_{n} − θ^{0}),
which was introduced by Wald (1943).

• By Slutsky’s theorem, W_{n} converges in dis-
tribution to χ^{2}_{s}.

• For the construction of confidence region,
one generates {θ^{0} : W_{n} ≤ χ^{2}_{s,α}} which is
an ellipsoid in R^{s}.

• As a remark, for the construction of con-
fidence region based on λ_{n}, one generates
{θ^{0} : λ_{n} ≤ χ^{2}_{s,α}} which is not necessary an
ellipsoid in R^{s}.

The Rao Score Tests

• Both the Wald and likelihood ratio tests re-
quires evaluation of ˆθ_{n}. Now we consider a
test for which this is not necessary.

• Denote the likelihood score vector

q(x; θ) = (q_{1}(x; θ), . . . , q_{s}(x; θ))^{T}

where

q_{j}(x; θ) = ∂

∂θ_{j} log f (x; θ).

• Write Q(θ) = ^{P}^{n}_{i=1} q(X_{i}; θ). By the central
limit theorem,

n^{−1/2}Q(θ^{0}) → N (0, I(θ^{d} ^{0})).

• A third statistic,

V_{n} = [n^{−1/2}Q(θ^{0})]^{T}I^{−1}(θ^{0})[n^{−1/2}Q(θ^{0})]

= n^{−1}Q(θ^{0})^{T}I^{−1}(θ^{0})Q(θ^{0}),
was introduced by Rao (1948).

Again, it has a limiting χ^{2}_{s} distribution.

Example 4. Consider a sample X_{1}, . . . , X_{n}
from the logistic distribution with density

f_{θ}(x) = e^{x−θ}

(1 + e^{x−θ})^{2}.

• q(x; θ) = −1 + 2e^{x−θ}/(1 + e^{x−θ}) and
Q(θ^{0}) = −n + 2 ^{X}^{n}

i=1

e^{x}^{i}^{−θ}^{0}
1 + e^{x}^{i}^{−θ}^{0}.

• I(θ) = 1/3 for all θ.

• The Rao scores test therefore rejects H_{0} with
test statistic

v u u u u t

3 n

n

X

i=1

e^{x}^{i}^{−θ}^{0} − 1
1 + e^{x}^{i}^{−θ}^{0}.

• In this case, the MLE does not have an ex- plicit expression and therefore the Wald and likelihood ratio tests are less convenient.

Example 5. Consider a sequence of n in- dependent trials, with s possible outcomes for each trials.

• Let θ_{j} denote the probability of occurrence
of the jth outcome in any given trial.

• Let N_{j} denote the number of occurrences of
the jth outcome in the series of n trials.

• The MLE of θ_{j}’s are N_{j}/n.

• The three test statistics λ_{n}, W_{n} and V_{n} for
testing H_{0} : θ = θ^{0} against H_{1} : θ 6= θ^{0} are
easily seen to be

λ_{n} = 2 ^{X}^{s}

j=1N_{j} log( N_{j}
nθ_{j}^{0}),
W_{n} = ^{X}^{s}

j=1

(N_{j} − nθ_{j}^{0})^{2}

N_{j} ,

V_{n} = ^{X}^{s}

j=1

(N_{j} − nθ_{j}^{0})^{2}
nθ_{j}^{0} .

• Both W_{n} and V_{n} are referred to as chi-squared
goodness of fit statistics; the latter often
called the Pearson chi-squared distribution.

The large sample properties was first derived by Pearson (1900).

Pearson’s chi-square statistic is easily remem- bered as

χ^{2} = sum(Observed − Expected)^{2}

Expected .

Example 6. (Testing a Genetic Theory)

• In experiments on pea breading, Mendel ob- served the different kinds of seeds obtained by crosses from peas with round yellow seeds and peas with wrinkled green seeds.

• Possible types of progeny were: (1) round yellow; (2) wrinkled yellow; (3) round green;

and (4) wrinkled green.

• Assume the seeds are produced independently.

We can think of each seed as being the out- come of a multinomial trial with possible

outcomes numbered 1, 2, 3, 4 as above and
associated probabilities of occurrence θ_{1}, θ_{2},
θ_{3}, θ_{4}.

• Mendel’s theory predicted that θ_{1} = 9/16,
θ_{2} = θ_{3} = 3/16, θ_{4} = 1/16.

• Data: n = 556, n_{1} = 315, n_{2} = 101, n_{3} =
108, n_{4} = 32.

• Pearson’s chi-square statistic is
(315 − 556 × 9/16)^{2}

312.75 +(3.25)^{2}

104.25+(3.75)^{2}

104.25+(2.75)^{2}

34.75 = 0.47, which has a p-