### Chapter 4. Method of Maximum Likelihood

### 1 Introduction

Many statistical procedures are based on statistical models which specify under
which conditions the data are generated. Usually the assumption is made that
the set of observations x_{1}, . . . , x_{n}is a set of (i) independent random variables (ii)
identically distributed with common pdf f (xi, θ). Once this model is specified,
the statistician tries to find optimal solutions to his problem (usually related to
the inference on a set of parameters θ ∈ Θ ⊂ R^{k}, characterizing the uncertainty
about the model).

The procedure just described is not always easy to carry out. In fact, when confronted with a set of data three attitudes are possible:

• The statistician may be a “pessimist” who does not believe in any par- ticular model f (x, θ). In this case he must be satisfied with descriptive methods (like exploratory data analysis) without the possibility of induc- tive inference.

• The statistician may be an “optimist” who strongly believes in one model.

In this case the analysis is straightforward and optimal solutions may often be easily obtained.

• The statistician may be “realist”: he would like to specify a particular model f (x, θ) in order to get operational results but he may have either some doubt about the validity of this hypothesis or some difficulty in choos- ing a particular parametric family.

Let us illustrate this kind of preoccupation with an example. Suppose that the parameter of interest is the “center” of some population. In many situations, the statistician may argue that, due to a central limit effect, the data are generated by a normal pdf. In this case the problem is restricted to the problem of inference on µ, the mean of the population. But in some cases, he may have some doubt about these central limit effects and may suspect some skewness and/or some kurtosis or he may suspect that some observations are generated by other models (leading to the presence of outliers).

In this context three types of question may be raised to avoid gross errors in the prediction, or in the inference:

• Does the optimal solution, computed for assumed model f (x, θ), still have

“good” properties if the true model is a little different?

• Are the optimal solutions computed for other models near to the original one really substantially different?

• Is it possible to compute (exactly or approximately) optimal solutions for a wider class of models based on very few assumptions?

The first question is concerned with the sensitivity of a given criterion to the hypotheses (criterion robustness). In the second question, it is the sensitiv- ity of the inference which is analyzed (inference robustness). The last question may be viewed as a tentative first step towards the development of nonpara- metric methods (i.e. methods based on a very large parametric space).

### 2 Information Bound

Any statistical inference starts from a basic family of probability measures, ex- pressing our prior knowledge about the nature of the probability measures from where the observations originate. Or a model P is a collection of probability measures P on (X , A) where X is the sample space with a σ-field of subsets A.

If

P = {P_{θ} : θ ∈ Θ}, Θ ⊂ R^{s}

for some k, then P is a parametric model. On the other hand, if P = {all P on (X , A)},

then P is often referred to as a nonparametric model.

Suppose that we have a fully specified parametric family of models. De- note the parameter of interest by θ. Suppose that we wish to calculate from the data a single value representing the “best estimate” that we can make of the unknown parameter. We call such a problem one of point estimation.

Define the information matrix as the s × s matrix
I(θ) = kI_{ij}(θ)k,

where

Iij(θ) = Eθ

"

∂ log f (X; θ)

∂θ_{i}

∂ log f (X; θ)

∂θ_{j}

#

.

When k = 1, I(θ) is known as the Fisher information. Under regularity condi- tions, we have

E

"

∂

∂θi

log f (X; θ)

#

= 0 (1)

and

I_{ij}(θ) = cov

"

∂

∂θ_{i} log f (X; θ), ∂

∂θ_{j} log f (X; θ)

#

.

Being a covariance matrix, I(θ) is then positive semidefinite and positive definite
unless the (∂/∂θ_{i}) log f (X; θ), i = 1, . . . , s are affinely dependent (and hence,
by (1), linear dependent). When the density also has the second derivatives,
we have the following alternative expression for I_{ij}(θ) which is

I_{ij}(θ) = −E

"

∂^{2}

∂θ_{i}∂θ_{j} log f (X; θ)

#

.

To make above statements correct, we make the following assumptions when s = 1:

(i) Θ is an open interval (finite, infinite, or semi-infinite).

(ii) The distribution P_{θ} have common support, so that without loss of generality
the set A = {x : p_{θ}(x) > 0} is independent of θ. (2)
(iii) For any x in A and θ in Θ, the derivative p^{0}_{θ}(x) = ∂pθ(x)/∂θ exists and is finite.

Lemma 1 (i) If (2) holds, and the derivative with respect to θ of the left side of

Z

f (x; θ)dµ(x) = 1 (3)

can be obtained by differentiating under the integral sign, then
E_{θ}

"

∂

∂θlog f (X; θ)

#

= 0 and

I(θ) = var_{θ}

"

∂

∂θ log f (X; θ)

#

. (4)

(ii) If, in addition, the second derivative with respect to θ of log f (X; θ) exists for all x and θ and the second derivative with respect to θ of the left side of (3) can be obtained by differentiating twice under the integral sign, then

I(θ) = −Eθ

"

∂^{2}

∂θ^{2} log f (X; θ)

#

. Let us now derive the information inequality for s = 1.

Theorem 1 Suppose (2) and (4) hold, and that I(θ) > 0. Let δ be any statistic
with E_{θ}(δ^{2}) < ∞ for which the derivative with respect to θ of E_{θ}(δ) exists and
can be obtained by differentiating under the integral sign. Then

var_{θ}(δ) ≥

h∂

∂θE_{θ}(δ)^{i}^{2}
I(θ) .

Proof. For any estimator δ of g(θ) and any function ψ(x, θ) with finite second moment, the covariance inequality states that

var_{θ}(δ) ≥ [cov(δ, ψ)]^{2}

var(ψ) . (5)

Denote g(θ) = E_{θ}δ and set

ψ(X, θ) = ∂

∂θ log f (X; θ).

If differentiation under the integral sign is permitted in E_{θ}δ, it then follows that
cov(δ, ψ) =

Z

δ(x)f^{0}(x; θ)

f (x; θ)f (x; θ)dx = g^{0}(θ)
and hence

var_{θ}(δ) ≥

hg^{0}(θ)^{i}^{2}

var^{h}_{∂θ}^{∂} log f (X, θ)^{i}.
This completes the proof of this theorem.

If δ is an unbiased estimator of θ, then
var_{θ}(δ) ≥ 1

nI(θ).

The above inequality provides a lower bound for the variance of any estimator.

In fact, the quantity nI(θ) is known as the “Cramer-Rao lower bound.” Like-
liwise, we can also have the information inequality for general s. We begin by
generalizing the correlation inequality to one involving many ψ_{i} (i = 1, . . . , r).

Theorem 2 For any unbiased estimator δ of g(θ) and any functions ψ_{i}(x, θ)
with finite second moments, we have

var(δ) ≥ γ^{0}C^{−1}γ, (6)

where γ = (γ_{1}, · · · , γ_{r}) and C = kC_{ij}k are defined by

γ_{i} = cov(δ, ψ_{i}), C_{ij} = cov(ψ_{i}, ψ_{j}). (7)
Proof. Replace Y by δ and Xi by ψi(X, θ) in the following lemma. Then the
fact that ρ^{∗2} ≤ 1 yields this theorem.

Let (X_{1}, . . . , X_{r}) and Y be random variables with finite second moment,
and consider the correlation coefficient corr(^{P}a_{i}X_{i}, Y ). Its maximum value ρ^{∗}
over all (a_{1}, . . . , a_{r}) is the multiple correlation coefficient between Y and the
vector (X_{1}, . . . , X_{r}).

Lemma 2 Let (X_{1}, . . . , X_{r}) and Y have finite second moment, let γ_{i} = cov(X_{i}, Y )
and Σ be the covariance matrix of the X’s. Without loss of generality, suppose
Σ is positive definite. Then

ρ^{∗2}= γ^{0}Σ^{−1}γ

var(Y ). (8)

Proof. Since a correlation coefficient is invariant under scale changes, the a’s
maximizing (8) are not uniquely determined. Without loss of generality, we
therefore impose the condition var(^{P}_{i}a_{i}X_{i}) = a^{0}Σa = 1. In view of a^{0}Σa = 1,

corr(^{X}

i

a_{i}X_{i}, Y ) = a^{0}γ/^{q}var(Y ).

The problem then becomes that of maximizing a^{0}γ subject to a^{0}Σa = 1. Using
the method of undetermined multipliers, one maximizes instead

a^{0}γ − λ

2a^{0}Σa (9)

with respect to a and then determines λ so as to satisfy a^{0}Σa = 1. Differentia-
tion with respect to the ai of (9) leads to a system of linear equations with the
unique solution

a = 1

λΣ^{−1}γ, (10)

and the side condition a^{0}Σa = 1 gives

λ = ±^{q}γ^{0}Σ^{−1}γ.

Substituting these values of λ into (10), one finds that
a = ±Σ^{−1}γ

q

γ^{0}Σ^{−1}γ

and the maximum value of corr(^{P}_{i}a_{i}X_{i}, Y ), ρ^{∗}, is therefore the positive root
of (8).

Note that always 0 ≤ ρ^{∗} ≤ 1, and that ρ^{∗} is 1 if and only if constants
a_{1}, . . . , a_{r} and b exist such that Y =^{P}_{i}a_{i}X_{i}+ b.

Let us now state the information inequality for the multiparameter case
in which θ = (θ_{1}, . . . , θ_{s}).

Theorem 3 Suppose that (1) holds and that I(θ) is positive definite Let δ be
any statistic with E_{θ}(δ^{2}) < ∞ for which the derivative with respect to θ_{i} exists
for each i and can be obtained by differentiating under the integral sign. Then

var_{θ}(δ) ≥ α^{0}I^{−1}(θ)α, (11)

where α^{0} is the row matrix with ith element
α_{i} = ∂

∂θ_{i}E_{θ}(δ(X)).

Proof. If the functions ψ_{i}of Theorem 2 are taken to be ψ_{i} = (∂/∂θ_{i}) log f (X; θ),
this theorem follows immediately.

Under regularity conditions on the class of estimators ˆθn under consider-
ation, it may be asserted that if ˆθ_{n} is AN (θ, n^{−1}Σ(θ)), then the condition

Σ(θ) − I(θ)^{−1} is nonnegative definite

must hold. (Read Ch2.6 and 2.7 of Lehmann (1983) for further details.) In this
respect, an estimator ˆθ_{n} which is AN (θ, Σ_{θ}) is “optimal.” (Such an estimator
need not exist.)

The following definition is thus motivated. An estimator ˆθ_{n}which is called
asymptotically efficient, or best asymptotically normal (BAN). Under suitable
regularity conditions, an asymptotically efficient estimate exists. One approach
toward finding such estimates is the method of maximum likelihood. Neyman
(1949) pointed out that these large-sample criteria were also satisfied by other
estimates. He defined a class of best asymptotically normal estimates. So
far, we have described three desirable properties ˆθ_{n}. They are unbiasedness,
consistency, and efficiency. We now describe a general procedure to produce an
asymptotic unbiased, consistent, and asymptotic efficient estimator.

### 3 Maximum Likelihood Methodology

Many statistical techniques were invented in the nineteenth century by experi- mental scientists who personally applied their methods to authentic data sets.

In these conditions the limits of what is computationally feasible are sponta- neously observed. Until quite recently these limits were set by the capacity of the human calculator, equipped with pencil and paper and with such aids as the slide rule, tables of logarithms, and other convenient tables, which have been in constant use from the seventeenth century until well into the twentieth.

Until the advent of the electronic computer, the powers of the human operator set the standard. This restriction has left its mark on statistical technique, and many new developments have taken place since it was lifted.

The first result of this modern computing revolution is that estimates de- fined by nonlinear equations can be established as a matter of routine by the appropriate iterative algorithms. This permits the use of nonlinear functional

forms. Although the progress of computing technology made nonlinear estima- tion possible, the statistical theory of Maximum Likelihood provided techniques and respectability. Its principle was first put forward as a novel and original method of deriving estimators by R.A. Fisher in the early 1920s. It very soon proved to be a fertile approach to statistical inference in general, and was widely adopted; but the exact properties of the ensuring estimators and test procedures were only gradually discovered.

Let observations x = (x1, . . . , xn) be realized values of random variables
X = (X_{1}, . . . , X_{n}) and suppose that the random vector X, having density
f_{X}(x; θ) with respect to some σ-finite measure ν. Here θ is the scalar parameter
to be determined. The likelihood function corresponding to an observed vector
x from the density f_{X}(x; θ) is written

Lik_{X}(θ^{0}; x) = f_{X}(x; θ^{0}),

whose logarithm is denoted by L(θ^{0}; x). When the Xi are iid with probability
density f (x; θ) with respect to a σ-finite measure µ,

f (x; θ) =

n

Y

i=1

f (x_{i}; θ).

If the parameter space is Ω, then the maximum likelihood estimate (MLE)
θ = ˆˆ θ(x) is that value of θ^{0} maximizing lik_{X}(θ^{0}; x), or equivalently its logarithm
L(θ^{0}; x), over Ω. That is,

L(ˆθ; x) ≥ L(θ^{0}; x) (θ^{0} ∈ Ω). (12)
L(θ, x) is called the log-likelihood. Note that L is regarded as a function of θ
with x fixed.

A MLE may not exist. It certainly exists if Ω is compact and f (x; θ) is
upper semicontinuous in θ for all x. As an example, consider U (θ, θ + 1). Later
on, we shall use the shorthand notation L(θ) for L(θ, x) and L^{0}(θ), L(θ)^{00}, . . . for
its derivatives with respect to θ. (Note that f is said to be upper semicontinuous
if {x|f (x) < α} is an open set.)

Fisher was the first to study and establish optimum properties of esti- mates obtained by maximizing the likelihood function, using criteria such as consistency and efficiency (involving asymptotic variances) in large samples. At that time, however, the computation involved were hardly practicable, this pre- vented a widespread adoption of these methods. Fortunately, the new computer technology had become generally accessible. Therefore, Maximum Likelihood (ML) methodology is widely used now.

It is a constant theme of the history of the method that the use of ML techniques is not always accompanied by a clear appreciation of their limita- tions. Le Cam (1953) complains that

· · · although all efforts at a proof of the general existence of [as- ymptotically] efficient estimates · · · as well as a proof of the efficiency of ML estimates were obviously inaccurate and although accurate proofs of similar statements always referred not to the general case but to particular classes of estimates · · · a general belief became estab- lished that the above statements are true in the most general sense.

As an illustration, consider the famous Neyman-Scott (1948) problem. In this example, the MLE is not even consistent.

Example 1. Estimation of a Common Variance. Let X_{αj} (j = 1, . . . , r)
be independently distributed according to N (θα, σ^{2}), α = 1, . . . , n. The MLEs
are

θˆ_{α} = X_{α·}, ˆσ^{2} = 1
rn

X X(X_{αj} − X_{α·})^{2}.

Furthermore, these are the unique solutions of the likelihood equations.

However, in the present case, the MLE of σ^{2} is not even consistent. To
see this, note that the statistics

S_{α}^{2} =^{X}(Xαj − Xα·)^{2}

are identically independently distributed with expectation
E(S_{α}^{2}) = (r − 1)σ^{2}

so that ^{P}S_{α}^{2}/n → (r − 1)σ^{2} and hence
ˆ

σ^{2} → r − 1

r σ^{2} in probability.

Example 2. Suppose X_{1}, X_{2}, . . . , X_{n} is a random sample from a uniform dis-
tribution U (0, θ). The likelihood function is

L(θ, x) = 1

θ^{n}, 0 < x_{1}, . . . , x_{n}< θ.

Clearly L cannot be maximized wrt θ by differentiation. However, it is not
difficult to find ˆθ_{n} = X_{(n)} with density function nt^{n−1}/θ^{n} where t ∈ (0, θ).

Then

E(ˆθ_{n}) = nθ
n + 1,

which is a biased estimator of θ. (But it is asymptotic unbiased.)

3.1 Efficient Likelihood Estimation

According to the example discussed by Neyman and Scott (1948), we will show that, under regularity conditions, the ML estimates are consistent, asymptot- ically normal, and asymptotically efficient. For simplicity, our treatment will be confined to the case of a 1-dimensional parameter.

We begin with the following regularity assumptions:

(A0) The distributions P_{θ} of the observations are distinct (otherwise, θ cannot
be estimated consistently).

(A1) The distributions P_{θ} have common support.

(A2) The observations are X = (X_{1}, . . . , X_{n}), where the X_{i} are iid with prob-
ability density f (x_{i}, θ) with respect to µ.

(A3) The parameter space Ω contains an open interval ω of which the true
parameter value θ_{0} is an interior point.

Theorem 4 Under assumptions (A0)-(A2),

P_{θ}_{0}{f (X_{1}, θ_{0}) · · · f (X_{n}, θ_{0}) > f (X_{1}, θ) · · · f (X_{n}, θ)} → 1
as n → ∞ for any fixed θ 6= θ_{0}.

Proof. The inequality is equivalent to 1

n

n

X

i=1

log [f (X_{i}, θ)/f (X_{i}, θ_{0})] < 0.

By the strong law of large numbers, the left side tends with probability 1 toward
E_{θ}_{0}log[f (X, θ)/f (X, θ_{0})].

Since − log is strictly convex, Jensen’s inequality shows that

Eθ0log[f (X, θ)/f (X, θ0)] < log Eθ0[f (X, θ)/f (X, θ0)] = 0, (13)
and the results follows. When θ_{0} is the true value, the above proof gives a
meaning to the numerical value of the Kullback-Leibler information number.

Namely, the likelihood ratio converges to zero exponential fast, at rate I(θ, η).

Remark 1. Define the Kullback-Leibler information number
I(θ, η) = E_{θ} log f (X, θ)

f (X, η)

!

.

Note that I(θ, η) ≥ 0 with equality holding if and only if, f (x, θ) = f (x, η).

I(θ, η) is a measure of the ability of the likelihood ratio to distinguish between
f (X, θ) and f (X, θ_{0}) when the latter is true.

Remark 2. If ˆθ_{n} is an MLE of θ and if g is a function, then g(ˆθ_{n}) is an
MLE of g(θ). When g is one-to-one, it holds obviously. If g is many-to-one,
this result holds again when the derivative of g is nonzero.

By Theorem 4, the density of X at the true θ_{0} exceeds that any other
fixed θ with high probability when n is large. We do not know θ0 but we can
determine the value ˆθ of θ which maximizes the density of X. However, Theorem
4 cannot guarantee that the MLE is consistent since we have to apply the law
of large numbers to the right-hand side of (13) for all θ^{0} 6= θ simultaneously.

However, if Ω is finite, the MLE ˆθ_{n} exists, it is unique with probability tending
to 1, and it is consistent.

The following theorem is motivated by the simple fact by differentiating

R f (x, θ)µ(dx) = 1 with respect to θ. It leads to Eθ0

f^{0}(X, θ0)
f (X, θ_{0}) = 0.

Theorem 5 Let X_{1}, . . . , X_{n} satisfy assumptions (A0)-(A3) and suppose that
for almost all x, f (x, θ) is differentiable with respect to θ in w, with derivative
f^{0}(x, θ). Then with probability tending to 1 as n → ∞, the likelihood equation

∂

∂θ [f (x1, θ) · · · f (xn, θ)] = 0 (14) or, equivalently, the equation

L^{0}(θ, x) =

n

X

i=1

f^{0}(x_{i}, θ)

f (xn, θ) = 0 (15)

has a root ˆθ_{n}= ˆθ_{n}(x_{1}, . . . , x_{n}) such that ˆθ_{n}(X_{1}, . . . , X_{n}) tends to the true values
θ_{0} in probability.

Proof. Let a be small enough so that (θ_{0}− a, θ_{0}+ a) ⊂ w, and let
S_{n}= {x : L(θ_{0}, x) > L(θ_{0}− a, x) and L(θ_{0}, x) > L(θ_{0}+ a, x)}. (16)
By Theorem 4, Pθ0(Sn) → 1. For any x ∈ Sn there thus exists a value θ0− a <

θˆn< θ0+ a at which L(θ) has a local maximum, so that L^{0}(ˆθn) = 0. Hence for
any a > 0 sufficiently small, there exists a sequence ˆθ_{n} = ˆθ_{n}(a) of roots such
that

P_{θ}_{0}(|ˆθ_{n}− θ_{0}| < a) → 1.

It remains to show that we can determine such a sequence, which does not depend on a.

Let ˆθ_{n}^{∗} be the root closest to θ_{0}, This exists because the limit of a sequence
of roots is again a root by the continuity of L(θ).) Then clearly P_{θ}_{0}(|ˆθ^{∗}_{n}− θ_{0}| <

a) → 1 and this completes the proof.

Remarks. 1. This theorem does not establish the existence of a consis-
tent estimator sequence since, with the true value θ_{0} unknown, the data do not
tell us which root to choose so as to obtain a consistent sequence. An exception,
of course, is the case in which the root is unique.

2. It should also be emphasized that existence of a root ˆθ_{n} is not asserted
for all x ( or for a given n even for any x). This does not affect consistency,
which only requires ˆθ_{n}to be defined on a set S_{n}^{0}, the probability of which tends
to 1 as n → ∞.

Above theorem establishes the existence of a consistent root of the likeli- hood equation. The next theorem asserts that any such sequence is asymptot- ically normal and efficient.

Theorem 6 Suppose that X_{1}, . . . , X_{n}are iid and satisfy the assumptions (A0)-
(A3), the integral ^{R} f (x, θ))dµ(x) can be twice differentiated under the integral
sign, and the existence of a third derivative satisfying

∂^{3}

∂θ^{3} log f (x, θ)

≤ M (x) (17)

for all x ∈ A, θ0− c < θ < θ0+ c with Eθ0[M (X)] < ∞. Then any consistent
sequence ˆθ_{n}= ˆθ_{n}(X_{1}, . . . , X_{n}) of roots of the likelihood equation satisfies

√n(ˆθ_{n}− θ_{0})→ N^{d} 0, 1
I(θ_{0})

!

.

Proof. For any fixed x, expand L^{0}(ˆθn) about θ0

L^{0}(ˆθ_{n}) = L^{0}(θ_{0}) + (ˆθ_{n}− θ_{0})L^{00}(θ_{0}) + 1

2(ˆθ_{n}− θ_{0})^{2}L^{(3)}(θ_{n}^{∗})

where θ_{n}^{∗} lies between θ_{0} and ˆθ_{n}. By assumption, the left side is zero, so that

√n(ˆθ_{n}− θ_{0}) = (1/√

n)L^{0}(θ_{0})

−(1/n)L^{00}(θ_{0}) − (1/2n)(ˆθ_{n}− θ_{0})L^{(3)}(θ_{n}^{∗})

where it should be remembered that L(θ), L^{0}(θ), and so on are functions of
(X_{1}, . . . , X_{n}) as well as θ. The desired result follows if we can show that

√1

nL^{0}(θ_{0})→ N [0, I(θ^{d} _{0})], (18)

that

−1

nL^{00}(θ_{0})→ I(θ^{P} _{0}) (19)
and that

1

nL^{(3)}(θ_{n}^{∗}) is bounded in probabiliy. (20)
Of the above statements, (18) follows from the fact that

√1

nL^{0}(θ0) =√
n1

n

n

X

i=1

"

f^{0}(X_{i}, θ_{0})
f (X_{i}, θ_{0}) − Eθ0

f^{0}(X_{i}, θ_{0})
f (X_{i}, θ_{0})

#

since the expectation term is zero, and then from the CLT and the definition of I(θ).

Next,

−1

nL^{00}(θ_{0}) = 1
n

n

X

i=1

f^{0}^{2}(X_{i}, θ_{0}) − f (X_{i}, θ_{0})f^{00}(X_{i}, θ_{0})
f^{2}(X_{i}, θ_{0}) .
By the law of large numbers, this tends in probability to

I(θ_{0}) − E_{θ}_{0}f^{00}(X_{i}, θ_{0})

f (X_{i}, θ_{0}) = I(θ_{0}).

Finally,

1

nL^{(3)}(θ_{0}) = 1
n

n

X

i=1

∂^{3}

∂θ^{3} log f (X_{i}, θ)
so that by (17)

1

nL^{(3)}(θ_{n}^{∗})

< 1

n[M (X_{1}) + · · · + M (X_{n})]

with probability tending to 1. The right side tends in probability to Eθ0[M (X)], and this completes the proof.

Remarks. 1. This is a strong result. It establishes several major prop- erties of the MLE in addition to its consistency. The MLE is asymptotically normal, which is of great help for the derivation of (asymptotically valid) tests;

it is asymptotically unbiased; and it is asympotically efficient, since the vari- ance of its limiting distribution equals the Cramer-Rao lower bound.

2. As a rule we wish to supplement the parameter estimates by an estimate of their (asymptotic) variance. This will permit us to assess (asymptotic) t-ratios and (asymptotic) confidence interval. Although the variance may depend on the unknown parameter, we can just use MLE to get an estimate of variance.

The usual iterative methods for solving the likelihood equation L^{0}(θ) = 0
are based on replacing L^{0}(θ) by the linear terms of its Taylor expansion about
an approximate solution ˜θ. Suppose we can use estimation method such as the

method of moments to find a good estimate of θ. Denote it as ˜θ. Then it is quite natural to use ˜θ as the initial solution of the iterative methods. Denote the MLE by ˆθ. This leads to the approximation

0 = L^{0}(ˆθ) ≈ L^{0}(˜θ) + (ˆθ − ˜θ)L^{00}(˜θ),
and hence to

θ = ˜ˆ θ − L^{0}(˜θ)
L^{00}(˜θ).

The procedure is then iterated according to the above scheme.

The following is a justification for the use of the above one-step approxi- mation as an estimator of θ.

Theorem 7 Suppose that the assumptions of Theorem 6 hold and that ˜θ is not only a consistent but a √

n-consistent estimator of θ, that is, that √

n(˜θ − θ_{0}) is
bounded in probability so that ˜θ tends to θ_{0} at least at the rate of 1/√

n. Then the estimator sequence

δ_{n}= ˜θ − L^{0}(˜θ)

L^{00}(˜θ) (21)

is asymptotically efficient.

Proof. As in the proof of Theorem 6, expand L^{0}(˜θ) about θ_{0} as
L^{0}(˜θn) = L^{0}(θ0) + (˜θn− θ0)L^{00}(θ0) + 1

2(˜θn− θ0)^{2}L^{(3)}(θ_{n}^{∗})

where θ_{n}^{∗} lies between θ_{0} and ˜θ_{n}. Substituting this expression into (21) and
simplifying, we find

√n(δn− θ0) = (1/√

n)L^{0}(θ_{0})

−(1/n)L^{00}(˜θ_{n})+√

n(˜θn− θ0)

×

"

1 − L^{00}(θ_{0})
L^{00}(˜θ_{n}) − 1

2(˜θ_{n}− θ_{0})L^{(3)}(θ^{∗}_{n})
L^{00}(˜θ_{n})

#

. (22)

Suppose we can show that the expression in square brackets on the right
hand side of (22) tends to zero in probability and L^{00}(˜θn)/L^{00}(θ0) → 1 in prob-
ability. This theorem will follows accordingly. These follows from θ^{∗}_{n} → θ0 in
probability and use the expansion

1

nL^{00}(˜θ_{n}) = 1

nL^{00}(θ_{0}) + 1

n(˜θ_{n}− θ_{0})L^{(3)}(θ_{n}^{∗∗})
where θ_{n}^{∗∗} is between θ_{0} and ˜θ_{n}.

3.2 The Multi-parameter Case

We just discuss the case that the distribution depends on a single parameter
θ. When extending this theory to probability models involving several para-
meters θ1, . . . , θs, one may be interested either in simultaneous estimation of
these parameters (or certain functions of them) or with the estimation of part
of the parameters. The part of parameter is of intrinsic and the rest represents
nuisance or incidental parameters that are necessary for a proper statistical
model but of no interest in themselves. For instance, we are only interested in
estimating σ^{2} in Neyman-Scott problem. Then θ_{α} are called nuisance parame-
ters.

Let (X_{1}, . . . , X_{n}) be iid with a distribution that depends on θ = (θ_{1}, . . . , θ_{s})
and satisfies assumptions (A0)-(A3). The information matrix I(θ) is an s × s
matrix with elements I_{jk}(θ), j, k = 1, . . . , s, defined by

I_{jk}(θ) = cov

"

∂

∂θ_{j} log f (X, θ), ∂

∂θ_{k}log f (X, θ)

#

.

We shall now show under regularity conditions that with probability tending
to 1 there exists solutions ˆθ_{n} = (ˆθ_{1n}, . . . , ˆθ_{sn}) of the likelihood equations

∂

∂θj

[f (x_{1}, θ) · · · f (x_{n}, θ)] = 0, j = 1, . . . , s,
or equivalently

∂

∂θ_{j}[L(θ)] = 0, j = 1, . . . , s

such that ˆθ_{jn} is consistent for estimating θ_{j} and asymptotically efficient in the
sense of with asymptotic variance [I(θ)]^{−1}_{jj} .

We first state some assumptions:

(A) There exists an open subset ω of Ω containing the true parameter point θ^{0}
such that for almost all x the density f (x, θ) admits all third derivatives
(∂^{3}/∂θ_{j}∂θ_{k}∂θ_{`})f (x, θ) for all θ ∈ ω.

(B) the first and second logarithmic derivatives of f satisfy the equations
E_{θ}

"

∂

∂θ_{j} log f (X, θ)

#

= 0 for j = 1, . . . , s, and

I_{jk}(θ) = E_{θ}

"

∂

∂θj

log f (X, θ) · ∂

∂θk

log f (X, θ)

#

= E_{θ}

"

− ∂^{2}

∂θ_{j}∂θ_{k}log f (X, θ)

#

.

(C) Since the s×s matrix I(θ) is a covariance matrix, it is positive semidefinite.

We shall assume that the I_{jk}(θ) are finite and that the matrix I(θ) is
positive definite for all θ in ω, and hence that the s statistics

∂

∂θ1

log f (X, θ), . . . , ∂

∂θs

log f (X, θ) are affinely independent with probability 1.

(D) Finally, we shall suppose that there exists functions Mjk` such that

∂^{3}

∂θ_{j}∂θ_{k}∂θ_{`} log f (x, θ)

≤ Mjk`(x) for all θ ∈ ω
where m_{jk`} = E_{θ}^{0}[M_{jk`}(X)] < ∞ for all j, k, `.

Theorem 8 Let X1, . . . , Xn be iid each with a density f (x, θ) (with respect
to µ) which satisfies (A0)-(A2) and assumptions (A)-(D) above. Then with
probability tending to 1 as n → ∞, there exist solutions ˆθ_{n}= ˆθ_{n}(X_{1}, . . . , X_{n}) of
the likelihood equations such that

(i) ˆθ_{jn} is consistent for estimating θ_{j},
(ii) √

n(ˆθ_{n}−θ) is asymptotically normal with (vector) mean zero and covariance
matrix [I(θ)]^{−1} and

(iii) ˆθ_{jn} is asymptotically efficient in the sense that

√n(ˆθ_{jn}− θ_{j})→ N (0, [I(θ)]^{L} ^{−1}_{jj} ).

Proof. (i) Existence and Consistency. To prove the consistence, with
probability tending to 1, of a sequence of solutions of the likelihood equations
which is consistent, we shall consider the behavior of the log likelihood L(θ)
on the sphere Q_{a} with center at the true point θ^{0} and radius a. We will show
that for any sufficiently small a the probability tends to 1 that L(θ) < L(θ^{0})
at all points θ on the surface of Q_{a}, and hence that L(θ) has a local maximum
in the interior of Q_{a}. Since at a local maximum the likelihood equations must
be satisfied it will follow that for any a > 0, with probability tending to 1 as
n → ∞, the likelihood equations have a solution ˆθ_{n}(a) within Q_{a} and the proof
can be completed as in the one-dimensional case.

To obtain the needed facts concerning the behavior of the likelihood on
Q_{a} for small a, we expand the log likelihood about the true point θ^{0} and divide
by n to find

1

nL(θ) − 1

nL(θ^{0}) = 1
n

XA_{j}(x)(θ_{j} − θ^{0}_{j}) + 1
2n

X XB_{jk}(x)(θ_{j} − θ^{0}_{j})(θ_{k}− θ_{k}^{0})

+ 1

6n

X

j

X

k

X

`

(θ_{j}− θ^{0}_{j})(θ_{k}− θ^{0}_{k})(θ_{`}− θ_{`}^{0})

n

X

i=1

γ_{jk`}(x_{i})M_{jk`}(x_{i})

= S_{1}+ S_{2}+ S_{3}

where

A_{j}(x) = ∂

∂θ_{j}L(θ)

_{θ=θ}0

, B_{jk}(x) = ∂^{2}

∂θ_{j}∂θ_{k}L(θ)

_{θ=θ}0

, and where by assumption (D)

0 ≤ |γ_{jk`}(x)| ≤ 1.

To prove that the maximum of this difference for θ on Q_{a} is negative with
probability tending to 1 if a is sufficiently small, we will show that with high
probability the maximum of S_{2} is negative while S_{1} and S_{3} are small compared
to S_{2}. The basic tools for showing this are the facts that by (B) and the law of
large numbers

1

nA_{j}(x) = 1
n

∂

∂θ_{j}L(θ)

_{θ=θ}0

→ 0 in probability. (23) and

1

nB_{jk}(x) = 1
n

∂^{2}

∂θ_{j}∂θ_{k}L(θ)

_{θ}_{0}

→ −I_{jk}(θ^{0}) in probability. (24)
Let us begin with S1. On Qa we have

|S_{1}| ≤ 1
na^{X}

j

|A_{j}(X)|.

For any given a, it follows from (23) that |A_{j}(X)|/n < a^{2} and hence that

|S_{1}| < sa^{3} with probability tending to 1. Next consider
2S_{2} = ^{X X h}−I_{jk}(θ^{0})(θ_{j} − θ^{0}_{j})(θ_{k}− θ_{k}^{0})^{i}

+^{X X}

1

nB_{jk}(X) − [−I_{jk}(θ^{0})]

(θ_{j} − θ_{j}^{0})(θ_{k}− θ_{k}^{0}).

For the second term it follows from an argument analogous to that for S_{1} that
its absolute value is less than s^{2}a^{3} with probability tending to 1. The first term
is a negative (nonrandom) quadratic form in the variables (θ_{j} − θ_{j}^{0}). By an
orthogonal transformation this can be reduced to diagonal form ^{P}λ_{i}ξ_{i}^{2} with
Q_{a} becoming ^{P}ξ_{i}^{2} = a^{2}. Suppose that the λ’s that are negative are numbered
so that λ_{s} ≤ λ_{s−1} ≤ · · · ≤ λ_{1} < 0. Then^{P}λ_{i}ξ_{i}^{2} ≤ λ_{1}^{P}ξ_{i}^{2} = λ_{1}a^{2}. Combining
the first and second terms, we see that there exist c > 0, a_{0} > 0 such that for
a < a_{0}

S2 < −ca^{2}
with probability tending to 1.

Finally, with probability tending to 1,

1 n

XM_{jk`}(X_{i})

< 2m_{jk`}

and hence |S_{3}| < ba^{3} on Q_{a} where
b = s^{3}

3

X X X

m_{jk`}.
Combining the three inequalities, we see that

max(S1+ S2 + S3) < −ca^{2}+ (b + s)a^{3}

which is less than zero if a < c/(b + s), and this completes the proof of (i).

The proof of part (ii) of Theorem 8 is basically the same as that of Theo- rem 6. However, the single equation derived there from the expansion of ˆθn− θ0

is now replaced by a system of s equations which must be solved for the differ-
ences (ˆθ_{jn}−θ^{0}_{j}). In preparation, it will be convenient to consider quite generally
a set of random linear equations in s unknowns

s

X

k=1

A_{jkn}Y_{kn}= T_{jn} (j = 1, . . . , s). (25)
Lemma 3 Let (T_{1n}, . . . , T_{sn}) be a sequence of random vectors tending weakly
to (T_{1}, . . . , T_{s}) and suppose that for each fixed j and k, A_{jkn} is a sequence of
random variables tending in probability to constants ajk for which the matrix
A = kajkk is nonsingular. Let B = kbjkk = A^{−1}. Then if the distribution of
(T_{1}, . . . , T_{s}) has a density with respect to Lebesgue measure over E_{s}, the solution
of (25) tend in probability to the solutions (Y_{1}, . . . , Y_{s}) of ^{P}^{s}_{k=1}a_{jk}Y_{k} = T_{j},
1 ≤ j ≤ s, given by Y_{j} =^{P}^{s}_{k=1}b_{jk}T_{k}.

In generalization of the proof of Theorem 6, expand ∂L(θ)/∂θj = L^{0}_{j}(θ)
about θ^{0} to obtain

L^{0}_{j}(θ) = L^{0}_{j}(θ^{0}) +^{X}(θ_{k}− θ_{k}^{0})L^{00}_{jk}(θ^{0}) + 1
2

X X(θ_{k}− θ_{k}^{0})(θ_{`}− θ^{0}_{`})L^{(3)}_{jk`}(θ^{∗}) (26)
where L^{00}_{jk} and L^{(3)}_{jk`} denote the indicated second and third derivatives of L and
where θ^{∗} is a point on the line segment connecting θ and θ^{0}. In this expansion,
replace θ by a solution ˆθ_{n} of the likelihood equations, which by part (i) of
this theorem can be assumed to exist with probability tending to 1 and to be
consistent. The left side of (26) is zero and the resulting equations can be
written as

√n^{X}(ˆθ_{k}− θ^{0}_{k})

1

nL^{00}_{jk}(θ^{0}) + 1

2nL^{000}_{jk`}(θ^{∗})

= − 1

√nL^{0}_{j}(θ^{0}). (27)
These have the form (26) with

Y_{kn} = √

n(ˆθ_{k}− θ^{0}_{k}) (28)

A_{jkn} = 1

nL^{00}_{jk}(θ^{0}) + 1

2n(ˆθ_{`}− θ_{`}^{0})L^{(3)}_{jk`}(θ^{∗}) (29)
T_{jn} = − 1

√nL^{0}_{j}(θ^{0}) = − 1

√n

" _{n}
X

i=1

∂

∂θj

log f (X_{i}, θ)

#

θ=θ^{0}

. (30)

Since E_{θ}^{0}[(∂/∂θ_{j}) log f (X_{i}, θ)] = 0, the multivariate central limit theorem shows
that (T_{1n}, . . . , T_{sn}) has a multivariate normal distribution with mean zero and
covariance matrix I(θ^{0}).

On the other hand, it is easy to see-again in parallel to the proof of Theorem 6 that

A_{jkn} → a^{P} _{jk} = E[L^{00}_{jk}(θ^{0})] = −I_{jk}(θ^{0}).

The limit distribution of the Y ’s is therefore that of the solution (Y1, . . . , Ys) of the equations

s

X

k=1

Ijk(θ^{0})Yk = Tj

where T = (T_{1}, . . . , T_{s}) is multivariate normal with mean zero and covariance
matrix I(θ^{0}). It follows that the distribution of Y is that of [I(θ^{0})]^{−1}T , which
is a multivariate distribution with zero mean and covariance matrix [I(θ^{0})]^{−1}.
This completes the proof of asymptotic normality and efficiency.

3.3 Efficiency and Adaptiveness

If the distribution of the X_{i} depends on θ = (θ_{1}, . . . , θ_{s}), it is interesting to
compare the estimation of θ_{j} when the other parameters are unknown with the
situation in which they are known. Such a question arises naturally in the case
that part of parameters are the nuisance parameter. For instance, consider
estimating ˜µ for a location family f (x − ˜µ) or the median. ˜µ, of a symmetric
density, f . Then ˜µ is the parameter of interest and f is the nuisance parameter.

If f is known and continuously differentiable, the best asymptotic mean-squared
error attainable for estimating ˜µ is (nI)^{−1} where

I =

Z f^{0}^{2}(x)

f (x) dx < ∞.

The question be asked is when can we estimate ˜µ as well asymptotically not
knowing f as knowing f . A necessary condition named the orthogonality con-
dition is given in Stein (1956). If there exists an estimate achieving the bound
(nI)^{−1} when f is unknown, it is named as an adaptive estimate of ˜µ. Accord-
ing to Stein’s condition, he indicated that such an estimator does exist for this
problem. Completely definite results for this problem were obtained by Beran
(1974) and Stone (1975).

Note that this problem is a so-called semiparametric estimation problem in which ˜µ is the parametric component and f is the nonparametric compo- nent. Recently, the problem of estimating and testing hypotheses about the parametric component in the presence of an infinite dimensional nuisance pa- rameter (nonparametric component) attracts a lot of attention. Main concerns are whether there exists either an adaptive or efficient estimate of the paramet- ric component and the existence of a practical procedure to find them.

We now consider the finite-dimensional case and derive the orthogonality
condition derived in Stein (1956). It was seen that under regularity conditions
there exist estimator sequences ˆθ_{nj} of θ_{j}, when the other parameters are known,
which are asymptotically efficient in the sense that

√n(ˆθ_{jn}− θ_{j})→ N (0,^{d} 1
I_{jj}(θ)).

When the other parameters are unknown,

√n(ˆθ_{jn}− θ_{j})→ N (0, [I(θ)]^{d} ^{−1}_{jj} ).

These imply that

1

I_{jj}(θ) ≤ [I(θ)]^{−1}_{jj} . (31)
Stein (1956) raised the question whether we can estimate θ_{j} equally well no
matter when the other parameters are known or not. This leads to the question
of efficiency and adaptiveness.

The two sides of (31) are equal if

I_{ij}(θ) = 0 for all j 6= i, (32)
as is seen from the definition of the inverse of a matrix, and in fact (32) is also
necessary for equality in (31) by the following facts.

Fact. Let A =

A_{11} A_{12}
A_{21} A_{22}

be a partitioned matrix with A_{22} square and
nonsingular, and let

B =

I −A12A^{−1}_{22}

0 I

. Note that

BA =

A_{11}− A_{12}A^{−1}_{22}A_{21} 0

A21 A22

.

It follows easily that |A| = |A_{11}− A_{12}A^{−1}_{22}A_{21}| · |A_{22}|. Since A_{22} is nonsingular,
(A^{−1})_{11} = (A_{11})^{−1} if A_{12} is a zero matrix.

The equality in (31) implies that I(θ) is diagonal. Suppose the efficient es-
timator of θ_{j} depends on the remaining parameters and yet θ_{j} can be estimated
without loss of efficiency when these parameters are unknown. The situation
can then be viewed as a rather trivial example of the idea of adaptive estima-
tion. On the other hand, it is known that I_{jj}(θ) is the smallest asymptotic
mean-squared error attainable for estimating θ_{j}. If an estimator does achieve
such a bound, it is called an efficient estimator. Then Stein (1956) states that
the adaptation is not possible unless Ijk(θ) = 0 for k 6= j.

We now study the bound of [I(θ)]^{−1}_{11}. Write I(θ) as a partitioned matrix

I_{11}(θ) I_{1·}(θ)
I_{1·}^{T}(θ) I··(θ)

,

where I_{1·}(θ) = (I_{12}(θ), . . . , I_{1s}(θ)) and I··(θ) is the lower right submatrix of I(θ)
with size (s − 1) × (s − 1). Then

[I(θ)]^{−1}_{11} = 1

I_{11}(θ) − I_{1·}(θ)[I··(θ)]^{−1}I_{1·}^{T}(θ).
Recall that

I_{ij}(θ) = E ∂

∂θ_{i} log f (X, θ) · ∂

∂θ_{j} log f (X, θ)

!

. Consider the minimization problem

minaj

E

∂

∂θ_{1} log f (X, θ) −

s

X

j=2

a_{j} ∂

∂θ_{j} log f (X, θ)

2

.

and denote the minimizer as a^{0} = (a_{20}, . . . , a_{s0})^{T}. By a simple algebra, a^{0} is
the solution of norml equations I··(θ)a^{0} = I_{1·}(θ) or a^{0} = I_{··}^{−1}(θ)I_{1·}(θ). It leads
to

E

∂

∂θ_{1} log f (X, θ) −

s

X

j=2

a_{j0} ∂

∂θ_{j} log f (X, θ)

2

= I_{11}(θ) − I_{1·}(θ)[I··(θ)]^{−1}I_{1·}^{T}(θ).

Or

[I(θ)]^{−1}_{11} = min

aj

E

∂

∂θ_{1} log f (X, θ) −

s

X

j=2

a_{j} ∂

∂θ_{j} log f (X, θ)

2

.
If aj0 = 0, [I(θ)]^{−1}_{11} = 1/I11(θ). Or, adaptivation is possible.

As illustrations, we will consider three examples. The first example is on the estimation of regression coefficients of a linear regression and the next two examples are on the estimation of parametric component in a semiparametric model. The two particular models we considered are the partial spline model (Wahba, 1984) and the two-sample proportional hazard model (Cox, 1972).

Example 1. (Linear Regression) Assume that y = β_{0} + β_{1}x + and

∼ N (0, σ^{2}). Let ( ˆβ_{0}, ˆβ_{1}) denote the least squares estimate. It follows easily
that

V ar

βˆ_{0}
βˆ_{1}

= σ^{2}

n ^{P}_{i}X_{i}

P

iX_{i} ^{P}_{i}X_{i}^{2}

−1

= σ^{2}

P

iX_{i}^{2}

P

i(Xi− ¯X)^{2} −P ^{X}^{¯}

i(Xi− ¯X)^{2}

−P ^{X}^{¯}

i(Xi− ¯X)^{2}

1

P

i(Xi− ¯X)^{2}

−1

.

When β_{0} is known, the variance of least squares estimate of β_{1} is σ^{2}/^{P}_{i}X_{i}^{2}.
Then the necessary and sufficient condition on guaranteeing adaptiveness is that
X = 0. When we write the model in matrix form, the condition ¯¯ X = 0 can be
explained as the two vectors (1, · · · , 1)^{T} and (X_{1}, · · · , X_{n})^{T} are orthogonal. Note
that (1, · · · , 1)^{T} and (X_{1}, · · · , X_{n})^{T} are associated with β_{0} and β_{1}, respectively.

On the other hand, we can use the above derivation. Observe that

∂ log f

∂θ_{0} = Y − β0− β1X

σ^{2} =

σ^{2}, ∂ log f

∂θ_{1} = (Y − β0− β1X)X

σ^{2} = X

σ^{2} .
Then

mina E ∂ log f

∂β_{1} − a∂ log f

∂β_{0}

!2

= min

a

1

σ^{2}E(X − a)^{2} = 1

σ^{2}V ar(X).

Example 2. (Partial Spline Model) Assume that Y = βX + g(T ) + and

∼ N (0, σ^{2}). Suppose that g ∈ L_{2}[a, b], the set of all square integrable functions
on the interval [a, b]. By proper taking care some mathematical subtlity, g(T )
can be written as ^{P}^{∞}_{j=1}b_{j}φ_{j}(T ) where {φ_{j}} is the complete bases of L_{2}[a, b].

Observe that

∂ log f

∂θ = X

σ^{2}

∂ log f

∂b_{j} = φ_{j}(T )
σ^{2} .
Then

minaj

E

∂ log f

∂β −^{X}

j

a_{j}∂ log f

∂bj

2

= min

aj

E^{2}
σ^{4}

X − −^{X}

j

a_{j}φ_{j}(T )

2

= 1
σ^{2} min

h E(X − h(T ))^{2}.

Therefore, h(T ) = E(X|T ). It means that when E(X|T ) = 0 or X and T are uncorrelated, the adaption is possible. Otherwise, the efficient bound for estimating β is

σ^{2}

E(X − E(X|T ))^{2} = σ^{2}
EV ar(X|T ).

Refer to Chen (1988) and Speckman (1988) for further references and construc- tion of efficient estimate of β.

Example 3. Let t_{1}, · · · , t_{n} be fixed constants. Suppose that X_{i} ∼
Bin(1, F (t_{i})) and Y_{i} ∼ Bin(1, F^{θ}(t_{i})). We now derive a lower bound on the