Chapter 4. Method of Maximum Likelihood

(1)

Chapter 4. Method of Maximum Likelihood

1 Introduction

Many statistical procedures are based on statistical models which specify under which conditions the data are generated. Usually the assumption is made that the set of observations x₁, . . . , x_nis a set of (i) independent random variables (ii) identically distributed with common pdf f (xi, θ). Once this model is specified, the statistician tries to find optimal solutions to his problem (usually related to the inference on a set of parameters θ ∈ Θ ⊂ R^k, characterizing the uncertainty about the model).

The procedure just described is not always easy to carry out. In fact, when confronted with a set of data three attitudes are possible:

• The statistician may be a “pessimist” who does not believe in any particular model f (x, θ). In this case he must be satisfied with descriptive methods (like exploratory data analysis) without the possibility of induc- tive inference.

• The statistician may be an “optimist” who strongly believes in one model.

In this case the analysis is straightforward and optimal solutions may often be easily obtained.

• The statistician may be “realist”: he would like to specify a particular model f (x, θ) in order to get operational results but he may have either some doubt about the validity of this hypothesis or some difficulty in choos- ing a particular parametric family.

Let us illustrate this kind of preoccupation with an example. Suppose that the parameter of interest is the “center” of some population. In many situations, the statistician may argue that, due to a central limit effect, the data are generated by a normal pdf. In this case the problem is restricted to the problem of inference on µ, the mean of the population. But in some cases, he may have some doubt about these central limit effects and may suspect some skewness and/or some kurtosis or he may suspect that some observations are generated by other models (leading to the presence of outliers).

In this context three types of question may be raised to avoid gross errors in the prediction, or in the inference:

• Does the optimal solution, computed for assumed model f (x, θ), still have

“good” properties if the true model is a little different?

(2)

• Are the optimal solutions computed for other models near to the original one really substantially different?

• Is it possible to compute (exactly or approximately) optimal solutions for a wider class of models based on very few assumptions?

The first question is concerned with the sensitivity of a given criterion to the hypotheses (criterion robustness). In the second question, it is the sensitivity of the inference which is analyzed (inference robustness). The last question may be viewed as a tentative first step towards the development of nonparametric methods (i.e. methods based on a very large parametric space).

2 Information Bound

Any statistical inference starts from a basic family of probability measures, ex- pressing our prior knowledge about the nature of the probability measures from where the observations originate. Or a model P is a collection of probability measures P on (X , A) where X is the sample space with a σ-field of subsets A.

If

P = {P_θ : θ ∈ Θ}, Θ ⊂ R^s

for some k, then P is a parametric model. On the other hand, if P = {all P on (X , A)},

then P is often referred to as a nonparametric model.

Suppose that we have a fully specified parametric family of models. De- note the parameter of interest by θ. Suppose that we wish to calculate from the data a single value representing the “best estimate” that we can make of the unknown parameter. We call such a problem one of point estimation.

Define the information matrix as the s × s matrix I(θ) = kI_ij(θ)k,

where

Iij(θ) = Eθ

"

∂ log f (X; θ)

∂θ_i

∂ log f (X; θ)

∂θ_j

#

.

When k = 1, I(θ) is known as the Fisher information. Under regularity conditions, we have

E

"

∂

∂θi

log f (X; θ)

#

= 0 (1)

(3)

and

I_ij(θ) = cov

"

∂

∂θ_i log f (X; θ), ∂

∂θ_j log f (X; θ)

#

.

Being a covariance matrix, I(θ) is then positive semidefinite and positive definite unless the (∂/∂θ_i) log f (X; θ), i = 1, . . . , s are affinely dependent (and hence, by (1), linear dependent). When the density also has the second derivatives, we have the following alternative expression for I_ij(θ) which is

I_ij(θ) = −E

"

∂²

∂θ_i∂θ_j log f (X; θ)

#

.

To make above statements correct, we make the following assumptions when s = 1:

(i) Θ is an open interval (finite, infinite, or semi-infinite).

(ii) The distribution P_θ have common support, so that without loss of generality the set A = {x : p_θ(x) > 0} is independent of θ. (2) (iii) For any x in A and θ in Θ, the derivative p⁰_θ(x) = ∂pθ(x)/∂θ exists and is finite.

Lemma 1 (i) If (2) holds, and the derivative with respect to θ of the left side of

Z

f (x; θ)dµ(x) = 1 (3)

can be obtained by differentiating under the integral sign, then E_θ

"

∂

∂θlog f (X; θ)

#

= 0 and

I(θ) = var_θ

"

∂

∂θ log f (X; θ)

#

. (4)

(ii) If, in addition, the second derivative with respect to θ of log f (X; θ) exists for all x and θ and the second derivative with respect to θ of the left side of (3) can be obtained by differentiating twice under the integral sign, then

I(θ) = −Eθ

"

∂²

∂θ² log f (X; θ)

#

. Let us now derive the information inequality for s = 1.

Theorem 1 Suppose (2) and (4) hold, and that I(θ) > 0. Let δ be any statistic with E_θ(δ²) < ∞ for which the derivative with respect to θ of E_θ(δ) exists and can be obtained by differentiating under the integral sign. Then

var_θ(δ) ≥

h∂

∂θE_θ(δ)ⁱ² I(θ) .

(4)

Proof. For any estimator δ of g(θ) and any function ψ(x, θ) with finite second moment, the covariance inequality states that

var_θ(δ) ≥ [cov(δ, ψ)]²

var(ψ) . (5)

Denote g(θ) = E_θδ and set

ψ(X, θ) = ∂

∂θ log f (X; θ).

If differentiation under the integral sign is permitted in E_θδ, it then follows that cov(δ, ψ) =

Z

δ(x)f⁰(x; θ)

f (x; θ)f (x; θ)dx = g⁰(θ) and hence

var_θ(δ) ≥

hg⁰(θ)ⁱ²

var^h_∂θ^∂ log f (X, θ)ⁱ. This completes the proof of this theorem.

If δ is an unbiased estimator of θ, then var_θ(δ) ≥ 1

nI(θ).

The above inequality provides a lower bound for the variance of any estimator.

In fact, the quantity nI(θ) is known as the “Cramer-Rao lower bound.” Like- liwise, we can also have the information inequality for general s. We begin by generalizing the correlation inequality to one involving many ψ_i (i = 1, . . . , r).

Theorem 2 For any unbiased estimator δ of g(θ) and any functions ψ_i(x, θ) with finite second moments, we have

var(δ) ≥ γ⁰C⁻¹γ, (6)

where γ = (γ₁, · · · , γ_r) and C = kC_ijk are defined by

γ_i = cov(δ, ψ_i), C_ij = cov(ψ_i, ψ_j). (7) Proof. Replace Y by δ and Xi by ψi(X, θ) in the following lemma. Then the fact that ρ^∗2 ≤ 1 yields this theorem.

Let (X₁, . . . , X_r) and Y be random variables with finite second moment, and consider the correlation coefficient corr(^Pa_iX_i, Y ). Its maximum value ρ^∗ over all (a₁, . . . , a_r) is the multiple correlation coefficient between Y and the vector (X₁, . . . , X_r).

(5)

Lemma 2 Let (X₁, . . . , X_r) and Y have finite second moment, let γ_i = cov(X_i, Y ) and Σ be the covariance matrix of the X’s. Without loss of generality, suppose Σ is positive definite. Then

ρ^∗2= γ⁰Σ⁻¹γ

var(Y ). (8)

Proof. Since a correlation coefficient is invariant under scale changes, the a’s maximizing (8) are not uniquely determined. Without loss of generality, we therefore impose the condition var(^P_ia_iX_i) = a⁰Σa = 1. In view of a⁰Σa = 1,

corr(^X

i

a_iX_i, Y ) = a⁰γ/^qvar(Y ).

The problem then becomes that of maximizing a⁰γ subject to a⁰Σa = 1. Using the method of undetermined multipliers, one maximizes instead

a⁰γ − λ

2a⁰Σa (9)

with respect to a and then determines λ so as to satisfy a⁰Σa = 1. Differentia- tion with respect to the ai of (9) leads to a system of linear equations with the unique solution

a = 1

λΣ⁻¹γ, (10)

and the side condition a⁰Σa = 1 gives

λ = ±^qγ⁰Σ⁻¹γ.

Substituting these values of λ into (10), one finds that a = ±Σ⁻¹γ

q

γ⁰Σ⁻¹γ

and the maximum value of corr(^P_ia_iX_i, Y ), ρ^∗, is therefore the positive root of (8).

Note that always 0 ≤ ρ^∗ ≤ 1, and that ρ^∗ is 1 if and only if constants a₁, . . . , a_r and b exist such that Y =^P_ia_iX_i+ b.

Let us now state the information inequality for the multiparameter case in which θ = (θ₁, . . . , θ_s).

Theorem 3 Suppose that (1) holds and that I(θ) is positive definite Let δ be any statistic with E_θ(δ²) < ∞ for which the derivative with respect to θ_i exists for each i and can be obtained by differentiating under the integral sign. Then

var_θ(δ) ≥ α⁰I⁻¹(θ)α, (11)

(6)

where α⁰ is the row matrix with ith element α_i = ∂

∂θ_iE_θ(δ(X)).

Proof. If the functions ψ_iof Theorem 2 are taken to be ψ_i = (∂/∂θ_i) log f (X; θ), this theorem follows immediately.

Under regularity conditions on the class of estimators ˆθn under consider- ation, it may be asserted that if ˆθ_n is AN (θ, n⁻¹Σ(θ)), then the condition

Σ(θ) − I(θ)⁻¹ is nonnegative definite

must hold. (Read Ch2.6 and 2.7 of Lehmann (1983) for further details.) In this respect, an estimator ˆθ_n which is AN (θ, Σ_θ) is “optimal.” (Such an estimator need not exist.)

The following definition is thus motivated. An estimator ˆθ_nwhich is called asymptotically efficient, or best asymptotically normal (BAN). Under suitable regularity conditions, an asymptotically efficient estimate exists. One approach toward finding such estimates is the method of maximum likelihood. Neyman (1949) pointed out that these large-sample criteria were also satisfied by other estimates. He defined a class of best asymptotically normal estimates. So far, we have described three desirable properties ˆθ_n. They are unbiasedness, consistency, and efficiency. We now describe a general procedure to produce an asymptotic unbiased, consistent, and asymptotic efficient estimator.

3 Maximum Likelihood Methodology

Many statistical techniques were invented in the nineteenth century by experi- mental scientists who personally applied their methods to authentic data sets.

In these conditions the limits of what is computationally feasible are sponta- neously observed. Until quite recently these limits were set by the capacity of the human calculator, equipped with pencil and paper and with such aids as the slide rule, tables of logarithms, and other convenient tables, which have been in constant use from the seventeenth century until well into the twentieth.

Until the advent of the electronic computer, the powers of the human operator set the standard. This restriction has left its mark on statistical technique, and many new developments have taken place since it was lifted.

The first result of this modern computing revolution is that estimates defined by nonlinear equations can be established as a matter of routine by the appropriate iterative algorithms. This permits the use of nonlinear functional

(7)

forms. Although the progress of computing technology made nonlinear estimation possible, the statistical theory of Maximum Likelihood provided techniques and respectability. Its principle was first put forward as a novel and original method of deriving estimators by R.A. Fisher in the early 1920s. It very soon proved to be a fertile approach to statistical inference in general, and was widely adopted; but the exact properties of the ensuring estimators and test procedures were only gradually discovered.

Let observations x = (x1, . . . , xn) be realized values of random variables X = (X₁, . . . , X_n) and suppose that the random vector X, having density f_X(x; θ) with respect to some σ-finite measure ν. Here θ is the scalar parameter to be determined. The likelihood function corresponding to an observed vector x from the density f_X(x; θ) is written

Lik_X(θ⁰; x) = f_X(x; θ⁰),

whose logarithm is denoted by L(θ⁰; x). When the Xi are iid with probability density f (x; θ) with respect to a σ-finite measure µ,

f (x; θ) =

n

Y

i=1

f (x_i; θ).

If the parameter space is Ω, then the maximum likelihood estimate (MLE) θ = ˆˆ θ(x) is that value of θ⁰ maximizing lik_X(θ⁰; x), or equivalently its logarithm L(θ⁰; x), over Ω. That is,

L(ˆθ; x) ≥ L(θ⁰; x) (θ⁰ ∈ Ω). (12) L(θ, x) is called the log-likelihood. Note that L is regarded as a function of θ with x fixed.

A MLE may not exist. It certainly exists if Ω is compact and f (x; θ) is upper semicontinuous in θ for all x. As an example, consider U (θ, θ + 1). Later on, we shall use the shorthand notation L(θ) for L(θ, x) and L⁰(θ), L(θ)⁰⁰, . . . for its derivatives with respect to θ. (Note that f is said to be upper semicontinuous if {x|f (x) < α} is an open set.)

Fisher was the first to study and establish optimum properties of estimates obtained by maximizing the likelihood function, using criteria such as consistency and efficiency (involving asymptotic variances) in large samples. At that time, however, the computation involved were hardly practicable, this pre- vented a widespread adoption of these methods. Fortunately, the new computer technology had become generally accessible. Therefore, Maximum Likelihood (ML) methodology is widely used now.

(8)

It is a constant theme of the history of the method that the use of ML techniques is not always accompanied by a clear appreciation of their limita- tions. Le Cam (1953) complains that

· · · although all efforts at a proof of the general existence of [asymptotically] efficient estimates · · · as well as a proof of the efficiency of ML estimates were obviously inaccurate and although accurate proofs of similar statements always referred not to the general case but to particular classes of estimates · · · a general belief became established that the above statements are true in the most general sense.

As an illustration, consider the famous Neyman-Scott (1948) problem. In this example, the MLE is not even consistent.

Example 1. Estimation of a Common Variance. Let X_αj (j = 1, . . . , r) be independently distributed according to N (θα, σ²), α = 1, . . . , n. The MLEs are

θˆ_α = X_α·, ˆσ² = 1 rn

X X(X_αj − X_α·)².

Furthermore, these are the unique solutions of the likelihood equations.

However, in the present case, the MLE of σ² is not even consistent. To see this, note that the statistics

S_α² =^X(Xαj − Xα·)²

are identically independently distributed with expectation E(S_α²) = (r − 1)σ²

so that ^PS_α²/n → (r − 1)σ² and hence ˆ

σ² → r − 1

r σ² in probability.

Example 2. Suppose X₁, X₂, . . . , X_n is a random sample from a uniform distribution U (0, θ). The likelihood function is

L(θ, x) = 1

θⁿ, 0 < x₁, . . . , x_n< θ.

Clearly L cannot be maximized wrt θ by differentiation. However, it is not difficult to find ˆθ_n = X_(n) with density function ntⁿ⁻¹/θⁿ where t ∈ (0, θ).

Then

E(ˆθ_n) = nθ n + 1,

which is a biased estimator of θ. (But it is asymptotic unbiased.)

(9)

3.1 Efficient Likelihood Estimation

According to the example discussed by Neyman and Scott (1948), we will show that, under regularity conditions, the ML estimates are consistent, asymptotically normal, and asymptotically efficient. For simplicity, our treatment will be confined to the case of a 1-dimensional parameter.

We begin with the following regularity assumptions:

(A0) The distributions P_θ of the observations are distinct (otherwise, θ cannot be estimated consistently).

(A1) The distributions P_θ have common support.

(A2) The observations are X = (X₁, . . . , X_n), where the X_i are iid with probability density f (x_i, θ) with respect to µ.

(A3) The parameter space Ω contains an open interval ω of which the true parameter value θ₀ is an interior point.

Theorem 4 Under assumptions (A0)-(A2),

P_θ₀{f (X₁, θ₀) · · · f (X_n, θ₀) > f (X₁, θ) · · · f (X_n, θ)} → 1 as n → ∞ for any fixed θ 6= θ₀.

Proof. The inequality is equivalent to 1

n

X

i=1

log [f (X_i, θ)/f (X_i, θ₀)] < 0.

By the strong law of large numbers, the left side tends with probability 1 toward E_θ₀log[f (X, θ)/f (X, θ₀)].

Since − log is strictly convex, Jensen’s inequality shows that

Eθ0log[f (X, θ)/f (X, θ0)] < log Eθ0[f (X, θ)/f (X, θ0)] = 0, (13) and the results follows. When θ₀ is the true value, the above proof gives a meaning to the numerical value of the Kullback-Leibler information number.

Namely, the likelihood ratio converges to zero exponential fast, at rate I(θ, η).

Remark 1. Define the Kullback-Leibler information number I(θ, η) = E_θ log f (X, θ)

f (X, η)

!

.

(10)

Note that I(θ, η) ≥ 0 with equality holding if and only if, f (x, θ) = f (x, η).

I(θ, η) is a measure of the ability of the likelihood ratio to distinguish between f (X, θ) and f (X, θ₀) when the latter is true.

Remark 2. If ˆθ_n is an MLE of θ and if g is a function, then g(ˆθ_n) is an MLE of g(θ). When g is one-to-one, it holds obviously. If g is many-to-one, this result holds again when the derivative of g is nonzero.

By Theorem 4, the density of X at the true θ₀ exceeds that any other fixed θ with high probability when n is large. We do not know θ0 but we can determine the value ˆθ of θ which maximizes the density of X. However, Theorem 4 cannot guarantee that the MLE is consistent since we have to apply the law of large numbers to the right-hand side of (13) for all θ⁰ 6= θ simultaneously.

However, if Ω is finite, the MLE ˆθ_n exists, it is unique with probability tending to 1, and it is consistent.

The following theorem is motivated by the simple fact by differentiating

R f (x, θ)µ(dx) = 1 with respect to θ. It leads to Eθ0

f⁰(X, θ0) f (X, θ₀) = 0.

Theorem 5 Let X₁, . . . , X_n satisfy assumptions (A0)-(A3) and suppose that for almost all x, f (x, θ) is differentiable with respect to θ in w, with derivative f⁰(x, θ). Then with probability tending to 1 as n → ∞, the likelihood equation

∂

∂θ [f (x1, θ) · · · f (xn, θ)] = 0 (14) or, equivalently, the equation

L⁰(θ, x) =

n

X

i=1

f⁰(x_i, θ)

f (xn, θ) = 0 (15)

has a root ˆθ_n= ˆθ_n(x₁, . . . , x_n) such that ˆθ_n(X₁, . . . , X_n) tends to the true values θ₀ in probability.

Proof. Let a be small enough so that (θ₀− a, θ₀+ a) ⊂ w, and let S_n= {x : L(θ₀, x) > L(θ₀− a, x) and L(θ₀, x) > L(θ₀+ a, x)}. (16) By Theorem 4, Pθ0(Sn) → 1. For any x ∈ Sn there thus exists a value θ0− a <

θˆn< θ0+ a at which L(θ) has a local maximum, so that L⁰(ˆθn) = 0. Hence for any a > 0 sufficiently small, there exists a sequence ˆθ_n = ˆθ_n(a) of roots such that

P_θ₀(|ˆθ_n− θ₀| < a) → 1.

(11)

It remains to show that we can determine such a sequence, which does not depend on a.

Let ˆθ_n^∗ be the root closest to θ₀, This exists because the limit of a sequence of roots is again a root by the continuity of L(θ).) Then clearly P_θ₀(|ˆθ^∗_n− θ₀| <

a) → 1 and this completes the proof.

Remarks. 1. This theorem does not establish the existence of a consistent estimator sequence since, with the true value θ₀ unknown, the data do not tell us which root to choose so as to obtain a consistent sequence. An exception, of course, is the case in which the root is unique.

2. It should also be emphasized that existence of a root ˆθ_n is not asserted for all x ( or for a given n even for any x). This does not affect consistency, which only requires ˆθ_nto be defined on a set S_n⁰, the probability of which tends to 1 as n → ∞.

Above theorem establishes the existence of a consistent root of the likelihood equation. The next theorem asserts that any such sequence is asymptotically normal and efficient.

Theorem 6 Suppose that X₁, . . . , X_nare iid and satisfy the assumptions (A0)- (A3), the integral ^R f (x, θ))dµ(x) can be twice differentiated under the integral sign, and the existence of a third derivative satisfying

∂³

∂θ³ log f (x, θ)

≤ M (x) (17)

for all x ∈ A, θ0− c < θ < θ0+ c with Eθ0[M (X)] < ∞. Then any consistent sequence ˆθ_n= ˆθ_n(X₁, . . . , X_n) of roots of the likelihood equation satisfies

√n(ˆθ_n− θ₀)→ N^d 0, 1 I(θ₀)

!

.

Proof. For any fixed x, expand L⁰(ˆθn) about θ0

L⁰(ˆθ_n) = L⁰(θ₀) + (ˆθ_n− θ₀)L⁰⁰(θ₀) + 1

2(ˆθ_n− θ₀)²L⁽³⁾(θ_n^∗)

where θ_n^∗ lies between θ₀ and ˆθ_n. By assumption, the left side is zero, so that

√n(ˆθ_n− θ₀) = (1/√

n)L⁰(θ₀)

−(1/n)L⁰⁰(θ₀) − (1/2n)(ˆθ_n− θ₀)L⁽³⁾(θ_n^∗)

where it should be remembered that L(θ), L⁰(θ), and so on are functions of (X₁, . . . , X_n) as well as θ. The desired result follows if we can show that

√1

nL⁰(θ₀)→ N [0, I(θ^d ₀)], (18)

(12)

that

−1

nL⁰⁰(θ₀)→ I(θ^P ₀) (19) and that

1

nL⁽³⁾(θ_n^∗) is bounded in probabiliy. (20) Of the above statements, (18) follows from the fact that

√1

nL⁰(θ0) =√ n1

n

X

i=1

"

f⁰(X_i, θ₀) f (X_i, θ₀) − Eθ0

f⁰(X_i, θ₀) f (X_i, θ₀)

#

since the expectation term is zero, and then from the CLT and the definition of I(θ).

Next,

−1

nL⁰⁰(θ₀) = 1 n

n

X

i=1

f⁰²(X_i, θ₀) − f (X_i, θ₀)f⁰⁰(X_i, θ₀) f²(X_i, θ₀) . By the law of large numbers, this tends in probability to

I(θ₀) − E_θ₀f⁰⁰(X_i, θ₀)

f (X_i, θ₀) = I(θ₀).

Finally,

1

nL⁽³⁾(θ₀) = 1 n

n

X

i=1

∂³

∂θ³ log f (X_i, θ) so that by (17)

1

nL⁽³⁾(θ_n^∗)

< 1

n[M (X₁) + · · · + M (X_n)]

with probability tending to 1. The right side tends in probability to Eθ0[M (X)], and this completes the proof.

Remarks. 1. This is a strong result. It establishes several major properties of the MLE in addition to its consistency. The MLE is asymptotically normal, which is of great help for the derivation of (asymptotically valid) tests;

it is asymptotically unbiased; and it is asympotically efficient, since the variance of its limiting distribution equals the Cramer-Rao lower bound.

2. As a rule we wish to supplement the parameter estimates by an estimate of their (asymptotic) variance. This will permit us to assess (asymptotic) t-ratios and (asymptotic) confidence interval. Although the variance may depend on the unknown parameter, we can just use MLE to get an estimate of variance.

The usual iterative methods for solving the likelihood equation L⁰(θ) = 0 are based on replacing L⁰(θ) by the linear terms of its Taylor expansion about an approximate solution ˜θ. Suppose we can use estimation method such as the

(13)

method of moments to find a good estimate of θ. Denote it as ˜θ. Then it is quite natural to use ˜θ as the initial solution of the iterative methods. Denote the MLE by ˆθ. This leads to the approximation

0 = L⁰(ˆθ) ≈ L⁰(˜θ) + (ˆθ − ˜θ)L⁰⁰(˜θ), and hence to

θ = ˜ˆ θ − L⁰(˜θ) L⁰⁰(˜θ).

The procedure is then iterated according to the above scheme.

The following is a justification for the use of the above one-step approximation as an estimator of θ.

Theorem 7 Suppose that the assumptions of Theorem 6 hold and that ˜θ is not only a consistent but a √

n-consistent estimator of θ, that is, that √

n(˜θ − θ₀) is bounded in probability so that ˜θ tends to θ₀ at least at the rate of 1/√

n. Then the estimator sequence

δ_n= ˜θ − L⁰(˜θ)

L⁰⁰(˜θ) (21)

is asymptotically efficient.

Proof. As in the proof of Theorem 6, expand L⁰(˜θ) about θ₀ as L⁰(˜θn) = L⁰(θ0) + (˜θn− θ0)L⁰⁰(θ0) + 1

2(˜θn− θ0)²L⁽³⁾(θ_n^∗)

where θ_n^∗ lies between θ₀ and ˜θ_n. Substituting this expression into (21) and simplifying, we find

√n(δn− θ0) = (1/√

n)L⁰(θ₀)

−(1/n)L⁰⁰(˜θ_n)+√

n(˜θn− θ0)

×

"

1 − L⁰⁰(θ₀) L⁰⁰(˜θ_n) − 1

2(˜θ_n− θ₀)L⁽³⁾(θ^∗_n) L⁰⁰(˜θ_n)

#

. (22)

Suppose we can show that the expression in square brackets on the right hand side of (22) tends to zero in probability and L⁰⁰(˜θn)/L⁰⁰(θ0) → 1 in probability. This theorem will follows accordingly. These follows from θ^∗_n → θ0 in probability and use the expansion

1

nL⁰⁰(˜θ_n) = 1

nL⁰⁰(θ₀) + 1

n(˜θ_n− θ₀)L⁽³⁾(θ_n^∗∗) where θ_n^∗∗ is between θ₀ and ˜θ_n.

(14)

3.2 The Multi-parameter Case

We just discuss the case that the distribution depends on a single parameter θ. When extending this theory to probability models involving several parameters θ1, . . . , θs, one may be interested either in simultaneous estimation of these parameters (or certain functions of them) or with the estimation of part of the parameters. The part of parameter is of intrinsic and the rest represents nuisance or incidental parameters that are necessary for a proper statistical model but of no interest in themselves. For instance, we are only interested in estimating σ² in Neyman-Scott problem. Then θ_α are called nuisance parameters.

Let (X₁, . . . , X_n) be iid with a distribution that depends on θ = (θ₁, . . . , θ_s) and satisfies assumptions (A0)-(A3). The information matrix I(θ) is an s × s matrix with elements I_jk(θ), j, k = 1, . . . , s, defined by

I_jk(θ) = cov

"

∂

∂θ_j log f (X, θ), ∂

∂θ_klog f (X, θ)

#

.

We shall now show under regularity conditions that with probability tending to 1 there exists solutions ˆθ_n = (ˆθ_1n, . . . , ˆθ_sn) of the likelihood equations

∂

∂θj

[f (x₁, θ) · · · f (x_n, θ)] = 0, j = 1, . . . , s, or equivalently

∂

∂θ_j[L(θ)] = 0, j = 1, . . . , s

such that ˆθ_jn is consistent for estimating θ_j and asymptotically efficient in the sense of with asymptotic variance [I(θ)]⁻¹_jj .

We first state some assumptions:

(A) There exists an open subset ω of Ω containing the true parameter point θ⁰ such that for almost all x the density f (x, θ) admits all third derivatives (∂³/∂θ_j∂θ_k∂θ_`)f (x, θ) for all θ ∈ ω.

(B) the first and second logarithmic derivatives of f satisfy the equations E_θ

"

∂

∂θ_j log f (X, θ)

#

= 0 for j = 1, . . . , s, and

I_jk(θ) = E_θ

"

∂

∂θj

log f (X, θ) · ∂

∂θk

log f (X, θ)

#

= E_θ

"

− ∂²

∂θ_j∂θ_klog f (X, θ)

#

.

(15)

(C) Since the s×s matrix I(θ) is a covariance matrix, it is positive semidefinite.

We shall assume that the I_jk(θ) are finite and that the matrix I(θ) is positive definite for all θ in ω, and hence that the s statistics

∂

∂θ1

log f (X, θ), . . . , ∂

∂θs

log f (X, θ) are affinely independent with probability 1.

(D) Finally, we shall suppose that there exists functions Mjk` such that

∂³

∂θ_j∂θ_k∂θ_` log f (x, θ)

≤ Mjk`(x) for all θ ∈ ω where m_jk` = E_θ⁰[M_jk`(X)] < ∞ for all j, k, `.

Theorem 8 Let X1, . . . , Xn be iid each with a density f (x, θ) (with respect to µ) which satisfies (A0)-(A2) and assumptions (A)-(D) above. Then with probability tending to 1 as n → ∞, there exist solutions ˆθ_n= ˆθ_n(X₁, . . . , X_n) of the likelihood equations such that

(i) ˆθ_jn is consistent for estimating θ_j, (ii) √

n(ˆθ_n−θ) is asymptotically normal with (vector) mean zero and covariance matrix [I(θ)]⁻¹ and

(iii) ˆθ_jn is asymptotically efficient in the sense that

√n(ˆθ_jn− θ_j)→ N (0, [I(θ)]^L ⁻¹_jj ).

Proof. (i) Existence and Consistency. To prove the consistence, with probability tending to 1, of a sequence of solutions of the likelihood equations which is consistent, we shall consider the behavior of the log likelihood L(θ) on the sphere Q_a with center at the true point θ⁰ and radius a. We will show that for any sufficiently small a the probability tends to 1 that L(θ) < L(θ⁰) at all points θ on the surface of Q_a, and hence that L(θ) has a local maximum in the interior of Q_a. Since at a local maximum the likelihood equations must be satisfied it will follow that for any a > 0, with probability tending to 1 as n → ∞, the likelihood equations have a solution ˆθ_n(a) within Q_a and the proof can be completed as in the one-dimensional case.

To obtain the needed facts concerning the behavior of the likelihood on Q_a for small a, we expand the log likelihood about the true point θ⁰ and divide by n to find

1

nL(θ) − 1

nL(θ⁰) = 1 n

XA_j(x)(θ_j − θ⁰_j) + 1 2n

X XB_jk(x)(θ_j − θ⁰_j)(θ_k− θ_k⁰)

+ 1

6n

X

j

X

k

X

`

(θ_j− θ⁰_j)(θ_k− θ⁰_k)(θ_`− θ_`⁰)

n

X

i=1

γ_jk`(x_i)M_jk`(x_i)

= S₁+ S₂+ S₃

(16)

where

A_j(x) = ∂

∂θ_jL(θ)

_θ=θ0

, B_jk(x) = ∂²

∂θ_j∂θ_kL(θ)

_θ=θ0

, and where by assumption (D)

0 ≤ |γ_jk`(x)| ≤ 1.

To prove that the maximum of this difference for θ on Q_a is negative with probability tending to 1 if a is sufficiently small, we will show that with high probability the maximum of S₂ is negative while S₁ and S₃ are small compared to S₂. The basic tools for showing this are the facts that by (B) and the law of large numbers

1

nA_j(x) = 1 n

∂

∂θ_jL(θ)

_θ=θ0

→ 0 in probability. (23) and

1

nB_jk(x) = 1 n

∂²

∂θ_j∂θ_kL(θ)

_θ₀

→ −I_jk(θ⁰) in probability. (24) Let us begin with S1. On Qa we have

|S₁| ≤ 1 na^X

j

|A_j(X)|.

For any given a, it follows from (23) that |A_j(X)|/n < a² and hence that

|S₁| < sa³ with probability tending to 1. Next consider 2S₂ = ^{X X h}−I_jk(θ⁰)(θ_j − θ⁰_j)(θ_k− θ_k⁰)ⁱ

+^{X X}

1

nB_jk(X) − [−I_jk(θ⁰)]

(θ_j − θ_j⁰)(θ_k− θ_k⁰).

For the second term it follows from an argument analogous to that for S₁ that its absolute value is less than s²a³ with probability tending to 1. The first term is a negative (nonrandom) quadratic form in the variables (θ_j − θ_j⁰). By an orthogonal transformation this can be reduced to diagonal form ^Pλ_iξ_i² with Q_a becoming ^Pξ_i² = a². Suppose that the λ’s that are negative are numbered so that λ_s ≤ λ_s−1 ≤ · · · ≤ λ₁ < 0. Then^Pλ_iξ_i² ≤ λ₁^Pξ_i² = λ₁a². Combining the first and second terms, we see that there exist c > 0, a₀ > 0 such that for a < a₀

S2 < −ca² with probability tending to 1.

Finally, with probability tending to 1,

1 n

XM_jk`(X_i)

< 2m_jk`

(17)

and hence |S₃| < ba³ on Q_a where b = s³

3

X X X

m_jk`. Combining the three inequalities, we see that

max(S1+ S2 + S3) < −ca²+ (b + s)a³

which is less than zero if a < c/(b + s), and this completes the proof of (i).

The proof of part (ii) of Theorem 8 is basically the same as that of Theo- rem 6. However, the single equation derived there from the expansion of ˆθn− θ0

is now replaced by a system of s equations which must be solved for the differ- ences (ˆθ_jn−θ⁰_j). In preparation, it will be convenient to consider quite generally a set of random linear equations in s unknowns

s

X

k=1

A_jknY_kn= T_jn (j = 1, . . . , s). (25) Lemma 3 Let (T_1n, . . . , T_sn) be a sequence of random vectors tending weakly to (T₁, . . . , T_s) and suppose that for each fixed j and k, A_jkn is a sequence of random variables tending in probability to constants ajk for which the matrix A = kajkk is nonsingular. Let B = kbjkk = A⁻¹. Then if the distribution of (T₁, . . . , T_s) has a density with respect to Lebesgue measure over E_s, the solution of (25) tend in probability to the solutions (Y₁, . . . , Y_s) of ^P^s_k=1a_jkY_k = T_j, 1 ≤ j ≤ s, given by Y_j =^P^s_k=1b_jkT_k.

In generalization of the proof of Theorem 6, expand ∂L(θ)/∂θj = L⁰_j(θ) about θ⁰ to obtain

L⁰_j(θ) = L⁰_j(θ⁰) +^X(θ_k− θ_k⁰)L⁰⁰_jk(θ⁰) + 1 2

X X(θ_k− θ_k⁰)(θ_`− θ⁰_`)L⁽³⁾_jk`(θ^∗) (26) where L⁰⁰_jk and L⁽³⁾_jk` denote the indicated second and third derivatives of L and where θ^∗ is a point on the line segment connecting θ and θ⁰. In this expansion, replace θ by a solution ˆθ_n of the likelihood equations, which by part (i) of this theorem can be assumed to exist with probability tending to 1 and to be consistent. The left side of (26) is zero and the resulting equations can be written as

√n^X(ˆθ_k− θ⁰_k)

1

nL⁰⁰_jk(θ⁰) + 1

2nL⁰⁰⁰_jk`(θ^∗)

= − 1

√nL⁰_j(θ⁰). (27) These have the form (26) with

Y_kn = √

n(ˆθ_k− θ⁰_k) (28)

(18)

A_jkn = 1

nL⁰⁰_jk(θ⁰) + 1

2n(ˆθ_`− θ_`⁰)L⁽³⁾_jk`(θ^∗) (29) T_jn = − 1

√nL⁰_j(θ⁰) = − 1

√n

" _n X

i=1

∂

∂θj

log f (X_i, θ)

#

θ=θ⁰

. (30)

Since E_θ⁰[(∂/∂θ_j) log f (X_i, θ)] = 0, the multivariate central limit theorem shows that (T_1n, . . . , T_sn) has a multivariate normal distribution with mean zero and covariance matrix I(θ⁰).

On the other hand, it is easy to see-again in parallel to the proof of Theorem 6 that

A_jkn → a^P _jk = E[L⁰⁰_jk(θ⁰)] = −I_jk(θ⁰).

The limit distribution of the Y ’s is therefore that of the solution (Y1, . . . , Ys) of the equations

s

X

k=1

Ijk(θ⁰)Yk = Tj

where T = (T₁, . . . , T_s) is multivariate normal with mean zero and covariance matrix I(θ⁰). It follows that the distribution of Y is that of [I(θ⁰)]⁻¹T , which is a multivariate distribution with zero mean and covariance matrix [I(θ⁰)]⁻¹. This completes the proof of asymptotic normality and efficiency.

3.3 Efficiency and Adaptiveness

If the distribution of the X_i depends on θ = (θ₁, . . . , θ_s), it is interesting to compare the estimation of θ_j when the other parameters are unknown with the situation in which they are known. Such a question arises naturally in the case that part of parameters are the nuisance parameter. For instance, consider estimating ˜µ for a location family f (x − ˜µ) or the median. ˜µ, of a symmetric density, f . Then ˜µ is the parameter of interest and f is the nuisance parameter.

If f is known and continuously differentiable, the best asymptotic mean-squared error attainable for estimating ˜µ is (nI)⁻¹ where

I =

Z f⁰²(x)

f (x) dx < ∞.

The question be asked is when can we estimate ˜µ as well asymptotically not knowing f as knowing f . A necessary condition named the orthogonality condition is given in Stein (1956). If there exists an estimate achieving the bound (nI)⁻¹ when f is unknown, it is named as an adaptive estimate of ˜µ. Accord- ing to Stein’s condition, he indicated that such an estimator does exist for this problem. Completely definite results for this problem were obtained by Beran (1974) and Stone (1975).

(19)

Note that this problem is a so-called semiparametric estimation problem in which ˜µ is the parametric component and f is the nonparametric component. Recently, the problem of estimating and testing hypotheses about the parametric component in the presence of an infinite dimensional nuisance parameter (nonparametric component) attracts a lot of attention. Main concerns are whether there exists either an adaptive or efficient estimate of the parametric component and the existence of a practical procedure to find them.

We now consider the finite-dimensional case and derive the orthogonality condition derived in Stein (1956). It was seen that under regularity conditions there exist estimator sequences ˆθ_nj of θ_j, when the other parameters are known, which are asymptotically efficient in the sense that

√n(ˆθ_jn− θ_j)→ N (0,^d 1 I_jj(θ)).

When the other parameters are unknown,

√n(ˆθ_jn− θ_j)→ N (0, [I(θ)]^d ⁻¹_jj ).

These imply that

1

I_jj(θ) ≤ [I(θ)]⁻¹_jj . (31) Stein (1956) raised the question whether we can estimate θ_j equally well no matter when the other parameters are known or not. This leads to the question of efficiency and adaptiveness.

The two sides of (31) are equal if

I_ij(θ) = 0 for all j 6= i, (32) as is seen from the definition of the inverse of a matrix, and in fact (32) is also necessary for equality in (31) by the following facts.

Fact. Let A =





A₁₁ A₁₂ A₂₁ A₂₂



 be a partitioned matrix with A₂₂ square and nonsingular, and let

B =





I −A12A⁻¹₂₂

0 I



. Note that

BA =





A₁₁− A₁₂A⁻¹₂₂A₂₁ 0

A21 A22



.

It follows easily that |A| = |A₁₁− A₁₂A⁻¹₂₂A₂₁| · |A₂₂|. Since A₂₂ is nonsingular, (A⁻¹)₁₁ = (A₁₁)⁻¹ if A₁₂ is a zero matrix.

(20)

The equality in (31) implies that I(θ) is diagonal. Suppose the efficient estimator of θ_j depends on the remaining parameters and yet θ_j can be estimated without loss of efficiency when these parameters are unknown. The situation can then be viewed as a rather trivial example of the idea of adaptive estimation. On the other hand, it is known that I_jj(θ) is the smallest asymptotic mean-squared error attainable for estimating θ_j. If an estimator does achieve such a bound, it is called an efficient estimator. Then Stein (1956) states that the adaptation is not possible unless Ijk(θ) = 0 for k 6= j.

We now study the bound of [I(θ)]⁻¹₁₁. Write I(θ) as a partitioned matrix





I₁₁(θ) I_1·(θ) I_1·^T(θ) I··(θ)



,

where I_1·(θ) = (I₁₂(θ), . . . , I_1s(θ)) and I··(θ) is the lower right submatrix of I(θ) with size (s − 1) × (s − 1). Then

[I(θ)]⁻¹₁₁ = 1

I₁₁(θ) − I_1·(θ)[I··(θ)]⁻¹I_1·^T(θ). Recall that

I_ij(θ) = E ∂

∂θ_i log f (X, θ) · ∂

!

. Consider the minimization problem

minaj

E





∂

∂θ₁ log f (X, θ) −

s

X

j=2

a_j ∂





2

.

and denote the minimizer as a⁰ = (a₂₀, . . . , a_s0)^T. By a simple algebra, a⁰ is the solution of norml equations I··(θ)a⁰ = I_1·(θ) or a⁰ = I_··⁻¹(θ)I_1·(θ). It leads to

E





∂

∂θ₁ log f (X, θ) −

s

X

j=2

a_j0 ∂





2

= I₁₁(θ) − I_1·(θ)[I··(θ)]⁻¹I_1·^T(θ).

Or

[I(θ)]⁻¹₁₁ = min

aj

E





∂

∂θ₁ log f (X, θ) −

s

X

j=2

a_j ∂





2

. If aj0 = 0, [I(θ)]⁻¹₁₁ = 1/I11(θ). Or, adaptivation is possible.

As illustrations, we will consider three examples. The first example is on the estimation of regression coefficients of a linear regression and the next two examples are on the estimation of parametric component in a semiparametric model. The two particular models we considered are the partial spline model (Wahba, 1984) and the two-sample proportional hazard model (Cox, 1972).

(21)

Example 1. (Linear Regression) Assume that y = β₀ + β₁x + and

∼ N (0, σ²). Let ( ˆβ₀, ˆβ₁) denote the least squares estimate. It follows easily that

V ar





βˆ₀ βˆ₁



= σ²





n ^P_iX_i

P

iX_i ^P_iX_i²





−1

= σ²





 P

iX_i²

P

i(Xi− ¯X)² −P ^X^¯

i(Xi− ¯X)²

−P ^X^¯

i(Xi− ¯X)²

1

P

i(Xi− ¯X)²







−1

.

When β₀ is known, the variance of least squares estimate of β₁ is σ²/^P_iX_i². Then the necessary and sufficient condition on guaranteeing adaptiveness is that X = 0. When we write the model in matrix form, the condition ¯¯ X = 0 can be explained as the two vectors (1, · · · , 1)^T and (X₁, · · · , X_n)^T are orthogonal. Note that (1, · · · , 1)^T and (X₁, · · · , X_n)^T are associated with β₀ and β₁, respectively.

On the other hand, we can use the above derivation. Observe that

∂ log f

∂θ₀ = Y − β0− β1X

σ² =

σ², ∂ log f

∂θ₁ = (Y − β0− β1X)X

σ² = X

σ² . Then

mina E ∂ log f

∂β₁ − a∂ log f

∂β₀

!2

= min

a

1

σ²E(X − a)² = 1

σ²V ar(X).

Example 2. (Partial Spline Model) Assume that Y = βX + g(T ) + and

∼ N (0, σ²). Suppose that g ∈ L₂[a, b], the set of all square integrable functions on the interval [a, b]. By proper taking care some mathematical subtlity, g(T ) can be written as ^P^∞_j=1b_jφ_j(T ) where {φ_j} is the complete bases of L₂[a, b].

Observe that

∂ log f

∂θ = X

σ²

∂ log f

∂b_j = φ_j(T ) σ² . Then

minaj

E





∂ log f

∂β −^X

j

a_j∂ log f

∂bj





2

= min

aj

E² σ⁴



X − −^X

j

a_jφ_j(T )





2

= 1 σ² min

h E(X − h(T ))².

Therefore, h(T ) = E(X|T ). It means that when E(X|T ) = 0 or X and T are uncorrelated, the adaption is possible. Otherwise, the efficient bound for estimating β is

σ²

E(X − E(X|T ))² = σ² EV ar(X|T ).

Refer to Chen (1988) and Speckman (1988) for further references and construc- tion of efficient estimate of β.

Example 3. Let t₁, · · · , t_n be fixed constants. Suppose that X_i ∼ Bin(1, F (t_i)) and Y_i ∼ Bin(1, F^θ(t_i)). We now derive a lower bound on the