• 沒有找到結果。

# Chapter 4. Method of Maximum Likelihood

N/A
N/A
Protected

Share "Chapter 4. Method of Maximum Likelihood"

Copied!
35
0
0

(1)

### 1 Introduction

Many statistical procedures are based on statistical models which specify under which conditions the data are generated. Usually the assumption is made that the set of observations x1, . . . , xnis a set of (i) independent random variables (ii) identically distributed with common pdf f (xi, θ). Once this model is specified, the statistician tries to find optimal solutions to his problem (usually related to the inference on a set of parameters θ ∈ Θ ⊂ Rk, characterizing the uncertainty about the model).

The procedure just described is not always easy to carry out. In fact, when confronted with a set of data three attitudes are possible:

• The statistician may be a “pessimist” who does not believe in any par- ticular model f (x, θ). In this case he must be satisfied with descriptive methods (like exploratory data analysis) without the possibility of induc- tive inference.

• The statistician may be an “optimist” who strongly believes in one model.

In this case the analysis is straightforward and optimal solutions may often be easily obtained.

• The statistician may be “realist”: he would like to specify a particular model f (x, θ) in order to get operational results but he may have either some doubt about the validity of this hypothesis or some difficulty in choos- ing a particular parametric family.

Let us illustrate this kind of preoccupation with an example. Suppose that the parameter of interest is the “center” of some population. In many situations, the statistician may argue that, due to a central limit effect, the data are generated by a normal pdf. In this case the problem is restricted to the problem of inference on µ, the mean of the population. But in some cases, he may have some doubt about these central limit effects and may suspect some skewness and/or some kurtosis or he may suspect that some observations are generated by other models (leading to the presence of outliers).

In this context three types of question may be raised to avoid gross errors in the prediction, or in the inference:

• Does the optimal solution, computed for assumed model f (x, θ), still have

“good” properties if the true model is a little different?

(2)

• Are the optimal solutions computed for other models near to the original one really substantially different?

• Is it possible to compute (exactly or approximately) optimal solutions for a wider class of models based on very few assumptions?

The first question is concerned with the sensitivity of a given criterion to the hypotheses (criterion robustness). In the second question, it is the sensitiv- ity of the inference which is analyzed (inference robustness). The last question may be viewed as a tentative first step towards the development of nonpara- metric methods (i.e. methods based on a very large parametric space).

### 2 Information Bound

Any statistical inference starts from a basic family of probability measures, ex- pressing our prior knowledge about the nature of the probability measures from where the observations originate. Or a model P is a collection of probability measures P on (X , A) where X is the sample space with a σ-field of subsets A.

If

P = {Pθ : θ ∈ Θ}, Θ ⊂ Rs

for some k, then P is a parametric model. On the other hand, if P = {all P on (X , A)},

then P is often referred to as a nonparametric model.

Suppose that we have a fully specified parametric family of models. De- note the parameter of interest by θ. Suppose that we wish to calculate from the data a single value representing the “best estimate” that we can make of the unknown parameter. We call such a problem one of point estimation.

Define the information matrix as the s × s matrix I(θ) = kIij(θ)k,

where

Iij(θ) = Eθ

"

∂ log f (X; θ)

∂θi

∂ log f (X; θ)

∂θj

#

.

When k = 1, I(θ) is known as the Fisher information. Under regularity condi- tions, we have

E

"

∂θi

log f (X; θ)

#

= 0 (1)

(3)

and

Iij(θ) = cov

"

∂θi log f (X; θ), ∂

∂θj log f (X; θ)

#

.

Being a covariance matrix, I(θ) is then positive semidefinite and positive definite unless the (∂/∂θi) log f (X; θ), i = 1, . . . , s are affinely dependent (and hence, by (1), linear dependent). When the density also has the second derivatives, we have the following alternative expression for Iij(θ) which is

Iij(θ) = −E

"

2

∂θi∂θj log f (X; θ)

#

.

To make above statements correct, we make the following assumptions when s = 1:

(i) Θ is an open interval (finite, infinite, or semi-infinite).

(ii) The distribution Pθ have common support, so that without loss of generality the set A = {x : pθ(x) > 0} is independent of θ. (2) (iii) For any x in A and θ in Θ, the derivative p0θ(x) = ∂pθ(x)/∂θ exists and is finite.

Lemma 1 (i) If (2) holds, and the derivative with respect to θ of the left side of

Z

f (x; θ)dµ(x) = 1 (3)

can be obtained by differentiating under the integral sign, then Eθ

"

∂θlog f (X; θ)

#

= 0 and

I(θ) = varθ

"

∂θ log f (X; θ)

#

. (4)

(ii) If, in addition, the second derivative with respect to θ of log f (X; θ) exists for all x and θ and the second derivative with respect to θ of the left side of (3) can be obtained by differentiating twice under the integral sign, then

I(θ) = −Eθ

"

2

∂θ2 log f (X; θ)

#

. Let us now derive the information inequality for s = 1.

Theorem 1 Suppose (2) and (4) hold, and that I(θ) > 0. Let δ be any statistic with Eθ2) < ∞ for which the derivative with respect to θ of Eθ(δ) exists and can be obtained by differentiating under the integral sign. Then

varθ(δ) ≥

h

∂θEθ(δ)i2 I(θ) .

(4)

Proof. For any estimator δ of g(θ) and any function ψ(x, θ) with finite second moment, the covariance inequality states that

varθ(δ) ≥ [cov(δ, ψ)]2

var(ψ) . (5)

Denote g(θ) = Eθδ and set

ψ(X, θ) = ∂

∂θ log f (X; θ).

If differentiation under the integral sign is permitted in Eθδ, it then follows that cov(δ, ψ) =

Z

δ(x)f0(x; θ)

f (x; θ)f (x; θ)dx = g0(θ) and hence

varθ(δ) ≥

hg0(θ)i2

varh∂θ log f (X, θ)i. This completes the proof of this theorem.

If δ is an unbiased estimator of θ, then varθ(δ) ≥ 1

nI(θ).

The above inequality provides a lower bound for the variance of any estimator.

In fact, the quantity nI(θ) is known as the “Cramer-Rao lower bound.” Like- liwise, we can also have the information inequality for general s. We begin by generalizing the correlation inequality to one involving many ψi (i = 1, . . . , r).

Theorem 2 For any unbiased estimator δ of g(θ) and any functions ψi(x, θ) with finite second moments, we have

var(δ) ≥ γ0C−1γ, (6)

where γ = (γ1, · · · , γr) and C = kCijk are defined by

γi = cov(δ, ψi), Cij = cov(ψi, ψj). (7) Proof. Replace Y by δ and Xi by ψi(X, θ) in the following lemma. Then the fact that ρ∗2 ≤ 1 yields this theorem.

Let (X1, . . . , Xr) and Y be random variables with finite second moment, and consider the correlation coefficient corr(PaiXi, Y ). Its maximum value ρ over all (a1, . . . , ar) is the multiple correlation coefficient between Y and the vector (X1, . . . , Xr).

(5)

Lemma 2 Let (X1, . . . , Xr) and Y have finite second moment, let γi = cov(Xi, Y ) and Σ be the covariance matrix of the X’s. Without loss of generality, suppose Σ is positive definite. Then

ρ∗2= γ0Σ−1γ

var(Y ). (8)

Proof. Since a correlation coefficient is invariant under scale changes, the a’s maximizing (8) are not uniquely determined. Without loss of generality, we therefore impose the condition var(PiaiXi) = a0Σa = 1. In view of a0Σa = 1,

corr(X

i

aiXi, Y ) = a0γ/qvar(Y ).

The problem then becomes that of maximizing a0γ subject to a0Σa = 1. Using the method of undetermined multipliers, one maximizes instead

a0γ − λ

2a0Σa (9)

with respect to a and then determines λ so as to satisfy a0Σa = 1. Differentia- tion with respect to the ai of (9) leads to a system of linear equations with the unique solution

a = 1

λΣ−1γ, (10)

and the side condition a0Σa = 1 gives

λ = ±qγ0Σ−1γ.

Substituting these values of λ into (10), one finds that a = ±Σ−1γ

q

γ0Σ−1γ

and the maximum value of corr(PiaiXi, Y ), ρ, is therefore the positive root of (8).

Note that always 0 ≤ ρ ≤ 1, and that ρ is 1 if and only if constants a1, . . . , ar and b exist such that Y =PiaiXi+ b.

Let us now state the information inequality for the multiparameter case in which θ = (θ1, . . . , θs).

Theorem 3 Suppose that (1) holds and that I(θ) is positive definite Let δ be any statistic with Eθ2) < ∞ for which the derivative with respect to θi exists for each i and can be obtained by differentiating under the integral sign. Then

varθ(δ) ≥ α0I−1(θ)α, (11)

(6)

where α0 is the row matrix with ith element αi = ∂

∂θiEθ(δ(X)).

Proof. If the functions ψiof Theorem 2 are taken to be ψi = (∂/∂θi) log f (X; θ), this theorem follows immediately.

Under regularity conditions on the class of estimators ˆθn under consider- ation, it may be asserted that if ˆθn is AN (θ, n−1Σ(θ)), then the condition

Σ(θ) − I(θ)−1 is nonnegative definite

must hold. (Read Ch2.6 and 2.7 of Lehmann (1983) for further details.) In this respect, an estimator ˆθn which is AN (θ, Σθ) is “optimal.” (Such an estimator need not exist.)

The following definition is thus motivated. An estimator ˆθnwhich is called asymptotically efficient, or best asymptotically normal (BAN). Under suitable regularity conditions, an asymptotically efficient estimate exists. One approach toward finding such estimates is the method of maximum likelihood. Neyman (1949) pointed out that these large-sample criteria were also satisfied by other estimates. He defined a class of best asymptotically normal estimates. So far, we have described three desirable properties ˆθn. They are unbiasedness, consistency, and efficiency. We now describe a general procedure to produce an asymptotic unbiased, consistent, and asymptotic efficient estimator.

### 3 Maximum Likelihood Methodology

Many statistical techniques were invented in the nineteenth century by experi- mental scientists who personally applied their methods to authentic data sets.

In these conditions the limits of what is computationally feasible are sponta- neously observed. Until quite recently these limits were set by the capacity of the human calculator, equipped with pencil and paper and with such aids as the slide rule, tables of logarithms, and other convenient tables, which have been in constant use from the seventeenth century until well into the twentieth.

Until the advent of the electronic computer, the powers of the human operator set the standard. This restriction has left its mark on statistical technique, and many new developments have taken place since it was lifted.

The first result of this modern computing revolution is that estimates de- fined by nonlinear equations can be established as a matter of routine by the appropriate iterative algorithms. This permits the use of nonlinear functional

(7)

forms. Although the progress of computing technology made nonlinear estima- tion possible, the statistical theory of Maximum Likelihood provided techniques and respectability. Its principle was first put forward as a novel and original method of deriving estimators by R.A. Fisher in the early 1920s. It very soon proved to be a fertile approach to statistical inference in general, and was widely adopted; but the exact properties of the ensuring estimators and test procedures were only gradually discovered.

Let observations x = (x1, . . . , xn) be realized values of random variables X = (X1, . . . , Xn) and suppose that the random vector X, having density fX(x; θ) with respect to some σ-finite measure ν. Here θ is the scalar parameter to be determined. The likelihood function corresponding to an observed vector x from the density fX(x; θ) is written

LikX0; x) = fX(x; θ0),

whose logarithm is denoted by L(θ0; x). When the Xi are iid with probability density f (x; θ) with respect to a σ-finite measure µ,

f (x; θ) =

n

Y

i=1

f (xi; θ).

If the parameter space is Ω, then the maximum likelihood estimate (MLE) θ = ˆˆ θ(x) is that value of θ0 maximizing likX0; x), or equivalently its logarithm L(θ0; x), over Ω. That is,

L(ˆθ; x) ≥ L(θ0; x) (θ0 ∈ Ω). (12) L(θ, x) is called the log-likelihood. Note that L is regarded as a function of θ with x fixed.

A MLE may not exist. It certainly exists if Ω is compact and f (x; θ) is upper semicontinuous in θ for all x. As an example, consider U (θ, θ + 1). Later on, we shall use the shorthand notation L(θ) for L(θ, x) and L0(θ), L(θ)00, . . . for its derivatives with respect to θ. (Note that f is said to be upper semicontinuous if {x|f (x) < α} is an open set.)

Fisher was the first to study and establish optimum properties of esti- mates obtained by maximizing the likelihood function, using criteria such as consistency and efficiency (involving asymptotic variances) in large samples. At that time, however, the computation involved were hardly practicable, this pre- vented a widespread adoption of these methods. Fortunately, the new computer technology had become generally accessible. Therefore, Maximum Likelihood (ML) methodology is widely used now.

(8)

It is a constant theme of the history of the method that the use of ML techniques is not always accompanied by a clear appreciation of their limita- tions. Le Cam (1953) complains that

· · · although all efforts at a proof of the general existence of [as- ymptotically] efficient estimates · · · as well as a proof of the efficiency of ML estimates were obviously inaccurate and although accurate proofs of similar statements always referred not to the general case but to particular classes of estimates · · · a general belief became estab- lished that the above statements are true in the most general sense.

As an illustration, consider the famous Neyman-Scott (1948) problem. In this example, the MLE is not even consistent.

Example 1. Estimation of a Common Variance. Let Xαj (j = 1, . . . , r) be independently distributed according to N (θα, σ2), α = 1, . . . , n. The MLEs are

θˆα = Xα·, ˆσ2 = 1 rn

X X(Xαj − Xα·)2.

Furthermore, these are the unique solutions of the likelihood equations.

However, in the present case, the MLE of σ2 is not even consistent. To see this, note that the statistics

Sα2 =X(Xαj − Xα·)2

are identically independently distributed with expectation E(Sα2) = (r − 1)σ2

so that PSα2/n → (r − 1)σ2 and hence ˆ

σ2 → r − 1

r σ2 in probability.

Example 2. Suppose X1, X2, . . . , Xn is a random sample from a uniform dis- tribution U (0, θ). The likelihood function is

L(θ, x) = 1

θn, 0 < x1, . . . , xn< θ.

Clearly L cannot be maximized wrt θ by differentiation. However, it is not difficult to find ˆθn = X(n) with density function ntn−1n where t ∈ (0, θ).

Then

E(ˆθn) = nθ n + 1,

which is a biased estimator of θ. (But it is asymptotic unbiased.)

(9)

3.1 Efficient Likelihood Estimation

According to the example discussed by Neyman and Scott (1948), we will show that, under regularity conditions, the ML estimates are consistent, asymptot- ically normal, and asymptotically efficient. For simplicity, our treatment will be confined to the case of a 1-dimensional parameter.

We begin with the following regularity assumptions:

(A0) The distributions Pθ of the observations are distinct (otherwise, θ cannot be estimated consistently).

(A1) The distributions Pθ have common support.

(A2) The observations are X = (X1, . . . , Xn), where the Xi are iid with prob- ability density f (xi, θ) with respect to µ.

(A3) The parameter space Ω contains an open interval ω of which the true parameter value θ0 is an interior point.

Theorem 4 Under assumptions (A0)-(A2),

Pθ0{f (X1, θ0) · · · f (Xn, θ0) > f (X1, θ) · · · f (Xn, θ)} → 1 as n → ∞ for any fixed θ 6= θ0.

Proof. The inequality is equivalent to 1

n

n

X

i=1

log [f (Xi, θ)/f (Xi, θ0)] < 0.

By the strong law of large numbers, the left side tends with probability 1 toward Eθ0log[f (X, θ)/f (X, θ0)].

Since − log is strictly convex, Jensen’s inequality shows that

Eθ0log[f (X, θ)/f (X, θ0)] < log Eθ0[f (X, θ)/f (X, θ0)] = 0, (13) and the results follows. When θ0 is the true value, the above proof gives a meaning to the numerical value of the Kullback-Leibler information number.

Namely, the likelihood ratio converges to zero exponential fast, at rate I(θ, η).

Remark 1. Define the Kullback-Leibler information number I(θ, η) = Eθ log f (X, θ)

f (X, η)

!

.

(10)

Note that I(θ, η) ≥ 0 with equality holding if and only if, f (x, θ) = f (x, η).

I(θ, η) is a measure of the ability of the likelihood ratio to distinguish between f (X, θ) and f (X, θ0) when the latter is true.

Remark 2. If ˆθn is an MLE of θ and if g is a function, then g(ˆθn) is an MLE of g(θ). When g is one-to-one, it holds obviously. If g is many-to-one, this result holds again when the derivative of g is nonzero.

By Theorem 4, the density of X at the true θ0 exceeds that any other fixed θ with high probability when n is large. We do not know θ0 but we can determine the value ˆθ of θ which maximizes the density of X. However, Theorem 4 cannot guarantee that the MLE is consistent since we have to apply the law of large numbers to the right-hand side of (13) for all θ0 6= θ simultaneously.

However, if Ω is finite, the MLE ˆθn exists, it is unique with probability tending to 1, and it is consistent.

The following theorem is motivated by the simple fact by differentiating

R f (x, θ)µ(dx) = 1 with respect to θ. It leads to Eθ0

f0(X, θ0) f (X, θ0) = 0.

Theorem 5 Let X1, . . . , Xn satisfy assumptions (A0)-(A3) and suppose that for almost all x, f (x, θ) is differentiable with respect to θ in w, with derivative f0(x, θ). Then with probability tending to 1 as n → ∞, the likelihood equation

∂θ [f (x1, θ) · · · f (xn, θ)] = 0 (14) or, equivalently, the equation

L0(θ, x) =

n

X

i=1

f0(xi, θ)

f (xn, θ) = 0 (15)

has a root ˆθn= ˆθn(x1, . . . , xn) such that ˆθn(X1, . . . , Xn) tends to the true values θ0 in probability.

Proof. Let a be small enough so that (θ0− a, θ0+ a) ⊂ w, and let Sn= {x : L(θ0, x) > L(θ0− a, x) and L(θ0, x) > L(θ0+ a, x)}. (16) By Theorem 4, Pθ0(Sn) → 1. For any x ∈ Sn there thus exists a value θ0− a <

θˆn< θ0+ a at which L(θ) has a local maximum, so that L0(ˆθn) = 0. Hence for any a > 0 sufficiently small, there exists a sequence ˆθn = ˆθn(a) of roots such that

Pθ0(|ˆθn− θ0| < a) → 1.

(11)

It remains to show that we can determine such a sequence, which does not depend on a.

Let ˆθn be the root closest to θ0, This exists because the limit of a sequence of roots is again a root by the continuity of L(θ).) Then clearly Pθ0(|ˆθn− θ0| <

a) → 1 and this completes the proof.

Remarks. 1. This theorem does not establish the existence of a consis- tent estimator sequence since, with the true value θ0 unknown, the data do not tell us which root to choose so as to obtain a consistent sequence. An exception, of course, is the case in which the root is unique.

2. It should also be emphasized that existence of a root ˆθn is not asserted for all x ( or for a given n even for any x). This does not affect consistency, which only requires ˆθnto be defined on a set Sn0, the probability of which tends to 1 as n → ∞.

Above theorem establishes the existence of a consistent root of the likeli- hood equation. The next theorem asserts that any such sequence is asymptot- ically normal and efficient.

Theorem 6 Suppose that X1, . . . , Xnare iid and satisfy the assumptions (A0)- (A3), the integral R f (x, θ))dµ(x) can be twice differentiated under the integral sign, and the existence of a third derivative satisfying

3

∂θ3 log f (x, θ)

≤ M (x) (17)

for all x ∈ A, θ0− c < θ < θ0+ c with Eθ0[M (X)] < ∞. Then any consistent sequence ˆθn= ˆθn(X1, . . . , Xn) of roots of the likelihood equation satisfies

√n(ˆθn− θ0)→ Nd 0, 1 I(θ0)

!

.

Proof. For any fixed x, expand L0(ˆθn) about θ0

L0(ˆθn) = L00) + (ˆθn− θ0)L000) + 1

2(ˆθn− θ0)2L(3)n)

where θn lies between θ0 and ˆθn. By assumption, the left side is zero, so that

√n(ˆθn− θ0) = (1/√

n)L00)

−(1/n)L000) − (1/2n)(ˆθn− θ0)L(3)n)

where it should be remembered that L(θ), L0(θ), and so on are functions of (X1, . . . , Xn) as well as θ. The desired result follows if we can show that

√1

nL00)→ N [0, I(θd 0)], (18)

(12)

that

−1

nL000)→ I(θP 0) (19) and that

1

nL(3)n) is bounded in probabiliy. (20) Of the above statements, (18) follows from the fact that

√1

nL00) =√ n1

n

n

X

i=1

"

f0(Xi, θ0) f (Xi, θ0) − Eθ0

f0(Xi, θ0) f (Xi, θ0)

#

since the expectation term is zero, and then from the CLT and the definition of I(θ).

Next,

−1

nL000) = 1 n

n

X

i=1

f02(Xi, θ0) − f (Xi, θ0)f00(Xi, θ0) f2(Xi, θ0) . By the law of large numbers, this tends in probability to

I(θ0) − Eθ0f00(Xi, θ0)

f (Xi, θ0) = I(θ0).

Finally,

1

nL(3)0) = 1 n

n

X

i=1

3

∂θ3 log f (Xi, θ) so that by (17)

1

nL(3)n)

< 1

n[M (X1) + · · · + M (Xn)]

with probability tending to 1. The right side tends in probability to Eθ0[M (X)], and this completes the proof.

Remarks. 1. This is a strong result. It establishes several major prop- erties of the MLE in addition to its consistency. The MLE is asymptotically normal, which is of great help for the derivation of (asymptotically valid) tests;

it is asymptotically unbiased; and it is asympotically efficient, since the vari- ance of its limiting distribution equals the Cramer-Rao lower bound.

2. As a rule we wish to supplement the parameter estimates by an estimate of their (asymptotic) variance. This will permit us to assess (asymptotic) t-ratios and (asymptotic) confidence interval. Although the variance may depend on the unknown parameter, we can just use MLE to get an estimate of variance.

The usual iterative methods for solving the likelihood equation L0(θ) = 0 are based on replacing L0(θ) by the linear terms of its Taylor expansion about an approximate solution ˜θ. Suppose we can use estimation method such as the

(13)

method of moments to find a good estimate of θ. Denote it as ˜θ. Then it is quite natural to use ˜θ as the initial solution of the iterative methods. Denote the MLE by ˆθ. This leads to the approximation

0 = L0(ˆθ) ≈ L0(˜θ) + (ˆθ − ˜θ)L00(˜θ), and hence to

θ = ˜ˆ θ − L0(˜θ) L00(˜θ).

The procedure is then iterated according to the above scheme.

The following is a justification for the use of the above one-step approxi- mation as an estimator of θ.

Theorem 7 Suppose that the assumptions of Theorem 6 hold and that ˜θ is not only a consistent but a √

n-consistent estimator of θ, that is, that √

n(˜θ − θ0) is bounded in probability so that ˜θ tends to θ0 at least at the rate of 1/√

n. Then the estimator sequence

δn= ˜θ − L0(˜θ)

L00(˜θ) (21)

is asymptotically efficient.

Proof. As in the proof of Theorem 6, expand L0(˜θ) about θ0 as L0(˜θn) = L00) + (˜θn− θ0)L000) + 1

2(˜θn− θ0)2L(3)n)

where θn lies between θ0 and ˜θn. Substituting this expression into (21) and simplifying, we find

√n(δn− θ0) = (1/√

n)L00)

−(1/n)L00(˜θn)+√

n(˜θn− θ0)

×

"

1 − L000) L00(˜θn) − 1

2(˜θn− θ0)L(3)n) L00(˜θn)

#

. (22)

Suppose we can show that the expression in square brackets on the right hand side of (22) tends to zero in probability and L00(˜θn)/L000) → 1 in prob- ability. This theorem will follows accordingly. These follows from θn → θ0 in probability and use the expansion

1

nL00(˜θn) = 1

nL000) + 1

n(˜θn− θ0)L(3)n∗∗) where θn∗∗ is between θ0 and ˜θn.

(14)

3.2 The Multi-parameter Case

We just discuss the case that the distribution depends on a single parameter θ. When extending this theory to probability models involving several para- meters θ1, . . . , θs, one may be interested either in simultaneous estimation of these parameters (or certain functions of them) or with the estimation of part of the parameters. The part of parameter is of intrinsic and the rest represents nuisance or incidental parameters that are necessary for a proper statistical model but of no interest in themselves. For instance, we are only interested in estimating σ2 in Neyman-Scott problem. Then θα are called nuisance parame- ters.

Let (X1, . . . , Xn) be iid with a distribution that depends on θ = (θ1, . . . , θs) and satisfies assumptions (A0)-(A3). The information matrix I(θ) is an s × s matrix with elements Ijk(θ), j, k = 1, . . . , s, defined by

Ijk(θ) = cov

"

∂θj log f (X, θ), ∂

∂θklog f (X, θ)

#

.

We shall now show under regularity conditions that with probability tending to 1 there exists solutions ˆθn = (ˆθ1n, . . . , ˆθsn) of the likelihood equations

∂θj

[f (x1, θ) · · · f (xn, θ)] = 0, j = 1, . . . , s, or equivalently

∂θj[L(θ)] = 0, j = 1, . . . , s

such that ˆθjn is consistent for estimating θj and asymptotically efficient in the sense of with asymptotic variance [I(θ)]−1jj .

We first state some assumptions:

(A) There exists an open subset ω of Ω containing the true parameter point θ0 such that for almost all x the density f (x, θ) admits all third derivatives (∂3/∂θj∂θk∂θ`)f (x, θ) for all θ ∈ ω.

(B) the first and second logarithmic derivatives of f satisfy the equations Eθ

"

∂θj log f (X, θ)

#

= 0 for j = 1, . . . , s, and

Ijk(θ) = Eθ

"

∂θj

log f (X, θ) · ∂

∂θk

log f (X, θ)

#

= Eθ

"

− ∂2

∂θj∂θklog f (X, θ)

#

.

(15)

(C) Since the s×s matrix I(θ) is a covariance matrix, it is positive semidefinite.

We shall assume that the Ijk(θ) are finite and that the matrix I(θ) is positive definite for all θ in ω, and hence that the s statistics

∂θ1

log f (X, θ), . . . , ∂

∂θs

log f (X, θ) are affinely independent with probability 1.

(D) Finally, we shall suppose that there exists functions Mjk` such that

3

∂θj∂θk∂θ` log f (x, θ)

≤ Mjk`(x) for all θ ∈ ω where mjk` = Eθ0[Mjk`(X)] < ∞ for all j, k, `.

Theorem 8 Let X1, . . . , Xn be iid each with a density f (x, θ) (with respect to µ) which satisfies (A0)-(A2) and assumptions (A)-(D) above. Then with probability tending to 1 as n → ∞, there exist solutions ˆθn= ˆθn(X1, . . . , Xn) of the likelihood equations such that

(i) ˆθjn is consistent for estimating θj, (ii) √

n(ˆθn−θ) is asymptotically normal with (vector) mean zero and covariance matrix [I(θ)]−1 and

(iii) ˆθjn is asymptotically efficient in the sense that

√n(ˆθjn− θj)→ N (0, [I(θ)]L −1jj ).

Proof. (i) Existence and Consistency. To prove the consistence, with probability tending to 1, of a sequence of solutions of the likelihood equations which is consistent, we shall consider the behavior of the log likelihood L(θ) on the sphere Qa with center at the true point θ0 and radius a. We will show that for any sufficiently small a the probability tends to 1 that L(θ) < L(θ0) at all points θ on the surface of Qa, and hence that L(θ) has a local maximum in the interior of Qa. Since at a local maximum the likelihood equations must be satisfied it will follow that for any a > 0, with probability tending to 1 as n → ∞, the likelihood equations have a solution ˆθn(a) within Qa and the proof can be completed as in the one-dimensional case.

To obtain the needed facts concerning the behavior of the likelihood on Qa for small a, we expand the log likelihood about the true point θ0 and divide by n to find

1

nL(θ) − 1

nL(θ0) = 1 n

XAj(x)(θj − θ0j) + 1 2n

X XBjk(x)(θj − θ0j)(θk− θk0)

+ 1

6n

X

j

X

k

X

`

j− θ0j)(θk− θ0k)(θ`− θ`0)

n

X

i=1

γjk`(xi)Mjk`(xi)

= S1+ S2+ S3

(16)

where

Aj(x) = ∂

∂θjL(θ)

θ=θ0

, Bjk(x) = ∂2

∂θj∂θkL(θ)

θ=θ0

, and where by assumption (D)

0 ≤ |γjk`(x)| ≤ 1.

To prove that the maximum of this difference for θ on Qa is negative with probability tending to 1 if a is sufficiently small, we will show that with high probability the maximum of S2 is negative while S1 and S3 are small compared to S2. The basic tools for showing this are the facts that by (B) and the law of large numbers

1

nAj(x) = 1 n

∂θjL(θ)

θ=θ0

→ 0 in probability. (23) and

1

nBjk(x) = 1 n

2

∂θj∂θkL(θ)

θ0

→ −Ijk0) in probability. (24) Let us begin with S1. On Qa we have

|S1| ≤ 1 naX

j

|Aj(X)|.

For any given a, it follows from (23) that |Aj(X)|/n < a2 and hence that

|S1| < sa3 with probability tending to 1. Next consider 2S2 = X X h−Ijk0)(θj − θ0j)(θk− θk0)i

+X X

1

nBjk(X) − [−Ijk0)]



j − θj0)(θk− θk0).

For the second term it follows from an argument analogous to that for S1 that its absolute value is less than s2a3 with probability tending to 1. The first term is a negative (nonrandom) quadratic form in the variables (θj − θj0). By an orthogonal transformation this can be reduced to diagonal form Pλiξi2 with Qa becoming Pξi2 = a2. Suppose that the λ’s that are negative are numbered so that λs ≤ λs−1 ≤ · · · ≤ λ1 < 0. ThenPλiξi2 ≤ λ1Pξi2 = λ1a2. Combining the first and second terms, we see that there exist c > 0, a0 > 0 such that for a < a0

S2 < −ca2 with probability tending to 1.

Finally, with probability tending to 1,

1 n

XMjk`(Xi)

< 2mjk`

(17)

and hence |S3| < ba3 on Qa where b = s3

3

X X X

mjk`. Combining the three inequalities, we see that

max(S1+ S2 + S3) < −ca2+ (b + s)a3

which is less than zero if a < c/(b + s), and this completes the proof of (i).

The proof of part (ii) of Theorem 8 is basically the same as that of Theo- rem 6. However, the single equation derived there from the expansion of ˆθn− θ0

is now replaced by a system of s equations which must be solved for the differ- ences (ˆθjn−θ0j). In preparation, it will be convenient to consider quite generally a set of random linear equations in s unknowns

s

X

k=1

AjknYkn= Tjn (j = 1, . . . , s). (25) Lemma 3 Let (T1n, . . . , Tsn) be a sequence of random vectors tending weakly to (T1, . . . , Ts) and suppose that for each fixed j and k, Ajkn is a sequence of random variables tending in probability to constants ajk for which the matrix A = kajkk is nonsingular. Let B = kbjkk = A−1. Then if the distribution of (T1, . . . , Ts) has a density with respect to Lebesgue measure over Es, the solution of (25) tend in probability to the solutions (Y1, . . . , Ys) of Psk=1ajkYk = Tj, 1 ≤ j ≤ s, given by Yj =Psk=1bjkTk.

In generalization of the proof of Theorem 6, expand ∂L(θ)/∂θj = L0j(θ) about θ0 to obtain

L0j(θ) = L0j0) +Xk− θk0)L00jk0) + 1 2

X Xk− θk0)(θ`− θ0`)L(3)jk`) (26) where L00jk and L(3)jk` denote the indicated second and third derivatives of L and where θ is a point on the line segment connecting θ and θ0. In this expansion, replace θ by a solution ˆθn of the likelihood equations, which by part (i) of this theorem can be assumed to exist with probability tending to 1 and to be consistent. The left side of (26) is zero and the resulting equations can be written as

√nX(ˆθk− θ0k)

1

nL00jk0) + 1

2nL000jk`)



= − 1

√nL0j0). (27) These have the form (26) with

Ykn = √

n(ˆθk− θ0k) (28)

(18)

Ajkn = 1

nL00jk0) + 1

2n(ˆθ`− θ`0)L(3)jk`) (29) Tjn = − 1

√nL0j0) = − 1

√n

" n X

i=1

∂θj

log f (Xi, θ)

#

θ=θ0

. (30)

Since Eθ0[(∂/∂θj) log f (Xi, θ)] = 0, the multivariate central limit theorem shows that (T1n, . . . , Tsn) has a multivariate normal distribution with mean zero and covariance matrix I(θ0).

On the other hand, it is easy to see-again in parallel to the proof of Theorem 6 that

Ajkn → aP jk = E[L00jk0)] = −Ijk0).

The limit distribution of the Y ’s is therefore that of the solution (Y1, . . . , Ys) of the equations

s

X

k=1

Ijk0)Yk = Tj

where T = (T1, . . . , Ts) is multivariate normal with mean zero and covariance matrix I(θ0). It follows that the distribution of Y is that of [I(θ0)]−1T , which is a multivariate distribution with zero mean and covariance matrix [I(θ0)]−1. This completes the proof of asymptotic normality and efficiency.

If the distribution of the Xi depends on θ = (θ1, . . . , θs), it is interesting to compare the estimation of θj when the other parameters are unknown with the situation in which they are known. Such a question arises naturally in the case that part of parameters are the nuisance parameter. For instance, consider estimating ˜µ for a location family f (x − ˜µ) or the median. ˜µ, of a symmetric density, f . Then ˜µ is the parameter of interest and f is the nuisance parameter.

If f is known and continuously differentiable, the best asymptotic mean-squared error attainable for estimating ˜µ is (nI)−1 where

I =

Z f02(x)

f (x) dx < ∞.

The question be asked is when can we estimate ˜µ as well asymptotically not knowing f as knowing f . A necessary condition named the orthogonality con- dition is given in Stein (1956). If there exists an estimate achieving the bound (nI)−1 when f is unknown, it is named as an adaptive estimate of ˜µ. Accord- ing to Stein’s condition, he indicated that such an estimator does exist for this problem. Completely definite results for this problem were obtained by Beran (1974) and Stone (1975).

(19)

Note that this problem is a so-called semiparametric estimation problem in which ˜µ is the parametric component and f is the nonparametric compo- nent. Recently, the problem of estimating and testing hypotheses about the parametric component in the presence of an infinite dimensional nuisance pa- rameter (nonparametric component) attracts a lot of attention. Main concerns are whether there exists either an adaptive or efficient estimate of the paramet- ric component and the existence of a practical procedure to find them.

We now consider the finite-dimensional case and derive the orthogonality condition derived in Stein (1956). It was seen that under regularity conditions there exist estimator sequences ˆθnj of θj, when the other parameters are known, which are asymptotically efficient in the sense that

√n(ˆθjn− θj)→ N (0,d 1 Ijj(θ)).

When the other parameters are unknown,

√n(ˆθjn− θj)→ N (0, [I(θ)]d −1jj ).

These imply that

1

Ijj(θ) ≤ [I(θ)]−1jj . (31) Stein (1956) raised the question whether we can estimate θj equally well no matter when the other parameters are known or not. This leads to the question of efficiency and adaptiveness.

The two sides of (31) are equal if

Iij(θ) = 0 for all j 6= i, (32) as is seen from the definition of the inverse of a matrix, and in fact (32) is also necessary for equality in (31) by the following facts.

Fact. Let A =

A11 A12 A21 A22

be a partitioned matrix with A22 square and nonsingular, and let

B =

I −A12A−122

0 I

. Note that

BA =

A11− A12A−122A21 0

A21 A22

.

It follows easily that |A| = |A11− A12A−122A21| · |A22|. Since A22 is nonsingular, (A−1)11 = (A11)−1 if A12 is a zero matrix.

(20)

The equality in (31) implies that I(θ) is diagonal. Suppose the efficient es- timator of θj depends on the remaining parameters and yet θj can be estimated without loss of efficiency when these parameters are unknown. The situation can then be viewed as a rather trivial example of the idea of adaptive estima- tion. On the other hand, it is known that Ijj(θ) is the smallest asymptotic mean-squared error attainable for estimating θj. If an estimator does achieve such a bound, it is called an efficient estimator. Then Stein (1956) states that the adaptation is not possible unless Ijk(θ) = 0 for k 6= j.

We now study the bound of [I(θ)]−111. Write I(θ) as a partitioned matrix

I11(θ) I(θ) IT(θ) I··(θ)

,

where I(θ) = (I12(θ), . . . , I1s(θ)) and I··(θ) is the lower right submatrix of I(θ) with size (s − 1) × (s − 1). Then

[I(θ)]−111 = 1

I11(θ) − I(θ)[I··(θ)]−1IT(θ). Recall that

Iij(θ) = E ∂

∂θi log f (X, θ) · ∂

∂θj log f (X, θ)

!

. Consider the minimization problem

minaj

E

∂θ1 log f (X, θ) −

s

X

j=2

aj

∂θj log f (X, θ)

2

.

and denote the minimizer as a0 = (a20, . . . , as0)T. By a simple algebra, a0 is the solution of norml equations I··(θ)a0 = I(θ) or a0 = I··−1(θ)I(θ). It leads to

E

∂θ1 log f (X, θ) −

s

X

j=2

aj0

∂θj log f (X, θ)

2

= I11(θ) − I(θ)[I··(θ)]−1IT(θ).

Or

[I(θ)]−111 = min

aj

E

∂θ1 log f (X, θ) −

s

X

j=2

aj

∂θj log f (X, θ)

2

. If aj0 = 0, [I(θ)]−111 = 1/I11(θ). Or, adaptivation is possible.

As illustrations, we will consider three examples. The first example is on the estimation of regression coefficients of a linear regression and the next two examples are on the estimation of parametric component in a semiparametric model. The two particular models we considered are the partial spline model (Wahba, 1984) and the two-sample proportional hazard model (Cox, 1972).

(21)

Example 1. (Linear Regression) Assume that y = β0 + β1x +  and

 ∼ N (0, σ2). Let ( ˆβ0, ˆβ1) denote the least squares estimate. It follows easily that

V ar

βˆ0 βˆ1

= σ2

n PiXi

P

iXi PiXi2

−1

= σ2

P

iXi2

P

i(Xi− ¯X)2P X¯

i(Xi− ¯X)2

P X¯

i(Xi− ¯X)2

1

P

i(Xi− ¯X)2

−1

.

When β0 is known, the variance of least squares estimate of β1 is σ2/PiXi2. Then the necessary and sufficient condition on guaranteeing adaptiveness is that X = 0. When we write the model in matrix form, the condition ¯¯ X = 0 can be explained as the two vectors (1, · · · , 1)T and (X1, · · · , Xn)T are orthogonal. Note that (1, · · · , 1)T and (X1, · · · , Xn)T are associated with β0 and β1, respectively.

On the other hand, we can use the above derivation. Observe that

∂ log f

∂θ0 = Y − β0− β1X

σ2 = 

σ2, ∂ log f

∂θ1 = (Y − β0− β1X)X

σ2 = X

σ2 . Then

mina E ∂ log f

∂β1 − a∂ log f

∂β0

!2

= min

a

1

σ2E(X − a)2 = 1

σ2V ar(X).

Example 2. (Partial Spline Model) Assume that Y = βX + g(T ) +  and

 ∼ N (0, σ2). Suppose that g ∈ L2[a, b], the set of all square integrable functions on the interval [a, b]. By proper taking care some mathematical subtlity, g(T ) can be written as Pj=1bjφj(T ) where {φj} is the complete bases of L2[a, b].

Observe that

∂ log f

∂θ = X

σ2

∂ log f

∂bj = φj(T ) σ2 . Then

minaj

E

∂ log f

∂β −X

j

aj∂ log f

∂bj

2

= min

aj

E2 σ4

X − −X

j

ajφj(T )

2

= 1 σ2 min

h E(X − h(T ))2.

Therefore, h(T ) = E(X|T ). It means that when E(X|T ) = 0 or X and T are uncorrelated, the adaption is possible. Otherwise, the efficient bound for estimating β is

σ2

E(X − E(X|T ))2 = σ2 EV ar(X|T ).

Refer to Chen (1988) and Speckman (1988) for further references and construc- tion of efficient estimate of β.

Example 3. Let t1, · · · , tn be fixed constants. Suppose that Xi ∼ Bin(1, F (ti)) and Yi ∼ Bin(1, Fθ(ti)). We now derive a lower bound on the

Reading Task 6: Genre Structure and Language Features. • Now let’s look at how language features (e.g. sentence patterns) are connected to the structure

Enrich the poem with a line that appeals to this missing sense.. __________________imagery that appeals to the sense of____________has not been used in the description

Then, we tested the influence of θ for the rate of convergence of Algorithm 4.1, by using this algorithm with α = 15 and four different θ to solve a test ex- ample generated as

For the proposed algorithm, we establish a global convergence estimate in terms of the objective value, and moreover present a dual application to the standard SCLP, which leads to

Establish the start node of the state graph as the root of the search tree and record its heuristic value.. while (the goal node has not

/** Class invariant: A Person always has a date of birth, and if the Person has a date of death, then the date of death is equal to or later than the date of birth. To be

As n increases, not only does the fixed locality bound of five become increasingly negligible relative to the size of the search space, but the probability that a random

Experiment a little with the Hello program. It will say that it has no clue what you mean by ouch. The exact wording of the error message is dependent on the compiler, but it might