### Methods for Statistical Prediction

### Financial Time Series I

### Topic 2: Methods of Estimation

### Hung Chen

### Department of Mathematics National Taiwan University

### 9/28/2002

OUTLINE 1. Motivated Examples 2. Statistical Models

3. Maximum Likelihood Estimates 4. Substitution Principles

– Method of Moments – Frequency Substitution 5. Method of Least Squares

Motivated Examples

Example 1. Censored exponentially distributed survival times

• Suppose W is a nonnegative random vari- able having an exponential distribution with mean θ > 0. Its pdf is given by

f_{W}(w; θ) = θ^{−1} exp(−w/θ)I_{(0,∞)}(w),
where the indicator function I_{(0,∞)}(w) = 1
for w > 0 and is zero elsewhere. The distri-
bution function is given by

F_{W}(w; θ) = {1 − exp(−w/θ)}I_{(0,∞)}(w).

• In survival or reliability analyses, a study to
observe a random variable W_{1}, . . . , W_{n} will
generally be terminated in practice before
all of these random variables are able to be
observed.

– Let y = (y_{1}^{T}, . . . , y_{n}^{T})^{T} denote the ob-
served data, where y_{j} = (c_{j}, δ_{j})^{T} and
δ_{j} = 0 or 1 according as the observa-
tion W_{j} is censored or uncensored at c_{j}
(j = 1, . . . , n).

– If the observation W_{j} is uncensored, its
realized value w_{j} is equal to c_{j}.

– If the observation W_{j} is censored at c_{j},
then w_{j} is some value greater than c_{j}
(j = 1, . . . , n).

– In medical study, it is commonly assumed that the censored data is caused by com- peting risk.

Under this assumption, it is assumed that

(W_{1}, R_{1}), . . . , (W_{n}, R_{n}) are iid, C_{i} = min(W_{i}, R_{i}),
and δ_{i} = 1_{{W}_{i}_{≤R}_{i}_{}}.

Here R is a nonnegative random variable.

• Approach 1: Model C directly.

– Derive the density function of C. Note that

P (C ≤ y) = P (min(W, R) ≤ y)

= 1 − P (W > y, R > y) = 1 − e^{−y/θ}[1 − F_{R}(y)]

and hence

f_{C}(c) = θ^{−1}e^{−c/θ}[1−F_{R}(c)]+e^{−c/θ}f_{R}(c).

– Assuming R is exponentially distributed

with mean λ, we have
f_{C}(c) = θ + λ

θλ exp

−θ + λ θλ c

.

Hence, C is again exponentially distributed with mean θλ/(θ + λ).

– How do we estimate θ?

We should use information contained in
δ_{j}. Note that δ is a Bernoulli random
variable with probability of success

P (W ≤ R) = ^{Z}_{0}^{∞}^{Z}_{0}^{r} θ^{−1}e^{−y/θ}f_{R}(r)dydr

= ^{Z}_{0}^{∞} f_{R}(r)[1 − exp(−r/θ)]dr

= 1 − ^{Z}_{0}^{∞} e^{−r/θ}f_{R}(r)dr.

When R is exponentially distributed with mean λ, we have

P (W ≤ R) = λ λ + θ.

By the law of large numbers, we consider
using n^{−1} ^{P}_{i} δ_{i} to estimate P (W ≤ R).

• Approach II: Method of Maximum Likeli- hood

– We have iid observations (C_{1}, δ_{1}), . . . , (C_{n}, δ_{n})
and need to find the probability density

function of (C, δ). Observe that

P (C ≤ c, δ = 1) = P (W ≤ R, W ≤ c)

= ^{Z}_{0}^{c} ^{Z}_{0}^{r} f_{W}(w)f_{R}(r)dwdr + ^{Z}_{c}^{∞}^{Z}_{0}^{c} f_{W}(w)f_{R}(r)dwdr

= ^{Z}_{0}^{c} F_{W}(r)f_{R}(r)dr + ^{Z}_{c}^{∞} F_{W}(c)f_{R}(r)dr

= ^{Z}_{0}^{c} F_{W}(r)f_{R}(r)dr + F_{W}(c)[1 − F_{R}(c)].

Then

f (C = c, δ = 1) = f_{W}(c)[1 − F_{R}(c)].

By the same argument, we have

f (C = c, δ = 0) = [1 − F_{W}(c)]f_{R}(c).

– The likelihood function is

Y

i (f_{W}(w_{i}) [1 − F_{R}(w_{i})])^{δ}^{i}(f_{R}(w_{i}) [1 − F_{W}(w_{i})])^{1−δ}^{i}

= ^{Y}

i (f_{W}(w_{i}))^{δ}^{i} [1 − F_{W}(w_{i})]^{1−δ}^{i}

·^{Y}

i (f_{R}(w_{i}))^{1−δ}^{i} [1 − F_{R}(w_{i})]^{δ}^{i} .
– For simplicity, we relabel the observa-

tions such that W_{1}, . . . , W_{r} denote the r
uncensored observations and W_{r+1}, . . . , W_{n}
the n − r censored observations.

The likelihood function for θ formed on

the basis of y is given by

r

Y

i=1[θ^{−1} exp(−w_{i}/θ)] ^{Y}^{n}

i=r+1{1 − [1 − exp(−w_{i}/θ)]}

= θ^{−r} exp(− ^{X}^{n}

i=1c_{i}/θ).

– In this case, the MLE of θ can be derived explicitly from the standard differentia- tion technique.

θ =ˆ ^{X}^{n}

i=1c_{i}/r.

– Rewrite ˆθ as

n^{−1 n}^{X}

i=1c_{i}

/(r/n).

It can be shown that ˆθ will converge to θ in probability.

Remarks:

• The exponential distribution is often used to model lifetimes or waiting times.

• Suppose that we consider modeling the life- time of an electronic component, T , as an exponential random variable with parame- ter θ. Its implication is as follows:

P (T > t + s|T > s) = P (T > t + s and T > s) P (T > s)

= P (T > t + s)

P (T > s) = e^{−(t+s)/θ}
e^{−s/θ}

= exp(−t/θ).

This is so-called memoryless property of ex- ponential distribution.

• Does it make sense to use exponential dis- tribution to model human lifetimes?

Compare the probability that a 16-year-old will live at least 10 more years and the prob- ability that a 80-year-old will live at least 10 more years.

• Hazard function h(t): It is defined as h(t) = f (t)

1 − F (t)(= − d

dt log S(t)),

where S(t) = P (T > t) = 1 − F (t). I can be thought of as the instantaneous death rate for individuals who are alive at time t.

If an individual is alive at time t, the prob- ability that that individual will die in the time interval (t, t + δ) is, assuming that the density function is continuous at t,

P (t ≤ T ≤ t + δ|T ≥ t) = P (t ≤ T ≤ t + δ) P (T ≥ t)

= F (T ≤ t + δ) − F (t)

1 − F (t) ≈ δf (t) 1 − F (t).

• For an exponential random variable T with mean θ, its hazard function is 1/θ (a con- stant function).

As a remark, the expectation of an exponen- tial random variable is θ.

Do you think that the connection between the expectation and the hazard function is a coincidence? Is there any intuitive expla- nation?

• Usually, the hazard function of human life- times is assumed to be of bathtub shape.

How would you model the density function of human lifetimes?

Example 2. Model heterogeneous data by finite-mixture models

• In the problem considered by Do and McLach-
lan (1984), the population of interest con-
sists of rats from g species G_{1}, . . . , G_{g}, that
are consumed by owls in some unknown pro-
portions π_{1}, . . . , π_{g}.

• The problem is to estimate the π on the ba- sis of the observation vector W containing measurements recorded on a sample of size n of rat skulls taken from owl pellets.

The rats constitute part of an owl’s diet, and indigestible material is regurgitated as a pellet.

• Use the argument of conditioning, the un-
derlying population can be modeled as con-
sisting of g distinct groups G_{1}, . . . , G_{g} in
some unknown proportions π_{1}, . . . , π_{g}, and
where the conditional pdf of W given mem-
bership of the ith group G_{i} is f_{i}(w).

• Let y = (w_{1}^{T}, . . . , w_{n}^{T})^{T} denote the observed
random sample obtained from the mixture

density

f (w; (π_{1}, . . . , π_{g−1})) = ^{X}^{g}

i=1πf_{i}(w).

• The log likelihood function for (π_{1}, . . . , π_{g−1})
can be formed from the observed data y is
given by

n

X

i=1 log

g

X

j=1π_{j}f_{j}(w_{i})

.

• On differentiating log likelihood function with
respect to π_{j} (j = 1, . . . , g − 1), we obtain

n

X

i=1

f_{j}(w_{i})

f (w_{i}; (π_{1}, . . . , π_{g−1})) − f_{g}(w_{i})

f (w_{i}; (π_{1}, . . . , π_{g−1}))

= 0, for j = 1, . . . , g −1. It clearly does not yield

an explicit solution for (π_{1}, . . . , π_{g−1})^{T}.

Statistical models

• Most studies and experiments, scientific or industrial, large scale or small, produce data whose analysis is the ultimate object of the endeavor.

– Compare the efficiency of two ways of do- ing something under similar conditions such as: brewing coffee; reducing pol- lution; treating a disease; producing en- ergy; learning a maze; and so on.

– Abstraction: It can be thought of as a problem of comparing the efficacy of two methods applied to the members of a cer- tain population.

– Run m + n independent experiments as follows: m + n members of the popula- tion are picked at random and m of these are assigned to the first method and the remaining n are assigned to the second method.

– In comparing two drugs A and B we would administer drug A to m and drug B to n randomly selected patients and then

measure temperature, blood pressure, have the patient rated quantitatively for im- provement by physicians, and so on.

– Random variability would come primar- ily from differing responses among pa- tients to the same drug, but also from error in the measurements and variation in the purity of the drugs.

– one sample location model for measure- ment:

Let X_{1}, . . . , X_{n} be the n determinations
of µ. Write

X_{i} = µ + _{i}, 1 ≤ i ≤ n,

where = (_{1}, . . . , _{n}) is the vector of
errors.

– two-sample problem:

Let X_{1}, . . . , X_{n} be the n samples from
the population with distribution F and
Y_{1}, . . . , Y_{m} be the m samples from the
population with distribution G.

• Many statistical procedures are based on sta- tistical models which specify under which conditions the data are generated.

• Usually the assumption is made that the set
of observations x_{1}, . . . , x_{n} is a set of (i) in-
dependent random variables (ii) identically
distributed with common pdf f (x_{i}, θ).

• Once this model is specified, the statistician
tries to find optimal solutions to his problem
(usually related to the inference on a set of
parameters θ ∈ Θ ⊂ R^{k}, characterizing the
uncertainty about the model).

Does this statement fit to the just-mentioned two-sample problem?

• Any statistical inference starts from a basic family of probability measures, expressing our prior knowledge about the nature of the probability measures from where the obser- vations originate.

A model P is a collection of probability mea- sures P on (X , A) where X is the sample space with a σ-field of subsets A.

• If P = {Pθ : θ ∈ Θ}, Θ ⊂ R^{k} for some k,
then P is a parametric model.

• Example 3. Bernoulli trials

– Consider a new model of automobile which is being produced in large numbers.

– Choose one at random from the produc- tion line and observe whether or not it suffers a mechanical breakdown within two years.

– In each trial, there are only two possible observations. The sample space consists of two elements 1 (representing break- down) and 0 (representing no breakdown).

– The inherent variability in the situation is described by a probability distribution which in this case is defined by a single number θ, the probability of breakdown.

– The possible probability distribution on the sample space can be described by a Bernoulli trial with an unknown param- eter θ between 0 and 1.

• Example 4. The parameter is a function.

– Suppose we have a large batch of seeds stored under constant conditions of tem- perature and humidity.

– In the course of time seeds die.

Suppose that at time t a proportion π(t) of the stored seeds are still alive.

– At each of times t_{1}, t_{2}, . . . , t_{s} we take a
random sample of n seeds and observe
how many are still alive.

– A typical observation consists of an or-
dered set (r_{1}, r_{2}, . . . , r_{s}) of integers, r_{i}
being the number of seeds observed to
be alive at time t_{i}.

– The appropriate distribution for describ- ing the variable element in this situation is

p(r_{1}, r_{2}, . . . , r_{s}) = ^{Y}^{s}

i=1C(n, r_{i})[π(t_{i})]^{r}^{i}[1−π(t_{i})]^{n−r}^{i}.
Here π(·) is an unknown distribution. In

this example, the parameter is a func- tion.

– Isotonic regression problem: π(t) is nec- essarily a non-increasing function of t, taking values between 0 and 1.

Can we find a parametric model for π(t)?

Related Issues:

• Suppose that we have a fully specified para- metric family of models. Denote the param- eter of interest by θ.

• Suppose that we wish to calculate from the data a single value representing the “best estimate” that we can make of the unknown parameter. We call such a problem one of point estimation.

• Instead of point estimation, we can estimate the parameter by giving a confidence inter- val which is associated with the probability of covering the true value.

When we say that a 95% CI of θ is (0.3, 0.7), it does not mean that there is a 95% prob- ability of θ ∈ (0.3, 0.7).

Such a claim does not make any sense since – Although θ is unknown, it is still a fixed

number.

– (0.3, 0.7) is a known fixed interval.

– θ is either in (0.3, 0.7) or not in that in- terval. It will not be sometimes in (0.3, 0.7) or sometimes not in.

– The precise meaning of probability 0.95 will be discussed later on.

– The probability 0.95 refers to the proba- bility that θ is in a random interval.

Here (0.3, 0.7) is one realization of that random interval.

• Distinction between data and random vari- ables:

In statistics, we deal with data only.

Why do we need to introduce random vari- ables?

Attitudes on Models:

• The statistician may be a “pessimist” who does not believe in any particular model f (x, θ).

In this case he must be satisfied with de- scriptive methods (like exploratory data anal- ysis) without the possibility of inductive in- ference.

• The statistician may be an “optimist” who strongly believes in one model. In this case the analysis is straightforward and optimal solutions may often be easily obtained.

• The statistician may be “realist”: he would like to specify a particular model f (x, θ) in order to get operational results but he may have either some doubt about the validity of this hypothesis or some difficulty in choosing a particular parametric family.

Let us illustrate this kind of preoccupation with an example.

• Suppose that the parameter of interest is the

“center” of some population.

• In many situations, the statistician may ar- gue that, due to a central limit effect, the data are generated by a normal pdf.

• In this case the problem is restricted to the problem of inference on µ, the mean of the population.

• But in some cases, he may have some doubt about these central limit effects and may suspect some skewness and/or some kurtosis or he may suspect that some observations are generated by other models (leading to the presence of outliers).

In this context three types of question may be raised to avoid gross errors in the predic- tion, or in the inference:

– Does the optimal solution, computed for assumed model f (x, θ), still have “good”

properties if the true model is a little dif- ferent?

This question is concerned with the sensi- tivity of a given criterion to the hypothe- ses (criterion robustness).

Question: Validity of one-sample t-test Partial Answer: Central Limit Theo- rem

– Are the optimal solutions computed for other models near to the original one re- ally substantially different?

In this question, it is the sensitivity of the inference that is analyzed (inference robustness).

Maximum Likelihood Estimates

• The true distribution on the sample space can be labeled by a parameter θ taking val- ues in a finite-dimensional Euclidean space.

• We further assume the family {Pθ : θ ∈ Θ}

(Θ ⊂ R^{k}) possesses density functions {pθ :
θ ∈ Θ} with respect to some natural mea-
sure on the sample space, such as counting
measure if the sample space is discrete or
Lebesgue measure when it is not.

– In the discrete case, pθ(x) is the proba- bility of the point x when θ is the true parameter.

– In the continuous case, pθ(x) is the prob- ability density at x when θ is the true parameter.

• x: the observed set of values obtained in an experiment.

• Consider p(x, θ) as a function of θ for fixed x.

p(x, θ) is called the likelihood function.

We also write it L(θ, x).

L(θ, x) gives the probability of observing x for each θ when X is discrete.

• Idea: Find the value ˆθ of the parameter which is most plausible after we have ob- served the data.

• A maximum likelihood estimate ˆθ is any el- ement of Θ such that

p(x, θ(x)) = max

θ^{∈Θ} p(x, θ).

• This principle was first put forward as a novel and original method of deriving es- timators by R.A. Fisher in 1922. It very soon proved to be a fertile approach to sta- tistical inference in general, and was widely adopted; but the exact properties of the en- suring estimators and test procedures were only gradually discovered.

• How do we find the maximum of L(θ, x)?

– A systematic way we learn in calculus is to transform a maximization problem to a root-finding problem.

– The above strategy may not always work.

Refer to the the uniform example.

– Computationally feasibility:

Before 1960, we only rely on the capacity of the human calculator, equipped with pencil and paper and with such aids as the slide rule, tables of logarithms, and other convenient tables.

The advent of electronic computer removes the restriction of the human operator.

– The estimates defined by nonlinear equa- tions can be established as a matter of routine by the appropriate iterative al- gorithms.

Examples:

• Example 5. Suppose θ = 0 or 1 (Θ = {0, 1}) and p(x, θ) is given by the following table.

p(x, θ) x = 0 x = 2

θ = 1 0 1

θ = 2 0.1 0.9

Suppose that we observe two observations 2 and 2.

How do we get them?

Abstraction:

– X: a discrete random variable with pmf p(x, θ)

– 2 and 2 are realizations of X_{1} and X_{2}.
– What is the pmf of (X_{1}, X_{2})?

Then

L(0, (2, 2)) = 1 L(1, (2, 2)) = (0.9)^{2}
and ˆθ((2, 2)) = 0.

• Example 6. If x_{1}, . . . , x_{n} are i.i.d. accord-
ing to the Poisson distribution P(λ), the
likelihood is

L(λ, x) = λ^{P}^{i}^{x}^{i}e^{−nλ}/ ^{Y}

i x_{i}!.

This is maximized by
λ =ˆ ^{X}

i x_{i}/n

which is therefore the MLE of λ.

In this example, Θ = (0, ∞) and k = 1.

Use rpois(20, 3) to generate 20 observations

from P(3). They are 2, 3, 3, 5, 6, 3, 0, 5, 3, 2, 2, 2, 2, 2, 4, 1, 3, 2, 7, 5.

Then ˆλ = 3.1.

Why do we need to introduce X?

• Example 7. Let X_{1}, . . . , X_{n} be i.i.d. ac-
cording to the uniform distribution U (0, θ),
so that the likelihood is

L(θ, x) =

1/θ^{n} if 0 ≤ x_{i} ≤ θ for all i
0 otherwise.

We can no longer differentiate L(θ, x) to get the MLE.

By direct maximization, the MLE is equal
to x_{(n)}.

• Example 8. Consider n items whose times
to failure X_{1}, . . . , X_{n} form a sample from an
E(θ) distribution. (i.e., p(x) = θ exp(−θx)
for x > 0)

Suppose the items are inspected only at dis-
crete times 1, 2, . . . , k so that we really ob-
serve Y_{1}, . . . , Y_{n} where, for j = 1, . . . , n,

Y_{j} = ` if ` − 1 < X_{j} ≤ `, ` = 1, . . . , k

= k + 1 if X_{j} > k.

Suppose n = 20, k = 5, and θ = 3. x_{i}s are
5.19, 0.06, 2.37, 4.38, 4.98, 13.02, 0.34, 7.26,
0.67, 1.96, 3.82, 0.27, 1.83, 3.48, 3.03, 1.90,
6.42, 7.49, 5.67, 6.27 and y_{i}s are 6, 1, 3, 5,
5, 6, 1, 6, 1, 2, 4, 1, 2, 4, 4, 2, 6, 6, 6, 6.

Let N_{i} = number of indices j such that Y_{j} = i,
i = 1, . . . , k + 1. Then the multinomial vec-
tor N = (N_{1}, . . . , N_{k+1}) is sufficient for θ
and the likelihood function of N is

L(θ, n_{1}, . . . , n_{k+1}) = n!

n_{1}! · · · n_{k+1}!

k+1

Y

j=1 p^{n}_{j}^{j}(θ),
where p_{j}(θ) = exp(−[j − 1]θ) − exp(−jθ)
for 1 ≤ j ≤ k and p_{k+1}(θ) = exp(−kθ).

Question: How do we solve this problem?

Limitations on MLE

• It is a constant theme of the history of the method that the use of ML techniques is not always accompanied by a clear appreciation of their limitations.

• Example 9. (Neyman-Scott (1948) prob- lem)

In this example, the MLE is not even con- sistent.

Refer to J. Neyman and E.L. Scott. Con- sistent estimate based on partially consis- tent observations. Econometrica 16 1-32 (1948).

Estimation of a Common Variance:

Let X_{αj} (j = 1, . . . , r) be independently

distributed according to N (θ_{α}, σ^{2}), α = 1, . . . , n.

The MLEs are

θˆ_{α} = X_{α·}, σˆ^{2} = 1
rn

n

X

α=1 r

X

j=1(X_{αj} − X_{α·})^{2}.
Furthermore, these are the unique solutions
of the likelihood equations.

However, in the present case, the MLE of σ^{2}
is not even consistent.

To see this, note that the statistics
S_{α}^{2} = ^{X}^{r}

j=1(X_{αj} − X_{α·})^{2}

are identically independently distributed with expectation

E(S_{α}^{2}) = (r − 1)σ^{2}

so that ^{P}S_{α}^{2}/n → (r − 1)σ^{2} and hence
ˆ

σ^{2} → r − 1

r σ^{2} in probability.

• Example 10. (Non-existence of MLE)
If Y_{1}, . . . , Y_{n} are i.i.d. according to the Pois-
son distribution P (λ). Suppose for each i
we observe only when Y_{i} is 0 or positive and

let

X_{i} =

0 if Y_{i} = 0
1 if Y_{i} > 0.

Then

P (X_{i} = 0) = exp(−λ), P (X_{i} = 1) = 1−exp(−λ),
and the likelihood is

L(λ) = [1 − exp(−λ)]^{P}^{x}^{i} exp(−λ ^{X}[1 − x_{i}]).

This is maximized by

λ = − log(1 − ¯ˆ x),
provided ^{P}(1 − x_{i}) > 0.

When all the x’s are = 1, the likelihood be- comes

L(λ) = [1 − exp(−λ)]^{n},

which is an increasing function of λ. In this case, the likelihood does not take on its max- imum for any finite λ and the MLE does not exist. (Does it make sense?)

Discussions:

– For any fixed n, the probability P (X_{1} =

· · · = X_{n} = 1) = (1 − exp(−λ))^{n} tends
to 1 as λ → ∞. Thus there will exist

values of λ for which the probability is close to 1 that the MLE is undefined.

– For any fixed λ, the probability P (X_{1} =

· · · = X_{n} = 1) = (1 − exp(−λ))^{n} tends
to 0 as n → ∞.

Iterative Procedures

In applications MLE’s typically do not have an- alytic forms and some numerical methods have to be used to compute MLE’s.

It is usually possible to assume that MLE emerges as a solution of the likelihood equations. Namely,

∂

∂θ_{i} log p(x, θ) = 0, i = 1, · · · , k.

Symbolically, the equations we have to solve may be written

Dθ`(x, θ) = 0,

where `(x, θ) = log p(x, θ) and Dθ is the vec- tor differential operator whose ith component is

∂/∂θ_{i}.

A commonly used numerical method is the Newton- Raphson iteration method.

• Solve the likelihood equation L^{(1)}(θ, x) = 0
iteratively.

• Replace L^{(1)}(θ, x) by the linear terms of its
Taylor expansion about a starting value ˆθ^{(0)}.

• Replace the likelihood equation with the equa- tion

L^{(1)}(ˆθ^{(0)}, x) + L^{(2)}(ˆθ^{(0)}, x)(θ − ˆθ^{(0)}) = 0.

The solution for θ is

θˆ^{(1)} = ˆθ^{(0)} − [L^{(2)}(ˆθ^{(0)}, x)]^{−1}L^{(1)}(ˆθ^{(0)}, x),
as a first approximation to the solution of
the likelihood equation.

• Iterative the above procedure by replacing
θˆ^{(0)} by ˆθ^{(1)} and so on.

• In general,

θˆ^{(t+1)} = ˆθ^{(t)}−∂L(θ)

∂θ

θ^{=}θ^{ˆ}^{(t)}

∂^{2}L(θ)

∂θ∂θ^{T}

θ^{=}θ^{ˆ}^{(t)}

−1

. – The laborious aspect of this iterative pro-

cedure is the inversion of the matrix ∂^{2}L(θ^{(t)})/∂θ∂θ^{T}
at the tth stage.

– If our initial approximation ˆθ^{(0)} is good,

then ∂^{2}L(θ^{(0)})/∂θ∂θ^{T} will be near ∂^{2}L(θ^{(t)})/∂θ∂θ^{T}
in non-pathological conditions.

We can often use the former matrix at each stage of the procedure.

– It often happens that terms awkward to
calculate appear in ∂^{2}L(θ^{(t)})/∂θ∂θ^{T} but
not in its expected value.

Sometimes, we replace ∂^{2}L(θ)/∂θ∂θ^{T}
by its expected value E[∂^{2}L(θ)/∂θ∂θ^{T}],
where the expectation is taken under Pθ.

This method is known as the Fisher-scoring method.

In most instances, E[∂^{2}L(θ)/∂θ∂θ^{T}] is
simply the negative information matrix
discussed in the second topic.

• Issues on implementation:

– Specification of the starting point: To en-
sure a sequence ˆθ^{(t)} which converges to
θ, it requires that ˆˆ θ^{(t)} is sufficiently close
to the root ˆθ.

– Take any estimator which satisfies √

n(ˆθ^{(0)}−
θ) is bounded in probability.

– Specification of the stopping rule:

• Example 11. Probit Analysis

– Suppose the probability π(s) that an in- dividual responds to the level s of a stim- ulus can be expressed in the form

π(s) = Φ

s − µ σ

= 1

√2π

Z (x−µ)/σ

−∞ e^{−z}^{2}^{/2}dz.

– The level s_{i} of the stimulus is applied
to n_{i} individuals (i = 1, . . . , r) and the
numbers m_{i} (i = 1, . . . , r) of responses
at the different levels are observed.

– Determine MLE of µ and σ.

– `(x, (µ, σ)) = constant+^{P}_{i}{m_{i} log π(s_{i})+

(n_{i} − m_{i}) log(1 − π(s_{i}))} and the likeli-
hood equations are

X

i

m_{i} − n_{i}π_{i}
π_{i}(1 − π_{i})

∂π(s_{i})

∂µ = 0,

X

i

m_{i} − n_{i}π_{i}
π_{i}(1 − π_{i})

∂π(s_{i})

∂σ = 0.

– Obtain initial approximations µ_{0} and σ_{0}
to their solution.

– Suppose the π(s_{i})s are known.

The plot of the points (s_{i}, Φ^{−1}(π(s_{i})))

would lie on the straight line
Φ^{−1}(π) = s − µ

σ .

Since m_{i}/n_{i} is an estimate of π(s_{i}), we
can fit a straight line to this set of points
to yield estimates of µ and σ.

– The Hessian matrix is a rather compli- cated expression.

If we use Fisher-scoring method, it is given by

Pi −n_{i}
π_{i}(1−π_{i})

∂π(s_{i})

∂µ

!2

Pi −n_{i}
π_{i}(1−π_{i})

∂π(s_{i})

∂µ

∂π(s_{i})

∂σ

Pi −n_{i}
π_{i}(1−π_{i})

∂π(s_{i})

∂µ

∂π(s_{i})

∂σ

Pi −n_{i}
π_{i}(1−π_{i})

∂π(s_{i})

∂σ

!2

.

Method of Moments

• It is the oldest method of deriving point es- timators.

Proposed by Karl Pearson (1894).

• Consider a parametric problem where X_{1}, . . . , X_{n}
are i.i.d. random variables from Pθ, θ ∈

Θ ⊂ R^{k}.

Suppose that m_{1}(θ), . . . , m_{k}(θ) are the first
k moments of the population we are sam-
pling from.

m_{j}(θ) = Eθ(X^{1}^{j}).

• Define the jth sample moment ˆm_{j} by
ˆ

m_{j} = 1
n

n

X

i=1 X_{i}^{j} = E_{F}_{n}(X^{j}).

• Suppose we want to estimate q(θ) which can be expressed as

q(θ) = g(m_{1}(θ), . . . , m_{k}(θ)),
where g is a continuous function.

• The method of moments estimate of q(θ) is
T (X) = g( ˆm_{1}, . . . ,mˆ _{k}).

• Basic ideas:

– Law of Large Numbers:

ˆ

m_{j} −→ m^{P} _{j}(θ)
– Continuity:

Example 12. Consider the estimation of µ
and σ^{2} if X_{1}, . . . , X_{n} are a random sample from
a population with mean µ and variance σ^{2}.

Example 13.

• Normal Mixtures. Consider an industrial setting with a production process in con- trol, so that the outcome follows a known distribution, which we shall take to be the standard normal distribution.

However, it is suspected that the production
process has become contaminated, with the
contaminating portion following some other
unknown normal distribution N (η, τ^{2}).

A sample x_{1}, . . . , x_{n} of the output is drawn.

The X^{0}s are therefore assumed to be i.i.d. ac-
cording to the distribution

pN (0, 1) + (1 − p)N (η, τ^{2}).

Apply the method of moments, we have
m_{1}(p, η, τ ) = E(X_{i}) = (1 − p)η,

m_{2}(p, η, τ ) = E(X_{i}^{2}) = p + (1 − p)(η^{2} + τ^{2}),
m_{3}(p, η, τ ) = E(X_{i}^{3}) = (1 − p)η(η^{2} + 3τ^{2})
and thus obtain the estimating equations

(1 − p)η = ¯x,
p + (1 − p)(η^{2} + τ^{2}) = n^{−1} ^{X}

i x^{2}_{i},
(1 − p)η(η^{2} + 3τ^{2}) = n^{−1} ^{X}

i x^{3}_{i}.

– Do you know how to express (p, η, τ ) as
functions of m_{1}, m_{2} and m_{3}?

– In general, how can we know whether the above task is possible?

– Implicit Function Theorem in advanced calculus.

• For the above example, suppose τ = 1. The resulting estimators of η and 1 − p are

ˆ

η = n^{−1} ^{P}_{i} X_{i}^{2} − 1

X¯ , and 1− ˆp =

X¯^{2}

n^{−1} ^{P}_{i} X_{i}^{2} − 1.

The Frequency Substitution Principle

• Suppose we observe n multinomial trials in
which the values v_{1}, . . . , v_{k} of the popula-
tion being sampled are known, but their re-
spective probabilities p_{1}, . . . , p_{k} are completely
unknown.

• Let N_{i} denote the number of indices j such
that X_{j} = v_{i}. Then (N_{1}, . . . , N_{k}) has a

multinomial distribution with parameter (n, p_{1}, . . . , p_{k}).

Here ^{P}_{i} N_{i} = n and n is any natural number
while (p_{1}, . . . , p_{k}) is any vector in

{(p_{1}, . . . , p_{k}) : p_{i} ≥ 0, ^{X}

i p_{i} = 1}.

• If (N_{1}, . . . , N_{k}) has a M(n, p_{1}, . . . , p_{k}),
p(n_{1}, . . . , n_{k}) = n!

n_{1}! · · · n_{k}!p^{n}_{1}^{1} · · · p^{n}_{k}^{k},
E(N_{i}) = np_{i}, V ar(N_{i}) = np_{i}(1 − p_{i}), and
Cov(N_{i}, N_{j}) = −np_{i}p_{j} for i 6= j.

• The intuitive estimate of p_{i} is N_{i}/n, the pro-
portion of sample values equal to v_{i}.

• Suppose we want to estimate a continuous
function q(p_{1}, . . . , p_{k}).

The frequency substitution principle will give
the estimate by replacing the unknown pop-
ulation frequencies p_{1}, . . . , p_{k} by the observ-
able sample frequencies N_{1}/n, . . . , N_{k}/n. That
is

T (N_{1}, . . . , N_{k}) = q(N_{1}/n, . . . , N_{k}/n).

• Basic ideas:

– Law of Large Numbers:

N_{j}/n −→ p^{P} _{j}
– Continuity:

A function f is said to be continuous at
x_{0} if f (x_{0}+) and f (x_{0}−) exist and if

f (x_{0}+) = f (x_{0}−) = f (x_{0}).

Refer to any advanced calculus book for details.

Example 14. Estimation in 2 × 2 tables

• Consider n independent trials, the outcome of each classified according to two criteria, as A or ¯A, and as B or ¯B.

For example, a series of operations is being

classified according to the gender of the pa- tient and the success or failure of the treat- ment.

• The results can be displayed in a 2 × 2 table as show below.

B B¯

A n_{AB} n_{A ¯}_{B} n_{A}
A n¯ AB¯ nA ¯¯B nA¯

n_{B} nB¯ n

where n_{AB} is the number of cases having
both attributes A and B, and so on.

• The joint distribution of the four cell en-
tries is then multinomial with parameters
(n, p_{AB}, pAB¯ , p_{A ¯}_{B}, pA ¯¯B).

• A standard measure of the degree of associ- ation of the attributes A and B is the cross- product ratio (also called odds ratio)

ρ = p_{AB}pA ¯¯B

pAB¯ p_{A ¯}_{B}.

– Use the fact that p_{AB} = p_{A}p_{B|A} where p_{A}
and p_{B|A} denote the probability of A and
the conditional probability of B given A,

respectively. It leads to
ρ = p_{B|A}pB|A¯

p_{B| ¯}_{A}pB| ¯¯ A

.

– Think of A as the maternal age is no more than 20, ¯A as the maternal age is greater than 20, B as the birthweight is no more than 2, 500gms, and ¯B as the birthweight is greater than 2, 500gms. The odds ratio can be used to associate the risk of underweight baby to the mater- nal age.

– The attributes A and B are said to be positively associated if

p_{B|A} > p_{B| ¯}_{A} and pB| ¯¯ A > pB|A¯ ,
and these conditions imply that ρ > 1.

– In the case of negative dependence, the above inequalities are reversed.

– Independence of A and B is character- ized by equality instead of inequality and hence by ρ = 1.

• The odds ratio ρ is estimated by replacing
the cell probabilities p_{AB}, . . . by the corre-
sponding frequencies n_{AB}/n, . . ..

The Method of Least Squares

• It became widely used early in the nine- teenth century by Gauss for estimation in problems of astronomical measurement.

• Suppose that water is being pumped through a container to which an amount of dye has been added.

Every few seconds the concentration of dye is measured in the water leaving the con- tainer.

It is expected that the concentration of dye will decrease linearly over time.

Since the measuring equipment maynot be perfectly accurate, it may not be possible to interpret the measurements exactly, and the mixing may not behave exactly as predicted.

The determine the rate at which the concen- tration decreases, the experimenter would have to approximate the data by a straight line, a line that best approximated the data in some sense. A common approach is to employ the method of least squares.

• The model above is called a linear model

because it is a linear combination of the model functions 1 and x.

x refers to the concentration of dye.

The model can be written as
Y_{x} = θ_{1} + θ_{2}x + _{x},

where Y_{x} is often called the response ob-
served at x, (θ_{1}, θ_{2}) is a 2-vector of unknown
parameters, x is an explanatory variable (or
covariate), and _{x} is random error.

Our data is (x, Y_{x}) and _{x} cannot be ob-
served.

x can be random or nonrandom.

• Nonlinear models are also used. A common example is an exponential model such as

Y_{t} = θ_{1}exp(θ_{2}t) + _{t}.

Here the model is a nonlinear function of the parameter β.

• In either case, we can write the observations
(x_{i}, y_{i})^{0}s in the form,

Y_{i} = g_{i}(θ_{1}, . . . , θ_{k}) + _{i}, 1 ≤ i ≤ n.

where

– The g_{i} are known functions and the real
numbers θ_{1}, . . . , θ_{k} are unknown param-
eters of interest.

– The parameters (θ_{1}, . . . , θ_{k}) can vary freely
over a set Ω contained in R^{k}.

– The _{i} satisfy the following restriction:

E(_{i}) = 0, 1 ≤ i ≤ n,
V ar(_{i}) = σ^{2}, 1 ≤ i ≤ n,
Cov(_{i}, _{j}) = 0, 1 ≤ i < j ≤ n.

– E(Y_{i}) = g_{i}(θ_{1}, . . . , θ_{k}) with unknown θ_{1}, . . . , θ_{k}.
Example 15.

• Suppose that we want to find out how in- creasing the amount x of a certain chemical or fertilizer in the soil increases the amount y of that chemical in the plants grown in that soil.

– Nine samples of soil were treated with different amounts x of phosphorus.

– Y is the amount of phosphorus found in corn plants grown for 38 days in the dif- ferent samples of soil.

– (x_{i}, y_{i}) are (1, 64), (4, 71), (5, 54), (9, 81),
(11, 76), (13, 93), (23, 77), (23, 95), (28, 109).

• Assume the relationship between x and y
can be approximated well by a random model
y_{i} = θ_{1} + θ_{2}x_{i} + _{i}.

• A least squares estimator of (θ_{1}, θ_{2}) is de-
fined to be the minimizer of

Q(θ_{1}, θ_{2}) = ^{X}^{9}

i=1(y_{i} − θ_{1} − θ_{2}x_{i})^{2}.

We then run into an optimization problem.

Note that

– Q(θ_{1}, θ_{2}) is a quadratic function of (θ_{1}, θ_{2}).

– There is no restriction on the ranger of
(θ_{1}, θ_{2}). (i.e., (θ_{1}, θ_{2}) ∈ R^{2} which falls in
an open set.)

It follows from vector calculus that the
least squares estimate ( ˆθ_{1}, ˆθ_{2}) must sat-
isfy the equations

∂

∂θ_{j}Q(θ_{1}, θ_{2}) = 0, j = 1, 2.

If the constraint is imposed, we may need to use the method of Lagrange multiplier to find the minimizer.

– Differentiation leads to the following nor- mal equations

X

i (y_{i} − θ_{1} − θ_{2}x_{i}) = 0

X

i x_{i}(y_{i} − θ_{1} − θ_{2}x_{i}) = 0.

The sample regression line is 61.58+1.42x.

• If some of {_{1}, · · · , _{n}} have more chance of
being small than others it might seem more
sensible to estimate θ_{1} and θ_{2} by minimizing
some weighted sum of squares

n

X

i=1w_{i}(y_{i} − θ_{1} − θ_{2}x_{i})^{2},

the ws being weights which are larger for
those is for which _{i} is liable to be small
and small for _{i} liable to be large.

Optimization and Least Squares

• The word optimization denotes either the minimization or maximization of a function.

• Consider a real-valued function h with do-
main D in R^{k}. The function h is said to
have a local maximum at point θ^{∗} ∈ D
if there exists a real number δ > 0 such

that h(θ) ≤ h(θ^{∗}) for all θ ∈ D satisfying
kθ − θ^{∗}k ≤ δ.

Define a local minimum in a similar way,
but in the sense that inequality h(θ) ≤ h(θ^{∗})
is reversed.

If the inequality h(θ) ≤ h(θ^{∗}) is replaced by
a strict inequality

h(θ) < h(θ^{∗}), θ ∈ D, θ 6= θ^{∗},

we have a strict local maximum; and if the
sense of the inequality h(θ) < h(θ^{∗}) is re-
versed, we have a strict local minimum.

• We say that the function h has a global (ab-
solute) maximum (strict global maximum)
at θ^{∗} if h(θ) ≤ h(θ^{∗}), [h(θ) < h(θ^{∗})] holds
for every θ ∈ D.

Thus a function may have many local max-
ima, each with a different value of h(θ), say,
h(θ^{0}_{j}), j = 1, . . . , `.

The global maximum can always be chosen from among these local maxima by compar- ing their values and choosing one such that

h(θ^{∗}) ≥ h(θ_{j}^{0}), j = 1, . . . , `,

where θ^{∗} ∈ {θ_{j}^{0}, j = 1, . . . , `}.

It is clear that every global maximum (min- imum) is also a local maximum (minimum);

however, the converse of this statement is, in general, not true.

Only when h(θ) is a convex function in R^{k}
and D ⊂ R^{k} is a convex set is every local
extremum of h at θ ∈ D also a global ex-
tremum of h over D.

• Minimization of a one-dimensional function h(θ), without any restrictions on θ, by New- ton’s method:

– Assume that h(θ) has at least two contin- uous derivatives and that it is bounded below.

– Approximate h(θ) by a quadratic func- tion that we can minimize, and use the minimizer of the simpler function as the new estimate of the minimizer of h(θ).

The process is then repeated from this new point.

– To form a quadratic approximation, let
θ^{(t)} be the current estimate of the solu-

tion θ^{∗}, and consider a Taylor series ex-
pansion of h about the point θ^{(t)}:

h(θ^{(t)}+s) = h(θ^{(t)})+sh^{0}(θ^{(t)})+1

2s^{2}h^{00}(θ^{(t)})+· · · .
The original minimization problem can

be approximated using a Taylor series ex- pansion

h(θ^{∗}) = min

θ h(θ) = min_{s} h(θ^{(t)} + s)

= min_{s}

h(θ^{(t)}) + sh^{0}(θ^{(t)}) + 1

2s^{2}h^{00}(θ^{(t)}) + · · ·

≈ min_{s}

h(θ^{(t)}) + sh^{0}(θ^{(t)}) + 1

2s^{2}h^{00}(θ^{(t)})

. – To minimize the quadratic, take the deriva-

tive with respect to s and set it equal to zero giving

s = −h^{0}(θ^{(t)})
h^{00}(θ^{(t)}).

Since s is an approximation to the step
that would take us from θ^{(t)} to the solu-
tion θ^{∗} of the original problem, and the
algorithm is defined by the formula

θ^{(t+1)} = θ^{(t)} − h^{0}(θ^{(t)})
h^{00}(θ^{(t)}).

• Optimization in many dimensions with lin- ear regression

– Consider Example 15 in which Q(θ_{1}, θ_{2})
can be written as

(θ_{1}, θ_{2})

9 ^{P}_{i} x_{i}

Pi x_{i} ^{P}_{i} x^{2}_{i}

θ_{1}
θ_{2}

−2(θ_{1}, θ_{2})

Pi y_{i}

Pi x_{i}y_{i}

+^{X}

i y_{i}^{2}.
– How do we differentiate a quadratic form

θ^{T}Aθ?

Here A is a k × k square matrix and symmetric.

Result:

∂

∂θθ^{T}Aθ = 2Aθ.

– How do we differentiate θ^{T}b?

Here b is a k × 1 column vector.

Result:

∂

∂θθ^{T}b = b.

– Matrix formulation of the linear model:

y = X θ + .

Here y = (y_{1}, . . . , y_{n})^{T}, X = (x_{ij})_{n×k},
and = (_{1}, . . . , _{n})^{T}. Observe that

(y−X θ)^{T}(y−X θ) = θ^{T}X^{T}X θ−2θ^{T}X^{T}y+y^{T}y.

Differentiation leads to the normal equa- tions

2X^{T}X θ − 2X^{T}y = 0,

Any solution of the above is an LSE of
θ. If X is of full rank, in which case
(X^{T}X )^{−1} exists, then there is a unique
LSE which is

θ = (Xˆ ^{T}X )^{−1}X^{T}y.

– In R, the function solve inverts matri- ces and solve systems of linear equations;

solve(A) inverts A and solve(A, b) solves A% ∗ %x = b.

If the system is over-determined, the least- squares fit is found, but matrices of less than full rank give an error.

– Consider the simple linear regression. It turns out that

X^{T}X =

n ^{P}_{i} x_{i}

Pi x_{i} ^{P}_{i} x^{2}_{i}

The matrix is invertible if and only if
some x_{i}’s are different.

• Optimization in Many Dimensions: New- ton’s Method

– Newton’s method (also called the Newton- Raphson method) is a widely used and often-studied method for minimization.

– The method requires use of both the gra- dient vector and the Hessian matrix in computations; hence it places more bur- den on the user to supply derivatives of the objective function than does the steep- est descent method learned in calculus (the gradient vector defines the direction of maximum local increase).

– Write the Taylor series in matrix/vector form. In two dimensions, the second- order Taylor series approximation is

h(θ_{1} + s_{1}, θ_{2} + s_{2}) ≈ h(θ_{1}, θ_{2}) + s_{1}D^{(1,0)}h(θ_{1}, θ_{2}) + s_{2}D^{(0,1)}h(θ_{1}, θ_{2})
+ 1

2

"

s^{2}_{1}D^{(2,0)}h(θ_{1}, θ_{2}) + 2s_{1}s_{2}D^{(1,1)}h(θ_{1}, θ_{2})
+ s^{2}_{2}D^{(0,2)}h(θ_{1}, θ_{2})

#

.
– Let 5^{2}h be the constant matrix of sec-

ond partial derivatives of h at θ_{j}-the so-
called Hessian matrix:

5^{2}h_{ij} = ∂^{2}h(θ)

∂θ_{i}∂θ_{j} .

– If the notations for the gradient and Hes- sian matrix are used, we can write down Taylor series in many dimensions which takes the form

Q(θ+p) = h(θ)+p^{T}5h(θ)+1

2p^{T}5^{2}h(θ)p.

– When θ^{(t)} is close to θ^{∗} we can expect
that the above quadratic function will
approximate h(θ).

To obtain the step p, we now minimize this quadratic as a function of p by form- ing its gradient with respect to p

5_{p}Q(p) = 5_{p}

p^{T} 5 h(θ) + 1

2p^{T} 5^{2} h(θ)p

= 5h(θ) + 5^{2}h(θ)p
and setting it equal to zero

5^{2}h(θ)p = − 5 h(θ).

This is a set of n linear equations in the
k unknowns p = (p_{1}, . . . , p_{k})^{T}.

These linear equations are called the New- ton equations.

If 5^{2}h(θ) is positive definite, this sug-

gests the general iterative scheme

θ^{(t+1)} = θ^{(t)}+p = θ^{(t)}−[5^{2}h(θ^{(t)})]^{−1}5h(θ).

– When h(θ) is closely approximated by
Q(p) in the neighborhood of θ^{∗}, conver-
gence will normally be at a quadratic rate
if the Hessian is positive definite at each
step.

– One problem with Newton’s method is that the Hessian may not be positive def- inite at each iteration.

Thus the method requires modification to insure that the resultant method is acceptable but still retains the desirable characteristics of Newton’s method.

Recall the nonlinear regression. If a least- squares approach were used, the following opti- mization problem would be obtained

minθ_{0},θ_{1}

Xn

i=1[Y_{i} − θ_{0} exp(θ_{1}T_{i})]^{2}.

This is called a nonlinear least-squares prob- lem. No analytic solution can be found. More details will be given when we discuss MLE later on.

Prediction

• Suppose we have a random vector (or vari-
able) X with EX^{T}X < ∞ and a random
variable Y .

One may wish to predict the value of Y based on an observed value of X.

Let g(X) be the predictor with E[g(X)]^{2} <

∞.

• As a motivated example, a stock holder wants to predict the value of his holdings at some time in the future (Y ) on the basis of his past experience with the market and his port- folio (X).

• Suppose we use a linear function of X (in- stead of nonlinear function) to predict of Y . What is the best linear predictor under mean squared error?

– Suppose that E(X^{2}) and E(Y ^{2}) are fi-
nite and X and Y are not constant. Then
the unique best zero intercept linear pre-
dictor is obtained by taking

a = a_{0} = E(XY )/E(X^{2}),