Methods for Statistical Prediction Financial Time Series I Topic 2: Methods of Estimation Hung Chen Department of Mathematics National Taiwan University 9/28/2002

56  Download (0)

Full text


Methods for Statistical Prediction

Financial Time Series I

Topic 2: Methods of Estimation

Hung Chen

Department of Mathematics National Taiwan University



OUTLINE 1. Motivated Examples 2. Statistical Models

3. Maximum Likelihood Estimates 4. Substitution Principles

– Method of Moments – Frequency Substitution 5. Method of Least Squares


Motivated Examples

Example 1. Censored exponentially distributed survival times

• Suppose W is a nonnegative random vari- able having an exponential distribution with mean θ > 0. Its pdf is given by

fW(w; θ) = θ−1 exp(−w/θ)I(0,∞)(w), where the indicator function I(0,∞)(w) = 1 for w > 0 and is zero elsewhere. The distri- bution function is given by

FW(w; θ) = {1 − exp(−w/θ)}I(0,∞)(w).

• In survival or reliability analyses, a study to observe a random variable W1, . . . , Wn will generally be terminated in practice before all of these random variables are able to be observed.

– Let y = (y1T, . . . , ynT)T denote the ob- served data, where yj = (cj, δj)T and δj = 0 or 1 according as the observa- tion Wj is censored or uncensored at cj (j = 1, . . . , n).


– If the observation Wj is uncensored, its realized value wj is equal to cj.

– If the observation Wj is censored at cj, then wj is some value greater than cj (j = 1, . . . , n).

– In medical study, it is commonly assumed that the censored data is caused by com- peting risk.

Under this assumption, it is assumed that

(W1, R1), . . . , (Wn, Rn) are iid, Ci = min(Wi, Ri), and δi = 1{Wi≤Ri}.

Here R is a nonnegative random variable.

• Approach 1: Model C directly.

– Derive the density function of C. Note that

P (C ≤ y) = P (min(W, R) ≤ y)

= 1 − P (W > y, R > y) = 1 − e−y/θ[1 − FR(y)]

and hence

fC(c) = θ−1e−c/θ[1−FR(c)]+e−c/θfR(c).

– Assuming R is exponentially distributed


with mean λ, we have fC(c) = θ + λ

θλ exp

−θ + λ θλ c


Hence, C is again exponentially distributed with mean θλ/(θ + λ).

– How do we estimate θ?

We should use information contained in δj. Note that δ is a Bernoulli random variable with probability of success

P (W ≤ R) = Z0Z0r θ−1e−y/θfR(r)dydr

= Z0 fR(r)[1 − exp(−r/θ)]dr

= 1 − Z0 e−r/θfR(r)dr.

When R is exponentially distributed with mean λ, we have

P (W ≤ R) = λ λ + θ.

By the law of large numbers, we consider using n−1 Pi δi to estimate P (W ≤ R).

• Approach II: Method of Maximum Likeli- hood

– We have iid observations (C1, δ1), . . . , (Cn, δn) and need to find the probability density


function of (C, δ). Observe that

P (C ≤ c, δ = 1) = P (W ≤ R, W ≤ c)

= Z0c Z0r fW(w)fR(r)dwdr + ZcZ0c fW(w)fR(r)dwdr

= Z0c FW(r)fR(r)dr + Zc FW(c)fR(r)dr

= Z0c FW(r)fR(r)dr + FW(c)[1 − FR(c)].


f (C = c, δ = 1) = fW(c)[1 − FR(c)].

By the same argument, we have

f (C = c, δ = 0) = [1 − FW(c)]fR(c).

– The likelihood function is


i (fW(wi) [1 − FR(wi)])δi(fR(wi) [1 − FW(wi)])1−δi

= Y

i (fW(wi))δi [1 − FW(wi)]1−δi


i (fR(wi))1−δi [1 − FR(wi)]δi . – For simplicity, we relabel the observa-

tions such that W1, . . . , Wr denote the r uncensored observations and Wr+1, . . . , Wn the n − r censored observations.

The likelihood function for θ formed on


the basis of y is given by



i=1−1 exp(−wi/θ)] Yn

i=r+1{1 − [1 − exp(−wi/θ)]}

= θ−r exp(− Xn


– In this case, the MLE of θ can be derived explicitly from the standard differentia- tion technique.

θ =ˆ Xn


– Rewrite ˆθ as

n−1 nX



It can be shown that ˆθ will converge to θ in probability.


• The exponential distribution is often used to model lifetimes or waiting times.

• Suppose that we consider modeling the life- time of an electronic component, T , as an exponential random variable with parame- ter θ. Its implication is as follows:

P (T > t + s|T > s) = P (T > t + s and T > s) P (T > s)


= P (T > t + s)

P (T > s) = e−(t+s)/θ e−s/θ

= exp(−t/θ).

This is so-called memoryless property of ex- ponential distribution.

• Does it make sense to use exponential dis- tribution to model human lifetimes?

Compare the probability that a 16-year-old will live at least 10 more years and the prob- ability that a 80-year-old will live at least 10 more years.

• Hazard function h(t): It is defined as h(t) = f (t)

1 − F (t)(= − d

dt log S(t)),

where S(t) = P (T > t) = 1 − F (t). I can be thought of as the instantaneous death rate for individuals who are alive at time t.

If an individual is alive at time t, the prob- ability that that individual will die in the time interval (t, t + δ) is, assuming that the density function is continuous at t,

P (t ≤ T ≤ t + δ|T ≥ t) = P (t ≤ T ≤ t + δ) P (T ≥ t)


= F (T ≤ t + δ) − F (t)

1 − F (t) ≈ δf (t) 1 − F (t).

• For an exponential random variable T with mean θ, its hazard function is 1/θ (a con- stant function).

As a remark, the expectation of an exponen- tial random variable is θ.

Do you think that the connection between the expectation and the hazard function is a coincidence? Is there any intuitive expla- nation?

• Usually, the hazard function of human life- times is assumed to be of bathtub shape.

How would you model the density function of human lifetimes?


Example 2. Model heterogeneous data by finite-mixture models

• In the problem considered by Do and McLach- lan (1984), the population of interest con- sists of rats from g species G1, . . . , Gg, that are consumed by owls in some unknown pro- portions π1, . . . , πg.

• The problem is to estimate the π on the ba- sis of the observation vector W containing measurements recorded on a sample of size n of rat skulls taken from owl pellets.

The rats constitute part of an owl’s diet, and indigestible material is regurgitated as a pellet.

• Use the argument of conditioning, the un- derlying population can be modeled as con- sisting of g distinct groups G1, . . . , Gg in some unknown proportions π1, . . . , πg, and where the conditional pdf of W given mem- bership of the ith group Gi is fi(w).

• Let y = (w1T, . . . , wnT)T denote the observed random sample obtained from the mixture



f (w; (π1, . . . , πg−1)) = Xg


• The log likelihood function for (π1, . . . , πg−1) can be formed from the observed data y is given by



i=1 log





• On differentiating log likelihood function with respect to πj (j = 1, . . . , g − 1), we obtain





f (wi; (π1, . . . , πg−1)) − fg(wi)

f (wi; (π1, . . . , πg−1))

= 0, for j = 1, . . . , g −1. It clearly does not yield

an explicit solution for (π1, . . . , πg−1)T.


Statistical models

• Most studies and experiments, scientific or industrial, large scale or small, produce data whose analysis is the ultimate object of the endeavor.

– Compare the efficiency of two ways of do- ing something under similar conditions such as: brewing coffee; reducing pol- lution; treating a disease; producing en- ergy; learning a maze; and so on.

– Abstraction: It can be thought of as a problem of comparing the efficacy of two methods applied to the members of a cer- tain population.

– Run m + n independent experiments as follows: m + n members of the popula- tion are picked at random and m of these are assigned to the first method and the remaining n are assigned to the second method.

– In comparing two drugs A and B we would administer drug A to m and drug B to n randomly selected patients and then


measure temperature, blood pressure, have the patient rated quantitatively for im- provement by physicians, and so on.

– Random variability would come primar- ily from differing responses among pa- tients to the same drug, but also from error in the measurements and variation in the purity of the drugs.

– one sample location model for measure- ment:

Let X1, . . . , Xn be the n determinations of µ. Write

Xi = µ + i, 1 ≤ i ≤ n,

where  = (1, . . . , n) is the vector of errors.

– two-sample problem:

Let X1, . . . , Xn be the n samples from the population with distribution F and Y1, . . . , Ym be the m samples from the population with distribution G.

• Many statistical procedures are based on sta- tistical models which specify under which conditions the data are generated.


• Usually the assumption is made that the set of observations x1, . . . , xn is a set of (i) in- dependent random variables (ii) identically distributed with common pdf f (xi, θ).

• Once this model is specified, the statistician tries to find optimal solutions to his problem (usually related to the inference on a set of parameters θ ∈ Θ ⊂ Rk, characterizing the uncertainty about the model).

Does this statement fit to the just-mentioned two-sample problem?

• Any statistical inference starts from a basic family of probability measures, expressing our prior knowledge about the nature of the probability measures from where the obser- vations originate.

A model P is a collection of probability mea- sures P on (X , A) where X is the sample space with a σ-field of subsets A.

• If P = {Pθ : θ ∈ Θ}, Θ ⊂ Rk for some k, then P is a parametric model.

• Example 3. Bernoulli trials


– Consider a new model of automobile which is being produced in large numbers.

– Choose one at random from the produc- tion line and observe whether or not it suffers a mechanical breakdown within two years.

– In each trial, there are only two possible observations. The sample space consists of two elements 1 (representing break- down) and 0 (representing no breakdown).

– The inherent variability in the situation is described by a probability distribution which in this case is defined by a single number θ, the probability of breakdown.

– The possible probability distribution on the sample space can be described by a Bernoulli trial with an unknown param- eter θ between 0 and 1.

• Example 4. The parameter is a function.

– Suppose we have a large batch of seeds stored under constant conditions of tem- perature and humidity.

– In the course of time seeds die.


Suppose that at time t a proportion π(t) of the stored seeds are still alive.

– At each of times t1, t2, . . . , ts we take a random sample of n seeds and observe how many are still alive.

– A typical observation consists of an or- dered set (r1, r2, . . . , rs) of integers, ri being the number of seeds observed to be alive at time ti.

– The appropriate distribution for describ- ing the variable element in this situation is

p(r1, r2, . . . , rs) = Ys

i=1C(n, ri)[π(ti)]ri[1−π(ti)]n−ri. Here π(·) is an unknown distribution. In

this example, the parameter is a func- tion.

– Isotonic regression problem: π(t) is nec- essarily a non-increasing function of t, taking values between 0 and 1.

Can we find a parametric model for π(t)?


Related Issues:

• Suppose that we have a fully specified para- metric family of models. Denote the param- eter of interest by θ.

• Suppose that we wish to calculate from the data a single value representing the “best estimate” that we can make of the unknown parameter. We call such a problem one of point estimation.

• Instead of point estimation, we can estimate the parameter by giving a confidence inter- val which is associated with the probability of covering the true value.

When we say that a 95% CI of θ is (0.3, 0.7), it does not mean that there is a 95% prob- ability of θ ∈ (0.3, 0.7).

Such a claim does not make any sense since – Although θ is unknown, it is still a fixed


– (0.3, 0.7) is a known fixed interval.

– θ is either in (0.3, 0.7) or not in that in- terval. It will not be sometimes in (0.3, 0.7) or sometimes not in.


– The precise meaning of probability 0.95 will be discussed later on.

– The probability 0.95 refers to the proba- bility that θ is in a random interval.

Here (0.3, 0.7) is one realization of that random interval.

• Distinction between data and random vari- ables:

In statistics, we deal with data only.

Why do we need to introduce random vari- ables?

Attitudes on Models:

• The statistician may be a “pessimist” who does not believe in any particular model f (x, θ).

In this case he must be satisfied with de- scriptive methods (like exploratory data anal- ysis) without the possibility of inductive in- ference.

• The statistician may be an “optimist” who strongly believes in one model. In this case the analysis is straightforward and optimal solutions may often be easily obtained.


• The statistician may be “realist”: he would like to specify a particular model f (x, θ) in order to get operational results but he may have either some doubt about the validity of this hypothesis or some difficulty in choosing a particular parametric family.

Let us illustrate this kind of preoccupation with an example.

• Suppose that the parameter of interest is the

“center” of some population.

• In many situations, the statistician may ar- gue that, due to a central limit effect, the data are generated by a normal pdf.

• In this case the problem is restricted to the problem of inference on µ, the mean of the population.

• But in some cases, he may have some doubt about these central limit effects and may suspect some skewness and/or some kurtosis or he may suspect that some observations are generated by other models (leading to the presence of outliers).


In this context three types of question may be raised to avoid gross errors in the predic- tion, or in the inference:

– Does the optimal solution, computed for assumed model f (x, θ), still have “good”

properties if the true model is a little dif- ferent?

This question is concerned with the sensi- tivity of a given criterion to the hypothe- ses (criterion robustness).

Question: Validity of one-sample t-test Partial Answer: Central Limit Theo- rem

– Are the optimal solutions computed for other models near to the original one re- ally substantially different?

In this question, it is the sensitivity of the inference that is analyzed (inference robustness).


Maximum Likelihood Estimates

• The true distribution on the sample space can be labeled by a parameter θ taking val- ues in a finite-dimensional Euclidean space.

• We further assume the family {Pθ : θ ∈ Θ}

(Θ ⊂ Rk) possesses density functions {pθ : θ ∈ Θ} with respect to some natural mea- sure on the sample space, such as counting measure if the sample space is discrete or Lebesgue measure when it is not.

– In the discrete case, pθ(x) is the proba- bility of the point x when θ is the true parameter.

– In the continuous case, pθ(x) is the prob- ability density at x when θ is the true parameter.

• x: the observed set of values obtained in an experiment.

• Consider p(x, θ) as a function of θ for fixed x.

p(x, θ) is called the likelihood function.

We also write it L(θ, x).


L(θ, x) gives the probability of observing x for each θ when X is discrete.

• Idea: Find the value ˆθ of the parameter which is most plausible after we have ob- served the data.

• A maximum likelihood estimate ˆθ is any el- ement of Θ such that

p(x, θ(x)) = max

θ∈Θ p(x, θ).

• This principle was first put forward as a novel and original method of deriving es- timators by R.A. Fisher in 1922. It very soon proved to be a fertile approach to sta- tistical inference in general, and was widely adopted; but the exact properties of the en- suring estimators and test procedures were only gradually discovered.

• How do we find the maximum of L(θ, x)?

– A systematic way we learn in calculus is to transform a maximization problem to a root-finding problem.

– The above strategy may not always work.

Refer to the the uniform example.


– Computationally feasibility:

Before 1960, we only rely on the capacity of the human calculator, equipped with pencil and paper and with such aids as the slide rule, tables of logarithms, and other convenient tables.

The advent of electronic computer removes the restriction of the human operator.

– The estimates defined by nonlinear equa- tions can be established as a matter of routine by the appropriate iterative al- gorithms.


• Example 5. Suppose θ = 0 or 1 (Θ = {0, 1}) and p(x, θ) is given by the following table.

p(x, θ) x = 0 x = 2

θ = 1 0 1

θ = 2 0.1 0.9

Suppose that we observe two observations 2 and 2.

How do we get them?



– X: a discrete random variable with pmf p(x, θ)

– 2 and 2 are realizations of X1 and X2. – What is the pmf of (X1, X2)?


L(0, (2, 2)) = 1 L(1, (2, 2)) = (0.9)2 and ˆθ((2, 2)) = 0.

• Example 6. If x1, . . . , xn are i.i.d. accord- ing to the Poisson distribution P(λ), the likelihood is

L(λ, x) = λPixie−nλ/ Y

i xi!.

This is maximized by λ =ˆ X

i xi/n

which is therefore the MLE of λ.

In this example, Θ = (0, ∞) and k = 1.

Use rpois(20, 3) to generate 20 observations

from P(3). They are 2, 3, 3, 5, 6, 3, 0, 5, 3, 2, 2, 2, 2, 2, 4, 1, 3, 2, 7, 5.

Then ˆλ = 3.1.

Why do we need to introduce X?


• Example 7. Let X1, . . . , Xn be i.i.d. ac- cording to the uniform distribution U (0, θ), so that the likelihood is

L(θ, x) =

1/θn if 0 ≤ xi ≤ θ for all i 0 otherwise.

We can no longer differentiate L(θ, x) to get the MLE.

By direct maximization, the MLE is equal to x(n).

• Example 8. Consider n items whose times to failure X1, . . . , Xn form a sample from an E(θ) distribution. (i.e., p(x) = θ exp(−θx) for x > 0)

Suppose the items are inspected only at dis- crete times 1, 2, . . . , k so that we really ob- serve Y1, . . . , Yn where, for j = 1, . . . , n,

Yj = ` if ` − 1 < Xj ≤ `, ` = 1, . . . , k

= k + 1 if Xj > k.

Suppose n = 20, k = 5, and θ = 3. xis are 5.19, 0.06, 2.37, 4.38, 4.98, 13.02, 0.34, 7.26, 0.67, 1.96, 3.82, 0.27, 1.83, 3.48, 3.03, 1.90, 6.42, 7.49, 5.67, 6.27 and yis are 6, 1, 3, 5, 5, 6, 1, 6, 1, 2, 4, 1, 2, 4, 4, 2, 6, 6, 6, 6.


Let Ni = number of indices j such that Yj = i, i = 1, . . . , k + 1. Then the multinomial vec- tor N = (N1, . . . , Nk+1) is sufficient for θ and the likelihood function of N is

L(θ, n1, . . . , nk+1) = n!

n1! · · · nk+1!



j=1 pnjj(θ), where pj(θ) = exp(−[j − 1]θ) − exp(−jθ) for 1 ≤ j ≤ k and pk+1(θ) = exp(−kθ).

Question: How do we solve this problem?

Limitations on MLE

• It is a constant theme of the history of the method that the use of ML techniques is not always accompanied by a clear appreciation of their limitations.

• Example 9. (Neyman-Scott (1948) prob- lem)

In this example, the MLE is not even con- sistent.

Refer to J. Neyman and E.L. Scott. Con- sistent estimate based on partially consis- tent observations. Econometrica 16 1-32 (1948).

Estimation of a Common Variance:


Let Xαj (j = 1, . . . , r) be independently

distributed according to N (θα, σ2), α = 1, . . . , n.

The MLEs are

θˆα = Xα·, σˆ2 = 1 rn



α=1 r


j=1(Xαj − Xα·)2. Furthermore, these are the unique solutions of the likelihood equations.

However, in the present case, the MLE of σ2 is not even consistent.

To see this, note that the statistics Sα2 = Xr

j=1(Xαj − Xα·)2

are identically independently distributed with expectation

E(Sα2) = (r − 1)σ2

so that PSα2/n → (r − 1)σ2 and hence ˆ

σ2 → r − 1

r σ2 in probability.

• Example 10. (Non-existence of MLE) If Y1, . . . , Yn are i.i.d. according to the Pois- son distribution P (λ). Suppose for each i we observe only when Yi is 0 or positive and



Xi =

0 if Yi = 0 1 if Yi > 0.


P (Xi = 0) = exp(−λ), P (Xi = 1) = 1−exp(−λ), and the likelihood is

L(λ) = [1 − exp(−λ)]Pxi exp(−λ X[1 − xi]).

This is maximized by

λ = − log(1 − ¯ˆ x), provided P(1 − xi) > 0.

When all the x’s are = 1, the likelihood be- comes

L(λ) = [1 − exp(−λ)]n,

which is an increasing function of λ. In this case, the likelihood does not take on its max- imum for any finite λ and the MLE does not exist. (Does it make sense?)


– For any fixed n, the probability P (X1 =

· · · = Xn = 1) = (1 − exp(−λ))n tends to 1 as λ → ∞. Thus there will exist


values of λ for which the probability is close to 1 that the MLE is undefined.

– For any fixed λ, the probability P (X1 =

· · · = Xn = 1) = (1 − exp(−λ))n tends to 0 as n → ∞.

Iterative Procedures

In applications MLE’s typically do not have an- alytic forms and some numerical methods have to be used to compute MLE’s.

It is usually possible to assume that MLE emerges as a solution of the likelihood equations. Namely,

∂θi log p(x, θ) = 0, i = 1, · · · , k.

Symbolically, the equations we have to solve may be written

Dθ`(x, θ) = 0,

where `(x, θ) = log p(x, θ) and Dθ is the vec- tor differential operator whose ith component is


A commonly used numerical method is the Newton- Raphson iteration method.

• Solve the likelihood equation L(1)(θ, x) = 0 iteratively.


• Replace L(1)(θ, x) by the linear terms of its Taylor expansion about a starting value ˆθ(0).

• Replace the likelihood equation with the equa- tion

L(1)(ˆθ(0), x) + L(2)(ˆθ(0), x)(θ − ˆθ(0)) = 0.

The solution for θ is

θˆ(1) = ˆθ(0) − [L(2)(ˆθ(0), x)]−1L(1)(ˆθ(0), x), as a first approximation to the solution of the likelihood equation.

• Iterative the above procedure by replacing θˆ(0) by ˆθ(1) and so on.

• In general,

θˆ(t+1) = ˆθ(t)−∂L(θ)







. – The laborious aspect of this iterative pro-

cedure is the inversion of the matrix ∂2L(θ(t))/∂θ∂θT at the tth stage.

– If our initial approximation ˆθ(0) is good,

then ∂2L(θ(0))/∂θ∂θT will be near ∂2L(θ(t))/∂θ∂θT in non-pathological conditions.


We can often use the former matrix at each stage of the procedure.

– It often happens that terms awkward to calculate appear in ∂2L(θ(t))/∂θ∂θT but not in its expected value.

Sometimes, we replace ∂2L(θ)/∂θ∂θT by its expected value E[∂2L(θ)/∂θ∂θT], where the expectation is taken under Pθ.

This method is known as the Fisher-scoring method.

In most instances, E[∂2L(θ)/∂θ∂θT] is simply the negative information matrix discussed in the second topic.

• Issues on implementation:

– Specification of the starting point: To en- sure a sequence ˆθ(t) which converges to θ, it requires that ˆˆ θ(t) is sufficiently close to the root ˆθ.

– Take any estimator which satisfies √

n(ˆθ(0)− θ) is bounded in probability.

– Specification of the stopping rule:

• Example 11. Probit Analysis


– Suppose the probability π(s) that an in- dividual responds to the level s of a stim- ulus can be expressed in the form

π(s) = Φ

s − µ σ

= 1


Z (x−µ)/σ

−∞ e−z2/2dz.

– The level si of the stimulus is applied to ni individuals (i = 1, . . . , r) and the numbers mi (i = 1, . . . , r) of responses at the different levels are observed.

– Determine MLE of µ and σ.

– `(x, (µ, σ)) = constant+Pi{mi log π(si)+

(ni − mi) log(1 − π(si))} and the likeli- hood equations are



mi − niπi πi(1 − πi)


∂µ = 0,



mi − niπi πi(1 − πi)


∂σ = 0.

– Obtain initial approximations µ0 and σ0 to their solution.

– Suppose the π(si)s are known.

The plot of the points (si, Φ−1(π(si)))


would lie on the straight line Φ−1(π) = s − µ

σ .

Since mi/ni is an estimate of π(si), we can fit a straight line to this set of points to yield estimates of µ and σ.

– The Hessian matrix is a rather compli- cated expression.

If we use Fisher-scoring method, it is given by

Pi −ni πi(1−πi)




Pi −ni πi(1−πi)





Pi −ni πi(1−πi)





Pi −ni πi(1−πi)






Method of Moments

• It is the oldest method of deriving point es- timators.

Proposed by Karl Pearson (1894).

• Consider a parametric problem where X1, . . . , Xn are i.i.d. random variables from Pθ, θ ∈

Θ ⊂ Rk.

Suppose that m1(θ), . . . , mk(θ) are the first k moments of the population we are sam- pling from.

mj(θ) = Eθ(X1j).

• Define the jth sample moment ˆmj by ˆ

mj = 1 n



i=1 Xij = EFn(Xj).

• Suppose we want to estimate q(θ) which can be expressed as

q(θ) = g(m1(θ), . . . , mk(θ)), where g is a continuous function.

• The method of moments estimate of q(θ) is T (X) = g( ˆm1, . . . ,mˆ k).


• Basic ideas:

– Law of Large Numbers:


mj −→ mP j(θ) – Continuity:

Example 12. Consider the estimation of µ and σ2 if X1, . . . , Xn are a random sample from a population with mean µ and variance σ2.

Example 13.

• Normal Mixtures. Consider an industrial setting with a production process in con- trol, so that the outcome follows a known distribution, which we shall take to be the standard normal distribution.

However, it is suspected that the production process has become contaminated, with the contaminating portion following some other unknown normal distribution N (η, τ2).

A sample x1, . . . , xn of the output is drawn.

The X0s are therefore assumed to be i.i.d. ac- cording to the distribution

pN (0, 1) + (1 − p)N (η, τ2).


Apply the method of moments, we have m1(p, η, τ ) = E(Xi) = (1 − p)η,

m2(p, η, τ ) = E(Xi2) = p + (1 − p)(η2 + τ2), m3(p, η, τ ) = E(Xi3) = (1 − p)η(η2 + 3τ2) and thus obtain the estimating equations

(1 − p)η = ¯x, p + (1 − p)(η2 + τ2) = n−1 X

i x2i, (1 − p)η(η2 + 3τ2) = n−1 X

i x3i.

– Do you know how to express (p, η, τ ) as functions of m1, m2 and m3?

– In general, how can we know whether the above task is possible?

– Implicit Function Theorem in advanced calculus.

• For the above example, suppose τ = 1. The resulting estimators of η and 1 − p are


η = n−1 Pi Xi2 − 1

X¯ , and 1− ˆp =


n−1 Pi Xi2 − 1.


The Frequency Substitution Principle

• Suppose we observe n multinomial trials in which the values v1, . . . , vk of the popula- tion being sampled are known, but their re- spective probabilities p1, . . . , pk are completely unknown.

• Let Ni denote the number of indices j such that Xj = vi. Then (N1, . . . , Nk) has a

multinomial distribution with parameter (n, p1, . . . , pk).

Here Pi Ni = n and n is any natural number while (p1, . . . , pk) is any vector in

{(p1, . . . , pk) : pi ≥ 0, X

i pi = 1}.

• If (N1, . . . , Nk) has a M(n, p1, . . . , pk), p(n1, . . . , nk) = n!

n1! · · · nk!pn11 · · · pnkk, E(Ni) = npi, V ar(Ni) = npi(1 − pi), and Cov(Ni, Nj) = −npipj for i 6= j.

• The intuitive estimate of pi is Ni/n, the pro- portion of sample values equal to vi.

• Suppose we want to estimate a continuous function q(p1, . . . , pk).


The frequency substitution principle will give the estimate by replacing the unknown pop- ulation frequencies p1, . . . , pk by the observ- able sample frequencies N1/n, . . . , Nk/n. That is

T (N1, . . . , Nk) = q(N1/n, . . . , Nk/n).

• Basic ideas:

– Law of Large Numbers:

Nj/n −→ pP j – Continuity:

A function f is said to be continuous at x0 if f (x0+) and f (x0−) exist and if

f (x0+) = f (x0−) = f (x0).

Refer to any advanced calculus book for details.

Example 14. Estimation in 2 × 2 tables

• Consider n independent trials, the outcome of each classified according to two criteria, as A or ¯A, and as B or ¯B.

For example, a series of operations is being


classified according to the gender of the pa- tient and the success or failure of the treat- ment.

• The results can be displayed in a 2 × 2 table as show below.

B B¯

A nAB nA ¯B nA A n¯ AB¯ nA ¯¯B nA¯

nB nB¯ n

where nAB is the number of cases having both attributes A and B, and so on.

• The joint distribution of the four cell en- tries is then multinomial with parameters (n, pAB, pAB¯ , pA ¯B, pA ¯¯B).

• A standard measure of the degree of associ- ation of the attributes A and B is the cross- product ratio (also called odds ratio)

ρ = pABpA ¯¯B

pAB¯ pA ¯B.

– Use the fact that pAB = pApB|A where pA and pB|A denote the probability of A and the conditional probability of B given A,


respectively. It leads to ρ = pB|ApB|A¯

pB| ¯ApB| ¯¯ A


– Think of A as the maternal age is no more than 20, ¯A as the maternal age is greater than 20, B as the birthweight is no more than 2, 500gms, and ¯B as the birthweight is greater than 2, 500gms. The odds ratio can be used to associate the risk of underweight baby to the mater- nal age.

– The attributes A and B are said to be positively associated if

pB|A > pB| ¯A and pB| ¯¯ A > pB|A¯ , and these conditions imply that ρ > 1.

– In the case of negative dependence, the above inequalities are reversed.

– Independence of A and B is character- ized by equality instead of inequality and hence by ρ = 1.

• The odds ratio ρ is estimated by replacing the cell probabilities pAB, . . . by the corre- sponding frequencies nAB/n, . . ..


The Method of Least Squares

• It became widely used early in the nine- teenth century by Gauss for estimation in problems of astronomical measurement.

• Suppose that water is being pumped through a container to which an amount of dye has been added.

Every few seconds the concentration of dye is measured in the water leaving the con- tainer.

It is expected that the concentration of dye will decrease linearly over time.

Since the measuring equipment maynot be perfectly accurate, it may not be possible to interpret the measurements exactly, and the mixing may not behave exactly as predicted.

The determine the rate at which the concen- tration decreases, the experimenter would have to approximate the data by a straight line, a line that best approximated the data in some sense. A common approach is to employ the method of least squares.

• The model above is called a linear model


because it is a linear combination of the model functions 1 and x.

x refers to the concentration of dye.

The model can be written as Yx = θ1 + θ2x + x,

where Yx is often called the response ob- served at x, (θ1, θ2) is a 2-vector of unknown parameters, x is an explanatory variable (or covariate), and x is random error.

Our data is (x, Yx) and x cannot be ob- served.

x can be random or nonrandom.

• Nonlinear models are also used. A common example is an exponential model such as

Yt = θ1exp(θ2t) + t.

Here the model is a nonlinear function of the parameter β.

• In either case, we can write the observations (xi, yi)0s in the form,

Yi = gi1, . . . , θk) + i, 1 ≤ i ≤ n.



– The gi are known functions and the real numbers θ1, . . . , θk are unknown param- eters of interest.

– The parameters (θ1, . . . , θk) can vary freely over a set Ω contained in Rk.

– The i satisfy the following restriction:

E(i) = 0, 1 ≤ i ≤ n, V ar(i) = σ2, 1 ≤ i ≤ n, Cov(i, j) = 0, 1 ≤ i < j ≤ n.

– E(Yi) = gi1, . . . , θk) with unknown θ1, . . . , θk. Example 15.

• Suppose that we want to find out how in- creasing the amount x of a certain chemical or fertilizer in the soil increases the amount y of that chemical in the plants grown in that soil.

– Nine samples of soil were treated with different amounts x of phosphorus.

– Y is the amount of phosphorus found in corn plants grown for 38 days in the dif- ferent samples of soil.


– (xi, yi) are (1, 64), (4, 71), (5, 54), (9, 81), (11, 76), (13, 93), (23, 77), (23, 95), (28, 109).

• Assume the relationship between x and y can be approximated well by a random model yi = θ1 + θ2xi + i.

• A least squares estimator of (θ1, θ2) is de- fined to be the minimizer of

Q(θ1, θ2) = X9

i=1(yi − θ1 − θ2xi)2.

We then run into an optimization problem.

Note that

– Q(θ1, θ2) is a quadratic function of (θ1, θ2).

– There is no restriction on the ranger of (θ1, θ2). (i.e., (θ1, θ2) ∈ R2 which falls in an open set.)

It follows from vector calculus that the least squares estimate ( ˆθ1, ˆθ2) must sat- isfy the equations

∂θjQ(θ1, θ2) = 0, j = 1, 2.

If the constraint is imposed, we may need to use the method of Lagrange multiplier to find the minimizer.


– Differentiation leads to the following nor- mal equations


i (yi − θ1 − θ2xi) = 0


i xi(yi − θ1 − θ2xi) = 0.

The sample regression line is 61.58+1.42x.

• If some of {1, · · · , n} have more chance of being small than others it might seem more sensible to estimate θ1 and θ2 by minimizing some weighted sum of squares



i=1wi(yi − θ1 − θ2xi)2,

the ws being weights which are larger for those is for which i is liable to be small and small for i liable to be large.

Optimization and Least Squares

• The word optimization denotes either the minimization or maximization of a function.

• Consider a real-valued function h with do- main D in Rk. The function h is said to have a local maximum at point θ ∈ D if there exists a real number δ > 0 such


that h(θ) ≤ h(θ) for all θ ∈ D satisfying kθ − θk ≤ δ.

Define a local minimum in a similar way, but in the sense that inequality h(θ) ≤ h(θ) is reversed.

If the inequality h(θ) ≤ h(θ) is replaced by a strict inequality

h(θ) < h(θ), θ ∈ D, θ 6= θ,

we have a strict local maximum; and if the sense of the inequality h(θ) < h(θ) is re- versed, we have a strict local minimum.

• We say that the function h has a global (ab- solute) maximum (strict global maximum) at θ if h(θ) ≤ h(θ), [h(θ) < h(θ)] holds for every θ ∈ D.

Thus a function may have many local max- ima, each with a different value of h(θ), say, h(θ0j), j = 1, . . . , `.

The global maximum can always be chosen from among these local maxima by compar- ing their values and choosing one such that

h(θ) ≥ h(θj0), j = 1, . . . , `,


where θ ∈ {θj0, j = 1, . . . , `}.

It is clear that every global maximum (min- imum) is also a local maximum (minimum);

however, the converse of this statement is, in general, not true.

Only when h(θ) is a convex function in Rk and D ⊂ Rk is a convex set is every local extremum of h at θ ∈ D also a global ex- tremum of h over D.

• Minimization of a one-dimensional function h(θ), without any restrictions on θ, by New- ton’s method:

– Assume that h(θ) has at least two contin- uous derivatives and that it is bounded below.

– Approximate h(θ) by a quadratic func- tion that we can minimize, and use the minimizer of the simpler function as the new estimate of the minimizer of h(θ).

The process is then repeated from this new point.

– To form a quadratic approximation, let θ(t) be the current estimate of the solu-


tion θ, and consider a Taylor series ex- pansion of h about the point θ(t):

h(θ(t)+s) = h(θ(t))+sh0(t))+1

2s2h00(t))+· · · . The original minimization problem can

be approximated using a Taylor series ex- pansion

h(θ) = min

θ h(θ) = mins h(θ(t) + s)

= mins

h(θ(t)) + sh0(t)) + 1

2s2h00(t)) + · · ·

≈ mins

h(θ(t)) + sh0(t)) + 1


. – To minimize the quadratic, take the deriva-

tive with respect to s and set it equal to zero giving

s = −h0(t)) h00(t)).

Since s is an approximation to the step that would take us from θ(t) to the solu- tion θ of the original problem, and the algorithm is defined by the formula

θ(t+1) = θ(t) − h0(t)) h00(t)).


• Optimization in many dimensions with lin- ear regression

– Consider Example 15 in which Q(θ1, θ2) can be written as

1, θ2)

9 Pi xi

Pi xi Pi x2i

θ1 θ2

−2(θ1, θ2)

Pi yi

Pi xiyi


i yi2. – How do we differentiate a quadratic form


Here A is a k × k square matrix and symmetric.


∂θθTAθ = 2Aθ.

– How do we differentiate θTb?

Here b is a k × 1 column vector.


∂θθTb = b.

– Matrix formulation of the linear model:

y = X θ + .

Here y = (y1, . . . , yn)T, X = (xij)n×k, and  = (1, . . . , n)T. Observe that

(y−X θ)T(y−X θ) = θTXTX θ−2θTXTy+yTy.


Differentiation leads to the normal equa- tions

2XTX θ − 2XTy = 0,

Any solution of the above is an LSE of θ. If X is of full rank, in which case (XTX )−1 exists, then there is a unique LSE which is

θ = (Xˆ TX )−1XTy.

– In R, the function solve inverts matri- ces and solve systems of linear equations;

solve(A) inverts A and solve(A, b) solves A% ∗ %x = b.

If the system is over-determined, the least- squares fit is found, but matrices of less than full rank give an error.

– Consider the simple linear regression. It turns out that


n Pi xi

Pi xi Pi x2i

The matrix is invertible if and only if some xi’s are different.

• Optimization in Many Dimensions: New- ton’s Method


– Newton’s method (also called the Newton- Raphson method) is a widely used and often-studied method for minimization.

– The method requires use of both the gra- dient vector and the Hessian matrix in computations; hence it places more bur- den on the user to supply derivatives of the objective function than does the steep- est descent method learned in calculus (the gradient vector defines the direction of maximum local increase).

– Write the Taylor series in matrix/vector form. In two dimensions, the second- order Taylor series approximation is

h(θ1 + s1, θ2 + s2) ≈ h(θ1, θ2) + s1D(1,0)h(θ1, θ2) + s2D(0,1)h(θ1, θ2) + 1



s21D(2,0)h(θ1, θ2) + 2s1s2D(1,1)h(θ1, θ2) + s22D(0,2)h(θ1, θ2)


. – Let 52h be the constant matrix of sec-

ond partial derivatives of h at θj-the so- called Hessian matrix:

52hij = ∂2h(θ)

∂θi∂θj .


– If the notations for the gradient and Hes- sian matrix are used, we can write down Taylor series in many dimensions which takes the form

Q(θ+p) = h(θ)+pT5h(θ)+1


– When θ(t) is close to θ we can expect that the above quadratic function will approximate h(θ).

To obtain the step p, we now minimize this quadratic as a function of p by form- ing its gradient with respect to p

5pQ(p) = 5p

pT 5 h(θ) + 1

2pT 52 h(θ)p

= 5h(θ) + 52h(θ)p and setting it equal to zero

52h(θ)p = − 5 h(θ).

This is a set of n linear equations in the k unknowns p = (p1, . . . , pk)T.

These linear equations are called the New- ton equations.

If 52h(θ) is positive definite, this sug-


gests the general iterative scheme

θ(t+1) = θ(t)+p = θ(t)−[52h(θ(t))]−15h(θ).

– When h(θ) is closely approximated by Q(p) in the neighborhood of θ, conver- gence will normally be at a quadratic rate if the Hessian is positive definite at each step.

– One problem with Newton’s method is that the Hessian may not be positive def- inite at each iteration.

Thus the method requires modification to insure that the resultant method is acceptable but still retains the desirable characteristics of Newton’s method.

Recall the nonlinear regression. If a least- squares approach were used, the following opti- mization problem would be obtained



i=1[Yi − θ0 exp(θ1Ti)]2.

This is called a nonlinear least-squares prob- lem. No analytic solution can be found. More details will be given when we discuss MLE later on.



• Suppose we have a random vector (or vari- able) X with EXTX < ∞ and a random variable Y .

One may wish to predict the value of Y based on an observed value of X.

Let g(X) be the predictor with E[g(X)]2 <


• As a motivated example, a stock holder wants to predict the value of his holdings at some time in the future (Y ) on the basis of his past experience with the market and his port- folio (X).

• Suppose we use a linear function of X (in- stead of nonlinear function) to predict of Y . What is the best linear predictor under mean squared error?

– Suppose that E(X2) and E(Y 2) are fi- nite and X and Y are not constant. Then the unique best zero intercept linear pre- dictor is obtained by taking

a = a0 = E(XY )/E(X2),




Related subjects :