Methods for Statistical Prediction Financial Time Series I Topic 2: Methods of Estimation Hung Chen Department of Mathematics National Taiwan University 9/28/2002

(1)

Methods for Statistical Prediction

Financial Time Series I

Topic 2: Methods of Estimation

Hung Chen

Department of Mathematics National Taiwan University

9/28/2002

(2)

OUTLINE 1. Motivated Examples 2. Statistical Models

3. Maximum Likelihood Estimates 4. Substitution Principles

– Method of Moments – Frequency Substitution 5. Method of Least Squares

(3)

Motivated Examples

Example 1. Censored exponentially distributed survival times

• Suppose W is a nonnegative random variable having an exponential distribution with mean θ > 0. Its pdf is given by

f_W(w; θ) = θ⁻¹ exp(−w/θ)I_(0,∞)(w), where the indicator function I_(0,∞)(w) = 1 for w > 0 and is zero elsewhere. The distribution function is given by

F_W(w; θ) = {1 − exp(−w/θ)}I_(0,∞)(w).

• In survival or reliability analyses, a study to observe a random variable W₁, . . . , W_n will generally be terminated in practice before all of these random variables are able to be observed.

– Let y = (y₁^T, . . . , y_n^T)^T denote the observed data, where y_j = (c_j, δ_j)^T and δ_j = 0 or 1 according as the observation W_j is censored or uncensored at c_j (j = 1, . . . , n).

(4)

– If the observation W_j is uncensored, its realized value w_j is equal to c_j.

– If the observation W_j is censored at c_j, then w_j is some value greater than c_j (j = 1, . . . , n).

– In medical study, it is commonly assumed that the censored data is caused by com- peting risk.

Under this assumption, it is assumed that

(W₁, R₁), . . . , (W_n, R_n) are iid, C_i = min(W_i, R_i), and δ_i = 1_{W_i_≤R_i_}.

Here R is a nonnegative random variable.

• Approach 1: Model C directly.

– Derive the density function of C. Note that

P (C ≤ y) = P (min(W, R) ≤ y)

= 1 − P (W > y, R > y) = 1 − e^−y/θ[1 − F_R(y)]

and hence

f_C(c) = θ⁻¹e^−c/θ[1−F_R(c)]+e^−c/θf_R(c).

– Assuming R is exponentially distributed

(5)

with mean λ, we have f_C(c) = θ + λ

θλ exp





−θ + λ θλ c





.

Hence, C is again exponentially distributed with mean θλ/(θ + λ).

– How do we estimate θ?

We should use information contained in δ_j. Note that δ is a Bernoulli random variable with probability of success

P (W ≤ R) = ^Z₀^∞^Z₀^r θ⁻¹e^−y/θf_R(r)dydr

= ^Z₀^∞ f_R(r)[1 − exp(−r/θ)]dr

= 1 − ^Z₀^∞ e^−r/θf_R(r)dr.

When R is exponentially distributed with mean λ, we have

P (W ≤ R) = λ λ + θ.

By the law of large numbers, we consider using n⁻¹ ^P_i δ_i to estimate P (W ≤ R).

• Approach II: Method of Maximum Likeli- hood

– We have iid observations (C₁, δ₁), . . . , (C_n, δ_n) and need to find the probability density

(6)

function of (C, δ). Observe that

P (C ≤ c, δ = 1) = P (W ≤ R, W ≤ c)

= ^Z₀^c ^Z₀^r f_W(w)f_R(r)dwdr + ^Z_c^∞^Z₀^c f_W(w)f_R(r)dwdr

= ^Z₀^c F_W(r)f_R(r)dr + ^Z_c^∞ F_W(c)f_R(r)dr

= ^Z₀^c F_W(r)f_R(r)dr + F_W(c)[1 − F_R(c)].

Then

f (C = c, δ = 1) = f_W(c)[1 − F_R(c)].

By the same argument, we have

f (C = c, δ = 0) = [1 − F_W(c)]f_R(c).

– The likelihood function is

Y

i (f_W(w_i) [1 − F_R(w_i)])^δⁱ(f_R(w_i) [1 − F_W(w_i)])^1−δⁱ

= ^Y

i (f_W(w_i))^δⁱ [1 − F_W(w_i)]^1−δⁱ

·^Y

i (f_R(w_i))^1−δⁱ [1 − F_R(w_i)]^δⁱ . – For simplicity, we relabel the observa-

tions such that W₁, . . . , W_r denote the r uncensored observations and W_r+1, . . . , W_n the n − r censored observations.

The likelihood function for θ formed on

(7)

the basis of y is given by

r

Y

i=1[θ⁻¹ exp(−w_i/θ)] ^Yⁿ

i=r+1{1 − [1 − exp(−w_i/θ)]}

= θ^−r exp(− ^Xⁿ

i=1c_i/θ).

– In this case, the MLE of θ can be derived explicitly from the standard differentiation technique.

θ =ˆ ^Xⁿ

i=1c_i/r.

– Rewrite ˆθ as





n^{−1 n}^X

i=1c_i





/(r/n).

It can be shown that ˆθ will converge to θ in probability.

Remarks:

• The exponential distribution is often used to model lifetimes or waiting times.

• Suppose that we consider modeling the life- time of an electronic component, T , as an exponential random variable with parameter θ. Its implication is as follows:

P (T > t + s|T > s) = P (T > t + s and T > s) P (T > s)

(8)

= P (T > t + s)

P (T > s) = e^−(t+s)/θ e^−s/θ

= exp(−t/θ).

This is so-called memoryless property of exponential distribution.

• Does it make sense to use exponential distribution to model human lifetimes?

Compare the probability that a 16-year-old will live at least 10 more years and the probability that a 80-year-old will live at least 10 more years.

• Hazard function h(t): It is defined as h(t) = f (t)

1 − F (t)(= − d

dt log S(t)),

where S(t) = P (T > t) = 1 − F (t). I can be thought of as the instantaneous death rate for individuals who are alive at time t.

If an individual is alive at time t, the probability that that individual will die in the time interval (t, t + δ) is, assuming that the density function is continuous at t,

P (t ≤ T ≤ t + δ|T ≥ t) = P (t ≤ T ≤ t + δ) P (T ≥ t)

(9)

= F (T ≤ t + δ) − F (t)

1 − F (t) ≈ δf (t) 1 − F (t).

• For an exponential random variable T with mean θ, its hazard function is 1/θ (a constant function).

As a remark, the expectation of an exponential random variable is θ.

Do you think that the connection between the expectation and the hazard function is a coincidence? Is there any intuitive expla- nation?

• Usually, the hazard function of human lifetimes is assumed to be of bathtub shape.

How would you model the density function of human lifetimes?

(10)

Example 2. Model heterogeneous data by finite-mixture models

• In the problem considered by Do and McLach- lan (1984), the population of interest consists of rats from g species G₁, . . . , G_g, that are consumed by owls in some unknown proportions π₁, . . . , π_g.

• The problem is to estimate the π on the basis of the observation vector W containing measurements recorded on a sample of size n of rat skulls taken from owl pellets.

The rats constitute part of an owl’s diet, and indigestible material is regurgitated as a pellet.

• Use the argument of conditioning, the un- derlying population can be modeled as con- sisting of g distinct groups G₁, . . . , G_g in some unknown proportions π₁, . . . , π_g, and where the conditional pdf of W given mem- bership of the ith group G_i is f_i(w).

• Let y = (w₁^T, . . . , w_n^T)^T denote the observed random sample obtained from the mixture

(11)

density

f (w; (π₁, . . . , π_g−1)) = ^X^g

i=1πf_i(w).

• The log likelihood function for (π₁, . . . , π_g−1) can be formed from the observed data y is given by

n

X

i=1 log







g

X

j=1π_jf_j(w_i)







.

• On differentiating log likelihood function with respect to π_j (j = 1, . . . , g − 1), we obtain

n

X

i=1











f_j(w_i)

f (w_i; (π₁, . . . , π_g−1)) − f_g(w_i)

f (w_i; (π₁, . . . , π_g−1))











= 0, for j = 1, . . . , g −1. It clearly does not yield

an explicit solution for (π₁, . . . , π_g−1)^T.

(12)

Statistical models

• Most studies and experiments, scientific or industrial, large scale or small, produce data whose analysis is the ultimate object of the endeavor.

– Compare the efficiency of two ways of do- ing something under similar conditions such as: brewing coffee; reducing pol- lution; treating a disease; producing en- ergy; learning a maze; and so on.

– Abstraction: It can be thought of as a problem of comparing the efficacy of two methods applied to the members of a certain population.

– Run m + n independent experiments as follows: m + n members of the population are picked at random and m of these are assigned to the first method and the remaining n are assigned to the second method.

– In comparing two drugs A and B we would administer drug A to m and drug B to n randomly selected patients and then

(13)

measure temperature, blood pressure, have the patient rated quantitatively for im- provement by physicians, and so on.

– Random variability would come primar- ily from differing responses among patients to the same drug, but also from error in the measurements and variation in the purity of the drugs.

– one sample location model for measurement:

Let X₁, . . . , X_n be the n determinations of µ. Write

X_i = µ + _i, 1 ≤ i ≤ n,

where = (₁, . . . , _n) is the vector of errors.

– two-sample problem:

Let X₁, . . . , X_n be the n samples from the population with distribution F and Y₁, . . . , Y_m be the m samples from the population with distribution G.

• Many statistical procedures are based on statistical models which specify under which conditions the data are generated.

(14)

• Usually the assumption is made that the set of observations x₁, . . . , x_n is a set of (i) independent random variables (ii) identically distributed with common pdf f (x_i, θ).

• Once this model is specified, the statistician tries to find optimal solutions to his problem (usually related to the inference on a set of parameters θ ∈ Θ ⊂ R^k, characterizing the uncertainty about the model).

Does this statement fit to the just-mentioned two-sample problem?

• Any statistical inference starts from a basic family of probability measures, expressing our prior knowledge about the nature of the probability measures from where the observations originate.

A model P is a collection of probability measures P on (X , A) where X is the sample space with a σ-field of subsets A.

• If P = {Pθ : θ ∈ Θ}, Θ ⊂ R^k for some k, then P is a parametric model.

• Example 3. Bernoulli trials

(15)

– Consider a new model of automobile which is being produced in large numbers.

– Choose one at random from the production line and observe whether or not it suffers a mechanical breakdown within two years.

– In each trial, there are only two possible observations. The sample space consists of two elements 1 (representing breakdown) and 0 (representing no breakdown).

– The inherent variability in the situation is described by a probability distribution which in this case is defined by a single number θ, the probability of breakdown.

– The possible probability distribution on the sample space can be described by a Bernoulli trial with an unknown parameter θ between 0 and 1.

• Example 4. The parameter is a function.

– Suppose we have a large batch of seeds stored under constant conditions of temperature and humidity.

– In the course of time seeds die.

(16)

Suppose that at time t a proportion π(t) of the stored seeds are still alive.

– At each of times t₁, t₂, . . . , t_s we take a random sample of n seeds and observe how many are still alive.

– A typical observation consists of an or- dered set (r₁, r₂, . . . , r_s) of integers, r_i being the number of seeds observed to be alive at time t_i.

– The appropriate distribution for describ- ing the variable element in this situation is

p(r₁, r₂, . . . , r_s) = ^Y^s

i=1C(n, r_i)[π(t_i)]^rⁱ[1−π(t_i)]^n−rⁱ. Here π(·) is an unknown distribution. In

this example, the parameter is a function.

– Isotonic regression problem: π(t) is nec- essarily a non-increasing function of t, taking values between 0 and 1.

Can we find a parametric model for π(t)?

(17)

Related Issues:

• Suppose that we have a fully specified parametric family of models. Denote the parameter of interest by θ.

• Suppose that we wish to calculate from the data a single value representing the “best estimate” that we can make of the unknown parameter. We call such a problem one of point estimation.

• Instead of point estimation, we can estimate the parameter by giving a confidence interval which is associated with the probability of covering the true value.

When we say that a 95% CI of θ is (0.3, 0.7), it does not mean that there is a 95% probability of θ ∈ (0.3, 0.7).

Such a claim does not make any sense since – Although θ is unknown, it is still a fixed

number.

– (0.3, 0.7) is a known fixed interval.

– θ is either in (0.3, 0.7) or not in that interval. It will not be sometimes in (0.3, 0.7) or sometimes not in.

(18)

– The precise meaning of probability 0.95 will be discussed later on.

– The probability 0.95 refers to the probability that θ is in a random interval.

Here (0.3, 0.7) is one realization of that random interval.

• Distinction between data and random variables:

In statistics, we deal with data only.

Why do we need to introduce random variables?

Attitudes on Models:

• The statistician may be a “pessimist” who does not believe in any particular model f (x, θ).

In this case he must be satisfied with de- scriptive methods (like exploratory data analysis) without the possibility of inductive inference.

• The statistician may be an “optimist” who strongly believes in one model. In this case the analysis is straightforward and optimal solutions may often be easily obtained.

(19)

• The statistician may be “realist”: he would like to specify a particular model f (x, θ) in order to get operational results but he may have either some doubt about the validity of this hypothesis or some difficulty in choosing a particular parametric family.

Let us illustrate this kind of preoccupation with an example.

• Suppose that the parameter of interest is the

“center” of some population.

• In many situations, the statistician may ar- gue that, due to a central limit effect, the data are generated by a normal pdf.

• In this case the problem is restricted to the problem of inference on µ, the mean of the population.

• But in some cases, he may have some doubt about these central limit effects and may suspect some skewness and/or some kurtosis or he may suspect that some observations are generated by other models (leading to the presence of outliers).

(20)

In this context three types of question may be raised to avoid gross errors in the prediction, or in the inference:

– Does the optimal solution, computed for assumed model f (x, θ), still have “good”

properties if the true model is a little different?

This question is concerned with the sensitivity of a given criterion to the hypothe- ses (criterion robustness).

Question: Validity of one-sample t-test Partial Answer: Central Limit Theo- rem

– Are the optimal solutions computed for other models near to the original one really substantially different?

In this question, it is the sensitivity of the inference that is analyzed (inference robustness).

(21)

Maximum Likelihood Estimates

• The true distribution on the sample space can be labeled by a parameter θ taking values in a finite-dimensional Euclidean space.

• We further assume the family {Pθ : θ ∈ Θ}

(Θ ⊂ R^k) possesses density functions {pθ : θ ∈ Θ} with respect to some natural measure on the sample space, such as counting measure if the sample space is discrete or Lebesgue measure when it is not.

– In the discrete case, pθ(x) is the probability of the point x when θ is the true parameter.

– In the continuous case, pθ(x) is the probability density at x when θ is the true parameter.

• x: the observed set of values obtained in an experiment.

• Consider p(x, θ) as a function of θ for fixed x.

p(x, θ) is called the likelihood function.

We also write it L(θ, x).

(22)

L(θ, x) gives the probability of observing x for each θ when X is discrete.

• Idea: Find the value ˆθ of the parameter which is most plausible after we have observed the data.

• A maximum likelihood estimate ˆθ is any element of Θ such that

p(x, θ(x)) = max

θ^∈Θ p(x, θ).

• This principle was first put forward as a novel and original method of deriving estimators by R.A. Fisher in 1922. It very soon proved to be a fertile approach to statistical inference in general, and was widely adopted; but the exact properties of the en- suring estimators and test procedures were only gradually discovered.

• How do we find the maximum of L(θ, x)?

– A systematic way we learn in calculus is to transform a maximization problem to a root-finding problem.

– The above strategy may not always work.

Refer to the the uniform example.

(23)

– Computationally feasibility:

Before 1960, we only rely on the capacity of the human calculator, equipped with pencil and paper and with such aids as the slide rule, tables of logarithms, and other convenient tables.

The advent of electronic computer removes the restriction of the human operator.

– The estimates defined by nonlinear equations can be established as a matter of routine by the appropriate iterative al- gorithms.

Examples:

• Example 5. Suppose θ = 0 or 1 (Θ = {0, 1}) and p(x, θ) is given by the following table.

p(x, θ) x = 0 x = 2

θ = 1 0 1

θ = 2 0.1 0.9

Suppose that we observe two observations 2 and 2.

How do we get them?

Abstraction:

(24)

– X: a discrete random variable with pmf p(x, θ)

– 2 and 2 are realizations of X₁ and X₂. – What is the pmf of (X₁, X₂)?

Then

L(0, (2, 2)) = 1 L(1, (2, 2)) = (0.9)² and ˆθ((2, 2)) = 0.

• Example 6. If x₁, . . . , x_n are i.i.d. according to the Poisson distribution P(λ), the likelihood is

L(λ, x) = λ^Pⁱ^xⁱe^−nλ/ ^Y

i x_i!.

This is maximized by λ =ˆ ^X

i x_i/n

which is therefore the MLE of λ.

In this example, Θ = (0, ∞) and k = 1.

Use rpois(20, 3) to generate 20 observations

from P(3). They are 2, 3, 3, 5, 6, 3, 0, 5, 3, 2, 2, 2, 2, 2, 4, 1, 3, 2, 7, 5.

Then ˆλ = 3.1.

Why do we need to introduce X?

(25)

• Example 7. Let X₁, . . . , X_n be i.i.d. according to the uniform distribution U (0, θ), so that the likelihood is

L(θ, x) =











1/θⁿ if 0 ≤ x_i ≤ θ for all i 0 otherwise.

We can no longer differentiate L(θ, x) to get the MLE.

By direct maximization, the MLE is equal to x_(n).

• Example 8. Consider n items whose times to failure X₁, . . . , X_n form a sample from an E(θ) distribution. (i.e., p(x) = θ exp(−θx) for x > 0)

Suppose the items are inspected only at discrete times 1, 2, . . . , k so that we really observe Y₁, . . . , Y_n where, for j = 1, . . . , n,

Y_j = ` if ` − 1 < X_j ≤ `, ` = 1, . . . , k

= k + 1 if X_j > k.

Suppose n = 20, k = 5, and θ = 3. x_is are 5.19, 0.06, 2.37, 4.38, 4.98, 13.02, 0.34, 7.26, 0.67, 1.96, 3.82, 0.27, 1.83, 3.48, 3.03, 1.90, 6.42, 7.49, 5.67, 6.27 and y_is are 6, 1, 3, 5, 5, 6, 1, 6, 1, 2, 4, 1, 2, 4, 4, 2, 6, 6, 6, 6.

(26)

Let N_i = number of indices j such that Y_j = i, i = 1, . . . , k + 1. Then the multinomial vector N = (N₁, . . . , N_k+1) is sufficient for θ and the likelihood function of N is

L(θ, n₁, . . . , n_k+1) = n!

n₁! · · · n_k+1!

k+1

Y

j=1 pⁿ_j^j(θ), where p_j(θ) = exp(−[j − 1]θ) − exp(−jθ) for 1 ≤ j ≤ k and p_k+1(θ) = exp(−kθ).

Question: How do we solve this problem?

Limitations on MLE

• It is a constant theme of the history of the method that the use of ML techniques is not always accompanied by a clear appreciation of their limitations.

• Example 9. (Neyman-Scott (1948) problem)

In this example, the MLE is not even consistent.

Refer to J. Neyman and E.L. Scott. Con- sistent estimate based on partially consistent observations. Econometrica 16 1-32 (1948).

Estimation of a Common Variance:

(27)

Let X_αj (j = 1, . . . , r) be independently

distributed according to N (θ_α, σ²), α = 1, . . . , n.

The MLEs are

θˆ_α = X_α·, σˆ² = 1 rn

n

X

α=1 r

X

j=1(X_αj − X_α·)². Furthermore, these are the unique solutions of the likelihood equations.

However, in the present case, the MLE of σ² is not even consistent.

To see this, note that the statistics S_α² = ^X^r

j=1(X_αj − X_α·)²

are identically independently distributed with expectation

E(S_α²) = (r − 1)σ²

so that ^PS_α²/n → (r − 1)σ² and hence ˆ

σ² → r − 1

r σ² in probability.

• Example 10. (Non-existence of MLE) If Y₁, . . . , Y_n are i.i.d. according to the Pois- son distribution P (λ). Suppose for each i we observe only when Y_i is 0 or positive and

(28)

let

X_i =











0 if Y_i = 0 1 if Y_i > 0.

Then

P (X_i = 0) = exp(−λ), P (X_i = 1) = 1−exp(−λ), and the likelihood is

L(λ) = [1 − exp(−λ)]^P^xⁱ exp(−λ ^X[1 − x_i]).

This is maximized by

λ = − log(1 − ¯ˆ x), provided ^P(1 − x_i) > 0.

When all the x’s are = 1, the likelihood be- comes

L(λ) = [1 − exp(−λ)]ⁿ,

which is an increasing function of λ. In this case, the likelihood does not take on its maximum for any finite λ and the MLE does not exist. (Does it make sense?)

Discussions:

– For any fixed n, the probability P (X₁ =

· · · = X_n = 1) = (1 − exp(−λ))ⁿ tends to 1 as λ → ∞. Thus there will exist

(29)

values of λ for which the probability is close to 1 that the MLE is undefined.

– For any fixed λ, the probability P (X₁ =

· · · = X_n = 1) = (1 − exp(−λ))ⁿ tends to 0 as n → ∞.

Iterative Procedures

In applications MLE’s typically do not have analytic forms and some numerical methods have to be used to compute MLE’s.

It is usually possible to assume that MLE emerges as a solution of the likelihood equations. Namely,

∂

∂θ_i log p(x, θ) = 0, i = 1, · · · , k.

Symbolically, the equations we have to solve may be written

Dθ`(x, θ) = 0,

where `(x, θ) = log p(x, θ) and Dθ is the vector differential operator whose ith component is

∂/∂θ_i.

A commonly used numerical method is the Newton- Raphson iteration method.

• Solve the likelihood equation L⁽¹⁾(θ, x) = 0 iteratively.

(30)

• Replace L⁽¹⁾(θ, x) by the linear terms of its Taylor expansion about a starting value ˆθ⁽⁰⁾.

• Replace the likelihood equation with the equation

L⁽¹⁾(ˆθ⁽⁰⁾, x) + L⁽²⁾(ˆθ⁽⁰⁾, x)(θ − ˆθ⁽⁰⁾) = 0.

The solution for θ is

θˆ⁽¹⁾ = ˆθ⁽⁰⁾ − [L⁽²⁾(ˆθ⁽⁰⁾, x)]⁻¹L⁽¹⁾(ˆθ⁽⁰⁾, x), as a first approximation to the solution of the likelihood equation.

• Iterative the above procedure by replacing θˆ⁽⁰⁾ by ˆθ⁽¹⁾ and so on.

• In general,

θˆ^(t+1) = ˆθ^(t)−∂L(θ)

∂θ

θ⁼θ^ˆ^(t)







∂²L(θ)

∂θ∂θ^T

θ⁼θ^ˆ^(t)







−1

. – The laborious aspect of this iterative pro-

cedure is the inversion of the matrix ∂²L(θ^(t))/∂θ∂θ^T at the tth stage.

– If our initial approximation ˆθ⁽⁰⁾ is good,

then ∂²L(θ⁽⁰⁾)/∂θ∂θ^T will be near ∂²L(θ^(t))/∂θ∂θ^T in non-pathological conditions.

(31)

We can often use the former matrix at each stage of the procedure.

– It often happens that terms awkward to calculate appear in ∂²L(θ^(t))/∂θ∂θ^T but not in its expected value.

Sometimes, we replace ∂²L(θ)/∂θ∂θ^T by its expected value E[∂²L(θ)/∂θ∂θ^T], where the expectation is taken under Pθ.

This method is known as the Fisher-scoring method.

In most instances, E[∂²L(θ)/∂θ∂θ^T] is simply the negative information matrix discussed in the second topic.

• Issues on implementation:

– Specification of the starting point: To en- sure a sequence ˆθ^(t) which converges to θ, it requires that ˆˆ θ^(t) is sufficiently close to the root ˆθ.

– Take any estimator which satisfies √

n(ˆθ⁽⁰⁾− θ) is bounded in probability.

– Specification of the stopping rule:

• Example 11. Probit Analysis

(32)

– Suppose the probability π(s) that an individual responds to the level s of a stimulus can be expressed in the form

π(s) = Φ







s − µ σ





 = 1

√2π

Z (x−µ)/σ

−∞ e^−z²^/2dz.

– The level s_i of the stimulus is applied to n_i individuals (i = 1, . . . , r) and the numbers m_i (i = 1, . . . , r) of responses at the different levels are observed.

– Determine MLE of µ and σ.

– `(x, (µ, σ)) = constant+^P_i{m_i log π(s_i)+

(n_i − m_i) log(1 − π(s_i))} and the likelihood equations are

X

i

m_i − n_iπ_i π_i(1 − π_i)

∂π(s_i)

∂µ = 0,

X

i

m_i − n_iπ_i π_i(1 − π_i)

∂π(s_i)

∂σ = 0.

– Obtain initial approximations µ₀ and σ₀ to their solution.

– Suppose the π(s_i)s are known.

The plot of the points (s_i, Φ⁻¹(π(s_i)))

(33)

would lie on the straight line Φ⁻¹(π) = s − µ

σ .

Since m_i/n_i is an estimate of π(s_i), we can fit a straight line to this set of points to yield estimates of µ and σ.

– The Hessian matrix is a rather compli- cated expression.

If we use Fisher-scoring method, it is given by







Pi −n_i π_i(1−π_i)

∂π(s_i)

∂µ

!2

∂π(s_i)

∂µ

∂π(s_i)

∂σ

∂π(s_i)

∂µ

∂π(s_i)

∂σ

∂π(s_i)

∂σ

!2







.

(34)

Method of Moments

• It is the oldest method of deriving point estimators.

Proposed by Karl Pearson (1894).

• Consider a parametric problem where X₁, . . . , X_n are i.i.d. random variables from Pθ, θ ∈

Θ ⊂ R^k.

Suppose that m₁(θ), . . . , m_k(θ) are the first k moments of the population we are sam- pling from.

m_j(θ) = Eθ(X¹^j).

• Define the jth sample moment ˆm_j by ˆ

m_j = 1 n

n

X

i=1 X_i^j = E_F_n(X^j).

• Suppose we want to estimate q(θ) which can be expressed as

q(θ) = g(m₁(θ), . . . , m_k(θ)), where g is a continuous function.

• The method of moments estimate of q(θ) is T (X) = g( ˆm₁, . . . ,mˆ _k).

(35)

• Basic ideas:

– Law of Large Numbers:

ˆ

m_j −→ m^P _j(θ) – Continuity:

Example 12. Consider the estimation of µ and σ² if X₁, . . . , X_n are a random sample from a population with mean µ and variance σ².

Example 13.

• Normal Mixtures. Consider an industrial setting with a production process in con- trol, so that the outcome follows a known distribution, which we shall take to be the standard normal distribution.

However, it is suspected that the production process has become contaminated, with the contaminating portion following some other unknown normal distribution N (η, τ²).

A sample x₁, . . . , x_n of the output is drawn.

The X⁰s are therefore assumed to be i.i.d. according to the distribution

pN (0, 1) + (1 − p)N (η, τ²).

(36)

Apply the method of moments, we have m₁(p, η, τ ) = E(X_i) = (1 − p)η,

m₂(p, η, τ ) = E(X_i²) = p + (1 − p)(η² + τ²), m₃(p, η, τ ) = E(X_i³) = (1 − p)η(η² + 3τ²) and thus obtain the estimating equations

(1 − p)η = ¯x, p + (1 − p)(η² + τ²) = n⁻¹ ^X

i x²_i, (1 − p)η(η² + 3τ²) = n⁻¹ ^X

i x³_i.

– Do you know how to express (p, η, τ ) as functions of m₁, m₂ and m₃?

– In general, how can we know whether the above task is possible?

– Implicit Function Theorem in advanced calculus.

• For the above example, suppose τ = 1. The resulting estimators of η and 1 − p are

ˆ

η = n⁻¹ ^P_i X_i² − 1

X¯ , and 1− ˆp =

X¯²

n⁻¹ ^P_i X_i² − 1.

(37)

The Frequency Substitution Principle

• Suppose we observe n multinomial trials in which the values v₁, . . . , v_k of the population being sampled are known, but their re- spective probabilities p₁, . . . , p_k are completely unknown.

• Let N_i denote the number of indices j such that X_j = v_i. Then (N₁, . . . , N_k) has a

multinomial distribution with parameter (n, p₁, . . . , p_k).

Here ^P_i N_i = n and n is any natural number while (p₁, . . . , p_k) is any vector in

{(p₁, . . . , p_k) : p_i ≥ 0, ^X

i p_i = 1}.

• If (N₁, . . . , N_k) has a M(n, p₁, . . . , p_k), p(n₁, . . . , n_k) = n!

n₁! · · · n_k!pⁿ₁¹ · · · pⁿ_k^k, E(N_i) = np_i, V ar(N_i) = np_i(1 − p_i), and Cov(N_i, N_j) = −np_ip_j for i 6= j.

• The intuitive estimate of p_i is N_i/n, the proportion of sample values equal to v_i.

• Suppose we want to estimate a continuous function q(p₁, . . . , p_k).

(38)

The frequency substitution principle will give the estimate by replacing the unknown population frequencies p₁, . . . , p_k by the observ- able sample frequencies N₁/n, . . . , N_k/n. That is

T (N₁, . . . , N_k) = q(N₁/n, . . . , N_k/n).

• Basic ideas:

– Law of Large Numbers:

N_j/n −→ p^P _j – Continuity:

A function f is said to be continuous at x₀ if f (x₀+) and f (x₀−) exist and if

f (x₀+) = f (x₀−) = f (x₀).

Refer to any advanced calculus book for details.

Example 14. Estimation in 2 × 2 tables

• Consider n independent trials, the outcome of each classified according to two criteria, as A or ¯A, and as B or ¯B.

For example, a series of operations is being

(39)

classified according to the gender of the patient and the success or failure of the treat- ment.

• The results can be displayed in a 2 × 2 table as show below.

B B¯

A n_AB n_{A ¯}_B n_A A n¯ AB¯ nA ¯¯B nA¯

n_B nB¯ n

where n_AB is the number of cases having both attributes A and B, and so on.

• The joint distribution of the four cell en- tries is then multinomial with parameters (n, p_AB, pAB¯ , p_{A ¯}_B, pA ¯¯B).

• A standard measure of the degree of associ- ation of the attributes A and B is the cross- product ratio (also called odds ratio)

ρ = p_ABpA ¯¯B

pAB¯ p_{A ¯}_B.

– Use the fact that p_AB = p_Ap_B|A where p_A and p_B|A denote the probability of A and the conditional probability of B given A,

(40)

respectively. It leads to ρ = p_B|ApB|A¯

p_{B| ¯}_ApB| ¯¯ A

.

– Think of A as the maternal age is no more than 20, ¯A as the maternal age is greater than 20, B as the birthweight is no more than 2, 500gms, and ¯B as the birthweight is greater than 2, 500gms. The odds ratio can be used to associate the risk of underweight baby to the maternal age.

– The attributes A and B are said to be positively associated if

p_B|A > p_{B| ¯}_A and pB| ¯¯ A > pB|A¯ , and these conditions imply that ρ > 1.

– In the case of negative dependence, the above inequalities are reversed.

– Independence of A and B is character- ized by equality instead of inequality and hence by ρ = 1.

• The odds ratio ρ is estimated by replacing the cell probabilities p_AB, . . . by the corre- sponding frequencies n_AB/n, . . ..

(41)

The Method of Least Squares

• It became widely used early in the nine- teenth century by Gauss for estimation in problems of astronomical measurement.

• Suppose that water is being pumped through a container to which an amount of dye has been added.

Every few seconds the concentration of dye is measured in the water leaving the container.

It is expected that the concentration of dye will decrease linearly over time.

Since the measuring equipment maynot be perfectly accurate, it may not be possible to interpret the measurements exactly, and the mixing may not behave exactly as predicted.

The determine the rate at which the concentration decreases, the experimenter would have to approximate the data by a straight line, a line that best approximated the data in some sense. A common approach is to employ the method of least squares.

• The model above is called a linear model

(42)

because it is a linear combination of the model functions 1 and x.

x refers to the concentration of dye.

The model can be written as Y_x = θ₁ + θ₂x + _x,

where Y_x is often called the response observed at x, (θ₁, θ₂) is a 2-vector of unknown parameters, x is an explanatory variable (or covariate), and _x is random error.

Our data is (x, Y_x) and _x cannot be observed.

x can be random or nonrandom.

• Nonlinear models are also used. A common example is an exponential model such as

Y_t = θ₁exp(θ₂t) + _t.

Here the model is a nonlinear function of the parameter β.

• In either case, we can write the observations (x_i, y_i)⁰s in the form,

Y_i = g_i(θ₁, . . . , θ_k) + _i, 1 ≤ i ≤ n.

where

(43)

– The g_i are known functions and the real numbers θ₁, . . . , θ_k are unknown parameters of interest.

– The parameters (θ₁, . . . , θ_k) can vary freely over a set Ω contained in R^k.

– The _i satisfy the following restriction:

E(_i) = 0, 1 ≤ i ≤ n, V ar(_i) = σ², 1 ≤ i ≤ n, Cov(_i, _j) = 0, 1 ≤ i < j ≤ n.

– E(Y_i) = g_i(θ₁, . . . , θ_k) with unknown θ₁, . . . , θ_k. Example 15.

• Suppose that we want to find out how increasing the amount x of a certain chemical or fertilizer in the soil increases the amount y of that chemical in the plants grown in that soil.

– Nine samples of soil were treated with different amounts x of phosphorus.

– Y is the amount of phosphorus found in corn plants grown for 38 days in the different samples of soil.

(44)

– (x_i, y_i) are (1, 64), (4, 71), (5, 54), (9, 81), (11, 76), (13, 93), (23, 77), (23, 95), (28, 109).

• Assume the relationship between x and y can be approximated well by a random model y_i = θ₁ + θ₂x_i + _i.

• A least squares estimator of (θ₁, θ₂) is defined to be the minimizer of

Q(θ₁, θ₂) = ^X⁹

i=1(y_i − θ₁ − θ₂x_i)².

We then run into an optimization problem.

Note that

– Q(θ₁, θ₂) is a quadratic function of (θ₁, θ₂).

– There is no restriction on the ranger of (θ₁, θ₂). (i.e., (θ₁, θ₂) ∈ R² which falls in an open set.)

It follows from vector calculus that the least squares estimate ( ˆθ₁, ˆθ₂) must satisfy the equations

∂

∂θ_jQ(θ₁, θ₂) = 0, j = 1, 2.

If the constraint is imposed, we may need to use the method of Lagrange multiplier to find the minimizer.

(45)

– Differentiation leads to the following normal equations

X

i (y_i − θ₁ − θ₂x_i) = 0

X

i x_i(y_i − θ₁ − θ₂x_i) = 0.

The sample regression line is 61.58+1.42x.

• If some of {₁, · · · , _n} have more chance of being small than others it might seem more sensible to estimate θ₁ and θ₂ by minimizing some weighted sum of squares

n

X

i=1w_i(y_i − θ₁ − θ₂x_i)²,

the ws being weights which are larger for those is for which _i is liable to be small and small for _i liable to be large.

Optimization and Least Squares

• The word optimization denotes either the minimization or maximization of a function.

• Consider a real-valued function h with do- main D in R^k. The function h is said to have a local maximum at point θ^∗ ∈ D if there exists a real number δ > 0 such

(46)

that h(θ) ≤ h(θ^∗) for all θ ∈ D satisfying kθ − θ^∗k ≤ δ.

Define a local minimum in a similar way, but in the sense that inequality h(θ) ≤ h(θ^∗) is reversed.

If the inequality h(θ) ≤ h(θ^∗) is replaced by a strict inequality

h(θ) < h(θ^∗), θ ∈ D, θ 6= θ^∗,

we have a strict local maximum; and if the sense of the inequality h(θ) < h(θ^∗) is reversed, we have a strict local minimum.

• We say that the function h has a global (ab- solute) maximum (strict global maximum) at θ^∗ if h(θ) ≤ h(θ^∗), [h(θ) < h(θ^∗)] holds for every θ ∈ D.

Thus a function may have many local maxima, each with a different value of h(θ), say, h(θ⁰_j), j = 1, . . . , `.

The global maximum can always be chosen from among these local maxima by comparing their values and choosing one such that

h(θ^∗) ≥ h(θ_j⁰), j = 1, . . . , `,

(47)

where θ^∗ ∈ {θ_j⁰, j = 1, . . . , `}.

It is clear that every global maximum (minimum) is also a local maximum (minimum);

however, the converse of this statement is, in general, not true.

Only when h(θ) is a convex function in R^k and D ⊂ R^k is a convex set is every local extremum of h at θ ∈ D also a global extremum of h over D.

• Minimization of a one-dimensional function h(θ), without any restrictions on θ, by New- ton’s method:

– Assume that h(θ) has at least two continuous derivatives and that it is bounded below.

– Approximate h(θ) by a quadratic function that we can minimize, and use the minimizer of the simpler function as the new estimate of the minimizer of h(θ).

The process is then repeated from this new point.

– To form a quadratic approximation, let θ^(t) be the current estimate of the solu-

(48)

tion θ^∗, and consider a Taylor series expansion of h about the point θ^(t):

h(θ^(t)+s) = h(θ^(t))+sh⁰(θ^(t))+1

2s²h⁰⁰(θ^(t))+· · · . The original minimization problem can

be approximated using a Taylor series expansion

h(θ^∗) = min

θ h(θ) = min_s h(θ^(t) + s)

= min_s





h(θ^(t)) + sh⁰(θ^(t)) + 1

2s²h⁰⁰(θ^(t)) + · · ·







≈ min_s





h(θ^(t)) + sh⁰(θ^(t)) + 1

2s²h⁰⁰(θ^(t))





. – To minimize the quadratic, take the deriva-

tive with respect to s and set it equal to zero giving

s = −h⁰(θ^(t)) h⁰⁰(θ^(t)).

Since s is an approximation to the step that would take us from θ^(t) to the solution θ^∗ of the original problem, and the algorithm is defined by the formula

θ^(t+1) = θ^(t) − h⁰(θ^(t)) h⁰⁰(θ^(t)).

(49)

• Optimization in many dimensions with linear regression

– Consider Example 15 in which Q(θ₁, θ₂) can be written as

(θ₁, θ₂)







9 ^P_i x_i

Pi x_i ^P_i x²_i













θ₁ θ₂





−2(θ₁, θ₂)







Pi y_i

Pi x_iy_i





+^X

i y_i². – How do we differentiate a quadratic form

θ^TAθ?

Here A is a k × k square matrix and symmetric.

Result:

∂

∂θθ^TAθ = 2Aθ.

– How do we differentiate θ^Tb?

Here b is a k × 1 column vector.

Result:

∂

∂θθ^Tb = b.

– Matrix formulation of the linear model:

y = X θ + .

Here y = (y₁, . . . , y_n)^T, X = (x_ij)_n×k, and = (₁, . . . , _n)^T. Observe that

(y−X θ)^T(y−X θ) = θ^TX^TX θ−2θ^TX^Ty+y^Ty.

(50)

Differentiation leads to the normal equations

2X^TX θ − 2X^Ty = 0,

Any solution of the above is an LSE of θ. If X is of full rank, in which case (X^TX )⁻¹ exists, then there is a unique LSE which is

θ = (Xˆ ^TX )⁻¹X^Ty.

– In R, the function solve inverts matrices and solve systems of linear equations;

solve(A) inverts A and solve(A, b) solves A% ∗ %x = b.

If the system is over-determined, the least- squares fit is found, but matrices of less than full rank give an error.

– Consider the simple linear regression. It turns out that

X^TX =







n ^P_i x_i

Pi x_i ^P_i x²_i







The matrix is invertible if and only if some x_i’s are different.

• Optimization in Many Dimensions: New- ton’s Method

(51)

– Newton’s method (also called the Newton- Raphson method) is a widely used and often-studied method for minimization.

– The method requires use of both the gradient vector and the Hessian matrix in computations; hence it places more bur- den on the user to supply derivatives of the objective function than does the steep- est descent method learned in calculus (the gradient vector defines the direction of maximum local increase).

– Write the Taylor series in matrix/vector form. In two dimensions, the second- order Taylor series approximation is

h(θ₁ + s₁, θ₂ + s₂) ≈ h(θ₁, θ₂) + s₁D^(1,0)h(θ₁, θ₂) + s₂D^(0,1)h(θ₁, θ₂) + 1

2

"

s²₁D^(2,0)h(θ₁, θ₂) + 2s₁s₂D^(1,1)h(θ₁, θ₂) + s²₂D^(0,2)h(θ₁, θ₂)

#

. – Let 5²h be the constant matrix of sec-

ond partial derivatives of h at θ_j-the so- called Hessian matrix:

5²h_ij = ∂²h(θ)

∂θ_i∂θ_j .

(52)

– If the notations for the gradient and Hes- sian matrix are used, we can write down Taylor series in many dimensions which takes the form

Q(θ+p) = h(θ)+p^T5h(θ)+1

2p^T5²h(θ)p.

– When θ^(t) is close to θ^∗ we can expect that the above quadratic function will approximate h(θ).

To obtain the step p, we now minimize this quadratic as a function of p by form- ing its gradient with respect to p

5_pQ(p) = 5_p





p^T 5 h(θ) + 1

2p^T 5² h(θ)p







= 5h(θ) + 5²h(θ)p and setting it equal to zero

5²h(θ)p = − 5 h(θ).

This is a set of n linear equations in the k unknowns p = (p₁, . . . , p_k)^T.

These linear equations are called the New- ton equations.

If 5²h(θ) is positive definite, this sug-

(53)

gests the general iterative scheme

θ^(t+1) = θ^(t)+p = θ^(t)−[5²h(θ^(t))]⁻¹5h(θ).

– When h(θ) is closely approximated by Q(p) in the neighborhood of θ^∗, conver- gence will normally be at a quadratic rate if the Hessian is positive definite at each step.

– One problem with Newton’s method is that the Hessian may not be positive definite at each iteration.

Thus the method requires modification to insure that the resultant method is acceptable but still retains the desirable characteristics of Newton’s method.

Recall the nonlinear regression. If a least- squares approach were used, the following optimization problem would be obtained

minθ₀,θ₁

Xn

i=1[Y_i − θ₀ exp(θ₁T_i)]².

This is called a nonlinear least-squares problem. No analytic solution can be found. More details will be given when we discuss MLE later on.

(54)

Prediction

• Suppose we have a random vector (or variable) X with EX^TX < ∞ and a random variable Y .

One may wish to predict the value of Y based on an observed value of X.

Let g(X) be the predictor with E[g(X)]² <

∞.

• As a motivated example, a stock holder wants to predict the value of his holdings at some time in the future (Y ) on the basis of his past experience with the market and his port- folio (X).

• Suppose we use a linear function of X (instead of nonlinear function) to predict of Y . What is the best linear predictor under mean squared error?

– Suppose that E(X²) and E(Y ²) are finite and X and Y are not constant. Then the unique best zero intercept linear predictor is obtained by taking

a = a₀ = E(XY )/E(X²),