PrincipleofDataReduction Chapter6

(1)

Principle of Data Reduction

6.1 Introduction

An experimenter uses the information in a sample X₁, . . . , X_nto make inferences about an unknown parameter θ. If the sample size n is large, then the observed sample x₁, . . . , x_n is a long list of numbers that may be hard to interpret. Any statistic, T (X), defines a form of data reduction or data summary. For example, the sample mean, the sample variance, the largest observation, and the smallest observation are four statistics that might be used to summarize some key features of the sample.

The statistic summarizes the data in that, rather than reporting the entire sample x, it reports only that T (x) = t. For example, two samples x and y will be treated as equal, if T (x) = T (y) is satisfied. The advantages and consequences of this type of data reduction are the topics of this chapter.

We study three principles of data reduction.

1. The Sufficiency Principle promotes a method of data reduction that does not discard information about θ while achieving some summarization of the data.

2. The Likelihood Principle describes a function of the parameter determined by the observed sample, that contains all the information about θ that is available from the sample.

3. The Equivariance Principle prescribes yet another method of data reduction that still preserves some important features of the model.

11

(2)

6.2 The Sufficiency Principle

Sufficiency Principle: If T (X) is a sufficient statistic for θ, then any inference about θ should depend on the sample X only through the value T (X). That is, if x and y are two sample points such that T (x) = T (y), then the inference about θ should be the same whether X = x or Y = y is observed.

6.2.1 Sufficient Statistics

Definition 6.2.1 A statistic T (X) is a sufficient statistic for θ if the conditional distribution of the sample X given the value of T (X) does not depend on θ.

To use this definition to verify that a statistic T (X) is a sufficient statistic for θ, we must verify that P (X = x|T (X) = T (x)) does not depend on θ. Since {X = x} is a subset of {T (X) = T (x)},

P_θ(X = x|T (X) = T (x)) = P_θ(X = x and T (X) = T (x)) P_θ(T (X) = T (x))

= P_θ(X = x)

P_θ(T (X) = T (x)) = p(x|θ) q(T (x))|θ),

where p(x|θ) is the joint pmf/pdf of the sample X and q(t|θ) is the pmf/pdf of T (X). Thus, T (X) is a sufficient statistic for θ if and only if, for every x, the above ratio is constant as a function of θ.

Theorem 6.2.1 If p(x|θ) is the joint pdf or pmf of X and q(t|θ) is the pdf or pmf of T (X), then T (X) is a sufficient statistic for θ if, for every x in the sample space, the ratio p(x|θ)/q(T (x)|θ) is constant as a function of θ.

Example 6.2.1 (Binomial sufficient statistic) Let X₁, . . . , X_n be iid Bernoulli random vari- ables with parameter θ, 0 < θ < 1. Then T (X) = X₁+ · · · + X_n is a sufficient statistic for θ. Note that T (X) counts the number of X_i’s that equal 1, so T (X) ∼ Bi(n, θ). The ratio of pmfs is thus

p(x|θ) q(T (x)|θ) =

Qθ^xⁱ(1 − θ)^1−xⁱ

¡_n

t

¢θ^t(1 − θ)^n−t (define t =P_n

i=1x_i)

= θ^P^xⁱ(1 − θ)^P^(1−xⁱ⁾

¡_n

t

¢θ^t(1 − θ)^n−t

= 1

¡_n

t

¢ = 1

¡ _n

Pxi

¢

Since this ratio does not depend on θ, by Theorem 6.2.1, T (x) is a sufficient statistic for θ.

(3)

Example 6.2.2 (Normal sufficient statistic) Let X₁, . . . , X_nbe iid N (µ, σ²), where σ²is known.

We wish to show that the sample mean, T (X) = ¯X, is a sufficient statistic for µ.

f (x|µ) = Yn i=1

(2πσ²)^−1/2exp(−(x_i− µ)²/(2σ²))

= (2πσ²)^−n/2exp{−[

Xn i=1

(x_i− ¯x)²+ n(¯x − µ)²]/(2σ²)}

Recall that the sample mean ¯X ∼ N (µ, σ²/n). Thus, the ratio of pdf is p(x|θ)

q(T (x))|θ) = (2πσ²)^−n/2exp{−[P_n

i=1(x_i− ¯x)²+ n(¯x − µ)²]/(2σ²)}

(2πσ²/n)^−1/2exp{−n(¯x − µ)²/(2σ²)}

= n^−1/2(2πσ²)^−(n−1)/2exp{−

Xn i=1

(x_i− ¯x)²/(2σ²)},

which does not depend on µ. By Theorem 6.2.1, the sample mean is a sufficient statistic for µ.

Example 6.2.3 (Sufficient order statistics) Let X₁, . . . , X_n be iid from a pdf f , where we are unable to specify any more information about the pdf. It then follows that

f_X₍₁₎_,...,X_(n)(x) =





 n!Q_n

i=1f_X(x_i), if x₁ < . . . < x_n

0 otherwise

where (x₍₁₎, . . . , x_(n) is the order statistic. By Theorem 6.2.1, the order statistic is a sufficient statistic.

Of course, this is not much of a reduction, but we should not expect more with so little information about the density f . However, even if we specify more about the density, we still may not be able to get much of a sufficient reduction. For example, suppose that f is the Cauchy pdf f (x|θ) = _π(x−θ)¹ 2 or the logistic pdf f (x|θ) = _(1+e^e^−(x−θ)−(x−θ))². It turns out that outside of the exponential family of distributions, it is rare to have a sufficient statistic of smaller dimension than the size of the sample, so in many cases it will turn out that the order statistics are the best that we can do.

Theorem 6.2.2 (Factorization Theorem) Let f (x|θ) denote the joint pdf or pmf of a sample X. A statistic T (X) is a sufficient statistic for θ if and only if there exist functions g(t|θ) and h(x) such that, for all sample points x and all parameter points θ,

f (x|θ) = g(T (x)|θ)h(x). (6.1)

(4)

Proof: We give the proof only for discrete distributions. Suppose T (X) is a sufficient statistic.

Choose g(t|θ) = P_θ(T (X) = t) and h(x) = P (X = x|T (X) = T (x)). Because T (X) is sufficient, the conditional probability h(x) does not depend on θ. Thus,

f (x|θ) = P_θ(X = x)

= P_θ(X = x and T (X) = T (x))

= P_θ(T (X) = T (x))P (X = x|T (X) = T (x))

= g(T (x)|θ)h(x)

So factorization (6.1) has been exhibited. Also, the last two lines above imply that g(T (x)|θ) is the pmf of T (X).

Now assume the factorization (6.1) exists. Let q(t|θ) be the pmf of T (X). Define A_{T (}x)= {y : T (y) = T (x)}. Then

f (x|θ)

q(T (x)|θ) = g(T (x)|θ)h(x) q(T (x)|θ)

= g(T (x)|θ)h(x) P

A_{T (}x⁾g(T (y)|θ)h(y) (density transformation)

= g(T (x)|θ)h(x) g(T (x)|θ)P

A_{T (}x)h(y) (since T is a constant on A_{T (}x))

= h(x)

P

AT (x⁾h(y)

Since the ratio does not depend on θ, by Theorem 6.2.1, T (X) is a sufficient statistic for θ. ¤

To use the Factorization Theorem to find a sufficient statistic, we factor the joint pdf of the sample into two parts. One part does not depend on θ and it constitutes the h(x) function. The other part depends on θ, and usually it depends on the sample x only through some function T (x) and this function is a sufficient statistic for θ.

Example 6.2.4 (Normal sufficient statistic) Let X₁, . . . , X_nbe iid N (µ, σ²), where σ²is known.

The pdf can be factored as

f (x|µ) = (2πσ²)^−n/2exp{−

Xn i=1

(x_i− ¯x)²/(2σ²)} exp{−n(¯x − µ)²/(2σ²)}.

We can define

g(t|θ) = exp{−n(t − µ)²/(2σ²)}

(5)

by defining T (x) = ¯x, and

h(x) = (2πσ²)^−n/2exp{−

Xn i=1

(x_i− ¯x)²/(2σ²)}.

Thus, by the Factorization Theorem, T (X) = ¯X is a sufficient statistic for µ.

Example 6.2.5 (Uniform sufficient statistic) Let X₁, . . . , X_n be iid observations from the dis- crete uniform distribution on 1, . . . , θ. That is, the unknown parameter, θ, is a positive integer and the pmf of X_i is

f (x|θ) =







1θ x = 1, 2, . . . , θ 0 otherwise.

Thus, the joint pmf of X₁, . . . , X_n is

f (x|θ) =







θ⁻ⁿ x_i ∈ {1, . . . , θ} for i = 1, . . . , n 0 otherwise.

Let I_A(x) be the indicator function of the set A; that is, it is equal to 1 if x ∈ A and equal to 0 otherwise. Let N = {1, 2, . . .} be the set of positive integers and let N_θ= {1, 2, . . . , θ}. Then the joint pmf of X₁, . . . , X_n is

f (x|θ) = Yn i=1

θ⁻¹I_N_θ(x_i) = θ⁻ⁿ Yn i=1

I_N_θ. Defining T (x) = max_ix_i, we see that

Yn i=1

I_N_θ = ( Yn i=1

I_N(x_i))I_N_θ(T (x)).

Thus we have the factorization

f (x|θ) = θ⁻ⁿI_N_θ(T (x))(

Yn i=1

I_N(x_i)).

By the factorization theorem, T (X) = max_iX_i is a sufficient statistic for θ.

Example 6.2.6 (Normal sufficient statistic, both parameters unknown) Assume that X₁, . . . , X_n are iid N (µ, σ²), and that both µ and σ² are unknown so the parameter vector is θ = (µ, σ²). Let T₁(x) = ¯x and T₂(x) = s² =P_n

i=1(x_i− ¯x)²/(n − 1).

f (x|θ) =(2πσ²)^−n/2exp{−[

Xn i=1

(x_i− ¯x)²+ n(¯x − µ)²]/(2σ²)}

= (2πσ²)^−n/2exp{−(n(t₁− µ)²+ (n − 1)t₂)/(2σ²)}.

(6)

Let h(x) = 1. By the factorization theorem, T (X) = (T₁(X), T₂(X)) = ( ¯X, S²) is a sufficient statistic for (µ, σ²).

The results can be generalized to the exponential family of distributions.

Theorem 6.2.3 Let X₁, . . . , X_n be iid observations from a pdf or pmf f (x|θ) that belongs to an exponential family given by

f (x|θ) = h(x)c(θ) exp{

Xk i=1

w_i(θ)t_i(x)}

where θ = (θ₁, . . . , θ_d), d ≤ k. Then T (X) = (

Xn j=1

t₁(X_j), . . . , Xn j=1

t_k(X_j))

is a sufficient statistic for θ.

6.2.2 Minimal Sufficient Statistics

In any problem, there are many sufficient statistics. For example,

1. It is always true that the complete sample, X, is a sufficient statistic.

2. Any one-to-one function of a sufficient statistic is also a sufficient statistic.

Recall that the purpose of a sufficient statistic is to achieve data reduction without loss of infor- mation about the parameter θ; thus, a statistic that achieves the most data reduction while still retaining all the information about θ might be considered preferable.

Definition 6.2.2 A sufficient statistic T (X) is called a minimal sufficient statistic if, for any other sufficient statistic T⁰(X), T (x) is a function of T⁰(x).

To say that T (x) is a function of T⁰(x) simply means that if T⁰(x) = T⁰(y), then T (x) = T (y).

Let T = {t : t = T (x) for some x ∈ X } be the image of X under T (x). Then T x partitions the sample space into sets A_t, t ∈ T , defined by A_t= {x : T (x) = t}. If {B_t⁰ : t ∈ T⁰} are the partition sets for T⁰(x) and {A_t : t ∈ T } are the partition sets for T (x), then Definition 6.2.2 states that every B_t⁰ is a subset of some A_t. Thus, the partition associates with a minimal sufficient statistic, is the coarsest possible partition for a sufficient statistic, and a minimal sufficient statistic achieves the greatest possible data reduction for a sufficient statistic.

(7)

Example 6.2.7 (Two Normal sufficient statistics) Suppose that X₁, . . . , X_n are observations from N (µ, σ²) with σ² known. We know that T (X) = ¯X is a sufficient statistic for µ. The factorization theorem shows that T⁰(X) = ( ¯X, S²) is also a sufficient statistic for µ. T (X) can be written as a function of T⁰(X) by defining the function r(a, b) = a. Then T (x) = ¯x = r(¯x, S²) = r(T⁰(x)).

Theorem 6.2.4 let f (x|θ) be the pmf or pdf of a sample X. Suppose there exists a function T (x) such that, for every two sample points x and y, the ratio f (x|θ)/f (y|θ) is a constant as a function of θ if and only if T (x) = T (y). Then T (x) is a minimal sufficient statistic for θ.

Proof: To simply the proof, we assume f (x|θ) > 0 for all x ∈ X and θ. First we show that T (X) is a sufficient statistic. Let T = {t : t = T (x) for some x ∈ X } be the image of X under T (x).

Define the partition sets induced by T (x) as A_t= {x : T (x) = t}. For each A_t, choose and fix one element x_t∈ A_t. For any x ∈ T , x_{T (}x) is the fixed element that is in the same set, A_t, as x. Since x and x_{T (}x) are in the same set A_t, T (x) = T (x_{T (}x)) and, hence, f (x|θ)/f (x_{T x}|θ) is constant as a function of θ. Thus, we can define a function on X by h(x) = f (x|θ)/f (x_{T x}|θ) and h does not depend on θ. Define a function on T by g(t|θ) = f (x_t|θ). Then it can be seen that

f (x|θ) = f (x_{T x}|θ)f (x|θ)

f (x_{T x}|θ) = g(T (x)|θ)h(x) and, by the factorization theorem, T (X) is a sufficient statistic for θ.

Now to show that T X is minimal, let T⁰(X) be any other sufficient statistic. By the factoriza- tion theorem, there exist function g⁰ and h⁰ such that f (x|θ) = g⁰(T⁰(x)|θ)h⁰(x). Let x and y be any two sample points with T⁰(x) = T⁰(y). Then

f (x|θ)

f (y|θ) = g⁰(T⁰(x)|θ)h⁰(x)

g⁰(T⁰(y)|θ)h⁰(y) = h⁰(x) h⁰(y).

Since this ratio does not depend on θ, the assumptions of the theorem imply that T (x) = T (y).

Thus, T (x) is a function of T⁰(x) and T (x) is minimal. ¤

Example 6.2.8 (Normal minimal sufficient statistic) Let X₁, . . . , X_n be iid N (µ, σ²), both µ and σ² unknown. Let x and y denote two sample points, and let (¯x, Sx) and (¯² y, Sy) be the sample² means and variances corresponding to the x and y samples, respectively. Then the ratio of densities is

f (x|µ, σ²)

f (y|µ, σ²) = exp{[−n(¯x²− ¯y)²+ 2nµ(¯x − ¯y) − (n − 1)(s²x − s²y)]/(2σ²)}.

(8)

This ratio will be constant as a function of µ and σ² if and only ¯x = ¯y and s²x = s²y. Thus, by Theorem 6.2.4, ( ¯X, S²) is a minimal sufficient statistic for (µ, σ²).

Example 6.2.9 (Uniform minimal sufficient statistic) Suppose X₁, . . . , X_n are iid uniform observations on the interval (θ, θ + 1), −∞ < θ < ∞. Then the joint pdf of X is

f (x|θ) =







1 θ < x_i < θ + 1, i = 1, . . . , n, 0 otherwise,

which can be written as

f (x|θ) =







1 max_ix_i− 1 < θ < min_ix_i 0 otherwise

Thus, for two sample points x and y, the numerator and denominator of the ratio f (x|θ)/f (y|θ) will be positive for the same values of θ if and only if min_ix_i = min_iy_i and max_ix_i = max_iy_i. Thus, we have that T (X) = (X₍₁₎, X_(n)) is a minimal sufficient statistic.

This example shows the dimension of a minimal sufficient statistic may not match the dimension of the parameter. A minimal sufficient statistic is not unique. Any one-to-one function of a minimal sufficient statistic is also a minimal sufficient statistic. For example, T⁰(X) = (X_(n)− X₍₁₎, (X_(n)+ X₍₁₎)/2) is also a minimal sufficient statistic in Example 6.2.9, and T⁰(X) = (P_n

i=1X_i,P_n

i=1X_i²) is also a minimal sufficient statistic in Example 6.2.8.

6.2.3 Ancillary Statistics

Definition 6.2.3 A statistic S(X) whose distribution does not depend on the parameter θ is called an ancillary statistic.

Alone, an ancillary statistic contains no information about θ. An ancillary statistic is an obser- vation on a random variable whose distribution is fixed and known, unrelated to θ. Paradoxically, an ancillary statistic, when used in conjunction with other statistics, sometimes does contain valuable information for inferences about θ.

Example 6.2.10 (Uniform ancillary statistic) Let X₁, . . . , X_n be iid uniform observations on the interval (θ, θ + 1), −∞ < θ < ∞. The range statistic R = X_(n)− X₍₁₎ is an ancillary statistic.

(9)

Recall that the cdf of each X_i is

F (x|θ) =











0 x ≤ θ

x − θ θ < x < θ + 1 1 θ + 1 ≤ x.

Thus, the joint pdf of X₍₁₎ and X_(n) is

g(x₍₁₎, x_(n)|θ) =







n(n − 1)(x_(n)− x₍₁₎)ⁿ⁻² θ < x₍₁₎ < x_(n)< θ + 1

0 otherwise

Making the transformation R = X_(n)− X₍₁₎ and M = (X_(n)+ X₍₁₎)/2, we see the joint pdf of R and M is

h(r, m|θ) =







n(n − 1)rⁿ⁻² 0 < r < 1, θ + r/2 < m < θ + 1 − r/2

0 otherwise

Thus, the pdf for R is h(r|θ) =

Z _θ+1−r/2

θ+r/2

n(n − 1)rⁿ⁻²dm = n(n − 1)rⁿ⁻²(1 − r), 0 < r < 1.

This is a beta pdf with α = n − 1 and β = 2. Thus, the distribution of R does not depend on θ, and R is ancillary.

Example 6.2.11 (Location and scale family ancillary statistic) Let X₁, . . . , X_n be iid ob- servations from a location family with cdf F (x − θ), −∞ < θ < ∞. Show that the range statistic R = X_(n)− X₍₁₎ is an ancillary statistic.

Let X₁, . . . , X_n be iid observations from a scale family with cdf F (x/σ), σ > 0. Show that any statistic that depends on the sample only through the n − 1 values X₁/X_n, . . . , X_n−1/X_n is an ancillary statistic.

6.2.4 Sufficient, Ancillary, and Complete Statistics

A minimal sufficient statistic is a statistic that has achieved the maximal amount of data reduction possible while still retaining all the information about the parameter θ. Intuitively, a minimal sufficient statistic eliminates all the extraneous information in the sample, retaining only that piece with information about θ. Since the distribution of an ancillary statistic does not depend on θ, it

(10)

might be suspected that a minimal sufficient statistic is unrelated to an ancillary statistic. However, this is not necessarily the case. Recall Example 6.2.9 in which X₁, . . . , X_n were iid obs from a uniform(θ, θ + 1) distribution. We have pointed out that the statistic (X_(n)− X₍₁₎, (X_(n)+ X₍₁₎)/2) is a minimal sufficient statistic, and in Example 6.2.10 we showed that X_(n)− X₍₁₎ is an ancillary statistic. Thus, in this case, the ancillary statistic is an important component of the minimal sufficient statistic. Certainly, the ancillary statistic and the minimal sufficient statistic are not independent.

The following example shows that an ancillary statistic can sometimes give important informa- tion for inference about θ.

Example 6.2.12 (Ancillary precision) Let X₁ and X₂ be iid obs from the discrete distribution that satisfies

P (X = θ) = P (X = θ + 1) = P (θ + 2) = 1/3,

where θ, the unknown parameter, is any integer. It can be shown with an argument similar to that in Example 6.2.9 that (R, M ), where R = X_(n)− X₍₁₎ and M = (X_(n)+ X₍₁₎)/2, is a minimal sufficient statistic. To see how R might give information about θ, consider a sample point (r, m), where m is an integer. First we consider only m, θ must be one of three values, either θ = m or θ = m − 1 or θ = m − 2. With only the information that M = m, all three θ values are possible values. But now suppose we get the information that R = 2. Then it must be the case that X₍₁₎ = m − 1 and X₍₂₎ = m + 1, and the only possible value for θ is θ = m − 1. Thus, the knowledge of the value of the ancillary statistic R has increased our knowledge about θ. Of course, the knowledge of R alone would give us no information about θ.

For many important situations, however, our intuition that a minimal sufficient statistic is independent of any ancillary statistic is correct. A description of situations in which this occurs relies on the next definition.

Definition 6.2.4 Let f (t|θ) be a family of pdfs or pmfs for a statistic T (X). The family of probability distributions is called complete if E_θg(T ) = 0 for all θ implies P_θ(g(T ) = 0) = 1 for all θ. Equivalently, T (X) is called a complete statistic.

(11)

Example 6.2.13 (Binomial complete sufficient statistic) Suppose that T has a binomial (n, p) distribution, 0 < p < 1. Let g be a function such that E_pg(T ) = 0. Then

0 = E_pg(T ) = Xn

t=0

g(T ) µn

t

¶

p^t(1 − p)^n−t = (1 − p)ⁿ Xn

t=0

g(t) µn

t

¶¡ p 1 − p

¢_t

for all p, 0 < p < 1. Let r = p/(1 − p), 0 < r < ∞. The last expression is a polynomial of degree n in r, where the coefficient of r^t is g(t)¡_n

t

¢. For the polynomial to be 0 for all r, each coefficient must be 0. Thus, g(t) = 0 for t = 0, 1, . . . , n. Since T takes on the values 0, 1 . . . , n with probability 1, this yields that P_p(g(T ) = 0) = 1 for all p. Hence, T is a complete statistic.

Example 6.2.14 Let X₁, . . . , X_n be iid uniform (0, θ) observations, 0 < θ < ∞. Show that T (X) = max_iX_i is a complete statistic.

Using an argument similar to that used in one of previous examples, we can see that T (X) = max_iX_i is a sufficient statistic and the pdf of T (X) is

f (t|θ) =







ntⁿ⁻¹θ⁻ⁿ 0 < t < θ

0 otherwise.

Suppose g(t) is a function satisfying E_θg(T ) = 0 for all θ. Since E_θg(T ) is a constant of θ, its derivative with respect to θ is 0. Thus we have that

0 = d

dθE_θg(T ) = d dθ

Z _θ

0

g(t)ntⁿ⁻¹θ⁻ⁿdt

= (θ⁻ⁿ) d dθ

Z _θ

0

ng(t)tⁿ⁻¹dt + ( d dθθ⁻ⁿ)

Z _θ

0

ng(t)tⁿ⁻¹dt

= θ⁻ⁿng(θ)θⁿ⁻¹+ 0

= θ⁻¹ng(θ)

Since θ⁻¹n 6= 0, it must be g(θ) = 0. This is true for every θ > 0, hence, T is a complete statistic.

Theorem 6.2.5 (Basu’s Theorem) If T (X) is a complete and minimal sufficient statistic, then T (X) is independent of every ancillary statistic.

Proof: We give the proof only for discrete distributions. Let S(X) be any ancillary statistic.

Then P (S(X) = s) does not depend on θ since S(X) is ancillary. Also the conditional probability,

P (S(X) = s|T (X) = t) = P (X ∈ {x : S(x) = s}|T (X) = t),

(12)

does not depend on θ because T (X) is a sufficient statistic (recall the definition!). Thus, to show that S(X) and T (X) are independent, it suffices to show that

P (S(X) = s|T (X) = t) = P (S(X) = s) (6.2)

for all possible values t ∈ T . Now, P (S(X) = s) =X

t∈T

P (S(X) = s|T (X) = t)P_θ(T (X) = t).

Furthermore, since P

t∈T P_θ(T (X) = t) = 1, we can write P (S(X) = s) =X

t∈T

P (S(X) = s)P_θ(T (X) = t)

Therefore, if we define the statistic

g(t) = P (S(X) = s|T (X) = t) − P (S(X) = s), the above two equations show that

E_θg(T ) =X

t∈T

g(t)P_θ(T (X) = t) = 0

for all θ. Since T (X) is a complete statistic, this implies that g(t) = 0 for all possible values t ∈ T . Hence, (6.2) is verified. ¤

It should be noted that the “minimality” of the sufficient statistic was not used in the proof of Basu’s theorem. Indeed, the theorem is true with this word omitted, because a fundamental property of a complete statistic is that it is minimal.

Theorem 6.2.6 (Complete statistics in the exponential family) Let X₁, . . . , X_n be iid ob- servations from an exponential family with pdf or pmf of the form

f (x|θ) = h(x)c(θ) exp¡X^k

j=1

w_j(θ)t_j(x)¢ ,

where θ = (θ₁, θ₂, . . . , θ_k). Then the statistic T (X) = (

Xn i=1

t₁(X_i), Xn

i=1

t₂(X_i), . . . , Xn

i=1

t_k(X_i))

is complete if {w₁(θ), . . . , w_k(θ)) : θ ∈ Θ} contains an open set in R^k.

(13)

Example 6.2.15 (Using Basu’s theorem) Let X₁, . . . , X_n be iid exponential observations with parameter θ. Consider computing the expected value of

g(X) = X_n

X₁+ · · · + X_n.

We first note that the exponential distributions form a scale parameter family and thus g(X) is an ancillary statistic. The exponential distributions also form an exponential family with t(x) = x and so, by Theorem 6.2.6,

T (X) = Xn i=1

X_i

is a complete statistic and by, Theorem 6.2.3, T (X) is a sufficient statistic. Hence, by Basu’s theorem, T (X) and g(X) are independent. Thus we have

θ = E_θX_n= E_θT (X)g(X) = (E_θT (X))(E_θg(X)) = nθE_θg(X).

Hence, for any θ, E_θg(X) = 1/n.

6.3 The likelihood Principle

6.3.1 The likelihood function

Definition 6.3.1 Let f (x|θ) denote the joint pdf or pmf of the sample X = (X₁, . . . , X_n). Then, given that X = x is observed, the function of θ defined by

L(θ|x) = f (x|θ) is called the likelihood function.

This definition almost seems to be defining the likelihood function to be the same as the pdf or pmf. The only distinction between these two functions is which variable is considered fixed and which is varying. When we consider the pdf or pmf f (x|θ), we are considering θ as fixed and x as the variable; when we consider the likelihood function L(θ|x), we are considering x to be the observed sample point and θ to be varying over all possible parameter values. If we compare the likelihood function at two parameter points and find that

L(θ₁|x) > L(θ₂|x),

(14)

then the sample we actually observed is more likely to have occurred if θ = θ₁ than if θ = θ₂, which can be interpreted as saying that θ₁ is a more plausible value for the true value of θ than is θ₂. We carefully use the word “plausible” rather than “probable” because we often think of θ as a fixed value. Furthermore, although f (x|θ), as a function of x, is a pdf, there is no guarantee that L(θ|x), as a function of θ, is a pdf.

LIKELIHOOD PRINCIPLE: If x and y are two sample points such that L(θ|x) is propor- tional to L(θ|y), that is, there exists a constant C(x, y) such that

L(θ|x) = C(x, y)L(θ|y)

for all θ, then the conclusions drawn from x and y should be identical.

Note that C(x, y) may be different for different (x, y) pairs but C(x, y) does not depend on θ.

6.4 The Equivariance Principle

The first type of equivariance might be called measurement equivariance. It prescribes that the inference made should not depend on the measurement scale that is used.

The second type of equivariance, actually an invariance, might be called formal invariance. It states that if two inference problems have the same formal structure in terms of the mathematical model used, then the same inference procedure should be used in both problems. The elements of the model that must be the same are: Θ, the parameter space; {f (x|θ) : θ ∈ Θ}, the set of pdfs or pmfs for the sample; and the set of allowable inferences and consequences of wrong inferences.

For example, Θ may be Θ = {θ : θ > 0} in two problems. But in one problem θ may be the average price of a dozen eggs and in another problem θ may refer to the average height of giraffes.

Yet, formal invariance equates these two parameter spaces since they both refer to the same set of real numbers.

Equivariance Principle: If Y = g(X) is a change of measurement scale such that the model for Y has the same formal structure as the model for X, then an inference procedure should be both measurement equivariant and formally equivariant.

Example 6.4.1 (Binomial equivariance) Let X ∼ Bionmial(n, p) with known n and p. Let T (x) be the estimate of p that is used when X = x is observed. Rather than using the number of successes, X, to make an inference about p, we could use the number of failures, Y = n − X. We

(15)

can see that Y ∼ Binomial(n, q = 1 − p). Let T^∗(y) be the estimate of q that is used when Y = y is observed, so that 1 − T^∗(y) is the estimate of p when Y = y is observed. If x successes are observed, then the estimate of p is T (x). But if there are x successes, then there are n − x failures and 1 − T^∗(n − x) is also an estimate of p. Measurement equivariance requires that these two estimates be equal, that is, T (x) = 1 − T^∗(n − x), since the change from X to Y is just a change in measurement scale. Furthermore, the formal structures of the inference problems based on X and Y are the same. X and Y both have binomial(n, θ) distributions, 0 ≤ θ ≤ 1. So formal invariance requires that T (z) = T^∗(z) for all z = 0, . . . , n. Thus, measurement and formal invariance together require that

T (x) = 1 − T^∗(n − x) = 1 − T (n − x).

If we consider only estimators satisfying the above equality, then we have greatly reduced and sim- plified the set of estimators we are wiling to consider. Whereas the specification of an arbitray estimator requires the specification of T (0), . . . , T (n), the specification of an estimator satisfying the above equality requires the specification only of T (0), . . . , T ([n/2]), where [n/2] is the greatest integer not larger than n/2. This is the type of data reduction that is always achieved by the equiv- ariance principle. The inference to be made for some sample points determines the inference to be made for other sample points.