1 The Order Statistics

(1)

Chapter 2. Order Statistics

1 The Order Statistics

For a sample of independent observations X₁, X₂, . . . , X_n on a distribution F , the ordered sample values

X₍₁₎≤ X₍₂₎≤ · · · ≤ X_(n), or, in more explicit notation,

X_(1:n)≤ X_(2:n)≤ · · · ≤ X_(n:n),

are called the order statistics. If F is continuous, then with probability 1 the order statistics of the sample take distinct values (and conversely).

There is an alternative way to visualize order statistics that, although it does not necessarily yield simple expressions for the joint density, does allow simple derivation of many important properties of order statistics. It can be called the quantile function representation.

The quantile function (or inverse distribution function, if you wish) is defined by

F⁻¹(y) = inf{x : F (x) ≥ y}. (1)

Now it is well known that if U is a Uniform(0,1) random variable, then F⁻¹(U ) has distribution function F . Moreover, if we envision U₁, . . . , U_n as being iid Uniform(0,1) random variables and X₁, . . . , X_n as being iid random variables with common distribution F , then

(X(1), . . . , X(n))= (F^d ⁻¹(U(1)), . . . , F⁻¹(U(n))), (2) where = is to be read as “has the same distribution as.”^d

1.1 The Quantiles and Sample Quantiles

Let F be a distribution function (continuous from the right, as usual). The proof of F is right continuous can be obtained from the following fact:

F (x + h_n) − F (x) = P (x < X ≤ x + h_n),

where {hn} is a sequence of real numbers such that 0 < hn ↓ 0 as n → ∞. It follows from the continuity property of probabaility (P (limnAn) = limnP (An) if lim An exists.) that

n→∞lim[F (x + hn) − F (x)] = 0,

and hence that F is right-continuous. Let D be the set of all discontinuity points of F and n be a positive integer. Set

D_n=

x ∈ D : P (X = x) ≥ 1 n

.

(2)

Since F (∞)−F (−∞) = 1, the number of elements in D_ncannot exceed n. Clearly D = ∪_nD_n, and it follows that D is countable. Or, the set of discontinuity points of a distribution function F is countable. We then conclude that every distribution function F admits the decomposition

F (x) = αFd(x) + (1 − α)Fc(x), (0 ≤ α ≤ 1),

where Fd and Fc are both continuous function such that Fd is a step function and Fc is continuous. Moreover, the above decomposition is unique.

Let λ denote the Lebesgue measure on B, the σ-field of Borel sets in R. It follows from the Lebesgue decomposition theorem that we can write F_c(x) = βF_s(x) + (1 − β)F_ac(x) where 0 ≤ β ≤ 1, Fs is singular with respect to λ, and Facis absolutely continuous with respect to λ. On the other hand, the Radon-Nikodym theorem implies that there exists a nonnegative Borel-measurable function on R such that

F_ac(x) = Z x

−∞

f dλ,

where f is called the Radon-Nikodym derivative. This says that every distribution function F admits a unique decomposition

F (x) = α1Fd(x) + α2Fs(x) + α3Fac(x), (x ∈ R), where αi≥ 0 and P3

i=1αi= 1.

For 0 < p < 1, the pth quantile or fractile of F is defined as ξ(p) = F⁻¹(p) = inf{x : F (x) ≥ p}.

This definition is motivated by the following observation:

• If F is continuous and strictly increasing, F⁻¹ is defined by F⁻¹(y) = x when y = F (x).

• If F has a discontinuity at x₀, suppose that F (x₀−) < y < F (x₀) = F (x₀+). In this case, although there exists no x for which y = F (x), F⁻¹(y) is defined to be equal to x0.

• Now consider the case that F is not strictly increasing. Suppose that

F (x)











< y for x < a

= y for a ≤ x ≤ b

> y for x > b

Then any value a ≤ x ≤ b could be chosen for x = F⁻¹(y). The convention in this case is to define F⁻¹(y) = a.

(3)

Now we prove that if U is uniformly distributed over the interval (0, 1), then X = F_X⁻¹(U ) has cumulative distribution function F_X(x). The proof is straightforward:

P (X ≤ x) = P [F_X⁻¹(U ) ≤ x] = P [U ≤ FX(x)] = FX(x).

Note that discontinuities of F become converted into flat stretches of F⁻¹ and flat stretches of F into discontinuities of F⁻¹.

In particular, ξ_1/2= F⁻¹(1/2) is called the median of F . Note that ξp satisfies F (ξ(p)−) ≤ p ≤ F (ξ(p)).

The function F⁻¹(t), 0 < t < 1, is called the inverse function of F . The following proposition, giving useful properties of F and F⁻¹, is easily checked.

Lemma 1 Let F be a distribution function. The function F⁻¹(t), 0 < t < 1, is nondecreasing and left-continuous, and satisfies

(i) F⁻¹(F (x)) ≤ x, −∞ < x < ∞, (ii) F (F⁻¹(t)) ≥ t, 0 < t < 1.

Hence

(iii) F (x) ≥ t if and only if x ≥ F⁻¹(t).

Corresponding to a sample {X₁, X₂, . . . , X_n} of observations on F , the sample pth quantile is defined as the pth quantile of the sample distribution function F_n, that is, as F_n⁻¹(p). Regarding the sample pth quantile as an estimator of ξ_p, we denote it by ˆξ_pn, or simply by ˆξpwhen convenient.

Since the order stastistics is equivalent to the sample distribution function Fn, its role is fundamental even if not always explicit. Thus, for example, the sample mean may be regarded as the mean of the order statistics, and the sample pth quantile may be expressed as

ξˆpn=







Xn,np if np is an integer X_n,[np]+1 if np is not an integer.

1.2 Functions of Order Statistics

Here we consider statistics which may be expressed as functions of order statistics. A variety of short-cut procedures for quick estimates of location or scale parameters, or for quick tests of related hypotheses, are provided in the form of linear functions of order statistics, that is statistics of the form

n

X

i=1

cniX_(i:n).

(4)

We term such statistics “L-estimates.” For example, the sample range X_(n:n)− X(1:n)belongs to this class. Another example is given by the α-trimmed mean.

1 n − 2[nα]

n−[nα]

X

i=[nα]+1

X_(i:n),

which is a popular competitor of ¯X for robust estimation of location. The asymptotic distribution theory of L-statistics takes quite different forms, depending on the character of the coefficients {cni}.

The representations of ¯X and ˆξpn in terms of order statistics are a bit artificial. On the other hand, for many useful statistics, the most natural and efficient representations are in terms of order statistics. Examples are the extreme values X1:n and Xn:n and the sample range Xn:n− X1:n.

1.3 General Properties

Theorem 1 (1) P (X(k)≤ x) =Pn

i=kC(n, i)[F (x)]ⁱ[1 − F (x)]ⁿ⁻ⁱ for −∞ < x < ∞.

(2) The density of X_(k) is given by nC(n − 1, k − 1)F^k−1(x)[1 − F (x)]^n−kf (x).

(3) The joint density of X_(k₁₎ and X_(k₂₎ is given by n!

(k₁− 1)!(k₂− k₁− 1)!(n − k₂)![F (x_(k₁₎)]^k¹⁻¹[F (x_(k₂₎) − F (x_(k₁₎)]^k²^−k¹⁻¹ [1 − F (x_(k₂₎)]^n−k²f (x_(k₁₎)f (x_(k₂₎)

for k₁< k₂ and x_(k₁₎< x_(k₂₎.

(4) The joint pdf of all the order statistics is n!f (z₁)f (z₂) · · · f (z_n) for −∞ < z₁ < · · · <

zn< ∞.

(5) Define V = F (X). Then V is uniformly distributed over (0, 1).

Proof. (1) The event {X_(k)≤ x} occurs if and only if at least k out of X1, X2, . . . , Xn are less than or equal to x.

(2) The density of X(k) is given by nC(n − 1, k − 1)F^k−1(x)[1 − F (x)]^n−kf (x). It can be shown by the fact that

d dp

n

X

i=k

C(n, i)pⁱ(1 − p)ⁿ⁻ⁱ= nC(n − 1, k − 1)p^k−1(1 − p)^n−k.

Heuristically, k − 1 smallest observations are ≤ x and n − k largest are > x. X_(k) falls into a small interval of length dx about x is f (x)dx.

(5)

1.4 Conditional Distribution of Order Statistics

In the following two theorems, we relate the conditional distribution of order statistics (con- ditioned on another order statistic) to the distribution of order statistics from a population whose distribution is a truncated form of the original population distribution function F (x).

Theorem 2 Let X1, X2, . . . , Xn be a random sample from an absolutely continuous population with cdf F (x) and density function f (x), and let X_(1:n)≤ X(2:n)≤ · · · ≤ X(n:n) denote the order statistics obtained from this sample. Then the conditional distribution of X_(j:n), given that X_(i:n)= x_i for i < j, is the same as the distribution of the (j − i)th order statistic obtained from a sample of size n − i from a population whose distribution is simply F (x) truncated on the left at xi.

Proof. From the marginal density function of X_(i:n) and the joint density function of X_(i:n) and X_(j:n), we have the conditional density function of X_(j:n), given that X_(i:n) = xi, as

fX_(j:n)(xj|X(i:n)= xi) = fX_(i:n),X_(j:n)(xi, xj)/fX_(i:n)(xi)

= (n − i)!

(j − i − 1)!(n − j)!

F (x_j) − F (x_i) 1 − F (xi)

j−i−1

× 1 − F (x_j) 1 − F (xi)

n−j

f (x_j) 1 − F (xi).

Here i < j ≤ n and x_i ≤ xj < ∞. The result follows easily by realizing that {F (x_j) − F (x_i)}/{1 − F (x_i)} and f (x_j)/{1 − F (x_i)} are the cdf and density function of the population whose distribution is obtained by truncating the distribution F (x) on the left at x_i.

Theorem 3 Let X1, X2, . . . , Xn be a random sample from an absolutely continuous population with cdf F (x) and density function f (x), and let X_(1:n)≤ X_(2:n)≤ · · · ≤ X_(n:n) denote the order statistics obtained from this sample. Then the conditional distribution of X_(i:n), given that X_(j:n) = xj for j > i, is the same as the distribution of the ith order statistic in a sample of size j − 1 from a population whose distribution is simply F (x) truncated on the right at xj.

Proof. From the marginal density function of X_(i:n) and the joint density function of X_(i:n) and X_(j:n), we have the conditional density function of X_(i:n), given that X_(j:n)= x_j, as

f_X_(i:n)(x_i|X_(j:n)= x_j) = f_X_(i:n)_,X_(j:n)(x_i, x_j)/f_X_(j:n)(x_j)

= (j − 1)!

(i − 1)!(j − i − 1)!

F (xi) F (x_j)

ⁱ⁻¹

× F (x_j) − F (x_i) F (xj)

j−i−1

f (x_i) F (xj).

(6)

Here 1 ≤ i < j and −∞ < x_i ≤ xj. The proof is completed by noting that F (x_i)/F (x_j) and f (x_i)/F (x_j) are the cdf and density function of the population whose distribution is obtained by truncating the distribution F (x) on the right at x_j

1.5 Computer Simulation of Order Statistics

In this section, we will discuss some methods of simulating order statistics from a distribution F (x). First of all, it should be mentioned that a straightforward way of simulating order statistics is to generate a pseudorandom sample from the distribution F (x) and then sort the sample through an efficient algorithm like quick-sort. This general method (being time- consuming and expensive) may be avoided in many instances by making use of some of the distributional properties to be established now.

For example, if we wish to generate the complete sample (x₍₁₎, . . . , x_(n)) or even a Type II censored sample (x₍₁₎, . . . , x_(r)) from the standard exponential distribution. This may be done simply by generating a pseudorandom sample y1, . . . , yrfrom the standard exponential distribution first, and then setting

x(i)=

i

X

j=1

yj/(n − j + 1), i = 1, 2, . . . , r.

The reason is as follows:

Theorem 4 Let X₍₁₎≤ X₍₂₎ ≤ · · · ≤ X_(n)be the order statistics from the standard exponential distribution. Then, the random variables Z1, Z2, . . . , Zn, where

Zi= (n − i + 1)(X_(i)− X_(i−1)),

with X₍₀₎≡ 0, are statistically independent and also have standard exponential distributions.

Proof. Note that the joint density function of X₍₁₎, X₍₂₎, . . . , X_(n)is

f1,2,...,n:n(x₁, x₂, . . . , x_n) = n! exp −

n

X

i=1

x_i

!

, 0 ≤ x₁< x₂< · · · < x_n< ∞.

Now let us consider the transformation

Z1= nX₍₁₎, Z2= (n − 1)(X₍₂₎− X₍₁₎), . . . , Zn= X_(n)− X_(n−1), or the equivalent transformation

X₍₁₎= Z₁/n, X₍₂₎= Z₁ n + Z₂

n − 1, . . . , X_(n)= Z₁ n + Z₂

n − 1+ · · · + Z_n. After noting the Jacobian of this transformation is 1/n! and that Pn

i=1xi = Pn

i=1zi, we immediately obtain the joint density function of Z1, Z2, . . . , Zn to be

f_Z₁_,Z₂_,...,Z_n(z₁, z₂, . . . , z_n) = exp −

n

X

i=1

z_i

!

, 0 ≤ z₁, . . . , z_n< ∞.

(7)

If we wish to generate order statistics from the Uniform(0, 1) distribution, we may use the following two Theorems and avoid sorting once again. For example, if we only need the ith order statistic u_(i), it may simply be generated as a pseudorandom observation from Beta(i, n − i + 1) distribution.

Theorem 5 For the Uniform(0, 1) distribution, the random variables V1 = U(i)/U(j) and V₂= U_(j), 1 ≤ i < j ≤ n, are statistically independent, with V₁ and V₂ having Beta(i, j − i) and Beta(j, n − j + 1) distributions, respectively.

Proof. From Theorem 1(3), we have the joint density function of U_(i)and U_(j)(1 ≤ i < j ≤ n) to be

f_i,j:n(u_i, u_j) = n!

(i − 1)!(j − i − 1)!(n − j)!uⁱ⁻¹_i (u_j− u_i)^j−i−1(1 − u_j)^n−j, 0 < u_i < u_j< 1.

Now upon makin the transformation V1 = U_(i)/U_(j) and V2 = U_(j) and noting that the Jacobian of this transformation is v2, we derive the joint density function of V1and V2to be

f_V₁_,V₂(v₁, v₂) = (j − 1)!

(i − 1)!(j − i − 1)!v₁ⁱ⁻¹(1 − v₁)^j−i−1

= n!

(j − 1)!(n − j)!v₂^j−1(1 − v2)^n−j,

0 < v1 < 1, 0 < v2 < 1. From the above equation it is clear that the random variables V1

and V2 are statistically independent, and also that they are distributed as Beta(i, j − i) and Beta(j, n − j + 1), respectively.

Theorem 6 For the Uniform(0, 1) distribution, the random variables

V₁^∗=U₍₁₎

U₍₂₎, V₂^∗= U₍₂₎ U₍₃₎

²

, · · · , V_(n−1)^∗ = U_(n−1) U_(n)

ⁿ⁻¹

and V_n^∗= U_(n)ⁿ are all independent Uniform(0, 1) random variables.

Proof. Let X₍₁₎ < X₍₂₎ < · · · < X_(n) denote the order statistics from the standard exponential distribution. Then upon making use of the facts that X = − log U has a standard exponential distribution and that − log u is a monotonically decreasing function in u, we immediately have X(i)

= − log Ud (n−i+1). The above equation yields

V_i^∗=

U_(i) U_(i+1)

^I

=d e^−X^(n−i+1) e^−X⁽ⁿ⁻ⁱ⁾

= exp[−i(X_(n−i+1)− X_(n−i))]= exp(−Y^d _n−i+1) upon using the above theorem, where Y_i^,are independent standard exponential random variables.

The just-described methods of simulating uniform order statistics may also be used easily to generate order statistics from any known distribution F (x) for which F⁻¹(·) is relatively easy to compute. We may simply obtain the order statistics x₍₁₎, . . . , x_(n)from the required distribution F (·) by setting x_(i)= F⁻¹(u_(i)).

(8)

2 Large Sample Properties of Sample Quantile

2.1 An Elementary Proof

Consider the sample pth quantile, ˆξpn, which is X_([np]) or X_([np]+1) depending on whether np is an integer (here [np] denotes the integer part of np). For simplicity, we discuss the properties of X_([np]) where p ∈ (0, 1) and n is large. This will in turn inform us of the properties of ˆξpn.

We first consider the case that X is uniformly distributed over [0, 1]. Let U_([np])denote the sample pth quantile. If i = [np], we have

nC(n − 1, i − 1) = n!

(i − 1)!(n − i)! = Γ(n + 1)

Γ(i)Γ(n − i + 1) = B(i, n − i + 1).

Elementary computations beginning with Theorem 1(2) yields U_([np])∼ Beta(i₀, n + 1 − i₀) where i0= [np]. Then U_(i₀₎∼ Beta(i0, n − i0+ 1). Note that

EU_([np]) = [np]

n + 1 → p nCov U(np₁), U(np₂)

= n[np1](n + 1 − [np2])

(n + 1)²(n + 2) → p1(1 − p2).

Use these facts and Chebyschev inequality, we can show easily that U_([np]) → p with rate^P n^−1/2. This generates the question whether we can claim that

ξˆpn

→ ξP p.

Recall that U = F (X). If F is absolutely continuous with finite positive density f at ξp, it is expected that the above claim holds.

Recall that U_([np])→ p with rate n^P ^−1/2. The next question would be what the distribution of √

n(U_([np])− p) is? Note that U_([np])is a Beta([np], n − [np] + 1) random variable.

Thus, it can be expressed as

U_[np] =

Pi0

i=1Vi

Pi0

i=1Vi+Pn+1 i=i0+1Vi

,

where the Vi’s are iid Exp(1) random variables. Observe that

√n

P[np]

i=1Vi

P[np]

i=1Vi+Pn+1 i=[np]+1Vi

− p

!

=

√1 n

n

(1 − p) P[np]

i=1Vi− [np]

− p Pn+1

[np]+1Vi− (n − [np] + 1)

+ [(1 − p)[np] − p(n − [np] + 1)]o Pn+1

i=1 Vi/n and (√

n)⁻¹{(1 − p)[np] − p(n − [np] + 1)} → 0. Since E(Vi) = 1 and V ar(Vi) = 1, from the central limit theorem it follows that

Pi₀

i=1V_i− i0

√i0

→ N (0, 1)d

(9)

and

Pn+1

i=i₀+1V_i− (n − i0+ 1)

√n − i₀+ 1

→ N (0, 1) .d

Consequently,

Pi0

i=1Vi− i0

√n

→ N (0, p)d

and

Pn+1

i=i0+1V_i− (n − i₀+ 1)

√n

→ N (0, 1 − p) .d

SincePi₀

i=1Vi andPn

i=i₀+1Vi are independent for all n, (1 − p)

P[np]

i=1Vi− [np]

√n − p

Pn+1

[np]+1Vi− (n − [np] + 1)

√n

→ N (0, p(1 − p)).d

Now, using the weak law of large numbers, we have 1

n + 1

n+1

X

i=1

Vi

→ 1.P

Hence, by Slutsky’s Theorem,

√n U_([np])− p d

→ N (0, p(1 − p)) .

For an arbitrary cdf F , we have X_([np])= F⁻¹(U_([np])). Upon expanding F⁻¹(U_([np])) in a Taylor-series around the point E(U_([np])) = [np]/(n + 1), we get

X_([np])= F^d ⁻¹(p) + (U_([np])− p){f (F⁻¹(Dn))}⁻¹

where the random variable Dn is betwen U_([np])and p. This can be rearranged as

√nX([np])− F⁻¹(p) d

=√

n U([np])− p {f (F⁻¹(Dn)}⁻¹.

When f is continuous at F⁻¹(p), it follows that as n → ∞, f (F⁻¹(D_n))→ f (F^P ⁻¹(p)). Use the delta method, it yields

√n X_([np])− F⁻¹(p) d

→ N

0, p(1 − p) [f (F⁻¹(p))]²

.

2.2 A Probability Inequality for | ˆ ξ

_p

− ξ

_p

|

We shall use the following result of Hoeffding (1963) to show that P (| ˆξpn− ξp| > ) → 0 exponentially fast.

Lemma 2 Let Y₁, . . . , Y_n be independent random variables satisfying P (a ≤ Y_i ≤ b) = 1, each i, where a < b. Then, for t > 0,

P

n

X

i=1

Yi−

n

X

i=1

E(Yi) ≥ nt

!

≤ exp−2nt²/(b − a)² .

(10)

Remark. Suppose that Y₁, Y₂, . . . , Y_n are independent and identically distributed random variables. Use Bery-Esseen theorem, we have

P

n

X

i=1

Yi−

n

X

i=1

E(Yi) ≥ nt

!

= Φ(tp

V ar(Y1)/√

n) + Error.

Here

|Error| ≤ c

√n

E|Y₁− EY₁|³ [V ar(Y1)]^3/2 .

Theorem 7 Let 0 < p < 1. Suppose that ξ_p is the unique solution x of F (x−) ≤ p ≤ F (x).

Then, for every > 0, P

| ˆξpn− ξp| >

≤ 2 exp −2nδ² , all n,

where δ= min{F (ξp+ ) − p, p − F (ξp− )}.

Proof. Let > 0. Write P

| ˆξ_pn− ξp| >

= P ˆξ_pn> ξ_p+

+ P ˆξ_pn< ξ_p− . By Lemma ??,

P ˆξpn> ξp+

= P (p > Fn(ξp+ ))

= P

n

X

i=1

I(Xi> ξp+ ) > n(1 − p)

!

= P

n

X

i=1

V_i−

n

X

i=1

E(V_i) > nδ₁

! ,

where Vi = I(Xi> ξp+ ) and δ1= F (ξp+ ) − p. Likewise, P ˆξpn< ξp−

= P (p > Fn(ξp− ))

= P

n

X

i=1

Wi−

n

X

i=1

E(Wi) > nδ2

! ,

where W_i = I(X_i< ξ_p− ) and δ2 = p − F (ξ_p− ). Therefore, utilizing Hoeffding’s lemma, we have

P ˆξ_pn> ξ_p+

≤ exp −2nδ₁² and

P ˆξpn< ξp−

≤ exp −2nδ²₂ . Putting δ= min{δ1, δ2}, the proof is completed.

(11)

2.3 Asymptotic Normality

Theorem 8 Let 0 < p < 1. Suppose that F possesses a density f in a neighborhood of ξp

and f is positive and continuous at ξp, then

√n( ˆξ_pn− ξ_p)→ N^d

0,p(1 − p) [f (ξp)]²

.

Proof. We only consider p = 1/2. Note that ξp is the unique median since f (ξp) > 0.

First, we consider that n is odd (i.e., n = 2m − 1).

P

√ n

X_(m)− F⁻¹ 1 2

≤ t

= P

√

nX_(m)≤ t

F⁻¹ 1 2

= 0

= P

X_(m)≤ t/√ n

F⁻¹ 1 2

= 0

. Let Sn be the number of X’s that exceed t/√

n. Then X_(m)≤ t

√n if and only if Sn≤ m − 1 = n − 1 2 .

Or, S_n is a Bin(n, 1 − F (F⁻¹(1/2) + tn^−1/2)) random variable. Set F⁻¹(1/2) = 0 and pn= 1 − F (n^1/2t). Note that

P

√ n

X(m)− F⁻¹ 1 2

≤ t

= P

Sn ≤n − 1 2

= P Sn− npn

pnp_n(1 − p_n) ≤

1

2(n − 1) − npn

pnp_n(1 − p_n)

! .

Recall that pn depends on n. Write Sn− npn

pnpn(1 − pn) =

n

X

i=1

Yi− pn

pnpn(1 − pn) =

n

X

i=1

Yin.

Again, they can be expressed as a double array with Y_i∼ Bin(1, p_n) Now utilize the Berry-Esseen Theorem to have

P

S_n≤ n − 1 2

− Φ

"₁

2(n − 1) − np_n pnpn(1 − pn)

#

→ 0

as n → ∞. Writing

1

2(n − 1) − npn

pnpn(1 − pn) ≈

√n ¹₂− pn

1/2

=

√n −¹₂+ F (n^−1/2t)

1/2 = 2tF (n^−1/2t) − F (0)

n^−1/2t → 2tf (0).

Thus,

Φ

"₁

2(n − 1) − npn

pnp_n(1 − p_n)

#

≈ Φ (2f (0) · t) or

√n X_(m)− F⁻¹(1/2) d

→ N

0, 1

4f²(F⁻¹(1/2))

.

(12)

When n is even (i.e., n = 2m), both P [√

n(X_(m)− F⁻¹(1/2)) ≤ t] and P [√

n(X_(m+1)− F⁻¹(1/2)) ≤ t] tend to Φ 2f (F⁻¹(1/2)) · t.

We just prove asymptotic normality of ˆξ_p in the case that F possesses derivative at the point ξp. So far we have discussed in detail the asymptotic normality of a single quantile.

This discussion extends in a natural manner to the asymptotic joint normality of a fixed number of quantiles. This is made precise in the following result.

Theorem 9 Let 0 < p1 < · · · < pk < 1. Suppose that F has a density f in neighborhoods of ξp₁, . . . , ξp_k and that f is positive and continuous at ξp₁, . . . , ξp_k. Then ( ˆξp₁, . . . , ˆξp_k) is asymptotically normal with mean vector (ξp1, . . . , ξp_k) and covariance σij/n, where

σ_ij = p_i(1 − p_j)

f (ξp_i)f (ξp_j) for 1 ≤ i ≤ j ≤ k and σ_ij = σ_ji for i > j.

Suppose that we have a sample of size n from a normal distribution N (µ, σ²). Let mn

represent the median of this sample. Then because f (µ) = (√

2πσ)⁻¹,

√n(mn− µ)→ N (0, (1/4)/f^d ²(µ)) = N (0, πσ²/2).

Compare mnwith ¯Xn as an estimator of µ. We conclude immediately that ¯Xnis better than mn since the latter has a much larger variance. Now consider the above problem again with Cauchy distribution C(µ, σ) with density function

f (x) = 1 πσ

1

1 + [(x − µ)/σ]². What is your conclusion? (Exercise)

2.4 A Measure of Dispersion Based on Quantiles

The joint normality of a fixed number of central order statistics can be used to construct simultaneous confidence regions for two or more population quantiles. As an illustration, we now consider the semi-interquantile range, R = ¹₂(ξ_3/4− ξ_1/4). (Note that the parameter σ in C(µ, σ) is the semi-interquantile range.) It is used as an alternative to σ to measure the spread of the data. A natural estimate of R is ˆRn= ¹₂( ˆξ3/4− ˆξ1/4). Theorem 4 gives the joint distribution of ˆξ_1/4and ˆξ_3/4. We can use the following result, due to Cramer and Wold (1936), which reduces the convergence of multivariate distribution functions to the convergence of univariate distribution functions.

Theorem 10 In R^k, the random vectors Xn converge in distribution to the random vector X if and only if each linear combination of the components of Xn converges in distribution to the same linear combination of the components of X.

(13)

By Theorem 9 and the Cramer-Wold device, we have R ∼ ANˆ

R, 1

64n

3

[f (ξ_1/4)]² − 2

f (ξ_1/4)f (ξ_3/4)+ 3 [f (ξ_3/4)]²

.

For F = N (µ, σ²), we have

R ∼ ANˆ

0.6745σ,(0.7867)²σ² n

.

2.5 Confidence Intervals for Population Quantiles

Assume that F posseses a density f in a neighborhood of ξ_pand f is positive and continuous at ξp. For simplicity, consider p = 1/2. Then

√n ˆξ_1/2− ξ_1/2 _d

→ N

0, 1

4f²(ξ_1/2)

.

Therefore, we can derive a confidence interval of ξ_1/2 if either f (ξ_1/2) is known or a good estimator of f (ξ1/2) is available. A natural question to ask then is how do we estimate f (ξ1/2).

Here we propose two estimates. The first estimate is

number of observatiobns falling in ξ_1/2− h_n, ξ_1/2+ h_n 2hn

which is motivated by

F ξ_1/2+ hn − F ξ_1/2− hn

2hn

≈ f (ξ_1/2).

The second one is

X_[n/2+`]− X_[n/2−`]

2`/n where ` = O(n^d) for 0 < d < 1.

2.6 Distribution-Free Confidence Interval

When F is absolutely continuous, F (F⁻¹(p)) = p and, hence, we have P (X_(i:n)≤ F⁻¹(p)) = P (F (X_(i:n)) ≤ p) = P (U_(i:n) ≤ p)

=

n

X

r=i

C(n, r)p^r(1 − p)^n−r. (3)

Now for i < j, consider

P (X_(i:n)≤ F⁻¹(p)) = P (X_(i:n)≤ F⁻¹(p), X_(j:n)< F⁻¹(p)) +P (X_(i:n)≤ F⁻¹(p), X_(j:n)≥ F⁻¹(p))

= P (X_(j:n)< F⁻¹(p)) + P (X_(i:n)≤ F⁻¹(p) ≤ X_(j:n)).

Since X(j:n)is absolutely continuous, this equation can be written as P X_(i:n) ≤ F⁻¹(p) ≤ X_(j:n)

= P X_(i:n)≤ F⁻¹(p) − P X(j:n)≤ F⁻¹(p)

=

j−1

X

r=i

C(n, r)p^r(1 − p)^n−r, (4)

(14)

where the last equality follows from (??). Thus, we have a confidence interval [X_(i:n), X_(j:n)] for F⁻¹(p) whose confidence coefficient α(i, j) given by (??), is free of F and can be read from the table of binomial probabilities. If p and the desired confidence level α₀ are specified, we choose i and j so that α(i, j) exceeds α0. Because of the fact that α(i, j) is a step function, usually the interval we obtain tends to be conservative. Further, the choice of i and j is not unique, and the choice which makes (j − i) small appear reasonable. For a given n and p, the binomial pmf C(n, r)p^r(1 − p)^n−r increases as r increases up to around [np], and then decreases. So if we want to make (j − i) small, we have to start with i and j close to [np]

and gradually increase (j − i) until α(i, j) exceeds α0.

2.7 Q-Q Plot

Wilk and Gnanadesikan (1968) proposed a graphical, rather informal, method of testing the goodness-of-fit of a hypothesized distribution to given data. It essentially plots the quantile function of one cdf against that of another cdf. When the latter cdf is the empirical cdf defined below, order statistics come into the picture. The empirical cdf, to be denoted by Fn(x) for all real x, represents the proportion of sample values that do not exceed x. It has jumps of magnitude 1/n at X_(i:n), 1 ≤ i ≤ n. Thus, the order statistics represent the values taken by F_n⁻¹(p), the sample quantile function.

The Q-Q plot is the graphical representation of the points (F⁻¹(p), X_(i:n)), where population quantiles are recorded along the horizontal axis and the sample quantiles on the vertical axis. If the sample is in fact from F , we expect the Q-Q plot to be close to a straight line. If not, the plot may show nonlinearity at the upper or lower ends, which may be an indication of the presence of outliers. If the nonlinearity shows up at other points as well, one could question the validity of the assumption that the parent cdf is F .

3 Empirical Distribution Function

Consider an i.i.d. sequence {Xi} with distribution function F . For each sample of size n, {X1, . . . , Xn}, a corresponding sample (empirical) distribution function Fn is constructed by placing at each observation Xi a mass 1/n. Thus Fn may be represented as

F_n(x) = 1 n

n

X

i=1

I(X_i≤ x), − ∞ < x < ∞.

(The definition of F defined on R^k is completely analogous.)

For each fixed sample, {X1, . . . , Xn}, Fn(·) is a distribution function, considered as a function of x. On the other hand, for each fixed value of x, F_n(x) is a random variable, considered as a function of the sample. In a view encompassing both features, F_n(·) is a

(15)

random distribution function and thus may be treated as a particular stochastic process (a random element of a suitable space.)

Note that the exact distribution of nF_n(x) is simply binomial(n, F (x)). It follows immediately that

Theorem 11 1. E[Fn(x)] = F (x).

2. V ar{Fn(x)} = F (x)[1−F (x)]

n → 0, as n → ∞.

3. For each fixed x, −∞ < x < ∞, Fn(x) is AN

F (x),F (x)[1 − F (x)]

n

.

The sample distribution function is utilized in statistical inference in several ways.

Firstly, its most direct application is for estimation of the population distribution function F . Besides pointwise estimation of F (x), each x, it is also of interest to characterize globally the estimation of F by F_n. For each fixed x, the strong law of large numbers implies that

Fn(x)^a.s.→ F (x).

We now describe the Glivenko-Cantelli Theorem which ensures that the ecdf converges uniformly almost surely to the true distribution function.

Theorem 12

P {sup

x

|Fn(x) − F (x)| → 0} = 1.

Proof. Let > 0. Find an integer k > 1/ and numbers

−∞ = x0< x₁≤ x2≤ · · · ≤ xk−1< x_k= ∞,

such that F (x⁻_j) ≤ j/k ≤ F (x_j) for j = 1, . . . , k − 1. Note that if x_j−1< x_j, then F (x⁻_j ) − F (x_j−1) ≤ . From the strong law of large numbers,

Fn(xj)^a.s.→ F (xj) and Fn(x⁻_j)^a.s.→ F (x⁻_j) for j = 1, . . . , k − 1. Hence,

4n= max(|Fn(xj) − F (xj)|, |Fn(x⁻_j) − F (x⁻_j)|, j = 1, . . . , k − 1)^a.s.→ 0.

Let x be arbitrary and find j such that xj−1< x ≤ xj. Then,

Fn(x) − F (x) ≤ Fn(x⁻_j) − F (xj−1) ≤ Fn(x⁻_j) − F (x⁻_j) + , and

F_n(x) − F (x) ≥ F_n(x_j−1) − F (x⁻_j) ≥ F_n(x_j−1) − F (x_j−1) − .

This implies that

sup

x

|F_n(x) − F (x)| ≤ 4_n+ ^a.s.→ .

Since this holds for all > 0, the theorem follows.

(16)

3.1 Kolmogorov-Smirnov Test

To this effect, a very useful measure of closeness of Fn to F is the Kolmogorov-Smirnov distance

Dn = sup

−∞<x<∞

|Fn(x) − F0(x)|.

A related problem is to express confidence bands for F (x), −∞ < x < ∞. Thus, for selected functions a(x) and b(x), it is of interest to compute probabilities of the form

P (Fn(x) − a(x) ≤ F (x) ≤ Fn(x) + b(x), −∞ < x < ∞).

The general problem is quite difficult. However, in the simplest case, namely a(x) = b(x) = d, the problem reduces to computation of P (Dn< d).

For the case of F 1-dimensional, an exponential-type probability inequality for Dn was established by Dvoretzky, Kiefer, and Wolfowitz (1956).

Theorem 13 The distribution of Dn under H0 is the same for all continuous distribution.

Proof. For simplicity we give the proof for F₀ strictly increasing. Then F₀has inverse F₀⁻¹ and as u ranges over (0, 1), F₀⁻¹(u) ranges over a;; the possible values of X. Thus

Dn = sup

0<u<1

|Fn(F₀⁻¹(u))F0(F₀⁻¹(u))| = sup

0<u<1

|Fn(F₀⁻¹(u)) − u|.

Next note that

Fn(F₀⁻¹(u)) = [number of Xi≤ F₀⁻¹(u)]/n = [number of F0(Xi) ≤ u]/n.

Let Ui = F0(Xi)|. Then U1, . . . , Un are a sample from U N IF (0, 1), since P [F0(Xi) ≤ u] = P [Xi ≤ F₀⁻¹(u)] = F0(F₀⁻¹(u)) = u, 0 < u < 1.

Thus,

D_n= sup

0<u<1

|F_n^∗(u) − u|

where F_n^∗(u) is the empirical distribution of the uniform sample U1, . . . , Un and the distribution of Dn does not depend on F0.

We now give an important fact which is used in Donsker (1952) to give a rigorous proof of the Kolmogorov-Smirnov Theorems.

Theorem 14 The distribution of the order statistic (Y₍₁₎, . . . , Y_(n)) of n iid random variables Y1, Y2, . . . from the uniform distribution on [0, 1] can also be obtained as the distribution of the ratios

S₁ Sn+1

, S₂ Sn+1

, · · · , S_n Sn+1

,

where S_k= T₁+ · · · + T_k, k ≥ 1, and T₁, T₂, . . . is an iid sequence of (mean 1) exponentially distributed random variables.

(17)

Intuitively, if the T_iare regarded as the successive times between occurrence of some phenom- ena, then S_n+1 is the time to the (n + 1)st occurrence and, in units of S_n+1, the occurrence times should be randomly distributed because of lack of memory and independence properties.

Recall that Dn = sup0<u<1|F_n^∗(u) − u| where F_n^∗(u) is the empirical distribution of the uniform sample U1, . . . , Un. We then have

Dn = √

n sup

0<u<1

|F_n^∗(u) − u| =√ n max

k≤n

Y_(k)−k n

=d √ n max

k≤n

Sk

Sn+1

−k n

= n

Sn+1

max

k≤n

Sk− k

√n −k n

Sn+1− n

√n

Theorem 15 Let F be defined on R. There exists a finite positive constant C (not depending on F ) such that

P (Dn< d) ≤ C exp(−2nd²), d > 0, for all n = 1, 2, . . ..

Moreover,

Theorem 16 Let F be 1-dimensional and continuous.. Then

n→∞lim P (n^1/2Dn < d) = 1 − 2

∞

X

j=1

(−1)^j+1exp(−2j²d²), d > 0,

for all n = 1, 2, . . ..

Secondly, we consider goodness of fit test statistics based on the sample distribution function.

The null hypotheseis in the simple case is H0 : F = F0, where F0 is specified. A useful procedure is the Kolmogorov-Smirnov test statistic

4n= sup

−∞<x<∞

|Fn(x) − F₀(x)|,

which reduces to D_n under the null hypothesis. More broadly, a class of such statistics is obtained by introducing weight functions:

sup

−∞<x<∞

|w(x)[Fn(x) − F0(x)]|.

Another important class of statistics is based on the Cramer-von Mises test statistic Cn= n

Z ∞

−∞

[Fn(x) − F0(x)]²dF0(x)

and takes the general form nR w(F0(x))[Fn(x) − F0(x)]²dF0(x). For example, for w(t) = [t(1−t)]⁻¹, each duscrepancy Fn(x)−F (x) becomes weighted by the reciprocal of its standard deviation (under H0), yielding the Anderson-Darling statistic.

(18)

3.2 Stieltjes Integral

If [a, b] is a compact interval, a set of points {x0, x1, . . . , xn}, satisfying the inequalities a = x0< x1< · · · < xn= b,

is called a partition of [a, b]. Write 4f_k = f (x_k) − f (x_k−1) for k = 1, 2 . . . , n. If there exists a positive number M such that

n

X

k=1

| 4 fk| ≤ M

for all partitions of [a, b], then f is said to be of bounded variation on [a, b]. Let F (x) be a function of bounded variation and continuous from the left such as a distribution function.

Given a finite interval (a, b) and a function f (x) we can form the sum J_n=

n

X

i=1

f (x⁰_i)[F (x_i) − F (x_i−1)]

for a division of (a, b) by points x_i such that a < x₁ < · · · < x_n < b and arbitrary x⁰_i ∈ (x_i−1, x_i). It may be noted that in the Riemann integration a similar sum is considered with the length of the interval (x_i− x_i−1) instead of F (x_i) − F (x_i−1). If J = lim_n→∞J_n as the length of each interval → 0, then J is called the Stieltjes integral of f (x) with respect to F (x) and is denoted by

J = Z b

a

f (x)dF (x).

The improper integral is defined by lim

a→−∞,b→∞

Z b a

f (x)dF (x) = Z

f (x)dF (x).

One point of departure from Riemann integration is that it is necessary to specify whether the end points are included in the integration or not. From the definition it is easily shown that

Z b a

f (x)dF (x) = Z b

a⁺0

f (x)dF (x) + f (a)[F (a⁺0) − F (a)]

where a⁺0 indicates that the end point a is not included. If F (x) jumps at a, then Z b

a

f (x)dF (x) − Z b

a⁺0

f (x)dF (x) + f (a)[F (a⁺0) − F (a)],

so that the integral taken over an interval that reduces to zero need not be zero. We shall follow the convention that the lower end point is always included but not the upper end point.

With this convention, we see that Rb

adF (x) = F (b) − F (a). If there exists a function p(x) such that F (x) =Rx

−∞p(x)dx, the Stieltjes integral reduces to a Riemann integral Z

f (x)dF (x) = Z

f (x)p(x)dx.

Theorem 17 Let α be a step function defined on [a, b] with jumps αk at xk. ThenRb

a f (x)dα(x) = Pn

k=1f (xk)αk.

(19)

4 Sample Density Functions

Recall that F⁰ = f . This suggests that we can use the derivative of Fn to estimate f . In particular, we consider

fn(x) = Fn(x + bn) − Fn(x − bn) 2bn

. Observe that 2nbnfn(x) ∼ Bin(n, F (x + bn) − F (x − bn)) and we have

Ef_n(x) = 1 2nbn

n [F (x + b_n) − F (x − b_n)]

≈ 1

2nbn

n · 2bnf (x) if bn→ 0, var(fn(x)) = 1

(2nbn)²n [F (x + bn) − F (x − bn)] [1 − F (x + bn) + F (x − bn)]

≈ 1

4nb²_n2bnf (x) = f (x)

2nb_n if nbn → ∞.

A natural question to ask is to find an optimal choice of bn. To do so, we need to find the magnitude of bias, Efn(x) − f (x).

Since the above estimate can be viewed as the widely used kernel estimate with kernel W (z) = 1/2 if |z| ≤ 1 and = 0 otherwise, we will find the the magnitude of bias for the following kernel estimate of f (x) instead.

f_n(x) = 1 nb_n

n

X

i=1

W x − Xi

b_n

,

where W is an integrable nonnegative weight function. Typically, W are chosen to be a density function, R tW (t)dt = 0, and R t²W (t)dt = α 6= 0. We have

Efn(x0) = 1 bn

EW x₀− X bn

= 1

bn

Z

W x₀− x bn

f (x)dx = Z

W (t)f (x0− bnt)dt

= Z

W (t)h

f (x0) − bntf⁰(x0− θtbnt)i dt

= f (x₀) − b_n Z

tW (t)f⁰(x₀− θ_tb_nt)dt.

WhenR tW (t)dt 6= 0, we have

Efn(x0) = f (x0) − bnf⁰(x0) Z

tW (t)dt + o(bn).

WhenR tW (t)dt = 0 and R t²W (t)dt 6= 0, we have Ef_n(x₀) =

Z W (t)

f (x₀) − b_ntf⁰(x₀) +b²_n

2 t²f^”(x₀− θtb_nt)

dt

= f (x0) +b²_n 2 f^”(x0)

Z

t²W (t)dt + o(b²_n).

Therefore, b_n = O(n^−1/3) when R tW (t)dt 6= 0 (i.e., Assume that f⁰ exists.), and b_n = O(n^−1/5) whenR tW (t)dt = 0, R t²W (t)dt 6= 0 (i.e., Assume that f^”exists.)

(20)

5 Applications of Order Statistics

The following is a brief list of settings in which order stastistics might have a significant role.

1. Robust Location Estimates. Suppose that n independent measurements are available, and we wish to estimate their assumed common mean. It has long been recognized that the sample mean, though attractive from many viewpoints, suffers from an extreme sensitivity to outliers and model violations. Estimates based on the median or the average of central order statistics are less sensitive to model assumptions. A particularly well-known application of this observation is the accepted practice of using trimmed means (ignoring highest and lowest scores) in evaluating Olympic figure skating performances.

2. Detection of Outliers. If one is confronted with a set of measurements and is concerned with determining whether some have been incorrectly made or reported, attention naturally focuses on certain order statistics of the sample. Usually the largest one or two and/or the smallest one or two are deemed most likely to be outliers. Typically we ask questions like the following: If the observations really were iid, what is the probability that the latgest order statistic would be as large as the suspiciously large value we have observed?

3. Censored Sampling. Considser life-testing experiments, in which a fixed number n of items are placed on test and the experiment is terminated as soon as a prescribed number r have failed. The observed lifetimes are thus X_1:n ≤ · · · ≤ Xr:n, whereas the lifetimes Xr+1:n≤ · · · ≤ Xn:n remain unobserved.

4. Waiting for the Big One. Disastrous floods and destructive earthquakes re- cur throughout history. Dam construction has long focused on so called 100-year flood. Presumably the dams are built big enough and strong enough to handle any water flow to be encountered except for a level expected to occur only once every 100 years. Architects in California are particularly concerned with construction designed to withstand “the big one,” presumably an earthquake of enormous strength, perhaps a “100-year quake.” Whether one agrees or not with the 100- year diaster philosophy, it is obvious that designers of dams and skycrapers, and even doghouses, should be concerned with the distribution of large order statistics from a possibly dependent, possibly not identically distributed sequence.

After the disastrous flood of February 1st, 1953, in which the sea-dikes broke in several parts of the Netherlands and nearly two thousand people were killed, the Dutch government appointed a committee (so-called Delta-committee) to recom- mend on an appropriate level for the dikes (called Delta-level since) since no specific

(21)

statistical study had been done to fix a safer level for the sea-dikes before 1953.

The Dutch government set as the standard for the sea-dikes that at any time in a given year the sea level exceeds the level of the dikes with probability 1/10, 000. A statistical group from the Mathematical Centre in Amsterdam headed by D. van Dantzig showed that high tides occurring during certain dangerous windstorms (to ensure independence) within the dangerous wintermonths December, January and February (for homogenity) follow closely an exponential distribution if the smaller high tides are neglected.

If we model the annual maximum flood by a random variable Z, the Dutch government wanted to determine therefore the (1 − q)-quantile

F⁻¹(1 − q) = inf{t ∈ R : F (t) ≥ 1 − q}

of Z, where F denotes the distribution of Z and q has the value 10⁻⁴.

5. Strength of Materials. The adage that a chain is no longer than its weakest link underlies much of the theory of strength of materials, whether they are threads, sheets, or blocks. By considering failure potential in infinitesimally small sections of the material, on quickly is led to strength distributions associated with limits of distributions of sample minima. Of course, if we stick to the finite chain with n links, its strength would be the minimum of the strengths of its n component links, again an order statistic.

6. Reliability. The example of a cord composed of n threads can be extended to lead us to reliability applications of order statistics. It may be that failiure of one thread will cause the cord to break (the weakest link), but more likely the cord will function as long as k (a number less than n) of the threads remains unbroken.

as such it is an example of a k out of n system commonly discussed in reliability settings. With regard to tire failure, the automobile is often an example of a 4 out of 5 system (remember the spare). Borrowing on terminology from electrical systems, the n out of n system is also known as a series system. Any component failure is disastrous. The 1 out of n system is known as a parallel system; it will function as long as any of the components survives. The life of the k out of n system is clearly X_n−k+1:n, the (n − k + 1)st largest of the component lifetimes, or, equivalently, the time until less than k components are functioning. But, in fact, they can be regarded as perhaps complicated hierarchies of parallel and series subsystems, and the study of system lifetime will necessarily involve distributions of order statistics.

7. Quality Control. Take a comfortable chair and watch the daily production of Snickers candy bars pass by on the conveyor belt. Each candy bar should weigh