### Chapter 3. Asymptotic Methods

### 1 Modes of Convergence of A Sequence of Random Vari- ables

Due to the difficulty of making exact calculation, we make use of asymptotic results. For example, we experience the approximation of probabilities for computing significance levels and setting confidence. In this process, we use the following facts: Law of Large Numbers, Central Limit Theorem, and the Approximation of Binomial Distribution by Normal Distribution or Poisson Distribution and etc.. The essence of asymptotic methods is approximation.

We approximate functions, random variables, probability distributions, means, variances, and covariance. However, we need to understand what kind of ap- proximation we are using. The strong law of large numbers and the central limit theorem illustrate the two main types of limit theorems in probability.

Strong limit theorems. Given a sequence of functions X_{1}(w), X_{2}(w), . . .
there is a limit function X(w) such that P (w : limnXn(w) = X(w)) = 1.

Weak limit theorems. Given a sequence of functions X1(w), X2(w), . . . show
that lim_{n}P (w : X_{n}(w) < x) exists for every x.

There is a great difference between strong and weak theorems which will
become more apparent. A more dramatic example of this is: on ([0, 1), B_{1}([0, 1)))
with P being Lebesgue measure, define

X_{n}(w) =

0, w < ^{1}_{2},
1, ^{1}_{2} ≤ w < 1,
for n even. For n odd,

X_{n}(w) =

1, w < ^{1}_{2},
0, ^{1}_{2} ≤ w < 1.

For all n, P (w : X_{n}(w) < x) = P (w : X_{1}(w) < x). But for every w ∈ [0, 1)
lim sup

n

X_{n}(w) = 1, lim inf

n X_{n}(w) = 0.

In this chapter, we will attempt to understand these asymptotic calculation.

1.1 The O, o Notation

Before the discussion of the concept of convergence for random variable, we will give a quick review of ways of comparing the magnitude of two sequences. A

notation that is especially useful for keeping track of the order of an approx-
imation is the “big O, little o.” Let {a_{n}} and {β_{n}} be two sequences of real
numbers. We have the following three concept of comparison:

a_{n}= O(β_{n}) if the ratio a_{n}/β_{n} is bounded for large n, if there exists a number
K and an integer n(K) such that if n ≥ K, then |an| < K|bn|.

a_{n}= o(β_{n}) if the ratio a_{n}/β_{n} converges to 0, as n → ∞.

a_{n}∼ β_{n} iff a_{n}/β_{n} = c + o(1), c 6= 0, as n → ∞.

Fact. (1) O(a_{n})O(β_{n}) = O(a_{n}β_{n}), (2) O(a_{n})o(β_{n}) = o(a_{n}β_{n}), (3) o(a_{n})o(β_{n}) =
o(a_{n}β_{n}),

(4) o(1) + O(n^{−1/2}) + O(n^{−1}) = o(1). The order of magnitude of a finite sum is
the largest order of magnitude of the summands.

Example. Taylor expansion of a function f (·) about the value c can be stated as

f (x) = f (c) + (x − c)f^{0}(c) + o(|x − c|) as x → c.

In general,

Theorem 1 (Taylor). Let the function f have a finite nth derivatives f^{(n)}
everywhere in the open interval (a, b) and (n − 1)th derivative f^{(n−1)}continuous
in the closed interval [a, b]. Let x ∈ [a, b]. For each point y ∈ [a, b], y 6= x, there
exists a point z interior to the interval joining x and y such that

f (y) = f (x) +

n−1

X

k=1

f^{(k)}(x)

k! (y − x)^{k}+f^{(n)}(z)

n! (y − x)^{n}.
or

f (y) = f (x) +

n−1

X

k=1

f^{(k)}(x)

k! (y − x)^{k}+ o(|y − x|^{n−1}) as y → x.

1.2 Convergence of Stochastic Sequences

Now we consider probabilistic version of these order of magnitude relations. Let
A_{n} and B_{n} be sequences of real random variables. Then

An= Op(Bn) iff for every > 0, there exists a constant M () and an integer N () such that if n ≥ N (), then

P {|A_{n}/B_{n}| ≤ M ()} ≥ 1 − .

A_{n}= o_{p}(B_{n}) iff for every > 0, lim_{n→∞}P {|A_{n}/B_{n}| ≤ } = 1.

A_{n}≈ B_{n} iff A_{n} = B_{n}+ o_{p}(B_{n}).

If X_{n} is a vector, we say that X_{n} = o_{p}(β_{n}) if kX_{n}k = o_{p}(β_{n}). Here kX_{n}k
denotes the length of the vector X_{n}.

Let X_{1}, X_{2}, . . . and X be random variables on a probability space (Ω, A, P ).

As an example, we take a measurement from an experiment in a laboratory.

Usually the outcome of the experiment cannot be predicted with certainty. To handle this situation, we introduce the probability of A (a collection of possible outcomes) to be the fraction of times that the outcome of the experiment results in A in a large number of trials of the experiment. The set of all outcomes of an experiment are called elementary events. Here, Ω is the set of all elementary events which is also called the sample space. A is a class of subsets of Ω to which we can assign probability. For each set A ∈ A we assign a value P (A) to be called the probability of A. Note that P is a set function over the members of A.

What kind of A would suffice our need? From our experience, four kinds of operations on sets, which are intersection, complement, union and set difference, are convenient and useful tools describing events. It is then quite natural to require that A contains the event formed by such operations. Such a class of sets is called a Boolean field. Based on the need, we also like to consider unions of all countable sequences of sets (events). We therefore require that A to be a Borel field or a σ-field. It means that it contains unions of all countable sequences of sets (and therefore countable intersections) and complementation.

For example, Ω can be a set of numbers or a subinterval of the real line.

The context that is necessary for the strong limit theorems we want to prove is:

Definition A probability space consists of a triple (Ω, F , P ) where

(i) Ω is a space of points w, called the sample space and sample points. It is a nonempty set that represents the collection of all possible outcomes of an experiment.

(ii) F is a σ-field of subsets of Ω. It includes the empty set as well as the set Ω and is closed under the set operations of complements and finite or countable unions and intersections. The elements of F are called measurable events, or simply events.

(iii) P (·) is a probability measure on F ; henceforth refer to P as simply a probability. It is an assignment of probabilities to events in F that is subject to the conditions that

1. 0 ≤ P (F ) ≤ 1, for each F ∈ F ,

2. P (Ø) = 0, P (Ω) = 1,

3. P (∪_{i}F_{i}) = ^{P}_{i}P (F_{i}) for any finite or countable sequence of mutually
exclusive events Fi, i = 1, 2, . . ., belonging to F .

### 2 Remarks on measure and integration

A pair (Ω, F ) consisting of a set Ω and a σ-field F of subsets of Ω is called a measurable space. For any given Ω, there is one trivial σ-field which is the collection containing exactly two elements, empty set and Ω. However, this field cannot be useful in applications.

Consider the set of real numbers R, which is uncountably infinite. We define the Lebesgue measure of intervals in R to be their length. This definition and the properties of measure determine the Lebesgue measure of many, but not all, subsets of R. The collection of subsets of R we consider, and for which Lebesgue measure is defined, is the collection of Borel sets defined below.

Let C be the collection of all finite open intervals on R. Then B = σ(C) is called the Borel σ-field. The elements of B are called Borel sets.

• All intervals (finite or infinite), open sets, and closed sets are Borel sets.

These can be shown easily by the following.

(a, ∞) = ∪^{∞}_{n=1}(a, a + n), (−∞, a) = ∪^{∞}_{n=1}(a − n, a), [a, b] = ((−∞, a) ∪ (b, ∞))^{c},
[a, ∞) = ∪^{∞}_{n=1}[a, a + n), (−∞, a] = ∪^{∞}_{n=1}[a − n, a), (a, b] = (−∞, b] ∩ (a, ∞),
{a} = ∩

a − 1

n, a + 1 n

.

This means that every set containing countably infinitely many numbers
is Borel; if A = {a_{1}, a_{2}, . . .}, then

A = ∪^{∞}_{k=1}{a_{k}}.

Hence the set of rational numbers is Borel, as is its complement, the set of irrational numbers. There are, however, sets which are not Borel. We have just seen that any non-Borel set must have uncountably many points.

• B = σ(O), where O is the collection of all open sets.

• The Borel σ-field B^{k} on the k-dimensional Enclidean space R^{k} can be
similarly defined.

• Let C ⊂ R^{k}be a Borel set and let B_{C} = {C ∩B : B ∈ B^{k}}. Then (C, B_{C}) is
a measurable space and B_{C} is called the Borel σ-field on C. (In statistics,
it is quite often that we need to consider conditional probability and etc.)

The closure properties of F ensure that the usual applications of set op- erations in representing events do not lead to nonmeasurable events for which no (consistent) assignment of probability is possible.

The required countable additivity property (3) gives probabilities a suffi-
ciently rich structure for doing calculations and approximations involving lim-
its. Two immediate consequences of (3) are the following so-called continuity
properties: if A_{1} ⊂ A_{2} ⊂ · · · is a nondecreasing sequence of events in F then,
thinking of ∪^{∞}_{n=1}An as the limiting event for such sequences,

P (∪^{∞}_{n=1}A_{n}) = lim

n P (A_{n}).

To prove this, disjointify {A_{n}} by B_{n} = A_{n}− A_{n−1}, n ≥ 1, A_{0} = Ø, and apply
(iii) to ∪^{∞}_{n=1}B_{n}= ∪^{∞}_{n=1}A_{n}. By considering complements, one gets for decreasing
measurable events A_{1} ⊃ A_{2} ⊃ · · · that

P (∪^{∞}_{n=1}A_{n}) = lim

n P (A_{n}).

Example 1. Suppose that {X_{t} : 0 ≤ t < ∞} is a continuous-time Markov chains
with a finite or countable state space S. The Markov property here refers to
the property that the conditional distribution of the future, given past and
present states of the process, does not depend on the past. The conditional
probabilities p_{ij}(s, t) = P (X_{t} = j|X_{s} = i), 0 ≤ s < t, are collectively referred
to as the transition probability law for the process. In the case p_{ij}(s, t) is a
function of t − s, the transition law is called time-homogeneous, and we write
p_{ij}(s, t) = p_{ij}(t − s). Write p(t_{0}) = ((p_{ij}(t_{0})), where p_{ij}(t_{0}) gives the probability
that the process will be in state j at time t_{0} if it is initially at state i. We
assume that lim_{t→0}p(t) = I, where I is the identity matrix. It means that with
probability 1, the process spends a positive (but variable) amount of time in
the initial state i before moving to a different state j. Set

q_{ij} = lim

t→0

p_{ij}(t) − p_{ij}(0)

t = lim

t→0

p_{ij}(t) − δ
t

which is being referred to as the infinitesimal transition rates. Write Q =
((q_{ij})), the infinitesimal generator.

Assume the Markov chain have the initial state i and let T_{0} = inf{t > 0 :
X_{t}6= i}. An important question is finding the distribution of T_{0}.

Let A denote the event that {T0 > t}. Choose and fix t > 0. For each integer n ≥ 1 define the finite-dimensional event

A_{n} = {X_{(m/2}^{n}_{)t} = i for m = 0, 1, . . . , 2^{n}}.

The events A_{n} are decreasing as n increases and
A = lim

n→∞A_{n}= ∩^{∞}_{n=1}A_{n}

= {X_{u} = i for all u in [0, t] which is a binary rational multiple of t}

= {T_{0} > t}.

Since there is some u of the form u = (m/2^{n})t ≤ t in every nondegenerate
interval, it follows that T_{0} has an exponential distribution with parameter −q_{ii}.

2.0.1 The Homogeneous Poisson Process, the Poisson Distribution and the Ex- ponential Distribution

In survival analysis, we are often interested in studying whether a particular event occurs or not. In this case, we can think in terms of death (denoted by state 1) and survive (denoted by state 0) using the language of Markov chain.

We just describe a very special chain with two states and state 1 is an absorbing state. As an illustration, we now consider a simplest chain in which there are only two states, 0 and 1. Usually, we would like to know the sojourn time of staying at state 0. Denote the sojourn time of staying at state 0 by T . We know that

P (T < t + δ|T ≥ t) ≈ λ(t)δ,

where λ(t) is the hazard function of T . Let T_{0} denote a fix time and δ = T_{0}/n
where n ∈ N . Using Markov property, we have

P (T ≥ T_{0}) = P

T ≥ (n − 1)T_{0}
δ

P

T ≥ T_{0}

T ≥ (n − 1)T_{0}
δ

= P

T ≥ (n − 2)T_{0}
δ

P

T ≥ (n − 1)T_{0}
δ

T ≥ (n − 2)T_{0}
δ

·P

T ≥ T_{0}

T ≥ (n − 1)T_{0}
δ

. Continue in this fashion, we have

P (T ≥ T0) ≈ ^{Y}

i

1 − λ

iT_{0}
δ

= exp

( X

i

ln

1 − λ

iT_{0}
δ

)

≈ exp

( X

i

1 − λ

iT0

δ

)

→ exp

"

−

Z T0

0

λ(t)dt

#

.

This is the commonly seen form of survival function written in terms of the
hazard function. If λ(t) = λ_{0}, T is exponential distributed random variable.

Now we consider a different kind of chains with no absorbing states. This is usually seen in terms of Poisson Processes and Queues. The occurrences of

a sequence of discrete events can often be realistically modelled as a Poisson
process. The defining characteristic of such a process is that the time intervals
between successive events are exponentially distributed. Now it is still a multi-
state Markov chain with no absorbing state. For the purpose of illustration,
we describe discrete-time Markov chains. In such a chain, it is often being
discussed in terms of finite, aperiodic, and irreducible. Finiteness means that
there is a finite number of possible states. The aperiodicity assumption is that
there is no state such that a return to, that state is possible only at t0, 2t0, 3t0, . . .
transitions later, where t_{0} is an integer exceeding 1. If the transition matrix of
a Markov chain with states E_{l}, E_{2}, E_{3}, E_{4} is, for example,

P =

0 0 0.6 0.4 0 0 0.3 0.7 0.5 0.5 0 0 0.2 0.8 0 0

,

then the Markov chain is periodic. If the Markovian random variable starts (at
time 0) in E_{1} then at time 1 it must be either in E_{3} or E_{4}, at time 2 it must
be in either E_{l} or E_{2}, and in general it can visit only E_{l} at times 2, 4, 6, . . .. It
is therefore periodic. The irreducibility assumption implies that any state can
eventually be reached from any other state, if not in one step then after several
steps except for the case of Markov chains with absorbing states.

Now we come back to the chain associated with Poisson process. Given
a sequence of discrete events occurring at times t_{0}, t_{1}, t_{2}, t_{3}, . . . the intervals
between successive events are 4t_{1} = (t_{1}−t_{0}), 4t_{2} = (t_{2}−t_{1}), 4t_{3} = (t_{3}−t_{2}), . . .,
and so on. Assume the transition law is time-homogeneous. By the above
argument, 4t_{i} is again exponentially distributed. Due to the definition of
Markov chain, these intervals 4t_{i} are treated as independent random variables
drawn from an exponentially distributed population, i.e., a population with the
density function f (x) = λ exp(−λx) for some fixed constant λ.

Now we state the fundamental properties that define a Poisson process, and from these properties we derive the Poisson distribution. Suppose that a sequence of random events occur during some time interval. These events form a homogeneous Poisson process if the following two conditions are met:

(1) The occurrence of any event in the time interval (a, b) is independent of the occurrence of any event in the time interval (c, d), while (a, b) and (c, d) do not overlap.

(2) There is a constant λ > 0 such that for any sufficiently small time interval,

(t, t + h), h > 0, the probability that one event occurs in (t, t + h), is independent of t, and is λh + o(h), and the probability that more than one event occurs in the interval (t, t + h) is o(h).

Condition 2 has two implications. The first is time homogeneity: The proba- bility of an event in the time interval (t, t + h) is independent of t. Second, this condition means that the probability of an event occurring in a small time interval is (up to a small order term) proportional to the length of the interval (with fixed proportionality constant λ). Thus the probability of no events in the interval (t, t + h) is 1 − λh + o(h), and the probability of one or more events in the interval (t, t + h) is λh + o(h).

Various naturally occurring phenomena follow, or very nearly follow, these two conditions. Suppose a cellular protein degrades spontaneously, and the quantity of this protein in the cell is maintained at a constant level by the continual generation of new proteins at approximately the degradation rate.

The number of proteins that degrade in any given time interval approximately satisfies conditions 1 and 2. The justification that condition 1 can be assumed in the model is that the number of proteins in the cell is essentially constant and that the spontaneous nature of the degradation process makes the independence assumption reasonable. Through time division and using Bernoulli random variable to indicate whether such an event occurs in (t, t + h), Condition 2 also follows when np is small, the probability of at least one success in n Bernoulli trials is approximately np.

We now show that under conditions 1 and 2, the number N of events that occur up to any arbitrary time t has a Poisson distribution with parameter λt.

At time 0 the value of N is necessarily 0, and at any later time t, the possible values of N are 0, 1, 2, 3, . . .. We denote the probability that N = j at any given time t by Pj(t). We would like to assess how Pj(t) behaves as a function of j and t.

The event that N = 0 at time t + h occurs only if no events occur in (0, t) and also no events occur in (t, t + h). Thus for small h,

P_{0}(t + h) = P_{0}(t)(1 − λh + o(h)) = P_{0}(t)(l − λh) + o(h).

This equality follows from conditions 1 and 2.

The event that N = 1 at time t + h can occur - in two ways. The first is that N = 1 at time t and that no event occurs in the time interval (t, t + h), the second is that N = 0 at time t and that exactly one event occurs in the

time interval (t, t + h). This gives

P_{1}(t + h) = P_{0}(t)(λh) + P_{1}(t)(1 − λh) + o(h),

where the term o(h) is the sum of two terms, both of which are o(h). Finally, for j = 2, 3, . . . the event that N = j at time t + h can occur in three different ways. The first is that N = j at time t and that no event occurs in the time interval (t, t + h). The second is that N = j − 1 at time t and that exactly one event occurs in (t, t + h). The final possibility is that N < j − 2 at time t and that two or more events occur in (t, t + h). Thus, for j = 2, 3, . . .,

P_{j}(t + h) = P_{j−l}(t)(λh) + P_{j}(t)(l − λh) + o(h).

The above discussion leads to
P_{0}(t + h) − P_{0}(t)

h = −P_{0}(t)(λh) + o(h)
h

P_{j}(t + h) − P_{j}(t)

h = −P_{j−1}(t)(λh) − P_{j}(t(λh)) + o(h)

h ,

j = 1, 2, 3, . . .. Letting h → 0, we get, d

dtP_{0}(t) = −λP_{0}(t),

and d

dtP_{j}(t) = λP_{j−l}(t) − λP_{j}(t), j = 1, 2, 3, . . . .
The P_{j}(t) are subject to the conditions

P_{0}(0) = 1, P_{j}(0) = 0, j = 1, 2, 3, . . . .

The probability of the system still being in state 0 at time t, P0(t) =
exp(−λt), which can be obtained easily. Note that P_{0}(t) + P_{1}(t) = 1. We could
replace P_{0} with 1 − P_{1} and write this as

1 λ

dP_{1}(t)

dt + P_{1}(t) = 1.

From this

d

dt(P1(t) exp(λt)) = λ.

We have

P_{1}(t) = e^{−λt}λt.

By induction, the probability of the nth state at time t is
P_{n}(t) = e^{−λt}(λt)^{n}

n! .

This is the probability distribution for a simple Poisson counting process, rep- resenting the probability that exactly n events will have occurred by the time t. Obviously the sum of these probabilities for n = 1 to ∞ equals 1, because the exponential exp(−λt) factors out of the sum, and the sum of the remaining factors is just the power series expansion of exp(−λt).

It’s worth noting that since the distribution of intervals between successive occurrences is exponential, the Poisson distribution is stationary, meaning that any time can be taken as the initial time t = 0, which implies that the proba- bility of n occurrences in an interval of time depends only on the length of the interval, not on when the interval occurs. The expected number of occurrences by the time t is given by the integral

E(n, t) =

∞

X

i=0

iPi(t) = λt.

Since the distribution of the time between successive events is given by the
exponential distribution. Thus the (random) time until the kth event occurs is
the sum of k independent exponentially distributed times. Let t_{0} be some fixed
value of t. Then if the time until the kth evenn occurs exceeds t_{0}, the number
of events occurring before time t_{0} is less than k, and conversely. This means
that the probability that k − 1 or fewer-events occur before time t_{0} must be
identical to the probability that the time until the kth event occurs exceeds t0.
In other words it must be true that

e^{−λt}^{0} 1 + (λt_{0}) + (λt_{0})^{2}

2! + · · · + (λt_{0})^{k−1}
(k − 1)!

!

= λ^{k}
Γ(k)

Z ∞ t0

x^{k−1}exp(−λx)dx.

This equation can also be established by repeated integration by parts of the right-hand side.

2.1 Counting measure and Lebesgue measure

First, we consider the counting measure in which Ω is a finite or countable set.

Then probabilities are defined for all subsets F of Ω once they are specified for singletons, so F is the collection of all subsets of Ω. Thus, if f is a prob- ability mass function (p.m.f.) for singletons, i.e., f (w) ≥ 0 for all w ∈ Ω and

P

wf (w) = 1, then one may define P (F ) = ^{P}_{w∈F}f (w). The function P so
defined on the class of all subsets of Ω is countably additive, i.e., P satisfies (3).

So (Ω, F , P ) is easily seen to be a probability space. In this case the probability measure P is determined by the probabilities of singletons {w}.

In the case Ω is not finite or countable, e.g., when Ω is the real line or the space of all infinite sequences of 0’s and 1’s, then the counting measure

formulation is no longer possible in general. We consider Lebesgue measure.

The Lebesgue measure µ_{0} of a set containing only one point must be zero. In
fact, since

{a} ⊂

a − 1

n, a + 1 n

for every positive integer n, we must have µ_{0}({a}) = 0. Hence, the Lebesgue
measure of a set containing countably many points must also be zero. Instead,
for example in the case Ω = R^{1}, one is often given a piecewise continuous
probability density function (p.d.f.) f , i.e., f is nonnegative, integrable, and

R∞

−∞f (x)dx = 1. For an interval I = (a, b) or (b, ∞), −∞ ≤ a < b ≤ ∞, one
then assigns the probability P (I) = ^{R}_{a}^{b}f (x)dx, by a Riemann integral. The
Lebesgue measure of a set containing uncountably many points can be either
zero, positive and finite, or infinite. We may not compute the Lebesgue measure
of an uncountable set by adding up the Lebesgue measure of its individual
members, because there is no way to add up uncountably many numbers.

This set function P may be extended to the class C comprising all finite
unions F = ∪_{j}I_{j} of pairwise disjoint intervals I_{j} by setting P (F ) = ^{P}_{j}P (I_{j}).

The class C is a field, i.e., the empty set and Ω belong to C and it is closed
under complements and finite intersection (and therefore finite unions). But,
since C is not a σ field, usual sequentially applied operations on events may
lead to events outside of C for which probabilities have not been defined. But
a theorem from measure theory, the Caratheodory Extension Theorem, asserts
that there is a unique countably additive extension of P from a field C to the
smallest σ field that contains C. In the case of C above, this σ field is called the
Borel σ field B^{1} on R^{1} and its sets are called Borel sets of R^{1}.

In general, such an extension of P to the power set σ-field, that is the
collection of all subsets of R^{1}, is not possible. The same considerations apply
to all measures (i.e., countably additive nonnegative set functions µ defined
on a σ-field with µ(O) = 0), whether the measure of Ω is 1 or not. The
measure µ_{0} = m, which is defined first for each interval I and the length of
the interval, and then extend uniquely to B^{1}, is called Lebesgue measure on
R^{1}. Similarly, one defines the Lebesgue measure on R^{k} (k ≥ 2) whose Borel
σ-field B^{k} is the smallest σ field that contains all k-dimensional rectangles
I = I_{1} × I_{2} × · · · × I_{k}, with I_{j} a one-dimensional rectangle (interval) of the
previous type. The Lebesgue measure of a rectangle is the product of the lengths
of its sides, i.e., its volume. Lebesgue measure on R^{k} has the property that
the space can be decomposed into a countable union measurable sets of finite
Lebesgue measure; such measures are said to be sigma-finite. All measures

referred to in this note are sigma-finite.

2.2 Extension

A finitely additive measure µ on a field F is a real-valued (including ∞), non- negative function with domain F such that for A, B ∈ F , A ∩ B = Ø,

µ(A ∪ B) = µ(A) + µ(B).

The extension problem for measures is: Given a finitely additive measure µ0

on a field F0, when does there exist a measure µ on F (F0) agreeing with µ0 on
F_{0}? A measure has certain continuity properties:

Theorem 2 Let µ be a measure on the σ-field F. If A_{n} ↓ A, A_{n} ∈ F , and if
µ(A_{n}) < ∞ for some n, then

limn µ(A_{n}) = µ(A).

Also, if A_{n}↑ A, A_{n}∈ F , then

limn µ(An) = µ(A).

This is called continuity from above and below. Certainly, if µ0 is to be ex-
tended, then the minimum requirement needed is that µ_{0} be continuous on its
domain. Call µ_{0} continuous from above at Ø if whenever A_{n}∈ F_{0}, A_{n}↓ Ø, and
µ_{0}(A_{n}) < ∞ for some n, then

limn µ_{0}(A_{n}) = 0.

Consider the example that

A_{1} = [1, ∞), A_{2} = [2, ∞), A_{3} = [3, ∞), . . . .
Then ∩^{∞}_{k=1}Ak = Ø, so µ(∩^{∞}_{k=1}Ak) = 0, but limn→∞µ(An) = ∞.

Caratheodory Extension Theorem. If µ_{0} on F_{0} is continuous from
above at Ø, then there is a unique measure µ on F (F_{0}) agreeing with µ_{0} on F_{0}
(see Halmos, p. 54).

The extension of a measure µ from a field C, as provided by the Caratheodory Extension Theorem stated above, is unique and may be expressed by the for- mula

µ(F ) = inf^{X}

n

µ(C_{n}), (F ∈ F ),

where the summation is over a finite collection C_{1}, C_{2}, . . . of sets in C whose
union contains F and the infimum is over all such collections.

As suggested by the construction of measures on B^{k}outlined above, start-
ing from their specifications on a class of rectangles, if two measures µ_{1} and µ_{2}
on a sigmafield F agree on subclass A ⊂ F closed under finite intersections and
Ω ∈ A, then they agree on the smallest sigmafield, denoted σ(A), that contains
A. The sigmafield σ(A) is called the σ-field generated by A. On a metric space
S the σ-field B = B(S) generated by the class of all open sets is called the Borel
σ field.

2.3 Lebesgue integral

An indicator function g from R to R is a function which takes only the values 0 and 1. We call

A = {x ∈ R; g(x) = 1}

the set indicated by g. We define the Lebesgue integral of g to be

Z

R

gdµ = µ_{0}(A).

A simple function h from R to R is a linear combination of indicators, i.e., a
function of the form h(x) =^{P}^{n}_{k=1}c_{k}g_{k}(x), where each g_{k} is of the form

g_{k}(x) =

1 if x ∈ A_{k}
0 if x 6∈ A_{k}

and each c_{k} is a real number. We define the Lebesgue integral of h to be

Pn

k=1c_{k}µ(A_{k}). Let f be a nonnegative function defined on R, possibly taking
the value ∞ at some points. We define the Lebesgue integral of f to be

Z

R

f dµ_{0} = sup

Z

R

hdµ_{0}; h is simple and h(x) ≤ f (x) for every x ∈ R

. It is possible that this integral is infinite. If it is finite, we say that f is integrable.

Finally, let f be a function defined on R, possibly taking the value ∞ at some points and the value −∞ at other points. We define the positive and negative parts of f to be

f^{+}(x) = max{f (x), 0}, f^{−}(x) = max{−f (x), 0},
respectively, and we define the Lebesgue integral of f to be

Z

R

f dµ_{0} =

Z

R

f^{+}dµ_{0}−

Z

R

f^{−}dµ_{0},

provided the right-hand side is not of the form ∞ − ∞. If both ^{R}_{R}f^{+}dµ_{0} and

R

Rf^{−}dµ_{)} are finite (or equivalently, ^{R}_{R}|f |dµ_{0} < ∞, since |f | = f^{+}+ f^{−}), we
say that f is integrable.

Let f be a function defined on R, possibly taking the value ∞ at some points and the value −∞ at other points. Let A be a subset of R. We define

R

Af dµ_{0} =^{R}_{R}1_{A}f dµ_{0}.

The Lebesgue integral just defined is related to the Riemann integral in
one very important way: if the Riemann integral ^{R}_{a}^{b}f (x)dx is defined, then the
Lebesgue integral ^{R}_{[a,b]}f dµ_{0} agrees with the Riemann integral. The Lebesgue
integral has two important advantages over the Riemann integral. The first
is that the Lebesgue integral is defined for more functions, as we show in the
following examples.

Example 2. Let Q be the set of rational numbers in [0, 1] and consider f = 1_{Q}.
Being a countable set, Q has Lebesgue measure zero, and so the Lebesgue
integral of f over [0, 1] is ^{R}_{[0,1]}f dµ_{0} = 0. To compute the Riemann integral

R1

0 f (x)dx, we choose partition points 0 = x_{0} < x_{1} < · · · < x_{n} = 1 and and
divide the interval [0, 1] into subintervals [x_{0}, x_{1}], [x_{1}, x_{2}], . . . , [x_{n−1}, x_{n}]. In each
subinterval [x_{k−1}, x_{k}] there is a rational point q_{k}, where f (q_{k}) = 1, and also an
irrational point r_{k}, where f (r_{k}) = 0. We approximate the Riemann integral
from above by the upper sum 1 and also approximate it from below by the
lower sum 0. No matter how fine we take the partition of [0, 1], the upper sum
is always 1 and the lower sum is always 0. Since these two do not converge
to a common value as the partition becomes finer, the Riemann integral is not
defined.

Example 3. Consider the function f (x) =

∞, if x = 0, 0, if x 6= 0.

Every simple function which lies between 0 and f is of the form h(x) =

y, if x = 0, 0, if x 6= 0.

for some y ∈ [0, ∞), and thus has Lebesgue integral

Z

R

hdµ_{0} = yµ({0}).

It follows that ^{R}_{R}f dµ_{0} = 0. Now consider the Riemann integral ^{R}_{−∞}^{∞} f (x)dx,
which for this function f is the same as the Riemann integral^{R}_{−1}^{1} f (x)dx. When
we partition [−1, 1] into subintervals, one of these will contain the point 0, and
when we compute the upper approximating sum for ^{R}_{−1}^{1} f (x)dx, this point will
contribute ∞ times the length of the subinterval containing it. Thus the upper
approximating sum is ∞. On the other hand, the lower approximating sum is
0, and again the Riemann integral does not exist.

The Lebesgue integral has all linearity and comparison properties one would expect of an integral. In particular, for any two functions f and g and any real constant c,

Z

R

(f + g)dµ_{0} =

Z

R

f dµ_{0} +

Z

R

gdµ_{0},

Z

R

cf dµ_{0} = c

Z

R

f dµ_{0},

Z

R

f dµ_{0} ≤

Z

R

gdµ_{0}, when f (x) ≤ g(x)

Z

A∪B

f dµ_{0} =

Z

A

f dµ_{0}+

Z

B

f dµ_{0}.

There are three convergence theorems satisfied by the Lebesgue integral.

In each of these the situation is that there is a sequence of functions f_{n}, n =
1, 2, . . . converging pointwise to a limiting function f . Pointwise convergence
just means that

n→∞lim fn(x) = f (x) for every x ∈ R.

There are no such theorems for the Riemann integral, because the Riemann integral of the limiting function f is too often not defined. Before we state the theorems, we given two examples of pointwise convergence which arise in probability theory.

Example 4. Consider a sequence of normal densities, each with variance 1 and the n-th having mean n:

f_{n}(x) = 1

√2πexp −(x − n)^{2}
2

!

.

These converge pointwise to the zero function. We have ^{R}_{R}f_{n}dµ_{0} = 1 for every
n but ^{R}_{R}f dµ_{)}= 0.

Example 5. Consider a sequence of normal densities, each with mean 0 and the n-th having variance 1/n:

f_{n}(x) = n

√2π exp − x^{2}
2n^{−1}

!

. These converge pointwise to the function

f (x) =

∞, if x = 0, 0, if x 6= 0.

We have ^{R}_{R}f_{n}dµ_{0} = 1 for every n but ^{R}_{R}f dµ_{0} = 0.

Theorem 3 (Fatous Lemma) Let f_{n}, n = 1, 2, . . . be a sequence of nonnegative
functions converging pointwise to a function f . Then

Z

R

f dµ_{0} ≤ lim inf

n→∞

Z

R

f_{n}dµ_{0}.

The key assumption in Fatou’s Lemma is that all the functions take only non- negative values. Fatou’s Lemma does not assume much but it is is not very satisfying because it does not conclude that

Z

R

f dµ_{0} = lim

n→∞

Z

R

f_{n}dµ_{0},

There are two sets of assumptions which permit this stronger conclusion.

Theorem 4 (Monotone Convergence Theorem) Let fn, n = 1, 2, . . . be a se- quence of functions converging pointwise to a function f . Assume that

0 ≤ f_{1}(x) ≤ f_{2}(x) ≤ · · · for every x ∈ R.

Then

Z

R

f dµ_{0} = lim

n→∞

Z

R

f_{n}dµ_{0},
where both sides are allowed to be ∞.

Theorem 5 (Dominated Convergence Theorem) Let f_{n}, n = 1, 2, . . . be a se-
quence of functions converging pointwise to a function f . Assume that there is
a nonnegative integrable function g (i.e., ^{R}_{R}gdµ_{0} < ∞) such that

|fn(x)| ≤ g(x) for every x ∈ R for every n.

Then

Z

R

f dµ_{0} = lim

n→∞

Z

R

f_{n}dµ_{0},
and both sides will be finite.

2.4 Related results in probability theory

Theorem 6 (Bounded Convergence Theorem) Suppose that Xnconverges to X in probability and that there exists a constant M such that P (|Xn| ≤ M ) = 1.

Then E(X_{n}) → E(X).

Proof. Let {x_{i}} be a partition of R such that F_{X} is continuous at each x_{i}.
Then

X

i

x_{i}P {x_{i} < X_{n} ≤ x_{i+1}} ≤ E(X_{n}) ≤^{X}

i

x_{i+1}P {x_{i} < X_{n} ≤ x_{i+1}}
and taking limits we have

X

i

x_{i}P {x_{i} < X_{n}≤ x_{i+1}} ≤ limE(X_{n})

≤ limE(X_{n}) ≤^{X}

i

x_{i+1}P {x_{i} < X_{n} ≤ x_{i+1}}.

As max |x_{i+1}− x_{i}| → 0, the left and right sides converges to E(X) giving the
theorem.

Theorem 7 (Monotone Convergence Theorem) Suppose 0 ≤ X_{n} ≤ X and X_{n}
converges to X in probability. Then lim_{n→∞}E(X_{n}) = E(X).

Proof. For M > 0

E(X) ≥ E(X_{n}) ≥ E(X_{n}∧ M ) → E(X ∧ M )

where the convergence on the right follows from the bounded convergence the- orem. It follows that

E(X^{^}M ) ≤ lim inf

n→E(X_{n}) ≤ lim sup

n→

E(X_{n}) ≤ E(X).

Theorem 8 (Dominated Convergence Theorem) Assume X_{n} and Y_{n} converge
to X and Y , respectively, in probability. Also, |X_{n}| ≤ Y_{n}and E(Y_{n}) → E(Y ) <

∞. Then lim_{n→∞}E(X_{n}) = E(X).

Its proof follows from Fatou Lemma.

### 3 Mode of Convergence

On Ω there is defined a sequence of real-valued functions X1(w), X2(w), . . . which are random variables in the sense of the following definition.

Definition A function X(w) defined on Ω is called a random variable if for every Borel set B in the real line R, the set is {w : X(w) ∈ B} is in F . (X(w) is a measurable function on (Ω, F ). )

3.1 Convergence in Distribution

Suppose we flip a fair coin 400 times and want to find out the probability
of getting heads between 190 and 210. A standard practice is to invoke the
Central Limit Theorem to get an approximation of the above probability. Let
S_{400} denote the number of heads in the 400 flips. For this particular problem,
our major concern is P (190 ≤ S_{400} ≤ 210) or whether this probability can be
approximated well by P (−1.05 ≤ Z ≤ 1.05). Here Z is a standard normal ran-
dom variable. In this example, we need the concept of converges in distribution.

Consider distribution functions F_{1}(·), F_{2}(·), . . . and F (·). Let X_{1}, X_{2}, . . . and X
denote random variables (not necessarily on a common probability space) hav-
ing these distributions, respectively. We say that X_{n} converges in distribution
(or in law) to X if

n→∞lim F_{n}(t) = F (t), for all t which are continuity points of F .

This is written X_{n} → X or X^{d} _{n} → X or F^{L} _{n} → F . What are convergent here^{w}
are not the values of the random variables themselves, but the probabilities
with which the random variables assume certain values. If Xn

→ X, then thed

distribution of X_{n} can be well approximated for large n by the distribution of
X. This observation is extremely useful since F_{X} is often easier to compute
than F_{X}_{n}.

In general, we would like to say that the distribution of the random vari-
ables X_{n} converes to the distribution of X if F_{n}(x) = P (X_{n} < x) → F (x) =
P (X < x) for every x ∈ R. But this is a bit too strong. We now use an example
to illustrate why we require the convergence only occurs at the continuity points
of F ? Consider random variables X_{n} which take values 1 − n^{−1} or 1 + n^{−1} with
probabilities 1/2. Heuristically, we would want the values of Xnto be more and
more concentrated about 1. Note that the distribution of Xn is

F_{n}(x) =

0, x < 1 − n^{−1}

1/2, 1 − n^{−1} ≤ x < 1 + n^{−1}
1, x ≥ 1 + n^{−1}.

By calculation, we have F_{n}(x) → F^{∗}(x) as n → ∞ where

F^{∗}(x) =

0, x < 1 1/2, x = 1 1, 1 < x.

On the other hand, for the random variable X taking value 1 with probability 1. The distribution of X is

F (x) =

0, x < 1 1, x ≥ 1.

Apparently, not much should be assumed about what happens for x at a dis-
continuity point of F (x). Therefore, we can only consider convergence in dis-
tribution at continuity points of F . Read Example 14.3-2(pp467) of Bishop,
Feinberg and Holland (1975) for direct verification that F_{n} → F . Another^{w}
important tool for establishing convergence in distribution is to use moment-
generating function or characteristic function. Read Example 14.3-3(pp467) of
Bishop, Feinberg and Holland (1975). In later section, we will use this tool to
prove the central limit theorem (Chung[1974], Theorem 6.4.4).

When we talk about convergence in distribution, w never come into the picture. As an example, flip a fair coin once. Let X = 1 if we get head and X = 0, otherwise. On the other hand, set Y = 1 − X. It is obvious that X

and Y have the same distribution. As a remark, the random variable X is a function of w but we can never observe w.

3.2 Convergence with Probability 1

Next, we discuss convergence with probability 1 (or strongly, almost surely, al-
most everywhere, etc.) which is closely related to the convergence of sequences
of functions in advanced calculus. This criterion of convergence is of partic-
ular importance in the probability limit theorems known as the laws of large
numbers. This is defined in terms of the entire sequence of random variables
X_{1}, X_{2}, . . . , X_{n}, . . .. Regarding such a sequence as a new random variable with
realized value x_{1}, x_{2}, . . . , x_{n}, . . ., we may say that this realized sequence either
does or does not converge in the ordinary sense to a limit x. If the probability
that it does so is unity, then we say that X_{n} → X almost certainly. Consider
random variables X1, X2, · · · and X, we say that Xn converges with probability
1 (or almost surely) to X if

P (w : lim

n→∞X_{n}(w) = X(w)) = 1.

This is written X_{n} ^{wp1}→ X, n → ∞. To be better understanding this convergence,
we give the following equivalent condition:

n→∞lim P (|X_{m}− X_{n}| < , for all m ≥ n) = 1, for every > 0.

Suppose we have to deal with questions of convergence when no limit
is in evidence. For convergence almost surely, this is immediately reducible
to the numerical case where the Cauchy criterion is applicable. Specifically,
{X_{n}} converges a.s. if and only if there exists a null set N such that for every
w ∈ Ω − N and every > 0, there exists m(w, ) such that

n^{0} > n ≥ m(w, ) → |X_{n}(w) − X_{n}^{0}(w)| ≤ .

Or, for any positive and η, there is an n_{0} such that

P {|X_{n}− X_{m}| > for at least one m ≥ n} < η

for all n ≥ n_{0}. As almost surely convergence depends on the simultaneous
behavior of Xn for all n ≥ n0, it is obviously more difficult to handle, but the
following sufficient criterion is useful. If ^{P}^{∞}_{n=1}E{|X_{n}− X|^{p}} < ∞ for some
p > 0, then X_{n} → X almost surely. This criterion follows from the observation:

P (|Xm− X| > for some m ≥ n) = P (∪^{∞}_{m=n}{|Xm− X| > })

≤

∞

X

m=n

P (|X_{m}− X| > ).

3.2.1 Consistency of the Empirical Distribution Function

Let X_{1}, . . . , X_{n} be independent identically distributed random variables on R
with distribution function F (x) = P (X ≤ x). The nonparametric maximum-
likelihood estimate of F is the sample distribution function or empirical distri-
bution function defined as

F_{n}(x) = 1
n

n

X

i=1

I_{[X}_{i}_{,∞)}(x).

Thus, F_{n}(x) is the proportion of the observations that fall less than or equal to x.

For each fixed x, the strong law of large numbers implies that F_{n}(x)−→ F (x),^{a.s.}

because we may consider I_{[X}_{i}_{,∞)}(x) as i.i.d. random variables with mean F (x).

Thus, F_{n}(x) is a strongly consistent estimate of F (x) for every x.

The following corollary improves on this observation in two ways. First, the set of probability one on which convergence takes place may be chosen to be independent of x. Second, the convergence is uniform in x. This assertion, that the empirical distribution function converges uniformly almost surely to the true distribution function, is known as the Glivenko-Cantelli Theorem.

COROLLARY. P {sup_{x}|F_{n}(x) − F (x)| → 0} = 1.

Proof. Let ε > 0. Find an integer k > 1/ε and numbers −∞ = x_{0} < x_{1} ≤
x_{2} ≤ · · · ≤ x_{k−1} < x_{k}= ∞, such that

F (x^{−}_{j}) ≤ j/k ≤ F (x_{j})

for j = 1, . . . , k − 1. [F (x^{−}_{j} ) may be considered notation for P (X < x_{j}).] Note
that if x_{j−1} < x_{j} then F (x^{−}_{j} ) − F (x_{j−1}) ≤ ε. From the strong law of large
numbers, F_{n}(x_{j})−→ F (x^{a.s.} _{j}) and F_{n}(x^{−}_{j} )−→ F (x^{a.s.} ^{−}_{j} ) for j = 1, . . . , k − 1. Hence,

4_{n}= max(|F_{n}(x_{j}) − F (x_{j})|, |F_{n}(x^{−}_{j}) − F (x^{−}_{j})|, j = 1, . . . , k − 1)−→ 0.^{a.s.}

Let x be arbitrary and find j such that x_{j−1} < x ≤ x_{j}. Then,
F_{n}(x) − F (x) ≤ F_{n}(x^{−}_{j}) − F (x_{j−1}) ≤ F_{n}(x^{−}_{j}) − F (x^{−}_{j}) + ,
and

F_{n}(x) − F (x) ≥ F_{n}(x_{j−1}) − F (x^{−}_{j} ) ≥ F_{n}(xj − 1) − F (x_{j−1}) − .

This implies that

sup

x

|F_{n}(x) − F (x)| ≤ 4_{n}+ −→ .^{a.s.}

Since this holds for all > 0, the corollary follows.

3.2.2 Law of Large Numbers

The weak (strong) law of large numbers states the sample mean is a weakly (strongly) consistent estimate of the population mean. The weak law of large numbers says that if X1, . . . , Xn are i.i.d. random variables with finite first moment, µ, then for every > 0 we have

P (| ¯X_{n}− µ| > ) → 0

as n → ∞. The argument of using Chebyschev inequality with finite second moment shows that

P (| ˆX_{n}− µ| > ) → 0

at rate 1/n. On the other hand, we can show that ˆX_{n} converges to µ weakly
(strongly) as long as E|X| < ∞.

3.3 Convergence in Probability

We say that X_{n} converges in probability to X as n → ∞ if, for any positive ,

n→∞lim P (w : |X_{n}(w) − X(w)| > ) = 0.

This is written Xn

→ X, as n → ∞. A necessary and sufficient condition forP

such convergence is that for any positive and η there is an n0 such that
P (w : |X_{n}(w) − X(w)| > ) < η for all n ≥ n_{0}.

A numerical constant c can always be viewed as a degenerate random variable C whose distribution has all of its probability concentrated on the single value c.

As an example, the weak law of large numbers states that the random variable sample mean converges in probability to a population mean (a constant).

Now we try to use the following theorem and the example to illustrate the
difference between converegence with probability 1 and convergence in probabil-
ity. For convergence in probability, one needs for every > 0 that the probability
that X_{n} is within of X tends to one. For convergence almost surely, one needs
for every > 0 that the probability that X_{n} stays within of X for all k ≥ n
tends to one as n tends to infinity.

Theorem 9 The sequence {X_{n}} of random variables converges to a random
variable X with probability 1 if and only if

n→∞lim P {∪^{∞}_{m=n}(|Xm− X| ≥ )} = 0
for every > 0.

By the above theorem, convergence in probability is weaker than con- vergence with probability 1. The following example is used to illustrate the difference.

Example 6. Let Ω = [0, 1], and let S be the class of all Borel sets on Ω. Let
P be the Lebesgue measure. For any positive integer n, choose integer m with
2^{m} ≤ n < 2^{m+1}. Clearly, n → ∞ if and only if m → ∞. We can write n ≥ 1 as
n = 2^{m}+ k, k = 0, 1, . . . , 2^{m}− 1. Let us define X_{n} on Ω by

X_{n}(w) =

1 if w ∈^{h}_{2}^{k}m,^{k+1}_{2}m

i, 0 otherwise,

if n = 2^{m}+ k. Then X_{n} is a random variable which satisfies
P {|Xn(w)| ≥ } =

1

2^{m} if 0 < < 1,
0 if ≥ 1,
so that Xn

→ 0. However, XP n does not converge to 0 with probability 1. In
fact, for any w ∈ [0, 1], there are an infinite number of intervals of the form
[k/2^{m}, (k + 1)/2^{m}] which contain w. Such a sequence of intervals depends on
w. Let us denote it by

("

k

2^{m},k + 1
2^{m}

#

, m = 1, 2, . . .

)

,

and let n_{m} = 2^{m}+ k_{m}. Then X_{n}_{m}(w) = 1, but X_{n}(w) = 0 if n 6= n_{m}. It follows
that {X_{n}} does not converge at w. Since w is arbitrary, X_{n} does not converge
with probability 1 to any random variable.

3.3.1 Borel-Cantelli Lemma

First, we give an example to illustrate the difference between convergence in
probability and convergence in distribution. Consider {X_{n}} where X_{n} is uni-
formly distributed on the set of points {1/n, 2/n, . . . , 1}. It can be shown easily
that X_{n}→ X where X is uniformly distributed over (0, 1). Can we answer the^{L}
question whether X_{n} → X?^{P}

Next, we give the Borel-Cantelli Lemma and the concept of infinitely of-
ten which are often used in proving strong law of large number. For events Aj,
j = 0, 1, . . ., the event {A_{j} i.o.} (read A_{j} infinitely often), stands for the event
that infinitely many A_{j} occur.

THE BOREL-CANTELLI LEMMA. If^{P}^{∞}_{j=1}P (A_{j}) < ∞, then P {A_{j} i.o.} =
0. Conversely, if the A_{j}are independent and^{P}^{∞}_{j=1}P (A_{j}) = ∞, then P {A_{j} i.o.} =
1.

Proof. (The general half) If infinitely many of the A_{j} occur, then for all n, at
least one A_{j} with j ≥ n occurs. Hence,

P {A_{j} i.o.} ≤ P

∞

[

j=n

A_{j}

≤

∞

X

j=n

P (A_{j}) → 0.

The proof of the converse can be found in standard probability textbook.

A typical example of the use of the Borel-Cantelli Lemma occurs in coin
tossing. Let X_{1}, X_{2}, . . . be a sequence of independent Bernoulli trials with
probability of success on the nth trial equal to p_{n}. What is the probability of
an infinite number of successes? Or, equivalently, what is P {X_{n} = 1 i.o.}?

From the Borel-Cantelli Lemma and its converse, this probability is zero or
one depending on whether ^{P}p_{n} < ∞ or not. If p_{n} = 1/n^{2}, for example, then
P {Xn= 1 i.o.} = 0. If pn = 1/n, then P {Xn= 1 i.o.} = 1.

The Borel-Cantelli Lemma is useful in dealing with problems involving
almost sure convergence because X_{n}−→ X is equivalent to^{a.s.}

P {|X_{n}− X| > i.o.} = 0, for all > 0.

3.4 Convergence in rth Mean

We say that X_{n} converges in rth mean to X if

n→∞lim E|Xn− X|^{r} = 0.

This is written X_{n} ^{rth}→ X, n → ∞. We say that X is dominated by Y if |X| ≤ Y
almost surely, and that the sequence {X_{n}} is dominated by Y iff this is true
for each X_{n} with the same Y . We say that X or {X_{n}} is uniformly bounded iff
the Y above may be taken to be a constant. Observe that

E|X_{n}−X|^{r} = E|X_{n}−X|^{r}1_{{|X}_{n}_{−X|<}}+E|X_{n}−X|^{r}1_{{|X}_{n}_{−X|>}}≤ ^{r}+EY^{r}1_{{|X}_{n}_{−X|>}}.
We then conclude that X_{n} ^{rth}→ X if X_{n} → X and {X^{P} _{n}} is dominated by some
Y that belongs to L^{p}.

We now use a Chebyshev type of “weak laws of large numbers” to demon- strate a method for determining the large sample behavior of linear combination of random variables.

Theorem (Chebyshev). Let X_{1}, X_{2}, . . . be uncorrelated with means µ_{1}, µ_{2}, . . .
and variances σ_{1}^{2}, σ_{2}^{2}, . . .. If^{P}^{n}_{i=1}σ_{i}^{2} = o(n^{2}), n → ∞, then

1 n

n

X

i=1

Xi− 1 n

n

X

i=1

µi

→ 0.P