### Methods for Statistical Prediction

### Financial Time Series I

### Topic 1: Review on Hypothesis Testing

### Hung Chen

### Department of Mathematics National Taiwan University

### 9/26/2002

OUTLINE 1. Fundamental Concepts

2. Neyman-Pearson Paradigm 3. Examples

4. Optimal Test

5. Observational Studies 6. Likelihood Ratio Test

7. One-sample and Two-sample Tests

Motivated Example on Hypothesis Testing ESP experiment: guess the color of 52 cards

with replacement.

• Experiment: Generate data to test the hy- potheses.

• T : number of correct guess in 10 trials

• H_{0} : T ∼ Bin(10, 0.5) versus H_{1} : T ∼
Bin(10, p) with p > 1/2

• Consider the test statistic T and the rejec- tion region R = {8, 9, 10}.

• Compute the probability of committing type 1 error:

α = P (R) = P (X > 7)

= 0.0439 + 0.0098 + 0.0010 = 0.0547.

• When rejection region= R = {7, 8, 9, 10},

α = P (X > 6) = 0.1172+P (X > 7) = 0.1719.

• Calculation of power when R = {8, 9, 10}.

We compute what the power will be under various values of p.

p = 0.6 P (X > 7|p = 0.6) = 0.1673 p = 0.7 P (X > 7|p = 0.7) = 0.3828.

2

• Idea: A statistical test of a hypothesis is a rule which assigns each possible observation to one of two exclusive categories: consis- tent with the hypothesis under considera- tion and not consistent with the hypothe- sis.

• Will we make mistake?

Two Types of Error

Reality

H_{0} true H_{0} false
Test says reject H_{0} Type I Error Good

cannot reject Good Type II Error
H_{0}

• Usually, P (Type I Error) is denoted by α and P (Type II Error) is denoted by β.

• In ESP experiment, α increases when R moves from {8, 9, 10} to {7, 8, 9, 10} but β decreases.

• Statistical Hypotheses testing is a formal means of choosing between two distributions on the basis of a particular statistic or ran- dom variable generated from one of them.

– How do we accomodate the uncertainty on the observed data?

– How do we evaluate a method?

• Neyman-Pearson Paradigm
– Null hypothesis H_{0}

– Alternate hypothesis H_{A} or H_{1}

– The objective is to select one of the two based on the available data.

– A crucial feature of hypothesis testing is that the two competing hypotheses are not treated in the same way: one is given the benefit of the doubt, the other has the burden of proof.

The one that gets the benefit of the doubt is called the null hypothesis. The other is called the alternative hypothesis.

– By definition, the default is H_{0}. When
we carry out a test, we are asking whether

4

the available data is significant evidence
in favor of H_{1}. We are not testing whether
H_{1} is true; rather, we are testing whether
the evidence supporting H_{1} is statisti-
cally significant.

– The conclusion of a hypothesis test is that we either reject the null hypothe- sis (and accept the alternative) or we fail to reject the null hypothesis.

Failing to reject H_{0} does not quite mean
that the evidence supports H_{0}; rather, it
means that the evidence does not strongly
favor H_{1}.

Again, H_{0} gets the benefit of the doubt.

– Examples:

∗ Suppose we want to determine if stocks picked by experts generally perform better than stocks picked by darts. We might conduct a hypothesis test to de- termine if the available data should persuade us that the experts do bet- ter. In this case, we would have

H_{0}: experts not better than darts
H_{1}: experts better than darts

∗ Suppose we are skeptical about the ef- fectiveness of a new product in pro- moting dense hair growth. We might conduct a test to determine if the data shows that the new product stimulates hair growth. This suggests

H_{0}: New product does not promote
hair growth

H_{1}: New product does promote hair
growth

Choosing the hypotheses this way puts
the onus on the new product; unless
there is strong evidence in favor of H_{1},
we stick with H_{0}.

∗ Suppose we are considering changing the packaging of a product in the hope of boosting sales. Switching to a new package is costly, so we will only un- dertake the switch if there is signifi- cant evidence that sales will increase.

We might test-market the change in one or two cities and then evaluate the results using a hypothesis test. Since the burden of proof is on the new pack-

6

age, we should set the hypotheses as follows:

H_{0}: New package does not increase
sales

H_{1}: New package does increase sales
– There are two types of hypotheses, sim-

ple ones where the hypothesis completely specifies the distribution.

– Simple hypotheses test one value of the parameter against another, the form of the distribution remaining fixed.

– Here is an example when they are both composite:

X_{i}: Poisson with unknown parameter
X_{i} is not Poisson

Steps for setting up test:

1. Define the null hypothesis H_{0} (devil’s ad-
vocate).

Put the hypothesis that you don’t believe
as H_{0}

2. Define the alternative H_{A} (one sided /two
sided).

3. Find the test statistic.

Use heuristic or systematic methods.

4. Decide on the type I error: α that you are willing to take.

5. Compute the probability of observing the data given the null hypothesis: p-value.

6. Compare the p-value to α, if its smaller,
reject H_{0}.

8

Example 1: Sex bias in graduate admission

• The graduate division of the University of California at Berkeley attempted to study the possibility that sex bias operated in grad- uate admissions in 1973 by examining ad- mission data.

• In this case, what does the hypothesis of no sex bias corresponds to? It is natural to translate this into

P [Admit|M ale] = P [Admit|F emale].

• Data

– There were 8, 442 men who applied for admission to graduate school that quar- ter, and 4, 321 women.

– About 44% of the men and 35% of the women were admitted.

– How do we perform this two-sample test?

• What is the conclusion?

– two-sample test 0.44 − 0.35

s0.44×0.56

8442 + ^{0.35×0.65}_{4321} = 9.948715.

– p-value is 1.283 exp(−22) when H_{1} is a

one-sided test P [Admit|M ale] > P [Admit|F emale].

– p-value is 2.566 exp(−22) when H_{1} is a

two-sided test P [Admit|M ale] 6= P [Admit|F emale].

10

Example 2: Effectiveness of Therapy

• Suppose that a new drug is being considered with a view to curing a certain disease.

• How do we evaluate its effectiveness?

• The drug is given to n patients suffering from the disease and the number x of cures is noted.

• We wish to test the hypothesis that there is at least a 50 − 50 chance of a cure by this drug based on the following data:

x cures among n patients.

• Put the problem in the following framework of statistical test:

– The sample space X is simple-it is the set {0, 1, 2, . . . , n}. (i.e., X can take on 0, 1, 2, . . . , n.)

– The family {P_{θ}} of possible distributions
on X is (assuming independent patients)

the family of binomial distributions, parametrized by the real parameter θ taking values in

[0, 1].

– θ is being interpreted as the probability of cure.

– X ∼ Bin(n, θ)

– The stated hypothesis defines the subset
Θ_{0} = [1/2, 1] of the parameter space.

H_{0} : θ ≥ 1/2

– In this situation, only a small class of tests which seem worth considering on a purely intuitive basis.

We will only consider those for which the
set of x taken to be consistent with Θ_{0}
have the form {x : x ≥ k}

– Question: Does it make sense to con-
sider that x cures out of n patients were
consistent with Θ_{0}, while x + 1 were not?

– What is a reasonable test?

12

A recipe: Optimal tests for simple hypotheses

• Null hypothesis H_{0} : f = f_{0}

• Alternate hypothesis H_{A} : f = f_{1}

• Want to find a rejection region R such that the error of both types are as small as pos- sible.

Z

R f_{0}(x)dx = α and 1 − β = ^{Z}_{R} f_{1}(x)dx.

• Neyman-Pearson Lemma:

For testing f_{0}(x) against f_{1}(x) a critical re-
gion of the form

Λ(x) = f_{1}(x)

f_{0}(x) ≥ K

where K is a constant has the greatest power (smallest β) in the class of tests with the same α.

– Let R denote the rejection region deter- mined by Λ(x) and S denote the rejec- tion of other testing procedure.

– α_{R} = ^{R}_{R} f_{0}(x)dx, α_{S} = ^{R}_{S} f_{0}(x)dx,
– α_{R}, α_{S} ≤ α

– β_{R} − β_{S} = (^{R}_{R} − ^{R}_{S}) f_{1}dx = ^{R}_{R∩S}^{c} f_{1}dx −

RS∩R^{c} f_{1}dx.

Since in R, f_{1} ≥ f_{0}/K and in R_{c}, −f_{1} ≥

−f_{0}/K we have:

β_{R} − β_{S} ≥ 1
K

Z

R∩S^{c} f_{0}dx − ^{Z}_{S∩R}c f_{0}dx^{}

= 1 K

Z

R f_{0}dx − ^{Z}_{S} f_{1}dx^{} = 1

K(α_{R} − α_{S})
– When α_{R} = α_{S} = α, β_{R} − β_{S} ≥ 0.

14

Why Neyman-Pearson framework is being accepted?

• A test whose error probabilities are as small as possible is clearly desirable.

However, we cannot choose the critical re- gion in such a way that α(θ) and β(θ) are simultaneously uniformly minimized.

By taking the critical region as the empty set, we can make α(θ) = 0 and by taking the critical region as the sample space, we can make β(θ) = 0. Hence a test which uniformly minimized both error-probability functions would require to have zero error probabilities, and usually no such test ex- ists.

• The modification suggested by Neyman and
Pearson is based on the fact that in most
circumstances our attitudes to the hypothe-
ses Θ_{0} and Θ−Θ_{0} are different- we are often
asking if there is sufficient evidence to reject
the hypothesis Θ_{0}.

In terms of the two possible errors this may be translated into the statement that often the Type I error is more serious than

the Type II error.

• We should control the probability of the Type I error at some pre-assigned small value α, and then, subject to this control, look for a test which uniformly minimizes the function describing the proba- bilities of Type II error.

• Is this asymmetry on (H_{0}, H_{1}) reasonable?

Can you come up an example with business application?

– Suppose we use this testing technique in searching for regions of the genome that resemble other regions that are known to have significant biological activity.

– One way of doing this is to align the known and unknown regions and com- pute statistics based on the number of matches.

– To determine significant values of these statistics a (more complicated) version of the following is done.

Thresholds (critical values) are set so that if the matches occur at random and the

16

probability of a match is 1/2, then the probability of exceeding the threshold (type I) error is smaller than α.

– No one really believes that H_{0} is true and
possible types of alternatives are vaguely
known at best, but computation under
H_{0} is easy.

Now we use the following example to moti-
vate Neyman-Pearson lemma. We start from
the simplest possible situation, that where Θ
has only two elements θ_{0} and θ_{1}, say, and where
Θ_{0} = {θ_{0}}, Θ − Θ_{0} = {θ_{1}}. Note that a hy-
pothesis which specifies a set in the parameter
space containing only one element is called a
simple hypothesis. Thus we are now consider-
ing testing a simple null-hypothesis against a
simple alternative. In this case, the power func-
tion of any test reduces to a single number, and
we examine the question of the existence of a
most-powerful test of given significance level α.

Revisit the example that x cures out of n pa- tients when n = 5. We wish to test

H_{0} : p = 0.5 versus H_{1} : p = 0.3.

• The probability distribution of X is

X = x 0 1 2 3 4 5

p = 0.5 0.031 0.156 0.313 0.313 0.156 0.031
p = 0.3 0.168 0.360 0.309 0.132 0.028 0.003
f_{1}(x)/f_{0}(x) 5.419 2.308 0.987 0.422 0.179 0.097

• Think of the meaning of likelihood ratio f_{1}(x)/f_{0}(x).

• We consider all possible nonrandomized tests of significance level 0.2.

critical region α 1 − β critical region α 1 − β {0} 0.031 0.168 {0, 1} 0.187 0.528 {1} 0.156 0.360 {0, 4} 0.187 0.196 {4} 0.156 0.028 {1, 5} 0.187 0.363 {5} 0.031 0.003 {4, 5} 0.187 0.031 {0, 5} 0.062 0.171

• The best test is the one with critical region {0, 1}. Can you give a reason for that? Or, can you find a rule?

Try to think in terms of likelihood ratio by noting

f_{1}(x) = f_{1}(x)

f_{0}(x) · f_{0}(x).

As a hint, compare the two tests {0, 1} and {0, 4} with the same α. Observe that their

18

power are

β_{{0,1}} = [P_{{p=0.3}}(r = 0)] + P_{{p=0.3}}(r = 1)
β_{{0,4}} = [P_{{p=0.3}}(r = 0)] + P_{{p=0.3}}(r = 4).

Compare P_{{p=0.3}}(r = 4) to P_{{p=0.3}}(r = 1).

• Conclusion: The critical region determined
by {x : f_{1}(x)/f_{0}(x) ≥ c} is quite intuitive.

Suppose that we set out to order points in
the sample space according to the amount of
evidence they provide for P_{1} rather than P_{0}.
We should naturally order them according
to the value of the ratio f_{1}(x)/f_{0}(x); any x
for which this ratio is large provides evidence
than P_{1} rather than P_{0} is the true underly-
ing probability distribution. The Neyman-
Pearson analysis gives us a basis for choosing
c so that

P_{1}

x : f_{1}(x)

f_{0}(x) ≥ c

= α.

Now we use the Neyman-Pearson lemma to derive UMP test in the following two examples.

Example 3. Suppose that X is a sample of size 1. We wish to test whether it comes from

N (0, 1) or the double exponential distribution
DE(0, 2) with the pdf 4^{−1} exp(−|x|/2).

• Make a guess on the testing procedure?

• Since P (f_{1}(x) = cf_{0}(x)) = 0, there is a
unique nonrandomized UMP test.

• The UMP test T_{∗}(x) = 1 if and only if
π

8 exp(x^{2} − |x|) > c^{2}

for some c > 0, which is equivalent to |x| >

t or |x| < 1 − t for some t > 1/2.

• Suppose that α < 1/4. We use

α = E_{0}[T_{∗}(X)] = P_{0}(|X| > t) = 0.3374 > α.

Hence t should be greater than 1 and α = Φ(−t) + 1 − Φ(t).

Thus, t = Φ^{−1}(1−α/2) and T_{∗}(X) = I_{(t,∞)}(|X|).

• Why the UMP test rejects H_{0} when |X| is
large?

• The power of T_{∗} under H_{1} is

E_{1}[T_{∗}(X)] = P_{1}(|X| > t) = 1−1
4

Z t

−t e^{−|x|/2}dx = e^{−t/2}.

20

Example 4. Let X_{1}, . . . , X_{n} be iid binary
random variables with p = P (X_{1} = 1). Sup-
pose that we wish to test H_{0} : p = p_{0} versus
H_{1} : p = p_{1}, where 0 < p_{0} < p_{1} < 1.

• Since P (f_{1}(x) = cf_{0}(x)) 6= 0, we may need
to consider randomized UMP test.

• A UMP test of size α is
T_{∗}(Y ) =

1 λ(Y ) > c
γ λ(Y ) = c
0 λ(Y ) < c,
where Y = ^{P}^{n}_{i=1} X_{i} and

λ(Y ) =

p_{1}
p_{0}

Y ^{}

1 − p_{1}
1 − p_{0}

n−Y

.

• Since λ(Y ) is increasing in Y , there is an integer m > 0 such that

T_{∗}(Y ) =

1 Y > m γ Y = m 0 Y < m, where m and γ satisfy

α = E_{0}[T_{∗}(Y )] = P_{0}(Y > m)+γP_{0}(Y = m).

• Since Y has the binomial distribution Bin(n, p), we can determine m and γ from

α = ^{X}^{n}

j=m+1

n j

p^{j}_{0}(1−p_{0})^{n−j}+γ

n m

p^{m}_{0} (1−p_{0})^{n−m}.

• Unless

α = ^{X}^{n}

j=m+1

n j

p^{j}_{0}(1 − p_{0})^{n−j}

for some integer m, the UMP test is a ran- domized test.

• Do you notice that the UMP test T_{∗} does
not depend on p_{1}?

– Neyman-Pearson lemma tells us that we should put those x into rejection region according to its likelihood ratio until the level of test achieves α.

– Think of two hypothesis testing problems:

The first one is H_{0} : p = p_{0} versus H_{1} :
p = p_{1} and the second one is H_{0} : p = p_{0}
versus H_{1} : p = p_{2} where p_{1} > p_{0} and
p_{2} > p_{0}.

– For the above two testing problems, both their likelihood ratios increase as y in- creases.

22

– T_{∗} is in fact a UMP test for testing H_{0} :
p = p_{0} versus H_{1} : p > p_{0}.

• Suppose that there is a test T_{∗} of size α such
that for every P_{1} ∈ P, T_{∗} is UMP for testing
H_{0} versus the hypothesis P = P_{1}.

Then T_{∗} is UMP for testing H_{0} versus H_{1}.
Example: Suppose we have reason to be-
lieve that the true average monthly return on
stocks selected by darts is 1.5%. We want to
choose between H_{0} : µ = 1.5 versus H_{1} : µ 6=

1.5, where µ¯ is the true mean monthly return.

• We need to select a significance level α. Let’s
pick α = 0.05. This means that there is at
most a 5% chance that we will mistakenly
reject H_{0} if in fact H_{0} is true (Type I error).

It says nothing about the chances that we
will mistakenly stick with H_{0} if in fact H_{1}
is true (Type II error).

• Large sample hypothesis test. Let’s suppose
we have samples X_{1}, · · · , X_{n} with n > 30.

– The first step in choosing between our hypotheses is computing the following test

statistic:

Z =

X − µ¯ _{0}
σ/√

n

– If the null hypothesis is true, then the test statistic Z has approximately a stan- dard normal distribution by the central limit theorem.

– The test: If Z < −z_{α/2} or Z > z_{α/2}, we
reject the null hypothesis; otherwise, we
stick with the null hypothesis. (Recall
that

• T-test for normal population.

Suppose now that we don’t necessarily have a large sample but we do have a normal pop- ulation. Consider the same hypotheses as before.

– Now our test statistic becomes t ==

X − µ¯ _{0}
s/√

n

– Under H_{0}, the test statistic t has a t-
distribution with n − 1

– Consider the mean return on darts. Sup- pose we have n = 20 observations (the 1-

24

month contests) with a sample mean of

−1.0 and a sample standard deviation of 7.2.

Our test statistic is −1.55. The thresh-
hold for rejection is t_{19,0.025} = 2.093.

Example: Consider the effect of a packag-
ing change on sales of a product. Let µ be the
(unknown) mean increase in sales due to the
change. We have data available from a test-
marketing study. We will not undertake the
change unless there is strong evidence in favor
of increased sales. We should therefore set up
the test like this: H_{0} : µ ≤ 0 versus H_{1} : µ > 0.

• Note that this is a one-sided test.

• This formulation implies that a large X (i.e.,
large increases in sales in a test market) will
support H_{1} (i.e., cause us to switch to the
new package) but negative values of X (re-
jecting decreased sales) support H_{0}.

• The packaging example: Suppose that based on test-marketing in 36 stores we observe a sample mean increase in sales of 13.6 units per week with a sample standard deviation

of 42.

Is the observed increase significant at level α=0.05? To answer this, we compute the test statistic Z = 1.80.

Our cutoffffis z_{α} = 1.645. Since Z > z, the
increase is significant.

26

Observational Studies

• An observational study on sex bias in ad- missions to the Graduate Division at the University of California, Berkeley, was car- ried out in the fall quarter of 1973. Bickel, P., OConnell, J.W., and Hammel, E. (1975) Is there a sex bias in graduate admissions?

Science 187, 398-404.

– There were 8, 442 men who applied for admission to graduate school that quar- ter, and 4, 321 women.

– About 44% of the men and 35% of the women were admitted.

– Assuming that the men and women were on the whole equally well qualified (and there is no evidence to the contrary), the difference in admission rates looks like a very strong piece of evidence to show that men and women are treated differently in the admission procedure.

• Admissions to graduate work are made sep- arately for each major. By looking at each major separately, it should have been possi-

ble to identify the ones which discriminated against the women.

– In Berkeley, there are over a hundred ma- jors.

– Look at the six largest majors had over five hundred applicants each. (They to- gether accounted for over one third of the total number of applicants to the cam- pus.)

– In each major, the percentage of female applicants who were admitted is roughly equal to the percentage of male appli- cants.

– The only exception is major A, which ap- pears to discriminate against men: it ad- mitted 82% of the women, and only 62%

of the men.

– When a;; six majors are taken together, they admitted 44% of the male appli- cants, and only 30% of the females-the difference is 14%,

• Admissions data in the six largest majors

28

Men Women

Number of Percent Number of Percent Major applicants admitted applicants admitted

A 825 62 108 82

B 560 63 25 68

C 325 37 59 34

D 417 33 375 35

E 191 28 393 24

F 373 6 341 7

• What is going on? An explanation:

– The first two majors were easy to get into. Over 50% of the men applied to these two.

– The other four majors were much harder to get into. Over 90% of the women ap- plied to these four.

– There was an effect due to the choice of major, confounded with the effect due to sex. When the choice of major is con- trolled for, as in the above Table, there is little difference in the admissions rates for men or women.

• An experiment is controlled when the in-

vestigators determine which subjects will be the controls and which will get the treatment- for instance, by tossing a coin.

• Statisticians distinguish carefully between con- trolled experiments and observational stud- ies.

– Studies of the effects of smoking are nec- essarily observational-nobody is going to smoke for ten years just to please a statis- tician.

– Many problems can be studied only ob- servationally and all observational stud- ies have to deal with the problems of con- founding.

– For the admission example, it is wrong to campus-wide choice of major. We have to make comparisons for homogeneous subgroups.

– This was not a controlled, randomized experiment, however; sex was not ran- domly assigned to the applicants.

• An alternative analysis: Compare the weighted average admission rates for men and women.

30

Consider

933

4526 × 62% + _{4526}^{585} × 63% + _{4526}^{918} × 37%

+_{4526}^{792} × 33% + _{4526}^{584} × 28% + _{4526}^{714} × 6%

and etc which lead to 39% versus 43%.

Hypothesis Testing By Likelihood Methods
Example Let X_{1}, . . . , X_{n} be iid with X_{1} ∼
N (µ, 1).

• Test H_{0} : µ = 0 versus H_{1} : µ = µ_{0} > 0.

• Construct a test with α = 0.05 and β = 0.2005.

• Reject H_{0} if √

n ¯X_{n} > 1.645.

• Note that β = P (√

n ¯X_{n} ≤ 1.645|µ = µ_{0}) = Φ(1.645−√

nµ_{0}).

• If n → ∞ and µ_{0} is a fixed positive con-
stant, β → 0.

• To ensure β = 0.2005, it requires that 1.645 − √

nµ_{0} = −0.84
or µ_{0} = 2.485n^{−1/2}.

• Do you notice that µ_{0} will change with n
which is no longer a fixed alternative?

Test Statistics for A Simple Null Hypothesis
Consider testing H_{0} : θ = θ^{0} ∈ R^{s} versus
H_{1} : θ 6= θ^{0}.

32

Likelihood Ratio Test

• A likelihood ratio statistic,
Λ_{n} = L(θ^{0}; x)

supθ∈Θ L(θ; x)

was introduced by Neyman and Pearson (1928).

• Λ_{n} takes values in the interval [0, 1] and H_{0}
is to be rejected for sufficiently small values
of Λ_{n}.

• The rationale behind LR tests is that when
H_{0} is true, Λ_{n} tends to be close to 1, whereas
when H_{1} is true, Λ_{n} tends to be close to 0,

• The test may be carried out in terms of the statistic

λ_{n} = −2 log Λ_{n}.

• For finite n, the null distribution of λ_{n} will
generally depend on n and on the form of
pdf of X.

• LR tests are closely related to MLE’s.

• Denote MLE by ˆθ. For asymptotic analysis,
expanding λ_{n} at ˆθ in a Taylor series, we get
λ_{n} = −2

− ^{X}^{n}

i=1log f (X_{i}, ˆθ) + ^{X}^{n}

i=1log f (X_{i}, θ^{0})

= 2

1

2(θ^{0} − ˆθ)^{T}

− ^{X}^{n}

i=1

∂^{2}

∂θ_{j}∂θ_{k} log f (x; θ)

θ^{=}θ^{∗}

(θ^{0} − ˆθ)

,
where ˆθ lies between ˆθ and θ^{0}.

• Since θ^{∗} is consistent,
λ_{n} = n(ˆθ−θ^{0})^{T}

−1 n

n

X

i=1

∂^{2}

∂θ_{j}∂θ_{k}L(θ)

θ^{=}θ_{0}

(ˆθ−θ^{0})+o_{P}(1).

By the asymptotic normality of ˆθ and

−n^{−1 n}^{X}

i=1

∂^{2}

∂θ_{j}∂θ_{k} L(θ)|

θ^{=}θ^{0}

→ I(θP ^{0}),
λ_{n} has, under H_{0}, a limiting chi-squared dis-
tribution on s degrees of freedom.

Example Consider the testing problem H_{0} :
θ = θ_{0} versus H_{1} : θ 6= θ_{0} based on iid X_{1}, . . . , X_{n}
from the uniform distribution U (0, θ).

• L(θ_{0}; x) = θ_{0}^{−n}1_{{x}_{(n)}_{<θ}_{0}_{}}

• ˆθ = x_{(n)} (MLE) and sup_{θ∈Θ} L(θ; x) = x^{−n}_{(n)}1_{{x}_{(n)}_{<θ}}

• We have
Λ_{n} =

(X_{(n)}/θ^{0})^{n} X_{(n)} ≤ θ_{0}
0 X_{(n)} > θ_{0}

34

• Reject H_{0} if X_{(n)} > θ_{0} or X_{(n)}/θ_{0} < c^{1/n}.

• What is the asymptotic distribution of λ_{n}?

• What is P (n log(X_{(n)}/θ^{0}) ≤ c) where c <

0? It is not a χ^{2} distribution. (Why???)
Example Consider the testing problem H_{0} :
σ^{2} = σ_{0}^{2} versus H_{1} : σ^{2} 6= σ_{0}^{2} based on iid
X_{1}, . . . , X_{n} from the normal distribution N (µ_{0}, σ^{2}).

• L(θ^{0}; x) = (2πσ_{0}^{2})^{−n/2} exp ^{}−^{P}_{i}(x_{i} − µ_{0})^{2}/2σ_{0}^{2}^{}

• ˆσ^{2} = n^{−1} ^{P}_{i}(x_{i} − µ_{0})^{2} (MLE) and
sup

θ∈Θ

L(θ; x) = (2π ˆσ^{2})^{−n/2} exp(−n/2).

• We have
Λ_{n} =

ˆ
σ^{2}
σ_{0}^{2}

n/2

exp

n 2 −

Pi(x_{i} − µ_{0})^{2}
2σ_{0}^{2}

or under H_{0}
λ_{n} = −n

ln

1 n

n

X

i=1Z_{i}^{2}

−

1 −

1 n

n

X

i=1 Z_{i}^{2}

,
where Z_{1}, . . . , Z_{n} are iid N (0, 1).

• Fact: Using CLT, we have
n^{−1} ^{P}^{n}_{i=1} Z_{i}^{2} − 1

r2/n

→ N (0, 1)d

or n 2

1 n

n

X

i=1 Z_{i}^{2} − 1

2 d

→ χ^{2}_{1}.

• Note that ln u ≈ −(1 − u) − (1 − u)^{2}/2
when u is near 1 and n^{−1} ^{P}^{n}_{i=1} Z_{i}^{2} → 1 in
probability by LLN.

• A common question to be asked in Tay-
lor’s series approximation is that how many
terms we should consider. In this exam-
ple, it refers to the use of approximation
ln u ≈ −(1 − u) as a contrast to the second
order approximation we use. If we do use
the first order approximation, we will end
up the difficulty of finding lim_{n}a_{n}b_{n} when
lim_{n} a_{n} = ∞ and lim_{n}b_{n} = 0.

• We conclude that λ_{n} has a limiting chi-squared
distribution with 1 degree of freedom.

36

1. We begin withthe simplestcase of a test. Suppose we are inclinedto believe that some

(unknown) population mean has the value

0

, where

0

is some (known) number.

We have samples X

1

;:::;X

n

from the underlying population and we want to test our

hypothesis that=

0

. Thus, wehave

H

0

: =

0

H

1

: 6=

0

What sort of evidence would lead us to reject H

0

in favor of H

1

? Naturally, a sample

mean farfrom

0

wouldsupportH

1

whileone closeto

0

wouldnot. Hypothesis testing

makesthisintuition precise.

2. Thisisatwo-sidedortwo-tailedtestbecausesamplemeansthatareverylargeorvery

smallcount asevidenceagainstH

0

. Inaone-sidedtest,onlyvaluesinonedirectionare

evidence againstthe nullhypothesis. We treat thatcase later.

3. Example: Suppose we have reason to believe that the true average monthly return on

stocks selected by darts is 1.5%. (See Dart Investment Fund in the casebook for back-

groundand data.) Wewantto choosebetween

H

0

: =1:5

H

1

: 6=1:5;

where isthetrue meanmonthlyreturn.

4. We need to select a signicance level . Let'spick =:05. Thismeans that there isat

mosta 5% chance thatwe willmistakenlyreject H

0

ifin factH

0

is true (Type Ierror).

It says nothingaboutthe chances that we willmistakenlystick with H

0

if infact H

1 is

true(Type IIerror).

5. Large sample hypothesis test. Let's suppose we have samples X

1

;:::;X

n

with n>

30. The rst step in choosing between our hypotheses is computing the following test

statistic:

Z =

X

0

=

p

n :

Iam temporarily assumingthatweknow.

6. Remember that we know

0

(it's part of the null hypothesis we've formulated), even

thoughwedon't know.

7. If the null hypothesis is true, then the test statistic Z has approximately a standard

normaldistribution.

8. Nowwe carryoutthetest: IfZ < z

=2

orZ >z

=2

wereject thenullhypothesis;oth-

erwise, westickwiththenullhypothesis. (Recallthat z

=2

is denedbytherequirement

that thearea to theright of z

=2

under N(0;1) is =2. Thus, with =:05, thecuto is

=2

statisticZ landsinthesetofpointshavingabsolutevaluegreatherthanz

=2

. Thissetis

calledtherejection region forthetest.

10. Every hypothesis test hasthisgeneral form: we computea test statisticfrom data, then

checkiftheteststatisticlandsinsideoroutsidetherejection region. Therejectionregion

dependson butnoton thedata.

11. Noticethat saying

z

=2

<

X

0

=

p

n

<z

=2

isequivalent to saying

X z

=2

p

n

<

0

<X+z

=2

p

n :

So, here is anotherway to think of thetest we just did. We found a condenceinterval

forthemeanandcheckedto seeif

0

landsinthatinterval. If

0

landsinside,we don't

reject H

0

;if

0

landsoutside,we do reject H

0 .

12. Thissupportsourintuitionthat we shouldreject H

0

ifX is farfrom

0 .

13. Asusual, ifwe don'tknow wereplace itwith thesamplestandard deviations.

14. T-test for normal population. Suppose now that we don't necessarily have a large

sample but we do have a normal population. Consider the same hypotheses as before.

Nowourtest statistic becomes

t=

X

0

s=

p

n :

15. Under the nullhypothesis, the test statistic t has a t-distribution with n 1 degrees of

freedom.

16. Nowwe carry outthetest. Reject if

t< t

n 1;=2

or t>t

n 1;=2

;

otherwise,do notreject.

17. Asbefore,rejecting basedon thisruleisequivalentto rejectingwhenever

0

fallsoutside

thecondenceintervalfor.

18. Example: Let'scontinuewiththehypothesistestforthemeanreturnondarts. Asabove,

0

= 1:5 and = :05. Suppose we have n = 20 observations (the 1-month contests)

withasample meanof 1:0anda sample standarddeviationof7:2. Ourtest statisticis

therefore

t=

X

0

s=

p

n

=

1:0 1:5

7:2=

p

20

= 1:55

The threshholdfor rejection is t

19;:025

=2:093. Since our test statistic thas an absolute

valuesmallerthanthecuto,wecannotrejectthenullhypothesis. Inotherwords,based

on a signicance level of :05 the evidence does not signicantly support the view that

6=1:5.

0

we stickwith thenullhypothesis. The smallerwemake ,the harderitisto reject H

0 .

20. Ifa test leadsusto reject H

0

,we saythatthe results aresignicant at level .

P-Values

1. There is something rather arbitrary aboutthe choice of . Why shouldwe use =:05

ratherthan .01,.10 orsome other value? What if we would have rejected H

0

at =:10

butfailto reject itbecause we chose =:05? Should we changeourchoice of?

2. Changing after a hypothesis test is \cheating" in a precise sense. Recall that, bythe

denitionof , the probabilityof a Type Ierror is at most . Thus, xing givesus a

guarantee on theeectivenessof thetest. If we change, we lose thisguarantee.

3. Nevertheless, there is an acceptable way to report what would have happened had we

chosen a dierent signicance level. This isbased on somethingcalled the p-value of a

test.

4. Thep-value isthesmallestsignicance level(i.e.,thesmallest)at whichH

0

wouldbe

rejected,foragiventeststatistic. Itisthereforeameasureofhowsignicanttheevidence

infavorof H

1

is: the smallerthep-value,themore compellingtheevidence.

5. Example: Consider the test of mean returns on stocks picked by darts, as above. To

simplify the present discussion, let's suppose we have 30 data points, rather than 20.

(See Dart Investment Fund in the casebook for background and data.) As before the

hypothesesare

H

0

: =1:5

H

1

: 6=1:5

Let'ssupposethatoursamplemean X (basedon 30observations)is-0.8and thesample

standarddeviation is6.1. Since we areassuminga largesample,ourtest statisticis

Z=

X 1:5

s=

p

n

=

0:8 1:5

6:1=

p

30

= 2:06

With a signicance level of =:05, we get z

=2

= 1:96, and our rejection region would

be

Z < 1:96 orZ >1:96:

So, inthis case, Z = 2:06 would be signicant: it is suÆcientlyfar from zero to cause

usto reject H

0 .

We now ask, what is thesmallest at which we would reject H

0

,based on Z = 2:06.

We are askingfor thesmallest suchthat 2:06< z

=2

;i.e., the smallest such that

z

=2

< 2:06. To nd this value, we look up 2.06 in the normal table. This gives us