Methods for Statistical Prediction Financial Time Series I Topic 1: Review on Hypothesis Testing Hung Chen Department of Mathematics National Taiwan University 9/26/2002

47  Download (0)

Full text


Methods for Statistical Prediction

Financial Time Series I

Topic 1: Review on Hypothesis Testing

Hung Chen

Department of Mathematics National Taiwan University



OUTLINE 1. Fundamental Concepts

2. Neyman-Pearson Paradigm 3. Examples

4. Optimal Test

5. Observational Studies 6. Likelihood Ratio Test

7. One-sample and Two-sample Tests


Motivated Example on Hypothesis Testing ESP experiment: guess the color of 52 cards

with replacement.

• Experiment: Generate data to test the hy- potheses.

• T : number of correct guess in 10 trials

• H0 : T ∼ Bin(10, 0.5) versus H1 : T ∼ Bin(10, p) with p > 1/2

• Consider the test statistic T and the rejec- tion region R = {8, 9, 10}.

• Compute the probability of committing type 1 error:

α = P (R) = P (X > 7)

= 0.0439 + 0.0098 + 0.0010 = 0.0547.

• When rejection region= R = {7, 8, 9, 10},

α = P (X > 6) = 0.1172+P (X > 7) = 0.1719.

• Calculation of power when R = {8, 9, 10}.

We compute what the power will be under various values of p.

p = 0.6 P (X > 7|p = 0.6) = 0.1673 p = 0.7 P (X > 7|p = 0.7) = 0.3828.



• Idea: A statistical test of a hypothesis is a rule which assigns each possible observation to one of two exclusive categories: consis- tent with the hypothesis under considera- tion and not consistent with the hypothe- sis.

• Will we make mistake?

Two Types of Error


H0 true H0 false Test says reject H0 Type I Error Good

cannot reject Good Type II Error H0

• Usually, P (Type I Error) is denoted by α and P (Type II Error) is denoted by β.

• In ESP experiment, α increases when R moves from {8, 9, 10} to {7, 8, 9, 10} but β decreases.


• Statistical Hypotheses testing is a formal means of choosing between two distributions on the basis of a particular statistic or ran- dom variable generated from one of them.

– How do we accomodate the uncertainty on the observed data?

– How do we evaluate a method?

• Neyman-Pearson Paradigm – Null hypothesis H0

– Alternate hypothesis HA or H1

– The objective is to select one of the two based on the available data.

– A crucial feature of hypothesis testing is that the two competing hypotheses are not treated in the same way: one is given the benefit of the doubt, the other has the burden of proof.

The one that gets the benefit of the doubt is called the null hypothesis. The other is called the alternative hypothesis.

– By definition, the default is H0. When we carry out a test, we are asking whether



the available data is significant evidence in favor of H1. We are not testing whether H1 is true; rather, we are testing whether the evidence supporting H1 is statisti- cally significant.

– The conclusion of a hypothesis test is that we either reject the null hypothe- sis (and accept the alternative) or we fail to reject the null hypothesis.

Failing to reject H0 does not quite mean that the evidence supports H0; rather, it means that the evidence does not strongly favor H1.

Again, H0 gets the benefit of the doubt.

– Examples:

∗ Suppose we want to determine if stocks picked by experts generally perform better than stocks picked by darts. We might conduct a hypothesis test to de- termine if the available data should persuade us that the experts do bet- ter. In this case, we would have

H0: experts not better than darts H1: experts better than darts


∗ Suppose we are skeptical about the ef- fectiveness of a new product in pro- moting dense hair growth. We might conduct a test to determine if the data shows that the new product stimulates hair growth. This suggests

H0: New product does not promote hair growth

H1: New product does promote hair growth

Choosing the hypotheses this way puts the onus on the new product; unless there is strong evidence in favor of H1, we stick with H0.

∗ Suppose we are considering changing the packaging of a product in the hope of boosting sales. Switching to a new package is costly, so we will only un- dertake the switch if there is signifi- cant evidence that sales will increase.

We might test-market the change in one or two cities and then evaluate the results using a hypothesis test. Since the burden of proof is on the new pack-



age, we should set the hypotheses as follows:

H0: New package does not increase sales

H1: New package does increase sales – There are two types of hypotheses, sim-

ple ones where the hypothesis completely specifies the distribution.

– Simple hypotheses test one value of the parameter against another, the form of the distribution remaining fixed.

– Here is an example when they are both composite:

Xi: Poisson with unknown parameter Xi is not Poisson


Steps for setting up test:

1. Define the null hypothesis H0 (devil’s ad- vocate).

Put the hypothesis that you don’t believe as H0

2. Define the alternative HA (one sided /two sided).

3. Find the test statistic.

Use heuristic or systematic methods.

4. Decide on the type I error: α that you are willing to take.

5. Compute the probability of observing the data given the null hypothesis: p-value.

6. Compare the p-value to α, if its smaller, reject H0.



Example 1: Sex bias in graduate admission

• The graduate division of the University of California at Berkeley attempted to study the possibility that sex bias operated in grad- uate admissions in 1973 by examining ad- mission data.

• In this case, what does the hypothesis of no sex bias corresponds to? It is natural to translate this into

P [Admit|M ale] = P [Admit|F emale].

• Data

– There were 8, 442 men who applied for admission to graduate school that quar- ter, and 4, 321 women.

– About 44% of the men and 35% of the women were admitted.

– How do we perform this two-sample test?

• What is the conclusion?

– two-sample test 0.44 − 0.35


8442 + 0.35×0.654321 = 9.948715.


– p-value is 1.283 exp(−22) when H1 is a

one-sided test P [Admit|M ale] > P [Admit|F emale].

– p-value is 2.566 exp(−22) when H1 is a

two-sided test P [Admit|M ale] 6= P [Admit|F emale].



Example 2: Effectiveness of Therapy

• Suppose that a new drug is being considered with a view to curing a certain disease.

• How do we evaluate its effectiveness?

• The drug is given to n patients suffering from the disease and the number x of cures is noted.

• We wish to test the hypothesis that there is at least a 50 − 50 chance of a cure by this drug based on the following data:

x cures among n patients.

• Put the problem in the following framework of statistical test:

– The sample space X is simple-it is the set {0, 1, 2, . . . , n}. (i.e., X can take on 0, 1, 2, . . . , n.)

– The family {Pθ} of possible distributions on X is (assuming independent patients)

the family of binomial distributions, parametrized by the real parameter θ taking values in

[0, 1].


– θ is being interpreted as the probability of cure.

– X ∼ Bin(n, θ)

– The stated hypothesis defines the subset Θ0 = [1/2, 1] of the parameter space.

H0 : θ ≥ 1/2

– In this situation, only a small class of tests which seem worth considering on a purely intuitive basis.

We will only consider those for which the set of x taken to be consistent with Θ0 have the form {x : x ≥ k}

– Question: Does it make sense to con- sider that x cures out of n patients were consistent with Θ0, while x + 1 were not?

– What is a reasonable test?



A recipe: Optimal tests for simple hypotheses

• Null hypothesis H0 : f = f0

• Alternate hypothesis HA : f = f1

• Want to find a rejection region R such that the error of both types are as small as pos- sible.


R f0(x)dx = α and 1 − β = ZR f1(x)dx.

• Neyman-Pearson Lemma:

For testing f0(x) against f1(x) a critical re- gion of the form

Λ(x) = f1(x)

f0(x) ≥ K

where K is a constant has the greatest power (smallest β) in the class of tests with the same α.

– Let R denote the rejection region deter- mined by Λ(x) and S denote the rejec- tion of other testing procedure.

– αR = RR f0(x)dx, αS = RS f0(x)dx, – αR, αS ≤ α


– βR − βS = (RRRS) f1dx = RR∩Sc f1dx −

RS∩Rc f1dx.

Since in R, f1 ≥ f0/K and in Rc, −f1

−f0/K we have:

βR − βS ≥ 1 K


R∩Sc f0dx − ZS∩Rc f0dx

= 1 K


R f0dx − ZS f1dx = 1

K(αR − αS) – When αR = αS = α, βR − βS ≥ 0.



Why Neyman-Pearson framework is being accepted?

• A test whose error probabilities are as small as possible is clearly desirable.

However, we cannot choose the critical re- gion in such a way that α(θ) and β(θ) are simultaneously uniformly minimized.

By taking the critical region as the empty set, we can make α(θ) = 0 and by taking the critical region as the sample space, we can make β(θ) = 0. Hence a test which uniformly minimized both error-probability functions would require to have zero error probabilities, and usually no such test ex- ists.

• The modification suggested by Neyman and Pearson is based on the fact that in most circumstances our attitudes to the hypothe- ses Θ0 and Θ−Θ0 are different- we are often asking if there is sufficient evidence to reject the hypothesis Θ0.

In terms of the two possible errors this may be translated into the statement that often the Type I error is more serious than


the Type II error.

• We should control the probability of the Type I error at some pre-assigned small value α, and then, subject to this control, look for a test which uniformly minimizes the function describing the proba- bilities of Type II error.

• Is this asymmetry on (H0, H1) reasonable?

Can you come up an example with business application?

– Suppose we use this testing technique in searching for regions of the genome that resemble other regions that are known to have significant biological activity.

– One way of doing this is to align the known and unknown regions and com- pute statistics based on the number of matches.

– To determine significant values of these statistics a (more complicated) version of the following is done.

Thresholds (critical values) are set so that if the matches occur at random and the



probability of a match is 1/2, then the probability of exceeding the threshold (type I) error is smaller than α.

– No one really believes that H0 is true and possible types of alternatives are vaguely known at best, but computation under H0 is easy.

Now we use the following example to moti- vate Neyman-Pearson lemma. We start from the simplest possible situation, that where Θ has only two elements θ0 and θ1, say, and where Θ0 = {θ0}, Θ − Θ0 = {θ1}. Note that a hy- pothesis which specifies a set in the parameter space containing only one element is called a simple hypothesis. Thus we are now consider- ing testing a simple null-hypothesis against a simple alternative. In this case, the power func- tion of any test reduces to a single number, and we examine the question of the existence of a most-powerful test of given significance level α.

Revisit the example that x cures out of n pa- tients when n = 5. We wish to test

H0 : p = 0.5 versus H1 : p = 0.3.


• The probability distribution of X is

X = x 0 1 2 3 4 5

p = 0.5 0.031 0.156 0.313 0.313 0.156 0.031 p = 0.3 0.168 0.360 0.309 0.132 0.028 0.003 f1(x)/f0(x) 5.419 2.308 0.987 0.422 0.179 0.097

• Think of the meaning of likelihood ratio f1(x)/f0(x).

• We consider all possible nonrandomized tests of significance level 0.2.

critical region α 1 − β critical region α 1 − β {0} 0.031 0.168 {0, 1} 0.187 0.528 {1} 0.156 0.360 {0, 4} 0.187 0.196 {4} 0.156 0.028 {1, 5} 0.187 0.363 {5} 0.031 0.003 {4, 5} 0.187 0.031 {0, 5} 0.062 0.171

• The best test is the one with critical region {0, 1}. Can you give a reason for that? Or, can you find a rule?

Try to think in terms of likelihood ratio by noting

f1(x) = f1(x)

f0(x) · f0(x).

As a hint, compare the two tests {0, 1} and {0, 4} with the same α. Observe that their



power are

β{0,1} = [P{p=0.3}(r = 0)] + P{p=0.3}(r = 1) β{0,4} = [P{p=0.3}(r = 0)] + P{p=0.3}(r = 4).

Compare P{p=0.3}(r = 4) to P{p=0.3}(r = 1).

• Conclusion: The critical region determined by {x : f1(x)/f0(x) ≥ c} is quite intuitive.

Suppose that we set out to order points in the sample space according to the amount of evidence they provide for P1 rather than P0. We should naturally order them according to the value of the ratio f1(x)/f0(x); any x for which this ratio is large provides evidence than P1 rather than P0 is the true underly- ing probability distribution. The Neyman- Pearson analysis gives us a basis for choosing c so that


x : f1(x)

f0(x) ≥ c

= α.

Now we use the Neyman-Pearson lemma to derive UMP test in the following two examples.

Example 3. Suppose that X is a sample of size 1. We wish to test whether it comes from


N (0, 1) or the double exponential distribution DE(0, 2) with the pdf 4−1 exp(−|x|/2).

• Make a guess on the testing procedure?

• Since P (f1(x) = cf0(x)) = 0, there is a unique nonrandomized UMP test.

• The UMP test T(x) = 1 if and only if π

8 exp(x2 − |x|) > c2

for some c > 0, which is equivalent to |x| >

t or |x| < 1 − t for some t > 1/2.

• Suppose that α < 1/4. We use

α = E0[T(X)] = P0(|X| > t) = 0.3374 > α.

Hence t should be greater than 1 and α = Φ(−t) + 1 − Φ(t).

Thus, t = Φ−1(1−α/2) and T(X) = I(t,∞)(|X|).

• Why the UMP test rejects H0 when |X| is large?

• The power of T under H1 is

E1[T(X)] = P1(|X| > t) = 1−1 4

Z t

−t e−|x|/2dx = e−t/2.



Example 4. Let X1, . . . , Xn be iid binary random variables with p = P (X1 = 1). Sup- pose that we wish to test H0 : p = p0 versus H1 : p = p1, where 0 < p0 < p1 < 1.

• Since P (f1(x) = cf0(x)) 6= 0, we may need to consider randomized UMP test.

• A UMP test of size α is T(Y ) =

1 λ(Y ) > c γ λ(Y ) = c 0 λ(Y ) < c, where Y = Pni=1 Xi and

λ(Y ) =

p1 p0


1 − p1 1 − p0



• Since λ(Y ) is increasing in Y , there is an integer m > 0 such that

T(Y ) =

1 Y > m γ Y = m 0 Y < m, where m and γ satisfy

α = E0[T(Y )] = P0(Y > m)+γP0(Y = m).


• Since Y has the binomial distribution Bin(n, p), we can determine m and γ from

α = Xn


n j


n m

pm0 (1−p0)n−m.

• Unless

α = Xn


n j

pj0(1 − p0)n−j

for some integer m, the UMP test is a ran- domized test.

• Do you notice that the UMP test T does not depend on p1?

– Neyman-Pearson lemma tells us that we should put those x into rejection region according to its likelihood ratio until the level of test achieves α.

– Think of two hypothesis testing problems:

The first one is H0 : p = p0 versus H1 : p = p1 and the second one is H0 : p = p0 versus H1 : p = p2 where p1 > p0 and p2 > p0.

– For the above two testing problems, both their likelihood ratios increase as y in- creases.



– T is in fact a UMP test for testing H0 : p = p0 versus H1 : p > p0.

• Suppose that there is a test T of size α such that for every P1 ∈ P, T is UMP for testing H0 versus the hypothesis P = P1.

Then T is UMP for testing H0 versus H1. Example: Suppose we have reason to be- lieve that the true average monthly return on stocks selected by darts is 1.5%. We want to choose between H0 : µ = 1.5 versus H1 : µ 6=

1.5, where µ¯ is the true mean monthly return.

• We need to select a significance level α. Let’s pick α = 0.05. This means that there is at most a 5% chance that we will mistakenly reject H0 if in fact H0 is true (Type I error).

It says nothing about the chances that we will mistakenly stick with H0 if in fact H1 is true (Type II error).

• Large sample hypothesis test. Let’s suppose we have samples X1, · · · , Xn with n > 30.

– The first step in choosing between our hypotheses is computing the following test



Z =

X − µ¯ 0 σ/√


– If the null hypothesis is true, then the test statistic Z has approximately a stan- dard normal distribution by the central limit theorem.

– The test: If Z < −zα/2 or Z > zα/2, we reject the null hypothesis; otherwise, we stick with the null hypothesis. (Recall that

• T-test for normal population.

Suppose now that we don’t necessarily have a large sample but we do have a normal pop- ulation. Consider the same hypotheses as before.

– Now our test statistic becomes t ==

X − µ¯ 0 s/√


– Under H0, the test statistic t has a t- distribution with n − 1

– Consider the mean return on darts. Sup- pose we have n = 20 observations (the 1-



month contests) with a sample mean of

−1.0 and a sample standard deviation of 7.2.

Our test statistic is −1.55. The thresh- hold for rejection is t19,0.025 = 2.093.

Example: Consider the effect of a packag- ing change on sales of a product. Let µ be the (unknown) mean increase in sales due to the change. We have data available from a test- marketing study. We will not undertake the change unless there is strong evidence in favor of increased sales. We should therefore set up the test like this: H0 : µ ≤ 0 versus H1 : µ > 0.

• Note that this is a one-sided test.

• This formulation implies that a large X (i.e., large increases in sales in a test market) will support H1 (i.e., cause us to switch to the new package) but negative values of X (re- jecting decreased sales) support H0.

• The packaging example: Suppose that based on test-marketing in 36 stores we observe a sample mean increase in sales of 13.6 units per week with a sample standard deviation


of 42.

Is the observed increase significant at level α=0.05? To answer this, we compute the test statistic Z = 1.80.

Our cutoffffis zα = 1.645. Since Z > z, the increase is significant.



Observational Studies

• An observational study on sex bias in ad- missions to the Graduate Division at the University of California, Berkeley, was car- ried out in the fall quarter of 1973. Bickel, P., OConnell, J.W., and Hammel, E. (1975) Is there a sex bias in graduate admissions?

Science 187, 398-404.

– There were 8, 442 men who applied for admission to graduate school that quar- ter, and 4, 321 women.

– About 44% of the men and 35% of the women were admitted.

– Assuming that the men and women were on the whole equally well qualified (and there is no evidence to the contrary), the difference in admission rates looks like a very strong piece of evidence to show that men and women are treated differently in the admission procedure.

• Admissions to graduate work are made sep- arately for each major. By looking at each major separately, it should have been possi-


ble to identify the ones which discriminated against the women.

– In Berkeley, there are over a hundred ma- jors.

– Look at the six largest majors had over five hundred applicants each. (They to- gether accounted for over one third of the total number of applicants to the cam- pus.)

– In each major, the percentage of female applicants who were admitted is roughly equal to the percentage of male appli- cants.

– The only exception is major A, which ap- pears to discriminate against men: it ad- mitted 82% of the women, and only 62%

of the men.

– When a;; six majors are taken together, they admitted 44% of the male appli- cants, and only 30% of the females-the difference is 14%,

• Admissions data in the six largest majors



Men Women

Number of Percent Number of Percent Major applicants admitted applicants admitted

A 825 62 108 82

B 560 63 25 68

C 325 37 59 34

D 417 33 375 35

E 191 28 393 24

F 373 6 341 7

• What is going on? An explanation:

– The first two majors were easy to get into. Over 50% of the men applied to these two.

– The other four majors were much harder to get into. Over 90% of the women ap- plied to these four.

– There was an effect due to the choice of major, confounded with the effect due to sex. When the choice of major is con- trolled for, as in the above Table, there is little difference in the admissions rates for men or women.

• An experiment is controlled when the in-


vestigators determine which subjects will be the controls and which will get the treatment- for instance, by tossing a coin.

• Statisticians distinguish carefully between con- trolled experiments and observational stud- ies.

– Studies of the effects of smoking are nec- essarily observational-nobody is going to smoke for ten years just to please a statis- tician.

– Many problems can be studied only ob- servationally and all observational stud- ies have to deal with the problems of con- founding.

– For the admission example, it is wrong to campus-wide choice of major. We have to make comparisons for homogeneous subgroups.

– This was not a controlled, randomized experiment, however; sex was not ran- domly assigned to the applicants.

• An alternative analysis: Compare the weighted average admission rates for men and women.





4526 × 62% + 4526585 × 63% + 4526918 × 37%

+4526792 × 33% + 4526584 × 28% + 4526714 × 6%

and etc which lead to 39% versus 43%.


Hypothesis Testing By Likelihood Methods Example Let X1, . . . , Xn be iid with X1 ∼ N (µ, 1).

• Test H0 : µ = 0 versus H1 : µ = µ0 > 0.

• Construct a test with α = 0.05 and β = 0.2005.

• Reject H0 if √

n ¯Xn > 1.645.

• Note that β = P (√

n ¯Xn ≤ 1.645|µ = µ0) = Φ(1.645−√


• If n → ∞ and µ0 is a fixed positive con- stant, β → 0.

• To ensure β = 0.2005, it requires that 1.645 − √

0 = −0.84 or µ0 = 2.485n−1/2.

• Do you notice that µ0 will change with n which is no longer a fixed alternative?

Test Statistics for A Simple Null Hypothesis Consider testing H0 : θ = θ0 ∈ Rs versus H1 : θ 6= θ0.



Likelihood Ratio Test

• A likelihood ratio statistic, Λn = L(θ0; x)

supθ∈Θ L(θ; x)

was introduced by Neyman and Pearson (1928).

• Λn takes values in the interval [0, 1] and H0 is to be rejected for sufficiently small values of Λn.

• The rationale behind LR tests is that when H0 is true, Λn tends to be close to 1, whereas when H1 is true, Λn tends to be close to 0,

• The test may be carried out in terms of the statistic

λn = −2 log Λn.

• For finite n, the null distribution of λn will generally depend on n and on the form of pdf of X.

• LR tests are closely related to MLE’s.

• Denote MLE by ˆθ. For asymptotic analysis, expanding λn at ˆθ in a Taylor series, we get λn = −2


i=1log f (Xi, ˆθ) + Xn

i=1log f (Xi, θ0)


= 2


2(θ0 − ˆθ)T




∂θj∂θk log f (x; θ)


0 − ˆθ)

, where ˆθ lies between ˆθ and θ0.

• Since θ is consistent, λn = n(ˆθ−θ0)T

−1 n








By the asymptotic normality of ˆθ and

−n−1 nX



∂θj∂θk L(θ)|


→ I(θP 0), λn has, under H0, a limiting chi-squared dis- tribution on s degrees of freedom.

Example Consider the testing problem H0 : θ = θ0 versus H1 : θ 6= θ0 based on iid X1, . . . , Xn from the uniform distribution U (0, θ).

• L(θ0; x) = θ0−n1{x(n)0}

• ˆθ = x(n) (MLE) and supθ∈Θ L(θ; x) = x−n(n)1{x(n)<θ}

• We have Λn =

(X(n)0)n X(n) ≤ θ0 0 X(n) > θ0



• Reject H0 if X(n) > θ0 or X(n)0 < c1/n.

• What is the asymptotic distribution of λn?

• What is P (n log(X(n)0) ≤ c) where c <

0? It is not a χ2 distribution. (Why???) Example Consider the testing problem H0 : σ2 = σ02 versus H1 : σ2 6= σ02 based on iid X1, . . . , Xn from the normal distribution N (µ0, σ2).

• L(θ0; x) = (2πσ02)−n/2 exp Pi(xi − µ0)2/2σ02

• ˆσ2 = n−1 Pi(xi − µ0)2 (MLE) and sup


L(θ; x) = (2π ˆσ2)−n/2 exp(−n/2).

• We have Λn =

ˆ σ2 σ02



n 2 −

Pi(xi − µ0)202

or under H0 λn = −n


1 n




1 −

1 n



i=1 Zi2

, where Z1, . . . , Zn are iid N (0, 1).

• Fact: Using CLT, we have n−1 Pni=1 Zi2 − 1


→ N (0, 1)d


or n 2

1 n



i=1 Zi2 − 1

2 d

→ χ21.

• Note that ln u ≈ −(1 − u) − (1 − u)2/2 when u is near 1 and n−1 Pni=1 Zi2 → 1 in probability by LLN.

• A common question to be asked in Tay- lor’s series approximation is that how many terms we should consider. In this exam- ple, it refers to the use of approximation ln u ≈ −(1 − u) as a contrast to the second order approximation we use. If we do use the first order approximation, we will end up the difficulty of finding limnanbn when limn an = ∞ and limnbn = 0.

• We conclude that λn has a limiting chi-squared distribution with 1 degree of freedom.



1. We begin withthe simplestcase of a test. Suppose we are inclinedto believe that some

(unknown) population mean  has the value 


, where 


is some (known) number.

We have samples X




from the underlying population and we want to test our

hypothesis that=


. Thus, wehave



: =




: 6=


What sort of evidence would lead us to reject H


in favor of H


? Naturally, a sample

mean farfrom




whileone closeto 


wouldnot. Hypothesis testing

makesthisintuition precise.

2. Thisisatwo-sidedortwo-tailedtestbecausesamplemeansthatareverylargeorvery

smallcount asevidenceagainstH


. Inaone-sidedtest,onlyvaluesinonedirectionare

evidence againstthe nullhypothesis. We treat thatcase later.

3. Example: Suppose we have reason to believe that the true average monthly return on

stocks selected by darts is 1.5%. (See Dart Investment Fund in the casebook for back-

groundand data.) Wewantto choosebetween



: =1:5



: 6=1:5;

where isthetrue meanmonthlyreturn.

4. We need to select a signi cance level . Let'spick =:05. Thismeans that there isat

mosta 5% chance thatwe willmistakenlyreject H


ifin factH


is true (Type Ierror).

It says nothingaboutthe chances that we willmistakenlystick with H


if infact H

1 is

true(Type IIerror).

5. Large sample hypothesis test. Let's suppose we have samples X




with n>

30. The rst step in choosing between our hypotheses is computing the following test


Z =





n :

Iam temporarily assumingthatweknow.

6. Remember that we know 


(it's part of the null hypothesis we've formulated), even

thoughwedon't know.

7. If the null hypothesis is true, then the test statistic Z has approximately a standard


8. Nowwe carryoutthetest: IfZ < z


orZ >z


wereject thenullhypothesis;oth-

erwise, westickwiththenullhypothesis. (Recallthat z


is de nedbytherequirement

that thearea to theright of z


under N(0;1) is =2. Thus, with =:05, thecuto is



statisticZ landsinthesetofpointshavingabsolutevaluegreatherthanz


. Thissetis

calledtherejection region forthetest.

10. Every hypothesis test hasthisgeneral form: we computea test statisticfrom data, then

checkiftheteststatisticlandsinsideoroutsidetherejection region. Therejectionregion

dependson butnoton thedata.

11. Noticethat saying











isequivalent to saying

X z









n :

So, here is anotherway to think of thetest we just did. We found a con denceinterval

forthemeanandcheckedto seeif


landsinthatinterval. If


landsinside,we don't

reject H




landsoutside,we do reject H

0 .

12. Thissupportsourintuitionthat we shouldreject H


ifX is farfrom

0 .

13. Asusual, ifwe don'tknow wereplace itwith thesamplestandard deviations.

14. T-test for normal population. Suppose now that we don't necessarily have a large

sample but we do have a normal population. Consider the same hypotheses as before.

Nowourtest statistic becomes






n :

15. Under the nullhypothesis, the test statistic t has a t-distribution with n 1 degrees of


16. Nowwe carry outthetest. Reject if

t< t

n 1; =2

or t>t

n 1; =2


otherwise,do notreject.

17. Asbefore,rejecting basedon thisruleisequivalentto rejectingwhenever



thecon denceintervalfor.

18. Example: Let'scontinuewiththehypothesistestforthemeanreturnondarts. Asabove,


= 1:5 and = :05. Suppose we have n = 20 observations (the 1-month contests)

withasample meanof 1:0anda sample standarddeviationof7:2. Ourtest statisticis









1:0 1:5




= 1:55

The threshholdfor rejection is t


=2:093. Since our test statistic thas an absolute

valuesmallerthanthecuto ,wecannotrejectthenullhypothesis. Inotherwords,based

on a signi cance level of :05 the evidence does not signi cantly support the view that




we stickwith thenullhypothesis. The smallerwemake ,the harderitisto reject H

0 .

20. Ifa test leadsusto reject H


,we saythatthe results aresigni cant at level .


1. There is something rather arbitrary aboutthe choice of . Why shouldwe use =:05

ratherthan .01,.10 orsome other value? What if we would have rejected H


at =:10

butfailto reject itbecause we chose =:05? Should we changeourchoice of ?

2. Changing after a hypothesis test is \cheating" in a precise sense. Recall that, bythe

de nitionof , the probabilityof a Type Ierror is at most . Thus, xing givesus a

guarantee on thee ectivenessof thetest. If we change , we lose thisguarantee.

3. Nevertheless, there is an acceptable way to report what would have happened had we

chosen a di erent signi cance level. This isbased on somethingcalled the p-value of a


4. Thep-value isthesmallestsigni cance level(i.e.,thesmallest )at whichH



rejected,foragiventeststatistic. Itisthereforeameasureofhowsigni canttheevidence

infavorof H


is: the smallerthep-value,themore compellingtheevidence.

5. Example: Consider the test of mean returns on stocks picked by darts, as above. To

simplify the present discussion, let's suppose we have 30 data points, rather than 20.

(See Dart Investment Fund in the casebook for background and data.) As before the




: =1:5



: 6=1:5

Let'ssupposethatoursamplemean X (basedon 30observations)is-0.8and thesample

standarddeviation is6.1. Since we areassuminga largesample,ourtest statisticis


X 1:5





0:8 1:5




= 2:06

With a signi cance level of =:05, we get z


= 1:96, and our rejection region would


Z < 1:96 orZ >1:96:

So, inthis case, Z = 2:06 would be signi cant: it is suÆcientlyfar from zero to cause

usto reject H

0 .

We now ask, what is thesmallest at which we would reject H


,based on Z = 2:06.

We are askingfor thesmallest suchthat 2:06< z


;i.e., the smallest such that



< 2:06. To nd this value, we look up 2.06 in the normal table. This gives us




Related subjects :