Financial Time Series I Topic 3: Analysis of Variance Hung Chen Department of Mathematics National Taiwan University 11/25/2002

(1)

Financial Time Series I

Topic 3: Analysis of Variance

Hung Chen

Department of Mathematics National Taiwan University

11/25/2002

(2)

OUTLINE

1. One-Way Analysis of Variance – Introduction

– Examples – Basic Set-up – Data Example – Algebra

– Validality Conditions on F -test 2. Analysis of Variance of Ranks

3. Logistic Regression

4. Two-Way Analysis of Variance – Interaction

– Data Example

– Related R-command – Design

(3)

Comparison of Several Populations

• When some measurement, such as height or aptitude for a particular job, is made on several individuals, the values vary from person to person.

– The variability of a quantitative scale is measured by its variance.

– If the set of individuals is stratified into more homogeneous groups, the variance of the measurements within the more homogeneous group will be less than that of the measurements in the entire group;

that is what “more homogeneous” means.

• As an example, consider the heights of pupils on an elementary school.

– Fact 1: The variance of the heights of pupils on an elementary school is usually greater than the variance of heights of pupils in just the first grade, the variance in the second grade, and the variance in each of the other grades.

The within-grade variances are less than the overall variance

(4)

– Fact 2: The average height of pupils also varies from grade to grade.

The averages vary between grades.

– The total variability (of heights) is made up of two components: the variability of individuals within groups (grades) and the variability of means between groups (grades).

Recall

V ar(Y ) = EV ar(Y |X)+V ar(E(Y |X)).

Y refers to the height of students and X refers to the grades.

– At the extreme, all of the variability of a measured variable may be within the groups and none of it between groups, that is, the means of the subgroups are equal. Mathematically, it refers to the case that E(Y |X) = E(Y ).

• The analysis of variance is a set of statisti- cal techniques for studying variability from different sources and comparing them to un- derstand the relative importance of each of the sources.

(5)

– It is also used to make inferences about the population through tests of significance, including the very important comparison of the means of two or more separate populations.

– The analysis of variance is the most straight- forward way to examine the association between a categorical variable (“groups”) and a numerical variable (the measure on which the means are based)?

– In the above example, we try to find the association between height of student and grade of student.

(6)

One-Way Analysis of Variance

Suppose that we wish to compare the means of three populations on some measured dependent variable.

• The population means are represented as µ₁, µ₂, and µ₃.

• A hypothesis that may be tested is H₀ : µ₁ = µ₂ = µ₃

that is, that the means of the three populations are equal.

• The hypothesis is rejected if the population means are different in any way, for example, if one of the means differs from the other two or if all three means are different.

• The procedure for testing H₀ through the analysis of variance parallels that for other tests of significance.

• Samples of data are obtained and a test statis- tics is computed. The test statistic is pro- posed based on the following idea.

(7)

– The total variability of the data is allocated to two sources: variability among the group means and variability of the observations within the groups.

– Two measures of variability are computed, Mean Square between Groups and Mean Square within Groups, respectively.

– The hypothesis that the 3 population means are equal is tested by comparing the mag- nitude of these two mean squares.

It is the ratio of Mean Square Between Groups to Mean Square Within Groups.

– We need a reference, the F -distribution, to provide significance points for deciding if the ratio is large enough to reject the hypothesis of equal means.

• In this application the test statistic reflects the extent to which variation among the sample means is greater than variation among observations within the groups.

• If the test statistic is large enough (that is, if it exceeds the corresponding significance point) then H₀ is rejected.

(8)

• If the sample means are close together and hence the test statistic is small, then H₀ is accepted.

• Can we apply this procedure for comparing the means of two populations?

• When only two groups are being compared, it is obvious from knowledge of the sample means which group has a significantly higher mean than the other.

– When there are three or more groups, however, rejection of H₀ only means that at least one population mean is different from the others.

– It is necessary to follow the test of significance by additional analysis-of-variance procedures to determine which if the population means are different from which other means.

– What are these follow-up procedures?

(9)

Set-Up

• The “groups” compared through analysis of variance may be created in a number of different ways.

– In many studies the data are already classified into groups, such as states, coun- tries, gender groups, medical diagnostic categories, or religious affiliations.

– In other instances, the statistician may define the groups based on one or more measured characteristics.

For example, socioeconomic classifications may be created by combining individuals’ educational levels, incomes, and em- ployment statuses according to some set of rules. The result might be, say, three or four or five separate socioeconomic categories.

– A medical research might classify patients as “high risk” or “low risk” for some dis- ease based on a number of indicators, for example, age, family history, and health behaviors.

(10)

– In conducting surveys, groups are fre- quently formed from the respondents’ answers, for example, all people who sup- port a particular political issue, those who are opposed, and those who have no opin- ion.

– In an experiment, the groups are defined by the various experimental conditions or

“treatments,” such as instructional methods, psychological or social interventions, or dosages or different forms of a drug.

In these situations, individuals are assigned to the conditions by the researcher.

When possible, the assigned should be done randomly so that the only difference between the groups is the explained by differences in the treatments received and not by other factors that are irrele- vant to the study.

(11)

EXAMPLE 1

A psychologist is interested in comparing the effects of three different informational “sets” in children’s ability to memorize words.

• Eighteen 7-year-old children were randomly assigned to one of three groups.

– In condition 1, the children were shown a list of 12 words and were asked to study them in preparation for a recall test.

– In condition 2, the children were told that the words comprised three global categories, flowers, animals, and foods, and were asked to study the list.

– In condition 3, the children were told that the words comprised 6 more detailed categories.

• After studying the word list for 10 minutes, each youngster was asked to list all of the words that he/she could remember.

• The number of correctly recalled words was the measured outcome variable of the study.

EXAMPLE 2

(12)

A study was performed to see if school grades are related to television viewing habits among high-school juniors.

• Respondents were monitored for 15 weekdays during the year and classified into 4 groups according to the average amount of television they viewed on those days (0−0.5 hour; 0.5 − 1.5 hours; 1.5 − 3.0 hours; more than three hours).

• Each student’s grade-average (GPA) was recorded for all courses taken during the year.

• Mean GPAs were compared among the four television viewing groups.

EXAMPLE 3

In the study The Academic Mind by Lazarsfeld and Thielens, a total of 2451 social science faculty members from 165 of the larger American colleges and universities were interviewed in or- der to assess the impact of the McCarthy era on social science faculties.

• At each college, the number of “academic freedom incidents” was counted.

(13)

• These were incidents mentioned by more than one respondent as an attack in the academic freedom of the faculty.

• They ranged from small-scale matters, such as a verbal denunciation of a professor by a student group, to large-scale matters, including a Congressional investigation.

• It was of interest to examine whether and how the institutional basis of a school’s sup- port and control affected the number of “incidents” occurring there.

• Hence, each college was classified as pub- licly controlled, privately controlled, or controlled by some other institution. (Teachers’

colleges and schools controlled by a religious institution were included in the “other” cat- egory.)

• The distributions of numbers of “incidents”

in the different types of institutions were studied.

EXAMPLE 4

A large manufacturing firm employs high-school

(14)

dropouts, high-school graduates, and individuals who attended college as production-line work- ers.

• The company management speculated that job proficiency was related to educational attainment.

• If so, only high-school graduates, or perhaps only individuals who had attended college, would be hired in the future.

• To test their idea, the job performance of a sample of employees was rated by their su- pervisors on an extensive rating scale that yield possible proficiency scores ranging from 0 to 200.

• The mean ratings of the three education groups (high-school dropouts, high-school graduates, college attendees) were compared by the analysis of variance.

(15)

A Complete Example with equal sample size The analysis of variance indicates whether population means are different by comparing the variability among sample means with variability among individual observations within groups.

The following data gives the numbers of words memorized by 18 children given three different information sets as described in Example 1.

No information 3 categories 6 categories

2 9 4

4 10 5

3 10 6

4 7 3

5 8 7

6 10 5

Sums 24 54 30

Sample means 4 9 5

• The scores for three samples and pooled sample can be graphed by box plot.

• The means of the three samples are 4, 9, and 5, respectively.

• There is clear variability from group to group.

(16)

• The variability in the entire pooled sample of 18 is much larger than the variability of these three groups.

• Variability within Groups: The measure of variability among observations within the groups is called the Mean Square Groups and is denoted M S_w.

– M S_w is the variance of all the scores, but computed separately for each of the three samples and then combined.

– Like the sample variance, M S_w has a sum of squared deviations in the numerator and the corresponding number of degrees of freedom in the denominator.

– The numerator is called the Sum of Squares within Groups and the denominator is called the degrees of freedom within groups.

– The Sum of Squares within Groups (SS_w) is the sum of squared deviations of individual scores from their subgroup means.

In this example, SS_w = 10+8+10 = 28.

– The degrees of freedom within groups (df_w) is the total number of degrees of freedom

(17)

of the deviations within the groups.

In this example, df_w = (6 − 1) + 5 + 5 = 15.

• Variability between Groups

– The measure of variability among the group means is called the Mean Square between Groups and is denoted M S_B.

– M S_B is similar to a variance but is computed from the three subgroup means, 4, 9, and 5.

– M S_B also has a sum of squared deviations in the numerator and a corresponding number of degrees of freedom in the denominator is called the Sum of Squares between Groups and the denominator is called the degrees of freedom between groups.

– The sum of Squares between Groups (SS_B) is the sum of squared deviations of the subgroup means from the overall or “pooled”

mean.

∗ The pooled mean of all 18 observations is 6.

∗ The squared deviations of the group

(18)

means from the pooled mean are (4 − 6)² + (9 − 6)² + (5 − 6)² = 14.

∗ Before dividing by the degrees of freedom between groups, one additional adjustment is necessary.

To test the null hypothesis of equal means, SS_B is going to be compared to SS_w by converting each to a mean square and then dividing one mean square by the other.

SS_B cannot be meaningfully compared to SS_w because SS_B reflects variability among means while SS_W reflects variability among individual measurements.

Put SS_B in a comparable scale to SS_w, it is multiplied by the number of observations in each subgroup, that is, 6 in this example.

∗ SS_B = 6 × 14 = 84.

– The degrees of freedom between groups (df_B) is the number of independent deviations summarized in SS_B.

In this example, df_B = 3 − 1 = 2.

(19)

– In this example, M S_B = 84/2 = 42.

• The Test Statistic:

– The test statistic to test H₀ : µ₁ = µ₂ = µ₃ is the ratio of M S_B/M S_w, or 42/1.867 = 22.50.

How do you connect them with V ar(Y ) and V ar(E(Y |X))?

– This ratio seems to indicate that the variability among groups is much greater than that within groups.

However, we know that different samples would give us different values of the ratio even of the three population means are equal.

– The calculated ratio of 22.50 is compared to values from the F -distribution to see if the test statistic is large enough for 3 samples and 18 observations to reject H₀.

• Conclusion: If we are using a 1% significance level, the significance point from the table of the F -distribution with 2 and 15 degrees of freedom is 6.36.

Thus the calculated ratio of 22.50 is very sig-

(20)

nificance and we conclude that there are real differences in the average number of words memorized depending on the amount of or- ganizing information that is provided.

• The follow-up question remains, specifically:

Which of the three means are significantly different from which others?

(21)

The Algebra of ANOVA

• Suppose the number of groups being compared is k.

• The null hypothesis is

H₀ : µ₁ = µ₂ = · · · = µ_k,

that is, the k population means are equal.

• Notation and Data:

Group

1 2 · · · k Population Mean µ₁ µ₂ · · · µ_k

Variance σ₁² σ₂² · · · σ_k² Sample Observations x₁₁ x₂₁ · · · x_k1

x₁₂ x₂₂ · · · x_k2 ... ... ...

Sample size n₁ n₂ · · · n_k Mean x¯₁ x¯₂ · · · ¯x_k Variance s²₁ s²₂ · · · s²_k

• In R, try help.search(“anova”) and look at the help files of aov and manov.

• The number of sampled observations in a

“typical” group, group g, is denoted by n_g

(22)

and the total number of observations is n = n₁ + n₂ + · · · + n_k.

• Each observation has two subscripts, the first indicating the group to which the observation belongs and the second indicating the observation number within that group. Thus, x_gi represents the ith observation in group g.

– The sample mean of the ith subgroup is denoted by ¯x_g = n⁻¹_g ^P_i x_gi.

– The overall mean is denoted by ¯x = n⁻¹ ^P_g ^P_i x_gi.

• The ANOVA Decomposition: The “decomposition” of a single typical score x_gi as

x_gi − ¯x = (¯x_g − ¯x) + (x_gi − ¯x_g).

– This expression shows that the deviation of a score from the overall mean ¯x (V ar(Y )) can be decomposed as the sum of two parts, the deviation of the group mean from the overall mean (V ar(E(Y |X))) and the deviation of an observation from its group mean (V ar(Y |X)).

– The terms on the right-hand side of the

(23)

above expression are the fundamental el- ements of “between-group” and “within- group” variability, respectively.

– By squaring and summing the terms in the expression we obtain exactly the sum- of-squares partition essential for the analysis of variance.

• Total Sum of Squares:

SS_T = ^X

g

X

i (x_gi − ¯x)².

• By the property of least squares method, we have

SS_T = SS_B + SS_w.

• Between-group Sum of Squares and Mean Square:

SS_B = ^X

g n_g(¯x_g − ¯x)²

and M S_B = SS_B/df_B where df_B = k − 1.

• Within-group Sum of Squares and Mean Square:

– SS_w can be obtained by subtracting SS_B from SS_T. (Recall the geometric interp- tation of least squares method.)

(24)

– df_w is the number of independent deviations of observed values from their subgroup means. There are n_g observations in subgroup g, then the number of independent deviations in that group is n_g − 1.

Summing these across the k groups, we have df_w = n − k.

– The Mean Square within Groups is M S_w = SS_w/df_w.

• The F -ratio, F = M S_B/M S_w.

(25)

Validity Conditions for the F -test

The analysis of variance rests on a set of assumptions that should be considered before the procedure is applied.

• The most inflexible assumption is that the n observations must be independent.

– Sampling observations at random is an important step in assuring that this condition is met.

– Steps should be taken to assure that the response on the measured outcome variable of no respondent is in any way affected by other respondents.

If observations are humans, they should not have the opportunity to hear, see, or otherwise be influenced by other sub- jects’ answers of behavior.

– If the assumption of independence is not met, then the analysis of variance tests of significance are not generally valid.

• The sampling distribution of the subgroup means should be nearly normal.

(26)

– This condition is usually met in analysis of variance, especially if the subgroup n’s are moderate to large, since the Central Limit Theorem assures us that means based on large sample sizes are nearly normally distributed.

– The normality condition will also be met if the distribution of the underlying measured variable is normal, whether or not the sample sizes are large.

– It is always advisable to make histograms of the data to see whether there are any gross irregularities in the distributions, however.

– If the sample sizes are small and the distribution of the measured variable is highly non-normal, then the analysis of variance of ranks should be used in place of the ANOVA methods presented so far.

• The F -test is based on an assumption that the population variances are all equal, that is σ₁² = · · · = σ_k².

– This condition is especially important if

(27)

the sample sizes (n₁, n₂, . . . , n_k) are not equal.

– Sample variances should be computed prior to conducting the analysis of variance to see if they are in the same general range as one another.

If they appear to be very different, a for- mal test of equality of the σ²’s may be conducted.

– If the test indicates that the variances are not homogeneous, several options may be available.

The data may be transformed to a scale on which the variances are more equal.

For example, this might involve analyzing the logarithms of the original observed values, the square roots of the observed values, or some other function of the data.

(28)

Which Groups Differ from Which, and by How Much?

The F -test gives information about all the means µ₁, µ₂, . . . , µ_k simultaneously.

• If the hypothesis of equal means is accepted, the conclusion is that the data do not indicate differences among the population means.

• If the null hypothesis is rejected, the conclusion is that there are some differences; then the researcher may want to know which specific means are significantly different from which other and the direction and the mag- nitudes of the differences.

What is the methods for comparing two means?

• Construct a confidence interval for the difference of the two means as follows:

(¯x₁−¯x₂−t_n−2(α/2)s

s

n⁻¹₁ + n⁻¹₂ , x¯₁−¯x₂+t_n−2(α/2)s

s

n⁻¹₁ + n⁻¹₂ ).

Extend the above idea to the comparison of three or more groups.

• When three or more groups are compared by the analysis of variance, several specific

(29)

comparisons may be made after the overall hypothesis of equality is rejected.

– For example, if there are three groups and H₀ is rejected, µ₁ may be compared with µ₂, µ₁ may be compared with µ₃, and µ₂ may be compared with µ₃.

– It is up to the researcher to decide which comparisons to make. The decision rests partially on the design of the research.

For example, if group 1 is a control group and groups 2 and 3 are two different experimental conditions, then it would be sensible to compare µ₂ with µ₁ and µ₃ with µ₁, that is, both experimental conditions with the control.

– If students from four different universities are being compared on mean scores on the Law School Admissions Test, every school’s mean might be routinely compared to every other school’s mean. In the latter case, the number of pairwise comparisons among k means is k × (k − 1)/2.

(30)

– The procedure for any one of these comparisons is very much like comparing two groups as described before.

– But one additional factor needs to be considered: the probability of making a Type I error when performing several tests of significance from the same data set.

– If, indeed, all the µ’s were equal, so that there were no real differences, the probability that any particular one of the pairwise differences would exceed the corresponding t-value is α.

However, the probability of making at least one Type I error out of two or more pairwise comparisons is greater than this.

That is, when many differences are tested, the probability that some will appear to be “significant” when the corresponding population means are equal is greater than the nominal significance level α. The more comparisons that are made, the greater the probability of making at least one Type I error.

• How can a researcher protect against too

(31)

high a Type I error rate?

– One widely used approach is based on the Bonferroni inequality.

– The Bonferroni inequality states that the probability of making at least one Type I error out of a given set of comparisons is less than or equal to the sum of the α’s used for the separate comparisons.

– For example, if we make 2 comparisons among 3 group means and use an α level of 0.05 for each comparison, the probability of making at least one Type I error is no greater than 0.10.

– This overall α level for the pair or “family” of tests is called the familywise (or experimentwise) Type I error rate.

– The Bonferroni inequality can be put to use to keep the familywise error rate acceptably small.

– Suppose that we wish to make m specific pairwise comparisons. We can then decide on a reasonable familywise error rate (α) and divide this value by m to

(32)

obtain a significance level to be used for each comparison separately; call this result α^∗.

– If α^∗ is used for each of m comparisons, the probability of making at least one Type I error out of the set is not greater than m × α^∗ = α.

(33)

Analysis of Variance of Ranks

• In ANOVA, its basic idea is to compare the distributions of a variable in several populations by focusing on the means of the samples.

– The ANOVA is based on the conditions that the sample means have a normal distribution and that the populations being compared have the same variance σ². – When these conditions are not met, or

when the raw data are ordinal, can we still use ANOVA technique to compare several populations?

– Consider the test procedure developed by Kruskal and Wallis (1952).

• Another approach to compare the distributions of a variable in several populations:

– The locations of distribution can also be compared through an analysis of ranks.

– This approach can be applied even when the data are ordinal, and does not require the assumption of normality or equal variances.

(34)

– The null hypothesis is that the locations of the populations are the same, and the alternative hypothesis is that they are not.

• Kruskal and Wallis (1952) considered the daily outputs of three bottle-cap machines.

– Use the following 12 values to test whether the three machines produce equal numbers of bottle caps.

Machine A 340, 345, 330, 342, 338 Machine B 339, 333, 344

Machine C 347, 343, 349, 355

– Instead of using the above data, we use the following ranks instead.

Ranks Machine A 5, 9, 1, 6, 3 Machine B 4, 2, 8

Machine C 10, 7, 11, 12

– The test applies ANOVA formulas to these ranks. We have

SS_T = n(n + 1)(2n + 1)

6 − n







n + 1 2







2

= n(n + 1)(n − 1) 12

(35)

and M S_T = n(n + 1)/12.

– The F -ratio in ANOVA is F = M S_B/M S_w. – Unlike ANOVA, we consider H = M S_B/M S_T. – Observe that

SS_B = ^X

g

(^P_i r_gi)²

n_g − n







n + 1 2







2

. – In this example,

H = 12

12 × 13







24²

5 + 14²

3 + 40² 4





−3×(12+1) = 5.66.

• The null hypothesis of no difference between the locations of the three populations is equiv- alent to tandom sampling, that is, that the ranks have been allocated at random to the k groups.

– When this hypothesis is true, the sampling distribution of H is approximately χ² with k − 1 degrees of freedom if the sample sizes are large.

– Why? What is F -distribution?

(36)

Introduction to Logistic Regression

There are many situations in which we want to forecast events which cannot be modeled as a continuous variable. Examples include

• Whether a consumer will purchase a prod- uct from a menu of products, click on a web button, respond to a direct mail offer.

• Whether a firm will decide to repurchase stock, change accounting procedures, or write- off an asset.

• In these cases, the Y variable is not only dis- crete but takes only two values (1 “Event”

or 0 “No Event”).

This is the simplest type of random variable, a Bernoulli random variable. The classic example of this variable is the outcome of a coin toss.

• We want to make predictions about this dis- crete Y on the basis of some other explanatory variables.

– For example, we want to study how prod- uct attributes and price influence consumer choice from an array of products

(37)

or how industry and firm financial conditions affect the decision to write-off assets.

– To make the conditional predictions, we need to incorporate the X variables into a prediction about Y .

– Unfortunately, the standard linear regression of Y on the X variables is inappro- priate.

Our forecasts should be probabilities!

• If we think back to the introduction of regression, we viewed the regression model as a model for the conditional mean of Y given the X variables.

• For a Bernoulli random variable, the conditional mean will be a probability.

E(Y |X) = P (Y = 1|X)

• The logit model is one very convenient and useful way of forming probabilties from x variables.

P (Y = 1|x) = exp(β₀ + β₁X₁ + · · · + β_kX_k) 1 + exp(β₀ + β₁X₁ + · · · + β_kX_k)

(38)

– Think of V = β₀ + β₁X₁ + · · · + β_kX_k as a “score.”

– As V gets large, the probability Y = 1 should increase (but never exceed 1).

– For example, if a firm is considering write- off an asset, then V might be the “desir- ability” of the write-off.

If V is negative, the write-off is not de- sired.

– The logit model is a particular form for how V gets mapped into a probability.

P (Y = 1) = exp(V ) 1 + exp(V ).

What does this probability curve or locus look like?

– Use R to plot it by yourself.

There are several keys aspects to notice about this curve:

i) As V increases to levels very much above or below 0 the probability of purchase goes very close to to 1 or 0.

ii) The sensitivity to changes in V varies depending on the level of utility and as- sociated probability. Around prob = 0.5,

(39)

the slope is at its maximum of 0.25. but at low or high probs the slope declines to very small numbers. This is because of the “Logistic” or S-shaped curve.

How do we estimate unknown parameters in such a model?

Use the method called Maximum Likelihood Es- timation (“MLE!L).

• Form a “likelihood function!L for each observation.

Y

i







exp(V_i) 1 + exp(V_i)







Y_i ^





1

1 + exp(V_i)







1−Y_i

where V_i = b₀ + b₁X_i1 + · · · + b_kX_ik.

• Consider the logarithm of the likelihoods for all the observations.

• Find the values of b₀, b₁, b₂, . . . that make this sum (log likelihood function) as large as possible.

• In R, try help(glm) and help(family) to find the MLE estimate.

– Logistic regression is a special case of gen- eralized linear model.

(40)

– We need to specify the error distribution and the link function.

– For logistic regression, the error distribution is binomial distribution and logit is being used to link the linear predictor and the mean of the error distribution.

– In glm, the “binomial” family admits the links “logit”, “probit”, “log, and “cloglog”

(complementary log-log).

(41)

An Example of Logistic Regression

In “Causes and Effects of Discretionary Asset Write-Offs” by Francis, Hanna and Vincent (JAR 1996),

• The authors investigate the decision of firms to write-off assets by gathering a large sample of firms that made and did not make write-offs in the 1989-92 period.

• The authors propose a set of independent variables which might be expected to influence the write-off decision.

The set of explanatory variables are as follows:

1. ryear1 - cumulative abnormal return over the preceding year

2. ryear5 - cumulative abnormal return 3. mtbdif - firm’s industry-adjusted Book

to Market ratio

4. mtbchg - change in firms B-t-M ratio 5. indmtb - change in industry’s B-t-M

ratio

6. roa - change in roa

7. indroa - change in industry roa

(42)

8. history - number of yrs in which firm reported special negative items

9. indhis - mean value of history for all other firms in industry

10. dmgmt - 1 if management changed in previsous year

11. poor - Unexpected earnings if U E <

0, 0 otherwise

12. good - UE- $amt of write-off, if > 0, 0 otherwise

13. lnsale - log of sales in yr preceding write-off

14. unless otherwise stated, all variables are averages of preceding 5 years

• They are interested in determining whether write-offs are the result of managers attempts to manipulate accounting performance or sim- ply the result of declines in the value of assets.

• The article relates the predicted probability of write-off to changes in stock price, but we will focus on the problem of predicting write-offs given on the basis of the financial

(43)

performance of the firm and the industry in which the firm operates.

(44)

Two-Way ANOVA; General Model

Two-way ANOVA has two factors of interest that occur in all combinations.

• Sometimes there is only one observation on each combination.

• Sometimes there are more (called “replications”).

• The general model with replications has observations labeled as Y_ijk.

The index i (i = 1, . . . , I) labels the value of the first factor, j (j = 1, . . . , J ) labels the value of the second factor, and k labels the replicated observations on the combination ij.

• When we always have the same number of replications for each ij combination, so that k = 1, . . . , K. Such a model is called “balanced”.

• There are two types of models - the “additive model” and the “additive model with interactions”.

The second model needs K ≥ 2.

(45)

Additive Model

• The additive model decomposes the population means µ_ij = E(Y_ij) as a sum of effects of the corresponding i and j factors.

µ_ij = µ + α_i + β_j where ^P_i α_i = 0 and ^P_j β_j = 0..

• Without proper constraint on α_i and β_j, those unknown parameters are not identi- fiable.

• For the balanced model, µ, αs and βs can be estimated in the natural way (like in the one-way model) as ˆµ = ¯Y_..., ˆα_i = ¯Y_i.. − ¯Y_..., βˆ_j = ¯Y_.j. − ¯Y_.... These formulas correspond to the natural estimates of the β_j. Thus

ˆ

µ_ij = ¯Y_i.. + ¯Y_.j. − ¯Y_....

(For non-balanced models the corresponding formulas are more complicated, which is why we treat only balanced models.)

• There is an overall null hypothesis to be tested:

H₀ : α₁ = · · · = α_I = 0 and β₁ = · · · = β_J = 0 versus H_a : H₀ is not true.

(46)

It includes two sub-hypotheses of interest:

H_0A : α₁ = · · · = α_I = 0 versus H_aA : H_0A is not true.

H_0B : β₁ = · · · = β_J = 0 versus H_0B : H_0B is not true The Data:

Hourly arrivals of telephone calls to a telephone call center

• The data in this example involve telephone calls to a relatively small Israeli Bank telephone call center in 1999.

• The caller desires to speak to a telephone service agent.

• The call center managers want to be able to predict the number of calls that will arrive in any given hour.

– The working day at this center runs from 7am to 11:59pm.

– Look at data for all the full work-weeks in November and December 1999.

– Divide each day up into hourly intervals;

from 7 − 8, 8 − 9, etc.

Label these intervals as i = 1, 2, . . . , 17.

(47)

Thus interval i = 2 corresponds to 8am V 9am and I = 17.

– There are 5 regular working days each week (Sunday through Thursday in Is- rael.).

Label these as j = 1, . . . , 5.

Thus j = 2 corresponds to a Monday and J = 5.

– Let N_ijk denote the number of calls arriv- ing during hourly interval i, day-of-week j, and week k. Note that K = 8.

• Use ANOVA to summarize this data set.

• How could we model N_ijk?

– It is reasonable to conjecture that these arrival times are well modeled by an in- homogeneous Poisson process.

– The arrival rate for this process should depend only on time of day, and perhaps other calendar related covariates such as month or day of the week.

– Theory suggests that they may have a Poisson distribution with mean λ_ij.

(48)

– If the arrival process for a given call cat- egory is as above then the number of arrivals each day within any given interval of time should be independent Poisson variables with a parameter that depends only on the given time interval.

If other covariates are involved (such as day of the week) then the Poisson parameter may also depend on these.

– Poisson distribution an be approximated by a normal distribution when λ_ij is large.

If so they would not be homoscedastic (since their variance would equal their mean).

– Anscombe’s (1948, Biometrika) variance stabilizing transformation suggests that the variables ^rN_ijk might be nearly homoscedastic with variance 1/4.

– Consider Y_ijk = ^rN_ijk + 1/4.

• Goal of the Manager:

– The manager of the call center would like to be able to predict the number of customers in any particular hour that will

(49)

call the center desiring to speak to an agent.

– Plan how many agents are needed at that time of day.

– (Other considerations also enter into this decision, such as the length of time that it takes an agent to serve a customer.)

• Next page gives a plot that tells the manager how to predict N for each hour on any day 2 (Monday).

• This plot also tells the manager what the 95% prediction intervals are for that prediction.

• Note that the prediction limits are pretty wide.

THATS UNFORTUNATE,

but it cant be helped if we only know about time-of-day and day-of-week.

• The 95% confidence intervals are also shown, but probably aren’t as important to the manager.

ANOVA Table

(50)

The tests of these hypotheses are summarized in an ANOVA table.

Summary of Fit

Rsquare 0.8127

Root Mean Square Error 0.7645 Mean of Response 7.889

Observations 680

Analysis of Variance

Source DF SS MS F Ratio

Model (I − 1) + (J − 1) = 20 1671.63 83.581 143.00 Error n − I − J + 1 = 659 385.17 0.585 Prob > F C. Total n − 1 = 679 2056.80 < .0001

Effect Tests

Source Npar. DF SS F Ratio Prob > F

Hour 16 I − 1 = 16 1658.40 177.34 < .0001 day of week 4 J − 1 = 4 13.23 5.66 0.0002

• Sum of Squares for “C Total” is SST =

Pi ^Pj ^Pk(Y_ijk − ¯Y_...)².

• Sum of Squares for “Error” is SSE = ^P_i ^P_j ^P_k(Y_ijk− ˆ

µ_ij)².

• In the model, we consider two factors: hour and day of the week.

SSM odel = SSA + SSB.

(51)

Residual by Predicted Plot

-2.0 -1.0 0.0 1.0 2.0 3.0

Y_ijk = "Root"(N_ijk) Residual

4 5 6 7 8 9 10 11 12

Y_ijk = "Root"(N_ijk) Predicted

(52)

0 50 100 150

No. of Customers/Hr

7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

time Y Pred N_ij

Lower Conf N_ij Upper Conf N_ij Lower Pred N_ij Upper Pred N_ij

Plot of Predicted Values of N for Mondays, as a function of Hour–of-Day

(53)

– SSM odel = SST − SSE.

– SS(Hour) is SSA = ^P_j ^P_k ^P_i( ¯Y_i..− ¯Y_...)². – SS(day of week) is SSB = ^P_i ^P_k ^P_j( ¯Y_.j.−

Y¯_...)².

• F -ratios:

– The first F -ratio (= 143 = M SM/M SE) tests H₀ and here has 20 and 659 df.

– The second F -ratio (= 177.34 = M SA/M SE) tests H_0A and here has 16 and 659 df.

– The third F -ratio (= 5.66 = M SB/M SE) tests H_0B and here has 4 and 659 df.

• The analysis tells us that knowing the day of the week makes a statistically significant difference, but not a very important one; one could do almost as well just knowing the time of day. This means that there is not much point from the service point of view in the “manager” making different work sched- ules for different weekdays.

• CONCLUSION: All three null hypotheses are rejected, but the differences among the

(54)

factors B are much less striking than those among the factors A.

(55)

Model Checking: Checking the residuals

• Each RESIDUAL is the value of the OB- SERVATION V ITS PREDICTION. Sym- bolically, r_ijk = Y_ijk − ˆµ_ij.

• When the basic ANOVA analysis is complete the residuals should then be examined for adherence to the basic assumptions V

– homoscedasticity – normality

• Standard graphical procedures: residual plot and quantile plot

• Residual plot:

– Look at a plot of the residuals against the values of Y_ijk or against the predicted values ˆµ_ij.

This provides a check for homoscedasticity. Refer to next page.

– What you are looking for is to see that the vertical spreads of the data are approximately the same at each value of

ˆ

µ_ij. (You need to allow for random variation when evaluating such a plot V Our

(56)

basic assumption is that such an equality of spreads holds in the population.)

– CONCLUSION: So far as we can tell it

appears that the populations are homoscedastic (apart from that one outlier).

• Normal probability plot

– Once it has been decided that the data are acceptably homoscedastic it becomes of interest to check whether the within ij group residuals are also acceptably near normality.

– R can give you the values of the residuals and then form a Normal Quantile Plot to check normality.

– This residual plot shows startlingly good agreement with normality V except for the one annoying outlier.

– Usually we’re satisfied with considerably less convincing agreement to normality.

– What we want to avoid are heavily skewed residual distributions.

These would suggest that the analysis is invalid, and that some corrective ac-

(57)

tion (such as a transformation of the Y - variables) needs to be taken before re- analyzing the data.

Model Checking: Compare it to a model with interactions

• The ij population group means are modeled as

µ_ij = µ + α_i + β_j + γ_ij,

where ^P_i α_i = 0, ^P_j β_j = 0, ^P_i γ_ij = 0, and

Pj γ_ij = 0.

• The γ_ij are the new part of this model.

They describe the difference between the µ_ij here and their value in the additive model – µ + α_i + β_j.

• This model imposes no restrictions on the µ_ij; they can take any value. Correspond- ingly, it turns out that

ˆ

µ_ij = ˆµ + ˆα_i + ˆβ_j + ˆγ_ij = ¯Y_ij.. Data Analysis

The tests of these hypotheses are summarized in an ANOVA table.

(58)

Summary of Fit

Root Mean Square Error 0.7617 Mean of Response 7.889

Observations 680

Analysis of Variance

Source DF SS MS F Ratio

Model 84 1711.59 20.3760 35.1193 Error 595 345.21 0.5802 Prob > F C. Total 679 2056.80 < .0001

Effect Tests

Source Npar. DF SS F Ratio Prob > F

Hour 16 16 1658.40 177.34 < .0001

day of week 4 4 13.23 5.66 0.0002

Hour × day of week 64 64 39.96 1.08 0.3272

• The “new” part of this table is the line for Hour × day of week.

• This corresponds to a test of H_0AB: all γ_ij = 0 versus H_aAB: not H_0AB.

• It has df = (I − 1)(J − 1) = 64. (And, note that the “Model” now has df = IJ − 1 = 84, so that the Error df = n − IJ = 595, which is also not the same as in the additive model.)

(59)

• CONCLUSION: We DO NOT REJECT H_0AB. We conclude that there is no statistically significant evidence of interaction effects. So far as we can determine the additive model fits this situation as well as the model with interactions.

• Profile Plots:

– Graphical method for the detection of interactions

– Plots of ˆµ_ij as a function of i for each j are a good way to see the interaction estimates. (Or plots for each j as a function of i.)

This produces a Profile Plot.

– If there were no interactions in the data then we would have

ˆ

µ_ij = ˆµ + ˆα_i + ˆβ_j

so that the functions displayed on the profile plot would be exactly parallel.

– The degree to which they are not parallel indicates how much interaction there is in the data.

(60)

Related R-commands

• interaction.plot

– Two-way Interaction Plot

– Plots the mean (or other summary) of the response for two-way combinations of factors, thereby illustrating possible interactions.

• lm

– Fitting Linear Models

– “lm” is used to fit linear models. It can be used to carry out regression, single stratum analysis of variance and analysis of covariance (although “aov” may provide a more convenient interface for these).

– Models for lm are specified symbolically.

A typical model has the form “response

∼ terms” where response is the (numeric) response vector and terms is a series of terms which specifies a linear predictor for response.

(61)

– A terms specification of the form first + second indicates all the terms in first together with all the terms in second with duplicates removed.

– A specification of the form first:second indicates the set of terms obtained by taking the interactions of all terms in first with all terms in second.

– The specification first*second indicates the cross of first and second. This is the same as first + second + first:second.

• g = lm(calltransf ∼ hour ∗ day, data) anova(g)

– anova: It gives Anova Tables

– Compute analysis of variance (or deviance) tables for one or more fitted model ob- jects.

• Diagnostics checking on the residual – qqnorm(g$res)

– plot(g$fitted,g$res,xlab=“Fitted”,ylab=“Residuals”, main=“Square root response”)

(62)

Blocking Designs and ANOVA

Suppose we want to compare 4 treatments and have 20 patients available.

• In completely randomized designs (CRD), the treatments are assigned to the experimental units at random.

The one-way ANOVA can be used to ana- lyze the resulting data to check on treatment effects.

– This design is appropriate when the units are homogenous.

– We may suspect that the units are het- erogenous, but we can’t describe the form it takes - for example, we may know a group of patients are not identical but we may have no further information about them.

– In this case, it is still appropriate to use a CRD.

– The randomization will tend to spread the heterogeneity around to reduce bias.

– Under the null hypothesis, there is no link between a factor and the response.

(63)

In other words, the responses have been assigned to the units in a way that is un- linked to the factor. This corresponds to the randomization used in assigning the levels of the factor to the units. This is why the randomization is crucial because it allows us to make this argument.

If the difference in the response between levels of the factor seems too unlikely to have occurred by chance, we can reject the null hypothesis.

• When the experimental units are heteroge- nous in a known way and can be arranged into blocks where the intrablock variation is ideally small but the interblock variation is large, a block design can be more efficient than a CRD.

– We might be able divide the patients in 5 blocks of 4 patients each where the patients in each block have some relevant similarity.

– We would then randomly assign the treatments within each block.

(64)

– Suppose we want to test 3 crop varieties on 5 fields.

Divide each field into 3 strips and randomly assign the crop variety.

– We prefer to have block size equal to the number of treatments.

If this is not done or possible, an incom- plete block design must be used.

– We just motivate randomized block design.

For the example we just described, we have one factor (or treatment) at 3 levels and one blocking variable at 5 levels.

We can use two-way ANOVA to check for interaction and check for a treatment effect.

We can also check the block effect but this is only useful for future reference.

• Notice that under the randomized block design the randomization used in assigning the treatments to the units is restricted relative to the full randomization used in the CRD.

– Blocking is a feature of the experimen-

(65)

tal units and restricts the randomized assignment of the treatments.

– This means that we cannot regain the degrees of freedom devoted to blocking even if the blocking effect turns out not to be significant.

– Only use blocking where there is some heterogeneity in the experimental units.

The decision to block is a matter of judg- ment prior to the experiment.

There is no guarantee that it will increase precision.

– How do we compare the efficiency of two designs?

• Latin Squares: These are useful when there are two blocking variables.

– Suppose, in a field used for agricultural experiments, the level of moisture may vary across the field in one direction and the fertility in another.

– In an industrial experiment, suppose we wish to compare 4 production methods (the treatment) - A, B, C, and D.

(66)

We have available 4 machines 1, 2, 3, and 4, and 4 operators, I, II, III, IV.

– A Latin square design is 1 2 3 4

I A B C D

II B D A C III C A D B IV D C B A

– Each treatment is assigned to each block once and only once.

The design and assignment of treatments and blocks should be random.

– Three-way ANOVA