### Financial Time Series I

### Topic 3: Regression Analysis and Correlation

### Hung Chen

### Department of Mathematics National Taiwan University

### 11/16/2002

OUTLINE 1. Association

– Scatter Plot

– Correlation Coefficient: linear associ- ation

– Interpreting Association – Nonlinear Association – Causation

– Distribution of Correlation Coefficient 2. Simple Linear Regression

– Statistical Relationship – Model

– Method of Least Squares

– Properties of Least Squares Solution – correlation

3. Association of Categorical Variables – Frequency Tables

– Bivariate Categorical Data – Test of Independence

– Measure of Association

Association Between Numerical Scales

• Scientific question: Does studying more help raise scores on the SAT?

– Data Collection: Record the number of hours spent studying for the SAT and the SAT scores for a sample of students.

– Does this data lead to the conclusion that

“individuals with higher hours also have higher scores?”

• Scientific question: Is sodium intake related to systolic and diastolic blood pressure?

– Data Collection: Record the monthly sodium intakes for each individual in a sample and his/her blood pressure.

– Do individuals with higher sodium con- sumption also have higher blood pressure readings?

• Question: How do we assess the association of two numerical variables in statistics?

– scatter plot: a graphical technique

Scatter plots frequently depict informa- tion about the relationship between vari-

ables that is not indicated by a single summary statistic.

– correlation coefficient: a formal numeri- cal index of linear relationship.

∗ How do we measure a curvelinear re- lationship?

– How reliable are these two tools?

Scatter plots

• Data: We have the language and nonlan- guage mental maturity scores (two kinds of

”IQs”) of 23 school children.

How do we explore it?

– Let x denote language IQ and y denote non-language IQ where

x < − c(86, 104, 86, 105, 118, 96, 90, 95, 105, 84, 94, 119, 82, 80, 109, 111, 89, 99, 94, 99, 95, 102, 102)

y < − c(44, 53, 42, 50, 65, 52, 37, 50, 46, 30, 37, 66, 41, 43, 74, 69, 44, 67, 43, 60, 47, 54, 43)

– These data are represented visually by making a graph on two axes, the hori- zontal x axis representing language IQ and the vertical y axis representing non- language IQ.

– Such a graph is called a scatter plot (or scatter diagram).

– Use R command, plot(x,y,xlab=“language IQ”,ylab =“non-languae IQ”,main=“Scatter

Plot”)

Each point in such a plot represents one individual.

– When all of the observations are plotted, the diagram conveys information about direction and magnitude of the associa- tion of x and y.

– The swarm of points goes in a southwest- northeast direction.

This indicates a positive or direct associ- ation of x and y.

Namely, individuals who have the lower y values are the same people who have the lower values on x; they form a clus- ter of points in the lower-left portion of the diagram.

– If the swarm of points lines in a northwest- southeast direction (i.e., upper left to lower right), there is a negative or inverse as- sociation of x with y.

• The strength or magnitude of the associa- tion is indicated by the degree to which the points are clustered together around a single

line.

– If all of the points fall exactly on the line, there is a “perfect” association of the two variables. In this case, if we knew an in- dividual’s value on variable x, we would be able to compute his/her value on y exactly.

– To the extent that the points in the di- agram diverge from a straight line, the association is less than perfect.

– Since the scatter plot is a nonnumerical way of assessing association, adjectives are used to describe the strength of asso- ciation.

– We may say a “strong” (moderate or even weak) association of x with y.

– Are they objective?

The Correlation Coefficient: A Measure of Linear Relationship

• correlation coefficient: r = _{s}^{s}^{xy}

xs_{y}

– It is a statistic based on n pairs of mea-
surements (x_{i}, y_{i}) on two variables x and
y.

– r is a measure of association between two quantitative variables.

– −1 ≤ r ≤ +1.

– The absolute value of r is exactly 1 for a perfect linear relationship, but lower if the points in the scatter plot diverge from a straight line.

• A descriptive statistic that indicates the de- gree of linear association of two numerical variates is the correlation coefficient, usually represented by the letter r.

Do you see why?

• The sign of r indicates the direction of association- it is positive for a direct association of x and

y and negative for an inverse association.

We might ask

– Whether the selling prices of various mod- els of automobiles are related to the num- bers of each that are sold in a given pe- riod of time?

– Whether the amount of rainfall in a given agricultural area is related to the size of the crop yield?

– The observational units may not be peo- ple, but institutions, objects, or events.

• For the above examples, it involves two nu- merical variables but questions about cate- gorical variables are equally common which will be discussed later on.

• A correlation close to zero-either positive or negative-indicates little or no linear associa- tion between x and y.

• The question “what is a strong correlation?”

has several answers.

Statisticians would generally refer to a cor- relation close to zero as indicating “no cor- relation”; a correlation between 0 and 0.3 as

“week”; a correlation between 0.3 and 0.6 as

“moderate”; a correlation between 0.6 and

1.0 as “strong”; and a correlation of 1.0 as

“perfect.”

– Two laboratory technicians counting im- purities in the same water samples should have very high agreement-the correlation between their counts may be 0.95 or bet- ter.

– Two different human characteristics rarely have such a high correlation.

The correlation of height and weight is generally in the neighborhood of 0.8; of scores on the Scholastic Assessment Test with college freshman grade average about 0.6; of measured intelligence with socioe- conomic status about 0.4; of heart rate with blood pressure about 0.2.

– Example: consider the data collected by Nanji and French (1985) to examine the relationship of alcohol consumption with mortality due to cirrhosis of the liver in the 10 Canadian provinces. In this case, variable x is an index of the amount of alcohol consumed in the province in one year (1978) and variable y is the number

of individuals who died from cirrhosis of the liver per 100, 000 residents.

r is equal to 0.51.

Calculating Correlation Coefficient

• help.search(“correlation”)

• The software responds with “Help files with

alias or title matching ‘correlation’, type ‘help(FOO, package = PKG)’ to inspect entry ‘FOO(PKG)

TITLE’:

• We try the first two:

– cor(base) Correlation, Variance and Covariance (Matrices)

– corr(boot) Correlation Coefficient – acf(ts) Auto- and Cross- Covariance

and -Correlation Function Estimation – plot.acf(ts) Plotting Autocovariance

and Autocorrelation Functions

• Consider the example on language and non- language IQ scores.

• Use R base package, cor(x, y) leads to 0.7689431.

• Use R boot package, corr(cbind(x, y)) leads to 0.7689431.

– From Packages, load the package boot.

– Look at help manual by command help(corr).

– It only deals with matrix. We need to form a matrix with x and y.

Test and Confidence Interval on r

• Efron (1982) analyzes data on law school admission, with the object being to exam- ine the correlation between the LSAT (Law School Admission Test) score and the first- year GPA.

– For each of 15 law schools, we have the pair of data points (acerage LSAT, aver- age GPA):

(576,3.39), (635, 3.30), (558, 2.81), (578,3.03), (666,3.44), (580,3.07), (555,3.00), (661,3.43), (651,3.36), (605,3.13), (653,3.12), (575, 2.74), (545,2.76), (572,2.88), (594,2.96) – Let (X, Y ) be a bivariate normal with

correlation coefficient ρ and sample cor-

relation r, it can be shown that

√n(r − ρ) → N (0, (1 − ρ^{2})^{2}).

– The above result can be used to derive confidence interval and carry out hypoth- esis testing when the data is normally distributed.

– Fisher suggests to consider a z-transformation log((1 + x)/(1 − x)). Then we have

√n1 2

log

1 + r 1 − r

− log

1 + ρ 1 − ρ

) → N (0, 1).

Interpreting Association

• The correlation coefficient is useful for sum- marizing the direction and magnitude of as- sociation between two variables. It is a widely- used statistic, cited frequently in both scien- tific and popular reports.

• Limitations to the meaning of any particular correlation:

– The correlation coefficient reveals only the straight line (linear) association be- tween x and y.

– r would be close to 0 while the scatter plot reveals important curvilinear pat- terns.

In this case, the scatter plots reveal that a straight line is not the whole story of the relationship of x to y.

This suggests that a scatter plot should always be examined when a correlation coefficient is to be computed or inter- preted.

– Nonlinear association: Pairs of variables are often associated in a clear pattern,

but not conforming to a straight line.

Nonlinear Association

Quite common, we will see scatter plots with following characteristics.

• (asymptote) Consider the following two cases.

– Suppose that patients who experience pain, the observations are variable x might be the amount of aspirin taken orally, in mil- ligrams, and variable y the amount that is absorbed into the bloodstream.

Beyond a certain point, additional amounts of ingested aspirin are no longer absorbed, and the amount found in the bloodstream reaches a plateau.

– In research on human memory, it has been found that initial practice trials are very helpful in increasing the amount of material memorized; after a certain num- ber of trials, however, each additional at- tempt to memorize has only a small added benefit.

Consider students who are learning a for- eign language, what is the relationship

between the number of 25-minute peri- ods devoted to studying vocabulary and the number of words memorized.

• (N shape) Examine the effects of advertising on sales.

– If observations are branches of a large chain of stores with independent control over expenditures, variable x might be advertising outlays and y the total sales volume in a given period of time.

– What will the plot look like?

– If initial advertising outlays have a sub- stantial effect on sales, additional adver- tising outlays have little additional im- pact on sales, but expensive “saturation”

advertising again gives a significant boost to sales.

• (concave upward) Consider the amount of water provided to agricultural plots and the proportion of plants on the plot that do not grow to a given size.

This pattern would suggest that too much as well as too little water is harmful, while

moderate amounts of water minimize the plant loss.

• (concave downward) In examining the re- sponses of humans and animals to various kinds of physiological stimulation, little or no stimulation produces little or no response, while moderate amounts of stimulation pro- duce maximal response.

Levels of stimulation that exceed the indi- vidual’s ability to process the input, how- ever, can result in a partial or complete sup- pression of t he response.

Causation

• Even a strong correlation between two vari- ables does not imply that one causes the other.

When a statistical analysis reveals associ- ation between two variables it is generally desirable to know more about the associa- tion.

– Does it persist under different conditions?

– Does one factor “cause” the other?

– Is there a third factor that causes both?

– Is there another link in the chain, a factor influenced by one variable and in turn influencing the other?

• The correlation indicates only that certain pairings of values on x and y occur more frequently than other combinations.

• Example: If the x and y variables were “years on the job” and “job satisfaction ratings” for a sample of employees, the positive correla- tion between them might indicate that

– if employees hold their jobs for more years, the work seems to become more satisfy- ing, or

– if employees are more satisfied, they keep their jobs longer.

That is, the casual connection may go in either direction and may also be affected by other intermediary mechanisms.

It may be that

– more senior employees are given subtle or overt rewards that in turn enhance their satisfaction. Thus, the distribution of rewards-referred to as an intervening variable-explains the association of years with satisfaction; the correlation itself does not imply a direct cause-and-effect rela- tionship.

• Association between two variables may also occur because x and y are both consequences of some third variable that has not been ob- served. This is seen in the following illustra- tion:

• Example: Do Storks Bring Babies?

In Scandinavian countries a positive associ- ation between the number of storks living in the area and the number of babies born in the area was noticed.

Do storks bring the babies? Without shat- tering the illusions of the incurably roman- tic, we may suggest the following:

Districts with large populations have a large number of births and also have many build- ings, on the chimneys of which storks can nest.

Consider the diagram representing the idea that the population factor explains both the number of births and the frequency with which storks are sighted.

Large population

Many babies born

Many buildings → Many storks The three variables to study are populations

of districts, numbers of births in districts, and numbers of storks seen in the districts.

Simple Regression Analysis

• Find the statistical relationship between two quantitative variables.

• Statistical data are often used to answer ques- tions about relationships between variables.

– How do we summarize the relationship or association between 2 or among 3 or more variables?

– How do we find the association among variables measured on numerical scales?

How about categorical variables?

• Functional Relationship: A variable y is said to be a function of a variable x if to any value of x there corresponds one and only one value of y.

– We symbolize a functional relationship by writing y = f (x), where f represents the function.

– The variable x is called the independent variable; the variable y is called the de- pendent variable because it is considered to depend on x.

– If x is the height from which a ball is dropped and y is the time the ball takes to fall to the ground, then y is function- ally related to x because the law of grav- ity determines y in terms of x.

• Statistical relationship: When one variable is used to predict or “explain” values of the second variable, we allow some imperfection in the prediction.

– How do we express statistical relation- ships?

– What kind of methods can be used for es- timating those relationships from a sam- ple of data?

– How do we measure and interpret vari- ability around the predicted or explained values?

Statistical Relationship

• The relationship between x and y is not an exact, mathematical relationship, but rather several y values corresponding to a given x value scatter about a value that depends on the x value.

• For example, although not all persons of the same height have exactly the same weight, their weights bear some relation to that height.

– On the average, people who are 6 feet tall are heavier than those who are 5 feet tall; the mean weight in the population of 6-footers exceeds the mean weight in the population of 5-footers.

• The relationship between height and weight is modeled statistically as follows:

– For every value of x there is a correspond- ing population of y values.

Denote it by F_{Y} (y|x) where F (·) is a dis-
tribution function.

– The population mean of y for a particular value of x is denoted by µ(x). Note that µ(x) = E(Y |X = x).

– As a function of x, µ(x) is called the re- gression function.

– The population of y values at a particu-
lar x value also has a variance, denoted
σ^{2}; the usual assumption is that the vari-
ance is the same for all values of x.

homoscedastic: V ar(Y |X = x) = σ^{2}
heteroscedastic: V ar(Y |X = x) depends
on x

• For many variables encountered in statisti- cal research, the regression function is a lin- ear function of x, and thus may be written as µ(x) = α + βx.

– The quantities α and β are parameters that define the relationship between x and µ(x).

– Write Y = α + βx + with E() = 0
and V ar() = σ^{2}.

– In conducting a regression analysis, we
use a sample of data, (x_{i}, y_{i}), to estimate
the values of these parameters so that we
can understand this relationship.

• The focus of regression analysis is on making

inferences about α, β, and σ^{2}.

– Estimate the magnitude of these param- eters and test hypotheses about them.

– Consider the hypothesis H_{0} : β = 0.

If this null hypothesis is true then µ(x) = α + 0 × x = α, the same number for all values of x.

This means that the values of y do not depend on x, that is, there is no statisti- cal relationship between x and y.

If H_{0} is rejected, then the existence of a
statistical relationship between x and y
is confirmed.

• The data required for regression analysis are observations on the pair of variables (x, y).

– Variable x may be uncontrolled of “nat- urally occurring” as in the case of ob- serving a sample of n individuals with their heights x (random design) and their weights y, or it may be controlled, as in an experiment in which persons are trained as data processors for different lengths of time x (fixed design), and one

measures the accuracy of their work y.

• Examples:

– “Does studying more help raise scores on the Scholastic Assessment Tests (SAT)?”

This question could also be worded as fol- lows: If we recorded the number of hours spent studying for the SAT and the SAT scores for a sample of students, do in- dividuals with higher “hours” also have higher SAT scores and individuals with lower hours have lower SAT scores?

– We might ask if high sodium intake in one’s diet is associated with elevated blood pressure.

The question could be worded as follows:

If we recorded the monthly sodium in- take for each individual in a sample and his/her blood pressure, do individuals with higher sodium consumption also have higher blood pressure readings while those with lower sodium intakes have the lower blood pressure readings?

– The term regression stems from the work

of Sir Francis Galton (1822-1911), a fa- mous geneticist, who studied the sizes of seeds and their offspring and the heights of fathers and their sons.

In both cases, he found that

∗ The offspring of parents of larger than average size tended to be smaller than their parents.

∗ The offspring of parents of smaller than average size tended to be larger than their parents.

∗ He called this phenomenon “regression toward mediocrity.”

Least-Square Estimates

• The data consist of n pairs of numbers (x_{1}, y_{1}), (x_{2}, y_{2}), . . . , (x_{n}, y_{n}).

• To explore their association, they can be plotted as a scatter plot.

• How do we find a ”best fit” line for two vari- ables x and y?

y_{i} = ˆα + ˆβx_{i} + r_{i}
ˆ

y_{i} = ˆα + ˆβx_{i},

where r_{i} = y_{i} − ˆy_{i} is called the residuals.

Is r_{i} close to unobserved _{i}?

• In regression analysis we seek to determine
the equation of that line that gives ˆy_{i} values
as close as possible to the data values y_{i}.

– How do we determine ˆα and ˆβ, estimates of α and β, respectively?

– How do we obtain confidence intervals and to test hypotheses about the param- eters of interest?

• If we all agree on choosing a line of best
fit from all the lines that gives ˆy_{i} values as
close as possible to the data values y_{i}, this

question is equivalent to asking how we can
choose an appropriate y intercept and slope,
a and b, such that the deviation y_{i} − ˆy_{i} as
small as possible.

One approach is to find a and b to minimize the criterion

n

X

i=1(y_{i} − ˆy_{i})^{2}.

• The principle of least squares leads to

¯

y = ˆα + ˆβx¯ β =ˆ

Pn

i=1(x_{i} − ¯x)(y_{i} − ¯y)

Pn

i=1(x_{i} − ¯x)^{2} .

Example 1. One of the questions about peace- ful uses of atomic energy is the possibility that radioactive contamination poses health hazards.

• Since World War II, plutonium has been produced at the Hanford, Washington, fa- cility of the Atomic Energy Commission.

• Over the years, appreciable quantities of ra- dioactive wastes have leaked from their open- pit storage areas into the nearby Columbia River, which flows through parts of Oregon to the Pacific.

• To assess the consequences of this contam- ination on human health, investigators cal- culated, for each of the nine Oregon coun- ties having frontage on either the Columbia River or the Pacific Ocean, an “index of ex- posure.”

– This index of exposure was based on sev- eral factors, including distance from Han- ford and average distance of the popula- tion from water frontage.

– The cancer mortality rate, cancer mor- tality per 100, 000 person-years (1959-1964), was also determined for each of these nine counties. They are Clatsop, Columbia, Gilliam, Hood River, Morrow, Portland, Sherman, Umatilla, and Wasco.

– Data: (8.34, 210.3), (6.41, 177.9), (3.41, 129.9), (3.83, 1623), (2.57, 130.1), (11.64, 207.5),

(1.25, 113.5), (2.49, 147.1), (1.62, 137.5) (the index of exposure, cancer mortality rate)

Simple Regression Model

• A simple regression model for a set of pairs
(X_{i}, Y_{i}), 1 ≤ i ≤ n of points with depen-
dent or response variable Y = (Y_{1}, . . . , Y_{n})^{T}
and explanatory variable X = (X_{1}, . . . , X_{n})
is the following:

Y_{i} = β_{0} + β_{1}X_{i} + _{i},

where = (_{1}, . . . , _{n})^{T} is a vector of in-
dependent, identically distributed random
variables with _{i} ∼ N (0, σ^{2}), the normal
distribution with mean zero and variance σ^{2}.

• The assumptions on the random variables _{i}
are relaxed to just uncorrelated, i.e. E[_{i}_{j}] =
0 for i 6= j but not necessarily independent.

• We can never separate the random variables

_{i} from the observations Y_{i} since we don’t
know the value of _{i}, which means that we
can never know the true values of (β_{0}, β_{1}).

• The method of least squares leads to choose
a “best” ( ˆβ_{0}, ˆβ_{1}). This is where the sum of
squared errors, or SSE of the model comes

in. We measure the goodness of linear re- gression model by its squared error

SSE(b_{0}, b_{1}) = ^{X}^{n}

i=1(Y_{i} − b_{0} − b_{1}X_{i})^{2}.
The least squares regression model is the
model with smallest least squares.

• The least squares solutions ( ˆβ_{0}, ˆβ_{1}) aren’t
equal to the true values of (β_{0}, β_{1}) but they
do have the property that

E[ ˆβ_{0}] = β_{0}, E[ ˆβ_{1}] = β_{1}.

That is, if the data do come from some sim-
ple linear regression model, the least squares
solutions are unbiased estimates of the true
values (β_{0}, β_{1}).

They also have the property that, among all
unbiased estimates of (β_{0}, β_{1}) that are linear
functions of the response variable, the least
squares solutions have the smallest variance,
this is what the classic Gauss-Markov The-
orem states.

Properties of Least Squares Solution

• To summarize, we have made the following decomposition of each observation using the fitted and residual values

Y = ˆY + r,

where ˆY is made from X and corr(r, X) = 0.

• Under the standard assumption that V ar(Y_{i}) =
σ^{2} and Cov(Y_{i}, Y_{j}) = 0 where i 6= j, we
have

V ar( ˆβ_{0}) = σ^{2}^{P}^{n}_{i=1} x^{2}_{i}

n^{P}^{n}_{i=1} x^{2}_{i} − (^{P}^{n}_{i=1} x_{i})^{2},
V ar( ˆβ_{1}) = nσ^{2}

n^{P}^{n}_{i=1} x^{2}_{i} − (^{P}^{n}_{i=1} x_{i})^{2},
Cov( ˆβ_{0}, ˆβ_{1}) = −σ^{2}^{P}^{n}_{i=1} x_{i}

n^{P}^{n}_{i=1} x^{2}_{i} − (^{P}^{n}_{i=1} x_{i})^{2}.

• Define the residual sum of squares (RSS) to be

RSS = ^{X}^{n}

i=1(y_{i} − ˆβ_{0} − ˆβ_{1}x_{i})^{2},

where ˆβ_{0} and ˆβ_{1} are the least squares solu-
tions.

Let s^{2} = RSS/(n − 2) which is an unbiased
estimate of σ^{2}.

• If the errors, _{i} are independent normal ran-
dom variables, then the estimated slope and
intercept, being linear combinations of in-
dependent random variables, are normally
distributed as well.

– Under the normality assumption, it can be shown that

βˆ_{i} − β
s_{β}_{ˆ}

i

∼ t_{n−2}.

This result makes possible the construc- tion of confidence intervals and Hypoth- esis tests.

– If the _{i} are independent and the x_{i} sat-
isfy certain assumptions, a version of the
central limit theorem implies that, for
large n, the estimated slope and intercept
then the estimated slope and intercept
are approximately normally distributed.

The above t-distributions can be used.

• Since the absolute value of r cannot exceed

1, the squared correlation has possible val- ues from 0 to 1.

• If r = 0, then x is of no help in predicting y.

• If r = 1 (and r^{2} = 1), then x predicts y
exactly.

This can only occur if each point (x_{i}, y_{i}) falls
exactly on the regression line; all of the vari-
ation in y can be explained by variability in
x.

• In between these extremes, a weak correla- tion (e.g., in the range 0 to 0.3) is accom- panied by a small proportion of explained variation (0 to 0.09), while a strong correla- tion (e.g., between 0.6 and 1.0) is accompa- nied by a substantially larger proportion of explained variation (0.36 to 1.0).

• Both r and r^{2} are useful measures of asso-
ciation for regression analysis.

The correlation tells the direction of associ- ation and its square tells the extent to which y is predictable from x.

For the cancer mortality data, the correla-

tion coefficient is r = 900.13/... = 0.926.

This value is very high, and the proportion of variation in mortality attributable to ra- dioactive exposure is also high, 0.9262 = 0.875 (that is, 85%).

Data Analysis

• In R, “lm” is used to fit linear models.

– It can be used to carry out regression, single stratum analysis of variance and analysis of covariance (although ‘aov’ may provide a more convenient interface for these).

– Usage: lm(formula, data, subset, weights, na.action, method = “qr”, model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL, offset = NULL,. . .)

– Models for ‘lm’ are specified symbolically.

A typical model has the form ‘response terms’ where ‘response’ is the (numeric) response vector and ‘terms’ is a series of terms which specifies a linear predictor for ‘response’.

A terms specification of the form ‘first + second’ indicates all the terms in ‘first’

together with all the terms in ‘second’

with duplicates removed.

A specification of the form ‘first:second’

indicates the set of terms obtained by taking the interactions of all terms in ‘first’

with all terms in ‘second’.

The specification ‘first*second’ indicates the cross of ‘first’ and ‘second’. This is the same as ‘first + second + first:second’.

– Additive model and multiplicative model

• Example: Consider Plant Weight Data in Annette Dobson (1990) “An Introduction to Generalized Linear Models.”

– Data input:

ctl < − c(4.17, 5.58, 5.18, 6.11, 4.50, 4.61, 5.17, 4.53, 5.33, 5.14) trt < − c(4.81, 4.17, 4.41, 3.59, 5.87, 3.83, 6.03, 4.89, 4.32, 4.69) group < − gl(2, 10, 20, labels = c(“Ctl”, “T rt”))

weight < − c(ctl, trt)

– Regress weight to the group.

lm(weight ∼ group) leads to the estimate
of β_{0} to be 5.032 and the estimate of β_{1}
to be −0.371. Here β_{1} refers to the group
effect.

– Carry out an ANOVA analysis on the proposed model.

anova(lm(weight ∼ group)) leads to the following table.

Analysis of Variance Table

Df Sum Sq Mean Sq F value P r(> F ) group 1 0.6882 0.6882 1.4191 0.249 Residuals 18 8.7293 0.4850

The Association Among Categorical Variables

• “Is inoculation with a polio vaccine related to the occurrence of paralytic polio?”

– Data may be collected on a sample of observations and we ask whether a par- ticular value on one variable (i.e., “vacci- nated”) co-occurs with a particular value on the other (“polio absent”).

– Both of these variables are simple yes/no dichotomies.

• How do we measure the association among categorical variables?

– The idea of association is basically the same as in quantitative variables, that is, do certain values of one variable tend to occur more frequently with certain values of another?

– The values of categorical variables, how- ever, may not have a range from lower to higher quantities, but may represent qualitatively distinct conditions, such as a condition being “present” or “absent.”

– Summary statistics such as the mean and standard deviation are not applicable.

Instead, observations are simply counted;

for example, how many observations have both condition A and condition B present?

– Counts are entered into a frequency ta- ble.

• As with numerical variables, an association of two categorical variables does not imply that one causes the other.

– The direction of causation may be re- ciprocal, with each variable affecting the other.

– Causation may be mediated by one or more intervening third variable(s).

– Both variables may be consequences of additional variables not included in the data causing the two outcomes to occur together.

– To understand these effects more com- pletely, it is often necessary to examine the association of three or more variables.

The tabulation of data for three categor-

ical variables results in a three-way for two variables is a two-way table.

Bivariate Categorical Data

• Bivariate categorical data result from the observation of two categorical variables for each individual.

• A variable with two categories, such as male/female or graduate/undergraduate, is called a di-

chotomous variable.

A pair of dichotomous variables is called a double dichotomy.

Because each variable has just two variables, the table that results is a two-by-two (2 × 2) frequency table.

• Double dichotomies arise, for example, when each of a number of persons is asked a pair of yes-no questions.

– Company X ask each of 600 men whether or not they use Brand X razors and whether or not they use Brand X blades.

– A physician may classify patients accord- ing to whether of not they have been in- oculated against a disease and whether or not they contracted the disease.

– Businesses of a certain type, for example, might be classified based on whether or not they provide “day-care” facilities and whether or not they provide maternity leave for pregnant employees.

• Association between the pair of variables is seen by examining the patterns of frequen- cies in the individual cells.

• Many statistical studies involve 2 × 2 tables and because many concepts of statistics can be presented in this form, we shall consider such tables in some detail.

This discussion is simplified by the notation displayed in the following table.

Question 2

Yes No Total

Question 1 Yes a b a + b

No c d c + d

Total a + c b + d n Here

– a is the number of persons answering “yes”

to both questions.

– b is the number of persons answering “yes”

to Question 1 and “no” to Question 2.

– c is the number of persons answering “no”

to Question 1 and“yes” to Question 2.

– d is the number of persons answering “no”

to both questions.

– We denote by n the total number of per- sons included in the table and n = a + b + c + d.

• Independence:

– When the joint frequencies for two vari- ables have no association, they are said to be independent.

– For example, suppose communities were cross-classified according to average in- come and crime rate as the following ta- ble.

Average income level

Low Medium High Total Crime Low 1(10%) 4(10%) 3(10%) 8(10%)

Medium 7(70%) 28(70%) 21(70%) 56(70%) rate High 2(20%) 8(20%) 6(20%) 16(20%)

All crime 10 40 30 80

levels

– The table shows that for each income

level most communities (70%) have a medium crime rate.

– The entire distribution of crime rates is the same for each level of average income;

that is, crime rate is independent of level of average income.

– In this case, knowledge of a community’s average income level provides no infor- mation about its crime rate.

– The percentages of districts with low, medium, and high crime rates are 10%, 70%, and

20% regardless of the average income.

– Note that the marginal distribution of crime rate has to be the same as the dis- tribution for each income level.

– When will _{a+b}^{a} = _{c+d}^{c} ?
– When will _{a+c}^{a} = _{b+d}^{b} ?

• Index of association for 2 × 2 tables

– A numerical indicator of association be- tween two dichotomous variables can be based on the quantity ad − bc.

– a and b are the upper-left and lower-right entries in the above table, and b and c are the upper-right and lower-left entries.

– If ad − bc = 0, it indicates independence.

– If ad−bc > 0, it indicates that A1 occurs more frequently with B1, and A2 with B2, than the other way around.

– If ad−bc < 0, it indicates that A1 occurs more frequently with B2, and A2 with B1, than the other way around.

– The difference ad − bc is divided by a quantity that keeps the final index of as- sociation between −1 and +1, like the correlation coefficient for numerical scales.

• Measure of Association: φ coefficient

φ = ad − bc

r(a + b)(c + d)(a + c)(b + d)

– If the variables A and B are ordered then the magnitude of φ indicates the strength of association and the sign indicates the direction of association as well.

– (a + b + c + d)φ is identical to the chi- square test we will discuss later on.

Other kinds of 2 × 2 tables

• Dichotomous variables sometimes have cate- gories that are ordered, so that one value re- flects more of some characteristic than other.

– To study the association of income and education level, the income might be clas- sified simply as above or below poverty level, education as having completed fewer than 12 years of schooling or 12 years or more.

• Another type double dichotomy is shown by the two-by-two tabulation of a dichotomous variable for the same individuals at two dif- ferent times.

– Suppose in August we asked a number of people which of two presidential can- didates they favored and then asked the same people the same question in Septem- ber.

– We could tabulate the results as follows.

September

Bush Clinton Total

August Bush a b a + b

Clinton c d c + d

Total a + c b + d n Here the number c would be the num- ber of persons who switched from Demo- cratic candidate Clinton Republican can- didate Bush between August and Septem- ber 1992.

– These data are change-in-time data and the table is called a “turnover” table.

– We may ask whether the number of peo- ple who kept their original preference (cells a and d) is substantially larger than the number who changed (cell b and c).

Interpretation of frequencies

• Question: Is there a tendency for those who use Brand X razors also to use Brand X blades?

• Company X conducted a market survey, in- terviewing a sample of 600 men.

Each man was asked whether he uses the Brand X razor and whether he uses Brand X blades.

Use Not use Brand X Brand X

blades blades Total Use Brand X 186(67%) 93(33%) 279(100%) razor

Not Use 59(18%) 262(82%) 321(100%) Total 245(41%) 355(59%) 600(100%)

• Association: Most (67%) of the men using Brand X razors also use Brand X blades;

only a few (18%) of the men not using Brand X razors who use Brand X blades differs from that percentage among men not using Brand X razors.

– We say there is an association between

using the Brand X razor and using Brand X blades.

– We can say that men who use the Brand X razor are more likely to use Brand X blades than are men who do not use the razor.

– This difference may suggest that an ad- vertising campaign for blades should be directed to men not using Brand X ra- zors.

Chi-Square Tests of Independence

Consider frequency tables in the case in which individuals were classified simultaneously on two categorical variables.

• Consider testing the null hypothesis that in a population two variables are independent on the basis of a sample drawn from that population.

– Even though the variables are indepen- dent in the in the population, they may not be (and in fact probably will not be) independent in a sample.

Consider the following hypothetical data.

• At each of three different dates, a sample of 1000 registered voters was drawn; at each time each respondent was asked which of two potential candidates he or she would fa- vor.

10/91 1/92 Total Bush 523 502 1025 Clinton 477 498 975 Total 1000 1000 2000

• In the underlying population, let the pro-
portion favoring Bush be p_{1} at the time of
the first poll and p_{2} at the time of the second
poll.

• The null hypothesis that the proportion fa-
voring one candidate does not depend on the
date of polling is H_{0} : p_{1} = p_{2}.

• In this example the estimates of p_{1} and p_{2}
are ˆp_{1} = 0.523 and ˆp_{2} = o.502, respectively,
based on sample sizes n_{1} = n_{2} = 1000

• The estimate of the SD of the difference be-
tween two sample proportions when H_{0} is
true is

v u u u u tpˆqˆ

1

n_{1} + 1
n_{2}

= 0.02235.

The test statistic is
z = pˆ_{1} − ˆp_{2}

rpˆˆq(1/n_{1} + 1/n_{2}) = 0.940.

• What is the distribution of the above test statistic?

Measure of Association Based on Prediction

• Consider another measure of association which is based on the idea of using one variable to predict the other.

• Consider the cross-classification of exercise and health in the following Table.

-3 Health status

Good Poor Total

Exerciser 92 14 106

Non-exerciser 25 71 96

Total 117 85 202

– How well does exercise group predict health status?

– If one of the 202 persons represented in this table is selected at random, our best guess of the health status-if we don’t know anything about the person-is to say that the person is in the good-health group because more of the people are in that group (117, compared with 85 for the poor-health group); the good-health cat- egory is the mode.

– If we make this prediction for each of the

202 persons, we shall be right in 117 cases and wrong in 85 cases.

• If we take the person’s exercise level into account, we can improve our prediction.

– If we know the person exercises regularly, our best guess is still that the person is in good-health group, for 92 of the regular exercisers are in the good-health group, compared with only 14 in the poor-health group.

– If we know the person is not an exer- ciser, we should guess that the person is in the poor-health group; in this case we would be correct 71 times and incorrect 25 times.

– Our total number of errors in predicting all 202 health conditions for both exer- cises and nonexercisers is 14 + 25 = 39, compared with 85 errors if we do not use the exercise category in making the pre- diction.

• For this example, a coefficient of association that measures the improvement in predic-

tion of the column category due to using the row classification is

λ_{c ˙r} = 85 − (14 + 25)

85 = 0.54.

Here 85−(14+25) is the reduction in errors when using the rows to predict columns and 85 is the number of errors not using the rows.

• The subscript c · r refers to predicting the column category using the row category.

Recall the correlation coefficient r = s_{xy}/s_{x}s_{y}
is also a measure of association between two
quantitative variables.

• Although the measure is symmetric in the two variables, it can be interpreted in terms of how well one variable y can be predicted form the other variable x.

• Recall the definition of λ which defines a measure of association for categorical vari- ables.

• With quantitative variables,

– The notion that replaces “number of er- rors” is “sum of squared deviations” of the predicted values of y from the actual values.

– Number of errors in prediction of y without using x is replaced with sum of squares of deviations in predicting y without using x. It is

X(y_{i} − ¯y)^{2}.

– Number of errors in prediction of y us- ing x is replaced with sum of squares of deviations in predicting y using x. It is

X(y_{i} − ˆy_{i})^{2}.

– With the above interpretation, λ is equal
to r^{2}.