Financial Time Series I Topic 3: Regression Analysis and Correlation Hung Chen Department of Mathematics National Taiwan University 11/16/2002

(1)

Financial Time Series I

Topic 3: Regression Analysis and Correlation

Hung Chen

Department of Mathematics National Taiwan University

11/16/2002

(2)

OUTLINE 1. Association

– Scatter Plot

– Correlation Coefficient: linear association

– Interpreting Association – Nonlinear Association – Causation

– Distribution of Correlation Coefficient 2. Simple Linear Regression

– Statistical Relationship – Model

– Method of Least Squares

– Properties of Least Squares Solution – correlation

3. Association of Categorical Variables – Frequency Tables

– Bivariate Categorical Data – Test of Independence

– Measure of Association

(3)

Association Between Numerical Scales

• Scientific question: Does studying more help raise scores on the SAT?

– Data Collection: Record the number of hours spent studying for the SAT and the SAT scores for a sample of students.

– Does this data lead to the conclusion that

“individuals with higher hours also have higher scores?”

• Scientific question: Is sodium intake related to systolic and diastolic blood pressure?

– Data Collection: Record the monthly sodium intakes for each individual in a sample and his/her blood pressure.

– Do individuals with higher sodium consumption also have higher blood pressure readings?

• Question: How do we assess the association of two numerical variables in statistics?

– scatter plot: a graphical technique

Scatter plots frequently depict information about the relationship between vari-

(4)

ables that is not indicated by a single summary statistic.

– correlation coefficient: a formal numerical index of linear relationship.

∗ How do we measure a curvelinear relationship?

– How reliable are these two tools?

(5)

Scatter plots

• Data: We have the language and nonlan- guage mental maturity scores (two kinds of

”IQs”) of 23 school children.

How do we explore it?

– Let x denote language IQ and y denote non-language IQ where

x < − c(86, 104, 86, 105, 118, 96, 90, 95, 105, 84, 94, 119, 82, 80, 109, 111, 89, 99, 94, 99, 95, 102, 102)

y < − c(44, 53, 42, 50, 65, 52, 37, 50, 46, 30, 37, 66, 41, 43, 74, 69, 44, 67, 43, 60, 47, 54, 43)

– These data are represented visually by making a graph on two axes, the hori- zontal x axis representing language IQ and the vertical y axis representing non- language IQ.

– Such a graph is called a scatter plot (or scatter diagram).

– Use R command, plot(x,y,xlab=“language IQ”,ylab =“non-languae IQ”,main=“Scatter

(6)

Plot”)

Each point in such a plot represents one individual.

– When all of the observations are plotted, the diagram conveys information about direction and magnitude of the association of x and y.

– The swarm of points goes in a southwest- northeast direction.

This indicates a positive or direct association of x and y.

Namely, individuals who have the lower y values are the same people who have the lower values on x; they form a clus- ter of points in the lower-left portion of the diagram.

– If the swarm of points lines in a northwest- southeast direction (i.e., upper left to lower right), there is a negative or inverse association of x with y.

• The strength or magnitude of the association is indicated by the degree to which the points are clustered together around a single

(7)

line.

– If all of the points fall exactly on the line, there is a “perfect” association of the two variables. In this case, if we knew an individual’s value on variable x, we would be able to compute his/her value on y exactly.

– To the extent that the points in the diagram diverge from a straight line, the association is less than perfect.

– Since the scatter plot is a nonnumerical way of assessing association, adjectives are used to describe the strength of association.

– We may say a “strong” (moderate or even weak) association of x with y.

– Are they objective?

(8)

The Correlation Coefficient: A Measure of Linear Relationship

• correlation coefficient: r = _s^s^xy

xs_y

– It is a statistic based on n pairs of mea- surements (x_i, y_i) on two variables x and y.

– r is a measure of association between two quantitative variables.

– −1 ≤ r ≤ +1.

– The absolute value of r is exactly 1 for a perfect linear relationship, but lower if the points in the scatter plot diverge from a straight line.

• A descriptive statistic that indicates the degree of linear association of two numerical variates is the correlation coefficient, usually represented by the letter r.

Do you see why?

• The sign of r indicates the direction of association- it is positive for a direct association of x and

y and negative for an inverse association.

We might ask

(9)

– Whether the selling prices of various models of automobiles are related to the numbers of each that are sold in a given period of time?

– Whether the amount of rainfall in a given agricultural area is related to the size of the crop yield?

– The observational units may not be people, but institutions, objects, or events.

• For the above examples, it involves two numerical variables but questions about categorical variables are equally common which will be discussed later on.

• A correlation close to zero-either positive or negative-indicates little or no linear association between x and y.

• The question “what is a strong correlation?”

has several answers.

Statisticians would generally refer to a correlation close to zero as indicating “no correlation”; a correlation between 0 and 0.3 as

“week”; a correlation between 0.3 and 0.6 as

“moderate”; a correlation between 0.6 and

(10)

1.0 as “strong”; and a correlation of 1.0 as

“perfect.”

– Two laboratory technicians counting im- purities in the same water samples should have very high agreement-the correlation between their counts may be 0.95 or bet- ter.

– Two different human characteristics rarely have such a high correlation.

The correlation of height and weight is generally in the neighborhood of 0.8; of scores on the Scholastic Assessment Test with college freshman grade average about 0.6; of measured intelligence with socioe- conomic status about 0.4; of heart rate with blood pressure about 0.2.

– Example: consider the data collected by Nanji and French (1985) to examine the relationship of alcohol consumption with mortality due to cirrhosis of the liver in the 10 Canadian provinces. In this case, variable x is an index of the amount of alcohol consumed in the province in one year (1978) and variable y is the number

(11)

of individuals who died from cirrhosis of the liver per 100, 000 residents.

r is equal to 0.51.

Calculating Correlation Coefficient

• help.search(“correlation”)

• The software responds with “Help files with

alias or title matching ‘correlation’, type ‘help(FOO, package = PKG)’ to inspect entry ‘FOO(PKG)

TITLE’:

• We try the first two:

– cor(base) Correlation, Variance and Covariance (Matrices)

– corr(boot) Correlation Coefficient – acf(ts) Auto- and Cross- Covariance

and -Correlation Function Estimation – plot.acf(ts) Plotting Autocovariance

and Autocorrelation Functions

• Consider the example on language and non- language IQ scores.

• Use R base package, cor(x, y) leads to 0.7689431.

(12)

• Use R boot package, corr(cbind(x, y)) leads to 0.7689431.

– From Packages, load the package boot.

– Look at help manual by command help(corr).

– It only deals with matrix. We need to form a matrix with x and y.

Test and Confidence Interval on r

• Efron (1982) analyzes data on law school admission, with the object being to examine the correlation between the LSAT (Law School Admission Test) score and the first- year GPA.

– For each of 15 law schools, we have the pair of data points (acerage LSAT, average GPA):

(576,3.39), (635, 3.30), (558, 2.81), (578,3.03), (666,3.44), (580,3.07), (555,3.00), (661,3.43), (651,3.36), (605,3.13), (653,3.12), (575, 2.74), (545,2.76), (572,2.88), (594,2.96) – Let (X, Y ) be a bivariate normal with

correlation coefficient ρ and sample cor-

(13)

relation r, it can be shown that

√n(r − ρ) → N (0, (1 − ρ²)²).

– The above result can be used to derive confidence interval and carry out hypothesis testing when the data is normally distributed.

– Fisher suggests to consider a z-transformation log((1 + x)/(1 − x)). Then we have

√n1 2





log







1 + r 1 − r





 − log







1 + ρ 1 − ρ











) → N (0, 1).

(14)

Interpreting Association

• The correlation coefficient is useful for sum- marizing the direction and magnitude of association between two variables. It is a widely- used statistic, cited frequently in both scientific and popular reports.

• Limitations to the meaning of any particular correlation:

– The correlation coefficient reveals only the straight line (linear) association between x and y.

– r would be close to 0 while the scatter plot reveals important curvilinear patterns.

In this case, the scatter plots reveal that a straight line is not the whole story of the relationship of x to y.

This suggests that a scatter plot should always be examined when a correlation coefficient is to be computed or interpreted.

– Nonlinear association: Pairs of variables are often associated in a clear pattern,

(15)

but not conforming to a straight line.

Nonlinear Association

Quite common, we will see scatter plots with following characteristics.

• (asymptote) Consider the following two cases.

– Suppose that patients who experience pain, the observations are variable x might be the amount of aspirin taken orally, in mil- ligrams, and variable y the amount that is absorbed into the bloodstream.

Beyond a certain point, additional amounts of ingested aspirin are no longer absorbed, and the amount found in the bloodstream reaches a plateau.

– In research on human memory, it has been found that initial practice trials are very helpful in increasing the amount of material memorized; after a certain number of trials, however, each additional at- tempt to memorize has only a small added benefit.

Consider students who are learning a for- eign language, what is the relationship

(16)

between the number of 25-minute peri- ods devoted to studying vocabulary and the number of words memorized.

• (N shape) Examine the effects of advertising on sales.

– If observations are branches of a large chain of stores with independent control over expenditures, variable x might be advertising outlays and y the total sales volume in a given period of time.

– What will the plot look like?

– If initial advertising outlays have a sub- stantial effect on sales, additional advertising outlays have little additional im- pact on sales, but expensive “saturation”

advertising again gives a significant boost to sales.

• (concave upward) Consider the amount of water provided to agricultural plots and the proportion of plants on the plot that do not grow to a given size.

This pattern would suggest that too much as well as too little water is harmful, while

(17)

moderate amounts of water minimize the plant loss.

• (concave downward) In examining the re- sponses of humans and animals to various kinds of physiological stimulation, little or no stimulation produces little or no response, while moderate amounts of stimulation pro- duce maximal response.

Levels of stimulation that exceed the individual’s ability to process the input, however, can result in a partial or complete sup- pression of t he response.

(18)

Causation

• Even a strong correlation between two variables does not imply that one causes the other.

When a statistical analysis reveals association between two variables it is generally desirable to know more about the association.

– Does it persist under different conditions?

– Does one factor “cause” the other?

– Is there a third factor that causes both?

– Is there another link in the chain, a factor influenced by one variable and in turn influencing the other?

• The correlation indicates only that certain pairings of values on x and y occur more frequently than other combinations.

• Example: If the x and y variables were “years on the job” and “job satisfaction ratings” for a sample of employees, the positive correlation between them might indicate that

(19)

– if employees hold their jobs for more years, the work seems to become more satisfy- ing, or

– if employees are more satisfied, they keep their jobs longer.

That is, the casual connection may go in either direction and may also be affected by other intermediary mechanisms.

It may be that

– more senior employees are given subtle or overt rewards that in turn enhance their satisfaction. Thus, the distribution of rewards-referred to as an intervening variable-explains the association of years with satisfaction; the correlation itself does not imply a direct cause-and-effect relationship.

• Association between two variables may also occur because x and y are both consequences of some third variable that has not been ob- served. This is seen in the following illustra- tion:

• Example: Do Storks Bring Babies?

(20)

In Scandinavian countries a positive association between the number of storks living in the area and the number of babies born in the area was noticed.

Do storks bring the babies? Without shat- tering the illusions of the incurably roman- tic, we may suggest the following:

Districts with large populations have a large number of births and also have many buildings, on the chimneys of which storks can nest.

Consider the diagram representing the idea that the population factor explains both the number of births and the frequency with which storks are sighted.

Large population











Many babies born

Many buildings → Many storks The three variables to study are populations

of districts, numbers of births in districts, and numbers of storks seen in the districts.

(21)

Simple Regression Analysis

• Find the statistical relationship between two quantitative variables.

• Statistical data are often used to answer questions about relationships between variables.

– How do we summarize the relationship or association between 2 or among 3 or more variables?

– How do we find the association among variables measured on numerical scales?

How about categorical variables?

• Functional Relationship: A variable y is said to be a function of a variable x if to any value of x there corresponds one and only one value of y.

– We symbolize a functional relationship by writing y = f (x), where f represents the function.

– The variable x is called the independent variable; the variable y is called the dependent variable because it is considered to depend on x.

(22)

– If x is the height from which a ball is dropped and y is the time the ball takes to fall to the ground, then y is function- ally related to x because the law of grav- ity determines y in terms of x.

• Statistical relationship: When one variable is used to predict or “explain” values of the second variable, we allow some imperfection in the prediction.

– How do we express statistical relationships?

– What kind of methods can be used for es- timating those relationships from a sample of data?

– How do we measure and interpret variability around the predicted or explained values?

(23)

Statistical Relationship

• The relationship between x and y is not an exact, mathematical relationship, but rather several y values corresponding to a given x value scatter about a value that depends on the x value.

• For example, although not all persons of the same height have exactly the same weight, their weights bear some relation to that height.

– On the average, people who are 6 feet tall are heavier than those who are 5 feet tall; the mean weight in the population of 6-footers exceeds the mean weight in the population of 5-footers.

• The relationship between height and weight is modeled statistically as follows:

– For every value of x there is a corresponding population of y values.

Denote it by F_Y (y|x) where F (·) is a distribution function.

– The population mean of y for a particular value of x is denoted by µ(x). Note that µ(x) = E(Y |X = x).

(24)

– As a function of x, µ(x) is called the regression function.

– The population of y values at a particular x value also has a variance, denoted σ²; the usual assumption is that the variance is the same for all values of x.

homoscedastic: V ar(Y |X = x) = σ² heteroscedastic: V ar(Y |X = x) depends on x

• For many variables encountered in statistical research, the regression function is a linear function of x, and thus may be written as µ(x) = α + βx.

– The quantities α and β are parameters that define the relationship between x and µ(x).

– Write Y = α + βx + with E() = 0 and V ar() = σ².

– In conducting a regression analysis, we use a sample of data, (x_i, y_i), to estimate the values of these parameters so that we can understand this relationship.

• The focus of regression analysis is on making

(25)

inferences about α, β, and σ².

– Estimate the magnitude of these parameters and test hypotheses about them.

– Consider the hypothesis H₀ : β = 0.

If this null hypothesis is true then µ(x) = α + 0 × x = α, the same number for all values of x.

This means that the values of y do not depend on x, that is, there is no statistical relationship between x and y.

If H₀ is rejected, then the existence of a statistical relationship between x and y is confirmed.

• The data required for regression analysis are observations on the pair of variables (x, y).

– Variable x may be uncontrolled of “nat- urally occurring” as in the case of ob- serving a sample of n individuals with their heights x (random design) and their weights y, or it may be controlled, as in an experiment in which persons are trained as data processors for different lengths of time x (fixed design), and one

(26)

measures the accuracy of their work y.

• Examples:

– “Does studying more help raise scores on the Scholastic Assessment Tests (SAT)?”

This question could also be worded as follows: If we recorded the number of hours spent studying for the SAT and the SAT scores for a sample of students, do individuals with higher “hours” also have higher SAT scores and individuals with lower hours have lower SAT scores?

– We might ask if high sodium intake in one’s diet is associated with elevated blood pressure.

The question could be worded as follows:

If we recorded the monthly sodium intake for each individual in a sample and his/her blood pressure, do individuals with higher sodium consumption also have higher blood pressure readings while those with lower sodium intakes have the lower blood pressure readings?

– The term regression stems from the work

(27)

of Sir Francis Galton (1822-1911), a fa- mous geneticist, who studied the sizes of seeds and their offspring and the heights of fathers and their sons.

In both cases, he found that

∗ The offspring of parents of larger than average size tended to be smaller than their parents.

∗ The offspring of parents of smaller than average size tended to be larger than their parents.

∗ He called this phenomenon “regression toward mediocrity.”

(28)

Least-Square Estimates

• The data consist of n pairs of numbers (x₁, y₁), (x₂, y₂), . . . , (x_n, y_n).

• To explore their association, they can be plotted as a scatter plot.

• How do we find a ”best fit” line for two variables x and y?

y_i = ˆα + ˆβx_i + r_i ˆ

y_i = ˆα + ˆβx_i,

where r_i = y_i − ˆy_i is called the residuals.

Is r_i close to unobserved _i?

• In regression analysis we seek to determine the equation of that line that gives ˆy_i values as close as possible to the data values y_i.

– How do we determine ˆα and ˆβ, estimates of α and β, respectively?

– How do we obtain confidence intervals and to test hypotheses about the parameters of interest?

• If we all agree on choosing a line of best fit from all the lines that gives ˆy_i values as close as possible to the data values y_i, this

(29)

question is equivalent to asking how we can choose an appropriate y intercept and slope, a and b, such that the deviation y_i − ˆy_i as small as possible.

One approach is to find a and b to minimize the criterion

n

X

i=1(y_i − ˆy_i)².

• The principle of least squares leads to

¯

y = ˆα + ˆβx¯ β =ˆ

Pn

i=1(x_i − ¯x)(y_i − ¯y)

Pn

i=1(x_i − ¯x)² .

Example 1. One of the questions about peace- ful uses of atomic energy is the possibility that radioactive contamination poses health hazards.

• Since World War II, plutonium has been produced at the Hanford, Washington, fa- cility of the Atomic Energy Commission.

• Over the years, appreciable quantities of radioactive wastes have leaked from their open- pit storage areas into the nearby Columbia River, which flows through parts of Oregon to the Pacific.

(30)

• To assess the consequences of this contamination on human health, investigators cal- culated, for each of the nine Oregon counties having frontage on either the Columbia River or the Pacific Ocean, an “index of exposure.”

– This index of exposure was based on several factors, including distance from Han- ford and average distance of the population from water frontage.

– The cancer mortality rate, cancer mortality per 100, 000 person-years (1959-1964), was also determined for each of these nine counties. They are Clatsop, Columbia, Gilliam, Hood River, Morrow, Portland, Sherman, Umatilla, and Wasco.

– Data: (8.34, 210.3), (6.41, 177.9), (3.41, 129.9), (3.83, 1623), (2.57, 130.1), (11.64, 207.5),

(1.25, 113.5), (2.49, 147.1), (1.62, 137.5) (the index of exposure, cancer mortality rate)

(31)

Simple Regression Model

• A simple regression model for a set of pairs (X_i, Y_i), 1 ≤ i ≤ n of points with dependent or response variable Y = (Y₁, . . . , Y_n)^T and explanatory variable X = (X₁, . . . , X_n) is the following:

Y_i = β₀ + β₁X_i + _i,

where = (₁, . . . , _n)^T is a vector of independent, identically distributed random variables with _i ∼ N (0, σ²), the normal distribution with mean zero and variance σ².

• The assumptions on the random variables _i are relaxed to just uncorrelated, i.e. E[_i_j] = 0 for i 6= j but not necessarily independent.

• We can never separate the random variables

_i from the observations Y_i since we don’t know the value of _i, which means that we can never know the true values of (β₀, β₁).

• The method of least squares leads to choose a “best” ( ˆβ₀, ˆβ₁). This is where the sum of squared errors, or SSE of the model comes

(32)

in. We measure the goodness of linear regression model by its squared error

SSE(b₀, b₁) = ^Xⁿ

i=1(Y_i − b₀ − b₁X_i)². The least squares regression model is the model with smallest least squares.

• The least squares solutions ( ˆβ₀, ˆβ₁) aren’t equal to the true values of (β₀, β₁) but they do have the property that

E[ ˆβ₀] = β₀, E[ ˆβ₁] = β₁.

That is, if the data do come from some simple linear regression model, the least squares solutions are unbiased estimates of the true values (β₀, β₁).

They also have the property that, among all unbiased estimates of (β₀, β₁) that are linear functions of the response variable, the least squares solutions have the smallest variance, this is what the classic Gauss-Markov The- orem states.

(33)

Properties of Least Squares Solution

• To summarize, we have made the following decomposition of each observation using the fitted and residual values

Y = ˆY + r,

where ˆY is made from X and corr(r, X) = 0.

• Under the standard assumption that V ar(Y_i) = σ² and Cov(Y_i, Y_j) = 0 where i 6= j, we have

V ar( ˆβ₀) = σ²^Pⁿ_i=1 x²_i

n^Pⁿ_i=1 x²_i − (^Pⁿ_i=1 x_i)², V ar( ˆβ₁) = nσ²

n^Pⁿ_i=1 x²_i − (^Pⁿ_i=1 x_i)², Cov( ˆβ₀, ˆβ₁) = −σ²^Pⁿ_i=1 x_i

n^Pⁿ_i=1 x²_i − (^Pⁿ_i=1 x_i)².

• Define the residual sum of squares (RSS) to be

RSS = ^Xⁿ

i=1(y_i − ˆβ₀ − ˆβ₁x_i)²,

where ˆβ₀ and ˆβ₁ are the least squares solutions.

(34)

Let s² = RSS/(n − 2) which is an unbiased estimate of σ².

• If the errors, _i are independent normal random variables, then the estimated slope and intercept, being linear combinations of independent random variables, are normally distributed as well.

– Under the normality assumption, it can be shown that

βˆ_i − β s_β_ˆ

i

∼ t_n−2.

This result makes possible the construc- tion of confidence intervals and Hypoth- esis tests.

– If the _i are independent and the x_i satisfy certain assumptions, a version of the central limit theorem implies that, for large n, the estimated slope and intercept then the estimated slope and intercept are approximately normally distributed.

The above t-distributions can be used.

• Since the absolute value of r cannot exceed

(35)

1, the squared correlation has possible values from 0 to 1.

• If r = 0, then x is of no help in predicting y.

• If r = 1 (and r² = 1), then x predicts y exactly.

This can only occur if each point (x_i, y_i) falls exactly on the regression line; all of the variation in y can be explained by variability in x.

• In between these extremes, a weak correlation (e.g., in the range 0 to 0.3) is accom- panied by a small proportion of explained variation (0 to 0.09), while a strong correlation (e.g., between 0.6 and 1.0) is accompa- nied by a substantially larger proportion of explained variation (0.36 to 1.0).

• Both r and r² are useful measures of association for regression analysis.

The correlation tells the direction of association and its square tells the extent to which y is predictable from x.

For the cancer mortality data, the correla-

(36)

tion coefficient is r = 900.13/... = 0.926.

This value is very high, and the proportion of variation in mortality attributable to radioactive exposure is also high, 0.9262 = 0.875 (that is, 85%).

(37)

Data Analysis

• In R, “lm” is used to fit linear models.

– It can be used to carry out regression, single stratum analysis of variance and analysis of covariance (although ‘aov’ may provide a more convenient interface for these).

– Usage: lm(formula, data, subset, weights, na.action, method = “qr”, model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL, offset = NULL,. . .)

– Models for ‘lm’ are specified symbolically.

A typical model has the form ‘response terms’ where ‘response’ is the (numeric) response vector and ‘terms’ is a series of terms which specifies a linear predictor for ‘response’.

A terms specification of the form ‘first + second’ indicates all the terms in ‘first’

together with all the terms in ‘second’

with duplicates removed.

A specification of the form ‘first:second’

(38)

indicates the set of terms obtained by taking the interactions of all terms in ‘first’

with all terms in ‘second’.

The specification ‘first*second’ indicates the cross of ‘first’ and ‘second’. This is the same as ‘first + second + first:second’.

– Additive model and multiplicative model

• Example: Consider Plant Weight Data in Annette Dobson (1990) “An Introduction to Generalized Linear Models.”

– Data input:

ctl < − c(4.17, 5.58, 5.18, 6.11, 4.50, 4.61, 5.17, 4.53, 5.33, 5.14) trt < − c(4.81, 4.17, 4.41, 3.59, 5.87, 3.83, 6.03, 4.89, 4.32, 4.69) group < − gl(2, 10, 20, labels = c(“Ctl”, “T rt”))

weight < − c(ctl, trt)

– Regress weight to the group.

lm(weight ∼ group) leads to the estimate of β₀ to be 5.032 and the estimate of β₁ to be −0.371. Here β₁ refers to the group effect.

– Carry out an ANOVA analysis on the proposed model.

(39)

anova(lm(weight ∼ group)) leads to the following table.

Analysis of Variance Table

Df Sum Sq Mean Sq F value P r(> F ) group 1 0.6882 0.6882 1.4191 0.249 Residuals 18 8.7293 0.4850

(40)

The Association Among Categorical Variables

• “Is inoculation with a polio vaccine related to the occurrence of paralytic polio?”

– Data may be collected on a sample of observations and we ask whether a particular value on one variable (i.e., “vacci- nated”) co-occurs with a particular value on the other (“polio absent”).

– Both of these variables are simple yes/no dichotomies.

• How do we measure the association among categorical variables?

– The idea of association is basically the same as in quantitative variables, that is, do certain values of one variable tend to occur more frequently with certain values of another?

– The values of categorical variables, however, may not have a range from lower to higher quantities, but may represent qualitatively distinct conditions, such as a condition being “present” or “absent.”

(41)

– Summary statistics such as the mean and standard deviation are not applicable.

Instead, observations are simply counted;

for example, how many observations have both condition A and condition B present?

– Counts are entered into a frequency table.

• As with numerical variables, an association of two categorical variables does not imply that one causes the other.

– The direction of causation may be re- ciprocal, with each variable affecting the other.

– Causation may be mediated by one or more intervening third variable(s).

– Both variables may be consequences of additional variables not included in the data causing the two outcomes to occur together.

– To understand these effects more com- pletely, it is often necessary to examine the association of three or more variables.

The tabulation of data for three categor-

(42)

ical variables results in a three-way for two variables is a two-way table.

(43)

Bivariate Categorical Data

• Bivariate categorical data result from the observation of two categorical variables for each individual.

• A variable with two categories, such as male/female or graduate/undergraduate, is called a di-

chotomous variable.

A pair of dichotomous variables is called a double dichotomy.

Because each variable has just two variables, the table that results is a two-by-two (2 × 2) frequency table.

• Double dichotomies arise, for example, when each of a number of persons is asked a pair of yes-no questions.

– Company X ask each of 600 men whether or not they use Brand X razors and whether or not they use Brand X blades.

– A physician may classify patients according to whether of not they have been in- oculated against a disease and whether or not they contracted the disease.

(44)

– Businesses of a certain type, for example, might be classified based on whether or not they provide “day-care” facilities and whether or not they provide maternity leave for pregnant employees.

• Association between the pair of variables is seen by examining the patterns of frequencies in the individual cells.

• Many statistical studies involve 2 × 2 tables and because many concepts of statistics can be presented in this form, we shall consider such tables in some detail.

This discussion is simplified by the notation displayed in the following table.

Question 2

Yes No Total

Question 1 Yes a b a + b

No c d c + d

Total a + c b + d n Here

– a is the number of persons answering “yes”

to both questions.

– b is the number of persons answering “yes”

(45)

to Question 1 and “no” to Question 2.

– c is the number of persons answering “no”

to Question 1 and“yes” to Question 2.

– d is the number of persons answering “no”

to both questions.

– We denote by n the total number of persons included in the table and n = a + b + c + d.

• Independence:

– When the joint frequencies for two variables have no association, they are said to be independent.

– For example, suppose communities were cross-classified according to average income and crime rate as the following table.

Average income level

Low Medium High Total Crime Low 1(10%) 4(10%) 3(10%) 8(10%)

Medium 7(70%) 28(70%) 21(70%) 56(70%) rate High 2(20%) 8(20%) 6(20%) 16(20%)

All crime 10 40 30 80

levels

(46)

– The table shows that for each income

level most communities (70%) have a medium crime rate.

– The entire distribution of crime rates is the same for each level of average income;

that is, crime rate is independent of level of average income.

– In this case, knowledge of a community’s average income level provides no information about its crime rate.

– The percentages of districts with low, medium, and high crime rates are 10%, 70%, and

20% regardless of the average income.

– Note that the marginal distribution of crime rate has to be the same as the distribution for each income level.

– When will _a+b^a = _c+d^c ? – When will _a+c^a = _b+d^b ?

• Index of association for 2 × 2 tables

– A numerical indicator of association between two dichotomous variables can be based on the quantity ad − bc.

(47)

– a and b are the upper-left and lower-right entries in the above table, and b and c are the upper-right and lower-left entries.

– If ad − bc = 0, it indicates independence.

– If ad−bc > 0, it indicates that A1 occurs more frequently with B1, and A2 with B2, than the other way around.

– If ad−bc < 0, it indicates that A1 occurs more frequently with B2, and A2 with B1, than the other way around.

– The difference ad − bc is divided by a quantity that keeps the final index of association between −1 and +1, like the correlation coefficient for numerical scales.

• Measure of Association: φ coefficient

φ = ad − bc

r(a + b)(c + d)(a + c)(b + d)

– If the variables A and B are ordered then the magnitude of φ indicates the strength of association and the sign indicates the direction of association as well.

– (a + b + c + d)φ is identical to the chi- square test we will discuss later on.

(48)

Other kinds of 2 × 2 tables

• Dichotomous variables sometimes have categories that are ordered, so that one value re- flects more of some characteristic than other.

– To study the association of income and education level, the income might be classified simply as above or below poverty level, education as having completed fewer than 12 years of schooling or 12 years or more.

• Another type double dichotomy is shown by the two-by-two tabulation of a dichotomous variable for the same individuals at two different times.

– Suppose in August we asked a number of people which of two presidential candidates they favored and then asked the same people the same question in Septem- ber.

– We could tabulate the results as follows.

(49)

September

Bush Clinton Total

August Bush a b a + b

Clinton c d c + d

Total a + c b + d n Here the number c would be the number of persons who switched from Demo- cratic candidate Clinton Republican candidate Bush between August and Septem- ber 1992.

– These data are change-in-time data and the table is called a “turnover” table.

– We may ask whether the number of people who kept their original preference (cells a and d) is substantially larger than the number who changed (cell b and c).

(50)

Interpretation of frequencies

• Question: Is there a tendency for those who use Brand X razors also to use Brand X blades?

• Company X conducted a market survey, in- terviewing a sample of 600 men.

Each man was asked whether he uses the Brand X razor and whether he uses Brand X blades.

Use Not use Brand X Brand X

blades blades Total Use Brand X 186(67%) 93(33%) 279(100%) razor

Not Use 59(18%) 262(82%) 321(100%) Total 245(41%) 355(59%) 600(100%)

• Association: Most (67%) of the men using Brand X razors also use Brand X blades;

only a few (18%) of the men not using Brand X razors who use Brand X blades differs from that percentage among men not using Brand X razors.

– We say there is an association between

(51)

using the Brand X razor and using Brand X blades.

– We can say that men who use the Brand X razor are more likely to use Brand X blades than are men who do not use the razor.

– This difference may suggest that an advertising campaign for blades should be directed to men not using Brand X razors.

(52)

Chi-Square Tests of Independence

Consider frequency tables in the case in which individuals were classified simultaneously on two categorical variables.

• Consider testing the null hypothesis that in a population two variables are independent on the basis of a sample drawn from that population.

– Even though the variables are independent in the in the population, they may not be (and in fact probably will not be) independent in a sample.

Consider the following hypothetical data.

• At each of three different dates, a sample of 1000 registered voters was drawn; at each time each respondent was asked which of two potential candidates he or she would fa- vor.

10/91 1/92 Total Bush 523 502 1025 Clinton 477 498 975 Total 1000 1000 2000

(53)

• In the underlying population, let the proportion favoring Bush be p₁ at the time of the first poll and p₂ at the time of the second poll.

• The null hypothesis that the proportion favoring one candidate does not depend on the date of polling is H₀ : p₁ = p₂.

• In this example the estimates of p₁ and p₂ are ˆp₁ = 0.523 and ˆp₂ = o.502, respectively, based on sample sizes n₁ = n₂ = 1000

• The estimate of the SD of the difference between two sample proportions when H₀ is true is

v u u u u tpˆqˆ







1

n₁ + 1 n₂





 = 0.02235.

The test statistic is z = pˆ₁ − ˆp₂

rpˆˆq(1/n₁ + 1/n₂) = 0.940.

• What is the distribution of the above test statistic?

(54)

Measure of Association Based on Prediction

• Consider another measure of association which is based on the idea of using one variable to predict the other.

• Consider the cross-classification of exercise and health in the following Table.

-3 Health status

Good Poor Total

Exerciser 92 14 106

Non-exerciser 25 71 96

Total 117 85 202

– How well does exercise group predict health status?

– If one of the 202 persons represented in this table is selected at random, our best guess of the health status-if we don’t know anything about the person-is to say that the person is in the good-health group because more of the people are in that group (117, compared with 85 for the poor-health group); the good-health category is the mode.

– If we make this prediction for each of the

(55)

202 persons, we shall be right in 117 cases and wrong in 85 cases.

• If we take the person’s exercise level into account, we can improve our prediction.

– If we know the person exercises regularly, our best guess is still that the person is in good-health group, for 92 of the regular exercisers are in the good-health group, compared with only 14 in the poor-health group.

– If we know the person is not an exerciser, we should guess that the person is in the poor-health group; in this case we would be correct 71 times and incorrect 25 times.

– Our total number of errors in predicting all 202 health conditions for both exercises and nonexercisers is 14 + 25 = 39, compared with 85 errors if we do not use the exercise category in making the prediction.

• For this example, a coefficient of association that measures the improvement in predic-

(56)

tion of the column category due to using the row classification is

λ_{c ˙r} = 85 − (14 + 25)

85 = 0.54.

Here 85−(14+25) is the reduction in errors when using the rows to predict columns and 85 is the number of errors not using the rows.

• The subscript c · r refers to predicting the column category using the row category.

Recall the correlation coefficient r = s_xy/s_xs_y is also a measure of association between two quantitative variables.

• Although the measure is symmetric in the two variables, it can be interpreted in terms of how well one variable y can be predicted form the other variable x.

• Recall the definition of λ which defines a measure of association for categorical variables.

• With quantitative variables,

(57)

– The notion that replaces “number of errors” is “sum of squared deviations” of the predicted values of y from the actual values.

– Number of errors in prediction of y without using x is replaced with sum of squares of deviations in predicting y without using x. It is

X(y_i − ¯y)².

– Number of errors in prediction of y using x is replaced with sum of squares of deviations in predicting y using x. It is

X(y_i − ˆy_i)².

– With the above interpretation, λ is equal to r².