3. Methodology
3.2. Proposed Methods
We simplify our goal as finding the relation between two variables.
If X is the parent (cause) and Y is the child (effect), then we use XÆY to represent the causal effect relation between them. We assume X and Y have the regression relation: , where f is a smoothing function and is a random variable with mean zero and is independent of X. Sometimes, X can be represented as
with Y being independent of
Y=f (X)+ ε ε
g(Y)+ ε' ε at the same time. This will hold when f is a ' linear function. When this happens, we denote the cause/effect reality byX↔Y.
R2
Let and be the residuals of regressory Y on X and X on Y, respectively.
The core idea of our method is that, if XÆY, then we should have the following result:
R1
1 2 1 ˆ 2 ˆ
R ⊥X and R ⊥ Y, where R = −Y f (X) and R = −X g(Y).
On the other hand, if YÆX, thenR1 ⊥ X and R2 ⊥ . Y The possible relationships between X and Y are:
- + - +
-X⎯⎯+→Y, X⎯⎯→Y, X←⎯⎯Y, X←⎯⎯Y, X←⎯→Y, X←⎯→Y, X⊥Y.
Arrow is the direction. The sign above the arrow represents activate (+) or repress (-).
Our proposed method is divided into two parts by the patterns of the residuals , =1, 2.
R ii
We summarize the method by the following flow chart.
The Distribution of the Residual R
We can use a graphical display, QQ plot, to check the normality of the residuals.
QQ plots are used to assess whether a data set has a particular distribution, or whether two datasets have the same distribution. If two distributions are the same, then the plot will approximate a straight line. The extreme points have more variability than points toward the center. Or we can use Kolmogorov-Smirnov Goodness-of-Fit Test
Second part , i=1,2.
i
At least one of the distributions of the residuals is normal distribution.
If the residuals both do not approximate to normal or the sample size is small (<30).
First part
Pre-process: TransformingR , X and Y to an ordinal type data using RANK or RANGE statistics.
2 1,R Decision rules
Step one: Relation test using nonparametric correlations (Kendall’s Tau, Spearman’s Rho) with Bonferroni correction.
Establish Relationship
Step two: Direction test using chi-square test of independence.
Figure 5: The flow chart of our proposed method
(Chakravart, Laha, and Roy, 1967) [5] and Chi square Goodness-of-Fit Test (Snedecor and Cochran, 1989) [24] to compare the distribution of the residuals with normal. If the sample size is small, we suggest skip this part and go to second part.
First part: At least one of the distributions of the residuals is normal distribution.
(We remark that this condition usually does not hold in our study.) In this part we use the following Decision rule:
1. IfR is approximately normal and1 R isn’t, then XÆY. 2 2. IfR is approximately normal and2 R isn’t, then YÆX. 1 3. IfR and1 R both are approximately normal, then2 X ↔Y .
4. IfR and1 R both are not approximately normal or the sample size is small, then 2 go to second part.
From our experience, the chance of using the first part of our method is quite small.
Second part: When the residuals both are not approximately normal or the sample size is small (<30).
1 2
R and R
In this part, we divide the target into the following steps:
Q1. Is there a relationship between those two variables?
Q2. How do they relate? Repress (-) or activate (+)?
Q3. What is the related direction if they really have relationship?
Step one will solve the Q1 and Q2. Step two will give the answer to Q3.
If we want to know the relationship between two variables, we must confirm whether there is relation between two variables first.
Relationships between variables. To express a relationship between two variables, one way is to compute the correlation coefficient between two variables. We discuss the correlation between X and Y using nonparametric correlations (Kendall Tau and Spearman R). An advantage of nonparametric or rank correlation is that we need not
know the probability distribution functions from which the and are drawn.
However, the slight loss of information in ranking is a small price to pay for a very major advantage: when a correlation is demonstrated to be present nonparametrically, then it is really there! Nonparametric correlation is more robust than linear correlation,
more resistant to unplanned defects in the data s
xi' yi's
Spearman Rank-Order Correlation Coefficient (Siegel & Castellan, 1988 and Siegel, 1956) [23, 24]
Suppose we have N data points (x ,i yi), i=1,...,N. Let be the rank of among the other x’s, be the rank of among the other y’s, then the rank-order correlation coefficient is defined to be the linear correlation coefficient of the ranks, namely,
Ri xi If N is larger than 10, the significance of a nonzero value ofrsis tested by computing
2
This statistic is distributed approximately as Student’s t distribution with N -2 degrees of freedom. A key point is that this approximation does not depend on the original distribution of thex si' and ; it is always the same approximation, and always pretty good.
i' y s
yi
Kendall’s Tau (Helsel & Hirsch, 1995) [12]
Suppose we have N data points (x ,i ). Now consider all
(
1)
points, where a data point cannot be paired with itself, and where the points in either order count as one pair. Let and be a pair of (bivariate) observations.If and have the same sign, we say that pair is concordant. If they have opposite signs, we say that the pair is discordant.
(xj
Let C be the number of concordant pairs, and D be the number of discordant pairs, then Kendall’s Tau is defined as
If , or , or both, the comparison is called a ‘tie’. Ties are not counted as concordant or discordant. If the number of ties is large, then Tau has to be replaced by
where be the number of ties involving x and be the number of ties involving y.
Obviously,
Nx Ny
1 1≤ ≤
− τ .
If N is larger than 40, the significance of Kendall's Tau can be tested by calculating a test statistic, t, and compares it to the tabular values of Student's t distribution:
of our proposed method.
Kendall’s Tau is equivalent to Spearman’s Rho (3.1) with regard to the underlying assumptions. But they are not equal in magnitude because their underlying logic and computational formulae are quite different. They have a relation represented as
1
In order to use the rank correlation test, we must transform the original data to ordinal type data. The following pre-process is necessary.
Pre-process: Transforming , X, and Y to ordinal type. We use the following methods:
2 1,R R
1. RANK:
Let X ,...,1 XNbe continuous data. Ri is the rank of X . We use i
⎥⎦⎤
⎢⎣⎡ K N Int Ri
/ to translate the original continuous data to a discrete ordinal data type with K classes.
2. RANGE:
Let X ....1 XNbe continuous data. A= max (X ....1 XN), B=min (X ....1 XN). We use B K
A B
Int Xi ×
⎥⎦⎤
⎢⎣⎡ × − + ×
−
− −7 −8
10 5 ) 10 1
( to translate the original continuous data to discrete ordinal type data with K classes.
Step one: Relation test.
We use those two nonparametric correlations (Kendall’s Tau and Spearman’s Rho) to perform the rank correlation test. The null hypothesis is that the coefficient (Kendall’s Tau or Spearman’s Rho) is zero. We use the signs of those two coefficients to indicate the way of connection (repress (-) or activate (+)). Next we use Bonferroni correction to combine the above results.
The rank correlation test is a distribution-free test that determines whether there is a monotonic relation between two variables. A monotonic relation exists when any increase in one variable is invariably associated with either an increase or a decrease in the other variables.
Bonferroni Correction (Sidak, 1968, 1971) [21, 22]
The following is the Bonferroni general inequality:
1 1
( ) 1 [
g g
i i i
P A P A
= =
≥ −
∑
I
i], (3.7)Ai
where Ai and its complement are any events, g is the number of statements or comparisons in the finite set. In particular, if each Ai is the event that a calculated confidence interval for a particular linear combination of treatments includes the true value of that combination, then the left-hand side of the inequality is the probability that all the confidence intervals simultaneously cover their respective true values. The right-hand side is one minus the sum of the probabilities of each of the intervals missing their true values. Therefore, if simultaneous multiple interval estimates are desired with an overall confidence coefficient 1- , one can construct each interval with confidence coefficient (1- /g), and the Bonferroni inequality ensures that the overall confidence coefficient is at least 1- .
In our simulations, we use 0.0975 for α . So if we apply a significance level of 0.05 to each of the two tests, there is now only a 5% chance that any of them will be declared significant under the null hypothesis.
If the Step one rejects the null hypothesis, then we can say that the relations are not strong enough to be noted and the two variables are uncorrelated. Otherwise, we must go a step further to differentiate what the related direction is.
Step two: Direction test.
At this step, we only focus on which direction is better, XÆY, YÆX, orX ↔Y . We decide the direction by comparing the strength of independence of and X with that of and Y. We use Pearson chi-square test to examine the independence of two
R
1R
2variables.
We can intuitively know that if the smooth regression function fits the data well, then the residual should be small and is (almost) independent of the predictor variable.
Use Pearson chi-square test to examine the dependence of two variables. Note that the real value of p-value will miss its essential meaning under some conditions. Its validity depends heavily on the assumption that the expected cell counts are at least moderately large; a minimum size of five is often quoted as a rule of thumb. Even when cell counts are adequate, the chi-square is only a large-sample approximation to the true distribution of X-squared under the null hypothesis. We only need to compare the magnitudes of the two p-values, or we can use the Pearson's X-squared statistic directly.
So, even if the sample size is small, we still can use this method. The discriminant rules are as follows:
LetP1 and χ21 be the P-value and X-squared statistic of the Pearson’s chi-square test of R and X, 1 P2 and χ22 be the P-value and X-squared statistic of the Pearson’s chi-square test of R and Y, respectively. 2
1. If P1 is larger thanP2, then and we accept X →Y . That means R &Y less 2 independence than R &X. (i.e. If1 χ21 is smaller thanχ22, than we acceptX →Y
P1 →X 1
.) 2. If P2 is larger than , then we acceptY . That means R &X less
independence than R &Y. (i.e. If2 χ21is larger thanχ22, then we acceptY →X.)
1 2| 0.00001
P−P < R1
3. If | , then we accept X ↔Y . The dependence of &X is similar to that of R &Y. (i.e., If 2 χ21is very close toχ22, then we acceptX ↔Y .)