2.2.1 The Weighted OS
Similar to the OS method, we standardize each gene. We change the original method (OS) from computing P
group2x0iI (x0i > Q30+ IQR0) to P
group2x0iwi, where wi is a weight function. In OS, wi take values either 0 or 1, i.e.
wi =
Therefore, it’s not a robust statistics. In our method, we will choose wi as a continuous function as follows.
where Q10 and Q30 are the values of 25% quantile and 75% quantile for for the stan-dardized samples and IQR0 = Q30− Q10 is the interquartile range.
The weighted outlier-sum statistic (WOS) is defined as
W∗ = X
i∈group2
x0iwi
We compare W∗ for each gene.
2.2.2 The Weighted ORT
The first step is to standardize each gene as ORT method, and to choose weight as
wi =
where Q100 and Q300 are the values of 25% quantile and 75% quantile for the samples in group 1, and IQR00 = Q300− Q100 is the interquartile range.
The weighted outlier robust t-statistic is defined as U∗ = X
i∈group2
x00iwi
where x00i is the data after standardization.
We compare U∗ for each gene.
2.2.3 Methods related Gaussian mixture model
In disease group, the gene expression of some patients is no difference with the normal group. Under the normal assumption in the normal group and mixed normal assump-tion on the disease group. We use the EM algorithm to find the MLE of the parameters.
Following three methods are related this MLE. Let
X1, . . . , Xn1 ∼ N (µ1, σ2) and Y1 = Xn1+1, . . . , Yn2 = Xn ∼ pN (µ1, σ2) + qN (µ2, σ2)
PGM method(the MLE of probability of Gaussian mixture model):
Let index for each gene be ˆq =
Pn2
i=1bqi
n2 , where ˆq is the MLE for q.
TGM method(T-statistic of Gaussian mixture model):
The index is defined as µb2−µb1
This index similar to the t-statistic. In the t-statistic, we only assume that the second group is normally distributed, and this index is an extension of the t-statistic by as-suming the second group is a mixture model.
QGM method(Quantile of Gaussian mixture model):
Let Y(1), . . . , Y(n2) be the order statistic of Y1, . . . , Yn2, and q(1), . . . , q(n2) be the
By the way, we get a theorem according to QGM.
Theorem.
The y(l) in the QGM converges to the r-percent quantile of the group N (µ2, σ2).
Proof.
The data in disease group comes from the distribution of pf1 + qf2 where f1 is the distribution of N (µ1, σ2) and f2 is the distribution of N (µ2, σ2).
2.2.4 Bayesian Rule P-value
There are many statistican using Bayesian Rule to solve their problems in biology, P.
Baldi and A.D. Long. (2001) and E. Kristiansson and A. Sjogren (2006). Here, we will try using Bayesian Rule in our problem. Let
X1, . . . , Xn1|µ1, σ ∼ N (µ1, σ2), Xn1+1, . . . , Xn|µ2, σ ∼ N (µ2, σ2).
where µ1 and µ2comes from uniform distribution, and σ2 comes from Inverse Gaussian distribution with the mean one and the shape parameter one.
We know that
σ2 ∼ f3(σ) = 1
3 Simulation Study
Now, we try to compare above methods. Theoretically, we could derive the distribution functions of the p-value of the indexes for a gene. Let the distribution of the index of normal group and disease group be F1 and F2 respectively. Suppose for some gene, the distribution of the index follows the distribution of F2, F2 = F1 if the gene comes from the normal group. If we observe the index value v, then 1 − F1(v) is the proportion of the normal genes with the index statistics greater than this index value, that is 1 − F1(∗) is the p-value of this gene. Therefore, 1 − F2(F1−1(1 − ∗))1 is the cumulative distribution function of the p-value for the gene. And we could obtain the mean, me-dian, Q1, Q3 and the plot of the distribution of this p-value(i.e. the true/false-positive rates plot) and then use these statistics to compare all methods. The distribution of T and Q in the t-statistic method and COPA method could be obtained by analytically, which will be seen in Appendix in detail. Else, we use simulation to find mean, me-dian, standard deviation, Q1, Q3, and the empirical cumulative distribution function plot. In the simulation study, we let n1 = n2 = 25 samples in normal and disease group. Set one disease gene which contains k = 1, 5, 10, 15, 20, 25 outlier disease sam-ples from the normal distribution with µ = 1, 2, 3 and σ2 = 1, and the other n2− k genes and the 999 normal genes coming from the standard normal distribution. That is X1, . . . , Xn1, Xn1+k+1, . . . , Xn∼ N (0, 1) , Xn1+1, . . . , Xn1+k ∼ N (µ, 1) where µ = 1, 2, 3 and k = 1, 5, 10, 15, 20, 25.
3.1 Comparison by mean, median, Q1, and Q3
We use simulation to compare these methods by checking mean, median, Q1, and Q3 of the p-value for the disease gene (See the tables in Appendix. The blue marked numbers are the smallest one of each row, and the light blue numbers are a little bit bigger than the red ones. There are no big differences between them.) We compare all methods by two ways, by fixing k and µ.
First, we fix k to see the behavior of the p-value when µ increases. When k is small, such as k = 1, 5 , from table 2 and table 3, we can see that no matter what µ is, QGM is the best choice for finding the over-expressed gene. Subsequently, when k is a little bit larger (k = 10), from table 4, it’s shown that QGM and BRP are good choices. Besides,
1We want to find the distribution of 1 − F1(v):
F (ζ) = P (1 − F1(v) ≤ ζ) = P (1 − ζ ≤ F(v)) = P (F1−1(1 − ζ) ≤ v) = 1 − F2(F1−1(1 − ζ))
as µ increases, ORT and WORT could be considered to be good indexs. When k is large, from table 5, 6, and 7, T-statistics and BRP are acceptable. And as µ increases, similarly as k = 10 , ORT and WORT could be taken into consideration. By the way, when k equals to the number of patients in disease group, i.e. k = 25 , T-statistics, ORT, WORT, PGM, BRP can be considered.
Second, we fix µ to observe the behavior of the p-value when k increases. When µ is small, that is, the difference between normal persons’ and patients’ genes is small, we can change our choice from QGM to T-statistics and BRP if k is increasing. When µ is bigger (µ = 2, 3), QGM may be good as k is small, but we have more choices such like T-statistics, ORT, WORT, PGM, and BRP if k increases.