New methods - 在基因晶片中關鍵基因之選取方法

2.2.1 The Weighted OS

Similar to the OS method, we standardize each gene. We change the original method (OS) from computing P

group2x⁰_iI (x⁰_i > Q3⁰+ IQR⁰) to P

group2x⁰_iw_i, where w_i is a weight function. In OS, w_i take values either 0 or 1, i.e.

w_i =

Therefore, it’s not a robust statistics. In our method, we will choose w_i as a continuous function as follows.

where Q1⁰ and Q3⁰ are the values of 25% quantile and 75% quantile for for the stan-dardized samples and IQR⁰ = Q3⁰− Q1⁰ is the interquartile range.

The weighted outlier-sum statistic (WOS) is defined as

W^∗ = X

i∈group2

x⁰_iw_i

We compare W^∗ for each gene.

2.2.2 The Weighted ORT

The first step is to standardize each gene as ORT method, and to choose weight as

w_i =

where Q1⁰⁰ and Q3⁰⁰ are the values of 25% quantile and 75% quantile for the samples in group 1, and IQR⁰⁰ = Q3⁰⁰− Q1⁰⁰ is the interquartile range.

The weighted outlier robust t-statistic is defined as U^∗ = X

i∈group2

x⁰⁰_iw_i

where x⁰⁰_i is the data after standardization.

We compare U^∗ for each gene.

2.2.3 Methods related Gaussian mixture model

In disease group, the gene expression of some patients is no difference with the normal group. Under the normal assumption in the normal group and mixed normal assump-tion on the disease group. We use the EM algorithm to find the MLE of the parameters.

Following three methods are related this MLE. Let

X1, . . . , Xn1 ∼ N (µ1, σ²) and Y1 = Xn1+1, . . . , Yn2 = Xn ∼ pN (µ1, σ²) + qN (µ2, σ²)

PGM method(the MLE of probability of Gaussian mixture model):

Let index for each gene be ˆq =

P_n2

i=1bqi

n2 , where ˆq is the MLE for q.

TGM method(T-statistic of Gaussian mixture model):

The index is defined as ^µ^b²⁻^µ^b¹

This index similar to the t-statistic. In the t-statistic, we only assume that the second group is normally distributed, and this index is an extension of the t-statistic by as-suming the second group is a mixture model.

QGM method(Quantile of Gaussian mixture model):

Let Y₍₁₎, . . . , Y_(n₂₎ be the order statistic of Y₁, . . . , Y_n₂, and q₍₁₎, . . . , q_(n₂₎ be the

By the way, we get a theorem according to QGM.

Theorem.

The y_(l) in the QGM converges to the r-percent quantile of the group N (µ₂, σ²).

Proof.

The data in disease group comes from the distribution of pf₁ + qf₂ where f₁ is the distribution of N (µ₁, σ²) and f₂ is the distribution of N (µ₂, σ²).

2.2.4 Bayesian Rule P-value

There are many statistican using Bayesian Rule to solve their problems in biology, P.

Baldi and A.D. Long. (2001) and E. Kristiansson and A. Sjogren (2006). Here, we will try using Bayesian Rule in our problem. Let

X₁, . . . , X_n₁|µ₁, σ ∼ N (µ₁, σ²), X_n₁₊₁, . . . , X_n|µ₂, σ ∼ N (µ₂, σ²).

where µ₁ and µ₂comes from uniform distribution, and σ² comes from Inverse Gaussian distribution with the mean one and the shape parameter one.

We know that

σ² ∼ f3(σ) = 1

3 Simulation Study

Now, we try to compare above methods. Theoretically, we could derive the distribution functions of the p-value of the indexes for a gene. Let the distribution of the index of normal group and disease group be F₁ and F₂ respectively. Suppose for some gene, the distribution of the index follows the distribution of F2, F2 = F1 if the gene comes from the normal group. If we observe the index value v, then 1 − F₁(v) is the proportion of the normal genes with the index statistics greater than this index value, that is 1 − F₁(∗) is the p-value of this gene. Therefore, 1 − F₂(F₁⁻¹(1 − ∗))¹ is the cumulative distribution function of the p-value for the gene. And we could obtain the mean, me-dian, Q1, Q3 and the plot of the distribution of this p-value(i.e. the true/false-positive rates plot) and then use these statistics to compare all methods. The distribution of T and Q in the t-statistic method and COPA method could be obtained by analytically, which will be seen in Appendix in detail. Else, we use simulation to find mean, me-dian, standard deviation, Q1, Q3, and the empirical cumulative distribution function plot. In the simulation study, we let n1 = n2 = 25 samples in normal and disease group. Set one disease gene which contains k = 1, 5, 10, 15, 20, 25 outlier disease sam-ples from the normal distribution with µ = 1, 2, 3 and σ² = 1, and the other n₂− k genes and the 999 normal genes coming from the standard normal distribution. That is X₁, . . . , X_n₁, X_n₁_+k+1, . . . , X_n∼ N (0, 1) , X_n₁₊₁, . . . , X_n₁_+k ∼ N (µ, 1) where µ = 1, 2, 3 and k = 1, 5, 10, 15, 20, 25.

3.1 Comparison by mean, median, Q1, and Q3

We use simulation to compare these methods by checking mean, median, Q1, and Q3 of the p-value for the disease gene (See the tables in Appendix. The blue marked numbers are the smallest one of each row, and the light blue numbers are a little bit bigger than the red ones. There are no big differences between them.) We compare all methods by two ways, by fixing k and µ.

First, we fix k to see the behavior of the p-value when µ increases. When k is small, such as k = 1, 5 , from table 2 and table 3, we can see that no matter what µ is, QGM is the best choice for finding the over-expressed gene. Subsequently, when k is a little bit larger (k = 10), from table 4, it’s shown that QGM and BRP are good choices. Besides,

1We want to find the distribution of 1 − F₁(v):

F (ζ) = P (1 − F1(v) ≤ ζ) = P (1 − ζ ≤ F₍v)) = P (F₁⁻¹(1 − ζ) ≤ v) = 1 − F2(F₁⁻¹(1 − ζ))

as µ increases, ORT and WORT could be considered to be good indexs. When k is large, from table 5, 6, and 7, T-statistics and BRP are acceptable. And as µ increases, similarly as k = 10 , ORT and WORT could be taken into consideration. By the way, when k equals to the number of patients in disease group, i.e. k = 25 , T-statistics, ORT, WORT, PGM, BRP can be considered.

Second, we fix µ to observe the behavior of the p-value when k increases. When µ is small, that is, the difference between normal persons’ and patients’ genes is small, we can change our choice from QGM to T-statistics and BRP if k is increasing. When µ is bigger (µ = 2, 3), QGM may be good as k is small, but we have more choices such like T-statistics, ORT, WORT, PGM, and BRP if k increases.

在文檔中在基因晶片中關鍵基因之選取方法 (頁 12-17)