• 沒有找到結果。

Results of Simulation

The above nine figures may be divided into three types. First, in Figure 2(a-c), Figure 3(a-d) and Figure 4, there are no difference between N(0,1) and dash lines, so we can see that the correlation between genes seems not affect the empirical distribution of the zi’s.

Secondly, in Figure 5(a), the empirical distribution of the zi’s is more wide than the N(0,1) as the positive φ getting larger. In Figure 5(b), the empirical distribution of the zi’s is more narrow than the N(0,1) as the negative φ getting smaller. In Figure 5(c), the empirical distribution of the zi’s is more wide than the N(0,1) as the positive φ getting larger and the empirical distribution of the zi’s is more narrow than the N(0,1) as the negative φ getting smaller. In Figure 7, the empirical distribution of the zi’s is more wide than the N(0,1) as the correlation coefficient c getting larger.

Also, in Figure 6(a), the empirical distribution of the zi’s is more wide than the N(0,1) as the positive θ getting larger. In Figure 6(b), the empirical distribution of the zi’s is more narrow than the N(0,1) as the negative θ getting smaller. In Figure 6(c), the empirical distribution of the zi’s is more wide than the N(0,1) as the positive θ getting larger. In Figure 6(d), the empirical distribution of the zi’s is more narrow than the N(0,1) as the negative θ getting smaller.

Hence, there is a significant difference between N(0,1) and dash lines in Figure 5(a-c), Figure 6(a-d), and Figure 7, so we can see that the correlation among microarrays actually affects the empirical distribution of the zi’s.

Thirdly, since there is an apparent difference between N(0,1) and dash lines in Figure 8(a)(b), Figure 9(a)(b), and Figure 10(a)(b), we can see that the various distri-bution assumptions actually affects the empirical distridistri-bution of the zi’s.

From the above results, we conclude that the correlation among microarrays and the various distribution assumptions can cause the empirical distribution of the zi’s differing from N(0,1) in microarray experiments.

5 Real Data

The data is a microarray experiment about breast cancer, which provided by Depart-ment of Interdisciplinary Oncology Moffitt Cancer Center and Research Institute, Uni-versity of South Florida. The experiment included 185 samples, 143 from the normal group and 42 from the patients. Each samples measured a microarray of expression levels for the same m = 54675 genes. Then we apply the data on the multiple testing procedures and therefore we get m = 54675 zi’s. The histogram of the observed zi’s plot is in the Figure 11. In Figure 11, heavy blue line indicates the theoretical null distribution. We can see that the empirical distribution of the zi’s is more wide than the N(0,1). Hence, we guess that the data may have correlation among microarrays.

Also, if the genes are null, these zi’s should have a standard normal distribution under normal assumption. In order to solve the problem, we may try some improved method.

For example, permutation methods can be used to avoid the assumption of zi|Hi ∼ N(0,1) and possibly make the permutation-improved theoretical null will more closely match the empirical null (Efron et al. 2001; Dudoit et al. 2003; Efron 2004; Efron 2007). Moreover, Efron (2007) referred to the random permutation of the microarrays can eliminate the group differences and preserve the correlation structure of the genes.

Hence we apply permutation methods to the breast cancer data.

Let X represent the 54675 × 185 matrix X = (xij) of the breast cancer data.

Each row of X (i.e., each gene) yields a two-sample t-statistic ti comparing 143 from the normal group and 42 from the patients, which is then transformed to a zi by zi = Φ−1(G0(ti)) and we get 54675 zi’s. Then, we recalculate the 54675 zi’s by ran-domly permuting the columns of X. Namely, we recalculate the 54675 zi’s by randomly dividing the 185 samples into groups of 143 and 42. This process is independently re-peated 100 times, generating a total of 100 × 54675 permutation zi’s. This testing is called permutation testing. Since permutation test is model-free, we can say that per-mutation test is more robust than t-test. The empirical distribution of the 100 × 54675 zi’s (i.e., permutation null) plot is in the Figure 11. In Figure 11, heavy red line

indi-z−value

Density

−10 −5 0 5 10

0.00.10.20.30.4

N(0,1) Permutation real data

Figure 11: The distribution of the zi’s plot in real data.

cates the distribution of the 100 × 54675 zi’s (i.e., permutation null). We can see that the empirical distribution of the zi’s is more wide than the permutation null distribu-tion, but the permutation null is more closely match the histogram of the observed zi’s than the N(0,1).

However, permutation methods are a way of avoiding the normal assumption ( Dudoit et al., 2003; Efron, 2001, 2004, 2006), but they do not solve the problem of selecting a suitable null hypothesis (Efron, 2004). The choice of a suitable null hypoth-esis can see Efron (2004, 2006, 2007).

6 Conclusions and Future Research

In this study, we focused on the reasons of empirical distribution of the zi’s differed from N(0,1) in large-scale multiple hypothesis testing. We proposed the three possi-ble reasons. The first reason was the correlation between genes. The secondly reason was the correlation among microarrays. The third reason was the various distribution assumptions. Moreover, we provided twelve models from three different reasons and simulated the data by the models.

By observing the simulated data from models of correlation among microarrays, we could see that the empirical distribution of the zi’s may differs from N(0,1) as the correlation getting larger. Also, we see that there is a significant difference between the empirical distribution of the zi’s and the N(0,1) by observing the simulated data from models of various distribution assumptions. Hence, by the simulation results we conclude that the correlation between genes could not affect the empirical distribu-tion of the zi’s and that the correlation among microarrays and various distribution assumption are the main reasons.

This study only proposed three possible reasons in large-scale multiple hypothesis testing. It might be worth to discuss further possible reasons that may make the dis-tribution of the zi’s differing from N(0,1) and provide appropriate models for the other possible reasons.

Also, this study used the AR and MA model with different coefficients and order to generate the correlation data between genes and among microarrays. Another di-rection for future research is to use an autoregressive moving average (ARMA) model or other correlation model for the proposed reasons. In addition, this study provided six different distribution models for the various distribution assumptions. It might be assume other distribution models to investigate further in future research.

References

[1] Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statis-tical Society, Ser. B, 57, 289-300.

[2] Chan N. H. (2001). Time series applications to finance. Wiley, New York.

[3] Dudoit, S., Shaffer, J., and Boldrick, J. (2003). Multiple hypothesis testing in microarray experiments. Statistical Science, 18, 71-103.

[4] Efron, B. (2003). Robbins, empirical bayes, and microarrays. The Annals of Statis-tics.

[5] Efron, B. (2004). Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. Journal of the American Statistical Association, 99, 96-104.

[6] Efron, B. (2005). Local false discovery rates. Available at www-stat.stanford.edu/ ckirby/brad/papers/2005LocalFDR.pdf

[7] Efron, B. (2006). Size, power, and false discovery rates. The Annals of Statistics.

[8] Efron, B. (2007). Correlation and large-scale simultaneous significance testing.

Journal of the American Statistical Association, 102, 93-103.

[9] Efron, B., Tibshirani, R., Storey, J., and Tusher, V. (2001). Empirical bayes anal-ysis of a microarray experiment. Journal of the American Statistical Association, 96, 1151-1160.

[10] Ge, Y., Dudoit, S., and Speed, T. (2003). Resampling-based multiple testing for microarray data analysis, Test, 12, 1-77.

[11] Gentleman, R, Carey, V., Huber, W., Irizarry, R., and Dudoit, S. (2005). Bioin-formatics and computational biology solutions using R and bioconductor. Springer-Verlag, New York.

[12] Gottardo, R., Raftery, A., Yeung, K., and Bumgarner, R. (2006). Bayesian robust inference for differential gene expression in microarrays with multiple samples.

Biometrics, 62, 10-18.

[13] Hedenfalk, I., Duggen, D., Chen, Y., et al. (2001). Gene expression profiles in hereditary breast cancer. New England Journal of Medicine, 344, 539-548.

[14] Lockhart, D. J., Dong, H.l., Byrne, M. C., Follettie, M.T., Gallo, M. V. Chee, M.

S., Mittmann, M., Wang, C.,Kobayashi, M., Horton, H. & Brown, E.L. (1996).

Expression monitoring by hybridization to high-density oligonucleotide arrays, Nature Biotechnology 14: 1675-1680.

[15] Qiu, X., Brooks, A., Klebanov, L., and Yakovlev, A. (2005a). The effects of nor-malization on the correlation structure of microarray data. BMC Bioinformatics, 6.

[16] Shumway, R. H., and Stoffer, D. (2005). Time series analysis and its applications.

2nd ed. Springer-Verlag, New York.

[17] van’t Wout, A., Lehrma, G., Mikheeva, S., O’Keeffe, G., Katze, M., Bumharner, R., Geiss, G., and Mullins, J. (2003). Cellular gene expression upon human im-munodeficiency virus type 1 infection of CD4+-T-Cell lines. Journal of Virology, 77, 1392-1402.

相關文件