• 沒有找到結果。

Simulation results for other imputations

5. Results and discussion

5.4 Simulation results for other imputations

In this section, we compare the performance of LLSimpute, SLLSimpute, ILLSimpute and the James-Stein approach for these methods on the SP.Alpha and GA.Env datasets;

however, there is one problem with ILLSimpute. Among the comparison, ILLSimpute has some situation with serious estimation error such that the NRMSE value is large. The following figure is one of examples. The figure is to compare LLSimpute, SLLSimpute, ILLSimpute, and James-Stein approach with different missing percentage on SP.Alpha and GA.Env dataset.

There are some sharps on the curve of ILLSimpute on the above figures. We find ILLSimpute leads to worse estimates on some situations; however, other imputations perform stably at the same time. We delete the point where ILLSimpute has serious error to compare the different methods.

Fig. 6. NRMSEs comparison of four methods respect to the number of genes for estimating missing values on SP.Alpha dataset and GA.Env dataset.

As shown in Figure 6, SLLSimpute and ILLSimpute have smaller NRMSE than that of LLSimpute, revealing SLLSimpute and ILLSimpute are better than LLSimpute overall. In addition, the James-Stein estimator based methods efficiently improve these three imputations for small k.

Fig. 7. Comparison of the NRMSEs against percentage of missing entries for two methods on Alpha dataset and Env dataset.

In Figure 7, ILLSimpute has better performance than LLSimpute and SLLSimpute.

In addition, James-Stein estimator based method can improve these three imputations efficiently as the missing percentage increases.

Fig. 8 Comparison of the NRMSEs of four methods with respect to noise levels on Alpha dataset and Env dataset.

In Figure 8, we find that these three imputations have worse performance as the artificial noise’s standard deviation increase. However, the James-Stein based method for these three imputations performance better overall. We conclude the James-Stein based method is less sensitive to the noise level.

6. Conclusion

Efficient imputation of missing values is needed for the using of microarray data, since most of downstream analyses require a complete dataset. Therefore, exploring accurate and efficient methods for estimating missing values has become a more important issue. In our studies, a shrinkage estimator method associated with a regression model is proposed to estimate missing values on microarray data. Our method takes advantage of the correlation structures existing in microarray data and selects similar genes for the target gene by Pearson correlation coefficients. Furthermore, we incorporate the least squares principle and utilize the James-Stein estimator to adjust the coefficients of the least squared estimation in the regression model to estimate missing values. A simulation study demonstrated that shrinkage estimator based method provided superior estimation accuracy for various types of datasets compared with LLSimpute and SLLSimpute when the k-value is less than 50. Since our proposed method can apply to any regression model based method and can provide better missing value estimation, it is a competitive alternative to the conventional least squares method.

Table 1 Improvement ratio against specific percentage (p %) of missing entries

Table 2 the NRMSEs against specific percentage (p %) of missing entries Alpha

p 10% 11% 13% 15% 17% 19% 20%

LLS 0.5652 0.5819 0.6192 0.6526 0.6830 0.7124 0.7250 LLS-J 0.5608 0.5773 0.6137 0.6461 0.6755 0.7036 0.7151

Elu

p 10% 11% 13% 15% 17% 19% 20%

LLS 0.4739 0.4890 0.5187 0.5461 0.5676 0.5900 0.6001 LLS-J 0.4731 0.4882 0.5170 0.5434 0.5644 0.5861 0.5955

Env

p 1% 2% 3% 4% 5% 10% 15%

LLS 0.6333 0.6359 0.6355 0.6375 0.6390 0.6515 0.6611 LLS-J 0.6298 0.6326 0.6319 0.6339 0.6356 0.6478 0.6570

Environ

p 1% 2% 3% 5% 6% 7% 8%

LLS 0.3783 0.4087 0.3964 0.4344 0.4371 0.4495 0.4776 LLS-J 0.3729 0.4029 0.3905 0.4279 0.4303 0.4425 0.4701

Table 3 Improvement ratio against different number of similar genes (k)

k 20 30 50 80 100

Alpha 0.0382 0.0185 0.0071 0.0017 0 Elu 0.0279 0.0124 0.0011 -0.0046 -0.0068 Env 0.0299 0.0152 0.0059 0.0009 -0.001 Environ 0.0059 0.0128 0.0188 0.0201 0.0153

Table 4 the NRMSEs against different number of similar genes (k) Alpha

k 20 30 50 80 100

LLS 0.8629 0.6286 0.5637 0.5412 0.535

LLS-J 0.8299 0.6170 0.5597 0.5403 0.535

Elu

k 20 30 50 80 100

LLS 0.5834 0.5086 0.4744 0.4608 0.4556

LLS-J 0.5671 0.5023 0.4739 0.4629 0.4587

Env

k 20 30 50 80 100

LLS 0.7888 0.6951 0.6489 0.6335 0.6283

LLS-J 0.7652 0.6845 0.6451 0.6329 0.6289

Environ

k 10 20 30 50 100

LLS 0.3719 0.4444 0.6602 0.7305 0.4312

LLS-J 0.3697 0.4387 0.6478 0.7158 0.4246

Table 5 Improvement ratio against artificial noise with different standard deviations ( )

0.01 0.05 0.10 0.15 0.20 0.25

Alpha 0.0034 0.0066 0.0088 0.0099 0.0105 0.0118 Elu -0.0056 0.0025 0.0075 0.0097 0.0112 0.0127 Env 0.0055 0.0051 0.0062 0.0067 0.0080 0.0088 Table 6 the NRMSEs against artificial noise with different standard deviations ( )

Alpha

0.01 0.05 0.10 0.15 0.20 0.25

LLS 0.4475 0.6053 0.7293 0.8056 0.8638 0.9094 LLS-J 0.4460 0.6013 0.7229 0.7976 0.8547 0.8987

Elu

 0.01 0.05 0.10 0.15 0.20 0.25

LLS 0.3761 0.5125 0.6391 0.7189 0.7783 0.8255 LLS-J 0.3782 0.5112 0.6343 0.7119 0.7696 0.8150

Env

 0.01 0.05 0.10 0.15 0.20 0.25

LLS 0.6394 0.6448 0.6661 0.6854 0.7112 0.7374 LLS-J 0.6359 0.6415 0.6620 0.6808 0.7055 0.7309

Table 7 Improvement ratio against specific percentage (p %) for three imputations.

p 5% 7% 10% 11% 13% 15% 17% 20%

LLS 0.0043 0.0058 0.0076 0.0079 0.0087 0.0098 0.0111 0.0132 SLLS 0.0019 0.0041 0.0057 0.0058 0.0060 0.0066 0.0070 0.0083 Alpha

ILLS -0.0044 -0.0004 0.0006 0.0009 0.0030 0.0029 0.0187 0.0050 LLS 0.0052 0.0055 0.0057 0.0058 0.0056 0.0061 0.0059 0.0065 SLLS 0.0055 0.0064 0.0061 0.0060 0.0063 0.0077 0.0068 0.0073 Elu

ILLS -0.0017 -0.0017 0.0008 0.0385 0.0005 0.0018 0.0024 0.0035

Table 8 the NRMSEs against specific percentage (p %) for three imputations.

Alpha

p 5% 7% 10% 11% 13% 15% 17% 20%

LLS 0.4369 0.4966 0.5653 0.5842 0.6191 0.6506 0.6817 0.7270 LLS-J 0.4350 0.4937 0.5610 0.5796 0.6137 0.6442 0.6741 0.7174 SLLS 0.4291 0.4844 0.5410 0.5559 0.5832 0.6065 0.6268 0.6501 SLLS-J 0.4283 0.4824 0.5379 0.5527 0.5797 0.6025 0.6224 0.6447 ILLS 0.4049 0.4580 0.5147 0.5274 0.5634 0.5785 0.7972 0.6232 ILLS-J 0.4067 0.4582 0.5144 0.5269 0.5617 0.5768 0.7823 0.6201

Env

p 5% 7% 9% 11% 13% 15% 17% 20%

LLS 0.6392 0.6422 0.6472 0.6521 0.6565 0.6602 0.6658 0.6728 LLS-J 0.6359 0.6387 0.6435 0.6483 0.6528 0.6562 0.6619 0.6684

SLLS 0.6379 0.6397 0.6442 0.6474 0.6507 0.6532 0.6575 0.6608 SLLS-J 0.6344 0.6356 0.6403 0.6435 0.6466 0.6482 0.6530 0.6560

ILLS 0.6036 0.6023 0.6125 0.9456 0.6152 0.6210 0.6266 0.6321 ILLS-J 0.6046 0.6033 0.6120 0.9092 0.6149 0.6199 0.6251 0.6299

Table 9 Improvement ratio against different number (k) for three imputations.

k 20 30 50 70 100

LLS 0.0380 0.0186 0.0078 0.0031 0 SLLS 0.0393 0.0154 0.0055 0.0013 0 Alpha

ILLS 0.0117 0.0014 0.0117 0.0010 0.0111 LLS 0.0302 0.0156 0.0057 0.0017 0 SLLS 0.0298 0.0161 0.0061 0.0027 0 Elu

ILLS 0.0097 0.0458 0.0007 -0.0002 0.0016

Table 10 the NRMSEs against different number of similar genes selected (k) Alpha

k 20 30 50 70 100 LLS 0.8624 0.6295 0.5653 0.5454 0.537

LLS-J 0.8296 0.6178 0.5609 0.5437 0.537

SLLS 0.8285 0.5976 0.5409 0.5224 0.515

SLLS-J 0.7959 0.5884 0.5379 0.5217 0.515

ILLS 0.6320 0.5148 0.6320 0.5155 0.632

ILLS-J 0.6246 0.5141 0.6246 0.5150 0.625

Env

k 20 30 50 70 100 LLS 0.7935 0.6937 0.6492 0.6334 0.627

LLS-J 0.7695 0.6829 0.6455 0.6323 0.627

SLLS 0.7853 0.6894 0.6445 0.6300 0.623

SLLS-J 0.7619 0.6783 0.6406 0.6283 0.623

ILLS 0.6816 1.5308 0.6120 0.6074 0.615

ILLS-J 0.6750 1.4607 0.6116 0.6075 0.614

Table 11 Improvement ratio against artificial noise with different standard deviations ( ) for three imputations.

Table 12 the NRMSEs against artificial noise with different standard deviations ( ) Alpha

 0.01 0.05 0.1 0.15 0.2 0.25

LLS 0.5682 0.6610 0.7610 0.8288 0.8846 0.9335 LLS-J 0.5641 0.6557 0.7537 0.8200 0.8742 0.9213 SLLS 0.5449 0.6397 0.7422 0.8086 0.8641 0.9115 SLLS-J 0.5413 0.6332 0.7342 0.7998 0.8541 0.9001 ILLS 0.5166 0.6126 3.7703 0.7499 0.8075 0.8462 ILLS-J 0.5159 0.6060 3.5619 0.7451 0.8017 0.8405

Env

 0.01 0.05 0.1 0.15 0.2 0.25

LLS 0.6489 0.6556 0.6717 0.6955 0.7186 0.7466 LLS-J 0.6452 0.6517 0.6675 0.6907 0.7130 0.7399 SLLS 0.6451 0.6523 0.6690 0.6923 0.7151 0.7439 SLLS-J 0.6413 0.6481 0.6644 0.6871 0.7093 0.7369 ILLS 0.6172 0.6847 0.6296 0.6447 0.6645 0.6877 ILLS-J 0.6161 0.6781 0.6295 0.6452 0.6651 0.6879

Reference:

1. Schena M, S.D., Davis RW, Brown PO, Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science, 1995. 270: p. 467–470.

2. DeRisi JL, I.V., Brown PO, Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 1997. 278: p. 680–686.

3. Spellman PT, S.G., Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B, Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 1998 9: p.

3273–3297.

4. Wu WS, L.W., Chen BS, Computational reconstruction of transcriptional regulatory modules of the yeast cell cycle. BMC Bioinformatics, 2006. 7: p. 421.

5. Gasch AP, S.P., Kao CM, Carmel-Harel O, Eisen MB, Storz G, Botstein D, Brown PO, Genomic expression programs in the response of yeast cells to environmental changes.

Mol Biol Cell 2000. 11: p. 4241–4257.

6. Wu WS, L.W., Identifying gene regulatory modules of heat shock response in yeast.

BMC Genomics 2008. 9: p. 439.

7. Chu, S., DeRisi,J., Eisen,M.B., Mulholland,J., Botstein,D., Brown,P.O. and Hesrkowitz,I., The transcriptional program of sporulation in budding yeast. Science 1998. 278: p. 680-686.

8. Alizadeh, A.A., Eisen,M.B., Davis,R.E., Ma,C., Lossos,I.S., Rosenwald,A., Boldrick,J.C., Sabet,H., Tran, T, Powell,J.L. et al., Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 2000. 403: p. 503-511.

9. Ouyang, M.e.a., Gaussian mixture clustering and imputation of microarray data. . Bioinformatics 2004. 20: p. 917–923.

10. Troyanskaya, O.e.a., Missing value estimation methods for cDNA microarrays. . Bioinformatics 2001. 17: p. 520–525.

11. Schafer, J., Graham, J., Missing data: our view of the state of the art. Psychol.

Methods 2002. 7: p. 147–177.

12. Oba S, S.M., Takemasa I, Monden M, Matsubara K, Ishii S., A Bayesian missing value estimation method for gene expression profile data. Bioinformatics. , 2003. 19(16): p.

2088-96.

13. Sehgal MS, G.I., Dooley LS., Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data. Bioinformatics. , 2005.

21(10): p. 2417-23.

14. Wang X, L.A., Jiang Z, Feng H., Missing value estimation for DNA microarray gene expression data by Support Vector Regression imputation and orthogonal coding scheme. BMC Bioinformatics., 2006. 7: p. 32.

15. Gan X, L.A., Yan H., Microarray missing data imputation based on a set theoretic

framework and biological knowledge. Nucleic Acids Res. , 2006 34(5): p. 1608-19.

16. Bø TH, D.B., Jonassen I., LSimpute: accurate estimation of missing values in microarray data with least squares methods. Nucleic Acids Res., 2004. 32(3).

17. Kim H, G.G., Park H., Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics. , 2005. 21(2): p. 187-98.

18. Cai Z, H.M., Lin G., Iterated local least squares microarray missing value imputation.

J Bioinform Comput Biol., 2006 4(5): p. 935-57.

19. Ching WK, L.L., Tsing NK, Tai CW, Ng TW, Wong AS, Cheng KW., A weighted local least squares imputation method for missing value estimation in microarray gene expression data. Int J Data Min Bioinform. , 2010. 4(3): p. 331-47.

20. Zhang X, S.X., Wang H, Zhang H., Sequential local least squares imputation estimating missing value of microarray data. Comput Biol Med., 2008. 38(10): p.

1112-20.

21. Alter, O., Brown, P.O. and Botstein, D., Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl Acad. Sci. USA, 2000. 97: p. 10101-10106.

22. Copas, J.B., Regression, Prediction and Shrinkage. J ROY STAT SOC B MET, 1983.

45(3): p. 311–354.

23. Stein, C., Inadmissibility of the Usual Estimator for the Mean of a Multivariate Normal Distribution. Proc. Third Berkeley Symp. on Math. Statist. and Prob, 1956. 1:

p. 197-206.

24. W. James, a.C.S., Estimation with Quadratic Loss. Proc. Fourth Berkeley Symp. Math.

Statist. Prob, 1961. 1: p. 361–379.

25. Wang, H., Brown's paradox in the estimated confidence approach. ANN STAT 1999.

27: p. 610-626.

26. Wang, H., Improved confidence estimators for the multivariate normal confidence set.

STAT SINICA, 2000. 10: p. 659-664.

27. Gasch, A.P., Huang, M., Metzner, S., Botstein, D., Elledge, S.J.,Brown, P.O., Genomic expressipn response to DNA-damaging agents and the regulator role of the yeast ATR homolog Meclp. Mol Biol Cell, 2001. 12: p. 2987-3003

相關文件