5. Results and discussion
5.4 Simulation results for other imputations
In this section, we compare the performance of LLSimpute, SLLSimpute, ILLSimpute and the James-Stein approach for these methods on the SP.Alpha and GA.Env datasets;
however, there is one problem with ILLSimpute. Among the comparison, ILLSimpute has some situation with serious estimation error such that the NRMSE value is large. The following figure is one of examples. The figure is to compare LLSimpute, SLLSimpute, ILLSimpute, and James-Stein approach with different missing percentage on SP.Alpha and GA.Env dataset.
There are some sharps on the curve of ILLSimpute on the above figures. We find ILLSimpute leads to worse estimates on some situations; however, other imputations perform stably at the same time. We delete the point where ILLSimpute has serious error to compare the different methods.
Fig. 6. NRMSEs comparison of four methods respect to the number of genes for estimating missing values on SP.Alpha dataset and GA.Env dataset.
As shown in Figure 6, SLLSimpute and ILLSimpute have smaller NRMSE than that of LLSimpute, revealing SLLSimpute and ILLSimpute are better than LLSimpute overall. In addition, the James-Stein estimator based methods efficiently improve these three imputations for small k.
Fig. 7. Comparison of the NRMSEs against percentage of missing entries for two methods on Alpha dataset and Env dataset.
In Figure 7, ILLSimpute has better performance than LLSimpute and SLLSimpute.
In addition, James-Stein estimator based method can improve these three imputations efficiently as the missing percentage increases.
Fig. 8 Comparison of the NRMSEs of four methods with respect to noise levels on Alpha dataset and Env dataset.
In Figure 8, we find that these three imputations have worse performance as the artificial noise’s standard deviation increase. However, the James-Stein based method for these three imputations performance better overall. We conclude the James-Stein based method is less sensitive to the noise level.
6. Conclusion
Efficient imputation of missing values is needed for the using of microarray data, since most of downstream analyses require a complete dataset. Therefore, exploring accurate and efficient methods for estimating missing values has become a more important issue. In our studies, a shrinkage estimator method associated with a regression model is proposed to estimate missing values on microarray data. Our method takes advantage of the correlation structures existing in microarray data and selects similar genes for the target gene by Pearson correlation coefficients. Furthermore, we incorporate the least squares principle and utilize the James-Stein estimator to adjust the coefficients of the least squared estimation in the regression model to estimate missing values. A simulation study demonstrated that shrinkage estimator based method provided superior estimation accuracy for various types of datasets compared with LLSimpute and SLLSimpute when the k-value is less than 50. Since our proposed method can apply to any regression model based method and can provide better missing value estimation, it is a competitive alternative to the conventional least squares method.
Table 1 Improvement ratio against specific percentage (p %) of missing entries
Table 2 the NRMSEs against specific percentage (p %) of missing entries Alpha
p 10% 11% 13% 15% 17% 19% 20%
LLS 0.5652 0.5819 0.6192 0.6526 0.6830 0.7124 0.7250 LLS-J 0.5608 0.5773 0.6137 0.6461 0.6755 0.7036 0.7151
Elu
p 10% 11% 13% 15% 17% 19% 20%
LLS 0.4739 0.4890 0.5187 0.5461 0.5676 0.5900 0.6001 LLS-J 0.4731 0.4882 0.5170 0.5434 0.5644 0.5861 0.5955
Env
p 1% 2% 3% 4% 5% 10% 15%
LLS 0.6333 0.6359 0.6355 0.6375 0.6390 0.6515 0.6611 LLS-J 0.6298 0.6326 0.6319 0.6339 0.6356 0.6478 0.6570
Environ
p 1% 2% 3% 5% 6% 7% 8%
LLS 0.3783 0.4087 0.3964 0.4344 0.4371 0.4495 0.4776 LLS-J 0.3729 0.4029 0.3905 0.4279 0.4303 0.4425 0.4701
Table 3 Improvement ratio against different number of similar genes (k)
k 20 30 50 80 100
Alpha 0.0382 0.0185 0.0071 0.0017 0 Elu 0.0279 0.0124 0.0011 -0.0046 -0.0068 Env 0.0299 0.0152 0.0059 0.0009 -0.001 Environ 0.0059 0.0128 0.0188 0.0201 0.0153
Table 4 the NRMSEs against different number of similar genes (k) Alpha
k 20 30 50 80 100
LLS 0.8629 0.6286 0.5637 0.5412 0.535
LLS-J 0.8299 0.6170 0.5597 0.5403 0.535
Elu
k 20 30 50 80 100
LLS 0.5834 0.5086 0.4744 0.4608 0.4556
LLS-J 0.5671 0.5023 0.4739 0.4629 0.4587
Env
k 20 30 50 80 100
LLS 0.7888 0.6951 0.6489 0.6335 0.6283
LLS-J 0.7652 0.6845 0.6451 0.6329 0.6289
Environ
k 10 20 30 50 100
LLS 0.3719 0.4444 0.6602 0.7305 0.4312
LLS-J 0.3697 0.4387 0.6478 0.7158 0.4246
Table 5 Improvement ratio against artificial noise with different standard deviations ( )
0.01 0.05 0.10 0.15 0.20 0.25
Alpha 0.0034 0.0066 0.0088 0.0099 0.0105 0.0118 Elu -0.0056 0.0025 0.0075 0.0097 0.0112 0.0127 Env 0.0055 0.0051 0.0062 0.0067 0.0080 0.0088 Table 6 the NRMSEs against artificial noise with different standard deviations ( )
Alpha
0.01 0.05 0.10 0.15 0.20 0.25
LLS 0.4475 0.6053 0.7293 0.8056 0.8638 0.9094 LLS-J 0.4460 0.6013 0.7229 0.7976 0.8547 0.8987
Elu
0.01 0.05 0.10 0.15 0.20 0.25
LLS 0.3761 0.5125 0.6391 0.7189 0.7783 0.8255 LLS-J 0.3782 0.5112 0.6343 0.7119 0.7696 0.8150
Env
0.01 0.05 0.10 0.15 0.20 0.25
LLS 0.6394 0.6448 0.6661 0.6854 0.7112 0.7374 LLS-J 0.6359 0.6415 0.6620 0.6808 0.7055 0.7309
Table 7 Improvement ratio against specific percentage (p %) for three imputations.
p 5% 7% 10% 11% 13% 15% 17% 20%
LLS 0.0043 0.0058 0.0076 0.0079 0.0087 0.0098 0.0111 0.0132 SLLS 0.0019 0.0041 0.0057 0.0058 0.0060 0.0066 0.0070 0.0083 Alpha
ILLS -0.0044 -0.0004 0.0006 0.0009 0.0030 0.0029 0.0187 0.0050 LLS 0.0052 0.0055 0.0057 0.0058 0.0056 0.0061 0.0059 0.0065 SLLS 0.0055 0.0064 0.0061 0.0060 0.0063 0.0077 0.0068 0.0073 Elu
ILLS -0.0017 -0.0017 0.0008 0.0385 0.0005 0.0018 0.0024 0.0035
Table 8 the NRMSEs against specific percentage (p %) for three imputations.
Alpha
p 5% 7% 10% 11% 13% 15% 17% 20%
LLS 0.4369 0.4966 0.5653 0.5842 0.6191 0.6506 0.6817 0.7270 LLS-J 0.4350 0.4937 0.5610 0.5796 0.6137 0.6442 0.6741 0.7174 SLLS 0.4291 0.4844 0.5410 0.5559 0.5832 0.6065 0.6268 0.6501 SLLS-J 0.4283 0.4824 0.5379 0.5527 0.5797 0.6025 0.6224 0.6447 ILLS 0.4049 0.4580 0.5147 0.5274 0.5634 0.5785 0.7972 0.6232 ILLS-J 0.4067 0.4582 0.5144 0.5269 0.5617 0.5768 0.7823 0.6201
Env
p 5% 7% 9% 11% 13% 15% 17% 20%
LLS 0.6392 0.6422 0.6472 0.6521 0.6565 0.6602 0.6658 0.6728 LLS-J 0.6359 0.6387 0.6435 0.6483 0.6528 0.6562 0.6619 0.6684
SLLS 0.6379 0.6397 0.6442 0.6474 0.6507 0.6532 0.6575 0.6608 SLLS-J 0.6344 0.6356 0.6403 0.6435 0.6466 0.6482 0.6530 0.6560
ILLS 0.6036 0.6023 0.6125 0.9456 0.6152 0.6210 0.6266 0.6321 ILLS-J 0.6046 0.6033 0.6120 0.9092 0.6149 0.6199 0.6251 0.6299
Table 9 Improvement ratio against different number (k) for three imputations.
k 20 30 50 70 100
LLS 0.0380 0.0186 0.0078 0.0031 0 SLLS 0.0393 0.0154 0.0055 0.0013 0 Alpha
ILLS 0.0117 0.0014 0.0117 0.0010 0.0111 LLS 0.0302 0.0156 0.0057 0.0017 0 SLLS 0.0298 0.0161 0.0061 0.0027 0 Elu
ILLS 0.0097 0.0458 0.0007 -0.0002 0.0016
Table 10 the NRMSEs against different number of similar genes selected (k) Alpha
k 20 30 50 70 100 LLS 0.8624 0.6295 0.5653 0.5454 0.537
LLS-J 0.8296 0.6178 0.5609 0.5437 0.537
SLLS 0.8285 0.5976 0.5409 0.5224 0.515
SLLS-J 0.7959 0.5884 0.5379 0.5217 0.515
ILLS 0.6320 0.5148 0.6320 0.5155 0.632
ILLS-J 0.6246 0.5141 0.6246 0.5150 0.625
Env
k 20 30 50 70 100 LLS 0.7935 0.6937 0.6492 0.6334 0.627
LLS-J 0.7695 0.6829 0.6455 0.6323 0.627
SLLS 0.7853 0.6894 0.6445 0.6300 0.623
SLLS-J 0.7619 0.6783 0.6406 0.6283 0.623
ILLS 0.6816 1.5308 0.6120 0.6074 0.615
ILLS-J 0.6750 1.4607 0.6116 0.6075 0.614
Table 11 Improvement ratio against artificial noise with different standard deviations ( ) for three imputations.
Table 12 the NRMSEs against artificial noise with different standard deviations ( ) Alpha
0.01 0.05 0.1 0.15 0.2 0.25
LLS 0.5682 0.6610 0.7610 0.8288 0.8846 0.9335 LLS-J 0.5641 0.6557 0.7537 0.8200 0.8742 0.9213 SLLS 0.5449 0.6397 0.7422 0.8086 0.8641 0.9115 SLLS-J 0.5413 0.6332 0.7342 0.7998 0.8541 0.9001 ILLS 0.5166 0.6126 3.7703 0.7499 0.8075 0.8462 ILLS-J 0.5159 0.6060 3.5619 0.7451 0.8017 0.8405
Env
0.01 0.05 0.1 0.15 0.2 0.25
LLS 0.6489 0.6556 0.6717 0.6955 0.7186 0.7466 LLS-J 0.6452 0.6517 0.6675 0.6907 0.7130 0.7399 SLLS 0.6451 0.6523 0.6690 0.6923 0.7151 0.7439 SLLS-J 0.6413 0.6481 0.6644 0.6871 0.7093 0.7369 ILLS 0.6172 0.6847 0.6296 0.6447 0.6645 0.6877 ILLS-J 0.6161 0.6781 0.6295 0.6452 0.6651 0.6879
Reference:
1. Schena M, S.D., Davis RW, Brown PO, Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science, 1995. 270: p. 467–470.
2. DeRisi JL, I.V., Brown PO, Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 1997. 278: p. 680–686.
3. Spellman PT, S.G., Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B, Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 1998 9: p.
3273–3297.
4. Wu WS, L.W., Chen BS, Computational reconstruction of transcriptional regulatory modules of the yeast cell cycle. BMC Bioinformatics, 2006. 7: p. 421.
5. Gasch AP, S.P., Kao CM, Carmel-Harel O, Eisen MB, Storz G, Botstein D, Brown PO, Genomic expression programs in the response of yeast cells to environmental changes.
Mol Biol Cell 2000. 11: p. 4241–4257.
6. Wu WS, L.W., Identifying gene regulatory modules of heat shock response in yeast.
BMC Genomics 2008. 9: p. 439.
7. Chu, S., DeRisi,J., Eisen,M.B., Mulholland,J., Botstein,D., Brown,P.O. and Hesrkowitz,I., The transcriptional program of sporulation in budding yeast. Science 1998. 278: p. 680-686.
8. Alizadeh, A.A., Eisen,M.B., Davis,R.E., Ma,C., Lossos,I.S., Rosenwald,A., Boldrick,J.C., Sabet,H., Tran, T, Powell,J.L. et al., Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 2000. 403: p. 503-511.
9. Ouyang, M.e.a., Gaussian mixture clustering and imputation of microarray data. . Bioinformatics 2004. 20: p. 917–923.
10. Troyanskaya, O.e.a., Missing value estimation methods for cDNA microarrays. . Bioinformatics 2001. 17: p. 520–525.
11. Schafer, J., Graham, J., Missing data: our view of the state of the art. Psychol.
Methods 2002. 7: p. 147–177.
12. Oba S, S.M., Takemasa I, Monden M, Matsubara K, Ishii S., A Bayesian missing value estimation method for gene expression profile data. Bioinformatics. , 2003. 19(16): p.
2088-96.
13. Sehgal MS, G.I., Dooley LS., Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data. Bioinformatics. , 2005.
21(10): p. 2417-23.
14. Wang X, L.A., Jiang Z, Feng H., Missing value estimation for DNA microarray gene expression data by Support Vector Regression imputation and orthogonal coding scheme. BMC Bioinformatics., 2006. 7: p. 32.
15. Gan X, L.A., Yan H., Microarray missing data imputation based on a set theoretic
framework and biological knowledge. Nucleic Acids Res. , 2006 34(5): p. 1608-19.
16. Bø TH, D.B., Jonassen I., LSimpute: accurate estimation of missing values in microarray data with least squares methods. Nucleic Acids Res., 2004. 32(3).
17. Kim H, G.G., Park H., Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics. , 2005. 21(2): p. 187-98.
18. Cai Z, H.M., Lin G., Iterated local least squares microarray missing value imputation.
J Bioinform Comput Biol., 2006 4(5): p. 935-57.
19. Ching WK, L.L., Tsing NK, Tai CW, Ng TW, Wong AS, Cheng KW., A weighted local least squares imputation method for missing value estimation in microarray gene expression data. Int J Data Min Bioinform. , 2010. 4(3): p. 331-47.
20. Zhang X, S.X., Wang H, Zhang H., Sequential local least squares imputation estimating missing value of microarray data. Comput Biol Med., 2008. 38(10): p.
1112-20.
21. Alter, O., Brown, P.O. and Botstein, D., Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl Acad. Sci. USA, 2000. 97: p. 10101-10106.
22. Copas, J.B., Regression, Prediction and Shrinkage. J ROY STAT SOC B MET, 1983.
45(3): p. 311–354.
23. Stein, C., Inadmissibility of the Usual Estimator for the Mean of a Multivariate Normal Distribution. Proc. Third Berkeley Symp. on Math. Statist. and Prob, 1956. 1:
p. 197-206.
24. W. James, a.C.S., Estimation with Quadratic Loss. Proc. Fourth Berkeley Symp. Math.
Statist. Prob, 1961. 1: p. 361–379.
25. Wang, H., Brown's paradox in the estimated confidence approach. ANN STAT 1999.
27: p. 610-626.
26. Wang, H., Improved confidence estimators for the multivariate normal confidence set.
STAT SINICA, 2000. 10: p. 659-664.
27. Gasch, A.P., Huang, M., Metzner, S., Botstein, D., Elledge, S.J.,Brown, P.O., Genomic expressipn response to DNA-damaging agents and the regulator role of the yeast ATR homolog Meclp. Mol Biol Cell, 2001. 12: p. 2987-3003