The proposed simulation method for gene expression data is based on an empirical distribution obtained from 1279 HG-U133 arrays. Genes displaying only one mode are recognized as either expressed in all tissues or unexpressed in all tissues.
These genes do not provide information for the purpose of differentiating between expressed and unexpressed. To distinguish abnormal tissues from normal tissues, we can just focus on 5005 genes which have multiple modes. To simulate a set of arrays, our simulation method recommends that first identifying each of 5005 genes as expressed or unexpressed, and then using this expression status to acquire these 5005 genes’ simulated expression values. The rest of the 22283-5005 genes will be simulated by one-mode empirical distributions.
The empirical distribution follows similar pattern with intensity distribution obtained from the spike-in dataset. It implies that the empirical distribution imitates some characteristics of real gene intensity distribution successfully. The derived simulated data can be used to evaluate various differential expression methods objectively. With three replicate arrays, the performances of compared six differential expression methods are all of inferior quality when using simulated data than using real spike-in data. This may be explained by the reason that the spike-in dataset contains technical replicates only, but our simulated data also contain variations from biological replicates. The ability of detecting differentially expressed genes with these six approaches except for SAM is of the same rank. Using five and ten replicate arrays, there will be some changes of the performance of these six methods. The fold-change and EBarrays-based analyses are superior in the low replication situation.
But when the replicate arrays increased, the performance of the two methods is not as excellent as it appears formerly. Besides, the capability of distinguishing differentially
expressed genes with SAM becomes better while augmenting the replicate arrays.
This simulation method has some limitation. First of all, we preprocessed all the row data of 1279 arrays from normal tissues using justRMA. Consequently, the empirical intensity distributions are generated through RMA-preprocessed data, and there is no way to evaluate varied preprocessing methods. Besides, simulation of 5005 genes has some problems. This simulation method considered the empirical distribution which had multiple modes as a complete distribution and ignored the fact that the expressed part and unexpressed part might come from the different distributions. That is, when simulating gene expressions from these multiple-mode genes, the expressed expression will always larger than the unexpressed expression.
Therefore, as the replicate arrays increase to a vast amount, the ability of differential expression methods will be too sensitive to evaluate.
With this simulation method, three replicate arrays for each group can be obtained within two minutes, and it is convenient to simulate an enormous amount of imitative data. Arising from the expensive price of Affymetrix chips, microarray experiment size was restricted and it would limit the power of microarray analysis methods. It is believed that the proposed simulation method of gene expression data can benefit the development of the microarray analysis tools.
Reference
Affymetrix. (2002) Statistical algorithms description document.
http://www.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdf Affymetrix. (2003) GeneChip Human Genome Array
http://www.affymetrix.com/support/technical/datasheets/human_datasheet.pdf Brazma,A., Robinson,A.,Cameron,G. and Ashburner,A. (2000) One-stop shop for
microarray data. Nature, 403, 699-700.
Chang,K.M, Harbron,C. and South,M.C. (2007) The vignette of RefPlus package in Bioconductor.
http://www.bioconductor.org/packages/2.4/bioc/vignettes/RefPlus/inst/doc/RefPl us.pdf
Chang,K.M., Harbron,C., South,M.C. (2006) An Exploration of Extensions to the RMA Algorithm.
http://bioconductor.org/packages/2.4/bioc/vignettes/RefPlus/inst/doc/Extensions _to_RMA_Algorithm.pdf
Choe,S.E., Boutros,M., Michelson,A.M., Church,G.M. and Halfon,M.S. (2005) Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset. Genome Biology, 6:R16
Ivliev,A.E., Hoen,P.A.C., Villerius,M.P., Dunnen,J.T. and Bradndt,B.W. (2008) Microarray retriever: a web-based tool for searching and large scale retrieval of public microarray data. Nucleic Acids Research, 36, 327–331.
MacDonald,J. and Bolstad,B. (2009) The manuals of affy package in Bioconductor.
http://www.bioconductor.org/packages/2.4/bioc/manuals/affy/man/affy.pdf Mcgee,M. and Chen,Z. (2006) New Spiked-In Probe Sets for the Affymetrix
HGU-133A Latin Square Experiment. COBRA Preprint Series, Article 5.
http://biostats.bepress.com/cobra/ps/art5
Smyth,G.K. (2004) Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, 3, Article 3.
Tusher, V. G., Tibshirani, R. and Chu, G. (2001) Significance analysis of microarrays applied to the ionizing radiation response, Proceedings of the National Academy of Sciences of the United States of America, 98, 5116–5121.
Wilson,C., Pepper,S.D. and Miller,C.J. (2009) The vignette of simpleaffy package in Bioconductor.
http://www.bioconductor.org/packages/2.4/bioc/vignettes/simpleaffy/inst/doc/Q CandSimpleaffy.pdf
Zilliox,M.J. and Irizarry,R.A. (2007) A gene expression bar code for microarray data.
Nature Methods, 4, 911-913.
Table 2.1. Affymetrix human genome U133 dataset contains 14 spike-in gene groups in each of 14 experimental groups. This table shows the
Table 2.2. Probe IDs for the seventeen additional probe sets annotated in the HG-U133Atag CDF.
Spike Ins Non Spike Ins AFFX-r2-TagA_at AFFX-r2-TagO-3_at AFFX-r2-TagB_at AFFX-r2-TagO-5_at AFFX-r2-TagC_at AFFX-r2-TagIN-3_at AFFX-r2-TagD_at AFFX-r2-TagIN-5_at AFFX-r2-TagE_at AFFX-r2-TagQ-3_at AFFX-r2-TagF_at AFFX-r2-TagQ-5_at AFFX-r2-TagG_at AFFX-r2-TagJ-3_at AFFX-r2-TagH_at AFFX-r2-TagJ-5_at
AFFX-r2-TagIN-M_at
Table 3.1. Summary of the amount of delete data in QC step.
GEO AE Total
Before QC 1886 559 2445
Scale factor -359 -112 -471
Averages background -56 0 -56
3’/5’ ratios -302 -98 -400
Percent present calls -4 -13 -17
After QC 1165 336 1501
Remove different type 943 336 1279
Table 3.2. The frequency distribution among tissue types of the reference training set.
beta cell islets 1 Normal cervix 8 T cells resting 23
medulla oblongata 1 prostate 8 cerebellum 24
Normal Breast 1 TERV (cell line) 8 Normal Kidney 25
Normal Colon 1 smooth muscle 9 uterus 25
Normal Corpus 1 primary fibroblast cell
line 9 esophageal
epithelium 26
Normal Stomach 1 PBSC CD34 selected
cells 10 Frontal Cortex 26
Normal Thalamus 1 Baseline macrophages 11
blood (cell type :
neuroblastoma cells 12 skeletal muscle 33
salivary gland 2
Stratagene Universal Human Reference
RNA
12 Normal Caudate
Nucleus 33
pituitary 2 peripheral blood CD8
T cells 12 prefrontal cortex 33
Normal Amygdala 3 umbilical cord blood 13 duodenal tissue 40
trachea 3 Post-mortem medial
substantia nigra 15 peripheral blood
(human PBMC) 47
Pulp tissue 4 skin 16 white blood cells 48
occipital lobe 4 Undifferentiated
human ES cells 16 lateralis muscle 48
Theca cell 4 lymphoblastoid cell
lines 17
Human umbilical vein endothelial
cells
53
Normal_Ovary 5 Human optic nerve
head astrocytes 18 bone marrow 56
thyroid gland
(thyrocytes) 7 hypothalamus 22 lung 63
Normal Spleen 7 liver 22 whole blood 67
adipose tissue 8 Bronchial Epithelium 23
0 500 1000 1500 2000 2500
0510152025
Scale Factor
2445 arrays
sfs
0 500 1000 1500 2000
0100200300400500
Figure 3.1. The deleted data in each quality assessment step. The red dots represent data deleted in this step. The block dots are the data still kept after this step.
0 500 1000 1500 2000 2500
0510152025
Scale Factor
2445 arrays
sfs
0 500 1000 1500 2000 2500
0100200300400500600
Average Background
2445 arrays
avbg
0 500 1000 1500 2000 2500
-4-202468
Ratios(actin3/actin5)
2445 arrays
actin3.actin5
0 500 1000 1500 2000 2500
-20246
Ratios(gapdh3/gapdh5)
2445 arrays
gapdh3.gapdh5
0 500 1000 1500 2000 2500
102030405060
Percent Precent
2445 arrays
pp
Figure 3.2. The total deleted data in each quality assessment step. The red dots represent all deleted data. The block dots are the data still kept after all the quality assessment steps.
Figure 3.3. These histograms are the log (base 2) expression distribution for gene
“200886_s_at”. The color lines are the smoothed densities, using various “n” and
“adjust” in function density(n, adjust) of R.
Figure 3.4. The red dots represent the rates of correctly identifying the two experiments (conditions) to be differentially expressed among 34 spike-in genes under various K. The blue dots are the rates of correctly identifying the two experiments (conditions) to be not differentially expressed among 4993 genes which were not in spike-in genes and had multiple modes.
Figure 4.1(a) Log (base 2) intensity distribution of one-mode gene.
Figure 4.1(b) Log (base 2) intensity distribution of two-mode gene which the second mode is close to the first mode.
Figure 4.1(c) Log (base 2) intensity distribution of two-mode gene which the second mode is more distant to the first mode.
Figure 4.1(d) Log (base 2) intensity distribution of gene which has more than two modes and the second mode is far away from the first mode.
Figure 4.2. Different combinations of arguments to fit the density smoother.
Combinations using the same adjust are assigned to the same color as shown in the legend.
Figure 4.3. The empirical intensity distributions and spike-in intensity distributions for two genes that are included as the spike-in genes and have multiple modes. The solid lines are the empirical distributions obtained through 1279 arrays of the reference training set, and the dotted lines are the intensity distributions using 42 spike-in arrays. The brilliant ticks act for the observed values of spike-in samples with color denoting the experimental group to which the observation belongs.
Figure 4.4. The empirical intensity distributions and spike-in intensity distributions for two genes that are included as spike-in genes and have only one mode. The solid lines are the empirical distributions obtained through 1279 arrays of the reference training set, and the dotted lines are the intensity distributions using 42 spike-in arrays.
The brilliant ticks act for the observed values of spike-in samples with color denoting the experimental group to which the observation belongs.
(a)
(b) (c)
Figure 4.5.1. The empirical intensity distribution and spike-in intensity distribution for a gene that is not included as the spike-in gene and has multiple modes, where the second mode is distant from the first mode. In (a), the empirical intensity distribution and spike-in intensity distribution are put together. (b) is for the empirical intensity distribution only, and (c) is for the spike-in intensity distribution only. The brilliant ticks act for the observed values of spike-in samples with color denoting the experimental group to which the observation belongs.
(a)
(b) (c)
Figure 4.5.2. The empirical intensity distribution and spike-in intensity distribution for a gene that is not included as the spike-in gene and has multiple modes, where the second mode is close to the first mode. In (a), the empirical intensity distribution and spike-in intensity distribution are put together. (b) is for the empirical intensity distribution only, and (c) is for the spike-in intensity distribution only. The brilliant ticks act for the observed values of spike-in samples with color denoting the experimental group to which the observation belongs.
(a)
(b) (c)
Figure 4.5.3. The empirical intensity distribution and spike-in intensity distribution for a gene that is not included as the spike-in gene and has multiple modes, where the second mode is far away from the first mode. In (a), the empirical intensity distribution and spike-in intensity distribution are put together. (b) is for the empirical intensity distribution only, and (c) is for the spike-in intensity distribution only. The brilliant ticks act for the observed values of spike-in samples with color denoting the experimental group to which the observation belongs.
(a)
(b) (c)
Figure 4.6.1. The empirical intensity distribution and spike-in intensity distribution for a gene that is not included as the in spike-in gene and has only one mode. In (a), the empirical intensity distribution and spike-in intensity distribution are put together.
(b) is for the empirical intensity distribution only, and (c) is for the spike-in intensity distribution only. The brilliant ticks act for the observed values of spike-in samples with color denoting the experimental group to which the observation belongs.
(a)
(b) (c)
Figure 4.6.2. The empirical intensity distribution and spike-in intensity distribution for a gene that is not included as the spike-in gene and has only one mode. In (a), the empirical intensity distribution and spike-in intensity distribution are put together. (b) is for the empirical intensity distribution only, and (c) is for the spike-in intensity distribution only. The brilliant ticks act for the observed values of spike-in samples with color denoting the experimental group to which the observation belongs.
Figure 4.7. ROC curves for six differential-expression methods, comparing the three replicate arrays from the 4th experimental group of the spike-in dataset with the three replicate arrays from the 10th experimental group. Various colors represent different differential-expression methods as shown in the legend.
Figure 4.8. ROC curves with FPs<100 for six differential-expression methods, comparing the three replicate arrays simulated based on the 4th experimental group of the spike-in dataset with the three replicate arrays simulated based on the 10th experimental group. Various colors represent different differential-expression methods as shown in the legend.
Figure 4.9. ROC curves with FPs<1000 for six differential-expression methods, comparing the three replicate arrays simulated based on the 4th experimental group of the spike-in dataset with the three replicate arrays simulated based on the 10th experimental group. Various colors represent different differential-expression methods as shown in the legend.
Figure 4.10. ROC curves with FPs<100 for six differential-expression methods, comparing the five replicate arrays simulated based on the 4th experimental group of the spike-in dataset with the five replicate arrays simulated based on the 10th experimental group. Various colors represent different differential-expression methods as shown in the legend.
Figure 4.11. ROC curves with FPs<100 for six differential-expression methods, comparing the ten replicate arrays simulated based on the 4th experimental group of the spike-in dataset with the ten replicate arrays simulated based on the 10th experimental group. Various colors represent different differential-expression methods as shown in the legend.