5 Conclusions and Discussion
5.2 Discussion
The strange pattern of EBarrays in Figure 6 is caused by too many genes having posterior probability of differential expression equal to 1. When ranking genes by
posterior probability, too many equal values make the order meaningless. For example, more than one thousand genes have posterior probability equal to 1 when using PDNN+EBarrays(GG) for L_CFY treatment/control. We can not select only 100 genes as differentially expressed genes in this situation. Even if using spike-in datasets, there are still too many genes having posterior probability of differential expression equal to 1. That is one disadvantage of EBarrays. And we can not find the best way to deal with genes having the equal values of score of significance.
Reference
Affymetrix. (2002) Statistical algorithms description document.
http://www.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdf
Bolstad,B.M., Irizarry,R.A., Astrand,M. and Speed,T.P. (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19, 185-193.
Choe,S.E., Boutros,M., Michelson,A.M., Church,G.M. and Halfon,M.S. (2005) Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset. Genome Biology, 6:R16
Chu,G., Narasimhan,B., Tibshirani,R. and Tusher,V. SAM “Significance Analysis of Microarrays”–Users guide and technical document. Technical Report, Stanford University. http://www-stat.stanford.edu/~tibs/SAM/sam.pdf
Cope,L.M., Irizarry,R.A., Jaffee,H.A., Wu,Z. and Speed,T.P. (2004) A benchmark for Affymetrix GeneChip expression measures. Bioinformatics, 20, 323-331.
Guo,L., Lobenhofer,E.K., Wang,C., Shippy,R., Harris,S.C., Zhang,L., Mei,N., Chen,T., Herman,D., Goodsaid,F.M., Hurban,P., Phillips,K.L., Xu,J., Deng,X., Sun,Y.A., Tong,W., Dragan,Y.P. and Shi,L. (2006) Rat toxicogenomic study reveals analytical consistency across microarray platforms. Nature
Biotechbology, 24, 1162-1169.
Huber,W. , Irizarry,R.A. and Gentleman,R. (2005) Preprocessing overview. In Gentleman,R., Irizarry,R.A., Carey,V.J., Dudoit,S. and Huber,W. (eds),
Bioinformatics and Computational Biology Solutions using R and Bioconductor,
Springer, New York, Chapter 1, pp. 3-12.
Irizarry,R.A., Hobbs,B., Collin,F., Beazer-Barclay,Y.D., Antonellis,K.J., Scherf,U. and Speed,T.P. (2003a) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, 4, 249-264.
Irizarry,R.A., Bolstad,B.M., Collin,F., Cope,L.M., Hobbs,B. and Speed,T.P. (2003b) Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Research, 31, e15.
Kendziorski,C., Sarkar,D., Chen,M. and Newton,M. (2007) The vignette of EBarrays package in Bioconductor.
http://bioconductor.org/packages/2.0/bioc/vignettes/EBarrays/inst/doc/vignette.p df
Kendziorski,C.M., Newton,M.A., Lan,H. and Gould,M.N. (2003) On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles. Statistics in Medicine, 22, 3899-3914.
Li,C. and Wong,W. (2001a) Model-based analysis of oligonucleotide arrays:
expression index computation and outlier detection. Proceedings of the National Academy of Science, 98, 31-36.
Li,C. and Wong,W. (2001b) Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biology, 2(8), research 0032.1-0032.11.
Newton,M.A. and Kendziorski,C.M. (2003) Parametric Empirical Bayes Methods for Microarrays. In Parmigiani,G., Garrett,E.S., Irizarry,R.A. and Zeger,S.L. (eds),
The Analysis of Gene Expression Data: Methods and Software. Springer, Chapter 11, pp. 254-271.
Newton,M.A., Kendziorski,C.M., Richmond,C.S., Blattner,F.R. and Tsui,K.W. (2001) On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. Journal of Computational Biology, 8, 37-52.
Smyth,G.K. (2004) Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, 3, Article 3.
Smyth,G.K. (2005) Limma: linear models for microarray data. In Gentleman,R., Carey,V., Dudoit,S., Irizarry,R. and Huber,W. (eds), Bioinformatics and Computational Biology Solutions using R and Bioconductor, Springer, New York, Chapter 23, pp. 397–420.
Sugimoto,N. et al. (1995) Thermodynamic parameters to predict stability of RNA/DNA hybrid duplexes. Biochemistry, 34, 11211-11216.
Tusher,V.G., Tibshirani,R. and Chu,G. (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Science, 98, 5116–5121.
Wolfinger,R. and Chu,T.-M. (2002) Who are those strangers in the latin square?
Critical Assessment of Microarray Data Analysis ‘CAMDA 02’.
Zhang,L., Miles,M.F. and Aldape,K.D. (2003) A model of molecular interactions on short oligonucleotide microarrays. Nature Biotechnology, 21, 818–821.
Table 1. Summary of the four preprocessing methods used.
Model Method Background adjustment
Normalization Summarization Reference
MAS5.0 Locational adjustment
& MM subtracted
Scale normalization Tukey biweight average Affymetrix, 2002
dChip (PM-MM)
MM intensities are subtracted
Invariant set Fit a model based expression index
Quantile normalization A robust linear model is fitted (median polish)
Irizarry et al., 2003
Physical model PDNN PM only Quantile normalization A free energy model accounts for background and signal.
Zhang et al., 2003
Table 2. Summary of the three spike-in datasets used.
Dataset Spike-in genes / Total genes in array
Conditions Total arrays
Replicates (conditions) Fold change range Reference
HGU95 14 / 12626 14 59 2 (1) , 3 (11) , 12 (2) 2 ~ 2 12 Affymetrix
HGU133 42 / 22300 14 42 3 (14) 2 ~ 2 12 Affymetrix
Golden Spike 1331 / 14010 2 6 3 (2) 1.2 ~ 4.0 Choe et al., 2005
Table 3. Affymetrix human genome U95 dataset contains 14 spike-in gene groups in each of 14 experimental groups. This table shows the spiked-in concentrations (pM).
Spike-in Gene Groups
37777_at 684_at 1597_at 38734_at 39058_at 36311_at 36889_at 1024_at 36202_at 36085_at 40322_at 407_at 1091_at 1708_at
HGU95
1 2 3 4 5 6 7 8 9 10 11 12 13 14
A 0 0.25 0.5 1 2 4 8 16 32 64 128 0 512 1024
B 0.25 0.5 1 2 4 8 16 32 64 128 256 0.25 1024 0
C 0.5 1 2 4 8 16 32 64 128 256 512 0.5 0 0.25
D 1 2 4 8 16 32 64 128 256 512 1024 1 0.25 0.5
E 2 4 8 16 32 64 128 256 512 1024 0 2 0.5 1
F 4 8 16 32 64 128 256 512 1024 0 0.25 4 1 2
G 8 16 32 64 128 256 512 1024 0 0.25 0.5 8 2 4
H 16 32 64 128 256 512 1024 0 0.25 0.5 1 16 4 8
I 32 64 128 256 512 1024 0 0.25 0.5 1 2 32 8 16
J 64 128 256 512 1024 0 0.25 0.5 1 2 4 64 16 32
K 128 256 512 1024 0 0.25 0.5 1 2 4 8 128 32 64
L 256 512 1024 0 0.25 0.5 1 2 4 8 16 256 64 128
M, N,
O, P. 512 1024 0 0.25 0.5 1 2 4 8 16 32 512 128 256
Experimental Groups
Q, R,
S, T. 1024 0 0.25 0.5 1 2 4 8 16 32 64 1024 256 512
Table 4. Affymetrix human genome U133 dataset contains 14 spike-in gene groups in each of 14 experimental groups. This table shows the
Table 5. Area under ROC curve (FP<100) for HGU95 dataset.
HGU95 Preprocessing Differential expression AUC (FP<100)
1 PDNN limma 0.948579
7 RMA EBarrays(GG) 0.927794
8 RMA EBarrays(LNN) 0.923941
9 dChip(PM-only) limma 0.905235
10 PDNN EBarrays(GG) 0.902541
11 PDNN EBarrays(LNN) 0.90131
12 PDNN t.test 0.898442
13 RMA t.test 0.886426
14 dChip(PM-only) t.test 0.88254
15 dChip(PM-only) SAM 0.880197
16 dChip(PM-only) FC 0.846546
17 dChip(PM-MM) t.test 0.841166
18 dChip(PM-MM) limma 0.835926
19 dChip(PM-MM) SAM 0.825455
20 dChip(PM-only) EBarrays(GG) 0.824395
21 dChip(PM-only) EBarrays(LNN) 0.820898
22 MAS5.0 t.test 0.815033
23 MAS5.0 limma 0.799162
24 MAS5.0 SAM 0.794531
25 PDNN Welch.t 0.767155
26 RMA Welch.t 0.7576
27 dChip(PM-only) Welch.t 0.742685
28 dChip(PM-MM) Welch.t 0.701568
29 dChip(PM-MM) FC 0.668716
30 dChip(PM-MM) EBarrays(LNN) 0.647701
31 dChip(PM-MM) EBarrays(GG) 0.645769
32 MAS5.0 Welch.t 0.644588
33 MAS5.0 FC 0.615917
34 MAS5.0 EBarrays(GG) 0.612341
35 MAS5.0 EBarrays(LNN) 0.587304
Table 6. Area under ROC curve (FP<100) for HGU133 dataset
HGU133 Preprocessing Differential expression AUC (FP<100)
1 RMA EBarrays(GG) 0.863092
2 RMA EBarrays(LNN) 0.862798
3 RMA FC 0.817002
4 RMA limma 0.81548
5 RMA SAM 0.815347
6 PDNN SAM 0.81162
7 PDNN limma 0.809847
8 PDNN FC 0.797237
9 dChip(PM-only) limma 0.786985
10 dChip(PM-only) SAM 0.785353
11 PDNN EBarrays(LNN) 0.779613
12 PDNN t.test 0.777446
13 dChip(PM-MM) SAM 0.771588
14 dChip(PM-only) t.test 0.770554
15 dChip(PM-MM) limma 0.764034
16 RMA t.test 0.752983
17 dChip(PM-MM) t.test 0.752711
18 MAS5.0 SAM 0.720726
19 PDNN Welch.t 0.720642
20 dChip(PM-only) FC 0.718709
21 MAS5.0 limma 0.706744
22 dChip(PM-only) Welch.t 0.706271
23 RMA Welch.t 0.699885
24 dChip(PM-only) EBarrays(GG) 0.684316
25 MAS5.0 t.test 0.670529
26 dChip(PM-only) EBarrays(LNN) 0.669003
27 dChip(PM-MM) Welch.t 0.668742
28 dChip(PM-MM) FC 0.577278
29 MAS5.0 Welch.t 0.553659
30 MAS5.0 EBarrays(GG) 0.552828
31 MAS5.0 EBarrays(LNN) 0.549571
32 dChip(PM-MM) EBarrays(LNN) 0.54558
33 MAS5.0 FC 0.535097
Table 7. Area under ROC curve (FPR<0.1) for Golden Spike dataset.
GoldenS Preprocessing Differential expression AUC (FP<100)
1 dChip(PM-only) limma 0.56372
2 dChip(PM-only) SAM 0.559223
3 dChip(PM-only) t.test 0.547767
4 dChip(PM-only) Welch.t 0.535408
5 dChip(PM-MM) t.test 0.521514
6 dChip(PM-MM) Welch.t 0.512999
7 dChip(PM-only) FC 0.507993
8 dChip(PM-MM) limma 0.501604
9 dChip(PM-only) EBarrays(GG) 0.496914
10 dChip(PM-only) EBarrays(LNN) 0.493245
11 dChip(PM-MM) SAM 0.481958
12 PDNN FC 0.36528
18 RMA EBarrays(GG) 0.328945
19 RMA EBarrays(LNN) 0.32708
20 PDNN SAM 0.321463
21 MAS5.0 Welch.t 0.314948
22 PDNN EBarrays(GG) 0.312507
23 PDNN EBarrays(LNN) 0.312131
24 RMA t.test 0.307509
31 dChip(PM-MM) EBarrays(LNN) 0.034358
32 dChip(PM-MM) EBarrays(GG) 0.016425
33 MAS5.0 FC 0.00642
34 MAS5.0 EBarrays(LNN) 0.004662
35 MAS5.0 EBarrays(GG) 0.004136
Figure 1-1. ROC curves for all combinations using HGU95 dataset (35 in total).
Combinations using the same preprocessing method are assigned to the same color as shown in the legend.
Figure 1-2. ROC curves for all combinations using HGU95 dataset (35 in total) but FP<100.
Figure 1-3. ROC curves for all combinations using HGU133 dataset (33 in total).
Combinations using the same preprocessing method are assigned to the same color as shown in the legend.
Figure 1-4. ROC curves for all combinations using HGU133 dataset but FP<100 (33 in total).
Figure 1-5. ROC curves for all combinations using Golden Spike dataset (35 in total). Combinations using the same preprocessing method are assigned to the same color as shown in the legend.
Figure 1-6. ROC curves for all combinations using Golden Spike dataset (35 in total) but false positive rate<0.1.
Figure 2-1. For HGU95 dataset, ROC curves of all combinations are divided by preprocessing method. Combinations using the same differential expression method are assigned to the same color as shown in the legend.
Figure 2-2. For HGU133 dataset, ROC curves of all combinations are divided by preprocessing method.
Figure 2-3. For Golden Spike dataset, ROC curves of all combinations are divided by preprocessing method.
Figure 3-1. ROC curves for all combinations using HGU95 dataset. Combinations using the same differential expression method are assigned to the same color as shown in the legend.
Figure 3-2. ROC curves for all combinations using HGU133 dataset.
Figure 3-3. ROC curves for all combinations using Golden Spike dataset.
Figure 4. Overlap rate of two differentially expressed gene lists generated using different combinations. The x-axis represents the number of genes selected as differentially expressed, and the y-axis is the overlap rate of two gene lists for a given number of differentially expressed genes. The four tissues suffering different treatments versus their controls are simply called as K_AA, L_AA, L_CFY, and L_RDL. The fifth graph shows an average plot across the four conditions. x-axis is in log scale. A line represents one kind of combinations and there are 36
combinations in total. This graph shows the overall patterns.
Figure 5-1. Overlap rate of two differentially expressed gene lists generated using different combinations for K_AA treatment/control. All combinations are divided by preprocessing method. Combinations using the same differential expression method are assigned to the same color as shown in the legend.
Figure 5-2. Overlap rate of two differentially expressed gene lists generated using different combinations for L_AA treatment/control.
Figure 5-3. Overlap rate of two differentially expressed gene lists generated using different combinations for L_CFY treatment/control.
Figure 5-4. Overlap rate of two differentially expressed gene lists generated using different combinations for L_RDL treatment/control.
Figure 6. Overlap rate of two differentially expressed gene lists generated using different combinations with EBarrays as differential expression method. Ten combinations in total are shown in the legend.
Figure 7. Overlap rate of two differentially expressed gene lists generated using different combinations. Combinations using the same differential expression method are assigned to the same color as shown in the legend. All combinations are
included.
Figure 8. Overlap rate of two differentially expressed gene lists generated using different combinations. Only the nine permutations with RMA, dChip(PM-only), PDNN as preprocessing method and FC, SAM, limma as differential expression method are plotted.
Figure 9-1. Average overlap rate of two differentially expressed gene lists generated using different combinations. Combinations using the same preprocessing method are assigned to the same color. All combinations are included. Black for RMA, red for MAS5.0, green for dChip(PM-MM), blue-black for dChip(PM-only), and baby blue for PDNN.
Figure 9-2. Average overlap rate of two differentially expressed gene lists generated using different combinations. Combinations using the same differential expression method are assigned to the same color. All combinations are included. Black for FC, red for SAM, green for t-test, blue-black for Welch t-test, baby-blue for
EBarrays(GG), pink for EBarrays(LNN), and yellow for limma.