Discussion - Conclusions and Discussion - 使用效度與信度來比較艾菲爾微陣列基因晶片的預處理方法與表現量差異方法的組合

5 Conclusions and Discussion

5.2 Discussion

The strange pattern of EBarrays in Figure 6 is caused by too many genes having posterior probability of differential expression equal to 1. When ranking genes by

posterior probability, too many equal values make the order meaningless. For example, more than one thousand genes have posterior probability equal to 1 when using PDNN+EBarrays(GG) for L_CFY treatment/control. We can not select only 100 genes as differentially expressed genes in this situation. Even if using spike-in datasets, there are still too many genes having posterior probability of differential expression equal to 1. That is one disadvantage of EBarrays. And we can not find the best way to deal with genes having the equal values of score of significance.

Reference

Affymetrix. (2002) Statistical algorithms description document.

http://www.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdf

Bolstad,B.M., Irizarry,R.A., Astrand,M. and Speed,T.P. (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19, 185-193.

Choe,S.E., Boutros,M., Michelson,A.M., Church,G.M. and Halfon,M.S. (2005) Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset. Genome Biology, 6:R16

Chu,G., Narasimhan,B., Tibshirani,R. and Tusher,V. SAM “Significance Analysis of Microarrays”–Users guide and technical document. Technical Report, Stanford University. http://www-stat.stanford.edu/~tibs/SAM/sam.pdf

Cope,L.M., Irizarry,R.A., Jaffee,H.A., Wu,Z. and Speed,T.P. (2004) A benchmark for Affymetrix GeneChip expression measures. Bioinformatics, 20, 323-331.

Guo,L., Lobenhofer,E.K., Wang,C., Shippy,R., Harris,S.C., Zhang,L., Mei,N., Chen,T., Herman,D., Goodsaid,F.M., Hurban,P., Phillips,K.L., Xu,J., Deng,X., Sun,Y.A., Tong,W., Dragan,Y.P. and Shi,L. (2006) Rat toxicogenomic study reveals analytical consistency across microarray platforms. Nature

Biotechbology, 24, 1162-1169.

Huber,W. , Irizarry,R.A. and Gentleman,R. (2005) Preprocessing overview. In Gentleman,R., Irizarry,R.A., Carey,V.J., Dudoit,S. and Huber,W. (eds),

Bioinformatics and Computational Biology Solutions using R and Bioconductor,

Springer, New York, Chapter 1, pp. 3-12.

Irizarry,R.A., Hobbs,B., Collin,F., Beazer-Barclay,Y.D., Antonellis,K.J., Scherf,U. and Speed,T.P. (2003a) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, 4, 249-264.

Irizarry,R.A., Bolstad,B.M., Collin,F., Cope,L.M., Hobbs,B. and Speed,T.P. (2003b) Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Research, 31, e15.

Kendziorski,C., Sarkar,D., Chen,M. and Newton,M. (2007) The vignette of EBarrays package in Bioconductor.

http://bioconductor.org/packages/2.0/bioc/vignettes/EBarrays/inst/doc/vignette.p df

Kendziorski,C.M., Newton,M.A., Lan,H. and Gould,M.N. (2003) On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles. Statistics in Medicine, 22, 3899-3914.

Li,C. and Wong,W. (2001a) Model-based analysis of oligonucleotide arrays:

expression index computation and outlier detection. Proceedings of the National Academy of Science, 98, 31-36.

Li,C. and Wong,W. (2001b) Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biology, 2(8), research 0032.1-0032.11.

Newton,M.A. and Kendziorski,C.M. (2003) Parametric Empirical Bayes Methods for Microarrays. In Parmigiani,G., Garrett,E.S., Irizarry,R.A. and Zeger,S.L. (eds),

The Analysis of Gene Expression Data: Methods and Software. Springer, Chapter 11, pp. 254-271.

Newton,M.A., Kendziorski,C.M., Richmond,C.S., Blattner,F.R. and Tsui,K.W. (2001) On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. Journal of Computational Biology, 8, 37-52.

Smyth,G.K. (2004) Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, 3, Article 3.

Smyth,G.K. (2005) Limma: linear models for microarray data. In Gentleman,R., Carey,V., Dudoit,S., Irizarry,R. and Huber,W. (eds), Bioinformatics and Computational Biology Solutions using R and Bioconductor, Springer, New York, Chapter 23, pp. 397–420.

Sugimoto,N. et al. (1995) Thermodynamic parameters to predict stability of RNA/DNA hybrid duplexes. Biochemistry, 34, 11211-11216.

Tusher,V.G., Tibshirani,R. and Chu,G. (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Science, 98, 5116–5121.

Wolfinger,R. and Chu,T.-M. (2002) Who are those strangers in the latin square?

Critical Assessment of Microarray Data Analysis ‘CAMDA 02’.

Zhang,L., Miles,M.F. and Aldape,K.D. (2003) A model of molecular interactions on short oligonucleotide microarrays. Nature Biotechnology, 21, 818–821.

Table 1. Summary of the four preprocessing methods used.

Model Method Background adjustment

Normalization Summarization Reference

MAS5.0 Locational adjustment

& MM subtracted

Scale normalization Tukey biweight average Affymetrix, 2002

dChip (PM-MM)

MM intensities are subtracted

Invariant set Fit a model based expression index

Quantile normalization A robust linear model is fitted (median polish)

Irizarry et al., 2003

Physical model PDNN PM only Quantile normalization A free energy model accounts for background and signal.

Zhang et al., 2003

Table 2. Summary of the three spike-in datasets used.

Dataset Spike-in genes / Total genes in array

Conditions Total arrays

Replicates (conditions) Fold change range Reference

HGU95 14 / 12626 14 59 2 (1) , 3 (11) , 12 (2) 2 ~ 2 ¹² Affymetrix

HGU133 42 / 22300 14 42 3 (14) 2 ~ 2 ¹² Affymetrix

Golden Spike 1331 / 14010 2 6 3 (2) 1.2 ~ 4.0 Choe et al., 2005

Table 3. Affymetrix human genome U95 dataset contains 14 spike-in gene groups in each of 14 experimental groups. This table shows the spiked-in concentrations (pM).

Spike-in Gene Groups

37777_at 684_at 1597_at 38734_at 39058_at 36311_at 36889_at 1024_at 36202_at 36085_at 40322_at 407_at 1091_at 1708_at

HGU95

1 2 3 4 5 6 7 8 9 10 11 12 13 14

A 0 0.25 0.5 1 2 4 8 16 32 64 128 0 512 1024

B 0.25 0.5 1 2 4 8 16 32 64 128 256 0.25 1024 0

C 0.5 1 2 4 8 16 32 64 128 256 512 0.5 0 0.25

D 1 2 4 8 16 32 64 128 256 512 1024 1 0.25 0.5

E 2 4 8 16 32 64 128 256 512 1024 0 2 0.5 1

F 4 8 16 32 64 128 256 512 1024 0 0.25 4 1 2

G 8 16 32 64 128 256 512 1024 0 0.25 0.5 8 2 4

H 16 32 64 128 256 512 1024 0 0.25 0.5 1 16 4 8

I 32 64 128 256 512 1024 0 0.25 0.5 1 2 32 8 16

J 64 128 256 512 1024 0 0.25 0.5 1 2 4 64 16 32

K 128 256 512 1024 0 0.25 0.5 1 2 4 8 128 32 64

L 256 512 1024 0 0.25 0.5 1 2 4 8 16 256 64 128

M, N,

O, P. 512 1024 0 0.25 0.5 1 2 4 8 16 32 512 128 256

Experimental Groups

Q, R,

S, T. 1024 0 0.25 0.5 1 2 4 8 16 32 64 1024 256 512

Table 4. Affymetrix human genome U133 dataset contains 14 spike-in gene groups in each of 14 experimental groups. This table shows the

Table 5. Area under ROC curve (FP<100) for HGU95 dataset.

HGU95 Preprocessing Differential expression AUC (FP<100)

1 PDNN limma 0.948579

7 RMA EBarrays(GG) 0.927794

8 RMA EBarrays(LNN) 0.923941

9 dChip(PM-only) limma 0.905235

10 PDNN EBarrays(GG) 0.902541

11 PDNN EBarrays(LNN) 0.90131

12 PDNN t.test 0.898442

13 RMA t.test 0.886426

14 dChip(PM-only) t.test 0.88254

15 dChip(PM-only) SAM 0.880197

16 dChip(PM-only) FC 0.846546

17 dChip(PM-MM) t.test 0.841166

18 dChip(PM-MM) limma 0.835926

19 dChip(PM-MM) SAM 0.825455

20 dChip(PM-only) EBarrays(GG) 0.824395

21 dChip(PM-only) EBarrays(LNN) 0.820898

22 MAS5.0 t.test 0.815033

23 MAS5.0 limma 0.799162

24 MAS5.0 SAM 0.794531

25 PDNN Welch.t 0.767155

26 RMA Welch.t 0.7576

27 dChip(PM-only) Welch.t 0.742685

28 dChip(PM-MM) Welch.t 0.701568

29 dChip(PM-MM) FC 0.668716

30 dChip(PM-MM) EBarrays(LNN) 0.647701

31 dChip(PM-MM) EBarrays(GG) 0.645769

32 MAS5.0 Welch.t 0.644588

33 MAS5.0 FC 0.615917

34 MAS5.0 EBarrays(GG) 0.612341

35 MAS5.0 EBarrays(LNN) 0.587304

Table 6. Area under ROC curve (FP<100) for HGU133 dataset

HGU133 Preprocessing Differential expression AUC (FP<100)

1 RMA EBarrays(GG) 0.863092

2 RMA EBarrays(LNN) 0.862798

3 RMA FC 0.817002

4 RMA limma 0.81548

5 RMA SAM 0.815347

6 PDNN SAM 0.81162

7 PDNN limma 0.809847

8 PDNN FC 0.797237

9 dChip(PM-only) limma 0.786985

10 dChip(PM-only) SAM 0.785353

11 PDNN EBarrays(LNN) 0.779613

12 PDNN t.test 0.777446

13 dChip(PM-MM) SAM 0.771588

14 dChip(PM-only) t.test 0.770554

15 dChip(PM-MM) limma 0.764034

16 RMA t.test 0.752983

17 dChip(PM-MM) t.test 0.752711

18 MAS5.0 SAM 0.720726

19 PDNN Welch.t 0.720642

20 dChip(PM-only) FC 0.718709

21 MAS5.0 limma 0.706744

22 dChip(PM-only) Welch.t 0.706271

23 RMA Welch.t 0.699885

24 dChip(PM-only) EBarrays(GG) 0.684316

25 MAS5.0 t.test 0.670529

26 dChip(PM-only) EBarrays(LNN) 0.669003

27 dChip(PM-MM) Welch.t 0.668742

28 dChip(PM-MM) FC 0.577278

29 MAS5.0 Welch.t 0.553659

30 MAS5.0 EBarrays(GG) 0.552828

31 MAS5.0 EBarrays(LNN) 0.549571

32 dChip(PM-MM) EBarrays(LNN) 0.54558

33 MAS5.0 FC 0.535097

Table 7. Area under ROC curve (FPR<0.1) for Golden Spike dataset.

GoldenS Preprocessing Differential expression AUC (FP<100)

1 dChip(PM-only) limma 0.56372

2 dChip(PM-only) SAM 0.559223

3 dChip(PM-only) t.test 0.547767

4 dChip(PM-only) Welch.t 0.535408

5 dChip(PM-MM) t.test 0.521514

6 dChip(PM-MM) Welch.t 0.512999

7 dChip(PM-only) FC 0.507993

8 dChip(PM-MM) limma 0.501604

9 dChip(PM-only) EBarrays(GG) 0.496914

10 dChip(PM-only) EBarrays(LNN) 0.493245

11 dChip(PM-MM) SAM 0.481958

12 PDNN FC 0.36528

18 RMA EBarrays(GG) 0.328945

19 RMA EBarrays(LNN) 0.32708

20 PDNN SAM 0.321463

21 MAS5.0 Welch.t 0.314948

22 PDNN EBarrays(GG) 0.312507

23 PDNN EBarrays(LNN) 0.312131

24 RMA t.test 0.307509

31 dChip(PM-MM) EBarrays(LNN) 0.034358

32 dChip(PM-MM) EBarrays(GG) 0.016425

33 MAS5.0 FC 0.00642

34 MAS5.0 EBarrays(LNN) 0.004662

35 MAS5.0 EBarrays(GG) 0.004136

Figure 1-1. ROC curves for all combinations using HGU95 dataset (35 in total).

Combinations using the same preprocessing method are assigned to the same color as shown in the legend.

Figure 1-2. ROC curves for all combinations using HGU95 dataset (35 in total) but FP<100.

Figure 1-3. ROC curves for all combinations using HGU133 dataset (33 in total).

Combinations using the same preprocessing method are assigned to the same color as shown in the legend.

Figure 1-4. ROC curves for all combinations using HGU133 dataset but FP<100 (33 in total).

Figure 1-5. ROC curves for all combinations using Golden Spike dataset (35 in total). Combinations using the same preprocessing method are assigned to the same color as shown in the legend.

Figure 1-6. ROC curves for all combinations using Golden Spike dataset (35 in total) but false positive rate<0.1.

Figure 2-1. For HGU95 dataset, ROC curves of all combinations are divided by preprocessing method. Combinations using the same differential expression method are assigned to the same color as shown in the legend.

Figure 2-2. For HGU133 dataset, ROC curves of all combinations are divided by preprocessing method.

Figure 2-3. For Golden Spike dataset, ROC curves of all combinations are divided by preprocessing method.

Figure 3-1. ROC curves for all combinations using HGU95 dataset. Combinations using the same differential expression method are assigned to the same color as shown in the legend.

Figure 3-2. ROC curves for all combinations using HGU133 dataset.

Figure 3-3. ROC curves for all combinations using Golden Spike dataset.

Figure 4. Overlap rate of two differentially expressed gene lists generated using different combinations. The x-axis represents the number of genes selected as differentially expressed, and the y-axis is the overlap rate of two gene lists for a given number of differentially expressed genes. The four tissues suffering different treatments versus their controls are simply called as K_AA, L_AA, L_CFY, and L_RDL. The fifth graph shows an average plot across the four conditions. x-axis is in log scale. A line represents one kind of combinations and there are 36

combinations in total. This graph shows the overall patterns.

Figure 5-1. Overlap rate of two differentially expressed gene lists generated using different combinations for K_AA treatment/control. All combinations are divided by preprocessing method. Combinations using the same differential expression method are assigned to the same color as shown in the legend.

Figure 5-2. Overlap rate of two differentially expressed gene lists generated using different combinations for L_AA treatment/control.

Figure 5-3. Overlap rate of two differentially expressed gene lists generated using different combinations for L_CFY treatment/control.

Figure 5-4. Overlap rate of two differentially expressed gene lists generated using different combinations for L_RDL treatment/control.

Figure 6. Overlap rate of two differentially expressed gene lists generated using different combinations with EBarrays as differential expression method. Ten combinations in total are shown in the legend.

Figure 7. Overlap rate of two differentially expressed gene lists generated using different combinations. Combinations using the same differential expression method are assigned to the same color as shown in the legend. All combinations are

included.

Figure 8. Overlap rate of two differentially expressed gene lists generated using different combinations. Only the nine permutations with RMA, dChip(PM-only), PDNN as preprocessing method and FC, SAM, limma as differential expression method are plotted.

Figure 9-1. Average overlap rate of two differentially expressed gene lists generated using different combinations. Combinations using the same preprocessing method are assigned to the same color. All combinations are included. Black for RMA, red for MAS5.0, green for dChip(PM-MM), blue-black for dChip(PM-only), and baby blue for PDNN.

Figure 9-2. Average overlap rate of two differentially expressed gene lists generated using different combinations. Combinations using the same differential expression method are assigned to the same color. All combinations are included. Black for FC, red for SAM, green for t-test, blue-black for Welch t-test, baby-blue for

EBarrays(GG), pink for EBarrays(LNN), and yellow for limma.

在文檔中使用效度與信度來比較艾菲爾微陣列基因晶片的預處理方法與表現量差異方法的組合 (頁 54-0)