DISCUSSION - 基因表現量晶片資料模擬器－使用公開之晶片資料庫

The proposed simulation method for gene expression data is based on an empirical distribution obtained from 1279 HG-U133 arrays. Genes displaying only one mode are recognized as either expressed in all tissues or unexpressed in all tissues.

These genes do not provide information for the purpose of differentiating between expressed and unexpressed. To distinguish abnormal tissues from normal tissues, we can just focus on 5005 genes which have multiple modes. To simulate a set of arrays, our simulation method recommends that first identifying each of 5005 genes as expressed or unexpressed, and then using this expression status to acquire these 5005 genes’ simulated expression values. The rest of the 22283-5005 genes will be simulated by one-mode empirical distributions.

The empirical distribution follows similar pattern with intensity distribution obtained from the spike-in dataset. It implies that the empirical distribution imitates some characteristics of real gene intensity distribution successfully. The derived simulated data can be used to evaluate various differential expression methods objectively. With three replicate arrays, the performances of compared six differential expression methods are all of inferior quality when using simulated data than using real spike-in data. This may be explained by the reason that the spike-in dataset contains technical replicates only, but our simulated data also contain variations from biological replicates. The ability of detecting differentially expressed genes with these six approaches except for SAM is of the same rank. Using five and ten replicate arrays, there will be some changes of the performance of these six methods. The fold-change and EBarrays-based analyses are superior in the low replication situation.

But when the replicate arrays increased, the performance of the two methods is not as excellent as it appears formerly. Besides, the capability of distinguishing differentially

expressed genes with SAM becomes better while augmenting the replicate arrays.

This simulation method has some limitation. First of all, we preprocessed all the row data of 1279 arrays from normal tissues using justRMA. Consequently, the empirical intensity distributions are generated through RMA-preprocessed data, and there is no way to evaluate varied preprocessing methods. Besides, simulation of 5005 genes has some problems. This simulation method considered the empirical distribution which had multiple modes as a complete distribution and ignored the fact that the expressed part and unexpressed part might come from the different distributions. That is, when simulating gene expressions from these multiple-mode genes, the expressed expression will always larger than the unexpressed expression.

Therefore, as the replicate arrays increase to a vast amount, the ability of differential expression methods will be too sensitive to evaluate.

With this simulation method, three replicate arrays for each group can be obtained within two minutes, and it is convenient to simulate an enormous amount of imitative data. Arising from the expensive price of Affymetrix chips, microarray experiment size was restricted and it would limit the power of microarray analysis methods. It is believed that the proposed simulation method of gene expression data can benefit the development of the microarray analysis tools.

Reference

Affymetrix. (2002) Statistical algorithms description document.

http://www.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdf Affymetrix. (2003) GeneChip Human Genome Array

http://www.affymetrix.com/support/technical/datasheets/human_datasheet.pdf Brazma,A., Robinson,A.,Cameron,G. and Ashburner,A. (2000) One-stop shop for

microarray data. Nature, 403, 699-700.

Chang,K.M, Harbron,C. and South,M.C. (2007) The vignette of RefPlus package in Bioconductor.

http://www.bioconductor.org/packages/2.4/bioc/vignettes/RefPlus/inst/doc/RefPl us.pdf

Chang,K.M., Harbron,C., South,M.C. (2006) An Exploration of Extensions to the RMA Algorithm.

http://bioconductor.org/packages/2.4/bioc/vignettes/RefPlus/inst/doc/Extensions _to_RMA_Algorithm.pdf

Choe,S.E., Boutros,M., Michelson,A.M., Church,G.M. and Halfon,M.S. (2005) Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset. Genome Biology, 6:R16

Ivliev,A.E., Hoen,P.A.C., Villerius,M.P., Dunnen,J.T. and Bradndt,B.W. (2008) Microarray retriever: a web-based tool for searching and large scale retrieval of public microarray data. Nucleic Acids Research, 36, 327–331.

MacDonald,J. and Bolstad,B. (2009) The manuals of affy package in Bioconductor.

http://www.bioconductor.org/packages/2.4/bioc/manuals/affy/man/affy.pdf Mcgee,M. and Chen,Z. (2006) New Spiked-In Probe Sets for the Affymetrix

HGU-133A Latin Square Experiment. COBRA Preprint Series, Article 5.

http://biostats.bepress.com/cobra/ps/art5

Smyth,G.K. (2004) Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, 3, Article 3.

Tusher, V. G., Tibshirani, R. and Chu, G. (2001) Significance analysis of microarrays applied to the ionizing radiation response, Proceedings of the National Academy of Sciences of the United States of America, 98, 5116–5121.

Wilson,C., Pepper,S.D. and Miller,C.J. (2009) The vignette of simpleaffy package in Bioconductor.

http://www.bioconductor.org/packages/2.4/bioc/vignettes/simpleaffy/inst/doc/Q CandSimpleaffy.pdf

Zilliox,M.J. and Irizarry,R.A. (2007) A gene expression bar code for microarray data.

Nature Methods, 4, 911-913.

Table 2.1. Affymetrix human genome U133 dataset contains 14 spike-in gene groups in each of 14 experimental groups. This table shows the

Table 2.2. Probe IDs for the seventeen additional probe sets annotated in the HG-U133Atag CDF.

Spike Ins Non Spike Ins AFFX-r2-TagA_at AFFX-r2-TagO-3_at AFFX-r2-TagB_at AFFX-r2-TagO-5_at AFFX-r2-TagC_at AFFX-r2-TagIN-3_at AFFX-r2-TagD_at AFFX-r2-TagIN-5_at AFFX-r2-TagE_at AFFX-r2-TagQ-3_at AFFX-r2-TagF_at AFFX-r2-TagQ-5_at AFFX-r2-TagG_at AFFX-r2-TagJ-3_at AFFX-r2-TagH_at AFFX-r2-TagJ-5_at

AFFX-r2-TagIN-M_at

Table 3.1. Summary of the amount of delete data in QC step.

GEO AE Total

Before QC 1886 559 2445

Scale factor -359 -112 -471

Averages background -56 0 -56

3’/5’ ratios -302 -98 -400

Percent present calls -4 -13 -17

After QC 1165 336 1501

Remove different type 943 336 1279

Table 3.2. The frequency distribution among tissue types of the reference training set.

beta cell islets 1 Normal cervix 8 T cells resting 23

medulla oblongata 1 prostate 8 cerebellum 24

Normal Breast 1 TERV (cell line) 8 Normal Kidney 25

Normal Colon 1 smooth muscle 9 uterus 25

Normal Corpus 1 primary fibroblast cell

line 9 esophageal

epithelium 26

Normal Stomach 1 PBSC CD34 selected

cells 10 Frontal Cortex 26

Normal Thalamus 1 Baseline macrophages 11

blood (cell type :

neuroblastoma cells 12 skeletal muscle 33

salivary gland 2

Stratagene Universal Human Reference

RNA

12 Normal Caudate

Nucleus 33

pituitary 2 peripheral blood CD8

T cells 12 prefrontal cortex 33

Normal Amygdala 3 umbilical cord blood 13 duodenal tissue 40

trachea 3 Post-mortem medial

substantia nigra 15 peripheral blood

(human PBMC) 47

Pulp tissue 4 skin 16 white blood cells 48

occipital lobe 4 Undifferentiated

human ES cells 16 lateralis muscle 48

Theca cell 4 lymphoblastoid cell

lines 17

Human umbilical vein endothelial

cells

Normal_Ovary 5 Human optic nerve

head astrocytes 18 bone marrow 56

thyroid gland

(thyrocytes) 7 hypothalamus 22 lung 63

Normal Spleen 7 liver 22 whole blood 67

adipose tissue 8 Bronchial Epithelium 23

0 500 1000 1500 2000 2500

0510152025

Scale Factor

2445 arrays

sfs

0 500 1000 1500 2000

0100200300400500

Figure 3.1. The deleted data in each quality assessment step. The red dots represent data deleted in this step. The block dots are the data still kept after this step.

0 500 1000 1500 2000 2500

0510152025

Scale Factor

2445 arrays

sfs

0 500 1000 1500 2000 2500

0100200300400500600

Average Background

2445 arrays

avbg

0 500 1000 1500 2000 2500

-4-202468

Ratios(actin3/actin5)

2445 arrays

actin3.actin5

0 500 1000 1500 2000 2500

-20246

Ratios(gapdh3/gapdh5)

2445 arrays

gapdh3.gapdh5

0 500 1000 1500 2000 2500

102030405060

Percent Precent

2445 arrays

Figure 3.2. The total deleted data in each quality assessment step. The red dots represent all deleted data. The block dots are the data still kept after all the quality assessment steps.

Figure 3.3. These histograms are the log (base 2) expression distribution for gene

“200886_s_at”. The color lines are the smoothed densities, using various “n” and

“adjust” in function density(n, adjust) of R.

Figure 3.4. The red dots represent the rates of correctly identifying the two experiments (conditions) to be differentially expressed among 34 spike-in genes under various K. The blue dots are the rates of correctly identifying the two experiments (conditions) to be not differentially expressed among 4993 genes which were not in spike-in genes and had multiple modes.

Figure 4.1(a) Log (base 2) intensity distribution of one-mode gene.

Figure 4.1(b) Log (base 2) intensity distribution of two-mode gene which the second mode is close to the first mode.

Figure 4.1(c) Log (base 2) intensity distribution of two-mode gene which the second mode is more distant to the first mode.

Figure 4.1(d) Log (base 2) intensity distribution of gene which has more than two modes and the second mode is far away from the first mode.

Figure 4.2. Different combinations of arguments to fit the density smoother.

Combinations using the same adjust are assigned to the same color as shown in the legend.

Figure 4.3. The empirical intensity distributions and spike-in intensity distributions for two genes that are included as the spike-in genes and have multiple modes. The solid lines are the empirical distributions obtained through 1279 arrays of the reference training set, and the dotted lines are the intensity distributions using 42 spike-in arrays. The brilliant ticks act for the observed values of spike-in samples with color denoting the experimental group to which the observation belongs.

Figure 4.4. The empirical intensity distributions and spike-in intensity distributions for two genes that are included as spike-in genes and have only one mode. The solid lines are the empirical distributions obtained through 1279 arrays of the reference training set, and the dotted lines are the intensity distributions using 42 spike-in arrays.

The brilliant ticks act for the observed values of spike-in samples with color denoting the experimental group to which the observation belongs.

(a)

(b) (c)

Figure 4.5.1. The empirical intensity distribution and spike-in intensity distribution for a gene that is not included as the spike-in gene and has multiple modes, where the second mode is distant from the first mode. In (a), the empirical intensity distribution and spike-in intensity distribution are put together. (b) is for the empirical intensity distribution only, and (c) is for the spike-in intensity distribution only. The brilliant ticks act for the observed values of spike-in samples with color denoting the experimental group to which the observation belongs.

(a)

(b) (c)

Figure 4.5.2. The empirical intensity distribution and spike-in intensity distribution for a gene that is not included as the spike-in gene and has multiple modes, where the second mode is close to the first mode. In (a), the empirical intensity distribution and spike-in intensity distribution are put together. (b) is for the empirical intensity distribution only, and (c) is for the spike-in intensity distribution only. The brilliant ticks act for the observed values of spike-in samples with color denoting the experimental group to which the observation belongs.

(a)

(b) (c)

Figure 4.5.3. The empirical intensity distribution and spike-in intensity distribution for a gene that is not included as the spike-in gene and has multiple modes, where the second mode is far away from the first mode. In (a), the empirical intensity distribution and spike-in intensity distribution are put together. (b) is for the empirical intensity distribution only, and (c) is for the spike-in intensity distribution only. The brilliant ticks act for the observed values of spike-in samples with color denoting the experimental group to which the observation belongs.

(a)

(b) (c)

Figure 4.6.1. The empirical intensity distribution and spike-in intensity distribution for a gene that is not included as the in spike-in gene and has only one mode. In (a), the empirical intensity distribution and spike-in intensity distribution are put together.

(b) is for the empirical intensity distribution only, and (c) is for the spike-in intensity distribution only. The brilliant ticks act for the observed values of spike-in samples with color denoting the experimental group to which the observation belongs.

(a)

(b) (c)

Figure 4.6.2. The empirical intensity distribution and spike-in intensity distribution for a gene that is not included as the spike-in gene and has only one mode. In (a), the empirical intensity distribution and spike-in intensity distribution are put together. (b) is for the empirical intensity distribution only, and (c) is for the spike-in intensity distribution only. The brilliant ticks act for the observed values of spike-in samples with color denoting the experimental group to which the observation belongs.

Figure 4.7. ROC curves for six differential-expression methods, comparing the three replicate arrays from the 4th experimental group of the spike-in dataset with the three replicate arrays from the 10th experimental group. Various colors represent different differential-expression methods as shown in the legend.

Figure 4.8. ROC curves with FPs<100 for six differential-expression methods, comparing the three replicate arrays simulated based on the 4th experimental group of the spike-in dataset with the three replicate arrays simulated based on the 10th experimental group. Various colors represent different differential-expression methods as shown in the legend.

Figure 4.9. ROC curves with FPs<1000 for six differential-expression methods, comparing the three replicate arrays simulated based on the 4th experimental group of the spike-in dataset with the three replicate arrays simulated based on the 10th experimental group. Various colors represent different differential-expression methods as shown in the legend.

Figure 4.10. ROC curves with FPs<100 for six differential-expression methods, comparing the five replicate arrays simulated based on the 4th experimental group of the spike-in dataset with the five replicate arrays simulated based on the 10th experimental group. Various colors represent different differential-expression methods as shown in the legend.

Figure 4.11. ROC curves with FPs<100 for six differential-expression methods, comparing the ten replicate arrays simulated based on the 4th experimental group of the spike-in dataset with the ten replicate arrays simulated based on the 10th experimental group. Various colors represent different differential-expression methods as shown in the legend.

在文檔中基因表現量晶片資料模擬器－使用公開之晶片資料庫 (頁 41-66)