• 沒有找到結果。

2 Literature Review

2.6 Datasets

Our purpose is to evaluate which combination of preprocessing and differential expression methods performs well. We attempt to evaluate both validity and reliability of these combinations. To properly compare the combinations in terms of validity, we request that the truth differentially expressed genes of the dataset must be known. One kind of microarray experiment is called “spike-in experiments”, that is, some gene fragments have been added at known concentrations. These genes are called spike-in genes. To evaluate the validity, we choose three spike-in datasets, human genome U95 dataset from Affymetrix, human genome U133 dataset from Affymetrix, and a wholly defined control spike-in dataset (Choe et al., 2005). To properly compare the method combinations in terms of reliability, we use a dataset which was generated using samples from rats and these samples are averagely distributed to different test sites (Guo et al., 2006). We use four datasets in total, and describe all briefly as follow.

Affymetrix human genome U95 dataset (HGU95)

The human data set with array type HG-U95A consist of a series of genes spiked-in at known concentrations and arrayed in a format analogous to cyclic Latin

Square format. But there is still a little different from cyclic Latin Square. They represent a subset of the data used to develop and validate the Affymetrix Microarray Suite (MAS) 5.0 algorithm.

A standard 14×14 cyclic Latin Square design must consist of 14 gene groups in 14 experimental groups. Each gene group contains only one spike-in gene, and each experimental group contains the same 14 spiked-in gene groups but spiked-in at different concentrations. For example, the concentration of the 14 gene groups in the first experimental group is 0, 0.25, 0.5, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, and 1024pM. Each subsequent experimental group rotates the spike-in concentrations by one group; i.e. experimental group 2 begins with 0.25pM and ends at 0pM, on up to experimental group 14, which begins with 1024pM and ends with 512pM. Except for the 14 spike-in genes, a common background cRNA have been added at all arrays.

The Affymetrix human genome U95 dataset contains 14 human genes in each of 14 experimental groups. Most groups contain 1 gene. Exceptions are group 1, which contains 2 genes, and group 12, which is empty. Specifically, transcript 407_at listed as present in group 12 is actually included in group 1 (together with 37777_at). For more comprehensible, we show the details in Table 3. The columns represent the 14 spiked-in gene groups and the rows represent the 14 experimental groups. The first row shows the gene name in each gene group.

Most experimental groups contain 3 replicates, except that the 3rd experimental group contain only 2 replicates and both the 13th and 14th experimental group contain 12 replicates. Replicates within each group result in a total of 59 arrays. This dataset is available at http://www.affymetrix.com/support/technical/sample_data/datasets.affx

Some researchers reported that there are 16 spike-in probesets in this dataset as opposed to the 14 originally described by Affymetrix (Cope et al., 2004). The two additional genes are "33818_at" and "546_at". They claimed that "33818_at" has the

pattern of gene group 12 missing from the Latin Square, agreed by three methods of calculating expression (RMA, MAS 5.0, dChip). Wolfinger and Chu (2002) identified this as well. They also claimed "546_at" should be considered with the same concentration as "36202_at" in gene group 9, since it has pattern the same as

"36202_at", as shown by three methods. Wolfinger and Chu (2002) identified this as well. Due to the competitive preprocessing methods we choose are not merely the three methods, recognizing the two genes as spike-in genes maybe not advisable. For this reason, we recognize the 14 genes orginially described by Affymetrix as the entire spike-in genes.

Affymetrix human genome U133 dataset (HGU133)

This dataset with a particular array type HG-U133A_tag consist of more genes spiked-in at known concentrations and arrayed in a cyclic Latin Square format. The dataset is expected to be useful for the development and comparison of expression analysis methods. Distinct from the HGU95 dataset above, this data set includes many more spikes, and a smaller concentration spike (0.125pM).

This dataset consists of 14 spiked-in gene groups in 14 experimental groups.

Distinct from the HGU95 dataset above, each gene group contains three spike-in genes. Thus there are 42 spike-in genes in total in this dataset. Each experimental group contains the same 42 spiked-in genes, but the genes in different gene group are spiked-in at different concentrations. For example, the concentration of the 14 gene groups in the first experimental group is 0, 0.125, 0.25, 0.5, 1, 2, 4, 8, 16, 32, 64, 128, 256, and 512pM. Each subsequent experimental group rotates the spike-in concentrations by one group; i.e. experimental group 2 begins with 0.125pM and ends at 0pM, on up to experimental group 14, which begins with 512pM and ends with 256pM. For more comprehensible, we show the details in Table 4.

The same as HGU95 dataset, all arrays have a common background cRNA except for the 42 spike-in genes. Each experimental group contains 3 replicates, and replicates within each group result in a total of 42 arrays. This dataset is available at http://www.affymetrix.com/support/technical/sample_data/datasets.affx .

A wholly defined control spike-in dataset

Due to the vast numbers of genes interrogated in a microarray experiment, only a relatively small fraction of gene expression differences tend to be validated in any given study. Choe et al. (2005) generated a new control dataset for the purpose of evaluating methods for identifying differentially expressed genes between two sets of triplicated hybridizations to Affymetrix GeneChips. The two sets are called spike-in samples and control samples, resulting in a total of 6 arrays. This dataset has three main features to facilitate the relative assessment of different analysis options. First, this experiment has 1331 spike-in genes spiked-in at known relative concentrations between the spike-in and control samples. The dataset has a larger fraction of gene expression differences than the general spike-in datasets. Second, this experiment used a defined background sample of 2535 genes presented at identical concentrations in both spike-in and control samples, rather than a biological RNA sample of unknown composition. Third, this dataset includes lower fold changes, beginning at only a 1.2-fold concentration difference to 4-fold concentration difference. This dataset is available at http://www.ccr.buffalo.edu/halfon/spike/index.html.

Here, we give a summary table for the three spike-in datasets in Table 2.

Rat dataset

The dataset we used is just a part of the complete dataset from a rat toxicogenomic study, which is one of the reference datasets of MAQC (MicroArray Quality Control) project

(http://www.fda.gov/nctr/science/centers/toxicoinformatics/maqc/). The purpose of

the MAQC project is to provide quality control tools to the microarray community in order to avoid procedural failures and to develop guidelines for microarray data analysis by providing the public with large reference datasets along with readily accessible reference RNA samples. The rat toxicogenomic dataset was generated using 36 RNA samples from rats treated with three chemicals (aristolochic acid, riddelliine and comfrey). In total there were six treatment/tissue groups: kidney from aristolochic acid–treated rats (K_AA), kidney from vehicle control (K_CTR), liver from aristolochic acid–treated rats (L_AA), liver from riddelliine- treated rats (L_RDL), liver from comfrey-treated rats (L_CFY) and liver from vehicle control (L_CTR). Within each treatment/tissue group there were six biological replicates.

Aliquots of these samples were prepared and distributed to each of the test sites for gene expression profiling using microarrays from four different platforms (Affymetrix, Agilent, Applied Biosystems and GE Healthcare). There are two test sites using Affymetrix platform, and we adopt only the data from the two test sites. Each test site generated 36 arrays respectively. In this paper, when we refer to the Rat dataset, it denotes the 72 arrays in all which were generated from the two sites using Affymetrix platform. This dataset is available at

http://www.fda.gov/nctr/science/centers/toxicoinformatics/maqc/ .

相關文件