Assessment of validity - Materials and Methods

3 Materials and Methods

3.2 Assessment of validity

~ ~

β = . In general, we are interested in testing

whether individual contrast values β_gj are equal to zero. The basic statistic with

respect to a certain contrast β_gj is the moderated t-statistic in which posterior residual standard deviations are used in place of ordinary standard deviations by empirical Bayes approach. Alternative statistic, called B-statistic, represents log posterior odds that the gene is differentially expressed. The default argument in limma package is B-statistic and we employ it as the score of significance.

3.2 Assessment of validity

To properly compare the combinations in terms of validity, we request that the true differentially expressed genes of the dataset must be known. Thus, we choose three datasets which provide the results of spike-in experiments where gene fragments have been added at known concentrations. The three datasets are human genome U95 dataset from Affymetrix, human genome U133 dataset from Affymetrix, and a wholly defined control spike-in dataset (Choe et al., 2005). The three datasets provide various number of spike-in genes. ROC curves are used for the evaluation. We describe the three datasets briefly as follow.

Affymetrix human genome U95 dataset (HGU95)

This dataset was used to develop and validate MAS 5.0 algorithm. It consists of 59 arrays, where 14 different cRNA gene fragments have been spiked-in at various known concentrations ranging from 0.25 to 1024pM. Except for the 14 spike-in genes, a common background cRNA have been added at all arrays. The 14 spike-in genes are arranged in the format similar to a 14×14 cyclic Latin square design with each concentration appearing once in each row and column. The difference from a 14×14 cyclic Latin square design is that there are two out of the 14 spike-in genes spiked-in

at the same concentrations across arrays (Table 3). Most experimental groups contain 3 replicates, except that the 3rd experimental group contains only 2 replicates and both the 13th and 14th experimental group contain 12 replicates. For more details, see Affymetrix website and this dataset is available here

http://www.affymetrix.com/support/technical/sample_data/datasets.affx.

Affymetrix human genome U133 dataset (HGU133)

Distinct from the HGU95 dataset above, this dataset includes many more spikes, and a smaller concentration spike (0.125pM). This dataset consists of 14 gene groups in 14 experimental groups. A cyclic Latin Square format is designed for each gene group and experimental group. Each gene group containing three spike-in genes and each experimental group containing 3 replicates result in a total of 42 spike-in genes and 42 arrays. For more details, see Affymetrix website and this dataset is available here

http://www.affymetrix.com/support/technical/sample_data/datasets.affx A wholly defined control spike-in dataset (Golden Spike)

Choe et al. (2005) generated a new control dataset which contains two sets of triplicated hybridizations to Affymetrix GeneChips. The two sets are called spike-in samples and control samples respectively, resulting in a total of 6 arrays. This dataset has three main features: (1) 1331 spike-in genes spiked-in at known relative concentrations between the spike-in and control samples, a larger fraction of gene expression differences. (2) a defined background sample of 2535 genes presented at identical concentrations in both spike-in and control samples, rather than a biological RNA sample of unknown composition. (3) a lower fold changes beginning at only a 1.2-fold concentration difference. This dataset is available at

http://www.ccr.buffalo.edu/halfon/spike/index.html .

Methodology of comparison

In order to evaluate the validity of these combinations, a receiver operating characteristic curve (simply called ROC curve) is used. ROC curve, which is widely used to evaluate the differential expression methods in microarray analysis, is a graphical plot of the sensitivity versus 1-specificity for a binary classifier system as its discrimination threshold is varied. Sensitivity and specificity are statistical measurements of how well a binary classification test correctly identifies the truth.

Sensitivity is defined as the probability that the test lead to make positive decision given that the truth is actually a positive case. This is also known as the true positive rate (TPR). And specificity is defined as the probability that a negative decision is made when the truth is negative. In other words, 1-specificity represents that the probability that the positive decision is made when the truth is negative, and the meaning is equivalent to the false positive rate (FPR). For most differential expression methods, null hypothesis is usually defined as gene expressed equally under two different conditions. The four outcomes of a test can be formulated as the following table.

TP : true positive FP : false positive FN : false negative TN : true negative

TPR : true positive rate (sensitivity) FPR : false positive rate (1-specificity)

Null hypothesisH ₀ (non-differentially expressed)

False True Reject H ₀

(Called significant)

TP (1−β)

FP (α ) Not reject H ₀

(Not called significant)

FN TN

Thus, the ROC curve is represented equivalently as a plot of the false positive (FP) rate as the x coordinate versus the true positive (TP) rate as the y coordinate. It provides tools to select possibly optimal methods by comparing the area under ROC curve (simply called AUC). The area measures discrimination, that is, the ability of the test to correctly classify those positive case and negative case in fact. The range of AUC is from 0 to 1 since both the x and y axes have values ranging from 0 to 1. The bigger its AUC is, the better overall performance of this test. We take advantage of ROC curve and AUC as criteria to assess the validity of different combinations.

Here we make a brief description of how to accomplish an average ROC curve for a selected combination of some dataset, preprocessing method, and differential expression method. For each spike-in dataset, spike-in genes are considered as true positives and non-spike-in genes as true negatives. For each dataset, different experimental groups imply that the spike-in genes are spiked-in at different concentrations. Thus, only replicates are regarded as being in the same experimental group. For each pair of experimental groups, we compute the number of true positive (TP) and false positive (FP) for a large range of thresholds. To form an average ROC curve, we compute the average TP according to each FP value. An average ROC curve is created by plotting the FP versus its average TP. And the area under average ROC curve is the measure of this combination (Cope et al., 2004).

在文檔中使用效度與信度來比較艾菲爾微陣列基因晶片的預處理方法與表現量差異方法的組合 (頁 42-45)