RESULTS - 基因表現量晶片資料模擬器－使用公開之晶片資料庫

4.1 Gene-specific expression distribution

After the raw data of all 1279 arrays were preprocessed together using justRMA, we fitted density smoother using density(n=512, adjust=3) function for all genes. The characteristics of intensity distributions of some randomly selected genes have been shown in Figure 4.1. In Figure 4.1(a), the gene displays only one mode and is believed to be either expressed or unexpressed in all tissues. From Figure 4.1(b) to Figure 4.1(d), multiple modes are observed. By assumption, the lowest intensity mode appears due to a lack of expression. We can discover that the second mode is close to the first mode in Figure 4.1(b), and the two modes are more distant in Figure 4.1(c).

More than two modes can be seen in Figure 4.1(d). Among 22283 genes in Affymetrix HG-U133A, 5005 genes have multiple modes and have distributions similar to Figure 4.1(b), Figure 4.1(c) or Figure 4.1(d), and 17278 genes have one mode and have distributions just like Figure 4.1(a).

4.2 Comparison with the HG-U133A tag spike-in dataset

After removing 17 tag probe sets existing only in HG-U133A tag but not in HG-U133A, 42 spike-in arrays contained exactly the same 22283 genes as HG-U133A arrays. We used the RefPlus package of R to preprocess the spike-in dataset after removing 17 additional tag probe sets, and, therefore, the obtained log (base 2) expressions were computed by the same scale of RMA which the reference set used (Chang et al., 2006).

To obtain density smoothers of the spike-in data, density(n, adjust) function of R

was used. We plotted histograms of log (base 2) expression values for some randomly selected genes. Smoothing curves created by different argument settings were fit to these histograms (Figure 4.2). After examining all these plots, we decided to set the smoothness parameters n=512 and adjust=3 for the use of creating the density distributions.

Then for each gene, we can compare the empirical distributions using the training set with the intensity distributions accomplished by spike-in data. Genes can be divided into 4 categories, including spike-in genes with multiple modes (Figure 4.3), spike-in genes with only one mode (Figure 4.4), non-spike-in genes with multiple modes (Figure 4.5), and non-spike-in genes with one mode (Figure 4.6). In these figures, the solid lines represent the empirical distribution computed by 1279 arrays of the reference training set, and the dotted lines represent the intensity distribution computed by 42 spike-in arrays. The brilliant ticks act for the observed values of spike-in samples with color denoting the experimental group to which the observation belongs. For the spike-in genes, although the empirical densities are different from the spike-in densities, the patterns of these two resemble each other (Figure 4.3 and Figure 4.4) In Figure 4.5 and Figure 4.6, since the densities of the training set and spike-in data varied greatly from each other, we plotted the empirical distribution and spike-in intensity distribution separately to see their pattern clearly.

Foe genes that are not included as spike-in genes and have multiple modes, if the first mode and the second mode do not distance too far, same essence between the empirical density and spike-in intensity distribution can be found (Figure 4.5.1 and Figure 4.5.2). On the contrary, in Figure 4.5.3, the genes whose the first mode and second mode are in the distance seem not to have identical patters comparing the empirical density with the spike-in intensity distribution. For genes in Figure 4.6.1 and Figure 4.6.2, we can also discovery same character between the two distributions.

4.3 Simulation based on the spike-in dataset (exp 4 vs. exp10)

The Affymetrix HG-U133A spike-in data set is used for determining the sensitivity and specificity of various methods for the analysis of microarray data (Choe and Boutros, 2005). Since true differentially expressed spike-in genes were already known, the performance of five differential expression methods can be assessed. The six differential expression methods were fold-change, two sample t-test, Welch t-test, SAM, EBarrays and limma. Here we simulate gene expression intensities of cases and controls, which mimic the expression patterns shown in the spike-in dataset experiment no. 4 and no. 10. We aim to observe whether these simulated expression values can assess the performance of six differential expression methods as well as obtain similar conclusion as what the spike-in dataset does.

Three replicate arrays for the 4^th experimental group and the 10^th experimental group were simulated separately. It cost us around two minutes to simulate one group.

Then we created ROC curves for six differential expression methods and compared the simulation data with the real spike-in dataset (Figure 4.7 and Figure 4.8). The ROC curves here are created as the graphs of the number of false positives (FPs) as the x coordinate versus the number of true positives (TPs) as the y coordinate. Under spike-in dataset, the growth in TPs for these six differential expression methods had already become flat gradually after FPs>100, therefore, we first focused on the part of FPs<100 in both spike-in dataset and simulation dataset. Although the amount of the replicate arrays of simulation and real spike-in data were the same, the ability of detect differentially expressed genes using simulated data was not as good as the ability using real spike-in dataset. Since the growth in TPs under simulated data still surged after FPs>100, the ROC curve on the part of FPs<1000 based on simulated

data was obtained in order to see the more complete pattern (Figure 4.9). We compared it with the pattern of the ROC curve using the real spike-in dataset and discovered that there were comparable trend of these two ROC curves. The performance of differential expression methods such as EBarrays(LNN), FC, limma is outstanding in both real spike-in dataset and simulated dataset. On the contrary, Welch t-test performs disappointingly in both dataset. The performance of SAM is apparently quite different in in real spike-in dataset and simulated dataset.

To improve power of detecting differentially expressed genes, we simulated more replicate arrays for the same two experimental groups. Simulation of five arrays needed about three minutes for each group. Comparing with the real spike-in dataset, despite the augmentation of the simulated replicate arrays, the ability of detecting differentially expressed genes was still less powerful in the simulated data (Figure 4.10). In Figure 4.10, the performance of differential expression methods such as limma and SAM is excellent in both real spike-in dataset and simulated dataset. But the methods like FC and EBarrays(LNN) have different performance in different datasets. FC and Ebarrays(LNN) appear admirable ability of detecting differentially expressed genes in real spike-in dataset, but show poor quality of detection in the simulated dataset. In addition, simulation of ten arrays cost about five minutes. The performance of differential expression methods is shown In Figure 4.11. The six differential expression methods except FC and EBarrays(LNN) perform great power of detecting differentially expressed genes.

在文檔中基因表現量晶片資料模擬器－使用公開之晶片資料庫 (頁 37-41)