• 沒有找到結果。

3 MATERIALS AND METHODS

3.1 T HE REFERENCE TRAINING SET

3.1.1 Microarray Retriever

We download all the raw data (CEL file) of Human Genome U133A Arrays which we could find from Gene Expression Omnibus (GEO) and ArrayExpress (AE) by Microarray Retriever (MaRe) (Ivliev et al., 2008). The MaRe web interface contains three boxes for input of the query term. To limit search on human species and HG-U133A arrays, choose "Homo sapiens" as the specified specie and input “GPL96”

and “A-AFFY-33”, the platform accession numbers of HG-U133A, to the platform accessions field. Since only individual gene expressions were needed, “Retrieve only GSE” was chosen for GEO. “Not retrieved from GEO” was chosen for ArrayExpress to avoid the overlapping with the experiments that already existed in GEO. “Retrieve raw data” checkbox should also be checked. Then we entered an email address in the

“Start search” box to start the search. The search can return a bunch of raw data that meet our searching criteria. There is a total of 701 experiments obtained from the MaRe.

3.1.2 Obtaining normal controls

Downloaded raw data contain samples from a variety of different conditions.

Only normal controls were used for creating the reference training set. For example, series GSE10072 from GEO was an experiment with 49 samples of normal lung tissue and 58 samples of adenocarcinoma of the lung. In this case, we only retained 49 samples of normal lung tissue but discarded 58 samples of adenocarcinoma of the lung. After removing files that were not from normal controls, 1886 .CEL files from GEO and 559 .CEL files from AE were kept.

3.1.3 Quality assessment metrics

The data quality for each array was verified by the qc function within the simpleaffy package in BioConductor. The qc function can generate the most commonly used quality assessment metrics as described in the following. All these metrics are parameters that are computed for/from the MAS 5.0 (Microarray Suite software, Version 5.0) algorithm. (Wilson et al., 2009)

Scale Factor

Due to the assumption that gene expression does not change significantly for the vast majority of transcripts in an experiment, the trimmed mean intensity for each array should be constant. MAS 5.0 scales the intensity for every sample to make each array have the same mean. Since “Scale Factor” represents the amount of scaling applied, it provides a measure of the overall expression level for an array. (Wilson et al., 2009)

In our quality assessment process, we propose to perform a “stepwise” scale factor refinement. First of all, the scale factors of our samples should be within 6-fold of one another. To obtain the 6-fold region, we calculated the mean and the of log (base 2) scale factors from all arrays in advance, and the region was the one between

the borders of 3 up or down from the mean value. Arrays whose log (base 2) scale factors were out of this area were removed. Then, for the remaining samples, their scale factors should be within the 4-fold of each other. After removing the arrays that were not in the 4-fold range, we then further removed those out of the 3-fold of the scale factors of the samples that were still retained. At the end, 1974 arrays were kept.

Averages background

The significant difference between average backgrounds of arrays is the result of great change in brightness of different arrays. The average backgrounds should be similar across all chips. (Wilson et al., 2009) To avoid dramatic variation in arrays’

intensity, we removed the array which had extreme average background. According to the picture, we recommended that the average background value should be below 300.

We had 1918 arrays after that.

3’ to 5’ ratios

The ratio of the 3’ and 5’ is a value comparing the amount of signal from the 3’

probset to either the mid or 5’ probset. So it is possible not only to measure the quality of the RNA hybridize to the chip but also measure the RMA quality. Affymetrix suggests that the beta-actin 3’:5’ ratio should be below 3 and the GAPDH 3’:5’ ratio less than 1.25 is acceptable. (Wilson et al., 2009) After removing the unsatisfactory arrays, we had 1501 arrays.

Number of genes called present (% Present)

The difference between PM and MM values for each probe pair in a probeset can be categorized as Present/Marginal/Absent calls. Marginal or Absent call appears when the PM probes’ values are not considered to be significantly above the MM probes. The large differences between the numbers of genes called present on different arrays can be found when different amounts of labeled RNA have been hybridized well to the chips. The “% Present” call is the percentage of probesets

called Present on an array. So the significant variations in % Present call across the arrays should be treated with care since it may be the result that some samples express more genes than other. Since that, the present percent are required to be similar.

(Wilson et al., 2009) The criterion we set was the value 20%. We removed the arrays whose present percent was below 20%. And all we kept were 1501 arrays.

Among 1501 arrays having passed quality assessment, there existed 222 arrays that were not the same type with other 1279 arrays and cannot input into R-2.3.0. We got rid of these 222 arrays. The remaining 1279 arrays were the reference training set we used.

The brief summary of the amount of delete data in each step is shown in Table 3.1. Figure 3.1 shows the deleted data in each step. Figure 3.2 shows the comparison between total delete data and 1501 saved data after quality assessment. For those 1279 arrays in the reference training set, they belonged to 74 different tissue types. The frequency distribution among these tissue types can be found in Table 3.2.

相關文件