Data Preprocessing - Materials and Methods

Chpater 2 Materials and Methods

2.2 Methods

2.2.2 Data Preprocessing

2.2.2.1 Normalization

There are a variety of reasons why the raw measurements of gene expression for two samples may not be directly comparable: the quantity of starting RNA may not be equal for each of the samples, there may be differences in labeling and detection efficiencies for the fluorescent labels, and there may be additional systematic effects that can skew the measured expression levels and the derived expression ratios. Normalization is any data transformation that adjusts for these effects and allows the data from two samples to be appropriately compared.

Robust Normalization accounts for probe set characteristics resulting from sequence-related factors, such as affinity of the probe set to the RNA and linearity of the hybridization of each probe pair. More specifically, this factor corrects for the inevitable error of using an average intensity of all the probes on the array as a normalization factor for every probe set. Robust Multi-array Analysis (RMA) was adopted due to its sensitivity and specificity in detecting differential expression and is a useful improvement to other kinds of normalization method for researchers using the GeneChip technology [14, 15]. The normalization results are presented in Figure 2.1 and Figure 2.2.

Human : 10

RMA: Quantile Normalization

Human 10 time points (Before) Human 10 time points (After)

Figure 2.1

Normalization result of human data.

Mouse 16 time points with replicates (Before)

Mouse 16 time points with replicates (After)

Mouse : 18

RMA: Quantile Normalization

Figure 2.2

Normalization result of mouse data.

2.2.2.2 Use of replicate data

Replication is essential for identifying and reducing the effect of variability in any experimental assay, and microarray analysis is no exception. Biological replicated use independently derived RNA from distinct biological sources to provide an assessment of both the variability in the assay and the inherent biological variability in the system under study.

Biological replicates allow commonly expressed genes to be identified, as well as those that are distinct to the particular biological sample. In the research, we did average the replicated to produce a single consensus measurement and thereby reduce the complexity of the final data.

2.2.2.3 Data Filtering

The goal of most other transformations is to filter the dataset to reduce its complexity and increase its overall quality. Many are designed to flag questionable and low quality data, while others are used to identify differentially expressed genes or to enhance particular feature of the data. Below is our method. If more than one probe sets represented the same gene, their intensities were averaged. Then, all hybridization intensity values ﹤20, including negative intensity values, were raised to a value of 20, in order to prevent the too small and negative intensities in these datasets. If the continuous time-points expression profile of one single gene is too flat, we called it “smooth pattern”, that gene would be filtered out. We hope that each gene we use for the latter dynamic time warping algorithm has a specific expression pattern; it means that the gene has variable expression intensities at different developmental ages, and we guess maybe this gene control the embryogenesis and has an important role in heart development. We made the calculation for genes with all the time-point intensities smaller its mean ± 0.3*mean were excluded from the latter use of mapping. As a result, we collected only undulated genes with any intensity of variation of greater than mean ± 0.3*mean, and transformed the data to z-score. Finally, z-score values at transcriptome level were calculated to represent expression data of each gene.

2.2.2.4 Standardization

If a distribution is normal but not standard, we can convert a value to the Standard normal distribution table by first by finding how many standard deviations away the number is from the mean.

The number of standard deviations from the mean is called the z-score and can be found by the formula: x -

Z =

μ

σ

. Consider the gene expression matrices in Table 2.3 and Table

2.4. They all represent the expression levels of genes G1-G9 for experimental conditions C1,

C2, C3 and C4. Table 2.3 is the original data and Table 2.4 is the original data transformed into z-score (standardization).

Table 2.3

Gene expression data matrixⅠ.

Gene expression data matrix of absolute expression measurements after normalization for samples C1, C2, C3 and C4.

Gene C1 C2 C3 C4 Mean Std

G1 211.5703 168.1379 175.8446 180.5085 184.0153 19.06502 G2 199.3421 370.9393 450.259 413.8647 358.6013 111.0119 G3 292.1011 384.8857 330.9426 277.6322 321.3904 47.94283 G4 58.30043 57.17114 59.13815 57.66531 58.06876 0.849661 G5 289.157 362.7946 335.4638 346.5588 333.4935 31.61678 G6 126.1376 120.9111 140.856 126.5952 128.625 8.551966 G7 658.9924 686.8183 809.7875 701.4234 714.2554 66.07527 G8 46.54035 48.21487 51.91154 47.12361 48.44759 2.411336 G9 219.3456 253.1414 285.1363 243.8249 250.362 27.21356

Table 2.4

Gene expression data matrixⅡ.

Gene expression data matrix of expression measurements after standardization for samples C1, C2, C3 and C4

Gene C1 C2 C3 C4

G1 1.445314 -0.8328 -0.42857 -0.18394 G2 -1.43461 0.111142 0.825657 0.497815 G3 -0.61092 1.324397 0.199241 -0.91272 G4 0.272665 -1.05644 1.258615 -0.47484 G5 -1.40231 0.926756 0.062316 0.413238 G6 -0.29085 -0.902 1.430197 -0.23734 G7 -0.83636 -0.41524 1.445807 -0.1942 G8 -0.79095 -0.09651 1.436528 -0.54907 G9 -1.13974 0.10213 1.27783 -0.24022

2.2.2.5 Identification of Orthologous Genes

Orthologs are genes that are related by direct evolutionary descent. The identification of orthologs is particularly important because these genes should play similar developmental or physiological roles, and consequently, their study in rodent or other models can provide insight into their functions in humans. We use orthologous genes to establish relations between human and mouse and then analysis their gene expression profiles with microarray data.

HomoloGene is a system for automated detection of homologs among the annotated

genes of several completely sequenced eukaryotic genomes. The genomes represented in the recent Build 52 of HomoloGene include Homo sapiens, Mus musculus, Rattus norvegicus,

Drosophila melanogaster, and so on [16]. This database contains 19157 orthologous genes

between human and mouse.

Table 3.1 presents the preprocessing steps and detailed information of the microarray data we used. We have performed a novel bioinformatics study and use the orthologous genes to be the cross-bridge between human and mouse. At last, we concluded the number of orthologs (probe sets) included in U133A and 430A is around 15530. Therefore, we have a large set of common genes covered by both sets to do the comparative functional genomics study. Figure 2.3 reveals the overview of our analysis of microarray data between human and mouse.

Figure 2.3 The overview of the microarray data analysis between human and mouse on the

developmental stages.

在文檔中利用時間序列之微陣列基因表現資料來比較人類和老鼠在心臟胚胎發育的關係 (頁 17-21)