• 沒有找到結果。

3 Materials and Methods

3.1 Implementation of methods selected

There are four preprocessing methods and five differential expression methods applied to each of the datasets we selected. Three statistical models, MAS 5.0, dChip and RMA, and one physical model, PDNN, are considered. The five differential expression methods are fold-change, two sample t-test, SAM, EBarrays and limma. A total of 35 combinations are resulted. We may regard each criterion of methods as

“score” to express the level of significance. The higher the score, the more significant the result.

3.1.1 Four preprocessing methods used MAS 5.0

MAS 5.0 (Microarray Suite software, Version 5.0) is offered by Affymetrix

(Affymetrix, 2002). Each probe including PM and MM must be preprocessed for background adjustment according to its location on the array. To avoid obtaining a negative value when subtracting MM from PM, MAS 5.0 introduces the concept of an Ideal Mismatch (IM) derived from MM and never bigger than its PM. The expression level is defined as the anti-log of a robust average (Tukey biweight) of the value

( )

{

log2 PMjgIMjg

}

where PMjgand IMjg represent the PM and IM intensities for j-th probe pair of gene g. Finally, the expression level is scaled using a trimmed mean.

We apply the absolute analysis of MAS 5.0 in R.

dChip ( including dChip(PM-MM) and dChip(PM-only))

Li and Wong (2001a) proposed a Model Based Expression Index model (MBEI) where multiple arrays are available to estimate the expression levels. For any given gene, the model is defined as follows:

ε and the j-th probe pair for this gene. θi denotes the expression index for this gene in the i-th array. αj and φj represent the increasing rate of intensity value of the

MM probe and the additional increasing rate in the corresponding ij PM probe ij respectively. νj is the baseline response of the j-th probe pair due to nonspecific hybridization, and ε are assumed to be independent normally distributed errors. Two methods based on the model above are developed: (1) subtracting MM from PM intensities (Li and Wong, 2001a) (2) using PM intensities only (Li and Wong, 2001b).

Li and Wong’s measure is defined as the maximum likelihood estimates of the expression index θi and the estimation procedure includes rules for outlier removal.

“Invariant Set” normalization method is used to normalize arrays at PM and MM probe levels.

RMA

Irizarry et al. (2003a) developed a log scale linear additive model using only PM probes, it is also known as RMA (Robust Multi-array Analysis). For any given gene, it is described as T(PMij)=ei +ajij, where PM is the PM intensity of array i ij and probe pair j for this gene. T(⋅) represents the transformation that background corrects, normalizes by quantile normalization, and logs the PM intensities. The three terms on the right represent the log2 scale expression value for this gene of array i, the log scale affinity effects for probe j, and error respectively (Irizarry et al., 2003b). To protect against outlier probes, a robust procedure such as median polish is used to estimate model parameters and the log scale measure of expression levele . i

PDNN

Zhang et al. (2003) proposed a simply free energy model, called

“position-dependent nearest–neighbor (PDNN) model”. Different from most methods focused on statistical models such as the methods introduced above, it is a physical model taking into account the sequence of nearest-neighbors (adjacent two bases) and the position of these nucleotide pairs. In the PDNN model, the signal of a probe is divided into three components: gene-specific binding, non-specific binding, and uniform background. And the free energy of the two bindings of a probe can be expressed as a weighted sum of its stacking energies (Sugimoto et al., 1995), where the stacking energies depend on the sequence of nearest-neighbors and the weights depend on the position along the probe. Further technical details can be found in Zhang et al. (2003).

3.1.2 Five differential expression methods used Fold-change (FC)

Fold-change is the most commonly used method of detecting differentially expressed gene between two compared condition samples. For any given gene, fold-change is calculated by the probe set intensity ratio of two compared condition samples. If there are replicates, we usually average across the samples for each condition in advance. Then the ratio of these averaged values is referred as fold-change. Fold-change is employed as the score of significance.

Two sample t-test ( including t-test and Welch t-test)

The simplest statistical method for comparing means between two groups is two sample t-test. When carrying out a two sample t-test, the variances of the two samples may be assumed to be equal or unequal. The approach of unequal variance assumption is also called Welch’s t-test. We employ minus p-value as the score of significance.

SAM (Significance Analysis of Microarrays)

It was proposed by Tusher, Tibshirani and Chu (2001). The method is based on a modified version of the standard t-statistic to adjust the high variance probably caused by a low expression level. For each gene g, the “relative difference” d in gene g expression is defined as the form which adds an exchangeability factor to the denominator of the standard two sample t-statistic for equal variance. Exchangeability factor is added to ensure that the variance of d is independent of gene expression g level. Rank all genes by the observed relative difference d and denote the new g arrangements as d( g). B sets of permutations of the samples are taken to obtain the expected relative difference d( g)* by a similar way (For more details, see Tusher et

al., 2001 and Chu et al.). A scatter plot of d( g) vs. d( g)* is used and the genes apart

from the d(g) =d(*g) line by a distance greater than the threshold Δ are regarded as differentially expressed genes.

Using the samr package in R, the differentially expressed genes can be identified by giving a threshold Δ . But the number of genes selected is determined by the given threshold Δ , we can not set at will. And further filtering criteria which are not mentioned in the original paper (Tusher et al., 2001) are carried out. Thus, we give up using the samr package, and employ the difference between d( g) and d( g)* as the score of significance according to the methodology referred in Tusher et al.(2001).

EBarrays ( including of EBarrays(GG) and EBarrays(LNN))

An empirical Bayes analysis, implemented in Bioconductor EBarrays package, attempt to describe the probability distribution of expression levels for gene g and select differentially expressed genes by posterior probability of differential expression.

Two mixture models, Gamma-Gamma model and lognormal-normal model, are considered according to their sampling and prior distributions. For more details on the methodology, see Newton et al. (2001), Kendziorski et al. (2003), and Newton and Kendziorski (2003). We employ the posterior probability of differential expression as the score of significance.

limma

Smyth (2004) proposed a method of linear models and empirical Bayes methods which is implemented in the Bioconductor limma package (Smyth, 2005). The linear model for gene g is

is the expression level vector of I arrays in total for this gene,

X is the design matrix, and ~ g

~

α is a vector of

coefficients. Certain contrasts of the coefficients are assumed to be of biological

interest and these are defined by

g T

g

C

~

~ ~

α

β = . In general, we are interested in testing

whether individual contrast values βgj are equal to zero. The basic statistic with

respect to a certain contrast βgj is the moderated t-statistic in which posterior residual standard deviations are used in place of ordinary standard deviations by empirical Bayes approach. Alternative statistic, called B-statistic, represents log posterior odds that the gene is differentially expressed. The default argument in limma package is B-statistic and we employ it as the score of significance.

相關文件