Four preprocessing methods used

2 Literature Review

2.4 Four preprocessing methods used

MAS 5.0 (Microarray Suite software, Version 5.0) is offered by Affymetrix (Affymetrix, 2002). The gene expression level is calculated from the combined, background-adjusted, PM and MM values of the probe set. At the beginning, both PM and MM probe intensities must be preprocessed for background adjustment.

To do the background adjustment, the array is divided into K rectangular zones (default K = 16). The probes are ranked and the lowest 2% is chosen as the background

b for that zone. Then each probe intensity is adjusted based on a weighted average of each of the background values, b(x,y). (x,y) to each of the zone centers. In particular, the weight is defined as:

smooth

of zone k. The default value of smooth is 100, which is added to d_k²(x,y) to ensure that the value will never be zero. The calculated background, b(x,y) 1, establishes a

“floor” to be subtracted from each raw probe intensity. There are some rules for avoiding leading to the negative intensity.

After each probe intensity is preprocessed for background adjustment, an ideal mismatch value is calculated and subtracted to adjust the PM intensity. Originally, the suggested purpose of the MM probes was that they could be used to adjust the PM probes for non-specific binding. The naïve approach is subtracting the intensity of MM probe from the intensity of the corresponding PM probe. However, this becomes problematic because the MM value is sometimes larger than the PM value. To avoid taking the negative expression value, Affymetrix introduced the concept of an Ideal Mismatch (IM), a quantity derived from the MM value that is never bigger than its corresponding PM value. IM is defined as a quantity equal to MM whenMM < PM, but adjusted to be less than PM whenMM ≥PM . This is done by computing

where contrastτ(with a default value of 0.03) and scale (with a default value of τ 10) are tuning constants. The adjusted PM intensity is obtained by subtracting the corresponding IM from the observed PM intensity. Then, MAS 5.0 use a one-step Tukey Biweight to combine the probe intensities in log scale.

( )

{

jg jg

}

g theanti of TukeyBiweight PM IM

signal = log log₂ − .

Finally, signal is scaled using a trimmed mean. They defined a scaling factor sf and a normalization factor nf in their algorithm.

)

where Sc is the target signal (default Sc=500). MAS 5.0 offers two analysis for user to choose, that are absolute analysis and comparison analysis. According to which analysis you want to perform, nf has different definition.

⎪⎩

where SPVb is the baseline array signal, and SPVe is the experiment array signal.

More details are described in the Statistical Algorithms Description Document (Affymetrix, 2002). The reported value of MAS5.0 of probe set g is:

signalg

dChip (DNA-Chip Analyzer) is also a popular software for Affymetrix platform probe-level and high-level analysis of gene expression microarrays (Li and Wong, 2001a) and SNP microarrays. This software can be downloaded from the website http://biosun1.harvard.edu/complab/dchip/ . dChip can be used to fit the Model Based Expression Index (MBEI) (Li and Wong, 2001a) , and obtain what we refer to as the dChip expression measure. Li and Wong reported that variation of a specific probe across multiple arrays (the between-array variance) is in general smaller than the

variance across probes within a probe set (the within-probe set variance) (Li and Wong, 2001a). To account for this strong probe affinity effect, they proposed a multiplicative model, for any given gene:

ε and the j-th probe pair for this gene. θ_i denotes the expression index for this gene in the i-th array. Here multiple arrays are available for analysis. Assume that the intensity value of a probe will increase linearly as θ_i increases, but different increasing rate for different probes. And within the same probe pair, the PM will _ij increase at a higher rate than theMM . _ij α_j and φ_j represent the increasing rate of the MM probe and the additional increasing rate in the corresponding _ij PM probe _ij

respectively. The increasing rates are assumed to be nonnegative.ν_j is the baseline response of the j-th probe pair due to nonspecific hybridization, and ε are assumed to be independent normally distributed errors.

The model for individual probe responses implies an even simpler model for the PM–MM differences:

The model above is called PM-MM difference model ( Li and Wong, 2001a).

Li and Wong discovered that because of doubting the efficiency of using MM probes, some investigators design custom arrays using PM probes exclusively. Thus, they proposed another model later to estimating gene expression levels, called PM-only model ( Li and Wong, 2001b). The PM-only model focus only on PM probes, using the description of PM in model (1). The PM-only model is as follows:

)

Notations in the PM-only model represent the same meaning as well as PM-MM difference model, except that φ merges the two increasing rates ^'_j α_j and φ_j.

No matter what model above is referred, Li and Wong’s measure is defined as the maximum likelihood estimates of the expression index θ_iand outlier probe intensities are removed as part of the estimation procedure. Before computing model-based expression levels, dChip use the “Invariant Set” normalization method to normalize arrays at PM and MM probe levels for PM-MM difference model or PM probe levels for PM-only model. Using a baseline array, arrays are normalized by selecting invariant sets of probes then using them to fit a non-linear relationship between the

"treatment" and "baseline" arrays. A set of probe is said to be invariant if ordering of probe in one chip is the same in other set. By default, an array with median overall intensity is chosen and all other arrays are normalized to it.

In order to summarize the probe intensities, dChip performs the “Invariant Set”

normalization method, then fit the normalized probe intensities to the alternative model for any given gene. Maximum likelihood estimates of the expression indexθ_iis the expression measure for this gene in array i.

RMA

RMA (Irizarry et al., 2003a), Robust Multi-array Analysis, is an expression measure consisting of three particular preprocessing steps: convolution background correction, quantile normalization, and a summarization based on a multi-array model fit robustly using the median polish algorithm. Many preprocessing methods, such as MAS 5.0 and dChip, calculating their measures rely on the difference PM-MM with the intention of correcting for non-specific binding. However, the exploratory analysis presented in Irizarry et al. (2003a) suggests that the MM probe may be a mixture

probe for which detects not only non-specific binding and background noise but also the transcript signal just like the PM probe. Thus, subtracting the MM intensity from the PM intensity as a way of correcting for non-specific binding and background noise is not always appropriate. These RMA authors proposed a procedure ignoring the MM intensities and using only the PM intensities.

The RMA convolution background correction method is motivated by looking at the distribution of probe intensities. The model observed PM as the sum of a background intensity bg caused by optical and nonspecific binding, and a signal _ijg intensity s . _ijg

G g

J j

I i

s bg

PM_ijg = _ijg + _ijg , =1,…, , =1,…, , =1,…,

with i representing the different array, j representing the probe pair, and g representing the different probe set. Under the model above, the background corrected probe intensities will be given by B(PM_ijg), where B(PM_ijg)≡E(s_ijg |PM_ijg). To obtain a computationally feasible B(⋅) we consider the closed-form transformation obtained when assuming that s is distributed exponential and _ijg bg is distributed normal, _ijg and the results obtained using B(⋅) work well in practice (Irizarry et al., 2003a).

Next, perform the quantile normalization, which is to make the distribution of probe intensities for each array the same (Bolstad et al., 2003). In order to summarize the probe intensities, RMA introduced a log scale linear additive model. The model is:

ij j i

ij e a

T( )= + +ε ,

where PM_ijg represents the PM intensity of array i=1,…,I and probe pair j=1,…,J, for any given probe set g. ^T

( )

^⋅ represents the transformation that background corrects, normalizes, and logs the PM intensities, e represents the log2 scale _i

expression value found on arrays i, a represents the log scale affinity effects for _j

probes j , and ε_ij represents error (Irizarry et al., 2003b). To protect against outlier probes, they use a robust procedure, such as median polish, to estimate model parameters (Irizarry et al., 2003a). The estimate of e as the log scale measure of _i expression refers to as robust multi-array average (RMA).

PDNN

Zhang et al. (2003) propose a simply free energy model over the probe signals that enables to estimate the gene expression levels, called “position-dependent nearest–neighbor (PDNN) model”, for the formation of RNA-DNA duplexes on Affymetrix microarray. Different from most methods focused on statistical models, it is a physical model taking into account the sequence of nearest-neighbors (adjacent two bases) and the position of these nucleotide pairs. It has been suggest that the effect of nearest-neighbor nucleotide pairs is the most important factor in determining RNA/DNA duplex stability. Their model also describes binding interactions complicated by many factors such as steric hindrance on the chip surface, probe-probe interaction and RNA secondary structure formation.

The model is based on the nearest-neighbor model (Sugimoto et al., 1995) with two modifications: (1) a positional weight factor is added to reflect the different contributions from different part of the probe; (2) two different types of binding on the probes are considered. The two types of binding are gene-specific binding (GSB), representing the formation of DNA-RNA duplexes with exact complementary sequences, and non-specific binding (NSB), representing the formation with many mismatches between the probe and the attached RNA molecule. Notice that PDNN assumes that the majority of probes are designed specifically for their target, and only PM probes are used for GSB and NSB estimation. PDNN model divides signal of a

probe into three components, GSB, NSB and uniform background B, as follows:

where Iˆ is denoted as the expected intensity of the j-th probe in a probe set _jg targeted to detect gene g, N as the true expression level for gene g, and _g N as the ^* population of RNA molecules that contributes to NSB. E is defined as the free _jg energy for formation of the specific RNA-DNA duplex with the targeted gene, and

Ejg is the average free energy for NSB, that is, formation of duplexes with many different genes. E and _jg E^*_jg are computed as weighted sums of stacking energies with the sequence of a probe is given as

(

b1,b2,...,b25

)

. stacking energy used in the nearest-neighbor model (Sugimoto et al., 1995). Both of GSB and NSB are involving 16 stacking energy parameters and 24 weight factors.

The unknown parameters are obtained by minimizing the fitness function F to optimize the match between the expected probe intensity Iˆ and the observed probe _jg intensityI . _jg

where M is the total number of probes on an array. A Monte Carlo simulation procedure is used to minimize the fitness function F. When the parameters are given,

the gene expression level N can be calculated and are scaled to an average of 500 _g on an array.

For more comprehensible, we give a summary table for the four preprocessing methods above in Table 1.

2.5 Five differential expression methods used

在文檔中使用效度與信度來比較艾菲爾微陣列基因晶片的預處理方法與表現量差異方法的組合 (頁 16-24)