• 沒有找到結果。

In this section, we briefly introduce our analysis procedure and the main methods we used to construct our gene signature. The theoretical and im-plemented details of these methods are given in later chapters.

The major challenge in microarray data analysis is the large dimension-ality. The number of genes (G) is in the range of ten to fifty thousands but the sample size (n) is only about hundreds. To reduce the effect of microar-ray noise, we used three criterions to filter out non-informative genes. We excluded the genes with low expressions, small variation expressions or with inconsistent expressions between three preprocessing methods, the MAS 5.0 Statistical algorithm from Affymatrix (2001) [18], the dChip algorithm from Li and Wong (2001) [19] and the Robust Multi-chip Average (RMA) from Irizarry et al. (2003) [20]. The detail of this part is discussed in Chapter 3.

After gene filtering, we reduced the number of genes (G) to a related smaller number (G), but the true effective dimensions to the survival time might be much smaller. Our strategy was to implement a two-steps dimen-sion reduction method proposed by Wu et al. (2008) with some modifications.

This approach contained gene selection and signature construction two steps.

In the gene selection part, we used both correlation and liquid association methods to select the important candidate genes related to survival time.

Pearson’s correlation coefficient was introduced to measure the strength of linear dependency between two variables. However, the association between gene expressions and patients survival might not be linear and might be more complicated. The liquid association (LA) method was implemented here to explore the interaction of two genes related to the survival time. Due to the

CHAPTER 2. MATERIALS AND ANALYSIS PROCEDURE 8 data censoring issue, both correlation and liquid association could not be implemented directly. Therefore, we modified a nonparametric imputation method [11] to impute the censored data, then the correlation coefficients could be evaluated by plugging the imputed survival probability. After that, we calculated and ranked the correlation coefficients between gene expres-sions and the imputed survival probability by the absolute values. Genes in the first few places were selected as candidate genes in this part. We also proposed a permutation procedure to decide how many genes should be se-lected. For implementing the liquid association (LA) method, our strategy was to select the genes, which recurrently appeared in the first few extreme LA gene pairs. These genes were called the LA hub genes. We selected the LA hub gene and also its paired genes as candidate genes in this part.

We note that the gene expression profiles and the imputed survival proba-bility were normalized by normal quantile transformed in the first two parts, gene filter and the gene selection. The normal quantile transformation is necessary for the liquid association method and makes the procedure robust against the outliers. Both the correlation and liquid association can be com-puted in the website http://kiefer.stat2.sinica.edu.tw/LAP3/index.php. The details of the imputation methods and liquid association method are given in Chapter 4.

In the signature construction part, the candidate genes selected from the previous step were used to derive a gene signature for survival prediction.

First, we applied the modified sliced inverse regression to estimate the effec-tive dimension reduction (e.d.r.) directions and projected the selected gene expression profiles on the e.d.r. space. If there is only one SIR direction, the

CHAPTER 2. MATERIALS AND ANALYSIS PROCEDURE 9 estimated e.d.r. direction, found significantly by a large sample chi-squared test, we projected the expression profiles on it as our final gene signature.

Otherwise, we could use the projected directions to fit other survival model, for example the multivariate Cox proportional hazard model, and derive the final gene signature. We note that the normal quantile transformation was not used in this part. The regressors (X) in the dimension reduction model were the candidate gene expression profiles transformed by log-2 transforma-tion and centered toward sample mean in training data set. The theoretical derivation and practical implementation of modified sliced inverse regression are given in Chapter 5.

The prediction power of our gene signature was tested in the independent validation data sets. In each validation data set, we used the linear combina-tion coefficients estimated from the training data set to combine the selected gene expressions into a gene signature. We used two ways to present the prediction power of our signature. First we used median of our signatures in each validation set to separate the samples into two groups, high risk and low risk groups, as a categorical classifier. For this categorical classifier, the log rank test was used to test the difference of the survival distribution of two groups. Second, we used our derived gene signature as a continuous risk score to fit the Cox proportional hazard model. We estimated the hazard ratios with corresponding p-value and the concordance probabilities (CPE) [23] for both categorical classifier and the continuous risk score. The CPE estimated the probability that survival outcome agreed with the risk score or categorical classifier under the Cox proportional hazard model. To compare the derived gene signature and the TNM tumor stage the results of multi-variate Cox proportional hazard model were also presented. A flow chart of

CHAPTER 2. MATERIALS AND ANALYSIS PROCEDURE 10 our procedure is given as figure 2.1.

Training data (HLM+UM) 

Gene filter  Inconsistent gene 

expressions  Low gene 

expressions  Small varia?on  gene expressions 

Gene selec?on  Correla?on  Liquid associa?on 

Gene signature  construc?on 

Modified sliced inverse regression 

(Mul?variate Cox model, if more than one significant direc?ons) 

Valida?on  CAN/DF  MSK  DUKE 

Normal score  transforma?on 

Normal score  transforma?on 

Log 2  transforma?on 

Log 2  transforma?on 

Survival  imputa?on 

Figure 2.1: The flow chart of analysis procedures.

Chapter 3 Gene filter

3.1 Inconsistent gene expressions

At the beginning of microarray analysis, choosing data preprocessing method is still an open issue. MAS 5.0 Statistical algorithm, dChip algo-rithm and the Robust Multi-chip Average (RMA) method are three widely used data preprocessing methods. However, different methods may lead to different results. In Shedden et al. (2008), they preprocessed the expression profiles by running dChip algorithm on all four data sets together. Never-theless, there was an issue they remarked: running the dChip algorithm on the entire data sets may have removed some of the inter-site differences but is somewhat unrealistic. Figure 3.1 showed a dramatic shift of the data indi-cating that gene expressions preprocessed separately or together as a group using dChip algorithm are not comparably scaled. This inter-site difference may impact the validation results a lot.

In our data analysis, we chose the MAS 5.0 Statistical algorithm for data preprocessing. The MAS 5.0 algorithm allowed us to preprocess the

microar-11

CHAPTER 3. GENE FILTER 12

10.5 11.0 11.5 12.0

10.511.011.512.0

HLM and UM preprocessed separatly

HLM and UM preprocessed together

UM HLM

-2 -1 0 1 2

-2-1012

HLM and UM preprocessed separatly

HLM and UM preprocessed together

HLM UM

Figure 3.1: Scatter plots of gene expression profiles of DDR1 in HLM and UM two data sets preprocessed together versus preprocessed separately by dChip algorithm: The left panel is log-2 transformed gene expression profiles and the right panel is normal quantile transformed gene expression profiles.

ray data entirely or chip by chip with the same results. However, we thought that the genes with inconsistent expression profiles between three preprocess-ing methods were unconvinced. Therefore, we filtered out the genes that had inconsistent expression levels between three preprocessing methods. Corre-lation coefficient is a measure also used to measure the similarity of two variables. Here we used it to measure the similarity of the expression levels preprocessed by each two of the three preprocessing methods for each gene.

Nevertheless, since the RMA preprocessed data is in log-2 scale, we may transform it before we calculated the correlation coefficients. Furthermore, the normal quantile transformed correlation coefficients between gene expres-sions and imputed survival probability is an important selecting criterion in the gene selection part. Thus, we also used the normal quantile transformed

CHAPTER 3. GENE FILTER 13 correlation coefficient in this part.

Before continuing the introduction of the gene filter, here we give a def-inition of the normal quantile transformation and note some properties of it.

Definition 3.1.1. For any n observations x = (x1, ..., xn) of variable X, The normal quantile transformation is define by

N (x) =

where Φ(·) is the cumulative distribution function of standard normal distri-bution and Ri is the rank of xi in x for i = 1, 2, ..., n.

Since Pearson’s correlation coefficient with normal quantile transforma-tion only depends on the rank of observatransforma-tions, it can be viewed as a kind of rank correlation coefficient. It is more robust against outliers than the orig-inal Pearson’s correlation coefficient. Furthermore, in elementary statistics, the correlation coefficient between N(x) and x is used to test for the null hy-pothesis that X is normally distributed. The correlation coefficient between N(x) and x is closed to 1 under null hypothesis. If X1 and X2 are both normally distributed, the correlation coefficient between N(x1) and N(x2) is closed to the correlation coefficient between x1 and x2. Then some proper-ties of the original Pearson’s correlation coefficient carried over. We also note that the normal quantile transformation is necessary for the LA calculation.

Therefore, we used the normal quantile transformed correlation coefficient for all the correlations between two variables in our analysis procedure.

For our real data analysis, first we preprocessed the expression profiles by using all the three preprocessing methods separately in HLM, UM, CAN/DF

CHAPTER 3. GENE FILTER 14 and MSK four data sets for three versions of gene expression profiles. Since the CAN/DF and MSK data sets were used for validation, the gene filter was only implemented in the training data sets, HLM and UM. There were 22,215 probe sets on Affymetrix U133A microarray. For each gene in the two training data sets, we evaluated the normal quantile transformed correlation coefficients between each two of the three different preprocessed expression profiles. Thus, there were six correlation coefficients evaluated for each gene.

The genes that had as least one of the six correlation coefficients smaller than 0.78 were excluded.

In each data set, the ranks of the mean expressions of the remaining genes with a total of 22,215 probe sets were recorded. The histogram is given in figure 3.2. In the histogram, the proportion of the remaining genes with high rank is found larger than the proportion of the remaining genes with low rank.

相關文件