Analysis procedure - 生物晶片資料分析與肺腺癌存活之預測

In this section, we brieﬂy introduce our analysis procedure and the main methods we used to construct our gene signature. The theoretical and im-plemented details of these methods are given in later chapters.

The major challenge in microarray data analysis is the large dimension-ality. The number of genes (G) is in the range of ten to ﬁfty thousands but the sample size (n) is only about hundreds. To reduce the eﬀect of microar-ray noise, we used three criterions to ﬁlter out non-informative genes. We excluded the genes with low expressions, small variation expressions or with inconsistent expressions between three preprocessing methods, the MAS 5.0 Statistical algorithm from Aﬀymatrix (2001) [18], the dChip algorithm from Li and Wong (2001) [19] and the Robust Multi-chip Average (RMA) from Irizarry et al. (2003) [20]. The detail of this part is discussed in Chapter 3.

After gene ﬁltering, we reduced the number of genes (G) to a related smaller number (G^∗), but the true eﬀective dimensions to the survival time might be much smaller. Our strategy was to implement a two-steps dimen-sion reduction method proposed by Wu et al. (2008) with some modiﬁcations.

This approach contained gene selection and signature construction two steps.

In the gene selection part, we used both correlation and liquid association methods to select the important candidate genes related to survival time.

Pearson’s correlation coeﬀicient was introduced to measure the strength of linear dependency between two variables. However, the association between gene expressions and patients survival might not be linear and might be more complicated. The liquid association (LA) method was implemented here to explore the interaction of two genes related to the survival time. Due to the

CHAPTER 2. MATERIALS AND ANALYSIS PROCEDURE 8 data censoring issue, both correlation and liquid association could not be implemented directly. Therefore, we modiﬁed a nonparametric imputation method [11] to impute the censored data, then the correlation coeﬀicients could be evaluated by plugging the imputed survival probability. After that, we calculated and ranked the correlation coeﬀicients between gene expres-sions and the imputed survival probability by the absolute values. Genes in the ﬁrst few places were selected as candidate genes in this part. We also proposed a permutation procedure to decide how many genes should be se-lected. For implementing the liquid association (LA) method, our strategy was to select the genes, which recurrently appeared in the ﬁrst few extreme LA gene pairs. These genes were called the LA hub genes. We selected the LA hub gene and also its paired genes as candidate genes in this part.

We note that the gene expression proﬁles and the imputed survival proba-bility were normalized by normal quantile transformed in the ﬁrst two parts, gene ﬁlter and the gene selection. The normal quantile transformation is necessary for the liquid association method and makes the procedure robust against the outliers. Both the correlation and liquid association can be com-puted in the website http://kiefer.stat2.sinica.edu.tw/LAP3/index.php. The details of the imputation methods and liquid association method are given in Chapter 4.

In the signature construction part, the candidate genes selected from the previous step were used to derive a gene signature for survival prediction.

First, we applied the modiﬁed sliced inverse regression to estimate the eﬀec-tive dimension reduction (e.d.r.) directions and projected the selected gene expression proﬁles on the e.d.r. space. If there is only one SIR direction, the

CHAPTER 2. MATERIALS AND ANALYSIS PROCEDURE 9 estimated e.d.r. direction, found signiﬁcantly by a large sample chi-squared test, we projected the expression proﬁles on it as our ﬁnal gene signature.

Otherwise, we could use the projected directions to ﬁt other survival model, for example the multivariate Cox proportional hazard model, and derive the ﬁnal gene signature. We note that the normal quantile transformation was not used in this part. The regressors (X) in the dimension reduction model were the candidate gene expression proﬁles transformed by log-2 transforma-tion and centered toward sample mean in training data set. The theoretical derivation and practical implementation of modiﬁed sliced inverse regression are given in Chapter 5.

The prediction power of our gene signature was tested in the independent validation data sets. In each validation data set, we used the linear combina-tion coeﬀicients estimated from the training data set to combine the selected gene expressions into a gene signature. We used two ways to present the prediction power of our signature. First we used median of our signatures in each validation set to separate the samples into two groups, high risk and low risk groups, as a categorical classiﬁer. For this categorical classiﬁer, the log rank test was used to test the diﬀerence of the survival distribution of two groups. Second, we used our derived gene signature as a continuous risk score to ﬁt the Cox proportional hazard model. We estimated the hazard ratios with corresponding p-value and the concordance probabilities (CPE) [23] for both categorical classiﬁer and the continuous risk score. The CPE estimated the probability that survival outcome agreed with the risk score or categorical classiﬁer under the Cox proportional hazard model. To compare the derived gene signature and the TNM tumor stage the results of multi-variate Cox proportional hazard model were also presented. A ﬂow chart of

CHAPTER 2. MATERIALS AND ANALYSIS PROCEDURE 10 our procedure is given as ﬁgure 2.1.

Training data (HLM+UM)

Gene ﬁlter Inconsistent gene

expressions Low gene

expressions Small varia?on gene expressions

Gene selec?on Correla?on Liquid associa?on

Gene signature construc?on

Modiﬁed sliced inverse regression

(Mul?variate Cox model, if more than one signiﬁcant direc?ons)

Valida?on CAN/DF MSK DUKE

Normal score transforma?on

Log 2 transforma?on

Survival imputa?on

Figure 2.1: The ﬂow chart of analysis procedures.

Chapter 3 Gene ﬁlter

3.1 Inconsistent gene expressions

At the beginning of microarray analysis, choosing data preprocessing method is still an open issue. MAS 5.0 Statistical algorithm, dChip algo-rithm and the Robust Multi-chip Average (RMA) method are three widely used data preprocessing methods. However, diﬀerent methods may lead to diﬀerent results. In Shedden et al. (2008), they preprocessed the expression proﬁles by running dChip algorithm on all four data sets together. Never-theless, there was an issue they remarked: running the dChip algorithm on the entire data sets may have removed some of the inter-site diﬀerences but is somewhat unrealistic. Figure 3.1 showed a dramatic shift of the data indi-cating that gene expressions preprocessed separately or together as a group using dChip algorithm are not comparably scaled. This inter-site diﬀerence may impact the validation results a lot.

In our data analysis, we chose the MAS 5.0 Statistical algorithm for data preprocessing. The MAS 5.0 algorithm allowed us to preprocess the

microar-11

CHAPTER 3. GENE FILTER 12

10.5 11.0 11.5 12.0

10.511.011.512.0

HLM and UM preprocessed separatly

HLM and UM preprocessed together

UM HLM

-2 -1 0 1 2

-2-1012

HLM and UM preprocessed separatly

HLM and UM preprocessed together

HLM UM

Figure 3.1: Scatter plots of gene expression proﬁles of DDR1 in HLM and UM two data sets preprocessed together versus preprocessed separately by dChip algorithm: The left panel is log-2 transformed gene expression proﬁles and the right panel is normal quantile transformed gene expression proﬁles.

ray data entirely or chip by chip with the same results. However, we thought that the genes with inconsistent expression proﬁles between three preprocess-ing methods were unconvinced. Therefore, we ﬁltered out the genes that had inconsistent expression levels between three preprocessing methods. Corre-lation coeﬀicient is a measure also used to measure the similarity of two variables. Here we used it to measure the similarity of the expression levels preprocessed by each two of the three preprocessing methods for each gene.

Nevertheless, since the RMA preprocessed data is in log-2 scale, we may transform it before we calculated the correlation coeﬀicients. Furthermore, the normal quantile transformed correlation coeﬀicients between gene expres-sions and imputed survival probability is an important selecting criterion in the gene selection part. Thus, we also used the normal quantile transformed

CHAPTER 3. GENE FILTER 13 correlation coeﬀicient in this part.

Before continuing the introduction of the gene ﬁlter, here we give a def-inition of the normal quantile transformation and note some properties of it.

Deﬁnition 3.1.1. For any n observations x = (x₁, ..., x_n)^′ of variable X, The normal quantile transformation is deﬁne by

N (x) =

where Φ(·) is the cumulative distribution function of standard normal distri-bution and R_i is the rank of x_i in x for i = 1, 2, ..., n.

Since Pearson’s correlation coeﬀicient with normal quantile transforma-tion only depends on the rank of observatransforma-tions, it can be viewed as a kind of rank correlation coeﬀicient. It is more robust against outliers than the orig-inal Pearson’s correlation coeﬀicient. Furthermore, in elementary statistics, the correlation coeﬀicient between N(x) and x is used to test for the null hy-pothesis that X is normally distributed. The correlation coeﬀicient between N(x) and x is closed to 1 under null hypothesis. If X₁ and X₂ are both normally distributed, the correlation coeﬀicient between N(x₁) and N(x₂) is closed to the correlation coeﬀicient between x1 and x2. Then some proper-ties of the original Pearson’s correlation coeﬀicient carried over. We also note that the normal quantile transformation is necessary for the LA calculation.

Therefore, we used the normal quantile transformed correlation coeﬀicient for all the correlations between two variables in our analysis procedure.

For our real data analysis, ﬁrst we preprocessed the expression proﬁles by using all the three preprocessing methods separately in HLM, UM, CAN/DF

CHAPTER 3. GENE FILTER 14 and MSK four data sets for three versions of gene expression proﬁles. Since the CAN/DF and MSK data sets were used for validation, the gene ﬁlter was only implemented in the training data sets, HLM and UM. There were 22,215 probe sets on Aﬀymetrix U133A microarray. For each gene in the two training data sets, we evaluated the normal quantile transformed correlation coeﬀicients between each two of the three diﬀerent preprocessed expression proﬁles. Thus, there were six correlation coeﬀicients evaluated for each gene.

The genes that had as least one of the six correlation coeﬀicients smaller than 0.78 were excluded.

In each data set, the ranks of the mean expressions of the remaining genes with a total of 22,215 probe sets were recorded. The histogram is given in ﬁgure 3.2. In the histogram, the proportion of the remaining genes with high rank is found larger than the proportion of the remaining genes with low rank.

在文檔中生物晶片資料分析與肺腺癌存活之預測 (頁 18-25)